337 50 6MB
English Pages 315 [331] Year 2018
Computational Biology
Dariusz Mrozek
Scalable Big Data Analytics for Protein Bioinformatics Efficient Computational Solutions for Protein Structures
Computational Biology Volume 28
Editors-in-Chief Andreas Dress, CAS-MPG Partner Institute for Computational Biology, Shanghai, China Michal Linial, Hebrew University of Jerusalem, Jerusalem, Israel Olga Troyanskaya, Princeton University, Princeton, NJ, USA Martin Vingron, Max Planck Institute for Molecular Genetics, Berlin, Germany Editorial Board Robert Giegerich, University of Bielefeld, Bielefeld, Germany Janet Kelso, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany Gene Myers, Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany Pavel A. Pevzner, University of California, San Diego, CA, USA Advisory Board Gordon Crippen, University of Michigan, Ann Arbor, MI, USA Joe Felsenstein, University of Washington, Seattle, WA, USA Dan Gusfield, University of California, Davis, CA, USA Sorin Istrail, Brown University, Providence, RI, USA Thomas Lengauer, Max Planck Institute for Computer Science, Saarbrücken, Germany Marcella McClure, Montana State University, Bozeman, MO, USA Martin Nowak, Harvard University, Cambridge, MA, USA David Sankoff, University of Ottawa, Ottawa, ON, Canada Ron Shamir, Tel Aviv University, Tel Aviv, Israel Mike Steel, University of Canterbury, Christchurch, New Zealand Gary Stormo, Washington University in St. Louis, St. Louis, MO, USA Simon Tavaré, University of Cambridge, Cambridge, UK Tandy Warnow, University of Illinois at Urbana-Champaign, Champaign, IL, USA Lonnie Welch, Ohio University, Athens, OH, USA
The Computational Biology series publishes the very latest, high-quality research devoted to specific issues in computer-assisted analysis of biological data. The main emphasis is on current scientific developments and innovative techniques in computational biology (bioinformatics), bringing to light methods from mathematics, statistics and computer science that directly address biological problems currently under investigation. The series offers publications that present the state-of-the-art regarding the problems in question; show computational biology/bioinformatics methods at work; and finally discuss anticipated demands regarding developments in future methodology. Titles can range from focused monographs, to undergraduate and graduate textbooks, and professional text/reference works.
More information about this series at http://www.springer.com/series/5769
Dariusz Mrozek
Scalable Big Data Analytics for Protein Bioinformatics Efficient Computational Solutions for Protein Structures
123
Dariusz Mrozek Silesian University of Technology Gliwice, Poland
ISSN 1568-2684 Computational Biology ISBN 978-3-319-98838-2 ISBN 978-3-319-98839-9 https://doi.org/10.1007/978-3-319-98839-9
(eBook)
Library of Congress Control Number: 2018950968 © Springer Nature Switzerland AG 2018 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
For my always smiling and beloved wife Bożena, and my lively and infinitely active sons Paweł and Henryk, with all my love. To my parents, thank you for your support, concern and faith in me.
Foreword
High-performance computing most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business. Big Data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. This timely book by Dariusz Mrozek gives you a quick introduction to the area of proteins and their structures, protein structure similarity searching carried out at main representation levels, and various techniques that can be used to accelerate similarity searches using high-performance Cloud computing and Big Data concepts. It presents introductory concepts of formal model of 3D protein structures for functional genomics, comparative bioinformatics, and molecular modeling and the use of multi-threading for the efficient approximate searching on protein secondary structures. In addition, there is a material on finding 3D protein structure similarities accelerated with high-performance computing techniques. The book is required reading to help in understanding for anyone working with area of data analytics for structural bioinformatics and the use of high-performance computing. It explores area of proteins and their structures in depth and provides practical approaches to many problems that may be encountered. It is especially useful to applications developers, scientists, students, and teachers. I have enjoyed and learned from this book and feel confident that you will as well. Knoxville, USA June 2018
Jack Dongarra University of Tennessee
vii
Preface
International efforts focused on understanding living organisms at various levels of molecular organization, including genomic, proteomic, metabolomic, and cell signaling levels, lead to huge proliferation of biological data collected in dedicated, and frequently, public repositories. The amount of data deposited in these repositories increases every year, and cumulated volume has grown to sizes that are difficult to handle with traditional analysis tools. This growth of biological data is stimulated by various international projects, such as 1000 Genomes. The project aims at sequencing genomes of at least one thousand anonymous participants from a number of different ethnic groups in order to establish a detailed catalog of human genetic variations. As a result, it generates terabytes of genetic data. Apart from international initiatives and projects, like the 1000 Genomes, the proliferation of biological data is further accelerated by newly developed technologies for DNA sequencing, like next-generation sequencing (NGS) methods. These methods are getting faster and less expensive every year. They produce huge amounts of genetic data that require fast analysis in various phases of molecular profiling, medical diagnostics, and treatment of patients that suffer from serious diseases. Indeed, for the last three decades we have been witnesses of the continuous exponential growth of biological data in repositories, such as GenBank, Sequence Read Archive (SRA), RefSeq, Protein Data Bank, UniProt/SwissProt. The specificity of the data has inspired the scientific community to develop many algorithms that can be used to analyze the data and draw useful conclusions. A huge volume of the biological data caused that many of the existing algorithms became inefficient due to their computational complexity. Fortunately, the rapid development of computer science in the last decade has brought many technological innovations that can be also used in the field of bioinformatics and life sciences. The algorithms demonstrating a significant utility value, which have recently been perceived as too time-consuming, can now be efficiently used by applying the latest technological achievements, like Hadoop and Spark for analyzing Big Data sets, multi-threading, graphics processing units (GPUs), or cloud computing.
ix
x
Preface
Scope of the Book The book focuses on proteins and their structures. It presents various scalable solutions for protein structure similarity searching carried out at main representation levels and for prediction of 3D structures of proteins. It specifically focuses on various techniques that can be used to accelerate similarity searches and protein structure modeling processes. But, why proteins? somebody can ask. I could answer the question by following Arthur M. Lesk in his book entitled Introduction to Protein Science. Architecture, Function, and Genomics. Because proteins are where the action is. Understanding proteins, their structures, functions, mutual interactions, activity in cellular reactions, interactions with drugs, and expression in body cells is a key to efficient medical diagnosis, drug production, and treatment of patients. I have been fascinated with proteins and their structures for fifteen years. I have fallen in love with the beauty of protein structures at first sight inspired by the research conducted by R.I.P. Lech Znamirowski from the Silesian University of Technology, Gliwice, Poland. I decided to continue his research on proteins and development of new efficient tools for their analysis and exploration. I believe this book will be interesting for scientists, researchers, and software developers working in the field of structural bioinformatics and biomedical databases. I hope that readers of the book will find it interesting and helpful in their everyday work.
Chapter Overview The content of the book is divided into four parts. The first part provides background information on proteins and their representation levels, including a formal model of a 3D protein structure used in computational processes, and a brief overview of technologies used in the solutions presented in this book. • Chapter 1: Formal Model of 3D Protein Structures for Functional Genomics, Comparative Bioinformatics, and Molecular Modeling This chapter shows how proteins can be represented in computational processes performed in scientific fields, such as functional genomics, comparative bioinformatics, and molecular modeling. The chapter provides a general definition of protein spatial structure that is then referenced to four representation levels of protein structure: primary, secondary, tertiary, and quaternary structures. • Chapter 2: Technological Roadmap This chapter provides a technological roadmap for solutions presented in this book. It covers a brief introduction to the concept of Cloud computing, cloud service, and deployment models. It also defines the Big Data challenge and
Preface
xi
presents the benefits of using multi-threading in scientific computations. It then explains graphics processing units (GPUs) and CUDA architecture. Finally, it focuses on relational databases and the SQL language used for declarative querying. The second part of the book is focused on Cloud services that are utilized in the development of scalable and reliable cloud applications for 3D protein structure similarity searching and protein structure prediction. • Chapter 3: Azure Cloud Services Microsoft Azure Cloud Services support development of scalable and reliable cloud applications that can be used to scientific computing. This chapter provides a brief introduction to Microsoft Azure cloud platform and its services. It focuses on Azure Cloud Services that allow building a cloud-based application with the use of Web roles and Worker roles. Finally, it shows a sample application that can be quickly developed on the basis of these two types of roles and the role of queues in passing messages between components of the built system. • Chapter 4: Scaling 3D Protein Structure Similarity Searching with Cloud Services In this chapter, you will see how the Cloud computing architecture and Azure Cloud Services can be utilized to scale out and scale up protein similarity searches by utilizing the system, called Cloud4PSi, that was developed for the Microsoft Azure public cloud. The chapter presents the architecture of the system, its components, communication flow, and advantages of using a queue-based model over the direct communication between computing units. It also shows results of various experiments confirming that the similarity searching can be successfully scaled on cloud platforms by using computation units of different sizes and by adding more computation units. • Chapter 5: Cloud Services for Efficient Ab Initio Predictions of 3D Protein Structures In this chapter, you will see how Cloud Services may help to solve problems of protein structure prediction by scaling the computations in a role-based and queue-based Cloud4PSP system, deployed in the Microsoft Azure cloud. The chapter shows the system architecture, the Cloud4PSP processing model, and results of various scalability tests that speak in favor of the presented architecture. The third part of the book shows the utilization of scalable Big Data computational frameworks, like Hadoop and Spark, in massive 3D protein structure alignments and identification of intrinsically disordered regions in protein structures. • Chapter 6: Foundations of the Hadoop Ecosystem At the moment, Hadoop ecosystem covers a broad collection of platforms, frameworks, tools, libraries, and other services for fast, reliable, and scalable data analytics. This chapter briefly describes the Hadoop ecosystem and focuses on two elements of the ecosystem—the Apache Hadoop and the Apache Spark.
xii
Preface
It provides details of the MapReduce processing model and differences between MapReduce 1.0 and MapReduce 2.0. The concepts defined in this chapter are important for the understanding of complex systems presented in the following chapters of this part of the book. • Chapter 7: Hadoop and the MapReduce Processing Model in Massive Structural Alignments Supporting Protein Function Identification Undoubtedly, for a variety of biological data and a variety of scenarios of how these data can be processed and analyzed, Hadoop and the MapReduce processing model bring the potential to make a step forward toward the development of solutions that will allow to get insights in various biological processes much faster. In this chapter, you will see MapReduce-based computational solution for efficient mining of similarities in 3D protein structures and for structural superposition. The solution benefits from the Map-only processing pattern of the MapReduce, which is presented and formally defined in this chapter. You will also see results of performance tests when scaling up nodes of the Hadoop cluster and increasing the degree of parallelism with the intention of improving efficiency of the computations. • Chapter 8: Scaling 3D Protein Structure Similarity Searching on Large Hadoop Clusters Located in a Public Cloud In this chapter, you will see how 3D protein structure similarity searching can be accelerated by distributing computation on large Hadoop/HBase (HDInsight) clusters that can be broadly scaled out and up in the Microsoft Azure public cloud. This chapter shows that the utilization of public clouds to perform scientific computations is very beneficial and can be successfully applied when performing time-consuming computations over biological data. • Chapter 9: Scalable Prediction of Intrinsically Disordered Protein Regions with Spark Clusters on Microsoft Azure Cloud Computational identification of disordered regions in protein amino acid sequences became an important branch of 3D protein structure prediction and modeling. In this chapter, you will see the IDPP meta-predictor that applies an ensemble of primary predictors in order to increase the quality of prediction of intrinsically disordered proteins. This chapter presents a highly scalable implementation of the meta-predictor on the Spark cluster (Spark-IDPP) that mitigates the problem of the exponentially growing number of protein amino acid sequences in public repositories. The fourth part of the book focuses on finding 3D protein structure similarities accelerated with the use of GPUs and on the use of multi-threading and relational databases for efficient approximate searching on protein secondary structures.
Preface
xiii
• Chapter 10: Massively Parallel Searching of 3D Protein Structure Similarities on CUDA-Enabled GPU Devices Graphics processing units (GPUs) and general-purpose graphics processing units (GPGPUs) promise to give a high speedup of many time-consuming and computationally demanding processes over their original implementations on CPUs. In this chapter, you will see that a massive parallelization of the 3D structure similarity searching on many-core CUDA-enabled GPU devices leads to the reduction of the execution time of the process and allows to perform it in real time. • Chapter 11: Exploration of Protein Secondary Structures in Relational Databases with Multi-threaded PSS-SQL In this chapter, you will see how protein secondary structures can be stored in the relational database and explored with the use of the PSS-SQL query language. The PSS-SQL is an extension to the SQL language. It allows formulation of queries against a relational database in order to find proteins having secondary structures similar to the structural pattern specified by a user. In this chapter, you will see how this process can be accelerated by parallel implementation of the alignment using multiple threads working on multiple-core CPUs.
Summary In this book, you will see advanced techniques and computational architectures that benefit from the recent achievements in the field of computing and parallelism. Techniques and methods presented in the successive chapters of this book will be based on various types of parallelism, including multi-threading, massive GPU-based parallelism, and distributed many-task computing in Big Data and Cloud computing environments (Fig. 1). Most of the problems are implemented as pleasantly or embarrassingly parallel processes, except the SQL-based search engine presented in Chap. 11, which employs multiple CPU threads in single search process. Beautiful structures of proteins are definitely worth creating efficient methods for their exploration and analysis, with the aim of mining the knowledge that will improve human life in further perspective. While writing this book, I tried to pass through various representation levels of protein structures and show various techniques for their efficient exploration. In the successive chapters of the book, I described methods that were developed either by myself or as a part of projects that I was involved in. In the bibliography lists at the end of each chapter, I also cited other solutions for the presented problems and gave recommendations for further
xiv
Preface
Fig. 1 Preliminary architecture of the cloud-based solution for protein structure similarity searching drawn by me during the meeting (March 6, 2013) with Artur Kłapciński, my associate in this project. Institute of Informatics, Silesian University of Technology, Gliwice, Poland
reading. I hope that the solutions presented in the book will turn out to be interesting and helpful for scientists, researchers, and software developers working in the field of protein bioinformatics. Gliwice, Poland June 2018
Dariusz Mrozek
Acknowledgements
For many years, I have been trying to develop various efficient solutions for proteins and their structures. Through this time, there were many people involved in the research and development works that I carried out. I find it hard to mention all of them. I would like to thank my wife Bożena Małysiak-Mrozek, and also Tomasz Baron, Miłosz Brożek, Paweł Daniłowicz, Paweł Gosk, Artur Kłapciński, Bartek Socha, and Marek Suwała, for their direct cooperation in my research leading to the emergence of the book. A brief information on some of them is shown below. I would like to thank Alina Momot for her valuable advice on mathematical formulas, Henryk Małysiak for his mental support and constructive guidance resulting from the decades of experience in the academic and scientific work, and Stanisław Kozielski, a former Head of Institute of Informatics at the Silesian University of Technology, Gliwice, Poland, for giving me a space where I grew up as a scientist and where I could continue my research. Bożena Małysiak-Mrozek received the M.Sc. and Ph.D. degrees, in computer science, from the Silesian University of Technology, Gliwice, Poland. She is an Assistant Professor in the Institute of Informatics at the Silesian University of Technology, Gliwice, Poland, and also a Member of the IBM Competence Center. Her scientific interests cover information systems, computational intelligence, bioinformatics, databases, Big Data, cloud computing, and soft computing methods. She participated in the development of all solutions and system for protein structure exploration presented in the book.
xv
xvi
Acknowledgements
Tomasz Baron received the M.Sc. degree in computer science from the Silesian University of Technology, Gliwice, Poland in 2016. He currently works for Comarch S.A. company in Poland as software engineer. His interests cover cloud computing, front-end frameworks, and Internet technologies. He participated in the development of the Spark-based system for prediction of intrinsically disordered regions in protein structures presented in Chap. 9.
Miłosz Brożek received the M.Sc. degree in computer science from the Silesian University of Technology, Gliwice, Poland in 2012. He currently works for JSofteris company in Poland as Java programmer. His interests in IT cover microservices, cloud applications, and Amazon Web Services. He participated in the development of the CASSERT algorithm for protein similarity searching on CUDA-enabled GPU devices presented in Chap. 10.
Paweł Daniłowicz received the M.Sc. degree in computer science from the Silesian University of Technology, Gliwice, Poland in 2014. He currently works for Asseco Poland S.A. company in Poland as senior programmer. His interests in IT cover databases and business intelligence. He participated in the development of the HDInsight-/HBase-/Hadoop-based system for 3D protein structure similarity searching presented in Chap. 8.
Acknowledgements
xvii
Marek Suwała received the M.Sc. degree in computer science from the Silesian University of Technology, Gliwice, Poland in 2013. He currently works for Bank Zachodni WBK in Wrocław, Poland, as system analyst. His interests cover business process modeling and Web Services technologies. He participated in the development of the MapReduce-based application for identification of protein functions on the basis of protein structure similarity presented in Chap. 7.
Additional contributors to the development of the presented scalable and high-performance solutions were: (1) Paweł Gosk who participated in the implementation of the scalable system for 3D protein structure prediction working in the Microsoft Azure cloud presented in Chap. 5, (2) Artur Kłapciński who was the main programmer while constructing the cloud-based system for 3D protein structure alignment and similarity searching presented in Chap. 4, and (3) Bartek Socha who participated in the development of the multi-threaded version of the PSS-SQL language for efficient exploration of protein secondary structures in relational databases presented in Chap. 11. Also, I would like to thank Microsoft Research for providing me a free access to computational resources of the Microsoft Azure cloud within the Microsoft Azure for Research Award grant. My special thanks go to Alice Crohas and Kenji Takeda from Microsoft, without whom my adventure with the Azure cloud would not be so long, interesting and full of new challenges. The emergence of this book was supported by the Statutory Research funds of Institute of Informatics, Silesian University of Technology, Gliwice, Poland (grant No BK/213/RAU2/2018). On a personal note, I would like to thank my family for all their love, patience, unconditional support, and understanding in the moments of my absence resulting from my desire to write this book.
Contents
Part I 1
2
Background
Formal Model of 3D Protein Structures for Functional Genomics, Comparative Bioinformatics, and Molecular Modeling . . . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 General Definition of Protein Spatial Structure . . . . . . . . . . . 1.3 A Reference to Representation Levels . . . . . . . . . . . . . . . . . 1.3.1 Primary Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Secondary Structure . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Tertiary Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.4 Quaternary Structure . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Relative Coordinates of Protein Structures . . . . . . . . . . . . . . 1.5 Energy Properties of Protein Structures . . . . . . . . . . . . . . . . 1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
3 4 4 6 6 8 10 13 15 20 23 23
Technological Roadmap . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Cloud Service Models . . . . . . . . . . . . . . . 2.1.2 Cloud Deployment Models . . . . . . . . . . . . 2.2 Big Data Challenge . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 The 5V Model of Big Data . . . . . . . . . . . . 2.2.2 Hadoop Platform . . . . . . . . . . . . . . . . . . . 2.3 Multi-threading and Multi-threaded Applications . . 2.4 Graphics Processing Units and the CUDA . . . . . . . 2.4.1 Graphics Processing Units . . . . . . . . . . . . 2.4.2 CUDA Architecture and Threads . . . . . . . . 2.5 Relational Databases and SQL . . . . . . . . . . . . . . . . 2.5.1 Relational Database Management Systems . 2.5.2 SQL For Manipulating Relational Data . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
29 30 31 33 33 34 35 36 39 39 40 42 43 44
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
xix
xx
Contents
2.6 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Part II
Cloud Services for Scalable Computations
3
Azure Cloud Services . . . . . . . . . . . . . . 3.1 Microsoft Azure . . . . . . . . . . . . . . 3.2 Virtual Machines, Series, and Sizes 3.3 Cloud Services in Action . . . . . . . . 3.4 Summary . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . .
4
Scaling 3D Protein Structure Similarity Searching with Azure Cloud Services . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Why We Need Cloud Computing in Protein Structure Similarity Searching . . . . . . . . . . . . . . 4.1.2 Algorithms for Protein Structure Similarity Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Other Cloud-Based Solutions for Bioinformatics 4.2 Cloud4PSi for 3D Protein Structure Alignment . . . . . . . . 4.2.1 Use Case: Interaction with the Cloud4PSi . . . . . 4.2.2 Architecture and Processing Model of the Cloud4PSi . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Scaling Cloud4PSi . . . . . . . . . . . . . . . . . . . . . . 4.3 Scalability of the Cloud4PSi . . . . . . . . . . . . . . . . . . . . . 4.3.1 Horizontal Scalability . . . . . . . . . . . . . . . . . . . . 4.3.2 Vertical Scalability . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Influence of the Package Size . . . . . . . . . . . . . . 4.3.4 Scaling Up or Scaling Out? . . . . . . . . . . . . . . . 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
45 46 47
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
Cloud Services for Efficient Ab Initio Predictions of 3D Protein Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Computational Approaches for 3D Protein Structure Prediction . . . . . . . . . . . . . . . . . . . . . 5.1.2 Cloud and Grid Computing in Protein Structure Determination . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
51 51 55 59 65 67
..... .....
69 69
.....
71
. . . .
. . . .
. . . .
. . . .
. . . .
71 75 75 77
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. 78 . 87 . 89 . 90 . 93 . 96 . 97 . 99 . 99 . 100
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . 103 . . . . . 103 . . . . . 104 . . . . . 105
Contents
xxi
5.2
Cloud4PSP for 3D Protein Structure Prediction 5.2.1 Prediction Method . . . . . . . . . . . . . . . 5.2.2 Cloud4PSP Architecture . . . . . . . . . . . 5.2.3 Cloud4PSP Processing Model . . . . . . . 5.2.4 Extending Cloud4PSP . . . . . . . . . . . . 5.2.5 Scaling the Cloud4PSP . . . . . . . . . . . . 5.3 Performance of the Cloud4PSP . . . . . . . . . . . . 5.3.1 Vertical Scalability . . . . . . . . . . . . . . . 5.3.2 Horizontal Scalability . . . . . . . . . . . . . 5.3.3 Influence of the Task Size . . . . . . . . . 5.3.4 Scale Up, Scale Out, or Combine? . . . 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Part III
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
107 108 110 114 116 116 118 119 121 123 125 127 129 131 131
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
137 137 138 138 140 141 142 143 146 148 149
Big Data Analytics in Protein Bioinformatics
6
Foundations of the Hadoop Ecosystem . . . . . 6.1 Big Data . . . . . . . . . . . . . . . . . . . . . . . 6.2 Hadoop . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Hadoop Distributed File System 6.2.2 MapReduce Processing Model . 6.2.3 MapReduce 1.0 (MRv1) . . . . . . 6.2.4 MapReduce 2.0 (MRv2) . . . . . . 6.3 Apache Spark . . . . . . . . . . . . . . . . . . . . 6.4 Hadoop Ecosystem . . . . . . . . . . . . . . . . 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
Hadoop and the MapReduce Processing Model in Massive Structural Alignments Supporting Protein Function Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Scalable Solutions for 3D Protein Structure Alignment and Similarity Searching . . . . . . . . . . . . . . . . . . . . . . . 7.3 A Brief Overview of H4P . . . . . . . . . . . . . . . . . . . . . . 7.4 Map-Only Pattern of the MapReduce Processing Model 7.5 Implementation of the Map-Only Processing Pattern in the H4P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Performance of the H4P . . . . . . . . . . . . . . . . . . . . . . . 7.6.1 Runtime Environment . . . . . . . . . . . . . . . . . . . 7.6.2 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.3 A Course of Experiments . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . 151 . . . . . . 151 . . . . . . 152 . . . . . . 155 . . . . . . 156 . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
159 164 164 165 165
xxii
Contents
7.6.4 7.6.5
Map-Only Versus MapReduce-Based Execution . . . . Scalability in One-to-Many Comparison Scenario with Sequential Files . . . . . . . . . . . . . . . . . . . . . . . 7.6.6 Scalability in Batch One-to-One Comparison Scenario with Individual PDB Files . . . . . . . . . . . . . 7.6.7 One-to-Many Versus Batch One-to-One Comparison Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.8 Influence of the Number of Map Tasks on the Acceleration of Computations . . . . . . . . . . . . 7.6.9 H4P Performance Versus Other Approaches . . . . . . 7.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
9
Scaling 3D Protein Structure Similarity Searching on Large Hadoop Clusters Located in a Public Cloud . . . . . . . . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 HDInsight on Microsoft Azure Public Cloud . . . . . . . . . 8.3 HDInsight4PSi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . 8.5.2 Comparing Individual Proteins in One-to-One Comparison Scenario . . . . . . . . . . . . . . . . . . . . 8.5.3 Working with Sequential Files in One-To-Many Comparison Scenario . . . . . . . . . . . . . . . . . . . . 8.5.4 FullMapReduce Versus Map-Only Execution Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.5 Performance of Various Algorithms . . . . . . . . . . 8.5.6 Influence of Protein Size . . . . . . . . . . . . . . . . . . 8.5.7 Scalability of the Solution . . . . . . . . . . . . . . . . . 8.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . 166 . . 168 . . 170 . . 172 . . . . .
. . . . .
174 175 179 180 181
. . . . . . .
. . . . . . .
183 183 186 187 188 194 196
. . . . . 198 . . . . . 200 . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
203 205 206 207 211 212 213
Scalable Prediction of Intrinsically Disordered Protein Regions with Spark Clusters on Microsoft Azure Cloud . . . . . . . . . . . . 9.1 Intrinsically Disordered Proteins . . . . . . . . . . . . . . . . . . . . 9.2 IDP Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 IDPP Meta-Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 Architecture of the IDPP Meta-Predictor . . . . . . . . . . . . . . 9.5 Reaching Consensus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Filtering Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
215 215 217 218 219 221 224
. . . . . . .
Contents
xxiii
9.7
IDPP on the Apache Spark . . . . . . . . . . . . . . . . . . . . . 9.7.1 Architecture of the Spark-IDPP . . . . . . . . . . . . 9.7.2 Implementation of the IDPP on Spark . . . . . . . 9.8 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 9.8.1 Runtime Environment . . . . . . . . . . . . . . . . . . . 9.8.2 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.8.3 A Course of Experiments . . . . . . . . . . . . . . . . 9.8.4 Effectiveness of the Spark-IDPP Meta-predictor 9.8.5 Performance of IDPP-Based Prediction on the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . 9.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.11 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Part IV
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
226 226 227 229 229 229 230 230
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
237 241 243 243 243
Multi-threaded Solutions for Protein Bioinformatics
10 Massively Parallel Searching of 3D Protein Structure Similarities on CUDA-Enabled GPU Devices . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.1 What Makes a Problem . . . . . . . . . . . . . . . . . . . . . 10.1.2 CUDA-Enabled GPUs in Processing Biological Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 CASSERT for Protein Structure Similarity Searching . . . . . . 10.2.1 General Course of the Matching Method . . . . . . . . . 10.2.2 First Phase: Low-Resolution Alignment . . . . . . . . . . 10.2.3 Second Phase: High-Resolution Alignment . . . . . . . 10.2.4 Third Phase: Structural Superposition and Alignment Visualization . . . . . . . . . . . . . . . . . . 10.3 GPU-Based Implementation of the CASSERT . . . . . . . . . . . 10.3.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.2 Implementation of Two-Phase Structural Alignment in a GPU . . . . . . . . . . . . . . . . . . . . . . . . 10.3.3 First Phase of Structural Alignment in the GPU . . . . 10.3.4 Second Phase of Structural Alignment in the GPU . . 10.4 GPU-CASSERT Efficiency Tests . . . . . . . . . . . . . . . . . . . . . 10.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . 251 . . 251 . . 252 . . . . .
. . . . .
253 254 257 258 259
. . 260 . . 261 . . 262 . . . . . . .
. . . . . . .
264 265 270 272 277 279 279
xxiv
11 Exploration of Protein Secondary Structures in Relational Databases with Multi-threaded PSS-SQL . . . . . . . . . . . . . . . . . 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Storing and Processing Secondary Structures in a Relational Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.1 Data Preparation and Storing . . . . . . . . . . . . . . . . . 11.2.2 Indexing of Secondary Structures . . . . . . . . . . . . . 11.2.3 Alignment Algorithm . . . . . . . . . . . . . . . . . . . . . . 11.2.4 Multi-threaded Implementation . . . . . . . . . . . . . . . 11.2.5 Consensus on the Area Size . . . . . . . . . . . . . . . . . 11.3 SQL as the Interface Between User and the Database . . . . . 11.3.1 Pattern Representation in PSS-SQL Queries . . . . . . 11.3.2 Sample Queries in PSS-SQL . . . . . . . . . . . . . . . . . 11.4 Efficiency of the PSS-SQL . . . . . . . . . . . . . . . . . . . . . . . . 11.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Contents
. . . 283 . . . 283 . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
286 287 287 289 291 295 298 299 300 304 306 307 308
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
Acronyms
AFP BLOB CASP CE CPU CUDA DAG DBMS DNA ETL FATCAT GPGPU GPU GUI H4P HDFS IaaS MAS MR NoSQL OODB PaaS PDB RDBMS RDD RMSD SaaS SIMD SIMT
Aligned fragment pair Binary large object Critical Assessment of protein Structure Prediction Combinatorial Extension Central processing unit Compute Unified Device Architecture Directed acyclic graph Database management system Deoxyribonucleic acid Extract, transform, and load Flexible structure AlignmenT by Chaining Aligned fragment pairs allowing Twists General-purpose graphics processing units Graphics processing unit Graphical user interface Hadoop for proteins Hadoop Distributed File System Infrastructure as a Service Multi-agent system MapReduce Non-SQL, non-relational Object-oriented database Platform as a Service Protein Data Bank Relational database management system Resilient distributed data set Root-mean-square deviation Software as a Service Single instruction, multiple data Single instruction, multiple thread
xxv
xxvi
SQL SSE SVD VM XML YARN
Acronyms
Structured Query Language Secondary structure element Singular value decomposition Virtual machine Extensible Markup Language Yet Another Resource Negotiator
Part I
Background
Proteins are complex molecules that play key roles in biochemical reactions in cells of living organisms. They are built up with hundreds of amino acids and thousands of atoms, which makes the analysis of their structures difficult and time-consuming. This part of the book provides background information on proteins and their representation levels, including a formal model of a 3D protein structure used in computational processes related to protein structure alignment, superposition, similarity searching, and modeling. It also consists of a brief overview of technologies used in the solutions presented in this book, solutions that aim at accelerating computations underlying protein structure exploration.
Chapter 1
Formal Model of 3D Protein Structures for Functional Genomics, Comparative Bioinformatics, and Molecular Modeling
The great promise of structural bioinformatics is predicted on the belief that the availability of high-resolution structural information about biological systems will allow us to precisely reason about the function of these systems and the effects of modifications or perturbations Jenny Gu, Philip E. Bourne, 2009
Abstract Proteins are the main molecules of life. Understanding their structures, functions, mutual interactions, activity in cellular reactions, interactions with drugs, and expression in body cells is a key to efficient medical diagnosis, drug production, and treatment of patients. This chapter shows how proteins can be represented in processes performed in scientific fields, such as functional genomics, comparative bioinformatics, and molecular modeling. The chapter begins with the general definition of protein spatial structure, which can be treated as a base for deriving other forms of representation. The general definition is then referenced to four representation levels of protein structure: primary, secondary, tertiary, and quaternary structures. This is followed by short description of protein geometry. And finally, at the end of the chapter, we will discuss energy features that can be calculated based on the general description of protein structure. The formal model defined in the chapter will be used in the description of the efficient solutions and algorithms presented in the following chapters of the book. Keywords 3D protein structure · Formal model · Primary structure Secondary structure · Tertiary structure · Quaternary structure · Energy features Molecular modeling
© Springer Nature Switzerland AG 2018 D. Mrozek, Scalable Big Data Analytics for Protein Bioinformatics, Computational Biology 28, https://doi.org/10.1007/978-3-319-98839-9_1
3
4
1 Formal Model of 3D Protein Structures for Functional …
1.1 Introduction From the biological point of view, the functioning of living organisms is tightly related to the presence and activity of proteins. Proteins are macromolecules that play a key role in all biochemical reactions in cells of living organisms. For this reason, they are said to be molecules of life. And indeed, they are involved in many processes, including reaction catalysis (enzymes), energy storage, signal transmission, maintaining cell’s cytoskeleton, immune response, stimuli response, cellular respiration, transport of small bio-molecules, regulation of cell’s growth and division. Analyzing their general construction, proteins are macromolecules with the molecular mass above 10 kDa (1Da = 1.66 × 10−24 g) built up with amino acids (>100 amino acids, aa). Amino acids are linked to each other by peptide bonds forming a kind of linear chains [5]. Proteins can be described with the use of four representation levels: primary structure, secondary structure, tertiary structure, and quaternary structure. The last three levels define the protein conformation or protein spatial structure. The computer analysis of protein structures is usually carried out on one of the representation levels. The computer analysis of protein spatial structure is very important from the viewpoint of the identification of protein functions, recognition of protein activity and analysis of reactions and interactions that the particular protein is involved in. This implies the exploration of various geometrical features of protein structures. There is no doubt that structures of even small molecules are very complex—proteins are built up of hundreds of amino acids and then thousands of atoms. This makes the computer analysis of protein structures more difficult and also influences a high computational complexity of algorithms for the analysis. For any investigation related to protein bioinformatics it is essential to assume some representation of proteins as macromolecules. Methods that operate on proteins in scientific fields, such as functional genomics, comparative bioinformatics, and molecular modeling, usually assume a kind of model of protein structure. Formal models, in general, allow to define all concepts that are used in the area under consideration. They guarantee that all concepts that are used while designing and performing a process will be understood exactly as they are defined by an author of the method or procedure. This chapter attempts to capture the common model of protein structure which can be treated as a base model for the creation of dedicated models, derived either by the extension or the restriction, and used for the computations carried out in the selected area. In the following sections, we will discover a general definition of protein spatial structure, and we will reference it to four representation levels of protein structure.
1.2 General Definition of Protein Spatial Structure We define a 3D structure (S 3D ) of protein P as a pair shown in Eq. 1.1.
1.2 General Definition of Protein Spatial Structure
5
Fig. 1.1 Fragment of sample protein structure: (left) atoms and bonds, (right) bonds only. Colors and letters assigned to atoms distinguish their chemical elements. Visualized using RasMol [52]
S 3D = A3D , B 3D ,
(1.1)
where A3D is a set of atoms defined as follows: A3D = an : n ∈ (1, . . . , N ) ∧ ∃ f E : A3D −→ E
(1.2)
B 3D = {bi j : bi j = (ai , a j ) = (a j , ai ) ∧ i, j ∈ (1, . . . , N )}.
(1.3)
where N is the number of atoms in a structure, f E is a function which for each atom an assigns an element from the set of chemical elements E (e.g., N—nitrogen, O—oxygen, C—carbon, H—hydrogen, S—sulfur). The B 3D is a set of bonds bi j between two atoms ai , a j ∈ A3D defined as follows:
Fragment of a sample protein structure is shown in Fig. 1.1. Each atom an is described in three-dimensional space by Cartesian coordinates x, y, z: (1.4) an = (xn , yn , z n )T where xn , yn , z n ∈ R. Therefore, the length of bond bi j between two atoms ai and a j can be calculated using the Euclidean distance: bi j =
(xi − x j )2 + (yi − y j )2 + (z i − z j )2 ,
(1.5)
which is equivalent to the norm calculation [8]: bi j = ai − a j = We can also state that:
(ai − a j )T (ai − a j ).
(1.6)
6
1 Formal Model of 3D Protein Structures for Functional …
an ∈ A3D =⇒ ∀n∈{1,..,N } ∃ f V a : A3D −→ N+ ∧ ∃ f V e : E −→ N+ ,
(1.7)
where f V a is a function determining the valence of an atom and f V e is a function determining the valence of chemical element. For example, f V e (C) = 4 and f V e (O) = 2.
1.3 A Reference to Representation Levels Having defined such a general definition of protein spatial structure, we can study what are the relationships between this structure and four main representation levels of protein structures, i.e., primary, secondary, tertiary and quaternary structures. These relationships will be described in the following sections.
1.3.1 Primary Structure Proteins are polypeptides built up with many, usually more than one hundred amino acids that are joined to each other by a peptide bond, and thus, forming a linear amino acid chain. The way how one amino acid joins to another, e.g., during the translation from the mRNA, is not accidental. Each amino acid has an N-terminus (also known as amino-terminus) and C-terminus (also known as carboxyl-terminus). When two amino acids join to each other, they form a peptide bond between C-terminus of the first amino acid and N-terminus of the second amino acid. When a single amino acid joins the forming chain during the protein synthesis, it links its N-terminus to the free C-terminus of the last amino acid in the chain. Therefore, the amino acid chain is created from N-terminus to C-terminus. Primary structure of protein is often represented as the amino acid sequence of the protein (also called protein sequence, polypeptide sequence), as it is presented in Fig. 1.2. The sequence is reported from N-terminus to C-terminus. Each letter in the sequence corresponds to one amino acid. Actually, the sequence is usually recorded in one-letter code, and rarely in three-letter code. Protein sequence is determined by the nucleotide sequence of appropriate gene in the DNA. There are twenty standard amino acids encoded by the genetic code in the living organisms. However, in some organisms two additional amino acids can be encoded, i.e., selenocysteine and pyrrolysine. All amino acids differ in chemical properties and have various atomic constructions. Proteins can have one or many amino acid chains. The order of amino acids in the amino acid chain is unique and determines the function of the protein. The representation of protein structure as a sequence of amino acids from Fig. 1.2a is very simple and frequently used by many algorithms and tools for protein comparison and similarity searching, such as Needleman–Wunsch [46] and Smith–Waterman [58] algorithms, BLAST [1] and FASTA [49] family of tools. The representation
1.3 A Reference to Representation Levels
7
Fig. 1.2 Primary structures of Deoxyhemoglobin S chain A in Homo Sapiens [PDB ID: 2HBS] [19]: a in a one-letter code describing amino acid types, b in a three-letter code describing amino acid types. First line provides some descriptive information
is also used by methods that predict protein structures from their sequences, like I-TASSER [63], Rosetta@home [29], Quark [64], and many others, e.g., [61] and [69]. Let us now reference the primary structure to the general definition of the spatial structure defined in the previous section. We can state that protein structure S 3D consists of M amino acids Pm3D S 3D such that: Pm3D = AmP , BmP ,
(1.8)
where AmP is a subset of the set of atoms A3D , and BmP is a subset of the set of bonds B 3D : AmP A3D and BmP B 3D . (1.9) Sample protein P can be now recorded as a sequence of peptides pm : P = { pm |i = 1, 2, . . . , M
∧
∃ f R : P −→ } ,
(1.10)
where M is a length of the sequence (in peptides), and f R is a function which for each peptide pm assigns a type of amino acid from the set containing twenty (twenty-two) standard amino acids. Assuming that pm = Pm3D we can associate the primary structure with the spatial structure S 3D (Fig. 1.3):
Although:
S 3D = Pm3D |m = 1, 2, . . . , M . M
m=1
Pm3D S 3D ,
(1.11)
(1.12)
8
1 Formal Model of 3D Protein Structures for Functional …
Fig. 1.3 Fragment of a sample protein structure showing the relationship between the primary structure and spatial structure. Successive amino acids are separated by dashed lines
in many situations related to processing of protein structures, we can assume that: S 3D =
M
Pm3D .
(1.13)
m=1
1.3.2 Secondary Structure Secondary structure reveals specific spatial shapes in the construction of proteins. It shows how the linear chain of amino acids is formed in spiral α-helices, wavy β-strands, or loops. Indeed, these three shapes, α-helices, β-strands, and loops, are main categories of secondary structures. Secondary structure itself does not describe the location of particular atoms in 3D space. It reflects local hydrogen interactions between some atoms of amino acids that are close in the amino acid chain. Protein structure represented by means of secondary structure elements can have the following form: S S = skse |k = 1, 2, . . . , K
∧
∃ f S : S S −→ Σ ,
(1.14)
where skse is the kth secondary structure element, K is the number of secondary structure elements in the protein, and f S is a function which for each element skse
1.3 A Reference to Representation Levels
9
Fig. 1.4 Secondary structures of Deoxyhemoglobin S chain A in Homo Sapiens [PDB ID: 2HBS]. First line provides some descriptive information
assigns a type of secondary structure from the set Σ of possible secondary structure types. Actually, the f S is a function that is sought by many researchers. Secondary structure prediction methods, like GOR [17], PREDATOR [15], or PredictProtein [51], try to model and implement the function in some way based on amino acid sequence. In order to cover all parts of the protein structure, the set Σ distinguishes four (sometimes more) types of secondary structures: • • • •
α-helix, β-sheet or β-strand, loop, turn or coil, and undetermined structure.
The first three types of secondary structures are visible in Fig. 1.7 (right) in the tertiary structure of a sample protein. Each element skse is characterized by two values: skse = [SS E k , L k ],
(1.15)
where SS E k describes the type of secondary structure (as mentioned above), L k ≤ M is the length of the kth element skse (measured in amino acids), and M is a length of the amino acid chain. Such defined secondary structure can be represented as it is shown in Fig. 1.4, where particular symbols stand for: H - α-helix, E - β-strand, C/L - loop, turn or coil, U - unassigned structure. The representation of protein secondary structures defined in Eqs. 1.14 and 1.15 and shown in Fig. 1.4 is used in some phases of the LOCK2 [55], CASSERT [36] and GPU-CASSERT [33] algorithms for 3D protein structure similarity searching, and in the indexing technique used in [18] and PSS-SQL [31, 40, 45] domain query languages for the exploration of secondary structures of proteins. Referencing the secondary structure to the general definition of the spatial structure, we can state that a single element skse is a substructure of the spatial structure S 3D containing usually several amino acids: skse = AkS , BkS ,
(1.16)
where AkS is a subset of the set of atoms A3D , and BkS is a subset of the set of bonds B 3D : (1.17) AkS A3D and BkS = (BkS∗ ∪ Hk ) B 3D .
10
1 Formal Model of 3D Protein Structures for Functional …
In formula (1.17) we take into account standard set of covalent bonds between atoms in the secondary structure skse , represented by the BkS∗ , and additional hydrogen bonds stabilizing constructions of secondary structure elements, represented by the set Hk . A spatial structure of a sample protein can be now recorded as a sequence of secondary structure elements skse : S 3D = skse |k = 1, 2, . . . , K ∧ ∃ f L : AkS −→ R3
(1.18)
where K is the number of secondary structure elements in the protein, f L is a function which for each atom an of the secondary structure skse assigns a location in space described by Cartesian coordinates (xn , yn , z n ). There are many approaches to modeling the function f L and finding the Cartesian coordinates for atoms of the protein structure. Physical methods rely on physical forces and interactions between atoms in a protein. Representatives of the approach include already mentioned I-TASSER [63], Rosetta@home [29], Quark [64], WZ [61], and NPF [69]. Comparative methods rely on already known structures that are deposited in macromolecular data repositories, such as Protein Data Bank (PDB) [4]. Representatives of the comparative approach are Robetta [26], Modeller [13], RaptorX [24], HHpred [59], Swiss-Model [2] for homology modeling, and Sparks-X [66], Raptor [65], and Phyre [25] for fold recognition. It is also interesting to follow the relationship between protein secondary structure and primary structure. We can record a single element skse as a subsequence of amino acids: skse = ( pl , pl+1 , . . . , pm ) , where 1 ≤ l ≤ m ≤ M, (1.19) and where element p is any amino acid forming part of the secondary structure skse , and M is a length of the protein (in amino acids). It can be also noted that for any pm = Pm3D = AmP , BmP : AmP ⊆ AkS and BmP ⊆ BkS .
(1.20)
Such a relationship between the secondary structure and the primary structure is usually represented as it is shown in Fig. 1.5 and can be visualized in a similar fashion to that shown in Fig. 1.6. The representation of protein secondary structures shown in Fig. 1.5 is used as one of the protein geometry features in algorithms for 3D protein structure similarity searching, e.g., CTSS [9] and mentioned CASSERT [36].
1.3.3 Tertiary Structure Tertiary structure is a higher degree of organization. Proteins achieve their tertiary structures through the protein folding process. In this process a polypeptide chain acquires its correct three-dimensional structure and adopts biologically active native
1.3 A Reference to Representation Levels
11
Fig. 1.5 Secondary structure and primary structure of Deoxyhemoglobin S chain A in Homo Sapiens [PDB ID: 2HBS]. First line provides some descriptive information
Fig. 1.6 Relationship between secondary structure and primary structure of Deoxyhemoglobin S chain A in Homo Sapiens [PDB ID: 2HBS] visualized graphically at the Protein Data Bank [4] Web site (http://www.pdb.org, accessed on March 7, 2018)
state [5]. Many proteins have only one amino acid chain, so that tertiary structure is enough to describe their spatial structure. Those that are composed of more than one chain have also the quaternary structure. Tertiary structure requires 3D coordinates of all atoms of the protein structure to be determined. Therefore, we can state that if the number of polypeptide chains H = 1, the general spatial structure S 3D describes the tertiary structure S T of a protein: H = 1 ⇐⇒ S T = S 3D ,
(1.21)
S T = A T , B T .
(1.22)
and
At this point, description of tertiary structure is the same as the description of the general spatial structure S 3D given in Sect. 1.2. Example of tertiary structure is presented in Fig. 1.7.
12
1 Formal Model of 3D Protein Structures for Functional …
Fig. 1.7 Tertiary structure of sample protein Cyclin Dependent Kinase CDK2 [PDB ID: 1B38] [7]: (left) representation showing atoms and bonds, (right) representation showing secondary structures and their relative orientation. Visualized using RasMol [52]
From the viewpoint of secondary structures, the tertiary structure specifies positional relationships of secondary structures [8], which is presented in Fig. 1.7 (right). The set of atoms of the tertiary structure A T consists of atoms forming all of the secondary structures packed into the protein structure (represented as the set A T ∗ ). It also includes possible atoms from additional functional groups (represented as the set A F G ), e.g., prosthetic groups, inhibitors, solvent molecules, and ions for which coordinates are supplied. Example of prosthetic group is presented in Fig. 1.8. Similarly, in addition to covalent and non-covalent bonds between atoms forming amino acids of the protein chain (represented as the set B T ∗ ): B
T∗
=
M
BmP ,
(1.23)
m=1
the set of bonds of the tertiary structure B T may also consist of bonds between atoms from the functional groups (represented as the set B F G ) and additional bonds stabilizing the tertiary structure (represented as the set B stab ), e.g., disulfide bridges (S-S) between cysteine residues (Fig. 1.9). Therefore: A T = A T ∗ ∪ A F G ∧ B T = B T ∗ ∪ B F G ∪ B stab .
(1.24)
The representation of the 3D protein structure, having regard to formulas (1.21– 1.24) and earlier formulas (1.1–1.7), is used by many algorithms for protein structure alignment and similarity searching, including DALI [21], LOCK2 [55], FATCAT [67], CTSS [9], CE [56], FAST [68], and others [36]. To complete the search task, these algorithms usually does not explore whole sets of atoms A T and bonds B T , but use reduced sets A T ′ of chosen atoms, e.g., Cα atoms of the backbone, and distances between the atoms (calculated using the formula (1.5) or (1.6)):
1.3 A Reference to Representation Levels
13
Fig. 1.8 Prosthetic heme group responsible for oxygen binding, distinguished in the structure of Myoglobin [PDB ID: 1MBN] [62]
AαT ′ = am |m = 1, 2, . . . , M ∧ ∀m am ∈ AmP ∧ f V a (am ) = Cα ,
(1.25)
where M is the length of protein chain (in residues). Some algorithms, like SSAP [48], also use the Cβ atoms in order to include an information on the orientation of the side chains: AβT ′ = {am |m = 1, 2, . . . , M ′ ∧ M ′ ≤ M ∧ ∀m am ∈ AmP ∧ f V a (am ) = Cβ },
(1.26)
or combinations of the two atoms: T′ ⊂ AαT ′ ∪ AβT ′ . Aαβ
(1.27)
Molecular viewers, like Chime [11], QMOL [16], Jmol [23], PMV [39], RasMol [52], PyMOL [53], and MViewer [60], also make use of the whole set of atoms A T and bonds B T or just subsets of them (depending of the display mode) during protein structure visualization. For example, in the balls and sticks display mode (Fig. 1.7 left), they use whole sets of atoms and bonds, and in the backbone mode they use just positions of the Cα atoms to display the protein backbone.
1.3.4 Quaternary Structure Quaternary structure describes spatial structures of proteins that have more than one polypeptide chain. Quaternary structure shows mutual location of tertiary structures
14
1 Formal Model of 3D Protein Structures for Functional …
Fig. 1.9 Disulfide bridge between two sulphur atoms in cysteine residues in sample protein Glutaredoxin-1-Ribonucleotide Reductase B1 [PDB ID: 1QFN] [3]
of these chains in the three-dimensional space. Therefore, we can represent a quaternary structure as follows: S Q = { ch |h = 1, 2, . . . , H ∧ ∃ f C I D : S Q −→ {A, B, C, . . . , X, Y, Z } ∧ ∃ f T : S Q −→ S T }.
(1.28)
where H is the number of protein chains, f C I D is a function which for each chain ch of the quaternary structure S Q assigns a chain identifier, e.g., A, B, …, Z, and f T is a function which for each chain ch of the quaternary structure S Q assigns tertiary structure S T . Therefore, we can state that if the number of polypeptide chains H > 1, the general spatial structure S 3D describes the quaternary structure S Q of the whole protein: H > 1 ⇐⇒ S Q = S 3D .
(1.29)
Such protein structures that are composed of a number of chains are called oligomeric complexes [8]. Examples of quaternary structures are presented in Figs. 1.10 and 1.11. If each chain ch has its tertiary structure, we can note that: ch = AhT , BhT , and that: A
3D
Q
=A =
H
h=1
AhT
(1.30)
∪ AFG ,
(1.31)
1.3 A Reference to Representation Levels
15
Fig. 1.10 Quaternary structure of Human Deoxyhemoglobin [PDB ID: 4HHB] [14] containing four chains and heme
B
3D
Q
=B =
H
h=1
BhT
∪ B F G ∪ B stab .
(1.32)
Again, the set of atoms A Q forming quaternary structure of a protein consists of atoms belonging directly to particular component polypeptide chains (AhT ) and atoms of additional functional groups (A F G ). The set of bonds B Q consists of covalent bonds linking atoms of each of the polypeptide chains BhT , bonds linking atoms of functional groups B F G , and bonds stabilizing the quaternary structure B stab , e.g., intra-chain disulfide bridges.
1.4 Relative Coordinates of Protein Structures Some of the computational processes performed in protein exploration prefer to use relative coordinates, rather than absolute coordinates of particular atoms of protein structures. For example, in protein structure prediction by energy minimization many different relative coordinates are used while performing a computational process.
16
1 Formal Model of 3D Protein Structures for Functional …
Fig. 1.11 Quaternary structure of Insulin Hormone [PDB ID: 1ZNJ] [57] containing six chains and zinc atoms
These relative coordinates can be derived based on the protein structure S 3D , for which absolute coordinates are being known. We have already had the opportunity to see one of the relative coordinates when we talked about a set of bonds, the B 3D component of the protein structure S 3D in formula (1.3) in Sect. 1.2. These were bond lengths. Bond lengths (Fig. 1.12) were studied intensively during past years and after making some statistics we know that lengths of bonds between particular types of atoms in protein backbone are similar. Bond length for N − Cα is 1.47 Å (1Å= 10−10 m), for Cα − C is 1.53 Å, and for C − N is 1.32 Å [54]. However, investigation of differences and similarities between bond lengths is still interesting. Some computational procedures require bond lengths to be calculated. For example, while comparing two protein structures selected types of bonds, like Cα − C ′ , can be compared for each pair of compared amino acids. Bond lengths are also used while calculating bond stretching component energy of total potential energy of protein structure (Sect. 1.5). Bond lengths can be calculated according to formulas (1.5) and (1.6) shown earlier in this chapter. A kind of generalization of bond lengths can be interatomic distances. Interatomic distances describe the distance between two atoms (Fig. 1.12). However,
1.4 Relative Coordinates of Protein Structures
17
Fig. 1.12 Graphical interpretation of bond length (top left), interatomic distance (top right), and bond angle (bottom)
these atoms do not have to be connected by any bond. Interatomic distances can be calculated according to the same formulas (1.5) and (1.6) as bond lengths. And they are very useful when we want to study interactions between particular atoms in protein structure or between atoms of two molecules, e.g., two substrates of cellular reaction. They are also frequently calculated in protein structure comparison. For example, popular DALI algorithm [21] uses distances between Cα atoms in order to calculate so-called distance matrices that represent protein structures in the comparison process. Another relative feature, which is studied by researchers in the field of chemistry and molecular biology, is bond angles. Bond angles or valence angles are, next to the bond lengths, the principal relative features that control the shape of 3D protein structures. In order to calculate a bond angle we have to know the positions of three atoms (Fig. 1.12). The angle between two bonds bi j and bk j linking these three atoms (Fig. 1.12, bottom) can be calculated from a dot product of their respective vectors: cos θ j =
bi j · bk j . bi j bk j
(1.33)
A very important information for the analysis of 3D protein structures bring also torsion angles. Torsion angles are dihedral angles that describe the rotation of protein polypeptide backbone around particular bonds. There are three types of torsion angles that are calculated for protein structures, i.e., Phi (φ), Psi (ψ), and Omega (ω). The Phi torsion angle describes the rotation around the N − Cα bond, the Psi torsion angle describes the rotation around the Cα − C ′ bond, and the Omega torsion angle describes the rotation around the C ′ − N bond (see Fig. 1.13).
18
1 Formal Model of 3D Protein Structures for Functional …
Fig. 1.13 Overview of protein construction. Visible atoms forming the main chain of the polypeptide. Side chains are marked as R0 , R1 , R2
R2
χ2
ψ2
C’
Cα2 H
φ2 N H
O ω1
C’
ψ1
H
Cα1 χ1
H
φ1 N
R0
R1 ω0
χ0
ψ0
Cα0
C’
O
H
φ0 N
Looking at Fig. 1.13 we can notice that the peptide bond formed by C, O, N, H atoms is planar which restricts the rotation around the C ′ − N bond. The Omega angle is then essentially fixed to 180 degrees due to the partial double-bond character of the peptide bond. Therefore, main chain rotations are restricted to the Phi and Psi angles, and these angles provide the flexibility required for folding the protein backbone. This information is utilized in some algorithms for protein structure prediction, e.g., WZ [61] and NPF [69], that model protein structure by random choosing and rotating the Phi and Psi angles. Torsion angles can be calculated using the dot product of the normal vectors of the two planes defined by three successive atoms ai , a j , ak and a j , ak , al as presented in Fig. 1.14. These normals can be calculated from the cross products of vectors creating particular planes (vectors defined by three successive atoms ai , a j , ak and a j , ak , al ): n 1 = bi j × bk j and n 2 = b jk × blk ,
(1.34)
1.4 Relative Coordinates of Protein Structures
19
Fig. 1.14 Calculation of the dihedral angle between two planes based on normal vectors (left). Calculation of the normal vector as a cross product of vectors defining a plane (right). Redrawn based on [8]
and then used to calculate a dihedral angle from the dot product n 1 · n 2 : cos ω =
n1· n2 , n 1 n 2
(1.35)
where ω is a calculated torsion angle (Phi, Psi, Omega), depending on which successive atoms of the backbone are inserted in place of ai , a j , ak , al . For the Phi ′ − Ni − Cαi − Ci′ , for the Psi torsion angle torsion angle these should be atoms Ci−1 ′ these should be Ni − Cαi − Ci − Ni+1 , and for Omega torsion angle these should be Cαi − Ci′ − Ni+1 − Cαi+1 . Theoretically, the Phi and Psi angles can take values ranging from -180 degrees to 180 degrees. However, in protein molecules rotations of Phi and Psi torsion angles are restricted to certain values due to steric collisions between main chain and side chain atoms. Moreover, protein regions that form a particular secondary structure impose additional constraints on the values of these torsion angles. This was noticed by Ramachandran and colleagues in [50]. The chart showing real values of the Phi and Psi angles and possible combinations of these values for various types of secondary structures is known today as the Ramachandran plot (Fig. 1.15). On the Ramachandran plot, values of the Phi angle are plotted on the x-axis and values of the Psi angle are plotted on the y-axis. For many years the Ramachandran plot has been widely used by scientists to validate torsion angles and assess the quality and correctness of protein structures that were obtained by means of experimental methods (X-ray crystallography and NMR spectroscopy) or by homology modeling [13]. For example, Ramachandran plots are created by the popular PROCHECK, a program that provides a detailed check on the stereochemistry of a protein structures [27].
20
1 Formal Model of 3D Protein Structures for Functional …
Fig. 1.15 Ramachandran plot showing distribution of torsion angles Phi and Psi for a sample protein structure. Generated by PROCHECK program [27]
1.5 Energy Properties of Protein Structures Protein structure S 3D can be also analyzed in terms of forces that act on each atom within the molecule. In such an approach, atoms are considered as masses that interact with each other. Various forces between interacting atoms cause changes in the potential energy of the molecular system S 3D . The molecular system can be then modeled by molecular mechanics, where the potential energy of the set of atoms A3D is described by empirical force fields providing a functional form for the potential energy and containing a set of parameters for particular atoms in the set A3D . This kind of description of the molecular system S 3D usually takes place while studying molecular dynamics of proteins or modeling the protein structure by minimizing the conformational energy. Scientists assume here that when a protein stabilizes the positions of its atoms, the energy of such a molecular system is minimized. Consequently, any changes in the protein conformation causing deviations of bond lengths, angles, and intermolecular distances from reference values come with energy sanctions [28].
1.5 Energy Properties of Protein Structures
21
There are various types of force fields that were derived experimentally or by using quantum mechanical calculations. The most popular ones include: Assisted Model Building and Energy Refinement (AMBER) [12], Chemistry at HARvard Molecular Mechanics (CHARMM) [6], and GROningen MOlecular Simulation package (GROMOS) [47], but there are also many others. These force fields provide different functional forms that model the potential energy of the molecular system S 3D . However, they usually contain the following common energy terms: E T (S 3D ) = E B S + E AB + E T A + E V DW + E CC ,
(1.36)
where E T (S 3D ) denotes the total potential energy and particular component energies contributing to the total potential E T are defined as follows: • bond stretching (E B S ) E B S (S 3D ) =
bonds j=1
kj (d j − d 0j )2 , 2
(1.37)
where k j is a bond stretching force constant, d j is a distance between two atoms (real bond length), and d 0j is an optimal bond length; • angle bending (E AB ) angles
E AB (S 3D ) =
kj (θ j − θ0j )2 , 2 j=1
(1.38)
where k j is a bending force constant, θ j is an actual value of the valence angle, and θ0j is an optimal valence angle; • torsional angle (E T A ) E T A (S 3D ) =
tor sions j=1
Vj (1 + cos(nω − γ)), 2
(1.39)
where V j denotes the height of the torsional barrier, n is a periodicity, ω is the torsion angle, and γ is a phase factor; • van der Waals (E V DW ) E V DW (S
3D
)=
N N
k=1 j=k+1
(4εk j [(
σk j 12 σk j 6 ) −( ) ]), rk j rk j
(1.40)
where rk j denotes the distance between atoms k and j, σk j is a collision diameter, εk j is a well depth, and N is the number of atoms in the structure S 3D ;
22
1 Formal Model of 3D Protein Structures for Functional …
Fig. 1.16 Schematic interpretation of bonded interactions: (top left) bond stretching, (top right) angle bending, (bottom) torsional angle
Fig. 1.17 Schematic interpretation of non-bonded interactions: (left) electrostatic, (right) van der Waals
• electrostatic (E CC ), also known as Coulomb or charge–charge E CC (S 3D ) =
N N k=1
qk q j , 4πε0 rk j j=k+1
(1.41)
where qk , q j are atomic charges, rk j denotes the distance between atoms k and j, ε0 is a dielectric constant, and N is the number of atoms in the structure S 3D . The first three terms are called as bonded interactions, since they occur between atoms that are covalently bonded. Their graphical interpretation is shown in Fig. 1.16. The last two terms are referred as non-bonded interactions, since they occur between non-bonded atoms. Graphical interpretations of these two terms are shown in Fig. 1.17.
1.5 Energy Properties of Protein Structures
23
There can be more energy terms in the function describing the total potential energy. Further description of these and other component energies is out of the scope of the book. However, readers that are interested in details of these potentials are encouraged to read the book Molecular Modelling: Principles and Applications by Leach A. [28]. As can be also seen, calculations of the potential energy require both components A3D and B 3D of the defined model of the protein structure S 3D , as well as some of the relative coordinates that can be derived from the structure. It is also worth noting methods that make use of energy properties of protein structures for investigating protein sequence–structure–function relationships, protein conformational modifications [34], and protein activity in cellular reactions [10, 22, 44] through energy properties. Representatives of the methods are ePros [20], and successive versions of the EAST method [30, 35, 41], including FN-EAST [37] and FS-EAST [32, 42], that use the EDB database [38] and EDML data exchange format [43].
1.6 Summary The model of protein structure S 3D shown in this chapter has a general purpose and can be used while describing protein molecules in many different processes related to functional genomics, comparative biology, and molecular modeling. In fact, protein structures can be described by many various features, and those presented in this chapter do not cover all of them. Which of the features are used depends on the particular process. However, most of them, if not all, can be derived from the general model S 3D . The general model of protein structure shown in this chapter is especially useful in any process related to protein modeling, drug design, or protein structure comparison. In these processes acting at the level of individual atoms and inspection of their positions is particularly important. Some of the methods and particular representations of protein structures will be shown in the following chapters of the book.
References 1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403 – 410 (1990). http://www.sciencedirect.com/science/article/pii/ S0022283605803602 2. Arnold, K., Bordoli, L., Kopp, J., Schwede, T.: The SWISS-MODEL workspace: a web-based environment for protein structure homology modelling. Bioinformatics 22(2), 195–201 (2006) 3. Berardi, M., Bushweller, J.: Binding specificity and mechanistic insight into glutaredoxincatalyzed protein disulfide reduction. J. Mol. Biol. 292, 151–161 (1999) 4. Berman, H., et al.: The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000) 5. Branden, C., Tooze, J.: Introduction to protein structure, 2nd edn. Garland Science, New York (1999)
24
1 Formal Model of 3D Protein Structures for Functional …
6. Brooks, B.R., Bruccoleri, R.E., Olafson, B.D., States, D.J., Swaminathan, S., Karplus, M.: CHARMM: A program for macromolecular energy, minimization, and dynamics calculations. J. Comput. Chem. 4(2), 187–217 (1983). https://onlinelibrary.wiley.com/doi/abs/10.1002/jcc. 540040211 7. Brown, N.R., Noble, M.E.M., Lawrie, A.M., Morris, M.C., Tunnah, P., Divita, G., Johnson, L.N., Endicott, J.A.: Effects of phosphorylation of threonine 160 on cyclin-dependent kinase 2 structure and activity. J. Biol. Chem. 274(13), 8746–8756 (1999). http://www.jbc.org/content/ 274/13/8746.abstract 8. Burkowski, F.: Structural bioinformatics: an algorithmic approach, 1st edn. Chapman and Hall/CRC, Boca Raton (2008) 9. Can, T., Wang, Y.F.: CTSS: a robust and efficient method for protein structure alignment based on local geometrical and biological features. In: Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003. pp. 169–179 (2003) 10. Chen, P.Y., Lin, K.C., Lin, J.P., Tang, N.Y., Yang, J.S., Lu, K.W., Chung, J.G.: Phenethyl isothiocyanate (PEITC) inhibits the growth of human oral squamous carcinoma HSC-3 cells through G0/G1 phase arrest and mitochondria-mediated apoptotic cell death. Evidence-Based Complementary and Alternative Medicine 2012(Article ID 718320), 1–12 (2012) 11. Chime and Jmol Homepage: Molecular Visualization Resources. http://www.umass.edu/ microbio/chime/ 12. Cornell, W., Cieplak, P., Bayly, C., Gould, I., Merz, K.J., Ferguson, D., Spellmeyer, D., Fox, T., Caldwell, J., Kollman, P.: A second generation force field for the simulation of proteins, nucleic acids, and organic molecules. J. Am. Chem. Soc. 117, 5179–5197 (1995) 13. Eswar, N., Webb, B., Marti-Renom, M.A., Madhusudhan, M., Eramian, D., Shen, M.Y., Pieper, U., Sali, A.: Comparative protein structure modeling using Modeller. Curr. Protoc. Bioinform. 15(1), 5.6.1–5.6.30 (2014). https://currentprotocols.onlinelibrary.wiley.com/doi/abs/10.1002/ 0471250953.bi0506s15 14. Fermi, G., Perutz, M., Shaanan, B., Fourme, R.: The crystal structure of human deoxyhaemoglobin at 1.74 A resolution. J. Mol. Biol. 175, 159–174 (1984) 15. Frishman, D., Argos, P.: 75% accuracy in protein secondary structure prediction. Proteins 27, 329–335 (1997) 16. Gans, J., Shalloway, D.: Qmol: a program for molecular visualization on Windows-based PCs. J. Mol. Graph. Model. 19(6), 557–9 (1998) 17. Garnier, J., Gibrat, J., Robson, B.: GOR method for predicting protein secondary structure from amino acid sequence. Methods Enzym. 266, 540–53 (1996) 18. Hammel, L., Patel, J.M.: Searching on the secondary structure of protein sequences. In: Bernstein, P.A., Ioannidis, Y.E., Ramakrishnan, R., Papadias, D. (eds.) VLDB ’02: Proceedings of the 28th International Conference on Very Large Databases, pp. 634–645. Morgan Kaufmann, San Francisco (2002) 19. Harrington, D.J., Adachi, K., Royer, W.E.: The high resolution crystal structure of deoxyhemoglobin S. J. Mol. Biol. 272(3), 398 – 407 (1997). http://www.sciencedirect.com/science/ article/pii/S0022283697912535 20. Heinke, F., Schildbach, S., Stockmann, D., Labudde, D.: eProS – a database and toolbox for investigating protein sequence-structure-function relationships through energy profiles. Nucleic Acids Res. 41(D1), D320–D326 (2013). http://dx.doi.org/10.1093/nar/gks1079 21. Holm, L., Kaariainen, S., Rosenstrom, P., Schenkel, A.: Searching protein structure databases with DaliLite v. 3. Bioinformatics 24, 2780–2781 (2008) 22. Hong, H.J., Chen, P.Y., Shih, T.C., Ou, C.Y., Jhuo, M.D., Huang, Y.Y., Cheng, C.H., Wu, Y.C., Chung, J.G.: Computational pharmaceutical analysis of anti-Alzheimer’s Chinese medicine Coptidis Rhizoma alkaloids. Mol. Med. Rep. 5(1), 142–7 (2012) 23. Jmol Homepage: Jmol: An Open-source Java Viewer for Chemical Structures in 3D. http:// www.jmol.org 24. Källberg, M., Wang, H., Wang, S., Peng, J., Wang, Z., Lu, H., Xu, J.: Template-based protein structure modeling using the RaptorX web server. Nat. Protoc. 7, 1511–1522 (2012)
References
25
25. Kelley, L., Sternberg, M.: Protein structure prediction on the web: a case study using the Phyre server. Nat. Protoc. 4(3), 363–371 (2009) 26. Kim, D., Chivian, D., Baker, D.: Protein structure prediction and analysis using the Robetta server. Nucleic Acids Res. 32(Suppl 2), W526–31 (2004) 27. Laskowski, R.A., MacArthur, M.W., Moss, D.S., Thornton, J.M.: PROCHECK: a program to check the stereochemical quality of protein structures. J. Appl. Crystallogr. 26(2), 283–291 (1993). https://onlinelibrary.wiley.com/doi/abs/10.1107/S0021889892009944 28. Leach, A.: Molecular modelling: principles and applications, 2nd edn. Pearson Education EMA, Essex (2001) 29. Leaver-Fay, A., Tyka, M., Lewis, S., Lange, O., Thompson, J., Jacak, R.: ROSETTA3: an objectoriented software suite for the simulation and design of macromolecules. Methods Enzym. 487, 545–74 (2011) 30. Małysiak, B., Momot, A., Kozielski, S., Mrozek, D.: On using energy signatures in protein structure similarity searching. In: Rutkowski, L., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) Artificial Intelligence and Soft Computing - ICAISC 2008. Lecture Notes in Computer Science, vol. 5097, pp. 939–950. Springer, Berlin (2008) 31. Małysiak-Mrozek, B., Kozielski, S., Mrozek, D.: Server-side query language for protein structure similarity searching, pp. 395–415. Springer, Berlin (2012). https://doi.org/10.1007/9783-642-23172-8_26 32. Małysiak-Mrozek, B., Mrozek, D.: An improved method for protein similarity searching by alignment of fuzzy energy signatures. Int. J. Comput. Intell. Syst. 4(1), 75–88 (2011). https:// doi.org/10.1080/18756891.2011.9727765 33. Mrozek, D., Brozek, M., Małysiak-Mrozek, B.: Parallel implementation of 3D protein structure similarity searches using a GPU and the CUDA. J. Mol. Model. 20, 2067 (2014) 34. Mrozek, D., Małysiak, B., Kozielski, S.: Energy profiles in detection of protein structure modifications. In: 2006 International Conference on Computing Informatics. pp. 1–6 (2006) 35. Mrozek, D., Małysiak, B., Kozielski, S.: An optimal alignment of proteins energy characteristics with crisp and fuzzy similarity awards. In: 2007 IEEE International Fuzzy Systems Conference. pp. 1–6 (2007) 36. Mrozek, D., Małysiak-Mrozek, B.: CASSERT: A two-phase alignment algorithm for matching 3D structures of proteins. In: Kwiecie´n, A., Gaj, P., Stera, P. (eds.) Computer Networks, Communications in Computer and Information Science, vol. 370, pp. 334–343. Springer International Publishing, Berlin (2013) 37. Mrozek, D., Malysiak-Mrozek, B., Kozielski, S.: Alignment of protein structure energy patterns represented as sequences of fuzzy numbers. In: NAFIPS 2009 - 2009 Annual Meeting of the North American Fuzzy Information Processing Society. pp. 1–6 (2009) ´ 38. Mrozek, D., Małysiak-Mrozek, B., Kozielski, S., Swierniak, A.: The energy distribution data bank: Collecting energy features of protein molecular structures. In: 2009 Ninth IEEE International Conference on Bioinformatics and BioEngineering. pp. 301–306 (2009) 39. Mrozek, D., Mastej, A., Małysiak, B.: Protein molecular viewer for visualizing structures stored in the PDBML format. In: Pietka, E., Kawa, J. (eds.) Information technologies in biomedicine, AISC, vol. 47, pp. 377–386. Springer, Berlin (2008) 40. Mrozek, D., Wieczorek, D., Małysiak-Mrozek, B., Kozielski, S.: PSS-SQL: protein secondary structure - structured query language. In: 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology. pp. 1073–1076 (2010) 41. Mrozek, D., Małysiak, B., Kozielski, S.: EAST: energy alignment search tool. In: Wang, L., Jiao, L., Shi, G., Li, X., Liu, J. (eds.) Fuzzy Systems and Knowledge Discovery. Lecture Notes in Computer Science, vol. 4223, pp. 696–705. Springer, Berlin (2006) 42. Mrozek, D., Małysiak-Mrozek, B., Kozielski, S.: Protein comparison by the alignment of fuzzy energy signatures. In: Wen, P., Li, Y., Polkowski, L., Yao, Y., Tsumoto, S., Wang, G. (eds.) Rough Sets and Knowledge Technology. Lecture Notes in Computer Science, vol. 5589, pp. 289–296. Springer, Berlin (2009) 43. Mrozek, D., Małysiak-Mrozek, B., Kozielski, S., Górczy´nska-Kosiorz, S.: The EDML format to exchange energy profiles of protein molecular structures. In: Huang, D.S., Jo, K.H., Lee, H.H.,
26
44.
45.
46.
47.
48.
49. 50. 51. 52.
53.
54. 55. 56. 57.
58.
59. 60.
61.
62. 63.
1 Formal Model of 3D Protein Structures for Functional … Kang, H.J., Bevilacqua, V. (eds.) Emerging intelligent computing technology and applications. Lecture Notes in Computer Science, vol. 5754, pp. 146–157. Springer, Berlin (2009) Mrozek, D., Małysiak-Mrozek, B., Kozielski, S., Górczy´nska-Kosiorz, S.: Energy properties of protein structures in the analysis of the human RAB5A cellular activity. In: Cyran, K.A., Kozielski, S., Peters, J.F., Sta´nczyk, U., Wakulicz-Deja, A. (eds.) Man-Machine Interactions. Advances in Intelligent and Soft Computing, vol. 59, pp. 121–131. Springer, Berlin (2009) Mrozek, D., Socha, B., Kozielski, S., Małysiak-Mrozek, B.: An efficient and flexible scanning of databases of protein secondary structures. J. Intell. Inf. Syst. 46(1), 213–233 (2016). https:// doi.org/10.1007/s10844-014-0353-0 Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443 – 453 (1970). http://www. sciencedirect.com/science/article/pii/0022283670900574 Oostenbrink, C., Villa, A., Mark, A.E., Van Gunsteren, W.F.: A biomolecular force field based on the free enthalpy of hydration and solvation: The GROMOS forcefield parameter sets 53A5 and 53A6. J. Comput. Chem. 25(13), 1656–1676 (2004). https://onlinelibrary.wiley.com/doi/ abs/10.1002/jcc.20090 Orengo, C.A., Taylor, W.R.: A local alignment method for protein structure motifs. J. Mol. Biol. 233(3), 488 – 497 (1993). http://www.sciencedirect.com/science/article/pii/ S0022283683715263 Pearson, W.R.: Flexible sequence similarity searching with the FASTA3 program package. Methods Mol. Biol. 132, 185–219 (2000) Ramachandran, G., Ramakrishnan, C., Sasisekaran, V.: Stereochemistry of polypeptide chain configurations. J. Mol. Biol. 7, 95–9 (1963) Rost, B., Liu, J.: The PredictProtein server. Nucleic Acids Res. 31(13), 3300–3304 (2003) Sayle, R.: RasMol, Molecular graphics visualization tool. BiomolecularStructures Group, Glaxo Welcome Research & Development, Stevenage, Hartfordshire (2013). http://www. umass.edu/microbio/rasmol/ Schrödinger, L.L.C.: The PyMOL molecular graphics system, version 1.3r1 (2010), pyMOL The PyMOL Molecular Graphics System, Version 1.3, Schrödinger, LLC. http://www.pymol. org Schulz, G., Schirmer, R.: Principles of protein structure. Springer, New York (1979) Shapiro, J., Brutlag, D.: FoldMiner and LOCK2: protein structure comparison and motif discovery on the Web. Nucleic Acids Res. 32, 536–41 (2004) Shindyalov, I., Bourne, P.: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 11(9), 739–747 (1998) Smith, G.D., Dodson, G.G.: The structure of a rhombohedral R6 insulin hexamer that binds phenol. Biopolymers 32(4), 441–445 (1992). https://onlinelibrary.wiley.com/doi/abs/10.1002/ bip.360320422 Smith, T., Waterman, M.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981). http://www.sciencedirect.com/science/article/pii/ 0022283681900875 Söding, J.: Protein homology detection by HMM-HMM comparison. Bioinformatics 21(7), 951–960 (2005) Stanek, D., Mrozek, D., Małysiak-Mrozek, B.: MViewer: Visualization of protein molecular structures stored in the PDB, mmCIF and PDBML data formats. In: Kwiecie´n, A., Gaj, P., Stera, P. (eds.) Computer Networks. Communications in Computer and Information Science, vol. 370, pp. 323–333. Springer, Berlin (2013) Warecki, S., Znamirowski, L.: Random simulation of the nanostructures conformations. In: Proceedings of International Conference on Computing, Communication and Control Technology. vol. 1, pp. 388–393. The International Institute of Informatics and Systemics, Austin, Texas (2004) Watson, H.: The stereochemistry of the protein myoglobin. Prog. Stereochem. 4, 299 (1969) Wu, S., Skolnick, J., Zhang, Y.: Ab initio modeling of small proteins by iterative TASSER simulations. BMC Biol. 5(17) (2007)
References
27
64. Xu, D., Zhang, Y.: Ab initio protein structure assembly using continuous structure fragments and optimized knowledge-based force field. Proteins 80(7), 1715–35 (2012) 65. Xu, J., Li, M., Kim, D., Xu, Y.: RAPTOR: optimal protein threading by linear programming, the inaugural issue. J. Bioinform. Comput. Biol. 1(1), 95–117 (2003) 66. Yang, Y., Faraggi, E., Zhao, H., Zhou, Y.: Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted onedimensional structural properties of query and corresponding native properties of templates. Bioinformatics 27(15), 2076–2082 (2011) 67. Ye, Y., Godzik, A.: Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics 19(2), 246–255 (2003) 68. Zhu, J., Weng, Z.: FAST: a novel protein structure alignment algorithm. Proteins 58, 618–627 (2005) 69. Znamirowski, L.: Non-gradient, sequential algorithm for simulation of nascent polypeptide folding. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M., Dongarra, J.J. (eds.) Computational Science - ICCS 2005. Lecture Notes in Computer Science, vol. 3514, pp. 766–774. Springer, Berlin (2005)
Chapter 2
Technological Roadmap
Technology is a useful servant but a dangerous master Christian Lous Lange
Abstract Scientific solutions presented in this book rely on various technologies that emerged in computer science. Some of them emerged recently and are quite new in the bioinformatics field. Some of them are widely used in developing efficient and reliable IT systems supporting various forms of business for many years, but are not frequently used in bioinformatics. This chapter provides a technological road map for solutions presented in this book. It covers a brief introduction to the concept of cloud computing, cloud service, and deployment models. It also defines the Big Data challenge and presents benefits of using multi-threading in scientific computations. It then explains graphics processing units (GPU) and CUDA architecture. Finally, it focuses on relational databases and the SQL language used for declarative querying. Keywords Cloud computing · Big data · Multi-threading · GPU · CUDA Relational databases · SQL Driven by huge amounts of biological data, bioinformatics and computational biology have made a huge progress in recent recent years and grown to disciplines that lie at the heart of biological research. This data-driven nature of both disciplines causes the necessity to reach for new technological achievements in computer science, as well as for proven technologies that have a well-established position supported by years of experience. This chapter shows a technological road map for solutions presented in the whole book. It provides definitions and explanations of general concepts of leading IT technologies that were successfully utilized to implement programs, systems, and solutions presented in this book, including • Cloud computing, • Big Data, • Multi-threading, © Springer Nature Switzerland AG 2018 D. Mrozek, Scalable Big Data Analytics for Protein Bioinformatics, Computational Biology 28, https://doi.org/10.1007/978-3-319-98839-9_2
29
30
2 Technological Roadmap
• Graphics processing units and CUDA, • Relational databases and SQL. Those of you who are already familiar with these concepts can skip this chapter and go directly to the next part of this book. However, I recommend at least to briefly review its content. This chapter is mainly addressed to those readers who have good biological knowledge, but feel deficiencies in modern IT technologies or do not know them at all.
2.1 Cloud Computing Cloud computing is a leading technology today that delivers various computer resources for deploying IT solutions in a cost-effective and efficient way. For example, instead of buying new hardware for storing and processing data, companies can just use the hardware, all storage and computer resources, available in the cloud, and perform their activities on the leased hardware. There can be various reasons for doing this, including the following: • a company or an institution needs to perform resource-demanding computations occasionally, and it does not want to buy any new, costly hardware for doing this, • a company or an institution needs a quick access to higher than average computer resources, but cannot afford buying them, • the predicted or planned dynamics of utilization of computer resources in a company or an institution will highly fluctuate, so there is no need to invest in huge computer resources as they will not be utilized for most of the time, • a company or an institution wants to use computer resources, but does not want maintain the computer infrastructure, they prefer somebody else to do this, • a company or an institution wants to quickly test its IT solutions before investing in hardware that will be kept on premises. These are just several sample scenarios, but there can be many others. Cloud computing provides huge amounts of computational power that can be provisioned on a pay-as-you-go basis. Cloud computing emerged as a result of requirements for the public availability of computing power, new technologies for data processing and the need of their global standardization, becoming a mechanism allowing to control the development of hardware and software resources by introducing the idea of virtualization. Cloud computing is a model that allows a convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction [18]. The idea of delivering computer resources in the Cloud is similar to delivering electricity to homes and electrical outlets. Anyone can plug his home equipment to the electrical outlet and perform his activity, e.g., do the laundry, iron, write emails on a laptop, vacuum the
2.1 Cloud Computing
31
flat. Then he has to pay for the electricity he used. The idea of the Cloud is similar. You can plug into the Cloud and use its resources and services, and then, pay for what you used. The use of cloud platforms can be particularly beneficial for companies and institutions that need to quickly gain access to a computer system which has a higher than average computing power. In this case, the use of cloud computing services can be more cost-effective and faster in implementation than using the owned resources (servers and computing clusters) or buying new ones. For this reason, cloud computing is widely used in business and according to the Forbes [17] the market value of such services will significantly increase in the coming years. According to National Institute of Standards and Technology (NIST), there are five essential characteristics of Cloud computing [18]: 1. On-demand self-service—provisioning computer resources (storage, processing, memory, network bandwidth) can be performed on demand and automatically, without any interaction with cloud service provider; 2. Broad network access—resources and services are available over the network through the use of various client platforms, like computers, laptops, mobile phones or tablets; 3. Resource pooling—resources are pooled, can be dynamically allocated and freed according to the customer will, may be delivered to multiple customers, and located in various independent data centers; 4. Rapid elasticity—resources and services can be provisioned and released elastically, scaled according to current needs, on demand or automatically; 5. Measured service—the utilization of resources and services is monitored, controlled, optimized, measured, and appropriate reports are delivered to the customer. For understanding what Cloud gives to users, it is also important to know cloud service models and cloud deployment models that are explained in the following two sections.
2.1.1 Cloud Service Models Service models define types of services that can be accessed on a cloud computing platform. Each of the service models also provides some sort of abstraction level for users that deploy their solutions in the Cloud. This abstraction level reduces the amount of work required when developing and deploying the cloud-based solution. Among many others, three types of services are universally accepted. They are usually presented in the form of stack as in Fig. 2.1. The basis of the stack of services provided in clouds (Fig. 2.1) is the Infrastructure as a Service (IaaS) layer. IaaS provides basic computing resources in a virtualized form, including: processing power (CPUs), memory (RAM), storage space and appropriate bandwidth for transferring data over the network, making it possible to
32
2 Technological Roadmap
Fig. 2.1 Cloud service models defining types of components that will be delivered to the consumer
2 Plaorm as a Service (PaaS)
3 Soware as a Service (SaaS)
1 Infrastructure as a Service (IaaS)
deploy and run any application. These resources are now available as services that can be managed by manipulating in appropriate management portals (e.g., https:// portal.azure.com) or by executing code scripts (e.g., PowerShell scripts). The IaaS service provider is responsible for the cloud infrastructure and its maintenance. This is how American NIST defines the Infrastructure as a Service [18]: The capability provided to the consumer is to provision processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications. The consumer does not manage or control the underlying cloud infrastructure but has control over operating systems, storage, and deployed applications; and possibly limited control of select networking components (e.g., host firewalls).
Platform as a Service (PaaS) allows to create custom applications based on a variety of services delivered by the cloud provider. As an addition to IaaS, the PaaS provides operating systems, applications, development platforms, transactions and control structures. The cloud provider manages the infrastructure, operating systems, and provided tools. NIST defines the Platform as a Service as follows [18]: The capability provided to the consumer is to deploy onto the cloud infrastructure consumercreated or acquired applications created using programming languages, libraries, services, and tools supported by the provider. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, or storage, but has control over the deployed applications and possibly configuration settings for the application-hosting environment.
Software as a Service (SaaS) provides services and applications with their user interfaces that are available on an on-demand basis. The consumer is provided with an application running in the cloud infrastructure. The consumer does not take care of the infrastructure, operating systems and other components underlying the application. His only responsibility is an appropriate interaction with the user interface and entering appropriate data into the application. The user can also change the configuration of the application and customize the user interface, if possible. NIST defines Software as a Service as follows [18]:
2.1 Cloud Computing
33
The capability provided to the consumer is to use the provider’s applications running on a cloud infrastructure. The applications are accessible from various client devices through either a thin client interface, such as a web browser (e.g., web-based email), or a program interface. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user-specific application configuration settings.
2.1.2 Cloud Deployment Models Deployment models decide where the infrastructure of the cloud will be located and managed, and who will use the cloud-based solution. We can distinguish here four widely accepted types (Fig. 2.2) [18]: • Public cloud—the infrastructure of the cloud is available for public use or a large industry group and is owned by an organization selling cloud services; • Private cloud—the cloud infrastructure is for the exclusive use of a single organization comprising multiple consumers (e.g., the organizational units); it does not matter whether it is a cloud managed by the organization, and it is located in its office; key factors for establishing private clouds seem to be: legal constraints, security, reliability, and lower costs for large organizations and dedicated solutions; • Community cloud—the cloud infrastructure is made available for the exclusive use of the consumer community from organizations that share common goals or are subjected to common legal restrictions; • Hybrid cloud—the cloud infrastructure is based on a combination of two or more types of the above cloud infrastructures; if needed, allows for the use of public cloud resources to provide potential increased demand for resources (cloud bursting).
2.2 Big Data Challenge Big Data is recently a frequently used term that arose on the ground of modern techniques of data harvesting, newly identified data sources, and growing capacities of storage systems that can accommodate the growth of incoming large data volumes. The era of Big Data that we entered several years ago has changed our imagination about the type and the volume of data that can be processed, as well as the value of data. With these modern techniques for data gathering, new sources of data, and extended storage capacities, the amount of data that can be collected and may constitute a value for making important decisions is constantly growing. This is where Big Data challenge (problem or opportunity) occurs.
34
2 Technological Roadmap
Hybrid cloud
Public cloud
Private cloud Community cloud
Fig. 2.2 Cloud deployment models defining who will use the cloud resources and services Fig. 2.3 5V model of Big Data
… Variety
Volume
Big Data
… Value
Velocity Veracity
2.2.1 The 5V Model of Big Data The Big Data challenge usually arises when data sets are so large that the conventional database management and data analysis tools are insufficient to process them [26]. However, the large amount of data that must be processed (large volume) is not the only characteristic of Big Data. The 5V model of Big Data presented in Fig. 2.3 shows all the characteristics: • volume—refers to amount of data generated and stored, • velocity—refers to the pace at which the data are generated and the speed at which the data is processed to satisfy requirements of the current analytics, • variety—refers to various formats the data are stored in, e.g., structured, unstructured, semi-structured, and type of the data, e.g., text, images, audio, video, • veracity—refers to the quality of data, including cleanness and accuracy,
2.2 Big Data Challenge
35
• value—refers to the capability of transforming the data into valuable source of information that will drive the decisions made by an institution. The Big Data solutions, including these presented in this book, usually address more than one of the Vs. The Big Data challenge is now visible in many fields which are experiencing an explosion of data that are considered relevant, including, e.g.,: • • • • • •
Social networks [6–8, 11, 13, 14, 30, 32], Multimedia processing [29], Internet of things (IoT) [3], Intelligent transport [12, 15, 16, 31], Medicine and bioinformatics [21–23, 25], Finance [1],
and many others [19, 33].
2.2.2 Hadoop Platform One of the top-notch technologies that address challenges of Big Data is Apache Hadoop. Apache Hadoop is an open-source software platform that allows to store and process very large volumes of data in a distributed manner on computer clusters built from commodity hardware. Development of the Hadoop started in 2006. Works on the system were inspired by two papers entitled “Google File System” [10] published by Google in 2003 and “MapReduce: Simplified Data Processing on Large Clusters” [9] in 2008. In general, as a tool, Hadoop allows to process data with the use of the MapReduce processing model, and efficiently store the data with the use of Hadoop Distributed File System (HDFS, Fig. 2.4). However, for many years, the platform evolved significantly. Today’s Hadoop is not only a single processing tool, but a great ecosystem, consisting many tools, that provide various services for data storage, data processing, data access, data governance, security, and operations. More details on the Hadoop ecosystem and platform, and the MapReduce processing model are provided in Chap. 6. Hadoop became a popular platform for processing and storing Big Data sets, since it guarantees the following: • Scalability and performance—data are processed locally to each node in a cluster in a distributed way, which makes Hadoop capable to store, manage, process and analyze data at petabyte scale. • Reliability—data are automatically replicated across cluster storage nodes in preparation for future node failures. Possible failures of individual nodes in the cluster cause re-direction of computations to the other node holding the replica. • Flexibility—data can be stored in any format, including semi-structured or unstructured formats. Any schema is applied to the data on read, when the data are needed for processing.
36
2 Technological Roadmap
MapReduce
Hadoop Distributed File System
Fig. 2.4 General architecture of the Hadoop platform
• Low costs—Hadoop does not require specialized hardware, it can be set up on low-cost commodity hardware.
2.3 Multi-threading and Multi-threaded Applications In recent years, the increased compute capabilities of many computers were achieved by increasing the number of CPUs (processors) or CPU cores. Multi-core processors are computing units with two or more independent processing subelements called cores. A typical processor in a personal computer today has two, four, or eight such CPU cores. The CPU cores are able to read and execute program instructions. A single CPU can run multiple instructions on each of the cores simultaneously. This may lead to the increased overall performance of executed programs or the tasks run by the operating system. In Fig. 2.5, we can see a simplified architecture of a multicore processor running two concurrent threads on each of the CPU cores through the use of Intel hyper-threading technology. Multi-level caches, including L1, L2, and (if available) L3 and L4 caches, provide fast access to data residing in the RAM memory (off-chip) of the computer, reduce the memory latency, and speed up the execution of instructions on CPU cores, leading to increased performance. Gaining performance out of the multi-core processor is further possible through the use of one of the parallel processing techniques, like multi-threading. In the hardware sense, multi-threading refers to performing multiple threads simultaneously on one core of the processor. Modern CPUs may have significant time periods when they are idle that result from missing data in CPU cache, wrong execution path for branch instructions, and dependencies between processed data. Multi-threading allows to use the idle time for the execution of another thread. For example, the hyper-threading technology, originally provided by Intel in 2002 for Xeon server processors and Pentium 4 desktop processors, allows the operating system to see two logical CPU cores for each physical core. Then, the workload can be shared
2.3 Multi-threading and Multi-threaded Applications Fig. 2.5 Multi-core processor running two concurrent threads per each CPU core in hyper-threading
37
Core 1
Core 2
Thread 1
Thread 3
Thread 2
Thread 4
L1 Cache
L1 Cache
L2 Cache
CPU RAM
between these logical or virtual cores by executing two threads in parallel on a single CPU core. However, gaining performance for a single process requires creation of the multi-threaded application. In the software development sense, multi-threading is the programming model that allows to create and execute multiple threads within the context of one process. Threads are pieces of the whole program or process that perform particular parts of the process concurrently. They can share some of the resources of the whole process, like memory or files. However, they are able to execute independently. In Fig. 2.6, we can see how parallel execution of threads may speed up execution of a sample process or interaction with multiple users. In the first scenario presented in Fig. 2.6a, two threads of a multi-threaded application process input data independently in parallel. In the most optimistic case, when there are no dependency between processed input data (embarrassingly parallel processing), the performance of the processing should increase twice. In the second scenario presented in Fig. 2.6b, two threads of a multi-threaded database server accept query requests for data from two users. These requests are executed in two independent threads, which allows to avoid queuing users’ requests and leads to increase of the processing efficiency in multi-user environment. In this scenario, additional gain in performance can be achieved if data are cached on the server, which decreases the number of I/O operations that must be performed against the database storage. Multi-threaded programs provide several advantages, including • Parallelization of the main process—multiple threads can be used to divide the whole computational processes into smaller tasks and each thread can perform a piece of work leading to the completion of the whole process.
38
2 Technological Roadmap
(a) Thread 1
Thread 2 Input
Multi-threaded application
Output
(b) User 1
Application
Thread 1
Application
Thread 2
User 2
Multi-threaded database server
Database storage
Fig. 2.6 Multi-threaded applications in two scenarios: a two parallel threads process independent parts of data, b concurrent execution of users’ query requests against a database server
• Faster execution—parallel, multi-threaded execution of the process on multi-core CPUs usually leads to increased performance and faster completion of the whole process. • Responsiveness—separating the working threads from the main thread of the program allows to perform required computations and avoid the main program to be frozen and non-responsive to the interaction with a user. • Efficient communication—as threads may share the process’s resources, like memory and process variables, the communication between them is usually more efficient. • Better usage of resources—a program may use multiple threads to serve multiple users or external applications at the same time, and using the same resources in the interaction with the users or applications. For example, a database management system may use multiple threads to execute users’ queries concurrently and utilize cached data for many query requests. Multi-threaded programs have also some disadvantages, e.g.,: • Synchronization—running threads that perform a piece of work for the main process may need to be synchronized from time to time, e.g., in situations when the performed computations depend on the output of the other threads or in order
2.3 Multi-threading and Multi-threaded Applications
39
to perform computations in a correct order. This usually slows down the parallel execution as some threads have to wait for others. • Thread crashes—exceptions that occur within threads may lead to the crash of the whole process and other threads that perform a particular process. An example of a multi-threaded process for exploration of protein secondary structures is presented in Chap. 11.
2.4 Graphics Processing Units and the CUDA The evolution of computer science and computer architectures constantly leads to new hardware solutions that can be used to accelerate various time-consuming processes. Recent years have also shown that promising results when performing many scientific computations can be obtained by using graphics processing units (GPUs) and general-purpose graphics processing units(GPGPUs). GPU devices, which were originally conceived as a means to render increasingly complex computer graphics, can now be used to perform computations that are required in completely different domains. For this reason, GPU devices, especially those utilizing the NVidia Compute Unified Device Architecture (CUDA) [27, 28], are now widely used to solve computationally intensive problems, including those encountered in bioinformatics.
2.4.1 Graphics Processing Units Graphics processing units (GPUs) are specialized single-chip processors primarily used to accelerate rendering of video and graphics. Introduction of GPU devices allowed to free the CPUs on computers from performing all the calculations related to rendering graphics. GPUs are able to perform these operations much faster, since they possess many computations units (cores) that can be utilized to run many parallel threads. In such a way, it is possible to perform multiple calculations at the same time and speed up the whole calculation process. Today, GPUs are widely used in video cards installed in personal computers and workstations. They are also used in mobile phones, game consoles, and embedded systems. Moreover, their large compute capabilities are frequently used not only to perform graphics operations, but also to general purpose computing. Intensive computations, which can be parallelized in a reasonable way, can now be passed from host to the GPU or the general-purpose GPU. The computational engine of all graphics processing units consists of streaming processors (SMs). Depending on the type and the hardware architecture they implement, GPUs may have various numbers of streaming multiprocessors. The number of SMs usually ranges from two to tens. Each such a streaming multiprocessor contains the following elements (Fig. 2.7):
40
2 Technological Roadmap GPU . .
.
Streaming Multiprocessor n (SM)
Streaming Multiprocessor 2 (SM) Streaming Multiprocessor 1 (SM) Shared Memory Global Memory (DRAM)
Registers
Registers
Scalar Processor 1 (SP)
Scalar Processor 2 (SP)
Registers Scalar
... Processor m (SP)
Warp Scheduler (WS)
Constant Cache Texture Cache
Fig. 2.7 Architecture of the GPU computing device, showing streaming multiprocessors, scalar processor cores, registers, and global, shared, constant, and texture memories
• Scalar processors (SP, or cores) that are able to perform integer and floating-point arithmetic operations; • A warp scheduler that manages the instruction dispatch to scalar processors/cores; • Thousands of 32-bit registers for storing data needed by threads; • Shared memory that allows to exchange data between threads; • Constant cache, a read-only memory that allows delivering data to a streaming multiprocessor; • Texture cache, which provides a read-only memory with a certain access pattern, originally designed for texture manipulation. The GPU device also has the off-chip global memory of a large size, typically several GB.
2.4.2 CUDA Architecture and Threads CUDA (originally Compute Unified Device Architecture) is an architecture and the programming model designed by NVidia for performing parallel computations on GPU devices. It allows software developers to use the power of CUDA-enabled graphics processing units for solving general purpose numerical processing problems much faster than on standard CPUs. In CUDA-enabled GPU devices, high scalability is achieved by the hierarchical organization of many threads, which are basic execution units. Threads execute, in parallel, a user-defined procedure, called the kernel, which implements some computational logic working on different data. Threads are executed by scalar processors (GPU cores). Threads are organized in one, two-, or three-dimensional organizational structure, called the block. Each thread
2.4 Graphics Processing Units and the CUDA
Software
41
Hardware
>_
Scalar Processor
Thread
SM >_
>_
>_
>_
Thread block Streaming Multiprocessor >_
>_
>_
>_
>_
>_
>_
>_
>_
>_
>_
>_
SM
… Grid
GPU
Fig. 2.8 Execution of threads on the GPU device
has its own index, the vector of coordinates corresponding to its location in the thread block. Each thread block is processed by a streaming multiprocessor (SM), which has many scalar processor cores (SP). Thread blocks form a one- or two-dimensional structure called a grid. A kernel is executed as a grid of thread blocks (Fig. 2.8). Threads can access global memory, which is the off-chip memory that has a relatively low bandwidth but provides a high storage capacity. Each thread has also access to the on-chip read-write shared memory as well as the read-only constant memory and texture memory, both of which are cached on-chip. Access to these three types of memories is much faster than that to the global memory, but they all provide limited storage space and are used in specific situations. Multiprocessors employ a new architecture, called SIMT (single instruction, multiple thread). In this architecture, a multiprocessor maps each thread to a scalar processor core, where each thread executes independently with its own instruction address and register state. The multiprocessor SIMT unit (a Warp scheduler) creates, manages, schedules, and executes block threads in groups of 32 parallel threads called warps. Warps are executed physically in parallel on a streaming multiprocessor. Threads in the warp perform the same instructions, but operate on different data, as in the SIMD (single instruction, multiple data) architecture. Therefore, appropriate preparation and arrangement of data is sometimes desirable before the kernel
42
2 Technological Roadmap
Host
GPU device
SM 2
CPU
SP
SP
SP
SP
SP
SP
SP
SP
3
Host memory (DRAM)
GPU
1
Global memory (DRAM) 4
Fig. 2.9 A typical execution of CUDA kernel: (1) transferring data to the global memory of the GPU device, (2) initiation of the GPU kernel by the CPU, (3) parallel kernel execution on GPU cores, (4) transferring the output data to the main memory of the host
execution begins, and this is one of the factors that influence the efficiency of any GPU-based implementation [27]. A typical execution of CUDA kernels includes the following steps (Fig. 2.9): 1. 2. 3. 4.
Transferring data from main memory of host to the memory of the GPU device; Initiating the GPU kernel by the CPU; Parallel kernel execution on GPU cores; Transferring the results of computations from GPU memory to the main memory of the host.
In this book, CUDA-enabled GPU devices will be used in the implementation of massively parallel method for protein structure similarity searching presented in Chap. 10.
2.5 Relational Databases and SQL For scientists studying structures and functions of proteins, one of the serious alternatives in collecting and processing protein macromolecular data are relational databases. Relational databases are digital repositories of data that operate on the basis of the relational model proposed by E. F. Codd in 1970 [4]. They collect data in tables (relations) describing part of reality, where data are arranged in columns (attributes) and rows. Figure 2.10 shows diagram of a sample relational database for storing information about proteins and scientific publications that are related to them. Rows, in relational model also called records or tuples, represent an object
2.5 Relational Databases and SQL
43
Fig. 2.10 Diagram of a sample relational database for storing information about proteins and scientific publications that are related to them AccessionNo ----------P14907 P20339 P38398 P97377 ...
ProtName ------------------------------------------Nucleoporin NSP1 Ras-related protein Rab-5A Breast cancer type 1 susceptibility protein Cyclin-dependent kinase 2
Abbrev -----NSP1 RAB5A BRCA1 CDK2
ProtID -----------NSP1_YEAST RAB5A_HUMAN BRCA1_HUMAN CDK2_MOUSE
Organism ------------------------Saccharomyces cerevisiae Homo sapiens Homo sapiens Mus musculus
Fig. 2.11 A part of data from the Proteins table from the sample database
from the described reality. Columns store the information describing the object (its attributes). Figure 2.11 shows sample data from the Proteins table from the database presented in Fig. 2.10.
2.5.1 Relational Database Management Systems Relational databases are managed by relational database management systems (RDBMSs), which are software systems that control many operations performed in databases. Examples of the RDBMSs are IBM DB2, Microsoft SQL Server, MySQL, Oracle, PostgreSQL, to mention just a few. A typical relational database management system performs the following functions: • Storing data for transactional or analytical processing, • Maintaining relationships between data in the database, • Maintaining consistency of stored data and controlling correctness of defined constraints, • Responding to requests of client applications, • Securing data, • Recovering data to a coherent state in case of a failure. Relational databases have several features, which can be important in various applications: • Schematization—data are structured, and their organization is simple and intuitive in the analysis (columns and rows),
44
2 Technological Roadmap
• Operational automation—concurrency, integrity, consistency, and data-type validity is controlled by RDBMS, • User-oriented nature—modern relational databases also provide a declarative query language—SQL that allows retrieving and processing collected data. The SQL language gained a great power in processing regular data, hiding details of the processing under a quite simple SELECT statement. This allows to move the burden of processing to the DBMS and leaves more freedom in managing the workload.
2.5.2 SQL For Manipulating Relational Data SQL (Structured Query Language) is a query language that allows to retrieve and manage data in relational database management systems. It was initially developed at IBM in the early 1970s and later implemented by Relational Software, Inc. (now Oracle) in its RDBMS [5]. The great power of the SQL lies in the fact that it is a declarative language. While writing SQL queries, SQL programmers and database developers are responsible just for specifying what they want to get, where the data are stored, i.e., in which tables of the database, and how to filter them, and this is the role of the database management system to build the execution plan for the query, optimize it, and perform all physical operations that are necessary to generate the final result. A simple SELECT statement, which is used to retrieve and display data, and at the same time one of the most frequently used statements of the SQL, may have the following general form: SELECT A1, ..., Ak FROM T WHERE C;
In the query, the SELECT clause contains a list of columns A1, ..., Ak that will be returned and displayed in the final result, the FROM clause indicates the table(s) T to retrieve data from, and the WHERE clause indicates the filtering condition C that can be simple or complex. Other clauses, like GROUP BY for grouping and aggregating data, and HAVING for filtering groups of data, are also possible, but we will omit them in the considerations for the sake of clarity. The simple query retrieves the specified columns from the table T and displays only those rows that satisfy the condition C. What is important for our considerations is that the table T can be one of the tables existing in the database, can be result of other nested SELECT statement (the result of any SELECT query is a table), or can be a table returned by a table function that is invoked in the FROM clause. The last option will be utilized in the PSS-SQL language for finding proteins on the basis of their secondary structure composition present in Chap. 11 and our works [20, 24].
2.5 Relational Databases and SQL AccessionNo ----------P38398 P38398 P38398 P20339 P20339 ...
Abbrev -----BRCA1 BRCA1 BRCA1 RAB5A RAB5A
CiteOrder ----------1 2 3 1 2
PMID -------------------7545954 8938427 9010228 2501306 7991565
45 Title -------------------------------------A strong candidate for the breast a... Complete genomic sequence and analy... Differential subcellular localizati... The human Rab genes encode a family... Rab geranylgeranyl transferase cata...
Year ----1994 1996 1997 1989 1994
Fig. 2.12 Partial results of the query presented in Listing 2.1 showing proteins together with related publications for Homo sapiens organism
Listing 2.1 shows a sample query executed against the sample database presented in Fig. 2.10. The query retrieves proteins together with related publications for Homo sapiens organism. Displayed columns—accession numbers of proteins AccessionNo, their abbreviations Abbrev, PubMed identifiers of related publications PMID, their titles Title and the year of publication Year—are specified in the SELECT clause. In order to display the data, the query joins three tables from the sample database in the FROM clause. The WHERE clause contains the filtering condition for Homo sapiens organisms. Results are sorted according to abbreviation Abbrev and citation order CiteOrder in the ORDER BY clause. 1 2 3 4 5
SELECT P.AccessionNo, Abbrev, CiteOrder , Pu.PMID, Title , Year FROM Proteins P JOIN RelPubs RP ON P.AccessionNo=RP.AccessionNo JOIN Publications Pu ON RP.PMID=Pu.PMID WHERE Organism = ’Homo sapiens ’ ORDER BY Abbrev, CiteOrder
Listing 2.1 Sample SQL query that retrieves proteins (accession numbers and abbreviations) together with related publications for the Homo sapiens organism.
The query produces results presented in Fig. 2.12. Relational databases will be used as a main data repository for the GPU-CASSERT algorithm presented in Chap. 10 and for the PSS-SQL language for the exploration of protein secondary structures presented in Chap. 11. They are also used for storing some results of 3D protein structure modeling processes presented in Chap. 5.
2.6 Scalability Scalability of the system means that it is able to handle a growing amount of work in a capable manner or is able to be enlarged to accommodate that growth [2]. A scalable system is not paralyzed by the increased workload, but it can adapt to it by increasing its computing capabilities by contracting and allocating additional computational resources from a pool of available resources. For example, a scalable system for protein structure prediction is one that can be expanded to process longer amino acid sequences, which requires more simulations, and thus, more CPU power. A scalable system for massive 3D protein structure similarity searching is one that can be upgraded to compare and align more and more 3D protein structures by adding
46
2 Technological Roadmap
new processors, cluster nodes, storage space, and other resources, without necessity to restart the system and the ongoing process. There are two widely accepted scaling techniques: • Horizontal scaling (or scaling out/in)—means increasing (or decreasing) the number of compute units, e.g., adding (or removing) a computer to a distributed system, a node to a database cluster, or a virtual machine to a virtualized computer cluster, and distribution of tasks between these compute units. • Vertical scaling (or scaling up/down)—means raising computational capabilities of the compute units, like increasing the number of processor cores, adding more memory, or moving the workload to the computation unit possessing better performance parameters. This usually allows to run more processes or threads on such compute units. In the following chapters, we will see many examples of scalable systems that use both scaling techniques and implement the scalability in different ways.
2.7 Summary Scientific solutions for protein bioinformatics presented in this book rely on various technologies that emerged in computer science. Some of them, like Cloud or Big Data, have emerged recently and are quite new in the bioinformatics field. Some of them, like relational databases, are widely used in developing efficient and reliable IT systems supporting various forms of business for many years, but are not frequently used in the field of bioinformatics. This chapter provided a technological road map for all the solutions presented in this book. You learned about new Cloud computing model, cloud service and deployment models. We defined the Big Data challenge and presented benefits of using multi-threading in scientific computations. We then explained graphics processing units (GPUs) and the complex CUDA architecture that utilize the concept of multiple threads for building high-performance solutions. Finally, we focused on relational databases and the SQL language used for declarative querying. Presented technologies will be used in the following parts of the book. In part II, we will see how to build systems for protein structure exploration and modeling that work in the Cloud. In part III, we will learn more about Big Data technologies and see how they are utilized for the same purposes. In part IV, we will see how protein structure exploration can be significantly accelerated with the use of multi-threading, massively parallel implementations on GPUs, and embedded in declarative querying in relational databases. For further reading on the concepts presented in the chapter, please refer to cited references.
References
47
References 1. Bai, C., Dhavale, D., Sarkis, J.: Complex investment decisions using rough set and Fuzzy Cmeans: an example of investment in green supply chains. Eur. J. Oper. Res. 248(2), 507–521 (2016) 2. Bondi, A.: Characteristics of scalability and their impact on performance. In: 2nd International Workshop on Software and Performance, WOSP 2000, pp. 195–203 (2000) 3. Chang, H., Mishra, N., Lin, C.: IoT Big-Data centred knowledge granule analytic and cluster framework for BI applications: a case base analysis. PLoS ONE 10, 1–23 (2015) 4. Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM 13(6), 377–387 (1970). https://doi.org/10.1145/362384.362685 5. Date, C.: An Introduction to Database Systems, 8th edn. Addison-Wesley (2003) 6. Davis, G.B., Carley, K.M.: Clearing the fog: fuzzy, overlapping groups for social networks. Soc. Netw. 30(3), 201–212 (2008) 7. De Maio, C., Fenza, G., Loia, V., Senatore, S.: Hierarchical web resources retrieval by exploiting fuzzy formal concept analysis. Inf. Process. Manag. 48(3), 399 – 418 (2012). http://www. sciencedirect.com/science/article/pii/S0306457311000458 8. De Maio, C., Fenza, G., Loia, V., Parente, M.: Time aware knowledge extraction for microblog summarization on Twitter. Inf. Fusion 28, 60–74 (2016) 9. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008). https://doi.org/10.1145/1327452.1327492 10. Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. SIGOPS Oper. Syst. Rev. 37(5), 29–43 (2003). https://doi.org/10.1145/1165389.945450 11. Ghosh, G., Banerjee, S., Yen, N.Y.: State transition in communication under social network: an analysis using fuzzy logic and density based clustering towards Big Data paradigm. Future Gener. Comput. Syst. 65, 207–220 (2016). http://www.sciencedirect.com/science/article/pii/ S0167739X16300309 12. Guo, K., Zhang, R., Kuang, L.: TMR: towards an efficient semantic-based heterogeneous transportation media Big Data retrieval. Neurocomputing 181, 122–131 (2016) 13. Kundu, S., Pal, S.K.: FGSN: Fuzzy granular social networks model and applications. Inf. Sci. 314, 100–117 (2015). http://www.sciencedirect.com/science/article/pii/S0020025515002388 14. Kundu, S., Pal, S.: Fuzzy-rough community in social networks. Pattern Recognit. Lett. 67, Part 2, 145–152 (2015). http://www.sciencedirect.com/science/article/pii/S0167865515000537, granular Mining and Knowledge Discovery 15. Lu, H., Sun, Z., Qu, W., Wang, L.: Real-time corrected traffic correlation model for traffic flow forecasting. Math. Probl. Eng. 2015, 1–7 (2015) 16. Lu, H., Sun, Z., Qu, W.: Big Data-driven based real-time traffic flow state identification and prediction. Discret. Dyn. Nat. Soc. 2015, 1–11 (2015) 17. McKendrick, J.: Cloud computing market hot, but how hot? estimates are all over the map (2012) Accessed 24 Aug 2015. http://www.forbes.com/sites/joemckendrick/2012/02/13/ cloud-computing-market-hot-but-how-hot-estimates-are-all-over-the-map/ 18. Mell, P., Grance, T.: The NIST definition of Cloud Computing. Special Publication 800-145 (2011). Accessed 10 Oct 2017. http://nvlpubs.nist.gov/nistpubs/Legacy/SP/ nistspecialpublication800-145.pdf 19. Meng, L., Tan, A., Wunsch, D.: Adaptive scaling of cluster boundaries for large-scale social media data clustering. IEEE Trans. Neur. Net. Lear. 27(12), 2656–2669 (2015) 20. Mrozek, D., Wieczorek, D., Malysiak-Mrozek, B., Kozielski, S.: PSS-SQL: protein secondary structure - structured query language. In: 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology, pp. 1073–1076 (2010) 21. Mrozek, D., Małysiak-Mrozek, B., Kłapci´nski, A.: Cloud4Psi: cloud computing for 3D protein structure similarity searching. Bioinformatics 30(19), 2822–2825 (2014) 22. Mrozek, D., Gosk, P., Małysiak-Mrozek, B.: Scaling Ab Initio predictions of 3D protein structures in Microsoft Azure cloud. J. Grid Comput. 13, 561–585 (2015)
48
2 Technological Roadmap
23. Mrozek, D., Daniłowicz, P., Małysiak-Mrozek, B.: HDInsight4PSi: Boosting performance of 3D protein structure similarity searching with HDInsight clusters in Microsoft Azure cloud. Inf. Sci. 349–350, 77–101 (2016) 24. Mrozek, D., Socha, B., Kozielski, S., Małysiak-Mrozek, B.: An efficient and flexible scanning of databases of protein secondary structures. J. Intell. Inf. Syst. 46(1), 213–233 (2016). https:// doi.org/10.1007/s10844-014-0353-0 25. Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kozielski, S.: Life sciences data analysis. Inf. Sci. 384, 86–89 (2017) 26. National Research Council: Frontiers in Massive Data Analysis. National Academy Press, Washington, D.C. (2013) 27. NVIDIA CUDA C Programming Guide (2018). http://docs.nvidia.com/cuda/cuda-cprogramming-guide/index.html 28. Sanders, J., Kandrot, E.: CUDA by Example: An Introduction to General-Purpose GPU Programming, 1st edn. Addison-Wesley Professional, Pearson Education Inc, Boston (2010) 29. Tripathy, B.K., Mittal, D.: Hadoop based uncertain possibilistic kernelized C-means algorithms for image segmentation and a comparative analysis. Appl. Soft Comput. 46, 886–923 (2016) 30. Wang, Z., Tu, L., Guo, Z., Yang, L.T., Huang, B.: Analysis of user behaviors by mining large network data sets. Future Gener. Comput. Syst. 37, 429–437 (2014) 31. Wang, C., Li, X., Zhou, X., Wang, A., Nedjah, N.: Soft computing in Big Data intelligent transportation systems. Appl. Soft Comput. 38, 1099–1108 (2016) 32. Wei, X., Luo, X., Li, Q., Zhang, J., Xu, Z.: Online comment-based hotel quality automatic assessment using improved fuzzy comprehensive evaluation and Fuzzy Cognitive Map. IEEE Trans. Fuzzy Syst. 23(1), 72–84 (2015) 33. Zhong, Y., Zhang, L., Xing, S., Li, F., Wan, B.: The Big Data processing algorithm for water environment monitoring of the three gorges reservoir area. In: Abstract and Applied Analysis 2014 (2014)
Part II
Cloud Services for Scalable Computations
Microsoft Azure Cloud Services support development of scalable and reliable cloud applications that can be used for scientific computations. In this part of the book, we will see two solutions developed for scalable computations related to various processes performed with protein structures: Chap. 4 shows the Cloud4PSi system for 3D protein structure similarity searching; Chap. 5 presents the CloudPSP system for modeling 3D structures of proteins; both are built upon the Azure Cloud services. These two chapters are preceded by an introductory Chap. 3 devoted to foundations of the Microsoft Azure cloud platform and its services, virtualization of compute resources on the Azure platform, and Azure Cloud Services, which allow building a cloud-based application with the use of Web roles and worker roles.
Chapter 3
Azure Cloud Services
Dynamic cloud services are like LEGO blocks, each with their own features and costs. Some blocks are bigger and simpler while other blocks are smaller and more complex, but each block serves a specific purpose Stewart Hyman, IBM, 2014
Abstract Microsoft Azure Cloud Services support development of scalable and reliable cloud applications that can be used to scientific computing. This chapter provides a brief introduction to Microsoft Azure cloud platform and its services. It focuses on Azure Cloud Services that allow building a cloud-based application with the use of web roles and worker roles. Finally, it shows a sample application that can be quickly developed on the basis of these two types of roles, and it emphasizes the role of queues in passing messages between components of the built system. Keywords Cloud computing · Cloud platform · PaaS · Microsoft Azure · Cloud services · Virtual machines · Web roles · Worker roles · Queues
3.1 Microsoft Azure Microsoft Azure is Microsoft’s application platform for the public cloud. It allows building and deploying cloud-based solutions for various purposes, and managing them across a global network of data centers spread around the world and managed by the Microsoft (Fig. 3.1). Microsoft Azure provides its services in the Platform as a Service (PaaS) model and the Infrastructure as a Service (IaaS) model, while capabilities of the Software as a Service (SaaS) model are offered by Microsoft Online Services. There are four basic categories of cloud-based services provided by Microsoft Azure:
© Springer Nature Switzerland AG 2018 D. Mrozek, Scalable Big Data Analytics for Protein Bioinformatics, Computational Biology 28, https://doi.org/10.1007/978-3-319-98839-9_3
51
52
3 Azure Cloud Services
Fig. 3.1 Regions of Microsoft’s data centers for the Azure platform spread around the world (source: https://azure.microsoft.com/en-au/global-infrastructure/regions/)
• • • •
Compute services, Network services, Data services, App services.
These four basic categories are flexible and cover various services that are currently available on the Azure cloud platform and can be used by a cloud application developer. In Fig. 3.2, we can see how a user benefits from the application built on top of these services and deployed to the Microsoft Azure cloud. The stack consists of the following elements: • Application—the application that is made available to the community of users in the cloud, usually accessed by its web-based interface, Web service, mobile interface, or programming script. • Compute—refers to compute capabilities of the Microsoft Azure platform providing separate services for particular needs, e.g.,: – Virtual Machines—this service provides basic compute units; it provides instances of virtual machines with preinstalled operating systems (e.g., Windows Server or Linux) that can be used for general-purpose or specialized computing on the Azure cloud; – Cloud Services—this service allows to build and deploy highly scalable applications with the use of component roles that implement the logic of the application and introduce the abstraction layer from the cloud infrastructure: Web role—is typically used for providing a Web-based front-end for the cloud application through a preinstalled dedicated IIS Web server on the virtual machine that runs the role; Worker role—is usually used for generalized development of programs that perform background processing and scalable computations, accept and
3.1 Microsoft Azure
53
user
Application Compute
Data services
Networking
App Services
Fabric (virtualization layer) virtual machines
Fig. 3.2 Application deployed to Microsoft Azure which serves as a virtualized infrastructure, platform for developers, and gateway for hosting applications
respond to requests, and perform long-running or intermittent tasks that are independent of user interaction; – Web Sites—is a service that can be used to build new Web sites hosted on the cloud for various business needs; – Mobile Services—represent highly functional mobile applications developed using Microsoft Azure; • Data Services—provide the ability to store, modify, and report on data in the Microsoft Azure; for example, the following components of Data Services can be used to store data: – BLOBs—allow to store unstructured text or binary data (video, audio and images) as named files along with their metadata; – Tables—can store large amounts of non-relational (NoSQL) data in a structured storage space; – SQL Database—allows to store large amounts of relational data in a SQL database hosted in Microsoft’s data center; – HDInsight—provides a fully managed and open-source analytics service for a wide spectrum of data analyzes performed within enterprises with the use of open-source frameworks, like Hadoop, Spark, Hive, LLAP, Kafka, Storm, R, and others.
54
3 Azure Cloud Services
Fig. 3.3 Main services available on the Azure platform within various categories and service models (source: [1])
– and others (SQL Data Sync, SQL Reporting, Search). • Networking—provide general connectivity and routing at the TCP/IP and DNS level; • App Services—provide multiple services related to security, performance, workflow management, and finally, messaging including Storage Queues and Service Bus that provide efficient communication between application tiers running in Microsoft Azure. • Fabric—entire compute, storage (e.g., hard drives), and network infrastructure, usually implemented as virtualized clusters; constitutes a resource pool that consists of one or more servers (called scale units). Presented four basic categories of Azure services are not the only way these services can be categorized and should be treated flexibly. The plethora of new services that are offered by the Azure platform nowadays may lead to defining other, more specialized categories. The spectrum of available services and tools available on the Microsoft Azure cloud platform expands every year. In Fig. 3.3, we show main services that were available on the Azure platform at the time when the book was written. What is important, services of the Microsoft Azure platform are provided within various service models. For example, Virtual machines are available within the Infrastructure as a Service (IaaS) model, while Azure Cloud Services are available within the Platform as a Service (PaaS) model. On the other hand, Office 365 is provided within the Software as a Service (SaaS) model. Microsoft Azure is an open platform by providing cloud services that allow to develop cloud applications using almost any programming language, framework, or tool.
3.1 Microsoft Azure
55
Table 3.1 Available sizes of A-series virtual machines (VMs) based on [2] VM/server type Number of CPU cores Memory (GB) Disk space for local storage (GB) Extra small Small Medium Large Extra large A5 A6 A7
1 1 2 4 8 2 4 8
0.768 1.75 3.5 7 14 14 28 56
20 225 490 1000 2040 490 1000 2040
The solutions developed for the Azure cloud can be integrated with an existing IT environment kept on premises, thus, extending the capabilities of the solutions and the IT environment.
3.2 Virtual Machines, Series, and Sizes Azure Cloud Services, including various types of roles, make use of virtual machines (VMs) to perform their operations. These virtual machines fall into different series that are optimized toward performing computations that require more or less resources of a particular type. Within particular series, there are usually several sizes of VMs. The size of a VM affects the processing, memory, and storage capacity of the virtual machine. The size of the virtual machine also affects the pricing. Table 3.1 shows a list of features for basic A-series VMs. Virtual machines from the A-series have CPU performance and memory configurations best suited for entrylevel workloads performed while developing applications and testing them. They are economical and provide a low-cost option to get started with Azure cloud services. A-series compute-intensive virtual machines feature more CPU cores per instance (8–16 cores) and much more memory (56 or 112 GB) than standard A-series VMs. They are best suited for more compute-intensive workloads. Table 3.2 shows a list of features for A-series compute-intensive VMs. Av2-series is the newer generation of A-series virtual machines with similar CPU performance and faster disk. Microsoft recommends these VMs for development workloads, build servers, code repositories, low-traffic Web sites and Web applications, microservices, early product experiments, and small databases. A list of basic features for Av2-series VMs is shown in Table 3.3. Table 3.4 shows a list of features for D-series VMs. Virtual machines from the D-series are equipped with solid-state drives (SSDs), fast CPUs, and optimal CPU-tomemory configuration making them suitable for most general-purpose applications.
56
3 Azure Cloud Services
Table 3.2 Available sizes of A-series compute-intensive virtual machines (VMs) based on [2] VM/server type Number of CPU cores Memory (GB) Disk space for local storage (GB) A8 A9 A10 A11
8 16 8 16
56 112 56 112
1817 1817 1817 1817
Table 3.3 Available sizes of Av2-series virtual machines (VMs) based on [2] VM/server type Number of CPU cores Memory (GB) Disk space for local storage (GB) Standard_A1_v2 Standard_A2_v2 Standard_A4_v2 Standard_A8_v2 Standard_A2m_v2 Standard_A4m_v2 Standard_A8m_v2
1 2 4 8 2 4 8
2 4 8 16 16 32 64
10 20 40 80 20 40 80
Table 3.4 Available sizes of D-series virtual machines (VMs) based on [2] VM/server type Number of CPU cores Memory (GB) Disk space for local storage (GB) Standard_D1 Standard_D2 Standard_D3 Standard_D4 Standard_D11 Standard_D12 Standard_D13 Standard_D14
1 2 4 8 2 4 8 16
3.5 7 14 28 14 28 56 112
50 100 200 400 100 200 400 800
Some of the D-series VMs also provide more memory per CPU making them suitable for applications that require higher amounts of memory. Dv2-series instances are newer generation of D-series instances that carry more powerful CPUs which are on average about 35% faster than D-series instances, and provide the same memory and disk configurations as the D-series. Dv2-series instances are based on the 2.4 GHz Intel Xeon E5-2673 v3 (Haswell) processor, and with Intel Turbo Boost Technology 2.0 can go to 3.2 GHz. Table 3.5 shows a list of features for Dv2-series VMs.
3.2 Virtual Machines, Series, and Sizes
57
Table 3.5 Available sizes of Dv2-series virtual machines (VMs) based on [2] VM/server type Number of CPU cores Memory (GB) Disk space for local storage (GB) Standard_D1_v2 Standard_D2_v2 Standard_D3_v2 Standard_D4_v2 Standard_D5_v2 Standard_D11_v2 Standard_D12_v2 Standard_D13_v2 Standard_D14_v2 Standard_D15_v2
1 2 4 8 16 2 4 8 16 20
3.5 7 14 28 56 14 28 56 112 140
50 100 200 400 800 100 200 400 800 1000
Table 3.6 Available sizes of Dv3-series virtual machines (VMs) based on [2] VM/server type Number of CPU cores Memory (GB) Disk space for local storage (GB) Standard_D2_v3 Standard_D4_v3 Standard_D8_v3 Standard_D16_v3 Standard_D32_v3 Standard_D64_v3
2 4 8 16 32 64
8 16 32 64 128 256
50 100 200 400 800 1600
Dv3-series virtual machines are the newest generation of D-series instances that provide more CPU cores and more memory than Dv2-series instances. Dv3-series instances are based on the 2.3 GHz Intel Xeon E5-2673 v4 (Broadwell) processor, and with Intel Turbo Boost Technology 2.0 can go to 3.5 GHz. Table 3.6 shows a list of features for Dv3-series VMs. The E-series virtual machines are optimized for heavy in-memory applications, e.g., SAP HANA. The E-series VMs possess high memory-to-core ratios which makes them well-suited for relational database servers, with medium to large caches, and in-memory analytics. These VMs range from 2 to 64 vCPUs and 16-432 GB RAM. Table 3.7 shows a list of features for the newest Ev3-series VMs. G-series virtual machines have got Intel Xeon processor E5 v3 family, two times more memory, and four times more solid-state drive storage (SSDs) than the generalpurpose D-series. G-series provide RAM of up to 0.5 TB and 32 CPU cores and provide unparalleled computational performance, memory, and local SSD storage for the most demanding applications. Table 3.8 shows a list of features for the Gseries VMs.
58
3 Azure Cloud Services
Table 3.7 Available sizes of Ev3-series virtual machines (VMs) based on [2] VM/server type Number of CPU cores Memory (GB) Disk space for local storage (GB) Standard_E2_v3 Standard_E4_v3 Standard_E8_v3 Standard_E16_v3 Standard_E32_v3 Standard_E64_v3
2 4 8 16 32 64
16 32 64 128 256 432
50 100 200 400 800 1600
Table 3.8 Available sizes of G-series virtual machines (VMs) based on [2] VM/server type Number of CPU cores Memory (GB) Disk space for local storage (GB) Standard_G1 Standard_G2 Standard_G3 Standard_G4 Standard_G5
2 4 8 16 32
28 56 112 224 448
384 768 1536 3072 6144
Table 3.9 Available sizes of H-series virtual machines (VMs) based on [2] VM/server type Number of CPU cores Memory (GB) Disk space for local storage (GB) Standard_H8 Standard_H16 Standard_H8m Standard_H16m Standard_H16r Standard_H16mr
8 16 8 16 16 16
56 112 112 224 112 224
1000 2000 1000 2000 2000 2000
The H-series family are next-generation high-performance computing virtual machines. This series of VMs is built on Intel Haswell processor technology, specifically E5-2667 V3 processors with 8 and 16 core VM sizes, both featuring DDR4 memory and local SSD-based storage. The H-series also provides various options for RDMA (remote direct memory access) and low latency capable networking using InfiniBand, along with several memory configurations to support memory-intensive computational requirements. Table 3.9 shows a list of features for H-series VMs.
3.3 Cloud Services in Action
59
Worker role
Web role
User
Worker role
myqueue (Azure Storage queue)
Worker role
MicrosoŌ Azure cloud
Fig. 3.4 Architecture of a sample application developed with the use of Azure Cloud Services showing: one Web role serving a Web site, queue for passing messages, and one or more Worker roles for computations
3.3 Cloud Services in Action Microsoft Azure Cloud Services support development of scalable and reliable cloud applications. Azure Cloud Services are hosted on virtual machines and application developers can control the VMs, e.g., by installing their own software on VMs that host developed Cloud Services or by connecting to the VMs remotely. In this section, we will see how to build a cloud-based application with the use of web roles and worker roles provided within Azure Cloud Services. We will also see how to use queues in order to pass messages between components of the built system. As we could read earlier in this chapter, any application developed with the use of the Azure Cloud Services can be created on the basis of two types of roles: • Web roles serving a Web site, • Worker roles for computations and processing. A sample application that will be built in this chapter will use both types of roles. The Web role will simply accept a short text entered by a user on the served web form. Then, the text will be passed as a message to the Worker role(s) through a queue (Azure Queue storage service) as it is presented in Fig. 3.4. The Web role of the sample application provides a simple Web form that contains a logo, a text box for entering messages, and a button for sending messages (Fig. 3.5). When the user enters the message and presses the Send message button, the message is sent to the myqueue queue.
60
3 Azure Cloud Services
Fig. 3.5 Simple Web form provided by the Web role implemented in the sample application developed with the use of Azure Cloud Services
The queue can be created before the application is executed in the Cloud. However, this step is usually performed by appropriate procedure executed as a part of the code of the running Web role. The C# code presented in Listing 3.1 shows how to create a queue, if it does not already exist. First, it is necessary to establish a connection to the cloud storage account. This is done in line 2 of the sample code. The StorageConnectionString holds all the information that is needed to connect to appropriate storage account on the Azure subscription of the developer. After establishing a connection to the cloud storage account, the queue client is created in line 5. It is needed in order to retrieve a reference to a queue (line 8). In line 8, we get a reference to our myqueue queue. If the queue does not exist, it is created by invoking the CreateIfNotExists method for the queue object (line 10). 1 2
/ / Retrieve storage account from connection string CloudStorageAccount storageAccount = CloudStorageAccount . Parse( CloudConfigurationManager . GetSetting(”StorageConnectionString”) ) ;
3 4 5
/ / Create the queue client CloudQueueClient queueClient = storageAccount . CreateCloudQueueClient() ;
6 7 8
/ / Retrieve a reference to a queue CloudQueue queue = queueClient .GetQueueReference(”myqueue”) ;
9 10 11
/ / Create the queue i f i t doesn’ t already exist queue . CreateIfNotExists () ;
Listing 3.1 Sample C# code that allows creation of a Storage queue, called myqueue, in Microsoft Azure cloud storage.
3.3 Cloud Services in Action
61
The StorageConnectionString with the information that is needed to connect to appropriate storage account on the Azure subscription of the developer is stored in configuration files. Connection strings to Azure storage account for the Web role and the Worker role are kept in environment settings of the Cloud service .NET project. There are separate settings to be used when the Cloud Services application runs locally and when it runs in the cloud. They are stored in two files: • ServiceConfiguration.Cloud.cscfg—for the application running in the cloud, • ServiceConfiguration.Local.cscfg—for the application running locally. Sample configuration file for the application running in the cloud may look like the one presented in Listing 3.2. In the configuration file, the developer should not only provide connection strings to the Azure storage account for the Web role (line 6) and the Worker role (line 13), but also specify the number of virtual machines that Azure will use to run the Web role and Worker role code on (lines 4 and 11, respectively). This configuration option allows developers to scale out the whole system according to the requirements, before it is published to the Microsoft Azure cloud. Then, the application can be scaled out through the Azure portal. 1 2 3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Listing 3.2 Sample configuration file for the application running in the cloud containing connection strings for the Web and the Worker roles.
Additionally, the ServiceDefinition.csdef definition file specifies the settings that are used by the Azure platform to configure a cloud service. A sample definition file may look like the one presented in Listing 3.3. Among many settings, the file contains the information on the sizes of virtual machines to run the Web and the Worker roles on (vmsize attribute, lines 3 and 21, respectively). This allows to configure appropriate sizes of VMs that should be used and to scale the components of the system up and down. 1 2 3
4
5
6
7
8
9
10
11
12
13
14
62
3 Azure Cloud Services
15
16
17
18
19
20
21
22
23
24
25
26
27
Listing 3.3 Sample definition file for the cloud service running in the cloud containing specification of sizes of the Web and the Worker roles.
Pressing the Send message button on the Web form provided by our Web role inserts the text of the message in the myqueue queue. The C# code of the event procedure responsible for this task is shown in Listing 3.4. The code consists the part that establishes the connection to the queue (lines 9–18) that was explained earlier. Then, in order to insert a message into the myqueue queue, we first create a new message (object of the CloudQueueMessage class, line 23). Next, the AddMessage method is called in order to add the message to the queue. A queue message can be created from either a string (in line 23 it is retrieved from the MessageTxtBox field on the Web form) or a byte array. 1 2 3 4
using System; using Microsoft .WindowsAzure. Storage ; using Microsoft .WindowsAzure. Storage .Queue; ...
5 6 7 8 9
protected void SendMsgBtn_Click( object sender , EventArgs e) { / / Retrieve storage account from connection string CloudStorageAccount storageAccount = CloudStorageAccount . Parse( CloudConfigurationManager . GetSetting(”StorageConnectionString”) ) ;
10
/ / Create the queue client CloudQueueClient queueClient = storageAccount . CreateCloudQueueClient() ;
11 12 13
/ / Retrieve a reference to a queue CloudQueue queue = queueClient .GetQueueReference(queueName: ”myqueue”) ;
14 15 16
/ / Create the queue i f i t doesn’ t already exist queue . CreateIfNotExists () ;
17 18 19
/ / Create a message and add i t to the queue . i f (! String .IsNullOrEmpty(MessageTxtBox. Text) ) { CloudQueueMessage message = new CloudQueueMessage(MessageTxtBox. Text) ; queue .AddMessage(message) ; }
20 21 22 23 24 25 26
}
Listing 3.4 Event procedure that enters a message into the existing queue in Microsoft Azure.
3.3 Cloud Services in Action
63
Fig. 3.6 The Hello world... message sent by the Web role, stored temporally (before it is peeked by the Worker role) in the myqueue queue, visible in the Cloud Explorer available in the Microsoft Visual Studio.NET software development environment
Figure 3.6 shows the message sent by the Web role in the myqueue queue in the Cloud Explorer available in the Microsoft Visual Studio.NET software development environment. The activity of the Worker role in our sample application is very simple— it waits for the incoming messages in the myqueue queue, consumes them, and displays the message, e.g., on the standard output. The main activity of our Worker is implemented in the Run method of the WorkerRole class (Listing 3.5). As in the case of the Web role, the Worker role first must establish a connection to the myqueue queue (lines 21–27). Then, the Worker role monitors the queue for the incoming messages in the infinite loop (line 31) and peeks the message, if it is posted with the use of the GetMessage method (line 34). The GetMessage method makes the peeked message invisible to any other code reading messages from the myqueue queue for 30 seconds (by default). The text sent through the peeked message is retrieved and displayed on the screen (lines 38–39). Here we use the Trace class and the TraceInformation method that are used for monitoring the execution of our application while it is running. When the message is processed, it is permanently deleted in line 41 by calling the DeleteMessage method. 1 2 3 4 5 6 7 8
using using using using using using using using
System; System. Diagnostics ; System.Net; System. Threading ; System. Threading . Tasks; Microsoft .WindowsAzure. ServiceRuntime ; Microsoft .WindowsAzure. Storage ; Microsoft .WindowsAzure. Storage .Queue;
9 10 11 12 13
namespace WorkerRole { public class WorkerRole : RoleEntryPoint {
14 15 16 17
... public override void Run() {
64
3 Azure Cloud Services Trace . TraceInformation(”WorkerRole is running”) ;
18 19
/ / Retrieve storage account from connection string CloudStorageAccount storageAccount = CloudStorageAccount . DevelopmentStorageAccount ;
20 21 22
/ / Create the queue client . CloudQueueClient queueClient = storageAccount . CreateCloudQueueClient() ;
23 24 25
/ / Retrieve a reference to a queue . CloudQueue queue = queueClient .GetQueueReference(queueName: ”myqueue”) ;
26 27 28
CloudQueueMessage peekedMessage;
29 30
while ( true ) { / / Get the next message peekedMessage = queue .GetMessage() ; i f (peekedMessage!=null ) { / / Display message, e .g. , Console . WriteLine(peekedMessage. AsString) ; Trace . TraceInformation(”WorkerRole received message: ” + peekedMessage. AsString) ; / / Process the message in less than 30 seconds , and then delete the message queue . DeleteMessage(peekedMessage) ; } }
31 32 33 34 35 36 37 38 39 40 41 42 43 44
... }
45 46 47
public override bool OnStart () { ... bool result = base . OnStart () ; Trace . TraceInformation(”WorkerRole has been started”) ;
48 49 50 51 52 53
return result ;
54
}
55 56
public override void OnStop() { Trace . TraceInformation(”WorkerRole is stopping”) ;
57 58 59 60
this . cancellationTokenSource . Cancel() ; this . runCompleteEvent .WaitOne() ; base .OnStop() ;
61 62 63 64
Trace . TraceInformation(”WorkerRole has stopped”) ;
65
}
66 67
...
68
}
69 70
}
Listing 3.5 Sample C# code for the Worker role running in Microsoft Azure cloud.
Other methods of the WorkerRole class, including OnStart and OnStop, are used here just for displaying trace information to the developer to simplify monitoring of the application execution. In Figs. 3.7 and 3.8, we can observe tracing results for the Web role and the Worker role displayed in the Microsoft Azure Compute Emulator, a local environment for testing developed and deployed cloud applications.
3.4 Summary
65
Fig. 3.7 Trace results for the Web role, displayed in the Microsoft Azure Compute Emulator
3.4 Summary Azure Cloud Services are, next to Azure Batch, one of the possibility for developing applications that perform scalable and long-running computations in the Azure cloud. However, even though applications are hosted and executed on a pool of virtual machines, Azure Cloud Services provide their capabilities in the Platform as a Service (PaaS) service model, not the Infrastructure as a Service (IaaS) model. This simplifies the management of the whole environment, since this is the role of the cloud provider to manage all VMs and, e.g., update all operating systems of the virtual machines that are provisioned. Azure Cloud Services allow to avoid the creation of virtual machines manually and simplify the development of cloud applications by providing the concepts of Web roles and Worker roles. This allows to separate parts of the system into front-end and processing services that are temporally decoupled. In such a way, parts of the system can be developed separately and able to perform their functions at their own pace as processing tasks are sent through Storage queues. Azure Cloud Services also simplify scaling. Developers do not have to create many virtual machines. Instead, they only configure the number and the size of instances for the Web and Worker roles in the configuration and the definition files. Then, the Azure platform creates as many instances of both types of roles as it was declared in the configuration files.
66
3 Azure Cloud Services
Fig. 3.8 Trace results for the Worker role, displayed in the Microsoft Azure Compute Emulator, showing the received Hello world message sent by the Web role
In this chapter, we provided a brief introduction to the Microsoft Azure cloud platform and its services. We could also see how to use Azure Cloud Services to build a simple cloud-based application that utilizes the web roles and the worker roles. We showed a sample application developed on the basis of these two types of roles, and the role of queues in passing messages between components of the built system. Presented concepts will be used in the following chapters of this part of the book. In these chapters, we present Cloud4PSi system for 3D protein structure similarity searching [3], and CloudPSP system for modeling 3D structures of proteins [4], built upon the Azure Cloud Services. For further reading on the Azure Cloud Services, please refer to Microsoft Azure cloud documentation at https://docs.microsoft.com/en-us/azure/cloud-services/. In the next chapter, we will present how Cloud Services are utilized in the Cloud4PSi system for performing scalable similarity searches of protein structures and what are the benefits of using a dedicated role-based and queue-based architecture.
References
67
References 1. Meleg, T.: Microsoft azure - microsoft azure–the big picture. MSDN Mag. 30(10) (2015). https:// msdn.microsoft.com/en-us/magazine/mt573712.aspx 2. Microsoft Documentation: Sizes for Cloud Services (2018). Accessed 7 May 2018. https://docs. microsoft.com/en-us/azure/cloud-services/cloud-services-sizes-specs 3. Mrozek, D., Małysiak-Mrozek, B., Kłapci´nski, A.: Cloud4Psi: cloud computing for 3D protein structure similarity searching. Bioinformatics 30(19), 2822–2825 (2014) 4. Mrozek, D., Gosk, P., Małysiak-Mrozek, B.: Scaling Ab Initio predictions of 3D protein structures in Microsoft Azure cloud. J. Grid Comput. 13, 561–585 (2015)
Chapter 4
Scaling 3D Protein Structure Similarity Searching with Azure Cloud Services
Frankly, it is hard to predict what new capabilities the cloud may enable. The cloud has a trajectory that is hard to plot and a scope that reaches into so many aspects of our daily life that innovation can occur across a broad range Barrie Sosinsky, 2011
Abstract Azure Cloud Services allow building scalable and reliable software applications that perform computations in the Cloud. These applications are built with the use of Web roles and Worker roles that abstract from the Cloud infrastructure. In this chapter, we will see how the Cloud computing architecture and Azure Cloud Services can be utilized to scale out and scale up protein similarity searches by utilizing the system, called Cloud4PSi, that was developed for the Microsoft Azure public cloud. We will see the architecture of the system, its components, communication flow, and advantages of using a queue-based model over the direct communication between computing units. Results of various experiments confirm that the protein structure similarity searching can be successfully scaled on cloud platforms by using computation units of different sizes and by adding more computation units. Keywords Proteins · 3D Protein structure · Tertiary structure Similarity searching · Structure comparison · Structure alignment · Superposition Cloud computing · Microsoft Azure · Azure Cloud Services · Parallel computing Software as a service · SaaS
4.1 Introduction 3D protein structures exhibit high conservation in the evolution of organisms, and even if protein sequences diverged significantly, comparison of 3D protein structures and finding structural similarities allow to draw conclusions on functional similarity
© Springer Nature Switzerland AG 2018 D. Mrozek, Scalable Big Data Analytics for Protein Bioinformatics, Computational Biology 28, https://doi.org/10.1007/978-3-319-98839-9_4
69
70
4 Scaling 3D Protein Structure Similarity Searching …
Fig. 4.1 Comparison and structural alignment of two 3D protein structures capable for oxygen binding in muscle tissue and red blood cells– myoglobin from sperm whale Physeter catodon (PDBID: 1MBN, chain A) [32] and wild-type deoxyhemoglobin from Homo sapiens (PDB ID: 1XZ2, chain D) [12]: a two 3D protein structures aligned and superimposed on each other (displayed as cartoons); matching fragments of both proteins are marked with the use of blue and orange colors, fragments that do not match are marked using the gray color, b projection of the structural alignment on amino acid sequences of both proteins showing weak sequence similarity and gaps
of proteins from various, sometimes evolutionary distant organisms. For this reason, protein structure similarity searching is of great importance in structural bioinformatics, systems biology and molecular modeling. 3D protein structure similarity searching refers to the process in which a given protein structure is compared to another protein structure or a set of protein structures collected in a database [20]. The comparison of protein structures usually involves structural alignment and superposition of these structures (Fig. 4.1). The aim of the process is to find common fragments of compared protein structures, fragments that match to each other. Matching fragments and protein structure similarities may indicate common ancestry of the proteins, and then organisms, their evolutionary relationships, functional similarities of investigated molecules, existence of common functional regions, and many other things [4]. The role of the process is especially important in situations, where sequence similarity searching fails or delivers too few clues [7] and then 3D protein structure similarity searching becomes the primary technique to make reasonable conclusions, e.g., regarding the function of the unknown
4.1 Introduction
71
protein. There are also several processes, e.g., validation of predicted protein models, where protein structure similarity searching plays a supportive role [15].
4.1.1 Why We Need Cloud Computing in Protein Structure Similarity Searching Protein structure similarity searching belongs to a group of the primary tasks performed in a structural bioinformatics. However, it is still very difficult and timeconsuming process. There are three key factors deciding on this: 1. 3D protein structures are complex—proteins are built up with hundreds of amino acids, and therefore, thousands of atoms, and they may have several chains in their quaternary structures, which makes the comparison process difficult; 2. the similarity searching process is computationally intensive—the problem belongs to the NP-hard problems; most of the widely accepted algorithms like VAST [6, 17], DALI [9, 10], LOCK2 [30], FATCAT [34], CE [31], FAST [35], MICAN [18], CASSERT [19, 21] have high computational complexity; and the process itself is usually carried out in a pairwise manner comparing a given query structure to successive structures from a collection in pairs, one by one; 3. the number of 3D structures in macromolecular data repositories, such as the Protein Data Bank (PDB) [3], grows exponentially; as of April 30, 2018, there were 139,717 structures in the PDB. These factors motivate scientific efforts to develop new methods for 3D structure similarity searching and to build scalable platforms that allow completing the task much faster. Cloud computing provides huge amounts of computational power that can be provisioned for such tasks and speed up them significantly.
4.1.2 Algorithms for Protein Structure Similarity Searching Popular methods for 3D protein structure similarity searching that procedure not only similarity measures, but also final alignment and structural superposition, like CE [31] and FATCAT [34], are still very time consuming. As a consequence, the similarity searching against large repositories of structural data requires increased computational resources that are not available for everyone. Cloud4PSi, which utilizes Cloud services and will be described in the following sections, allows searching for protein structure similarities by means of two algorithms—jFATCAT and jCE [28]. jFATCAT and jCE are new, enhanced implementations of the FATCAT and the CE algorithms for protein structure alignment and similarity searching. Both algorithms have a very well-established position among researchers and are publicly available through the Protein Data Bank Web site for those who want to search for struc-
72
4 Scaling 3D Protein Structure Similarity Searching …
(a)
A
A
secons si , s i+1 and A s i+2 of Protein A
Backbone of Protein A
Protein A
Protein B
(b)
(c) B
secons si and sj of Protein B
B
siA
AFPj AFPi
A
sj
B
AFPi AFPj
Protein A
sjA
si
B
A
secons si and sj of Protein A
Fig. 4.2 3D protein structure alignment by combining AFP elements: a 3D structure of a sample protein, its backbone, and splitting into fixed-length fragments (sections), b sections of protein structures A (black) and B (gray) forming two aligned fragment pairs AF Pi and AF P j , c alignment matrix showing various AFP elements and alignment path created by joining (red arrows) several AFPs
tural neighbors. Moreover, both algorithms are used for pre-calculated 3D-structure comparisons for the whole PDB that are updated on a weekly basis [27]. FATCAT and CE, and their new versions jFATCAT and jCE, work on the basis of matching protein structures using aligned fragment pairs (AFPs) representing parts of protein structures that fit to each other (Fig. 4.2). It can be noticed that these methods are consistent at the level of representation of protein structures during their comparison—they both represent protein structures by means of local geometry, rather than global features such as orientation of secondary structures and overall topology. Both methods construct the alignment path that shows which parts of protein structures can be treated as identical or similar. However, they use different strategies to verify whether two parts of protein structures are compatible or not, and different computational procedures.
4.1 Introduction
73
The jFATCAT tests the compatibility of consecutive AFPs pairs by calculating the root mean square deviation (RMSD) between distance matrices of residues located in these parts of each protein that form connected AFPs m and k: L 2 db11 (m)+s,b1 (k)+s − db22 (m)+s,b2 (k)+s , (4.1) Dmk = s=1
where di,1 j , di,2 j is the distance between residue i and j in the first and the second protein, b1 (m), b1 (k), b2 (m), and b2 (k) are the starting positions of AFP m and k in the first and the second protein, and L is the length of each AFP. On the other hand, the jCE uses the following distance measure to evaluate how well two protein fragments forming an AFP match each other: 1 Di j = 2 m
m−1 m−1 A B d p A +k, p A +l − d p B +k, p B +l , i
j
i
(4.2)
j
k=0 l=0
where Di j denotes the distance between two combinations of two fragments from protein A and B defined by two AFPs at positions i and j in the alignment path, piA denotes AFP’s starting residue position in protein A at the ith position in the alignment path; similarly, for piB , di,Aj is the distance between residues i and j in the protein A based on the coordinates of the Cα atoms; similarly, for di,Bj , m denotes the size of the AFP fragment. Regarding the computational procedure leading to the calculation of the alignment path, the jCE involves the combinatorial approach. Combinations of AFPs that represent possible continuous alignment paths are selectively extended or discarded thereby leading to a single optimal alignment. jFATCAT uses the dynamic programming algorithm to connect AFPs by combining gaps and twists between consecutive AFPs, each with its own score penalty. Importantly, jFATCAT-flexible eliminates drawbacks of many existing methods that treat proteins as rigid bodies, not flexible structures. The research conducted by the authors of the FATCAT has shown that rigid representation causes that a lot of similarities, even very strong, are omitted. In contrast, the jFATCAT allows to enter twists in protein structures while matching their fragments providing better alignments in a number of cases. One of the cases is presented in Fig. 4.3. It shows how two protein structures are structurally aligned using the jCE algorithm and the jFATCAT algorithm. In Fig. 4.3a, we can see structures of proteins (PDB ID: 2SPC.A [33]) and (PDB ID: 1AJ3.A [26]) and their structural alignment generated by jCE algorithm, which treats structures as rigid bodies. With the use of different colors, we have marked parts of the structures that were aligned. Both structures are highly homologous, which is also reflected in their sequence alignment. However, a different orientation of the rest of compared chains causes that these parts are not regarded as structurally
74
4 Scaling 3D Protein Structure Similarity Searching …
(a) Protein 1AJ3 chain A
Protein 2SPC chain A
Structural alignment 1AJ3.A vs. 2SPC.A by jCE
(b)
Protein 1AJ3 chain A
Protein 2SPC chain A after twisting
Structural alignment 1AJ3.A vs. 2SPC.A by jFATCAT
Fig. 4.3 Results of alignment and superposition of two 3D protein structures (PDB ID: 1AJ3.A) and (PDB ID: 2SPC.A) produced by the jCE algorithm (a) and the jFATCAT algorithm (b). Parts of the structures that were structurally aligned are marked by using different colors. In the case of jFATCAT (bottom), the structure (PDB ID: 2SPC.A) is transformed by entering a twist, which produces better alignment
similar. This applies not only to the jCE algorithm, but also other algorithms treating proteins as rigid bodies. jFATCAT flexible is able to handle such deformations and various orientations by entering gaps and twists (rigid body movements) and by using appropriate penalty system. The penalty system is used in order to limit the number of twist operations. In Fig. 4.3 (bottom), we can see the structural alignment of the same two structures after entering gaps and twists. As a result, jFATCAT finds new regions reflecting structural similarity.
4.1 Introduction
75
4.1.3 Other Cloud-Based Solutions for Bioinformatics The concept of cloud computing is also becoming increasingly popular in scientific applications for which theoretically infinite resources of the cloud allow to solve computationally intensive problems. Also, in the domain of bioinformatics, there are many dedicated tools that are cloud-ready and several that have been created with the aim of working in the cloud. For example, CloVR for automated sequence analysis can utilize cloud computing resources. CloVR is implemented as a single portable virtual machine (VM) that provides several automated analysis pipelines for microbial genomics, including 16S, whole genome, and metagenome sequence analysis [2]. Cloud-based CloVR was developed for Amazon EC2 and automatically provisions additional VM instances, if the computational process requires this. Hydra [16] is an example of the cloud-ready tool that uses Hadoop and MapReduce in the identification of peptide sequences from spectra in the mass spectrometry. Cloud BioLinux [14] is a publicly accessible virtual machine (VM) that enables scientists to quickly provision on-demand infrastructures for high-performance bioinformatics computing using cloud platforms. Users have instant access to a range of pre-configured command line and graphical software applications, including a fullfeatured desktop interface, documentation and over 135 bioinformatics packages for applications including sequence alignment, clustering, assembly, display, editing, and phylogeny [14]. In the area of protein structure similarity searching, it is worth noting the work [11] by Che-Lun Hung and Yaw-Ling Lin and the PH2 system [8]. Authors of the first paper present the method for protein structure alignment and their own refinement algorithm that are implemented in Hadoop and are deployed on a virtualized computing environment. The PH2 system allows to store PDB files in a replicated way on the Hadoop Distributed File System and then allows to formulate SQL queries concerning 3D protein structures. Cloud-based solutions for bioinformatics usually relate to problems that require increased computational resources. As we know from the previous section, protein 3D structure similarity searching is one of the computationally complex and time-consuming processes. 3D protein structure similarity searching requires scalable platforms that would allow accelerating computations related to the process. Cloud computing provides such a kind of scalable platform for high-performance computing.
4.2 Cloud4PSi for 3D Protein Structure Alignment One of the few systems in the world that utilize cloud computing architecture to perform 3D protein structure similarity searching is Cloud4PSi (Fig. 4.4). The system was developed in the Institute of Informatics at the Silesian University of Technology in Gliwice, Poland, within the Cloud4Proteins group. The development works
76
4 Scaling 3D Protein Structure Similarity Searching …
Fig. 4.4 Cloud4PSi Web site (originally created within [13] with later improvements)
were started in 2013. The main constructors of the system were Artur Kłapci´nski, Bo˙zena Małysiak-Mrozek, and me. Artur Kłapci´nski was my student and assistant in this project at the time, and further works were continued by me and my research group within the Microsoft Azure for Research Award grant for the project entitled Cloud4PSi: Cloud Computing in the Service of 3D Protein Structure Similarity Searching sponsored by Microsoft Research in the USA. These works resulted in Artur’s Master thesis [13] and common papers [22, 24]. The Cloud4PSi system can be scaled out (horizontal scaling) and scaled up (vertical scaling) in the Microsoft Azure public cloud. Scaling up allows for the expansion of computational resources, like increasing the number of processor cores, adding more memory, or moving the workload to the computation unit possessing better performance parameters. Horizontal scaling, or scaling out, is achieved by increasing the number of the same units and appropriate allocation of tasks among these units. Microsoft Azure allows to combine both types of scaling. It is worth noting that in case of Cloud4PSi, vertical scalability required designing and implementation of the application code in such a way that it utilizes many processing cores available after scaling the system up. For the horizontal scalability, the Cloud4PSi code had to be properly designed in order to allow appropriate scheduling of computations and the division of tasks between computation units.
4.2 Cloud4PSi for 3D Protein Structure Alignment
77
Fig. 4.5 Interaction of the user with the Cloud4PSi system. Typical use cases for similarity searching and scaling the system
4.2.1 Use Case: Interaction with the Cloud4PSi Users may interact with the Cloud4PSi system on two levels. They may generate requests for the similarity searching (a typical end user), or they may configure Cloud4PSi (an administrator), e.g., when they want to scale up or scale out the system (Fig. 4.5). Execution of the similarity searching and displaying its results remains the basic scenario implemented by the Cloud4PSi. The process includes entering through a dedicated Web site a query protein structure, either by providing the PDB ID code with or without the chain identifier, or by uploading user’s structure from a local hard drive to the Cloud4PSi storage system. Users may also choose one of the available algorithms for the similarity searching (jCE, jFATCAT rigid, jFATCAT flexible) and define parameters of the process, if they do not want to use default values.
78
4 Scaling 3D Protein Structure Similarity Searching …
A special token number is generated for the user for each search submission. The token number can be used in order to get partial or full results of the submitted search after some time. Users do not have to follow changes on the Web site, since the similarity searching can be a long-running process. They can return to the Web site at any convenient moment and check if the process has already completed. The configuration and scaling of the Cloud4PSi are mainly reserved for advanced users of the system and can be performed outside of the Cloud4PSi itself, e.g., in Microsoft Azure management portal. Cloud4PSi can be scaled up by raising the capabilities of compute units (see Sect. 3.2) or scaled out by adding more searching instances.
4.2.2 Architecture and Processing Model of the Cloud4PSi As we know from Chap. 3, cloud services developed for the Microsoft Azure platform are composed of a set of roles performing some predefined tasks. Breakdown of the roles depends on process that is implemented and delivered by the cloud-based application. Cloud4PSi has been developed in such a way that it consists of several types of roles and storage modules responsible for gathering and exchanging data between computing units (Fig. 4.6). The set of roles RC4 working in the Cloud4PSi system is defined as follows: RC4 = {r W } ∪ {r M } ∪ R S ,
(4.3)
where r W is the Web role responsible for the interaction with Cloud4PSi users, r M is the Manager role that distributes requests received from the Web role and prepares the workload for Searcher role instances, and R S is a set of Searcher role instances that perform parallel similarity searching.
4.2.2.1
Web Role
Cloud4PSi allows users to execute the similarity searching through a dedicated Web site. The Web site is provided by the Web role. The Web role provides graphical user interface (GUI) and has an additional logical layer for even handling. Through the Web role, users can initiate the similarity searches or receive the results of the searches. Logical layer is responsible for converting parameters received from the user into a form of request message that is passed to the Manager role through the Input queue (Fig. 4.6). The Web role has access to the Storage Table that provides results of the ongoing or finished similarity searches. It also has access to the virtual hard drive (VHD) for PDB files (protein structures), when the user decides to send his/her own PDB file to be compared by the Cloud4PSi. Submission of the search request requires that the user inputs a query protein structure and chooses one of the algorithms for similarity searching (jCE, jFATCAT-
4.2 Cloud4PSi for 3D Protein Structure Alignment
79
User Results Web role VHD with PDB files User
Storage Table
Storage BLOB
Input queue
Manager role
Output queue
Searcher roles
Fig. 4.6 Architecture of the Cloud4PSi system for 3D protein structure similarity searching, developed and deployed in Microsoft Azure cloud: users interact with the system through the Web role, search requests are placed in the Input queue; Manager role consumes search requests, divides the search job into smaller tasks, and distributes the tasks to instances of the Searcher role through the Output queue; instances of the Searcher role perform searches for proteins in data packages specified in task descriptors. Protein structures are taken from Storage BLOBs; results of the partial searches are stored in the Storage Table
rigid, or jFATCAT-flexible). Additionally, the user can specify parameters of the search process. Query protein structure can be defined by the PDB ID code (unique code of protein structures in the PDB repository) or by uploading user’s protein structure from the local hard drive. When the user starts the search process, the Web role behaves according to the pseudocode of the Algorithm 1. It generates and returns a token number for the search request (line 2). The user may return to the Web site after some time with this token number and check whether the search process has already finished. If the user chose to upload his own structure as the query protein, the Web role mounts the virtual hard drive in the full mode, uploads the structure to the hard drive, and encodes the locator to the structure in the search request message (lines 3–6). If the user chose to search similarities by providing the PDB ID code of the query protein structure that exists in the repository, the Web role encodes the code in the search request message (lines 7–8). Then, the Web role generates the search request message that is sent to the Input queue (line 10). The format of the input message is presented in Listing 4.1.
80
4 Scaling 3D Protein Structure Similarity Searching …
Algorithm 1 Web role: Search request processing algorithm 1: for each search request do 2: Generate and return a token number 3: if user uploads own protein structure then 4: Mount virtual hard drive (VHD) in the full mode 5: Upload user’s query protein structure 6: Encode the locator of user’s protein in the search request message 7: else 8: Encode PDB ID code of the query protein in the search request message 9: end if 10: Enqueue the search request in the Input queue 11: end for
1 2 3 4 5 6 7 8 9 10
CloudQueueMessage searchRequest = new CloudQueueMessage( token . ToString () + ” | ” + package_size + ” | ” + repository_size + ” | ” + pdb_id + ” | ” + upload_name + ” | ” + messageTime + ” | ” + algorithm + ” | ” + ByChain.Checked. ToString () ) ; inputQueue .AddMessage(searchRequest) ;
Listing 4.1 Format of the input message containing a search request [13].
Messages in the Azure queueing system may have text or binary formats. In Cloud4PSi we have chosen a text format and separated components of the search request by the | symbol. The message consists of eight component parts: a randomly generated search request ID (token) that is returned to the user, the number of proteins in the package (package size) that should be compared by each instance of the Searcher role in a single task (package_size), the number of proteins from the repository that will be used in the whole comparison process (r epositor y_si ze, used for performance tests), the pdb_id identifier of the user’s query protein to quickly locate it in the repository, the locator of the query protein structure that was uploaded from the user’s computer (upload_name), a marker defining the time of dispatch of the search request (messageTime, used for time statistics), the name of the algorithm used for similarity searching (algorithm), and finally, the information whether the comparison is performed using the whole protein structures or just between selected chains (ByChain). Some of the components (e.g., token) are used later while accessing Storage Tables service to identify the specific outcome of the search job. The Web role also allows users to check results of the similarity searching through appropriate web form. The Web role asks the user to provide the token number that was generated during the execution of the process (Fig. 4.7). Results that are related to the given token number are then retrieved from the Storage Tables service and are displayed to the user. These results include identifiers of protein structures (sorted by a chosen similarity measure) and similarity measures specific for the similarity algorithm, e.g. Z-score,
4.2 Cloud4PSi for 3D Protein Structure Alignment
81
Fig. 4.7 Retrieving similarity searching results from the Cloud4PSi. Reproduced from [13] with permissions
RMSD, alignment length, P-value, TM-score, and others. The user may also display detailed structural alignment report for a pair of query protein structure and selected database structure returned by the Cloud4PSi. Sample detailed structural alignment report is shown in Fig. 4.8.
4.2.2.2
Manager Role
The Manager role has a specific functionality as one of the Worker roles in the Cloud4PSi. It schedules tasks for execution on idle Searcher roles based on requests received from the Web role, passes parameters, arranges the scope of the similarity searching, and manages associated computational load between instances of the Searcher role. Manager is also responsible for the preparation of the read-only, virtual hard drive located on Azure BLOB Storage, which stores all candidate protein structures, and eventually, uploaded given query protein structure, all used by instances of the Searcher role when performing parallel structural alignments. Manager role implements the pseudocode of Algorithm 2. The role listens if there are any search requests in the Input queue (line 2). Incoming requests are immediately captured by the Manager Worker role, which divides the whole range of molecules
82
4 Scaling 3D Protein Structure Similarity Searching …
Fig. 4.8 Detailed report showing structural alignment of sample protein structures [PDB ID: 1BSN.A] and [PDB ID: 1EWC.A]. Parts of chains marked using dark green and light green colors reflect regions of structures that correspond to each other. A vertical line between residues reflects structural equivalence and identical residues, a colon means structural equivalence and similar residues, and a dot means structural equivalence, but not similar residues
into packages (lines 3–5). Packages contain a small number of protein structures from the main repository R of PDB files that should be compared with the user’s query protein in a single task by a single Searcher role.
4.2 Cloud4PSi for 3D Protein Structure Alignment
83
Algorithm 2 Manager role: Search request processing and task creation 1: while true do 2: Check messages in the Input queue 3: if exists a message then 4: Retrieve the message (search request) and extract parameters 5: Divide repository R into smaller packages according to the defined package size 6: for each package Pi ⊂ R do 7: Create task descriptor as output message 8: Encode package metadata and other parameters in the output message 9: Enqueue the output message in the Output queue 10: end for 11: end if 12: end while
Scheduling computations through proper orchestration of task execution is an important part of every system that processes large data sets on a distributed computer infrastructure. The parameter-sweep application model has emerged as a killer application model for developing high-throughput computing (HTC) applications for processing on highly distributed computing environments [1], like Grids or Clouds. This model essentially assumes processing different data sets by n independent tasks on s distributed compute units, where n is typically much larger than s. This highthroughput parametric computing model is simple, yet powerful enough to enable efficient distributed execution of computations performed in various areas related to bioinformatics and life sciences. The parameter-sweep computing model is sufficient for pleasingly parallel problems, especially, if there are no additional constraints, like application execution makespan, computation budget limitations, or hardware limitations [24]. Cloud4PSi also uses parameter-sweep processing model. Let R = R1 , R2 , . . . , R|R| be the repository of |R| candidate protein structures to be compared and aligned by s Searcher roles (|R| ≫ s), and P = P1 , P2 , . . . , Pn be a finite set of n packages (n < |R|, n ≫ s) that satisfies the following three conditions: ∀ Pi ∈P Pi ⊆ R, f1
(4.4)
∃ f1 R − → P,
(4.5)
|Pi | = const.
(4.6)
The last condition tells that the number of proteins in the package Pi is fixed, and we can state that n n |Pi | = |R|, (4.7) Pi and R= i=1
i=1
where Pi is the ith package of protein structures, and n is the number of packages that should be processed by all s instances of the Searcher role.
84
4 Scaling 3D Protein Structure Similarity Searching …
search request ID
package size |Pi|
repository size |R|
query protein
upload flag
start point
message time
snapshot URI
algorithm
byChain flag
Fig. 4.9 Format of the task descriptor sent to the Output queue
Packages satisfy the following relationship: ∀1≤i, j≤n i = j =⇒ Pi ∩ P j = ∅.
(4.8)
The number of packages n determines the number of all search tasks T = T1 , T2 , . . . , Tn that will be executed on Searcher roles R S = r S1 , r S2 , . . . , r Ss . The number of packages and tasks (n) depends on the number of protein structures in the repository R and the number of protein structures in package Pi : n=
|R| . |Pi |
(4.9)
The number of proteins in the package Pi should be chosen experimentally in order to minimize the whole computation time, when processing search requests. Descriptors of successive tasks with sweep parameters that determine package content are sent by the Manager role as messages to the Output queue, where they wait for being processed (lines 6–9). In such a way, Manager role creates search tasks T that are scheduled for execution. Given that the number of protein structures in each package is fixed, we call these packages as fixed number of proteins packages, and the scheduling scheme will be called fixed number of proteins package-based (FNPP-based) scheduling scheme. Since protein structures have various sizes (various numbers of chains, amino acids in each chain, and atoms), sizes of packages may vary significantly, and processing time of each package may differ. Format of the message containing task descriptor that is sent to the Output queue is shown in Fig. 4.9. Task descriptor starts with search request ID (also called job identifier or token), which is a unique identifier of the search request the task belongs to. The identifier is followed by the package size expressed as the number of proteins contained in the package (|Pi |). The next field repository size consists of the information on the number of protein structures in the whole repository (|R|). The field query protein consists of the PDB ID code of the query protein, if it is also present in the repository, or name of the uploaded file with macromolecular data describing input protein structure. The following upload flag field tells if the query protein was uploaded as a file (T r ue) or is one of the proteins in the repository (False). The next field, start point, defines the sweep starting parameter for each instance of the Searcher role that consumes the task—Searchers process |Pi | successive protein structures from the repository (by aligning query protein structure to candidates from the repository) starting from the start point. In Fig. 4.10, each square can be interpreted as a part of repository (a package) that will be retrieved by a single instance of the Searcher role while executing the task, for which description is retrieved from the Output queue. For each task descriptor that is generated by the Manager role, the
4.2 Cloud4PSi for 3D Protein Structure Alignment
85
start point for package P3
0
10
20
package size |Pi|
30
…
100
…
110
…
10k
…
100k
repository size |R|
Fig. 4.10 Division of the repository of |R| proteins into n packages of the size |Pi | and the start point for each instance of the Searcher role
start point is incremented by the value of the package size. The message time field (Fig. 4.9) stores the information on the time the task descriptor was created, which is needed to determine the task processing time in the system. Additionally, task descriptor consists of the URL address of the shared virtual hard drive image that contains repository of protein structures (snapshot URI field), encoded name of the alignment algorithm that should be used when comparing proteins (algorithm field), and byChain flag that informs Searcher whether to align single chains of protein structures or the whole molecules. Task descriptors are encoded as messages, where successive fields are separated by the | character. Sample message from the Output queue is presented below: 3bb398c1-e8aa-24ad-0c3a-8abab76c7572|10|100000|1a0t.A|True|20| 3/10/2018 9:38:45 AM| http://prot.blob.core.windows.net/drives/pdb.vhd?snapshot=...|1|True
Sample code responsible for formatting the task descriptor as a message sent to the Output queue is shown in Listing 4.2. 1 2 3 4 5 6 7 8 9 10 11 12 13
CloudQueueMessage taskDescriptor = new CloudQueueMessage( token + ” | ” + package_size + ” | ” + repository_size + ” | ” + pdb_id + ” | ” + upload_name + ” | ” + start_point + ” | ” + messageTime + ” | ” + snapshotUri + ” | ” + algorithm + ” | ” + byChain ); outputQueue .AddMessage( taskDescriptor ) ;
Listing 4.2 Sample code responsible for formatting the task descriptor as a message sent to the Output queue [13].
86
4.2.2.3
4 Scaling 3D Protein Structure Similarity Searching …
Searcher Roles
Searcher roles, which are also Worker roles, bear the computational load associated with the process of protein comparison. Instances of the Searcher role receive from the Manager role messages with the information on the scope of the main tasks that should be performed by the particular Searcher, the name of the comparison algorithm that should be used, and a list of files from a virtual hard drive (the package) that should be compared to the query protein structure. The Searcher roles execute the search task according to the task specification encoded in the peeked message and perform similarity searches for the query protein and proteins from the package defined in the task specification. Finally, instances of the Searcher role are responsible for entering results to a table in the Storage Table service. The set of instances of the Searcher role R S is defined as follows: R S = {r Si |i = 1, . . . , s}
(4.10)
where r Si is a single instance of the Searcher role, and s is the number of Searchers working in the system. Each instance r Si of the Searcher role processes the package of proteins by comparing the query protein structure to candidate structures which identifiers are contained in the package (Algorithm 3). Identifiers of the query protein structure and candidate structures are passed in the task descriptor message, together with the name of the comparison algorithm (mapped to integer) that should be used (line 4). Algorithm 3 Searcher role: Task processing algorithm 1: while true do 2: Check messages in the Output queue 3: if exists a message then 4: Retrieve the message and extract parameters 5: Get query protein structure S Q from virtual hard drive 6: for each database structure S D ∈ Pi do 7: Get the candidate database structure S D from repository R on VHD 8: Compare structures S Q , S D with the use of selected algorithm 9: Collect comparison results in a dedicated array 10: end for 11: Save collected results in the Azure Storage Table 12: end if 13: end while
If the Searcher role operates on the compute node (virtual machine) possessing many CPU cores, all cores of the compute node are used (the task is parallelized inside the compute node). Candidate protein structures described by a task descriptor message are taken from the read-only, virtual hard drive (VHD) located in the Storage
4.2 Cloud4PSi for 3D Protein Structure Alignment
87
Drive service of the Azure cloud (line 6–7). After comparing all structures in the package (lines 8–9), outcomes of the comparison, i.e., PDB identifiers of structures and similarity measures, are sent to the table with results available through the Azure Storage Tables service (line 11). Instances of the Searcher role work in a loop. After processing a task, the role returns to listening and capturing messages from the Output queue (lines 1–4). Successive tasks are processed until there are no more messages in the queue.
4.2.2.4
Other Components of the Cloud4PSi
The computing architecture of the Cloud4PSi system for protein structure similarity searching shown in Fig. 4.6 consists of three types of roles already mentioned and additional modules responsible for storing and exchanging data. These are the following [13]: • Table (from Azure Storage service), which stores the results of similarity searches, time stamps of the key moments of the application run (needed when studying performance of the system) and technical parameters used globally by all roles. • A pageable BLOB (from Azure Storage service) that contains the virtual hard drive (VHD). The VHD is mounted by the Web role in the full mode, if the user chooses to upload its own protein PDB file as a query structure, or in the read-only mode as a current image of the PDB repository for instances of the Searcher role that perform parallel, distributed similarity searches. • Input queue, which collects similarity search requests from the Web role and provides these requests to the Manager role, where they are distributed among the instances of the Searcher role. • Output queue, which stores messages with descriptors of tasks and the information on what part of the PDB repository should be processed by the instance of the Searcher role that receives particular message and comparison parameters. The flow of messages in the entire system is collectively presented in the UML sequence diagram presented in Fig. 4.11.
4.2.3 Scaling Cloud4PSi Microsoft Azure allows to create and publish user’s applications that can be scaled up (vertical scaling) or scaled out (horizontal scaling). Let’s remind that scaling up means raising computational capabilities of compute units, like increasing the number of processor cores, adding more memory, or moving the workload to the computation unit possessing better performance parameters. Horizontal scaling, or scaling out, means increasing the number of the same units and distribution of tasks between these entities.
88
4 Scaling 3D Protein Structure Similarity Searching …
Fig. 4.11 Flow of information in the Cloud4PSi during the execution of protein structure similarity searching
Cloud4PSi can be scaled out and scaled up during the similarity searching process. Scaling mainly applies to Searcher roles. When scaling out, we change the value of s in Eq. 4.10, i.e., the number of instances of the Searcher role. When scaling up, we change the size of the Searcher role—for example, for the basic A-series virtual machines for Cloud services: si ze(r Si ) ∈ {X S, S, M, L , X L , A5, A6, A7},
(4.11)
where XS denotes ExtraSmall size, S—Small, M—Medium, L —Large, XL— ExtraLarge. Sizes of virtual machines for the Searcher roles were described in Sect. 3.2. In the Cloud4PSi, we made the following assumption regarding all Searcher roles: ∀r Si ,r S j ∈R S ,i, j∈1,...,s,i= j si ze(r Si ) = si ze(r S j ).
(4.12)
4.2 Cloud4PSi for 3D Protein Structure Alignment
89
The number of instances s of the Searcher role depends only on the administrator’s choice and the range of services and resources that are provided by the Microsoft company as the owner of the Microsoft Azure cloud.
4.3 Scalability of the Cloud4PSi In order to assess the scalability of the system, we have conducted a series of tests for different configurations of the cloud and various parameters of the search process. The Web role and the Manager role were running on computational units of the Small size. Sizes of the virtual machines for instances of the Searcher role were variable in different experiments. While testing performance, we used ten selected query protein structures with lengths between 54 and 754 amino acids. These were randomly chosen molecules that represent different classes according to the SCOP classification [25], i.e., all α, all β, α + β, α/β, α and β, coiled coil proteins, and others. Here, we present results for a sample query protein structure 1A0T, chain P. This chain is one of three chains that can be distinguished in the protein structure (Fig. 4.12); it has a length of 413 amino acids, contains mainly β-strands and some αhelices in its fold [5], and represents a medium-sized protein in the repository used in our tests. Similarity searching was carried out against a repository containing 1,000 different structures from the Protein Data Bank and using both algorithms, jCE and
Fig. 4.12 3D structure of the sucrose-specific porin ScrY from Salmonella typhimurium and its complex with sucrose. Displayed as cartoons in RasMol [29]. Protein PDB ID: 1A0T [5]
90
4 Scaling 3D Protein Structure Similarity Searching …
jFATCAT (we used jFATCAT-flexible, results for jFATCAT-rigid were similar). In particular, we have analyzed: • horizontal scalability of the similarity searching in the Cloud4PSi—efficiency of the computational process depending on the number of instances of the Searcher role, • vertical scalability of the similarity searching in the Cloud4PSi—efficiency of the computational process depending on the size of instances of the Searcher role, • efficiency of the similarity searching in the Cloud4PSi depending on the size of the data package assigned to a task.
4.3.1 Horizontal Scalability The dynamics of the horizontal scaling in the Azure cloud was assessed on the basis of the series of experiments evaluating the rate of obtaining complete results of protein similarity searches. Experiments were performed for various numbers of instances of the Searcher role (from 1 to 30 instances) and for both algorithms. Instances of the Searcher role were running on virtual machines of the Small size. Package size was set to 10 protein structures on the basis of experiments (see Sect. 4.3.3). In Fig. 4.13, we can observe total execution time as a function of the number of instances of the Searcher role for jCE and jFATCAT algorithms. Total execution time means the period of time needed to complete the search process for the given, query molecule. We noted that by increasing the number of instances of the Searcher role ·104
Execution time (seconds)
FATCAT (flexible) CE (main)
1.5
1
0.5
0 1
2
4
8 10 16 # Searcher roles
18
20
30
Fig. 4.13 Execution time as a function of the number of instances of the Searcher roles for jCE and jFATCAT algorithms
4.3 Scalability of the Cloud4PSi
91
30 FATCAT (flexible)
n-fold speedup
CE (main)
20
10
0 1
2
4
8 10 16 # Searcher roles
18
20
30
Fig. 4.14 Acceleration of the similarity searching as a function of the number of Searcher roles for jCE and jFATCAT algorithms
from 1 to 30 for the given protein structure, the total execution time of the search process decreased from 15,882 s (approx. 4 h and 25 min) to 598 s (approx. 10 min) for the jFATCAT, and from 11,122 s (approx. 3 h and 5 min) to 426 s (approx. 7 min) for the jCE. The decrease in the execution time was expected and was inversely proportional to the degree of parallelism. Average execution time per protein was at the level of 16 s for jFATCAT and 11 s for jCE, regardless of the number of instances of the Searcher role participating in the process. Average execution time per package for jFATCAT oscillated between 154 and 159 s for various numbers of instances of the Searcher role participating in the process, and for jCE was at the level of 109 s regardless of the number of instances of the Searcher role. In Fig. 4.14 we show acceleration of the similarity searching as a function of the number of Searcher roles for both algorithms. We noted that employing two instances of the Searcher role increased the process by almost twofold. Adding more Searcher roles additionally accelerated the process. However, we noted a decrease in the dynamics of the acceleration. By increasing the number of Searcher roles from 1 up to 30 we gained the acceleration ratio at the level of 26.11 for the jCE, and 26.46 for the jFATCAT. Although, based on the execution time measurements we have noticed that the jFATCAT algorithm is 25–30% slower than the jCE, we can see that the n-fold speedups are almost identical for both algorithms. This indicates that the effectiveness of the scaling process does not depend on the algorithm and the computational procedure, but there are some other factors slowing down the dynamics of the acceleration. Among these factors we can distinguish the communication between roles through the queueing system, necessity of retrieving query protein structure from
92
4 Scaling 3D Protein Structure Similarity Searching …
Fig. 4.15 Comparison of the execution time of the progressive realization of the similarity searching for one (blue) and eighteen (red) instances of the Searcher role (jCE algorithm). Reproduced from [13] with permissions
Fig. 4.16 Comparison of the execution time of the progressive realization of the similarity searching for one (blue) and eighteen (red) instances of the Searcher role (jFATCAT algorithm). Reproduced from [13] with permissions
repository on the virtual hard drive, and the necessity to store partial results of the similarity searches for each processed package in the Azure Storage Table. In Figs. 4.15 and 4.16 we can observe a comparison of the rate at which the same percentage of the completion of the process is achieved using one or eighteen instances of the Searcher role. Test was performed by Artur Kłapci´nski, my assistant in this project. These graphs show that the horizontal scaling is more time-effective for larger computational problems, rather than for smaller (time differences increase with the level of advancement of the process, i.e., the number of candidate protein structures processed). On this basis we could conclude that it should be more profitable for
4.3 Scalability of the Cloud4PSi
93
scaling out the system if the similarity searches were performed against the whole PDB repository, i.e., repository containing tens of thousands of protein structures.
4.3.2 Vertical Scalability Basic offer of the Microsoft Azure provides different sizes of virtual machines for the Web and the Worker role instances. These machines differ in size of available memory and the number of virtual CPU cores. For example, the ExtraSmall size is generally used when testing applications that will work in the cloud, as they are the cheapest compute units available in the cloud. However, beginning from the Small size, up to ExtraLarge size we can observe a gradual doubling of memory capacity and doubling of the number of cores. Detailed values were presented in Sect. 3.2. While testing vertical scalability, we started with the configuration consisting of two ExtraSmall instances of the Searcher role. Then, while keeping the number of instances constant, we increased the size of virtual machines up to ExtraLarge. This gradually increased the number of computing cores available to instances of the Searcher role from 2 to 16, and the amount of total available RAM increased from 1.5GB (2 × 0.768GB) to 28GB (2 × 14GB). During the whole experiment the package size was set to 10 protein structures.
Fig. 4.17 Execution time as a function of the size of instances of the Searcher role for jCE and jFATCAT algorithms. Reproduced from [13] with permissions
94
4 Scaling 3D Protein Structure Similarity Searching …
In Fig. 4.17 we can observe total execution time as a function of the size of instances of the Searcher role for the jCE and the jFATCAT algorithms. Tests were performed by my assistant in the project—Artur Kłapci´nski (within his Master thesis [13] and our common publication [22]) who kindly agreed to reproduce the results in this book. The execution time was measured for the given, query molecule. We noticed that by increasing the size of instances of the Searcher role from ExtraSmall to ExtraLarge for the given protein structure, the total execution time of the search process decreased from 8,037 (approx. 2 h and 14 min) to 1,953 s (approx. 33 min) for the jFATCAT, and from 5,645 (approx. 1 h and 34 min) to 986 s (approx. 16 min) for the jCE. Similarly to the horizontal scaling, the decrease in the execution time was expected and was inversely proportional to the degree of parallelism—larger roles have more CPU cores, so computations could be parallelized on multiple threads created on each instance of the Searcher role, adequately to the number of cores. Average execution time per protein was at the level of 20 s for jFATCAT (with a standard deviation 5.46 s) and 12 s for jCE (with a standard deviation 1.17 s), and the time was dependent on the size of instances of the Searcher role participating in the process. Average execution time per package oscillated between 154 (for Small size) and 295 s (for ExtraLarge size) for the jFATCAT, and between 109 s (for Small size) and 136 s (for ExtraLarge size) for the jCE. Acceleration of the similarity searching performed with the use of both algorithms for various sizes of computing instances is shown in Fig. 4.18. We can observe that apart from the first upgrade from ExtraSmall to Small size, the acceleration trend is similar to that presented for horizontal scaling. For jCE algorithm, doubling of the computing resources, although made in another way than for horizontal scaling, accelerates the similarity searching by almost twofold. Similarly for jFATCAT algorithm, when upgrading from the Small size to Medium. The dynamics of the acceleration growth slows down for Large and ExtraLarge sizes, which is especially visible for the jFATCAT algorithm. We measured that searches performed on multiple cores of ExtraLarge-sized instances with the use of jFATCAT are up to 50% slower than searches performed with the use of the jCE. Finally, during our tests on two ExtraLarge instances of the Searcher role, we have achieved 5.73-fold speedup of the similarity searching for the jCE and 4.12-fold speedup for the jFATCAT over the initial configuration consisting two ExtraSmall instances of the Searcher role. In Figs. 4.19 and 4.20 we can observe a comparison of the rate at which the same level of advancement of similarity searching is achieved by using Small and ExtraLarge instances of the Searcher role. These figures also show general, linear tendency for the processing time while using particular sizes of the Searcher role. Results can be extrapolated for greater repositories of protein structures, which is very important when planning the use of such an architecture in future works. We can also notice some distortions repeating for the same progress degrees. They probably resulted from the fact that processing times of particular packages are different, depending on the sizes of structures in the package. Moreover, all experiments were performed on the same repository of sample protein structures, which determines that these distortions occur at the same place.
4.3 Scalability of the Cloud4PSi
95
Fig. 4.18 Acceleration of the similarity searching as a function of the size of instances of the Searcher role for jCE and jFATCAT algorithms. Reproduced from [13] with permissions
Fig. 4.19 Comparison of the execution time of the progressive realization of the similarity searching for Small (blue) and ExtraLarge (red) instances of the Searcher role (jCE algorithm). Reproduced from [13] with permissions
96
4 Scaling 3D Protein Structure Similarity Searching …
Fig. 4.20 Comparison of the execution time of the progressive realization of the similarity searching for Small (blue) and ExtraLarge (red) instances of the Searcher role (jFATCAT algorithm). Reproduced from [13] with permissions
4.3.3 Influence of the Package Size Package size determines how many proteins will be compared by every single instance of the Searcher role. Since instances of the Searcher role may handle various search requests at any time, they retrieve query protein structure for each package of protein structures that must be processed. However, during our experiments we noticed that the package size affects the average searching time per package. Therefore, while testing the system we also checked how the assumed package size influences the execution time of the whole similarity searching. To this purpose, we configured the system with 4 Medium-sized instances of the Searcher role, each of which possessed 2 cores, which totally gave us 8 cores. Tests were performed again by my assistant Artur Kłapci´nski and published in [13, 24]. For both algorithms (jCE and jFATCAT) we measured total execution time for the following package sizes: 1, 5, 10, 20, 50, 100, and 200 protein structures. In Fig. 4.21 we present total execution time for various sizes of packages. Results confirmed that for small packages the execution time was shorter than for large packages, even though a larger number of small packages causes that each instance of the Searcher role must often repeat some fixed actions in its operating cycle, such as retrieving the query protein structure from repository, creation of an array for results, saving results, and others, which for sure negatively influence the processing time. However, smaller packages with fewer structures processed in one task seem to be a more flexible solution. They allow for more effective management of free processing power in the system. Moreover, for small packages the distribution of packages among Searchers is more balanced. For packages containing 20 protein structures or more, the total
4.3 Scalability of the Cloud4PSi
97
Fig. 4.21 Total execution time as a function of package size for jCE and jFATCAT algorithms. Results obtained for 4 Medium-sized instances of the Searcher role. Reproduced from [13, 24] with permissions
execution time gradually increased due to lower flexibility in allocating packages to individual instances of the Searcher role. Large packages decrease the number of packages and tasks to be processed for a single search request and cause that one Searcher performs its work for a long time, while other Searchers may be idle because of the absence of further structures to be processed. We found that the most effective searches were achieved for packages of 5 structures (2,217 s for the jFATCAT, 1,476 s for the jCE) or 10 structures (2 200 s for the jFATCAT, 1,496 s for the jCE). For 200 structures per package the execution time increased to 3,826 s for the jFATCAT and 2,617 s for the jCE.
4.3.4 Scaling Up or Scaling Out? Administrators of the cloud application may choose to scale the system up or out. The choice of the scaling technique for 3D protein structure similarity searching depends on the individual needs. For some group of people a priority may be the speed, in which the group gets results of the process, and for some other group the cost of the process. If we assume that there is a stable group of users and searches are executed against a great repository of protein structures, the optimization of costs gains its importance.
98
4 Scaling 3D Protein Structure Similarity Searching …
Table 4.1 Total execution time for corresponding amounts of processing CPU cores for horizontal and vertical scaling Number of cores 2 4 8 16 Configuration for horizontal scaling Horizontal scaling - time (s) Vertical scaling time (s) Configuration for vertical scaling
2 × Small
4 × Small
8 × Small
16× Small
5,601
2,848
1,452
782
5,601
2,931
1,593
986
2 × Small
2 × Medium
2 × Large
2 × ExtraLarge
In this section we will try to compare both scaling approaches. For the available series and sizes of Microsoft Azure compute units (virtual machines) the processor clock rates do not change. Particular compute units differ in the number of CPU cores and amount of available memory per instance of the worker role. In order to compare both scaling techniques we have measured the total execution time for varying number of processing CPU cores. We have done it by controlling the sizes (for vertical scaling) and the number of instances (for horizontal scaling) of the Searcher role. We gathered results for system configurations providing the same number of processing cores in both scaling techniques. Results are presented in Table 4.1. We can notice that both scaling techniques proposed by the Microsoft Azure platform brought similar increase in time efficiency, with a minimum predominance of horizontal scaling (1,452 s versus 1,593 s for 8 cores, and 782 s versus 986 s for 16 cores). Scaling out, however, has another advantage, which is a higher flexibility in controlling the computing power; e.g., we can choose an odd number of virtual machines. Vertical scaling, with the exception of the use of a single role of the Small or ExtraSmall size, can only double the number of cores. Scaling out is not constraint in this way. In many cases we can use mixed approach by combining both scaling techniques. However, vertical scaling is more difficult to implement, because it requires a multithreaded implementation of the process that is executed in parallel on the computational unit. Moreover, changing the size of the virtual machine for the role requires a re-publication of the system in the Microsoft Azure cloud. In contrast, horizontal scaling can be configured on later stages of the application deployment and execution, after the application has been published in the cloud and according to current needs. The application code must be properly designed in order to allow the division of tasks between computational units. At the moment, Microsoft Azure allows users to run dynamic scaling, which automatically adjusts the number of instances of the selected roles, depending on the level of Internet traffic or demands of external users of the published application.
4.4 Discussion
99
4.4 Discussion Cloud4PSi represents a novel architectural approach in building cloud-based systems for protein structure similarity searching and bioinformatics by implementing dedicated role-based and queue-based model. Most solutions developed so far are mainly based on pre-configured virtual machines. Their images can be set up in a cloud, if a user wants to scale out the computational process. However, these solutions does not provide full features of the SaaS layer. Cloud4PSi is a fully SaaS solution. It requires the user gets familiar just with the Web-based graphical user interface (GUI), and everything else is hidden under the GUI. From the viewpoint of the maintenance of the system, the role-based model applied in the Cloud4PSi provides higher portability (inside the same cloud provider) and significantly higher flexibility in deployment of the Cloud4PSi on various operating systems. An interesting alternative for such a processing problem provide systems that are built based on the Hadoop platform, like those developed by Che-Lun Hung and YawLing Lin [11], HDInsight4PSi [20] and H4P [23]. These systems, however, represent a different approach to the distributed implementation of the similarity searching process, which is based on the MapReduce paradigm presented in Chap. 6. Cloud4PSi uses its own, dedicated scheduling technique with various types of roles and queues. Using queues has several advantages. Since queues provide asynchronous messaging model, users of the Cloud4PSi need not be online at the same time. Queues reliably store requests as messages until the Cloud4PSi is ready to process them. Cloud4PSi can be adjusted and scaled out according to the current needs. As the depth of the request queue grows, more computing resources can be provisioned. Therefore, such an approach allows to save money taking into account the amount of infrastructure required to service the application load. Finally, queue-based approach allows load balancing—as the number of requests in a queue increases, more Searcher roles can be added to read from the queue.
4.5 Summary Cloud4PSi has been developed as a real Software as a Service solution. Its architecture has been adapted to the Microsoft Azure role-based and queue-based model, which brings many operational benefits. This allows to place the system among the scalable, high-performance, and reliable solutions for protein similarity searching and function identification. Cloud4PSi benefits from the idea of cloud computing by utilizing computation units to scale the process of 3D protein structure similarity searching—the process that is time consuming and very important from the perspective of structural bioinformatics, comparative genomics, and computational biology. In the chapter, we could
100
4 Scaling 3D Protein Structure Similarity Searching …
observe that scaling the process in the cloud improves the efficiency of the search process without reducing the computational complexity of used alignment methods. For further reading on Microsoft Azure and Cloud Services I would like to recommend the book entitled Microsoft Azure Essentials: Fundamentals of Azure by Michael S. Collier and Robin E. Shahan. For readers that are interested in various applications of Hadoop framework and MapReduce programming model in bioinformatics I recommend the paper of Quan Zou et al. [36]. In the next chapter, we will see a modified architecture of the system developed with the use of roles and queues for 3D protein structure prediction.
References 1. Abramson, D., Giddy, J., Kotler, L.: High performance parametric modeling with Nimrod/G: Killer application for the global Grid? In: Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS 2000). pp. 1–5. IEEE Computer Society Press, Los Alamitos, CA (2000) 2. Angiuoli, S., Matalka, M., Gussman, A., Galens, K., Vangala, M., Riley, D.R., Arze, C., White, J.R., White, O., Fricke, W.F.: CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinform. 12, 356 (2011) 3. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E.: The Protein Data Bank. Nucleic Acids Res. 28(1), 235–242 (2000). https:// doi.org/10.1093/nar/28.1.235, https://www.oup/backfile/content_public/journal/nar/28/1/10. 1093_nar_28.1.235/1/280235.pdf 4. Burkowski, F.: Struct. Bioinform. Algorithmic Approach, 1st edn. Chapman and Hall/CRC, Boca Raton (2008) 5. Forst, D., Welte, W., Wacker, T., Diederichs, K.: Structure of the sucrose-specific porin ScrY from Salmonella typhimurium and its complex with sucrose. Nat. Struct. Biol. 5(1), 37–46 (1998) 6. Gibrat, J., Madej, T., Bryant, S.: Surprising similarities in structure comparison. Curr. Opin. Struct. Biol. 6(3), 377–385 (1996) 7. Gu, J., Bourne, P.: Structural bioinformatics (methods of biochemical analysis), 2nd edn. WileyBlackwell, Hoboken (2009) 8. Hazelhurst, S.: PH2: an Hadoop-based framework for mining structural properties from the PDB database. In: Proceedings of the 2010 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists. pp. 104–112 (2010) 9. Holm, L., Kaariainen, S., Rosenstrom, P., Schenkel, A.: Searching protein structure databases with DaliLite v. 3. Bioinformatics 24, 2780–2781 (2008) 10. Holm, L., Sander, C.: Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233(1), 123–38 (1993) 11. Hung, C.L., Lin, Y.L.: Implementation of a parallel protein structure alignment service on cloud. Int. J. Genomics 439681, 1–8 (2013) 12. Kavanaugh, J.S., Rogers, P.H., Arnone, A., Hui, H.L., Wierzba, A., DeYoung, A., Kwiatkowski, L.D., Noble, R.W., Juszczak, L.J., Peterson, E.S., Friedman, J.M.: Intersubunit interactions associated with Tyr42α stabilize the quaternary-T tetramer but are not major quaternary constraints in deoxyhemoglobin. Biochemistry 44(10), 3806–3820 (2005). https://doi.org/10. 1021/bi0484670 13. Kłapci´nski, A.: Scaling the process of protein structure similarity searching in cloud computing. Master’s thesis, Institute of Informatics, Silesian University of Technology, Gliwice, Poland (2013)
References
101
14. Krampis, K., Booth, T., Chapman, B., Tiwari, B., Bicak, M., Field, D., Nelson, K.E.: Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community. BMC Bioinform. 13(1), 42 (2012). https://doi.org/10.1186/1471-2105-13-42 15. Lesk, A.: Introduction to protein science: architecture, function, and genomics, 2nd edn. Oxford University Press, NY, USA (2010) 16. Lewis, S., Csordas, A., Killcoyne, S., Hermjakob, H.: Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework. BMC Bioinform. 13, 324 (2012) 17. Madej, T., Lanczycki, C., Zhang, D., Thiessen, P., Geer, R., Marchler-Bauer, A., Bryant, S.: MMDB and VAST+: tracking structural similarities between macromolecular complexes. Nucleic Acids Res. 42(Database issue), D297–303 (2014) 18. Minami, S., Sawada, K., Chikenji, G.: MICAN: a protein structure alignment algorithm that can handle multiple-chains, inverse alignments, Ca only models, alternative alignments, and non-sequential alignments. BMC Bioinform. 14(24), 1–22 (2013) 19. Mrozek, D., Brozek, M., Małysiak-Mrozek, B.: Parallel implementation of 3D protein structure similarity searches using a GPU and the CUDA. J. Mol. Model. 20, 2067 (2014) 20. Mrozek, D., Daniłowicz, P., Małysiak-Mrozek, B.: HDInsight4PSi: boosting performance of 3D protein structure similarity searching with HDInsight clusters in Microsoft Azure cloud. Inf. Sci. 349–350, 77–101 (2016) 21. Mrozek, D., Małysiak-Mrozek, B.: CASSERT: a two-phase alignment algorithm for matching 3D structures of proteins. In: Kwiecie´n, A., Gaj, P., Stera, P. (eds.) Computer Networks, Communications in Computer and Information Science, vol. 370, pp. 334–343. Springer International Publishing, Berlin (2013) 22. Mrozek, D., Małysiak-Mrozek, B., Kłapci´nski, A.: Cloud4Psi: cloud computing for 3D protein structure similarity searching. Bioinformatics 30(19), 2822–2825 (2014) 23. Mrozek, D., Suwała, M., Małysiak-Mrozek, B.: High-throughput and scalable protein function identification with Hadoop and Map-only pattern of the MapReduce processing model. Knowledge and Information Systems (in press). http://dx.doi.org/10. 1007/s10115-018-1245-3 24. Mrozek, D., Kłapci´nski, A., Małysiak-Mrozek, B.: Orchestrating task execution in Cloud4PSi for scalable processing of macromolecular data of 3D protein structures. In: Nguyen, N.T., Tojo, S., Nguyen, L.M., Trawi´nski, B. (eds.) Intelligent information and database systems. Lecture Notes in Computer Science, vol. 10192, pp. 723–732. Springer International Publishing, Cham (2017) 25. Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247(4), 536– 540 (1995). http://www.sciencedirect.com/science/article/pii/S0022283605801342 26. Pascual, J., Pfuhl, M., Walther, D., Saraste, M., Nilges, M.: Solution structure of the spectrin repeat: a left-handed antiparallel triple-helical coiled-coil. J. Mol. Biol. 273(3), 740–751 (1997). http://www.sciencedirect.com/science/article/pii/S0022283697913449 27. Prli´c, A., Bliven, S., Rose, P., Bluhm, W., Bizon, C., Godzik, A., Bourne, P.: Pre-calculated protein structure alignments at the RCSB PDB website. Bioinformatics 26, 2983–2985 (2010) 28. Prli´c, A., Yates, A., Bliven, S., et al.: BioJava: an open-source framework for bioinformatics in 2012. Bioinformatics 28, 2693–2695 (2012) 29. Sayle, R.: RasMol, molecular graphics visualization tool. BiomolecularStructures Group, Glaxo Welcome Research & Development, Stevenage, Hartfordshire (May 2013). http://www. umass.edu/microbio/rasmol/ 30. Shapiro, J., Brutlag, D.: FoldMiner and LOCK2: protein structure comparison and motif discovery on the Web. Nucleic Acids Res. 32, 536–41 (2004) 31. Shindyalov, I., Bourne, P.: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 11(9), 739–747 (1998) 32. Watson, H.: The stereochemistry of the protein myoglobin. Prog. Stereochem. 4, 299 (1969) 33. Yan, Y., Winograd, E., Viel, A., Cronin, T., Harrison, S., Branton, D.: Crystal structure of the repetitive segments of spectrin. Science 262(5142), 2027–2030 (1993). http://science. sciencemag.org/content/262/5142/2027
102
4 Scaling 3D Protein Structure Similarity Searching …
34. Ye, Y., Godzik, A.: Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics 19(2), 246–255 (2003) 35. Zhu, J., Weng, Z.: FAST: a novel protein structure alignment algorithm. Proteins 58, 618–627 (2005) 36. Zou, Q., Li, X.B., Jiang, W.R., Lin, Z.Y., Li, G.L., Chen, K.: Survey of MapReduce frame operation in bioinformatics. Brief. Bioinform. 15(4), 637–647 (2014). http://dx.doi.org/10. 1093/bib/bbs088
Chapter 5
Cloud Services for Efficient Ab Initio Predictions of 3D Protein Structures
Cloud computing is the third wave of the digital revolution Lowell McAdam, Verizon Communications, 2013
Abstract Computational methods for protein structure prediction enable determination of a three-dimensional structure of a protein based on its pure amino acid sequence. However, conventional calculations of protein structure may be timeconsuming and may require ample computational resources, especially when carried out with the use of ab initio methods. In this chapter, we will see how Cloud Services may help to solve these problems by scaling the computations in a role-based and queue-based Cloud4PSP system, deployed in the Microsoft Azure cloud. The chapter shows the system architecture, the Cloud4PSP processing model, and results of various scalability tests that speak in favor of the presented architecture. Keywords Bioinformatics · Proteins · 3D Protein structure · Protein structure prediction · Tertiary structure prediction · Ab initio · Protein structure modeling Cloud computing · Distributed computing · Microsoft Azure · Azure Cloud Services · Parallel computing · Software as a Service · SaaS · Scalability
5.1 Introduction Protein structure prediction is one of the most important and yet difficult processes for modern computational biology and structural bioinformatics [18]. The practical role of protein structure prediction is becoming even more important in the face of the dynamically growing number of protein sequences obtained through the translation of DNA sequences coming from large-scale sequencing projects. Needless to say, the pace of determination of protein structures with the experimental methods, such as X-ray crystallography or Nuclear Magnetic Resonance (NMR), is much slower than the pace of determination of protein sequences. As a consequence, the number © Springer Nature Switzerland AG 2018 D. Mrozek, Scalable Big Data Analytics for Protein Bioinformatics, Computational Biology 28, https://doi.org/10.1007/978-3-319-98839-9_5
103
104
5 Cloud Services for Efficient Ab Initio Predictions of 3D Protein Structures
of protein structures in repositories, such as the worldwide Protein Data Bank (PDB) [2], is only a small percentage of the number of all known sequences. Therefore, computational procedures that allow for the determination of a protein structure from its amino acid sequence are great alternatives for experimental methods for protein structure determination, like the X-ray crystallography and the NMR.
5.1.1 Computational Approaches for 3D Protein Structure Prediction Protein structure prediction refers to the computational procedure that delivers a three-dimensional structure of a protein based on its amino acid sequence (Fig. 5.1). There are various approaches to the problem and many algorithms have been developed. These methods generally fall into two groups: (1) physical and (2) comparative [28, 60]. Physical methods rely on physical forces and interactions between atoms in a protein. Most of them try to reproduce nature’s algorithm and implement it as a computational procedure in order to give proteins their unique 3D native conformations [36]. Following the rules of thermodynamics, it is assumed that the native conformation of a protein is the one that possesses the minimum potential energy. Therefore, physical methods try to find the global minimum of the potential energy function [34]. The functional form of the energy is described by empirical force fields [7] that define the mathematical function of the potential energy, its components describing various interactions between atoms, and the parameters needed for computations of each of the components (see Sect. 5.2.1 for details). The potential energy of an atomic system is composed of several components depending on the force field type.
Fig. 5.1 3D protein structure prediction: from amino acid sequence (Input) to 3D structure (Output). Modeling software uses different methods for prediction purposes: (1) physical, (2) comparative
5.1 Introduction
105
Methods that rely on such a physical approach belong to so-called ab initio protein structure prediction methods. Representatives of the approach include I-TASSER [56], Rosetta@home [35], Quark [57], WZ [54] and NPF [61]. In this chapter, we will focus on this approach. On the other hand, comparative methods rely on already known structures that are deposited in macromolecular data repositories, such as the Protein Data Bank (PDB). Comparative methods try to predict the structure of the target protein by finding homologs among sequences of proteins of already determined structures and by “dressing” the target sequence in one of the known structures. Comparative methods can be split into several groups including: (1) homology modeling methods, e.g., Swiss-Model [1], Modeller [10], RaptorX [26], Robetta [29], HHpred [51], (2) fold recognition methods, like Phyre [27], Raptor [58], and Sparks-X [59], and (3) secondary structure prediction methods, e.g., PREDATOR [14], GOR [15], and PredictProtein [45]. Both physical and comparative approaches are still being developed and their effectiveness is assessed every two years in the CASP experiment (Critical Assessment of protein Structure Prediction). It must also be admitted that ab initio prediction methods require significant computational resources, either powerful supercomputers [42, 48] or distributed computing. The proof of the latter give projects such as Folding@home [50] and Rosetta@Home [6] that make use of Grid computing architecture. A promising alternative seems also to be provided by Cloud computing, which enables convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction [37].
5.1.2 Cloud and Grid Computing in Protein Structure Determination Cloud computing evolved from various forms of distributed computing, including Grid computing. Both Clouds and Grids provide plenty of computational and storage resources, which is very attractive for scientific computations performed in the domain of Life sciences. Therefore, both computing models became popular in scientific applications for which a large pool of computational resources allows various computationally intensive problems to be solved. In the domain of protein structure prediction, it is worth noting two cloud-based solutions. The first one is an open-source Cloud BioLinux [31], which is a publicly accessible virtual machine (VM) that enables scientists to quickly provision ondemand infrastructures for high-performance bioinformatics computing using cloud platforms. Cloud BioLinux provides a range of pre-configured command line and graphical software applications, including a full-featured desktop interface, documentation and over 135 bioinformatics packages for applications including sequence
106
5 Cloud Services for Efficient Ab Initio Predictions of 3D Protein Structures
alignment, clustering, assembly, display, editing, and phylogeny. The release of the Cloud BioLinux reported in [25] consists of the already mentioned PredictProtein suite that includes methods for the prediction of secondary structure and solvent accessibility (profphd), nuclear localization signals (predictnls), and intrinsically disordered regions (norsnet). The second solution is Rosetta@Cloud [22]. Rosetta@Cloud is a commercial, cloud-based pay-as-you-go server for predicting protein 3D structures using ab initio methods. Services provided by Rosetta@Cloud are analogous to the well-established Rosetta computational software suite, with a variety of tools developed for macromolecular modeling, structure prediction and functional design. Rosetta@Cloud was originally designed to work on the Cloud provided by Amazon Web Service (AWS). It delivers a scalable environment, fully functional graphical user interface and a dedicated pricing model for using computing units (AWS Instances) in the prediction process. However, the Rosetta@Cloud is currently in limited beta release. Great progress in scaling molecular simulations has also been brought by scientific initiatives that make use of the Grid computing architecture. Laganà et al. [32] presented the foundations and structures of the building blocks of the ab initio Grid Empowered Molecular Simulator (GEMS) implemented on the EGEE computing Grid. In GEMS the computational problem is split into a sequence of three computational blocks, namely INTERACTION, DYNAMICS, and OBSERVABLES. Each of the blocks has different purpose. Ab initio calculations determining the structure of a molecular system are performed in the first block. The main efforts to implement GEMS on the EGEE Grid were made to the second block, which is responsible for the calculation of the dynamics of the molecular system. The authors show that for various reasons, the parallelization of molecular simulations is possible through the application of a simple parameter sweeping distribution scheme, which is also used in the Cloud4PSP presented in this chapter. WeNMR [55] is an example of a successful initiative for moving the analysis of Nuclear Magnetic Resonance (NMR) and Small Angle X-Ray scattering (SAXS) imaging data for atomic and near-atomic resolution molecular structures to the Grid. The WeNMR makes use of an operational Grid that was initially based on the gLite middleware, developed in the context of the EGEE project series [12]. Although WeNMR focuses on NMR and SAXS, its Web portal provides access to various services and tools, including those used for the prediction of biological complexes (HADDOCK [8]), calculating structures from NMR data (Xplor-NIH [46], CYANA [19] and CS-ROSETTA [49]), molecular structure refinement and molecular dynamics simulations (AMBER [4] and GROMACS [53]). A typical workflow is based on Grid job pooling. A specialized process is listening for Grid job packages that should be executed in the Grid, and another process is periodically checking running jobs for their status, retrieving the results when ready, and finally, another process performs post-processing of results before they are presented to the user. The Cloud4PSP presented in this chapter uses a similar workflow scheme, with prediction jobs that are divided into many prediction tasks. These tasks are enqueued in the system for further execution by processing units that work in a dedicated role-based and queue-based architecture that we designed.
5.1 Introduction
107
Gesing et al. [16] report on a very important security issue while carrying out computations in structural bioinformatics, molecular modeling, and quantum chemistry with the use of Molecular Simulation Grid (MoSGrid). MoSGrid is a science gateway that makes use of WS-PGRADE [11] as a user interface, gUSE [24] as a high-level middleware service layer for workload storage, workload execution, monitoring and others, UNICORE [52] as a Grid resources middleware layer, and XtreemeFS [21] as a file system for storage purposes. Various security elements are employed in particular layers of the MoSGrid security infrastructure with the aim of protecting intellectual property of companies performing scientific calculations. Mentioned projects prove that computations performed in bioinformatics require ample computational resources that are available through the use of Grid or Cloud computing. The Cloud4PSP system presented in details in the next section responds to these requirements perfectly. It is based on the role-based and queue-based architecture similar to the one developed in the Cloud4PSi [38, 39, 41] presented in Chap. 4. However, the architecture of the CloudPSP has been extended by additional queues, and lower-level roles perform lots of simulations and, instead of processing (consuming) large amounts of data, they produce them.
5.2 Cloud4PSP for 3D Protein Structure Prediction Cloud4PSP (Cloud for Protein Structure Prediction) is an extensible service that allows for ab initio prediction of protein structures from amino acid sequences. The construction of the system started in 2012. The development works were carried out by my assocciates—Paweł Gosk within his MSc thesis [17] and Bo˙zena MałysiakMrozek from Institute of Informatics at my university—and me within Microsoft Azure for Research Award grant for project entitled Boosting performance of 3D protein structure predictions in the Azure Cloud sponsored by Microsoft Research. Our common works resulted in publishing the paper [40] describing the system. The Cloud4PSP was intentionally designed to work in the Microsoft Azure cloud. At the moment, the Cloud4PSP uses only the Warecki–Znamirowski (WZ) method [54] as a sample procedure for protein structure prediction. However, the collection of prediction methods can be extended by using available programming interfaces developed by Paweł Gosk. The information on possible extensions will be provided in Sect. 5.2.4. In terms of efficiency, the mentioned WZ method is slower than popular optimization methods like gradient-based steepest descent [5], Newton–Raphson [9], Fletcher–Powell [13], or Broyden–Fletcher–Goldfarb–Shanno (BFGS) [47]. However, it finds a global minimum of the potential energy with higher probability and has a lower tendency to converge on and get stuck in the nearest local minimum, which is not necessarily the global minimum of the energy function [54]. The method is based on the modified Monte Carlo approach, which is a better solution. The following sections give details of the prediction method, architecture and processing model used by the Cloud4PSP, and details of how the system is scaled in order to handle a growing amount of work related to the prediction process.
108
5 Cloud Services for Efficient Ab Initio Predictions of 3D Protein Structures
5.2.1 Prediction Method The Warecki–Znamirowski (WZ) method for protein structure prediction models a protein backbone by (1) sampling numerous protein conformations determined by a series of φ and ψ torsion angles of the protein backbone (Fig. 1.13 in Chap. 1), and (2) finding the conformation with the lowest potential energy, which minimizes the expression: min
φ0 ,ψ0 ,φ1 ,ψ1 ,...,φn−1 ,ψn−1
E(φ0 , ψ0 , φ1 , ψ1 , . . . , φn−1 , ψn−1 ),
(5.1)
where E is a function describing the potential energy of the current conformation, n is the number of amino acids in the predicted protein structure, and φi , ψi are torsion angles of the ith amino acid. The basic assumption of the method is that the peptide bond is planar (see Fig. 1.13 in Chap. 1), which restricts the rotation around the C ′ –N bond, and the ω angle is essentially fixed to 180 degrees due to the partial double-bond character of the peptide bond. Therefore, the main changes of the protein conformation are possible by chain rotations around the φ and ψ angles, and these angles provide the flexibility required for folding the protein backbone. The following potential energy function E is used in the calculations of energy for the current conformation (configuration AN of N atoms, determined by a series of φ and ψ torsion angles in Eq. 5.1): N
E(A ) =
bonds j=1
kjb 2
angles
(dj −
torsions
dj0 )2
+
kja j=1
2
(θj − θj0 )2
V (1 + cos(pω − γ)) (5.2) 2 j=1 6 N N N N σkj 12 qk qj σkj + − , + 4εkj rkj rkj 4πε0 rkj
+
k=1 j=k+1
k=1 j=k+1
where the first term represents the bond stretching component energy, kjb is a bond stretching force constant, dj is the distance between two atoms (real bond length), and dj0 is the optimal bond length; the second term represents the angle bending component energy, kja is a bending force constant, θj is the actual value of the valence angle, and θj0 is the optimal valence angle; the third term represents torsional angle component energy, V denotes the height of the torsional barrier, p is the periodicity, ω is the torsion angle, and γ is a phase factor; the fourth term represents van der Waals component energy described by the Lennard–Jones potential, rkj denotes the distance between atoms k and j, σkj is the collision diameter, εkj is the well depth, and N is the number of atoms in the structure AN ; the fifth term represents electrostatic component energy and: qk , qj are atomic charges, rkj denotes the distance between
5.2 Cloud4PSP for 3D Protein Structure Prediction
109
atoms k and j, ε0 is a dielectric constant, and N is the number of atoms in the structure AN . The Warecki–Znamirowski (WZ) algorithm minimizes the expression Eq. 5.2 for current values of variables defining degrees of freedom, referential values of the variables taken from force field parameter sets [30], and imposing constraints on torsion angles φ and ψ resulting from Ramachandran plots [44]. These constraints are derived based on the experimental observations of possible values of torsion angles for particular types of amino acids published in [20]. The full algorithm of Warecki–Znamirowski consists of two phases: • Phase I: Monte Carlo phase In the first phase, random values of φ and ψ torsion angles are generated for each amino acid. The space of possible values for torsion angles (360 × 360 degrees) is restricted based on the type of amino acid and the Ramachandran plot for the particular type [20]. Protein conformation is then defined by a vector of φ and ψ torsion angles. When all angles are generated for the given amino acid chain, the energy of the conformation E is calculated. The whole process is repeated for a given amount of tries iTotal (the iTotal is specified by the user) and bc best solutions are collected in a dedicated array. The result of the Monte Carlo phase is the array holding the best conformations of the given amino acid chain. These conformations represent approximations of valid solutions that might still need some adjustment. • Phase II: Angles Adjustment phase The adjustment is performed in the second phase. In this phase, each conformation stored in the array of bc best conformations is examined once again. The torsion angles φi and ψi (i = 0, 1, . . . , n − 1) of each conformation are modified one by one by given values ±k∆φ and ±k∆ψ. This simulates shaking the protein molecule and continuous motions of particular atoms, wherein the amplitude of the motions is determined by the ±∆φ and ±∆ψ values. The k parameter is a positive and decreasing parameter during the consecutive steps of the minimization of the potential energy function. The shaking process is continued until the k parameter reaches a fixed value. After each modification of the torsion angle, the conformational energy E ′ (AN ) is calculated once again, and if it is lower than the current minimum E ′ (AN ) < Emin , the modified conformation and its energy are accepted and saved as a current best result. The whole adjustment process is performed for each of the bc best conformations. The number of random tries iTotal that Phase I is repeated for is defined by the user. In Ref. [54] Warecki and Znamirowski give the formula by which users can estimate the minimal number of iterations that are needed to model the molecule containing n amino acids: log(1 − α) , (5.3) iTotal ≈ log(1 − ξ n ) where α is the probability of finding the optimal solution with the accuracy ξ ∈ (0; 1), which is a part of the range for each ith pair of torsional angles, i ∈ 0, . . . , n − 1.
110
5 Cloud Services for Efficient Ab Initio Predictions of 3D Protein Structures
For example, when modeling protein containing 5 amino acids (n = 5) and assuming the accuracy of localization of the optimal solution to be half of the range for each pair of torsional angles ξ = 0.5 with probability α = 0.9, the number of iterations iTotal = 73. It is worth noting that with the same assumptions iTotal = 2, 357 for n = 10 amino acids, iTotal = 75, 450 for n = 15 amino acids, and iTotal = 2, 414, 435 for n = 20 amino acids. This shows that the number of possible conformations that should be explored grows quickly with the number of amino acids. The best conformation is the one that is characterized by the lowest value of the potential energy E. Energies are calculated with the Tinker package [43] and AMBER96 [30] molecular mechanics potentials and the generalized Born continuum solvation model (GBSA).
5.2.2 Cloud4PSP Architecture Cloud4PSP parallelizes the WZ method by the distribution of the computations in the Cloud computing. Cloud4PSP has been developed to work on the Microsoft Azure public cloud. In the Microsoft Azure cloud, applications can be delivered by deploying pre-configured virtual machines (IaaS) or as fully functional SaaS solutions. For the latter case, Microsoft provides a dedicated application programming interface (API) for developers of any software application that is intended to work on the Cloud. Cloud4PSP was developed with the use of Azure Cloud Services, and it is composed of a set of roles—a Web role providing a graphical user interface (GUI) in the form of a Web site, and Worker roles implementing the application logic. The roles reside in and execute the application logic on virtual machines provided by Microsoft, as a cloud provider. These virtual machines come from Microsoft’s standard gallery of virtual machines and provide various computational capabilities depending on their size (see Sect. 3.2). Roles working in the Cloud4PSP system create a distributed set of processing units ΩPU defined as follows: ΩPU = {rW } ∪ {rPM } ∪ RPW ,
(5.4)
where rW is the Web role responsible for the interaction with Cloud4PSP users, rPM is the PredictionManager role distributing requests received from the Web role and preparing the workload, and RPW is a set of m instances of the PredictionWorker role performing the prediction of protein structures in parallel. The architecture of the Cloud4PSP consisting of the mentioned processing units is shown in Fig. 5.2. The Web role provides a front-end for users of the system, the PredictionManager role mediates the distribution of the prediction process and stores the results in the Azure SQL Database, and the PredictionWorker roles predict the protein structure according to the logic of the chosen prediction method. Control instructions and parameters are transferred through appropriate queues. Roles have
5.2 Cloud4PSP for 3D Protein Structure Prediction
User
Web role
Input queue
111
PredicƟonManager role
Output queue
DB with predicƟon results Azure SQL Database
Azure BLOBs
3D protein structures (PDB files)
PredicƟon Output queue
PredicƟon Input queue
…
PredicƟonWorker roles
MicrosoŌ Azure public cloud
Fig. 5.2 Architecture of the Cloud4PSP—a cloud-based system for ab initio protein structure prediction
access to various storage resources, including Azure BLOBs (for PDB files with predicted 3D structures) and Azure SQL Database (for description of the results). The Web role provides a Web site which allows users to interact with the entire system. Through the Web role users input the amino acid sequence of the protein whose 3D structure should be predicted. They also choose the prediction method (only one is available at the moment, but the system provides programming interfaces allowing users to bind new methods), specify its parameters, e.g., the number of iterations (random tries of the modified Monte Carlo method), and provide a short description of the input molecule. The PredictionManager role distributes the prediction process across many instances of the PredictionWorker role. In other words, PredictionManager implements the entire logic of how the prediction process is parallelized. This may depend on the prediction method used. Algorithm 4 describes how calculations are scheduled and parallelized by the PredictionManager role with part of the code adapted to the specificity of the WZ prediction algorithm used.
112
5 Cloud Services for Efficient Ab Initio Predictions of 3D Protein Structures
Algorithm 4 PredictionManager role: Processing prediction jobs 1: while true do 2: Check messages in the Input queue 3: if exists a message then 4: Retrieve the message and extract parameters of the prediction job (protein sequence, #tasks, prediction method, and its parameters) 5: Save job description in Azure SQL Database 6: for i ← 1, #tasks do ⊲ This part of the code is specific to WZ-oriented Prediction Manager 7: Create a prediction task 8: Generate seed for the pseudorandom number generator 9: Assign the seed to the prediction task 10: Save prediction task description in Azure SQL Database 11: Encode prediction task (seed, sequence, prediction method, and its parameters) in the task description message 12: Enqueue task description message in the Prediction Input queue 13: end for 14: end if 15: Check messages in the Prediction Output queue 16: if exists a message then 17: Retrieve the message and extract results of the prediction task (name of the file with predicted structure, potential energy) 18: Save task results in Azure SQL Database 19: if all tasks completed then 20: Finalize the prediction job in Azure SQL Database 21: end if 22: end if 23: end while
When the prediction is performed according to the WZ algorithm, the entire prediction process (in the system also called a prediction job) consists of a vast number of iterations (random selections of torsion angles) that, in the Cloud4PSP, are performed in parallel by many instances of the PredictionWorker role. The number of iterations is specified by the user through the Web role. In the configuration of the prediction job, the user specifies the number of tasks (#tasks) and the number of tries performed in each task (#iterations). This gives flexibility in choosing the stopping criterion and flexibility in configuring prediction jobs (and in consequence, the number of candidate structures after the prediction, since in this implementation of the parallel prediction only the best conformations generated in Phase I by each task are adjusted in Phase II and these tuned conformations are returned). The total number of iterations performed within the prediction job (iTotal) is a product of the number of tasks (#tasks) and the number of tries performed in each task (#iterations): iTotal = #tasks × #iterations.
(5.5)
The PredictionManager role creates a pool of #tasks tasks (lines 6–7). Descriptions of these tasks, which contain the information on the prediction algorithm, its parameters, random seed used to initialize a pseudorandom number generator for
5.2 Cloud4PSP for 3D Protein Structure Prediction
113
the modified Monte Carlo method, and the number of tries #iterations, are stored in the PredictionInput queue (lines 11–12). Prediction tasks are then consumed by instances of the PredictionWorker role. Instances of the PredictionWorker role act according to Algorithm 5. They consume messages from the Prediction Input queue (lines 2–4), execute the chosen prediction method, generate successive protein structures, calculate potential energies for them (lines 6–7), store protein structures in PDB files, and then save these files in Azure BLOBs (line 8). The set of PredictionWorker role instances RPW is defined as follows: (5.6) RPW = {rPW i |i = 1, . . . , m} where rPW i is a single instance of PredictionWorker role, and m is the number of PredictionWorker role instances working in the system. Usually: #tasks ≫ m. (5.7)
Algorithm 5 PredictionWorker role: Processing prediction task 1: while true do 2: Check messages in the Prediction Input queue 3: if exists a message then 4: Retrieve the message and extract parameters of the prediction task (protein sequence, prediction method, and its parameters, including seed) 5: Update task start time in Azure SQL Database 6: Execute prediction process according to the specified method 7: Collect results of the prediction (name of the file with predicted structure, potential energy) 8: Save protein structure to Azure BLOB under given filename 9: Update task end time in Azure SQL Database 10: Serialize task execution results in the prediction output message 11: Enqueue the message in the Prediction Output queue 12: end if 13: end while
Steering instructions that control the entire system and the course of action inside the Cloud4PSP are transferred as messages through the queueing system. There are four queues present in the architecture of the Cloud4PSP: • Input queue—collects structure prediction requests (Prediction Job descriptions) generated by users through the web site. • Output queue—collects notifications of prediction completion for particular users (based on a generated token number). • Prediction Input queue—transfers parameters of the prediction process for a single instance of the PredictionWorker role, among others: amino acid sequence provided by the user, the number of iterations #iterations to be performed, parameters of the prediction method.
114
5 Cloud Services for Efficient Ab Initio Predictions of 3D Protein Structures
• Prediction Output queue—transfers descriptions of results, e.g., PDB file name with protein structure and potential energy value for the structure, from each instance of the PredictionWorker role back to the PredictionManager role. The Input queue and Prediction Input queue are very important for buffering prediction requests and steering the prediction process. The prediction process may take several hours and is performed asynchronously. User’s requests are placed in the Input queue with the typical FIFO discipline and they are processed when there are idle instances of the PredictionWorker role that can realize the request. Many users can generate many prediction requests, so there must be a buffering mechanism for these requests. Queues fulfill this task perfectly. One prediction request enqueued in the Input queue causes the creation of many prediction orders (tasks) for instances of the PredictionWorker role. The PredictionManager role consumes prediction requests from the Input Queue, generates many predictions tasks, each for performing the specified number of iterations by PredictionWorker roles available in the system, and generates task description messages to the Prediction Input queue. The Prediction Output queue is used by instances of the PredictionWorker role to return a description of results to the PredictionManager role. Such a feedback allows the PredictionManager role to control if all the prediction tasks have been completed. In the case of failure to realize of a prediction task, its completion can be delegated to another PredictionWorker instance. The Output queue allows the Web role to be notified that the whole prediction process is already completed.
5.2.3 Cloud4PSP Processing Model Cloud4PSP provides a dedicated processing model (Fig. 5.3) that is used by the Cloud4PSP and can be used by developers extending the system in the future. In the Cloud4PSP processing model, every prediction request generated by a user causes the creation of a Prediction Job (step 1). This is a special object with the information on the entire prediction that must be performed. This Prediction Job is submitted for execution (step 2); i.e., it is serialized to the message that is placed in the Input queue. The PredictionManager role consumes Prediction Jobs, initialize them (step 3) and, based on their description and available resources, creates so-called Prediction Tasks (step 4). Specifications of these Prediction Tasks are serialized to the messages that are sent to the Prediction Input queue (step 5) and are then consumed by successive PredictionWorker role instances. PredictionWorker role instances execute tasks according to the description contained in the retrieved message and using an appropriate prediction method (step 6), and finally save the predicted protein structures in the Azure BLOB storage service (step 7). After completing the Prediction Task, PredictionWorker confirms completion of the task and sends a description of the generated protein structures to the PredictionManager (step 8). The PredictionManager saves the results in the database managed by the Azure SQL Database (step 9). By using an additional register stored in the database,
5.2 Cloud4PSP for 3D Protein Structure Prediction
115 3. IniƟalize job
Web node
PredicƟonManager node 1. Configure and execute
Web role
2. Request for execuƟon
PredicƟon Job
4. Create predicƟon tasks
PredicƟonManager role
11. Confirm the compleƟon of execuƟon
12. Retrieve results
5. Send a task for execuƟon
8. Confirm task execuƟon and send descripƟon of results
10. Finalize the job
9. Send descripƟon of results to DB
PredicƟonWorker node PredicƟonWorker role
7. Save predicted protein structures
Azure SQL Database
6. Execute task
PredicƟon Task
Azure BLOBs
Phase I: MC simulaƟon
Phase II: Angles adjustment
Azure Storage
Fig. 5.3 Cloud4PSP processing model showing components and objects involved in the prediction process, and the flow of messages and data between components of the system
the PredictionManager maintains awareness of the progress of the Prediction Job execution. This emphasizes the importance of the confirmations obtained in step 8. When the PredictionManager states that all tasks have been completed, it finalizes the job by setting an appropriate property in the job’s register in the database (step 10) and sends the notification to the Web role through the Output queue (step 11). The Output queue stores notifications together with the token number generated for the Prediction Job at the beginning of the prediction process, i.e., when submitting the job for execution. The Web role periodically checks the Output queue for messages reporting the completion of the Prediction Job and the statuses of prediction tasks in the database (step 12). Partial results (results of already finished tasks) are retrieved from the database. In this way, the user knows whether the prediction process is completed or is still in progress. Such a processing model provides a kind of abstraction layer for developers of the system, hides some implementation details, e.g., implementation of communication and serialization of jobs and tasks, and finally, provides a framework for future extensions of the system.
116
5 Cloud Services for Efficient Ab Initio Predictions of 3D Protein Structures
5.2.4 Extending Cloud4PSP Although Cloud4PSP has been designed to work on the Microsoft Azure Cloud, its architecture and workflow have a general character and can be also implemented on other Clouds. This, however, would require adoption of the programming model specific to the Cloud provider and tailoring of elements of the architecture to available compute resources. The architecture of the Cloud4PSP could also be adapted and extended to hybrid clouds by combining on-premises compute resources and Microsoft Azure cloud resources. This would require some additional effort, related to moving the PredictionManager to the on-premises cluster or private cloud and parceling out prediction tasks across available resources. However, this would also allow for the use of public cloud resources to satisfy potentially increased demand especially for compute units. If the system needs some extra compute power, e.g., due to an excessively long and computationally demanding prediction process, part of the prediction job (some prediction tasks) could be scheduled for execution on the Azure cloud. Then, after the prediction process is completed these commercial compute resources could be released. Cloud4PSP has been developed in such a way that it allows developers to extend its functionality by adding new modules. This is important, for example, in situations when new prediction algorithms have to be added to the system. This determines the openness of the system. Cloud4PSP has been mainly developed as a software, which places it in the SaaS layer of the cloud stack [37]. However, extensions are possible not only by using the processing model presented in Sect. 5.2.3, but also by a set of programming interfaces designed by my assistant Paweł Gosk for almost all components of the system. In Fig. 5.4 we show interfaces, their implementations for PredictionManager (for parallelization logic) and PredictionWorker (for the prediction method) in the form of base classes, and inheritance for classes representing particular prediction methods. By using these base classes, advanced users and programmers can plug in their own prediction methods (by inheritance from the PredictionBase class) and their own computation scheduling algorithms (by inheritance from the PredictionManagerBase class).
5.2.5 Scaling the Cloud4PSP Let us remind that the scalability of the system means that it is able to handle a growing amount of work in a capable manner or is able to be enlarged to accommodate that growth [3]. Cloud4PSP has this property. Cloud4PSP is a cloud-based, highperformance, and highly scalable system for 3D protein structure prediction. The scalability of the system is provided by the Microsoft Azure cloud platform. The system can be scaled up (vertical scaling) or scaled out (horizontal scaling). Microsoft Azure allows us to combine both types of scaling.
5.2 Cloud4PSP for 3D Protein Structure Prediction
117
Fig. 5.4 Interfaces, their implementations for PredictionManager and PredictionWorker in the form of base classes, and inheritance for classes representing a particular prediction method. DummyPredictionManager and DummyPrediction represent the sample test manager and prediction method, MonteCarloPredictionManager and MonteCarloPrediction reflect implementations of the scheduling/parallelization and prediction procedure for the described WZ method. Reproduced from [17] with permissions
Cloud4PSP can be scaled out and scaled up during the protein structure prediction process. Scaling out means changing the value of m in Eq. 5.6. Scaling up implies the change of the size or series of virtual machines for the PredictionWorker role. For example, for the basic A-series VMs: size(rPW i ) ∈ {XS, S, M , L, XL, A5, A6, A7},
(5.8)
where XS denotes ExtraSmall size, S—Small, M—Medium, L—Large, XL— ExtraLarge. The sizes of the virtual machines for PredictionWorker role instances are described in Sect. 3.2.
118
5 Cloud Services for Efficient Ab Initio Predictions of 3D Protein Structures
In the Cloud4PSP the following assumption regarding all instances of the PredictionWorker role was made: ∀rPW i ,rPW j ∈RPW ,i,j∈1,...,m,i =j size(rPW i ) = size(rPW j ).
(5.9)
The assumption of the same size for all instances of the PredictionWorker role is motivated by the ease of configuration and scaling of the system and by the ease of analysis of its scalability. However, when working in hybrid clouds, the sizes of virtual machines for various instances of the PredictionWorker role can be different. The number of instances m of the PredictionWorker role depends on the configuration of the Cloud4PSP. It can be measured in thousands. The m is limited only by the user’s subscription for Microsoft Azure cloud resources and the number of compute units available in the Azure cloud.
5.3 Performance of the Cloud4PSP Prediction of 3D protein structures from the beginning with the use of the WZ method requires many Monte Carlo iterations, which takes some time. The number of iterations and therefore the total execution time depend on the size of the protein modeled, i.e., the number of residues in the input sequence. During our tests we successfully modeled a 20 amino acid long part of the non-structural protein NSP1 molecule from the Semliki Forest virus with an RMSD of 0.99 Å. The NMR structure of the molecule (PDB ID code: 1FW5) [33] from the Protein Data Bank is shown in Fig. 5.5 (top). The structural alignment and superposition of the modeled molecule
Fig. 5.5 Crystal structure of the part of the NSP1 molecule (PDB ID code: 1FW5) [33] from the Semliki Forest virus that was modeled during experiments: (top left) protein skeletal structure made by covalent bonds between atoms (sticks display mode in Jmol [23]), (top right) secondary structure showing mainly α-helical character of this fragment (ribbon display mode in Jmol), (bottom) structural alignment and superposition of modeled molecule and NMR molecule
5.3 Performance of the Cloud4PSP
119
and NMR molecule are presented in Fig. 5.5 (bottom). Modeling the 20 amino acid long part of the NSP1 protein was a sample computational procedure used for testing the efficiency of the designed Cloud4PSP architecture in the real (Microsoft Azure) cloud environment. Tests were performed with the use of twenty CPU cores. Two of these twenty cores, i.e., two compute units/virtual machines of the Small size, were allocated to allow the activity of the Web role and PredictionManager role. The remaining eighteen cores were used for the activity of the PredictionWorker role instances and to test the scalability of the system. In particular, the following characteristics of the Cloud4PSP were tested: • the vertical scalability of the system—the efficiency of the prediction process depending on the size of instances of the PredictionWorker role, • the horizontal scalability of the system—the efficiency of the prediction process depending on the number of instances of the PredictionWorker role, • the efficiency of the prediction process depending on the number of prediction tasks and the number of iterations performed by each task. In all cases, the scalability has been determined on the basis of execution time measurements. At least two replicas for each measurement were carried out. Then, the obtained results were averaged and in such a way they are presented in the following sections. Averaged values of measurements were also used to determine n-fold speedups for both scaling techniques.
5.3.1 Vertical Scalability In the first phase of our tests, we wanted to check which of the basic sizes for the virtual machines offered by Microsoft Azure is most efficient. The basic tier for the Web and Worker roles consists of the following five sizes: ExtraSmall (XS), Small (S), Medium (M), Large (L), and ExtraLarge (XL). Let me remind that their capabilities were described in Sect. 3.2 and briefly compared in Table 3.1. It is worth noting in Table 3.1 that beginning from the Small size up to the ExtraLarge size the amount of available memory and the number of cores doubles gradually. The ExtraSmall size provides one, shared core and is generally used when testing applications that should work on the Cloud, so as not to generate unnecessary costs for the testing process. While testing vertical scalability, we changed the size of the PredictionWorker role from ExtraSmall (one, shared CPU core) to ExtraLarge (eight CPU cores). Only one PredictionWorker was used in this test. The Prediction job was configured to explore 80,000 random conformations (iTotal) that were performed in 16 prediction tasks. Multiple prediction tasks were assigned to those instances of PredictionWorker that possessed more than one CPU core, proportionally to the number of cores (according to the rule: one core—one task). The input sequence contained 20 amino acids of the NSP1 molecule. Each task was configured to generate 5,000 random protein conformations (performed 5,000 iterations per task) in Phase I of the WZ method,
120
5 Cloud Services for Efficient Ab Initio Predictions of 3D Protein Structures 8
7,30
7
n-fold speedup
6 5 3,90
4
3 1,99 2 1,00
1,00
1
0 XS
S M L Size of PredicƟonWorker role
XL
Fig. 5.6 Results of vertical scaling. Acceleration (n-fold speedup) of the 3D protein structure prediction (prediction job) as a function of the size of instances of the PredictionWorker role
and to tune only one best conformation in Phase II of the method (∆φ = ∆ψ = 8 degrees for k = 3, 2, 1 in successive adjustment iterations). In Fig. 5.6 we can observe the n-fold speedup when scaling the system vertically. We observed that increasing the size of the PredictionWorker role from Small to Medium accelerated the prediction process almost twofold. Above the Medium size (2 cores), the dynamics of the acceleration slowed down a bit—for Large size (4 cores) the acceleration reached 3.90 and for ExtraLarge size (8 cores) it reached 7.30. When analyzing the results of tests, we deduced that the slowdown in the acceleration dynamics was an effect of two factors. The first factor was an overhead caused by the necessity of handling multiple threads. For example, on ExtraLarge-sized PredictionWorkers we ran eight prediction tasks concurrently (8 CPU cores = 8 threads = 8 prediction tasks). And the second factor was the cumulative I/O operations performed on a local hard drive by many threads of the PredictionWorker role. During computations, the PredictionWorker role makes use of various executable programs and additional files that must be read from the local hard drive. It also stores some intermediate results on the hard drive. Multiple threads of the role that run on the compute unit multiply these read/write operations and interfere with each other. Therefore, the more threads were running on the compute unit during experiments, the more I/O operations were performed, which influenced the n-fold speedup. This is also visible when analyzing average execution times for prediction tasks, as shown in Fig. 5.7. In Fig. 5.7 we can clearly see that execution of a single prediction task was the fastest when performed on Small-sized instances and the slowest on ExtraLarge-sized instances of the PredictionWorker role. However, ExtraLarge instances could host the execution of eight prediction tasks at the same time, while Small-sized instances
5.3 Performance of the Cloud4PSP
121
1:25
Time (hh:mm)
1:20
1:15
1:10
1:05
1:00 XS
S
M
L
XL
Size of PredicƟonWorker role
Fig. 5.7 Results of vertical scaling. Average execution time for a single prediction task as a function of the size of the PredictionWorker role. Reproduced from [17, 40] with permissions
could host only one. Therefore, the total execution time for the whole prediction job performed on ExtraLarge instances was 7.30 times shorter than the same job performed on a single Small instance.
5.3.2 Horizontal Scalability In the second series of performance tests, we checked the horizontal scalability of the Cloud4PSP. Tests were performed in the Microsoft Azure cloud. During this series of tests, the number of instances of the PredictionWorker role was increased from one to 18. Instances of the PredictionWorker role were hosted on compute units of the Small size (one CPU core). The Small-sized virtual machine represents the cheapest computational unit, providing one unshared core, and is the smallest sized unit recommended for production workloads. Moreover, as proved by the experimental results shown in Fig. 5.7, the size guaranteed the fastest processing of prediction tasks. While testing the horizontal scalability, we used the same input sequence as in the vertical scalability tests. The sequence contained 20 amino acids of the NSP1 enzyme. The prediction was configured to explore 4,000 random conformations (iTotal) that were performed in 40 tasks. Each task was configured to generate 100 random protein conformations (performed 100 iterations per task) in Phase I of the WZ method, and to tune only one best conformation in Phase II of the method (∆φ = ∆ψ = 8 degrees for k = 3, 2, 1 in successive angle adjustment iterations). In Fig. 5.8 we can observe changes in the total execution time of the whole prediction as a function of the number of instances of the PredictionWorker role. The results
122
5 Cloud Services for Efficient Ab Initio Predictions of 3D Protein Structures 11:00:00 10:00:00 Time (hh:mm:ss)
09:00:00 08:00:00 07:00:00
06:00:00 05:00:00 04:00:00
03:00:00 02:00:00 01:00:00
00:00:00 1
2 4 8 16 # PredicƟonWorker role instances
18
Fig. 5.8 Results of horizontal scaling. Total execution time for a prediction job as a function of the number of instances of the PredictionWorker role (Small size) 16
14,52 13,08
n-fold speedup
14 12 10
7,45
8 6 3,98 4 2,04 2
1,00
0 1
2
4 8 16 # PredicƟonWorker role instances
18
Fig. 5.9 Results of horizontal scaling. Acceleration (n-fold speedup) of the 3D protein structure prediction as a function of the number of instances of the PredictionWorker role (Small size)
obtained proved that the total prediction time can be significantly reduced when increasing the number of compute units. Upon increasing the number of instances from one to 18, the prediction time was reduced from 9 h and 31 min to only 39 min. Based on the execution time measurements that were obtained during the performance tests, we calculated the n-fold speedup for various configurations of the Cloud4PSP with respect to a single instance configuration. Figure 5.9 shows how the n-fold speedup changes as a function of the number of instances of the PredictionWorker role. We can notice that employing two instances of the PredictionWorker role increased the speed of the prediction process more than twofold. Adding more instances of the
5.3 Performance of the Cloud4PSP
123
84:00:00
Time (hh:mm:ss)
72:00:00 60:00:00
XS & S
48:00:00
M
36:00:00
L XL
24:00:00 12:00:00 0:00:00 1 000
10 000
100 000
1 000 000
# Monte Carlo trials (generated structures)
Fig. 5.10 Efficiency of a single compute unit in the first phase of the prediction process. Dependency between the execution time and the number of randomly generated structures (Monte Carlo trials) for various sizes of compute units. Reproduced from [17, 40] with permissions
role proportionally accelerated the process. Finally, the acceleration ratio (n-fold speedup) reached the level of 14.52, when the number of instances of the PredictionWorker role was increased from one to 18. A slowdown of the acceleration dynamics was observed when using more than four compute units. This was caused by the uneven processing times of individual tasks—the average execution time per task was 840 s, with minimum 340 s and maximum 1,722 s. Moreover, we have to remember that the execution times and n-fold speedup were measured for the whole system. And therefore, the execution times covered not only the execution of particular prediction tasks, but also processing and storing the results of these tasks by the PredictionManager role. Consequently, by increasing the degree of parallelism we raised the pressure on the PredictionManager, which serially processed incoming prediction results. An interesting question is also why the processing times of tasks are so different. The answer lies in the prediction method itself, and specifically in its second phase. Phase I is fairly predictable in terms of the speed of generating random protein conformations and calculating energies for them (Fig. 5.10). The problem lies in the second phase, which can take much longer, depending on the length of the input sequences, tuning parameters and the course of tuning itself, which depends on the conformation generated in the first phase. We observed that for the tested input sequence, Phase I took 5% and Phase II took 95% of the whole execution time.
5.3.3 Influence of the Task Size Since in Phase II of the WZ method, the angle adjustment is only performed for the structure with the lowest energy, while performing our experiments we expected that
124
5 Cloud Services for Efficient Ab Initio Predictions of 3D Protein Structures 14:00:00
Time (hh:mm:ss)
12:00:00 10:00:00 08:00:00 06:00:00 04:00:00 02:00:00 00:00:00 16t x 250i
32t x 125i
40t x 100i
50t x 80i
80t x 50i
100t x 40i
200t x 20i
400t x 10i
800t x 5i
Job configuraƟon (#tasks x #iteraƟons)
Fig. 5.11 Dependency between execution time and configuration of the prediction job (#tasks × #iterations) for the constant number of randomly generated structures (Monte Carlo trials, iTotal = 4000) for sixteen instances of the PredictionWorker role
the total execution time of the prediction process might depend on the configuration of the prediction tasks. We decided to verify how the task size (i.e., the number of iterations) and the entire job configuration influence the whole prediction time and how strong that influence is. For this purpose, we decided to explore 4,000 random conformations (iTotal) in the prediction process, for the same input sequence as in the vertical and horizontal scalability tests. The sequence contained 20 amino acids of the NSP1 molecule. Nine different configurations of the prediction job were tested. Leaving the number of random conformations (iTotal) constant, we appropriately selected the number of iterations per task and the number of tasks. Phase II of the WZ method was set up to tune only one best conformation for ∆φ = ∆ψ = 8 degrees for k = 3, 2, 1 in successive angle adjustment iterations. Tests were performed using sixteen Small-sized instances of the PredictionWorker role. The total execution times for various configurations of the prediction job are shown in Fig. 5.11. As can be observed, the most efficient is the configuration with the lowest number of tasks and the highest number of iterations (16t × 250i). The prediction process for this configuration took 19 min and 25 s. The configuration with the largest number of small tasks (800t × 5i) proved to be extremely inefficient. The prediction process for this configuration took 11 h and 42 min. The reasons for such a behavior of the system are the same as already discussed in the previous section. The most time-consuming phase of the WZ method is Phase II, which usually takes 90–98% of the task execution time. By configuring the system to use many small prediction tasks, we multiplied the execution of the longest phase, which resulted in a long execution time for the whole prediction job. For a small number of tasks, it is advisable to choose the number of tasks as a multiple of the number of instances of the PredictionWorker role. For a larger
5.3 Performance of the Cloud4PSP
125
number of tasks (if #tasks > 2m) the rule does not apply, since the execution times for particular tasks are different. It is difficult to explicitly and clearly define the configuration of the prediction job. It depends on many factors, including the purpose of the prediction process, the size of the input molecule, and others. However, choosing a low number of tasks and a large number of iterations has the following consequences: • the efficiency of the prediction process is higher, • users obtain fewer candidate structures after the prediction is finished, and some good candidates can be missed, • fewer prediction results are stored in the database and fewer 3D structures are stored in BLOBs, which lowers the long-term costs of storage in the Cloud, • the cost of the computation process is lower due to higher efficiency. On the other hand, choosing a large number of tasks and a low number of iterations has the following consequences: • the efficiency of the prediction process is lower, • users obtain more candidate protein structures after the prediction process, which increases the chance to get the global minimum of the energy function, • more prediction results are stored in the database and more candidate 3D structures are stored in BLOBs, which increases the long-term costs of storage in the Cloud, • the cost of the computation process is higher due to lower efficiency.
5.3.4 Scale Up, Scale Out, or Combine? Microsoft Azure allows us to scale-out and scale-up applications and systems deployed in the Cloud. The choice is left to the developer and administrator of the system. In both cases, the system must be properly implemented to utilize the resources of the multi-core compute unit (when scaling up) and to distribute, and then process, parts of the main job by many instances of the Worker role (when scaling out). The workload associated with the development of the system toward each of the scaling types is difficult to express in numbers. However, there are some technical reasons that speak in favor of particular scaling techniques. Moreover, we have conducted a series of tests in order to compare similar configurations of the Cloud4PSP architecture and answer which of the configurations is more efficient. We tested various configurations of the Cloud4PSP by changing the size of the PredictionWorker role from ExtraSmall (one, shared CPU core) to ExtraLarge (eight CPU cores) and by changing the number of instances of the role from one to eighteen (if possible, depending on the size of the compute unit). The prediction was configured to explore 100,000 random conformations (iTotal) that were performed in 18 tasks. Multiple prediction tasks were assigned to instances that possessed more than one CPU core, proportionally to the number of cores (according to the rule: one core - one task). The input sequence contained 20 amino acids of the NSP1 enzyme. Each
126
5 Cloud Services for Efficient Ab Initio Predictions of 3D Protein Structures
task was configured to generate 5,000 random protein conformations (performed 5 000 iterations per task) in Phase I of the WZ method, and to tune only one best conformation in Phase II of the method (∆φ = ∆ψ = 8 degrees for k = 3, 2, 1 in successive adjustment iterations). We tested the following configurations of the Cloud4PSP with respect to resource consumption by the PredictionWorker role: • • • •
one to eighteen ExtraSmall and Small compute units, one to nine Medium compute units, one to four Large compute units, one to two ExtraLarge compute units.
In Fig. 5.12 we can see the number of prediction tasks completed within one hour as a function of the number of instances of the PredictionWorker role for various sizes of the instances. The results of the tests indicate a slight advantage of horizontal scaling over vertical scaling. Comparing both scaling techniques, we noted that the number of prediction tasks completed within one hour by eight Small (1-core) instances was higher than those completed by one ExtraLarge (8-core) instance. Similar behavior was observed when comparing four Small instances vs. one Large (4-core) instance, and two Small instances vs. one Medium (2-core) instance, but the differences were not so significant. We could draw the same conclusion based on the charts showing n-fold speedups for vertical scaling (presented in Fig. 5.6) and horizontal scaling (presented in Fig. 5.9). For example, the acceleration ratio was
16
#predicƟon tasks per hour
14
12 10 XS S M L XL
8 6
4 2 0 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# PredicƟonWorker instances
Fig. 5.12 Comparison of the efficiency of horizontal and vertical scaling, and the combination of both scaling techniques. The number of prediction tasks completed within one hour as a function of the number of instances of the PredictionWorker role for various sizes of the instances. Measurements for ExtraSmall (XS) and Small (S) sizes are similar. Reproduced from [17, 40] with permissions
5.3 Performance of the Cloud4PSP
127
slightly better for eight Small instances of the PredictionWorker role (7.45) than for a single, ExtraLarge (8-core) instance (7.30). When combining both scaling techniques, our results again speak in favor of horizontal scaling with the use of Small-sized instances, e.g., eighteen Small instances vs. nine Medium instances, sixteen Small instances vs. two ExtraLarge instances, eight Small instances vs. two Large instances. The differences are not significant, but they increase with the number of simulation hours. We must also remember that, although in Fig. 5.12 we present measurements for particular parameters of the prediction job, the entire experiment reveals the relationships between particular configurations of the Cloud4PSP. The actual execution times depend on the size of the input sequence, the number of Monte Carlo trials, the number of tasks, and the parameters of the WZ method. It should also be noted that horizontal scaling is easier to implement. Once the Cloud4PSP system is designed to be scaled horizontally, scaling can be done at any moment after the system is published in the Cloud and only if the system requires higher resources (more compute units). Moreover, it can be done not only manually by the administrator of the system, but also automatically based on statistics of the system performance and resource usage indicators.
5.4 Discussion Building Cloud computing services for the prediction of 3D protein structures, such as the presented Cloud4PSP, responds to the current demands for having widely available and scalable platforms solving these types of difficult problems. This is very important not only for scientists and research laboratories trying to predict a single 3D protein structure, but also for the biotechnology and pharmaceutical industry producing new drug solutions for humanity. Cloud4PSP is such a platform, allowing a highly scalable prediction process in the Cloud, with all its advantages and drawbacks. Cloud4PSP was originally designed to be deployed in Microsoft’s commercial cloud, where computing resources can be dynamically allocated as needed. As we have shown, the prediction process can be scaled up by adding more powerful computing instances, or scaled out by allocating more computing units of the same types. The results of our experiments have shown that scaling out by adding more Small-sized computing units turned out to be more effective and to give more development flexibility than scaling up (the acceleration rate for eight cores was 7.45 when scaling out and 7.30 when scaling up). However, in the case of big molecules being modeled, we may be forced to use larger compute units, e.g., Medium, Large, or ExtraLarge, and to combine horizontal and vertical scaling. Delivering Cloud4PSP as a service offloads bio-oriented companies from maintaining their own IT infrastructures, which can now be outsourced from the cloud provider. This has another positive consequence—a low barrier to entry. Even a laboratory or a company that cannot afford to possess a high-performance computational infrastructure can outsource it in the Cloud and perform a prediction
128
5 Cloud Services for Efficient Ab Initio Predictions of 3D Protein Structures
process starting with several computation units, then grow up, if needed, and scale its system at any time. On the other hand, the important aspects of processing data in the Cloud are privacy and data protection. As rightly pointed out by Gesing et al. [16], both the input and output data of molecular simulations are sensitive and may constitute a valuable intellectual property. Therefore, they need to be stored securely. Cloud security is still an evolving domain and defensive mechanisms are implemented on various levels of the functioning of the cloud-based system. In order to protect the systems deployed in the Cloud, Microsoft Azure provides a reach set of security elements, including SSL mutual authentication mechanisms while accessing various components of the system, encryption of data stored in Azure Storage and encryption of data transmission, firewalled and partitioned networks to help protect against unwanted traffic from the Internet. Many of the security elements are explicitly, and many of them implicitly, used by Cloud4PSP, e.g., secure access to Azure BLOBs using HTTPS, secure authentications when accessing Azure SQL Database, where prediction results are stored, SSL-based communication between internal components. In terms of data privacy, Microsoft is committed to safeguarding the privacy of data and follows the provisions of the E.U. Data Protection Directive. Users of the Azure cloud, i.e., cloud application developers, may specify the geographic areas where the data are stored and replicated for redundancy. Also, the distribution and deployment model of the Cloud4PSP supports privacy and data protection. On the deployment level, various users and companies are expected to establish their private Cloud4PSP-based clusters in the Azure cloud for their own simulations. This ensures separation of data and computations performed by two companies, for which the predicted 3D structures of proteins may be a highly protected intellectual property. Finally, on the application level, Cloud4PSP provides additional security mechanism of 128-bit GUID1 -based tokens for protecting access to results of prediction processes. Comparing Cloud4PSP to other solutions mentioned in Sect. 5.1.2, we can draw the following conclusions. Firstly, Cloud4PSP, in terms of what it offers to the end user, is a typical Software as a Service solution (SaaS), and also an extensible computational framework, where advanced users may add their own prediction methods. This is the fundamental feature that differentiates presented cloud-based software from the Cloud BioLinux. Cloud BioLinux provides a virtual machine and, in the context of the presented service stack, functions in the Infrastructure as a Service (IaaS) model, providing pre-configured software on a pre-configured Ubuntu operating system and leaving the users the responsibility for maintaining the operating systems and configuring the best way to scale the available applications. The second fundamental difference between the two solutions is the software used for structure prediction. Cloud BioLinux is rich in various software packages, allowing us to perform many processes in the domain of bioinformatics. However, in terms of the protein structure prediction, it represents a different approach to the problem. It relies on the PredictProtein and provides tools for the prediction of secondary structures of
1 GUID—Globally
Unique Identifier.
5.4 Discussion
129
macromolecules, not 3D structures. Unlike Cloud BioLinux, Cloud4PSP is focused on ab initio prediction methods for tertiary structure. Considering these two mentioned features, i.e., the service model and prediction methods, Cloud4PSP is more similar to Rosetta@Cloud and its AbinitioRelax application. In both systems prediction services are provided in the SaaS model. Although both systems use different prediction methods, the implemented prediction methods belong to the same ab initio class. Both systems are prepared to be published to the Cloud, giving users the ability to establish their private cloud-based clusters for 3D protein structure prediction, which is also important from the viewpoint of data protection. Rosetta@Cloud builds a cluster of compute units on the Amazon Web Services (AWS), while Cloud4PSP works on the Microsoft Azure cloud. There are also some similarities and differences in the architectures of the two systems. Rosetta@Cloud uses a so-called Master Node that controls the distribution of workload among Worker Nodes. The Master Node in Rosetta@Cloud is a counterpart of the PredictionManager role in Cloud4PSP, and Worker Nodes are counterparts of instances of the PredictionWorker role. However, Rosetta’s GUI for launching predictions is installed locally on the user’s computer as Rosetta@Cloud Launcher; i.e., it works outside the cloud infrastructure, while the Cloud4PSP GUI is designed to be available through a Web site (provided by the Web role working inside the cloud), and therefore, users do not have to install anything locally. On a critical note, we have to admit that, although Cloud4PSP implements a relatively new algorithm for 3D protein structure prediction, which is based on Monte Carlo simulations and structure refinement, at the current stage of development, it is unable to catch up Rosetta@Cloud in terms of the wealth of deployed software for predicting protein structures. However, the advantage of the WZ method used in the Cloud4PSP is that it eliminates the drawbacks of widely used gradient-based methods and this brings it closer to finding the global minimum of energy function. Implementation of other prediction methods remains in the perspective of system development in the future. Advanced users may also extend the system by adding their own, new prediction methods. It is also worth mentioning that our tests were performed for the prediction of small molecular structures. For large proteins, the number of iterations needed for modeling the structure increases exponentially, and the process requires more computational resources than those which we had during the presented tests.
5.5 Summary Predicting protein structures from the beginning using the ab initio methods is one of the most challenging tasks for modern computational biology. This requires huge computing resources to allow us to perform calculations in parallel in order to reduce the prediction time. Cloud computing provides theoretically unlimited computing resources in a pay-as-you-go model. This is very important feature of the cloud architecture. Clouds provide an important computing power that can be provisioned
130
5 Cloud Services for Efficient Ab Initio Predictions of 3D Protein Structures
on-demand and according to particular needs, which perfectly fits the character of the prediction process. Cloud4PSP is a fully Software as a Service solution. As we could see in this chapter, it was built up with Azure Cloud Services, including various types of roles that have different goals, and Storage Queues for buffering prediction requests, task distribution, and collecting results. The system has got similar architecture to the Cloud4PSi system presented in Chap. 4. However, PredictionWorker roles do not have to analyze a lot of macromolecular data, like Searcher roles in the Cloud4PSi [39]. Instead, they perform a lot of Monte Carlo simulations and generate a lot of protein models that are stored in the Azure cloud storage space. Likewise the Cloud4PSi, the Cloud4PSP benefits from using queues as they provide load leveling and load balancing. Moreover, queues allow for temporal decoupling of system components. Web roles and Worker roles do not have to communicate through messages synchronously, because messages are stored durably in a queue. Moreover, Web role does not have to wait for any reply from the Worker roles in order to keep on processing prediction requests. The Web role is able to send prediction requests at different rate than the PredictionManager role can consume them and PredictionWorker roles can perform all required simulations. Additional queues introduced in the Cloud4PSP allow notifying upper-level roles on the success of the prediction task or prediction job execution. The Prediction Output queue is used by instances of the PredictionWorker role to return description of results to the PredictionManager role. Such a feedback allows the PredictionManager role to control if all prediction tasks have been completed. In case of failure in the execution of a prediction task, its completion can be delegated to another PredictionWorker instance. The Output queue allows notifying the Web role that the whole prediction process is already completed. In this chapter, we could see a dedicated architecture of a Cloud-based, scalable, and extensible Cloud4PSP system focused on protein structure prediction. The architecture of the system is oriented toward the specificity of the prediction process, but can be easily modified to implement other compute-intensive problems. The system benefits from roles and queues and provides a standard processing model that involves the creation of prediction jobs with accompanying collections of prediction tasks. By applying Cloud Services it was developed as a SaaS solution that is readyto-deploy in the Azure cloud, and whose good scalability was proved by a series of tests. In next chapters, we will see how massive 3D protein structure alignments and prediction of intrinsically disordered proteins can be effectively accelerated on Big Data platforms, like Apache Hadoop or Apache Spark.
5.6 Availability
131
5.6 Availability For the latest news on the Cloud4PSP, visit the Web site of the Cloud4Proteins nonprofit, scientific group:http://www.zti.aei.polsl.pl/w3/dmrozek/science/cloud4proteins.htm
References 1. Arnold, K., Bordoli, L., Kopp, J., Schwede, T.: The SWISS-MODEL workspace: a web-based environment for protein structure homology modelling. Bioinformatics 22(2), 195–201 (2006) 2. Berman, H., et al.: The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000) 3. Bondi, A.: Characteristics of scalability and their impact on performance. In: 2nd International Workshop on Software and Performance, WOSP 2000, pp. 195–203 (2000) 4. Case, D., Cheatham 3rd, T., Darden, T., Gohlke, H., Luo, R., Merz, K.J., Onufriev, A., Simmerling, C., Wang, B., Woods, R.: The Amber biomolecular simulation programs. J. Comput. Chem. 26, 1668–1688 (2005) 5. Chen, C., Huang, Y., Ji, X., Xiao, Y.: Efficiently finding the minimum free energy path from steepest descent path. J. Chem. Phys. 138(16), 164122 (2013) 6. Chivian, D., Kim, D.E., Malmström, L., Bradley, P., Robertson, T., Murphy, P., Strauss, C.E., Bonneau, R., Rohl, C.A., Baker, D.: Automated prediction of CASP-5 structures using the Robetta server. Proteins: Struct. Funct. Bioinf. 53(S6), 524–533 (2003) 7. Cornell, W., Cieplak, P., Bayly, C., Gould, I., Merz, K.J., Ferguson, D., Spellmeyer, D., Fox, T., Caldwell, J., Kollman, P.: A second generation force field for the simulation of proteins, nucleic acids, and organic molecules. J. Am. Chem. Soc. 117, 5179–5197 (1995) 8. De Vries, S., van Dijk, A., Krzeminski, M., van Dijk, M., Thureau, A., Hsu, V., Wassenaar, T., Bonvin, A.: HADDOCK versus HADDOCK: new features and performance of HADDOCK2.0 on the CAPRI targets. Proteins 69, 726–733 (2007) 9. Edic, P., Isaacson, D., Saulnier, G., Jain, H., Newell, J.: An iterative Newton-Raphson method to solve the inverse admittivity problem. IEEE Trans. Biomed. Eng. 45(7), 899–908 (1998) 10. Eswar, N., Webb, B., Marti-Renom, M.A., Madhusudhan, M., Eramian, D., Shen, M., Pieper, U., Sali, A.: Comparative Protein Structure Modeling Using MODELLER, chap. 5. Wiley, New York (2007) 11. Farkas, Z., Kacsuk, P.: P-GRADE portal: a generic workflow system to support user communities. Future Gener. Comput. Syst. 27(5), 454–465 (2011) 12. Ferrari, T., Gaido, L.: Resources and services of the EGEE production infrastructure. J. Grid Comput. 9, 119–133 (2011) 13. Fletcher, R., Powell, M.: A rapidly convergent descent method for minimization. Comput. J. 6(2), 163–168 (1963) 14. Frishman, D., Argos, P.: 75% accuracy in protein secondary structure prediction. Proteins 27, 329–335 (1997) 15. Garnier, J., Gibrat, J., Robson, B.: GOR method for predicting protein secondary structure from amino acid sequence. Methods Enzymol. 266, 540–53 (1996) 16. Gesing, S., Grunzke, R., Krüger, J., Birkenheuer, G., Wewior, M., Schäfer, P., et al.: A single sign-on infrastructure for science gateways on a use case for structural bioinformatics. J. Grid Comput. 10, 769–790 (2012) 17. Gosk, P.: Modeling of protein structures using cloud computing. Master’s thesis, Institute of Informatics, Silesian University of Technology, Gliwice, Poland (2013) 18. Gu, J., Bourne, P.: Structural Bioinformatics (Methods of Biochemical Analysis), 2nd edn. Wiley-Blackwell, Hoboken (2009)
132
5 Cloud Services for Efficient Ab Initio Predictions of 3D Protein Structures
19. Herrmann, T., Güntert, P., Wüthrich, K.: Protein NMR structure determination with automated NOE assignment using the new software CANDID and the torsion angle dynamics algorithm DYANA. J. Mol. Biol. 319, 209–227 (2002) 20. Hovmöller, S., Zhou, T., Ohlson, T.: Conformations of amino acids in proteins. Acta Cryst. D58, 768–776 (2002) 21. Hupfeld, F., Cortes, T., Kolbeck, B., Stender, J., Focht, E., Hess, M., et al.: The XtreemFS architecture - a case for object-based file systems in Grids. Concurrency Computat.: Pract. Exper. 20(17), 2049–2060 (2008) 22. Insilicos: Rosetta@Cloud: Macromolecular modeling in the Cloud. Fact Sheet (2012). Accessed 9 Mar 2018. https://rosettacloud.files.wordpress.com/2012/08/rc-fact-sheet_bp5en2a.pdf 23. Jmol Homepage: Jmol: an open-source Java viewer for chemical structures in 3D (2018) Accessed 7 May 2018. http://www.jmol.org 24. Kacsuk, P., Farkas, Z., Kozlovszky, M., Hermann, G., Balasko, A., Karóczkai, K., Márton, I.: WS-PGRADE/gUSE generic DCI gateway framework for a large variety of user communities. J. Grid Comput. 10(4), 601–630 (2012) 25. Kaján, L., Yachdav, G., Vicedo, E., Steinegger, M., Mirdita, M., Angermüller, C., Böhm, A., Domke, S., Ertl, J., Mertes, C., Reisinger, E., Staniewski, C., Rost, B.: Cloud prediction of protein structure and function with PredictProtein for Debian. BioMed Res Int. 2013(398968), 1–6 (2013) 26. Källberg, M., Wang, H., Wang, S., Peng, J., Wang, Z., Lu, H., Xu, J.: Template-based protein structure modeling using the RaptorX web server. Nat. Protoc. 7, 1511–1522 (2012) 27. Kelley, L., Sternberg, M.: Protein structure prediction on the Web: a case study using the Phyre server. Nat. Protoc. 4(3), 363–371 (2009) 28. Kessel, A., Ben-Tal, N.: Introduction to Proteins: Structure, Function, and Motion. Chapman & Hall/CRC Mathematical & Computational Biology. CRC Press, Boca Raton (2010) 29. Kim, D., Chivian, D., Baker, D.: Protein structure prediction and analysis using the Robetta server. Nucleic Acids Res. 32(Suppl 2), W526–31 (2004) 30. Kollman, P.: Advances and continuing challenges in achieving realistic and predictive simulations of the properties of organic and biological molecules. Acc. Chem. Res. 29, 461–469 (1996) 31. Krampis, K., Booth, T., Chapman, B., Tiwari, B., et al.: Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community. BMC Bioinf. 13, 42 (2012) 32. Laganà, A., Costantini, A., Gervasi, O., Lago, N.F., Manuali, C., Rampino, S.: COMPCHEM: progress towards GEMS a Grid empowered molecular simulator and beyond. J. Grid Comput. 8(4), 571–586 (2010) 33. Lampio, A., Kilpeläinen, I., Pesonen, S., Karhi, K., Auvinen, P., Somerharju, P., Kääriäinen, L.: Membrane binding mechanism of an RNA virus-capping enzyme. J. Biol. Chem. 275(48), 37853–9 (2000) 34. Leach, A.: Molecular Modelling: Principles and Applications, 2nd edn. Pearson Education EMA, Essex (2001) 35. Leaver-Fay, A., Tyka, M., Lewis, S., Lange, O., Thompson, J., Jacak, R., et al.: ROSETTA3: an object-oriented software suite for the simulation and design of macromolecules. Methods Enzymol. 487, 545–74 (2011) 36. Lesk, A.: Introduction to Protein Science: Architecture, Function, and Genomics, 2nd edn. Oxford University Press, NY (2010) 37. Mell, P., Grance, T.: The NIST definition of Cloud Computing. Special Publication 800-145 (2011). Accessed 7 May 2018. http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145. pdf 38. Mrozek, D.: High-Performance Computational Solutions in Protein Bioinformatics. Springer Briefs in Computer Science. Springer International Publishing, Berlin (2014) 39. Mrozek, D., Małysiak-Mrozek, B., Kłapci´nski, A.: Cloud4Psi: cloud computing for 3D protein structure similarity searching. Bioinformatics 30(19), 2822–2825 (2014)
References
133
40. Mrozek, D., Gosk, P., Małysiak-Mrozek, B.: Scaling Ab Initio predictions of 3D protein structures in Microsoft Azure cloud. J. Grid Comput. 13, 561–585 (2015) 41. Mrozek, D., Kłapci´nski, A., Małysiak-Mrozek, B.: Orchestrating task execution in Cloud4PSi for scalable processing of macromolecular data of 3D protein structures. In: Nguyen, N.T., Tojo, S., Nguyen, L.M., Trawi´nski, B. (eds.) Intelligent Information and Database Systems. Lecture Notes in Computer Science, vol. 10192, pp. 723–732. Springer International Publishing, Cham (2017) 42. Pierce, L., Salomon-Ferrer, R., de Oliveira, C., McCammon, J., Walker, R.: Routine access to millisecond time scale events with accelerated molecular dynamics. J. Chem. Theory Comput. 8(9), 2997–3002 (2012) 43. Ponder, J.: TINKER - software tools for molecular design (2001), Dept. of Biochemistry & Molecular Biophysics, Washington University, School of Medicine, St. Louis 44. Ramachandran, G., Ramakrishnan, C., Sasisekaran, V.: Stereochemistry of polypeptide chain configurations. J. Mol. Biol. 7, 95–9 (1963) 45. Rost, B., Liu, J.: The PredictProtein server. Nucleic Acids Res. 31(13), 3300–3304 (2003) 46. Schwieters, C., Kuszewski, J., Tjandra, N., Clore, G.: The Xplor-NIH NMR molecular structure determination package. J. Magn. Reson. 160, 65–73 (2003) 47. Shanno, D.: On Broyden-Fletcher-Goldfarb-Shanno method. J. Optimiz Theory Appl. 46 (1985) 48. Shaw, D.E., Dror, R.O., Salmon, J.K., Grossman, J.P., Mackenzie, K.M., Bank, J.A., Young, C., Deneroff, M.M., Batson, B., Bowers, K.J., Chow, E., Eastwood, M.P., Ierardi, D.J., Klepeis, J.L., Kuskin, J.S., Larson, R.H., Lindorff-Larsen, K., Maragakis, P., Moraes, M.A., Piana, S., Shan, Y., Towles, B.: Millisecond-scale molecular dynamics simulations on Anton. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pp. 39:1–39:11. SC ’09, ACM, New York, NY, USA (2009) 49. Shen, Y., Vernon, R., Baker, D., Bax, A.: De novo protein structure generation from incomplete chemical shift assignments. J. Biomol. NMR 43, 63–78 (2009) 50. Shirts, M., Pande, V.: COMPUTING: screen savers of the world unite!. Science 290(5498), 1903–4 (2000) 51. Söding, J.: Protein homology detection by HMM-HMM comparison. Bioinformatics 21(7), 951–960 (2005) 52. Streit, A., Bala, P., Beck-Ratzka, A., Benedyczak, K., Bergmann, S., Breu, R., et al.: Unicore 6 - recent and future advancements. JUEL 4319 (2010) 53. Van Der Spoel, D., Lindahl, E., Hess, B., Groenhof, G., Mark, A., Berendsen, H.: GROMACS: fast, flexible, and free. J. Comput. Chem. 26, 1701–1718 (2005) 54. Warecki, S., Znamirowski, L.: Random simulation of the nanostructures conformations. In: Proceedings of International Conference on Computing, Communication and Control Technology, vol. 1, pp. 388–393. The International Institute of Informatics and Systemics, Austin, Texas (2004) 55. Wassenaar, T.A., van Dijk, M., Loureiro-Ferreira, N., van der Schot, G., de Vries, S.J., Schmitz, C., van der Zwan, J., Boelens, R., Giachetti, A., Ferella, L., Rosato, A., Bertini, I., Herrmann, T., Jonker, H.R., Bagaria, A., Jaravine, V., Güntert, P., Schwalbe, H., Vranken, W.F., Doreleijers, J.F., Vriend, G., Vuister, G., Franke, D., Kikhney, A., Svergun, D.I., Fogh, R.H., Ionides, J., Laue, E.D., Spronk, C., Jurk˘sa, S., Verlato, M., Badoer, S., Dal Pra, S., Mazzucato, M., Frizziero, E., Bonvin, A.M.: WeNMR: structural biology on the Grid. J. Grid Comput. 10(4), 743–767 (2012) 56. Wu, S., Skolnick, J., Zhang, Y.: Ab initio modeling of small proteins by iterative TASSER simulations. BMC Biol. 5(17) (2007) 57. Xu, D., Zhang, Y.: Ab initio protein structure assembly using continuous structure fragments and optimized knowledge-based force field. Proteins 80(7), 1715–35 (2012) 58. Xu, J., Li, M., Kim, D., Xu, Y.: RAPTOR: optimal protein threading by linear programming, the inaugural issue. J. Bioinform. Comput. Biol. 1(1), 95–117 (2003)
134
5 Cloud Services for Efficient Ab Initio Predictions of 3D Protein Structures
59. Yang, Y., Faraggi, E., Zhao, H., Zhou, Y.: Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted onedimensional structural properties of query and corresponding native properties of templates. Bioinformatics 27(15), 2076–2082 (2011) 60. Zhang, Y.: Progress and challenges in protein structure prediction. Curr. Opin. Struct. Biol. 18(3), 342–348 (2008) 61. Znamirowski, L.: Non-gradient, sequential algorithm for simulation of nascent polypeptide folding. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M., Dongarra, J.J. (eds.) Computational Science - ICCS 2005. Lecture Notes in Computer Science, vol. 3514, pp. 766–774. Springer, Berlin (2005)
Part III
Big Data Analytics in Protein Bioinformatics
Big Data tools allow for distributed analysis and processing of large volumes of unstructured or semi-structured data that are commonly produced in bioinformatics. In this part of the book, we present three solutions developed for performing scalable computations over Big data sets of protein structures: In Chap. 7, we show how massive protein structure alignments can be performed with the use of Apache Hadoop Big Data framework located on the private cloud; in Chap. 8, we will see how the same process is performed on HDInsight Hadoop-based clusters scaled on the public cloud; in Chap. 9, we will see how Apache Spark helps in identification of intrinsically disordered regions in protein structures. These three chapters are preceded by an introductory Chap. 6 devoted to foundations of Big Data, MapReduce processing model, and Big Data frameworks, like Apache Hadoop and Apache Spark, that allow for scalable data processing and analysis.
Chapter 6
Foundations of the Hadoop Ecosystem
The ability to collect, analyze, triangulate, and visualize vast amounts of data in real time is something the human race has never had before. This new set of tools, often referred by the lofty term ’Big Data,’ has begun to emerge as a new approach to addressing some of the biggest challenges facing our planet Rick Smolan
Abstract The era of Big Data that we entered several years ago has changed our imagination about the type and the volume of data that can be processed, as well as the value of the data. Hadoop and the MapReduce processing model have revolutionized the way how we process and analyze the data today and how much important and valuable information we can get from the data. At the moment, the Hadoop ecosystem covers a broad collection of platforms, frameworks, tools, libraries, and other services for fast, reliable, and scalable data analytics. In this chapter, we will briefly describe the Hadoop ecosystem. We will also focus on two elements of the ecosystem—the Apache Hadoop and the Apache Spark. We will provide details of the MapReduce processing model and differences between MapReduce 1.0 and MapReduce 2.0. The concepts defined here are important for the understanding of complex systems presented in the following chapters of this part of the book. Keywords Big Data · Scalable computations · MapReduce · Apache Hadoop Apache Spark · Hadoop ecosystem
6.1 Big Data The era of Big Data that we entered several years ago has changed our imagination about the type and the volume of data that can be processed, as well as the value of the data. Let’s remind that the Big Data challenge usually arises when data sets © Springer Nature Switzerland AG 2018 D. Mrozek, Scalable Big Data Analytics for Protein Bioinformatics, Computational Biology 28, https://doi.org/10.1007/978-3-319-98839-9_6
137
138
6 Foundations of the Hadoop Ecosystem
are so large that the conventional database management and data analysis tools are insufficient to process them [2]. Indeed, we now can be more interested in collecting and processing the data that we did not imagine to collect ten years ago, since they were not supposed to bring any value. For example, we can now collect a lot of medical data from electronic health records or medical equipment, weather indicators from various sensors located in weather stations, TV channel activity in cable television, parameters of production machines in a factory, the number of passengers entering public transport vehicles, and other sort of data. But why do we do this now? The answer is quite simple—because with the new technologies for processing and analyzing the data, we are now able to get a new value that would make our lives better, optimize some processes or services, or bring other measurable benefits. All we need for doing this is the right technology platform that will allow us to better understand our data, our customers, our processes, our problems, or our field of action. Hadoop and the MapReduce processing model have brought the revolution in the way how we process and analyze the data today and how much important and valuable information we can get from the data. At the moment, the Hadoop ecosystem covers a broad collection of platforms, frameworks, tools, libraries, and other services for fast, reliable, and scalable data analytics. Let’s take a closer look at these technologies to better understand the solutions presented in the following chapters of the book.
6.2 Hadoop Apache Hadoop [4] is one of dominant technologies today for analyzing big data. Hadoop is an open-source software framework that allows to distribute data and perform computations on many nodes of computational cluster by using two main components [1]: 1. Hadoop Distributed File System (HDFS), 2. MapReduce processing model. Hadoop Distributed File System (HDFS) provides a reliable shared storage by spreading data across multiple Data nodes (servers) and storing them in a replicated manner. Then, the MapReduce paradigm allows to process and analyze that data in parallel by MapReduce jobs created by a developer.
6.2.1 Hadoop Distributed File System Large portions of data processed by the Hadoop are usually stored in the Hadoop Distributed File System (HDFS). HDFS is designed for distributed storage of large data sets, while ensuring a reliable and quick access to data. Typical file sizes that are stored on the HDFS range between a few gigabytes to several terabytes and are
6.2 Hadoop
139
NameNode
metadata
client
read
DataNode
B1
DataNode
B3
B1
DataNode
B2
DataNode
B3
B1
DataNode
B2 Rack 1
replica on
DataNode
B2
B3 Rack 2
Fig. 6.1 Master–slave architecture of Hadoop Distributed File System (HDFS)
usually processed in a batch manner. Once saved on the HDFS, the data are read many times in various analytical processes. The Hadoop Distributed File System is designed to run on low-cost commodity hardware and may use hundreds of nodes to store parts of the HDFS data. The data stored in HDFS are divided into blocks, which are in turn distributed in all cluster nodes. The block size is configurable. A typical block size is 64MB or 128MB. The HDFS has a master–slave architecture (see Fig. 6.1) that consists of a NameNode and a number of DataNodes. The NameNode manages the filesystem namespace and controls all the input–output operations executed by clients. It mediates the operations on the files and directories, like opening, closing, and renaming. It also manages the metadata of files stored in the HDFS and the information on DataNodes used to store file data. The DataNodes store and manage blocks of data the files are split into. They are responsible for various operations on blocks, like creation, deletion, and replication of blocks, requested by the NameNode. Hadoop protects stored data against loss by repeated replication of every single block of data to multiple nodes of the cluster, thereby, enabling to carry out reliable and fast computations. Until losing the last replica of the data, the owner does not become aware of any damage which may take place on the data server side. The replication factor is configurable.
140
6 Foundations of the Hadoop Ecosystem Map Key |Value ---------------------------------1 |… capping enzyme NSP1 … 2 |… BRCA1 missense … 3 |… NSP1 has implications for … 4 |… repeats of BRCA1 bound to … 5 |… Rotavirus NSP1 peptide … …
Reduce
shuffle Key | Value -------------NSP1 | 1 BRCA1 | 1 NSP1 | 1 BRCA1 | 1 NSP1 | 1
Key | Value -------------BRCA1 | 1,1 NSP1 | 1,1,1
Key | Value -------------BRCA1 | 2 NSP1 | 3
…
…
…
Fig. 6.2 Sample data flow in single Map and Reduce tasks for aggregating the number of occurrences of short names of proteins in titles of scientific papers
6.2.2 MapReduce Processing Model MapReduce is a data processing model that allows processing highly parallelizable data sets. The MapReduce defines how parallel processing of huge data sets will be performed in a distributed environment. There are generally two stages of performing calculations in systems that implement the MapReduce model: the Map phase, which comprises parallel processing of records of the input data set, and the Reduce phase, which aggregates related data. Hadoop implements the MapReduce computational model, thus allowing data analysts to easily create and run applications that process data in a distributed cluster of computers. Hadoop accepts MapReduce applications, called MapReduce jobs, as a set of specific processing work to be done on the specified distributed data with some specified configuration parameters. The MapReduce job is then divided by the Hadoop into smaller tasks; there are two main types of these tasks: Map tasks and Reduce tasks. Such a division is directly related to the concept of performing two-stage calculations in the MapReduce model. Input data, subdivided into smaller fragments (splits) containing particular data records, are processed in parallel by multiple Map tasks that generate results in the form of key–value pairs. These results are then sorted and merged based on the key (shuffle mid-phase). Reduce tasks may then perform aggregations of values related to particular keys and generate the output in the form of a list of key–value pairs. Sample MapReduce data flow for aggregating the number of occurrences of short names of proteins in titles of scientific papers is presented in Fig. 6.2. As the input, the Map phase accepts titles of scientific papers, e.g., from the Protein Data Bank. The key is the row identifier that is further ignored as it is not needed in presented example. The Map function extracts short names of proteins and produces the list of the names (as the key) with the information that they occurred (as the value). For simplicity of further aggregation, the value of 1 is generated for each occurrence. The output from the Map function is then processed by the MapReduce framework (shuffle), which sorts and groups key-value pairs by key. This produces the input for the Reduce phase. The Reduce function aggregates provided values for each extracted name of the protein. It generates the list of short names (key) together with aggregated number of occurrences (value) as the output.
6.2 Hadoop
141
client client
JobTracker NameNode
TaskTracker DataNode
MR HDFS
TaskTracker DataNode
MR
HDFS
MR HDFS
TaskTracker DataNode
MR HDFS
Fig. 6.3 Architecture of the Hadoop cluster and execution of the MapReduce v1 job on the Hadoop
6.2.3 MapReduce 1.0 (MRv1) Splitting the MapReduce job into tasks enables their distributed execution in cluster nodes. Hadoop tries to perform computations as close to the data as possible. HDFS provides interfaces for applications to move themselves to the DataNode where the data block needed for computations is located. Moving computations to data is a key assumption, since it minimizes the network traffic and raises the throughput of the system, especially, if the volume of data that must be processed is huge. Each Map task can be executed on one of the Hadoop cluster nodes that performs needed computations. Many Map tasks are then executed on multiple nodes of the Hadoop cluster. The same applies to Reduce tasks, which can also be run simultaneously on multiple cluster nodes. Each cluster node can accept for execution many Map and many Reduce tasks during the whole MapReduce job lifetime. In MapReduce 1.0 (also called MRv1, Fig. 6.3), the execution of the MapReduce job and all its tasks on the Hadoop platform is controlled by two services provided by the system and run on the cluster nodes in a master–slave manner: 1. JobTracker—accepts and coordinates MapReduce jobs and monitors their status during execution; it is also responsible for distribution of Map and Reduce tasks to cluster nodes where their execution is controlled by TaskTrackers. 2. TaskTracker—controls the execution of individual Map and Reduce tasks in nodes of the Hadoop cluster, and reports the execution progress to the JobTracker. In most frequent scenarios, Hadoop compute nodes (hosting JobTracker and TaskTrackers) and the HDFS storage nodes (the NameNode and DataNodes) are the
142
6 Foundations of the Hadoop Ecosystem
same, i.e., the MapReduce framework and the Hadoop Distributed File System use the same computer cluster nodes. Such an approach allows Hadoop to effectively schedule tasks on the cluster nodes where data are stored, which results in very high aggregate bandwidth across the cluster.
6.2.4 MapReduce 2.0 (MRv2) The execution process in MapReduce 2.0 (MRv2), which is the next generation of MapReduce, is different. The introduction of MRv2 results from scalability bottlenecks that the previous generation of MapReduce (MRv1) experienced for very large clusters. The MRv2 makes use of YARN (Yet Another Resource Negotiator), which is responsible for resource management and delivers consistent operations, security, and data governance tools across Hadoop clusters. In MRv2 resource management and job scheduling/monitoring, which were the responsibilities of the JobTracker in MRv1, are split up and managed by two separate services (Fig. 6.4): 1. a Resource Manager, which manages resources across the cluster, 2. and per-application Application Master, which manages the application (job or DAG of jobs) by negotiating resources from the Resource Manager and working with the NodeManager(s) to execute and monitor the tasks.
client client
Resource Manager
NodeManager
Name Node
YARN
NodeManager
YARN
YARN
HDFS
NodeManager
Applica on Master
Applica on Master
Container
Container
Container
Container
DataNode
DataNode
HDFS
DataNode
HDFS
Fig. 6.4 Execution of the MapReduce v2 job with YARN
YARN
HDFS
6.2 Hadoop
143
Resources required by an application are represented as Containers and may cover elements such as memory, CPU, disk, network, etc. These containers are used to run the Application Master (one per Hadoop job) and MapReduce tasks. They are scheduled by the Resource Manager and managed by the NodeManagers. The NodeManagers create and monitor containers in each node of the cluster and report their status to the Resource Manager. The Application Master negotiates appropriate resource containers from the Resource Manager, track the status of the containers, and monitors the progress of task execution. For example, in Fig. 6.4 we can see two clients (blue and green) that submit MapReduce jobs to the YARN-managed Hadoop cluster. The Resource Manager starts two Application Masters for the jobs in containers allocated in two nodes. Each Application Master creates a number of Map tasks for the input splits located in the HDFS, and a number of Reduce tasks. Next, each of the Application Masters requests containers for all the Map and Reduce tasks from the Resource Manager. The containers are allocated as close to the data as possible, and the tasks are executed within them on particular DataNodes. It is recommended that the YARN Resource Manager and the HDFS Name Node do not reside on the same nodes, but have dedicated machines allocated for their activity.
6.3 Apache Spark Apache Spark [7] is a platform for large-scale data processing and general-purpose computing originally developed in AMPLab laboratories at University of Berkeley. At the moment, the platform is developed within the Apache Software Foundation. Spark was generally created to run computations on computer clusters allowing to distribute computations over the cluster nodes. However, in contrast to the twophase, disk-based MapReduce paradigm applied in Hadoop, the Spark can execute programs even 100 times faster in memory, and up to 10 times faster on disks than Hadoop MapReduce. In 2014, the Spark platform won the Daytona GraySort Contest for the fastest sorting of 100TB of data [5]. Spark sorted 100TB of data in 23 min on 206-node cluster with 6,592 virtual CPU cores, while the previous winner of the competition—Hadoop MapReduce—did it in 72 min on 2,100-node cluster with 50,400 physical CPU cores. This gives a threefold speedup by using 10 times fewer cluster nodes, which shows that in many solutions Spark can be a worthy successor to the Hadoop platform and the MapReduce programming model. Spark applications are executed on a cluster as collections of processes coordinated by the SparkContext service in the main application program, called the driver program. Likewise the Hadoop, Spark also cooperates with cluster managers that manage resources and schedule computations on the cluster nodes (Fig. 6.5). At the moment of the book emergence, Spark supported the following cluster managers:
144
6 Foundations of the Hadoop Ecosystem Cluster Manager
Worker Node Executor
Cache
Task
Task
Driver Program SparkContext
Worker Node Executor
Cache
Task
Task
Fig. 6.5 Overview of the Spark application execution
• Standalone—the simplest cluster manager included with Spark that facilitates easy cluster set up, • Apache Mesos—a general cluster manager that can also run Hadoop MapReduce and service applications, • Hadoop YARN—the resource manager in Hadoop MapReduce 2.0, • Kubernetes—an open-source cluster manager that schedules computations, automates deployment, enables scaling, and manages containerized applications. Spark divides the computational job into a number of tasks and executes the tasks on Worker nodes of the cluster within processes called executors. In addition to performing tasks, executors are responsible for keeping data in memory or disk storage across tasks. Operational data that must be processed on Spark cluster are delivered as resilient distributed data set (RDD). RDD [6] is a fault-tolerant collection of elements that Spark uses to operate on in parallel. RDDs can be created by parallelizing an already existing data collection through a transformation or by referencing a data set in an external storage system, such as Hadoop Distributed File System (HDFS). Both methods will be used in Chap. 9 devoted to prediction of intrinsically disordered proteins on the Spark. Data delivered in the form of RDDs are divided into a number of partitions. Spark then adjusts the number of tasks sent for execution to the number of partitions the RDD collection is cut into. RDDs allow to store processing results in memory in any stage of data processing, e.g., between various transformations. Transformations allow to perform various operations on data and create new data sets (RDDs) from existing ones. The following list provides examples of the most commonly used transformations [3]:
6.3 Apache Spark
145
• map(func)—returns a new distributed data set produced by processing each element of the source by a given function func, • filter(func)—returns a new data set produced by selecting particular elements of the input on which func returns true, • sample—samples a fraction of the data, • union(otherData set)—returns a new data set that contains the union of the elements in the source RDD and the other data set, • reduceByKey(func, [numPartitions])—when called on a data set of key–value (K, V) pairs, returns a data set of key–value (K, V) pairs where the values for each key are aggregated using the given reduce function func, • pipe(command, [envVars]) —pipes each partition of the RDD collection through a shell command, e.g., a Perl or bash script, • and many others. Aside from transformations, Spark also supports performing actions, which return the results to the main program after performing computations on the RDD data set. Examples of actions are [3]: • reduce(func)—aggregates the elements of the RDD collection by using a commutative and associative function func that can be computed correctly in parallel, • collect()—returns all the elements of the RDD collection as an array at the driver program after filtering or other operation that returns a sufficiently small subset of the data, • count()—returns the number of elements in the RDD collection, • takeOrdered(n, [ordering])—returns the first n elements of the RDD collection sorting them in a natural order or with the use of a custom comparator, • saveAsTextFile(path) —writes the elements of the RDD data set as a text file (or a collection of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system, • foreach(func)—executes a function func on each element of the RDD collection, • and others. The success of the Apache Spark lies in its important capability to persist (or cache) an RDD data set in memory across performed operations. When an RDD is persisted, each cluster node stores any partitions of the RDD that it operates on in memory. Then, these RDD partitions are reused while performing other operations on that data set (or data sets derived from it). This speeds up all future actions even by a factor of 10. Caching operation is very important for all algorithms that operate on an iterative manner and must use the data previously computed (Fig. 6.6). RDD collections can be persisted by invoking the persist() or cache() methods on them. After RDD data sets are computed for the first time in an action, they are kept in memory on the cluster nodes. Cache provided by the Spark is fault-tolerant . In case of loosing any partition of the RDD data set, Spark automatically recomputes it by performing the transformations that originally led to the creation of the RDD
146
6 Foundations of the Hadoop Ecosystem Output
(a)
PName | Author -------------NSP1 | J. Doe
Query 1
…
Input Distributed Memory
ID | Abstract . .. ---------------------------------1 |… capping enzyme NSP1 … 2 |… BRCA1 missense … 3 |… NSP1 has implications for … 4 |… repeats of BRCA1 bound to … 5 |… Rotavirus NSP1 peptide … …
Enzymes -------------1024
Query 2 …
Protein|Number -------------NSP1 | 1 BRCA1 | 1 NSP1 | 1 BRCA1 | 1 NSP1 | 1
Query n
…
(b)
Protein|Number -------------NSP1 | 1 BRCA1 | 1 NSP1 | 1 BRCA1 | 1 NSP1 | 1
PName | Count -------------BRCA1 | 2 NSP1 | 3
…
…
Input ID | Abstract ... ---------------------------------1 |… capping enzyme NSP1 … 2 |… BRCA1 missense … 3 |… NSP1 has implications for … 4 |… repeats of BRCA1 bound to … 5 |… Rotavirus NSP1 peptide … …
Distributed Memory Itera on 1
Distributed Memory Itera on 2
…
Fig. 6.6 Sample modes of data processing in Spark: a low latency computations with many queries using cached data set in memory, b iterative computations with intermediate results stored in memory
partition. Moreover, each persisted RDD data set can be stored using a different storage level. This enables, for example, persisting the RDD data set on disk or in memory, but as a collection of serialized Java objects (to save space), or replicating the RDD across cluster nodes.
6.4 Hadoop Ecosystem Hadoop ecosystem contains not only the Hadoop and the Spark computational frameworks that were presented in previous sections, but also many other tools and frameworks that facilitate performing efficient and scalable computations on Big Data sets. Figure 6.7 shows broad Hadoop ecosystem with its many components for fast processing of various types of data. Popular components of the Hadoop ecosystem include (the list is not exhausted) the follows: • Spark—an in-memory data processing platform for fast and general computations, • Storm—a streaming data processing platform for continuous computations and real-time analytics over stream data,
Flume Sqoop Ka a
Dataflow
Knox Ranger
Security
ZooKeeper Ambari Oozie
Management
OperaƟons Management Security
Fig. 6.7 Broad Hadoop ecosystem
iPython Zeppelin
InteracƟve Notebooks
Tools
Distributed Data Storage
Distributed Processing Framework
Data Access Engines Hive Impala
Pig HBase
NoSQL Random Access
Hadoop Distributed File System (HDFS)
Spark
In-Memory Data Processing
YARN
Hadoop MapReduce
SQL Query
Scrip ng ETL
Spark/Tez
Solr
Search Indexing + search
Other File Systems
Slider
Storm
Stream Real-Ɵme Processing
6.4 Hadoop Ecosystem 147
148
6 Foundations of the Hadoop Ecosystem
• Solr—a highly reliable, scalable, and fault-tolerant search engine that allows for efficient full-text search and near real-time distributed indexing, • Pig—a platform for development and execution of high-level, Pig Latin language scripts for complex ETL and data analysis jobs on Hadoop data sets, • Hive—a data warehouse infrastructure that provides data aggregation and ad hoc querying, • HBase —a NoSQL scalable, distributed database that provides a structured data storage space for large tables, • Cassandra – an open-source distributed NoSQL database management system for handling large amounts of data and providing high availability, • Impala—a massively parallel processing (MPP) analytical database and interactive SQL-based query engine for real-time analytics, • Mahout—a library of scalable statistical, analytical and machine learning algorithms, implemented on top of Apache Hadoop and using the MapReduce paradigm that can be used for data mining and analysis, • Giraph– an iterative graph processing engine based on top of Hadoop framework, • Oozie—a workflow scheduler system to manage Apache Hadoop jobs, • Sqoop—a tool used for efficiently transferring bulk data between Apache Hadoop and structured datastores, such as relational databases, • Flume—a distributed service for collecting, aggregating, and moving large amounts of streaming log data into the HDFS, • Kafka—a distributed streaming platform for building real-time data pipelines and streaming applications, • Ambari—a Web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters, which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, and Sqoop, • Zookeeper — a high-performance coordination service for distributed applications, e.g., maintaining Hadoop configuration information and enabling coordination among distributed Hadoop processes, • Ranger—a centralized framework for monitoring and managing comprehensive data security across the Hadoop platform, • Knox—an application gateway, which acts as a reverse proxy and provides perimeter security for Hadoop clusters.
6.5 Summary The Hadoop ecosystem covers a broad collection of platforms, frameworks, tools, libraries, and other services for fast, reliable, and scalable data analysis and processing. The list of elements of the Hadoop ecosystem is not exhausted, and when you are reading this book probably there are many other elements that were developed to support various types of operations performed in the ecosystem. In this chapter we focused and briefly described only two elements of the ecosystem—the Hadoop and the Spark.
6.5 Summary
149
Hadoop and the MapReduce processing model have revolutionized the way how we process and analyze the data today and how much important and valuable information we can get from the data. Spark is a successor of the Hadoop platform when it comes to speed of performing certain calculations. However, both can be used to different purposes. For example, Spark is very useful in jobs that consist of many iterative operations on the processed data or need results to be provided quickly (fast data transformations) or in near real time. For batch-mode processing and non-iterative, long-running computations, Hadoop and the MapReduce may remain sufficient. Another thing is that Hadoop provides its Hadoop Distributed File System (HDFS) for a storage space. Spark does not provide any storage space or file system, it must use any—if not the HDFS, then another file system. For further reading on the Hadoop framework I would recommend the book “Hadoop—The Definitive Guide: Storage and Analysis at Internet Scale” by Tom White. For further reading on the Spark, I would recommend the book “Spark: The Definitive Guide—Big Data Processing Made Simple” by Matei Zaharia and Bill Chambers. In the following three chapters, we will see how particular components of the Hadoop ecosystem can be used to perform complex calculations related to the analysis of protein data. In Chap. 7, we will see how the Hadoop Distributed File System is used to store macromolecular data of proteins, and how the MapReduce is applied in massive 3D protein structure alignments performed on a local computer cluster. In Chap. 8, we will extend the use of Hadoop on the public cloud for unlimited scaling of 3D protein structure similarity searches, and we will use the HBase data warehouse for storing results. In Chap. 9, we will see Spark performing scalable predictions of intrinsically disordered regions in protein structures for increasing volumes of protein sequences.
References 1. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008) 2. National Research Council: Frontiers in Massive Data Analysis. National Academy Press, Washington, D.C. (2013) 3. The Apache Software Foundation: RDD Programming Guide (2018). https://spark.apache.org/ docs/latest/rdd-programming-guide.html#rdd-programming-guide 4. White, T.: Hadoop - The Definitive Guide: Storage and Analysis at Internet Scale, 3rd edn. OReilly, Ireland (2012) 5. Xin, R.: Apache Spark officially sets a new record in large-scale sorting. Technical report, Engineering Blog (2014). https://databricks.com/blog/2014/11/05/spark-officially-sets-a-newrecord-in-large-scale-sorting.html 6. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp. 15–28. USENIX, San Jose, CA (2012). https://www.usenix. org/conference/nsdi12/technical-sessions/presentation/zaharia
150
6 Foundations of the Hadoop Ecosystem
7. Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J., Shenker, S., Stoica, I.: Apache Spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016). https:// doi.org/10.1145/2934664
Chapter 7
Hadoop and the MapReduce Processing Model in Massive Structural Alignments Supporting Protein Function Identification
Without big data analytics, companies are blind and deaf, wandering out onto the web like deer on a freeway Geoffrey Moore The world is one big data problem Andrew McAfee
Abstract Undoubtedly, for a variety of biological data and a variety of scenarios of how these data can be processed and analyzed, Hadoop and the MapReduce processing model bring the potential to make a step forward toward the development of solutions that will allow to get insights into various biological processes much faster. In this chapter, we will see MapReduce-based computational solution for efficient mining of similarities in 3D protein structures and for structural superposition. The solution benefits from the Map-only processing pattern, which utilizes only the Map phase of the MapReduce model. We will also see results of performance tests when scaling up nodes of the Hadoop cluster and increasing the degree of parallelism with the intention of improving efficiency of the computations. Keywords Bioinformatics · Big Data · Proteins · Scalable computations · Hadoop · MapReduce · 3D protein structures · Structural alignment · Similarity searching · Superposition
7.1 Introduction In the previous chapter, we could see how dedicated frameworks, platforms, and tools from the Hadoop ecosystem support efficient analysis of large data volumes by dividing the data into smaller pieces and by distributing computations on computer clusters. However, not only business data experience the huge explosion. Also, the volume of biological data, including DNA sequences, protein sequences, and their © Springer Nature Switzerland AG 2018 D. Mrozek, Scalable Big Data Analytics for Protein Bioinformatics, Computational Biology 28, https://doi.org/10.1007/978-3-319-98839-9_7
151
152
7 Hadoop and the MapReduce Processing Model in Massive Structural …
3D structures, collected in dedicated, frequently public, repositories, increases every year. This is caused by international efforts to understand living organisms at various levels, determined by the biological information that is analyzed. The number of 3D protein structures in the Protein Data Bank, a worldwide repository for collecting structures of proteins and other biomolecules, increases every year. Thousands of newly discovered proteins, those that structure was determined experimentally by X-ray crystallography or Nuclear Magnetic Resonance (NMR), and those determined by computational methods need to be compared with the existing ones to find similarities, identify their functions, classify them into families. This involves performing structural alignments between these proteins and comparison of hundreds millions of atomic positions. Structural alignments can be also used to find structural homologs and compare proteins with low sequence similarity, in order to detect evolutionary relationships between proteins that share very little common sequence. Structural alignment methods can be used in comparisons of individual structures (single one-to-one comparisons), one-to-many comparisons against a set of protein structures collected in a repository, or many-to-many comparisons performed for all protein structures within the repository (Fig. 7.1). In the two latter cases, however, performing structural alignments can be very time-consuming and thus requires distributed computational approach in order to complete the task in a reasonable time. Fortunately, recent advances in processing and analyzing data by utilization of the MapReduce processing model and the Hadoop platform [25] allow to bring computations to the data and scale the computations with growing volume of data and growing demand for computational power.
7.2 Scalable Solutions for 3D Protein Structure Alignment and Similarity Searching The growing volume of biological data, the variety of data gathered in different areas of bioinformatics, together with the high rate at which the data are generated, caused the necessity to search for efficient computational solutions that would support parallel data exploration and increase the performance of the process. Recent technological advances in parallel programming, distributed computing, and data processing brought several solutions for accelerating protein structure comparison with the use of GPU devices, farms of computers or virtual machines located on the Cloud, and Hadoop clusters. The first group of GPU-based approaches contains solutions proposed by Leinweber et al. [7–9] for structural comparisons of protein binding sites. These approaches rely on the feature-based representation of protein structure and graph comparisons. Authors reported a significant superiority of GPU-based executions of the comparison process over the CPU-based ones in terms of runtimes of performed experiments. Mentioned works focus on comparison of protein binding sites, not protein
7.2 Scalable Solutions for 3D Protein Structure Alignment and Similarity Searching
153
(a)
(c)
(b)
…
…
…
Repository/collection
Repository/collection
Repository/collection
Fig. 7.1 Various comparison scenarios for 3D protein structures: a one-to-one comparisons between pairs of protein structures (left) and the result of a single structural alignment (right), b one-to-many comparison between given 3D protein structure and structures in the repository/collection, c many-to-many comparisons between 3D protein structures in the repository/collection
154
7 Hadoop and the MapReduce Processing Model in Massive Structural …
structures as a whole. In contrast, fold-based methods for protein comparison focus on entire protein structures. Examples of GPU-accelerated fold-based methods are SA Tableau Search by Stivala et al. [23], pssAlign by Pang et al. [19], and GPUCASSERT reported in one of our previous works [13] and presented in Chap. 10. The SA Tableau Search makes use of orientations of secondary structure elements and distance matrices to represent protein structures and simulated annealing for optimal alignment. The pssAlign uses locations of the Cα atoms to represent proteins in the comparison and a dynamic programming procedure for the alignment. Finally, the GPU-CASSERT represents proteins as reduced chains of secondary structure elements and chains of molecular residue descriptors, and it compares proteins with the use of the dynamic programming-based two-phase alignment method. All three GPU-based implementations of fold-based methods significantly accelerate calculations related to assessing the similarity of protein molecules. However, they also require appropriate GPU devices and specific preparation of data. Moreover, they only measure the similarity of protein molecules without providing the final superposition. The second group of approaches consists of solutions that utilize farms of computers or virtual machines to perform protein structure comparison. Examples of such solutions are MAS4PSi [11], Cloud4PSi [12, 17] described earlier in Chap. 4, and CloudPSR [15]. All of them rely on fold-based methods for 3D protein structure comparison. The MAS4PSi utilizes the JADE multi-agent system for reliable and efficient computations, in which software agents residing on the farm of physical computers perform pairwise alignments. On the other hand, the Cloud4PSi and the CloudPSR utilize a collection of virtual machines (Worker roles) located on the public Cloud for protein structure alignment. Various scheduling schemes for computations were tested in both systems. In all presented systems parallelization brought a significant acceleration of computations mainly by scaling the systems horizontally. However, they are specific for the public cloud provider. The third group contains Hadoop and MapReduce-based approaches for protein structure alignment. This group contains the system developed by Che-Lun Hung and Yaw-Ling Lin [6] and HDInsight4PSi [14] that will be described in Chap. 8. The Hadoop-based system developed by Hung and Lin uses two popular fold-based alignment methods—DALI [5] and VAST [3]. The Map phase performs structural alignments for given pairs of protein structures, and the Reduce phase refines results of the produced alignments. The system was tested with one thousand 3D protein structures randomly chosen from the Protein Data Bank on a private virtualized computing environment containing eight Data nodes showing good, linear scalability of performed computations. In our previous work, published in [14], we also proved that using sequential files (instead of processing individual structures) may increase the performance of the MapReduce-based parallel protein structure similarity searches. This was especially visible when processing large repositories of macromolecular structures. The HDInsight4PSi, presented in Chap. 8, was developed for HDInsight/HBase clusters deployed in Microsoft Azure public cloud. The system uses the MapReduce procedure when performing one-to-many protein structure comparisons against large collections of proteins. However, besides huge scaling
7.2 Scalable Solutions for 3D Protein Structure Alignment and Similarity Searching
155
capabilities, provisioning the HDInsight/HBase clusters from public cloud providers may produce a significant cost for potential users. Thus, some of them may want to make use of their own, existing compute environments or establish a private cloud [10, 22] due to economic or security reasons. For example, some companies or laboratories may treat their data as intellectual property and may want to use on-premises storage to keep the data within their data centers.
7.3 A Brief Overview of H4P Undoubtedly, for a variety of biological data and a variety of scenarios of how these data can be processed and analyzed, the MapReduce processing model brings the potential to make a step forward toward the development of solutions that will allow to get insights in various biological processes much faster. In 2013 we started our works on the MapReduce-based computational solution, called H4P, for efficient mining of similarities in 3D protein structures and for structural superposition. Development works were carried out within the Cloud4Proteins group by Marek Suwała, Bo˙zena Małysiak-Mrozek, and by me in Institute of Informatics at the Silesian University of Technology in Gliwice, Poland. The H4P system benefits from the Map-only processing pattern of the MapReduce model, presented and formally defined in Sect. 7.4. The H4P system [18] is implemented for the Hadoop framework with supposed deployment to private clouds. It was tested on the virtualized computer cluster with the Hadoop framework. The H4P joins advantages of the system developed by Hung and Lin for small virtualized compute environments with large-scale scanning capabilities of the HDInsight4PSi, thus allowing to perform structural alignments for a number of usage scenarios, including comparison of pairs of 3D protein structures during evaluation of predicted protein models, one-to-many comparisons while identifying possible functions of the given structure, or all-to-all alignments while investigating the divergence between known protein structures and classifying proteins by their fold. The Map-only pattern of the MapReduce processing model was tested in the H4P system with the use of two popular methods for 3D protein structure alignment, namely jCE and jFATCAT [20]. These methods were chosen mainly due to a good quality of alignments that they provide and the capability of handling circular permutations (jCE-CP). In the H4P, both methods were implemented as MapReduce procedures. These methods, the jCE and the jFATCAT, are enhanced versions of the Combinatorial Extension (CE) [21] and the Flexible structure AlignmenT by Chaining Aligned fragment pairs allowing Twists (FATCAT) [26]. CE and FATCAT methods are relatively slow, which justifies the use of highly scalable platforms for their parallelization, but on the other hand, they generate high-quality alignments with respect to the RMSD measure. Both methods are publicly available through the Protein Data Bank (PDB) Web site for those, who want to search for structural neighbors [4], which confirms their good reputation in the structural bioinformatics community.
156
7 Hadoop and the MapReduce Processing Model in Massive Structural …
7.4 Map-Only Pattern of the MapReduce Processing Model H4P uses the Hadoop framework and was originally developed for the MapReduce 1.0 processing model in order to parallelize computations related to massive 3D protein structure alignments. For the purpose of defining the Map-only pattern of the MapReduce processing model, let us consider a commonly used one-to-many comparison scenario, in which a dedicated implementation of the MapReduce application compares and aligns 3D structures of the given query protein(s) to 3D structures of candidate proteins stored in the repository. (Other comparison scenarios, i.e., oneto-one and many-to-many, will be considered as restrictions or extensions to the one-to-many scenario.) The collection (repository) of candidate structures can be described as follows: CP = cP,v |v = 1, 2, . . . , V
(7.1)
where V is the number of candidate protein structures cP in the collection (repository). These candidate protein structures can be processed individually, while comparing pairs of 3D protein structures in one-to-one comparison scenario or one-to-many scenarios when the size of repository is relatively small. However, processing individual candidate protein structures in large repositories negatively affects the performance due to small sizes of most of macromolecular data files ranging from kilobytes to several megabytes (comparing to 64MB/128MB block size of the HDFS). For this reason, for one-to-many and many-to-many scenarios with large collections of candidate proteins, we decided to group candidate protein structures in so-called sequential files that fit the size of the HDFS data block size and contain many records, each representing a candidate protein structure cP . When processing these data in the MapReduce application, the Hadoop logically represents sequential files as input splits. These input splits can be defined as follows: sH = cP,m , cP,m+1 , . . . , cP,n , where 1 ≤ m ≤ n ≤ V and m, n, V ∈ N+ , (7.2) where m and n determine boundaries of the input splits. Each input split is a subset of the whole collection: sH ⊆ CP
and
∀k, l > 0, k = l sH ,k ∩ sH ,l = ∅,
(7.3)
which means that each candidate protein structure belongs only to one sequential file. We assume that the size of each input split fits the 64MB block size of the HDFS: SizeOf (sH ) = 64MB.
(7.4)
The number of protein structures in Hadoop split may vary, and it depends on the sizes of particular protein structures included in the sequential files:
7.4 Map-Only Pattern of the MapReduce Processing Model
|sH | = n − m,
157
(7.5)
where m and n determine boundaries of the input split. The collection of input splits is defined as: SH = sH ,u |u = 1, 2, . . . , U ,
(7.6)
where U is the number of input splits, equal to the number of sequential files with candidate protein structures. Protein structures located in all input splits cover the whole collection of candidate proteins: U sH ,k , (7.7) CP = k=1
where U is the number of Hadoop input splits, and cardinality of the collection CP : |CP | =
U
|sH ,k |.
(7.8)
k=1
The Hadoop framework processes data by creating and executing MapReduce jobs. A MapReduce job divides the whole work into smaller tasks. Let us remind that there are two types of tasks that can appear in the processing workflow—Map tasks, which are followed by Reduce tasks. Map tasks are used for processing data, and Reduce tasks are usually used to consolidate results produced in the Map phase. The Map and the Reduce phases are not necessarily sequential, but the Reduce phase depends on the output of the Map phase. For general purposes, from the viewpoint of implemented and executed tasks, a MapReduce job can therefore be defined as: JMR = TM ∪ TR ,
(7.9)
where TM is the set of Map tasks and TR is the set of Reduce tasks. In the H4P, massive 3D protein structure alignments are performed in the Map phase as it is presented in Fig. 7.2. Results of the alignment processes (identifiers of aligned structures and a set of similarity measures) are then stored in a database for future reuse. Storing results does not involve any consolidation of results in the Reduce phase, like in the standard MapReduce procedure. Running the Reduce phase just for storing results of the alignment processes would introduce unnecessary overhead, since the output of the Map phase is written to local disks and then transferred across the network to the reducer nodes. Our early tests performed with the use of full MapReduce implementation showed that it is 15–20% less efficient than the Maponly implementation (see Sect. 7.6.4). Therefore, we assume that in the Map-only pattern the TR = ∅, and then: (7.10) JMR = TM ,
158
(a)
7 Hadoop and the MapReduce Processing Model in Massive Structural … 3D structure of given protein
Map Key |Value ----------------pdb1a0t | | pdb1jc7 | | pdb1n6a | | …
Reduce Key | Value -------------1n6h | BSON doc | 1a0t | BSON | doc 1jc7 | BSON | doc
…
Key | Value -------------1a0t | BSON | doc 1jc7 | BSON | doc 1n6h | BSON | doc
MongoDB
…
3D structures of candidate proteins (PDB files)
(b)
3D structure of given protein
Map Key |Value -----------------pdb1a0t | | pdb1jc7 | | pdb1n6a | | …
Key | Value -------------1n6h | BSON doc | 1a0t | BSON doc | 1jc7 | BSON | doc
MongoDB
…
3D structures of candidate proteins (PDB files)
Alignment visualization (HTML)
(c) jobId: 12A2-2FF67SDS… timeStamp: 20180203 qChain: 1n6h.A cPdbId: 1n6a cChain: A algorithm: jFATCAT similarity: 18.18% RMSD:4.81 … visual: HTML document
BSON document
Fig. 7.2 Simplified idea of 3D protein structure alignment implemented as the MapReduce job with: full MapReduce (a) and Map-only (b) processing patterns. The alignment implements oneto-many comparison scenario with 3D structures of candidate proteins delivered as sequential files. c Single BSON document generated as the output of the Map phase
7.4 Map-Only Pattern of the MapReduce Processing Model
159
and in consequence, we can define the MapReduce job as a set of Map tasks: JMR = tM ,s |s = 1, 2, . . . , S ,
(7.11)
where S is the number of all Map tasks. This defines the Map-only processing pattern of the MapReduce model. We can state that during the MapReduce job execution with the use of the Maponly processing pattern: f1
∃f1 SH − → TM ,
(7.12)
where f1 is the function that assigns input splits (sequential files with candidate protein structures) to Map tasks in such a way that: ∀sH ,u , sH ,w ∈ SH , u, w = 1, 2, . . . , U, f1 (sH ,u ) = f1 (sH ,w ) =⇒ sH ,u = sH ,w ,
(7.13)
which means that each input split is processed in a different Map task. The Hadoop cluster can be defined as a set of computing nodes: CH = cH ,d |d = 1, 2, . . . , D ,
(7.14)
where D is the number of cluster nodes cH . We can state that during Hadoop computations with the use of the Map-only processing pattern: f2
∃f2 JMR − → CH ,
(7.15)
where f2 is the function that assigns Map tasks to the Hadoop cluster nodes in such a way that: ∀cH ∈ CH , ∃tM ∈ JMR , f2 (tM ) = cH . (7.16)
7.5 Implementation of the Map-Only Processing Pattern in the H4P The H4P uses the Map-only processing pattern when parallelizing computations related to massive 3D protein structure alignment. When the H4P Map-based application is executed on the Hadoop cluster (Fig. 7.3), it creates the MapReduce job that is parallelized on nodes of the cluster. The H4P application passes all configuration parameters to the MapReduce job and passes the job for execution. The job is executed under control of the JobTracker service (the Application Master in newer version [18]). The JobTracker initializes the MapReduce job (step 3) and arranges data into input splits (step 4), which are logical representation of macromolecular
160
7 Hadoop and the MapReduce Processing Model in Massive Structural … 3. IniƟalize search job
H4P MapReduce applicaƟon
1. Configure and execute
MR job
2. Pass job for execuƟon
JobTracker
5. Create tasks
JVM Master/JobTracker node 4. Split input data into chunks
Hadoop Distributed File system
6. Pass task for execuƟon
7. Get a chunk of input data
9. Confirm task compleƟon
TaskTracker
8. Execute task
Map task JVM Slave/TaskTracker node
Fig. 7.3 Execution of the H4P MapReduce application in Hadoop. Reproduced from [18, 24] with permissions
data stored in file blocks (in the Hadoop Distributed File System). The number of input splits depends on the format and size of input data. The H4P accepts input splits with protein structures stored as individual files or sequential files. Sequential files contain many records, and each single record contains macromolecular data for one candidate 3D protein structure (exactly, the file name and compressed binary content). For each such an input split, the JobTracker creates a Map task (step 5), which processes each record of the input split (each candidate protein structure) by execution of the dedicated map function. The pseudocode of the Map task with the map function is presented in Listing 7.1. Many Map tasks that are created for each of the input splits are executed in parallel on the available Data nodes of the Hadoop cluster (steps 6–8). Execution of Map tasks on a single compute node of the Hadoop cluster is controlled by the TaskTracker service. Appropriate algorithm of data processing is invoked in the map function, executed within the Map task. The map function is responsible for finding common fragments in two protein structures (given query structure and a candidate protein structure from the input split) by performing structural alignment and 3D superposition of compared protein struc-
7.5 Implementation of the Map-Only Processing Pattern in the H4P
161
tures. Once the Map task is completed (step 9), another Map task can be assigned to the worker Data node for execution, until all input splits are processed [24]. The algorithm of processing data in a single Map task in one-to-many comparison scenario with the use of sequential files (against large repositories of candidate protein structures) is presented in Fig. 7.4 and Listing 7.1. 1 2 3 4 5 6 7 8 9
/ / prepare query protein structure for comparison / / and get alignment algorithm name from the job context setup ( context ) { algorithmName = context .getAlignAlgorithmName() ; qChainID = context .getQChainID() ; queryProtChain = getStructureHDFS( context . queryProtein(qChainID) ) ; }
10 11 12 13 14 15 16
/ / key: name of the PDB f i l e with a single candidate protein / / value : content of the PDB f i l e with a single candidate protein map( key, value , context ) { candidatePdbId = key; List isomerList = value . getIsomers () ;
17
for each isomer in isomerList do { / / get l i s t of chains of the candidate protein List chainList = isomer . getChains () ; for each chain in chainsList do { afpChain = align (queryProtChain , chain , algorithmName) ; identity = afpChain . getIdentity () ; similarity = afpChain . getSimilarity () ; rsmd = afpChain .getRMSD() ; score = afpChain . getScore () ; probability = afpChain . getProbability () ; alignmentHTML = afpChain . getWebDisplayResult() ; docBSON = createBSONdocument(jobId , timeStamp, candidatePdbId , candidatePdbChainId , identity , similarity , rmsd, score , probability , alignmentHTML) ; writeToMongoDB(queryProteinChain , candidatePdb , docBSON) ; } }
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
}
Listing 7.1 Pseudocode of the Map task and the map function (on the basis of [24] with permissions).
The first step of the Map task involves retrieving the name of the alignment algorithm and getting the 3D structure of the user’s query protein (Q). This is done in the setup function on the basis of job configuration available in the job context (lines 5–6, Listing 7.1). Macromolecular data for the specified query protein structure are
162
7 Hadoop and the MapReduce Processing Model in Massive Structural …
PDB/mmCIF file of the query structure Q
Get 3D structure of the query protein Q
Read informaƟon on the given chain of the query protein Q
SequenƟal file
For each record of the sequenƟal file (each candidate protein)
PDB/mmCIF file of the candidate structure C
Read one record of the sequenƟal file and transform it into PDB/mmCIF file
For each chain Run alignment for two 3D protein structures (query protein Q and candidate protein C)
Map funcƟon
For each isomer
Map task
Read the informaƟon on the number of isomers and chains in the PDB/mmCIF file of the candidate protein structure C
Save results of the alignment as a MongoDB document
BSON document with the results
Save the results in the database
Database
Fig. 7.4 Algorithm of macromolecular data processing, alignment, and similarity searching in the Map task for one-to-many comparison scenario with the use of sequential files (against large repositories of candidate protein structures). Reproduced from [18, 24] with permissions
retrieved from an appropriate PDB/mmCIF file representing given protein structure (lines 7–8). The file should be located in the HDFS. The name of the file and path in the HDFS file system are given as parameters in the job context. The name of the algorithm used for structural alignment is also passed through the job context (line 5). The use of sequential files causes the map function to be invoked multiple times in a single Map task, once for every record of the sequential file (every candidate protein). For this reason, macromolecular data of the query structure and the alignment method are stored in the internal memory of the task (private attributes of the Mapper class), which allows for reusing them in each invocation of the map function (line 13). Such a solution ensures that macromolecular data of the query protein structure will be
7.5 Implementation of the Map-Only Processing Pattern in the H4P
163
loaded and processed only once within the Map task, which positively affects the performance of the whole process. The map function is executed individually for each record of the sequential file (each candidate protein structure). As input parameters, the map function accepts the name of the macromolecular data file for a candidate protein structure (as the key parameter) and the macromolecular data that is passed to it through the value (lines 11–15). This organization is also adapted in sequential files. This makes the sequential files compatible with the input of the map function. The map function retrieves the list of isomers for the candidate protein structure (some proteins contain many isomers, line 16). For each isomer, it then retrieves all chains (line 21) and performs alignment of each chain to the chain of the query protein (lines 22–24). Comparison and alignment of two protein structures, which generates a set of similarity measures (lines 25–29), is performed by means of the jCE or the jFACTAT method. Importantly, the alignment comprises a pair of protein chains, which includes: • a specified chain of the given query protein Q (queryProtChain, line 24), • all, successive chains of the candidate protein C (chain, line 24) as proteins may have many chains, and each of the chains has its own tertiary (3D) structure, and therefore, it is processed separately. Each alignment generates the chain of AFPs (aligned fragment pairs), which is a result of used jCE or jFATCAT alignment methods. Depending on the number of chains in the candidate protein structure C, in each single invocation of the map function, the alignment can be executed more than once. As a result, single invocation of the map function can provide many similarity results (many sets of similarity measures), but each subsequent outcome is related to different pair of compared protein chains. Finally, results of each alignment process, including algorithm-specific similarity measures (like identity, similarity, RMSD, score, and probability), timestamp, and identifier of the candidate protein structure and its chain, are gathered and saved as BSON documents that are stored in the MongoDB database [1] (lines 31–35). Single entry in the MongoDB database contains the entry id, identifiers of compared proteins and their chains, and the binary BSON document with the outcome of a single alignment (see Fig. 7.2c). For one-to-one comparison scenarios (or more precisely, when comparing several pairs of proteins), the map function is invoked only once in the Map task, for a pair of compared proteins. For many-to-many comparison scenarios, each separate sets of Map tasks are created for each given query protein structure, and within each such a set, processing occurs in the same way as presented in Fig. 7.4 and Listing 7.1.
164
7 Hadoop and the MapReduce Processing Model in Massive Structural …
7.6 Performance of the H4P Parallel, Map-based procedures for massive 3D protein structure alignments were extensively tested in order to verify their performance. The main goal of the experiments was to address the following questions: • What is the efficiency of massive structural alignments performed on the Hadoop with the use parallel, Map-based versions of the jCE and the jFATCAT methods? • How the use of sequential files affects the performance of similarity searches executed on a private cloud environment? • How scalable the H4P with the implemented parallel, Map-based methods is? • What is the efficiency of H4P’s procedures compared to other mentioned GPUbased and cloud-based systems, and the MapReduce-based procedures for structural alignment? A part of the tests presented in this section was performed by Marek Suwała within works that led to his Master thesis [24], under my guidance as a promoter. Results of these tests are presented here with his permission. Later, some of these tests were repeated by me on a much powerful Hadoop cluster to confirm good scalability of the solution.
7.6.1 Runtime Environment Runtime environment used for performed experiments was established on virtualized computer cluster equipped with two 4-core (eight threads in hyper-threading) physical CPUs Intel Xeon 2.40 GHz and 96 GB RAM. In this environment, we created eight virtual machines—each one was assigned two logical (virtual) CPUs (from the pool of 16 available) and 30 GB of local storage. One of the machines was assigned 4GB of RAM, and remaining machines were assigned 2 GB of RAM each. Each of the virtual machines worked under the control of Ubuntu 12.04 LTS operating system, and each had the Hadoop framework and all required software for parallel structural alignments installed on it. All virtual machines were connected by a common network, which gave the possibility to configure and run the Hadoop cluster containing up to eight nodes. In this computational cluster: • one virtual machine served as both the Master node (NameNode/JobTracker node), as well as the Worker node (DataNode/TaskTracker node), • remaining (up to seven) virtual machines served exclusively as Worker nodes (DataNodes/TaskTracker nodes) that performed protein comparisons. Such a prepared environment was used to run the H4P system with parallel procedures for structural alignments and carry out assumed performance experiments.
7.6 Performance of the H4P
165
7.6.2 Data Set Preparation of the test data for performed research experiments involved sampling and downloading from the PDB repository a certain number of files in the PDB format. The set of test data was a subset of the PDB containing 1,000 randomly selected protein structures (as in the tests performed by Hung and Lin). PDB files representing the selected protein structures were downloaded from the PDB FTP server and stored hierarchically in 66 folders (this corresponded to the way how these files were stored in the PDB repository). Then, by using an auxiliary program (also MapReduce-based program implemented as an additional component of the H4P), the downloaded files were converted and stored as sequential files. In this way, for the 66 folders containing a total of 1,000 PDB files, there were created 66 sequential files (of various sizes). These sequential files were then placed in the HDFS distributed file system of the running Hadoop cluster, enabling multiple processing of the files in the H4P system. For the purposes of our research, in the HDFS we also placed a set of test data in the form of a non-transformed, individual PDB files. All the data in the HDFS were subjected to automatic replication, wherein the degree was dependent on the number of cluster nodes used in particular test case [24].
7.6.3 A Course of Experiments In our investigations over the scalability of the H4P system, we carried out many numerical experiments in which we used (deoxy) hemoglobin (alpha chain) [2] as a query protein structure (Q). The protein is identified by 2HHB PDB code in the Protein Data Bank (PDB). This is a medium-sized protein that consists of two amino acid chains, and structural alignment in each of experiments concerned the chain A. Structural alignments were performed with the use of parallel, Map-based versions of the jCE and the jFATCAT (flexible) methods. For each experiment scheme, the process of protein similarity searching comprised the entire data set of 1,000 proteins, which gave a total of 3,321 amino acid chains being aligned with a given query protein chain. The number of computing nodes of the Hadoop cluster, on which experiments were conducted, ranged from one to eight, namely: d ∈ {1, 2, 4, 6, 8}. Depending on the experiment scheme, each computing node of the cluster was adapted to carry out one or simultaneously two mapping operations (Map tasks). As a result, we distinguished two schemes for conducted experiments, where: • single compute cluster node was responsible for the execution of one Map task; i.e., the maximum number of parallel mapping operations running in the cluster was equal to the number of compute nodes of the cluster currently in use, i.e.: s′ ∈ {1, 2, 4, 6, 8}; • single compute cluster node was responsible for the simultaneous execution of two Map tasks; i.e., the maximum number of parallel mapping operations running
166
7 Hadoop and the MapReduce Processing Model in Massive Structural …
Table 7.1 Values of the data replication parameter in HDFS for various number of cluster nodes. Reproduced from [18, 24] with permissions Number of Hadoop computing nodes Data replication factor 1 2 4 6 8
1 2 3 3 3
in the cluster was twice the number of compute nodes of the cluster currently in use, i.e.: s′ ∈ {2, 4, 8, 12, 16}. In order to test various comparison scenarios, the experiment schemes described above were carried out for the test data provided both in the form of Hadoop sequential files, as well as in their direct form—as individual files in the PDB format. At this point, it should also be noted that the total number of Map tasks that were executed in the system every time was equal to the number of input data files. Therefore, for tests conducted using sequential files, the number of Map tasks was equal to the number of files, i.e., 66. And, in experiments carried out with the use of individual PDB files, the number of Map tasks executed in the system reached the value of 1,000. It is also important that in the course of experiments, we changed the degree of data replication. Macromolecular data were dispersed throughout the HDFS file system. Table 7.1 shows the values which the data replication parameter received in the system, depending on the number of active nodes in the cluster [24]. Performance tests were conducted for both experiment schemes (one Map task per node versus two Map tasks per node) and for two comparison scenarios (oneto-one and one-to-many scenarios). The many-to-many comparison scenario is an extension of the one-to-many scenario with the use of sequential files. In all cases, performance has been assessed on the basis of execution time measurements. At least two replicas for each measurement were carried out. Then, the obtained results were averaged, and in such a way they are presented in the following sections. Averaged values of measurements were also used to determine n-fold speedups when scaling computations on Hadoop cluster.
7.6.4 Map-Only Versus MapReduce-Based Execution In the first series of tests we empirically verified the assumption that skipping the Reduce phase brings improvement of the performance of the whole massive alignment process. To this purpose, we performed experiments according to both experiment schemes (one Map task per cluster node and two Map tasks per cluster node) with sequential files (one-to-many comparison scenario) and individual PDB files
7.6 Performance of the H4P
167
(a)
(b)
1 000
500 Map-only
867
900
466
739
724
700
636
600 500 400
ExecuƟon Ɵme (s)
ExecuƟon Ɵme (s)
800
366
350
317
300 250 200 150
200
100
100
50
0
0 one Map task/node
two Map tasks/node
one Map task/node
experiment scheme
two Map tasks/node
experiment scheme
(c)
(d)
1 800
1 400
1657
1251
1 600
1 200
1478 1395
ExecuƟon Ɵme (s)
ExecuƟon Ɵme (s)
full MapReduce
392
400
300
1 400
Map-only
450
full MapReduce
1264
1 200 1 000
800 600
1123
1063 965
1 000 800 600 400
400 Map-only
200
Map-only
200
full MapReduce
full MapReduce 0
0 one Map task/node
two Map tasks/node
experiment scheme
one Map task/node
two Map tasks/node
experiment scheme
Fig. 7.5 Execution time for various MapReduce execution patterns (Map-only and full MapReduce) and various experiments schemes for the jFATCAT (a, c) and the jCE (b, d) structural alignment algorithms. Tests performed for the 8-node Hadoop cluster for one-to-many comparison scenario with sequential files (a, b) and batch one-to-one comparison scenario with individual PDB files (c, d)
(batch one-to-one comparison scenario) on the 8-node Hadoop cluster. Results of these experiments presented in Fig. 7.5 show the execution time for each of the comparison scenario and MapReduce implementation. As can be observed from Fig. 7.5 the Map-only execution pattern of the MapReduce processing model is faster in all tested comparison scenarios (one-to-many and batch one-to-one) and experiment schemes (one Map task/node and two Map tasks/node). For example, for the one-to-many comparison scenario with sequential files, the jFATCAT algorithm, and two Map tasks running on each node (Fig. 7.5a, right chart bars), the execution with the use of the Map-only pattern took 636 s, while full MapReduce implementation took 739 s (16% longer). Similarly, for the batch one-to-one comparison scenario with individual PDB files, the jCE algorithm, and one Map task running on each node (Fig. 7.5d, left chart bars), the execution with the use of the Map-only pattern took 1,063 s, while full MapReduce implementation took 1,251 s (18% longer). In all tested cases the full MapReduce implementation was 15−20% less efficient than the Map-only execution pattern.
168
7 Hadoop and the MapReduce Processing Model in Massive Structural …
(a) one Map task/node
6000
two Map tasks/node
5167
ExecuƟon Ɵme (s)
5000
4000
3000 2612
2393
2000
1280
1023
1421
1000 868
724
792
636
0 0
2
4
6
8
10
# Hadoop nodes
(b)
one Map task/node
ExecuƟon Ɵme (s)
3000
two Map tasks/node
2844
2000 1574
1412
1000 669
755
475
407
347
392
317
0 0
2
4
6
8
10
# Hadoop nodes
Fig. 7.6 Dependency between the execution time and the number of computing nodes of the Hadoop cluster (for one-to-many comparison scenario with sequential files) for parallel, Map-based versions of the jFATCAT (a) and the jCE (b) structural alignment algorithms. Reproduced from [18, 24] with permissions
7.6.5 Scalability in One-to-Many Comparison Scenario with Sequential Files Hadoop sequential files are the default format of the data processed in the H4P system. The results presented in this chapter relate to experiments carried out with molecular data stored in the form of sequential files. Figure 7.6 shows the results of experiments performed according to both experiment schemes (one Map task per cluster node and two Map tasks per cluster node) and illustrates the change in computation time, depending on the number of Hadoop compute nodes.
7.6 Performance of the H4P
169
Table 7.2 Acceleration of calculations for the growing number of Hadoop computing nodes (for one-to-many comparison scenario with sequential files) for parallel, Map-based versions of the jFATCAT and the jCE structural alignment algorithms. Reproduced from [18, 24] with permissions Number of SpeedUp Hadoop computing nodes jFATCAT jCE 1 Map task/node 2 Map tasks/node 1 Map task/node 2 Map tasks/node 1 2 4 6 8
1.00 2.16 4.04 5.05 7.14
1.98 3.64 5.95 6.52 8.12
1.00 2.01 4.25 5.99 7.26
1.81 3.77 6.99 8.20 8.97
As can be observed, with the growing number of Hadoop nodes the time needed to complete the whole MapReduce job drops significantly. For structural alignments with the jFATCAT algorithm, the execution time was reduced from 5,167 s (on one node) to 724 s (on eight nodes of the Hadoop cluster) for one Map task/node experiment scheme, and from 2,612 s (on one node) to 636 s (on eight nodes of the Hadoop cluster) for two Map task/node experiment schemes. For structural alignments with the jCE algorithm, the execution time was reduced from 2,844 s (on one node) to 392 s (on eight nodes of the Hadoop cluster) for one Map task/node experiment scheme, and from 1,574 s (on one node) to 317 s (on eight nodes of the Hadoop cluster) for two Map task/node experiment schemes. In order to better illustrate the results, we show the acceleration of the computations in Table 7.2. The acceleration was calculated according to the following formula and obtained by comparing particular execution times to the case in which calculations were performed by only one Hadoop node executing one Map task (this mimics the stationary/serial implementation of protein structure alignment): Sm,d =
T1,1 , Tm,d
(7.17)
where m = 1 or 2 is the number of Map tasks performed in parallel on a single node of the Hadoop cluster, d is the number of Data nodes used, T1,1 is the execution time obtained while performing computations with the use of 1-node cluster and one Map task, and Tm,d is the execution time obtained while performing computations on the d -node cluster configured to execute in parallel m Map tasks per node. From the obtained results it can be concluded that both the increase in number of nodes, and the number of Map tasks resulted in significant acceleration of calculations and thus increase of the overall system performance.
170
7 Hadoop and the MapReduce Processing Model in Massive Structural …
Table 7.3 Average time for matching pairs of amino acid chains (for the one-to-many comparison scenario with sequential files) for parallel, Map-based versions of the jFATCAT and the jCE structural alignment algorithms. Reproduced from [18, 24] with permissions Number of Average time for matching pairs of amino acid chains (s) Hadoop computing nodes jFATCAT jCE 1 Map task/node 2 Map tasks/node 1 Map task/node 2 Map tasks/node 1 2 4 6 8
1.56 0.72 0.39 0.31 0.22
0.79 0.43 0.26 0.24 0.19
0.86 0.43 0.20 0.14 0.12
0.47 0.23 0.12 0.10 0.095
Table 7.3 shows how the average time of matching a pair of amino acid chains changed with the number of Hadoop nodes and the number of Map tasks executed in parallel on each node.
7.6.6 Scalability in Batch One-to-One Comparison Scenario with Individual PDB Files Stationary implementations of the jCE and the jFATCAT algorithms accept individual PDB/mmCIF files as input data. Our Map-based implementations of the algorithms also allow to perform structural alignments for pairs of unprocessed, individual files from PDB repository and individual protein structures from local collections stored in private compute environments, e.g., private clouds. In order to verify the performance of such a comparison scenario, we repeated experiments described in previous subsection, but this time for a different format of input data. In this case, the source of input data for the system were unprocessed, individual protein structures stored in the PDB/mmCIF format. Results of experiments performed with the use of input data stored in a standard PDB format are presented in a manner analogous to that of the previous section. Batch (exactly, 1,000) one-to-one comparisons were performed with the use of individual protein structures. In Fig. 7.7 we show the dependency between the execution time and the number of compute nodes in the Hadoop cluster. Similarly to experiments with sequential files, we can observe a vast acceleration of similarity searches while increasing the number of Hadoop nodes. For structural alignments with the jFATCAT algorithm, the execution time was reduced from 9,771 s (on one node) to 1,395 s (on eight nodes of the Hadoop cluster) for the one Map task/node experiment scheme, and from 6,062 s (on one node) to 1,264 s (on eight nodes of the Hadoop cluster) for the two Map task/node experiment schemes. For structural alignments with the jCE
7.6 Performance of the H4P
171
(a)
one map task/node
11000
two map tasks/node
9771
10000
ExecuƟon Ɵme (s)
9000 8000
7000 6062
6000
4708
5000 4000
3000
2499
3218
1945 2000 1902
1000
1469
1395 1264
0 0
2
4
6
8
10
# Hadoop nodes
(b)
one Map task/node
two Map tasks/node
8000 7106
ExecuƟon Ɵme (s)
7000 6000 5000
4739
4000
3571
3000 2368
2000
1736 1262
1000
1246
1078
1063 965
0 0
2
4
6
8
10
# Hadoop nodes
Fig. 7.7 Dependency between the execution time and the number of computing nodes of the Hadoop cluster (for the batch one-to-one comparison scenario with individual PDB files) for parallel, Mapbased versions of the jFATCAT (a) and the jCE (b) structural alignment algorithms. Reproduced from [18, 24] with permissions
algorithm, the execution time was reduced from 7,106 s (on one node) to 1,063 s (on eight nodes of the Hadoop cluster) for the one Map task/node experiment scheme, and from 4,739 s (on one node) to 965 s (on eight nodes of the Hadoop cluster) for the two Map task/node experiment scheme. Speedup calculated based on the execution time measurements for the particular Hadoop cluster configuration and the experiment scheme is shown in Table 7.4. From Table 7.4, for experiments with individual PDB files, we can draw similar conclusions to those related to the use of sequential files. During the course of experiments it turned out that also in this case, with an increasing number of computing nodes of
172
7 Hadoop and the MapReduce Processing Model in Massive Structural …
Table 7.4 Acceleration of calculations for the growing number of Hadoop computing nodes (for the batch one-to-one comparison scenario with individual PDB files) for parallel, Map-based versions of the jFATCAT and the jCE structural alignment algorithms. Reproduced from [18, 24] with permissions Number of SpeedUp Hadoop computing nodes jFATCAT jCE 1 Map task/node 2 Map tasks/node 1 Map task/node 2 Map tasks/node 1 2 4 6 8
1.00 2.08 3.91 5.02 7.00
1.61 3.04 5.14 6.65 7.73
1.00 1.99 4.09 5.63 6.68
1.50 3.00 5.70 6.59 7.36
Table 7.5 Average time for matching pairs of amino acid chains (for the batch one-to-one comparison scenario with individual PDB files) for parallel, Map-based versions of the jFATCAT and the jCE structural alignment algorithms. Reproduced from [18, 24] with permissions Number of Average time for matching pairs of amino acid chains (s) Hadoop computing nodes jFATCAT jCE 1 Map task/node 2 Map tasks/node 1 Map task/node 2 Map tasks/node 1 2 4 6 8
2.94 1.42 0.75 0.59 0.42
1.83 0.97 0.57 0.44 0.38
2.14 1.08 0.52 0.38 0.32
1.43 0.71 0.38 0.32 0.29
the Hadoop cluster, the time required to perform similarity searches is significantly reduced relative to the case of using a single node with one Map task (T1,1 ). Table 7.5 shows how the average time of matching a pair of amino acid chains changed with the number of Hadoop nodes and the number of Map tasks executed in parallel on each node.
7.6.7 One-to-Many Versus Batch One-to-One Comparison Scenarios Let us now compare results obtained while performing one-to-many comparisons of protein structures with sequential files and batch one-to-one comparisons of protein
7.6 Performance of the H4P
(a)
173
(b)
jFATCAT: 1 map task/node
jFATCAT: 2 map tasks/node
7000
12000
10000
8000
6000
5167 4708
4000 2499
1395 1280
batch one-to-one comparison scenario
5000 4000 3218 2612
3000
1902
2000
1469
1945
2393
2000
one-to-many comparison scenario
6062 6000
batch one-to-one comparison scenario
9771
ExecuƟon Ɵme (s)
ExecuƟon Ɵme (s)
one-to-many comparison scenario
1000
868
724
1023
0
636
792
0 0
2
4
6
8
10
0
2
4
(c)
(d)
jCE: 1 Map task/node
10
jCE: 2 Map tasks/node one-to-many comparison scenario
4739
one-to-many comparison scenario
7106
8
5000
8000 7000
6
# Hadoop nodes
# Hadoop nodes
4500
batch one-to-one comparison scenario
batch one-to-one comparison scenario 4000
6000
ExecuƟon Ɵme (s)
ExecuƟon Ɵme (s)
1264
1421
5000 4000
3571 2844
3000 1736
2000
1412
1262
1000
669
3000 2368
2500 2000
1574
1500
1246
500
392
0
1078
965
755
1000
1063
475
3500
407
347
317
4
6
8
0 0
2
4
6
# Hadoop nodes
8
10
0
2
10
# Hadoop nodes
Fig. 7.8 Comparison of execution times for the varying number of computing nodes of the Hadoop cluster for one-to-many comparisons of protein structures with sequential files and batch one-to-one comparisons of protein structures with individual PDB/mmCIF files performed with the experiment scheme where each node was responsible for execution of only one Map task per node (a, c) and two Map tasks per node (b, d) for the jFATCAT (a, b) and the jCE (c, d) alignment algorithms
structures with individual PDB/mmCIF files and verify the acceleration rates for both experiment schemes (one Map task/node and two Map tasks/node). Figure 7.8 illustrates the comparison of the runtime for both comparison scenarios, depending on the number of used computing nodes of the Hadoop cluster and for both experiment schemes. Figures 7.8a, c present the results obtained for the experiment scheme where each node was responsible for execution of only one Map task per node. In turn, Figs. 7.8b, d present the results of the same experiment, except that each node was adapted to execute in parallel a maximum of two Map tasks per node. The total number of comparisons of protein structures performed in both scenarios was equal to 1,000 (with respect to the number of protein structures) or 3,321 (with respect to the number of amino acid chains). Analyzing results gathered in Fig. 7.8 we can see that by delivering macromolecular data in the form of sequential files we were able to achieve nearly two- to threefold increase in system performance in both described experiment schemes and
174
7 Hadoop and the MapReduce Processing Model in Massive Structural …
Table 7.6 Acceleration of calculations performed with sequential files in relation to individual PDB files for varying number of Hadoop cluster nodes, for both experiment schemes (one Map task per cluster node and two Map tasks per cluster node), and for parallel, Map-based versions of the jFATCAT and the jCE structural alignment algorithms Number of SpeedUp Hadoop computing nodes jFATCAT jCE 1 Map task/node 2 Map tasks/node 1 Map task/node 2 Map tasks/node 1 2 4 6 8 Average Standard deviation
1.89 1.97 1.95 1.90 1.93 1.93 0.03
2.32 2.26 2.19 1.85 1.99 2.21 0.18
2.50 2.53 2.59 2.66 2.71 2.60 0.08
3.01 3.14 3.06 3.11 3.04 3.07 0.04
in all tested cases. This is also visible in Table 7.6, which shows speedups for both experiment schemes and various configurations of the Hadoop cluster calculated according to the following equation: Sf ,d =
Ti,d , Ts,d
(7.18)
where d is the number of nodes of the Hadoop cluster, Ti,d is the execution time obtained while performing computations with individual PDB/mmCIF files (in the batch one-to-one comparison scenario) on the d -node cluster, and Ts,d is the execution time obtained while performing computations with sequential files (in the one-tomany comparison scenario) on the same d -node cluster. On the basis of results presented in Table 7.6 we can draw the conclusion that the acceleration Sf ,d of the calculations obtained by using sequential files is quite stable for varying number of nodes of the Hadoop cluster in each experiment scheme— average speedups range between 1.93 and 3.07 with standard deviations ranging between 0.03 and 0.18, and the speedup is larger when computing nodes execute two Map tasks in parallel.
7.6.8 Influence of the Number of Map Tasks on the Acceleration of Computations To further investigate the behavior of the parallel, Map-based procedures for structural alignment of proteins implemented in the H4P, we conducted additional experi-
7.6 Performance of the H4P
175
Table 7.7 Acceleration of calculations performed for one-to-many comparison scenario with sequential files for varying number of logical processors and varying number of Map tasks performed in parallel on a single node of the Hadoop cluster. Reproduced from [18, 24] with permissions Number of SpeedUp Sm,c Hadoop computing nodes 1 Map task 2 Map tasks 3 Map tasks 4 Map tasks 5 Map tasks 1 2 3 4
1.00 2.77 2.81 2.79
1.02 5.48 5.95 5.95
0.99 5.42 8.01 8.25
1.00 5.46 8.01 9.99
1.00 5.45 8.01 9.42
ments. In these experiments we tried to identify how the number of logical processors assigned to a single node of the Hadoop cluster and the number of parallel Map tasks influence the acceleration of the structural alignments. All tests were conducted for the one-to-many comparison scenario with the use of sequential files as input data. Table 7.7 shows the variability of the speedup for the varying number of logical processors per node and the varying number of Map tasks performed in parallel on a single node of the Hadoop cluster. The speedup was calculated according to the following formula: T1,1 , (7.19) Sm,c = Tm,c where m = 1 . . . 5 is the number of Map tasks performed in parallel on a single data node of the Hadoop cluster, c is the number of logical CPUs assigned to a single node, T1,1 is the execution time obtained while performing computations with the use of one logical CPU assigned to the cluster node and one Map task, and Tm,c is the execution time obtained while performing computations with the use of c logical CPUs assigned to a single cluster node configured to execute m Map tasks in parallel. The analysis of results of this experiment allowed us to conclude that increasing the number of Map tasks simultaneously executed in a single computing node is meaningful only to the point where this number reaches a value equal to the number of logical processors assigned to the node. After exceeding this threshold the acceleration value stabilized at a certain level and despite continued increasing of the number of simultaneously executed Map tasks, we did not record any further growth of the acceleration ratio [24].
7.6.9 H4P Performance Versus Other Approaches MapReduce-based approaches for massive structural alignments, like H4P, HDInsight4PSi, and Hung and Lin’s system, are one of possible approaches to increase
176
7 Hadoop and the MapReduce Processing Model in Massive Structural …
the performance of the alignment process. Let us compare the H4P to other Hadoopbased approaches, approaches that rely on a farm of single-core virtual machines, like the appropriately configured Cloud4PSi and the CloudPSR, and the GPU-CASSERT that makes use of GPU devices to accelerate computations. The Cloud4PSi and the CloudPSR have dedicated, role-based computer architectures, and their processing model does not utilize MapReduce patterns at all. Both are dedicated for Microsoft Azure public cloud and utilize its virtual machines (VMs). These virtual machines can have various compute capabilities depending on the VM series used. However, we used one to eight single-core A-series (small) VMs and data packages with one candidate protein (by default the Cloud4PSi uses ten proteins per data package) in order to perform many (exactly 1,000) one-to-one comparisons and to investigate the performance while using this naïve parallelization approach. The HDInsight4PSi and the H4P utilize MapReduce processing model, but the HDInsight4PSi scales computations on HBase clusters in the Microsoft Azure public cloud and allows optional Reduce phase to take place. All four systems provide parallel procedures for structural alignments performed with the jCE and the jFATCAT algorithms. The Hung and Lin’s system, which also utilizes the Hadoop and the MapReduce, scales refined DALI [5] alignments on the private 8-node Hadoop cluster. The GPU-CASSERT parallelizes similarity searches with the CASSERT algorithm [16] on many-core streaming multiprocessors of GPU devices. In Table 7.8 we can observe the comparison of average execution times per protein chain when comparing 1,000 protein structures in one-to-one and one-to-many comparison scenarios for various parallelization approaches. Results clearly show that the H4P (H4P-IF-2MT, one-to-one comparisons with two Map tasks per node) was faster than the farm of single-core virtual machines in the Cloud4PSi (0.99 s for the jFATCAT, 0.70 s for the jCE) and in the CloudPSR (2.16 s, only jFATCAT), even if it processed individual files (0.38 s for the jFATCAT, 0.29 for the jCE). When utilizing sequential files, the H4P (H4P-SF-2MT, one-to-many comparisons with two Map tasks per node) was twice faster (0.19 for the jFATCAT and 0.0955 for the jCE) than the H4P processing individual files (H4P-IF-2MT). The HDInsight4PSi was the fastest out of all MapReduce-based systems (0.15 for the jFATCAT and 0.17 for the jCE), which results from higher mapper-to-node ratio (31 parallel Map tasks on the 8-node HBase cluster versus 16 Map tasks on the H4P). The GPU-CASSERT outperformed all approaches with the average execution time 1.76183E-05 seconds per compared pair of protein chains. This results from using many cores of the GPU device and its streaming multiprocessors. However, the method only calculates the similarity of proteins on the GPU device. Together with the structural superposition, which is executed on the CPU, the average execution time was 0.67 s, which is worse than runtimes of the H4P in any configuration. Implemented in the H4P, Map-only procedures for structural alignments of 3D protein structures are also among the most efficient cloud-dedicated solutions for the problem. In Fig. 7.9 we show performance comparison of four systems that implement parallel structural alignments in cloud environments: Cloud4PSi, CloudPSR, HDInsight4PSi, and H4P, expressed in terms of Structural Alignment Speed measure [14], which in Fig. 7.9 is averaged for compute unit/task performed in parallel.
7.6 Performance of the H4P
177
Table 7.8 Average execution time per protein chain for various parallelization approaches. Results for the H4P performing alignments with individual (IF, many one-to-one comparison scenario) and sequential (SF, one-to-many comparison scenario) files with two Map tasks (2MT) performed in parallel on a single node of the Hadoop cluster Used approach Average time per chain jFATCAT jCE Cloud4PSi (eight VMs, single-core) CloudPSR (eight VMs, single-core) HDInsight4PSi (eight nodes) H4P-IF-2MT (eight nodes) H4P-SF-2MT (eight nodes) GPU-CASSERT GPU-CASSERT with CPU superposition
0.99 2.16 0.15 0.38 0.19 1.76183E-05 0.67
0.70 N/A 0.17 0.29 0.0955
Structural Alignment Speed shows how many residue-to-residue comparisons are performed in a unit of time and allows measuring the performance of protein alignments regardless of the used collection of protein structures: SpeedSA =
sizer (Q) × sizer (CP ) , T
(7.20)
where: sizer (Q) denotes the number of residues in the query protein structure, sizer (CP ) denotes the total number of residues in the collection (repository) of protein structures, and T is the execution time. As we can observe from Fig. 7.9, the H4P achieves the best average structural alignment speed of 20,069 (residues2 /s) per single Map task executed in parallel for the jCE alignment algorithm, which reflects almost two times better the performance of the system than the HDInsight4PSi and the Cloud4PSi, and more than 5 times better performance than the CloudPSR. For the jFATCAT structural alignment algorithm, the performance of the H4P is slightly worse (10,003 residues2 /s) than the performance of the HDInsight4PSi (11,581 residues2 /s), but much better than the performance of the Cloud4PSi (6,617 residues2 /s) and the CloudPSR (3,426 residues2 /s). Slightly worse performance of the H4P for jFATCAT is a result of high memory consumption of the jFATCAT alignment algorithm and relatively limited memory resources that were available for Map tasks working in the virtualized environment of the established private cloud. Map-only procedures for structural alignments implemented in the H4P scale very well in various Hadoop cluster configurations. In Fig. 7.10 we show speedups over serial executions achieved by approaches that utilize a farm of VMs (Cloud4PSi and CloudPSR) and three MapReduce-based implementations of procedures for structural alignment (Hung and Lin, HDInsight4PSi, and H4P), when scaling those systems from one to eight Data nodes. Both, the Hung and Lin’s and the H4P systems, were tested in the private clouds, while the HDInsight4PSi was tested in the Microsoft
7 Hadoop and the MapReduce Processing Model in Massive Structural …
(a)
SpeedSA per compute unit (residues 2/s)
SpeedSA per compute unit (residues 2/s)
178
14 000 jFATCAT 11 581
12 000
10 003
10 000 8 000 6 617 6 000
4 000
3 426
2 000 0 Cloud4PSi
CloudPSR
HDInsight4PSi
H4P
(b) 25 000 jCE 20 069
20 000
15 000
10 226 9 068
10 000
5 000
3 974
0 Cloud4PSi
CloudPSR
HDInsight4PSi
H4P
Tested system
Tested system
Fig. 7.9 Structural Alignment Speed per compute task achieved by parallelizing computations in four systems for massive structural alignments of 3D protein structures: a for jFATCAT alignment algorithm, b for jCE alignment algorithm. Tests performed for one-to-many comparison scenario in all systems, with sequential files (only HDInsight4PSi and H4P). H4P was configured for parallel execution of two Map tasks per node
(a)
(b)
9
10 Hung&Lin H4P
9
6
7
Speedup
5 4.00 3.92 3.59 3.22
3.64
3
8.00
Cloud4PSi
6.41
5.95
4
H4P
6.54
CloudPSR
8.97
HDInsight4PSi
8
Cloud4PSi
7
Speedup
Hung&Lin
8.12 8.00 7.66
HDInsight4PSi
8
6 5.72
5 4.00
4
3.77
3.89
3 2.00 2.02 1.86 1.64
2 1
2.97 2.00
2
1.99 1.50
1
1.00
1
2
7.54
6.99
3
4
5
6
7
# data nodes or # VMs
8
9
1.00
1
2
3
4
5
6
7
8
9
# data nodes or # VMs
Fig. 7.10 Speedups achieved by three MapReduce-based implementations of structural alignment (Hung and Lin’s, HDInsight4PSi, and H4P) and two VM-based approaches (Cloud4PSi and CloudPSR): a H4P, HDInsight4PSi, Cloud4PSi and CloudPSR utilizing the jFATCAT alignment approach, b H4P, HDInsight4PSi and Cloud4PSi utilizing the jCE alignment approach. Tests performed when comparing 1,000 protein structures. The H4P was configured for parallel execution of two Map tasks per node
7.6 Performance of the H4P
179
Azure public cloud. Hung and Lin reported linear (ideal) speedup when scaling their system from one to eight Hadoop Data nodes—4.0 for four Data nodes, and 8.0 for eight Data nodes. When scaling the HDInsight4PSi within the same range of Data nodes of the HBase cluster, the system achieved sublinear speedup—5.72 for the jCE and 6.41 for the jFATCAT for eight Data nodes. (Much better speedup was achieved when scaling the HDInsight4PSi above 16 Data nodes, due to growing (#Map tasks/#CPUs) ratio, but both, Hung and Lin’s system and the H4P, were not scaled out above eight Data nodes. Moreover, in the Hung and Lin’s system and the H4P, the (#Map tasks/#CPUs) ratio was constant.) The H4P achieved very good 8.97fold (jCE) and 8.12-fold (jFATCAT) super-linear speedup on 8-node Hadoop cluster over serial execution of the alignment process in the experiment scheme with two Map tasks executed per one Hadoop node. In terms of efficiency gain, the approaches working on the basis of a farm of single-core virtual machines were worse than the H4P. The Cloud4PSi achieved 7.66-fold speedup over the serial execution for the jFATCAT and 7.54-fold speedup for the jCE. The CloudPSR was much worse (6.54fold speedup for jFATCAT) than the H4P and slightly better than HDInsight4PSi (Fig. 7.10a).
7.7 Discussion Efficient computational solutions for identification of protein functions or finding structural homologs of proteins gain importance in the era of structural genomics and in the face of growing volumes of biological data. Structural alignments, which underlie these two processes, take a lot of time to complete, especially when performed for large collections of 3D protein structures. Fortunately, structural alignments can be carried out on well-separable and independent subsets of the whole macromolecular data repository, which perfectly fits the MapReduce processing paradigm of bringing computations to data. In this chapter, we could see how the protein function identification and finding structural homologs can be efficiently accelerated with the use of the MapReduce procedure executed on Hadoop cluster established in a virtualized compute environment or a private cloud. For this purpose, we proposed the Map-only processing pattern, which utilizes only the Map phase of the MapReduce processing model. The presented solution joins advantages of performing computations in small virtualized compute environments with large-scale computations in public clouds, thus allowing to perform structural alignments for a number of usage scenarios, including comparison of pairs of 3D protein structures during evaluation of predicted protein models, one-to-many comparisons while identifying possible functions of the given structure, or all-to-all alignments while investigating the divergence between known protein structures and classifying proteins by their fold. Results of performance tests confirmed that when scaling up nodes of the Hadoop cluster and increasing the degree of parallelism leads to improving efficiency of the computations.
180
7 Hadoop and the MapReduce Processing Model in Massive Structural …
This chapter showed one of possible applications of the Map-only pattern of the MapReduce processing model. Results of performed experiments shown that with the implementation of the pattern in the H4P, we can reduce average time of matching a pair of protein structures by, at least, one order of magnitude, from seconds to fractions of seconds, depending on the number tasks executed in parallel, by scaling computations just from one to eight compute units. During the course of experiments, we could also observe that by implementing the Map-only pattern (especially, in a hardware environment with limited compute resources) we could keep the number of Map tasks constant during the whole computation process. This was possible due to saving compute resources that are usually consumed by Reduce tasks (which can run in parallel after first Map tasks return their outcomes) that were omitted in the Map-only pattern implemented in the H4P. As also shown in this chapter, the number of Map tasks executed in parallel at each Data node of the Hadoop cluster may vary depending on the hardware configuration. However, the best results were obtained for those hardware configurations where the number of parallel Map tasks did not exceed the number of logical CPUs assigned to the compute unit (Hadoop Data node). Manipulation of the number of logical CPUs assigned to compute units is quite easy, especially in private clouds (to whom H4P is dedicated), where virtual machines can be flexibly tuned with respect to various parameters influencing the performance of the computational cluster. The H4P is then fully adaptable to computational capabilities of the Hadoop cluster established in such a virtualized environment of the private cloud. Among unique features of the H4P’s implementation of structural alignment, it is worth mentioning its flexibility in various comparison scenarios. Although, the best performance was achieved for alignments performed with the use of sequential files, this approach of data feeding is suitable for one-to-many and many-to-many comparison scenarios. For single one-to-one and batch one-to-one comparison scenarios, it is better to use slower, but still parallel version that consumes individual PDB/mmCIF files with single protein structures. This slower implementation of parallel alignment procedure works similar to the one presented by Hung and Lin [6], which processed individual pairs of protein structures. However, the system developed by Hung and Lin was implemented based on the full MapReduce processing pattern, with the Map phase for structural alignment, and the Reduce phase for refinement of obtained results. The H4P also differs from the Hung and Lin’s system with the implemented algorithms for structural alignment.
7.8 Summary We can draw several conclusions on the basis of the results of experiments performed for the presented approach. First of all, we can notice that implementation of procedures for structural alignment as parallel, Map-only jobs for Hadoop computing framework led to the significant increase in the performance of the alignment process, proportionally to the number of Hadoop Data nodes used in computations.
7.8 Summary
181
This results in shortening of the average time of the alignment processes performed in various comparison scenarios, especially in one-to-many and many-to-many structural alignments with the use of large collections of protein structures, but also in batch one-to-one alignments. Secondly, the Map-only pattern of the MapReduce processing model serves sufficiently well when performing computations, such as the alignment of 3D protein structure, that do not require secondary processing (sorting, grouping, and aggregating data) of primary results. Thirdly, for various comparison scenarios the H4P provides two versions of parallel, Map-only procedures for structural alignment—the slower version that processes individual files is used in one-to-one comparison scenarios, and the faster version that consumes sequential files is used in one-to-many and many-to-many comparison scenarios. The solution presented in this and next chapters are examples of a few rare solutions in the world that give a great promise showing how advances in computational solutions keep in pace with progress in large-scale data gaining in bioinformatics. In the next chapter, we will see how similar computations over 3D protein structures are performed with the use of the Hadoop clusters scaled out on the public cloud, and how another component of the Hadoop ecosystem—the HBase data warehouse is used for storing results.
References 1. Chodorow, K.: MongoDB: The Definitive Guide, Powerful and Scalable Data Storage, 2nd edn. O’Reilly Media, Sebastopol (2013) 2. Fermi, G., Perutz, M., Shaanan, B., Fourme, R.: The crystal structure of human deoxyhaemoglobin at 1.74 A resolution. J. Mol. Biol. 175, 159–174 (1984) 3. Gibrat, J., Madej, T., Bryant, S.: Surprising similarities in structure comparison. Curr. Opin. Struct. Biol. 6(3), 377–385 (1996) 4. Gu, J., Bourne, P.: Structural Bioinformatics (Methods of Biochemical Analysis), 2nd edn. Wiley-Blackwell, Hoboken (2009) 5. Holm, L., Kaariainen, S., Rosenstrom, P., Schenkel, A.: Searching protein structure databases with DaliLite v. 3. Bioinformatics 24, 2780–2781 (2008) 6. Hung, C.L., Lin, Y.L.: Implementation of a parallel protein structure alignment service on Cloud. Int. J. Genomics 439681, 1–8 (2008) 7. Leinweber, M., Baumgärtner, L., Mernberger, M., Fober, T., Hüllermeier, E., Klebe, G., Freisleben, B.: GPU-based cloud computing for comparing the structure of protein binding sites. In: 2012 6th IEEE International Conference on Digital Ecosystems and Technologies (DEST), pp. 1–6 (2012) 8. Leinweber, M., Fober, T., Freisleben, B.: GPU-based point cloud superpositioning for structural comparisons of protein binding sites. IEEE/ACM Trans. Comput. Biol. Bioinform. PP(99), 1– 14 (2018) 9. Leinweber, M., Fober, T., Strickert, M., Baumgärtner, L., Klebe, G., Freisleben, B., Hüllermeier, E.: CavSimBase: a database for large scale comparison of protein binding sites. IEEE Trans. Knowl. Data Eng. 28(6), 1423–1434 (2016) 10. Mell, P., Grance, T.: The NIST definition of Cloud Computing. Special Publication 800-145 (2011). Accessed on 7 May 2018. http://csrc.nist.gov/publications/nistpubs/800-145/SP800145.pdf 11. Momot, A., Małysiak-Mrozek, B., Kozielski, S., Mrozek, D., Hera, Ł., Górczy´nska-Kosiorz, S., Momot, M.: Improving performance of protein structure similarity searching by distributing
182
12. 13. 14.
15.
16.
17. 18.
19.
20. 21. 22. 23.
24. 25. 26.
7 Hadoop and the MapReduce Processing Model in Massive Structural … computations in hierarchical multi-agent system. In: Pan, J.S., Chen, S.M., Nguyen, N.T. (eds.) Computational Collective Intelligence. Technologies and Applications. pp. 320–329. Springer, Berlin (2010) Mrozek, D.: High-Performance Computational Solutions in Protein Bioinformatics. SpringerBriefs in Computer Science. Springer International Publishing, Berlin (2014) Mrozek, D., Brozek, M., Małysiak-Mrozek, B.: Parallel implementation of 3D protein structure similarity searches using a GPU and the CUDA. J. Mol. Model 20, 2067 (2014) Mrozek, D., Daniłowicz, P., Małysiak-Mrozek, B.: HDInsight4PSi: boosting performance of 3D protein structure similarity searching with HDInsight clusters in Microsoft Azure cloud. Inf. Sci. 349–350, 77–101 (2016) Mrozek, D., Kutyła, T., Małysiak-Mrozek, B.: Accelerating 3D protein structure similarity searching on Microsoft Azure cloud with local replicas of macromolecular data. In: Parallel Processing and Applied Mathematics - PPAM 2015. Lecture Notes in Computer Science. Springer, Heidelberg (2015) Mrozek, D., Małysiak-Mrozek, B.: CASSERT: A two-phase alignment algorithm for matching 3D structures of proteins. In: Kwiecie´n, A., Gaj, P., Stera, P. (eds.) Computer Networks, Communications in Computer and Information Science, vol. 370, pp. 334–343. Springer International Publishing, Berlin (2013) Mrozek, D., Małysiak-Mrozek, B., Kłapci´nski, A.: Cloud4Psi: cloud computing for 3D protein structure similarity searching. Bioinformatics 30(19), 2822–2825 (2014) Mrozek, D., Suwała, M., Małysiak-Mrozek, B.: High-throughput and scalable protein function identification with Hadoop and Map-only pattern of the MapReduce processing model. J. Knowl. Inf. Syst. (in press). http://dx.doi.org/10.1007/s10115-018-1245-3 Pang, B., Zhao, N., Becchi, M., Korkin, D., Shyu, C.R.: Accelerating large-scale protein structure alignments with graphics processing units. BMC Res. Notes 5(1), 116 (2012). https://doi. org/10.1186/1756-0500-5-116 Prli´c, A., Yates, A., Bliven, S., et al.: BioJava: an open-source framework for bioinformatics in 2012. Bioinformatics 28, 2693–2695 (2012) Shindyalov, I., Bourne, P.: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 11(9), 739–747 (1998) Singh, S., Chana, I.: Cloud resource provisioning: survey, status and future research directions. Knowl. Inf. Syst. 1–65 (2016). http://dx.doi.org/10.1007/s10115-016-0922-3 Stivala, A.D., Stuckey, P.J., Wirth, A.I.: Fast and accurate protein substructure searching with simulated annealing and GPUs. BMC Bioinf. 11(1), 446 (2010). https://doi.org/10.1186/14712105-11-446 Suwała, M.: Scaling-out protein structure similarity searching on the Hadoop platform. Master’s thesis, Institute of Informatics, Silesian University of Technology, Gliwice, Poland (2013) White, T.: Hadoop - The Definitive Guide: Storage and Analysis at Internet Scale, 3rd edn. OReilly, Ireland (2012) Ye, Y., Godzik, A.: Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics 19(2), 246–255 (2003)
Chapter 8
Scaling 3D Protein Structure Similarity Searching on Large Hadoop Clusters Located in a Public Cloud
The science of today is the technology of tomorrow Edward Teller At its essence, the field of bioinformatics is about comparisons Jonathan Pevsner
Abstract For many reasons, protein structures are worth exploration and this exploration still leaves a lot of reserve for potential applications of the results of the exploration processes. 3D protein structure similarity searching is one of the important exploration processes performed in structural bioinformatics. Due to the complexity of 3D protein structures and exponential growth of protein structures in public repositories, like the Protein Data Bank, the process is time-consuming and requires increased computational resources. In this chapter we will see how 3D protein structure similarity searching can be accelerated by distributing computations on large Hadoop/HBase (HDInsight) clusters that can be broadly scaled out and up in the Microsoft Azure public cloud. We will see that the utilization of public clouds to perform scientific computations is very beneficial and can be successfully applied when performing time-consuming computations over biological data. Keywords Proteins · 3D protein structure · Tertiary structure · Similarity searching · Structure alignment · Superposition · Cloud computing · Parallel computing · Hadoop · MapReduce · Microsoft Azure · Public cloud
8.1 Introduction Proteins are complex molecules and their structures are determined by experimental and theoretical methods. Protein Data Bank (PDB) [1] is the first and the most popular repository established to collect macromolecular data describing 3D protein structures. When it was created in the 1970s, it contained only a few protein © Springer Nature Switzerland AG 2018 D. Mrozek, Scalable Big Data Analytics for Protein Bioinformatics, Computational Biology 28, https://doi.org/10.1007/978-3-319-98839-9_8
183
184
8 Scaling 3D Protein Structure Similarity Searching on Large Hadoop Clusters … 160000 Yearly
Number of structures
140000
Total
120000 100000 80000 60000 40000
20000 0 1970
1980
1990
2000
2010
2020
Year
Fig. 8.1 Yearly growth of the number of 3D structures in the Protein Data Bank on the basis of published statistics at https://www.rcsb.org/
structures, but due to the development of methods for determination of protein structures the growth of macromolecular data is now exponential (Fig. 8.1). In the PDB, descriptions of protein structures, including primarily their geometries, are stored in text files. Apart from pure geometry, protein structures possess many features that determine their physical and chemical properties. All these information is stored in the text files. These files have a particular format that allow to store appropriate data in specific, dedicated, and self-descriptive sections. Popular formats for storing protein structures are PDB [27], mmCIF [3], and PDBML [26]. All are used today by the Protein Data Bank (PDB) for storing and exchanging macromolecular data. For example, in PDB files all data are arranged in records that keep some information on protein structure, like title, molecule name, its function, primary structure, secondary structure, locations of particular atoms, chemical elements (Fig. 8.2). These records, however, are not the records that we know from relational databases. Apart from some attempts to keep and process macromolecular data in relational databases, including projects, such as BioSQL [2], PSS-SQL [15, 17], and P3D-SQL [16], PDB files and other types of files for protein structures form a collection of NoSQL data. Taking into account that each PDB file may consist of tens of thousands of records, and that the Protein Data Bank contains more than one hundred thousands of biological macromolecular structures (140,390 files, as of May 22, 2018), this gives an estimated half a billion atoms and atomic coordinates that must be explored. Additionally, since we observe the exponential growth of macromolecular data in the PDB repository, we can conclude that we deal with a large amount of molecular data, which fits the volume feature of the 5V model for Big
8.1 Introduction HEADER TITLE TITLE COMPND COMPND COMPND COMPND COMPND COMPND SOURCE SOURCE SOURCE SOURCE SOURCE SOURCE SOURCE SOURCE ... HELIX HELIX SHEET SHEET SHEET SHEET SHEET SHEET SHEET SHEET ... ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ...
185
TRANSPORT PROTEIN 04-MAR-08 3CFM CRYSTAL STRUCTURE OF THE APO FORM OF HUMAN WILD-TYPE 2 TRANSTHYRETIN MOL_ID: 1; 2 MOLECULE: TRANSTHYRETIN; 3 CHAIN: A, B; 4 FRAGMENT: UNP RESIDUE 30-147; 5 SYNONYM: PREALBUMIN, TBPA, TTR, ATTR; 6 ENGINEERED: YES MOL_ID: 1; 2 ORGANISM_SCIENTIFIC: HOMO SAPIENS; 3 ORGANISM_COMMON: HUMAN; 4 ORGANISM_TAXID: 9606; 5 GENE: TTR, PALB; 6 EXPRESSION_SYSTEM: ESCHERICHIA COLI; 7 EXPRESSION_SYSTEM_TAXID: 562; 8 EXPRESSION_SYSTEM_STRAIN: BL21(DE3) 1 2 1 2 3 4 5 6 7 8
1 2 A A A A A A A A
ASP A ASP B 8 SER 8 LEU 8 ARG 8 SER 8 SER 8 ARG 8 LEU 8 SER
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
N CA C O CB SG N CA C O CB CG CD N CA C O CB CG CD1 CD2
CYS CYS CYS CYS CYS CYS PRO PRO PRO PRO PRO PRO PRO LEU LEU LEU LEU LEU LEU LEU LEU
A A A A B B B B A A A A A A A A A A A A A A A A A A A A A
74 ALA A 81 1 74 LEU B 82 1 23 PRO A 24 0 12 ASP A 18 -1 104 SER A 112 1 115 THR A 123 -1 115 THR B 123 -1 104 SER B 112 -1 12 ASP B 18 1 23 PRO B 24 -1 10 10 10 10 10 10 11 11 11 11 11 11 11 12 12 12 12 12 12 12 12
-6.452 -5.671 -6.409 -6.440 -4.271 -3.200 -7.017 -7.924 -7.273 -7.951 -8.781 -7.869 -6.909 -5.994 -5.288 -3.876 -3.073 -5.266 -4.488 -5.122 -4.406
8 9 N O O O N N O
ASP ILE THR THR ALA MET SER
11.390 11.953 11.715 10.591 11.339 12.061 12.778 12.652 12.594 12.262 13.909 14.916 14.178 12.947 12.945 12.444 13.085 14.342 14.517 13.704 15.985
A A A B B B B
18 107 119 118 108 13 23
7.914 6.772 5.448 4.952 6.735 5.490 4.884 3.743 2.357 1.380 3.857 4.427 5.335 2.267 0.980 1.192 1.873 0.351 -0.958 -2.093 -1.351
O N N N O O N
SER MET ALA TYR THR ILE ASP
1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
A A A A B B B
23 13 108 116 119 107 18
33.45 29.61 26.83 28.82 31.26 35.34 20.60 17.02 14.87 14.61 16.87 19.25 19.25 14.30 11.07 14.59 14.86 12.79 13.86 15.41 16.22
N C C O C S N C C O C C C N C C O C C C C
Fig. 8.2 A part of the PDB file describing sample protein structure in the Protein Data Bank: The Title section (top) contains records used to describe the experiment and the biological macromolecules, the secondary structure section (middle) describes helices and sheets found in structures of proteins and polypeptides, and the Coordinate Section (bottom) provides atomic coordinates describing locations of particular atoms
186
8 Scaling 3D Protein Structure Similarity Searching on Large Hadoop Clusters …
Data presented in Chap. 2. Moreover, appearance of any new protein structure sooner or later causes the need to compare it with all structures that exist in the repository (one-to-many comparison scenario presented in Chap. 7). The velocity, in which the macromolecular data are (and will be) generated comparing to the time, in which the data are processed to meet the goals of the similarity searching process, puts the pressure on the existing computer systems. 3D protein structure similarity searching is a computationally intensive problem that belongs to the NP-hard class and the process itself is usually carried out in a pairwise manner, comparing a given query (user’s) structure to successive structures from a collection. Finally, the variety of formats for macromolecular data and a variety of sources, where the data come from, is a challenge for institutions collecting 3D protein structures and performing all-to-all comparisons. This motivates scientific efforts to scale similarity searches, performed with the use of existing methods, by means of new computing paradigms on computer clusters. In Chap. 7, we could see the use of the Hadoop platform and the MapReduce processing model in parallel implementation of massive 3D protein structure alignments in various comparison scenarios. Performance of the H4P system presented in Chap. 7 was tested in a virtualized environment of the private Cloud, which had limited scaling capabilities. As we know from Chap. 2, Cloud computing allows users to access and use configurable computing resources (e.g., networks, servers, storage, applications) as a service without having to build entire infrastructure that supports the resources within their own companies. This can be particularly beneficial for establishing a Hadoop-based computer cluster, since computing Clouds allow to bring the computations to the data, scale out working computer systems according to current needs, and quickly gain access to a higher than average computing power. Cloud computing provides such a kind of scalable, high-performance computational platform that can be also used for solving the problem of missing computer power in 3D protein structure similarity searching. In this chapter, we will see a Hadoopbased solution for 3D protein structure similarity searching, which is scaled out on the Microsoft Azure public Cloud, providing almost unlimited scaling capabilities.
8.2 HDInsight on Microsoft Azure Public Cloud Microsoft Azure cloud platform delivers hardware infrastructure and services for building scalable applications. Microsoft Azure allows developing, deploying, and managing applications and services through a network of data centers located in various countries throughout the world [5]. Microsoft Azure is a public cloud, which means that the infrastructure of the cloud is available for the public use and is owned by Microsoft, selling cloud services. The cloud provides computing resources in a virtualized form, including processing power, RAM, storage space, and appropriate bandwidth for transferring data over the network, within the Infrastructure as a Service (IaaS) service model [10, 25]. As we also know from Chap. 3, within the Platform as a Service model [10, 25], Azure also delivers a platform and dedicated
8.2 HDInsight on Microsoft Azure Public Cloud
187
cloud service programming model for developing applications that should work in the cloud. As clouds enable storing large amounts of data, Microsoft Azure provides a rich set of Data services for various scenarios, including storing, modifying, and reporting on data in Microsoft Azure, e.g., BLOBs that allow to store unstructured text or binary data (video, audio, and images), Tables that can store large amounts of structured non-relational (NoSQL) data, Azure SQL Database for storing large amounts of relational data. A large volume, variety, and velocity of NoSQL, macromolecular data raises the opportunity to treat them as big data. This entails the possibility of using computational solutions that are typical for big data analysis, like Hadoop or Spark. HDInsight is one of the Microsoft’s computational solutions that cover the possibility to create the Hadoop or the Spark as a Service on the Azure cloud. Azure HDInsight is a Hadoop-based service that brings the Apache Hadoop solution to the cloud aiming at efficient analysis of even petabytes of data. HDInsight Hadoop uses the Hadoop Distributed File System (HDFS) to collect data on the computational cluster, distribute the data, and replicate across multiple cluster nodes hosted on virtual machines leased from the cloud provider. MapReduce jobs are also executed on nodes of this virtualized cluster, in parallel, moving the processing to the analyzed data and then combining it all again. HDInsight clusters may use various series and sizes of virtual machines (VMs) that provide different compute capabilities (number of cores, CPU/core speed, amount of memory, efficiency of I/O channel), as it was presented in Chap. 3. Aside from HDInsight, data can be also stored in Azure Storage Vault (ASV), e.g., in Azure BLOBs. This is especially beneficial, when the HDInsight computational cluster is not in use, it should be released in order to reduce infrastructure leasing costs, but still keep data in the cloud. Within the Hadoop/HDInsight environment data can be also collected in the HBase, which is an Apache, open-source, NoSQL database [6]. Built on Hadoop, HBase provides random access and strong consistency for large amounts of unstructured and semistructured data. HBase stores data in rows of a table. However, it does not require a prior definition of table columns and data types for these columns. This schemaless feature is very important for the solution presented in this chapter, since it allows to store various outcomes of 3D protein structure similarity searches performed by the users of the presented system.
8.3 HDInsight4PSi HDInsight4PSi is a Hadoop MapReduce-based system for efficient and highly scalable 3D protein structure similarity searching through the structural alignment performed according to the one-to-many comparison scenario known from Chap. 7. The system was developed in Institute of Informatics at the Silesian University of Technology in Gliwice, Poland within the Cloud4Proteins group. The development works were started in 2014. The main constructors of the system were Paweł Daniłowicz
188
8 Scaling 3D Protein Structure Similarity Searching on Large Hadoop Clusters …
(within his Master thesis [4]), Bo˙zena Małysiak-Mrozek, and me. At the time of development, we already had considerable experience gained during the construction of the H4P system presented in Chap. 7. The HDInsight4PSi allows for massive, distributed 3D protein structure similarity searches with the use of both, the Maponly and the full MapReduce execution patterns. It is able to process individual protein structures (PDB files) and sequential files. Likewise the H4P system, the HDInsight4PSi searches for protein similarities with the use of two popular methods for 3D protein structure alignment—jCE and jFATCAT [20], which are enhanced versions of the Combinatorial Extension (CE) [24] and the Flexible structure AlignmenT by Chaining Aligned fragment pairs allowing Twists (FATCAT) [28]. Both, the jCE and the jFATCAT algorithms have a very well-established position among researchers and are publicly available through the Protein Data Bank (PDB) Web site for those, who want to search for structural neighbors. Moreover, both algorithms are used for pre-calculated all-to-all 3D-structure comparisons for the whole PDB that are updated on a weekly basis [19]. The HDInsight4PSi is similar to its predecessor H4P in many areas. However, in contrast to the H4P presented in Chap. 7, the HDInsight4PSi: • was originally developed for MapReduce 2.0 (MRv2) and uses YARN as a resource manager of the computational cluster; • allows scaling all computations in publicly available, commercial Azure Cloud, which results in achieving very good, experimentally proved, efficiency of the solution; • enables execution of similarity searches on a large scale, with the use of full repository of protein structures.
8.4 Implementation Searching for protein structure similarities with the use of the HDInsight4PSi requires access to Microsoft Azure cloud, then creation of Hadoop/HBase cluster using the hardware infrastructure of the cloud (storage and virtual machines), and finally, creation of appropriate MapReduce jobs implementing algorithms for similarity searching. Access to Microsoft Azure commercial cloud is possible for holders of active Azure subscription, which is usually served on the pay-as-you-go basis. When developing HDInsight4PSi, we used the subscription obtained within Microsoft Azure for Research Award program, which allowed us to create the Hadoop cluster having 1– 48 Data nodes (originally based on A3/A4 virtual machines, but later we used A10 compute intensize VMs). Creation of a general-purpose Hadoop/HBase cluster (hereinafter referred to as HDInsight cluster due to the name of the service under which it is delivered in the Azure cloud) is usually an easy task, which can be done from Microsoft Azure management console—a Web-based tool for managing objects of the Cloud infrastructure and platform (like Cloud services, storage, and virtual machines). However, 3D pro-
8.4 Implementation
189
tein structure similarity searching requires higher than usual computational resources for each single protein comparison that is performed. Each Map task requires at least 1.5 GB of RAM memory for its execution, and each Reduce task requires at least 2 GB of RAM memory. Therefore, for proper creation and configuration of the cluster we had to use Azure PowerShell console and PowerShell script. HDInsight4PSi has been originally created for the following versions of software and services: • • • •
Hadoop ver. 2.4.0 (HDInsight 3.1), HBase ver. 0.98.4, BioJava ver. 3.0.5, Java ver. 1.7.
Architecture of the HDInsight4PSi working in the Azure cloud is shown in Fig. 8.3. The system enables parallel execution of similarity searches for a protein structure specified by a user. Similarity searching is executed from a batch script or HDInsight PowerShell execution script. Execution of the process requires several parameters to be provided, including user’s query protein structures (QS) for which the similarity searching will be performed and the name and variant of the algorithm used to this purpose. This causes the creation of the MapReduce job for protein similarity searching, which execution will be managed by the Application Master. NodeManagers execute in parallel Map tasks that perform pairwise comparisons and alignments of the query protein structure (QS) to subsets of candidate structures (CSs) from the whole repository located in the HDFS. Results of the alignments performed in the Map phase are stored in the HBase database. Due to the nature of the process, in which the results are independent of each other and do not require additional grouping or sorting, the system may work with and without the Reduce phase. Therefore, saving results, depending on the implementation variant, occurs in the Reduce phase (variant MR) or directly in the Map phase (variant M ). The Application Master and YARN continuously monitor the progress in execution of parallel Map and Reduce tasks (see Fig. 8.4). Candidate protein structures (CSs) that are compared with the given structure QS are delivered from Data nodes of the HDInsight cluster in portions, as individual compressed PDB files or grouped in so-called sequential files holding many protein structures (also compressed). Typical size for compressed PDB file describing a single structure ranges from kilobytes to megabytes. This size of data is not optimal for Hadoop Distributed File System, which block size (the smallest unit of data that a file system can store) is set to 64 MB (128 MB in some distributions of Hadoop). Therefore, we decided to test the performance of both variants of data delivery. Results of performed experiments are presented in Sect. 8.5. MapReduce jobs for protein similarity searching are actually parallel implementations of jCE and jFATCAT algorithms. These jobs consist of a number of Map tasks and, optionally, a number of Reduce tasks, both executed in parallel on nodes of the HDInsight/HBase cluster. As an input, a single Map task requires the query protein structure (QS) and a set of candidate protein structures (CSs) to be compared (Fig. 8.5). Both are located in the HDFS. Each Map task performs a comparison
HDFS data node
Map
NodeManager
Transfer of alignment results, variant M
HDFS
…
Map
HDFS data node
…
NodeManager
Alignment results
ApplicaƟon Master
…
HBase
Transfer of alignment results, variant MR
DB Results
Reduce
NodeManager
Alignment results
Reduce
…
MicrosoŌ Azure Cloud
NodeManager
Fig. 8.3 HDInsight4PSi running on the Hadoop/HBase cluster located in Microsoft Azure cloud
Protein structures
HDInsight PowerShell execuƟon script
190 8 Scaling 3D Protein Structure Similarity Searching on Large Hadoop Clusters …
Fig. 8.4 MapReduce job execution and progress monitoring
8.4 Implementation 191
192
8 Scaling 3D Protein Structure Similarity Searching on Large Hadoop Clusters …
(a) Query protein structure
Type of files
Results viewer
Alignment algorithm
Map Candidate protein structures
PDB IDs & similarity measures
Key: PDB ID/file name Value: Protein structure(s)
Hadoop
HDFS
Reduce Key: QPDB ID-PDB ID Value: Similarity measures
DB Results
Hadoop
HBase
Local disks
(b) Query protein structure
Type of files
Results viewer
Alignment algorithm
Map Candidate protein structures
HDFS
Key: PDB ID/file name Value: Protein structure(s)
Hadoop
DB Results
PDB IDs & similarity measures
Local disks
HBase
Fig. 8.5 Execution of 3D protein structure similarity searching as a MapReduce job: a variant with the Reduce phase (full MapReduce, variant MR), b variant without the Reduce phase (only Map, variant M). On the basis of [12]
and alignment of the query structure to each candidate structure that was delivered to it and generates results of the alignments. Since entries in the MapReduce are recorded as (key, value) pairs, PDB ID codes of candidate proteins or names of files are assigned to keys and 3D structures are assigned to values (Fig. 8.6). Alignments are performed by means of the algorithm specified in the HDInsight execution script (jCE or jFATCAT). Location of the query protein structure (QS), the alignment algorithm specified by the user, and the type of files processed (single/individual or sequential files) are provided as configuration parameters (the context argument) of the Map function. Results of the alignments performed by Map tasks are transferred to the Reduce phase (Fig. 8.5a) or directly to the HBase database (Fig. 8.5b). These results have also a form of (key, value) pairs, where the key contains concatenated PDB ID code of the query protein (QPDB ID) and the PDB ID code of the ith candidate protein (PDB ID i), and the value consists of values of similarity measures produced by the alignment method (specific to the method used). The flow of the Map phase is shown in Algorithm 6. Fig. 8.6 shows the input and output of the Map function when processing proteins stored in individual files and in sequence files.
8.4 Implementation
193
(a) 3D structure of given protein
Map
Output
Map
Output
..
...
Output
Map
Key |Value -----------------------------------QPDBID-PDBID i | Similarity measures
3D structures of candidate proteins (PDB files)
(b)
3D structure of given protein Key | Value -----------------PDBID 1 | | PDBID 2 | ... | PDBID n | |
Output
Key |Value -----------------------------------QPDBID-PDBID 1 | Similarity measures QPDBID-PDBID 2 | Similarity measures ... QPDBID-PDBID n | Similarity measures
Map
Output
...
Key | Value -----------------PDBID n+1 | …
Map
...
Map
Output
SequenƟal files with 3D structures of candidate proteins (PDB files)
Fig. 8.6 Input and output of the Map task when processing proteins stored in: a individual files, b sequential files
194
8 Scaling 3D Protein Structure Similarity Searching on Large Hadoop Clusters …
Algorithm 6 Pseudocode of the Map phase for 3D protein structure similarity searching on the HDInsight cluster 1: for each Map(key=pdb_id, value=candidate_structure_or_structures _in_sequence_file, context=configuration) do 2: QS ← context.query_structure 3: algorithm ← context.algorithm 4: for i ← 1, value.#structures do 5: CS ← candidate_structure[i] 6: key ← QS.pdb_id & "–" & CS.pdb_id 7: value ← align3D(QS, CS, algorithm) 8: end for 9: end for
Optional Reduce phase (Fig. 8.5a) is only responsible for storing results in the HBase database. Entries of the Reduce task are also recorded as (key, value) pairs. Reducers accept the output of the Map tasks—the key contains concatenated QPDB ID-PDB ID i codes and the value consists of values of similarity measures serialized to a representative object. Keys are then stored in such a concatenated form in the HBase database. This allows for fast filtering of results when retrieving and displaying data with the use of Results viewer, a dedicated tool for viewing alignment results and visualizing alignments (see Fig. 8.7). Values of similarity measures are extracted and stored in separate fields of the database, which simplifies browsing results and sorting structures by a specified measure.
8.5 Performance Evaluation Performance of the presented solution was tested in a series of tests. Our group expected that by parallelization of computations in HDInsight on the Azure cloud the similarity searching will be completed much faster than the same process performed on a regular workstation. In order to verify this thesis, we have created two testing environments: • stand-alone PC workstation with Intel Core i5-4460 3.2 GHz Quad-Core Processor, RAM 8 GB, HDD 1TB 7200 RPM, • HDInsight/HBase cluster with two Head nodes (hosting Hadoop NameNodes and the Resource Manager as services) and n Data nodes (n = 1, 2, 4, 8, 12, 16, 24, 48) running up to mMR MapReduce tasks in parallel; the cluster worked on Microsoft Azure cloud. Maximum total number of MapReduce tasks executed in parallel (mMR ), determined by the number of YARN containers, varied according to the number of data nodes of the HDInsight cluster. The maximum total number of MapReduce tasks executed concurrently and the total numbers of CPUs utilized in various configurations of the HDInsight cluster are shown in Table 8.1. We used A10 virtual machines
Fig. 8.7 Main window of the Result viewer dedicated tool used for browsing results of similarity searches retrieved from the HBase database
8.5 Performance Evaluation 195
196
8 Scaling 3D Protein Structure Similarity Searching on Large Hadoop Clusters …
Table 8.1 Specification of the HDInsight cluster used in tests for the varying number of Data nodes #Data nodes
1
2
4
8
12
16
24
48
#CPUs 3 #CPUs with Zookeeper 4 Max #parallel MR tasks 3
4 5 7
6 8 15
10 13 31
14 17 47
18 21 63
26 29 95
50 53 191
for the Head nodes and Data nodes of the HDInsight cluster, and A10 units for the Zookeeper (used for highly reliable distributed coordination). For hardware parameters of A10 VMs, please refer to Chap. 3. In various performance tests, we used a collection of query protein structures with lengths between 50 and 781 amino acids. These were arbitrary chosen molecules from the Protein Data Bank that represent different classes according to the SCOP classification [18], i.e., all α, all β, α + β, α/β, α and β, coiled coil proteins, and others. Similarity searching was carried out with the use of the whole (at the time of testing) repository of macromolecular structures containing 93,121 proteins from the Protein Data Bank (we filtered out molecules that were not proteins). Although in some tests, we used subsets of the repository. The performance was tested for both algorithms, jCE and jFATCAT (we used flexible variant of jFATCAT, results for jFATCAT-Rigid were similar). In particular, we have tested and analyzed: • performance of similarity searches executed for individual candidate protein structures (batch one-to-one comparison scenario), • performance of similarity searches executed for candidate protein structures assembled in sequential files (one-to-many comparison scenario), • performance of similarity searches executed with and without the Reduce phase (variant MR and variant M), • performance of similarity searches executed for different alignment methods, • performance of similarity searches for various sizes of protein structures, • scalability of the presented solution in the Microsoft Azure cloud. In all cases, performance evaluation was conducted on the basis of execution time measurements. At least two replicas for each measurement were carried out. Then, obtained results were averaged, and in such a way, they are presented in the following sections.
8.5.1 Evaluation Metrics To demonstrate the performance of the parallel, MapReduce versions of algorithms for protein structure similarity searching, traditional performance evaluation metrics were used, and we introduced an additional metric (Structural Alignment Speed) that
8.5 Performance Evaluation
197
is dedicated to the specificity of the problem. The following metrics were used to evaluate the performance of the parallelized methods: • Execution time T ; • Speedup Sp ; • Structural Alignment Speed SpeedSA . Execution time T is a basic metric that allows us to assess the performance of computations. The time was measured from the moment when the MapReduce job started to the moment it stopped; i.e., the last MapReduce task was completed. We can assume that the total execution time of the MapReduce job is determined by several components, including data decomposition and task assignment time Tdec , computation time Tcomp , communication time Tcomm , and idle time Tidle resulting from unbalanced load of computational units. Speedup allows us to asses how much faster we are running our parallel implementations of similarity searching algorithms in the HDInsight/HBase cluster. The speedup is defined as follows: T1 (8.1) Sp = , Tp where T1 is the execution time when performing protein structure similarity searches on computation unit with one processor (p = 1), and Tp is the execution time when performing protein structure similarity searches on HDInsight/HBase cluster with p processors. The above-mentioned metrics are very useful for evaluation of performance of parallel tasks execution. Scientists working in structural bioinformatics would also appreciate to know how many proteins they can compare in a time unit. However, as we will observe in the following sections, the execution time T highly depends on the size of the query protein structure sizer (Q) given by a user and the size of the repository sizer (Reposiory) used in experiments, i.e., sizes of proteins in the repository, measured in residues. This is expected if we analyze how the jCE and the jFATCAT algorithms work, since larger molecules compared in the similarity searching produce longer chains of aligned fragment pairs (AFPs), and therefore, larger similarity/distance matrices are built for every pair of compared molecules. The Protein Data Bank repository of macromolecular data evolves all the time and increases in size every year. It is then difficult to compare obtained results of performance tests with results obtained in experiments carried out by other researchers, unless we know the exact data set that was used in their experiments. For this reason, in order to eliminate factors of the size of the query molecule and the size of the data set in structural comparisons, and make results comparable in the future, we defined the Structural Alignment Speed measure (SpeedSA ): SpeedSA =
sizer (Q) × sizer (Reposiory) , T
(8.2)
198
8 Scaling 3D Protein Structure Similarity Searching on Large Hadoop Clusters …
where sizer (Q) denotes the number of residues in the query protein structure, sizer (Reposiory) denotes the total number of residues in the repository of protein structures, and T is the execution time. The speed is expressed in (residues2 /s) and it shows how many residue-to-residue comparisons are performed in a time unit. The size sizer (S 3D ) of any (query or repository) protein structure S 3D measured in residues is calculated by summing the number of residues in all amino acid chains of the protein structure: H sizer (ch ), (8.3) sizer (S 3D ) = h=1
where H is the number of amino acid chains in the structure of the protein S 3D , and sizer (ch ) is the number of residues in the hth amino acid chain ch of the protein. The size of the repository sizer (Reposiory) measured in residues is calculated by summing the sizes of each protein structure in the repository: sizer (Repository) =
|R|s
sizer (Sv3D ),
(8.4)
v=1
where |R|s is the number of protein structures in the repository, and sizer (Sv3D ) is, measured in residues, size of the v th protein Sv3D from the repository, calculated according to Eq. 8.3.
8.5.2 Comparing Individual Proteins in One-to-One Comparison Scenario In the first series of experiments, we tested the execution time of similarity searches performed against a set of one hundred proteins that were stored individually, each in a separate, compressed macromolecular (PDB) data file. This is a standard way how protein structures are stored and exchanged from the Protein Data Bank. Tests were performed on the PC workstation (serial execution) and on two small HDInsight clusters with one Data node and four Data nodes. HDInsight clusters were configured to execute both, Map and Reduce phases (variant MR). Results are shown for the query protein PDB ID: 3CFM [9], which is a middle-sized molecule containing 2 chains, 118 amino acids (residues) long each. Execution times for similarity searches performed in all three hardware configurations are shown in Fig. 8.8. We noted that for both algorithms, the process was completed the fastest on the stand-alone PC workstation—it took exactly 6 min and 1 s for jCE and 5 min and 22 s for jFATCAT. The same process performed on 1-node HDInsight cluster took slightly more than 52 min for the jCE and more than 46 min for the jFATCAT, which was more than eight times slower for both algorithms. The
8.5 Performance Evaluation
199 01:00
01:00 00:52:09
PC workstaƟon
PC workstaƟon
4-node HDInsight
00:40
00:20
1-node HDInsight 00:46:34
Time (hh:mm:ss)
Time (hh:mm:ss)
1-node HDInsight
00:20
00:07:15
00:06:01
00:00
4-node HDInsight
00:40
00:06:28
00:05:22
00:00
hardware configuraƟon
hardware configuraƟon
Fig. 8.8 Execution time of similarity searches performed for individual protein structures depending on hardware configuration for jCE (left) and jFATCAT (right) structural alignment algorithms
Table 8.2 Comparison of performance metrics for similarity searches executed with jCE and jFATCAT algorithms for individual protein structures in three hardware configurations: PC workstation (serial) and two HDInsight clusters (parallel) jCE PC workstation 1-node HDInsight 4-node HDInsight CPUs Execution time (s) Speedup SpeedSA (residues2 /s) jFATCAT CPUs Execution time (s) Speedup SpeedSA (residues2 /s)
1 361 1.00 29,542
4 3,129 0.115 3,408
8 435 0.829 24,517
1 322 1.00 33,121
4 2,794 0.115 3,817
8 388 0.830 27,487
process performed on the 4-node HDInsight cluster took more than 7 min for the jCE and 6 min 28 s for the jFATCAT, which was slightly slower than on the PC workstation. In Table 8.2 we can study values of all performance metrics used for evaluation. As can be observed, the speedup remains on a very low level for similarity searches executed on both HDInsight clusters, for both alignment algorithms. When comparing and aligning protein structures stored in individual files, the structural alignment speed SpeedSA for the larger 4-node HDInsight cluster hardly achieved 27,487 (residues2 /s) for the jFATCAT and 24,517 (residues2 /s) for the jCE. In the series of experiments, the PC workstation achieved the speed of 33,121 (residues2 /s) for the jFATCAT and 29,542 (residues2 /s) for the jCE. Obtained results can be surprising at the first glance, since both executions on the HDInsight cluster turned to be slower than the PC workstation. However, we
200
8 Scaling 3D Protein Structure Similarity Searching on Large Hadoop Clusters …
have to remember that Hadoop has been design to deal with Big Data and its file system (HDFS) is optimized to handle very large files that typically range in size from megabytes to petabytes. Therefore, default HDFS block size is set to 64 MB in order to reduce the number of requests to get data from the file system. On the other hand, compressed PDB files that store protein structures range in size from kilobytes to single megabytes, so their sizes do not fit the HDFS block size. All requests for protein structures have to be processed by Hadoop Name node (running as a service on the HDInsight Head node) to figure out where the protein structure is located and go across the network to the HDFS. Moreover, a single PDB file with protein structure is then assigned to a single Map task. This comes with a lot of overhead comparing it to processing protein structures taken from the local hard drive on the PC workstation and is the reason for the obtained results. It might seem that increasing the number of data and fitting to the HDFS block size should be beneficial, and this was the subject of the experiments presented in the following section.
8.5.3 Working with Sequential Files in One-To-Many Comparison Scenario In the second series of experiments, we tested the performance of similarity searches against a collection of proteins that were not stored individually in separate files, but were joined in so-called sequential files in such a way that they filled the 64 MB HDFS block size. The number of protein structures in the collection represented 14% of the size of the whole repository, exactly 12,984 protein structures, which gave 20 sequence files (20 Hadoop splits). Similarly to the previous case, tests were performed on the PC workstation (serial execution) and on two small HDInsight clusters with one Data node and four Data nodes. HDInsight clusters were configured to execute both, Map and Reduce phases (variant MR). Results are shown for the same query protein PDB ID: 3CFM [9] containing 2 chains, 118 amino acids (residues) long each. Execution times for similarity searches performed with all three hardware configurations are shown in Fig. 8.9. We noted that for both algorithms, the process was completed the fastest on the 4-node HDInsight cluster—it took 1 h and 40 m for the jCE and 1 h and 24 min for the jFATCAT. The same process performed on the 1-node HDInsight cluster took 9 h and 9 min for the jCE and slightly more than 8 h for the jFATCAT. The process performed on the stand-alone PC workstation took 19 h and 31 min for the jCE and 17 h and 25 min for the jFATCAT. In Table 8.3 we can study values of all performance metrics used for evaluation. By using the 1-node HDInsight cluster, we gained the n-fold speedup at the level of 2.13 for the jCE and 2.16 for the jFATCAT over the stand-alone PC workstation. By scaling to the 4-node HDInsight cluster, we gained the n-fold speedup at the level of 11.61 for the jCE and 12.34 for the jFATCAT over the stand-alone PC workstation, and 5.45 for the jCE and 5.71 for jFATCAT over the 1-node HDInsight cluster.
8.5 Performance Evaluation
201 22:00
22:00
19:31
PC workstaƟon
Time (hh:mm)
18:00
4-node HDInsight
16:00 14:00 12:00
10:00
09:09
08:00
PC workstaƟon
20:00
1-node HDInsight
18:00
1-node HDInsight
17:25
4-node HDInsight
16:00
Time (hh:mm)
20:00
14:00
12:00 10:00 08:00
08:03
06:00
06:00
04:00
04:00 01:40
02:00
01:24
02:00 00:00
00:00
hardware configuraƟon
hardware configuraƟon
Fig. 8.9 Execution time of similarity searches performed for protein structures assembled in sequential files (one-to-many comparison scenario) depending on hardware configuration for the jCE (left) and the jFATCAT (right) structural alignment algorithms. HDInsight clusters were configured to execute both Map and Reduce phases (variant MR, full MapReduce) Table 8.3 Comparison of performance metrics for similarity searches executed with the jCE and the jFATCAT algorithm for proteins grouped in sequential files in three hardware configurations: PC workstation (serial execution) and two HDInsight clusters (parallel execution). HDInsight clusters were configured to execute both Map and Reduce phases (variant MR, full MapReduce). jCE PC workstation 1-node HDInsight 4-node HDInsight CPUs Execution time (s) Speedup SpeedSA (residues2 /s) jFATCAT CPUs Execution time (s) Speedup SpeedSA (residues2 /s)
1 70,276 1.00 20,805
4 32,983 2.13 44,329
8 6,053 11.61 241,551
1 62,746 1.00 23,302
4 29,025 2.16 50,374
8 5,084 12.34 287,590
When comparing the values of performance metrics achieved by both HDInsight clusters when using sequential files in one-to-many comparison scenario (Table 8.3) to analogous metrics obtained for individual files in batch ono-to-one comparison scenario (Table 8.2), we can notice a significant improvement in terms of execution times, speedup, and SpeedSA . By joining protein structures in sequential files, the Structural Alignment Speed SpeedSA increased from 24,517 (residues2 /s) to 241,551 (residues2 /s) for the jCE, and from 27,487 (residues2 /s) to 287,590 (residues2 /s) for the jFATCAT on the larger 4-node HDInsight cluster. Depending on the HDInsight cluster configuration and the alignment algorithm used, grouping proteins in sequential files gave 9-fold to 13-fold improvement over the previous parallelization technique dealing with
8 Scaling 3D Protein Structure Similarity Searching on Large Hadoop Clusters …
Fig. 8.10 Speedup S achieved when aligning proteins assembled in sequential files (over processing individual files) depending on hardware configuration and structural alignment algorithm
14.00 13.01
13.20 jCE
jFATCAT
12.00 10.46
9.85
10.00 Speedup
202
8.00
6.00 4.00 2.00 0.00 1-node HDInsight
4-node HDInsight
individual files (see Fig. 8.10 for details). Since in both experiments we worked on various data sets, the speedup S was calculated based on Structural Alignment Speed metric as a quotient of speed achieved by the program when processing individual files SpeedSA,if and speed achieved by the program when processing sequential files SpeedSA,sf : SpeedSA,if S= . (8.5) SpeedSA,sf These results confirm the conclusions presented in Chap. 7 that sequential files used in the one-to-many comparison scenarios bring significant performance improvements. It is also worth noting that in this series of experiments, the PC workstation achieved the speed of 23,302 (residues2 /s) for the jFATCAT and 20,805 (residues2 /s) for the jCE, and that the 1-node HDInsight cluster turned out to be only 2.13–2.16 times faster than the PC workstation, although it allowed to execute up to three Map tasks in parallel. Reasons of such a result lie in the MapReduce processing model and limited resources of the 1-node HDInsight cluster. After completing several Map tasks, Hadoop executes the Shuffle and then the Reduce phases concurrently. These phases require and consume additional resources that must be allocated by limiting the number of concurrent Map tasks. We noticed that in such a case, only one Map task was executed on the 1-node HDInsight cluster. On the 4-node HDInsight cluster, the amount of resources was sufficient to maintain multiple simultaneous Map tasks and keep appropriate degree of parallelism. This conclusion is also confirmed by results of experiments presented in the following section.
8.5 Performance Evaluation
203
8.5.4 FullMapReduce Versus Map-Only Execution Pattern As we know from Chap. 7, the specificity of protein structure similarity searching does not require a second step where data are grouped, sorted, or aggregated, after the similarity search is completed. This let us to believe that the Reduce phase is unnecessary, and we can omit it and additionally accelerate the entire process. This was checked in our experiments by measuring the execution time of similarity searches performed without the Reduce phase. The Map phase was modified appropriately in such a way that it stored results of protein comparisons in the HBase database as a part of Map task activity. Tests were performed with the same collection of proteins stored in sequential files, whose size was 64 MB. Similarly to the previous case, the number of protein structures in the collection represented 14% of the size of the whole repository, exactly 12,984 protein structures grouped in 20 sequence files (20 Hadoop splits). Tests were performed on two small HDInsight clusters with one Data node and four Data nodes. Execution results are shown for the same query protein PDB ID: 3CFM containing 2 chains, 118 amino acids (residues) long each. Results of the serial execution on the PC workstation are the same as in previous experiment. Execution times for similarity searches performed with all three hardware configurations are shown in Fig. 8.11. Similarly to the full MapReduce implementation (variant MR), in this series of experiments we also noted that similarity searches were completed the fastest on 4-node HDInsight cluster for both alignment algorithms— 1 h and 25 min for the jCE and 1 h and 11 m for the jFATCAT. This gave a significant speedup over the implementation on the stand-alone PC workstation: 13.78 for the jCE and 14.72 for the jFATCAT (Table 8.4). The same process performed on 1-node HDInsight cluster took 4 h and 40 min for the jCE and 4 h and 27 min for the jFATCAT. This gave the n-fold speedup over the implementation on the stand-alone PC workstation at the level of 4.18 for the jCE and 3.91 for the jFATCAT. Taking into
22:00
22:00
20:00
19:31
PC workstaƟon
Time (hh:mm)
Time (hh:mm)
14:00 12:00
10:00 08:00 06:00
1-node HDInsight
17:25
4-node HDInsight
16:00 14:00
12:00 10:00 08:00 06:00
04:40
04:27 04:00
04:00 02:00
18:00
4-node HDInsight
16:00
PC workstaƟon
20:00
1-node HDInsight
18:00
01:25
02:00
01:11
00:00
00:00
hardware configuraƟon
hardware configuraƟon
Fig. 8.11 Execution time of similarity searches performed for protein structures assembled in sequential files (one-to-many comparison scenario) depending on hardware configuration for the jCE (left) and the jFATCAT (right) structural alignment algorithms. HDInsight clusters were configured to work in the Map-only execution pattern without the Reduce phase (variant M)
204
8 Scaling 3D Protein Structure Similarity Searching on Large Hadoop Clusters …
Table 8.4 Comparison of performance metrics for similarity searches executed with the jCE and the jFATCAT algorithms for proteins grouped in sequential files (one-to-many comparison scenario) in three hardware configurations: PC workstation (serial execution) and two HDInsight clusters (parallel execution). HDInsight clusters were configured to work without the Reduce phase (Maponly, variant M) jCE PC workstation 1-node HDInsight 4-node HDInsight CPUs Execution time (s) Speedup SpeedSA (residues2 /s) jFATCAT CPUs Execution time (s) Speedup SpeedSA (residues2 /s)
1 70,276 1.00 20 805
4 16,801 4.18 87,025
8 5,149 13.78 283,960
1 62 746 1.00 23,302
4 16,057 3.91 91,057
8 4,274 14.72 342,094
account execution time measurements, the 4-node HDInsight cluster was 3.26 times faster than 1-node HDInsight cluster for the jCE and 3.76 times faster than 1-node HDInsight cluster for the jFATCAT. Skipping the Reduce phase in the Map-only implementation (variant M ) led to shortening of the search time on both clusters. On 4-node HDInsight cluster, the execution time was reduced from 1 h and 40 min to 1 h and 25 min for the jCE, and from 1 h and 24 min to 1 h and 11 min for the jFATCAT (cf. Figs. 8.9 and 8.11). Depending on the alignment algorithm (see Fig. 8.12), this gave a 15% (jCE) and 16% (jFATCAT) performance improvement over the implementation with the Reduce phase (variant MR), for which execution times were presented in Fig. 8.9. The most significant reduction of the execution time was noticed for the 1-node HDInsight cluster – from 9 h and 9 min to 4 h and 40 min for the jCE, and from 8 h and 3 min to 4 h and 27 min for the jFATCAT. This gave us almost 50% reduction of the execution time (49% for the jCE and 45% for the jFATCAT) over the full MapReduce implementation (variant MR). When comparing the values of performance metrics achieved when using the Maponly implementation (variant M ) without the Reduce phase (Table 8.4) to analogous metrics obtained for the full MapReduce-based implementation (variant MR) with the Reduce phase (Table 8.3), we can notice additional improvement in terms of the execution time, speedup, and SpeedSA . Elimination of the Reduce phase (variant M ) allowed to increase the Structural Alignment Speed from 287,590 (residues2 /s) to 342,094 (residues2 /s) for the jFATCAT, and from 241,551 (residues2 /s) to 283,960 (residues2 /s) for the jCE, when working on the 4-node HDInsight cluster with a subset of the repository (20 sequence files). The speedup over the serial implementation on the PC workstation increased from 11.61 to 13.78 for the jCE, and from 12.34 to
8.5 Performance Evaluation 0.60 jCE
0.50
jFATCAT
49% 45%
0.40 Speedup
Fig. 8.12 Reduction of the execution time achieved by elimination of the Reduce phase (in the Map-only implementation, variant M , of the MapReduce) depending on hardware configuration and structural alignment algorithm. Protein structures were assembled in 20 sequence files
205
0.30
0.20
15%
16%
0.10
0.00 1-node HDInsight
4-node HDInsight
14.72 for the jFATCAT. This confirms our observations from Chap. 7 that for processes, like protein similarity searching, skipping the Shuffle and the Reduce phases allows to save resources that can be used to perform many Map tasks in parallel and keep the degree of parallelism constant during the whole computation process.
8.5.5 Performance of Various Algorithms Various structural alignment algorithms defer in computational procedure (they might also return different results in a number of cases). The computational procedure determines the execution time for each of the tested algorithms. In the fourth series of experiments, we investigated the execution time for all used algorithms for 3D structure alignment and similarity searching. Tests were performed on both 1-node and 4-node HDInsight clusters without the Reduce phase (Map-only, variant M) with the same collection of proteins stored in 20 sequence files. The size of each file was 64 MB. Results are shown for the same query protein PDB ID: 3CFM containing 2 chains, 118 amino acid long each. Execution times for similarity searches performed with the use of all three algorithms (jCE, jFATCAT-Rigid, jFATCAT-Flexible) are shown in Fig. 8.13. During the series of experiments, we observed that both variants of jFATCAT, i.e., FATCATRigid and jFATCAT-Flexible, have almost the same average execution times. On the other hand, jCE algorithm was slower when tested on all hardware configurations—it was 4% slower than the jFATCAT-Rigid and 5% slower than the jFATCAT-Flexible
206
8 Scaling 3D Protein Structure Similarity Searching on Large Hadoop Clusters … 06:00 jCE 04:40
04:29
jFatCat-Rigid
04:27
Time (hh:mm)
jFatCat-Flexible 04:00
02:00
01:25
01:13
01:11
00:00 1-node HDInsight
4-node HDInsight
Fig. 8.13 Execution time of similarity searches performed on both HDInsight clusters for all tested algorithms (jCE, jFATCAT-Rigid, jFATCAT-Flexible) on protein structures assembled in sequential files in one-to-many comparison scenario. HDInsight clusters were configured to work in the Maponly pattern of the MapReduce without the Reduce phase (variant M)
on the 1-node HDInsight cluster, and 17% slower than the jFATCAT-Rigid and 20% slower than the jFATCAT-Flexible on the 4-node HDInsight cluster.
8.5.6 Influence of Protein Size Similarity searching time also depends on the size of the proteins that are compared. While the size of the whole repository remains stable for some time, the size of the user’s query protein highly influences the execution time. We also investigated how the execution time depends on the protein size for both algorithms of similarity searching (jCE and jFATCAT). In the experiments we used full repository of proteins structures (93,121 proteins) stored in 186 sequence files, whose size was 64 MB. Because of long search times, tests were performed only on the 4-node HDInsight cluster for the Map-only implementation without the Reduce phase (variant M). Execution times for similarity searches performed with the use of both algorithms (jCE, jFATCAT) are shown in Fig. 8.14. Results are shown for six molecules that differ in size—from 50 residues for PDB ID: 1PTQ [29], through 100 residues for PDB ID: 3QDA [23], 200 residues for PDB ID: 3VTG [8], 300 residues for PDB ID: 1XQZ [22], and 500 residues for PDB ID: 1CWY [21], up to 781 residues for PDB ID: 4BS9 [7]. We observed that the execution time increased with the growing size of the query protein, which was compared to the whole repository. This was expected, since for bigger query molecules larger alignment matrices must be calculated.
8.5 Performance Evaluation
207
24:00
24:00 jCE
jFATCAT
20:09
17:23
Time (hh:mm)
Time (hh:mm)
17:24 16:00 11:34 09:18 8:00
05:35
06:15
0:00
50
100
200
300
500
781
1ptq
3qda
3vtg
1xqz
1cwy
4bs9
Protein size (residues)
15:30
16:00
09:59 08:42
8:00 05:27
06:08
0:00 50
100
200
300
500
781
1ptq
3qda
3vtg
1xqz
1cwy
4bs9
Protein size (residues)
Fig. 8.14 Execution time of similarity searches performed against the whole repository of protein structures (93,121 protein structures assembled in 64 MB sequential files) on the 4-node HDInsight cluster for both tested algorithms (jCE, jFATCAT). HDInsight cluster was configured to work with the Map-only implementation, without the Reduce phase (variant M)
8.5.7 Scalability of the Solution As we could observe, grouping protein structures in sequential files and skipping the Reduce phase in the MapReduce parallel computational procedure led to significant reduction of the execution time, increased the speedup over the serial implementation on the PC workstation, and increased the Structural Alignment Speed. However, tests were performed on small HDInsight clusters with just a subset of the PDB repository. In the experiments performed by our group, we wanted to verify the scalability of the solution by scaling out protein similarity searches on multiple nodes of the HDInsight cluster. In the experiments we used full repository of protein structures (93,121 proteins) stored in 186 sequence files, whose size was 64 MB. During our tests we changed the number of nodes of the HDInsight cluster from one to 48. Similarity searches were performed according to the Map-only execution pattern, without the Reduce phase (variant M ). Execution times for similarity searches performed as a serial procedure on the PC workstation and parallel procedure on HDInsight cluster in the function of the number of nodes of the cluster are shown in Fig. 8.15. Results were collected for both algorithms, the jCE and the jFATCAT. Execution of the serial procedure for protein structure similarity searching took more than one hundred hours for both algorithms—109 h and 5 min for the jCE, and 105 h and 29 min for the jFATCAT. Parallel execution on the HDInsight cluster reduced the time significantly. Even on the 1-node HDInsight cluster, we could observe a significant improvement. Then, by scaling the HDInsight4PSi system horizontally from one to 48 Data nodes we were able to reduce the execution time from 40 h to 36 min for the jCE, and from 35 h and 24 min to 33 min for the jFATCAT. Speedups achieved by performing protein similarity searches as a parallel procedure on the HDInsight cluster over the serial execution on the PC workstation in
8 Scaling 3D Protein Structure Similarity Searching on Large Hadoop Clusters …
(a)
5
109:05
jCE
4.5
ExecuƟon Ɵme (hh:mm)
Fig. 8.15 Execution time of similarity searches performed as a serial procedure on the PC workstation and as a parallel procedure on the HDInsight cluster against the whole repository of protein structures (93,121 protein structures) in the function of the number of nodes of the HDInsight cluster, for both tested algorithms (jCE (a), jFATCAT (b)). HDInsight cluster was configured to work with sequential files in one-to-many comparison scenario with Map-only execution pattern without the Reduce phase (variant M)
4
3.5 3 2.5 2
40:00
1.5 20:14
1
10:40
0.5
5:42 3:12
2:14
1:20
0:36
12
16
24
48
0
PC
1
2
4
8
PC and n-node HDInsight cluster (n=1..48)
(b) 120:00 108:00
ExecuƟon Ɵme (hh:mm)
208
jFATCAT
105:29
96:00
84:00 72:00 60:00 48:00 35:24
36:00
18:02
24:00
9:13
12:00
5:01
2:48
1:57
1:12
0:33
12
16
24
48
0:00
PC
1
2
4
8
PC and n-node HDInsight cluster (n=1..48)
the function of the number of CPUs consumed by the HDInsight cluster, for both tested algorithms (jCE, jFATCAT), are shown in Fig. 8.16. The number of CPUs includes those utilized by (two) Head nodes (hosting NameNodes), all Data nodes in action, and the Zookeeper. Utilization of CPUs for particular configurations of the HDInsight cluster can be found in Table 8.1. As can be observed in Fig. 8.16, the achieved speedup of the Hadoop-based implementation over the PC-based implementation is significant. For the 48-node HDInsight (Hadoop/HBase) cluster, the Map-only implementation of the jCE was 181 times faster than the serial implementation on the PC, and the Map-only implementation of the jFATCAT was almost 191 faster than its PC-based serial implementation. This significant (superlinear) speedup results from running multiple Map task on each Data node of the HDInsight cluster. Table 8.5 shows how many Map tasks can be executed in parallel for various cluster sizes. The number of parallel Map tasks increases with the number of Data nodes, the amount of compute resources, and the number of YARN containers that are available to run the Map tasks in. Table 8.5 also shows the average number of
8.5 Performance Evaluation
(a)
200 181.55
jCE ideal speedup
180
Speedup over serial PC execuƟon
Fig. 8.16 Speedup achieved by performing similarity searches as a parallel procedure on HDInsight cluster over serial execution on the PC workstation in the function of the number of CPUs utilized by nodes of the HDInsight cluster, for both tested algorithms (jCE (a), jFATCAT (b)). Tests performed for the whole repository of protein structures (93,121 protein structures). HDInsight cluster was configured to work with sequential files in one-to-many comparison scenario with Map-only execution pattern without the Reduce phase (variant M)
209
160 140 120 100 81.33
80 60
48.68
34.02
40
19.12
20
10.21 5.39 2.73
0 0
8
16
24
32
40
48
56
# CPUs uƟlized by HDInsight cluster (incl. Zookeeper)
(b) 200 Speedup over serial PC execuƟon
190.89
jFATCAT ideal speedup
180 160 140 120 100
87.48
80 53.90
60
37.63
40 20.97 11.45 5.85 2.98
20 0 0
8
16
24
32
40
48
56
# CPUs uƟlized by HDInsight cluster (incl. Zookeeper)
cores per one Map task. We can see that the average number of cores per one Map task never falls below two, which confirms that the Application Master is able to negotiate appropriate number of containers with the YARN Resource Manager in each cluster configuration (cluster size). For the 48-node cluster the maximum number of Map tasks (191) exceeds the number of data splits (186), so all sequential files can be processed in one iteration. For smaller clusters, there are several iterations needed, depending on the cluster size. Important conclusions can be also drawn when studying the dependency between Structural Alignment Speed achieved by parallel procedure for similarity searching and the number of Data nodes of the HDInsight cluster presented in Fig. 8.17 for both tested alignment algorithms (jCE, jFATCAT). When scaling the HDInsight
210
8 Scaling 3D Protein Structure Similarity Searching on Large Hadoop Clusters …
Table 8.5 Maximum total number of parallel MapReduce tasks (mMR ) and the CPU core-to-Map tasks ratio for the varying number of Data nodes of the HDInsight cluster used in tests #Data nodes 1 2 4 8 12 16 24 48 #CPU cores1 mMR #CPU cores1 /mMR
8 3 2.67
16 7 2.29
32 15 2.13
64 31 2.06
96 47 2.04
128 63 2.03
192 95 2.02
384 191 2.01
1 #CPU cores available on Data nodes of the Hadoop cluster (excl. Head nodes and the Zookeeper)
Structural Alignment Speed (residues/s) x10 3
(a) 6 000
jCE
5 465
5 000
4 000
3 000 2 448
2 000 1 465
1 000
1 024
307
576
82 162
0 0
4
8
12
16
20
24
28
32
36
40
44
48
# nodes of HDInsight cluster
Structural Alignment Speed (residues/s) x10 3
Fig. 8.17 Structural Alignment Speed achieved by parallel procedure for similarity searching for various sizes of the HDInsight cluster, for both tested algorithms: a jCE, b jFATCAT. Tests performed for the whole repository of protein structures (93,121 protein structures). HDInsight cluster was configured to work with sequential files in one-to-many comparison scenario and with Map-only execution pattern of the MapReduce (variant M)
(b) 7 000 jFATCAT 6 000
5 941
5 000 4 000
3 000 2 723
2 000 1 678 1 171
1 000 93
0
356
653
182
0
4
8
12
16
20
24
28
32
36
# nodes of HDInsight cluster
40
44
48
8.5 Performance Evaluation
211
cluster from one to 48 Data nodes, the Structural Alignment Speed increased from 82 × 103 (residues2 /s) to 5, 465 × 103 (residues2 /s) for the jCE and from 93 × 103 (residues2 /s) to 5, 941 × 103 (residues2 /s) for the jFATCAT. In the series of experiments, the PC workstation achieved the speed of 30,101 (residues2 /s) for the jCE and 31,127 (residues2 /s) for the jFATCAT. It is worth noting that the Structural Alignment Speed reached by HDInsight4PSi is practically linearly proportional to the number of HDInsight cluster nodes. Small fluctuations are caused by the idle time of some nodes of the cluster resulting from various sizes of protein structures in particular sequential files and unbalanced load of compute units.
8.6 Discussion For many reasons, protein structures are worth exploration and this exploration still leaves a lot of reserve for potential applications of its results. The amount of macromolecular data in repositories, such as Protein Data Bank, is constantly growing, which makes processing and analyzing of the data difficult and almost impossible by a single machine. This leads to the necessity of using computer clusters for their sophisticated exploration, comprising similarity searching, alignment, and superposition. Application of Hadoop and the MapReduce data processing pattern in this exploration is an important option of using the latest advances in data analysis for processing 3D structures of protein molecules. As we could see, HDInsight on Azure is able to handle the growing amount of macromolecular data and perform time-consuming similarity searches much faster than a single workstation. The implementations of 3D protein structure similarity searching that utilized the Map-only execution pattern on 48-node HDInsight cluster were almost 200 times faster than serial implementations of the alignment algorithms on the standard PC workstation. This hardware configuration allowed to achieve the highest Structural Alignment Speed of almost 6 million (residues2 /s) on A10 virtual machines used for scaling the cluster. Presented HDInsight4PSi possesses several unique features that positively affect computational efficiency of the system. The most important characteristics are the use of sequential files, application of the Map-only execution pattern, and scaling the MapReduce computations in the Azure cloud, which allows an on-demand allocation of additional resources. It is worth noting that 48 Data nodes that we used in our experiments is not an upper limit for scaling the HDInsight cluster, but it was enough to perform similarity searches against the whole repository with the use of 64 MB sequential files in a single iteration (there was enough YARN containers to run all Map tasks at the same time and process all available sequential files). With the growth of the PDB database in the future, it is possible to scale the cluster by adding more nodes. The utilization of the Map-only execution pattern of the MapReduce and skipping the Reduce phase in the HDInsight4PSi once again confirmed additional time savings—it brought 15–16% improvement over the full MapReduce variant for larger
212
8 Scaling 3D Protein Structure Similarity Searching on Large Hadoop Clusters …
clusters. Since the whole alignment process is performed in the Map phase, the results of the alignment do not have to be grouped, sorted, or aggregated, and since the Reduce phase in the full MapReduce variant was only responsible for collecting and saving results in the HBase database, we could skip the Reduce phase and save results in the final steps of the Map phase. HDInsight4PSi was also faster than its predecessors Cloud4PSi [11, 13] presented in Chap. 4 and the H4P [14] presented in Chap. 7. In terms of Structural Alignment Speed, we experimentally checked that the Cloud4PSi achieved 94,603 (residues2 /s) (working with sixteen Searcher roles) for jFATCAT, the H4P achieved 160,046 (residues2 /s) (on 8-node Hadoop cluster with two Map tasks per node execution scheme) for jFATCAT, while HDInsight4PSi reached up to 304,434 (residues2 /s) [12] (A3-sized VMs) and 356,276 (residues2 /s) (A10-sized VMs) on the 4-node HDInsight cluster (providing the same degree of parallelism) for the jFATCAT alignment algorithm. While testing performance of the HDInsight4PSi, we used full repository of macromolecular data (93,121 structures) available at the time. This allowed us to discover potential problems when processing such a large volume of data. It is worth noting that sizes of some protein structures may pose a problem for those HDInsight clusters that are based on small-sized virtual machines (e.g., A3 used in experiments presented in [12]). These problems occur due to the limited memory available on these machines (for details on available memory for particular size of virtual machines, please refer to Chap. 3). Fortunately, Microsoft Azure not only allows to scale out the cluster of computers by adding more computing nodes, but also to scale up, i.e., raise the size of the compute units in order to provide higher computational capabilities.
8.7 Summary The growth of macromolecular data in repositories, such as the Protein Data Bank, is unavoidable, since the number of protein structures is lagging far behind the number of protein sequences that are already known. However, in order to get insights into molecular basis of many diseases, the scientific community needs this type of information, as well as methods and systems that will deliver useful clues based on the variety and the large volume of data in a reasonable time. The parallel implementation of protein similarity searching presented in this chapter shortens this time significantly. Moreover, the HDInsight4PSi system with the implemented parallel procedure for similarity searching can be easily scaled in the Cloud, both vertically and horizontally, in order to handle the dynamic growth of macromolecular data. At the same time, implemented MapReduce variants of two popular alignment methods guarantee the quality of results, which makes presented approach very useful in many scientific areas that draw conclusions from the analyzes of 3D protein structures. At the moment, the HDInsight service in Microsoft Azure cloud allows not only to create Hadoop clusters, but also Spark clusters for fast sophisticated computations.
8.7 Summary
213
In the next chapter, we will see how the Spark cluster created in the Azure cloud is utilized for predicting intrinsically disordered regions in protein structures. Acknowledgements We would like to thank Microsoft Research for providing us with free access to the computational resources of the Microsoft Azure cloud within the Microsoft Azure for Research Award grant.
References 1. Berman, H., et al.: The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000) 2. BioSQL Homepage: http://biosql.org/ Accessed on: January 20,2018 3. Bourne, P., Berman, H., Watenpaugh, K., et al.: The macromolecular crystallographic information file (mmCIF). Methods Enzymol. 277, 571–590 (1997) 4. Daniłowicz, P.: Protein structure similarity searching in distributed system. Master’s thesis, Institute of Informatics, Silesian University of Technology, Gliwice, Poland (2014) 5. Gannon, D., Fay, D., Green, D., Takeda, K., Yi, W.: Science in the cloud: lessons from three years of research projects on Microsoft Azure. In: Proceedings of the 5th ACM workshop on Scientific cloud computing. pp. 1–8 (2014) 6. George, L.: HBase: The Definitive Guide, 1st edn. O’Reilly Media, Sebastopol, CA, USA (2011) 7. Koehnke, J., Bent, A.F., Zollman, D., Smith, K., Houssen, W.E., Zhu, X., Mann, G., Lebl, T., Scharff, R., Shirran, S., Botting, C.H., Jaspars, M., Schwarz-Linek, U., Naismith, J.H.: The cyanobactin heterocyclase enzyme: A processive adenylase that operates with a defined order of reaction. Angewandte Chemie International Edition 52(52), 13991–13996 (2013), https:// onlinelibrary.wiley.com/doi/abs/10.1002/anie.201306302 8. Kudo, N., Yasumasu, S., Iuchi, I., Tanokura, M.: Crystal structure of high choriolytic enzyme 1 (HCE-1), a hatching enzyme from Oryzias latipes (Medaka fish), https://www.rcsb.org/ structure/3VTG 9. Lima, L., da Silva, A., de Palmieri, C., Oliveira, M., Foguel, D., Polikarpov, I.: Identification of a novel ligand binding motif in the transthyretin channel. Bioorg Med Chem. 18(1), 100–110 (2010) 10. Mell, P., Grance, T.: The NIST definition of Cloud Computing. Special Publication 800-145 (accessed on May 7, 2018) (2011), http://csrc.nist.gov/publications/nistpubs/800-145/SP800145.pdf 11. Mrozek, D.: High-Performance Computational Solutions in Protein Bioinformatics. Springer International Publishing, SpringerBriefs in Computer Science (2014) 12. Mrozek, D., Daniłowicz, P., Małysiak-Mrozek, B.: HDInsight4PSi: Boosting performance of 3D protein structure similarity searching with HDInsight clusters in Microsoft Azure cloud. Information Sciences 349–350, 77–101 (2016) 13. Mrozek, D., Małysiak-Mrozek, B., Kłapci´nski, A.: Cloud4Psi: Cloud computing for 3D protein structure similarity searching. Bioinformatics 30(19), 2822–2825 (2014) 14. Mrozek, D., Suwała, M., Małysiak-Mrozek, B.: High-throughput and scalable protein function identification with Hadoop and Map-only pattern of the MapReduce processing model. J Knowl Inf Syst (in press), http://dx.doi.org/10.1007/s10115-018-1245-3 15. Mrozek, D., Wieczorek, D., Malysiak-Mrozek, B., Kozielski, S.: PSS-SQL: Protein Secondary Structure - Structured Query Language. In: 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology. pp. 1073–1076 (2010) 16. Mrozek, D., Małysiak-Mrozek, B., Adamek, R.: P3D-SQL: Extending Oracle PL/SQL capabilities towards 3D protein structure similarity searching. In: Ortuño, F., Rojas, I. (eds.) Bioinformatics and Biomedical Engineering. Lecture Notes in Comput. Sci., vol. 9043, pp. 548–556. Springer International Publishing, Cham (2015)
214
8 Scaling 3D Protein Structure Similarity Searching on Large Hadoop Clusters …
17. Mrozek, D., Socha, B., Kozielski, S., Małysiak-Mrozek, B.: An efficient and flexible scanning of databases of protein secondary structures. Journal of Intelligent Information Systems 46(1), 213–233 (2016), https://doi.org/10.1007/s10844-014-0353-0 18. Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology 247(4), 536–540 (1995), http://www.sciencedirect.com/science/article/pii/ S0022283605801342 19. Prli´c, A., Bliven, S., Rose, P., Bluhm, W., Bizon, C., Godzik, A., Bourne, P.: Pre-calculated protein structure alignments at the RCSB PDB website. Bioinformatics 26, 2983–2985 (2010) 20. Prli´c, A., Yates, A., Bliven, S., et al.: BioJava: an open-source framework for bioinformatics in 2012. Bioinformatics 28, 2693–2695 (2012) 21. Przylas, I., Tomoo, K., Terada, Y., Takaha, T., Fujii, K., Saenger, W., Sträter, N.: Crystal structure of amylomaltase from Thermus aquaticus, a glycosyltransferase catalysing the production of large cyclic glucans. Journal of Molecular Biology 296(3), 873 – 886 (2000), http://www. sciencedirect.com/science/article/pii/S0022283699935039 22. Qian, K.C., Wang, L., Hickey, E.R., Studts, J., Barringer, K., Peng, C., Kronkaitis, A., Li, J., White, A., Mische, S., Farmer, B.: Structural basis of constitutive activity and a unique nucleotide binding mode of Human Pim-1 Kinase. Journal of Biological Chemistry 280(7), 6130–6137 (2005), http://dx.doi.org/10.1074/jbc.m409123200 23. Raimondi, S., Barbarini, N., Mangione, P., Esposito, G., Ricagno, S., Bolognesi, M., Zorzoli, I., Marchese, L., Soria, C., Bellazzi, R., Monti, M., Stoppini, M., Stefanelli, M., Magni, P., Bellotti, V.: The two tryptophans of β2-microglobulin have distinct roles in function and folding and might represent two independent responses to evolutionary pressure. BMC Evolutionary Biology 11(1), 159 (Jun 2011), https://doi.org/10.1186/1471-2148-11-159 24. Shindyalov, I., Bourne, P.: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engineering 11(9), 739–747 (1998) 25. Sosinsky, B.: Cloud Computing Bible, 1st edn. Wiley, New York, USA (2011) 26. Wesbrook, J., Ito, N., Nakamura, H., Henrick, K., Berman, H.: PDBML: the representation of archival macromolecular structure data in XML. Bioinformatics 21(7), 988–992 (2005) 27. Westbrook, J., Fitzgerald, P.: The PDB format, mmCIF, and other data formats. Methods Biochem Anal. 44, 161–79 (2003) 28. Ye, Y., Godzik, A.: Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics 19(2), 246–255 (2003) 29. Zhang, G., Kazanietz, M.G., Blumberg, P.M., Hurley, J.H.: Crystal structure of the Cys2 activator-binding domain of protein kinase C delta in complex with phorbol ester. Cell 81(6), 917 – 924 (1995), http://www.sciencedirect.com/science/article/pii/009286749590011X
Chapter 9
Scalable Prediction of Intrinsically Disordered Protein Regions with Spark Clusters on Microsoft Azure Cloud
Big data is mostly about taking numbers and using those numbers to make predictions about the future. The bigger the data set you have, the more accurate the predictions about the future will be Anthony Goldbloom
Abstract Intrinsically disordered proteins (IDPs) constitute a wide range of molecules that act in cells of living organisms and mediate many protein–protein interactions and many regulatory processes. Computational identification of disordered regions in protein amino acid sequences, thus, became an important branch of 3D protein structure prediction and modeling. In this chapter, we will see the IDP meta-predictor that applies an ensemble of primary predictors in order to increase the quality of IDP prediction. We will also see the highly scalable implementation of the meta-predictor on the Spark cluster (Spark-IDPP) that mitigates the problem of the exponentially growing number of protein amino acid sequences in public repositories. Spark-IDPP responds very well to the current needs of IDP prediction by parallelizing computations on the Spark cluster that can be scaled on demand on the Microsoft Azure cloud according to particular requirements for computing power. Keywords Proteins · 3D protein structure · Tertiary structure · Intrinsically disordered proteins · Cloud computing · Parallel computing · Spark · Microsoft Azure · Public cloud
9.1 Intrinsically Disordered Proteins Determination of 3D protein structures became an important branch of structural biology, since the knowledge of 3D protein structures allows to draw conclusions about molecular mechanisms of cellular biochemical reactions or particular diseases, and it supports drug design. However, as was gradually observed from 1990s, not all known proteins have stable (ordered) native 3D structure. Some proteins have shorter or longer segments that indicate instability or flexibility, which are called © Springer Nature Switzerland AG 2018 D. Mrozek, Scalable Big Data Analytics for Protein Bioinformatics, Computational Biology 28, https://doi.org/10.1007/978-3-319-98839-9_9
215
216
9 Scalable Prediction of Intrinsically Disordered …
Fig. 9.1 Human SUMO1 protein with identified intrinsically disordered regions in N- and Cterminal regions. Based on the (PDB:1A5R) [4] structure from the Protein Data Bank. Visualized with the use of NGL viewer [56]
intrinsically disordered regions (IDRs). Such proteins are usually known as intrinsically disordered proteins (IDPs, Fig. 9.1), and they may consist of one or several IDRs, or, in extreme case, can be completely unstructured (intrinsically unstructured proteins, IUPs). Intrinsically disordered proteins play many important roles in cells of living organisms, and on the basis of various studies carried out by scientists on whole proteomes, it is estimated that the percentage of IDPs in mammals is very large [16]. Determination of 3D structures of intrinsically disordered proteins with traditional methods, like the X-ray crystallography or Nuclear Magnetic Resonance (NMR), is difficult, since, e.g., the lack of electron density in crystal structures (which is marked in PDB files describing protein macromolecules as REMARK465 record). For this reason, IDP predictors have become playing an important role in determination of unstructured regions. IDP prediction became an important part of the computational protein structure determination. It supports studies of protein structural features, functional analysis of proteins, and investigations of the relationships between IDRs and the occurrence of particular diseases. As strictly computational tools, IDP predictors are able to predict protein disorder from pure amino acid sequence. However, the number of known genetic sequences (DNA sequences) that may encode proteins is constantly growing. The huge gap between the number of known genetic sequences, protein sequences (which are encoded by genes in DNA), and 3D protein structures increases every year. For example, as of June 20, 2018 there were 209,775,348 DNA sequences and 639,804,105 WGS sequences in GenBank database [5], 557,713 reviewed and 116,030,110 non-reviewed protein amino acid sequences in UniProtKB/Swiss-Prot and UniProtKB/TrEMBL databases [7, 62], and (only) 141,616 three-dimensional macromolecular (mainly protein) structures in the Protein Data Bank (PDB) [6]. Taking into account the growth of the number of DNA sequences in the GenBank, and consequently, protein amino acid sequences in the UniProtKB/Swiss-Prot database, it is essential to provide efficient and effective tools for IDP prediction. These tools must be able to scale the computational procedure in order to accommodate the growing volume of DNA and then protein sequences.
9.2 IDP Predictors
217
9.2 IDP Predictors Accurate and efficient prediction of disordered regions in protein structures on the basis of pure amino acid sequence is one of the challenges in computational biology and structural genomics. Existing methods for IDP prediction rely on the analysis of various features of proteins, e.g., protein amino acid composition or physical– chemical characteristics of particular amino acids. Some of the methods are grounded in statistical observations, other use machine learning algorithms. For example, GlobPlot [35] is a simple prediction method that identifies regions of globularity and disorder within protein sequences on the basis of propensities of particular amino acids to be in globular or non-globular states. The GlobPlot calculates a running sum of the propensity for amino acids to be in an ordered or disordered state and uses it to classify regions of the given amino acid sequence. IUPred [15] is also footed on the physical foundations and statistical observations of inter-residue interactions in proteins structures. To discriminate between ordered and disordered class, the IUPred uses statistical interaction potentials. On the other hand, DisEMBL [34] uses trained artificial neural networks (ANNs) to predict protein disorders. There are three variants of the method: (1) Coils used to predict classic loops and coils as defined by DSSP [28], (2) Hot loops for prediction of flexible loops with a high degree of mobility determined by temperature factors (B-factors), (3) Remark 465 that predicts missing coordinates in X-ray structures as defined by REMARK465 records in the PDB files describing macromolecular structures of proteins. DISpro [10] uses evolutionary information in the form of protein profiles, predicted secondary structure and relative solvent accessibility, and ensembles of 1D-recursive neural networks. RONN [72] classifies disordered regions on the basis of observed homology between protein sequences. The homology is expressed in calculated alignment scores while comparing given protein sequences with a series of amino acid sequences of known folding state (ordered, disordered, or a mixture of both). Obtained alignment scores are then used in the prediction process performed by suitably trained regional order neural network. SPINE-D [74] classifies residues into three classes, as structurally ordered, disordered, and semi-ordered. The method also uses a trained artificial neural network. Several methods, including DISOPRED2 [68], Poodle-s [58], Poodle-l [20], PrDOS [25], and Spritz [64], use SVM-based classifiers to predict disordered regions on the basis of various features extracted from protein sequences. For example, Poodle-s, Poodle-l, and PrDOS perform prediction on the basis of positionspecific scoring matrices (PSSMs) generated by PSI-BLAST [1], while Spritz uses amino acid frequencies in disordered regions. PSSMs with respect to physicochemical properties of amino acids are also used in iPDA server [60] and its underlying DisPSSMP predictor. Recent works in this area, including Xue et al. [70] and Kozlowski and Bujnicki [30], show that meta-prediction with the use of ensembles composed of subsets of described basic predictors may improve the quality of prediction results. The method presented in this chapter also works on the basis of ensemble of basic predictors. However, so far, none of the prediction or meta-prediction methods was designed
218
9 Scalable Prediction of Intrinsically Disordered …
to perform large-scale identification of disordered regions and to deal with large volumes of protein sequences. Meanwhile, there are real-life problems, in which massive parallelization of computations on Apache Hadoop or Spark and the use of scalable environments, like the Cloud, brought significant improvements in performance of data processing and analysis. Big Data challenge was observed and solved in various works devoted to intelligent transport and smart cities [11, 18, 37, 38, 65, 66, 71], water monitoring [12, 21, 75], social networks analysis [13, 14, 67], multimedia processing [63, 69], internet of things (IoT) [9], social media monitoring[45], life sciences [39, 41, 51, 61], telecommunication [26], and finance [2], to mention just a few. Many hot issues in various sub-fields of bioinformatics were also solved with the use of Big Data ecosystems and Cloud computing, e.g., mapping next-generation sequence data to the human genome and other reference genomes, for use in a variety of biological analyzes including SNP discovery, genotyping and personal genomics [57], sequence analysis and assembly [17, 29, 31, 32, 43, 54], multiple alignments of DNA and RNA sequences [76], codon analysis with local MapReduce aggregations [55], NGS data analysis [8], phylogeny [23, 44], proteomics [33], analysis of protein-ligand binding sites [22]. Regarding the analysis of 3D protein structures, it is worth mentioning several works, including Hazelhurst et al. [19] and Małysiak-Mrozek et al. [42] devoted to exploration of various atomic interactions within protein structures, works of Che-Lun Hung and Yaw-Ling Lin [24], and Mrozek et al. [46, 47, 49, 50], devoted to comparison and alignment of 3D protein structures, and cloud-based system for 3D protein structure modeling presented in [48]. However, none of the mentioned works was focused on prediction of disordered regions.
9.3 IDPP Meta-Predictor With the growing volume of biological data describing various aspects of living forms, increasing demand for fast data analysis (velocity), and a variety of data formats, structural bioinformatics has been facing the Big Data challenge. This requires changing the way how we cope with the biological data and redevelopment of existing tools for data analysis and processing, thus directing us to Big Data ecosystems. As we know from Chap. 6, Apache Hadoop with the MapReduce processing model and Apache Spark are both popular Big Data frameworks. However, since the Spark allows for running most computations in memory, without the necessity to store intermediate results in a file system, it provides better performance for many applications [73]. Therefore, in order to meet the requirements for fast IDP predictions, our group decided to use Spark for the development of the efficient IDP meta-predictor. The IDP meta-predictor (IDPP, intrinsically disordered proteins meta-predictor) and its Spark-based version for high-throughput and scalable predictions (SparkIDPP) were developed in Institute of Informatics at the Silesian University of Technology in Gliwice, Poland, within the Cloud4Proteins group. The development works were started in 2016. The main constructors of the system were Tomasz Baron (within
9.3 IDPP Meta-Predictor VoƟng (primary predicƟon)
219 Consensus (AggregaƟng votes)
Fuzzy filtering
Fig. 9.2 Main phases of the prediction of disordered regions in IDPP meta-predictor
his Master thesis [3]), Bo˙zena Małysiak-Mrozek, and me. The IDPP is an ensemble method for predicting intrinsically disordered proteins. The IDPP classifies disordered regions on the basis of consensus of votes cast by component, basic predictors (Fig. 9.2). Several basic (primary) predictors are used to vote. They identify which regions of protein sequences can be disordered by assigning classes (ordered/disordered) or probabilities of disorder to each residue in the protein amino acid chain. Then, votes are aggregated in the Consensus phase. The IDPP meta-predictor allows to work in four consensus modes, in which it aggregates votes in a different way. Then, it processes the predicted disordered regions in order to eliminate outliers. The Spark-IDPP [40] is the implementation of the ensemble method on the Spark cluster to speed up computations related to IDP prediction. Spark-IDPP allows for large-scale prediction of intrinsically disordered proteins or intrinsically disordered regions of protein structures on the basis of amino acid sequences. It provides significantly better performance for large volumes of sequence data that are processed and analyzed in protein bioinformatics than local IDPP application. Since Spark clusters are available as a service on public clouds, we decided to use the Microsoft Azure cloud platform in order to quickly create the Spark cluster for IDP prediction and perform computations on a large scale. This also allowed us for flexible scaling the cluster on-demand in order to accommodate data growth, raising the computing power, and testing various cluster configurations for particular volumes of protein sequence data. In the following sections, we will see the architecture of the IDPP meta-predictor, foundations of prediction method working on the basis of consensus, filtering method used for elimination of outliers, and details of the implementation of the IDPP metapredictor on the Spark cluster [40].
9.4 Architecture of the IDPP Meta-Predictor IDPP meta-predictor performs prediction of intrinsically disordered proteins or disordered regions for the given protein amino acid sequences. General architecture of the IDPP meta-predictor is presented in Fig. 9.3. It consists of the following modules: • Component Predictors Module (CPM), • Consensus Module (CoM), • Analysis Module (AM).
220
9 Scalable Prediction of Intrinsically Disordered … Component Predictors Module
RONN Analysis Module
COILS
DisEMBL
REMARK465 HOTLOOPS
FASTA
Protein sequences
Consensus Module
IDP regions
GlobPlot
Short
Results
IUPred Long
Fig. 9.3 Architecture of the IDPP meta-predictor. Reproduced from [3, 40] with permissions
Component Predictors Module (CPM) is responsible for prediction of disordered regions. Prediction can be performed for many protein sequences provided in the FASTA format [36] at the input of the CPM. A sample input sequence in the FASTA format is presented in Listing 9.1. First line of each entry in the FASTA file consists of descriptive information, e.g., identifiers of the protein sequence in various public repositories. Next lines consist of amino acid sequence of the protein. 1 2 3 4 5
>DisProt|DP00004|uniprot|P49913|unigene|Hs.51120|sp|CAMP_HUMAN MKTQRNGHSLGRWSLVLLLLGLVMPLAIIAQVLSYKEAVLRAIDGINQR SSDANLYRLLDLDPRPTMDGDPDTPKPVSFTVKETVCPRTTQQSPEDC DFKKDGLVKRCMGTVTLNQARGSFDISCDKDNKRFALLGDFFRKSKEK IGKEFKRIVQRIKDFLRNLVPRTES
Listing 9.1 A sample input sequence in the FASTA format.
Component Predictors Module contains several basic predictors for intrinsically disordered regions. Each of the component predictors accepts an amino acid sequence at the input and performs independent predictions of intrinsically disorder regions on the basis of the input sequence [3]. The CPM of the IDPP meta-predictor implements the following basic prediction algorithms: • • • •
RONN [72]; DisEMBL in three variants: Coils, Remark465, and Hotloops [34]; IUPred Short and IUPred Long [15]; GlobPlot [35].
9.4 Architecture of the IDPP Meta-Predictor
221
Each of the component predictors generates its results containing the list of regions which are supposed to be disordered. These results contain the information on the probability that each amino acid belongs to the disordered region or not, binary classification of belonging to the disordered region (1—belongs, 0—does not belong), and lists of disordered regions expressed as ranges of amino acid positions. The Consensus Module (CoM) aggregates results from component predictors on the basis of consensus. There are four consensus approaches (also called consensus modes) implemented and tested in the Consensus Module: two operating on binary classification from component predictors and two operating on float values of the returned probability. All will be described in details in Sect. 9.5. In the next phase, the CoM performs filtering of the consensus results with the use of fuzzy smoothing filter in order to remove probable outliers. After filtering out single disordered positions, the Consensus Module finally classifies particular amino acids in the protein sequence as belonging to a disorder region or not, on the basis of aggregated probabilities or binary classification results. The final classification is performed for the given cutoff threshold obtained experimentally [3, 40]. The Analysis Module (AM) allows to assess the quality of classification and to find potential cutoff thresholds for component predictors. The AM allows to analyze results of the prediction process and compare them to a ground truth (actual disordered regions that were discovered experimentally or collected manually from literature, stored in the DisProt database [52]). Results of the analyzes are provided in the form of true-positive, true-negative, false-positive, and false-negative rates. Results of the prediction process are returned as text files, saved on hard disk drive, or on the standard output of the IDPP meta-predictor. For example, a result of a prediction of disordered regions for a single amino acid sequence of protein Cyclindependent kinase inhibitor 1B from Homo sapiens (accession number: P46527, entry name: CDN1B_HUMAN in the UniProtKB/Swiss-Prot database) may look like it is shown in Listing 9.2. First line of the result contains the information that allows to identify the protein. The second line contains the information on disordered regions identified in the sequence, expressed as ranges of amino acid positions. 1
> D i s P r o t | DP00018 | u n i p r o t | P46527 | u n i g e n e | Hs . 2 3 8 9 9 0 | sp |CDN1B_HUMAN #22−34 #96−108
Listing 9.2 A result of prediction of disordered regions for a single amino acid sequence.
9.5 Reaching Consensus IDPP meta-predictor consists of several component predictors that classify amino acids as belonging to a disordered region or not. These predictors work on the basis of various algorithms, and they have various prediction effectiveness. However, we assumed that the aggregation of their votes may give improvements in the prediction effectiveness. In the IDPP meta-predictor, we applied the following four consensus modes—two operating on results of binary classification from seven component
222
9 Scalable Prediction of Intrinsically Disordered …
Simple/ Weighted Binary Consensus
Sequence of classes (0/1)
Simple/ Weighted Float Consensus
Sequence of probabilities
Fuzzy Smoothing Filter for Binary Values
Sequence of probabilities
Fuzzy Smoothing Filter for Float Values
Sequence of probabilities
Final classification based on λ
Final classification based on λ
A list of disordered regions Sequence of probabilities
A list of disordered regions Sequence of probabilities
Fig. 9.4 Overview of the information flow in the Consensus Module in various consensus modes. Reproduced from [3, 40] with permissions and corrected according to changes in the system design
predictors (RONN, IUPred in two variants, GlobPlot, and DisEMBL in three variants) and two operating on float values of the returned probabilities of belonging to a disordered region (as presented in Fig. 9.4): • • • •
Simple Binary, Weighted Binary, Simple Float, Weighted Float.
The Simple Binary (SB) consensus mode makes use of only binary scores returned by all component predictors (0/1 classification), and votes of all component predictors are equally important while aggregating component decisions (equal weights). In this consensus mode, the IDPP meta-predictor evaluates votes of component predictors and aggregates ranks for each ith residue in the protein amino acid sequence according to the following formula: pr eScor eiI D P P−Bin
=
w j Scor ei,Bin j , j∈J w j
j∈J
(9.1)
where J is a set of basic component predictors (RONN, GlobPlot, IUPred Short, IUPred Long, DisEMBL Hotloops, DisEMBL Remark465, DisEMBL Coils), Scor ei,Bin j is the binary score returned by the jth component predictor for the ith residue (1—belongs to a disordered region, 0—does not belong), and w j = 1.0 are weights of importance of particular component predictors (w j = 0.0, if we want to eliminate the jth predictor from voting). Likewise the Simple Binary consensus mode, the Weighted Binary (WB) consensus mode uses only binary scores returned by all component predictors (0/1 classification), but votes of all component predictors have different weights while aggregating component decisions. Weights of importance for particular component predictors (w j in Eq. 9.1) are calculated according to the following formula:
9.5 Reaching Consensus
SW, j =
S Smax
=
223
Wdisor der T P − Wor der F P + Wor der T N − Wdisor der F N , (9.2) Wdisor der (T N + F N ) + Wor der (T N + F P)
where: Wdisor der equals the fraction of disordered residues, Wor der equals the fraction of ordered residues, TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives. The SW, j weighted score rewards a correct disorder prediction higher than a correct order prediction [27]. This is done to avoid over-prediction of an ordered state due the fact that ordered regions are more common in known proteins. Values of the SW, j weighted score for particular component predictors of the IDPP were obtained experimentally (see Table 9.3 in Sect. 9.8.4). In both binary consensus modes, particular amino acids in the protein sequence (i) are then pre-classified, whether they belong to a disordered region or not: pr eClassiI D P P−Bin
1 if pr eScor eiI D P P−Bin ≥ λ0 = 0 if pr eScor eiI D P P−Bin < λ0
,
(9.3)
where pr eScor eiI D P P−Bin is the score aggregated in one of the binary consensus modes (Eq. 9.1), and λ0 is a qualification threshold determined experimentally to 0.5. The Simple Float (SF) consensus mode makes use of probabilities returned by all component predictors for each residue of the amino acid chain. Votes of all component predictors are equally important while aggregating component decisions (equal weights, like in the Simple Binary consensus mode). In this consensus mode, the IDPP meta-predictor evaluates votes of component predictors and aggregates probabilities for each ith residue in the protein amino acid sequence according to the following formula: pr eScor eiI D P P−Flo =
w j Pr obi, j , j∈J w j
j∈J
(9.4)
where J is the set of basic component predictors, Pr obi, j is the probability returned by the jth component predictor for the ith residue that the residue belongs to a disordered region, and w j = 1.0 are weights of importance of particular component predictors (w j = 0.0, if we want to eliminate the jth predictor from voting). Similarly, the Weighted Float (WF) consensus mode works on the basis of, calculated by component predictors, probabilities that a residue belongs to a disordered region. However, while aggregating component decisions, votes of all component predictors are weighted according to the SW, j score (Eq. 9.2). In contrast, to binary consensus modes, results of float consensus modes are not pre-classified, but fuzzy filtering in the next phase is performed directly on aggregated probabilities pr eScor eiI D P P−Flo .
224
9 Scalable Prediction of Intrinsically Disordered …
9.6 Filtering Outliers Disordered regions are usually formed by many successive residues in the protein chain, rather than single amino acids. Therefore, single amino acids or short segments predicted as disordered regions separated by short segments of ordered regions should be eliminated from the final result. To this purpose, we implemented fuzzy smoothing filter that discards such small disordered “islands.” The filter runs through the sequence of classes (0/1) or probabilities each residue in protein sequence was assigned in the Consensus Module and replaces each entry by a new probability value. The new probability value for the ith residue is calculated on the basis of classes pr eClassiI D P P−Bin (for binary consensus modes):
Pr obiI D P P =
i+2
k=i−2 l=k−i+3
µ(l) · pr eClasskI D P P−Bin , 5 l=1 µ(l)
(9.5)
or scores pr eScor eiI D P P−Flo (for float consensus modes):
Pr obiI D P P =
i+2
k=i−2 l=k−i+3
µ(l) · pr eScor ekI D P P−Flo , 5 l=1 µ(l)
(9.6)
for neighboring residues. The pattern of neighboring residues is called the window, and slides residue by residue, over the entire sequence of amino acids. In Eqs. 9.5 and 9.6 the window size is equal to 5. For the ith residue, the 5-residue window consists of the residues located at absolute positions i − 2, i − 1, i, i + 1, i + 2 in the protein amino acid chain. The k iterates through neighboring residues (points to absolute positions) in the sliding window with respect to the ith residue being processed, l transforms the absolute position of the kth residue to the position in the sliding window (l = 1..5), and µ(l) is the weight for the lth residue in the 5-residue sliding window. The µ(l) weight is calculated as it is presented in Fig. 9.5. Such a fuzzy set defined by a membership function allows assigning lower weights for the first two residues in the sliding window. In consequence, this weakens the influence of the first two elements (elements preceding the current ith element in a sequence), which is important when switching between classes. If the following elements i + 1 and i + 2 are of the same class as the ith element, the influence of the preceding two elements i − 1 and i − 2, especially if one of them is of a different IDP class than the ith element, decreases with the distance from the current element. In other words, this fuzzy set models the uncertainty at the border of regions belonging to two different classes. With the position-specific weight function as defined in Fig. 9.5, we can notice that the following condition always holds:
9.6 Filtering Outliers
225
1
µ weight
Fig. 9.5 A fuzzy set assigning weights for disorders identified at particular positions of the sliding window of the used filter
0.8
0.6
0.4 1
2
3
4
5
Position in the sliding window
5 l=1
µ(l) = 4.0.
(9.7)
For positions in the sliding window that go beyond the beginning or end of the protein sequence, we assume that the IDPP meta-predictor assigns the class of 1, i.e., disordered, since residues located at each end of proteins sequences are, on average, more frequently disordered than residues located in the middle of the protein chain. Values of the new probability calculated for each residue in the protein sequence are then compared to the experimentally determined cutoff threshold λ (default value of λ is 0.5). The ith residue is classified as disordered, if the probability is greater or equal to the λ threshold: ClassiI D P P
=
1 if Pr obiI D P P ≥ λ 0 if Pr obiI D P P < λ
.
(9.8)
After this step, each residue is finally classified as ordered or disordered, and the Consensus Module returns a list of disordered regions accompanied by the sequence
Fig. 9.6 Result of filtering applied on a sequence of classes produced in binary classification for a sample protein from the DisProt database
226
9 Scalable Prediction of Intrinsically Disordered …
of disorder probabilities, for each protein sequence provided on the input of the IDPP meta-predictor. An example of the filtering applied on a sequence of classes produced in binary classification for a sample protein from the DisProt database is shown in Fig. 9.6.
9.7 IDPP on the Apache Spark Spark-IDP meta-predictor (Spark-IDPP) performs predictions of disordered regions on large scale by parallelizing computations on the Apache Spark cluster located on Microsoft Azure cloud. The Spark-IDPP was created and tested on Spark 1.6.1 working on Linux platform within Microsoft Azure HDInsight service HDI 3.4.
9.7.1 Architecture of the Spark-IDPP Overview of the execution of the IDP prediction on Spark with the use of the Spark-IDPP meta-predictor is presented in Fig. 9.7. Input data for the Spark-IDPP (many protein sequences in the FASTA format) should be available on the HDFS in the storage space of the Azure HDInsight service. The input data, usually distributed in many FASTA files, are retrieved from the HDFS and loaded into RDD collection located on the Master node of the Spark cluster with the use of JavaPairRDD wholeTextFiles(String path, int minPartitions) function. The wholeTextFiles function allows to read folders containing many small text files. Each file is read as a single record and is added to the RDD collection as a key–value pair, where the key is a path to the file, and the value is the content of the file. The RDD collection consisting of such key–value pairs is then divided into partitions. The number of partitions is passed as a second argument of the wholeTextFiles function. If the argument is not set, the Apache Spark environment sets it automatically by itself, but the number of partitions cannot exceed the number of input files with protein sequences [3]. Spark creates a task for each data partition in the RDD collection and places it into the FIFO queue. Tasks are then sent to the multi-core Spark Worker nodes and executed by executors. If the number of partitions is greater than the number of Spark Worker nodes, after completing calculations, an idle Worker node takes the next enqueued task from the FIFO queue. Input data retrieved from the HDFS are passed as data streams to IDPP metapredictor processes running on Spark Worker nodes. This is done by using static JavaRDD pipe(command) transformation. The pipe transformation is one of the mechanisms for communication between running processes and allows to exchange data between these processes. Each RDD partition is piped through shell commands, i.e., bash execution script. Elements of the RDD partition, i.e., protein amino acid sequences, are written to the standard input of the executed
9.7 IDPP on the Apache Spark
227
Storage: HDFS saveAsTextFile()
wholeTextFiles() Spark Worker Node 1 pipe() parƟƟon 1
Task 1
parƟƟon 2
Task 2
parƟƟon 3
Task 3
IDP metapredictor
bash execuƟon script
. . . Spark Worker Node 4
parƟƟon 4
RDD collecƟon
pipe()
Task 4
bash execuƟon script
IDP metapredictor
FIFO queue
Spark Master Node
VoƟng
Consensus
Filtering
Fig. 9.7 Spark-IDPP—execution of the IDP prediction on Spark. Reproduced from [3, 40] with permissions
process, and results are written to the standard output of the process and returned as an RDD of strings. The Bash script executed by the pipe transformation saves the data from the standard input to a local text file on a Worker node. It is necessary, since component predictors in the CPM require paths to text files specified as execution arguments. After saving data in a text file, a Worker node executes the IDPP meta-predictor, which returns results on the standard output. These results are saved in the RDD collection on the Master node of the Spark cluster and then stored as text files on the HDFS with the use of saveAsTextFile action (Fig. 9.7).
9.7.2 Implementation of the IDPP on Spark Details of the execution of the IDP prediction on Spark with the Spark-IDPP metapredictor are shown in Listings 9.3 and 9.4. Within the Spark driver program, the Spark-IDPP application runs the main method presented in Listing 9.3. The method reads the execution parameters (line 2) and loads the Spark-IDPP configuration (lines 3-4). The configuration is loaded from the configuration file indicated by the file path extracted from one of the execution parameters (line 3). The configuration file consists of many information that control the execution of the IDP prediction on the Spark, including the path to the bash IDPP execution script, the path to the folder with input files, the path to the folder where the output files should be saved, the
228
9 Scalable Prediction of Intrinsically Disordered …
number of partitions of the Spark to run calculations on. These data set appropriate attributes (respectively, command, inputFolderPath, outputFolderPath, numberOfPartitions) of the SparkTools class, which is a part of the Spark driver program. Then, the main method creates the JavaSparkContext class object (line 5) and starts calculations by executing the run method of the SparkTools class object (line 6), passing the Spark context with the Spark configuration as an argument. 1 2 3
4 5
6 7 8
p u b l i c s t a t i c v o i d main ( S t r i n g [ ] a r g s ) { Params params = new Params ( a r g s ) ; S p a r k T o o l s s p a r k T o o l s = new S p a r k T o o l s ( params . g e t C o n f i g P a t h ( ) ); SparkConf s p a r k C o n f = s p a r k T o o l s . g e t S p a r k C o n f ( ) ; J a v a S p a r k C o n t e x t s p a r k C o n t e x t = new J a v a S p a r k C o n t e x t ( sparkConf ) ; sparkTools . run ( sparkContext ) ; sparkContext . close ( ) ; }
Listing 9.3 Main function of the Spark driver program for the Spark-IDPP meta-predictor [3, 40].
Implementation of the run method is shown in Listing 9.4. At the beginning, after reading the start time of the job (lines 2-3), the method loads the amino acid sequences by reading the files located in a folder indicated by the path specified in the inputFolderPath attribute (line 5). The files are loaded with the use of the wholeTextFiles function (method of the JavaSparkContext class object). The function accepts two input parameters, the first one is the path to the folder with files, the second one is optional, but recommended, and it suggests the number of partitions the input data should be split into. Execution of the function creates a key–value RDD collection (line 5), JavaPairRDD class object, where the key is the name of the file, and the value contains the content of the file. In the next step, the pipe transformation is invoked on the RDD collection (line 6). The argument of the transformation is a bash IDPP execution script (stored in the command attribute) that runs the IDPP meta-predictor. This produces the output RDD collection with results. These results are stored in the specified location (indicated by the outputFolderPath attribute) by invocation of the saveAsTextFile action (executed on the output RDD collection, line 7) [3, 40]. 1 2 3
public void run(JavaSparkContext sparkContext) { SimpleDateFormat dateFormat = new SimpleDateFormat("yyyy−MM −dd−HH−mm−ss") ; String jobStartDate = dateFormat . format(new Date( this . startJobTime) ) ;
4
JavaPairRDD inputFiles = sparkContext . wholeTextFiles( this . inputFolderPath , this . numberOfPartitions) ; JavaRDD output = inputFiles . pipe( this .command) ; output . saveAsTextFile( this . outputFolderPath + jobStartDate ) ;
5
6 7 8
}
Listing 9.4 Execution of the prediction process in the run method. [3, 40]
9.8 Experimental Results
229
9.8 Experimental Results Parallel, Spark-based implementation of the IDPP meta-predictor (Spark-IDPP) was extensively tested in order to verify its effectiveness and performance. The main goal of the experiments was to address the following questions: • What is the quality of predictions provided by the designed meta-predictor for various consensus modes? • What is the efficiency of massive predictions performed on the Apache Spark with the use of the designed meta-predictor? • How does the performance depend on the number of input files passed for execution? • How scalable is the Spark-IDPP? • What is the efficiency of Spark-IDPP compared to the local, sequential version of the predictor?
9.8.1 Runtime Environment The Spark cluster used for most of the performed experiments was established on the Microsoft Azure public cloud as the HDInsight service (HDI 3.4) hosted on D13v2sized virtual machines (VMs) with Linux operating system. Virtual machines in the Dv2-series are intended for applications that demand faster CPUs, better temporary storage performance, or have higher memory demands. The whole series is based on the 2.4 GHz Intel Xeon E5-2673 v3 (Haswell) processor, and with the Intel Turbo Boost Technology 2.0, can go up to 3.1 GHz. D13v2-sized virtual machines used by Worker nodes of the Spark cluster had 8 virtual CPUs, 56 GB RAM, and 400 GB of temporal storage space on SSD drives. Such a prepared Spark cluster was used to run parallel procedures of the Spark-IDPP meta-predictor and carry out assumed performance experiments.
9.8.2 Data Set During the experiments two databases of protein sequences were used—one for testing effectiveness of the Spark-IDPP meta-predictor and other basic predictors and another data set for testing efficiency of the Spark-IDPP. While testing effectiveness of the Spark-IDPP meta-predictor, we used the DisProt data set [52, 59]. The data set that we used contained 1,539 disordered regions located in 694 proteins. These regions were discovered with the use of experimental methods, and evidences of disorder were manually collected from the literature. The DisProt data set provided a “ground truth” while testing effectiveness of all primary predictors and the Spark-
230
9 Scalable Prediction of Intrinsically Disordered …
IDPP and allowed us to calculate weighted scores SW for basic predictors in all consensus modes. The second database, UniProtKB [62], was used to test efficiency of the SparkIDPP. This database provides millions of protein amino acid sequences and was used due to its large size. In our experiments we used various subsets of the database.
9.8.3 A Course of Experiments Experiments involved testing the quality of predictions of disordered regions and efficiency of the proposed meta-predictor on the Spark cluster. The quality of predictions was tested for four consensus modes in order to select the best one. Then, we tested the performance of the meta-predictor working on the Spark cluster for the selected consensus mode. While testing performance, we first experimentally established how to divide data into data chunks and what number and size of data chunks should be used in order to minimize the execution time for the current cluster configuration (the number of nodes). The most efficient division of data into data chunks was used in the following experiments. Afterward, we changed the cluster size for the same amount of input data in order to verify scalability of the solution. And finally, we verified efficiency and speedup achieved by the system built on the 32-node Spark cluster (cluster size was constant) for the growing volume of data.
9.8.4 Effectiveness of the Spark-IDPP Meta-predictor Effectiveness of the prediction performed with the use of the Spark-IDPP metapredictor was verified in a series of tests and compared to the effectiveness of component predictors. All component predictors and the Spark-IDPP meta-predictor were tested on the DisProt data set ver. 6.02. This data set contains protein sequences together with the information on the experimentally identified disordered regions. We had to exclude three protein sequences from the data set due to errors generated by particular component predictors: DP00642 and DP00651 (errors generated by the DisEMBL), and DP00195 (not fulfilling the conditions of RONN).
9.8.4.1
Effectiveness Measures
Predictors were examined in terms of sensitivity (recall, TPR), accuracy (ACC), specificity (SPC), precision (PREC), F1-score, and Matthews correlation coefficient (MCC) [53], on the basis of the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) retrieved from the confusion matrix obtained for each of the tested predictors. The confusion matrix also known as a contingency table or an error matrix contains information about actual and predicted
9.8 Experimental Results
231
Table 9.1 Confusion matrix for performance evaluation of created IDPP meta-predictor
Predicted Positive
Negative
TP (actual disordered pro- FN (actual disordered proPositive tein residues that were cor- tein residues that were incorrectly classified as disor- rectly classified as ordered) Real dered) FP (actual ordered protein TN (actual ordered protein Negative residues that were incorrectly residues that were correctly classified as disordered) classified as ordered)
values (Table 9.1). TP is the number of disordered protein residues correctly classified, TN is the number of ordered protein residues correctly classified, FP is the number of ordered protein residues incorrectly classified as disordered, and FN is the number of disordered protein residues incorrectly classified as ordered. One of the measures that we used in our evaluation is accuracy (ACC), which is the proportion of properly predicted disordered and ordered residues (both true positives and true negatives) among the total number of cases examined: ACC =
(T P + T N ) . (T P + T N + F P + F N )
(9.9)
Accuracy measures how well the prediction model correctly identifies particular protein residues as ordered and disordered (the closer to 1.0, the better). Sensitivity (Recall, or true-positive rate, TPR) measures the proportion of true positives (TP) that are correctly identified for all the cases that are positive in the diagnostic test: TP T PR = . (9.10) (T P + F N ) Sensitivity can be treated as the measure that examines the probability of detection of true-positive cases (correctly predicted disordered protein residues). With the higher sensitivity (closer to 1.0), fewer real positive cases are misclassified. Specificity is the proportion of cases that are true negative (correctly classified ordered protein residues) for all of the cases that are assessed as negatives: S P EC =
TN . (T N + F P)
(9.11)
With the higher specificity (closer to 1.0) fewer real disordered residues (positive cases) are labeled as ordered, so this ratio can be regarded as the percentage of
232
9 Scalable Prediction of Intrinsically Disordered …
ordered protein residues (negative cases) correctly predicted as belonging to the ordered region of the protein. Precision (PREC, or Positive Predictive Value, PPV) is the proportion of positive cases that are correctly identified (TP) for all cases that are classified as positive: P R EC =
TP . (T P + F P)
(9.12)
High values of the precision (closer to 1.0) indicate better performance of the classification model. Two additional measures of the quality of the prediction were also used: F-measure and the Matthews correlation coefficient (MCC). F-measure is the weighted harmonic mean that can be used to find the balance between the precision (PREC) and the recall (sensitivity, TPR). It can be calculated according to the following formula: F-measur e = 2 ·
(P R EC · T P R) . (P R EC + T P R)
(9.13)
The Matthews correlation coefficient (MCC) was introduced by B.W. Matthews in 1975. It takes values from the range −1; +1, where +1 denotes perfect prediction. The MCC can be calculated according to the following formula: MCC = √
(T P · T N − F P · F N ) . ((T P + F P)(T P + F N )(T N + F P)(T N + F N ))
(9.14)
Additionally, we used the ROC curves (Receiver Operating Characteristic curves) and the Area Under the Curve (AUC) in order to assess the quality of our IDPP metapredictor and other IDP predictors. ROC curves and the AUC measure are frequently used in prediction of intrinsically disordered proteins to evaluate performance of prediction (classification) models. The ROC curve graphically illustrates the relative trade-off between true-positive rate (TPR) indicating benefits and false-positive rate (FPR), which indicates the cost, at various settings of the discrimination threshold. AUC can be considered as the measure indicating the accuracy of the predictive model, where 1 is the best possible value and 0.5 is the equivalent to a random prediction.
9.8.4.2
Evaluation of Primary, Component Predictors
At the beginning, we evaluated primary, component predictors that are known from literature and were also implemented in the IDPP meta-predictor. Results of the evaluation of component predictors are presented in Table 9.2. As can be observed from Table 9.2, the RONN predictor achieved the best sensitivity T P R = 0.695 with the lowest specificity S PC = 0.667. On the other hand, out of all component predictors the DisEMBL Remark465 predictor is characterized by the best specificity
9.8 Experimental Results
233
Table 9.2 Effectiveness of component predictors (the best achieved result is marked in bold) Name TPR ACC SPC PREC F1-score MCC DisEMBL Coils DisEMBL Hotloops DisEMBL Remark465 IUPred Short IUPred Long GlobPlot RONN
0.507
0.688
0.744
0.379
0.434
0.229
0.487
0.683
0.744
0.370
0.420
0.212
0.364
0.756
0.878
0.478
0.413
0.267
0.619 0.633 0.330 0.695
0.727 0.727 0.700 0.673
0.761 0.756 0.814 0.667
0.444 0.445 0.354 0.391
0.517 0.523 0.342 0.501
0.343 0.350 0.148 0.311
Table 9.3 Values of the SW score and the AUC measure for particular component predictors (ordered by AUC) [3, 40] Predictor SW AUC IUPred Long IUPred Short RONN DisEMBL Coils DisEMBL Remark465 DisEMBL Hotloops GlobPlot
0.390 0.380 0.362 0.251 0.242 0.231 0.145
0.746 0.740 0.721 0.639 0.626 0.636 0.572
S PC = 0.878, accuracy ACC = 0.756, and the precision P R EC = 0.478, but also the lowest sensitivity T P R = 0.364. The results for IUPred predictors, especially IUPred Long, are quite high for each of the measures (though not the best), and the predictor is characterized by the best F1-scor e = 0.523 and the Matthews correlation coefficient MCC = 0.350. On the basis of results of the effectiveness tests, for each component predictor, we plotted the ROC curve and calculated the AUC measure and SW weighted score (according to Eq. 9.2). ROC curves for particular component predictors are presented in Fig. 9.9. Values of the SW score and the AUC measure for particular component predictors are shown in Table 9.3. As can be observed by analyzing the results presented in Table 9.3 the IUPred, especially IUPred Long, and RONN predictors achieved the highest values of the AUC in our tests. They also had the highest values of the SW coefficient, and therefore, they contribute with the highest weighted scores to the final decision made in the Consensus Module of the IDPP meta-predictor working in Weighted Binary and Weighted Float modes.
234
9 Scalable Prediction of Intrinsically Disordered …
Table 9.4 Effectiveness of the IDPP predictor working in each of the four consensus modes Consensus mode TPR ACC SPC PREC F1-score MCC AUC Simple Binary (SB) Weighted Binary (WB) Simple Float (SF) Weighted Float (WF)
9.8.4.3
0.166 0.663 0.676 0.708
0.775 0.720 0.712 0.700
0.963 0.737 0.723 0.697
0.579 0.438 0.429 0.419
0.258 0.527 0.525 0.526
0.218 0.355 0.350 0.351
0.572 0.743 0.742 0.752
Evaluation of the IDPP Meta-Predictor
After the calculation of the SW weighted scores, we examined the effectiveness of the proposed IDPP meta-predictor working in all four consensus modes. All tests were conducted with the use of the same DisProt data set ver. 6.02 as for testing component predictors. On the basis of the confusion matrices obtained for the IDPP meta-predictor working in each of the four consensus modes, we calculated values of the effectiveness measures (Table 9.4). Results of the effectiveness tests presented in Table 9.4 show that IDPP metapredictor working in the Simple Binary consensus mode (IDPP-SB) is characterized by the worst prediction quality. The value of AUC calculated for the IDPP-SB was 0.572, which reflects that the predictor is close to a random predictor, for which the AUC = 0.5. The IDPP meta-predictor working in the remaining three consensus modes achieved significantly better prediction quality. Results are similar, though the Weighted Float consensus mode led to the highest AUC = 0.752, and the Weighted Binary consensus mode enabled to obtain the best MCC = 0.355. Obtained results are slightly better (AUC, MCC) or close to results achieved by particular component predictors presented in Tables 9.2 and 9.3. The Simple Float-based IDPP metapredictor (IDPP-SF), which uses regular average while striving for the consensus, is slightly worse. It reached the AUC of 0.742 and the MCC of 0.350. The Weighted Binary-based and the Weighted Float-based IDPP meta-predictors (IDPP-WB and IDPP-WF) use the weighted mean while seeking the consensus, where weights are SW coefficients calculated for particular component predictor (Table 9.3). The IDPPWB reached the AUC of 0.743 and the MCC of 0.355, and the IDPP-WF reached the AUC of 0.752 and the MCC of 0.351. This shows that the application of the SW weighted score is beneficial, especially for IDPP predictors working on the basis of Binary consensus (compare AUC and MCC for SB and WB modes in Table 9.4). ROC curves for the IDPP meta-predictor working in all four consensus modes are presented in Fig. 9.8. Comparison of ROC curves for the proposed IDPP meta-predictor and component predictors is shown in Fig. 9.9. For the clarity of presentation we show the ROC curve for the IDPP meta-predictor working in the Weighted Float consensus mode (IDPP-WF).
9.8 Experimental Results
1
0.8
TPR
Fig. 9.8 ROC curves for the proposed IDPP meta-predictor working in various consensus modes: Simple Binary, Weighted Binary, Simple Float, Weighted Float
235
0.6
0.4
0.2
Simple Binary (IDPP-SB) Weighted Binary (IDPP-WB) Simple Float (IDPP-SF)
0
Weighted Float (IDPP-WF)
0
0.2
0.4
0.6
0.8
1
FPR
Fig. 9.9 ROC curves for the proposed IDPP meta-predictor (IDPP-WF) and other component predictors
1
TPR
0.8
0.6
0.4
IDPP-WF RONN GlobPlot IUPred-Short
0.2
IUPred-Long DisEMBL Coils DisEMBL Hotloops
0
DisEMBL Remark465
0
0.2
0.4
0.6
0.8
1
FPR
9.8.4.4
IDPP Meta-Predictor With and Without Fuzzy Filtering
We also checked how the fuzzy filtering influences results of the prediction with the use of the IDPP meta-predictor. In Fig. 9.10 we show ROC curves for the proposed IDPP meta-predictor working with and without fuzzy filtering in all consensus modes. Figure 9.10 clearly shows that the fuzzy filtering brings improvement in the prediction quality in all consensus modes of the IDPP meta-predictor. This improvement can be measured by the change in the AUC presented in Table 9.5. Results presented in Table 9.5 show that relative improvement after using the fuzzy filtering ranges between 2–4% in all implemented consensus modes.
236
9 Scalable Prediction of Intrinsically Disordered …
(b)
1
TPR
TPR
(a)
0.5
1
0.5
without filtering
0 0
0.5 FPR
1
(d)
1
0.5
with fuzzy filter
0
TPR
TPR
(c)
without filtering
0
with fuzzy filter
0.5
without filtering
0
with fuzzy filter
0
0.5 FPR
1
1
without filtering
0
0.5 FPR
1
with fuzzy filter
0
0.5 FPR
1
Fig. 9.10 ROC curves for the proposed IDPP meta-predictor working with and without filtering in various consensus modes: Simple Binary (a), Weighted Binary (b), Simple Float (c), Weighted Float (d) Table 9.5 Area under the ROC curve (AUC) calculated for the IDPP predictor working with and without the fuzzy filtering in each of the four consensus modes Consensus mode AUC without filtering AUC with filtering Relative improvement (%) Simple Binary (SB) Weighted Binary (WB) Simple Float (SF) Weighted Float (WF)
0.562 0.722
0.572 0.743
2 3
0.721 0.726
0.742 0.752
3 4
9.8 Experimental Results
237
9.8.5 Performance of IDPP-Based Prediction on the Cloud We conducted a broad series of tests in order to verify performance of the proposed Spark-IDPP method. In those tests we investigated the execution time and speedup achieved for various configurations of the Spark cluster and various data load.
9.8.5.1
Execution Time versus the Number and the Size of Files (Data Chunks)
In this series of tests we wanted to verify how the performance of the prediction process carried out with the Spark-IDPP meta-predictor depends on the number and the size of input data files (input chunks) stored on the HDFS. These tests allowed to experimentally select the appropriate files-to-nodes ratio for various sizes of input data set used in other performance experiments. In these tests we used a 32MB part of the UniProtKB/Swiss-Prot database, which was divided into a number of files of various sizes: 16 partial files of size up to 2,000 kB, 32 partial files of size up to 1,000 kB, 64 partial files of size up to 500 kB, and 320 partial files of size up to 100 kB. Experiments were carried out on the Spark cluster with 16 Worker nodes. Results of the experiments are presented in Table 9.6. As can be observed from Table 9.6, for the 32 MB database, the Spark-IDPP achieved the best performance for many (320) files of size up to 100 kB. The execution time was 10,146 s. For the same 32 MB input data set, the configuration with 16 files of size up to 2,000 kB and the files-to-nodes ratio equal to 1 was almost twice less efficient (19,703 s) than the configuration with 320 files of size up to 100 kB, for which the number of files is 20 times the number of nodes (files-to-nodes ratio is 20). We can also observe that with the growing size of input chunks and with the decreasing files-to-nodes ratio, the Spark platform automatically decreased the number of data partitions. For higher files-to-nodes ratio (20) the Spark created 16 partitions, and for low files-to-nodes ratio (1) the Spark created only 11 partitions, leaving 5 nodes of the computation cluster idle. For this reason, the configuration with 16 input chunks of size up to 2,000 kB and the lowest files-to-nodes ratio turned to be the slowest. This was caused by the fact that not whole capabilities of the Spark cluster were utilized and the workload was unbalanced. Only 11 nodes of the computation cluster were used, and five of them had to perform prediction for two data chunks, while
Table 9.6 Spark-IDPP total execution time for various sizes of input files (input chunks) [3, 40] File size (kB) #Files Files-to-nodes ratio Execution time (s) #Partitions 100 500 1,000 2,000
320 64 32 16
20 4 2 1
10,146 12,377 14,719 19,703
16 15 13 11
238
9 Scalable Prediction of Intrinsically Disordered …
Execution time (hours)
174.32
Spark-IDPP-WF (Dv2-seriesVMs)
112.4
Spark-IDPP-WF(A-seriesVMs)
102 43.91 28.24
21.98 14.12
11.3
101
7.23
5.61 3.58
0
5
10
15
20
25
30
35
# Worker nodes Fig. 9.11 Dependency between the prediction time (plotted on a logarithmic scale) and the number of cluster nodes for 256.1 MB Swiss-Prot data set. Computations performed on the Spark cluster established on A-series and Dv2-series virtual machines
the other nodes were idle after processing a single data chunk. Results of these tests allowed us to confirm that the best execution times are achieved when the number of files is much larger than the number of nodes.
9.8.5.2
Execution Time versus Cluster Size
In the next series of performance tests we examined scalability of the Spark-IDPP by changing the number of working nodes of the Spark cluster. During the experiments, the IDPP meta-predictor was launched on the Spark cluster with 1, 4, 8, 16, and 32 nodes. The size of the whole input data set was constant during the course of experiments and equal to 256.1 MB (the whole Swiss-Prot data set). The data set was divided into smaller chunks in such a way that allowed keeping the files-tonodes ratio at the level of 20, e.g., 640 files up to 410 kB each for the 32-node Spark cluster. The results of these tests are presented in Fig. 9.11. The prediction time decreases proportionally with the growing number of Spark Worker nodes. On the Spark cluster with only one Worker node the prediction took more than 112 h (on Dv2-series VMs). The IDP prediction with the Spark-IDPP executed on 32-node cluster took less than 4 h on Dv2-series VMs and less than 6 h on A-series VMs. This gave almost ideal n-fold speedup when scaling out the cluster from one to 32 Worker nodes on the Azure cloud. The speedup curves are presented in Fig. 9.12. The n-fold speedup was calculated according to the following equation: Sd =
T1 , Td
(9.15)
9.8 Experimental Results
239
Fig. 9.12 n-fold speedup for IDP prediction performed on Apache Spark cluster for various size of the cluster
35 Spark-IDPP on Dv2-series VMs
31.1 31.4
Spark-IDPP on A-series VMs
30
ideal
n-fold speedup
25 20 15.43 15.54
15 10
7.93 7.96 3.97 3.98
5 1 0 0
1 5
10
15
20
25
30
35
# Worker nodes
where d is the number of nodes of the Spark cluster, Td is the execution time obtained while performing computations on the d-node cluster, and T1 is the execution time obtained while performing computations on the 1-node cluster.
9.8.5.3
Performance for Growing Volume of Data
We also tested the performance of the Spark-IDPP for the growing volume of protein amino acid sequences. We wanted to verify the gain resulting from using the Spark cluster in the IDP prediction with respect to desktop version of the meta-predictor (Desktop-IDPP). The Desktop-IDPP was tested on the workstation PC with CPU Core i7 4700MQ 2.4 GHz (four cores, eight threads), RAM 16 GB, storage HDD 1 TB, working under control of the Microsoft Windows 7 64-bit operating system. Spark-IDPP was tested on the 32-node Spark cluster (Dv2-series VMs) located on the Microsoft Azure cloud. Results of the performance tests for both versions of the IDPP predictors are presented in Fig. 9.13. For predictions performed with the Spark-IDPP data sets were divided into smaller data chunks (files) in such a way that allowed keeping the files-to-nodes ratio at the level of 20. The number of files depended on the whole data set size and the number of protein sequences in the input data set (Table 9.7). Results of performance tests presented in Fig. 9.13 show that the execution time for the Desktop-IDPP predictor grows very quickly with the size of the input data set—for 256.1 MB of input data the prediction took more than 5 days. This shows how time-consuming the prediction process is. Prediction with the Spark-IDPP on 32-node Spark cluster took less than 4 h for the same data set. Significant differences
240
9 Scalable Prediction of Intrinsically Disordered … 140 Desktop-IDPP Spark-IDPP
Prediction time (hours)
120 100 80 60 40 20 0 0
50
100
150
200
250
Size of the data set (MB)
Fig. 9.13 Comparison of the execution time (prediction time) for the desktop version of the IDPP meta-predictor (Desktop-IDPP) and the Spark-IDPP working on the 32-node Spark cluster for varying size of the input data set Table 9.7 Sizes of input data sets used in the prediction process performed on the 32-node Spark cluster, together with sizes of data chunks the data sets were divided into, and the number of partitions created by Spark during execution of the Spark-IDPP Spark-IDPP on 32-node Spark cluster Data size (MB) Chunk size (kB) #Partitions 5.06 20.08 40.15 100.0 200.1 256.1
8 33 65 160 320 410
32 32 32 32 32 32
in execution times for large data sets confirm that the use of the Spark cluster is all the more justified, the larger the data set we process. For larger data sets, like 256.1 MB, the reduction of time was significant, and even the necessity of creation of the Spark cluster on the Cloud (which usually takes around 20 min) was just a fraction of time taken by the desktop version of the IDP predictor. In Table 9.7 we can also observe that the number of data partitions for various sizes of the input data set was constant and equal to the number of nodes of the Spark cluster. This was possible by dividing the input data set into many (640) chunks and keeping the files-to-nodes ratio at relatively high level (20).
9.9 Discussion
241
9.9 Discussion Prediction of disorder regions for protein amino acid sequences became an important branch of 3D protein structure prediction and modeling. The knowledge flowing from correctly resolved protein structures translates very well into drug design, or at least, recognition of molecular mechanisms underlying many civilization diseases. Disordered proteins constitute a wide range of molecules that play important roles in these molecular mechanisms. Since the number of protein amino acid sequences in world repositories grows exponentially, availability of efficient methods that are able to predict IDPs on highly scalable computer clusters is very important. This belief underlied our research. Spark-IDPP responds to the needs very well by parallelizing computations on the Spark cluster that can be scaled on the Cloud on demand according to current requirements for computing power. Results of our performance tests show that we are able to perform predictions of disordered regions on large-scale and to handle growing amount of protein sequences by scaling the cluster horizontally and vertically. As also shown in this chapter, IDP prediction on Spark clusters is the most beneficial, when it is performed for large data sets divided into smaller chunks (files). Best results were obtained when the number of data chunks was much larger than the number of Data nodes—we established the files-to-nodes ratio at the level of 20. This caused that the computational capabilities of the Spark cluster were fully utilized, as the number of Spark data partitions was equal to the number of nodes. Only then we were able to achieve almost linear, above 31-fold, speedup on 32-node Spark cluster, and to significantly reduce the execution time. The Spark-IDPP was developed for Spark clusters hosted in local data centers or in the Cloud. However, low entry barrier, a huge storage space, wide compute and flexible scaling capabilities of the Cloud made it an attractive alternative to local compute infrastructures kept on premises. With the wide, horizontal and vertical, scaling capabilities the Cloud enabled us to create the Spark cluster that is able to respond to current needs of IDP prediction and appropriately accommodate the growth of protein data in public repositories. The use of public cloud platforms, like Microsoft Azure or Amazon Web Services, simplifies many tasks related to the creation and maintenance of the Spark cluster. Firstly, within these platforms the Spark is provided as a service, i.e., it can be easily created and configured ondemand, when needed, and removed after performing required computations. This ease in maintaining Spark clusters clearly distinguishes the solution from local Spark installations and clusters created in IaaS clouds. Although, this comes as a cost of broader configurability that can be possessed by clusters created in local data centers or IaaS clouds. Secondly, once created on the Cloud platform, the Spark cluster can be dynamically scaled out by adding more cluster nodes or scaled down by releasing unnecessary compute resources. Thirdly, cloud platforms usually provide a specialized fleet of virtual machines that have differentiated compute capabilities (various sizes), optimized for different tasks, e.g., for compute-intensive or memoryintensive calculations. As a result, it is easy to scale up by using more powerful
242
9 Scalable Prediction of Intrinsically Disordered …
virtual machines. We started testing the Spark-IDPP on the Spark cluster created on A-series virtual machines. However, we quickly scaled up the solution to Dv2-series virtual machines, since they provided higher compute capabilities (better CPUs and more memory). This allowed us to further accelerate the IDP prediction for large data sets (see Fig. 9.13). The necessity of paying for the Spark cluster performing IDP predictions must be mentioned as a disadvantage of using cloud platforms. However, on the other hand, users do not have to cover the costs of maintenance of the whole hardware infrastructure kept on premises. Results of our experiments on the quality of predictions prove that the proposed method is able to achieve higher prediction effectiveness than primary predictors. With the Spark-IDPP working in the Weighted Float consensus mode (Spark-IDPPWF), we were able to improve the quality of predictions compared to basic predictors. The quality improvement was not so significant as in MetaDisorder reported in [30] and PONDR-FIT reported in [70], since we were not able to implement all component methods in the Spark-IDPP (e.g., some of them are not available as program packages). Nevertheless, our experiments confirmed that meta-prediction with the use of many component predictors, which operate on various features extracted from amino acid sequences and characteristics of particular amino acids, may increase the prediction quality. The most important unique feature of the Spark-IDPP is its capability to work with large amounts of protein sequence data and to provide prediction results fast, adequately to the size of the Spark cluster. In such a way, it addresses the volume characteristics of the Big Data challenge. To the best of our knowledge, this is the first method that addresses this feature for prediction of IDPs. Among other unique features of the Spark-IDPP, it is worth mentioning its four consensus modes for combining results of basic predictors and fuzzy filtering method. Three out of four consensus modes allow to predict disordered regions with reasonable quality confirmed by the majority of the effectiveness measures that were calculated. The best results, in terms of the AUC and the MCC, were obtained for the Spark-IDPP working in the Weighted Float consensus mode (Spark-IDPP-WF), where we used weighted score proposed in [27] to make decisions on the classification to ordered/disordered classes. The same weighted score was used in the MetaDisorder method [30], but we calculated our own values of weights for used component predictors and for the DisProt data set. The quality of predictions returned by the Spark-IDPP working in the Weighted Binary and the Simple Float consensus modes was only slightly worse in terms of the AUC and the MCC, but even better in terms of accuracy and precision. Only the Simple Float consensus mode did not bring satisfactory results. Its prediction capabilities turned out to be slightly better than random guessing. In addition, the fuzzy filtering method that we used allowed us to efficiently eliminate short segments outlying from neighboring ones, resulting in smoothing final outcome of the prediction.
9.10 Summary
243
9.10 Summary The Spark-IDPP meta-predictor can be a valuable tool for the whole field of protein structure modeling and fold recognition. It complements the collection of existing tools by providing better predictive capabilities with significantly higher performance, which can be adapted to the current needs and amount of input data. This makes it an important alternative for desktop software tools, for which scalability is very limited or sometimes impossible to implement. All solutions presented in this part of the book proved to be highly scalable on Big Data frameworks, like Hadoop and Spark. All of them are dedicated to accelerate long-running processes performed in protein bioinformatics. Calculations for a single biomedical data record take seconds to minutes, which is not typical, e.g., for business scenarios, which involve calculations for millions of small records. However, the presented solution that utilizes Hadoop, Spark, and the HDFS turned out to be the most efficient for the performed processes of massive 3D protein structure alignment, protein similarity searching and structural superposition, and prediction of intrinsically disordered regions in protein structures. In the next part of this book, we will see other approaches that allow to speed up calculations related to protein analysis. In Chap. 10 we will see the use of graphics processing unit (GPU) in protein similarity searching. In Chap. 11 we will see the exploration of protein secondary structures nested in SQL queries.
9.11 Availability Further development of the system will be carried out by the Cloud4Proteins non-profit, scientific group (http://www.zti.aei.polsl.pl/w3/dmrozek/science/cloud4 proteins.htm).
References 1. Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997). https://doi.org/10.1093/nar/25.17.3389 2. Bai, C., Dhavale, D., Sarkis, J.: Complex investment decisions using rough set and fuzzy cmeans: an example of investment in green supply chains. Eur. J. Oper. Res. 248(2), 507–521 (2016) 3. Baron, T.: Prediction of intrinsically disordered proteins in Apache Spark. Master’s thesis, Institute of Informatics, Silesian University of Technology, Gliwice, Poland (2016) 4. Bayer, P., Arndt, A., Metzger, S., Mahajan, R., Melchior, F., Jaenicke, R., Becker, J.: Structure determination of the small ubiquitin-related modifier SUMO-1. J. Mol. Biol. 280(2), 275–286 (1998). http://www.sciencedirect.com/science/article/pii/S0022283698918393
244
9 Scalable Prediction of Intrinsically Disordered …
5. Benson, D.A., Cavanaugh, M., Clark, K., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Sayers, E.W.: GenBank. Nucleic Acids Res. 45(D1), D37–D42 (2017). https://doi.org/10.1093/nar/ gkw1070 6. Berman, H., et al.: The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000) 7. Boutet, E., Lieberherr, D., Tognolli, M., Schneider, M., Bansal, P., Bridge, A.J., Poux, S., Bougueleret, L., Xenarios, I.: UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View, pp. 23–54. Springer, New York (2016) 8. Ceri, S., Kaitoua, A., Masseroli, M., Pinoli, P., Venco, F.: Data management for heterogeneous genomic datasets. IEEE/ACM Trans. Comput. Biol. Bioinform. 99, 1–1 (2016) 9. Chang, H., Mishra, N., Lin, C.: IoT Big-Data centred knowledge granule analytic and cluster framework for BI applications: a case base analysis. Plos One 10, 1–23 (2015) 10. Cheng, J., Sweredoski, M.J., Baldi, P.: Accurate prediction of protein disordered regions by mining protein structure data. Data Min. Knowl. Discov. 11(3), 213–222 (2005), https://doi. org/10.1007/s10618-005-0001-y 11. Cupek, R., Ziebinski, A., Huczala, L., Erdogan, H.: Agent-based manufacturing execution systems for short-series production scheduling. Comput. Ind. 82, 245–258 (2016) 12. Czerniak, J.M., Dobrosielski, W.T., Apiecionek, Ł., Ewald, D.: Representation of a trend in OFN during fuzzy observance of the water level from the Crisis control center. In: Proceedings of the 2015 Federated Conference on Computer Science and Information Systems (FedCSIS), pp. 443–447 (2015) 13. Davis, G.B., Carley, K.M.: Clearing the fog: fuzzy, overlapping groups for social networks. Soc. Netw. 30(3), 201–212 (2008) 14. De Maio, C., Fenza, G., Loia, V., Parente, M.: Time aware knowledge extraction for microblog summarization on Twitter. Inf. Fus. 28, 60–74 (2016) 15. Dosztányi, Z., Csizmok, V., Tompa, P., Simon, I.: IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 21(16), 3433–3434 (2005). https://doi.org/10.1093/bioinformatics/bti541 16. Dunker, A.K., Silman, I., Uversky, V.N., Sussman, J.L.: Function and structure of inherently disordered proteins. Curr. Opin. Struct. Biol. 18(6), 756–764 (2008) 17. Feng, X., Grossman, R., Stein, L.: PeakRanger: a cloud-enabled peak caller for ChIP-seq data. BMC Bioinform.12(1), 1–11 (2011), https://doi.org/10.1186/1471-2105-12-139 18. Guo, K., Zhang, R., Kuang, L.: TMR: towards an efficient semantic-based heterogeneous transportation media Big Data retrieval. Neurocomputing 181, 122–131 (2016) 19. Hazelhurst, S.: PH2: an Hadoop-based framework for mining structural properties from the PDB database. In: Proceedings of the 2010 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists, pp. 104–112 (2010) 20. Hirose, S., Shimizu, K., Kanai, S., Kuroda, Y., Noguchi, T.: POODLE-L: a two-level SVM prediction system for reliably predicting long disordered regions. Bioinformatics 23(16), 2046– 2053 (2007). https://doi.org/10.1093/bioinformatics/btm302 21. Hu, C., Ren, G., Liu, C., Li, M., Jie, W.: A Spark-based genetic algorithm for sensor placement in large scale drinking water distribution systems. Clust. Comput. 20(2), 1089–1099 (2017). https://doi.org/10.1007/s10586-017-0838-z 22. Hung, C.L., Hua, G.J.: Cloud Computing for protein-ligand binding site comparison. Biomed Res. Int. 170356 (2013) 23. Hung, C.L., Lin, C.Y.: Open reading frame phylogenetic analysis on the cloud. Int. J. Genomics 2013(614923), 1–9 (2013) 24. Hung, C.L., Lin, Y.L.: Implementation of a parallel protein structure alignment service on cloud. Int. J. Genomics 439681, 1–8 (2013) 25. Ishida, T., Kinoshita, K.: PrDOS: prediction of disordered protein regions from amino acid sequence. Nucleic Acids Res. 35(suppl_2), W460–W464 (2007). https://doi.org/10.1093/nar/ gkm363 26. Jensen, K., Nguyen, H.T., Do, T.V., Årnes, A.: a big data analytics approach to combat telecommunication vulnerabilities. Clust. Comput. 20(3), 2363–2374 (2017). https://doi.org/10.1007/ s10586-017-0811-x
References
245
27. Jin, Y., Dunbrack, R.: Assessment of disorder predictions in CASP6. Proteins 61, 167–175 (2005) 28. Kabsch, W., Sander, C.: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22(12), 2577–2637 (1987) 29. Kelley, D.R., Schatz, M.C., Salzberg, S.L.: Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 11(11), 1–13 (2010). https://doi.org/10.1186/gb-2010-1111-r116 30. Kozlowski, L.P., Bujnicki, J.M.: MetaDisorder: a meta-server for the prediction of intrinsic disorder in proteins. BMC Bioinform. 13(1), 111 (2012). https://doi.org/10.1186/1471-210513-111 31. Langmead, B., Hansen, K.D., Leek, J.T.: Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol. 11(8), 1–11 (2010). https://doi.org/10.1186/gb-2010-118-r83 32. Langmead, B., Schatz, M.C., Lin, J., Pop, M., Salzberg, S.L.: Searching for SNPs with Cloud computing. Genome Biol. 10(11), 1–10 (2009). https://doi.org/10.1186/gb-2009-10-11-r134 33. Lewis, S., Csordas, A., Killcoyne, S., Hermjakob, H., et al.: Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework. BMC Bioinform. 13, 324 (2012) 34. Linding, R., Jensen, L.J., Diella, F., Bork, P., Gibson, T.J., Russell, R.B.: Protein disorder prediction: implications for structural proteomics. Structure 11(11), 1453–1459 (2003). http:// www.sciencedirect.com/science/article/pii/S0969212603002351 35. Linding, R., Russell, R.B., Neduva, V., Gibson, T.J.: GlobPlot: exploring protein sequences for globularity and disorder. Nucleic Acids Res. 31(13), 3701–3708 (2003). https://doi.org/10. 1093/nar/gkg519 36. Lipman, D., Pearson, W.: Rapid and sensitive protein similarity searches. Science 227(4693), 1435–1441 (1985) 37. Lu, H., Sun, Z., Qu, W.: Big Data-driven based real-time traffic flow state identification and prediction. Discret. Dyn. Nat. Soc. 2015, 1–11 (2015) 38. Lu, H., Sun, Z., Qu, W., Wang, L.: Real-time corrected traffic correlation model for traffic flow forecasting. Math. Probl. Eng. 2015, 1–7 (2015) 39. Mahmud, S., Iqbal, R., Doctor, F.: Cloud enabled data analytics and visualization framework for health-shocks prediction. Future Gener. Comput. Syst. 65, 169–181 (2016). http://www. sciencedirect.com/science/article/pii/S0167739X15003271. (special Issue on Big Data in the Cloud) 40. Małysiak-Mrozek, B., Baron, T., Mrozek, D.: Spark-IDPP: High throughput and scalable prediction of intrinsically disordered protein regions with Spark clusters on the Cloud, J. Clus. Comp, 1–35 (in review) 41. Małysiak-Mrozek, B., Stabla, M., Mrozek, D.: Soft and declarative fishing of information in Big Data lake. IEEE Trans. Fuzzy Syst. 99, 1–1 (2018) 42. Małysiak-Mrozek, B., Zur, K., Mrozek, D.: In-memory management system for 3D protein macromolecular structures. Curr. Proteomics 15 (2018). https://doi.org/10.2174/ 1570164615666180320151452 43. Matsunaga, A., Tsugawa, M., Fortes, J.: Cloudblast: combining MapReduce and virtualization on distributed resources for bioinformatics applications. In: Proceedings of the IEEE Fourth International Conference on eScience (ESCIENCE ’08), pp. 222–229 (2008) 44. Matthews, S.J., Williams, T.L.: MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees. BMC Bioinform. 11(1), 1–9 (2010). https://doi.org/10.1186/ 1471-2105-11-S1-S15 45. Meng, L., Tan, A., Wunsch, D.: Adaptive scaling of cluster boundaries for large-scale social media data clustering. IEEE Trans. Neural Netw. Learn. 27(12), 2656–2669 (2015) 46. Mrozek, D.: High-Performance Computational Solutions in Protein Bioinformatics. SpringerBriefs in Computer Science. Springer International Publishing, Cham (2014) 47. Mrozek, D., Daniłowicz, P., Małysiak-Mrozek, B.: HDInsight4PSi: boosting performance of 3D protein structure similarity searching with HDInsight clusters in Microsoft Azure cloud. Inf. Sci. 349–350, 77–101 (2016)
246
9 Scalable Prediction of Intrinsically Disordered …
48. Mrozek, D., Gosk, P., Małysiak-Mrozek, B.: Scaling Ab Initio predictions of 3D protein structures in Microsoft Azure cloud. J Grid Comput. 13, 561–585 (2015) 49. Mrozek, D., Kutyła, T., Małysiak-Mrozek, B.: Accelerating 3D protein structure similarity searching on Microsoft Azure Cloud with local replicas of macromolecular data. In: Wyrzykowski, R. (ed.) Parallel Processing and Applied Mathematics - PPAM 2015. Lecture Notes in Computer Science, vol. 9574, pp. 1–12. Springer, Heidelberg (2016) 50. Mrozek, D., Małysiak-Mrozek, B., Kłapci´nski, A.: Cloud4Psi: cloud computing for 3D protein structure similarity searching. Bioinformatics 30(19), 2822–2825 (2014) 51. Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kozielski, S.: Life sciences data analysis. Inform. Sci. 384, 86–89 (2017) 52. Piovesan, D., Tabaro, F., Miˇceti´c, I., Necci, M., Quaglia, F., Oldfield, C.J., Aspromonte, M.C., Davey, N.E., Davidovi´c, R., Dosztányi, Z., Elofsson, A., Gasparini, A., Hatos, A., Kajava, A.V., Kalmar, L., Leonardi, E., Lazar, T., Macedo-Ribeiro, S., Macossay-Castillo, M., Meszaros, A., Minervini, G., Murvai, N., Pujols, J., Roche, D.B., Salladini, E., Schad, E., Schramm, A., Szabo, B., Tantos, A., Tonello, F., Tsirigos, K.D., Veljkovi´c, N., Ventura, S., Vranken, W., Warholm, P., Uversky, V.N., Dunker, A.K., Longhi, S., Tompa, P., Tosatto, S.C.: DisProt 7.0: a major update of the database of disordered proteins. Nucleic Acids Res. 45(D1), D219–D227 (2017). https://doi.org/10.1093/nar/gkw1056 53. Powers, D.: Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. Int. J. Mach. Learn. Technol. 2, 37–63 (2011) 54. Qiu, X., Ekanayake, J., Beason, S., Gunarathne, T., Fox, G., Barga, R., Gannon, D.: Cloud technologies for bioinformatics applications. In: Proceedings of the 2nd Workshop on ManyTask Computing on Grids and Supercomputers, pp. 6:1–6:10. MTAGS ’09, ACM, New York, NY, USA (2009). https://doi.org/10.1145/1646468.1646474 55. Radenski, A., Ehwerhemuepha, L.: Speeding-up codon analysis on the cloud with local MapReduce aggregation. Inf. Sci. 263, 175–185 (2014) 56. Rose, A.S., Hildebrand, P.W.: NGL viewer: a web application for molecular visualization. Nucleic Acids Res. 43(W1), W576–W579 (2015). https://doi.org/10.1093/nar/gkv402 57. Schatz, M.C.: CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 25(11), 1363–1369 (2009) 58. Shimizu, K., Hirose, S., Noguchi, T.: POODLE-S: web application for predicting protein disorder by using physicochemical features and reduced amino acid set of a position-specific scoring matrix. Bioinformatics 23(17), 2337–2338 (2007). https://doi.org/10.1093/bioinformatics/ btm330 59. Sickmeier, M., Hamilton, J.A., LeGall, T., Vacic, V., Cortese, M.S., Tantos, A., Szabo, B., Tompa, P., Chen, J., Uversky, V.N., Obradovic, Z., Dunker, A.K.: DisProt: the database of disordered proteins. Nucleic Acids Res. 35(suppl_1), D786–D793 (2007). https://doi.org/10. 1093/nar/gkl893 60. Su, C.T., Chen, C.Y., Hsu, C.M.: iPDA: integrated protein disorder analyzer. Nucleic Acids Res. 35(suppl_2), W465–W472 (2007). https://doi.org/10.1093/nar/gkm353 61. Teijeiro, D., Pardo, X.C., Penas, D.R., González, P., Banga, J.R., Doallo, R.: A cloud-based enhanced differential evolution algorithm for parameter estimation problems in computational systems biology. Clust. Comput. 20(3), 1937–1950 (2017). https://doi.org/10.1007/s10586017-0860-1 62. The UniProt consortium: Uniprot: the universal protein knowledgebase. Nucleic Acids Res. 45(D1), D158–D169 (2017). https://doi.org/10.1093/nar/gkw1099 63. Tripathy, B.K., Mittal, D.: Hadoop based uncertain possibilistic kernelized c-means algorithms for image segmentation and a comparative analysis. Appl. Soft Comput. 46, 886–923 (2016) 64. Vullo, A., Bortolami, O., Pollastri, G., Tosatto, S.C.E.: Spritz: a server for the prediction of intrinsically disordered regions in protein sequences using kernel machines. Nucleic Acids Res. 34(suppl_2), W164–W168 (2006). https://doi.org/10.1093/nar/gkl166 65. Wang, H., Li, J., Hou, Z., Fang, R., Mei, W., Huang, J.: Research on parallelized real-time map matching algorithm for massive GPS data. Clust. Comput. 20(2), 1123–1134 (2017). https:// doi.org/10.1007/s10586-017-0869-5
References
247
66. Wang, C., Li, X., Zhou, X., Wang, A., Nedjah, N.: Soft computing in Big Data intelligent transportation systems. Appl. Soft Comput. 38, 1099–1108 (2016) 67. Wang, Z., Tu, L., Guo, Z., Yang, L.T., Huang, B.: Analysis of user behaviors by mining large network data sets. Future Gener. Comput. Syst. 37, 429–437 (2014) 68. Ward, J.J., McGuffin, L.J., Bryson, K., Buxton, B.F., Jones, D.T.: The DISOPRED server for the prediction of protein disorder. Bioinformatics 20(13), 2138–2139 (2004). https://doi.org/ 10.1093/bioinformatics/bth195 69. Xu, Z., Mei, L., Hu, C., Liu, Y.: The big data analytics and applications of the surveillance system using video structured description technology. Clust. Comput. 19(3), 1283–1292 (2016). https://doi.org/10.1007/s10586-016-0581-x 70. Xue, B., Dunbrack, R.L., Williams, R.W., Dunker, A.K., Uversky, V.N.: Pondr-fit: a metapredictor of intrinsically disordered amino acids. Biochim. Biophys. Acta (BBA) - Proteins Proteomics 1804(4), 996–1010 (2010). http://www.sciencedirect.com/science/article/pii/ S1570963910000130 71. Yang, C.T., Chen, S.T., Yan, Y.Z.: The implementation of a cloud city traffic state assessment system using a novel big data architecture. Clust. Comput. 20(2), 1101–1121 (2017). https:// doi.org/10.1007/s10586-017-0846-z 72. Yang, Z.R., Thomson, R., McNeil, P., Esnouf, R.M.: RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins. Bioinformatics 21(16), 3369–3376 (2005). https://doi.org/10.1093/bioinformatics/bti534 73. Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J., Shenker, S., Stoica, I.: Apache Spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016). https:// doi.org/10.1145/2934664 74. Zhang, T., Faraggi, E., Li, Z., Zhou, Y.: Intrinsic disorder and Semi-disorder prediction by SPINE-D, pp. 159–174. Springer, New York (2017). https://doi.org/10.1007/978-1-49396406-2_12 75. Zhong, Y., Zhang, L., Xing, S., Li, F., Wan, B.: The Big Data processing algorithm for water environment monitoring of the three gorges reservoir area. Abstr. Appl. Anal. 2014 (2014) 76. Zou, Q., Hu, Q., Guo, M., Wang, G.: HAlign: fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. Bioinformatics 31(15), 2475–2481 (2015)
Part IV
Multi-threaded Solutions for Protein Bioinformatics
Multi-threaded solutions allow to execute multiple threads within a single process in order to accelerate computations related to the process, especially for computeintensive tasks. In this part of the book, we will see two solutions that use the idea of multi-threading for performing parallel calculations in computationally complex alignment processes. In Chap. 10, we will utilize graphics processing units (GPUs) with CUDA compute capabilities and many threads in 3D protein similarity searching. In Chap. 11, we will see the exploration of protein secondary structures in a relational database that make use of many-core CPUs to speed up the exploration and querying.
Chapter 10
Massively Parallel Searching of 3D Protein Structure Similarities on CUDA-Enabled GPU Devices
The structural alignment between two proteins: is there a unique answer? Adam Godzik Proteins are the machinery of living tissue that builds the structures and carries out the chemical reactions necessary for life Michael Behe
Abstract Finding common molecular substructures in complex 3D protein structures is still challenging. This is especially visible when scanning entire databases containing tens or even hundreds of thousands protein structures. Graphics processing units (GPUs) and general purpose graphics processing units (GPGPUs) promise to give a high speedup of many time-consuming and computationally demanding processes over their original implementations on CPUs. In this chapter, we will see that massive parallelization of the 3D structure similarity searching on many core CUDA-enabled GPU devices leads to reduction of the execution time of the process and allows to perform it in real time. Keywords Proteins · 3D protein structure · Graphics processing units · Similarity searching · Structural alignment · Parallel computing · GPU · CUDA
10.1 Introduction As we know from previous chapters, 3D protein structure similarity searching is a process in which we try to find matching fragments within two or more protein structures. In frequently performed one-to-many scenarios, a given protein structure is compared to another protein structure or a set of protein structures collected in a database or other repository. On the basis of the similarities between proteins found during this process, scientists can draw useful conclusions about the common ancestry of proteins, and thus the organisms (that the proteins came from), their evolutionary relationships, functional similarities, existence of common functional © Springer Nature Switzerland AG 2018 D. Mrozek, Scalable Big Data Analytics for Protein Bioinformatics, Computational Biology 28, https://doi.org/10.1007/978-3-319-98839-9_10
251
252
10 Massively Parallel Searching of 3D Protein Structure …
regions, and many other things [6]. This process is especially important in situations, where sequence similarity searches fail or deliver too few clues [14]. There are also other processes in which protein structure similarity searching plays a supportive role, such as in the validation of predicted protein models [25]. Finally, we believe that in the very near future scientists will have the opportunity to study beautiful structures of proteins as a regular diagnostic procedure that will utilize comparison methods to highlight areas of proteins that are inadequately constructed, leading to dysfunctions of the body and serious diseases. This goal is currently motivating work leading to the development of similarity searching methods that return results in real time.
10.1.1 What Makes a Problem We also know that although protein structure similarity searching belongs to a group of the primary tasks performed in structural bioinformatics, it is still a very difficult and time-consuming process. There are three key factors deciding on this: 1. the 3D structures of proteins are highly complex, 2. the similarity searching process is computationally complex, 3. the number of 3D structures stored in macromolecular data repositories such as the Protein Data Bank (PDB) [2] is growing exponentially. Among these three problems, the bioinformaticians can attempt to ease the second one by developing new, more efficient algorithms, and to—at least partially—help with the first one by using appropriate representative features of protein 3D structures that can then be fed into their algorithms. The collection of algorithms that have been developed for protein structure similarity searching over the last two decades is large, and included methods such as VAST [13], DALI [16, 17], LOCK2 [50], FATCAT [56], CTSS [7], CE [51], FAST [61], and others [36, 47]. These methods use various representative features when performing protein structure similarity searches in order to reduce the huge search space. For example, local geometric features and selected biological characteristics are used in the CTSS [7] algorithm. Shape signatures that include information on Cα atom positions, torsion angles, and types of the secondary structure present are calculated for each residue in a protein structure. A very popular DALI algorithm [16, 17] compares proteins based on distance matrices built for each of the compared proteins. Each cell of a distance matrix contains the distance between the Cα atoms of every pair of residues in the same structure (inter-residue distances). Fragments of 6 × 6 elements of the matrix are called contact patterns, which are compared between two proteins to find the best match. On the other hand, the VAST algorithm [13], which is available through the Web site of the National Center for Biotechnology Information (NCBI), uses secondary structure elements (SSEs: α-helices and β-sheets), which form the cores of the compared proteins. These SSEs are then mapped to the representative vectors, which simplifies the analysis and comparison process. During the comparison, the algorithm attempts to match vectors of
10.1 Introduction
253
pairs of protein structures. Other methods, like LOCK2 [50], also utilize the SSE representation of protein structures in the comparison process. The CE [51] algorithm uses the combinatorial extension of alignment path formed by aligned fragment pairs (AFPs). AFPs are fragments of both structures that indicate clear structural similarity and are described by local geometrical features, including positions of Cα atoms. The idea of AFPs is also used in the FATCAT [56]. A more detailed overview of methodologies used for protein structure comparison and similarity searching is given in [3, 8, 9]. Even though better methods are developed every year, performing a protein structure similarity search against a whole database of protein 3D structures is still a challenge. As it was shown in the works [30, 34] on the effectiveness and scalability of the process, performing the search with the FATCAT algorithm for a sample query protein structure using twenty alignment agents working in parallel took 25 hours (without applying any additional acceleration techniques). Tests were carried out using a database containing 3D structures of 106,858 protein chains. This shows how time-consuming the process can be, and it is one of the main motivations for designing and developing the new methods that are reported every year, such as RAPIDO [35], FS-EAST [31], DEDAL [11], MICAN [33], CASSERT [36], ClusCo [19], and others [41, 43, 57, 58, 60].
10.1.2 CUDA-Enabled GPUs in Processing Biological Data Performance issues of various algorithms used to solve different problems caused the necessity to look for computational solutions, like CUDA-enabled GPU devices, that would speed up the execution of these algorithms. The computational potential of GPU devices has been also noticed by specialists working in the domain of life sciences, including bioinformatics. Given the successful applications of GPUs in the field of sequence similarity [26–28, 32, 44, 48, 54], phylogenetics [55], molecular dynamics [12, 45], and microarray data analysis [5], it is clear that GPU devices are beginning to play a significant role in the 3D protein structure similarity searching. It is worth mentioning three related GPU-based implementations of the process. These methods use different representations of protein structures and different computational procedures, but demonstrate a clear improvement in performance over the CPU-based implementations. The first one, SA Tableau Search presented in [53], uses simulated annealing for tableau-based protein structure similarity searching. Tableaux are based on orientations of secondary structure elements and distance matrices. The GPU-based implementation of the algorithm parallelizes two areas: multiple iterations of the simulated annealing procedure and multiple comparisons of the query protein structure to many database structures. The second one, called pssAlign [42], consists of two alignment phases—fragment-level alignment and residue-level alignment. Both phases use dynamic programming [1]. In the fragmentlevel alignment phase, so-called seeds between the target protein and each database protein are used to generate initial alignments. These seeds are represented by the
254
10 Massively Parallel Searching of 3D Protein Structure …
locations of the Cα atoms. The initial alignments are then refined in the residue-level alignment phase. pssAlign parallelizes both alignment phases. The third GPU-based approach was proposed by Leinweber et al. [22–24] for structural comparisons of protein binding sites. This approach relies on the feature-based representation of protein structure and graph comparisons. Authors reported a significant superiority of GPU-based executions of the comparison process over the CPU-based ones in terms of runtimes of performed experiments. Implementations proposed by Leinweber et al. focus on comparison of protein binding sites, not protein structures as a whole. In contrast, fold-based methods for protein comparison, like SA Tableau Search, pssAlign, and GPU-CASSERT presented in this chapter, focus on entire protein structures. In the following sections, we will see the GPU-based implementation of the CASSERT [36], one of the newest algorithms for 3D protein structure similarity searching. Like pssAlign, CASSERT is based on two-phase alignment. However, it uses an extended set of structural features to describe protein structures, and the computational procedure differs too. Originally, CASSERT was designed and implemented as a CPU-based procedure, and its effectiveness is reported in [36]. Its GPU-based implementation will be referred as GPU-CASSERT throughout the chapter.
10.2 CASSERT for Protein Structure Similarity Searching CASSERT is a two-phase algorithm used for finding similarities in protein structures. 3D protein structure similarity searching is typically realized by performing pairwise comparisons of the query protein (Q) specified by the user with successive proteins (D) from the database of protein structures (one-to-many comparison scenario). Here, we will see how protein structures are represented in both phases of the comparison process performed by the CASSERT. Let us assume that Q represents the structure of the query protein that is q residues (amino acids) long, and D is the structure of a candidate protein in the database that is d residues (amino acids) long. In the first phase of the alignment algorithm, protein structures Q and D are compared by aligning their reduced chains of secondary structures formed by secondary structure elements S E i : Q = (S E 1Q , S E 2Q , . . . , S E nQ ),
(10.1)
where n ≤ q is the number of secondary structures in the chain of the query protein Q, and (10.2) D = (S E 1D , S E 2D , . . . , S E mD ), where m ≤ d is the number of secondary structures in the chain of the database protein D.
10.2 CASSERT for Protein Structure Similarity Searching
255
Fig. 10.1 Secondary structure elements: (left) four α-helices in a sample structure (PDBID: 1CE9 [29]), (right) two β-strands joined by a loop in a sample structure (PDB ID: 1E0Q [59]); visualized by MViewer [52]. Full and reduced chains of secondary structure elements for marked subunit (left) and the whole structure (right) are visible below
Each element S E i , which is a part of the chain that has been selected on the basis of its secondary structure, is characterized by two values, i.e., S E i = [SS E i , L i ],
(10.3)
where SS E i describes the type of the secondary structure selected, and L i is the length of the ith element S E i (measured in residues). The alignment method distinguishes between three basic types of secondary structures (Fig. 10.1): • α-helix (H), • β-sheet or β-strand (E), • loop, turn, coil, or undetermined structure (L). Elements S E iQ and S E Dj , hereinafter referred to as S E regions or S E fragments, are built from groups of adjacent amino acids that form the same type of secondary structure. For example, six successive residues folded into an α-helix form one S E region. Hence, the overall protein structures are, at this stage, represented by the reduced chains of secondary structures. In the second phase of the alignment algorithm, protein structures Q and D are represented in more detail. At the residue level, successive residues are described by so-called molecular residue descriptors si . Proteins are represented as chains of descriptors si : Q = (s1Q , s2Q , . . . , sqQ ), (10.4)
256
10 Massively Parallel Searching of 3D Protein Structure …
where q is the length of the query protein Q (i.e., the number of residues it contains), and each siQ corresponds to the ith residue in the chain of protein Q, D = (s1D , s2D , . . . , sdD ),
(10.5)
where d is the length of the database protein D, and each siD corresponds to the ith residue in the chain of protein D. Each descriptor si is defined by the following vector of features: si = |Ci |, γi , SS E i , ri ,
(10.6)
where |Ci | is the length of vector between Cα atoms of the ith and (i + 1)th amino acid in a protein chain, γi is the angle between successive vectors Ci and Ci+1 , SS E i is the type of secondary structure formed by the ith residue, ri is a type of amino acid (Fig. 10.2).
Fig. 10.2 Structural features included in molecular residue descriptors marked on part of a sample protein structure: residue type (Met, Gln, Ile, Phe), secondary structure type (β-strand in this case), length of the vector between Cα atoms (|Ci |) and the γ angle
10.2 CASSERT for Protein Structure Similarity Searching
257
Phase 1 similarity matrix SSE
Phase 2 similarity matrix S Fig. 10.3 Overview of the two-phase alignment algorithm. In phase 1, low-resolution alignment is performed; protein structures are represented as reduced chains of secondary structures; the similarity matrix SS E used in the alignment is small—proportional to the number of secondary structures in both proteins. In phase 2, high-resolution alignment is performed; protein structures are represented as chains of molecular residue descriptors; the similarity matrix S used in the alignment is therefore large—proportional to the length of both proteins
10.2.1 General Course of the Matching Method Pairwise comparisons of protein 3D structures are performed using the matching method, which consists of two phases (Fig. 10.3): 1. The first phase involves the coarse alignment of spatial structures represented by secondary structure elements (SSEs). This is the low-resolution alignment phase, because groups of amino acids occurring in each structure are grouped into one representative element (the S E region). This phase allows us to run fast alignments in which small similarity matrices are constructed. This eliminates the need for computationally costly alignments for proteins that are entirely dissimilar. Proteins that exhibit secondary structures similarity are subjected to more thorough analysis in the second phase. 2. The second phase involves the detailed alignment of spatial structures represented by the molecular residue descriptors. This alignment is performed based on the results of the coarse alignment realized in the first phase. The second phase is the high-resolution alignment phase, because amino acids are not grouped in it. Instead, each amino acid found in the structure is represented by the corresponding molecular residue descriptor si . Therefore, in this phase CASSERT aligns sequences of molecular residue descriptors using much larger similarity matrices than those that were utilized in the first phase. In the second phase, the algorithm analyzes more features describing protein structures, and the protein itself is represented in more detail.
258
10 Massively Parallel Searching of 3D Protein Structure …
In both phases, the alignments are carried out using dynamic programming procedures that are specifically adapted to the molecular descriptions of protein structures in each phase. The detailed courses of both alignment phases are shown in the following sections.
10.2.2 First Phase: Low-Resolution Alignment The low-resolution alignment phase is performed in order to filter out molecules that do not show secondary structural similarity. Originally, this phase was also used to establish initial alignments that were projected onto the similarity matrix in the second phase. However, since both phases are executed independently in the GPUbased implementation, alignment paths are not transferred between alignment phases in the GPU-based approach. In order to match the structures of proteins Q and D that are represented as reduced chains of secondary structures, the algorithm builds the similarity matrix SS E of size n × m, where n and m describe the number of secondary structures in the compared chains of proteins Q and D. Successive cells of the SS E matrix are filled according to the following rules: For 0 ≤ i ≤ n and 0 ≤ j ≤ m: SS E i,0 = SS E 0, j = 0,
(10.7)
SS E i,(1)j SS E i,(2)j SS E i,(3)j
= SS E i−1, j−1 + δi j ,
(10.8)
= E i, j ,
(10.9)
= Fi, j ,
(10.10)
SS E i, j =
max {SS E i,(v)j , 0}. v=1..3
(10.11)
where δi j is the similarity reward, which reflects the degree of similarity between two regions S E iQ and S E Dj of proteins Q and D, respectively, and vectors E and F define possible, horizontal and vertical penalties for inserting a gap. The similarity reward δi j takes values from the interval [0, 1], where 0 means no similarity and 1 means the regions are identical. The degree of similarity is calculated using the formula: |L Dj − L iQ | δi j = σi j − σi j ∗ , (10.12) (L Dj + L iQ ) where L iQ , L Dj are lengths of compared regions S E iQ and S E Dj , while σi j describes the similarity degree of secondary structures building ith and jth S E regions of compared proteins Q and D. This parameter can take three possible values according to the following rules:
10.2 CASSERT for Protein Structure Similarity Searching
259
(i) σi j = 1, when both S E regions have the same secondary structure of α-helix or β-strand; (ii) σi j = 0.5, when at least one of the regions is a loop, turn, coil, or its secondary structure is undefined; (iii) σi j = 0, when one of the regions has the construction of α-helix and the second the construction of β-strand. Values of gap penalty vectors are calculated as follows: E i, j = max
E i−1, j − g E SS E i−1, j − g O
,
(10.13)
Fi, j = max
Fi, j−1 − g E SS E i, j−1 − g O
.
(10.14)
In order to assess the similarity between two reduced chains of secondary structures, CASSERT uses the Score measure, which is equal to the highest value in the similarity matrix SS E: (10.15) Scor e = max {SS E i, j }. Auxiliary vectors E and F allow us to perform alignment procedure and to calculate the Score similarity measure in linear space, because the value of cell SS E i, j depends only on the value of cell SS E i−1, j−1 , SS E i−1, j , and SS E i, j−1 . During the calculation of the similarity matrix, SS E CASSERT has to store the position of the maximum value of the Scor e in the matrix as well as the value itself.
10.2.3 Second Phase: High-Resolution Alignment Molecules that pass the first phase (based on the user-defined cutoff value) are further aligned in the second phase. A database protein structure is qualified to the second phase if the following condition is satisfied: Scor e Q D ≥ Qt , Scor e Q Q
(10.16)
where Scor e Q D is a similarity measure employed when matching the query protein structure to the database protein structure, Scor e Q Q is the similarity measure obtained when matching the query protein structure to itself (i.e., the maximum Scor e that the compared chain can achieve), and Q t ∈ [0, 1] is a user-defined qualification threshold for structural similarity. The second phase is performed similarly to the first phase, except that the alignment is carried out on the residue level, where aligned molecules Q and D are represented by chains of molecular residue descriptors. However, the way that GPU-
260
10 Massively Parallel Searching of 3D Protein Structure …
CASSERT calculates the similarity reward for the two compared residue molecular descriptors si and s j is different. The similarity reward ssi j is calculated according to the following formula: γ
E + wr σirj , ssi j = wC σiCj + wγ σi j + w SS E σiSS j
(10.17)
D where σiCj is the degree of similarity of a pair of vectors CQ i and Cj in proteins Q γ
E and D, σi j is the similarity of angles γiQ and γ Dj in proteins Q and D, σiSS is the j degree of similarity of secondary structures of residues i and j (calculated according to the rules (i)–(iii) listed for the first phase), σirj is the degree of similarity of residues defined by means of the BLOSUM62 substitution matrix [15] normalized to range of [0, 1], and wC , wγ , w SS E , wr are the weights of all of the components (with default value of 1). D The similarity of vectors CQ i and Cj is defined according to the formula:
σiCj
=e
2 D − κ1 |CQ i |−|Cj |
,
(10.18)
Q D for κ = 1, where |CiQ | and |C D j | are the lengths of vectors Ci and Cj , respectively,
and the similarity of the angles γiQ and γ Dj is defined as follows: γ
σi j = e
2 − κ1 γiQ −γ D j
,
(10.19)
for κ = 4. In high-resolution alignment, the value of the degree of similarity of molecular residue descriptors ssi j (Eq. 10.17) replaces the similarity reward δi j (Eq. 10.8). The relative strength of each component in the similarity search (Eq. 10.17) can be controlled using participation weights. The default value for each is 1, but this can be changed by the user. For example, researchers who are looking for surprising structural similarities but no sequence similarity can disable the component for the primary structure by setting the value of wr = 0. The Scor e similarity measure, the basic measure of the similarity of protein structures, is also calculated in this phase. Its value incorporates all possible rewards for a match, mismatch penalties, and penalties for inserting gaps in the alignment. The Scor e is also used to rank highly similar proteins that are returned by the GPUCASSERT.
10.2.4 Third Phase: Structural Superposition and Alignment Visualization In the third phase, the algorithm performs superposition of protein structures on the basis of aligned chains of molecular residue descriptors. The purpose of this step
10.2 CASSERT for Protein Structure Similarity Searching
261
is to match two protein structures by performing a set of rotation and translation operations that minimizes the root mean square deviation (RMSD): N 1
d 2, RMSD = N i=1 i
(10.20)
where N is the number of aligned Cα atoms in the protein backbones, and di is the distance between the ith pair of atoms. Two approaches are widely used to complete this step. One of the approaches uses quaternions [18]. CASSERT uses the approach proposed by Kabsch [20, 21] that makes use of the singular value decomposition (SVD) technique. These two approaches are said to be computationally equivalent [10], but there can be some circumstances deciding that one can be more convenient than the other. CASSERT performs the superposition of protein structures on the CPU of the host workstation. In this phase, CASSERT also calculates the full similarity matrix S in order to allow backtracking from the maximum value and full visualization of the structural alignment at the residue level. This step is performed on the CPU of the host and only for a limited number (M, which is configured by the user) of the most similar molecules.
10.3 GPU-Based Implementation of the CASSERT GPU devices can accelerate calculation speeds greatly, but this also requires to adapt to the CUDA programming model. GPU-CASSERT is a massively parallel software tool for searching 3D protein structure similarities on CUDA-enabled GPU devices. As most of the solutions presented in this book, it was developed in Institute of Informatics at the Silesian University of Technology in Gliwice, Poland. The development works were started in 2012. The main constructors of the whole solution were Miłosz Bro˙zek (within his Master thesis [4]), Bo˙zena Małysiak-Mrozek, and me. At the time of development, we already had considerable experience gained while developing a similar GPU-based tool for protein sequence similarity [44]. GPU-CASSERT [37] implements the one-to-many comparison scenario for 3D protein structure similarity searching, in which a given protein is compared to a collection of many proteins stored in a repository. In case of the GPU-CASSERT, a relational database was used to store the information on protein chains and locations of atoms of their 3D structures. Therefore, execution of the CASSERT algorithm on a GPU device also involves appropriate extraction and preparation of data that will be processed. In this section, we will focus on the GPU-based implementation of the CASSERT, data preparation, and we will gain insight into the implementation details of both CASSERT alignment phases on GPU.
262
10 Massively Parallel Searching of 3D Protein Structure …
10.3.1 Data Preparation Early tests of the first implementations of the CASSERT algorithm on GPU devices showed that read operations from the database system storing structural data were too slow and can be a bottleneck in the entire similarity searching process. Therefore, the present implementation of the GPU-CASSERT does not read data directly from the database, because single execution of the searching procedure would take too long. GPU-CASSERT uses binary files instead. These files contain data packages that are ready to be sent to the GPU device. The only data that are read directly from the database are those that describe the query protein structure Q. But, even in this situation, the data are stored in an appropriate way in binary files. Using binary files with data packages allows the initialization time of the GPU device to be reduced severalfold. This is necessary to ensure that the GPU-CASSERT has a fast response time. Binary files are refreshed in two cases: • changes in the content of a database, • changes in parameters affecting the construction of data packages. Data packages that are sent to the GPU device have the same general structure, regardless of what is stored inside. Due to the size of the data packages utilized by the CASSERT algorithm, these packages are placed in the global memory of the GPU device. As we know from the Sect. 2.4 when discussed GPUs and the CUDA, global memory is the slowest type of memory available. For this reason, it is worth minimizing the number of accesses made of this type of memory. Access operations are carried out in 32-, 64-, or 128-byte transactions. When the warp (which is composed of 32 threads) reaches the read/write operation, the GPU device attempts to perform this operation using a minimum number of transactions. Basically, the greater the number of transactions needed, the greater the amount of unnecessary data transmitted. This unnecessary overhead can be minimized for CUDA 2.x if memory cells that are read by all warp threads are located within a single 128-byte memory segment. In order to satisfy this condition, the address of this area must be aligned to 128 bytes and the threads need to read data from adjacent memory cells. For devices with compute capabilities of 1.0 or 1.1, upon which the GPU-CASSERT can also run, there is the additional restriction that warp threads must be in the same order as memory cells being read [40]. If these conditions are met, we can get 4 bytes of data for each of the threads in a single 128-byte transaction. These 4 bytes correspond to a single number of the type int or float, which is used while encoding data in data packages. The preferred allocation of 128-byte memory segment to threads is presented in Fig. 10.4. Data are transmitted to the GPU device in the form of a two-dimensional array of unsigned integers (Fig. 10.5). The array is organized in row-major order. This means that the cells in adjacent columns are located next to each other in the memory. This has an important influence on performance when processing an array, because
10.3 GPU-Based Implementation of the CASSERT
263
Fig. 10.4 Preferred allocation of 128-byte memory segment to warp threads. Thread 0 takes first 4 bytes of the transaction, thread 1 takes the next 4 bytes, etc
Fig. 10.5 Arrangement of chains of structural descriptors S1 , S2 , S3 , . . . in a memory array. Block threads are assigned to particular columns. One column stores one chain of structural descriptors (secondary structure elements or components of molecular residue descriptors). Each cell contains 4 bytes of data (structural descriptors). All block threads read contiguous memory areas (coalesced access)
contiguous array cells can usually be accessed more quickly than cells that are not contiguous. Each column of the array is assigned to a single block thread. Threads start at an index given by the following code: int tid=blockIdx.x*blockDim.x+threadIdx.x;
where blockIdx.x is the block index along the x dimension (GPU-CASSERT uses a one-dimensional blocks), blockDim.x stores the number of threads along the x dimension of the block, and threadIdx.x is the thread index within the block. A single chain of structural descriptors is stored in a single column of the array (Fig. 10.5). Such a solution satisfies the condition that contiguous addresses must be read, because block threads will always read adjacent cells, moving from the beginning to the end of the chain (from top to bottom). Every cell in the array is 4
264
10 Massively Parallel Searching of 3D Protein Structure …
bytes in size, so the transfer of data to a wrap’s 32 threads will be made in one 128byte read transaction. This allows to take a full advantage of data transfer from the memory to the registers of the GPU device. This way of organizing data in memory is used and described in [32, 44]. Another factor affecting the performance is the density at which the data are packed in memory cells. The distribution of data in memory cells depends on the phase of the algorithm and the type of structural descriptors that are used in the phase. There are five types of data that are sent to the memory of the GPU device: • reduced chains of secondary structures formed by secondary structure elements S E i (phase 1), • secondary structure elements SS E i that are components of non-reduced chains of molecular residue descriptors (phase 2), • amino acid residue types ri that are components of non-reduced chains of molecular residue descriptors (phase 2), • lengths of the vectors between Cα atoms of subsequent residues that are components of non-reduced chains of molecular residue descriptors (phase 2), • γi angles between successive vectors Ci and Ci+1 that are components of nonreduced chains of molecular residue descriptors (phase 2). Regardless of the type of data present in the memory cells, the chains included in the package may be of various lengths. For this reason, all chains of structural descriptors are aligned to the length of the longest chain. Empty cells are filled with zeros. In principle, comparing these zeros in the course of the algorithm does not affect the scoring system assumed and the final results. Chains of structural descriptors (secondary structure elements or components of molecular residue descriptors) contained in a data package are sorted by their lengths in ascending order. In this way, we minimize differences in processing time for individual block threads and their idle times (threads that have already completed their work must wait for the other threads to finish processing). A similar method is used in the work presented in [32, 44]. Data packages are divided into subpackages. Each subpackage consists of 32 chains of structural descriptors. This is exactly the same as the number of warp threads.
10.3.2 Implementation of Two-Phase Structural Alignment in a GPU Implementation of the two-phase structural alignment algorithm in a GPU with the CUDA requires a dedicated approach. GPU-CASSERT operates according to the Algorithm 7. In both alignment phases, the similarity matrix is stored in the global memory of the GPU device as an array of the type float. This means that a read/write of a single element requires just one transaction. It is also worth noting that, due to memory restrictions, each thread remembers only the last row of the similarity matrix. This
10.3 GPU-Based Implementation of the CASSERT
265
is sufficient to determine the maximum element of the similarity matrix, which also provides a value for the Score similarity measure, which is needed to check whether a database structure qualifies for the second phase. The similarity measure alone is sufficient to assess the quality of the alignment before the second phase. Algorithm 7 GPU-CASSERT: a general algorithm 1: Read data packages describing database protein structures from binary files 2: Read query protein structure (Q) from database and create appropriate data package with query profile 3: for all database proteins D do 4: Perform (in parallel) the first phase of the structural alignment on the GPU device 5: end for 6: Qualify proteins for the second phase according to formula 10.16 7: Prepare data packages describing database protein structures for the second phase 8: Read data packages describing database protein structures from binary files 9: Read query protein structure (Q) from database and create appropriate data packages with query profiles 10: for all qualified database proteins D do 11: Perform (in parallel) the second phase of the structural alignment on the GPU device 12: end for 13: Return a list of the top M database molecules that are most similar to the query molecule, together with similarity measures 14: if the user wants to visualize the alignment then 15: Perform the second phase on the CPU of the host computer for molecules from the list of the most similar ones to the query molecule returned by the GPU device 16: Perform structural superposition 17: Return alignment visualization to the user 18: end if
On the other hand, the second phase is performed on the GPU device for all qualified structures, and once again on the CPU of the host for the database proteins that are most similar to the query molecule in order to get alignment paths and to perform structural superposition. As a result, the user obtains a list of the structures that match most closely to the query structure and a visualization of the local alignments of these structures at the residue level.
10.3.3 First Phase of Structural Alignment in the GPU The first phase requires data to be delivered in the form of data packages containing reduced chains of secondary structures (S E regions). Separate data packages are built for the query protein and candidate protein structures from the database. For the purpose of processing, S E regions are encoded using two bytes: one byte for the type of secondary structure and one byte for its length (Eq. 10.3). Types of secondary structures are mapped to integers. In the Sect. 10.3.1, where we talked about the overall structure of a data package, we also mentioned that the data in memory are
266
10 Massively Parallel Searching of 3D Protein Structure …
Fig. 10.6 Encoding a reduced chain of secondary structures in a data package. The secondary structure of the protein is first translated to a reduced chain of S E regions. Subsequently, every two S E regions are placed in a data package in the manner shown, taking up 4 bytes, and in such a way they are loaded to the global memory of the GPU device. On the basis of [4] with improvements
arranged into 4-byte cells. In such a 4-byte cell, we can store two encoded S E regions. This is illustrated in Fig. 10.6. The data package for the query chain of secondary structures is built on the basis of a slightly different principle. If it was created in the same way as the data packages for database structures, then in order to extract the similarity coefficient of secondary structures σi, j we would have to read the cell (SS E iA , SS E Bj ) from a predefined matrix of coefficients (a kind of substitution matrix constructed based on rules (i)–(iii) in the Sect. 10.2.2), which would affect performance negatively. We can avoid this by pre-computing and writing all possible similarity coefficients directly into the data package of the query protein, creating something like the query-specific substitution matrix proposed in [46] and called a query profile in the GPU-based alignment algorithm for sequence similarity presented in [32]. Therefore, the data package for the query protein passes through an additional preparation step. For each S E region, four versions of the similarity coefficient are created, one for each of the secondary structure types and one for the neutral element 0 (as shown in Fig. 10.7). In the query profile created, the row index is defined by the index of structural region S E divided by 2, and the column index is defined by the type of secondary structure present (with the additional neutral element 0). The coefficients are converted to integers in order to fit them into 1 byte, according to the following rules: • if coefficient σi, j = 0, it is encoded as 0, • if coefficient σi, j = 1, it is encoded as 1, • if coefficient σi, j = 0.5, it is encoded as 2. Lengths of S E regions do not change. This process is illustrated in Fig. 10.7. Once the data packages are loaded into the host memory and a data package for the reduced query chain is created, the program transfers data to the GPU device. To
10.3 GPU-Based Implementation of the CASSERT
267
Fig. 10.7 Encoding the reduced chain of secondary structure for query protein Q (left) and construction of the query profile (right). The query profile shows all possible (encoded) scores when comparing the reduced query chain of secondary structures to S E regions from candidate protein structures from the database
do this, it uses four streams. Each stream has its own memory buffers on the GPU device side and in the page-locked memory on the host side. The host loads data into the page-locked memory and then initiates an asynchronous data transfer to GPU device for each of the streams. This allows transmission to take place in parallel with the ongoing calculations, again improving performance. Results are received prior to the transfer of the next data package or after all available packages have been processed. Block threads perform parallel alignments of reduced chains of secondary structures. Each block thread performs a pairwise alignment of the query protein vs. one candidate database protein. In order to limit the number of accesses to the global memory of the GPU device, the similarity matrix SS E is not calculated cell-by-cell but is divided into rectangular areas 2 × 4. Calculations are performed area-by-area, and row-by-row in each area, from left to right, as shown in Fig. 10.8. Structural elements (S E regions) of the candidate database structure are (virtually) located along the vertical edge of the matrix, and S E regions of the query protein structure are located along the horizontal edge of the matrix. The pseudocode of the CUDA kernel for the calculation of the matrix SS E by a block thread is presented in Algorithm 8.
268
10 Massively Parallel Searching of 3D Protein Structure …
Fig. 10.8 Calculation of similarity matrix SS E. Structural elements (SE regions) of the candidate database structure are (virtually) located along the vertical edge of the matrix and S E regions of the query protein structure along the horizontal edge of the matrix. Calculations are performed in areas 2 × 4 in size. Values of the cells in these areas are calculated according to the given order. Colors reflect the type of read/write operation required and the memory resources that are affected
The thread reads consecutive four elements S E Dj , S E Dj+1 , S E Dj+2 , S E Dj+3 of the database protein from the global memory of the GPU and saves them in registers (lines 1–3). They will be used many times while calculating successive areas to the right of the leftmost area (Fig. 10.8, left and middle). Then, for each succesQ of the query protein the thread reads values sive pairs of elements S E iQ , S E i+1 of SS E i, j−1 , SS E i+1, j−1 and Fi, j−1 , Fi+1, j−1 (calculated for the previous area, if any) from the global memory of the GPU and saves them in registers (lines 4–6). These values stored in registers will be swapped many times by current values of SS E i, j , SS E i+1, j and Fi, j , Fi+1, j during the calculation of area rows, since actually, at the end of the calculation we do not need the whole similarity matrix SS E, but the Scor e Q D value. In the next step, for each row of the area the tread reads Q elements S E iQ , S E i+1 of the query protein from the texture memory (lines 7–8). These two elements of the query protein correspond to only one row of the query profile. In line 9, the thread calculates values of Fi, j , Fi+1, j , E i, j , E i+1, j and saves them in registers. They are required to calculate values of SS E i, j and SS E i+1, j of the matrix SS E according to formulas (10.7)–(10.11) (line 10). The value of SS E i−1, j−1 , which is also required for the calculation is stored in registers, as well. The values
10.3 GPU-Based Implementation of the CASSERT
269
Algorithm 8 Phase 1: kernel pseudocode for the calculation of the matrix SS E by a block thread (GM - global memory, TM - texture memory) D D D 1: for each consecutive four elements S E D j , S E j+1 , S E j+2 , S E j+3 : j = 1, . . . , m do 2: Reset registers D 3: Read from GM elements S E D j , . . . , S E j+3 and save in registers Q ) : i = 1, . . . , n do 4: for each successive pairs of elements (S E iQ , S E i+1 5: Read from GM values SS E i, j−1 , SS E i+1, j−1 and save in registers 6: Read from GM values Fi, j−1 , Fi+1, j−1 and save in registers 7: for each row of the area do Q Q 8: Read from TM the element of the query profile that corresponds to (S E i , S E i+1 ) 9: Calculate Fi, j , Fi+1, j , E i, j , E i+1, j and save in registers 10: Calculate SS E i, j and SS E i+1, j according to formulas 10.7-10.11 11: Scor e Q D ← max(Scor e Q D , SS E i, j , SS E i+1, j ) 12: Save in registers values of SS E i, j , SS E i+1, j for the next row of the area Q Q 13: Save in register value of SS E i+1, j for the next pair (S E i , S E i+1 ) (next area) 14: end for 15: Save values of SS E i, j , SS E i+1, j in the GM 16: Save values of Fi, j , Fi+1, j in the GM 17: Save in register value of SS E i+1, j that will be used as diagonal value for another area 18: end for 19: end for 20: Save in GM the value of Scor e Q D /Scor e Q Q
of SS E i−1, j and E i−1, j are equal to 0 for the leftmost areas (Fig. 10.8, left and right) or stored in registers after the calculation of the previous area (Fig. 10.8, middle). In line 11, a temporary value of the Scor e Q D similarity measure is calculated. In line 12, current values of SS E i, j , SS E i+1, j are stored in registers, replacing old values SS E i, j−1 , SS E i+1, j−1 . Values of SS E i+1, j for successive rows are also stored in additional set of registers for the calculation of the next area to the right (line 13). They serve as values SS E i−1, j for successive rows of the next area to the right of the current area. At the end of the calculation of the area, the thread writes values of SS E i, j , SS E i+1, j and Fi, j , Fi+1, j , calculated for the last row, to the global memory (lines 15–16). They will be read and used again, when the thread processes the area below the current area. The value of SS E i+1, j for the last row of the area is stored in additional register (line 17). It will be used as diagonal value of SS E i−1, j−1 at the beginning of the calculation of another area (down-right). Finally, when all cells of the matrix SS E are calculated, the thread knows the final Scor e Q D and is able to calculate the value of Scor e Q D /Scor e Q Q , which will decide if the candidate protein is qualified for the second phase. The value is stored in the global memory (line 20). During the calculation of each 2 × 4 area, the values of the four elements of the vector E representing the horizontal gap penalty and four elements of the matrix SS E to the left of the current area are stored in GPU registers. Four consecutive elements of the reduced chain of secondary structures for the database protein are read from the global memory once, before the calculation of each leftmost area of the matrix begins. They are also stored in GPU registers and reused during the calculation of other areas located on the right of the leftmost area. Calculation of a
270
10 Massively Parallel Searching of 3D Protein Structure …
2 × 4 area requires two reads and two writes to the global memory for the vector F representing the vertical gap penalty, and two reads and two writes for the similarity matrix SS E. It also requires four reads for the query profile placed in the texture memory. In total, the calculation of 8 cells of an area of the similarity matrix SS E requires eight read/write transactions to the global memory of the GPU device and four reads from the texture memory. The order of calculation of cells and read/write operations performed are shown in Fig. 10.8.
10.3.4 Second Phase of Structural Alignment in the GPU After filtering candidate database proteins based on the qualification threshold Q t , the program creates new, smaller data packages that are needed in the second phase. Separate data packages are built for each of the features included in the molecular residue descriptors. In data packages for amino acid types and secondary structure types, we can store elements for four successive molecular residue descriptors in every 4 bytes (and then in every 4-byte memory cell). The arrangement of bytes and cells in memory is similar to that used in the first phase (see Fig. 10.5). Vector lengths and angles occupy 4 bytes each, which is one cell of the prepared array in memory. For the query protein structure, data packages for amino acid types and secondary structures are generated in a similar manner to how this is done in the first phase. The program creates separate query profiles for secondary structures and for residue types. The query profile for secondary structures is formed from the secondary structure similarity coefficients σi, j in such a way that the row index is the index of the current element from the query chain divided by 4, and the column index is the type of the secondary structure of the element from the compared database protein (Fig. 10.9). The query profile for residue types is derived from the normalized BLOSUM62 substitution matrix in such a way that the row index is the index of the current element from the query chain divided by 4, and the column index is the type of the residue from the compared database chain. Data packages containing vector lengths and angles between these vectors, for the query protein structure, are created by rewriting these values to separate packages. Transfer of data packages to the device is performed in the same manner as in the first phase. Four streams are used for this purpose. After the first part of data has been transferred to the GPU device, the high-resolution alignment procedure is initiated. Block threads perform parallel alignments of chains of molecular residue descriptors. Each block thread performs a pairwise alignment of the query protein vs. one candidate database protein. In order to limit the number of accesses to the global memory of the GPU device, the similarity matrix S is divided into rectangular areas of size 4 × 4. Calculations are performed area-by-area, and row-by-row inside areas, from left to right, as shown in Fig. 10.10. Molecular residue descriptors of the candidate database structure are (virtually) located along the left vertical edge of the matrix S, and molecular residue descriptors of the query protein structure are located along the top horizontal edge of the matrix.
10.3 GPU-Based Implementation of the CASSERT
271
Fig. 10.9 Encoding the secondary structure elements (SSEs) from chains of molecular residue descriptors for query protein Q (left) and construction of the query profile (right). The query profile shows all possible (encoded) scores when comparing the query chain of SSEs to SSEs from candidate protein structures from the database
During the calculation of each 4 × 4 area, the values of four elements of the vector E representing the horizontal gap penalty and the molecular residue descriptors for four successive elements of the database chain are stored in GPU registers. Calculation of a 4 × 4 area requires four reads and four writes to the global memory for the vector F representing the vertical gap penalty, and four reads and four writes for the similarity matrix S. It is also necessary to perform four reads for the query profile for secondary structures, four reads for the query profile for residue types, four reads for vector lengths, and four reads for angles between vectors. These reads are performed from the texture memory, where these structural features are placed and arranged in an appropriate manner. In total, the calculation of the 16 cells in each area of the similarity matrix S requires 16 read/write transactions to the global memory of the GPU device and 16 reads from the texture memory. The order of calculation of cells and the read/write operations performed are shown in Fig. 10.10. The kernel pseudocode is similar to the one presented for the first phase, with the exception that the thread processes 4 × 4 areas, which implies more I/O operations, and the similarity is calculated according to formula (10.17).
272
10 Massively Parallel Searching of 3D Protein Structure …
Fig. 10.10 Calculation of the similarity matrix S in the second phase of alignment. Molecular residue descriptors of the candidate database structure are (virtually) located along the vertical edge of the matrix and molecular residue descriptors of the query protein structure are located along the horizontal edge of the matrix. Calculations are performed in areas of size 4 × 4. Values of the cells in these areas are calculated according to the given order. Colors reflect the type of read/write operation that are required and the memory resources that are affected
10.4 GPU-CASSERT Efficiency Tests The efficiency of the GPU-CASSERT algorithm was tested in a series of experiments. In this section, we will see the results of these tests and we will compare the GPU-CASSERT to its CPU-based implementation that was published in [36]. Both implementations, i.e., the GPU-based and the CPU-based implementations, were tested on a Lenovo ThinkStation D20 with two Intel Xeon CPU E5620 2.4GHz processors, 16 GB of RAM, and a GeForce GTX 560 Ti graphics card with 2 GB of GDDR5 memory. The workstation had the Microsoft Windows Server 2008 R2
10.4 GPU-CASSERT Efficiency Tests
273
Datacenter 64-bit operating system installed, together with the CUDA SDK version 4.2. The CUDA compute capability supported by the graphics card was 2.1. The graphics card had the following features: • • • • •
8 multiprocessors (384 processing cores), 48 KB of shared memory per block, 64 KB of total constant memory, 32,768 registers per block, 2 GB of total global memory.
Tests were conducted using the DALI database (the same as that used by the DALI algorithm [16, 17]), which contained the structures for 105,580 protein chains. While testing performance, 14 selected query protein structures with lengths between 29 and 2005 amino acids were used. These were randomly selected molecules that represent different classes according to SCOP classification [38], i.e., all α, all β, α + β, α/β, α&β, coiled-coil proteins, and others. The list of query protein structures used in the tests performed in the present work is shown in Table 10.1. Tests were performed using different qualification thresholds Q T = 0.01, 0.2, 0.4, 0.6, 0.8 that the structures had to attain for them to pass from the first phase to the second phase of CASSERT. CASSERT execution times for Q T = 0.01 and Q T = 0.2 are shown in Fig. 10.11. The thresholds used were not chosen randomly. The Q T = 0.2 is an experimentally determined threshold that filters out a reasonable number of structures based on the secondary structure similarity but still allows short local similarities to be found. This will be discussed further later in the section. The Q T = 0.01 means that almost no filtering is done based on the secondary structure similarity, and almost all structures in the database qualify for the second phase. The results of the efficiency tests presented in Fig. 10.11 prove that GPUCASSERT scans the database much faster than the CPU-based implementation. Upon analyzing execution times for the first phase of the CASSERT algorithm (Fig. 10.11a and 10.11b) for both qualification thresholds, we can see that increasing the query protein’s length causes the execution time for the algorithm to increase too. This is expected, since a longer query protein chain implies a longer alignment time for every pair of compared proteins. Small fluctuations that are visible for short chains
Table 10.1 Query protein structures used in the performance tests PDB ID Chain Length PDB ID Chain 2CCE 2A2B 1BE3 1A1A 1AYY 2RAS 1TA3
A A G B B A B
29 40 80 101 142 199 300
1AYE 2EPO 1KK7 1URJ 2PDA 2R93 2PFF
_ B A A A A B
Length 400 600 802 1027 1230 1421 2005
274
10 Massively Parallel Searching of 3D Protein Structure …
(a)
(b)
(c)
(d)
Fig. 10.11 Total execution time for the first phase (a, b) and average execution time of both phases per protein that qualified for the second phase (c, d) for qualification thresholds of 0.01 (a, c) and 0.2 (b, d) as a function of the length of the query protein structure Q. Time is plotted on a log10 scale. Comparison of two implementations of the CASSERT algorithm: CPU-based (red) and GPU-based (CUDA, blue). Results for 14 selected query protein structures between 29 and 2005 amino acids long. Searches were performed against the DALI database, containing 105,580 structures
when using the GPU-based implementation and Q T = 0.01 (Fig. 10.11a, blue) are caused by variations in the number of secondary structures identified in the investigated proteins, which affect the alignment time. We can observe a similar (expected) dependency between the length of the query protein and the execution time while analyzing the measured execution times after both phases of the CASSERT algorithm for both qualification thresholds (Fig. 10.11c and 10.11d). However, since the number of proteins that qualify to the second phase varies and depends on the length and complexity of the query structure, average execution times per qualified pro-
10.4 GPU-CASSERT Efficiency Tests
(a)
275
(b)
Fig. 10.12 Acceleration achieved by GPU-CASSERT with respect to CPU-CASSERT as a function of query protein length after the first phase (blue) and both alignment phases (red) with qualification thresholds 0.01 (a) and 0.2 (b)
tein are shown in Fig. 10.11c and 10.11d. When executing the CASSERT for various query proteins, it can be noticed that, in some cases, more database protein structures qualify for the second phase for shorter query protein structures rather than longer (between 1000 and 2000 residues) query protein structures. The execution time measurements that have been obtained during the performance tests allowed to calculate acceleration ratios for GPU-CASSERT with respect to CPU-CASSERT. Figure 10.12 shows how the acceleration ratio changes as a function of query protein length for the first phase and both phases for Q T = 0.01 and Q T = 0.2. We can see that the acceleration ratio for the first phase remains stable. In this phase, GPU-CASSERT is on average 120 times faster than CPU-CASSERT. However, for the whole alignment, i.e., after the first and second phases, the acceleration ratio greatly depends on the length of the query protein structure, its construction and complexity. The whole alignment process when performed on the GPU is 30–300 times faster than the same process performed on the CPU. Actually, for qualification thresholds Q T ≥ 0.1, it is possible to observe a kind of compensation effect. For longer query protein chains, which also have more complicated constructions in terms of secondary structure, the number of candidate structures from the database that qualified for the second phase decreases with the length of the query protein. This causes a situation in which fewer database proteins need to be aligned during the entire process. But, at the same time, the length of the query protein grows, causing the alignment time to increase. This growth is compensated for by the smaller number of database structures that need to be aligned.
276
10 Massively Parallel Searching of 3D Protein Structure …
Fig. 10.13 Number of structures from the database that qualified for the second phase as a function of query protein length for various values of the qualification threshold
Figure 10.13 shows the relationship between query protein length and the number of structures that qualified for the second phase when various values of the qualification threshold Q T were applied. For example, for Q T = 0.01 we can see that almost all of the database structures qualified for the second phase, regardless of query protein length. In this case, there is practically no filtering based on the secondary structures identified in the query protein. On the other hand, for Q T = 0.8, we can notice that for query proteins over 150 residues in length, only single database structures are eligible for further processing. In many situations, such a high value of the qualification threshold will filter out too many molecules. However, this depends on the situation for which the entire process of similarity searching is carried out. For example, in homology modeling, we may want to find referential protein structures that are very similar to the given query protein structure. For functional annotation and while searching for homologous structures, Q T = 0.2 could be a reasonable threshold, since it filters out many candidate molecules and, even for very long query proteins, it allows several thousands of structures at least to pass through to the second phase. We should also remember that the first alignment phase can be turned off completely by specifying Q T = 0.0. Then, all of the database molecules pass through to the second phase, which prolongs the similarity searching process.
10.5 Discussion
277
10.5 Discussion The results of the efficiency tests have confirmed our expectations. Using a graphics card with a CUDA compute capability is one of the most efficient approaches to use when performing protein structure similarity searching. Upon comparing execution times, we can see that the GPU-based implementation is several dozen to several hundred times faster (an average of 180 times faster for Q T = 0.2) than the CPU-based implementation. This is very important, since the number of protein structures in macromolecular databases, such as the Protein Data Bank, is growing very quickly, and the dynamics of this growth is also increasing. The use of GPU-based implementations is particularly convenient for such processes because GPU devices are reasonably inexpensive compared to, say, big computer clusters. Presented experiments were performed on a middle-class GPU device, which was set up on a small PC workstation with two processors. For this reason, GPU devices can be usefully applied in the implementation of many algorithms in the field of bioinformatics. The novelty of CPU-CASSSERT lies mainly in the fast preselection phase based on secondary structures (the low-resolution alignment phase), which precedes the phase of detailed alignment (the high-resolution alignment phase). This allows the number of structures that will be processed in the second, costly phase, to be limited, which, in turn, significantly accelerates the method itself. A comparison of CPUCASSERT with the popular DALI and FATCAT algorithms is presented in [36]. GPU-CASSERT provides additional acceleration over its CPU-based version by executing the computational procedure in parallel threads on many cores of the GPU device. The resulting increase in speed is even greater than those achieved with the methods mentioned in the Sect. 10.1.2. SA Tableau Search provides a 33-fold increase in speed when using a GTX 285 graphics card and a 24-fold increase when using a C1060 GPU device rather than the CPU implementation. However, the optimization procedure is based on simulated annealing, which is run in parallel CUDA threads. Individual thread blocks perform the optimization procedure for different candidate protein structures from a database. Protein structures are represented as tableaux containing the orientations of secondary structure elements and distance matrices. However, one of the problems with this algorithm is encountered when comparing big protein structures that generate big tableaux and distance matrices, as they cannot be stored inside the constant and shared memory during computations. This makes it necessary to use a slower version of the GPU kernel which exploits the global memory rather than the faster constant and shared memory. GPU-CASSERT avoids this problem by using a different representation of protein structures–Linear sequences of structural descriptors (where secondary structure elements are also included) are employed rather than two-dimensional representative structures. In terms of representation of protein structures and the implementation of the method, GPU-CASSERT is closer to pssAlign [42], which shows up to a 35-fold increase in speed with the NVIDIA Tesla C2050 GPU over its CPU-based implementation. Both algorithms consist of two alignment phases. The fragment-level alignment phase of pssAlign uses an index-based matched fragment set (MFS) in
278
10 Massively Parallel Searching of 3D Protein Structure …
order to find so-called seeds between the target protein and each database protein. These seeds, which are represented by the locations of the Cα atoms, are used to generate initial alignments which are then refined in the residue-level alignment phase. Just like GPU-CASSERT, both phases utilize dynamic programming. However, in GPU-CASSERT, the low-resolution alignment is treated as a preselection phase for detailed alignment. In contrast to pssAlign, both phases are executed independently in GPU-CASSERT. GPU-CASSERT does not store alignment paths after the first phase of the algorithm, which was done in the original CASSERT published in [36]. Consequently, it also does not perform backtracking in the kernel of the first phase, since GPU-CASSERT only needs the Score measure to calculate the qualification threshold Q T for the next phase. The Score is calculated in a linear space, which also influences the effectiveness. Backtracking is also not performed in the GPU after the high-resolution alignment phase. It is executed on the host instead, and only for the highest-scoring database molecules that are returned for the user to visualize. This allows computational time to be saved. Additional savings can be achieved when working with small query structures. After filtering candidate database proteins based on the qualification threshold, the program creates new, smaller data packages that are needed in the second phase. This usually takes some time. For this reason, for shorter query proteins (less than 100 amino acids in length), it is reasonable to omit the first phase by setting the qualification threshold to 0.0. The probability that such a small protein structure (after it has been reduced to a chain of S E regions) will be similar to many of the database proteins is very high. This means that all or almost all of the proteins qualify for the next phase (this is visible in Fig. 10.13), which makes the first preselection phase almost useless for very small molecules. GPU-CASSERT also provides additional unique features. Following research into GPU-based sequence alignments [26, 27, 32, 44], the data are arranged in an appropriate manner before sending them to the global memory of the GPU device. Chains of structural descriptors representing protein structures are stored in a prepared memory array that guarantees coalesced access to the global memory in a single transaction. Structural descriptors are not transferred to the global memory of the GPU device directly from a database, but they are stored in binary files, which enables faster transfer, and they are sorted by their lengths in order to reduce thread idle time once they are processed. Moreover, secondary structure descriptors of query protein structures (in both phases) and residue types (in the second phase) are encoded as query profiles—appropriate matrices of all possible scores. During the computations performed on the GPU device, the query profile and substitution matrix (needed in the second phase) are located in the texture memory. The texture memory is cached on the chip of the graphics card and provides a higher effective bandwidth, reducing the number of requests made to off-chip global memory. Streaming is also applied in GPU-CASSERT in order to alternate kernel launches and memory copies, resulting in further acceleration. Finally, kernel codes are optimized to avoid introducing branching via conditional statements.
10.6 Summary
279
10.6 Summary Protein 3D structure similarity searching still needs efficient methods and new implementations in order to generate results in a reasonable time. This has been prevalent taking into account exponentially growing numbers of protein structures in macromolecular repositories. It seems that at the current stage of development of computer science, GPU devices provide an excellent alternative to very expensive computer infrastructures, as they allow large increases in speed over CPU-based implementations for the same computational methods. Moreover, taking into account that the number of processing cores and the amount of memory in modern GPU devices are constantly growing, the computational capabilities of GPU devices are also growing at the same time. Although implementing computational methods requires some additional effort by the user, including the need to get familiar with the completely new CUDA architecture and programming model, and to refactor the code of existing procedures into GPU kernels, in return we can achieve much faster processing. This is very important because, for many processes such as 3D protein structure similarity searching, reducing computational complexity is a very difficult, if not impossible, task. GPU-based implementations like that presented in the chapter do not reduce the complexity, but they can speed up the process by implementing massive parallelization, thus reducing the overall time required for process execution. For the latest news on the GPU-CASSERT, please visit the project Web site: http:// zti.polsl.pl/dmrozek/science/gpucassert/cassert.htm For further reading on GPU-based implementations of other algorithms for bioinformatics, I would like to recommend the book entitled Bioinformatics: HighPerformance Parallel Computer Architectures by Bertil Schmidt [49] and the paper Graphics processing units in bioinformatics, computational biology, and systems biology by Nobile et al. [39]. In the next chapter, we will see the exploration of protein molecules on the basis their secondary structures, and how it can be performed with the use of SQL queries in relational databases and accelerated in multi-threaded execution.
References 1. Bellman, R.: On the theory of dynamic programming. Proc. Natl. Acad. Sci. 38(8), 716–719 (1952). http://www.pnas.org/content/38/8/716 2. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E.: The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000) 3. Brown, N.P., Orengo, C.A., Taylor, W.R.: A protein structure comparison methodology. Comput. Chem. 20(3), 359–380 (1996). http://www.sciencedirect.com/science/article/pii/ 0097848595000623 4. Bro˙zek, M.: Protein structure similarity searching with the use of CUDA. Master’s thesis, Institute of Informatics, Silesian University of Technology, Gliwice, Poland (2012) 5. Buckner, J., Wilson, J., Seligman, M., Athey, B., Watson, S., Meng, F.: The gputools package enables GPU computing in R. Bioinformatics 26(1), 134–135 (2010). https://doi.org/10.1093/ bioinformatics/btp608
280
10 Massively Parallel Searching of 3D Protein Structure …
6. Burkowski, F.: Structural Bioinformatics: An Algorithmic Approach, 1st edn. Chapman and Hall/CRC, Boca Raton (2008) 7. Can, T., Wang, Y.F.: CTSS: a robust and efficient method for protein structure alignment based on local geometrical and biological features. In: Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference (CSB2003), pp. 169–179 (2003) 8. Carugo, O.: Recent progress in measuring structural similarity between proteins. Curr. Protein Pept. Sci. 8(3), 219–241 (2007). https://www.ingentaconnect.com/content/ben/cpps/2007/ 00000008/00000003/art00001 9. Carugo, O., Pongor, S.: Recent progress in protein 3D structure comparison. Curr. Protein Pept. Sci. 3(4), 441–449 (2002). http://www.eurekaselect.com/node/81461/article 10. Coutsias, E.A., Seok, C., Dill, K.A.: Using quaternions to calculate RMSD. J. Comput. Chem. 25(15), 1849–1857 (2004). https://doi.org/10.1002/jcc.20110 11. Daniluk, P., Lesyng, B.: A novel method to compare protein structures using local descriptors. BMC Bioinform. 12(1), 344 (2011). https://doi.org/10.1186/1471-2105-12-344 12. Friedrichs, M.S., Eastman, P., Vaidyanathan, V., Houston, M., Legrand, S., Beberg, A.L., Ensign, D.L., Bruns, C.M., Pande, V.S.: Accelerating molecular dynamic simulation on graphics processing units. J. Comput. Chem. 30(6), 864–872 (2009). https://doi.org/10.1002/jcc. 21209 13. Gibrat, J., Madej, T., Bryant, S.: Surprising similarities in structure comparison. Curr. Opin. Struct. Biol. 6(3), 377–385 (1996) 14. Gu, J., Bourne, P.: Structural Bioinformatics (Methods of Biochemical Analysis), 2nd edn. Wiley, Hoboken (2009) 15. Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. 89(22), 10915–10919 (1992). http://www.pnas.org/content/89/22/10915 16. Holm, L., Kaariainen, S., Rosenstrom, P., Schenkel, A.: Searching protein structure databases with DaliLite v. 3. Bioinformatics 24, 2780–2781 (2008) 17. Holm, L., Sander, C.: Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233(1), 123–38 (1993) 18. Horn, B.K.P.: Closed-form solution of absolute orientation using unit quaternions. J. Opt. Soc. Am. A 4(4), 629–642 (1987). http://josaa.osa.org/abstract.cfm?URI=josaa-4-4-629 19. Jamroz, M., Kolinski, A.: ClusCo: clustering and comparison of protein models. BMC Bioinform. 14(1), 62 (2013). https://doi.org/10.1186/1471-2105-14-62 20. Kabsch, W.: A solution for the best rotation to relate two sets of vectors. Acta Crystallogr. Sect. A 32(5), 922–923 (1976). https://doi.org/10.1107/S0567739476001873 21. Kabsch, W.: A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallogr. Sect. A 34(5), 827–828 (1978). https://doi.org/10.1107/S0567739478001680 22. Leinweber, M., Baumgärtner, L., Mernberger, M., Fober, T., Hüllermeier, E., Klebe, G., Freisleben, B.: GPU-based cloud computing for comparing the structure of protein binding sites. In: 2012 6th IEEE International Conference on Digital Ecosystems and Technologies (DEST), pp. 1–6 (2012) 23. Leinweber, M., Fober, T., Freisleben, B.: GPU-based point cloud superpositioning for structural comparisons of protein binding sites. IEEE/ACM Trans. Comput. Biol. Bioinform. PP(99), 1– 14 (2018) 24. Leinweber, M., Fober, T., Strickert, M., Baumgärtner, L., Klebe, G., Freisleben, B., Hüllermeier, E.: CavSimBase: a database for large scale comparison of protein binding sites. IEEE Trans. Knowl. Data Eng. 28(6), 1423–1434 (2016) 25. Lesk, A.: Introduction to Protein Science: Architecture, Function, and Genomics, 2nd edn. Oxford University Press, USA (2010) 26. Liu, Y., Maskell, D.L., Schmidt, B.: CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units. BMC Res. Notes 2(1), 73 (2009). https://doi.org/10.1186/1756-0500-2-73 27. Liu, Y., Schmidt, B., Maskell, D.L.: CUDASW++2.0: enhanced Smith-Waterman protein database search on CUDA-enabled GPUs based on SIMT and virtualized SIMD abstractions. BMC Res. Notes 3(1), 93 (2010). https://doi.org/10.1186/1756-0500-3-93
References
281
28. Liu, Y., Wirawan, A., Schmidt, B.: CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions. BMC Bioinform. 14(1), 117 (2013). https://doi.org/10.1186/1471-2105-14-117 29. Lu, M., Shu, W., Ji, H., Spek, E., Wang, L., Kallenbach, N.R.: Helix capping in the GCN4 leucine zipper. J. Mol. Biol. 288(4), 743–752 (1999). http://www.sciencedirect.com/science/ article/pii/S0022283699927079 30. Małysiak-Mrozek, B., Momot, A., Mrozek, D., Hera, Ł., Kozielski, S., Momot, M.: Scalable system for protein structure similarity searching. In: Jedrzejowicz, P., Nguyen, N.T., Hoang, K. (eds.) Computational Collective Intelligence. Technologies and Applications. Lecture Notes Computer Science, vol. 6923, pp. 271–280. Springer, Berlin (2011) 31. Małysiak-Mrozek, B., Mrozek, D.: An improved method for protein similarity searching by alignment of fuzzy energy signatures. Int. J. Comput. Intell. Syst. 4(1), 75–88 (2011). https:// doi.org/10.1080/18756891.2011.9727765 32. Manavski, S.A., Valle, G.: CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment. BMC Bioinform. 9(2), S10 (2008). https://doi.org/10. 1186/1471-2105-9-S2-S10 33. Minami, S., Sawada, K., Chikenji, G.: MICAN: a protein structure alignment algorithm that can handle multiple-chains, inverse alignments, Ca only models, alternative alignments, and non-sequential alignments. BMC Bioinform. 14(24), 1–22 (2013) 34. Momot, A., Małysiak-Mrozek, B., Kozielski, S., Mrozek, D., Hera, Ł., Górczy´nska-Kosiorz, S., Momot, M.: Improving Performance of Protein Structure Similarity Searching by Distributing Computations in Hierarchical Multi-Agent System. Lecture Notes in Computer Science, vol. 6421, pp. 320–329. Springer, Berlin (2010) 35. Mosca, R., Brannetti, B., Schneider, T.R.: Alignment of protein structures in the presence of domain motions. BMC Bioinform. 9(1), 352 (2008). https://doi.org/10.1186/1471-2105-9-352 36. Mrozek, D., Małysiak-Mrozek, B.: CASSERT: a two-phase alignment algorithm for matching 3D structures of proteins. In: Kwiecie´n, A., Gaj, P., Stera, P. (eds.) Computer Networks. Communications in Computer and Information Science, vol. 370, pp. 334–343. Springer International Publishing, New York (2013) 37. Mrozek,D., Brozek,M.,Małysiak-Mrozek, B.: Parallel implementation of 3D protein structure similarity searches using a GPU and the CUDA. J. Mol. Model. 20, 2067 (2014) 38. Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247(4), 536– 540 (1995). http://www.sciencedirect.com/science/article/pii/S0022283605801342 39. Nobile, M.S., Cazzaniga, P., Tangherloni, A., Besozzi, D.: Graphics processing units in bioinformatics, computational biology and systems biology. Brief. Bioinform. 18(5), 870–885 (2017). https://doi.org/10.1093/bib/bbw058 40. NVIDIA CUDA C Programming Guide (2018). http://docs.nvidia.com/cuda/cuda-cprogramming-guide/index.html 41. Ortiz, A.R., Strauss, C.E., Olmea, O.: MAMMOTH (matching molecular models obtained from theory): an automated method for model comparison. Protein Sci. 11(11), 2606–2621 (2009). https://doi.org/10.1110/ps.0215902 42. Pang, B., Zhao, N., Becchi, M., Korkin, D., Shyu, C.R.: Accelerating large-scale protein structure alignments with graphics processing units. BMC Res. Notes 5(1), 116 (2012). https://doi. org/10.1186/1756-0500-5-116 43. Pascual-Garca, A., Abia, D., Ortiz, N.R., Bastolla, U.: Cross-over between discrete and continuous protein structure space: insights into automatic classification and networks of protein structures. PLOS Comput. Biol. 5(3), 1–20 (2009). https://doi.org/10.1371/journal.pcbi. 1000331 44. Pawłowski, R., Małysiak-Mrozek, B., Kozielski, S., Mrozek, D.: Fast and accurate similarity searching of biopolymer sequences with GPU and CUDA. In: Xiang, Y., Cuzzocrea, A., Hobbs, M., Zhou, W. (eds.) Algorithms and Architectures for Parallel Processing. Lecture Notes in Computer Science, vol. 7016, pp. 230–243. Springer, Berlin Heidelberg, Berlin, Heidelberg (2011)
282
10 Massively Parallel Searching of 3D Protein Structure …
45. Roberts, E., Stone, J.E., Sepulveda, L., Hwu, W.M.W., Luthey-Schulten, Z.: Long time-scale simulations of in vivo diffusion using GPU hardware. In: 2009 IEEE International Symposium on Parallel Distributed Processing, pp. 1–8 (2009) 46. Rognes, T., Seeberg, E.: Six-fold speed-up of SmithWaterman sequence database searches using parallel processing on common microprocessors. Bioinformatics 16(8), 699–706 (2000). https://doi.org/10.1093/bioinformatics/16.8.699 47. Sam, V., Tai, C.H., Garnier, J., Gibrat, J.F., Lee, B., Munson, P.J.: Towards an automatic classification of protein structural domains based on structural similarity. BMC Bioinform. 9(1), 74 (2008). https://doi.org/10.1186/1471-2105-9-74 48. Schatz, M.C., Trapnell, C., Delcher, A.L., Varshney, A.: High-throughput sequence alignment using graphics processing units. BMC Bioinform. 8(1), 474 (2007). https://doi.org/10.1186/ 1471-2105-8-474 49. Schmidt, B.: Bioinformatics: High Performance Parallel Computer Architectures (Embedded Multi-Core Systems), 1st edn. CRC Press, Boca Raton (2010) 50. Shapiro, J., Brutlag, D.: FoldMiner and LOCK2: protein structure comparison and motif discovery on the Web. Nucleic Acids Res. 32, 536–41 (2004) 51. Shindyalov, I., Bourne, P.: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 11(9), 739–747 (1998) 52. Stanek, D., Mrozek, D., Małysiak-Mrozek, B.: MViewer: Visualization of protein molecular structures stored in the PDB, mmCIF and PDBML data formats. In: Kwiecie´n, A., Gaj, P., Stera, P. (eds.) Computer Networks. Communications in Computer and Information Science, vol. 370, pp. 323–333. Springer, Berlin (2013) 53. Stivala, A.D., Stuckey, P.J., Wirth, A.I.: Fast and accurate protein substructure searching with simulated annealing and GPUs. BMC Bioinform. 11(1), 446 (2010). https://doi.org/10.1186/ 1471-2105-11-446 54. Striemer, G.M., Akoglu, A.: Sequence alignment with GPU: performance and design challenges. In: 2009 IEEE International Symposium on Parallel Distributed Processing, pp. 1–10 (2009) 55. Suchard, M.A., Rambaut, A.: Many-core algorithms for statistical phylogenetics. Bioinformatics 25(11), 1370–1376 (2009). https://doi.org/10.1093/bioinformatics/btp244 56. Ye, Y., Godzik, A.: Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics 19(2), 246–255 (2003) 57. Yuan, C., Chen, H., Kihara, D.: Effective inter-residue contact definitions for accurate protein fold recognition. BMC Bioinform. 13(1), 292 (2012). https://doi.org/10.1186/1471-2105-13292 58. Zemla, A.: LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Res. 31(13), 3370–3374 (2003). https://doi.org/10.1093/nar/gkg571 59. Zerella, R., Williams, D.H., Chen, P.Y., Evans, P.A., Raine, A.: Structural characterization of a mutant peptide derived from ubiquitin: implications for protein folding. Protein Sci. 9(11), 2142–2150 (2000). https://doi.org/10.1110/ps.9.11.2142 60. Zhang, Y., Skolnick, J.: TM-align: a protein structure alignment algorithm based on the TMscore. Nucleic Acids Res. 33(7), 2302–2309 (2005). https://doi.org/10.1093/nar/gki524 61. Zhu, J., Weng, Z.: FAST: A novel protein structure alignment algorithm. Proteins 58, 618–627 (2005)
Chapter 11
Exploration of Protein Secondary Structures in Relational Databases with Multi-threaded PSS-SQL
…; life was no longer considered to be a result of mysterious and vague phenomena acting on organisms, but instead the consequence of numerous chemical processes made possible thanks to proteins Amit Kessel, Nir Ben-Tal, 2010
Abstract Protein secondary structure reveals important information regarding protein construction and regular spatial shapes, including alpha-helices, beta-strands, and loops, which protein amino acid chain can adopt in some of its regions. The relevance of this information and the scope of its practical applications cause the requirement for its effective storage and processing. In this chapter, we will see how protein secondary structures can be stored in the relational database and processed with the use of the PSS-SQL. The PSS-SQL is an extension to the SQL language. It allows formulation of queries against a relational database in order to find proteins having secondary structures similar to the structural pattern specified by a user. In the chapter, we will see how this process can be accelerated by parallel implementation of the alignment using multiple threads working on multi-core CPUs. Keywords Proteins · Secondary structure · Query language · SQL · Relational database · Multi-threading · Parallel computing · Alignment
11.1 Introduction Secondary structures are a kind of intermediate organizational level of protein structures, a level between the simple amino acid sequence and complex 3D structure. The analysis of protein structures on the basis of the secondary structures is very supportive for many processes related to studying protein folding principles involving both short- and long-range interactions that affect secondary structure stability and © Springer Nature Switzerland AG 2018 D. Mrozek, Scalable Big Data Analytics for Protein Bioinformatics, Computational Biology 28, https://doi.org/10.1007/978-3-319-98839-9_11
283
284
11 Exploration of Protein Secondary Structures …
conformation. Algorithms comparing protein 3D structures and looking for structural similarities quite often make use of the secondary structure representation at the beginning as one of the features distinguishing one protein from the other. Secondary structures are taken into account in algorithms, such as VAST [9], LOCK2 [22], CTSS [6], CASSERT [16]. Also in protein 3D structure prediction by comparative modeling [13, 30], particular regions of protein structures are modeled through the adoption of particular secondary structure types of proteins that structure is already determined and deposited in a database. Secondary structure organizational level also shows what types of secondary structure a protein molecule is composed of, what is their arrangement—whether they are segregated or alternating each other. Based on the information proteins are classified by systems, such as CATH [21] and SCOP [20]. All these examples show how important the description by means of secondary structures is. For scientists studying structures and functions of proteins, it is very important to collect data describing protein construction in one place and have the ability to search particular structures that satisfy given searching criteria. Consequently, this needs an appropriate representation of protein structures allowing for effective storage and searching. The problem is particularly important in the face of dynamically growing amount of biological and biomedical data in databases, such as PDB [4] or Swiss-Prot [3, 5]. At the current stage of the development of IT technologies, a well-established position in terms of collecting and managing various types of data reached relational databases [7]. Let us remind that relational databases collect data in tables (describing part of reality) where data are arranged in columns and rows. Modern relational databases also provide a declarative query language—SQL that allows retrieving and processing collected data. The SQL language gained a great power in processing regular data hiding details of the processing under a quite simple SELECT statement (for more information on relational databases and SQL, please refer to Sect. 2.5). However, processing biological data, such as protein secondary structures, by means of relational databases is hindered by several factors: • Data describing protein structures have to be managed by database management systems (DBMSs), which work excellent in commercial uses, but they are not dedicated for storing and processing biological data. They do not provide the native support for processing biological data with the use of the SQL language, which is a fundamental, declarative way of data manipulation in most modern relational database systems. • Processing of biological data must be performed by external tools and software applications, forming an additional layer in the IT system architecture, which is a disadvantage. • Currently, results of data processing are returned in different formats, like tableform data sets, TXT, HTML or XML files, and users must adopt them in their software applications. • Secondary processing of the data is difficult and requires additional external tools.
11.1 Introduction
285
In other words, modern relational databases require some enhancements in order to deal with the data on secondary structures of proteins. A possibility of collecting protein structural data in appropriate manner and processing the data by submitting simple queries to a database simplifies a work of many researchers working in the area of protein bioinformatics. Actually, the problem of storing biological data describing biopolymer structures of proteins and DNA/RNA molecules and possessing appropriate query language allowing processing the data has been noticed in the last decade and reported in several papers. There are only a few initiatives in the world reporting this kind of solutions. For example, the ODM BLAST [25] is an implementation of the BLAST family of methods in the commercial Oracle database management system. ODM BLAST extends the SQL language by providing appropriate functions for local alignment and similarity searching of DNA/RNA and protein amino acid sequences. ODM BLAST works fast, but in terms of protein molecules it is limited only to the primary structure. In [10], authors describe their extension to the SQL language, which allows searching on the secondary structures of protein sequences. The extension was developed in Periscope (dedicated engine) and in Oracle (commercial database system). In the solution, secondary structures are represented by segments of different types of secondary structure elements, e.g., hhhllleee. In [26], authors show the Periscope/SQ extension of the Periscope system. Periscope/SQ is a declarative tool for querying primary and secondary structures. To this purpose, authors introduced new language PiQL, new data types, and algebraic operators according to the defined query algebra PiQA. The PiQL language has many possibilities. In the paper [27], authors present their extensions to the object-oriented database (OODB) by adding the ProteinQL query language and the Protein-OODB middle layer for requests submitted to the OODB. ProteinQL allows to formulate simple queries that operate on the primary, secondary, and tertiary level. Finally in 2010, me and a group of researchers from my university (Silesian University of Technology in Gliwice, Poland) developed the PSS-SQL (Protein Secondary Structure - Structured Query Language) [15, 17, 28, 29], which is an extension to the Transact-SQL language and Microsoft SQL Server DBMS allowing for searching protein similarities on the secondary structure level (Fig. 11.1). I had the opportunity to be the manager and supervisor of the project, and I have never stopped thinking on its improvement in the following years. New versions of the PSS-SQL [18, 19] consist of many improvements leading to the significant growth of the efficiency of PSS-SQL queries, including • parallel and multi-threaded execution of the alignment procedure used in the searching process, • reduction of the computational complexity of the alignment algorithm by using gap penalty matrices, • indexing of sequences of secondary structure elements.
286
11 Exploration of Protein Secondary Structures …
Fig. 11.1 Exploration of protein secondary structures in relational databases using PSS-SQL language. Secondary structure description of protein molecules is stored in relational database. The Database Management System (DBMS) has the PSS-SQL extension that interprets queries submitted by users. Users can connect to the database from various tools, desktop software applications, and web applications. They obtain results of their queries in a table-like format or as an XML document
The PSS-SQL language containing these improvements will be described in the chapter. In the chapter, we will also see results of performance tests for sample queries in PSS-SQL language and how to return query results as table-like result sets and as XML documents.
11.2 Storing and Processing Secondary Structures in a Relational Database Searching for protein similarities on secondary structures by formulating queries in PSS-SQL requires that data describing secondary structures should be stored in a database in an appropriate format. The format should guarantee an efficient processing of the data. In PSS-SQL, the search process is carried out in two phases, by 1. Multiple scanning of a dedicated segment index for secondary structures. 2. Alignment of found segments in order to return k-best solutions. All these steps, including data preparation, creating and scanning the segment index, and alignment will be discussed in the following sections.
11.2 Storing and Processing Secondary Structures in a Relational Database
287
11.2.1 Data Preparation and Storing The PSS-SQL uses a specific representation of protein secondary structures while storing them in a database. Let us assume we have a protein P described by the amino acid sequence (primary structure): P = { pi |i = 1, 2, . . . , d ∧ pi ∈ Π ∧ d ∈ N},
(11.1)
where n is the length of protein amino acid chain, i.e., the number of amino acids, and Π is a set of twenty common types of amino acids. Secondary structure of protein P can be then described as a sequence of secondary structure elements (SSEs) related to amino acids in the protein chain: S = {si |i = 1, 2, . . . , d ∧ si ∈ Σ ∧ d ∈ N},
(11.2)
where each element si corresponds to a single element pi , and Σ is a set of secondary structure types. The set Σ may be defined in various ways. A widely accepted definition of the set provides DSSP [11, 12]. The DSSP code distinguishes the following secondary structure types: • • • • • • •
H = alpha helix, B = residue in isolated beta-bridge, E = extended strand, participates in beta ladder, G = 3-helix (3/10 helix), I = 5 helix (pi helix), T = hydrogen bonded turn, S = bend. In practice, the set is often reduced to the three general types [8]:
• H = alpha helix, • E = beta strand (or beta sheet), • C = loop, turn or coil. An example of such a representation of protein structure is shown in Fig. 11.2, where we can see primary and secondary structures of a sample protein recorded as sequences. In such a way, both sequences can be effectively stored in a relational database, as it is presented in Fig. 11.3.
11.2.2 Indexing of Secondary Structures At the level of DBMS, the PSS-SQL uses additional data structures and indexing in order to accelerate the similarity searching. A dedicated segment table is created for
288
11 Exploration of Protein Secondary Structures …
Fig. 11.2 Sample amino acid sequence of Zinc transport system ATP-binding protein adcC in the Streptococcus pneumoniae with the corresponding sequence of secondary structure elements
Fig. 11.3 Sample relational table storing sequences of secondary structure elements SSEs (secondary field), amino acid sequences (primary field), and additional information of proteins from the Swiss-Prot database. The table (called ProteinTbl) will be used in sample queries presented in next sections. Secondary structures were predicted from amino acid sequences using the Predator program [8] Fig. 11.4 Part of the segment table
the table field storing sequences of secondary structures elements. The segment table consists of secondary structures and their lengths extracted from the sequences of SSEs, together with locations of the particular secondary structure in the molecule (identified by the residue number, Fig. 11.4). Then, additional segment index is created for the segment table. The segment index is a B-Tree clustered index holding on the leaf-level data pages from the additional segment table. The idea of using the segment table and segment index is adopted from the work [10]. The segment index supports preliminary filtering of protein structures that are not similar to the query pattern. During the filtering, the PSS-SQL extension extracts the most characteristic features of the query pattern and, on the basis of the information in the index, eliminates proteins that do not meet the search criteria. Afterward, proteins that pass the filtering process are aligned to the query pattern. If we take a closer look at the segment table, we will see that it stores secondary structures in the form that has been described in Sect. 1.3.2. During the scanning of the segment index, the search engine of the PSS-SQL tries to match segments distinguished in the given query pattern to segments of the index.
11.2 Storing and Processing Secondary Structures in a Relational Database
289
11.2.3 Alignment Algorithm The alignment implemented in the PSS-SQL is inspired by the Smith-Waterman method [23]. The method allows to align two biopolymer sequences, originally DNA/RNA sequences or amino acid sequences of proteins. When scanning a database the alignment is performed for each pair of sequences—query sequence given by a user and a successive, qualified sequence from a database. In PSS-SQL, after performing multiple scanning of the segment index (MSSI), a database protein structure S D of the length d residues is represented as a sequence of segments (see also formulas (1.14) and (1.15) in Sect. 1.3.2), which can be expanded to the following form: S D = SS E 1D L 1 , SS E 2D L 2 , . . . , SS E nD L n ,
(11.3)
where SS E Dj ∈ Σ describes the type of secondary structure (as defined in Sect. 11.2.1), n is the number of segments (secondary structures) in a database protein, L j ≤ d is the length of the ith segment of a database protein S D . Query protein structure S Q , given by a user in a form of string pattern, is represented by ranges, which gives more flexibility in defining search criteria against proteins in a database: S Q = SS E 1Q (L 1 ; U1 ), SS E 2Q (L 2 ; U2 ), . . . , SS E mQ (L m ; Um ),
(11.4)
where SS E iQ ∈ Σ describes the type of secondary structure (as defined in Sect. 11.2.1), L i ≤ Ui ≤ q are lower and upper limits for the number of successive SSEs of the same type, q is the length of the query protein S Q measured in residues, which is the maximal length of the string query pattern resulting from expanding the ranges of the pattern, m is the number of segments in the query pattern. Additionally, the SS E iQ can be replaced by the wildcard symbol ‘?’, which denotes any type of secondary structure element from Σ, and the value of the Ui can be replaced by the wildcard symbol ‘*’, which denotes Ui = +∞. The advantage of the used alignment method is that it finds local, optimal alignments with possible gaps between corresponding elements. A big drawback is that it is computationally costly, which negatively affects efficiency of the search process carried out against the whole database. The computational complexity of the original algorithm is O(nm(n + m)) when allowing for gaps calculated in a traditional way. However, in the PSS-SQL we have modified the way how gap penalties are calculated, which results in better efficiency. While aligning two protein structures S Q and S D the search engine of the PSSSQL calculates the similarity matrix D, according to the following formulas. Di,0 = 0 for i ∈ [0, q],
(11.5)
D0, j = 0 for j ∈ [0, d],
(11.6)
and
290
11 Exploration of Protein Secondary Structures …
and
Di, j
⎧ 0 ⎪ ⎪ ⎪ ⎨D i−1, j−1 + di, j = max ⎪ E i, j ⎪ ⎪ ⎩ Fi, j
(11.7)
,
for i ∈ [1, q], j ∈ [1, d], where q, d are lengths of proteins S Q and S D , and di, j is the matching degree between elements SS E Dj L j and SS E iQ (L i ; Ui ) of both structures calculated using the following formula:
di, j =
ω+ if SS E iQ = SS E Dj ∧ L j ≥ L i ∧ L j ≤ Ui ω− otherwise
,
(11.8)
where ω+ is the matching award, and ω− is the mismatch penalty. If the element SS E iQ is equal to ‘?’, then the matching procedure ignores the condition SS E Dj = SS E iQ . Similarly, if we assign the ‘*’ symbol for the Ui , the procedure ignores the condition L j ≤ Ui . Auxiliary matrices E and F, called gap penalty matrices, allow to calculate horizontal and vertical gap penalties with the O(1) computational complexity (as opposed to the original method, where it was possible with the O(n) computational complexity for each direction). In the first version of the PSS-SQL, the calculation of the current element of the matrix D required an inspection of all previously calculated elements in the same row (for a horizontal gap) and all previously calculated elements in the same column (for a vertical gap). By using gap penalty matrices, we need only to check one previous element in a row and one previous element in a column. Such an improvement gives a significant acceleration of the alignment method, and the acceleration is greater for longer sequences of secondary structure elements and greater similarity matrices D. Elements of the gap penalty matrices E and F are calculated according to the following equations:
E i, j = max
E i−1, j − δ Di−1, j − σ
,
(11.9)
Fi, j = max
Fi, j−1 − δ Di, j−1 − σ
,
(11.10)
and
where σ is the penalty for opening a gap in the alignment, and δ is the penalty for extending the gap, and
11.2 Storing and Processing Secondary Structures in a Relational Database
291
E i,0 = 0 for i ∈ [0, q],
Fi,0 = 0 for i ∈ [0, q],
(11.11)
E 0, j = 0 for j ∈ [0, d],
F0, j = 0 for j ∈ [0, d],
(11.12)
The PSS-SQL uses the following values for matching award ω+ = 4, mismatch penalty ω− = −1, gap open penalty σ = −1, and gap extension penalty δ = −0.5. Filled similarity matrix D consists of many possible paths how two sequences of SSEs can be aligned. Backtracking from the highest scoring matrix cell and going along until a cell with score 0 is encountered allows to find the highest scoring alignment path. However, in the version of the alignment method that is implemented in the PSS-SQL, the search engine finds k-best alignments by searching consecutive maxima in the similarity matrix D. This is necessary, since the pattern is usually not defined precisely, contains ranges of SSEs or undefined elements. Therefore, there can be many regions in a protein structure that fit the pattern. In the process of finding alternative alignment paths, the alignment method follows the value of the internal parameter M P E (Minimum Path End), which defines the stop criterion. The search engine finds alignment paths until the next maximum in the similarity matrix D is lower than the value of the M P E parameter. The value of the M P E depends on the specified pattern, according to the following formula. M P E = (M P L × ω+ ) + (N oI S × ω− ),
(11.13)
where M P L is the minimum pattern length, N oI S is the number of imprecise segments in the pattern, i.e., segments, for which L i = Ui . For example, for the structural pattern h(10;20),e(1;10),c(5),e(5;20) containing α-helix of the length 10–20 elements, β-strand of the length 1–10 elements, loop of the length 5 elements, and β-strand of the length 5–20 elements, the M P L = 21 (10 elements of the type h, 1 element of the type e, 5 elements of the type c, and 5 elements of the type e), the N oI S = 3 (first, second, and fourth segment), and therefore, M P E = 81.
11.2.4 Multi-threaded Implementation In the original PSS-SQL [17], the calculation of the similarity matrix D was performed by a single thread. This negatively affected performance of PSS-SQL queries or, at least, this left a kind of computational reserve in the era of multi-core CPUs. In the new version of the PSS-SQL [18], we have re-implemented procedures and functions in order to use all processor cores that are available on the computer hosting the database with the PSS-SQL extension. Main constructors of the extension were Bartłomiej Socha (within his MSc thesis [24]), Bo˙zena Małysiak-Mrozek, and me. The multi-threaded implementation of the PSS-SQL required different approach while calculating values of particular cells of the similarity matrix D. Successive cells
292
11 Exploration of Protein Secondary Structures …
Fig. 11.5 Calculation of cells in the similarity matrix D by using the wavefront approach. Calculation is performed for cells at diagonals, since their values depend on previously calculated cells. Arrows show dependences of particular cells and the direction of value derivation
cannot be calculated one by one, as in the original version, but calculations are carried out for cells located on successive diagonals, as it is presented in Fig. 11.5. This is because, according to Eqs. 11.7, 11.9, and 11.10 each cell Di, j can be calculated only if there are calculated cells Di−1, j−1 , Di−1, j and Di, j−1 . Such an approach to the calculation of the similarity matrix is called a wavefront [2, 14]. Moreover, in order to avoid too many synchronizations between running threads (which may lead to significant delays), the entire similarity matrix is divided to so called areas (Fig. 11.6a). These areas are parts of the similarity matrix, and have smaller sizes q ′ × d ′ than the size of the whole similarity matrix. Assuming that
Fig. 11.6 Division of the similarity matrix D into areas (left)—arrows show mutual dependencies between areas during calculation of the matrix. (right) An order in which areas will be calculated in a sample similarity matrix
11.2 Storing and Processing Secondary Structures in a Relational Database
293
the entire similarity matrix has the size of q × d, where q and d are lengths of two compared sequences of secondary structure elements, the number of areas that must be calculated is equal to: d q × . (11.14) nA = q′ d′ For example, for the matrix the size 382 × 108 and size of the area q ′ = 382 D of 108 ′ 10 and d = 10, the n A = 10 × 10 = 39 × 11 = 429. Areas are assigned to threads working in the system. Each thread is assigned to one area, which is an atomic portion of calculation for the thread. Areas can be calculated according to the same wavefront paradigm. The area A z,v can be calculated, if there have been calculated areas A z−1,v and A z,v−1 for z > 0 and v > 0, which implies an earlier calculation of the area A z−1,v−1 . The area A0,0 is calculated as a first one, since there are no restrictions for calculation of the area. In order to synchronize calculations (due to computational dependency), each area has a semaphore assigned to it. Semaphores guarantee that an area will not be calculated until the areas that it depends on have not been calculated. When all cells of an area have been calculated, the semaphore is being unlocked. Therefore, each area waits for unlocking two semaphores—for areas A z−1,v and A z,v−1 for z > 0 and v > 0. While calculating an area each thread implements the algorithm, which pseudocode is presented in Algorithm 9. In Algorithm 9, after initialization of variables (lines 2–4), the thread enters the critical section marked with the lock keyword (line 5). Entering the critical section means that a thread obtains the mutual-exclusion lock for a given object. The thread executes some statements and finally releases the lock. In our case, the thread obtains an exclusive access to the coordinates (z, v) of the area, which should be calculated by calling GetAreaZ() and GetAreaV() methods (lines 6–7). In the critical section, the thread also triggers the calculation of the (z, v) coordinates of the next area that should be calculated by another thread (line 8). Lines 9–11 determine whether this will be the last area that is calculated by any thread. Upon leaving the critical section, the current thread waits until areas A z−1,v and A z,v−1 are unlocked (lines 13–14). Then, based on coordinates (z, v) and the area size in both dimensions, the thread determines absolute coordinates (i, j) of the first cell of the area (lines 15–16). These coordinates are used inside the following two for loops in order to establish absolute coordinates (i, j) of the current cell of the area. Figure 11.7 helps to interpret the variables used in the algorithm. The value of the current cell is calculated in line 21, according to formulas (11.5)–(11.7). When the thread completes the calculation of the current area, it unlocks the area (line 24) and asks for another area (lines 25–27).
294
11 Exploration of Protein Secondary Structures …
Fig. 11.7 Interpretation of variables used in the Algorithm 9 for the calculated area
Algorithm 9 The algorithm for the calculation of an area by a thread 1: procedure CalculateArea 2: z←0 3: v←0 4: bool Finish ← tr ue 5: lock 6: z ← Get Ar ea Z () 7: v ← Get Ar eaV () 8: Calculate (z, v) coordinates of the next area 9: if calculation successful (i.e., exists next area) then 10: bool Finish ← f alse 11: end if 12: endlock 13: Wait for unlocking the area A z−1,v 14: Wait for unlocking the area A z,v−1 15: abs Star t_i ← z ∗ ar eaSi zeZ 16: abs Star t_ j ← v ∗ ar eaSi zeV 17: for r el_i ← 0 to ar eaSi zeZ − 1 do 18: for r el_ j ← 0 to ar eaSi zeV − 1 do 19: i ← abs Star t_i + r el_i 20: j ← abs Star t_ j + r el_ j 21: Calculate cell Di, j according to formulas 11.5–11.7 22: end for 23: end for 24: Unlock area A z,v 25: if ¬bool Finish then 26: Apply for the next area (enqueue for execution) 27: end if 28: end procedure
⊲ starts critical section
11.2 Storing and Processing Secondary Structures in a Relational Database
295
The order in which areas are calculated is provided by a scheduling algorithm dispatching areas to threads. For example, the order of calculation of particular areas in similarity matrix of the size 5 × 5 areas is presented in Fig. 11.6b. Such a division of the similarity matrix into areas reduces the number of tasks related to initialization of semaphores needed for synchronization purposes and reduces the synchronization time itself, which increases the efficiency of the alignment algorithm.
11.2.5 Consensus on the Area Size Area sizes for the multi-threaded variant of the alignment procedure that uses multiple scanning of the segment index (+MT+MSSI) were chosen experimentally taking into account various patterns and varying number of CPU cores. Tests were performed on the Microsoft SQL Server 2012 EE working on nodes of the virtualized cluster controlled by the HyperV hypervisor hosted on Microsoft Windows 2008 R2 Datacenter Edition 64-bit. The host server had the following parameters: 2x Intel Xeon CPU E5620 2.40 GHz, RAM 92 GB, 3x HDD 1TB 7200 RPM. Cluster nodes were configured to use from 1 up to 4 virtual CPU cores and 4 GB RAM per node, and worked under the Microsoft Windows 2008 R2 Enterprise Edition 64-bit operating system. Tests were performed on the database storing 6,360 protein structures and for PSS-SQL queries containing patterns representing various classes: • class 1: short patterns and patterns frequently occurring in the database, e.g., c(10;20),h(2;5),c(2;40); • class 2: patterns with precisely defined regions, e.g., region e(15) in the sample pattern e(4;20),c(3;10),e(4;20),c(3;10),e(15),c(3;10),e(1;10), c(3;10),e(5;12); • class 3: patterns with unique regions, e.g., region h(243) in the sample pattern h(10;20),c(1;10),h(243),c(1;10),h(5;10),c(1;10),h(10;15); • class 4: patterns with undefined type of secondary structure (wildcard symbol ?) and with unlimited length of one of its regions (wildcard symbol *), e.g., region ?(1;30) and e(5;*), in the sample pattern c(10;20), h(2;5),c(2;40), ?(1;30),e(5;*). Query patterns chosen for our tests had different characteristics and were representatives of possible patterns that can be entered by users. During experiments, we have not observed any dependency between types of the secondary structures specified in patters and the execution time. The aim of the series of tests was to determine the possible best size of the area, which was assigned to every single thread. Together with Bartek Socha, we have checked different area sizes for the following heights and widths: 1, 2, 3, . . . , 9. We tested popular configurations of 1, 2, 3, and 4 threads working in parallel, which corresponded to 1, 2, 3, and 4 CPU cores available for the database management system. This gave 4 × 92 = 324 combinations that were examined for each query pattern.
296
11 Exploration of Protein Secondary Structures …
Table 11.1 Execution times for PSS-SQL query with sample pattern from class 2 parallelized on 4 threads for various sizes (W×H, H for query sequence, W for database sequence) of area (s) H\W 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
7.074 5.825 5.672 5.569 5.384 5.296 5.325 5.289 5.276
5.580 4.955 4.945 4.891 5.026 4.712 5.052 4.733 4.581
5.209 4.806 4.599 4.643 5.336 4.872 4.618 4.866 4.791
4.712 4.514 4.519 4.566 4.565 5.620 4.558 4.567 4.538
4.622 4.422 4.491 4.602 4.571 4.522 4.568 6.289 4.543
5.293 4.365 4.428 4.397 5.870 4.535 4.512 4.524 6.133
4.437 4.521 4.483 4.295 4.379 4.490 4.382 4.536 4.376
4.490 4.358 4.377 4.444 4.487 4.402 4.534 4.456 4.423
4.770 4.399 4.372 4.398 4.684 4.462 4.484 4.375 4.510
In Table 11.1, we can see execution times for the sample PSS-SQL query containing sample pattern from class 2. The class represents complex patterns that consist of many segments. As we can notice in Table 11.1, smaller area sizes (especially the size 1 × 1) result in higher execution times. Increasing the area size above 2 × 2 reduces the execution time. However, changes of the area size (above 3 × 3) do not affect the execution time significantly. The same tendency was observed in all tested cases, i.e., while testing various numbers of threads for different query patterns. Figure 11.8 shows relative execution times for the same pattern class. Relative execution times are presented as a heat map, where particular cells are colored using H\W
1
2
3
4
5
6
7
8
9
1
0.00
53.76
67.11
84.99
88.23
64.09
94.89
92.98
82.91
2
44.94
76.25
81.61
92.12
95.43
97.48
91.87
97.73
96.26
3
50.45
76.61
89.06
91.94
92.95
95.21
93.23
97.05
97.23
4
54.16
78.55
87.48
90.25
88.95
96.33
100.00
94.64
96.29
5
60.81
73.70
62.54
90.28
90.07
43.32
96.98
93.09
86.00
6
63.98
84.99
79.24
52.32
91.83
91.36
92.98
96.15
93.99
7
62.94
72.76
88.38
90.54
90.18
92.19
96.87
91.40
93.20
8
64.23
84.24
79.45
90.21
28.25
91.76
91.33
94.21
97.12
9
64.70
89.71
82.15
91.26
91.08
33.86
97.09
95.39
92.26
Fig. 11.8 Relative execution times ti,%j for PSS-SQL query with sample pattern from class 2 parallelized on 4 threads for various sizes (W×H) of area (%)
11.2 Storing and Processing Secondary Structures in a Relational Database
297
the red color (worst result), though yellow, and green color (best results). Results are expressed as a percentage – 0% denotes the longest execution time, 100% denotes the shortest execution time. Values are calculated according to the following expression: ti,%j =
max(t) − ti, j · 100%, max(t) − min(t)
(11.15)
where ti, j is measured execution time of the sample query taken from Table 11.1 for the corresponding area size, max(t), min(t) are maximal and minimal execution times of the query (in Table 11.1). The heat map (Fig. 11.8) reveals preferred and recommended area sizes (green) and those that should be avoided (red and orange). We have to remember that various query patterns and the number of possessed CPU cores may move the best point in any direction. Therefore, we have made these types of statistics, as presented in Table 11.1 and Fig. 11.8, for all tested patterns in all tested classes for all tested ncore CPU configurations and area sizes. Then, in order to determine a universal area size for common patterns we have calculated the value of weighted arithmetic mean taking into account the participation of popular n-core processors in the market (n = 1,2,3,4) and possible assignment of logical CPU cores in virtualized environments. We have arbitrarily chosen the following values of weights: 15% for 1-core CPUs (we assumed 1 core = 1 thread execution), 40% for 2-core CPUs, 5% for 3-core CPUs, and 40% for 4-core CPUs. Results are presented in Fig. 11.9. Best execution times (relative to the worst case) were obtained for the area width greater than 6.
H\W
1
2
3
4
5
6
7
8
9
1
36.49
45.79
63.88
70.29
73.26
75.49
77.51
79.44
79.46
2
44.13
52.33
67.74
72.89
75.43
77.18
79.25
80.29
80.31
3
56.68
62.91
73.89
76.78
79.04
80.28
82.61
82.45
82.58
4
61.38
65.6
74.35
77.04
77.54
77.87
82.57
80.83
79.67
5
64.27
67.12
72.63
75.1
75.81
77.35
78.28
79.01
79.45
6
65.67
68.07
72.64
75.42
75.95
77.27
77.65
78.54
78.93
7
66.72
69.47
73.91
73.48
75.49
77.66
79.75
80.44
80.96
8
66.97
69.76
76.11
75.51
74.9
75.98
79.46
79.57
80.47
9
68
70.54
76.36
75.28
74.54
75.95
79.84
80.07
80.39
Fig. 11.9 Weighted arithmetic mean for relative execution times of all PSS-SQL queries parallelized on various number of threads (1, 2, 3, 4) depending on the sizes (W×H) of area (%)
298
11 Exploration of Protein Secondary Structures …
Fig. 11.10 Histogram of the number of secondary structures (segments) identified in proteins deposited in tested database
Height of the area has no significant impact, since compressed query sequences of secondary structures are usually short, while database sequences are relatively longer. Figure 11.10 shows the histogram of the number of secondary structures identified in proteins stored in tested database. It shows that most of the proteins have less than 100 secondary structures (segments). Therefore, widths of whole similarity matrices after compression of sequences of secondary structures should be lower than 100 elements in most cases. Heights of whole similarity matrices depend on the query pattern. For example, the sample query pattern from pattern class 2 contains 9 segments, and consequently, the height of the similarity matrix is 9. Therefore, on the basis of our experiments, we have chosen 3×7 (H×W) for the area size, i.e., 3 for the query pattern and 7 for database sequence of secondary structures.
11.3 SQL as the Interface Between User and the Database PSS-SQL (Protein Secondary Structure-Structured Query Language) extends the standard syntax of the SQL language by providing additional functions that allow to search protein similarities on secondary structures. SQL language becomes a user interface (UI) between the user, who is a data consumer, and database management system hosting secondary structures of proteins. PSS-SQL discloses three important functions for scanning protein secondary structures: containSequence, sequencePosition, and sequenceMatch; all will be described in this chapter. PSS-SQL covers also a series of supplementary procedures and functions, which are used implicitly, e.g., for extracting segments of particular types of SSEs, building additional segment tables, indexing SSEs sequences, processing these sequences, aligning the target structures from a database to the query pattern, validating patterns, and many other operations.
11.3 SQL as the Interface Between User and the Database
299
Fig. 11.11 General architecture of the system with the PSS-SQL extension. The PSS-SQL extension is registered in the Microsoft SQL Server DBMS. When the user submits a query invoking PSS-SQL functions (actually, Transact-SQL functions) the DBMS redirects the call to the PSSSQL extension, which invokes appropriate functions assembled in the ProteinLibrary DLL library, passing appropriate parameters
PSS-SQL extension was developed in the C# programming language. All procedures were assembled in the ProteinLibrary DLL file and registered for the Microsoft SQL Server 2008R2/2012 (Fig. 11.11).
11.3.1 Pattern Representation in PSS-SQL Queries While searching protein similarities on secondary structures, we need to pass the query structure (query pattern) as a parameter of the search procedure. In PSS-SQL queries, the pattern is represented as in the formula (11.4). Such a representation allows users to formulate a large number of various query types with different degrees of complexity. Moreover, we assumed that query patterns should be as simple as possible and should not cause any syntax difficulties. Therefore, we have defined the corresponding grammar in order to help constructing the query pattern. In simple words, in PSS-SQL queries, the pattern is represented by blocks of segments. Each segment is determined by its type and length. The segment length can be represented precisely or as an interval. It is possible to define segments, for which the type is not important or undefined (wildcard symbol ‘?’), and for which the upper limit of the interval is not defined (wildcard symbol ‘*’). The grammar for defining patterns written in the Chomsky notation has the following form. The grammar is formally defined as the ordered quad-tuple: G pss = N pss , Σ pss , Ppss , S pss ,
(11.16)
where the symbols respectively mean: N pss —a finite set of non-terminal symbols, Σ pss —a finite set of terminal symbols, Ppss —a finite set of production rules, S pss —a distinguished symbol S ∈ N pss that is the start symbol.
300
11 Exploration of Protein Secondary Structures …
Assumption: ≤ The following terms are compliant with the defined grammar G pss : • h(1;10)—representing an α-helix of the length 1–10 elements; • e(2;5),h(10;*),c(1;20)—representing a β-strand of the length 2–5 elements, followed by an α-helix of the length at least 10 elements, and a loop of the length 1–20 elements; • e(10;15),?(5;20),h(35)—representing a β-strand of the length 10–15 elements, followed by any element of the length 5–20, and an α-helix of the exact length 35 elements. With such a representation of the query pattern, we can start the search process using one of the functions disclosed by the PSS-SQL extension.
11.3.2 Sample Queries in PSS-SQL The PSS-SQL extension provides a set of functions and procedures for processing protein secondary structures. Three of the functions can be effectively invoked from the SQL commands, usually the SELECT statement. The containSequence function verifies if a particular protein or a set of database proteins contain the structural pattern specified as a query pattern. This function returns the Boolean value 1 (true), if the database protein contains specified pattern, or 0 (false), if the protein does not include the pattern. Sample invocation of the function is shown in Listing 11.1. 1 2 3 4
SELECT protID , protAC FROM P r o t e i n T b l WHERE name LIKE ’%E s c h e r i c h i a c o l i%’ AND dbo . c o n t a i n S e q u e n c e ( id , ’ secondary ’ , ’ h ( 5 ; 1 5 ) , c ( 3 ) , ? ( 6 ) , c ( 1 ; 5 ) ’ ) =1
Listing 11.1 Sample query invoking containSequence function and returning identifiers of proteins from Escherichia coli containing the given secondary structure pattern.
The sample query returns identifiers and Accession Numbers of proteins from Escherichia coli having the structural region containing an α-helix of the length
11.3 SQL as the Interface Between User and the Database
301
Table 11.2 Input arguments of PSS-SQL functions Argument Description @ProteinIda
Unique identifier of a protein in the database table that contains sequences of SSEs (e.g., id field in case of the ProteinTbl) Database field containing sequences of SSEs of proteins (e.g., secondary) Query pattern represented by a set of segments, e.g., h(2;10), c(1;5),?(2;*) An optional, simple or complex filtering criteria that allow to limit the list of proteins that will be processed during the search, e.g., length 150 ORDER BY AC, s . s t a r t P o s
8 9 10 11 12 13 14 15 16
−− invoking s e q u e n c e P o s i t i o n and s t a n d a r d JOIN SELECT p . protAC AS AC, p . name , s . s t a r t P o s , s . endPos , p . [ primary ] , s . matchingSeq , p . secondary FROM P r o t e i n T b l AS p JOIN dbo . s e q u e n c e P o s i t i o n ( ’ secondary ’ , ’e (1;10) ,c (0;5) ,h (5;6) ,c (0;5) ,e (1;10) ,c (5) ’ , ’p . name LIKE ’ ’%Staphylococcus a ureus%’ ’ AND p . l e n g t h > 150 ’ ) AS s ON p . i d =s . p r o t e i n I d ORDER BY AC, s . s t a r t P o s
Listing 11.2 Sample query invoking sequenceMatch and sequencePosition table functions and returning information on proteins from Staphylococcus aureus having the length greater than 150 residues and containing the given secondary structure pattern.
These sample queries return Accession Numbers (ACs) and names of proteins from Staphylococcus aureus having the length greater than 150 residues and structural region containing β-strand of the length from 1 to 10 elements, optional loop up to 5 elements, an α-helix of the length 5 to 6 elements, optional loop up to 5 elements, a β-strand of the length 1–10 elements and a 5 element loop—pattern e(1;10),c(0;5),h(5;6),c(0;5),e(1;10),c(5). Partial results of the query from Listing 11.2 are shown in Fig. 11.12. Detailed description of the output fields of the sequenceMatch and sequencePosition functions is given in Table 11.3. Results of the PSS-SQL queries are originally returned in a tabular form. However, by adding an extra FOR XML clause at the end of the SELECT statement, like in the example in Listing 11.3, produces results in the XML format that can be easily transformed to the HTML Web page by using appropriate XSLT transformation file, and finally, published in the Internet. Partial results of the query from Listing 11.3 are presented in Fig. 11.13. An additional function—superimpose—that was used in
11.3 SQL as the Interface Between User and the Database
303
Table 11.3 Output table of sequenceMatch and sequencePosition functions Field Description proteinId startPos endPos length matchingSeq
Unique identifier of the protein that contains the specified pattern Position, where the pattern starts in the target protein from a database Position, where the pattern ends in the target protein from a database Length of the segment that matches to the given pattern Exact sequence of SSEs, which matches to the pattern defined in the query
Fig. 11.13 Partial results of the query from Listing 11.3
the presented query (Listing 11.3) visualizes the alignment of the matched sequence and the database sequence of SSEs. 1 2 3 4 5 6 7 8
SELECT p . protAC AS AC, p .name, s . startPos , s . endPos , s . matchingSeq , p . [ primary ] , dbo . superimpose ( s . matchingSeq , p . secondary ) AS alignment FROM ProteinTbl AS p CROSS APPLY dbo . sequenceMatch (p . id , ’secondary ’ , ’e (1;10) , c (0;5) ,h(5;6) , c (0;5) , e (1;10) , c (5) ’ ) AS s WHERE p .name LIKE ’%Staphylococcus aureus%’ AND p . length > 150 ORDER BY AC, s . startPos FOR XML RAW ( ’ protein ’ ) , ROOT( ’ proteins ’ ) , ELEMENTS
Listing 11.3 Sample query invoking sequenceMatch table function and returning results as an XML document by using the FOR XML clause.
304
11 Exploration of Protein Secondary Structures …
11.4 Efficiency of the PSS-SQL The efficiency of the PSS-SQL query language was examined in various experiments. Tests were performed on the Microsoft SQL Server 2012 Enterprise Edition working on nodes of the virtualized cluster controlled by the HyperV hypervisor hosted on Microsoft Windows 2008 R2 Datacenter Edition 64-bit. The host server had the following parameters: 2x Intel Xeon CPU E5620 2.40 GHz, RAM 92 GB, 3x HDD 1TB 7200 RPM. Cluster nodes were configured to use 4 virtual CPU cores and 4GB RAM per node and worked under the Microsoft Windows 2008 R2 Enterprise Edition 64-bit operating system. Most of the tests were performed on the database storing 6,360 protein structures. However, in order to compare our language to one of the competitive solutions, some tests were performed on the database storing 248,375 protein structures. During the experiments we measured execution times for various query patterns. The query patterns were passed as a parameter of the sequencePosition function. Tests were performed for queries containing the following sample patterns: • • • • •
SSE1: e(4;20),c(3;10),e(4;20),c(3;10),e(15),c(3;10),e(1;10) SSE2: h(30;40),c(1;5),?(50;60),c(5;10),h(29),c(1;5),h(20;25) SSE3: h(10;20),c(1;10),h(243),c(1;10),h(5;10),c(1;10),h(10;15) SSE4: e(1;10),c(1;5),e(27),h(1;10),e(1;10),c(1;10),e(5;20) SSE5: e(5;20),h(2;5),c(2;40),?(1;30),e(5;*)
Pattern SSE1 represents protein structure built only with β-strands connected by loops. Pattern SSE2 consists of several α-helices connected by loops and one undefined segment of SSEs (‘?’ wildcard symbol). Patterns SSE3 and SSE4 have regions that are unique in the database, i.e., h(243) in pattern SSE3 and e(27) in pattern SSE4. Pattern SSE5 has a wildcard symbol ‘*’ for undetermined length, which slows down the search process. In order to verify the influence of particular acceleration techniques on the execution times, tests were carried out for the PSS-SQL in three variants: • without multi-threading (–MT), • with multi-threading, but without multiple scanning of the segment index (+MT– MSSI), • with multi-threading and with multiple scanning of the segment index (+MT+MSSI). Results of the tests presented in Fig. 11.14 prove that the performance of +MT– MSSI variant is higher, and in case of SSE1 and SSE2 even much higher, than –MT variant (implemented in original PSS-SQL). For +MT+MSSI we can see additional improvement of the performance. It is difficult to estimate the overall acceleration, because it tightly depends on the uniqueness of the pattern. The more unique the pattern is, the more proteins are filtered out based on the segment index, the fewer proteins are aligned and the less time we need to obtain results. We can see it clearly in Fig. 11.14 for patterns SSE3 and SSE4 that have precisely defined, unique regions
11.4 Efficiency of the PSS-SQL 120 -MT +MT-MSSI +MT+MSSI
100 80
time (s)
Fig. 11.14 Execution time for various query patterns SSE1-SSE4 and for three variants of the PSS-SQL language: without multi-threading (–MT), with multi-threading, but without multiple scanning of the segment index (+MT–MSSI), with multi-threading and with multiple scanning of the segment index (+MT+MSSI)
305
60 40 20 0 SSE1
SSE2
SSE3
SSE4
Query 1,200 -MT +MT-MSSI +MT+MSSI
1,000 800
time (s)
Fig. 11.15 Execution time for query pattern SSE5 for three variants of the PSS-SQL language: without multi-threading (–MT), with multi-threading, but without multiple scanning of the segment index (+MT–MSSI), with multi-threading and with multiple scanning of the segment index (+MT+MSSI)
600 400 200 0
SSE5
h(243) and e(27). For universal patterns, like SSE1 and SSE2, for which we can find many fitting proteins or multiple alignments, we can observe longer execution times. In such cases, the parallelization and multiple scanning of the segment index start playing a more significant role. In these cases, the length of the pattern influences the alignment time—for longer patterns we experienced longer response times. We have not observed any dependency between the type of the SSE and the response time. However, specifying wildcards in the query pattern increases the waiting period, which is visible for the pattern SSE5 (Fig. 11.15). In Fig. 11.15 for the pattern SSE5, we can also see how beneficial the use of the MSSI technique can be. In this particular case, the execution time was reduced from 920 seconds in –MT (original PSS-SQL), and 550 seconds in +MT–MSSI, to 15 seconds in +MT+MSSI, which gives 61.33fold speedup over the –MT variant and 36.67-fold speedup over the +MT–MSSI variant.
306
11 Exploration of Protein Secondary Structures …
11.5 Discussion PSS-SQL language complements existing relational database management systems, which are not designed to process biological data, such as protein secondary structures stored as sequences of secondary structure elements. By extending the standard SELECT, UPDATE, and DELETE statements of the SQL language, it provides a declarative method for retrieving, modifying and deleting records. Records that satisfy the criteria given by a user can be returned in a table-like form or as an XML document, which is easy to display as a Web page. In such a way, the PSS-SQL extension to RDBMS provides a kind of domain specific language for processing protein secondary structures. This is especially important for relational database designers, wide group of biological data analysts and bioinformaticians. The PSS-SQL language can be used for the fast classification of proteins based on their secondary structures. For example, systems such as SCOP [20] and CATH [21] make use of the secondary structure description of protein structures in order to classify proteins into classes and families. PSS-SQL can be also supportive in protein 3D structure prediction by homology modeling, where appropriate structure profile can be found based on primary and secondary structure and the secondary structure can be superimposed on the protein of the unknown 3D structure before performing a free energy minimization. Comparing the PSS-SQL to other languages presented in Sect. 11.1, we can notice that all variants of the PSS-SQL extend the syntax of the SQL. This makes the PSS-SQL similar to PiQL [26], rather than to ProteinQL [27]. ProteinQL was developed for the object-oriented database and relies on its own domain-specific database and dedicated ProteinQL interpreter and translator. As opposed to ProteinQL, both PiQL and PSS-SQL extend capabilities of relational database management system (RDBMS). They extend the syntax of the SQL language by providing additional functions that can be nested in particular clauses of the SQL commands. However, the form of queries provided by users is different. PiQL accepts query patterns in a full form, like in BLAST [1] – a tool used for fast local matching of biomolecular sequences of DNA and proteins. Query patterns provided in PSS-SQL are similar to those presented by Hammel and Patel in [10]. The pattern defined in a query does not have to be specified strictly. Segments in the pattern can be specified as intervals and they can have undefined lengths. Both languages allow specifying query patterns with undefined types of the SSE or patterns, where some SSE segments may occur optionally. Therefore, the search process has an approximate character, regarding various possible options for segment matching. The possibility of defining patterns that include optional segments allows users to specify gaps in a particular place. The described version of the PSS-SQL also uses the method of scanning the segment index in order to accelerate the search process. The method was adopted from the work of Hammel and Patel [10]. However, after multiple scans of the segment index Hammel and Patel used sort-merge join operations in order to join segments from the same candidate proteins and decide, whether they meet specified query conditions or not. The novelty of PSS-SQL is that it relies on the alignment of
11.5 Discussion
307
the found segments. Alignment implemented in PSS-SQL gives the unique possibility of finding many matches for the same database protein and returning k-best matches, matches that in some particular cases can be separated by gaps. These are not the gaps defined by a user and specified by an optional segment, but the gaps providing better alignment of particular regions. This type of matching is typical for similarity searching between biomolecular sequences, such as DNA/RNA sequences or amino acid sequences. Presented approach extends the spectrum of searching and guarantees the optimality of the results according to assumed scoring system. Despite the fact that PSS-SQL uses the alignment procedure, which is computationally complex, it gained quite a good performance. We have compared the efficiency of the PSS-SQL (+MT+MSSI variant) and language presented by Hammel and Patel for single-predicate exact match queries with various selectivity (between 0.3 and 6%) using the database storing 248,375 proteins (515 MB for ProteinTbl, 254 MB for segment table storing 11,986,962 segments). The PSS-SQL was on average 5.14 faster than Comm-Seg implementation, 3.28 faster than Comm-CSP implementation, both implemented on a commercial ORDBMS, and 1.84 faster than ISS-MISS(1) implementation on Periscope/SQ. This proves that PSS-SQL compensates the efficiency loss caused by alignment procedure by using the segment index. In such a way, the PSS-SQL joins wide capabilities of the alignment process (possible gaps, mismatches, and many solutions), provides optimality and quality of results, and guarantees efficiency of scanning databases of secondary structures.
11.6 Summary Integrating methods of protein secondary structure similarity searching with database management systems provides an easy way for manipulation of biological data without the necessity of using external data mining applications. The PSS-SQL extension presented in this chapter is a successful example of such integration. PSS-SQL is certainly a good option for biological and biomedical data analysts who want to process their data on the server side. This has many advantages that are typical for such a processing in the client-server architecture. Entire logic of data processing is performed on the database server, which reduces the load on the user’s computer. Therefore, data exploration is performed while retrieving data from a database. Moreover, the number of data returned to the user, and the network traffic between the server and the user application, are much reduced. The use of multi-threading allows to utilize the whole capable computing power more efficiently. The PSS-SQL adapts to the number of processing units possessed by the server hosting the database management system and to the number of cores used by the database system. This results in better performance of the language while scanning huge databases of protein secondary structures. Parallelization of calculations in bioinformatics brings tangible benefits and reduces the execution time of many algorithms. In this chapter, we could see one of many examples of such parallelization. For the latest information on the PSS-SQL,
308
11 Exploration of Protein Secondary Structures …
please visit the project home page: http://zti.polsl.pl/dmrozek/science/pss-sql.htm. For readers that are interested in other examples, I recommend the book Parallel Computing for Bioinformatics and Computational Biology by Albert Y. Zomaya [31] for further reading.
References 1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990). http://www.sciencedirect.com/science/article/pii/ S0022283605803602 2. Anvik, J., MacDonald, S., Szafron, D., Schaeffer, J., Bromling, S., Tan, K.: Generating parallel programs from the wavefront design pattern. In: Proceedings 16th International Parallel and Distributed Processing Symposium, p. 8 (2002) 3. Apweiler, R., Bairoch, A., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M.J., Natale, D.A., ODonovan, C., Redaschi, N., Yeh, L.L.: UniProt: the universal protein knowledgebase. Nucleic Acids Res. 32(suppl-1), D115–D119 (2004). https://doi.org/10.1093/nar/gkh131 4. Berman, H., et al.: The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000) 5. Boutet, E., Lieberherr, D., Tognolli, M., Schneider, M., Bansal, P., Bridge, A.J., Poux, S., Bougueleret, L., Xenarios, I.: UniProtKB/Swiss-Prot, the manually annotated section of the UniProt knowledgebase: how to use the entry view, 23–54 (2016) 6. Can, T., Wang, Y.F.: CTSS: a robust and efficient method for protein structure alignment based on local geometrical and biological features. In: Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference (CSB2003), pp. 169–179 (2003) 7. Date, C.: An Introduction to Database Systems, 8th edn. Addison-Wesley, USA (2003) 8. Frishman, D., Argos, P.: Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence. Protein Eng. 9(2), 133–142 (1996) 9. Gibrat, J., Madej, T., Bryant, S.: Surprising similarities in structure comparison. Curr. Opin. Struct. Biol. 6(3), 377–385 (1996) 10. Hammel, L., Patel, J.M.: Searching on the secondary structure of protein sequences. In: Bernstein, P.A., Ioannidis, Y.E., Ramakrishnan, R., Papadias, D. (eds.) VLDB ’02: Proceedings of the 28th International Conference on Very Large Databases, pp. 634–645. Morgan Kaufmann, San Francisco (2002) 11. Joosten, R.P., te Beek, T.A., Krieger, E., Hekkelman, M.L., Hooft, R.W., Schneider, R., Sander, C., Vriend, G.: A series of PDB related databases for everyday needs. Nucleic Acids Res. 39(suppl-1), D411–D419 (2011). https://doi.org/10.1093/nar/gkq1105 12. Kabsch, W., Sander, C.: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22(12), 2577–2637 (1987) 13. Källberg, M., Wang, H., Wang, S., Peng, J., Wang, Z., Lu, H., Xu, J.: Template-based protein structure modeling using the RaptorX web server. Nat. Protoc. 7, 1511–1522 (2012) 14. Liu, W., Schmidt, B.: Parallel design pattern for computational biology and scientific computing applications. In: 2003 Proceedings of IEEE International Conference on Cluster Computing, pp. 456–459 (2003) 15. Małysiak-Mrozek, B., Kozielski, S., Mrozek, D.: Server-side query language for protein structure similarity searching, pp. 395–415. Springer, Berlin (2012). https://doi.org/10.1007/9783-642-23172-8_26 16. Mrozek, D., Małysiak-Mrozek, B.: CASSERT: a two-phase alignment algorithm for matching 3D structures of proteins. In: Kwiecie´n, A., Gaj, P., Stera, P. (eds.) Computer Networks. Communications in Computer and Information Science, vol. 370, pp. 334–343. Springer International Publishing, Berlin (2013)
References
309
17. Mrozek, D., Wieczorek, D., Małysiak-Mrozek, B., Kozielski, S.: PSS-SQL: protein secondary structure - structured query language. In: 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology, pp. 1073–1076 (2010) 18. Mrozek, D., Małysiak-Mrozek, B., Socha, B., Kozielski, S.: Selection of a consensus area size for multithreaded wavefront-based alignment procedure for compressed sequences of protein secondary structures. In: Kryszkiewicz, M., Bandyopadhyay, S., Rybinski, H., Pal, S.K. (eds.) Pattern Recognition and Machine Intelligence. Lecture Notes Computer Science, vol. 9124, pp. 472–481. Springer International Publishing, Cham (2015) 19. Mrozek, D., Socha, B., Kozielski, S., Małysiak-Mrozek, B.: An efficient and flexible scanning of databases of protein secondary structures. J. Intell. Inf. Syst. 46(1), 213–233 (2016). https:// doi.org/10.1007/s10844-014-0353-0 20. Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247(4), 536– 540 (1995). http://www.sciencedirect.com/science/article/pii/S0022283605801342 21. Orengo, C., Michie, A., Jones, S., Jones, D., Swindells, M., Thornton, J.: CATH a hierarchic classification of protein domain structures. Structure 5(8), 1093–1109 (1997). http://www. sciencedirect.com/science/article/pii/S0969212697002608 22. Shapiro, J., Brutlag, D.: FoldMiner and LOCK2: protein structure comparison and motif discovery on the Web. Nucleic Acids Res. 32, 536–41 (2004) 23. Smith, T., Waterman, M.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981). http://www.sciencedirect.com/science/article/pii/ 0022283681900875 24. Socha, B.: Multithreaded execution of the Smith-Waterman algorithm in the query language for protein secondary structures. Master’s thesis, Institute of Informatics, Silesian University of Technology, Gliwice, Poland (2013) 25. Stephens, S.M., Chen, J.Y., Davidson, M.G., Thomas, S., Trute, B.M.: Oracle database 10g: a platform for BLAST search and regular expression pattern matching in life sciences. Nucleic Acids Res. 33(suppl-1), D675–D679 (2005). https://doi.org/10.1093/nar/gki114 26. Tata, S., Friedman, J.S., Swaroop, A.: Declarative querying for biological sequences. In: 22nd International Conference on Data Engineering (ICDE’06), pp. 87–98 (2006) 27. Wang, Y., Sunderraman, R., Tian, H.: A domain specific data management architecture for protein structure data. In: 2006 International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 5751–5754 (2006) 28. Wieczorek, D., Małysiak-Mrozek, B., Kozielski, S., Mrozek, D.: A declarative query language for protein secondary structures. J. Med. Inform. Technol. 16, 139–148 (2010) 29. Wieczorek, D., Małysiak-Mrozek, B., Kozielski, S., Mrozek, D.: A method for matching sequences of protein secondary structures. J. Med. Inform. Technol. 16, 133–137 (2010) 30. Yang, Y., Faraggi, E., Zhao, H., Zhou, Y.: Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted onedimensional structural properties of query and corresponding native properties of templates. Bioinformatics 27(15), 2076–2082 (2011). http://dx.doi.org/10.1093/bioinformatics/btr350 31. Zomaya, A.Y.: Parallel Computing for Bioinformatics and Computational Biology: Models, Enabling Technologies, and Case Studies, 1st edn. Wiley-Interscience, New York (2006)
Index
A Accuracy, 231 Aligned Fragment Pairs (AFP), 72, 163, 253 Alignment, 74 Alpha helix, 8, 9, 287 Amino acid chain, 6 Amino acids, 4 Amino acid sequence, 6 Apache Spark, 226 Area Under the Curve (AUC), 232 Asynchronous task execution, 114 Atoms, 5 Azure BLOB, 111 Azure SQL Database, 111
B Backbone, 12 Backtracking, 261, 291 Beta sheet, 287 Beta strand, 8, 9, 287 Big Data, 33, 137 value, 35 variety, 34 velocity, 34 veracity, 34 5V model of big data, 34 volume, 34 Block index, 263 BLOSUM62, 270 Bond angles, 17 Bonded interactions, 22 Bond lengths, 5, 16 Bonds, 5
C Cartesian coordinates, 5 Central Processing Unit (CPU), 55, 180 Cloud computing, 30, 71 characteristics, 31 cloud service models, 31 community cloud, 33 deployment models, 33 hybrid cloud, 33 Infrastructure as a Service (IaaS), 31 Platform as a Service (PaaS), 32 private cloud, 33 public cloud, 33 Software as a Service (SaaS), 32 Clustered index, 288 Coalesced access, 263, 278 Coil, 287 Combinatorial Extension (CE), 71 Compute Unified Device Architecture (CUDA), 39, 40, 262 blocks, 41 grid, 41 kernel, 40 kernel execution, 42 threads, 40, 41 Conformational energy, 20 Consensus modes, 219, 222 Contact patterns, 252 Coulomb potential, 22 CPU cores, 36, 86 Critical section, 293 CUDA streams, 267 CUDA transactions, 262
© Springer Nature Switzerland AG 2018 D. Mrozek, Scalable Big Data Analytics for Protein Bioinformatics, Computational Biology 28, https://doi.org/10.1007/978-3-319-98839-9
311
312 D Database Management System (DBMS), 284, 285, 298 Declarative query language, 284 Dihedral angles, 17 Disulfide bridges, 12, 14 3D protein structure, 4 Dynamic programming, 258, 278
E Electrostatic potential, 22 Energy minimization, 15 Enzymes, 4 Extensible Markup Language (XML), 302
F False negatives, 230 False positives, 230 Fixed Number of Proteins Package-based (FNPP-based) scheduling scheme, 84 Flexible structure AlignmenT by Chaining Aligned fragment pairs allowing Twists (FATCAT), 71 F-measure, 232 Fold recognition, 10 Force fields, 20 FOR XML clause, 302 Functional groups, 12 Fuzzy set, 224 Fuzzy smoothing filter, 224
G Gap penalty, 258, 259, 290 Gaps, 73 General-Purpose Graphics Processing Units (GPGPU), 39 Genetic code, 6 Global memory, 262, 264, 278 GPU applications, 152 GPU-CASSERT, 261 GPU devices, 39 Graphics Processing Unit (GPU), 39, 262 constant cache, 40 constant memory, 40, 41, 273 global memory, 40, 41, 273, 278 registers, 273 scalar processor cores, 41 scalar processors, 40 shared memory, 40, 41, 273 streaming multiprocessor, 41, 273 texture cache, 40
Index texture memory, 40, 41, 278 warp, 41 Grid computing, 105
H Hadoop, 35, 53, 138, 187, 188 application master, 142, 189 cluster managers, 143 head node, 194 JobTracker, 141, 159 resource management, 142 resource manager, 142, 194 scheduling, 142 sequential files, 156, 189, 200 splits, 140, 156 TaskTracker, 141, 160 Yet Another Resource Negotiator (YARN), 142, 144 Hadoop Distributed File System (HDFS), 35, 138, 156, 162, 187, 226 blocks, 139 block size, 139, 156, 189 datanode, 139, 196 namenode, 139, 194 replication, 139 Hadoop ecosystem, 146 HBase, 148, 187, 188 HDInsight, 188, 226 HDInsight4PSi, 187 High resolution alignment, 257 Hive, 53 Homology modeling, 10 Horizontal scaling, 76 Hyper-threading, 36
I IDPP, 218 IDPP meta-predictor, 218 IDP predictors, 217 InfiniBand, 58 Infrastructure as a Service (IaaS), 51, 54 Interatomic distances, 16 Inter-residue distances, 252 Intrinsically Disordered Proteins (IDPs), 216 Intrinsically disordered proteins metapredictor, 218 Intrinsically Disordered Regions (IDRs), 216 Intrinsically Unstructured Proteins (IUPs), 216
Index K Kafka, 53 Kernel, 267
L LLAP, 53 Local alignment, 285 Loop, 8, 9, 287 Low resolution alignment, 257, 278
M Macromolecules, 4 Manager role, 81 MapReduce, 35, 140, 187 jobs, 140 key–value, 163, 192 map function, 192 map-only pattern, 156, 159 mapper class, 162 Map phase, 140, 189 MapReduce 1.0, 141 MapReduce 2.0, 142 Map task, 140, 157, 189 MRv1, 141 MRv2, 142 Reduce phase, 140, 189 Reduce task, 140, 157 shuffle, 140 Matthews correlation coefficient, 232 Membership function, 224 Messages, 80, 85 Microsoft Azure, 51, 52 application, 52 app services, 54 BLOB, 53, 87, 187 cloud explorer, 63 cloud services, 52, 59 cloud storage account, 60 compute, 52 compute emulator, 64 configuration file, 61 data services, 53 definition file, 61 fabric, 54 HDInsight, 53, 187 messages, 63 mobile services, 53 networking, 54 queue client, 60 queues, 59, 87, 99 roles, 99
313 SQL database, 53, 187 tables, 53, 87, 187 virtual machines, 52 web sites, 53 Microsoft Azure virtual machine size, 55–58 Microsoft SQL Server, 285, 299 Minimum Path End (MPE), 291 mmCIF files, 162, 184 Molecular dynamics, 20 Molecular mechanics, 20 Molecular residue descriptors, 255, 264, 270 Monte Carlo method, 111 Multi-core CPUs, 38 Multi-core processor, 36 Multiprocessor, 41, 273 Multi-threaded application, 37, 98, 291 Multithreading, 36 Mutual-exclusion lock, 293
N Non-bonded interactions, 22
O Object-Oriented Database (OODB), 285 Oracle, 285
P Page-locked memory, 267 Parallel alignments, 270 PDB files, 162, 184 PDBML files, 184 Peptide bond, 6, 18 Platform as a Service (PaaS), 51, 54 Polypeptide sequence, 6 Positive Predictive Value (PPV), 232 Potential energy, 20 PowerShell, 189 Precision, 232 Prediction job, 113 PredictionManager role, 111 PredictionWorker role, 113 Primary structure, 6, 7 Processing model, 114 Protein conformation, 4 Protein Data Bank (PDB), 183 Protein folding, 10 Proteins, 4 Protein Secondary Structure - Structured Query Language (PSS-SQL), 298, 300–302 Protein sequence, 6
314 Protein similarity searching, 251 Protein spatial structure, 4, 7 Protein structure matching, 257 Protein structure prediction, 103, 104, 108 ab initio methods, 105 comparative methods, 105 Critical Assessment of protein Structure Prediction (CASP), 105 fold recognition, 105 force fields, 104 homology modeling, 105 methods, 104 physical methods, 104 potential energy function, 108 secondary structure prediction, 105 Warecki–Znamirowski (WZ) method, 109 Protein synthesis, 6
Q Qualification threshold, 259, 273 Quaternary structure, 13 Quaternions, 261 Query pattern, 299, 300 Query profile, 266, 267, 278 Queues, 54, 79, 113, 130
R R, 53 Ramachandran plot, 19 Recall, 231 Receiver Operating Characteristic (ROC) curves, 232 Reduced chains of secondary structures, 254, 255, 264, 265 Relational database, 42, 284 columns, 42 records, 42 rows, 42 table, 42 table attributes, 43 tuples, 42 Relational Database Management System (RDBMS), 43 Relative coordinates, 16 Remote Direct Memory Access (RDMA), 58 Resilient Distributed Data set (RDD), 226 Root Mean Square Deviation (RMSD), 73, 261
Index S Scalability, 45 horizontal scaling, 46 scaling out/in, 46 scaling up/down, 46 vertical scaling, 46 Scaling out, 76 Scaling up, 76 Scheduling computations, 114 Searcher role, 86 Secondary structure, 8, 283 Secondary Structure Elements (SSEs), 252, 264, 287 Secondary structure types, 287 Segment index, 288 Segment table, 287 SELECT statement, 284, 300, 302 Semaphore, 293 Sensitivity, 231 Service Bus, 54 Shape signatures, 252 Similarity matrix, 257, 264, 292 Similarity measure, 265 Similarity searching, 70 Single Instruction, Multiple Data (SIMD), 41 Single Instruction, Multiple Thread (SIMT), 41 Singular Value Decomposition (SVD), 261 Software as a Service (SaaS), 51, 54 Spark, 53, 143, 187 actions, 145 data partitions, 144, 226 driver program, 143, 227 executors, 144, 226 FIFO queue, 226 pipe transformation, 145, 226 RDD caching, 145 RDD fault tolerance, 145 Resilient Distributed Data set (RRDs), 144 saveAsTextFile action, 145, 227 spark context, 143 transformations, 144 Spark-IDPP, 218 Spark-IDPP meta-predictor, 226 Specificity, 231 Sterical collisions, 19 Storm, 53 Structural alignment, 70 Structural alignment speed, 197 Structural descriptors, 263 Structural similarities, 69
Index Structured Query Language (SQL), 44, 284, 298, 300, 301 FROM clause, 44 query, 44 SELECT statement, 44 WHERE clause, 44 Superposition, 70, 74, 260, 265 Synchronization, 38, 295 T Task scheduling, 81 Tertiary structure, 10, 11 Thread, 37, 262, 292, 293 Thread index, 263 Torsion angles, 17, 19 Transact-SQL, 285 True negatives, 230 True Positive Rate (TPR), 231 True positives, 230 Turn, 287 Twists, 73 Two-phase alignment algorithm, 257 V Valence angles, 17
315 van der Waals potential, 21 Vertical scaling, 76 Virtual Hard Drive (VHD), 78 Virtual Machines (VMs), 54, 55, 187 series, 55 sizes, 55 Virtualization, 30 Visual Studio.NET, 63
W Warp, 262 Wavefront, 292 Web role, 52, 59, 78, 111, 130 Worker role, 52, 59, 81, 86, 130
Y Yet Another Resource Negotiator (YARN), 189 containers, 194 node managers, 142, 189
Z Zookeeper, 148