269 92 35MB
English Pages XIII, 309 [313] Year 2021
LNCS 12582
Diganta Goswami Truong Anh Hoang (Eds.)
Distributed Computing and Internet Technology 17th International Conference, ICDCIT 2021 Bhubaneswar, India, January 7–10, 2021 Proceedings
Lecture Notes in Computer Science Founding Editors Gerhard Goos Karlsruhe Institute of Technology, Karlsruhe, Germany Juris Hartmanis Cornell University, Ithaca, NY, USA
Editorial Board Members Elisa Bertino Purdue University, West Lafayette, IN, USA Wen Gao Peking University, Beijing, China Bernhard Steffen TU Dortmund University, Dortmund, Germany Gerhard Woeginger RWTH Aachen, Aachen, Germany Moti Yung Columbia University, New York, NY, USA
12582
More information about this subseries at http://www.springer.com/series/7409
Diganta Goswami Truong Anh Hoang (Eds.) •
Distributed Computing and Internet Technology 17th International Conference, ICDCIT 2021 Bhubaneswar, India, January 7–10, 2021 Proceedings
123
Editors Diganta Goswami Indian Institute of Technology Guwahati Guwahati, India
Truong Anh Hoang University of Engineering and Technology Hanoi, Vietnam
ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-030-65620-1 ISBN 978-3-030-65621-8 (eBook) https://doi.org/10.1007/978-3-030-65621-8 LNCS Sublibrary: SL3 – Information Systems and Applications, incl. Internet/Web, and HCI © Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
This volume contains the papers selected for presentation at the 17th International Conference on Distributed Computing and Internet Technology (ICDCIT 2021) held during January 7–10, 2021, in Bhubaneswar, India. From its humble beginnings, the ICDCIT conference series has grown to a conference of international repute and has become a global platform for Computer Science researchers to exchange research results and ideas on the foundations and applications of Distributed Computing and Internet Technology. An additional goal of ICDCIT is to provide an opportunity for students and young researchers to get exposed to topical research directions of Distributed Computing and Internet Technology. This year we received 99 full paper submissions. Each submission considered for publication was reviewed by about three Program Committee (PC) members with the help from reviewers outside the PC. Based on the reviews, the PC decided to accept 17 papers – 13 regular papers and 4 short papers – for presentation at the conference, with an acceptance rate of 17%. We would like to express our gratitude to all the researchers who submitted their work to the conference. Our special thanks goes to all colleagues who served on the PC, as well as the external reviewers, who generously offered their expertise and time, which helped us select the papers and prepare the conference program. We were fortunate to have three invited speakers – Ajay D. Kshemkalyani from the University of Illinois at Chicago, USA, Martin Gogolla from the University of Bremen, Germany, and Nguyen Le Minh from Japan Advanced Institute of Science and Technology, Japan. Their talks provided us with the unique opportunity to hear the leaders of their field. The papers related to the talks were also included in this volume. A number of colleagues have worked very hard to make this conference a success. We wish to express our thanks to the Local Organizing Committee, organizers of the satellite events, and many student volunteers. The School of Computer Engineering, Kalinga Institute of Industrial Technology (KIIT), Bhubaneswar, the host of the conference, provided various support and facilities for organizing the conference and its associated events. Finally, we enjoyed institutional and financial support from KIIT Bhubaneswar. Particularly, we express our sincere thanks to Achyuta Samanta, the Founder of KIIT Bhubaneswar, for his continuous support to ICDCIT since its inception. We express our appreciation to all the Steering Committee members, and in particular Prof. Hrushikesha Mohanty and Prof. Raja Natarajan, whose counsel we frequently relied on. We are also thankful to D.N. Dwivedy for his active participation in ICDCIT. Our thanks are due to the faculty members of the School of Computer Engineering, KIIT Bhubaneswar, for their timely support.
vi
Preface
The conference program and proceedings were prepared with the help of EasyChair. We thank Springer for continuing to publish the conference proceedings. We hope you will find the papers in this collection stimulating and rewarding. January 2021
Diganta Goswami Truong Anh Hoang
Organization
General Chair Laxmi Parida
IBM, USA
Program Committee Chairs Diganta Goswami Hoang Anh Truong
IIT Guwahati, India VNU-UET, Vietnam
Conference Management Chair Manjusha Pandey
KIIT, India
Organizing Chair Jagannath Singh
KIIT, India
Finance Chair Chittaranjan Pradhan
KIIT, India
Publicity Chair Hrudaya Kumar Tripathy
KIIT, India
Registration Chairs Bindu Agarwalla Abhaya Kumar Sahoo
KIIT, India KIIT, India
Session Management Chairs Sital Dash Saurabh Bilgaiyan
KIIT, India KIIT, India
Publications Chair Junali Jasmine Jena
KIIT, India
viii
Organization
Student Symposium Chairs Manoj Kumar Mishra Amiya Ranjan Panda
KIIT, India KIIT, India
Industry Symposium Chairs Phani Premaraju Siddharth Swarup Rautaray
KIIT, India KIIT, India
Project Innovation Contest Chairs Ajay Kumar Jena Krishna Chakravarty
KIIT, India KIIT, India
Steering Committee Maurice Herlihy Gérard Huet Bud Mishra Hrushikesha Mohanty Raja Natarajan David Peleg R. K. Shyamasundar
Brown University, USA Inria, France Courant Institute, NYU, USA KIIT, India TIFR, India WIS, Israel IIT Bombay, India
Program Committee Arup Bhattacharjee Santosh Biswas Duc-Hanh Dang Ashok Kumar Das Meenakshi D’Souza Anh Nguyen Duc Christian Erfurth Antony Franklin Sasthi Charan Ghosh Diganta Goswami Arobinda Gupta Truong Anh Hoang Chittaranjan Hota Dang Van Hung Hemangee Kapoor Dac-Nhuong Le Ulrike Lechner Partha Sarathi Mandal Ganapathy Mani
NIT Silchar, India IIT Bhilai, India VNU-UET, Vietnam IIIT Hyderabad, India IIIT Bangalore, India University College of Southeast Norway, Norway University of Applied Sciences Jena, Germany IIT Hyderabad, India ISI Kolkata, India IIT Guwahati, India IIT Kharagpur, India VNU-UET, Vietnam BITS Pilani, India VNU-UET, Vietnam IIT Guwahati, India Haiphong University, Vietnam Bundeswehr University Munich, Germany IIT Guwahati, India Purdue University, USA
Organization
Hrushikesha Mohanty Arijit Mukherjee Krishnendu Mukhopadhyaya Dmitry Namiot Raja Natarajan Atul Negi Himadri Sekhar Paul Sathya Peri P. S. V. S. Sai Prasad Tho Quan Ramaswamy Ramanujam Shrisha Rao K. Sreenivas Rao Debasis Samanta Chayan Sarkar Nityananda Sarma Somnath Tripathy T. Venkatesh
KIIT, India TCS Research and Innovation, India ISI Kolkata, India Lomonosov Moscow State University, Russia Tata Institute of Fundamental Research, Mumbai, India SCIS, University of Hyderabad, India TCS Research and Innovation, India IIT Hyderabad, India University of Hyderabad, India Ho Chi Minh City University of Technology, Vietnam Institute of Mathematical Sciences Chennai, India IIIT Bangalore, India IIT Kharagpur, India IIT Kharagpur, India TCS Research and Innovation, India Tezpur University, India IIT Patna, India IIT Guwahati, India
Additional Reviewers Hrishav Bakul Barua Gaurav Bhardwaj Hieu Vo Dinh Barun Gorain Aayush Grover Aditya Gulati Abhay Jain Karishma Karishma Rakesh Matam Kaushik Mondal Minh Thuan Nguyen Nam Nguyen Pradip Pramanick
ix
Nemi Chandra Rathore Durga Bhavani S. Muktikanta Sa Dibakar Saha Ashok Sairam Ranbir Sanasam Sinchan Sengupta Roohani Sharma S. P. Suresh Ha Nguyen Tien Trung Tran Gopalakrishnan Venkatesh
Contents
Invited Talks The Bloom Clock for Causality Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . Anshuman Misra and Ajay D. Kshemkalyani
3
Model Development in the Tool USE: Explorative, Consolidating and Analytic Steps for UML and OCL Models . . . . . . . . . . . . . . . . . . . . . . Martin Gogolla
24
ReLink: Open Information Extraction by Linking Phrases and Its Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xuan-Chien Tran and Le-Minh Nguyen
44
Cloud Computing and Networks Energy-Efficient Scheduling of Deadline-Sensitive and Budget-Constrained Workflows in the Cloud. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anurina Tarafdar, Kamalesh Karmakar, Sunirmal Khatua, and Rajib K. Das An Efficient Renewable Energy-Based Scheduling Algorithm for Cloud Computing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sanjib Kumar Nayak, Sanjaya Kumar Panda, Satyabrata Das, and Sohan Kumar Pande A Revenue-Based Service Management Algorithm for Vehicular Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sohan Kumar Pande, Sanjaya Kumar Panda, and Satyabrata Das Interference Reduction in Directional Wireless Networks . . . . . . . . . . . . . . . Manjanna Basappa and Sudeepta Mishra
65
81
98 114
Distributed Algorithms, Concurrency and Parallelism Automated Deadlock Detection for Large Java Libraries . . . . . . . . . . . . . . . R. Rajesh Kumar, Vivek Shanbhag, and K. V. Dinesha DNet: An Efficient Privacy-Preserving Distributed Learning Framework for Healthcare Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parth Parag Kulkarni, Harsh Kasyap, and Somanath Tripathy
129
145
xii
Contents
Memory Optimized Dynamic Matrix Chain Multiplication Using Shared Memory in GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Girish Biswas and Nandini Mukherjee
160
Graph Algorithms and Security Parameterized Complexity of Defensive and Offensive Alliances in Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ajinkya Gaikwad, Soumen Maity, and Shuvam Kant Tripathi A Reconstructive Model for Identifying the Global Spread in a Pandemic . . . Debasish Pattanayak, Dibakar Saha, Debarati Mitra, and Partha Sarathi Mandal Cost Effective Method for Ransomware Detection: An Ensemble Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parthajit Borah, Dhruba K. Bhattacharyya, and J. K. Kalita
175 188
203
Social Networks and Machine Learning Exploring Alzheimer’s Disease Network Using Social Network Analysis . . . . Swati Katiyar, T. Sobha Rani, and S. Durga Bhavani
223
Stroke Prediction Using Machine Learning in a Distributed Environment . . . . Maihul Rajora, Mansi Rathod, and Nenavath Srinivas Naik
238
Automated Diagnosis of Breast Cancer with RoI Detection Using YOLO and Heuristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ananya Bal, Meenakshi Das, Shashank Mouli Satapathy, Madhusmita Jena, and Subha Kanta Das
253
Short Papers An Efficient Approach for Event Prediction Using Collaborative Distance Score of Communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B. S. A. S. Rajita, Bipin Sai Narwa, and Subhrakanta Panda
271
A Distributed System for Optimal Scale Feature Extraction and Semantic Classification of Large-Scale Airborne LiDAR Point Clouds . . . . . . . . . . . . Satendra Singh and Jaya Sreevalsan-Nair
280
Load Balancing Approach for a MapReduce Job Running on a Heterogeneous Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kamalakant Laxman Bawankule, Rupesh Kumar Dewang, and Anil Kumar Singh
289
Contents
xiii
Study the Significance of ML-ELM Using Combined PageRank and Content-Based Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rajendra Kumar Roul and Jajati Keshari Sahoo
299
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
309
Invited Talks
The Bloom Clock for Causality Testing Anshuman Misra and Ajay D. Kshemkalyani(B) University of Illinois at Chicago, Chicago, IL 60607, USA {amisra7,ajay}@uic.edu Abstract. Testing for causality between events in distributed executions is a fundamental problem. Vector clocks solve this problem but do not scale well. The probabilistic Bloom clock can determine causality between events with lower space, time, and message-space overhead than vector clock; however, predictions suffer from false positives. We give the protocol for the Bloom clock based on Counting Bloom filters and study its properties including the probabilities of a positive outcome and a false positive. We show the results of extensive experiments to determine how these above probabilities vary as a function of the Bloom timestamps of the two events being tested, and to determine the accuracy, precision, and false positive rate of a slice of the execution containing events in the temporal proximity of each other. Based on these experiments, we make recommendations for the setting of the Bloom clock parameters. We postulate the causality spread hypothesis from the application’s perspective to indicate whether Bloom clocks will be suitable for correct predictions with high confidence. The Bloom clock design can serve as a viable space-, time-, and message-space-efficient alternative to vector clocks if false positives can be tolerated by an application. Keywords: Causality · Vector clock · Bloom clock · Bloom filter Partial order · Distributed system · False positive · Performance
1 1.1
·
Introduction Background and Motivation
Determining causality between pairs of events in a distributed execution is useful to many applications [9,17]. This problem can be solved using vector clocks [5,11]. However, vector clocks do not scale well. Several works attempted to reduce the size of vector clocks [6,12,18,20], but they had to make some compromises in accuracy or alter the system model, and in the worst-case, were as lengthy as vector clocks. A survey of such works is included in [8]. The Bloom filter, proposed in 1970, is a space-efficient probabilistic data structure that supports set membership queries [1]. The Bloom filter is widely used in computer science. Surveys of the variants of Bloom filters and their applications in networks and distributed systems are given in [2,19]. Bloom filters provide space savings, but suffer from false positives although there are no false negatives. The confidence in the prediction by a Bloom filter depends on the c Springer Nature Switzerland AG 2021 D. Goswami and T. A. Hoang (Eds.): ICDCIT 2021, LNCS 12582, pp. 3–23, 2021. https://doi.org/10.1007/978-3-030-65621-8_1
4
A. Misra and A. D. Kshemkalyani
size of the filter (m), the number of hash functions used in the filter (k), and the number of elements added to the set (q). The use of the Bloom filter as a Bloom clock to determine causality between events was suggested [16], where, like Bloom filters, the Bloom clock will inherit false positives. The Bloom clock and its protocol based on Counting Bloom filters, which can be significantly more space-, time-, and message-space-efficient than vector clocks, was given in [7]. The expressions for the probabilities of a positive outcome and of a false positive as a function of the corresponding vector clocks, as well as their estimates as a function of the Bloom clocks were then formulated [7]. Properties of the Bloom clock were also studied in [7], which then derived expressions to estimate the accuracy, precision, and the false positive rate for a slice of the execution using the events’ Bloom timestamps. 1.2
Contributions
In this paper, we first give the Bloom clock protocol and discuss its properties. We examine the expressions for the probability of a positive and of a false positive in detecting causality, and discuss their trends as the distance between the pair of events varies. We then show the results of our experiments to: 1. analyze in terms of Bloom timestamps how the probability of a positive and the probability of a false positive vary as the distance between a pair of events varies; 2. analyze the accuracy, precision, and the false positive rate for a slice of the execution that is representative of events that are close to each other. The parameters varied are: number of processes n, size of Bloom clock m, number of hash functions k, probability of a timestamped event being an internal event pri , and temporal proximity between the two events being tested for causality. Based on our experiments, we 1. analyze the nature of false positive predictions, 2. make recommendations for settings of m and k, 3. state conditions and analyze dependencies on the parameters (e.g., n, pri ) under which Bloom clocks make correct predictions with high confidence (high accuracy, precision, and low false positive rate), and 4. generalize the above results and state a general principle (the causality spread hypothesis) based on the degree of causality in the application execution, which indicates whether Bloom clocks can make correct predictions with high confidence. Thus our results and recommendations can be used by an application developer to decide whether and how the application can benefit from the use of Bloom clocks. Roadmap: Section 2 gives the system model. Section 3 details the Bloom clock protocol. Section 4 studies properties of the Bloom clock, discusses ways to
The Bloom Clock for Causality Testing
5
estimate the probabilities of a positive outcome and of a false positive, and predicts the trends of these probability functions as the temporal proximity between the events increases. Section 5 gives our experiments for the complete graph and analyzes the results. Section 6 gives our experiments for the star graph (client-server configuration) and analyzes the results. Section 7 summarizes the observations of the experiments and discusses the conditions under which Bloom clocks are advantageous to use. It also postulates the causality spread hypothesis and validates it. Section 8 concludes.
2
System Model
A distributed system is modeled as an undirected graph (N , L), where N is the set of processes and L is the set of links connecting them. Let n = |N |. Between any two processes, there may be at most one logical channel over which the two processes communicate asynchronously. A logical channel from Pi to Pj is formed by paths over links in L. We do not assume FIFO logical channels. The execution of process Pi produces a sequence of events Ei = e0i , e1i , e2i , · · ·, where eji is the j th event at process Pi . An event at a process can be an internal event, a message send event, or a message receive event. Let E = i∈N {e | e ∈ Ei } denote the set of events in a distributed execution. The causal precedence relation between events, defined by Lamport’s “happened before” relation [10], and denoted as →, induces an irreflexive partial order (E, →). Mattern [11] and Fidge [5] designed the vector clock which assigns a vector V to each event such that: e → f ⇐⇒ Ve < Vf . The vector clock is a fundamental tool to characterize causality in distributed executions [9,17]. Each process needs to maintain a vector V of size n to represent the local vector clock. Charron-Bost has shown that to capture the partial order (E, →), the size of the vector clock is the dimension of the partial order [3], which is bounded by the size of the system, n. Unfortunately, this does not scale well to large systems.
3
The Bloom Clock Protocol
The Bloom clock is based on the Counting Bloom filter. Each process Pi maintains a Bloom clock B(i) which is a vector B(i)[1, . . . , m] of integers, where m < n. The Bloom clock is operated as shown in Algorithm 1. To try to uniquely update B(i) on a tick for event exi , k random hash functions are used to hash (i, x), each of which maps to one of the m indices in B(i). Each of the k indices mapped to is incremented in B(i); this probabilistically tries to make the resulting B(i) unique. As m < n, this gives a space, time, and message-space savings over the vector clock. We would like to point out that the scalar clock [10] can be thought of as a Bloom clock with m = 1 and k = 1. The Bloom timestamp of an event e is denoted Be . Let V and B denote the sets of vector timestamps and Bloom timestamps of events. The standard vector comparison operators collect(p|p.status)->asSet()->size() [noSelfInvitation] not(Profile.allInstances-> exists(p | p.friendshipR.invitee->includes(p))) The first Integer term statusSize counts the number of occurring Friendship status values in the respective object model. The second Boolean term checks whether no self invitation is present. Figure 9 shows six different generated object models (in the order the MV generates them) and the result of evaluating the classifying terms. Each of the object models has a unique value pair for the two classifying terms. Classifying terms help analyzing the UML and OCL model by systematically constructing object models with given properties that are determined by the classifying terms.
6
Related Work
The transformation of UML and OCL into formal specifications for validation and verification is a widely considered topic. In [35], a translation from UML to UML-B is presented und used for the validation and verification of models, focusing on consistency and checking safety properties. The approach in [4] presents a translation of UML and OCL into first-order predicate logic to reason about models utilizing theorem provers. Similarly, OCL2MSFOL, a tool recently introduced in [12], can automatically reason about UML/OCL models through a mapping from UML/OCL to many-sorted first-order logic. The tool can check satisfiability of OCL invariants by applying SMT solvers. There are also other tools that validate model instances against UML and OCL constraints directly, like DresdenOCL [13]. Another similar tool is UML-RSDS [27], which allows for the validation of UML class diagrams. Several approaches rely on different technological cornerstones like logic programming and constraint solving [10], relational logic and Alloy [2], term rewriting with Maude [32] or graph transformations [14]. In contrast to the tool used in this work, which is based on the transformation of UML and OCL into relational logic [26], these approaches either do not support full OCL (e.g., higher-order associations [2] or recursive
Explorative, Consolidating and Analytic Development Steps in USE
39
operation definitions [10] are not supported) or do not facilitate full OCL syntax checks [32]. Also, the feature to automatically scroll through several valid object models from one verification task is not possible in all of the above approaches. (Semi)-automatic proving approaches for UML class properties have been put forward on the basis of description logics [31], on the basis of relational logic and pure Alloy [2] using a subset of OCL, and in [36] focusing on model inconsistencies by employing Kodkod. A classification of model checkers with respect to verification tasks can be found in [15]. Verification tools use such transformations to reason about models and verify test objectives. UMLtoCSP [9] is able to automatically check correctness properties for UML class diagrams enhanced with OCL constraints based on Constraint Logic Programming. The approach operates on a bounded search space similar to the model validator. In [2], UML2Alloy is presented. A transformation of UML and OCL into Alloy [25] is used to be able to automatically test models for consistency with the help of the Alloy Analizer. Another approach based on Alloy is presented in [28]. In particular, limitations of the previous transformation are eliminated by introducing new Alloy constructs to allow for a transformation of more UML features, e.g., multiple inheritance. In [38], OCL expressions are transformed into graph constraints and instance validation is performed by checking models against the graph constraints. Additionally, in [8], a transformation of OCL pre- and postconditions is presented for graph transformations. The work in [5] describes an approach for test generation based on a transformation of UML and OCL into higher-order logic (HOL). With the HOL-TestGen tool, test cases (model instances) are generated and validated. In [30], a transformation of UML and OCL into first-order logic is described and test methods for models are shown, e.g., class liveliness (consistency) and integrity of invariants (constraint independence). A different approach is presented in [11]. The authors suggest to use Alloy for the early modeling phase of development due to its better suitability for validation and verification. Additionally, FOML, an F-logic based language, is introduced in [3] as an approach for modeling, analyzing and reasoning about models. UML together with OCL have been successfully used for system modeling in numerous industrial and academic projects. Here, we refer to only three example projects trying to indicate the wide spectrum of application options. In our own early work [39], we have specified safety properties of a train system in the context of the well-known BART case study (Bay Area Rapid Transit, San Fransisco). In [1], central aspects of an industrial video conferencing system developed by Cisco have been studied. In [29], UML and OCL are employed for the specification of the UML itself by introducing the so-called UML metamodel in which fundamental well-formedness rules of UML are expressed as OCL constraints. Finally, the USE model validator is to a certain degree the successor of ASSL (A Snapshot Specification Language) [16]. ASSL allows the specification of generation procedures for objects and links of each class and association. ASSL searches for a valid system state by iterating through all combinations defined by the procedures. In comparison, the USE model validator translates
40
M. Gogolla
all model constraints into a SAT formula allowing for a more efficient generation of a system state, due to detecting bad combinations earlier. However, use cases like invariant independence have been discussed employing ASSL in earlier work [17]. But ASSL has, in comparison to the model validator, the advantage that Strings can be handled as composed entities and operations like substring() can be used, and not only simple checks for equality. The substring() operation is not supported in the more efficient model validator as Strings are atomic there.
7
Conclusion and Future Work
Our aim in this contribution was to demonstrate that the design tool USE for UML and OCL models can be employed in the software development process in a flexible manner for a number of tasks, namely explorative, consolidating, and analytic tasks. These tasks have to be combined in an adjustable way that meet stakeholder requirements, in particular for customers and developers. Although USE has reached some degree of maturity, more work remains to be done. Systematically keeping track of model development steps, model versioning and carrying over test scenarios from early to later development phases is needed. The support for transformation into executable models and programs must be improved. Enhancing validation and verification options in the model validator would offer more possibilities. Improving and harmonizing the user interface in various corners of USE, e.g., between GUI and CLI, is necessary. Last but not least, assistance for imperfect, partial models as well as for runtime models and temporal and deontic constraints, as has already been initiated with first steps, should be worked on. Acknowledgments. USE is not the result of the work of the author, but of numerous colleagues that have contributed significantly to its success (mentioned in order of appearance): Mark Richters, J¨ orn Bohling, Paul Ziemann, Mirco Kuhlmann, Lars Hamann, Duc-Hanh Dang, Frank Hilken, Khanh-Hoang Doan, Nisha Desai. Furthermore numerous students have added value through their final theses. Thank you very much. It was a pleasure for the author to work with you.
References 1. Ali, S., Iqbal, M.Z.Z., Arcuri, A., Briand, L.: A search-based OCL constraint solver for model-based test data generation. In: N´ un ˜ez, M., Hierons, R.M., Merayo, M.G. (eds.) Proceedings of 11th International Conference on Quality Software QSIC, pp. 41–50. IEEE (2011) 2. Anastasakis, K., Bordbar, B., Georg, G., Ray, I.: On challenges of model transformation from UML to Alloy. Softw. Syst. Model. 9(1), 69–86 (2010). https://doi. org/10.1007/s10270-008-0110-3 3. Balaban, M., Kifer, M.: Logic-based model-level software development with FOML. In: Whittle, J., Clark, T., K¨ uhne, T. (eds.) MODELS 2011. LNCS, vol. 6981, pp. 517–532. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3642-24485-8 38
Explorative, Consolidating and Analytic Development Steps in USE
41
4. Beckert, B., Keller, U., Schmitt, P.: Translating the object constraint language into first-order predicate logic. In: Proceedings of 2nd Verification WS: VERIFY, vol. 2, pp. 2–7 (2002) 5. Brucker, A.D., Krieger, M.P., Longuet, D., Wolff, B.: A specification-based test case generation method for UML/OCL. In: Dingel, J., Solberg, A. (eds.) MODELS 2010. LNCS, vol. 6627, pp. 334–348. Springer, Heidelberg (2011). https://doi.org/ 10.1007/978-3-642-21210-9 33 6. Br¨ uning, J., Gogolla, M., Hamann, L., Kuhlmann, M.: Evaluating and debugging OCL expressions in UML models. In: Brucker, A.D., Julliand, J. (eds.) TAP 2012. LNCS, vol. 7305, pp. 156–162. Springer, Heidelberg (2012). https://doi.org/10. 1007/978-3-642-30473-6 13 7. B¨ uttner, F., Gogolla, M.: On OCL-based imperative languages. J. Sci. Comput. Program. 92, 162–178 (2014) 8. Cabot, J., Claris´ o, R., Guerra, E., de Lara, J.: Synthesis of OCL pre-conditions for graph transformation rules. In: Tratt, L., Gogolla, M. (eds.) ICMT 2010. LNCS, vol. 6142, pp. 45–60. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3642-13688-7 4 9. Cabot, J., Claris´ o, R., Riera, D.: On the verification of UML/OCL class diagrams using constraint programming. J. Syst. Softw. 93, 1–23 (2014) 10. Cabot, J., Claris´ o, R., Riera, D.: UMLtoCSP: a tool for the formal verification of UML/OCL models using constraint programming. In: Proceedings of ASE 2007, pp. 547–548 (2007) 11. Cunha, A., Garis, A.G., Riesco, D.: Translating between alloy specifications and UML class diagrams annotated with OCL. SoSyM 14(1), 5–25 (2015). https://doi. org/10.1007/s10270-013-0353-5 12. Dania, C., Clavel, M.: OCL2MSFOL: a mapping to many-sorted first-order logic for efficiently checking the satisfiability of OCL constraints. In: Proceedings of ACM/IEEE 19th International Conference on Model Driven Engineering Languages and Systems, MODELS 2016, pp. 65–75. ACM (2016) 13. Demuth, B., Wilke, C.: Model and object verification by using Dresden OCL. In: Proceedings of Russian-German WS Innovation Information Technologies: Theory and Practice, pp. 687–690 (2009) 14. Ehrig, K., K¨ uster, J.M., Taentzer, G.: Generating instance models from meta models. Softw. Syst. Model. 8, 479–500 (2009). https://doi.org/10.1007/s10270-0080095-y 15. Gabmeyer, S., Brosch, P., Seidl, M.: A classification of model checking-based verification approaches for software models. In: Proceedings of the 1st VOLT Workshop (2013) 16. Gogolla, M., Bohling, J., Richters, M.: Validating UML and OCL models in USE by automatic snapshot generation. Softw. Syst. Model. 4(4), 386–398 (2005). https:// doi.org/10.1007/s10270-005-0089-y 17. Gogolla, M., Kuhlmann, M., Hamann, L.: Consistency, independence and consequences in UML and OCL models. In: Dubois, C. (ed.) TAP 2009. LNCS, vol. 5668, pp. 90–104. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-64202949-3 8 18. Gogolla, M., B¨ uttner, F., Richters, M.: USE: a UML-based specification environment for validating UML and OCL. J. Sci. Comput. Program. 69, 27–34 (2007)
42
M. Gogolla
19. Gogolla, M., Hamann, L., Hilken, F., Sedlmeier, M.: Modeling behavior with interaction diagrams in a UML and OCL tool. In: Roubtsova, E., McNeile, A., Kindler, E., Gerth, C. (eds.) Behavior Modeling – Foundations and Applications. LNCS, vol. 6368, pp. 31–58. Springer, Cham (2015). https://doi.org/10.1007/978-3-31921912-7 2 20. Gogolla, M., Hamann, L., Xu, J., Zhang, J.: Exploring (Meta-)model snapshots by combining visual and textual techniques. In: Gadducci, F., Mariani, L. (eds.) Proceedings of Workshop Graph Transformation and Visual Modeling Techniques (GTVMT 2011), ECEASST, Electronic Communications (2011). https://journal. ub.tu-berlin.de/eceasst/issue/view/95 21. Gogolla, M., Havakili, H., Schipke, C.: Advanced features for model visualization in the UML and OCL tool USE. In: Michael, J., et al. (eds.) Companion Proceedings Modellierung 2020, CEUR, vol. 2542, pp. 203–207. CEUR-WS.org (2020) 22. Gogolla, M., Hilken, F., Doan, K.H.: Achieving model quality through model validation, verification and exploration. J. Comput. Lang. Syst. Struct. 54, 474–511 (2018) 23. Hamann, L., Hofrichter, O., Gogolla, M.: Towards integrated structure and behavior modeling with OCL. In: France, R., Kazmeier, J., Breu, R., Atkinson, C. (eds.) Proceedings of 15th International Conference on Model Driven Engineering Languages and Systems (MoDELS 2012), LNCS 7590, pp. 235–251. Springer, Berlin (2012) 24. Hilken, F., Gogolla, M., Burgueno, L., Vallecillo, A.: Testing models and model transformations using classifying terms. J. Softw. Syst. Model. 17(3), 885–912 (2018). https://doi.org/10.1007/s10270-016-0568-3 25. Jackson, D.: Software Abstractions - Logic, Language, and Analysis. MIT Press, Cambridge (2006) 26. Kuhlmann, M., Gogolla, M.: From UML and OCL to relational logic and back. In: France, R.B., Kazmeier, J., Breu, R., Atkinson, C. (eds.) MODELS 2012. LNCS, vol. 7590, pp. 415–431. Springer, Heidelberg (2012). https://doi.org/10.1007/9783-642-33666-9 27 27. Lano, K., Kolahdouz-Rahimi, S.: Specification and verification of model transformations using UML-RSDS. In: M´ery, D., Merz, S. (eds.) IFM 2010. LNCS, vol. 6396, pp. 199–214. Springer, Heidelberg (2010). https://doi.org/10.1007/9783-642-16265-7 15 28. Maoz, S., Ringert, J.O., Rumpe, B.: CD2Alloy: Class diagrams analysis using alloy revisited. In: Whittle, J., Clark, T., K¨ uhne, T. (eds.) MODELS 2011. LNCS, vol. 6981, pp. 592–607. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3642-24485-8 44 29. OMG - Object Management Group: Unified Modeling Language Specification, Version 2.5, June 2015 30. Queralt, A., Teniente, E.: Reasoning on UML class diagrams with OCL constraints. In: Embley, D.W., Oliv´e, A., Ram, S. (eds.) ER 2006. LNCS, vol. 4215, pp. 497–512. Springer, Heidelberg (2006). https://doi.org/10.1007/11901181 37 31. Queralt, A., Artale, A., Calvanese, D., Teniente, E.: OCL-Lite: finite reasoning on UML/OCL conceptual schemas. Data Knowl. Eng. 73, 1–22 (2012) 32. Rold´ an, M., Dur´ an, F.: Dynamic validation of OCL constraints with mOdCL. ECEASST 44 (2011) 33. Rumbaugh, J., Jacobson, I., Booch, G.: The Unified Modeling Language Reference Manual, 2nd edn. Addison-Wesley, Boston (2004) 34. Selic, B.: The pragmatics of model-driven development. IEEE Softw. 20(5), 19–25 (2003)
Explorative, Consolidating and Analytic Development Steps in USE
43
35. Snook, C., Savicks, V., Butler, M.: Verification of UML models by translation to UML-B. In: Aichernig, B.K., de Boer, F.S., Bonsangue, M.M. (eds.) FMCO 2010. LNCS, vol. 6957, pp. 251–266. Springer, Heidelberg (2011). https://doi.org/10. 1007/978-3-642-25271-6 13 36. Straeten, R.V.D., Puissant, J.P., Mens, T.: Assessing the kodkod model finder for resolving model inconsistencies. In: ECMFA, pp. 69–84 (2011) 37. Warmer, J., Kleppe, A.: The Object Constraint Language: Getting Your Models Ready for MDA, 2nd edn. Addison-Wesley, Boston (2004) 38. Winkelmann, J., Taentzer, G., Ehrig, K., K¨ uster, J.M.: Translation of restricted OCL constraints into graph constraints for generating meta model instances by graph grammars. ENTCS 211, 159–170 (2008) 39. Ziemann, P., Gogolla, M.: Validating OCL specifications with the USE tool: an example based on the BART case study. ENTCS 80, 157–169 (2003)
ReLink: Open Information Extraction by Linking Phrases and Its Applications Xuan-Chien Tran and Le-Minh Nguyen(B) Japan Advanced Institute of Science and Technology, Nomi, Japan [email protected], [email protected] Abstract. Recently, many Open IE systems have been developed based on using deep linguistic features such as dependency-parse features to overcome the limitations presented in older Open IE systems which use only shallow information like part-of-speech or chunking. Even though these newer systems have some clear advantages in their extractions, they also possess several issues which do not exist in old systems. In this paper, we analyze the outputs from several popular Open IE systems to find out their strength and weaknesses. Then we introduce ReLink, a novel Open IE system for extracting binary relations from open-domain text. Its working model is based on identifying correct phrases and linking them in the most proper way to reflect their relationship in a sentence. After establishing connections, it can easily extract relations by using several pre-defined patterns. Despite using only shallow linguistic features for extraction, it does not have the same weakness that existed in older systems, and it can also avoid many similar issues arising in recent Open IE systems. Our experiments show that ReLink achieves larger Area Under Precision-Recall Curve compared with ReVerb and Ollie, two well-known Open IE systems.
Keywords: Open information extraction
1
· Relink
Introduction
Since the COVID-19 pandemic, the number of documents written about it has become explosive and overwhelming. The problem of extraction information about this issue becomes very important. Information extraction (IE) is a task to extract required information from unstructured data such as raw text. The extracted information can be events, facts, entities, or relationship between entities in the text. These extracted data enable computers to perform computation or logical inference on it, a difficult task if we just work with raw text. Traditional methods for IE are based on pre-defined target relations and they usually work on a specific domain [10,21]. Because of this, these IE methods do not scale very well on an open-domain and large corpora such as Web text. Open IE is proposed to solve this problem. It provides a different way of extraction in which all potential relations are extracted without requiring any target relations or human input [1]. Therefore it can work and scale really well c Springer Nature Switzerland AG 2021 D. Goswami and T. A. Hoang (Eds.): ICDCIT 2021, LNCS 12582, pp. 44–62, 2021. https://doi.org/10.1007/978-3-030-65621-8_3
ReLink: Open Information Extraction
45
with open-domain corpora, especially Web text. Open IE provides a simple way to transform unstructured data into a relational data which is used to support other tasks like question-answering system [7,15], text summarization [3,11,18], textual entailment [2,17], or some semantic tasks [16,19]. Many Open IE systems focus on extracting binary relations from the text. Those binary relations are in the form (argument 1 ; relation phrase; argument 2 ) in which argument 1 and argument 2 are two noun phrases and relation phrase represents the relationship between those noun phrases in the sentence. For example, given the sentence “Barrack Obama was born in the United States”, we want to extract the following relation: (Barrack Obama; was born in; the United States) In this relation, “Barrack Obama” and “the United States” are two entities in this sentence and “was born in” describes their relationship. A relation can also be called as a triple or tuple, and the relation phrase can be called predicate. Open IE systems can be categorized into two groups depending on what information they use for extraction. The first group includes systems like ReVerb [6] or TextRunner [1] in which they use only shallow linguistic features (POS tags or chunks) to extract relations. These features have flat structure and thus prevent the above systems from capturing long-range dependencies between words or phrases in a sentence. The second group includes systems like DepOE [9], Ollie [14] or StanfordOpenIE [8]. These systems utilize deep linguistic features of a sentence for extractions. Specifically, they use dependency-parse features to resolve the issues of long-range dependencies that existed in the first group’s system. As a result, these systems were reported to extract relations more accurately. Interestingly, we found that using deeper linguistic features also has a limitation. Systems which tightly depend on the deep parsing features can produce incorrect relations due to a small error on the parsing output. Table 1 gives an example of this situation. ReVerb uses only shallow linguistic information but it correctly identifies two relations meanwhile Ollie only discovers one and StanfordOpenIE yields no extractions. This raises the question of whether we can avoid using deep linguistic features but still manage to achieve long-range dependencies. In this paper, we introduce a novel system called ReLink to answer this question. In particular, our system uses shallow linguistic information but can deal with the long-range dependencies. This allows us to overcome several issues that existed in ReVerb and at the same time avoid the issues caused by bad dependency-parsing output. The rest of this paper is organized as follows. Section 2 presents some related work in this research topic. Section 3 introduces the method we use to deal with two important elements in extracting a relation: Verb Phrases and Noun Phrases. Section 4 describes in details ReLink, our proposed Open IE system. We present our experimental results in Sect. 5 and finally, we give a conclusion and some words about our future work in Sect. 6.
46
X.-C. Tran and L.-M. Nguyen
Table 1. Extractions of ReVerb, Ollie and StanfordOpenIE for the sentence “Einstein, who was born in Germany, is a scientist.” Einstein, who was born in Germany, is a scientist. ReVerb
(Einstein; was born in; Germany) (Einstein; is; a scientist)
Ollie
(Einstein; is; a scientist)
StanfordOpenIE No extractions
2
Related Work
The first Open IE system is TextRunner [1]. This system uses a self-trained classifier to decide when to extract the relationship between two noun phrases. Its features for the classifier are extracted from the POS tags and chunks. When comparing with another IE system called KnowItAll [5], TextRunner achieved competitive accuracy with lower error rate. Later, a new Open IE called WOE [22] is proposed which dramatically improved precision and recall comparing to TextRunner. Its idea is to use information from Wikipedia to train the extractor. It also supports using dependency-parse features to further improve the system performance. Fader et.al. [6] observed several issues presented in TextRunner and WOE, and proposed a system called ReVerb to solve these issues. The authors listed out two main types of incorrect relations extracted by previous systems: incoherent (a relation has no meaningful interpretation) and uninformative (a relation misses critical information). Hence, they defined two types of constraints to reduce these errors: syntactic constraint and lexical constraint. Syntactic constraint is basically a token-based regular expression used to capture the correct predicate appearing in the sentence. It requires the relation phrase to be a contiguous sequence or words, start with a verb and end with a preposition. Lexical constraint is built from a large dictionary, and it is used to filter out overspecific relations. The main problem of ReVerb is that it is unable to capture the long-distance relationship between predicate and arguments. Most of the incorrect extractions from ReVerb are due to its wrong argument identification. Its argument-finding heuristics often return the closest noun phrase on the left side of the predicate as Argument 1 and therefore unable to output a correct triple if arguments and predicate are far apart. Thus many researchers started using a richer linguistic information to overcome this limitation. Ollie [14], which is considered as the next version of ReVerb, resolves ReVerb’s issue by supporting a wider syntactic range via its open pattern templates. These templates are applied directly on the dependency-parsing output to find valid relations. Ollie also has the ability to provide context information for a relation and extract relations mediated by nouns or adjectives. Another Open IE system which uses dependency parser for extraction is DepOE [9]. This system is multilingual and achieves better performance than ReVerb in their evaluation. It relies on DepPattern, a multilingual dependency
ReLink: Open Information Extraction
47
parser1 , to parse a sentence and then apply a list of pre-defined rules on the parse output to extract relations. Some recent Open IE systems take another further step in using the dependency-parsing output for their extraction, that is they divide the sentences into multiple clauses before extraction. A system called ClauseIE [4] follows this approach. From dependency-parsing output, the system extract relations by identifying different types of useful clauses. Each clause can result in multiple relations. This system achieved higher precision than previous extractors in all of their experiment datasets. Lastly, a new clause-based Open IE system is also implemented as a part of Stanford CoreNLP [13], which leverages the linguistic structure of the sentence for extraction[8]. This system works by first extracting self-contained clauses in a sentence and then running a logic inference on these clauses to find the correct arguments for each triple. They reported a better performance than Ollie on the end-to-end TAC-KBP 2013 Slot Filling task [20]. Our work is partly inherited from ReVerb and inspired by newer systems. We adopt a similar method as ReVerb for identifying phrases in a sentence, but we use a novel linking mechanism to build the relationship between phrases to capture long-range dependencies. We also define a list of patterns, but they are used for extracting relations from the connected phrases, not for identifying clause type like ClauseIE.
3
Verb Phrases and Noun Phrases
In Open IE, one important step is to recognize correct Verb Phrases (VP) and Noun Phrases (NP) in the sentence. In this section, we describe how we adopt the mechanism used in ReVerb and extend it to use in our system. 3.1
Verb Groups
The information of a VP can determine the position of its subject and object in the sentence. Based on this idea, we treat VP differently depending on their POS information. In particular, we categorize VP into four groups: – – – –
A-VP : if their POS tags start with MD, VB, VBZ or VBD. P-VP : if their POS tags start with VBN. G-VP : if their POS tags start with VBG. T-VP : if their POS tags start with TO followed by a VB tag.
Table 2 shows some example sentences with their corresponding POS tags in each verb group. Notice that even though a VP can contain multiple words, we only look at the POS tag of its first word to categorize. The reason for categorizing VP into different groups is because they can affect the way we extract a relation from a sentence. For example, given the sentence: 1
https://gramatica.usc.es/pln/tools/deppattern.html.
48
X.-C. Tran and L.-M. Nguyen Table 2. Some examples of Verb Groups Verb Group Example sentence A-VP
He is walking on the street PRP VBZ VBG IN DT NN
P-VP
Mary likes the photo posted by Peter NNP VBZ DT NN VBN IN NNP
G-VP
People making this statement are rich NNS VBG DT NN VBP JJ
T-VP
I want him to stay here tonight PRP VBP PRP TO VB RB RB
A smart guy living in this house invented a new machine to do the task assigned by his boss. There are four VPs presented in this sentence: “living” is a G-VP, “invented” is an A-VP, “to do” is a T-VP and “assigned” is a P-VP. With G-VP and P-VP, it is likely that their subjects stay right before them in a sentence and therefore we can extract the following relations with high confidence: 0: (A smart guy; be living in; this house) 1: (the task; be assigned by; his boss) This, however, might not be true for an A-VP because “this house” should not be the subject of the verb “invented”, its correct argument should be “A smart guy” as in the following relation: 2: (A smart guy; invented; a new machine) T-VP, on the other hand, cannot take its preceding NP as its first argument and should be combined with other phrases to build a coherent relation like this: 3: (A smart guy; invented a new machine to do; the task) ReVerb is unable to extract any of above relations from the example sentence, its only extraction is an incoherent relation: (this house; invented; a new machine). Ollie can extract relation 1 and 2, but its third relation is a bit controversial: (A smart guy living in this house; invented; a new machine to do the task). In our opinion, arguments should not contain too much information which can be considered as an extra relation. Table 3. Regular expression for recognizing VP. Verb Group Expression A-VP
prefix [MD|VB|VBZ|VBD|VBP] suffix
P-VP
prefix VBN suffix
G-VP
prefix VBG suffix
T-VP
prefix TO prefix VB suffix
prefix
(adv|particle)*
suffix
(adv|particle|vp-chunk)*
ReLink: Open Information Extraction
49
For recognizing these VPs from their POS tags and Chunks, we adopt the token-based regular expression used in ReVerb. Here, each verb group has its own expression as shown in Table 3. As can be seen, these expressions are able to recognize other components such as adverbs or particles around the main verb. It is worth noting that we also utilize chunking information (I-VP tags) to recognize VP because this information is useful in the case some adverbs or verbs are mistagged by the POS tagger but correctly tagged by the chunker. If there are multiple adjacent VPs, we merge all of them into a single VP with verb group is the same as the left-most VP. Besides, our VP expansion also deals with two special cases: 1. If a VP is followed by a preposition and a G-VP, we should merge them into the VP. Some example phrases are “aim at hurting”, “look forward to seeing”. Both of these phrases are considered as a single VP. 2. If a VP is followed by multiple prepositions, we merge all prepositions except the last one into the VP. This is a small correction for the case when POS tagger incorrectly identifies a particle as a preposition. Example phrases are: “come out of ”, “look forward to”. 3.2
Noun Phrases
We can easily get all base NPs from the sentence based on chunking information, but we also want to group related NP together to form a bigger NP if necessary. For example, phrases like “a president of the United States” should be viewed as a single NP even though they are recognized as two NPs by the chunker. For this purpose, we define three cases for grouping multiple NPs into a single NP: 1. An NP followed by a possessive NP. 2. Two NPs separated by the lexical token “of ”. 3. Two NPs separated by the lexical token “and” and: (i) the second NP is not followed by a VP, or (ii) the first NP is right after an SBAR chunk. ReVerb also supports expanding NP by following the first two cases, but we think the third case is also necessary, especially when the chunker is unable to tag them properly as a single phrase. 3.3
Disputed Noun Phrases
On analyzing the output from ReVerb, we have realized that there are two common cases where an NP can be incorrectly identified as an argument of a relation phrase: (i) an NP that stays exactly between two A-VP phrases, or (ii) an NP that follows a PP and stays before an A-VP. We call those NPs as disputed NPs. In Table 4, we give two sentences containing disputed NP and corresponding extractions from ReVerb. In the first sentence, the NP “the man” is between two A-VPs “said” and “had”. ReVerb does not take this into account and blindly extract the first relation even though it is incoherent.
50
X.-C. Tran and L.-M. Nguyen Table 4. Extractions from ReVerb on sentences with disputed NP. He said the man was a criminal. (He; said; the man) (the man; was; a criminal) The students in the classroom are second language learners. (the classroom; are; second language learners)
In the second case, the phrase “the classroom” is a disputed NP because it is between the PP “in” and the A-VP “are”. With this sentence, ReVerb yields the relation “(the classroom; are; second language learners)”. This relation is clearly incoherent because the first argument should be “the students”. In ReLink, we explicitly handle these disputed NPs to let the extractor know which NP should be selected for a certain predicate and thus we can avoid incoherent relations.
4
ReLink
This section describes the algorithm of ReLink for extracting binary relations. First, it starts by chunking and identifying phrases in a sentence using the methods described in Sect. 3. Then it employs a new method for building the relationship between phrases by iterating through each phrase and create left and right connections. Finally, relations are extracted using several pre-defined patterns. Figure 1 shows the steps performed in ReLink to extract relations from an input sentence.
Fig. 1. Procedure of ReLink to extract relations from a sentence.
4.1
Phrase Identification
In this step, we use Apache OpenNLP2 for POS tagging and chunking the sentence. From the chunked sentence, we construct four groups of phrases: VP, NP, PP and O. VP and NP are identified using the methods described in Sect. 3, PP is directly captured from chunking information, and all other tokens are converted to O phrases. The process of phrase identification is illustrated in Fig. 2. As can be seen from the example, we successfully capture the long noun phrase 2
https://opennlp.apache.org/.
ReLink: Open Information Extraction
51
Fig. 2. An example of Phrase Identification for the sentence “Routing smoking is linked to airway inflammation and increased symptoms of chronic bronchitis.”
“airway inflammation and increased symptoms of chronic bronchitis” by grouping several NPs together. This is achieved thanks to the patterns we described in Sect. 3.2. 4.2
Phrase Linking
This is a crucial step to decide which phrases should go together to form a valid extraction. In ReLink, we go through each phrase from left to right and decide which phrases should be linked to reflect the relationship of phrases in the most proper way. Specifically, each phrase is assigned a number of available slots, i.e. the maximum number of connections a phrase can have with other phrases in the sentence. Whenever a connection is established between two phrases, their available slots is decreased 1. A phrase with no available slots can not connect with any other phrases. The value of available slots varies depending on the phrase type as follows: – NP: has 1 available slot by default. In our assumption, an NP is either a subject or an object of a clause but not both. Exceptional cases are handled separately. – VP: has 2 available slots, 1 slot for the connection on the left side and another slot for the connection on the right side. We give 2 slots to a VP based on an assumption that a VP always has 1 subject and 1 object. – PP: has 2 available slots similar to VP. PP here should play a role of connecting its preceding phrase and its succeeding phrase. – O: does not have any available slots. We do not want them to connect with any other phrases. Before iterating through each phrase, we perform two pre-linking tasks to deal with some special cases: 1. We add 1 additional slot to the NPs that are followed by a WH-modifier because they are likely the argument of many relations. 2. We identify and mark disputed NPs as described in Sect. 3.3. These NPs will be processed differently in our algorithm. After this preparation, we start processing each phrase one by one from left to right. NPs will play a passive role in our algorithm, they stay waiting for
52
X.-C. Tran and L.-M. Nguyen
connections from phrases of other types. Because of this, we only focus on how to create connections from VP and PP. Connections from VP. From each VP, we will try to create two connections with other phrases, one connection on its left side and the other connection on the right side. It is simple to create a connection on the right side, we only need to make sure the succeeding phrase is an NP or PP. This is because, in our observation, the phrase follows a VP is usually directly related to that VP. Of course this will be wrong if the succeeding phrase is a disputed NP. This problem will be handled by the VP staying right after the disputed NP when it tries to create a left connection. It is, however, more complicated to create a connection with a phrase on the left side. The detailed algorithm to look for proper phrase and create a left connection is shown in Algorithm 1. First of all, we deal with a special case where the word “and” stays in front of the VP. In that case, the current VP is likely to be an extension of another existing clause. Thus we look for the first VP of the same group and try to connect the left phrase of that VP to the current phrase (using the function getArg1OfVerbGroup). If the previous phrase is not “and”, we will look for candidate phrase depending on the group of the current VP. In case of A-VP, there are three cases as follows: – If its immediate preceding phrase is an available NP, we directly create a connection between it and the current VP. This is actually the normal case in simple sentences like “Donald Trump is the new president of the United States.”, there is no confusion to establish the connection between “is” and “Donald Trump”. – If its previous phrase is not available, we have to search until the beginning of the sentence to find an available NP (using the function getAvailableNP). – If no available NP is found and the previous phrase is a disputed NP, we remove current connections from this disputed NP and connect it with the current VP. The function getAvailableNP starts searching on phrases within the specified range from right to left. An available NP is returned if it has available slots > 0 and satisfies one of the following conditions: – It has a right-side connection with another VP. – It has no right-side connections with another VP but we have not seen any verbs on the way to reach this NP. These conditions are mainly used to prevent an irrelevant NP which is far away from current VP to be selected as a candidate for connection. Lastly, if current VP is not an A-VP and the previous phrase is an NP, we simply increase previous NP’s available slots by 1 and create a connection between this NP and the current VP.
ReLink: Open Information Extraction
53
Algorithm 1: Creating a left connection from a VP. idx is index of the current phrase. if prevPhrase.token == ‘and’ then leftNp = getArg1OfVerbGroup(0, idx-2, phrase.verbGroup) if leftNp != null then leftNp.increaseAvailableSlots() connect(leftNp, phrase) end else if phrase is A-VP then if prevPhrase.isAvailableNP then connect(prevPhrase, phrase) else leftNp = getAvailableNP(0, idx-2) if leftNp != null then connect(leftNp, phrase) else if prevPhrase.isDisputeNP then prevPhrase.removeLeftConnection() connect(prevPhrase, phrase) end end else if prevPhrase.isNP then prevPhrase.increaseAvailableSlots() connect(prevPhrase, phrase) end end
Connections from PP. For a PP node, we simply create connections with its left and right phrases if they are NP. We do not check if the previous phrase is VP because it has already been handled by that VP. Algorithm 2 shows how we create connections with its left and right phrases from a PP. Quote and Comma Characters. When linking phrases, we temporarily ignore O phrases which contain only quote characters. This mean that if immediate preceding phrase of the current phrase is a quote, we simply move one step further to the left to get the correct preceding phrase. The same thing applies for getting the succeeding phrase.
Algorithm 2: Creating left and right connection from a PP. if prevPhrase is NP then prevPhrase.increaseAvailableSlots() connect(prevPhrase, phrase) end if nextPhrase is NP then connect(phrase, nextPhrase) end
54
4.3
X.-C. Tran and L.-M. Nguyen
Relation Extraction
After setting up connections between phrases, the final step is to detect valid verb-based relations and extract their arguments accordingly. Unlike ReVerb in which the relation phrase is extracted before identifying two arguments, in ReLink we extract a relation in the following order: 1. Identify phrases belonging to A-VP, P-VP or G-VP group which have both left and right connection. 2. Extract Argument 1 from its left connection. 3. Extract Predicate and Argument 2 simultaneously from its right connection using predefined patterns. Extracting Argument 1. Because of the way we build connections between phrases, a connection on the left side of a VP (if any) is always an NP and this should be extracted as the Argument 1 of a relation. But we also consider the case when this NP has a connection to another NP through PP, if so then we follow the NP’s right connection to build a full argument. Extracting Predicate and Argument 2. We do not merely use the identified VP as Predicate and extract the closest NP to the right of the relation phrase as Argument 2. Instead, we obtain Predicate and Argument 2 simultaneously because they are mutually related in a relation. For example, in a sentence like “He bought a new car as a present”, we observe that there are two relations: 1: (He; bought; a new car) 2: (He; bought a new car as; a present) The phrase “a new car” can both be Argument 2 and a part of Predicate name in this case. Which parts to include in the Predicate will affect the decision to extract the Argument 2. Therefore we need to be able to recognize them in our connected phrases for extraction. We do this by defining several patterns for matching against the connected phrase list. Whenever we find a list of connected phrases that matches the pattern, we extract them out as a relation. The last NP will be selected as the Argument 2 and all other phrases are grouped to become the Predicate. We combine these with the Argument 1 already extracted previously to have a complete relation.
5
Experiments
We compare ReLink with two well-known systems in different categories: one system uses shallow linguistic feature and one system use deep linguistic feature. – ReVerb: This Open IE system is considered as state-of-the-art in terms of using only shallow linguistic features for extracting relations. Because our system also uses similar information, this is a good baseline for comparison.
ReLink: Open Information Extraction
55
– Ollie: This system extracts relations by using templates learned from seed training data. These templates work against the output of a dependency parser. In this research, we run Ollie using the default Malt parser shipped in its package. For testing, we collected 200 random sentences from news articles of CNN3 . These sentences are fed to each system to get a list of binary relations. Two human judges were asked to independently evaluate each relation as correct or incorrect. We got an agreement of 82.9% on the extractions, agreement score κ = 0.63. We compute precision-recall on the subset of extractions where evaluations from both judges are the same. Similar to the experiments performed in [6], we calculate the total of correct relations using relations marked as correct by both judges. Duplicate relations are treated as a single relation. In ReLink, we do not design a separate confidence score model but instead, we use the logistic regression model available in ReVerb for assigning confidence score. The precision-recall curve is drawn by varying the confidence score from 0 to 1. 5.1
Results
Figure 3 shows the Area Under Precision-Recall Curve (AUC) of each system. As we can see, ReLink gets the highest AUC (≈ 0.554), both Ollie and ReVerb get AUC about 30% smaller than ReLink, 0.375 and 0.324 respectively. It is interesting to see that Ollie does not work very well in our experiment even though it uses dependency parser to support the extraction process. Figure 4 shows Precision-Recall curves of three systems. ReLink achieves a stable precision of nearly 0.8 in almost all levels of recall. It also gets higher precision than ReVerb when recall increases to higher than around 0.08. Ollie curve is quite interesting. It gets the highest precision at the beginning of the curve but starts to decrease as recall increases. When recall reaches around 0.2, Ollie starts to get lower precision than ReLink; and at the end of the curve, Ollie precision drops to lower than ReVerb. An explanation for this is that Ollie depends entirely on the result from the dependency parser to extract the relations. Because of this, only a minor error in the dependency parsing can also confuse the extractor and make it produce uninformative or incoherent relations. ReVerb and ReLink, on the other hand, use only the shallow syntax of the sentence and therefore are not sensitively prone to errors. In English, shallow parsing is also considered more robust than deep parsing [12]. In our view, ReVerb gets low precision because its model is quite simple and cannot handle complex cases in a sentence. This prevents it from detecting valid relations as well as extracting correct arguments from the text. Ollie is able to extract more relations from text but many are incorrect because it relies completely on the result of dependency parsing, and just a small error can also lead to all incorrect relations. With ReLink, even though it uses similar information as ReVerb, its model allows it to deal with multiple sentence forms 3
https://edition.cnn.com.
56
X.-C. Tran and L.-M. Nguyen
Fig. 3. Area Under Precision-Recall Curve
Fig. 4. Precision-Recall curve.
ReLink: Open Information Extraction
57
and produce better relations. For example, given a sentence like “I took a flight from New York”, ReVerb can extract only one relation: (I; took a flight from; New York). However, we believe that the relation (I; took; a flight) also contains important information and should be extracted as well. This is, in fact, similar to the approach taken by newer Open IE systems like Ollie or StanfordOpenIE. In ReLink, we follow the same approach and extract both relations. That is another reason why the precision of ReLink is higher than ReVerb most of the time. Another contribution to high precision in ReLink is the way we handle T-VP in a sentence. As can be seen from our defined patterns, we consider TVP not as an independent relation phrase but as a part of a bigger relation phrase. Our approach led to results different from ReVerb and Ollie. 5.2
Evaluation on Other Datasets
For a fair comparison, we also conducted an experiment on the datasets used for evaluating ClauseIE4 . These datasets consist of: – 500 sentences from ReVerb data. – 200 sentences from Wikipedia. – 200 sentences from New York Times. Each dataset contains the original sentences, the extractions from three systems (ReVerb, Ollie, ClausIE), and the annotated labels. Normally, a common approach will be to re-annotate all these extractions, then compute the precision and recall like we did in the previous experiment. However, this process is time-consuming considering the number of extractions in these datasets. We, therefore, took another approach which utilizes the available annotations for comparison. To be more specific, we fed these sentences to ReLink to get a list of extractions and set the label of each ReLink extraction to be the same with the annotated extraction. For those extractions which do not exist in the original annotation, we ask two human judges from our side to annotate (similar to how we did on the custom dataset). By doing this, we significantly reduce the number of extractions needed to be annotated while still having a fair comparison. Two judges got an agreement of 71% with κ = 0.38. Table 5 shows the statistics of each system based on the original annotated data, and Table 6 shows the result of our annotation on out-of-db extractions. Correct extractions are those which marked as correct by both annotators. We calculate the Precision, Recall and F1 scores for each system with the new annotated data. The final set of correct extractions is the union of correct extractions in the ClausIE data and correct extractions in our out-of-db annotation. We present the results in Table 7 and visually plot the F1 score in Fig. 5. 4
https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/ software/clausie/.
58
X.-C. Tran and L.-M. Nguyen
Table 5. Statistics of extractions from each system on different dataset. #out-of-db column indicates the number of extractions which do not exist in the annotated data. #correct #incorrect #out-of-db ReVerb data ReVerb 384 Ollie
343
556
686
ClausIE 1692
1283
ReLink
231
465
310
Wiki data ReVerb 164
85
Ollie
235
330
ClausIE 597
404
ReLink
72
181
162
NYT data ReVerb 152
119
Ollie
216
281
ClausIE 685
618
ReLink
78
152
174
Table 6. Statistics of out-of-db extractions after annotating. #correct #incorrect ReVerb data 147
163
Wiki data
98
64
NYT data
95
79
As we can see from the results, our system achieves better scores than ReVerb and Ollie on all three datasets. This is consistent with the results we got on the custom data, and it shows the stable performance of ReLink. But ReLink gets lower F1 score than ClausIE even though it achieves higher Precision. This is mainly due to that fact that the number of extractions from ClausIE is significantly more than all other methods, making its Recall is quite high. This shows that the process of splitting the sentence into clauses before extracting the relations in ClausIE is effective in this case. Perhaps breaking the sentence into multiple clauses makes the dependency-parsing error less severe than treating the sentence as a whole. More investigation into this matter can be done in future work.
ReLink: Open Information Extraction
59
Table 7. Precision, Recall and F1 score calculated on the new annotated data. Precision Recall F1 ReVerb data ReVerb 0.53 Ollie
0.13
0.20
0.45
0.18
0.26
ClausIE 0.57
0.56
0.56
ReLink
0.23
0.33
0.16
0.26
0.62 Wiki data
ReVerb 0.66 Ollie
0.42
0.23
0.30
ClausIE 0.60
0.58
0.59
ReLink
0.32
0.44
ReVerb 0.56
0.14
0.22
Ollie
0.43
0.19
0.27
ClausIE 0.53
0.61
0.57
ReLink
0.26
0.37
0.71 NYT data
0.65
Fig. 5. Comparison of F1 score between four OpenIE systems on new annotated data.
5.3
Application of ReLink for COVID-19 Data
As stated at the beginning, when COVID-19 data increase rapidly, current machine learning models could not deal with the change of data. Therefore, unsupervised model as open information extraction is expected. Figure 6 shows the use of our system performing on COVID-19 data. As results, the relations we obtained are useful for users in terms of understanding about MERS-COV and H1N1. In our future work, we would like to exploit our tools for large-scale of COVID-19 data.
60
X.-C. Tran and L.-M. Nguyen arg1
rel
ar g 2
[MERSCoV]GGP
include
[fever]DISEASE , , [chills/rigors]DISEASE [headache]DISEASE , nonproductive [cough]DISEASE
[MERSCoV]GGP
is responsible for causing
lower [respiratory with infections]DISEASE [fever]DISEASE and [cough]DISEASE
Fig. 6. Example of relations presented in COVID-19 papers.
We extend our work to perform an information retrieval system to search either “entities” or “relation” when we extracted from the large-scale of data.
6
Conclusion and Future Work
This paper introduced ReLink - a novel Open IE system which uses only shallow linguistic feature including POS and Chunking for extractions. Our main contributions in this paper are: – We analyzed several notable Open IE systems and showed their drawbacks on extracting relations from free text. – We proposed a simple method to identify VP and NP more accurately, then we proposed a mechanism to extract relations by linking phrases in a sentence. Experimental results showed that our system performed better than ReVerb and Ollie in terms of AUC. The results were also the same on different datasets. – We implemented ReLink and published it on Github for the research community in this field5 . In our future work, we want to firstly explore the possibility of extending this model to extract other types of relations (N-ary relations, relational nouns, etc). In addition, we also consider applying a machine learning technique to deal with the case where some words are incorrectly tagged by Apache OpenNLP. Acknowledgments. This work was supported by JST CREST Grant Number JPMJCR1513 and in part by the Asian Office of Aerospace R&D (AOARD), Air Force Office of Scientific Research (Grant no. FA2386-19-1-4041).
5
https://github.com/linktorepository.
ReLink: Open Information Extraction
61
References 1. Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. In: IJCAI, vol. 7, pp. 2670–2676 (2007) 2. Berant, J., Dagan, I., Goldberger, J.: Global learning of typed entailment rules. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 610–619. Association for Computational Linguistics (2011) 3. Christensen, J., Mausam, S.S., Soderland, S., Etzioni, O.: Towards coherent multidocument summarization. In: HLT-NAACL, pp. 1163–1173. Citeseer (2013) 4. Del Corro, L., Gemulla, R.: Clausie: Clause-based open information extraction. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 355–366. ACM (2013) 5. Etzioni, O., et al.: Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. 165(1), 91–134 (2005) 6. Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In: Proceedings of the Conference of Empirical Methods in Natural Language Processing (EMNLP 2011), Edinburgh, Scotland, UK, 27–31 July 2011 (2011) 7. Fader, A., Zettlemoyer, L., Etzioni, O.: Open question answering over curated and extracted knowledge bases. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1156–1165. ACM (2014) 8. Gabor Angeli, M.J.P., Manning, C.D.: Leveraging linguistic structure for open domain information extraction. In: Proceedings of the Association of Computational Linguistics (ACL) (2015) 9. Gamallo, P., Garcia, M., Fern´ andez-Lanza, S.: Dependency-based open information extraction. In: Proceedings of the Joint Workshop on Unsupervised and SemiSupervised Learning in NLP, pp. 10–18. Association for Computational Linguistics (2012) 10. Kim, J.T., Moldovan, D.I.: Acquisition of semantic patterns for information extraction from corpora. In: Proceedings of Ninth Conference on Artificial Intelligence for Applications, pp. 171–176. IEEE (1993) 11. Li, P., Cai, W., Huang, H.: Weakly supervised natural language processing framework for abstractive multi-document summarization: weakly supervised abstractive multi-document summarization. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1401–1410. ACM (2015) 12. Li, X., Roth, D.: Exploring evidence for shallow parsing. In: Proceedings of the 2001 Workshop on Computational Natural Language Learning, vol. 7, p. 6. Association for Computational Linguistics (2001) 13. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Association for Computational Linguistics (ACL) System Demonstrations, pp. 55–60 (2014). http://www.aclweb.org/anthology/P/P14/P14-5010 14. Mausam, Schmitz, M., Bart, R., Soderland, S., Etzioni, O.: Open language learning for information extraction. In: Proceedings of Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CONLL) (2012)
62
X.-C. Tran and L.-M. Nguyen
15. Reddy, S., Lapata, M., Steedman, M.: Large-scale semantic parsing without question-answer pairs. Trans. Assoc. Comput. Linguist. 2, 377–392 (2014) 16. Ruppert, E.: Unsupervised conceptualization and semantic text indexing for information extraction. In: Sack, H., Blomqvist, E., d’Aquin, M., Ghidini, C., Ponzetto, S.P., Lange, C. (eds.) ESWC 2016. LNCS, vol. 9678, pp. 853–862. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-34129-3 54 17. Schoenmackers, S., Etzioni, O., Weld, D.S., Davis, J.: Learning first-order horn clauses from web text. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 1088–1098. Association for Computational Linguistics (2010) 18. Soderland, J.C.S., Mausam, G.B.: Hierarchical summarization: scaling up multidocument summarization. In: Proceedings of the 52nd Annual Meeting of the Association for Computlational Linguistics, pp. 902–912 (2014) 19. Stanovsky, G., Mausam, I.D.: Open IE as an intermediate structure for semantic tasks (2015) 20. Surdeanu, M.: Overview of the TAC2013 knowledge base population evaluation: English slot filling and temporal slot filling. In: Sixth Text Analysis Conference (2013) 21. Vo, D.T., Bagheri, E.: Open information extraction. arXiv preprint arXiv:1607.02784 (2016) 22. Wu, F., Weld, D.S.: Open information extraction using Wikipedia. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 118–127. Association for Computational Linguistics (2010)
Cloud Computing and Networks
Energy-Efficient Scheduling of Deadline-Sensitive and Budget-Constrained Workflows in the Cloud Anurina Tarafdar(B) , Kamalesh Karmakar, Sunirmal Khatua, and Rajib K. Das Department of Computer Science and Engineering, University of Calcutta, Kolkata, India [email protected], [email protected], [email protected], [email protected]
Abstract. Due to the rapid advancement of Cloud computing, more and more users are running their scientific and business workflow applications in the Cloud. The energy consumption of these workflows is high, which negatively affects the environment and also increases the operational costs of the Cloud providers. Moreover, most of the workflows are associated with budget constraints and deadlines prescribed by Cloud users. Thus, one of the main challenges of workflow scheduling is to make it energy-efficient for Cloud providers. At the same time, it should prevent budget and deadline violations for Cloud users. To address these issues, we consider a heterogeneous Cloud environment and propose an energyefficient scheduling algorithm for deadline-sensitive workflows with budget constraints. Our algorithm ensures that the workflow is scheduled within the budget while reducing energy consumption and deadline violation. It utilizes Dynamic Voltage and Frequency Scaling (DVFS) to adjust the voltage and frequency of the virtual machines (VMs) executing tasks of the workflow. These adjustments help to achieve significant energy savings. Extensive simulation using real-world workflows and comparison with some state-of-art approaches validate the effectiveness of our proposed algorithm. Keywords: Workflow scheduling · Cloud computing Energy-efficiency · Budget constraints · Deadline
1
·
Introduction
A workflow comprises of a set of interdependent tasks executed in the specified order. Workflows are useful tools to model complex scientific and business applications that require large-scale computation and storage. Cloud computing is an in-demand technology in which Cloud Service Providers (CSPs) deliver resources c Springer Nature Switzerland AG 2021 D. Goswami and T. A. Hoang (Eds.): ICDCIT 2021, LNCS 12582, pp. 65–80, 2021. https://doi.org/10.1007/978-3-030-65621-8_4
66
A. Tarafdar et al.
and services to the Cloud users in a pay-per-use manner over the internet. Due to several advantages of Cloud computing, such as elasticity, cost-effectiveness, reliability, numerous workflow applications are being submitted by users for execution in the Cloud. The tasks of these workflows run in the virtual machines (VMs) deployed in the hosts or servers of a Cloud data center. The scheduling of workflow applications in the Cloud is quite challenging. Large-scale virtualized Cloud data centers consume a large amount of energy, and that harms the environment [1]. Moreover, the high energy costs of the data center affect the profit margin of the CSPs. Thus, it is necessary to develop an energy-efficient workflow scheduling approach to reduce the energy consumption of the Cloud. The CSPs offer several types of VMs having different resource capacities and prices, and charge the users in a pay-per-use policy. In a heterogeneous Cloud environment, creating a cost-effective schedule of the workflows is quite challenging. Moreover, many workflow applications are associated with budget constraints and deadlines mentioned by the users. In such a situation, it is critical to schedule the workflow within the prescribed budget and minimize deadline violation at the same time. Although in the recent past, a significant number of research works have investigated workflow scheduling in Cloud, many of them consider a homogeneous environment [6,7,14]. In some research works [2,11], the authors perform scheduling of budget-constrained workflows in a heterogeneous Cloud environment. However, they do not consider energy-efficiency as a scheduling objective. On the other hand, works like [4,13] aim to minimize the energy consumption of workflows having deadline constraints, but do not take into account the monetary cost due to the execution of the workflows. Workflow scheduling in the Cloud must be energy-efficient to promote green computing and reduce the operational costs of the CSPs. At the same time, the scheduling approach must generate a schedule that is within the budget and deadline prescribed by the Cloud user. Consideration of all these aspects simultaneously makes the scheduling problem very challenging. To address this issue, we consider a heterogeneous Cloud environment and propose an energy-efficient scheduling approach for deadlinesensitive workflows with budget constraints. The rest of the paper is organized as follows: Section 2 consists of a brief discussion on the related works, followed by the description of the system models in Sect. 3. Our proposed scheduling algorithm appears in Sect. 4. Section 5 presents the performance evaluation and Sect. 6 concludes the paper with some future directions.
2
Related Work
In the recent past, a large number of research works have focused on workflow scheduling in the Cloud environment. A workflow is scheduled by considering different objectives like- minimizing the makespan, increasing energy efficiency, reducing monetary cost, and so on. Some of the works in the literature have
Energy-Efficient Scheduling
67
scheduled workflow considering a single objective, whereas some others have proposed multi-objective workflow scheduling. In [9], the authors aim to generate a shorter makespan for the workflow through effective scheduling, while in [2], the authors have considered budget satisfaction as their principal objective. On the contrary, Rizvi et al. [11] have proposed a fair scheduling policy that aims to minimize the makespan of the schedule and satisfies the budget constraints at the same time. An evolutionary multi-objective optimization (EMO) technique presented in [15] optimizes cost as well as makespan. Tang et al. [13] have proposed an energy-aware workflow scheduling algorithm based on Dynamic Voltage and Frequency Scaling (DVFS) that tries to schedule the workflow within a given deadline. In [4], the authors have presented a workflow scheduling algorithm that aims to improve resource utilization, reduce energy consumption, and satisfy the given deadline. Li et al. [8] have presented a cost and energy-aware scheduling algorithm for scientific workflows. However, [4,13] do not consider the monetary cost of workflow execution. Works like [2,9,11,15], though consider budget and makespan, do not take into account the energy consumption of the workflows. Unlike the works mentioned above, we propose a scheduling heuristic that ensures the completion of the workflow within the given budget while reducing energy consumption and deadline violation.
3
System Models
In this section, we describe the system models: Cloud data center model, workflow model, and energy model. We also define the budget constraint of the workflow. 3.1
Cloud Data Center Model
We consider a Cloud data center with n heterogeneous VMs, V = {v1 , v2 , . . . , vn }. Each VM vk has some characteristics like processing speed pk represented in million instructions per scond (MIPS), and cost per unit time ck . We assume the data center offers x types of VM, Vtype = {τ1 , τ2 , . . . , τx }. Each VM in V belongs to a particular type in Vtype and VMs of same type have the same characteristics. 3.2
Workflow Model
A workflow application W is represented by a directed acyclic graph G(T, E) where T = {t1 , t2 , . . . , tm } is the set of tasks in the workflow, and E is the set of directed edges between the tasks. Each task ti has length li represented in million instructions(MI). The directed edges of G(T, E) indicate the dependencies between the tasks. If there exists an edge from task ti to task tj , then ti is considered to be an immediate predecessor of tj and tj as an immediate successor of ti . The sets pre(ti ) and suc(ti ) denote all immediate predecessors
68
A. Tarafdar et al.
and all immediate successors of task ti respectively. The data transmission time between two tasks ti and tj where ti is an immediate predecessor of tj is denoted by T Tij . It depends on the size of the data transmitted from ti to tj , and the internal network bandwidth of the data center. T Tij is considered to be zero if ti and tj are assigned to the same VM. Tentry denotes the set of entry tasks of the workflow having no predecessor, whereas Texit is the set of exit tasks of the workflow having no successor. The execution time of task ti when executed on a VM vk is denoted by ET (ti , vk ), and is defined as: ET (ti , vk ) =
li pk
(1)
where li is the length of ti in million instructions(MI) and pk is the processing speed of vk in million instructions per scond (MIPS). The earliest start time EST (ti ), and earliest finish time EF T (ti ) of task ti can be recursively calculated using the following equations where f (ti ) and f (tp ) indicate the index of the VMs on which tasks ti and tp are assigned respectively.
EST (ti ) =
⎧ ⎨0, ⎩ max {EST (tp ) + ET (tp , vf (tp ) ) + T Tpi }, tp ∈pre(ti )
if ti ∈ Tentry otherwise
EF T (ti ) = EST (ti ) + ET (ti , vf (ti ) )
(2)
(3)
The latest start time LST (ti ) and latest finish time LF T (ti ) are similarly e denotes the estimated calculated according to the following equations where Wm e makespan of the workflow and is calculated as: Wm = max {EF T (ti )}. ti ∈Texit
⎧ e ⎨Wm − ET (ti , vf (ti ) ), LST (ti ) = ⎩ min {LST (tj ) − T Tij − ET (ti , vf (ti ) )}, tj ∈suc(ti )
if ti ∈ Texit otherwise
LF T (ti ) = LST (ti ) + ET (ti , vf (ti ) )
(4)
(5)
e The Wm value obtained assuming all tasks are scheduled on VMs of the fastest type can be considered as the minimum makespan of the workflow min . Thus the deadline WD of the workflow can be defined using the following Wm equation: min , (6) W D = α · Wm
where α is the deadline factor specified by the Cloud user, and α ≥ 1.
Energy-Efficient Scheduling
3.3
69
Energy Model
For the energy model used in this paper, we assume that the physical machines (hosts) of the Cloud infrastructure support dynamic voltage and frequency scaling technique (DVFS). Also, each VM is assigned a virtual CPU (vCPU) that corresponds to a single core of a physical machine. Thus a VM vk ∈ V can operate at some distinct CPU frequencies in the range [fkmin , fkmax ] by varying the voltage levels in the range [Vkmin , Vkmax ]. The dynamic power consumption of VM vk is expressed as [13]: Pk = K · (Vkl )2 · fkl ,
(7)
where K is a constant parameter related to the dynamic power, Vkl is the supply voltage of VM vk at level l, and fkl is the CPU frequency of VM vk corresponding to voltage Vkl . The energy consumption E due to execution of a task ti on VM vk is determined as: E = Pk · ET (ti , vk ). Thus the total energy consumption of the workflow W comprising of m tasks is calculated as follows [10]: EW =
m
Pf (ti ) · ET (ti , vf (ti ) ),
(8)
i=1
where f (ti ) is the index of the VM on which task ti is assigned. The processing speed pk of VM vk depends on its voltage and frequency, , pmax ]. plk represents the processing speed of VM and lies in the range [pmin k k l l vk operating at (Vk , fk ) voltage-frequency combination. Though VMs generally operate at their maximum voltage level when they are busy [10], using DVFS, one can dynamically set the voltage and frequency, and thus improve energyefficiency. 3.4
Budget Constraint of the Workflow
The monetary cost of execution of task ti on VM vk is determined as: c(ti , vk ) = ET (ti , vk ) · ck ,
(9)
where ck is the cost per unit time of VM vk . The unit of time can be hour-based or second-based depending upon the CSP and the type of VM. The VM will be charged/billed for an integral unit of time, even if the task completes early. As all VMs are in the same data center, data transfer costs are ignorable. Thus, the total cost of scheduling the workflow is: CW =
m
c(ti , vf (ti ) ),
(10)
i=1
where f (ti ) is the index of the VM on which task ti is assigned. As discussed in Sect. 3.1, the data center offers x types of VM, Vtype = and pmax represent the cost per unit {τ1 , τ2 , . . . , τx }. We consider that cτq , pmin τq τq
70
A. Tarafdar et al.
time, minimum processing speed and maximum processing speed of a VM of type τq . VMs of different types may have different cost per unit time and processing speed ranges. However, all VMs of a particular type have identical cost per unit time and processing speed range. That is, for each VM vk of type τq , the cost , pmax ] is per unit time ck is equal to cτq and the processing speed range [pmin k k min max equal to [pτq , pτq ]. cτq For each VM type τq , we calculate the value pmax . The lowest cost of workflow τq
scheduling Cl is obtained by considering all tasks to be executed on VMs of the cτq . Likewise, the highest cost of workflow type τq that has the least value of pmax τq
scheduling Ch is obtained by selecting VMs of the type τq that has the highest cτq , for every task in the workflow. As in [10], the budget of the value of pmax τq
workflow is represented by Eq. (11) stated below: BuW = Cl + β · (Ch − Cl ),
(11)
where β is the budget factor such that β ∈ [0, 1). The Cloud user sets the budget of the workflow by specifying the budget factor. The budget constraint of the workflow implies that the cost of workflow scheduling must be within the given budget, i.e., CW ≤ BuW .
4
Workflow Scheduling Strategy
While scheduling the tasks of a workflow, our objective is to reduce energy consumption and deadline violation of the workflow and also ensure that the budget constraint is satisfied. Workflow scheduling is NP-hard in nature [10,13]. Thus, we propose a heuristic for effective scheduling of workflow in the Cloud environment. We assume that at a given time, only one task can be executed on a VM to prevent contention for resources. 4.1
Deadline, Priority and Budget of a Task
Before describing our proposed scheduling algorithm, we define the deadline, priority, and budget of a task in the workflow. Assuming each task is scheduled on VMs of the fastest type, we calculate the latest start time LST (ti ) and latest finish time LF T (ti ) of a task ti using Eq. (4) and Eq. (5) respectively. Thereafter, we determine the deadline D(ti ) of task ti by extending its LF T (ti ) proportionately as follows [8]: D(ti ) = α · LF T (ti ),
(12)
where α is the deadline factor specified by the Cloud user. It is logical that if each task is completed within its deadline, then the entire workflow will also be completed within the deadline WD .
Energy-Efficient Scheduling
71
In workflow scheduling, the tasks of a workflow have to be selected one at a time and allocated to suitable VMs. To perform task selection, we assign priority to each task [10]. The priority P r(ti ) of task ti is calculated as: P r(ti ) = ETavg (ti ) +
max {T Tij + P r(tj )},
tj ∈suc(ti )
(13)
where ETavg (ti ) is the average execution time of ti over all types of VMs in the data center. Priority P r(ti ) denotes the length of the longest path from task ti to an exit task. As in [10,11], we divide the budget BuW among all the tasks in the workflow. To determine the budget of each task, we consider the parameter Surplus Budget of Workflow (SBW ). Initially SBW is set to BuW − Cl where Cl denotes the lowest cost of workflow scheduling as discussed in Sect. 3.4. Let cmin (ti ) denote the minimum cost of executing ti . Then, for the first task ti which is scheduled, its budget Bu(ti ) is given by: Bu(ti ) = SBW + cmin (ti ),
(14)
A suitable VM vk has to be selected for task ti such that the cost of execution of ti on vk expressed as c(ti , vk ) lies within Bu(ti ). Once the task ti is scheduled, SBW is updated as: SBW = SBW − (c(ti , vk ) − cmin (ti ))
(15)
In this way, after scheduling every task, we update SBW , and determine the budget of the next task from the new SBW . 4.2
Proposed Approach
Our proposed workflow scheduling approach- Energy-efficient Scheduling of Deadline sensitive Workflow with Budget constraint (ESDWB), has been presented in Algorithm 1. In this Algorithm, each task ti of the workflow is selected priority wise, and assigned to a suitable VM. Since each task ti is scheduled within its budget Bu(ti ), the cost of the workflow remains within its budget BuW . Here AST (ti , vk ) and AF T (ti , vk ) denotes the actual start time and actual finish time of task ti on VM vk respectively. To determine AST (ti , vk ), we introduce a term P ST (ti ) indicating the possible start time of task ti . It is defined below: ⎧ ⎨0, if ti ∈ Tentry P ST (ti ) = ⎩ max {AST (tp , vf (tp ) ) + ET (tp , vf (tp ) ) + T Tpi }, otherwise tp ∈pre(ti )
(16) This implies that it is possible to start execution of task ti only after all its predecessor tasks have completed execution and have transferred necessary data
72
A. Tarafdar et al.
Algorithm 1: Energy-efficient Scheduling of Deadline-sensitive Workflow with Budget constraint (ESDWB)
1 2 3 4 5 6 7 8 9
10 11
Input: Workflow W , Deadline WD , Budget BuW Output: Schedule of the workflow SW Determine deadline D(ti ) of each task ti in task set T of W using Eq. (12); Calculate priority P r(ti ) of each task using Eq. (13); Sort the tasks in T in descending order of their priorities; SBW ← BuW − Cl ; SW ← φ; foreach ti ∈ T do vf (ti ) ← null; // vf (ti ) indicates the VM on which ti will be assigned Calculate Bu(ti ) using Eq. (14); if (pre(ti ) = φ) then Sort the tasks in pre(ti ) in descending order of the data size to be transferred to ti ; foreach tp ∈ pre(ti ) do vk ← vf (tp ) ; // vf (tp ) indicates the VM on which tp is assigned if ((AF T (ti , vk ) ≤ D(ti )) and (c(ti , vk ) ≤ Bu(ti ))) then vf (ti ) ← vk ; SW ← SW ∪ ti , vf (ti ) ; Update SBW using Eq. (15) ; break; end
12 13 14 15 16 17 18 19 20 21
end end if ((pre(ti ) = φ) or (vf (ti ) = null)) then Compute ps(ti ) using Eq. (21); Q ← {τq | τq ∈ Vtype ∧ (pmax ≥ ps(ti ))}; τq
Sort the VM types in set Q in ascending order of their pmax values; τq
22
foreach τq ∈ Q do Consider an idle or new VM vk of type τq ; if (c(ti , vk ) ≤ Bu(ti )) then vf (ti ) ← vk ; SW ← SW ∪ ti , vf (ti ) ; Update SBW using Eq. (15) ; break; end end if (vf (ti ) = null) then li G ← {τg | τg ∈ Vtype ∧ ( pmax · cτg ≤ Bu(ti ))};
23 24 25 26 27 28 29 30 31 32
τg
τb ← {τg | τg ∈ G ∧ (pmax ≥ pmax ∀τy ∈ G)}; τg τy
33
Consider an idle or new VM vk of type τb ; vf (ti ) ← vk ; SW ← SW ∪ ti , vf (ti ) ; Update SBW using Eq. (15);
34 35 36
end
37 38 39 40 41 42 43 44
end end a using Eq. (19); Calculate actual makespan of workflow Wm a ); //Algorithm 2 SW ← ERT(SW , Wm Calculate energy EW and cost CW using Eq. (8) and Eq. (10) respectively; Calculate Deadline violation DV using Eq. (20); return SW
Energy-Efficient Scheduling
73
to it. AST (ti , vk ) and AF T (ti , vk ) is defined in Eq. (17) and Eq. (18) respectively as follows: P ST (ti ), if Tki = φ AST (ti , vk ) = (17) max{P ST (ti ), AF T (tb , vk )}, if Tki = φ where Tki represents the set of tasks of the workflow scheduled on VM vk before task ti , and tb is the task scheduled on vk just before ti . AF T (ti , vk ) = AST (ti , vk ) + ET (ti , vk )
(18)
a and the percentage of deadThus the actual makespan of the workflow Wm line violation of the workflow DV can be calculated using Eq. (19) and Eq. (20) respectively as: a = max {AF T (ti , vf (ti ) )} (19) Wm ti ∈Texit
DV =
a (Wm −WD ) WD
0,
· 100%,
a if Wm > WD otherwise
(20)
Algorithm ESDWB: Algorithm 1 first tries to allocate a task ti to a VM executing one of its predecessor tasks (lines 8 to 18 of Algorithm 1) to avoid the data transmission time and thus reduce the actual finish time of ti . It would a and the percentage of help to reduce the actual makespan of the workflow Wm deadline violation of the workflow DV . If the task ti does not have a predecessor, or if it is not possible to allocate it to a VM executing one of its predecessor tasks due to deadline violation or budget constraint, then a VM of a suitable type is selected for it (lines 19 to 38 of Algorithm 1). For this, we compute the minimum processing speed ps(ti ) needed to complete ti before its deadline using the following equation: ps(ti ) =
li D(ti ) − P ST (ti )
(21)
where li is the length of ti in million instructions(MI). After calculating ps(ti ), the set Q is determined (line 21 of Algorithm 1) which contains the types of VM that can complete ti before its deadline. pmax τq represents the maximum processing speed of a VM of type τq . Algorithm 1 tries to allocate ti to a VM by choosing the VM types in Q in ascending order of their processing speeds. Lower processing speed corresponds to lower voltage and frequency and thus lower power consumption. If the VM types in Q do not permit the execution of the task within its budget, we need to pick a VM type that will violate the task’s deadline. But while doing so, we choose the fastest VM available within its budget (lines 31 to 37 of Algorithm 1) so that there is less chance of deadline violation of the workflow. In Algorithm 1, G represents the set of VM types that can schedule ti within its budget Bu(ti ), and τb is a member of G with the highest processing speed. In this way, all tasks are a is determined. scheduled and the actual makespan of the workflow Wm
74
A. Tarafdar et al.
Algorithm 2: Energy Reduction of Tasks (ERT)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
a Input: Schedule of the workflow SW , actual makespan of the workflow Wm Output: Schedule SW after updation foreach ti ∈ T do vk ← vf (ti ) ; // vf (ti ) indicates the VM on which task ti is assigned Calculate ExF T (ti ) using Eq. (22); if (ExF T (ti ) > AF T (ti , vk )) then Determine P ExF T (ti , vk ) using Eq. (23); Calculate pmin (ti , vk ) using Eq. (24); P S ← {plk | plk ∈ [pmin , pmax ] ∧ (plk ≥ pmin (ti , vk ))}; k k l l pk ← min(P S); fk ← frequency corresponding to plk if plk < pmax then k Update the frequency of VM vk from fkmax to fkl for task ti ; Update ET (ti , vk ) and AF T (ti , vk ) in the schedule SW ; end end end return SW
Algorithm 2 uses DVFS to reduce the energy consumption of the tasks by extending their finish times without affecting the actual makespan and budget of the workflow. Here ExF T (ti ) represents the extended finish time of a task ti in the workflow. It is defined below: ⎧ a ⎨Wm , if ti ∈ Texit (22) ExF T (ti ) = ⎩ min {AST (tj , vf (tj ) ) − T Tij }, otherwise tj ∈suc(ti )
The above equation implies that extension in finish time of a task ti must not affect the makespan of the workflow and the actual start time of the successor tasks. A task ti is said to have some slack time if ExF T (ti ) is greater than AF T (ti ). The finish time of such a task ti assigned to a VM vk can be extended to its possible extended finish time P ExF T (ti , vk ) formulated as: ⎧ li ⎨min{ExF T (ti ), AST (ta , vk ), {AST (ti , vk ) + max }}, case 1 pk P ExF T (ti , vk ) = l i ⎩min{ExF T (ti ), {AST (ti , vk ) + max }}, case 2 pk
(23) Two cases can occur while determining the possible extended finish time of task ti on VM vk . They are: – Case 1: This situation occurs if another task ta is scheduled for execution on VM vk after ti . It is possible to extend the finish time of ti till ExF T (ti ), or AST (ta , vk ), or till the end of the billing period, whichever is earliest. This is because extension of ti beyond AST (ta , vk ) would delay the start of task
Energy-Efficient Scheduling
75
li ta and extension of ti beyond {AST (ti , vk ) + pmax } would lead to increase k in cost of scheduling ti . – Case 2: This situation occurs if no other tasks are scheduled for execution on vk after ti . The minimum processing speed pmin (ti , vk ) required to complete task ti on VM vk within its possible extended finish time P ExF T (ti , vk ) is calculated as: pmin (ti , vk ) =
li P ExF T (ti , vk ) − AST (ti , vk )
(24)
where li is the length of ti in million instructions(MI). Algorithm 2 finds an allowable set of processing speeds P S of vk greater than or equal to pmin (ti , vk ). The processing speed plk = min(P S) will cause lowest energy consumption as explained below. The energy consumption E of VM vk due to execution of task ti is :
li l 2 l E = K · (Vk ) · fk · , (25) plk where Vkl , fkl , plk are the voltage, frequency and processing speed of VM vk , and li is length of task ti . As processing speed is proportional to the frequency, energy consumption E ∝ (Vkl )2 . From Table 1, it is observed that for a given VM type, lower voltage corresponds to lower frequency and thus lower processing speed. Hence we determine the minimum processing speed in the set P S and accordingly update the frequency of the VM (lines 8 to 12 of Algorithm 2). This helps to reduce the energy consumption.
5
Performance Evaluation
To evaluate the performance of our proposed approach ESDWB, we compare it with two existing workflow scheduling algorithms- FBCWS [11] and DEWTS [13]. FBCWS tries to minimize the makespan of the workflow while satisfying the user-defined budget. On the other hand, DEWTS tries to reduce the energy consumption of the workflow while keeping its makespan within the deadline. 5.1
Performance Metrics
We evaluate and compare our proposed algorithm with respect to the following performance metrics(i) Normalized Energy Consumption: The Normalized Energy Consumption (N E) of scheduling a workflow W by an algorithm is defined as follows: NE =
EW , min EW
(26)
76
A. Tarafdar et al.
where EW is the energy consumption of the workflow W obtained using Eq. (8), min is the minimum energy consumption value obtained among all the and EW algorithms under comparison. (ii) Normalized Makespan: The Normalized Makespan (N M ) of a workflow W is defined as follows: Wa (27) NM = m , WD a where Wm is the actual makespan of the workflow W obtained using Eq. (19), and WD is the deadline of the workflow. If the value of N M becomes greater than 1, then it indicates that the deadline has been violated. (iii) Normalized Cost: The Normalized Cost (N C) of scheduling workflow W is defined as follows: CW , (28) NC = BuW where CW is the monetary cost of scheduling the workflow W calculated using Eq. (10), and BuW is the budget of the workflow. If the value of N C becomes greater than 1, then it indicates that the cost of scheduling has exceeded the budget. 5.2
Experimental Setup
We have used WorkflowSim [5], a well-accepted and widely used workflow simulator, to simulate a Cloud data center. The VMs in the simulated data center are of three types- Type 1, Type 2, and Type 3. The VMs of Type 1, 2, and 3 are modeled to run on AMD Turion MT-34 processor, AMD Opteron 2218 processor, and Intel Xeon E5450 processor respectively. The voltage-frequency pairs of these real-world processors are shown in Table 1 [12,13]. The price of each VM Table 1. Voltage-frequency pairs of different processors. Level
AMD Turion MT-34
AMD Opteron 2218
Intel Xeon E5450
Voltage (V)
Frequency (GHz)
Voltage (V)
Frequency (GHz)
Voltage (V)
Frequency (GHz)
0
1.20
1.80
1.30
2.60
1.35
3.00
1
1.15
1.60
1.25
2.40
1.17
2.67
2
1.10
1.40
1.20
2.20
1.00
2.33
3
1.05
1.20
1.15
2.00
0.85
2.00
4
1.00
1.00
1.10
1.80
5
0.90
0.80
1.05
1.00
of Type 1, Type 2, and Type 3 are respectively $0.0058 per hour, $0.0116 per hour, and $0.023 per hour. These values correspond to the price of Amazon EC2 on-demand VM instances having one vCPU and Linux OS. Similar to CSPs like Google and Amazon, we consider a per-second billing model with a minimum billing period of 1 min. That is, if a VM runs for 75.2 s, the charge will be for 76 s, but if a VM runs for 35 s, the same will be for 1 min. We assume that the average bandwidth between the VMs is 1 Gbps.
Energy-Efficient Scheduling
77
We have experimented with four real scientific workflows- CyberShake, Epigenomics, SIPHT, and Montage. The structure and characteristics of these workflows appear in detail in [3]. We have conducted experiments for different combinations of α and β. It is noteworthy that the deadline factor α, and budget factor β significantly affects the deadline satisfaction of a workflow. For a particular value of β, a high α increases the chance of deadline satisfaction. Again, for a given α, a higher β leads to a greater chance of deadline satisfaction. Thus by varying the α and β parameters, the users can suitably adjust the deadline and budget depending upon the relative importance of the makespan and monetary cost of the workflow. After trying with different values of α and β, we have finally settled on α = 1.3, β = 0.6, and plotted the normalized energy, makespan, and cost. 5.3
Experimental Results and Analysis
Normalized Energy Consumption
Figures. 1, 2 and 3 show the simulation results. From the figures, it is clear that FBCWS keeps the scheduling cost within the budget and tries to reduce the makespan. However, it leads to high energy consumption. DEWTS reduces energy consumption and satisfies the deadline of the workflow, but significantly increases the cost of scheduling which exceeds the budget. Our proposed approach ESDWB effectively schedules the workflow within its budget and tries to lessen deadline violation and energy consumption at the same time. The makespan for our approach mostly lies within the deadline, and when it does not, it exceeds the deadline by a small amount keeping the cost within budget. In workflows like CyberShake and Epigenomics, where there is a significant amount of data transfer between tasks of the workflow, the makespan generated by our approach is lesser than that of FBCWS and DEWTS. Our approach
1.4 1.3 1.2 1.1 1.0 0.9
b Cy
h erS
e ak Ep
i
o gen
mi
cs
ESDWB
SIP
HT
FBCWS
e
Mo
g nta
DEWTS
Fig. 1. Energy consumption of scheduling different workflows.
A. Tarafdar et al.
Normalized Makespan
78
1.1 1.0 0.9 0.8 0.7
b Cy
h erS
e ak Ep
i
o gen
mi
cs
ESDWB
SIP
HT
FBCWS
g nta Mo
e
DEWTS
Normalized Cost
Fig. 2. Makespan of different workflows.
1.4 1.3 1.2 1.1 1.0 0.9
b Cy
h erS
e ak Ep
i
o gen
mi
cs
ESDWB
SIP
HT
FBCWS
e
g nta Mo
DEWTS
Fig. 3. Monetary cost of workflow scheduling.
gives a smaller makespan because it tries to allocate the predecessor and successor tasks of the workflow that have high data transfer between them, in the same VM, and thus reduces the data transmission time. Our algorithm ESDWB achieves significant energy savings by adjusting the voltage and frequency of the VMs using the DVFS technique. Although the energy consumption value generated by ESDWB is sometimes more than that of DEWTS, it is much lesser than that of FBCWS. Moreover, unlike DEWTS, the cost of scheduling in our approach ESDWB never exceeds the budget. Our proposed method outperforms the other algorithms under comparison, by considering all three scheduling parameters- energy, makespan, and cost.
Energy-Efficient Scheduling
6
79
Conclusion and Future Work
In this paper, we have proposed an energy-efficient scheduling algorithm for deadline-sensitive and budget-constrained workflows in the Cloud environment. Our approach ensures that the schedule is within the user-defined budget. It also reduces energy consumption by assigning tasks to VMs having comparatively lower processing speeds, without violating the deadline of the tasks. It further promotes energy efficiency by reducing the frequency of the VMs using the DVFS technique. Our proposed approach tries to schedule the workflow such that deadline violation is prevented as far as possible. In case the deadline violation cannot be avoided due to insufficient budget, the algorithm assigns tasks to VMs of the fastest type available within the budget to minimize the makespan. Experimental results validate the efficacy of our approach in comparison to other state-of-art workflow scheduling techniques. As future work, we would like to explore the workflow scheduling problem by considering the data transfer costs between the tasks of the workflow. We also intend to investigate workflow scheduling in the multi-cloud environment. Acknowledgment. We acknowledge the contribution of UGC-NET Junior Research Fellowship (UGC-Ref. No.: 3610/(NET-NOV 2017)) provided by the University Grants Commission, Government of India to the first author for research work. We would also like to thank the Visvesvaraya PhD Scheme of Ministry of Electronics & Information Technology, Government of India (Ref. No. MLA/MUM/GA/10(37)C) for their support.
References 1. How to stop data centres from gobbling up the world’s electricity (2018). https:// www.nature.com/articles/d41586-018-06610-y. Accessed 6 Jul 2020 2. Arabnejad, H., Barbosa, J.G.: A budget constrained scheduling algorithm for workflow applications. J. Grid Comput. 12(4), 665–679 (2014) 3. Bharathi, S., Chervenak, A., Deelman, E., Mehta, G., Su, M.H., Vahi, K.: Characterization of scientific workflows. In: 2008 3rd Workshop on Workflows in Support of Large-Scale Science, pp. 1–10. IEEE (2008) 4. Chen, H., Zhu, X., Qiu, D., Guo, H., Yang, L.T., Lu, P.: EONS: minimizing energy consumption for executing real-time workflows in virtualized cloud data centers. In: 2016 45th International Conference on Parallel Processing Workshops (ICPPW), pp. 385–392. IEEE (2016) 5. Chen, W., Deelman, E.: WorkflowSim: a toolkit for simulating scientific workflows in distributed environments. In: 2012 IEEE 8th International Conference on Escience, pp. 1–8. IEEE (2012) 6. Karmakar, K., Das, R.K., Khatua, S.: Resource scheduling of workflow tasks in cloud environment. In: 2019 IEEE International Conference on Advanced Networks and Telecommunications Systems (ANTS), pp. 1–6. IEEE (2019) 7. Karmakar, K., Das, R.K., Khatua, S.: Resource scheduling for tasks of a workflow in cloud environment. In: Hung, D.V., D’Souza, M. (eds.) ICDCIT 2020. LNCS, vol. 11969, pp. 214–226. Springer, Cham (2020). https://doi.org/10.1007/978-3030-36987-3 13
80
A. Tarafdar et al.
8. Li, Z., Ge, J., Hu, H., Song, W., Hu, H., Luo, B.: Cost and energy aware scheduling algorithm for scientific workflows with deadline constraint in clouds. IEEE Trans. Serv. Comput. 11(4), 713–726 (2015) 9. Mathew, T., Sekaran, K.C., Jose, J.: Study and analysis of various task scheduling algorithms in the cloud computing environment. In: 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 658– 664. IEEE (2014) 10. Qin, Y., Wang, H., Yi, S., Li, X., Zhai, L.: An energy-aware scheduling algorithm for budget-constrained scientific workflows based on multi-objective reinforcement learning. J. Supercomput. 76(1), 455–480 (2019). https://doi.org/10.1007/s11227019-03033-y 11. Rizvi, N., Ramesh, D.: Fair budget constrained workflow scheduling approach for heterogeneous clouds. Clust. Comput. 23(4), 3185–3201 (2020). https://doi.org/ 10.1007/s10586-020-03079-1 12. Stavrinides, G.L., Karatza, H.D.: An energy-efficient, QoS-aware and cost-effective scheduling approach for real-time workflow applications in cloud computing systems utilizing DVFs and approximate computations. Fut. Gener. Comput. Syst. 96, 216–226 (2019) 13. Tang, Z., Qi, L., Cheng, Z., Li, K., Khan, S.U., Li, K.: An energy-efficient task scheduling algorithm in DVFs-enabled cloud environment. J. Grid Comput. 14(1), 55–74 (2016) 14. Wu, C.Q., Lin, X., Yu, D., Xu, W., Li, L.: End-to-end delay minimization for scientific workflows in clouds under budget constraint. IEEE Trans. Cloud Comput. 3(2), 169–181 (2014) 15. Zhu, Z., Zhang, G., Li, M., Liu, X.: Evolutionary multi-objective workflow scheduling in cloud. IEEE Trans. Parallel Distrib. Syst. 27(5), 1344–1357 (2015)
An Efficient Renewable Energy-Based Scheduling Algorithm for Cloud Computing Sanjib Kumar Nayak1 , Sanjaya Kumar Panda2(B) , Satyabrata Das1 , and Sohan Kumar Pande1 1 Veer Surendra Sai University of Technology, Burla 768018, Odisha, India [email protected], [email protected], [email protected] 2 National Institute of Technology, Warangal 506004, Telangana, India [email protected]
Abstract. The global growth of cloud computing services is witnessing a continuous surge, starting from storing data to sharing information with others. It makes cloud service providers (CSPs) efficiently utilize the existing resources of datacenters to increase adaptability and minimize the unexpected expansion of datacenters. These datacenters consume enormous amounts of energy generated using fossil fuels (i.e., non-renewable energy (NRE) sources), and emit a substantial amount of carbon footprint and heat. It drastically impacts the environment. As a result, CSPs are pledged to decarbonize the datacenters by adopting renewable energy (RE) sources, such as solar, wind, hydro and biomass. However, these CSPs have not completely ditched fossil fuels as RE sources are subjected to inconsistent atmospheric conditions. Recent studies have suggested using both NRE and RE sources by the CSPs to meet user requirements. However, these studies have not considered flexible duration, nodes and utilization of the user requests (URs) with respect to datacenters. Therefore, we consider these URs’ properties and propose a RE-based scheduling algorithm (RESA) to efficiently assign the URs to the datacenters. The proposed algorithm determines both the earliest completion time and energy cost, and takes their linear combination to decide a suitable datacenter for the URs. We conduct extensive simulations by taking 1000 to 16000 URs and 20 to 60 datacenters. Our simulation results are compared with other algorithms, namely roundrobin (RR) and random, which show that RESA is able to reduce the overall completion time (i.e., makespan (M )), energy consumption (EC), overall cost (OC) and the number of used RE (|U RE|) resources. Keywords: Cloud computing · Renewable energy · Non-renewable energy · Scheduling algorithm · Datacenters · Coronavirus disease
1
Introduction
Over the last few years, many companies, such as Netflix, Instagram, Apple and many more, have moved to the cloud [1]. These companies are looking for part or all of their information technology (IT) solutions, starting from storing c Springer Nature Switzerland AG 2021 D. Goswami and T. A. Hoang (Eds.): ICDCIT 2021, LNCS 12582, pp. 81–97, 2021. https://doi.org/10.1007/978-3-030-65621-8_5
82
S. K. Nayak et al.
data to sharing information with others [2–5]. The cloud-based services enable them to meet the spike in demand and depletion in activity without owning any infrastructure and upfront commitment. According to the MarketsandMarkets report [6], the market of global cloud computing is expected to increase at a compound annual growth rate of 17.5% (i.e., $832.1 billion by 2025 from $371.4 billion in 2020) and its adoption is drastically increasing due to Coronavirus Disease (COVID)-19 pandemic. The continuous surge of cloud computing creates enormous challenges for the CSPs. One such challenge is to efficiently utilize the existing resources of datacenters, so that new users’ applications can be deployed without increasing the physical infrastructure [7,8]. Moreover, efficient resource management and monitoring minimize the unexpected expansion of datacenters. The electricity use of global datacenters is rapidly increasing day by day. It is generated using fossil fuels (NRE resources), such as coal, petroleum, gas and oils. Due to the high demand for electricity, the cost of fossil fuels is also increasing. These fuels adversely affect the environment by emitting a huge amount of carbon dioxide (i.e., CO2 ) and heat. For instance, streaming a 30min Netflix video generates 0.028 to 0.057 kg CO2 , which is the same as 200 m driving as reported in International Energy Agency [9]. As a result, CSPs are pledged to decarbonize the datacenters by adopting RE sources, such as solar, wind, hydro, tidal, geothermal and biomass [10,11]. Many CSPs like Amazon, Google and Microsoft claim that datacenters’ resources are 100% powered by RE resources, as reported in an American magazine [12]. However, these CSPs have not completely ditched NRE sources as RE sources are reliant on the atmospheric conditions. Recent studies have suggested that the CSPs use both NRE and RE sources for addressing the user requirements [10,11,13–15]. These studies present the user requirements, such as start time, nodes and duration in the form of URs, and they are assigned to the datacenters based on the availability of RE resources, energy cost, specific ordering and randomly. However, these studies have not considered flexible duration, nodes and utilization of the URs on the datacenters. This phenomenon inspired us to think of the URs’ properties and introduce a new scheduling problem in the RE-based cloud computing environment. In this paper, we address the following scheduling problem. Given a set of n URs and a set of m datacenters with their resources, the problem is to assign the URs to the datacenters, so that M , EC and OC are minimized, and |U RE| resources is maximized. We propose a three-phase algorithm, called RESA to efficiently assign the URs to the datacenters. The proposed algorithm estimates the earliest completion time (CT ) and energy cost (CO) of the URs in all the datacenters, and takes their linear combination to select a suitable datacenter for the URs. The performance of the RESA is carried out through simulation runs by taking 1000 to 16000 URs and 20 to 60 datacenters. The results of simulation runs are compared with two algorithms, namely RR [10] and random [16] in terms of four performance metrics, namely M , EC, OC and |U RE| resources to show the effectiveness of the RESA. Note that existing scheduling algorithms do not model the URs in flexible duration, nodes and utilization; hence, those
An Efficient Renewable Energy-Based Scheduling Algorithm
83
algorithms are not directly comparable to the proposed algorithm. Therefore, we compare with RR and random as compared in [10,15–17]. The rest of this paper is organized as follows. In Sect. 2, we survey the task consolidation and scheduling algorithms with their pros and cons. In Sect. 3, we discuss the cloud, cost and energy models, and formulate the scheduling problem. The proposed algorithm is presented in Sect. 4 with an illustration. Simulation results of the proposed and existing algorithms are given in Sect. 5. The paper concludes in Sect. 6 with some possible future works.
2
Related Work
RE-based scheduling algorithms are commonly focused on minimizing the OC [10,11,13,15] and maximizing the green power utilization [10,11,14,15,18]. These scheduling algorithms have modeled the URs in the start time, nodes and duration without considering utilization. Moreover, these algorithms have assigned the URs to the resources of datacenters without considering heterogeneity. Beloglazov et al. [19] have proposed an energy-aware resource allocation algorithm by considering CPU utilization. They have shown a significant reduction of energy in the cloud datacenters using experimentation. However, they have considered mapping virtual machines (VMs) to hosts without looking into the URs. Consequently, Esfandiarpoor et al. [20] have improved the algorithm proposed in [19] by taking the structural advantage of datacenters. Like [19], the mapping of URs to VMs is not focused on their algorithm. Lee et al. [16] have suggested considering resource utilization in scheduling algorithm to improve the energy efficiency and proposed two energy-conscious task consolidation algorithms. But, the relationship between resource utilization and energy consumption is considered as a linear one. Later, Hsu et al. [21] have stated that the relationship is not linear. As a result, they have presented an energy-aware task consolidation by limiting the CPU usage to a specified threshold (i.e., 70%). The rationality behind this threshold is that energy consumption is drastically increased beyond this threshold. Panda and Jana [17] have addressed the demerits associated with task consolidation and scheduling, and proposed an energy-efficient task scheduling without considering nodes. In the above works, the RE-based sources are not used to reduce energy consumption. Rajeev and Ashok [18] have presented a dynamic load shifting program for cloud computing. This program locally computes the generation and demand in each time frame and determine the possibility of RE sources for the customer. However, they have not considered inconsistent atmospheric conditions in their program. Grange et al. [13] have proposed an attractiveness-based blind scheduling algorithm for scheduling the jobs to the machines. However, the cost associated with powering machines is not well-studied in their algorithm. Xu et al. [14] have presented a workload shifting algorithm to reduce NRE resources and carbon dioxide emissions. They have performed workload sharing among the datacenters to properly utilizing their RE resources. They have suggested that
84
S. K. Nayak et al.
the workload execution can be delayed, if allowed, to maximize RE resources usage. The above works have not considered resource utilization, which significantly reduces energy consumption. Toosi and Buyya [10] have proposed an algorithm for geographical load balancing to minimize cost and maximize RE utilization. They have considered future-aware best fit (FABEF), RR and highest available renewable first (HAREF) benchmark algorithms in their study. FABEF performs better in overall cost, whereas HAREF performs better in green power utilization. However, they have assumed that URs can be assigned to any datacenter with identical time and nodes. Nayak et al. [15] have presented a RE-based task consolidation algorithm by considering the resource utilization. They have also assigned the URs by taking identical time, nodes and utilization on the datacenters. The proposed algorithm is different in various aspects in comparison with other algorithms. (1) Unlike [10,13,15–17,21], the RESA considers different flexible duration, nodes and utilization of URs on the datacenters to make it a realistic one. (2) The RESA takes a linear combination of the earliest CT and CO to select a suitable datacenter in contrast to OC or |U RE| resources as used in [10,15]. (3) We evaluate the RESA using M and EC in addition to OC and |U RE| resources as used in [10,15].
3 3.1
Models and Problem Statement Cloud System Model
Consider a cloud system that consists of a set of geographically distributed datacenters. Each datacenter houses a set of servers/resources and is connected to both NRE and RE sources. Here, we assume that the resources are powered with RE and NRE sources based on their availability. On the other hand, there is a CSP portal’s user interface to submit the URs and keep track of the URs and the resources. These URs are placed in a global queue and served in the order of their arrival without any interruption. It is noteworthy to mention that multiple URs can be assigned to a single resource without any interference, as stated in [16,17,21]. The data transfer time between the portal and datacenters’ resources, including resource mapping, is assumed to be negligible. 3.2
Cost and Energy Model
The cost and energy model is based on the utilization of resources and duration of UR as used in [16,17]. As a resource can be powered with NRE or RE source, the relationship between resource utilization and power consumption depends on the energy source type. If the energy source is NRE, then the relationship is considered as linear one up to a certain utilization (U T ) (say, τ %) and non-linear one beyond that utilization as considered in [21]. Here, we consider that the CO is associated with utilization and it is calculated for NRE resource as follows. COSTjkl × 2 if U T ≥ τ % (1) CO = Otherwise COSTjkl
An Efficient Renewable Energy-Based Scheduling Algorithm
85
Table 1. pmax and pmin values pmax pmin
Resource
Utilization
NRE
0 to < τ % ≥ τ % to 100%
25 30
20 20
RE
0 to 100%
25
20
where COSTjkl is the cost of k th resource of j th datacenter at time instance l. Similarly, we calculate the EC for NRE resource as follows. EC = (pmax − pmin ) × U T + pmin
(2)
where pmax (100% utilization) and pmin (1% utilization) values are set as per Table 1, where 20, 25 and 30 values represent 200 W, 250 W and 300 W, respectively as suggested in [16]. On the other hand, the relationship between resource utilization and power consumption is a linear one irrespective of utilization in RE source. Therefore, the CO is calculated as follows. CO = COSTjkl
(3)
The EC for RE resource is calculated using Eq. (2), where pmax and pmin values are set as mentioned in Table 1. 3.3
Problem Statement
Consider a global queue Q, which keeps a set of n URs, U = {U1 , U2 , U3 ,. . ., Un }, in which each UR Ui , 1 ≤ i ≤ n, is represented in the form of 3-tuple, i.e., , where D is the flexible duration, N is the number of nodes and U T is the utilization. Here, flexible duration means that there is no specific start and end time. On the other hand, consider a set of m datacenters, DC = {DC1 , DC2 , DC3 ,. . ., DCm }, in which each datacenter DCj , 1 ≤ j ≤ m, consists of a set of p resources/nodes, Rj = {Rj1 , Rj2 , Rj3 ,. . ., Rjp }. Each resource Rjk ∈ DCj , 1 ≤ j ≤ m, 1 ≤ k ≤ p, consists of a set of o resource slots/time window, Sjk = {Sjk1 , Sjk2 , Sjk3 ,. . ., Sjko }. Each resource slot Sjkl ∈ Rjk , 1 ≤ j ≤ m, 1 ≤ k ≤ p, 1 ≤ l ≤ o, can be run using NRE (brown) or RE (green) energy source with a given time l. If a resource slot Sjkl is running using NRE source, then the cost is variable, identifiable and pre-determined up to o resource slots. On the contrary, if a resource slot Sjkl is running using RE source, then the cost is fixed, identifiable and pre-determined up to o resource slots as stated in [10,11]. Given a D matrix, N matrix and U T matrix between the URs and the datacenters, the problem is to find an efficient mapping function f between U and DC (i.e., f : U → DC), such that M , EC and OC are minimized and |U RE| resources is maximized.
86
S. K. Nayak et al.
DC1 ⎧ U1⎪ D11 ⎪ U2⎨ D21 D= . .. .. ⎪ ⎪ ⎩ . Un Dn1
DC2 D12 D22 .. . Dn2
· · · DCm ⎫ · · · D1m ⎪ ⎪ · · · D2m ⎬ (4) .. .. . . ⎪ ⎪ ⎭ · · · Dnm
DC1 ⎧ U1⎪ U T11 ⎪ U2⎨ U T21 UT = . .. .. ⎪ ⎪ ⎩ . Un U Tn1
DC1 ⎧ U1⎪ N11 ⎪ U2⎨ N21 N= . .. .. ⎪ ⎪ ⎩ . Un Nn1 DC2 U T12 U T22 .. . U Tn2
· · · DCm ⎫ · · · U T1m ⎪ ⎪ · · · U T2m ⎬ .. .. . . ⎪ ⎪ ⎭ · · · U Tnm
DC2 N12 N22 .. . Nn2
· · · DCm ⎫ · · · N1m ⎪ ⎪ · · · N2m ⎬ (5) .. .. . . ⎪ ⎪ ⎭ · · · Nnm
(6)
where Dij , Nij and U Tij , 1 ≤ i ≤ n, 1 ≤ j ≤ m, represent the flexible duration, nodes and utilization of UR Ui on datacenter DCj , respectively. The problem is restricted to the following constraints. 1. U
Tm[Sjkl ] + U Tij ≤ 100%, 1 ≤ i ≤ n, 1 ≤ j ≤ m, 1 ≤ k ≤ p, 1 ≤ l ≤ o 2. j=1 Iij = 1, ∀Ui ∈ U 1 if UR Ui is assigned to datacenter DCj where Iij = (7) 0 Otherwise
n 3. i=1 Gij ≥ 1, ∀DCj ∈ DC 1 if datacenter DCj is executing UR Ui where Gij = (8) 0 Otherwise
4
Proposed Algorithm
The proposed algorithm RESA is an on-line algorithm for cloud computing and the pseudo code for RESA is shown in Algorithm 1. It aims to minimize the M , EC and OC, and maximize the |U RE| resources. The RESA consists of three phases, namely estimation, selection, and assignment and execution. In the first phase, RESA finds the earliest CT and CO of the URs on the datacenters (Line 3 to Line 43 of Algorithm 1). In the second phase, it selects a datacenter by finding the fitness value of the URs on the datacenters followed by finding the minimum fitness value (Line 44, and Line 1 to Line 5 of Procedure 1). Finally, it assigns the URs to their selected datacenter and executes in the resources of the datacenters (Line 45, and Line 1 to Line 28 of Procedure 2).
An Efficient Renewable Energy-Based Scheduling Algorithm
4.1
87
Phase 1: Estimation
In this phase, RESA finds the earliest CT and CO of a UR on the datacenters. For this, it finds the available resource slots in the datacenter by taking the flexible duration, nodes and utilization of that UR. If the sum of the utilization of the resource slot by early arrived URs and utilization of that UR does not exceed the maximum utilization of the resource slot, i.e., 100% (Line 9 of Algorithm 1), then the resource slot can be selected for that UR. However, this process continues until the nodes of that UR is completely satisfied for the given duration (9 to Line 15). If the number of nodes is not sufficient (Line 17), then it skips the resource for that particular time. Otherwise, it finds the cost for that UR by checking whether that resource slot is powered with NRE source or not (Line 19 and Line 20). If it is an NRE source, then it finds the utilization of that resource slot (Line 21). Note that the utilization is zero iff there are no early arrived URs. It then checks whether the UR’s utilization is greater than or equal to the threshold τ % or not (Line 22). If it is so, then the cost is doubled by following the proposed cost and energy model (Line 23). Otherwise, the normal cost is taken for using that resource slot (Line 24 and Line 25). On the other hand, if the early arrived URs occupy the resource slot, then the sum of the utilization of early arrived URs and the current UR is determined. If it is greater than or equal to the threshold, then the normal cost is taken for using that resource slot (Line 28 to Line 30). On the contrary, the normal cost is taken for the resource slot powered with RE source irrespective of the value of utilization (Line 33 to Line 35). This process continues until the duration of that UR is satisfied (Line 38 to Line 41). Note that the CT is the end time of the duration of the UR. 4.2
Phase 2: Selection
In this phase, RESA selects a datacenter by finding the UR’s fitness value on the datacenters. For this, it normalizes the CT and CO of that UR on the datacenters (Line 1 of Procedure 1) and takes their linear combination (Line 3). Finally, it finds the minimum fitness value and the corresponding datacenter (Line 4 and Line 5). The rationality behind this is that the CT and CO need to be minimized. Therefore, its linear combination also needs to be minimized. Note that EC and |U RE| resources are associated with the cost; hence, these are not explicitly shown here. 4.3
Phase 3: Assignment and Execution
In this phase, RESA assigns a UR to the selected datacenter of phase 2 and executes it in the resources of that datacenter. For this, it first finds the start time of the UR from the CT (Line 1 of Procedure 2). Then it assigns the resource slots between the start time and the CT as per the nodes of that UR. If the sum of the resource slot utilization and utilization of that UR does not exceed 100%, then that resource slot is assigned to the UR (Line 4). Like phase 1, this process continues until the nodes of that UR is completely satisfied for the given
88
S. K. Nayak et al.
Algorithm 1. Pseudo code for RESA Input: 1-D matrices: Q, n, m, p, o, τ and λ ∈ (0, 1) 2-D/3-D matrices: D, N , U T , S and COST Output: M , EC, OC and |U RE| resources 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41: 42: 43: 44: 45: 46: 47: 48:
Set OC = 0 while Q = NULL do for i = 1, 2, 3,. . ., n do for j = 1, 2, 3,. . ., m do Set COij = 0, tempd = 0 for l = 1, 2, 3,. . ., o do Set ncount = 0, tempn = Nij for k = 1, 2, 3,. . ., p do if (U T [Sjkl ] + U Tij ) ≤ 100% then Constraint 1 ncount += 1 if ncount == tempn then tempd += 1 Break end if end if end for if ncount < tempn then Set tempd = 0 else if Sjkl == 1 then NRE Resource if U T [Sjkl ] == 0% then if U Tij ≥ τ % then COij += COSTjkl × 2 else COij += COSTjkl end if else if (U T [Sjkl ] + U Tij ) ≥ τ % then COij += COSTjkl end if end if else RE Resource if U T [Sjkl ] == 0% then COij += COSTjkl end if end if end if if Dij == tempd then Set CTij = l Break end if end for end for Procedure 1 SELECT -DAT ACEN T ER-U R(i, m, λ, CT , CO) Constraint 2 Procedure 2 ASSIGN -U R-DAT ACEN T ER(i, j , p, τ , D, N , U T , S, COST , CT , OC) Constraint 3 end for end while Calculate the M , EC, OC and |U RE| resources
Procedure 1. SELECT -DAT ACEN T ER-U R(i, m, λ, CT , CO) 1: 2: 3: 4: 5:
Find norm(COij ) and norm(CTij ), 1 ≤ j ≤ m norm is a function to perform normalization of all the values. for j = 1, 2, 3,. . ., m do Fij = λ × norm(CTij ) + (1 − λ) × norm(COij ) end for Find min(Fij ) and determine the datacenter j that holds min(Fij ) min is a function to find the minimum of all the values.
An Efficient Renewable Energy-Based Scheduling Algorithm
89
duration. Next, it finds the OC for that UR by checking whether that resource slot is powered with NRE or RE source and the utilization, as discussed in phase 1 (Line 6 to Line 22). This process continues from the start time to the CT of that UR. Procedure 2. ASSIGN -U R-DAT ACEN T ER(i, j , p, τ , D, N , U T , S, COST , CT , OC) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28:
4.4
for l = (CTij - Dij + 1), (CTij - Dij + 2), (CTij - Dij + 3),. . ., CTij do Set ncount = 0 for k = 1, 2, 3,. . ., p do if (U T [Sj kl ] + U Tij ) ≤ 100% then ncount += 1 if Sj kl == 1 then NRE Resource if U T [Sj kl ] == 0% then if U Tij ≥ τ % then OCij += COSTj kl × 2 else OCij += COSTj kl end if else if (U T [Sj kl ] + U Tij ) ≥ τ % then OCij += COSTj kl end if end if else RE Resource if U T [Sj kl ] == 0% then OCij += COSTj kl end if end if end if if ncount == Nij then Break end if end for end for
An Illustration
Let us illustrate the proposed algorithm using ten URs, U1 to U10 and three datacenters, DC1 to DC3 . The properties of the URs, such as D, N and U T , are shown in Table 2. Each datacenter contains three resources/nodes in which NRE (RE) resource slots are represented in gray (white) color, as shown in Table 3. The COST of using the NRE resource slots is represented in the first row of the respective datacenters and the COST of using the RE resource slots is fixed as 1 irrespective of time l. The COST and resource slots are identifiable and pre-determined up to a time window of 9 and the threshold is fixed at 70%. Initially, the Q contains ten URs in the order of U1 to U10 and there is no URs assigned to the datacenters as shown in Table 3. In the first phase, UR U1 is selected from Q, and its D, N and U T on three datacenters are 4, 5, 2; 3, 2, 1 and 59, 19, 50, respectively. If UR U1 is assigned to datacenter DC1 , then
S. K. Nayak et al. Table 2. A set of ten URs with their properties
U1 U2 U3 U4 U5 U6 U7 U8 U9 U10
D1 4 2 3 4 5 5 3 1 1 2
D2 5 2 5 2 5 2 1 2 4 3
D3 2 5 3 3 5 2 4 4 2 3
Nodes Datacenter D1 D2 D3 3 2 1 1 2 3 1 3 2 1 3 1 1 1 1 3 2 2 3 2 3 1 2 2 3 3 2 1 3 1
2
Utilization D1 59 65 17 65 48 15 26 43 68 68
D2 19 69 68 39 58 18 35 65 58 68
D3 50 12 61 66 51 56 55 33 49 20
2
3
2
4
3
2
1
3
DC1
Duration
l=1l=2l=3l=4l=5l=6l=7l=8l=9 3
2
2
4
3
1
2
3
2
DC2
User Request
Table 3. A set of three datacenters with their NRE and RE resources
l=1l=2l=3l=4l=5l=6l=7l=8l=9 2
2
3
2
3
2
4
3
2
DC3
90
l=1l=2l=3l=4l=5l=6l=7l=8l=9
CT and CO is 4 and 18 (i.e., 1 × 8 + 2 + 2 + 3 + 3), respectively. Similarly, if it is assigned to datacenter DC2 and DC3 , then CT is 5 and 2, and CO is 14 (i.e., 1 × 7 + 3 + 2 + 2) and 2, respectively. In the second phase, the CT s are normalized as 0.80, 1.00 and 0.40, and the COs are normalized as 1.00, 0.78 and 0.11 on three datacenters, respectively. Then the fitness value (F ) is calculated as 0.90, 0.89 and 0.26 on the datacenters by taking the λ value as 0.50. As the minimum F is achieved on datacenter DC3 , UR U1 is assigned to datacenter DC3 in the third phase. The OC of datacenter DC3 is updated to 2 and the CT of UR U1 is set to 2. Next, UR U2 is selected from Q, and its D, N and U T on three datacenters are 2, 2, 5; 1, 2, 3 and 65, 69, 12, respectively. If UR U2 is assigned to datacenter DC1 , then CT and CO is 2 and 2, respectively. Similarly, if it is assigned to datacenter DC2 and DC3 , then CT is 2 and 5, and CO is 7 and 29, respectively. In the second phase, the CT s are normalized as 0.40, 0.40 and 1.00, and the COs are normalized as 0.07, 0.24 and 1.00 on three datacenters, respectively. Then the F is calculated as 0.23, 0.32 and 1.00 on the datacenters. As the minimum F is achieved on datacenter DC1 , UR U2 is assigned to datacenter DC1 in the third phase. The OC of datacenter DC1 is updated to 2 and the CT of UR U2 is set to 2. In the similar fashion, URs U3 to U10 are assigned to datacenters DC1 , DC3 , DC2 , DC2 , DC2 , DC1 , DC3 and DC1 , respectively, as shown in Table 4. The summary of step by step process for RESA is shown in Table 5. We also produce the Gantt chart for RR and random algorithms by following [10,16] in Table 6 and Table 7, respectively. The comparison of the proposed and existing algorithms in terms of four performance metrics is shown in Table 8. The comparison results show the superiority of the proposed algorithm over the existing algorithms. Theorem 1. The overall time complexity of RESA (Algorithm 1) is O(Knmpo). Proof. The phase 1 of Algorithm 1 finds the CT and CO, which takes O(nmpo) time. Then it calls the Procedure 1 to select a suitable datacenter, which takes
An Efficient Renewable Energy-Based Scheduling Algorithm
2 2 3 2 68% (10) 43% (8) 68% (10) 65% (2)+ 65% (2)+ 17% (3) 17% (3) 17% (3) l=1 l=2 l=3 l=4
4
3
2
1
Table 5. Summary of step by step process for RESA algorithm
3
DC1 CT DC2 DC3 DC1 CO DC2 DC3 DC1 F DC2 DC3 min(F ) Datacenter
l=5 l=6l=7l=8l=9
3 2 2 4 3 1 2 3 2 35% (7) 18% (6)+ 18% (6) 35% (7) 58% (5)+ 58% (5)+ 58% (5) 58% (5) 58% (5) 18% (6) 18% (6) l=1 l=2 l=3 l=4 l=5 l=6l=7l=8l=9 2 3 2 49% (9) 66% (4) 50% (1)+ 66% (4) 49% (9) l=2 l=3 l=4
3
2
4
3
2
3 58% (5) 69% (2) 69% (2) l=1
2 58% (5) 69% (2) 69% (2) l=2
2
2
3 59% (1)+ 26% (7) 59% (1)+ 26% (7) 59% (1)+ 26% (7) l=3
2
4
3
1
59% (1) 68% (10) 68% (10)
l=4
2
61% (3) 61% (3) l=2
U4 4 2 3 3 12 5 0.63 0.75 0.58 0.58 DC3
U5 5 5 5 7 5 11 0.82 0.73 1.00 0.73 DC2
l=5
3
l=6
l=7
1
2
l=6
l=7
2
4
l=8 l=9 3
2
l=8 l=9 3
2 59% (1)
2 59% (1)
3 2 59% (1) 59% (1)
59% (1)
59% (1)
59% (1) 59% (1)
U6 5 2 4 27 5 9 1.00 0.29 0.57 0.29 DC2
U7 5 1 7 21 3 30 0.71 0.12 1.00 0.12 DC2
U8 1 3 4 1 8 10 0.18 0.78 1.00 0.18 DC1
U9 3 9 2 6 19 4 0.32 1.00 0.22 0.22 DC3
U10 2 8 3 3 14 4 0.23 1.00 0.33 0.23 DC1
4
3
2
1
3
48% (5)+ 48% (5) 43% (8)
59% (1)+ 59% (1)+ 59% (1)+ 59% (1) 65% (2) 65% (2) 48% (5) 48% (5) 48% (5) 17% (3) 17% (3) 17% (3) l=1 l=2 l=3 l=4 l=5 l=6 l=7 l=8 l=9 3 39% (4)+ 58% (9) 39% (4)+ 58% (9) 39% (4)+ 58% (9) l=1
2 2 4 39% (4)+ 58% (9) 58% (9) 58% (9) 39% (4)+ 58% (9) 58% (9) 58% (9) 39% (4)+ 58% (9) 58% (9) 58% (9) l=2 l=3 l=4
3
1
2
3
2
l=5
l=6
l=7
l=8
l=9
4
3
2
l=7
l=8
l=9
2 2
61% (3) 61% (3) l=1
U3 3 5 3 1 23 12 0.32 1.00 0.56 0.32 DC1
3
59% (1) 65% (4) 65% (4) 65% (4) 65% (4)
2 4 3 65% (8) 65% (8) 65% (8) 65% (8) 58% (5) 58% (5) 58% (5) l=3 l=4 l=5 3
2
59% (1) DC1
DC1 DC2 DC3
2 59% (1)+ 26% (7) 59% (1)+ 26% (7) 59% (1)+ 26% (7) l=2
U2 2 2 5 2 7 29 0.23 0.32 1.00 0.23 DC1
Table 7. Gantt chart for random algorithm
Table 6. Gantt chart for RR algorithm 2 59% (1)+ 26% (7) 59% (1)+ 26% (7) 59% (1)+ 26% (7) l=1
U1 4 5 2 18 14 2 0.90 0.89 0.26 0.26 DC3
l=5 l=6l=7l=8l=9
DC2
2 49% (9) 66% (4) 50% (1)+ 49% (9) l=1
61% (3) 56% (6) 56% (6) 49% (9) 49% (9) 61% (3) 56% (6) 56% (6) 49% (9) 49% (9) l=3 l=4 l=5 l=6 l=7 l=8 l=9
DC3
DC3
DC2
DC1
Table 4. Gantt chart for RESA algorithm
91
2
3 55% (7) 56% (6) 56% (6) 55% (7) 56% (6)+ 56% (6)+ 55% (7)+ 20% (10) 20% (10) 20% (10) l=1 l=2 l=3
2 3 2 55% (7) 55% (7) 55% (7) 55% (7) 55% (7) 55% (7) 55% (7) 55% (7) 55% (7) l=4
l=5
l=6
O(m) time. Next, it calls the Procedure 2 to assign and execute the UR on the selected datacenter, which takes O(po) time in the worst case. However, Algorithm 1 is invoked K times (say). Therefore, the overall time complexity is [K × O(nmpo)] + [K × n × O(m)] + [K × O(po)] = O(Knmpo). Phase 1
Phase 2
Phase 3
Lemma 1. The mapping function f : U → DC is surjective. Proof. RESA assigns a UR to a datacenter by checking that the sum of the resource slot’s utilization and the utilization of the UR does not exceed the maximum utilization. Here, utilization of the UR is between 1% to 100% and it varies with respect to the datacenters. As |U | >> |DC| or n >> m, every datacenter is assigned with some URs and these URs may share the same resource slot of the datacenter subjected to maximum utilization. As a result, there is no datacenter left out without URs. Therefore, the mapping function f : U → DC is surjective.
92
S. K. Nayak et al.
Table 8. Comparison of four performance metrics for RESA, RR and random algorithms Algorithm
M (in time EC (in energy OC (in cost |U RE| units) units) units)
RESA
5
6850
31
13
RR
8
17350
92
24
Random
9
18565
98
24
Lemma 2. The mapping between the URs to the resource slots of the datacenter is injective iff the utilization of the URs in any datacenter exceeds 50%. Proof. RESA can assign two URs, Ui and Ui , 1 ≤ i , i ≤ n, i = i , to the same resource slot of the datacenter DCj , 1 ≤ j ≤ m iff U Ti j + U Ti j ≤ 100%. It is given that U Ti j > 50% or U Ti j > 50%. Therefore, these URs cannot be assigned to the same resource slot of the datacenter. Alternatively, each resource slot cannot be accommodated with more than one URs. Therefore, the mapping between the URs to the resource slots of the datacenter is injective. Lemma 3. If a UR is estimated to complete in two different datacenters simultaneously, then the datacenter with the lowest cost is selected for that UR. Proof. Let us assume that UR Ui is completed in datacenter DCj at CTij and datacenter DCj at CTij . It is given that CTij = CTij . The fitness value of UR Ui on datacenter DCj is calculated as follows. Fij = λ × norm(CTij ) + (1 − λ) × norm(COij ) Similarly, the fitness value of UR Ui on datacenter DCj is calculated as follows. Fij = λ × norm(CTij ) + (1 − λ) × norm(COij ) If Fij > Fij , then UR Ui is assigned to datacenter DCj . Otherwise, it is assigned to datacenter DCj . It can be simplified as follows. Fij > Fij ⇒ λ × norm(CTij ) + (1 − λ) × norm(COij ) > λ × norm(CTij ) + (1 − λ) × norm(COij ) ∵ CTij = CTij ⇒ λ × norm(COij ) > λ × norm(COij ) ⇒ norm(COij ) > norm(COij ) ⇒ COij > COij Therefore, UR Ui is assigned to datacenter DCj , which results in the lowest cost. Lemma 4. If a UR is estimated to take same cost in two different datacenters, then the datacenter with the earliest CT is selected for that UR. Proof. The proof is the same as Lemma 3. Lemma 5. If a datacenter takes the earliest CT and CO for a UR, then that datacenter is selected irrespective of the value of λ.
An Efficient Renewable Energy-Based Scheduling Algorithm
93
Proof. Let datacenter DCj takes CTij and COij for UR Ui in comparison to another datacenter DCj , which takes CTij and COij . It is given that CTij < CTij and COij < COij ⇒ norm(CTij ) < norm(CTij ) and norm(COij ) < norm(COij ) ⇒ λ × norm(CTij ) < λ × norm(CTij ) and (1 − λ) × norm(COij ) < (1 − λ) × norm(COij ) ⇒ λ × norm(CTij ) + (1 − λ) × norm(COij ) < λ × norm(CTij ) + (1 − λ) × norm(COij ) ⇒ Fij < Fij Therefore, UR Ui is assigned to datacenter DCj .
5
Performance Metrics, Datasets and Simulation Results
This section presents four performance metrics, the generation of five different datasets and simulation results. 5.1
Performance Metrics
We use four performance metrics, namely M , EC, OC and |U RE| resources to compare the proposed and existing algorithms. These metrics are defined as follows. The M of a datacenter DCj (i.e., Mj ), 1 ≤ j ≤ m, is the maximum CT of the URs that are assigned to that datacenter. However, overall M is the maximum of all the CT s of the datacenters. Mathematically, M = max(Mj ), 1 ≤ j ≤ m. The EC is the total amount of energy required to execute all the URs in the datacenters’ resources and it is calculated as described in Sect. 3.2. The OC is the total cost gained by the CSPs for executing the URs and it is calculated as described in Sect. 3.2. Alternatively, it is the sum of the cost of using NRE and RE resource slots. The |U RE| resources is the total number of RE resource slots of the datacenters used to execute all the URs. 5.2
Datasets
We generated five datasets in which each dataset contains three different instances. These datasets are generated by the pre-defined function, randi, of MATLAB R2017a. This function returns a 2-D integer array using the discrete uniform distribution on the given interval. The rows and columns of the array represent URs and datacenters, respectively, and the array is represented by URs × datacenters. To select a wide variety of URs, datacenters and intervals, we follow the well-known Monte Carlo method [22] as follows.
94
S. K. Nayak et al.
Table 9. Comparison of M , EC, OC and |U RE| for RESA, RR and random algorithms M RESA RR Random
RESA
EC RR
Random
RESA
OC RR
Random
RESA
1533 1358 1499
74428905 73577120 74849105
444813335 443654830 446569175
439204225 453793255 460535005
773562 755997 754474
4228062 4168973 4236899
4158470 4281804 4360274
70241 70394 71167
1643 1608 1491
1674 1845 1612
126319995 891023145 126759150 870422385 124661385 882644790
216 216 211
2578 2295 2341
2317 2563 2416
224425155 1805839200 1781000940 2374636 17076916 16976451 197795 1847563 1826773 224685435 1760172230 1767916780 2368048 16775597 16775991 201938 1827420 1822893 220132700 1774122800 1758765700 2329875 16865780 16720189 197352 1841633 1808720
i1 i2 i3
299 294 292
3280 3479 3561
3274 3501 3538
408465360 3591663065 3593372040 4393962 34308194 34182230 355171 3534790 3517949 405928155 3588989840 3603123675 4356600 34231837 34198062 352187 3504015 3541726 405147270 3581061730 3632697640 4382475 34122022 34479219 351375 3526839 3600477
i1 i2 i3
450 446 450
5437 5513 5293
5445 5783 5293
767065705 7322619235 7370971110 8378616 70167828 70691777 648829 6902719 7007246 765533565 7386703120 7303789235 8284972 70633870 69974505 656224 6991771 6918273 767302105 7257734730 7410603900 8280241 69280800 70537691 653569 6838319 6995578
Dataset
Instance
1000 × 20
i1 i2 i3
181 162 178
1254 1217 1320
2000 × 30
i1 i2 i3
174 168 169
4000 × 40
i1 i2 i3
8000 × 50
16000 × 60
896693565 1333253 8548678 888788650 1327548 8302101 879682715 1305842 8379219
|U RE| RR Random 492015 489694 497864
480342 495545 500565
8539553 116501 941808 8386787 116358 935279 8395016 115401 927865
951647 962645 936648
Step 1: We choose the number of URs as 1000, 2000, 4000, 8000 and 16000, and the number of datacenters as 20, 30, 40, 50 and 60, respectively, for five different datasets. Similarly, we choose the interval of D, N and U T datasets as [10–100], [10–50] and [10–70], respectively. The maximum number of resources is set as 50 in which NRE and RE resources are generated randomly. The cost of using NRE and RE resource slots is set as [2–10] and 1, respectively. Step 2: We generate three instances of each dataset by taking the same intervals, as discussed in Step 1. Step 3: We apply the proposed algorithm on the generated instances of datasets. We consider τ = 70% for RESA, RR and random as adopted in [21]. Step 4: We find the results in terms of four performance metrics and average the results of three instances of the datasets. 5.3
Simulation Results
We apply the existing algorithms, RR and random, on the generated instances of datasets to obtain the results. Then we compare the results of the proposed and existing algorithms using four performance metrics, namely M , EC, OC and |U RE| resources, as shown in Table 9. The visual comparisons are also carried out separately in Fig. 1 to Fig. 4 for easy visualization of results. The results demonstrate that the proposed algorithm RESA outperforms the RR and random in terms of four performance metrics. The rationality behind this better performance is that the URs are assigned to datacenters’ resources by calculating the minimum F , which is the linear combination of CT and CO. As CO is associated with resource utilization and |U RE| resources, the RESA also takes care of these metrics while assigning the UR to the datacenters.
An Efficient Renewable Energy-Based Scheduling Algorithm 104
RR
Random
1010
RESA
RR
Random
EC
RESA
95
M
109
103
108
0 100
0 0 0 0 0 ×2 0×3 0×4 0×5 0×6 0 800 400 200 160 Datasets
60 30 50 20 40 0 × 000 × 000 × 000 × 000 × 2 8 4 100 16 Datasets
Fig. 1. Comparison of M . 108
RR
Random
OC
107
RESA
RR
Random
106 105
106
60 50 40 30 20 0 × 000 × 000 × 000 × 000 × 6 8 4 2 1 Datasets
100
Fig. 3. Comparison of OC.
6
107
|U RE|
RESA
Fig. 2. Comparison of EC.
60 50 40 30 20 0 × 000 × 000 × 000 × 000 × 6 8 4 2 1 Datasets
100
Fig. 4. Comparison of |U RE|.
Conclusion
In this paper, we have presented a scheduling algorithm, RESA, for cloud computing. This algorithm iterates through a three-phase process, and is shown to require O(Knmpo) time for n URs, m datacenters, p resources, o slots and K iterations. The simulation results of the proposed and existing algorithms have been carried out using five datasets and compared using four performance metrics, namely M , EC, OC and |U RE| resources. The comparison results demonstrate the efficacy of the proposed algorithm over the existing algorithms. However, we have considered an equal number of resources per datacenter in this scheduling problem. In our future work, we will extend this problem by taking a different number of resources, as seen in small, medium and large datacenters. Moreover, we have assumed that there are no outages during the scheduling process. In practice, the datacenter faces various outages, such as human error, cooling failure, hardware failure, software failure and many more. It can be addressed to make the proposed algorithm a more practical one.
References 1. Khayer, A., Talukder, M.S., Bao, Y., Hossain, M.N.: Cloud computing adoption and its impact on SMEs’ performance for cloud supported operations: a dual-stage analytical approach. Technol. Soc. 60, 101225 (2020) 2. Gill, S.S., et al.: Holistic resource management for sustainable and reliable cloud computing: an innovative solution to global challenge. J. Syst. Softw. 155, 104–129 (2019)
96
S. K. Nayak et al.
3. Gholipour, N., Arianyan, E., Buyya, R.: A novel energy-aware resource management technique using joint VM and container consolidation approach for green computing in cloud data centers. Simul. Model. Pract. Theor. 104, 102127 (2020) 4. Panda, S.K., Jana, P.K.: Normalization-based task scheduling algorithms for heterogeneous multi-cloud environment. Inf. Syst. Front. 20(2), 373–399 (2018) 5. Panda, S.K., Jana, P.K.: An efficient request-based virtual machine placement algorithm for cloud computing. In: Krishnan, P., Radha Krishna, P., Parida, L. (eds.) ICDCIT 2017. LNCS, vol. 10109, pp. 129–143. Springer, Cham (2017). https:// doi.org/10.1007/978-3-319-50472-8 11 6. MarketsandMarkets: Cloud computing market. https://www.marketsandmarkets. com/Market-Reports/cloud-computing-market-234.html. Accessed 6 Aug 2020 7. Qiu, C., Shen, H.: Dynamic demand prediction and allocation in cloud service brokerage. IEEE Trans. Cloud Comput. (2019) 8. Panda, S.K., Jana, P.K.: Efficient task scheduling algorithms for heterogeneous multi-cloud environment. J. Supercomput. 71(4), 1505–1533 (2015). https://doi. org/10.1007/s11227-014-1376-6 9. Kamiya, G.: The carbon footprint of streaming video. https://www.iea. org/commentaries/the-carbon-footprint-of-streaming-video-fact-checking-theheadlines. Accessed 6 Aug 2020 10. Toosi, A.N., Buyya, R.: A fuzzy logic-based controller for cost and energy efficient load balancing in geo-distributed data centers. In: Proceedings of the 8th International Conference on Utility and Cloud Computing, pp. 186–194. IEEE Press (2015) 11. Nayak, S.K., Panda, S.K., Das, S.: Renewable energy-based resource management in cloud computing: a review. In: Tripathy, A.K., Sarkar, M., Sahoo, J.P., Li, K.-C., Chinara, S. (eds.) Advances in Distributed Computing and Machine Learning. LNNS, vol. 127, pp. 45–56. Springer, Singapore (2021). https://doi.org/10. 1007/978-981-15-4218-3 5 12. Oberhaus, D.: Amazon, Google, Microsoft: Here’s who has the greenest cloud. https://www.wired.com/story/amazon-google-microsoft-green-clouds-andhyperscale-data-centers/. Accessed 11 Apr 2020 13. Grange, L., Da Costa, G., Stolf, P.: Green it scheduling for data center powered with renewable energy. Fut. Gener. Comput. Syst. 86, 99–120 (2018) 14. Minxian, X., Buyya, R.: Managing renewable energy and carbon footprint in multicloud computing environments. J. Parallel Distrib. Comput. 135, 191–202 (2020) 15. Nayak, S.K., Panda, S.K., Das, S, Pande, S.K.: A renewable energy-based task consolidation algorithm for cloud computing. In: Electric Power and Renewable Energy Conference, pp. 1–10. Springer (2020) 16. Lee, Y.C., Zomaya, A.Y.: Energy efficient utilization of resources in cloud computing systems. J. Supercomput. 60(2), 268–280 (2012) 17. Panda, S.K., Jana, P.K.: An energy-efficient task scheduling algorithm for heterogeneous cloud computing systems. Clust. Comput. 22(2), 509–527 (2018). https:// doi.org/10.1007/s10586-018-2858-8 18. Rajeev, T., Ashok, S.: Dynamic load-shifting program based on a cloud computing framework to support the integration of renewable energy sources. Appl. Energy 146, 141–149 (2015) 19. Beloglazov, A., Abawajy, J., Buyya, R.: Energy-aware resource allocation heuristics for efficient management of data centers for cloud computing. Future Gener. Comput. Syst. 28(5), 755–768 (2012)
An Efficient Renewable Energy-Based Scheduling Algorithm
97
20. Esfandiarpoor, S., Pahlavan, A., Goudarzi, M.: Structure-aware online virtual machine consolidation for datacenter energy improvement in cloud computing. Comput. Electr. Eng. 42, 74–89 (2015) 21. Hsu, C.-H., Slagter, K.D., Chen, S.-C., Chung, Y.-C.: Optimizing energy consumption with task consolidation in clouds. Inf. Sci. 258, 452–462 (2014) 22. Cunha, Jr., A., Nasser, R., Sampaio, R., Lopes, H., Breitman, K.: Uncertainty quantification through the Monte Carlo method in a cloud computing setting. Comput. Phys. Commun. 185(5), 1355–1363 (2014)
A Revenue-Based Service Management Algorithm for Vehicular Cloud Computing Sohan Kumar Pande1 , Sanjaya Kumar Panda2(B) , and Satyabrata Das1 1
Veer Surendra Sai University of Technology, Burla 768018, Odisha, India [email protected], [email protected] 2 National Institute of Technology, Warangal 506004, Telangana, India [email protected]
Abstract. Vehicular cloud computing (VCC) is an emerging research area among business and academic communities due to its dynamic computing capacity, on-road assistance, infotainment services, emergency and traffic services. It can mitigate the hindrance faced in the existing infrastructure, relying on the cellular networks, using roadside units (RSUs). Moreover, the existing cellular networks cannot provide better quality services due to the influx of vehicles. In the VCC, RSUs can prefetch the content and data required by the vehicles, and provide them in terms of services when vehicles are residing within the communication range of RSUs. However, RSUs suffer from their limited communication range and data rate. Therefore, it is quite challenging for the RSUs to select suitable vehicles to provide services, such that its revenue can be maximized. In this paper, we propose a revenue-based service management (RBSM) algorithm to tackle the above-discussed challenges. RBSM is a two-phase algorithm that finds data rate zones of the vehicles and selects a suitable vehicle at each time slot to maximize the total revenue of the RSUs, the total download by the vehicles and the total number of completed requests. We assess the performance of RBSM and compare it with an existing algorithm, namely RSU resource scheduling (RRS), by considering various traffic scenarios. The comparison results show that RBSM performs 87%, 90% and 170% better than RRS in terms of total revenue, download and number of completed requests. Keywords: Vehicular cloud computing · Revenue-based service management · Roadside unit · Resource scheduling · Cellular networks
1
Introduction
Innovation and research in technologies for smart cities urged the information technology (IT) industries, professionals and researchers to explore the features of VCC [1–5]. VCC is the integration of cloud computing, wireless networks, and smart vehicles to provide better on-demand solutions and expand the capabilities of existing infrastructure [6–8]. It offers numerous services in the form c Springer Nature Switzerland AG 2021 D. Goswami and T. A. Hoang (Eds.): ICDCIT 2021, LNCS 12582, pp. 98–113, 2021. https://doi.org/10.1007/978-3-030-65621-8_6
A Revenue-Based Service Management Algo. for Vehicular Cloud Computing
99
of Network as a Service (NaaS), STorage as a Service (STaaS), Computing as a Service (CaaS), COoperation as a Service (COaaS), INformation as a Service (INaaS), ENtertainment as a Service (ENaaS), Traffic Safety as a Service (TSaaS) and many more [3–5]. Nowadays, maximum smart vehicles are facilitated with the modern onboard unit (OBU), which comprises various resources, such as communication systems, storage devices, computing processors, cognitive radios, programmable sensors, global positioning systems and other advanced things [2,3,5]. These highly sophisticated resources remain idle for a considerable amount of time when the vehicles are at the parking lot, roadways or traffic. These underutilized resources can be used by the RSUs to deliver various services offered by the cloud service provider (CSP). It can be performed by creating an infrastructure in which services are provisioned by creating virtual machines (VMs) [4,9]. This infrastructure enables three types of communications, namely vehicle to vehicle (V2V), vehicle to infrastructure (V2I) and infrastructure to vehicle (I2V) [4,5]. Generally, vehicles on the move request various services, such as information, video, entertainment and many more, from the CSP through cellular networks [1,10]. Recent studies reveal that video content contributes to 70% of data traffic [11]. With the increasing demand for video content by smart vehicles, the existing cellular networks cannot provide better quality services [1,10,11]. Therefore, RSUs can augment the cellular networks by prefetching the content and data required by the vehicles [1] and deliver them when vehicles are residing within the communication range of RSUs. Here, RSUs can make a significant amount of revenue by providing better quality services [1,10]. However, the main challenging issues are the coverage and data rate of RSUs, which are maximum up 1 km and 28 MBPS, respectively [12]. On the other hand, RSUs cannot provide content and data to all the vehicles. Therefore, it is quite challenging for the RSUs to select suitable vehicles to provide content and data, such that its revenue can be maximized. To tackle the above-discussed challenges, we propose a novel algorithm, called RBSM, to maximize the total revenue of the RSUs. RBSM comprises of two phases. In the first phase, it finds the data rate zones of the vehicles. The second phase selects a suitable vehicle at each time slot to download the content from the RSU. For this, it considers the arrival time, velocity, revenue and requested content size of the vehicles. We compare RBSM with an existing algorithm, called RRS [1], using three performance metrics, namely total revenue (T R), total download (T D) by the vehicles and total number of completed requests (T CR). The comparison is performed by simulating both the algorithms in a virtual environment. The results show that RBSM outperforms RRS in all the performance metrics. The significant contributions of our work are as follows. – Development of a novel algorithm, RBSM, to maximize the revenue of the RSUs. – RBSM considers arrival time, velocity, revenue and requested content size of the vehicles for selecting a suitable vehicle, which results in the maximum revenue.
100
S. K. Pande et al.
– Extensive simulation is carried out to compare RBSM and RRS by considering various traffic scenarios with twenty-five instances of five datasets. The organization of this paper is as follows. Section 2 discusses scope of VCC and its related problems, followed by the VCC model and problem statement in Sect. 3. Section 4 presents the proposed algorithm and Sect. 5 discusses performance metrics used to evaluate the algorithms. Simulation results and concluding remarks are presented in Sect. 6 and Sect. 7, respectively.
2
Related Work
VCC has gained a remarkable attraction among IT industries, professionals, researchers in today’s technology-driven world due to its broad area of services [2,3,5]. It unfolds a set of problems and challenges along with the services [1,4]. In VCC, V2V, V2I and I2V, communications enable the smart vehicles to sublet their underutilized resources to the CSP through the RSUs [4]. RSUs make revenue by delivering various services, such as displaying advertisements, providing navigation facilities, producing video and game content, and many more [1,13–15]. Here, smart vehicles lease their resources through the RSUs for providing services to needy vehicles and gain enormous revenue. Therefore, many researchers have proposed different models and algorithms to maximize the total revenue by providing better quality services. Einziger et al. [13] have explored that RSUs can act as brokers to manage and send the advertisements to the vehicles intelligently. Moreover, RSUs can earn revenue from advertisers. Fux et al. [16] have suggested that the owners of the physical resources can sell their content to the vehicles. They have proposed a game model to improve the quality and stabilize the price war among the RSUs owners. In [14], the cooperation and competition between parked vehicles and RSUs are studied, and a game-theoretic approach is proposed to increase individual profit. Wang et al. [17] have proposed a distributed game theoretical framework to improve the performance and efficiency of the network infrastructure. Some researchers have focused on data and video delivery by the RSUs to earn incentives. Researchers in [1,11] have discussed that the percentage of video content requested by vehicles is approximately 70% and around 90% of the 5G data traffic by 2028. Han et al. [11] have discussed and identified the different types of obstacles and challenges present in the network that degrades the performance of content delivery. Some other models and strategies are proposed by Sun et al. [15] and Xu et al. [18] to improve the process of data delivery. Zhou et al. [19] have proposed an innovative data delivery scheme that detects optimal relay nodes to improve performance and security with the help of k-nearest neighbors-based machine learning system. Researchers have discussed that existing infrastructure cannot cope with the rapid demand for data and video services in real-time [1,10]. A minor delay of 2 s can cause high degradation in QoS and it can ultimately lead to customer dissatisfaction [11]. Zhao et al. [20] have discussed content prefetching to solve
A Revenue-Based Service Management Algo. for Vehicular Cloud Computing
101
the delay problems associated with the vehicular ad-hoc networks. They have proposed a module that predicts the vehicle mobility to select RSU and suitable time for prefetching the respective content. Yao et al. [21] have proposed a solution, which downloads popular content into vehicles that are going to visit different hot spot regions shortly. They have discussed that vehicles staying at the hot spot region can provide more services to others for a longer time. Al-Hilo et al. [1] have proposed a revenue-driven video delivery model to select a vehicle for delivering content, which gives the highest profit in each time slot. Most of the above-discussed works do not consider the vehicles’ arrival time and velocity to design an incentive model for the RSUs. Our proposed algorithm RBSM is somehow different from others, as it takes advantage of the heterogeneity of the content in terms of arrival time, velocity, revenue and requested content size of the vehicles. RBSM selects suitable vehicles for providing content, such that the revenue of RSUs is maximized.
3 3.1
Vehicular Cloud Model and Problem Statement Vehicular Cloud Model
We consider a heterogeneous vehicular cloud environment over a geographical area or grid. This area consists of cellular network, multi-way lanes, RSU, traffic junctions, parking lot and vehicles, as shown in Fig. 1. Vehicles spend a significant amount of time in the parking lot of airports, shopping malls and hospitals. These vehicles are equipped with different types of modern resources, which are generally underutilized and idle for a considerable amount of time. VMs can be formed using these underutilized resources of vehicles to expand the CSP capabilities and provide various services. Here, these vehicles are connected to the RSUs and the RSUs are connected to the CSP with high-speed Internet facilities. Besides the vehicles present at the grid, many vehicles also cross the grid at different times. Vehicles on the move or the grid request various services, such as traffic information, video content, live streaming, gaming services, infotainment services and many more. Previously, these services were delivered to the vehicles from the CSP using the cellular networks. With an increase in the number of vehicles and the size of requests, the cellular network cannot provide realtime services. Here, RSUs play a significant role by co-operating with cellular networks. Whenever a vehicle enters a grid, it broadcasts a beacon beam containing detailed information, such as arrival time, residing time, velocity, revenue and path, which are received and processed by the RSU. In this scenario, if the RSU gets the information from the CSP regarding the requested content by the respective vehicle, then it can download or pre-fetch the content before the arrival of the vehicle to the grid. After downloading the requested content, RSU may save it locally or by incorporating the vehicles present in the parking lot. Consequently, when the vehicle reaches the communication range of the RSU, it can download the content directly from the RSU. RSU can earn revenue from the
102
S. K. Pande et al.
Fig. 1. A vehicular cloud model.
vehicle user through the cellular network for providing services to that vehicle. It encourages the RSU owner for active involvement in delivering services. Consider another scenario in which some vehicles may get part of the content from the RSU and rest part from the cellular networks, as the RSU communication range is limited. Alternatively, the vehicle on the move can be out of the communication range of the RSU. The requested content of the vehicles is of different size and revenue, so it is essential for the RSU to select an appropriate vehicle to provide service in order to generate maximum revenue. Here, RSU uses information, such as arrival time, velocity and revenue of the vehicles to deliver the services. 3.2
Problem Statement
Consider a geographical area, which is covered with a set of r RSUs, RSU = {RSU1 , RSU2 , RSU3 ,. . ., RSUr }, and a set of multi-way lanes. The coverage of RSUo , 1 ≤ o ≤ r, is Xo . The downlink data rate from RSU to vehicle depends on the distance between them. Therefore, the coverage of each RSU is divided into a set of n data rate zones, R = {R0 , R1 , R2 , . . ., Rn }. Each data rate zone Rs , 1 ≤ s ≤ n, has a downlink data rate of ds , which is calculated as follows [22]. ds = β log2 (1 + SN Rs ), 1 ≤ s ≤ n
(1)
where β is the bandwidth of RSU and SN Rs is the signal to noise ratio in data rate zone Rs . Consider a set of m vehicles, V = {V1 , V2 , V3 ,. . ., Vm }, that crosses the RSUo , 1 ≤ o ≤ r. Each vehicle, Vk , 1 ≤ k ≤ m, is represented with
A Revenue-Based Service Management Algo. for Vehicular Cloud Computing
103
5-tuple, i.e., . Here, V ID, V AT and V S denote vehicle id, arrival time and velocity, respectively. The revenue per MB and requested content size are denoted by V RDP and V RDS. The problem is to select suitable vehicles for receiving various services, such that the RSUs can generate maximum revenue. This problem is divided into two sub-problems. The first sub-problem finds the position of the vehicles, Vk , 1 ≤ k ≤ m, in the data rate zones Rs , 1 ≤ s ≤ n. The second sub-problem selects suitable vehicles to get desired services from the RSUs. These sub-problems are subjected to the following constraints. 1. A vehicle Vk , 1 ≤ k ≤ m, can only download the content at time t from the data rate zone Rs , 1 ≤ s ≤ n, of RSUo , 1 ≤ o ≤ r, (i.e., F [k, t, s]), as follows. 1, if R[s, ST ART ] ≤ Xo × V tSk ≤ R[s, EN D] F [k, t, s] = (2) 0, otherwise where R[s, ST ART ] and R[s, EN D] define the boundary of Rs , 1 ≤ s ≤ n, Xo defines the coverage of RSUo , 1 ≤ o ≤ r, and Tk = VXSok . 2. The total download by the vehicle Vk , 1 ≤ k ≤ m, from RSUo is DD[k, o] = Tk t=1 F [k, t, s] × ds . Here, Tk is the time required by vehicle Vk to cover Xo . 3. The total download by the vehicle Vk , 1 ≤ k ≤ m, is always less than or equal to the requested content size. Mathematically, DD[k, o] ≤ V RDSk . 4. At time t, only one vehicle Vk , 1 ≤ k≤ m can download its content from the m RSUo , 1 ≤ o ≤ r. Mathematically, k=1 F [k, t, s] = 1, 1 ≤ t ≤ max(Tk ), 1 ≤ s ≤ n.
4
Proposed Algorithm
The proposed algorithm RBSM is an online algorithm in a vehicular cloud environment that selects different vehicles to receive content from the RSUs, such that the revenue is maximized. RBSM is divided into two phases, as follows. (1) Finding the data rate zone of a vehicle at a particular time (2) Selection of the vehicle to receive the content. The pseudocode for the proposed algorithm RBSM is shown in Algorithm 1, Procedures 1 and 2, respectively. Table 1 represents the mathematical notations and their definitions used in the pseudocode. Firstly, RBSM maintains all the upcoming vehicles in a global queue, Q, to provide content (Line 1, Algorithm 1). Then travel time and exit time are calculated for each vehicle (Lines 2–5). Let us explore the same with an example for clear understanding. We consider that three vehicles, V1 , V2 , V3 , have requested content to the RSU and their details are shown in Table 2. We also consider an RSU with seven data rate zones, namely R1 , R2 , R3 , R4 , R5 , R6 , R7 and their downlink data rate 2 MBPS, 3 MBPS, 4 MBPS, 5 MBPS, 4 MBPS, 3 MBPS, 2 MBPS, respectively. The covering range of the RSU is considered as 100 m.
104
S. K. Pande et al.
For vehicle V1 , travel time (T T IM E[1]) is calculated as 3 s (i.e., 100 33 = 3.03 ≈ 3) and exit time (V ET [1]) is calculated as 3 s (i.e., V AT [1] + T T IM E[1]−1 = 1+3−1 = 3). In the similar manner, T T IM E[2] and T T IM E[3] are calculated as 4 s and 7 s, respectively, and V ET [2] and V ET [3] are calculated as 5 s and 9 s, respectively.
Table 1. Mathematical notations and their definitions Notation
Definition
Q
Global queue
X
Covering range of RSU
m
Numbers of vehicles
n
Number of data rate zones
V
Set of vehicles
R[s]
Data rate zone s (in MBPS)
V ID[k]
ID of vehicle k
V AT [k]
Arrival time of vehicle k (in seconds)
V S[k]
Velocity/Speed of vehicle k (in meters/seconds)
V RDP [k]
Requested content revenue per MB of vehicle k
V RDS[k]
Requested content size of vehicle k
T T IM E[k] Travel time of vehicle k (in seconds) V ET [k]
Exit time of vehicle k (in seconds)
V P RZ[k, t] Data rate zone of vehicle k at time t ASSIGN [t] Selected vehicle to provide content at time t TR
Total revenue of the RSUs
TD
Total download by the vehicles
T CR
Total numbers of completed requests
Next, RBSM calls Procedure 1 to find the data rate zones of each vehicle (Line 6, Algorithm 1). Procedure 1 finds the duration of a vehicle in each data rate zone (Line 4, Procedure 1). Then the vehicles are assigned to the data rate zones according to time slots (Lines 5–7). This process is repeated for all the vehicles in Q (Lines 1–10). Let us explain this procedure with the earlier discussed example. For vehicle V1 , T T IM E[1] is 3 s. Therefore, temp2 is calculated as 0 (i.e., round( 37 × 1) = 0) for s = 1. As a result, there is no assignment in the data rate zone R[1]. For s = 2, temp2 is calculated as 1 (i.e., round( 37 × 2) = 1). Therefore, V P RZ[1, 1] is updated to R[2] = 3 MBPS at t = 1 s. In the similar manner, V P RZ[1, 2] and V P RZ[1, 3] are calculated as 5 MBPS and 3 MBPS, respectively. The same procedure is repeated for vehicles, V2 and V3 , and the assignment of data rate zones of three vehicles are shown in Table 3. Next, RBSM calls Procedure 2 to select suitable vehicles that can download their requested content (Line 7, Algorithm 1). In Procedure 2, all the vehicles, in
A Revenue-Based Service Management Algo. for Vehicular Cloud Computing
105
Algorithm 1. Pseudocode for RBSM Input: Q, X, R, n, m, V ID, V AT, V S, V RDP, V RDS Output: T R, T D, T CR 1: while Q = 0 do 2: for k = 1, 2, 3, . . . , m do X ) round is a function to return a rounded 3: T T IM E[k] = round( V S[k] number. 4: V ET [k] = V AT [k] + T T IM E[k] - 1 5: end for 6: Call F IN D-V EHICLE-DAT A RAT E ZON E(R, n, m, V AT, T T IM E, V ET ) 7: Call SELECT -V EHICLE(n, m, V AT, V RDP, V RDS, V ET, V P RZ) 8: end while
Procedure 1.
F IN D-V EHICLE-DAT A RAT E ZON E(R, n, m, V AT, T T IM E, V ET )
1: for k = 1, 2, 3, . . . , m do 2: temp1 = 1 3: for s = 1, 2, 3, . . . , n do E[k] × s) 4: temp2 = round( T T IM n 5: for t = temp1, temp1 + 1, temp1 + 2, . . . , temp2 do 6: V P RZ[k, V AT [k] + t − 1] = R[s] 7: end for 8: temp1 = temp2 + 1 9: end for 10: end for Table 2. Specification of three vehicles Vehicle AT
VS V RDP V RDS (in m/s) (in Rs.) (in MB)
V1
01
33
10
08
V2
02
25
30
10
V3
03
15
05
22
the Q, are sorted in the descending order of V RDP (Line 2 of Procedure 2). Then the vehicle with the highest value of V RDP is selected for further processing (Line 4). Now, the procedure assigns the time slots to the vehicle based on the exit time (Lines 5–25). However, before assigning the time slots, it checks whether the RSU is available to provide content or not (Line 7). This process continues until the vehicle downloads the entire content or there are no more available time slots (Lines 7–19). Finally, T R and T D are calculated (Lines 11–12 and Lines 14–16) followed by task completion status and T CR are updated (Line 21 and Line 22).
106
S. K. Pande et al.
Procedure 2.
SELECT -V EHICLE(n, m, V AT, V RDP, V RDS, V ET, V P RZ)
1: ASSIGN [] = 0, T ASK COM P [] = 0, T R = 0, T D = 0, T CR = 0 2: V EH SORT = sortdes(V RDP ) sortdes is a function to sort the elements in descending order. 3: for k = 1, 2, 3, . . . , m do 4: veh = V EH SORT [k] 5: rdw = V RDS[veh], temp3 = V ET [veh] 6: while rdw = 0 and temp3 ≥ V AT [veh] do 7: if ASSIGN [temp3] == 0 then 8: temp4 = V P RZ[veh, temp3] 9: if rdw < temp4 then 10: T R += (rdw × V RDP [veh]) 11: T D += rdw 12: rdw = 0 13: else 14: T R += (temp4 × V RDP [veh]) 15: T D += temp4 16: rdw −= temp4 17: end if 18: ASSIGN [temp4] = veh 19: end if if rdw == 0 then 20: 21: T ASK COM P [k] = 1 22: T CR += 1 23: end if 24: temp3 −= 1 25: end while 26: end for
Let us discuss Procedure 2 with the earlier discussed example. Here, vehicles, V1 , V2 and V3 , are sorted in the descending order of V RDP , which results vehicles, V2 , V1 , V3 , respectively. As vehicle V2 contains highest V RDP (i.e., V RDP [2] = 30), it is selected for further processing. The V ET of vehicle V2 is 5 s. As a result, the assignment of time slots starts from 5 s. At t = 5 s, the data rate zone of vehicle V2 is 2 MBPS. Therefore, vehicle V2 is able to download 2 MB of data from RSU. However, the required download (rdw) of vehicle V2 is 10 MB. As a result, rdw is updated to 8 MB. Then T R and T D are updated to Rs. 60 and 2 MB, respectively. At t = 4 s, the data rate zone of vehicle V2 is 4 MBPS. In this time slot, vehicle V2 is able to download 4 MB of data. Therefore, rdw, T R and T D are updated to 4 MB, Rs. 180 and 6 MB, respectively. Similarly, at t = 3 s, vehicle V2 is able to download 4 MB of data. Here, rdw of vehicle V2 is updated to 0. It indicates that vehicle V2 has downloaded its desired content from the RSU. In the similar fashion, other vehicles, V1 and V3 are processed. Table 4 shows the assignment of vehicles to RSU at different time slots. We compare the proposed algorithm, RBSM, with an existing algorithm, RRS [1]. RRS tries to maximize the revenue at each time slot. Therefore, it selects a vehicle from the group of vehicles present in the communication range
A Revenue-Based Service Management Algo. for Vehicular Cloud Computing
107
Table 3. Movement of vehicles in seven data rate zones Time 2 MBPS 3 MBPS 4 MBPS 5 MBPS 4 MBPS 3 MBPS 2 MBPS t=1
V1
t=2
V2
t=3
V3
V1 V2
t=4
V1
V3
V2
t=5
V3
V2
t=6
V3
t=7
V3
t=8
V3
t=9
V3
Table 4. Assignment of vehicles to RSU using RBSM Time Selected Vehicle
t = 1t = 2t = 3t = 4t = 5t = 6t = 7t = 8t = 9 V1
V1
V2
V2
V2
V3
V3
V3
Table 5. Assignment of vehicles to RSU using RRS Time t = 1 t = 2 t = 3 t = 4 t = 5 t = 6 t = 7 t = 8 t = 9 Selected V1 V2 V2 V2 V3 V3 V3 V3 V3 Vehicle
V3
of RSU and provides content that gives maximum revenue at each time slot. The assignment of vehicles to RSU by RRS is shown in Table 5. We compare RBSM and RRS in terms of three performance metrics, namely T R, T D and T CR, respectively, as shown in Table 6. The comparison results clearly show that RBSM performs better than RRS in terms of all the performance metrics. Table 6. Comparison of RBSM and RRS algorithms T R (in Rs.) T D (in MB) T CR (in Nos.) RBSM 450
32
2
RRS
31
1
420
Lemma 1. If two vehicles are entering a grid at different times with the same speed and their travel times in the grid overlapping, then each vehicle can get a chance to download its content irrespective of V RDP and V RDS. Proof. Let us assume that two vehicles, Vk and Vk , are moving at the same speed and their arrival times, V ATk and V ATk , respectively, in the grid. We know that V ATk = V ATk . There are two cases. Case 1: If V ATk > V ATk , then V RDP [k ] is compared with V RDP [k ]. If V RDP [k ] > V RDP [k ] (or V RDP [k ] < V RDP [k ]), then vehicle Vk (or Vk ) is selected first and the assignment of time slots start from V ET [k ] (or V ET [k ]). If V RDS[k ]
108
S. K. Pande et al.
(or V RDS[k ]) is huge in size by considering the worst case, then vehicle Vk (or Vk ) is assigned with all the time slots during its existence in the grid. In this scenario, vehicle Vk (or Vk ) can only download its content from V ATk (V ETk ) to V ATk (V ETk ). Case 2: If V ATk < V ATk , then V RDP [k ] is compared with V RDP [k ]. If V RDP [k ] > V RDP [k ] (or V RDP [k ] < V RDP [k ]), then vehicle Vk (or Vk ) is selected first and the assignment of time slots start from V ET [k ] (or V ET [k ]). If V RDS[k ] (or V RDS[k ]) is huge in size by considering the worst case, then vehicle Vk (or Vk ) is assigned with all the time slots during its existence in the grid. In this scenario, vehicle Vk (or Vk ) can only download its content from V ETk (V ATk ) to V ETk (V ATk ). Therefore, each vehicle can get a chance to download its content irrespective of V RDP and V RDS. Lemma 2. If two vehicles are entering a grid at the same time and speed, then the vehicle requested content with the least cost can get a chance to download its content at a higher data rate. Proof. Let us assume that vehicle Vk is requested content with a higher revenue than vehicle Vk , i.e., V RDP [k ] > V RDP [k ]. Then vehicle Vk is selected first and assigned time slots between V ET [k ] to (V ET [k ] − γ) (say). The downlink data rate is increasing from V ET [k ] to V ET [k ] − γ. If V RDS[k ] is smaller in size, then vehicle Vk downloads its content prior to V ET [k ] − γ (say, ζ). Therefore, vehicle Vk can download its content from ζ. As the data rate at ζ is higher than the data rate at V ET [k ], the vehicle Vk downloads its content at a higher data rate. Lemma 3. If two vehicles are entering a grid at the same time with different speeds, then the vehicle requested content with the least cost can get a chance to download its content. Proof. Let us assume that vehicle Vk is requested content with a higher revenue than vehicle Vk , i.e., V RDP [k ] > V RDP [k ] and V S[k ] > V S[k ]. Then vehicle Vk is selected first and assigned time slots from V ET [k ]. As V S[k ] > V S[k ], V ET [k ] > V ET [k ]. As a result, vehicle Vk is out of the grid (the range of RSU) after time slot V ET [k ]. Therefore, vehicle Vk downloads its content from V ET [k ] to V ET [k ].
5
Performance Metrics
In this section, we discuss three performance metrics to compare the proposed algorithm with the existing algorithm.
A Revenue-Based Service Management Algo. for Vehicular Cloud Computing
5.1
109
Total Revenue
The total revenue (T R) is defined as the total amount of revenue paid by the vehicles to the RSUs for downloading their respective contents. Mathematically, it can be written as follows. max(V ET )
TR =
V P RZ[ASSIGN [t], t] × V RDP [ASSIGN [t]]
(3)
t=1
5.2
Total Download
The total download (T D) is defined as the total amount of content downloaded by all the vehicles from the RSUs. Mathematically, it can be defined as follows. max(V ET )
TD =
V P RZ[ASSIGN [t], t]
(4)
t=1
5.3
Total Number of Completed Requests
The total number of completed requests (T CR) is defined as the number of requested content by the vehicles, which are downloaded completely from the RSUs. Mathematically, it is stated as follows. T CR =
m
COM [k]
(5)
k=1
where COM [k] =
6
V ET [k] 1 if V RDS[k] == t=V AT [k] V P RZ[k, t] and ASSIGN [t] == k 0 Otherwise
Simulation Results
We simulated the proposed algorithm, RBSM, and the existing algorithm, RRS, in a virtual environment created on a system with Intel(R) Core(TM) i5-4210U CPU @ 1.70 GHz 1.70 GHz, 8.00 GB of primary memory and Windows 10 64-bit operating system and installed with MATLAB R2017a. It is important to note that the simulation, evaluation and performance comparison of RBSM and RRS are independent of the virtual environment. The performance of RBSM and RRS are compared using the performance metrics discussed in Sect. 5. To perform the extensive simulation of RBSM and RRS, we used five different datasets. The traffic density is different in each dataset and represented as XX vehicles 1000 textm . Here, XX represents 10, 20, 30, 40 and 50. Each dataset contains five distinct instances, namely i1 to i5. We considered that vehicles’ arrival followed the Poisson distribution. We used the truncated normal distribution to calculate the vehicles’ velocity. For the datasets’ generation, we followed the Monte Carlo
110
S. K. Pande et al.
simulation process, as used in [23–25]. Here, we assume that only one vehicle can download its content from the RSU at a particular time. Table 7 shows the values of the parameters used for the simulation of RBSM and RRS. The proposed algorithm, RBSM and the existing algorithm RRS are simulated, and their results are shown in Table 8. For better comparison, we generated the bar chart diagrams by taking the average of five results of each dataset, as shown in Figs. 2, 3, and 4, respectively. From the comparison, it is clear that RBSM performs better than RRS. It is wise to mention that RBSM performs 87%, 90% and 170% better than RRS in terms of T R, T D, and T CR, respectively. The reason behind the better performance of RBSM is that we consider the vehicle’s arrival time and velocity in the selection process to deliver content from RSUs. Table 7. Parameters and their values Parameters
Values
X
1000 m
Number of data rate zones 7 Traffic density
(10, 20, 30, 40 or 50) vehicles/1000 m
V AT
10–50 s
VS
10–40 m/s
V RDP
Rs. 10–100
V RDS
20–100 MB
Table 8. Simulation results of RBSM and RRS algorithms Density (vehicles/meters) Instance RBSM RRS TR T D T CR T R
T D T CR
10
i1 i2 i3 i4 i5
14620 17965 18533 17986 19006
276 419 387 365 367
04 06 05 04 05
07179 16154 13567 14516 07752
102 331 273 294 131
00 06 03 04 02
20
i1 i2 i3 i4 i5
19101 17266 17730 16535 15717
394 335 351 282 332
11 11 09 07 10
09451 06586 07689 09418 09128
165 105 130 183 178
01 01 01 05 02
30
i1 i2 i3 i4 i5
17948 17607 16954 16702 16064
310 331 320 354 337
18 22 18 24 18
07580 07394 06152 12188 08587
132 05 110 05 091 01 261 12 150 06 (continued)
A Revenue-Based Service Management Algo. for Vehicular Cloud Computing
111
Table 8. (continued) Density (vehicles/meters) Instance RBSM RRS TR T D T CR T R
T D T CR
40
i1 i2 i3 i4 i5
15482 16547 13445 14948 14388
173 199 167 181 169
21 19 20 20 23
05097 10645 06297 08930 06510
068 147 081 118 085
05 06 04 10 04
50
i1 i2 i3 i4 i5
11597 12643 11294 11885 12592
148 161 140 156 158
27 29 29 27 28
05293 05131 10251 05157 05041
071 080 138 086 070
07 12 26 10 09
·104 RBSM
RRS
RBSM
400
RRS
300
1.5
TD
TR
2
200
1
100
0.5 10
20 30 40 Traffic density
50
10
Fig. 2. Pictorial comparison of T R for RBSM and RRS algorithms.
T CR
50
Fig. 3. Pictorial comparison of T D for RBSM and RRS algorithms. RBSM
30
20 30 40 Traffic density
RRS
20 10 0 10
20 30 40 Traffic density
50
Fig. 4. Pictorial comparison of T CR for RBSM and RRS algorithms.
112
7
S. K. Pande et al.
Conclusion
In this paper, we have proposed the RBSM algorithm for the vehicular cloud environment, which maximizes the total revenue of the RSUs by providing content to the vehicles. RBSM consists of two phases, namely finding data rate zones of the vehicles and selecting a suitable vehicle at each time slot. The comparison of RBSM and RRS algorithms has been performed by conducting simulation on different datasets with different traffic density. It has been observed that RBSM performs better than RRS and generates 87% more revenue than RRS. Moreover, in RBSM, vehicles download 90% more content and 170% more completed requests than RRS. However, we have considered that only one vehicle can download content at a particular time. In our future work, we will try to improve the algorithm, such that more numbers of vehicles can get content simultaneously.
References 1. Al-Hilo, A., Ebrahimi, D., Sharafeddine, S., Assi, C.: Revenue-driven video delivery in vehicular networks with optimal resource scheduling. Veh. Commun. 23, 100215 (2020) 2. Ashok, A., Steenkiste, P., Bai, F.: Vehicular cloud computing through dynamic computation offloading. Comput. Commun. 120, 125–137 (2018) 3. Boukerche, A., Robson, E.: Vehicular cloud computing: architectures, applications, and mobility. Comput. Netw. 135, 171–189 (2018) 4. Refaat, T.K., Kantarci, B., Mouftah, H.T.: Virtual machine migration and management for vehicular clouds. Veh. Commun. 4, 47–56 (2016) 5. Whaiduzzaman, Md, Sookhak, M., Gani, A., Buyya, R.: A survey on vehicular cloud computing. J. Netw. Comput. Appl. 40, 325–344 (2014) 6. Ridhawi, I.A., Aloqaily, M., Kantarci, B., Jararweh, Y., Mouftah, H.T.: A continuous diversified vehicular cloud service availability framework for smart cities. Comput. Netw. 145, 207–218 (2018) 7. Hagenauer, F., Higuchi, T., Altintas, O., Dressler, F.: Efficient data handling in vehicular micro clouds. Ad Hoc Netw. 91, 101871 (2019) 8. Midya, S., Roy, A., Majumder, K., Phadikar, S.: Multi-objective optimization technique for resource allocation and task scheduling in vehicular cloud architecture: a hybrid adaptive nature inspired approach. J. Netw. Comput. Appl 103, 58–84 (2018) 9. Al-Rashed, E., Al-Rousan, M., Al-Ibrahim, N.: Performance evaluation of widespread assignment schemes in a vehicular cloud. Veh. Commun. 9, 144–153 (2017) 10. Guo, H., Rui, L., Gao, Z.: A zone-based content pre-caching strategy in vehicular edge networks. Future Gener. Comput. Syst. 106, 22–33 (2020) 11. Han, T., Ansari, N., Mingquan, W., Heather, Y.: On accelerating content delivery in mobile networks. IEEE Commun. Surv. Tutor. 15(3), 1314–1333 (2012) 12. Teixeira, F.A., Silva, V.F., Leoni, J.L., Macedo, D.F., Nogueira, J.M.S.: Vehicular networks using the IEEE 802.11 p standard: an experimental analysis. Veh. Commun. 1(2), 91–96 (2014) 13. Einziger, G., Chiasserini, C.F., Malandrino, F.: Scheduling advertisement delivery in vehicular networks. IEEE Trans. Mob. Comput. 17(12), 2882–2897 (2018)
A Revenue-Based Service Management Algo. for Vehicular Cloud Computing
113
14. Zhou, S., Qichao, X., Hui, Y., Wen, M., Guo, S.: A game theoretic approach to parked vehicle assisted content delivery in vehicular Ad Hoc networks. IEEE Trans. Veh. Technol. 66(7), 6461–6474 (2016) 15. Sun, Y., Le, X., Tang, Y., Zhuang, W.: Traffic offloading for online video service in vehicular networks: a cooperative approach. IEEE Trans. Veh. Technol. 67(8), 7630–7642 (2018) 16. Fux, V., Maill´e, P., Cesana, M.: Price competition between road side units operators in vehicular networks. In: 2014 IFIP Networking Conference, pp. 1–9. IEEE (2014) 17. Wang, B., Han, Z., Liu, K.R.: Distributed relay selection and power control for multiuser cooperative communication networks using Stackelberg game. IEEE Trans. Mob. Comput. 8(7), 975–990 (2008) 18. Xu, C., Quan, W., Vasilakos, A.V., Zhang, H., Muntean, G.M.: Information-centric cost-efficient optimization for multimedia content delivery in mobile vehicular networks. Comput. Commun. 99, 93–106 (2017) 19. Zhou, Y., Li, H., Shi, C., Ning, L., Cheng, N.: A fuzzy-rule based data delivery scheme in VANETs with intelligent speed prediction and relay selection. Wirel. Commun. Mob. Comput. 2018 (2018) 20. Zhao, Z., Guardalben, L., Karimzadeh, M., Silva, J., Braun, T., Sargento, S.: Mobility prediction-assisted over-the-top edge prefetching for hierarchical VANETs. IEEE J. Sel. Areas Commun. 36(8), 1786–1801 (2018) 21. Yao, L., Chen, A., Deng, J., Wang, J., Guowei, W.: A cooperative caching scheme based on mobility prediction in vehicular content centric networks. IEEE Trans. Veh. Technol. 67(6), 5435–5444 (2017) 22. Shannon: Shannon channel capacity. Accessed 2 Aug 2020 23. Pande, S.K., Panda, S.K., Das, S.: Dynamic service migration and resource management for vehicular clouds. J. Ambient Intell. Humanized Comput., 1–21 (2020). https://doi.org/10.1007/s12652-020-02166-w 24. Pande, S.K., et al.: A smart cloud service management algorithm for vehicular clouds. IEEE Trans. Intell. Transp. Syst., 1–12 (2020) 25. Panda, S.K., Jana, P.K.: An efficient request-based virtual machine placement algorithm for cloud computing. In: Krishnan, P., Radha Krishna, P., Parida, L. (eds.) ICDCIT 2017. LNCS, vol. 10109, pp. 129–143. Springer, Cham (2017). https:// doi.org/10.1007/978-3-319-50472-8 11
Interference Reduction in Directional Wireless Networks Manjanna Basappa1(B) and Sudeepta Mishra2 1
2
Department of Computer Science and Information Systems, Birla Institute of Technology and Science Pilani, Hyderabad Campus, Shameerpet 500078, India [email protected] Department of Computer Science and Engineering, Indian Institute of Technology Ropar, Nangal Road, Rupnagar 140001, India [email protected]
Abstract. In a wireless network using directional transmitters, a typical problem is to schedule a set of directional links to cover all the receivers in a region, such that an adequate data rate and coverage are maintained while minimizing interference. We can model the coverage area of a directional transmitter as an unit triangle and the receiver as a point in the plane. Motivated by this, we first consider a minimum ply covering (MPC) problem. We propose a 2-approximation algorithm for the MPC problem in O((opt + n)m14opt+1 (log opt)) time, where m is the number of transmitters and n is the number of receivers given in the plane, and opt is the maximum number of triangles, in the optimal solution, covering a point in the plane. We also show that the MPC problem is NP-hard, and is not (2 − )-approximable for any > 0 unless P = NP. We also study channel allocation in directional wireless networks by posing it as a colorable covering problem, namely, 3-colorable unit triangle cover (3CUTC). We propose a simple 4-approximation algorithm in O(m30 n2 ) time, for this problem. Keywords: Unit triangles cover · Interference
1
· Approximation algorithm · Minimum ply
Introduction
Directional communication technologies are considered to enhance or mitigate problems experienced in omni-directional technologies. Directional communication links such as mmWave, free-space optics, and visible light communication are being exploited in various wireless networks such as vehicular ad-hoc networks and cellular networks. The capacity, coverage, and interference characteristics of these networks are very different than that of existing omni-directional wireless networks. This is because the slightest change in the beam orientation of the transmitter and/or the receiver results in a drastic change in the coverage and interference characteristics of these networks. Typically, a directional transmitter has high gain toward one direction, which helps reduce energy c Springer Nature Switzerland AG 2021 D. Goswami and T. A. Hoang (Eds.): ICDCIT 2021, LNCS 12582, pp. 114–126, 2021. https://doi.org/10.1007/978-3-030-65621-8_7
Interference Reduction in Directional Wireless Networks
115
consumption when the receiver actively receives or transmits packets. In addition, this will also reduce undesired interference when compared to using an omni-directional transmitter because the transmitted signal is directed toward its intended recipient. Furthermore, unlike omni-directional transmitters, the directional transmitter has a narrow coverage area directed toward its recipient. Due to this, a frequency channel the receiver uses will be available in the areas of the network not covered by the directional transmitter. In this paper, we model the narrow coverage areas of directional transmitters as unit triangles (all sides are of unit length) and receivers as points in the 2-dimensional plane. We then consider a combinatorial optimization problem, namely, minimum ply coverage (MPC) problem, as follows. Given a set P of n points and a set T of m y-parallel unit triangles in the plane, our objective is to choose a set T ∗ ⊆ T such that P ⊂ ∪t∈T ∗ t and every point in the plane (not only points of P ) is covered by a minimum number of triangles in T ∗ , where by y-parallel unit triangles we mean one of the three sides of every triangle t ∈ T is parallel to y-axis. Motivated by further reduction of interference and usage of minimum number of channels in directional wireless networks, we define a variant of the MPC problem, namely, 3-colorable unit triangle cover (3CUTC) problem as follows. Given a set P of n points and a set T of m unit triangles, we wish to select a set T ∗ ⊆ T of triangles, whose union covers all points in P such that T ∗ = T1 ∪ T2 ∪ T3 and triangles within each Ta are pairwise disjoint for a ∈ {1, 2, 3}, i.e., the triangles in T ∗ are 3-colorable while achieving coverage of all points in P . Achieving 3-colorable covering of receivers by transmitters is motivated by the fact that usage of three channels is a common practise in frequency channel assignment in wireless networks [2].
2
Related Work
In cellular communication systems, interference has been recognized as a major bottleneck in achieving good quality of service. Motivated by this important challenge of minimizing interference in cellular networks, Kuhn et al. [8] formulated a combinatorial optimization problem, namely, the minimum membership set cover (MMSC) problem. In the MMSC problem, we are given a set X of n elements, and a family F of subsets of X, the objective is to compute a set F ⊆ F such that X ⊆ ∪f ∈F f and the maximum membership of an element x ∈ X with respect to F , i.e., the number of sets in F that contains x, is minimized. Kuhn et al. [8] proved that the MMSC problem is NP-complete, and can not be approximated better than ln n factor unless NP⊂ T IM E(nO(log log n) ). They also proposed an O(ln n)-approximation algorithm for the MMSC problem. A geometric version of the MMSC problem is called the geometric minimum membership set cover (GMMSC) problem. In the GMMSC problem, we have a set of points P and geometric objects Q in the plane, we wish to choose a set Q∗ ⊆ Q such that (i) every point in P is covered by at least one object in Q∗ , and (ii) a maximum number of objects of Q∗ covering a point in P is minimized. Erlebach and van Leeuwen [7] proved that the GMMSC problem is NP-hard and
116
M. Basappa and S. Mishra
can not be approximated with a factor smaller than 2 unless P = NP, when the geometric objects are unit disks and unit squares. For the unit squares, they presented a 5-approximation algorithm in polynomial time if the cost of an optimal solution is constant. At times when minimizing interference in cellular networks, it is required to compute a cover of P with the goal of not only minimizing the membership of every point of P , but also of every point in the plane containing P . Biedl et al. [1] coined this as the minimum ply covering of points, and studied two variants of the problem, namely, the minimum ply covering of points by unit disks (MPCUD) and by unit squares (MPCUS). They proved that both the MPCUD and MPCUS problems are NP-hard and can not be approximated better than the factor 2, and presented 2-approximation algorithms for both of them if the cost of an optimal solution is constant. They also solved the problem in dimension one optimally in polynomial time, in which case unit disks or unit squares are just intervals on the real line. Directional transmitter such as mmWave transmitter mounted on a Unmanned aerial vehicle(UAV) is used for achieving coverage in UAV-enabled cellular network [9]. In [9], the authors have studied optimal resource allocation problem by formulating the problem as a mixed-integer non-convex programming problem. There are other works in the literature which consider the problem of increasing the lifetime of omni-directional networks by reducing interference (for example, [3–5]).
3
Minimum Ply Covering Problem
In this section we present an approximation algorithm for the minimum ply covering of points by y-parallel unit triangles (MPC) problem. 3.1
Preliminaries
Let a and b be two points on a straight line (see Fig. 1). We use [a, b] to denote the closed interval on , when is considered to be a real line. If is an integer line, i.e., when we are interested in only the integer points on , then [a, b] includes only the points that represent integers within the interval on . In the geometric context, we use [a, b] to denote either a vertical slab or a horizontal slab when a and b represent a pair of parallel lines. We refer to the y-axis aligned side of a triangle t as the y-side of t, i.e., a side of t that is parallel to y-axis. Assume that the x-coordinates of y-side and the vertex opposite to y-side of all triangles t ∈ T and all points p ∈ P are distinct. Similarly, assume that the y-coordinates of the vertex opposite to y-side of all triangles t ∈ T and all points b
a
Fig. 1. Illustrating closed interval on integer/real line
Interference Reduction in Directional Wireless Networks
117
ju Hj
L
p1 p3
p2
(a)
jl
(b)
Fig. 2. (a) The plies of p1 , p2 and p3 are 1, 2 and 3 respectively, (b) Illustrating the proof of Lemma 1
p ∈ P are distinct. Let mid(y-side(t)) be the midpoint of y-side of a triangle t. Let x(p) (resp. y(p)) be the x-coordinate (resp. y-coordinate) of any point p in the plane. Given any two arbitrary points p1 and p2 in the plane, let dist(p1 , p2 ) denote the Euclidean distance between p1 and p2 . Given an arbitrary straight line , the minimum distance between an arbitrary point p and , denoted by dist(p, ), is the distance from p to its closest point on . 3.2
Problem Description
Given a set T of m triangles in the plane, the ply of any point p in the plane with respect to the set T is defined to be the maximum number k of triangles in T such that the point p lies in the common intersection region between these k triangles, where k is an integer 0 ≤ k ≤ m (see Fig. 2(a)). Thus, we say that any set T ⊆ T has ply equal to the maximum number of triangles in T that have nonempty common intersection region. Formally, the MPC problem is defined as follows: MPC: Given a set P of n points and a set T of m y-parallel unit triangles in the plane, our objective is to compute a set T ∗ ⊆ T such that T ∗ covers all points of P and has minimum ply. 3.3
Algorithm
In order to solve MPC, we have taken ideas from [1] to develop an approximation algorithm which finds a cover T ⊆ T of points in P such that T has minimum ply. We describe the algorithm as follows. Firstly, we partition the plane into horizontal strips of two units height. Assume that no point of P lies on the boundary of any strip. Let us label these horizontal strips as H1 , H2 , . . ., Hr ,
118
M. Basappa and S. Mishra
from bottom to top, where Hr is the topmost strip containing at least one point from P . Let Pj be the set containing all points of P lying within the strip Hj , and Tj be the set of triangles from T intersecting with Hj , for every j = 1, 2, . . . , r. Let α be an upper bound on the minimum ply of a cover of the given instance of the MPC problem. As a standard practice in most of the geometric covering problems, in order to solve the MPC problem we first solve the MPC problem restricted to strip Hj separately for j = 1, 2, . . . , r. Now observe that if for some Hj , there is no subset Tj ⊂ Tj that covers all points Pj and has minimum ply at most α, then the original point set P can not have any covering with minimum ply at most α. Hence, we have a cover Tj ⊂ Tj with ply at most α, that covers all points Pj lying within the strip Hj , for every j = 1, 2, . . . , r. Then r Tj will cover all points P and have ply at most 2α since a triangle in T T = j=1
can participate in covering of points in two consecutive strips. We approximate the minimum value of α by using doubling technique and bisection method on the integer line Z+ . Now we describe an algorithm for computing a cover Tj with ply at most α for the points lying in the strip Hj . Given an integer α, points Pj , triangles Tj , and that there is a cover of points Pj with ply at most α, we do the following to compute Tj . Let 1 , 2 , . . ., k−1 be the vertical lines, passing through y-sides and vertices opposite to y-sides of triangles in Tj , in the increasing order of their x-coordinates. We can view these vertical lines as forming vertical strips [i , i+1 ] for i = 0, 1, . . . , k − 1, where 0 and k are two dummy vertical lines to be explained later. Since |Tj | ≤ m, k ≤ 2m. Lemma 1. If we have a solution Tj covering all points Pj and having ply at most α, then there are at most 7α unit triangles of Tj intersecting with any vertical strip [i , i+1 ] for i = 0, 1, . . . , k − 1. Proof. For the sake of contradiction assume that there are more than 7α yparallel unit triangles intersecting with some vertical slab [i , i+1 ]. Consider an arbitrarily placed vertical line L, but passing through this vertical slab [i , i+1 ]. Let the seven points p1 , p2 , . . ., p7 be marked on L ∩ Hj such that p4 is at a point equidistant from both upper and lower boundary lines ju and jl defining the strip Hj i.e., dist(p4 , ju ) = dist(p4 , jl ), the points p1 , p2 , and p3 are placed closed to the upper boundary line ju of Hj separated by very small distance from one another, and similarly, the points p5 , p6 , and p7 are placed closed to the lower boundary line jl of Hj separated by very small distance from one another (see Fig. 2(b)). Let us suppose that there are seven disjoint y-parallel unit triangles covering p1 , p2 , . . . , p7 respectively. Observe that no more y-parallel unit triangle disjoint from the above seven triangles can be placed covering a new point p8 on L ∩ Hj (in fact, lying anywhere on [i , i+1 ] ∩ Hj ). Now we can replace the above seven triangles with seven groups of almost coinciding α triangles respectively by still maintaining the above pairwise disjointness property between these groups, and covering their respective seven points. Hence, if there are more than 7α y-parallel unit triangles intersecting with the vertical slab [i , i+1 ], then one of
Interference Reduction in Directional Wireless Networks
119
the seven points p1 , p2 , . . ., p7 , is covered by more than α triangles. Therefore, the ply of Tj is more than α, a contradiction. Based on Lemma 1 we define a pair ([i , i+1 ], R) for i ∈ {1, 2, . . . , k − 1} as follows: for every vertical strip [i , i+1 ] a subset R ⊆ Tj is any setconsisting t covers of at most 7α unit triangles intersecting with [i , i+1 ] such that t∈R
all points in [i , i+1 ] ∩ Pj and the ply of R is at most α. Now given any pair ([i , i+1 ], R), a pair ([i+1 , i+2 ], R ) is said to be a successor of the ([i , i+1 ], R) t covers all points in [i+1 , i+2 ] ∩ Pj , the ply of if and only if |R | ≤ 7α, t∈R
R is at most α, and |RR | ≤ 1, where denotes symmetric difference of two sets. Observe that R and R differ by at most one triangle, y-side or vertex opposite to y-side of which lies on the line i+1 . Let us introduce two dummy triangles Tdummyl and Tdummyr to cover two dummy points in Hj , placed lying to the left and right of all triangles of Tj respectively. Let 0 and k be the vertical lines through the y-side or vertex opposite to y-side of Tdummyl and the y-side or vertex opposite to y-side of Tdummyr respectively. Based on the above definitions we construct a directed acyclic graph G = (V, E) as follows: for i = 1, 2, . . . , k − 1, consider all possible pairs ([i , i+1 ], R) and define a vertex v(R) corresponding to each such pair. Let the set Vi contain vertices corresponding to all possible k pairs ([i , i+1 ], R) for the vertical strip [i , i+1 ]. Let V = Vi . We add an i=1
edge e = (v, v ) to the set E, where e is a directed edge from a vertex v(R) ∈ Vi to a vertex v (R ) ∈ Vi+1 if and only if the pair ([i+1 , i+2 ], R ) is a successor of the pair ([i , i+1 ], R). Finally, we add the edges from the dummy vertex v({Tdummyl }) to all vertices in V1 and edges from all vertices in Vk−1 to the dummy vertex v({Tdummyr }), to the set E. We argue below that any path from v({Tdummyr }) to v({Tdummyr }) in G corresponds to a cover of Pj with ply at most α. Lemma 2. Given an instance (Tj , Pj ) of the MPC problem on the strip Hj , then there is a cover Tj ⊆ Tj of all points in Pj , with ply at most α, if and only if there is a path ρ, corresponding to Tj , from vertex v({Tdummyl }) to v({Tdummyr }) in G. Proof. First consider a solution Tj∗ ⊆ Tj with ply at most α, by Lemma 1 there can be at most 7α triangles in Tj∗ covering points in each vertical slab [i , i+1 ], i = 1, 2, . . . , k−1. Because the boundaries of vertical slabs [i , i+1 ] are formed by y-sides or vertex opposite to y-sides of triangles, we can associate a vertex with each such group of at most 7α triangles, and an edge between two such groups of at most 7α triangles provided they differ in at most one triangle. These vertices form a path ρ corresponding to Tj∗ , of length at most 2m + 1. Next, if we have a path ρ from v({Tdummyl }) to v({Tdummyr }) in G, then because of the way in which the graph G is constructed, the union of all collections of at most 7α triangles, corresponding to all vertices in ρ, will cover all points in Pj and have ply at most α. Thus the lemma follows.
120
M. Basappa and S. Mishra
Lemma 3. The algorithm for the MPC problem on the strip Hj runs in O((α + |Pj |)|Tj |7α+1 ) time. Proof. Recall that there is a set of at most 7α triangles corresponding to each vertex in G, and there are at most three directed edges from each vertex to its successor vertices in G. The time required to verify that triangles corresponding to each vertex cover all the points in the corresponding vertical slab and identifying its successors is O((α + |Pj |)|Tj |). Hence, the construction of G takes O((α + |Pj |)|Tj |7α+1 ) time. We can run a DFS algorithm starting from v({Tdummyl }) to find a path ρ in O(|Tj |7α ) time. Therefore, the overall running time of the algorithm is dominated by the time needed for the construction of G. In order to compute a cover of all the points P , we repeat the algorithm of r Tj . Since a triangle Lemma 3 for all strips Hj , j = 1, 2, . . . , r. Then set T = j=1
can participate in the covering of points in two consecutive strips and the ply of each Tj is at most α, the ply of T is at most 2α. To find the minimum value of α, we solve the MPC problem repeatedly for α = 2i , i = 0, 1, 2, . . . , until the above algorithm succeeds in finding Tj for every strip Hj , j = 1, 2, . . . , r, where Tj is a cover of points Pj and has ply at most α. Let that value of α be 2τ . Then we know that the algorithm has failed to compute Tj for some strip Hj when α = 2τ −1 . Hence the minimum ply opt lies in the interval [2τ −1 , 2τ ] for the given point set P and unit triangle set T . We can now do a standard binary search in the interval [2τ −1 , 2τ ], which would need only τ bisection steps, where τ ≤ log opt + 1 Finally, the ply of the cover T returned by the last invocation of the algorithm for the MPC problem is at most opt. Hence, we have the following theorem. Theorem 1. We have a 2-approximation algorithm for the MPC problem in O((opt + n)m14opt+1 (log opt)), where opt is the maximum number of triangles of T , in the optimal solution, covering a point in the plane. Proof. Given an upper bound α on the minimum ply, the algorithm of Lemma 3 computes a cover Tj of points Pj , with ply at most α, in the strip Hj . Since a triangle can participate in the covering of points in only two consecutive strips Hj and Hj+1 (j = 1, 2, . . . , r − 1) and the ply of each Tj is at most α, the r ply of T = Tj is at most 2α. In solving the MPC problem on each strip i=1
Hj for j = 1, 2, . . . , r, the algorithm of Lemma 3 is invoked at most log 2opt times and at most log opt + 1 times due to doubling technique and standard binary search respectively. During this bisection process, the algorithm may get called for a value of α that is almost twice as large as opt. Therefore, the overall 2 log opt+2 r ( (α + |Pj |)|Tj |7·α+1 ) ≤ running time for computing the cover T is 2 log opt+2 i=0
(
r
j=1
i=0
j=1
(2opt + |Pj |)|Tj |7·2opt+1 ) = O((opt + n)m14opt+1 (log opt)). Since the
Interference Reduction in Directional Wireless Networks
121
bisection process and the repeated calling of the algorithm ensure that the value of α is ultimately at most opt for each Hj , the overall cover T will have ply at most 2opt. Thus, the theorem follows.
Remark 1. In the approximation algorithm for the MPC problem, we have used only the properties that for each element t ∈ T , the length of y-side of t is unit and t is a triangle. Hence, the algorithms can be extended to work even when T consists of y-parallel isosceles triangles with unit length y-side. 3.4
NP-Hardness of MPC Problem
In this subsection we show that the MPC problem is NP-hard. Biedl et al., [1] proved that the minimum ply covering of points by unit disks (MPCUD) and the minimum ply covering of points by unit squares (MPCUS) are both NP-hard by reducing from the NP-complete planar graph 3-coloring problem. As with Biedl et al., [1] we prove that the minimum ply covering of points by unit triangles (MPC) is also NP-hard, in particular, by modifying gadgets in the reduction of Biedl et al. [1] and introducing gadgets as required. As it is argued in [1] that because of ply being an integer, the MPCUS and MPCUD problems can not be approximated with a factor smaller than 2, MPC is also not (2−)-approximable, for any > 0. In the decision version of planar graph 3-coloring problem, we are given a planar graph G(V, E), and our aim is to answer the question: is G 3-colorable? i.e., whether all vertices in V can be colored with three distinct colors such that adjacent vertices receive different colors, or not. Given a planar graph G(V, E), we describe the construction of an instance (PG , TG ) of the MPC problem as follows. The vertex gadget in Fig. 3(a) will be constructed corresponding to each vertex v ∈ V (G). Depending on which of the three colors is used to color the vertex v, one of the three triangles t0 , t1 , and t2 is selected to cover the point pv . The selected color can be carried over by the transport gadget as shown in Fig. 3(b). When a vertex v has degree d(v) more than 2, using as many copies of the duplicate gadget, as shown in Fig. 3(c), as the degree d(v), we can carry over the selected color/triangle of the point pv along all those edges incident on v. In the construction, whenever we want to change the orientation of triangles in the transport gadget, we can then insert the orientation change gadget there, as shown in Fig. 3(d). In between every pair of adjacent vertices u and v in G, we introduce the color-conflict avoidance gadget (Fig. 4) to avoid that these two vertices receive the same color, i.e., to avoid that the corresponding points pu and pv are covered by the same labeled triangles among the triples t0 , t1 , and t2 . Let PG and TG consist of all those points and triangles respectively, thus constructed. Theorem 2. A planar graph G(V, E) is 3-colorable if and only if the instance (PG , TG ) of the MPC problem has a solution with ply 1. Proof. Now, if all the vertices in V (G) can be colored with three distinct colors, then the corresponding triangles of TG to the chosen colors cover all the points
122
M. Basappa and S. Mishra t0 t1 t2 pv
(a)
(b)
(c)
(d)
Fig. 3. Hardness of MPC problem: (a) vertex gadget, (b) transport gadget, (c) duplicate gadget, and (d) orientation change gadget
x
selected triangle/color of pv
y
z
Fig. 4. Color conflict avoidance gadget
selected triangle/color of pu
Interference Reduction in Directional Wireless Networks
123
in PG , and no two triangles overlap anywhere. Hence, there exists a set TG∗ ⊆ TG covering all the points in PG and having ply 1. On the other hand, if we have a solution TG ⊆ TG , with ply 1, which covers all the points in PG , then by the construction of G, the colors corresponding to the triangles in TG result in 3-coloring of G. Thus, the theorem follows.
4
Channel Allocation in Directional Wireless Networks
In this section, we study the channel allocation in wireless networks with directional transmitters. In particular, we investigate the 3-colorable unit triangle cover (3CUTC) problem: given a set T of m y-parallel unit triangles (modelling transmission/interference regions of wireless directional transmitters) and a set P of n points (modelling receivers) in the plane, the objective is to compute a set T ∗ ⊆ T such that T ∗ ensures coverage of all points in P (i.e., P ⊂ ∪t∈T ∗ t) and at most three colors (i.e., channels) are sufficient to color all triangles in T ∗ (i.e, the set T ∗ can be split into T1∗ , T2∗ and T3∗ such that the triangles in Ta∗ are pairwise disjoint for each a ∈ {1, 2, 3}). In other words, we need to compute 3-colorable cover of all points in P with unit triangles in T . The usage of three colors here is justified due to the constraints on channel selection in Wi-Fi networks [2,6]. The algorithm we describe next for this problem is based on the following observation. Observation 1. If we have a 3-colorable cover TR of all points PR lying within √ a rectangle R of dimension 1 × 23 , then |TR | ≤ 30. √
Proof. ince the horizontal length of R is 23 , observe that at most ten sets of almost overlapping three triangles can cover the interior of R while all the triangles in these sets are 3-colorable (see Fig. 5). Therefore, in any 3-colorable cover of PR , there are at most 30 unit triangles from T participating. Thus the claim follows.
The algorithm for the above channel allocation problem proceeds as follows. √ First, place a rectangular grid of size 1 × 23 over the plane containing all points in P . For each rectangle R formed by this grid, for which P ∩ R = ∅, by exhaustively checking all subsets T ⊆ T of size at most 30, find that set TR of size at most 30 that covers all points in PR and is 3-colorable, where PR = P ∩ R. Let C1 , C2 , C3 and C4 be the four sets of three distinct colors. Triangles in any TR thus computed can overlap with at most 8 adjacent rectangles. Now, consider the rectangular grid shown in Fig. 6. Let us assign the color set C1 to the triangles in TR . Then the adjacent rectangles of R will get assigned the color sets C2 , C3 , and C4 as shown in Fig. 6. Let Rl , Rr , Rt , and Rb be the left, right, top, and bottom rectangles that share their edges with R, and whose assigned colors are C4 , C4 , C2 and C3 respectively. Now the right side adjacent rectangle of Rl will get the color set C1 . Similarly, the left side adjacent rectangle of Rr , top adjacent rectangle of Rt , and bottom adjacent rectangle of Rb will also get the color set C1 . The same pattern then as shown in Fig. 6 will repeat. For a
124
M. Basappa and S. Mishra
R
Fig. 5. (a) Illustrating observation 1
triangle t participating in the covering of points lying in more than one adjacent rectangles, we retain the color from the color set of the rectangle R for which the point mid(y-side(t)) lies in the interior of R. Theorem 3. We have a 4-approximation algorithm for the 3CUTC problem, that runs in O(m30 n2 ) time. Proof. The approximation factor of the algorithm follows from the facts that we use four pairwise disjoint color sets, each containing three distinct colors, and there is vertical distance of at least one unit and horizontal distance of at least √ 3 units between any two rectangles with the same assigned color set. Because 2 |P | = n, there are at most n nonempty rectangles R, i.e., for which R ∩ P = ∅. For each rectangle R, we invest O(m30 n) time to compute the cover TR which is 3-colorable (due to Observation 1). Therefore, the overall time of the algorithm
is O(m30 n2 ). Thus, the theorem follows. Remark 2. Since the triangles within each Ta for a ∈ {1, 2, 3} are pairwise disjoint, the ply of Ta is 1. Hence, by pigeon hole principle the cover T ∗ (= Ta ) has ply at most 3.
a
Interference Reduction in Directional Wireless Networks
C3
C2
125
C3 Rt
C4
C1 Rl
C2
C4 Rr
R
C3
C2 Rb
Fig. 6. Coloring triangles chosen to cover points in adjacent rectangles
5
Conclusion
In this paper we study some optimization issues pertained to directional wireless networks. We formulated these optimization issues related to coverage and interference as a geometric covering problem involving unit triangles with fixed orientations, and points, namely, the minimum ply covering (MPC) problem. The approximation algorithm that we have presented for the MPC problem in this paper has a running time that is polynomial in the input size, but unfortunately exponential in the size of optimal solution. However, in practice for a vehicular adhoc network utilizing directional transmitters, the number of candidate transmitters covering a single receiver is small. Furthermore, developing constant factor approximation algorithms for the above problems, that run in time polynomial in both input size and output size, is an important open problem.
References 1. Biedl, T., Biniaz, A., Lubiw, A.: Minimum ply covering of points with disks and squares. In: Proceedings of the 29th Canadian Conference on Computational Geometry, pp. 226–235 (2019) 2. Brass, A., Hurtado, F., Lafreniere, B.J., Lubiw, A.: A lower bound on the area of a 3-coloured disk packing. Int. J. Comput. Geom. Appl. 20(3), 341–360 (2010) 3. Carrabs, F., Cerulli, R., D’Ambrosio, C., Raiconi, A.: Exact and heuristic approaches for the maximum lifetime problem in sensor networks with coverage and connectivity constraints. RAIRO-Oper. Res. 51(3), 607–625 (2017)
126
M. Basappa and S. Mishra
4. Carrabs, F., Cerrone, C., D’Ambrosio, C., Raiconi, A.: Column generation embedding carousel greedy for the maximum network lifetime problem with interference constraints. In: Sforza, A., Sterle, C. (eds.) ODS 2017. SPMS, vol. 217, pp. 151–159. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67308-0 16 5. Carrabs, F., Cerulli, R., D’Ambrosio, C., Raiconi, A.: Prolonging lifetime in wireless sensor networks with interference constraints. In: Au, M.H.A., Castiglione, A., Choo, K.-K.R., Palmieri, F., Li, K.-C. (eds.) GPC 2017. LNCS, vol. 10232, pp. 285–297. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-57186-7 22 6. Cisco Systems, Inc.: Channel Deployment Issues for 2.4-Ghz 802.11 WLANs. http://www.cisco.com/univercd/cc/td/doc/product/wireless/airo1200/accsspts/ techref/channel.pdf. Accessed 15 Apr 2007 7. Erlebach, T., Van Leeuwen, E.J.: Approximating geometric coverage problems. In: Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 20 January 2008, pp. 1267–1276. Society for Industrial and Applied Mathematics (2008) 8. Kuhn, F., von Rickenbach, P., Wattenhofer, R., Welzl, E., Zollinger, A.: Interference in cellular networks: the minimum membership set cover problem. In: Wang, L. (ed.) COCOON 2005. LNCS, vol. 3595, pp. 188–198. Springer, Heidelberg (2005). https:// doi.org/10.1007/11533719 21 9. Kumar, S., Suman, S., De, S.: Dynamic resource allocation in UAV-enabled mmWave communication networks. IEEE Internet Things J. (2020)
Distributed Algorithms, Concurrency and Parallelism
Automated Deadlock Detection for Large Java Libraries R. Rajesh Kumar(B) , Vivek Shanbhag, and K. V. Dinesha International Institute of Information Technology, Bangalore 560100, Karnataka, India [email protected] https://www.iiitb.ac.in Abstract. Locating deadlock opportunities in large Java libraries is a subject of much research as the Java Execution Environment (JVM /JRE) does not provide means to predict or prevent deadlocks. Researchers have used static and dynamic approaches to analyze the problem. Static approaches: With very large libraries, this analysis face typical accuracy/doability problem. If they employ a detailed modelling of the library, then the size of the analysis grows too large. Instead, if their model is coarse grained, then the results have too many false cases. Since they do not generate deadlocking test cases, manually creating deadlocking code based on the predictions is impractical for large libraries. Dynamic approaches: Such analysis produces concrete results in the form of actual test cases to demonstrate the reachability of the identified deadlock. Unfortunately, for large libraries, generating the seed test execution paths covering all possible classes, to trigger the dynamic analysis becomes impractical. In this work we combine a static approach (Stalemate) and a dynamic approach (Omen) to detect deadlocks in large Java libraries. We first run ‘Stalemate’ to generate a list of potential deadlocking classes. We feed this as input test case to Omen. In case of deadlock, details are logged for subsequent reproduction. This process is automated without the need for manual intervention. We subjected the entire JRE v1.7.0 79 libraries (rt.jar) to our implementation of the above approach and successfully detected 113 deadlocks. We reported a few of them to Oracle as defects. They were accepted as bugs. Keywords: Concurrency · Deadlock Dynamic analysis · Scalable
1
· Java · Static analysis ·
Introduction
The synchronised construct is provided in the Java Language to facilitate the development of code fragments that may run concurrently. When this happens in an uncoordinated manner it can give rise to Lock Order Violations. If such a lock c Springer Nature Switzerland AG 2021 D. Goswami and T. A. Hoang (Eds.): ICDCIT 2021, LNCS 12582, pp. 129–144, 2021. https://doi.org/10.1007/978-3-030-65621-8_8
130
R. Rajesh Kumar et al.
order violation is realised during program execution it causes the executing JVM to deadlock. This problem has been widely investigated, and various researchers have tried different approaches to address it. There have been numerous published research works for detecting deadlocks including both static and dynamic approaches [1,2,4–6,9,11–16] and some for preventing them [3,7,8]. Stalemate [1,2] is a static analysis approach that identifies lock order violations in large Java libraries, which have the possibility of being realised, during execution, in the form of a deadlock. However, its predictions include many false positives, since the analysis is at the type level while the contested locks are held on the actual objects. Consequently, one must develop a deadlocking test case by manual examination of the call cycles in the predictions. For a library as large as the entire JRE, it becomes impractical to apply this manual method to narrow down to realisable deadlocks. Other static approaches such as Jade [12,13] can scale well, produces notable results, however these methods also produce many false positives. There are many dynamic analysis approaches to identify deadlocks such as Omen [4], Needlepoint [5] and Sherlock [6]. Omen, produces realisable deadlocks along with reproducible test cases. Such dynamic analysis, initiated using seed test cases, is limited to the execution traces realisable during the call flows of the seed test cases. For large libraries, the set of seed test cases would become substantial, and looking for lock cycles realisable in their execution traces would be impractical. The implementation Sherlock is suitable for large programs but not for libraries. In this paper we address the problem of identifying reproducible deadlocks in large Java libraries without false positives, and produce deadlocking test cases for the detected deadlocks. Our approach combines a static approach Stalemate [1,2], as it can scale to analyze large libraries and a dynamic approach Omen [4] to detect real deadlocks along with deadlocking test cases. We start by subjecting the library to Stalemate and from the lock order violations it reports, interleaving calls that could lead to deadlocks are extracted. Reading off classes/methods involved in such calls, we use Randoop to generate test cases that exercise these classes/methods. These tests are then subjected to dynamic analysis (Omen) narrowing down to reproducible deadlocks. In this way we eliminate the numerous false cases identified by static analysis. At the same time the attention of the dynamic analysis is directed to only those components in the library where there is a possibility of finding a deadlock. We have fully automated the process so that the complete analysis for the entire library can be completed with no manual intervention. This paper is organized as follows: Sect. 2 states the problem we are attempting to address. Section 3 provides the solution overview and details the implementation, Sect. 4 summarizes the results, and Sect. 5 states the conclusions.
Automated Deadlock Detection for Large Java Libraries
2
131
Problem Statement
Design an automated method to analyze large Java libraries to detect deadlocks by combining static and dynamic analysis approaches. For each of the lock order violations identified by the static analysis (Stalemate), create an automated process to detect any real deadlock associated with that violation using dynamic analysis (Omen). The process should also generate the deadlocking test cases to reproduce the deadlocks. As an illustration, we have used the output of Stalemate [1] on the entire JRE v1.7.0 79 libraries (rt.jar) to generate a list of lock order violations (this number is more than 26,000) with many false positive cases. We have used the automated process described in this paper to generate a list of real deadlocks (number: 113) and test cases to reproduce the deadlocks.
3
Solution Details
The solution focuses on extracting the relevant details from the static analysis and targets the dynamic analysis only on those classes that have the potential to cause a deadlock. Given a lock order violation that is identified by the static analysis Stalemate, next, we want to develop single-threaded test programs that may individually realise each its thread stacks. For this purpose we use the Randoop tool, which synthesizes test cases for a given set of classes. The dynamic analysis tool Omen is then invoked by passing the generated test case as the seed test case. On completion of the analysis if deadlocks are found, Omen will synthesize a multi-threaded test case that will always deadlock. The task is to use these tools, namely, Stalemate, Randoop and Omen, and devise an automated method to analyze large Java libraries to detect deadlocks and produce deadlocking test scenarios. The flow is designed to handle any exceptions and intermittent failures so that it is suitable to analyze large libraries and results in a truly automated solution. 3.1
Solution Steps
The process flow is represented in Fig. 1. Key steps of the automated analysis is described below. 1. Static Analysis: The jar file containing the libraries that are to be analyzed are subjected to the static analysis. This step produces the static analysis results (Sn), which contains the list of potential deadlocking scenarios containing the call flow details with the lock order violations involved in the synchronized functions calls, as represented in Fig. 2.
132
R. Rajesh Kumar et al.
Fig. 1. Automated deadlock detection process flow
The result is interpreted as follows: the “Cycle-2” in its opening line means that is a lock order violation involving 2 locks. The class names of the locks are in the string that follows. This line is followed by a number of thread stacks enclosed within “Thread-i Option” descriptors. i ranging from 1 to n, the number of locks involved in the violation. The example in Fig. 2, where n = 2, we refer to the “Thread-1” stacks as forward stacks, which acquire locks in the forward order, and the “Thread-2” stacks as reverse stacks, as they acquire them in the reverse order. If in a multi-threaded program, one thread were to realise one of the forward stacks, and another thread were to realise one of the reverse stacks, and if they were to do this concurrently, and if the two locks they are trying to acquire were to be the same two locks, then they could result in a deadlock.
Automated Deadlock Detection for Large Java Libraries
133
java.util.logging.Logger.getAnonymousLogger:()Ljava.util.logging.Logger; java.util.logging.LogManager.getLogger:(Ljava.lang.String;)Ljava.util.logging.Logger;
java.util.logging.Logger.getAnonymousLogger:(Ljava.lang.String;)Ljava.util.logging.Logger; java.util.logging.LogManager.getLogger:(Ljava.lang.String;)Ljava.util.logging.Logger;
java.util.logging.Logger.getLogger:(Ljava.lang.String;Ljava.lang.String;)Ljava.util.logging.Logger; java.util.logging.LogManager.getLogger:(Ljava.lang.String;)Ljava.util.logging.Logger;
java.util.logging.Logger.getLogger:(Ljava.lang.String;Ljava.lang.String;)Ljava.util.logging.Logger; java.util.logging.LogManager.addLogger:(Ljava.util.logging.Logger;)Z
java.util.logging.Logger.getLogger:(Ljava.lang.String;)Ljava.util.logging.Logger; java.util.logging.LogManager.addLogger:(Ljava.util.logging.Logger;)Z
java.util.logging.Logger.getLogger:(Ljava.lang.String;)Ljava.util.logging.Logger; java.util.logging.LogManager.getLogger:(Ljava.lang.String;)Ljava.util.logging.Logger;
java.util.logging.LogManager.addLogger:(Ljava.util.logging.Logger;)Z java.util.logging.Logger.getLogger:(Ljava.lang.String;)Ljava.util.logging.Logger;
Fig. 2. Lock order violation output from static analysis
2. Prediction Filtering and Data Preparation: The results produced by Stalemate could be a very large data set. To narrow down the areas of focus, certain filters such as Cycles, Call Depth and Call Density are applied on the results (details of the filters are elaborated in Sect. 3.2). Each of the result nodes (S1 to Sn) are checked for the filter criteria and a subset Nk of the static analysis results (N k ⊂ Sn) is produced. There is a possibility that this step could filter certain real deadlock candidates, however it helps to direct the analysis to the desired focus area. For each of the shortlisted nodes, the classes are extracted along with the metadata to trace back to the static analysis and it is referred as Node Info. An extract of a log created after the filtering exercise is shown in Fig. 3(a) and Node Info is shown in Fig. 3(b). 3. Seed Test Case Generation: Using the Node Info details collected during the data preparation step the seed test cases (TCk) are generated for each of the shortlisted predictions using the utility Randoop [10]. The test cases target specifically the classes involved in a lock order violation as detected by the static analysis. Example of a test case generated by Randoop is shown in Fig. 3(c).
134
R. Rajesh Kumar et al.
Fig. 3. (a) Execution summary (b) node info (c) test case generated by Randoop
4. Dynamic Analysis for Deadlock Detection: Post generation of the seed test cases, for each of the test cases (TC1 to TCk), the Dynamic Deadlock Analysis is initiated with Omen. The Dynamic Analysis starts by executing the seed test case and by tracing the lock dependency relations of the classes during the execution and recording them along with the invocation contexts. The presence of cyclic chains in the lock dependency relations are identified and potential deadlocking scenarios are synthesized. Multi-threaded tests are then generated by using the invocation details from the seed test case execution and by creating additional conditions that could result in a deadlock. At the end of the execution if any deadlocks are detected (DLj) successfully, they are recorded along with the deadlocking test cases. The class to be analyzed, seed test case, and the deadlocking test case generated by Omen are illustrated in Fig. 4 with a simple example. The flow continues by picking the next shortlisted result to be analyzed. 5. Handling of Deadlocked Analysis: It was observed that the execution of the analysis was getting deadlocked either during the generation of the seed test cases or during the dynamic analysis. It was observed that these deadlocking scenarios were occurring while processing specific set of classes. The stack traces of the deadlocked JVM revealed that these are indeed deadlocking scenarios as predicted by the static analysis and they were consistently occurring while processing those specific classes. The test case generation by Randoop is a multi-threaded process as it executes each test in a separate thread to speed up the generation. When the lock order violations occur in
Automated Deadlock Detection for Large Java Libraries
135
the call flows of the classes for which the test cases are being generated, the possibility of deadlock occurs. Similarly, such scenarios occur during dynamic analysis as well. JVM Monitoring Routine was developed to watch the JVMs for such scenarios. Once a deadlocked JVM is detected then the stack traces were extracted and recorded along with the associated metadata so that it is traceable to the prediction. The hung JVM is then terminated by the routine so that the flow would continue to the next prediction to be analyzed. An extract of the JVM stack trace of such as scenario is shown in Fig. 5
Fig. 4. Omen dynamic analysis example: (a) class to be analyzed (b) sequential seed test case to be subjected to the analysis (c) synthesized multi-threaded deadlocking test case generated by Omen
136
R. Rajesh Kumar et al. ... "Finalizer" daemon prio=5 tid=0x00007f8944813800 nid=0x3303 in Object.wait() [0x0000700001e0e000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135) - locked (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209) ... "main" prio=5 tid=0x00007f8944802800 nid=0x1b03 runnable [0x00007000016f8000] java.lang.Thread.State: RUNNABLE at java.util.logging.Logger.getEffectiveResourceBundleName(Logger.java:1703 ) at java.util.logging.Logger.doLog(Logger.java:636) at java.util.logging.Logger.log(Logger.java:664) at java.util.logging.Logger.info(Logger.java:1182) at RandoopTest0.test7(RandoopTest0.java:109) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ...
Fig. 5. Extract of JVM stack trace
6. Data Collection and Report Creation: Comprehensive reports of the analysis are generated based on the data collected when the execution concludes, resulting in a comprehensive Deadlock Detection Report.
3.2
Implementation
The implementation details of the solution is elaborated in this section. The complete source code for of this implementation is published in GitHub along with the test results of the executions. Execution Flow. The end-to-end execution flow design is represented in the Fig. 6. The flow of the automated deadlock analyzer is as follows (refer Fig. 6, steps 1 to 7): 1. The run cycles are specified by providing the constraints that needs to be applied on the static analysis prediction output. The constraints specifies the output files that needs to be selected for analysis and the filter criteria that needs to be applied for each of the output files. 2. For each run, the Prediction filter constraints file is generated, which is the primary input for the Static Analysis Prediction Scanner.
Automated Deadlock Detection for Large Java Libraries
137
Fig. 6. Automated deadlock analyzer implementation design
3. The Static Analysis Prediction Scanner takes the generated Prediction filter constraints and scans each of the static analysis output files and selects the specific prediction nodes (the xml node that contain the lock order violations) along with the relevant metadata. 4. Dynamic Deadlock Analyzer is then invoked for each of the shortlisted node after generating the seed test cases for the classes involved in the potential deadlocks identified by the static analysis. The detected deadlocks are recorded in such a way that the corresponding static analysis prediction and the interleaving call details can be traced back. 5. During the execution of the dynamic analysis the JVM Monitor is invoked to monitor the execution for any hung scenarios. When such a scenario is detected, the monitor will extract the details for analysis and terminate the hung program so that the execution can continue. All the logs related to the deadlocking scenario are organized so that they can be seen in the context of the specific deadlock prediction. 6. After the completion of the analysis, the Results Report Generator is run to consolidate the complete execution results. The summary of each execution
138
R. Rajesh Kumar et al.
is created in a master report file with the details of the constraints specified along with the results. With this information any detected deadlock can be consistently recreated. 7. The steps 2 through 6 are repeated for each of the run cycles specified in Step 1, thereby enabling the execution of a complete batch of automated analysis The details of some of the key components of the implementation are described below: Static Analysis Prediction Scanner. The static analysis prediction scanner is designed to shortlist the specific nodes from the output of the static analysis based on the constraints such as Cycles: Specifies the number of locks involved in a lock order violation, Call Depth: The number of function calls involved in a deadlock cycle that was predicted, Call Density: The number of classes involved in the prediction node resulting in a deadlocking cycle, and Package Exclusions: The list of packages that are to be excluded in the scan. All the necessary metadata required to target only the classes that could lead to a potential deadlock are collected at this stage. The Dynamic Deadlock Analysis is triggered after the completion of this scan. Dynamic Deadlock Analysis. The first step of the process is to generate the sequential test cases that will act as the seed test cases for the dynamic analysis. Our aim is to target only the classes that are involved in a lock order violation, as predicted by the static analysis. The test cases corresponding to each of the selected prediction node from the static analysis are generated by the Randoop tool. Omen, is then invoked with the seed test case as input to initiate the dynamic analysis. The test cases are executed by Omen and the execution traces are scanned for cycles to detect the deadlocking scenarios. The detected deadlocks are then consolidated into a comprehensive report. JVM Monitoring Routine. JVM Monitoring Routine was developed to handle the deadlocked analysis that happens either during Randoop execution or during Omen execution. JVM Monitor, once initiated is designed to run as long as there is an active JVM. As a first step, it fetches all the process identifiers (PIDs) of the java programs that are being executed in the JVM with a time delay. Then it checks if there are any identical processes between those time delays. If there are any identical processes, their respective stack traces are fetched. The stack traces are then compared to check the JNI global references held by the JVM. Analysis of the JNI global references is a predictable indicator to assess if the JVM is active or hung. If the JNI global references are identical across multiple snapshots with significant time delays, it predictably indicated a hung JVM. Once the hung state of the JVM is detected then logs are generated, and the details are collected in the reports. The processes that are hung are then terminated for the flow of the Dynamic Analyzer to progress ahead.
Automated Deadlock Detection for Large Java Libraries
139
Report Generation. Following are the key reports generated during the execution: 1. Test Runs Report: Contains the execution status for each of the batches, along with the filter criteria applied for the analysis. This report provides an overview of the batch executions and enables to plan further batches for analysis. 2. Execution Summary Report: This report provides the summary of the execution results and helps to navigate to the specific deadlocking scenario. The metadata captured by this report enables to locate the specific Deadlock Report file along with the other details that indicates the number of tests executed and number of dead locking scenarios detected. 3. Deadlock Report: The Deadlock Report contains the log related to each of the nodes that were processed and which of the processed node resulted in a deadlock. If the dead locks are detected as a result of the hung JVM detected during test case generation or dynamic analysis, they are marked.
4
Results
The entire Java Runtime Libraries JRE v1.7.0 79 libraries (rt.jar) were subjected to Stalemate [1], the static analysis tool. The output files from the static analysis method were used as a starting point for the analysis. Table 1 lists the key details of the execution. We have been able to uncover deadlock in Java libraries that have not been demonstrated by other methods. We identified such cases and reported some of them to Oracle as bugs. Table 2 lists the bugs reported to Oracle. Table 1. Summary of execution Platform
MacOS High Sierra 17.3.0 Darwin Kernel Version 17.3.0, xnu-4570.31.3 1/RELEASE X86 64 x86 6
Java environment
java version “1.7.0 79”, Java(TM) SE Runtime Environment (build 1.7.0 79-b15), Java HotSpot(TM) 64-bit server VM (build 24.79-b02, mixed mode)
Execution summary Total tests executed Total prediction nodes analyzed Duplicates eliminated Nodes filtered from static analysis Total deadlocking scenarios detected
1,563 26,728 3,001 1,563 113
140
R. Rajesh Kumar et al. Table 2. Bugs reported to Oracle Deadlocking Java Library classes
Oracle JDK Bug ID
java.util.logging.Logger java.util.logging.LogManager
8194918
ava.awt.EventQueue 8194407 sun.awt.AppContext javax.swing.plaf.basic.BasicDirectoryModel sun.awt.X11.XToolkit java.util.Vector java.awt.EventQueue sun.awt.PostEventQueue java.awt.SentEvent sun.awt.SunToolkit
8194635
ava.awt.Component javax.swing.JFileChooser javax.swing.SwingUtilities java.awt.dnd.DropTarget javax.swing.JComponent
8194862
java.io.ObjectInputStream java.awt.KeyboardFocusManager java.awt.Component java.awt.Window java.awt.Frame
8194920
java.util.TimeZone java.util.Properties javax.naming.spi.DirectoryManager java.lang.SecurityManager java.util.Hashtable
8194919
java.awt.EventQueue java.awt.EventDispatchThread sun.awt.PostEventQueue
8194962
ava.awt.EventQueue 8194983 sun.awt.AppContext javax.swing.plaf.basic.BasicDirectoryModel sun.awt.X11.XToolkit
Automated Deadlock Detection for Large Java Libraries Table 3. Examples: deadlocking call cycles Calls between: java.util.logging.Logger java.util.logging.LogManager Forward calls Logger.getAnonymousLogger()
calls
LogManager.getLogger(String) Logger.getLogger(String,String)
calls
LogManager.getLogger(String) Logger.getLogger(String)
calls
LogManager.addLogger(Logger) Reverse calls LogManager.addLogger(Logger)
calls
Logger.getLogger(String) Calls between: sun.awt.PostEventQueue java.awt.EventQueue Forward calls PostEventQueue.flush()
calls
EventQueue.postEventPrivate(AWTEvent) Reverse calls EventQueue.removeSourceEvents(Component)
calls
java.awt.SentEvent.dispose(AppContext,SentEvent)
calls
PostEventQueue.postEvent(SentEvent) EventQueue.push(EventQueue)
calls
EventQueue.getNextEvent()
calls
SunToolkit.flushPendingEvents()
calls
PostEventQueue.flush() Calls between: sun.rmi.server.Activation sun.rmi.server.Activation$GroupEntry Forward calls Activation.addLogRecord(Activation$LogRecord)
calls
Activation$ActivationSystemImpl.shutdown()
calls
Activation.checkShutdown()
calls
Activation$GroupEntry.restartServices()
calls
Activation$GroupEntry.getInstantiator(ActivationGroupID) Activation.addLogRecord(Activation$LogRecord)
calls
SystemImpl.shutdown()
calls
Activation$GroupEntry.unregisterGroup() Reverse calls Activation$GroupEntry.setActivationGroupDesc(ActivationGroupID) calls Activation.addLogRecord(Activation$LogRecord)
141
142
R. Rajesh Kumar et al.
The results identified by our approach are reproducible. The static analysis [1] alone detected many thousands of potential deadlocks where one has to analyze the predictions manually to construct the deadlocking test cases, whereas we produce the deadlocking scenarios as output. The dynamic analysis [4] results are limited by the test cases that are subjected to it, hence it is not a viable tool in itself to analyze large libraries. Combining both together we have demonstrated an automated solution that is scalable for large libraries. The Table 3 shows few examples of interleaving call details of the deadlocks detected by our method. For brevity the package names and return values of the methods are omitted while representing the call flows.
5
Conclusions
From the above results we assess that the method described was able to deal with the scale what it was intended for. It was successfully able to scan through tens of thousands of potential deadlocking scenarios and created over hundred reproducible deadlocking test cases without any false positives. The approach was able to overcome the inherent limitation of the static approach of producing numerous ‘potential’ dead locking cases containing lot of false positives. Dynamic analysis on the other hand is effective in deadlock detection for programs for applications but not for libraries. It is limited by the coverage provided by the seed test case that is subjected to the analysis. The presented approach provides an effective way to leverage both static and dynamic analysis methods to produce a viable automated way to detect deadlock in large java libraries. Acknowledgements. We sincerely thank Dr. Murali Krishna Ramanathan for the discussion in formulating the problem. We also thank Malavika Samak and Dr. Murali Krishna Ramanathan for permitting us to use the program developed by them for dynamic deadlock detection.
References 1. Shanbhag, V.K.: Locating lock order violations in Java libraries - a scalable static analysis. Ph.D. dissertation. IIIT - Bangalore, Bangalore, Karnataka, India (2015). Reference [2] is the preliminary work of this thesis. Contact: IIIT-B Library ([email protected]) or Author ([email protected]) 2. Shanbhag, V.K.: Deadlock-detection in Java-library using static-analysis. In: 2008 15th Asia-Pacific Software Engineering Conference, Beijing, 2008, pp. 361–368 (2008). https://doi.org/10.1109/APSEC.2008.68 3. Pandey, S., Bhat, S., Shanbhag, V.: Avoiding deadlocks using stalemate and Dimmunix. In: Companion Proceedings of the 36th International Conference on Software Engineering (ICSE Companion 2014), pp. 602–603. Association for Computing Machinery, New York (2014). https://doi.org/10.1145/2591062.2591136
Automated Deadlock Detection for Large Java Libraries
143
4. Samak, M., Ramanathan, M.K.: Multithreaded test synthesis for deadlock detection. In: Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA 2014), pp. 473–489. Association for Computing Machinery, New York (2014). https://doi. org/10.1145/2660193.2660238 5. Nagarakatte, S., Burckhardt, S., Martin, M.M.K., Musuvathi, M.: Multicore acceleration of priority-based schedulers for concurrency bug detection. In: Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2012), pp. 543–554. Association for Computing Machinery, New York (2012). https://doi.org/10.1145/2254064.2254128 6. Eslamimehr, M., Palsberg, J.: Sherlock: scalable deadlock detection for concurrent programs. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2014), pp. 353–365. Association for Computing Machinery, New York (2014). https://doi.org/10.1145/2635868. 2635918 7. Jula, H., Tralamazza, D., Zamfir, C., Candea, G.: Deadlock immunity: enabling systems to defend against deadlocks. In: Proceedings of the 8th USENIX conference on Operating systems design and implementation (OSDI 2008), pp. 295–308. USENIX Association, USA (2008) 8. Jula, H., T¨ oz¨ un, P., Candea, G.: Communix: a framework for collaborative deadlock immunity. In: 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks (DSN), Hong Kong, pp. 181–188 (2011). https://doi.org/ 10.1109/DSN.2011.5958217 9. Pradel, M., Gross, T.R.: Fully automatic and precise detection of thread safety violations. In: Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2012), pp. 521–530. . Association for Computing Machinery, New York (2012). https://doi.org/10.1145/2254064. 2254126 10. Pacheco, C., Ernst, M.D.: Randoop: feedback-directed random testing for Java. In: Companion to the 22nd ACM SIGPLAN Conference on Object-Oriented Programming Systems and Applications Companion (OOPSLA 2007), pp. 815–816. Association for Computing Machinery, New York (2007). https://doi.org/10.1145/ 1297846.1297902 11. Choudhary, A., Lu, S., Pradel, M.: Efficient detection of thread safety violations via coverage-guided generation of concurrent tests. In: 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), Buenos Aires, pp. 266–277 (2017). https://doi.org/10.1109/ICSE.2017.32 12. Naik, M., Park, C.-S., Sen, K., Gay, D.: Effective static deadlock detection. In: Proceedings of the 31st International Conference on Software Engineering (ICSE 2009), pp. 386–396. IEEE Computer Society, USA (2009). https://doi.org/10.1109/ ICSE.2009.5070538 13. Naik, M., Aiken, A., Whaley, J.: Effective static race detection for Java. In: Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2006), pp. 308–319. Association for Computing Machinery, New York (2006). https://doi.org/10.1145/1133981.1134018 14. Joshi, P., Park, C.-S., Sen, K., Naik, M.: A randomized dynamic program analysis technique for detecting real deadlocks. In: Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2009), pp. 110–120. Association for Computing Machinery, New York (2009). https://doi. org/10.1145/1542476.1542489
144
R. Rajesh Kumar et al.
15. Williams, A., Thies, W., Ernst, M.D.: Static deadlock detection for Java libraries. In: Black, A.P. (ed.) ECOOP 2005. LNCS, vol. 3586, pp. 602–629. Springer, Heidelberg (2005). https://doi.org/10.1007/11531142 26 16. Cai, Y.: A dynamic deadlock prediction, confirmation and fixing frame- work for multithreaded programs. In: Doctoral Symposium of the 26th European Conference on Object-Oriented Programming (ECOOP 2012, DS) (2012)
DNet: An Efficient Privacy-Preserving Distributed Learning Framework for Healthcare Systems Parth Parag Kulkarni(B) , Harsh Kasyap, and Somanath Tripathy Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna, India {kulkarni.cs16,harsh 1921cs01,som}@iitp.ac.in
Abstract. Medical data held in silos by institutions, makes it challenging to predict new trends and gain insights, as, sharing individual data leaks user privacy and is restricted by law. Meanwhile, the Federated Learning framework [11] would solve this problem by facilitating ondevice training while preserving privacy. However, the presence of a central server has its inherent problems, including a single point of failure and trust. Moreover, data may be prone to inference attacks. This paper presents a Distributed Net algorithm called DNet to address these issues posing its own set of challenges in terms of high communication latency, performance, and efficiency. Four different networks have been discussed and compared for computation, latency, and precision. Empirical analysis has been performed over Chest X-ray Images and COVID-19 dataset. The theoretical analysis proves our claim that the algorithm has a lower communication latency and provides an upper bound. Keywords: Distributed learning · Federated Learning Binary Tree Representation · Privacy
1
· Healthcare ·
Introduction
The science of medicine has reached its advanced stage of research. New methodologies are taking shape, and efficient drugs are being developed. This research has been given wings, by the analysis of medical data using machine learning techniques. New deep learning methodologies have become the tool for gaining new insights without putting much of the domain knowledge. These techniques rely on training on a huge scale of data. However, the question is from where does this data come from and who holds access to it? How much sensitive information does the data leak? All these questions make the study challenging and demand new innovative ways of study. In the age of sensors and devices spanning around us, data is continuously generated and has value to contribute to research. Three sources mainly possess healthcare data - the patients, hospitals and medical stores. A patient himself/herself owns his/her medical records. He/she keeps records prescribed from c Springer Nature Switzerland AG 2021 D. Goswami and T. A. Hoang (Eds.): ICDCIT 2021, LNCS 12582, pp. 145–159, 2021. https://doi.org/10.1007/978-3-030-65621-8_9
146
P. P. Kulkarni et al.
doctors, medical stores, and self-owned smart devices like fit-bits. Hospitals hold records of multiple patients and are significant repositories of data. Medical stores also own health/drug related data and statistics, which can infer many trends. However, sharing this data often reveals more than required. For example, if an institution is running cancer research; the data being studied should only tell how the tumor looks like rather revealing that a person has cancer. The institutions will come across patients with different symptoms and remedies working for them. As collaboration becomes indispensable, it becomes vital to exchange this information for developing the best cure. However, these sources owning data pose a high risk of losing sensitive data. There can be misuse of confidential personal information to gain benefits. Patients who own the data do not know what it has been used for. In 2016, Google came with the Federated Learning framework that promised data security. Federated Learning keeps the data in-place and performs local model training. The training is iterated for thousands and millions of times for improving the model. These local models are sent to a central server and aggregated to obtain the global model. The global model is sent back to the devices for training in the next iteration. Though, Federated Learning improvises security and asserts privacy by keeping the data with the real owner. However, it becomes vulnerable to inference attacks by reverse engineering on the transmitted gradients. The presence of a trusted server is also questionable. It can be biased for some participants and may post malicious updates to others. It faces scalability issues for executing multiple tasks. It needs to instantiate individually for the operation of separate tasks. It also poses the risk of a single point of failure. As the central server reports to be down or broken, the complete system comes to a halt. As discussed above, Federated Learning poses threats from a malicious central server and requires high communication overhead. It also violates being genuinely decentralized. However, health care institutions are demanding a privacypreserving collaborative learning system to facilitate research advances. They would trust more on a peer-to-peer network that need not rely on any third party and also bear fault-tolerance, liveliness, and availability. Keeping the above-discussed issues and requirements into consideration, we propose a communication-efficient decentralized variant of Federated Learning. The significant contributions are as follows. – This paper presents a Distributed algorithm for in-place model training without the presence of central curator (DNet). – Four variants of the algorithm are proposed, and analysis of their efficiency recommends Binary Tree Representation technique due to its optimal performance. – This paper does an empirical and theoretical analysis of the algorithm and its variants, which shows our claim of communication efficiency. – The proposed methods have been tested on real-world Pneumonia as well as recently released COVID-19 datasets.
DNet
147
The remainder of this paper is organized as follows. Section 2 discusses the background and related works. Section 3 summarizes some traditional frameworks. Section 4 discusses the DNet algorithm along with the proposed framework. Section 5 explains the training and experiment methodology. Section 6 lists the results and shows some empirical comparisons. Section 7 concludes and briefs the scope for future work.
2
Background and Related Work
This section briefs the concept of Federated Learning and its application in healthcare. It discusses the existing works in the same direction. 2.1
Federated Learning
Federated Learning is a collaborative learning paradigm by training across decentralized applications holding similar chunks of data. It involves multiple participants with a central server. The central server is responsible for delineating the computation rounds, selecting the devices for participation, and aggregating their respective model updates to build a global model. It is privacy-preserving in nature as it holds the data to the device itself. Previous approaches used to collect all these data to the central server and train, which induces a high data leak risk and violates various security principles and pacts. Federated Learning brings the model to the data while keeping the data to the device itself. It is an iterative process that involves significant communication overhead and a threat of a malicious server. FedSGD (a variant of stochastic gradient descent (SGD)) is a widely used optimization and aggregation algorithm for federated averaging. SGD samples a subset of summand functions and updates the weight. w := w − η∇Qi (w)
(1)
SGD is effective and scalable for large training sets coupled with complex gradient calculation formulae. In FedSGD, each participant trains the model inplace with some random samples chosen in every iteration and sends the delta change in the gradient to the central aggregator. k k∈St nk Δwt (2) wt+1 = wt + η k∈St nk The central aggregator sums up the weighted contributions of the delta updates received from all the participants and updates the global weight. Let the system has a total K number of users. In every iteration, a fraction of clients participates, some may drop out. The set comprising of participating clients be St and nk be the number of samples held by client k with the server having a learning rate of η. Let wt be the global weight of the previous iteration, server updates it, and evaluates wt+1 using distributed approximate Newton method [14].
148
P. P. Kulkarni et al.
Federated Learning can help the institutions and individuals owning data to come forward and share the inference over their information to build better predictive models and making research advances. It will also make way for the players involved in healthcare to collaborate. Sensors and devices in hospitals, laboratories, health metrics on smartwatches, and medical reports of individuals can help doctors, scientists, and institutions to learn and prepare better diagnosis. 2.2
Related Work
There have been independent studies citing the benefits of Federated Learning in healthcare and aggregator free learning. Researchers have discussed how multiinstitutional secure learning infrastructure can be set up with the help of collaborative learning. For addressing issues of malicious server, various approaches involving cryptographic, distributed and decentralized-distributed techniques have been proposed [8,10,12]. Chen et al. [4] proposed a framework called Fedhealth, claiming to be the first federated transfer learning framework for wearable healthcare. They use existing deep learning techniques and additive homomorphic encryption for classification, distribution, and aggregation. In the proposed framework, the cloud works as a trusted central server, and the participants are present at remote locations. They claim to achieve an improvement of five to six percent accuracy. This work mainly focuses on the data islanding problem and aims to make personalized models for individuals. It lacks formal security analysis and privacy guarantees. Stephen et al. [16] demonstrated classification of positive and negative pneumonia data from a collection of chest X-ray images, relying heavily on the transfer learning approach using a Convolutional Neural Network (CNN) based model for the classification of the image data. Brisimi et al. [3] proposed a decentralized computationally scalable methodology for large-scale machine learning problems in the healthcare domain. It aims to solve a binary supervised classification problem to predict hospitalizations for cardiac events using a distributed algorithm. They proposed an iterative cluster Primal-Dual Splitting (cPDS) algorithm to solve the problem in a decentralized fashion. They achieved higher convergence at the cost of expensive communication among the agents. Liu et al. [9] proposed a new method called Federated-Autonomous Deep Learning (FADL) that relies upon training in a distributed manner across largely imbalanced scattered data. This work tried to handle the imbalance by training the first half of the neural globally using data collected from all sources. The second half is trained locally like traditional Federated Learning. They claimed to achieve similar performance to the conventional centralized and Federated Learning methods. Lu et al. [10] took the communication problem of the distributed training into account and proposed an efficient framework with low latency over a fully
DNet
149
decentralized network over the graph. They did the empirical analysis over a fully decentralized non-convex stochastic algorithm, which involves only local updates and communication among the participating nodes. They emphasized the local updates and considered to repeat it for the improved local model in each update. However, they did not talk about the network structure and how to improve the latency over it. Xu et al. [18] discussed the challenges of incorporating Federated Learning in healthcare. They summarized solutions to system and statistical challenges as well as the privacy issues in Federated Learning. They also analyzed different frameworks123 for experiments and simulation over the heterogeneous data available. Shokri et al. [15] and Deist et al. [5] discussed about pre-federated distributed learning techniques. They proposed key technical innovations like selective sharing of model parameters during training and usage of Support Vector Machine (SVM), solved with Alternating Direction Method of Multipliers (ADMM). Ramaswamy et al. [13] shed light on working over large scale applications. Konevcny et al. [6], Wang et al. [17], Bonawitz et al. [2] and Agarwal et al. [1], made efforts for devising methods of distributed learning, for secure and communication efficient learning. They sketched different algorithms like PRLC, autotuned SecAgg, and cpSGD. Kuo et al. [7] proposed distributed learning using Blockchain technology, which assumes hierarchical network-of-networks.
3
Traditional Frameworks
Most discussed approaches of decentralized Federated Learning use Fully Connected networks. The other less discussed strategies are Pseudorandom, Random, and Cyclic. All of these approaches have their trade-offs like high latency, fault tolerance, and efficiency. We review all these variants in this section, and propose a Binary Tree Representation for the distributed network gaining high efficiency, in the next section, along with the DNet algorithm. The three traditionally discussed frameworks for distributed networks could be described as follows. In the Pseudorandom framework, one node of the network is chosen randomly as root (n1 in Fig. 1a). It establishes connections with all the other nodes. Other nodes make a connection in the network with a probability of 40%. Figure 1a illustrates the Pseudorandom architecture. Figure 1b illustrates the Cyclic framework, where all nodes are connected to the next node in a cyclic fashion, i.e., one node is connected to two other adjacent nodes, making a cycle. In the Fully Connected framework, each node establishes a connection to the remaining nodes in the network. Figure 1c illustrates Fully Connected architecture. 1 2 3
AI, W.: Federated ai technology enabler (2019), https://www.fedai.org/cn/. Google.: Tensorflow federated (2019), https://www.tensorflow.org/federate. OpenMined: Pysyft-tensorflow (2019), https://github.com/OpenMined/PySyftTensorFlow.
150
P. P. Kulkarni et al.
Pseudorandom
Cyclic
Fully connected
Fig. 1. Traditional distributed net structures (for 5 nodes)
4
Distributed Net
This section discusses the proposed algorithm, which is an aggregator-free Federated Learning approach. It presents an entirely distributed, decentralized system and addresses the existing communication and latency issues. The basic system setup for Distributed Net architecture with N nodes in the network is as follows. The problem is formulated based on any participant’s local data and model. The participant seeks collaboration for an improved global model (It is assumed that the participants are identified with the intersecting data and can contribute to the network). The datasets held by the participants are locally preprocessed, and the respective labels are one-hot encoded. For the purpose of running the experiments, out of Ttraining + Ttest samples of a dataset, each node is given Ttraining /N training samples and Ttest /N test samples.
DNet
151
Each node trained on (V/V + 1) * (Ttraining /N) samples and validated on (1/V + 1) * (Ttraining /N) samples, where V:1 is train-validation split. Different distributed nets are trained till convergence, and test results are predicted according to the distributed net training algorithm (DNet). 4.1
Training Algorithm (DNet)
Algorithm 1 explains the DNet training algorithm. This algorithm can be used to train any framework (including the ones discussed in previous sections), only by changing the aggregation and propagation algorithms in steps 10 and 11 depending on the DNet variant.
Algorithm 1: Distributed Net Training Algorithm (DNet) 1 2 3 4 5 6 7 8 9
10 11 12
Train with local data, get wj ini , ∀jn(nodes) while epochs(i) not complete do for every node(j) in the net do for every connection(c) from the node do Send wj i from nj to nc wc i = f ederated average(wj i , wc i ) Train nc to get wc i+1 end end /* Aggregation and propagation (Algorithm 2 for Binary Tree Representation). Varies as per the type of framework wroot i+1 = distributed average(wj i+1 ∀jn) wj i+1 = wroot i+1 ∀jn end
*/
First, all the nodes are trained locally once, to get the initial weights. After that, in every epoch, each node sends its weights to the nodes it is connected to. The receiver nodes take the federated average the incoming and local weights, and then train the model with those weights. Then all weights are aggregated, averaged and propagated, according to different strategies depending on the type of network (For an instance of our network, refer Algorithm 2). Thus, all the nodes have the trained weights at the end of each epoch. 4.2
Binary Tree Representation
This is the proposed architecture for the aggregator free and communicationefficient Federated Learning. Each node is connected to two other nodes like the Cyclic framework but rather form a binary tree instead of a cycle. It has minimum number of connections and ensures that each node is visited only once. Figure 2a and b illustrate the Binary Tree Representation and its tree view.
152
P. P. Kulkarni et al.
Binary Tree Representation
Binary Tree Representation (Binary Tree View)
Fig. 2. Binary Tree Representation structures
Algorithm 2 lists down the steps involved in the aggregation and propagation of the final model weights in every epoch. The aggregation part, recursively sends the weights of each node to its parent, and sums them up. Eventually, the sum of all the weights reaches the root. At the root, it is divided by the number of nodes to get the average. The propagation part, simply recursively propagates the final weights from every parent node to its children, starting from the root and eventually reaching all the nodes. We can note that, aggregation and propagation happens only on the last step of every epoch, following the same path of Binary Tree, in reverse direction (from leaf to root) for aggregation, and in forward direction (from root to leaf), for propagation. Hence, the extent of privacy preservation is unchanged.
Algorithm 2: Aggregation and Propagation in Binary Tree Representation (Distributed Net) 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Function aggregation(nroot , connection matrix, n): sum = wroot for all connections from nroot to nj do Send aggregation(nj , connection matrix, n) from nj to nroot sum = sum + aggregation(nj , connection matrix, n) end return sum wroot = sum/no. of nodes Function Propagation(nroot , connection matrix, n): for all connections from nroot to nj do Send wroot from nroot to nj wj = wroot Propagation(nj , connection matrix, n) end
DNet
4.3
153
Theoretical Analysis
This provides proof for the claim of gaining communication efficiency in the decentralized Federated Learning using Binary Tree Representation. The system has N nodes connected in a Binary Tree Representation network (Fig. 2a). In every epoch of training, all nodes perform total 3(N – 1) data transfers (First, weights are passed via the (N – 1) connections, which accounts for (N – 1) transfers. Aggregation of weights of all nodes to the root and propagation of average weights to all nodes, require (N – 1) transfers each. Adding all three, we get a total of 3(N – 1) transfers per epoch). Let data transferred per epoch, per node, be d units. Hence, total data transferred per epoch by all the nodes would be Nd. Total data transferred Now, total data transferred per transfer, would be Total number of transfers which Nd is 3(N −1) . The communication latency (L) can be seen as L = Ls + Lr + Ltrans where, Ls is latency at sender, Lr is latency at receiver and Ltrans is the transmission latency. We are interested in calculating the upper bound for the communication bandwidth. We will calculate the value of γ and bound it to prove our claim. γ= =
L Lavg Ls + Lr + Ltrans Ls,avg + Lr,avg + Ltrans,avg
Let port bandwidths for sender and receiver be Bs and Br , respectively. Thus, the time required to transfer data would be μs = Bds and μr = Bdr respectively. As per Queuing theory, latency at any node x can be represented as below, where λ is the arrival rate of updates at the device assuming Poisson process. 1 λ + μx 2μx (μx − dλ) d dλ = + Bx 2Bx (Bx − dλ)
Lx =
After substituting the values of latency at nodes (Lx ) assuming it as sender and receiver respectively, γ can be expressed as: γ=
d Bs Nd 3(N −1)
Bs
≤ γ≤
+
+
dλ 2Bs (Bs −dλ) Nd λ 3(N −1)
Nd 2Bs (Bs − 3(N λ) −1)
d d d Bs + Br + Btrans N d d d 3(N −1) [ Bs + Br + Btrans ]
+ + + +
d Br
+
Nd 3(N −1)
Br
dλ 2Br (Br −dλ)
+
+
d Btrans
Nd λ 3(N −1)
Nd 2Br (Br − 3(N λ) −1)
+
Nd 3(N −1)
Btrans
dλ dλ 2Bs (Bs −dλ) + 2Br (Br −dλ) N dλ dλ 3(N −1) [ 2Bs (Bs −dλ) + 2Br (Br −dλ) ]
3(N − 1) N
The above-stated proof states the upper bound as 3(NN− 1) for the Binary Tree Representation of DNet. For the other discussed networks, i.e., the Pseudorandom, Cyclic and Fully Connected, with a total number of transfers being
154
P. P. Kulkarni et al.
∼(0.4N + 2)(N – 1), 3N and (N + 2)(N – 1) respectively, the bounds turn out − 1) . It justifies our claim of proposing an to be (0.4N +N2)(N − 1) , 3 and (N + 2)(N N efficient communication framework for decentralized Federated Learning.
5
Experiments
We demonstrate experiments using Convolutional Neural Networks (CNNs) over two medical datasets with the proposed architecture. This section will discuss the dataset and the deep learning architecture used for training. 5.1
Datasets
Two datasets are primarily used for evaluating the proposed DNet algorithm with Binary Tree Representation and compared with other existing frameworks. Chest X-ray Images (Pneumonia) Dataset4 – Number of samples: 5,840 – Sample desc: X-ray images (varying dimensions) of anterior-posterior chests, carefully chosen from retrospective pediatric patients between 1–5 years old. – Number of classes: 2 (pneumonia (4,265), normal (1,575)) – Figure 3a and b plots some of the samples of this dataset
Pneumonia
Normal
COVID
Normal
Fig. 3. Examples from the datasets Pneumonia dataset: a and b | COVID-19 dataset: c and d
COVID-19 and Normal Posteroanterior (PA) X-rays Dataset5 – – – – 4 5
Number of samples: 280 Sample desc: Posteroanterior chest X-ray images (varying dimensions) Number of classes: 2 (covid (140), normal (140)) Figure 3c and d plots some of the samples of this dataset https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia. https://www.kaggle.com/tarandeep97/covid19-normal-posteroanteriorpa-xrays.
DNet
5.2
155
Training
This section describes the training over both the datasets discussed above using the DNet algorithm with different architectures, with the Convolutional Neural Network (CNN) deep learning configuration as mentioned in Table 1. Table 1. CNN configuration Layer (type) Output shape Conv2D
(None, 198, 198, 32)
MaxPooling2 (None, 99, 99, 32) Conv2D
(None, 97, 97, 64)
MaxPooling2 (None, 48, 48, 64) Conv2D
(None, 46, 46, 128)
MaxPooling2 (None, 23, 23, 128) Conv2D
(None, 21, 21, 128)
MaxPooling2 (None, 10, 10, 128) Flatten
(None, 12800)
Dropout
(None, 12800)
Dense
(None, 512)
Dense
(None, 2)
First, we carried the experiment over the Chest X-ray (Pneumonia) dataset. It is a comprehensive dataset with 5840 X-ray images of pneumonia patients and healthy people, thus being a binary classification candidate. The algorithm’s effectiveness was tested on a domain problem in this experiment. The dataset was split into a 75:25(3:1) train:test set, implying 4380 training images and 1460 test images. The base classifier was trained on the complete training set (with a 4:1 train:validation split) and tested on 1460 samples. It simulates the scenario where the data from all nodes is pooled, for ground values without federated setup. The complete training ran for 10 epochs. In the federated experiment, the training was conducted in an environment with 5 nodes. The number of local training epochs for each was set to be 10. The number of global iterations to converge the training was set to 2. We implemented all the code in Keras 2.2.4. The optimizer used was Adam and the loss function used for training, was binary crossentropy. The second experiment was conducted over the COVID-19 and Normal Posteroanterior (PA) X-rays dataset. We ran this experiment to verify our proposed method’s performance for a minimal amount of data. It ran with precisely the same configuration, parameters, and hyperparameters as of the first experiment.
156
6
P. P. Kulkarni et al.
Results and Discussion
This section discusses the empirical results of all the different experiments performed. It also shows the comparison among the different architectures used with DNet. The results for both the experiments carried out, i.e., over the Chest Xray (Pneumonia) dataset and over the small COVID 19 dataset, for five nodes with all the variants of the DNet algorithm are compiled in Table 2. We can infer that the proposed Binary Tree Representation architecture achieves similar performance in less than half the number of computations compared to the Fully Connected variant. The Binary Tree Representation outperforms the traditional variants with the least number of computations. Table 2. Results of the experiments DNet architecture
Dataset
n2
n3
n4
n5
Average Computations per epoch
Base classifier
Pneumonia COVID 19 -
n1
-
-
-
-
94.66 95.29
Pseudorandom
Pneumonia 97.6 COVID 19 100
95.21 96.58 94.18 95.55 95.824 92.86 100 100 100 98.57
17
Cyclic
Pneumonia 96.92 95.55 95.55 93.49 92.81 94.864 COVID 19 92.86 100 92.86 92.86 85.71 92.86
15
Fully Connected
Pneumonia 96.23 95.55 97.6 COVID 19 85.71 85.71 100
95.89 95.2 96.092 100 78.57 90
28
Binary Tree Representation Pneumonia 93.49 92.12 94.86 95.21 94.86 94.108 COVID 19 100 100 100 100 92.86 98.57
12
-
From the results of the experiments conducted, we can conclude the following about DNet: – DNet performs well on Image data, which is evident from the results of both the experiments – DNet handles a real-world healthcare task of image classification, very efficiently. This is made clear by the results of the main experiment. – DNet also passed the test of performing in the scenario of very less data to train. This is made clear by the results of the second experiment. All the variants of DNet are on par with the Base classifier in terms of accuracy, while being much more secure and privacy-preserving. The main experiment with the Chest X-ray data showed that this model is feasible to be applied in the healthcare domain and thus, the model established itself. The results show that the DNet was on par with the base classifier and hence, the experiment was a success. The second experiment was in a different setting, with very less amount of data. Empirically, the results show that the model does perform well in that scenario too. In the case of a very small number of data points, the results tend
DNet
157
to depend on particular data points, which means that the results can change drastically due to even one misclassification by the model. Thus some random erratic behavior, for e.g. lower accuracy in node n5, is observed. But overall, the performance of the model was satisfactory. Comparing the DNet frameworks with each other, we see that all four tend to perform similarly, as results show that all of them are on par with the base classifier, and show about equivalent accuracy measure, in all cases. The main difference comes in the efficiency part. As we can see from Table 2, the Cyclic and Binary Tree Representation frameworks achieve similar accuracy to the other nets, but in very few numbers of computations in comparison. The rate at which the number of computations increases with the number of nodes is as shown in Fig. 4. We see that Fully Connected framework has a very high rate of increase in the number of computations as the number of nodes increase. The Pseudorandom framework follows with a considerably high rate of increase. Meanwhile, Cyclic and Binary Tree Representation frameworks show a linear increase rate, thus being much more efficient than the other two. (Binary being slightly better, as a number of computations for N nodes, 3N for Cyclic and 3(N – 1) for Binary Tree Representation). In Fig. 4, due to scale, lines representing Cyclic and Binary Tree Representation networks are very close to each other, Cyclic being slightly above. As far as robustness is concerned, the proposed Binary Tree Representation framework outclasses the Cyclic one. The collapse of even one node in the Cyclic framework would completely shut down the training procedure. While, in case of Binary Tree Representation network, the part of the tree below the malfunctioned node would get cut off from the training, but this won’t stop the process altogether. Thus, considering factors of performance, efficiency, and robustness, the proposed Binary Tree Representation framework is the best option among the four DNet variants we have studied. There are also certain limitations, a far as the DNet and the Binary Tree Representation are concerned. In case of the DNet as a whole, the training procedure is slow, as compared to a traditional Federated Learning scheme. The absence of a central server, gives rise to more non-simultaneous transfers between nodes, thus increasing the training time, at the benefit of increase in privacy. The proposed Binary Tree Representation framework is very rigid, and thus cannot incorporate nodes entering and exiting while the training is in progress. The number of nodes needs to be fixed beforehand. Also, the framework lacks differential privacy. These are some problems which could be addressed in the future.
158
P. P. Kulkarni et al.
Fig. 4. Rate of increase in the number of computations
7
Conclusion and Future Work
This paper proposed the DNet algorithm with Binary Tree Representation and compared it with the existing frameworks. It aims to set up a decentralized Federated Learning system with low communication latency. It achieved slightly less accuracy by 2% compared to a Fully Connected framework by reducing communication by more than half. We theoretically proved that the communication latency of the Binary Tree Representation has the lower upper bound as compared to the Pseudorandom, Cyclic, and Fully Connected networks. The network is privacy-preserving in nature but prone to inference and poisoning attacks. Many works are already going in this direction and should be incorporated to make this model more secure. There is also scope for research to integrate Multimodality, Selective Parameter Sharing, Transfer Learning, Interoperability, and local compensation for making the network more robust.
References 1. Agarwal, N., Suresh, A.T., Yu, F.X.X., Kumar, S., McMahan, B.: cpSGD: communication-efficient and differentially-private distributed SGD. In: Advances in Neural Information Processing Systems, pp. 7564–7575 (2018) 2. Bonawitz, K., Salehi, F., Koneˇcn` y, J., McMahan, B., Gruteser, M.: Federated learning with autotuned communication-efficient secure aggregation. arXiv preprint arXiv:1912.00131 (2019) 3. Brisimi, T.S., Chen, R., Mela, T., Olshevsky, A., Paschalidis, I.C., Shi, W.: Federated learning of predictive models from federated electronic health records. Int. J. Med. Inform. 112, 59–67 (2018)
DNet
159
4. Chen, Y., Qin, X., Wang, J., Yu, C., Gao, W.: Fedhealth: a federated transfer learning framework for wearable healthcare. IEEE Intell. Syst. 35, 83–93 (2020) 5. Deist, T.M., et al.: Infrastructure and distributed learning methodology for privacypreserving multi-centric rapid learning health care: euroCAT. Clin. Transl. Radiat. Oncol. 4, 24–31 (2017) 6. Koneˇcn` y, J., McMahan, H.B., Yu, F.X., Richt´ arik, P., Suresh, A.T., Bacon, D.: Federated learning: strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492 (2016) 7. Kuo, T.T., Kim, J., Gabriel, R.A.: Privacy-preserving model learning on a blockchain network-of-networks. J. Am. Med. Inform. Assoc. 27, 343–354 (2020) 8. Lalitha, A., Kilinc, O.C., Javidi, T., Koushanfar, F.: Peer-to-peer federated learning on graphs (2019) 9. Liu, D., Miller, T., Sayeed, R., Mandl, K.D.: FADL: federated-autonomous deep learning for distributed electronic health record. arXiv preprint arXiv:1811.11400 (2018) 10. Lu, S., Zhang, Y., Wang, Y., Mack, C.: Learn electronic health records by fully decentralized federated learning (2019) 11. McMahan, H.B., Moore, E., Ramage, D., Arcas, B.A.: Federated learning of deep networks using model averaging. CoRR abs/1602.05629 (2016). http://arxiv.org/ abs/1602.05629 12. Ramanan, P., Nakayama, K.: BAFFLE: blockchain based aggregator free federated learning (2019) 13. Ramaswamy, S., Mathews, R., Rao, K., Beaufays, F.: Federated learning for emoji prediction in a mobile keyboard. arXiv preprint arXiv:1906.04329 (2019) 14. Shamir, O., Srebro, N., Zhang, T.: Communication efficient distributed optimization using an approximate newton-type method. CoRR abs/1312.7853 (2013). http://arxiv.org/abs/1312.7853 15. Shokri, R., Shmatikov, V.: Privacy-preserving deep learning. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 1310–1321 (2015) 16. Stephen, O., Sain, M., Maduh, U.J., Jeong, D.U.: An efficient deep learning approach to pneumonia classification in healthcare. J. Healthc. Eng. 2019 (2019) 17. Wang, H., Qu, Z., Guo, S., Gao, X., Li, R., Ye, B.: Intermittent pulling with local compensation for communication-efficient federated learning. arXiv preprint arXiv:2001.08277 (2020) 18. Xu, J., Wang, F.: Federated learning for healthcare informatics. arXiv preprint arXiv:1911.06270 (2019)
Memory Optimized Dynamic Matrix Chain Multiplication Using Shared Memory in GPU Girish Biswas(B) and Nandini Mukherjee Department of Computer Science and Engineering, Jadavpur University, Kolkata, IN, India [email protected], [email protected]
Abstract. Number of multiplications needed for Matrix Chain Multiplication of n matrices depends not only on the dimensions but also on the order to multiply the chain. The problem of multiplication. Dynamic pro is to find the optimalorder 3 2 gramming takes O n time, along with O n space in memory for solving this problem. Now-a-days, Graphics Processing Unit (GPU) is very useful to the developers for parallel programming using CUDA computing architecture. The main contribution of this paper is to recommend a new memory optimized technique to solve the Matrix Chain Multiplication problem in parallel using GPU, mapping diagonals of calculation tables m[][] and s[][] into a single combined calculation table of size O n2 for better memory coalescing in the device. Besides optimizing the memory requirement, a versatile technique of utilizing Shared Memory in Blocks of threads is suggested to minimize time for accessing dimensions of matrices in GPU. Our experiment shows best ever Speedup as compared to sequential CPU implementation, run on large problem size. Keywords: GPU · CUDA · Matrix chain · Memory mapping · Dynamic programming · Memory optimized technique
1 Introduction Graphics Processing Unit (GPU) is a common architecture in today’s machines that can provide high level of performance in graphical platform using many-core processors. Modern GPU offers the developers to use all cores of processors simultaneously to parallelize the general purpose computing. Many studies [2, 4, 6, 8, 9] have been carried out till date to implement parallel algorithms in CUDA for general computational problems. There are some Streaming Multiprocessors (SM) in a GPU device and each SM comprises of many cores (Fig. 1) which may be allocated to threads in parallel. The whole computation in the device is done over a Grid of some Blocks, where each Block is constituted of some number of threads. NVIDIA GPUs provide the parallel programming architecture, called CUDA (Compute Unified Device architecture) [5]. Using GPU architecture for solving the optimization problems with large number of combinations is challenging due to limited memory in the device and minimum dependency between different threads. The problem of Matrix Chain Multiplication arises in © Springer Nature Switzerland AG 2021 D. Goswami and T. A. Hoang (Eds.): ICDCIT 2021, LNCS 12582, pp. 160–172, 2021. https://doi.org/10.1007/978-3-030-65621-8_10
Memory Optimized Dynamic Matrix Chain Multiplication
161
many real time applications such as image processing, modern physics and modelling etc. This optimization problem needs to find the optimal order of multiplication of the matrices. Dynamic programming approach may be applied to find the optimal solution using GPU [2] with diagonal mapped matrices of calculation for assuring coalesced memory access. This approach uses two 2D calculation tables m[][] and s[][], each of n rows and n columns. So, O 2 · n2 size memory space is required to be allocated in GPU device, which is made of limited space. The main contribution of our paper is to an efficient way of using only a suggest single combined calculation table of size O n2 , to be allocated in the device, mapping diagonals of both of m[][] and s[][] into it (Fig. 3). Our memory optimized approach maintains memory coalescing and shows better results choosing proper Block-size in GPU (Sect. 4.B) for the varying number of elements in the diagonals of the tables. Also, a simple trick is taken in our study to use Shared Memory to store only the required dimensions of matrices (Fig. 4) in Blocks of threads executing in parallel in GPU to reduce the access time for accessing dimensions of matrices to enrich the Speedup even more, compared with sequential implementation run in CPU for large datasets. This paper presents the most effective technique with respect to both memory requirements and performance. The paper is organized as follows, Sect. 2 illustrates the idea of CUDA programming architecture in GPU device. Section 3 provides the Dynamic Programming Techniques to solve the Matrix Chain Multiplication Problem in GPU. Section 4 discusses about our Proposed Approach. Section 5 discusses about the results. Section 6 concludes the paper. 1.1 Previous Works As per our knowledge, very few studies have been made on Matrix Chain Multiplication optimization problem until now for parallelizing the problem especially through GPU. In 2011 Kazufumi Nishida, Yasuaki Ito and Koji Nakano [2] proposed an efficient technique of memory mapping to ensure coalesced memory access for accelerating the dynamic programming for the Matrix Chain Multiplication in GPU. The diagonals of m[][] and s[][] were mapped into rows of 2D arrays in a manner that all elements of a diagonal are consecutive in nature. They have passed all the diagonals of m and s tables to GPU one by one for computation of all elements of each diagonal by Blocks of threads in size of n matrices, this approach takes memory space of size But, for aproblem parallel. 2 · n2 , where O n2 size of memory is wasted. In addition, much time is wasted in accessing the array of dimensions from Global Memory of GPU by the threads of each Block which can be further accelerated with the help of Shared Memory. Mohsin Altaf Wani and S.M.K Quadri [4] presented an accelerated dynamic programming on GPU in 2013. They have not used any mapping of m table, but simply used single Block of threads for the computation of a single diagonal of m where each thread independently calculates some elements of that diagonal in parallel. This approach suffers from non-coalesced memory access and does not use multiple Blocks of threads also.
162
G. Biswas and N. Mukherjee
2 GPU and CUDA NVIDIA introduced CUDATM , a general purpose computing architecture in GPU in 2006. This massive parallel computing architecture can be applied to solve complex computational problems in highly efficient manner with respect to equivalent sequential solution implemented on CPU. Developers may use the high-level language, C in programming with CUDA [3]. 2.1 GPU Architecture GPU consists of several Streaming Multiprocessors (SM) with many cores and mainly two types of memory: Global Memory, Shared Memory (Fig. 1) [3]. Also, each SM has number of registers which are fastest and local to SM. Each SM has its own Shared Memory, which can be as fast as registers when bank conflict does not happen. Global Memory of higher memory capacity is potentially 150× slower than Shared Memory. 2.2 CUDA In programming architecture of CUDA [3], parallelism is achieved with bunch of threads combined into a Block and multiple Blocks combined into a Grid. Each Block is always assigned to a single SM while the threads in a Block are scheduled to compute as a warp of 32 threads at a time. All threads of a Block can be synchronized within that Block and can access the Shared Memory of that assigned SM only. CUDA permits programmers to write C function, called Kernel for parallel computations in GPU using Blocks of threads. 2.3 Coalesced Memory Access If access requests to Global Memory from multiple threads can be assembled into contiguous location accesses or same location access, this request can be performed at once which is known as coalesced memory access [3]. As a result of such memory coalescing, Global Memory works nearly as fast as register memory.
Fig. 1. GPU computing architecture in CUDA
Memory Optimized Dynamic Matrix Chain Multiplication
163
2.4 Shared Memory and Memory Banks Shared Memory [3, 10] is a collection of multiple banks of equal size which could be accessed simultaneously. Any memory accesses of n addresses from n distinct memory banks can effectively be serviced simultaneously. If multiple threads request to the same bank and to the same address, it is accessed only once and served to all those threads concurrently as multicast at once. However, multiple access requests to different addresses from same bank lead to Bank Conflict, which needs much time as requests are served serially [11]. For the GPU devices of compute capability ≥ 2.x, there are 32 banks with each bank of 32-bits long whereas the warp size is of 32 threads. If multiple threads of a warp try to access data from the same memory address with same bank, there happens no bank conflict also which is termed as multicast. When used with 1Byte/2Byte long data in Shared Memory, each bank contains more than one data. In this case also there is no bank conflict if threads access these data from single bank, as this is taken as multicast [12] in GPU device with compute capability ≥ 2.x. GPU devices of compute capability = 2.x have the default settings of 48 KB Shared Memory/16 KB L1 cache.
3 Matrix Chain Multiplication Problem Matrix Chain Multiplication or Matrix chain ordering problem requires to finding the best order for multiplying the given sequence of matrices so that the least number of multiplications are involved. This is merely an optimization problem using the associative property of matrix multiplication. Actually the solution is to provide the fully parenthesized chain of matrices through the optimal order of multiplication. Provided the Matrix Chain, containing n matrices {A1 , A2 , . . . ..An }, is to be multiplied where {d1 , d2 , d3 , . . . ..dn , dn+1 } is the set of all dimensions of these matrices, described as follows: A1 : d1 × d2
A2 : d2 × d3
. . . . . . . . . . . . . . . ..
An : dn × dn+1
Now, the problem is to find the order in which the computation of the product A1 × A2 × . . . .. × An needs the minimum number of multiplications and hence find the fully parenthesized chain denoting the optimal order of multiplication. 3.1 Solving Technique in Dynamic Programming Dynamic programming is useful for storing the solutions of small sub-problems and reusing them step by step to combine into greater problems and finally finding the solution of the given problem in time efficient approach. Dynamic programming technique [1] makes it easy to solve the above Matrix Chain Multiplication problem with n matrices using a m[][] table, where m i, j denotes the minimum number of multiplications needed
164
G. Biswas and N. Mukherjee
tocompute the sub-problem for 1 < i < j < n. Minimum cost m i, j is calculated using the following recursion: if i = j 0 (1) min {m[i, k] + m k + 1, j + d d d if i < j i j+1 k i>
1
2
3
(b) s table
4
5
43000
49500 2
2
35000
40000
44500 3 i 3
60000
30000
37500
22500 4
4
25000
43750
17500
14000
12000 5
5
0
0
0
0
diagonal-3
diagonal-1
0
j => 1
diagonal-4
diagonal-2
6
53000 1
diagonal-6 diagonal-5
> 6= al on ag di
2
> 5= al on ag di
1
j =>
1
> 4= al on ag di
> 3= al on ag di => l2
a on ag di
> 6= al on ag di > 5= al on ag di > 4= al on ag di > 3= al on ag di > 2= al on ag di > 1= al on ag di
j =>
165
(c) mapped m table for Coalesced Access
0
6
6
1
2
0
3
4
5
6
3
0 and a = α. The size of a minimum defensive alliance is the minimum a satisfying dpr (∅, n, a, a) = true and a > 0. 4.2
Offensive Alliance
We also obtain the following result: Theorem 5 (). Given an n-vertex graph G and its nice tree decomposition T of width at most k, the size of a minimum offensive alliance of G can be computed in O(2k n2k+6 ) time.
186
5
A. Gaikwad et al.
Conclusion
In this work we proved that Defensive Alliance and Offensive Alliance problems are FPT when parameterized by neighbourhood diversity; Offensive Alliance is FPT when parameterized by domino treewidth; Defensive Alliance and Offensive Alliance problems are solvable in polynomial time on graphs of bounded treewidth. The paramererized complexity of different kinds of alliances such as offensive alliance or powerful alliance remains unsettled when parameterized by clique-width, treewidth or pathwidth. Acknowledgement. We are grateful to the referees for thorough reading and comments that have made the paper better readable. We would like to thank Prof. Saket Saurabh, IMSc Chennai, for giving us the open problems on defensive and offensive alliances.
References 1. Bliem, B., Woltran, S.: Defensive alliances in graphs of bounded treewidth. Discret. Appl. Math. 251, 334–339 (2018) 2. Bodlaender, H.L., Engelfriet, J.: Domino treewidth. J. Algorithms 24(1), 94–123 (1997) 3. Chang, C.-W., Chia, M.-L., Hsu, C.-J., Kuo, D., Lai, L.-L., Wang, F.-H.: Global defensive alliances of trees and cartesian product of paths and cycles. Discret. Appl. Math. 160(4), 479–487 (2012) 4. Chellali, M., Haynes, T.W.: Global alliances and independence in trees. Discuss. Math. Graph Theory 27(1), 19–27 (2007) 5. Cygan, M., et al.: Parameterized Algorithms. Springer, Cham (2015). https://doi. org/10.1007/978-3-319-21275-3 6. Enciso, R.: Alliances in graphs: parameterized algorithms and on partitioning series-parallel graphs. Ph.D. thesis, USA (2009) 7. Fellows, M.R., Lokshtanov, D., Misra, N., Rosamond, F.A., Saurabh, S.: Graph layout problems parameterized by vertex cover. In: Hong, S.-H., Nagamochi, H., Fukunaga, T. (eds.) ISAAC 2008. LNCS, vol. 5369, pp. 294–305. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-92182-0 28 8. Fernau, H., Raible, D.: Alliances in graphs: a complexity-theoretic study. In: Proceeding Volume II of the 33rd International Conference on Current Trends in Theory and Practice of Computer Science (2007) 9. Fernau, H., Rodr´ıguez-Vel´ azquez, J.A., Sigarreta, J.M.: Global r-alliances and total domination. In: CTW (2008) 10. Fernau, H., Rodr´ıguez, J.A., Sigarreta, J.M.: Offensive r-alliances in graphs. Discret. Appl. Math. 157(1), 177–182 (2009) 11. Fricke, G., Lawson, L., Haynes, T., Hedetniemi, M., Hedetniemi, S.: A note on defensive alliances in graphs. Bull. Inst. Comb. Appl. 38, 37–41 (2003) 12. Jamieson, L.H.: Algorithms and complexity for alliances and weighted alliances of various types. Ph.D. thesis, USA (2007) 13. Jamieson, L.H., Hedetniemi, S.T., McRae, A.A.: The algorithmic complexity of alliances in graphs. J. Comb. Math. Comb. Comput. 68, 137–150 (2009) 14. Kannan, R.: Minkowski’s convex body theorem and integer programming. Math. Oper. Res. 12(3), 415–440 (1987)
Parameterized Complexity of Defensive and Offensive Alliances in Graphs
187
15. Kiyomi, M., Otachi, Y.: Alliances in graphs of bounded clique-width. Discret. Appl. Math. 223, 91–97 (2017) 16. Kristiansen, P., Hedetniemi, M., Hedetniemi, S.: Alliances in graphs. J. Comb. Math. Comb. Comput. 48, 157–177 (2004) 17. Lampis, M.: Algorithmic meta-theorems for restrictions of treewidth. Algorithmica 64, 19–37 (2012) 18. Lenstra, H.W.: Integer programming with a fixed number of variables. Math. Oper. Res. 8(4), 538–548 (1983) 19. Robertson, N., Seymour, P.: Graph minors. III. Planar tree-width. J. Comb. Theory Ser. B 36(1), 49–64 (1984) 20. Rodr´ıguez-Vel´ azquez, J., Sigarreta, J.: Global offensive alliances in graphs. Electron. Notes Discret. Math. 25, 157–164 (2006) 21. Sigarreta, J., Bermudo, S., Fernau, H.: On the complement graph and defensive k-alliances. Discret. Appl. Math. 157(8), 1687–1695 (2009) 22. Sigarreta, J., Rodr´ıguez, J.: On defensive alliances and line graphs. Appl. Math. Lett. 19(12), 1345–1350 (2006) 23. Sigarreta, J., Rodr´ıguez, J.: On the global offensive alliance number of a graph. Discret. Appl. Math. 157(2), 219–226 (2009) 24. Tedder, M., Corneil, D., Habib, M., Paul, C.: Simpler linear-time modular decomposition via recursive factorizing permutations. In: Aceto, L., Damg˚ ard, I., Goldberg, L.A., Halld´ orsson, M.M., Ing´ olfsd´ ottir, A., Walukiewicz, I. (eds.) ICALP 2008. LNCS, vol. 5125, pp. 634–645. Springer, Heidelberg (2008). https://doi.org/ 10.1007/978-3-540-70575-8 52
A Reconstructive Model for Identifying the Global Spread in a Pandemic Debasish Pattanayak1 , Dibakar Saha2(B) , Debarati Mitra3 , and Partha Sarathi Mandal2 1 2
Cryptology and Security Research Unit, Indian Statistical Institute, Kolkata, India [email protected] Department of Mathematics, Indian Institute of Technology Guwahati, Guwahati, Assam, India [email protected], [email protected] 3 Department of Chemistry, Royal Global University, Guwahati, Assam, India [email protected]
Abstract. With existing tracing mechanisms, we can quickly identify potentially infected people with a virus by choosing everyone who has come in contact with an infected person. In the presence of abundant resources, that is the most sure-fire way to contain the viral spread. In the case of a new virus, the methods for testing and resources may not be readily available in ample quantity. We propose a method to determine the highly susceptible persons such that under limited testing capacity, we can identify the spread of the virus in a community. We determine highly suspected persons (represented as nodes in a graph) by choosing paths between the infected nodes in an underlying contact graph (acquired from location data). We vary parameters such as the infection multiplier, false positive ratio, and false negative ratio. We show the relationship between the parameters with the test positivity ratio (the number of infected nodes to the number of suspected nodes). We observe that our algorithm is robust enough to handle different infection multipliers and false results while producing suspected nodes. We show that the suspected nodes identified by the algorithm result in a high test positivity ratio compared to the real world. Based on the availability of the test kits, we can run our algorithm several times to get more suspected nodes. We also show that our algorithm takes a finite number of iterations to determine all the suspected nodes. Keywords: Pandemic · Spread Location tracking · COVID-19
· Infection transmission tree ·
Dibakar Saha would like to acknowledge the Science and Engineering Research Board (SERB), Government of India, for financial support under the NPDF scheme (File Number: PDF/2018/000633). c Springer Nature Switzerland AG 2021 D. Goswami and T. A. Hoang (Eds.): ICDCIT 2021, LNCS 12582, pp. 188–202, 2021. https://doi.org/10.1007/978-3-030-65621-8_12
A Reconstructive Model for Identifying the Global Spread in a Pandemic
1
189
Introduction
In any pandemic, the infection grows fast worldwide. In the absence of a proper medicine or vaccine, to control the spreading; social distancing, self-quarantine, and wearing a face mask have been widely-used strategies for mitigation. In this context, mathematical models are required to estimate virus transmission, recovery, and deaths. A well known Susceptible-Infected-Removed (SIR) model provides a theoretical framework to investigate the spread of a virus in a pandemic. Cooper et al. [1] use the classical SIR model to study how the COVID-19 virus spreads within communities. The delivery of infection from asymptomatic carriers of COVID-19 in a familial cluster has been studied [2,3]. It is also the utmost requirement to find such asymptomatic carriers within a community or in a particular area such as a village, small town, to control the spreading of the disease. Hence, we propose a generalized model that can identify highly suspected persons and asymptomatic carriers efficiently. In this model, we use an estimation of the length of the infectiousness of COVID-19. It takes 5 to 6 days on an average for a COVID-19 infected person to show symptoms [4]. He et al. [5] show that around 9% of transmission would occur three days before the onset of symptoms. Hence it is a reasonable estimate that we consider a person infected with the virus can spread the virus to other people after three days of infection [5,6]. Another marker is the duration of infectivity [7]. A person spreads the virus throughout the symptomatic period and stops spreading the virus after recovery from symptoms and develops immunity to the virus. Furthermore, all infected persons do not transmit the virus [7]. In any pandemic, the greatest challenge is identifying the infected people in an initial stage so that the probability of spreading the infection gets minimized. COVID-19, caused by the dreaded coronavirus, is an example of such a pandemic. Some companies, including Google, and Apple, along with many countries like India, China, and South Korea, are engaged in tracking the infected people by using the location data of their cell phones via apps. This tracking data can be used to construct a graph where each node represents a person. Now, if the nodes have been in contact with each other, then there exists an edge between two nodes. From location data, it can be determined whether a person has come within 2 m of range of another person or not. The resulting graph is a contact graph. We may have a partial data of infected persons for k th and (k + i)th days, where i ≥ 1. With such data, we try to find the intermediate carriers responsible for spreading the infection. To find such intermediate nodes is very challenging where the infection proliferates over time. Let G = (V, E) is the contact graph constructed, as discussed above. Since the most treacherous fact is that any person can transmit the virus unknowingly being asymptomatic, it is essential to determine the links to cease the spread. Our Contributions. In this paper, we aim to identify the highly suspected persons infected with the virus in a pandemic so that those persons are to be tested or quarantined to prevent the spreading of the virus and make the following contributions. Our approach works by determining links that connect one infected
190
D. Pattanayak et al.
person to another in the underlying contact graph. We simulate our proposed method on some ground-truth datasets assumed as the contact graph, such as Dolphin, Zachary karate club, Football, and Facebook ego. The spread of the virus is modeled as an infection transmission tree over the contact graph. – We present an algorithm to determine a set of nodes that are highly suspected of having been infected. – We show the robustness of our algorithm even with changing infection parameters as well as false positive and false negative results included. – Our algorithm results in a high test positivity ratio (70%) compared to the real world (10%). – The algorithm can be run multiple times to generate suspected nodes by including the results of previously identified suspected nodes to get a new set of suspected nodes. This is useful when the availability of test kits is limited. – We show the number of times our algorithm needs to run before it terminates. Our algorithm determines more than 80% of the infected nodes on an average. Outline. The rest of the paper is organized as follows: In Sect. 2, we present the preliminaries and formulation of the problem. Section 3 describes the proposed algorithms. Section 5 shows the performance evaluation of the proposed method. Finally, Sect. 6 concludes the paper.
2
Preliminaries
In this section, we define the notations and describe the problem. 2.1
Notations
Given location data of persons, we construct a contact graph at time t, G = (V, E) such that V is the set of persons, and each person is represented as a node. There exists an edge between two nodes if the distance between any two nodes has been less than 2 m at some time t ≤ t. We define the set of symbols used in Table 1. Table 1. Notations Notation Description
Notation
G
The contact graph
V
Description Set of nodes
E
Set of edges
λ
Infection multiplier
h
Maximum hop distance
Ti
ith time interval
ITi
Infected node list at time Ti S
Set of suspected nodes
S+
Set of positive nodes
C
Set of carrier nodes
P
Set of paths
Δ
Maximum degree of the graph
Φ
Cutoff length of a path
Number of paths
|·|
Size of the set
test results Test result of the nodes
A Reconstructive Model for Identifying the Global Spread in a Pandemic
2.2
191
Problem Formulation
Let G(V, E) be an undirected unweighted graph, where V is a set of vertices or nodes, and E is a set of edges generated from the location data. Let G = (V , E ) be a subgraph of G, where V is a set of nodes infected with the virus, and E represents a set of connections among the nodes in V . G can be considered the infection transmission tree of the virus on the graph G. The structure of G is cycle-free because an infected node cannot be infected again, and since we consider the edge from the first node that infects another node. Hence, G maybe a forest. We may not have the complete information of all the nodes with the virus. With tests performed on people with symptoms, we can determine a subset of V with the virus by time Tj . Our objective is to determine the graph G , which represents the actual spread of the virus. This may not be possible due to the availability of partial information about the nodes carrying the virus. The available partial information is the positive test results that have been done in the past. We have two sets of nodes with confirmed positive test cases denoted as ITi and ITj , where Ti < Tj . So, we try to determine a set C ⊆ V , which is the set of suspected nodes. We have I = C ∪ ITi ∪ ITj , which should be very close to the set V . The symmetric difference of two sets measures the closeness of two sets. If two sets are identical, then their symmetric difference is null. We have the following problem definition. Problem Definition. Given a contact graph G, and two sets ITi and ITj , the target is to determine I ⊆ V that minimizes the symmetric difference of V and I, i.e., minimize((V ∪ I) \ (V ∩ I)).
3
Find Suspected Nodes from ITi and ITj
In this section, we describe Algorithm 1 (F ind Inf ected nodes). The algorithm consists of Procedure 1 (F ind suspected nodes) as a subroutine. We have a set I as an input that contains the day wise test results on T = {T0 , T1 , . . . , Tt }. We determine ITi as the set of nodes infected before Ti and I + Tj as the set of nodes infected between Ti and Tj . Given two sets of nodes ITi and ITj as input, Procedure 1 finds a set of nodes S as the output. To find the suspected node set S, we compute paths between nodes in ITi and ITj . To choose paths that are more likely to contain the infected nodes, we define infection parameters for the nodes. The infection parameter of a node depends on the hop distance from a known infected node, and the infection parameter of a path is the average of infection parameter of the nodes. Based on this, we determine the paths between nodes with high infection parameters. We choose the first paths in the decreasing order of the infection parameter. The suspected nodes S contains the intermediate nodes of these paths that are not present in ITi and ITj . Algorithm 1 runs with a feedback process. We can run the procedure once with ITi and ITj ; and then we get S. Tests can be conducted for the nodes in S, and we determine S + as the set of nodes that tested positive. With this added input S + , we can rerun Procedure 1 to find the next set of suspected nodes. We can continue this
192
D. Pattanayak et al.
process until the number of tests available is exhausted. If we have sufficient test kits for the entire population, that is the best way to determine the spread of the virus. When a limited number of test kits available, it is better to test only the highly suspected nodes. 3.1
Infection Criteria
The COVID-19 virus has spread rapidly. Hence, testing and identification of the infected nodes are essential to prevent the spread. COVID-19 can infect people from another infected person. The virus can be spread from one person to another through small droplets from the nose or mouth, released when a person with COVID-19 coughs or exhales. These droplets land on objects and surfaces around the person. Other people may get infected when they touch those objects or surfaces, touching their eyes, nose, or mouth. People can also catch COVID-19 if they breathe in droplets from a person with COVID-19 who coughs out or exhales droplets that can travel up to 2 m from an infected person with high concentration. Infection Parameter for a Node. The COVID-19 virus can infect people from another infected person. Thus, we consider that a node may get infected by its infected neighbors, and the chance of infection depends on the hop distance from an infected node. Let λi be the infection parameter at hop distance i. Let ki be the number of infected neighbors of a node u at a hop distance i. Here we define the infection parameter of a node u as P(u) =
h
λi ki ,
(1)
i=1
where h is the maximum hop distance considered. Infection Parameter for a Path. Let u and v be two infected nodes with causal relation as u has been infected before v. By causal relation, there should be a transmission path of the virus from u to v. We can determine multiple paths from u to v. Now, we need to determine the most probable path that the virus has followed. If a path contains multiple nodes with suspected infection, it is a more likely path of virus transmission. i as the sum We determine the parameter of transmission along ith path Puv i i of the infection parameter of the path Puv . Let Puv = (u, u1 , u2 , . . . , ul , v) be a path. The average infection parameter is i P(Puv )=
l
1 P(uj ) l j=1
(2)
A Reconstructive Model for Identifying the Global Spread in a Pandemic
3.2
193
Algorithm to Find Suspected Nodes
Initially, we are given a graph G(V, E), which is a Geo-location-based network. We also have day wise test results, i.e., a set of time intervals T = {T1 , T2 , . . . , Tt }, where for each Ti ∈ T , we have a list of infected nodes. We are given a graph G and two lists of infected nodes ITi and ITj corresponding to infected nodes before Ti and infected nodes between Ti and Tj , where Ti < Tj . We aim to propose an algorithm to find the intermediate nodes responsible for spreading the infection among the nodes in ITj . Thus, the algorithm can detect the carrier or infected nodes to be tested to prevent the infection’s spreading. This algorithm executes the following steps. Step-1: First, we compute the infection parameter P(u) of each node u ∈ G(V ). Step-2: Next, for each pair of nodes u ∈ ITi and v ∈ ITj , we explore all paths 1 2 m , Pu,v , · · · , Pu,v } of length less than or equal to Φ. Let Pu,v = {Pu,v th be the set of paths from u to v in the graph G, where the i path i = {u, u1 , u2 , . . . , ul , v}. Here, uj , j = 1, . . . , l, are the intermediate Pu,v i nodes in the path Pu,v ∈ Pu,v . We extract the first paths in the decreasing order of P(Pu,v ). Step-3: Once we extracted all the paths P = {Pu,v , Pu,v , . . .}, we discard those paths having at least one intermediate node has tested report negative (not infected) or all intermediate nodes are tested positive. Hence, for each path Pu,v ∈ P , we check each of the intermediate nodes test reports. If at least one node has a negative report, then we discard that path from P . Here, only the positive or not tested cases are considered. Step-4: Once we find all such paths, we report the updated P and all the intermediate nodes that are to be tested.
3.3
Example
In this example, we start with the Zachary Karate club network as the initial graph G(V, E), as shown in Fig. 1(a). The corresponding underlying is shown in Fig. 1(b). In the infection transmission tree, all the nodes are infected. – We first compute the infection parameter P(u), for each node u ∈ G. We get P = {1, 0.87, 0.87, 0.87, 1, 0.87, 0.87, 0.66, 0.66, 0.27, 1, 0.66, 0.66, 1, 0.18, 0.18, 0.27, 0.66, 0.18, 0.66, 0.18, 0.66, 0.18, 0.48, 0.69, 1, 0.18, 0.36, 0.57, 0.27, 0.27, 1, 0.57, 0.78} – Let ITi = {0, 10} and ITj = {4, 13, 25, 31} be the two lists of infected nodes reported by day Ti and day Tj , respectively. All the nodes in ITi and ITj are represented by the red color in Fig. 1(c). – Now, we compute all the paths of cutoff length Φ = 4 between each pair of nodes u, v, where u ∈ ITi and v ∈ ITj , and extracts the path having maximum infection parameter. – We find the suspected node list S = {3, 5, 6, 33}. Each node in S is represented by orange color in the Fig. 1(c).
194
D. Pattanayak et al.
Procedure 1: F ind suspected nodes(G, ITi , ITj , )
1 2
3 4 5 6
7 8 9 10 11 12
Input: G, ITi , ITj Output: P, S for each node v ∈ G do P(v) ← Compute the infection parameter by Eq. 1; // P is the set containing all node probabilities. // Computing highest probability paths between (u, v) where u ∈ ITi and v ∈ ITj Φ ← (Ti − Tj )/3; for each node u ∈ ITi do for each node v ∈ ITj do Compute Puv ; // find all the paths from u to v using Iteratively deepening Depth-first Search with maximum length Φ i for each path Puv ∈ Puv do Compute the infection parameter of the path using Eq. 2 if i has nodes from ITi and ITj then Puv discard the path i if Puv has a node with negative test result then discard the path
Take paths with highest infection parameter and add it to P
14
for each path p ∈ P do S = S ∪ intermediate nodes ∈ p
15
return P and S
13
Algorithm 1: F ind Inf ected nodes Input: G, T, IT , test results, Output: Updated C 1 ITj+1 = ITj ; 2 C = {}; 3 do 4 S = F ind suspected nodes(G, ITi , ITj+1 ∪ C, ); Procedure 1 5 Perform test for nodes in S and find positive nodes S + ; 6 Update test results; 7 C = C ∪ S+; 8 while |S| > 0; We get all paths are: P0,31 : {0, 13, 33, 31} P0,13 : {0, 31, 33, 13} P0,4 : {0, 5, 10, 4} P0,25 : {0, 13, 33, 31, 25}
P10,31 : {10, 4, 6, 0, 31} P10,13 : {10, 4, 0, 3, 13} P10,4 : {10, 0, 6, 4} P10,25 : {10, 5, 0, 31, 25}
// Execute
A Reconstructive Model for Identifying the Global Spread in a Pandemic
195
Fig. 1. A sample run of the algorithm on a graph generated from Zachary Karate Club network (Color figure online)
– Next, we test each node in u ∈ S, and if u has a negative test report, we remove u from S. – We find node 3 and 6 have negative test reports, and we update S + = {5, 33}. – Next, we add this S + in ITj . We have updated ITj = {4, 5, 13, 25, 31, 33}. – We recompute the infection parameter for each node in G. We get P = {1, 1.05, 1.05, 0, 1, 1, 0, 0.75, 1.05, 0.57, 1, 0.75, 0.75, 1, 0.48, 0.48, 0.57, 0.75, 0.48, 1.05, 0.48, 0.75, 0.48, 0.78, 0.78, 1, 0.48, 0.66, 0.87, 0.57, 0.57, 1, 0.87, 1}. – We again compute the final paths P : P0,31 : {0, 2, 8, 33, 31} P0,13 : {0, 19, 1, 2, 13} P0,4 : {0, 10, 5, 6, 4} P0,25 : {0, 19, 33, 31, 25} P0,33 : {0, 19, 33} P0,5 : {0, 10, 4, 6, 5}
P10,31 : {10, 0, 19, 33, 31} P10,13 : {10, 0, 19, 1, 13} P10,4 : {10, 0, 5, 6, 4} P10,25 : {10, 0, 31, 24, 25} P10,33 : {10, 0, 2, 8, 33} P10,5 : {10, 0, 4, 6, 5}
– We get the suspected node list S = {1, 2, 8, 19, 24}. – Next, we test each node in u ∈ S and find node 8 has a negative test report. Hence, we remove 8 from S . The updated S + is {1, 2, 19, 24}.
196
D. Pattanayak et al.
– Finally, we find the carrier nodes C = S + ∪S + = {1, 2, 5, 19, 24, 33}, as shown in Fig. 1(d). The nodes having negative test reports are shown by green color in Fig. 1(d).
4
Complexity Analysis
In this section, we analyze the complexity of Procedure 1. There are three main parts of the algorithm for which we estimate the running time separately. We have ITi and ITj nodes as inputs. First, we determine the infection parameter for a node. If we consider the neighbors up to h hop, then in the worst case, we need to find all the neighbors of a node at hop distance h. A node can have at most Δ neighbors. Thus, the total number of neighbors up to h is Δh . We repeat this process for all infected nodes in the graph and thus gain the complexity of O((|ITi | + |ITj |)Δh ). Second, we compute all simple paths between two nodes that are part of the input. The paths are computed using iterative deepening Depth First Search up to a cutoff Φ. Here, we compute all paths smaller or equal to length Φ for a pair of nodes u and v, where u ∈ ITi and v ∈ ITj . There can be at most ΔΦ nodes at a distance Φ from a node. Thus we can find all paths with an iterative deepening DFS with time complexity O(ΔΦ ) for a node u ∈ ITi . Iterating through all the paths, we can determine the paths that end at v ∈ ITj , and compute the corresponding infection parameter of the path. Hence, we arrive at a time complexity of O((|ITi | + |ITj |)Δh + ΔΦ |ITi |) time.
5
Performance Evaluation
Extensive simulation studies have been done to evaluate the performance of the proposed algorithm. In our simulation study, we use small real-world benchmark datasets as the contact graphs like, Zachary karate club, Dolphin, Football and Facebook ego networks having 34, 62, 115, 4039 nodes and 78, 159, 613, 88234 edges, respectively. The algorithm is implemented using python3. Initially, for each of the datasets, we generate a random tree treated as the actual infection transmission tree to show how the infection spreads over the network. Infection Transmission Tree Generation. We consider the persons as a node in the graph. We first randomly select a root node for the infection transmission tree. We assume that there is a single source of infection for the virus in a community. A node can spread the virus to its neighbors with a certain probability. We choose random neighbors of an infected node to select the next neighbor that is infected with a probability of 30%. We continue to grow the infection transmission tree subsequently. As an example, Fig. 1(b) shows the infection transmission tree of the Zachary Karate club network.
A Reconstructive Model for Identifying the Global Spread in a Pandemic
197
Fig. 2. Size of suspected nodes (|S|) and nodes tested positive among the suspected (|S + |) with respect to number of total infected nodes (|V |) and the remaining positive nodes (|V | − |ITi | − |ITj |) for different contact graphs based on Zachary karate club, Dolphin, Football and Facebook ego networks.
Input Data Generation. We take Ti and Tj as the two time instances. The infection transmission tree contains all the nodes that are infected. Now we choose the subset of nodes from the tree which are tested positive. We choose a subset of the nodes that are infected before Ti with higher probability (80% for simulation), which becomes the set ITi , since it is highly likely for nodes infected earlier to show symptoms and get tested. Next, we similarly choose nodes for ITj with moderate to low probability (30% for simulation), because the onset of the symptoms takes time as well as in most cases, the symptoms are very mild. Suspected Node Identification. Procedure 1 takes the two sets of nodes (ITi and ITj ) as input and returns a set of nodes (S) suspected of having been infected. We choose the first paths with highest infection parameter based on the computed infection parameter of a path with maximum length Φ = 4 between a node from ITi and a node from ITj . We show in Fig. 2 the ratio of the size of suspected nodes to the size of the infected nodes, i.e., |S|/(|V |). We also show the ratio of the size of positive suspected nodes to the size of remaining positive undetected nodes, i.e., |S + |/(|V | − |ITi | − |ITj |). We run our algorithm 100 times for different ITi and ITj for the same infection transmission tree and take the average. We plot it against the value of chosen from {1, 2, 3}. Observe that we can safely choose a higher value of to get more suspected nodes and correspondingly also get more positive nodes.
198
D. Pattanayak et al.
Fig. 3. Different IT1 and IT4 vs. Test positivity ratio on different networks.
We show the comparison between the test positivity ratio for the highest priority path, two high priority paths, and three high priority paths in Dolphin, Zachary karate club, Football, Facebook ego networks, as shown in Fig. 3(a–d). We define test positivity ratio as the ratio of the number of nodes that is tested positive to the number of suspected nodes, i.e., |S + |/|S|. We observe that our proposed method results in a better positivity ratio for the highest priority path. In contrast, if we consider two or three high priority paths for finding the suspected nodes, then the positivity ratio decreases a little. For example, in the Dolphin networks, considering the highest priority path, we get on an average 77% positivity ratio. When we consider the two and three high priority paths, we get the positivity ratio 70% and 72%. In the real world, as of 17 August 2020, we have around 3,09,41,264 samples tested in India as per the data from the Indian Council of Medical Research website [8] with 27,01,604 people tested positive [9,10]. It is interesting to observe that our proposed method results in a significant positivity ratio as compared to
A Reconstructive Model for Identifying the Global Spread in a Pandemic
199
(a) Dolphin Network
(b) Zachary Karate Club Network
(c) Football Network
(d) Facebook ego Network
Fig. 4. λ1 and λ2 vs. No. of Suspected Nodes and No. of Positive Nodes out of the suspected nodes between time interval T1 and T4 on different networks.
the real-world test positivity ratio, which is around 10% in India [10], whereas it was as low as 1% in some states at the initial stages of the pandemic. Variation of Infection Multiplier λ. We compute the infection parameter of a node by Eq. 1(a–d). For simulation, we have considered infection parameters up to two hops. Specifically, we vary the value of λ1 and λ2 for the same infection transmission tree, and input node set ITi and ITj . We present the results of the variation in Fig. 4. Here, our target is to find suitable values for λ1 and λ2 . We find that for the different values of (λ1 , λ2 ) such as (0.1, 0.01), (0.1, 0.09), (0.1, 0.25), (0.3, 0.01), (0.3, 0.09), (0.3, 0.25), (0.5, 0.01), (0.5, 0.09), (0.5, 0.25), the variation between the number of identified positive nodes and the total number of the suspected nodes is minimal. We choose (0.3, 0.09) as the default values of λ1 , λ2 for other simulations presented in this paper. Algorithmic Feedback. We can run our algorithm repeatedly with added input to arrive at more and more suspected nodes. If we get entirely correct feedback, i.e., all test results of suspected nodes are accurate (be it positive or negative), we can run it again to get more suspected nodes. As of now, some of the test kits used for detection of the virus is not very accurate. So, in the feedback
200
D. Pattanayak et al.
Fig. 5. (False Positive Ratio, False Negative Ratio) vs. No. of Suspected Nodes/No. of Positive Nodes out of the suspected between time interval T1 and T4 on different networks.
input, we vary the false positive and false negative percentages for the values {5%, 10%, 15%}. We plot the suspected node list and the corresponding positive node list in Fig. 5(a–d). In this study, Fig. 5 show how the number of suspected nodes varies with the False Positive Ratio, False Negative Ratio. We see that size of the suspected nodes increases as we increase the false positive ratio and the actual positive nodes decrease as we increase the false negative ratio. Notice that, even with high false positive and false negative ratios, our algorithm performs well, and the test positivity ratio remains high. Assuming that the test results are entirely correct, we determine the total number of runs of Algorithm 1 required before all the positive nodes are detected. We execute the algorithm 100 times with different tree sizes and different input data to show the number of runs Algorithm 1 requires before it stops in Fig. 6(a). We observe from the simulation that the algorithm may not find all the positive nodes. We stop the algorithm when Procedure 1 returns an empty set.
A Reconstructive Model for Identifying the Global Spread in a Pandemic
201
Fig. 6. (a) The number of runs of the algorithm, (b) total suspected, and (c) total positive nodes detected with respect to total infected nodes for different contact graphs based on Dolphin, Zachary karate club and Football networks
In Fig. 6(b), we plot the ratio of total suspected nodes that have been generated from the algorithm and the total infected nodes in the infection transmission tree. In Fig. 6(c), we plot the ratio of total positive nodes that have been found by testing the suspected nodes and the total infected nodes. Observe that, for the contact graph based on Dolphin network, we need on an average of 5 rounds to determine 83.84% of the infected nodes on an average, whereas the total suspected nodes generated by the algorithm is 109.88% of the total infected nodes on an average. Similarly for Zachary Karate Club and Football networks.
6
Conclusion
In this paper, we identify suspected persons infected with the virus so that they can be tested or quarantined to prevent the spreading of the virus. Our proposed approach works by determining a path connecting one infected person to another in the underlying contact graph. We simulate our proposed method on some benchmark datasets considered as the contact graph, such as Dolphin, Zachary karate club, Football, and Facebook ego. We show that our proposed method of finding suspected nodes results in a significantly higher test positivity ratio (around 70%) than the real world (around 10%). Our assumptions rely on the availability of location data as well as testing data of previous days. The real world test positivity ratio is small due to various factors, including the testing criteria and tracing mechanism. Depending on the availability of the test kits, we can run our algorithm multiple times to get more suspected nodes. Since the suspected node size is directly correlated with the input size, we get more suspected nodes as output as we include the test results of the previously suspected nodes. We show the robustness of our algorithm with different infection multiplier (λ). Even when false positive and
202
D. Pattanayak et al.
false negative test results are included, our algorithm maintains a high test positivity ratio. Finally, we also determine the number of runs our algorithm needs to terminate, which is very small. We show that our algorithm, when run to completion, finds more than 80% of the positive nodes on an average. This is also helpful in identifying asymptomatic carriers that may be infectious.
References 1. Cooper, I., Mondal, A., Antonopoulos, C.G.: A SIR model assumption for the spread of COVID-19 in different communities. Chaos, Solitons Fractals 139, 110057 (2020) 2. Ye, F., et al.: Delivery of infection from asymptomatic carriers of COVID-19 in a familial cluster. Int. J. Infect. Dis. 94, 133–138 (2020) 3. Bai, Y., et al.: Presumed asymptomatic carrier transmission of COVID-19. JAMA 323, 1406–1407 (2020) 4. Lauer, S.A., et al.: The incubation period of coronavirus disease 2019 (COVID-19) from publicly reported confirmed cases: estimation and application. Ann. Internal Med. 172(2020), 577–582 (2019) 5. He, X., et al.: Author correction: temporal dynamics in viral shedding and transmissibility of COVID-19. Nat. Med. 26, 1491–1493 (2020) 6. He, X., et al.: Temporal dynamics in viral shedding and transmissibility of COVID19. Nat. Med. 26, 672–675 (2020) 7. Widders, A., Broom, A., Broom, J.: SARS-CoV-2: the viral shedding vs infectivity dilemma. Infect. Dis. Health 25, 210–215 (2020) 8. Indian Council of Medical Research: Samples tested in India (2020). https://www. icmr.gov.in/. Accessed 17 Aug 2020 9. Ministry of Health and Family Welfare: COVID-19 Status in India (2020). https:// www.mohfw.gov.in/. Accessed 17 Aug 2020 10. covid19india.org: Coronavirus Outbreak in India (2020). https://www. covid19india.org/. Accessed 17 Aug 2020
Cost Effective Method for Ransomware Detection: An Ensemble Approach Parthajit Borah1(B) , Dhruba K. Bhattacharyya1 , and J. K. Kalita2 1
Department of Computer Science, Tezpur University, Tezpur, Assam, India [email protected] 2 Department of Computer Science, University of Colorado, Colorado Springs, CO 80918, USA
Abstract. In recent years, ransomware has emerged as a new malware epidemic that creates havoc on the Internet. It infiltrates a victim system or network and encrypts all personal files or the whole system using a variety of encryption techniques. Such techniques prevent users from accessing files or the system until the required amount of ransom is paid. In this paper, we introduce an optimal, yet effective classification scheme, called ERAND (Ensemble RANsomware Defense), to defend against ransomware. ERAND operates on an optimal feature space to yield the best possible accuracy for the ransomware class as a whole as well as for each variant of the family. Keywords: Malware Dynamic · Analysis
1
· Feature · Optimal · Classification · Static ·
Introduction
With rapid advances in technology, more than one third of the world’s population has now entered the digital world. The Internet provides the backbone to the digital world where people constantly make use of beneficial services and applications available on the Internet. The Internet is used for basic communication purposes as well as for numerous online transactions. The services available on the Internet can be exploited by people with destructive intentions. Malicious software or Malware is used to further increase the harmful intentions of such people. There are various types of malware available in the wild and each of them has been designed for specific purpose. Ransomware is a type of malware which has been emerged as one of the most sophisticated malware in the recent past. Locker and Crypto are primarily the two categories of ransomware. Both kinds of ransomware utilize the same infection vectors like drive by download, social engineering, phishing, spam emails, or removable media to get into the information devices and systems, including mobile and Internet of Things devices. However, the way of compromising the victim’s system is different for both types of ransomware. The locker ransomware is designed to lock the target c Springer Nature Switzerland AG 2021 D. Goswami and T. A. Hoang (Eds.): ICDCIT 2021, LNCS 12582, pp. 203–219, 2021. https://doi.org/10.1007/978-3-030-65621-8_13
204
P. Borah et al.
system and denies user access to the system without making any modification to the file system and then a certain amount of ransom is demanded from the victim. On the other hand, crypto ransomware, after getting into the victim’s system, encrypts all or selected files in the system using various encryption techniques like AES or RSA [1]. After encryption, a message with all the payment instructions is displayed to the users. To gain access to the system, a required ransom need to be paid to the attackers and in return a decryption key is obtained. The cyber-attacks carried out by ransomware is growing and becoming more sophisticated to defend against. This paper presents a fast, yet reliable ransomware defense solution, referred to as ERAND, powered by an optimal feature selection method to discriminate the ransomware class as a whole, as well as the eleven variants of the ransomware family from the goodware instances. The proposed solution is significant, considering the following clauses. 1. It is able to identify an optimal subset of Indicator of Compromises (IoCs) for the ransomware as a family as well as its individual variants to ensure better accuracy. 2. It is able to classify or discriminate instances of ransomware class (as a whole) from non ransomware as well as instances corresponding to individual variants from one another with high accuracy. The rest of the paper is organised as follows. Section 2 discusses background and related work. In Sect. 3, we present our defense solution, ERAND. Section 4 discusses about performance comparison of ERAND with other recent approaches. Section 5 contains implementation and discusses the results. Finally, in Sect. 6, we give the concluding remarks.
2
Background and Related Work
Cyber threats involving ransomware or ransom malware are growing at an exponential rate to extort money from individual users or organizations. In the recent past, several defense solutions have been proposed to combat ransomware. To combat malware, it is necessary to analyze the malware binaries properly to extract the IoCs to distinguish malware from goodware. As shown in Fig. 1, malware files are fed as input to the host machine of the analysis framework, which executes the binaries in an emulated environment and records their static and dynamic characteristics, and provides an output report to the user. These reports are finally used to extract the IoCs. Gazet et al. [14] made first attempt at analysing ransomware families. The authors carried out a technical review on quality of ransomware codes and its functionality. The authors concluded that the ransomware under analysis were designed for mass propagation but not for mass extortion. Scaife et al. [18] propose a ransomware detection system called Cryptodrop which promises to give early warnings in case of breach. It also combines three IoCs related to ransomware for rapid detection of ransomware.
Cost Effective Method for Ransomware Detection: An Ensemble Approach
205
Input file(PDF,EXE,WORD etc) Analysis and Report GeneraƟon Emulated Environment VMn VM1 VM2 Host Machine
.......
Output File(XML,JSON,HTML,PDF etc)
Database
User Interface
Fig. 1. A generic architecture of malware analysis framework
Homayoun et al. [16] propose a sequential pattern mining approach for combating ransomware. The approach helps to identify best features for classification of ransomware applications from benign apps as well as identifying a ransomware sample family. 2.1
Discussion
We observe the following, based on the conducted review. – Both supervised and semi-supervised learning approaches have been used for detection of known and unknown malware. – Among supervised approaches, there is scope for further enhancement of detection performance in terms of classification accuracy. Recently, several faster and effective classifiers as well as ensemble approaches have been proposed to address the issue. – Identification of an appropriate and optimal subset of IoCs can improve the performance of the detection method in terms of both accuracy and cost effectiveness. – Most learning approaches fail to perform well for all kinds of malware due to non-availability of adequate training instances. Active learning approaches could be a better alternative for detection of both known as well as unknown malware with a low rate of false alarms.
3
Proposed Framework
This section presents the proposed detection framework, referred to as ERAND, and discusses its components. Figure 2 illustrates the architecture of the proposed framework and its components. 3.1
Data Collection
We evaluate our method using two datasets. For the first, we collect 2288 ransomware samples from [7]. In addition to that, we also collect 933 goodware
206
P. Borah et al.
Original Feature Data Preprocessor
FSA1
FSA2
Core subset S1 idenficaon using intersecon
Recursive Eliminaon
Combinaon Funcon
Redundant(Zero variance) feature eliminaon
FSA3
…………. .
FSAN
Extended subset S2 idenficaon
S (S1 U S2) Select features with least relevance Eliminate feature with e breaking(if any) Evaluate the new subset S’
Opmal Feature/IoC subset, Sfinal
Fig. 2. ERAND Detection framework for Ransomware and its variants
samples from Google Play Store1 . After this, we analyse the collected samples for feature extraction. We extract two categories of features: a) Permission and b) API. A total of 241 features are extracted out of which 214 features belong to permission based and the remaining 27 belong to API category. The whole process of feature dataset generation is represented in Fig. 3. The final description of the dataset is given in Table 1.
F1
F2
F3
F4
Feature ExtracƟon Analysis Module
Data Preprocessing API
Permission
Android APKs
Database
Fig. 3. The ransomware feature dataset generation framework
For our experiment, we use another feature dataset from Sgandurra et al. [19]. The dataset contains 582 samples of ransomware with 11 variants and 942 samples of benign programs. The dataset has 30,962 attributes which represent all instances both goodware and ransomware present in the dataset. A detailed description of the dataset is given in the Table 2. 1
https://play.google.com/.
Cost Effective Method for Ransomware Detection: An Ensemble Approach
207
Table 1. Ransomware dataset with two classes No. of instances
No. of classes No. of features
Ransomware 2288 2 Goodware 933 Total instances: 3221
Permission-based 214 API-based 27 Total features: 241
Table 2. Ransomware dataset with Normal and 11 ransomware subclasses Sl no. Class
No. of samples
1
Goodware
942
2
Critroni
50
3
CryptLocker
107
4
CryptoWall
46
5
KOLLAH
25
6
Kovter
64
7
Locker
97
8
MATSNU
59
9
PGPCODER
4
10
Reveton
90
11
TeslaCrypt
6
12
Trojan-Ransom 34 Total samples: 1524 Total features: 30962
3.2
Data Preprocessing
Data preprocessing helps convert the input data into an appropriate format to make it suitable for further analysis. In our work, we preprocessed our original dataset before performing subsequent analysis. In the preprocessing phase, we removed all those attributes whose variance is zero. In addition to that, we have discarded all those malware binaries that are failed to execute during analysis phase. 3.3
Ensemble Feature Selection
Data mining or machine learning algorithms may face the curse of dimensionality issues while dealing with high dimensional data. Additionally, the learning models may overfit in the presence of a large number of features which may lead to performance degradation, in addition to increased memory requirements and computational cost. Therefore, it is necessary to remove irrelevant features. Since both the datasets have large number of features (241 for dataset 1 and 30,962 for dataset 2), we perform feature selection to find an optimal subset of features.
208
P. Borah et al.
Each variant of ransomware exhibits different characteristics and as such feature selection techniques are used for each of them present in our dataset. More specifically, we use an ensemble feature selection approach, where the results of the base feature selection algorithms are combined with a consensus to generate an optimal subset of features. Selection of the Base Feature Selection Algorithms. Ensemble feature selection eliminates the biases of individual participating feature selection methods to yield the best possible output using an appropriate consensus function. Similar to other ensemble approaches, there are two steps in ensemble feature selection. First, we need a set of well-performing feature selectors, each of which provides a subset of features found relevant. Second, based on the output received from each ranker or feature selection algorithm, we apply an appropriate consensus function to yield the best possible subset of relevant features. Guided by our experimental results, we use the following three feature selectors, namely, CMIM [12], MIFS [3], and ReliefF [17]. (a) Conditional Mutual Information Maximization (CMIM) [12]: For a given set of selected features, CMIM feature selector functions in an iterative manner by selecting features with maximum mutual information with the class labels. In other words, CMIM discards those features which are similar to the previously selected features even though its predictive power is strong concerning the class labels. For each unselected feature Xi , the feature score is calculated using Eq. 1. JCM IM (Xi ) = minXj S [I(Xi ; Y |Xj )]
(1)
(b) Mutual Information Feature Selection (MIFS) [3]: A good feature should be highly correlated with the class label, but it should not have high correlation with other features. Both feature relevance and feature redundancy are taken into consideration by MIFS during feature selection phase. The feature score is calculated for each unselected feature Xk using Eq. 2. JM IF S (XK ) = I(Xk ; Y ) − β
I(Xj ; Xk )
(2)
Xj S
(c) ReliefF [17]: ReliefF selects features to efficiently separate instances from different classes. Assume that l data instances are randomly selected among all n instances, then the feature score of any feature fi in ReliefF is estimated using Eq. 3.
Relief F score(fi ) = 1/c +
y=y j
l j=1
(−1/mj )
d(X(j, i) − X(r, i))
xr N H(j)
1/hjy p(y)/(1 − p(y))
xr N M (jy)
(3) d(X(j, i)X(r, i))
Cost Effective Method for Ransomware Detection: An Ensemble Approach
209
where N H(j) and N M (j, y) are the nearest instances of xj in the same class and in class y, respectively. Their sizes are mj and hjy , respectively. p(y) is the ratio of instances with class label y. Combination Function to Generate Initial Feature Subset. In this work, we introduce a 3-step consensus building process to identify an initial optimal subset of features for the subsequent recursive optimality test. C1 Consider the intersection of selected feature subsets given by each base feature selection algorithm to obtain a core subset of features, which we denote as S1 . C2 Consider the scores of all features given by all individual algorithms and select those features (other than those included in S1 ) for each algorithm with scores higher than a user defined threshold (say, α). We consider all those features whose score value is greater than 0. Finally, this step outputs a feature set S2 based on contributions from all the feature selection algorithms. C3 Obtain the initial optimal feature subset S by taking union of S1 and S2 for consideration of the recursive optimality test, described next, towards generation of the final optimal feature subset, i.e., Sf inal . Generation of Final Optimal Feature Subset Using Recursive Optimality Test. In this section, a recursive elimination method is applied to get the final optimal feature subset Sf inal based on S. There are three steps in this process, which are stated below. O1 Consider the feature subset, i.e., S as input and identifies one or more features with least relevance score. O2 Eliminate the feature(s) to obtain a new subset, S . In case of tie, based on relevance score estimated for an identified pair of candidate features, we choose feature for elimination given by an inconsistent performing ranker algorithm. O3 Evaluate S in terms of classification accuracy. It terminates the elimination process if and only if a significant performance degradation is observed in terms of accuracy due to the elimination of a feature, and considers the recent Si as Sf inal . Optimal Feature Subset for Each Class and for the Whole Family. Initially, the three feature selection algorithms namely: CMIM, MIFS and ReliefF generate 3 feature subsets for each variant of ransomware. The naming convention for each subset is Ran subij , where i goes from 1 to n and j goes from 1 to 3. Next, we apply a consensus function (described in Sect. 3.3) on these subset of features to generate a common subset for each class of ransomware. This generates n feature subsets for each variant of ransomware present in the dataset. Now, for each of the n feature subsets, an optimal feature subset is
210
P. Borah et al. Table 3. Number of optimal features for each ransomware dataset Dataset
No. of optimal features
Dataset 1
15
Dataset 2 (class 1)
23
Dataset 2 (class 2)
36
Dataset 2 (class 3)
19
Dataset 2 (class 4)
42
Dataset 2 (class 5)
20
Dataset 2 (class 6)
36
Dataset 2 (class 7)
31
Dataset 2 (class 8)
4
Dataset 2 (class 9)
32
Dataset 2 (class 10)
8
Dataset 2 (class 11) 25 Table 4. List of selected features for Table 5. List of top ranked feature catedataset 1 gories for dataset 2 Feature rank Feature name
Feature rank Feature category
1
SEN D SM S
2
RECEIV E BOOT COM P LET ED
1
RegistryKeysOperations
3
GET T ASKS
4
Ljava/net/U RL; − > openConnection
2
AP ICalls
5
V IBRAT E
3
Strings
6
W AKE LOCK
7
KILL BACKGROU N D P ROCESSES
4
F ilesoperations F ileextensions
8
SY ST EM ALERT W IN DOW
5
9
ACCESS W IF I ST AT E
6
Directoryoperations
10
DISABLE KEY GU ARD
11
Landroid/location/LocationM anager; − > getLastKnownLocation
7
Droppedf iles
12
READ P HON E ST AT E
13
RECEIV E SM S
14
CHAN GE W IF I ST AT E
15
W RIT E EXT ERN AL ST ORAGE
identified using the process described in the Sect. 3.3. The number of optimal features identified for each dataset with each variant of ransomware is given in the Table 3. Table 4 lists all the optimal features for ransomware dataset 1 and Table 5 lists top ranked feature categories for dataset 2. Correctness of Feature Selection Results. To establish the correctness of our feature selection results, we propose the following two lemmas.
Cost Effective Method for Ransomware Detection: An Ensemble Approach
211
Lemma 1. For discrimination of each ransomware variant, i.e., Ri from benign instances, selected feature subset, i.e. Ran subi is optimal. Proof: Suppose for the sake of contradiction, we assume that for a ransomware variant Ri , Ran subi is non optimal and |Ran subi | = k. Also, assume that |Ran subi |+1 gives us the best possible accuracy. However, as shown in Table 12 and also stated in Sect. 3.3, the highest possible accuracy to discriminate Ri from normal or goodware after recursive elimination is for the subset of features Ran subi . Any addition or deletion of features from Ran subi does not improve accuracy. Hence, the assumption is false and it contradicts the non-optimality assumption. Hence, Ran subi is optimal. Lemma 2. The accuracy given by Ran suball cannot be greater than the overall highest accuracy given by the individual optimal subset of features i.e., Ran subi . Proof: A feature subset Ran subi for each ransomware variant Ri is identified and its optimality as stated in the Lemma 1 is established. Ran suball includes Ran subi and some additional features. In other words, for discrimination of Ri , Ran suball includes some additional redundant features with reference to Ran subi . Such additional features cannot improve classifier performance for Ri (as substantiated by Lemma 1). Hence, the accuracy given by Ran suball ≯ Ran subi (where i varies from 1 to 11). 3.4
Classification of Ransomware Family and Its Variants
After obtaining the dataset using an optimal feature subset, it is necessary to establish the performance of the features to distinguish the ransomware from the goodware and supervised approach can be used to achieve the same. To avoid biases of classifiers, we consider an ensemble approach for unbiased evaluation of the optimality of the subset of IoCs given by the previous step. However, ensemble methods are also not totally free from drawbacks due to the limitations of the inherent combination/consensus function used. Therefore, in our study, we consider a number of consistent performing ensemble classifiers for optimal classification. The five ensemble classifiers used in our work are Random forest [5], Extra Tree [15], Adaboost [13], Gradient boosting [4], and XGBoost [8]. To avoid individual biases of these ensemble methods, we combine their outputs to yield best possible classification performance by minimizing false alarms. Balancing of Classes. Data balancing is a technique to make an approximately balanced number of instances in each class. An imbalanced data distribution may lead to unusual model performance. Our dataset is highly imbalanced as shown in Table 2 of class distribution and hence, it needs to be balanced. For data balancing, we need additional class specific instances which are difficult to collect because ransomware data are not readily available due to various other reasons in addition to security reasons. Therefore, we use a sampling technique to handle the class imbalance problem. In general, there are three sampling techniques: undersampling, oversampling, and hybrid sampling. In our case, we use an oversampling technique called SMOTE [6], where the minority class is
212
P. Borah et al.
oversampled by creating “synthetic” examples rather than by over-sampling with replacement. Finally, in the dataset all the class instance distributions become 50:50. In addition to that, we also validate our method’s performance in other data distributions also like 60:40, 70:30 and 80:20. Classification of Ransomware Variants. The above mentioned classifiers, i.e., Random forest, Extra tree, Adaboost, Gradient boosting, and XGboost are used to build a predictive model for each Ran subi dataset, where i goes from 1 to 11 and each dataset includes the instances of goodware and ransomware variants. Classification of Ransomware Family. The classifier models are also built using the Ran suball dataset, where all the instances of ransomware variants are included as a single ransomware class along with the goodware instances.
4
Comparison with Existing Methods
In this section, we compare our method with some of the existing ransomware detection methods. The similarity and dissimilarity of our method with the existing methods can be described in the following ways. Table 6 presents comparative study of the proposed method with some of the existing methods. – Like [2,21,22], our method also employ the various supervised approaches for classification of ransomware instances into their respective families. – Like [2,21,22], our work is also based on multi-class classification ransomware instances. But in [2,21,22], the proposed method is experimented against 8, 9, and 7 families of ransomware whereas our method uses 11 families of ransomware for validation. – Like [22], our dataset is also not balanced in terms of goodware and ransomware samples. But unlike [22], we apply data balancing technique to validate our method with different class distributions. – Unlike [2,22], our method uses the different platform specific ransomware for which both dynamic and static behaviors of ransomware are extracted as features to discriminate ransomware from goodware. – In [21], the stable performance is obtained for 131 features which is too high in comparisons to ours. The maximum feature dimension used by our method is 45 and minimum feature dimension is 4 for which we get the stable performance among the ransomware families. – In [22], the stable performance is obtained for 123 features which is too high in comparisons to ours. The maximum feature dimension used by our method is 45 and minimum feature dimension is 4 for which we get the stable performance among the ransomware families. – In [2], the highest accuracy is obtained with 8 features which 97.10 but our method achieves highest accuracy of 98.7 with only 4 features for Dataset 2 (class 8).
Cost Effective Method for Ransomware Detection: An Ensemble Approach
213
– Unlike [2,22], our method achieves better accuracy in terms of classification. The best possible accuracy obtained by the method [22] is 91.43% while our method obtained 98.7%.
5
Results
The experimental analysis is performed in a Python environment. The experiments are carried out on a workstation with 64 GB main memory, 2.26 Intel(R) Xeon processor and Ubuntu operating system. To build consensus based on decisions given by the individual classifiers, we use a weighted majority approach. The approach uses individual weights of the classifiers given by a multiobjective optimization technique for unbiased combinations of the individual decisions to achieve best possible accuracy. 5.1
Computation of Weights for Classifiers Using NSGA-II
We use multiobjective evolutionary method to compute best possible set of weighting factors for the participating classifiers based on their classification performances on each of the 11 variants of ransomware. Since none of the classifiers has been found to give winning performance consistently for all the variants of ransomware, we decided to exploit weighted majority based combination function to achieve best possible classification accuracy. A good number of multiobjective evolutionary algorithms are available in the literature and some of their comparisons are available at [9,23]. NSGA-II has already been established as a promising optimization method to handle multiobjective problem due to its elitist approach. An arithmetic crossover and Gaussian mutation operators generates offspring population from parent population. Offspring populations are then added to the current population. The NSGA-II algorithm uses (i) nondominated sorting method for ranking individuals into different non-domination levels and (ii) crowding distance method to sort individuals within the same level. An individual dominates another individual if it is strictly better in at least one objective and no worse in all the other objectives. In this work, we exploit the NSGA-II to compute an optimal set of weighting factor for the five classifiers (c1 to c5 ) based on their individual performances for 11 variants. We use their 5 × 11 = 55 performance values as input to the NSGA-II for computation of the optimal set of weights. The optimal weights obtained are given in Table 7.
214
P. Borah et al.
Table 6. Comparison of the proposed method Table 7. Weightage of the classiwith existing methods fiers given by NSGA-II Classifier, ci
Weightage
S
123
91.43
ExtraTree (c1 )
0.35
10
D
8
97.1
Gradient Boosting (c2 ) 0.28
6
D
23
97.03
256
2
D
8
87
AdaBoost (c3 )
0.15
755
8
D
131
98
XGBoost (c4 )
0.12
574
12
S/D
98.25
Random Forest (c5 )
0.10
Methods Ransomware samples
Total class Features (S/D) Total features Accuracy (%)
[22]
1787
10
[2]
210
[10]
500
[11] [21] [20]
Proposed 2288 (Dataset 1) 2 582 (Dataset 2) 12
5.2
S D
15 98.2 Min:4 Max:42 98.8
Weighted Majority Based Combination Function
To generate an unbiased classification output, ERAND uses ‘weighted majority voting‘ to build consensus among the outputs given by the individual classifiers. We initially carry out an exhaustive experimentation with these classifiers using the datasets described in Sect. 3.1 and consider their performance as the basis for subsequent processing to decide their weights while building the consensus. Our experimental study based on weighted majority voting uses Eq. 4 to decide the class label of a test instance. It computes anomaly score, Si for each test instance given by Eq. 4 to recognise either as ‘goodware’ or ‘malware’ with respect to a user defined threshold, β. Si = w1 c1 + w2 c2 + w3 c3 + w4 c4 + w5 c5
(4)
If the value of Si ≥ β ( a user-defined threshold), the instance is considered anomalous or belonging to a malware class. In our experimentation, we consider β = 0.63. Because if any two best performers (like (c1 ,c2 ) or (c1 ,c3 ) agree, then their total weights are used as threshold. The value of ci for a given class can be either 1 ( if belongs to the class) or 0 (if not). In case of tie, we give priority to the decision of the best performers. For example, if c1 and c4 are on one side and c2 , c3 and c5 are on the other side, we prefer the decision of (c1 , c4 ). 5.3
Classification of Ransomware Variants
We evaluate the performance of the selected subset of features to distinguish the instances of the ransomware family from the goodware, using five classifiers individually. The classification results of dataset 1 and dataset 2 are given in Tables 8 and 9. It can be observed from Tables 8 and 9 that ERAND consistently performs well like ExtraTree and Gradient Boosting in classifying each variant of the ransomware malware.
Cost Effective Method for Ransomware Detection: An Ensemble Approach
215
Table 8. Classification accu- Table 9. Classification accuracies of dataset 2 for racies of dataset 1 each variant Classifiers
Accuracy
Gradient Boosting 97.5
Ransomware family Classifiers Gradient Boosting Adaboost Random Forest Extra Tree XGBoost ERAND Critroni
98.7
98.7
98
98.1
98.7
98.8
Adaboost
96.47
CryptLocker
96.8
96.8
96.7
96.8
96.7
96.8
Random Forest
98
CryptoWall
96.3
96.17
96.23
96.44
96.33
96.42
Extra Tree
98.27
KOHLER
98.6
98.56
98.61
98.8
98.7
98.83
Kovter
98.76
97.88
97.8
98.1
98.23
98.78
XGBoost
97.2
Locker
96.39
96.34
96.50
96.92
96.39
96.55
ERAND
98.2
MATSNU
97.61
97.70
97.77
97.70
97.88
97.88
PGPCODER
98.12
98.12
98.43
98.7
98.43
98.7
Reveton
98.40
98.72
98.84
97.7
98.72
TeslaCrypt
98.87
98.62
98.77
98.6
97.88
98.86
Trojan-Ransom
98.43
97.78
98.32
98.61
97.86
98.67
5.4
98.79
Classification of Ransomware Family
To calculate the performance of the selected features to distinguish whole ransomware family from the goodware, we used five classifiers namely, Random forest, Gradient boosting, Adaboost, Extratree and XGboost. The classification results of whole ransomware family with respect to goodware are enlisted in Table 10. 5.5
Metrics and Cross Validation
For effective evaluation of our method, four machine learning performance metrics are used namely: Accuracy, Recall, Precision and F-Measure. The precision, recall and F1 Score values of each classifier are reported in Table 11. On the other hand, the classification accuracies of each classifier are reported in Table 9 for 50:50 class distribution. – Accuracy: The no of instances correctly detected by a classifier divided by the total of goodware and ransomware instances gives the accuracy. Accuracy =
TP + TN TP + TN + FP + FN
– Precision: It defines what proportion of predicted ransomware are actually correct. Thus, Precision of a model is calculated as follows: P recision =
TP TP + FP
Table 10. Classification accuracies of whole ransomware family Goodware Ransomware Classifiers ERAND Random Forest Extra Tree AdaBoost XGBoost Gradient Boosting 942
618
97.8
98.7
97.9
98.1
98.6
98.6
97
98.89
95.9
93.2
98.7
98.7
94.5
98.4
98.8
98.6
98.7
98.4
Dataset 2 98.9 (Class 1)
Dataset 2 95.7 (Class 2)
Dataset 2 94.3 (Class 3)
Dataset 2 98.7 (Class 4)
Dataset 2 98.5 (Class 5)
Dataset 2 95.4 (Class 6)
Dataset 2 98.6 (Class 7)
Dataset 2 97.7 (Class 8)
Dataset 2 98.1 (Class 9)
Dataset 2 98.4 (Class 10)
Dataset 2 98.5 (Class 11)
98.8
98.8
98.3
98.76
98.7
94.7
98.2
98.7
93.4
96.8
98.89
98.9
98.6
98.8
98.4
98.8
98.5
94.6
98.6
98.8
93.3
96.2
98.88
98.7
Gradient Adaboost Random Extra Boosting Forest Tree
Precision
Dataset 1 98.1
Datasets
F1-Score
98.4
97.3
98.2
98.88
98.6
94.3
98.6
98.8
93.1
95.8
98.89
98.4
98.4
98.7
98.2
97.8
98
98.1
98.3
98.6
98.7
98.8
98
97.9
98.3
98.6
98.5
97.8
98.8
98.3
98.6
98.6
98.6
98.8
98.3
97.5
98.3
98.5
98.6
97.7
98.2
98.6
98
98.6
98.5
97.5
98.3
98.1
98.2
97.8
98.9
98.7
98
98.6
98.7
98.3
98.4
97.6
98.5
98.6
98.88 97.5
98.7
98.8
98.7
98.2
98.5
98.6
98.8
98.7
98
98
98.5
98.6
98.6
97.8
98.3
96.7
98.9
98.7
96.5
97.2
98
98
98.4
98.7
98
97.9
98.1
96.5
98.6
98.7
96.2
97.4
98.1
97.22
98.5
98.8
98.89
97.8
98.2
96.6
98.7
98.7
96.4
96.7
97.9
98.5
97.9
98.3
96.4
98.8
98.7
96.3
97.1
98.4
97.8
98.6
97.8
98.5
98.4
98.89 98.7
98.7
98.1
96.8
98
98.7
96.5
97.2
98.7
98.6
XGBoost Gradient Adaboost Random Extra XGBoost Gradient Adaboost Random Extra XGBoost Boosting Forest Tree Boosting Forest Tree
Recall
Table 11. Precision, Recall, and F1 score of all the classifiers
216 P. Borah et al.
Cost Effective Method for Ransomware Detection: An Ensemble Approach
217
– Recall : It defines what proportion of all ransomware samples are correctly predicted. The recall of a model is calculated as follows: Recall =
TP TP + FN
– F1-score is calculated by taking the weighted average of precision and recall. F1-score is defined as follows: F 1 Score =
2 ∗ (Recall ∗ P recision) (Recall + P recision)
Where TP: True Positive, TN: True Negative, 1.65 cm FP: False Positive, FN: False Negative. In order to evaluate the performance of the proposed method, we used the K-Fold cross validation where we set the value of K = 10. Table 11 presents the Precision, Recall, and F1 score of all the classifiers.
6
Conclusion
In order to counter the ransomware, a cost effective method is proposed based on the dynamic features of ransomware. For identification of most discriminative features of ransomware families, an ensemble approach is introduced. Finally, a weighted majority based combination function is proposed to get highest possible classification accuracy in ransomware detection. ERAND performs consistently well on all the malware datasets. As a future work, we are developing an unsupervised approach to counter zero-day ransomware. Table 12. Result of Recursive Optimality test for ransomware variant 1 No. of Classification No. of Classification No. of Classification No. of Classification features accuracy features accuracy features accuracy features accuracy 1
0.883264278
10
0.954893617
19
0.977217245
28
0.982049272
2
0.892821948
11
0.954893617
20
0.977743561
29
0.982575588
3
0.911394177
12
0.954893617
21
0.978269877
30
0.982575588
4
0.907183651
13
0.954893617
22
0.978226204
31
0.982575588
5
0.926254199
14
0.954893617
23
0.982049272
32
0.982575588
6
0.935828667
15
0.954893617
24
0.982575588
33
0.982049272
7
0.953309071
16
0.954893617
25
0.982575588
34
0.982575588
8
0.953309071
17
0.954893617
26
0.982575588
35
0.982049272
9
0.954367301
18
0.96393617
27
0.982575588
218
P. Borah et al.
References 1. The Evolution of Ransomware (2008). https://www.symantec.com/content/en/ us/. Accessed 14 Feb 2019 2. Alhawi, O.M.K., Baldwin, J., Dehghantanha, A.: Leveraging machine learning techniques for windows ransomware network traffic detection. In: Dehghantanha, A., Conti, M., Dargahi, T. (eds.) Cyber Threat Intelligence. AIS, vol. 70, pp. 93– 106. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73951-9 5 3. Battiti, R.: Using mutual information for selecting features in supervised neural net learning. IEEE Trans. Neural Netw. 5(4), 537–550 (1994) 4. Breiman, L.: Arcing the edge. Technical report (1997) 5. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001) 6. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002) 7. Chen, J., Wang, C., Zhao, Z., Chen, K., Du, R., Ahn, G.: Uncovering the face of android ransomware: characterization and real-time detection. IEEE Trans. Inf. Forensics Secur. 13(5), 1286–1300 (2018). https://doi.org/10.1109/TIFS.2017. 2787905 8. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, pp. 785–794. ACM, New York (2016). https://doi.org/ 10.1145/2939672.2939785, http://doi.acm.org/10.1145/2939672.2939785 9. Coello, C.A.: An updated survey of GA-based multiobjective optimization techniques. ACM Comput. Surv. 32(2), 109–143 (2000). https://doi.org/10.1145/ 358923.358929 10. Cohen, A., Nissim, N.: Trusted detection of ransomware in a private cloud using machine learning methods leveraging meta-features from volatile memory. Exp. Syst. Appl. 102, 158–178 (2018). https://doi.org/10.1016/j.eswa.2018.02.039. http://www.sciencedirect.com/science/article/pii/S0957417418301283 11. Cusack, G., Michel, O., Keller, E.: Machine learning-based detection of ransomware using SDN. In: Proceedings of the 2018 ACM International Workshop on Security in Software Defined Networks & Network Function Virtualization, pp. 1–6. ACM (2018) 12. Fleuret, F.: Fast binary feature selection with conditional mutual information. J. Mach. Learn. Res. 5, 1531–1555 (2004) 13. Friedman, J., Hastie, T., Tibshirani, R., et al.: Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). Ann. Stat. 28(2), 337–407 (2000) 14. Gazet, A.: Comparative analysis of various ransomware virii. J. Comput. Virol. 6(1), 77–90 (2010). https://doi.org/10.1007/s11416-008-0092-2 15. Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006) 16. Homayoun, S., Dehghantanha, A., Ahmadzadeh, M., Hashemi, S., Khayami, R.: Know abnormal, find evil: frequent pattern mining for ransomware threat hunting and intelligence. IEEE Trans. Emerg. Top. Comput. 8, 341–351 (2017) ˇ 17. Robnik-Sikonja, M., Kononenko, I.: Theoretical and empirical analysis of relieff and rrelieff. Mach. Learn. 53(1–2), 23–69 (2003) 18. Scaife, N., Carter, H., Traynor, P., Butler, K.R.: Cryptolock (and drop it): stopping ransomware attacks on user data. In: 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS), pp. 303–312. IEEE (2016)
Cost Effective Method for Ransomware Detection: An Ensemble Approach
219
19. Sgandurra, D., Mu˜ noz-Gonz´ alez, L., Mohsen, R., Lupu, E.C.: Automated dynamic analysis of ransomware: benefits, limitations and use for detection. CoRR abs/1609.03020 (2016). http://arxiv.org/abs/1609.03020 20. Shaukat, S.K., Ribeiro, V.J.: Ransomwall: a layered defense system against cryptographic ransomware attacks using machine learning. In: 2018 10th International Conference on Communication Systems Networks (COMSNETS), pp. 356–363 (January 2018). https://doi.org/10.1109/COMSNETS.2018.8328219 21. Vinayakumar, R., Soman, K., Velan, K.S., Ganorkar, S.: Evaluating shallow and deep networks for ransomware detection and classification. In: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 259–265. IEEE (2017) 22. Zhang, H., Xiao, X., Mercaldo, F., Ni, S., Martinelli, F., Sangaiah, A.K.: Classification of ransomware families with machine learning based on N-gram of opcodes. Future Gener. Comput. Syst. 90, 211–221 (2019). https://doi. org/10.1016/j.future.2018.07.052. http://www.sciencedirect.com/science/article/ pii/S0167739X18307325 23. Zitzler, E., Deb, K., Thiele, L.: Comparison of multiobjective evolutionary algorithms: empirical results. Evol. Comput. 8(2), 173–195 (2000)
Social Networks and Machine Learning
Exploring Alzheimer’s Disease Network Using Social Network Analysis Swati Katiyar , T. Sobha Rani(B) , and S. Durga Bhavani(B) School of Computer and Information Sciences, University of Hyderabad, Hyderabad, India [email protected], {sobharani,sdbcs}@uohyd.ac.in
Abstract. Alzheimer’s is a degenerative disease with changes occurring in different regions of the brain at different rates resulting in progressive deterioration. A lot of functional brain connectivity is altered, the process itself is insufficiently understood. In this work, an attempt is made to understand the progressive deterioration of the brain by locating the regions that show significant changes in the connectivity in the five lobes of the brain at different stages of the disease. Methods available in social network analysis like community and maximal clique analysis along with degree distributions, and centrality measures have been used to observe the network evolution of these regions at different stages. Networks of four diagnostic stages i.e., Normal, Early MCI (eMCI), Late MCI (lMCI), and Alzheimer’s Disease (AD), taken from ADNI (Alzheimer’s Disease Neuroimaging Initiative) database, are used for this study. Nine Regions of Interest (ROIs) from the five lobes are identified and a higher degree of change is observed in the connections of regions from the temporal and frontal lobes. There is a splurge of new connections in the eMCI stage for all regions except for those from the frontal lobe. We also observed more rearrangement among the left hemisphere nodes as compared to the right hemisphere nodes. There is an overall loss of edges between the normal and AD stages. This confirms that the study is able to identify the regions that are affected by the progression of the disease. Keywords: Disease progression discovery · Regions of the brain
1
· Differential degree · Community
Introduction
Alzheimer’s is a degenerative disease that destroys memory and other important mental functions. It is an irreversible process, which may ultimately lead to the inability to perform even simple tasks [12]. Main focus of this work is on understanding these functional changes caused by the Alzheimer’s Disease. An attempt to gain insights into the effects of this disease on the brain including the changing connectivity patterns as the disease progresses from one stage to the next is being attempted here. The concept of Network biology [3] is used for the purpose of visualising the brain as a network where the regions of interest c Springer Nature Switzerland AG 2021 D. Goswami and T. A. Hoang (Eds.): ICDCIT 2021, LNCS 12582, pp. 223–237, 2021. https://doi.org/10.1007/978-3-030-65621-8_14
224
S. Katiyar et al.
(ROI) of the brain become nodes wherein two nodes are connected if there is a pathway between these regions. 1.1
Motivation
More than 4 million people are estimated to be suffering from Alzheimer’s in India (the third highest caseload in the world). In most cases the patients’ lives are characterised by a total loss of independence in the last years [1,11]. Through this work we aim to identify changes which could be indicative of early signs of the disease onset in the brain network. 1.2
Objectives
The goal as mentioned earlier, is to understand the changes in the brain network as the Alzheimer’s disease progresses. For this purpose we have used methods available in social network analysis literature: community analysis, maximal clique, degree distributions and centrality measures. We also locate specific regions from the five lobes of the brain that show significant changes and visualise their connections through the stages.
2
Related Literature
Barabasi and their group of co-authors have pioneered the work on network biology [3]. A biological network is built using concerned entities as nodes and interactions between the respective molecular components as edges. Many a time, if the strength of the interaction can be captured then these are weighted networks. Network medicine is a branch of network biology that studies and analyzes human diseases using a network based approach. This enables a possibility of personalised therapies and treatment [2]. Disease modelling refers to the quantification of the features of any disease along with an understanding of the underlying patterns of the disease growth. The network analysis of cancer disease has been carried out in [7,13,15]. In [16], the authors model diseases and symptoms which are the phenotype manifestations of a disease in a network. They construct a symptom-based network of human diseases (HSDN) in which two diseases are linked if they exhibit similar symptoms. The authors study the complex network obtained by integrating HSDN with disease—gene interactions along with the corresponding protein-protein interaction (PPI) data to obtain interesting results of shared genes among diseases and disease clusters. In an interesting work [6], using network analysis, a disease is linked to a welldefined local neighbourhood of the human interactome, known as the disease module. It is shown that the modules extracted are distinct for each disease.
Exploring Alzheimer’s Disease Network Using Social Network Analysis
225
There is less work regarding Alzheimer’s disease with network analysis. Our work considers the data set extracted by [14] in which the authors construct the disease networks for Alzheimer’s disease in various stages. They apply the important problem of link prediction from social network analysis in order to discern the patterns of links lost and gained in various stages of the disease using measures like Adamic Adar, Preferential attachment etc. In a study to understand the changes in brain laterization in patients with MCI and Alzheimer’s disease conducted by Liu and Zhang [8] they have used the intrinsic laterality index approach to compute the functional laterality of the brain hemispheres. It is important to mention the key work of [5] who crafted Desikan-Killiany atlas. In an effort to aid clinical investigation, an automated labeling system was developed. This was used for the subdivision of the cerebral cortex into various gyri. 35 ROIs were manually identified in each hemisphere on a data set of 40 MRI scans. This was encoded as an atlas which could be used to label the cortical regions automatically. A comparison between the automated and the manual labelling of ROIs was done using both ICC (intra-class correlation coefficients) and mean distance measures to guage the percentage of mismatch between them. It turned out that the automated labelling was extremely accurate, with an average ICC of 0.835 and a mean distance error of less than 1 mm [5]. These metric values suggest that the system developed was both valid and reliable. Figures 1 shows the identified ROIs using the Desikan-Killiany atlas.
Fig. 1. Cortical Representations of the ROIs in one Hemisphere (L) and their Index (R) [5, 9].
226
3 3.1
S. Katiyar et al.
Data Set Initial Data Set
Data used in this work is taken from ADNI (Alzheimer’s Disease Neuroimaging Initiative) database [14]. The main goal of ADNI has been to assess the progression of early Alzheimer’s and MCI (Mild Cognitive Impairment) from PET (Positron Emission Tomography), MRI (Magnetic Resonance Imaging), and other biological markers. DWIs (diffusion-weighted images) of 202 participants are scanned for this purpose. Table 1 shows the demographic information for the 202 participants [14]. Here, N is the number of participants in each category, age is the age of the participants, MMSE (Mini Mental State Exam) scores. Table 1. Demographic information.
N Age (mean ± SD in years) MMSE (mean ± SD) Sex
3.2
Normal 50 72.6 ± 6.1 28.9 ± 1.4 22M/28F
eMCI 72 72.4 ± 7.9 28.1 ± 1.5 45M/27F
lMCI 38 72.6 ± 5.6 26.9 ± 2.1 25M/13F
AD 42 75.5 ± 8.9 23.3 ± 1.9 28M/14F
Total 202 73.1 ± 7.4 27.1 ± 2.7 120M/82F
Final Data Set
Applying whole-brain tractography using the Hough transform on every scan recovered 10000 fibers for each participant. 35 regions of interest(ROI) are extracted from each hemisphere from MRI scans using the Desikan-Killiany atlas. A connectivity matrix of size 70 × 70 (35 from each hemisphere) was obtained for each one of the 202 participants. Mean of matrices from each diagnostic group is obtained and are converted into boolean values. Four matrices of size 70 × 70, one for each diagnostic group: Normal, eMCI, lMCI, and AD are obtained. The corpus callosum region is eliminated by replacing rows 3 and 38 by 0s which represents the region in left and right hemispheres respectively. 3.3
Network Construction
Here, a method is described to convert the given data-set into a form useful for our experimentation. Also the nodes of our networks are mapped to the regions of interest of the brain. Four binary matrices of dimensions 70 × 70 are obtained, where every row and column index corresponds to a ROI of the brain, deduced from the DesikanKilliany atlas, and each matrix corresponds to a stage of the Alzheimer’s disease i.e., normal, eMCI, lMCI and AD.
Exploring Alzheimer’s Disease Network Using Social Network Analysis
227
These four matrices are taken and visualised as graphs containing 70 nodes, connected to each other by non-weighted edges. These were our four networks of the brain corresponding to the different stages of the Alzheimer’s disease. All experiments and observations are a result of analysis done on these four networks.
4
Experiments and Results
Our analysis focuses on three broad categories: Network Analysis, Node-wise analysis, Edge-wise analysis. Network Analysis. Number of Edges, Average Path Length, Average Clustering Coefficient Node-Wise Analysis. Degree Distribution, Eigenvector Centrality, Maximal Clique Analysis, Community Analysis - Girvan Newman (GN) algorithm and Louvain methods, Degree Differential Analysis Edge Distribution. Visualisation of connections for chosen ROIs from the five lobes based on the above analysis. 4.1
Global Network Analysis
These are dense networks with high clustering coefficients and small path lengths therefore satisfying the small-world property. This goes with our previous knowledge of brain networks being small-world networks. Table 2 shows the values obtained for each stage. Table 2. Observations of Global Network Properties. Stage
4.2
Normal eMCI lMCI AD
No. of edges
1490
1510
Avg. path length
1.346
1.337 1.397 1.402
1380 1364
Avg. clustering coefficient 0.753
0.751 0.738 0.736
Degree Distribution
Figure 2 shows degree distribution for the 4 stages. It is surprising to observe a N ormal distribution instead of power law distribution which is observed in many biological networks. There is a possible loss of the N ormal behaviour as the disease progresses. A number of edges are lost in the later stages. 4.3
Eigenvector Centrality
Figure 3 shows the plots for degree vs. eigen vector centrality. Wider range of values are observed for degree as compared to eigenvector centrality. High degree
228
S. Katiyar et al.
Fig. 2. Degree Distribution Plots for various stages.
nodes are connected to the other high degree nodes signifying the assortativity property. An observable drop in the highest degree can be noticed from normal to AD stages. eMCI and AD have more nodes with 0.08 eigenvector centrality. Entorhinal Cortex and Frontal Pole belong to this case. Eigenvector centrality for the Temporal Pole is 0.08 in normal which decreases to 0.06 in AD. Posterior Cingulate Cortex has eigenvector centrality 0.17 in normal which drops to 0.15 in AD. Eigenvector centrality is not varying much. Hence, it can be concluded that the shortest paths are not disturbed if a rewiring happens in normal to AD stages, still it is maintaining the connections in such a way to retain the shortest paths. 4.4
Maximal Clique Analysis
Maximal cliques are those cliques which cannot be extended any further by including another adjacent vertex. Then it becomes a maximal clique, meaning it is not a subset of a larger clique. Again a sizable loss of connections can signify the loss of functionality as the disease progresses. Total number of maximal cliques observed in each stage are: – – – –
Normal - 414 eMCI - 611 lMCI - 424 AD - 333
Exploring Alzheimer’s Disease Network Using Social Network Analysis
229
Fig. 3. Degree vs Eigenvector centrality of nodes.
Figure 4 shows the number of cliques and the size of the clique. Two size 1 cliques are nodes 3 and 38 (Corpus Callosum). The frequency of maximal cliques is higher for bigger sizes in normal and eMCI. In stages lMCI and AD more number of maximal cliques appear for comparatively smaller size values. 4.5
Community Analysis
The GN algorithm [10] is run until the highest modularity value was found. This resulted in the formation of two major communities i.e., the left and right hemispheres along with a few isolated node communities. On applying the Louvain method [4], the entire network was divided into left and right hemispheres as communities with higher modularity values. Table 3 shows the modularity values using GN and Louvain community discovery algorithms. Communities that are obtained when the modularity is highest are chosen as the best communities. Table 3. Modularity values for communities. Community discovery method Normal eMCI lMCI AD Girvan Newman
0.169
0.176 0.214 0.225
Louvain
0.207
0.198 0.232 0.242
230
S. Katiyar et al.
Fig. 4. Count vs. Size of Maximal Cliques.
4.6
Degree Differential Analysis
Degree differential analysis is taken up between each of the stages for all the 70 nodes. From the Fig. 5, it is evident that eMCI to lMCI and Normal to AD has similar structure where there is loss of edges almost for all nodes. Loss of connections between the nodes may signify the loss of functionality that is experienced in the progression of the disease.
5
List of Selected ROIs
In order to carry out a deeper analysis, nine ROIs (at least one ROI from each of the five lobes of the brain) are chosen based on the eigen vector centrality and degree differential analysis explained earlier. These are chosen on the basis of the change in their network properties and their functionalities. Each ROI is represented by a node pair (x, y) where x ∈ [0, 34] and y ∈ [35, 69]. They are listed in Fig. 6 which provides the change of connections for each node pair in all the 4 stages. That is, the node pair (22,57) has 64 and 59 connections in Normal stage, 63 and 60 in eMCI, 58 and 55 in lMCI, 53 and 52 in AD stages respectively. On visualising their connections we observed a significant amount of rearrangement between the normal state and the disease stages. We describe this rearrangement for three selected ROIs in the following sections. There is a certain degree of randomness in the way connections are changing in the disease stages.
Exploring Alzheimer’s Disease Network Using Social Network Analysis
231
Fig. 5. Degree difference plots.
5.1
Posterior Cingulate Cortex (22, 57)
The Posterior Cingulate Cortex participates in diverse functions and communicates with various regions in the brain. It was identified as one of the significant regions from Eigenvector centrality analysis. Refer Fig. 7. – 22 and 57 are connected to each other in Normal stage. 57 is connected to the entire map of the region’s connections in all the stages. – It is also connected to all the regions in the parietal lobe through all the four stages. – In AD, 22 lost its connection with the cuneus cortex which is responsible for visual processing.
232
S. Katiyar et al.
– 22 lost many connections after the eMCI stage with the temporal lobe. Only connects to parahippocampal gyrus responsible formemory encoding and retrieval, fusiform gyrus responsible for object recognition and banks of superior temporal sulcus which plays a role in (social perception remained in lMCI. It still connects to middle temporal gyrus and transverse temporal cortex, involved in processing of tone in AD. – This region has links to every region of the frontal lobe except for the left hemispheres of lateral orbital frontal gyrus, frontal pole and pars orbitalis in normal and eMCI. Links are lost with pars opercularis and pars triangularis, both having functions in language formation and semantic processing, in lMCI and to pars orbitalis language processing and rostral middle frontal gyrus attention switch in AD. – It is connected to all the regions within the cingulate cortex in the first three stages, lost it’s link to left hemisphere node of insular cortex in AD. Insular cortex is involved in consciousness. 5.2
Parahippocampal Gyrus (15, 50)
This surrounds the hippocampus and contains the entorhinal cortex hence plays an important role in the memory function. This region is selected because of the role it plays in memory related tasks. Refer Fig. 8. – 15 and 50 are connected to each other in all the four stages. – A lot of rearrangement occurred in the connections with Occipital lobe. In the eMCI stage, both 15 and 50 got linked to all its regions. They got divided by hemispheres again in lMCI except for lingual gyrus vision and dreaming; visual processing of the written word. – Connections with every region in the parietal lobe are maintained until AD stage. 15 got linked to the right hemisphere node of inferior parietal cortex (perception of facial stimuli) and 50 gained connections to the left hemispheres of supramarginal gyrus and superior parietal cortex (involved in spatial orientation) in eMCI which are lost again in lMCI. – It’s connection to the caudal anterior cingulate cortex is lost in lMCI stage. It gains a connection to the left hemisphere of rostral anterior cingulate cortex in AD. – With the frontal lobe, 50 lost its connections to the left hemisphere nodes of caudal middle frontal gyrus (reorients attention) and pars opercularis (language production) in eMCI. 15 lost its connections with the left hemispheres of frontal pole and rostral middle frontal gyrus whereas 50 gained a link to the right hemisphere of the frontal pole in lMCI. Frontal pole controls the ability to communicate. – Within the temporal lobe it was connected to every other region in the normal and lMCI states, divided by hemispheres. eMCI showed formation of links from 15 to right hemispheres of fusiform gyrus (facial recognition), middle temporal gyrus (semantic memory processing) and inferior temporal gyrus (Sensory integration). In AD, 15 gets linked to the right hemisphere of entorhinal cortex.
Exploring Alzheimer’s Disease Network Using Social Network Analysis
5.3
233
Caudal Middle Frontal Gyrus (2, 37)
This region from the frontal lobe is chosen since it displayed a high degree difference in the differential analysis of nodes. It acts as a circuit breaker and diverts attention towards external stimuli. Refer Fig. 9. – 2 and 37 are linked to each other till the last stage. – In eMCI, 37 lost its link to the right hemisphere of the precuneus cortex memory recollection but regained it in lMCI. 2 lost all its connections to the parietal lobe in lMCI. – In the occipital lobe it had connections to every region except for the cuneus cortex, divided by hemispheres, in the normal stage. 2 lost all it’s connections in eMCI and 37 gained a link to the right hemisphere of the cuneus cortex in AD. – Both 2 and 37 are completely connected to the cingulate cortex in the Normal stage. 2 lost it’s links to the right hemispheres of the insular cortex and isthmus cingulate cortex in eMCI. Both 2 and 37 lost their connections to the left hemispheres of insular cortex and isthmus cingulate cortex in lMCI. Insular cortex has many functions in self-awareness, empathy, cognition, etc.
Fig. 6. List of selected ROIs.
234
S. Katiyar et al.
Fig. 7. Connections of Posterior Cingulate Cortex to other lobes.
Fig. 8. Connections of Parahippocampal Gyrus to other lobes.
– In the temporal lobe it is connected to all regions except the temporal pole. 2 lost all its connections in the eMCI stage. Temporal Pole is the site for auditory perception and also has functions in semantic and personal memory. – Within the frontal lobe they kept their connections to every other region but the number of common nodes kept decreasing with every stage. 37 lost its connections with the left hemispheres of pars opercularis, pars orbitalis, pars triangularis the language processing regions and frontal pole. 2 lost its link to the right hemisphere of the paracentral lobule which has motor functions.
Exploring Alzheimer’s Disease Network Using Social Network Analysis
235
Fig. 9. Connections of Caudal Middle Frontal Gyrus to other lobes.
6 6.1
Conclusion Conclusions from the Node Wise Analysis and Network Analysis Experiments
A significant loss of edges can be observed from Normal to AD stage. Loss of Normal distribution through the disease stages can be observed from global network properties and degree distribution. High degree of rearrangement can be seen from maximal clique analysis. It is highly plausible that the rearrangement in the network occurs in a way that it gives the brain a false impression of working properly but in hindsight, some secondary or tertiary connections could be missing which causes some malfunction. 6.2
Conclusions from Edge Distribution and Visualisation
Rostral anterior cingulate cortex lost connections with regions responsible for motor functions. Entorhinal cortex and parahippocampal gyrus lost connections with major regions of the cingulate cortex and regions responsible for communication (among others) in the frontal lobe. Temporal lobe is a region of auditory perception which loses its links with the language processing unit. Pars triangularis loses links with the memory hub and other regions of the language processing unit. Regions from the temporal lobe and the frontal lobe had connections divided by hemispheres which are disturbed in the disease stages. The connections of a region with the right hemisphere nodes in Normal are replicated in lMCI. This is most prominent in the parietal lobe. An explosion of new connections happened in the eMCI stage for all regions except for those from the frontal lobe. High loss of edges in lMCI and AD stages can be seen in every lobe. Higher degree of rearrangement and loss of connections happens among the left hemisphere
236
S. Katiyar et al.
nodes as compared to the right hemisphere. In another work by Liu and Zhang [8], they found an abnormal rightward dominance in the patients with MCI and AD. According to them this could be indicative of such patients using additional brain resources to compensate for the loss of cognitive function. Disappearance of the leftward laterality was also observed in the patients with AD. This was associated with the damage in the left hemisphere. 6.3
Limitations
We were limited by the data set we used which only gave us binary information in the form matrices averaged over N subjects. For a work as this one it is possible to get better results if data is made available per subject hence, all of the results we produce are obtained from a generalised perspective.
References 1. Alzheimer’s Association: 10 early signs and symptoms of alzheimer’s (2020). https://www.alz.org/alzheimers-dementia/10 signs. Accessed 26 Jun 2020 2. Barab´ asi, A.L., Gulbahce, N., Loscalzo, J.: Network medicine: a network-based approach to human disease. Nat. Rev. Genet. 12(1), 56–68 (2011). https://doi. org/10.1038/nrg2918 3. Barab´ asi, A.L., Oltvai, Z.N.: Network biology: understanding the cell’s functional organization. Nat. Rev. Genet. 5(2), 101–113 (2004). https://doi.org/10.1038/ nrg1272 4. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Community structure in social and biological networks. J. Stat. Mech. Theor. Exp. (2008). https://doi. org/10.1088/1742-5468/2008/10/P10008 5. Desikan, R.S., et al.: An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. NeuroImage 31(3), 968–980 (2006). https://doi.org/10.1016/j.neuroimage.2006.01.021 6. Ghiassian, S.D., Menche, J., Barab´ asi, A.L.: A disease module detection (DIAMOnD) algorithm derived from a systematic analysis of connectivity patterns of disease proteins in the human interactome. PLOS Comput. Biol. 11(4), e1004120 (2015) 7. Hanahan, D., Weinberg, R.A.: The hallmarks of cancer. Cell 100(1), 57–70 (2000) 8. Hao, L., et al.: Changes in brain lateralization in patients with mild cognitive impairment and Alzheimer’s disease: a resting-state functional magnetic resonance study from Alzheimer’s disease neuroimaging initiative. Front. Neurol. 9, 3 (2018). https://doi.org/10.3389/fneur.2018.00003 9. Lisowska, A., Rekik, I.: Joint pairing and structured mapping of convolutional brain morphological multiplexes for early dementia diagnosis. Brain Connect. 9(1), 22–36 (2019). https://doi.org/10.1089/brain.2018.0578 10. Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proc. Natl. Acad. Sci. 99, 7821–7826 (2002) 11. NIA: Alzheimer’s disease fact sheet (2020). https://www.nia.nih.gov/health/ alzheimers-disease-fact-sheet. Accessed 26 Jun 2020 12. NIA: What is Alzheimer’s disease? (2020). https://www.nia.nih.gov/health/whatalzheimers-disease. Accessed 26 Jun 2020
Exploring Alzheimer’s Disease Network Using Social Network Analysis
237
13. Sahoo, R., Rani, T., Bhavani, S.: Differentiating cancer from normal proteinprotein interactions through network analysis, Chap. 17. In: Arabnia, H., Tran, Q.N. (eds.) Emerging Trends in Applications and Infrastructures for Computational Biology, Bioinformatics, and Systems Biology (2017) 14. Sulaimany, S., Khansari, M., Zarrineh, P., Daianu, M., Thompson, N.J.P.M., Masoudi-Nejad, A.: Predicting brain network changes in Alzheimer’s disease with link prediction algorithms. Mol. BioSyst. 13(4), 725–735 (2017). https://doi.org/ 10.1039/c6mb00815a 15. Wu, G., Feng, X., Stein, L.: A human functional protein interaction network and its application to cancer data analysis. Genome Biol. 11, 1 (2010) 16. Zhou, X., Menche, J., Barab´ asi, A.L., Sharma, A.: Human symptoms-disease network. Nat. Commun. 5, 4212 (2014)
Stroke Prediction Using Machine Learning in a Distributed Environment Maihul Rajora1 , Mansi Rathod1(B) , and Nenavath Srinivas Naik2 1
Department of Electronics and Communication Engineering, IIIT Naya Raipur, Naya Raipur, India {maihul17101,mansi17101}@iiitnr.edu.in 2 Department of Computer Science and Engineering, IIIT Naya Raipur, Naya Raipur, India [email protected]
Abstract. As with our changing lifestyles, certain biological dimensions of human lives are changing, making people more vulnerable towards stroke problem. Stroke is a medical condition in which parts of the brain do not get blood supply and a person attains stroke condition which can be fatal at times. As these stroke cases are increasing at an alarming rate, there is a need to analyze about factors affecting the growth rate of these cases. There is a need to design an approach to predict whether a person will be affected by stroke or not. This paper analyse different machine learning algorithms for better prediction of stroke problem. The algorithms used for analysis include Naive Bayes, Logistic Regression, Decision Tree, Random Forest and Gradient Boosting. We use dataset, which consists of 11 features such as age, gender, BMI (body mass index), etc. The analysis of these features is done using univariate and multivariate plots to observe the correlation between these different features. The analysis also shows how some features such as age, gender, smoking status are important factors and some feature like residence are of less importance. The proposed work is implemented using Apache Spark, which is a distributed general-purpose cluster-computing framework. The Receiver Operating Curve (ROC) of each algorithm is compared and it shows that the Gradient Boosting algorithm gives the best results with the ROC area score of 0.90. After fine-tuning, certain parameters in Gradient Boosting algorithm like optimization of the learning rate, depth of the tree, the number of trees and minimum sample split. The obtained ROC area score is 0.94. Other performance parameters such as Accuracy, Precision, Recall and F1 score values before fine-tuning are 0.867, 0.8673, 0.866 and 0.8659 respectively and after fine-tuning the values are 0.9449, 0.9453, 0.9449 and 0.9448 respectively.
Keywords: Stroke Machine learning
· Distributed environment · Apache spark ·
c Springer Nature Switzerland AG 2021 D. Goswami and T. A. Hoang (Eds.): ICDCIT 2021, LNCS 12582, pp. 238–252, 2021. https://doi.org/10.1007/978-3-030-65621-8_15
Stroke Prediction Using Machine Learning in a Distributed Environment
1
239
Introduction
A stroke is a cerebrovascular disease in which arteries carrying oxygen and nutrients to the brain gets ruptured and there is no blood supply to the parts of the brain. This result in complete damage of blood cells in the brain [6]. A Fairly large number of people are losing life, especially in developing countries [5]. According to the reports of the American Heart Association [10], the mortality rate for 2017 was 37.6 in every 100,000 stroke cases. According to the World Health Organization [12], stroke has been classified under non-communicable disease. The reports of 2012 say that Stroke was the main cause of death due to non-communicable disease, causing 17.5 million deaths. Stroke is also the fourth major cause of death in India [16]. It is also the fifth major cause of death in the United States. Nearly 800,000 people have a stroke per year which equates to one person every 40 s. As this issue is increasing at an alarming rate, there is an emergent need to examine the health data and develop a system which could predict whether a person is likely to suffer a stroke or not. As with the advancement in maintaining medical data [4], it is easier to maintain Electronic Health records. As with growth in data, the importance of big data analytic [1] comes into play. The data used to make decisions in this system consists of various attributes like age, gender, BMI, smoking status, glucose level etc. The objective of the paper is to maximize the stroke detection rate of the patient i.e. correctly categorize the patients who are at risk of stroke and reduce the false alarms rate which will reduce the number of patients visiting hospitals if not necessary. The parameter SaveLife is the number of people who were saved due to correct prediction made by the system. MissLife is the number of people who had chances of stroke but were predicted of not having the stroke. FalseAlarm is the number of people who didn’t have the chances of stroke but were predicted as having the stroke. Hence, the two terms i.e detection rate and false alarm rate can be described as DetectionRate =
F alseAlarmRate =
Savelif e saveLif e + M issLif e F alseAlarm SaveLif e + F alseAlarm
To develop a robust system, stroke detection rate should maximize and false alarm rate should minimize. Hence this trade-off is shown in Fig. 1. As the data is growing at an alarming rate, the importance of big data is comprehended. With this, the need for frameworks to process this enormous data is rapidly increasing, especially in the healthcare field. Apache Hadoop [2,17] is the emerging big data technology framework for distributed data storage and parallel processing. Apache Spark [9,15] is one of the framework which is designed for fast processing using in-memory computation. In this paper, we use Apache spark as a data processing framework. Spark provides abstraction known as Resilient distributed dataset [18]. Spark executes operation 10 times faster on disk and 100 times faster in-memory than Hadoop.
240
M. Rajora et al.
Fig. 1. Trade-off between stroke detection and false alarm rate
The Spark stack currently consists of Spark core engine along with libraries i.e MLlib for executing machine learning task in spark, Cluster Management to acquire cluster resource for performing jobs and Spark SQL which combines relational processing with Spark functional API. The rest of the paper is organized as follows. In Sect. 2, the related work is described by comparing it with the proposed system performance. In Sect. 3, an elaborate description of the proposed solution is presented. It contains details about the data used and explanation of the system model. Section 4 contains the experimental analysis and results and also performance analysis of different models and Sect. 5 concludes the paper.
2
Related Work
In [13], an artificial neural network was used to predict the thrombo-embolic stroke disease. The dataset consisted of eight important features to be considered for prediction. Simple ANN model was built which evaluated the accuracy score of 0.89. It uses a relatively smaller dataset as it should use when working with deep learning models. In [14], Support vector machine algorithm with all its four kernels i.e linear, quadratic, polynomial and RBF were used. The dataset consisted of features like - age, sex, atrial fibrillation, walking symptoms, face deficit, arm deficit, Leg deficit, dyphasia, visuospatial disorder, hemianopia, infarct visible on CT and cerebellar signs. They worked with 300 training samples and 50 testing samples. Accuracy score of the four kernels are as follows: linear was 91%, quadratic was 81%, RBF was 59% and polynomial was 87.9%. In [8], authors used the dataset which consisted the attributes like Sex, Age, Province, Marital status, Education and occupation. They used three machine
Stroke Prediction Using Machine Learning in a Distributed Environment
241
learning algorithms i.e. naive bayes, decision tree and neural network with six input layer, one hidden layer and two output layer to predict the stroke. The accuracy score obtained from decision tree was 75%, by naive bayes was 72% and by neural network it was 74%. In [11], the data set consisted of 29074 records, of which 30% was used for testing and 70% training. The algorithm used was decision tree, random forest and neural network. The accuracy presented by decision tree was 74.31%, by random forest was 74.53% and 75.02% by neural network. The limitation of the paper [8] and [11] is their accuracy score. The result depicts the predictions on which person’s life depends and risking accuracy in medical domain costs high. Among the related works, the accuracy predicted by our model has better results. We compared six models to choose the best performing model and then tuned the parameter to reach the state of art. The main contribution of our work is implanting Gradient Boosting algorithm and then tuning various parameters to reach an accuracy of 0.9449.
3
Proposed Solution
The objective of the paper is to build the robust system which can accurately detect whether a person will suffer from stroke or not. For this purpose, machine learning classification algorithms are applied on processed data in Apache spark framework. The proposed model works in a pseudo distributed environment. A Detailed description is given in following sub-sections. 3.1
Data Description
The dataset consists of 43400 entries of patients and 12 attributes as shown in Table 1. These attributes can be majorly divided into three parts i.e lifestyles factors, medical risk factors and non-controllable factors. Lifestyle factors constitutes the habit and indulgence of a person in a certain activity due to his own will. They consist of smoking, drinking habits, eating habits and physical activities. Medical risk factors are those factors for which chances of stroke can be controlled like high blood pressure, atrial fibrillation, high cholesterol, diabetes and circulation problems. Non-controllable factors include age, gender and ethnicity. The data set contains various attributes like: gender, age, hypertension, heart disease, marital status, kind of occupation, residence area (rural or urban), average glucose level, Body Mass Index (BMI), smoking status and the stroke status. 3.2
System Architecture
The System architecture explains elaborately about each step as shown in Fig. 2. It is carried out from importing libraries to predicting whether patients will suffer from stroke or not. It explains the step by step process of the implementation
242
M. Rajora et al. Table 1. Data set description Attributes
Description
ID
Patients ID to avoid duplicity
Gender
Gender of patient
Age
Age of patient
Hypertension
No hypertension-0 Suffering hypertension-1
Heart disease
No heart disease-0 Suffering heartdisease-1
Marital status
Married or not
Work-type
Kind of Occupation person is involved
Residence area Lives in urban or rural area Avg-Glucose
Average glucose level of patient measured after meal
BMI
Body Mass Index of patient
Smoking-status Patient smoking status Stroke status
No Stroke-0 Stroke suffered-1
of stroke prediction system architecture. The very first step includes importing certain spark libraries. As we work with columnar data, we import Spark SQL libraries. The first point to use spark SQL is an object called Spark Session. It initializes the spark application and sets up the session where we can process data. After the session is created we load our data into the session. As the data is loaded into the session, we perform exploratory data analysis.
Fig. 2. System architecture
Exploratory Data Analysis. We analyse the uni-variate plots for six features such as Age, Avg-Glucose, BMI, heart disease, hypertension and stroke. From Fig. 3, we can observe that heart disease and hypertension happens to show correlation with each other. The plot of average glucose is slightly bi-modal in nature and not normal distribution as it seems to be. The standard deviation of average glucose is 43 with an average value of 140. The BMI plot is slightly
Stroke Prediction Using Machine Learning in a Distributed Environment
243
skewed to the right and age is not uniform distribution but comes out that stroke is common around the age of 42 and a spike for old age and infants. Next, we check the influence of Gender, BMI and Age on predicting the stroke percentage. It comes out that age is an important factor to be considered while evaluating the probability of stroke. In Fig. 4, it shows there is an abrupt increase in the stroke rate above the age of 40 years and diminishing rate below 20 years. The plots also help us to conclude that BMI is a relevant feature as it has less number of outliers and also high BMI value contributes to the greater probability of stroke than people with lower BMI values. Gender is also another factor, as per the data of 43400 people, 25665 are females and 17724 are males with females having 2.89% and males having 3.49% chances of getting a stroke.
Fig. 3. Univariate plots
Fig. 4. Influence of age, gender and BMI on stroke
244
M. Rajora et al.
Fig. 5. Influence of age, smoking status and BMI on stroke
The effect of Smoking status, BMI and age on stroke is also analysed in Fig. 5. The smoking status is divided into four parts i.e. unknown, never smoked, formally smoked and smokes. It reveals that smoking is a factor to be considered when looked upon smokers versus non-smokers. Present smokers have less BMI than people who never smoked or older in age. The risk of stroke is highest for former smokers having age above 40. Data Cleaning. After the analysis of data in the above section, there is a need to remove the outliers, duplicate values and also fill the missing values. In the dataset, it is observed that – Age values range from negatives to values above 100. – Patients-Ids which are duplicated, need to be resolved. Here we have 43400 Ids out of which 38713 unique Ids. – Also, we ponder over the missing data and found that 1461 fields of BMI feature are missing. Since only 3% of BMI data is missing, we fill these fields with the mean value. This marginally reduces the correlation with other features. – The missing data for a smoking status attribute is about 30% hence we fill the missing fields with new category which is unknown except with the intuitive knowledge we put “never smoked” for children of age below 10. Hence data cleaning is required.
Stroke Prediction Using Machine Learning in a Distributed Environment
245
Table 2. Statistical description of data. Id
Age
Hypertension Heart-disease Avg-glucose-level BMI
Stroke
Count 43400.00 43400.00 43400.00
43400.00
43400.00
41938.00 43400.00
Mean 36326.14 42.22
0.09
0.05
104.48
28.61
0.02
Std
21072.13 22.52
0.29
0.21
43.11
7.77
0.13
Min
1.00
0.00
0.00
55.00
10.10
0.00
25%
18038.50 24.00
0.00
0.00
77.54
23.20
0.00
50%
36351.50 44.00
0.00
0.00
91.58
27.70
0.00
75%
54514.25 60.00
0.00
0.00
112.07
32.90
0.00
Max
72943.00 82.00
1.00
1.00
291.05
97.60
1.00
0.08
Table 2 gives the statistical description of the dataset. It shows total count, mean, standard deviation, minimum value, maximum value and quartile division. These quartile are made on the values of features present in the dataset. Data Preprocessing. As we have cleaned the data, with no outliers and missing values, now we proceed to eradicate the issue of imbalance. If a class is imbalanced then the predicted output will always be biased. To address the issue, SMOTE technique has been used. Instead of directly oversampling the minority class, it first randomly selects a point from minority class and find the nearest neighbour. Now, among the neighbours, a point is chosen and a synthetic instance is created between these two. In this way, enumerable synthetic points can be generated [3] and the class imbalance can be re-balanced. This is shown in Figs. 6 and 7. Preparing Data for PySpark to Process. After the data is processed, PySpark does not process the data in this form for implementation of machine learning algorithms. Machine learning algorithms cannot work solely with categorical data, it has further requirements which are fulfilled through String Indexer, One-hot encoder and Vector assembler. These are the steps for machine learning pipeline [7] as shown in Fig. 8. Machine Learning Models. For this stroke Prediction Model, we used five ML models such as Naive Bayes, Logistic Regression, Decision Tree, Random Forest, Gradient Boosting algorithms.
246
M. Rajora et al.
Fig. 6. Before class rebalancing
Fig. 7. After class re-balancing using SMOTE
Evaluating Classifiers. To evaluate the classifiers, we use the concept of confusion matrix. We define its four terms as below TP (True Positive): Stroke is predicted, the person actually suffers from a stroke. We call it “Save Life”. TN (True Negative): No stroke is predicted, there is actually no stroke. We call it “Save Time”.
Fig. 8. Machine learning pipeline
Stroke Prediction Using Machine Learning in a Distributed Environment
247
FP (False Positive): Stroke is predicted but actually there is no Stroke. We call it “False Alarm”. FN (False Negative): No stroke is predicted but there is actually a stroke. We call it “Missed Life”. In order to evaluate the classifier and find out the most robust and efficient of all, we aim to maximize the Detection rate as DetectionRate(Recall) =
TP TP + FN
To maximize Detection Rate term, we need to minimize FN i.e missed life. Also, we aim to reduce the False alarm rate such as F alseAlarmRate =
FP TN + FP
To view the results, we use Receiver Operating Characteristic curve (ROC). Few parameters for performance criteria are defined as below Accuracy =
TP + TN TP + FP + FN + TN
TP FP + TP TP Recall = TP + FN 2 × Recall × P recision F 1M easure = Recall + P recision P recision =
Optimization of Algorithm. After evaluating various classifiers, an algorithm with the best result is chosen for further optimization. The ROC score of different machine learning algorithms is 0.66 for Naive Bayes, 0.83 for Decision Tree, 0.82 for Random Forest, 0.78 for Logistic Regression and 0.90 for gradient boosting as described in Fig. 13. The results show that the gradient boosting algorithm has the best ROC score of 0.90 and hence it is further optimized. The parameter optimized for gradient boosting algorithm are learning rate, tree depth, number of trees and minimum sample leaf. The optimal value for these parameters is chosen such that overfitting and underfitting are avoided. Figure 9 shows the performance of the model with varying learning rates. The value of the learning rate is varied from 0.0 to 1.6. The optimal value chosen is 0.9, for which model produces the best results. Figure 10 shows the performance of the model with a varying number of trees. Increasing the number of trees results in more learning. This increases the training time, hence an optimal value is chosen. The number of trees is varied from 1 to 900. 600 is chosen for which model produces the best results.
248
M. Rajora et al.
Fig. 9. Optimizing learning rate.
Fig. 10. Optimizing number of trees.
Figure 11 shows the performance of the model with varying the depth of the tree. Deeper the tree, better it learns. But increasing too much depth increases the overfitting. Hence the optimal value is 8 for this case. Figure 12 shows the performance of the model with varying the number of samples required to split an internal node. The value is varied from 1% to 30% and it is observed the increasing the value results in underfitting of data. Hence, the optimal value chosen is 0.04.
Stroke Prediction Using Machine Learning in a Distributed Environment
249
Fig. 11. Optimizing depth of the tree.
Fig. 12. Optimizing number of sample at node split.
4
Results and Analysis
In this section, we evaluate the ROC curve for each of the machine learning algorithms as shown in Fig. 13. We observe that Gradient Boosting algorithm has the highest accuracy of the value of 0.867. The ROC score obtained is 0.90. The precision, recall and F1 score are 0.8673, 0.866 and 0.8659 respectively. In order to predict more precisely, we tune the parameters of gradient boosting algorithm by optimizing the learning rate, tree depth, minimum number of samples required at the node split and the number of trees. After these tuning, the accuracy is 0.9449. The ROC score obtained is 0.94. The precision, recall and F1 scores are 0.9453, 0.9449, 0.9448. The ROC curve before and after fine tuning is as shown in Fig. 14.
250
M. Rajora et al.
Fig. 13. Receiver operating curve
Fig. 14. Receiver pperating curve before and after tuning
Table 3 describes the classification report of Gradient boosting algorithm before tuning and Table 4 describes the classification report after tuning. The class 0 and class 1 represents No stroke and Stroke respectively. Figure 14 gives the insight about ROC value of each algorithm with gradient boosting performing the best and naive-bayes performing the least as it is a simple classification technique and requires data with no least correlating feature. The ROC score of different machine learning algorithm is 0.66 for Naive Bayes, 0.83 for Decision Tree, 0.82 for Random Forest, 0.78 for Logistic Regression and 0.90
Stroke Prediction Using Machine Learning in a Distributed Environment
251
Table 3. Classification report before tuning Precision Recall F1 Class 0
0.89
0.83
0.86
Class 1
0.84
0.90
0.87
Avg/Total 0.87
0.87
0.87
Table 4. Classification report after tuning Precision Recall F1 Class 0
0.93
0.96
0.95
Class 1
0.96
0.93
0.94
Avg/Total 0.95
0.95
0.94
for gradient boosting as described in Fig. 14. Figure 14 gives the insight about the comparative performance before and after tuning parameters in gradient boosting algorithm.
5
Conclusion
As the stroke disease is ranked fourth major cause of death in the category of non-communicable diseases, it is the need of the hour to bring this stroke prediction system which can predict the chances of whether the person can suffer from a stroke or not. Based on the results and extensive analysis of the data, preventive measures can be advised to patients to avoid the chances of suffering from a stroke. The system is built in a distributed machine learning environment using the Apache Spark framework. Among various classifiers, the gradient boosting algorithm performed the best with ROC of 0.88 score before tuning and 0.9449 score after tuning. The novel work contributed is training gradient boosting algorithm and improvising the accuracy from 0.867 to 0.9449 and ROC score from 0.90 to 0.95 by optimizing various parameters of gradient boosting algorithms. The future work will be to attain the state of art results by training deep learning models for a larger dataset.
References 1. Bates, D.W., Saria, S., Ohno-Machado, L., Shah, A., Escobar, G.: Big data in health care: using analytics to identify and manage high-risk and high-cost patients. Health Aff. 33(7), 1123–1131 (2014) 2. Borthakur, D.: The Hadoop distributed file system: architecture and design. Hadoop Proj. Website 11(2007), 21 (2007) 3. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
252
M. Rajora et al.
4. Chen, M., Hao, Y., Hwang, K., Wang, L., Wang, L.: Disease prediction by machine learning over big data from healthcare communities. IEEE Access 5, 8869–8879 (2017) 5. Donaldson, M.S., Corrigan, J.M., Kohn, L.T., et al.: To Err is Human: Building a Safer Health System, vol. 6. National Academies Press, Washington, D.C. (2000) 6. Hafermehl, K.T.: High spatial resolution diffusion-weighted imaging (DWI) of ischemic stroke and transient ischemic attack (TIA) (2016) 7. Haihong, E., Zhou, K., Song, M.: Spark-based machine learning pipeline construction method. In: 2019 International Conference on Machine Learning and Data Engineering (iCMLDE), pp. 1–6. IEEE (2019) 8. Kansadub, T., Thammaboosadee, S., Kiattisin, S., Jalayondeja, C.: Stroke risk prediction model based on demographic data. In: 2015 8th Biomedical Engineering International Conference (BMEiCON), pp. 1–3. IEEE (2015) 9. Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning Spark: LightningFast Big Data Analysis. O’Reilly Media, Inc., Sebastopol (2015) 10. Roger, V.L., et al.: Heart disease and stroke statistics—2012 update: a report from the American heart association. Circulation 125(1), e2 (2012). Writing Group Members 11. Nwosu, C.S., Dev, S., Bhardwaj, P., Veeravalli, B., John, D.: Predicting stroke from electronic health records. In: 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 5704–5707. IEEE (2019) 12. World Health Organization, et al.: Global status report on noncommunicable diseases 2014. No. WHO/NMH/NVI/15.1. World Health Organization (2014) 13. Shanthi, D., Sahoo, G., Saravanan, N.: Designing an artificial neural network model for the prediction of thrombo-embolic stroke. Int. J. Biometric Bioinform. (IJBB) 3(1), 10–18 (2009) 14. Singh, M.S., Choudhary, P., Thongam, K.: A comparative analysis for various stroke prediction techniques. In: Nain, N., Vipparthi, S.K., Raman, B. (eds.) CVIP 2019. CCIS, vol. 1148, pp. 98–106. Springer, Singapore (2020). https://doi.org/10. 1007/978-981-15-4018-9 9 15. Apache Spark: Apache spark: lightning-fast cluster computing, pp. 2168–7161 (2016). http://spark.apache.org 16. Subha, P.P., Geethakumari, S.M.P., Athira, M., Nujum, Z.T.: Pattern and risk factors of stroke in the young among stroke patients admitted in medical college hospital, Thiruvananthapuram. Ann. Indian Acad. Neurol. 18(1), 20 (2015) 17. White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Inc., Sebastopol (2012) 18. Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Presented as Part of the 9th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 2012), pp. 15–28 (2012)
Automated Diagnosis of Breast Cancer with RoI Detection Using YOLO and Heuristics Ananya Bal1 , Meenakshi Das2 , Shashank Mouli Satapathy1(B) , Madhusmita Jena3 , and Subha Kanta Das4 1
School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu 632014, India [email protected] 2 Department of Computer Science and Engineering, Indraprastha Institute of Information Technology Delhi, New Delhi 110020, India 3 Department of Pathology, East Point College of Medical Sciences and Research Centre, Bengaluru, Karnataka 560049, India 4 Department of Pathology, MKCG Medical College, Brahmapur, Odisha 760004, India Abstract. Breast Cancer (specifically Ductal Carcinoma) is widely diagnosed by Fine Needle Aspiration Cytology (FNAC). Deep Learning techniques like Convolutional Neural Networks (CNNs) can automatically diagnose this condition by processing images captured from FNAC. However, CNNs are trained on manually sampled RoI (Region of Interest) patches or hand-crafted features. Using a Region Proposal Network (RPN) can automate RoI detection and save time and effort. In this study, we have proposed the use of the YOLOv3 network as an RPN and supplemented it with image-based heuristics for RoI patch detection and extraction from cytology images. The extracted patches were used to train 3 CNNs - VGG16, ResNet-50 and Inception-v3 for classification. YOLOv3 identified 164 RoIs in 26 out of 27 images and we achieved 96.6%, 98.8% and 98.9% classification accuracies with VGG16, ResNet50 and Inception-v3 respectively. Keywords: Breast cancer · Computer aided diagnosis learning · RoI extraction and classification · YOLO
1
· Deep
Introduction
The most common cancer among Indian women is breast cancer. In 2018, more than 1,60,000 new incidences of breast cancer were reported in India [5]. Ductal carcinoma is a cancer originating from the cells that line the milk ducts in breasts. More than 80% of all breast cancer cases are Ductal Carcinomas [4]. The National Cancer Registry Programme by the Indian Council of Medical Research estimates the rate of growth per decade of breast cancer in India lies within 15% and 20% [14]. Additionally, breast cancer prognosis in India is far worse than c Springer Nature Switzerland AG 2021 D. Goswami and T. A. Hoang (Eds.): ICDCIT 2021, LNCS 12582, pp. 253–267, 2021. https://doi.org/10.1007/978-3-030-65621-8_16
254
A. Bal et al.
the prognosis in developed countries with merely 50% women surviving [13]. The lack of awareness and screening are two major causes for this dismal statistic. India also has very burdened diagnostic systems. Therefore, creating intelligent CAD (Computer-Automated Diagnosis) systems is necessary to improve breast cancer diagnosis and treatment. Fine Needle Aspiration Cytology (FNAC) is the industry standard to diagnose breast cancer (Ductal Carcinoma). Pathologists use a narrow-gauge needle to collect a tissue sample from the breast for microscopic examination. The aspirated sample is fixed onto a slide and stained (common stains are Haematoxylin and Eosin (H&E), Pap, and Giemsa). A cytopathologist views stained slides under a microscope and makes a diagnosis. FNAC is preferred as it is minimally invasive and quick. While the process of FNAC remains manual, diagnosis from the microscopic images can be automated with the help of Computer Vision techniques. Deep Learning can be used to extract features for visual context and classify cytology images from FNAC lesions. Convolutional Neural Networks (CNN), a class of deep learning models, yield state-of-the-art results for image classification problems. They are increasingly being adopted for digital diagnosis. While CNNs can tackle medical image classification, they require a lot of data preparation and manually extracted RoIs which contain the objects that need to be classified. Most medical and cellular images have a lot of irrelevant data. Thus, extracting RoIs from these images is time-consuming and is mostly done manually. A Region Proposal Network (RPN) can automatically detect RoIs which can then be classified by CNNs. We aimed to build a pipeline that could automatically detect RoIs, extract them and classify them. We have achieved this by using YOLOv3, heuristics and CNNs. We present related literature in Sect. 2 and then outline our framework in detail in Sect. 3. This is followed by results and discussion in Sect. 4 and 5.
2
Prior Work
Given that mammograms usually have a single mass to be detected, YOLO has performed well in tumour detection in mammograms. Al-masni et al. [3] trained a YOLO-based CAD model that detected masses in mammograms and classified them as benign or malignant. The proposed system detected mass locations with 96.33% accuracy and distinguished between benign and malignant lesions with an accuracy of 85.52%. Mugahed A. et al. [2] proposed an integrated CAD system for lesion detection and classification from entire mammograms. A deep learning YOLO detector was used and evaluated for breast lesion detection. Then, three deep learning classifiers, a regular feedforward CNN, ResNet-50, and InceptionResNet-V2, were used for classification. Mugahed A. et al. [1] have utilized YOLO, a full-resolution convolutional network (FrCN) and Alexnet for RoI detection, segmentation and classification respectively. They achieved 98.96% mass detection accuracy, 92.97% segmentation accuracy and 95.64% classification accuracy.
Breast Cancer Diagnosis with RoI Detection Using YOLO and Heuristics
255
YOLO is especially effective in identifying RoIs in video frames. Gao et al. [7] proposed an approach to detect Squamous Cell Carcinomas from Oesophageal Endoscopic Videos using mask-RCNN and YOLOv3. They have used a NonMaxima Technique to ensure the detection of a single bounding box. The bounding box region is classified into three classes. YOLOv3 gave a detection accuracy of 85% and a classification accuracy of 74%. Ding et al. [6] proposed a novel approach for lesion localization in gastroscopic videos with a cloud-edge collaborative framework. They have detected upper gastrointestinal disease with a Tinier-YOLO algorithm and have improved the model performance by integrating lesion RoI segmentation into the YOLOv3 algorithm. Their results exhibit superior performance in mean Average Precision (mAP) and Intersection over Union (IOU) of lesion detection. Spanhol et al. [19] used deep learning on histopathological images from BreakHis public dataset. They concluded that CNN performed better than other machine learning models. They also tested the combination of different CNNs using simple fusion rules, obtaining some improvement in performance. Saikia et al. [17] used CNNs for the diagnosis of cell samples from FNAC images and tested VGG16 and other network architectures for a comparative study. The results showed that GoogLeNet-V3 achieved 96.25% accuracy. Vesal et al. [22] fine-tuned the Inception-v3 and ResNet-50 networks and used transfer learning with weights from the ImageNet competition to classify the BACH 2018 challenge data. They achieved 97.08% classification accuracy for Inception-v3 and 96.66% classification accuracy for ResNet-50.
3
Proposed Framework
The workflow of our proposed framework can be seen in Fig. 1. The framework can be divided into six modules which are as follows: 3.1
Data Collection
Pathologists view slides with stained tissue from FNAC under microscopes for diagnosis. The microscopic lesions, as viewed under the microscope, can be captured by a camera to produce digital images. For this study, FNAC images of breast lesions with Giemsa stain were captured in the pathology labs at East Point College of Medical Sciences and Research Centre, Bangalore, India and Maharaja Krushna Chandra Gajapati Medical College, Brahmapur, Odisha, India following all ethical protocols. Benign FNAC samples collected from 40 patients and malignant FNAC samples (Ductal Carcinoma NOS) collected from 30 patients were used. These samples were observed at 400x magnification by certified cytopathologists using a CH-20i Olympus Binocular Microscope and an LM-52–6000 Lawrence and Mayo Multi viewing Microscope. The diagnosis for each sample has been confirmed by histopathology. Image collection methods were consistent.
256
A. Bal et al.
On an average, 1 to 2 images were taken per patient’s sample. This produced 79 images of benign lesions and 56 images of malignant lesions. Roughly 80% of these images were used for training (64 benign images and 44 malignant images) the YOLOv3 network after annotation. The remaining images were used for testing (15 benign images and 12 malignant images). All test images belonged to different patients. For training the CNN, 4 to 8 RoI patches of size 256 × 256 pts were extracted from each image in the training set depending on the cellular content. This resulted in 389 benign patches and 300 malignant patches as presented in Table 1.
Fig. 1. Proposed framework
Breast Cancer Diagnosis with RoI Detection Using YOLO and Heuristics
257
Table 1. Data distribution Category
79
56
Images for training YOLO
64
44
Images for testing YOLO
15
12
RoI patches for training CNN
3.2
Benign Malignant
Slide Images
389
300
Augmented patches for training CNN 2334
1800
Image Augmentation for CNNs
Neural networks can overfit on small datasets, rendering them unable generalize to larger unseen test data. To avoid this and to obtain good classification accuracy, we have increased the cardinality of our training data for the CNN classifiers through data augmentation. We applied geometric transformations, i.e., flipping, mirroring and rotation. Microscopic cellular images are rotationally invariant i.e., a pathologist will be able to make a diagnosis irrespective of the angle of rotation of the image. Therefore, our image patches were rotated by 90◦ three times, resulting in 4 images. Similarly, flipped or mirrored images of cells can still be diagnosed by a pathologist. As a result of the augmentation, the cardinality of our training data grew by a factor of 6, resulting in 2334 benign training patches and 1800 malignant training patches. Nearly 75% of these RoI patches were placed under the training set and the remaining were placed under the validation set as provided in Table 1. 3.3
Annotation for YOLOv3
The training images need to be annotated to define ground truth for the RoI detection model. Annotation is the process of identifying an object in an image with a box and tagging the object with a label. This was done with the help of the Microsoft Visual Object Tagging Tool (VoTT) software. This process saves the x and y coordinates of the boxes we draw to enclose the RoIs. The process is seen in Fig. 2. Object annotation is largely subjective. There is no set method for annotating irregularly-shaped cell clusters. Many cell clusters are big and occupy a large part of the image. These clusters were not annotated with single large boxes as this would pick up more background information than is desirable. Instead, as seen in Fig. 2, multiple smaller and tight-fitting boxes were used to cover the entire cell cluster. This resulted in higher object to background ratio. 3.4
RoI Detection with YOLO
YOLO (You Only Look Once) is a unified CNN-based architecture that can identify and classify objects [15]. It is faster than other networks used for object
258
A. Bal et al.
Fig. 2. Image annotation with Micrsoft VoTT
detection and is therefore uniquely suited to object detection in video frames. As the name suggests, YOLO looks at the entire image to learns context instead of looking at smaller parts of the image. Object detection is formulated as a regression problem in the YOLO architecture. An image is divided into a grid and each grid cell detects bounding boxes for objects whose centres lie within the cell. The grid cells predict bounding boxes for objects by resizing a fixed number of boxes of different aspect ratios called anchors. After this, confidence values for the boxes are calculated (Eqs. 1 and 2). They indicate the model’s confidence in the accuracy of the box’s fit to an object. Duplicate boxes or boxes having a high degree of overlap with other boxes are discarded by Non-maximal Suppression (NMS) which looks at the Intersection Over Union (IOU) between proposed bounding boxes. In NMS, the proposed bounding box with the highest confidence is added to a list and other proposed bounding boxes which have a high IOU with the boxes already in the list are discarded on the grounds of duplicity. This process is done iteratively and an IOU threshold is set for the same. IOU =
Area in intersection Area in union
Conf idenceV alue (cv) = P (RoI) × IOU
(1) (2)
where, P (RoI) is the probability of the object being in the bounding box. A CNN is used to learn the parameters of bounding boxes - the coordinates of the centre of the box relative to the grid cell, height, width and the confidence
Breast Cancer Diagnosis with RoI Detection Using YOLO and Heuristics
259
value (x, y, h, w, cv). The class of the object can also be learned by the CNN. But in our framework, taking inspiration from studies [1,2], we test YOLO as a region proposal network only. Hence, we have used it for RoI detection solely and not to classify the suggested RoIs. Features are extracted in the first few layers while the fully-connected layers at the end predict bounding box coordinates as well as class probabilities, if classification is also to be done. The loss function of the CNN is a modified sum squared error function. This is similar to the squared error losses (Root Mean Squared error and Mean Squared Error) in regression. Since FNAC is the conclusive diagnostic test for breast cancer, testing the efficiency of YOLO as an RPN on cytology images is crucial. We have not found literature which has attempted this. There are certain challenges associated with cytological images because they contain cell clusters that do not have clearly demarcated boundaries. They vary a lot in size and shape. But the biggest challenge with these images is the presence of multiple RoIs in close proximity, unlike in the case of most radiological images, where there is only one RoI. Hence, multiple bounding-boxes are required to cover all and many of these overlap. To conclude, identifying RoIs in these images is a complex problem and required modifications in the YOLO network. YOLOv3 Network The YOLOv3 network is an improved version of YOLO. It has two parts - a Feature Extractor and a Detector. The architecture can be seen in Fig. 3. DarkNet53 is the feature extractor [16] and has 53 layers - 52 convolution layers and one fully-connected layer. There are 23 residual blocks which contain residual layers. Residual blocks use skip connections [8]. The convolution layers use the Leaky ReLU activation function (Eq. 3) [12] to avoid dying neurons which occur with the ReLU (Eq. 4) function when the input is lower than 0. The detector module has 53 layers consisting of 1 × 1 and 3 × 3 convolution layers and two upsampling layers. The feature extractor downsizes the input and the detector module applies 1 × 1 detection kernels on feature maps from the last layers of the last three residual blocks. The detection occurs at three different scales. The feature map from the first scale is upsampled (enlarged) by a factor of 2 for the second scale and the feature map from the second scale is upsampled by a factor of 2 for the third scale. Each grid cell predicts three bounding boxes using three anchors at each scale, making the total number of anchors used 9. The different scales help the network detect objects of all sizes, especially small ones. The confidence values are calculated for all detected boxes and the best fitting box is retained by NMS. The final output is a list of detected bounding boxes which enclose RoIs. We used transfer learning by leveraging the pre-trained DarkNet53 weights on ImageNet data. The training continued for 50 epochs with an initial learning rate of 0.0001 which was gradually reduced. We implemented the network with Keras and Tensorflow in Python.
260
A. Bal et al.
Fig. 3. YOLOv3 architecture
Leaky ReLU activation f unction : x if x > 0; f (x) = 0.1x otherwise
(3)
After fine-tuning hyperparameters, we established a batch size of 12 for the network and used the RMSprop optimizer [9] instead of Adam [11] which is the default optimizer in DarkNet53. 15% of the training data was used for validation. The key change in our implementation was the choice of the confidence value threshold. Traditionally, a value (0.5-0.95) is used to retain the best-fitting bounding boxes that have confidence scores above the threshold. However, given the complex nature of our images and RoIs, the threshold was lowered to 0.1 to retain a few poorly fit bounding boxes. This ensures that bounding boxes are retained for most RoIs and a sufficient number of RoIs are detected for diagnosis by classification. 3.5
Heuristic RoI Patch Extraction
The YOLOv3 network performed reasonably well in detecting RoIs in the test images. To use CNNs for classification, all RoI patches need to be of a uniform size. However, the bounding boxes and their respective RoIs, vary in size and shape. One solution to overcome this hurdle is to resize all RoIs to a single dimension. However, resizing alters the shape of cells especially if the aspect ratio of the RoI is not maintained. The RoIs vary in shape and so resizing them all to a single fixed size (256 × 256 in our case) would stretch the cells in the patches. Since the distinction in shape between benign and malignant cells is important in interpreting features for a diagnosis, it is not wise to resize the RoIs. The other solution is to take a smaller, fixed-size patch from each RoI. But this approach leads to loss of data. The cells which are in the RoI but not in the patch may be essential features. Therefore, our objective was to extract a minimum number of patches from any suggested RoI such that we maximized the features covered, while having minimum overlap among patches. We tackled this with heuristics. Upon examination of the dimensions of all bounding boxes, we identified 10 categories of RoIs based on variations in width (x) and height (y). The number
Breast Cancer Diagnosis with RoI Detection Using YOLO and Heuristics
261
of 256 × 256 patches that could be extracted from an RoI depended on the category into which the RoI fell. The categories are: – Category 1 (x < 256 and y 0.1 and an IOU with the ground truth >= 0.5. Our implementation of YOLOv3 generated an Average Precision of 29.3% with 560 bounding boxes, failing to predict any RoI with confidence score > 0.1 for one image in 27 images. This indicates the complexity of cytology images and that YOLOv3 was 96% effective in detecting RoIs in cytological images. Table 2. Results Metric
Formula
VGG16 ResNet-50 Inception-v3
Accuracy
(TP + TN) Total Cases
0.966
0.988
0.989
Precision
TP (TP + FP)
0.996
0.985
1
Recall (or Sensitivity)
TP (TP + FN)
0.944
0.994
0.982
Specificity
TN (FP + TN)
0.996
0.979
1
0.004
0.021
0
0.055
0.005
0.017
False Positive Rate False Negative Rate
FP (FP + TN) FN (FN + TP)
The final diagnosis was the majority class from the predicted labels for all RoIs in images from a patient’s sample. We tested the system for accuracy in diagnosing a patient’s sample. The framework gave the correct diagnosis for all but one patient’s test image, failing only when the YOLO detector could not find any RoIs.
Breast Cancer Diagnosis with RoI Detection Using YOLO and Heuristics
265
Fig. 7. Confusion matrices for VGG16, ResNet-50 and Inception-v3
Among the CNN classifiers, Inception-v3 performed the best for our data shown in Table 2. Sensitivity and specificity are especially important for assessing the performance of medical tests and while all three networks give abovesatisfactory values, Inception-v3 and ResNet-50 are the obvious choices, because their sensitivity and specificity values are both high. Inception-v3 has a slight edge over Resnet-50 as it does not falsely classify any benign patches as malignant (False positive rate is 0). While VGG16 has a high specificity value, its sensitivity value is low. This combined with its lower accuracy value makes VGG16 an inferior model. Refer Table 2. All metrics provided in Table 2 are calculated from the confusion matrices shown in Fig. 7.
5
Conclusion
Our implementation of YOLOv3 is 96% effective in suggesting RoIs. It fails for one image only but since the diagnosis is dependent on RoIs, having zero RoIs from an image will result in no diagnosis. This needs to be overcome with further improvements to the YOLOv3 detector network. The addition of more data to the pipeline will also certainly improve the bounding box detection. Overall, our framework performed very well in detecting and classifying RoIs in cytology images from breast FNAC lesions. The diagnostic accuracy of our framework lies between 96.6% and 98.9% depending on the choice of classifier network. High precision, sensitivity and specificity values make the framework suitable for industry applications. We conclude that the YOLOv3 network is an effective Region Proposal Network for cytological images. The combination of YOLOv3, heuristics and CNNs fully automates the diagnosis of Ductal Carcinoma with good classification accuracy and with further improvement, can be used in medical laboratories.
266
6
A. Bal et al.
Discussion and Future Work
While lowering the confidence threshold in YOLO can retain more bounding boxes, unless the RPN is improved in terms of training and/or architecture, there can be cases where no RoIs are detected and a diagnosis is not possible. The RPN can either be trained on datasets which are similar to ours or be trained on more data. Also, other RPN architectures such as SSD (Single Shot Detector) and Faster RCNN can be tested in the pipeline. In the future, we will work on integrating detection and classification into a single network. YOLOv3 detects more bounding boxes for the malignant test images even though they are lesser in number than the benign test images. Therefore, the number of malignant patches is also higher. This shows that the learning is skewed and that the network identifies malignant cells better than it identifies benign cells. Again, the addition of more data is expected to resolve this issue. Acknowledgment. We acknowledge the help and support of lab technicians and pathologists at East Point College of Medical Sciences and Research Centre and Hospital, Bengaluru, India, and Maharaja Krushna Chandra Gajapati Medical College, Brahmapur, Odisha, India. Their diligent efforts and coordination have made this study possible. This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
References 1. Al-Antari, M.A., Al-Masni, M.A., Choi, M.T., Han, S.M., Kim, T.S.: A fully integrated computer-aided diagnosis system for digital x-ray mammograms via deep learning detection, segmentation, and classification. Int. J. Med. Inf. 117, 44–54 (2018) 2. Al-antari, M.A., Kim, T.S.: Evaluation of deep learning detection and classification towards computer-aided diagnosis of breast lesions in digital x-ray mammograms. Computer Methods and Programs in Biomedicine p. 105584 (2020) 3. Al-masni, M.A., et al.: Detection and classification of the breast abnormalities in digital mammograms via regional convolutional neural network. In: 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). pp. 1230–1233. IEEE (2017) 4. Breastcancer.org: Invasive ductal carcinoma: Diagnosis, treatment, and more. https://www.breastcancer.org/symptoms/types/idc (2019) 5. Cancer Today: International Agency for research on Cancer: Iarc world cancer report 2020. https://www.iccp-portal.org/sites/default/files/resources/IARCWorld-Cancer-Report-2020.pdf (2018). Accessed: 20 Feb 2020 6. Ding, S., Li, L., Li, Z., Wang, H., Zhang, Y.: Smart electronic gastroscope system using a cloud-edge collaborative framework. Future Generation Comput. Syst. 100, 395–407 (2019) 7. Gao, X., Braden, B., Taylor, S., Pang, W.: Towards real-time detection of squamous pre-cancers from oesophageal endoscopic videos. In: 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA). pp. 1606–1612. IEEE (2019)
Breast Cancer Diagnosis with RoI Detection Using YOLO and Heuristics
267
8. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016) 9. Hinton, G., Srivastava, N., Swersky, K.: Coursera: Neural networks for machine learning: Lecture 6(a)–overview of mini-batch gradient descent. https://www.cs. toronto.edu/∼tijmen/csc321/slides/lecture slides lec6.pdf (2014) 10. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015) 11. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 12. Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of ICML. vol. 30, p. 3 (2013) 13. National Cancer Institute (NCI-AIIMS: Cancer statistics — drupal. http:// nciindia.aiims.edu/en/cancer-statistics (2020) 14. National Centre for Disease Informatics and Research: NCPR three-year report of population based cancer registries 2012–2014. https://ncdirindia.org/NCRP/ ALL NCRP REPORTS/PBCR REPORT 2012 2014/ALL CONTENT/PDF Printed Version/Chapter10 Printed.pdf (2020) 15. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 779–788 (2016) 16. Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018) 17. Saikia, A.R., Bora, K., Mahanta, L.B., Das, A.K.: Comparative assessment of cnn architectures for classification of breast fnac images. Tissue Cell 57, 8–14 (2019) 18. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 19. Spanhol, F.A., Oliveira, L.S., Petitjean, C., Heutte, L.: Breast cancer histopathological image classification using convolutional neural networks. In: 2016 International Joint Conference on Neural Networks (IJCNN), pp. 2560–2567. IEEE (2016) 20. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference On Computer Vision And Pattern Recognition. pp. 1–9 (2015) 21. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference On Computer Vision And Pattern Recognition, pp. 2818–2826 (2016) 22. Vesal, S., Ravikumar, N., Davari, A.A., Ellmann, S., Maier, A.: Classification of breast cancer histology images using transfer learning. In: Campilho, A., Karray, F., ter Haar Romeny, B. (eds.) ICIAR 2018. LNCS, vol. 10882, pp. 812–819. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93000-8 92
Short Papers
An Efficient Approach for Event Prediction Using Collaborative Distance Score of Communities B. S. A. S. Rajita(B) , Bipin Sai Narwa, and Subhrakanta Panda CSIS, BITS-Pilani, Hyderabad Campus, Pilani, India {p20150409,f20170030,spanda}@hyderabad.bits-pilani.ac.in
Abstract. An effective technique for prediction of events can capture the evolution of communities and help understand the collaborative trends in massive dataset applications. One major challenge is to find the derived features that can improve the accuracy of ML models in efficient prediction of events in such evolutionary patterns. It is often observed that a group of researchers associate with another set of researchers having similar interests to pursue some common research goals. A study of such associations forms an essential basis to assess collaboration trends and to predict evolving topics of research. A hallmarked co-authorship dataset such as DBLP plays a vital role in identifying collaborative relationships among the researchers based on their academic interests. The association between researchers can be calculated by computing their collaborative distance. Refined Classical Collaborative Distance (RCCD) proposed in this paper is an extension of existing Classical Collaborative distance (CCD). This computed RCCD score is then considered as a derived feature along with other community features for effective prediction of events in DBLP dataset. The experimental results show that the existing CCD method predicts events with an accuracy of 81.47%, whereas the accuracy of our proposed RCCD method improved to 84.27%. Thus, RCCD has been instrumental in enhancing the accuracy of ML models in the effective prediction of events such as trending research topics. Keywords: Social networks · Community mining Collaboration distance · Machine learning
1
· Event prediction ·
Introduction
A massive data set is a collection of voluminous data that can be represented in the form of a social network (SN) to understand relations among the data such as associations, behavior, interactions, and collaborative trends etc. A SN is a graphical representation of the dataset consisting of nodes interconnected c Springer Nature Switzerland AG 2021 D. Goswami and T. A. Hoang (Eds.): ICDCIT 2021, LNCS 12582, pp. 271–279, 2021. https://doi.org/10.1007/978-3-030-65621-8_17
272
B. S. A. S. Rajita et al.
through edges. Such graphical representation of these SNs consists of nodes that correspond to individual entities, and edges correspond to the relationship between the nodes (entities) [8]. A community is defined to be a subset (subgraph) of the graphical representation of a social network. A community represents dense intra-connection among its nodes and sparse inter-connection with the nodes that are outside the community. The structure of these communities change according to the occurrence of different events, such as Form, Dissolve, Same, Split, and Merge [2]. Researchers often form different collaborations in order to discover new knowledge, improve individual skills or specialization, and also for pursuing inter-disciplinary research goals. Such association is an essential basis to assess collaborative trends and to identify trending (evolving) topics of research. Supposing a researcher X switches to another research domain (community) then it will certainly impact the structure of both parent and host communities. Such structural changes (transitions) also reflect a collaborative change that attributes to SN evolution [4,9]. Therefore, prediction of events is essential in many areas of SN applications related to mining the underlying structure and detection of the network’s structural properties. Thus prediction of the future state of the communities have recently drawn significant attention of the researchers [11]. This paper aims to improve the accuracy in the prediction of events that are likely to occur within the communities. But, one of the major challenges in such evolutionary patterns, in a massive dataset, is finding the derived features of communities that improve the accuracy of ML models for efficient prediction events [6]. This paper envisages to calculate the collaborative change by calculating the distance between community of researchers (Authors of research articles), who change their publishing pattern (area of interest). Thereby influences a collaborative change in the communities (occurrence of an event) of a co-authorship dataset. For this purpose, collaborative distance is proposed as an additional feature along with other community features to train the machine learning models for better accuracy in prediction. The organization of the rest of the paper is as follows: Sect. 2 gives the details of the background and the proposed methodology. The performance and comparative analysis of the experimental results of our proposed approach are available in Sect. 3. Section 4 concludes the work with some insights into our future work.
2
Background and Proposed Methodology
In this section, we give a basic understanding of the SNs and our methodology. A SN can be represented as a set of graphs over a set of timestamps, {1,2, . . . , T}. Thus, a SN can be represented in the form of {Gt }Tt=1 , where each Gt = (V t , E t ) represents the entire snapshot (graphical representation) at timestamp t, V t = {v1 , v2 , v3 , ...vm } is the set of m vertices and E t = {e1 , e2 , e3 , ...en } is the set of n edges. The timestamp considered in this paper to construct the graphs is one year.
Event Prediction Using Collaborative Distance
273
Fig. 1. A proposed framework for event prediction in a social network. Table 1. Comparative analysis of nine community detection algorithms Community Algorithm CC Louvain
M
0.83 0.96
Time(sec) 20
Multilevel
0.73 0.96
40
Fastgreedy
0.73 0.95
756
Label propagation
0.88 0.88
831
Infomap
0.67 0.87
3433
Walktrap
0.65 0.84
5996
Eigenvector
0.61 0.77
533
Spinglass
0.55 0.78 42083
Edgebetweeness
0.45 0.36 44592
A community, CG , where CG ⊆ G, is a sub-graph having densely connected internal nodes and these nodes have a sparse connection with other G − CG nodes. The density is computed as the average degree of vertices in CG , given N
(d+ (Vi )+d− (Vi ))
as DCG = i=1 CG N CG , to find the communities of G. So, a set of i communities of Graph G for year t is represented as CGt = {C1t , C2t , C3t , ...Cit }. Community mining refers to the detection of structural changes in a sequence of communities over a period of time [5,13]. Supposing CG1 to CGt are sequence of communities over the years, y1 to yt , 1 ≤ t ≤ T . Next we need a technique to identify the dependence of changes in these communities on the behavior of the nodes in the community [1,10]. This can be depicted by finding direct features (Community features) and Derived features. The proposed framework (shown in Fig. 1) is to detect the communities, extract features from the communities, compute RCCD values, and predict events using ML (Machine Learning) models. The first step in Fig. 1 is to convert XML format of DBLP data into a vector format. The second step is to graphically represent the temporal data of each year. In the third step, we detect the communities for each year of data using the Louvain approach. The results
274
B. S. A. S. Rajita et al.
proves that Louvain detects communities in less Time with high Clustering Coefficient(CC) and Modularity(M) as compared to the other eight methods (refer Table 1). In the fourth step, we applied the algorithm in [2] to mine (identify the events) detected communities in previous step. These identified events are labeled with their communities using LDA approach [7]. In the fifth step, we compute the community features (Sect. 2.1) and the RCCD score (Sect. 2.2) of the communities. These scores are then used as feature of the ML model for the prediction of events. 2.1
Community Features
The extraction of community features is significant in identifying the evolutionary patterns of the communities. In this work, we identified 13 categorical community features [12] for better internal connectivity. The identified community features are Year, Node Degree, Edge Degree, Number of Inter edges, Number of Inter edges,Degree, Conductance, Density, Connected components, Clustering Co-efficient,Activeness, Aging, and Events. Experiments were carried out to justify the relevance of these 13 community features using the Pearson correlation heatmap [12]. The plot of the Pearson correlation heatmap in Fig. 2(a) shows the correlation of the community features (independent variables) with the Events (output variable). Generally, the correlation coefficient ranges between -1 to 1, but Fig. 2(a) shows that the correlation coefficient values of community features range between -0.8 to 0.8. Therefore, it is inferred from the experimental results in Fig. 2(a) that our identified community features are beneficial for predicting events of the communities.
(a) Correlation between Community (b) Average value of RCCD Score at Features and Events each year
Fig. 2. Scores
This work focuses on calculating the relationships among authors (nodes) involved in the change of interactions and derive properties to predict the events of communities. The association between authors can be calculated by computing their collaborative distance. The proposed Refined Classical Collaborative Distance (RCCD) is an extension of existing Classical Collaborative distance. Collaborative distance between two nodes is the shortest edge length between
Event Prediction Using Collaborative Distance
275
them in a given community. It signifies how similar nodes influence a community to change its structure (number of edges necessary to reach a node from another node). Existing approaches are based on similarity measure within the community only. Refined Classical Collaboration Distance of a node (author) is calculated by using intersection between two authors based on paper publication on same n 1 . Where p(ai ) is the set topic in a given graph. RCCD = |p(a ) ∩ p(a i (i+1) )| i=0 of papers of author ai and | · | is the cardinality operator. RCCD score of the community is calculated by considering the centrality of the community : σ 2 = n (RCCDi − μ)2 i=1
n
.
Fig. 3. Probability distribution function between RCCD and events
2.2
RCCD Score
Observation. How to know the RCCD Score could be a good measure for detecting event changes? The above question is answered by the usage of two strategies called PDF and Poisson distribution. First Strategy: We added a new feature called RCCD to improve the accuracy of the ML model. Before that we needed to check whether the newly derived feature affects the target variable (Events). This effect can be identified by identifying relation between newly derived feature and the target variable. Relation can be identified by using two steps [3]. Step one calculates the average value of new feature at each time-interval. Figure 2(b) shows the outcome of the first step. The inference from Fig. 2(b) is that the average value of RCCD increased every year (this justifies the first step). In second step we identified impact of RCCD on each Event by calculating probability density function (PDF). The outcome of this step is represented in Fig. 3 and shows the PDF values of Merge (which is 1.6), Split (which is 1.6), Form (which is 2.5), Same (which is 0.012), and Dissolve (which is 0.0035) events. The average values of RCCD (from Fig. 2(b)) are matching with these PDF values. It is expected as RCCD score signifies the
276
B. S. A. S. Rajita et al. Table 2. Poisson multiple regression results Parameters
Beta value Std. err
RCCD score
0.0289
0.008
Conductance
0.4086
0.173
Connected components 0.0016
0.000
Degree
0.0155
0.002
Density
0.0138
0.004
Aging
0.2086
0.113
tendency of nodes to link with other nodes that have a similar degree distribution (authors tend to collaborate with other authors having similar paper publications). The inference from Fig. 3 is that RCCD score is considerably impacting Merge, Split and Form events and has less impact on Same and Dissolve events. Second Strategy: In this strategy, we observed the effect of independent variables on dependent variables by using Poisson regression model [5]. We start by assigning weights to the event changes. The higher weight associated with RCCD Score (0.0289) in Table 2 proves that it indeed is a good measure for detecting event changes. So, based on above justifications, we included RCCD Score as an additional feature along with community features. The pseudo-code of our proposed algorithm to compute the RCCD score of all the communities is given in Algorithm 1.
Algorithm 1: RCCD Algorithm Input : Features Data frame=df{} Output: RCCD Score 1 n = 0 2 auth ano = [] 3 list ind = int(df[year][index])-2001 4 for auth in df[names][index] n 1 5 RCCD = i=0 p(ai ) ∩ p(a(i+1) ) n 2 (RCCDi − µ) i=1 6 RCCD Score = n 7 end for 8 return RCCD Score
3
Performance and Comparative Analysis of Results
We conducted experiments on several ML multi-classification algorithms to find the best accuracy model for predicting the events of the communities. We used Decision Tree, Naive Bayes, Neural Networks, SVM , and Logistic Regression ML models for our experimentation. For all the ML models, all the classifiers are evaluated by 10-fold cross-validation approach.
Event Prediction Using Collaborative Distance
3.1
277
Performance Analysis
We included all sets of community and derived features in our experimentation. We compared the results obtained by considering only the community features (shown in Table 3) with the results obtained by considering RCCD scores as an additional feature along with community features. Accuracy gives the measure of the correctness of predicted events based on the training events. As can be inferred from Table 3, the addition of the RCCD score as a feature improved the performance of the Logistic Regression model from 81.47% to 84.27% (aprox. 3.43% improvement). Precision gives the measure of the positivity of the correctly predicted events based on the training events. Our results in Table 3 show that the precision of the Logistic Regression model improved from 79.68% to 83.62% (aprox. 4.94% improvement). The recall is the ratio of correctly predicted positive events to all the predicted events. It measures the completeness (how relevant the results are) of the ML models. Recall value of the Logistic Regression model improved from 76.37% to 83.53% (aprox. 9.37% improvement). F-measure or F1 score is the weighted average of Precision and Recall. Our results in Table 3 show that the F-score of the Logistic Regression model improved from 76.83% to 83.57% (aprox. 11.67% improvement). Table 3. Experimental comparison of ML models.
3.2
Classifiers
Accuracy
Approach
CF
Precision
RCCD CF
Recall
RCCD CF
F-measure RCCD CF
RCCD
Decision Tree 52.92 77.82
51.98 70.48
50.98 73.41
51.24 71.79
Naive Bayes
72.61 72.31
71.72 73.71
72.16 72.89
73.92 79.91
NN
81.84 79.34
72.14 78.63
75.53 77.24
74.25 77.94
SVM
81.17 82.17
79.92 81.57
71.82 81.59
70.85 81.58
LR
81.47 84.27
79.68 83.62
76.37 83.53
76.83 83.57
Comparative Analysis with Existing Work
Classical Collaborative Distance (CCD) method [8] finds common neighbors. And verified a correlation between the number of common neighbors of vi and vj at the time t, and the probability that they will collaborate in the future. So, it identified collaboration of only two nodes in the future. Table 4. Performance analysis of RCCD and CCD on Logistic Regression Model. Metrics
Accuracy
Precision
Recall
F-measure
Approach CCD RCCD CCD RCCD CCD RCCD CCD RCCD LR
82.54 84.27
81.12 83.62
82.67 83.53
81.45 83.57
278
B. S. A. S. Rajita et al.
Whereas, the novelty in our proposed RCCD approach is that it considers the similarity measure of both intra- and inter- communities and is Table 3 infers that the logistic regression (LR) model predicted events more accurately compared to the remaining four ML models. In Table 4, we compared logistic regression model with the performance of our proposed model. Our model predicts events approximately with 2.09% more accuracy than CCD approach. The precision of our prediction is approximately 3.08 % more compared to CCD approach. Our proposed method predicts events with 83.53% recall, which is approximately 1.04% improvement over CCD approach. Similarly, the F-measure for our model is 83.57%, which is approximately 2.60% better than CCD approach.
4
Conclusion and Future Work
This paper presented a scalable and parallel framework for modeling a coauthorship SN and demonstrated how the proposed new feature, called RCCD score, improved the performance of the ML models for the prediction of the events in social network communities. The performance and effectiveness of the proposed framework show that the addition of the RCCD score as a feature improved the accuracy of the ML model. Thus, the accuracy of the proposed approach in prediction of the events improved from 81.47% to 84.27%. In the future, we plan to design a strategy to measure similarity using GAN and stochastic gradient methods for further improvement in accuracy.
References 1. Alamuri, M., Surampudi, B.R., Negi, A.: A survey of distance/similarity measures for categorical data. In: 2014 IJCNN, pp. 1907–1914. IEEE (2014) 2. Bommakanti, S.R., Panda, S.: Events detection in temporally evolving social networks. In: 2018 ICBK, pp. 235–242. IEEE (2018) 3. Chakraborty, T., Srinivasan, S., Ganguly, N., Mukherjee, A., Bhowmick, S.: On the permanence of vertices in network communities. In: Proceedings of the 20th ACM SIGKDD, pp. 1396–1405. ACM (2014) 4. Kong, X., Shi, Y., Yu, S., Liu, J., Xia, F.: Academic social networks: modeling, analysis, mining and applications. J. Network Comput. Appl. 132(3), 86–103 (2019) 5. Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Statistical properties of community structure in large social and information networks. In: Proceedings of the 17th ICWWW, pp. 695–704. ACM (2008) 6. Newman, M.E.: Co-authorship networks and patterns of scientific collaboration. In: Proceedings of the National Academy of Sciences 101(1), pp. 5200–5205. National Acad Sciences (2004) 7. Papanikolaou, Y., Foulds, J.R., Rubin, T.N., Tsoumakas, G.: Dense distributions from sparse samples: improved gibbs sampling parameter estimators for LDA. J. Mach. Learn Res. 18(1), 2058–2115 (2017) 8. Pereira, F.S., de Amo, S., Gama, J.: Detecting events in evolving social networks through node centrality analysis. In: ECML, pp. 47–60 (2016)
Event Prediction Using Collaborative Distance
279
9. Pradhan, A.K., Mohanty, H., Lal, R.P.: Event detection and aspects in twitter: a bow approach. In: Fahrnberger, G., Gopinathan, S., Parida, L. (eds.) ICDCIT 2019. LNCS, vol. 11319, pp. 194–211. Springer, Cham (2019). https://doi.org/10. 1007/978-3-030-05366-6 16 10. Rajita, B.S.A.S., Kumari, D., Panda, S.: A comparative analysis of community detection methods in massive datasets. In: Goel, N., Hasan, S., Kalaichelvi, V. (eds.) MoSICom 2020. LNEE, vol. 659, pp. 174–183. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-4775-1 19 11. Rajita, B.S.A.S., Ranjan, Y., Umesh, C.T., Panda, S.: Spark-based parallel method for prediction of events. Arabian J. Sci. Eng. 45(4), 3437–3453 (2020). https://doi. org/10.1007/s13369-020-04437-2 12. Saganowski, S., Br´ odka, P., Koziarski, M., Kazienko, P.: Analysis of group evolution prediction in complex networks. PLoS ONE 14(10), e0224194 (2019) 13. Sharma, A., Bhavani, S.D.: A network formation model for collaboration networks. In: Fahrnberger, G., Gopinathan, S., Parida, L. (eds.) ICDCIT 2019. LNCS, vol. 11319, pp. 279–294. Springer, Cham (2019). https://doi.org/10.1007/978-3-03005366-6 24
A Distributed System for Optimal Scale Feature Extraction and Semantic Classification of Large-Scale Airborne LiDAR Point Clouds Satendra Singh and Jaya Sreevalsan-Nair(B) Graphics-Visualization-Computing Lab, International Institute of Information Technology, 26/C, Electronics City, Bangalore 560100, Karnataka, India [email protected] http://www.iiitb.ac.in/gvcl
Abstract. Airborne LiDAR (Light Detection and Ranging) or aerial laser scanning (ALS) technology can capture large-scale point cloud data, which represents the topography of large regions. The raw point clouds need to be managed and processed at scale for exploration and contextual understanding of the topographical data. One of the key processing steps is feature extraction from pointwise local geometric descriptors for object-based classification. The state of the art involves finding an optimal scale for computing the descriptors, determined using descriptors across multiple scales, which becomes computationally intensive in the case of big data. Hence, we propose the use of a widely used big data analytics framework integration of Apache Spark and Cassandra, for extracting features at optimal scale, semantic classification using a random forest classifier, and interactive visualization. The visualization involves real-time updates to the selection of regions of interest, and display of feature vectors upon a change in the computation of descriptors. We show the efficacy of our proposed application through our results in the DALES aerial LiDAR point cloud. Keywords: Big data framework · Apache Spark · Cassandra · LiDAR point cloud · Multiscale feature extraction · Semantic classification
1
Introduction
Three-dimensional (3D) topographical data for large expanses of region is captured effectively using airborne Light Detection and Ranging (LiDAR) technology. The data is procured in the format of point clouds, which are unstructured. Contextual understanding of such data is necessary to make sense of the environment and its constituents. Hence, semantic classification is a key processing Supported by Early Career Research Award, Science and Engineering Research Board, Government of India. c Springer Nature Switzerland AG 2021 D. Goswami and T. A. Hoang (Eds.): ICDCIT 2021, LNCS 12582, pp. 280–288, 2021. https://doi.org/10.1007/978-3-030-65621-8_18
A Distributed Optimal-Scale Analytics System for Large-Scale Point Clouds
281
Fig. 1. Our proposed 3-stage workflow for data management, optimal scale feature extraction, and visual analytics involving interactive visual exploration and semantic classification of large-scale airborne LiDAR point clouds using Apache Spark-Cassandra integrated framework, with Datastax Spark-Cassandra Connector. The processed data is managed using resilient distributed datasets (RDDs).
method applied on the point clouds. The state of the art in semantic classification of LiDAR point clouds is mostly supervised learning including ensemble learning techniques, such as random forest classifiers [12], and deep learning techniques [11]. The feature vector for the learning task is obtained using local geometric descriptors computed using local neighborhood, which is determined at multiple scales [1,12]. The combination of feature extraction at optimal scale of local neighborhood and a random forest classifier has been recommended as an appropriate framework for semantic classification in terms of both accuracy and computational efficiency [12]. Upon evaluating multiple scales, the optimal scale is determined at the argmin of Shannon entropy computed from eigenvalues of the covariance matrix representing the local neighborhood at the scale. However, the existing solutions for this compute-intensive combination do not directly scale for large-scale point clouds in data analytic applications, such as, interactive feature visualization and semantic classification. As an example, DALES (Dayton Annotated LiDAR Earth Scan) [11] is one of the largest publicly available annotated point cloud dataset acquired using aerial laser scanning, with ∼0.5 billion 3D points at considerably high point resolution of 50 ppm (points per meter). The dataset spans a region of 10 km2 in the city of Surrey in British Columbia, stored in 40 tiles, with each tile containing 12 million points. That said, existing big data tools and frameworks can be tapped into and re-purposed for addressing this gap. In our previous work, we have proposed an integrated framework of Apache Spark and Cassandra to perform semantic classification using feature extraction from local geometric descriptors using multiscale aggregation of saliency map [10]. Here, we extend the framework for optimal scale feature extraction, and interactive visualization (Fig. 1). Our main contributions
282
S. Singh and J. Sreevalsan-Nair
are in extending framework for: (a) feature extraction from large-scale point clouds at optimal scale, for semantic classification using conventional classifiers, such as random forest classifier; and (b) interactive visualization. Local neighbor search is one of the most compute-intensive steps in feature extraction for point cloud processing. The computational requirements multiply when performing feature extraction across multiple scales by determining an optimal scale based on Shannon entropy [1]. Parallel processing has been exploited for implementing semantic classification has been implemented on large-scale point clouds. The classification of Semantic3D has been done using random forest classifiers using OpenMP [3], and deep learning frameworks, such as Torch [2]. k-nearest neighborhood has been used widely used type of local neighborhood with deep learning methods to optimize the performance of these architectures. For instance, Adam optimizer has been used in RandLA-Net [5], which also performs down-sampling of the point cloud on the GPU. While parallelized and optimized machine learning frameworks can improve computational efficiency, the big data frameworks have been largely used for both dataset management and processing. Apache Spark has been used for extraction of tree crowns from LiDAR point cloud, using spherical neighborhood [7]. Background: Apache Spark is a unified data analytics engine for large-scale data using in-memory processing [13]. Integration of Spark with storage systems, such as key-value stores, e.g., Cassandra [6], provides persistent storage. Both Spark and Cassandra are horizontally scalable, as more distributed systems, or nodes, can be added to the cluster. Spark uses Resilient Distributed Data (RDD) distributed across multiple cluster nodes for loading data in logical partitions across many servers for parallel processing on the cluster nodes. Apache Spark-Cassandra Connector from Datastax1 is used to query Cassandra tables from RDDs, after which the query results are stored in Cassandra. The connector leverages data locality to mitigate the network latency. Apache Spark also integrates complex tools such as MLlib, for machine learning.
2
Methodology
We use an integrated Apache Spark-Cassandra framework for semantic classification of large-scale airborne LiDAR point clouds using optimal scale feature extraction, and interactive visualization. We design an appropriate 3-stage workflow for utilizing this framework effectively. Here, the choice of Apache Spark with Cassandra has been made for: (a) parallelizing and scaling with data as well as nodes, and (b) optimized performance in semantic classification using tools like Spark ML, and interactive visualization. The persistent storage using Cassandra serves two purposes in our case: (a) storage of processed data in compute-intensive interactive applications, e.g., visualization, (b) distributed data management, as, in a multi-partitioned node, only a single partition can be in-memory in Spark at any given time. A partition key is needed for partitioning 1
https://github.com/datastax/spark-cassandra-connector.
A Distributed Optimal-Scale Analytics System for Large-Scale Point Clouds
283
data across nodes, and is computed based on the user-defined strategy on Spark. A hash value of the partition key is needed for inserting and retrieving data and it is computed using a function Partitioner in Apache Spark during the readwrite operations on the cluster. We exploit the feature in Cassandra to store the data in distributed nodes based on the partition key and optimize the search using clustering columns. In our work, we partition 3D data in the x-y plane, assign the partitions a region-ID, use the region-ID as the partition key, and assign the x, y, z variables as the clustering columns. When the Spark executor and Cassandra nodes are deployed on the same machine, the integrated framework processes the region data without incurring any network traffic, owing to the use of Spark-Cassandra Connector. 2.1
Our Proposed Workflow
Our workflow implemented on the integrated framework comprises of the following three stages (Fig. 1): the partition assignment of large-scale point cloud on S1 ], spatial partitioning and feature extraction [S S2 ], and visual the framework [S S3 ]. Within S3 , the framework performs interactive visualization of analytics [S features in the point cloud, S3A , and semantic classification, S3B . Compared to our previous work [10], here, we compute more features, modify S2 to determine the optimal scale, and include S3A . Stage S1 : For initializing the framework, we load the 3D point cloud P into the Apache Spark as an RDD. We normalize all points in P to be contained inside a cube of size 2 units centered at (0,0,0), without altering its aspect ratio. We then partition only along one axis, referred to as the principal axis, to strategize the partition layout with fewer partition boundaries. The partition boundaries pose an overhead of inter-node communication, as, for the points close to the boundaries, their local neighborhoods are split across different partitions, and hence, across different nodes. Thus, fewer partition boundaries are used to reduce the inter-node communication for collating neighborhood information. We select the axis with maximum range as principal axis p, either x or y axes, here. Spatial partitions of P into N contiguous regions along the p axis, have partition boundaries at pi , for i = 0, 1, 2, . . . , N , where N is determined using the maximum scale value, lmax , range of data along p-axis in P (Δp = pmax − pmin ), Δp th and the number of available cluster nodes n. Thus, N = lmax .n , and the i partition boundary pi = pmin + i ∗ n ∗ lmax . A region-ID assigned to each point x in P, serves as the partition key in Spark, K, where the p-coordinate of the point satisfies the boundary condition, p(K−1) ≤ xp < p(K) for K = 1, 2, . . . , N . For each partition, a buffer region is introduced by extending the right and left boundaries to pi ± lmax , respectively. Buffer regions provide complete local neighborhood information for boundary points, and features are extracted for all points except those in buffer regions. We store the resultant RDD in the Cassandra cluster using partition key, K. Stage S2 : We create partitions with the assigned K in S1 using our custom partitioner in the RDD. The custom partitioner enforces our computed partitioning,
284
S. Singh and J. Sreevalsan-Nair
thus, overriding the default random one on Spark. The partition key is crucial for the spatial contiguity in P as it ensures that a partition is contained in a single node without being split across nodes. A single node can still load multiple partitions, and process them in parallel. The feature extraction algorithm consists of four sequential processes implemented for each point, namely, local neighborhood determination, descriptor computation, its eigenvalue decomposition, and feature vector computation. Point-wise processing makes the algorithm embarrassingly data-parallel. Here, we use the cubical neighborhood [8] over the conventional spherical or k-nearest, to reduce computations. Cubical neighborhood uses Chebyshev distance (infinity (L∞ ) or maximum norm) for neighbor-search, instead of Euclidean distance (L2 norm). A spherical neighborhood of radius r is approximated by the cubical neighborhood of l = 2r. Definition 1. l-cubical neighborhood Nl of a point x in P, such that P = {p ∈ Rd }, is a set of points which satisfy the criterion based on Chebyshev distance, Nl (x) = {y ∈ P | max (|xi − yi |)}. {0≤i ∗ < no.of themaximumcontainerpernode >)
(1)
Load Balancing Approach for a MapReduce Job Running
293
=0.95 * (12 * 2) = 22.8 = 1.75 ∗ (< no.of nodes > ∗ < no.of themaximumcontainerpernode >)
(2)
=1.75 * (12 * 2) = 42
Fig. 2. Heterogeneous Hadoop cluster
In the heterogeneous cluster, the thumb rule creates fewer required Reducers when multiplied by the factor 0.95, and creates more required Reducers when multiplied by factor 1.75 [17]. Nodes with higher computing capabilities complete their first round of reduces and launch a second stream of reduces. In contrast, nodes with lower computing capabilities will near-complete their first round of reduces and start a second stream of reduces when the first-round gets complete [8]. This uneven estimation of Reducers for a MapReduce job running on a heterogeneous Hadoop cluster motivates us to correctly estimate the specified number of Reducers for a MapReduce job in a heterogeneous Hadoop cluster [10].
4
Proposed Approach to Estimate Required Number of Reducers for a MapReduce Job Running on a Heterogeneous Hadoop Cluster
The existing rule does not estimate the correct number of required Reducers for a MapReduce job running on a heterogeneous Hadoop cluster. It creates the load imbalance by wasting the computing resources. To improve a MapReduce job’s performance in a heterogeneous Hadoop cluster, we have modified the existing rule to calculate the required number of Reducers. The existing rule uses either 0.95 or 1.75 multiplication factors to estimate the required number of Reducers. The rule either overload the cluster or underutilizes the cluster resources, leading to load imbalance in a heterogeneous Hadoop cluster. We have proposed a new rule by modifying and combining the existing rules. While changing the rule for a heterogeneous cluster, we have assumed that some nodes are faster, and other nodes are slower. The proposed rule uses both Eqs. 3
294
K. L. Bawankule et al.
and 4 with the above assumption to modify the existing rule for calculating the required number of Reducers. Finally, the values obtained from Eqs. 3 and 4 are added to estimate the required number of Reducers for a MapReduce job running on a heterogeneous Hadoop cluster. = 0.95 ∗ (< no.of f asternodes > ∗ < no.of themaximumcontainerpernode >) (3) = 1.75 ∗ (< no.of slowernodes > ∗ < no.of themaximumcontainerpernode >) (4) We will take the same above examples to understand the proposed rule. Suppose a heterogeneous cluster consists of 12 nodes, and each node has a maximum of 2 containers. Using the proposed rule, we assumed some nodes as faster and other nodes as slower. Example 1: We assume that out of 12 nodes, 4 nodes as faster nodes and 8 nodes as slower nodes. So the required number of Reducers for a MapReduce job will be: =0.95 * (4 * 2) = 7.6 =1.75 * (8 * 2) = 28 Number of required Reducers = 7.6 + 28 = 35.6 Example 2: We assume that out of 12 nodes, 6 nodes as faster nodes and 6 nodes as slower nodes. So the required number of Reducers for a MapReduce job will be: =0.95 * (6 * 2) = 11.472 =1.75 * (6 * 2) = 21 Number of required Reducers = 11.472 + 21 = 32.472 For a heterogeneous cluster, the proposed rule estimates an exact number of required Reducers when multiplied by both the factor 0.95 and 1.75. The proposed rule distribute more Reducers to the faster nodes and less number of Reducers to slower nodes. The above Example 1 creates a total of 36 Reducers in a heterogeneous cluster, out of which it will assign 28 Reducers to the faster nodes and 8 Reducers to slower nodes. The above Example 2 creates 32 Reducers in a heterogeneous cluster, out of which it will assign 21 Reducers to the faster nodes and 11 Reducers to slower nodes. Faster nodes complete their first round of reduces and launch a second wave of reduces, whereas all the Reducers scheduled on slower nodes start simultaneously. The new rule helps to complete the execution of all the Reducers at the same time. The proposed rule works well for a MapReduce job running on a heterogeneous Hadoop cluster and balances the MapReduce’s job load in the Reduce phase of MapReduce.
Load Balancing Approach for a MapReduce Job Running
5
295
Performance Evaluation
The section presents the information regarding the experimental environment, the benchmark programs, and the experimental results. 5.1
Experimental Environment
In our experiment, we have tested the performance of the proposed rules by simulating a heterogeneous environment. To simulate the heterogeneous environment, we have used four physical machines. Each physical device has 20 GB of memory, 4 CPUs, and 500 GB of disk. Table 1 provides the detailed specification of all the seven VMs (virtual machines) used to simulate a heterogeneous environment. Except for one physical node where the Master node is running, other physical nodes run two slave nodes. For varying each node’s computing capabilities, we used the Oracle VM VirtualBox 5.2 platform for simulation of a heterogeneous environment. We have used Ubuntu 16.04 LTS operating system VMs and installed a stable version of Hadoop-2.7.2. Table 1. Virtual machines specification SrNo Node
CPU Memory Disk
1
Master 6
10 GB
40 GB
2
Slave1
2
6 GB
30 GB
3
Slave2
4
8 GB
40 GB
4
Slave3
3
4 GB
30 GB
5
Slave4
6
8 GB
40 GB
6
Slave5
3
3 GB
30 GB
7
Slave6
5
10 GB
20 GB
We have evaluated the proposed rule’s performance on a micro benchmark workload and a websearch benchmark such as TeraSort and PageRank of HiBench benchmark suite [7]. These benchmark programs are realistic and widely used to test the performance of Hadoop. We have tested the proposed rule’s performance on data sizes 5 GB, 10 GB, and 20 GB. We have varied data sizes to analyze the proposed rule’s performance in the Reduce phase of MapReduce. In this section, the proposed rule’s performance is measured on the MapReduce jobs running on a heterogeneous Hadoop cluster by performing a series of experiments. For each set of tests, the Reduce tasks execution time is recorded by varying the data sizes. We evaluate the performance of the proposed rule by using the execution time metric:
296
K. L. Bawankule et al.
Execution Time: Total time required from submission of Reduce task until the end of its execution. It can be formalized as follows: ExecutionT ime(ret) = EndT imeof Reducetask(ert) − StartT imeof Reducetask(srt)
(5) To calculate the total time required to execute all the Reduce tasks, we have added the execution time of each Reduce task completed on heterogeneous Hadoop cluster. Total Execution Time: Sum of total turnaround time required to execute all the Reduce tasks of a MapReduce job in a heterogeneous cluster. T otalExecutionT ime(tet) = ret(rt1) + ret(rt2) + ... + ret(rtn) 5.2
(6)
Results
Fig. 3. Total execution time of a TeraSort application
We have performed a series of experiments by varying the number of Reducers and data sizes to measure the total execution time required to execute all the Reduce tasks. In the first iteration, the total execution time of TeraSort and PageRank is recorded for different data sizes. Figure 3 and 4 shows the total execution time required to execute all the Reduce tasks for TeraSort and PageRank application on 5 GB, 10 GB, and 20 GB data. In Fig. 3 and 4, we have compared the results of the existing rule with the proposed rule using total execution time. For the current rule of 0.95, it creates 23 Reducers, while using the 1.75 factor, it creates 42 Reducers. Subsequently, by using the proposed rule, it creates 29 Reducers. For the TeraSort application, we improved the execution time by nearly 18%, and for the PageRank application, we improved the execution time by nearly 28%. Based on the test results, regardless of the data sizes, the proposed rule generates the exact number of Reducers. It balances the heterogeneous Hadoop cluster’s workload and improves the MapReduce Reduce phase’s total execution time.
Load Balancing Approach for a MapReduce Job Running
297
Fig. 4. Total execution time of a PageRank application
6
Conclusion and Future Work
In a heterogeneous Hadoop cluster, load imbalance in the Reduce phase of MapReduce delays the job execution. The current rule is not proficient in calculating the required number of Reducers for a heterogeneous Hadoop cluster. We have proposed a new rule by modifying the existing rule to balance a load of a MapReduce program in a heterogeneous cluster. It calculates the exact number of required Reducers for a MapReduce program running on a heterogeneous Hadoop cluster. Experimental results prove that the proposed rule balances the load and improves the execution time of all the Reducers of a MapReduce job running on a heterogeneous Hadoop cluster compared to the existing rule on all data sizes and minimizes computing resources’ overutilization. Our future research will focus on scheduling more Reduce tasks on faster processing nodes and fewer Reduce tasks on the slower processing nodes in a heterogeneous cluster.
References 1. Ahmad, F., Chakradhar, S.T., Raghunathan, A., Vijaykumar, T.: Tarazu: optimizing mapreduce on heterogeneous clusters. In: ACM SIGARCH Computer Architecture News, vol. 40, pp. 61–74. ACM (2012) 2. Anjos, J.C., Carrera, I., Kolberg, W., Tibola, A.L., Arantes, L.B., Geyer, C.R.: Mra++: scheduling and data placement on mapreduce for heterogeneous environments. Future Gen. Comput. Syst. 42, 22–35 (2015) 3. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008) 4. Gandhi, R., Xie, D., Hu, Y.C.: {PIKACHU}: how to rebalance load in optimizing mapreduce on heterogeneous clusters. In: 2013 {USENIX} Annual Technical Conference ({USENIX}{ATC} 13), pp. 61–66 (2013) 5. Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, pp. 29–43 (2003) 6. Hou, X., Thomas, J.P., Varadharajan, V.: Dynamic workload balancing for Hadoop mapreduce. In: Proceedings of the 2014 IEEE Fourth International Conference on Big Data and Cloud Computing, pp. 56–62 (2014)
298
K. L. Bawankule et al.
7. Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The hibench benchmark suite: characterization of the mapreduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), pp. 41–51. IEEE (2010) 8. Kwon, Y., Balazinska, M., Howe, B., Rolia, J.: Skewtune: mitigating skew in mapreduce applications. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 25–36. ACM (2012) 9. Lee, C.W., Hsieh, K.Y., Hsieh, S.Y., Hsiao, H.C.: A dynamic data placement strategy for Hadoop in heterogeneous environments. Big Data Res. 1, 14–22 (2014) 10. Liu, Z., Liu, Y., Wang, B., Gong, Z.: A novel run-time load balancing method for mapreduce. In: 2015 4th International Conference on Computer Science and Network Technology (ICCSNT), vol. 1, pp. 150–154. IEEE (2015) 11. Lu, W., Chen, L., Yuan, H., Wang, L., Xing, W., Yang, Y.: Improving mapreduce performance by using a new partitioner in yarn. In: The 23rd International Conference on Distributed Multimedia Systems, Visual Languages and Sentient Systems, pp. 24–33 (2017) 12. Naik, N.S., Negi, A., BR, T.B., Anitha, R.: A data locality based scheduler to enhance mapreduce performance in heterogeneous environments. Future Gen. Comput. Syst. 90, 423–434 (2019) 13. Nghiem, P.P., Figueira, S.M.: Towards efficient resource provisioning in mapreduce. J. Parallel Distrib. Comput. 95, 29–41 (2016) 14. Paravastu, R., Scarlat, R., Chandrasekaran, B.: Adaptive load balancing in mapreduce using flubber. Duke University Project Report (2012) 15. Shvachko, K., Kuang, H., Radia, S., Chansler, R., et al.: The Hadoop distributed file system. MSST. 10, 1–10 (2010) 16. White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Inc., Massachusetts (2012) 17. Yan, W., Li, C., Du, S., Mao, X.: An optimization algorithm for heterogeneous Hadoop clusters based on dynamic load balancing. In: 2016 17th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), pp. 250–255. IEEE (2016) 18. Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R.H., Stoica, I.: Improving mapreduce performance in heterogeneous environments. In: Osdi, vol. 8, p. 7 (2008)
Study the Significance of ML-ELM Using Combined PageRank and Content-Based Feature Selection Rajendra Kumar Roul1(B)
and Jajati Keshari Sahoo2
1
Department of Computer Science and Engineering, Thapar Institute of Engineering and Technology, Patiala 147004, Punjab, India [email protected] 2 Department of Mathematics, BITS-Pilani, K.K.Birla Goa Campus, Zuarinagar 403726, Goa, India [email protected] Abstract. Scalable big data analysis frameworks are of paramount importance in the modern web society, which is characterized by a huge number of resources, including electronic text documents. Hence, choosing an adequate subset of features that provide a complete representation of the document while discarding the irrelevant one is of utmost importance. Aiming in this direction, this paper studies the suitability and importance of a deep learning classifier called Multilayer ELM (ML-ELM) by proposing a combined PageRank and content-based feature selection (CPRCFS) technique on all the terms present in a given corpus. Top k% terms are selected to generate a reduced feature vector which is then used to train different classifiers including ML-ELM. Experimental results show that the proposed feature selection technique is better or comparable with the baseline techniques and the performance of Multilayer ELM can outperform state-of-thearts machine and deep learning classifiers. Keywords: Classification · Deep learning · Feature selection · Machine learning · Multilayer ELM · PageRank
1 Introduction Modern web society is characterized by a huge number of resources, including electronic text documents, and a large number of end users as active participants. In order to personalize vast amounts of information to the needs of each individual user, systematic ways to organize data and retrieve useful information in real time are necessary. Document classification is a machine learning technique which can handle a huge volume of data very efficiently by categorizing the unseen test data into their respective groups. Different research on classification methods can be found in [1–4]. The problem with classification technique is that visualizing the data becomes more difficult when the dataset size increases, and it needs a large storage size. In order to handle such problem, feature selection technique is essential which selects the important features from a large corpus and removes the unnecessary features during the model construction [5–7]. A good feature selection technique is not enough to make the classification process faster. Along with the feature selection technique, choice of a good classifier also affects the c Springer Nature Switzerland AG 2021 D. Goswami and T. A. Hoang (Eds.): ICDCIT 2021, LNCS 12582, pp. 299–307, 2021. https://doi.org/10.1007/978-3-030-65621-8_20
300
R. K. Roul and J. K. Sahoo
entire classification process. There are many traditional machine and deep learning classifiers exist, but they face many limitations such as need a high amount of training time, require more parameters, unable to manage a huge volume of data etc. Multilayer ELM [8], a deep learning network architecture is one of the most popular classifiers among them because of its good characteristics such as being able to manage a huge volume of data, no backpropagation, faster learning speed, maximum level of data abstraction etc. Keeping all these things in mind, a novel feature selection technique named as CPRCFS along with the Multilayer ELM classifier is introduced for text classification, which is the main motivation of this paper. The novelty of this feature selection technique is that it is very simple to understand, easy to implement, and comparable to the conventional feature selection technique. Although much work has been done in text classification, the realm of Multilayer ELM provides a relatively unexplored pool of opportunities. The major contributions of the proposed approach are as follows: i. A novel feature selection technique (CPRCFS) is introduced by combining the PageRank mechanism with the content-based similarity measures to select the important features from the corpus which improves the performance of the classification technique. ii. The proposed CPRCFS technique is compared with the conventional feature selection technique to measure its efficacy. iii. The performance of Multilayer ELM using the proposed CPRCFS technique is compared with state-of-the-arts machine and deep learning classifiers to justify its competency and effectiveness. Rest of the paper is as follows: Sect. 2 highlights the methods and materials used for the proposed approach. The methodology of the proposed work is discussed in Sect. 3. Section 4 discusses the experimental work and finally, Sect. 5 concludes the work with some future enhancements.
2 Materials Used 2.1
Multilayer ELM
Multilayer ELM (ML-ELM) is an artificial neural network having multiple hidden layers [8] and it is shown in Fig. 1. Equations 1, 2 and 3 are used for computing β (output weight vector) in ELM Autoencoder where H represents the hidden layer, X is the input layer, n and L are number of nodes in the input and hidden layer respectively. i. n = L ii. n < L
β = H −1 X −1 I + HH T ) X C −1 I T +H H β= HT X C
β = HT (
iii. n > L
(1) (2)
(3)
Here, CI generalize the performance of ELM [9] and is known as regularization parameter.
Study the Significance of ML-ELM Using Combined PageRank
301
Fig. 1. Architecture of Multilayer ELM
Multilayer ELM uses Eq. 4 to map the features to higher dimensional space and thereby makes them linearly separable [9]. ⎤T ⎡ ⎤T g(w1 , b1 , x) h1 (x) ⎢ g(w2 , b2 , x) ⎥ ⎢ h2 (x) ⎥ ⎥ ⎥ ⎢ ⎢ ⎥ ⎢ ⎢ . ⎥ . ⎥ ⎥ ⎢ ⎢ h(x) = ⎢ ⎥ =⎢ ⎥ . . ⎥ ⎥ ⎢ ⎢ ⎦ ⎣ ⎣ . ⎦ . hL (x) g(wL , bL , x) ⎡
(4)
T
where, hi (x) = g(wi .x + bi ). h(x) = [h1 (x)...hi (x)...hL (x)] can directly use for feature mapping [10, 11]. wi is the weight vector between the input nodes and the hidden nodes and bi is the bias of the ith hidden node, and x is the input feature vector with g as the activation function. ML-ELM takes the benefit of the ELM feature mapping [9] extensively as shown in Eq. 5. lim ||y(x) − yL (x)|| = lim ||y(x) −
L→+∞
L→+∞
L
βi hi (x)|| = 0
(5)
i=1
Using Eq. 6, Multilayer ELM transfers the data between the hidden layers. Hi = g((βi )T Hi-1 )
(6)
302
R. K. Roul and J. K. Sahoo
3 Proposed Methodology The following steps are used to select the important features from a given corpus to generate the training feature vector. 1. Document Pre-processing: Consider a corpus P having a set of classes C = {c1 , c2 , · · · , cn }. At the beginning, the documents of all the classes of P are merged into a large document set called Dlarge = {d1 , d2 , · · · , db }. Then all documents of Dlarge of P are pre-processed using lexical-analysis, stop-word elimination, removal of HTML tags, stemming1 , and then index terms are extracted using Natural Language Toolkit2 . 2. Calculation of content-based similarities: The content-based similarities between every pair of terms (tp and tq ) of the corpus P are calculated using the following four conventional techniques. i. Dice Coefficient is calculated as follows: dice(tp , tq ) =
2(tp .tq ) ||tp ||2 + ||tq ||2
ii. Extended Jaccard Coefficient is computed as follows: e-jac(tp , tq ) =
tp .tq ||tp ||2 + ||tq ||2 − (tp .tq )
iii. Cosine Similarity is a similarity measure and computed as follows: cosine(tp , tq ) =
tp .tq ||tp || ∗ ||tq ||
iv. Euclidean Distance is computed as follows:
n eucl(tp , tq ) = (ap − bq )2 p=q=1
where tp = {a1 , a2 , · · · , an } and tq = {b1 , b2 , · · · , bn } are the co-ordinates of tp and tq respectively. The average similarity between every pair of terms tp and tq of P are calculated using the Eq. 7. Avgsim (tp , tq ) =
dice(tp , tq ) + e-jac(tp , tq ) + cosine(tp , tq ) + eucl(tp , tq ) 4 (7)
3. Graph Creation: An undirected graph is created where each term represents a node and the average similarity value (Avgsim (tp , tq )) is the edge between two terms tp and tq of the corpus P . Now all the term-pairs (i.e., edges) are arranged in the descending order of their average similarity values. Of this, the top n% similarity values are selected. If a term-pair (both inclusive) is not present in the top n% similarity values then there will be no edge between that term-pair. 1 2
https://pythonprogramming.net/lemmatizing-nltk-tutorial/. https://www.nltk.org/.
Study the Significance of ML-ELM Using Combined PageRank
303
4. PageRank calculation: PageRank [12] is an algorithm used to determine the relevance or importance of a web page and rank it in the search results. The rank value indicates the importance of a particular page. We have modified the existing PageRank algorithm so that it can be used on terms where each term is assumed to be a web page in the real PageRank algorithm. Assume that term T has incoming edges from terms T1 , T2 , T3 , · · · , Tn . The PageRank of a term T is given by Eq. 8 where ‘d’ represents the damping factor and its value lies between 0 and 1 and C(T ) represents the number of edges that leaving out from the term T . The PageRanks form a probability distribution over the terms, hence the sum of the PageRank of all the terms P R(T ) will be one. P R(T ) P R(Tn ) 1 + ··· + (8) P R(T ) = (1 − d) + d C(T1 ) C(Tn ) If there are a lot of terms that link to the term T then there is a common belief that the term T is an important term. The importance of the terms linking to T is also taken into account, and using PageRank terminology, it can be said that terms T1 , T2 , T3 , · · · , Tn transfer their importance to T , albeit in a weighted manner. Thus, it is possible to iteratively assign a rank to each term, based on the ranks of the terms that point to it. In the proposed approach, the PageRank algorithm is run on the graph created in the previous step, with damping factor d = 0.5. 5. Generating the training feature vector: All the terms of the corpus P are sorted in descending order based on their PageRank score and out of these, top k% terms are selected. The TF-IDF values of these terms will act as the values of the features for the documents. Finally, all these top k% terms are merged into a new list Lnew , which constitute the training feature vector for the classifiers. 6. Performance evaluation With the known class labels, the unknown documents are tested against each classifier to predict their performances.
4 Experimental Framework 4.1 Dataset Used To conduct the experiment, two benchmark datasets ( 20-Newsgroups3 and Classic44 ) are used. The details about these two datasets are shown in the Table 1. 4.2 Experimental Setup Details Parameter settings of different deep learning classifiers are depicted in the Table 2. Approximate sizes of feature vectors used for ELM and Multilayer ELM is shown in the Table 3, where ‘ILN’ represents input layer nodes and ‘HLN’ indicates hidden layer nodes. The values of the parameters for these classifiers are decided based on the best results that are obtained by experiment. 3 4
http://qwone.com/∼jason/20Newsgroups/. http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets/.
304
R. K. Roul and J. K. Sahoo Table 1. Details of the datasets Dataset
No. of training docs No. of testing docs No. of terms used for training
20-NG
11293
7528
32270
4257
2838
15971
Classic4
Table 2. Parameters used for deep learning classifiers Classifiers No. of hidden layers
Activation function Dropout Optimizers Epochs
ANN
3
Sigmoid
CNN
3
Relu
0.2
SGD
500
RNN
3
Tanh
0.2
SGD
400
0.1
SGD
450
0.1
ML-ELM 20-NG = 4, Classic4 = 3 Sigmoid
SGD
500
Table 3. Approximate size of the feature vectors of ELM and ML-ELM Dataset Terms used for ILN training (Top 1%)
HLN (Top 1%)
ILN (Top 5%)
HLN (Top 5%)
ILN HLN (Top 10%) (Top 10%)
20-NG
32270
320
350
1610
1630
3230
3250
Classic4 15971
160
195
800
870
1600
1650
Table 5. Top 5% (20-NG)
Table 4. Top 1% (20-NG) IG
Chi-square BNS
CPRCFS
MI
IG
Chi-square BNS
CPRCFS
ML Classifier
MI
SVC (linear)
0.89122 0.87161 0.87362
0.88947 0.87515
0.93460 0.92817 0.92815
0.93015 0.93454
SVM (linear)
0.89234 0.88967 0.89151
0.89444 0.89428
0.94377 0.93461 0.93729
0.93375 0.95509
Gradient Boosting 0.85583 0.83784 0.84084
0.86063 0.85224
0.89954 0.88561 0.89489
0.89591 0.87908
Decision Trees
0.88227 0.86307 0.86992
0.87514 0.87782
0.93516 0.88816 0.90252
0.92878 0.93314
NB(Multinomial) 0.86303 0.83860 0.83796
0.85794 0.88662
0.93122 0.90103 0.91607
0.92511 0.92191
Adaboost
0.87322 0.88314 0.88381
0.88472 0.87335
0.89075 0.88366 0.86261
0.87112 0.87184
Random Forest
0.85993 0.85847 0.85902
0.84243 0.85584
0.85997 0.86254 0.85813
0.85763 0.85618
Extra Trees
0.89672 0.87606 0.88643
0.88235 0.88762
0.90426 0.88001 0.90223
0.89423 0.89717
ELM
0.89242 0.87041 0.87542
0.89575 0.88420
0.93551 0.92673 0.93012
0.93682 0.92331
MLELM
0.91703 0.91237 0.90511
0.91668 0.93802
0.93873 0.94882 0.94664
0.95584 0.96746
Study the Significance of ML-ELM Using Combined PageRank Table 7. Top 1% (Classic4)
Table 6. Top 10% (20-NG) IG
Chi-square BNS
305
CPRCFS
MI
IG
Chi-square BNS
CPRCFS
ML Classifier
MI
SVC (linear)
0.94742 0.93732 0.93648
0.94377 0.94921
0.91491 0.88286 0.89942
0.91784 0.90147
SVM (linear)
0.94286 0.94559 0.93647
0.94654 0.94536
0.92632 0.90663 0.92185
0.92799 0.91670
Gradient boosting 0.90147 0.89582 0.89861
0.90510 0.89582
0.90966 0.83336 0.88875
0.88188 0.88345
Decision trees
0.93996 0.91012 0.93353
0.93992 0.91133
0.83281 0.79881 0.87911
0.86601 0.83021
NB (Multinomial) 0.93822 0.92344 0.93272
0.93732 0.91937
0.84197 0.76852 0.80705
0.85317 0.88163
Adaboost
0.88266 0.86261 0.86341
0.87253 0.87686
0.89162 0.88094 0.88437
0.88986 0.88393
Random Forest
0.86372 0.84923 0.86605
0.85911 0.85641
0.86046 0.84001 0.85312
0.86365 0.85874
Extra Forest
0.89293 0.89242 0.89472
0.89273 0.90572
0.91773 0.89212 0.91534
0.91801 0.88714
ELM
0.93228 0.94051 0.93443
0.94672 0.92886
0.90286 0.88143 0.89944
0.92382 0.90182
MLELM
0.95636 0.96882 0.95567
0.95814 0.96928
0.94651 0.92222 0.91885
0.94563 0.95707
4.3 Comparison of the Performance ML-ELM Using CPRCFS with other machine learning classifiers The proposed CPRCFS is compared with the traditional feature selection techniques such as mutual information (MI), Information Gain (IG), Chi-square, and Bi-normal separation (BNS) (shown in Tables 4, 5, 6, 7, 8, 9) where a bold result indicates the maximum F-measure achieved by a feature selection technique. From the results, it can be concluded that Multilayer ELM can outperform the conventional machine learning classifiers. Table 8. Top 5% (Classic4) IG
Chi-square BNS
Table 9. Top 10% (Classic4)
ML Classifier
MI
SVC (linear)
0.94596 0.92333 0.94105
0.94302 0.95655
CPRCFS
MI
0.94547 0.92125 0.92338
IG
Chi-square BNS
0.96869 0.96551
CPRCFS
SVM (linear)
0.96511 0.93916 0.94052
0.94794 0.96021
0.94542 0.92652 0.92758
0.97287 0.96768
Gradient boosting 0.93412 0.92598 0.92952
0.93951 0.93278
0.94021 0.93594 0.93951
0.94238 0.93991
Decision trees
0.92632 0.90763 0.91492
0.90218 0.90987
0.92031 0.90688 0.91548
0.91054 0.89722
NB (Multinomial) 0.93814 0.92327 0.93095
0.94525 0.94667
0.95097 0.94561 0.94778
0.95352 0.96918
Adaboost
0.87345 0.88184 0.84973
0.85482 0.84585
0.85482 0.85482 0.85482
0.84587 0.84587
Random Forest
0.84925 0.84846 0.84887
0.85559 0.84849
0.85072 0.84735 0.85205
0.84573 0.85087
Extra Trees
0.91893 0.91891 0.91656
0.91921 0.89554
0.91525 0.92688 0.91851
0.91436 0.91675
ELM
0.94532 0.92522 0.94676
0.94172 0.94578
0.92333 0.94672 0.95634
0.96948 0.95633
MLELM
0.96668 0.94888 0.96253
0.96548 0.96541
0.97106 0.95877 0.96886
0.97944 0.98077
4.4 Comparison of the Performance ML-ELM Using CPRCFS with other deep learning classifiers The existing deep learning techniques have high computational cost, require massive training data and training time. The F-measure and accuracy comparisons of Multilayer ELM along with other deep learning classifiers such as Recurrent Neural Network
306
R. K. Roul and J. K. Sahoo
(RNN), Convolution Neural Network (CNN), and Artificial Neural Network (ANN) using CPRCFS technique on top 10% features are depicted in Figs. 2 and 3 respectively. Results show that the performance of Multi-layer ELM outperforms existing deep learning classifiers. Reason behind the good performance of ML-ELM is that, using the universal approximation [13] and classification capabilities [14] of ELM, it can able to map a huge volume of features from a low to high dimensional feature space and can separate those features linearly in higher dimensional space with less cost.
Fig. 2. Deep learning (F-measure)
Fig. 3. Deep learning (Accuracy)
5 Conclusion A novel feature selection technique named as CPRCFS is introduced in this paper which combined the ranking ability of PageRank with four similarity measures to select the relevant features from a corpus of documents. By using the similarity measures, top n% of term-pairs are selected. A graph is created based on this selected term-pair set, and the PageRank algorithm is run on it which gives the most relevant features. Performance of different machine and deep learning classifiers using this relevant features are measured. Empirical results indicate that the proposed CPRCFS is comparable with the conventional techniques and the performance of Multilayer ELM outperformed the state-of-the-arts machine and deep learning classifiers. This work can be extended by running the machine learning classifiers on the feature space of Multilayer ELM which can further enhance the classification results.
References 1. Du, J., Vong, C.-M., Chen, C. P.: Novel efficient RNN and LSTM-like architectures: Recurrent and gated broad learning systems and their applications for text classification. IEEE Trans. Cybern. (2020) 2. Sambasivan, R., Das, S.: Classification and regression using augmented trees. Int. J. Data Sci. Anal. 7(4), 259–276 (2019) 3. Joseph, S.I.T., Sasikala, J., Juliet, D.S.: A novel vessel detection and classification algorithm using a deep learning neural network model with morphological processing (m-dlnn). Soft Comput. 23(8), 2693–2700 (2019) 4. Roul, R.K., Asthana, S.R., Kumar, G.: Study on suitability and importance of multilayer extreme learning machine for classification of text data. Soft Comput. 21(15), 4239–4256 (2017)
Study the Significance of ML-ELM Using Combined PageRank
307
5. Sayed, G.I., Hassanien, A.E., Azar, A.T.: Feature selection via a novel chaotic crow search algorithm. Neural Comput. Appl. 31(1), 171–188 (2019) 6. Tsai, C.-J.: New feature selection and voting scheme to improve classification accuracy. Soft Comput. 23(15), 1–14 (2019) 7. Roul, R.K., Rai, P.: A new feature selection technique combined with elm feature space for text classification. In: Proceedings of the 13th International Conference on Natural Language Processing, pp. 285–292 (2016) 8. Kasun, L.L.C., Zhou, H., Huang, G.-B., Vong, C.M.: Representational learning with extreme learning machine for big data. IEEE Intell. Syst. 28(6), 31–34 (2013) 9. Huang, G.-B., Zhou, H., Ding, X., Zhang, R.: Extreme learning machine for regression and multiclass classification. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 42(2), 513–529 (2012) 10. Huang, G.-B., Chen, L., Siew, C.K., et al.: Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Trans. Neural Netw. 17(4), 879–892 (2006) 11. Roul, R.K.: Suitability and importance of deep learning feature space in the domain of text categorisation. Int. J. Comput. Intell. Stud. 8(1–2), 73–102 (2019) 12. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Tech. Rep, Stanford InfoLab (1999) 13. Huang, G.-B., Zhou, H., Ding, X., Zhang, R.: ”Extreme learning machine for regression and multiclass classification. IEEE Trans. Syst. Man Part B Cybern.(Cybern.) 42(2), 513–529 (2011) 14. Huang, G.-B., Chen, Y.-Q., Babri, H.A.: Classification ability of single hidden layer feedforward neural networks. IEEE Trans. Neural Netw. 11(3), 799–801 (2000)
Author Index
Bal, Ananya 253 Basappa, Manjanna 114 Bawankule, Kamalakant Laxman Bhattacharyya, Dhruba K. 203 Bhavani, S. Durga 223 Biswas, Girish 160 Borah, Parthajit 203 Das, Meenakshi 253 Das, Rajib K. 65 Das, Satyabrata 81, 98 Das, Subha Kanta 253 Dewang, Rupesh Kumar Dinesha, K. V. 129
289 Naik, Nenavath Srinivas 238 Narwa, Bipin Sai 271 Nayak, Sanjib Kumar 81 Nguyen, Le-Minh 44 Panda, Sanjaya Kumar 81, 98 Panda, Subhrakanta 271 Pande, Sohan Kumar 81, 98 Pattanayak, Debasish 188
289
Gaikwad, Ajinkya 175 Gogolla, Martin 24 Jena, Madhusmita
Mitra, Debarati 188 Mukherjee, Nandini 160
253
Rajesh Kumar, R. 129 Rajita, B. S. A. S. 271 Rajora, Maihul 238 Rani, T. Sobha 223 Rathod, Mansi 238 Roul, Rajendra Kumar 299
Kalita, J. K. 203 Karmakar, Kamalesh 65 Kasyap, Harsh 145 Katiyar, Swati 223 Khatua, Sunirmal 65 Kshemkalyani, Ajay D. 3 Kulkarni, Parth Parag 145
Saha, Dibakar 188 Sahoo, Jajati Keshari 299 Satapathy, Shashank Mouli 253 Shanbhag, Vivek 129 Singh, Anil Kumar 289 Singh, Satendra 280 Sreevalsan-Nair, Jaya 280
Maity, Soumen 175 Mandal, Partha Sarathi 188 Mishra, Sudeepta 114 Misra, Anshuman 3
Tarafdar, Anurina 65 Tran, Xuan-Chien 44 Tripathi, Shuvam Kant 175 Tripathy, Somanath 145