467 60 15MB
English Pages 182 Year 2021
Lecture Notes in Networks and Systems 309
Irfan Awan Salima Benbernou Muhammad Younas Markus Aleksy Editors
The International Conference on Deep Learning, Big Data and Blockchain (Deep-BDB 2021)
Lecture Notes in Networks and Systems Volume 309
Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Fernando Gomide, Department of Computer Engineering and Automation—DCA, School of Electrical and Computer Engineering—FEEC, University of Campinas— UNICAMP, São Paulo, Brazil Okyay Kaynak, Department of Electrical and Electronic Engineering, Bogazici University, Istanbul, Turkey Derong Liu, Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, USA; Institute of Automation, Chinese Academy of Sciences, Beijing, China Witold Pedrycz, Department of Electrical and Computer Engineering, University of Alberta, Alberta, Canada; Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Marios M. Polycarpou, Department of Electrical and Computer Engineering, KIOS Research Center for Intelligent Systems and Networks, University of Cyprus, Nicosia, Cyprus Imre J. Rudas, Óbuda University, Budapest, Hungary Jun Wang, Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
The series “Lecture Notes in Networks and Systems” publishes the latest developments in Networks and Systems—quickly, informally and with high quality. Original research reported in proceedings and post-proceedings represents the core of LNNS. Volumes published in LNNS embrace all aspects and subfields of, as well as new challenges in, Networks and Systems. The series contains proceedings and edited volumes in systems and networks, spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor Networks, Control Systems, Energy Systems, Automotive Systems, Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems, Robotics, Social Systems, Economic Systems and other. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution and exposure which enable both a wide and rapid dissemination of research output. The series covers the theory, applications, and perspectives on the state of the art and future developments relevant to systems and networks, decision making, control, complex processes and related areas, as embedded in the fields of interdisciplinary and applied sciences, engineering, computer science, physics, economics, social, and life sciences, as well as the paradigms and methodologies behind them. Indexed by SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.
More information about this series at http://www.springer.com/series/15179
Irfan Awan Salima Benbernou Muhammad Younas Markus Aleksy •
•
•
Editors
The International Conference on Deep Learning, Big Data and Blockchain (Deep-BDB 2021)
123
Editors Irfan Awan Department of Computer Science, Faculty of Engineering and Informatics University of Bradford Bradford, UK
Salima Benbernou Université de Paris Paris, France Markus Aleksy Ludwigshafen, Rheinland-Pfalz, Germany
Muhammad Younas School of Engineering, Computing and Mathematics Oxford Brookes University Oxford, UK
ISSN 2367-3370 ISSN 2367-3389 (electronic) Lecture Notes in Networks and Systems ISBN 978-3-030-84336-6 ISBN 978-3-030-84337-3 (eBook) https://doi.org/10.1007/978-3-030-84337-3 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
Welcome to the 2nd International Conference on Deep Learning, Big Data and Blockchain (Deep-BDB 2021) which is being held virtually (online). The conference was planned to be held on-site in Rome, Italy. Unfortunately, due to ongoing pandemic, it was not possible to organize the conference on-site. Virtual conferences provide flexibility to participants as they can participate online from anywhere. On-site (physical) conferences provide conventional environment and face-to-face interaction among participants. Whether it is a virtual or an on-site (physical) conference, it demands significant amount of time and commitment from the conference organizing and programme committees. The committees have worked very hard in the organization of the conference. We hope that participants would find the conference useful and would have an opportunity to share and exchange ideas on different topics which are related to theme of the conference. The Deep-BDB conference includes innovative and timely topics in areas of deep/machine learning, big data and blockchain which provide new tools, methods and techniques that are used in various disciplines and application areas. For instance, deep learning and big data tools and techniques have been used by businesses in order to analyse large volume of data and discover trends and pattern in the data for decision-making purposes. Similarly, blockchain technology has been employed in various applications in order to ensure transparency and security of data and (financial) transactions. The Deep-BDB conference programme committee has put enormous efforts in creating an interesting technical programme. The conference aims to provide a forum where participants can present, discuss and provide constructive feedback on different aspects of deep learning, big data and blockchain. Though the current pandemic has affected the number of submissions, Deep-BDB still has attracted a number of good-quality papers from different countries. The conference followed a rigorous review procedure. All submitted papers were reviewed by members of the technical programme committee. Based on the reviews, 13 papers were accepted for the conference. This gives an acceptance rate of 36% of the total submissions. The accepted papers present interesting v
vi
Preface
work on different topics, such as clustering of time series data, detection of anomalies in data, blockchain frameworks and technologies, security and scalability in blockchain, recommender systems and so on. The papers also include promising work on practical applications such as industrial process plants, commercial websites and gas price prediction. We sincerely thank all the members of programme committee members who have spent their valuable time in reviewing papers and providing useful feedback to the authors. We are also thankful to all the authors for their contributions to the conference. We are grateful to the conference: General Chair, Dr. Markus Aleksy; Local Organizing Co-chairs, Dr. Flora Amoto and Dr. Francesco Piccialli; Workshop Coordinator, Dr. Filipe Portela; Publicity Chair, Dr. Mourad Ouziri; and Journal Special Issue Coordinator, Dr. Satish Narayana. We sincerely thank the Springer team for their time and support which they provided throughout the production of conference proceedings. August 2021
Irfan Awan Salima Benbernou Muhammad Younas Markus Aleksy
Organization
Deep-BDB 2021 Organizing Committee General Chair Markus Aleksy
ABB, Germany
Programme Co-chairs Irfan Awan Salima Benbernou
University of Bradford, UK Paris Descartes University, France
Local Organizing Co-chairs Flora Amoto Francesco Piccialli
University of Naples “Federico II”, Italy University of Naples “Federico II”, Italy
Publication Chair Muhammad Younas
Oxford Brookes University, UK
Journal Special Issue Coordinator Satish Narayana
University of Tartu, Estonia
Workshop Coordinator Filipe Portela
University of Minho Portugal, Portugal
Publicity Chair Mourad Ouziri
Paris Descartes University, France
vii
viii
Organization
Programme Committee Abdelhakim Hafid Abdelmounaam Rezgui Afonso Ferreira Ahmad Javaid Ahmed Zouinkhi Aida Kamisalic Akiyo Nadamoto Allaoua Chaoui Allel Hadjali Amin Beheshti Armin Lawi Angelika Kedzierska-Szczepaniak Antonio Dourado Athena Vakali Ayman Alahmar Bruno Veloso Daniele Apiletti Dhanya Jothimani Dimka Karastoyanova Domenico Talia Emad Mohammed Fahimeh Farahnakian Fanny Klett George Pallis Giovanna Castellano Giuseppe Di Modica Hiroaki Higaki Imen Khamassi Jims Marchang Jorge Bernardino Joshua Ellul Kazuaki Tanaka Khouloud Boukadi Lei Zhang Mahardhika Pratama Marc Jansen Mohamed Boukhebouze Mourad Khayati
University of Montreal, Canada Illinois State University, USA CNRS, Toulouse, France The University of Toledo, Spain National Engineering School of Gabes, Tunisia University of Maribor, Slovenia Konan University, Japan University Mentouri Constantine, Algeria ENSMA, France Macquarie University, Australia Hasanuddin University, Indonesia University of Gdansk, Poland University of Coimbra, Portugal Aristotle University of Thessaloniki, Greece Lakehead University, Canada INESC Technology and Science, Porto, Portugal Polytechnic University of Turin, Italy Ryerson University, Canada University of Groningen, The Netherlands University of Calabria, Italy University of Calgary, Canada University of Turku, Finland German Workforce ADL Partnership Laboratory, Germany University of Cyprus, Cyprus University of Bari, Italy University of Catania, Italy Tokyo Denki University, Japan University of Tunis, Tunisia Sheffield Hallam University, UK Polytechnic Institute of Coimbra - ISEC, Portugal University of Malta, Malta Kyushu Institute of Technology, Japan University of Sfax-Tunisia, Tunisia East China Normal University, China Nanyang Technology University, Singapore University of Applied Sciences Ruhr West, Germany CETIC Research Center, Belgium University of Fribourg, Switzerland
Organization
Muhammad Rizwan Asghar Mustapha Lebbah Mu-Yen Chen Nizar Bouguila Orazio Tomarchio Oshani Seneviratne Pino Caballero-Gil Rabiah Ahmad Rachid Benlamri Radwa El Shawi Rafael Santos Ridha Hamila Santanu Pal Saptarshi Sengupta Sebastian Link Serap Sahin Sotiris Kotsiantis Sung-Bae Cho Stephane Bressan Tomoyuki Uchida Yacine Atif Yuansong Qiao
ix
The University of Auckland, New Zealand Université Paris Nord, France National Cheng Kung University, Taiwan Concordia University, Canada University of Catania, Italy Rensselaer Polytechnic Institute, USA DEIOC, University of La Laguna, Spain Universiti Teknikal Malaysia, Malaysia Lakehead University, Canada University of Tartu, Estonia Brazilian National Institute for Space Research, Brazil Qatar University, Qatar Universität des Saarlandes, Germany Murray State University, USA The University of Auckland, New Zealand Izmir Institute of Technology, Turkey University of Patras, Greece Yonsei University, Korea National University of Singapore, Singapore Hiroshima City University, Japan Skövde University, Sweden Athlone Institute of Technology, Ireland
Contents
Machine Learning and Time Series Tiered Clustering for Time Series Data . . . . . . . . . . . . . . . . . . . . . . . . . Ruizhe Ma and Rafal Angryk A Three-Step Machine Learning Pipeline for Detecting and Explaining Anomalies in the Time Series of Industrial Process Plants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcel Dix Detecting Phishing Websites Using Neural Network and Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ravinthiran Partheepan
3
15
27
Blockchain Technology and Applications A Blockchain Framework for On-Demand Intermodal Interlining: Blocklining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mary Rose Everan, Michael McCann, and Gary Cullen
41
Intersection of AI and Blockchain Technology: Concerns and Prospects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . K. B. Vikhyath, R. K. Sanjana, and N. V. Vismitha
53
SAIaaS: A Blockchain-Based Solution for Secure Artificial Intelligence as-a-Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nicolas Six, Andrea Perrichon-Chrétien, and Nicolas Herbaut
67
Blockchain and Security Trade-Off Between Security and Scalability in Blockchain Design: A Dynamic Sharding Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kahina Khacef, Salima Benbernou, Mourad Ouziri, and Muhammad Younas
77
xi
xii
Contents
BC-HRM: A Blockchain-Based Human Resource Management System Utilizing Smart Contracts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Heba Adel, Mostafa ElBakary, Kamal ElDahshan, and Dina Salah
91
Applicability of the Software Security Code Metrics for Ethereum Smart Contract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Aboua Ange Kevin N’DA, Santiago Matalonga, and Keshav Dahal Machine Learning, Blockchain and IoT A Recommendation Model Based on Visitor Preferences on Commercial Websites Using the TKD-NM Algorithm . . . . . . . . . . . . 123 Piyanuch Chaipornkaew and Thepparit Banditwattanawong Reinforcement Learning: A Friendly Introduction . . . . . . . . . . . . . . . . . 134 Dema Daoun, Fabiha Ibnat, Zulfikar Alom, Zeyar Aung, and Mohammad Abdul Azim Universal Multi-platform Interaction Approach for Distributed Internet of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Maria Stepanova and Oleg Eremin A Practical and Economical Bayesian Approach to Gas Price Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 ChihYun Chuang and TingFang Lee Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Machine Learning and Time Series
Tiered Clustering for Time Series Data Ruizhe Ma1(B) and Rafal Angryk2 1
University of Massachusetts Lowell, Lowell, MA 01854, USA ruizhe [email protected] 2 Georgia State University, Atlanta, GA 30302, USA
Abstract. Clustering is an essential unsupervised learning method. While the clustering of discrete data is a reasonably solved problem, sequential data clustering, namely time series data, is still an ongoing problem. Sequential data such as time series is widely used due to its abundance of detailed information. Often, normalization is applied to amplify the similarity of time series data. However, by applying normalization, measurement values, which is an important aspect of similarity, are removed, impairing the veracity of comparison. In this paper, we introduce a tiered clustering method by adding the value characteristic to the clustering of normalized time series. As such, two clustering methods are implemented. First, the Distance Density Clustering algorithm is applied to normalized time series data. After obtaining the first-tier results, we apply a traditional hierarchical clustering of a summarized time series value to further partition clusters. Keywords: Unsupervised learning
1
· Cluster · Time series
Introduction
The majority of data used in traditional data analysis are discrete point data, either an instantaneous point value (i.e., point in time) or a summarized point value (i.e., average). While point data is efficient to store and process, the obvious drawback is the lack of rich details. On the other hand, sequential data contains much more details on the process of a recorded event. Time series is a special type of sequential data, it is ordered and evenly spaced sequential values. Time series is extensively applied in various real-world applications. Clustering is an important part of exploratory data mining; essentially, it is the partitioning of data to have high within-cluster similarity and low betweencluster similarity. A clustering process can be an independent procedure to gain insight into the distribution of a dataset or as a pre-process or subroutine for other data mining tasks, such as rule discovery, indexing, summarization, anomaly detection, and classification [6]. The application of clustering is very diverse; it can be applied in fields such as pattern recognition, machine learning, bioinformatics, and more. Cluster analysis is a reasonably well-studied problem in the data mining community. The clustering of time series, however, is a relatively newer facet. Due c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 I. Awan et al. (Eds.): Deep-BDB 2021, LNNS 309, pp. 3–14, 2022. https://doi.org/10.1007/978-3-030-84337-3_1
4
R. Ma and R. Angryk
to the high dimensionality of time series data, the distribution can be difficult to comprehend. The predetermination of cluster parameter settings is already a difficult task with discrete data. The global parameter used for partitioning can be even more complicated to identify for time series data, as it is near impossible to visualize the respective position and correlation of time series. Therefore cluster algorithms with minimal parameter settings are more beneficial, making the study of the effect of different hierarchical structures of a time series dataset an important aspect to consider. Previous work on solar flares has shown that profiles could be used to identify the otherwise unnoticeable distinction amongst time series data [12]. A key application of time series profiles is prediction. If distinct trend profiles can be identified prior to the occurrence of an event, then predictions can be made as new measurements come in, near realtime. Another possible application of time series cluster profiles is identifying possible sub-classes within existing identified classes. If different profiles within an existing class can be found, this would insinuate the existence of physical sub-classes within the current definitions. Both applications may be hard to achieve using discrete point values, whereas the adoption of time series data and shape-based analysis could set the stage in this direction. The issue with using normalized time series data to generate cluster profiles is that in the process of normalization, certain aspects of time series characteristics are lost. There are three aspects of time series similarity: range value similarity, duration similarity, and shape similarity [12]. By normalization, the shape similarity is amplified while sacrificing value similarity. Therefore, even though the shape similarity is much more apparent and that clustering algorithms can build clusters of more similar time series, accuracy did not improve when compared to clustering with data that is not normalized. In this paper, we extend the normalized cluster profiles by adding another layer of value-based clustering, in the hope of combining two types of similarity and generating better results. The rest of this paper is organized as follows: Sect. 2 presents the related work. Section 3 discusses the applied clustering methods and how they produce tiered cluster results. Section 4 briefly discuss the solar pre-flare time series data used in our experiments. Section 5 presents the results and analysis. Finally, Sect. 6 summarizes this paper.
2 2.1
Background Distance Measure
Real-world events are complex and detailed, often times when we evaluate events on summarized values we trade preciseness for efficiency. With the improved storage and processing capabilities, sequential data has gained more popularity. Time series is a popular type of sequential data, it is a sequence of measurements that are equally spaced in time. Since real-world events are complex and often affected by a multitude of unforeseeable external factors, it is highly probable to observe differences for both duration and measurements for time series describing
Tiered Clustering
5
the same class of events. Therefore the similarity determination for time series is not a trivial problem. There are two main types of similarity measure, lock-step and elastic. The traditional lock-step similarity measure Lp norm refers to the Minkowski distance raised to the power of p. Minkowski distance is most commonly used with Lp where p = 1 (Manhattan distance), and with Lp where p = 2 (Euclidean distance). Euclidean distance is the straight-line distance. When applied to time series, assuming we are working with equal length time series, Euclidean distance will always be made based on a one-to-one mapping where the i -th element in one sequence is always mapped to the i -th element in the compared sequence. Comparatively, elastic measures allow one-to-many as well as one-to-one mappings [9]. Originally used in the field of speech recognition, the Dynamic Time Warping (DTW) algorithm is one of the most widely used elastic similarity measurement [1–3,7]. DTW enables computers to find an optimal match between two given sequences under certain constraints, and it allows a flexibility in sequential similarity comparisons. Euclidean and DTW distances [5] of given time series Q and C are shown in Eq. 1 and 2, respectively, where time series Q: Q = {q1 , q2 , ..., qi , ..., qn }, and time series C: C = {c1 , c2 , ..., cj , ..., cm }. N Dist(Euclidean) = (qi − ci )2 (1) i=1
Dist(DT W ) = min{W (Q, C)}
(2)
When Euclidean distance is used for time series data, the total distance is the sum of distances between each of the one-to-one mapping between elements qi and ci . In the case of DTW, however, a n×m distance matrix is first constructed containing all possible distances for each qi and ci pairing. Then each optimum step is chosen to form the optimal path, among the numerous warping paths of W = w1 , w2 , ..., wk , ..., wK , the path that minimizes the mapping between time series Q and C, represented as min{W }, is considered as the optimal warping path. At each step of the DTW algorithm, several choices are presented, and the allowed possibilities is referred to as the step pattern. The ability to choose a minimal step translates to data point mapping, and this choice gives the ability and effectiveness in finding shape similarities in time series data. Equation 3 is considered as one of the most basic and commonly used step patterns. Here the cumulative distance D(Qi , Cj ) is the sum of the current distance d(qi , cj ) and the minimum distance from the adjacent elements. ⎧ ⎫ ⎨ d(Qi , Cj−1 ) ⎬ D(Qi , Cj ) = d(qi , cj ) + min d(Qi−1 , Cj−1 ) (3) ⎩ ⎭ d(Qi−1 , Cj ) For both clustering and cluster representation, an effective time series averaging technique is required. Here we use the time series averaging technique
6
R. Ma and R. Angryk
DTW Barycenter Averaging (DBA) [8]. Instead of dividing the summation, as is with traditional averaging, DBA considers shape by using DTW to minimize the Within Group Sum of Squares (WGSS). Simply put, given a time series set of S = {S1 , S2 , ..., Sn }, the time series C = {c1 , c2 , ..., ct } is considered an average of S if it minimizes: W GSS(C) =
n
dtw(C, Sn )2
(4)
k=1
2.2
Time Series Similarity
In different applications, the similarity of time series can vary. The three key elements of time series similarity are range value similarity, duration similarity, and shape similarity [12]. Shown in Fig. 1, the range value similarity is demonstrated by sub-figure (a) and (c), it refers to the absolute range value of time series, this similarity signifies the vertical comparability of two given time series. The duration similarity is demonstrated by sub-figure (a) and (b), it refers to the time series measurement duration, this similarity reveals the horizontal comparability of two given time series. Demonstrated by sub-figure (b) and (c), the shape similarity focuses more on the contour of the given time series.
(a)
(b)
(c)
Fig. 1. Time series similarity: (a) and (c) demonstrate range value similarity; (a) and (b) demonstrate duration similarity; (b) and (c) demonstrate shape similarity.
The similarity of time series is highly contextual and has broad applicability. Therefore, certain aspects of similarity could be deemed more significant under certain circumstances. However, all three elements of similarity should be fulfilled for two time series to be considered truly similar. Therefore, for unsupervised learning, we have to consider which similarity feature is applied to for clusters. 2.3
Normalization Methods
Normalization is often used to scale data so that the data will fall within a specified range. In addition, time series normalization can also be used to shift and scale data to eliminate the effect of gross value influences. Evidently, normalization is not suitable for all time series data; it is more useful when the values are on different range or when the value differences are substantial enough for certain details to be overlooked. The four most commonly applied normalization
Tiered Clustering
7
techniques for time series data are Offset Translation, Amplitude Scaling, Trend Removal, and Smoothing. When a certain normalization is applied, the same normalization is applied to all the time series in the dataset. Offset Translation. Offset translation is the vertical shift of time series; it was originally used in signal processing when sequences are similar in shape but are within different ranges. ts = ts − mean(ts) (5) Here the mean value is independently computed for each time series and is the average over all the values in that specific time series. The translation of the offset can be useful for similarity comparisons. However, an immediate drawback of this operation is that the range values would be eliminated since the value differences are removed. This, however, can be made up for in the second stage of our tiered clustering process. Amplitude Scaling. Amplitude is another term from signal processing; it measures how far and in which direction a variable differs from a defined baseline. Scaling of a signal’s amplitude means changing the strength of the signal. With time series data, we remove the different amplitudes in hopes of finding similarity by excluding the strength of the physical parameters. ts = (ts − mean(ts))/std(ts)
(6)
Shown in Eq. 6, amplitude scaling is achieved by first moving time series by its mean and then normalized by the standard deviation. Which means that offset translation is included in amplitude scaling. In fact, when std(ts) = 1, the two methods are identical. Trend Removal. Trend removal is mostly applied in prediction models. Trends represent long-term movements in sequences. Trends can be distracting when attempting to identify patterns in sequential data, and therefore, it is often justified to remove them for revealing possible oscillations. To this end, the regression line of the time series needs to be identified and then subtracted from the time series. Unlike offset translation and amplitude scaling, trend removal is not a straightforward operation. In practice, there could be various types of trends or even multiple trends. In our experiments, we only considered the simple linear trend and the logarithmic trend. Smoothing. Smoothing is performed with a moving window on the time series to obtain the average values of each data point with those of its neighbors. While it can eliminate some irregular movements, it can be sensitive to outliers and also invalidates data at the beginning and the end of any time series. In the solar flare dataset for our experiments, the time series are relatively short in length (i.e., 60 data points) and is also noisy in nature. For a smoothing
8
R. Ma and R. Angryk
window to be effective, the size is often relatively large. Therefore, an effective smoothing would excessively shorten the time series we are working with, rendering the result ineffective. For this reason, smoothing is not included in our experiments.
3
Tiered Clustering
Time series data is very domain-specific, meaning the data from one area could be processed in an entirely different way as the data from another field. Therefore, we use a tiered cluster method to encompass more dimensions of similarity. In this section, we present the cluster algorithms applied in our tiered clustering method, namely Distance Density Clustering (DDC) and Hierarchical Agglomerative Clustering (HAC). 3.1
Distance Density Clustering
The Distance Density Clustering (DDC) method [10] was specifically developed for time series clustering, and has shown promising results. Here we use it to cluster normalized time series. DDC is divisive in structure, meaning that performance generally increases as more clusters are introduced. In the extreme case of each event forming its own cluster, the method degenerates to a k-Nearest Neighbors algorithm with k = 1 (i.e., 1NN), where each instance of testing data is compared to all the existing training data, and assigned the label of its single closest neighbor. While setting k to 1 can drastically improve the classification accuracy, conceptually, 1NN is a memorization process and not a generalization process. Memorization processes are inherently less powerful in real-world applications, as a comparison against the entire historical archive is unrealistic in most circumstances. While many existing clustering algorithms can be applied to time series, either with data summarization or effective distance measures, the effect is often limited. DDC typically generates more intuitive results for time series clustering [10], the main steps of which are shown in Algorithm 1. Initially, through majority voting, the furthest time series is identified and is used as the initial cluster seed. The furthest time series is the time series that is the furthest from the most number of other time series. Then the distances between all instances and the cluster seed are computed and sorted. The most significant increase in the sorted distances is considered as a virtual sparse region and is used to divide the dataset. Then new cluster seeds are identified, and the cluster assignment is re-balanced based on time series similarity. This process is iterated until no more clusters can be found, or the process has reached a user-defined threshold, such as a certain number of clusters have been generated. Finally, all the identified cluster seeds and their respective cluster elements are obtained.
Tiered Clustering
9
Algorithm 1. Distance Density Clustering Algorithm Require: E = {e1 , ..., en } is the time series events to be clustered Ck−1 = {c1 , ..., ck−1 } is the set of cluster seeds k is number of seeds Lk is the cluster set of events based on the number of groups 1: Lk−1 ← Cluster(Ck−1 ) 2: ar[1, 2, ..., k − 1] = DistSort(Lk−1 ) 3: value[i] ← max(ar[2] − ar[1], ..., ar[k − 1] − ar[k − 2]) 4: if ar[n] − ar[n − 1] == max(value[i]) then 5: location[i] = n 6: end if 7: if theni ← max(value[1, ..., k − 1]) 8: l(i1 , i2 ) ← l(i), (ci1 , ci2 ) ← ci 9: end if 10: return Ln = {1, 2, ..., i1 , i2 , ..., n} ← Ck {(c1 , c2 , ..., ci1 , ci2 , ..., cn )} 11: for ei ∈ E do 12: (c1 , c2 , ..., ck ) ← DBA(c1 , c2 , ..., ci1 , ci2 , ..., ck−1 ) 13: U pdateClusterDBA(Ck ) 14: end for 15: return Ck = {c1 , ..., ck } as set of cluster seeds 16: return Ln = {l(e) |= 1, 2, ..., n} set of cluster labels of E
3.2
Hierarchical Agglomerative Clustering
Hierarchical clustering algorithm separates data into different levels that have a top to bottom ordering, which forms a corresponding tree structure. There are two types of hierarchical clustering, agglomerative, also known as Agglomerative Nesting (AGNES), and divisive, also known as Divisive Analysis (DIANA) [4]. AGNES is a bottom-up approach, where each event is assigned as its own cluster, and based on a specific linking mechanism, the most similar clusters are joint to form a new cluster. This process is repeated until all events are joint together. DIANA is a top-down approach, where all events start as a single cluster and are then partitioned to form two least similar clusters. This process is repeated until each event forms its own cluster. Both AGNES and DIANA are based on distance for measuring similarity. In an agglomerative structure, clusters are joint based on the similarity between elements or clusters. When comparing the similarity of clusters, various measures can be adopted. The cluster merging method is called linkage. The most commonly used linkage measures use nearest, furthest, or average distance for cluster distance measurement, which corresponds to single link, complete link, and average link. Both the DDC and the HAC are hierarchical in structure, but they form clusters based on different concepts. DDC takes advantage of a virtual sparsity split to form clusters, whereas HAC is purely based on distance/similarity split. Furthermore, HAC is a greedy approach and DDC is not. In our proposed
10
R. Ma and R. Angryk
Algorithm 2. Hierarchical Agglomerative Clustering Require: set X of objectives {x1 , ..., xn } similarity function dist(c1 , c2 ) 1: for i = 1 to n do 2: ci = {xi } 3: end for C = {c1 , ..., cn } l =n+1 4: while C.size > 1 do (cmin1 , cmin2 ) = min dist(ci , cj ) for all ci , cj in C remove cmin1 and cmin2 from C C ← {cmin1 , cmin2 } l =l+1 5: end while
clustering structure we take advantage of both cluster algorithms to focus on different aspects of similarity. We use DDC to focus on the shape similarity before using HAC to partition the data based on range similarity.
4
SWAN-SF Dataset
In this study, we use the Space Weather ANalytics for Solar Flares (SWANSF) [11], which is a benchmark dataset of multivariate time series (MVTS), spanning over a 9 year period (2010–2018). Essentially, the goal is to predict the most significant solar flare within the next 24 h with the 12 h of before-flare time series measurements for multiple parameters. For reference, 9 of the most interesting parameters are picked by domain experts and listed in Table 1. There are a total of 5 classes of solar flares, listed from quiet to the most powerful, FQ (flare-quiet), B class, C class, M class, and X class, and each time series is labeled with the most significant (largest) flare within the 24 h observation period. The most impactful flares are M and X class flares; therefore, in this paper, we are specifically focusing on the clustering of classes C, M, and X flares. This dataset can be considered as an MVTS dataset with 3 class labels, and the meaning of specific parameters should not interfere with the presented method. The measurements of solar flares cannot be clustered in a straight forward manner. While the duration of each event is the same, the range value similarity and the shape similarity can be challenging to be identified simultaneously. This is partly due to the vast variation of the strength of solar flare measurements. When the value of different events differs substantially, the shape details could become hard to distinct. In a previous study, shape-intuitive clusters were generated by clustering the normalized time series [12]; however, the accuracy performance was not improved despite the shape emphasis. This side effect of normalization can be eliminated by the actual value of different events. In this study, we are considering both the effect of shape similarity as well as the measurement values.
Tiered Clustering
11
Table 1. Nine parameters selected by domain experts of which solar pre-flare time series are evaluated. Keyword
Description
1 MEANJZD Mean vertical current density 2 MEANJZH Mean current helicity 3 R VALUE
Sum of flux near polarity inversion line
4 SAVNCPP Sum of the modulus of the net current per polarity 5 SHRGT45
Fraction of area with shear angle >45◦
6 TOTFZ
Sum of z-component of Lorentz force
7 TOTUSJH Total unsigned current helicity
5
8 TOTUSJZ
Total unsigned vertical current
9 USFLUX
Total unsigned flux
Experimental Results
In consideration of fairness and to eliminate performance randomness, a balanced 5-fold cross-validation on the curated dataset of a total of 300 C, M, and X class instances was implemented. Cross-validation is a statistical evaluation method used to evaluate machine learning models where data is limited. The testing data is never included in the training process to avoid bias, and the training and testing are repeated for each data fold to ensure stability.
1e+13
CMX=1:4:4
2.0e+14
2.0e+14
2.0e+14
1.5e+14
1.5e+14
1.5e+14
Value
2.5e+14
Value
0e+00
CMX=0:0:7
2.5e+14
Value
5e+12
CMX=0:1:1
2.5e+14
1.0e+14
1.0e+14
5.0e+13
5.0e+13
1.0e+14
5.0e+13
−5e+12 0.0e+00
0.0e+00 0
20
40
60
Time Series
(a) Normalized clus- (b) Sub-cluster 1 ter
0.0e+00 0
20
40
60
Time Series
(c) Sub-cluster 2
0
20
40
60
Time Series
(d) Sub-cluster 3
Fig. 2. (a) shows one of the clusters generated by the DDC algorithm, (b), (c), and (d) are the sub-clusters generated by HAC.
First, we show in detail the advantage of applying a tiered clustering of normalized time series data with DDC and HAC. After processing normalized time series data with DDC, we apply HAC on each DDC generated cluster. Starting from the bottom of the dendrogram, when the branch ratio first exceeds the third quartile, we cut the dendrogram and obtain the corresponding clusters. The importance of both the shape and the measurement value is shown in Fig. 2.
12
R. Ma and R. Angryk Method
UN DDC
N DDC
Method
N DDC&HAC
UN DDC
N DDC
Method
N DDC&HAC
(b) MEANJZH
(g) TOTUSJH
(f) TOTFZ
(e) SHRGT45
(d) SAVNCPP
N DDC&HAC
(c) R VALUE
N DDC
(a) MEANJZD
UN DDC
(h) TOTUSJZ
(i) USFLUX
Fig. 3. Performance of different clustering approaches demonstrated by 9 parameters of solar pre-flare data. The x-axis is the increase of cluster numbers, and the y-axis is the corresponding accuracy value. Three cluster structures are shown, UN DDC is unnormalized DDC results, N DDC is normalized DDC results, and N DDC&HAC being normalized time series with DDC and HAC results.
A DDC generated DDC cluster is presented in Fig. 2(a), with the orange line being X-class flares, the yellow line being M-class flares, and blue being C-class flares, the time series average is the dark line. While this is not a pure cluster, the shape after normalization for all three classes is actually quite similar. Figure 2(b), (c), and (d) are the sub-clusters generated by HAC from the original cluster, here the actual value is taken into account. The ratio of classes C, M, and X is written above each sub-cluster. The third sub-cluster contains 7 Xclass flares, the first and second are more assorted. However, considering both the shape similarity as well as the value similarity, it would be difficult even for
Tiered Clustering
13
a human to distinguish the first and second sub-clusters just by the time series alone. The overall performance of one fold is shown in Fig. 3, other folds are comparable in performance, but omitted for simplicity. Here the number of HAC are generically performed with dendrogram branch ratio, in practice HAC can be fine-tuned for different data or different parameters. For each parameter, the progression of accuracy improvement for each clustering method is demonstrated in relation to the number of clusters in Fig. 3(a)–(i). Different normalizations are all included in the figures. The unnormalized time series DDC results are referred to as “UN DDC”, normalized DDC results are referred to as “N DDC”, and normalized tiered clustering results from both DDC and HAC is referred to as “N DDC&HAC”. The UN DDC accuracy results are overlapping with N DDC accuracy results. As concluded in the previous work [12], although the clustering of normalized time series generated more intuitive clusters, it did not improve the accuracy performance. This was partly due to the information loss in the normalization process. Therefore, when both the shape and the value information is considered, we see a general improvement in the tiered clustering structure with DDC and HAC, especially when the number of clusters increase.
6
Conclusion
Normalization is effective in finding shape similarities when the value differences are significant. However, in the process of normalization, measurement value information is lost. In this paper, we extend the clustering of normalized time series by reintroducing value information using hierarchical clustering. This way, we can take into account both the shape information as well as the value information embedded in the original time series measurements. We would like to note that this tiered clustering is not suited to all time series data, but an alternative method for time series data that may have extreme range value differences. This method could also be helpful in identifying new sub-classes within established data class in the future.
References 1. Sakoe, H.: Dynamic-programming approach to continuous speech recognition. In: 1971 Proceedings of the International Congress of Acoustics, Budapest (1971) 2. Myers, C., Rabiner, L.: A level building dynamic time warping algorithm for connected word recognition. IEEE Trans. Acoust. Speech Signal Process. 29(2), 284– 297 (1981) 3. Keogh, E., Ratanamahatana, C.A.: Exact indexing of dynamic time warping. Knowl. Inf. Syst. 7(3), 358–386 (2005) 4. Rokach, L., Maimon, O.: Clustering methods. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 321–352. Springer, Boston (2005). https://doi.org/10.1007/0-387-25465-X 15
14
R. Ma and R. Angryk
5. M¨ uller, M.: Dynamic time warping. In: M¨ uller, M. (ed.) Information Retrieval for Music and Motion, pp. 69–84. Springer, Heidelberg (2007). https://doi.org/10. 1007/978-3-540-74048-3 4 6. Chi¸s, M., Banerjee, S., Hassanien, A.E.: Clustering time series data: an evolutionary approach. In: Abraham A., Hassanien AE., de Leon F. de Carvalho A.P., Sn´ aˇsel V. (eds.) Foundations of Computational, IntelligenceVolume 6, vol. 206, pp. 193–207. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-0109109 7. Jeong, Y.-S., Jeong, M.K., Omitaomu, O.A.: Weighted dynamic time warping for time series classification. Pattern Recogn. 44(9), 2231–2240 (2011) 8. Petitjean, F., Ketterlin, A., Gan¸carski, P.: A global averaging method for dynamic time warping, with applications to clustering. Pattern Recogn. 44(3), 678–693 (2011) 9. Wang, X., Mueen, A., Ding, H., Trajcevski, G., Scheuermann, P., Keogh, E.: Experimental comparison of representation methods and distance measures for time series data. Data Min. Knowl. Disc. 26(2), 275–309 (2013) 10. Ma, R., Angryk, R.: Distance and density clustering for time series data. In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 25–32. IEEE (2017) 11. Aydin, B., et al.: Multivariate time series dataset for space weather data analytics (manuscript submitted for publication). Sci. Data (2019) 12. Ma, R., Ahmadzadeh, A., Boubrahimi, S.F., Georgoulis, M.K., Angryk, R.: Solar pre-flare classification with time series profiling. In: 2019 IEEE International Conference on Big Data (Big Data). IEEE (2019)
A Three-Step Machine Learning Pipeline for Detecting and Explaining Anomalies in the Time Series of Industrial Process Plants Marcel Dix(B) ABB Corporate Research Center, Wallstadter Str. 59, 68526 Ladenburg, Germany [email protected]
Abstract. Anomaly detection offers an important type of machine learning analysis for industrial automation systems because it is one of the highest goals in the industrial domain to avoid production disturbances and to stay productive. If devices are not operating anymore within normal bounds, or if the produced product quality is no longer within normal bounds, then the plant operator wants to be informed as early as possible. Anomaly detection can provide this type of information for the operator. This paper presents a solution pipeline for the detection and explanation of anomalies in the multivariate time series of industrial plant processes, and a possible implementation of this pipeline that is fully based on the use of neural network architectures. The pipeline consists of three consecutive steps, that build on each other, to successively explore in depth the anomalies and their underlying root-causes. The three steps of the pipeline are: 1) the detection of the anomaly itself, 2) pointing to the location where the anomaly comes from, and 3) determining the type of anomaly at this location. The evaluation of the pipeline is performed to detect a set of 16 simulated plant equipment failures that have been obtained from a real-world industrial process that is typically found in oil production fields, called the separator process. Keywords: Anomaly detection · Explainable AI · Industrial process plants · Multivariate time series · Process simulation
1 Introduction The development of process control systems for industrial process plants, such as a chemical plant, or oil production field, has enabled the monitoring and controlling of large production processes by only few operators. The central workplace for operators has become the control room, where the data from hundreds of sensors in the field is coming together and displayed on monitoring screens. Through process control, the production can run largely automated today, while a main attention of operators is to observe that the plant is running normal and to take actions in case of unwanted process deviations. This is, however, not an easy task, but requires a lot of experience, because production processes can be large and complex, and there are numerous of anomalous situations that can occur [1]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 I. Awan et al. (Eds.): Deep-BDB 2021, LNNS 309, pp. 15–26, 2022. https://doi.org/10.1007/978-3-030-84337-3_2
16
M. Dix
Here, the use of machine learning algorithms has gained attention to help operators in the detection of anomalies in the plant [2]. A specific use case for anomaly detection in process industries is to detect various plant equipment performance issues and failures. To illustrate and discuss this use case in this paper, the example of a separator process [3] is used. This is a type of process that is typically found in oil production fields. The core of this process is shown in Fig. 1. It consists of a separator vessel that separates the fluids coming from the ground well into three output components: oil, gas, and wastewater. The separation process is achieved by several valves, which control the flow of fluids going into the vessel and the three separated flows of oil, gas and wastewater going out.
Fig. 1. Separator vessel and valves in a separator process in oil production (source: [4])
The various valves are mechanical devices that can fail. For example, a valve may have a technical issue that it is leaking, which means, it is not closing fully as expected, or it may be completely blocked to not open or close at all anymore. Such device issues can easily go unnoticed, when they are not directly indicated to operators in the control system displays, e.g., as an alarm [5]. Here, the use of anomaly detection models can provide an interesting machine learning use case for industrial process plants, by trying to uncover such hidden issues to the operator. When an anomaly is detected by the machine learning algorithm and reported to the operator, it is necessary to better explain this anomalous situation, to help operators in finding and resolving the issue more easily. To address this need for more comprehensible machine learning outcomes, the research field “Explainable AI” (XAI) [6] has received increasing attention in the industrial domain recently [7]. The contribution of this paper is to present a possible machine learning approach for detecting and explaining anomalies in industrial process plants. The approach consists of a pipeline that includes three steps: 1) the detection of the anomaly itself, 2) pointing to the location where the anomaly comes from, and 3) determining the type of anomaly at this location. The evaluation is performed to detect 16 valve failure cases, that have been obtained from a simulated separator process with the help of high-fidelity process simulation tools. The remainder of this paper is structured as follows: Sect. 2 briefly presents the related work. Section 3 introduces the proposed three-step solution pipeline. Section 4 discusses the simulated evaluation data that was used. Section 5 presents and discusses the evaluation results for this data, and Sect. 6 draws the final conclusions.
A Three-Step Machine Learning Pipeline
17
2 Related Work In process industries it is one of the highest goals to avoid production disturbances, because unplanned downtimes are expensive. Not surprisingly, there is a large body of literature on failure detection and anomaly detection in the industrial domain, as highlighted in a survey paper by Cook et al. [8]. Process control systems records various types of data, such as time series readings from plant sensors, alarms and events, laboratory samples, operator shift book notes, etc. Here, a very comprehensive type of data about the plant process and equipment, that could be used for data analysis, is the time series data, because it is recorded continuously and typically at very small sampling rates (e.g. seconds) [1]. A possible approach toward anomaly detection in multivariate time series data is with the help of artificial neural networks. Here, a feasible network architecture is an Autoencoder network [9]. There are specific types of Autoencoders, including Dense Autoencoders and LSTM Autoencoders [10]. Furthermore, recurrent neural networks (RNNs) are applicable for anomaly detection [11]: instead of predicting a future time window, one can re-predict the time window that was given as input data to the model, and then compute the reconstruction error similarly to Autoencoders. In addition to neural networks, statistical outlier detection methods can provide a solution to detecting anomalies in industrial time series. For example, Zhong et al. [12] apply isolation forest to detect anomalies in gas turbines. In the government-funded research project FEE [13], a density-based anomaly detection approach is applied to detect anomalies in chemical plant processes, based on Euclidian distances of the time series in combination with a k-nearest neighbor (kNN) approach [2]. A further type of data from industrial systems that can be useful for anomaly detection is the alarm and event data. For instance, Atzmüller et al. [14] present an approach that utilizes sequential alarm data for anomaly detection and analysis, based on first-order Markov chain models. In their paper by Siddharthan et al. [15] the concept of topic modeling is realized to describe the “topics” of large industrial event logs that could more easily uncover equipment issues than having to read all the events. Statistical outlier detection algorithms such as a OneClassSVM algorithm [16] can also provide a solution to detect unusual event patterns, e.g., in event fingerprints of assets. A possible limitation of anomaly detection models can be the lack of explainability of predicted anomalies, which is a reason why the research field Explainable AI (XAI) has gained attention in the industrial domain [7]. The explanation of model predictions is the scope of so-called post-hoc methods in XAI [6]. There are popular examples for open-source python libraries, such as SHAP [17] and LIME [18], that can be used for building explainers for classifiers of image data, tabular data, and textual data. However, in their paper by Kotriwala et al. [7] about XAI use cases in process industries, it is observed that little attention has been given to explain models based on time series data, which is, on the other hand, the most relevant type of data in the industrial domain. Where a needed XAI solution does not exist “out of the box”, an own solution can be developed and tailored to the customer-specific industrial needs. In his survey paper on XAI, Lipton [19] highlights four basic principles that have been commonly used for building custom XAI solutions, which are: (1) training a second model that tries to explain the first model, (2) making use of visualization to show to the user what the
18
M. Dix
model has learned, (3) making use of natural language to better explain to humans in their human terms, and (4) explaining by showing similar examples to the user. For instance, Atzmüller et al. in their paper [2] highlight a custom solution that was developed based on dynamic time warping (DTW), to find similar plant situations that have occurred in the past, that can be presented to operators to better understand the present plant situation. The contribution of this paper is the presentation of a solution pipeline for the detection and explanation of anomalies in the time series of industrial plant processes, and a possible implementation of this pipeline that is fully based on the use of neural network architectures. The pipeline consists of three consecutive steps, that build on each other, to successively explore in depth the anomalies and their underlying root-causes.
3 Solution Approach The proposed solution, for detecting and explaining anomalies in the multivariate time series of industrial plant processes, consists of three steps that build on each other: • Step 1: Detecting an anomaly in the given plant situation. (The use case for this paper is to detect anomalies caused by plant equipment failures.) • Step 2: Tracing back the anomaly to the location where the anomaly comes from. (In the given use case, this means, pointing to the valve that has failed.) • Step 3: Determining the type of anomaly at this location. (E.g., that the given valve that was found in step 2 seems to be leaking.) Figure 2 illustrates the pipeline at the example of detecting a valve leakage:
anomaly detection
signals of given plant situation
anomaly detection
plant situation is anomalous
Step 1
root-cause analysis
anomaly localizaton Step 2
anomaly comes from water valve
anomaly type definition
water valve is leaking
Step 3
Fig. 2. Solution pipeline (at the example for detecting a valve leakage in the given plant situation)
As the first step of the pipeline, an anomaly must be detected. In this paper this is realized with the help of a Dense Autoencoder. As highlighted by Sakurada and Yairi in [9], a feasible approach to anomaly detection is with the help of such Autoencoder neural network architectures. An Autoencoder is a type of neural network, that is trained to learn how to efficiently compress a given type of data, and then reconstruct this data back to its original form. In our case, this data represents multivariate time series readings from the different devices (sensors and valves) in the separator process. When an unusual
A Three-Step Machine Learning Pipeline
19
data sample is given to the trained model, it will not be able to reconstruct this data very well, because it has not learned how to do so. The resulting reconstruction error will be relatively high. If this error is higher than a predefined acceptable threshold, then an anomaly is reported. This threshold is typically calculated by taking the average errors from the normal data, plus adding a standard deviation. The second step of the pipeline tries to trace back the detected anomaly to the location where the anomaly comes from. In the given use case, the aim here is to point to the valve in the separator process that has failed, which has caused the anomaly. In this paper, this is realized with the help of the reconstruction error from the Autoencoder. When a failure occurs on a device, it has been observed, that this reconstruction error is primarily evident in the time series related to the failed valve, and that the Autoencoder does not carry over this error (significantly) to the other signals in the separator process. Here, by splitting up the reconstruction error to an error per signal, it is possible to identify the device which has contributed most to an anomaly. Section 5 analyzes to what extent this approach succeeds to point to the correct valve that had failed in the 16 given failure cases. Knowing the location where an anomaly comes from, a further helpful information for operators can be to determine the type of issue at this location. This is the aim of the third step of the presented pipeline. In this paper, the type definition for failures is realized with the help of failure classification, using a Convolutional Neural Network (CNN) [20] architecture. The underlying idea for using a classifier approach is an observation made that there are specific types of known failures that can typically happen on industrial valves, e.g., valve leakages, valve blockages, valve plugging failures, valve dead band failures, valve position indicator failures. With the help of example failure data for different failure cases, a classifier model could be trained to learn, e.g., how a valve leakage typically “looks like” in this data. Then, for a given new anomaly, the classifier should try to recognize the valve leakage again in this new sample. However, failure classification has a limitation related to unknown failures, when the classification approach needs to know all possible failure classes beforehand, during the model training. Section 5 analyzes and discusses this limitation at the example of detecting known and unknown valve failure cases in the given separator process use case.
4 Evaluation Data To train a reliable anomaly detection or classification model, historic plant data needs to be available including known equipment failure cases. This can be a challenge in the industrial domain, because industrial process plants are very robust, so that real failure cases are usually rare. Moreover, even if some failure cases did occur, they are often hard to find in the provided data sets, because many times failures have not been labeled as such by the operator, or they were not noticed when they occurred [4]. Here, the use of industrial process simulators can help data scientists to simulate various normal and abnormal plant situations that could not be obtained from the real plant. In 2019, the two companies ABB and Corys decided to set up a simulation infrastructure for ABB’s control system 800xA, that data scientist could use to generate simulation datasets for machine learning [4]. At the heart of this infrastructure are the simulation
20
M. Dix
tools of the two companies: Corys’ process simulator Indiss Plus [21], and ABB’s 800xA Simulator [22]. In this setup a separator process was configured as an experimental process, which is shown in Fig. 3. When the simulation is running, the process data is automatically captured and stored into a history database. The recorded data can then be used by data scientists for their research. This was also how the evaluation data was generated for this paper.
Indiss Plus process simulator
Start/stop simulaon simulate a valve leakage simulated process values sent to the control system 800xA
separator vessel
gas
well oil water
800xA Simulator
Fig. 3. Using the process simulators Indiss Plus [21] with 800xA Simulator [22], to simulate plant processes, as well as plant equipment failures. Here, a valve leakage is simulated in the oil valve of the separator process.
A key advantage of process simulation is to be able to simulate various abnormal situations, such as plant equipment failures. In the example of Fig. 3, a valve leakage is simulated in the oil valve of the separator process. This way, simulation provides a possible solution to overcome the lack of failure data in the real plant. For the evaluation in this paper, data sets for 4 types of valve failures were generated: 1. 2. 3. 4.
A valve leakage failure: The valve does not close fully as expected, A valve plugging failure: This causes the valve to obstruct the flow, A sticking valve failure: This blocks the valve actuator in the current position, and A valve dead band failure: Here, the valve actuator gets a hot spot and consequently a dead band from this hot spot. In this dead band the valve does not move.
Furthermore, these failure have been performed on the three main valves of the separator process, namely the oil valve (in the following figures this valve is also identified by its tag name 20-LV-1031_Z_Y_Value), the water valve (20-LV-1034_Z_Y_Value), and the gas outlet valve (20-PV-1037_Z_Y_Value). In total, 16 failure cases have been simulated and stored into 16 separate data sets.
A Three-Step Machine Learning Pipeline
21
5 Evaluation and Discussion This section presents and discusses the evaluation results, to what extent the proposed machine learning pipeline can detect and explain the 16 simulated valve failures as anomalies in the given separator process use case. In this paper, the first step of the pipeline, which is the detection of the anomaly, is realized with the help of a Dense Autoencoder. A model is first trained with normal plant data, i.e., data that represents normal plant operation without any failures. The trained model is then applied to detect the anomalies in the 16 failure cases. Figure 4 presents an example for detecting a leakage failure in the water valve. The left plot shows the data for this failure case as input data to the Autoencoder. The time series related to the signal of the failed water valve (20-LV-1034_Z_Y_Value) is highlighted as the bold black line. The failure case starts in normal plant operation without the failure. Then, at the time indicated by the first vertical orange line, the failure is induced in the simulation tool. The failure continues until the second vertical orange line (i.e. the dashed line) when the failure is removed again in the simulation. The plot in the middle shows the reconstructed data that is the output from the Autoencoder. It is observed that there are clear signs of reconstruction error during the time of the failure. This is true especially for the signal of the failed water valve. The right plot of Fig. 4 shows the reconstruction error metric as the mean-squared error (mse) between the input and the output data. When this error exceeds the anomaly threshold, which is the horizontal blue line, then an anomaly is reported. It is observed in the example of Fig. 4, that the given failure case was successfully detected as an anomaly by the Autoencoder. Mean-squared error (mse)
Autoencoder
Reconstruction Error
Fig. 4. Example for detecting a valve failure as anomaly. The bold black time series represents the signal of the failed valve. It shows clear signs of reconstruction error during the failure.
Figure 5 presents the overall performance for detecting the anomalies in all 16 failure cases. The different failure cases have a length of 150 to 600 min, depending on how long each failure experiment was running in the simulation. As the Autoencoder makes a prediction at every minute across the length of each failure case, this leads to the high number of predictions that are shown in the confusion matrix of Fig. 5.
22
M. Dix
Performance metrics Accuracy
0.82
Precision
0.77
Recall
0.80
F1-Score
0,79
Fig. 5. Overall performance of step 1 of the pipeline to detect the anomalies
There is a difficulty in the interpretation of the results of Fig. 5. Note in Fig. 4 that there is a time lag between the point in time when the failure was induced in the simulation tool (first vertical orange line) until it is detected as an anomaly by the Autoencoder. Similarly, there is a time lag when the failure was removed again (second vertical orange line). This time lag can be explained by the fact that the process is slow and always needs some time to react to changes. However, the lag makes the interpretation of false predictions in the confusion matrix of Fig. 5 debatable. For example, the results would improve if failures had simply been simulated longer, because then the influence of the time lag would become relatively smaller as compared to the failure time. The second step of the pipeline is to point to the location where the anomaly comes from. In this paper, this step is realized with the help of the signal-wise reconstruction error analysis. Figure 6 presents an example for this analysis, using again the failure case in the water valve. The left plot shows the reconstruction error from Fig. 4 that is now split into the various signals. The error for the signal of the failed water valve is highlighted again as the bold black line. By summing up the error over the time of the failure, it is possible to determine the contribution of each signal to the anomaly. This is shown in the right plot of Fig. 6. In this example, it is observed that the signal-wise reconstruction error analysis can point to the correct device that had caused the anomaly, i.e. the water valve.
Fig. 6. Example for step 2 of the pipeline that points to the water valve where the anomaly appears to come from.
A Three-Step Machine Learning Pipeline
23
Figure 7 presents the overall performance to locate the failed valve in all 16 failure cases. Here, only the 1648 true positives from Fig. 5 are considered. Where the Autoencoder did not predict any anomalies, there is no need to explain them. The 498 false predictions from Fig. 5 are not considered, to avoid carrying over this error to the subsequent steps of our analysis. For the given 1648 correct predictions of anomalies, it is observed in Fig. 7 that the signal-wise reconstruction error analysis can point to the failed valve that had caused the anomaly in 97% of the predictions.
Performance metrics Accuracy
0.97
Precision
0.97
Recall
0.95
F1-Score
0.96
Fig. 7. Overall performance of step 2 of the pipeline to point to the valves which have caused the anomalies
The third step of the pipeline is to determine the type of failure at the location that was found to be anomalous. In this paper, this step is realized with the help of failure classification, using a CNN neural network. To assess the issue of unknown failures, only two of the four failure types are used for model training, namely leakages and plugging failures. Sticking and dead band failures are left unknown to the CNN model. Figure 8 presents two example outputs from the CNN model. The left plot shows a valve leakage which is a known failure. Here, the CNN model classifies this failure correctly to be a leakage. The right plot shows a valve dead band failure, which is an unknown failure. Here, the CNN model tries to fit the dead band failure to one of its known classes: leakage or plugging. This behavior, however, leads to misleading false predictions about the failure type, that are reported from the model. To overcome the issue related to unknown failures, a possible solution is explored. It is observed in Fig. 8 that for unknown failures the prediction from the CNN model would often switch arbitrarily between classes. Based on this observation, the following concept for detecting unknown failures was tested: If the prediction does not stick to a class, but switches classes within a grace period of 30 min, then this failure is classified as “unknown”. Here, the time of 30 min was manually chosen after having tested the performance of different grace periods. Figure 9 presents the overall performance to determine the failure type for all 16 failure cases with the help of a CNN classifier. The left confusion matrix shows the results without the additional postprocessing step to move failures that switch classes into an “unknown” class. The right matrix shows the results with the “unknown” class. In the left matrix all sticking and dead band failures (i.e. the unknown failure types) are
24
M. Dix
Known Failure Type
Unknown Failure Type
Fig. 8. Example for step 3 to determine the type of failure. Left: a known failure type (leakage). Right: an unknown failure type (dead band failure) is falsely classified.
falsely classified as leakages or plugging failures. In the right matrix the “unknown” class helped that most of these false predictions are recognized as unknown. However, several false predictions remain. This happens, e.g., when a dead band failure is continuously predicted to be a leakage for longer than the 30 min of grace period, then this failure is in the end “accepted” as a leakage. Depending on the types of failure, this effect may be even more pronounced in other cases as compared to the cases here. Then, waiting a grace period to recognize these failures as unknown also becomes less useful.
Performance metrics
determine unknown failures
without with unknown unknown class class Accuracy
0.46
0.88
Precision 0,46
0.81
Recall
0.46
0.82
F1-Score
0.46
0.81
Fig. 9. Overall performance of step 3 of the pipeline to determine the failure types. Left: without recognition of unknown failures. Right: with the additional “unknown” class.
6 Conclusion In this paper, a three-step machine learning pipeline was presented for detecting and explaining anomalies in the multivariate time series data of industrial process plants. At the example of a real-world process from the oil production industry, called separator process, a realistic use case was described to detect 16 simulated equipment failures in
A Three-Step Machine Learning Pipeline
25
this process as anomalies. The presented pipeline consists of three consecutive steps: 1) the detection of the anomaly itself, 2) pointing to the location where the anomaly comes from, and 3) determining the type of anomaly at this location. In our evaluation, the third step of the pipeline to determine the failure type showed relatively high uncertainty. This is because the chosen classification approach needs to know all types of possible failures beforehand, which is unrealistic. A better handling of unknown failures is therefore a scope of our further research. On the other hand, it was observed that the second step of the pipeline to locate the anomaly, which was realized with the help of the signal-wise reconstruction error, could already narrow down the possible root causes very well with an accuracy of 97% for the given failure cases. For the first step of the pipeline a Dense Autoencoder was chosen, which was able to detect the anomalies in the given failure cases with an accuracy of 82%. However, since we are dealing with time series data, our ongoing research involves a comparison how LSTM-based neural network architectures, such as LSTM Autoencoders and LSTMs, would perform to detect the same failure cases from this paper. A challenge for machine learning in the industrial domain is the lack of failure data, because industrial systems are very robust so that real failure cases are usually rare. Here, this paper highlighted the use of process simulators, to simulate realistic device failures that could not be obtained from the real plant. A limitation of simulation, however, is the substantial effort required to build a high-fidelity digital twin for a given real plant. Nevertheless, such simulation models often exist already (e.g. to provide operators a simulator of their plant for operator training), and when in place, these simulators can be very useful also for machine learning, as presented in this paper. Acknowledgements. The ABB research project team would like to thank JeanChristophe Blanchon from Corys France, as well as Elise Thorud and Rikard Hansson from ABB Energy Industries Norway, for providing us a preconfigured setup of Corys’ process simulator Indiss Plus [21], and ABB’s 800xA Simulator [22], which we could use to generate high-fidelity simulation data for various industrial AI research studies such as the present study here.
References 1. Klöpper, B., et al.: Defining software architectures for big data enabled operator support systems. In: IEEE 14th International Conference on Industrial Informatics (INDIN) (2016) 2. Atzmüller, M., et al.: Big data analytics for proactive industrial decision support – approaches and first experiences in the FEE project. atp mag. 58(09), 62–74 (2016) 3. Sayda, A.F., Taylor, J.H.: Modeling and control of three-phase gravity separators in oil production facilities. In: 2007 American Control Conference, New York, USA (2007) 4. Dix, M., Klöpper, B., Blanchon, J.-C., Thorud, E.: A formula for accelerating autonomous anomaly detection. ABB Review 02/2021, pp. 14–17 (2021) 5. Abele, L., Anic, M., Gutmann, T., Folmer, J., Kleinsteuber, M., Vogel-Heuser, B.: Combining knowledge modeling and machine learning for alarm root cause analysis. IFAC Proc. Vol. 46(9), 1843–1848 (2013)
26
M. Dix
6. Xu, F., Uszkoreit, H., Du, Y., Fan, W., Zhao, D., Zhu, J.: Explainable AI: a brief survey on history, research areas, approaches and challenges. In: Tang, J., Kan, M.-Y., Zhao, D., Li, S., Zan, H. (eds.) NLPCC 2019. LNCS (LNAI), vol. 11839, pp. 563–574. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32236-6_51 7. Kotriwala, A., Kloepper, B., Dix, M., Gopalakrishnan, G., Ziobro, D., Potschka, A.: XAI for operations in the process industry – applications, theses, and research directions. In: Proceedings of the Spring Symposium on Combining Machine Learning and Knowledge Engineering in Practice, AAAIMAKE (2021) 8. Cook, A.A., Mısırlı, G., Fan, Z.: Anomaly detection for IoT time-series data: a survey. IEEE Internet Things J. 7(7), 6481–6494 (2019) 9. Sakurada, M.; Yairi, T. Anomaly detection using autoencoders with nonlinear dimensionality reduction. In: Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysia (2014) 10. Ashraf, J., Bakhshi, A.D., Moustafa, N., Khurshid, H., Javed, A., Beheshti, A.: Novel deep learning-enabled LSTM autoencoder architecture for discovering anomalous events from intelligent transportation systems. IEEE Trans. Intell. Transp. Syst. (2020) 11. Su, Y., Zhao, Y., Niu, C., Liu, R., Sun, W., Pei, D.: Robust anomaly detection for multivariate time series through stochastic recurrent neural network. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2828–2837 (2019) 12. Zhong, S., Fu, S., Lin, L., Fu, X., Cui, Z., Wang, R.: A novel unsupervised anomaly detection for gas turbine using isolation forest. In: 2019 IEEE International Conference on Prognostics and Health Management (ICPHM), pp. 1–6 (2019) 13. Homepage of public-funded research project FEE. https://www.fee-projekt.de/index_en.html. Accessed 10 Feb 2021 14. Atzmüller, M., Arnu, D., Schmidt, A.: Anomaly detection and structural analysis in industrial production environments. In: Haber, P., Lampoltshammer, T., Mayr, M. (eds.) Data ScienceAnalytics and Applications, pp. 91–95. Springer, Wiesbaden (2017). https://doi.org/10.1007/ 978-3-658-19287-7_13 15. Siddharthan, S.P., Dix, M., Sprick, B., Klöpper, B.: Summarizing industrial log data with latent Dirichlet allocation. Arch. Data Sci. Ser. A 6(1), 14 (2020) 16. Scikit-learn Novelty and Outlier Detection Homepage. https://scikit-learn.org/stable/mod ules/outlier_detection.html. Accessed 14 Feb 2021 17. Lundberg, S., Lee, S.I.: A unified approach to interpreting model predictions. arXiv preprint arXiv:1705.07874 (2017) 18. Ribeiro, M.T., Singh, S., Guestrin, C.: Why should i trust you?: explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016) 19. Lipton, Z.C.: The mythos of model interpretability: in machine learning, the concept of interpretability is both important and slippery. Queue 16(3), 31–57 (2018) 20. Zhao, B., Lu, H., Chen, S., Liu, J., Wu, D.: Convolutional neural networks for time series classification. J. Syst. Eng. Electron. 28(1), 162–169 (2017) 21. CORYS Indiss Plus Homepage. https://www.corys.com/en/indiss-plusr. Accessed 15 May 2021 22. System 800xA Simulator Homepage. https://new.abb.com/control-systems/service/customersupport/800xA-services/800xA-training/800xa-simulator. Accessed 10 Feb 2021
Detecting Phishing Websites Using Neural Network and Bayes Classifier Ravinthiran Partheepan(B) Faculty of Informatics, Master Programme – Applied Informatics, Vytautas Magnus University, Kaunas, Lithuania
Abstract. Phishing is a social engineering attack that is based on a cyberattack and it focuses on naïve online users by spoofing them to provide their sensitive credentials such as password, username, security number, or debit card number, etc. Phishing can be performed by masking a webpage as a legitimate page to pull the personal credentials of the user. Nonetheless, there are a lot of methodologies that have been introduced as a solution for detecting the phishing websites such as the whitelist approach or blacklist approach, visual similarity-based approach, and meta-heuristic approach however still the online users are getting scammed into revealing sensitive credentials in phishing websites. In this research paper, a novel hybrid methodology PB-cup learner was proposed, which is based on integration dimensional and neural learning that is pulled from the source code, uniform resource locator, and representative state transfer API to overcome the drawbacks of the existing phishing techniques. This model gives the accuracy analysis of the Naïve Bayes Classifier, Genetic Algorithm, Multi-Layer Perceptron, Multiple Linear Regression, and PB-CUP neural learner and out of which, the Multi-Layer Perceptron algorithm has been performed the best with an accuracy of 99.17%. The experiments were iteratively analyzed with different orthogonal algorithms for finding the best classifier accuracy for phishing website detection. Keywords: Anti-phishing · Neural network · Bayes classifier · Machine learning algorithms · PB-CUP neural learner · Phishing attacks
1 Introduction In the financial sector and distributed devices over the internet such as communication services, online shopping, e-banking, and other online services have been proliferated due to the available measures of resources that made the users access the different services conveniently and able to extract the information. The attacker may use this situation to set the trap for extracting the user credentials such as identity, money wallet, or sensitive credentials needed to access the online service-related websites. Phishing attacks can be conducted through various forms such as websites, email, and malware. To perform the phishing malware injection in email, the attackers would design fake emails as same as the original or authorized email which makes the users trust that the email arrives from a trustworthy company. They probably send millions of phishing © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 I. Awan et al. (Eds.): Deep-BDB 2021, LNNS 309, pp. 27–38, 2022. https://doi.org/10.1007/978-3-030-84337-3_3
28
R. Partheepan
emails for getting at least thousands of legitimate users to fall for it which is based on the hypothesis. The phishing websites could also be conducted through websites by, the attackers build a website which looks like a replication of the source site and grasps the user to the website either through social networks such as Twitter and Facebook, etc and also through an advertisement in other websites. Alternatively, the attacks can also be manipulated through security indicators by the attackers such as HTTP connection and green padlock, etc (Fig. 1). 100% 90% 80%
Ratio
70% 60% 50% 40% 30% 20% 10% 0% 2015
2016
2017
2018
2019
Year
Fig. 1. Email phishing attack ratio
In Malware phishing attacks, the attackers would perform malware dependency injects in software and pushes as malicious software such as Trojan horse into a puzzled legitimate site without the knowledge of a victim. According to the Anti Phishing resource center (2019) report, nearly 30 million new malware samples were captured in the first quarter of 2018. The enormous amount of malware is comprehensively crossfunctional, i.e, they extract the credentials, make the victims computing system as part of the auto-installation of malicious software without user acknowledgment, and much more like a botnet or bot application (Table 1 and Fig 2). Table 1. Phishing attacks overview Attack type Year Ratio Phish web
2019 92%
Email
2019 56%
Software
2019 2%
Detecting Phishing Websites Using Neural Network
29
100%
Ratio
80% 60% 40% 20% 0% 2015
2016
2017
2018
2019
Year
Fig. 2. Malicious website phishing ratio. As the figure, demonstrates there is a peak increase in using malware dependencies for performing phishing attacks throughout recent years.
2 Related Works 2.1 Classification Models for Classifying the Phishing Attacks The classification model comprises various feature extraction from the source of the website and CERT of the site. In feature extraction, nearly 16 features have been extracted from the UCI-phishing resource repository and the feature extraction has been defined comprehensively. Nonetheless, the feature classification is performed by Naïve Bayes classifier and Support Vector Machine. The accuracy achieved by the Support Vector Machine classification model was 95.34% and the Naïve Bayes Classifier achieved 96.87% (Table 2). Table 2. Existing classifier accuracy Dataset
Classifier
Accuracy
UCI phishing
Naïve Bayes
95.34%
UCI phishing
SVM
96.87%
2.2 Detecting Phishing Websites Using Machine Learning Algorithms In detecting phishing websites, the most commonly used novel approach is performing a comparison between machine learning algorithms such as Generalized Linear Model, Decision Tree, Generalized Additive Model, and Gradient Boosting approach. To select the impeccable algorithm for detecting phishing websites, the Accuracy, Precision, and Recall was evaluated for each algorithm and distinguished. The website attributes contribute 37 features with the help of libraries of each model. The Generalized Linear model, Gradient Boosting, and Decision Tree algorithms were the most commonly used algorithm and the Generalized Linear Model has an accuracy of 96.89%, Gradient Boosting is at 97.81% and Decision Tree has an accuracy of 98.93% (Table 3).
30
R. Partheepan Table 3. Comparison of existing prediction models Dataset
Algorithm
Accuarcy
Phish tank
Decision tree
98.93%
UCI phishing Gradient boosting
97.81%
UCI phishing Generalized linear model 96.89%
2.3 Detecting Phishing Websites with Chrome Support Using Machine Learning Algorithms Google Chrome extension with machine learning library helps to detect phishing website contents. The dataset used in this research was contributed from the UCI-Machine Learning Repository and 18 features were extracted for this dataset. The comparison of f1-score, recall, accuracy, and precision for each algorithm was evaluated to obtain the impeccable result. The chrome extension uses HTML, Javascript, and CSS along with machine learning libraries. This extension is having a major drawback to thwart the malicious sites which are proliferating every day (Table 4). Table 4. Comparison of phishing detector with chrome extension using machine learning algorithm Dataset
Chrome extension
Machine learning algorithm
Accuracy
UCI phishing
Phish detector
K-NN
95.71%
UCI phishing
Phish detector
Random tree
97.51%
2.4 Detecting Phishing URL Using Random Forest Algorithm The phishing attack uses URL identification for injecting phishing dependencies. Random Forest algorithm has been used in this proposed model with three different stages such as heuristic classification, performance evaluation, and dependency parsing. The dependency parsing is used to evaluate the feature set. Dataset pulled from PhishTank and used 16 features and only 10 features were taken into account for dependency parsing. The accuracy of the Random Forest algorithm is at 94.63% in predicting the phishing features.
3 Information About Dataset The phishing list was pulled from the Anti-Phishing resource center. The dataset used in this novel work contains a list of phishing URLs and phishing logs. The benign list has 30,000 attributes and as well as the same in a legitimate website. These 30,000
Detecting Phishing Websites Using Neural Network
31
attributes contain high popularity websites and their relative branches. For all URL data, the landing URL has been considered as the actual target for crawling. In summary, this dataset comprises a magnified set of features such as WHOIS logs, Images, HTML, Screenshot, URL, Certificates, Domain, Country, IP, Subnet mask, Cipher, host identity, and branches of network dependencies.
4 Related Works A cluster of URLs relevant to the webpage was taken as an input for the feature extraction, which extracts the most accurate features for phishing detection. The extracted features are trained with Neural Network, Bayes classifier, and Random Forest to train the model for classifying the websites clusters such as legitimate and phishing. The feature extraction will crawl the source code of the website and detects the phishing logs. The feature extraction and classification are as follows. • URL and Hyperlink based features • Classification with Neural Network and Naïve Bayes Classifier. 4.1 URL and Hyperlink Based Features In this feature extraction process, the features are extracted from the hyperlinks embedded in the website source code. The hyperlinks usually connect one web source to another web source. Usually, the hyperlinks could be in form of an HTML document, an image, and an element in a document. The phishing pages contain hyperlinks that direct to a foreign domain that makes the user trust the behavior of the trusted websites and their domain. For legitimate websites, usually, the ratio of local links is greater than the ratio of foreign links. The features extracted from the hyperlinks are URLs, Image source URLs, Cascading stylesheet URLs to differentiate the phishing website from the legitimate website. The novel feature classifications are HF1, HF2, and HF5. URL is used to specify the address of a resource such as a document, image, HTML, cascading style sheet, video, etc. The HTTP protocol is used to access the web with the name of the host machine which is a hostname and the address of the resource which is a pathname. The URL flowchart is as follows (Fig. 3): HF1 – Frequency of domain in anchor tags or links. This feature experiments with the source which contains anchor links of a website and makes a comparison with the local domain of the frequency domain of a website. If both the local and the frequency domain are the same the feature will be set to 0, else, if both domains are not similar then the feature will set to 1 which could be classified as a phishing website. 0, Both the local and frequent domains are equal Hf 1 = (1) 1, Both the domain are not equal (Phishing) HF2 and 5 - Frequency of domain in the cascading stylesheet links, script links, null links, and image links. This feature is more likely relevant to HF1 but it differs when evaluating the most frequent domain with the request URLs of image and script file from
32
R. Partheepan
URL Input
Feature Extraction
Forward Pass Neural
Bayes Classifier
Backward Pass Neural Legitimate
Classifi cation Phishing Fig. 3. Architecture of Phishing Detecting Framework
and or of the website script. If the website shares similar CSS, Image links, and scripts then it is said to be a legitimate website, otherwise, if the local domain and frequency domain of a website doesn’t match the features then it is said to a phishing website. 0, Both the local and frequent domains are equal Hf 2, 5 = (2) 1, Both the domain are not equal (Phishing)
5 Detecting Phishing Using Neural Network Algorithm Neural Networks are based on parametric classifiers. It provides various alternatives to the statistical classifiers. Neural learning can be learned with a set of training data and thus make a fuzzy logic. For detecting phishing, there are five-layer neural network was implemented with a feed-forward and backpropagation algorithm. The five layers are the input layer, hidden layer, activation layer, output layer, and target. The total number of data in the output layer is equal to the number of classes, and the number of the input layer is equal to the number of features. During neural network learning, the weight is initialized in random values which are scaled in small integers. The weights will be propagated to the hidden layer along with the input layer. In the hidden layer, the rectilinear transformation function was used to scale the weight to any extent. As the phishing detection was based on a nonlinear separable case, the rectilinear transfer function would be the accurate function than tanh, linear, sigmoid, and softmax. The transfer function can be classified as follows.
Detecting Phishing Websites Using Neural Network
• • • • •
33
Non-linear transfer function Linear transfer function Tanh transfer function Sigmoid Transfer function Softmax Transfer function
Non-linear transfer function, the node outputs to the hidden layer will be interpreted in terms of class membership (class 0 or 1) and the outputs to the hidden layer will be continuous and depends on how close the network input is to a chosen value of the median (Fig. 4). f (a) = √
1 2π σ
e.−(a−μ)
2 /2σ ∗2σ
(3)
1 - Phishing
0 – Legitimate Fig. 4. Non-linear transfer function structure
The sigmoid transfer function consists of two functions which are logistics and tangential. The tanh vales can be scaled from −1 to 1 (Fig. 5). f (a) =
1 1 + e−θa
(4)
Fig. 5. Sigmoid transfer function structure
The node output values to the hidden layer are evaluated with the training data. The comparison between non-linear transfer and sigmoid transfer function as follows (Table 5).
34
R. Partheepan Table 5. Transfer function evaluation in hidden layer Transfer function
Legitimate data Phishing data
Non rect-linear transfer function 0.731
0.999
Linear transfer function
0.987
0.637
The input vector will be propagated with a forward pass to compute the weighted sum Sa and activations Oa = f (Sa ) for each node, where f (Sa ) is the activation function. The backpropagation algorithm starts with the actual output node and makes a backward pass through the output layer, activation layer, and hidden layer. Usually, the target value will be set to 1. f (Sa = Oa (Target − Actual output) δa =
(5)
(Target − Actual) 1/f (Sa ) N (WN ,a δN ) ∗ 1/f (Sa )
(6)
Once the weights are updated in the backward pass from the output layer to the hidden layer then the update has to be updated again along with the learning rate, where speed is the learning rate and x is the legitimate and y is the phishing node (Table 6). Wx,y = Wx,y + speed ∗ δa ∗ Oa
(7)
Table 6. Error weight evaluation after each epoch Epochs
Forward pass (actual output)
Backward pass
Speed
Epoch 1
0.881
0.735
0.2
Epoch 2
0.789
0.531
0.2
Epoch 3
0.767
0.437
0.2
Epoch 4
0.683
0.427
0.2
6 Detecting Phishing Website with Bayes Classifier The classifier evaluates the probability for each target and assigns the values from the previous classified values to the class within the maximum probability value. The probability for class legitimate (Li ) and class phishing (Pj ) can be calculated from the training set data. C(Li ) = P(N /Li ) ∗ P(Ci ) (8) P N
Detecting Phishing Websites Using Neural Network
Where, P(Li /N ) = legitimate.
a
i=1 P(Ni /Li )
35
is the conditional probability for the class
P(C(P i )/N ) = P(N /Pi ) ∗ P(Pi ) (9) a Where, P(C(Pi )/N ) = i=1 P(Ni /Pi ) is the conditional probability for the class phishing (Table 7). Table 7. Classification rule update each iteration Iterations
Class – legitimate (classified)
Class – phishing (classified)
MSE
RMSE
Iteration 1
0.67
0.33
0.034
0.164
Iteration 2
0.51
0.47
0.031
0.153
Iteration 3
0.47
0.57
0.027
0.137
Iteration 4
0.50
0.59
0.015
0.110
From the table., after each iteration the error values are drastically decreased and it makes the iteration stop at four epochs.
7 Evaluation Metrics To attain the average performance of the classifier process the cross-validation will be applied. The dataset with the set of features will be randomly split into N number of subsets such that each subset could have the same degree of each class label in all the degrees. The performance of this framework was gauged using four different evaluation metrics such as F-1 score, recall, accuracy, and precision. Precision could give the aspects of true positive rate, which can make a median portion of a total number of sets predicted as positive by the classifier. A low precision rate indicates that the evaluation has produced a myriad number of false correlations. Precision =
TP TP + FP
(10)
True Positive rate or Recall, which points to the correctly predicted positive correlation made by the classifier that recalls with the total number of positive values in the dataset. It could also be described as the number of true positive correlations divided by the sum of true positive and false negative correlations. A low recall correlation indicates the total number of errors made by the classifier. Recall = TP/True Psoitive + False Negative
(11)
Accuracy is the process of gauging the performance of the classifier, that counts the total number of correctly classified instances out of all instances of the classes. Accuracy = TP + TN /TP + FP + TN + FN
(12)
36
R. Partheepan
F-1 Score is the process of averaging the precision and recall, that makes the classifier predict both the positive and negative classes. Thus the F-1 score is used as a criteria evaluation method (Fig. 6 and Table 8). F − 1 Score = (Precision ∗ Recall)/(Recall + Precision)
(13)
Fig. 6. Visualization of evaluation metrics for neural network
Table 8. Comaprison of evaluation metrics for both classes Neural network – feed forward and back propagation Class legitimate - 0
Class phishing - 1
Precision
0.797
Precision 0.930
Recall
0.943
Recall
F-1 score
0.864
F-1 score 0.837
Accuracy
98.93% Accuracy 97.91%
TP rate
0.945
TP Rate
0.760
0.712
The evaluation reports are shown in the table, have impeccable accuracy in the prediction of both the legitimate and phishing classes (Fig. 7 and Table 9). The evaluation analysis for the Bayes classifier has shown in table, which states the classifier has a high true positive rate in classifying legitimate and phishing classes.
Detecting Phishing Websites Using Neural Network
37
Fig. 7. Visualization of evaluation metrics for bayes classifier
Table 9. Comaprison of evaluation metrics for both classes Bayes classifier Class legitimate - 0 Class phishing - 1 Precision 0.971
Precision 0.930
Recall
Recall
0.931
F-1 Score 0.7514
F-1 Score
0.673 0.431
Accuracy 97.03% Accuracy 96.81% TP Rate
0.527
TP Rate
0.412
8 Conclusion In this proposed model, a novel method has been proposed for detecting phishing websites using Neural Network and Bayes Classifier. In the prediction model, the neural network has been evaluated with a high true positive rate at 0.945 in legitimate class and 0.712 in phishing class. In overall perspective, the prediction with the neural network has a 0.797 precision rate in legitimate class and 0.930 precision rate in phishing class, the true positive rate for prediction model is high and the overall accuracy of the learning is at 98.93%. In the classification model, the precision for the legitimate class has been evaluated 0.971 and 0.930 in the phishing class. However, the true positive rate for legitimate is at 0.527 and the phishing class is at 0.412, so the accuracy of the classifier makes 97.03% in classifying the classes.
38
R. Partheepan
References 1. Liu, C., Hoi, C.H., Sahoo, D.: Malicious URL detection used for machine learning: research (2017). http://arxiv.org/abs/1701.07179 2. Fette, I., Sadeh, N., Tomasic, A.: Learning to receive spam emails for sensitive information. In: Procedures for the 16th International Conference on the World Wide Web, pp. 649–656 (2007) 3. Lord, N.: What is a phishing attack? Explaining and identifying various types of identity theft (2018). https://digitalguardian.com/blohg/what-phishing-attack-defining-and-identfying-dif ferent-types-phishing-attacks 4. Sharma, S., Shad, J.: How to learn the machine detection system for theft websites the Jaypee institute of information technology, pp. 425–430 (2018) 5. Sankhe, P.S., Parikh, D., Parekh, S., Kotak, S.: A new way to find criminal website identification. In: 2018 Second World Conference on Intent Communication and Computational Technologies (ICICCT), pp. 949–952. IEEE (2018) 6. Mustafa, T., Karabatak, M.: Comparison of the performance of classifiers in the database of a website for the theft of sensitive information. In: 6th International Symposium on Digital Forensic Security, ISDFS 2018 - Continued, vol. 2018-January, pp. 1–5 (2018) 7. Vinaykumar, R., Vazhayil, A., Soman, K.P.: A comparative study for the detection of wrong URLs using deep neural networks. In: 2018 9th International Conference on Computing, Communication and Networking Technologies, ICCCNT 2018 (2018) 8. Nguyen, M.H., Nguyen, H.K., Nyugen, L.A.T., To, B.L.: A novel way to find the crime of identity theft using URL_based heuristic. In: 2014 International Conference on Computer Management and Telecommunication ComManTel2014, pp. 298–303 (2014) 9. APWG. Reports of a trend of crime to steal sensitive information, next quarter of 2019 (2019). Accessed 03 Mar 2019 10. Raul, N., Desai, A., Jatakia, J.: Detection of inappropriate web content used for machine learning. In: RTEICT 2017 - 2nd IEEE International Conference on Recent Electronic Styles. Information Communism and Technology Proceedings, vol. 2018-Janua, pp. 1432–1436 (2018)
Blockchain Technology and Applications
A Blockchain Framework for On-Demand Intermodal Interlining: Blocklining Mary Rose Everan(B) , Michael McCann, and Gary Cullen Letterkenny Institute of Technology, Port Road, Letterkenny 92 FC93, County Donegal, Ireland [email protected]
Abstract. International travel journeys, by their nature, incorporate elements provided by multiple service providers such as airlines, rail carriers, airports, and ground handlers. Data needs to be stored by and exchanged between these parties in the process of managing the journey. The fragmented nature of this shared management of mutual clients is a limiting factor in the development of a seamless, hassle-free, end-to-end, travel experience. Traditional interlining agreements attempt to facilitate many separate aspects of co-operation between service providers, typically between airlines and to some extent, intermodal travel operators, including schedules, fares, ticketing, through check-in and baggage handling. These arrangements rely on pre-agreement. The development of Virtual Interlining - that is, interlining facilitated by a third party (often but not always an airport) without formal pre-agreement by the airlines or rail carriers - demonstrates an underlying demand for a better quality end-to-end travel experience. Blockchain solutions are being explored in a number of industries and offer, at first sight, an immutable, single source of truth for this data, avoiding data conflicts and misinterpretation. Combined with Smart Contracts, they seemingly offer a more robust and dynamic platform for multi-stakeholder ventures, and even perhaps the ability to join and leave consortia dynamically. Applying blockchain to the intermodal interlining space – termed Blocklining in this paper - is complex and multi-faceted because of the many aspects of cooperation outlined above. An experimental approach to explore its potential is the basis for the author’s M.Sc. research, concentrating on one particular dimension, that of through baggage interlining to which this paper alludes. Keywords: Airport · Aviation · Baggage · Blockchain · Blocklining · Consortium · Data · Interlining · Intermodal · Privacy · Rail · Trust
1 Introduction Intermodal transport is a topic that is rapidly gathering momentum for global transport organizations, industry bodies and research institutes, particularly in light of climate change considerations and the quest for smarter, greener mobility options. The current Covid-19 crisis and its impact on the aviation industry is also adding impetus to this subject. New innovation supported by effective regulatory frameworks and crossborder/industry collaboration will be required to progress this agenda. The European © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 I. Awan et al. (Eds.): Deep-BDB 2021, LNNS 309, pp. 41–52, 2022. https://doi.org/10.1007/978-3-030-84337-3_4
42
M. R. Everan et al.
Commission’s Flightpath 2050 [1] is an initiative which aims to enable 90% of doorto-door travel to any place in Europe within 4 h by 2050. SESAR’s Modus project [2] aims to research the role of air transport within an integrated intermodal transport system. Enabling rail organizations to interline with airlines will require the development of innovative platforms to secure data sharing between stakeholders that can overcome the challenges of interoperability between systems to facilitate booking, ticketing, border control, baggage transfer, settlement, and proration of tariffs. Traditional interline agreements and other commercial agreements have existed within the aviation sector for decades to facilitate these joint operations. However, new virtual interline models are now emerging that are threatening to supersede these legacy agreements. These virtual interlining models may well suit the new collaborative intermodal ventures between rail and air and the increasing interest in adoption of rail as airport feeder systems. However, many of these models still have not achieved ‘through baggage interline’. Only traditional aviation interline agreements with additional baggage agreements facilitate baggage being transferred from one carrier to another seamlessly without the passenger having to reclaim their bag and check it in again. The addition of a rail leg on a journey poses additional challenges to overcome as rail operations differ from air operations in areas such as booking, ticketing, checking-in, carriage of baggage and security requirements. ‘Through baggage interline’ operations require inter-carrier collaboration on a number of levels to manage the complex logistics of this operation. This includes the integration of information flow and data sharing, along with contractual obligations. This paper offers one approach to resolve these challenges using blockchain as the underlying framework in which to blend these complex legal and technical conundrums.
2 Related Work Blockchain technology is currently being explored for many use cases within the aviation industry and is seen to be facilitating a more industry cooperative approach enabling progressive disintermediation [19]. Collaborative alliances in blockchain development that are emerging within the aviation industry include Winding Tree (https://windin gtree.com/), that seeks to develop an open source, inventory distribution platform in partnership with airline, hotel, and IT organizations, and Sovrin (https://sovrin.org), an industry wide initiative based on blockchain for self-sovereign identity across borders. However, for interlining, a more greenfield scenario exists. Interlining within the context of aviation is the term used when two or more airlines cooperate to transport passengers on an itinerary that requires multiple flight legs. This type of itinerary is managed using a single Passenger Name Record (PNR) across all flight legs. A PNR is used to identity a passenger, or group of passengers, and their itinerary in Computer Reservations Systems (CRS). There are many different types of commercial agreements between airlines that facilitate interlining including those that operate by the rules of the International Air Transport Authority (IATA) Multilateral Interline Traffic Agreement (MITA) [3]. There are other individual interline and codeshare agreements between airlines, in addition to joint ventures and alliances, such as the Star Alliance or OneWorld Alliance [4]. Airline alliances and partnerships have existed in the air transport industry for decades. These alliances attempt to offer passengers a seamless service
A Blockchain Framework for On-Demand Intermodal Interlining
43
by coordinating flight schedules, ensuring gate proximity, and blending frequent flyer programs. Alternatively, passengers can book multi-leg flights with different airlines who do not interline, a concept known as ‘self-connection’. Virtual Interlining is a new model where a third party travel organization can book this multi-leg flight for the passenger, where no interlining exists between the airlines. However, the virtual interlining company provides its own SuperPNR for the complete itinerary incorporating the separate PNRs created on each airline CRS (Fig. 1).
Fig. 1. Traditional and virtual interlining
The third party travel organizations that are moving into this market space also offer protection for the passengers in case of flight disruption or cancellation of one of the legs. IATA, in cooperation with member airlines, is now proposing a new interline framework, the Standard Retailer and Supplier Interline Agreement (SRSIA) [5], to support all these different types of interline models. A more dynamic real-time approach to offer and order management for interline itineraries instead of the fixed, more restrictive, interline model is emerging in the industry. 2.1 Air and Rail Interlining Airlines can also interline with rail carriers. Grimme [6] points out that cooperation between airlines and train operators has been rather difficult in the past due to different regulatory, operational and managerial constraints. However, industry bodies such as IATA and International Union of Railways (UIC) [7] are now collaborating to accelerate the integration of air and rail transportation. AccesRail is an example of an IATA Travel Partner enabling booking for air and rail on the same itinerary using its own IATA carrier code (9B) [8]. Other co-operative engagements between air and rail carriers that operate within the framework of an overall airport-feeder system are outlined by Rüger and Albl [9, p. 7]. These engagements cover luggage transport, check-in, ticketing, information, and security services. Some examples include Rail & Fly [10], AIRail [11], and Train + Air [12].
44
M. R. Everan et al.
2.2 Virtual Interlining Models Within the aviation industry, third party operators and Online Travel Agencies (OTAs) such as Kiwi.com [13] are starting to fill the gap in the market with virtual interlining business offerings. These guarantee to rebook passengers on the next available flight if they miss their connection or provide hotel accommodation or meal vouchers if the next available flight is not until the next day. Kiwi.com promote a virtual interlining model, where passengers will not have to exit to a check-in area or transfer terminal to re-check baggage for their next flight. Inter Airline Through Check-In (IATCI) [14] is an aviation industry initiative that provides rules, recommendations, and messaging standards to facilitate passenger handling procedures for interlining flights. It facilitates a single check-in transaction for multiple airline flights. 2.3 Airport Facilitated Interlining Never and Suau-Sanchez [15] introduced the term ‘Airport Facilitated Inter-airline Network Connectivity Schemes’ (AFINCSs).
Fig. 2. The evolutionary path for self-connectivity [15]
Gatwick is one example of an airport which positioned itself as a connecting airport [16] using this model. The airport introduced a facilitated connections service, GatwickConnects [16] for passengers self-connecting between airlines that are not interlined. Airlines can opt to participate in the passenger or booking service provided by GatwickConnects. Passengers availing of the fee-paying service can connect quickly to their next flight by dropping their bags at the GatwickConnects desk where the transfer is managed for them. They can then proceed to a Premium Security channel for boarding their next flight. Guy Stephenson, Chief Commercial Officer at Gatwick [16], admits that they are still a little way from trying to mimic ‘through-baggage interline’ and that ultimately, technology will overcome these obstacles. A dynamic or on-demand interlining solution could allow passengers the flexibility to have their baggage checked through from their airport feeder transport (rail or air) onto their ongoing connection via the airport hub on check-in. AFINCSs could become increasingly appealing for airlines such as Low Cost Carrier airlines (LCC), which are opening up to the idea of inter-airline network connectivity. The same applies to rail carriers with the advent of rail as an alternative airport feeder system. Never and Suau-Sanchez [15] argue that airport-led transfer schemes could “challenge the legacy partnership mechanisms of interlining and code sharing”. They
A Blockchain Framework for On-Demand Intermodal Interlining
45
offer a solution to carriers that do not interline today because of technological or business model limitations. 2.4 Through Baggage Interlining One important aspect of interlining, either between air carriers only or air and rail carriers, is through baggage interlining, which is the focus of this paper. Traditional interline agreements usually include interline baggage agreements, where baggage is ideally transferred from one carrier to another seamlessly between flight legs. Therefore, passenger do not have to reclaim their bag and check it in again before each flight. This is not the case for virtual interlining. Also, transferring a bag across traditional interlining flights can sometimes be complex, especially if separate tickets are issued for each flight. Mishandling of bags during transfer between airlines accounts for the largest occurrence of delayed bags and lost baggage in the industry. This is highlighted in the SITA Baggage Data Insights survey [18]. From a baggage point of view, Verma [22] observed that blockchain technology could be used to store data of checked-in baggage to optimize the process of baggage management. Akmeemana [23], at the Blockchain Research Institute, also foresaw blockchain being used for baggage tracking, where baggage-handling software could write events to the blockchain and use smart contracts to automatically trigger compensation pay-outs. Gao, Sun, and Zheng [24] propose a consortium blockchain architecture model applied to a baggage tracking use case, as shown in Fig. 3, where the position of the bag in physical space is linked to the transmission of data in the data space.
Fig. 3. Data layer block structure diagram [24]
Linking the various threads of contractual obligation, security and baggage tracking using blockchain is the goal of the Blocklining [25] solution proposed by the author, which is the basis of a Research M.Sc thesis. This paper summarizes the Blocklining concept and extends it in an intermodal context.
46
M. R. Everan et al.
3 Blocklining Blocklining is a concept which provides a framework for on-demand interlining based on blockchain. This framework would facilitate the creation on an interline agreement or bond that is valid only for the duration of the passenger journey. Participants who can opt into a consortium blockchain such as air or rail carriers, airport/transport hubs, ground handlers and security agencies could endorse the validity of the contract at various points on the passenger journey based on mutual consensus. Endorsement of the contract must also be based on passenger, baggage, and flight/rail data from data sources (oracles) agreed by the consortium. Smart Contracts on a blockchain can be used to define logic to validate the feasibility of the Blockline contract based on the transfer of baggage between stakeholders based on real-time flight schedules and bag tracking events, and also to verify and calculate baggage allowances and fees. Other validations could be also performed to trigger automatic settlements in case of lost or damaged baggage at a certain point on the journey. Consortium blockchains are designed for use between a restricted group of stakeholders to leverage information and to improve workflows, accountability, and transparency [26]. This model would have the flexibility to enable stakeholders to change interline partners dynamically or on demand, to facilitate seasonal requirements, disruptions, global crises, or schedule changes. The proposed Blocklining framework aims to complement and enrich aspects of the self-connect and airport-airline network connectivity models as outlined by Never and Suau-Sanchez [15] (see Fig. 2) Blocklining could also be offered as an upgrade to a passenger at booking, or check-in. Blocklining would require the creation of a consortium based blockchain network to set up a trust framework, with rules which must be agreed between all stakeholders. Never and Suau-Sanchez’s model can also be extended into the intermodal realm. Rüger and Albl’s [9] findings show that baggage management strongly influences the choice of transport mode for passengers who are travelling to an airport. Check-in and luggage check-in on a train, at the departure station, are of major importance to passengers (Fig. 4).
Fig. 4. Importance of luggage check-in on the train and check-in on the train [9]
The International Air Rail Organization (IARO) Baggage Report 2001 [28] highlights the complexity of airport and rail station interchanges and the collaboration and
A Blockchain Framework for On-Demand Intermodal Interlining
47
allocation of resources to create an integrated baggage handling system with appropriate safety and security measures in place. However, some successful progress is being made in this area by some operators including the City Airport Train (CAT) in Vienna which offers a city check-in. Passengers can check-in baggage up to 75 min before departure, get their boarding pass and CAT will bring their baggage right to their flight. Yeung and Marinov [29] in their paper on baggage transfer services in UK rail stations, refer to the Hong Kong Airport Express Line, part of the Mass Transit Railway (MTR) metro system, which connects Hong Kong International Airport with the city center. A free check-in is provided by the service in Hong Kong and Kowloon stations for major airlines. This allows passengers to travel baggage-free to the airport and passengers will only see their bags again upon arrival at their destination airport.
4 Data Exchange Within the aviation industry alone, airlines, airports, ground handlers and security authorities have their own systems for handling the transfer of information between each other, to ensure baggage is transferred securely between origin and destination airports and tracked at various points. These systems, across the various airline and airports in the global industry, use a variety of different storage and messaging formats that will require interrogation to endorse blockchain transactions. Access to a reliable source of baggage data is critical for a Blockline solution for through baggage interlining, in addition to flight scheduling information. Figure 5 provides a simple view of the baggage messaging flow for a two segment interline flight with through-check-in of bag [30].
Fig. 5. Baggage messaging for interlining [30]
A Departure Control System (DCS) automates the processing of check-in, baggage handling, and passenger boarding. The BTM (Baggage Transfer Message) contains bag information related to transfer passengers. Adding other types of transport to an interlining model requires the integration of another set of systems and messaging formats that need to interoperate. Even with airline interlining, there is no single source of passenger
48
M. R. Everan et al.
information across a multileg itinerary, as this data can be stored across multiple reservation systems, DCS systems, tickets, and Electronic Miscellaneous Documents (EMDs) - documents that facilitate fulfillment of optional airline services. Therefore, interlining stakeholders may not have access to a complete passenger itinerary. This can cause difficulties for participating carriers, virtual interlining operators, airports, baggage handlers and ultimately passengers who may have to collect baggage between transfers. Also, passengers who lose baggage on transfer may not know who to claim against if baggage goes missing. Blockchain can normalize the transfer of data across these systems and with the help of a common ontology, the information could be streamlined across disparate transport systems. Korpela, Halikas and Hadlberg [31], in their project on the transformation of an aviation based digital supply chain toward blockchain integration, found that blockchain technology simplifies Business to Business (B2B) integration. IATA introduced Resolution 753 in June 2018 [32]. Compliance with this resolution requires airlines to track baggage at four key points in the baggage journey including the transfer of baggage between airlines when interlining. The chain of custody of the baggage between these tracking points must be proved between stakeholders and tracking data must be shared with connecting airlines. For interline flights, the change of custody from the delivering carrier to the receiving carrier must be agreed between the connecting carriers [33]. It may be useful for IATA, in association with a rail body such the UIC, to review and extend this resolution and extend it to the tracking of baggage across a rail leg for intermodal operations. With intermodal travel, alignment between regulations across the different industries would be required and built into a Blockline solution. To achieve a trust framework that protects stakeholders within a Blockline consortium, undeniable legal evidence is required to verify the chain of transactions that comprise the life cycle of the Blockline contract that is underpinned by the transference of custody of a bag from one carrier to another at the various tracking points across a bag journey. This evidence is critical to levying a legally enforceable liability that is equitably pro-rationed between stakeholders. Blockchain seems an ideal platform to fulfil a basic framework for utilizing baggage data between interlining stakeholders to verify a Blockline contract, as it claims to: • Provide immutable evidence of data exchanges by consensus. • Prevent transactions from being amended or hacked. • Provide a single source of truth of transactions and chains of custody. These blockchain characteristics can be applied to a Blockline framework so that the chain of custody of a bag across an interline itinerary is recorded (compliance required for IATA Resolution 753) and ‘immutable evidence’ of transactions can facilitate the fair prorating/proportional distribution of liability complying (compliance required for IATA Resolution 754). Smart contracts are deployed within blockchains to verify conditions and validate the feasibility of transactions. In cases of baggage transfer transactions across air, rail carriers and borders, a Smart Contract could verify if interline flight/rail schedules are on-time and not delayed or cancelled or if the Minimum Connect Time (MCT) between interlining legs allows enough time for the passenger to get from an aircraft or train to another aircraft across the airport, which would be critical for more
A Blockchain Framework for On-Demand Intermodal Interlining
49
dynamic, on-demand interlining. Further, checking to see if a passenger has boarded an aircraft is important, as, according to ICAO Annex 17 [39], carriers cannot transport baggage of persons who are not on board an aircraft unless the baggage is identified as unaccompanied.
5 Methodology To explore the basic functionality of a Blockline solution, an experimental approach, a PoC, is currently in progress that will provide a basis to explore the reliability, security, interoperability, and optimum performance for a feasible real-time Blockline solution. This PoC is based on a consortium blockchain that forms a trust framework between a restricted group of stakeholders across intermodal environments. The PoC will also help to identify an optimal consensus model to enable dynamic decision making between nodes on the blockchain. Recognizing legal, data governance and GDPR considerations of the data and transactions in this blockchain use case are also envisaged as an outcome from the PoC, as are the logistical and operational challenges of the physical baggage handling requirements that are fundamental to a Blockline framework. Potential revenue models can also be explored. The resulting ontology across intermodal environments could be proposed to standardize information management across various organizations in the consortium to ensure common understanding of the structure of information and its purpose. This bootstrapping approach will also be supplemented by an industry survey to evaluate the potential interest in this concept, which is also currently in progress. Hyperledger Fabric, in particular, is a convenient and well established platform on which to practically test a Blockline solution, with its modular architecture, plug-andplay consensus and membership services, and performance and privacy capabilities. A Blockline framework needs to cover a number of airport/rail hubs, as flights and rail connections connect across departure and arrival airports. Hyperledger Fabric also offers the ability to create private data collections on a channel, which can allow two or more organizations access to some, or a subset of, the data within a specific transaction, for example, if a rail carrier requests a through-baggage interline transaction with an air carrier and they wish to share private commercial or passenger data between them that is not accessible by the other participants on the channel. A Blockline ecosystem could be facilitated or governed by an industry body such as Airports Council International (ACI) or an organization such as SITA (Société Internationale de Télécommunications Aéronautiques). For Blocklining, identifying a bag and its planned journey will be of utmost importance in order to track the chain of custody of the bag. Reusing existing industry identification tags, such as a bag license plate is one approach. However, employing RFID (Radio Frequency Identification) technology for baggage identification is another alternative and would enable baggage identification during bag transfer between carriers which requires less dependency on rail and airline departure control systems, thus empowering a more airport/transport hub facilitated interlining business model. Additionally, RFID might be a more viable and flexible solution in an intermodal context. RFID uses a small chip in a tag which can be written to by a passenger app. The bag identification details are then picked up by RFID readers. IATA has worked closely with airlines, airports and
50
M. R. Everan et al.
suppliers to develop a standard known as RP1740C [41], which will be the template for use of RFID in the interline baggage environment [42, p. 155].
6 Conclusion The current crisis has highlighted the fact that business and regulatory frameworks may be too rigid and unsuitable for current trends in passenger transport. Changes in passenger behavior with passengers increasingly self-connecting and building their own itineraries, are driving new opportunities for collaborative business and operational initiatives, and consequentially, the evolution of a ‘Cooperative Competition’ or ‘Coopetition’ model concept between airlines and other enterprises. Despite some challenges, blockchain does provide a tangible vehicle for exploring cooperative ways of working and can act as an “enabling engine” for partnerships to explore new business opportunities as predicted by Liu and Brody [43] in their Harvard Business Review paper. Air, rail, and other transport organizations need to develop more dynamic models to adapt to global uncertainty, industry disruption and crises, in parallel with the requirements for new inter-industry regulations, standards and frameworks. Implementing a successful Blocklining framework for an on-demand intermodal interlining framework will require business co-operation between transport providers, in addition to technical interoperability between various systems including air and rail DCSs, reservations, airport operations, border control and ground handling. These systems across the various entities employ a variety of technologies and messaging formats that will require interaction with a blockchain. By proving the feasibility of a through baggage interlining use case using Blocklining, other more complex passenger operations and transactions can be explored. With intermodal initiatives and virtual interlining models both at early evolutionary stages, it seems like an opportune time to examine the potential of a Blocklining initiative.
References 1. Europäische Kommission and Europäische Kommission (eds.): Flightpath 2050: Europe’s vision for aviation ; maintaining global leadership and serving society’s needs ; report of the High-Level Group on Aviation Research. Luxembourg: Publ. Off. of the Europ. Union (2011) 2. SESAR Joint Undertaking | Modus - Modelling and assessing the role of air transport in an integrated, intermodal transport system. https://www.sesarju.eu/projects/modus. Accessed 30 Sept 2020 3. IATA - IATA Multilateral Interline Traffic Agreements (MITA). https://www.iata.org/wha twedo/workgroups/Pages/mita.aspx. Accessed 02 Sept 2019 4. Coles, H.: the future of interline - a new model for seamless customer journeys. IATA White Paper, p. 20, October 2019 5. IATA: IATA - SRSIA. https://www.iata.org/en/programs/workgroups/passenger-standardsconference/srsia/. Accessed 26 Feb 2020 6. Grimme, W.: Experiences with advanced air-rail passenger intermodality – the case of Germany, p. 18 7. Albert, F.: UIC-IATA Passenger workshops held from 3–4 March 2020 in Geneva. UIC Communications, 18 July 2020. https://uic.org/com/enews/682/article/uic-iata-passenger-worksh ops-geneve-3-4-mars-2020. Accessed 18 July 2020
A Blockchain Framework for On-Demand Intermodal Interlining
51
8. AccesRail | Your Gateway to Air-Rail. https://accesrail.com/. Accessed 18 July 2020 9. Rüger, B., Albl, C.: Airport trains - baggage drop off during train ride to the airport, p. 7 10. Rail & Fly - Travel by train and plane. https://www.flytap.com/en-ru/other-bookings/railfly?v=t. Accessed 03 Oct 2020 11. AIRail Expansion between Salzburg Central Railway Station and Vienna Airport. https://www.austrianairlines.ag/Press/PressReleases/Press/2017/08/042.aspx?sc_lang= en&mode=%7B30999B4B-42D0-45A6-B671-FE5E3CB68ED8%7D. Accessed 16 July 2020 12. TGV air. https://www.corsair.fr/flight/services/corsairs-services/Services/tgv-air. Accessed 03 Oct 2020 13. ConnectionReview: Virtual interlining in aviation - What it means exactly. ConnectionReview (2019). http://www.connectionreview.com/blog/virtual-interlining-in-aviation-what-itmeans-exactly--30. Accessed 01 Sept 2019 14. IATCI. http://iatci.com/. Accessed 29 Dec 2019 15. Never, J., Suau-Sanchez, P.: Challenging the interline and codeshare legacy: drivers and barriers for airline adoption of airport facilitated inter-airline network connectivity schemes. Res. Transp. Econ. 100736 (2019). https://doi.org/10.1016/j.retrec.2019.100736 16. Otley, T.: Connecting low-cost airlines at Gatwick. Business Traveller, 31 Dec 2016. https:// www.businesstraveller.com/features/connecting-low-cost-airlines-gatwick/ 17. Flight Detective: Why is interline baggage check in so painful at outstations? TravelUpdate, 29 August 2017. https://travelupdate.boardingarea.com/outstations-interline-baggage/. Accessed 10 Aug 2019 18. SITA: Baggage IT Insights 2019 | SITA. SITA. Create success together (2019). https://www. sita.aero/resources/type/surveys-reports/baggage-it-insights-2019. Accessed 28 Apr 2019 19. Rencher, R.J.: Progressive disintermediation of the commercial aviation industry ecosystem. SAE International, Warrendale, PA, SAE Technical Paper 2019-01-1330, March 2019. https:// doi.org/10.4271/2019-01-1330 20. Cassar, R.: Distributed ledger technology in the airline industry: potential applications and potential implications. J. Air Law Commer. 83(3), 455 (2018) 21. Jugl, J., Linden, E.: The impact of blockchain on the aviation system. Presented at the ATRS 23rd World Conference, Amsterdam, July 2019. https://www.alexandria.unisg.ch/259153/. Accessed 05 Feb 2020 22. Verma, M.: Application of hybrid blockchain and cryptocurrency in aviation (2018) 23. Akmeemana, C.: Blockchain takes off: how distributed ledger technology will transform airlines, p. 33 (2017) 24. Gao, Q., Sun, L., Zheng, J.: Research on air passenger baggage tracking based on consortium chain. DEStech Trans. Comput. Sci. Eng. (ica) (2019). https://doi.org/10.12783/dtcse/ica 2019/30771 25. Everan, M.R.: Exploration of a blockchain assisted data sharing model for on-demand virtual interlining: blocklining. Masters thesis, Letterkenny Institute of Technology, Donegal, Ireland (Unpublished) 26. Yafimava, D.: What are consortium blockchains, and what purpose do they serve? OpenLedger Insights, 15 January 2019. https://openledger.info/insights/consortium-blockchains/. Accessed 14 Sept 2020 27. Dib, O., Brousmiche, K.-L., Durand, A., Thea, E., Hamida, E.B.: Consortium blockchains: overview, applications and challenges, p. 15 (2018) 28. I. International Air Rail Organisation: IARO Report 3.01 (2001). https://www.iaro.com/sitefi les/3.01%20baggagereport.pdf. Accessed 06 Sept 2020 29. Yeung, H.K., Marinov, M.: A systems design study introducing a collection point for baggage transfer services at a railway station in the UK. Urban Rail Transit. 5(2), 80–103 (2019). https:// doi.org/10.1007/s40864-019-0101-4
52
M. R. Everan et al.
30. Hafner, S.: IATA type B bag messages and baggage messaging refresher | The JavaDude Weblog. IATA Type B Bag Messages and Baggage Messaging Refresher, 20 May 2017. https://javadude.wordpress.com/2017/05/20/iata-type-b-bag-messages-and-bag gage-messaging-refresher/. Accessed 15 Apr 2019 31. Korpela, K., Hallikas, J., Dahlberg, T.: Digital supply chain transformation toward blockchain integration, January 2017. https://doi.org/10.24251/HICSS.2017.506 32. IATA: IATA Resolution 753, 14 Feb 2017. https://www.iata.org/en/programs/ops-infra/bag gage/baggage-tracking/. Accessed 15 Apr 2019 33. IATA: IATA Resolution 753 - Implementation (V3.0).pdf. IATA, 13 Nov 2017. https://aci. aero/Media/75983060-74e2-4030-a4e9-2d2d08eaf63b/CDBMeg/About%20ACI/Priorities/ IT%20-%20New/Initiativies/End%20to%20end%20baggage%20tracking/IATA%20Resolut ion%20753%20-%20Implementation%20(V3.0).pdf. Accessed 22 Aug 2020 34. Werbach, K.: The Blockchain and the New Architecture of Trust. MIT Press, Cambridge (2018) 35. Nakamoto, S.: Bitcoin: a peer-to-peer electronic cash system (2008). http://bitcoin.org/bit coin.pdf 36. Casino, F., Politou, E., Alepis, E., Patsakis, C.: Immutability and decentralized storage: an analysis of emerging threats. IEEE Access 8, 4737–4744 (2020). https://doi.org/10.1109/ACC ESS.2019.2962017 37. Scott, A.: “The” blockchain Vs. “A” blockchain: setting the record straight. Bitcoin News, 02 October 2016. https://news.bitcoin.com/blockchain-setting-record-straight/. Accessed 04 Oct 2020 38. Baliga, A.: Understanding blockchain consensus models (2017) 39. ICAO Annex 17 - Security Safeguarding International Civil Aviation Against Acts of Unlawful Interference - ICAO Online Store. https://store.icao.int/icao-annex-17. Accessed 16 June 2019 40. Hyperledger Foundation. Channels—hyperledger-fabricdocs master documentation (2019). https://hyperledger-fabric.readthedocs.io/en/release-1.4/channels.html. Accessed 23 Feb 2020 41. IATA: Resolution: RFID Baggage Tracking Set for Global Deployment. https://www.iata. org/pressroom/pr/pages/2019-06-02-05.aspx. Accessed 10 Aug 2019 42. DeVries, P.D.: The state of RFID for effective baggage tracking in the airline industry. Int. J. Mob. Commun. 6(2), 151 (2008). https://doi.org/10.1504/IJMC.2008.016574 43. Liu, J., Brody, P.: Is collaboration the new innovation? Harv. Bus. Rev. Anal. Rev. (2016). https://hbr.org/resources/pdfs/comm/ey/IsCollaborationTheNewInnovation.pdf. Accessed 26 Sept 2020
Intersection of AI and Blockchain Technology: Concerns and Prospects K. B. Vikhyath(B)
, R. K. Sanjana, and N. V. Vismitha
Department of Computer Science and Engineering, JSS Academy of Technical Education Bengaluru, Karnataka 560060, India
Abstract. Artificial Intelligence (AI) and blockchain are two major technologies that emphasize the break through innovations in various industries. Every technology has its own benefits in spite of its technical complexity for building advanced business applications. Combination of AI and blockchain has led to the restructuring of architectural changes to meet the present industrial demands of globalized financial markets, Internet of things, intelligent business data models and smart medical applications. Alliance of AI and blockchain technologies give rise to decentralized AI which enables machines to understand and take decisions on reliable and secured data independently by avoiding the involvement of any intermediaries in the course of action. In this paper, we review the intersection junctures, benefits and support provided by AI and blockchain. We also shed light on tools and latest technologies that has emerged as a result of intersection between these two technologies. Keywords: Artificial intelligence · Decentralised AI · Blockchain · Machine learning · Data security · Data integrity
1 Introduction Artificial intelligence and blockchain are the two emerging technologies by a wide margin, where AI makes a system powerful by providing cognitive functions such as learning, inferring and adapting by analyzing data samples. Blockchain technology is a decentralised, distributed ledger, which stores the data in encrypted format. AI and Blockchain have gained the recognition in today’s world due to the fact that AI can automate challenging assignments and can provide data modeling for difficult circumstances than human beings. Contrarily blockchain renders stronger data privacy and security. Blockchain and AI jointly may look implausible. However, they amalgamate together by dealing with the flaws and stabilizing the worst tendencies of each other. AI uses the datasets hoarded in blockchain and its distributed computing built on blockchain transmogrifications. Whereas blockchain uses AI for monetizing user-controlled data, generating AI based market place and developing autonomous organizations. Blockchain internally uses distributed ledger that facilitates account reconciliation process using encryption techniques and message transmission protocols and maintains bulk data by utilizing distributed architecture. It always ensures data security along with © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 I. Awan et al. (Eds.): Deep-BDB 2021, LNNS 309, pp. 53–66, 2022. https://doi.org/10.1007/978-3-030-84337-3_5
54
K. B. Vikhyath et al.
high speed data processing and quick data sharing. Hence blockchain is an outstanding technology which offers sustainability, spontaneous data exchange, compatibility, data protection and interlinking [1]. Blockchain techniques are used to endeavor a healthcare system which helps people to understand their data and to own copyrights on its usage. They also support server less connectivity and decentralized web future wherein people can have control on their data, identity and fortune [2]. Almost all the machine learning techniques using AI depend on centralized architecture wherein cluster of servers run definite model in opposition to validating datasets [3]. Because of the centralized nature of AI only the companies with good computational power, access to superior quality datasets can take substantial amount of advantage of AI prospect. Companies like Amazon, Microsoft and Google contribute more in monopolizing the power of AI, because of their computing facilities [4]. This could result in bad, insecure and highly menacing AI decisions, which can have critical repercussions in the real world. To overcome these unfavorable outcomes we need a distributed and secure way of storing, retrieving and accessing of data. The field of decentralized AI is one of the leading trends that are looking to address this challenge [5]. The application of blockchain in the healthcare industry is been of great interest and advancing areas for research and development. The patient’s data is widely spread across various sources due to the regular change in healthcare providers. Easy access to the previous data is often lost by the patient, whereas the primary ownership is retained by the provider [6, 7]. Decentralized AI is utilized in the fields of healthcare sector and biomedical research to rapidly analyze biomedical samples, and enhance healthcare assistance by deploying robots in elderly care. Decentralised AI focuses on asset management to eliminate corporate dynamics and comparative market analysis. It is also used in the banking sector to handle dynamic variety of transactions. Decentralized AI also contributes to the field of asset management to eliminate independent power and in firms for market analysis. It also applies to routing techniques to control traffic. This not only ensures avoidance of congestion but also prevents it [3]. The rest of this paper is organized as follows. Section 2 enlightens the literature survey. Section 3 highlights about the technologies available on centralized AI. Section 4 presents the technologies based on decentralised AI. Section 5 discuss about AI for blockchain. Section 6 discuss about blockchain for AI. Section 7 presents the AI and blockchain intersection benefits. Section 8 highlights the issues and problems at the intersection of AI and blockchain. Finally, Sect. 9 concludes this paper and gives some future directions.
2 Literature Review When AI is integrated with blockchain, the resulting technology would provide users a revolutionary data model. This helps in accomplishing better data authenticity. The authors in [8] talks about ChainIntel P2P network which are bunch of active nodes that rely on distributed AI applications intended for their work such as face, voice, picture recognition and smart homes. The work in [4] focuses on ethereum and IPFS which
Intersection of AI and Blockchain Technology
55
controls the data repository and resources by providing a better safety and confidentiality to the records. The work in [9] spotlights on partially homomorphic encryption like Pallier and Goldwasser-Micali encryption schemes that provide security to have safer transaction and that safeguard trading parties confidential aspects. The work in [10] highlights two well-known scientists who examined the capability of blockchain to modernize and automate the IoT and energy. There work spotlights on the AI solutions to transfer energy resource in a distributed environment using encryption technology. The work in [11] focuses on a concept called swarm robotic ecosystem based on blockchain. The concept is constructed on smart contracts which are distributed in nature, and are used to develop a structured swarm mechanism to get rid of byzantine members. The work in [12] calls attention to distributed EHR MedRec. It motivated EHR stakeholders and medical associations to take part in POW and in response allowed them to access data. With the help of Harvard medical hospital this concept was implemented and tested. This work in [13] primarily focuses on another prototype in contrast to MedRec. The model provides formidable access to EHR system, leading to cloud storage acceptance and key transfer access for data encryption. The work in [14] underlines that a single diagnostic report of patient could be generated by using distributed AI. The work in [15] highlights that the Dutch land registry organizations used decentralised AI model for their landed property sector. The aim was to predict the result using AI and handle the huge amount of data using blockchain. The work in [16] underscores the usage of P2P e-cash system to ensure the decentralised trading and knowledge management. The work in [17] focuses the significance of using decentralised AI for solving the internet related issues. The work in [18] highlights the experiments done in Pisa Italy for forecasting pollution levels. The work in [19] highlights that snipAIR launched in 2019 was treated as a substitute to SIRI to safeguard user’s data. SnipAIR makes sure that the personal data will remain intact within connected home rather than storing it on cloud. The work in [20] enshrines a process called longenesis which delivers a platform for coining and allocating the data such as medical records and health data. The work in [4] highlights the fact that machine learning internally employs AI and blockchain to automate the procedure by minimizing the manual work to improve the efficiency. Hybrid learning models is the last but not the least among the common trends highlighted her. It uses the real time data and the data source for decision making. The work in [21] talks about distributed AI and smart-contract problems.
3 Technologies Based on Centralized AI 3.1 Digital Twin Digital twin is one of the software applications that try to connect the gaps between virtual world and physical systems. As an instance, General Electric (GE) is constructing an AI workforce that screens its locomotives, aircraft engines, gasoline turbines to forecast any upcoming failure with the assistance of GE machine’s software models hosted on cloud. Their digital twins are essentially strains of software code. Nevertheless, these
56
K. B. Vikhyath et al.
optimized versions are similar to the design drawing in 3D which consists of interactive visual representations using charts, points and diagrams [22]. 3.2 SenseTime SenseTime leads the image recognition industry by developing face recognition technology that is used for fee and photo evaluation verification in bank cards. Since the aim of most of the companies today, is to extract the worth of videos and images across the Internet using AI technology, this tool has been of utmost importance. 3.3 AIONE This facilitates software developers to construct intelligent assistants. It also referred to human interference intelligence system. AIONE toolbox has following services such as APIs, document library and building agents. The basic advantage of this tool is to convert data to common sets of rules which helps machine learning and AI structures. 3.4 Deep Learning for Java (Deeplearning4j) Deep Learning for Java (Deeplearning4j): It is a prominent open source library used in Java virtual machine. This is basically designed to work with applications like Hadoop and Apache Spark [23]. This incorporates following services such as stacked denoising autoencoder, word2vec, deep belief net, Boltzmann machine, deep autoencoder and Doc2vec. 3.5 TensorFlow Tensorflow is a deep learning library which helps in developing and deploying models on browsers, as well as on mobiles. It has a wide range of tools available for the development of deep learning and machine learning applications. It is also used for numerical computations and graphical representations [24]. 3.6 PyTorch PyTorch is an open source library, primarily used for various applications such as natural language processing, forecast time sequences, and cloud support and computer vision. It is developed by Facebook’s AI Research lab (FAIR). Multiple pieces of deep learning software are built with the help of PyTorch [24]. 3.7 Keras Keras is an open source deep learning framework used for low level computations and acts as an interface for the Tensorflow library. It has the capability to implement arbitrary research and has various practices to reduce the cognitive load and speed up experimentation cycles [25].
Intersection of AI and Blockchain Technology
57
4 Technologies Based on Decentralized AI The integration of blockchain and AI is addressed as decentralised AI. Due to the centralised nature of AI there is a possibility of data alteration which threatens the originality of data. Whereas decentralised nature of blockchain offers reliable and immutable features to AI on intersection with blockchain. This attributes to development of new innovative technologies in multiple fields. The highlights of decentralised AI technologies are shown in Table 1. Table 1. Technologies based on decentralised AI Technology
Objective
Application
Method
snipAIR [19]
Protection of user’s personal data
GOOGLE Home, AMAZON Echo
Protect data within Machine parameters of Learning connected homes, instead of its storage on cloud
KEEL [18]
To predict the traffic To resolve levels, to avoid traffic congestion bottleneck
By C4.5 Classification technology
Machine Learning
ChainIntel [8]
Handling the security threats faced in the world of internet
By implementing AI models
Deep Learning
MedRec [12]
Assist practitioners In medical and Extraction of data by Deep in understanding the research field Miners Learning current evidence
Nebula Genomics Provides a reliable [26] and safe platform for sharing and monetizing life’s data
Neuoromation [27]
Networking
Marketing
Help the developers Medical to better train neural Instruments networks and industrial robots
Category
Bridging the gap Deep between the Learning companies that want data and the people who want their genome sequence analyzed By training models to empower the distributed computational power and blockchain
Deep Learning
58
K. B. Vikhyath et al.
5 AI for Blockchain Figure 1 illustrates the key properties of AI which illuminate the facts that AI can easily integrate with other technologies, can adapt quickly, and can provide a platform to build an autonomous system for prediction, governance and resource management [27].
Fig. 1. Properties of AI.
The design and functionality of blockchain includes numerous parameters and tradeoffs among decentralization, security, functionalities, overall performance and many such these.AI can ease these choices, optimize and automate it for better governance and higher performance. Furthermore as blockchain makes data publicly available on its platforms, AI performs a key function in offering customers privacy and confidentiality. On the flip side, blockchain ought to use AI for the purpose of legalization of the user-controlled records, or it also aims at creating self sustaining agencies. 5.1 Automatic Governance Efficient and easy management of data can be achieved by AI. AI is proving to be exponentially important for the purpose of information governance. Subsequently blockchain maintains an inflexible ledger that holds the records of all undertakings using independent peer networks. Multiple stakeholders are a part of the Information Governance. It has become increasingly important to consider all types of information as records for efficient governance.This can be achieved by smooth functioning of these two technologies [28]. 5.2 Real-Time Flagging of Fraudulent Transactions With AI accompanying blockchain at the time of an e-contract, the data owners have the complete transparency of the proceedings and with AI monitoring the process, it can instantaneously notify and flag if there is a breach or fraud in the data inserted and processed [29].
Intersection of AI and Blockchain Technology
59
5.3 Enhances Security Security in blockchain consists of an application that includes mechanisms such as data encryption and cryptocontracts. For detecting threats in an application, IDS (Intrusion Detection System) and IPS (Intrusion Protection System) are extremely important. The solution that adds to the improvement of blockchain system is the creation of sturdy ciphers which enhances the device security. AI plays a great role in minimizing the weakness of blockchain’s implementation in the field of cryptanalysis and delivers by creating powerful ciphers that boost blockchain defense procedure and its resilience [30]. 5.4 Efficient Creation of Digital Investment Assets According to literature survey there is no sufficient data with respect to AI to apply on economic artifacts dealt over blockchain. Increase in the data volumes going through blockchain prompts derivation of helpful observations from data [31]. The blockchain-AI convergence process covers 4 phases as shown in Fig. 2. Phase I: Proof of concept on blockchain. Phase II: Blockchain tokenization of assets. Phase III: Digital investment assets on Blockchain. Phase IV: AI as a financial representative which powers digital investment assets.
Fig. 2. Depicting AI based digital investment assets using blockchain.
5.5 Scalability Scalability being a key feature of blockchain enabled distributed learning and many such enhancements through AI. Scalability also uses fusion of AI and blockchain technology. It has obstacles like bootstrap time, cost per transaction and latency. These characteristics have a big impact on blockchain scalability. In a blockchain technology, as each block accommodates some amount of business data, traditional mining is not successful. As some of the advanced AI algorithms show an ability to learn from different distributed data sources, it proves to be an excellent solution for blockchain.
6 Blockchain for AI Figure 3 portrays the special characteristics of blockchain, resulting in a much secure and reliable variant of the same.
60
K. B. Vikhyath et al.
Fig. 3. Strengths of blockchain.
6.1 Transparency in AI With Blockchain integrated, AI proceedings can be monitored in the controlled way where the experts and data scientists can have access to the data being fed and processed in AI without the fear of interference or tampering in the data [27]. 6.2 Improved Trust on Robotic Decisions Decisions taken by AI must always be simple and understandable for the stakeholders to perceive and trust it. Blockchain stores a record of the transactions in the form of distributed ledger in a point to point manner making it easier to receive and trust the conclusions derived from AI, as storing information in an encrypted manner assures tamper-proof records for the auditing process. This makes AI on a blockchain system give more clarity on the decisions taken which in turn gains the trust of users [32]. 6.3 Decentralized Intelligence For taking decisions having high stakes and involving numerous agents for performing different micro tasks with access to common data, individual cybersecurity AI agents could be combined to provide a fully structured security to the underlying networks and to troubleshoot any scheduling anomalies [33]. 6.4 Keeping Data Private Blockchain eliminates the identifying parameters or details. There are methods to infer as to which data is particularly concerned with whom. The process of training algorithms, making predictions and performing analysis of data can be achieved with the help of anonymized data rather than relying on companies like Google or Amazon to collect data [30].
Intersection of AI and Blockchain Technology
61
6.5 Data Distribution and Security The primary and key aspects in the growth of AI are the ability to handle big datasets and provide security to them. Currently almost all of the AI models require storage models on cloud or centralised servers. This makes the data more vulnerable to threats, as it has just one point of access to it [34]. Blockchain being decentralised in nature resolves this issue and removes the threat by distributed storage of data on numerous systems across the world. Though the data is diffused, the access to it is still easy and doesn’t require much effort. This acts as entry point for variant data sets which expands the practicing and learning of AI/ML algorithms. It is challenging to secure data when it is with respect to AI in industries especially in businesses and the health care sectors [30].
7 Intersection Between Blockchain and AI Figure 4a depicts an assemblage of selected features from both AI and blockchain which produces decentralised AI. Figure 4b summarizes how one technology compliments the other across various aspects of the integration.
Fig. 4. (a) Decentralised AI. (b) Depicts how one technology compliments the other across various aspects of the intersection.
62
K. B. Vikhyath et al.
AI and blockchain are two different technologies. AI promotes the centralized intelligence on organized data in closed environment. Blockchain fosters decentralized applications in open data platform [3]. This results in convergence of AI and blockchain for development of breakthrough technologies. The intersection of blockchain with AI happens at the following junctures: a) Anti Hacking: It is conceivable for hackers to introduce malicious data into an AI model in order to introduce bias for financial gain or to generate chaos. For instance, it is possible to introduce a few errant pixels into an image of a giraffe, to trick the AI into identifying it as a dog. b) Privacy Protection: Blockchain can even perform one way homomorphic encryption (HE) to protect our privacy. This way medical record can be made available to the general public to develop innovative AI solutions; for instance, in a recent study done in China, AI is now able to detect lung cancer with significantly higher accuracy than oncologists. However, privacy regulations hinder AI development in other developed countries. Blockchain can help us open the door without loss of privacy [35]. c) Ownership: One of the biggest criticisms of Facebook and other media platforms is how they own our data. Blockchain is well suited to storing your data so that you get paid if you choose to share it with an institution. Taking this a step further, we have companies like SingularityNET that offers full stack AI services. A developer can make, for instance, a sales forecasting AI service and share it via this platform. However, blockchain is used in the background so that the developer always retains ownership of his algorithm and hence can be appropriately compensated. d) Bias: For controlling the bias, we utilize blockchain to document the whole process of AI modeling. This will document how each decision was taken concerning the layers in the neural network, which activation functions to apply, how would the data be collected and cleansed, and so on. Each of these decisions has a tiny but vital impact on any biases in the design. e) Contracts: One of the big use cases for blockchain is to carry out smart contracts. These are immutable programs submitted to the blockchain for execution under specific circumstances. For instance, on the passing of person X, all assets kept under his name should be transferred to person Y. But these are all hard coded steps. In the future, we can have AI use reinforcement learning to train a bot to act on our behalf to network events, rather than follow scripted rules. From there, the AI bot can evolve to protect itself from security threats and so on.
Table 2. AI and blockchain intersection benefits [4] AI
Blockchain
Intersection Benefits
Centralized
Decentralized
Improved data security
Probabilistic
Immutable
Informed and collective decision making
Changing
Deterministic
Enhanced trust on decisions with respect to robots
Volatile
Data Integrity
Decentralised Intelligence
Intersection of AI and Blockchain Technology
63
The above Table 2 lists the individual properties of AI and Blockchain and tries to give us a brief idea about the benefits of integrating the two technologies in different perspectives and how combining these will counter each other’s loopholes and help emerge with better and stronger technologies.
8 Issues and Problems at the Intersection of AI and Blockchain AI and Blockchain are the key resources for innovation that we are witnessing today. By acknowledging the great change in our day to day life, they are all destined to fundamentally redefine the way we live, interact and work. As it is a highly disruptive technology, they are expected to contribute hundreds of billions to the global economy. Developments in this field could work and change the need for a third-party approach that could disrupt critical industries. 8.1 Governance A number of stages and activities are involved in the process of deployment, structuring and management of a blockchain platform. This process involves people either directly or indirectly which is basically the stakeholders and the active participants. One of the major issues faced with respect to this is the variant of blockchain to be deployed, who troubleshoots, administers and resolves disputes in the blockchain, authors of smart contracts, destination for posting of nodes in the blockchain. This issue is faced even with private consortium blockchain [3]. 8.2 Tedious Consolidation of Outputs The nodes in blockchain are heterogeneous and decentralised in nature. The storage of information is achieved by the decentralised ledger which ensures secure storage and immutability of information using techniques such as hashing and cryptographic encryption. Hence, they are politically and architecturally decentralised. This results in Blockchain being public and open sourced, which makes it tedious for AI to consolidate the outputs from various sources into a single point without which further derivations on data cannot be made at all [36]. 8.3 Magnitude of Efforts Both of these technologies are poles apart as one fosters decentralised intelligence in open data platforms whereas the other promotes centralised application in a closed data environment. The concept of integrating these two is relatively new. As a result of which enormous amount of time and investment is necessary to explore these two technologies in depth and to find similar grounds to enable integration [37].
64
K. B. Vikhyath et al.
8.4 Higher Computational Needs Efficient and greater computational power is required for the ecosystem that uses decentralised AI. As an instance, Google search engine will require exponentially more time to ensure security and other advancements for the search, which makes it difficult to keep up with the pace of the process. When we think of Artificial Intelligence and Blockchain alone, they both show that they have a bright future. By 2025, AI market is projected to extend to nearly $39.7 billion whereas the global blockchain market was worth of only $3.0 billion at 2020 [38]. 8.5 Security Security is one of the key challenges with respect to integration of AI and blockchain. AI is delivered with reliable and abundant information, by the Blockchain platform which is public, isolated and securely distributed. This uses cryptography algorithms which makes it impossible for data thefts to occur. However, for AI to make better predictions on data, it might need to change the protected data which has to be decrypted in the first case. This decryption of data by AI might lead to data hacking [36].
9 Conclusion In this paper we have highlighted the concerns and prospects at the intersection of blockchain and AI and how these two technologies faint the loopholes of each other to build a more powerful technology. Here we have clearly depicted how these two technologies promise to complement each other in the best possible way. We have also presented various tools and technologies in the field of AI and blockchain that are playing a crucial role for the rise sustainable development. The technical implications that arise at the intersection between AI and blockchain technology gives rise to decentralised AI which enables the process of decision making on reliable and secured data independently by avoiding the involvement of any intermediaries in the course of action. The detailed implications can be addressed as a future enhancement.
References 1. Yu, S., Lv, K., Shao, Z., Guo, Y., Zou, J., Zhang, B.: A high performance blockchain platform for intelligent devices. In: 1st IEEE International Conference on Hot Information-Centric Networking, pp. 260–261. IEEE, Shenzhen, China (2018) 2. Dinh, T.N., Thai, M.T.: AI and blockchain: a disruptive integration. IEEE Comput. Soc. 51, 48–53 (2018) 3. Smriti, N., Dhir, S., Hooda, M.: Possibilities at the intersection of AI and blockchain technology. Int. J. Innov. Technol. Explor. Eng. 9, 135–144 (2019) 4. Salah, K., et al.: Blockchain for AI: review and open research challenges. IEEE Access 7, 10127–10149 (2018) 5. Brandenburger, M., Cachin, C., Kapitza, R., Sorniotti, A.: Blockchain and trusted computing: Problems, pitfalls, and a solution for hyperledger fabric. IEEE (2018)
Intersection of AI and Blockchain Technology
65
6. Wehbe, Y., Al Zaabi, M., Svetinovic, D.: Blockchain AI framework for healthcare records management: constrained goal model. In: 26th Telecommunication forum Telfor, IEEE, Serbia, Belgrade (2018) 7. Vikhyath, K.B., Brahmanand, S.H.: Wireless sensor networks security issues and challenges: a survey. Int. J. Eng. Technol. 7(2.33), 89–94 (2018) 8. Homepage. https://blog.chainintel.com/distributed-decentralized-artificial-intelligence-fra mework-for-dapps-75fefdc554c5. Accessed 05 June 2021 9. Sharath, Y., Kajal, B., Neelima, B.: Privacy preserving in blockchain based on partial homomorphic encryption system for AI applications. In: 25th International conference on High Performance computing workshop (HIPCW), pp. 81–85. IEEE (2018) 10. Mylrea, M., Gourisetti, S.N.G.: Blockchain for small grid resilience: Exchanging distributed energy at speed, scale and security. In: Proceedings of the Resilience Week (RWS), pp. 18–23. United States (2017) 11. Strobel, V., Ferrer, E.C., Dorigo, M.: Managing byzantine robots via blockchain technology in a swarm robotics collective decision making scenario. In: Proceedings of the 17th International Foundation for Autonomous Agents and MultiAgent systems, pp. 541–549. Stockholm, Sweden (2018) 12. Ekblaw, A., Azaria, A., Halamka, J.D., Lippman, A.: A case study for blockchain in healthcare: medrec prototype for electronic health records and medical research data. In: IEEE Open & Big Data Conference, (2016) 13. Dubovitskaya, A., Xu, Z., Ryu, S., Schumacher, M., Wang, F.: Secure and trustable electronic medical records sharing using blockchain. arXiv preprint, arXiv:1709.06528 (2017) 14. Peterson, K., Deeduvanu, R., Kanjamala, P., Boles, K.: A blockchain-based approach to health information exchange networks. In: Proceedings of the NIST Workshop Blockchain Healthcare, pp. 1–10. (2016) 15. Homepage. https://bitnewsbot.com/dutch-land-registry-how-blockchain-and-ai-could-ben efit-the-real-estate-industry/. Accessed 10 Aug 2020 16. Homepage. http://bitcoin.org/bitcoin.pdf. Accessed 05 June 2021 17. Homepage. http://medium.com/crypto-oracle/blockchain-rebalancing-amplifying-thepower -of-ai-and-machine-learning-ml-af95616e9ad9. Accessed 05 June 2021 18. Osaba, E., Onieva, E., Moreno, A., Lopez-Garcia, P., Perallos, A., Bringas, P.G.: Decentralised intelligent transport system with distributed intelligence based on classification techniques. IET Intel. Transp. Syst. 10, 674–682 (2016) 19. Homepage. http://www.forbes.com/sites/rachelwolfson/2018/09/14/blockchain-based-aivoice-assistant-brings-privacy-to-smart-homes/#1f965b3b6b50. Accessed 10 Aug 2020 20. Gammon, K.: Experimenting with blockchain: Can one technology boost both data integrity and patients’ pocketbooks? Nat. Med. 24(4), 378–381 (2018) 21. Homepage. http://arXiv.org/abs/1802.04451. Accessed 05 June 2021 22. Schluse, M., Priggemeyer, M., Atorf, L., Rossmann, J.: Experimentable digital twins–streamlining simulation-based systems engineering for industry 4.0. IEEE Trans. Ind. Inform. 14, 1722–1731 (2018). 23. Homepage. https://onix-systems.com/blog/top-10-java-machine-learning-tools-and-lib raries. Accessed 05 June 2021 24. Homepage. https://www.upgrad.com/blog/top-deep-learning-frameworks. Accessed 05 June 2021 25. Homepage. https://analyticsindiamag.com/deep-learning-frameworks. Accessed 05 June 2021 26. Homepage. https://neuromation.io. Accessed 05 June 2021 27. Shabbir, J., Anwer, T.: Artificial intelligence and its role in near future. J. Latex Class Files 14 (2015)
66
K. B. Vikhyath et al.
28. Chelvachandran, N., Trifuljesko, S., Drobotowicz K., Kendzierskyj, S., Jahankhani, H., Shah, Y.: Considerations for the governance of AI and government legislative frameworks. In: Jahankhani H., Kendzierskyj, S., Chelvachandran, N., Ibarra, J. (eds.) Cyber Defence in the Age of AI, Smart Societies and Augmented Humanity. Advanced Sciences and Technologies for Security Applications, pp. 57–72 (2020). Springer, Cham. https://doi.org/10.1007/978-3030-35746-7_4 29. Hassan, M.M., Mirza, T.: Real-Time detection of fraudulent transactions in retail banking using data mining techniques. Int. J. Comput. Sci. Eng. 10, 120–126 (2020) 30. Zhang, R., Xue, R., Liu, L.: Security and privacy on blockchain. ACM Comput. Surv. 52(3), 1–34 (2019). https://doi.org/10.1145/3316481 31. Homepage. www.oecd.org/finance/The-Tokenisation-of-Assets-and-PotentialImplicationsfor-Financial-Markets.htm. Accessed 05 June 2021 32. Siau, K., Wang, W.: Building trust in artificial intelligence, machine learning, and robotics. Cutter Bus. Technol. J.31, 47–53 (2018) 33. Scholz, M., Zhang, X., Kreitlein, S., Franke, J.: Decentralized intelligence: the key for an energy efficient and sustainable intralogistics. Procedia Manuf. 2, 679–685 (2018) 34. Cao, T.-D., Pham, T.-V., Quang-Hieu, V., Truong, H.-L., Le, D.-H., Dustdar, S.: MARSA: a marketplace for realtime human sensing data. ACM Trans. Internet Technol. 16(3), 1–21 (2016). https://doi.org/10.1145/2883611 35. Dlamini, Z., Francies, F.Z., Hull, R., Marima, R.: Artificial intelligence (AI) and big data in cancer and precision oncology. Comput. Struct. Biotechnol. J. 18, 2300–2311 (2020). https:// doi.org/10.1016/j.csbj.2020.08.019 36. Homepage. https://appinventiv.com/blog/what-happens-when-blockchain-and-ai-merge. Accessed 05 June 2021 37. Homepage. https://www.artificial-intelligence.blog/analysis-and-resources/artificial-intell igence-and-the-blockchain. Accessed 05 June 2021 38. Homepage. https://www.reportlinker.com/p04226790/Blockchain-Technology-Marketby-Provider-Application-Organization-Size-Vertical-and-Region-Global-Forecast-to.html? utm_source=PRN. Accessed 05 June 2021
SAIaaS: A Blockchain-Based Solution for Secure Artificial Intelligence as-a-Service Nicolas Six, Andrea Perrichon-Chr´etien, and Nicolas Herbaut(B) Universit´e Paris 1 Panth´eon-Sorbonne Centre de Recherche en Informatique, 75013 Paris, France [email protected]
Abstract. Artificial Intelligence models are crucial elements to support many sectors in the current global economy. Training those models requires 3 main assets: data, machine learning algorithms, and processing capabilities. Given the growing concerns regarding data privacy, algorithm intellectual property, and server security, combining all 3 resources to build a model is challenging. In this paper, we propose a solution allowing providers to share their data and run their algorithms in secured cloud training environments. To provide trust for both clients and asset providers in the system, a blockchain is introduced to support the negotiation, monitoring, and conclusion of model production. Through a preliminary evaluation, we validate the feasibility of the approach and present a road map to a more secure Artificial Intelligence as-a-service.
Keywords: Blockchain TEE
1
· Artificial intelligence · Security · Privacy ·
Introduction
With the increasing number of companies that started to investigate AI solutions addressing business issues, the popularity of AIaaS (Artificial Intelligence as-a-service) has been rising throughout the years. It is expected that the market, valued at USD 2.68 billion in 2019, might reach USD 28.58 billion by 20251 . Indeed, AIaaS facilitates access to machine learning algorithms and learning infrastructure, two of the three assets required to compute AI models, along with datasets. However, its growing adoption raises issues, notably with the centralization of those services over a handful of actors, such as Google’s Prediction API and Amazon ML. Also, even with the disposal of infrastructure, getting high-quality datasets and innovative algorithms is a difficult task. Owners of sensitive or valuable datasets as well as state-of-the-art algorithms might be reluctant to share their assets. They expect the client to pay a premium for getting an asset, or guarantees of confidentiality that are difficult to provide 1
https://bit.ly/3wfEATY.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 I. Awan et al. (Eds.): Deep-BDB 2021, LNNS 309, pp. 67–74, 2022. https://doi.org/10.1007/978-3-030-84337-3_6
68
N. Six et al.
in regular cloud environments. Confidentiality of the assets can also be threatened during the learning phase if the computation service is compromised or the provider malicious. Finally, it is difficult to find such providers, as many datasets and algorithms exist. Those issues have been partially addressed by academic studies, whether by the construction of blockchain-enabled data marketplaces (e.g., [6]) or blockchain cloud monitoring for third-party computation [9]. However, there still is no end-to-end solution to allow clients to get a desired model in a decentralized way. Consider the following use case: a hospital manager possessing a trustable set of 100 images from its patients’ diseases. He requests more data and an efficient algorithm. SAIaaS will allow training the manager’s model and he will be able to label a 101st image with the resulting model. The core of our proposal is two-fold: first, we use blockchain to design a transparent and tamper-proof marketplace facilitating the auction-based pricing for immaterial (datasets and ML algorithms) and material assets (cloud computing resources). Then we propose using Trusted Execution Environments for ML tasks, to guarantee that code and data loaded inside the infrastructure being protected with respect to confidentiality and integrity. The rest of the paper is organized as follows. We first present our model for a Secure Artificial Intelligence as-a-Service marketplace in Sect. 2. We conduct a preliminary evaluation of the cost of such a platform in Sect. 3, then we mention some related work and position our proposal in Sect. 4. We finally conclude and discuss future work in Sect. 5.
2
A Secure Marketplace for IA
This paper proposes SAIaaS (Secure AI as-a-service), a model for a blockchainbased marketplace for collaborative and rewarded computation of models. In this section, we describe our model for SAIaaS and provide insights on how the 4 main steps of our proposal (Fig. 1) can be implemented. 2.1
Actors and High-Level Workflow
In SAIaaS, a client willing to obtain an AI model publishes an auction on a public blockchain for providers to bid on. Providers are classified into 3 categories: Data providers (DP) who provide datasets, Algorithm providers (AP) who provide innovative machine learning algorithms, and Infrastructure providers (IP) who provide cloud resources for the learning phase. The auction contains a description of the client’s needs, and associated requirements (e.g., model accuracy). Each provider will be allowed to bid for its own category. Thus, three winners will be selected for each category, constituting a triplet of winners. They will have to collaborate to generate the expected model. First, the AP will have to set up a computing environment. Then, the DP and the AP will send their assets to the IP, through a secured channel. Finally, the IP will compute the model, and return it to the client. Providers will be rewarded according to their bid. The next sections provide more details on each step of the workflow. We discuss the implementation in Sect. 3.
SAIaaS
69
Fig. 1. High-level architecture
2.2
Semantic Matchmaking Phase
The client’s request contains requested asset specifications for data, algorithms and infrastructure on the blockchain system. Since we aim to build a use caseagnostic system, we must support a wide and dynamic range of application domains. To this end, we rely on an ontology-based resource retrieval and allocation system, which have already been proposed in the literature for cloud services provisioning [5] or dataset [4] discovery. With a common ontology used to describe both client requirements and providers assets, matchmaking can be done on-chain, through the emission of a specific event targeted at providers that can fulfill client requests or off-chain, through the continuous monitoring of new clients’ requests publicly available on the blockchain. Based on the asset ontology, each provider can analyze the asset specifications published by the client and know if it owns a matching asset, in which case it will take part in the auction phase through asset bids. 2.3
Auction Phase
The auction phase is the sequence of actions after the client’s request is made public. It involves providers interacting with the auction smart contract by placing asset bids. Along with its asset specifications, the client locks a cryptocurrency amount, the reserve price, that will be used to pay providers for their assets. The auction ends when the client adjudicates the auction contract or when a predetermined time elapses. Each provider is able to participate in the auction and propose a price for an asset that semantically matches the client’s request. The first bid proposed by a provider determine the base price for a particular asset specification, it is updated when a new asset bid is placed at a lower price for the same asset request. To foster competition between providers, when the base price is updated, competing providers are able to reduce the initial price of their asset bid. At the end of the auction phase, the winning triplet comprised of the semantically matching
70
N. Six et al.
lowest-price bids is made public through an Auction Service Level Agreement (SLA) on the blockchain. Since the platform assures both data privacy and algorithm confidentiality, it is not possible to assess the quality of immaterial assets (datasets and algorithms) before actually performing the machine learning. This can slow down the adoption of the system, since clients are reluctant to pay without knowing if the results will be satisfactory. To circumvent this issue, each asset bids from a provider can contain references to previous auctions with semantically matching bids from the same provider. For example, if a particular dataset was proposed by another client in a previous auction, the provider can reference the Auction instance address in its bid to provide evidence of its suitability. Clients can specify how many references each bid must have in their auction specifications. Allow new providers to enter the system, it is expected that bids without a reference to a previous Auction SLA to be significantly cheaper than referenced ones, so compensating for the risk on the client side. 2.4
Secure Learning Phase
Once all the providers are identified, the immaterial assets from the DP and the AP need to be transmitted onto the IP infrastructure for machine learning computation. To prevent any data or intellectual property leaks from a malicious or compromised infrastructure provider, a secured learning environment is required. For dataset security, trusted execution environments (TEE), such as Intel SGX enclaves, have been proposed to perform both machine learning and model creation in a secure way [3]. This technology brings trust in the learning process since the infrastructure provider cannot access the data stored in TEEs, and the data provider receives an attestation proving that the environment is secured and up to date2 , and a secure communication channel are created for data upload. For algorithm security, TEE and Linux containers can also be leveraged to make sure that the intellectual property of the AP is not compromised. Through the previously mentioned secure channel, AP uploads its algorithms from a secure registry. It can be executed as Linux containers [1] which has the additional benefit of preventing the algorithm from being compromised through containers image signature, for example using Docker Content Trust3 . 2.5
Restitution Phase
Once the learning phase is over, the model is provided as an encrypted file through the TEE secure channel to the client. The client then assesses the result of the model off-chain and publishes an acknowledgment in the auction SLA contract to close the process, unlock the payments to the providers and retrieve the potential unspent resources from its reserve price. The next section presents 2 3
SGX Remote attestation https://intel.ly/3ry4UoU. https://dockr.ly/3m3ESss.
SAIaaS
71
a preliminary implementation of the marketplace, the secured learning environment being left for future work.
3
Proof of Concept
This section presents a proof of concept of the main component of our proposal, the blockchain-based marketplace. First, a Solidity implementation of the blockchain marketplace is proposed. Then, a cost analysis is performed to evaluate the overall cost of the solution. 3.1
Implementation
To evaluate the contribution, this paper introduces a concrete blockchain implementation of the marketplace on Ethereum. The code is available on Github4 . The architecture of the marketplace is designed as shown in Fig. 2. IPFS
Request requirements specification
On-chain
Reference
Factory
Instantiates
Auction instances
Fig. 2. Marketplace on-chain architecture.
This marketplace is based on two solidity smart contracts: Factory, and Auction. Factory is a contract dedicated to creating Auction instances. Once deployed on-chain, it can be called by an external party to create an auction, providing an adequate description of its requirements. Expected requirements are specified in an ontology file, stored off-chain on IPFS (Inter-Planetary File System), a decentralized storage platform5 . The party willing to create an auction must provide its requirements following this file. When an Auction instance is created, its requirements, the client address, and auction modalities are set as state variables. A reference to the Factory contract responsible for its instantiating is also kept. Thus, the Factory contract, as well as requirement metadata, can easily be updated, without losing the link between old metadata and older Auction instances. The main auction modality is the auction duration. Before the execution of each contract method, verification is done to check if the defined duration did not elapsed since auction creation. If so, the contract automatically adjudicates the winners of the auction. 4 5
https://github.com/nicoSix/solidity-data-marketplace. https://docs.ipfs.io/.
72
N. Six et al.
Providers can bid on the auction if they can effectively provide the desired assets and if their assets are semantically compliant with the client’s initial request. Providers can submit as many bids as they want, as long as the auction phased is not complete and if the new bid amount is below the previous one. A maximum bid value is also set in the contract, forbidding providers to bid above this value. After the adjudication, a triplet of winners is determined. Each member of the triplet is the winning provider for a category (data, algorithm, or infrastructure). 3.2
Cost Estimation
As Ethereum has been selected for this implementation, each operation performed on contracts that alter their states (e.g., deployment, bid, ...) has a defined cost in gas. The price of performing an operation in Ether (the main cryptocurrency of Ethereum) is the product between the total cost in gas and the current blockchain gas price (Ether per gas). By extension, the cost in $USD can be deducted from the cost in Ether. To get an accurate estimation of those costs, a scenario will guide the measurements. Scenario. A party wants to obtain a model, but he doesn’t have any data, algorithm, or infrastructure. He decides to use the SAIaaS marketplace to find providers that could perform this task for him. He creates an auction on-chain through the Factory instance (already deployed) that acts as a gateway, by specifying its requirements and its maximal price. 6 providers are willing to bid, two per type of asset provided (data, infrastructure, algorithms). They don’t know and don’t trust each other. First, one provider per asset bid to provide their asset. The other providers then outbid the first three, who will also bid again. Finally, the auction stops, and the first three providers are the winners of this auction. Thus, 9 bids are placed for the auction. Results. Table 1 lists all possible operations and associated costs for each operation, and the sum of all the costs when running the scenario described before. 3.3
Discussion
The results show an important gap between the cost of deploying contracts and the cost of bidding and adjudicating on the auction. Indeed, contract deployment implies storing large amounts of data and defining many states. This is considered a very expensive operation as nodes will have to store contracts and their states forever on-chain. The cost associated with the scenario execution is also expensive, even if the Factory contract is considered as already deployed. By extension, the cost in $USD is prohibitively high to be used in its current state. However, as gas costs are inherent to Ethereum, selecting another blockchain might decrease costs a lot. Relevant additional information is returned on top
SAIaaS
73
Table 1. Cost per operation (in gas/in Ether/in $USD). When the experiment was conducted (28/03/2021), the gas price was 124 Gwei, and an Ether was valued $1,683.
Operation
Gas cost
Price(Ether) Price ($USD)
Factory deployment Auction creation First bid of contract First bid of user Modify existing bid Adjudication Total (scenario)
2,344,498 1,760,105 237,518 207,829 94,747 198,295 3,180,058
0.2907178 0.218253 0.0294522 0.0257708 0.0117486 0.0245886 0.3943272
489.28 367.32 49,57 43,37 19,77 41.38 663.65
of the computing model. Consequently, obtained results are only valid using the Ethereum mainnet, but this blockchain also comes with a lot of advantages, especially the decentralization factor.
4
Related Work
In the literature, several works propose a marketplace for IA using blockchain where authors propose a secure data oriented AI marketplace to guarantee privacy between users (such as [8,10] and [7]) using blockchain. Our proposal goes further by also including ML Algorithms and Infrastructure as tradeable assets while considering security aspects of the learning process. Other work has proposed solutions leveraging Trusted Execution Environments ([2]). Our proposals differ in the sense that it is more specialized and proposes ready-to-use solution targeted at AI needs with an auction-driver pricing scheme.
5
Conclusion and Future Works
This paper presents a blockchain-based marketplace to support secure artificial intelligence as-a-service (SAIaaS). From a requirements file shared on-chain, a client can request the computation of a specific model to solve an AI problem by creating an auction. Providers can then bid on the auction, meaning that they are willing to provide their assets (data, infrastructure, or algorithm) to compute the model. At the end, the best offers for each asset are retained, and the model is computed in the provided infrastructure using provided data and algorithms. A secure environment can be set up in the infrastructure to avoid any leakage of valuable or confidential data, using enclaves. This paper paves the way for future progress in the proposed solution. First, by implementing a system capable of automatically bootstrapping the model computation on the Infrastructure provider dedicated services, with the dataset
74
N. Six et al.
and the algorithm transferred from the other two providers. This includes the setup of a trusted learning environment, using enclaves. The designed system must also take into account the potential incompatibility between a provided dataset and an algorithm. Additional pre-processing steps should be specified during the user’s requirement phase. Second, with the implementation of a monitoring system connected to the blockchain that ensures computations are performed following client requirements. This monitoring system could also help detect breaches of confidentiality on datasets provided by the Data provider. Third, by handling potential issues that could occur from the creation of an auction to the computation of the model. As it is meant to be deployed on a public blockchain, we want to prevent potential misuses of the decentralized app and conflicts between users by proposing a DAO (Decentralized Autonomous Organization) suited for this use case. Users would be entitled to submit bids by deposing a collateral to prevent misbehaving that declines depending on their reputation built over their utilization of SAIaaS. Conflicts could be litigated by highly reputed users in exchange of a part of the collateral.
References 1. Arnautov, S., et al.: Scone: secure linux containers with intel sgx. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 689–703 (2016) 2. Fedak, G., Bendella, W., Alves, E.: Iexec whitepaper: blockchain-based decentralized cloud computing (2017) 3. Hunt, T., Song, C., Shokri, R., Shmatikov, V., Witchel, E.: Chiron: privacypreserving machine learning as a service. arXiv preprint arXiv:1803.05961 (2018) 4. Kushiro, N.: A method for generating ontologies in requirements domain for searching data sets in marketplace. In: 2013 IEEE 13th International Conference on Data Mining Workshops, pp. 688–693. IEEE (2013) 5. Ma, Y.B., Jang, S.H., Lee, J.S.: Ontology-based resource management for cloud computing. In: Nguyen, N.T., Kim, C.-G., Janiak, A. (eds.) ACIIDS 2011. LNCS (LNAI), vol. 6592, pp. 343–352. Springer, Heidelberg (2011). https://doi.org/10. 1007/978-3-642-20042-7 35 6. Nardini, M., Helmer, S., El Ioini, N., Pahl, C.: A blockchain-based decentralized electronic marketplace for computing resources. SN Comput. Sci. 1(5), 1–24 (2020) ¨ 7. Ozyilmaz, K.R., Do˘ gan, M., Yurdakul, A.: Idmob: iot data marketplace on blockchain. In: 2018 Crypto Valley Conference on Blockchain Technology (CVCBT), pp. 11–19. IEEE (2018) 8. Sarpatwar, K., Sitaramagiridharganesh Ganapavarapu, V., Shanmugam, K., Rahman, A., Vaculin, R.: Blockchain enabled ai marketplace: the price you pay for trust. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2019) 9. Taghavi, M., Bentahar, J., Otrok, H., Bakhtiyari, K.: A blockchain-based model for cloud service quality monitoring. IEEE Trans. Serv. Comput. 13(2), 276–288 (2019) 10. Travizano, M., Minnoni, M., Ajzenman, G., Sarraute, C., Della Penna, N.: Wibson: a decentralized marketplace empowering individuals to safely monetize their personal data. White paper (2018)
Blockchain and Security
Trade-Off Between Security and Scalability in Blockchain Design: A Dynamic Sharding Approach Kahina Khacef1(B) , Salima Benbernou2 , Mourad Ouziri2 , and Muhammad Younas3 1
LIP6, Sorbonne Universit´e, Paris, France [email protected] 2 LIPADE, Universit´e de Paris, Paris, France {salimabenbernou,mourad.ouziri}@u-paris.fr 3 Oxford Brookes University, Oxford, UK [email protected]
Abstract. Security and scalability are considered as two major issues that are most likely to influence rapid deployment of blockchains in businesses. We believe that the ability to scale up a blockchain lies mainly in improving the underlying technology rather than deploying new hardware. Though recent research works have applied sharding techniques in enhancing scalability of blockchains, they do not cater for addressing the issue of both data security and scalability in blockchains. In this paper, we propose an approach that makes a trade-off between security and scalability when designing blockchain based systems. We propose an efficient replication model, which creates dynamic sharding wherein blocks are stored in a varying number of nodes. The proposed approach shows that the replication of blockchain over peer-to-peer network is minimized as the blockchain’s length evolves according to a replication factor to preserve the security. Keywords: Blockchain
1
· Security · Scalability · Sharding
Introduction
Blockchain is a distributed data structure that comprises a list of blocks which record transactions such that they are maintained by nodes without a central authority. It has a decentralized peer-to-peer (P2P) architecture which is inherently resilient, decentralized, and open. In it, blocks are (chained or) linked securely by hash pointers that require consensus among different nodes in order to approve every transaction in a block. Blockchain was originally proposed in the white paper [13] by mysterious Nakamoto in 2008, as a proof of work (PoW) consensus. It was based on a P2P network which is not required to abide by control of a trusted third party. In the beginning, it was used for bitcoin as a safer way to carry out financial transaction. It is worth noting that the c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 I. Awan et al. (Eds.): Deep-BDB 2021, LNNS 309, pp. 77–90, 2022. https://doi.org/10.1007/978-3-030-84337-3_7
78
K. Khacef et al.
number of blockchains went from one (Bitcoin) to several blockchains in 2021. Blockchain based-systems have recently become appealing to several financial sectors and scientific communities. Currently there exist various blockchains such as Ethereum [17], Hyperledger [1], Tezos [6] etc. Each blockchain has its own operating mode, a different transaction validation consensus that makes it attractive to various applications. A user broadcasts a new transaction to network and adds it to the blockchain. A set of nodes then verify the new transaction to ensure that it is correctly signed and that it has not been previously spent (or recorded) in the ledger. Although the nodes of this decentralized network are equal, they can have different roles depending on the functions they support, e.g., routing, database, miner, and wallet nodes. A node with all these functions is called a full node - which maintains a complete copy of the blockchain and contributes to network security. Others nodes, supporting only a subset of functions, verify transactions using a simplified payment verification (SPV) method, known as lightweight nodes they allow to send and receive transactions without owning a full copy of the blockchain. But they download headers of blocks and transactions concerned depend on full nodes. Blockchain has won its spurs in data integrity, security, and immutability. However, security is achieved at the price of maintaining full nodes. Storage costs increase linearly with the increase in number of transactions and may become one of the bottlenecks that limit blockchain’s scalability. Traditional blockchain is based on full replication, where nodes rely on all past transactions locally. They check the state to validate a new transaction and then store each transaction to maintain the system. This is to represent proof of correct state that consists of all block headers starting from the genesis’s block. This processing is slow and requires a lot of storage capacity. Each node has to agree with that process. However, supporting a large number of users and transactions result in a serious scalability problem. On the one hand, the decentralization of blockchain on a peer-to-peer network and its replication on multiple nodes, provide extra security by making it more difficult for an attacker to compromise the system. On the other hand, these negatively affect the scalability of blockchain systems. It is believed that the ability to scale up a blockchain therefore lies mainly in improving the technology, and not in the deployment of new hardware. This paper presents a new blockchain design in order to reduce the storage volume on each node and allow dynamic sharding with different local states from node to node but without compromising on security properties. The main contributions of this paper are summarized as follows: – It proposes to design a new blockchain based system that makes a tradeoff between security and scalability in a blockchain. The proposed system is named as SecuSca. – It formulates trade-off approach as a multi-objective optimization problem. The proposed approach allows a dynamic sharding to save the replications in the network while maintaining appropriate security of the blockchain systems.
Blockchain Scalability
79
– It implements the proposed approach as a proof of concept and highlights its efficiency. The remainder of this paper is organized as follows. Section 2 presents a motivating example as well as overview of the dynamic sharding approach. Section 3 discusses the related work. Section 4 provides background on blockchains. Section 5 describe the proposed design which enhances the scalability of the blockhain while maintaiing the security. Section 6 provides the sharding function to optimize the trade-off. Section 7 concludes the paper.
2
Overview of SecuSca Approach and Motivation
This section presents a running example in relation to the proposed SecuSca approach that aims to overcome scalability limitation of blockchain based systems while maintaining their security. It also presents an overview of SecuSca model. 2.1
Motivating Example
The blocks that make up blockchain are replicated on all nodes. They maintain local copies of all blocks (including genesis block) for the following reasons: i To verify a transaction, the nodes read the history of all past transactions locally. ii To provide replication as it enhances security against attacks and tampering, and also improves availability of data. iii To safeguard transactions - transactions are considered sufficiently safe from attacks when buried under enough blocks, and miners reach consensus by selecting the longest one. In proof of work cryptocurrencies, the longest chain is deemed honest by the network, regarded as the most invested chain [14].
Fig. 1. Full replication
We consider Alice, Bob and John as three participants among hundred nodes in the blockchain network which has a storage capacity of 50 GB that holds
80
K. Khacef et al.
shared ledger. As shown in Fig. 1, the blockchain is fully replicated on every node in the network. All nodes store whole blocks with all transactions, and the same block is replicated on all nodes. Suppose that the size of a block is 1 MB, and that blocks are generated every 10 min, 1 MB * 6 * 24 * 30 = 4320 MB per month. Since each block is replicated in all nodes, the three nodes can store up to 50 ∗ 103 blocks. These nodes with a capacity of 50 GB containing the blockchain will be saturated in less than a year. This shows that current design of blockchain would result in major scalability issue.
Fig. 2. Traditional blockchain vs the new replication model. In 2(a), each node maintains its chain which contains all previous transactions. Each given block is stored on the three nodes, Alice, Bob, and john. While in 2(b), the global blockchain shown at the top is sharded across the three nodes. Colored blocks represent the entire block containing transactions, and framed blocks only have the block header - so the block buried in the chain is held on a few nodes.
Reducing replication reduces storage space of nodes so they can continues to receive new blocks. Thus blockchain can contain more than 50,000 blocks. As shown in Fig. 2, blockchain distributed over networks is more scalable in storage than that of full replication. 2.2
The SecuSca Approach in a Nutshell
SecuSca aims to reduce storage load by reducing replication of each block in a distributed ledger. It takes into account fundamental characteristics of security and verification as mentioned above. The blockchain is distributed on nodes, and all blocks make the overall state. To allow a node for verifying history of the blockchain, SecuSca removes transactions from blocks and keeps header block. The new block produced is cryptographically linked to the last block which is added to the longest chain. As blocks arrive in the system, they are stored on a higher number of nodes. The replication of blocks secured by chaining decreases when the blockchain becomes longer on each node. Each block is not entirely removed from the blockchain. As a result, it is possible to store a larger number of transactions than current blockchain systems by using the same storage capacity. It also reduces the memory required for each node individually. Also, with an increase in transactions, users are not forced to store a large amount of data.
Blockchain Scalability
3
81
Related Work
The most widely used blockchains of Bitcoin [13] and Ethereum [17] are based on a full replication system which cannot support a large number of transactions. With growing number of transactions, they result in system overload that does not allow to scale. In effect, the efficiency of blockchain decreases as more nodes join the network. In order to tackle this issue, many approaches have recently emerged that allow blockchain to continue functioning even when the number of users increases. Sharding, generally used in databases, is proposed in cryptocurrency ledger. The network of N nodes is partitioned into committees k with a small number of nodes c, with replication, c = N/K - i.e., to yield smaller full replication systems. The node in each committee K stores only validated blocks inside its committee and does not manage the entire blockchain ledger. As an example, [8,11,18] are sharding-based Proof of Work and Byzantine fault tolerance (BFT). Elastico [11] is the first sharding-based public blockchain proposed in 2016 that tolerates byzantine adversaries. It partitions the network into shards and ensures probabilistic correctness by randomly assigning nodes to committees, wherein each shard is verified by a disjoint committee of nodes in parallel. It executes expensive PoW to form a committee, where nodes randomly join different committees and run PBFT or Practical Byzantine Fault Tolerance for intra-committee consensus. In Elastico, all nodes maintain the blockchain ledger but cross-shard transactions are not supported. In addition, while running PBFT among hundreds of nodes decreases protocol’s performance, but reducing the number of nodes within each shard increases the failure probability. The network can only tolerate up to 25% of malicious nodes. Omniledger in [8] improved upon Elastico. It includes new methods to assign nodes into shards with a higher security guarantee, as Elastico. It uses both Pow and BFT; an atomic protocol for across-shard transactions (Atomix). The intra-shard consensus protocol of OmniLedger uses a variant of ByzCoin[7] and assumes partially synchronous channels to achieve faster transactions. The network tolerates up to 25% of faulty nodes and 33% of malicious nodes in each committee as in [11]. Cross-shard in Rapidchain [18] relies on an inter-committee routing scheme which is based on the routing algorithm of Kademlia[12]. It tolerates up to 33% of total resiliency and 50% of committee resiliency. Rapidchain also supports cross-shard transactions using Byzantine consensus protocols but requires strong synchronous communication among shards which is hard to achieve. There exist other approaches in the literature which are based on private blockchain [2,3, 5,15,16]. Even though sharding improves storage and throughput, K increases linearly with N with a low-security level, thus, leading to malicious node errors. Vault [9] introduces fast bootstrapping to allow new participants to join the network without downloading the whole blockchain by reducing the transmitted state. Vault is Account-based for Algorand [4], and does not require all nodes to store the whole blockchain state.
82
K. Khacef et al.
In [10], authors propose a superlight client design to allow a light client to relay full nodes to read blockchain with a low read cost to predict (non) existence of a transaction in a blockchain. Therefore, blockchains can hold a large amount of data. However, each node requires storage space. Thus, the cost of storage and the required memory increase with the number of transactions.
4
Background
In this section, we present relevant background on blockchain systems in relation to the proposed approach. 4.1
Blockchain Systems
Blockchain Network. A blockchain network consists of nodes that can record all data in an immutable way. The data structure is collected in blocks that contain sets of transactions, and the consensus of most network participants verifies these transactions. Each block is identified by a hash, a unique identifier. In our proposed approach, we assume that all nodes always have same constant amount of resources with respect to CPU, storage, and network bandwidth. Data Model. Different blockchain uses different models for their states. The main models are Unspent Transaction Output (UTXO) and the account-based. Bitcoin adopts the UTXO model and many other cryptocurrencies, consisting of outputs of transactions that have not been considered inputs to another transaction. UTXO provides a higher level of privacy. More recent blockchains-based systems, such as Ethereum and Hyperledger, support general states that can be modified arbitrarily by smart contracts. They adopt an account-based data model, in which each account has its local states stored on the blockchain; it is similar to the record-keeping in a bank. Our proposed approach focuses on UTXO as it will involve reading history of transactions and verify the states. Block Structure. A block stores transactions and global states. A block header contains the following fields: (1) Previous Block -reference to the previous block in the chain, (2)Nonce-related to the proof of work, (3) Merkle root -enable efficient verification of transactions and states, (4) A set of metadata related to the mining protocol; difficulty and timestamp. Consensus. It is an agreement to validate correctness of a blockchain, which is linked to the order and timing of block. Its goal is to achieve consistency of the nodes participating in blockchain. All honest nodes accept the same order of transactions as long as they are confirmed in their local blockchain views. The blockchain is constantly updated as new blocks are received.
Blockchain Scalability
4.2
83
States Management
States. The system ensures data availability and transparency for all participating members by maintaining all historical and current transactions at each node for blockchain verification. The main abstraction of reading blockchain requires user to maintain a full node to run the consensus protocol and maintain a local replica of a blockchain. An increase in number of participants makes blockchain system complex and leads to saturation of network. This leads to substantial transaction costs to process data with increased storage space which degrades network performance. Censorship-Resistance. Data stored in the blockchain cannot be tampered with during and after block generation. An adversary will fail to modify historical data stored on blockchain because of cryptographic techniques used in distributed blockchain storage: (1) Asymmetric Key - that each node uses to sign and verify the integrity of the transaction, and (2) Hash function - a mathematical algorithm that maps arbitrary size data to a unique fixed-length binary output. A hash function (e.g., SHA − 256) is computationally infeasible to recover input from output hash. An attacker fails to tamper with a block after a size t of the blockchain sufficiently secure according to [14], with t is the number of blocs in the blockchain. Even if the adversary tries to cover up this tampering by breaking the hash of the previous block and so on, this attempt will ultimately fail when the genesis block is reached. It is complicated to modify data blocks across the distributed network. A small change in the original data makes the hash unrecognizable different, which secures the blockchain.
5
The Proposed Dynamic Sharding Approach
In this section, we first discuss an overall architecture of the proposed approach, SecuSca. We then describe the dynamic sharding approach that preserves the security and scalability of storage in the blockchain. 5.1
Architecture of SecuSca
An architecture of the proposed SecuSca approach is depicted in Fig. 3. Users trigger transactions which are inserted into a block and are added and replicated in the chain of blockchain over the network. In order to preserve a maximum security and scalability of the blockchain by decreasing its full replication, SecuSca creates a trade-off between security and sclability. The process comprises two steps that operates at the time. An optimization function is introduced in order to help the sharding process: – Efficient replication: The replication means efficient storage and security. This approach distributes/replicates the global state of the blockchain to some nodes of the network by dynamic sharding without sacrificing security. The fundamental goal is to preserve the security and scalability of the blockchain
84
K. Khacef et al.
Fig. 3. An overview of functional architecture of SecuSca
and store data in the future where blockchain applications are variants; even if the node has poor memory storage can participate in the protocol process. – Efficient reduction: While the replication of an inserted block is in the process, a reduction step is operating in order to reduce the replication of some blocks over the network and then scaling up the blockchain. This allows more storage in the network of the blockchain. In the next section we will detail the SecuSca approach. 5.2
The Process of SecuSca Approach
Step 1: Replication Model. In SecuSca, we are interested in storing transaction’s history. Unfortunately, for scalability reasons, all nodes cannot store the whole state. The blockchain is distributed over a higher number of nodes that store bi blocs (where i = {1, ..., B}); each node is represented by nj (where j = {1, ...., N }). The blocks are propagated throughout the network and stored on nodes n with (n γB
7
Implementation and Discussion
We simulate a blockchain in order to validate how our proposed SecuSca offers a trade-off between security and scalability. 7.1
Experimental Setup
We evaluate our approach and compare it with traditional blockchain (bitcoin) in two experiments by varying the replication of block and block depth. We run multiple simulations with different values of α and γ parameters in order to define upper bound αN and lower bound γ0. We then give the size of the whole blockchain sharded over the network. 7.2
Experiments Studies
In the experiments we studied two aspects; the efficiency in terms of block replication; and the size of the blockchain.
Blockchain Scalability
87
– Replication of blocks. When a miner produces a new block, it is broadcasted over the network and added to the local disk of nodes. First, we simulate the function R(d) with 100 nodes with 200 GB storage capacity. See Fig. 4(a) and (b). Each node performs the sharding optimization function R. At the start of the process, the new block has a depth of zero. No block is chained to it, and its replication is high. We fixed the parameter (α = 0.5) and (γ = 0.5), and then (α = 0.7) and (γ = 0.6). For the first simulation, the upper bound is given by 50, and the block is replicated on 50 nodes. This initial replication is constant until it reaches depth of γ0 . A second simulation with 200 nodes with a storage capacity of 100 GB is given in Fig. 4(c) and (d). The replication of each block depends on its depth in the blockchain. It decreases as the block’s depth increases. Our system reduces the replication of block to α0 . It is the lower number of replication that will be stored in the blockchain for each block. For our experimental results, we consider that 15 replicas of a block are sufficient to maintain state of the blockchain as the size increases (αO = 15). – Size of the network. In the experiment that follows, we run the algorithm over 100 nodes that produce 300 blocks of 1 MB size. Figure 5 gives an overview of the evolution of the size of blockchain in both traditional blockchain and SecuSca. The experiment reveals that the overhead of blockchain becomes significant in traditional blockchain traced in red. The size of SecuSca is the sum of all blocks stored on each node, from the first block to the last block. The sum of all transactions of all shards is traced in blue. From [0–200], every block is highly replicated, and the size grows linearly. After that, nodes start reducing replication until the lower bound of our approach. The size t of the shard in each node n at different steps is given by: tn =
b R(i, t) i=0
N
(1)
where N is the number of nodes, i is the position of a block bi is in the chain and t is the size of the blockchain. – Discussion. This section discusses the effectiveness of our approach and certain aspects that are not analyzed. This work aims to design an optimal function for blockchain scalability that maintains security. The above analytical and numerical results show, how our proposed SecuSca approach, promotes scalability, and enables users to store more transactions by freeing up local disk. Nevertheless, SecuSca needs further improvements. For instance, it should allow the blockchain to continue to function even when some nodes delete transactions from their local chain. At each transaction verification, if the needed data is not on a local chain
88
K. Khacef et al.
Fig. 4. The replication of blocks according to block depth
Fig. 5. Comparison of the size of blockchain between traditional blockchain and SecuSca
of each node, this node must retrieve them from another node despite sharing state between nodes. Inter-shard communications should be part of the consensus protocol.
Blockchain Scalability
8
89
Conclusion
In this paper, we have proposed a new design of a blockchain which makes a trade-off between security and scalability that allow for more capacity in terms of storage in the whole blockchain. We developed a dynamic sharding approach which is formulated as an optimisation problem which is preserving security and scaling up the blockchain. We have conducted various simulation-based experiments. This experiments have shown promising results and significant improvement over traditional blockchain. As a future work, we project to investigate querying our new design of the blockchain to access the data.
References 1. Hyperledger Architecture Volume 1. Introduction to hyperledger business blockchain design philosophy and consensus. https://www.hyperledger.org/ wp-content/uploads/2017/08/Hyperledger Arch WG Paper 1 Consensus.pdf. Accessed 20 May 2021 2. Amiri, M.J., Agrawal, D., El Abbadi, A.: On sharding permissioned blockchains. In: IEEE International Conference on Blockchain, Blockchain 2019, Atlanta, GA, USA, 14–17 July 2019, pp. 282–285. IEEE (2019) 3. Amiri, M.J., Agrawal, D., El Abbadi, A.: Sharper: sharding permissioned blockchains over network clusters. CoRR, abs/1910.00765 (2019) 4. Chen, J., Micali, S.: Algorand: a secure and efficient distributed ledger. Theor. Comput. Sci. 777, 155–183 (2019) 5. Dang, H., Dinh, T.T.A., Loghin, D., Chang, E.-C., Lin, Q., Ooi, B.C.: Towards scaling blockchain systems via sharding. In: Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, 30 June–5 July 2019, pp. 123–140. ACM (2019) 6. Goodman, L.M.: Tezos a self amending crypto ledger white paper (2017). https:// tezos.com/whitepaper.pdf. Accessed 20 May 2021 7. Kokoris-Kogias, E., Jovanovic, P., Gailly, N., Khoffi, I., Gasser, L., Ford, B.: Enhancing bitcoin security and performance with strong consistency via collective signing. CoRR, abs/1602.06997 (2016) 8. Kokoris-Kogias, E., Jovanovic, P., Gasser, L., Gailly, N., Syta, E., Ford, B.: OmniLedger: a secure, scale-out, decentralized ledger via sharding. In: 2018 IEEE Symposium on Security and Privacy, SP 2018, Proceedings, 21–23 May 2018, San Francisco, California, USA, pp. 583–598. IEEE Computer Society (2018) 9. Leung, D., Suhl, A., Gilad, Y., Zeldovich, N.: Vault: fast bootstrapping for the algorand cryptocurrency. In: 26th Annual Network and Distributed System Security Symposium, NDSS 2019, San Diego, California, USA, 24–27 February 2019. The Internet Society (2019) 10. Lu, Y., Tang, Q., Wang, G.: Generic superlight client for permissionless blockchains. CoRR, abs/2003.06552 (2020) 11. Luu, L., Narayanan, V., Zheng, C., Baweja, K., Gilbert, S., Saxena, P.: A secure sharding protocol for open blockchains. In: Weippl, E.R. , Katzenbeisser, S., Kruegel, C., Myers, A.C., Halevi, S. (eds.) Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016, pp. 17–30. ACM (2016)
90
K. Khacef et al.
12. Maymounkov, P., Mazi`eres, D.: Kademlia: a peer-to-peer information system based on the XOR metric. In: Druschel, P., Kaashoek, F., Rowstron, A. (eds.) IPTPS 2002. LNCS, vol. 2429, pp. 53–65. Springer, Heidelberg (2002). https://doi.org/10. 1007/3-540-45748-8 5 13. Nakamoto, S.: Bitcoin: a peer-to-peer electronic cash system. White paper (2008) 14. Ozisik, A.P., Bissias, G., Levine, B.N.: Estimation of miner hash rates and consensus on blockchains (draft). CoRR, abs/1707.00082 (2017) 15. Harmony team. Harmony. Technical White paper, 16 (2017). https://harmony. one/whitepaper.pdf. Accessed May 2021 16. ZILLIQA team and Others. The zilliqa. Technical White paper, 16 (2017) 17. Buterin, V.: Ethereum, white paper (2017). https://ethereum.org/en/whitepaper/. Accessed 20 May 2021 18. Zamani, M., Movahedi, M., Raykova, M.: RapidChain: scaling blockchain via full sharding. In: Lie, D., Mannan, M., Backes, M., Wang, X. (eds.) Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS 2018, Toronto, ON, Canada, 15–19 October 2018, pp. 931–948. ACM (2018)
BC-HRM: A Blockchain-Based Human Resource Management System Utilizing Smart Contracts Heba Adel1 , Mostafa ElBakary1 , Kamal ElDahshan2 , and Dina Salah3,4(B) 1
Arab Academy for Science, Technology and Maritime Transport, Alexandria, Egypt 2 Al-Azhar University, Cairo, Egypt 3 Sadat Academy for Management Sciences, Cairo, Egypt 4 The American University in Cairo, Cairo, Egypt [email protected]
Abstract. Blockchain technology offers several advantages to Human Resource Management (HRM); easier verification of job candidates’ education and skills, efficient recording of job candidates’ education, expertise, and training, easier payment of cross-border transfers, international expenses, and tax liabilities, elimination of fraud and improved cybersecurity. An in-depth literature review was conducted followed by an extensive analysis of several blockchain HRM systems that resulted in the design of BC-HRM; a Blockchain-Based Human Resource Management system utilizing smart contracts. A test plan was devised for BC-HRM and 15 test cases were devised and tested by HRM employees to test both the functional and non-functional requirements of the proposed system. The evaluation results via the System Usability Scale (SUS) model showed an overall successes rate of 85%. Keywords: Blockchain contracts
1
· Human resource management · Smart
Introduction
Blockchain technology has profoundly changed the financial, governmental and industrial operations. Although blockchain advances has mostly dominated the financial segment over recent years, nevertheless, blockchain technology’s distinctive properties of integrity and decentralization have driven its further exploitation in diverse domains [1]. Human Resources Management (HRM) follows a strategic approach rooted in an organizational and societal context for employees management and improvement [2]. HRM has four main functions: first, staffing that focuses on recruitment, compensation, and retaining employees. Second, performance that concentrates on training, rewards, labor and union-management relations. Third, change management that addresses enhancements of employees’ involvement, c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 I. Awan et al. (Eds.): Deep-BDB 2021, LNNS 309, pp. 91–105, 2022. https://doi.org/10.1007/978-3-030-84337-3_8
92
H. Adel et al.
and conflict management. Fourth, administrative activities that handles record keeping and legal compliance [3]. Up to the current time, there is a lack of a unified system for verifying the skills and competencies of employees [4]; recommendation letters mostly exhibit broadness, subjectivity or fraud and the recruitment process consumes considerable time, effort and cost. Business endurance is highly dependent on the selection and sustainability of the workforce and the usage of blockchain in the HRM domain is under-presented although its impact can be transformational for both organizations and current and prospective employees. HRM can benefit of blockchain in several aspects [5]: – blockchain can improve fraud detection and ease the recruiters’ verification of employees’ skills, education, training, expertise, achievements and certifications thus reducing the hiring costs. – blockchain can create reputation systems for employees via recording document based evidence related to employees’ education, expertise, training, and achievements in the workplace thus providing an authentic and trustworthy based employees’ records. This will save companies the time, effort and money spent by HRM staff in verifying employees’ credentials via performing background checks. – blockchain can reduce contracting costs via handling employees’ payment more efficiently, for example, cross-border transfers, international expenses, and tax liabilities especially those that involves exchange rates through utilizing smart contracts and encoding the entire compensation control procedure on the blockchain, along with the associated taxation laws and codes. This paper presents BC-HRM, a BlockChain-based Human Resource Management system utilizing smart contracts that achieved the following: – verifying the skills, education, training, expertise, achievements and performance of prospective employees to enable optimal job allocation process. – creating a reputation system for employees that tracks and verifies the employees’ skills, education, training, expertise, achievements and performance via the utilization of smart contracts to govern and orchestrate all the interactions and transactions between HR and employees. – developing a personal cabinet for employees to record their skills, education, training, expertise, and achievements. – sharing secure HRM data across organizations. – enabling organizations to open new associations outside their limits via reducing contracting costs. – tracking and verifying all tasks and transactions performed by current employees to identify best performers. The rest of this paper is structured as follows: Sect. 2 explores the utilization of smart contracts in the HRM domain; Sect. 3 summarizes the relevant literature; Sect. 4 highlights the method of research; Sect. 5 provides details on the analysis and design of BC-HRM; Sect. 6 reports on the evaluation of the proposed system; and finally, Sect. 7 presents the conclusion and future work.
BC-HRM
2
93
Smart Contracts
Smart contracts are cryptographic boxes with a conditionally unlocked value [6], they shift the traditional contract enforcement methods from judicial dependency into coding dependency and offer deterministic, auditable, and verifiable contracts in the decentralized blockchain network [7]. 2.1
Conventional Contracts Versus Smart Contracts
Smart contracts differ from conventional contracts in the following aspects: – rights and responsibilities: conventional contracts involve sets of rights and responsibilities that operate in accordance with each other whereas smart contracts include a number of conditional statements where both the set of obligations and rights and the executing functions roles are structured to provide a commonly settled transaction [8]. – legality: conventional contracts are controlled by legal conditions whereas smart contracts are a self-executing computer programs [8]. – modification: conventional contracts are flexible and are timely modifiable with the agreement of the involved parties whereas people’s desires cannot modify smart contracts [9]. – manipulation: conventional contracts can be easily manipulated whereas it is not possible to tamper with smart contracts [9]. – interaction: conventional contracts need human intervention while smart contracts limit human interference [9]. 2.2
Smart Contracts and Ethereum
Ethereum offers an alternative protocol for constructing decentralized, efficient and secure applications [6]. Smart contracts can also be built above the Ethereum platform, with power greater than that of Bitcoin scripting, this power is attributed to the supplementary powers of blockchain-awareness, turingcompleteness and value-awareness.
3
Related Work: Blockchain Technology Role in Human Resources Management
Several practitioners and academic have investigated blockchain technology utilization in the HRM domain, Chen et al. [10] discussed applying blockchain technology for developing a cross domain Talent Management System (TMS) to record data related to the online learning history, outcomes of projects, and certification of completion of 698 interns. This offers profound advantages to research establishments, schools and organizations who can efficiently access the interns digital Curriculum Vitae (CV), thus reducing the cost of resumes verification and employees recruitment [10].
94
H. Adel et al.
Z. Zaccagni et al. [12] proposed a decentralized approach to match students’ gained knowledge with employers’ required knowledge. They used blockchain smart contracts to verify students’ credentials and keep their records through using the topics micro-accreditation from the framework of CAE to courses and related tasks [12]. T.-H. Kim. [13] suggested a blockchain based HRM model that would reduce job fraud. A privacy preserving framework was utilized for HR record management. Wallets were created with corporation’s id and a pair of public-private key that was used along with privacy parameter mapping with hash. Keys were utilized to offer confidentiality, authentication and integrity [13]. The effect of blockchain technology on HRM was explored by X. Wang et al. [14], Chris, et al. [15], Luki´c et al. [16], and D. Salah et al. [17]. X. Wang et al. [14] created a model for merging conventional encryption technology with internet distributed technology. They constructed a model for HRM to solve the problem of discrimination of HRM information authenticity and to promote the utilization efficiency and impact of HRM information. Bitcoin was used to validate the HRM data, and to link and document employees’ information and documents [14]. The implications of blockchain for the future of HRM functions was discussed by Chris, et al. [15], they assembled an expert round table consisting of HR leaders from various industries as well as blockchain technology specialists and reported the blockchain potential in areas such as recruiting, talent management and skills development, while also reporting how to minimize the risks that can occur with early adoption of blockchain technology. Advantages of blockchain technology were explored as well and it was reported that blockchain can eliminate the need for a back offer, because blockchain settlement is instantaneous-with no need for reconciliation, receipt, and purchase order to the other traditional components of a transaction, it removes the need for a third party in transactions and provides immutable proof of transaction occurrence in addition to utilizing smart contracts to enable the integration of business logic into a single transaction [15]. J. M. Luki´c et al. [16] also highlighted how the blockchain technology can transform the recruitment process since recruiters can use the blockchain technology to search for job candidates. Moreover, blockchain technology can aid in integrating job candidates’ education, media and civil records, professional license verification, local criminal records, and motor vehicle records. Furthermore, blockchain allows for the standardization of prospective candidates’ career profiles which will result in leveraging resumes’ quality and ease the recording and tracking of career development with Quick Response (QR) codes on candidate’s profile. This will offer recruiters transparency and easier validation of the resume content of higher quality candidates [16]. Moreover, J. M. Luki´c et al. pointed out how blockchain technology has led to the evolution of new technical and non-technical positions as well as changing the recruitment processes. Examples of the key job positions that were suggested to be essential in organizations that use or plan to use blockchain technology
BC-HRM
95
are: blockchain developer, project manager, designer, quality engineer, and legal consultant, technical researcher/marketer, concept developer, analyst, research scientist, architect, back-end engineer, algorithm engineer, and staff engineer [16]. D. Salah et al. [17] conducted ten one-to-one semi-structured interviews with humaN resource management experts. This study revealed two themes: the first theme was the possible application areas of blockchain in the HRM domain application areas and the second theme revealed the possible challenges of adoption. The first theme demonstrated that the possible utilization of blockchain in HRM activities incLude its usage in the following: performance appraisal, verification of training centers and trainers’ credentials, salary surveys and payment of employees’ salaries, verification of references, and verification of criminal and medical records. The second theme was associated with the possible challenges of adoption that could restrict the usage of blockchain in HRM activities. Those challenges were: lack of support, lack of international widespread usage of blockchain technology, lack of proficiencies, fear of layoff, security vulnerabilities, and lack of funding, and the need for proof of success [17]. Although several researchers and practitioners investigated the blockchain technology usage in the domain of human resource management, nevertheless, there exists numerous research areas that are not fully exploited yet. Examples of those areas include investigating blockchain societal impact on the job market and the social behavior, in addition to blockchain utilization in reducing contracting costs via efficient handling of payment, for example, crossborder transfers, international expenses, and tax liabilities especially those that involves exchange rates via encoding the entire compensation control procedure on blockchain. Furthermore, although there is plenty of theoretical papers that discusses the potential of blockchain utilization in the HRM domain there is a scarcity of empirical studies that investigates its usage in industrial context. Moreover, there is an absence of an HRM system that acts as a proof of concept that utilizes blockchain in tracking and checking the employees’ qualification, skills and appraisal and creating an employee personal cabinet in which users can add their skills in order to assign them to the most convenient role and share this valuable data across organizations in case the employee applied to a new organization. In addition, the utilization of smart contract functions to track and validate the HRM transactions and processes is not explored up till the present moment.
4
BC-HRM Analysis, Design and Implementation
Our goal was to create BC-HRM, an HRM system based on the blockchain that utilizes smart contracts. BC-HRM achieves the following: – developing a personal cabinet for employees to add their skills, education, training, expertise, and achievements.
96
H. Adel et al.
– creating an inter-organizational reputation system for employees that tracks and verifies their skills, education, training, expertise, achievements and performance via the utilization of smart contracts to orchestrate and govern all the interactions and transactions between HR and employees. – verifying the skills, education, training, expertise, achievements and performance of prospective employees to enable optimal job allocation. – sharing secure HRM data across organizations. – enabling organizations to open new associations outside their limits via reducing contracting costs. – tracking and verifying all transactions and tasks performed by current employees to identify high caliber employees. The blockchain platform security algorithm of choice in implementing the Blockchain-Based Human Resource Management System (BC-HRM) was Ethereum since Ethereum promotes peer-to-peer trust mechanism based on the consensus of the nodes majority. Moreover, in the Ethereum network, operations occur in real time and blocks are written in the ultimate chain in exchange for Ethers (Ethereum currency). Ethers are offered to miners in return for their computation time and power. Ethereum Smart contracts were utilized to eliminate third-party intermediaries thus lowering the operational costs and providing more secured and verified services. In order to achieve BC-HRM system goals, it was designed and implemented according to a set of components illustrated in Fig. 1.
Fig. 1. BC-HRM system components
BC-HRM
97
Fig. 2. BC-HRM secure HRM
Since BC-HRM deals with sensitive organizational and employee data, a secure process was implemented to achieve data privacy as shown in Fig. 2. The consensus mechanism Proof-of-Work (PoW) was used for adding and verifying transactions. PoW was characterized by the following [18]: – it ought to be difficult and time-consuming for any miner to deliver a proof that meets certain requirement. – it must be easy and fast for others to verify the confirmation in terms of its correctness. As illustrated in Fig. 3, miners compete against one another in settling computationally complex puzzles that are profoundly challenging to solve, nevertheless, when those puzzles are solved, their solutions can be rapidly verified. When a miner reaches a new block’s solution, they can broadcast it to the whole network. Then the block will likely be confirmed and the blockchain will be updated. Figure 4 illustrates the variety of features offered by BC-HRM. BC-HRM system offers the following features: – data security. – recruitment process via identifying job vacancies and its requirements.
98
H. Adel et al.
Fig. 3. BC-HRM consensus mechanism
– contract signing via smart contracts to prohibit fraud attempts. – tracking employees education, training, experience, performance appraisal, etc. – verification of employees’ qualifications and experiences for job applicants and new hires. – allowing job seekers to apply to the most suitable occupations. – allowing current employees to check their appraisal records.
4.1
Smart Contract Execution
A smart contract is a blockchain executable code or a mechanism involving digital assets and two or more parties. Smart contracts distribute assets based on rules written down in smart contracts. The peers of the Ethereum network execute smart contracts every fifteen seconds, and at least 2 users must verify them for activation then functions of smart contracts are executed [19]. BCHRM contracts were written in Solidity programming language to provide the following functionalities: – sharing employees’ records and organization’s data securely. – helping employees and HRM negotiate a data sharing agreement and generate a custom smart contract. – making a series of contract arrangements to stimulate and restrain the behavior of employees.
BC-HRM
99
Fig. 4. BC-HRM features
– preventing breaching or forgery or imitation, in the event that the smart contract algorithm determines that specific conditions have been met, it automatically executes the terms stipulated in the contract. Moreover, it enables parties to recommit to the terms being executed, without depending on a central privilege. – ensuring an effective identification and renovation and restriction of data access in HRM system. – checking employees’ identification and maintenance when the employee desire to add records, if it is not valid a message will be sent using the smart contract, however, if it is valid the record will be added. – handling HRM data with quality of service, i.e., trust and transparency. – orchestrating and governing all interactions and transactions between HRM and employees. – enabling remote employee monitoring. Smart contracts were implemented using Remix Ethereum Integrated Development Environment (IDE) in writing solidity contracts since it offers several useful features including: a GUI code editor, being open source, supporting testing of smart contract and wallets, debugging and deploying smart contracts, and providing a private network. Thus messages were produced by smart contracts and were represented via calling a smart contract function. Figure 5 shows the various smart contract functions implemented in BC-HRM system.
100
H. Adel et al.
Fig. 5. BC-HRM smart contract functions
4.2
BC-HRM Application
The development tool utilized was Microsoft Visual Studio since Solidity and smart contracts work with Ethereum blockchains that are integrated into Visual Studio. The Azure blockchain Development for Ethereum extension in Visual Studio Code was used to create, and execute smart contracts. BC-HRM application is composed of several forms for example, Vacancy, Applicant, Job, Job Seeker, Organization, Project, Skill, etc. BC-HRM application users can be job seekers, employees, HRM Employees or managers. The job seeker user is able to view all the accessible occupations and apply to any job of interest via the applicant form. As for the employee user and the HRM employee user, they will be able to login to the system and manage the work concurring to their system privileges. Finally, the manager user is able to add a new organization, branch, department, project, skill, vacancy or job and he can also view the existing organizations, branches, departments, etc., using the smart contract with the aid of calling the different functions. Figure 6 provides a scenario of how a jobseeker will approach the BC-HRM system to find a job.
BC-HRM
101
Fig. 6. BC-HRM jobseeker scenario
4.3
BC-HRM Database
BC-HRM database was made using Microsoft SQL server and it consisted of 22 tables as shown in Fig. 7. Different users can perform Data Manipulation Language (DML) operations according to their set privileges, for example, HRM employee can perform DML operations on several tables including: departments, applicants, contracts, employees, or job vacancies tables, whereas the manager user will be able to execute DML operations on organizations, branches, departments, projects, skills, and job vacancies tables.
5
BC-HRM Evaluation
BC-HRM was tested via following a set of steps that started with devising a test plan that included determining the testing scope, testing entry and exit criteria, HRM functional and non-functional test cases and their respective success criteria, testing schedule and testing assignment tasks. This was followed by considering several usability testing questionnaires and the final choice was made between Post-Study System Usability Questionnaire (PSSUQ) and System Usability Scale (SUS). System Usability Questionnaire (PSSUQ). PSSUQ is a standardized 16item questionnaire that is widely used to measure users’ satisfaction with software, websites or products. It stemmed from IBM’s System Usability Metrics
102
H. Adel et al.
Fig. 7. BC-HRM database tables
(SUM) project in 1988. Continuous revision and improvement occurred SUM that resulted in version 3 of PSSUQ. System Usability Scale (SUS). SUS is a fast and trustworthy usability measurement tool created in 1986. It includes 10 questions with five answer alternatives. It allows the assessment of software, hardware, websites and mobile devices. After careful scrutinizing of both PSSUQ and SUS in order to decide on which one to use. SUS was chosen since its questions are focused on both functional and non functional requirements. Moreover, it offers several advantages since has an easy scale, can be used reliably on small sample sizes, and can effectively distinguish between unusable and usable systems. After choosing SUS as our testing questionnaire, system testing was conducted via several HRM staff members with variant HRM roles including: assistant HR manager, director of employment and recruiting, executive recruiter, HR payroll specialist, senior HRM specialist. Participants were subjected to fifteen test cases that were carefully designed and they were asked to fill in the SUS template after finishing the test cases.
BC-HRM
103
The evaluation results via the SUS model showed an overall successes rate of 85% as shown in Fig. 8. Thus BC-HRM achieved its intended objectives.
Fig. 8. BC-HRM evaluation results
6
Limitations
The proposed system was tested on only 5 users since we adopted a combination of discount usability engineering principles and Rapid Iterative Testing and Evaluation (RITE) via focusing on testing on a limited number of users and then redesigning user interfaces quickly and cheaply according to the acquired feedback. General Data Protection Regulation (GDPR) regulates privacy and data protection in the European Union (EU) and addresses personal data transfers outside the European Union. GDPR asserts that privacy should be integrated into information systems from the design stage. To satisfy this regulation’s requirements and protect the data subjects’ rights, appropriate organizational and technical measures needed to be effectively implemented and enforced by the controller [5]. Since blockchain technology eliminates the need for trusted third parties while maintaining data protection and trustworthiness and involves a large number of individual processor agreements, consequently for our proposed system to be widely used all organizations that utilize it need to abide by GDPR principles.
104
7
H. Adel et al.
Conclusion and Future Work
This paper presented an HRM system that utilized the blockchain technology. BC-HRM system has established its capability in sharing HRM data, allowing organizations to hire the right employees and to match those employees to the right jobs, verifying the skills and performance of employees to enable them to be allocated to the most suitable roles, allowing employees to have a comprehensive, trustworthy blockchain-based record of their skills and performance to be used for their future work, developing an employees’ personal cabinet in which users can add their skills, tracking and checking the employees qualifications, skills and performance with the power of smart contracts, orchestrating and governing all interactions and transactions between HR and employees and tracking and checking all transactions and processes for employees. The evaluation results of the HRM system via the SUS model showed an overall successes rate of 85%. Although these results are positive yet future work plans involves further testing on a larger number of users. Several relevant HRM-blockchain research areas needs to be further exploited, for example, blockchain societal impact on the job market and the social behavior and blockchain utilization in reducing contracting costs via efficient handling of payment, for example, cross-border transfers, international expenses, and tax liabilities.
References 1. De Angelis, S.: Assessing security and performances of consensus algorithms for permissioned blockchains. arXiv preprint arXiv:1805.03490, no. 9, May 2018 2. Bratton, J., Gold, J.: Human Resource Management: Theory and Practice, 6th edn. Palgrave Macmillan, London (2017) 3. Torrington, D., Hall, L., Taylor, S., Atkinson, C.: Fundamentals of Human Resource Management. Managing People at Work, 9th edn. Prentice Hall, London (2014) 4. Johnston, P.: “ukoln.ac.uk” Society of Archivists EAD/Data Exchange Group, Public Record Office, 21 May 2002. http://www.ukoln.ac.uk/. Accessed 15 Aug 2020 5. Coita, D.C., Abrudan, M.M., Matei, M.C.: Effects of the blockchain technology on human resources and marketing: an exploratory study. In: Kavoura, A., Kefallonitis, E., Giovanis, A. (eds.) Strategic Innovative Marketing and Tourism. SPBE, pp. 683–691. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-12453-3 79 6. Buterin, V.: Ethereum white paper (2014). https://ethereum.org/en/whitepaper/. Accessed 15 Aug 2020 7. Morgado de Carvalho, T.M.: Validation of Smart Contracts through Automated Tooling. Universidade de Coimbra, Portugal (2019) 8. Werbach, K., Cornell, N.: Contracts ex machina. Duke LJ 67, 313 (2017) 9. K˜ olvart, M., Poola, M., Rull, A.: Smart contracts. In: Kerikm¨ ae, T., Rull, A. (eds.) The Future of Law and eTechnologies, pp. 133–147. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-26896-5 7 10. Chen, Y.-C., Wu, H.-J., Wang, C.-P., Yeh, C.-H., Lew, L.-H., Tsai, I.-C.: Applying blockchain technology to develop cross-domain digital talent. In: 2019 IEEE 11th International Conference on Engineering Education (ICEED), pp. 113–117. IEEE (2019)
BC-HRM
105
11. Cherkasov, A.: Platform for the Recruitment Industry on the Ethereum blockchain, 12 March 2018 12. Zaccagni, Z., Paul, A., Dantu, R.: Micro-accreditation for matching employer ehire needs. In: 2019 IEEE International Conference on Blockchain (Blockchain), pp. 347–352. IEEE (2019) 13. Kim, T.-H.: A privacy preserving distributed ledger framework for global human resource record management: the blockchain aspect. IEEE Access 8, 96455–96467 (2020) 14. Wang, X., Feng, L., Zhang, H., Lyu, C., Wang, L., You, Y.: Human resource information management model based on blockchain technology. In: IEEE Symposium on Service-Oriented System Engineering (SOSE), pp. 168–173 (2017) 15. Na, S., An, J., Yang, J., Par, Y.B.: Smart contract for rule management using semantic web rule language. In: ICAEIC, Mauritius (2018) 16. Luki´c103, J.M., Salki´c104, H., Ostoji´c105, B.: New job positions and recruitment of employees shaped by blockchain technologies. In: Leadership & Management: Integrated Politics of Research and Innovations (2018) 17. Salah, D., Ahmed, M.H., ElDahshan, K.: Blockchain applications in human resources management: opportunities and challenges. In: Proceedings of the Evaluation and Assessment in Software Engineering, Trondheim, Norway, pp. 383–389 (2020) 18. Zhang, R., Xue, R., Liu, L.: Security and privacy on blockchain. ACM Comput. Surv. (CSUR) 52(3), 1–34 (2019) 19. Yavuz, E., Ko¸c, A.K., C ¸ abuk, U.C., Dalkılı¸c, G.: Towards secure e-voting using ethereum blockchain. In: 2018 6th International Symposium on Digital Forensic and Security (ISDFS), pp. 1–7. IEEE (2018)
Applicability of the Software Security Code Metrics for Ethereum Smart Contract Aboua Ange Kevin N’DA(B)
, Santiago Matalonga , and Keshav Dahal
School of Computing, Engineering and Physical Sciences, University of the West of Scotland, Scotland, UK [email protected]
Abstract. The Ethereum blockchain allows, through software called smart contract, to automate the contract execution between multiple parties without requiring a trusted middle party. However, smart contracts are vulnerable to attacks. Tools and programming practices are available to support the development of secure smart contracts. These approaches are effective to mitigate the smart contract vulnerabilities, but the unsophisticated ecosystem of the smart contract prevents these approaches from being foolproof. Besides, the Blockchain immutability does not allow smart contracts deployed in the Blockchain to be updated. Thus, businesses and developers would develop new contracts if vulnerabilities were detected in their smart contracts deployed in Ethereum, which would imply new costs for the business. To support developers and businesses in the smart contract security decision makings, we investigate the applicability of the security code metric from non-blockchain into the smart contract domain. We use the Goal Question Metric (GQM) approach to analyze the applicability of these metrics into the smart contract domain based on metric construct and measurement. As a result, we found 15 security code metrics that can be applied to smart contract development. Keywords: Security · Metric · QGM · Smart contract · Ethereum · Blockchain
1 Introduction Blockchain technology has started gaining more attention from businesses and governments after the introduction of Bitcoin [1]. Defined as a shared public ledger that stores transactions in a decentralized peer-to-peer network of computers, Blockchain guarantees the transaction execution without third-party intervention. Initially limited to peer-to-peer payment-based transactions [1], Blockchain now through platforms like Ethereum [2] and Hyperledger [3], can also be applied in many other domains [4, 5]. For this study, we focus on the Ethereum Blockchain. The Ethereum Blockchain enables through computing protocols, known as smart contracts, to verify and enforce contract negotiation on the top of the Blockchain. The smart contract can be built and deployed by anyone using Solidity – one of the Turing complete language provided by Ethereum. However, it is publicly accessible [2]. Therefore, the smart contract code is exposed to the attackers, which can lead to attacks on the contracts such as the attack on © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 I. Awan et al. (Eds.): Deep-BDB 2021, LNNS 309, pp. 106–119, 2022. https://doi.org/10.1007/978-3-030-84337-3_9
Applicability of the Software Security Code Metrics
107
DAO contract [6]. This attack led to a loss of 3.6m Ether. The growing number of attacks on Ethereum has forced businesses and researchers to incorporate the aspect of security during the development process of the smart contract. Verification tools [7–10] and programming practices and patterns [11–13] are currently available in the literature to support this process. However, these approaches are not effective enough to be foolproof. They do not fully cover the security of the smart contract. Thus, given Blockchain immutability, identifying critical vulnerabilities in a deployed contract would lead the developer or business to create and deploy a new contract. Therefore, an indication of smart contract security is necessary to help developers and businesses in decision-making before deploying smart contracts. In non-blockchain environment, software security metrics are described as good indicators for the security of the software. To the best of our knowledge, there are no security metrics for the Ethereum smart contract domain. We investigate the applicability of the existing security code metric in non-blockchain environment into Ethereum smart contract domain for contract build with Solidity. By using the Goal Question Metric (GQM) approach we analyzed the applicability of the security code metric from the non-blockchain into the Solidity smart contract domain based on metric construct and measurement. We found 15 among the list of 20 security code metrics are applicable for the Ethereum smart contract with respect to both construct and measurement. These metrics can be used by developers or organizations during smart contract security development.
2 Background 2.1 Ethereum Smart Contract and Solidity Blockchain is an encrypted transaction-based ledger running on the top of a decentralized peer-to-peer network of computers known as nodes [1]. Guaranteeing the transparency, immutability, anonymity of the transaction through Bitcoin (first implementation of the Blockchain), Blockchain has been implemented under new platforms such as Ethereum. Defined as a second-generation of Blockchain, Ethereum implements computing protocols known as smart contracts on the top of the Blockchain for transaction flexibility [2]. A smart contract is a software running on top of the Blockchain. It is executed in a virtual machine (EVM) provided by Ethereum located in each node [14]. It aims to encode rules to reflect any kind of multi-party interactions. A smart contract is represented in the Ethereum environment by an account which consists of a unique address and account state with the amount of Ether (balance), storage, and code hashes respectively pointing toward storage memory in EVM and the contract code in Blockchain ledger. Locating in the Blockchain ledger, the smart contract code stays immutable, thus preventing a malicious entity from tampering with the code to get control of the smart contract. However, the immutability of the contract code does not allow the smart contract to rely on the standard life cycle of software. The maintenance of the smart contract is impossible. The developer could not update a contract program containing errors unless a new version of the contract is released. For the contract development, the developer can refer to Solidity – a Turing-complete programming language provided by Ethereum. Solidity’s syntax is quite similar to
108
A. A. Kevin N’DA et al.
JavaScript and Java language. Solidity as the Java and Javascript language supports some Objected oriented constructs and procedural constructs [15]. It also supports coupling, cohesion, and inheritance of the contracts. Solidity based-contracts are executed through the EVM bytecode generated by a Solidity compiler, which bytecode is interpreted by the EVM. 2.2 Smart Contract Security Smart contracts are increasingly being adopted to implement applications in many sectors such as financial [16], health [4], and IoT [5]. Smart contracts enable the development of potential new business models but also introduces computing infrastructures capable of defect, error, and failure. According to Delmolino et al. [17], after analysing various smart contract application source codes developed by students, they identified smart contract source codes are victim to common pitfalls such as contracts prone to logical errors, fail to use cryptography, they do not incentive user to the expected behaviour, and not aware Ethereum specific bug such as Call-stack bug. Similarly, Luu et al. [7] highlighted gaps in the understanding of the distributed semantics of the Ethereum platform, which favours smart contract attacks. They introduce new security vulnerabilities in smart contracts and show possible attacks based on these vulnerabilities. Atzei et al. [18] provide a systematic classification of smart contract vulnerabilities according to the vulnerability source. They show Ethereum Blockchain, EVM and Solidity are sources of smart contract vulnerability. Similarly, Chen et al. [19] show that Ethereum blockchain, Solidity, and EVM are the sources of vulnerability by providing a taxonomy of smart contract vulnerability after analysing 44 vulnerabilities which 19 of those vulnerabilities are caused by Solidity language and the misunderstanding of the programming with Solidity. They suggest the development of more secure programming languages and more secure supporting tools for smart contract security. Security which is becoming a critical aspect of the smart contract development remains the prerogative of stakeholders to verify the correctness and fairness of the smart contract. Thus, research works provide tools to analyse the smart contract vulnerabilities. Luu et al. [7] provide an OYENTE - static analysis tool using symbolic execution for detecting bugs at the bytecode level of the smart contract. However, this tool does not offer sufficient enough to detect most vulnerabilities identifies in a smart contract such as integer overflow/underflow [10]. Tsankov et al. [8] developed a static analysis tool, called SECURIFY, using both symbolic execution and filter of predefined compliance and violation patterns to explore all the contract behaviours (avoiding false negatives) and also to avoid false positives. However, this tool has limitations. For instance, SECURIFY is based on properties that do not capture all the violations that might be exploited by attackers, leading to certain vulnerabilities remaining in smart contracts. Another approach for analysing the program is to apply formal verification which incorporates mathematical models to make sure that the code is free of errors. Bhargavan et al. [20] conducted a study on smart contracts using this approach. They outline a framework able to parse the Solidity source and EVM byte code into a functional programming language F*, to proceed to the smart contract verification. Similarly, Amani et al. [9] presented a verification tool of the smart contract based on Isabelle/HOL proof
Applicability of the Software Security Code Metrics
109
assistant. However, these formal verification tools do not take into account all the EVM semantics. For instance, the properties based on inter-message calls of contracts are not supported by the tool provided by Amani et al. [9] leading the related vulnerabilities to remain in smart contracts. Some research works have been conducted to help the developers to adopt security practices and patterns for securing the smart contracts. Security practices and patterns refer to realistic and enforceable actions that address the most common software security issues. Whorer and Zdun [11] identified six security design patterns based on the grounded theory techniques for the smart contract with Solidity. Mavridou et al. [12] proposed two security patterns that can be implemented as plugins to facilitate their application. Regarding the security programming practices, N’Da et al. [13] conducted a study for characterizing the cost of the security programming practices in the smart contract. As a result, they identified a list of 30 security programming practices from JAVA and C + + that can be used for smart contract security. 2.3 Metrics for Measuring Software Security In the non-Blockchain environment, the use of metrics has received a lot of attention. Metrics provide information on the software which can be analysed and utilised in business decisions. A software metric is defined as a quantitative measure of a given software property, which differs from a measurement used in some literature to represent a metric [21, 22]. Research studies assessed the efficiency of the software metric in security vulnerability prediction. Nguyen and Tran [23] evaluated the semantic complexity of the software in vulnerability detection through complexity metrics. They extracted semantic complexity metrics from a dependency graph on Mozilla JSE software. By using classification AI techniques(Naïve Bayesian, Random Forest, Neural Network, and Bayesuian Network) for these complexity metrics evaluation, their result showed the false negative(FN) rate can be reduced from 89% for nesting metrics to 40%. Shin et al. [24] investigated the usefulness of metrics such as complexity, code churn, and developer activity for vulnerability prediction. By using three-prediction model of each group and a model combining these metrics based on discriminant analysis and Bayesain network, they showed these models were able to predict about 70% of vulnerable files from two open sources projects (Mozilla Firefox, and Linux Kernel) with accuracy lower than 5%. Moshtari et al. [25] replicated the shin et al. study by considering more complete vulnerabilities and cross-project vulnerability that were not considered by Shin et al. They evaluated complexity metrics in the prediction of vulnerability. Their result showed about 92% of the vulnerable files with false positive (FP) of 0.12% are detected in the Mozilla Firefox project. In addition, complexity metrics could detect about 70% of vulnerable files with a tolerable false positive (FP) rate of 26% for cross-project vulnerability prediction on five open-source projects. Chowdhury and Zulkernie [38] used Coupling, cohesion, and complexity metric to predict vulnerability. They performed the experiment on 52 releases on the Mozilla Firefox and through the use of C4.5 Decision tree, Random Forest, Logistic regression and Naives Bayes, the Coupling and Cohesion metric were able to predict correctly 75% of the vulnerable files with a false-positive rate less than 30%.
110
A. A. Kevin N’DA et al.
Besides, security metrics were also proposed and used to assess security at the early stage of software development. Sultan et al. [26] provided a catalog of metrics for assessing security risks of the software through the Software Development Life Cycle (SDLC). They defined metrics for requirement, design, implementation, and maintenance phases by using the GQM approach. Metrics are also defined to characterise the security vulnerabilities of software systems. Based on Common Vulnerability Scoring System (CVSS) research [27, 28] provided security metrics to assess software system vulnerability. 2.4 Metric in Smart Contract Domain In this section, we provided an overview of the reported researches based on the metrics for the smart contract context. There are few researches works using metrics in the smart contract domain. Hegedus [29] proposes a set of metrics derived from non-blockchain metrics to help developers during the Solidity smart contract development. The author also provides a tool called Solmet to collect them. The study was able to provide an overview of the structure of smart contracts in terms of size, complexity, coupling, and inheritance properties. The result of his study shows these metrics in the context of smart contracts have lower values compared with the context non-blockchain programs. Similarly, Perio et al. [30] propose a fully web-based tool called PASO, which is able to compute smart contract metrics. They discussed a few numbers of software code metrics derived from OO metrics in non-blockchain environment and also metrics specifics to Solidity language. Some of their OO metrics have been already discussed by Hegedus [29]. Vandenbogaerde [31] proposes a graph-based framework for computing OO design metrics from non-blockchain environment into the smart contract context. Their framework allows analysing the design of Solidity smart contract through the simple queries which can extract the contract function and design metric from a generated graph-based semantic meta-model. After implementing this framework to a list of contracts, the authors mentioned that most of the contracts in solidity have some similarity to java practice when creating class, and also that contract coupling and inheritance are less utilized in smart contracts. Unlike Hedegus and Vandenbogaerde whose study was just oriented smart contract structure through metric, Ajienka et al. [32] investigated OO metrics for smart contract structure as well as their correlation with gas consumption in smart contracts. The authors use Solmet, Truffle suite [33] to extract respective OO metrics and deployment cost of the list of contacts from a Github project. By using Spearman’s rank correlation method, the authors show there is a statistically significant correlation between some of the OO metrics and the resources consumed on the Ethereum Blockchain network when deploying the contract. In contrast to these papers, our study focuses on the security aspect of the smart contract by investigating the security code metrics from non-blockchain environment for the solidity smart contract.
Applicability of the Software Security Code Metrics
111
3 Applicability of Software Security Metrics for Ethereum Smart Contracts In this section, we present an investigation of the applicability of security metrics for solidity smart contracts. To design our research we exploit the GQM approach [34]. This section describes the research method and results. 3.1 Method and Process Our goal is to characterise the applicability of security metrics for Ethereum smart contracts written in Solidity. To achieve this, we will evaluate the potential applicability of the non-blockchain security metrics into the Ethereum smart contracts written in Solidity. Table 1 presents the GQM approach used to design this research [34]. Table 1. GQM model for applicability of the security source code metrics in smart contract context Purpose Analyze
Source code metrics for measuring security in non-blockchain software development
With the purpose of
Characterizing their applicability in smart contract development
With respect to
The applicability of Construct: Identification of built concepts around the metric and their potential interpretation in the Blockchain domain The capability of measurement: getting the metric measures based on the measurement process in the targeted domain
Point of view
Researcher
In the context of
The software coding process of smart contracts in Solidity for the Ethereum Blockchain platform. Subjects are smart contracts obtained from Ethereum and published peer-review papers
We define the following two research questions and their derived metrics based on these research needs developed through the GQM. Research Question 1 (Applicability) Is it possible to identify and interpret the constructs embedded in the security metrics and interpret them in the Solidity smart contract domain? – Metric: Qualitative Judgement and “Yes/No” dictum. Research Question 2 (Capability) Is it possible to perform the measurement (defined in the metric) in the Solidity smart contract domain? – Metric: Qualitative Judgement, “Yes/No” dictum, and corresponding corroborating evidence.
112
A. A. Kevin N’DA et al.
Instrumentation To answer Research question 1 (related to the construct We firstly proceed to the collection of the security code metrics available in the non-Blockchain environment. presents the searching strategy use for getting the relevant literature related to the security code metric. Our inclusion criteria comprise of references that focus on source code security metrics, publicly refereed, and which metrics presented are well defined and provide evidence that they measure the security. For each identified metrics, we extracted the following data (based on the characterisation provided in [35]): • • • • •
Metric name: given name of the metric Subject: The type and the description of the entity being measure Attribute: property of the feature from the subject being measure Measurement: process describing how the subject attribute is measured Validation approach: validation process providing to evaluate the reliability of the metric base on the attribute being measure
As result, we were able to select among a plethora of security metrics in the literature, a list of 20 relevant security code metrics (see Table 2). As result, we identified a list of 20 security code metrics as the candidates of security code metrics for our study. For each security code metric, we applied a judgment and our experience to critically appraise their construct. Our judgment is based on the interpretation of the metric constructs in the smart contract environment. Hence, we deemed applicable a metric based on the construct when its construct can be interpreted in Ethereum smart contract environment. To Answer Research question 2 (related to measurement), we analyse the measurement process against the smart contract environment for each security metric used to answer RQ1 (related to the construct). If the measurement of a security code metric can be performed in the context of the smart contract, to ensure the consistency of our judgment, we try to collect its measure from two testimonial contracts. The applicability of the metric is decided based on its measure. The two testimonial contracts used in this study are E-voting and Supply chain contracts. The E-Voting contract is provided by the Ethereum platform [36]. The second one is a module contract for the supply chain that consists of two smart contracts developed by another research study [37]. We claim that these two smart contracts provide a representative sample of smart contracts in Ethereum. Firstly, the E-voting development is independent of this research. This is a single smart contract sample having a simple structure design as required by Ethereum (compare to traditional software), representing then most of the smart contracts that are in Ethereum. Secondly, the supply chain smart contract has been validated by the research community and (comparing to the E-voting) provides attack surfaces that stem the interaction between contracts. 3.2 Result In this section, we present the results of the analysis deriving from the applicability of the security code metric from non-blockchain into the smart contract domain with Solidity.
Applicability of the Software Security Code Metrics
113
RQ1: Analysis of Applicability of the Security Metrics Base on the Metric Construct. To determine the applicability of the security code metric based on the construct in the smart contract context we proceed to the review of literature through which we identified 20 security code metrics as shown in Table 2. The identified metrics can be grouped by OO metrics (WMC, DIT, CBO, NOC, RFC, LCOM,), complexity metric (McCabe complexity, Halsted volume, CER), and exception metrics (NCBC, Rserr, EHF). For each metric, we identified and compared its construct against the smart contract environment by considering Solidity language constructs, Ethereum Blockchain mechanism, and theoretical basis. Finally, the construct of each metric is deemed to be applicable when the construct is represented (implemented) into the smart contract environment. As a result, we found 15 security code metrics, shown in Table 2, for which the construct is applicable in the smart contract context. Since space constraints limit our capacity to provide detail on each metric, this paragraph presents the type of discussions that lead our judgment. For instance, the constructs of metrics such as DIT and CBO are based on the Object-Oriented concepts: Coupling, inheritance, cohesion program [38, 39], which can be readily interpreted in the development of the smart contract with Solidity given those concepts are used in Solidity [15]. Therefore, we accept these metrics to be applicable. In contrast, the metric construct of the VBW [22] metric (that stands for Vulnerability Based Weakness) cannot be applied in Solidity. Indeed, the VBW construct implies standard lists. The first one is CWE [40], which is a list of Common Weaknesses related to software. The second one is [41], which is Common Vulnerabilities related to software. These two lists are linked by the fact that weakness can imply vulnerability (ies). Therefore, for a specific weakness in CWE, the list of vulnerabilities from the CVE can be identified. However, in the smart contract context, currently, there is no standard common weaknesses list from which smart contract vulnerabilities are associated. There is a standard common weakness list implementation for smart contracts named SWC [42] (stand for Standard Weakness classification) similar to CWE, but unlike CWE, it does not associate vulnerabilities to weaknesses. Therefore, in absence of CWE and CVE lists in the smart contract context, we claim that this metric construct is not applicable in the smart contract. RQ2: Analysis of the Capability of Measurement of the Security Metrics. To answer this research question, we followed the metric definition to obtain a measurement of the smart contracts under study. We were able to obtain the measurement for 15 metrics from the 20 metrics from Table 2. Therefore, we conclude that smart contracts in Solidity are capable of providing the elements to measure these metrics. As shown the Table 3, we were able to determine the value of the metrics for E-voting and Supply chain contracts by following the measurement process of these 15 metrics. Like in the previous section, we cannot provide a detailed account of how each measurement is taken due to space concerns. As a representative example, the NCBC metric measurement consists to count the number of exception handling statements (try-catch) in the program, which can be identified in the smart contract context by determining the number of exception handling statements used in Solidity. These exception statement are: (revert(), required(), assert(), etc.…).
114
A. A. Kevin N’DA et al. Table 2. List Security code metric in the traditional (non-blockchain) environment
Metric name
Applicability of construct? (M1) The capability of measurement (M2)
VBW [22] (vulnerability-based weakness)
NO
NO
WMC [21, 39] (Weigthed Method Per Class
YES
YES
DIT [21, 39] (Depth of Inheritance)
YES
YES
NOC [21, 39] Number of Children)
YES
YES
CBO [21, 39] (Coupling between objects class)
YES
YES
RFC [21, 39] (Response for the Class)
YES
YES
LCOM [21, 39] (Lack of Cohesion in Method)
YES
YES
Stall Ratio [43]
YES
YES
CER [43] (Critical Element Ratio)
YES
YES
Nerr [26] (Number of implementation error)
NO
NO
Nserr [26] (Number of implementation error related to security)
NO
NO
Rserr [26] (Ratio of implementation error that have impact on security)
NO
NO
Nex [26] (Number of exception implemented to handle failure related to security)
YES
YES
Noex [26] (Number of omitted exceptions for handling execution failures related to security)
YES
YES
Roex [26] (Ratio of the number of omitted exceptions)
YES
YES
McCabe Complexity(CCM) [24, YES 44]
YES (continued)
Applicability of the Software Security Code Metrics
115
Table 2. (continued) Metric name
Applicability of construct? (M1) The capability of measurement (M2)
Halstead’s volume metric [45, 46]
YES
YES
CCP [43] (Coupling Corruption Propagation)
YES
YES
NCBC [47] (Number of Catch Blocks per Class)
YES
YES
EHF [47] (Exception Handling Factor)
NO
NO
The reader will notice from Table 3 that some metric measures resulting in a value of 0. We highlight that this does not challenge the answer of our research question. The high-level explanation for these results is that the smart contracts used here do not provide the complexities (or vulnerabilities) that those metrics intend to count. We will discuss this further in the next section.
Table 3. Measures of applicable security code metrics based on measurement from E-voting and supply chain contracts Metric
E-voting
Supply chain (Bidding)
Supply chain value (Tracking)
WMC
12
11
11
DIT
0
0
0
NOC
0
0
0
CBO
0
1
1
RFC
5
18
23
LCOM)
0
0
0
Stall Ratio
0
0
0
CER
1
0.78
0.9
Nex
0
13
14
Noex
7
5
8
Roex
1
0.28
22
McCabe Complexity
3
1.58
1.38
Halstead’s volume metric
1850.15
3134.25
2589.4
CCP
0
0
6
NCBC
0
0.72
0.64
116
A. A. Kevin N’DA et al.
4 Discussion In this section, we present some points that stem from our research results. Applicability of the Security Code Metric Construct into Smart Contract. We analysed a list of 20 security code metrics from a review of the literature and found that 15 of these metric constructs are applicable to the smart contract domain. These applicable security metrics based on construct are generally related to the complexity, coupling and cohesion, inheritance, and exception handling of the program code. These elements of the source code have been empirically shown to be suitable proxies for measuring source code security [21, 47–49]. Regarding the smart contract context, though there is little (or none) empirical evidence, it is reasonable to expect that the relationship between security and these aspects is kept. For instance, a smart contract with high complexity in code structure might lead to security vulnerabilities. So, with this information, the smart contract development team will take proactive steps to deal with the complexity of the smart contract code to keep the contract secure before it is deployed. Measuring Capacity and Automation in Smart Contracts. The result from the analysis of the measurement of the non-blockchain security code metric reveals that the measurements of 15 metrics are applicable to smart contracts. Regarding automation, the only WMC, DIT, CBO, and McCabe complexity can be collected automatically using Solmet tool [29]. However, for the rest of the metrics in Table 2, to the best of our knowledge, there is not yet an automated tool to collect them in the smart contract context. They can currently be collected through a manual process as shown in1 and2 . This is a laborious task for developers. For instance, the Halstead volume measurement which is applied manually to an Ethreum smart contract requires identifying all operators and operands of the program code. Thus, the higher the program size will be, the harder the measure collection of the Halstead volume is going to be. The automation of these security metrics is needed to help the developer and businesses to manage the security of the smart contract more effectively before deploying them. Consistency of the Metric Measurements. Table 3 shows the measures of the DIT, NOC, stall ratio, and LCOM equal 0 for all the testimonial contracts. In fact, the reason for the measures of the DIT and NOC being equal to 0 is that these metrics use inheritance as a proxy for security. None of the testimonial contracts used inheritance. Therefore, the measures resulted in 0. It is expected that, for a smart contract involving inheritance, applying this measurement will provide a non-zero measure. Similarly, Stall ratio and LCOM have 0 as value for both testimonial contracts. The Stall ratio looks at the statements in the loop which can delay the program (i.e. log functions), but the codes for both testimonial contracts do not contain loop statements which can enforce tardiness. Hence the measure of this metric for the testimonial contracts is equal to 0. 1 https://github.com/ndaangekevin/security-code-metric-collection-for-smart-contract.-E-vot
ing-casae-study. 2 https://github.com/ndaangekevin/security-code-metric-collection-for-smart-contract.-Supply-
chain-casae-study.
Applicability of the Software Security Code Metrics
117
Similarly, it can be expected that for a smart contract code involving a statement leading to the tardiness of contract execution, the measurement of this metric will be non-zero.
5 Conclusion We investigated the applicability of the non-blockchain security code metric into Solidity smart contract domain to help Ethereum developers and organizations to deal with the security decisions of the smart contracts. We used the QGM approach to design our research case study to achieve this goal. Our study reveals that 15 security code metrics from both construct and measurement are applicable in the smart contract context we provided evidence of their measurement. Therefore, we claim that these 15 metrics can be used by developers and organizations to manage smart contract security. Moreover, the study suggests that complexity, cohesion, coupling, and exception handling might impact the security of the smart contract. The use of the proposed metrics from our study also presents a real challenge for developers. Most of the measurement processes of these metrics have to be performed manually. We, therefore, incentivize the research community to provide automated tools to collect these security code metrics to help developers and organizations to manage security issues during the development of smart contracts.
References 1. Nakamoto, S.: “Bitcoin: A Peer-to-Peer Electronic Cash System.” 2. Buterin, V.: A next generation smart contract & decentralized application platform. 3. Androulaki, E., et al.: Hyperledger fabric. In: Proceedings of the Thirteenth EuroSys Conference, 2018, pp. 1–15 (2018) 4. Kuo, T.T., Kim, H.E., Ohno-Machado, L.: Blockchain distributed ledger technologies for biomedical and health care applications. J. Am. Med. Inform. Assoc. 24(6), 1211–1220 (2017) 5. Christidis, K., Devetsikiotis, M.: Blockchains and smart contracts for the internet of things. IEEE Access 4, 2292–2303 (2016) 6. “The DAO Attacked: Code Issue Leads to $60 Million Ether Theft - CoinDesk”. https://www. coindesk.com/dao-attacked-code-issue-leads-60-million-ether-theft. Accessed 21 Oct 2019 7. Luu, L., Chu, D. H., Olickel, H., Saxena, P., Hobor, A.: Making smart contracts smarter. In: Proceedings of the ACM Conference on Computer and Communications Security, 2016, vol. 24–28 October, pp. 254–269 (2016) 8. Tsankov, P., Dan, A., Drachsler-Cohen, D., Gervais, A., Buenzli, F., Vechev, M.: Securify: practical security analysis of smart contracts. In: Proceedings of the ACM Conference on Computer and Communications Security, 2018, pp. 67–82 (2018) 9. Amani, S., Bégel, M., Bortin, M., Staples, M.: Towards verifying ethereum smart contract bytecode in Isabelle/HOL. In: CPP 2018 - Proceedings of the 7th ACM SIGPLAN International Conference on Certified Programs and Proofs, Co-located with POPL 2018, vol. 2018–January, pp. 66–77 (2018) 10. Kalra, S., Goel, S., Dhawan, M., Sharma, S.: ZEUS: analyzing safety of smart contracts, no. February (2018) 11. Wohrer, M., Zdun, U.: Smart contracts: Security patterns in the ethereum ecosystem and solidity. In: 2018 IEEE 1st International Workshop on Blockchain Oriented Software Engineering (IWBOSE) 2018 - Proceedings, vol. 2018–January, pp. 2–8 (2018)
118
A. A. Kevin N’DA et al.
12. Mavridou A., Laszka A.: Designing secure ethereum smart contracts: a finite state machine based approach. In: Meiklejohn, S., Sako, K. (eds.) Financial Cryptography and Data Security. FC 2018. Lecture Notes in Computer Science, vol. 10957 (2018). Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-58387-6_28 13. N’Da, A.A.K., Matalonga, S., Dahal, K.: Characterizing the cost of introducing secure programming patterns and practices in ethereum. In: Rocha, Á., Adeli, H., Reis, L.P., Costanzo, S., Orovic, I., Moreira, F. (eds.) WorldCIST 2020. AISC, vol. 1160, pp. 25–34. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45691-7_3 14. Szabo, N.: Smart contracts: building blocks for digital free markets. Extropy J. Transhuman Thought 18(2) (1996) 15. “Solidity Documentation Release 0.5.0 Ethereum,” (2018) 16. Treleaven, P., Brown, R.G., Yang, D.: Blockchain Technology in Finance. Computer (Long. Beach. Calif.) 50(9), 14–17 (2017) 17. Delmolino K., Arnett M., Kosba A., Miller A., Shi E.: Step by step towards creating a safe smart contract: lessons and insights from a cryptocurrency lab. In: Clark, J., Meiklejohn, S., Ryan, P., Wallach, D., Brenner, M., Rohloff, K. (eds.) Financial Cryptography and Data Security. FC 2016. Lecture Notes in Computer Science, vol. 9604 (2016). Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-53357-4_6 18. Atzei, N., Bartoletti, M., Cimoli, T.: A survey of attacks on ethereum smart contracts (SoK). In: Maffei, M., Ryan, M. (eds.) POST 2017. LNCS, vol. 10204, pp. 164–186. Springer, Heidelberg (2017). https://doi.org/10.1007/978-3-662-54455-6_8 19. Chen, H., Pendleton, M., Njilla, L., Xu, S.: A survey on ethereum systems security: vulnerabilities, attacks, and defenses. ACM Comput. Surv. 53(3), 1–43 (2020) 20. Bhargavan, K., et al.: Formal verification of smart contracts: short paper. In: PLAS 2016 - Proceedings of the 2016 ACM Workshop on Programming Languages and Analysis for Security, co-located with CCS 2016, pp. 91–96 (2016) 21. Chowdhury, I., Zulkernine, M.: Can complexity, coupling, and cohesion metrics be used as early indicators of vulnerabilities? In: Proceedings of the 2010 ACM Symposium on Applied Computing, pp. 1963–1969 (2010) 22. Wang, J.A., Wang, H., Guo, M., Xia, M.: Security metrics for software systems. In: Proceedings of the 47th Annual Southeast Regional Conference ACM-SE 47 (2009) 23. Nguyen, V.H., Tran, L.M.S.: Predicting vulnerable software components with dependency graphs. In: Proceedings of the 6th International Workshop on Security Measurements and Metrics 2010, pp. 1–8 (2010) 24. Shin, Y., Williams, L.: An empirical model to predict security vulnerabilities using code complexity metrics. In: ESEM 2008 Proceedings of the 2008 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, pp. 315–317 (2008) 25. Moshtari, S., Sami, A., Azimi, M.: Using complexity metrics to improve software security. Comput. Fraud Secur. 2013(5), 8–17 (2013) 26. Sultan, K., En-Nouaary, A., Hamou-Lhadj, A.: Catalog of metrics for assessing security risks of software throughout the software development life cycle. In: Proceedings of the 2nd International Conference on Information Security and Assurance ISA 2008, pp. 461–465 (2008) 27. Wang, J.A., Zhang, F., Xia, M.: Temporal metrics for software vulnerabilities. In: CSIIRW 2008 - 4th Annual Workshop on Cyber Security and Information Intelligence Research: Developing Strategies to Meet the Cyber Security and Information Intelligence Challenges ahead, pp. 1–3 (2008) 28. Wang, A.J.A., Xia, M., Zhang, F.: Metrics for information security vulnerabilities. J. Appl. Glob. Res. 1(1), 48–58 (2008) 29. Heged˝us, P.: Towards analyzing the complexity landscape of solidity based ethereum smart contracts. Technologies 7(1), 6 (2019)
Applicability of the Software Security Code Metrics
119
30. Pierro, G.A., Tonelli, R.: PASO: a web-based parser for solidity language analysis. In: IWBOSE 2020 - Proceedings of the 2020 IEEE 3rd International Workshop on Blockchain Oriented Software Engineering (IWBOSE), pp. 16–21 (2020) 31. Vandenbogaerde, B.: A graph-based framework for analysing the design of smart contracts. In: ESEC/FSE 2019 - Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1220–1222 (2019) 32. Ajienka, N., Vangorp, P., Capiluppi, A.: An empirical analysis of source code metrics and smart contract resource consumption. J. Softw. Evol. Process 32(10), e2267 (2020). https:// doi.org/10.1002/smr.2267 33. “Documentation | Truffle Suite”. https://www.trufflesuite.com/docs. Accessed 16 Nov 2020 34. Basili, V.R., Caldiera, G., Rombach, H.D.: The goal question metric approach. Encycl. Softw. Eng. 2, 528–532 (1994) 35. “National Institute of Standards and Technology | NIST”. https://www.nist.gov/. Accessed 24 March 2021 36. “Contracts — Solidity 0.5.3 documentation”. https://solidity.readthedocs.io/en/v0.5.3/contra cts.html. Accessed 11 Nov 2019 37. Koirala, R.C., Dahal, K., Matalonga, S.: Supply chain using smart contract: a blockchain enabled model with traceability and ownership management. In: Proceedings of the 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence 2019), pp. 538–544 (2019) 38. Chowdhury, I., Zulkernine, M.: Using complexity, coupling, and cohesion metrics as early indicators of vulnerabilities. J. Syst. Archit. 57(3), 294–313 (2011) 39. Chidamber, S., Kemerer, C.: MetricForOOD_ChidamberKemerer94.pdf. IEEE Trans. Softw. Eng. 20(6), 476–493 (1994) 40. “CWE - Common Weakness Enumeration”. https://cwe.mitre.org/. Accessed 17 Nov 2020 41. “CVE - Common Vulnerabilities and Exposures (CVE)”. https://cve.mitre.org/. Accessed 17 Nov 2020 42. “SmartContractSecurity/SWC-registry: Smart Contract Weakness Classification and Test Cases”. https://github.com/SmartContractSecurity/SWC-registry. Accessed 14 Nov 2020 43. Chowdhury, I., Chan, B., Zulkernine, M.: Security metrics for source code structures. In: Proceedings of the International Conference on Software Engineering, no. June, pp. 57–64 (2008) 44. Mccabe, T.J.: “A Complexity,” no. 4, pp. 308–320 (1976) 45. Hariprasad, T., Vidhyagaran, G., Seenu, K., Thirumalai, C.: Software complexity analysis using halstead metrics. In: Proceedings of the International Conference on Trends in Electronics and Informatics (ICEI) 2017, vol. 2018–January, pp. 1109–1113 (2018) 46. Davari, M., Zulkernine, M.: Analysing vulnerability reproducibility for Firefox browser. In: 2016 14th Annual Conference on Privacy, Security and Trust (PST) 2016, pp. 674–681 (2016) 47. Aggarwal, K.K., Singh, Y., Kaur, A., Malhotra, R.: Software design metrics for object-oriented software. J. Object Technol. 6(1), 121–138 (2007) 48. Shin, Y., Williams, L.: Is complexity really the enemy of software security? In: Proceedings of the ACM Conference on Computer and Communications Security, pp. 47–50 (2008) 49. Maxion, R.A.: Eliminating exception handling errors with dependability cases: a comparative, empirical study. IEEE Trans. Softw. Eng. 26(9), 888–906 (2000)
Machine Learning, Blockchain and IoT
A Recommendation Model Based on Visitor Preferences on Commercial Websites Using the TKD-NM Algorithm Piyanuch Chaipornkaew(B) and Thepparit Banditwattanawong Department of Computer Science, Kasetsart University, Bangkok, Thailand {piyanuch.chai,thepparit.b}@ku.th
Abstract. In recent years, recommendation models have been widely implemented in various areas. In order to construct recommendation models, much research makes use of machine learning techniques. This research also proposes a recommendation model using a novel machine learning technique called the “TKD-NM” algorithm. It is the combination of the TF-IDF, KMeans, and Decision Tree incorporated with the Nelder-Mead algorithm. TF-IDF was applied to form word vectorization from webpage headings. KMeans was utilized for clustering webpage headings while the Decision Tree algorithm was employed to investigate the performance of KMeans. Nelder-Mead was applied to find the optimum values of word vectorization. The dataset analyzed in the research was collected from a specific commercial website. Visitor preferences on the website were considered as the dataset in the research. The recommendation lists were retrieved from webpages in the same cluster. The prediction accuracy of the TKD-NM algorithm was approximately 97.31% while the prediction accuracy of the baseline model was only 88.87%. Keywords: Recommendation model · Webpage heading · TKD-NM · TF-IDF · KMeans · Decision Tree · Nelder-Mead
1 Introduction Steve Jobs said “You can’t ask customers what they want and then try to give that to them. By the time you get it built, they’ll want something new.” [1]. In order to provide something new that meets customer needs, data related to customer preferences should be considered and analyzed. In the case of commercial websites, customer preferences could be extracted from the content of their visited webpages. Once customer needs are known, it is possible to suggest new contents which relate to customer interests. Such information can be offered via a website which is hereafter referred to as “an intelligent website”. The definition of an intelligent website is a website that can offer what customers need at an early stage without any request from them. In order to implement an intelligent website, machine learning techniques are required. In the case of commercial websites, machine learning techniques are applied in many ways. Data obtained from back-end websites are analyzed using machine learning and © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 I. Awan et al. (Eds.): Deep-BDB 2021, LNNS 309, pp. 123–133, 2022. https://doi.org/10.1007/978-3-030-84337-3_10
124
P. Chaipornkaew and T. Banditwattanawong
turned into useful information. Data visualization of results from machine learning could help executive-level managers in decision-making. Machine learning is also applied to construct a recommendation model which can assist in the planning of business campaigns. The definition of a recommendation system is a system that is used to find new items or services that are related to the interests of users [2]. Such systems could help administrators to drive businesses more effectively. The recommendation system is applied with several goals, such as maximizing profits, minimizing cost, and minimizing risk. There are many widely-used machine learning techniques to implement a recommendation system. Examples of machine learning algorithms are KMeans, LDA, deep neural network, Doc2Vec, ensemble, matrix factorization, decision tree, SVM, and TFIDF [3–5]. The top five application domains of the recommendation system are movie, social, academic, news, and e-commerce domains [6]. This research falls under the last domain, which is a recommendation system for e-commerce. This paper proposes a recommendation model based on visitor preferences on commercial websites using a novel algorithm, the “TKD-NM Algorithm”. This algorithm is a combination of TF-IDF, K-Means, and Decision Tree incorporated with the NelderMead algorithm. The proposed model was divided into three phases: data preprocessing, clustering optimization, and webpage suggestion. The result from the proposed model is a recommendation list of webpages in the same cluster. The remainder of the paper is as follows. Section 2 is the literature review. Section 3 presents the methodology. Section 4 presents the experimental results. Section 5 provides the conclusion and suggestions, and last section presents the acknowledgements.
2 Literature Review Recommendation models are generated from various machine learning techniques, such as clustering, classification, and association. Erra et al. conducted research on topic modelling using the TF-IDF algorithm. The paper proposed an approximate TF-IDF to extract topics from a massive message stream using GPU [7]. There were two main contributions of the research. Firstly, an approximate TF-IDF measure was implemented. Secondly, parallel implementations of the calculation of the approximate TF-IDF based on GPUs were introduced. The research concluded that the parallel GPU architecture met the fast response requirements and overcame storage constraints when processing a continuous flow of data stream in real time. The experiment also revealed that the GPU implementation was stable and performed well even with limited memory. Furthermore, the time to compute the approximate TF-IDF measure on the GPU was not varied depending on the data source. Another research on clustering was entitled “An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit” [8]. The two main contributions were document clustering and topic modelling for social media text data. The research presented document and word embedding representations of online social network data. The combination of doc2vec and TF-IDF weighted meant word embedding representations delivered better results than simple averages of word embedding vectors in document clustering tasks. The results also revealed that k-means clustering provided
A Recommendation Model Based on Visitor Preferences
125
the best performance with doc2vec embeddings. The top term analysis was conducted based on a combination of TF-IDF scores and word vector similarities. This method provided a representative set of keywords for a topic cluster. The doc2vec embedding with k-means clustering could successfully recover the latent hashtag structure in Twitter data. The research of Zhang et al. proposed a novel method to extract topics from bibliometric data [9]. The two innovations presented in the research were a word embedding technique and a polynomial kernel function integrated into a cosine similarity-based k-means clustering algorithm. A word embedding technique was applied in data preprocessing to extract a small set of key features. A polynomial kernel function incorporated with a cosine similarity-based k-means clustering algorithm was implemented to enhance the performance of the topic extraction. The experimental results involving the comparison of the proposed method with k-means, fuzzy c-means, principal component analysis, and topic models demonstrated its effectiveness across both a relatively broad range of disciplines and a given domain. A qualitative evaluation was made based on expert knowledge. Further, several insights for stakeholders were revealed during the qualitative investigation of the similarities and differences between the JASIST, JOI, and SCIM journals. K-means clustering and decision tree algorithm were integrated in the research of Wang et al. [10]. The research proposed a novel algorithm named “BDTKS” which is a linear multivariate decision tree classifier Binary Decision Tree based on K-means Splitting. The BDTKS algorithm introduced k-means clustering to serve as a node splitting method and proposed the non-split condition; therefore, the proposed algorithm could provide good generalization ability and enhanced the classification performance. Furthermore, the k-means centroid based BDTKS model was converted into the hyperplanebased decision tree; as a result, the speed of the classification process was faster. The experimental results demonstrated that the proposed BDTKS matched or outperformed the previous decision trees. Chen et al. [11] researched optimal replenishment policies with allowable shortages for a product life cycle. The Nelder-Mead algorithm was employed to find the optimal solution. The purpose of the paper was to determine the optimum number of inventory replenishments, the inventory replenishment time points, and the beginning time points of shortages within the product life cycle by minimizing the total relevant costs of the inventory replenishment system. The proposed problem was mathematically formulated as a mixed-integer nonlinear programming model. There were several numerical examples and corresponding sensitivity analyses presented in the paper. Such examples and analyses were utilized to illustrate the features of the model by employing the search procedure developed in the paper.
3 Methodology The research proposes a recommendation model based on visitor preferences on commercial websites using a novel algorithm, the “TKD-NM algorithm”. The aim of this study is to extend our previous research [12]. The proposed model is divided into three phases as shown in Fig. 1. The first phase is data pre-processing. The second phase
126
P. Chaipornkaew and T. Banditwattanawong
is clustering optimization, and the third phase is webpage suggestion. There were two main tasks to complete in the first phase. The first task was to exclude some data, which were invalid data, missing data, and dependent attributes. Whenever data could not be described, such data was defined as invalid data. Therefore, instances and features that included invalid data were also ignored. In the dataset, some attributes depended on other attributes; therefore, it was necessary to remove dependent attributes from the analysis to reduce computational time. The second task in the first phase was to integrate all related data tables to form a suitable dataset and provide convenience for the next phase. The second phase of a proposed model is clustering optimization. The objective of this phase is to classify webpages based on their webpage headings. The first step, feature representation, aims to represent each word with its vector. All webpage headings were separated into words using the word_tokenize method from pythainlp.tokenize library. The TF-IDF algorithm was then applied to generate vectors for such words. The next step was to perform clustering using KMeans, and the model was then evaluated by employing the Decision Tree algorithm. There are two approaches in the research. The first approach is the baseline model which implements TF-IDF, KMeans, and Decision Tree only one time and returns the results. In contrast, the second approach is the novel TKD-NM model which reveals the recurring processes of three algorithms; TF-IDF, KMeans, and Decision Tree. Such repetition of the processes is terminated when the highest prediction accuracy and the optimal values of feature vector matrix are obtained. The previouslymentioned optimization algorithm uses a combination of the Nelder-Mead, TF-IDF, KMeans, and Decision Tree algorithms. When the optimization algorithm is complete, it yields optimal word vectors, the target classes of webpages, and the accuracy of the model. The third phase is webpage suggestion, which provides the recommendation list of webpages regarding visitor preferences. There are two input datasets in this phase. The first input comes from the output of phase 2, which is the relationship between the ID of the webpage and its cluster label. The second input is the relationship between the ID of the visitor and his/her favorite cluster labels. The aim of this phase is to aggregate both datasets, which are the IDs of the webpages in each cluster, the association cluster labels, and the IDs of visitors and their favor cluster labels. The final results are recommendation lists, which are the top ten most possible webpages in the same cluster. The latest webpage based on a published date would be the first in the list, and so on.
A Recommendation Model Based on Visitor Preferences
Fig. 1. TKD-NM Algorithm
127
128
P. Chaipornkaew and T. Banditwattanawong
A Recommendation Model Based on Visitor Preferences
129
Algorithm has sketched out the process of forming the TKD-NM algorithm. The input of the algorithm is the feature vector matrix of webpage headings while the output is the optimal feature vector matrix from the TKD-NM algorithm. Lines 3–7 represent the webpage headings with feature vectors. In this experiment, the number of clusters is 6. X is the set of data points. V is the set of centroids. S is the set of data points in each cluster. Line 17 calculates the distance between each data point and its nearest centroid. ci is the number of data points in ith cluster. Lines 18–22 assign each data point x to the closet centroid. Lines 23–27 compute the centroids for the clusters by taking the average of the all data points that belong to each cluster. Line 34 calculates the entropy of each attribute. Line 35 calculates the information gain of each attribute. Line 37 selects an attribute which has the smallest entropy value and the largest information gain value.
4 Experimental Results Three datasets were collected from back-end websites. The first dataset was the headings of webpages and their IDs. The number of instances for the first dataset was 103,860. The second dataset were data of visits for nine months, such as the ID of the visitor, the ID of action, and the visitor location. There were 53,054 instances with 25 features. The third dataset comprised the ID of action and the ID of the webpage heading, and there were 4,997 instances. The number of instances was reduced to 14,840 instances after missing values, invalid data, and dependent attributes were removed. The novel algorithm, TKD-NM, was applied to yield recommendation lists of webpages. The elbow method was adopted to obtain the optimal k as shown in Fig. 2. To conclude from Fig. 2, the experiment for clustering would consider a value for k between 3 and 13.
Fig. 2. The elbow method for optimal k between 1 and 29
Another three machine learning techniques, namely KNN, Decision Tree, and MultiLayer Perceptron (MLP), were utilized to determine a suitable number of clusters. The experiment of KNN was conducted with the variation of k; however, the optimal value of k was 300. When Decision Tree was applied, the variation of depth was evaluated and it was revealed that the optimal depth was 27. MLP was conducted with various
130
P. Chaipornkaew and T. Banditwattanawong
numbers of hidden layers and hidden nodes. The optimal number of hidden layers was 2 with 50 hidden neurons. The prediction accuracy of each machine learning technique was plotted as shown in Fig. 3. To conclude from Fig. 3, Decision Tree yielded high performance results when k was between 3 and 7. Therefore, the research selected to perform clustering based on six groups and employed the Decision Tree algorithm to evaluate both the baseline model and the novel TKD-NM model.
Fig. 3. The prediction accuracy of KNN, Decision Tree, and MLP
This research was constructed based on two approaches. The first approach was a baseline method, which was performed by using the TF-IDF, KMeans, and Decision Tree algorithms. TF-IDF was adopted for feature vectorization of webpage headings. KMeans was applied for webpage heading clustering while Decision Tree was employed to evaluate the prediction model. These three algorithms were implemented once and returned the output as soon as they were completed. The second approach was the novel algorithm, TKD-NM. This proposed method also utilized the TF-IDF, KMeans, and Decision Tree algorithms, but it required an extra algorithm which was NelderMead. The objective of the Nelder-Mead algorithm was to obtain the optimal values of the feature vector matrix. When the optimal values of the feature vector matrix were reached, the highest accuracy of the prediction model would be generated. An example of webpage headings and their separated words is presented in Fig. 4. An example of a feature vector matrix is shown in Fig. 5. The experimental results of the baseline model and the TKD-NM model are shown in Table 1 and reveal that the TKD-NM model yields higher prediction accuracy than the baseline model. The prediction accuracy of TKD-NM model is approximately 97.31% while the baseline model offers 88.87% prediction accuracy. Although the TKD-NM model outperformed the baseline model, the processing time of the TKD-NM model was slower than the baseline model because the complexity of the proposed model was time-consuming. An example of a recommendation list for one visitor is shown in Table 2. The top five webpages were retrieved from webpages in the same cluster and were sorted by date. The first order of the recommendation list is the latest uploaded webpage.
A Recommendation Model Based on Visitor Preferences
Fig. 4. Webpage headings and separated words
Fig. 5. Feature vector matrix of each word
Table 1. Experimental results Algorithm
Prediction accuracy (%) No. of clusters
Baseline model
88.87
6
TKD-NM model 97.31
6
Table 2. Recommendation lists for one visitor
131
132
P. Chaipornkaew and T. Banditwattanawong
5 Conclusion and Suggestions The research proposed a recommendation model based on visitor behaviors on commercial websites using the TKD-NM algorithm, which is the combination of the TF-IDF, KMeans, Decision Tree, and Nelder-Mead algorithms. The elbow method was adopted to investigate the optimal value of k. Decision Tree, KNN, and MLP were also applied to evaluate the algorithm. The experimental results revealed that the optimal number of clusters was between 3 and 7. However, the research conducted six clusters of webpages defined as 0 to 5. The Decision Tree algorithm was applied to evaluate both the baseline model and the TKD-NM model. The prediction accuracy of the proposed model was higher than the baseline method; the TKD-NM algorithm yielded 97.31% while the baseline algorithm provided 88.87% accuracy. However, other adaptations and experiments could be employed in future research. Since the proposed model might be timeconsuming when processing a large amount of data, future work could involve deeper analysis of particular mechanisms to leverage the speed of the optimization process. Acknowledgments. This research is financially supported by the Department of Computer Science, Faculty of Science, Kasetsart University. Moreover, the authors would like to express their gratitude to an anonymous company which cannot be mentioned because of confidentiality. It is appreciated that the business data provided by this selected company is sensitive and will not be disclosed or used for any purpose other than for research.
References 1. Steve Jobs Quote: What Customers Can Tell You https://www.entrepreneurshipinabox.com/ quotes/steve-jobs-quote-what-customers-can-tell-you/. Accessed 13 Feb 2021 2. Adomavicius, G., Tuzhilin, A.: Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Trans. Knowl. Data Eng. 17, 734–739 (2005) 3. Kim, D., Seo, D., Cho, S., Kang, P.: Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec. Inf. Sci. 477, 15–29 (2019) 4. Padurariu, C., Breaban, M.: Dealing with data imbalance in text classification. ScienceDirect 159, 736–745 (2019) 5. Song, Y., Wu, S.: Slope one recommendation algorithm based on user clustering and scoring preferences. ScienceDirect 166, 539–545 (2020) 6. Portugal, I., Alencar, P., Cowan, D.: The use of machine learning algorithms in recommender systems: a systematic review. Expert Syst. Appl. 97, 205–227 (2018) 7. Erra, U., Senatore, S., Minnella, F., Caggianese, G.: Approximate TF-IDF based on topic extraction from massive message stream using the GPU. Inf. Sci. 292, 143–161 (2015) 8. Curiskis, S., Drake, B., Osborn, T., Kennedy, P.: An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit. Inf. Process. Manag. 57, 102034 (2020) 9. Zhang, Y., et al.: Does deep learning help topic extraction? A kernel k-means clustering method with word embedding. J. Informetr. 12, 1099–1117 (2018) 10. Wang, F., Wang, Q., Nie, F., Li, Z., Yu, W., Ren, F.: A linear multivariate binary decision tree classifier based on K-means splitting. Pattern Recogn. 107, 107–120 (2020)
A Recommendation Model Based on Visitor Preferences
133
11. Chen, C., Hung, T., Weng, T.: Optimal replenishment policies with allowable shortages for a product life cycle. Comput. Math. Appl. 53, 1582–1594 (2007) 12. Chaipornkaew, P., Banditwattanawong, T.: A recommendation model based on user behaviors on commercial websites using TF-IDF, KMeans, and Apriori algorithms. In: Meesad, P., Sodsee, D.S., Jitsakul, W., Tangwannawit, S. (eds.) Recent Advances in Information and Communication Technology 2021. IC2IT 2021, pp. 55–65. Springer, Cham (2021). https:// doi.org/10.1007/978-3-030-79757-7_6
Reinforcement Learning: A Friendly Introduction Dema Daoun1 , Fabiha Ibnat1 , Zulfikar Alom1 , Zeyar Aung2 , and Mohammad Abdul Azim2(B) 1
2
Department of Computer Science, Asian University for Women, Chittogram, Bangladesh {dema.daoun,fabiha.ibnat,zulfikar.alom}@auw.edu.bd Department of Electrical Engineering and Computer Science, Khalifa University, Abu Dhabi, United Arab Emirates [email protected],[email protected]
Abstract. Reinforcement Learning (RL) is a branch of machine learning (ML) that is used to train artificial intelligence (AI) systems and find the optimal solution for problems. This tutorial paper aims to present an introductory overview of the RL. Furthermore, we discuss the most popular algorithms used in RL and the Markov decision process (MDP) usage in the RL environment. Moreover, RL applications and achievements that shine in the world of AI are covered. Keywords: Artificial intelligence · Reinforcement learning decision process · Bellman optimality
1
· Markov
Introduction
Decades ago, science fiction books and movies introduced to the public a concept known as artificial intelligence (AI), where people are replaced by robots or machines that perform human-like work more effectively and efficiently. Eventually, these imaginations are becoming true as almost wherever we go, we see devices are working in different fields and doing various tasks. This phenomenon made people afraid of losing their jobs, being replaced, and being dominated by machines. The fear raises the question, “Will robots exceed human intelligence and control everything on this planet?” To answer this question, we need to understand the meaning of AI and machine learning (ML) by focusing more on reinforcement learning (RL). ML is an automatic way of analyzing data without human involvement through learning from the data, identifying patterns, and then taking action. There are four popular types of learning approaches consisting of (i) supervised learning, (ii) unsupervised learning, (iii) semi-supervised learning, and (iv) reinforcement learning [32]. The Khalifa University, UAE partially support this research. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 I. Awan et al. (Eds.): Deep-BDB 2021, LNNS 309, pp. 134–146, 2022. https://doi.org/10.1007/978-3-030-84337-3_11
Reinforcement Learning: A Friendly Introduction
135
Supervised learning is a widely studied branch of machine learning approaches. In this approach, training is accomplished by utilizing labeled data with a supervisor who determines the correct action that had better be. The algorithm learns from in cooperation with the sets of inputs and their corresponding right outputs. And then, the algorithm finds the errors and modifies the model according to the error. Supervised learning can be used in predicting future events from historical data [32]. Unsupervised learning is another branch of machine learning technique where the algorithms have to figure out the correct action by themselves depending only on the inputs. Unsupervised learning is based on exploring the data and finding intrinsic patterns or groups instead of learning from the labeled data. This branch is used widely in transactional data [32]. Semi-supervised learning is a branch of machine learning that uses both labeled and unlabeled data for training. It uses a small amount of labeled data and a more considerable amount of unlabeled data and is suitable when the cost of labeling is high [32]. RL is a distinct branch of machine learning that differs from the ones mentioned earlier. In RL, an agent interacts with the environment to maximize its rewards through taking actions and learning from its consequences. To make the agent take action, it uses two methods: (i) exploitation (where the agent depends on its experience (trails and errors) to take action), and (ii) exploration (where the agent chooses to take new action not related to its previous experience) [40]. RL is similar to self-learning, where no one or nothing can help in knowing what the right action is. Therefore, the agent should keep trying and gathering information to reach the right action and reward. RL emphasizes the core issue of AI, i.e., the representation of information [27]. Nowadays, the RL technique is quite famous for organizations dealing with broad and complex problems frequently. This paper aims to discuss some of the RL aspects and highlight a subset of remarkable RL applications. In the following sections, RL’s main components, algorithms, and achievements are discussed. Also, the applications of the Markov Decision Process (MDP) and Bellman optimality equation in RL are mentioned. The RL challenges and future direction are mentioned after highlighting some of the advantages and disadvantages of RL. Section 2 provides a background of RL. Section 3 describes RL taxonomy. Section 4 provides an RL overview consisting of components of RL agents, MDP, Bellman optimality equation. Section 5 presents a simple example of RL. The paper finally discusses the advantages and disadvantages of the RL systems in Sect. 6 and the conclusions in Sect. 7.
2
Background
RL techniques were improving year by year, which caused expanding their applications in our real life. RL could do remarkable achievements in the last three decades, and it has been used in different fields and domains.
136
2.1
D. Daoun et al.
RL Achievements
Recently, RL accomplished remarkable achievements in the AI world, where most of these achievements are in the gaming sector. In 1993, IBM developed a program called “TD-Gammon”. This program uses RL to play backgammon at extreme levels and challenge the best players in the world [39]. In 2006, RL was used in autonomous helicopter flight. Two approaches have been used to attain it. First, a pilot is required to help in finding the helicopter dynamics model and a reward function. Second, RL algorithms’ usage is needed to find an optimized controller for the model [1]. In 2013, DeepMind introduced the first deep learning model to play Atari 2600 at professional levels using deep learning and RL combination. They trained the model using a Q-learning algorithm where the inputs are raw pixels and the outputs are value-functions [25]. In 2016, DeepMind again developed a program called “AlphaGo” using RL. This program plays a challenging classical game called “Go,” and it could defeat the best Go player by 5-0 games [34]. In 2017, DeepMind introduced a program called “Alpha Zero” which can play chess, shogi, and Go. This program could reach extreme levels in 24 h without only knowing the rules of each game. It does not have specific hypermeters for each game, and it uses the same hypermeters for all [35]. 2.2
Real-Life Applications
RL is widely used in many fields like robotics, games, finance, transportation, etc. It is used in solving problems and building systems that can improve themselves with experiences. Game playing is the most popular RL application in the world of AI. Many researchers used RL in the environment of a very general class of games. One of the RL game applications is Samuel’s checkers playing system that defeated America’s best checkers players. This artificial player is trained by playing games against and by given the rule of the games consisting of an incomplete list of parameters and a sense of direction [31]. The temporal difference algorithm is used in backgammon games by Tesauro, where he used three layers of neural networks for training [39]. Robotics and Control machines are other applications of RL. RL is used to build a two-armed robot that learns from experience. In this robot, complex nonlinear control tasks and dynamic programming make the application known as linear-quadratic-regulator design. Also, RL is used in robots that push boxes by using the Q-learning algorithm. The application is considered difficult because of its vast number of uncertain results. In the food industry, machines fill up food containers with a variable number of non-identical products. These machines operate according to many set-points, and the chosen set-points depend on the change in the product’s characteristics and task constraints [19].
Reinforcement Learning: A Friendly Introduction
137
RL is used in intelligent transportation systems. Multi-agent reinforcement learning reduces traffic congestion by using adaptive traffic signal control, where each agent controls traffic lights around a single traffic junction. The reinforcement learning approach faced some challenges in this application because of the instability in the single-agent case. Van der Pol and Oliehoek devised a scalable system for the problem using Deep Q-Learning algorithms and a coordination algorithm. The success is due to removing the simplified assumptions and inclusion of a new reward function [29]. Resources management in computer clusters RL is used to build systems that learn from their experience. This application helps in minimizing the job slowdown process by learning the allocation and schedule of computer resources. Here, the optimal policy is found using the REINFORCE algorithm with the baseline value [24]. RL applications are also used in the online web system autoconfiguration [7], optimize chemical reactions [43], online personalized news recommendation [42], real-time advertising [18], and in natural language processing [33]. 2.3
Current Challenges and/or Opportunities
RL faces some challenges, and finding solutions to these challenges will expand the areas where RL can be applied. 1. System Delay: RL natural systems face issues regarding delays in state sensation, actuators, or reward feedback. RL systems take time to improve agents’ learning, especially in recommender systems, which might need about a week to determine the reward [23]. Some researchers applied a learning algorithm that itself can know the effects of delays [15]. Also, in some RL problems, like in Atari, re-distributing rewards throughout time is applied to find returnequivalent MDP that gives almost the same optimal policy as in the original MDP [4]. 2. Non-Stationary: In many RL systems, it is challenging to observe the state of physical items, like wear and tear or the amount of building up, and observations about users’ mental state. This issue has been solved in two ways: first, combining history and the agent’s observations using the DQN approach. The second is to use recurrent networks that help track and recover hidden states [12]. However, RL aims to figure out how to use multi-task learning to allow the agent to do multiple tasks instead of specialization [8]. Another challenge is that the model-free RL needs many trials to learn, and that is not acceptable in the real world because it causes danger and sometimes threatens people’s lives.
3
RL Taxonomy
Reinforcement aims to find the best solution for a problem by using algorithms. Numerous algorithms have been applied to reach the goal of optimization. These algorithms can be divided into three distinct groups: (i) value-based, (ii) policybased, and (iii) model-based.
138
3.1
D. Daoun et al.
Value-Based Methods
In the value-based methods, the agent aims to maximize the value function to get a long-run return under a specific policy [14]. The value-based group includes QLearning and SARSA (State-Action-Reward-State-Action). Q-Learning is an offpolicy and model-free algorithm based on an action derived from another policy. This method follows some steps to update the value consisting of (i) initialize the Q-table, (ii) choose the action, (iii) perform the action and measure the reward and the new state, (iv) update the Q-table, and (v) repeat this process until it reaches the terminal state. SARSA is an on-policy algorithm based on the agent’s current action and policy. It follows one approach, which is the same as the behavior and target policy. The difference between Q-Learning and SARSA is that SARSA rewards are selected based on the same policy and original action counter to Q-learning. In general, the on-policy algorithms are better in performing than off-policy, but it is worse in finding the target policy [37]. 3.2
Policy-Based Methods
In the policy-based methods, the agents aim to find the optimal policy that returns maximum reward without value function [14]. In policy-based methods, the neural networks are used to realize the optimal policy in two ways: gradientbased and gradient-free [38]. An example of policy-based is the REINFORCE Algorithm which is a policy gradient method. To understand the REINFORCE Algorithm, it is required to understand the meaning of the policy gradient methods, i.e., an on-policy depends on optimizing parameterized policies considering the long-term rewards. REINFORCE algorithm uses Monte Carlo methods (use to solve a large number of computer problems) [11] to estimate the returns because it is based on the entire trajectory. Another example of a policy-based algorithm is the actor-critic method. This method is a temporal difference method that implements its policy indecently from the value function. It is called actor-critic as it selects actions and gives critiques to the actor responsible for the action [20]. 3.3
Model-Based Methods
The agents aim to perform in the created environment in the model-based methods by relying on a model better. Some popular algorithms like Q-learning, SAC, and DDPG are model-free and are characterized by stability and optimality. However, model-based algorithms are more effective, and it uses artificial neural networks to predict the environment transactions and reward function [26]. For instance, Dyna-Q is an algorithm used to build a policy using both actual and simulated experiences. These experiences improve both the model and the approach through model learning, direct reinforcement learning, and planning.
Reinforcement Learning: A Friendly Introduction
4
139
RL: An Overview
To understand RL and know how it can be used, it is essential to get familiar with few concepts before, like RL components, MDP, and Bellman optimality equation. 4.1
Components of RL Agent
The RL system has four main components: (i) policy, (ii) reward function, (iii) value function, and (iv) the model of the environment. Policy: Here, the policy (π) is the core goal of RL. The policy is the agent’s behavior function that determines the way of behaving under certain circumstances [37]. In other words, the policy is the strategy used to achieve the agent’s goal of conducting some tasks. The policy may include a straightforward function or a very complex one, and it can be stochastic. Mathematically, the policy is defined using the MDP, a tuple with four sets (s, a, P , R). MDP suggests actions to each possible state. Here, set s comprises the agent’s internal status. Set a includes the agent’s actions. The set P is a matrix of the transition probabilities from one state to another that modifies the agent’s status. The set R is the agents’ rewards. Reward Function: The reward R of action is evaluated utilizing the reward function. Here, the process uses the agents’ status as inputs and returns rewards as outputs of real numbers [10]. Note that R is the feedback from the environment and the main factor for updating the policy. It accurately describes the agent’s state and defines the goal in the RL problem. The reward signal distinguishes between bad and good events for the agent to reach the maximum utility level for total received rewards in the long run. The reward depends on the action and the environment of the agent. Thus, to change the reward, there are two possible ways: (i) direct through action and (ii) indirect through the environment. When the action is changed, that will directly affect the reward. It will also affect the environment, which, in turn, affects the reward “indirectly” [37]. In general, the agent’s reward differs when different actions are applied. Value Function: The value functions V (s) determine what is good in the long run by measuring how much good to be in a specific situation. The value function predicts the total amount of reward that an agent can get in the potential future [37]. The value function aims to find the optimal policy that maximizes the expected rewards by evaluating the states and selecting the best action. Three methods may achieve the optimality: (i) value iteration, (ii) policy evaluation, and (iii) policy iteration [16]. Moreover, the value function has four types of operations:
140
D. Daoun et al.
– The on-policy value function is where the agent starts in a state and acts according to the policy. After that, the function will give expected returns. – The on-policy action-value function is where the agent starts in a state and suddenly an action, which is not from the policy, is taken and applied till the end, then the function will give expected return. – The optimal value function is where the agent starts in a state and acts according to the optimal policy. After that, the function will give an expected return. – The optimal value-action function is where the agent starts in a state. Suddenly, an action, which is not from the policy, is taken and acting according to the optimal policy till the end, then the function will give expected return [2]. Model of Environment: The model of the environment is the place where the agents operate. The model of the environment allows deducting the environmental behavior after looking at the state and action. It is used to build a view of future situations to decide which course of action is better to take [37]. The environment gives two signals to the agent, one on its next state and another on reward [41]. 4.2
Markov Decision Process
Markov Decision Process (MDP) is a sequence of decisions that describes the environment for RL. It has four components: set of states (s), set of actions (a), transition probabilities (P ), and real-valued reward function on states R(s). The main property of the MDP is transition probabilities, where the probability of the next state depends only on the current state and current action without considering the previous states and actions. The probability of the next state is formulated in the following equation: P (st+1 |st , at ) = P (st+1 |s0 . . . st , a0 . . . at ).
(1)
According to this equation, the probability of the following state only depends on the current state and current action, regardless of the previous states and actions: s0 . . . st , a0 . . . at . (2) 4.3
Bellman Optimality Equation
Richard Bellman found that using dynamic programming in solving MDP problems reduces the computational burdens [5]. MDP problem is, in fact, an optimality problem because the agent, as discussed before, aims to maximize its rewards [6]. It can be mathematically expressed as follows: V ∗ (s) = max q ∗ (s, a). a
(3)
In this equation, the first part of the equation means maximizing the rewards. And the second part shows how the value function is related to itself [36]. This
Reinforcement Learning: A Friendly Introduction
141
equation is nonlinear and has no closed-form solutions. However, some methods introduced iterative solutions like Q-learning [30]. Note that this function raises a concept of q that stands for quality. Quality, in this case, represents how valuable a given action is in gaining some future reward.
Fig. 1. RL example (cat catching fish).
5
Example of RL
Here, an example of reinforcement learning is presented that further clarifies how reinforcement learning precisely works. In this example, RL is used to create a cat and fish game where the cat has to find the fish [13]. The cat will keep looking for the fish in different possible paths. Upon successfully finding the fish, it will get 100 points. Hence, the cat will constantly learn to find fish and become a skilled fish hunting cat over time. Figure 1 depicts various RL aspects in this example, where, Fig. 1a provides a big functional picture of RL, Fig. 1b presents the agents in the environment, Fig. 1c shows the reward policy implementation in the RL systems. Finally, Fig. 1d depicts the derived reward matrix in the system. To make the cat learn from its experience, the first thing is to create an environment where the fish can be found without guiding the cat to the route of finding fish.
142
D. Daoun et al.
Here, five nodes are assigned to connect these nodes in different ways and create paths. The agent, i.e., the cat, will look for the courses to find out the fish. Also, the connection of the trails is defined in edges. There are several paths in this environment in the figure, and the cat is located at node 0. As per the RL algorithm, the reward matrix determines how much the cat will get (as a reward) if it reaches the target fish and how much it will get if she does not. Now, there are many −1’s in the reward matrix which indicates there is no direct connection between these nodes. Moreover, there are some (0) that notifies there is a direct connection between these nodes. For example, there is only one direct connection with node 0 which is node 4. That can be expressed as follows [−1. −1. −1. −1. 0. −1] where (0, 0) = −1, (0, 1) = −1, (0, 2) = −1, (0, 3) = −1, (0, 4) = 0, (0, 5) = −1. After that, a function is created to determine the cat’s next step from its current state. This function also finds the optimal policy that gives maximum rewards. Thus, the policy will be constantly updated, and these values will form the Q-matrix. In other words, the more times the cat goes through paths, the more information about the environment in the Q-matrix. In the last stage, the agent will learn over time when the cat will search for the fish. Each time, the cat will choose a different path starting from node 0. The Q-matrix is constantly updated by the information received from a cat’s movements. Thus the cat will be more efficient over time.
6
Recent Developments
This section presents some of the recent research done in RL and could make significant improvements. 6.1
Graph Convolutional RL
Graph convolutional is applied in RL to optimize the policy. This technique provides a multi-agents model in graphs because agents are not stables as they keep moving. Thus, agents face difficulties in learning the abstracts representations of interaction. Compared to the other cooperative methods, the graph convolutional RL technique improves agents’ convolutional through regularizing the temporal relation [17]. 6.2
Measuring the Reliability of RL Algorithms
Chan and other researchers realized the lack of reliability in RL and found a set of metrics to measure the reliability of RL algorithms quantitatively from different aspects. To analyze the performance of RL algorithms, evolution must be done during training and after learning. Also, these metrics measure reliability from various elements, such as reproducibility and stability [9].
Reinforcement Learning: A Friendly Introduction
6.3
143
Behavior Suite for RL
Researchers introduced a collection of designed experiments that look into the core of an agent’s capabilities, called “Behavior Suite.” This set of experiments capture the main issues in designing learning algorithms that can be developed. Also, it studies an agent’s behavior through its performance [28]. 6.4
The Ingredients of Real World Robotics RL
A group of researchers discussed the elements of a robotic learning system that does not require human intervention to ensure the process of learning. This automatic learning system is independently improved using real-world collected data, and learning here is feasible only using onboard perception without manually designed reset reward function [44]. 6.5
Network Randomisation
A Simple Technique for Generalisation in Deep Reinforcement Learning improves deep RL agents’ generalization ability by applying a randomized neural network that perturbs input observations. This approach allows agents to adapt to new domains because the agent learns robust features from the different environments [21].
7
RL: Pros and Cons
Like any techniques, RL has both advantages and disadvantages, and knowing them helps determine the best methods that can be applied to solve a specific problem. 7.1
Advantages
RL made a great revolution in the world of machine learning. Thus, it has many advantages compared to other machine learning types or compared to other AI methods. Here are some of the advantages: – It gives successful rewards regardless of the environment’s size because it depends on observing the environment to take the action that gives reward [22]. – It sustains change for a long time. – In RL, errors have less chance to happen again because the agent learns from its experience and corrects its errors by training. – It is better than humans in completing tasks, even in the games; RL applications could defeat the best players in the world.
144
D. Daoun et al.
7.2
Disadvantages
RL has some disadvantages that make it not suitable in some specific situations even with all these advantages. – It takes a long time to find the optimal solution for big real-world problems [3]. – It is difficult to directly train some systems that need fixed logos of the system’s behavior to be learned. – Most real-world systems are fragile and expensive. – The physical system destroys itself and its environment [12]. – If lots of reinforcement is applied, it leads to an overload of loads and diminishes the results. – It creates a serious problem when it is used in some real-world applications like self-drive cars.
8
Conclusion
RL is a fascinating and vital topic regarding its wide usage, applications, and achievements. RL depends on self-experience and training to find the optimal policy and get maximum rewards. This tutorial paper described RL’s main components and its popular algorithms and explained the MDP used widely in RL. This paper summarized most RL applications and achievements in the real world and included its advantages, disadvantages, current challenges, and opportunities.
References 1. Abbeel, P., Coates, A., Quigley, M., Ng, A.Y.: An application of reinforcement learning to aerobatic helicopter flight. In: Advances in Neural Information Processing Systems, pp. 1–8 (2007) 2. Achiam, J.: Introduction to RL (2018). BOpen AI. https://spinningup.openai.com/ en/latest/spinningup/rl intro.html 3. Arabnejad, H., Pahl, C., Jamshidi, P., Estrada, G.: A comparison of reinforcement learning techniques for fuzzy cloud auto-scaling. In: Proceedings of the 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 64–73 (2017) 4. Arjona-Medina, J.A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J., Hochreiter, S.: RUDDER: return decomposition for delayed rewards. arXiv preprint arXiv:1806.07857 (2018) 5. Bellman, R.: On the theory of dynamic programming. Proc. Natl. Acad. Sci. U.S.A. 38(8), 716 (1952) 6. Bellman, R.E., Dreyfus, S.E.: Applied Dynamic Programming. Princeton University Press, Princeton (2015) 7. Bu, X., Rao, J., Xu, C.Z.: A reinforcement learning approach to online web systems auto-configuration. In: Proceedings of the 2009 29th IEEE International Conference on Distributed Computing Systems, pp. 2–11 (2009) 8. Caruana, R.: Multitask learning. Mach. Learn. 28(1), 41–75 (1997)
Reinforcement Learning: A Friendly Introduction
145
9. Chan, S.C., Fishman, S., Canny, J., Korattikara, A., Guadarrama, S.: Measuring the reliability of reinforcement learning algorithms. arXiv preprint arXiv:1912.05663 (2019) 10. De Luca, G.: What is a policy in reinforcement learning? (2020). Baeldung. https:// www.baeldung.com/cs/ml-policy-reinforcement-learning 11. Dimov, I.T., Tonev, O.I.: Monte Carlo algorithms: performance analysis for some computer architectures. J. Comput. Appl. Math. 48(3), 253–277 (1993) 12. Dulac-Arnold, G., Mankowitz, D., Hester, T.: Challenges of real-world reinforcement learning. arXiv preprint arXiv:1904.12901 (2019) 13. Fazly, R.: Data science book (2020). GitHub. https://github.com/FazlyRabbiBD/ Data-Science-Book/blob/master/8-ReinforcementLearning.ipynb 14. Guru99: Reinforcement learning: what is, algorithms, applications, example (2020). Guru99. https://www.guru99.com/reinforcement-learning-tutorial.html 15. Hester, T., Stone, P.: TEXPLORE: real-time sample-efficient reinforcement learning for robots. Mach. Learn. 90(3), 385–429 (2013) 16. Hui, J.: RL – value learning (2018). Medium. https://jonathan-hui.medium.com/ rl-value-learning-24f52b49c36d 17. Jiang, J., Dun, C., Huang, T., Lu, Z.: Graph convolutional reinforcement learning. arXiv preprint arXiv:1810.09202 (2018) 18. Jin, J., Song, C., Li, H., Gai, K., Wang, J., Zhang, W.: Real-time bidding with multi-agent reinforcement learning in display advertising. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 2193–2201 (2018) 19. Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. J. Artif. Intell. Res. 4, 237–285 (1996) 20. Konda, V.R., Tsitsiklis, J.N.: On actor-critic algorithms. SIAM J. Control Optim. 42(4), 1143–1166 (2003) 21. Lee, K., Lee, K., Shin, J., Lee, H.: Network randomization: a simple technique for generalization in deep reinforcement learning. arXiv preprint arXiv:1910.05396 (2019) 22. Manju, S., Punithavalli, M.: An analysis of Q-learning algorithms with strategies of reward function. Int. J. Comput. Sci. Eng. 3(2), 814–820 (2011) 23. Mann, T.A., et al.: Learning from delayed outcomes via proxies with applications to recommender systems. In: International Conference on Machine Learning, pp. 4324–4332. PMLR (2019) 24. Mao, H., Alizadeh, M., Menache, I., Kandula, S.: Resource management with deep reinforcement learning. In: Proceedings of the 15th ACM Workshop on Hot Topics in Networks, pp. 50–56 (2016) 25. Mnih, V., et al.: Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013) 26. Moazami, S., Doerschuk, P.: Modeling survival in model-based reinforcement learning. arXiv preprint arXiv:2004.08648 (2020) 27. Mondal, A.K., Jamali, N.: A survey of reinforcement learning techniques: strategies, recent development, and future directions. arXiv preprint arXiv:2001.06921 (2020) 28. Osband, I., et al.: Behaviour suite for reinforcement learning. arXiv preprint arXiv:1908.03568 (2019) 29. Van der Pol, E., Oliehoek, F.A.: Coordinated deep reinforcement learners for traffic light control. In: Proceedings of the NIPS 2016 Workshop on Learning, Inference and Control of Multi-Agent Systems, pp. 1–8 (2016)
146
D. Daoun et al.
30. Rummery, G.A., Niranjan, M.: On-line Q-learning using connectionist systems. Technical report, Cambridge University Engineering Department, UK (1994) 31. Samuel, A.L.: Some studies in machine learning using the game of checkers. IBM J. Res. Dev. 3(3), 210–229 (1959) 32. SAS: Machine learning: what it is and why it matters (2020), SAS. https://www. sas.com/en us/insights/analytics/machine-learning.html 33. Sharma, A.R., Kaushik, P.: Literature survey of statistical, deep and reinforcement learning in natural language processing. In: Proceedings of the 2017 IEEE International Conference on Computing, Communication and Automation, pp. 350–354 (2017) 34. Silver, D., et al.: Mastering the game of Go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016) 35. Silver, D., et al.: Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815 (2017) 36. Singh, A.: Reinforcement learning: Bellman equation and optimality (Part 2). Towards Data Sci. (2019). https://towardsdatascience.com/reinforcementlearning-markov-decision-process-part-2-96837c936ec3 37. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018) 38. Taylor, G., Burmeister, R., Xu, Z., Singh, B., Patel, A., Goldstein, T.: Training neural networks without gradients: a scalable ADMM approach. In: Proceedings of the 33rd International Conference on Machine Learning, pp. 2722–2731 (2016) 39. Tesauro, G.: TD-Gammon, a self-teaching backgammon program, achieves masterlevel play. Neural Comput. 6(2), 215–219 (1994) 40. Woergoetter, F., Porr, B.: Reinforcement learning. ScholarPedia 3, 1448 (2008) 41. Zhang, J.: Reinforcement learning - model based planning methods. Towards Data Science (2020). https://towardsdatascience.com/reinforcement-learningmodel-based-planning-methods-5e99cae0abb8 42. Zheng, G., et al.: DRN: a deep reinforcement learning framework for news recommendation. In: Proceedings of the 2018 World Wide Web Conference, pp. 167–176 (2018) 43. Zhou, Z., Li, X., Zare, R.N.: Optimizing chemical reactions with deep reinforcement learning. ACS Cent. Sci. 3(12), 1337–1344 (2017) 44. Zhu, H.: The ingredients of real-world robotic reinforcement learning. arXiv preprint arXiv:2004.12570 (2020)
Universal Multi-platform Interaction Approach for Distributed Internet of Things Maria Stepanova(B) and Oleg Eremin Bauman Moscow State Technical University, ul. Baumanskaya 2-ya, 5/1, Moscow 105005, Russia {stepanova,ereminou}@bmstu.ru
Abstract. This paper describes the issue of communication between devices in a designed distributed Internet of Things (IoT) computing system named DCS-IoT. The system was designed and is aimed to solve the challenges of high computation loading of a central node and nodes in general in the IoT-based solutions by data and task distribution approach. Initially, the approach is tested on complex math tasks. However, it is supposed the approach to be scaled for different task types and to be implemented in a variety of IoT-based solutions. However, it is assumed that this approach will be scaled and applied for different task types, and implemented in a variety of IoT-based solutions regardless of the sphere. The DCS-IoT is heterogeneous and based on multiple numbers of devices (Raspberry Pi, Odroid, Arduino, etc.) that mutually connected by wired and/or wireless communication channels. Since the devices have different processor architectures (ARM, Intel, ATmega), a universal approach is required to ensure their interaction. The multi-platform JVM (Java Virtual Machines) technology is considered as such an approach. Solutions based on TCP sockets, RMI (Remote Method Invocation) technology, and CORBA (Common Object Request Broker Architecture) technology are considered communication technology. The advantages and disadvantages of each approach are considered, and recommendations for use are offered for the DCS-IoT. Keywords: Internet of Things · Cloud computing · Distributed systems · Fog calculations
1 Introduction The IoT (Internet of Things) is a rapidly developing sphere in Computer Science. The main idea of IoT is enabling environmental objects with intellectual power. The electronics industry develops specialized equipment to embed it into physical objects, such as tiny microcontrollers and microprocessors, batteries with high capacity and accumulators, universal sensors, etc. Embedded systems allow gathering and analyzing vast amounts of data about the environment and states and behavior of such systems itself. The Data Mining and Big Data methods are used to process, to analyze, and to discover patterns and correlations of the data of the IoT to bring new data value and insights. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 I. Awan et al. (Eds.): Deep-BDB 2021, LNNS 309, pp. 147–159, 2022. https://doi.org/10.1007/978-3-030-84337-3_12
148
M. Stepanova and O. Eremin
The microprocessor implementation into the IoT objects allows organizing interaction between such objects itself and between the objects and larger components of computational systems such as cloud infrastructure, computational clusters, etc. That type of integration benefits to gain new and slightly different knowledge about the functioning of the computational systems. A great number of researches and contributions have been done for the IoT development from the first definition appearance, however, the recent researches indicate the actual challenges [1] and necessity to review the IoT concept and its implementation according to the changing of the IoT by itself and technologies for it [1]. Some of the critical issues for the IoT are the creation of self-intellectual interaction, integration of multiple communication technologies, managing of increasing data and task flow volumes, realization of intelligent distribution system [2]. The paper [2] describes the aspect, adaptive strategy and method of tasks assignment and distribution from IoT node in developed distributed IoT platform to solve the issues of managing increasing data and task flow volumes and to provide intelligent distribution mechanisms in IoT. Initial approaches of interacting IoT objects are to transfer data from computing sensor modules embedded in real-world objects to a centralized computing node. And the node provides data storing and processing, device coordination, and communication. Such architecture type has a set of significant limitations [2]. To solve the limitations an approach was proposed that is to consider the IoT platform as a distributed computational system and to unload the centralized IoT computing node by tasks assignment and distribution to the rest IoT nodes of the platform [2]. The solution describes a slightly different approach that, especially, differs from the Edge Computing concept and architectures. Because even with the rapidly developing Edge Computing concept, there are should be defined particular methods for each challenge in each case. Moreover, there were noticed critical issues of Edge Computing in [3, 4], especially for latency, management, reliability, selective data, and task implementation from users. This work is the continuation of the paper research and the described approach and is focused on issues and aspects of communication technologies inside the distributed computational system on the IoT to provide mechanisms of tasks and data exchanging in the process of tasks assignments and distribution [2]. The rest of this work is organized as follows. Section 2 discusses the details of distributed computational systems on the IoT, and the interaction principles of IoT objects in it. In Sect. 3, an overall system design and IoT object interaction implementation are described. Section 4 presents the comparative analysis of the implementation of the experiments. Finally, in Sect. 5, we draw our conclusions and further work plans.
2 Distributed IoT Computing System and IoT Objects and Interaction The considered distributed IoT computing system (DCS-IoT) consists of a great number of heterogeneous IoT objects that have various technical constraints. The IoT objects, i.e.nodes, interact with each other with wired or wireless communication channels.
Universal Multi-platform Interaction Approach for Distributed IoT
149
The system is used for the computation of complex math tasks, including [5, 6] and etc. in a parallel mode [7], and consists of a central node that is an orchestrator and computational nodes that are implemented on IoT objects [8, 9]. The overall interaction in the developed distributed IoT computing system is shown in Fig. 1.
Fig. 1. Distributed IoT computing system
The Central node is a given generalization form of a configuration variety as the approach [2] is applicable to a central node that is implemented by Cloud, Gateway, or any Centralized Server. The central node decomposes a source computational task into multiple simple sub-tasks that are sent then to computational node groups for processing. Accordingly, that is defined as an actual issue of providing an appropriate interaction between nodes of a distributed computing system based on the IoT platform. A detailed description of the approach, strategy, method, and results of task assignment and task distribution procedure is published in our work [2]. Thus, this work is concentrated on implementing the communication technology for data, status, information, task and etc. exchanges among nodes in the distributed IoT computing system. To implement the communication technology, it should be noticed that: • Communication technology should be lightweight • Communication interaction should be organized with a single technology The described above is critical for the implementation as IoT devices are constrained in resources, power, computation ability and etc. IoT devices are made by numerous different manufacturers that provide them mostly as a closed eco-system. Also, there are numerous IoT end solutions that are closed for development and user manipulations. However, it is supposed to be spread more open IoT devices and solutions.
150
M. Stepanova and O. Eremin
IoT object’s development includes using standardized digital platforms (based on ARM, ATmega, Intel microprocessors) that are based on standard components: microprocessors, memory, peripherals, I/O ports. There are plenty of component’s manufacturers, and the components could have minor features in each case. That means programs for each of the platforms cannot be portable if they are implemented on a low-level assembler language. Then there are used high-level languages, for example, C, to avoid such collisions. However, each platform specifics are still noticeable [1, 10]. And there is applied GNU/Linux operating system to reduce platform specifics impact. In the 90th of the past century, there was developed Java programming language that initially was intended for designing a unified universal platform. And the platform then could be deployed as an embedded system on different hardware: micro-waves, home media-centers, TVs, smart home components, etc. Java platform allows avoiding any problem type in different electronic components and digital platforms. In this work to organize interaction in our distributed IoT computing system, Java was chosen. The choice was defined according to the initial purpose of Java that it has been designed for – the implementation of a unified universal platform. Java allows abstracting from hardware specifics and neutralizing different electronic components effects. That admits using generic portable program code. In this case, Java enables IoT devices to interact irrespective of different devices and hardware manufacturers. In this work, there are used IoT devices implemented on ARM, ATmega, Intel microprocessors. And JVM (Java Virtual Machine) is installed on all distributed IoT computing system nodes, that allow using a common portable program code which converts into bytecode for execution on each object.
3 IoT Interaction System Design The interaction of a central node and devices of distributed IoT computing systems could be considered as a client-server architecture approach. According to the approach, there are two possible types of interaction whichever server and client roles are assigned to whom: • A server role is assigned to a computing node (IoT device), a client role is – to a distributing node (central node). • A distributing node is defined as a server, a computing node is as a client. This paper describes the situation when the computing node is a server, and the distributing node is a client (e.g., Fig. 2). A distributing node sends a task (message) to a computing node. And a computing node sends a result (message) to a distributing node (e.g., Fig. 2). As a server, the computing node services client with the computing primitives. At the same time, the distributing node makes requests to the client to get back the proceeded small computing tasks according to the decomposition of the initial task. The distributing node can also send managing commands to computing nodes, for example, to set it to a different functioning mode. The server takes decisions according to received tasks: make computations, change the mode, etc.
Universal Multi-platform Interaction Approach for Distributed IoT
151
Fig. 2. Computing node – distributing node Interaction
According to the client-server architecture, the application Node_1 requests Node_2 the execution of computing primitive that is described as a function m(). The client and the server are located separately and interact with each other through wired or wireless network communication type. In this case, the deployed on the server m()-function should be called by the client that is impossible without special tools. There were considered communication mechanisms for interaction: Socket API, RMI (Remote Method Invocation), CORBA (Common Object Request Broker Architecture). Because Java empowers physically remote objects or applications for a mutual interaction through such technologies. The Java and such communication mechanisms were considered because it is standard-based, stable, and mature which is hard to be provided by any recently appeared framework. Also, typically frameworks require more resources and supportive attention. 3.1 Socket API In general, the Socket API interaction mechanism has significant advantages for the implementation, it has a set of predefined methods for implementation main features of the communication. The socket-based communication mechanism neutralizes the heterogeneity of devices and networks and provides their interaction. Also, it received a wide distribution due to its simplicity and high performance in the IoT use cases for numerous projects and researches such as: • • • •
Smart environments [11] Smart homes [12] Monitoring systems [13] Etc.
The common aspect of the socket-based communication mechanism implementation in IoT solutions is to provide channel and tools for data exchanging: • Devices send raw values that are sent further for processing • Data flows between applications and devices
152
M. Stepanova and O. Eremin
For the data and task distribution from the central node, it was established a connection between the client and the server with Sockets, as shown in Fig. 3.
Fig. 3. Socket API interaction
As described above, the client cannot call the server functions that means the server should have a stub-method. When a certain message type is received by the server from the client, the stub-method is executed. Then the stub-method calls a function that implements computing primitive [14]. The client listing is shown below: . . . Socket so = new Socket(“192.168.5.12”, 9001); BufferedWriter out= new BufferedWriter( New OutputStreamWriter(so.getOutputStream())); String message = “m”; out.write(message,0, message.length()); out.newLine(); out.flush(); so.close(); out.close(); . . .
In the algorithm, the client starts a socket connection to the server and packs inside a message (message = “m”). The message refers to computing primitive and its parameters to be executed.
Universal Multi-platform Interaction Approach for Distributed IoT
153
The server listing is shown below: private void m() { . . . } // calculation primitive . . . ServerSocket s = new ServerSocket(9001); Socket so = s.accept(); BufferedReader in = new BufferedReader( new InputStreamReader(so.getInputStream())); String message = in.readLine(); if(message.equals(“m”) m() else System.out.println(“No such action”); so.close(); s.close(); . . .
The server waits for a socket connection (server socket). When established, the server reads the received message from the input traffic associated with that socket. Then the server analyzes the message and assigns it to the appropriate computing primitive with the parameters [15]. When messages are sent from the computing node to the distributing node, the computing node starts functioning as a server, and the distributing node is as a client. The described communication mechanism for our DCS-IoT during the experiments showed that it is applicable for computing math tasks in cases: • When nodes exchanges with raw, transitional or result data. Also, in cases of servicebased or alarm data. In that case, the implementation of the mechanism is optimal. That provides the simplicity and imposes limitations simultaneously as the necessity of data and task distribution issues are not fully covered for the DCS-IoT. • When the program code is divided into pieces and represented in an appropriate data format for transiting the program code pieces. In that case, the additional functions should be implemented both on a server and on a client, at least: function for program code decomposition, function for program code pieces interpretation into data format, function for program code pieces recovery into the initial program code. That leads to additional costs. 3.2 RMI The remote procedure call is implemented in Java object approach – RMI that provides RMI-based architectures with mechanisms for distributed object-oriented computing [16, 17]. The RMI implementation got a widespread and positively recommended itself within mobile devices [18, 19]. In general, such mechanism implementation and usage are mostly user-centric and mobile devices oriented: the end solutions provide service or application interaction with users, users with users, or users with smart devices of the IoT. Therefore, the RMI implementation in the IoT is offered mostly for user-IoT application interaction rather than for devices or infrastructure elements interaction.
154
M. Stepanova and O. Eremin
The DCS-IoT is infrastructure-centric and is concerned on load distribution among nodes for data and tasks computing processing in IoT infrastructure. The RMI allows physically remote objects in DCS-IoT to interact with each other, as shown in Fig. 4.
java.rmi
Client
IServer m()
Remote UnicastRemoteObject
Naming* Server
Proxy m()
rebind(name,URO) lookup(name): Remote
m() Registry
LocateRegistry
rebind(name,URO) lookup(name): Remote
createRegistry(port) getRegistry()
Fig. 4. Client – server interaction via RMI
To implement the RMI technology in the DCS-IoT, the IServer interface needs to be created. The interface contains all computing primitives that will be called by the client and stored on the server. The IServer interface listing is shown below: public interface IServer extends Remote { public void m() // calculation primitive throws RemoteException; }
Universal Multi-platform Interaction Approach for Distributed IoT
155
The interface should be accessible both to the server and to the client since it is implemented as a Server class on the server, and as a Proxy on the client. The server application listing is shown below: . . . public class Server extends UnicastRemoteObject implements IServer { public Server () throws RemoteException {} public void m() throws RemoteException { . . . //computing primitive function realization } public static void main() { . . . Server s = new Server(); Registry re = LocateRegistry.createRegistry(1099); re.rebind(“server”, s); //or two last lines could be replaced by next line //Naming.rebind(“server”, s); . . . }
The class Server in responsible for the all-server functions of the computing system that relate to computing primitives. In the executable code, there is an instance of the class Server is created that would be accessible through remote access with objects registration in the Registry. The registration procedure helps the IoT to maintain and manage a variety of IoT devices and to call them by their name defined by a manufacturer or a developer. The client application listing is shown below: . . . Registry re = LocateRegistry.getRegistry(); (IServer) s = (IServer) re.lookup(“server”); s.m(); //remote function – calculating primitive //or for Naming service //IServer s = (IServer) Naming.lookup(“rmi://localhost/server”)
An object of Server type is created on the client-side, that object investigates on RMI pipeline a remote component using Registry. The RMI architecture principles are suitable for DCS-IoT task distribution, especially in the case of the task as a program code. 3.3 CORBA The CORBA is a standard defined by Object Management Group that allows integration of distributed and heterogeneous applications based on different programming languages and hardware bases. CORBA has the same principle as the RMI – the pipeline of objects [20]. That type of interaction uses IDL (Interface Description Language) - a special platforms independent language that is independent on platforms. The CORBA functioning approaches are similar to RMI, the only distinction is that CORBA allows creating multi-platform systems.
156
M. Stepanova and O. Eremin
The technology recommended itself in IoT cases mostly as a middleware for industrial applications [21]. However, it is noticed as a lack of extensibility [22]. For the DCS-IoT, there were required to implement an additional pipeline of objects.
4 Technology Comparison for Distributed IoT Computing System Communication Implementation According to the considered technologies, there could be summarized a comparative analysis below for the DCS-IoT. For the DCS-IoT, the main criteria for communication technology selection and implementation are simplicity and reduced costs of the development and implementation, efficient device resources consumption for communication channel in task distribution process, technology extensibility and lightweight. Socket API technology advantages: • Implementation simplicity. • Methods for interaction with applications developed in different languages. • Not required additional libraries that are not in Java ME. The disadvantages of Socket API: • The necessity to develop processing methods for received messages. • A lack of distributing mechanisms. • The necessity to design a naming system or addressing system to provide interaction between a variety of devices in the IoT. • The inability of connection to the same port in interactions with several devices. RMI technology advantages: • • • • •
Distributing mechanisms. Implementation relative simplicity. Easy implementation of remote components. Automatic launch of a pipeline of objects. Support extensibility. RMI technology disadvantages:
• The inability to interact with applications that are developed in languages differ from Java. • A lack of interaction with applications on other platforms. CORBA technology advantages: • Distributing mechanisms. • The maintaining of multi-platforms.
Universal Multi-platform Interaction Approach for Distributed IoT
157
• The maintaining of different languages for developing various application components. • Standards of OMG (Object Management Group). CORBA technology disadvantages: • • • •
Implementation complexity. The necessity to develop a pipeline of objects for each type of platform. Deployment complexity. Does not support extensibility.
5 Conclusion In this paper, the implementation of the distributed computing system – DCS-IoT is based on the IoT platform, and the IoT devices interaction is based on Java programming language. The Java choice is specified by its multifunctionality to implement applications for any type of hardware. The presence of Java virtual machine in the majority of platforms allows applying all the Java power for the IoT devices. The Socket API, RMI, and CORBA technologies were implemented and tested in each case to propose the optimal solution. The held experiments discovered use cases and features of each technology. The Socket API solution requires to implement additional functions to process receiving messages and to design a naming system or addressing system for numerous remote devices. That factor limits distributed computing system scaling and makes the technology inappropriate to use. The CORBA technology is quite universal and allows to implementation of complex communications of client-server applications, it also supports multi-platform mode. However, CORBA implementation itself is more complicated than RMI though the total result is equal. The RMI technology is a part of Java that means the interaction with other platform types is impossible. Nevertheless, the client and the server sides are both implemented in Java in this paper, and the RMI is appreciated. Furthermore, RMI is simple in implementation and regularly improving as Java improves. In this way, native RMI technology is preferable to apply for the implementation of interaction among distributed parts of the IoT. Further works will be continued and concentrated on the DCS-IoT scalability and task variety testing for distribution to nodes for the execution.
References 1. Zaidan, A.A., et al.: A survey on communication components for IoT-based technologies in smart homes. Telecommun. Syst. 69(1), 1–25 (2018). https://doi.org/10.1007/s11235-0180430-8 2. Eremin, O., Stepanova, M.: A reinforcement learning approach for task assignment in IoT distributed platform. In: Kravets, A.G., Bolshakov, A.A., Shcherbakov, M.V. (eds.) CyberPhysical Systems. SSDC, vol. 350, pp. 385–394. Springer, Cham (2021). https://doi.org/10. 1007/978-3-030-67892-0_31
158
M. Stepanova and O. Eremin
3. Parikh, S., Dave, D., Patel, R., Doshi, N.: Security and privacy issues in cloud, fog and edge computing. Procedia Comput. Sci. 160, 734–739 (2019). https://doi.org/10.1016/j.procs.2019. 11.018 4. Mostafavi, S., Dawlatnazar, M., Paydar, F.: Edge computing for IoT: challenges and solutions. J. Commun. Technol. Electron. Comput. Sci. 25(26), 5–8 (2019) 5. Verma, C., Illés, Z., Stoffová, V.: Study level prediction of Indian and Hungarian students towards ICT and mobile technology for the real-time. In: 2020 International Conference on Computation, Automation and Knowledge Management (ICCAKM), pp. 215–219 (2020). https://doi.org/10.1109/ICCAKM46823.2020.9051551 6. Kumar, D., Verma, C., Singh, P.K., Raboaca, M.S., Felseghi, R.-A., Ghafoor, K.Z.: Computational statistics and machine learning techniques for effective decision making on student’s employment for real-time. Mathematics 9, 1166 (2021). https://doi.org/10.3390/math9111166 7. Eremin, O.Y., Stepanova, M.V.: Applying reinforcement learning in distribution computational system – Internet of Things. Dyn. Complex Syst. 14(2), 84–92 (2020). https://doi.org/ 10.18127/j19997493-202002-10 8. Celic, L., Magjarevic, R.: Seamless connectivity architecture and methods for IoT and wearable devices. Automatika 61(1), 21–34 (2020). https://doi.org/10.1080/00051144.2019.166 0036 9. Mocnej, J., Seah, W.K.G., Pekar, A., Zolotova, I.: Decentralised IoT architecture for efficient resources utilisation. IFAC-PapersOnLine 51, 168–173 (2018) 10. Eleftherakis, G., Pappas, D., Lagkas, T., Rousis, K., Paunovski, O.: Architecting the IoT paradigm: a middleware for autonomous distributed sensor networks. Int. J. Distrib. Sensor Netw. 2015 (2015). https://doi.org/10.1155/2015/139735 11. Santos, D., Ferreira, J.C.: IoT power monitoring system for smart environments. Sustainability 11, 5355 (2019). https://doi.org/10.3390/su11195355 12. Phan, L.-A., Kim, T.: breaking down the compatibility problem in smart homes: a dynamically updatable gateway platform. Sensors 20, 2783 (2020). https://doi.org/10.3390/s20102783 13. Ma, K., Sun, R.: Introducing websocket-based real-time monitoring system for remote intelligent buildings. Int. J. Distrib. Sens. Netw. 9, 867693 (2013). https://doi.org/10.1155/2013/ 867693 14. Lesson: All About Sockets (The Java Tutorials). https://docs.oracle.com/javase/tutorial/net working/sockets/index.html. Accessed 03 Oct 2020 15. Maata, R.L.R., Cordova, R., Sudramurthy, B., Halibas, A.: Design and implementation of client-server based application using socket programming in a distributed computing environment. In: 2017 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC), Coimbatore, pp. 1–4 (2017). https://doi.org/10.1109/ICCIC.2017. 8524573 16. Trail: RMI (The Java Tutorials). https://docs.oracle.com/javase/tutorial/rmi/index.html. Accessed 03 Oct 2020 17. Basanta-Val, P., García-Valls, M.: Resource management policies for real-time Java remote invocations. J. Parallel Distrib. Comput. 14, 1930–1944 (2014). https://doi.org/10.1016/j.jpdc. 2013.08.001 18. Le, M., Clyde, S., Kwon, Y.-W.: Enabling multi-hop remote method invocation in device-todevice networks. HCIS 9(1), 1–22 (2019). https://doi.org/10.1186/s13673-019-0182-9 19. Kang, H., Jeong, K., Lee, K., Park, S., Kim, Y.: Android RMI: a user-level remote method invocation mechanism between Android devices. J. Supercomput. 72(7), 2471–2487 (2015). https://doi.org/10.1007/s11227-015-1471-3 20. Java IDL and RMI-IIOP Tools and Commands (Java Platform, Standard Edition Tools Reference). https://docs.oracle.com/javase/10/tools/java-idl-and-rmi-iiop-tools-and-commands. htm. Accessed 03 Oct 2020
Universal Multi-platform Interaction Approach for Distributed IoT
159
21. Gaitan, N., et al.: An IoT middleware framework for industrial applications. Int. J. Adv. Comput. Sci. Appl. 7, 31–41 (2016) 22. Tightiz, L., Yang, H.: A comprehensive review on IoT protocols’ features in smart grid communication. Energies 13, 2762 (2020). https://doi.org/10.3390/en13112762
A Practical and Economical Bayesian Approach to Gas Price Prediction ChihYun Chuang1 and TingFang Lee2(B) 1
2
AMIS, Taipei, Taiwan [email protected] College of Pharmacy, University of Rhode Island, Kingston, RI, USA [email protected]
Abstract. On the Ethereum network, it is challenging to determine a gas price that ensures a transaction will be included in a block within a user’s required timeline without overpaying. One way of addressing this problem is through the use of gas price oracles that utilize historical block data to recommend gas prices. However, when transaction volumes increase rapidly, these oracles often underestimate or overestimate the price. In this paper, we demonstrate how Gaussian process models can predict the distribution of the minimum price in an upcoming block when transaction volumes are increasing. This is effective because these processes account for time correlations between blocks. We performed an empirical analysis using the Gaussian process model on historical block data and compared the performance with GasStation-Express and Geth gas price oracles. The results suggest that when transactions volumes fluctuate greatly, the Gaussian process model offers a better estimation. Further, we demonstrated that GasStation-Express and Geth can be improved upon by using a smaller training sample size which is properly pre-processed. Base of empirical analysis, we recommended a gas price oracle made up of a hybrid model consisting of both the Gaussian process and GasStation-Express. This oracle provides efficiency, accuracy, and better cost.
Keywords: Ethereum process · Bayesian
1
· Gas price oracle · Blockchain · Gaussian
Introduction
On the Ethereum blockchains [1], Gas refers to the fuel required to conduct a transaction or execute a smart contract [2]. Since each block on the chain has an upper bound on the amount of gas that can be included in it1 , miners maximize profit by prioritizing transactions offering higher gas prices. As with all markets exhibiting supply and demand dynamics, it is advantageous for users to be able 1
The average block gas limit was around 12,500,000 units of gas at the day of Mar 4, 2021.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 I. Awan et al. (Eds.): Deep-BDB 2021, LNNS 309, pp. 160–174, 2022. https://doi.org/10.1007/978-3-030-84337-3_13
Gaussian Process for Gas Price
161
to predict and offer the minimum gas price that ensures their transactions will be included in a block within a pre-determined timeline. Developing such a gas price oracle is complicated by the high variability in transaction volume. Under these conditions, some existing gas price oracles either underestimate the prices needed such that these transactions have to wait a long time to be included, or overestimate the price, which results in users overpaying for the transaction. To date, one of the primary methods for gas price prediction is to analyze the pricing structure of pending transactions in large mempools [3]. This method is resource intensive as it requires accessing large quantities of mempools to obtain enough pending transaction data for analysis. Further, it can only accurately predict the gas prices under the assumption that the data from mempools is correct, something that is difficult for users to verify. Another method is to utilize recent transactions that were included by miners to recommend a price. Some algorithms based on this concept have been used to develop gas price oracles including Geth, EthGasStation, GasStation-Express (abrev. GS-Express) and the work of Sam M. Werner et al. [4–8]. One such gas price oracle, GS-Express, proposed using the set of minimum gas prices in the most recent 200 blocks. The model suggests that the probability of the αth percentile of the set is greater than the minimum gas price of the transactions in the next block is α%. In other words, the αth percentile of the set has α% probability to be included in the next block. An additional oracle, Geth, uses the set of minimum gas prices in the most recent 100 blocks and takes the 60th percentile of the set as a recommended gas price. These models provide an efficient way to recommend gas prices when the quantity of pending transactions are relatively few. However, when there is a surge of pending transactions, these models will underestimate the prices. In this paper, we will focus on a novel predictive model that uses recent successful mining blocks to estimate the lowest price a user should offer to obtain a specified probability level that the transaction will be processed. The proposed methodology uses Gaussian process (abrev. GP) models to predict the distribution of the minimum price in the upcoming block. Stochastic processes, including GP, are often used to study numerous stock market micro-structure related questions, including price discovery, competition among related markets, strategic behavior of market participants, and modeling of real time market dynamics. These processes present potentially efficient estimators and predictors for volatility, time-varying correlation structures, trading volume, bid-ask spreads, depth, trading costs, and liquidity risks. The market forces acting within Ethereum markets are very similar to these, making GP an appropriate method to capture the dynamics of Ethereum gas prices. Another attractive feature of stochastic processes is the covariance functions, which allows the model to estimate the time correlation between blocks; that is, it can capture the stronger correlation between closer blocks. Our method provides stable price prediction even when there is a surge in transaction volume. Over the long-term, the gas
162
C. Chuang and T. Lee
prices recommended by the model are more economical and practical2 compared with existing methods. Our contributions include the following: 1) We introduce a novel application of Gaussian process to evaluate pricing models and then use it to study the advantages and disadvantages of GS-Express, Geth, and GP. Our findings indicate that GS-Express and Geth over/under-estimated the price when the transaction volume fluctuates greatly, while GP maintained reasonable accuracy. Additionally, GP possesses time efficiencies in model training and prediction. 2) A sensitivity analysis was conducted to study the impact of the training data sizes on the performance of the GS-Express model. This showed reducing the training data size to 50 or 30 can effectively improve prediction. However, this requires a reliable data pre-processing procedure. 3) We propose a practical and economical gas price oracle in Algorithm 1 which retains the advantages of both GP and GS-Express and avoids the disadvantages of each. Our method is superior in achieving the targeted short-term and long-term success rates among the considered blocks compared with existing methods. Remarkably, except for P50 (see Table 7 and 8), the average cost of our method is still less than the others. The outline of this paper is as follows. In Sect. 2, we introduce the operation of transactions in the Ethereum network and Gaussian processes. In Sect. 3, we establish our methodology, including data pre-processing and modeling. In Sect. 4 we present the results for GP predictive models and compare with GSExpress and Geth models. Finally in Sect. 5, we propose a gas price oracle that is a hybrid of GP and GS-Express and utilizes the advantages of each.
2
Background
In this section, we provide a brief overview of transactions in the Ethereum network and Gaussian processes. 2.1
Ethereum and Gas
Like other permission-less blockchains and cryptocurrencies, Ethereum obtains consensus using a form of cryptographic zero-knowledge proof called “Proof-ofwork“. In such protocols, a character called “miner”, groups transactions into a block and appends it to the end of blockchains. This work is resource consumptive, and thus, operations using Ethereum require a fee, which is received by miner in exchange for performing the work. Based on the gas price, miners determine which transactions should be included in a block. A transaction fee is calculated in Gas, using a unit called wei = 10−18 ETH or Gwei = 10−9 ETH, where ETH is the currency in Ethereum. The cost of execution is equal to: gas cost × gas price. 2
The average time consumed of the GP model is 0.7 seconds to predict a new price.
Gaussian Process for Gas Price
163
Here – The gas cost is bounded by the lower bound 21, 000 and the upper bound gas limit, which represents the maximum amount of gas a user is willing to use for an operation. The precise amount of gas cost depends on the complexity of performing “smart contracts”, which define a set of rules using a Turing-complete programming language. After the transaction is completed, all unused gas is returned to the user’s account. If the gas limit is less than the gas cost, then the transaction is viewed as invalid and will be rejected; the gas spent to perform calculations will not be returned to the account. – The gas price is also determined by the user and represents the price per unit of gas the user is offering to pay. Since a miner’s reward is largely determined by the gas price, a higher gas price results in a greater probability of transactions being selected by miners and grouped into blocks. 2.2
Gaussian Process
A Gaussian process is a stochastic process that provides a powerful tool for probabilistic inference on distributions over functions. It offers a flexible nonparametric Bayesian framework for estimating latent functions from data. Briefly speaking, Gaussian Process makes prediction with uncertainty. For instance, it will predict that a stock price of the next minute is $100, with a standard deviation of $30. Knowing the prediction uncertainty is important for pricing strategies. The rest of this section will follow [9] to describe the GP regression. Definition 1. A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution. A GP is specified by its mean function and covariance function which determine the functions’ smoothness and variability. Given input vectors x and x , we define mean function m(x) and the covariance function k(x, x ) of a real process f (x) as m(x) = E[f (x)] k(x, x ) = E[(f (x) − m(x))(f (x ) − m(x ))] and will write the Gaussian process as f (x) ∼ GP(m(x), k(x, x )). Given a training dataset D = {(xi , yi )|i = 1, 2, · · · , n} where x denotes the input vector and y denotes the target variable. One can consider the Gaussian noise model yi = f (xi ) + N (0, σn2 ). The squared exponential with hyperparameter θ = {σf , l}, k(x, x ) = σf2 exp
−|x − x |2 , 2l2
164
C. Chuang and T. Lee
is considered the most widely used covariance function. This covariance function is also appropriate in our model, since recent gas prices have stronger correlation. The joint distribution of the observed target values and the function value at a new input x∗ is y K + σn2 I K∗T ∼ N (0, ), (1) f (x∗ ) K∗ K∗∗ T where y = y1 · · · yn , the notation T denotes matrix transportation, K = [k(xi , xj )]i,j=1,··· ,n is the covariance matrix, K∗ = [k(x∗ , xi )]i=1,··· ,n , and K∗∗ = k(x∗ , x∗ ). Therefore, the posterior predictive distribution is f (x∗ )|D, x∗ , θ ∼ N (K∗ (K + σn2 I)−1 y, K∗∗ − (K + σn2 I)−1 K∗T ).
(2)
The accuracy of the GP regression model depends on how well the covariance function is selected. In particular, estimating the hyperparameter of the covariance function is critical to prediction performance. The Laplace approximation framework is often utilized to approximate the predictive posterior distribution and is constructed from the second order Taylor expansion of log p(f |D, θ) around the maximum of the posterior. It has been shown to provide more precise estimates in much shorter time [10].
3
Methodology
In this section, we explain the steps of data pre-processing. Then, processed data will be fitted into the GP regression model, Geth, and GS-Express. The method to evaluate the performance of each model is introduced in Sect. 3.3. 3.1
Pre-processing
In order to maintain statistical significance, we removed blocks with a number of transactions less than 7. Further, some blocks have uncommonly low cost transactions. For instance, there are three zero fee transactions in block 11763787. Such transactions are rare and yet create noise in the models. Therefore, we excluded these transactions by removing all transactions in which the fees were lower than the 2.5th percentile among all gas prices. The processing steps are as follows: 1. 2. 3. 4.
Take blocks with more than six transactions. Calculate the 2.5th percentile of each block, called P2.5 . Remove the transactions in which the fees are lower than P2.5 . Obtain the minimum gas price in each block, called y.
Gaussian Process for Gas Price
3.2
165
The Model
Take n consecutive blocks, b1 , b2 , · · · , bn , and let the training dataset D = {(i, yi )|i = 1, 2, · · · , n} where yi := min{ gas prices in block bi }. The goal is using the GP regression model to predict yn+1 in bn+1 . We consider the Gaussian noise model yi = f (i) + σn2 . The squared exponential covariance function is used to estimate the covariance matrix in the joint distribution (1). The posterior predictive distribution (2) is used to predict the mean, yˆn+1 , and the standard deviation, sˆn+1 , of the minimum gas price in the (n + 1)-th block; that is, f (n + 1) ∼ N (ˆ yn+1 , sˆ2n+1 ). More specifically, the above estimation means that the probability of that yˆn+1 is greater than the minimum gas price, yn+1 , of (n + 1)-th block is 50%. We then define that yˆn+1 is P50 of (n + 1)-th block. Similarly, yˆn+1 + 0.675 · sˆn+1 is P75 of (n + 1)-th block. (ˆ yn+1 + sˆn+1 is P84 3 ; yˆn+1 + 1.645 · sˆn+1 is P95 ). 3.3
Model Evaluation
In this section, we will introduce a model comparison criteria, inverse probability weight (abbrev. IPW), to compare models performance in terms of accuracy and efficiency. Our procedure to compare the GP regression model, GS-Express, and Geth contains the following steps: I. Given n consecutive blocks b1 , b2 , · · · , bn and 0 < α < 1, we use the GP regression model, GS-Express, and Geth to predict Pα . We then compare the predicted Pα with actual yn+1 . If Pα ≥ yn+1 , the transaction is viewed as successfully included in block bn+1 . Define 1, if Pα ≥ yn+1 , Tn+1 (α) = 0, otherwise. II. Iterating the model fitting obtains Tn+1 , Tn+2 , · · · , and so on. Define the success rate among blocks bs , bs+1 , · · · , bs+t−1 as t−1 j=0 Ts+j (α) . Rs,t,n (α) = t Here n is the number of observations of training data. Note that the function Rs,t,n (α) is an increasing function for 0 < α < 1. We use Rs,t,n to observe the short term (t ≤ 3n) and long term (t ≥ 10n) success rate while using training data with n observations. It is notable that Tn+1 (α) ∼ Bernoulli(α) 3
More precisely, it should be P84.13 .
166
C. Chuang and T. Lee
given an ideal gas price oracle. Therefore, in the long run, the success rate of all three methods, GP model, GS-Express, and Geth, using Pα should be approximately α% (i.e. Rs,t,n is a consistent estimator of α). That is, lim Rs,m,n (α) = α%.
m→∞
When m is not large, Rs,m,n can reflect the predictive performance of each method in the short-term. An inefficient gas price oracle can result in a pending transaction, e.g. 50 min. Although users can resign4 the pending transactions from the Ethereum network, a new price is still required from the oracle. Therefore, a gas price oracle needs to perform reasonably in a short period of time. The ultimate goal is to predict the lowest prices such that the transactions can be included in a block. Higher prices can, of course, result in higher success rates. Therefore, we introduce a measurement, inverse probability weight, which can reflect better pricing strategy. IPWs,t,n (α) =
average cost . Rs,t,n (α)
In words, a small IPWs,t,n (α) represents low cost with high success rate; on the other hand, high cost with low success rate gives large IPWs,t,n (α). We will use this measurement to evaluate model performance in the next section.
4
Empirical Analysis
We use a laptop with CPU:Intel Xeon Processor E3-1505M v5 2.80 GHz and 32 gb ram to perform all analysis. Mathematica 12 was utilized for Gaussian process regression model fitting with squared exponential covariance function specified. SageMath was used GS-Express and Geth to operate the model fitting and prediction. All used data can be found in [11]. 4.1
Observations
The historical block data, block 11753792 to 118237905 , were used in the analysis. There were 1450 blocks that have no transaction and 4 blocks have less than 7 transactions. We first removed these blocks and followed steps 2, 3, and 4 in Sect. 3.1 to process the remaining blocks. After pre-processing each block, we have 68,545 blocks. We denote those blocks by b1 , b2 , · · · , b68545 , e.g. the block number of b1 is 11753792, and b201 is 11753994. Taking 200 consecutive blocks as training data, we fit the GP model and GS-Express to the training data. Geth uses only 100 training data. Each model will be used to predict the minimum gas price of the next block. In other words, we fit each of the three models into 1st to 200th blocks, we then predict P50 , P75 , P84 , and P95 of the minimum gas 4 5
Use the same nonce and increase the gas price. Jan-29-2021 to Feb-09-2021.
Gaussian Process for Gas Price
167
price of 201st block. Next, we fit the models into 2nd to 201st blocks, we then predict P50 , P75 , P84 , and P95 of the minimum gas price of 202nd block, and so on. The obtained Pα values will be used to compare with the true minimum gas price. Following I. and II. in Sect. 3.3, we will demonstrate the success rate of each model, and compare the performance. Ideally, the success rate would be approximately α% when using Pα to compare with the true minimal price. We find that the long term success rate R201,68345,200 , Table 1, of GS-Express and Geth on P50 are around 0.5 and GP is 0.36. GS-Express and Geth underestimated the prices on P75 , P84 , and P95 and GP suggested relatively accurate prices on P75 , P84 , and P95 . The average cost, the average of predicted price · 10−9 , of each method is also reported in Table 1. Table 1. The long term success rate R201,68545,200 (α) of GP and GS-Express and R201,68545,100 (α) of Geth. The average cost with α = 50, 75, 84, 95 of each method is reported at the right hand side of the table. The corresponding inverse probability weights IPW201,68545,200 (α) are also reported.
Method
Success Rate P75 P84
P50
GP 0.358 GS-Express 0.502 Geth 0.500 Method GP GS-Express Geth
0.744 0.696 0.712
P50 359.22 282.07 286.20
0.862 0.784 0.798
P95 0.972 0.914 0.922
Average cost (Gwei) P50 P75 P84 P95 128.6 141.6 143.1
168.3 168.0 164.7
Inverse Probability Weight P75 P84 226.22 241.38 231.32
217.40 227.30 218.17
187.4 178.2 174.1
225.3 197.1 193.3
P95 231.79 215.65 209.65
Note: The average cost is the average of predicted price 10−9 .
We also calculated the minimum short term success rate min{Rs,m,200 (α) | s = 1, 2, · · · , 68345}, and reported the minimum success in Table 2. The success rate of consecutive m = 25, 50, 100 blocks are considered to be fast, average, and slow, respectively, when grouped by miners. From Table 1 and 2, we observe that GP performs better when using P75 , P84 , and P95 in a long term and also maintain better success rate in short terms m = 25, 50, 100. We now focus on the success rate of P75 of each method. GP gave 0.744 success rate while GS-Express and Geth had 0.696 and 0.712. We derived an α
168
C. Chuang and T. Lee
Table 2. The minimum success rate Rs,m,200 (α) of GP and GS-Express and Rs,m,100 (α) of Geth with α = 50, 75, 84, 95 of each method. We consider m = 25, 50, 100 to represent that the transaction is fast, average, and slow to be included in a block.
m
P50
P75
GP P84
P95
P50
GS-Express P75 P84
P95
P50
Geth P75 P84
P95
25 50 100
0 0.04 0.09
0.12 0.20 0.33
0.24 0.32 0.42
0.36 0.54 0.71
0 0.02 0.07
0 0.06 0.12
0.12 0.18 0.32
0 0.04 0.10
0 0.06 0.19
0.24 0.40 0.50
0 0.06 0.16
0 0.08 0.23
such that Pα from GP provided a comparable level of success rate, but lower cost than GS-Express and Geth, see Table 3. Therefore, in the long term, GP has the advantage in cost and success rate. However, GS-Express can have better performance when using less training data. Table 3. The minimum short term success rate Rs,m,200 , long term success rate R201,68545,200 , and the average cost using GP predicted P72.24 . Short term success rate Long term m = 25
0.12
Success rate
m = 50
0.18
Average cost 164.16 Gwei
m = 100 0.30
4.2
IPW
0.712 230.56
More Observations
The gas prices showed large fluctuations during block 11903793 to 119176946 . We also provided analogous results for these blocks transaction data. According to the previous observations, we reduced the training data points for GS-Express to 50 or 30. After pre-processing these recent block data, we have 13627 blocks denoted by b1 , b2 , · · · , b13627 , e.g. the block number of b1 is 11903793, and b201 is 11903999. The long term success rate R201,13627,200 of GP and R201,13627,50 and R201,13627,30 of GS-Express, and the average costs are reported in Table 4. The minimum short term success rates (m = 25, 50, 100) of each model can be found in Table 5. The results are consistent with the observations in Sect. 4.1.
6
Feb-22-2021 to Feb-24-2021.
Gaussian Process for Gas Price
169
Table 4. The long term success rate R201,13627,200 (α) of GP prediction, R201,13427,30 (α) and R201,13427,50 (α) using GS-Express. And the average cost with α = 50, 75, 84, 95 of each method. The corresponding inverse probability weights IPW201,68545,200 (α) are also reported. Success Rate P75 P84
P50
Method
GP 0.375 GS-Express (30) 0.516 GS-Express (50) 0.506
0.79 0.70 0.70
0.911 0.786 0.786
GP GS-Express (30) GS-Express (50)
0.982 0.882 0.893
Average cost (Gwei) P50 P75 P84 P95 248.5 270.9 270.2
314.5 297.0 301.1
346.3 312.4 317.5
Inverse Probability Weight P75 P84
P50
Method
P95
662.67 525.00 533.99
398.10 424.29 430.14
409.4 331.8 341.0
P95
380.13 397.46 403.94
416.60 376.19 381.86
Table 5. The minimum success rate Rs,m (α) with α = 50, 75, 84, 95 of each method. We consider m = 25, 50, 100 to represent that the transaction is fast, average, and slow to be included in block. GP m
P50
P75
P84
P95
P50
25 50 100
0 0.04 0.09
0.08 0.24 0.30
0.24 0.38 0.42
0.32 0.52 0.63
0 0.08 0.12
GS-Express(30) P75 P84 P95 0 0.10 0.16
0 0.14 0.21
0.28 0.42 0.52
P50 0 0.06 0.12
GS-Express(50) P75 P84 P95 0 0.08 0.12
0 0.10 0.16
0.16 0.28 0.39
Similarly, we find that the prediction P69.67 from GP gave a comparable level of success rate with the other methods, but lower cost, see Table 6.
5
Discussion
The goal of our study is to develop a gas price oracle which ensures that transactions will be included in a block within a user required timeline without overpaying. The proposed the GP regression provided an efficient prediction of gas prices, especially when the transaction volumes increase rapidly. When using various amounts of data, e.g., 200, 100, 50, or 30, the GS-Express model performed poorly from block 11799554 to 11799759 and 11907262 to 11907569. Figure 1 compared the prediction of P75 from each method to the true minimum gas price in each block. When gas prices increased, GS-Express often underestimated the price and resulted in a pending transaction until the price stabilized or dropped. Furthermore, GS-Express overestimated the price when the price just dropped from a peak, which results in the user overpaying. When the pending transaction volume remains stable, GS-Express performs better if we reduce the training data observations (from 200 down to 50 or
170
C. Chuang and T. Lee
Table 6. The minimum short term success rate Rs,m,200 , long term success rate R, and the average cost using GP predicted P69.67 . Short term success rate Long term m = 25
0.08
m = 50
0.20
m = 100 0.26
(a) block number:11799554-11800236
Success rate
0.701
Average cost 298.85 Gwei IPW
426.32
(b) block number:11907262-11907569
Fig. 1. The comparison of the true minimum gas prices in each block and the prediction of P75 from each method. Block number 11799554 to 11800236 (top) and block number 11907262 to 11907569 (bottom) are shown here.
30). Reducing the training data can be a risk due to abnormal transactions such as zero fee transactions which may create more noise for the models. Preprocessing the data can effectively reduce the impact of these abnormal data points. Therefore, our observations from the empirical analysis are as follows: 1. GP maintains a better success rate with little overpayment when transaction volumes are increasing rapidly. 2. The prediction of Geth and GS-Express can be improved by reducing training data when the transaction volume fluctuate greatly. However, abnormal transactions can interfere with the models. Therefore, pre-processing data is an important step. 3. Long term success rates of all 3 methods are comparable. However, Geth and GS-Express often underestimate the gas price when gas price rise rapidly.
Gaussian Process for Gas Price
171
In addition to using GP only, we propose the following gas price oracle, Algorithm 1, which consists of GP and GS-Express and depends on the change of instant success rates. Instant success rate R can be used to monitor the bias of the GP and GS-Express estimators. When the gas prices are stable, GS-Express with a small training sample size performs well. When gas prices increases rapidly, the success rate R is smaller than the expected value α, and users can switch to GP to maintain the expected α. If R is higher than α, the value α should be adjusted to a lower level. This oracle offers efficiency, success rate, and better cost. Algorithm 1. Gas Price Oracle Input: the desired success rate α, nGS (resp. nGP ) the size of training data of GS-Express(resp. GP), and an allowed error e. Output: a prediction of the block with number s + nGS . 1: Perform pre-process: i. Take blocks with more than six transactions. ii. Calculate the 2.5th percentile of each block, called P2.5 iii. Remove the transactions in which the fees are lower than P2.5 . iv. Construct a set M of the minimum gas price in each block. 2: Let the, Pα , prediction of GP(resp. GS-Express) be PGP,α (resp. PGS,α ). With s advancing, the success rate Rs,nGS ,nGS (α) of GS-Express keeps updating, do: a. the case Rs,nGS ,nGS (α) < α − e: The output is max{PGP,α , PGS,α }. b. the case α − e ≤ Rs,nGS ,nGS (α) ≤ α + e: The output is PGS,α . c. the case Rs,nGS ,nGS (α) > α + e: Use intermediate value theorem to find α such that α − e ≤ Rs,nGS ,nGS (α ) ≤ α + e. If such α does not exist, one takes α = α. Then the output is PGS,α .
To evaluate the proposed gas price oracle, we use the gas prices in block 11799554 to 11800236 and 11903999 to 11917694. The gas prices in block 11799554 to 11800236 changed gradually and in 11903999 to 11917694 changed rapidly. We take nGS = 30, nGP = 200, and e = 0.1 in our oracle. Our gas price oracle has smaller inverse probability weighting under most circumstances. The results suggested that our oracle has lower costs with higher long term success rates while also preserving the desired short term success rates, see Table 7 and 8.
172
C. Chuang and T. Lee
Table 7. The long term success rate, the average cost, and the inverse probability weights with α = 50, 75, 84, 95 of our method, GS-Express, and Geth using data from block 11753792 to 11823790 (top two tables). The bottom two tables report the long term success rate, the average cost, and the inverse probability weights using data from block 11903793 to 11917694.
Method
P50
Success Rate P75 P84
P95
0.73 0.70 0.71
0.92 0.91 0.92
Our method 0.52 GS-Express 0.50 Geth 0.50
Our method GS-Express Geth Method
275.00 283.20 286.20 P50
Method Our method GS-Express Geth
216.99 240.00 231.97
Success Rate P75 P84
P95
0.75 0.70 0.70
0.92 0.90 0.91
Our method 0.52 GS-Express 0.52 Geth 0.50
143.0 141.6 143.1
158.4 168.0 164.7
Inverse Probability Weight P75 P84
P50
Method
0.81 0.78 0.80
Average cost (Gwei) P50 P75 P84 P95
P50 524.42 509.23 537.60
0.842 0.79 0.78
207.53 228.46 217.63
189.7 197.1 193.3
P95 206.20 216.59 210.11
Average cost (Gwei) P50 P75 P84 P95 272.7 264.8 268.8
303.6 316.2 309.4
Inverse Probability Weight P75 P84 404.80 451.71 442.00
168.1 178.2 174.1
384.20 424.30 416.03
323.5 335.2 324.5
352.3 367.8 358.7
P95 382.93 408.67 394.18
Note: The average cost is the average of predicted price · 10−9 .
Potential future work includes practicing the proposed gas price prediction procedure in real-time and verifying the efficiency and accuracy of this procedure. This will also enhance the study of short term success rate and real waiting times. Additionally, different covariance functions of the GP regression models should also be studied closely to determine which covariance function would be the best suited for predicting gas prices.
Gaussian Process for Gas Price
173
Table 8. The short term success rate of each method with m = 25, 50, 100 from the data of blocks 11753792 to 11823790 and 11903793 to 11917694.
m P50 P75 P84 P95
m P50 P75 P84 P95
block 11753792 to 11823790 Our method GS-Express 25 50 100 25 50 100 25 0.04 0.12 0.20 0.68
0.08 0.22 0.38 0.72
0.18 0.34 0.50 0.81
0 0 0 0.12
0.02 0.06 0.06 0.18
0.07 0.12 0.16 0.32
0 0 0 0.28
block 11903793 to 11917694 Our method GS-Express 25 50 100 25 50 100 25 0 0.12 0.32 0.52
0.08 0.28 0.46 0.64
0.17 0.35 0.50 0.73
0 0 0 0
0.04 0.04 0.06 0.08
0.06 0.10 0.12 0.13
0 0 0 0.08
Geth 50
100
0.04 0.06 0.08 0.40
0.10 0.19 0.23 0.50
Geth 50
100
0.04 0.06 0.08 0.14
0.08 0.12 0.12 0.22
Acknowledgment. We would like to thank all of the AMIS data management teams, and participants who contributed to this project. We also thank Yu-Te Lin and Gavino Puggioni for their useful comments and feedbacks in earlier drafts and all the reviewers’ helpful comments and suggestions.
References 1. Ethereum.org. https://ethereum.org/en/. Accessed 17 Feb 2021 2. Wood, G.: Ethereum: a secure decentralised generalised transaction ledger Petersburg version 41c1837. Ethereum Yellow Paper, February 2021 3. Gas platform. https://www.blocknative.com/gas. Accessed 17 Feb 2021 4. Github. official go implementation of the ethereum protocol. https://github.com/ ethereum/go-ethereum/. Accessed 17 Feb 2021 5. Ethgasstation. https://ethgasstation.info. Accessed 17 Feb 2021 6. Github. gasstation-express. https://github.com/ethgasstation/gasstation-expressoracle. Accessed 17 Feb 2021 7. Antonio Pierro, G., Rocha, H., Tonelli, R., Ducasse, S.: Are the gas prices oracle reliable? A case study using the ethgasstation. In: 2020 IEEE International Workshop on Blockchain Oriented Software Engineering (IWBOSE), pp. 1–8 (2020) 8. Werner, S.M., Pritz, P.J., Perez, D.: Step on the gas? A better approach for recommending the ethereum gas price. In: Pardalos, P., Kotsireas, I., Guo, Y., Knottenbelt, W. (eds.) Mathematical Research for Blockchain Economy. SPBE, pp. 161–177. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-53356-4 10 9. Rasmussen, C., Williams, C.: Gaussian processes for machine learning, January 2005
174
C. Chuang and T. Lee
10. Rue, H., Martino, S., Chopin, N.: Approximate Bayesian inference for latent Gaussian models by using integrated nested laplace approximations. J. Roy. Stat. Soc. Ser. B 71, 319–392 (2009) 11. Bayesian gas price oracle. https://github.com/bayesian-gas-price-oracle/APractical-and-Economical-Bayesian-Approach-to-Gas-Price-Prediction. Accessed 19 Mar 2021
Author Index
A Adel, Heba, 91 Alom, Zulfikar, 134 Angryk, Rafal, 3 Aung, Zeyar, 134 Azim, Mohammad Abdul, 134 B Banditwattanawong, Thepparit, 123 Benbernou, Salima, 77 C Chaipornkaew, Piyanuch, 123 Chuang, ChihYun, 160 Cullen, Gary, 41 D Dahal, Keshav, 106 Daoun, Dema, 134 Dix, Marcel, 15 E ElBakary, Mostafa, 91 ElDahshan, Kamal, 91 Eremin, Oleg, 147 Everan, Mary Rose, 41
K Kevin N’DA, Aboua Ange, 106 Khacef, Kahina, 77 L Lee, TingFang, 160 M Ma, Ruizhe, 3 Matalonga, Santiago, 106 McCann, Michael, 41 O Ouziri, Mourad, 77 P Partheepan, Ravinthiran, 27 Perrichon-Chrétien, Andrea, 67 S Salah, Dina, 91 Sanjana, R. K., 53 Six, Nicolas, 67 Stepanova, Maria, 147
H Herbaut, Nicolas, 67
V Vikhyath, K. B., 53 Vismitha, N. V., 53
I Ibnat, Fabiha, 134
Y Younas, Muhammad, 77
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 I. Awan et al. (Eds.): Deep-BDB 2021, LNNS 309, pp. 175, 2022. https://doi.org/10.1007/978-3-030-84337-3