142 50 14MB
English Pages 307 Year 2018
Paolo Mori · Steven Furnell Olivier Camp (Eds.)
Communications in Computer and Information Science
Information Systems Security and Privacy Third International Conference, ICISSP 2017 Porto, Portugal, February 19–21, 2017 Revised Selected Papers
123
867
Communications in Computer and Information Science Commenced Publication in 2007 Founding and Former Series Editors: Alfredo Cuzzocrea, Xiaoyong Du, Orhun Kara, Ting Liu, Dominik Ślęzak, and Xiaokang Yang
Editorial Board Simone Diniz Junqueira Barbosa Pontifical Catholic University of Rio de Janeiro (PUC-Rio), Rio de Janeiro, Brazil Phoebe Chen La Trobe University, Melbourne, Australia Joaquim Filipe Polytechnic Institute of Setúbal, Setúbal, Portugal Igor Kotenko St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg, Russia Krishna M. Sivalingam Indian Institute of Technology Madras, Chennai, India Takashi Washio Osaka University, Osaka, Japan Junsong Yuan University at Buffalo, The State University of New York, Buffalo, USA Lizhu Zhou Tsinghua University, Beijing, China
867
More information about this series at http://www.springer.com/series/7899
Paolo Mori Steven Furnell Olivier Camp (Eds.) •
Information Systems Security and Privacy Third International Conference, ICISSP 2017 Porto, Portugal, February 19–21, 2017 Revised Selected Papers
123
Editors Paolo Mori Consiglio Nazionale delle Ricerche Pisa Italy
Olivier Camp MODESTE/ESEO Angers France
Steven Furnell Plymouth University Plymouth UK
ISSN 1865-0929 ISSN 1865-0937 (electronic) Communications in Computer and Information Science ISBN 978-3-319-93353-5 ISBN 978-3-319-93354-2 (eBook) https://doi.org/10.1007/978-3-319-93354-2 Library of Congress Control Number: 2018946820 © Springer International Publishing AG, part of Springer Nature 2018 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Printed on acid-free paper This Springer imprint is published by the registered company Springer International Publishing AG part of Springer Nature The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
The present book includes extended and revised versions of a set of selected papers from the Third International Conference on Information Systems Security and Privacy (ICISSP 2017), held in Porto, Portugal, during February 19–21, 2017. The International Conference on Information Systems Security and Privacy provides a meeting point for researchers and practitioners, addressing the security and privacy challenges facing organizations from both technological and social perspectives. The conference welcomes papers offering both practical and theoretical contributions and presenting research or applications across all aspects of security and privacy for organizations and individuals. ICISSP 2017 received 100 paper submissions from 35 countries, of which 13 are included in this book. The selected papers were chosen by the event chairs, based upon review feedback provided by the Program Committee members, the session chairs’ assessment of the presentation at the event, and the program chairs’ global view of all papers included in the technical program. The authors of selected papers were then invited to submit a revised and extended version of their work, having at least 30% innovative material. A final review and revision process was performed on the extended version of each article by the program co-chairs. The papers selected for inclusion in this book contribute to the understanding of relevant trends of current research on information systems security and privacy, including: vulnerability analysis and countermeasures, attack patterns discovery and intrusion detection, malware classification and detection, cryptography applications, data privacy and anonymization, security policy analysis, enhanced access control, and socio technical aspects of security. We would like to thank all the authors for their contributions and also the reviewers who helped ensure the quality of this publication. February 2017
Paolo Mori Steven Furnell Olivier Camp
Organization
Conference Chair Olivier Camp
MODESTE/ESEO, France
Program Co-chairs Paolo Mori Steven Furnell
Istituto di Informatica e Telematica, Consiglio Nazionale delle Ricerche, Italy University of Plymouth, UK
Program Committee Magnus Almgren Ja’far Alqatawna Mario Alvim Morteza Amini Thibaud Antignac Gilles Van Assche Man Ho Au Alessandro Barenghi Catalin V. Birjoveanu Christos Bouras Francesco Buccafurri Olivier Camp Luigi Catuogno Hervé Chabanne Rui Chen Thomas Chen Feng Cheng Hung-Yu Chien Stelvio Cimato Mauro Conti Gianpiero Costantino Mathieu Cunche Ashok Kumar Das Hervé Debar Andreas Dewald Josep Domingo-Ferrer Isao Echizen
Chalmers University of Technology, Sweden University of Jordan, Jordan Federal University of Minas Gerais, Brazil Sharif University of Technology, Iran CEA/DRT/LIST, France STMicroelectronics, Belgium The Hong Kong Polytechnic University, SAR China Polytecnic University of Milan, Italy Al.I. Cuza University of Iasi, Romania University of Patras and CTI&P Diophantus, Greece University of Reggio Calabria, Italy MODESTE/ESEO, France University of Salerno, Italy Morpho and Télécom ParisTech, France Samsung Research America, USA City University London, UK Hasso Plattner Institute, University of Potsdam, Germany National Chi Nan University, Taiwan University of Milano, Crema, Italy University of Padua, Italy Consiglio Nazionale delle Ricerche, Italy INSA-Lyon/Inria, France International Institute of Information Technology, India Télécom SudParis, France Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany Universitat Rovira i Virgili, Spain National Institute of Informatics, Japan
VIII
Organization
David Eyers Oriol Farras Mathias Fischer Apostolos Fournaris Benjamin Fung Steven Furnell Alban Gabillon Clemente Galdi Debin Gao Bok-Min Goi Mario Goldenbaum Dieter Gollmann Ana I. González-Tablas Gilles Guette R. C. Hansdah Martin Hell Guy Hembroff Anthony Ho Fu-Hau Hsu Dieter Hutter Rafiqul Islam Mariusz Jakubowski Jens Jensen Rafael Timoteo de Sousa Junior Mark G. Karpovsky Anne Kayem Elisavet Konstantinou Hristo Koshutanski Thomas Lagkas Gianluca Lax Gabriele Lenzini Shujun Li Flamina Luccio Ilaria Matteucci Vashek Matyas Catherine Meadows Florian Mendel Nele Mentens Ali Miri Mattia Monga Paolo Mori Charles Morisset
University of Otago, New Zealand Universitat Rovira i Virgili, Spain University Hamburg, Germany University of Patras, Greece McGill University, Canada University of Plymouth, UK Laboratoire GePaSud, Université de la Polynésie Française, French Polynesia University of Napoli Federico II, Italy Singapore Management University, Singapore Universiti Tunku Abdul Rahman, Malaysia Princeton University, USA TU Hamburg, Germany University Carlos III of Madrid, Spain University of Rennes, France Indian Institute of Science, Bangalore, India Lund University, Sweden Michigan Technological University, USA University of Surrey, UK National Central University, Taiwan German Research Centre for Artificial Intelligence, Germany Charles Sturt University, Australia Microsoft Research, USA STFC Rutherford Appleton Laboratory, UK University of Brasilia, Brazil Boston University, USA University of Cape Town, South Africa University of the Aegean, Greece Safe Society Labs, Spain International Faculty of the University of Sheffield, CITY College, Greece University of Reggio Calabria, Italy University of Luxembourg, Luxembourg University of Kent, UK University Ca’ Foscari Venezia, Italy Consiglio Nazionale delle Ricerche, Italy Masaryk University, Czech Republic US Naval Research Laboratory, USA TU Graz, Austria Katholieke Universiteit Leuven, Belgium Ryerson University, Canada University of Milano, Italy Consiglio Nazionale delle Ricerche, Italy Newcastle University, UK
Organization
Kirill Morozov Ravi Mukkamala Paliath Narendran Antonino Nocera Carles Padro Yin Pan Mauricio Papa Günther Pernul Andreas Peter Makan Pourzandi Bart Preneel Kenneth Radke Wolfgang Reif Karen Renaud Eike Ritter Jean-Marc Robert Neil Rowe Antonio Ruiz-Martínez Michaël Rusinowitch Nader Sohrabi Safa David Sanchez Andrea Saracino Michael Scott Kent Seamons Qi Shi Abdulhadi Shoufan Jordan Shropshire Boris Skoric Angelo Spognardi Paul Stankovski Rainer Steinwandt Hung-Min Sun Cihangir Tezcan Ciza Thomas Pierre Ugo Tournoux Raylin Tso Yasuyuki Tsukada Udaya Tupakula Shambhu J. Upadhyaya Adriano Valenzano Rakesh M. Verma Artemios Voyiatzis Adrian Waller Yong Wang
IX
Tokyo Institute of Technology, Japan Old Dominion University, USA State University of New York at Albany, USA Mediterranea University of Reggio Calabria, Italy Universitat Politecnica de Catalunya, Spain Rochester Institute of Technology, USA University of Tulsa, USA University of Regensburg, Germany University of Twente, The Netherlands Ericsson Research, Canada KU Leuven, Belgium Information Security Institute, Queensland University of Technology, Australia University of Augsburg, Germany Abertay University, UK University of Birmingham, UK ETS Montreal, Canada Naval Postgraduate School, USA University of Murcia, Spain Laboratoire Lorrain de Recherche en Informatique et Ses Applications, France University of Warwick, UK Universitat Rovira i Virgili, Spain Consiglio Nazionale delle Ricerche, Italy Certivox Ltd., Ireland Brigham Young University, USA Liverpool John Moores University, UK Khalifa University of Science, UAE University of South Alabama, USA Eindhoven University of Technology, The Netherlands Sapienza University of Roma, Italy Lund University, Sweden Florida Atlantic University, USA National Tsing Hua University, Taiwan Middle East Technical University, Turkey College of Engineering Trivandrum, India Université de la Réunion, Reunion Island National Chengchi University, Taiwan Kanto Gakuin University, Japan Macquarie University, Australia University at Buffalo, USA Consiglio Nazionale delle Ricerche, Italy University of Houston, USA SBA Research, Austria THALES Research and Technology, UK Dakota State University, USA
X
Organization
Edgar Weippl Bing Wu Ching-Nung Yang Ping Yang Alec Yasinsac Meng Yu Wenbing Zhao Tianqing Zhu
SBA and FHSTP, Austria Fayetteville State University, USA National Dong Hwa University, Taiwan Binghamton University, USA University of South Alabama, USA University of Texas at San Antonio, USA Cleveland State University, USA Deakin University, Australia
Additional Reviewers Tooska Dargahi Hossein Fereidooni Houssem Maghrebi Serena Nicolazzo Partha Sarathi Roy Sameer Wagh Rui Xu
University of Rome Tor Vergata, Italy University of Padua, Italy Safran Identity and Security, France University of Reggio Calabria, Italy Kyushu University, Japan Princeton University, USA KDDI Research, Inc., Japan
Invited Speakers Elisa Bertino Nancy Cam-Winget Bart Preneel
Purdue University, USA Cisco Systems, USA KU Leuven, Belgium
Contents
Application Marketplace Malware Detection by User Feedback Analysis . . . . Tal Hadad, Rami Puzis, Bronislav Sidik, Nir Ofek, and Lior Rokach
1
A System for Detecting Targeted Cyber-Attacks Using Attack Patterns . . . . . Ian Herwono and Fadi Ali El-Moussa
20
A Better Understanding of Machine Learning Malware Misclassifcation . . . . Nada Alruhaily, Tom Chothia, and Behzad Bordbar
35
Situation-Aware Access Control for Industrie 4.0 . . . . . . . . . . . . . . . . . . . . Marc Hüffmeyer, Pascal Hirmer, Bernhard Mitschang, Ulf Schreier, and Matthias Wieland
59
How to Quantify Graph De-anonymization Risks . . . . . . . . . . . . . . . . . . . . Wei-Han Lee, Changchang Liu, Shouling Ji, Prateek Mittal, and Ruby B. Lee
84
A Security Pattern Classification Based on Data Integration . . . . . . . . . . . . . Sébastien Salva and Loukmen Regainia
105
Forensic Analysis of Android Runtime (ART) Application Heap Objects in Emulated and Real Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alberto Magno Muniz Soares and Rafael Timoteo de Sousa Junior
130
Efficient Detection of Conflicts in Data Sharing Agreements . . . . . . . . . . . . Gianpiero Costantino, Fabio Martinelli, Ilaria Matteucci, and Marinella Petrocchi
148
On Using Obligations for Usage Control in Joining of Datasets . . . . . . . . . . Mortaza S. Bargh, Marco Vink, and Sunil Choenni
173
Directional Distance-Bounding Identification . . . . . . . . . . . . . . . . . . . . . . . Ahmad Ahmadi and Reihaneh Safavi-Naini
197
An Information Security Management for Socio-Technical Analysis of System Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jean-Louis Huynen and Gabriele Lenzini An Exploration of Some Security Issues Within the BACnet Protocol . . . . . . Matthew Peacock, Michael N. Johnstone, and Craig Valli
222 252
XII
Contents
Not So Greedy: Enhanced Subset Exploration for Nonrandomness Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linus Karlsson, Martin Hell, and Paul Stankovski
273
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
295
Application Marketplace Malware Detection by User Feedback Analysis Tal Hadad(B) , Rami Puzis(B) , Bronislav Sidik(B) , Nir Ofek(B) , and Lior Rokach(B) Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Beersheba, Israel {tah,sidik,nirofek}@post.bgu.ac.il, {puzis,liorrk}@bgu.ac.il
Abstract. Smartphones are becoming increasingly ubiquitous. Like recommended best practices for personal computers, users are encouraged to install antivirus and intrusion detection software on their mobile devices. However, even with such software these devises are far from being fully protected. Given that application stores are the source of most applications, malware detection on these platforms is an important issue. Based on our intuition, which suggests that an application’s suspicious behavior will be noticed by some users and influence their feedback, we present an approach for analyzing user reviews in mobile application stores for the purpose of detecting malicious apps. The proposed method transfers an application’s text reviews to numerical features in two main steps: (1) extract domain-phrases based on external domain-specific textual corpus on computer and network security, and (2) compute three statistical features based on domain-phrases occurrences. We evaluated the proposed methods on 2,506 applications along with their 128,863 reviews collected from “Amazon AppStore”. The results show that proposed method yields an AUC of 86% in the detection of malicious applications. Keywords: Mobile malware · Malware detection User feedback analysis · Text mining · Review mining
1
Introduction
The use of mobile devices such as smartphones, tablets, and smartwatches is constantly increasing. According to Statista, more than 1.49 billion smartphones were sold to end users worldwide in 2016, a 5% increase compared to the previous year, which set a new record for the highest number of smartphones sold in a year [1]. Moreover, in 2016 there were more than 5.6 million applications available in leading application stores (Google Play, iOS AppStore, and Amazon Appstore), a 14% increase compared to the previous year [2–4]. Unfortunately, due to the ubiquitous nature of mobile devices, the number of security threats targeting them has increased as well. In fact, malicious users, hackers, and even the manufacturers of mobile devices and applications themselves take advantage of the growing capabilities of mobile devices, careless and c Springer International Publishing AG, part of Springer Nature 2018 P. Mori et al. (Eds.): ICISSP 2017, CCIS 867, pp. 1–19, 2018. https://doi.org/10.1007/978-3-319-93354-2_1
2
T. Hadad et al.
unaware users, and vulnerabilities in the design of standard security mechanisms, in order to develop mobile-specific malware. Malicious applications aim to exploit system and application software for purposes such as exposing personal information, launching unwanted pop-ups, initiating browser redirects to download malicious files, and encrypting a victim’s personal information in order to demand money (ransom) in exchange for a decryption key. For example, ‘Viking Horde’ and ‘DressCode’ malware which, according to Check Point [5,6], can be used to infiltrate internal networks, were detected in Google Play in May and August 2016, respectively. Unfortunately, many malicious applications provide attackers with a large window of time to strike before they are removed from the marketplace. Meanwhile users install the applications, use them, and leave feedback in the respective marketplaces. Despite their effectiveness, antivirus (AV) engines and website scanners occasionally provide different conclusions regarding the same suspected file. Aware users often comment about the reports of different AV tools or about an app’s unwanted or strange behavior. This feedback can serve as input data for crowdsourcing techniques that can determine whether an app is malicious or benign better than a single AV. In recent years, many academic studies have focused on detecting malicious applications using static and dynamic analysis methods. Static analysis usually involves the inspection of an application’s code in order to identify sensitive capabilities or potentially harmful instructions. Dynamic analysis monitors system activity (behavior) and classifies it as normal or abnormal. This classification is based on heuristics or rules and attempts to detect irregular behavior [7]. Major mobile application distributors and designated stores and marketplaces inspect the uploaded application with state of the art malware detection tools and remove applications that are found to be malicious. Occasionally a user’s review sparks the interest of malware laboratories, causing them to inspect a particular application and thus, accelerates its removal from the marketplace. We explore an approach that can be used by the application marketplaces themselves, to alert them of a suspected new malware that might require further investigation. The major strength of the proposed approach is its use of available information - the feedback users provide is in the possession of the app stores and marketplaces, so why not use it to detect malicious applications? Moreover, the proposed approach is robust to code transformations and varied mobile environments and does not require root permissions. In this study we propose a new approach that analyzes user generated content (UGC), such as customer reviews, using text mining, and machine learning techniques. User generated content can be found on designated review sites (e.g., TripAdvisor and Yelp) and purchase/review sites (e.g., Amazon and Travelocity). These sites provide fertile ground for the analysis of users’ feedback which can be very useful for decision making [8]. Extracting and aggregating user generated content across opinion-rich resources are mainly used for the following reasons. First, they allow a close look at online communities without having to
Application Marketplace Malware Detection by User Feedback Analysis
3
directly survey the population, which is a time-consuming and expensive task [9]. Second, they provide the ability to collect subjective information about a product or service in order to obtain collective information (“the wisdom of the crowd,” a well-known phenomenon today). The present paper is an extended and revised version of our paper presented at the International Conference on Information Systems Security and Privacy (ICISSP) 2017 [10]. In [10], we present two numerical features for an application, based on domain-phrases occurrences in low rating reviews, in order to classify application to malicious and benign classes. This approach is improved in this paper by replacing the former two features with three numerical features that are extracted from all of an application’s reviews, rather than low rating reviews. Moreover, we present additional evaluation results for different settings, in order to emphasize the contribution of textual domain-knowledge (represented as domain lexicon). The main contributions of this paper are as follows: – We introduce a novel approach for malware detection that utilizes users’ textual feedback as an indicator for malicious behaviour. – The proposed method is capable of identifying suspicious applications which can be further analyzed by static and dynamic approaches; thus the method can play a role in executing more complex methods on large datasets (e.g., application stores) to reduce time and space consumption.
2
Related Work
Initial research in this area performed by the authors [10], in which applications were classified as malicious or benign classes based on two features representing domain-specific phrase occurrence counts in low-rated application reviews, resulted in a low true positive rate (TPR) of 23%. The low TPR is partially due to applications with no low-rated reviews which are classified as benign. In this study we extend our previous work by representing an application using all rating reviews and by introducing additional feature as described in Sect. 3. Moreover, evaluation focus on investigating accuracy improvement when using external domain-specific phrases v.s. internal reviews phrases. To the best of our knowledge, there has been no other in depth work focusing on malware detection based on text mining techniques applied to user feedback. Thus, the discussion of related work is divided into two sections: related work concerning malware detection and related work in the area of text classification. 2.1
Malware Detection
In recent years a great deal of academic research has been published proposing a wide range of methods for malware detection. In this section we mention the academic research that, in our opinion, has provided the most significant contribution to this field.
4
T. Hadad et al.
Behavior-Based Analysis. In [11], the authors proposed a behavior-based malware detection system (pBMDS) that correlates the user’s input and output with system calls, in order to detect anomalous activities such as unsolicited SMS/MMS and email messages. Like other research, the authors rely on kernel calls that require root privileges on devices. [12] presented Crowdroid which is a machine learning based framework for dynamic behavior analysis that recognizes Trojan-like malware on Android smartphones. This framework analyzes the number of times each system call has been issued by an application during the execution of an action that requires user interaction. In [13], the authors presented a new behavior-based anomaly detection system for detecting meaningful deviations in a mobile application’s network behavior. The main goal of the system is to protect mobile device users and cellular infrastructure companies from malicious applications by: (1) the identification of malicious attacks or masquerading applications installed on a mobile device, and (2) the identification of republished popular applications injected with a malicious code (i.e., repackaging). Research proposing application behavior analysis may suffer from one of the following limitations. First, many of the studies take place in a secure environment which is problematic, since their conclusions may be misleading in reallife settings, particularly considering the wide range of environments that exist. Second, many studies explore malicious behavior at the kernel level (e.g., system calls, real-time permission monitoring using the Android intent system). In such solutions, which require root permission, the security application itself can become a threat to the security of the mobile device. Permission-Based Analysis. In [14], the authors proposed a framework based on machine learning methods for the detection of malicious applications on Android. This framework is designed to detect malicious applications and enhance the security and privacy of smartphone users. In [15], the authors presented a new concept based on user behavior that involves learning permissions used by applications. They also developed the VetDroid framework, which is a dynamic analysis platform for reconstructing sensitive behaviors in Android applications from the permission perspective. Static Code Based Analysis. [16] proposed code transformation procedures for the Dalvik virtual machine (VM) and ARM (two engines for virtual machines on An-droid). The assumption behind this line of research is that malware writers rely on similar code transformation procedures to evade static signatures. Therefore, evaluating malware detectors against mutants is a good estimation of their robustness to future malware. In [17], the authors presented a static analysis tool for privacy leaks in Android applications. This tool analyzes the intention of privacy leaks and can distinguish between intended and unintended data transmission.
Application Marketplace Malware Detection by User Feedback Analysis
5
Zheng et al. [18] proposed the DroidAnalytics framework for static analysis which is designed to detect obfuscated malware on the Android platform. The proposed framework generates signatures in order to detect obfuscated code and repackaged malware on three levels: (1) the method level, (2) the class level, and (3) the application level. Most static malware detection techniques suffer from an inability to detect exploits introduced only at run-time. In addition, attackers have developed various techniques that are particularly effective against static analysis [19]. 2.2
Text Classification
Methods for text classification can be based on feature focused algorithms that propose new features, and model focused algorithms that propose new classification models. Like our work, the bulk of the research in text classification focuses on feature focused algorithms and concentrates on creating new features that enable new perspectives of the analyzed data. Since raw text is a special type of data that: (1) should be pre-processed (e.g., tokenize sentences into words or normalize words) [20], and (2) can be represented in many ways given the versatility of human language [21], various approaches for feature engineering have been proposed. These features are then fed to commonly used classifiers in order to classify the class of the text. Research based on model focused algorithms is aimed at proposing new algorithms and classification models. Approaches for feature-engineering presented in the literature are diverse, proposing both textbased features and the integration of information from multiple sources. A number of works generate a lexicon of terms to represent each class, which then can be used to determine the class of a test instance, for example, by counting terms [22] or computing their joint probability (e.g.,utilizing the Naive Bayes approach) [23]. Other feature focused approaches attempt to bridge the gap between lexicon-based and learning-based approaches by dynamically setting weights to sets of predefined terms. Works in this area include [24], which uses machine learning with voting, and [25], which emphasizes the text surrounding special entities in the analyzed text. A simple approach involves representing a fragment of text as a term frequency (TF) vector and feeding this into a classifier [26]. Such representation can be augmented with additional information about specific phrases [27]. Since this approach is highly effective [28], it has been used as a baseline in our work as part of the latent Dirichlet allocation [29].
3 3.1
The Proposed Approach Architectural Overview
Figure 1 presents an overview of the proposed malware detection method’s architecture. The method’s inputs are application reviews and a malware related textual corpus as described in Sects. 2 and 3.2, respectively. The output is
6
T. Hadad et al.
Fig. 1. Flowchart of the presented malware detection system’s architectural overview [10].
a probabilistic classifier which can classify previously unseen applications into ‘malicious’ and ‘benign’ classes, based on application reviews. The method generates a classifier in three main steps: (1) the generation of a domain-specific lexicon; (2) extraction of application features based on users’ feedback and the domain lexicon; and (3) generation of a classification model using supervised learning. The steps corresponding to the numbered units are shown in Fig. 1. 3.2
Textual Corpora
In this paper, we deal with two different types of textual corpora: (1) a domainspecific corpus, and (2) an application review corpus provided by general users. These corpora are presented in natural language form, which imposes additional processing difficulties as described in [30]. Therefore, text normalization is required in order to reduce language diversity, including transformation to canonical form for further processing. We obtain text normalization by applying the following steps: (1) removing numbers, punctuation, and stop words to remove noise as described in [31] ; (2) character replacement, including: character continuity (i.e., a character which repeats itself more than three times will be reduced to two times, e.g.,“goooood” will be replaced by “good”), slang words as described in [32] predetermined spelling mistakes and expressions (e.g., “helpfull” will be replaced by “helpful”), predetermined missing apostrophe (e.g., “dont” will be replaced with “don’t”), and predetermined apostrophe expansion (e.g., “don’t” will be replaced with “do not”); and (3) stemming each word to its root, as described in [33].
Application Marketplace Malware Detection by User Feedback Analysis
7
Domain-Specific Corpus. Domain-specific corpus is used to generate a domain-specific lexicon which will be used for natural language processing (NLP). In this work we focus on the cyber security domain, and the extraction of relevant phrases has been applied on computer and network security books such as [34,35]. The corpus consists of 53 books manually selected and it includes both academic books as well as professional books, such as [36–38]. We choose to use books as a resource for the cyber security domain, since they contain reliable and conventional terminology. We use the domain-specific corpus to generate the domain-specific lexicon, referred to as the “Domain Lexicon” (DL). First, text normalization is applied on the domain-specific corpus, as described earlier. Then, unigrams and bigrams (phrases) are extracted along with their occurrence frequency. Finally, the top 1% of the most frequent phrases is selected for inclusion in the DL. We used unigrams and bigrams rather than higher order n-gram models to represent the text, based on the following studies: (1) [39] which shows that unigrams beat other higher order n-grams, and (2) [40] which concludes that the use of trigrams and higher failed to show significant improvement. In order to measure the contribution of using cyber security domain phrase, we experimented with two different corpora as DomainCorpus: (1) security books, and (2) reviews, as described in Subsect. 4.2. This will enable to compare results when using only internal review phrases (i.e., top p% most frequent phrases in reviews), to the results achieved when using external security phrases (i.e., top p% most frequent phrases in security books). Algorithm 1. Creating Domain-Specific Lexicon. Input: DomainCorpus,p 1 N ormT ext ←Perform textual normalization on DomainCorpus; 2 P hrases ←Extract unigrams and bigrams, along with their frequencies from N ormT ext; 3 DomainLexicon ←Select p% of the most frequent phrases from P hrases; 4 return DomainLexicon
Application Review Corpus. Application review corpus is required for training a malicious application detector (i.e., classifier). Each review consists of the following information: (1) textual content, (2) author’s ID, and (3) review rating (1–5 stars). Also collected for each application are the Android application package files (APKs), which constitute the binary representation of an application for the Android platform. In this paper, only application reviews have been used in order to extract features (no other additional information has been used). As a first step we crawled and harvested application reviews from Amazon application store. In the next step, each review is segmented into separate sentences [41]. Finally, text normalization is applied as previously described.
8
3.3
T. Hadad et al.
Model Generation
The proposed method uses natural language processing techniques for the identification of linguistic phrases that correspond to the basic phrases of a malicious application’s domain. The model generation unit’s inputs are the DL (as described in Sect. 3.2) and application reviews (as described in Subsect. 3.2). Once these two inputs have been obtained, we create a dataset that reflects the relationships between the DL and an application’s reviews. Each instance in the dataset represents an application, based on 306 features as described in Subsect. 3.3. Based on the created dataset, a classification model is generated using supervised learning. The model is used to classify malicious applications based on application reviews. The model generation unit is composed of three main processes: (1) Feature Generation; (2) Feature Extraction; and (3) Supervised Learning. Feature Generation Process. Feature generation process is used to detect and count the occurrences of each phrase DLi ∈ DL in an application’s review set, as described in Algorithm 2. Therefore, each application is associated with a list of phrases DLi ∈ DL and their associated occurrences in the application’s reviews. This information requires for the feature extraction process that follows. Feature Extraction Process. The feature extraction process is used to create three numerical features that will be used to generate a statistical model (i.e., classifier). Therefore, as a first step we denote that for review r, a review weight w(r) will be the occurrences summary of each phrase DLi ∈ DL in r, as described by Eq. 1 [10]. w(r) ← DLi occurrences in r (1) DLi ∈DL
This weighting function is been used to generate the features: RDL, U DL and T DL. RDL - represents the average number of DL occurrences found in an application reviews. Therefore, the computation of this feature is performed by summarizing of all w(r) and divide by the number of reviews (REV ), as described by Eq. 2. w(r) r∈REV (2) RDL ← |REV | U DL - represents the average number of DL occurrences that have been used by a unique review’s authors. The computation of this feature is performed similarly to the preceding computation, except for the division by the number of unique review’s authors (U SR), as described by Eq. 3 w(r) r∈REV U DL ← (3) |U SR|
Application Marketplace Malware Detection by User Feedback Analysis
9
T DL - represents the average number of DL occurrences over a period of time. By using the date stamp available in a review, we can extract the earliest written review. The number of month from the earliest review can serve as an approximation to the number of month from application release date. The computation of this feature is performed by summarizing of all w(r) and divide by the approximate release date (RLS), described by Eq. 4 w(r) r∈REV (4) T DL ← |RLS|
Algorithm 2. Feature Extraction. Input: App Reviews,DL 1 DCS ← 0; 2 foreach r ∈ App Reviews do 3 DCS ← DCS + w(r); 4 end 5 REV ←getReviewsNum(App Reviews); 6 AT R ←getAuthorsNum(App Reviews); 7 RLS ←getApproxReleaseDate(App Reviews); 8 RDL ←DCS/REV ; 9 U DL ←DCS/AT R; 10 T DL ←DCS/RLS; 11 return RDL,U DL,T DL
3.4
Data Labeling
In order to generate the model we use supervised learning algorithms. There are several approaches for supervised classification, however all of them require known labels (classes) for the training, testing, and validation stages. Therefore, each of the instances in the dataset must be assigned a label. In [42], researchers tested state of the art static and dynamic methods for malware detection. Their results show that static analysis methods can provide accuracy rates between 92% and 99%, and the accuracy of dynamic analysis methods’ ranges from 91% to 96%. For this reason, we select the static approach to serve as a gold standard for the proposed method, and label applications as malicious or benign using VirusTotal [43]. For each application’s APK we obtain labels from several antivirus vendors that provide information such as whether the given application has a label B = {benign application} or label M = {malicious application}. In label M cases, most AVs provide additional information regarding the specific type of malicious application.
10
T. Hadad et al.
Due to the non-zero false positive classification of some AVs, their results cannot be trusted when only one AV reports that an APK is malicious. Similar to related works [10,44,45] we label an application as malicious if two or more AVs report that an APK is malicious. Supervised Learning. Given a labeled training set, we can now train a classification model (classifier). We evaluated the following induction algorithms: (1) C4.5 decision tree learner [46]; (2) random forest [47]; and (3) logistic regression [48]; A brief explanation of these three algorithms follows. 1. C4.5 Decision Tree Learner: The algorithm for the induction of decision trees uses the greedy search technique to induce decision trees for classification. 2. Random Forest: Random forest: An ensemble of 100 unpruned classification trees, induced from bootstrap samples of the training data, using random feature selection in the tree induction process. The prediction is made by aggregating (majority vote) the predictions of the ensemble. 3. Logistic Regression: This algorithm allows prediction of a discrete outcome, such as group membership, from a set of variables that may be continuous, discrete, dichotomous, or a mix of any of these. In our case, we used logistic regression from a set of continuous variables. Since logistic regression calculates the probability of class M over the probability of class B, the results of the analysis are in the form of an odds ratio.
4
Experimental Evaluation
In this section, we describe our evaluation, which includes the dataset, results, and accuracy. We start by describing the dataset used for the evaluation, including the malicious family type distribution. We also describe the motivation behind our decision to use unigrams and bigrams, rather than higher order ngram models, to generate the DL. Finally, we evaluate the proposed method’s accuracy based on multiple AV reports, and compare the results to two baseline methods: (1) bag-of-words [49], and (2) latent Dirichlet allocation [29], which are popular textual classification methods. 4.1
Datasets
To the best of our knowledge, there is no publicly available corpus (or dataset) for malicious applications together with their end user reviews. Malicious applications whose official reviews appear in news reports are immediately removed from application stores and markets, and therefore we cannot obtain their end user reviews for in depth analysis. In order to collect applications that are available (free), a crawler was applied on Amazon’s Android application store [50] for a two month period (October through November 2015). Due to the large number of applications available, we
Application Marketplace Malware Detection by User Feedback Analysis
11
Fig. 2. Histogram of 2,506 VirusTotal applications scan results [10].
Fig. 3. Malicious family distribution of 510 malicious applications [10].
randomly selected a subset of applications in order to generate the classifiers. In this paper, a single crawling session was performed, extracting a single version for each application. In total, we collected 2,506 applications’ APKs along with their 128,863 user reviews as shown in Table 1. Each application’s APK was scanned by VirusTotal in January 2016, which aggregates different antivirus products that provide online scan engines and presents a comprehensive report that indicates whether a given APK is malicious or not and specifies the malicious threat. In our case, the VirusTotal mass API, which is available for researchers to perform malicious file detection, was used.
12
T. Hadad et al. Table 1. Dataset volume for evaluating the classification task [10]. Total applications Total reviews Malicious applications Benign applications 2,506
128,863
336
2,170
We performed a scan of 2,506 applications and summarized the scan reports using a histogram which is presented in Fig. 2. In this figure positive reports aggregate the number of AVs that label each APK as malicious. The majority (1,996 applications) were labeled as non-malicious by all of the AVs. As can be seen in Fig. 2, different threshold values can be used for evaluation. Similar to other works [10,44,45] we classify an application as malicious if two or more AVs report that an APK is malicious. Note, that due to the diversity and accuracy of the AVs, scanning by different AVs can result in a situation in which a single APK is associated with different malicious families. Thus, each application can belong to several different malicious families, as shown in Fig. 3. As seen in the Venn diagram in Fig. 3, our dataset includes the following types of malicious threats: Trojan, adware, virus, spyware, riskware, and other less well-known malicious threats. 4.2
Settings
The proposed method compute for each application three numerical features: RDL U DL and T DL. Those features values are highly influenced by the preset phrases we detect in an application’s user feedbacks. The preset phrases (DL), created from Algorithm 1, can be extracted from different corpus (DomainCorpus) with different thresholds (p). We experiment with two corpora as DomainCorpus: (1) security books, and (2) reviews. Selecting security books as DomainCorpus, will result to DL which mainly contain cyber security domain phrases. However, by selecting reviews as DomainCorpus, will result to DL which contain user written phrases. The two different settings can help measure the contribution of using external cyber security domain phrases, over internal user written phrases. In addition, we experiment with different p values, where p is the percentage of most frequent phrases to be included in the DL, as mentioned in Sect. 1. Evaluation start with p = 1 up to p = 10 (p ∈{1, 2, 3, 5, 10}), i.e. top p% of the most frequent phrases in DomainCorpus. This rage was selected due to our previews study [10], which show that representing domain with p = 10 resulted to the best performance over p ∈{10, 20, 30, 40, 50}. The evaluation results were performed for different values of DomainCorpus and p, simultaneously, and are presented in Subsect. 4.4.
Application Marketplace Malware Detection by User Feedback Analysis
4.3
13
Classifiers
We used supervised machine learning methods for the classification of applications into malicious and benign classes. For this purpose, we used Waikato Environment for Knowledge Analysis (WEKA) [51]. As indicated above, the evaluation was performed on different classification models based on the following approaches: (1) decision tree, (2) random forest, and (3) logistic regression. 4.4
Results
To evaluate the performance of machine learning classifiers, k-fold crossvalidation is usually used [52]. Therefore, for each classifier that has been used, we apply k-fold cross-validation [53] with k = 10, i.e., the dataset is partitioned ten times into ten different sets. This way each time we use 90% of the data for training and 10% for testing. Our domain has an imbalanced class distribution, namely there are many more genuine applications than malicious, due to the nature of application stores which contain a small percentage of malicious applications. Thus, in order to evaluate the performance of each classifier, we measured the true positive rate (TPR): TP (5) TPR = TP + FN where TP is the number of malicious applications correctly classified (true positives), and FN is the number of malicious applications misclassified as benign (false negatives). In addition, we measured the false positive rate (FPR): FPR =
FP FP + TN
(6)
where FP is the number of benign applications incorrectly detected as malicious, and TN is the number of benign applications correctly classified. We also measured the accuracy (the number of successfully classified instances divided by the total number of instances in the dataset): Accuracy =
TP + TN TP + FP + FN + TN
(7)
Moreover, we measure the area under the ROC curve (AUC) which establishes the relation between false negatives and false positives [54]. The ROC curve is obtained by plotting the TPR against the FPR. The AUC measure is independent of prior probabilities of class distributions, and therefore is not overwhelmed by the majority class instances. [55] indicates that the AUC is more robust over other measures and is not influenced by class imbalance or sampling bias. Table 2 summarizes the results of the evaluation performed over three parameters settings: (1) Domain Corpus, where we create domain lexicon from two
14
T. Hadad et al. Table 2. Evaluation results of different classifiers. Domain corpus
TPR
FPR
Accuracy AUC
Cyber security books 1
P
C4.5 Random forest Logistic regression 2 C4.5 Random forest Logistic regression 3 C4.5 Random forest Logistic regression 5 C4.5 Random forest Logistic regression 10 C4.5 Random forest Logistic regression
Classifier
0.116 0.348 0.173 0.074 0.247 0.119 0.024 0.208 0.101 0 0.125 0.048 0 0.11 0.03
0.012 0.059 0.024 0.006 0.057 0.016 0.007 0.060 0.012 0 0.064 0.009 0 0.060 0.005
0.87 0.861 0.868 0.87 0.849 0.868 0.862 0.841 0.869 0.867 0.826 0.864 0.867 0.828 0.865
0.776 0.782 0.834 0.718 0.726 0.78 0.648 0.69 0.757 0.496 0.666 0.729 0.496 0.615 0.687
Reviews
1
0 0.107 0 0 0.104 0 0 0.098 0 0 0.083 0 0 0.074 0
0 0.045 0 0 0.040 0 0 0.037 0 0 0.038 0 0 0.041 0
0.867 0.84 0.867 0.867 0.844 0.867 0.867 0.846 0.867 0.867 0.844 0.867 0.867 0.84 0.867
0.496 0.577 0.496 0.496 0.597 0.496 0.496 0.593 0.496 0.496 0.538 0.496 0.496 0.541 0.496
C4.5 Random forest Logistic regression 2 C4.5 Random forest Logistic regression 3 C4.5 Random forest Logistic regression 5 C4.5 Random forest Logistic regression 10 C4.5 Random forest Logistic regression
corpora: cyber security books and reviews. (2) p, different size thresholds of lexicon. and (3) Classifier, three classification algorithms; C4.5, Random Forest and Logistic Regression as mentioned in Subsect. 4.3. The AUC precision of our method drops when using reviews most frequent phrases as lexicon (Domain Corpus = Reviews), and significantly improve when cyber security domain corpus is been used (Domain Corpus = Cyber Security
Application Marketplace Malware Detection by User Feedback Analysis
15
Books). Moreover, representing a lexicon with the 1% most frequent phrases (p = 1) shows better results, rater than higher percentage of phrases (p ∈ {2, 3, 5, 10}). As can be seen from Table 2 the most efficient settings: (1) Domain Corpus = Cyber Security Books, (2) p = 1 and (3) Classifier = Random Forest. Using those settings, the methods achieved TPR of 34.8%, FPR of 5.9%, accuracy of 86.1% and AUC of 78.2%. TPR results show that it is not feasible for using our method as a sole detection method, and further research should be performed in this aspect. However, TPR of 34.8% show that domain-specific phrases can be useful for classifying application based on user feedbacks. 4.5
Baselines
Popular textual classification methods such as bag-of-words [49] and latent Dirichlet allocation (LDA) [29] were used to evaluate the proposed method. Table 3 presents the results. The BOW method’s hypothesis is that the frequency of words in a document tends to indicate the relevance of the document to other documents. If documents have similar column vectors in a term document matrix, then they tend to have similar meanings. The hypothesis expresses the belief that a column vector in a term document matrix captures (to some degree) an aspect of the meaning of the corresponding document, i.e., what the document is about. This method generates a mathematical model of all words (perhaps with the exception of a list of high frequency noisy words), without any additional knowledge or interpretation of linguistic patterns and properties. LDA can also be cast as a language-modeling technique. The basic idea is to infer language models that correspond to unobserved factors in the data, with the hope that the factors that are learned represent topics. However, as [56] shows, this method does not work well on classifying short text documents such as tweets or costumers’ reviews. Table 3. Results of different baseline methods. Method
5
TPR FPR Accuracy AUC
Proposed 0.34 0.05
0.86
0.78
BOW
0.26
0.06
0.84
0.61
LDA
0.15
0.03 0.81
0.62
Conclusions
This paper presents a method for generating effective classifiers for the detection of malicious applications in application marketplaces. Classification is based on a set of features that are extracted from customers’ textual reviews. We achieved this by defining a set of significant features and extracting them from a real application store dataset.
16
T. Hadad et al.
The proposed method uses a static approach as a gold standard, by labeling applications as malicious or benign using VirusTotal, as mentioned in Subsect. 3.4. In terms of space complexity, traditional methods analyze application files (a few dozen megabytes), while the proposed method analyzes text files (a few kilobytes). In our dataset, an average application file size is 18MB, while an average application review text file size is 7.8 KB, a 99.9% decrease in space resources. The time complexity for classifying a single application in the proposed method is O(w), where w is the number of words in all of the application reviews. The contribution of this paper is that we introduced an approach for malware detection that uses text analysis to analyze feedback generated by users. The proposed method is capable of identifying suspicious applications which can be further analyzed by static and dynamic approaches; thus the method can play a role in executing more complex methods on large datasets (e.g., application stores) by reducing time and space consumption. The major strength of the proposed approach is its use of information that is available from application stores. Moreover, the proposed approach is robust to code transformations and varied mobile environments and does not require root permissions. However, the detection is based solely on users’ feedback, thus the limitation of the proposed approach is to detect a malicious app without any feedbacks. The generated model provides better results than several baseline methods for textual classification such as BOW and LDA. Such results demonstrate the proposed method’s ability to detect malicious applications when using cyber security domain information, such as the domain lexicon, rather than general linguistic information such as word frequency. Evaluation of the classifier was performed with several machine learning algorithms. The evaluation demonstrates that our model performs well in terms of accuracy and AUC measures, as presented in Sect. 4. The best results were obtained using the logistic regression model which achieved accuracy rates of 86% and an AUC of 83%. However, the best TPR results (TPR = 34%) were achieved by the by random forest classifier. Our research considers malicious application detection by analyzing end users’ textual reviews. Thus, a possible future research direction includes the analysis of other types of user-related features such as user reputation, diversity of reviews, etc. Examples of other feature types to consider are user reliability and professional domain knowledge. Another future direction is the detection of unknown malware based on the reported behavior of an application from user reviews.
References 1. Statista: Number of smartphones sold to end users worldwide from 2007 to 2016 (2017). https://www.statista.com/statistics/263437/global-smartphonesales-to-end-users-since-2007/. Accessed Aug 2017 2. Statista: Number of available apps in the Apple App Store from July 2008 to January 2017 (2017). https://www.statista.com/statistics/263795/number-ofavailable-apps-in-the-apple-app-store/. Accessed Aug 2017
Application Marketplace Malware Detection by User Feedback Analysis
17
3. Statista: Number of available applications in the Google Play Store from December 2009 to June 2017 (2017). https://www.statista.com/statistics/266210/number-ofavailable-applications-in-the-google-play-store/. Accessed Aug 2017 4. Statista: Number of available apps in the Amazon Appstore from March 2011 to April 2016 (2017). https://www.statista.com/statistics/307330/numberof-available-apps-in-the-amazon-appstore/. Accessed Aug 2017 5. Check Point: Viking horde: a new type of android malware on Google Play (2017). https://blog.checkpoint.com/2016/05/09/viking-horde-a-new-typeof-android-malware-on-google-play/. Accessed Aug 2017 6. Check Point: Dresscode android malware discovered on Google Play (2017). https://blog.checkpoint.com/2016/08/31/dresscode-android-malware-discoveredon-google-play/. Accessed Aug 2017 7. Wang, K., Stolfo, S.J.: Anomalous payload-based network intrusion detection. In: Jonsson, E., Valdes, A., Almgren, M. (eds.) RAID 2004. LNCS, vol. 3224, pp. 203– 222. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30143-1 11 8. Blair-Goldensohn, S., Hannan, K., McDonald, R., Neylon, T., Reis, G.A., Reynar, J.: Building a sentiment summarizer for local service reviews. In: WWW Workshop on NLP in the Information Explosion Era, vol. 14, pp. 339–348 (2008) 9. Portier, K., Greer, G.E., Rokach, L., Ofek, N., Wang, Y., Biyani, P., Yu, M., Banerjee, S., Zhao, K., Mitra, P., et al.: Understanding topics and sentiment in an online cancer survivor community. JNCI Monogr. 47, 195–198 (2013) 10. Hadad, T., Sidik, B., Ofek, N., Puzis, R., Rokach, L.: User feedback analysis for mobile malware detection. In: ICISSP, pp. 83–94 (2017) 11. Xie, L., Zhang, X., Seifert, J.P., Zhu, S.: pBMDS: a behavior-based malware detection system for cellphone devices. In: Proceedings of the Third ACM Conference on Wireless Network Security, pp. 37–48. ACM (2010) 12. Burguera, I., Zurutuza, U., Nadjm-Tehrani, S.: Crowdroid: behavior-based malware detection system for Android. In: Proceedings of the 1st ACM Workshop on Security and Privacy in Smartphones and Mobile Devices, pp. 15–26. ACM (2011) 13. Shabtai, A., Tenenboim-Chekina, L., Mimran, D., Rokach, L., Shapira, B., Elovici, Y.: Mobile malware detection through analysis of deviations in application network behavior. Comput. Secur. 43, 1–18 (2014) 14. Aung, Z., Zaw, W.: Permission-based Android malware detection. Int. J. Sci. Technol. Res. 2, 228–234 (2013) 15. Zhang, Y., Yang, M., Xu, B., Yang, Z., Gu, G., Ning, P., Wang, X.S., Zang, B.: Vetting undesirable behaviors in Android apps with permission use analysis. In: Proceedings of the 2013 ACM SIGSAC Conference on Computer and Communications Security, pp. 611–622. ACM (2013) 16. Rastogi, V., Chen, Y., Jiang, X.: Droidchameleon: evaluating android anti-malware against transformation attacks. In: Proceedings of the 8th ACM SIGSAC Symposium on Information, Computer and Communications Security, pp. 329–334. ACM (2013) 17. Yang, Z., Yang, M., Zhang, Y., Gu, G., Ning, P., Wang, X.S.: Appintent: analyzing sensitive data transmission in Android for privacy leakage detection. In: Proceedings of the 2013 ACM SIGSAC Conference on Computer and Communications Security, pp. 1043–1054. ACM (2013) 18. Zheng, M., Sun, M., Lui, J.C.: Droid analytics: a signature based analytic system to collect, extract, analyze and associate Android malware. In: 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, pp. 163–171. IEEE (2013)
18
T. Hadad et al.
19. Moser, A., Kruegel, C., Kirda, E.: Limits of static analysis for malware detection. In: 2007 Twenty-Third Annual Computer Security Applications Conference, ACSAC 2007, pp. 421–430. IEEE (2007) 20. Ofek, N., Rokach, L., Mitra, P.: Methodology for connecting nouns to their modifying adjectives. In: Gelbukh, A. (ed.) CICLing 2014. LNCS, vol. 8403, pp. 271–284. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54906-9 22 21. Katz, G., Ofek, N., Shapira, B.: Consent: context-based sentiment analysis. Knowl.Based Syst. 84, 162–178 (2015) 22. Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 168–177. ACM (2004) 23. Ofek, N., Shabtai, A.: Dynamic latent expertise mining in social networks. IEEE Internet Comput. 18, 20–27 (2014) 24. Choi, Y., Cardie, C.: Learning with compositional semantics as structural inference for subsentential sentiment analysis. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 793–801. Association for Computational Linguistics (2008) 25. Balahur, A., Steinberger, R., Kabadjov, M., Zavarella, V., Van Der Goot, E., Halkia, M., Pouliquen, B., Belyaeva, J.: Sentiment analysis in the news. arXiv preprint arXiv:1309.6202 (2013) 26. Ye, Q., Zhang, Z., Law, R.: Sentiment classification of online reviews to travel destinations by supervised machine learning approaches. Expert Syst. Appl. 36, 6527–6535 (2009) 27. Mullen, T., Collier, N.: Sentiment analysis using support vector machines with diverse information sources. In: EMNLP, vol. 4, pp. 412–418 (2004) 28. Ofek, N., Katz, G., Shapira, B., Bar-Zev, Y.: Sentiment analysis in transcribed utterances. In: Cao, T., Lim, E.-P., Zhou, Z.-H., Ho, T.-B., Cheung, D., Motoda, H. (eds.) PAKDD 2015. LNCS (LNAI), vol. 9078, pp. 27–38. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18032-8 3 29. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003) 30. Baron, N.S.: Language of the internet. In: The Stanford Handbook for Language Engineers, pp. 59–127 (2003) 31. Onix: Onix text retrieval toolkit (2016). http://www.lextek.com/manuals/onix/ stopwords1.html. Accessed Apr 2016 32. Twitter: Twitter dictionary: a guide to understanding twitter lingo (2016). http:// www.webopedia.com/quick ref/Twitter Dictionary Guide.asp. Accessed Apr 2016 33. Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980) 34. Dunham, K.: Mobile Malware Attacks and Defense. Syngress, Maryland Heights (2008) 35. Syngress, E.O., O’Farrell, N.: Hackproofing Your Wireless Network (2002) 36. Shostack, A.: Threat Modeling: Designing for Ssecurity. Wiley, Hoboken (2014) 37. Nayak, U., Rao, U.H.: The InfoSec Handbook: An Introduction to Information Security. Apress, New York City (2014) 38. Bosworth, S., Kabay, M.E.: Computer Security Handbook. Wiley, Hoboken (2002) 39. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings of the ACL-2002 Conference on Empirical Methods in Natural Language Processing, vol. 10, pp. 79–86. Association for Computational Linguistics (2002)
Application Marketplace Malware Detection by User Feedback Analysis
19
40. Dave, K., Lawrence, S., Pennock, D.M.: Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of the 12th International Conference on World Wide Web, pp. 519–528. ACM (2003) 41. De Marneffe, M.C., MacCartney, B., Manning, C.D., et al.: Generating typed dependency parses from phrase structure parses. In: Proceedings of LREC, vol. 6, pp. 449–454 (2006) 42. Ranveer, S., Hiray, S.: Comparative analysis of feature extraction methods of malware detection. Int. J. Comput. Appl. 120, 1–7 (2015) 43. VirusTotal: Virustotal, a free online service that analyzes files and urls enabling the identification of viruses, worms, trojans and other kinds of malicious content. https://www.virustotal.com/. Accessed Aug 2017 ˇ 44. Srndic, N., Laskov, P.: Detection of malicious pdf files based on hierarchical document structure. In: Proceedings of the 20th Annual Network & Distributed System Security Symposium (2013) 45. Nissim, N., Cohen, A., Moskovitch, R., Shabtai, A., Edry, M., Bar-Ad, O., Elovici, Y.: ALPD: Active learning framework for enhancing the detection of malicious pdf files. In: 2014 IEEE Joint Conference on Intelligence and Security Informatics Conference (JISIC), pp. 91–98. IEEE (2014) 46. Quinlan, J.R.: C4. 5: Programming for Machine Learning, p. 38. Morgan Kauffmann, Burlington (1993) 47. Ho, T.K.: The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20, 832–844 (1998) 48. Walker, S.H., Duncan, D.B.: Estimation of the probability of an event as a function of several independent variables. Biometrika 54, 167–179 (1967) 49. Harris, Z.S.: Distributional structure. Word 10, 146–162 (1954) 50. Amazon: Amazon appstore (2017). http://www.amazon.com/mobile-apps/b? node=2350149011. Accessed Aug 2017 51. WEKA. http://www.cs.waikato.ac.nz/ml/weka/. Accessed Aug 2017 52. Bishop, C.M.: Pattern recognition. Mach. Learn. 128, 1–58 (2006) 53. Kohavi, R., et al.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: IJCAI, vol. 14, pp. 1137–1145 (1995) 54. Singh, Y., Kaur, A., Malhotra, R.: Comparative analysis of regression and machine learning methods for predicting fault proneness models. Int. J. Comput. Appl. Technol. 35, 183–193 (2009) 55. Oommen, T., Baise, L.G., Vogel, R.M.: Sampling bias and class imbalance in maximum-likelihood logistic regression. Math. Geosci. 43, 99–120 (2011) 56. Hong, L., Davison, B.D.: Empirical study of topic modeling in Twitter. In: Proceedings of the First Workshop on Social Media Analytics, pp. 80–88. ACM (2010)
A System for Detecting Targeted Cyber-Attacks Using Attack Patterns Ian Herwono(&) and Fadi Ali El-Moussa Security Futures Practice, Research and Innovation, BT, Ipswich IP3 5RE, UK {ian.herwono,fadiali.el-moussa}@bt.com
Abstract. Detecting multi-stage cyber-attacks remains a challenge for any security analyst working in large corporate environments. Conventional security solutions such as intrusion detection systems tend to report huge amount of alerts that still need to be examined and cross-checked with other available data in order to eliminate false positives and identify any legitimate attacks. Attack patterns can be used as a means to describe causal relationships between the events detected at different stages of an attack. In this paper, we introduce an agent-based system that collects relevant event data from various sources in the network, and then correlates the events according to predefined attack patterns. The system allows security analysts to formulate the attack patterns based on their own knowledge and experience, and test them on available datasets. We present an example attack pattern for discovering suspicious activities in the network following a potential brute force attack on one of the servers. We discuss the results produced by our prototype implementation and show how a security analyst can drill down further into the data to identify the victim and obtain information about the attack methods. Keywords: Cyber security Visualization
Attack patterns Knowledge sharing
1 Introduction Cyber-attacks have become more sophisticated and difficult to detect, opponents more persistent and determined and these trends will continue. As a result, organizations must shift their posture from relying on preventive defense, to one of early detection of attacks in progress and timely initiation of an appropriate response to mitigate impact. Although attacks vary considerably at a detailed level, experienced security analysts are able to discern commonly repeated sequences of activity or events associated with different attack families. These regularities can be used to characterize the type of attack and predict what is likely to happen next. Such regularities may arise from habit and preference, but in many cases they are dictated by causal relationships between the steps or stages in the attack – A needs to be done to enable B to happen, and so on – and the resources and exploit mechanisms available to the attacker. The ability to cross-correlate alerts or events from various sources in the network is key for detecting sophisticated multi-stage attacks at an early stage, which would allow a reasonable time to stop the attack or mitigate the damage. © Springer International Publishing AG, part of Springer Nature 2018 P. Mori et al. (Eds.): ICISSP 2017, CCIS 867, pp. 20–34, 2018. https://doi.org/10.1007/978-3-319-93354-2_2
A System for Detecting Targeted Cyber-Attacks Using Attack Patterns
21
In this paper we propose to make use of attack patterns to embody the knowledge of skilled security analysts in identifying regularities, causal relationships and correlated cyber events. General patterns such as the Cyber Kill Chain [11] are often followed, but there can be significant variation leading to uncertainties in both recognition and prediction. Each security analyst may encounter specific type of attacks or suspicious activities such as port scanning in many different ways. Different attack evidences may be collected from various sources in corporate network environment, e.g. firewall, web proxy, name server, network switches, etc. Sharing such knowledge and experience with other analysts thus becomes ever more important as cyber criminals will always try to develop new methods and strategies to break through any cyber security defenses. Collective intelligence and collaborative analytics of security data will be crucial for any organization or enterprise to remain one step ahead of the cyber attackers. However there are still legal and technical barriers to achieve this fully. Sharing cyber threat information between organizations without revealing sensitive and private information remains challenging and involves tedious, resource intensive processes such as data encryption, anonymization, etc. [6] is currently addressing this issue by defining a framework for collaborative and confidential cyber threat information sharing and analysis. Our goal is to use such collaboration framework to allow security analysts share attack patterns within the same enterprise or between multiple enterprises, and use those patterns to check whether attacks have happened without having to reveal any confidential data. We designed a pattern recognition system to lay the groundwork for such collaborative analytics capability. This is an extended and revised version of [12] which provides more details about the system’s analytics engine as well as a discussion on test results. The remainder of this paper is organized as follows: Sect. 2 presents related works. In Sect. 3, we introduce the architecture of our agent-based pattern recognition system. We outline the process for specifying attack patterns in Sect. 4. In Sect. 5, we show how the system’s analytics engine works using an attack pattern, followed by the discussion of its test results. We conclude and discuss future work in Sect. 6.
2 Related Work Multi-stage cyber-attacks are conducted in multiple steps or stages using several attack paths to achieve its ultimate attack objective such as denial of service or exfiltration of sensitive and private data. [10] discussed different types of Internet-based attacks and considered the role of network-level attribution in preventing those attacks as well as potential solutions to the multi-stage attack problem. [2] examined statistical modelling techniques for alert clustering and correlation with the objective to identify attack steps and predict the expected sequence of events. The authors proposed a correlation framework to achieve high quality multi-stage attack recognition in real time. Over the years the term Advanced Persistent Threat (APT) has been commonly associated with such multi-stage attacks. [14] considered APTs to be a general subset of targeted attacks which might require a greater degree of sophistication on a high-value target. [5] developed a framework that models multi-stage attacks in a way that describes the
22
I. Herwono and F. A. El-Moussa
attack methods and the anticipated effects of attacks. The authors adopted the Intrusion Kill Chain attack model [11] as the foundation of their framework. Other related works fall into the category of knowledge-based attack modelling where intrusion steps are formally specified by experts and added to scenarios to build the model and to discover the logical connections between alerts. [9] developed the Correlated Attack Modelling Language (CAML) using a modular approach, each module represents an inference step and can be linked to each other to detect multistep scenarios. [4] introduced the concept of attack patterns as a mechanism to capture and communicate the attacker’s perspective. The author described a typical process for how attack patterns could actually be generated and used during different phases of software development. The work was related to the Common Attack Pattern Enumeration and Classification (CAPEC) initiative of the U.S. Department of Homeland Security (DHS) in their effort to provide a publicly available catalogue of attack patterns alongside with a comprehensive schema and classification taxonomy created to assist in the building of secure software [7]. [13] developed a data mining framework that employs text mining techniques to dynamically relate the information between the security-related events and CAPEC attack patterns. It aims to reduce analysis time, increase the quality of attack reports, and automatically build correlated scenarios. [1] proposed a combination of statistical and knowledge-based models to achieve higher detection rate with minimal false positives. The authors used a knowledge-based model with vulnerability and extensional conditions to create manageable and meaningful attack graphs. Their system was evaluated to detect Botnet multi-stage attacks. [15] also used attack graphs for correlating intrusion alerts; empirical results showed that their method could complete correlation tasks faster than an Intrusion Detection System (IDS) would be able to report alerts, making it usable for monitoring the progress of intrusions in real time. Our system mainly employs the knowledge-based model approach. Its knowledge base consists of a repository of attack patterns that would have been captured from skilled security analysts, along with relevant datasets such as malware alerts or web proxy logs. We consider attack pattern as a structure for correlating security and network events, which may have been logged at different times and locations, to detect early stages of advanced attacks. Such attack pattern should not be mistaken for an attack graph, which is a structure to represent all possible sequences of exploits that an intruder can carry out to attack computer networks [3]. The method to formally describe the attack pattern is specific to our prototype implementation, but CAPEC schema may well be used in the future. One of our main design principles is to allow easy integration of multiple data sources and provide the necessary data source adapters. As the target users of the system, security analysts can thus focus their efforts on defining useful attack patterns and validating them against available datasets without having to deal with the technicality of querying and examining the data from different systems and databases. Statistical modelling techniques for event correlation are currently not employed in our system but are subjects for further study.
A System for Detecting Targeted Cyber-Attacks Using Attack Patterns
23
3 System Architecture Our pattern recognition system is a knowledge-driven software solution with a knowledge base consisting of predefined attack patterns. Its system architecture is depicted in Fig. 1 and the associated components described in the following sections.
Assets Manager
Authen ca on & Authoriza on
System Database
A ack Pa ern Manager
Visualiza on
Analy cs Engine
Data Source Agent
Agent Manager
Data Sources
Fig. 1. System architecture.
3.1
Authentication and Authorization
The authentication and authorization component ensures that only authorized users can access the system, and that certain sets of data (e.g. attack patterns, alert logs, etc.) or software features are available only to particular groups of users, e.g. administrator, analysts, etc. The user authentication is currently implemented using simple username and password method, and single sign-on service such as the Central Authentication Service (CAS)1 may be used for integrating the system with existing enterprise platform. 3.2
Attack Pattern Manager
The attack pattern manager is responsible for the entire lifecycle of each attack pattern, starting from its creation up to its removal. It provides a graphical user interface for modelling multi-stage attacks as flow diagrams or directed (acyclic) graphs, thus giving
1
https://wiki.jasig.org/display/CAS/Home.
24
I. Herwono and F. A. El-Moussa
user a better view of how the attack stages are connected with each other and which different paths they may have. It also provides control over the usage of existing attack patterns, either for monitoring live data feeds or analyzing historical events, e.g. to search for evidence during forensic analysis. 3.3
Assets Manager
The assets manager provides simple management of (critical) network assets such as public web servers or customer database servers, which can later be associated with any of the attack patterns to focus its search scope. It provides grouping of assets that can be identified by their IP address or network sub-domain using Classless Inter-Domain Routing (CIDR) notation. 3.4
System Database
The system database is usually a relational database that is mainly used to persistently store the attack patterns, along with other application data such as details of available external data sources, authentication credentials, and critical assets. 3.5
Data Sources
Data sources are external data stores that collate structured events or log data generated by various systems in the network, e.g. firewall, IDS, name server, web proxy, etc. We assume that any necessary pre-processing and enrichment of the raw data, such as parsing of event attributes or IP address lookup, has already been carried out prior to storing the data. Two types of data sources are currently supported: • Conventional SQL database management systems such as Oracle, MySQL, or PostgreSQL, and • Elasticsearch2 system. Setting details for new data source such as the database server/cluster details and credentials, table names and attributes mapping (e.g. to identify which table columns contain the source and destination IP information) can be added at runtime via the admin interface. An Elasticsearch data source may typically contain different types of event or log data as it is document-oriented and does not use fixed schema as opposed to conventional SQL database. 3.6
Data Source Agents
Data source agents are centrally-managed software agents that query and fetch data from external data sources to determine the quantifiable measures defined for specific cyber events, e.g. the number of failed login attempts within a five-minute time frame, or the number of detected malware in the last 24 h. The relevant query and filter parameters are specified by the user during the attack modelling exercise. In case of 2
https://www.elastic.co/products/elasticsearch.
A System for Detecting Targeted Cyber-Attacks Using Attack Patterns
25
Elasticsearch data source, much of the required filtering and aggregation tasks are taken over by the Elasticsearch analytics engine. Data source agents are implemented in a modular fashion, such that they can be adapted and optimized to different types of data source and the integration of new data source types such as NoSQL databases can be done without major change on existing software codes. 3.7
Agent Manager
The agent manager is the single point of contact within the system for instantiating and parameterizing different types of data source agents. Its service is consumed by the analytics engine. The instantiated software agents will later perform their tasks independent from each other and report the results back to the corresponding process entity of analytics engine. 3.8
Analytics Engine
The analytics engine is responsible for correlating the cyber events, which have been gathered by the data source agents from the same or different data sources over short or long periods of time, in order to determine if there is any (suspicious) activities in the network that match particular attack patterns. The events can either be observed from live data feeds or historical datasets. Historical data is normally used to test and validate attack patterns, or to carry out forensic analysis. Such validation would allow security analysts to refine the patterns and readjust its measures or metrics in order to increase the detection accuracy. We believe that there is a number of ways to implement the analytics engine depending on various factors, such as scalability (e.g. volume of data available and size of networks) and the underlying mathematical models (e.g. Bayesian belief networks). Our current implementation is based on simple comparison of measurement values reported by the data source agents with the threshold values specified for the corresponding stage in the attack pattern. The engine will progress to monitor the subsequent attack stage once the threshold condition is satisfied. More details about the analytics engine are provided in Sect. 5. 3.9
Visualization
The visualization component offers graphical views of the attack patterns, their monitoring status, and results. An external visual analytics tool can be loosely integrated into the system’s user interface in order to support security analysts in their further investigation of potential threats.
4 Attack Modelling 4.1
Attack Pattern
Each attack pattern is essentially a plan template embodying an attack strategy, typically consisting of multiple steps or stages. Associated with each stage and also with the overall pattern are observable events and conditions that would normally occur
26
I. Herwono and F. A. El-Moussa
during, before and after execution of the plan. The system’s attack pattern repository will be populated based on security analysts’ knowledge of historic attack cycles. For example, a Distributed Denial of Service (DDoS) campaign often follows a pattern: due diligence, intelligence gathering, vulnerability scanning, defacement and DDoS. If the events associated with due diligence, intelligence gathering, and vulnerability scanning have been observed, then we can predict that defacement and DDoS will probably follow after intervals similar to those observed previously. Knowledge and insights about common attack techniques and strategies may also be obtained from publicly available catalogues such as the Common Attack Pattern Enumeration and Classification (CAPEC) [7]. Figure 2 illustrates an attack pattern that has been derived from the CAPEC catalogue, i.e. Exploitation of Trusted Credentials (CAPEC-21), for detecting unauthorized system access following exploitation of web session variables and other trusted credentials. It consists of the following five (detection) stages: 1. Web pages spidering: Indications that available web pages of an organization are being spidered and session IDs sampled. The average page request frequency can be monitored to detect such activity. 2. Increased invalid sessions: Unusual amounts of invalid sessions are logged by the web server due to many anonymous connections. 3. Increased invalid connections: Unusual amounts of invalid connections or requests from unauthorized hosts are logged by the web server. 4. Traffic from unexpected sources: High number of users or systems connecting from unexpected sources is detected by the network IDS. 5. Requests for unexpected actions: High number of users or systems requesting or performing unexpected actions is logged by the web server.
Web page spidering
Increased invalid sessions
Traffic from unexpected sources
Increased invalid connec ons
Requests for unexpected ac ons
Fig. 2. Attack pattern for detecting unauthorized system access.
4.2
Quantifiable Measure
When formulating an attack pattern one crucial task is to specify a quantifiable measure or indicator for the relevant events or activities at each stage. The user needs to select the data source from which such measure could be derived, as well as the parameters to be used by the data source agent for querying and aggregating the event data. Figure 3 shows an example setup of a data source agent for detecting possible brute force attack activities through observation of the number of failed login attempts into a File Transfer Protocol (FTP) server. The indicated “Snort MACCDC2012” data source represents an
A System for Detecting Targeted Cyber-Attacks Using Attack Patterns
27
Fig. 3. Quantifiable measure for an Elasticsearch data source agent [12].
Elasticsearch data source that contains Snort3 IDS alert logs that were generated from the 2012 Mid-Atlantic Collegiate Cyber Defense Competition (MACCDC 2012) dataset [8]. Essentially the data source agent needs to count the total number of alerts with the signature “INFO FTP Bad login” that were reported within five-minute time windows for each destination IP address. The user should then assign a threshold value which will later be examined by the analytics engine prior to making the decision whether or not to trigger a transition to the subsequent stage. Dependency between the events of successive stages can be specified to indicate whether events at a particular stage should only be examined after some event attributes from one of the preceding stages, e.g. IP addresses of affected hosts, have been passed on by its data source agent. Events will be observed either periodically (e.g. until their threshold value is exceeded) or only once (e.g. to check if certain events have occurred in the past). Figure 4 shows the complete setup for detecting an attack stage Brute force attack using the above-mentioned measure. Since it is the first stage in the example attack pattern (see Fig. 5), it has no dependency on other stages. The threshold is set to 20, i.e. twenty failed login attempts within five-minute time window, which should be examined periodically. The Brute force attack stage is followed by two subsequent stages, i.e. Suspicious activity for privilege gain and Suspicious activity for information leak attempt, which indicate the two different paths such attack may progress next. At both stages we look
3
https://www.snort.org.
28
I. Herwono and F. A. El-Moussa
Fig. 4. Setup for detecting Brute force attack stage [12].
Suspicious ac vity for privilege gain Brute force a ack Suspicious ac vity for informa on leak a empt
Fig. 5. Example attack pattern with two possible attack paths.
Fig. 6. Setup for detecting Suspicious activity for privilege gain stage.
for suspicious activities that involve any destination hosts that were identified at the preceding stage after the threshold condition was met. The stage’s dependency should thus be set to “Dependent of the data from previous stage” and the associated input data (i.e. Destination IP) should be selected accordingly (see Fig. 6).
A System for Detecting Targeted Cyber-Attacks Using Attack Patterns
29
5 Analytics Engine The analytics engine owns the task of matching attack patterns against a series of detected events from live feed or historical datasets in order to determine if a potential attack campaign is under way. The engine consists of independent process entities whose workflows are dictated by the attack patterns. Each active attack pattern is assigned a single parent process which will create as many child processes as necessary over time. 5.1
Parent Process
Once an attack pattern is activated (e.g. against historical cyber events) a parent process is started and the following tasks will be performed. Pattern Data Retrieval. The process control entity retrieves the pattern data from the system database. It extracts the information about which of the attack stages should be monitored from the start, hence referred to as start stages. Agent Parameterization. The data source agent associated with each of those start stages is created. The parameters for its quantifiable measure, such as type of measurement, time frame, threshold value, etc. are passed on to the agent. Agent Scheduling. The time interval, at which each data source agent would execute its task, is configured in the scheduler. The agent may then carry out its assigned task periodically. Result Examination. The result produced by each agent, i.e. the measurement value, is reported back to the process control entity. The agent would have indicated whether or not the threshold condition has been satisfied. Child Process Creation. Each time the threshold at particular start stage is exceeded the control entity would create a new child process for observing the subsequent stage. The reporting agent (of the parent process) may hold characteristic information about the relevant events, e.g. hosts’ IP addresses, which will be used for correlating events at the subsequent stage. Each child process represents a path that needs to be followed throughout the attack cycle separately. 5.2
Child Process
Throughout its lifecycle a child process would perform similar tasks as its parent, i.e. parameterizing and scheduling data source agents, examining agent measurement values, and creating further child processes for subsequent stages. Any child process would eventually terminate in one of the following ways: • End of Attack Cycle: This means that the final attack stage has been reached and the associated data source agent has indicated that the threshold has been met and no more subjects (e.g. hosts) remain to be monitored at that stage.
30
I. Herwono and F. A. El-Moussa
• Timeout: This means that the threshold condition has not been satisfied after a certain observation period (e.g. 6 h); from the security point of view this may suggest false alarms, or that a potential attack campaign has not progressed any further. 5.3
Message Flow
Figure 7 depicts the flow diagram for the example attack pattern described earlier in Sect. 4.2 (see also Fig. 5). The related events for all the stages are retrieved from the same Elasticsearch data source containing records of Snort IDS alerts. The following communications and data exchange happen between the process entities once the attack pattern is activated (i.e. replayed against historical alerts data): 1. The Parent Process Control entity (PPC) retrieves the pattern data from the system database and determines its start stage, i.e. Brute force attack. It then sends a request to the Agent Manager (AM) for instantiating an Elasticsearch data source agent (ESA-0) with the corresponding measurement parameters (e.g. time window) and a threshold value (i.e. 20 alerts). Eventually PPC instructs the scheduler (SCH) to trigger the data source agent at a specific time interval. 2. Each time the trigger fires, ESA-0 queries the relevant data from the Elasticsearch data source (EDS), calculates the measurement value (i.e. the total number of failed logins within five-minute time windows), and applies the threshold. It then reports the result back to PPC; the result contains the computed measurement value, a flag indicating if the threshold has been exceeded, and other information (e.g. destination IP address list). 3. PPC logs the result and checks the threshold flag. If the threshold condition has not been satisfied, no further action is taken. ESA-0 will keep periodically querying new data and reporting the results to PPC. 4. Once the threshold is exceeded, PPC extracts the measurement parameters of each of the subsequent stages (i.e. Suspicious activity for privilege gain and Suspicious activity for information leak attempt) and proceeds with the creation of two new Child Process Control (CPC) entities for both stages, i.e. CPC-1 and CPC-2. PPC extracts the list of destination IP addresses from the received agent’s (ESA-0) result and passes it on to CPC-1 and CPC-2. The system will be using temporal information and the IP address list to correlate the detected events/alerts. 5. CPC-1 sends a request to AM for instantiating a new Elasticsearch data source agent (ESA-1) with the task to observe the number of IDS alerts within one-hour time window that are related to any suspicious activities for gaining system privilege involving devices with the specified IP addresses. CPC-1 then instructs the scheduler (SCH) to trigger the agent (ESA-1) at a specific time interval. In a similar fashion CPC-2 also instructs its data source agent (ESA-2) to look for alerts related to any suspicious activities for information leak attempts. 6. Whenever ESA-1/2 receives a trigger signal, it will then retrieve the relevant data from the Elasticsearch data source, calculate the measurement value, apply the threshold, and send the result back to CPC-1/2. CPC-1/2 will log the results and check the threshold flag. If the threshold has not been exceeded, no further action is taken. ESA-1/2 may keep observing new relevant events until it times out.
A System for Detecting Targeted Cyber-Attacks Using Attack Patterns PPC: Parent Process Control CPC: Child Process Control
AM: Agent Manager SCH: Scheduler
ESA: Elas csearch Data Source Agent EDS: Elas csearch Data Source
SCH
AM
PPC
31
EDS
Retrieve pa ern data Determine start stage request agent, threshold Instan ate data source agent
ESA-0 ini alise
configure trigger for ESA-0, me interval trigger query data event data calculate measurement value measurement value, threshold flag, IP address list Log event, threshold exceeded?
Determine next stage, create CPCs
CPC-1
ini alise, IP address list request Instan ate data source agent
ESA-1 ini alise, IP address list configure trigger for ESA-1 trigger
query data
CPC-2
event data
ini alise, IP address list calculate measurement value
request Instan ate data source agent
measurement value, threshold flag, IP address list
ESA-2
ini alise, IP address list configure trigger for ESA-2 trigger Log event, threshold exceeded?
query data event data
calculate measurement value measurement value, threshold flag, IP address list Log event, threshold exceeded?
Fig. 7. Flow diagram for the example attack pattern.
5.4
Test Result
We tested the attack pattern using the datasets stored in our Elasticsearch server. It contains one day of Snort IDS alert logs (i.e. on 16 March 2012) with over two million records. The measurement results that have been logged by the analytics engine’s parent and child process entities can be consumed by the system’s visualization component and presented to the users.
32
I. Herwono and F. A. El-Moussa
Figure 8 shows the visualization of the results for one branch or path of the attack pattern, i.e. Suspicious activity for privilege gain as the second stage. The vertical lines (or bars) indicate the occurrence of events observed by the data source agents at each stage in the attack pattern, and the line length represents the maximum examined value, e.g. maximum number of detected alerts per destination IP address. The test result shows that the threshold for the start stage (i.e. Brute force attack) has been exceeded on several occasions which has led to the activation of new data source agents for observing the events related to the second stage. Those agents then reported related suspicious activity events on three occasions but only one has satisfied the threshold condition (i.e. the events at 10:00am). It is now interesting to learn more about the events or IDS alerts that have matched the attack pattern parameters and how they were correlated to each other.
threshold
Fig. 8. Visualization of the measurement results.
First we take a look at the events observed for the second stage at 10:00am. As shown in Fig. 8 (i.e. the pop-up window) the events are grouped by its host IP address and 15 events were detected for the IP address “192.168.202.102” (which exceeded the threshold of 10). Drilling down further into this particular IP address reveals that it was reported as source address for a number of suspicious activities on multiple destination hosts and all activities have the same alert signature “ET WEB_SERVER Suspicious Chmod Usage in URI” (see Fig. 9, left). We then trace the suspicious IP address back to the preceding stage (i.e. first stage) where it turns out that the IP address appears to have been assigned to an FTP server which was targeted many times by a brute force attack (see Fig. 9, right). Analyzing the server’s access log could thus be an appropriate next step for a security analyst in order to check if the brute force attack has been successful and the server compromised.
A System for Detecting Targeted Cyber-Attacks Using Attack Patterns
33
Fig. 9. Alert details for second stage (left) and first stage (right).
6 Conclusions In this paper, we have described our approach of using predefined attack patterns as a means for detecting multi-stage cyber-attacks through correlation of relevant network and host events that have previously been logged by different systems and devices in the network. We presented the architecture of our pattern recognition system and outlined the attack modelling process, where users can specify which events should be observed at each stage of the attack, and how those events should be examined and connected with each other. We anticipate that security analysts of large organizations or corporations would be the main users and beneficiaries of our system; they are the ones who would be able to define the attack patterns based on their domain expertise and daily job experience, and eventually use the attack patterns to help them detect potential cyber threats as early as possible in order to allow for enough time to stop the attack or mitigate its impact. Attack patterns should thus be comprehensive and in particular, up-to-date, so that they can be effective at recognizing attacks that may be conducted in many different ways. It would be very useful for the security analysts to have a ‘hacker mindset’ in order to provide the capability to detect not only cyber-attacks with historically known patterns, but also the ones using new methods and strategies never seen before. We also believe that sharing such powerful attack patterns in a trusted and confidential way with other security analysts outside the organization would be essential for building a proactive network defense against targeted cyber-attacks; we are currently looking into this possibility in the framework of a collaborative research project [6]. Furthermore we implemented software agents with central control in our prototype to independently query the event data from various sources and compute the measurement value. This proved quite challenging in terms of resource usage and scalability when replaying multiple attack patterns against historical datasets simultaneously. Better resource allocation and caching strategies as well as use of big data technologies are thus planned in future system implementations to overcome this issue. Acknowledgments. This work was partially supported by the H2020 EU-funded project Collaborative and Confidential Information Sharing and Analysis for Cyber Protection, C3ISP [GA #700294]. The views expressed in this paper are solely those of the authors and do not necessarily represent the views of their employers, the C3ISP project, or the Commission of the European Union.
34
I. Herwono and F. A. El-Moussa
References 1. Alnas, M., Hanashi, A.M., Laias, E.M.: Detection of Botnet multi-stage attack by using alert correlation model. Int. J. Eng. Sci. IJES 2(10), 24–34 (2013) 2. Alserhani, F., Akhlaq, M., Awan, I.U., Cullen, A.J., Mirchandani, P.: MARS: multi-stage attack recognition system. In: Proceedings of the 24th IEEE International Conference on Advanced Information Networking and Applications, Perth, WA (2010) 3. Ammann, P., Wijesekera, D., Kaushik, S.: Scalable, graph-based network vulnerability analysis. In: Proceedings of the 9th ACM Conference on Computer and Communications Security, Washington, DC (2002) 4. Barnum, S.: An introduction to attack patterns as a software assurance knowledge resource. In: OMG Software Assurance Workshop, Fairfax, VA (2007) 5. Bhatt, P., Yano, E.T., Gustavsson, P.M.: Towards a framework to detect multi-stage advanced persistent threats attacks. In: Proceedings of the IEEE 8th International Symposium on Service Oriented System Engineering, Oxford, UK (2014) 6. C3ISP – Collaborative and Confidential Information Sharing and Analysis for Cyber Protection Project Homepage. http://c3isp.eu. Accessed 17 Aug 2017 7. CAPEC – Common Attack Pattern Enumeration and Classification Homepage. http://capec. mitre.org. Accessed 17 Aug 2017 8. Capture files from Mid-Atlantic CCDC (Collegiate Cyber Defense Competition) MACCDC 2012. https://www.netresec.com/?page=MACCDC. Accessed 07 Aug 2017 9. Cheung, S., Lindqvist, U., Fong, M.W.: Modelling multistep cyber attacks for scenario recognition. In: Proceedings of the 3rd DARPA Information Survivability Conference and Exposition, DISCEX III, Washington, DC, vol. 1 (2003) 10. Clark, D.D., Landau, S.: The problem isn’t attribution; it’s multi-stage attacks. In: Proceedings of the Re-Architecting the Internet Workshop, Philadelphia, US. ACM (2010) 11. Hutchins, E., Cloppert, M., Amin, R.: Intelligence-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains. In: Proceedings of the 6th International Conference on Information Warfare and Security, Washington, DC (2011) 12. Herwono, I., El-Moussa, F.: A collaborative tool for modelling multi-stage attacks. In: Camp, O., Mori, P., Furnell, S. (eds.) Proceedings of the 3rd International Conference on Information Systems Security and Privacy, pp. 312–317 (2017) 13. Scarabeo, N., Fung, B.C.M., Khokhar, R.H.: Mining known attack patterns from security-related events. PeerJ Comput. Sci. 1, e25 (2015) 14. Sood, A.K., Enbody, R.J.: Targeted cyber attacks: a superset of advanced persistent threats. Secur. Priv. 11(1), 54–61 (2013) 15. Wang, L., Liu, A., Jajodia, S.: Using attack graphs for correlating, hypothesizing, and predicting intrusion alerts. Comput. Commun. 29(15), 2917–2933 (2006)
A Better Understanding of Machine Learning Malware Misclassifcation Nada Alruhaily(B) , Tom Chothia, and Behzad Bordbar School of Computer Science University of Birmingham, Edgbaston B15 2TT, UK {N.M.Alruhaily,T.P.Chothia,bxb}@cs.bham.ac.uk
Abstract. Machine learning-based malware detection systems have been widely suggested and used as a replacement for signature-based detection methods. Such systems have shown that they can provide a high detection rate when recognising non-previously seen malware samples. However, when classifying malware based on their behavioural features, some new malware can go undetected, resulting in a misclassification. Our aim is to gain more understanding of the underlying causes of malware misclassification; this will help to develop more robust malware detection systems. Towards this objective, several questions have been addressed in this paper: Does misclassification increase over a period of time? Do changes that affect classification occur in malware at the level of families, where all instances that belong to certain families are hard to detect? Alternatively, can such changes be traced back to certain malware variants instead of families? Also, does misclassification increase when removing distinct API functions that have been used only by malware? As this technique could be used by malware writers to evade the detection. Our experiments showed that changes in malware behaviour are mostly due to behavioural changes at the level of variants across malware families, where variants did not behave as expected. It also showed that machine learning-based systems could maintain a high detection rate even in the case of trying to evade the detection by not using distinct API functions, which are uniquely used by malware.
Keywords: Malware Behavioural analysis
1
· Classification · Machine learning
Introduction
Malware is a major concern in most computer sectors. Past research has shown that machine learning based detection systems can detect new malware using the knowledge gained from training a classifier on previously discovered and labeled malware samples (e.g. [3,14,15,35]). However, due to the fact that malware are evolving and their behaviour can change, as in the case of exploiting a new vulnerability [52] or an attempt of malware writers to avoid detection, malware can remain undetected and, therefore, be classified incorrectly as benign. c Springer International Publishing AG, part of Springer Nature 2018 P. Mori et al. (Eds.): ICISSP 2017, CCIS 867, pp. 35–58, 2018. https://doi.org/10.1007/978-3-319-93354-2_3
36
N. Alruhaily et al.
In this paper we investigated the reasons behind the misclassification of malware by using a knowledge-base of malware and benign samples that we have collected from a range of sources. We tracked changes adopted by the misclassified malware instances and we investigated whether there was a recognisable pattern across these misclassified samples. In our first experiment, we grouped malware by year and classified them in order to check whether there was any relation between the passage of time and the malware detection rate. In our second experiment, we checked whether the changes that caused the misclassification occur in malware at the level of families where all instances that belong to specific new families are misclassified, or if these changes can be traced back to individual variants? In addition, we noticed that there were some API functions that have been used only by malware, therefore, we have checked whether removing such set of APIs would increase the overall misclassification rate, as this technique could be used by malware writers to evade the detection. To summarise, this paper is an extended version of the one presented at the International Conference on Information Systems Security and Privacy (ICISSP 2017) [2]. The conference paper investigated the following research questions: (i) does misclassification increase over a period of time? (ii) does misclassification occur in malware at the level of families, where all instances that belong to specific new malware families are misclassified? (iii) alternatively, does misclassification occur at the level of variants, and it is not related to malware families? (iv) when misclassification does occur can we find the reason for it?. In addition to those research questions, we investigate in this extended version the following: (i) does generating the classification model based on a sliding window when checking malware behavioural changes over time lead to different conclusion, compared to generating the model based on all previously seen data? (ii) are there any API functions that are uniquely used by malware, and would avoid using such functions increase the overall misclassification rate? In order to answer the research questions described, we used 5410 malware samples, from approximately 400 variants drawn from 200 malware families. Although anti-malware vendors may have different names for the same sample, they usually share a similar naming scheme where malware can be grouped into families based on some similarities (code similarity as an example) and then each family can have a number of variants where each variant represent a new strain that is slightly modified. Figure 1 shows an example of the naming scheme followed by the company Symantec; it consists of a prefix, the family name and a suffix [42]. Throughout this paper Symantec’s Listing of Threats and Risks [45] has been used as a source of malware discovery date, types and malware family and variants names.
Fig. 1. Symantec malware naming scheme, adopted from [2].
A Better Understanding of Machine Learning Malware Misclassifcation
37
We first build a classifier based on our malware and benign samples. We have an imbalanced dataset as the malware class is the majority, which is an issue that proposed malware detection systems usually encounter [28,47]. We showed that malware detection systems based on Bagging [6] classifier could be affected by the imbalance problem as the classifier becomes more biased towards the majority class, therefore, we propose to adopt Exactly Balanced Bagging (EBBag) [8] instead, which is a version of Bagging for dealing with an imbalanced dataset. As described in [2], we conducted multiple experiments: in the first experiment we trained the classifier on all data seen before a specific year, and we tested the remaining malware grouped into their year of discovery. In the second experiment we grouped each new malware family into their available variants. We recorded the detection rate resulting from each group, in each experiment, in order to answer the research questions mentioned above. We used two different classification algorithms: Support Vector Machines (SVM) and Decision trees (DT) as a base for our classifier to make sure that the results produced are not dependent on a specific classification algorithm. When looking at how the classifier performed on malware released after the training data, we saw a small fall in the detection rate after the first tested year, however we found no significant, long-term correlation between the time since a classifier is built and the misclassification rate of new malware, i.e., after a small initial drop off, the behavioural changes in the new malware did not reduce the accuracy of the classifier. It can be seen, however, that when having access to more training data the classifier tend to perform slightly better overall. Our research also showed that most of the behavioural changes that affect the classification rate of malware can be traced back to particular malware variants, and in most cases these changes are not replicated across malware families. Meaning that we would see a single variant that was hard to detect in a family of malware in which the other variants could be reliably detected. We also found that the misclassifications were mostly due: to the adoption of anti-virtualisation techniques, or the fact that particular variants were looking for a specific argument to run, or due to the fact that some variants were actually considered as corrupted files. While removing all examples of misclassified corrupted malware from our dataset would have been possible, we note no other work on malware classification does this, so removing these samples would not reflect other work. In addition, we conducted two experiments on this extended version of the paper: in the first, we ran the classifier implemented on malware grouped into their year of discovery, but with using a different training approach; where a sliding window model was used instead of the one based on all previously seen data. We found, as before, that the classifiers can continue to give a high detection rate even after a period of time. In the second experiment we ran the classifiers on malware data after removing the set of API functions that has been uniquely called by malware. We noticed that removing those APIs from the feature set did not increase the overall misclassification rate; this is might be due to the fact that there are other features which have been considered as better indicators for distinguishing malware from benign samples.
38
N. Alruhaily et al.
We hope that our results will help the reader to interpret other papers that present the detection rate of a machine learning classification system as their main result, and that our work sheds some light on how these systems will perform over time and why some malware avoids detection. The paper is organised as follows: Sect. 2 describes related work in this field. Section 3 gives an overview of the methodology followed. Section 4 presents the sample preparation and data collection procedure. The main contribution to this field of research is described in Sect. 5 and 6 along with the experiments’ results. The analysis is presented then in Sect. 7, while Sect. 8 analyses the use of unique API functions and investigates the effect of removing those distinct APIs on the misclassification rate. Section 9 discusses the research outcomes and outlines the conclusion.
2
Related Work
In the following section we review the literature on malware classification and detection systems that is related to our work, and we indicate how this work differs. A large number of studies have introduced or evaluated different ML-based detection systems; the authors in [53,54,58] have based their detection systems on API calls sequence extracted using static analysis. However, static analysis is known to be questionable when dealing with obfuscation techniques where the sequence of the instructions might be modified or data locations can be hidden [30]. Motivated by this fact, other researchers have based their detection systems on API call sequence extracted using behavioural analysis [12,47,61]. Most of the proposed detection systems have been tested on either a limited, or a current set of malware. Therefore, there was a need to examine the effect of the passage of time on such malware detection systems and explore whether this can affect the systems’ detection rate. Islam et al. in [18] showed that it is possible for current ML-based malware detection systems to maintain a high detection rate even when classifying new malware. They considered two sets of malware in order to verify this idea, one collected between 2002 and 2007 and the other collected from 2009 to 2010. In their experiment they dated malware based on the collection date and they used static, behavioural and a combination of both features during their experiment. Singh et al. [41] have used three approaches to detect concept drift (which refers to changes in the behaviour and underlying distribution of the data). In their paper they have used, relative similarity, metafeatures, and retraining a classifier to detect the drift. They have focused in their experiments on malware static features only and on three malware families. Although their malware sample size was limited, their work have also provided evidence in favour of negligible drift. Shabtai et al. [40] on the other hand used OpCode n-gram patterns as features. They addressed the question: “How often should a classifier be trained with recent malicious files in order to improve the detection accuracy of new
A Better Understanding of Machine Learning Malware Misclassifcation
39
malicious files?”. Their research showed that classifiers can maintain a reliable level of accuracy. However, when testing malware created in 2007, a significant decrease in the accuracy was observed, which might indicate that new types of malware were released during that year. Our work is different from the research mentioned above in the following respects: – In addition to looking at malware behaviour over time, we checked whether using a sliding window approach to generate the classification model, instead of generating the model based on all previously seen data, will lead to a different conclusion in regards to malware behavioural changes over time. – We tracked the misclassification and investigated whether it can be traced back to malware families, or even to variants. – We investigated the possible reasons that led to this misclassification and analysed the results. – We also determined the set of API uniquely used by malware and investigated whether avoid using such distinct APIs, will increase the overall misclassification rate.
3
Sketch of the Approach
Our work on this paper can be divided into five stages: 1. Collecting malware; described in detail in Sect. 4 2. Building a typical ML-based detection system, represented in a classifier, this task includes: – Extracting and determining the features that will be used when classifying the data. – Addressing the imbalance class problem on our data where malware samples outnumber the benign samples. – Assessing the performance of the implemented classifier. The feature extraction step is described in Subsect. 5.1 while the system architecture is illustrated in Subsect. 5.3. 3. Classifying large grouped datasets: following the previous task, we classified large datasets which have been grouped into: – Years: in order to check whether there is a notable change in malware behaviour over time which might result in a change in the classification rate in one of the years. We chose one year intervals based on the amount of data we have, as it was the minimum period which was able to produce a stable results. The classification models are generated based on two training approaches, i.e. based on all previously seen data, and a sliding window of the latest three years. – Variants: in order to check whether these changes, which affect the detection rate, can be traced back to specific malware families were all their variants are hard to detect, or just to particular variants, without the family membership being a factor.
40
N. Alruhaily et al.
The details of these two experiments are described in Sect. 6. 4. Analysing the misclassified instances: which includes analysis of the misclassification occurred and identification of its reasons. The analysis is discussed in Sect. 7. 5. Investigating the use of unique API functions: which includes determining the set of API functions that has been uniquely used by malware, and investigating whether trying to evade the detection, by avoiding the use of such functions, would increase the overall misclassification rate; this is discussed in Sect. 8.
4
The Knowledge-Base
In order to conduct our experiments we built a knowledge-base, which consisted of information and analysis reports related to all our collected malware and benign samples. The following section provides an overview of the procedure that was followed to collect the data. Initially, the python-based tools: mwcrawler and maltrieve [26,27] were used to collect malware samples from a number of sources through parsing multiple malware websites and retrieving the latest malicious samples which have been uploaded to them. These websites include: Malc0de, Malware Black List,Malware Domain List, Malware URLs, VX Vault URLquery, CleanMX and ZeusTracker. To makes sure our dataset reflected the most common types of malware, we have developed a python-based script that collect the most common 10 and 20 malware families recorded by Symantec and Microsoft in their Internet Security Threat reports [44] and Security Intelligence Reports [29], respectively. The script works by pulling all samples resulting from the search request for each malware family, including all the available variants, from an open malware database “Open Malware” [32]. In addition, a large number of samples have also been downloaded from VirusTotal website [50] through their intelligence service. We note that this method succeeded in getting more samples from the most common malware families. As a result of the previous step, 5410 malware samples were collected. Our malware samples vary between approximately 400 malware variants with a date of discovery varying ranging from 08.12.1997 (Infostealer ) to 22.10.2015 (W32.Xpiro.I ). We have also parsed all the information found on Symantecs’ Threats, Risks & Vulnerabilities [45] pages in order to collect some valuable information, such as: malware types, discovery date of each malware family, and the risk level, in addition to the systems affected. Our benign executables (956 samples) were collected from fresh installations of Windows XP SP3 and several official websites such as Windows and Adobe. The analysis Windows XP SP3 was selected to be installed on the virtual machines during the analysis as all the collected malware samples from 08.12.1997 to 22.10.2015 could run on this system according to the Symantec’s systems affected information for each malware family. In addition, Windows XP SP3 is widely used for the purpose of analysing malware [7,34] for multiple reasons; one of them is the fact that it “consumes less memory and CPU
A Better Understanding of Machine Learning Malware Misclassifcation
41
power” [34]. All the executables were sent to Virustotal in order to retrieve the scanning results from multiple vendors such as Symantec, Kaspersky, Avira, ClamAV and others, and to ensure the integrity of the benign applications. To analyse the malware and benign samples we used Cuckoo Sandbox version 0.2 [10], an automated malware analysis system. The sandbox executes samples in a controlled environment (virtual machines) for 120 s and generates reports with API calls, system traces in addition to network activities in .pcap file. In order to decrease the analysis time, we configured Cuckoo Sandbox to use three virtual machines instead of one where all of them will run simultaneously on a host machine that runs Ubuntu 14.04. After the analysis was completed, behavioural analysis reports were parsed so that the extracted API calls and the other information included in the report could be added to our knowledge-base, along with the information which was collected from Symantec’s Threats, Risks & Vulnerabilities pages.
5
Building a Classifier
In this section, we discuss the process of extracting the features that will be used in addition to the classification procedure. This section also includes the metrics that are used to measure the performance throughout this paper. 5.1
Feature Extraction
In this paper, we used the API calls as our dynamic features as they have been widely used as behavioural features for such systems, and they also demonstrate that they can offer a good representation of malware behaviour [11,12,15,35,47]. Additionally, it also worth noting that relying on behavioural analysis helps to avoid problems when dealing with some cases such as malware obfuscation. Based on preliminary tests on 1282 malware samples from our data, we found that the frequency of the APIs did not improve the classification any further, which is similar to the conclusion reached by [47], where only information related to the presence of an API is important and not the frequency. Thus we used the binary vector representation of the extracted APIs where 0 refers to the absence of an API call and 1 denotes its presence. In addition, in our experiments we used an n-gram based method to represent the API call sequence [1] where unigram, bigram, a combination of unigram and bigram, and trigram have been evaluated. Using unigrams means that the sequence of API words is unimportant, where we simply check if a word is present or not (a.k.a ‘bag of words’ representation). By using bigram and trigram, the feature vector will not only contain a single API word but also will preserve the sequence of two and three words, respectively, which can be considered as taking a snapshot of the malware behaviour. We also believe that using a hybrid analysis (both static and dynamic features) might boost the detection rate further. However, as we intend to use the classifier in checking malware behaviour in different scenarios such as: over time and with malware grouped into variants, static analysis was beyond the scope of this paper.
42
N. Alruhaily et al.
5.2
Evaluation Metrics
The following metrics have been used for the evaluation purposes throughout this paper: – – – –
True positive (TP): Number of samples correctly identified as malware. True negative (TN): Number of samples correctly identified as benign. False positive (FP): Number of samples incorrectly identified as malware. False negative (FN) Number of samples incorrectly identified as benign.
These terms are used to define four performance metrics which we use throughout this paper: 1. Sensitivity (recall): measures the proportion of true positives: Sensitivity = TP T P +F N
N 2. Specificity: measures the proportion of true negatives: Specificity = T NT+F P 3. Geometric mean (G-mean): Also known as the macro-averaged accuracy [13]; it is the geometric mean of recall over all classes and it is known in the field of machine learning as a more accurate measure of performance than the normal accuracy in an unbalanced classification scenario as it considers the accuracy of both classes: the majority and √ the minority [19,24]. The G-mean can be calculated as follows: G-mean= Sensitivity · Specif icity 4. Area under the ROC curve (AUCROC ) [17]: The ROC curve is a graphical plot that illustrates the performance of a binary classifier. The curve is created by plotting the true positive rate (sensitivity) against the false positive rate (1-specificity) at various threshold. The value of AUCROC can vary between 1.0 and 0, where 1 indicates a perfect classifier with an ideal separation of the two classes, and an AUCROC of 0.5 represents worthless classifier. AUCROC is insensitive to class imbalance; if the majority labels of the data are positive or negative, a classifier which always outputs 1 or 0, respectively, will have a 0.5 score although it will achieve a very high accuracy. We calculated the AUCROC score based on the functions provided by Scikit-learn library, where the library offers a wide range of machine learning algorithms and tools [39].
Both metrics, G-mean and AUCROC , are commonly used to evaluate the imbalanced data classification performance. However as AUCROC considers all the possible thresholds, it is usually used to assess the performance when choosing the best model that will be used during the classification procedure. Therefore we will be using both G-mean and AUCROC to assess and choose the best model on Sect. 5.3, while we will be using the G-mean, which represents the balanced accuracy, as our main metric when classifying malware based on their year of discovery on Sect. 6.1. On Sect. 6.2 we will be using sensitivity as our main metric as it conforms with our goal of the second experiment, which is focusing on the classifier’s ability to correctly identify those malicious instances without adding any noisy data related to the benign samples.
A Better Understanding of Machine Learning Malware Misclassifcation
5.3
43
Classifier Design and Addressing the Imbalance Problem
For our classifiers, we used Support Vector Machine and Decision Tree as they are widely known machine learning algorithms [9,22]. In addition, they have shown State-of-the-Art results in a number of classification problems, including classifying and recognising new malware [1,23,25,56,59,60]. The SVM and DT classifiers are based on the implementation of Scikit-learn library. As mentioned previously, we have also evaluated their performance with different sizes of n-grams: unigram, bigram, unigram+bigram and trigram as shown in Table 1, and we have chosen the best settings for each classifier. In case of SVM, the bigram and the unigram+bigram both gave us the best results, however, as our aim is to test malware discovered on the followed years where a match of an exact sequence might not be found, we thus preferred to use the unigram+bigram which will result in more generalisation as it tests the occurrence of a single API word in addition to a sequence of two APIs together. The class imbalance is a common problem in the area of machine learning in general [55], and in malware detection in particular [31,57]. The imbalance problem occurs when the number of instances in each class vary. In malware detection systems this is due to the fact that malware can be downloaded on large numbers from open access database such as Openmalware [32], VXHeaven [51], VirusShare [49], whereas it is more difficult to gather benign samples. The imbalance problem can affect the classification process as the classifier becomes more biased towards the majority class, which is usually the malicious class as mentioned in [28]. To avoid the affect of this problem, we tested a well known approach in the area of machine learning that is referred to as Exactly Balanced
Fig. 2. The EBBag model.
44
N. Alruhaily et al. Table 1. Classifiers performance on different n-gram sizes. SVM
DT
Feature set
G-mean AUCROC G-mean AUCROC
APIU ni
0.92519 0.97163
0.92796 0.97203
APIBigram
0.93965 0.97681
0.91791 0.96815
APIU ni+Bigram 0.94041 0.97692
0.91965 0.96865
APIT igram
0.89926 0.95985
0.92465 0.97079
Fig. 3. Classification rate with different ratio of malware samples to benign, adopted from [2].
Bagging [8,19,20]. This approach is based on classifying balanced subsets and it is a modified version of Bagging [6], which is a method that have been used extensively in malware detection systems with different base classifiers and gives a considerably high detection rate than the normal classifiers [56] even with an imbalance dataset [33]. To the best of our knowledge, this paper is the first which uses EBBag for malware detection. Figure 2 depicts our classification approach. Bagging, in general, is a classifier ensemble technique which is based on randomly drawn subsets of the training data. Each time a subset is drawn, a classifier will be constructed to classify the newly generated subsets. The classification procedure is repeated for a number of times (we used 100 times) and a majority voting over all predictions is calculated as the final prediction to ensure the robustness of the results. Our framework implements the EBBag approach which is based on Bagging but with a minor modification to the way the subsets are drawn. In EBBag, the entire minority class is used for each classifier along with randomly generated
A Better Understanding of Machine Learning Malware Misclassifcation
45
subsets of the majority class, which are the same size as the minority class, so balancing the data. The procedure of generating smaller samples is known as “downsampling”. We ran five tests to compare Bagging with EBBag, calculating classification rate of randomly chosen malware and benign samples, with a malware to benign ratio varying between 2:1 and 10:1. We performed 10-fold cross validation and compared our adopted approach, EBBag, to the Bagging approach with the same base classifier (SVM) and the best n-gram size. The results of these tests are shown in Fig. 3. The figure shows the true positive rate (sensitivity), true negative rate (specificity) and AUCROC recorded for each approach. We note that Bagging becomes increasingly inaccurate as the data becomes more imbalanced. So the figure indicates that imbalanced data will be part of the cause of the misclassification rate in papers that use Bagging alone to classify malware. By using EBBag, it can be seen that the measures: sensitivity and specificity which represent the accuracy of the malicious and benign classes, respectively, have not been affected by the imbalance problem. Also, the False Positive rate (1-specificity) is significantly decreased from 0.29 to 0.07, where a false positive occurs when a classifier flag a benign file as malicious by mistake and this is usually costly as it consumes a considerable amount of resources and time. Therefore, this analysis shows that EBBag outperforms Bagging, thus we use EBBag for the rest of the work.
6
Classifying a Large Grouped Dataset
As mentioned previously, in order to answer the main research questions of this paper we have followed two methods on testing the data: 1. Classifying all malware based on the year of discovery. 2. Classifying malware based on malware variants. Sections 6.1 and 6.2 describe in detail the process used in these two methods. 6.1
Classifying Malware Based on the Year of Discovery
In our first experiment, we tested the classifier on data grouped into years in order to check whether there was a notable change in malware behaviour over time which will be represented as a change in the classification rate. We have carried out two experiments, where on each we divided the entire test set into years from 2007 to 2014, based on the discovery date recorded by Symantec, thus we ended up with 8 testing sets. We chose one year intervals based on the amount of data we have, as it was the minimum period which was able to produce a stable results. The initial training set included samples until year 2006. Each time, the training set was evaluated separately on each of the followed years, so for the initial training set it was evaluated on malware discovered from 2007 to 2014. We repeated the experiment by extending the training set to include the followed year and test on the remaining years.
46
N. Alruhaily et al.
(a) Using SVM
(b) Using DT
Fig. 4. Malware tested yearly, adopted from [2].
Figure 4 shows the averaged accuracy (G-mean) recorded by our framework on data trained on real malware samples. Most of the results showed that the classifier can maintain a score above 80%. It can also be seen that the more the data included in the training phase, the higher the G-mean score for the following year will be, except for some cases where a new family were introduced such as in 2009 and 2010, or where a new variant with a slightly different behaviour was introduced such as in 2012 with W32.Ramnit.B!gen2. In case of this family, although some variants of the malware family W32.Ramnit have been introduced before 2012 and the system has been trained on some of them, it seems that W32.Ramnit.B!gen2 variant which was introduced on 2012 implements some anti-sandboxing techniques where the malware tried to unhook Windows functions monitored by the sandbox causing the Cuckoo monitor to throw 1025 exceptions. However, overall from Fig. 4 we can conclude that the detection rate are not consistently affected by the passage of time, instead, the classifiers, generally, can continue giving a good detection rate. To analyse the results further, we carried out another experiment which is explained in the remaining of this section. The experiment aimed to check whether generating the classification model based on a sliding window of the latest three years will lead to different results. Using a Sliding Window to Generate the Classification Model. On Fig. 4, the classification model has been generated from all previously seen examples; this is based on the assumption that all data are important and there is no outdated data or features which needs to be discarded. To make sure that the conclusion made have not been affected by such an assumption. We have utilised an additional training approach where, in addition to training on all the previously seen data, we train the classifier based on malware data seen within a fixed window (we chose the last three years only as a training window). These two training approaches are widely-known and they have been used extensively
A Better Understanding of Machine Learning Malware Misclassifcation
47
Fig. 5. Malware tested yearly based on a model generated using a sliding window.
to detect any distribution drift on data whose nature or distribution changes over time [5,21]. Figure 5 shows the G-mean results from training each classifier based on the two different training approaches; the graph is generated based on testing the following year only, each time a classification model is constructed. It seems that adopting the sliding window approach, where old data are continuously discarded, does support the conclusion made previously which indicates that detection rate are not consistently affected by the passage of time. Thus, to analyse the results further, we carried out another experiment which is explained in the next section. The experiment aimed to check whether the misclassification caused by the changes in malware behaviour can be traced back to a number of malware families or even to sub-families (a.k.a variants). 6.2
Classifying Malware Based on Malware Variants
In this experiment we tested seven malware families (broken down into variants) in order to check whether the misclassified malware instances were a result of undetected behavioural changes at the malware family level, or whether it was due to other changes in the level of variants. The experiment was conducted for seven families, namely: W32.Changeup, W32.Pilleuz, W32.Imaut, W32.Sality, Trojan.FakeAV, Trojan.Zbot and Trojan.Gen. The remaining families were not used due to an insufficient amount of samples for each particular variant in the available data set. These seven families correspond to a total of 2737 malware samples. For testing each of which families we trained the classifier on malware data prior to 2007, as we will have after this date a reasonable number of malware families grouped into variants to carry out the testing process.
48
N. Alruhaily et al.
We used sensitivity here as our main metric in the experiment (referred to in the table as Sens) to measure the detection rate as it conforms with our goal, which is focusing on the classifier’s ability to correctly identify those malicious instances without adding any noisy data related to the benign samples. Table 2 shows the results of classifying each of the malware families, including their available variants. It appears that, mostly, there is no absolute pattern between the misclassification and the malware family, nor with the discovered year of each of the variants. Therefore, it can be concluded that most of the misclassification can be traced back to several malware variants and it cannot be traced back to the discovery date of each of the tested variants, nor to changes at the level of the malware family (which would affect all of that families’ future variants).
7
Reasons for Misclassification
Our aim in this section is to analyse the classification results. This includes also outlining the differences between the correctly classified and the misclassified variants and explaining the reasons that may led to the misclassification. Generally, from Table 2 we can identify three misclassification cases although most of the misclassifications occurred at the level of variants. We can summarise the different misclassification cases as follows: – Variants misclassified by both classifiers. – Variants misclassified by only one classifier. – Misclassification which occurred on the family level instead of variants. 7.1
Variants Misclassified by both Classifiers
Table 2 shows malware variants misclassified by both classifiers: SVM and DT. These variants are: W32.Pilleuze!gen21, W32.Sality.AB,W32.Sality !dam, W32. Imaut!gen1 and Trojan.FakeAV! gen119. In the case of the W32.Pilleuze!gen21, W32.Sality !dam and W32.Sality.AB variants, it seems that these variants have not performed any behavioural action when being analysed. This can happens because the samples implemented some anti- virtualisation techniques, or they are looking for a specific argument, or because they are corrupted files. All three mentioned variants have terminated the process by calling NtTerminateProcess. In case of W32.Sality.AB and W32.Sality!dam they also adopted a stealthiness technique where they disabled the error messages through calling SetErrorMode API with the arguments SEM NOGPFAULTERRORBOX — SEM NOOPENFILEERRORBOX. After looking at W32.Sality!dam page on Symantec [43] it seems that this variant is considered as a corrupted file where it can no longer be executed or infect other files. As said previously, while removing all examples of misclassified corrupted malware from our dataset would have been possible, we note no other work on malware classification do this, so removing these samples would not reflect other work.
A Better Understanding of Machine Learning Malware Misclassifcation Table 2. Tested by classifiers trained before 2007, adopted from [2]. Variants Changeup W32.Changeup
Pilleuz
Imaut
Sality
FakeAV
Zbot
Gen
#
Date
SVM-Sens DT-sens
41
18/08/09 0.98
0.98
W32.Changeup!gen9
48
02/09/10 1.0
1.0
W32.Changeup!gen10
47
02/02/11 1.0
1.0
W32.Changeup!gen12
47
11/08/11 1.0
1.0
W32.Changeup!gen16
51
20/07/12 1.0
1.0
W32.Changeup!gen44
76
01/08/13 0.99
1.0
W32.Changeup!gen46
67
28/05/14 1.0
1.0
W32.Changeup!gen49
64
20/08/14 0.94
1.0
W32.Pilleuz!gen1
47
19/01/10 0.77
0.68
W32.Pilleuz!gen6
138
29/09/10 0.88
0.86
W32.Pilleuz!gen19
70
17/01/11 0.97
0.79
W32.Pilleuz!gen21
64
29/03/11 0.28
0.03
W32.Pilleuz!gen30
46
01/02/12 1.0
1.0
W32.Pilleuz!gen36
60
07/02/13 1.0
1.0
W32.Pilleuz!gen40
98
22/08/13 1.0
1.0
W32.Imaut.AA
68
07/06/07 0.97
0.80
W32.Imaut.AS
45
01/08/07 0.84
0.84
W32.Imaut.CN
73
20/02/08 0.85
0.92
W32.Imaut.E
64
23/12/08 0.88
0.83
W32.Imaut!gen1
46
20/09/10 0.24
0.04
W32.Sality.X
46
12/01/07 0.96
0.93
W32.Sality.Y!inf
91
16/03/07 0.98
0.98
W32.Sality.AB
55
11/01/08 0.02
0.02
W32.Sality.AE
71
20/04/08 0.93
0.87
W32.Sality.AM
51
18/04/09 0.80
0.75
W32.Sality!dr
71
31/08/10 0.80
0.80
W32.Sality!dam
54
30/04/13 0.15
0.15
W32.Sality.AF
93
02/01/14 0.90
0.77
Trojan.FakeAV
41
10/10/07 0.68
0.85
Trojan.FakeAV!gen29
70
07/05/10 0.99
0.93
Trojan.FakeAV!gen99
38
08/03/13 1.0
1.0
Trojan.FakeAV!gen119
42
01/04/14 0.29
0.12
Trojan.Zbot
40
10/01/10 0.98
0.28
Trojan.Zbot!gen9
48
16/08/10 1.0
0.94
Trojan.Zbot!gen43
48
26/05/13 0.85
0.88
Trojan.Zbot!gen71
44
23/12/13 1.0
0.11
Trojan.Zbot!gen75
32
05/06/14 0.69
0.97
Trojan.Gen
202
19/02/10 0.48
0.77
Trojan.Gen.2
137
20/08/10 0.45
0.45
Trojan.Gen.X
52
12/01/12 0.42
0.42
Trojan.Gen.SMH
52
26/10/12 0.62
0.40
Trojan.Gen.3
99
06/08/13 0.61
0.60
49
50
N. Alruhaily et al. Table 3. Performance on stemmed and un-stemmed features set. Un-stemmed 286 API Stemmed ∼ ∼ 230 API ∼ G-mean AUC G-mean AUC SVM 0.94747 0.94747
0.94681 0.94682
DT
0.91688 0.91701
0.92381 0.92393
W32.Imaut!gen1 worm, on the other hand, did not terminate the process, however, it did not perform any network activity which might led to being misclassified. In fact only 2 samples out of the 46 carried out some network activities and both of them had been classified correctly. In the case of Trojan.FakeAV!gen119, the malware variant used an uncommon API, (compared to others in our database), to connect to the Internet: InternetOpenW, InternetOpenUrlW which are the Unicode based API of the high-level Internet API: Windows Internet (WinINet). The calls used by this variant takes arguments in Unicode format, while the older variants of this malware family have used the ASCII based API calls instead: InternetOpenA, InternetOpenUrlA. This raises the question: would normalising the API by removing the appended characters such as A, W, ExW, and ExA when processing the features, such as having only InternetOpen in the features set instead of multiple entries will increase the overall accuracy of API based classifiers? To answer this question, we carried out another experiment by using all of our data and normalising the Win32 API by removing the appended characters, such as A, W, ExA and ExW. We then performed 10-fold cross validation to assess the performance of the classifier using the stemmed and un-stemmed features set. The results are shown in Table 3. It can be seen from the Table that removing the appended letters did not have a considerable impact on the classification rate. However, by removing the appended characters we ended up with around 230 features instead of 280 without significantly affecting the detection rate. Such an option can be considered to improve the efficiency of the classifier and minimise the time needed for the classification. We note that other papers [36,37,48] use feature selection methods to reduce the number of features in their datasets, however they do not use API stemming, which Table 3 suggests might be a helpful addition. 7.2
Variants Misclassified by Only One Classifier
Another case which we have investigated can be seen in Trojan.Zbot and Trojan. Zbot!gen71 where most of their instances have been correctly classified by the SVM classifier, however DT failed to classify these samples correctly. We analysed random trees constructed by the classifier to be able to determine the reasons behind the misclassification (Fig. 6 shows an example
A Better Understanding of Machine Learning Malware Misclassifcation
51
of a simplified version of a tree constructed by DT classifier). In the case of Trojan.Zbot, it seems that the absence of the call: SetWindowsHookExA was the reason for the misclassification on almost all the variant’s samples. While in case of Trojan.Zbot!gen71 variant, the correctly classified malware called the NtProtectVirtualMemory API to allocates a read-write-execute memory. This API is usually called by malware binaries during the unpacking process, and the absence of this call might indicate that the malware sample was not packed, which, on our case, led to the misclassification of the incorrectly classified instances.
Fig. 6. A simplified version of the tree constructed by DT classifier based on one run of the EBBag classifier.
7.3
Misclassification on Malware Family Level
Although most of the misclassification that we have seen occurred at the variant level, there is a single case where the misclassification can be linked to the family instead, as can be seen in Table 2 in the case of Trojan.Gen family. The reason that this family is different from the others seems to be due to the fact that this family is actually a category “for many individual but varied Trojans for which specific definitions have not been created”, as stated by Symantec on the Trojan.Gen family page [46]. By checking the misclassified samples and the paths that have been taking by samples belonging to this family, it can be seen that although the samples may share some general characteristics, they adopt different techniques and thus the samples can behave in various ways and took different paths on trees generated by DT classifier (approximately 15 different behavioural paths), unlike other families where their behaviour were very uniform (2 or 3 paths). As we have said, the behavioural profiles and definitions resulted from this family were varied and thus we only giving examples for some of the
52
N. Alruhaily et al.
misclassification cases as identifying all the reasons for the misclassification for this family would not be possible. Many of the misclassified instances did not connect to the Internet either because the malware were applying some antivirtualisation techniques, an example of this case is: Trojan.Gen.X, or they were terminating the process as a specific argument have not been found, as in Trojan.Gen.3. In the case of Trojan.Gen.X, nearly half of the misclassified samples belonging to this variant were monitoring the user window by calling GetForegroundWindow and checking the mouse movement through calling theGetCursorPos API. they also followed these calls by calling GetKeyState to monitor the following keys constantly: the mouse keys, Alt, Ctrl, and shift key. The execution then were delayed through going on a loop when monitoring these actions and NtDelayExecution also have been called. These techniques have been noticed when analysing recent malware, as reported by malware researchers [38] in order to evade sandbox detection, and this could be the reason why all the variants that used that technique have been misclassified.
8
The Use of Unique API Functions
When analysing the data, we have also noticed the use of unique API functions. In case of benign samples, there was only one unique API which have not been used by malware, i.e. NtDeleteKey. Whereas in malware, there were 21 unique APIs which have been called only by malware. Table 4 shows those APIs which were called only by malware samples, along with the number of malware binaries. We have checked whether avoid using such unique APIs will increase the misclassification rate. We performed 10-fold cross validation to assess the performance of the classifier before and after removing those unique calls from the malware feature set. Table 4. Unique API calls made only by malware. API call
# malware API call
NtCreateThread
38
CreateDirectoryExW
RtlCreateUserThread
35
CertOpenSystemStoreW
CryptGenKey DnsQuery A GetBestInterfaceEx RtlCompressBuffer NtLoadDriver
5 27
NtCreateProcess
# malware 1 23 1
RtlDecompressBuffer
107
1
Thread32First
233
1
InternetWriteFile
14
1
NtSaveKey
11 27
GetKeyboardState
107
FindFirstFileExA
Thread32Next
234
NtMakeTemporaryObject
7
NtDeleteFile
3
WSASendTo
6
GetAddrInfoW
3
A Better Understanding of Machine Learning Malware Misclassifcation
53
Table 5. Performance before and after removing the unique calls. Before G-mean
AUC
After G-mean
AUC
SVM 0.947546 0.949475 0.946762 0.946764 DT
0.924745 0.924827 0.921878 0.921982
Fig. 7. Best features determined by one run of the EBBag SVM classifier (Table 6 lists the features denoted by their IDs).
Our experiment showed that removing the unique APIs from the malware feature set did not have a considerable impact on the overall classification rate (results are shown in Table 5). This might indicates that there are other features which have been used as better indicators for distinguishing malware from benign samples. To investigate this further we have retrieved the top 20 best features when classifying malware and benign samples using SVM classifier. Table 6 list these features denoted by their ID number while Fig. 7 shows the weight assigned to each of the features; this weight is based on the features’ coefficients and the list has been retrieved by making use of functions provided by Scikit-learn library [39]. It can be seen that one of the best features for distinguishing benign executables is ‘NtTerminateProcess’ API function; this is because benign executables usually call such an API when the unpacking process and the installation is complete; while in case of malware samples, they usually start their malicious activities after the unpacking process is complete without calling such an API function (at least not to terminate the current process [16]).
54
N. Alruhaily et al. Table 6. Best 20 features determined by one run of the EBBag SVM classifier. Negative class (Benign) ID 7754 2902 3960 8096 5079 7117 5058 2912 7449 7755 4572 2947 6219 7796 8479 8148 8141 4124 6670 8907
API calls system:ldrgetdllhandle process:ntterminateprocess misc:writeconsolea ole:coinitializeex system:ntclose process:ntterminateprocess system:ldrgetprocedureaddress synchronisation:getsystemtimeasfiletime process:ntterminateprocess registry:ntopenkey misc:writeconsolea process:ntallocatevirtualmemory system:getkeystate system:ldrgetdllhandle process:ntunmapviewofsection process:ntfreevirtualmemory misc:writeconsolew misc:writeconsolew process:ntterminateprocess registry:regopenkeyexa ui:loadstringw system:ldrgetprocedureaddress file:getfiletype system:setwindowshookexa registry:regqueryvalueexw system:ntclose registry:ntopenkey system:ntclose process:ntfreevirtualmemory ole:oleinitialize resource:findresourceexa ui:loadstringw
Positive class (Malicious) ID 6213 7920 5709 97 2273 8024 7917 812 6501 1934 4171 5726 7605 3723 1676 8146 8045 7820 5180 5576
API calls registry:regopenkeyexa system:ldrunloaddll system:ldrloaddll system:ldrgetprocedureaddress registry:regclosekey system:isdebuggerpresent file:copyfilea misc:getcursorpos misc:enumwindows system:ldrunloaddll system:ldrgetdllhandle system:ldrloaddll system:getsysteminfo file:getsystemdirectorya registry:regsetvalueexa file:searchpathw ui:loadstringa process:createprocessinternalw registry:regcreatekeyexa system:getsysteminfo system:ntquerysysteminformation network:wsastartup file:ntreadfile system:ntclose system:ntclose process:ntunmapviewofsection system:lookupprivilegevaluew file:getvolumenameforvolumemountpointw system:ldrgetprocedureaddress process:ntallocatevirtualmemory process:ntunmapviewofsection system:ldrunloaddll registry:ntqueryvaluekey process:ntfreevirtulr
A Better Understanding of Machine Learning Malware Misclassifcation
9
55
Conclusion
This paper provided an in-depth analysis of the underlying causes of malware misclassification when using machine learning-based malware detectors. Such causes need to be determined so that the successful rate achieved could be interpreted, and the right mitigation could be adopted. The analysis carried out shows that misclassification is mostly due to changes in several malware variants without the family membership or the year of discovery being a factor. It shows also that most of the misclassification were due to applying some anti-virtualisation techniques, or due to that some samples were looking for a specific argument to run, or due to the fact that some variants are actually considered as bad data. As machine learning techniques are still considered as new techniques to detect malware, compared to other techniques (e.g. signature matching), only some malware variants where trying to evade the detection. However, the situation is expected to change in the future when such technique become more common. In this paper we have also checked whether removing the set of distinct API functions, uniquely used by malware, from the feature set will increase the misclassification rate. The results shows that the overall detection rate of the classifier have not been significantly affected by such a technique which could be used by malware writers to evade the detection. Regarding future work, throughout this paper we have only considered the API function calls dynamically extracted from the analysed samples as features, due to that they can offer a good representation of malware behaviour. However, other features, which are extracted either, dynamically or statically, could also be considered. Acknowledgments. We would like to express our sincere gratitude to Professor Peter Tino, Chair of Complex and Adaptive Systems in the University of Birmingham, for his valuable advise and suggestions. We would like also to thank VirusTotal for providing us with access to their intelligence service in addition to a private API.
References 1. Alazab, M., Layton, R., Venkataraman, S., Watters, P.: Malware detection based on structural and behavioural features of API calls (2010) 2. Alruhaily, N., Bordbar, B., Chothia, T.: Towards an understanding of the misclassification rates of machine learning-based malware detection systems. In: Proceedings of the 3rd International Conference on Information Systems Security and Privacy - Volume 1: ICISSP, pp. 101–112 (2017) 3. Bailey, M., Oberheide, J., Andersen, J., Mao, Z.M., Jahanian, F., Nazario, J.: Automated classification and analysis of internet malware. In: Kruegel, C., Lippmann, R., Clark, A. (eds.) RAID 2007. LNCS, vol. 4637, pp. 178–197. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74320-0 10 4. Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013) 5. Bifet, A., Gavalda, R.: Learning from time-changing data with adaptive windowing. In: Proceedings of the 2007 SIAM International Conference on Data Mining, pp. 443–448. SIAM (2007)
56
N. Alruhaily et al.
6. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996) 7. Ceron, J.M., Margi, C.B., Granville, L.Z.: MARS: an SDN-based malware analysis solution. In: 2016 IEEE Symposium on Computers and Communication (ISCC), pp. 525–530. IEEE (2016) 8. Chang, E.Y., Li, B., Wu, G., Goh, K.: Statistical learning for effective visual information retrieval. In: ICIP, vol. 3, pp. 609–612. Citeseer (2003) 9. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995) 10. Cuckoo Sandbox: Automated malware analysis - cuckoo sandbox (2015). http:// www.cuckoosandbox.org/ 11. Fan, C.I., Hsiao, H.W., Chou, C.H., Tseng, Y.F.: Malware detection systems based on API log data mining. In: 2015 IEEE 39th Annual Computer Software and Applications Conference (COMPSAC), vol. 3, pp. 255–260. IEEE (2015) 12. Faruki, P., Laxmi, V., Gaur, M.S., Vinod, P.: Behavioural detection with API call-grams to identify malicious PE files. In: Proceedings of the First International Conference on Security of Internet of Things, pp. 85–91. ACM (2012) 13. Ferri, C., Hern´ andez-Orallo, J., Modroiu, R.: An experimental comparison of performance measures for classification. Pattern Recogn. Lett. 30(1), 27–38 (2009) 14. Firdausi, I., Lim, C., Erwin, A., Nugroho, A.S.: Analysis of machine learning techniques used in behavior-based malware detection. In: 2010 Second International Conference on Advances in Computing, Control and Telecommunication Technologies (ACT), pp. 201–203. IEEE (2010) 15. Hansen, S.S., Larsen, T.M.T., Stevanovic, M., Pedersen, J.M.: An approach for detection and family classification of malware based on behavioral analysis. In: 2016 International Conference on Computing, Networking and Communications (ICNC), pp. 1–5. IEEE (2016) 16. Hsu, F.H., Wu, M.H., Tso, C.K., Hsu, C.H., Chen, C.W.: Antivirus software shield against antivirus terminators. IEEE Trans. Inf. Forensics Secur. 7(5), 1439–1447 (2012) 17. Huang, J., Ling, C.X.: Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 17(3), 299–310 (2005) 18. Islam, R., Tian, R., Moonsamy, V., Batten, L.: A comparison of the classification of disparate malware collected in different time periods. J. Netw. 7(6), 946–955 (2012) 19. Kang, P., Cho, S.: EUS SVMs: ensemble of under-sampled SVMs for data imbalance problems. In: King, I., Wang, J., Chan, L.-W., Wang, D.L. (eds.) ICONIP 2006. LNCS, vol. 4232, pp. 837–846. Springer, Heidelberg (2006). https://doi.org/ 10.1007/11893028 93 20. Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A.: Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Trans. Syst. Man Cybern.Part A: Syst. Hum. 41(3), 552–568 (2011) 21. Klinkenberg, R., Renz, I.: Adaptive information filtering: learning in the presence of concept drifts. In: Learning for Text Categorization, pp. 33–40 (1998) 22. Kotsiantis, S.B., Zaharakis, I., Pintelas, P.: Supervised machine learning: a review of classification techniques (2007) 23. Kruczkowski, M., Szynkiewicz, E.N.: Support vector machine for malware analysis and classification. In: Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)-Volume 02, pp. 415–420. IEEE Computer Society (2014) 24. Lin, W.J., Chen, J.J.: Class-imbalanced classifiers for high-dimensional data. Briefings in bioinformatics, p. bbs006 (2012)
A Better Understanding of Machine Learning Malware Misclassifcation
57
25. Lu, Y.B., Din, S.C., Zheng, C.F., Gao, B.J.: Using multi-feature and classifier ensembles to improve malware detection. J. CCIT 39(2), 57–72 (2010) 26. Maxwell, K.: Mwcrawler (2012). https://github.com/0day1day/mwcrawler 27. Maxwell, K.: Maltrieve (2015). https://github.com/technoskald/maltrieve 28. Miao, Q., Liu, J., Cao, Y., Song, J.: Malware detection using bilayer behavior abstraction and improved one-class support vector machines. Int. J. Inf. Secur. 15, 1–19 (2015) 29. Microsoft: Microsoft security intelligence report (sir) (2015). http://www. microsoft.com/security/sir/default.aspx 30. Moser, A., Kruegel, C., Kirda, E.: Limits of static analysis for malware detection. In: Twenty-Third Annual Computer Security Applications Conference, ACSAC 2007, pp. 421–430. IEEE (2007) 31. Moskovitch, R., Feher, C., Elovici, Y.: Unknown malcode detection—a chronological evaluation. In: IEEE International Conference on Intelligence and Security Informatics, ISI 2008, pp. 267–268. IEEE (2008) 32. Offensivecomputing: Open malware (2015). http://www.offensivecomputing.net 33. Peiravian, N., Zhu, X.: Machine learning for android malware detection using permission and API calls. In: 2013 IEEE 25th International Conference on Tools with Artificial Intelligence, pp. 300–305. IEEE (2013) 34. Pekta¸s, A., Acarman, T., Falcone, Y., Fernandez, J.C.: Runtime-behavior based malware classification using online machine learning. In: 2015 World Congress on Internet Security (WorldCIS), pp. 166–171. IEEE (2015) 35. Pirscoveanu, R.S., Hansen, S.S., Larsen, T.M., Stevanovic, M., Pedersen, J.M., Czech, A.: Analysis of malware behavior: type classification using machine learning. In: 2015 International Conference on Cyber Situational Awareness, Data Analytics and Assessment (CyberSA), pp. 1–7. IEEE (2015) 36. Salehi, Z., Sami, A., Ghiasi, M.: Using feature generation from API calls for malware detection. Comput. Fraud Secur. 2014(9), 9–18 (2014) 37. Sami, A., Yadegari, B., Rahimi, H., Peiravian, N., Hashemi, S., Hamze, A.: Malware detection based on mining API calls. In: Proceedings of the 2010 ACM Symposium on Applied Computing, pp. 1020–1025. ACM (2010) 38. Schick, S.: Security intelligence: Tinba malware watches mouse movements, screen activity to avoid sandbox detection. https://securityintelligence.com/news/tinbamalware-watches-mouse-movements-screen-activity-to-avoid-sandbox-detection/ (2016) 39. Scikit-learn: Scikit-learn: machine learning in python, 17 June 2013. http://scikitlearn.org/stable/ 40. Shabtai, A., Moskovitch, R., Feher, C., Dolev, S., Elovici, Y.: Detecting unknown malicious code by applying classification techniques on opcode patterns. Secur. Inform. 1(1), 1–22 (2012) 41. Singh, A., Walenstein, A., Lakhotia, A.: Tracking concept drift in malware families. In: Proceedings of the 5th ACM Workshop on Security and Artificial Intelligence, pp. 81–92. ACM (2012) 42. Symantec: Symantec security response - virus naming conventions, 17 June 2013. https://www.symantec.com/security response/virusnaming.jsp 43. Symantec: W32.Sality!dam, 17 June 2013. https://www.symantec.com/security response/writeup.jsp?docid=2013-043010-4816-99 44. Symantec: Internet security threat report (2015). http://www.symantec.com/ security response/publications/threatreport.jsp 45. Symantec: A-Z listing of threats & risks (2016). https://www.symantec.com/ security response/landing/azlisting.jsp
58
N. Alruhaily et al.
46. Symantec: Trojan.gen (2016). https://www.symantec.com/security response/ writeup.jsp?docid=2010-022501-5526-99 47. Tian, R., Islam, R., Batten, L., Versteeg, S.: Differentiating malware from cleanware using behavioural analysis. In: 2010 5th International Conference on Malicious and Unwanted Software (MALWARE), pp. 23–30. IEEE (2010) 48. Veeramani, R., Rai, N.: Windows API based malware detection and framework analysis. In: International Conference on Networks and Cyber Security, vol. 25 (2012) 49. Virusshare: Virusshare.com (2016). http://vxheaven.org 50. VirusTotal: Virustotal - free online virus, malware and URL scanner (2015). https://www.virustotal.com/ 51. VX Heaven: Vxheaven.org. (2016) http://vxheaven.org 52. Walenstein, A., Lakhotia, A.: The software similarity problem in malware analysis. In: Dagstuhl Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum f¨ ur Informatik (2007) 53. Wang, C., Pang, J., Zhao, R., Liu, X.: Using API sequence and Bayes algorithm to detect suspicious behavior. In: International Conference on Communication Software and Networks, ICCSN 2009, pp. 544–548. IEEE (2009) 54. Xu, J.Y., Sung, A.H., Chavez, P., Mukkamala, S.: Polymorphic malicious executable scanner by API sequence analysis. In: Fourth International Conference on Hybrid Intelligent Systems, HIS 2004, pp. 378–383. IEEE (2004) 55. Yap, B.W., Rani, K.A., Rahman, H.A.A., Fong, S., Khairudin, Z., Abdullah, N.N.: An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets. In: Herawan, T., Deris, M.M., Abawajy, J. (eds.) Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013). LNEE, vol. 285, pp. 13–22. Springer, Singapore (2014). https:// doi.org/10.1007/978-981-4585-18-7 2 56. Ye, Y., Chen, L., Wang, D., Li, T., Jiang, Q., Zhao, M.: SBMDS: an interpretable string based malware detection system using svm ensemble with bagging. J. Comput. Virol. 5(4), 283–293 (2009) 57. Ye, Y., Li, T., Huang, K., Jiang, Q., Chen, Y.: Hierarchical associative classifier (HAC) for malware detection from the large and imbalanced gray list. J. Intell. Inf. Syst. 35(1), 1–20 (2010) 58. Ye, Y., Wang, D., Li, T., Ye, D., Jiang, Q.: An intelligent PE-malware detection system based on association mining. J. Comput. Virol. 4(4), 323–334 (2008) 59. Zhang, B.Y., Yin, J.P., Hao, J.B., Zhang, D.X.: Using support vector machine to detect unknown computer viruses. Int. J. Comput. Intell. Res. 2(1), 100–104 (2006) 60. Zhang, B., Yin, J., Tang, W., Hao, J., Zhang, D.: Unknown malicious codes detection based on rough set theory and support vector machine. In: The 2006 IEEE International Joint Conference on Neural Network Proceedings, pp. 2583–2587. IEEE (2006) 61. Zhao, H., Xu, M., Zheng, N., Yao, J., Ho, Q.: Malicious executables classification based on behavioral factor analysis. In: International Conference on e-Education, e-Business, e-Management, and e-Learning, 2010. IC4E 2010, pp. 502–506. IEEE (2010)
Situation-Aware Access Control for Industrie 4.0 Marc H¨ uffmeyer1(B) , Pascal Hirmer2 , Bernhard Mitschang2 , Ulf Schreier1 , and Matthias Wieland2 1
2
Hochschule Furtwangen University, Robert-Gerwig-Platz 1, Furtwangen im Schwarzwald, Germany {marc.hueffmeyer,ulf.schreier}@hs-furtwangen.de Universit¨ at Stuttgart, Universit¨ atsstraße 38, Stuttgart, Germany {hirmer,mitschang,wieland}@ipvs.uni-stuttgart.de
Abstract. In recent years, the Internet of Things emerges as a new paradigm that enables new applications, such as Smart Factories, Smart Homes, and Smart Cities. In these applications, privacy and security are important issues, especially regarding the access to sensors and actuators. Sometimes, this access should only be permitted if a certain situation occurs, e.g., access to a camera should only be allowed in an exceptional situation. In this paper, we enable situation-based access control for sensitive components in the Internet of Things, focusing on Industrie 4.0. To realize this, we combine an attribute-based access control system with a situation recognition system to create a highly flexible, well performing, and situation-aware access control system. This access control system is capable of automatically granting or prohibiting access depending on situation occurrences and other dynamic or static security attributes. Keywords: Authorization · Attribute based access control Situation-awareness · REST · Internet of Things
1
Introduction
The emerging Internet of Things (IoT) enables many benefits through the introduction of smart applications (e.g., Smart Factories [1]). However, with these benefits, security risks increase. Especially control to sensors and actuators should only be possible by authorized entities (persons, software, etc.). Whether an entity is authorized could depend on its current context or situation, respectively. To realize situation-based access control to sensors and actuators, this article describes a situation-aware access control system to protect various kinds of RESTful services. The system consists of two components: a situation recognition system and an access control mechanism. The situation recognition system is equipped with multiple sensors and actuators and based on the sensors’ values, situations can be recognized. For example, the situation “production machine overheated” can be detected once temperature sensors reach a certain threshold. SitOPT [2–4] is an example for such a situation recognition system. c Springer International Publishing AG, part of Springer Nature 2018 P. Mori et al. (Eds.): ICISSP 2017, CCIS 867, pp. 59–83, 2018. https://doi.org/10.1007/978-3-319-93354-2_4
60
M. H¨ uffmeyer et al.
Attribute Based Access Control (ABAC) is an access control model that enables specifying rich variations of access rules. The main idea of ABAC is that any property of an entity can be used to determine access. In [5], we introduced RestACL, an access control language based on the ABAC model, which enables access control of RESTful services. The challenge in building a situation-aware access control system is the integration of those heterogeneous systems into one efficient and stable system. We present the concepts described in the context of a case scenario from the Smart Factory domain, also referred to as Industrie 4.0. This article is based on an earlier paper [6]. We evaluated the system in an Industrie 4.0 scenario as described as part of our future work from the earlier paper. Furthermore, we included a detailed description about the services of the situation-aware system and how these services can be employed to secure access to critical services and infrastructure. The remainder of this paper is organized as follows: Sect. 2 introduces a motivating scenario in the domain of Industrie 4.0. Related work is referred to in Sect. 3. In Sect. 4, we describe the fundamentals of the situation recognition system (SitOPT) and the access control system (RestACL). In Sect. 5, an architecture, its components, and an API are introduced that enable situation-aware access control based on those systems. Situation registration and access procedures are discussed in detail in Sect. 6. Evaluation results are presented in Sect. 7. A summary and an outlook to our future work are given in Sect. 8.
2
Motivating Scenario in Industrie 4.0
This section introduces a motivating scenario in the domain of Industrie 4.0 (based on [7]). In fast, autonomous production environments, not only real-time requirements are essential, but also security issues need to be coped with. Only authorized persons or software components should be allowed to access typical resources, such as sensors, actuators, production process information, and so on. Unauthorized access could lead to the loss of company secrets or, even worse, sabotage of the production processes. Usually, whether an entity is allowed to access a specific resource, is dependent on its current context, i.e., the situation of the entity. For example, a maintenance engineer that remotely maintains the production environment should only be allowed to access resources if a problem occurs. Other entities should be denied access at all times. In our scenario, a production robot of a car manufacturer uses a laser to cut metal parts of the car shell into the form required for further production steps. To realize this, the metal parts are picked up from a pile, are inserted into a frame, are cut by the robot, and are then put onto a conveyer belt which transports it to the next production step. When processing the metal part, multiple situations could occur: (1) the metal part could be wrongly placed into the frame which leads to it being cut wrongly, (2) the laser could be overheated and cannot properly cut the part, or (3) the part is stuck into the frame and cannot be removed after cutting. These situations can be automatically recognized using situation recognition systems such as SitRS.
Situation-Aware Access Control for Industrie 4.0
61
Access Control System
Service
Chief Engineer
Unauthorized EnƟty
Internet
Maintenance Engineer
Firewall
SituaƟon RecogniƟon System
ProducƟon StaƟon
Sensors
Fig. 1. Scenario for situation-aware access control [6].
In order to verify a correct recognition, a remote maintenance engineer, responsible for this machine, should get access to different resources to recognize the specific error and initiate actions to solve it. These resources comprise a camera, which can be used to see whether the part is correctly placed in the frame or whether it is stuck, and temperature sensors which indicate whether the laser is overheated. Once the error is found, responsible persons can be contacted or the error can even be solved in an automated manner. Furthermore, the chief engineer is allowed to access the resources remotely at all times, not dependent on the current situation. Figure 1 depicts this scenario. Access to highly sensible resources should only be granted if one of the error situations occur. Especially access to the camera is highly critical in production environments. There are three different cases that target remote access to this camera: (1) the chief engineer should have access to the camera at all times independent of the situation, (2) the remote maintenance engineer should only have access when an error situation occurs, and (3) in contrast to the first two cases, unauthorized persons must be permanently locked out. Imagine an unauthorized person having access to the video camera. This could lead to the loss of company secrets. Therefore, it is crucial to restrict access to the data produced by the camera. This scenario is used as basis to explain the approach for situation-aware access control introduced in this paper.
3
Related Work
Situation-aware access control is a topic that has only been rarely discussed. However, there are a few publications that are associated.
62
M. H¨ uffmeyer et al.
A specialized Situation-Aware Access Control model is described in [8,9]. In this work, a health-care oriented model is presented. They describe a situation schema that is used to compute an access decision. The schema contains for example Patients, Data-Requesters or Health Records. Access is granted depending on several contextual factors rather than only depending on the role of the Data-Requester. They describe a specialized solution in medical environments, which extends the role-based access control model to include a medical context. So, this work presents a specialized solution rather than a generic approach for situation-aware access control. Situation-aware Access Control for autonomous decentralized systems is described in [10]. In this approach, a role-based access control model is used to enforce access control in dynamic and large scale information systems. The authors extend the hierarchical role-based model described in [11]. They introduce situation constraints that are applied to the relation between Users and Roles as well as the relations between Roles and Permissions. However, rolebased access control requires careful role engineering and might lead to role overengineering if a great diversity of access conditions must be expressed [12,13]. The eXtensible Access Control Markup Language (XACML) is an OASIS standard that describes one way to apply ABAC. XACML policies are arranged in tree structures and XACML engines have to traverse multiple branches of that tree for every access request. While XACML can be seen as a compositional approach that computes the union of several access rules that can be applied to a request [14]. Glombiewski et al. [15] present an approach similar to the situation recognition system of SitOPT by integrating context from a wide range of sources for situation recognition. However, they do not provide any abstraction for situation modeling, i.e., situations need to be defined using complex CEP queries. This proves difficult, especially for domain experts, who do not have extensive knowledge in that area. Furthermore, [16] propose to use CEP along with a dynamic enrichment of sensor data in order to realize situation-awareness. In this approach, the situations of interest are directly defined in the CEP engine, i.e., the user formulates the situations of interest using CEP query languages [16]. A dynamic enrichment component processes and enriches the sensor data before the CEP engine evaluates them against the situations of interest. In SitOPT, we provide an abstraction by Situation Templates [17] and a graphical interface [2] for situation modeling. Many similar approaches exist that use ontologies for situation recognition [18]. However, these approaches are either focused on specific use case scenarios [19] or cannot provide the efficiency required by real-time critical scenarios [18,20], e.g., in the Industrie 4.0 scenario where situations need to be recognized timely. These limitations regarding efficiency occur in machine learning approaches [21], too. In contrast, the SitOPT approach offers high efficiency by recognizing situations in milliseconds [2] instead of seconds or even minutes as reported in [18,20]. This enables applicability in time-critical real-world scenarios in which recognition times are of vital importance.
Situation-Aware Access Control for Industrie 4.0
4
63
Foundations
This section provides overviews over the SitOPT and the RestACL approaches. 4.1
SitOPT
As depicted in Fig. 2, the SitOPT approach [3,4] offers several components for situation recognition. The two main components are the Situation Recognition Service (SitRS) and the Resource Management Platform (RMP). The SitRS is capable of detecting situations based on so called Situation Templates – a model for defining the conditions for occurring situations. More precisely, Situation Templates connect sensor values with conditions, which are then aggregated using logical operations such as AND, OR, or XOR. The root node of a Situation Templates is the actual situation to be defined. The RMP serves as gateway to the sensors and offers several adapters to bind different kinds of sensors and, furthermore, provides a uniform RESTful interface to access sensor data. The binding of devices equipped with sensors is conducted as described in [22,23].
Device API
SituaƟon API SitOPT SitRS Sit. DB
RMP Sensor
Sensor
Sensor
Fig. 2. Architecture of SitOPT [6].
A registered device can have one or more owners. Initially, only the owner has access to these devices. Once devices are bound, applications or users can register for situations to get notified on their occurrence. To realize this, first, a Situation Template has to be modeled defining the situation’s conditions. After that, it is used as input for the Situation Recognition Service, which transforms it into an executable representation (e.g., a CEP query) and executes it in a corresponding engine. On execution, situations are continuously monitored and registered applications or users are notified as soon as they occur. SitOPT offers two interfaces to upper-level applications. The Situation API allows registration on situations to get notified on their occurrence. The Device API allows registration of new devices to be monitored by the SitRS. In case a situation occurred, an alert is raised to all callbacks and the situation is written into a Situation Database. The Situation Database (SitDB) is a longterm storage for situation occurrences.
64
4.2
M. H¨ uffmeyer et al.
RestACL
Access control commonly answers the question which subjects might perform what actions on what objects [24]. ABAC replaces directly referred entities like subjects or objects by attributes. ABAC mechanisms use so called categories to map attributes to entity types. For example, an attribute name can be mapped to a subject type. Definition (Attribute). We define an attribute as a triple a := (c, d, v) consisting of a category c, a designator d and a value v. Category, designator and value all have types like integers or character sequences. Further we define A as the set of all possible attributes. Note that attributes are related to dedicated entities (e.g., an attribute name with the value bob is related to a human (the entity) with the name bob). Therefore, an entity is a set of attributes. Definition (Entity). We define an entity e := ({a1 , . . . , an }) as the set of all attributes {a1 , . . . , an } with n ∈ N belonging to the same category. For example, a policy might require that “the subject name must be equal to bob”. An attribute abob = (subject, name, bob) then can be used in an attribute condition to control if any attribute ax of the set of attributes of an entity matches the attribute condition ac(equal, abob ). Definition (Attribute Condition). We define an attribute condition as a function ac : B × An → {true, f alse} with B being the set of boolean functions. A policy can be interpreted as a logical concatenation of attribute conditions. For example, a policy might declare two attribute conditions and create a logical conjunction between them: “if the attribute conditions ac1 and ac2 are fulfilled, access is granted”. During the evaluation of an access request a set of attributes from the request is compared against this logical concatenation of attribute conditions. Definition (Policy). We define a policy p : P(AC) → E as a function from the power set of attribute conditions P(AC) to the set of effects E with E := {P ermit, Deny, N otApplicable}. RestACL is an ABAC mechanism that targets authorization for RESTful services. Figure 3 shows the components of a RestACL access control system. Access logic is determined in policies located in a Policy Repository. The identification of policies is done using so called Domains that describe mappings between the resources R of a RESTful Service and policies. Definition (Domain). We define a domain d : R → P(P ) as a function from the set of resources R to the power set of policies P(P ) with P being the set of all policies. An Evaluation Engine computes access decisions based on access requests derived from resource requests. The attributes that are required to perform the decision computation are delivered by one or more Attribute Providers. There are different types of Attribute Providers. For example, one Attribute Provider may deliver all attributes of an entity, if a unique identifier is provided. Another
Situation-Aware Access Control for Industrie 4.0
65
Access API
RestACL Policy Repository
EvaluaƟon Engine
Domain AƩribute Prov.
Fig. 3. Architecture of RestACL [6].
Attribute Provider may deliver actual context data like the current time. Once the attributes are given, the Evaluation Engine uses two functions to compute an access decision. d : R → P(P )
(1)
f : P(A) × P(P ) → E
(2)
d is the domain function and identifies the policies that need to been evaluated. It takes the address of the requested resource as input. f computes the access decision based on a set of attributes and a set of policies (identified using d). Policies have priorities and the applicable policy with the highest priority determines the final access decision. Further details about RestACL can be found in [5]. In [25] it is shown that the discrimination between policy identification (1) and the actual access control logic (2) helps to create a much better scaling access control system.
5
Situation-Aware Access Control
To ensure a fast and appropriate reaction to dedicated situations, subjects – such as the chief engineer from the example scenario – can register situations based on Situation Templates and get notified on their occurrence. On registration, those subjects have to provide callback information that defines what subject is automatically informed once a situation occurs. As described in Sect. 2, a robot in the car production process might be equipped with several sensors like a thermometer or pressure sensors. In terms of the situation recognition system, the robot is a device while the thermometer is a sensor. Through the aggregation of sensor data, it can be detected if an error situation has occurred. On detection, the situation recognition system raises an alert to all registered subjects. For example, the robot might be equipped with multiple pressure sensors that check if the metal parts are cut correctly and are placed in the correct position. If one or more of the sensors reports values outside a dedicated range, the metal part was not cut or placed correctly. The robot then might automatically try to position
66
M. H¨ uffmeyer et al.
the metal part correctly, but if that fails, an error situation has occurred while cutting or placing the part. In rare situations, the pressure sensor might have a defect and report faulty values that lead to an alert even if the error situation did actually not occur. This can be classified as a false-positive situation detection. In order to double check whether an error situation occurred, the maintenance engineer requires remote access to all the sensor values and the camera mentioned in the scenario (cf. Fig. 1). If the camera shows that metal parts are cut correctly without any fraying, the robot works fine and there might only be a problem with the pressure sensor. The maintenance engineer then can decide whether the robot can keep on producing the metal parts or if a shutdown is required. To enable this, dedicated services or sensor values, such as remotely-accessible cameras, must be accessible dependent on a certain situation for a limited group of people. 5.1
Requirements
An access control system for such a scenario must be very efficient and flexible. That means, the system must be capable to determine access decisions in short time to ensure that a service request provides up-to-date data. This must be guaranteed even for large amounts of services like cameras or any other web service. Furthermore, the system must be flexible in terms of adding and removing services, which means that the system must easily support the creation, change, and removal of services as well as policies that restrict access to these services and the produced data. Moreover, the access control system must be capable to express rich variations of access policies in order to embody multiple access control requirements. We identified four requirements a situation-aware access control system for RESTful services must fulfill. If the requirements are met, the benefits of REST are not violated and situation-aware access control can be implemented in an efficient manner. (R1) Request-Based Authorization: If authorization is not done on a request base, for example, if access is granted or prohibited using tokens or along sessions, this might lead to access decisions that are not conform to the actual security policy. Imagine, a resource changes its state between two access requests from the same subject in a manner that the second request is denied while the first access was granted. Especially in a situation-aware context, the access rights might change if a dedicated situation occurs. If the second request is not independently evaluated to the first one, a (temporarily) unauthorized access might be the result or access might be prohibited even if the actual context would grant access. Therefore, authorization has to be done on a request base from the client perspective. (R2) Quick Reaction: The states of devices, sensors and other resources, including their access rights, can change frequently depending on the contextual environment (the situation). Hence, the access control system must be capable to quickly react to those changes. Imagine, a device or a sensor changes its state (e.g., a sensor value changes) and an access request is performed immediately
Situation-Aware Access Control for Industrie 4.0
67
after the state change. The system must respond with an access decision that belongs to the new state of the resource. Therefore, the system must support a tight integration of the situational context. In addition, a fast application of changes to the security policy and the given attributes must be guaranteed. (R3) Expressive Strength: A situation-aware access control system must enable high flexibility and fine-grained access control. As we can see from the Industrie 4.0 example, situation-dependent and situation-independent policies might coexist in the same context. Both types of policies must be supported by the access control system. Therefore, the access control system must support the application of rich variations of access rules based on various attributes of situation, subject, resource, environment or similar entities. (R4) Integration and Administration: The situation-aware access control system must support a tight integration of new devices, sensors, and Situation Templates as well as flexible administration of existing ones. Therefore, a carefully chosen set of initial policies and attributes must be created during the registration process. In addition, a highly structured and well-designed administration interface is required that enables access to management actions for humans as well as for automated systems. 5.2
The Situation Category
Because ABAC is an ideal candidate for a flexible access control model, a situation-aware access control system can rely on the ABAC model. In situationaware access control, a situation becomes an own category similar to subjects or resources. Entities of that category have dedicated attributes. For example, the SitOPT system notifies the Access Control System in case a situation has occurred and also in case a situation is no longer occurring. Therefore, a situation has occurred and time attributes that indicate whether a situation is currently occurring and the time at which the last change to the occurrence has happened. These attributes are delivered from the SitRS to the RestACL system. The RestACL system passes the attributes to its Attribute Providers to store them. Because the SitAC system must be capable to grant access for a dedicated period of time after the occurrence of a situation, a situation also requires an accessInterval attribute. For example, during the situation registration process (cf. Sect. 6.3), a subject might create a policy that grants access if a problem situation has occurred within the previous 20 min. The accessInterval attribute can be interpreted as a sort of counter that expires after a dedicated time period (measured from the moment the situation occurred). Once this timer is expired, access is not further granted or respectively prohibited. The accessInterval attribute is initially created during the registration process for a Situation Template but can be manipulated by the subject that registered the Situation Template at any time. Using the occurred attribute enables to revoke access rights before the accessInterval expires. For example, if the SitOPT system sends a notification that the emergency situation is no longer given, the
68
M. H¨ uffmeyer et al.
occurred attribute is set to false and the situation-dependent access is revoked. That means, the decision whether a maintenance engineer can access the video camera depends, among others, on the attributes occurred, time, and accessInterval of the situation. For example, the following attributes are given (Table 1). Table 1. Situation entity example [6]. Att. Category Designator
Value
a1
Situation Id
123
a2
Situation Occurred
True
a3
Situation Time
12:00:00 01.01.2017
a4
Situation AccessInterval 1200000 ms
Note that a situation entity always must have the three attributes occurred, time and accessInterval. Otherwise, a situation-aware computation of access decisions is not possible. In addition, the time the access request arrives is required, too. The actual time is an attribute of the environment context. For example, if a request arrives at 12:10:00 01.01.2017 the environment time attribute is (Table 2):
Client Layer
Client Client Client 7
9
Service Layer
Camera API
8
RestACL SituaƟon Callback API 6
Device API Access API
Enforce.
Access API
Security Layer
1 3
Access/SituaƟon Admin. API
Camera API
4
SituaƟon API
Enforce.
2
Device API
SitOPT 5
Real World
Camera
SituaƟon
Device
Fig. 4. Architecture for situation-aware access control [6].
6
Situation-Aware Access Control for Industrie 4.0
69
Table 2. Environment attribute example [6]. Att. Category a5
Designator Value
Environment Time
12:10:00 01.01.2017
Note that access not only depends on the presence of dedicated attributes. Attribute values can also be related to each other. For example, the environment time attribute must provide a value that is between the situation time and the situation time plus the accessInterval. Access is granted right at the moment the situation occurred (indicated by the situation’s time attribute). For example, one can say that: (3) va2 = true and va3 < va5 < va3 + va4
(4)
must be fulfilled in order to grant access [6]. In Sect. 6.3, a detailed example is given how these attributes and the relations are expressed within a policy. 5.3
Architecture
Figure 4 depicts the components that are required to combine the access control and situation recognition systems into a situation-aware access control system. The architecture is divided into three layers that are built on top of physical and environmental objects in the real world: a service layer, a security layer, and a client layer. Every layer provides RESTful services to the upper layer, respectively, to the public. The service layer covers the services that need to be protected. For example, in the Industrie 4.0 scenario, the camera service is located in the service layer. In general, the service layer carries any type of RESTful services that requires permanent or situation-dependent protection. Note that the SitRS provides various services to manage situation recognition. For example, these services include sensor or Situation Template registration as well as access to the RESTful services provided by the RMP to access devices. These are located on the service layer as well. As described previously, situations can be registered including contact information (callbacks) that describe which subjects need to be informed in case of a situation occurrence. The SitRS periodically monitors whether the situation occurred. The security layer is responsible for guaranteeing that only privileged subjects have access to the general services and the SitOPT system. Therefore, Enforcement points inspect every incoming request to the general services as well as the SitOPT system and consult the RestACL system whether the request can be permitted. Therefore, an Enforcement point formulates an access request and passes it to the Access API of the RestACL system. The access request contains at least the resource and subject identifiers and optionally a situation identifier. The RestACL system then identifies the policies that need to be evaluated.
70
M. H¨ uffmeyer et al.
During the evaluation process, the RestACL system uses the identifiers from the access request to load a list of attributes for each entity from its Attribute Providers. The actual access decision is computed based on the identified policies and the given attributes. This decision is returned back to the Enforcement points. That means, the Enforcement points provide the same service interface as their related web services. They only add the execution of access control logic to these services. The Enforcement point either rejects or forwards a request. If a client registers devices, sensors or Situation Templates using the Access/Situation Administration API, the security layer checks if the client is allowed to perform this action (e.g., register a sensor for a dedicated device). If the client has the permission to perform the action, the RestACL system forwards the request to the SitOPT system. In order to provide situation-aware access decisions, the RestACL system registers itself as a situation callback for any situation. That means, the SitRS informs not only dedicated clients about all changes in the occurrence of a situation but also the RestACL system. The RestACL system then updates the related situation entities stored in its Attribute Providers and, once an access request arrives, the RestACL system can perform a situation-aware policy evaluation. As depicted in Fig. 4, the situation-aware access control system works as follows: (1) A client registers devices (including sensors) at the Access/Situation Administration API. The RestACL system creates initial attributes for these devices and sensors (cf. Sect. 6). (2) The RestACL system forwards the creation request to the SitOPT system. (3) A client registers a Situation Template at the Access/Situation Administration API. The RestACL system creates a situation entity for this Situation Template. (4) The RestACL system forwards the Situation Template creation request to the SitOPT system and registers itself as a callback. (5) The SitOPT system monitors the devices (respectively their sensors). (6) If the situation occurs, the SitRS informs all callbacks (clients as well as the RestACL system) about the occurrence. (7) A client tries to access a device or another resource like the camera. (8) The related Enforcement point creates an access request and sends it to the RestACL system. (9) Depending on the result, the Enforcement point either forwards or rejects the client’s request. As part of the integration work for the two systems, an administration component is required that enables various types of clients to control policies, attributes and their assignments to devices and sensors. For example, human users might want to register devices and manage their access rights through a web application while other machines might want to register or deregister devices in an automated fashion using basic REST calls. The Access and Situation Administration component creates initial policies, attributes and resource assignments during the registration procedure for devices, sensors or Situation Templates. In addition, this component offers an API for policy, attribute and situation administration. Note that this component is not part of the generic ABAC system but is rather tailored for the situation-aware system. Details about the registration procedure are described in Sect. 6.
Situation-Aware Access Control for Industrie 4.0
5.4
71
SitAC Services
Since the situation-aware access control system offers multiple RESTful services, a client can perform several actions in such an environment. Firstly, the client can perform situation recognition related operations like the registration of devices, sensors, or situations. Of course, the client must only perform those actions the subject is privileged for. Secondly, besides these situation recognition related operations, the client can also perform a set of access control related actions. Finally, the client can access the web services in case that the subject is privileged. Following tables list the different types of operations that can be performed by a client. Table 3. SitAC situation services: situation API and Device API for registration. Action
API
Register situation
Situat. Clients want to describe situations in which they are called back
Description
Deregister situation Situat. Clients want to deregister situations in which they are called back Register device
Situat. Clients want to register devices
Deregister device
Situat. Clients want to deregister devices
Register sensor
Situat. Clients want to register sensors associated with a device
Deregister sensor
Situat. Clients want to deregister dedicated sensors
Access sensor values Device Clients want to access sensors resp. a snapshot of the actual values produced by the sensor
Table 3 lists services with the Situation type that are offered by the SitOPT system. The security layer intercepts requests to those services and adds the execution of access control logic. If a client wants to perform one of those operations and the access control system permits the execution, the client’s request is forwarded to the SitOPT system. Note that the RMP offers additional operations to update sensor values. Those operations are only accessible for the sensors but not from public. Therefore, the access control system does not mirror these operations to the public. Since one resource is directly mapped to one sensor and must only be updated by this sensor, there is no need for a fine-grained access control mechanism. Table 4 lists additional services provided by the Security Layer. These services are access control related and requests to them will be handled by the RestACL system. Because the system is based on the RestACL language, which is designed to protect RESTful Services, the access control system can not only protect the SitOPT system and the web services, but also its own Service API.
72
M. H¨ uffmeyer et al. Table 4. SitAC access services: Access/Situation Admin API.
Action
API
Description
Register services
Access Clients want to register services for which situation-aware access should be granted
Create attributes
Access Clients want to create attributes for themselves as well as attributes for the devices, sensors and services that they own
Update, delete attributes Access Clients want to update or delete attributes that are used to determine access to devices and sensors Create policies
Access Clients want to create policies that are used to determine access to devices and sensors. The policies contain the actual access logic for dedicated sensors
Update, delete policies
Access Clients want to update or delete policies that are used to determine access to devices and sensors
Assign policies
Access Clients want to change the mapping between policies and devices or sensors. Note that policies can be assigned to many devices, sensors or services while attributes are related to exactly one entity
Table 5. SitAC application services: e.g. Camera API and Device API for access. Action
API
Description
Access services Applicat. Clients want to access (read, update or delete) services like the camera for which situation-aware access is granted
Finally, Table 5 lists actions of type Application that are application-related operations and enable users to register services/resources that require situationdependent access protection. Like the SitOPT system, the services of the application are also located at the service layer. 5.5
SitAC Service API
The SitAC services are established as RESTful services. There are three main resources (situation, service and device lists) including several subresources that implement the services. The situation list is addressed with the path /situations. A HTTP GET access to this resource returns a complete list of the registered Situation Templates including their ids. A new situation can be registered by sending an HTTP POST request to the situation list including a new Situation Template. Note that this request also includes a list of devices that are queried
Situation-Aware Access Control for Industrie 4.0
73
in ordered to check whether the situation has occurred. The RestACL system checks whether the subject that tries to register the Situation Template is allowed to access the devices and their sensors. Only those subjects may register new Situation Templates. If the subject is not allowed to access one of the sensors, the Situation Template is not registered and an HTTP error message is returned. If the request is permitted, a situation entity is created at the Attribute Provider (with several attributes as described in Sect. 5.2). These must be overwritable by the SitOPT system. Therefore, the RestACL provides a subresource of the situation, that is addressed with the path /situation/{id}/attributes. If the situation occurs, the SitOPT system sends a HTTP PUT request to this resource which causes the RestACL system to update the attributes. Note that the /situation/{id}/attributes subresource also requires protection against unauthorized access. Therefore, the RestACL system creates a policy that only allows the SitOPT system to access the resources /situation/{id} and /situation/{id}/attributes. The /situation/{id}/access subresource is used to update access policies for the Situation Template. The RestACL system creates a policy, that allows only the subject that registers the situation to change this subresource. Therefore, only the registering subject might change who can access the situation resource itself (/situation/{id}), the attributes of the situation (/situation/{id}/attributes) and the access policy for the situation (/situation/{id}/access) (Table 6). Table 6. SitAC situation API. Resource
Methods
Description
/situations
GET, POST
Read the situation list or register a new Situation Template
/situations/{id}
GET, DELETE Read the actual situation resource. Remove the Situation Template
/situations/{id}/access
PUT
/situations/{id}/attributes PUT
Update the access policies for a situation resource Update the list of attributes for a dedicated situation
The service list is addressed with the path /services. Similar to the situation resource, services can be created and removed. Note that each service resource also has /attributes and /access subresources to implement attribute based access control for the service. Similar to the situation resources, also service resources initially can be accessed only by the registering subject (Table 7). The device list is addressed with the path /devices. Like the situation and service resources, devices can be created and removed having the device id and the permission to update the corresponding resources. Access to the device can be changed by updating the /access subresources of the device and its attributes can be updated by using the /attributes subresource (Table 8).
74
M. H¨ uffmeyer et al. Table 7. SitAC application API. /services
GET, POST
/services/{id}
GET, DELETE Read the actual service data. Remove the service
/services/{id}/access
PUT
/services/{id}/attributes PUT
Read the service list or register a new service
Update the access policies for a service Update the list of attributes for a dedicated service
The difference between the device resource when compared with the situation and service resource is, that there is another set of subresources. Each device might have multiple sensors. Sensors can be registered with a device using the sensor list of the device. That means, that an HTTP POST request must be sent to the /sensors subresource of that device. Note that sensors are identified by a name rather than an id. Like the device itself, also the sensor has /access and /attributes subresources that can be used to update access policies and sensor attributes like the actual value or, for example, the location of the sensor (Table 9). Table 8. SitAC device API. /devices
GET, POST
/devices/{id}
GET, DELETE Read access to a dedicated device. Remove a dedicated device
/devices/{id}/access
PUT
/devices/{id}/attributes PUT
Read the device list or register a new device
Update the access policies for a dedicated device Update the list of attributes for a dedicated device
Table 9. SitAC sensor API. /devices/{id}/sensors
GET, POST
Read the sensor list or create a new sensor associated with that device
/devices/{id}/sensors/{name}
GET, DELETE Read a sensor value or remove the sensor from the sensor list of that device
/devices/{id}/sensors/{name}/ access
PUT
Update the access policies for a dedicated sensor
/devices/{id}/sensors/{name}/ attributes
PUT
Update the list of attributes for a dedicated sensor
Situation-Aware Access Control for Industrie 4.0
6
75
Situation-Aware Access Control for the Industrie 4.0 Example
In a resource-oriented environment, there are at least two involved entities: the requested resource and the requesting subject. While the resource can be identified using its URI, the requesting subject must authenticate itself to the access control system. 6.1
Authentication
Authentication can be done using standardized HTTP authentication methods like basic or digest authentication or any other authentication method. In case the subject does not authenticate itself, the RestACL system will return an error message. Note that user name and password do not need a special treatment in the access control system. They are regular attributes assigned to a dedicated subject. Once a subject has authenticated itself, the access control system executes the access logic to determine whether the requested operation must be performed. Therefore, the Enforcement point creates an access request containing the address of the requested resource (e.g., /devices/1/sensors/thermometer), the access method (e.g., GET), and a subject identifier (e.g., /users/1). The access control system uses these properties to load various attributes from the Attribute Provider. For example, a subject might have an attribute responsibility with the value maintenance or a sensor (a resource) might have an attribute type. Given the identifier of the sensor and the identifier of the subject, both can be uniquely identified and attributes, like the previously mentioned ones, can be loaded from an Attribute Provider. Note that the access control system can employ multiple Attribute Providers for one request. Even external attribute sources like an external identity management are possible. 6.2
Registration Procedure
If a subject decides to create a new device composition that should be monitored for dedicated situations, the subject must first register all devices and their sensors (cf. Fig. 4 – Message 1). Therefore, the subject sends exactly one registration request for each device to the access control system. The request must contain a unique device identifier, a list of device owners, and optionally, a device description. The RestACL system creates a new policy and limits access to the subjects of the owner list. After that, the system creates several new entries for the device within the Domain. Besides an entry for the device itself, the same policy is used to ensure that only an owner can register new sensors with that device or register new attributes. Finally, another entry ensures that only the owners can change the access rights for that device. In addition to the Domain entries, an owner list attribute is assigned to the device. Note that a subject can register multiple owners for one device. In such a case, all owners have access to the device and can change the access policies. All these operations are executed
76
M. H¨ uffmeyer et al.
inside the RestACL system and once the device is created, the system forwards the creation request to the SitOPT system (cf. Fig. 4 – Message 2). For example, the chief engineer from the Industrie 4.0 scenario might want to register the robot at the SitOPT system. The engineer might be identified with the id 1 and therefore addressed using the path /users/1. The registration application needs to send a POST request to the RestACL system containing a unique device id and an owner list in the payload of the POST request (cf. Fig. 4 – Message 1). Our implementation expects JSON data as input. POST /devices HTTP/1.1 { "deviceId" : "1234", "deviceDescription" : "A robot that cuts metal parts", "deviceOwners" : ["/users/1"] }
If such a request arrives, the RestACL system creates a new access policy (as indicated in the following listing) that grants access to the subjects from the owner list. This policy is stored in the Policy Repository. { "id": "P1", "effect": "Permit", "priority": "1", "condition": { "function": "equal", "arguments": [ {"category": "subject","designator": "uri"}, {"value": "/users/1"} ] } }
In a second step, this policy is assigned to the new device (the robot). Therefore, the RestACL system creates a new Domain entry (as indicated in the following listing) and stores the entry to its Domain database. { "path": "/devices/1234", "access": [ {"methods": ["GET"],"policies": ["P1"]} ] }
As a third step, the same policy is assigned to the /sensor subresource path of the device, too (as indicated in the following listing). The sensor subresource (e.g., /devices/1234/sensors) must be used to register sensors, like the thermometer or the pressure sensors, with the robot.
Situation-Aware Access Control for Industrie 4.0
77
{ "path": "/devices/1234/sensors", "access": [ {"methods": ["POST"],"policies": ["P1"]} ] }
The same policy is also assigned to the /attributes subresource of the new device (e.g., /devices/1234/attributes). This subresource enables clients to store new attributes or to update attribute values at the Attribute Provider. In case that only external Attribute Providers are used, the binding between the attribute registration path and the policy is not required. But since the reference implementation uses its own Attribute Provider, clients need an API to update their own attributes as well as their devices’ and sensors’ attributes. Access to this API must be limited, too. Therefore, the binding is required. In the fifth step, the policy is assigned to the /access subresource of the new device (e.g., /devices/1234/access). This subresource enables clients to assign new access policies to the device. For example, if the chief engineer wants to grant access to the maintenance engineer in error situations, a new policy for the device must be set using this URI. Lastly, the new device gets an owner attribute. This attribute must be stored together with all other attributes of the device at the Attribute Provider. { { "category" : "resource", "designator" : "id", "value" : "/devices/1234" }, { "category" : "resource", "designator" : "deviceOwners", "value" : ["/users/1"] } }
The service registration and the sensor registration procedures are very similar to the device registration procedure and, therefore, are not discussed in detail. However, there is one big difference that should be noted: sensors are associated with devices and therefore can only be registered if the access policies for the associated device are fulfilled. Sensor registration is done using a POST request to the sensor list of a device (e.g., /devices/1234/sensors). The RestACL system checks whether access to the sensor list is permitted (checking the policies that have been assigned as described above). Note that during the device registration, an initial assignment is created (step 3) that restricts access to that sensor list to the device owners. Therefore, the owners can determine who might register new sensors for that device. Besides the execution of access logic, the
78
M. H¨ uffmeyer et al.
registration process for devices and sensors is the same. That means, also for sensors a new policy is created that initially restricts access to the device owners and this policy is assigned to the sensor resource itself as well as to the /attributes and /access subresources of the sensor. 6.3
Situation-Aware Access Policy
If a subject wants to restrict access to a dedicated service in a situation-aware fashion, the subject must register a Situation Template. To do so, the subject must have the permission to read any of the sensor values defined in the template, otherwise the situation cannot be registered. During the registration process, the RestACL system creates a situation entity (and stores it to its Attribute Provider). This entity is used to store situation-aware access information. For example, the entity stores the information if the situation has occurred and the point of time it has occurred. These information can be taken into account during the evaluation procedure of an access request. The following RestACL policy is used in the Industrie 4.0 scenario to protect access to the camera service. { "id": "PError", "description": "A policy that grants access in case of an error situation. The policy can be applied to the camera resource.", "effect": "Permit", "priority": "2" , "compositeCondition": { "operation": "AND", "conditions": [ { "function": "equal", "arguments": [ {"category": "subject", "designator": "responsibility"}, {"value": "maintenance"} ] },{ "function": "equal", "arguments": [ {"category": "situation", "designator": "occurred"}, {"value": "true"} ] },{ "function": "between", "arguments": [ {"category": "situation","designator": "time"}, {"category": "environment","designator": "time"}, { "function": "add", "arguments": [
Situation-Aware Access Control for Industrie 4.0
79
{"category": "situation","designator": "time"}, {"category": "situation","designator": "accessInterval"} ] } ] } ] } }
The policy grants access in case that three conditions are fulfilled: (1) The requesting subject must have a responsibility attribute with the value maintenance indicating that the subject is a maintenance engineer. Note that reliable attribute sources are required. Otherwise, a subject might assign any attribute to itself or other subjects. However, this is an issue with ABAC in general and the use of trusted Attribute Providers guarantees that attributes are not manipulated. (2) the situation must have occurred. That means, the situation recognition service has informed the access control system that the actual sensor value composition can be interpreted as the error situation. (3) The time the request arrived (indicated by the environment time attribute) must be in between the time at which the situation occurred and the time of the occurrence (indicated by the situation time attribute) plus the situation’s accessInterval attribute value. The second and third condition require that situation attributes must be upto-date. Therefore it is crucial that the SitRS informs the RestACL system in case that the occurrence of a situation has changed (either the situation occurred attribute value switches from false to true or vice versa). The SitRS is capable of informing one or more callback receivers in case that the situation occurs. In order to provide up-to-date situation attributes, the RestACL system registers itself as a callback for each situation. If a situation occurs, the SitRS informs the RestACL system, which forwards the updated situation entity information to the Attribute Provider. 6.4
Situational Service Access
In the above mentioned scenario, standardized REST clients can be used to request service data from default web services. Those web services are protected by an upstream Enforcement Point. That means, the client performs a resource request (e.g., HTTP GET ) to the camera service and the Enforcement Point intercepts it (cf. Fig. 4 – Message 7). After a successful authentication procedure, the Enforcement Point executes access logic by sending an access request to the RestACL system and enforcing the access decision (cf. Fig. 4 – Message 8). The access request is derived from the resource request and enriched with attributes stored at the Attribute Provider. The enforcement is done by either forwarding the resource request to the web service (cf. Fig. 4 – Message 9) or rejecting the initial resource request (e.g., returning a HTTP 403 response). Forwarding or rejection of the resource requests depends on the actual access decision of the RestACL system. This procedure allows to treat every request individually
80
M. H¨ uffmeyer et al.
in terms of access control. This ensures that requirement (R1) is fulfilled and proper access decisions can be guaranteed in a situation-aware fashion. Note that once registration and configuration are done and a client simply tries to access a resource (like the camera from the Industrie 4.0 example), the client must only authenticate itself to the access control system. Any further steps are performed by the access control system internally without any client interaction.
7
Evaluation
We used two methods to evaluate our approach: (1) We evaluated whether our system meets the requirements described in Sect. 5.1. (2) We did a non-formal verification that the system produces the expected results. Therefore, the system has been used to evaluate the Industrie 4.0 scenario. The interested reader is referred to [25] for further performance evaluations of the RestACL system for increasing numbers of resources. The situation-aware access control system is capable of handling each service/resource request individually. Every single remote request is forwarded to the access control system to ensure that the system cannot be bypassed. Because the system does not employ concepts like tokens, we can ensure that every request is evaluated individually by the access control system as required in (R1). This suits with the stateless communication principles of REST and of the situation-aware context very well. For validation purposes, both, the situation recognition system and the attribute-based access control system have been set up on the same computer. Running both systems on the same computer offers the benefit that network runtimes can be eliminated. That means, there is no additional delay between the time a situation occurred and the time access is granted. Therefore, granting and revoking access permissions can be reduced to setting attribute values. This can be performed in less than 1 ms, because it is a very basic operation. This ensures that proper access rights are assigned immediately after a situation occurrence and before a situation related access request can arrive. Therefore, we can see (R2) as fulfilled. We created a solution that allows the coexistence of situation-dependent and situation-independent access policies. Both types of policies are not limited to a dedicated area, because they can also include any type of attributes. This enables a very flexible access control system that can be used in various fields like Industrie 4.0, smart cities, ambient assisted living, or any event-driven environment. Attributes and their categories can be freely selected and logically combined to express rich variations of access rights as required in (R3). Section 5 describes the registration procedure in detail. Using this procedure guarantees that devices and sensors are never accessible without any access control inspections as required by (R4). In Sect. 1, we mentioned that there are different types of access cases. Access decisions are either permanently or temporary and access decisions either grant
Situation-Aware Access Control for Industrie 4.0
81
Table 10. Access types [6]. Access type
s1
s2
s3
s4
Permanently grant Permit Permit Permit Permit Permanently forbid Deny
Deny
Deny
Deny
Temporary grant
Deny
Permit Deny
Deny
Temporary forbid
Permit Deny
Permit Permit
or prohibit access. That means, we conducted several tests that proved whether the system works properly. For each access type (permanently granted, permanently forbidden, temporarily granted, temporarily forbidden), it must be guaranteed that the system computes the right access decision before a situation has occurred (s1 ), after the situation has occurred (s2 ), after the access end has arrived (s3 ) and after the situation occurrence has switched back (s4 ). Table 10 lists the expected access decision for each test case. We could successfully verify that the system works properly with these test cases.
8
Conclusions and Future Work
This contribution presents a situation-aware access control system for sensitive data in the Internet of Things. The access control system relies on a situation recognition service that analyzes sensor data and checks if the data matches Situation Templates. Once a situation occurs, the situation recognition service informs an attribute based access control system that stores the actual situation data. The situation data can be evaluated to gain or restrict access to any kind of RESTful service. An architecture is presented that shows how the situation recognition service and the access control system can be integrated. The designed and implemented solution provides situation-aware access control including request-based authorization, the application of frequent policy and situation changes, and the expressive strength of an attribute-based access control system. Because of its generic design based on the ABAC model, the system can be used in different scenarios. We targeted situation-aware access control in the context of Industrie 4.0 since it is an emerging market that ships with new security requirements. But the application context of situation-aware access control is widely spread. Other interesting areas are Smart Cities, ambient assisted living or any other event-driven environment. In our future work, we want to ease the registration and administration processes. As we have previously mentioned in the example scenario, experts are required to perform the registration of devices and sensors. This is acceptable in a Smart Factory scenario where experts configure and operate such a system. In other scenarios, experts cannot be presupposed and also users without computer science knowledge should be able to use the system. Therefore, configuration simplification is required. This includes the task of policy management to enable clients to easily manage the access policies of their resources.
82
M. H¨ uffmeyer et al.
Acknowledgments. This work is partially funded by the BMWi project IC4F (01MA17008G).
References 1. Kassner, L., Gr¨ oger, C., K¨ onigsberger, J., Hoos, E., Kiefer, C., Weber, C., Silcher, S., Mitschang, B.: The Stuttgart IT architecture for manufacturing. In: Hammoudi, S., Maciaszek, L.A., Missikoff, M.M., Camp, O., Cordeiro, J. (eds.) ICEIS 2016. LNBIP, vol. 291, pp. 53–80. Springer, Cham (2017). https://doi.org/10.1007/9783-319-62386-3 3 2. Franco da Silva, A.C., Hirmer, P., Wieland, M., Mitschang, B.: SitRS XT - towards near real time situation recognition. J. Inf. Data Manag. (2016) 3. Hirmer, P., Wieland, M., Schwarz, H., Mitschang, B., Breitenb¨ ucher, U., Leymann, F.: SitRS - a situation recognition service based on modeling and executing situation templates. In: Barzen, J., Khalaf, R., Leymann, F., Mitschang, B. (eds.) Proceedings of the 9th Symposium and Summer School On Service-Oriented Computing. Volume RC25564 of Technical Paper, IBM Research Report (2015) 4. Wieland, M., Schwarz, H., Breitenb¨ ucher, U., Leymann, F.: Towards situationaware adaptive workflows. In: Proceedings of the 13th Annual IEEE International Conference on Pervasive Computing and Communications Workshops: 11th Workshop on Context and Activity Modeling and Recognition. IEEE (2015) 5. H¨ uffmeyer, M., Schreier, U.: RestACL - an attribute based access control language for RESTful services. In: ABAC 2016 - Proceedings of the 1st Workshop on Attribute Based Access Control (2016) 6. H¨ uffmeyer, M., Hirmer, P., Mitschang, B., Schreier, U., Wieland, M.: SitAC – a system for situation-aware access control - controlling access to sensor data. In: Mori, P., Furnell, S., Camp, O. (eds.) Proceedings of the 3rd International Conference on Information Systems Security and Privacy, vol. 1, pp. 113–125. SciTePress, Porto (2017) 7. Hoos, E., Hirmer, P., Mitschang, B.: Improving problem resolving on the shop floor by context-aware decision information packages. In: Franch, X., Ralyt´e, J. (eds.) Proceedings of the CAiSE 2017 Forum, Essen, CEUR Workshop Proceedings, 121– 128 (2017) 8. Beimel, D., Peleg, M.: Using OWL and SWRL to represent and reason with situation-based access control policies. Data Knowl. Eng. 70(6), 596–615 (2011) 9. Peleg, M., Beimel, D., Dorib, D., Denekamp, Y.: Situation-based access control: privacy management via modeling of patient data access scenarios. J. Biomed. Inform. 41(6), 1028–1040 (2008) 10. Yau, S.S., Yao, Y., Banga, V.: Situation-aware access control for service-oriented autonomous decentralized systems. In: Proceedings of the 2005 International Symposium on Autonomous Decentralized Systems, ISADS 2005 (2005) 11. Ahn, G.J., Sandhu, R.: Role-based authorization constraints specification. ACM Trans. Inf. Syst. Secur. 3(4), 207–226 (2000) 12. Jin, X., Krishnan, R., Sandhu, R.: A unified attribute-based access control model covering DAC, MAC and RBAC. In: Cuppens-Boulahia, N., Cuppens, F., GarciaAlfaro, J. (eds.) DBSec 2012. LNCS, vol. 7371, pp. 41–55. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31540-4 4 13. Yuan, E., Tong, J.: Attribute based access control (ABAC) for web services. In: ICWS 2005 - International Conference on Web Services (2005)
Situation-Aware Access Control for Industrie 4.0
83
14. Kencana Ramli, C.D.P., Nielson, H.R., Nielson, F.: The logic of XACML. In: ¨ Arbab, F., Olveczky, P.C. (eds.) FACS 2011. LNCS, vol. 7253, pp. 205–222. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35743-5 13 15. Glombiewski, N., Hoßbach, B., Morgen, A., Ritter, F., Seeger, B.: Event processing on your own database. In: BTW Workshops, pp. 33–42 (2013) 16. Hasan, S., Curry, E., Banduk, M., O’Riain, S.: Toward situation awareness for the semantic sensor web: complex event processing with dynamic linked data enrichment. SSN 839, 69–81 (2011) 17. H¨ aussermann, K., Hubig, C., Levi, P., Leymann, F., Simoneit, O., Wieland, M., Zweigle, O.: Understanding and designing situation-aware mobile and ubiquitous computing systems. In: Proceedings of International Conference on Mobile Ubiquitous and Pervasive Computing, pp. 329–339 (2010) 18. Wang, X., Zhang, D.Q., Gu, T., Pung, H.: Ontology based context modeling and reasoning using OWL. In: Proceedings of the Second IEEE Annual Conference on Pervasive Computing and Communications Workshops. IEEE Computer Society (2004) 19. Brumitt, B., Meyers, B., Krumm, J., Kern, A., Shafer, S.: EasyLiving: technologies for intelligent environments. In: Thomas, P., Gellersen, H.-W. (eds.) HUC 2000. LNCS, vol. 1927, pp. 12–29. Springer, Heidelberg (2000). https://doi.org/10.1007/ 3-540-39959-3 2 20. Dargie, W., Mendez, J., Mobius, C., Rybina, K., Thost, V., Turhan, A.Y., et al.: Situation recognition for service management systems using OWL 2 reasoners. In: Proceedings of the 10th IEEE Workshop on Context Modeling and Reasoning 2013, pp. 31–36. IEEE Computer Society (2013) 21. Attard, J., Scerri, S., Rivera, I., Handschuh, S.: Ontology-based situation recognition for context-aware systems. In: Proceedings of the 9th International Conference on Semantic Systems, pp. 113–120. ACM (2013) 22. Hirmer, P., Wieland, M., Breitenb¨ ucher, U., Mitschang, B.: Automated sensor registration, binding and sensor data provisioning. In: Proceedings of the CAiSE 2016 Forum, at the 28th International Conference on Advanced Information Systems Engineering (CAiSE 2016), CEUR Workshop Proceedings, vol. 1612. CEURWS.org (2016) 23. Hirmer, P., Wieland, M., Breitenb¨ ucher, U., Mitschang, B.: Dynamic ontologyˇ based sensor binding. In: Pokorn´ y, J., Ivanovi´c, M., Thalheim, B., Saloun, P. (eds.) ADBIS 2016. LNCS, vol. 9809, pp. 323–337. Springer, Cham (2016). https://doi. org/10.1007/978-3-319-44039-2 22 24. Ferraiolo, D., Kuhn, R., Hu, V.: Attribute-based access control. In: Computer, vol. 48. IEEE Computer Society (2015) 25. H¨ uffmeyer, M., Schreier, U.: Analysis of an access control system for RESTful services. In: Bozzon, A., Cudre-Maroux, P., Pautasso, C. (eds.) ICWE 2016. LNCS, vol. 9671, pp. 373–380. Springer, Cham (2016). https://doi.org/10.1007/978-3-31938791-8 22
How to Quantify Graph De-anonymization Risks Wei-Han Lee1(B) , Changchang Liu1 , Shouling Ji2,3 , Prateek Mittal1 , and Ruby B. Lee1 1
Princeton University, Princeton, USA {weihanl,cl12,pmittal,rblee}@princeton.edu 2 Zhejiang University, Hangzhou, China [email protected] 3 Georgia Tech, Atlanta, USA Abstract. An increasing amount of data are becoming publicly available over the Internet. These data are released after applying some anonymization techniques. Recently, researchers have paid significant attention to analyzing the risks of publishing privacy-sensitive data. Even if data anonymization techniques were applied to protect privacysensitive data, several de-anonymization attacks have been proposed to break their privacy. However, no theoretical quantification for relating the data vulnerability against de-anonymization attacks and the data utility that is preserved by the anonymization techniques exists. In this paper, we first address several fundamental open problems in the structure-based de-anonymization research by establishing a formal model for privacy breaches on anonymized data and quantifying the conditions for successful de-anonymization under a general graph model. To the best of our knowledge, this is the first work on quantifying the relationship between anonymized utility and de-anonymization capability. Our quantification works under very general assumptions about the distribution from which the data are drawn, thus providing a theoretical guide for practical de-anonymization/anonymization techniques. Furthermore, we use multiple real-world datasets including a Facebook dataset, a Collaboration dataset, and two Twitter datasets to show the limitations of the state-of-the-art de-anonymization attacks. From these experimental results, we demonstrate the ineffectiveness of previous de-anonymization attacks and the potential of more powerful de-anonymization attacks in the future, by comparing the theoretical deanonymization capability proposed by us with the practical experimental results of the state-of-the-art de-anonymization methods. Keywords: Structure-based de-anonymization attacks Anonymization utility · De-anonymization capability Theoretical bounds
1
Introduction
This paper is an extension of [1] in ICISSP 2017. In this paper, we provide complete proofs of our theorems and use a broad range of real-word datasets c Springer International Publishing AG, part of Springer Nature 2018 P. Mori et al. (Eds.): ICISSP 2017, CCIS 867, pp. 84–104, 2018. https://doi.org/10.1007/978-3-319-93354-2_5
How to Quantify Graph De-anonymization Risks
85
to show the limitations of the existing de-anonymization attacks and the gap between the theory and the experiments. Individual users’ data such as social relationships, medical records and mobility traces are becoming increasingly important for application developers and data-mining researchers. These data usually contain sensitive and private information about users. Therefore, several data anonymization techniques have been proposed to protect users’ privacy [2–4]. The privacy-sensitive data that are closely related to individual behavior usually contain rich graph structural characteristics. For instance, social network data can be modeled as graphs in a straightforward manner. Mobility traces can also be modeled as graph topologies according to [5]. Many people nowadays have accounts through various social networks such as Facebook, Twitter, Google+, Myspace and Flickr. Therefore, even equipped with advanced anonymization techniques, the privacy of structural data still suffers from de-anonymization attacks assuming that the adversaries have access to rich auxiliary information from other channels [5–11]. For instance, Narayanan et al. [8] effectively deanonymized a Twitter dataset by utilizing a Flickr dataset as auxiliary information based on the inherent cross-site correlations. Nilizadeh et al. [10] exploited the community structure of graphs to de-anonymize social networks. Furthermore, Srivatsa et al. [5] proposed to de-anonymize a set of location traces based on a social network. However, to the best of our knowledge, there is no work on theoretically quantifying the data anonymization techniques to defend against de-anonymization attacks. In this paper, we aim to theoretically analyze the de-anonymization attacks in order to provide effective guidelines for evaluating the threats of future de-anonymization attacks. We aim to rigorously evaluate the vulnerabilities of existing anonymization techniques. For an anonymization approach, not only the users’ sensitive information should be protected, but also the anonymized data should remain useful for applications, i.e., the anonymized utility should be guaranteed. Then, under what range of anonymized utility, is it possible for the privacy of an individual to be broken? We will quantify the vulnerabilities of existing anonymization techniques and establish the inherent relationships between the application-specific anonymized utility and the de-anonymization capability. Our quantification not only provides theoretical foundations for existing de-anonymization attacks, but also can serve as a guide for designing new de-anonymization and anonymization schemes. For example, the comparison between the theoretical de-anonymization capability and the practical experimental results of current de-anonymization attacks demonstrates the ineffectiveness of existing de-anonymization attacks. Overall, we make the following contributions: – We theoretically analyze the performance of structure-based de-anonymization attacks through formally quantifying the vulnerabilities of anonymization techniques. Furthermore, we rigorously quantify the relationships between the deanonymization capability and the utility of anonymized data, which is the first such attempt to the best of our knowledge. Our quantification provides
86
W.-H. Lee et al.
theoretical foundations for existing structure-based de-anonymization attacks, and can also serve as a guideline for evaluating the effectiveness of new deanonymization and anonymization schemes through comparing their corresponding de-anonymization performance with our derived theoretical bounds. – To demonstrate the ineffectiveness of existing de-anonymization attacks, we implemented these attacks on multiple real-world datasets including Facebook dataset, Collaboration dataset, and Twitter dataset. Experimental results show that previous methods are not robust to data perturbations and there is a significant gap between their de-anonymization performance and our derived theoretically achievable de-anonymization capability. This analysis further demonstrates the potential of developing more powerful de-anonymization attacks in the future.
2 2.1
Related Work Challenges for Anonymization Techniques
Privacy preservation on structural data has been studied extensively. The naive method is to remove users’ personal identities (e.g., names, social security numbers), which, unfortunately, is rather vulnerable to structure-based deanonymization attacks [3,5–10,12–15]. An advanced mechanism, k-anonymity, was proposed in [12], which obfuscates the attributes of users so that each user is indistinguishable from at least k − 1 other users. Although k-anonymity has been well adopted, it still suffers from severe privacy problems due to the lack of diversity with respect to the sensitive attributes as stated in [16]. Differential privacy [17,18] is a popular privacy metric that statistically minimizes the privacy leakage. Sala et al. in [19] proposed to share a graph in a differentially private manner. However, to enable the applicability of such an anonymized graph, the differential private parameter should not be large, which would thus make their method ineffective in defending against structure-based de-anonymization attacks [9]. Hay et al. in [2] proposed a perturbation algorithm that applies a sequence of r random edge deletions followed by r other random edge insertions. However, their method also suffers from structure-based de-anonymization attacks as shown in [10]. In summary, existing anonymization techniques are subject to two intrinsic limitations: (1) they are not scalable and thus would fail on high-dimensional datasets; (2) They are susceptible to adversaries that leverage the rich amount of auxiliary information to achieve structure-based de-anonymization attacks. 2.2
De-anonymization Techniques
Structure-based de-anonymization was first introduced in [6], where both active and passive attacks were discussed. However, the limitation of scalability reduces the effectiveness of both attacks.
How to Quantify Graph De-anonymization Risks
87
Narayanan et al. in [7] utilized the Internet movie database as the source of background knowledge to successfully identify users’ Netflix records, uncovering their political preferences and other potentially sensitive information. In [8], Narayanan et al. further de-anonymized a Twitter dataset using a Flickr dataset as auxiliary information. They proposed the popular seed identification and mapping propagation process for de-anonymization. In order to obtain the seeds, they assume that the attacker has access to a small number of members of the target network and can determine if these members are also present in the auxiliary network (e.g., by matching user names and other contextual information). Srivatsa et al. in [5] captured the WiFi hotspot and constructed a contact graph by connecting users who are likely to utilize the same WiFi hotspot for a long time. Based on the fact that friends (or people with other social relationships) are likely to appear in the same location, they showed how mobility traces can be de-anonymized using an auxiliary social network. However, their de-anonymization approach is rather time-consuming and may be computationally infeasible for real-world applications. In [13,14], Sharad et al. studied the de-anonymization attacks on ego graphs with graph radius of one or two and they only studied the linkage of nodes with degree greater than five. As shown in previous work [9], nodes with degree less than five cannot be ignored since they form a large portion of the original real-world data. Recently, Nilizadeh et al. [10] proposed a community-enhanced de-anonymization scheme for social networks. The community-level de-anonymization is first implemented for finding more seed information, which would be leveraged for improving the overall deanonymization performance. Their method may, however, suffer from the serious inconsistency problem of community detection algorithms [20]. Most de-anonymization attacks are based on the seed-identification scheme, which either relies on the adversary’s prior knowledge or a seed mapping process. Limited work has been proposed that requires no prior seed knowledge by the adversary [9,21]. In [21], Pedarsani et al. proposed a Bayesian-inference approach for de-anonymization. However, their method is limited to de-anonymizing sparse graphs. Ji et al. in [9] proposed a cold-start optimization-based de-anonymization attack. However, they only utilized very limited structural information (degree, neighborhood, top-K reference distance and sampling closeness centrality) of the graph topologies. Ji et al. further made a detailed comparison for the performance of existing de-anonymization techniques in [22]. 2.3
Theoretical Work for De-anonymization
Despite these empirical de-anonymization methods, limited research has provided theoretical analysis for such attacks. Pedarsani et al. in [4] conducted preliminary analysis for quantifying the privacy of an anonymized graph according to the ER graph model [23]. However, their network model (ER model) may not be realistic, since the degree distribution of the ER model (follows the Poisson distribution) is quite different from the degree distributions of most observed real-world structural data [24,25].
88
W.-H. Lee et al.
Ji et al. in [9] further considered a configuration model to quantify perfect deanonymization and (1 − )-perfect de-anonymization. However, their configuration model is also not general for many real-world data structures. Furthermore, their assumption that the anonymized and the auxiliary graphs are sampled from a conceptual graph is not practical since only edge deletions from the conceptual graph have been considered. In reality, edge insertions should also be taken into consideration. Besides, neither [4] nor [9] formally analyzed the relationships between the de-anonymization capability and the anonymization performance (e.g., the utility performance for the anonymization schemes). Note that our theoretical analysis in Sect. 4 takes the application-specific utility definition into consideration. Such non-linear utility analysis makes the incorporation of edge insertions to our quantification rather nontrivial. Furthermore, our theoretical quantification does not make any restrictive assumptions about the graph model. Therefore, our theoretical analysis would provide an important guide for relating de-anonymization capability and applicationspecific anonymizing utility. Further study on de-anonymization attacks can be found in [26–28]. These papers provide theoretically guaranteed performance bounds for their deanonymization algorithms. However, their derived performance bounds can only be guaranteed under restricted assumptions of the random graph, such as ER model and power-law model. We will show the advantage of our analysis over these approaches where our analysis requires no assumptions or constraints on the graph model as these approaches required.
3
System Model
We model the structural data (e.g., social networks, mobility traces, etc.) as a graph, where the nodes represent users who are connected by certain relationships (social relationships, mobility contacts, etc.). The anonymized graph can be modeled as Ga = (Va , Ea ), where Va = {i|i is an anonymized node} is the set of users and Ea = {ea (i, j)|ea (i, j) is the relationship between i ∈ Va and j ∈ Va } is the set of relationships. Here, ea (i, j) = 1 represents the existence of a connecting edge between i and j in Ga , and ea (i, j) = 0 represents the non-existence of such an edge. The neighborhood of node i ∈ Va is Na (i) = {j|ea (i, j) = 1} and the degree is defined as |Na (i)|. Similarly, the auxiliary structural data can also be modeled as a graph Gu = (Vu , Eu ) where Vu is the set of labeled (known) users and Eu is the set of relationships between these users. Note that the auxiliary (background) data can be easily obtained through various channels, e.g., academic data mining, online crawling, advertising and third-party applications [4,5,8,29]. A de-anonymization process is a mapping σ : Va → Vu . ∀i ∈ Va , its mapping under σ is σ(i) ∈ Vu ∪ {⊥}, where ⊥ indicates a non-existent (null) node. Similarly, ∀ea (i, j) ∈ Ea , σ(ea (i, j)) = eu (σ(i), σ(j)) ∈ Eu ∪{⊥}. Under σ, a successful de-anonymization on i ∈ Va is defined as σ(i) = i, if i ∈ Vu or σ(i) =⊥, otherwise. For other cases, the de-anonymization on i fails.
How to Quantify Graph De-anonymization Risks
3.1
89
Attack Model
We assume that the adversary has access to Ga = (Va , Ea ) and Gu = (Vu , Eu ). Ga = (Va , Ea ) is the anonymized graph and the adversary can only get access to the structural information of Ga . Gu = (Vu , Eu ) is the auxiliary graph and the adversary already knows all the identities of the nodes in Gu . In addition, we do not assume that the adversary has other prior information (e.g., seed information). These assumptions are more reasonable than most of the state-ofthe-art research [5,8,10].
4
Theoretical Analysis
In this section, we provide a theoretical analysis for the structure-based deanonymization attacks. Under any anonymization technique, the users’ sensitive information should be protected without significantly affecting the utility of the anonymized data for real-world systems or research applications. We aim to quantify the trade-off between preserving users’ privacy and the utility of anonymized data. Under what range of anonymized utility, is it possible for the privacy of an individual to be broken (i.e., for the success of de-anonymization attacks)? To answer this, we quantify the limitations of existing anonymization schemes and establish an inherent relationship between the anonymized utility and de-anonymization capability. Our theoretical analysis incorporates an application-specific utility metric for the anonymized graph, which further makes our rigorous quantification useful for real world scenarios. Our theoretical analysis can serve as an effective guideline for evaluating the performance of practical de-anonymization/anonymization schemes (will be discussed in Sect. 6). First, we assume that there exists a conceptually underlying graph G = (V, E) with V = Va ∪ Vu and E is a set of relationships among users in V , where e(i, j) = 1 ∈ E represents the existence of a connecting edge between i and j, and e(i, j) = 0 ∈ E represents the non-existence of such an edge. Consequently, Ga and Gu could be viewed as observable versions of G by applying edge insertions or deletions on G according to proper relationships, such as ‘co-occurrence’ relationships in Gowalla dataset [29]. In comparison, previous work [4,9] only considers edge deletions which is an unrealistic assumption. For edge insertions from G to Ga , the process is: ∀e(i, j) = 0 ∈ E, e(i, j) = 1 add appears in Ea with probability padd a , i.e., P r(ea (i, j) = 1|e(i, j) = 0) = pa . The del probability of edge deletion from G to Ga is pa , i.e., P r(ea (i, j) = 0|e(i, j) = 1) = pdel a . Similarly, the insertions and deletions from G to Gu can be charand pdel acterized with probabilities padd u u . Furthermore, we assume that both the insertion/deletion relationship of each edge is independent of every other edge. Furthermore, this model is intuitively reasonable since the three graphs G, Ga , Gu are related with each other. In addition, our model is more reasonable than the existing models in [4,9] because we take both edge deletions and insertions into consideration. Note that the incorporation of edge insertion is non-trivial in our quantification of non-linear application-specific utility analysis.
90
W.-H. Lee et al.
Our quantification analysis would therefore contribute to relating the real world application-specific anonymizing utility and the de-anonymization capability. Adjacency matrix and transition probability matrix are two important descriptions of a graph, and the graph utility is also closely related to these matrices. The adjacency matrix is a means of representing which nodes of a graph are adjacent to which other nodes. We denote the adjacency matrix by A (resp. Aa and Au ) for graph G (resp. Ga and Gu ), where the element A(i, j) = e(i, j) (resp. Aa (i, j) = ea (i, j) and Au (i, j) = eu (i, j)). Furthermore, the transition probability matrix is a matrix consisting of the one-step transition probabilities, which is the probability of transitioning from one node to another in a single step. We denote the transition probability matrix by T (resp. Ta and Tu ) for graph G (resp. Ga and Gu ), where the element T (i, j) = e(i, j)/deg(i) (resp. Ta (i, j) = ea (i, j)/dega (i) and Tu (i, j) = eu (i, j)/degu (i)), and deg(i), dega (i), degu (i) represent the degree of node i in G, Ga , Gu , respectively. We now define the smallest (l) and largest (h) probabilities of an edge existing between two nodes in the graph G, and the graph density (denoted by R). For graph G, we denote |V | = N and |E| = M . Let p(i, j) be the probability of an edge existing between i, j ∈ V and define l = min{p(i, j)|i, j ∈ V,i = j}, h = max{p(i, j)|i, j ∈ V, i = j}, the expected number of edges PT = i,j∈V p(i, j) and the graph density R = PNT . (2) Then, we start our formal quantification from the simplest scenario where the anonymized data and the auxiliary data correspond to the same group of users i.e., Va = Vu as in [4,5,8]. This assumption does not limit our theoretical analysis since we can either (a) apply it to the overlapped users between Va and Vu or (b) extend the set of users to Vanew = Va ∪ (Vu \Va ) and Vunew = Vu ∪ (Va \Vu ), and apply the analysis to Ga = (Vanew , Ea ) and Gu = (Vunew , Eu ). Therefore, in order to prevent any confusion and without loss of generality, we assume Va = Vu in our theoretical analysis. We define σk as a mapping between Ga and Gu that contains k incorrectly-mapped pairs. Given a mapping σ : Va → Vu , we define the Difference of Common Neighbors σ(i) σ(i) (DCN) on a node i’s mapping σ(i) as φi,σ(i) = |Nai \Nu | + |Nu \Nai |, which measures the neighborhoods’ difference between node i in Ga and node σ(i) in Gu under the mapping σ. Then, we define the overall DCN for all the nodes under the mapping σ as Φσ = (i,σ(i))∈σ φi,σ(i) . Next, we not only explain why structure-based de-anonymization attacks work but also quantify the trade-off between the anonymized utility and the de-anonymization capability. We first quantify the relationship between a straightforward utility metric, named local neighborhood utility, and the deanonymization capability. Then we carefully analyze a more general utility metric, named global structure utility, to accommodate a broad class of real-world applications.
How to Quantify Graph De-anonymization Risks
91
Fig. 1. Visualization of utility region (green shaded) for successful de-anonymization under different scenarios. To guarantee the applicability of the anonymized data, the anonymized utility should be preserved by the anonymization techniques. We theoretically demonstrate that successful de-anonymization can be achieved if the anonymized utility locates within these shaded regions [1]. (Color figure online)
4.1
Relation Between the Local Neighborhood Utility and De-anonymization Capability
At the beginning, we explore a straightforward utility metric, local neighborhood utility, which evaluates the distortion of the anonymized graph Ga from the conceptually underlying graph G as Definition 1. The local neighborhood utility for the anonymized graph is Ua = E[D(Ga ,G)] a −A ||1 1 − ||A N (N −1) = 1 − N (N −1) (the denominator is a normalizing factor to guarantee Ua ∈ [0, 1]), where D(·, ·) is the hamming distance [30] of edges between two graphs, i.e., if ea (i, j) = e(i, j), D(ea (i, j), e(i, j)) =1 and E[D(Ga , G)] is the distortion between Ga and G and E[D(Ga , G)] = E[ D(ea (i, j), e(i, j))] = i,j del add (p(i, j)p + (1 − p(i, j))p ). a a i,j Thus, we further have del add i,j (p(i, j)pa + (1 − p(i, j))pa ) Ua = 1 − N (1) 2 add = 1 − (Rpdel a + (1 − R)pa ) [1]
Similarly, the local neighborhood utility for the auxiliary graph is add Uu = 1 − (Rpdel u + (1 − R)pu ) [1]
(2)
Though the utility metric for structural data is application-dependent, our utility metric can provide a comprehensive understanding for utility performance by considering both the edge insertions and deletions, and incorporating the distance between the anonymized (auxiliary) graph and the conceptual underlying
92
W.-H. Lee et al.
graph. Although our utility is one of the most straightforward definitions, to the best of our knowledge, it is still the first scientific work that theoretically analyzes the relationship between de-anonymization performance and the utility of the anonymized data. Furthermore, we will provide more analysis by considering a general utility metric that can be applied to a broad coverage of applications. Based on the local neighborhood utility in Definition 1, we theoretically analyze the de-anonymization capability of structure-based attacks and quantify the anonymized utility for successful de-anonymization. Theorem 1 implies that as the number of nodes in the graphs Ga and Gu increase, the probability of successful deanonymization approaches 1 when the four conditions (in Eqs. 3, 4, 5 and 6) regarding graph density R, and the smallest and largest probabilities of the edges between nodes hold. Theorem 1. For any σk = σ0 , where k is the number of incorrectly-mapped nodes between Ga and Gu , limn→∞ P r(Φσk ≥ Φσ0 ) = 1 when the following conditions are satisfied. Uu + 2lUa > 1 + 2l − Rl [1]
(3)
Uu + 2(1 − h)Ua > 1 + 2(1 − h) − (1 − h)(1 − R) [1]
(4)
1−R 1−R Ua > 1 + 2l − l(1 − R) [1] R R R R Uu + 2(1 − h) Ua > 1 + 2(1 − h) − R(1 − h) [1] 1−R 1−R Uu + 2l
(5) (6)
From Theorem 1, we know that when the local neighborhood utility for the anonymized graph and the auxiliary graph satisfies the four conditions in Eqs. 3, 4, 5 and 6, we can achieve successful de-anonymization from a statistical perspective. The reason is that, the attacker can discover the correct mapping with high probability by choosing the mapping with the minimal Difference of Common Neighbors (DCN), out of all the possible mappings between the anonymized graph and the auxiliary graph. To the best of our knowledge, this is the first work to quantify the relationship between anonymized utility and de-anonymization capability. It also essentially explains why structure-based de-anonymization attacks work. The four conditions in Theorem 1 can be reduced to one or two conditions under four types of graph density. Figure 1(a) is the triangular utility region for 1−h 1−h } (where the graph density R is smaller than 0.5 and 1−h+l ), R < min{0.5, 1−h+l which is only bounded by Eq. 3. Figure 1(b) is the quadrilateral utility region for 1−h 1−h } ≤ R < 0.5 (where the graph density R is larger than 1−h+l and min{0.5, 1−h+l smaller than 0.5), which is bounded by Eqs. 3 and 4. Similarly, Fig. 1(c) is the tril } (where the graph density R angular utility region for 0.5 ≤ R < max{0.5, 1−h+l l is larger than 1−h+l and 0.5), which is only bounded by Eq. 6. Figure 1(d) is the l } (where the graph density R is quadrilateral utility region for R ≥ max{0.5, 1−h+l l larger than 0.5 and smaller than 1−h+l ), which is bounded by Eqs. 5 and 6. Therefore, we not only analytically explain why the structure-based de-anonymization
How to Quantify Graph De-anonymization Risks
93
works, but also theoretically provide the bound of anonymized utlity for sucessful de-anonymization. When the anonymized utility satisfies the conditions in Theorem 1 (or locates within the green shaded utility regions shown in Fig. 1), successful de-anonymization is theoretically achievable.
5
Proof for Theorem 1
We first prove the following two lemmas based on the four conditions in Eqs. 3, 4, 5 and 6, before obtaining the final result limn→∞ P r(Φσk ,Eku \Eτu ≥ Φσ0 ,Eku \Eτu ) = 1 in Theorem 1. The two Lemmas provide the important properties of puadd , pudel and ua u u pua add , pdel respectively, where padd , pdel represent the edge insertion and deletion probabilities from the conceptually underlying graph to the auxiliary graph and ua pua add , pdel represent the edge insertion and deletion probabilities from the auxiliary graph to the anonymized graph. 5.1
Properties for puadd , pudel
Lemma 1. Consider puadd (i, j) which represents the edge insertion probability from G to Ga and pudel (i, j) which represents the deletion process from G to Ga . We have the following properties for puadd , pudel as: max{puadd , pudel }
max{1 − lR, 1 − (1 − h)R, 1 − l(1 − R), 1 − (1 − h)(1 − R)}. Based on that, we further have Rpudel + (1 − R)puadd = 1 − U u < min{lR, (1 − h)R, l(1 − R), (1 − h)(1 − R)} ⇒ 12 > max{puadd , pudel }. Similarly, for the anonymized graph, we can prove max{paadd , padel } < 12 . 5.2
a a Properties for puadd (i, j), pudel (i, j)
Lemma 2. Consider pua add (i, j) which represents the edge insertion probability from Gu to Ga and pua (i, j) which represents the deletion process from Gu to Ga . We del ua have the following properties for pua add (i, j), pdel (i, j) as: ua max{pua add (i, j), pdel (i, j)}
pudel (1 − 2padel )p(i, j)
(11)
puadd (1 − 2paadd )(1 − p(i, j)) < (1 − pudel )(1 − 2padel )p(i, j)
(12)
We prove Eqs. 11 and 12 under the following four situations: (a) pudel and padel ≥ paadd ; (b) pudel ≥ puadd and padel ≤ paadd ; (c) pudel ≤ puadd and padel and (d) pudel ≤ puadd and padel ≤ paadd , respectively.
≥ puadd ≥ paadd
Situation (a): under situation (a), we only need to consider Eqs. 3 and 6. Equau a 1−U u tion 6 is equivalent to 1−U < (1 − h)(1 − 2 1−U < (1 − h) ⇒ R R ) ⇒ R u u Rpdel +(1−R)padd u u < 1 − h ⇒ pdel < 1 − h. Based on that, we have pdel < 1 − p(i, j) R and p(i, j) < 1 − pudel < 1 − puadd . In addition, we have 1 − 2padel < 1 − 2paadd . Therefore, Eq. 11 is satisfied. u u a Rpu del +(1−R)padd < l − 2l 1−U ⇒ < l − Eq. 3 is equivalent to 1−U R R R a Rpa Rpu Rpa 1−2pa u a u del +(1−R)padd del del 2l ⇒ R < l−2l R ⇒ pdel < l(1−2pdel ) ⇒ pdel < l( 1−2padel ). R add Therefore, we have 1 1 − 1 > 1−2pa − 1 pudel l 1−2padel add
>
1 1−2pa
p(i, j) 1−2padel
−1
add
1 − 2padel 1 − 2paadd 1 − ) =( p(i, j) 1 − 2paadd 1 − 2padel 1 − 2paadd 1 − 1) =( p(i, j) 1 − 2padel
(13)
Furthermore, we can prove that Eq. 13 is equivalent to (1 − pudel )(1 − 2padel )p(i, j) > pudel (1 − 2paadd )(1 − p(i, j))
(14)
Considering that pudel > puadd , Eq. 12 thus holds. Therefore, we have proved that 1 1 ua pua add (i, j) < 2 and pdel (i, j) < 2 under situation (a). Situation (b): under situation (b), we only need to consider Eqs. 3 and 6. Equau a Rpu +(1−R)pu 1−U u add tion 3 is equivalent to 1−U < l(1−2 1−U < l ⇒ del R
−1 1−2pa pudel (1 − h) 1−2padd a del
>
1 1−2pa
(1 − p(i, j)) 1−2padd a
−1
del
1 − 2paadd 1 − 2padel 1 − ) =( (1 − p(i, j)) 1 − 2padel 1 − 2paadd 1 − 2padel 1 − 1) =( (1 − p(i, j)) 1 − 2paadd
(15)
Furthermore, we can prove that Eq. 15 is equivalent to (1 − pudel )(1 − 2paadd )(1 − p(i, j)) > pudel (1 − 2padel )p(i, j)
(16)
In addition, we have pudel > puadd , and Eq. 11 is thus satisfied. Therefore, we have 1 1 ua proved that pua add (i, j) < 2 and pdel (i, j) < 2 under situation (b). Situation (c): under situation (c), we only need to consider Eqs. 4 and 5. Equau 1−U a 1−U u tion 4 is equivalent to 1−U 1−R < (1 − h)(1 − 2 1−R ) ⇒ 1−R < 1 − h ⇒
u Rpu del +(1−R)padd < 1 − h ⇒ pudel < puadd < 1 − h. Therefore, we can obtain 1−R u padd < 1 − p(i, j) and p(i, j) < 1 − puadd . Besides, 1 − 2padel < 1 − 2paadd , therefore
Eq. 11 holds. Equation 5 is equivalent to Rpa +(1−R)pa add 2l del R 1−2pa l( 1−2padel ). Thus, add
⇒
(1−R)pu add 1−R
1−U u 1−R
< l − 2l 1−U R
< l−
Rpa 2l Rdel
⇒
a
puadd
⇒
1 1−2pa
p(i, j) 1−2padel
−1
add
1 − 2padel 1 − 2paadd 1 − ) =( p(i, j) 1 − 2paadd 1 − 2padel 1 − 2paadd 1 − 1) =( p(i, j) 1 − 2padel
(17)
Equation 17 is equivalent to (1 − puadd )(1 − 2padel )p(i, j) > puadd (1 − 2paadd )(1 − p(i, j))
(18)
Combining with pudel < puadd , we have Eq. 12 satisfied. Therefore, we have proved 1 1 ua that pua add (i, j) < 2 and pdel (i, j) < 2 under situation (c).
96
W.-H. Lee et al.
Situation (d): under situation (d), we only need to consider Eqs. 4 and 5. Equau u Rpu 1−U a 1−U u del +(1−R)padd < tion 5 is equivalent to 1−U 1−R < l(1 − 2 R ) ⇒ 1−R < l ⇒ 1−R l ⇒ puadd < l. Therefore, we can obtain puadd < p(i, j) and 1 − p(i, j) < 1 − pudel . In addition, we have 1 − 2padel > 1 − 2paadd . Thus, Eq. 12 holds. u u Rpu 1−U a del +(1−R)padd Equation 4 is equivalent to 1−U < 1−R < (1 − h)(1 − 2 1−R ) ⇒ 1−R a Rpa (1−R)pu (1−R)pa del +(1−R)padd add )⇒ < (1 − h)(1 − 2 1−Radd ) 1−R 1−R 1−2pa 2paadd ) ⇒ puadd < (1 − h)( 1−2padd ). Therefore, we have a del
(1 − h)(1 − 2 (1 − h)(1 −
⇒ puadd
−1 1−2pa puadd (1 − h) 1−2padel add
>
1 1−2pa
(1 − p(i, j)) 1−2padel
−1
add
1 − 2padel 1 − 2paadd 1 − ) =( (1 − p(i, j)) 1 − 2paadd 1 − 2padel 1 − 2paadd 1 − 1) =( (1 − p(i, j)) 1 − 2padel
(19)
Equation 19 is equivalent to (1 − puadd )(1 − 2padel )(1 − p(i, j)) > puadd (1 − 2paadd )p(i, j)
(20)
Besides, pudel < puadd , therefore Eq. 11 is satisfied. Finally, we have proved that 1 1 ua pua add (i, j) < 2 and pdel (i, j) < 2 under situation (d). 5.3
Achieving Successful De-anonymization
Since k is the number of incorrect mappings in σk = σ0 , 2 ≤ k ≤ n is satisfied. With σk , we consider Vku ⊆ V u as the set of incorrectly de-anonymized nodes, and Eku = {eui,j |i ∈ Vku or j ∈ Vku } as the set of all the possible edges adjacent to at least one user in Vku , Eτu = {eui,j |i, j ∈ Vku , (i, j) ∈ Eku , and (j, i) ∈ Eku } as the set of all the possible edges corresponding to transposition mappings in σk , and E u = {eui,j |1 ≤ i = j ≤ n} as the set of all the possible edges on V . Furthermore, define mk = |Eku | and mτ = |Eτu |. We have |Vku | = k, mk = k2 + k(n − k), mτ ≤ k2 since there are at most k2 transposition mappings in σk , |E u | = n2 . Next, we quantify Φσ0 from a stochastic perspective. To quantify Φσ0 , we consider the DCN caused by the projection of each link including both the existing links and the non-existing links in the conceptually underlying graph, i.e., ∀eui,j ∈ E u . We further define Φσk ,E as the DCN caused by edges in the set E under the mapping σk . If this link exists in Gu but not exist in Ga , according to the definition of DCN, it will cause a DCN of 2 and vice versa. Therefore, Φσk = Φσk ,E u \Eku + Φσk ,Eku \Eτu + Φσk ,Eτu and Φσ0 = Φσ0 ,E u \Eku + Φσ0 ,Eku \Eτu + Φσ0 ,Eτu . Since Φσk ,E u \Eku = Φσ0 ,E u \Eku and Φσk ,Eτu = Φσ0 ,Eτu , we can obtain P r(Φσk ≥ Φσ0 ) = P r(Φσk ,Eku \Eτu ≥ Φσ0 ,Eku \Eτu ).
How to Quantify Graph De-anonymization Risks
97
Then, ∀eui,j ∈ Eku \Eτu under σk , it will be mapped to some other possible / Eτu and at least one of i and j is edge σk (eui,j ) = euσk (i),σk (j) ∈ E u since eui,j ∈ incorrectly de-anonymized under σk . Therefore, in this case, the DCN/2 caused by eui,j during the projection process satisfies the Bernoulli distribution Φσk ,eui,j ∼ u ua B(1, p(i, j)u (p(σk(i) , σk(j) )u × pua del (i, j) + (1 − p(σk(i) , σk(j) ) ) × (1 − padd (i, j))) + u u ua u (1 − p(i, j) )(p(σk(i) , σk(j) ) × (1 − pdel (i, j)) + (1 − p(σk(i) , σk(j) ) ) × pua add (i, j))). For σ0 , ∀eui,j ∈ Eku \Eτu , the DCN/2 caused by eui,j after the projection process u ua satisfies Bernoulli distribution Φσ0 ,eui,j ∼ B(1, p(i, j)u pua del + (1 − p(i, j) )padd ). Let λσ0 ,eui,j and λσk ,eui,j be the mean of Φσ0 ,eui,j and Φσk ,eui,j , respectively. Since 1 ua ua ua ua padd (i, j) < 12 and pua del (i, j) < 2 , we have pdel (i, j) < 1 − padd (i, j) and padd (i, j) < ua 1 − pdel (i, j). Furthermore, we can prove u ua λσk ,eui,j > p(i, j)u pua del + (1 − p(i, j) )padd = λσ0 ,eu i,j
(21)
Lemma 3 [9]. (i) Let X ∼ B(n1 , p) and Y ∼ B(n2 , p) be independent binomial variables. Then, X + Y is also a binomial variable and X + Y ∼ B(n1 + n2 , p); (ii) Let X and Y be two binomial random variables with means λx and λy , respectively. (λx −λy )2 Then, when λx > λy , P r(X − Y ≤ 0) ≤ 2exp( −8(λ ). x +λy ) By applying Lemma 3, we thus have P r(Φσk ,eui,j > Φσ0 ,eui,j ) > 1 − 2 exp −
(λσk ,eui,j − λσ0 ,eui,j )2 8λσk ,eui,j λσ0 ,eui,j
(22)
=1 − 2 exp (−f (p(i, j) , p(σk (i), σk (j)) m2 ) u
u
where p(., .) is a function of p(i, j)u and p(σk (i), σk (j))u . Similarly, we have P r(Φσk ,Eku \Eτu ≥ Φσ0 ,Eku \Eτu ) = P r( ei,j ∈E u \Eτu Φσk ,ei,j ≥ k ei,j ∈Eku \Eτu Φσ0 ,ei,j ) ≥ ei,j ∈Eku \Eτu P r(Φσk ,ei,j ≥ Φσ0 ,ei,j ). Considering fmin = min f (p(i, j), p(σk (i), σk (j))), we can obtain P r(Φσk ,Eku \Eτu ≥ Φσ0 ,Eku \Eτu ) ≥ ei,j ∈E u \E u P r(Φσk ,ei,j ≥ Φσ0 ,ei,j ) ≥ τ k mk . ei,j ∈E u \E u (1 − 2 exp (−fmin m2 )) = (1 − 2 exp (−fmin m2 )) k
τ
Lemma 4 lim (1 − exp(−(ax + b)))
x→∞
cx+d
=1
(23)
where a, b, c, d are positive real numbers. Proof. Let us take logarithm of Eq.(23), it is thus equivalent to limx→∞ (cx + d) ln (1 − 2 exp(−(ax + b))) = 0. Then, we can prove that
98
W.-H. Lee et al.
lim (cx + d) ln (1 − 2 exp(−(ax + b)))
x→∞
ln (1 − 2 exp(−(ax + b)))
= lim
1 cx+d 2a exp(−(ax+b)) 1−2 exp(−(ax+b)) = lim −c x→∞ (cx+d)2 x→∞
= lim
x→∞
−2a3
(L Hospital s Rule)
(24)
c exp(ax + b)
= 0. Applying Lemma 4, we obtain the final result of Theorem 1 as lim P r(Φσk ,Eku \Eτu ≥ Φσ0 ,Eku \Eτu ) = 1
n→∞
5.4
(25)
Relation Between the Global Structure Utility and De-anonymization Capability
In Definition 1, we consider a straightforward local neighborhood utility metric, which evaluates the distortion between the adjacency matrices of the two graphs, i.e., ||Aa − A||1 . However, the real-world data utility is application-oriented such that we need to consider a more general utility metric, to incorporate more aggregate information of the graph instead of just the adjacency matrix. Motivated by the general utility distance in [31,32], we consider to utilize the w-th power of the transition probability matrix T w , which is induced by the w-hop random walk on graph G, to define the global structure utility as follows: Definition 2. The global structure utility for the anonymized graph Ga is defined as ||Taw − T w ||1 [1] (26) Ua(w) = 1 − 2N where Taw , T w are the w-th power of the transition probability matrix Ta , T , respectively. The denominator in Eq. 26 is a normalization factor to guarantee Ua(w) ∈ [0, 1]. Similarly, the global structure utility for the auxiliary graph is Uu(w) = 1 −
||Tuw − T w ||1 [1] 2N
(27)
Our metric of global structure utility in Definition 2 is intuitively reasonable for a broad class of real-world applications, and captures the w-hop random walks between the conceptually underlying graph G and the anonymized graph Ga . We note that random walks are closely linked to structural properties of real-world data. For example, a lot of high-level social network based applications such as recommendation systems [33], Sybil defenses [34] and anonymity systems [35] directly perform random walks in their protocols. The parameter w is application specific; for applications that require access to fine-grained community structure, such as recommendation systems [33], the value of w should be small (typically 2 or 3). For
How to Quantify Graph De-anonymization Risks
99
other applications that utilize coarse and macro community structure of the data, such as Sybil defense mechanisms [34], w can be set to a larger value (typically around 10). Therefore, our global structure utility metric can quantify the utility performance of a perturbed graph for various real-world applications in a general and universal manner. Based on this general utility metric, we further theoretically analyze the deanonymization capability of structure-based attacks and quantify the anonymized utility for successful de-anonymization. Theorem 2. For any σk = σ0 , where k is the number of incorrectly-mapped nodes between Ga and Gu , limn→∞ P r(Φσk ≥ Φσ0 ) = 1 when the following conditions are satisfied: wRl(N − 1) [1] 2 w(N − 1)(1 − h)(1 − R) Uu(w) + 2(1 − h)Ua(w) > 1 + 2(1 − h) − [1] 2 1−R 1 − R wl(1 − R)(N − 1) Uu(w) + 2l Ua(w) > 1 + 2l − [1] R R 2 wR(1 − h)(N − 1) R R Uu(w) + 2(1 − h) Ua(w) > 1 + 2(1 − h) − [1] 1−R 1−R 2 Uu(w) + 2lUa(w) > 1 + 2l −
(28) (29) (30) (31)
Proof. We first relate the adjacency matrix A with the transition probability matrix T as A = ΛT, where Λ is a diagonal matrix and Λ(i, i) = deg(i). Then we analyze the global utility distance for the anonymized graph. When w = 1, we can prove ||Aa − A||1 = ||Λa Ta − ΛT||1 = ||(Λa Ta − Λa T + Λa T − ΛT)||1 ≥ ||Λa ||1 ||Ta − T||1 . Since the element in the diagonal of Λa is greater than 1, we have ||Aa − A||1 ≥ ||Ta − T||1 . Therefore, we can obtain Ua ≤ Ua(w) . Similarly, we have Uu ≤ Uu(w) . Incorporating these two inequalities into Eqs. 3, 4, 5 and 6, we have Theorem 2 satisfied under w = 1. Next, we consider w ≥ 1. It is easy to prove that w w w ||Tw a − T ||1 ≤ w||Ta − T||1 so ||Ta − T ||1 ≤ w||Aa − A||1 . Therefore, we have Ua ≤ wUa(w) + 1 − w. Similarly, we also have Uu ≤ wUu(w) + 1 − w for the auxiliary graph. Incorporating these two inequalities into Eqs. 3, 4, 5 and 6, we thus have Theorem 2 proved. Similar to Theorem 1, when the global structure utility for the anonymized graph and the auxiliary graph satisfies all of the four conditions in Theorem 2, we can achieve successful de-anonymization from a statistical perspective. With rather high probability, the attacker can find out the correct mapping between the anonymized graph and the auxiliary graph, by choosing the mapping with the minimal DCN out of all the potential mappings. Furthermore, both Theorems 1 and 2 give meaningful guidelines for future designs of de-anonymization and anonymization schemes: (1) Since successful deanonymization is theoretically achievable when the anonymized utility satisfies the conditions in Theorem 1 (for the local neighborhood utility) and Theorem 2 (for the global structure utility), the gap between the practical de-anonymization
100
W.-H. Lee et al.
accuracy and the theoretically achievable performance can be utilized to evaluate the effectiveness of a real-world de-anonymization attack; (2) we can also leverage Theorems 1 and 2 for designing future secure data publishing to defend against de-anonymization attacks. For instance, a secure data publishing scheme should provide anonymized utility that locates out of the theoretical bound (green shaded region) in Fig. 1 while enabling real-world applications. We will provide a practical analysis for such privacy and utility tradeoffs in Sect. 6. Table 1. De-anonymization accuracy of state-of-the-art de-anonymization methods. Datasets
Method
N oise = 0.05 N oise = 0.15 N oise = 0.25
Facebook
Ji et al. [9]
0.95
0.81
0.73
Facebook
Nilizadeh et al. [10] 0.83
0.74
0.68
0.27
0.09
0.05
Collaboration Nilizadeh et al. [10] 0.58
0.23
0.11
Twitter: small Ji et al. [9]
Collaboration Ji et al. [9]
6
0.55
0.39
0.21
Twitter: small Nilizadeh et al. [10] 0.92
0.79
0.69
Twitter: large Ji et al. [9]
0.48
0.21
0.12
Twitter: large Nilizadeh et al. [10] 0.91
0.66
0.21
Practical Privacy and Utility Trade-Off
In this section, we show how the theoretical analysis in Sect. 4 can be utilized to evaluate the privacy risks of practical data publishing and the performance of practical de-anonymization attacks. To enable real-world applications without compromising the privacy of users, a secure data anonymization scheme should provide anonymized utility which does not locate within the utility region for perfect de-anonymization shown as the green shaded regions in Fig. 1. From a data publisher’s point of view, we consider the worst-case attacker who has access to perfect auxiliary information, i.e., noiseu = 0. Based on Theorem 1, we aim to quantify the amount of noise that is added to the anonymized data for achieving successful de-anonymization. Note that our derivation is from a statistical point of view instead of from the perspective of a concrete graph. Theorem 3. When the noise of the anonymized graph is less than 0.25, successful de-anonymization can be theoretically achieved. Proof. For the anonymization method of Hay et al. in [2], we have Padel = ka /Ma N add and Pa = ka /( 2 − Ma ). Similarly, we have Pudel = ku /Mu and Puadd = N ku /( 2 − Mu ). Based on our utility metric in Definition 1, we have Uu = 1 − 2R × noiseu and Ua = 1 − 2R × noisea . Considering the sparsity property in most real-world structural graphs [7], the utility condition for achieving successful de-anonymization is restricted by Eq. 3, which can be represented as noiseu + l × noisea < 2l . Consider the worst-case attacker who has access to perfect auxiliary information, i.e., noiseu = 0. Therefore, we have noisea < 0.25.
How to Quantify Graph De-anonymization Risks
101
Therefore, when the noise added to the anonymized graph is less than 0.25, there would be a serious privacy breach since successful de-anonymization is theoretically achievable. Note that such a utility bound only conservatively provides the minimum noise that should be added to the anonymized data. Practically, we suggest a real-world data publisher to add more noise to protect the privacy of the data. Furthermore, such privacy-utility trade-off can be leveraged as a guide for designing new anonymization schemes. In addition, our derived theoretical analysis can also be utilized to evaluate the performance of existing de-anonymization attacks. We implement our experiments on Facebook dataset, Collaboration dataset [36], Twitter dataset [10]. The Facebook dataset [37] which contains 46, 952 nodes (i.e., users) connected by 876, 993 edges (i.e., social relationships). The Collaboration dataset [36] is a network of coauthorships between scientists who have posted preprints on the Condensed Matter E-Print Archive, which consists of 36,458 users and 171,735 edges. The Twitter dataset [10] captures the connections between users who mentioned each other at least once between March 24th, 2012 and April 25th, 2012, and contains two different graphs named Twitter (small) with 9,745 users and 50,164 edges, and Twitter (large) with 90,331 users and 358,422 edges. To evaluate the performance of existing de-anonymization attacks, we consider a popular perturbation method of Hay et al. in [2], which applies a sequence of r edge deletions followed by r random edge insertions. A similar perturbation process has been utilized for the de-anonymization attacks in [10]. Candidates for edge deletion are sampled uniformly at random from the space of the existing edges in graph G, while candidates for edge insertion are sampled uniformly at random from the space of edges that are not existing in G. Here, we define noise (perturbations) as the extent of edge modification, i.e., the ratio of altered edges r to the total numr . Note that we add the same amount of noise to the ber of edges, i.e., noise = M original graph to obtain the anonymized graph and the auxiliary graph, respectively. Then, we apply the state-of-the-art de-anonymization attacks in [9,10] to de-anonymize the anonymized graph by leveraging the auxiliary graph. We utilize Accuracy as an effective evaluation metric to measure the deanonymization performance. Accuracy is the ratio of the correctly de-anonymized nodes out of all the overlapped nodes between the anonymized graph and the auxiliary graph: Accuracy =
Ncor [1] |Va ∩ Vu |
(32)
where Ncor is the number of correctly de-anonymized nodes. The Accuracy of these de-anonymization attacks corresponding to different levels of noise is shown in Table 1. From Table 1, we can see that the state-of-the-art de-anonymization attacks can only achieve less than 75% de-anonymization accuracy when the noise is 0.25, which demonstrates the ineffectiveness of previous work and the potential of developing more powerful de-anonymization attacks in the future.
102
7
W.-H. Lee et al.
Discussion
There Is a Clear Trade-Off Between Utility and Privacy for Data Publishing. In this work, we analytically quantify the relationships between the utility of anonymized data and the de-anonymization capability. Our quantification results show that privacy could be breached if the utility of anonymized data is high. Hence, striking the balance between utility and privacy for data publishing is important yet difficult - providing the high utility for real-world applications would decrease the data’s resistance to de-anonymization attacks. Suggestions for Secure Data Publishing. Secure data publishing (sharing) is important for companies (e.g., online social network providers), governments and researchers. Here, we give several general guidelines: (i) Data owners should carefully evaluate the potential vulnerabilities of the data before publishing. For example, our quantification result in Sect. 4 can be utilized to evaluate the vulnerabilities of the structural data. (ii) Data owners should develop proper policies on data collections to defend against adversaries who aim to leverage auxiliary information to launch de-anonymization attacks. To mitigate such privacy threats, online social network providers, such as Facebook, Twitter, and Google+, should reasonably limit the access to users’ social relationships.
8
Conclusion
In this paper, we theoretically analyze the de-anonymization attacks and provide conditions on the utility of the anonymized data (denoted by anonymized utility) to achieve successful de-anonymization under a general graph model. Our analysis provides a theoretical foundation for structure-based de-anonymization attacks, and can serve as a guide for designing new de-anonymization/anonymization systems in practice. By comparing these experimental results and the theoretically achievable de-anonymization capability derived in our analysis, future work can include studying our utility versus privacy trade-offs for more datasets, and designing more powerful anonymization/de-anonymization approaches.
References 1. Lee, W.-H., Liu, C., Ji, S., Mittal, P., Lee, R.B.: Quantification of de-anonymization risks in social networks. In: Information Systems Security and Privacy (ICISSP), SCITEPRESS (2017) 2. Hay, M., Miklau, G., Jensen, D., Weis, P., Srivastava, S.: Anonymizing social networks. Computer Science Department Faculty Publication Series (2007) 3. Liu, K., Terzi, E.: Towards identity anonymization on graphs. In: SIGMOD (2008) 4. Pedarsani, P., Grossglauser, M.: On the privacy of anonymized networks. In: SIGKDD (2011) 5. Srivatsa, M., Hicks, M.: Deanonymizing mobility traces: using social network as a side-channel. In: CCS (2012)
How to Quantify Graph De-anonymization Risks
103
6. Backstrom, L., Dwork, C., Kleinberg, J.: Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography. In: WWW (2007) 7. Narayanan, A., Shmatikov, V.: Robust de-anonymization of large sparse datasets. IEEE S&P (2008) 8. Narayanan, A.: De-anonymizing social networks. In: IEEE S&P (2009) 9. Ji, S., Li, W., Srivatsa, M., Beyah, R.: Structural data de-anonymization: quantification, practice, and implications. In: CCS (2014) 10. Nilizadeh, S., Kapadia, A., Ahn, Y.-Y.: Community-enhanced de-anonymization of online social networks. In: CCS (2014) 11. Lee, W.-H., Liu, C., Ji, S., Mittal, P., Lee, R.: Blind de-anonymization attacks using social networks. In: Proceedings of the 16th Workshop on Privacy in the Electronic Society. ACM (2017) 12. Hay, M., Miklau, G., Jensen, D., Towsley, D., Weis, P.: Resisting structural reidentification in anonymized social networks. VLDB Endowment 1, 102–114 (2008) 13. Sharad, K., Danezis, G.: An automated social graph de-anonymization technique. In: Proceedings of the 13th Workshop on Privacy in the Electronic Society. ACM (2014) 14. Sharad, K., Danezis, G.: De-anonymizing D4D datasets. In: Workshop on Hot Topics in Privacy Enhancing Technologies (2013) 15. Buccafurri, F., Lax, G., Nocera, A., Ursino, D.: Discovering missing me edges across social networks. Inf. Sci. 319, 18–37 (2015) 16. Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: l-diversity: privacy beyond k-anonymity. In: ACM Transactions on Knowledge Discovery from Data (2007) 17. Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006). https:// doi.org/10.1007/11787006 1 18. Liu, C., Chakraborty, S., Mittal, P.: Dependence makes you vulnerable: differential privacy under dependent tuples. In: NDSS (2016) 19. Sala, A., Zhao, X., Wilson, C., Zheng, H., Zhao, B.Y.: Sharing graphs using differentially private graph models. In: IMC (2011) 20. Xie, J., Kelley, S., Szymanski, B.K.: Overlapping community detection in networks: the state-of-the-art and comparative study. ACM Comput. Surv. (CSUR) 45(4), 43 (2013) 21. Pedarsani, P., Figueiredo, D.R., Grossglauser, M.: A Bayesian method for matching two similar graphs without seeds. In: Allerton (2013) 22. Ji, S., Li, W., Mittal, P., Hu, X., Beyah, R.: SecGraph: a uniform and open-source evaluation system for graph data anonymization and de-anonymization. In: USENIX Security Symposium (2015) 23. Erd˝ os, P., R´enyi, A.: On the evolution of random graphs. In: Selected Papers of Alfr´ed R´enyi (1976) 24. Newman, M.: Networks: An Introduction. Oxford University Press, Oxford (2010) 25. Newman, M.E.: The structure and function of complex networks. SIAM Rev. 45, 167–256 (2003) 26. Fabiana, C., Garetto, M., Leonardi, E.: De-anonymizing scale-free social networks by percolation graph matching. In: INFOCOM (2015) 27. Ji, S., Li, W., Gong, N.Z., Mittal, P., Beyah, R.: On your social network deanonymizablity: quantification and large scale evaluation with seed knowledge. In: NDSS (2015) 28. Korula, N., Lattanzi, S.: An efficient reconciliation algorithm for social networks. Proc. VLDB Endowment 7, 377–388 (2014)
104
W.-H. Lee et al.
29. Pham, H., Shahabi, C., Liu, Y.: EBM: an entropy-based model to infer social strength from spatiotemporal data. In: SIGMOD (2013) 30. Hamming, R.W.: Error detecting and error correcting codes. Bell Syst. Tech. J. 29, 147–160 (1950) 31. Mittal, P., Papamanthou, C., Song, D.: Preserving link privacy in social network based systems. In: NDSS (2013) 32. Liu, C., Mittal, P.: LinkMirage: enabling privacy-preserving analytics on social relationships. In: NDSS (2016) 33. Andersen, R., Borgs, C., Chayes, J., Feige, U., Flaxman, A., Kalai, A., Mirrokni, V., Tennenholtz, M.: Trust-based recommendation systems: an axiomatic approach. In: WWW (2008) 34. Yu, H., Gibbons, P.B., Kaminsky, M., Xiao, F.: SybilLimit: a near-optimal social network defense against sybil attacks. In: IEEE S&P (2008) 35. Mittal, P., Wright, M., Borisov, N.: Pisces: anonymous communication using social networks. In: NDSS (2013) 36. Newman, M.E.: The structure of scientific collaboration networks. Proc. Nat. Acad. Sci. 98, 404–409 (2001) 37. Viswanath, B., Mislove, A., Cha, M., Gummadi, K.P.: On the evolution of user interaction in Facebook. In: ACM Workshop on Online Social Networks (2009)
A Security Pattern Classification Based on Data Integration S´ebastien Salva(B) and Loukmen Regainia LIMOS CNRS UMR 6158, Clermont Auvergne University, Clermont-Ferrand, France {sebastien.salva,loukmen.regainia}@uca.fr
Abstract. Security patterns are design patterns specialised to provide reusable and general solutions to recurring security problems. These patterns, which capture the strengths of different security approaches, are intended to make the design of maintainable and secure applications easier. The pattern community is continuously providing new security patterns (180 patterns are available at the moment). For a given problem, this growing pattern set along with their abstract presentations make the security pattern choice tedious, even for experts in software design. We contribute in this issue by presenting a method of security pattern classification based upon data extraction and integration. The pattern classification is semi-automatically inferred by means of a datastore integrating disparate publicly available security data. This classification exposes relationships among software attacks, weaknesses, security principles and security patterns. It expresses the pattern combinations that can counter a given attack. Besides the pattern classification, we show that the data-store can be used to generate Attack Defense Trees. In our context, these illustrate, for a given attack, its sub-attacks and the related defenses given under the form of security pattern combinations. Such trees make the pattern classification more readable even for beginners in security patterns. Finally, we evaluate on 25 human subjects the benefits of using Attack Defense Trees and a classification established for Web applications, which covers 215 attacks, 136 software weaknesses, 66 security principles and 26 security patterns. Keywords: Security patterns · Classification · Data integration CAPEC attacks · CWE weaknesses · Attack-defense trees
1
Introduction
Design patterns are recurring solutions to software design problems proposed and used by skilled application or system designers. They are more and more considered in the industry since they may accelerate the design stage of the software life cycle and help in the code readability and maintenance. As the interest in software security continuously grows for a few years, specialised patterns also emerged to help design secure applications. These, called security patterns, c Springer International Publishing AG, part of Springer Nature 2018 P. Mori et al. (Eds.): ICISSP 2017, CCIS 867, pp. 105–129, 2018. https://doi.org/10.1007/978-3-319-93354-2_6
106
S. Salva and L. Regainia
are defined as reusable elements to design secure applications, which will enable software architects and designers to produce a system that meets their security requirements and that is maintainable and extensible from the smallest to the largest systems [1]. Schumacher also postulates that Security patterns relates countermeasures to threats and attacks in a given context [2]. Security patterns are often presented at a high level of abstraction with texts and sometimes with UML diagrams to be reusable in different kinds of context. Since 1997, the number of security patterns is continuously growing. The repository given in [3] lists around 180 security patterns. Because of the abstract nature of security patterns and because the documents are not structured in the same manner, the choice of the most appropriate pattern to solve a security problem is difficult with regard to a given context and somehow perilous for novice designers [4] as a wrong choice may imply the use of useless countermeasures or the addition of new vulnerabilities in the design and code of the application. As designers cannot be experts in all the software engineering fields, security pattern classifications have been published in the literature to help them find the most appropriate patterns according to the application requirements. Several classifications were proposed to arrange security patterns into different categories, e.g., by security principles [5,6], by application domains [7] (software, network, user, etc.), by vulnerabilities [4,8] or by attacks [4,9]. Despite the improvements in the pattern choice, several issues still remain open. Among them, we noticed that these classifications are manually devised by directly comparing the textual descriptions of security concepts (vulnerabilities, attacks, patterns, etc.). As these descriptions are generic and have different purposes, the categorisation of a pattern can be done only when there is an evident relation between it and other security concepts. Besides, as these classifications are not deterministic (no strict definition of the classification process [6]), it becomes often delicate to upgrade them. These observations lead to the purpose of this paper, which is to establish a strict security pattern classification method, composed of several successive steps. These lead to a pattern classification that organises the security patterns that can be used to counter an attack. This paper extends the work we initiated in [10] and includes the following contributions: – we present a data-store architecture and an integration method that extracts data from various Web and publicly accessible sources and stores relationships among attacks, security principles and security patterns. The data-store integrates data coming from the CAPEC base [11], several papers dealing with security principles [12–16] and from the pattern catalogue given in [17]. It also integrates the studies about inter-pattern relations [5,18]. All these steps provide the detailed justifications of the resulting classification; – we automatically derive a security pattern classification from the data-store, providing the pattern combinations that can be used to counter an attack. The inter-pattern relations integrated into the data-store offer the advantage of making apparent the dependencies among patterns as well as the conflicting or alternative patterns. We have generated a classification specialised to Web applications, which includes 215 CAPEC attacks, 136 CWE weaknesses,
A Security Pattern Classification Based on Data Integration
107
and 26 security patterns covering varied security aspects. To the best of our knowledge, this is the largest classification for this application domain. It is stored in a database available in [19]; – we automatically generate Attack-Defence Trees (shortened ADTrees [20]), which aim at supplementing the classification with illustrations depicting, for a given attack, its sub-attacks along with defenses expressed here with security patterns. Such ADTrees improve the understanding of the previous classification. We employed the classification to evaluate, on 25 human subjects, the benefits of using our pattern classification and ADTrees in terms of Comprehensibility, Effectiveness and Accuracy. Paper Organisation: Sect. 2 presents the related work and the motivations of the approach. In addition, we introduce some security notions and data used throughout the paper. The extraction and integration of security data into the data-store are given in Sect. 3. The next section shows how we automatically extract the pattern classification and ADTrees from the data-store. Then, we evaluate it in Sect. 6. We discuss about the advantages and limitations of our classification in Sect. 7. We traditionally conclude and give some perspectives for future work in Sect. 8.
2 2.1
Background Related Work and Motivations
The growing number of security patterns available in the literature makes the choice of the most appropriate ones very difficult for overcoming a security problem. In order to ease this task, several pattern catalogues and classifications were proposed [3,4,8,9,17,21–23]. An overview of these documents is given by Bunke et al., who reviewed the papers dealing with security patterns between 1997 and 2012 [7]. They listed a set of classification criteria and established a comparison between design patterns and security pattern classifications. They finally proposed their own classification based upon the application domains of patterns (software, network, user, etc.) Vulnerabilities are taken into consideration for pattern classification in [4,8]. Compared to the previous paper, these give another point of view helping designers in the choice of patterns to fix software vulnerabilities. The classifications exposed in [4,9,22,23] expose pattern categories by focusing on the attacker side and attacks. This choice of categorisation seems quite interesting and meaningful as attacks are more and more known and examined by designers. Wiesauer et al. initially introduced in [9] a short taxonomy of security design patterns made from links found in the textual descriptions of attacks and security patterns. Wiesauer et al. claimed that 40 security patterns can be connected to attacks, but only few examples are given. These examples are associated with one or two patterns only. Tondel et al. presented in [22] the combination of three formalisms of security
108
S. Salva and L. Regainia
modelling (misuse cases, attack trees and security activity models) in order to give a more complete security modelling approach. In their method for building attack trees, they linked some activities of attack trees with CAPEC attacks; they also connected some activities of SAGs (security activity diagrams) with security patterns. The relationships among security activities and security patterns are manually extracted from documentation and are not explained. Shortly after, Alvi and Zulkernine presented a classification scheme for security patterns putting together CAPEC attacks and security patterns for the implementation phase of the software life cycle [4]. They analysed some security pattern templates available in the literature and proposed a new template associated with software lifecycle phases. They considered around 20 attacks and linked them to 5 patterns. They also manually augmented the CAPEC attack documentation with a section named “Relevant security patterns” composed of some patterns [4]. After inspecting the CAPEC base, we observed that this section is seldom available, which limits its use and interest. Finally, Uzunov et al. introduced in [23] a classification of security threats and patterns specialised for distributed systems. They proposed a library of threats and their relationships with security patterns in order to reduce the expertise level required for the design of secure applications. They considered that that their threat patterns are abstract enough to encompass security problems related to the context of distributed systems [23]. Open Issues and Contributions Alvi and Zulkernine outlined 24 pattern catalogues and classifications in [6] and established a comparative study to point out their positive and negative aspects. They chose 29 classification attributes (purpose, abstraction levels, lifecycle, etc.) and compared the classifications against a set of nine desirable quality criteria (Navigability, Completeness, Usefulness, etc.). They observed that several classifications were built in reference to a unique classification attribute, which appears to be insufficient. They indeed concluded that the use of multiple attributes enables the pattern selection in a faster and more accurate manner. Yskout et al. also reported that the security pattern adoption is limited possibly due to a sub-optimal quality of the documentation [17]. We indeed believe that security pattern classifications lack Navigability and Comprehensibility, which are quality criteria respectively defined as: the ability to direct a software designer among collaborative and related patterns; the ease to understand patterns by both a novice and expert developer. We also observed that the main issue of the above works lies in the lack of a precise method to build the classification. All of them are based upon the interpretation of different documents, which are converted to abstract relationships. The first consequence of these interpretations is the difficulty to extend these classifications. In addition, it is sometimes tricky to understand the reasons of the relationships established between attacks and patterns. In [24], we introduced a first semi-automatic classification method and the classification itself, which exposes relationships among 185 software weaknesses of the CWE base [25], security principles and 26 security patterns. The
A Security Pattern Classification Based on Data Integration
109
classification groups the patterns that partially mitigate a given weakness with respect to the security principles that have to be addressed to fix the weakness. In [10], we introduced another classification method to categorise the security patterns that can be used to counter attacks. We extend this work by describing the meta-model of the data-store used to automatically infer the pattern classification. The classification process, which is built on data acquisition, is composed of six manual and automatic steps. They offer the advantage of justifying the pattern classification and reduce the efforts required to add new patterns or attacks to the classification. Finally, we complete the classification with ADTrees illustrating attacks, sub-attacks and security patterns as defenses. These are generated after the choice of an attack in the classification and remain up-to-date. 2.2
Publicly Accessible Resources Used for the Data Integration
Security Patterns. A security pattern is a generic solution to a recurring security problem, which is characterised by a set of structural and behavioural properties. A pattern is described with textual sections called intents, forces and consequences. These sections point out the features of a pattern, called strong points [26]. For a security pattern, strong points characterise the forces and the consequences brought by the use of the pattern against a security problem. In addition, a security pattern can be documented to express its relationships with other patterns. Such annotations may noticeably help combine patterns and not to devise unsound composite patterns. Yskout et al. defined the following annotations between two patterns p1 and p2 [5]: – “depend” means that the implementation of p1 requires the implementation of p2 ; – “benefit” models that implementing p2 completes p1 with extra security functionalities or decreases the development time; – “alternative” expresses that p2 is a different pattern fulfilling the same functionality as p1 ; – “impair” means that the functioning of p1 can be obstructed by the implementation of p2 , but both may be used together; – “conflict” encodes the fact that if both p1 and p2 are implemented together then it shall result in inconsistencies. For example, Fig. 1 portrays the UML class diagram of the pattern “Application Firewall” whose purpose is to filter requests and responses to and from an application, based on access control policies. This security pattern structures an application in such a way that the inputs filtering logic is centralised and decoupled from the functional logic of the application. This is a strong point of this pattern. “Application Firewall” is related to two other security patterns [5]: it is an alternative to the patterns “Input Guard” and “Output Guard” since it is able to filter input calls and output responses from the application.
110
S. Salva and L. Regainia
Fig. 1. Security pattern “Application Firewall”, reprinted from security pattern catalog, URL: https://people.cs.kuleuven.be/∼koen.yskout/icse15/catalog.pdf, 2017.
CWE Weaknesses. The Common Weakness (CWE) base [25] provides an open catalogue of software weaknesses, which are software or design mistakes that can lead to vulnerabilities. At the moment, this database includes around 1000 software weaknesses but this number is still growing. A weakness is documented with a panoply of information, including a full description, its causes, detection methods, and relations with CAPEC attacks or vulnerabilities. In addition, a set of potential mitigations are often proposed. CAPEC Attacks. The Common Attack Pattern Enumeration and Classification (CAPEC) is an open database offering a catalogue of attacks in a comprehensive schema [11]. Attack patterns are descriptions of common approaches that attackers take to target weaknesses of software or systems. An attack pattern, which we refer here as documentation (to avoid the confusion with security pattern), consists of several textual sections, e.g., Attack Execution Flow, Severity, etc. In our context, three sections sound particularly interesting for starting a classification. The section Related attack patterns shows interdependence among attacks, having different levels of abstraction. The first two levels (Category and Meta pattern) give attack mechanisms, the last two levels called “Standard pattern” and “Detailed attack pattern” gather the most concrete attacks. These interdependences provide a hierarchical organisation of attacks. Another section called Related Weaknesses lists the CWE weaknesses targeted by the attack. The section “Relation security principles” aligns some principles defined as desirable properties targeted by the attacks. At the moment, this section is often incomplete though.
A Security Pattern Classification Based on Data Integration
111
Security Principles. A security principle is a desirable property, structure or behaviour of software that aims at reducing the impact and the likelihood of a threat realisation [13]. They represent an insight on the nature of close security tasks whose contexts are not taken into consideration. Numerous works focused on security principles since the last four decades. Saltzer and Schroeder firstly established a set of eight best practices for system security [12], which were widely expanded to form security principles [13–16]. Most of these papers reflect the fact that a security principle has a level of abstraction; it may be the realisation of other security principles, or as a subordinate principle of another one.
3
Data-Store Architectures
As stated previously, our classification aims to make the design of secure applications easier by providing the set of security patterns that can be used as countermeasures against a given attack (in reference to the security pattern definition of Schumacher [2]) and the relations among these patterns. Finding direct relations among attacks and security patterns by reading documentation is a hard problem. The documents are presented quite differently, with different level of abstractions. Instead, in order to later infer a precise classification, we chose to anatomise the security concepts available in documentation into more detailed properties that can be interconnected in an explicit manner. The literature and some attack bases [11,25,27] have confirmed to us the importance of the following associations: an attack can be documented with more concrete attacks, which can be themselves segmented into steps; These steps can be performed with techniques and can be prevented with countermeasures. These properties and associations are modelled with the meta-model of Fig. 2. Besides, an attack also exploits a weakness, which may be composed of several more concrete weaknesses. Mitigations can be applied to treat them. These others associations are illustrated in Fig. 3. As for security patterns, they can be characterised with strong points, which are pattern features that are extractable from pattern descriptions. In addition, a security pattern can have relations with others patterns. Figures 2 and 3 depict these properties and relations with entities in the same way. Countermeasures, mitigations and strong points refer to the notion of attack prevention. But directly finding relations among them is still an obscure task as these properties have initially different purposes. To solve this issue, we chose to focus on security principles as mediators. As introduced by Wassermann and Cheng, security patterns are classifiable w.r.t. security principles like most of the security concepts [28]. Here again, we consider that a security principle are organised into a hierarchy, which shows the materialisation of a principle with more concrete ones. Countermeasures and mitigations are often detailed security properties. It turns out that gathering them into groups (clusters) often reduces the efforts required to find connections with security principles without adding ambiguity. The choice of the cluster granularity, i.e., the size of the groups,
112
S. Salva and L. Regainia
Fig. 2. Metamodel 1 of the data-store.
Fig. 3. Metamodel 2 of the data-store.
along with the principle organisation offer a lot of flexibility to reach about the same abstraction level among strong points, principles, countermeasures and mitigations. In other words, these techniques help associate clusters, principles and strong points. These last security properties and associations are identically modelled in Figs. 2 and 3. Both meta-models of Figs. 2 and 3 can be used to structure our data-store. A last possible meta-model could be achieved by blending the two previous ones. At the moment, we prefer avoiding this solution as the countermeasures of an attack step and the mitigations of a weakness have different purposes. We believe that gathering them might bring confusing associations among security principles and clusters, and finally false relations among attacks and security patterns. After inspecting the available security data resources, e.g., [11,25,27], we observed that few documents provide the countermeasures of a given attack step. For instance, some countermeasures are provided in the CAPEC base, but not all of them. In contrast, many countermeasures are listed for a weakness in the CWE base. In essence, it is manifest that the more security data we collect, the more precise the pattern classification will be. This is why we prefer using the meta-model of Fig. 3 for designing our data-store. The next section shows how the data integration is performed with this data-store.
A Security Pattern Classification Based on Data Integration
4
113
Data Integration
We present, in this section, the six steps required to integrate security data into the data-store. These aim at collecting security data and establishing the different relations illustrated in the meta-model of Fig. 3. Steps 1 to 5 give birth to databases, and Step 6 consolidates them so that every entity of the meta-model is related to the other ones as expected. The steps 1, 2 and 6 are automatically done with tools. These steps offer the strong advantage of semi-automatically achieving a data-store, which can be updated. For instance, if one wants to add a new attack, the steps 1 and 2 have to be followed. Likewise, if a new security pattern is available in the literature, the steps 3 and 5 have to be applied. We have implemented these steps with scripts mostly based upon the tool Talend1 , an ELT (Extraction, Load, Transform) tool that allows an automated processing of data independently from the type of its source or destination. We applied these steps on attacks, patterns and principles related to the Web application context and on data coming from different sources: the CAPEC and CWE bases, several papers dealing with security principles [12–16] and the pattern catalogue given in [17]. We provide some quantitative results related to this context with each step. But other kinds of systems can be considered as long as documentation is available. We also illustrate these steps with the pattern “Application Firewall” and with the attack “CAPEC-39: Manipulating Opaque Client-based Data Tokens”, which corresponds to a threat on applications using tokens, e.g., cookies, holding personal data. 4.1
Step 1: CAPEC Attack Extraction and Organisation
We chose to focus on the CAPEC base to extract information about security attacks because this appears to be the most complete base composed of the largest number of attacks explained in detail (steps, techniques, risks, security controls, etc.). We extracted attacks from the CAPEC base and organised them into a single tree that describes a hierarchy of attacks from the most abstract to the most concrete ones so that we can get all the sub-attacks of a given attack. To reach that purpose, we rely on the relationships among attack descriptions found in the CAPEC section called Related Attack Patterns. By scrutinising all the CAPEC documents, it becomes possible to develop a hierarchical tree whose root node is unlabelled and connected to the most abstract attacks of the type “Category”. These nodes are parents of attacks that belong to the type “Meta Attack pattern” and so on. The leaves are the most concrete attacks of the type “Detailed attack pattern”. The relations among attacks (“parent of”, “child of”) are provided in the CAPEC Base. Figure 4 shows the related attacks for the attack CAPEC-39. The abstraction level of the attack is expressed in the column “Type” (M stands 1
https://talend.com/.
114
S. Salva and L. Regainia
for Meta-pattern, C for Category, D for Detailed pattern), the links with other attacks are listed in the column “Nature”. Figure 4 shows the CAPEC-39 has one sub-attack “CAPEC-31: Accessing/Intercepting/Modifying HTTP Cookies”.
Fig. 4. Hierarchical organisation of attacks for the attack CAPEC 39, adapted from the CAPEC base, URL: https://capec.mitre.org/, 2017.
This data extraction is automatically performed with a script, which yields a database DB1 . From the CAPEC database Version 2.8, we collected 215 attacks for the Web application context. 4.2
Step 2: CWE Weakness and Mitigation Extraction
Given an attack of the database DB1 , we automatically extracted the CWE weaknesses targeted by the attack. These can be found in a textual section called Related Weaknesses of the CAPEC documents. Weaknesses are grouped here into two categories named Targeted and Secondary ranking the impact degree of the attack on a weakness. We focused on the type Targeted (even though it could also be relevant to consider both types). These weaknesses are also described in the CWE base, which arranges them into a hierarchy of four levels reflecting abstraction levels. From the CWE base, we automatically gathered the more concrete weaknesses of every previous weakness and their respective mitigations found in a textual section called CWE mitigations. As depicted in Fig. 3, we later associate security principles with mitigations by grouping the latter into clusters. It turns out that the section “CWE mitigations” often groups mitigations by categories called Strategies. After a meticulous study of these groups, we observed that they can be associated with security principles without ambiguity. As a consequence, we have directly integrated them as mitigation clusters into the data-store. The outcome of this systematic extraction is stored in a database DB2 , which encodes relations among 215 attacks, 136 CWE weaknesses, 130 mitigations and 15 clusters. Unsurprisingly, we observed that the attacks having the highest level of abstraction are seldom related to CWE weaknesses, whereas concrete attacks are connected to several mitigation clusters. The attack CAPEC-39 and its sub-attack CAPEC-31 taken as example, target 18 CWE weaknesses, which illustrate here that the attacks are segmented into more concrete security functionalities. Among them, we have “Improper Input Validation” or “External Control of Critical State Data”. These weaknesses can be fixed by 17 mitigations, grouped into 8 clusters.
A Security Pattern Classification Based on Data Integration
4.3
115
Step 3: Security Pattern and Strong Point Integration
We manually collected security patterns and their strong points from the catalogue given in [17]. Finding strong points can be a difficult task as these ones are seldom explicitly provided. Strong points often have to be deduced from the sections referring to the forces and intents of the patterns. Afterwards, we manually established two relations among patterns and strong points: 1. the first one is a many-to-many relation between security patterns and strong points, each pattern being characterised by a set of strong points that can be shared with other patterns. For example, the patterns “Authorization enforcer” and “Container managed security” share the strong point “Providing the application with authorization mechanism”; 2. the second relation is related to the inter-pattern relationships [5]. With P a set of patterns, we define a mapping from P 2 to the annotation set {depend, benef it, impair, alternative}. These data and relations, which provide connections among security patterns and strong points, are encoded into the database DB3 . For the domain of Web applications, we gathered 26 security patterns and 36 strong points. For instance, the security pattern “Application firewall” can be characterised with 8 strong points e.g., “Providing the application with a perimeter security mechanism”. “Application firewall” is associated with two alternative patterns, “Output Guard” and “Input Guard”. 4.4
Step 4: Security Principle Integration
We chose to organise security principles into a hierarchy, from the most abstract to the most concrete principles. This principle organisation gives a complete hierarchical view on security mechanisms, which are required to cure a weakness and provided by security patterns at the same time. As principles are hierarchically organised, we can link a strong point and a mitigation cluster even if they do not exactly have the same level of abstraction. For instance, consider a strong point and a cluster that are linked to two principles being in different levels of the hierarchy. If one principle is a child of the second one, then the strong point and the cluster will be later related in the classification. We collected 66 security principles related to Web applications from the papers [12–16] and manually established dependencies in accordance with the nature of each security principle, often described with text. The resulting hierarchy is certainly not exhaustive but covers the security patterns considered in the catalogue given in [17]. Figure 5 depicts the security principle hierarchy, which is stored in the database DB4 . There are four levels, the first one being composed of elements labelled by the most abstract principles, e.g., “Access control”.
116
S. Salva and L. Regainia
Security simplificaƟon
In depth defense
SensiƟve data protecƟon
Fault tolerance
ConfiguraƟon management
Access control
ConfiguraƟon protecƟon
Privilege management Fail-safe defaults
AuthenƟcaƟon
AuthorizaƟon
Access approval
AccounƟng Fail gracefully
ExcepƟon management
ReparƟƟon
EncrypƟon
Privacy
Complete mediaƟon
Memory management
Perimeter security Sandbox chroot jail
Token
Biometrics
CerƟficates
File authorizaƟon
Object authorizaƟon
TBAC
PBAC
RBAC
Memory authorizaƟon
ConfiguraƟon store protecƟon ConfiguraƟon interfaces protecƟon ConfiguraƟon documentaƟon Privilege separaƟon Least privilege
AuthenƟcaƟon Enforcement Brokered AuthenƟcaƟon Asserted authenƟcaƟon Contained AuthenƟcaƟon AuthorizaƟon Enforcement Core AuthorizaƟon
AƩribute based access control Contained authorizaƟon Approval requirements Session Management
System state management Audit Log Output filtering and validaƟon least common mechanisms ReplicaƟon Load balancing Stored data encrypƟon Transited data encrypƟon Anonymity pseudonymity ObfuscaƟon Zero knowledge Input validaƟon Input canonicalisaƟon Firewalling Zones definiƟon
Intrusion detecƟon and prevenƟon
Economy of mechanisms
Psycho acceptability Open design
Fig. 5. Security principles organisation.
4.5
Step 5: Association Among Strong Points, Security Principles and Mitigation Clusters
In this step, we incorporated into the data-store the many-to-many relations between strong points and security principles. We manually performed this step because strong points and principles are mostly presented with different keywords. We observed that the abstraction levels of the strong points better fit with the most concrete security principles, which are labelled in the lowest-level nodes of the hierarchical organisation depicted in Fig. 5. But, if a strong point is related to a principle sp that is not at the lowest level, then we also link the strong point with all the children of sp. If we take back the example of security pattern “Application Firewall”, its strong point “Providing the application with a perimeter security mechanism” can be easily associated with the principle “Perimeter security”. As the latter has 3 children in the hierarchy of Fig. 5, the strong point is also related to them.
A Security Pattern Classification Based on Data Integration
117
In the same way, we established the many-to-many relations between mitigation clusters and security principles. In Step 3, the clusters include mitigations based upon the same security aspects, e.g., validating user inputs. Once these aspects are deduced, linking clusters and security principles becomes straightforward. For instance, the need for validating user inputs corresponds to the principle “Input validation”, which belongs to the principle “Complete mediation” in the security principle hierarchy. These relations are materialised with the database DB5 , which gathers 15 clusters, 36 strong points and 66 principles. 4.6
Step 6: Data Consolidation
This automatic step merges the previous databases DB1 to DB5 into a single one. On the one hand, DB1 , DB2 , DB4 and DB5 store the relations among attacks, weaknesses, mitigations and security principles. On the other hand, DB3 , DB4 and DB5 store the relations among security patterns, strong points and security principles. It is now manifest that the security principle hierarchy becomes the central point that helps match attacks with security patterns. This step is automatically performed with a script by means of the metamodel given in Fig. 3. The step produces the final database DBf , which is available in [19].
5
Security Pattern Classification and ADTree Generation
The database DBf now holds enough information to organise security patterns and build ADTrees. This section explains how to automatically generate them. 5.1
Security Pattern Classification
We have chosen to catalogue the combinations of security patterns that could be used to counter an attack stored in DBf . More precisely, for a given attack Att, we extract: – the information about the attack (name, identifier, description, etc.), – the tree T (Att) of attacks, whose root is Att, if Att is not a leaf of the attack tree derived in Step 1. – for each attack A of T (Att), the hierarchy of security principles Sp(A) by means of the successive relations established among A, weaknesses, clusters and security principles. Sp(A) represents the complete hierarchy of security principles related to an attack, i.e., if a principle sp of Sp(A) is not a leaf of the hierarchical organisation depicted in Fig. 5, then we also extract the principle sub-tree whose root is sp; – for each principle sp in Sp(A), the set of security patterns Psp and the set of patterns P 2sp not in Psp that have relations with any pattern of Psp . We also extract the inter-pattern relationships defined for couples of patterns by the relations depend, benefit, impair, alternative, conflict.
118
S. Salva and L. Regainia
Fig. 6. Data extraction for the attack CAPEC-39.
Figure 6 depicts an extraction example for the attack CAPEC-39. The first column gives the attack identifier. The next column gives the security pattern allowing to counter the attack. Columns 3 and 4 provide the inter-pattern relations, e.g., “Application Firewall” is an alternative to “Input Guard”. The attack CAPEC-39 has one sub-attack CAPEC-31, whose identifier is provided in Column 5. The three last columns give the security patterns allowing to overcome the attack CAPEC-31 and their relations with other patterns. The data extraction is automatically performed with a tool based upon Talend. Once the tool has covered all the attacks stored in the database DBf , we obtain the security pattern classification. This tool can be re-executed every time the data-store is updated. The classification remains up-to-date accordingly. Unfortunately, we think that Comprehensibility, which refers to the ability to use the classification by experts or novices, is not yet totally satisfied at this stage. Indeed, the classification is given under a tabular form, which does not appear to be the most user-friendly way to represent a classification. This is why we also propose to improve its readability with ADTrees. 5.2
Attack-Defense Tree Generation
ADTrees are graphical representations of possible measures an attacker might take in order to attack a system and the defenses that a defender can employ to protect the system [20]. ADTrees have two different kinds of nodes: attack nodes (red circles) and defense nodes (green squares). A node can be refined with child nodes and can have one child of the opposite type (linked with a dashed line). Node refinements are either disjunctive or conjunctive. The former is recognisable by edges going from a node to its children. The latter is graphically distinguishable by connecting these edges with an arc. We generate ADTrees having the general form illustrated in Fig. 7(a). The root of this tree is labelled by an attack. If the latter has sub-attacks, these are given in the tree with children linked with a disjunctive refinement and so forth. Furthermore, the tree points out how to prevent attacks with defenses given
A Security Pattern Classification Based on Data Integration
119
under the form of security pattern combinations. A defense node is linked to an attack node with a dashed line. This defense node is either labelled by a security pattern, or is the root of a sub-tree showing patterns and their relations. We generate ADTrees with the following steps:
(a) Pattern classification representation with ADTrees
(b) Conflicting pattern representation with ADTree
Fig. 7. ADTree examples, reprinted from [10]. (Color figure online)
1. every attack found in DBf has its own ADTree whose root node is labelled by its identifier. This root node is linked to other attack nodes with a disjunctive refinement if the attack has sub-attacks. This step is repeated for every subattack. In other words, we generate a sub-tree of the original hierarchical tree extracted in Step 1, whose root is an attack; 2. for every attack node A, we collect the set P of security patterns that counter the attack. The inter-pattern relationships are illustrated in the ADTree with new nodes and logic operations. Given a couple of patterns (p1 , p2 ) ∈ P , if we have: – (p1 R p2 ) with R a relation in {depend, benef it}, we build three defense nodes, one parent node labelled by R and two nodes labelled by p1 , p2 combined with this parent defense node by a conjunctive refinement; – (p1 alternative p2 ), we build three defense nodes, one parent node labelled by alternative and two nodes labelled by p1 , p2 , which are linked by a disjunctive refinement to the parent node; – (p1 R p2 ) with R a relation in {impair, conf lict}. In this particular case, we would want to use the xor operation since both patterns can be used but the implementation of p2 decreases the efficiency or conflicts with p1 . Unfortunately, this operation is not available with this tree model. Therefore, we replace the operator by the classical formula (A xor B ) −→ ((A or B ) and not (A and B)). The not operation is here
120
S. Salva and L. Regainia
replaced by an attack node meaning that two conflicting security patterns used together constitute a kind of attack. The corresponding sub-tree is depicted in Fig. 7(b), – p1 having no relation with any pattern p2 in P , we add one parent defense node labelled with p1 . The parent defense nodes, resulting from the above steps, are combined to a defense node labelled by “Pattern Composition” with a conjunctive refinement. This last defense node is linked to the attack node A. When an attack is linked to several security patterns, the second step can achieve a large defense sub-tree. But, this one can often be simplified. In short, if we replace the relations depend, benefit by the operation AND, the relation alternative by OR and the relations impair, conflict by XOR, we obtain logical expressions. These expressions can be reduced with tools, e.g., BExpRed2 . A simplified defense tree can be derived from the reduced expression. For instance, with the three patterns p1, p2 and p3 having the relations (p1 benefit p2), (p1 alternative p3) and (p2 alternative p3), we obtain (p1 AND p2) AND (p2 OR p3) AND (p1 OR p3), which can be reduced by (p1 AND p2). This expression gives a defense node that is conjunctively refined with two nodes labelled by p1 and p2.
Fig. 8. ADTree of the attack CAPEC-39 reprinted from [10].
We implemented the ADTree generation with a tool, which takes as input an attack identifier and yields an ADTree, which is stored into an XML file. These 2
https://sourceforge.net/projects/bexpred/.
A Security Pattern Classification Based on Data Integration
121
files can be used with the ADTree editing tool ADTool presented in [29]. As a consequence, ADTrees can be modified or updated as the designer wishes. If we take back the attack CAPEC-39, we obtain the ADTree of Fig. 8. This tree firstly shows that the attack CAPEC-39 has the sub-attack CAPEC-31. Both attacks can be countered by several security pattern combinations. For instance, the attack CAPEC-39 can be countered by two pattern combinations: the pattern “Canonicalization” must be used either with “Application Firewall” or with “Input guard” since both are alternative patterns. The number of security patterns related to the attacks CAPEC-39 and CAPEC-31 is explained here by the diversity of the targeted weaknesses. Indeed, 18 weaknesses can be exploited here (6 for the attack CAPEC-39 and 12 for CAPEC-31). We assume for the classification generation that all of them have to be mitigated. As they cover different security issues, e.g., input validation problems, privilege management or encryption problems, several patterns are required to fix the weaknesses and hence block the attacks. This example illustrates that a designer can follow the concrete materialisations of an attack in an ADTree. He/she can choose the most appropriate attack with respect to the context of the application being designed. The ADTree provides the different security pattern combinations that have to be used to prevent this attack.
6
Empirical Evaluation
In order to assess whether designers can take profit of our classification and ADTrees, we empirically studied two scenarios where 25 participants were given the task of choosing security pattern combinations to prevent two attacks, CAPEC-244: Cross-Site Scripting via Encoded URI Schemes and CAPEC-66: SQL Injection, on two vulnerable Web applications, Ropeytasks 3 and Bodgeit 4 . The participants are third to fourth year computer science undergraduate students; most of them have good skills in the design, development and test of Web applications. They have some knowledge about classical attacks and are used to handle design patterns, but not security patterns. The duration of each scenario was set at most to one hour. In the first scenario, denoted Part 1, we supplied these documents to the students: the CAPEC base, two concrete examples showing how to perform each attack, the catalogue of security patterns given in [17] and the pattern classification proposed in [4]. For simplicity, we refer to these documents as basic pattern documents in the remainder of the evaluation. In the second scenario, denoted Part 2, we supplied additional documents for the two attacks, i.e., our classification under the form of tabulars, two ADTrees generated from the datastore. At the end of each scenario, the students were invited to fill in a form listing these questions: 3 4
https://github.com/continuumsecurity/RopeyTasks. https://github.com/psiinon/bodgeit.
122
S. Salva and L. Regainia
– Q1: Was it difficult to choose security patterns? – Q2: Was it difficult to use the CAPEC documentation (in Part 1)/our classification+ADTrees (in Part 2)? – Q3: Was it difficult to use the basic pattern documents (in Part 1)/our classification+ADTrees (in Part 2)? – Q4: What was your time spent for choosing security patterns? – Q5: How confident are you in your pattern choice? – Q6: What are the patterns you have chosen? This form was actually devised to evaluate the following criteria: – C1: Comprehensibility: does our classification make the pattern choice less difficult? – C2: Efficiency: does our classification help reduce the time needed to choose patterns? – C3: Accuracy: are the chosen patterns correct?
6.1
Experiment Results
From the forms returned by the participants (available in [19]), we extracted the following results. Firstly, Fig. 9 illustrates the percentages of answers to the questions Q1 to Q3. For these, we used this four-valued scale: easy, fairly easy, difficult, very difficult. From Question Q4, we collected the time spent by the participants for choosing patterns (in Part 1 and 2 of the experimentation). In summary, response times varied between 15 and 50 min for Part 1, and between 5 and 30 min for Part 2. The bar charts of Fig. 10 depicts the levels of confidence of the participants towards their security pattern choices (Question Q5). The possible answers were for both scenarios: very sure, sure, fairly sure, not sure.
Fig. 9. Response rates for Q1 to Q3.
A Security Pattern Classification Based on Data Integration
123
Fig. 10. Confidence rates (Q5).
We finally analysed the security pattern combinations provided by the participants in Question Q6. We organised these responses into four categories (ordered from the more to the less accurate): – Correct: several pattern combinations were accurate. When a participant gives one of these combinations, its response belongs to this category; – Correct+Additional: this category includes the responses composed of a correct pattern combination, completed with some other patterns; – Missing: we gather in this category, the incomplete pattern combinations without additional patterns; – Missing+Additional: this category holds the worst responses, composed of unexpected patterns eventually accompanied with some expected ones. With these categories, we obtained the bar charts of Fig. 11, which gives the number of responses per category and per experiment scenario.
Fig. 11. Accuracy measurement (Q6).
124
S. Salva and L. Regainia
6.2
Result Interpretation
C1 Comprehensibility: Fig. 9 shows that 33% of the participants estimated that the pattern choice was easy with our classification and ADTrees (Q1). In contrast, no participant found that the choice was easy when using only the basic pattern documents. The rate of “Easy”, “Fairly Easy” increased by 70.8% between Part 1 and Part 2. With Question Q2, 41.7% of the participants found “Fairly easy” the use of the CAPEC base, whereas 87.5% esteemed our documents (ADTrees) “Easy” and “Fairly Easy” to use. Similarly, only 37.5% of the participants found “Easy” and “Fairly easy” the reading of the basic pattern documents. This rate reaches 87.5% with our classification. Consequently, Fig. 9 shows that our classification and ADTrees make the pattern choice easier and that they are simpler to interpret than the basic pattern documents. Figure 10 expresses that the confidence of the participants on their responses increased by 20.8%. C2 Efficiency: The average time spent by the participants for choosing patterns is equal to 32 min in the first scenario (Part 1). This time delay decreases to 15 min when the participants employed our classification and ADTrees. Furthermore, no participants went over 30 min for choosing patterns in Part 2 (in contrast with 50 min for Part 1). Hence, our documents make the participants more efficient. C3 Accuracy: Fig. 11 reveals how complicated it is to read the basic pattern documents. Indeed, no participant gave a correct pattern combination in Part 1. In contrast, when they used our classification and ADTrees, the number of correct responses rises to 15 out of 25 (60%). Furthermore, the category of responses “Missing+Additional” (worst responses) is strongly reduced (60% with Part 1 to 8% with Part 2). Consequently, the pattern choice is significantly more accurate with our classification and ADTrees. Nonetheless, even with our documents, the number of participants that gave incomplete pattern combinations remains around the same range (9 in Part 1, 7 in Part 2). More efforts seem required to avoid the participants forgetting patterns in ADTrees.
7
Classification Discussion
Our current classification is built on a non exhaustive set of 215 CAPEC attacks, 26 security patterns and 136 CWE weaknesses related to Web applications. Presented in a tabular form, it enables multi-attribute based decisions insofar as patterns can be classified according to security principles, weaknesses and attacks. The classification complies with seven of the nine quality criteria defined in [6]: – Navigability: our classification, accompanied by ADTrees, satisfies this criterion as it exposes the hierarchical refinements of an attack and the combinations of patterns, which should be integrated in the application model. In addition, the classification provides the relationships among security patterns, which help choose the most appropriate pattern combination. For instance,
A Security Pattern Classification Based on Data Integration
– –
–
–
–
125
if two conflicting patterns are listed, the classification points out this conflict to avoid using them together; Determinism: the classification is clearly defined by means of the integration steps. These justify the soundness of the classification; Unambiguity/Comprehensibility: as patterns are classified w.r.t. attacks and security principles, we provide a clear category structure. This organisation, which is supplemented and illustrated by means of ADTrees, makes our classification readable and comprehensible even for novices in security patterns; Usefulness: we believe the classification can be used in practice since it is based upon the security pattern catalogue given in [17] and the CAPEC and CWE bases. Furthermore, the Attack tree formalism is one of the most prominent security formalism for analysing threats. The ADTree model is supported by several tools, in particular ADTool [29]. Our ADTree generator actually generates XML files taken as inputs by ADTools; Acceptability: an acceptable classification schema should be structured in a way that it provides help in partitioning the security pattern landscape and becomes generally approved [6]. Our classification partitions security patterns with regard to attacks, weaknesses and security principles. Furthermore, our evaluation shows that the classification makes participants more efficient and confident on their pattern choices without providing new constraints; Repeatability: the classification is generic and can be reused. Furthermore, the data-store and the classification can be updated.
In our classification, a security pattern can be related to several attacks and security principles. As a consequence, it is not Mutual exclusive (patterns should be categorised into at most one category). Even though it is not a primary goal of our classification, we could fix this issue by grouping attacks into contexts in a mutual way, like in [7]. To do so, the meta-model of Fig. 3 should be updated with a new entity called Context linked to the entity Attack. Like most of the pattern classifications, the Completeness criterion is not met as we do not yet consider all the available security patterns. We compared our classification with the two papers providing relations between security patterns and attacks [4,9]. In these works, the security pattern intents are manually compared to the summaries of the attacks. As these textual sections are abstract, few relations were found. The largest contribution is provided by Alvi et al. who considered around 20 attacks and manually linked them to 5 patterns. In contrast to these works, our classification is more complete: we provide 26 security patterns as solutions against 215 attacks of the CAPEC base. Our classification exposes more pattern combinations per attack; the more choice is not always the better though. After inspection, we observed that more than one or two patterns are generally required to counter attacks. A last important point is that the classifications exposed in [4,9] do not contradict our relations between attack and patterns. For instance, the attack “CAPEC66 SQL Injection” is related to the security patterns “Intercepting Validator” and “Input validation” in [9]. The attacks “CAPEC-244: Cross-Site Scripting via Encoded URI Schemes” and CAPEC-66 are only associated with the pattern “Intercepting Validator” in [4]. For these attacks, our method generates
126
S. Salva and L. Regainia
two ADTrees, which provide 4 combinations of 7 patterns for the CAPEC-244 and 8 combinations of 9 patterns for the CAPEC-66. As in [4,9], the ADTrees exhibit the pattern “Input Guard”, which can be implemented by “Intercepting Validator”. But, they also list other patterns. For the CAPEC-244, some of these patterns are alternative to “Input Guard”, e.g., “Application Firewall”. Other patterns, e.g., “Authentication Enforcer” or “Controlled Object Monitor” are related to specific weaknesses targeted by the attack CAPEC-244. We believe these patterns, which are not given in the previous classifications, are required to counter the attack with regard to the application context. Some statistical information can be automatically extracted from our classification, e.g., the ratio of weaknesses to attacks, of patterns to attacks. For instance, Fig. 12 shows the number of attacks at least partially countered per pattern. Keeping in mind, that the current set of patterns is not exhaustive, we observe that 2 patterns seem to emerge for partly fixing a large part of the 215 attacks covered by the classification: “Input Guard” and “Application firewall”, can overcome 113 and 109 attacks respectively. This kind of information can guide designers towards security analysis and good practices. For instance, with the above chart and ADTrees, a designer can deduce that the patterns “Input Guard” and “Application firewall” are alternative security patterns and that one of them should be used in the design of Web applications as they partially block numerous attacks. It is manifest that if we complete the data-store with more data, e.g., attack risks, such charts could be more refined and adapted to the developer needs.
Fig. 12. Number of fixed attacks per pattern, reprinted from [10].
A Security Pattern Classification Based on Data Integration
127
Limitations. Our classification and method present some limitations, which could lead to some research future work: – we did not envisaged the notion of attack combination. Such a combination could be seen as several attacks or as one particular attack. If an attack combination can be identified and documented with its sub-attacks, then it can be integrated in our data-store; – the ADTree size limit is not supported by our ADTree generator. When an attack has a high level of abstraction, we observed that the resulting ADTree size can become large because it includes a set of sub-attacks, themselves linked to several patterns. This is a strong limitation since large trees are usually unreadable, which contradicts the method purposes; – the classification is not exhaustive: it includes 215 attacks out of 569 (for any kind of application), 210 CWE weaknesses out of around 1000 and 26 security patterns out of around 176. It can be completed with new attacks automatically. But it worth mentioning that the completion of the data-store with new security patterns or weaknesses requires some manual steps. It could relevant to investigate whether some text mining techniques would help partially automate these manual steps. The classification exhaustiveness also depends on the available security data. In the ADTree of Fig. 8, all the attack nodes are linked to defense nodes. Sometimes, with other attacks, no defenses are provided. This can be generally explained by three main reasons: 1. the attack is too abstract to be associated with weaknesses. This attack should be linked to sub-attacks though; 2. Security databases or pattern catalogues are incomplete (lack of mitigation, weakness, etc.). More data are required while the data integration process; 3. the attack is relatively new and cannot be yet overcame by security patterns; – several steps require manual interventions, which are prone to errors. These manual steps may lead to associations among security data that are bound to be controversial. We compared our results with other papers, but this is insufficient to ensure all the associations are correct or that no security data e.g., strong point, is missing. Validating every relation is a hard problem. It could be partially solved by the use of verification methods. But the writing of formal expressions for modelling the entities and associations of our metamodel is also a long and error-prone task.
8
Conclusion
The generic nature of security patterns and their growing number make their choice difficult for overcoming a security problem. This is why we have presented a security pattern classification method putting together CAPEC attacks, CWE weaknesses and security patterns to guide designers in their pattern choices. This method provides a meta-model and the data integration steps to generate a pattern classification showing the patterns that can be used to counter an attack. Pattern intern-relationships are also given to increase Navigability and Comprehensibility. The method automatically generates ADTrees, which ease
128
S. Salva and L. Regainia
the classification readability. These ADTrees could be taken as a first step of other security processes, e.g., threat modelling. In future research, we firstly intend to focus on the automation of some of the data integration steps. We will investigate whether some text mining techniques would help partially automate the extraction and integration of security data without bringing ambiguity. Our method does not take into consideration the size of the ADTrees. The ADTree reduction could be a first solution on this problem. But, the literature does not yet provide a generic method for this kind of reduction. Reducing such trees remains a hard problem as the node meaning must be taken into account in the node aggregating process. We intend to investigate on this issue in future works.
References 1. Rodriguez, E.: Security Design Patterns, vol. 49 (2003) 2. Schumacher, M., Roedig, U.: Security Engineering with Patterns. Engineering 2754, 1–208 (2001) 3. Slavin, R., Niu, J.: Security patterns repository (2016) 4. Alvi, A.K., Zulkernine, M.: A natural classification scheme for software security patterns. In: 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing, pp. 113–120 (2011) 5. Yskout, K., Heyman, T., Scandariato, R., Joosen, W.: A system of security patterns (2006) 6. Alvi, A.K., Zulkernine, M.: A comparative study of software security pattern classifications. In: 2012 Seventh International Conference on Availability, Reliability and Security, pp. 582–589 (2012) 7. Bunke, M., Koschke, R., Sohr, K.: Organizing security patterns related to security and pattern recognition requirements. Int. J. Adv. Secur. 5(1), 46–67 (2012) 8. Anand, P., Ryoo, J., Kazman, R.: Vulnerability-based security pattern categorization in search of missing patterns. In: 2014 Ninth International Conference on Availability, Reliability and Security, pp. 476–483 (2014) 9. Wiesauer, A., Sametinger, J.: A security design pattern taxonomy based on attack patterns. In: International Joint Conference on e-Business and Telecommunications, pp. 387–394 (2009) 10. Regainia, L., Salva, S.: A methodology of security pattern classification and of attack-defense tree generation. In: Camp, O., Furnell, S., Mori, P., (eds): Proceedings of the 3rd International Conference on Information Systems Security and Privacy, ICISSP 2017, Porto, Portugal. SciTePress (2017) 11. MITRE Corporation: Common attack pattern enumeration and classification (2017) 12. Saltzer, J.H., Schroeder, M.D.: The protection of information in computer systems. Proc. IEEE 63, 1278–1308 (1975) 13. Viega, J., McGraw, G.: Building Secure Software: How to Avoid Security Problems the Right Way, Portable Documents. Pearson Education, New York City (2001) 14. Meier, J., Mackman, A., Dunner, M., Vasireddy, S., Escamilla, R., Murukan, A.: Improving web application security: threats and countermeasures. Microsoft Corporation 3 (2003)
A Security Pattern Classification Based on Data Integration
129
15. Dialani, V., Miles, S., Moreau, L., De Roure, D., Luck, M.: Transparent fault tolerance for web services based architectures. In: Monien, B., Feldmann, R. (eds.) Euro-Par 2002. LNCS, vol. 2400, pp. 889–898. Springer, Heidelberg (2002). https:// doi.org/10.1007/3-540-45706-2 126 16. Meier, J.: Web application security engineering. IEEE Secur. Priv. 4, 16–24 (2006) 17. Yskout, K., Scandariato, R., Joosen, W.: Do security patterns really help designers? In: Proceedings of the 37th International Conference on Software Engineering, ICSE 2015, vol. 1, pp. 292–302. IEEE Press, Piscataway (2015) 18. Fernandez, E.B., Washizaki, H., Yoshioka, N., Kubo, A., Fukazawa, Y.: Classifying security patterns. In: Zhang, Y., Yu, G., Bertino, E., Xu, G. (eds.) APWeb 2008. LNCS, vol. 4976, pp. 342–347. Springer, Heidelberg (2008). https://doi.org/10. 1007/978-3-540-78849-2 35 19. Regainia, L., Salva, S.: Security pattern classification, companion site (2018). http://regainia.com/research/companion.html. Accessed 2018 20. Kordy, B., Mauw, S., Radomirovi´c, S., Schweitzer, P.: Attack-defense trees. J. Logic Comput. 24(1), 55–87 (2012) 21. Munawar, H.: Security pattern catalog (2013). http://www.munawarhafiz.com/ securitypatterncatalog/. Accessed 2018 22. Tøndel, I.A., Jensen, J., Røstad, L.: Combining misuse cases with attack trees and security activity models. In: International Conference on Availability, Reliability, and Security, ARES 2010, pp. 438–445. IEEE (2010) 23. Uzunov, A.V., Fernandez, E.B.: An extensible pattern-based library and taxonomy of security threats for distributed systems. Comput. Stand. Interfaces 36, 734–747 (2014) 24. Regainia, L., Salva, S., Bouhours, C.: A classification methodology for security patterns to help fix software weaknesses. In: Proceedings of the 13th ACS/IEEE International Conference on Computer Systems and Applications AICCSA (2016) 25. MITRE Corporation: Common weakness enumeration (2017) 26. Harb, D., Bouhours, C., Leblanc, H.: Using an ontology to suggest software design patterns integration. In: Chaudron, M.R.V. (ed.) MODELS 2008. LNCS, vol. 5421, pp. 318–331. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3642-01648-6 34 27. OWASP: The open web application security project (OWASP) (2017). http:// www.owasp.org 28. Wassermann, R., Cheng, B.H.: Security patterns. In: PLoP Conference. Michigan State University, Citeseer (2003) 29. Kordy, B., Kordy, P., Mauw, S., Schweitzer, P.: ADTool: security analysis with attack–defense trees. In: Joshi, K., Siegle, M., Stoelinga, M., D’Argenio, P.R. (eds.) QEST 2013. LNCS, vol. 8054, pp. 173–176. Springer, Heidelberg (2013). https:// doi.org/10.1007/978-3-642-40196-1 15
Forensic Analysis of Android Runtime (ART) Application Heap Objects in Emulated and Real Devices Alberto Magno Muniz Soares and Rafael Timoteo de Sousa Junior(&) Cyber Security INCT Unit 6, Electrical Engineering Department, University of Brasília, Brasília, DF 70910-900, Brazil [email protected], [email protected]
Abstract. Each new release of a mobile device operating system represents a renewed challenge for the forensics analyst. Even a small modification or fault correction of such basic software requires the revision of forensic tools and methods, frequently yielding to the development of new investigation tools and the consequent adaptation of methods. Forensic analysts then need to preserve each tool set and related methods and associate these sets to the specific mobile operating system release. This paper describes a case of transition consequent to the Android Runtime (ART) operating system release. The introduction of this system in the market required the development of a new forensic technique for analyzing ART memory objects using a volatile memory data extraction. Considering the Android Open Source Project (AOSP) source code, a method and associated software tools were developed allowing the location, extraction and interpretation of arbitrary ART memory instances with the respective object classes and their data properties. The proposed technique and tools were validated both for emulated and real devices, illustrating the difficulties related to the forensics analysis for the target system due to its particular implementations by multiple manufacturers of mobile devices. Keywords: Mobile device forensics Android
Memory forensics Memory analysis
1 Introduction This paper is an extended version of a paper presented in the 3rd International Conference on Information Systems Security and Privacy – ICISSP 2017 [1]. With respect to the previous publication, the present paper presents a new contribution regarding the forensics analysis of real devices running under the Android operating system. This introduction was completely rewritten, now presenting data regarding the Android user base as well as data collected by the authors from the Civil Police of the Brazilian Federal District showing the ever-growing interest of Android forensics in real world investigations. Consequently, an entirely new Sect. (6.3) was included with the validation scenario for our proposed methods and tools applied to a real Android device. Based on the new results, this paper has sections that were reviewed and partially © Springer International Publishing AG, part of Springer Nature 2018 P. Mori et al. (Eds.): ICISSP 2017, CCIS 867, pp. 130–147, 2018. https://doi.org/10.1007/978-3-319-93354-2_7
Forensic Analysis of Android Runtime (ART) Application Heap Objects
131
rewritten, particularly its conclusions which were completed and rewritten. Also, bibliographic references were reviewed and completed with new references. The Android operating system has become predominant in the smartphone market, representing roughly 85% of the worldwide smartphone volume [2]. Consequently, Android has also became an interesting item related to crime scene investigations. As personal mobile devices can be used for variable purposes, their RAM may contain digital evidence for potential investigations, being of interest both the cell phones used by the criminals and those of the victims of crimes. Data from the Civil Police of the Brazilian Federal District (Table 1) illustrate that the number of devices controlled by Android represents an important challenge for the forensics team. In a previous work [3], we have characterized an Android phone as a large and rich data repository. The amount of memory, the existing access protections, the memory organization, the biding from application and operating system processes to memory object locations, and other related aspects, represent together a combined volume-variety question that must be rapidly treated by the forensics expert. This constitutes an important big data problem that requires appropriate solutions to be promptly tackled, given the needs of the Justice system. Table 1. Mobile devices seized for forensic analysis by the Civil Police of the Brazilian Federal District from August 16, 2016 to August 16, 2017 (Source: the authors). Manufacturers Not specified Samsung Motorola Apple LG Nokia Sony Alcatel Blu Lenovo Asus ZTE Lenoxx, Microsoft, Multilaser, Positivo Blackberry, CCE, Fashion Q9, Geotel, HTC, Huawei, Itel, Kis
Quantity 388 345 96 86 68 40 18 15 9 8 5 4 2 devices of each manufacturer 1 device of each manufacturer
Moreover, each new release of a mobile device operating system represents a renewed challenge for the forensics analyst. Even a small modification or fault correction of such basic software requires the revision of forensic tools and methods, frequently yielding to the development of new investigation tools and the consequent adaptation of methods. Forensic analysts then need to preserve each tool set and related methods and associate these sets to the specific mobile operating system release.
132
A. M. Muniz Soares and R. T. de Sousa Junior
These reasons constitute the motivations for this paper proposal of methods and tools aimed at supporting semi-automatic forensic analysis of Android Runtime (ART) application heap objects both in emulated and real devices. Mobile devices forensics has been traditionally aimed at the acquisition and analysis of data present in non-volatile storage media. Depending on the purpose of the investigation, or as a consequence of the difficulty with the ephemeral nature of the data, usually volatile memory exams were not performed. But, the increasing use of encryption and the presence of ever more sophisticated malicious software raised the need to conduct investigations on the volatile memory contents of mobile devices. As discussed in [4], the forensic community now recognizes that capturing data in memory is required in order to comply with the volatility of digital evidence, since some information about the system environment are never kept statically in secondary storage media. Thus, techniques to analyze data from a volatile memory extraction have become imperative to extend traditional techniques. The Android operating system is based on the Linux kernel and is designed especially for mobile devices. Currently this system is packaged in versions for 32-bit and 64-bit processors, complying with x86, MIPS and, especially, the ARMS architecture. Departing from the basic Linux distribution, Android presents specific features that, for the purpose of forensic investigations, yield to the need of specific techniques for memory extraction and analysis, which require a detailed understanding of the runtime environment. According to the Android Open Source Project (AOSP), the Android version 5.0 OS contains a new runtime environment for operating in most available devices. Called Android Runtime – ART, this environment replaces the former interpretation mechanism called Dalvik Virtual Machine (DVM). In place of using this interpretation engine, ART requires the compilation of every application during their installation, a process that is called Ahead-Of-Time (AOT). Also, it is noteworthy that new memory management mechanisms are implemented in ART. As observed by [5], a general digital forensics process includes the acquisition of data from a source, the analysis of the data and the extraction of evidence, with the preservation and presentation of the evidence. Although there are several RAM memory data acquisition techniques for Android, a forensic technique specific for memory analysis and extraction of Java objects in the ART runtime environment is yet to be established. Thus, the central contribution of this paper is the proposal and validation of a technique that allows the location and extraction of object data of a running application, using volatile memory content acquired from Android version 5.0 devices. This memory analysis technique is based on the source code available from AOSP. Another contribution of this paper is the development of software tools that support the proposed forensic technique. The remaining of this paper is organized as follows. Section 2 describes related work. Section 3 is an overview of the Android architecture, while Sect. 4 is devoted to ART. In Sect. 5, the proposed forensic technique is described with its supporting tools. Section 6 discusses results from the experimental evaluation of the proposed technique and developed support tools, for RAM acquisitions from an emulated device as well as real ones. Section 7 presents conclusions and possible future works.
Forensic Analysis of Android Runtime (ART) Application Heap Objects
133
2 Related Work Considering the unique characteristics of the Android platform, particularly for its early versions, and different scenarios which a forensic analyst may come across, a general data acquisition method is proposed in [3] with its respective workflow. But regarding the specific procedures that are required to detail a general method, acquisition techniques for forensic purposes present limitations related to intrinsic features implemented by manufacturers, such as hard security mechanisms that prevent access to data, as discussed in [6]. Nevertheless, there are different known techniques for RAM acquisition in Android. Linux Memory Extractor - LiME [7] is a known technique which extracts raw data from a device volatile memory, ensuring a high integrity rate in its results. A study presented in [8] is dedicated to the recovery of credentials from Android applications by means of available volatile memory extraction techniques. This study shows that even without the analysis of applications objects, the referred credentials are accessible by direct inspection of the extracted data. The analysis of data extracted from real devices showed similar results to those and from emulated systems. An alternative to bypass hard security barriers is presented in [9] based on data extraction of real devices running under Android version below 4.4. Using the technique called Forensic Recovery of Scrambled Telephones – FROST, this paper holds that, even in case of rebooting and unrecoverable data erasure in non-volatile memory, which occurs in some devices when they are reset to factory state, a situation caused by the bootloader unlocking process, it is still possible to analyze the remaining data in RAM, including Java objects maintained by the old Dalvik runtime. In this work, the process is performed using plugins of the Volatility Framework (http://www. volatilityfoundation.org). In [10], the ART compilation process and instrumentation solutions for applications within ART are presented, highlighting the innovations introduced in the ART compilation process, including significant internal operation details that are useful in understanding the difference between the ART and the earlier Android runtime versions. It is noteworthy to point that, after a careful publications search, we verified that forensics studies on ART for Android version 5 or greater are still rare. Then, the analysis of the AOSP open source code and its constant updates is an important source of information [11].
3 Overview of the Android Architecture The Android software stack comprises three main layers: the application layer, the framework for Java objects and the Runtime environment – RT, and a native Linux kernel layer containing hardware abstraction libraries [12]. Regarding the memory management used by the RT, as described in [13], the Android system does not offer a memory swapping area, but instead it uses paging mechanisms and file mapping.
134
A. M. Muniz Soares and R. T. de Sousa Junior
Regarding the paging mechanism, page sharing is used between processes. Each process is instantiated by fork of a pre-existing process called Zygote. This original father process starts during the system initialization phase (boot) and loads the code and features that are part of the Android framework. This allows many pages, allocated to the code and resources of the framework, to be shared by all other process applications. With the mapping mechanism, most of the static data (byte-code, resources and possible native code libraries) of an application are mapped into the memory address space of the application process. Besides allowing data sharing between processes, this feature the concerned memory pages to be disposed as needed. Memory sharing between applications works through an asynchronous sharing mechanism called Anonymous Shared Memory (Ashmem). Ashmem is an additional modification of the Android Linux kernel to allow automatic adjustment of the size of memory caches and recovering areas when the total available memory is low [12]. Also, by means of a memory snapshot, the virtual memory area of an application may contain unused mapped pages. In the boot process, in addition to the preparation of the Zygote by the RT process, a service starts keeping (for each booting) memory mapping of key-value pairs related to system configuration, comprising data properties files and other sources of the operational system. Many components of the operating system and the Android framework, including the execution environment, use these values, including those related to the configuration of the execution environment (for instance, the size of the memory space for the Java object heap and parameters of the Garbage Collection – GC). With respect to security aspect within AOSP, after installation, each application is activated in its own virtual memory area, implementing the principle of least privilege. Android version 5.0 includes security mechanisms requiring that all dynamic code liking must be of relative type (Position-Independent Code – PIC), reinforcing the existent mechanism of Address Space Layout Randomization (ASLR). Despite operating on a Linux kernel, these peculiar characteristics of the Android architecture with respect to the security mechanisms, memory management, and application runtime environment, impose the use of specific techniques for procedures dedicated to RAM extraction and analysis.
4 Android Runtime (ART) The runtime module is responsible for managing Android applications designed to operate on the Android framework layer. One of its responsibilities is to provide memory management for application execution and access to other system services such as Virtual Machine (VM) byte-code compilation and loading (in DEX files). This VM is similar to a Java Virtual Machine (JVM) and runs as an application that in ART keeps the name and uses the same byte-code of Dalvik, despite of the replacement of the corresponding legacy runtime module. Previously to running applications, the ART initializes a set of classes during the first boot (or after system modifications), generating a file that contains an executable image with extension “art” with all the loaded classes and objects that are part of the
Forensic Analysis of Android Runtime (ART) Application Heap Objects
135
Android framework. This file, called boot.art, is mapped into memory during the Zygote boot process, and basically contains natively compiled objects who hold address references (pointers) with absolute addresses within the image itself and references to methods in the code contained in framework files (inside the framework file there are absolute pointers to the image as well). The overall data structure related to the compilation and execution in the ART environment is then described in the image header, including a field that stores the respective offset from the beginning of the file. This value changes at every boot so that the image is not loaded at the same address (in AOSP version 5.0, the base address for the displacement of ASLR was set to 0x70000000). After the initial preparation, the byte-code of each installed application is compiled to native code before its first run. The product of this compilation, comprising each application byte-code and libraries that make up the Android framework, are files in Executable and Linking Format - ELF, called OAT (specifically boot.oat for the framework). These files, compiled to boot the Android framework and to install applications, contain three dynamic symbol tables called oatdata, oatexec and oatlastword that respectively contain the OAT header and DEX files, the compiled native code for each method, and the last 4 bytes of generated native code functioning as a final section marker. For memory management, the ART divides the virtual memory as follows: a main space for application’s Java objects (heap), a space for the image objects and classes of the Android framework, a space for Zygote’s shared objects, and a space for large object (Large Objects Space – LOS). The first three are arranged in a continuous address space while there is a collection of discontinuous addresses for the LOS. In addition to these spaces, there are data structures related to garbage collection whose types are related to the GC and the Java object heap allocation and that can be active depending on the GC plan that is working. The GC plan is usually set by the manufacturer according to the device’s intrinsic characteristics and according to the plan established by the memory allocator. For devices such as common use smartphones, without strong memory constraints, there is generally a defined plan whose operating mode works with the allocator called Runs-Of-Slots-Allocator (RosAlloc) for mutable objects and with Dlmalloc for immutable objects. The RosAlloc came up with the ART runtime environment, and is the main allocator responsible of heap memory space for Java objects. It organizes this memory space in rows of slots of the same size. These runs are clustered as pages within brackets. The first page of a bracket contains a header that determines the number of pages this bracket contains and the slot’s allocation bitmap. The number of slots per page is set according to the size of the bracket, the header length and the byte alignment (which depends on the target device architecture). Figure 1 illustrates an example of a heap structure and mapping schema. Each slot stores data for one object and the first bytes store its parent class address. The slot is classified according to the size of the object as a means to reduce fragmentation and allow parallel GC. Objects with big data ( 12 KiB) are spread through LOS allocation areas, allowing the kernel to conveniently manage address spaces to store this data. The allocator maintains an allocation map for the brackets pages (each page with 4 KiB size) setting in this map the type of each page in the allocation space. This map is
136
A. M. Muniz Soares and R. T. de Sousa Junior
Fig. 1. Example of a heap structure and mapping within ART RosAlloc [1].
stored in a mapped file in RAM (rosalloc page map). For the allocation of the heap space, it sets the address to start near the lowest virtual address of the process, from 0x12c00000 bytes (300 MiB). Considering this memory layout information, drawn from our analysis of the AOSP source code, it is possible to establish a strategy for locating objects by scanning the bracket’s slots inside the heap mapped file and decoding the data set for each allocated object. This is also possible for a recoverable object from a deallocated slot. While these are subjects of the present paper, as approached in the next section, the analysis of data stored in structures related to large objects or allocated by native libraries, which have specific allocation mechanisms, are considered for future work.
5 Analysis of Android ART Application Objects As exposed above, in an application’s runtime environment there are mapped files in RAM containing: information about system properties, Android framework, Java object heap, mapping of objects used by the memory allocator, as well as class
Forensic Analysis of Android Runtime (ART) Application Heap Objects
137
definitions and executables compiled from the application’s DEX files contained by OAT files. From a whole RAM extraction, the technique proposed in this paper, as illustrated in Fig. 2, is aimed at recovering Java objects for data analysis from the heap space. This is performed by inspecting the mapping maintained by the memory allocator, based on the premise that from a volatile memory extraction it is possible to recover data pages from those files. For Java objects data, according to the type of the page (guided by the mapping maintained by the allocator file) and the respective page header data, it is possible to recover the slots and, with the appropriate description of the target object class, decode the data. Object data decoding can be performed directly or from the traversal of the references throughout the class hierarchy (similar to a recursive programming process) using memory layout information obtained by decompilation of the application byte-code or by understanding the upper classes information. In Fig. 3, a generic sequential process for recovering an arbitrary string field of the Object X is illustrated. From the Object X slot (bottom-left in the figure), it is possible to walk through the parent classes references, this way decoding object data using the layout of known Android framework classes. The Volatility Framework (in version 2.4) [14] provides tools and data structures mappings with support for the Linux platform on the ARM 32-bit architecture, allowing the retrieval of information, such as process table and memory mapping,
Fig. 2. The proposed technique for analyzing heap objects within ART RosAlloc [1].
138
A. M. Muniz Soares and R. T. de Sousa Junior
2
4
6
9
7
5
3 1
8
Fig. 3. Iterative recovery process for an object field [1].
among others. In this paper, the process of data analysis is supported by a set of tools that were conjointly developed within the Volatility Framework, based on Android AOSP source code for ART version 5.0.1_r1 (https://android.googlesource. com/platform/art/+/android-5.0.1_r1), and on ART related information described in [15, 16]. These extensions built for the Volatility framework allow retrieval of information on the execution environment and the recovery of allocated Java objects. For the recovery of the runtime data structures, we have created mappings for interpretation of data from ART files, OAT, DEX, Java framework classes, heap pages structures and system properties. Then, for the extraction and analysis process, we have built tools for recovery of the runtime properties, location of OAT files, data decoding from DEX files, extraction of Java objects from the heap, and for decoding object data from the heap and from the Android framework image. The architecture of the Volatility framework and the design of these tools allow updating and adding new mappings, which facilitate adaptation to other architectures or changes in future versions of Android. The list of references to heap objects used in data extraction is constructed by inspecting and decoding the slots of the heap pages described in the mapping file maintained by the RosAlloc allocator. This list contains data objects with the location of the object (address, page, bracket, and slot), the parent class, class identifiers in DEX, and raw or textual data (of type String or char array). This technique enables in-depth analysis of the extracted data, overcoming the traditional techniques of carving, text or other articles search, which lack the understanding of the storage structures in memory.
Forensic Analysis of Android Runtime (ART) Application Heap Objects
139
6 Experimental Evaluation The experimental evaluation of the proposed technique was done for an emulated device and real ones, representative of a common ART environment under Android 5.0. A complete RAM memory dump from each device was acquired using the technique described in [7], while these devices were running with active applications, including a chat application (WhatsApp v.2.12.510). For acquiring memory dumps from both devices, it was necessary to use a privileged user access (root) and to perform the replacement of the kernel code with a newly built compilation configured to accept loading kernel modules without validation. The privileged user is available by default in the emulator, while for the real device it was obtained using the rooting tool Kingo (http://www.kingoapp.com). The source code of the kernels was compiled according to the guidelines in the AOSP site. The workstation used for the process of cross-compilation and analysis consists of the Santoku Linux version 0.4, with the installation of the Android NDK (Release 8e) and the Volatility Framework (version 2.4), as described in their project sites. The configuration was based on the construction of the experimental setting procedure used in [17]. For each memory acquisition, the RAM memory data was transferred by TCP directly to the analysis workstation. 6.1
Evaluation with an Emulated Device
The emulated device is an Android Virtual Device (AVD) configured with the parameters CPU/ABI: ARM (armeabi-v7a), 768 MB RAM, Target: Android 5.0.1 (API level 21), build number “sdk_phone_armv7-eng 5.0.2 LSY64 1772600 test-keys”, hw.device.name Nexus 5 and vm.heapSize 64 MB. Environment Set-Up. The target device used for memory acquisition is the Android emulator, available in the development tools package Android SDK Tools Revision 23.0.2. The source code of the kernel version (3.4.67) available for the emulator (goldfish) was obtained from the AOSP site. Evaluation. In analyzes of the memory extraction according to the proposed technique, it is possible to successfully recover common interesting forensic data from ART objects, as for instance the user contacts maintained by the com.android.contacts application. For a deeper evaluation example, we describe hereafter in detail how to discover and characterize the objects from a running chat application (com.whatsapp v. 2.12.510) involving messages exchanged with another user in a real device. Initially, the extension to the Linux Volatility framework that allows retrieving the table of running processes is used. Thus, it is possible to locate the target process for the analysis which in this case is identified by PID 1206. With the developed tool for
140
A. M. Muniz Soares and R. T. de Sousa Junior
recovery of system properties, environmental data is extracted, including the size of the heap for Java objects, a value that subsequently is used as a parameter in the recovery of target objects: >python vol.py --profile=LinuxLinuxGoldfish_3_4_67ARM -f memdumpWhatsAppChat.lime art_extract_properties_data –p 1206 ... [dalvik.vm.heapsize]= [64m] ...
Then, it is possible to retrieve data about the application execution environment, such as the addresses related with the Android framework mapping, using the tool built for this activity and the target process handle as a parameter: >python vol.py --profile=LinuxLinuxGoldfish_3_4_67ARM -f memdumpWhatsAppChat.lime art_extract_image_data –p 1206 com.whatsapp ART image Header ----------------------image_begin:0x700c7000 oat_checksum:0xbd5a21c9L oat_file_begin:0x70be8000 oat_data_begin:0x70be9000
image_roots:0x70bb8840 kClassRoots:0x70bb8948 0x1 LJava/lang/Class; 0x700c7220L 0x2 LJava/lang/Object; 0x700f7240L 0x5 LJava/lang/String; 0x700df8f0L 0x6 LJava/lang/DexCache; 0x700c74f0L 0x8 LJava/lang/reflect/ArtField; 0x700f7640L 0xc [LJava/lang/reflect/ArtField; 0x700f74a0L 0x1d [C 0x700f6fd8L
The recovered information present in the image header, including the memory offset for the location of the mapping framework (0x700c7000), serves as the basis for recovering addresses from various classes, such as java.lang.String class. With these
Forensic Analysis of Android Runtime (ART) Application Heap Objects
141
data and the map maintained by the RosAlloc allocator, the list of heap objects containing references to object data and references to other objects is constructed, also using a developed tool. The address allocation map (0xb1d70000) is recovered by searching the name of the respective file in the mapping process. With this gathered information, and by means of another developed localization tool, it is possible to recover OAT files used by the target process, including the addresses of each location in the addressing process: >python vol.py --profile=LinuxLinuxGoldfish_3_4_67ARM -f memdumpWhatsAppChat.lime art_find_oat –p 1206 Oat offset_ ------------------------------------- [email protected]@classes.dex 0xa06dc000L [email protected]@classes.dex 0xa5a74000L
With the OAT address, it is possible to recover data that enables a static analysis of some components, including class identifier indexes and application’s byte-code. After analyzing the OAT decompiled code of the file located in 0xa5a74000L, comes the selection of the identifier (DEX_CLASSDEF_IDX = 0x1394) for the class of objects (com.whatsapp.protocol.l) that indicates the storage for the target application messages text data. Searching the list of heap objects references, looking for references to the definition of the requested class, it is possible to identify the parent class java.lang.Class object (described in the Android framework image at 0x700c7220L):
>python vol.py --profile=LinuxLinuxGoldfish_3_4_67ARM -f memdumpWhatsAppChat.lime -p 1206 -b 0x700c7000 art_dump_rosalloc_heap_objects –e 0x12c00000 –m 0xb1d70000 –s 0x4000000 address ------------0x1384d2c0L 0x1384e0c0L 0x1384f240L 0x13850820L
page bracket slot obj class ---- ------- ---- ---------------3149 13 2 *(FOUND)* 0x12c19020 3149 13 18 *(FOUND)* 0x12c19020 3149 13 38 *(FOUND)* 0x12c19020 3149 13 63 *(FOUND)* 0x12c19020
Then, using a developed tool for object data recovery, it is possible to examine the data for each specific object of this class, i.e., data recovery is made for the object located in 0x1384d2c0:
142
A. M. Muniz Soares and R. T. de Sousa Junior
>python vol.py --profile=LinuxLinuxGoldfish_3_4_67ARM -f memdumpWhatsAppChat.lime -p 1206 -b 0x700c7000 art_extract_object_data -o 0x1384d2c0 Object Address: 0x1384d2c0 Class Address: 0x12C19020 Loaded: 0x700c7220L LJava/lang/Class; classLoader 0x12c02b20L componentType 0x0L dexCache 0x12c01610L LJava/lang/DexCache; directMethods 0x133ff980L [LJava/lang/reflect/ArtMethod; iFields 0x12c04900L [LJava/lang/reflect/ArtField; .. sFields 0x13407500L [LJava/lang/reflect/ArtField; dexClassDefIndex 0x1394L dexTypeIndex 0x1810L
Among the recovered data, the address with reference to the array of properties java.lang.reflect.ArtField[] (at 0x12c04900L) is found. With a new search to this address and for this type of class, data from the conversation, including the message text, is recovered. By tracking through references and properties of the recovered objects of this class other attributes are identified: text, date, peer ID, and other data. Figure 4 illustrates the links between some of the addresses visited for retrieval of data objects related to the target object. It is noteworthy that the developed tools also support the reverse process which, given a specific object property (e.g. message text), reveals references of objects related to the concerned chat. 6.2
Evaluation with a Samsung Device
The first real device evaluated was a Samsung Galaxy S4 (GT-I9500 non-LTE) with CPU Exynos 5410, 2 GB RAM, original Android 5.0.1 (API level 21), build number LRX22C.I9500UBUHOL1, and vm.heapSize 64 MB. Environment Set-Up. The cross-compiled kernel source code (version 3.4.5) was obtained from the manufacturer open source release site (http://opensource.samsung. com). Evaluation. Initially, the procedure to locate the target process (com.whatsapp) is executed and retrieves data about the application execution environment, such as the addresses related to the Android framework mapping. Then, as shown in Fig. 5, it is interesting to find that the Android framework image header in this device is different from that in the emulated device, although this real system presents the same ART header version identification (009).
Forensic Analysis of Android Runtime (ART) Application Heap Objects
143
In the real device, the header field for the image address does not point to a valid absolute address in the image segment. This difference suggests that this manufacturer Android OS does not correspond to the AOSP source code. Consequently, the technique proposed in this paper cannot be fully used in this case since the unknown header demands reverse engineering the ART image present in this real device. This evaluation result shows a common limitation in forensic procedures designed for extraction of objects from ever evolving operating systems in mobile devices. Moreover, the consequent requirement regarding the adaptation of the proposed technique to this new situation comes up against an important obstacle, since there is no available public ART runtime source code provided by the concerned manufacturer. 6.3
Evaluation with a Motorola Device
The second real device evaluation concerned a model Motorola Moto E (XT1021) with CPU Qualcomm Snapdragon 200, 1 GB RAM, original upgraded Android 5.1 (API level 22), build number BLURVERSION.23.201.3.CONDORRETBR.RETBR.EN. BR, and vm.heapSize 128 MB. Environment Set-Up. The cross-compiled kernel source code (version 3.4.42) was obtained from the manufacturer open source release site (https://github.com/ MotorolaMobilityLLC/motorola-kernel). Before the chat application installation and flashing, the modified kernel was necessary to unlock this device bootloader using the official procedures described in manufacturer’s site (https://motorola-global-portal. custhelp.com/app/standalone/bootloader/unlock-your-device-a). Evaluation. Initially, the procedure to locate the target process (com.whatsapp) is executed and retrieves data about the application execution environment, such as the addresses related to the Android framework mapping. Then, it is verified that the Android framework image header in this device presents a different version of the ART header version identification (012) than the previous analysis, thus requiring an adaptation of the developed tools to the header layout according to the source code of branch Android 5 (lollipop-release) from AOSP (https://android.googlesource.com/platform/art/+/lollipop-release/runtime/image.h). After that, using the developed tools, all memory extraction attempts failed to locate data for the class root address pointed in the ART header because that the exact page that contained the class root address swapped out of RAM, which is maybe related to the ratio between the RAM size and the vm.heapsize configured by the manufacturer. Then, a different a strategy to data recovery was defined for the device under evaluation. Knowing that the image file is patched in every boot process, adjusting the framework image address due to the ASLR, the developed analysis tools were adapted to work direct with the ART image file that can be extracted from to secondary memory associated to the RAM dump. In that way, it was possible to recover data from the page swapped out (the page which was not found before our tools adaptations), by a direct read from the original image file patched and mapped in RAM.
144
A. M. Muniz Soares and R. T. de Sousa Junior
Fig. 4. Memory references to objects found when recovering a text message [1].
Fig. 5. Examples of image_roots addresses: a valid address (up) and an invalid one (bellow), this last found in a RAM dump from a Samsung device (Source: the authors).
With these adjustments, the technique proposed in this paper could be successfully used and java objects could be recovered from the heap. Figure 6 shows messages recovered from WhatsApp applications between the evaluated real devices. This evaluation result shows that the proposed technique can be effective in real devices.
Forensic Analysis of Android Runtime (ART) Application Heap Objects
145
Fig. 6. Chat messages recovered from Motorola RAM extraction (Source: the authors).
7 Conclusions and Future Work This paper proposes methods and tools to tackle a common challenge for the forensics analyst related to new releases of a mobile device operating system. The described case regards the Android ART runtime environment, since the release of this system requires the revision of forensic tools and methods, yielding to the development of new investigation tools and the consequent adaptation of methods. Given that Android ART poses new requirements for the forensics analyst, this paper presents a technique for object data analysis in RAM acquisitions from devices compliant to the ARM 32-bit architecture. The work includes the study of concepts and structures of the ART runtime environment, present in the Android operating system version 5.0 from AOSP. Experimental evaluation of the proposed technique is performed using software tools developed within the Volatility framework. The proposed technique contribution comes from its ability to extract and analyze Java objects in ART revealing involved memory structures, thus overcoming earlier system releases analysis and other traditional techniques based on detecting patterns intrinsic to the artefact components. Another contribution is the set of supporting tools developed as Volatility plugins that can also be useful as reverse-engineering tools.
146
A. M. Muniz Soares and R. T. de Sousa Junior
The validation of the proposed method and tools illustrates the challenge that new releases of mobile operating systems pose to the forensic professional. For instance, there is the case of a manufacturer whose Android OS does not correspond to the AOSP source code. Consequently, the technique proposed in this paper cannot be fully used in this case since the concerned system demands the reverse engineering of the ART image present in a real device. Our evaluation result shows a common limitation in forensic procedures designed for extraction of objects from ever evolving operating systems in mobile devices. Moreover, the consequent requirement regarding the adaptation of the proposed technique to such new situations comes up against an important obstacle, since there is no available public ART runtime source code provided by the concerned manufacturer. Another interesting observation concerned the Android framework image header in a device that presents a different version of the ART header version identification than the AOSP source code, thus requiring an adaptation of the developed tools to this particular header. It is noteworthy that the proposed technique and the constructed support tools have the flexibility to be adapted to other computer architectures (including 64-bit), for devices with different hardware limitations and to comply with ART modifications already identified in the AOSP source code of the latest versions of Android. It is relevant that the technique is successful in analyzing heap objects from ART in an emulated device and some real ones, though our evaluation identified an implementation of ART in a real device that differs from the AOSP version tested in an emulated device. As future work, the authors intend to carry out the experimental validation of the technique with data retrieved from other real devices and architectures, and to associate the technique with similar ones for detection and analysis of malwares. Acknowledgements. This research work has the support of the Brazilian Research, Development and Innovation Agencies CAPES – Coordination for the Improvement of Higher Education Personnel (Grant 23038.007604/2014-69 FORTE – Tempestive Forensics Project), FINEP – Funding Authority for Studies and Projects (Grant 01.12.0555.00 RENASIC/PROTO – Secure Protocols Laboratory of the National Information Security and Cryptography Network), and CNPq – National Council for Scientific and Technological Development (Grant 465741/2014-2 Science and Technology National Institute – INCT on Cybersecurity), as well as the Brazilian Federal Police (Contract 36/10 DITEC/DPF/MJ-FUB) and the Civil Police of the Brazilian Federal District (IC/PCDF).
References 1. Soares, A.M.M., de Sousa Jr., R.T.: A technique for extraction and analysis of application heap objects within Android Runtime (ART). In: Proceedings of the 3rd International Conference on Information Systems Security and Privacy (ICISSP 2017), pp. 147–156. SciTePress (2017) 2. IDC Smartphone OS Market Share (2017). Q1 Homepage: http://www.idc.com/promo/ smartphone-market-share/os. Accessed 04 Sept 2017
Forensic Analysis of Android Runtime (ART) Application Heap Objects
147
3. Simão, A.M.L., Sícoli, F.C., Melo, L.P., Deus, F.E., de Sousa Jr, R.T.: Acquisition and analysis of digital evidence in Android smartphones. Int. J. Forensic Comput. Sci. 1, 28–43 (2011). https://doi.org/10.5769/J201101002 4. Brezinski, D., Killalea, T.: Guidelines for evidence collection and archiving. RFC 3227. IETF (2002) 5. Carrier, B.D.: Defining digital forensic examination and analysis tools using abstraction layers. IJDE 1(4), 1–12 (2003) 6. Wächter, P., Gruhn, M.: Practicability study of Android volatile memory forensic research. In: 2015 IEEE International Workshop on Information Forensics and Security (WIFS), pp. 1–6. IEEE (2015) 7. Sylve, J., Case, A., Marziale, L., Richard, G.G.: Acquisition and analysis of volatile memory from Android devices. Digit. Invest. 8(3), 175–184 (2012) 8. Apostolopoulos, D., Marinakis, G., Ntantogian, C., Xenakis, C.: Discovering authentication credentials in volatile memory of Android mobile devices. In: Douligeris, C., Polemi, N., Karantjias, A., Lamersdorf, W. (eds.) I3E 2013. IAICT, vol. 399, pp. 178–185. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37437-1_15 9. Hilgers, C., Macht, H., Müller, T., Spreitzenbarth, N.: Post-mortem memory analysis of cold-booted Android devices. In: Eighth International Conference on IT Security Incident Management & IT Forensics (IMF), pp. 62–75. IEEE (2014) 10. Backes, M., Bugiel, S., Schranz, O., von Styp-Rekowsky, P., Weisgerber, S.: ARTist: the Android runtime instrumentation and security toolkit. Cornell University Library. arXiv: 1607.06619 (2016) 11. Google. Android Open Source Project - AOSP. http://source.android.com. Accessed 04 Sept 2017 12. Yaghmour, K.: Embedded Android: Porting, Extending, and Customizing. O’Reilly Media Inc., Newton (2011) 13. Drake, J.J., Lanier, Z., Mulliner, C., Fora, P.O., Ridley, S.A., Wicherski, G.: Android Hacker’s Handbook. Wiley, Hoboken (2014) 14. Ligh, M.H., Case, A., Levy, J., Walters, A.: The Art of Memory Forensics: Detecting Malware and Threats in Windows, Linux, and Mac Memory. Wiley, Hoboken (2014) 15. Sabanal, P.: State of the ART. Exploring the New Android KitKat Runtime (2014). https:// conference.hitb.org/hitbsecconf2014ams/materials/D1T2-State-of-the-Art-Exploring-the-NewAndroid-KitKat-Runtime.pdf. Accessed 20 Oct 2016 16. Sabanal, P.: Hiding Behind ART (2015). https://www.blackhat.com/docs/asia-15/materials/ asia-15-Sabanal-Hiding-Behind-ART-wp.pdf. Accessed 20 Oct 2016 17. Høgset, E.S.: Investigating the security issues surrounding usage of Ephemeral data within Android environments. Master thesis. UiT The Arctic University of Norway (2015)
Efficient Detection of Conflicts in Data Sharing Agreements Gianpiero Costantino, Fabio Martinelli, Ilaria Matteucci, and Marinella Petrocchi(B) Istituto di Informatica e Telematica, Consiglio Nazionale delle Ricerche, Pisa, Italy {gianpiero.costantino,fabio.martinelli,ilaria.matteucci, marinella.petrocchi}@iit.cnr.it
Abstract. This paper considers Data Sharing Agreements and their management as a key aspect for a secure, private and controlled access and usage of data. Starting from describing formats and languages for the agreements, we then focus on the design, development, and performance evaluation of an analysis tool, to spot potential conflicts within the data privacy policies constituting the agreement. The promising results achieved in terms of the execution time, by varying the number of rules in the agreements, and number of terms in the rules vocabulary, pave the way for the employment of the analyser in a real-use context.
Keywords: Data sharing rules Policy analysis and conflict detection · Controlled data sharing DSA management · Formal analysis · Performances evaluation Data security · Data privacy
1
Introduction
With the support of highly-connected IT systems, sharing data among individuals and associations is becoming easier and easier. As an example, businesses, public administrations, and health-care organisations are increasingly choosing to use cloud infrastructures to store, share, and collaboratively operate on data. Online data management and sharing, however, poses several problems, including uncontrolled propagation of information and misuse of data. Since years, researchers have successfully shown how a technical approach based on the definition and enforcement of data sharing policies represents a valid support for automatising and assuring a secure and private way for data exchange, storage, and management, see, e.g., the series of work in [1–4], ranging over an interval of more than two decades. This paper concentrates on the same field, considering a framework that permits the exchange of information by enforcing privacy policies to access and use data in a controlled way. Our proposal Extended and revised version of “Analysis of Data Sharing Agreements”, appeared in proceedings of ICISSP, 2017. c Springer International Publishing AG, part of Springer Nature 2018 P. Mori et al. (Eds.): ICISSP 2017, CCIS 867, pp. 148–172, 2018. https://doi.org/10.1007/978-3-319-93354-2_8
Efficient Detection of Conflicts in Data Sharing Agreements
149
is supported by the concept of Data Sharing Agreement (DSA)1 . DSA specify policies that are applied for accessing the data to which they are linked. In particular, they represent the machine-readable transposition of a traditional paper contract regulating access, usage, and sharing of data. A DSA conveys different information, like the parties stipulating the contract, the purpose of data sharing, the kind of data, and a set of rules stating which actions are authorised, prohibited, and obliged on such data. Possibly edited by different actors from various perspectives - such as the legal and the business ones - a DSA could quite naturally include conflicting data sharing rules: the access to - and usage of - some data could be permitted according to some rules and denied according to others. Here, we offer a panoramic view on DSA, from languages and formats chosen for suitably expressing the agreement, to the infrastructure needed to handle the life-cycle of a DSA. The infrastructure includes a component for creating and modifying the DSA, an analyser for checking consistency of the DSA rules, and a mapper to translate the rules in a language amenable for policy enforcement. The paper specifically focuses on the design and implementation of the analysis tool - the DSA-Analyser - that detects potential conflicts among rules in the DSA, before the actual enforcement of the agreement. Conveniently guiding the editor to the kind of conflicts and the reasons which may cause them, the DSA-Analyser is available as a web service application and it exposes its features through APIs. We also present a performance evaluation of the analyser, in terms of execution time, varying the number of rules in the DSA and the number of terms in the underlying vocabulary. The result of the evaluation is promising, leading to the completion of the analysis within few seconds for real DSA consisting of around 250 rules. Contributions with Respect to the Paper “Analysis of Data Sharing Agreements”, appeared in proceedings of ICISSP, 2017. This manuscript revises and extends the previous version by: – describing the controlled natural language adopted as specification language for Data Sharing Agreements (Sect. 2.2); – introducing the DSA management infrastructure that integrates, among other components, the DSA analyser here proposed (Sect. 2.4); – specifying the kind of conflicts we are able to deal with (Sect. 2.3), as well as showing examples of conflict detection (Codes 1.10, 1.11, and 1.12 in Sect. 3.1, along with Figs. 3, 4, and 5); – adopting a new reference DSA, as the running example in this manuscript (Codes 1.1 and 1.2 in Sect. 3); – enriching the vocabularies of the case studies, as the natural consequence of having adopted the new reference DSA; 1
The original design and development of DSA-based frameworks, as well as recent innovation updates, have been carried out within past and ongoing EU projects. The interested reader can consult: http://www.consequence-project.eu/, http://www. coco-cloud.eu/, http://c3isp.eu/. (All URLs in this paper accessed on August 3, 2017).
150
G. Costantino et al.
– As a non negligible effect of such changes, we performed new simulations to measure the analyser performances; thus, Sect. 3.2, about the analyser performance results, has been totally renewed in its outcome; – finally, Title, Introduction, Related Work, and Conclusions have been renewed, rephrased, and modified. The Paper is Structured as Follows: Next section introduces (i) basic notions on DSA structure and language, (ii) the kind of analysed conflicts, and (iii) a proposal for an overall framework for DSA management. Section 3 describes the design, development, and performance evaluation of the DSA analyser. Section 4 highlights how to proceed towards rules enforcement, once the DSA has been analysed. Section 5 discusses related work and Sect. 6 concludes the paper.
2
Data Sharing Agreements
This section introduces Data Sharing Agreements, the language adopted for their specification [5] and the kind of conflicts we consider in the current work. 2.1
DSA Definition
DSA are electronic contracts regulating data sharing. They consist of: – the title, a label which could be used to identify the DSA (DSA ID). – the parties, either natural or legal persons, specified by means of their names, roles and responsibilities. Borrowing the terminology from the personal data protection regulations, roles of the parties are the Data Controller, the Data Processor, and the Data Subject2 Responsibilities are legal duties of the parties about gathering, sharing, and storing the data subject of the agreement, expressed in pure natural language. – the validity period, stating the DSA start and end dates. – the vocabulary, which provides the terminology to edit the DSA data sharing rules. The vocabulary is defined by an ontology, i.e., a formal explicit description of a domain of interest (like, for example, a medical, a business, and a public administration domain). – the data classification, describing the nature of the data covered by the DSA, such as personal data (e.g., contact details, medical data, judicial data) and even non-personal data (e.g., business data, as corporate strategy development analysis, customer data, product development plans). – the purpose of the DSA, which is linked with the data classification. Example of purposes are the provision of health-care services (e.g., for diagnoses), administrative purposes (e.g., for booking and payments), marketing (e.g., for proposal of commercials services), and fulfilment of law obligations (e.g., to access data when needed by public authorities). 2
Terminology adopted in the European Parliament Directive 95/46/EC and in the new General Data Protection Regulation (GDPR, actionable from 2018).
Efficient Detection of Conflicts in Data Sharing Agreements
151
Finally, a DSA contains the rules regulating data sharing: – the authorisations section contains rules on permitted operations; – the prohibitions section contains rules on operations which are not allowed; – the obligations section contains rules on mandatory operations. 2.2
A Controlled Natural Language for Data Sharing Agreements
For specifying the data sharing rules in a user-friendly manner, the user adopts so called controlled natural languages, which loosely constraint sentences and terms within fixed formal constructs. Here, we briefly remind the Controlled Natural Language for Data Sharing Agreements (CNL4DSA), introduced in [5]. The core of CNL4DSA is the notion of fragment, a tuple f = s, a, o where s is the subject, a is the action, o is the object. The fragment expresses that “the subject s performs the action a on the object o”, e.g., “the doctor reads the medical report”. It is possible to express authorisations, obligations, and prohibitions by adding the can/must/cannot constructs to the basic fragment. Fragments are evaluated within a specific context. In CNL4DSA, a context is a predicate c that usually characterises factors such as users’ roles, data categories, time, and geographical location. Contexts are predicates that evaluate either to true or false. To describe complex policies, contexts must be combined. Hence, we use the Boolean connectors and, or, and not for describing a composite context C which is defined inductively as follows: C := c | C and C | C or C | not c The syntax of a composite authorisation fragment, FA , is as follows: FA := nil | can f | FA ; FA | if C then FA | after f then FA | (FA ) with the following meaning: – nil can do nothing. – can f is the atomic authorisation fragment that expresses that f is allowed, where f = s, a, o. Its informal meaning is the subject s can perform the action a on the object o. – FA ; FA is a list of composite authorisation fragments. – if C then FA expresses the logical implication between a context C and a composite authorisation fragment: if C holds, then FA is permitted. – after f then FA is a temporal sequence of fragments. Informally, after f has happened, then the composite authorisation fragment FA is permitted. The list of authorisations represents all the composite authorisation fragments that define the access rights on the data. Also, CNL4DSA has a specific syntax expressing composite obligation and prohibition fragments. Similar to the authorisations, the obligation fragment indicates that the subject s must perform the action a on the object o, while, for the prohibition, the subject s cannot perform the action a on the object o.
152
G. Costantino et al.
One of the advantages of adopting CNL4DSA is the possibility to associate to composite fragments a formal semantics, described through modal transition systems MTS [6]. This makes the language suitable to be formally analysed, by even exploiting existing tools, such as Maude [7], as presented in Sect. 3. 2.3
Conflicting Rules in a DSA
At time of DSA composition, different rules can be inserted, at different levels and by different authors. Let the reader consider, for example, a legal expert composing legal authorisation rules that strictly descend from legislation and regulations (e.g., the fact that some stored data must be maintained for a specific period of time). Then, a policy expert at a particular organisation may want to add specific data sharing rules to the DSA, i.e., rules that are internal to the organisation itself (e.g., the fact that access to data must be recorded). Even the fact that data are stored in a specific country may imply restrictions on their transfer. Thus, it may happen that two, or more, rules composing the DSA could allow and deny the data access and usage under the same contextual conditions. Such conditions are a collection of attributes referring to the subject, object, and environment of data sharing rules, constraining the scenario in which an action is authorised, prohibited, or obliged. As an example, if we consider the following authorisation: Doctors can read medical reports during office hours, when a subject tries to access a certain datum, according to that authorisation the access will be granted if the subject is a doctor, the datum is a medical report, and the time at which the access request is being made is within office hours. In this work, we consider conflicts between authorisations and prohibitions and between obligations and prohibitions. Also, the analysis tool proposed in the following deals with three kinds of conflicting rules (nomenclature inherited from [8]): Contradictions. Two rules are contradictory if one allows and the other denies the right to perform the same action by the same subject on the same data under the same contextual conditions. In practice, the rules are exactly the same, except for their effect (deny/permit the access). As a simple example: an authorisation stating if the data is stored in the European Union and the user is located in Europe, than the user can modify the data and a prohibition exactly stating the opposite if the data is stored in the European Union and the user is located in Europe, than the user cannot modify the data. Exceptions. One rule is an exception of another one, if they have different effects (one permits and the other denies) on the same action, but the subject (and/or the data, and/or the contextual conditions) of one rule belongs to a subset of the subject (and/or the data, and/or the conditions) of the other one. As an example: an authorisation stating doctors can read medical data and a prohibition stating doctors cannot read x-ray reports, where quite obviously x-ray reports are a subset of medical data.
Efficient Detection of Conflicts in Data Sharing Agreements
153
Correlations. Finally, two rules are correlated if they have different effects (permit and deny) and the set of conditions of the two rules intersect one another. As an example, doctors can modify x-ray reports and doctors located outside the hospital in which the x-ray reports are stored cannot modify those x-ray reports. The two rules raise a conflict when a doctor tries to modify an x-ray report when she is not in the hospital. 2.4
Data Sharing Agreements Infrastructure
A DSA life-cycle is automatically managed by a DSA management framework, originally proposed in [9], comprising a DSA Authoring Tool, a DSA Analyser, and a DSA Mapper Tool, glued together by the DSA Lifecycle Manager, see Fig. 1:
Fig. 1. DSA management framework.
– with the DSA Authoring Tool, the author can edit the DSA. The data sharing rules are constrained by CNL4DSA constructs [5] and the terms in the rules come from specific vocabularies (ontologies); – the DSA Analyser analyses the rules in a DSA, detecting potential conflicts among them. In case a conflict is detected, a conflict solver strategy based on prioritisation of rules is put in place to correctly enforce the DSA; – the DSA Mapper translates the CNL4DSA rules into an enforceable language. The mapping process takes as input the analysed DSA rules, translates them in the XACML-like language, and combines all the rules in line with predefined conflict solver strategies. The outcome of this tool is an enforceable policy. Such policy will be evaluated at each request to access and/or use the target data; – finally, the DSA Lifecycle Manager orchestrates all the previous components. When a user logs into the DSA Lifecycle Manager, this enacts her specific
154
G. Costantino et al.
functions, according to the users role (e.g., end-user, policy maker, legal user, as defined in [10]). Thus, users interact with the DSA tools via the DSA Lifecycle Manager.
3
The DSA Analyser
In our scenario, we imagine a policy expert at an organisation (such as a hospital, a public administration, or a private company): she aims at analysing rules in a DSA, and rules have been possibly composed by different actors - like the policy expert herself and a legal expert - who knows legal constraints applicable to the data whose sharing is controlled by that DSA. The DSA-Analyser has been developed using RESTful technology3 . This allows the tool to be reachable through a simple HTTP call, while the execution of the core component can be expressed using a different programming language, such us Java. This way, the interaction with the analyser is quite versatile since it can be directly done from a generic web browser as well as a client software developed to interact with the tool. We develop the core of the DSA-Analyser, as well as its features, using Java v8. Then, the DSA-Analyser runs as web-application into an Apache Tomcat v7.0.70 server. To call the DSA-Analyser, a simple web-client application specifies the server URI. The client specifies the call type - for the DSA-Analyser is POST - and the DSA ID, which is sent as payload in the call. The inner analysis process is hidden to the user and it is performed by Maude [7]4 , a popular executable programming language that models distributed systems and the actions within those systems. We let Maude group data sharing rules in authorisations, prohibitions, and obligations. Each set of rules is seen as a process, describing the sequence of authorised, denied, and obliged actions, respectively. Maude is executable and comes with built-in commands allowing, e.g., to search for allowed sequence of actions within a set of rules. We simulate all the possible access requests, given the application domain of the rules (in Sect. 3.1 we will detail how the access requests are built). When Maude finds at least two sequences, in different sets, which are equal (e.g., subject with role doctor reads object of type x-ray reports). When Maude finishes the computation, the analysis outcome is shown to the user through a graphical interface. The DSA-Analyser takes as input a DSA ID from an external database, with the available DSA for a specific organisation (Fig. 2). Then, it grabs the DSA through its ID and it starts processing it. As shown in Codes 1.1 and 1.2, the DSA content follows an XML structure. The analyser executes the following steps to check conflicts: Step 1. It checks that the XML file is well formed and it can be properly parsed. Step 2. It reads the root elements of the DSA, such as the purpose, id, parties, roles, title (Code 1.1 line 2). 3 4
http://www.ibm.com/developerworks/library/ws-restful/. maude.cs.illinois.edu.
Efficient Detection of Conflicts in Data Sharing Agreements
155
Code 1.1.Example header in a DSA source code. 1
2
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
. 0)
< subject,
All rules written in Maude are part of a bigger template that is given as input to Maude to evaluate the rules. The DSA-Analyser uses a single template, filled in with the rules parsed from the DSA. The STATEMENTS HERE placeholder in Code 1.6 is the part in the template where the DSA-Analyser inserts the DSA rules, converted as in previous steps. Code 1.6. Excerpt of a template. ... mod EXAMPLE is inc CNL4DSA . eq dsa-auth = STATEMENTS_HERE endm ...
STATEMENTS HERE is updated as in Code 1.7 (for the sake of simplicity, we have shown only one authorisation rule).
Efficient Detection of Conflicts in Data Sharing Agreements
159
Code 1.7. Authorisation rules in the Maude template. ... mod EXAMPLE is inc CNL4DSA . eq dsa-auth = ( ’Statement0 =def ((isrelateto(data,medical)) and (hasrole(subject,patient)) ) ’write, data > . 0)
3.1
< subject,
Algorithm for Conflict Detection
The data sharing rules are evaluated in Maude under a set of contextual conditions (hereafter called contexts), which mimics valid properties at time of the access request. Thus, contexts instantiate properties of the subject, the data, and the external environment: for a generic access request, when a subject will request the access to some data, such properties will be checked against the DSA rules, to evaluate the right of that subject to access those data. Contexts may specify, e.g., the role of the subject making the request, the category of requested data, the time of the day at which the request is being made, the geographical location (both of data and subjects), and so on. An example of context5 is in Code 1.8. Code 1.8. Example of contexts. Data is related to subject = true; Subject has role patient = true;
A peculiar feature of the DSA-Analyser is its ability to simulate all possible contexts, given the DSA vocabulary and the properties specified in the DSA rules. The algorithm for contexts generation is detailed in the following. Algorithm for Context Generation. A DSA vocabulary is made up of properties p and terms t. A single DSA document may contain a subset of properties. Examples of properties are hasRole, hasCategory, hasID. Properties have a domain and a co-domain. Examples of domains are Subject and Data. Examples of co-domains are Doctor for hasRole and Medical for hasCategory. To automatise the process of generating all the possible combinations of properties, ranged over all the different terms, we first consider the array P[pq ], 0 ≤ q ≤ n, shown below. The array P lists all the properties specified in the rules that appear in a DSA.
5
For the sake of readability, we write contexts in a semi-natural language format.
160
G. Costantino et al.
Algorithm 1. Combination Matrix Algorithm [11]. Require: An array with all Properties P = [P1 , P2 . . . PN ] and n different Terms arrays where T1 = P1 , T2 = P2 , . . . , Tn = Pn . Each T is an array T = [t1 , t2 . . . tK ] 1: global P The Properties array 2: function CombinationMatrix(P ) 3: for i = 1 to P.length do 4: T = P[i] 5: if T.length > 0 then 6: numRows = numRows ∗ T.length Counting the number of rows that the matrix will have 7: end if 8: end for 9: Initialisation of variables 10: numColumns = P.length Numbers of columns of the Matrix 11: combinationMatrix = new [numRows, numColumns] Creating the combination Matrix 12: counter = 0 It contains the index to write in the matrix 13: previousArrayLength = 1 It contains the length of the Terms array before that one it is processed 14: iterationNumber = 0 it counts how many loops will be done 15: innerIterationNumber = 0 it counts how many loops will be done in the inner FOR 16: arraySize = 0 The size of the array processed 17: for i = (numColumns − 1) to 0 do 18: if iterationN umber == 0 then 19: previousArrayLength = 0 20: else 21: T = P[i+1] P [i + 1] because it is a decrementing FOR 22: previousArrayLength = previousArrayLength ∗ T.length 23: end if 24: T = P[i] 25: arraySize = T.length 26: for j = 0 to j < numRows do 27: if iterationN umber == 0 then 28: combinationMatrix[j][i] = counter 29: if counter < (arraySize − 1) then It checks (arraySize − 1) because the array starts from zero 30: (counter++) 31: else 32: (counter = 0) 33: end if 34: else 35: Number iteration > 0 36: combinationMatrix[j][i] = counter It writes the value of counter in the matrix 37: It writes value of counter until the number of iterations does not reach the length of the previous array 38: It stops to (previousArrayLength − 1) because it counts starting from zero 39: if innerIterationNumber == (previousArrayLength − 1) then 40: counter++ 41: innerIterationNumber = 0 42: if count == arraySize then When the counter reaches arraySize. Then, it is initialized to zero. 43: counter = 0 44: end if 45: else 46: inneriterationNumber++ 47: end if 48: end if 49: end for 50: Initialising everything for next loop 51: iterationNumber++ 52: innerIterationNumber = 0 53: counter = 0 54: end for 55: end function
Efficient Detection of Conflicts in Data Sharing Agreements
161
P[p0 ] -> T0 [t00 , t01 , t02 , . . . , t0i ] P[p1 ] -> T1 [t10 , t11 , t12 , . . . , t1j ] ... n n n P[pn ] -> Tn [tn 0 , t1 , t2 , . . . , t k ]
For each position of the array P, another array Tq contains the list of terms representing the co-domain for the specific property pq . The DSA-Analyser grabs from the DSA vocabulary the properties and associated terms to form P and T, by filtering out those properties that do not belong to the rules of the specific DSA under investigation. Instantiating the above structure with an example, we have: [hasRole] -> [Role1 , Role2 ] [hasDataCategory] -> [Category1 ] [hasID] -> [id1 , id2 ]
We then create a matrix M whose values are pointers to a property with an associated term. The number of rows in the matrix is equal to all the possible combinations of properties and terms: Mrows = |hasRole| ∗ |hasDataCategory| ∗ |hasID| The number of columns is equal to the number of properties: Mcolumns = |P roperties| In our example, we have Mrows = 2 ∗ 1 ∗ 2 = 4 and Mcolumns = 3. We start filling the content of the matrix from the last column on the right. This last column will contain pointers to the last item of P, i.e., P[pn ] (and to the corresponding values in Tn ). We initialise a counter, which starts at zero, and stops increasing at [hasID].length−1 (in the example, 2−1 = 1, thus, leading to only two possible values, 0 and 1). Once the counter reaches [hasID].length − 1, it starts again until the numbers of rows are all filled in. Thus, we get a partially filed matrix, as follows: ⎡ ⎤ XX0 ⎢X X 1⎥ ⎥ M =⎢ ⎣X X 0⎦ XX1 Then, the algorithm starts processing the second array from the bottom, P[pn−1 ] (in our example, the element [hasDataCategory]). The counter stops at [hasDataCategory].length − 1 = 1 − 1 = 0. This means that, for all rows of the corresponding column in the matrix, we put 0:
162
G. Costantino et al.
⎡ X ⎢X M =⎢ ⎣X X
⎤ 00 0 1⎥ ⎥ 0 0⎦ 01
As last step in the example, we proceed with the element, [hasRole], which still range over two terms: even in this case, the counter can have two possible states, 0 and 1. Differently from the way we acted when filling the right-hand column, the algorithm fills the column by repeating a counter value for a number of times equal to: hasID.length ∗ hasDataCategory.length = (2 ∗ 1) = 2 Concluding, the algorithm generates a combination matrix: ⎡ ⎤ 000 ⎢0 0 1⎥ ⎥ M =⎢ ⎣1 0 0⎦ 101 The DSA-Analyser uses the matrix and it works at row level for generating all the possible contexts. At the first iteration, the DSA-Analyser generates the context in Code 1.9: Code 1.9. Automated context generation. #1 Subject hasRole Role1 = true; Data hasDataCategory Category1 = true; Subject hasID id1 = true;
At the second iteration, the context produced by the DSA-Analyser is: #2 Subject hasRole Role1 = true; Data hasDataCategory Category1 = true; Subject hasID id2 = true;
Finally, third and forth iterations are: #3 Subject hasRole Role2 = true; Data hasDataCategory Category1 = true; Subject hasID id1 = true; #4 Subject hasRole Role2 = true; Data hasDataCategory Category1 = true; Subject hasID id2 = true;
Algorithm 1 shows the pseudo-code for the generation of the combination matrix M .
Efficient Detection of Conflicts in Data Sharing Agreements
163
Contradictions, Exceptions, and Correlations. We generate three DSA about a health-care scenario, to test the DSA-Analyser on the three kinds of conflicts described in Sect. 2.3. Contradictions. We consider a DSA in which an authorisation and a prohibition are in contradiction between each other. Code 1.10. Example of Contradiction.
.... IF a Data isStoredIn EU AND a Subject hasLocation Area THEN that Subject CAN Modify that Data if isStoredIn(?X_4,?X_5) and hasLocation(?X_6,?X_7) then can [?X_6, Modify, ?X_4]
......
.......
... IF a Data isStoredIn EU AND a Subject hasLocation Area THEN that Subject CAN Modify that Data ... if isStoredIn(?X_23,?X_24) and hasLocation(?X_25,?X_26) then can [?X_25, Modify, ?X_23]
The DSA-Analyser detects the conflict and, as a result, it also points to the set of conditions for which the conflict holds (Fig. 3). Exceptions. We consider a DSA in which one authorisation and one prohibition generate an exception. The prohibition denies the access to data of type radiological. This type is a specific kind of data included in the super-type medical data. The authorisation allows the access to medical data.
164
G. Costantino et al.
Fig. 3. JSON answer - example of contradiction detection.
Code 1.11. Example of Exception.
....
.... IF a Data isRelatedTo a Subject AND that Subject hasRole Doctor THEN that Subject CAN Modify that Data if isRelatedTo(?X_9,?X_10) and hasRole(?X_10,?X_50) then can [?X_10, Modify, ?X_9]
....
.....
.... IF a Data hasTypes Radiological AND isRelatedTo a Subject AND that Subject hasRole Doctor THEN that Subject CAN Modify that Data ... if hasTypes(?X_23,?X_18) and isRelatedTo(?X_23,?X_10) and hasRole(?X_10,?X_50) then can [?X_10, Write, ?X_23]
Efficient Detection of Conflicts in Data Sharing Agreements
165
The DSA-Analyser detects the conflict and it also points to the conditions for which the conflict exists (Fig. 4).
Fig. 4. JSON answer - example of exception detection.
Correlations. We consider a DSA in which an authorisation and a prohibition do not involve the same set of attributes, but there exists at least one access request that correlates the two rules (Code 1.12). Code 1.12. Example of Correlation.
.... IF a Data isStoredIn EU AND a Subject hasLocation Area THEN that Subject CAN Modify that Data if isStoredIn(?X_4,?X_5) and hasLocation(?X_6,?X_7) then can [?X_6, Modify, ?X_4]
.... IF a Data hasTypes Radiological AND isRelatedTo a Subject AND that Subject hasRole Doctor THEN that Subject CAN Modify that Data
if hasTypes(?X_23,?X_18) and isRelatedTo(?X_23,?X_10) and hasRole(?X_10,?X_50) then can [?X_10, Modify, ?X_23]
166
G. Costantino et al.
Also in this case, the DSA-Analyser detects the conflict and it also provides the access request that generates it (Fig. 5). It is worth noting like the conflict arises when the union of the attributes regulating the authorisation and the prohibition evaluate to true.
Fig. 5. JSON answer - example of correlation detection.
3.2
Performances
The performances are evaluated by varying (i) the number of rules in a DSA and (ii) the dimension of the DSA vocabulary. We test the tool on three real DSA referred to three scenarios: (i) data related to health-care organisations, (ii) business data shared on mobile devices, and (iii) administrative data of public administrations. Each scenario is associated to a vocabulary defined by the Web Ontology Language (OWL), which defines the structure of knowledge for various domains, its terms and properties. The three vocabularies in our tests have different number of terms and number of properties. We performed a series of experiments on DSA containing a different number of rules (from 5 to 1000 rules, see Table 1), for a total of 30 DSA (10 DSA per 3 different vocabularies). The DSA-Analyser analyses each DSA by evaluating - separately - the authorisation, the prohibition, and the obligation rules, with respect to all the possible producible contexts. The health-care vocabulary (HV) leads to 191 iterations for set of rules, the public administration vocabulary (PAV) leads to 108 iterations, and also the mobile vocabulary (MV) leads to 107 iterations. We remind the reader that the test data are practically relevant, since rules and vocabularies are the ones from real use cases. Furthermore, the DSA-Analyser execution time resulting from our experiments is independent from the number of conflicts actually occurring over the tested rules (meaning that the analysis could, e.g., reveal no conflicts, or even one conflict per each pair of rules, but the execution time remains the same). Tests were run on a 1,3 GHz Intel Core m7 with 8 GB of RAM and SSD storage. Figures 6 and 7 report the execution time varying the number of rules in the DSA and the vocabulary. Total analysis refers to the whole analysis over the DSA, while single analysis considers the average execution time of the analysis of a DSA evaluated with respect to a single context. Overall, the Maude engine execution
Efficient Detection of Conflicts in Data Sharing Agreements
167
time is stable - and reasonably small - until it processes nearly one hundred rules. In particular, from the graphical representation in Fig. 6, we observe that the execution time starts growing polynomial when the rules are greater than 120. This is particularly relevant when considering the health-care scenario: the difference in this scenario is the number of terms in the vocabulary, with respect to the ones in PAV and MV. However, the polynomial growth in terms of rules number does not sensitively affect those scenarios where the number of terms in the vocabularies are lower (see Table 1, Total Analysis Time, 250 and 500 rules, PAV and MV columns, compared with the HV column). To the best of our experience with DSA, and also according to some previous work, e.g., the one in [13], dozens of rules represent a good estimation of the number of rules in a real DSA. This paves the way for the employment of the analyser in a real-use context. Table 1. Results per number of rules. Number of rules Total analysis time (s) Single analysis time (ms) HV PAV MV HV PAV MV 5
105
34
85
182
213, 19
264, 59
10
112
63
72
193, 9
195, 79
222, 93
20
109
71
84
189
220, 25
260, 78
40
132, 4
94
79
229, 7
290, 40
246, 21
80
138
77
78
239, 36
238, 63
240, 10
100
161
86
77
275, 35
266, 02
238, 49
250
207 197
123
354, 48
330, 62
380, 77
500
561 296
261
901, 65
909, 09
800, 66
750
1126 517
417
1644, 27
1584, 96
1279, 37
1000
2464 1101
735
2980, 56
3380, 94
2250, 23
Notes on Complexity. To estimate the complexity of the DSA-Analyser, we consider the steps described at the beginning of this section. The time consuming steps are mainly Step 6 and Step 7, while the other steps have a constant cost that does not depend on the number of rules. Step 6 and Step 7 consist of three main functions: 1. the generation of the context matrix (Algorithm 1); 2. the evaluation of the set of rules by Maude; 3. the pairwise comparison of the Maude results (between each pair of authorisation and prohibition and prohibition and obligation). The generation of the context matrix is described in Sect. 3.1 and Algorithm 1. The cost can be overestimated considering the combinations of all the vocabulary properties, without repetitions, O(num propnum terms ), where num prop
168
G. Costantino et al. 2500
Healthcare Vocabulary PA Vocabulary Mobile Vocabulary
2000
Time (s)
1500
1000
500
850
900
950
1000
900
950
1000
800
850
750
800
700
600
650
550
500
450
400
350
300
200
250
150
100
50
0
0
Number of Rules
Fig. 6. Total analysis time. Healthcare Vocabulary PA Vocabulary Mobile Vocabulary
3000
2500
Time (ms)
2000
1500
1000
500
750
700
650
600
550
500
450
400
300
350
250
150
200
100
50
0
0
Number of Rules
Fig. 7. Single analysis time, per sets of rules.
is the number of the properties and num terms is the number of terms in the considered vocabulary. The cost of the second function depends on the Maude engine. To evaluate the rules of a DSA, we exploit the Maude built-in command red: the tool performs as many red calls as the number of rules of the DSA. Hence, the computational costs of this functions depends on the number of rules, hereafter denoted by n, and by the computational costs of the red function in Maude. The computational cost of the third function is of O(n2 ) order, being the pairwise comparison of the evaluation of the rules. Thus, the DSA-Analyser complexity is strictly related to the complexity of the Maude engine (second function), plus an additional factor that depends on the third function, (O(n2 )), all multiplied by the number of iterations of the first function (O(num propnum terms )).
4
Notes on Rules Prioritisation
The DSA-Analyser outputs either the confirmation that no conflicts arise among the evaluated rules or the complete list of conflicts, each of them associated to the related context, as we see above in Figs. 3, 4, and 5. JSON messages highlight
Efficient Detection of Conflicts in Data Sharing Agreements
169
potential conflicts that may arise at DSA enforcement time whether there is an access request that should be authorised according to an authorisation and should be denied according to prohibition, as for instance in one of the three examples in Sect. 2.3. The access request is evaluated under the context specified in the figures (i.e., in Fig. 3, the data has category medical, and type Ecg and it is related to a subject with ID SubjectIdentifier, located in a certain area. The data is also stored within the European area, and so on). To fix a conflicting situation, it is possible to re-editing the rules which may lead to conflicts, and re-running the analysis phase once again. However, it is also possible to leave the rules as they are, and to define ad hoc resolution strategies that will act at enforcement time. Indeed, well-known standard policy languages, such as XACML [14], introduce combining algorithms, which solve conflicts by prioritising the application of the rules according to some strategy. Standard and well known strategies are Deny-Overrides, Permit-Overrides, First-Applicable, and Only-One-Applicable. As an example, if the Deny-Overrides algorithm is chosen to solve the conflict that could arise among the rules of the same policy, the result of the evaluation of the policy is Deny if the evaluation of at least one of the rules returns Deny. Instead, if Permit-Overrides is chosen, the result of the evaluation of a policy is Permit if the evaluation of at least one of the rules in the policy returns Permit. However, standard combining algorithms are coarse grained, mainly because they take into account the result of the rules evaluation (e.g., Deny-Overrides and Permit-Overrides) or their order (e.g., First-Applicable). We could envisage other aspects for rule prioritisation, such as (i) the issuer of the rules, (ii) the data category, and (iii) the purpose of the data sharing, which is classified according to national and international regulations. Hence, we are able to provide a finer combining algorithm that takes into account not only the result evaluation of the rule itself but also according to the evaluation of properties of the rules. As a simple example, let the reader consider a medical datum, which has to be shared between a hospital and a patient, with the purpose of giving diagnoses. We can imagine that the DSA referred to those data features a potential conflict, between an authorisation rule - set by the patient whose data refer to - and a prohibition rule, set by a policy expert working at the hospital. For instance, the conflict could arise when the patient tries to access the data from outside the hospital in which the data have been produced. Being the purpose of data sharing related to giving diagnoses, and being those medical data related to the patient, one possible rule prioritisation strategy could be to apply the rule defined by the patient, thus granting the access to data.
5
Related Work
Since decades, approaches based on policies definition and enforcement have been successfully proposed for solving the issue of private, secure, and controlled data management. Thus, researchers investigated several aspects, like usage control – the monitoring and enforcement of security and privacy policies, not only at
170
G. Costantino et al.
time of accessing some resource, but also once that resource has been accessed, see the seminal work in [4,15], followed by specific implementations, as in [16] for the enforcement of usage control policies on Android mobile devices. Recently, research issues successfully carried out within the European projects Consequence and Coco Cloud showed how a secure and private way for data exchange, storage, and management can be appropriately supported by the concept of Data Sharing Agreements. Born as the electronic transposition of traditional legal contracts regulating data sharing, DSA overcome burdens on users and usability issues for end-to-end automation of contract definitions and their enforcement, even cleverly mixes legal and technical aspects of data sharing [2,9,10,17]. Adopted by large departments of public administrations and private companies (just to cite two simple examples), data sharing rules handling infrastructures need specific components to check the presence of potential conflicts among a priori wide sets of policies. Work in [12,18,19] presents a preliminary analysis tool for conflict detection among DSA clauses. Differently to the analysis framework described in this paper, the analysis worked by considering one single context at a time, which was manually edited by the user. In [20], the authors propose the analysis of obligation rules expressed in the Event-B language [21], while work in [13] presents a conflict-detection tool based on first order logic, whose performances are compared to the ones in [22], where the authors use coloured Petri nets process for policy analysis. Our performances outcome are competitive with respect to these results. Finally, a popular and general approach for solving conflicts among privacy rules is the one adopted by the eXtensible Access Control Markup Language (XACML) and its associated policy management framework [14]. XACML policies (or policy sets) include a combining algorithm that defines the procedure to combine the individual results obtained by the evaluation of the rules of the policy (of the policies in the policy set). Work in [23,24] is an example on how standard XACML combining algorithms can be extended, e.g., evaluating - through well known techniques for multi-criteria decision making [25] - how much the attributes in a policy are specific in identifying the subject, the object, and the environment of the policy.
6
Conclusions
The paper focused on a private, secure, and controlled management of data through the support of electronic contracts regulating data sharing, so called Data Sharing Agreements. Being the agreements made of several data sharing rules, possibly edited by more than one actor, this work contributed to highlight potential conflicts among the rules, by designing and developing an analysis tool, which evaluates set of rules with different effect (access granted/access denied) under all the contextual conditions which may arise from the vocabulary and properties associated to the DSA. The evaluation of the analyser performances in terms of execution time indicates the capability of the tool to deal with up to
Efficient Detection of Conflicts in Data Sharing Agreements
171
hundreds of rules and up to dozens of terms in the vocabulary. These represent realistic numbers for DSA-based practical applications in the field of, e.g., healthcare, public administration, and business scenarios. Acknowledgements. Partially supported by the FP7 EU project Coco Cloud [grant no. 610853] and the H2020 EU project C3ISP [grant no. 700294].
References 1. Damianou, N., Dulay, N., Lupu, E., Sloman, M.: The ponder policy specification language. In: Sloman, M., Lupu, E.C., Lobo, J. (eds.) POLICY 2001. LNCS, vol. 1995, pp. 18–38. Springer, Heidelberg (2001). https://doi.org/10.1007/3-54044569-2 2 2. Casassa Mont, M., Matteucci, I., Petrocchi, M., Sbodio, M.L.: Towards safer information sharing in the cloud. Int. J. Inf. Sec. 14, 319–334 (2015) 3. Ferraiolo, D., Kuhn, R.: Role-based access control. In: NIST-NCSC National Computer Security Conference, pp. 554–563 (1992) 4. Park, J., Sandhu, R.: The UCON-ABC usage control model. ACM Trans. Inf. Syst. Secur. 7, 128–174 (2004) 5. Matteucci, I., Petrocchi, M., Sbodio, M.L.: CNL4DSA: a controlled natural language for data sharing agreements. In: Symposium on Applied Computing, pp. 616–620 (2010) 6. Larsen, K.G., Thomsen, B.: A modal process logic. In: LICS, pp. 203–210 (1988) 7. Clavel, M., Dur´ an, F., Eker, S., Lincoln, P., Mart´ı-Oliet, N., Meseguer, J., Talcott, C. (eds.): All About Maude - A High-Performance Logical Framework. LNCS, vol. 4350. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-71999-1 8. Jin, J., Ahn, G.J., Hu, H., Covington, M.J., Zhang, X.: Patient-centric authorization framework for electronic healthcare services. Comput. Secur. 30, 116–127 (2011) 9. Ruiz, J.F., Petrocchi, M., Matteucci, I., Costantino, G., Gambardella, C., Manea, M., Ozdeniz, A.: A lifecycle for data sharing agreements: how it works out. In: Schiffner, S., Serna, J., Ikonomou, D., Rannenberg, K. (eds.) APF 2016. LNCS, vol. 9857, pp. 3–20. Springer, Cham (2016). https://doi.org/10.1007/978-3-31944760-5 1 10. Caimi, C., Gambardella, C., Manea, M., Petrocchi, M., Stella, D.: Legal and technical perspectives in data sharing agreements definition. In: Berendt, B., Engel, T., Ikonomou, D., Le M´etayer, D., Schiffner, S. (eds.) APF 2015. LNCS, vol. 9484, pp. 178–192. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-31456-3 10 11. Costantino, G., Martinelli, F., Matteucci, I., Petrocchi, M.: Analysis of data sharing agreements. In: Information Systems Security and Privacy, ICISSP 2017, Porto, Portugal, 19–21 February 2017, pp. 167–178 (2017) 12. Matteucci, I., Petrocchi, M., Sbodio, M.L., Wiegand, L.: A design phase for data sharing agreements. In: Garcia-Alfaro, J., Navarro-Arribas, G., Cuppens-Boulahia, N., de Capitani di Vimercati, S. (eds.) DPM/SETOP -2011. LNCS, vol. 7122, pp. 25–41. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28879-1 3 13. Liang, X., Lv, L., Xia, C., Luo, Y., Li, Y.: A conflict-related rules detection tool for access control policy. In: Su, J., Zhao, B., Sun, Z., Wang, X., Wang, F., Xu, K. (eds.) Frontiers in Internet Technologies. CCIS, vol. 401, pp. 158–169. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-53959-6 15
172
G. Costantino et al.
14. OASIS: eXtensible Access Control Markup Language (XACML) Version 3.0 (2010) 15. Pretschner, A., Hilty, M., Basin, D.: Distributed usage control. Commun. ACM 49, 39–44 (2006) 16. Lazouski, A., Martinelli, F., Mori, P., Saracino, A.: Stateful usage control for android mobile devices. In: Mauw, S., Jensen, C.D. (eds.) STM 2014. LNCS, vol. 8743, pp. 97–112. Springer, Cham (2014). https://doi.org/10.1007/978-3-31911851-2 7 17. Gambardella, C., Matteucci, I., Petrocchi, M.: Data sharing agreements: how to glue definition, analysis and mapping together. ERCIM News 2016 (2016) 18. Matteucci, I., Mori, P., Petrocchi, M., Wiegand, L.: Controlled data sharing in E-health. In: Socio-Technical Aspects in Security and Trust, pp. 17–23 (2011) 19. Martinelli, F., Matteucci, I., Petrocchi, M., Wiegand, L.: A formal support for collaborative data sharing. In: Quirchmayr, G., Basl, J., You, I., Xu, L., Weippl, E. (eds.) CD-ARES 2012. LNCS, vol. 7465, pp. 547–561. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32498-7 42 20. Arenas, A.E., Aziz, B., Bicarregui, J., Wilson, M.D.: An Event-B approach to data sharing agreements. In: M´ery, D., Merz, S. (eds.) IFM 2010. LNCS, vol. 6396, pp. 28–42. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16265-7 4 21. Bicarregui, J., Arenas, A., Aziz, B., Massonet, P., Ponsard, C.: Towards modelling obligations in Event-B. In: B¨ orger, E., Butler, M., Bowen, J.P., Boca, P. (eds.) ABZ 2008. LNCS, vol. 5238, pp. 181–194. Springer, Heidelberg (2008). https:// doi.org/10.1007/978-3-540-87603-8 15 22. Huang, H., Kirchner, H.: Formal specification and verification of modular security policy based on colored Petri nets. IEEE Trans. Dependable Secur. Comput. 8, 852–865 (2011) 23. Lunardelli, A., Matteucci, I., Mori, P., Petrocchi, M.: A prototype for solving conflicts in XACML-based e-Health policies. In: 26th IEEE Symposium on ComputerBased Medical Systems, pp. 449–452 (2013) 24. Matteucci, I., Mori, P., Petrocchi, M.: Prioritized execution of privacy policies. In: Di Pietro, R., Herranz, J., Damiani, E., State, R. (eds.) DPM/SETOP -2012. LNCS, vol. 7731, pp. 133–145. Springer, Heidelberg (2013). https://doi.org/10. 1007/978-3-642-35890-6 10 25. Saaty, T.L.: How to make a decision: the analytic hierarchy process. Eur. J. Oper. Res. 48, 9–26 (1990)
On Using Obligations for Usage Control in Joining of Datasets Mortaza S. Bargh1(B) , Marco Vink1 , and Sunil Choenni1,2 1
Research and Documentation Centre, Ministry of Security and Justice, The Hague, The Netherlands {m.shoae.bargh,m.e.vink,r.choenni}@minvenj.nl, [email protected] 2 Creating 010, Rotterdam University of Applied Sciences, Rotterdam, The Netherlands
Abstract. Legitimately collected and accessed data must also be used appropriately according to laws, guidelines, policies or the (current) preferences of data subjects. For example, inconsistency between the data collection purpose and the data usage purpose may conflict with some privacy principles. In this contribution we motivate adopting the usage control model when joining vertically-separated relational datasets and characterize it as obligations within the Usage Control (UCON) model. Such obligations are defined by the state of the object (i.e., a dataset) in the UCON model with respect to the state of another object/dataset. In case of the join operation, dependency on two UCON objects (i.e., two datasets) results in a new type of UCON obligations. We describe also a number of mechanisms to realize the identified concept in database management systems. To this end, we also provide some example methods for determining whether two given datasets can be joined. Keywords: Access control Usage control
1
· Join operation · Obligations · Privacy
Introduction
Data are created in an explosive rate due to the surge of new (sensory) devices, applications and services currently. Digitalization and e-administration, e-services, Big Data, Open Data, and Internet of Things are example phenomena that contribute to this data outpouring and overflow. Consequently, it becomes a common practice in (business) data analytics and data intensive applications to integrate data from different sources, of various types, of large volumes, and/or of high rates. These applications and services aim at easing our daily lives, providing insight in societal phenomena, or creating added values for businesses. Delivering these benefits, however, must not violate or compromise, for example, the privacy, commercial, and intellectual rights of individuals and parties who contribute their data to the data integration process. For a long time, access control mechanisms have been used to protect data. An access control mechanism controls the access to the data by granting or c Springer International Publishing AG, part of Springer Nature 2018 P. Mori et al. (Eds.): ICISSP 2017, CCIS 867, pp. 173–196, 2018. https://doi.org/10.1007/978-3-319-93354-2_9
174
M. S. Bargh et al.
rejecting an access request. Although in this way the input datasets for a data integration process may be acquired or accessed legitimately, it is crucial for the output dataset of the data integration process to be legitimate and acceptable for all parties who provided the input datasets. For example, the privacy and business sensitivity requirements of these parties must be preserved. Today, personal devices produce more and more personal data than before. Big data analytics makes it possible to combine these data, resulting in (new) personal data that may expose the private lives of people in quite detail. Such data combinations may result in unexpected and harmful impacts on individuals. Therefore, access control is insufficient in current era of data expulsion. Given the fact that the access to data is obtained legitimately, one needs to control how the data are practically used. Suppose that a tax officer needs to know the name, the annual income, the spouse’s name, and the number of children of a person in order to carry out his/her tasks. It is not, however, the business of the tax officer to find out how many spouses or children per spouse a certain person (like a celebrity) has had. The system, therefore, should note such illegitimate use of attribute values and exclude them from the tax officer’s access (possibly when combining various datasets). Therefore, a query like find all spouses of singer-X and for each spouse the name of the children is an improper use of the attribute values and should not be executed. Determining the (privacy) policies that govern such data integrations become steadily unforeseeable due to availability of vast amount of background information to data receivers and adversaries. For example, one cannot predetermine the datasets that will be encountered and integrated with a given dataset in the future. This makes it difficult to assess beforehand the potential risks in combining the released data with all other possible datasets (i.e., with the background information). This uncertainty relates to the extrinsic characteristics of data, e.g., the (privacy) issues of a given datasets in relation to other datasets. The other datasets exist in outside world due to, for example, sequential data release, multiple data release, continuous data release, collaborative data release, social networks, Big Data, and Open Data. One may conclude that it is unwise to share data anymore. This policy appears to be too restrictive and unrealistic nowadays. Another solution direction is to devise and realize mechanisms that control compliance with data privacy policies after sharing the data with others, i.e., during the data usage lifecycle. This solution, which can be realized in controllable environments like an organization’s Database Management System (DBMS), requires a flexible and adaptive framework that decides on a data integration policy and enforces the decision at runtime. Hereby it becomes possible to deal with the issue of authorized-access and unauthorized-use of datasets [6]. To this end, for example, the Usage Control (UCON) model [24] is one of the promising models. Our research objective is to control the usage of relational datasets in volatile and dynamic settings i.e., when data analysts gradually and unforeseeably gain access to datasets and want to link/integrate a subset of these datasets. We limit our scope to relational databases and those structured datasets that are vertically separated. By vertically separated datasets we mean vertically distributed
On Using Obligations for Usage Control in Joining of Datasets
175
datasets, as illustrated in [15], which are not necessarily at different locations (i.e., they can be colocated as in the case of typical data warehouse environments). We consider the usage control for the join operation among these vertically separated datasets. Inspired by the UCON model, we specifically investigate how the join operation can be framed in such a data usage control framework. This investigation results in a new insight in UCON obligation constructs. As our first contribution, we distinguish a new type of obligations where the state of the object (e.g., a dataset) is determined with respect to existence of another dataset. This type of dependency, to the best of our knowledge, has not been identified so far within the context of the UCON model. As our second contribution, we present two mechanisms to realize the identified obligation in a DBMS. As our third contribution, we present an example realization to illustrate how the proposed mechanism can be implemented and we analyze the results. Furthermore, we enlist a number of methods for determining whether two given datasets can be joined. Preliminary results of our work on usage control for joining datasets are presented in [3]. In this contribution we extend the work with improving the problem description (Sect. 4) and sketching the algorithmic mapping of identifiers (Subsect. 7.2), as well as with adapting the overall paper structure. The paper starts with providing some background information and the motivation behind the work in Sect. 2, the related work in Sect. 3, and the problem statement in Sect. 4. Subsequently, we present the design principles of our proposed approach in Sect. 5, followed by mechanisms for decision making in Sect. 6 and decision enforcement in Sect. 7. Finally, we discusses the issues and limitations in Subsect. 7.3 and draw conclusions in Sect. 8.
2
Background
This section briefly introduces the theoretical background on access and usage control, and outlines some motivations behind usage control. 2.1
Access Control
In information systems, authorized entities should be able to access resources like services, documents, computer system functionalities, or computer processing times. Access control is defined as the ability to permit or deny access to a particular resource by a particular entity [19]. The entity that seeks access to the resource and the resource that is sought by the entity are referred to as subject and object, respectively. The access to an object can be in specific modes called rights, which the subject is allowed to carry out on the object. For digital objects, for example, these access rights include being able to read, write and delete those objects. A particular access right is not predefined and it exists at the time of the authorization [19]. Figure 1 illustrates a traditional access control model, where the reference monitor is a control program that monitors and prohibits actions [13].
176
M. S. Bargh et al.
Fig. 1. A traditional access control model, copied from [3].
Three common access control models are Discretionary Access Control (DAC), Mandatory Access Control (MAC), and Role Based Access Control (RBAC). In the DAC model there is a Access Control List for an object that specifies which subjects have which permissions/rights to the object. The MAC model assigns security labels to objects and subjects, and grants a subject with access to an object if their labels match. Both DAC and MAC entail overheads for determining the access rights, therefore, the RBAC model is proposed by introducing roles as a link between subjects and objects. In this model subjects are authorized for roles and roles are authorized for objects to hold certain permissions or rights. There are other access control methods in the literature that we do not mention for brevity of the presentation (the interested reader is referred to the survey paper [20]). Traditional access control models are mostly suitable for closed organizational environments, where the subjects and objects are well known and when the sensitivity and trustworthiness of the objects and subjects are well defined and rather static. 2.2
Motivations for Usage Control
Organizations and businesses collect various types of data via different channels for their business and administrative purposes. For example, in the context of business operations and public administration, data may be collected for registration purposes as is done in hospitals, municipalities or judicial offices, for improving service operations as is done by websites for improving the webbrowsing experience of users, or for scientific research as is done in academia to, for instance, study the impacts of crime victimization. Furthermore, these organizations and businesses gain access to external data sources via crowdsourcing campaigns, through open data initiatives, or due to mergence of services, businesses and organizations. For example, Google has merged various services like Gmail, Google+, Google Drive; and Facebook has acquired Instagram and WhatsApp. Such strategic merges require integration of information systems, with datasets that are generally collected for different purposes and within various contexts. There are also Open Data initiatives to release public sector data
On Using Obligations for Usage Control in Joining of Datasets
177
to citizens as a means of, among others, government transparency, innovation stimulation, economic growth, and public participation in government [9,10]. Crowdsourcing is another means of collecting relevant data in an affordable way, where the resulting datasets often contain some personal data from participants such as profile data (including their names, email addresses and phone numbers), activity data (indicating their sporting, sleeping, and eating habits), and situational data (revealing their visited locations, adjacency to other users/objects, and conversation buddies). Nowadays there are compelling incentives for these businesses and organizations to analyze all collected data available by data analytics tools and apply the findings of these tools in practice for, for example, improving their services and processes. Consequently, the collected data may be used in another context – like for another purpose, in another jurisdiction or region, and in another domain (e.g., public, healthcare, judicial, insurance and trade) – than the one they are collected for. For example, crowdsourcing datasets may contain personal data and as such they must be accessible to authorized entities (like system administrators and specific services/systems) and be used in an authorized way (like for the specified purpose). It is foreseeable that some authorized users with ill intentions (e.g., those insider intruders or employees with questionable ethics as mentioned in [1]) misuse such personal information that they have access to for their illegitimate purposes like personal satisfaction, financial gains, and political benefits. Revealing personal information makes data subjects (i.e., those individuals and organizations that the data are about) vulnerable to cyber attacks such as identity theft, phishing and spams, and privacy breaches. Therefore, the crowd may become fearful and unwilling to participate in the data collection process due to being subjected to such threats and becoming victims of such attacks. Even when users voluntarily participate in crowdsourcing, they desire sometimes their personal information not to be processed when, for instance, they are at certain situations like during evenings, in the weekends, and during holidays. Even highly sensitive data attributes may be disclosed or inferred by means of easily accessible data and data linkage. Kosinski et al. [17] show that easily accessible digital records of behavior, e.g., Facebook Likes, can be used to automatically and accurately predict a range of highly sensitive personal attributes (such as sexual orientation, ethnicity, religious and political views, personality traits, intelligence, happiness, use of addictive substances, parental separation, age, and gender). de Montjoye et al. [21] analyzed a dataset of fifteen months of human mobility data for 1.5 million individuals. According to [21], human mobility traces are highly unique. For example, when the location of an individual is specified hourly at the precision level of mobile network cells, it is possible to uniquely identify 95 percents of the individuals based on four spatiotemporal points. They also found that even rather highly aggregated datasets provide little anonymity. Consequently, even when such datasets are collected and accessed legitimately, they should still be used appropriately according to policies, guidelines, rules, laws, or the (current) preferences of data subjects. Any inconsistency
178
M. S. Bargh et al.
between the data collection/access context and the data usage context may result in conflict with, for instance, many privacy principles like the transparency principle, no secondary use principle, or intended purpose usage principle [2,11]. 2.3
Usage Control
In modern applications, where for example social networks, Big Data, and information (sharing) systems across organizations should be dealt with, one needs to cope with rather dynamic environments to authorize (previously unknown) entities who want to access and use objects with dynamic sensitivity, within varying contextual situations, across organizational boundaries, and with a commitment to unprecedented conditions. To cope with the shortcomings of the traditional access control models, usage control models (like the UCON model [24,25,30]) are introduced. The UCON model extends the traditional models to include also controlling the access decisions during the object’s usage interval (so-called access decision continuity) and to allow also adapting the access criteria before, during and after object usage interval (so-called attribute mutability). The continuity of decisions and mutability of attributes in UCON allow adapting to the changes of subject, object and environmental attributes before, during or after the data usage period. For example, the number of subjects that concurrently may access the object can change depending on the consumption intensity. As illustrated in Fig. 2, the reference monitor in the UCON model uses three types of decision-making factors. The first type is called authorizations. These authorizations include those predicates that put constraints on subject and object attributes. The attributes of the subject (e.g., the name, age, role, and nationality) in UCON are similar to Capability List in DAC and Clearance Label in MAC. The attributes of the object (e.g., document type, content sensitivity, and data ownership) in UCON are similar to Access Control List in DAC and Classification Label in MAC [24]. The other two types of decision-making factors in the UCON model are conditions and obligations, which are not uniquely defined in literature [8]. For conditions, the authors in [24] consider the environmental or system-oriented constraints that should hold before or during the object’s usage interval. Example conditions are those related to the time of the day, room temperature, and disastrous situation. As such, conditions are not dependent of the subject and the object (i.e., the data) directly. We shall elaborate upon obligations in Subsect. 3.1 (to compare our approach with those that have already adopted UCON obligations within the context of relational databases) and Sect. 5 (to lay down the theoretical basis for our adoption of UCON obligations). The reference monitor in UCON controls the access to and usage of the object (e.g., data items) by the subject. Similarly to [13], we regard the UCON reference monitor rather liberally to describe control programs that can not only monitor and prohibit actions, but can also trigger corrective actions such as the application of penalties.
On Using Obligations for Usage Control in Joining of Datasets
179
Fig. 2. An illustration of the UCON model, copied from [3] (originally adopted with adaption from [24, 30]).
3
Related Work
This section provides a review of some related works on obligations and on controlling the join operation in relational databases. 3.1
Obligations
Obligations are considered as an important means of realizing privacy and security aware systems, nevertheless, as mentioned before, there is still no consensus on the precise meaning of the term obligation [8]. One misalignment in the literature relates to the concepts of pre-obligations, on-going-obligations and post-obligations. In the UCON-ABC model of [24], pre-obligations and on-goingobligations are recognized. The concept of post-obligation is added to the UCON model in [16]. Note that, not within the context of the UCON model, others (e.g. [4,12,13]) had already considered obligations as requirements that must be fulfilled after data access has been done. Ni et al. [23] introduce pre and post obligations to the Role Based Access Control (P-RBAC). In [4] pre-obligations are characterized as provisions. In [13] obligations are further classified in two dimensions: being observational or not, and being temporally-bounded or temporally-unbounded. The observational aspect characterizes whether the reference monitor can observe the fulfillment of the obligation or not. The temporal bound-ability characterizes whether obligations should be fulfilled in a certain time period or not (i.e., should be checked for ever). These criteria define four obligation types:
180
M. S. Bargh et al.
– Bounded future and observable (e.g., pay a fee within a fixed number of days, data item may not be accessed for n days, the reference monitor must notify the data owner about the access within n days). – Bounded future and non-observable (e.g., data item must be deleted within n days, data item must not be redistributed in the next n days), – Unbounded future and observable (e.g., re-access the data at least every n days to maintain freshness of data as demanded by some data protection regulations), and – Unbounded future and non-observable (e.g., data should be used only for statistical analysis, data should not be distributed further, each usage of the data must be reported immediately, or must be protected with protection level L until it is declassified by the owner). In our case we shall consider joining two vertically separated datasets A and B, accessible for projects A and B, respectively. The obligation for the join operation is temporally unbounded, i.e., it holds for as long as there is a possibility of joining any pair of datasets A and B. The obligation for the join operation is also unobservable (i.e., in project A one cannot foresee that project B is going to link dataset A with its dataset B and vice versa). By introducing the reference monitor we ensure the join operation to be observable to the central reference monitor and, eventually, those non-observable data protection requirements to be adhered to. This strategy, which is also mentioned in [13], enforces an unobservable obligation through transforming it into a set of provisions and observable obligations that prevent unwanted executions. One can think of not only this strict sense of enforcement (i.e., the prevention of unwanted executions of a system through system monitoring and denying actions that would violate the policy) but also additional corrective or compensative actions (e.g., penalties) in case that the execution violates the policy [13]. Unlike in our case, obligations in [13] are those conditions that must be imposed in the future (i.e., the time after an access is authorized) and [13] uses provisions instead of obligations to refer to those conditions that must be imposed by/at the time of an access being authorized. In our case, furthermore, we shall show that it is possible for obligations to be of types pre-obligations/on-going-obligations and post-obligations at the same time. 3.2
Relational Databases
As the work presented in this contribution relates to both usage control for relational databases and privacy protection for the join operation in relational DBMS, we review some related works on these topics in the following. In [8] the authors consider enforcing obligations, which are derived from privacy policies, on relational database management operations. While Colombo and Ferrari [8] consider SQL operations in general, we focus on the join operation particularly and zoom in its peculiarities from the viewpoints of the parties (i.e., projects) involved in the operation. Similarly to our work, [8] considers obligations as constraints “on the [expected] state of the data [(i.e., the object)] stored in the
On Using Obligations for Usage Control in Joining of Datasets
181
database at the time in which the execution of an action (i.e., SQL code) is invoked (like the account balance after withdrawing must be positive)”. We go one step further and take into account also the state of each of the two datasets of the join operation with respect to the other dataset. Secure Multi Party Computing (SMPC) methods aim at computing a function F on vertically or horizontally distributed datasets for data mining or data processing purposes, without requiring the raw datasets to be shared with a central entity (a Trusted Third Party, TTP) or with the peers. In this way every party learns only the result of function F and its own dataset. SMPC methods are applied in combination with the SQL join operation in multi-party settings in [18] for horizontally distributed datasets. As mentioned above, the objective of SMPC is to compute a specific function F on the joined dataset in a privacy preserving way (i.e., without sharing the raw datasets with a TTP or the peers). For example, the function F in [18] delivers the number of rows in the join table (for which the join predicate holds). In our setting, however, the aim is to authorize the join operation or not, regardless of which function the data analyst intends to apply to the resulting datasets in the future. As such, our approach acts as a dynamic, ongoing and adaptive access control (thus a usage control) mechanism rather than a privacy preserving data mining or data processing mechanism.
4
Problem Statement
In this contribution we shall focus on dealing with the issue of authorized-access and unauthorized-use of vertically separated datasets, as described below. 4.1
Problem Context
For scientific studies, our research center maintains a data warehouse that contains various datasets from several organizations involved in the Dutch justice system. These organizations include the Police, the Public Prosecution Office, the courts, the Central Fine Collection Agency, the agency of correctional institutions (i.e., prisons) and the Probation Service. In some projects the data of more than one organization can be used to measure the performance of the chain of Dutch justice organizations by combining the necessary datasets from these organizations. For combining these datasets, a case number is used to uniquely identify a judicial case across all these organizations. Although our data analysts have access to all datasets, they may not combine all datasets due to privacy and some other reasons (for example, as the number and the contents of datasets are growing over time, one may combine old and new data under certain conditions). 4.2
Usage Scenarios
In the following we describe two scenarios where usage control can be meaningful or perhaps necessary to warrant privacy protection, data analytics validity or other quality preservation during a join operation. Inspired by [1], we particularly focus on the (privacy) policy violation issues that may arise when linking datasets in relational datasets during their usage time.
182
M. S. Bargh et al.
Fig. 3. An illustration of the join of dataset B with dataset A within project A (i.e., the result of the join operation is used for project A).
Fig. 4. An illustration of the join of dataset A with dataset B within project B (i.e., the result of the join operation is used for project B).
Sequential Release Data Publishing. Assume a data analyst X, who works for project A, obtains access to dataset A at time tA . See the process illustration in Figs. 3 and 4. At a later time tB > tA the data analyst X, who now works also for project B, gets access to dataset B. Therefore, we shall alternatively denote datasets A and B by also A(tA ) and B(tB ) notations, respectively, whenever appropriate. In this scenario we assume that projects A and B are executed in two (logically, legally, operationally) separated environments. Nevertheless, there might be a need and urge to join dataset B with dataset A within project A (as shown in Fig. 3) or to join dataset B with dataset A within project B (as shown in Fig. 4). In the former dataset B acts as the background knowledge for dataset A and in the latter dataset A acts as the background knowledge for dataset B. Note that background knowledge is a concept used in privacy protection literature for statistical data publishing. The join operation can take place at anytime t ≥ tB , as illustrated in Figs. 3 and 4. This scenario can be characterized as sequential release data publishing, see [11], where some attributes of (all) data records are sequentially published to projects A and B according to the needs of projects A and B. Each release is a projection of the original data (in vertically separated form) and serves a corresponding purpose.
On Using Obligations for Usage Control in Joining of Datasets
183
Continuous Release Data Publishing. In longitudinal studies datasets A and B can be delivered to the same project in different moments of tA and tB > tA . Although the purpose of data use remains unchanged for both datasets, other changes might occur in the course of time that should be considered in joining datasets A and B. For example, assume that datasets A and B include the demographic data of a city’s population. Assume that a number of surrounding districts are added to the city’s jurisdiction between tA and tB and, therefore, dataset B encompasses the population data of a wider region than that of dataset A. Consequently it is desirable if the usage control system supports data analysts with protection measures against any naive join of such datasets. Figure 5 illustrates this scenario, which can be characterized as continuous release data publishing, see [11], according to which the data are released continuously as new information becomes available. This is a typical case of vertically separated datasets, where the new dataset includes the same data records, but with some new attribute values (related to the events occurred in the new time interval).
Fig. 5. An illustration of the join of datasets B and A for a longitudinal study project.
4.3
Vertically Separated Datasets
Datasets A and B contain data records or tuples represented by a = (a1 , a2 , . . . , aMA ) and bj = (b1 , b2 , . . . , bMB ), respectively. Every tuple a in dataset A is defined over single valued attributes A A from attribute set ATTA = {attA 1 , att2 , . . . , attMA }. In other words, dataset A is A A a subset of the Cartesian product of dom(att1 ) × dom(attA 2 ) × · · · × dom(attMA ), A in which dom(attm ) is the set of the values that can be assumed by attribute attA m where m : 1, . . . , MA . Tuple a is an ordered list of attribute values to which a unique identifier is attached. Without loss of generality, we assume that this A identifier corresponds to the first attribute attA 1 , drawn from dom(att1 ). Similarly, one can formalise tuples b in dataset B, with MB attributes and NB records. We also assume that the identifier of tuple b is represented by the B first attribute attB 1 , drawn from dom(att1 ). Note that the identifier attributes A B att1 and att1 in datasets A and B are drawn from the same population, i.e., A dom(attB 1 ) = dom(att1 ).
184
M. S. Bargh et al.
We assume that datasets A and B are partially vertically separated, i.e., the attribute sets ATTA = ATTB , ATTA ∩ ATTB = ∅. The first condition separates datasets A and B vertically, ensuring that the attributes that appear in both datasets are not exactly the same (i.e., there is at least one attribute difference). Thus the attributes in datasets A abd B do not provide the very same information. The second condition is necessary to link (or join) datasets A and B with those common attributes. Without loss of generality, we assume that (at least) the first attributes of datasets A and B are B the same, i.e., attA 1 and att1 are drawn from the same domain, in other words A B ) = dom(att ), as mentioned above. Thus attributes attA dom(attB 1 1 1 and att1 act as the primary key in the join process. Note that, in terms of data records a and b, vertical separation does not necessarily mean that the records in datasetsA and B refer to the same group of individuals exactly. In other words, we assume vertical separation as the case of having some records a ∈ A and b ∈ B for which a1 = b1 , where a1 ∈ dom(attA 1) ), i.e., records a and b refer to the same entity. The concept and b1 ∈ dom(attB 1 of vertically separated datasets is illustrated in Fig. 6. 4.4
Join Operation
In this section we specify the join operation in the context of the first scenario (i.e., Sequential Release Data Publishing) because it is more generic than the second one. From time tB onwards, data analyst X has access to both datasets A and B and can join them for two different purposes: For project A execution and for project B execution. It is foreseeable that combining/joining datasets A and B may not be allowed from the viewpoint of project A, of project B or of both. This lack of permission for joining datasets A and B can be due to, for example, privacy or information sensitivity reasons (to be elaborated upon in the following sections). Inspired by [8], we define a Join Operation (JO) as a tuple: def
JO = ,
(1)
where parameters – A(tA ) and B(tB ) are datasets A and B, obtained by data analyst X at times tA and tB , respectively, – PA (t) and PB (t) are the (privacy) policies of projects A and B, associated with datasets A and B at time t, – JC represents the condition (or predicate) for the join operation, and – t, being t ≥ tB > tA , represents the time of executing the join operation.
On Using Obligations for Usage Control in Joining of Datasets
185
Fig. 6. An illustration of the vertically separated datasets with respect to the join operation.
Note that policies PA (t) and PB (t) are generally obtained at moments tA and tB , respectively. But they can (automatically) be adapted during the lifecycle of the corresponding datasets. Therefore, we consider them functions of t. In order to allow the join operation to be carried out at t, there should be two requirements satisfied, namely: – The resulting dataset should not violate the (privacy) policy of project A. This is denoted by requirement RA (B(tB ), t|A(tA ), PA (t)),
(2)
where the notation should be read as: Requirement for project A in regard to dataset B(tB ) to be considered for the join operation at time t, given project A’s own dataset A(tA ) acquired at time tA and own policy PA (t) as off time t. – The resulting dataset should not violate the (privacy) policy of project B. Similarly to requirement (2), this is denoted by requirement RB (A(tA ), t|B(tB ), PB (t)).
(3)
Research Questions. In this contribution we shall address the following research questions: – How can the UCON model be characterized for the join operation of relational datasets? – How can we determine when the join is (dis)allowed? – How is it possible to realize the resulting restricted join functionality?
186
M. S. Bargh et al.
In the following, we adapt the UCON model to the problem at hand. First in Sect. 5 we frame the data integration scenario as UCON obligations. Subsequently in Sect. 6 we describe the decision-making component of the reference monitor that determines whether datasets A and B are joinable at a given time. In Sect. 7 we describe the decision-enforcement component of the proposed UCON model for the join operation, i.e., the reference monitor, which is realized partly as the proof of concept of this work.
5
Obligations as the Design Framework
Obligations are an active area of research currently. Particularly, the enforcement of those obligations that are concerned with fulfilling some tasks and actions during or after the usage of the object (i.e., data) are open research issues [19]. Obligations mandate those actions that someone should execute before, during or after an object’s usage interval [19]. For example, the credit card owner must be informed in 30 days after a credit card being used, a license agreement must be signed before data usage, an ad must be watched for 20 s, and the document must be downloaded just one time. When the actions are executed appropriately, the subject could access or could continue to use the object. Note that the entity that fulfills the obligation, i.e., carries out the action(s), might be the subject or someone else, depending on the usage scenario. Similarly, the entity on which an obligation activity is carried out might be the object or something else. In [8] the authors consider the enforcement of obligations, which are derived from privacy policies, on relational database management operations. They regard obligations as “the constraints that refer to the (expected) state of the data [object] stored in the database” at the time in which the object is accessed or used (i.e., invoking a SQL code). For example, the bank account balance must be positive after withdrawing. In summary, one can regard obligations as the constraints (a) on the state of the object (i.e., the data in the database) or (b) on specific actions being executed by someone. Fulfillment of both constraint types can be required before, during or after an object’s usage interval. Our usage control on the join operation in this contribution can be categorized as obligation because the authorization of the right (i.e., the join of datasets A and B) is constrained by the state of the objects (e.g., the datasets A and B for projects A and B, respectively). Our first contribution hereto is that we distinguish a new type of obligations where the state of an object (i.e., dataset A or dataset B) is determined with respect to another object. This type of dependency, to the best of our knowledge, has not been identified so far in the UCON literature. In distributed usage control, where information is disseminated in open networks, post-usage obligations are widely applicable [19]. We observe that this is also the case in our centralized usage control, when an operation on a data object (like dataset A in our scenario) is dependent of other (upcoming) data object (like dataset B in our scenario). More specifically, from requirements RA (B(tB ), t|A(tA ), PA (t)) for project A and RB (A(tA ), t|B(tB ), PB (t))
On Using Obligations for Usage Control in Joining of Datasets
187
for project B, where t ≥ tB > tA , one can define the data integration obligations for the data analyst X as the UCON’s subject. These obligations can be of type: – Post obligation for project A (because t ≥ tB > tA ), – Post obligation for project B when t > tB , and – Pre-obligation/on-going-obligation1 for project B when t = tB . So when t = tB the constraint on datasets A and B can be of type preobligation (for project B) and post-obligation (for project A) simultaneously, see Fig. 7. This duality is another new insight, to the best of our knowledge, provided in this contribution.
Fig. 7. An illustration of pre-obligation and post-obligation aspects of the join operation at t = tB .
6
Decision Making
The reference monitor in the UCON architecture is responsible to decide on whether two datasets A and B can be joined or not based on the model specified by Relation (1). The join operation basically depends on (a) The basic join predicates and (b) The obligation requirements RA (·|·) and RB (·|·). We defined the join operation JO in Relation (1) for datasets A(tA ) = {ai |i : 1, . . . , NA } and B(tB ) = {bj |j : 1, . . . , NB }. The join operation JO can alternatively be specified as those record members of the Cartesian product of datasets A and B, i.e., {ai |i : 1, . . . , NA } × {bj |j : 1, . . . , nB }, for which the predicate JC holds for those tuples that ai,1 = bj,1 , and both obligation requirements RA and RB hold. In other words, alternatively to Relation (1), we can define def
JO = {ai × bj | i : 1, . . . , NA ; j : 1, . . . , NB ; Fi,j = True}, 1
(4)
From this point on we shall use the term pre-obligation to refer to this situation as the term on-going obligation is not much meaningful for a join operation.
188
M. S. Bargh et al.
where Fi,j is a Boolean function defined as the conjunction of the following Boolean operands: ⎧ ⎪ JC(ai , bj ) = True ∧ (ai,1 = bj,1 holds), ⎨Basic operand O1 : Obligation operand O2 : RA (B(tB ), t|A(tA ), PA (t)) holds, ⎪ ⎩ Obligation operand O3 : RB (A(tA ), t|B(tB ), PB (t)) holds. The basic operand O1 is common for a join operation, nevertheless we see a twist in our case as we are going to advocate using pseudo identifiers instead of personal identifiers. Therefore, we shall elaborate on that twist in the following subsection. The obligation operands O2 and O3 depend on the momentary policies of project A and B (i.e., PA (t) and PB (t)) as well as on the datasets A(tA ) and B(tB ). The dependancy on the datasets is often dependent of the attributes of these datasets, as we will elaborate upon in Subsect. 6.2. 6.1
Basic Operand
B In practice, attributes attA 1 and att1 in datasets A and B often correspond to personal identifiers, which should be anonymized for privacy protection reasons. For example, let’s assume that ai,1 and bj,1 all possible values of i and j are pseudonymized in datasets A and B by appropriate hash functions HA (·) and HB (·), thus resulting in HA (ai,1 ) and HB (bj,1 ), respectively. In this case, operand O1 can be written as: H
O1 : (JC(ai , bj ) = T ) ∧ (HA (ai,1 ) = HB (bj,1 ) holds),
(5)
H
where HA (ai,1 ) = HB (bj,1 ) holds if ai,1 = bj,1 . Note that H
– The operator = represents equivalency rather than equality, i.e., there is a one-to-one mapping possible between HA (ai,1 ) and HB (bj,1 ) in a practical sense (see the next note), and it does not imply equality necessarily; and H – The probability that ai,1 = bj,1 if HA (ai,1 ) = HB (bj,1 ) is (extremely) negligible in practice. This is a sound assumption for collision-resistant hash functions. H
In Sect. 7 we shall present two mechanisms to realize operator = in joining relational datasets. 6.2
Obligation Operands
The reference monitor should decide on whether two datasets A and B can be joined or not based on requirements RA and RB that, in turn, depend on projects A and B momentary policies PA (t) and PB (t), as well as on their datasets A(tA ) and B(tB ), respectively. Joining two datasets may extend the attribute sets AT T A and AT T B to set ATTA∪B = ATTA ∪ ATTB = {attA∪B , . . . , attA∪B 1 MA∪B }. In the following we shall sketch a number of mechanisms to decide upon joining two datasets A and B within the framework of UCON obligations.
On Using Obligations for Usage Control in Joining of Datasets
189
Based on Expert Opinion. For deciding on the join of two datasets, one could check whether the resulting combination of attributes is allowed or not. A domain expert can control this based on existing laws, regulations, and policies. For example, Sweeny [27] showed that in 1990 U.S. census data one could uniquely identify 87% of the U.S. population if (s)he knows three attributes gender, date of birth, and ZIP code of them. Learning from this, the domain expert could prevent join of two datasets A and B if set ATTA∪B includes these three attributes. Based on Data Collection Purpose. Alternatively, similarly to [5] and based on the purposes for which datasets A and B are collected, the reference monitor can check whether the privacy policies of project A and project B allow their data objects to be part of the resulting table or not. This can be done through controlling the possibility of any inconsistency between the data collection purpose and the usage purpose. For example, if dataset A and B are collected for commercial and system administration purposes, respectively, then the join should not go on if it appears that those commercial and administrative purposes are inconsistent. Based on Amount of Information Leakage. Another way to decide on allowing the join operation is to control whether there would be undesired information leakage due to the join operation or not. To explain this, let assume that every attribute in attribute sets ATTA and ATTB (and thus in the resulting attribute set ATTA∪B ) can be represented by a random variable. For example, A A random variable ATTA m corresponds to attribute attm ∈ ATT . (NB: In the A B rest of this section we reuse the notations of sets ATT , ATT , and ATTA∪B and assume they represent the sets of attributes as well as the sets of the corresponding random variables.) Further, let random variable set S ⊂ ATTA∪B be the set of those random variables after the join operation that are (privacy) sensitive. In our setting, S includes at least one member, i.e., we have ATTA 1 ∈ S, which is the same as ∈ S as we assumed above. Let random variable set X = ATTA∪B \ S be ATTB 1 the set of those (privacy) non-sensitive random variables after the join operation. Thus, random variable sets S and X represent those attributes that cannot be and can be, respectively, revealed to the data analyst in our scenario according to requirements RA and RB . In order to determine the information leakage in the dataset resulted from the join operation JO, one may use the mutual information function [26,29]. The amount of information leaked about random variables in S due to random variables in X can be determined by Mutual Information function defined as I(S; X) =
s
x
p(S = s, X = x) log(
p(S = s, X = x) ), p(S = s)p(X = s)
(6)
where p(S = s, X = x) is the joint probability distribution of multivariate random variables S and X with marginal distributions p(S = s) and p(X = x).
190
M. S. Bargh et al.
For example, knowing that a patient smokes or not may leak too much information about whether (s)he has lung disease or not. Therefore, the mutual information between random variables smoking and having lung disease becomes large. In practice, the mentioned probability distributions are estimated by the corresponding empirical probability distributions that, in turn, are obtained from the available datasets. The amount of mutual information in Relation (6) should ideally be zero to have no information leakage. If this value reaches an unacceptable high level due to the join operation, then the join could be disallowed. One can also aim at the information leakage for any subset S ⊂ S and examine whether I(S ; X) reaches an unacceptable high level due to the join operation. The thresholds for unacceptable Mutual Information values can be determined and extracted from policies PA (t) and PB (t) at or upto runtime t.
7
Decision Enforcement
In this section we present first (Subsect. 7.1) a basic mechanism for realizing the reference component of the proposed UCON model. This basic mechanism, which is implemented as the validation part of this contribution, includes a lookup table to map pseudo identifiers. Subsequently, we shortly propose a second mechanism that uses polymorphic encryption to map these pseudo identifiers algorithmically (Subsect. 7.2). Finally, we shall discuss the characteristics and limitations of these mechanisms (Subsect. 7.3). 7.1
Basic Mechanism
To describe the basic mechanism we use an illustrative example, which is also realized as the proof of concept for this contribution. The example comprises 3 tables A, B, and C, each containing two attributes: an identification number ID and a data attribute ATTR, as shown in Tables 1, 2 and 3. Data analyst X has access to all three tables separately. We assume that analyst X is authorized to join tables A and B, but (s)he is not authorized to join tables A and C. In practice, one can apply the described mechanism to large tables (like those that we have in our data warehouse). Table 1. Table A [3]. ID 1 2 3
ATTR Source value A1 Source value A2 Source value A3
Table 2. Table B [3].
Table 3. Table C [3].
ID ATTR 1 Source value B1 3 Source value B3
ID ATTR 2 Source value C2 3 Source value C3
Using the unique identifiers, i.e., attribute ID, as the primary key the data analyst can now easily combine the data from tables A and C: > Select * from a join c on c.id = a.id
On Using Obligations for Usage Control in Joining of Datasets
191
As the first step of our implementation, we replace the original unique identifiers, which are unique per criminal case, by a new set of global identifiers. A global identifier means that the identifier is unique among all tables in our data warehouse. Consequently those tuples from different tables, which correspond to the same case/entity, will no longer have the same identifiers in the new dataset. The resulting tables in our data warehouse are shown in Tables 4, 5 and 6, where the records of these tables that refer to the same entity are positioned on the same row. Table 4. Table DWHA [3]. Table 5. Table DWHB [3]. ID N1 N2 N3
ATTR Source value A1 Source value A2 Source value A3
ID ATTR N4 Source value B1 N5 Source value B3
Table 6. Table DWHC [3]. ID ATTR N6 Source value C2 N7 Source value C3
One could use hash functions, as mentioned in Subsect. 6.1, to create such global identifiers. Note in this case that the outputs of such a hash function should also be a function of the table names in order to result in global identifiers as defined above. The mapping between the new identifiers of an entity via the old identifier of the entity is made one-to-one and it is safely stored in a separate repository called identifier repository or IDREP table, as shown in Table 7. H The identifier repository realizes the operation = in practice (see Subsect. 6.1). On the other hand, making decisions based on the obligation operands (see Subsect. 6.2), results in an outcome about whether two datasets can be joined or not. These outcomes are (dynamically) maintained in a so-called usage right or USAGERIGHT table. For the example mentioned in this section Table 8 shows such a usage rights table, which indicates that dataset A can be joined with dataset B, and vice versa. Table 7. Identifier repository (IDREP table) [3]. IDSRC 1 2 3 1 3 2 3
IDDW H N1 N2 N3 N4 N5 N6 N7
SRCDAT ASET A A A B B C C
Table 8. Usage rights (USAGERIGHT ) table [3]. DATASET1 DATASET2 A B B A
Realization. The joins in here proposed approach will be made through the identifier repository table, where we will use the usage rights table to check whether the tables can be combined. The next query shows how the join can be performed through the identifier repository with a check on the usage rights table.
192
M. S. Bargh et al.
> select dwh_a.*, dwh_b.* > from dwh_a > join id_rep rep1 > join id_rep rep2 > join dwh_b > join usage_right
on rep1.id_dwh = dwh_a.id on rep2.id_src = rep1.id_src on dwh_b.id = rep2.id_dwh on (usage_right.dataset1 = rep1.src_dataset and usage_right.dataset2 = rep2.src_dataset)
For tables DWHA and DWHB this will result in a dataset with the combined data with new global identifiers, as shown in Table 9, or with the existing identifiers (i.e., those being known within the project from which the join operation was carried out), as shown in Table 10. Whether to provide the result of the join operation with new or existing identifiers depends on the requirements, see also Subsect. 7.3 for further discussion. When a similar query is run for table DWHA and DWHC there will be an empty result set because the join is not allowed according to the USAGERIGHT table. In this setting a direct/SQL join (like the one mentioned in the beginning of this section) will return an empty result set because the identifiers in tables of the data warehouse (e.g., Tables 4, 5 and 6) do not mach. Table 9. Result of the join operation with new identifiers.
Table 10. Result of the join operation with old identifiers.
ID ATTR ATTR N8 Source val. A1 Source val. B1 N9 Source val. A3 Source val. B1
ID ATTR ATTR N1 Source val. A1 Source val. B1 N3 Source val. A3 Source val. B1
The final step in our implementation is to set access control to the tables in the example. Data analyst X is not allowed to see the identifier repository or to change the usage rights. Joins are carried out through a stored procedure, which has access to the identifier repository. 7.2
Algorithmic Mapping of Pseudo Identifiers
The basic mechanism presented in Subsect. 7.1 considers the reference monitor as the party who maintains the identifier repository table (as an example see Table 7) and the usage right table (as an example see Table 8). There are two issues with the reference monitor maintaining the identifier repository table, namely: – Growing the length of the identifier repository table as the numbers of (the entries of) the tables in the data warehouse grow, – Knowing the identifiers and their relationships in pseudonymized tables by the reference monitor. The second issue requires to assume that the reference monitor is a fully trusted (third) party. To deal with these two issues, we briefly sketch another solution that is based on Polymorphic Encryption for Pseudonymization [28] (and [14]).
On Using Obligations for Usage Control in Joining of Datasets
193
This envisioned mechanism hides the identifiers and pseudo identifiers from the reference monitor, therefore, relaxes the trust requirements on the monitor. Data custodians (e.g., those in charge of datasets A and B) encrypt identifiers a1 and b1 with a global public key, send them to the reference monitor who re-encrypts (by re-keying and re-shuffling [28]) them for projects A and B. The re-keying by the reference monitor makes it possible for those projects to decrypt the encrypted identifiers of only own projects. Re-shuffling (by factors SA and SB ) by the reference monitor creates two customized pseudo identifiers SA ·a1 and SB · b1 for projects A and B, respectively. Note that the reference monitor works only with encrypted identifiers, so it cannot see the original/plain identifiers. Further, only the reference monitor knows factors SA and SB , so project A H or project B cannot individually deduce SA · a1 = SB · b1 when a1 = b1 (thus H protecting privacy). In order to deduce SA ·a1 = SB ·b1 when a1 = b1 , there should be a transaction between project A and project B via the reference monitor carried out. Hereby the mapping from SA · a1 to SB · b1 , or vice versa, is done without the reference monitory learning the mapping and without the projects realizing that pseudo identifiers SA · a1 and SB · b1 are equivalent. Describing the details of this mapping is beyond our scope in this paper. 7.3
Discussion and Limitations
The database in an administrative domain (like the one corresponding to a project) may consist of different datasets. Within this dataset (or a project) the tables can be joined as usual because they share the same pseudo identifiers, so the performance will be the same as for standard joins. When the tables of different databases/projects are joined, however, there is a performance penalty since usage control rights table have to be checked and the identifiers have to be mapped. For the latter, the basic mechanism needs to looked up the identifier repository table and the algorithmic mapping mechanism needs to carry out a transactions via the reference monitor. These looks up and mappings inflict some processing/communication burdens. In case of the realized queries, for example, we need to perform three extra joins. Obviously this is less efficient than a single join. However modern database engines are very efficient in doing joins and can be optimized by the use of techniques like indexing. Therefore, the cost associated with generating such new datasets is usually not a huge problem in a data warehouse setting. The USAGERIGHT table is a good location to hook in extra usage control decision factors like UCON authorizations, conditions or obligations. For example only tuples for adults may be combined, the datasets may only be joined during a fixed period, or the requester has to sign a privacy statement, for example see Table 11. Table 11. Extended table USAGERIGHT [3]. DataSet1 DataSet2 AUTH A
D
COND
OBLIG
D.age ≥ 18 Sys.date < 15 July Agree on privacy policy
194
M. S. Bargh et al.
For the basic mechanism, it possible to present the result with a set of new and dependent identifiers (i.e., the result does not contain the old or new identifiers from the identifier repository), as shown in the example of Table 9. In this way, the identifiers in the new dataset cannot easily and readily be tracked back to the original datasets, nor will it be easily possible to link the identifiers of the joined dataset to other (possible future) datasets. In practice, however, it is statistically possible for a data analyst with ill intentions and enough resources to infer which identifiers in table A (see Table 4) and table B (see Table 5) appear also in the resulting table of the join operation (e.g., in Table 9). The data analyst can infer the mappings in the identity repository for these records by using/matching the values of the other attributes in table A, table B and the resulting table from their the join operation, by using, for example, the inference techniques described in [7,22]. This can be seen as an attack on the pseudonymization part of the proposed approach. Our proposed approach attempts to impede those attackers (i.e., data analysts, the employee with questionable ethics as mentioned in [1]) who want to directly track the identifiers in the new dataset back to the original datasets. This inference of the new pseudo identifiers is not an issue in case of producing the result of the join operation with the old pseudo identifiers, i.e., those known at the project that initiates the join operation. Producing the join result with old identifiers is the case shown in Table 10 and in the algorithmic mapping of pseudo identifiers sketched in Subsect. 7.2. The traditional join of two datasets A and B is commutative with respect to datasets A and B. In our usage control scenario, this might not be the case if, for example, the policy of project A allows the join but that of project B does not. This asymmetry can easily be realized, for example, by eliminating one row in usage rights Table 8.
8
Conclusion
To deal with the issue of authorized-access and unauthorized-use of datasets, there is a need for a flexible and adaptive framework to decide on and enforce the data integration policy at runtime. We motivated this need for the join operation in vertically separated relational datasets where one cannot predetermine which datasets would be encountered and integrated with a given dataset. We characterized the usage control model of the join operation by the obligations of the UCON model. Here the authorization of the right (i.e., the join of datasets A and B) is constrained with the state of the object. In this study we distinguished a new type of obligations where the state of the object (i.e., dataset A or dataset B) is determined with respect to another dataset. These obligations can be of both pre-obligation and post-obligation types simultaneously, depending on the timing of the join operation with respect to the moments of datasets A and B availability. This duality is another new insight provided in this contribution. We proposed a few methods for making decision whether two datasets A and B can be joined or not. The decision can be based on, for example, whether the resulting combination of attributes is allowed or not using the domain knowledge,
On Using Obligations for Usage Control in Joining of Datasets
195
comparing the data collection and data usage purposes of datasets A and B, or information leakage about the sensitive attributes due to the join operation. Finally we proposed a mechanism to enforce the obligations and realized it in an example realization. The reference monitor of the proposed usage control is realized as a stored procedure that maps the pseudo identifiers from the identifier repository to the original identifiers, checks the usage rights to determine if a join is allowed, and joins the data if that is the join is allowed. Our scheme uses different pseudo identifiers for the input and output datasets of the join operation and relies on a secure lookup table to map among these pseudo identifiers during the realized join functionality. This solution creates a first barrier against the threat of inferring pseudo identifiers. Searching for a more scalable and secure solution, we sketched an algorithmic mechanism based on Polymorphic Encryption for Pseudonymization [28]. It is for our future work to realize and evaluate the algorithmic mapping sketched and also to research the decision making mechanisms that are relevant and efficient for our organisation to protect privacy and to improve data quality.
References 1. Agrawal, R., et al.: Hippocratic databases. In: Proceedings of the 28th International Conference on Very Large Data Bases, vol. 4, no. 1890, pp. 143–154 (2002) 2. Bargh, M.S., Choenni, S.: On preserving privacy whilst integrating data in connected information systems. In: Proceedings of International Conference on Cloud Security Management (ICCSM 2013), Guimar˜ aes, Portugal (2013) 3. Bargh, MS., Vink, M.E., Choenni, S.: On usage control in relational database management systems: obligations and their enforcement in joining datasets. In: Proceedings of 3rd International Conference on Information Systems Security and Privacy (ICISSP), Porto, Portugal, 19–21 February 2017 4. Bettini, C., et al.: Provisions and obligations in policy rule management. J. Netw. Syst. Manag. 11(3), 351–372 (2013) 5. Byun, J., Li, N.: Purpose based access control for privacy protection in relational database systems. VLDB J. 17, 603–619 (2008) 6. Choenni, S., Bargh, M.S., Roepan, C., Meijer, R.F.: Privacy and security in smart data collection by citizens. In: Gil-Garcia, J.R., Pardo, T.A., Nam, T. (eds.) Smarter as the New Urban Agenda. PAIT, vol. 11, pp. 349–366. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-17620-8 19 7. Choenni, S., van Dijk, J., Leeuw, F.: Preserving privacy whilst integrating data: applied to criminal justice. Inf. Polity 15(1–2), 125–138 (2010) 8. Colombo, P., Ferrari, E.: Enforcing obligations within relational database management systems. IEEE Trans. Depend. Secur. Comput. 11, 1–14 (2014) 9. Dawes, S.S.: Information policy meta-principles: stewardship and usefulness. In: Sprague Jr., R.H. (ed.) Proceedings of the 43rd Hawaii International Conference on System Sciences (HICSS), pp. 1–10 (2010) 10. Dawes, S.S.: Stewardship and usefulness: policy principles for information-based transparency. Gov. Inf. Q. 27(4), 377–383 (2010) 11. Fung, B.C.M., et al.: Privacy-preserving data publishing. ACM Comput. Surv. 42(4), 1–53 (2010) 12. Gama, P., Ribeiro, C., Ferreira, P.: Heimdhal: a history-based policy engine for grids. In: Sixth IEEE International Symposium on In Cluster Computing and the Grid (CCGRID) (2006)
196
M. S. Bargh et al.
13. Hilty, M., Basin, D., Pretschner, A.: On obligations. In: di Vimercati, S.C., Syverson, P., Gollmann, D. (eds.) ESORICS 2005. LNCS, vol. 3679, pp. 98–117. Springer, Heidelberg (2005). https://doi.org/10.1007/11555827 7 14. Jacobs, B., et al.: Polymorphic Encryption and Pseudonymization (PEP) for Privacy-Friendly Personalised Medicine. Presentations, ICIS Digital Security, Radboud University, 16 September 2016 15. Karr, A.F., et al.: Secure, privacy-preserving analysis of distributed databases. Technometrics 49(3), 335–345 (2007) 16. Katt, B. et al.: A general obligation model and continuity: enhanced policy enforcement engine for usage control. In: Proceedings of the 13th ACM Symposium on Access Control Models and Technologies (SACMAT 2008), pp. 123–132 (2008) 17. Kosinski, M., Stillwell, D., Graepel, T.: Private traits and attributes are predictable from digital records of human behavior. Proc. Natl. Acad. Sci. U.S.A. 110(15), 5802–5805 (2013) 18. Laur, S., Talviste, R., Willemson, J.: From oblivious AES to efficient and secure database join in the multiparty setting. In: Jacobson, M., Locasto, M., Mohassel, P., Safavi-Naini, R. (eds.) ACNS 2013. LNCS, vol. 7954, pp. 84–101. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38980-1 6 19. Lazouski, A., Martinelli, F., Mori, P.: Usage control in computer security: a survey. Comput. Sci. Rev. 4(2), 81–99 (2010) 20. Lopez, J., Oppliger, R., Pernul, G.: Authentication and authorization infrastructures (AAIs): a comparative survey. Comput. Secur. 23(7), 578–590 (2004) 21. de Montjoye, Y.-A., et al.: Unique in the crowd: the privacy bounds of human mobility. Sci. Rep. 3, 1376 (2013) 22. Narayanan, A., Shmatikov, V.: Robust de-anonymization of large sparse datasets open datasets. In: IEEE Symposium on Security and Privacy (SP 2008), pp. 111– 125 (2008) 23. Ni, Q., Bertino, E., Lobo, J.: An obligation model bridging access control policies and privacy policies. In: Proceedings of the 13th ACM Symposium on Access Control Models and Technologies - SACMAT 2008, p. 133 (2008) 24. Park, J., Sandhu, R.: The UCON ABC usage control model. ACM Trans. Inf. Syst. 7(1), 128–174 (2004) 25. Sandhu, R., Park, J.: Usage control: a vision for next generation access control. In: Gorodetsky, V., Popyack, L., Skormin, V. (eds.) MMM-ACNS 2003. LNCS, vol. 2776, pp. 17–31. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-54045215-7 2 26. Sankar, L., Rajagopalan, S., Poor, H.: Utility-privacy tradeoff in databases: an information-theoretic approach. IEEE Trans. Inf. Forensics Secur. 8, 838–852 (2013) 27. Sweeny, L.: Uniqueness of simple demographics in the U.S. population. Carnegie Mellon University, Laboratory for International Data Privacy, Pittsburgh, PA (2000) 28. Verheul, E., et al.: Polymorphic Encryption and Pseudonymisation for Personalised Healthcare (2016). https://www.semanticscholar.org/ paper/Polymorphic-Encryption-and-Pseudonymisation-for-Verheul-Jacobs/ 7dfce578644bc101ae4ffcd0184d2227c6d07809 29. Wang, W., Ying, L., Zhang, J.: On the relation between identifiability, differential privacy and mutual-information privacy. In: 52nd IEEE Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1086–1092 (2014) 30. Zhang, X., et al.: Formal model and policy specification of usage control. ACM Trans. Inf. Syst. Secur. 8(4), 351–387 (2005)
Directional Distance-Bounding Identification Ahmad Ahmadi(B) and Reihaneh Safavi-Naini University of Calgary, Calgary, Canada {ahmadi,rei}@ucalgary.ca
Abstract. Distance bounding (DB) protocols allow a prover to convince a verifier that they are within a distance bound. A public key distance bounding relies on the public key of the users to prove their identity and proximity claim. There has been a number of approaches in the literature to formalize security of public key distance bounding protocols. In this paper we extend an earlier work that formalizes security of public key DB protocols using an approach that is inspired by the security definition of identification protocols, and is referred to it as distance-bounding identification (DBID). We first show that if protocol participants have access to a directional antenna, many existing protocols that have been proven secure, will become insecure, and then show to revise the previous model to include this new capability of the users. DBID approach provides a natural way of modelling man-in-the-middle attack in line with identification protocols, as well as other attacks that are commonly considered in distance bounding protocols. We compare the existing public key DB models, and prove the security of the scheme known as ProProx, in our model.
Keywords: Distance-bounding Directional antenna
1
· Identification · Public-key · MiM
Introduction
Distance upper bounding (DB) protocols were first proposed in [11] to provide security against Man-in-the-middle (MiM) attack in authentication protocols. They have found wide diversity of applications in location and proximity based services [5,9,13,18,22]. Most DB protocols are symmetric key protocols where the prover and the verifier share a secret key. More recently public key DB protocols have been proposed where the prover is only known through their public keys, while their secret key remains private to them [2,3,25]. In these models, the verifier only has access to system public parameters as well as public keys of the participants. In a DB setting there are three types of participants: provers who are registered in the system and have secret keys, a verifier who is honest and has access to correct public keys of provers, and actors who are not registered in c Springer International Publishing AG, part of Springer Nature 2018 P. Mori et al. (Eds.): ICISSP 2017, CCIS 867, pp. 197–221, 2018. https://doi.org/10.1007/978-3-319-93354-2_10
198
A. Ahmadi and R. Safavi-Naini
the system, but want to be accepted and may collude with a dishonest prover. The distance between the prover and the verifier is measured by using a “fast challenge-response phase” during which a sequence of one bit challenges are sent by the verifier to the prover, and the corresponding responses by the prover is recorded and used for distance estimation. A challenge-response table includes responses that are required for all possible challenges and is calculated by the prover before the fast challenge-response rounds start. The challenge-response table is constructed using the provers’ secret key, and some nonces that are communicated during the slow phase of the protocol. In symmetric key setting, the challenge-response table can also be constructed by the verifier and used for the verification of responses In public key setting however, the verifier only knows the prover’s public key, and cannot calculate the challenge-response table. In this case, the verifier verifies the correctness of the prover’s responses using their relation with the provers’ public key. For a DB protocol with distance bound D, we refer to participants whose distance to the verifier are less than D as close-by participants (set S) and those who are farther away than D, as far-away participants (set F). Important attacks against DB protocols are: (A1) Distance-Fraud [6]; where a dishonest far-away prover tries to be accepted in the protocol. Distance-Hijacking [9] is a special case of this attack, where a far-away prover takes advantage of the communication of honest close-by provers to succeed in the protocol. (A2) Mafia-Fraud (MF) [11]; a close-by actor tries to use the communications of a far-away honest prover, to succeed in the protocol. (A3) Strong-Impersonation [2]; a close-by actor learns from past executions of the protocol by a close-by honest prover and tries to impersonate the prover in a new execution when the prover is either inactive or is not close-by anymore. (A4) Terrorist-Fraud (TF) [11]; a dishonest far-away prover colludes with a close-by actor to succeed in the protocol. In the original TF, it’s assumed that the prover does not leak their secret key to the actor. In the recent TF [24] this restriction is removed, but it is required that non-negligible success of TF attack results in non-negligible improvement in future impersonation attacks by the actor. To prove security of the existing public key DB protocols, such as [2,3,6,7, 14,15,25], PoPoK [25] proposed a formal security model that uses a cryptographic PoK system and considers distance bound as an additional property of the system. In DBID [2] an alternative approach was proposed that follows the security formalization of identification protocols (using Σ-protocols), and includes distance-bound as an extra property. The ProProx protocol [25] was first proven secure in the former model [25], and later in the latter model [2]. Our Work. This paper is an expanded and revised version of our paper [2]. We consider provers that have access to directional antennas. Such antennas allow
Directional Distance-Bounding Identification
199
point to point communication with minimum interception by eavesdroppers who are outside the main transmission direction [1]. Advances in beamforming techniques and smart antennas in recent years [1] has made these antennas readily accessible to users. Distance bounding protocols, during the fast challengeresponse phase, rely on physical layer communication and so it is important to consider this extra attacking capability for protocol participants. We will show that indeed directional antenna affects the security evaluation of DB protocols, and in particular effectively allows a malicious prover to launch a successful TF attack against protocols that had provable security against this attack. In Sect. 3 we show how this extra capability can be used by a malicious prover who is aided by a helper to break security of VSSDB scheme [14]. Directional antennas had been previously considered for actors during MF attack. In this paper we consider a dishonest prover with access to this type of antenna. For distance fraud, a directional antenna does not appear to affect security. In TF however, the dishonest prover is aided by a helper and directional antenna and this affects the security definition. We extend the DBID formal security model [2] to include this new attacker’s capability. The directional TF attack is captured in the revised TF-resistance (Property 4). We prove that the existing ProProx scheme is indeed secure in this new model. Organization. Section 2 is preliminaries. Section 3 shows directional TF attack on a public key DB protocol. Section 4 presents our model, Sect. 5 describes the construction of ProProx and gives security theorems and proofs. Section 6 gives a summary of related works, and Sect. 7 concludes the paper.
2
Preliminaries
In this section we introduce a primitive that will later be used in our model. A Σ-protocol is a 3-round cryptographic protocol between a prover P and a verifier V , in which the two parties interact, and at the end of the protocol, V is convinced about validity of P ’s statement. P has a private input x that satisfies the relation R(x, y), where y is a public value that is also known to V . A Σ-protocol is used in cryptographic systems such as proof-of-knowledge schemes [10,16,17,21,23]. Here we define a more general form of Σ-protocols, called Σ ∗ -protocols, in which the verifier consecutively sends multiple challenges, each (except the first one) after receiving the response to the previous challenges. Definition 1 (Σ ∗ -protocol). A prover P and verifier V run the following Let C, H and R denote three sets defined as follows. C is the set of possible input that is chosen by the prover; H is the set of possible challenges chosen by the verifier; and R is the set of possible responses of the prover. The steps of the protocol are as follows: 1. P randomly chooses a ∈ C, computes the commitment A = Commit(a), and sends A to V .
200
A. Ahmadi and R. Safavi-Naini
2. Challenge and Response messages that are defined as follows: (a) V randomly chooses a challenge c ∈ H and sends it to P , (b) P computes r = Response(x, a, c, ¬c) ∈ R, where ¬c is the list of previous challenges before c, and sends it to V , Steps 2-(a) and 2-(b) may be repeated a number of times. 3. V calculates ret = Check(y, [c], [r], A), where ret ∈ {accept, reject} and [c] and [r] are lists of all challenges and responses, respectively. At the end of the protocol, V outputs OutV = 1 if ret = accept, and OutV = 0 otherwise. In a basic Σ-protocol, there is only one instance of the steps 2-(a) and 2-(b).
3
Directional Attacks on Public-Key DB Protocols
Directional attacks assume that participants have access to directional antennas that allow them to direct messages to specific participants, and prevent other participants from receiving them. Figure 1 shows how such an antenna can be exploited by a malicious prover in a TF attack. The helper does not receive slow phase messages that are sent by the prover, as prover uses a directional antenna (orange ribbon in Fig. 1) for communication in this phase. Before the start of the fast-phase, the prover sends all fast-phase responses (e.g., the fast challengeresponse table) to the helper, making the helper in-charge of responding to the fast-phase challenges. This means that the adversary is able to separate the slow phase messages of the protocol from the fast-phase messages. In a vulnerable protocol, the prover may succeed in TF attack without leaking their long term key to the helper, using this separation technique. Therefore, the attacker’s success in TF will not imply success in future impersonation. D
P∗
(2)
V
(1) slow phase
fas t
e
has
p ast 3) f
res
p.
( H
Fig. 1. Directional TF. (Color figure online)
In the following we describe how this setting helps a malicious prover to succeed in terrorist-fraud against VSSDB scheme [14]. This protocol was proven secure in the model. Using Definition 2 for a DB scheme, Fig. 2 presents the Π protocol (Definition 2) of VSSDB scheme. This is a protocol between the prover and the verifier
Directional Distance-Bounding Identification
201
where the prover has access to the public key of the verifier and their own secret key, and the verifier has access to their private key and the public key of the prover. Lemma 1. In the Π protocol of VSSDB scheme, the fast challenge-response table does not leak information about the secret value skP of prover, assuming that skP and x are independently chosen. Proof. The elements of the fast challenge-response table are calculated as rj = fj (cj ) ∈ {0, 1} for j = {1, . . . , λ}. Therefore, by knowing the table, one can, at the most, extract the values of ej , kj , lj for j = {1, . . . , λ}. By finding these values, one can extract the value of x using the equation xj = ej ⊕ kj ⊕ lj for j = {1, . . . , λ}. Since kj and lj are chosen randomly, therefore, this table only contains information about randomly chosen values k and l, and the value of x, which are independent of the secret value skP . Attack. In this attack, the prover sends the messages of the slow phase to the verifier using directional antenna. The prover then sends the fast challengeresponse table (i.e., ∀j ∈ {1, . . . , λ}: either (ej , kj ) or (kj ⊕ lj , ej ⊕ lj )) to the helper before running the fast phase. Note that the fast challenge-response table does not leak the prover long-term secret skP according to Lemma 1. This allows the helper to respond to the verifier’s challenges during the fast phase. The collusion of the prover and the helper will make the verifier to accept (i.e., OutV = 1) and this is without the prover sending to the helper any information that is dependent on the secret key skP . The secret skP is required to generate a valid signature σ in the message π. This means that the helper’s success chance in a future impersonation attack will not improve. This completes a successful TF.
4
Model
First we define the settings of our system. This includes entities, their communication, their views, and the adversarial capabilities. Then we define distancebounding identification scheme (DBID) and describe DBID experiment, which simulates an instance of DBID scheme. Finally we formalize four properties: (Completeness, Soundness, DF-resistance, and TF-resistance of distance-bounding identification schemes, using a game-based approach and described as a DBID experiment where adversary is active, and the game is between a challenger that sets up the system, taking into account the adversary’s input. Entities. We consider a set U of users. The user u ∈ U can have multiple provers that are denoted by the set P. This captures the scenario that a single user has multiple devices.
202
A. Ahmadi and R. Safavi-Naini
Fig. 2. Π protocol of VSSDB scheme. (Commit, COpen) is a commitment scheme. (Enc, Dec) is a secure public key encryption scheme. (Sign, SVerify) is a signature scheme. (Prove, PVerify) is a proof-of-knowledge scheme. H is a secure hash function with pseudo-random output. x is the private key of the prover, with random distribution. υ = {υj }λj=1 where υj = Hj (x). comj = Commit(xj , υj ). The fast challengeresponse table has λ columns with the j th column defines by the boolean function fj ().
Directional Distance-Bounding Identification
203
A trusted group manger generates the public parameters of the system, and registers users and issues a key pair to each user. The user u is identifiable by its’ private key. The private key, that must be kept secret, forms the secret input of the user in providing authentication proof. The private key of a user u is shared by all their provers P. The corresponding public key of the user is published by the group manager. There is a single verifier in the system, that for uniformity of notations, we refer to it as a set V that has a single member. The verifier only access to the public parameters of the system. There is a set of actors (C) that only have access to the public parameters of the system. In this paper we refer to the members of the sets P, V and C as participants. Each participant has a location loc = (x, y) ∈ R × R, that is an element of a metric space equipped with Euclidean distance, and is fixed during the protocol. The distance function d(loc1 , loc2 ) returns the distance between two locations. 1 ,loc2 ) , where L is Message travel time between locations loc1 and loc2 is d(locL the speed of light. A bit sent over the channel may flip with probability pnoise (0 ≤ pnoise ≤ 1). Participants that are located within a predefined distance bound D from the verifier, excluding the verifier, are called close-by participants (set S), and those who are outside the distance bound from the verifier are called far-away participants (set F). Communication Structure. All participants have access to directional antennas: a participant A in locA can send a message to participant B at locB , such that others who are not on the straight line connecting locA and locB , cannot intercept it. Using omni-directional antenna however allows a message to be seen and modified by other participants. A participant may have multiple antennas that can be either directional or omni-directional. We allow a participant to send multiple messages to multiple parties at the same time, each from a separate antenna. Multiple messages that are received at the same time on the same antenna are combined and received as a single message. View. The view of an entity at a point of a protocol consists of: all the inputs of the entity (including random coin tosses) and the set of messages that they have received up to that point in the protocol. Receiving a message is called an event. V iewxΓ (e) is a random variable that denotes the view of an entity (or a set of entities) x right after the event e in protocol Γ . The short notation V iewxΓ is used to indicate the view of x at the end of the protocol Γ , i.e., V iewxΓ = V iewxΓ (elast ) where elast is the last event in the protocol Γ . Adversary. An adversary can corrupt a subset of participants X∗ ⊂ P ∪ V ∪ C. As we will see later in this section, for each security property, X∗ will have certain restrictions; in Completeness X∗ = ∅, in Soundness X∗ ⊆ C, in DF-resistance X∗ ⊆ P, and in TF-resistance X∗ ⊆ P ∪ C.
204
A. Ahmadi and R. Safavi-Naini
When a prover of a user u is compromised, the user u’s secret private key is compromised and the adversary can choose devices with that key at locations of their choice. In other words, all the provers in P become compromised. This is because all the provers of a user share the same private key. We refer to them as corrupted provers, who are controlled by the adversary and may be activated simultaneously. However, we assume the non-corrupted provers follow the protocol, and a user only uses one of its devices at a time (i.e., the execution time of the provers P do not overlap). This is because an honest user does not use multiple devices simultaneously. Definition 2 (Distance-Bounding Identification Scheme). For a security parameter λ, a distance-bounding identification scheme (DBID) is defined by a tuple (X, Y, S, P, Init, KeyGen, Π, Revoke, D, pnoise ), where (I) X and Y are the sets of possible master keys and public keys of the system, respectively, chosen based on the security parameter λ. The system master key msk ∈ X, and group public key gpk ∈ Y are generated using (msk, gpk) ← Init(1λ ) algorithm; (II) S and P are sets of possible private keys and public keys of the users respectively, chosen according to the security parameter λ. The user private key sk ∈ S, and public key pk ∈ Y are generated using either (sk, pk) ← KeyGen(1λ , msk, gpk) algorithm or KeyGen{U (1λ , gpk) ↔ GM (1λ , msk)} protocol; The KeyGen algorithm is run by the group manager and the output is a user key pair and updated group public key. The user key pair is securely sent to the user, and the public key is published by the group manager, i.e., gpk := gpk ∪ {pk}. However, the KeyGen protocol is run between the group manager GM (1λ , msk) and a user U (1λ , gpk). The user outputs a key pair (sk, pk), and the group manager outputs the updates group public key gpk := gpk ∪ {pk}. (III) D ∈ R is a real number that indicates the distance upper bound on the prover’s distance to the verifier; (IV) Π is a Σ ∗ -protocol between a prover P (sk, pk, gpk) and the verifier V (pk, gpk), in which V verifies if the prover is authentic and is located within the distance bound D to the verifier. The Π protocol has multiple rounds of challenge and response. The round trip time of some of the challenge and response rounds is used to estimate the distance between the prover and the verifier. The time measured rounds are fast challengeresponse rounds and form the fast-phase of the protocol. Other steps of the protocol Π are part of the slow-phase. Fast Fast phase of the protocol is run on physical channel and transmitted bits can be affected by noise where pnoise ∈ [0, 1] is the probability of a bit flip on transmission of a fast challenge-response round; (V) (gpk ) ← Revoke(msk, gpk, i) is an algorithm that takes the master secret key, the group public key and the index of a user. The algorithm removes the corresponding user ui from the system and updates the group public key
Directional Distance-Bounding Identification
205
accordingly, i.e., gpk → gpk . The Revoke operation is optional in DBID scheme. Below we describe execution of an instance of the DBID scheme, which we call DBID experiment. Definition 3 (DBID Experiment). A DBID experiment is defined by a tuple (DBID; U; P; V; C), where (i) DBID is a distance-bounding identification scheme as defined in Definition 2. (ii) U is the set of users that are members of the group; each user uj ∈ U has three attributes: – uj .Key that is a secret key generated by the group manager, – uj .RT that is the registration time of the user that can be any time, and – uj .Rev that is a flag that shows if the user is revoked. (iii) P is the set of provers; each prover has access to the secret key of a single user. (iv) V is the set of verifiers; that have access to the public parameters of the DBID system. We consider the case where V has a single member. (v) C is the set of actors; each actor has access to the public parameters of the DBID system. Members of the set X = P ∪ V ∪ C are called participants of the system. Each of the participants x ∈ X, has the following attributes: a1. a2. a3. a4.
x.Loc is the location of the participant, x.Code is the code to run by the participant, x.St that is the start time of the x.Code execution, and x.Corr is a flag indicating if the participant is corrupted or not.
In addition to these attributes, each prover p ∈ P has one extra attribute: a5. p.Key that is the secret key of the corresponding user, i.e., p.Key = uj .Key for user uj ∈ U. The start time of all provers is after registration time of the user, i.e., ∀u ∈ U, ∀p ∈ P : p.St > u.RT . The provers of a user are either all honest or all dishonest. Because of users’ keys are independently chosen, we can only consider a single user and so for simplicity we omit other users. i.e., ∀p ∈ P : p.Corr = f lag, where f lag ∈ {true, f alse}. Honest provers p ∈ P follow the Π protocol (i.e., p.Code = DBID.Π.P (.)) and there is no overlap in the execution time of the honest provers. If the verifier is honest, then it follows the Π protocol (i.e., v.Code = DBID.Π.V (.) for v ∈ V). The experiment is run by a simulator that sets the attributes of the participants, and interacts with the group manager to assign keys to the provers of a user. If there is an adversary in the system, the simulator interacts with the adversary and follow their requested operations, that will influence the experiment. The experiment, without an adversary, proceeds as follows:
206
A. Ahmadi and R. Safavi-Naini
1. Setup. (a) Initiate: The group manager runs (msk/gpk) ← DBID.Init(1λ ) algorithm to generate the master secret key and group public key. (b) Generate Players: The simulator forms the sets (U, V, P, C) and sets their attributes. The simulator interacts with the group manager obtain and assign keys of the provers. 2. Run: The simulator starts the execution of x.Code for all participants x ∈ X = P ∪ V ∪ C at time x.St. The simulation uses a clock. The start and finish time of a protocol Γ is indicated as stT ime(Γ ) and f shT ime(Γ ) respectively, which form the execution time exT ime(Γ ) = (stT ime(Γ ), f shT ime(Γ )) as the range of time and the execution time period exLen(Γ ) = f shT ime(Γ ) − stT ime(Γ ). Different provers have different execution time period (i.e., they participate in a protocol from time t1 to t2 ), and possibly different locations. In the following, we define security properties of DBID scheme, using a game between a challenger and an adversary. This game is a DBID experiment that is run by the challenger who interacts with an adversary. In this game we only consider one user, i.e., |U| = 1. The challenger plays both roles of the simulator and the group manager in the DBID experiment (Definition 3). The adversary’s capabilities is modelled as access to a query that it presents to the challenger. Definition 4 (DBID Game). A DBID game between a challenger and adversary is a DBID experiment that is defined by a tuple (DBID; U; P; V; C; CorruptParties) where – DBID is a distance-bounding identification scheme as defined in Definition 2. – U, P, V, C are the sets of users, provers, verifiers and actors as defined in Definition 3, that are determined through interaction of the challenger and the adversary. – CorruptParties(Q) is a query that allows the adversary to plan (program) their attack. Q is a set of participants, that may exist in the system or be introduced by the adversary. The game setup phase is by the challenger while playing the roles of the simulator and the group manager, and interacting with the adversary. In more details: 1. Setup. (a) Initiate: Challenger runs (msk/gpk) ← DBID.Init(1λ ) and publishes gpk. Note that the execution codes of an honest prover and verifier are known by the challenger and the adversary at this point, and are referred to as DBID.Π.P and DBID.Π.V , respectively. (b) Generate Players: The sets (U, V, P, C) are formed through the interaction of the challenger and the adversary as follows: i. The challenger creates the sets (U, V, P, C) as follows: – Chooses a verifier V = {v}, with the following attributes: v.Loc = loc0 , v.Code = DBID.Π.V , v.St = 0, and v.Corr = f alse.
Directional Distance-Bounding Identification
207
– Runs (sk, pk) ← DBID.KeyGen(1λ , msk, gpk) once and forms the set U = {u}. The user key is set as u.Key = sk, the registration time of the user is set as u.RT = 0 and the revocation flag is set as u.Rev = f alse. The group public key is updated as gpk := gpk ∪ {pk}. – Creates a prover set P and for each member p of P, assigns their attributes as: p.Loc is set arbitrarily, p.Code = DBID.Π.P , p.St is set arbitrarily such that there is no overlap in the execution time of the provers (i.e., p1 , p2 ∈ P : p1 .St < p2 .St ∧ p1 .St + exLen(DBID.Π) > p2 .St), p.Corr = f alse, and secret key p.Key = u.Key. – C=∅ ii. The challenger sends the attributes (x.Loc, x.Code, x.St) for all x ∈ P ∪ V ∪ C to the adversary. The size of the set X is n. iii. The adversary uses the published values to form their corruption query CorruptParties(Q) that is sent to the challenger. The secret information of the corrupted participants in Q is given to the adversary and the behaviour (Code) of the corresponding participants, is assigned according to the adversary instruction. More specifically, the parameter of this query is Q = {q1 , . . . , qn }. Each qi consists of the location, the execution start time and the execution code of a participant. i.e., qi = (type, location, code, time), where type ∈ {verif ier, prover, actor, user} indicates the type of the participant, location ∈ R × R, code ∈ {0, 1}∗ indicates the location of the participant and time ∈ N indicates the execution start time of the participant. If qi ∈ X = P ∪ V ∪ C ∪ U, it determines the settings of / X, it determines the settings of a an existing participant, and if qi ∈ new participant. iv. Upon receiving the CorruptParties(Q) where Q = {q1 , . . . , qn }, the challenger runs: – For a qi that qi .type = verif ier, then v.Code = qi .code and v.Corr = true for v ∈ V. – For each qi that qi .type = user, sets the users’ revocation flag as u.Rev = true where u ∈ U, runs (gpk ) ← Revoke(msk, gpk, 1), and then updates the group public key gpk ← gpk . This applies only if the DBID scheme provides user revocation. – If there is a qi that qi .type = prover, then for each member p of the set P, sets their corruption flag p.Corr = true. If qi is not corresponding to an existing prover, then create a new prover p and add it to the prover set P. Set the attributes of the participant p as follows: location p.Loc = qi .location, execution code p.Code = qi .code, start time p.St = qi .time, corruption flag p.Corr = true, and secret key p.Key = u.Key. – For each qi that qi .type = actor, add a new actor x to the set C, and assign its attributes as follows: location x.Loc = qi .location,
208
A. Ahmadi and R. Safavi-Naini
execution code x.Code = qi .code, start time x.St = qi .time, and corruption flag x.Corr = true. v. The challenger sends the key of the corrupted provers and the key of revoked user to the adversary, i.e., p.Key for all p ∈ P such that p.Corr = true and u.Key for all u ∈ U such that u.Rev = true. 2. Run: Challenger activates all participants x ∈ X = P ∪ V ∪ C at time x.St for execution of x.Code. The game ends when the last participant’s code completes its execution. Using the above game, we define four distinct properties for distancebounding identification schemes. The winning condition of the above game, varies for each property. Property 1 (DBID Completeness). Consider a DBID scheme and a DBID game when Q = ∅ in the CorruptParties(Q) query and the set P is not empty. The DBID scheme is (τ, δ)-complete for 0 ≤ τ, δ ≤ 1, if the verifier returns OutV = 1 with probability at least 1 − δ, under the following assumptions: – the fast challenge-response rounds are independently affected by noise and at least τ portion of them are noiseless, and – τ > 1 − pnoise − for some constant > 0. A complete scheme must have negligible δ to be able to function in the presence of communication noises. Property 2 (DBID Soundness). Consider a DBID scheme and a DBID game with the following restrictions: – P is nonempty and ∀p ∈ P, v ∈ V : d(p.Loc, v.Loc) > DBID.D, and – in the CorruptParties(Q) query, qi .type ∈ {actor, user} for all qi ∈ Q. In this game the verifier and provers are honest, while the adversary A corrupts a set of actors and sets their locations (and, if applicable) revokes some users. The corrupted actors are controlled by the adversary, and can simultaneously communicate with multiple provers and the verifier. They can receive a message m from a prover and send m to the verifier, and vice versa. The certificate of the revoked users are sent to the adversary. The DBID scheme is γ-sound if the probability of the verifier outputting OutV = 1 is at most γ. This general definition captures the following attacks by considering special values for the parameters of the game; – relay attack [6] where the MiM attacker only relays the messages between the honest verifier and a far-away honest prover. The MiM attacker tries to convince the verifier that the prover is located close to the verifier. This attack is achieved by adding extra restrictions on the adversary of Property 2 as follows:
Directional Distance-Bounding Identification
209
• ∀qi ∈ Q we have qi .code =“relay messages”. – mafia-fraud [11] is when there is an honest verifier, an honest far-away prover, and a close-by MiM attacker who tries to convince the verifier that the prover is located close to the verifier. The attacker listens to the legitimate communications for a while, before running the attack as the learning phase. This attack corresponds to adding extra restrictions on the adversary in Property 2 as follows: • P is nonempty, and • ∀qi ∈ Q we have d(qi .location, v.Loc) ≤ DBID.D for v ∈ V. – impersonation attack [4] happens when there is an honest verifier and a single close-by attacker who tries to convince the verifier that the prover is located close to the verifier. The attacker can have a learning phase before running the attack. We can achieve this attack by adding extra restrictions on the adversary of Property 2 as follows: • P is nonempty, and • ∀qi ∈ Q we have d(qi .location, v.Loc) ≤ DBID.D for v ∈ V, and • among all the successful DBID.Π protocols (Π succ set) during the game, / [p.St, p.St + exLen(p.Code)]. ∃π ∈ Π succ , ∀p ∈ P : t = f shT ime(π), t ∈ – strong-impersonation [2] happens when either mafia-fraud or impersonation happens. We can achieve this attack by adding extra restrictions on the adversary of Property 2 as follows: • P is nonempty, and • ∀qi ∈ Q we have d(qi .location, v.Loc) ≤ DBID.D for v ∈ V, and • among all the successful DBID.Π protocols (Π succ set) during the game, at least one of the following conditions hold: / [p.St, p.St + exLen(p.Code)] (i) ∃π ∈ Π succ , ∀p ∈ P : t = f shT ime(π), t ∈ (ii) ∃p ∈ P, ∃π ∈ Π succ , v ∈ V : t = f shT ime(π), t ∈ [p.St, p.St + exLen(p.Code)] ∧ d(p.Loc, v.Loc) > DBID.D. We consider two types of attacks by a dishonest prover: far-away dishonest provers (Property 3), and far-away dishonest provers with a close-by helper (Property 4). Property 3 (DBID Distance-Fraud). Consider a DBID scheme and a DBID game with the following restrictions: – P is nonempty and ∀p ∈ P, v ∈ V : d(p.Loc, v.Loc) > DBID.D, and – in the CorruptParties(Q) query, qi .type = prover and d(qi .location, v.Loc) > DBID.D for all qi ∈ Q and v ∈ V. The DBID scheme is α-DF-resistant if, for any DBID.Π protocol in such game, we have Pr[OutV = 1] ≤ α. In the following we define the TF-resistance of DBID protocols. Property 4 (DBID Terrorist-Fraud). Consider a DBID scheme and a DBID game with the following restrictions:
210
A. Ahmadi and R. Safavi-Naini
– P is nonempty and ∀p ∈ P, v ∈ V : d(p.Loc, v.Loc) > DBID.D, and – in the CorruptParties(Q) query, qi .type ∈ {prover, actor} and d(qi .location, v.Loc) > DBID.D for all qi ∈ Q that qi .type = prover and v ∈ V. The DBID scheme is μ-TF-resistant, if the following holds about the above game: • If the verifier returns OutV = 1 in the Π protocol of game Γ with nonnegligible probability κ, then there is an impersonation attack as a DBID game Γ with honest verifier, no prover and one close-by actor that takes the view of close-by participants (V iewSΓ ) as input, and makes the verifier return OutV = 1 with probability at least κ − μ in the Π protocol of Γ game. Note that this is a formal definition of the terrorist-fraud resistance (A4) that is based on the recent definitions (such as [24]) and is different from the definition of TF in [11], which is the original version of this work. This change in the definition of TF is necessary because here we consider directional antennas. With this new capability, a malicious prover can use directional communication with the verifier and the helper, such that although the TF succeeds, the leaked information does not allow a response generator to be constructed. Using the original approach, and removing contribution of the verifier’s view, allow us to define TF security. In Lemma 2 we show that if a DBID scheme is TF-resistant (Property 4), using a directional antenna (as in Fig. 1) will not affect its security. We only provide an informal proof because a formal proof needs formalizing properties of directional antennas. Lemma 2. If a DBID scheme is TF-resistant (Property 4), it is directional TFresistant. Proof. The main observation is that in a TF attack (Property 4), all close-by participants, except the verifier, are controlled by the adversary. So, using a directional antenna to communicate with close-by participants such that the verifier is excluded, adds the transmitted message to the view of adversary, and replacing the directional antenna with an omni directional one, does not change this view. The messages that are sent to the verifier using directional antenna, will not be included in the impersonation adversary view, i.e., V iewSΓ . Using property 4, if there is a successful TF attack against a DBID scheme, the TF-resistant property guarantees existence of an impersonation attacker with non-negligible probability that takes the V iewSΓ as input. Since the view of actors in a directional TF attack will include this view, therefore, in a TFresistant DBID scheme, having a successful directional TF attack implies future impersonation attack.
Directional Distance-Bounding Identification
5
211
ProProx Scheme [25]
ProProx scheme is a public key DB protocol [25] that fits our DBID model. We also prove security of the protocol in this model. The details of the operations of ProProx is given below. Let λ and n be the security parameters that are linearly related. (msk, gpk) ← Init(1λ ). The group manager initializes a Goldwasser-Micali cryptosystem with λ bit security: chooses N = p.q and chooses θ that is a quadratic residue modulo N . It also chooses chooses b ∈ {0, 1}n with Hamming weight of n2 . The group master key is msk = (p, q); the group public key is gpk = (N, b, θ, Ξ) where Ξ = ∅. (sk, pk) ← KeyGen(msk, gpk). Assume l−1 users have joined the group and their public keys are in the set Ξ = {pk1 , . . . , pkl−1 } that is published by the group manager. For the lth user, the group manager generates a key pair (sk, pk), such that sk ∈R {0, 1}λ and pk is the output of a homomorphic and deterministic commitment scheme ComH () on sk = (sk1 . . . skλ ); that is pk = ComH (sk) = (Com(sk1 ; H(sk, 1)), . . . , Com(skλ ; H(sk, λ))), where Com(u; v) isGoldwasserMicali encryption (= θu v 2 (mod N )) and H is a one-way hash function. The group manager securely sends the key pair to the new user and adds the public key pk to the set Ξ. accept/reject ← Π{P (sk, pk) ↔ V (pk)}. When a prover (Pl ) of a registered user wants to run the DBID.Π protocol with the verifier, they will follow the protocol described in Fig. 3. In the verification phase, the prover and the verifier agree on a list I = (I1 , . . . , Iλ ), where each Ij consists of τ.n indices from 1 to n. Both parties . The verifier then checks believe ∀j = {1 . . . λ}, i ∈ Ij : ci,j = ci,j and ri,j = ri,j whether responses are within the required time interval. The prover and the verifier then run an interactive zero-knowledge proof (ZKP ) to show that the responses ri,j , j = {1 . . . λ}, i ∈ Ij are consistent with the corresponding Ai,j ’s and yj ’s. If the verification fails, the verifier aborts and outputs Outv = 0, otherwise, outputs OutV = 1. (msk , gpk ) ← Revoke(msk, gpk, i). The group manager removes the ith public key from the set Ξ. i.e., Ξ := Ξ \ {pki }. 5.1
Security Analysis
To prove that the above protocol satisfies our security definition, we first note that the Π protocol of ProProx scheme (i.e., Fig. 3) can be seen as a Σ ∗ -protocol (Definition 1). This is true because, assuming the agreement step (on the value of I) and the ZKP step can be written as Σ ∗ -protocols, their concatenation is also a Σ ∗ -protocol because one can consider all first message commitment of the protocols as a single commitment phase, and all verification functions stay at the end. The remaining challenge and response messages are concatenated and form the challenge and responses of the combined protocol.
212
A. Ahmadi and R. Safavi-Naini P
V
(secret: sk)(public: gpk)
(public: pk, gpk)
Commitment (slow phase) for j ∈ 1 . . . λ in parallel: pick ai,j ∈R Z2 , ρi,j ∈R Z∗N , i = 1, . . . , n • Ai,j = Com(ai,j ; ρi,j ) A1,j , . . . , An,j Challenge/Response (fast phase) j = 1 · · · λ and i = 1 · · · n ci,j
receive ci,j • ri,j = ai,j + ci,j bi + ci,j skj
ri,j
Verification (slow phase) agree on I = (I1 , . . . , Iλ ), where ∀j ∈ {1...λ} : Ij ⊂ {1, ..., n}
• vj = H(sk, j) c
• αi,j = ρi,j vj i,j
ci,j ∈R Z2 start timeri,j • receive ri,j stop timeri,j •
|Ij | = τ.n
check |Ij | = τ.n and timeri,j ≤ 2B for j = 1, . . . , λ ∀j ∈ {1, . . . , λ}, i ∈ Ij :
zi,j = Ai,j (θbi yj )ci,j θ−ri,j • 2 ) ZKP (αi,j : zi,j = αi,j
OutV
Fig. 3. Π protocol of ProProx scheme. Com is Goldwasser-Micali encryption. τ is the minimum threshold ratio of noiseless fast rounds. ZKP is an interactive zero-knowledge proof. The number of fast rounds is n.λ. In each fast round, the verifier sends one-bit challenge, and receives the corresponding response.
Theorem 1. Assuming Com(u; v) is a perfect binding and computationally hiding homomorphic bit commitment scheme (Definition 6), ComH () is oneway function (Definition 7), and ZKP is a κ-sound (Definition 5) and ζ-zeroknowledge authentication protocol (Definition 8) for negligible values of κ and ζ; ProProx is (τ, δ)-complete, μ-TF-resistant, γ-sound and α-DF-resistant DBID scheme for negligible values of δ, μ, γ and α, when n is linear in security parameter λ, and n.(1 − pnoise − ) > n.τ ≥ n − ( 12 − 2 ) n2 for some constant > 0. ProProx is proven to be complete, DF-resistant and zero-knowledge (Definition 8) in [25]. Our definitions of these properties remain unchanged. So we only need to prove TF-resistance and soundness properties of ProProx scheme. We prove TF-resistance of ProProx in Lemma 4 that uses the following lemma.
Directional Distance-Bounding Identification
213
Lemma 3 (Extractor). Consider a DBID game Γ with TF attack (Property 4), for ProProx scheme. If there is a Π protocol in the game Γ in which, the verifier returns OutV = 1 with non-negligible probability p, then there is a PPT extractor E, that takes the view of all close-by participants, except the verifier (V iewSΓ ) as input, and outputs sk = sk with probability p − μ, for negligible value of μ. This holds assuming, Com(u; v) is a perfect binding computational hiding homomorphic bit commitment scheme (Definition 6), and ZKP is a κ-sound authentication protocol (Definition 5). Note that the extractor of this lemma, has a critical difference from the extractor that is considered in security analysis of ProProx scheme in the original paper [25]; the input of the extractor of the original paper takes the view of the verifier is the view of all close-by participants, including the verifier, but the input of the above extractor is the view of all close-by participants (excluding the verifier). By excluding the view of the verifier of a TF attack from the view of the extractor, the close-by participants can extract the secret-key of the prover, even when the prover is using directional antenna to communicate directly to the verifier. In the security analysis of both of these extractors (i.e., our extractor and the original paper), it is assumed that a correct response ri,j is solely sent from a single close-by participant. However, there might be a case that the received message ri,j is the combination of a message that is sent from a far-away source and a message that is sent from a close-by source. The study of this case is left as future work. Proof (Extractor). Let’s assume there is a TF adversary A that succeeds in Π protocol with non-negligible probability p, i.e., generates a transcript ξ = (A, [c], [r]) that is accepted by the verifier with probability p. We construct a PPT extractor algorithm E for the secret key. In a protocol Π from game Γ , the sequence of all challenges [c] (slow and fast) is chosen randomly and broadcasted by the honest verifier. We define [r] = [r]f ast ||[r]slow and [c] = [c]f ast ||[c]slow where the superscripts show the type of the phase of the challenges. Because of the perfect binding commitment (Definition 6), the value of the public key pk uniquely determines sk = Com−1 (pk), and the value of Ai,j uniquely determines ai,j = Com−1 (Ai,j ). We emphasis that these values are not being calculate, but we just mathematically define them based on the view of the verifier. Let S be the event that for all j, and i ∈ Ij , the verifier’s checks ri,j = ai,j + ci,j bi + ci,j skj hold true. This can be verified by checking success of ZKP , for all the corresponding j and i ∈ Ij . In other words, when ZKP succeeds for all j, and i ∈ Ij , we have zi,j as commitment to ai,j + ci,j bi + ci,j skj − ri,j , which implies the occurrence of S. Since ZKP is κ-sound, we conclude that Pr[succ ZKP |¬S] ≤ κ and then Pr[succ ZKP, ¬S] ≤ κ, where ¬S indicates negation of S. So we have Pr[valid ξ, ¬S] ≤ κ.
214
A. Ahmadi and R. Safavi-Naini
During a fast phase round of a successful Π protocol, the responses ([r]f ast ) cannot depend on the information received from a far-away prover during that round, as it will be independent of the challenge. Moreover, for all j and i ∈ Ij , because of the computational hiding property of Com(ai,j ; ρi,j ) and Com(skj ; H(sk, j)), the view of the verifier before receiving ri,j , is computationally independent from ri,j and skj , and so the verifier has no information about ai,j and skj . Thus the view of close-by participants, right before sending the response ri,j , is the only information for constructing a valid fast phase response ri,j within the time bound, that is indicated by V iewSΓ (¬ri,j ). We conclude that there is an algorithm Jf ast that takes V iewSΓ (¬ri,j ) and generates a correct response for any challenge ci,j . Note that V iewSΓ (¬ri,j ) includes the challenge ci,j . We consider the view of close-by participants before sending the response ri,j , i.e., V iewSΓ (¬ri,j ), relative to the view of the close-by participants before seeing the challenge ci,j , i.e., V iewSΓ (¬ci,j ). In the time period between receiving the challenge ci,j and sending ri,j , the close-by participants can receive messages from two different sources: the verifier, and the far-away participants. The only message from the verifier in this period is ci,j and we indicate the messages from the far-away participants as M sgF→S (¬ci,j ) which is independent from ci,j . So we have V iewSΓ (¬ri,j ) = V iewSΓ (¬ci,j )||ci,j ||M sgF→S (¬ci,j ). Since there is an algorithm Jf ast that respi,j = Jf ast (V iewSΓ (¬ci,j )||ci,j | |M sgF→S (¬ci,j )), then we construct the algorithm J that calls Jf ast two times with the following inputs: resp0i,j = Jf ast (V iewSΓ (¬ci,j )||0||M sgF→S (¬ci,j )) and resp1i,j = Jf ast (V iewSΓ (¬ci,j )||1||M sgF→S (¬ci,j )). As a result, J returns a pair (resp0i,j , resp1i,j ) that is the answer for the two possible cases of the challenge bit ci,j . From the final view of the close-by participants, we can find their partial view at the time of sending response ri,j for all j = 1 . . . λ and i = 1 . . . n, and then call the algorithm J. So we can calculate (resp0i,j , resp1i,j ) for all j = 1 . . . λ and i = 1 . . . n, from the final view of the close-by participants (V iewSΓ ). We build the extractor E as follows: E runs (resp0i,j , resp1i,j ) ← J(V iewSΓ ) for all j = 1 . . . λ and i = 1 . . . n. Then it guesses the bits of secret skj = majority(δ1,j . . . δn,j ) for j = {1 . . . λ}, where δi,j = resp1i,j − resp0i,j − bi for i = {1 . . . n}. In the following we calculate the success probability of this extractor; A response respdi,j for d ∈ {0, 1} is correct if respdi,j = ai,j + d.bi + d.skj . We define Rj = [R1,j . . . Rn,j ] where R1,j is the number of challenge bits d ∈ {0, 1} for which the response respdi,j is correct. If we have Ri,j = 2 then δi,j = skj , but if Ri,j = 1 then we might have δi,j = skj . Let R be the set of all vectors X = [X1 . . . Xn ] ∈ {0, 1, 2}n that have at least n 2 + 1 values of 2 in the vector. If we have Rj ∈ R, then we have a majority of i’s that δi,j = skj , which implies skj = skj . Consider R = (R1 , . . . , Rλ ) to be the vectors where Ri,j is defined as above and calculated by comparing the (resp0i,j , [resp1i,j ) with correct responses for all j = 1 . . . λ and i = 1 . . . n.
Directional Distance-Bounding Identification
215
Since the verifier selects the challenges randomly, then knowing Ri,j allows us to find the probability that the response ri,j = respi,j is correct: if Ri,j = 2 then this probability is 1, otherwise this probability is at most 21 (this is true that a response can always be guessed randomly). If W is the random variable / R, we have Pr[S|W = w] = pw giving the number Rj s that Rj ∈ B where pB = n n 1 T ail( 2 , τ − 2 , 2 ), defined in Lemma 6. So Pr[S, W = w] ≤ pw B Pr[W = w] . As a result, we have: and then Pr[S, W ≥ w] ≤ pw B Pr[W ≥ w, valid ξ] ≤ Pr[¬S, valid ξ] + Pr[S, W ≥ w] ≤ κ + pw B Each index j where skj = skj , corresponds to Rj ∈ / R. Therefore, having the verifier outputting OutV = 1 (i.e., ξ is valid) and the extractor giving at least 1 error occurs with probability bounded by μ = κ + pB . This implies that we can build a key extractor from V iewSΓ that follows the success chance of the TF attack (i.e. p), except with probability μ, which is negligible because of Chernoff bound (Lemma 6). Lemma 4 (TF-resistance). ProProx is a μ-TF-resistant DBID (Property 4) scheme for negligible value of μ, assuming Com is a perfectly binding computational hiding homomorphic bit commitment (Definition 6), and ZKP is a κsound authentication protocol (Definition 5). Proof. According to the TF-resistance definition (Property 4), we need to show that if there is a game Γ for a Π protocol in which that the verifier returns OutV = 1 with non-negligible probability κ, then there exists a close-by actor R that for any challenge sequence [c] can create a valid transcript with probability at least κ − μ for a negligible μ, using the view of all close-by participants, excluding the verifier (V iewSΓ ). Based on Lemma 3, the existence of TF attacker with non-negligible success probability κ, implies the existence of a key extractor sk ← E(V iewSΓ ) with the success chance of at least κ − μ for a negligible μ . So we make R as follows: After a successful TF attack, R calls the above extractor E to find sk . Then R runs the ProProx.Π.P (sk , gpk) interactive algorithm in order to generate valid transcript with correct timing for any challenge that is generated by the verifier. Since ProProx is (τ, δ)-complete, the verifier outputs OutV = 1, unless negligible probability δ. Therefore, the success chance of R is at least κ − μ for negligible μ = μ − δ. Lemma 5 (Soundness). ProProx is a γ-sound DBID (Property 2) scheme for γ = negl(λ), if the followings hold: τ.n ≥ n − ( 12 − 2 ) n2 for some constant ; ComH is one-way; Com is homomorphic bit commitment with all properties of Definition 6; ZKP is a κ-sound (Definition 5) and ζ-zero-knowledge authentication (Definition 8) for negligible κ and ζ. Proof. In a Π protocol, the verifier receives a transcript ξ = (A, [c], [r]). There are two possible participant arrangements for the winning conditions of a soundness adversary that result in the verifier returning OutV = 1: (i) all active provers
216
A. Ahmadi and R. Safavi-Naini
are far-away from the verifier, (ii) there is no active prover during the Π protocol (i.e., there might be close-by provers but they are not active). In the following we show that the success probability of the adversary in both cases is negligible. In other words, the success probability of generating a valid transcript ξ = (A, [c], [r]) when the challenge sequence [c] is generated by the verifier, is negligible. In the first case, the adversary cannot simply relay the messages because of the extra delay and the fact that the responses are from out of bound locations. In this case the verifier will reject the instance. If there is a PPT adversary A that can guess at least τ.n out of n responses for each key bit with nonnegligible probability (i.e. guessing all bits of ∀j ∈ {1, . . . , λ}Ij ⊂ {1, . . . , n} such that |Ij | ≥ τ.n), then they can find the response table for at least τ.n elements for each j ∈ {1, . . . , λ} with the same probability. So for τ.n out of r −r ¯ −(ci,j −ci,j ¯ )bi with probability n values of i they can find correct skj = i,j i,j ci,j −ci,j ¯ ≥ poly(λ). Therefore by taking the majority, they can find the correct key bit with probability ≥ 1 − (1 − poly(λ))τ.n . Thus if the adversary succeeds in the first case with non-negligible probability, then they can find the secret key with considerably higher probability than random guessing and this contradicts the zero-knowledge property of ProProx. Therefore, the adversary’s success chance will be negligible in this case. In the second case, the adversary succeeds in the protocol by providing the correct response to V for at least τ.n correct queries out of n fast rounds for all key bits. We noted that the learning phase of the adversary cannot provide information about the secret key ({skj }λj=1 ) or the committed values ({ai,j }j={1...λ},i={1...n} ) as otherwise the zero-knowledge property of the protocol, or the commitment scheme will be violated, respectively. In order to succeed in the protocol with non-negligible probability, the adversary must succeed in ZKP, for at least τ.n values of i, so they need to find at least τ.n valid tuples πi = (X, Y, Z) for random challenge bits such that Z 2 = X(θbi yj )ci,j θ−Y without having information about skj . For π = [πi ] with size at least τ.n and [c], Pr[π is valid|[c] is random] = γ i=1 Pr[πi is valid|[c] is random]. So if Pr[π is valid |[c] is random] ≥ negl, then there is a value of i that Pr[πi is valid |[c] is random] ≥ 12 + poly(λ). Since X is sent to the verifier before seeing ci,j , therefore we have Pr[valid (X, Y, Z)|ci,j = 0] ≥ 12 + poly(λ) and also Pr[valid (X, Y , Z )|ci,j = 1] ≥ 12 + poly(λ). Since both tuples are valid, then we have Z 2 = Xθ−Y and Z 2 = X(θbi yj )θ−Y . Therefore we have the following for pkj = θskj (vj ); (
Z 2 ) = pkj θbi −Y +Y = θskj +bi −Y +Y (vj )2 Z
Therefore, the adversary can conclude skj +bi −Y +Y ∈ / {1, 3} for the known bits bi , Y and Y . So they gain some information about skj , which is in contradiction with zero-knowledge property of ProProx.
Directional Distance-Bounding Identification
6
217
Related Works
The main models and constructions of public key DB protocols are in [3,14,19, 25]. In the following, we discuss and contrast the security model of these works to be able to put our new work in context. [19] presented an informal model for Distance-Fraud, Mafia-Fraud and Impersonation attack and provided a secure protocol according to the model. [3] formally defined Distance-Fraud, Mafia-Fraud, Impersonation, Terrorist-Fraud and Distance-Hijacking attack. The Distance-Fraud adversary has a learning phase before the attack session and is therefore stronger than the definition in A2. During the learning phase, the adversary has access to the communications of the honest provers that are close-by. The security proofs of the proposed protocol have been deferred to the full version, which is not available yet. [14] uses an informal model that captures Distance-Fraud, Mafia-Fraud, Impersonation, Terrorist-Fraud, Distance-Hijacking and a special type of attack, called Slow-Impersonation [12]. In their model, the definition of Terrorist-Fraud is slightly different from A4: a TF attack is successful if it allows the adversary to succeed in future Mafia-Fraud attacks. For the first time in distance-bounding literature, [12] considered normal MiM attacking scenario where both the honest prover and the adversary are close to the verifier. The adversary interacts with the prover in order to succeed in a separate protocol session with the verifier. The adversary has to change some of the received messages in the slow phases of protocol in order to be considered successful. The attack is called Slow-Impersonation and is inspired by the basic MiM attack in authentication protocols. In Slow-Impersonation, a close-by MiM actor that communicates with both verifier and close-by prover, tries to succeed in the protocol. During the slow-phase, the actor modifies the received messages from a party, and then sends it to the other party. Although the basic MiM attack is proper for DB models, it may not be strictly possible in one phase of the protocol as their action could influence or be influenced by other phases of the protocol. A MiM adversary may, during the learning phase, only relay the slow-phase messages but, by manipulating the messages of the fast phase, learn the key information and later succeed in impersonation. According to the definitions in [12,14], the protocol is secure against Slow-Impersonation, however it is not secure against Strong-Impersonation (A3). This scenario shows that SlowImpersonation does not necessarily capture Impersonation attacks in general. Moreover, it’s hard to distinguish the success in slow phases of a protocol without considering the fast phase, as those phases have mutual influences on each other. As an alternative definition, [2] proposed Strong-Impersonation (A3), in which the MiM adversary has an active learning phase that allows them to change the messages. Strong-Impersonation captures the MiM attack without the need to define success in the slow rounds. One of the incentives of StrongImpersonation is capturing the case when the prover is close to the verifier, but is not participating in any instance of the protocol. In this case, any acceptance by
218
A. Ahmadi and R. Safavi-Naini
the verifier means that the adversary has succeeded in impersonating an inactive prover. In [25] an elegant formal model for public key distance-bounding protocols in terms of proof of proximity of knowledge has been proposed. The model captures Distance-Fraud, Distance-Hijacking, Mafia-Fraud, Impersonation and Terrorist-Fraud. In this approach, a public key DB protocol is a special type of proof of knowledge (proximity of knowledge): a protocol is considered sound if the acceptance of the verifier implies existence of an extractor algorithm that takes the view of all close-by participants and returns the prover’s private key. This captures security against Terrorist-Fraud where a dishonest far-away prover must succeed without sharing their key with the close-by helper. According to the soundness definition in [25] however, if the adversary succeeds while there is an inactive close-by prover, the protocol is sound because the verifier accepts, and there is an extractor for the key simply because there is an inactive close-by prover and their secret key is part of the extractor’s view. Existence of an extractor is a demanding requirement for the success of attacks against authentication: obviously an adversary who can extract the key will succeed in the protocol, however it is possible to have an adversary who succeeds without extracting the key, but providing the required responses to the verifier. Our goal in introducing identification based model is to capture this weaker requirement of success in authentication, while providing a model that includes realistic attacks against DB protocols.
7
Concluding Remarks
This paper is a revised and extended version of [2] which proposed a new formal model (DBID) for distance-bounding protocols, inline with the cryptographic identification protocols, that captures and strengthens the main attacks on public key distance-bounding protocols. This approach effectively included physical distance as an additional attribute of the prover in identification protocol. In this paper we assume a stronger adversary that has access to a directional antenna. We showed this additional capability can break security of protocols that had been proven secure. To include this capability of the adversary, we needed to revise the definition of TF in [2] which resulted in proving a new security proof for the ProProx protocol. Other parts of model and security definition remained unchanged. Our future work includes designing more efficient DBID protocols, and extending the model to include the anonymity of the prover against the verifier.
Appendix Definition 5 (Authentication). An authentication protocol is an interactive pair of protocols (P (ζ), V (z)) of PPT algorithms operating on a language L and relation R = {(z, ζ) : z ∈ L, ζ ∈ W (z)}, where W (z) is the set of all witnesses for z that should be accepted in authentication. This protocol has the following properties:
Directional Distance-Bounding Identification
219
– complete: ∀(z, ζ) ∈ R, we have Pr[OutV = 1 : P (ζ) ↔ V (z)] = 1. – κ-sound: Pr[OutV = 1 : P ∗ ↔ V (z)] ≤ κ in any of the following two cases; (i) z ∈ / L, (ii) z ∈ L while algorithm P ∗ is independent from any ζ ∈ W (z). Pr[OutV = 1 : A2 (V iewA1 ) ↔ V (z)] ≤ negl. Definition 6 (Homomorphic Bit Commitment). A homomorphic bit commitment function is a PPT algorithm Com operating on a multiplicative group G with parameter λ, that takes b ∈ Z2 and ρ ∈ G as input, and returns Com(b; ρ) ∈ G. This function has the following properties: – homomorphic: ∀b, b ∈ Z2 and ∀ρ, ρ ∈ G, we have Com(b; ρ)Com(b ; ρ ) = Com(b + b ; ρρ ). – perfect binding: ∀b, b ∈ Z2 and ∀ρ, ρ ∈ G, the equality Com(b; ρ) = Com(b ; ρ ) implies b = b . – computational hiding: for a random ρ ∈R G, the distributions Com(0, ρ) and Com(1, ρ) are computationally indistinguishable. Definition 7 (One-way Function). By considering λ as the security parameter, an efficiently computable function OU T ← FUNC(IN ), is one-way if there is no PPT algorithm that takes OU T as input and returns IN with non-negligible probability in terms of λ. Definition 8 (Zero-Knowledge Protocol). A pair of protocols (P (α), V (z)) is ζ-zero-knowledge for P (α), if for any PPT interactive machine V ∗ (z, aux) there is a PPT simulator S(z, aux) such that for any PPT distinguisher, any (α : z) ∈ L, and any aux ∈ {0, 1}∗ , the distinguishing advantage between the final view of V ∗ , in the interaction P (α) ↔ V ∗ (z, aux), and output of the simulator S(z, aux) is bounded by ζ. Lemma 6 (Chernoff-Hoeffding Bound, [8,20]). For any ( , n, τ, q), we have n n i n−i the following inequalities about the function T ail(n, τ, ρ) = ; i ρ (1 − ρ) i=τ
– if – if
τ n τ n
< q − , then T ail(n, τ, q) > 1 − e−2 2 > q + , then T ail(n, τ, q) < e−2 n
2
n
References 1. Agiwal, M., Roy, A., Saxena, N.: Next generation 5G wireless networks: a comprehensive survey. IEEE Commun. Surv. Tutor. 18(3), 1617–1655 (2016) 2. Ahmadi, A., Safavi-Naini, R.: Distance-bounding identification. In: Proceedings of the 3rd International Conference on Information Systems Security and Privacy, ICISSP, INSTICC, vol. 1, pp. 202–212. SciTePress (2017) 3. Ahmadi, A., Safavi-Naini, R.: Privacy-preserving distance-bounding proof-ofknowledge. In: Hui, L.C.K., Qing, S.H., Shi, E., Yiu, S.M. (eds.) ICICS 2014. LNCS, vol. 8958, pp. 74–88. Springer, Cham (2015). https://doi.org/10.1007/9783-319-21966-0 6
220
A. Ahmadi and R. Safavi-Naini
4. Avoine, G., Bing¨ ol, M.A., Karda¸s, S., Lauradoux, C., Martin, B.: A framework for analyzing RFID distance bounding protocols. J. Comput. Secur. 19(2), 289–317 (2011) 5. Boureanu, I., Mitrokotsa, A., Vaudenay, S.: Secure and lightweight distancebounding. In: Avoine, G., Kara, O. (eds.) LightSec 2013. LNCS, vol. 8162, pp. 97–113. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40392-7 8 6. Brands, S., Chaum, D.: Distance-bounding protocols. In: Helleseth, T. (ed.) EUROCRYPT 1993. LNCS, vol. 765, pp. 344–359. Springer, Heidelberg (1994). https:// doi.org/10.1007/3-540-48285-7 30 7. Bussard, L., Bagga, W.: Distance-bounding proof of knowledge protocols to avoid terrorist fraud attacks. Technical report, Institut Eurecom, France (2004) 8. Chernoff, H.: A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 23, 493–507 (1952) 9. Cremers, C., Rasmussen, K.B., Schmidt, B., Capkun, S.: Distance hijacking attacks on distance bounding protocols. In: Security and Privacy, pp. 113–127 (2012) 10. Damg˚ ard, I.: On -protocols. Lecture Notes, University of Aarhus, Department for Computer Science (2002) 11. Desmedt, Y.: Major security problems with the u ¨ nforgeabl¨e(feige-)fiat-shamir proofs of identity and how to overcome them. In: Congress on Computer and Communication Security and Protection Securicom 1988, pp. 147–159 (1988) 12. D¨ urholz, U., Fischlin, M., Kasper, M., Onete, C.: A formal approach to distancebounding RFID protocols. In: Lai, X., Zhou, J., Li, H. (eds.) ISC 2011. LNCS, vol. 7001, pp. 47–62. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-64224861-0 4 13. Francillon, A., Danev, B., Capkun, S.: Relay attacks on passive keyless entry and start systems in modern cars. In: NDSS (2011) 14. Gambs, S., Killijian, M.O., Lauradoux, C., Onete, C., Roy, M., Traor´e, M.: Vssdb: a verifiable secret-sharing and distance-bounding protocol. In: International Conference on Cryptography and Information Security (BalkanCryptSec 2014) (2014) 15. Gambs, S., Onete, C., Robert, J.M.: Prover anonymous and deniable distancebounding authentication. In: Proceedings of the 9th ACM Symposium on Information, Computer and Communications Security, pp. 501–506 (2014) 16. Gennaro, R.: Multi-trapdoor commitments and their applications to proofs of knowledge secure under concurrent man-in-the-middle attacks. In: Franklin, M. (ed.) CRYPTO 2004. LNCS, vol. 3152, pp. 220–236. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-28628-8 14 17. Guillou, L.C., Quisquater, J.-J.: A practical zero-knowledge protocol fitted to security microprocessor minimizing both transmission and memory. In: Barstow, D., et al. (eds.) EUROCRYPT 1988. LNCS, vol. 330, pp. 123–128. Springer, Heidelberg (1988). https://doi.org/10.1007/3-540-45961-8 11 18. Hermans, J., Pashalidis, A., Vercauteren, F., Preneel, B.: A new RFID privacy model. In: Atluri, V., Diaz, C. (eds.) ESORICS 2011. LNCS, vol. 6879, pp. 568– 587. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23822-2 31 19. Hermans, J., Peeters, R., Onete, C.: Efficient, secure, private distance bounding without key updates. In: Proceedings of the Sixth ACM Conference on Security and Privacy in Wireless and Mobile Networks, pp. 207–218. ACM (2013) 20. Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58, 13–30 (1963) 21. Kurosawa, K., Heng, S.-H.: The power of identification schemes. In: Yung, M., Dodis, Y., Kiayias, A., Malkin, T. (eds.) PKC 2006. LNCS, vol. 3958, pp. 364–377. Springer, Heidelberg (2006). https://doi.org/10.1007/11745853 24
Directional Distance-Bounding Identification
221
22. Rasmussen, K.B., Capkun, S.: Realization of RF distance bounding. In: USENIX Security Symposium, pp. 389–402 (2010) 23. Schnorr, C.P.: Efficient signature generation by smart cards. J. Cryptol. 4(3), 161– 174 (1991) 24. Vaudenay, S.: On modeling terrorist frauds. In: Susilo, W., Reyhanitabar, R. (eds.) ProvSec 2013. LNCS, vol. 8209, pp. 1–20. Springer, Heidelberg (2013). https://doi. org/10.1007/978-3-642-41227-1 1 25. Vaudenay, S.: Proof of proximity of knowledge. IACR Eprint 695 (2014)
An Information Security Management for Socio-Technical Analysis of System Security Jean-Louis Huynen(B) and Gabriele Lenzini Interdisciplinary Centre for Security, Reliability and Trust (SnT), University of Luxembourg, Esch-sur-Alzette, Luxembourg {jean-louis.huynen,gabriele.lenzini}@uni.lu
Abstract. Concerned about the technical and social aspects at the root causes of security incidents and how they can hide security vulnerabilities we propose a methodology compatible with the Information Security Management life-cycle. Retrospectively, it supports analysts to reason about the socio-technical causes of observed incidents; prospectively, it helps designers account for human factors and remove potential sociotechnical vulnerabilities from a system’s design. The methodology, called S·CREAM, stems from practices in safety, but because of key differences between the two disciplines migrating concepts, techniques, and tools from safety to security requires a complete re-thinking. S·CREAM is supported by a tool, which we implemented. When available online it will assist security analysts and designers in their tasks. Using S·CREAM, we discuss potential socio-technical issues in the Yubikey’s two-factor authentication device.
Keywords: Socio-technical security Information Security Management and Reasoning Root Cause Analysis
1
Introduction
It is nowadays recognised that the role of humans is paramount in security and that failing to understand the holistic complexity of the human interactions with security technology leads to the development of systems that are exposed to hacking. There is no rhetoric in this warning. White papers like the Verizon Data Breach Investigation Report, which from the 2009 studies security breaches and provides analytically insight about their nature and origins, wrote in 2015 that in the about 80000 reported incidents, “people is the common denominator This research received the support of the Fond National de la Recherche (FNR), project Socio-Technical Analysis of Security and Trust (STAST) I2R-APS-PFN11STAS, and of the pEp Security SA/SnT partnership project “Protocols for Privacy Security Analysis”. c Springer International Publishing AG, part of Springer Nature 2018 P. Mori et al. (Eds.): ICISSP 2017, CCIS 867, pp. 222–251, 2018. https://doi.org/10.1007/978-3-319-93354-2_11
An Information Security Management for Socio-Technical Analysis
223
Fig. 1. The PDCA process in the ISO 27001:2005 with the OODA loop for incident response proposed by Schneier (reprinted from [19]).
across the top four patterns that accounts for nearly 90% of all incidents” [32]. In the most recent report, dated 2017 [33], the same organization found that in all the documented data security incidents “social attacks were utilized in 43% of all breaches” and that “phishing is the most common social tactic in our dataset (93% of social incidents)”. Further they say that “81% of hacking-related breaches leveraged either stolen and/or weak passwords” which indicates clearly a problem in how passwords are protected and so, at least in part, in how people chose their passwords and keep them safe. Too often, human errors in security are seen as inevitable, the end of the causative chain. In the Annual Incident Reports 2015, by ENISA, published in September 2016, human errors are pointed out be the ‘root cause category involving most users affected, around 2.6 million user connections on average per incident’ [11]. The conclusion that humans are at the root of all such incidents is worrisome but not helpful. It does not suggest how to improve security without removing the humans. And in the lack of a comprehensive understanding of the reasons why systems and processes allow humans to be induced in security critical errors through which the system is being attacked, the problem of reducing human-caused insecurity incidents remains. We are not claiming that organizations are unaware of or that underestimate what we call socio-technical security risks. Companies invest in Information Security Management to keep attacks under control. They assess risks regularly, and implement measures that they deem appropriate. But in contexts where the human behavior is significant for security, predicting the effect of mitigations merely reasoning on technical grounds may result in a false sense of security. For instance, a policy that regularly forces users to change passwords, supposedly to protect the system from password theft, may instead nudge users to start using easy-to-remember pass-phrases [2] which are also easy to guess. Thus, the policy may end up strengthening the threat, since now adversaries can hack into the system also by making good password-guessing or dictionary attacks. And even when a policy does not increase the threat and looks successful, in the absence of an holistic and socio-technical approach to security, it remains unclear whether is the policy that worked or if instead it was the result of a ‘shadow security’ [23], a phenomenon where security increases thanks to the spontaneous behavior of people acting in opposition to an otherwise ineffective policy.
224
J.-L. Huynen and G. Lenzini
To take an holistic socio-technical approach to security requires a change in the way we look at systems and at the reasons of their security or insecurity. It requires to answer questions like the following: “What reasoning about the interplay between humans and technology can one resort to unveil a system’s socio-technical weak points”?, and “What remediation acting on the social and on the technical sides can one apply to effectively strengthen security”? To address these questions one has to understand human users and untangled factors that an attacker who has hacked into the system can manipulate to exert influence on people’s behaviours. Ultimately, these factors should be included in the security best practices and mitigation strategies. In other disciplines this seems not to be such a revolutionary thought. In safety is common to take a “socio-technical” approach in incident analysis (although called so) while searching for events at the root of an incident. Here, analysts adopt a waterfall view of the universe [3]: they follow top-down approaches to find out the causes of unwanted events including those that involve human errors, or bottom-up approaches to foresee the negative consequences of a design choice on people, and they set priorities and optimise the efforts in defining and applying remedies from the identification of the different factors that might have fostered adverse outcomes. So, why cannot security take inspiration from safety and re-use those well experimented practices and methods? In principle the migration seems possible and with little effort but, as we explain in Sect. 4, applying safety’s methods requires a complete re-thinking of the goals, the notion of incident, the place and the role of the adversary (an element which is not present in system safety), and the way in which knowledge can be reused across different situations to avoid further incidents. Suspending for a moment the discussion about the character and the difficulties of this adaptation, we believe that such a migration is possible, and this paper discusses and implements a way to do it. We refocus the study of security by centering on the impact that users have on a system, and we propose our methodology as part of the Information Security Management life cycle. Previous Work. This paper leverages two works of ours. It re-elaborates, extends, and discuss in the context of an Information Security Management the ideas on how to retrospectively analyse the socio-technical root cause of a security incident that we presented in [12]. It also extends [19] by providing full overview of the implementation of the methodology, detailing what in the original paper we omitted for reasons of space.
2
Information Security Management
In order to structure an Information Security Management effort, assess information security risks, and formulate security policies, the ISO 27001:2005 [1] recommends organizations to follow a four-steps cycle called Plan, Do, Check, Act (PDCA) [20]:
An Information Security Management for Socio-Technical Analysis
225
– Plan: list assets, define threats, derive the risks posed by the threats on the assets, and define the appropriate controls and security policies needed to mitigate risks; – Do: conduct business operations while implementing the security controls and enforcing the security policies defined in ‘Plan’; – Check: check that the Information Security Management system is effectively mitigating the risks without impeding the business operations; – Act: correct to the defects identified in ‘Check’. In the PDCA cycle is the ‘Do’ phase where the organization operates. This is also the phase where attacks can occur and, subsequently, where Incident Response fires up. Schneier, in [28], argues that Incident Response can also be seen as a loop, called Observe, Orient, Decide, Act (OODA) [8], in which organizations engage in four steps: (i) Observe: observe what happens on the corporate networks. This is a phase of collection of data in which, for instance, Security Information and Event Management (SIEM) systems are used; (ii) Orient: make sense of the data observed within its context of operations also by processing the pieces of information gathered from previous attacks and threat intelligence feeds; (iii) Decide: decide which action should be carried out to defend against an attack, at best of its current knowledge; (iv) Act: perform the action decided earlier to thwart the attack. Figure 1 shows the PDCA cycle with the OODA in its ‘Do’ step. Organizations engaged in such a process implement two types of analyses: a prospective analysis in the ‘Plan’ step to predict how their security measures are expected to protect their assets; and, usually after an attack, a retrospective analysis in the ‘Check’ step to identify the reasons of why their Information Security Management has failed. Shortcoming of Current Information Security Management Processes. The existence of methodologies and tools to assess risks (e.g., OCTAVE Allegro [9]) help company implement a PDCA process as we described, but there is no method that help them foresee what impact planned mitigations will have on the system’s actual security. Indeed, introducing a security control to contain a risk (e.g., in the ‘Act’ step) may introduce new socio-technical vulnerabilities because of unforeseen interactions between the newly introduced controls and the organization’s employees, business processes, and existing security measures. Additionally, in the case of a security breach, there is lack of methods (in the ‘Check’ step) encompassing human-related aspects of security that could be used to identify the root cause of Information Security Management systems failures. The operational phase (i.e., the ‘Do’ step) is also not tuned to consider sociotechnical aspects of security. Indeed, if organizations use Security Information and Event Management systems in their Security Operations Centers to monitor their networks, they don’t really consider the human-related aspects of their operations. For instance, when an company is hit by a ‘fake president’ scam—a fraud consisting of convincing an employee of a company to make an emergency bank transfer to a third party in order to obey an alleged order of a leader (i.e., the president) under a false pretext—the organization can defend itself by
226
J.-L. Huynen and G. Lenzini
focusing on the technical aspects of the attack (e.g., by blocking connections to a ranges of IPs). However, the social aspects of the attacks (i.e., that people does not recognise the phishing and falls for it) remain, while the attacker can easily adapt to the additional security controls and persist in using the same social engineering tricks i.e., phishing people but in a different kind of scam. Security practitioners mainly rely on user awareness campaigns to cover the social side of these family of attacks but we claim that identifying the reasons why employees fall for the scams would be more effective. Indeed, it could be that the phishing campaign exploits loopholes in the interactions between the organization’s security policies and organization’s business processes (hence employee’s primary goals). Identifying clearly these reasons could help propose additional remedies that focus not on the technical, but on the social aspects of an attack. These new remedies ought to cause the attacker the hassle of adapting its behaviour and its attack instead of only fiddling with technical details. We believe that this strategy could prove beneficial to the defense as it forces the attacker to climb up the ‘pyramid of pain’ [5]. An Information Security Management also suffers from the lacking of ways to learn from past failures in applying the remediation strategies. Indeed, without a proper way to determine the root causes for past failures, organizations are doomed to repeat ill-considered decisions and lose energy correcting these. Thus, we believe that security practitioners should have access to additional methods that could help pondering the consequences of their security choices, the technical and the social alike, to prevent the recurrence of security failures.
3
Related Work
Among the numerous of works on Root Cause Analysis (RCA) in safety incidents we comment only those in relation to the information security domain. Those which we have found in the literature use RCA in specific situations: the process they follow is not flexible enough to work in other and in general contexts. For instance, Coroneo et al. [10] implement a strategy for the automated identification of the root causes of security alerts, but what their offer is a fast algorithm to timely response to intrusions and attacks which is not applicable to analyse general security incidents. Another example of highly specific application is a work that customise the search for root causes in software failures [22]. It supports software testing with an automated debugging that provides developers with an explanation of a failure that occurred in production but, still, this is not a solution that can be generalised to other security incidents. A completely different category of works, which, again, we represent here by picking a single work in the class i.e., [29], is that proposing more efficient RCA’s reasoning algorithms. Works in this class to not relate with understanding sociotechnical causes. Rather they focus on improving the performances of existing methodologies but not on extending them to work in security. Our search for related work seems thus more insightful not in what it has found but rather in what has not been found, that is, for the lack of works
An Information Security Management for Socio-Technical Analysis
227
attempting to migrate in security well established RCA methodologies with the aim of helping as widely as possible security practitioners. At the time of writing (October 2016), we have found no significant and related article that address this problem, nor have we found RCA in security when human are involved. Of course there is a plethora of research that stress the need of keeping the human in the security loop (see e.g., [4,26]) as there are plenty of works in field of sociotechnical security, usable security, human factors in security, and similar topics. For reason of space, this list is too long to be considered here but, in the best of our understanding, we were not able to find in those works methodologies that could help security practitioners in a RCA of socio-technical security incidents.
4
RCA in Safety and Security
In safety, Root Cause Analysis (RCA) provides insights over an incident and helps identify the sequence of events that caused the incident. It has been introduced in safety as a response to disasters, e.g., nuclear accidents, whose cost in human lives and environmental damages was so high that was mandatory to avoid their repetition justifying a thorough investigation of all the possible causes, also those concerning ‘human errors’. RCA usually follows a four-step process (i) Data collection and investigation, where factual information about an incident are collected and investigated. The outcome of this phase is a description of the incident and should be objective, independent from any analyst’s bias. (ii) Retrospective Analysis, where the causes of the incident are determined. Depending on the RCA method used, this phase can more or less depend on the analyst’s knowledge and experiences. The outcome of this phase is a set of root causes without which the incident could not have occurred. (iii) Recommendations generation, where a set of recommendations to prevent the analyzed event from recurring is set. Whatever the method, this phase is analyst-dependent because he needs to understand the outcome of the analysis and generate sound remediations about what should be done. (iv) Implementation of recommendations, where recommendations are implemented to prevent additional occurrences of the incident from happening. The outcome is a system freed of its identified caveats. Reason [27] proposes an accident causation model and shows that ‘human errors’ are active failures that, when combined with latent failures, can transform a simple hazard into an accident. Reason’s model metaphorically describes a system as a Swiss cheese: each slice or layer is a system’s defence against failure and the holes on the slices are the flaws that, when ‘aligned’, create a hazard. A person responsible for an accident is not to be blamed alone, but it is the system as a whole that needs to be investigated because it hosts the fertile ground for the ‘human error’ that caused the accident. Consequently, when searching for the root causes of an accident, one must seek for all the contributing factors and
228
J.-L. Huynen and G. Lenzini
consider ‘human errors’ as manifestations of additional factors that one must also identify. Although several factors concur to a disastrous events it would not be justified to prefer one over another, but knowledge gained from an RCA enables the analyst to write guidelines to control the factors and avoid the re-occurrence of the incident. In security, where experts are in search of policies meant for people, a similar RCA approach can be applied but the search for causes of an incident that involves users should not stop and point at the human as the root of the incident. Instead, we believe, it should look further for the triggers of human behaviour that the system could have left under the control of an adversary. The retrospective analysis of the RCA in safety is not the only methodology that has a potential application in security. The safety field also make use of techniques to predict the performance of systems that are going to be operated by humans. Predicting how an event can unfold is highly dependent on the description of the context, tasks, and failure modes. Potential paths that an actual event can follow are usually represented in binary trees called event trees (see THERP for instance [31]), where branches represent what may happen when an event (a leaf) succeeds or fails. Eventually, probabilities are computed for each outcome and recommendations are produced to enhance the reliability of the system. This prospective approach is used in Human Reliability Analysis. It relies ‘heavily on process expertise to identify problems and its methods themselves or expert estimation for quantification’ [6]. The overall process for a prospective analysis follows the same processes introduced earlier for an RCA except for step (ii) which is called Prospective analysis. From Safety to Security. Two neat differences emerge between the RCA currently used in safety and a potential (retrospective) RCA as it should be used in security: they are both about defining when to stop the search for causes. First, the search for a root cause should not point at the user. Instead, the analysis should aim at what triggers of human behaviour have been left to the control of the adversary. An example is when users click on poisoned links. Security policies state that clicking on links is a bad habit, but users are daily stormed by trusted and untrusted mails most of them carrying links and hardly they can discern foes from friends. The cause is perhaps in the system’s failing to authenticate an email’s source or in the reasons why people fail to recognise friends from intruders that ‘pretext’ to be friends. Second, the search for the root cause should not either point at the adversary who, in security, is actually the real root of all evil. In addition, other differences emerge preponderantly and make migrating the existing retrospective and prospective techniques from safety to security be not a straightforward task. Such differences, emerge in the four steps of the RCA (Table 1) and bring up five major challenges that need to be addressed and resolved: (C1 ) Addressing the lack of knowledge and structured data: The challenge is to compile and format factual information about the investigated attack to allow for the RCA to be performed, to describe what the attacker does and what are the attack’s effects on the user and on the system’s security. Furthermore, The RCA should provide precise information regarding the data
An Information Security Management for Socio-Technical Analysis
229
to collect. (C2 ) Investigating Attacks: The RCA for security must output a set of contributors and human-related factors that are likely to explain the success of attacks, or potential attacks. The new analysis should safeguard against one inherent shortcoming of RCA: the possible lack of objectivity. (C3 ) Creating reusable knowledge: To integrate with existent computer security’s techniques, the RCA technique should provide direct links between the attacker’s capabilities and their effects on a system’s security. The challenge is to be able to augment a said threat model with capabilities that an attacker can gain by performing user-mediated attacks allowed by the threat model. (C4 ) Match patterns of known attacks: The RCA, in addition to the retrospective analysis of past attacks needs to provide a socio-technical security analysis where, from a system’s description, socio-technical vulnerabilities, along with their contributors are listed. (C5 ) Being flexible: The new method should be flexible enough to adapt to new threat, attacks, and technologies.
5
S·CREAM: Process and Concepts
Our methodology, called Cognitive Reliability and Error Analysis Method for Socio-Technical Security (S·CREAM), has four steps. They address the five challenges C1 − C5 described in the previous section. As for the RCA process, steps are backward-dependent: each step builds and depends on the successful completion of the preceding. For each step, we describe its concepts (what it is about), input (what it needs to operate), output (what it produces), and operation (how it operates). We supplement the exposition by referring to the following toy example of an attack: An attacker aims at having action A1 executed on a system S1 . Imitating an identity I1 , it sends a message to a user requesting him to run A1 on S1 . This user confuses the attacker for I1 and obey. Step 1: Data Collection and Investigations. This step addresses C1 . It is about structuring the data related to the attack under scrutiny. Concepts Attack Description Scheme - the template that guides the collection of information. The scheme contains the list of information needed for the subsequent analysis to be performed successfully. Pre-Requisites - the flags associated with the different fields contained in the Attack description scheme. They represent the capabilities that an attacker is required to possess in order to perform the attack. Inputs The Factual Information Gathered about the Attack - obtained e.g., from Digital Forensics, Incident Response, live monitoring, and inperson investigation;
230
J.-L. Huynen and G. Lenzini Table 1. Key differences in safety and security. Safety
Security
Data Collection There is an established process to This process is not well-established; and Investigation collect structured evidence for data are often unstructured and the root cause analysis information is often scattered across multiple actors
Analysis
There are no malicious actors. Incidents happen because of general malfunctioning
Incidents are caused by attackers whose skills and capabilities may be subtle and even unknown
Accidents to be investigated usually take place in well-known and well-defined settings
We face much more heterogeneous contexts, furthermore the incidents can still be unfolding at the time of analysis
RCA techniques are widely used and the human component is a central part of practices
The use of RCA methods is often advocated but lacks human-related insights
A ‘human error’ is a well defined The ‘human error’ is considered a concept that can be the starting system’s failure mode that does not point of an analysis call for investigations There is always some root cause that can be isolated for an incident
The root cause of the success of an attack is always the attacker, therefore we are interested in all the factors that contribute to the success of attacks
The analysis begins from the terminal point of failure: the observable incident
An attack/incident can be an intermediate step leading to other attacks/incidents. Therefore we might not be able to observe the factual consequences of an attack/incident on a system
Recommendation Removing the root cause prevents Since the root cause is the attacker, Generation the incident from reoccurring technical controls can be applied on to reduce the attacker’s capabilities. Socio-technical controls can be applied on the human contributors
Implementation
An adverse event, being coincidental, may never reoccur on similar systems
An attack incident will re-occur because attackers actively probe similar systems to recreate it. The sharing of recommendations is thus critical
People involved in incidents are mostly trained professionals (e.g., pilots, air traffic controllers, power plant operator)
People are much more diverse with regard to their relevant skills and knowledge (e.g., children, bank employees, elderly people, medical doctor). Furthermore they can have motives and concerns unrelated to security
Root causes are identified and controlled
It may be impossible to control all identified contributors that will be actively manipulated by the attacher
An Information Security Management for Socio-Technical Analysis
231
An Attack Description Scheme - it defines the structure of the information that this and the other steps will process, and the Pre-Requisites that can be applied on this information. This scheme structures the way the analyst will generalise attacks later on, and test other systems for socio-technical vulnerabilities. For instance, expressing attacks’ PreRequisites in terms of capabilities on protocols is different from expressing these same Pre-Requisites in terms of capabilities on security ceremonies. We take these decisions for our implementation of the Data Collection and Investigations steps described in Sect. 6.1. Furthermore, information about the attacker, the effect of the attack on the system, the system, and the context will be used in the Retrospective Analysis. This information constrains the universe of Contributors that can be discovered. For instance, Contributors related to the user state of mind cannot be yielded by the Retrospective Analysis if not collected and stored in the attack description. Output. A factual, structured description of the attack with its PreRequisites. Operation. The analyst uses the gathered information to fill the scheme and seeks further information about the attack when needed. Information can be gathered through different means of investigation. The security analyst ponders upon which fields to define as required by setting the Pre-Requisites on the attack’s description (by ticking check boxes for instance). Example 1. The scheme asks for a textual description of the Attacker’s actions, their effect on the system’s security, the User’s behaviour, the System, and the Context. The output produced by the Data Collection and Investigations step using the example scheme, and the toy attack example as inputs: – Attacker’s actions: what pieces of information did the attacker gather to perform the attack, what is the form and content of the messages sent, what is their time of appearance et cetera. In our example, the attacker sends a message imitating the graphical identity of I1 and a request to perform A1 on S1 . – Effect on the system’s security: what are the consequences of the success of the attack on the system. What the attacker achieved. In our example, the unauthorised attacker performed the action A1 via the user. – User’s behaviour: what are the user’s behaviours that allowed the attack to be a success, what should the user have done to avoid the attack. In our example: the user performs A1 on S1 . – System’s description: information about the interactions between the user and the system as well as about the interfaces mediating these interactions. In our example, I1 is authorised to issue requests to the user. – Context’s description: information about the context in which the attack occurred and the context in which the user-system interactions usually occurs. – Pre-Requisites to perform the attack: what the attacker needs to control and what capabilities he needs to be able to perform the attack. In our example, the attacker needs to be able to ‘send a message’ and to ‘imitate the graphical identity’.
232
J.-L. Huynen and G. Lenzini
In Sect. 6 we explain how to implement this step. Step 2: Retrospective Analysis. It addresses challenge C2 . It provides the analyst with a method to extract the Contributors of an attack. Concepts Error Mode - it is the analogue of a failure mode as is used in technological failure analysis. An Error Mode (EM) describes the observable user’s erroneous behaviour in space and time. Defining the EM of an observed erroneous action will for instance clarify if the user performed an action at the wrong time, or an action on the wrong object. Contributor - it is a characteristic pertaining to any component of the system where the attack has occurred, that has facilitated the attack’s success. Inputs. It is a description of the Attacker’s actions, the user’s behaviours, the system and the context from Data Collection and Investigations step. Output A list of Error Mode (EM) - the errors corresponding to the user’s observed behaviour that permitted the success of the attack. A List of Contributors - the elements responsible for the occurrence of the said observed behaviour. Operation. This step’s operation is based on cause-consequent links that the analyst follows until he identifies likely Contributors to the attack. The analyst first identifies what EMs drove the user’s insecure behaviour, then he follows the causal relationships until he finds a satisfactory list of Contributors. To implement this step we need to refer to or to established and reliable tables that realise those cause-consequent relations. Example 2. In our toy example, the observed error mode is that the user misidentifies a message as emanating from I1 . We reckon that the Retrospective Analysis would identify the fact that the received message mimicked the graphical identity of this entity as Contributor. With more information additional contributors can be identified. For instance, the malicious message could have been received at a time when a genuine message from the spoofed entity was expected, and the ‘Habits and expectations’ Contributor could have been selected. The identified Contributors are used in the next step: the Generalisation step. It is worth mentioning here that if one wanted only to study one attack on a particular system, one could follow the generic process of RCA described earlier and produce recommendations to thwart this attack on this particular system. Such recommendations are not directly provided. It is the analyst’s duty to produce the recommendations about the controls that should be applied to prevent the recurrence of the attack on the system, from the list of Contributors yielded by the Retrospective Analysis. In Sect. 6 we explain how to implement this step. Our implementation is inspired by the same step as implemented by the Cognitive Reliability and Error Analysis Method [17], an existing methodology. It is integrated into our tool, described in Sect. 7.
An Information Security Management for Socio-Technical Analysis
233
Step 3: Generalisation. It addresses C3 . Unlike the other steps, the Generalisation requires not one but several successful preceding steps (Retrospective Analysis) to operate. The Generalisation step is meant to create reusable links between the attacker’s capabilities and security incidents. These links then enable an analyst to probe a system for socio-technical vulnerabilities under a Threat Model (see the Security Analysis step). Concepts Socio-Technical Capability (STC) - the capability to produce an effect harmful for a system’s security by performing a user-mediated attack. This effect can be directly or indirectly harmful for the system’s security. This means, producing an effect on another component that will ultimately harm the system’s security is also an STC. Attack Mode (AM) - a link between an STC, one of its Contributors, and some Pre-Requisites. Socio-Technical Vulnerability - the presence of an uncontrolled AM in a system. Catalogue of Attack Modes - a repository of AMs. Inputs. A set of the outputs obtained from the Data Collection and Investigations and Retrospective Analysis steps performed on previously studied attacks. Output. Attack Modes compiled into a catalogue. Operation. The analyst builds the catalogue of AMs by grouping the outputs from several corresponding Data Collection and Investigations and Retrospective Analysis steps together. The analyst chooses which STCs to create depending on the attacks he analysed before the Generalisation step. For instance, if he studied several attacks related to pretexting, he may have gained knowledge about the Contributors that facilitate spoofing identities. From this list of Contributors and descriptions of attacks, he can decide to create a STC called “spoof ”. Indeed, generalising a set of attacks into a catalogue of AMs starts with a simple question: ‘What are the STCs gained through these attacks?’. Once this question is answered, the description of the AMs consists of, for each AM, appending the Pre-Requisites defined in the Data Collection and Investigations step and one Contributor identified during the Retrospective Analysis step of an attack to the STCs, which the analyst identified has being gained during this attack. The catalogue of AMs is built by repeating this process for all the identified Contributors that correspond to the attacks that the analyst wants to generalise. Example 3. In our toy example, we can identify the action A1 as being an STC; however, we can also identify Identity Spoofing as an STC because spoofing I1 ’s identity is an effect on the user that ultimately leads to performing A1 . If we represent an AM with three constituents—Pre-Requisites (PR) (PR), Contributor (Co), Socio-Technical Capability—an AM can be built from the attack in (PR, Co, A1 ) where P R = {Send message, Imitate graphical ID}, Co = {Habits, Expectancies}, and A1 is the action of our reference attack example.
234
J.-L. Huynen and G. Lenzini
This AM is not generalised as the action A1 can be specific to the system S1 on which it is performed. If we want our AMs to be reusable and helpful in preventing an attacker to perform an a similar attach but launched on a different system and targeted to the specific action of this second system we need to set the AM’s STC to a common effect between these attacks that can cause harms to the system’s security. We can thus choose to build an AMs for each identified Contributors with the STC ‘Identity Spoofing’ in the form: (P R, Co, Identity spoofing). These AMs now reflect what has been observed in past attacks and provide links between an Attacker’s capabilities, effects that have had security consequences in the past, and what Contributors one should control to prevent this kind of attack from recurring. Furthermore, Contributors can be used as Indicators of an Socio-Technical Attack (STA) because their manipulation can betray an attack. Our implementation of the Generalisation step is described in Sect. 6.3. Step 4: Security Analysis. This step is the Prospective Analysis. The information gained in the Retrospective Analysis step and generalised in the Generalisation step is here reused by the analyst to address C4 . This step operates in two modalities: a mandatory semi-automated Security Analysis and an optional expert-driven Security Analysis. Semi-automatic Security Analysis. In this step’s modality, the analyst makes use of a catalogue of AMs previously built by generalising STAs. The analysis identifies socio-technical vulnerabilities in a system by filtering the catalogue of AMs by the Threat Model that applies to this system. Concepts Threat Model - determines the attacker’s capabilities, and optionally, the attacker’s goal in terms of the STC in question; Contributor - a characteristic pertaining to any component of the system that can contribute (because it already did in the past on another system in another attack) to an attack. Inputs. The inputs are a catalogue of AMs and the system to be analysed. Output. This step outputs a list of socio-technical vulnerabilities. That is to say a list of STCs, Contributors couples. Operation. The analyst must, in turn, perform several operations, which are as follows: First, the analyst describes the Threat Model that is applied to the system in the same scheme used to describe the attacks employed to build the catalogue of AMs. Then, the analyst filters the catalogue of AMs in order to list the AMs which have Pre-Requisites that fit into the system’s Threat Model. Example 4. A Threat Model is an attacker that can send messages with spoofed identities. From the catalogue built, we select those whose Pre-Requisite are compatible with the threat model, obtaining a list of potential socio-technical vulnerabilities system under the threat model, e.g., (Co, Identity spoofing).
An Information Security Management for Socio-Technical Analysis
235
Analyst-driven Security Analysis. In this modality an analyst can input additional insights into the results. Concepts Potential Attack. An attack that the analyst reckons is possible against the system but for which no factual information exists. It is a plausible ‘what if’ scenario. Inputs. The knowledge the analyst has of the potential attack. Output. A list of Contributors to the potential attack. Operation. The analyst must, in turn, perform several operations, which are as follows: First, the analyst identifies a potential attack against the system under scrutiny. Then, he proceeds to investigate this attack by performing a Retrospective Analysis step (preceded by a Data Collection and Investigations), which yields the attack’s Contributors. We discuss how to implement the two modalities of Security Analysis step in Sect. 6.4. 5.1
Controlling Vulnerabilities
The methodology’s outputs have been designed to allow an analyst to thwart potential user-mediated attacks by: (i) applying controls on the Contributors of identified socio-technical vulnerabilities, (ii) leverage the use of computer security methods by listing the STC attainable by an attacker given a Threat Model, and (iii) providing Indicators of STA. Thus, depending on the presence of a Contributor in the system and the success of the different methods, there are several paths that an analyst can follow: – The Contributor is not found: In this case, the socio-technical vulnerability does not exist in the system. – The Contributor is found and reliable controls can be applied: In this case, the analyst applies recommendations as he would have done in a generic RCA process. The socio-technical vulnerability is controlled and the system’s security is safe. – The Contributor is found but no reliable control can be applied: In this case, the system has a gaping socio-technical vulnerability. The analyst can then turn to computer security methods to prove that the system is secure against an extended Threat Model that incorporates the newly discovered socio-technical vulnerability. If it can’t be proven secure, then the analyst can attempt to redesign the system to make it secure against the extended Threat Model. – If everything else fails, S·CREAM’s outputs can be considered to create sophisticated Indicators of STAs that can be used, for instance, to monitor the system and respond quickly to a security incident.
236
6
J.-L. Huynen and G. Lenzini
Implementation
We selected Cognitive Reliability and Error Analysis Method (CREAM) [17] as our preferred RCA. It is a 2nd generation Human Reliability Analysis (HRA) method that consider cognitive causes of errors to bring up a great deal of details into the analysis of an accident. Besides, CREAM offers both retrospective and prospective analysis, providing us with bi-directional links between causes and effects. This allows us to build a catalogue of AMs that can be used for both detecting attacks (starting from observed effects) and predicting attacks (starting from a threat model). Building the Retrospective Analysis at the Heart of the Root Cause Analysis Method. CREAM relies on the two following pillars: (i) a classification of erroneous actions (this is represented in tables linked together by causal relationships) and (ii) a method that describes how to follow these links back to the human as well as the contextual and the technological factors at the origin of an ‘event’. An event is caused by the manifestation of an ‘erroneous action’, and is called the phenotype [17]. The confluence of underlying factors that made the erroneous action arise is called its genotype. CREAM’s tables of causal relationships between antecedent (cause of errors) and consequent (effect of errors) link a phenotype with its genotype [17]. Following these causal relationships, it is possible to find what caused an erroneous action and the root cause(s) of an event. CREAM is the building block of our method, however, it needs to be customised for security. We call the result S·CREAM, which stands for ‘Security CREAM’. Choosing a Source to Bootstrap S·CREAM. We bootstrap S·CREAM’s catalogue of AMs with a library of known attack patterns drawn from Common Attack Pattern Enumeration and Classification (CAPEC) [25]. This library contains attacks ’generated from in-depth analysis of specific real-world exploit examples’1 . It is maintained by MITRE Corporation, and it compiles and documents a wide range of attacks centered on the user. We use CAPEC’s repository to extract and select these Attack Patterns whose success relies on critical actions of the user. The CAPEC taxonomy contains descriptions of social-engineering Attack Patterns, together with their Pre-Requisites, mechanisms, and possible mitigations. 6.1
Implementing the Data Collection, and Investigations Step of S·CREAM
There are two aspects of importance when implementing the Data Collection and Investigations step described in Sect. 5: (i) the implementation should enable the analyst to perform a sub-sequent Retrospective Analysis step to objectively choose the paths to follow through antecedent-consequent links, and (ii) the 1
https://capec.mitre.org/.
An Information Security Management for Socio-Technical Analysis
237
implementation should structure the information about the attacker’s capabilities required to perform the attack in a way that allows the analyst to filter the attacks by these capabilities. It is the attack description scheme that defines what data are to be collected, and how the data should be organised, i.e., what attack properties should be defined. The scheme is customisable. We choose to describe the effects the attack has on a system’s security and the attacker’s actions. Describing the Effects. The effect of consequences of an attack on the user and on the system are described in natural language. This choice is motivated because a precise description is the key to a successful Retrospective Analysis. Indeed, there are so many applications, user interactions, decision processes, and consecutive actions possible, that the text is the best way to forward a finegrained description of an attack. Thus, this implementation deviates from the Data Collection and Investigations step described in Sect. 5 by lacking structure to describe the effect an attack produces on a system. Describing the Attacker’s Actions. In contrast to the effects, we describe the attacker’s actions as a set of messages flowing between the attacker and the victim prior to the manifestation of the critical action. With the notable distinction that we do not use UML diagrams to describe all the messages flowing between the user and the attacker, we focus on the attacker’s messages, and we extract from them a set of common properties. The event initiating the attack is described through common properties shared by the messages sent from the attacker to the user. These properties are as follows: (a) a source, which is the principal that the user believes to be interacting with; (b) an identity, which is further split into declared identity (i.e., who the attacker says he is, like the from field of an email) and imitated identity (i.e., who the attacker imitates to be by stealing a logo for instance); (c) a command for the user to execute; (d) a description of the subsequent action to state if the action is booby-trapped or spoofed; (e) a sequence that describes the temporal situation of the message; (f) a medium (web, phone, or paper). Describing the Pre-requisites. A pre-requisite flag attached to each property contained in the scheme describes the attacker’s capabilities needed to perform the attack. 6.2
Implementing the Retrospective Analysis of S·CREAM
S·CREAM’s Retrospective Analysis draws from the retrospective analysis of CREAM and includes adaptations required by our computer security focus. We sketch CREAM’s original analysis, and S·CREAM’s analysis side-by-side in Fig. 2. In CREAM’s retrospective analysis, one first defines common performance conditions (see left-hand side of Fig. 2) to describe the analysed event followed by the Error Modes to investigate. This investigation is a process where the analyst searches for the antecedents of each Error Mode. This process is recursive, i.e.,
238
J.-L. Huynen and G. Lenzini
Fig. 2. CREAM process (left) and its adaptation to security (right) (reprinted from [19]). (Color figure online)
each antecedent that an analyst finds can be investigated in turn. Antecedents justified by other antecedents are called ‘generic’, whereas those which are ‘sufficient in themselves’ are called ‘specific’. To avoid following ‘generic antecedents’ endlessly, one must stop the investigation on the current branch when a ‘specific antecedent’ is found to be the most likely cause of the event (in Fig. 2, see the yes branch on the right-hand side of the CREAM block that leads to the end state for the current branch). As discussed see Sect. 4, the computer security context in which we intend to use CREAM’s retrospective analysis calls for a different procedure. Two adaptations of CREAM’s retrospective analysis method are therefore needed. First, we customise the phase preceding the investigation instead of formalising the context in common performance conditions. S·CREAM implements a Data Collection and Investigations step as described in the previous section (in Fig. 2, the description step depicted as a green box replaces the first activity of CREAM).
An Information Security Management for Socio-Technical Analysis
239
Second, S·CREAM uses a less restrictive stop rule to yield Contributors. By doing so, we avoid pointing invariably to the attacker’s action, which allow us to investigate additional contributing antecedents. Hence, where CREAM stops as soon as a specific antecedent is found as being a likely cause of the event, S·CREAM lists all likely specific antecedents for the event in addition to the specific antecedents that are contained into sibling generic antecedents. It then stops the investigation of the current branch. As S·CREAM’s Retrospective Analysis follows a Data Collection and Investigations step (see also the red box in Fig. 2), the analyst uses the description of the attack to define the critical actions carried out by the victim (i.e., those with an effect on the system’s security) and the associated Error Modes. For the Retrospective Analysis to be possible, the analyst has to identify at least one Error Mode for an attack. Additional Error Modes may have to be analysed in the course of the events that lead to the critical action, for instance, if the victim first encounters the attacker and misidentifies him/her as being trustworthy. Considering each antecedent with the attack’s description in hand, the analyst follows the stop rule to build the list of Contributors of the attack under scrutiny. 6.3
Implementing the Generalisation Step of S·CREAM
Building Socio-Technical Capabilities. The Generalisation step’s description in Sect. 5 states that this step builds AMs, or links between an attacker’s capabilities and the effects he can potentially produce on a system’s security. For instance, sending a message that nudges a user to click on a malicious link may allow the attacker to execute some code on the system. There may be an observable instance of such an attack that exists, but it is not reasonable to create a STC called ‘Remote Code Execution’ (and with it all the AMs with the Contributors that make it possible) that links an attacker that can send a message with this STC. Indeed, there are so many differences in the usage and the consequences of this action of clicking that it makes no sense to create STCs based on its consequences, instead creating an STC ‘Push the user to click on an object’ would make more sense. For the initial development of S·CREAM’s catalogue of AMs, we are interested in STCs that are the intermediate goals of the attacker and are peripheral and decoupled from the system on which they are exploited: links between what the attacker can do and the consequences on the user and not the system. Here is a list of STCs that we foresee could be reachable by the use of S·CREAM (given that the attacks that make use of these capabilities exist in the corpus of attacks used to build a catalogue) and actually useful for computer security methods: spoof. The attacker is able to usurp another entity’s identity. This STC is, for instance, used in phishing to impersonate an entity that is likely to send the request contained in the phishing email. block. The attacker is able to block messages reaching the user by distracting him for instance. alter. Some AMs may provide the attacker with the capability of changing the perception the user has of a genuine message.
240
J.-L. Huynen and G. Lenzini
S·CREAM Allows to Hide Some Attack Modes. As for STCs, Contributors can be highly specific to a system or a context of attack. Thus S·CREAM provides a way (a flag) to remove AMs that use a Contributor that is highly unlikely to be found in other systems than the one in which it was observed (sight parallax-related Contributors for instance). This flag can be set at any time, thus buying time for the analyst to judge the usefulness of an AM while performing Socio-Technical Security Analyses before deciding on its filtering. The Generalisation step can be run numerous times on diverse sets of attacks to enrich the resulting catalogue of AMs (and nothing forbids the creation of different, specialised catalogues). Bootstrapping S·CREAM’s Catalogue of Attack Modes. To implement this step in our tool, we were in the need to bootstrap the tool’s with an initial catalogue of AMs. Instead of waiting for a sufficient amount of real attacks to be observed, described and then analysed for their root causes, we resorted to look into fifteen Attack Patterns among those described in the CAPEC library [25]. We are here referring to the library as it were in January 2016. Then we run step (i) and (ii) on them using the tool. From the CAPEC library we selected the attacks where the user is at the source of the success of the attack; after processing them, we populated the AM catalogue with 29 contributors to two socio-technical capabilities that we identified. These capabilities are intermediate goals of the attacker, peripheral and decoupled from the system on which they are exploited: Identify spoofing which is the capability to usurp an identity, and Action spoofing which is the capability to deceive the user into thinking that an action he performs will behave as he expects, whereas another action, which is harmful for the system, is executed in its place. Some of these AMs appear in Fig. 3. 6.4
Implementing the Security Analysis of S·CREAM
Once the catalogue of AMs is bootstrapped, the implementation of this step consists of filtering AMs and displaying the Contributors of linked potential attacks. To perform the semi-automated Security Analysis, the analyst describes the attacker’s capabilities on the system under scrutiny (the Threat Model), and the S·CREAM Assistant will then filter and display the corresponding AMs. To add his insights about potential attack while performing an analyst-driven Security Analysis, the analyst attaches attacks to the system being analysed, as he would attach attacks to an STC. To ensure that the potential attacks are indeed possible on the system, S·CREAM Assistant filters out the attacks that exceed the attacker’s capabilities described during the semi-automatic Security Analysis. Albeit being potential attacks, these attacks are no different from regular attacks to the S·CREAM Assistant and are investigated in the same way (Data Collection and Investigations step then Retrospective Analysis step). Contributors corresponding to the linked potential attacks are displayed along with the AMs resulting from the semi-automatic Security Analysis. The difference is that the AMs instructs the analyst about STCs, whereas the Contributors
An Information Security Management for Socio-Technical Analysis
241
yielded from the analyst-driven Security Analysis provide information on how an attacker could achieve the potential attack’s adverse effect on the system.
7
S·CREAM Assistant: The Toolkit
We developed a S·CREAM’s companion tool, the S·CREAM Assistant which is available at [18]. S·CREAM Assistant greatly improves the analysis experience by providing guidance, checking for the completion of the analysis, and providing an interface for navigating and filtering the resulting catalogue of AMs. Indeed, given the number of AMs and the possible filters, it is potentially unbearable to perform some analyses without the support of a dedicated tool. Note. The Cognitive Reliability and Error Analysis Method (CREAM), to which our methodology is in part inspired, uses tables that the analyst browses through by following causal relationships. This task can be cumbersome and requires the analyst to take several decisions that can undermine the analysis’ validity. For instance, the analyst can miss a link to a table or overlook an antecedent that could have been a major Contributor to the success of an attack. To partially address this obstacle, in [12], we used Serwy’s software implementation of CREAM [30]. While being perfectly suitable to perform CREAM analyses, this software does not implement any of the customisations described in Sect. 6. We thus had to run and document these parts using pencil and paper. The quality assurance of the results, the low costs of operation, and the usability of the method are the main rationale behind the inception of this tool. S·CREAM Assistant’s main functions are mapped on the steps described in Sect. 5 and meet the specifications of S·CREAM described in Sect. 6. Description. In S·CREAM Assistant, the Data Collection and Investigations step is accessed from the ‘Retrospective Analysis’ tab. S·CREAM Assistant implements the Data Collection and Investigations step as a form where the analyst fills in the information related to the attack under scrutiny for the structured description. Retrospective Analysis. S·CREAM Assistant implements the Retrospective Analysis step (see Fig. 3). The analyst determines the list of EMs observed in the attack under scrutiny from CREAM’s taxonomy of EMs. This is done via a dialog box triggered by the ‘Manage Error Modes’ button. Then S·CREAM Assistant enables the analyst to investigate each EM with the help of an interactive tree. By clicking on the ‘Analyse’ button, S·CREAM Assistant pushes the corresponding EM from the list into the tree view for analysis. For an EM under scrutiny, S·CREAM Assistant displays the possible antecedents that the analyst has to consider as possible Contributors in the form of children to the EM root node. S·CREAM Assistant takes care of finding the children of generic antecedents in CREAM’s tables and of implementing S·CREAM’s custom stoprule. When a specific antecedent is selected or unselected, the corresponding branch is automatically checked to enable the stop rule on the right specific antecedent (if applicable), and to open or to close the generic antecedents that
242
J.-L. Huynen and G. Lenzini
Fig. 3. Screenshot of the retrospective analysis as performed by using our tool. ‘GA’ are generic antecedent, ‘SA’ specific antecedent. Red antecedent cannot be further expanded, they denote a stop condition in the search for contributors (reprinted from [19]). (Color figure online)
An Information Security Management for Socio-Technical Analysis
243
Fig. 4. Managing STC’s attack (S·CREAM Assistant screenshot).
fall under its realm. That is to say, the generic antecedents that are siblings or children of siblings of the specific antecedent that carries the stop rule. In addition to this, S·CREAM Assistant verifies automatically if the tree has reached a ‘completed’ state, meaning that at least one Contributor to the EM under investigation has been found. At any time of this process, the analyst can consult the results compiled in a dedicated view. Generalisation. To build the catalogue of AMs, S·CREAM Assistant enables the analysts to store a list of STCs that they can link to the analysed attacks. As shown in Fig. 4, the tool displays a list of available attacks on the left and a list of linked attacks on the right. An attack can only be linked to one STC at a time (Fig. 5). Once the attacks have been linked to an STC, the analyst can compile a list of AMs for this STC. Each AM displays a Contributor, the exploited EM, a justification, which is a comment that the analyst can use to justify his choice, and a check box. The check box allows the analyst to discard an AM from further Security Analysis steps by enabling the ‘specific’ flag (as explained in Sect. 6.3). The view responsible for displaying the AMs hides the details of the description of each AM for the sake of space, while the link to its description exists and is used in the Security Analysis step (see below). Security Analysis. S·CREAM Assistant supports the Security Analysis step by allowing the analyst to filter the catalogue of AMs to view only the STCs that fit a given Threat Model. This means that the analyst builds the list of STCs available to an attacker that possesses the Threat Model’s capabilities. These capabilities are described in the same way as the attacks are with each property being attached to pre-requisite flag. The list of STCs is built by comparing for each STC, the Pre-Requisites attached to its AMs with the Threat Model’s PreRequisites. An STC is displayed to the analyst only if at least one of its AM fits the system’s Threat Model. Finally, the analyst browses the STCs along with their usable AMs corresponding to the system he described. In addition to listing the STCs along with their AMs corresponding to a system, S·CREAM Assistant provides the analyst with the possibility to attach
244
J.-L. Huynen and G. Lenzini
attacks to a system (see Fig. 5). This is meant to investigate potential attacks on a system and get a more precise understanding of its socio-technical vulnerabilities. To add attacks on a system, the analyst uses the same attacker manager as described earlier (see Fig. 4). The analyst can only select attacks that fit the Threat Model described for the system. Once, the Threat Model is described and potential attacks are linked, S·CREAM Assistant displays the list of corresponding STCs with their AMs along with the Contributors corresponding to the linked attacks. Technical Implementation. We wanted the application to be multi-platform, portable, and stand-alone in the first iterations, while still being able to transpose it to a client-server model, or even to a desktop application if we later decide so. Consequently, we chose to implement S·CREAM Assistant in JavaScript, storing all data in the web browser’s Local Storage. S·CREAM Assistant uses several frameworks. AngularJS [14] manages the Views and the Controllers, jsdata [21] handles the Models and provides an Object-Relation Mapping, and D3.js [7] displays the interactive tree used to visualise S·CREAM’s Retrospective Analysis step. S·CREAM Assistant allows the analyst to export and import the data stored locally. Furthermore, to address C5 introduced in Sect. 4 (the need for flexibility of the method), and make S·CREAM Assistant ready to follow further developments of S·CREAM, XSLT style sheets are applied on CREAM tables’ XML representation at runtime (see Fig. 5). This allows the analyst to add, alter, or remove antecedent-consequent links from S·CREAM in case he performs domain-specific analyses. While performing a Retrospective Analysis, the analyst can choose one of these style sheets, which we refer to as S·CREAM flavors, to apply to the original CREAM tables.
8
Use Case: Yubikey
We put ourselves in the shoes of a security practitioners who after having assessed risks posed to one of its organization’s application decides to implement One Time Passwords (OTPs) to authenticate users on this application. He chooses the Yubikeys nano USB security token. A YubiKey is a multi-purpose security token in the form of a USB dongle. It is versatile: to a computer (or via NFC to a smart phone), it presents itself as either a keyboard or a two-factor authentication device; it can be used to generate and store a 64 characters password, generate OTPs, or to play different challenge-response protocols [35]. Its user interface consists of only one button and a LED. YubiKeys have peculiar user interface and user interactions as they are not doted of a screen. The absence of a screen has for consequence to shift the duty of providing feedback to the user to a LED light. The LED’s behaviors (e.g., flashing rapidly, being on or off, et cetera) have different meanings and are explained in the user’s manual [35].
An Information Security Management for Socio-Technical Analysis
245
Fig. 5. S ·CREAM Assistant data model. Tables in blue are implemented using js-data and stored in the web-browser’s Local Storage. (Color figure online)
One aspect of Yubikeys is that they support two configuration slots on one device. To use these configurations, the user touches the button of the device for different periods of time. These slots can be configured to generate OTPs or a static password. Quoting from the yubico’s YubiKeys security evaluation document [34] this functionality assumes a few security implications: “The YubiKey 2.0 introduces a mechanism where the user can use two separate credentials. We call the storage for these credentials ‘slot 1’ and ‘slot 2’. To generate a credential from slot 1, the user touches the button for a short period of time (e.g., well below 2 s). To generate a credential from slot 2, the user touches the button for a long period of time (e.g., well above 3 s). With proper user education, we believe this does not add any additional security problems so we continue to evaluate the YubiKey configured with just the slot 1 credential.” It is worth noting that YubiKeys (now in version 4) come in two forms. One is the standard YubiKey, which is 18 × 45 × 3 mm. It has a round-shaped button with a LED at the top. The other is the YubiKey ‘nano’. Much smaller, it completely disappears into a USB port when plugged in, and it has button and LED on its edge. Without more details about the organization’s context, we only investigate the consequences of configuration and functional choices related to the Yubikey itself. How the different configuration settings can impact the token’s operations and the provided security when used by the organization’s employees? We perform our analysis on the basic operation of a YubiKey with the ‘Dual configuration’ functionality enabled. We set a YubiKey nano to yield an OTP on slot 1, and a static password on slot 2. The security practitioner’s rationale being that it would be a waste not to propose to improve employees’ passwords with a long random string of characters while the Yubikey provides this feature.
246
8.1
J.-L. Huynen and G. Lenzini
Security Analysis
Threat Model. The main assumptions for this system are that the attacker can read and write on the Internet. This Threat Model implies that the attacker is free to send messages on the web medium to the user before and after the operation of the Yubikey by touching its button. More specifically, we consider that the user is visiting a website under the control of the attacker. Semi-automatic Security Analysis. We consider that this system’s Threat Model allows the attacker to control the source, the declared identity, the imitated identity, the command, and that it can write on the web medium. As the attacker has no control over the YubiKey, he cannot spoof the action the user is about to perform. The attacker has control of the sequence of communication with the user. In consequence, by using S·CREAM, we find that the reachable STC is Identity spoofing. Analyst-Driven Security Analysis. To find likely potential attacks on this system, our strategy is to formulate hypotheses about the consequences of the user’s actions in consideration of the attacker’s extended capabilities. There are two actions that a user has to carry out when using a YubiKey on a computer: plugging the YubiKey into a usb port, and operating the YubiKey by touching its button according to the authentication scheme of the application. On the YubiKey nano, both actions are critical from a security point of view. – plugging the Yubikey nano in a computer can accidentally produce an OTP because of the location of the button at the edge of the device. Plugging or unplugging a YubiKey nano can lead to a loss of confidentiality of the OTP code located in the first slot. As the YubiKey operates after the touching event is finished we consider that the Error Mode to investigate is ‘SequenceWrong action’, and that the user appends an irrelevant action to the sequence of actions. – operating the YubiKey nano has two important dimensions: the action’s duration (i.e., less than 2 s or more than 3 s) and the action’s location (i.e., which user interface element has the focus at the time of the action). The user needs to touch the device within the right amount of time while being in communication with the correct entity; otherwise, there can be a loss of confidentiality. As location-based attacks are already covered by the Identity spoofing (i.e., the user misidentifies the attacker for another entity), we focus on the duration. In particular, we investigate the EM ‘Duration-Too long’. Table 2 sums up the results of this investigation. Discussion on the Results and the Possible Remediations. Regarding Identity spoofing, the attacker has a lot of options when it comes to impersonate another entity (see the Identity spoofing’s column in Table 2). A prominent example of such attack is the Man In the Browser attack: the attacker, in control of the web browser, redirects the user to a website he controls when the user attempts to go to his bank’s website. The attacker then asks for the credentials (including
An Information Security Management for Socio-Technical Analysis
247
Table 2. Contributors yielded by the Security Analysis (reprinted from [19]). STC: Identity spoofing
Attack: Foster ‘Sequence-Wrong Attack: Foster ‘Duration-Too action’ EM on plugging long’ EM on operating
GA-Faulty diagnosis
GA: Sound
GA: Adverse ambient conditions
GA-Inadequate quality control SA: Competing task
SA: Confusing symptoms
GA-Inattention
SA: Design
SA: Inadequate training
GA-Insufficient knowledge
SA: Noise
GA-Mislabelling
SA: Information overload SA: Mislearning
GA-Missing information
SA: Multiple signals
GA-Wrong reasoning
SA: New situation
SA-Ambiguous label
SA: Noise
SA-Ambiguous signals
SA: Overlook side consequent
SA-Ambiguous symbol set
SA: Too short planning horizon
SA-Competing task
SA: Trapping error
SA-Erroneous information SA-Error in mental model SA-Habit, expectancy SA-Hidden information SA-Inadequate training SA-Incorrect label SA-Mislearning SA-Model error SA-Overlook side consequent SA-Presentation failure SA-Too short planning horizon
two-factors Authentication credentials as the one provided by a YubiKey) and logs into the bank’s website in place of the user. The key result of this analysis is that there is little that can be done to thwart the attack, given the number of Contributors. The results of this analysis come to the same conclusion as the security evaluation made by yubico [34], which states, ‘We conclude that the system does not provide good defence against a real-time man-in-the-middle or phishing attack.’ Regarding potential attacks on the ‘Dual configuration’ functionality, Table 2 shows that there are three Contributors that an attacker can manipulate to foster the occurrence of the ‘Sequence-Wrong action’ EM during the plugging critical action. The attacker, in control of the webpage can emit sounds or noises to apply pressure on the user, and he can also create a competing task. We see little practical application of this attack. Finally, we turn to the case of the operating critical action. Investigating this critical action with S·CREAM yielded more Contributors than the plugging critical action, and therefore, it appears more likely to observe potential attacks that exploit the operating action as opposed to the plugging action. Table 2 lists the Contributors that we reckon can be used to trigger to the ‘Duration-Too long’ EM. For instance, we select ‘SA-Confusing symptoms’ because the attacker can attempt an attack in the same fashion as Social-Engineering attacks in which the
248
J.-L. Huynen and G. Lenzini
attacker sends a ‘bad authentication’ message as sole feedback after each login attempt, nudging users to give away every password they know while trying to authenticate. The difference being that, in our use case, the user would try every possible action on the YubiKey instead of entering passwords. This kind of attack is very well possible given the fact that the YubiKey provides little feedback when a slot is yielded and no feedback about which slot is yielded. Furthermore, the user might be unsure how he configured his YubiKey (and someone may have configured it for him). In the light of this analysis, we consider that the choice of the Yubikey for implementing OTPs is not a bad choice, but that the security practitioners should refrain from using the second slot as it poses some socio-technical security issues.
9
Discussion
The S·CREAM methodology has been conceived to be used by analysts and security practitioners that are in the need to consider a system’s security sociotechnically. In fact, either when they wish to assess potential vulnerabilities of a system in use, or, when confronted by an security attack, are challenged to establish and remove its causes, it is now clear that both technical and social factors must be considered. The toolkit that we have developed can help in this mission and it is meant to be used at different points of the Information Security Management cycle. It allows its users to prospectively (as in the use case presented in Sect. 8) and to retrospectively identify socio-technical factors that could lead or could have led to a security breach. We believe that, if used with discernment, the insights that our methodology supported by the toolkit produces can be transformed in requirements or best practices able to increase the a system’s socio-technical security. However, in order to take the best of what we propose here, one needs to be aware of the limitations and peculiarities of S·CREAM. First, our methodology is young: it has to be further developed and it has not been yet properly validated. The catalogue that we bootstrapped from the AM library is, for the need of S·CREAM, still rudimentary and has to be refined and extended. And, inevitably, the antecedent-consequent tables that S·CREAM Assistant implements are still not dedicated to security as we wished them to be and consequently the Contributors yielded by S·CREAM’s Security Analysis are currently more generic than what we think they should be. Because of this generality, sometimes the S·CREAM Assistant’s suggestions are not easy to be interpreted. We expect the toolkit’s relevance to grow over time through its use. Therefore, we consider the current state of the toolkit as a proof of concept that as future work we will challenge through the analysis of different security incidents and systems, and that we will tune to improve its accuracy. A second issue to be considered when applying the methodology is that using it may not provide a silver bullet for all security problems, but rather it is
An Information Security Management for Socio-Technical Analysis
249
meant to be of guidance in the analysis, which completes only with the analyst’s expertise. For instance, if the eventual conclusion seem indicating that warning the user is could prevent an socio-technical attack in some form, the analyst should consider that warnings can have a negative impact on user’s decisions if not implemented properly. Indeed, S·CREAM’s results are potential, that is the toolkit produces a list of potential factors that an attacker may exploit to perform an attack on a system, or that may have caused one, but it is the analyst’s duty to ponder on whether these factors should be controlled on the system or not. To avoid introducing ill-designed controls, S·CREAM’s tables should be improved by integrating Contributors-related litterature’s findings to guide further the analyst in his quest for socio-technical security.
10
Future Work
In this paper we detailed the methodology as much as possible, but there is still work to do before S·CREAM could be used extensively. Some regards the methodology itself. It needs to be validated and this task has high priority: in particular we intend to assess that it yields as often as possible sound results for its security analysis. The imported CREAM’s tables are assumed to work for security but still we intend to identify the Contributors that are the most often discarded by analysts, and for this we plan to challenge their relevance experimentally. The S·CREAM Assistant needs improvements too. We intend to design helpers, such as check-lists, to offer guidance to the analyst who runs the tool. The antecedent-consequent tables inherited from CREAM should be specialised for security. We intend to provide up-to-date tables of antecedents that reflect the current state of the research on factors that influence security-related behavior. We are planning to add the possibility to share schemes, catalogues of AMs, and sets of attacks because we believe that the tool’s efficacy will improve if analysts of different companies, or Incident Response Centers can cooperate, share, and compare their knowledge. However, there is a main obstacle to overcome before reaching these milestones: we need to find a sufficient number of documented attacks to analyse. These attacks and the corresponding attacker’s traces are in the hand of security practitioners. Therefore, we are currently in the process of interfacing S·CREAM with a cyber incident analysis tool (i.e., the Hive project [13]) and a threat intelligence sharing tools (i.e., the MISP project [24]) to foster the use of S·CREAM. This first interaction with security practitioners and their tools lead us to acknowledge further the need for help to analyst guidance when performing analysis of security incidents. Indeed, S·CREAM adds to the noise that surrounds security incidents by creating new hypotheses to be evaluated. Therefore, future works will be also geared towards helping analysts sorting out hypotheses regarding the causes of the success of attacks from facts (e.g., observables extracted from the hive) and findings (e.g., the results of other intermediate
250
J.-L. Huynen and G. Lenzini
analyses gathered from MISP) in conjunction with S·CREAM. To this end, we plan to make use of intelligence analysis techniques called Structured Analysis Techniques (SATs) (see [16]), and in particular Analysis of Competing Hypothesis (ACH) [15]), which is a process that systematically evaluates the consistency of facts with competing hypotheses. Inspired by science, ACH proceeds by rejecting rather that confirming hypotheses. The consistency of each facts with each hypotheses is evaluated by the analyst until there is no more information to assess, then the hypotheses which has the least credible facts against it is the most likely. ACH analyses can be ran several time by several analysts and their conclusion challenged against one another, which makes these a good fit for our toolkit. We believe that the shortcomings we identified can be fixed, and that by improving S·CREAM’s tables, by maintaining our toolset’s knowledge of security and human-related factors, and by fostering its use, its inter-operability and the sharing of experiences, our toolkit can be a useful addition to a security practitioner’s toolbox, but we need to fully implement such ameliorations.
References 1. ISO/IEC 27001:2005 - Information technology - Security techniques - Information security management systems - Requirements. Technical report, International Organization for Standardization, Geneva (2005) 2. Adams, A., Sasse, M.A.: Users are not the enemy. Commun. ACM 42(12), 40–46 (1999) 3. Anderson, R.J.: Security Engineering: A Guide to Building Dependable Distributed Systems, 2nd edn. Wiley, New York (2002) 4. Beautement, A., Becker, I., Parkin, S., Krol, K., Sasse, M.A.: Productive security: a scalable methodology for analysing employee security behaviours. In: Twelfth Symposium on Usable Privacy and Security, SOUPS 2016, Denver, CO, USA, 22– 24 June 2016, pp. 253–270 (2016) 5. Bianco, D.: The pyramid of pain (2014). http://detect-respond.blogspot.lu/2013/ 03/the-pyramid-of-pain.html 6. Boring, R.L.: Fifty years of THERP and human reliability analysis. Technical report, Idaho National Laboratory (INL) (2012) 7. Bostock, M., Ogievetsky, V., Heer, J.: D3 data-driven documents. IEEE Trans. Vis. Comput. Graph. 17(12), 2301–2309 (2011) 8. Boyd, J.: The essence of winning and losing (1995). https://www.danford.net/ boyd/essence.htm 9. Caralli, R., Stevens, J., Young, L., Wilson, W.: Introducing octave allegro: improving the information security risk assessment process. Technical report, CMU/SEI2007-TR-012, Software Engineering Institute, Carnegie Mellon University, Pittsburgh (2007) 10. Cotroneo, D., Paudice, A., Pecchia, A.: Automated root cause identification of security alerts: evaluation in a SaaS cloud. Future Gener. Comput. Syst. 56, 375– 387 (2016) 11. ENISA: Annual Incident Reports 2015. Technical report, ENISA - European Union Agency for Network and Information Security (2016)
An Information Security Management for Socio-Technical Analysis
251
12. Ferreira, A., Huynen, J.-L., Koenig, V., Lenzini, G.: In cyber-space no one can hear you S·CREAM. In: Foresti, S. (ed.) STM 2015. LNCS, vol. 9331, pp. 255–264. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24858-5_16 13. Franco, T., Kadhi, S., Leonard, J.: TheHive project 14. Google: AngularJS (2016) 15. Heuer, R.J.: Psychology of Intelligence Analysis. Washington (1999) 16. Heuer, R.J., Pherson, R.H.: Structured Analytic Techniques for Intelligence Analysis. CQ Press, Washington (2014) 17. Hollnagel, E.: Cognitive Reliability and Error Analysis Method CREAM. Elsevier, Oxford (1998) 18. Huynen, J.: S·CREAM Assistant (2016). https://github.com/gallypette/SCREAMAssistant 19. Huynen, J., Lenzini, G.: From situation awareness to action: an information security management toolkit for socio-technical security retrospective and prospective analysis. In: Proceedings of the 3rd International Conference on Information Systems Security and Privacy, ICISSP 2017, Porto, Portugal, 19–21 February 2017, pp. 213–224 (2017) 20. Ishikawa, K., Ishikawa, K.: What is Total Quality Control? The Japanese Way. Prentice Hall, Upper Saddle River (1988) 21. Js-data Development Team: Js-data (2016) 22. Kasikci, B., Schubert, B., Pereira, C., Pokam, G., Candea, G.: Failure sketching: a technique for automated root cause diagnosis of in-production failures. In: Proceedings of the 25th Symposium on Operating Systems Principles, SOSP 2015, pp. 344–360. ACM, New York (2015) 23. Kirlappos, I., Parkin, S., Sasse, M.A.: “Shadow security” as a tool for the learning organization. SIGCAS Comput. Soc. 45(1), 29–37 (2015) 24. MISP Development Team: Malware Information Sharing Platform (2015) 25. MITRE: CAPEC - Common Attack Pattern Enumeration and Classification (2014) 26. Noureddine, M., Keefe, K., Sanders, W.H., Bashir, M.: Quantitative security metrics with human in the loop. In: Proceedings of the 2015 Symposium and Bootcamp on the Science of Security, HotSoS 2015, pp. 21:1–21:2. ACM, New York (2015) 27. Reason, J.: Human Error. Cambridge University Press, Cambridge (1990) 28. Schneier, B.: The future of incident response. IEEE Secur. Priv. 12(5), 96 (2014) 29. Schoenfisch, J., von Stülpnagel, J., Ortmann, J., Meilicke, C., Stuckenschmidt, H.: Using abduction in Markov logic networks for root cause analysis. CoRR (2015) 30. Serwy, R.D., Rantanen, E.M.: Evaluation of a software implementation of the cognitive reliability and error analysis method (Cream). Proc. Hum. Factors Ergon. Soc. Ann. Meet. 51(18), 1249–1253 (2007) 31. Swain, A.D., Guttmann, H.E.: Handbook of human-reliability analysis with emphasis on nuclear power plant applications. Final report, NUREG/CR, U.S. Nuclear Regulatory Commission (1983) 32. Verizon RISK Team: 2015 Data Breach Investigations Report. Technical report, Verizon (2015) 33. Verizon RISK Team: 2017 Data Breach Investigations Report. Technical report, Verizon (2017) 34. Yubico, A.B.: YubiKey security evaluation: discussion of security properties and best practices (2012) 35. Yubico, A.B.: The YubiKey manual: usage, configuration and introduction of basic concepts (2015)
An Exploration of Some Security Issues Within the BACnet Protocol Matthew Peacock(B) , Michael N. Johnstone, and Craig Valli Security Research Institute, Edith Cowan University, Perth, Western Australia, Australia {m.peacock,m.johnstone,c.valli}@ecu.edu.au
Abstract. Building automation systems control a range of services, commonly heating, ventilation and air-conditioning. BACnet is a leading protocol used to transmit data across building automation system networks, for the purpose of reporting and control. Security is an issue in BACnet due to its initial design brief which appears to be centred around a centralised monolithic command and control architecture. With the advent of the Internet of Things, systems that were isolated are now interconnected. This interconnectivity is problematic because whilst security is included in the BACnet standard, it is not implemented by vendors of building automation systems. The lack of focus on security can lead to vulnerabilities in the protocol being exploited with the result that the systems and the buildings they control are open to attack. This paper describes two proof-of-concept protocol attacks on a BACnet system, proves one attack using experimentation and the other attack through simulation. The paper contextualises a range of identified attacks using a threat model based on the STRIDE threat taxonomy. Keywords: Building automation · State modelling Heating ventilation and air conditioning
1
· Security
Introduction
Building automation systems, henceforth referred to as BAS, are responsible for monitoring and control of key facilities management functions such as lighting, security, heating, ventilation and air-conditioning. As such, BAS are defined as Critical Infrastructure by the United States National Institute of Standards and Technology [1]. BAS rely on communication between devices with attendant logging (for audit purposes) to assure safe operation of services internal to a building. Given that BAS functions have expanded considerably, partly due to the rise of the “Industrial Internet”, managing the expanded attack surface in a complex environment, is, from a cyber-security viewpoint, a prime priority for ensuring the integrity of building operations. BACnet is one of several popular BAS protocols. Whilst the BACnet security model is valid, there are vulnerabilities present in the implementations of the protocol. For example, the change c Springer International Publishing AG, part of Springer Nature 2018 P. Mori et al. (Eds.): ICISSP 2017, CCIS 867, pp. 252–272, 2018. https://doi.org/10.1007/978-3-319-93354-2_12
An Exploration of Some Security Issues Within the BACnet Protocol
253
of value (CoV) reporting method can be exploited to cause network bandwidth saturation, effectively resulting in a Denial of Service attack [2]. One solution, where the use of CoV is limited by a buffer, moves the problem from the network to a security issue, but this can prevent key devices from receiving CoV notifications, which effectively nullifies the utility of using CoV reporting [3]. Further, the nature of BAS indicates that their operation bridges the cyber-physical divide, therefore it is essential to define priority levels for commands to resolve conflicts, as critical sub-systems must have guaranteed priority. Unfortunately, the description of priority within the current BACnet standard with respect to cyber-physical properties lacks precision. For example, system behaviour is undefined when multiple devices write to the same property with the same priority level [4]. Therefore, different vendors could have devices that behave differently when presented with the same data. This is a valid scenario in a large, multivendor environment. Therefore, a potential vulnerability is exposed as malicious writes could be presented as legitimate value changes with the ensuing events having serious consequences for the building controlled by the vulnerable BAS. This work expands upon the work presented in [2], which discussed the key elements of BACnet and security issues related to data value handling in BACnet. Two potential issues were raised, viz. CoV and bounded priority arrays, with a theoretical attack pertaining to the former discussed. Subsequently, an experimental framework was described and implemented, which tested the presented CoV vulnerability. This expanded work provides a simulation of an attack based on the bounded priority array issue discussed in [2]. Finally, threat modelling of a BACnet controller scenario is undertaken using the STRIDE threat taxonomy.
2
Background
Security issues in BAS are an emergent threat, with security analysis undertaken by [5–7] and later [8,9]. [5,6] identify Denial of Service, eavesdropping and buffer overflows in core functions of the protocol, while [7] primarily discuss the limitations of secure communication exchange in BAS. [5,6] outline the increased connectivity between previously isolated BAS networks, enterprise networks and the Internet for remote management; an increasing trend discussed in [8]. Further attacks, including covert channels and Denial of Service are defined in [10,11] respectively. A number of these stated vulnerabilities in BACnet have had mitigations suggested such as the BACnet firewall [12], however, [8] identifies the potential of legitimate-yet-malicious commands operating inside BACnetworks, which would not pass traffic through a border firewall. A potential solution is a BAS specific intrusion detection system (IDS), the focus of works by [10,11,13].
3
BACnet
BACnet is an object-oriented peer-based protocol for managing building automation systems. Released in 1995 as an ASHRAE/ANSI standard, BACnet gained ISO standardisation (ISO 16 484-5) in 2003. BACnet is actively maintained, with
254
M. Peacock et al.
reviews of the protocol occurring every four years until 2008, thence changing to biennial reviews [14]. BACnet focuses on the network layer and above, with the goal of being operable on any data link and physical medium. The BACnet standard defines data structures to represent the communications and devices on a BACnetwork. The core data structure is an object, of which there are 54 standard types. When combined, objects can represent a device, with each object containing properties which further define the features of the object. Similarly, communications on the network are defined in the BACnet standard as services, of which there are 38 types [4]. A core process of BACnet is communicating values across the network between devices to allow automatic actions to occur. 3.1
Data Collection in BACnet
Devices require correct, timely information in order to act appropriately to events occurring in a building. In BACnet, reportable information is defined as either value-based or event-based, both of which are time-stamped. As BACnet is a peer-based network, any device in a BACnetwork may request values from other devices, or be notified of events occurring. To accomodate peer-based communications, by default, BACnet devices are passive servers which listen for requests and service received requests. Each data sharing transaction is represented as a client/server request, where the device requesting data is a client, and the targeted device providing the data is the server. A server in this case might be a temperature sensor, whilst a client might be a HVAC controller. Depending on the data type, either value or event, the retrieval process differs. Value data is shared using one or a multiple of polling, CoV reporting or triggered collection; while event data uses event logs to collect notifications. The three methods mentioned for value based reporting are implemented using a Trend Log object, which can monitor one object or property on a device. When a value changes, a notification is stored in an internal buffer in the Trend log object, as a Trend log record. Each record holds the timestamp of when the change occurred, and the changed value of the monitored property. Polling uses the default passive model previously discussed, where data retrieval queries are made to the server device at defined time intervals. A problem with polling is the potential for state and value changes between polling intervals, the active value changes of the system will not be updated unless they exist when a query is sent. A careful balance is required, as increasing the frequency of polling can impact the throughput of the network significantly [3]. An alternative is an active data collection method called CoV reporting. CoV reporting uses a subscription-based method, where a client may request a subscription to a particular object or property value, shown in Fig. 1. The subscription details the property to monitor, the criteria for issuance of notifications, and a lifetime for the subscription. The CoV reporting method may be Confirmed with acknowledgement of notifications, or Unconfirmed without acknowledgement. A third method is triggered collection, which defines a boolean property in a device, when the property is true, data is retrieved from the device. External network writes and internal processes such as alarms or other events can
An Exploration of Some Security Issues Within the BACnet Protocol
255
1) CoV Subscription Request
2) Value Change
Client Device 5) Write to Bu er
Server Device 6) CoV ACK
4) Wait for ACK
Fig. 1. Confirmed CoV transaction, taken from [2, p. 548].
cause the trigger to become true, and cause an immediate acquisition of values to occur. Any property may be monitored using a Trend log object, and the property information stored in the buffer can be of any data type; making the buffer an unrestricted data type storage area, and potential vulnerability. 3.2
Conflict Resolution
As mentioned previously, all BACnet devices are peers, meaning any device connected to the network may write to any writable property on any other device. As some property values directly cause cyber-physical actions to occur, conflict resolution in the form of a priority system is implemented. BACnet accounts for the potential of conflicting commands through a conflict resolution process where properties are split into commandable, and writable types. Commandable properties are defined as those whose value change causes physical actions, while all other properties are defined as writable. Conflict resolution is only applied to commandable properties, whereas writable properties have no priority mechanism, meaning the last write to the property overwrites the previous value. The present value property of most objects in BACnet are classed as commandable properties, for example, Analog Output and Binary Output objects. BACnet devices interact with commandable properties using the WriteProperty or WritePropertyMultiple service requests. The request primitive for both services contain three parameters; Property Identifier, Property Value and Priority. The three parameters contain the commandable properties ID, the desired value for the property, and the priority value respectively. The priority value is a number set between 1 and 16 for that WriteProperty service request, the lowest number having the highest priority. The BACnet standard defines consistent representations for the applications of command priority levels, with 5 defined applications and 11 flexible open applications which can be implementation specific, shown in Table 1. When a client device no longer requires access to the commandable property in a service provider, a relinquish command is sent to the provider, using WriteProperty or WritePropertyMultiple service requests. The relinquish request uses the same parameters as a normal write request with the property value parameter set to NULL. When a device is notified that no service request exists at that priority level, the next priority array element is checked for a service request. When all elements in the priority array are NULL, the property value will be set to the value defined in the Relinquish Default property of the object.
256
M. Peacock et al. Table 1. BACnet Priority array applications, taken from [2, p. 548]. Priority level Application
3.3
1
Manual Life Safety
2
Automatic Life Safety
3
Available
4
Available
5
Critical equipment control
6
Minimum on/off
7
Available
8
Manual operator
9
Available
10
Available
11
Available
12
Available
13
Available
14
Available
15
Available
16
Available
BACnet Security
BACnet was not designed with security as a primary requirement, as early systems operated in isolated networks without any external connection. With the increasing connection of BACnetworks to external facing networks, such as the Internet or enterprise networks for maintenance, the attack surface has increased. Holmberg [5], documented a number of the threats to BACnet, with further work being undertaken by [6] to increase the security features of BACnet. In 2009, the suggestions from the [5] were ratified as BACnet security services (BSS), in Addendum g to the BACnet standard. Despite being an official addendum to the standard, BSS is not widely implemented, with Michael Newman, the original chairman of BACnet stating “no company has yet implemented it in a commercially available product” [15, p44]. Additionally, Newman [15] suggests that BACnet/IP could be secured using IP security solutions, such as IPsec, TLS, or Kerberos. In contrast, given that a number of BACnet vulnerabilities are not solved by encryption (as in the aforementioned protocols), but rely on authentication instead, this indicates that the coverage of the attack surface is incomplete and therefore not all BACnet vulnerabilities are mitigated.
4
Security Issues in Value Changes
One way in which objects on a BACnetwork can interchange data is by a client-server relationship. In contrast to the conventional client-server model,
An Exploration of Some Security Issues Within the BACnet Protocol
257
the “server” is the Service Provider or transmitting device (such as a thermostat) and the “client” is a controlling device, such as an air conditioning control unit. The BACnet client-server model exhibits the following behaviours: 1. One or more clients can subscribe to the Service Provider 2. Requests can be queued 3. The queue is stored on the Service Provider. The client-server model is often implemented by the Observer design pattern [16]. It is valuable to examine the requirements of the Observer pattern as this informs our discussion about a specific BACnet vulnerability with respect to the bounded priority array. The Observer pattern describes a one-to-many connection between objects such that when an object undergoes a state change, its connected objects are notified. Conventionally, the object that is the source of notification events is called the subject and the observer objects register with the subject in order to be notified of changes in the subject. As the subject is the sole owner of the data, the observers are reliant on the subject to update them when state changes occur. Consistent with the loose coupling exhibited by many design patterns, observers can be added or removed from a list maintained by the subject at any time. The Observer pattern is common to user interface toolkits and is found in the model-view-controller architecture. A concrete example would be the different views of the same object held by spreadsheet programs such that when a data cell changes, a chart is re-drawn automatically as the latter is notified of the change. 4.1
BACnet Change of Value
In relation to the observer model described, the Service Provider acts as the subject and maintains a list of subscribers (observers), namely the Active CoV subscriptions is the list which holds the CoV subscriptions on the Service Provider. As mentioned, CoV subscriptions may be either Confirmed or Unconfirmed, a data service type reminiscent of TCP or UDP in operation. Unconfirmed CoV reporting sends a notification to the subscriber when a value changes within the CoV threshold of the subscription. Confirmed CoV reporting incorporates acknowledgement of change, with an ACK packet to be sent from the subscriber to the device serving the data. Due to the many media on which BACnet operates, some devices take longer to acknowledge a Confirmed CoV than others. Two device object properties, APDU timeout and Number Of APDU Retries, set on the client device determine how long to wait for an acknowledgement, and how many times to retry waiting. The values for these properties are vendor specific, the BACnet standard suggests between 6,000 ms, and 10,000 ms [15]; the de facto standard set by the vendors is a 3,000 ms wait time, with 3 retries (see Table 2). Some vendor guides suggest a 20,000 ms or even 60,000 ms wait time, dependent on the capability of older devices; one guide suggests all APDU timeouts should be set to the highest
258
M. Peacock et al.
value in the system. If a subscriber is offline when a Confirmed CoV notification is sent, the CoV server device will wait the length of the APDU Timeout of the client device, and then retry the CoV notification the specified number of times before processing the next CoV notification. Given the length of some APDU timeout values and the number of retries, significant network delays can occur [3]. Additionally, when a subscriber goes offline, CoV messages are not stored or queued, therefore if a subscriber returns to the network, data synchronisation can be lost. If the subscription is Unconfirmed, there is no feasible way to determine if the subscriber has received the CoV notification, or tell if the subscribing device is offline. A combination of polling and CoV is suggested to counteract devices power cycling, however the solution is not complete, as the logging device for the system could suffer from the same issue, and the log of a CoV will never be recorded [3]. Subscriptions to devices are not persistent between power cycles, meaning if a device is reset for any reason, the subscriptions to other devices will not be preserved and must be re-connected. Table 2. BACnet vendor APDU-timeout default values, taken from [2, p.548]. Vendor
APDU timeout value (ms) APDU retries
ScadaEngine
500
5
Kepware
1,000
3
Siemens
3,000
3
Contemporary controls 3,000
3
Obvius
3
6,000
Tridium
20,000
3
Viking controls
3,000/60,000
3
A solution exists for automatic re-subscriptions, where a duration property exists in the Trend log object, which triggers a CoV subscription to occur. However, the integrity of the BACnetwork is then dependent on client devices resubscribing to keep a synchronised and robust system. Due to the capacity of BACnetworks, oversubscription of CoV notifications is plausible. Robust testing of oversubscription is often not carried out, due to the risk of damage to devices [3,15]. An implementation guideline solution suggests limiting the number of subscriptions a device is capable of holding, to reduce the potential traffic density. A device on the BACnetwork may subscribe to the same object multiple times, as the unique identifier for each subscription is self-assigned. Each implementation of a device may have a limit applied to the quantity of subscriptions that each device can initiate. This bottleneck creates a network security issue, where critical devices will not receive notifications due to the subscription limit being reached (malicious or not).
An Exploration of Some Security Issues Within the BACnet Protocol
4.2
259
BACnet Bounded Priority Arrays
A separate, yet related identified issue with the array implementation of BACnet is that each priority value in the array may have only one command stored at a time. As described, upon relinquishing an array position, a device will write a NULL value to the array position, which causes “unknown behaviour” for any queued command which is in the same array position [4]. As source authentication is not specified for write commands entering the priority table, the priority array attempts to create a queue of values to enter into a commandable property. An issue similar to writable commands is created, where a write-lastwins scenario occurs, negating the purpose of the array, when same priority commands are issued. Additionally, as there is no verification of priorities on commands, any device may change the value of a commandable property at any priority. Similar to Confirmed CoV notifications, these commandable property writes are legitimate traffic which are expected to be occuring on the network. [13] assessed the possibility of detecting multiple high priority writes using time deltas between same packets with success. Potentially, the same type of detection could be applied to the priority of same packets from an identical source device to determine malicious behaviour. 4.3
Theoretical Attack Scenario
A potential scenario could manifest as a malicious device sending Confirmed, low threshold value CoV subscriptions to every supported device on the BACnetwork, and then disconnecting from the network. Whenever a value on any device changes, the malicious device will be notified, but as the malicious device is offline, each legitimate device will then wait for the timeout to expire before sending the next notification. A model of the attack is detailed in Fig. 2. As the APDU timeout property is defined on the client of the transaction, the malicious device may set the length to wait, and the number of retry attempts [15]. Detection of the attack may be achievable given the vendor timeout values detailed in Table 2, as values which differ drastically from the de facto-standards may be easily identified. However, as a device may subscribe multiple times to another device, a proposed attack chain involves subscribing within the normal time values, up to 20 s, with three retries and multiple subscriptions thus increasing the Denial of Service on the device by each subsequent subscription. The impact of manipulating the timeout values is dependent on the building services provided, and the criticality of the buildings contents. Preventing values being reported on the network is of serious concern, with a clear reduction in availability to the cyber-physical systems controlled on the network. Given that the commands used to cause the attack are legitimate, and can appear to be normal traffic on a BACnetwork, a different type of detection method, with contextual analysis of the network is required to identify these situations.
260
M. Peacock et al. 1) CoV Subscription Request
3) Value Change
2) Client Disconnect
Client Device
Server Device
6) Write to Bu er
7) CoV ACK
5) Wait for ACK
Fig. 2. Malicious confirmed CoV transaction, taken from [2, p. 551].
5
Attack Implementation
Experimentation is required to examine the impact of this attack, and the potential of identification and classification. As described, robust CoV testing is often not undertaken due to fear of network and physical degradation. As such, an Table 3. Experimental setup materials, taken from [2, p. 550]. Software/Hardware
Details
Raspbian Jessie
R2016-09-23, V4.4
CBMS studio
V1.3.7.1204
CBMS engineering configuration tool V1.3.1221.7 Bacnet open stack
V0.82
Windows 7
SP1
Wireshark
V1.12.1
3x Raspberry Pi 2
1 GB RAM, 16 GB SD card
Cisco 3560 switch
SPAN configured
Windows 7 Desktop
16 GB RAM, 128 GB SSD
RpiThermostat
Cisco 3560
RpiController
Rpi- Logger
Windows7HMI
Fig. 3. Experimental simulation setup, adapted from [2, p. 551].
An Exploration of Some Security Issues Within the BACnet Protocol
261
Table 4. Initial experimentation results, taken from [2, p. 552]. No.
Time
Source
Destination
80040 19:58:32.5 192.168.1.12 192.168.1.5
80041 19:58:32.5 192.168.1.5
Protocol Length Info BACnet 85
192.168.1.12 BACnet 60
Confirmed-REQ confirmedCOVNotification [67] device, 9999 analog-output, 1 present-value Reject unrecognizedservice [67]
80327 19:58:48.4 192.168.1.12 192.168.1.5
BACnet 85
Confirmed-REQ confirmedCOVNotification [69] device, 9999 analog-output, 1 present-value
80510 19:58:58.4 192.168.1.12 192.168.1.5
BACnet 85
Confirmed-REQ confirmedCOVNotification [69] device, 9999 analog-output, 1 present-value
80656 19:59:08.4 192.168.1.12 192.168.1.5
BACnet 85
Confirmed-REQ confirmedCOVNotification [69] device, 9999 analog-output, 1 present-value
81438 19:59:48.5 192.168.1.12 192.168.1.5
BACnet 85
Confirmed-REQ confirmedCOVNotification [70] device, 9999 analog-output, 1 present-value
81439 19:59:48.5 192.168.1.5
192.168.1.12 BACnet 60
Reject unrecognizedservice [70]
experimental setup was developed for generation of network traffic, and manipulation of CoV values, the materials used are noted in Table 3. The experimental setup resembles a subset of a BACnet controlled HVAC service. A BACnet controller acts as a client, subscribing to the thermostat present value property, when a value change occurs the controller is notified. A CBMS BACnet server instance runs on the thermostat Raspberry Pi, emulating a thermostat.
262
M. Peacock et al.
The controllers subscription commands are sent using the BACnet protocol stack version 0.81. Configuration of the thermostat is undertaken using the CBMS engineering tool, running on the Windows 7 HMI. All traffic which passes over the network is port mirrored to the Raspberry Pi logging device, using Wireshark for packet capture and analysis. A graphical depiction of the experimental setup is shown in Fig. 3. Initial experimental results shown in Table 4 are encouraging, with the Raspberry Pi thermostat server device failing when a malicious controlled client disconnected from the network. Packets 80040 and 80041 show a normal CoV notification and ACK details (shown as a reject due to a limited server implementation). Packets 80327, 80510 and 80656 show three attempted notifications, 10 s apart after the device has disconnected. During the 30 s window, the tracked server value was changed multiple times, none of these changes were reported as the server device waited for the acknowledgement packet from the disconnected client. Entry 81438 and 81439 show a value change and subsequent return to a normal notification and ACK, shortly after the client device was reconnected. No CoV notifications were generated by the server device while it waited for the ACK from the offline client. When the client reconnected, multiple changes in value that would normally trigger alerts were never disseminated to the client. From this initial experiment, the extent of the attack is unclear. Further experimentation is required to incorporate additional subscribing devices, to determine if the attack causes the server device to hang. If the server device hangs, it may prevent other subscribed devices from receiving change of value alerts, as described in the BACnet specification. Modelling the legitimate-yetmalicious scenarios may reveal further information, such as a network pattern or activity which could indicate a malicious event taking place from legitimate commands. Such a pattern or activity could be suitable for a detection rule.
6
Modelling Bounded Priority Arrays
We used Anylogic simulation software to model the priority level security vulnerability, described previously in Sect. 4.2. We developed a conceptual model (see Fig. 4) of the problem space and a dynamic agent-based simulation model sufficient to illustrate the problem with objects and their associated properties in a BACnetwork. A typical simulation run is shown in Fig. 5. An object can be connected to the priority list at any priority level (as per the BACnet specification [4]). The problem arises because of a lack of source authentication and integrity checking on property writes (as seen in Tables 5 and 6). As identified, there exists a potential Denial of Service method in the BACnet priority array implementation. Modelling the priority array has identified that the issue exists, namely, two devices can write to the same priority level in the array, with the second device overwriting the first device’s value. Investigating this phenomena on a BACnet implementation, such as the BACnet open stack could reveal further details of the effect.
An Exploration of Some Security Issues Within the BACnet Protocol
Other Device Relinquish Array
Write Value
Commandable Property Write
Commandable Property Write Relinquish Array Command
Priority Array
Controller Value
Lowest Array Value
Change Value
Device-Any_Device
Value
Subscription Request
Change of Value Subscription
Value
Device-Controller Subscription Details Present Value
Fig. 4. DFD of bounded array priority process.
Fig. 5. Sample simulation run of BACnet bounded priority array.
263
264
M. Peacock et al. Table 5. Array after the device writes the property value at priority 1. Priority level Value 1 2 3 4 5 6 ... 16
Identifier
11 Device NULL NULL NULL NULL NULL NULL
Table 6. Array after the other device overwrites the property value at priority 1. Priority level Value 1 2 3 4 5 6 ... 16
7
Identifier
68 Other device NULL NULL NULL NULL NULL NULL
Threat Modelling of BACnet Controllers
As discussed in [2], threat models are often used to identify threats and systematically reveal vulnerabilities in a system. The core requirements for a threat model are the identification of adversaries and their motivations, description of the system and identification of threats against the system. There are various taxonomies related to adversary identification and classification identified in [17]. Generic adversaries and their motivations can be found in [18], and are summarised in Table 7. Traditionally, in relation to BAS, the insider adversary posed the highest threat, given the requirement of physical access to interact with the BAS network, and the required knowledge of system behaviour. However, with increased connectivity to the Internet and enterprise networks, the adversarial landscape has shifted. The ability for legitimate inbounds commands which can cause malicious action, provides adversaries with a remote avenue of attack, either against the BAS directly, or as a pivot point into the enterprise network. Given the defined modus operandi in Table 7, BAS could be a target by each adversary to achieve their stated motivations, dependent on the targeted building system.
An Exploration of Some Security Issues Within the BACnet Protocol
265
Table 7. Summary of cyber adversaries. Adversary
Modus operandi
Motivation
Novice
Denial of Service Pre-written scripts Denial of Service Defacement Information disclosure Sabotage Information disclosure Internal knowledge Develop scripts Avoid exposure Highly skilled Well resourced Targeted attacks Destabilise cyber or physical assets Disrupt cyber or physical assets Destroy cyber or physical assets Highly skilled Well resourced State sanctioned Destabilise cyber or physical assets Disrupt cyber or physical assets Destroy cyber or physical assets Highly skilled Well resourced Information theft
Attention seeking Prestige Political cause
Hacktivist
Insider
Coder Organised crime (Blackhat) Cyber terrorist
Nation-state
Revenge
Power prestige Money Greed
Ideology
National interests
An example BACnet managed HVAC environment is presented as a data flow diagram (DFD) in Fig. 6. The DFD identifies core data handling processes between a controller, and other devices in the BACnetwork. Sensors and actuators directly related to the controller, in addition to other subnetwork devices are generalised as the external “Any device”. The HMI entity represents a management interface where manual commands are executed from. The logger entity is an external device which logs data collected by the controller from other devices. Various authors have identified and classified threats against BACnet. Holmberg [5] in their 2003 threat assessment identified a range of attacks against BACnet, with classification split into two distinct parts, IT based, which represent generic Internet Protocol based attacks, and BACnet protocol specific attacks. Within the BACnet attack class, there are five categories, snooping, application service attacks, network layer attacks, network layer Denial of Service and application layer Denial of Service. Similarly, Kaur [10] identified three classes of attack against BACnet, adapted from IT (equivalent to Holmberg’s IT based), non-conformance and protocol vulnerability. Further, Caselli [11] defined
266
M. Peacock et al.
Polled Data Commence Polling Command
Logger
TrendLog
Poll Data Property value(s)
TrendLog values
TrendLog values
TrendLog values
Subscription
Move to Logger
Property value(s)
Subscribe to Device
Any device
Check Threshold Values
Alarm values Property value(s) Write Command
Write Command
Raise Alarm
Write Value
Controller-Logger
Read Command Subscription Command Read Command
Alarm Alert
Property value(s) Read Command
Manual Data Read
Write Command
Controller-Any Device
HMI Property value(s) Controller-HMI
Fig. 6. DFD of controller in a BACnet managed HVAC scenario.
snooping, Denial of Service and process control subversion. From the three identified sources, each classification approach reveals several distinct BACnet specific attacks, using various BACnet services. To consolidate these approaches, the known attack types described in [2,5,10,11,13] against BACnet have been collated with a BACnet specific view, excluding IP based attacks. The specific commands and properties which are used in each attack are identified, and classified using the STRIDE threat matrix [19], presented as Table 8. The example scenario was analysed by first reducing the BACnet elements identified in Fig. 6 to unique elements. The STRIDE threat classes against each element were defined, and then correlated with the threat classified attacks in Table 8. The total threat each attack poses to the scenario’s elements was generated, with a subset presented as Table 9. In total, the threats posed by all attacks against the scenario numbered 652. Of the 652 threat instances, the largest threat class identified against the scenario was Denial of Service, with 306 occurrences. Given that a core objective of building control systems is availability, Denial of Service attacks are particularly damaging. The specific attacks which have the highest impact on the scenario were derived through collating the threats of each attack against each element in the scenario, for example, A2 has associated threats of S, T and D, which applies to multiple elements in the scenario, resulting in a total of 39 threat counts against the scenario. Six attacks, including the discussed CoV attack were identified as equal highest, with 39 threat counts, full detail is shown in Table 10. Through this classification it can be defined where
An Exploration of Some Security Issues Within the BACnet Protocol
267
Table 8. Known attacks classified against the STRIDE matrix. Attack type
Specific command/property
Denial of Service
Subscribe-CoV
Denial of Service
Router-busy-to-network
Denial of Service
I-am-router-to-network
Denial of Service
Router-available-to-network
Denial of Service
Disconnect-connection-to-network
Denial of Service
DeleteObject
Flooding
Who-is-router-to-network
Flooding
Reinitialize-device
Flooding
I-am
Flooding
Any malformed packet
Malformed broadcast Reject-message-to-network Malformed broadcast CreateObject
S T R I
D E
NetworkLoop
InitializeRoutingTable
Reconnaissance
NPDU probes
Reconnaissance
Read property
Reconnaissance
Whois
Reconnaissance
Whoami
Reconnaissance
Who-is-router-to-network
Reconnaissance
WhoHas
Routing table attack Initializeroutingtable
Shutdown/Reboot
Reinitialize-device
Smurf attack
Source address manipulation
Spoof device
I-am
Traffic redirection
I-am-router-to-network
Traffic redirection
Router-available-to-network
Write attack
Write property
mitigation development and protocol hardening should be focused. In addition, the classification highlights which legitimate commands have a higher potential for malicious use. Further investigation of the uses of each identified command can be used to derive normal operational patterns, in context to the specific implementation, to build rulesets for intrusion detection systems.
8
Discussion
We have described, simulated and tested several vulnerabilities in BACnet that could have serious consequences to Critical Infrastructure. These vulnerabilities,
268
M. Peacock et al.
Table 9. Identified BACnet specific attacks, classified by the STRIDE matrix. Element
A1
A2
Alarm alert
T,D T,D T,D D
D
D
I
D
D
D
T,I
I
T,D
Alarm values
T,D T,D T,D D
D
D
I
D
D
D
T,I
I
T,D
Any device
S
S
S
S
-
-
-
S
S
S
-
Check threshold values
T,D T,D T,D D
D
D
I
D
D
S,D T,I
I
T,D
Commence polling command
T,D T,D T,D D
D
D
I
D
D
D
I
T,D
S
A6
S
A7 A11 A13 A14 A20 A21 A22 A23 A24 A26
T,I
HMI
S
S
S
S
S
S
-
-
-
S
S
S
-
Logger
S
S
S
S
S
S
-
-
-
S
S
S
-
Manual data read
T,D T,D T,D D
D
D
I
D
D
S,D T,I
I
T,D
Move to logger
T,D T,D T,D D
D
D
I
D
D
S,D T,I
I
T,D
Poll data
T,D T,D T,D D
D
D
I
D
D
S,D T,I
I
T,D
Polled data
T,D T,D T,D D
D
D
I
D
D
D
T,I
I
T,D
Property value(s)
T,D T,D T,D D
D
D
I
D
D
D
T,I
I
T,D
Raise alarm T,D T,D T,D D
D
D
I
D
D
S,D T,I
I
T,D
Read command
T,D T,D T,D D
D
D
I
D
D
D
T,I
I
T,D
Subscribe to T,D T,D T,D D device
D
D
I
D
D
S,D T,I
I
T,D
Subscription T,D T,D T,D D
D
D
I
D
D
D
T,I
I
T,D
Subscription T,D T,D T,D D command
D
D
I
D
D
D
T,I
I
T,D
TrendLog
T,D T,D T,D D
D
D
I
D
D
D
T,I
I
T,D
TrendLog values
T,D T,D T,D D
D
D
I
D
D
D
T,I
I
T,D
Write command
T,D T,D T,D D
D
D
I
D
D
D
T,I
I
T,D
Write value
T,D T,D T,D D
D
D
I
D
D
S,D T,I
I
T,D
in themselves, are not the problem as we foresee a significant second-order effect arising from the realisation of the threats. For example, a potential second-order attack could be crafted to take advantage of one vulnerability through delaying response to a temperature controller, adversely affecting a bank’s datacentre.
An Exploration of Some Security Issues Within the BACnet Protocol
269
Table 10. Total threat counts based on known attacks against scenario. A1
A2
A3
A4
A5
A6
A7
A8
A9
A10 A11 A12 A13 Total
S
3
3
3
3
3
3
3
3
3
3
3
3
3
T
18
18
18
18
0
18
0
0
0
0
0
0
0
R
0
0
0
0
0
0
0
0
0
0
0
0
0
I
0
0
0
0
0
0
0
0
0
0
0
0
0
D
18
18
18
18
18
18
18
18
18
18
18
18
18
E 0 Total 39 A14 S 0
0 39 A15 0
0 39 A16 0
0 39 A17 0
0 21 A18 0
0 39 A19 0
0 21 A20 0
0 21 A21 0
0 21 A22 10
0 21 A23 3
0 21 A24 3
0 21 A25 3
0 21 A26 0 58
T
0
0
0
0
0
0
0
0
0
18
0
0
18
R
0
0
0
0
0
0
0
0
0
0
0
0
0
0
I
18
18
18
18
18
18
0
0
0
18
18
18
0
162
D
0
0
0
0
0
0
18
18
18
0
0
0
18
306
0 18
0 18
0 18
0 18
0 18
0 18
0 18
0 28
0 39
0 21
0 21
0 36
0 652
E 0 Total 18
126
The loss of banking data or even the time delay whilst such data are recovered from a backup are significant events for any financial institution and its clients. The ability to detect these legitimate events may be linked to the time between events, therefore monitoring the timeout durations on devices waiting for acknowledgements is the likely way forward. The peer-based nature of the protocol, coupled with multiple subscriptions and lack of source authentication means that there is the potential to have this duration value increased by an adversary. As the duration value is unbounded, an adversary can execute a Denial of Service attack through normal operation of the system. Due to the time-dependent nature of cyber-physical systems, models that can express time are useful for verifying certain properties of the system. Clearly, the CoV communication process of the BACnet protocol, where time is a factor for successful completion of a command, is one such area. It is likely that any modelling notation/technique that can express a queue and has some notion of state, could represent this type of problem and be able to verify aspects of the system. Z [20], Communicating Sequential Processes [21] or even possibly the Object Constraint Language [22] could be potential candidates. Each of these formal notations could model the BACnet change of value reporting. The modelling of the bounded priority array has formalised the issues identified from the protocol specification. The optional provisions of the specification mean that vendor devices can be compliant with the specification, but still be exposed to the threat identified and tested via simulation in this research.
270
M. Peacock et al.
The use of a threat model has contextualised a range of known attacks against BACnet through the use of the STRIDE threat classification matrix. Correlating the known attacks against the typical controller scenario has revealed a large portion of known attacks are direct threats to availability, a core requirement of control systems. From the viewpoint investigated, authentication attacks are discounted given the BACnet protocol relies on additional protocols for authentication. Additionally, non-repudiation attacks are not a concern, given the lack of native source authentication in the protocol and trusting nature of the protocol, there is no requirement to deny communication occurred.
9
Conclusion
This paper described a proof-of-concept attack on any building automation system that use the BACnet protocol CoV reporting function as part of its communication and control, and suggested that a formal proof of the attack would be valuable. We found that while BACnet has a security addendum to the standard which defines source authentication, this is not implemented in practice. We deployed an experimental configuration to investigate the phenomena derived from the BACnet standard, with promising initial results. We defined a situation where multiple subscriptions and a lack of source authentication can cause a failure of critical infrastructure, leading to a more serious second-order effect, using banking systems as an example. The issue of bounded priority arrays was discussed, and a model was developed and tested. The model confirmed the identified issue in the context of the behaviour in the priority array, when objects write to commandable properties. As noted, when two devices write at the same priority level, the array is overwritten by the last object to write. Clearly, with a lack of source authentication, the problem the priority system attempts to resolve still exists. This is concerning, given commandable properties directly cause cyber-physical actions to occur. Further, known attacks against BACnet were collated and classified according to the STRIDE threat matrix. A threat model against a BACnet controller was developed from a presented scenario, classified known attacks, identified adversaries and their potential motivations. In future work, we intend to expand our experiments to incorporate a larger network consisting of more subscribed devices, along with additional BACnet services. Further, experiments will be undertaken, to test the priority array relinquish issue in a BACnet context, with the aim to confirm the models presented, given network traces and memory allocation to the device. Additionally, the CoV experiments will be expanded, with investigation into the use of alternative stack implementations to verify if the behaviour can be generalised to a range of devices. Finally, we will further explore modelling the protocol to derive identification patterns to identify known out-of-bound attacks, and in-bounds legitimate, yet malicious, commands.
An Exploration of Some Security Issues Within the BACnet Protocol
271
Acknowledgement. The authors would like to thank Marcelo Macedo for his assistance in implementing the simulation environment. This research was supported by an Australian Government Research Training Program Scholarship.
References 1. Stouffer, K., Pillitteri, V., Lightman, S., Abrams, M., Hahn, A.: NIST Special Publication 800–82: Guide to Industrial Control Systems (ICS) Security. Special Publication, NIST, London (2015) 2. Peacock, M., Johnstone, M.N., Valli, C.: Security issues with BACnet value handling. In: Olivier Camp, P.M., Furnell, S. (eds.): Proceedings of the 3rd International Conference on Information Systems Security and Privacy - Volume 1: ICISSP, INSTICC, pp. 546-552. SciTePress (2017) 3. Chipkin, P.: BACnet for field technicians. Technical report. Chipkin Automation Systems (2009) 4. SSPC-135: BACnet: a data communciation protocol for building automation and control networks (2012) 5. Holmberg, D.G.: BACnet wide area network security threat assessment. Technical report. NIST (2003) 6. Kastner, W., Neugschwandtner, G., Soucek, S., Newman, H.: Communication systems for building automation and control. Proc. IEEE 93, 1178–1203 (2005) 7. Granzer, W., Kastner, W.: Communication services for secure building automation networks. In: 2010 IEEE International Symposium on Industrial Electronics (ISIE), pp. 3380–3385 (2010) 8. Peacock, M., Johnstone, M.N.: An analysis of security issues in building automation systems. In: Proceedings of the 12th Australian Information Security Management Conference, pp. 100–104 (2014) 9. Valli, C., Johnstone, M.N., Peacock, M., Jones, A.: BACnet - bridging the cyber physical divide one HVAC at a time. In: Proceedings of the 9th IEEE-GCC Conference and Exhibition, pp. 289–294. IEEE (2017) 10. Kaur, J., Tonejc, J., Wendzel, S., Meier, M.: Securing BACnet’s pitfalls. In: Federrath, H., Gollmann, D. (eds.) SEC 2015. IAICT, vol. 455, pp. 616–629. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18467-8 41 11. Caselli, M.: Intrusion detection in networked control systems: from system knowledge to network security. Ph.D thesis. University of Twente, Enschede (2016) 12. Holmberg, D.G., Bender, J.J., Galler, M.A.: Using the BACnet firewall router. ASHRAE Am. Soc. Heat. Refrig. Air Cond. J. 48, 10–14 (2006) 13. Johnstone, M.N., Peacock, M., den Hartog, J.: Timing attack detection on bacnet via a machine learning approach. In: Proceedings of the 13th Australian Information Security Management Conference, pp 57–64 (2015) 14. SSPC-135: BACnet addenda and companion standards (2014) 15. Newman, H.M.: BACnet: The Global Standard for Building Automation and Control Networks. Momentum Press LLC, New York (2013) 16. Gamma, E., Helm, R., Johnson, R., Vlissides, J.: Design Patterns: Elements of Reusable Object-oriented Software. Addison-Wesley Longman Publishing Co., Inc., Boston (1995) 17. Magar, A.: State-of-the-art in cyber threat models and methodologies. Report, Defence Research and Development Canada (2016)
272
M. Peacock et al.
18. Bernier, M.: Military activities and cyber effects (MACE) taxonomy. Taxonomy, Defence Research and Development Canada, Centre for Operational Research and Analysis (2013) 19. Howard, M., Lipner, S.: The Security Development Lifecycle. Microsoft Press, Redmond (2006) 20. Spivey, J.M.: The Z Notation: A Reference Manual. Prentice-Hall Inc., Upper Saddle River (1989) 21. Hoare, C.A.R.: Communicating sequential processes. Commun. ACM 21, 666–677 (1978) 22. (OMG), O.M.G.: Object Constraint Language (OCL). Version 2.4. (2014)
Not So Greedy: Enhanced Subset Exploration for Nonrandomness Detectors Linus Karlsson(B) , Martin Hell, and Paul Stankovski Department of Electrical and Information Technology, Lund University, P.O. Box 118, 22100 Lund, Sweden {linus.karlsson,martin.hell,paul.stankovski}@eit.lth.se
Abstract. Distinguishers and nonrandomness detectors are used to distinguish ciphertext from random data. In this paper, we focus on the construction of such devices using the maximum degree monomial test. This requires the selection of certain subsets of key and IV-bits of the cipher, and since this selection to a great extent affects the final outcome, it is important to make a good selection. We present a new, generic and tunable algorithm to find such subsets. Our algorithm works on any stream cipher, and can easily be tuned to the desired computational complexity. We test our algorithm with both different input parameters and different ciphers, namely Grain-128a, Kreyvium and Grain-128. Compared to a previous greedy approach, our algorithm consistently provides better results. Keywords: Maximum degree monomial · Distinguisher Nonrandomness detector · Grain-128a · Grain-128 · Kreyvium
1
Introduction
Stream ciphers are symmetric cryptographic primitives which generate a pseudorandom sequence of digits, called the keystream, which is then combined with the plaintext message to produce a ciphertext. To generate the keystream, a public initialization vector (IV) and a secret key are used. It is important that an attacker cannot use the public IV to deduce information about the keystream, since this would put the encrypted message at risk. To prevent such an attack, the key and IV are mixed during an initialization phase, before the stream cipher produces actual keystream. This initialization phase consist of a set of initialization rounds, during which the output of the cipher is suppressed. A cipher needs to have an adequate amount of initialization rounds. Too many, and the cipher will have poor initialization performance, too few and an attacker may be able to perform an attack, e.g., a chosen-IV attack. In this paper we will look into the design of distinguishers and nonrandomness detectors to perform cryptanalysis of different ciphers. The goal of such devices is to look at data and then determine whether the data is random data, or data c Springer International Publishing AG, part of Springer Nature 2018 P. Mori et al. (Eds.): ICISSP 2017, CCIS 867, pp. 273–294, 2018. https://doi.org/10.1007/978-3-319-93354-2_13
274
L. Karlsson et al.
from a specific cipher. Recall that for a good cipher, the keystream should be pseudo-random, so it should be hard to construct such a detection device, since data from a cipher should appear random to an outside observer. Distinguishers and nonrandomness detectors differs in what degree of control an attacker has over the input. In a distinguisher, the key is fixed and unknown to the attacker, only the IV can be modified. In a nonrandomness detector, an attacker has more control, and can modify both key and IV bits. The design of distinguishers and nonrandomness detectors has previously been discussed in the literature. Previous work such as [1] has considered the design of such devices by using a test called the Maximum Degree Monomial (MDM) test. This test looks at statistical properties of a cipher to find weaknesses. This test requires selection of a subset of the cipher’s key and IV bits, which can be selected using, for example, a greedy algorithm, as described in [2]. We build upon this previous work and propose an improved, generalized, algorithm which outperforms the greedy algorithm in finding suitable subsets. We also implement and test our algorithm, and present new results on the stream ciphers Grain-128, Grain-128a and Kreyvium. This paper is an extended and revised version of [3]. The major novelties are analysis of one more cipher (Kreyvium), a new test which investigates the effect of optimal starting subsets, and a more detailed descriptions of our algorithm. The paper is organized as follows. In Sect. 2 we present some required background, which is then used when describing our new algorithm in Sect. 3. Results are presented in Sect. 4, which is then followed by a discussion of related work in Sect. 5. Section 6 concludes the paper.
2
Background
In this paper we will mainly focus on the analysis of the two stream ciphers Grain-128a and Kreyvium. This selection of ciphers has been made since they share some interesting properties. They are both based on ciphers from the final eSTREAM portfolio (Grain v1 and Trivium, respectively), but modified to have 128-bit keys. Both ciphers also update their internal state relatively slowly—a small fraction of the internal state is modified in each clock cycle. This requires both ciphers to have many initializations rounds. For completeness, we start with a brief description of these two ciphers. After this, in the rest of this chapter, we discuss the Maximum Degree Monomial test in more detail. 2.1
Grain-128a
The Grain-family of ciphers consist of a progression of ciphers, starting with Grain v1 [4], which is included in the final eSTREAM portfolio of ciphers. This was extended into a 128-bit key version as Grain-128 [5], and finally to the current version Grain-128a [6].
Not So Greedy: Enhanced Subset Exploration for Nonrandomness Detectors
275
Grain-128a is a stream cipher with a 128 bit key and a 96 bit IV. It support two modes of operations, with or without authentication. For brevity, the following description will focus on the non-authenticated mode. Refer to the original paper for an extended description. The cipher is constructed with three major parts: one LFSR of size 128, one NFSR of size 128, and one pre-output function h combining values from the LFSR and the NFSR. An overview of the cipher can be seen in Fig. 1. g NFSR
f
⊕
LFSR
h
⊕
z
Fig. 1. Overview of Grain-128a.
The functions f (x) and g(x) are the feedback functions for the LFSR and the NFSR respectively. They are defined as follows: f (x) = 1 + x32 + x47 + x58 + x90 + x121 + x128 and g(x) = 1 + x32 + x37 + x72 + x102 + x128 + x44 x60 + x61 x125 + x63 x67 + x69 x101 + x80 x88 + x110 x111 + x115 x117 + x46 x50 x58 + x103 x104 x106 + x33 x35 x36 x40 The function h(x) is defined as follows, where si and bi correspond to the ith state variable of the LFSR and the NFSR respectively: h = b12 s8 + s13 s20 + b95 s42 + s60 s79 + b12 b95 s94 Finally, the output z from the cipher is constructed as: z = h + s93 + b2 + b15 + b36 + b45 + b64 + b73 + b89 The initialization of the cipher is as follows. At first the NFSR is filled with the 128 key bits, and then the LFSR is filled with the 96 IV-bits. The remaining 32 bits of the LFSR are filled with ones, except the final bit which is set to zero. After this, the cipher is clocked 256 times, during which the output is suppressed and instead fed back and XORed with the input to both the NFSR and LFSR. After this, the cipher is ready and starts to produce keystream.
276
2.2
L. Karlsson et al.
Kreyvium
Kreyvium [7] is based on another eSTREAM finalist, namely Trivium [8]. Trivium is notable for its simplistic design. It has an 80 bit key and an 80 bit IV. The authors of Kreyvium modifies the construction by increasing this to 128 bit for both the key and the IV. Kreyvium’s internal state consists of five different registers, of sizes 93, 84, 111, 128, and 128 bits. In the following brief description, we will call them a, b, c, IV ∗ , and K ∗ , respectively. The first three registers are the same as in Trivium, while the latter two are added in Kreyvium. An overview of the cipher can be found in Fig. 2. IV ∗
a
⊕
b
⊕
c
⊕
⊕ ⊕
z
K∗
Fig. 2. Overview of Kreyvium.
Following the notation of the original paper, the registers a, b, and c are numbered s1 , . . . , s93 , followed by s94 , . . . , s177 , and finally s178 , . . . , s288 , respectively. Then the output z can be expressed as: z = s66 + s93 + s162 + s177 + s243 + s288 + K0∗ For every clock, each LFSR is shifted one step, and the following values are shifted in for each register: s1 = s243 + s288 + K0∗ + s286 s287 + s69 s94 = s66 + s93 + s91 s92 + s171 + IV0∗ s178 = s162 + s177 + s175 s176 + s264 ∗ K127 = K0∗ ∗ IV127 = IV0∗
Not So Greedy: Enhanced Subset Exploration for Nonrandomness Detectors
277
The initialization of the ciphers is as follows. The a register is initialized with the first 93 key bits. The b register is initialized with the first 84 IV bits. The c register is initialized with the remaining IV bits, followed by all ones, except the final bit which is a zero. The K ∗ register is filled with the key, and the IV ∗ register is filled with the IV. After this the cipher is clocked 1152 times, during which the output is suppressed. After this, the cipher starts generating keystream. 2.3
Maximum Degree Monomial Test
The maximum degree monomial test was first presented in [1] and described a clean way to detect nonrandomness by looking at the cipher output. Considering an arbitrary stream cipher, we can consider it as a black box with two inputs, and one output. The input is the key K and the initialization vector (IV) V respectively, while the output is the generated keystream. We consider the concatenation of the key K and the IV V as a boolean space B of dimension b = |K| + |V |. Any Boolean function g over a boolean space B can be described by its Algebraic Normal Form (ANF) g(x1 , x2 , . . . , xb ) = c0 + c1 x1 + c2 x2 + . . . + cm x1 x2 . . . xb where the coefficients ci are either 0 or 1, thus describing if the term is included in the ANF or not. For the function g above, the last term with coefficient cm describes the maximum degree monomial. If cm is zero, we say that the maximum degree monomial does not exist, while if cm is 1, we say it does exist. We note that for a randomly chosen Boolean function g, we would expect the maximum degree monomial to appear with a probability of 12 . We are interested in finding out whether or not the maximum degree monomial exists in the ANF of the Boolean function of the first keystream bit. The rationale behind this is that intuitively, the maximum degree monomial tells us something about the mixing of the input to the cipher. Since the maximum degree monomial is the product of all inputs of the Boolean function, we expect to see it only if all inputs have been mixed. It is well known that according to the Reed-Muller transform, the coefficient of the maximum degree monomial can be found simply by XORing all the entries in the truth table of a Boolean function as g(x) (1) x∈{0,1}b
where g(x) is a Boolean function. Thus all possible values for the input set is generated, and for each input value the function is evaluated. We will use this test to analyze the required amount of initialization rounds of a stream cipher. The designers of a stream cipher need to select how many initialization rounds to perform: too few, and it may be possible to attack the cipher, too many, and the performance hit will be large.
278
L. Karlsson et al.
If we consider the first bit of keystream from a stream cipher as a Boolean function, we can choose to sum over this function in Eq. 1 above. The input x would then correspond to the input set of key and IV bits. Instead of only looking at the first bit of real keystream, the idea can be extended such that a modified version of the cipher is considered. In the modified version, we also look at the cipher’s output during its initialization rounds, output which is normally suppressed. Assuming a cipher with l initialization rounds, we denote the ith initialization round output function as fi (x), thus giving us a vector f1 (x), f2 (x), . . . , fl (x) . l functions
Thus, instead of only looking at the ANF and finding the maximum degree monomial of a single function (z0 before), we now look at l different boolean functions, and for each of the functions, we find the coefficient of the maximum degree monomial. Such a sequence would have a format like . . . 101 01100101 l coefficients
where each individual bit is the maximum degree monomial coefficient for its corresponding function fi . We call this sequence of coefficients the maximum degree monomial signature, or MDM signature, following the terminology in [2]. Since the keystream is a pseudo-random sequence of digits, the keystream produced by an ideal stream cipher should, to an outside observer, be indistinguishable from a random stream of bits. This means that if we look at each output bit function fi (x), it should appear to be a random function fi : B → {0, 1}. As noted earlier, for a random Boolean function, we expect the maximum degree monomial to exist with probability 12 . Therefore, we expect the coefficients 0 and 1 to appear with equal probability, and for an ideal cipher, we expect to see a random-looking MDM signature. However, if the input space B is large, clearly the construction of a MDM signature will result in too many initialization of the cipher to be feasible. Therefore, we can only consider a subset S of the input space B. The remaining part, B \ S, is set to some constant value, in this paper we selected it to be zero. 2.4
Finding the Subset S
The selection of the subset S turns out to be a crucial part of the MDM test. We will soon see that depending on the choice of S, the resulting MDM signature will vary greatly. Consider a subset S of key and IV bits for the stream cipher Grain-128a [6]. Choosing S as key bit 23, and IV bits 47, 53, 58, and 64, we get the following MDM signature: 000 . . 000 111 . . . . 187 zeros
Not So Greedy: Enhanced Subset Exploration for Nonrandomness Detectors
279
Looking at the initial sequence of 187 adjacent zeros, out first conclusion is that this does not appear to be a random-looking sequence. After this, we will however start to see ones and zeros in a more mixed fashion. From this we can intuitively say the it appears as if 187 initialization rounds are not enough. However, Grain-128a is designed with 256 initialization rounds in a non-modified implementation, and thus it appears as if the designers have chosen a sufficiently high amount of initialization rounds. To more concisely describe the result above, we state that we find nonrandomness in 187 out of 256 initialization rounds. We will use this terminology throughout the paper. Worth noting is also that this is a nonrandomness result, since we have included both key and IV bits as a part of the subset S. From the description above, it should not come as a surprise that our goal now is to maximize the length of the initial sequence of zeros we can find in the MDM signature. The ultimate goal is of course to find nonrandomness in all initialization rounds, at which point it may be interesting to look for it in the actual keystream of an unmodified cipher. The selection of what bits to include from B into the subset S is important. The composition of S will greatly influence the resulting MDM signature. Four examples can be found in Table 1. Table 1. The number of initial zeros in the MDM signature for four different subsets S for Grain-128a. K
IV
Rounds out of 256
{}
{1, 2, 3, 4, 5}
107
{}
{91, 92, 93, 94, 95} 124
{23}
{47, 53, 58, 64}
{1, 2, 3, 4, 5} {}
187 109
From the table above, we can clearly see that the choice of S is crucial. For these examples, we have selected a subset size of five, i.e. |S| = 5, and included both key and/or IV bits in S. The third row, where we find 187 consecutive zeros, is actually the optimal result for a subset of size 5. Calculating the optimal result is however not feasible as the subset grows larger. For the general case, where the input space is B and the subset is S, we would have to test |B| |S| combinations. Again, using Grain-128a as an example, that would correspond to 224 |S| combinations, since Grain-128a has 96 IV bits and 128 key bits. 2.5
Greedy Approach
Since the selection of the subset S is important, we now turn our attention to algorithms used to construct such a subset. Previous work, such as [2], has proposed to use a greedy algorithm to find such a subset. The greedy approach
280
L. Karlsson et al. ni ⎡ ki
. . . ki−1 αi−1
merge, sort, reduce
ni
⎡
⎤ 2 , 16 ⎢ 8 , 43 ⎥ ⎢ ⎥ ⎢ 32 , 69 ⎥ ⎣ ⎦ .. .
⎤ 2 , 16 , 86 ⎢ 2 , 16 , 55 ⎥ ⎣ ⎦ .. . ⎡
ki
⎤ 8 , 43 , 54 ⎢ 8 , 43 , 27 ⎥ ⎣ ⎦ . . . ki−1 αi−1 ki αi .. .
⎡
8 , 43 , 54 ⎢ 2 , 16 , 86 ⎢ ⎢ 8 , 43 , 27 ⎣ .. .
ni ⎡ ki
32 , 69 , 5 32 , 69 , 8 .. .
⎤
Fig. 3. One step of our improved algorithm [3].
can, in short, be described through the following steps, which results in a subset of a desired size: 1. Find an optimal starting subset of a small size (possibly empty, making this step optional). 2. Add the n bits which together produce the highest number of zero rounds to the current subset. 3. Repeat step 2 until a subset of the wanted size m is found. To make the algorithm even clearer, consider the following example where we start with the optimal subset of size five described earlier in Table 1. A few steps of the greedy algorithm, with n = 1, would then look like this: i0 :
K = {23}
IV = {47, 53, 58, 64}
i1 :
K = {23}
IV = {47, 53, 58, 64, 12}
i2 :
K = {23, 72}
IV = {47, 53, 58, 64, 12}
i3 :
K = {23, 72, 31}
IV = {47, 53, 58, 64, 12}
i4 :
K = {23, 72, 31, 107}
IV = {47, 53, 58, 64, 12}
The algorithm, in iteration i0 starts with the optimal subset of size 5. In iteration i1 all possible remaining bits are tried, and the best bit, i.e. the one giving the longest initial sequence of zeros, is selected and included in the subset, in this case IV bit 12. The algorithm then repeat the same step for all remaining iterations until a subset of the desired size is found, in this example |S| = 9.
Not So Greedy: Enhanced Subset Exploration for Nonrandomness Detectors
281
This greedy algorithm has the same drawbacks as for greedy algorithms in general—they may not find the global optimum, but rather get stuck in a local optima, thus resulting in a poor selection of S.
3
Improved Algorithm
Considering the possible issues of the greedy algorithm presented in the previous section, we propose a more general solution which can achieve better results. The main idea to solve this efficiency problem is to extend the na¨ıve greedy algorithm to examine more possible paths. Rather than only considering the single best candidate in each iteration, our improved algorithm will store and explore a multitude of possible paths. The rationale behind this approach is that the second best candidate in one iteration may be better in the following iteration of the algorithm, when more bits are to be added. Increasing the number of explored candidates in each step of the algorithm will of course increase the computational complexity of the algorithm. We will, however, later derive the an expression for calculating the total computational effort required for certain parameters. In this way, we can easily estimate the computation time required. The algorithm can briefly be described as follows: The algorithm starts with either an optimal set of candidates, or an empty set. Each member of set is called a candidate, and every candidate is in itself a subset of key and IV-bits. For each candidate, the algorithm now tries to find the best bits to add, to maximize the initial sequence of zeros in the resulting MDM signature. This is done for each of the original candidates, which means that this generates several new sets of candidates. If this is repeated, the number of candidates clearly will grow to unmanageable numbers. Therefore, the algorithm limits the resulting set of new candidates with some factor. A more formal and detailed description of the algorithm is described below. A description in pseudo-code can be found in Algorithms 1 and 2. The algorithm is parametrized by three different parameter vectors: α, k, and n. We also provide a graphical presentation of one iteration of the algorithm in Fig. 3, which we will refer to in the more detailed, textual, description below: 1. Consider a set of candidates from a previous iteration, or from an optimal starting set. If this is the first iteration, it is also possible to start with a completely empty subset of key and IV bits. In that case the algorithm starts with a single candidate, where the MDM signature is calculated with all key and IV bits set to zero. 2. For each candidate in the list, the algorithm adds the ki best ni new bits and store them in a new list. Note that there now exists one such new list for each candidate in the original list. 3. Merge all lists, sorting by the number of zeros in the MDM signature. This gives a list of k0 α0 . . . ki−1 αi−1 ki items, since there were k0 α0 . . . ki−1 αi−1
282
L. Karlsson et al.
candidates in the beginning of this iteration, and each one has now resulted in ki new candidates. 4. Finally, reduce the size of this merged list with the factor αi (0 < αi ≤ 1.0), limiting the size of the combined list to k0 α0 . . . ki−1 αi−1 ki αi items. If this step is omitted, or if αi is set to 1.0, the number of candidates will grow exponentially. 5. Repeat from step 1 until a subset S of the wanted size has been found. We earlier stated that this improved algorithm was a more general approach compared to the na¨ıve greedy algorithm described in Subsect. 2.5. Using our new, improved algorithm and its input parameters k, n, and α, we can express the previous greedy algorithm’s behavior as a specific set of input parameters, namely α = [1.0, 1.0, . . .], k = [1, 1, . . .], and n = [n, n, . . .]. Thus our improved algorithm is a generalization of the previous algorithm, with many more degrees of freedom. Algorithm 1. SlightlyGreedy [3]. Input: key K, IV V , bit space B, maximum subset size m, vector k, vector n, vector α Output: subset S of size m. S0 = {∅} /* The set S0 contains a single empty subset */ for (each i ∈ {0, . . . , m − 1}) { for (each c ∈ Si ) { Lc = FindBest(K, V, B, c, ki , ni ); } Si+1 = concatenate(all Lc from above); sort Si+1 by the number of consecutive zeros in the MDM signature; reduce the number of elements in Si+1 by a factor αi ; } return Sm ;
Algorithm 2. FindBest [3]. Input: key K, IV V , bit space B, current subset c, number of best subsets to retain k, bits to add n Output: k subsets each of size |c| + n. /* let S k denote the set of all k-combinations of a set S. */ S = ∅; for (each n-tuple {b1 , . . . , bn } ∈ B\c ){ n z = number of initial zeros using subset c ∪ {b1 , . . . , bn }; if (z is among the k highest values) { add c ∪ {b1 , . . . , bn } to S; reduce S to k elements by removing element with lowest z; } } return S;
3.1
Computational Cost
The improved algorithm may have a greater computation cost compared to the previous greedy algorithm, because it considers more candidates. The
Not So Greedy: Enhanced Subset Exploration for Nonrandomness Detectors
283
computational cost will depend on the input parameter vectors, since they affect the amount of candidates explored. The total computational cost C is expressed as the number of initializations required. The cost is expressed according to the following function, from [3], where c is the number of iterations required (c = |k| = |n| = |α|), and b is the bit space size b = |B|. ⎡ ⎤
i−1 i−1 c−1 i b − n j=0 j ⎣2 j=0 nj (2) kj αj ⎦ C(b, c, k, n, α) = n i i=0 j=0 The expression can be derived using combinatorics. In the expression, the power of two is related to the size of the different subsets S—a large subset requires more initializations of the cipher. The binomial coefficient is the number of possible subsets we can form given the current iteration’s ni . Finally, the final product is needed because the algorithm reduces the number of candidates in each iteration using the factors in α. Clearly, in practice, the actual running time is also dependent on other factors, such as the cipher we are running the algorithm on. As a special case of the expression in Eq. 2, an expression for the previous greedy algorithm can be derived. Recall that this algorithm had a constant n, and since it only considered the best candidate in each iteration, both k and α are all ones. Under these constraints, the expression can more concisely be given as [3]: c−1 b−n·i 2n(i+1) (3) C(b, c, n) = n i=0
4
Results
To get any results from our proposed algorithm, the choice of parameters must first be discussed. The algorithm is parametrized by the parameter vectors k, n, and α. In this section we will explore and investigate how the choice of parameters affect the final result of our algorithm. These new results will be compared to the previous greedy algorithm as a baseline. The greedy algorithm only had one degree of freedom, n, while the improved algorithm has many more. We have performed a significant amount of simulations, on several different ciphers, to be able to present results on how the choice of parameters affect the results of the algorithm. The tests have been performed on the stream ciphers Grain-128a [6], Kreyvium [7], and to some extent Grain-128 [5]. For reference, the exact parameters used for each result presented below are available in the Appendix of this paper.
284
4.1
L. Karlsson et al.
Tuning the Greediness
To get a feeling for how the different parameters affect the result, we start by varying the two parameter vectors k and α, while keeping n consistent, and equal to an all-one vector. While k and α gives almost unlimited possibilities, we have opted for the following simulation setup. For every test case, a given iteration i will have the same amount of candidates, which makes the computational complexity identical between the different test cases. The input vectors k and α will of course be different for the different test cases, which in turn mean that even if the amount of candidates is the same, the actual candidates will vary between the tests. By designing the test this way, we wish to investigate how this difference in candidate selection affect the final result of the algorithm. Recall that ki govern how many new candidates we generate from a previous iteration’s subsets. A high ki and low αi means that we may end up with several candidates that have the same “stem”, i.e. they have the same origin list. If we lower ki and instead increase αi we will get a greater mix of different stems, while still maintaining the same amount of candidates for the iteration—in a sense the greediness of the algorithm is reduced. Thus, we want to test some different tradeoffs between the two parameters. In the results below, we name the different test cases as a percentage value of the total amount of candidates for each round. As an example, if the total number of candidates in a given round is 1000, we could select a ki of 200, and a corresponding αi of 0.005, which gives us 1000 candidates for the next round as well. We call this particular case 20%-k since ki is 20% of the candidates for the round. As mentioned earlier, the simulations have been performed on different ciphers, in this case Grain-128a and Kreyvium. We have tried several combinations of k and α as can be seen in the plot in Fig. 4, which includes one plot for each cipher. Note that Grain-128a has 256 initialization rounds, while Kreyvium has 1152 initialization rounds. The greedy algorithm is also included as a reference. Note that the greedy algorithm will, due to its simplistic nature, have a lower computational complexity since it only keeps one candidate in each iteration. To be able to compare the results based on computational complexity, we have plotted the graph based on logarithmic complexity rather than subset size. The complexity is calculated using Eq. 2, and the natural logarithm is then applied on this value, so that a reasonably scaled plot is produced. This graph can be seen in Fig. 5, for the same ciphers as above. The maximum values for each case is also available in Table 2. From the results we note that a too low k seems to lower the efficiency of the algorithm. The reason for this is probably that a too low k forces the algorithm to choose candidates from lists with lower value. These candidates are then poor choices for the upcoming iterations. We also note that our improved algorithm consistently gives better results than the previous greedy algorithm.
Not So Greedy: Enhanced Subset Exploration for Nonrandomness Detectors
285
Table 2. Maximum length of initial sequence of zeros in MDM signature when varying k and α, expressed as actual count, and percentage of total initialization rounds. Grain-128a Kreyvium Count Percentage Count Percentage Greedy
187
73.0
862
74.8
20%-k
203
79.3
896
77.8
0.5%-k
198
77.3
876
76.0
0.2%-k
192
75.0
877
76.1
min-k%-k 190
74.2
866
75.2
250 1000 200 800
rounds
rounds
150
100
400 greedy improved (20%-k) 0.5%-k 0.2%-k min-k
50
0
600
5
10
15
greedy improved (20%-k) 0.5%-k 0.2%-k min-k
200
20
25
30
35
40
0
5
10
15
20
25
bit set size
bit set size
(a) Grain-128a
(b) Kreyvium
30
35
40
Fig. 4. Varying k and α, with ni = 1. Thick dotted black line is the greedy baseline.
4.2
Varying the Number of Bits Added in Each Iteration
In the previous section, a fixed n was used throughout all tests. In this section, we will instead focus on the input parameter vector n and see how different vectors affect the result of the algorithm. Recall that this vector decides how many bits that is added to the subset in each iteration. Intuitively, we expect a higher value of a single ni to yield better results, since this also reduces the risk to get stuck in a local optima. However, having a large, constant ni in all iterations, as explored in [2], means that later iterations will be require very heavy computations. We therefore explore three different variants, where the vector n contains decreasing values of ni . These results are then compared to the previous greedy approach where a constant n of different values where used throughout the whole algorithm. For these tests, the computational complexity will vary between the different tests. This is different from the previous section where the tests were designed to have the same computational complexity. Therefore the results are once again presented in two ways, first as plots where the x-axis is the subset size, as seen in Fig. 6. The other plots present the results plotted by their computational complexity. As in the last section, the complexity is calculated using Eq. 2, and the plot uses a logarithmic scale on the x-axis. This can be seen in Fig. 7. The results for each test case are also available in tabular form in Table 3.
286
L. Karlsson et al. 250 1000 200 800
rounds
rounds
150
100
600
400 greedy improved (20%-k) 0.5%-k 0.2%-k min-k
50
0 25
26
27
greedy improved (20%-k) 0.5%-k 0.2%-k min-k
200
28
29
30
31
32
0 26
33
27
28
29
30
log(complexity)
log(complexity)
(a) Grain-128a
(b) Kreyvium
31
32
33
Fig. 5. Varying k and α, with ni = 1. Thick dotted black line is the greedy baseline. The x-axis scaled according to logarithmic computational complexity. Table 3. Maximum length of initial sequence of zeros in MDM signature when varying n, expressed as actual count, and percentage of total initialization rounds. Grain-128a Kreyvium Count Percentage Count Percentage Greedy 1-bit
187
73.1
862
74.8
Greedy 2-bit
187
73.1
864
75.0
Greedy 3-bit
187
73.1
851
73.9
2-2-2-2-2-2-2-2-1-. . . 203
79.3
868
75.4
2-2-2-2-1-. . .
199
77.7
872
75.7
1-. . .
195
76.2
869
75.4
250 1000 200 800
rounds
rounds
150
100
400 greedy 1-add greedy 2-add greedy 3-add 2-2-2-2-2-2-2-2-1-... 2-2-2-2-1-... 1-...
50
0
600
5
10
15
greedy 1-add greedy 2-add greedy 3-add 2-2-2-2-2-2-2-2-1-... 2-2-2-2-1-... 1-...
200
20
25
30
35
40
0
5
10
15
20
25
bit set size
bit set size
(a) Grain-128a
(b) Kreyvium
30
35
40
Fig. 6. Varying n. Thick black lines are the greedy baselines for n equal to 1, 2, and 3.
Not So Greedy: Enhanced Subset Exploration for Nonrandomness Detectors
287
From the results we note that regardless of our choice of n, our algorithm outperforms the greedy variants. For Grain-128a, we also see that a higher ni in the initial iterations seem to lead to better results which remain as the algorithm proceeds towards larger subsets. The results for Kreyvium are not as clear, and it seems like the size of the resulting subset is the most important property.
250 1000 200
800
rounds
rounds
150
100
50
0 25
600
400 greedy 1-add greedy 2-add greedy 3-add 2-2-2-2-2-2-2-2-1-... 2-2-2-2-1-... 1-...
26
27
200
28
29
30
31
32
33
0 26
log(complexity)
(a) Grain-128a
greedy 1-add greedy 2-add greedy 3-add 2-2-2-2-2-2-2-2-1-... 2-2-2-2-1-... 1-...
27
28
29
30
31
32
33
log(complexity)
(b) Kreyvium.
Fig. 7. Varying n. Thick black lines are the greedy baselines for n equal to 1, 2, and 3. The x-axis scaled according to logarithmic computational complexity.
4.3
Results for Different Starting Points
In the previous tests, optimal subsets of size 5 has been used as a starting point for the simulations. In this section, we compare the use of such an optimal start to starting from an empty subset. A simple approach has been chosen, namely to reuse two test cases from Subsect. 4.1, namely the test case named 20%-k for both Grain-128a and Kreyvium. These test cases start with optimal subsets of size 5. The two new additional test cases start with an empty subset, and then sequentially add one bit during the first five iterations. The remaining iterations’ parameters are kept the same between all test cases, so that the difference in the initial start is isolated. In this way we can investigate whether this optimal starting set is important or not. The result of this experiment can be found in Fig. 8, again with one subfigure for Grain-128a and one for Kreyvium. The results are summarized in Table 4. In summary, the differences are very small, and for Kreyvium non-existent, which means that the choice of initial starting point may not be the most important decision to make when selecting parameters for the algorithm.
288
L. Karlsson et al.
Table 4. Maximum length of initial sequence of zeros in MDM signature with different starting subsets, expressed as actual count, and percentage of total initialization rounds. Grain-128a Kreyvium Count Percentage Count Percentage 5-bit optimal start
203
79.3
896
77.8
Empty subset start 201
78.5
896
77.8
Different starting subsets
Different starting subsets
250 1000 200 800
rounds
rounds
150
100
600
400
50
200 optimal empty
0
0
5
10
15
20
25
30
35
optimal empty
40
0
0
5
10
15
20
25
bit set size
bit set size
(a) Grain-128a
(b) Kreyvium
30
35
40
Fig. 8. Different starting sets and how they affect the results.
4.4
Results on Grain-128
Apart from new results on Grain-128a and Kreyvium, tests were also performed on Grain-128, a predecessor of Grain-128a which has been analyzed in other works. In [2], a full-round (256 out of 256 initialization rounds) result was presented using a subset of size 40, using only IV-bits, with an optimal starting subset of size 6. This was found using a constant n = 2. This corresponds to a parameter set of α = [1.0, 1.0, 1.0, . . .]), k = [1, 1, 1, . . .], and n = [6, 2, 2, . . .] in our improved algorithm. It would clearly be possible to find the exact same subset using our improved algorithm, but we are also interested in seeing whether or not we can find other subsets resulting in full-round results using our improved algorithm. A new set of parameters for our improved algorithm is constructed as follows: The possibility to keep multiple candidates in each step is utilized, especially in the beginning where there are still small subsets. Using the improved algorithm, a smaller subset of size 25 is found, which still gives us a full-round result of 256 out of 256 initialization rounds. Using the complexity expression in Eq. 2, the computational complexity between the two results can be compared. We find that our improved algorithm has a complexity which is a factor about 212 lower than the earlier result, while still finding an equal amount of zeros in the MDM signature.
Not So Greedy: Enhanced Subset Exploration for Nonrandomness Detectors
5
289
Related Work
Related work can be divided into two main categories: work related to the maximum degree monomial test, and work related to general cryptanalysis of the discussed ciphers. In [9], Saarinen described the d-Monomial test, and how it can be applied in chosen-IV attacks against stream ciphers. In contrast to our work, and the work done by Stankovski [2], Saarinen considers monomials of various degrees, namely monomials up to degree d, therefore the name d-Monomial test. In addition to this difference, the choice of input subset bits is different. Saarinen only considers consecutive bits either in the beginning or in the end of the IV. This is in contrast to our work, where the subset is chosen freely as any subset of IV and/or key bits. Related to the work of Saarinen, the Maximum Degree Monomial (MDM) was introduced by Englund et. al. in [1]. Rather than looking at several different degrees of monomials, the MDM test only focuses on the maximum degree monomial. The motivation behind this choice is that the maximum degree monomial is likely to occur only if all IV bits have been properly mixed. In addition to this, the existence of the maximum degree monomial is easy to find. The coefficient of the monomial can be found by simply XORing all entries in the truth table. In the previously mentioned work, a subset of the IV space was used in the tests. In [2], a greedy heuristic to find these subsets was discussed. The greedy algorithm started with an optimal, precalculated, subset of a small size, and then added n bits in each step in a greedy fashion. In addition, both IV and key bits were suggested for getting distinguisher and nonrandomness results, respectively. Several different ciphers were analyzed, among them Grain-128 and Trivium. Other work related to distinguishers for Trivium is [10], where the authors concentrate on small cubes, and instead look at unions of these cubes. Another difference is that they look at sub-maximal degree monomial tests. Also partly based on Stankovski’s work is the work in [11], where the authors propose two new, alternative heuristics. Here, the heuristic is modified so that it does not maximize the initial sequence of zeros in the MDM signature. Rather, in the first heuristic, called “maximum last zero”, the authors not only maximize the initial sequence of zeros, but also ensure that the position of the current iteration in the MDM signature is a zero as well. In their second heuristic, called “maximum frequency of zero”, they instead look at the total amount of zeros in the MDM signature. Their heuristics are applied to the ciphers Trivium [8] and Trivia-SC [12]. Similar to our paper, they also mention the use of a nonconstant n, i.e. a n-vector, although the authors do not discuss the reasons for this extension. In [13] an attack called AIDA on a modified version of Trivium was presented. In this case Trivium was modified so that it only had half of the original count of initialization rounds. Related to this attack are the cube attacks [14], and especially the dynamic cube attack [15] which was used to attack Grain-128.
290
L. Karlsson et al.
Attacks on the newer Grain-128a can be found in the literature as well. In [16] the authors present a related-key key attack requiring >232 related keys and >264 chosen IVs, while in [17] the authors present a differential fault attack against all the three ciphers in the Grain-family. There is very limited work regarding the analysis of Kreyvium, possibly because the original Kreyvium paper is relatively recent, however in [18] the authors discuss conditional differential cryptanalysis of Kreyvium.
6
Conclusions
This paper has described the design and motivation of the maximum degree monomial test when designing nonrandomness detectors. The MDM test requires a subset of key and IV bits, and in this paper we have designed and proposed a new algorithm to find such subsets. Our algorithm is based on a greedy approach, but rather than using a na¨ıve greedy algorithm, we propose an algorithm which is less likely to get stuck in local optima, and therefore yields better final results. The algorithm is highly flexible, and parameters can be chosen and adapted to get a both reasonable and predictable computational complexity. To validate our algorithm, we have performed a significant amount of simulations to find good input parameters to our algorithm. Simulations has been performed mainly on the ciphers Grain-128a and Kreyvium, and the results show that our new algorithm outperforms previously proposed na¨ıve greedy algorithms. Acknowledgments. This paper is an extended and revised version of the paper “Improved Greedy Nonrandomness Detectors for Stream Ciphers” previously presented at ICISSP 2017 [3]. The computations were performed on resources provided by the Swedish National Infrastructure for Computing (SNIC) at Lunarc.
Appendix This appendix contains the exact vectors used for the different results discussed in Sect. 4. The vectors used for the results for varying k and α are given in Table 5. In the same fashion, the vectors used for the results for varying n are presented in Table 6. Finally, the vectors for the results on Grain-128 are given in Table 7.
Not So Greedy: Enhanced Subset Exploration for Nonrandomness Detectors
291
Table 5. Varying k and α [3].
Greedy k { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 } n { 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 } α { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 } Improved (20 %-k) k { 1000, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 100, 60, 60, 20, 20, 20, 20, 20, 20, 12, 6, 3, 2, 2, 2, 2, 2, 1, 1, 1, 1 } n { 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 } α { 1.0, 0.005, 0.005, 0.005, 0.005, 0.005, 0.005, 0.005, 0.005, 0.005, 0.005, 0.005, 0.005, 1 1 1 2 0.01, 60 , 60 , 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 12 , 15 , 0.375, 0.5, 0.5, 0.5, 0.5, 29 , 1.0, 0.5, 1.0 } 0.5 %-k k { 1000, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 3, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 } n { 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 } α { 1.0, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 16 , 0.3, 0.5, 13 , 1.0, 1.0, 1.0, 1.0, 1.0, 0.6, 0.5, 0.4, 0.75, 1.0, 1.0, 1.0, 1.0, 29 , 1.0, 0.5, 1.0 } 0.2 %-k k { 1000, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 } n { 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 } α { 1.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.6, 1.0, 13 , 1.0, 1.0, 1.0, 1.0, 1.0, 0.6, 0.5, 0.4, 0.75, 1.0, 1.0, 1.0, 1.0, 29 , 1.0, 0.5, 1.0 } min-k k { 1000, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 } n { 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 } α { 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.5, 0.6, 1.0, 13 , 1.0, 1.0, 1.0, 1.0, 1.0, 0.6, 0.5, 0.4, 0.75, 1.0, 1.0, 1.0, 1.0, 29 , 1.0, 0.5, 1.0 }
292
L. Karlsson et al. Table 6. Varying n [3].
Greedy 1-add k { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 } n { 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 } α { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 } Greedy 2-add k { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 } n { 5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 } α { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 } Greedy 3-add k { 1, 1, 1, 1, 1, 1, 1 } n { 5, 3, 3, 3, 3, 3, 3 } α { 1, 1, 1, 1, 1, 1, 1 } 2-2-2-2-2-2-2-2-1-... k { 1000, 200, 200, 200, 200, 150, 50, 50, 50, 30, 15, 6, 5, 5, 5, 5, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1 } n { 5, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 } 1 1 1 4 α { 1.0, 0.005, 0.005, 0.0005, 0.005, 150 , 0.02, 0.01, 0.02, 0.02, 15 , 15 , 0.15, 0.2, 0.2, 45 , 0.1, 0.5, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0 } 2-2-2-2-1-... k { 1000, 200, 200, 200, 200, 150, 50, 50, 50, 30, 15, 6, 5, 5, 5, 5, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 } n { 5, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 } 1 1 1 4 α { 1.0, 0.005, 0.005, 0.0005, 0.005, 150 , 0.02, 0.01, 0.02, 0.02, 15 , 15 , 0.15, 0.2, 0.2, 45 , 0.1, 0.5, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0 } 1-... k { 1000, 200, 200, 200, 200, 150, 50, 50, 50, 30, 15, 6, 5, 5, 5, 5, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 } n { 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 } 1 1 1 4 , 0.02, 0.01, 0.02, 0.02, 15 , 15 , 0.15, 0.2, 0.2, 45 , α { 1.0, 0.005, 0.005, 0.0005, 0.005, 150 0.1, 0.5, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0 }
Not So Greedy: Enhanced Subset Exploration for Nonrandomness Detectors
293
Table 7. Results on Grain-128 [3].
Greedy 1-add k { 1000, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 60, 60, 20, 20, 20, 20, 20 } n { 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 } α { 1.0, 0.0125, 0.0125, 0.0125, 0.0125, 0.0125, 0.0125, 0.0125, 0.0125, 0.0125, 0.0125, 1 1 , 60 , 0.05, 0.05, 0.05, 0.05 } 0.0125, 0.00625, 0.01, 60
References 1. Englund, H., Johansson, T., S¨ onmez Turan, M.: A framework for chosen IV statistical analysis of stream ciphers. In: Srinathan, K., Rangan, C.P., Yung, M. (eds.) INDOCRYPT 2007. LNCS, vol. 4859, pp. 268–281. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-77026-8 20 2. Stankovski, P.: Greedy distinguishers and nonrandomness detectors. In: Gong, G., Gupta, K.C. (eds.) INDOCRYPT 2010. LNCS, vol. 6498, pp. 210–226. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-17401-8 16 3. Karlsson, L., Hell, M., Stankovski, P.: Improved greedy nonrandomness detectors for stream ciphers. In: Proceedings of the 3rd International Conference on Information Systems Security and Privacy, pp. 225–232. SciTePress (2017) 4. Hell, M., Johansson, T., Meier, W.: Grain - a stream cipher for constrained environments. Int. J. Wirel. Mob. Comput. 2, 86–93 (2006). Special Issue on Security of Computer Network and Mobile Systems 5. Hell, M., Johansson, T., Maximov, A., Meier, W.: A stream cipher proposal: Grain128. In: 2006 IEEE International Symposium on Information Theory, pp. 1614– 1618 (2006) 6. ˚ Agren, M., Hell, M., Johansson, T., Meier, W.: Grain-128a: a new version of Grain128 with optional authentication. Int. J. Wirel. Mob. Comput. 5, 48–59 (2011) 7. Canteaut, A., Carpov, S., Fontaine, C., Lepoint, T., Naya-Plasencia, M., Paillier, P., Sirdey, R.: Stream ciphers: a practical solution for efficient homomorphic-ciphertext compression. In: Peyrin, T. (ed.) FSE 2016. LNCS, vol. 9783, pp. 313–333. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-52993-5 16 8. Canni`ere, C.: Trivium: a stream cipher construction inspired by block cipher design principles. In: Katsikas, S.K., L´ opez, J., Backes, M., Gritzalis, S., Preneel, B. (eds.) ISC 2006. LNCS, vol. 4176, pp. 171–186. Springer, Heidelberg (2006). https://doi.org/10.1007/11836810 13 9. Saarinen, M.J.O.: Chosen-IV statistical attacks on eSTREAM stream ciphers (2006). http://www.ecrypt.eu.org/stream/papersdir/2006/013.pdf 10. Liu, M., Lin, D., Wang, W.: Searching cubes for testing Boolean functions and its application to Trivium. In: 2015 IEEE International Symposium on Information Theory (ISIT), pp. 496–500 (2015) 11. Sarkar, S., Maitra, S., Baksi, A.: Observing biases in the state: case studies with Trivium and Trivia-SC. Des. Codes Crypt. 82, 351–375 (2016) 12. Chakraborti, A., Chattopadhyay, A., Hassan, M., Nandi, M.: TriviA: a fast and secure authenticated encryption scheme. In: G¨ uneysu, T., Handschuh, H. (eds.) CHES 2015. LNCS, vol. 9293, pp. 330–353. Springer, Heidelberg (2015). https:// doi.org/10.1007/978-3-662-48324-4 17
294
L. Karlsson et al.
13. Vielhaber, M.: Breaking ONE.FIVIUM by AIDA an algebraic IV differential attack. Cryptology ePrint Archive, Report 2007/413 (2007). http://eprint.iacr. org/2007/413 14. Dinur, I., Shamir, A.: Cube attacks on tweakable black box polynomials. In: Joux, A. (ed.) EUROCRYPT 2009. LNCS, vol. 5479, pp. 278–299. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01001-9 16 15. Dinur, I., Shamir, A.: Breaking Grain-128 with dynamic cube attacks. In: Joux, A. (ed.) FSE 2011. LNCS, vol. 6733, pp. 167–187. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21702-9 10 16. Banik, S., Maitra, S., Sarkar, S., Meltem S¨ onmez, T.: A chosen IV related key attack on Grain-128a. In: Boyd, C., Simpson, L. (eds.) ACISP 2013. LNCS, vol. 7959, pp. 13–26. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-64239059-3 2 17. Sarkar, S., Banik, S., Maitra, S.: Differential fault attack against Grain family with very few faults and minimal assumptions. IEEE Trans. Comput. 64, 1647–1657 (2015) 18. Watanabe, Y., Isobe, T., Morii, M.: Conditional differential cryptanalysis for Kreyvium. In: Pieprzyk, J., Suriadi, S. (eds.) ACISP 2017. LNCS, vol. 10342, pp. 421–434. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-60055-0 22
Author Index
Ahmadi, Ahmad 197 Alruhaily, Nada 35
Lenzini, Gabriele 222 Liu, Changchang 84
Bargh, Mortaza S. 173 Bordbar, Behzad 35
Martinelli, Fabio 148 Matteucci, Ilaria 148 Mitschang, Bernhard 59 Mittal, Prateek 84 Muniz Soares, Alberto Magno 130
Choenni, Sunil 173 Chothia, Tom 35 Costantino, Gianpiero
148 Ofek, Nir 1
de Sousa Junior, Rafael Timoteo El-Moussa, Fadi Ali 20 Hadad, Tal 1 Hell, Martin 273 Herwono, Ian 20 Hirmer, Pascal 59 Hüffmeyer, Marc 59 Huynen, Jean-Louis 222 Ji, Shouling 84 Johnstone, Michael N. Karlsson, Linus
273
Lee, Ruby B. 84 Lee, Wei-Han 84
252
130 Peacock, Matthew 252 Petrocchi, Marinella 148 Puzis, Rami 1 Regainia, Loukmen Rokach, Lior 1
105
Safavi-Naini, Reihaneh Salva, Sébastien 105 Schreier, Ulf 59 Sidik, Bronislav 1 Stankovski, Paul 273 Valli, Craig 252 Vink, Marco 173 Wieland, Matthias
59
197