Model Checking Software: 16th International SPIN Workshop, Grenoble, France, June 26-28, 2009. Proceedings [1 ed.] 9783642026515, 3642026516

This book constitutes the refereed proceedings of the 16th International SPIN workshop on Model Checking Software, SPIN

267 15 5MB

English Pages 297 [304] Year 2009

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Front Matter....Pages -
Software Model Checking Improving Security of a Billion Computers....Pages 1-1
On Quantitative Software Verification....Pages 2-3
The Quest for Correctness-Beyond a Posteriori Verification....Pages 4-4
Who Really Cares If the Program Crashes?....Pages 5-5
Tool Presentation: Teaching Concurrency and Model Checking....Pages 6-11
Fast, All-Purpose State Storage....Pages 12-31
Efficient Probabilistic Model Checking on General Purpose Graphics Processors....Pages 32-49
Improving Non-Progress Cycle Checks....Pages 50-67
Reduction of Verification Conditions for Concurrent System Using Mutually Atomic Transactions....Pages 68-87
Probabilistic Reachability for Parametric Markov Models....Pages 88-106
Extrapolation-Based Path Invariants for Abstraction Refinement of Fifo Systems....Pages 107-124
A Decision Procedure for Detecting Atomicity Violations for Communicating Processes with Locks....Pages 125-142
Eclipse Plug-In for Spin and st2msc Tools-Tool Presentation....Pages 143-147
Symbolic Analysis via Semantic Reinterpretation....Pages 148-168
EMMA: Explicit Model Checking Manager (Tool Presentation)....Pages 169-173
Efficient Testing of Concurrent Programs with Abstraction-Guided Symbolic Execution....Pages 174-191
Subsumer-First: Steering Symbolic Reachability Analysis....Pages 192-204
Identifying Modeling Errors in Signatures by Model Checking....Pages 205-222
Towards Verifying Correctness of Wireless Sensor Network Applications Using Insense and Spin....Pages 223-240
Verification of GALS Systems by Combining Synchronous Languages and Process Calculi....Pages 241-260
Experience with Model Checking Linearizability....Pages 261-278
Automatic Discovery of Transition Symmetry in Multithreaded Programs Using Dynamic Analysis....Pages 279-295
Back Matter....Pages -
Recommend Papers

Model Checking Software: 16th International SPIN Workshop, Grenoble, France, June 26-28, 2009. Proceedings [1 ed.]
 9783642026515, 3642026516

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany

5578

Corina S. P˘as˘areanu (Ed.)

Model Checking Software 16th International SPIN Workshop Grenoble, France, June 26-28, 2009 Proceedings

13

Volume Editor Corina S. P˘as˘areanu NASA Ames Research Center, Space Science Division Mail Stop 269-2, Moffett Field, CA 94035, USA E-mail: [email protected]

Library of Congress Control Number: 2009928779 CR Subject Classification (1998): F.3, D.2.4, D.3.1, D.2 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13

0302-9743 3-642-02651-6 Springer Berlin Heidelberg New York 978-3-642-02651-5 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12703296 06/3180 543210

Preface

This volume contains the proceedings of the 16th International SPIN Workshop on Model Checking of Software (SPIN 2009), that was held at the Grenoble World Trade Center, in Grenoble, France, June 26–28, 2009. The workshop was co-located with the 21st International Conference on Computer-Aided Verification (CAV 2009). The SPIN workshop is a forum for practitioners and researchers interested in the model checking-based analysis of software systems. The focus of the workshop is on theoretical advances and empirical evaluations related to state-space and path exploration techniques, as implemented in the SPIN model checker and other software verification tools. The workshop aims to encourage interactions and exchanges of ideas with all related areas in software engineering. SPIN 2009 was the 16th event in the workshop series, which started in 1995. This year, we received 41 submissions (34 technical papers and 7 tool papers) out of which 18 papers were accepted (15 technical papers and 3 tool papers). Each submission was reviewed by three Program Committee members. In addition to the refereed papers, the workshop featured four invited talks given by Patrice Godefroid, from Microsoft Research, USA, on “Software Model Checking Improving Security of a Billion Computers,” Marta Kwiatkowska, from Oxford University, UK, “On Quantitative Software Verification,” Joseph Sifakis (recipient of the Turing Award 2007), from VERIMAG, France, on “The Quest for Correctness - Beyond a posteriori Verification,” and Willem Visser, from the University of Stellenbosch, South Africa, on “Who Really Cares if the Program Crashes?” We would like to thank the authors of submitted papers, the invited speakers, the Program Committee members, the external reviewers, and the Steering Committee, for their help in composing a strong program. Special thanks go to Stefan Leue for his guidance throughout the SPIN 2009 organization, to Saddek Bensalem for the SPIN 2009 local organization, and to last year’s organizers, Klaus Havelund and Rupak Majumdar, for their advice and help with advertising the event. We also thank Springer for agreeing to publish these proceedings as a volume of Lecture Notes in Computer Science. The EasyChair system was used for the submission and reviewing of the papers and also for the preparation of the proceedings. April 2009

Corina S. P˘ as˘areanu

Organization

Program Chair Corina S. P˘ as˘areanu

Program Committee Christel Baier Dragan Bo˘sna˘cki Patricia Bouyer Lubos Brim Marsha Chechik Matthew Dwyer Stefan Edelkamp Jaco Geldenhuys Susanne Graf Klaus Havelund Gerard Holzmann Radu Iosif Michael Jones Sarfraz Khurshid Orna Kupferman Stefan Leue Rupak Majumdar Madan Musuvathi Koushik Sen Scott Stoller Farn Wang Pierre Wolper

University Bonn, Germany Eindhoven University, The Netherlands ´ ENS de Cachan, France Masaryk U, Czech Republic University of Toronto, Canada University of Nebraska, USA T.U. Dortmund, Germany University of Stellenbosch, South Africa VERIMAG, France JPL, USA JPL, USA VERIMAG, France Brigham Young U, USA University of Texas, Austin, USA Hebrew University, Israel University of Konstanz, Germany University of California, Los Angeles, USA Microsoft Research, USA University of California, Berkeley, USA Stony Brook University, USA National Taiwan University, Taiwan University of Liege, Belgium

Steering Committee Dragan Bo˘sna˘cki Stefan Edelkamp Susanne Graf Klaus Havelund Stefan Leue (Chair) Rupak Majumdar Pierre Wolper,

Eindhoven University, The Netherlands T.U. Dortmund, Germany VERIMAG, France JPL, USA University of Konstanz, Germany University of California, Los Angeles, USA University of Liege, Belgium

VIII

Organization

Local Organization Saddek Bensalem (CAV 2009 Local Organization, VERIMAG/UJF, France)

External Reviewers Husain Aljazzar Adam Antonik Bahareh Badban Jiri Barnat Tobias Blechmann Jacob Burnim Franck Cassez Ivana Cerna Joel Galenson Pallavi Joshi Sudeep Juvekar Shadi Abdul Khalek Peter Kissmann Joachim Klein Filip Konecny Laura Kovacs

Matthias Kuntz Roman Manevich Nicolas Markey Eric Mercer Chang-Seo Park Polyvios Pratikakis Neha Rungta Roopsha Samanta Christoph Scheben Jiri Simacek Jocilyn Simmonds Christos Stergiou Damian Sulewski Faraz Torchizi Wei Wei Tim Willemse

Table of Contents

Invited Contributions Software Model Checking Improving Security of a Billion Computers . . . Patrice Godefroid

1

On Quantitative Software Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marta Kwiatkowska

2

The Quest for Correctness-Beyond a Posteriori Verification . . . . . . . . . . . Joseph Sifakis

4

Who Really Cares If the Program Crashes? . . . . . . . . . . . . . . . . . . . . . . . . . Willem Visser

5

Regular Papers Tool Presentation: Teaching Concurrency and Model Checking . . . . . . . . . Mordechai (Moti) Ben-Ari

6

Fast, All-Purpose State Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peter C. Dillinger and Panagiotis (Pete) Manolios

12

Efficient Probabilistic Model Checking on General Purpose Graphics Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dragan Boˇsnaˇcki, Stefan Edelkamp, and Damian Sulewski Improving Non-Progress Cycle Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David Farag´ o and Peter H. Schmitt Reduction of Verification Conditions for Concurrent System Using Mutually Atomic Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Malay K. Ganai and Sudipta Kundu Probabilistic Reachability for Parametric Markov Models . . . . . . . . . . . . . Ernst Moritz Hahn, Holger Hermanns, and Lijun Zhang

32 50

68 88

Extrapolation-Based Path Invariants for Abstraction Refinement of Fifo Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Heußner, Tristan Le Gall, and Gr´egoire Sutre

107

A Decision Procedure for Detecting Atomicity Violations for Communicating Processes with Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nicholas Kidd, Peter Lammich, Tayssir Touili, and Thomas Reps

125

X

Table of Contents

Eclipse Plug-In for Spin and st2msc Tools-Tool Presentation . . . . . . . . . . . Tim Kovˇse, Boˇstjan Vlaoviˇc, Aleksander Vreˇze, and Zmago Brezoˇcnik

143

Symbolic Analysis via Semantic Reinterpretation . . . . . . . . . . . . . . . . . . . . Junghee Lim, Akash Lal, and Thomas Reps

148

EMMA: Explicit Model Checking Manager (Tool Presentation) . . . . . . . . Radek Pel´ anek and V´ aclav Roseck´y

169

Efficient Testing of Concurrent Programs with Abstraction-Guided Symbolic Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Neha Rungta, Eric G. Mercer, and Willem Visser

174

Subsumer-First: Steering Symbolic Reachability Analysis . . . . . . . . . . . . . . Andrey Rybalchenko and Rishabh Singh

192

Identifying Modeling Errors in Signatures by Model Checking . . . . . . . . . Sebastian Schmerl, Michael Vogel, and Hartmut K¨ onig

205

Towards Verifying Correctness of Wireless Sensor Network Applications Using Insense and Spin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Oliver Sharma, Jonathan Lewis, Alice Miller, Al Dearle, Dharini Balasubramaniam, Ron Morrison, and Joe Sventek Verification of GALS Systems by Combining Synchronous Languages and Process Calculi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hubert Garavel and Damien Thivolle Experience with Model Checking Linearizability . . . . . . . . . . . . . . . . . . . . . Martin Vechev, Eran Yahav, and Greta Yorsh

223

241 261

Automatic Discovery of Transition Symmetry in Multithreaded Programs Using Dynamic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yu Yang, Xiaofang Chen, Ganesh Gopalakrishnan, and Chao Wang

279

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

297

Software Model Checking Improving Security of a Billion Computers Patrice Godefroid Microsoft Research [email protected]

Abstract. I will present a form of software model checking that has improved the security of a billion computers (and has saved Microsoft millions of dollars). This form of software model checking is dubbed whitebox fuzz testing, and builds upon recent advances in systematic dynamic test generation (also known as DART) and constraint solving. Starting with a well-formed input, whitebox fuzzing symbolically executes the sequential program under test dynamically, and gathers constraints on inputs from conditional statements encountered along the execution. The collected constraints are negated systematically one-byone and solved with a constraint solver, yielding new inputs that exercise different execution paths in the program. This process is repeated using novel state-space exploration techniques that attempt to sweep through all (in practice, many) feasible execution paths of the program while checking simultaneously many properties. This approach thus combines program analysis, testing, model checking and automated theorem proving (constraint solving). Whitebox fuzzing has been implemented in the tool SAGE, which is optimized for long symbolic executions at the x86 binary level. Over the past 18 months, SAGE has been running on hundreds of machines and has discovered many new expensive security-critical bugs in large shipped Windows applications, including image processors, media players and file decoders, that are deployed on more than a billion computers worldwide. SAGE is so effective in finding bugs missed by other techniques like static analysis or blackbox random fuzzing that it is now used daily in various Microsoft groups. This is joint work with Michael Levin (Microsoft CSE) and other contributors.

C.S. P˘ as˘ areanu (Ed.): SPIN 2009, LNCS 5578, p. 1, 2009. c Springer-Verlag Berlin Heidelberg 2009 

On Quantitative Software Verification Marta Kwiatkowska Oxford University Computing Laboratory, Parks Road, Oxford, OX1 3QD

Software verification has made great progress in recent years, resulting in several tools capable of working directly from source code, for example, SLAM and Astree. Typical properties that can be verified are expressed as Boolean assertions or temporal logic properties, and include whether the program eventually terminates, or the executions never violate a safety property. The underlying techniques crucially rely on the ability to extract from programs, using compiler tools and predicate abstraction, finite-state abstract models, which are then iteratively refined to either demonstrate the violation of a safety property (e.g. a buffer overflow) or guarantee the absence of such faults. An established method to achieve this automatically executes an abstraction-refinement loop guided by counterexample traces [1]. The vast majority of software verification research to date has concentrated on methods for analysing qualitative properties of system models. Many programs, however, contain randomisation, real-time delays and resource information. Examples include anonymity protocols and random back-off schemes in e.g. Zigbee and Bluetooth. Quantitative verification [2] is a technique for establishing quantitative properties of a system model, such as the probability of battery power dropping below minimum, the expected time for message delivery and the expected number of messages lost before protocol termination. Models are typically variants of Markov chains, annotated with reward structures that describe resources and their usage during execution. Properties are expressed in temporal logic extended with probabilistic and reward operators. Tools such as the probabilistic model checker PRISM are widely used to analyse system models in several application domains, including security and network protocols. However, at present the models are formulated in the modelling notations specific to the model checker. The key difficulty in transferring quantitative verification techniques to real software lies in the need to generalise the abstraction-refinement loop to the quantitative setting. Progress has been recently achieved using the idea of strongest evidence for counterexamples [3] and stochastic game abstractions [4]. In this lecture, we present a quantitative software verification method for ANSI-C programs extended with random assignment. The goal is to focus on system software that exhibits probabilistic behaviour, for example through communication failures or randomisation, and quantitative properties of software such as “the maximum probability of file-transfer failure” or “the maximum expected number of function calls during program execution”. We use a framework based on SAT-based predicate abstraction, in which probabilistic programs are represented as Markov decision processes, and their abstractions as stochastic C.S. P˘ as˘ areanu (Ed.): SPIN 2009, LNCS 5578, pp. 2–3, 2009. c Springer-Verlag Berlin Heidelberg 2009 

On Quantitative Software Verification

3

two-player games [5]. The abstraction-refinement loop proceeds in a quantitative fashion, yielding lower and upper bounds on the probability/expectation values for the computed abstractions. The bounds provide a quantitative measure of the precision of the abstraction, and are used to guide the refinement process, which proceeds automatically, iteratively refining the abstraction until the interval between the bounds is sufficiently small. In contrast to conventional approaches, our quantitative abstraction-refinement method does not produce counterexample traces. The above techniques have been implemented using components from GOTO-CC, SATABS and PRISM and successfully used to verify actual networking software. The lecture will give an overview of current research directions in quantitative software verification, concentrating on the potential of the method and outlining future challenges. Acknowledgements. Supported in part by EPSRC grants EP/D07956X, EP/ D076625 and EP/F001096, and FP7 project CONNECT-IP.

References 1. Clarke, E., Grumberg, O., Jha, S., Lu, Y., Veith, H.: Counterexample-guided abstraction refinement. In: Emerson, E.A., Sistla, A.P. (eds.) CAV 2000. LNCS, vol. 1855, pp. 154–169. Springer, Heidelberg (2000) 2. Kwiatkowska, M.: Quantitative verification: Models, techniques and tools. In: Proc. 6th ESEC/FSE, pp. 449–458. ACM Press, New York (2007) 3. Hermanns, H., Wachter, B., Zhang, L.: Probabilistic CEGAR. In: Gupta, A., Malik, S. (eds.) CAV 2008. LNCS, vol. 5123, pp. 162–175. Springer, Heidelberg (2008) 4. Kwiatkowska, M., Norman, G., Parker, D.: Game-based abstraction for Markov decision processes. In: Proc. QEST 2006. IEEE, Los Alamitos (2006) 5. Kattenbelt, M., Kwiatkowska, M., Norman, G., Parker, D.: Abstraction refinement for probabilistic software. In: Jones, N., Muller-Olm, M. (eds.) VMCAI 2009. LNCS, vol. 5403, pp. 182–197. Springer, Heidelberg (2009)

The Quest for Correctness-Beyond a Posteriori Verification Joseph Sifakis Verimag Laboratory

Abstract. In this presentation, I discuss the main achievements in the area of formal verification, in particular regarding their impact thus far on the development of Computer Science as a discipline and on future research directions. The presentation starts with a short overview of formal verification techniques and their main characteristics, followed by an analysis of their current status with respect to: 1) requirements specification; 2) faithfulness of modeling; 3) scalability of verification methods. Compositional modeling and verification is the main challenge to tackling complexity. System verification should be tightly integrated into the design process, making use of knowledge about the system’s structure and its properties. I identify two complementary research directions for overcoming some of the current difficulties in compositional techniques: 1) Moving away from low-level automata-based composition to component-based composition, by developing frameworks encompassing heterogeneous components; 2) Using such frameworks to study compositionality techniques for particular architectures and/or specific properties. I illustrate these ideas through the BIP (Behavior, Interaction, Priority) component framework which encompasses high-level composition of heterogeneous components. BIP supports a design methodology for building systems in a threedimensional design space by using property-preserving transformations. This allows efficient compositional verification techniques for proving invariants, and deadlock-freedom in particular.

C.S. P˘ as˘ areanu (Ed.): SPIN 2009, LNCS 5578, p. 4, 2009. c Springer-Verlag Berlin Heidelberg 2009 

Who Really Cares If the Program Crashes? Willem Visser Computer Science Division Department of Mathematical Sciences University of Stellenbosch South Africa

Abstract. After spending eight years at NASA doing research in model checking and testing, I decided it would be a nice change of scene to see how software is being developed in a fast-paced technology start-up (SEVEN Networks). Of course I was secretly hoping to solve all their testing problems with the cool research techniques from the verification and testing community. At NASA software is written once, for the most part run once, and if it fails there are serious (even life-threatening) consequences. Clearly this is a fruitful hunting ground for advanced verification and testing technology. At SEVEN, on the other hand, code is maintained and adapted for years and the same programs execute thousands of times a second on various platforms. Failures are plentiful, but they only become important once they start to impact service level agreements with the paying customers; i.e. when they start to have a negative impact on the bottomline. Failures are not necessarily crashes either, it is much more likely to be a performance bottle-neck that eventually causes a system-wide failure. What does the verification and testing community have to offer in this arena, bearing in mind there are very few ”NASA”s and very many ”SEVEN”s in the world? This talk is about what I learned in the past two years at SEVEN and how it is influencing my current research. In particular I will explain why I ran a model checker on SEVEN code just once, used a static analysis tool only once as well, the reasons why model based testing is no longer used at SEVEN, why I am no longer certain deadlocks are so important (but races are), why SQL is a useful debugging aid and why performance analysis is important. I will also highlight some of the more interesting errors I encountered at SEVEN and why our current tools cannot find most of these.

C.S. P˘ as˘ areanu (Ed.): SPIN 2009, LNCS 5578, p. 5, 2009. c Springer-Verlag Berlin Heidelberg 2009 

Tool Presentation: Teaching Concurrency and Model Checking Mordechai (Moti) Ben-Ari Department of Science Teaching Weizmann Institute of Science Rehovot 76100 Israel [email protected] http://stwww.weizmann.ac.il/g-cs/benari/

Abstract. This paper describes software tools for teaching concurrency and model checking. jSpin is an development environment for Spin that formats and filters the output of a simulation according to the user’s specification. SpinSpider uses debugging output from Spin to generate a diagram of the state space of a Promela model; the diagram can be incrementally displayed using iDot. VN supports teaching nondeterministic finite automata. The Erigone model checker is a partial reimplementation of Spin designed to be easy to use, well structured and well documented. It produces a full trace of the execution of the model checker in a format that is both readable and amenable to postprocessing.

1

Introduction

Concurrency is a notoriously difficult subject to learn because the indeterminate behavior of programs poses challenges for students used to debugging programs by trial and error. They must learn new concepts of specification and correctness, as well as formal methods such as state transition diagrams and temporal logic. Nevertheless, efforts are being made to teach this fundamental topic to beginning undergraduates and even to high school students, hopefully, before a sequential mindset takes hold [1,2]. We have found that students at this level are fully capable of understanding basic concepts of concurrency such as race condition, atomicity, interleaving, mutual exclusion, deadlock and starvation. Special tools are needed to teach concurrency, because the student must be able to examine and construct scenarios in full detail, and this is not possible without fine-grained control of the interleaving. The traditional educational tool is a concurrency simulator for (a subset of) a language like Pascal or C augmented with processes, semaphores and monitors [3,4]. Concurrency is now being taught in Java because of its near-universal use in teaching introductory programming. I believe that this is far from optimal for several reasons: concurrency is inextricably bound with objects, there is no fine-grained control of the interpreter, the language-defined monitor-like construct is too high-level and not well designed, and the java.util.concurrent library is too complex. C.S. P˘ as˘ areanu (Ed.): SPIN 2009, LNCS 5578, pp. 6–11, 2009. c Springer-Verlag Berlin Heidelberg 2009 

Tool Presentation: Teaching Concurrency and Model Checking

7

Several years ago, I became convinced that the Spin model checker is appropriate for teaching concurrency even to beginning students: (a) The simplicity of the Promela language does not obscure the basic concepts; (b) There is direct support for fundamental constructs like shared variables, semaphores (using atomic) and channels; (c) The execution can be displayed in full detail and the interleaving of individual statements controlled. For teaching at introductory levels, I feel that Spin is better tool than Java PathFinder [5], because Promela is much simpler than Java, and Spin is a simpler tool to use. GOAL [6] can be used for studying concurrency through infinite automata, but these students have yet to study even finite automata. Learning Spin led me to write a new edition of my concurrency textbook [7] and an introductory textbook on Spin [8]. During this period, I developed software tools for teaching: jSpin, a development environment; SpinSpider, which automatically generates the state space diagram of a Promela program that can be viewed incrementally with iDot; VN, a tool to support learning nondeterministic finite automata (NDFA). The Erigone model checker is a partial reimplementation of Spin designed to simplify modeling checking when teaching concurrency and to facilitate learning model checking itself.1

2

jSpin: A Development Environment for Spin

The downside of using Spin for teaching is that it requires a C compiler which can be difficult for beginning students to install. Furthermore, Spin is a commandline tool with a daunting list of arguments, while the XSpin interface is oriented at professional users and requires Tcl/Tk which students do not use. jSpin is a simple GUI that allows the student to edit Promela programs and to execute Spin in its various modes with a single mouse click or keypress. jSpin formats the output of a simulation (random, interactive or guided) as a scenario in a tabular representation. You can specify that certain variables and statements will be excluded from the table, so that less important information like incrementing indices does not clutter the display. An example of the output of jSpin is shown below. It is for a program that repeatedly increments a global variable concurrently in two processes by loading its value into local variables and then storing the new value. Surprisingly, there is a scenario in which the final value of the global variable can be two! The final part of a trail is shown below; the barrier variable finished and the loop indices have been excluded, as have the statements that increment the indices. Process 0 P 0 P 1 Q 1 Q 0 P 1

Statement 9 temp = n 10 n = (temp+1) 22 n = (temp+1) 25 finished = (fi 13 finished = (fi

P(0):temp 8 9 9 9 9

Q(1):temp 1 1 1 1 1

n 9 9 10 2 2

Screenshots of the tools can be found on the website given in Section 8.

8

M. Ben-Ari

2 Finis 29 finished==2 9 1 spin: text of failed assertion: assert((n>2))

2

Each line shows the statement that is to be executed and the current values of all the variables; the result of executing the statement is reflected in the following line. Spin only displays the values of variables that have changed, while this tabular format facilitates understanding executability and execution of expressions and statements that can involve several variables.

3

SpinSpider: Visualizing the State Space

SpinSpider uses data available from Spin to generate a graphical representation of the complete state space of a Promela program; it is written out in the dot language and laid out by the dot tool of Graphviz [9]. The trail of a counterexample can be displayed as a separate graph or it can be emphasized in the graph of the state space. SpinSpider is integrated into the jSpin environment, although it can be run as a standalone application. iDot displays the graphics files generated by SpinSpider interactively and incrementally.

4

VN: Visualization of Nondeterminism

Nondeterminism is a related, but important, concept that is difficult for students to understand [10]. In particular, it is difficult to understand the definition of acceptance by an NDFA. VN visually demonstrates nondeterminism by leveraging simulation and verification in Spin together with the graph layout capabilities of dot. Its input is an XML representation of an NDFA generated interactively by JFLAP [11]. For any NDFA and input string, a Promela program is generated with embedded printf statements that create a file describing the path in the NDFA taken by an execution. VN runs the program in random simulation mode to show that arbitrary scenarios will be generated for each execution. In interactive simulation mode, the user resolves the nondeterminism like an oracle. Verification is used to show the existence of an accepting computation.

5

The Erigone Model Checker

My experience—both teaching concurrency and developing the tools described above—led me to develop the Erigone model checker, a simplified reimplementation of Spin. The rationale and design principles were as follows: Installation and execution. The installation of a C compiler is a potential source of error for inexperienced students, so Erigone is designed as a single executable file. The size of the state vector is static, but this is not a problem for the small programs taught in a course. The constants defining the size of the state vector are declared with limited scope so that a future version could—like Spin—generate a program-specific verifier with minimize recompilation.

Tool Presentation: Teaching Concurrency and Model Checking

9

Tracing. The implementation of the above tools was difficult because there is no uniform and well documented output from Spin. Erigone uses a single format—named association—that is both easy to read (albeit verbose) and easy to postprocess. Here are a few lines of the output of a step of a simulation of the “Second Attempt” to solve the critical section problem [7, Section 3.6]: next state=,p=3,q=8,wantp=1,wantq=0,critical=0, all transitions=2, process=p,source=3,target=4,...,statement={critical++},..., process=q,source=8,target=2,...,statement={!wantp},..., executable transitions=1, process=p,source=3,target=4,...,statement={critical++},..., chosen transition=, process=p,source=3,target=4,...,statement={critical++},..., next state=,p=4,q=8,wantp=1,wantq=0,critical=1,

Four data options are displayed here: all the transitions from a state, the executable transitions in that state, the chosen transition and the states of the simulation. Fifteen arguments independently specify which data will be output: (a) the symbol table and transitions that result from the compilation; (b) for an LTL to BA translation, the nodes of the tableau and the transitions of the BA; (c) for a simulation, the options shown above; (d) for a verification, the sets of all and executable transitions, and the operations on the stacks and on the hash table; (e) runtime information. The display of the first step of a verification is (with elisions): push state=0,P=1,Q=1,R=1,n=0,finish=0,P.temp=0,Q.temp=0, all transitions=3, ... executable transitions=2, process=P,source=1,target=2,...,line=5,statement={temp=n}, process=Q,source=1,target=2,...,line=12,statement={temp=n}, push transition=0,process=1,transition=0,...,visited=false,last=true, push transition=1,process=0,transition=0,...,visited=false,last=false, top state=0,P=1,Q=1,R=1,n=0,finish=0,P.temp=0,Q.temp=0, top transition=1,process=0,transition=0,...,visited=false,last=false, inserted=true,P=2,Q=1,R=1,n=0,finish=0,P.temp=0,Q.temp=0,

Postprocessing. A postprocessor Trace was written; it implements a filtering algorithm like the one in jSpin that formats a scenario as a table, excluding variables and statements as specified by the user. Because of the uniform output of Erigone, it could be implemented in a few hours. Well structured and well documented. Learning model checking is difficult because there is no intermediate description between the high-level pseudocode in research papers and books, and the low-level C code of Spin. This also has implications for research into model checking, because graduate students who would like to modify Spin’s algorithms have to learn the C code. During the development of Erigone, continuous effort was invested in refactoring and documentation to ensure the readability and maintainability of the software.

10

6

M. Ben-Ari

The Implementation of Erigone

Erigone is a single program consisting of several subsystems: (1) a top-down compiler that translates Promela into transitions with byte code for the statements and expressions; (2) a model checker that implements the algorithms as described in [12], except that an explicit stack for pending transitions from the states is used instead of recursion; (3) a translator of LTL to BA using the algorithm in [13]. The compiler and the LTL-to-BA translator can be run independently of the model checker. Erigone is implemented in Ada 2005. This language was chosen because of its superb facilities for structuring programs and its support for reliable software. An additional advantage of Ada is that the source code can be read as easily as pseudocode. A researcher who wishes to modify Erigone will have to learn Ada, but I believe that that is not a difficult task and that the reliability aspects of the language will repay the effort many times over.

7

Current Status of Erigone and Future Plans

Erigone implements enough of Promela to study the basic concepts and algorithms found in textbooks like [7], in particular, the safety and liveness of Dijkstra’s “four attempts” and Dekker’s algorithm. Weak fairness is implemented and weak semaphores can be defined using atomic. Correctness specifications are given using assert and LTL formulas. Arrays can be used for solving nondeterministic algorithms [14], [8, Chapter 11] and for simulating NDFAs. Channels will be implemented in the near future. Versions of jSpin and VN for Erigone are under development. Future plans include: (a) Develop a regression test suite; (b) Develop interactive visualizations (unlike SpinSpider which uses postprocessing); (c) Visualize the LTL to BA translation, perhaps by integrating Erigone with GOAL[6]; (d) Measure the performance and improve the efficiency to the extent that it can be done without sacrificing clarity; (e) Implement more efficient algorithms for LTL to BA translation, fairness and state compression, but as optional additions rather than as replacements for the existing algorithms, so that students can begin by learning the simpler algorithms; (f) Use the excellent concurrency constructs in Ada to implement a parallel model checker [15].

8

Availability of the Tools

All these tools are freely available under the GNU General Public License and can be downloaded from Google Code; see the links at: http://stwww.weizmann.ac.il/g-cs/benari/home/software.html The GNAT compiler from AdaCore was used; it is freely available under the GNU GPL for Windows and Linux, the primary platforms used by students. jSpin is implemented in Java, as are SpinSpider, iDot and VN.

Tool Presentation: Teaching Concurrency and Model Checking

11

Acknowledgements Mikko Vinni of the University of Joensuu developed iDot, and Trishank Karthik Kuppusamy of New York University wrote the Promela compiler under the supervision of Edmond Schonberg. Michal Armoni helped design VN. I am deeply indebted to Gerard Holzmann for his unflagging assistance throughout the development of these tools. I would also like to thank the many victims of my emails asking for help with the model-checking algorithms.

References 1. Arnow, D., Bishop, J., Hailperin, M., Lund, C., Stein, L.A.: Concurrency the first year: Experience reports. ACM SIGCSE Bulletin 32(1), 407–408 (2000) 2. Ben-David Kolikant, Y.: Understanding Concurrency: The Process and the Product. PhD thesis, Weizmann Institute of Science (2003) 3. Ben-Ari, M.: Principles of Concurrent Programming. Prentice-Hall International, Hemel Hempstead (1982) 4. Bynum, B., Camp, T.: After you, Alfonse: A mutual exclusion toolkit. ACM SIGCSE Bulletin 28(1), 170–174 (1996) 5. Visser, W., Havelund, K., Brat, G., Park, S., Lerda, F.: Model checking programs. Automated Software Engineering 10(2), 203–232 (2003) 6. Tsay, Y.K., Chen, Y.F., Tsai, M.H., Wu, K.N., Chan, W.C., Luo, C.J., Chang, J.S.: Tool support for learning B¨ uchi automata and linear temporal logic. Formal Aspects of Computing 21(3), 259–275 (2009) 7. Ben-Ari, M.: Principles of Concurrent and Distributed Programming, 2nd edn. Addison-Wesley, Harlow (2006) 8. Ben-Ari, M.: Principles of the Spin Model Checker. Springer, London (2008) 9. Gansner, E.R., North, S.C.: An open graph visualization system and its applications to software engineering. Software Practice & Experience 30(11), 1203–1233 (2000) 10. Armoni, M., Ben-Ari, M.: The concept of nondeterminism: Its development and implications for education. Science & Education (2009) (in press), http://dx.doi.org/10.1007/s11191-008-9147-5 11. Rodger, S.H., Finley, T.W.: JFLAP: An Interactive Formal Languages and Automata Package. Jones & Bartlett, Sudbury (2006) 12. Holzmann, G.J.: The Spin Model Checker: Primer and Reference Manual. AddisonWesley, Boston (2004) 13. Gerth, R., Peled, D., Vardi, M.Y., Wolper, P.: Simple on-the-fly automatic verification of linear temporal logic. In: Fifteenth IFIP WG6.1 International Symposium on Protocol Specification, Testing and Verification XV, pp. 3–18 (1996) 14. Floyd, R.W.: Nondeterministic algorithms. Journal of the ACM 14(4), 636–644 (1967) 15. Holzmann, G.J., Joshi, R., Groce, A.: Tackling large verification problems with the Swarm tool. In: Havelund, K., Majumdar, R., Palsberg, J. (eds.) SPIN 2008. LNCS, vol. 5156, pp. 134–143. Springer, Heidelberg (2008)

Fast, All-Purpose State Storage Peter C. Dillinger and Panagiotis (Pete) Manolios College of Computer and Information Science, Northeastern University 360 Huntington Ave., Boston MA 02115, USA {pcd,pete}@ccs.neu.edu

Abstract. Existing techniques for approximate storage of visited states in a model checker are too special-purpose and too DRAM-intensive. Bitstate hashing, based on Bloom filters, is good for exploring most of very large state spaces, and hash compaction is good for high-assurance verification of more tractable problems. We describe a scheme that is good at both, because it adapts at run time to the number of states visited. It does this within a fixed memory space and with remarkable speed and accuracy. In many cases, it is faster than existing techniques, because it only ever requires one random access to main memory per operation; existing techniques require several to have good accuracy. Adapting to accommodate more states happens in place using streaming access to memory; traditional rehashing would require extra space, random memory accesses, and hash computation. The structure can also incorporate search stack matching for partial-order reductions, saving the need for extra resources dedicated to an additional structure. Our scheme is wellsuited for a future in which random accesses to memory are more of a limiting factor than the size of memory.

1

Introduction

An efficient explicit-state model checker such as Spin can easily fill main memory with visited states in minutes if storing them exactly [1]. This is a hindrance to automatically proving properties of large asynchronous programs by explicit state enumeration. In most cases, a level of uncertainty in the completeness of the verification is acceptable, either because of other uncertainties in the process or because one is simply looking for errors. Over-approximating the set of visited states can allow orders of magnitude more states to be explored quickly using the same amount of memory. Bitstate hashing [2], which uses Bloom filters [1,3,4], is the pinnacle of exploring as many states as possible when available memory per state is very small—say, less than 8 bits per state. The configuration that tends to cover the largest proportion of the state space in those conditions (setting three bits per state) covers even more when there is more memory per state, but it does not utilize the extra memory well. Using different configurations of the same or a different structure makes better use of the extra memory and comes much closer to full coverage of the state space—or achieves it. At around 36 bits per state, C.S. P˘ as˘ areanu (Ed.): SPIN 2009, LNCS 5578, pp. 12–31, 2009. c Springer-Verlag Berlin Heidelberg 2009 

Fast, All-Purpose State Storage

13

hash compaction has a good probability of full coverage [5], while the standard bitstate configuration omits states even with 300 bits per state. The difficulty in making better use of more memory is that special knowledge is needed for known schemes to offer a likely advantage. In particular, one needs to know approximately how many states will be visited in order to tune the data structures to be at their best. In previous work, we described how to use a first run with the standard bistate approach to inform how best to configure subsequent runs on the same or a related model [4]. This can help one to achieve a desired level of certainty more quickly. We also implemented and described automatic tool support for this methodology [6]. The shortcoming of this methodology is that the initial bitstate run can omit many more states than theoretically necessary, if there are tens of bits of memory per state. Ideally, no guidance would be required for a model checker to come close to the best possible accuracy for available memory in all cases. Such a structure would be good at both demonstrating absence of errors in smaller state spaces and achieving high coverage of larger state spaces. If it were competitively fast, such a structure would be close to the best choice conceivable when the state space size is unknown. This paper describes a scheme that is closer to this ideal than any known. The underlying structure is a compact hash table by John G. Cleary [7], which we make even more compact by eliminating redundancy in metadata. Our most important contribution, however, is a fast, in-place algorithm for increasing the number of cells by reducing the size of each cell. The same structure can similarly be converted in place to a Bloom filter similar to the standard bitstate configuration. These algorithms allow the structure to adapt to the number of states encountered at run time, well-utilizing the fixed memory in every case. Do not mistake this scheme for an application of well-known, classical hash table technology. First of all, the Cleary structure represents sets using near minimum space [8], unlike any classical structure, such as that used by the original “hashcompact” scheme [9]. The essence of this compactness is that part of the data of each element is encoded in its position within the structure, and it does this without any pointers. Second, our algorithm for increasing the number of cells is superior to classical rehashing algorithms. Our adaptation algorithm requires no hash function computation, no random memory accesses, and O(1) auxiliary memory. Our scheme is also competitively fast, especially when multiple processor cores are contending for main memory. Because it relies on bidirectional linear probing, only one random access to main memory is needed per operation. Main memory size is decreasingly a limiting factor in verification, and our scheme is usually faster than the standard bitstate approach until memory is scarce enough to shrink cells for the first time. Even then, each adaptation operation to shrink the cells only adds less than two percent to the accumulated running time. Execution time can also be trimmed by integrating into the structure the matching of states on the search stack or search queue, used for partial-order reductions [10,11]. This entails dedicating a bit of each cell to indicating whether

14

P.C. Dillinger and P. (Pete) Manolios

that element is on the stack. In many cases, such integration reduces the memory required for such matching and improves its accuracy as well. In Section 2 we overview Cleary’s compact hash tables and describe a noticeable improvement. Section 3 describes our fast adaptation algorithms. Section 4 describes how to incorporate search stack/queue matching to support partialorder reduction. Section 5 tests the performance of our scheme. In Section 6, we conclude and describe avenues for future work.

2

Cleary Tables

John G. Cleary describes an exact representation for sets in which only part of the descriptor of an element and a constant amount of metadata needs to be stored in each cell [7]. The rest of the descriptor of each element is given by its preferred location in the structure; the metadata link elements in their actual location with their preferred location. We describe one version of Cleary’s structure and describe how redundancy in the metadata enables one of three metadata bits to be eliminated. A Cleary table is a single array of cells. For now, assume that the number of cells is 2a , where a is the number of address bits. If each element to be added is b bits long, then the first a are used to determine the home address (preferred location) and the remaining b − a are the entry value stored in the cell. With three metadata bits, each cell is therefore b−a+3 bits. At this point, we notice some limitations to the Cleary structure. First, each element added has to be of the same fixed size, b bits. It might also be clear that operations could easily degrade to linear search unless the elements are uniformly distributed. This can be rectified by using a randomization function (1:1 hash), but in this paper, we will use the Cleary table exclusively to store hashes, which are already uniformly distributed. We are using an exact structure to implement an inexact one, a fine and well-understood approach [8,12]. 2.1

Representation

We can’t expect each element added to be stored in the cell at its home address; this is where bi-directional linear probing and the metadata come in. Entries with the same home address will be placed in immediate succession in the array, forming chains. The change metadata bit marks that the entry in that cell is the beginning of a chain. The mapped bit at an address marks that there is a chain somewhere in the table with entries with that home address. The occupied bit simply indicates whether an entry is stored in that cell. The nth change bit that is set to 1 begins the chain of entries whose home address is where the nth mapped bit that is set to 1 is located. To ensure that every occupied entry belongs to a chain with a home address, a Cleary table maintains the following invariant:

Fast, All-Purpose State Storage

15

Invariant 1. In a Cleary table, the number of mapped bits set is the same as the number of change bits set. Furthermore, the first occupied cell (if there is one) has its change bit set. The chains do not have to be near their homes for the representation to work, but the order of the chains corresponds to the order of the set mapped bits. Conceptually, the occupied and change bits relate to what is in the cell and the mapped bit relates to the home address whose preferred location is that cell. One could implement a variant in which the number of cells and home locations is different, but we shall keep them the same. 2.2

Random Access

For the structure to be fast, chains should be near their preferred location, but it is not always possible for each chain to overlap with its preferred location. Nevertheless, this next invariant makes fast access the likely case when a portion of cells are left unoccupied: Invariant 2. In a Cleary table, all cells from where an element is stored through its preferred location (based on its home address) must be occupied. This basically says that chains of elements must not be interrupted by empty cells, and that there must not be any empty cells between a chain and its preferred location. Consequently, when we go to add an element, if its preferred location is free/unoccupied, we store it in that cell and set its change bit and mapped bit. We know by Invariant 2 that if the preferred location of an element is unoccupied, then no elements with that home address have been added. Consequently, the mapped bit is not set and there is no chain associated with that home address. (See the first two additions in Figure 1.) If we’re trying to add an element whose preferred location is already occupied, we must find the chain for that home address–or where it must go–in order to complete the operation. Recall that the element stored in the corresponding preferred location may or may not be in the chain for the corresponding home address. To match up chains with home addresses—to match up change bits with mapped bits—we need a “synchronization point.” Without Invariant 2 the only synchronization points were the beginning and end of the array of cells. With Invariant 2, however, unoccupied cells are synchronization points. Thus, to find the nearest synchronization point, we perform a bidirectional search for an unoccupied cell from the preferred location. From an unoccupied, unmapped cell, we can backtrack, matching up set mapped bits with set change bits, until we reach the chain corresponding to the home address we are interested in—or the point where the new chain must be inserted to maintain the proper matching. If we are adding, we have already found a nearby empty cell and simply shift all entries (and their occupied and change bits) toward and into the empty cell, opening up a space in the correct chain—or where the new chain must be added. The remaining details of adding

16

P.C. Dillinger and P. (Pete) Manolios

Index (dec) Index (bin) Mapped Change Occupied Data entry

0 000

1 001

2 010

3 011

4 100

5 101

6 110

7 111

001

010

011

100

101

110

111

101

110

111

Add "1001110" Index (bin) Mapped Change Occupied Data entry

000

1110

Add "0001010" Add "1011100" Add "1110110" Index (bin) 000 Mapped Change Occupied Data entry 1010

001

010

011

100

1110 1100

0110

Add "0001011" Add "1000011" Index (bin) 000 001 Mapped Change Occupied Data entry 1010 1011

010

011

100

101

0011 1110 1100

110

111

0110

Add "0111101" Add "1000101" Index (bin) 000 001 010 011 100 101 110 111 Mapped Change Occupied Data entry 1010 1011 1101 0011 0101 1110 1100 0110 Fig. 1. This diagram depicts adding eight elements to a Cleary table with eight cells. In this example, the elements are seven bits long, the home addresses are three bits long, and the cell data entries are, therefore, four bits long. Each cell is shown with three metadata bits (mapped, change, and occupied). The lines connecting various metadata bits depict how the metadata bits put the entries into chains associated with home addresses.

Fast, All-Purpose State Storage

17

are a trivial matter of keeping the change and mapped bits straight. See Figure 1 for examples, but Cleary’s paper can be consulted for full algorithms [7]. Because each add or query operation requires finding an empty cell and the structure uses linear probing, the average search length can grow very large if the structure is allowed to grow too full. Others have analyzed this problem in detail [13], but in practice, more than 90% occupancy is likely to be too slow. We default to allowing no more than 85% occupancy. Note, however, that the searching involves linear access to memory, which is typically much less costly than random access. For large structures, random accesses are likely to miss processor cache, active DRAM pages, and even TLB cache. 2.3

Eliminating occupied Bits

The last invariant was originally intended to speed up lookups, but we show how it can be used to eliminate the occupied bits: Invariant 3. In a Cleary table, all entries in a chain are put in low-to-high unsigned numerical order. This makes adding slightly more complicated, but probably does not affect much the time per visited list operation, in which all negative queries become adds. With Invariant 3, the occupied bits are redundant because if we fill all unoccupied entries with all zeros and make sure their change bit is unset, then entries with all zeros are occupied iff their change bit is set. This is because entries with all zeros will always be first in their chain, so they will always have their change bit set. Eliminating this bit allows another entry bit to be added in the same amount of memory, which will cut expected omissions in half. This optimization does seem to preclude the encoding of multisets by repeating entries, because only one zero entry per chain is allowed under the optimized encoding. It seems this is not a problem for visited lists, but will complicate the combination of adaptation and stack matching. 2.4

Analysis

As briefly mentioned, we will use this structure to store exactly inexact hashes of the visited states. This makes the probabilistic behavior relatively easy to analyze. If the hashes are b bits long with a bits used for the home address, the structure has 2a cells of b − a + 2 bits each. Let n be the number of cells occupied, which is also the number of states recognized as new. The probability of the next new state appearing to have already been visited is f = n/2b . The expected number of new states falsely considered visited until the next one correctly recognized as new is f /(1 − f ). (For example, when f = 0.5, we expect 0.5/(1 − 0.5) = 1 new state falsely considered visited per new state correctly recognized as new.) Thus, the expected hash omissions for a Cleary table after recognizing n states as new is n−1 

oˆCT (n, b) = i=0

2b n/2b ≈ n 1 − n/2b n

n/2b 

0

 f n df = −n − 2b ln 1 − b 1−f 2

(1)

18

P.C. Dillinger and P. (Pete) Manolios

Note that floating-point arithmetic, which is not good at representing numbers very close to 1, is likely to give inaccurate results for the last formula. Here are simpler bounds, which are also approximations when n ≪ 2b : n − 1 n/2b n(n − 1) n−1 n/2b n(n − 1) · ≤ o ˆ · = = b+1 (n, b) ≤ CT 2b+1 2 1 2 1 − n/2b 2 − 2n

(2)

For example, consider storing states as b = 58-bit hashes in 228 cells. There are then a = 28 address bits and 58 − 28 + 2 = 32 bits per cell. That is correct: a Cleary table can store any 228 58-bit values using only 32 bits for each one. If we visit n = 2 × 108 states, we expect 0.06939 hash omissions (all approximations agree, given precise enough arithmetic). Consequently, the probability of any omissions is less than 7%. The structure is approximately 75% full, which means the structure is using roughly 43 bits per visited state. Non-powers of 2. The problem of creating a Cleary table of hash values with a number of cells (and home addresses) that is not a power of two can be reduced to the problem of generating hash values over a non-power-of-two range. Take how many cells (home addresses) are desired, multiply by 2b−a (the number of possible entry values), and that should be the number of possible hash values. In that case, stripping off the address bits—the highest order bits of the hash value—results in addresses in the proper range.

3

Fast Adaptation

Here we describe how to adapt a Cleary table of hash values to accommodate more values in the same space, by forgetting parts of the stored hash values. The basis for the algorithms is a useful “closer-first” traversal of the table entries. In this paper, we use this in doubling the number of cells, by cutting the size of each in half, and in converting the table into certain Bloom filters. Both of these can be done in place making only local modifications throughout the structure (no random accesses). 3.1

Twice as Many, Half the Size

Consider the difference between a Cleary table with cells of size 2j = b − a + 2 bits and one with cells of half that size, 2j−1 = b′ − a′ + 2 bits. If they are the same size overall, then the second has twice as many addresses—one more address bit: a′ = a + 1. If the elements added to these are prefixes of the same set of hash values, they will have similar structure. Each home address in the first corresponds to two home addresses in the second, so each mapped home address in the first will have one or both of the corresponding home addresses in the second mapped. In fact, the left-most (highest) bit of an entry in the first determines which of the two corresponding home addresses it has in the second. Thus, the entries in the second have one less bit on the left, which is now part of the home address, and 2j−1 − 1 fewer on the right. Only the 2j−1 − 1 bit on

Fast, All-Purpose State Storage

19

the right are truly missing in the second structure; that is half of the 2j − 2 bits per entry in the first structure. For example, if j = 5 and a = 20, the first structure has 25 = 32 bits per cell, 32−2 = 30 bits per stored entry, and each element is 30+20 = 50 bits. The second structure has 16 bits per cell, 16 − 2 = 14 per stored entry, 20 + 1 = 21 address bits, and 14 + 21 = 35 bits per element. The second effectively represents 30/2 = 15 fewer bits per element. Both structures are 32×220 bits or 4 megabytes overall. Converting from the first to the second requires careful attention to two things: (1) making sure the new mapped and change bits are updated properly, as each one set in the old structure could entail setting one or two in the new, and (2) making sure to shift elements toward their preferred location, displacing any empty cells in between. These correspond to preserving Invariant 1 and Invariant 2 respectively. Preserving Invariant 3 is just a matter of keeping elements in the same order. In fact, if a chain becomes two in the new structure, Invariant 3 guarantees that all elements in the new first chain are already before elements of the new second chain, because elements with the highest-order bit 0 come before those with highest-order bit 1! A naive approach that converts the data entries in order from “left” to “right” or “right” to “left” fails. As we iterate through the chains, we need to update the new mapped bits according to the presence of new chains, and those mapped bits might be off to either side of the chains. The problem is that half of the new mapped bits are where old data entries are/were, and we cannot update mapped bits on the side where we have not processed. We could queue up those new mapped bits and update them when we get there, but the queue size is technically only bounded by the length of the whole structure. A preferable solution should only process data entries whose new mapped bit lies either in the cell being processed or in cells that have already been processed. We can solve this problem with a traversal that processes entries in an order in which all entries between an entry and its preferred location are processed before that entry is processed. 3.2

Closer-First Traversal

It turns out that virtually any in-place, fast adaptation we want to do on a Cleary table can be done by elaborating a traversal that processes the entries between any given entry and its preferred location before processing that entry. This traversal can be done in linear time and constant space. All accesses to the table are either linear/streaming or expected to be cached from a recent linear access. The following theorem forms the basis for our traversal algorithm: Theorem 1. In a Cleary table, the maximal sequences of adjacent, occupied cells can be divided uniquely into subsequences all with the following structure: – Zero or more “right-leaning” entries each with its preferred location higher than its actual location, followed by – One “pivot” entry at its preferred location, followed by

20

P.C. Dillinger and P. (Pete) Manolios

– Zero or more “left-leaning” entries each with its preferred location lower than its actual location. Proof Idea. Proof is by induction on the number of pivots in a maximal sequence, using these lemmas: – An entry in the first cell of the structure is not left-leaning, and an entry in the last cell is not right-leaning. (This would violate Invariant 1 or how change bits are implicitly connected to mapped bits.) – An entry adjacent to an unoccupied cell cannot be leaning in the direction of the unoccupied cell. (This would violate Invariant 2.) – A left-leaning entry cannot immediately follow a right-leaning entry. (Such entries would be in different chains and violate how change bits are implicitly connected to mapped bits.) ⊓ ⊔ This theorem solves the bootstrapping problem of where to start. We start from each pivot and work our way outward. To minimize the traversing necessary to process entries in an acceptable order, we adopt this order within each subsequence: process the pivot, process the right-leaning entries in reverse (right-toleft) order, and then process the left-leaning entries in (left-to-right) order. Index (bin) 000 001 010 011 100 101 110 111 Mapped Change Occupied Data entry 1010 1011 1101 0011 0101 1110 1100 0110 Theorem 1 Categorization

P

Traversal order

1

P 2

5

4

3

P 6

7

8

Fig. 2. This diagram takes the final Cleary table from Figure 1, shows how the entries are categorized according to Theorem 1, and lists the order of processing by the closerfirst traversal

A natural implementation of the overall traversal, which finds subsequences “on the fly,” goes like this: (1) Remember starting location. (2) Scan to a pivot and remember its location. (3) Process the pivot and then all entries back through the starting location (reverse order). (4) Process entries after the pivot until one that is not left-leaning and go back to (1). To determine the direction of each entry’s home, home addresses for the current and saved locations are tracked and updated as they are updated. Proper tracking eliminates the need for searching for synchronization points. All accesses are adjacent to the previous except between steps (3) and (4). Since the pivot was processed recently, it will still be cached with high probability. A corollary of Theorem 1 is that processing entries in this order guarantees that processing the (old or new) home location of an entry will never come after the processing of that entry.

Fast, All-Purpose State Storage

3.3

21

Building on the Traversal

The closer-first traversal allows us to compact cells to half their size and update mapped and change bits accordingly without overwriting unprocessed cells. (That handles Invariant 1.) Storing entries near their preferred location (Invariant 2) can also be handled in the same elaborated traversal. We store the pivot entry in the new cell that is its preferred location. For processing the right-leaning entries, we keep track of the “next” new location, which is the one just before (left of) the new cell we most recently put an entry into. Each right-leaning entry will be placed at the minimum (left-most) of that “next” location and the entry’s preferred location. We do the symmetric thing for left-leaning entries. This procedure guarantees Invariant 2 because it places each entry either at its preferred location, or adjacent to an entry (which is adjacent to an entry . . . ) which is at that preferred location. 3.4

Our Design

Cell sizes should be powers of two to allow for repeated cutting in half. Our current design starts with 64 bits per cell. If there are, say, 228 cells (2 GB), states are stored as 62 + 28 = 90 bit hash values. The probability of any omissions is theoretically one in billions. Jenkins’ hash functions are fast and behave according to expectation [14]. Thus, the only tangible benefit of allowing larger values is psychological, and it might require more hash computation. Recall that the structure becomes unacceptably slow beyond 90% occupancy. Thus, when an occupancy threshold is reached (default 85%), we convert from 64 to 32, from 32 to 16, and from 16 to 8. We do not go from 8 to 4. Consider what a Cleary table with 4 bits per cell would be like. Two bits are metadata and two bits are left for the entry. Each cell contains only one of four possible entries. But each cell is four bits long. This means we could represent sets of values with the same number of bits using the same amount of memory just by using them as bit indexes into a bit vector, and that would allow us to add any number of such values. That would be the same as a k = 1 Bloom filter. You could also think of it as a Cleary table with just mapped bits; entries are 0 bits, so no need for change bits. In other words, a Cleary table with 4 bits per cell is no more accurate than a k = 1 Bloom filter, cannot accommodate as many elements, and might be noticeably slower. Thus, we choose to convert from an 8-bit-per-cell Cleary table into a Bloom filter. We actually convert into a special k = 2 Bloom filter, but let us first examine how to convert an 8-bit-per-cell Cleary table into a single-bit (k = 1) Bloom filter. 3.5

Adapting to Bloom Filter

Adapting a Cleary table using 8 bits per cell into a single-bit (k = 1) Bloom filter is incredibly easy using the traversal. To turn an old entry into a Bloom filter index, we concatenate the byte address bits with the highest three data bits, from

22

P.C. Dillinger and P. (Pete) Manolios

the six stored with each old entry. This means that setting the Bloom filter bit for each old entry will set one of the eight bits at that entry’s preferred location. In other words, only bytes that had their mapped bits set will have bits set in the resulting Bloom filter. Using the same “closer-first” traversal guarantees that entries are not overwritten before being processed. Unfortunately, single-bit Bloom filters omit states so rapidly that they often starve the search before they have a high proportion of their bits set. Holzmann finds that setting three bits per state (k = 3) is likely to strike the right balance between not saturating the Bloom filter and not prematurely starving the search. Unfortunately, we cannot convert a Cleary table using 8 bits per cell into any structure we want. First of all, it needs to have locality in the conversion process, so that we can do the conversion in place. Second, it can only use as many hash value bits as are available in the 8-bit-per-cell Cleary table. We believe the best choice is a special k = 2 Bloom filter that has locality. It is well-known that forcing Bloom filter indices to be close to one another significantly harms accuracy, but we do not have much choice. The first index uses three of the six old entry bits to determine which bit to set. That leaves only three more bits to determine the second index, which can certainly depend on the first index. Running some simulations has indicated that it does not really matter how those three bits are used to determine the second index from the first; all that matters is that all three are used and that the second index is always different from the first. We have decided that the easiest scheme to implement uses those three bits as a bit index into the next byte. Thus, the same address bits that determined the home addresses for the 8-bit Cleary table determine the byte for the first Bloom filter index. The first three entry bits determine the index within that byte, and the next three determine the index within the next byte. Altering the conversion/adaptation algorithm to generate this structure is not too tricky. Recognizing that we cannot order our traversal to avoid the second indices overwriting unprocessed entries, we must keep track of the new byte value that should follow the right-most processed entry and not write it until the entry has been processed. That is the basic idea, anyway. Our implementation caches up to three bytes yet to be written: one left over from the previous subsequence, one to come after the pivot, and one more to work with while iterating down the “leaning” sides. We can analyze the expected accuracy of this scheme by building on results from previous work [4]. This is a k = 2 fingerprinting Bloom filter whose fingerprint size is 3 bits more than one index (log2 s = 3 + log2 m; s = 8m; s is the number of possible fingerprints and m is the number of bits of memory for the Bloom filter). Previous work tells us that a simple over-approximation of the expected hash omissions from a fingerprinting Bloom filter is the sum of the expected hash omissions due to fingerprinting, which we can compute using Equation 1 or 2 (b = 3 + log2 m), and the expected hash omissions due to the

Fast, All-Purpose State Storage

underlying Bloom filter, which is roughly Thus, our rough estimate is oˆBF ≈



n−1 −2i/m 2 ) i=0 (1−e

 n(n − 1) n + 1 − e−2n/m 2(8m − n) 2

2

23

≈ n(1−e−2n/m )2 /2.

(3)

That formula and Equations 1 and 2 give the expected omissions assuming we had been using the given configuration since the beginning of the search. That is easily corrected by subtracting the expected hash omissions for getting to the starting conditions of each new configuration—had that configuration been used from the beginning. For each Cleary table configuration, we subtract the expected hash omissions for the starting number of occupied cells from the expected hash omissions for the ending number of occupied cells. Note that collapses can make the next starting number smaller than the previous ending number. We cannot easily be so exact with the Bloom filter, so we can just use the previous ending number as the starting number.

4

Search Stack Matching

Partial-order reductions play a central role in making contemporary explicitstate verifiers effective [15], by reducing the size of the state space that needs to be searched for errors. In many cases, the reduction is dramatic, such as making previously intractable problems tractable. A typical implementation requires runtime support in the form of a “cycle proviso,” which needs to know whether a state is on the DFS stack [10] or BFS queue [11] (depending on whether depth-first or breadth-first search is being used). We will refer to this as checking whether the state is active. Combining with visited list. A visited list based on cells, such as the Cleary table, can include a bit with each cell that indicates whether the state stored in that cell is active. This can be a compact option since no other random-access structure with this data is required. However, the relative size of the stack or queue can be small enough that the vast majority of active bits would be zero and a separate structure would be more compact. Speed should favor the unified structure, because separate lookup is not required to check whether a state is active. Marking the state as active can also “piggy-back” on the initial look-up by saving the cell location where the new entry was just placed. Accuracy is also a strong point of the unified structure. Specifically, stack/queue matching is 100% accurate whenever the visited list is accurate. Using a separate structure that is independently inaccurate could lead to imprecise reductions and error omission even if the visited list causes no omissions. Complications. Despite the fact that multiple states can map to the same value at the same address in a Cleary table (or other “hash compaction” scheme), there is not traditionally a need to account for multiple stack entries per stored

24

P.C. Dillinger and P. (Pete) Manolios

value, because only one such state would ever be recognized as new and, therefore, only one can ever be active. But our adaptive Cleary tables can have more than one state on the stack that each map to the same table entry. When shrinking cells, some pairs of entries will no longer be distinguishable and are collapsed into one. (One of each such pair would have been a hash omission if the new cell size had been used from the beginning!) If both states are on the stack, however, we prefer to be able to say that there is more than one state on the stack matching a certain entry. Cleary’s table representation allows duplicate entries, so in these rare cases, perhaps we could duplicate the entry for each matching state on the stack. However, our optimization that allowed only two metadata bits per state assumed that an entry of all zeros would be at the beginning of a chain, and if we allow more than one entry of all zeros in a chain, this is no longer the case. However, the only case in which we want to have duplicate entries is when each of those entries needs to have its active bit set. As long as the active bit is set, therefore, it would be possible to distinguish the entry from an unoccupied cell. When the state gets removed from the stack/queue, however, we would need to do something other than clearing the active bit and turning the duplicate all-zeros entry into an unoccupied cell (which could violate Invariant 2). Since deletion from a Cleary table is possible, we just delete the duplicate entry when its active bit is cleared. Our implementation deletes all such duplicates, not just all-zero entries, to (1) make room for more additions and (2) maintain the invariant that no state that is active has a matching entry in the Cleary table without its active bit set. A final complication comes when the structure becomes a Bloom filter, which is not based on cells. For a single-bit Bloom filter, we could have an active bit for each bit, but that would be a waste of space considering what a small proportion of visited states are typically on the stack. There is also the problem that a Cleary table with 8 bits per cell and an active bit in each cell only has five entry data bits per cell. Ideally, we want a two-bit Bloom filter that uses all five bits and takes up more space than the accompanying active information. Here’s a design: use two bits per byte as a counter of the number of active states whose home is/was this byte. As in a counting Bloom filter [16], the counter can overflow and underflow, introducing the possibility of false negatives—in addition to the false positives due to sharing of counters. In the context of a search that is already rather lossy, these are not big issues. Six bits of each byte remain for the Bloom filter of visited states. If we set two bits per state, one in the home and one in the following byte, that is 36 possibilities for states whose home is this byte. The 5 bits of data left allow us to cover 32 of those 36 possibilities. Mapping those 5-bit values to pairs of indices 0..5 is efficient with a small look-up table. This makes it easy to spread indices somewhat evenly over that range, but all six bit indices cannot have the exact same weight/likelihood.

5

Validation

We have implemented the schemes described in a modified version of Spin 5.1.7 and use that for all of the experimental results. It can be downloaded from [17].

Fast, All-Purpose State Storage

25

Our implementation also outputs the expected hash omission numbers based on formulas given. Timing results were taken on a 64-bit Mac Pro with two 2.8 GHz quad-core Intel Xeon processors and 8GB of main memory. 5.1

Accuracy

In this section we demonstrate the accuracy advantages of our adaptive Cleary+ Bloom structure as compared to the standard k = 3 bitstate approach and validate the predictive value of our formulas. Setup. The main accuracy test of Figure 3 has an artificial aspect to it, and we explain that here. A typical protocol is subject to what we call the transitive omission problem [4]. States omitted from a lossy search can be put in two categories: hash omissions, those states that were falsely considered not new by the visited list, and transitive omissions, those states that were never reached because other omissions made them unreachable. Clearly, if there are zero hash omissions, there are zero transitive omissions. But when there are hash omissions, we do not reliably know how many transitive omissions resulted. In well-behaved protocols, there tends to be a linear relationship, such as two transitive omissions per hash omission. Despite the transitive omission problem—or perhaps because of it—minimizing expected hash omissions is key to maximizing the accuracy of a search. This approach also optimizes two other metrics: the probability of any omissions and expected coverage. Note that when much smaller than 1, the expected hash omissions approximate the probability of any omissions. However, the probability of any omissions is not good at comparing searches that are expected to be lossy, and coverage is hard to predict in absolute terms in the presence of transitive omissions. Thus, we focus on minimizing hash omissions. To measure hash omissions and compare those results against our mathematical predictions, we generated a synthetic example that is highly connected and, therefore, should have almost no transitive omissions. The model consists of a number starting at zero, to which we non-deterministically add 1 through 10 until a maximum is reached. Choosing that maximum allows us to manipulate the size of the state space, as we have done to get the results in Figure 3. Adaptive vs. Bitstate. A quick examination of Figure 3 confirms that when memory is not highly constrained (left side), our visited list scheme (“Adaptive”) is more accurate than (below) the standard bitstate approach (“Bitstate”). (For now, only consider the results using a chaining hash table for stack matching, “ChainTable.”) For example, at around 200 000 states, our scheme has less than a 1 in 10 000 chance of any omissions while the bitstate scheme expects to have omitted about 10 states. When memory is highly constrained (right side), the two yield similar accuracy. If we look at the “Adaptive, ChainTable” accuracies in more detail, we can see where the adaptations occur. When the expected omissions are near 10−14 , it is still using 64 bits per cell. When the expected omissions jump to 10−5 , it has changed over to 32 bits per cell. Near one omission, it is using 16 bits per

26

P.C. Dillinger and P. (Pete) Manolios

Average omissions (log scale)

100000

1

1e-05 Adaptive, ChainTable, Observed Adaptive, ChainTable, Predicted Adaptive, Integrated, Predicted Adaptive, CountingBF, Predicted Bitstate, ChainTable, Observed Bitstate, ChainTable, Predicted Bitstate, CountingBF, Predicted

1e-10

1e-15 100000

1e+06 State space size (log scale)

Fig. 3. This graph compares the predicted and observed accuracy of searches using our adaptive scheme and the standard k = 3 bitstate scheme, using different DFS stack-matching schemes. The model is described in Section 5.1; it exhibits virtually no transitive omissions and allows the state space size to be manipulated. About 1MB total is available for visited and active state matching (see Section 5.1). Observation data points represent the average of 10-50 trials. Not enough trials were performed to observe any omissions from searches expecting full coverage with high probability. To avoid clutter, observations for many configurations are omitted but are analogously close to prediction. Predicted values are based on equations given in this paper.

cell. When it moves to 8 bits per cell, its accuracy is similar to the standard k = 3 bitstate approach. The structure converts to the special k = 2 Bloom filter around 106 states. The mathematical prediction overestimates a little at first because of the roughness of the approximation for fingerprinting Bloom filters, but it gets closer later. No observations show up for the 64 and 32 bits per cell cases because no omissions were encountered in the tens of trials run for each state space size. At least a million trials would have been required to get good results for 32 bits per cell. Figure 3 generalizes very easily and simply. These results are for 1 megabyte. To get the results for c megabytes, simply multiply the X- and Y-axis values by c. It is that simple. In other words, the proportion of states that are expected to be hash omissions under these schemes depends only on the ratio between states and memory, not on their magnitude. Unlike many classical structures, these schemes scale perfectly. Also, the Jenkins hash functions [14] used by Spin are good enough that, for all non-cryptographic purposes, the relationships among reachable state descriptors in a model are irrelevant. The Jenkins hashes are effectively random.

Fast, All-Purpose State Storage

27

Stack Matching. The “ChainTable” results of Figure 3 assume that a chaining hash table is used to match active states and that its memory size is negligible compared to the memory size of the visited list (see memory usage in Figure 4). This is the case for problems whose maximum depth is orders of magnitude smaller than the state space size. Because this approach is faster and usually more compact than what is currently available in Spin(version 5.1.7)—but an application of classical techniques—we consider it state-of-the-art for this case. (More information is on the Web [17].) The “CountingBF” results assume that a counting Bloom filter [16] that occupies half of available memory is used. This is the allocation usually used by Spin’s CNTRSTACK method, and we consider it state-of-the-art for unbounded active lists that are dynamically swapped out to disk, as in Spin’s SC (“stack cycling”) feature. Note that a counting Bloom filter cannot overflow a fixed memory space as a chaining hash table can. The important difference for Figure 3 is that only half as much memory is available to the visited list for the “CountingBF” results as for the “ChainTable” results—to accommodate the large counting Bloom filter. Thus, the counting Bloom filter approach is clearly detrimental to accuracy if the search stack is small enough to keep in a small area of memory (Figure 3). The (k = 2) counting Bloom filter is also relatively slow, because it always requires two random lookups into a large memory space; thus it is DRAM-intensive (see Figure 4). But if the search stack is large enough to warrant swapping out, the counting Bloom filter is likely to be a better choice (not graphed, but see memory usage in Figure 5). Using our adaptive Cleary+Bloom structure allows a third option: integrating active state matching with visited state matching (“Integrated”), which is always preferable to the counting Bloom filter. Making room for the counting Bloom filter as a separate structure requires cutting the bits per entry by about half. Making room for one active bit in each cell only takes away one entry bit. The result is a doubling in the expected omissions, which is tiny compared to the impact of cutting cell sizes in half (“Adaptive, Integrated” vs. “Adaptive, CountingBF” in Figure 3). 5.2

Speed

Figure 4 confirms that the Cleary table is very fast when the structure is big and never gets beyond about half full. Plugging in a 64-bit Cleary table in place of k = 3 bitstate increases speed by about 2.8% when running by itself. Using the Cleary table also for search stack matching increases that to 5.5%, unless using the counting Bloom filter with standard bitstate, which is 13.5% slower than the integrated Cleary structure. The Cleary table with integrated matching of active states is the least affected by running in parallel with other model checker instances. Running six instances simultaneously slows each by about 9%, but running six instances of Spin’s bitstate implementation slows them by about 20%. This can easily be explained by the fact that the Cleary table with integrated stack matching only needs one random access to main memory to check/add a new state and/or check/add it

New states per second per process

28

P.C. Dillinger and P. (Pete) Manolios

700000 600000 500000 400000 300000 200000

Cleary, Integrated MEM=1089MB Cleary, ChainTable MEM=1105MB Bitstate, ChainTable MEM=1105MB Bitstate, CountingBF MEM=1083MB

100000 0 1

2 3 4 5 Number of processes running in parallel

6

Fig. 4. This plots the verification speed, in average time per state per process, of various visited and stack matching methods for one to six instances running in parallel on a multicore system. In this test, Cleary tables never became full enough to trigger adaptation algorithms. The model is PFTP with LOSS=1,DUPS=1,QSZ=5. Statevector is 168 bytes. Visited state storage is 1024MB, except for CountingBF, which is given 512MB each for the visited set and the counting Bloom filter stack. Depth limit of 1.5 million was beyond the maximum and required a relatively small amount of additional storage. 67.8M states results in about 127 bits per state.

to the search stack. The k = 3 bitstate method with the counting Bloom filter stack requires five random accesses. Figure 5 shows how the user can optimize more for speed or more for accuracy by setting the maximum occupancy before adaptation. The omissions (not graphed) decrease with higher maximum occupancy, most dramatically between 60% and 65% (in this case) because 60% and lower ended with a smaller cell size. The omissions for 60% were about twenty times higher than for 65%, a much larger difference than between 50% and 60% or 70% and 90%. Adaption itself does not cause omissions, but after adaption, the significantly higher rate of omission causes the number of omissions to jump, as in Figure 3. Typically, lower maximum occupancy means faster, but 60% was actually slower than 65% because the 60% doubled the number of cells right before it finished. It should be possible to avoid such scenarios with heuristics that predict how close to completion the process is and allowing higher occupancy if near completion. Nevertheless, even after doubling its number of cells several times, our adaptive storage scheme is faster than standard bitstate in some cases. Another experiment (not graphed) exhibits how little time is needed for adaptation. We ran a 370 million state instance of PFTP(1,1,7) using 320 megabyte instances of our adaptive structure, causing adaptation all the way down to a k = 2 Bloom filter. Adaptation operations required 1.3-2.0% of the running time so far, with the total time spent on adaptation never exceeding 3.3% of the running time so far. This includes maximum allowed occupancies of 75% and 90%,

Fast, All-Purpose State Storage

29

New states per second (overall)

700000 650000 600000 550000 500000 450000 400000 350000

Adaptive, Integrated(SC) MEM=548MB Adaptive, ChainTable MEM=675MB Bitstate, ChainTable MEM=675MB Bitstate, CountingBF(SC) MEM=545MB 50 60 70 80 90 Maximum allowed % occupancy before each adaptation

100

Fig. 5. This plots the verification speed, in average time per state, of our adaptive Cleary tables, allowed to fill to various occupancies before each doubling of the number of cells. The straight lines show k = 3 bitstate performance with the specified stack matching scheme. Visited set is given 512MB, except for CountingBF(SC), which splits that equally between visited and active state matching. “Stack cycling” (SC) was used when supported to reduce memory requirements. Otherwise, depth limit of 3 million was needed to avoid truncation. The model is PFTP with LOSS=1,DUPS=1,QSZ=6. State-vector is 192 bytes. 170M states results in about 25 bits per state. All Cleary tables converted from 64 to 32 and 32 to 16 bits per cell. Those limiting to less than 65% occupancy also converted from 16 to 8 and had at least an order of magnitude more omissions. These results are for a single process running alone; results with four processes are slightly steeper and also cross the bottom line at around 90%.

with and without integrated active state matching, and 1 to 4 processes running simultaneously. Despite all the adaptations, “Adaptive, Integrated(SC)” was 718% faster and explored more states than “Bitstate, CountingBF(SC)” on the same problem given the same amount of memory. (Non-power of 2 memory size and significant time spent as a localized k = 2 Bloom filter conferred advantages to our adaptive structure not seen in previous results.) In other experiments not shown, we have noticed that the Cleary structure gains speed relative to competing approaches as memory dedicated to the structure grows very large. We suspect that this relates to higher latency per main memory access because of more TLB cache misses in accessing huge structures. If one compares the approaches using relatively small amounts of memory, findings are likely to be skewed against the Cleary table. The biggest sensitivity to the particular model used is in how long it takes to compute successors and their hashes, which is closely tied to the state vector size. More computation there will tend to hide any differences in time required by different state storage techniques. Less will tend to inflate differences. There is little way for a different model to result in different speed rankings. The PFTP

30

P.C. Dillinger and P. (Pete) Manolios

model used here for timings has a state vector size (about 200 bytes) that is on the low end of what might be encountered in large problems, such as those listed in Tables V and VI of [18].

6

Conclusion and Future Work

We have described a novel scheme for state storage that we believe has a future in explicit-state model checkers such as Spin. It has the flexibility to provide highassurance verification when memory is not highly constrained and good coverage when memory is highly constrained. In that sense, it is an “all-purpose” structure that requires no tuning to make good use of available memory. In many cases, our scheme is noticeably faster than Spin’s standard bitstate scheme. We believe this is due to its favorable access pattern to main memory: only one random look-up per operation. For example, when multiple processor cores are contending for main memory, our scheme is consistently faster. When supporting unbounded search stack/queue sizes, our scheme is consistently faster. Otherwise, the standard bitstate scheme is only a little faster once the number of states reaches about 1/100th the number of memory bits. At that point, bitstate has already omitted states while our scheme can visit twice that many with no omissions. Cleary’s compact hash tables and our fast, in-place adaptation algorithm offer speed, accuracy, compactness, and dynamic flexibility that previous schemes fall well short of in at least one category. We plan to extend this technique further. It should be possible to start a search storing full states and then adapt quickly in place to storing just hash values. This would provide the psychological benefit of exact storage for as long as possible. It should also be possible to make an intermediate adaptation step in between splitting cells in half. This is much trickier, but would make even better use of available memory. In fact, we hope to demonstrate how the scheme is “never far from optimal.”

References 1. Holzmann, G.J.: The Spin Model Checker: Primer and Reference Manual. AddisonWesley, Boston (2003) 2. Holzmann, G.J.: An analysis of bitstate hashing. In: Proc. 15th Int. Conf. on Protocol Specification, Testing, and Verification, INWG/IFIP, Warsaw, Poland, pp. 301–314. Chapman & Hall, Boca Raton (1995) 3. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13(7), 422–426 (1970) 4. Dillinger, P.C., Manolios, P.: Bloom filters in probabilistic verification. In: Hu, A.J., Martin, A.K. (eds.) FMCAD 2004. LNCS, vol. 3312, pp. 367–381. Springer, Heidelberg (2004) 5. Stern, U., Dill, D.L.: A new scheme for memory-efficient probabilistic verification. In: IFIP TC6/WG6.1 Joint Int’l. Conference on Formal Description Techniques for Distributed Systems and Communication Protocols, and Protocol Specification, Testing, and Verification, pp. 333–348 (1996)

Fast, All-Purpose State Storage

31

6. Dillinger, P.C., Manolios, P.: Enhanced probabilistic verification with 3Spin and 3Murphi. In: Godefroid, P. (ed.) SPIN 2005. LNCS, vol. 3639, pp. 272–276. Springer, Heidelberg (2005) 7. Cleary, J.G.: Compact hash tables using bidirectional linear probing. IEEE Trans. Computers 33(9), 828–834 (1984) 8. Pagh, A., Pagh, R., Rao, S.S.: An optimal bloom filter replacement. In: Proceedings of the 16th ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 823–829. SIAM, Philadelphia (2005) 9. Wolper, P., Leroy, D.: Reliable hashing without collision detection. In: 5th International Conference on Computer Aided Verification, pp. 59–70 (1993) 10. Holzmann, G.J., Peled, D.: Partial order reduction of the state space. In: First SPIN Workshop, Montr`eal, Quebec (1995) 11. Bosnacki, D., Holzmann, G.J.: Improving spin’s partial-order reduction for breadth-first search. In: Godefroid, P. (ed.) SPIN 2005. LNCS, vol. 3639, pp. 91– 105. Springer, Heidelberg (2005) 12. Carter, L., Floyd, R., Gill, J., Markowsky, G., Wegman, M.: Exact and approximate membership testers. In: Proceedings of the 10th ACM Symposium on Theory of Computing (STOC), pp. 59–65. ACM, New York (1978) 13. Pagh, A., Pagh, R., Ruzic, M.: Linear probing with constant independence. In: Proceedings of the 39th ACM Symposium on Theory of Computing (STOC), New York, NY, USA, pp. 318–327. ACM, New York (2007) 14. Jenkins, B.: http://burtleburtle.net/bob/hash/index.html (2007) 15. Clarke, E.M., Grumberg, O., Peled, D.: Model Checking. MIT Press, Cambridge (1999) 16. Fan, L., Cao, P., Almeida, J., Broder, A.Z.: Summary cache: a scalable wide-area Web cache sharing protocol. IEEE/ACM Transactions on Networking 8(3), 281– 293 (2000) 17. Dillinger, P.C., Manolios, P.: 3Spin home page, http://3spin.peterd.org/ 18. Holzmann, G.J., Bosnacki, D.: The design of a multicore extension of the spin model checker. IEEE Trans. Softw. Eng. 33(10), 659–674 (2007)

Efficient Probabilistic Model Checking on General Purpose Graphics Processors Dragan Boˇsnaˇcki1, Stefan Edelkamp2 , and Damian Sulewski3 1

Eindhoven University of Technology, The Netherlands 2 TZI, Universit¨ at Bremen, Germany 3 Technische Universit¨ at Dortmund, Germany

Abstract. We present algorithms for parallel probabilistic model checking on general purpose graphic processing units (GPGPUs). For this purpose we exploit the fact that some of the basic algorithms for probabilistic model checking rely on matrix vector multiplication. Since this kind of linear algebraic operations are implemented very efficiently on GPGPUs, the new parallel algorithms can achieve considerable runtime improvements compared to their counterparts on standard architectures. We implemented our parallel algorithms on top of the probabilistic model checker PRISM. The prototype implementation was evaluated on several case studies in which we observed significant speedup over the standard CPU implementation of the tool.

1

Introduction

Probabilistic Model Checking. Traditional model checking deals with the notion of absolute correctness or failure of a given property. On the other hand, probabilistic1 model checking is motivated by the fact that probabilities are often an unavoidable ingredient of the systems we analyze. Therefore, the satisfaction of properties is quantified with some probability. This makes probabilistic model checking a powerful framework for modeling various systems ranging from randomized algorithms via performance analysis to biological networks. From an algorithmic point of view, probabilistic model checking overlaps with the conventional one, since it too requires computing reachability of the underlying transition systems. Still, there are also important differences because numerical methods are used to compute the transition probabilities. It is those numerical components that we are targeting in this paper and show how they can be sped up by employing the power of the new graphic processors technology. 1

In the literature probabilistic and stochastic model checking often are used interchangeably. Usually a more clear distinction is made by relating the adjectives probabilistic and stochastic to the underlying model: discrete- and continuous-time Markov chain, respectively. For the sake of simplicity in this paper our focus is on discrete-time Markov chains, so we opted for consistently using the qualification “probabilistic”. Nevertheless, as we also emphasize in the paper, the concepts and algorithms that we present here can be applied as well to continuous-time Markov chains.

C.S. P˘ as˘ areanu (Ed.): SPIN 2009, LNCS 5578, pp. 32–49, 2009. c Springer-Verlag Berlin Heidelberg 2009 

Efficient Probabilistic Model Checking on GPGP

33

Parallel Model Checking. According to [31] less than two years ago a clear majority of the 500 most powerful computers in the world (www.top500.org) were characterized as clusters of computers/processors that work in parallel. Unfortunately, this has not had a major impact on the popularity of parallel computing both in industry and academia. With the emergence of the new parallel hardware technologies, like multi-core processors and general purpose graphics processing units, this situation is changing drastically. This “parallelism for the masses” certainly offers great opportunities for model checking. Yet, ironically enough, model checking, that was mainly introduced for the verification of highly parallel systems, in the past has mostly relied on sequential algorithms. Parallel model checking algorithms have been designed before (e.g., [33,30,8]) and with few exceptions [27,26] all of them targeted clusters of CPUs. However, this did not have any major impact in practice - besides some recent case studies on a big cluster (DAS-3)[4] none of the widely used model checking tools has a cluster version that preserves its full range of capabilities. In the last several years the things started to change. In [24,25] the concept of multi-core model checking was introduced, followed by [5]. In the context of large-scale verification, different disk-based algorithms for solving the model checking problem have been published [16,9,7]. In [16], the authors avoid nested depth-first search for accepting cycle detection by reducing the liveness to a safety problem. This I/O-efficient solution was further improved by running directed search and exploiting parallelism. Another disk-based algorithm for LTL model checking [7] avoids the increase in space, but does not operate on-the-fly. The algorithm given in [9] is both on-the-fly and linear in the space requirements wrt. the size of the state space, but its worst-case time complexity is large. Other advances in large-scale LTL model checking exploit Flash media [18,19]. GPGPU Programming. In the recent years (general purpose) graphics processor units ((GP)GPUs) have become powerful massively parallel systems and they have outgrown their initial application niches in accelerating computer graphics. This has been facilitated by the introduction of new application programming interfaces (APIs) for general computation on GPUs, like CUDA form NVIDIA, Stream SDK from AMD, and Open CL. Applications that exploit GPUs in different domains, like fluid dynamics, protein folding prediction in bioinformatics, Fast Fourier Transforms, and many others, have been developed in the last several years [32]. In model checking, however, GPUs have not had any impact. To the best of our knowledge the only attempt to use model checking on GPUs was by the authors of this paper [15]. They improved large-scale disk-based model checking by shifting complex numerical operations to the graphic card. As delayed elimination of duplicates is the performance bottleneck, the authors performed parallel processing on the GPU to improve the sorting speed significantly. Since existing GPU sorting solutions like Bitonic Sort and Quicksort do not obey any speed-up on state vectors, they propose a refined GPU-based Bucket Sort algorithm. Additionally, they study sorting a compressed state vector and obtain speed-ups for delayed duplicate detection of more than one order of magnitude with a 8800-GTX GPU.

34

D. Boˇsnaˇcki, S. Edelkamp, and D. Sulewski

Contribution. Traditionally the main bottleneck in practical applications of model checking has been the infamous state space explosion [35] and, as a direct consequence, large requirements in time and space. With the emergence of the new 64-bit processors there is no practical limit to the amount of shared memory that could be addressed. As a result the goals shift towards improving the runtime of the model checking algorithms [25]. In this paper we show that significant runtime gains can be achieved exploiting the power of GPUs in probabilistic model checking. This is because basic algorithms for probabilistic model checking are based on matrix-vector multiplication. These operations lend themselves to very efficient implementation on GPUs. Because of the massive parallelism – a standard commercial video card comprises hundreds of fragment processors – quite impressive speedups with regard to the sequential counterparts of the algorithms are quite common. We present an algorithm that is a parallel adaptation of the method of Jacobi for matrix-vector product. Jacobi was chosen over other methods that usually outperform it on sequential platforms because of its lower memory requirements and potential to be parallelized because of fewer data dependencies. The algorithm features sparse matrix vector multiplication. It requires a minimal number of copy operations from RAM to GPU and back. We implemented the algorithm on top of the probabilistic model checker PRISM [28]. The prototype implementation was evaluated on several case studies and remarkable speedups (up to factor 18) were achieved compared to the sequential version of the tool. Related Work. In [11] a distributed algorithm for model checking of Markov chains is presented. The paper focuses on continuous-time Markov chain models and Computational Stochastic Logic. They too use a parallel version of Jacobi’s method, which is different from the one presented in this paper. This is reflected in the different memory management (GPUs hierarchical shared memory model vs. the distributed memory model) and in the fact that their algorithm stores part of the state space on external memory (disks). Also, [11] is much more oriented towards increasing the state spaces of the stochastic models, than improving algorithm runtimes, which is our main goal. Maximizing the state space sizes of stochastic models by joining the storages of individual workstations of a cluster is the goal pursuit also in [12]. A significant part of the paper is on implicit representations of the state spaces with a conclusion that, although they can further increase the state space sizes, the runtime remains a bottleneck because of the lack of efficient solutions for the numerical operations. In [1] a shared memory algorithm is introduced for CTMC construction and numerical steady-state solution. The CTMCs are constructed from generalized stochastic Petri nets. The algorithm for computing steady state probability distribution is an iterative one. Compared to this work, our algorithm is more general as it can be used in CTMCs also to compute transient probabilities. Another shared memory approach is described in [6]. It targets Markov decision processes, which we do not consider in this paper. As such it differs from our work significantly since the quantitative numerical component of the

Efficient Probabilistic Model Checking on GPGP

35

algorithm reduces to solving systems of linear inequalities, i.e., using linear program solvers. In contrast, large-scale solutions support multiple scans over the search space on disks [17,13]. Layout. The paper is structured as follows. Section 2 briefly introduces probabilistic model checking. Section 3 describes the architecture, execution model and some challenges of GPU programming. Section 4 presents the algorithm for matrix-vector multiplication as used in the Jacobi iteration method and its port to the GPU. Section 5 evaluates our approach verifying examples shipped with the PRISM source showing significant speedups compared to the current CPU solution. The last section concludes the paper and discusses the results.

2

Probabilistic Model Checking

In this section we briefly recall along the lines of [29] the basics of probabilistic model checking for discrete-time Markov chains (DTMCs). More details can be found in, e.g., [29,2]. Discrete Time Markov Chains. Given a fixed finite set of atomic propositions AP we define a DTMC as follows: Definition 1. A (labeled) DTMC D is a tuple (S, sˆ, P, L) where – S is a finite set of states; – sˆ ∈ S is the initial state; – P : S ×S → [0, 1] is the transition probability matrix where Σs′ ∈S P(s, s′ ) = 1 for all s ∈ S; – L : S → 2AP is a labeling function which assigns to each state s ∈ S the set L(s) of atomic propositions that are valid in the state. Each P(s, s′ ) gives the probability of a transition from s to s′ . For each state the sum of the probabilities of the outgoing transitions must be 1. Thus, end states, i.e., states which will normally not have outgoing transitions are modeled by adding self-loops with probability 1. Probabilistic Computational Tree Logic. Properties of DTMCs can be specified using Probabilistic Computation Tree Logic (PCTL) [20], which is a probabilistic extension of CTL. Definition 2. PCTL has the following syntax: Φ ::= true | a | ¬Φ | Φ ∧ Φ | P∼p [φ]

φ ::= X Φ | Φ U

≤k

Φ

where a ∈ AP , ∼∈ {}, p ∈ [0, 1], and k ∈ N ∪ {∞}. For the sake of presentation, in the above definition we use both state formulae Φ and path formulae φ, which are interpreted on states and paths, respectively,

36

D. Boˇsnaˇcki, S. Edelkamp, and D. Sulewski

of a given DTMC D. However, the properties are specified exclusively as state formulae. Path formulae have only an auxiliary role and they occur as a parameter in state formulae of the form P∼p [φ]. Intuitively, P∼p [φ] is satisfied in some state s of D, if the probability of choosing a path that begins in s and satisfies φ is within the range given by ∼p. To formally define the satisfaction of the path formulae one defines a probability measure, which is beyond the scope of this paper. (For example, see [29] for more detailed information.) Informally, this measure captures the probability of taking a given finite path in the DTMC, which is calculated as the product of the probabilities of individual transitions of this path. The intuitive meaning of the path operators is analogous to the ones in standard temporal logics. The formula X Φ is true if Φ is satisfied in the next state of the path. The bounded until formula Φ U ≤k Ψ is satisfied if Ψ is satisfied in one of the next k steps and Φ holds until this happens. For k = ∞ one obtains the unbounded until. In this case we omit the superscript and write Φ U Ψ . The interpretation of unbounded until is the standard strong until. Algorithms for Model Checking PCTL. Given a labeled DTMC D = (S, sˆ, P, L) and a PCTL formula Φ, usually we are interested whether the initial state of D satisfies Φ. Nevertheless, the algorithm works by checking the satisfaction of Φ for each state in S. The output of the algorithm is Sat (Φ), the set of all states that satisfy Φ. The algorithm starts by first constructing the parse tree of the PCTL formula Φ. The root of the tree is labeled with Φ and each other node is labeled by a subformula of Φ. The leaves are labeled with true or an atomic proposition. Starting with the leaves, in a recursive bottom-up manner for each node n of the tree the set of states is computed that satisfies the subformula that labels n. When we arrive at the root we can determine Sat (Φ). Except for the path formulae the model checking of PCTL formulae is actually the same as for their counterparts in CTL and as such quite straightforward to implement. In this paper we concentrate on the path formulae. They are the most computationally demanding part of the model checking algorithm and as such they are the targets of our improvement via GPU algorithms. To give a general flavor of the path formulae, we give a briefly consider the algorithm for the formulae of the form P[Φ U≤k Ψ ], where k = ∞. This case boils down to finding the least solution of the linear equation system: ⎧ ⎨

W(s, Φ U Ψ ) = ⎩

1 if s ∈ Sat(Ψ ) 0 if s ∈ Sat(¬Φ ∧ ¬Ψ ) Σs′ ∈S P(s, s′ ) · W(s′ , Φ U Ψ ) otherwise

where W(Φ U Ψ ) is the resulting vector of probabilities indexed by the states in S. The states in which the formula is satisfied with probabilities 1 and 0 are singled out. For each other state the probabilities are computed via the corresponding probabilities of the neighboring states. Before solving the system, the algorithm employs some optimizations by precomputing the states that satisfy the formula with probability 0 or 1. The (simplified) system linear equations

Efficient Probabilistic Model Checking on GPGP

37

can be solved using iterative methods that comprise matrix-vector multiplication. One such method is the one by Jacobi, which is also one of the methods that PRISM uses and which we describe in more detail in Section 4. We choose Jacobi’s method over methods that on sequential architectures usually perform better. This is because Jacobi has certain advantages in the parallel programming context. For instance, it has lower memory consumption compared to the Krylov subspace methods and less data dependencies than the Gauss-Seidel method, which makes ti easier to parallelize [11]. The algorithms for the next operator and bounded until boil down to a single matrix-vector product and a sequence of such products, respectively. Therefore they can also be resolved by using Jacobi’s method. PCTL can be extended with various rewards (costs) operators that we do not give here. The algorithms for those operators can also be reduced to matrixvector multiplication [29]. Model checking of a PCTL formula Φ on DTMC D is linear in |Φ|, the size of the formula, and polynomial in |S|, the number of states of the DTMC. The most expensive are the operators for unbounded until and also the rewards operators which too boil down to solving system linear equations of size at most |S|. The complexity is also linear in kmax , the maximal value of the bounds k in the bounded until formulae (which also occurs in some of the costs operators). However, usually this value is much smaller than |S|. Thus, the main runtime bottleneck of the probabilistic model checking algorithms remain the linear algebraic operations. Their share of the total runtime of the algorithms increases with |S|. So, for real world problems, that tend to have large state spaces, this dependence is even more critical. In the sequel we show how by using parallel versions of the algorithms on GPU, one can obtain substantial speedups of more than one order of magnitude compared to the original sequential algorithms. Beyond Discrete Time Markov Chains. Matrix-vector product is also in the core of model checking continuous-time Markov chains, i.e., the corresponding Computational Stochastic Logic (CSL) [29,3,11]. For instance, the next operator of CSL can be checked with in the same way like its PCTL counterpart. Both algorithms for steady state and transient probabilities boil down to matrix-vector multiplication. On this operation hinge also various extensions of CSL with costs. Thus, the parallel version of the Jacobi algorithm that we present in the sequel, can be used also for stochastic models, i.e., models based on CTMCs.

3

GPU Programming

A considerable part of the challenges that arise in model checking algorithms for GPUs is due to the specific architectural differences between GPUs and CPUs and the restrictions on the programs that can run on GPUs. Therefore, before describing our approach in more detail, we give an overview of the GPU architecture and the Compute Unified Device Architecture (CUDA) programming model by the manufacturer NVIDIA [14] along the lines of [10].

38

D. Boˇsnaˇcki, S. Edelkamp, and D. Sulewski

Modern GPUs are no longer dedicated only to graphics applications. Instead a GPU can be seen as a general purpose multi-threaded massively data parallel co-processor. Harnessing the power of GPUs is facilitated by the new APIs for general computation on GPUs. CUDA is an interface by NVIDIA which is used to program GPUs. CUDA programs are basically extended C programs. To this end CUDA features extensions like: special declarations to explicitly place variables in some of the memories (e.g., shared, global, local), predefined keywords (variables) containing the block and thread IDs, synchronization statements for cooperation between threads, runtime API for memory management (allocation, deallocation), and statements to launch functions on GPU. CUDA Programming Model. A CUDA program consists of a host program which runs on the CPU and a set of CUDA kernels. The kernels, which are the parallel parts of the program, are launched on the GPU device from the host program, which comprises the sequential parts. The CUDA kernel is a parallel kernel that is executed on a set of threads. Each thread of the kernel executes the same code. Threads of a kernel are grouped in blocks that form a grid. Each thread block of the grid is uniquely identified by its block ID and analogously each thread is uniquely identified by its thread ID within its block. The dimensions of the thread and the thread block are specified at the time of launching the kernel. The grid can be one- or two-dimensional and the blocks are at most three-dimensional. CUDA Memory Model. Threads have access to different kind of memories. Each thread has its own on-chip registers and off-cheap local memory, which is quite slow. Threads within a block cooperate via shared memory which is oncheap and very fast. If more than one block are executed in parallel then the shared memory is equally split between them. All blocks have access to the device memory which is large (up to 4GB), but slow since, like the local memory, it is not cached. The host has read and write access to the global memory (Video RAM, or VRAM), but cannot access the other memories (registers, local, shared). Thus, as such, global memory is used for communication between the host and the kernel. Besides the memory communication, threads within a block can cooperate via light-weight synchronization barriers. GPU Architecture. The architecture of GPU features a set of multiprocessors units called streaming multiprocessors (SMs). Each of those contains a set of processor cores called streaming processors (SPs). The NVIDIA GeForce GTX280 has 30 SMs each consisting of 8 SPs, which gives in total 240 SPs. CUDA Excution Model. Each block is mapped to one multiprocessor whereas each multiprocessor can execute several blocks. The logical kernel architecture allows flexibility to the GPU to schedule the blocks of the kernel depending of the concrete hardware architecture in an optimal and for the user completely transparent way. Each multiprocessor performs computations in SIMT (Single Instruction

Efficient Probabilistic Model Checking on GPGP

39

Multiple Threads) manner, which means that each thread is executed independently with its own instruction address and local state (registers and local memory). Threads are executed by the SPs and thread blocks are executed on the SMs. Each block is assigned to the same processor throughout the execution, i.e., it does not migrate. The number of blocks that can be physically executed in parallel on the same multiprocessor is limited by the number of registers and the amount of shared memory. Only one kernel at a time is executed per GPU. GPU Programming Challenges. To fully exploit the computational power of the GPUs some significant challenges will have to be addressed. The main performance bottleneck is usually the relatively slow communication (compared to the enormous peak computational power) with the off-chip device memory. To fully exploit the capacity of the GPU parallelism this memory latency must be minimized. Further, it is recommended to avoid synchronization between thread blocks. The inter-thread communication within a block is cheap via the fast shared memory, but the accesses to the global and local memories are more than hundred times slower. Another way to maximize the parallelism is by optimizing the thread mapping. Unlike the CPU threads, the GPU threads are very light-weight with negligible overhead of creation and switching. This allows GPUs to use thousands of threads whereas multi-core CPUs use only a few. Usually more threads and blocks are created than the number of SPs and SMs, respectively, which allows GPU to maximally use the capacity via smart scheduling - while some threads/blocks are waiting for data, the others which have their data ready are assigned for execution. Thread mapping is coupled with the memory optimization in the sense that threads that access physically close memory locations should be grouped together.

4

Matrix-Vector Multiplication on GPU

To speed up the algorithms we replace the sequential matrix-vector multiplication algorithm with a parallel one, which is adapted to run on GPU. In this section we describe our parallel algorithms which are derived from the Jacobi algorithm for matrix-vector multiplication. This algorithm was used for both bounded and unbounded until, i.e., also for solving systems of linear equations. Jacobi Iterations. As mentioned in Section 2 for model checking DTMCs, Jacobi iteration method is one option to solve the set of linear equations we have derived for until (U). Each iteration in the Jacobi algorithm involves a matrixvector multiplication. Let n be the size of the state space, which determines the dimension n × n of the matrix to be iterated. The formula of Jacobi for solving Ax = b iteratively for an n × n matrix A = (aij )0≤i,j≤n−1 and a current vector xk is ⎞ ⎛

xk+1 = 1/aii · ⎝ bi − i



aij xki ⎠ j =i

, for i ∈ {0, . . . , n − 1}.

40

D. Boˇsnaˇcki, S. Edelkamp, and D. Sulewski

For better readability (and faster implementation), we may extract the diagonal elements and invert them prior to applying the formula. Setting Di = 1/aii , i ∈ {0, . . . , n − 1} then yields ⎞ ⎛

xk+1 = Di · ⎝ b i − i



aij xki ⎠

, for i ∈ {0, . . . , n − 1}.

j =i

The sufficient condition for Jacobi iteration to converge is that the magnitude of the largest eigenvalue (spectral radius) of matrix D−1 (A−D) is bounded by value 1. Fortunately, the Perron–Frobenius theorem asserts that the largest eigenvalue of a (strictly positive) stochastic matrix is equal to 1 and all other eigenvalues are smaller than 1, so that limk→∞ Ak exists. In the worst case, the number of iterations can be exponential in the size of the state space but in practice the number of iteration k until conversion to some sufficiently small ǫ according to | < ǫ, is often moderate [34]. a termination criteria, like maxi |xki − xk+1 i Sparse Matrix Representation. The size of the matrix is Θ(n2 ), but for sparse models that usually appear in practice it can be compressed. Such matrix compaction is a standard technique used for probabilistic model checking and to this end special structures are used. In the algorithms that we present in the sequel we assume the so called modified compressed sparse row/column format [11]. We illustrate this format on the sparse transition probability matrix P given below: row col non-zero

0 0 0 1 1 2 2 2 3 4 4 1 2 4 2 3 0 3 4 0 0 2 0.2 0.7 0.1 0.01 0.99 0.3 0.58 0.12 1.0 0.5 0.5

The above matrix contains only the non-zero elements of P. The arrays labeled row, col, and non-zero contain the row and column indices, and the values of the non-zero elements, respectively. More formally, for all r of the index range of the arrays, non-zeror = P(rowr , colr ). Obviously, this is already an optimized format compared to the standard full matrix representation. Still, one can save even more space as shown in the table below, which is, in fact, the above mentioned modified compressed sparse row/column format : rsize col non-zero

3 2 3 1 2 1 2 4 2 3 0 3 4 0 0 2 0.2 0.7 0.1 0.01 0.99 0.3 0.58 0.12 1.0 0.5 0.5

The difference with the previous representation is only in the top array rsize. Instead of the row indices, this array contains the row sizes, i.e., rsizei contains

Efficient Probabilistic Model Checking on GPGP

41

the number of non-zero elements in row i of P. To extract row i of the original matrix P, we take the elements non-zerorstarti , non-zerorstarti +1 , . . . , non-zerorstarti +rsizei −1 where rstarti =



i−1 k=0

rsizek .

Algorithm Implementation. The pseudo code of the sequential Jacobi algorithm that implements the aforementioned recurrent formula and which uses the compression given above is shown in Algorithm 1.. Algorithm 1. Jacobi iteration with row compression, as implemented in PRISM 1: k := 0 2: Terminate := f alse 3: while (not Terminate and k < maxk ) do 4: h := 0; 5: for all i := 0 . . . n do 6: d := bi ; 7: l := h; 8: h := l + rsizei − 1; 9: for all j = l . . . h do 10: d = d − non-zeroj · xkcol ; j 11: d := d · Di ; k+1 := d; 12: xi 13: Terminate := true 14: for all i := 0 . . . n do − xki | > ǫ then 15: if |xk+1 i 16: Terminate := f alse 17: k := k + 1;

The iterations are repeated until a satisfactory precision is achieved or the maximal number of iterations maxk is overstepped. In lines 6–8 (an element of) vector b is copied into the auxiliary variable d and the lower and upper bounds for the indices of the elements in array non-zero that correspond to row i are computed. In the for loop the product of row i and the result of the previous iteration, vector xk , is computed. The new result is recorded in variable xk+1 . Note that, since we are not interested in the intermediate results, only two vectors are needed: one, x, to store xk , and another, x′ , that corresponds to xk+1 , the result of the current iteration. After each iteration the contents of x and x′ are swapped, to reflect the status of x′ , which becomes the result of the previous iteration. We will use this observation to save space in the parallel implementation of the algorithm given below. In lines 13–16 the termination condition is computed, i.e., it is checked if sufficient precision is achieved. We assume that vector x is initialized appropriately before the algorithm is started.

42

D. Boˇsnaˇcki, S. Edelkamp, and D. Sulewski

Due to the fact that the iterations have to be performed sequentially the matrix-vector multiplication is the part to be distributed. As a feature of the algorithm (that contributed most to the speedup) the comparison of the two solution vectors, x and x′ in this case, is done in parallel. The GPU version of the Jacobi algorithm is given in Algorithms 2. and 3..

Algorithm 2. JacobiCPU: CPU part of the Jacobi iteration, for unbounded until computation 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

allocate global memory for x’ allocate global memory for col, non-zero, b, x, ǫ, n and copy them allocate global memory for TerminateGPU to be shared between blocks rstart 0 := 0; for i = 1 . . . |rsize| + 1 do rstart i := rstarti−1 + rsizei−1 ; allocate global memory for rstartGPU and copy rstart to rstartGPU k := 0 Terminate := f alse while (not Terminate and k < maxk ) do JacobiKernel(); copy TerminateGPU to Terminate; Swap(x,x’) k = k + 1; copy x’ to RAM;

Algorithm 2., running on the CPU, copies vectors non-zero and col from the matrix representation, together with vectors x and b, and constants ǫ and n, to the global memory (VRAM) and allocates space for the vector x′ . Having done this, space for the Terminate variable is allocated in the VRAM. Variable rstart defines the starting point of a row in the matrix array. The conversion from rsize to rstart is needed to let each thread find the starting point of a row immediately. (In fact, implicitely we use a new matrix representatin where rsize is replaced with rstart.) Array rstart is copied to the global memory variable rstartGPU. To specify the number of blocks and the size of a block CUDA supports additional parameters in front of the kernel (>). Here the grid is defined with n/BlockSize + 1 blocks2 , and a fixed BlockSize. After the multiplication and comparison step on the GPU the Terminate variable is copied back and checked. This copy statement serves also as a synchronization barrier, since the CPU program waits until all the threads of the GPU kernel have terminated before copying the variable from the GPU global memory. If another iteration is needed x and x′ are swapped3 . After all iterations the result is copied back from global memory to RAM. 2

3

If BlockSize is a divisor of n threads in the last block execute only the first line of the kernel. Since C operates on pointers, only these are swapped in this step.

Efficient Probabilistic Model Checking on GPGP

43

Algorithm 3. JacobiKernel: Jacobi iteration with row compression on the GPU 1: i := BlockId · BlockSize + T hreadId; 2: if (i = 0) then 3: TerminateGPU := true; 4: if (i < n) then 5: d := bi ; 6: l := rstartGPU i ; 7: h := rstartGPU i+1 − 1; 8: for all j = l . . . h do 9: d := d − non-zeroj · xcolj ; 10: d := d · Di ; 11: x′i := d; 12: if |xi − x′i | > ǫ then 13: TerminateGPU := f alse

JacobiKernel shown in Algorithm 3. is the so-called kernel that operates on the GPU. Local variables d, l, h, i and j are located in the local registers and they are not shared between threads. The other variables reside in the global memory. The result is first computed in d (locally in each thread) and then written to the global memory (line 11). This approach minimizes the access to the global memory from threads. At invocation time each thread computes the row i of the matrix that it will handle. This is feasible because each thread knows its ThreadId, and the BlockId of its block. Note that the size of the block (BlockSize) is also available to each thread. Based on value i only one thread (the first one in the first block) sets the variable TerminateGPU to true. Recall, this variable is located in the global memory, and it is shared between all threads in all blocks. Now, each thread reads three values from the global memory (line 5 to 7), here we profit from coalescing done by the GPU memory controller. It is able to detect neighboring VRAM access and combine it. This means, if thread i accesses 2 bytes at bi and thread i + 1 accesses 2 bytes at bi+1 the controller fetches 4 bytes at bi and divides the data to serve each thread its chunk. In each iteration of the for loop an elementary multiplication is done. Due to the compressed matrix representation a double indirect access is needed here. As in the original algorithm the result is multiplied with the diagonal value Di and stored in the new solution vector x′ . Finally, each thread checks if another iteration is needed and consequently sets the variable TerminateGPU to false. Concurrent writes are resolved by the GPU memory controller. The implementation in Algorithm 2. matches the one for bounded-until (U≤k ), except that bounded-until has a fixed upper bound on the number of iterations, while for until a termination criterion applies.

5

Experiments

All experiments were done on a PC with an AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ with 4 GB of RAM; the operating system is 64-bit SUSE 11 with

44

D. Boˇsnaˇcki, S. Edelkamp, and D. Sulewski Table 1. Results for the herman protocol instance herman15.pm herman17.pm

n 32,768 131,072

iterations 245 308

seq. time 22.430 304.108

par. time speedup 21.495 1.04 206.174 1.48

CUDA 2.1 SDK and the NVIDIA driver version 177.13. This system includes a MSI N280GTX T20G graphic card with 1 GB global memory plugged into an ExpressPCI slot. The GTX200 chip on this card contains 10 texture processing clusters (TPC). Each TPC consists of 3 streaming multiprocessors (SM) and each SM includes 8 streaming processors (SPs) and 1 double precision unit. In total, it has 240 SPs executing the threads in parallel. Maximum block size for this GPU is 512. Given a grid, the TPCs divide the blocks on its SMs, and each SM controls at most 1024 threads, which are run on the 8 SPs. We verified three protocols, herman, cluster, and tandem, shipped with the source of PRISM. The protocols were chosen due to their scalability and the possibility to verify its properties by solving a linear function with the Jacobi method. Different protocols show different speedups achieved by the GPU, because the Jacobi iterations are only a part of the model checking algorithms, while the results show the time for the complete run. In all tables of this section n denotes the number of rows (columns) of the matrix, “iterations” denotes the number of iterations of the Jacobi method, “seq. time” and “par. time” denote the runtimes of the standard (sequential) version of PRISM and our parallel implementation extension of the tool, respectively. All times are given in seconds. The speedup is computed as the quotient between the sequential and parallel runtimes. The first protocol called herman is the Herman’s self-stabilizing algorithm [22]. The protocol operates synchronously on an oriented ring topology, i.e., the communication is unidirectional. The number in the file name denotes the number of processes in the ring, which must be odd. The underlying model is a DTMC. We verified the PCTL property 3 from the property file herman.pctl (R=? [ F "stable" {"k_tokens"}{max} ]). Table 1 shows the results of the verification. Even though the number of iterations is rather small compared to the other models, the GPU achieves a speedup factor of approx. 1.5. Since everything beyond multiplication of the matrix and vector is done on the CPU, we have not expected a larger speedup. Unfortunately, it is not possible to scale up this model, due to the memory consumption being too high; the next possible instance (herman19.pm) consumes more then 1 GB. The second case study is cluster [21] which models communication within a cluster of workstations. The system comprises two sub-clusters with N workstations in each of them, connected in a star topology. The switches connecting each sub-cluster are joined by a central backbone. All components can break down and there is a single repair unit to service all components. The underlying model is CTMC and the checked CSL property is property 1 from the corresponding property file (S=? [ "premium" ]). Fig. 1 shows that GPU performs significantly better, Table 2 contains some exact numbers for chosen instances.

Efficient Probabilistic Model Checking on GPGP

8

15000 6

10000 4

5000

speedup (CPU time / GPU time)

seconds to complete the model checking process

10

CPU GPU speedup

20000

45

2

0 0

100

200

300 choosen constant N

400

500

0 600

Fig. 1. Verification times for several instances of the cluster protocol. The x-axis shows the value of the parameter N . Speedup is computed as described in the text as a quotient between the runtime of standard PRISM and the runtime of our GPU extension of the tool. Table 2. Results for the cluster protocol. Parameter N is used to scale the protocol. The global memory usage (denoted as GPU mem) is in MB. N 122 230 320 410 446 464 500 572

n iterations 542,676 1,077 1,917,300 2,724 3,704,340 5,107 6,074,580 11,488 7,185,972 18,907 7,776,660 23,932 9,028,020 28,123 11,810,676 28,437

seq. time 31.469 260.440 931.515 3,339.307 6,440.959 8,739.750 11,516.716 15,576.977

par. time GPU mem speedup 8.855 21 3.55 54.817 76 4.75 165.179 146 5.63 445.297 240 7.49 767.835 284 8.38 952.817 308 9.17 1,458.609 694 7.89 1,976.576 908 7.88

The largest speedup reaches a factor of more then 9. Even for smaller instances, the GPU exceeds factor 3. In this case study a sparser matrix was generated, which in turn needed more iterations to converge then the herman protocol. In the largest instance (N = 572) checked by the GPU, PRISM generates a matrix with 11,810,676 lines and iterates this matrix 28,437 times. It was even necessary to increase the maximum number of iterations, set by default to 10,000, to obtain a solution. In this protocol, as well as in the next one, for large matrices

46

D. Boˇsnaˇcki, S. Edelkamp, and D. Sulewski

Table 3. Results from the verification of the tandem protocol. The constant c is used to scale the protocol. Global memory usage, shown as GPU mem, is given in MB (o.o.m denotes out of global memory). c 255 511 1,023 2,047 3,070 3,588 4,095

n 130,816 523,776 2,096,128 8,386,560 18,859,011 25,758,253 33,550,336

iterations 4,212 8,498 16,326 24,141 31,209 34,638 37,931

seq. time par. time GPU mem speedup 26.994 3.639 4 7.4 190.266 17.807 17 10.7 1,360.588 103.154 71 13.2 9,672.194 516.334 287 18.7 25,960.397 1,502.856 647 17.3 33,820.212 2,435.415 884 13.9 76,311.598 o.o.m

1000

dual precision CPU single precision GPU dual precision GPU

miliseconds per iteration

100

10

1

0.1

0.01 0

500

1000

1500

2000 2500 const c

3000

3500

4000

4500

Fig. 2. Time per iteration in the tandem protocol. The CPU is significantly slower then the GPU operating in single or double precision. Reducing the precision has nearly no effect on the speed.

we observed a slight deterioration of the performance of the GPU implementation for which, for the time being, we could not find a clear explanation. One plausible hypothesis would be that after some threshold number of threads GPU cannot profit any more from smart scheduling to hide the memory latencies. The third case study tandem is based on a simple tandem queueing network [HMKS99]. The model is represented as a CTMC which consists of a M/Cox(2)/1queue sequentially composed with a M/M/1-queue. We use c to denote the capacity of the queues. We verified property 1 from the corresponding CSL property file

Efficient Probabilistic Model Checking on GPGP

47

(R=? [ S ]). Constant T was set to 1 for all experiments and parameter c was scaled as shown in Table 3. In this protocol the best speedup was recorded. For the best instance (c = 2047) PRISM generates a matrix with 8,386,560 rows, which is iterated 24,141 times. For this operation standard PRISM needs 9, 672 seconds while our parallel implementation only needs 516 seconds, scoring a maximal speedup of a factor 18.7. As mentioned above, 8 SPs share one double precision unit, but each SP has an own single precision unit. Hence, our hypothesis was that reducing the precision from single to double should bring a significant speedup. The code of PRISM was modified to support single precision floats for examining the effect. As can be seen in Fig. 2 the hypothesis was wrong. The time per iteration in double precision mode is nearly the same as the single precision mode. The graph clearly shows that the GPU is able to hide the latency which occurs when a thread is waiting for the double precision unit by letting the SPs work on other threads. Nevertheless, it is important to note that the GPU with single precision arithmetic was able to verify larger instances of the protocol, given that the floating point numbers consumed less memory. It should be noted that in all cases studies we also tried the MTBDD and hybrid representations of the models, which are an option in PRISM, but in all cases the runtimes were consistently slower than the ones with the sparse matrix representation, which are shown in the tables.

6

Conclusions

In this paper we introduced GPU probabilistic/stochastic model checking as a novel concept. To this end we described a parallel version of Jacobi’s method for sparse matrix-vector multiplication, which is the main core of the algorithms for model checking discrete- and continuous-time Markov chains, i.e., the corresponding logics PCTL and CSL. The algorithm was implemented on top of the probabilistic model checker PRISM. Its efficiency and the advantages of the GPU probabilistic model checking in general were illustrated on several case studies. Speedups of up to 18 times compared to the sequential implementation of PRISM were achieved. We believe that our work opens a very promising research avenue on GPU model checking in general. To stay relevant for the industry, the area has to keep pace with the new technological trends. “Model checking for masses” gets tremendous opportunities because of the “parallelism for masses”. To this end model checking algorithms that are designed for the verification of parallel systems and exploit the full power of the new parallel hardware will be needed. In the future we intend to experiment with other matrix-vector algorithms for GPUs, as well as with combination of multi-core and/or multi-GPU systems. What is needed for analyzing the time complexity of GPU algorithms is a fine grained theoretical model of its operation.

48

D. Boˇsnaˇcki, S. Edelkamp, and D. Sulewski

References 1. Allmaier, S.C., Kowarschik, M., Horton, G.: State Space Construction and Steadystate Solution of GSPNs on a Shared-Memory Multiprocessor. In: Proc. 7th Int. Workshop on Petri Nets and Peformance Models (PNPM 1997), pp. 112–121. IEEE Comp. Soc. Press, Los Alamitos (1997) 2. Baier, C., Katoen, J.-P.: Principles of Model Checking, p. 950. MIT Press, Cambridge (2008) 3. Baier, C., Katoen, J.-P., Hermanns, H., Haverkort, B.: Model-Checking Algorithms for Contiuous-Time Markov Chains. IEEE Transactions on Software Engineering 29(6), 524–541 (2003) 4. Bal, H., Barnat, J., Brim, L., Verstoep, K.: Efficient Large-Scale Model Checking. In: IEEE International Parallel & Distributed Processing Symposium (IPDPS) (to appear, 2009) 5. Barnat, J., Brim, L., Roˇckai, P.: Scalable Multi-core Model-Checking. In: Boˇsnaˇcki, D., Edelkamp, S. (eds.) SPIN 2007. LNCS, vol. 4595, pp. 187–203. Springer, Heidelberg (2007) ˇ 6. Barnat, J., Brim, L., Cern´ a, I., Ceska, M., Tumova, J.: ProbDiVinE-MC: Multicore LTL Model Checker for Probabilistic Systems. In: International Conference on the Quantitative Evaluaiton of Systems QEST 2008, pp. 77–78. IEEE Compuer Society Press, Los Alamitos (2008) ˇ 7. Barnat, J., Brim, L., Simeˇ cek, P.: I/O efficient accepting cycle detection. In: Damm, W., Hermanns, H. (eds.) CAV 2007. LNCS, vol. 4590, pp. 281–293. Springer, Heidelberg (2007) 8. Barnat, J., Brim, L., Str´ıbrn´ a, J.: Distributed LTL Model Checking in SPIN. In: Dwyer, M.B. (ed.) SPIN 2001. LNCS, vol. 2057, pp. 200–216. Springer, Heidelberg (2001) ˇ 9. Barnat, J., Brim, L., Simeˇ cek, P., Weber, M.: Revisiting resistance speeds up I/Oefficient LTL model checking. In: Ramakrishnan, C.R., Rehof, J. (eds.) TACAS 2008. LNCS, vol. 4963, pp. 48–62. Springer, Heidelberg (2008) 10. Baskaran, M.M., Bordawekar, R.: Optimzing Sparse Matrix-Vector Multiplication on GPUs Using Compile-time and Run-time Strategies, IBM Reserach Report, RC24704, W0812-047 (2008) 11. Bell, A., Haverkort, B.R.: Distribute Disk-based Algorithms for Model Checking Very Large Markov Chains. In: Formal Methods in System Design, vol. 29, pp. 177–196. Springer, Heidelberg (2006) 12. Ciardo, G.: Distributed and Structured Analysis Approaches to Study Large and Complex Systems. European Educational Forum: School on Formal Methods and Performance Analysis 2000, 344–374 (2000) 13. Dai, P., Mausam, Weld, D.S.: External Memory Value Iteration. In: Proc. of the Twenty-Third AAAI Conf. on Artificial Intelligence (AAAI), pp. 898–904 (2008) 14. http://www.nvidia.com/object/cuda_home.html# 15. Edelkamp, S., Sulewski, D.: Model Checking via Delayed Duplicate Detection on the GPU, Technical Report 821, Universit¨ at Dortmund, Fachberich Informatik, ISSN 0933-6192 (2008) 16. Edelkamp, S., Jabbar, S.: Large-scale directed model checking LTL. In: Valmari, A. (ed.) SPIN 2006. LNCS, vol. 3925, pp. 1–18. Springer, Heidelberg (2006) 17. Edelkamp, S., Jabbar, S., Bonet, B.: External Memory Value Iteration. In: Proc. 17th Int. Conf. on Automated Planning and Scheduling, pp. 128–135. AAAI Press, Menlo Park (2007)

Efficient Probabilistic Model Checking on GPGP

49

ˇ 18. Edelkamp, S., Sanders, P., Simeˇ cek, P.: Semi-external LTL model checking. In: Gupta, A., Malik, S. (eds.) CAV 2008. LNCS, vol. 5123, pp. 530–542. Springer, Heidelberg (2008) 19. Edelkamp, S., Sulewski, D.: Flash-efficient LTL model checking with minimal counterexamples. In: Software Engineering and Formal Methods, pp. 73–82 (2008) 20. Hansson, H., Jonsson, B.: A Logic for reasoning about time and reliability. Formal Aspects of Computing 6(5), 512–535 (1994) 21. Haverkort, B., Hermanns, H., Katoen, J.-P.: On the Use of Model Checking Techniques for Dependability Evaluation. In: Proc. 19th IEEE Symposium on Reliable Distributed Systems (SRDS 2000), pp. 228–237 (2000) 22. Herman, T.: Probabilistic Self-stabilization. Information Processing Letters 35(2), 63–67 (1990) 23. Hermanns, H., Meyer-Kayser, J., Siegle, M.: Multi Terminal Binary Decision Diagrams to Represent and Analyse Continuous Time Markov Chains. In: Proc. 3rd International Workshop on Numerical Solution of Markov Chains (NSMC 1999), pp. 188–207 (1999) 24. Holzmann, G.J., Boˇsnaˇcki, D.: The Design of a multi-core extension of the Spin Model Checker. IEEE Trans. on Software Engineering 33(10), 659–674 (2007); (first presented at: Formal Methods in Computer Aided Design (FMCAD), San Jose (November 2006)) 25. Holzmann, G.J., Boˇsnaˇcki, D.: Multi-core Model Checking with Spin. In: Proc. Parallel and Distributed Processing Symposium, IPDPS 2007, IEEE International, pp. 1–8 (2007) 26. Inggs, C.P., Barringer, H.: CTL∗ Model Checking on a Shared Memory Architecture. Electronic Notes in Theoretical Computer Science 128(4), 107–123 (2005) 27. Inggs, C.P., Barringer, H.: Effective State Exploration for Model Checking on a Shared Memory Architecture. Electronic Notes in Theoretical Computer Science 68(4) (2002) 28. Kwiatkowska, M.Z., Norman, G., Parker, D.: PRISM: Probabilistic Symbolic Model Checker. In: Field, T., Harrison, P.G., Bradley, J., Harder, U. (eds.) TOOLS 2002. LNCS, vol. 2324, pp. 200–204. Springer, Heidelberg (2002) 29. Kwiatkowska, M., Norman, G., Parker, D.: Stochastic Model Checking. In: Bernardo, M., Hillston, J. (eds.) SFM 2007. LNCS, vol. 4486, pp. 220–270. Springer, Heidelberg (2007) 30. Lerda, F., Sisto, R.: Distributed Model Checking in SPIN. In: Dams, D.R., Gerth, R., Leue, S., Massink, M. (eds.) SPIN 1999. LNCS, vol. 1680, pp. 22–39. Springer, Heidelberg (1999) 31. Marowka, A.: Parallel Computing on Any Desktop. Comm. of the ACM 50(9), 75–78 (2007) 32. Philips, J.C., Braun, R., Wang, W., Gumbart, J., Tajkhorshid, E., Villa, E., Chipot, C., Skeel, R.D., Kale, L., Sculten, K.: Scalable Molecular Dynamics with NAMD. Journal of Computational Chemistry 26, 1781–1802 (2005) 33. Stern, U., Dill, D.: Parallelizing the Murφ Verifier. In: Grumberg, O. (ed.) CAV 1997. LNCS, vol. 1254, pp. 256–278. Springer, Heidelberg (1997) 34. Stewart, W.J.: Introduction to the Numerical Solution of Markov Chains. Princeton University Press, Princeton (1994) 35. Valmari, A.: The State Explosion Problem. In: Reisig, W., Rozenberg, G. (eds.) APN 1998. LNCS, vol. 1491, pp. 429–528. Springer, Heidelberg (1998)

Improving Non-Progress Cycle Checks David Farag´ o⋆ and Peter H. Schmitt Universit¨ at Karlsruhe (TH), Institut f¨ ur Theoretische Informatik Logik und Formale Methoden {farago,pschmitt}@ira.uka.de

Abstract. This paper introduces a new model checking algorithm that searches for non-progress cycles, used mainly to check for livelocks. The algorithm performs an incremental depth-first search, i.e., it searches through the graph incrementally deeper. It simultaneously constructs the state space and searches for non-progress cycles. The algorithm is expected to be more efficient than the method the model checker SPIN currently uses, and finds shortest (w.r.t. progress) counterexamples. Its only downside is the need for a subsequent reachability depth-first search (which is not the bottleneck) for constructing a full counterexample. The new algorithm is better combinable with partial order reduction than SPIN’s method. Keywords: Model Checking, SPIN, Non-progress cycles, livelocks, depthfirst search, partial order reduction.

1

Introduction

In Section 1.1, we describe what non-progress cycles (NPCs) are and how SPIN currently searches for them. Section 1.2 presents SPIN’s method in more detail and reveals its redundant operation. Hence we apply a new idea (see Section 2.1) to design two new algorithms, the incremental DFS and DFSFIFO , see Section 2.2. We prove the correctness of DFSFIFO in Section 2.3. Section 2.4 shows that DFSFIFO has several advantages over SPIN’s method. The section ends by depicting the high relevance of partial order reduction. After describing how this reduction works (see Section 3.1), we show that its usage by DFSFIFO is correct (see Section 3.2) and yields many further advantages (see Section 3.3). The paper closes with a conclusion and future work. The main contributions of this paper are DFSFIFO and the theoretical basis for its upcoming implementation. 1.1

Non-Progress Cycle Checks by SPIN

NPC checks are mainly used to detect livelocks in the system being modeled, i.e., execution cycles that never make effective progress. NPC checks are often performed in formal verifications of protocols, where livelocks frequently occur. ⋆

This research received financial support by the Concept for the Future of KIT within the framework of the German Excellence Initiative from DFG.

C.S. P˘ as˘ areanu (Ed.): SPIN 2009, LNCS 5578, pp. 50–67, 2009. c Springer-Verlag Berlin Heidelberg 2009 

Improving Non-Progress Cycle Checks

51

Using SPIN, livelocks were found, for instance, in the i-protocol from UUCP (see [3]) and GIOP from CORBA (see [10]), whereas DHCP was proved to be free of livelocks (see [9]). To be able to check for NPCs, desired activities of the system are marked in PROMELA by labeling the corresponding location in the process specification with a progress label: ”statementi ; progress: statementj ;”. This sets the local state between statementi and statementj to a local progress state (cf. Figure 7). A (global) progress state is a global system state in which at least one of the processes is in a local progress state. SPIN marks global progress states by setting the global variable np to false. Section 2.4 presents progress transitions as alternative for modeling progress. If no cycle without any progress-label exists, then the system definitely makes progress eventually (it never gets stuck in a livelock). A non-progress cycle check detects (and returns a path to) a reachable nonprogress cycle, i.e., a reachable cycle with no progress states, iff there exists one in the state transition system (S, T ) (with S being the set of states and T ⊆ S×S the set of transitions). SPIN puts the check into effect with the B¨ uchi automaton for the LTL formula ♦  np , which translates into the never claim of Listing 1 (cf. [6]). never { /∗ [] np ∗/ do /∗ n o n d e t e r m i n i s t i c a l l y d e l a y or swap t o NPC s e a r c h mode∗/ : : np −> break : : true /∗ n o n d e t e r m i n i s t i c d e l a y mode∗/ od ; a c c e p t : /∗NPC s e a r c h mode∗/ do : : np od } Listing 1. Never claim for NPC checks

The LTL formula is verified with SPIN’s standard acceptance cycle check , the nested depth-first search (NDFS ) algorithm (see [8,6]): Before the basic depthfirst search (DFS ) backtracks from an accepting state s and removes it from the stack, a second, nested DFS is started to check whether s can reach itself, thus resulting in an acceptance cycle. Pseudo-code for the nested DFS is given in Listing 2. 1.2

Motivation for Improving Non-Progress Cycle Checks

The following walkthrough depicts a detailed NPC check in SPIN (cf. Figure 1): 1. When traversal starts at init, the never claim immediately swaps to its NPC search mode because the never claim process firstly chooses np −> break in the first do-loop (if the order in this do-loop were swapped, the NPC check would descend the graph as deeply as possible in the nondeterministic

52

D. Farag´ o and P.H. Schmitt

proc DFS( s t a t e s ) i f e r r o r ( s ) then r e p o r t e r r o r f i ; add {s , 0 } t o hash t a b l e ; push s onto s t a c k ; f o r each s u c c e s s o r t o f s do i f {t , 0 } ∈ hash t a b l e then DFS( t ) f i od ; i f a c c e p t i n g ( s ) then NDFS( s ) f i ; pop s from s t a c k ; end proc NDFS( s t a t e s ) /∗ t h e n e s t e d s e a r c h ∗/ add {s , 1 } t o hash t a b l e ; f o r each s u c c e s s o r t o f s do i f {t , 1 } ∈ hash t a b l e then NDFS( t ) e l s e i f t ∈ s t a c k then r e p o r t c y c l e f i fi od ; end Listing 2. Nested DFS

delay mode). Hence a DFS is performed in which all states are marked as acceptance states by the never claim and progress states are omitted, i.e., truncated (see Listing 1). 2. Just before backtracking from each state being traversed in the NPC search mode DFS, the NDFS (i.e., the nested search) starts an acceptance cycle search (since all traversed states were marked as acceptance states). For these acceptance cycle searches, the non-progress states are traversed again. 3. If an acceptance cycle is found, it is also an NPC since only non-progress states are traversed. If no acceptance cycle is found, the NDFS backtracks from the state s where the NDFS was initiated, but immediately starts a new NDFS before the NPC search mode DFS backtracks from the predecessor of s. Fortunately, states that have already been visited by an NDFS are not revisited. But the NDFS is repeatedly started many times and at least one transition has to be considered each time. Eventually, when the NDFS has been performed for all states of the NPC search mode DFS, the NPC search mode DFS backtracks to init. 4. Now the nondeterministic delay mode DFS constructs the state space once more. During this, after each forward step, all previous procedures are repeated. Since most of the time the states have already been visited, these procedures are immediately aborted. During this nondeterministic delay mode DFS, also progress states are traversed. On the whole, the original state space (i.e., without the never claim) is traversed three times: in the NPC search mode DFS, in the NDFS and in the

Improving Non-Progress Cycle Checks

53

repeated in the nondet. delay mode for each state s

repetitive NDFSs NPC search mode Fig. 1. Walkthrough of SPIN’s NPC check

nondeterministic delay mode DFS. The state space construction for reaching an NPC and the NPC search are performed in separate steps.

2

Better NPC Checks

In this section, we firstly introduce our new approach and then our new algorithms: the incremental DFS and its improvement DFSFIFO . Thereafter, the correctness of DFSFIFO is proved. Finally, it is compared to SPIN’s method of NPC checks. 2.1

Approach

The detailed walkthrough in Section 1.2 has shown that SPIN’s NPC check unnecessarily often initializes procedures, touches transitions, and traverses the state space. The cause for this inefficiency is the general approach: acceptance cycle checks in combination with never claims are very powerful and cover more eventualities and options than necessary for NPC checks. So we are looking for a more specific algorithm for NPC checks that performs less redundantly and combines the construction and the NPC search phase. But with only one traversal of the state space, we have to cope with the following problem: Simply checking for each cycle found in a basic DFS whether it makes progress is an incomplete NPC search since the DFS aborts traversal in states which have already been visited. Hence not all cycles are traversed. Figure 2 shows an example of an NPC that is not traversed and therefore not found by the DFS: From s1 , the DFS first traverses path 1, which contains a progress state s2 . After backtracking from path 1 to s1 , the DFS traverses path 2, but aborts it at s3 before closing the (red, thick) NPC. Hence if an NPC has states that have already been visited, the cycle will not be found by the basic DFS. The idea for our alternative NPC checks is to guarantee that: After reaching an NPC for the first time, traversal (p-stop) of progress states is postponed as long as possible.

54

D. Farag´ o and P.H. Schmitt

s1 path1 s+

2

path 2

s3 Fig. 2. Not traversed NPC

This constraint enforces that NPCs are traversed before states from the NPC are visited through some progress cycle and thus break the NPC traversal. The following section introduces two new algorithms and checks (p-stop) for them. 2.2

The Incremental Depth-First Search Algorithms

Incremental Depth-First Search. This new algorithm searches for NPCs using a depth-first iterative deepening search with incrementally larger thresholds for the number of progress states that may be traversed. For this, the incremental DFS algorithm, described in Listing 3 and 4, repeatedly builds subgraphs GL where paths (starting from init) with maximally L progress states are traversed, for L = 0, 1 . . . , with a basic DFS algorithm. It terminates either with an error path (a counterexample to the absence of NPCs) when an NPC is found or without an error when L becomes big enough for the incremental DFS to build the complete graph G, i.e., GL = G. So in each state s we might prune some of the outgoing transitions by omitting those which exceed the current progress limit L, and only consider the remaining transitions. We can implement DFSstarting over by using a progress counter for the number of progress states on the current path. The progress counter is saved for every state on the stack, but is ignored when states are being compared. This causes an insignificant increase in memory (maximally log(Lmax )×depth(G) bits). With this concept, we can firstly update the progress counter when backtracking, secondly abort traversal when the progress counter exceeds its current limit L, and thirdly quickly check for progress whenever a cycle is found. To complete this implementation, we still have to determine the unsettled functions of DFSprune,NPC , underlined in listing 4: – – – –

pruned(s, t) returns true iff progress counter = L and t is a progress state. pruning action(t) sets DFS pruned to true. np cycle(t) returns true iff progress counter = (counter on stack for t). The error message can print out the stack, which corresponds to the path

from init to the NPC, inclusively.

Improving Non-Progress Cycle Checks

55

proc DFSstarting over ( s t a t e s ) L:=0; repeat DFS pruned := f a l s e ; DFSprune,NPC ( s ) ; L++; u n t i l ( ! DFS pruned ) ; end ; proc main ( ) DFSstarting over ( init ) ; p r i n t f ( ”LTS d o e s not c o n t a i n NPCs” ) ; end ; Listing 3. Incremental depth-first search

Unfortunately, this incremental DFS has several deficiencies: – The upper part of the graph (of the graph’s computation tree) is traversed repeatedly. But since we usually have several transitions leaving a state and relatively few progress states, this makes the incremental DFS require maximally twice the time of one complete basic DFS (i.e., of a safety check). – Depending on the traversal order of the DFS, the progress counter limit might have to become unnecessarily large until an NPC is found, cf. Figure 3. – As main disadvantage, the approach of the incremental DFS is not sufficient to fulfill the condition (p-stop): It can happen that a state s0 on an NPC is reached the first time for progress counter limit L0 (via path 3), but with progress counter(s0 ) < L0 . For this to be the case, path 3 was aborted for L < L0 . Hence for L < L0 , a state s1 on path 3 was already visited from another path (path 2) with more progress, see Figure 3. For L0 , s0 was reached via path 3, and thus path 2 was pruned. Therefore a state s2 on path 2 has already been visited via another path (path 1) for L0 , but not for L < L0 . This situation is depicted in Figure 3, with the traversal order equal to the path number, L0 = 3, progress counter(s0 ) = 2, and + marking progress. Hence we modify the incremental DFS algorithm in the following section. Incremental Depth-First Search with FIFO. Instead of repeatedly increasing the progress counter limit L and re-traversing the upper part of the graph’s computation tree, we save the pruned progress states to jump back to them later on to continue traversal. Roughly speaking, we perform a breadth-first search with respect to the progress states, and in-between progress states we perform DFSs (cf. Listing 5). To reuse the subgraph already built, we have to save some extra information to know which transitions have been pruned. One way to track these pruned transitions is by using a FIFO (or by allowing to push elements under the stack, not only on top). Hence we name the algorithm incremental DFS with FIFO (DFSFIFO ). Listing 5 shows that we do not repeatedly construct the graph from

56

D. Farag´ o and P.H. Schmitt

proc DFSprune,NPC ( s t a t e s ) add s t o hash t a b l e ; push s onto s t a c k ; f o r each s u c c e s s o r t o f s do i f ( t ∈ hash t a b l e ) then i f ( ! pruned (s, t ) ) then DFSprune,NPC ( t ) else pruning action (t) fi e l s e i f ( t ∈ s t a c k && n p c y c l e ( t ) ) then h a l t with e r r o r m e s s a g e fi fi od ; pop s from s t a c k ; end ; Listing 4. Generic DFS with pruning and NPC check

init path 2

path 1

+

+ + +

s2

path 3 + +

s1

+

s0

Fig. 3. (p-stop) is not met for the incremental DFS

scratch, but rather use the graph already built, gradually pick the progress states out of the FIFO, and expand the graph further by continuing the basic DFS. When a new progress state is reached, traversal is postponed by putting the state into the FIFO. When the basic DFS is finished and the FIFO is empty, the complete graph G is built. The unsettled functions of DFSprune,NPC are defined for DFSFIFO as follows: – pruned(s, t) returns true iff t is a progress state. – pruning action(t) puts t into the F IF O. – np cycle(t) returns (t != first element of stack), since the first element is a progress state. (Using progress transitions (see Section 2.4), this exception becomes unnecessary and the constant true is returned.) – The error message can print out the stack, which now corresponds to the path of the NPC found, but no longer contains the path from init to the cycle. Note. This algorithm does not know which GL is currently constructed. If we want to clearly separate the different runs, as before, we can use two FIFOs, one for reading and one for writing. When the FIFO that is read from is empty, the current run is finished and we swap the FIFOs for the next run.

Improving Non-Progress Cycle Checks

57

proc DFSFIFO ( s t a t e s) put s i n F IF O ; repeat p i c k f i r s t s out o f F IF O ; DFSprune,NPC ( s ) u n t i l ( F IF O i s empty ) ; end ; proc main ( ) DFSFIFO ( init ) ; p r i n t f ( ”LTS d o e s not c o n t a i n NPCs” ) ; end ; Listing 5. Incremental depth-first search with a FIFO

With this technique, the deficiencies from the original incremental DFS are avoided: (p-stop) is fulfilled since progress state traversal is postponed as long as possible, the progress counter limit L does not become unnecessarily large, and we avoid the redundancy of the original incremental DFS by reusing the part of the graph already built previously. The consequent postponing guarantees a constraint stronger than (p-stop): Each state is visited through a path with the fewest possible progress states. So now GL is the maximal subgraph of G such that all paths in GL without cycles have at most L progress states. On the whole, DFSFIFO does not require more memory by using a FIFO compared to a basic DFS because progress states are stored only temporarily in the FIFO until they are stored in the hash table (cf. Listing 4). The time complexity is also about the same as for the basic DFS. DFSFIFO erases a large part of the stack: everything behind progress states, i.e. all of the stack between init and the last progress state, is lost. But for detecting NPCs, this is a feature and not a bug: Exactly the NPCs are detected. The cycles that go back to states from previous runs are progress cycles and stay undetected. Thus we no longer need to save a progress counter on the truncated stack, saving even more memory. A further benefit will arise in combination with partial order reduction (see Section 3). If an NPC is detected, the stack from the current run supplies the NPC, but an additional basic DFS for reaching the NPC is required to obtain a complete error path as counterexample. The shortest (w.r.t. progress) counterexample can be found quickly, for instance with the following method, which generally requires only little additional memory: Instead of storing only the last progress states in the FIFO, all progress states on the postponed paths are saved, e.g., in form of a tree. The shortest counterexample can then easily be reconstructed using the progress states on the error path as guidance for the additional basic DFS. 2.3

Correctness of DFSFIFO

Constructively proving that DFSFIFO finds a certain NPC is difficult: We would require to consider various complex situations and the technical details of the

58

D. Farag´ o and P.H. Schmitt

init

s sh1

1

p1 sh1

aborted traversal

1

Fig. 4. s1h1 cannot be twice on the current path at time tend

init

init s

s

p1 p

2

sh1

1

aborted traversal

aborted traversal

sh

1 1

sh1 +1

sh1 +1

1

1

Fig. 5. Constructing π 2 from π 1

algorithm, e.g., the order in which the transitions are traversed (cf. Figure 6). Hence we prefer a pure existence proof. Theorem 1. DFSFIFO finds an NPC if one exists and otherwise outputs that no NPC exists. An NPC is found at the smallest depth w.r.t. progress, i.e., after the smallest number (L0 ) of progress states that have to be traversed. Proof. DFSFIFO only postpones transitions, but does not generate new ones. It checks for NPCs by searching through the stack (except the first state), i.e., it only considers non-progress states. Thus it is sound, i.e., it does not output false negatives. To prove completeness, i.e., that NPCs are found if they exist, let there be an NPC. As long as DFSFIFO constructs GL for L < L0 , all paths leading to an NPC are pruned. Let L = L0 , s be the first state reached in the DFS which is on an NPC, tbegin be the time the DFS reaches s (the first time), and π 1 =  s11 =s, s12 , . . . , s1n1 =s  be an NPC containing s. Because of (p-stop), the DFS stops traversing progress states while all non-progress states reachable from s are being traversed. We assume no NPC is found in GL0 . Hence the traversal of π 1 must be aborted because a state s1h1 = s for h1 ∈ {2, . . . , n1 − 1} is revisited (i.e., visited when already in the hash table) before π 1 is closed, i.e., before s could be reached the second time. Let tmiddle1 be the time when s1h1 is visited the first time and

Improving Non-Progress Cycle Checks

59

init t1 found NPC

t2 s firstly visited NPC

Fig. 6. The found NPC does not contain s

tend be the time when s1h1 is revisited and π 1 is aborted. s1h1 cannot be twice on the current path (once at the end and once earlier) at time tend : the first occurrence cannot be above s (i.e., closer to init) because s is the first visited state of π 1 , and not below s, since then an NPC s1h1 , . . . , s1h1  would be found, see Figure 4. So our algorithm first visits s in tbegin , then visits s1h1 at tmiddle1 , then backtracks from s1h1 and finally revisits s1h1 at tend while traversing π 1 . Informally, since our algorithm backtracks from s1h1 without having found an NPC, the traversal of the path from s1h1 to s was aborted because some state on it was revisited, i.e., the state has already been visited before, but after tbegin . With this argument, we come successively closer to completing some NPC, which is a contradiction. Formally: Let π 2 =  s21 =s, s22 , . . . , s2n2 =s  be the path from s at time tbegin to 1 sh1 at time tmiddle1 concatenated with s1h1 +1 , s1h1 +2 , . . . , s1n1 =s , i.e., π 2 is also an NPC containing s, see Figure 5. Therefore we can apply the argumentation from above to π 2 instead of π 1 to obtain a state s2k (k ∈ {1, . . . , n2 }) on π 2 that is revisited before π 2 is closed. Let tmiddle2 be the time when s2k is visited the first time. Since on π 2 the state s1h1 is visited the first time (at tmiddle1 ), the DFS also reaches (maybe a revisit) s1h1 +1 on π 2 (at some time after tmiddle1 ). So s2k = s1h2 for h2 ∈ {h1 + 1, . . . , n1 − 1}. Let π 3 =  s31 =s, s32 , . . . , s3n3 =s  be the NPC from s at time tbegin to s1h2 at time tmiddle2 concatenated with s1h2 +1 , s1h2 +2 , . . . , s1n1 =s. Applying this argumentation iteratively, we get a strictly monotonically increasing sequence (hi )i∈N with all hi ∈ {2, . . . , n1 − 1}. Because of this contradiction, the assumption that no NPC is found in GL0 is wrong. If all cycles in the LTS make progress, L will eventually be big enough to contain the complete graph. After traversing it the algorithm terminates with the correct output that no NPC exists. Thus our algorithm is correct. Note. The proof shows that DFSFIFO finds an NPC before backtracking from s. But the NPC does not have to contain s: Figure 6 shows an example if t1 is traversed ahead of t2 . Since our pure existence proof assumed that no NPCs are found, it also covers this case (cf. Figure 4).

60

D. Farag´ o and P.H. Schmitt

statementi; progress:statementj; statementi

statementj

statementi; atomic{skip; progress:statementj}; statementi

s

s

progress

statementj skip progress

Fig. 7. Progress transition with atomic

2.4

Comparison

We firstly compare SPIN’s NPC checks with DFSFIFO by relating their performances to that of the basic DFS (which corresponds to a safety check): The runtime for a basic DFS, denoted tsafety , is linear in the number of reachable states and transitions, the memory requirement ssafety is linear in the number of reachable states. For SPIN’s NPC checks, the memory required in the worst case is about 2×ssafety because of the never claim. The runtime is about 3×tsafety because of the nested search and the doubled state space. For DFSFIFO , both time and memory requirements are about the same as for the basic DFS. To construct a full counterexample, maximally tsafety is required, but usually far less. But this asymptotic analysis only gives rough complexities. For a more precise comparison, we look at the steps in detail and see that the inefficiencies of the NDFS algorithm are eliminated for DFSFIFO : The redundancy is completely avoided as all states are traversed only once by a simultaneous construction and NPC search. Furthermore, only paths with minimal progress are traversed. Since many livelocks in practice occur after very little progress – e.g., for the i-protocol (cf. [3]) after 2 sends and 1 acknowledge – DFSFIFO comprises an efficient search heuristic. Additionally, shortest (w.r.t. progress) counterexamples are easier to understand and often reveal more relevant errors. Finally, we can also model progress in a better way using progress transitions instead of progress states. SPIN’s NPC check needs to mark states as having progress because never claims are used: The never claim process is executed in lockstep with the remaining automaton and thus only sees the states of the remaining automaton, not its transitions. Since our DFSFIFO does not require never claims, we can mark transitions (e.g., those switching np from false to true in the original semantics) as having progress. The most fundamental implementation of progress transitions is to change the semantics of PROMELA so that a progress label marks the following statement as a progress transition. If we do not want to change the PROMELA semantics, we can use the construct ”statementi ; atomic {skip; progress: statementj }” instead of ”statementi ; progress: statementj ;”. Figure 7 shows the difference in the automata: The progress moves from state s to the following composite transition. ”atomic{...} ” guarantees that the progress state is left immediately

Improving Non-Progress Cycle Checks

process 1

x

global automaton

process 2

s1

s1,t1 x

t1

61

s1,t2

t2 s2,t1

s2

progress

progress

s2,t2

progress

Fig. 8. Faked progress Table 1. Big difference between safety and NPC checks Problem safety checks Size time depth states 3 5” 33 66 4 5” 40 103 5 5” 47 148 6 5” 54 201 7 5” 61 262 254 100” 1790 260353

NPC checks via NDFS time depth states 5” 387 1400 5” 2185 18716 6” 30615 276779 70” 335635 4.3e+06 memory overflow (> 1GB) memory overflow (> 1GB)

after it was entered. Unfortunately, SPIN does not interleave atomic sequences with the never claim process, so this technique cannot be used for SPIN’s NPC checks. Trying nevertheless, SPIN sometimes claims to find an NPC, but returns a trace which has a progress cycle; At other times, SPIN gives the warning that a ”progress label inside atomic - is invisible”. SPIN’s inconsequent warning suggests that it does not always detect progress labels inside atomic. Using progress transitions, we can model more faithfully since in reality actions, not states, make progress. For example in Figure 8, if the action corresponding to the transition from a state s2 to a state s1 causes progress, PROMELA models s2 as progress state. So the cycle between (s2 , t1 ) and (s2 , t2 ) in the global automaton is considered as progress cycle, although the system does not perform any progress within the cycle. The other case of a path with several different local progress states visited simultaneously or directly after one another cannot be distinguished from one persistent local progress state as in Figure 8. Using progress transitions, all these cases can be differentiated and are simpler: The number of progresses on a path π is simply its number of progress transitions, denoted |π|p . The biggest advantages of using progress transitions emerge in combination with partial order reduction (see Section 3). The performance comparison from this section has to be considered with caution, though, as the effectiveness of a verification usually stands and falls with the strength of the additional optimization techniques involved, especially partial order reduction (cf. Section 3). The reduction strength can significantly decrease when changing from safety to liveness checks because the traversal algorithm

62

D. Farag´ o and P.H. Schmitt

changes and the visibility constraint C2 (cf. Section 3) becomes stricter. For instance, in a case study that verified leader election protocols (cf. [4]), the safety checks with partial order reduction were performed in quadratic time and memory (i.e., easily up to SPIN’s default limit of 255 processes), whereas the NPC checks could only be performed up to the problem size 6 (see Table 1). So a very critical aspect of NPC checks is how strongly partial order reduction can reduce the state space. In the next section, we show that DFSFIFO is compatible with partial order reduction and that the elimination of redundancy in DFSFIFO – as well as its further advantages (see Section 3.3) – enhance the strength of partial order reduction.

3

Compatibility with Partial Order Reduction

SPIN’s various reduction methods contribute strongly to its power and success. Many of them are on a technical level and easily combinable with our NPC checks, for instance Bitstate Hashing, Hash-compact and collapse compression. For Bitstate Hashing and Hash-compact, the reduction might be weakened by DFSFIFO because the FIFO temporarily stores complete states. One of the most powerful reduction methods is partial order reduction (POR). In this section, we firstly introduce SPIN’s POR, which uses the technique of ample sets (see [1,2], for technical details [7]). Thereafter, we prove that DFSFIFO can be correctly combined with POR. Finally, we again compare SPIN’s NPC checks with DFSFIFO , this time also considering POR. 3.1

Introduction

One of the main reasons for state space explosion is the interleaving technique of model checking to cover all possible executions of the asynchronous product of the system’s component automata. These combined executions usually cause an exponential blowup of the number of transitions and intermediate states. But often statements of concurrent processes are independent: ∀s ∈ S : α, β ∈ enabled(s) =⇒ α, β ∈ S are iff α ∈ enabled(β(s)) and β ∈ enabled(α(s)) (enabledness) independent and α(β(s)) = β(α(s)) (commutativity) α, β ∈ S are iff α, β are not independent dependent with enabled : S → P(S) and S being the set of all statements (we regard a statement as the subset of those global transitions T in which a specific local transition is taken). So the different combinations of their interleaving have the same effect. POR tries to select only few of the interleavings having the same result. This is done by choosing in each state s a subset ample(s) ⊆ enabled(s), called the ample set of s in [11]. The choice of ample(s) must meet the conditions C0 to C3 listed in Table 2. C3’ is a sufficient condition for C3 and can be checked locally in the current state. Since SPIN is an on-the-fly model checker, it uses C3’.

Improving Non-Progress Cycle Checks

63

Table 2. Constraints on ample(s) C0 : Emptiness

ample(s) = ∅ ⇔ enabled(s) = ∅ C1 : Ample decom- No statement α ∈ S\ample(s) that is dependent on some stateposition ment from ample(s) can be executed in the original, complete graph after reaching the state s and before some statement in ample(s) is executed. C2 : Invisibility ample(s) = enabled(s) =⇒ ∀α ∈ ample(s) : α is invisible, which means that α is not a progress transition, or, when progress states are being used, that α does not change np . C3 : Cycle closing If a cycle contains a state s s.t. α ∈ enabled(s) for some statecondition ment α, it also contains a state s′ s.t. α ∈ ample(s′ ). C3’ : NotInStack α ∈ ample(s) and α(s) is on the stack ⇒ ample(s) = enabled(s)

If these conditions are met, then the original graph G and the partial order reduced graph G′ are stuttering equivalent : For each path π ∈ G there exists a path π ′ ∈ G′ (and vice versa) such that π and π ′ are stuttering equivalent (cf. [11] and [2]). In our special case of NPCs, two paths π and π ′ are stuttering equivalent (π ∼st π ′ ) iff they make the same amount of progress. 3.2

Correctness of DFSFIFO in Combination with POR

For proving the correctness of DFSFIFO with POR activated, we have to look at the conditions for POR first. C3’ no longer implies C3 if DFSFIFO is used: Since a large part of the stack gets lost by postponing the traversal at progresses (progress states or progress transitions), progress cycles are not detected. To guarantee C3 for progress cycles being traversed, we traverse all pending transitions when we are about to destroy the stack by making progress. So for each state s we fulfill the condition: (∃α ∈ ample(s) : α is visible) =⇒ (ample(s) = enabled(s)). This is equivalent to C2. Note. When progress states are being used, C2 is not sufficient to guarantee C3 in the special case of cycles that solely contain progress states (e.g., as in Figure 8). Several solutions are possible: Firstly, we can alter C2 to C2’: (∃α ∈ ample(s) : α(s) is a progress state) =⇒ (ample(s) = enabled(s)). Secondly, these cycles might be avoidable by weak fairness (which is combinable with our algorithm) if they are caused by one process remaining in its progress state throughout the cycle. Thirdly, we can guarantee by hand that these long sequences of progress states never occur, e.g., by forcing quick exit from progress states (similarly to Figure 7). But we favor using progress transitions anyway, which is once more the simplest and most efficient solution. If DFSFIFO detects a cycle on the stack, it has already found an NPC and is finished. Hence we no longer need C3’, C2 suffices to fulfill C3. This fact helps in the following proof.

64

D. Farag´ o and P.H. Schmitt

Theorem 2. DFSFIFO in combination with POR finds an NPC if one exists and otherwise outputs that no NPC exists. An NPC is found at the smallest depth w.r.t. progress, i.e., after the smallest number (L0 ) of progresses that have to be traversed. Proof. Partial order reducing the graph G does not create new NPCs. Hence DFSFIFO still does not output false negatives. To prove completeness, let there be an NPC in G. If L < L0 , all paths leading to an NPC are pruned before the NPC is reached. Let L = L0 and π be a counterexample in GL0 . C3’ is unnecessary for DFSFIFO . C0, C1 and C2 are independent of the path leading to s. Thus ample(s) can be determined independently of the path leading to s. So all conditions C0, C1, C2 and C3 are met and ample(s) is not influenced by the differing traversal order. Hence stuttering equivalence is preserved. Thus the reduced graph from GL0 that DFSFIFO with POR constructs also has an infinite path with exactly L0 progresses like π, i.e., an NPC. Theorem 1 proves that after L0 progresses an NPC is found by DFSFIFO in combination with POR. 3.3

Comparison

Now we can do an overall comparison between SPIN’s NPC checks and DFSFIFO , both with POR activated (pros for DFSFIFO are marked with +): + DFSFIFO avoids all redundancies and therefore saves time and memory and enables stronger POR. + The use of progress transitions instead of progress states is possible, spawning several advantages: • The faithful modeling not only simplifies the basic algorithms, but also the application of POR: The visible transitions are exactly the progress transitions, and π ∼st π ′ iff |π|p = |π ′ |p . That is why progress transitions are the easiest solution to get by with C2 instead of C2’. • Only one of the originally two local transitions is now visible, i.e., we definitely have fewer visible global transitions. • In certain situations, this difference in the number of visible global transitions can get very large: Figure 9 depicts that the global automaton can have far more visible transitions when progress states are used instead of progress transitions. Consequently, also the ample sets strongly differ in size. The ample sets are marked with circles, the visible transitions with - (if np is switched to true) and +. + The constraint C3’ becomes unnecessary. – To obtain an error path from the initial state to the NPC, an additional basic DFS is necessary, but this requires less resources than the main check. + A shortest (w.r.t. progress) error path can be given, which is often easier to understand and more revealing than longer paths. + By avoiding progress as much as possible, DFSFIFO exhibits an efficient NPC search heuristic: In practice, NPCs often occur after only few progresses. Additionally, by avoiding progress as much as possible, its visibility weakens

Improving Non-Progress Cycle Checks

SPIN’s POR with progress states

65

DFSFIFO’s POR with progress transitions

+ + +

-

+

-

+

-

+

+ + + + + +

Fig. 9. Ample sets (red circles) are smaller for progress transitions

POR just as much as necessary. Since the time and memory requirements of DFSFIFO and the basic DFS are about the same, the performance of our NPC check is roughly the same as for a safety check if POR stays about as strong as for safety checks. + Our new NPC check is a more direct method. This is in line with SPIN’s paradigm of choosing the most efficient and direct approach and eases modifications, such as improvements, additional options and extensions. + It might be possible to improve POR: For finding NPCs, we only need to distinct |π|p = ∞ from |π|p < ∞ for an infinite path π. Therefore a stronger reduction that does not guarantee stuttering equivalence is sufficient, as long as at least one NPC is preserved. Note. We can also compare our DFSFIFO with SPIN’s former NPC check. The old check used the NDFS directly (see [5]). [8] explains that this algorithm is not compatible with POR because of condition C3. The authors of the paper ”do not know how to modify the algorithm for compatibility with POR” and suggest the alternative SPIN is now using (cf. Section 1.1). But DFSFIFO can be regarded as such modification of SPIN’s old NPC check: the state space creation and the search for an NPC are combined, and C3 is reduced to C2.

4 4.1

Closure Conclusion

Instead of separately constructing the state space and searching for NPCs, as SPIN does, DFSFIFO performs both at the same time. To be able to avoid a

66

D. Farag´ o and P.H. Schmitt

nested search, DFSFIFO postpones traversing progress for the (L + 1)-th time until the combined state space creation and NPC check for the subgraph GL is finished. Then DFSFIFO retrieves the postponed progresses and continues with GL+1 \ GL . When an NPC is found or the complete graph is built, DFSFIFO terminates. DFSFIFO is a more direct NPC check than SPIN’s method, with no redundancy, and enabling an efficient search heuristic, better counterexamples, the use of progress transitions, stronger POR and facilitation of improvements. With these enhancements, the verification by NPC checks becomes more efficient. As trade-off, DFSFIFO does not deliver an error path from the initial state to the NPC anymore, only the NPC itself. For a complete error path, an additional basic DFS is required - whose cost is, however, negligible. 4.2

Future Work and Open Questions

Having proved that DFSFIFO is correct and combinable with POR, we can now move from these important theoretical questions to the next step of implementing the algorithm. Thereafter, we will analyze DFSFIFO ’s performance. Because of the mentioned advantages, we are optimistic that DFSFIFO will strongly improve NPC checks in practice. Section 3.3 posed the open question whether POR can be further strengthened for our NPC checks by weakening stuttering equivalence to a constraint that solely preserves NPC existence. Solving this question might further speed up our NPC check.

References 1. Clarke, E.M., Grumberg, O., Minea, M., Peled, D.: State space reduction using partial order techniques. International Journal on Software Tools for Technology Transfer (STTT) 2, 279–287 (1999) 2. Clarke Jr., E.M., Grumberg, O., Peled, D.A.: Model Checking. The MIT Press, Cambridge (1999); third printing, 2001 edition 3. Dong, Y., Du, X., Ramakrishna, Y.S., Ramakrishnan, C.R., Ramakrishnan, I.V., Smolka, S.A., Sokolsky, O., Stark, E.W., Warren, D.S.: Fighting livelock in the i-protocol: a comparative study of verification tools. In: Cleaveland, W.R. (ed.) TACAS 1999. LNCS, vol. 1579, pp. 74–88. Springer, Heidelberg (1999) 4. Farag´ o, D.: Model checking of randomized leader election algorithms. Master’s thesis, Universit¨ at Karlsruhe (2007) 5. Holzmann, G.J.: Design and Validation of Computer Protocols. Prentice Hall Software Series (1992) 6. Holzmann, G.J.: The SPIN Model Checker: primer and reference manual, 1st edn. Addison Wesley, Reading (2004) 7. Holzmann, G.J., Peled, D.: An improvement in formal verification. In: Proceedings of the Formal Description Techniques 1994, Bern, Switzerland, pp. 197–211. Chapman & Hall, Boca Raton (1994) 8. Holzmann, G.J., Peled, D., Yannakakis, M.: On nested depth-first search. In: Proceedings of the Second SPIN Workshop, Rutgers Univ., New Brunswick, NJ, August 1996, pp. 23–32. American Mathematical Society. DIMACS/32 (1996)

Improving Non-Progress Cycle Checks

67

9. Islam, S.M.S., Sqalli, M.H., Khan, S.: Modeling and formal verification of DHCP using SPIN. IJCSA 3(2), 145–159 (2006) 10. Kamel, M., Leue, S.: Formalization and validation of the general inter-orb protocol (GIOP) using PROMELA and SPIN. In: Software Tools for Technology Transfer, pp. 394–409. Springer, Heidelberg (2000) 11. Peled, D.: Combining partial order reductions with on-the-fly model-checking. In: 6th International Conference on Computer Aided Verification, Stanford, California (1994)

Reduction of Verification Conditions for Concurrent System Using Mutually Atomic Transactions Malay K. Ganai1 and Sudipta Kundu2 1

2

NEC Labs America, Princeton, NJ, USA University of California, San Diego, CA, USA

Abstract. We present a new symbolic method based on partial order reduction to reduce verification problem size and state space of a multi-threaded concurrent system with shared variables and locks. We combine our method with a previous token-based approach that generates verification conditions directly without a scheduler. For a bounded unrolling of threads, the previous approach adds concurrency constraints between all pairs of global accesses. We introduce the notion of Mutually Atomic Transactions (MAT), i.e., two transactions are mutually atomic when there exists exactly one conflicting shared-access pair between them. We propose to reduce the verification conditions by adding concurrency constraints only between MATs. Such an approach removes all redundant interleavings, thereby, achieves state reduction as well. We guarantee that our MAT-based reduction is both adequate (preserves all the necessary interleavings) and optimal (no redundant interleaving), for a bounded depth analysis. Our experimental results show the efficacy of our approach in reducing the state space and the verification problem sizes by orders of magnitude, and thereby, improving the overall performance, compared with the state-of-the-art approaches.

1 Introduction Verification of multi-threaded programs is hard due to complex and un-expected interleaving between the threads [1]. In practice, verification efforts often use incomplete methods, or imprecise models, or sometimes both, to address the scalability of the problem. The verification model is typically obtained by composing individual thread models using interleaving semantics, and model checkers are applied to systematically explore the global state space. To combat the state explosion problem, most methods employ partial-order reduction techniques to restrict the state-traversal to only a representative subset of all interleavings, thereby, avoiding exploring the redundant interleaving among independent transitions [2, 3, 4]. Explicit model checkers [5, 6, 7, 8, 9] explore the states and transitions of concurrent system by explicit enumeration, while symbolic model checkers [10, 11, 12, 13, 14, 15, 16, 17] uses symbolic methods. We focus on symbolic approaches based on SMT (Satifiability Modulo Theory) to generate efficient verification conditions. Based on how verifications models are built, symbolic approaches can be broadly classified into: synchronous (i.e., with scheduler) and asynchronous (i.e., without scheduler) modeling. Synchronous modeling: In this category of symbolic approaches [10, 11, 12], a synchronous model of concurrent programs is constructed with a scheduler. The scheduler is then constrained—by adding guard strengthening—to explore only a subset C.S. P˘as˘areanu (Ed.): SPIN 2009, LNCS 5578, pp. 68–87, 2009. c Springer-Verlag Berlin Heidelberg 2009 

Reduction of Verification Conditions for Concurrent System Using MAT

69

of interleaving. To guarantee correctness (i.e., cover all necessary interleavings), the scheduler must allow context-switch between accesses that are conflicting (i.e. dependent). One determines statically (i.e., conservatively) which pair-wise locations require context switches, using persistent [4]/ample [18] set computations. One can further use lock-set and/or lock-acquisition history analysis [19, 20, 21, 11], and conditional dependency [22, 16] to reduce the set of interleavings need to be explored (i.e., remove redundant interleavings). Even with these state reduction methods, the scalability problem remains. To overcome that, researchers have employed sound abstraction [7] with bounded number of context switches [23] (i.e., under-approximation), while some others have used finite-state model abstractions [13], combined with proof-guided method to discover the context switches [14]. Asynchronous Modeling: In this category, the symbolic approaches such as TCBMC [15] and token-based [17] generate verification conditions directly without constructing a synchronous model of concurrent programs, i.e., without using a scheduler. These verification conditions are then solved by satisfiability solvers. To our knowledge so far, the state-reduction based on partial-order has hardly been exploited in the asynchronous modeling approaches [15, 17]. We will focus primarily in that direction. Our Approach: We present a new SMT-based method—combining partial-order reduction with the previous token-based approach [17]—to reduce verification problem size and state-space for multi-threaded concurrent system with shared variables and locks. For a bounded unrolling of threads, the previous approach adds concurrency constraints between all pairs of global accesses, thereby allowing redundant interleavings. Our goal is to reduce the verification conditions by removing all redundant interleavings (i.e., guarantee optimality) but keeping the necessary ones (i.e., guarantee adequacy). We first introduce the notion of Mutually Atomic Transactions (MAT), i.e., two transactions are mutually atomic when there exists exactly one conflicting sharedaccess pair between them. We then propose an algorithm to identify an optimal and adequate set of MATs. For each MAT in the set, we add concurrency constraints only between the first and last accesses of the transactions, and not in-between. Our MATbased approach achieves reduction both in state-space as well as in the size of verification conditions. We guarantee that our MAT-based reduction is both adequate (preserves all the necessary interleavings) and optimal (no redundant interleaving), for a bounded depth analysis. We implemented our approach in a SMT-based prototype framework, and demonstrated the efficacy of our approach against the state-of-the-art SMT-based approaches based on asynchronous modeling [17], and synchronous modeling [16], respectively. Outline: We provide an informal overview of our MAT-based reduction approach in Section 2, followed by formal definitions and notations in Section 3. In Section 4, we present a flow diagram of our new SMT-based method. We give an algorithm for identifying an adequate and optimal set of MATs in Section 5, followed by a presentation of adequacy and optimality theorems in Section 6. We present our experimental results in Section 7, and conclusions in Section 8.

70

M.K. Ganai and S. Kundu

2 An Overview We motivate our readers with a following example, which we use to guide the rest of our discussion. Consider a two-threaded concurrent system comprising threads M1 and M2 with local variables ai and bi , respectively, and shared (global) variables x, y, z. This is shown in Figure 1(a), as a concurrent control flow graph (CCFG) with a forkjoin structure. Each shared statement associated with a node is atomic, i.e., it cannot be interrupted. Further, each node is associated with at most one shared access. A node with a shared write/read access of variable x is identified as W (x)/R(x). We use the notation ? to denote a non-deterministic input to a variable. Given such a concurrent system, the goal of the token-based approach [17] is to generate verification conditions that capture necessary interleaving for some bounded unrolling of the threads, aimed at detecting reachability properties such as data races and assertion violations. These verification conditions together with the property constraints are encoded and solved by an SMT solver. A satisfiable result is typically accompanied by a trace—comprising data input valuations, and a total-ordered thread interleaving— that is witness to the reachability property. On the other hand, an unsatisfiable result is followed by these steps (a)—(c): (a) increase unroll depths of the threads, (b) generate verification conditions for increased depths, and (c) invoke SMT solver on these conditions. Typically, the search process (i.e., to find witnesses) is terminated when a resource—such as time, memory or bound depth—reaches its limit. For effective implementation, these verifications constraints are added on-the-fly, lazily and incrementally at each unrolled depth. Though the approach captures all necessary interleaving, it however does not prevent redundant interleavings. In this work, our goal is to remove all the redundant interleavings but keep the necessary ones for a given unroll bound. We focus on reducing the verification conditions, as generated in the token-passing modeling approach [17]. To understand how we remove redundancy, we first present a brief overview of such a modeling approach. x=? y=? CCFG z=?

assume (-7 < x) assume (y < z+5)

I

Fork

x,y,z=? tk= ? ctk= ?

Token-Passing Model

W(y)

1a

y=0

b1=x-2

1b

R(x)

W(y)

1a

1b

R(x)

R(x)

2a

a1=x+3 z=b1-1

2b

W(z)

R(x)

2a

2b

W(z)

R(z)

3a

a2=z-1

3b

R(x)

R(z)

3a

3b

R(x)

W(y)

4a

y=a1-a2 y=b1+b2

4b

W(y)

W(y)

4a

4b

W(y)

b2=x+1

Join

M1 R(y)

E

(a)

M2

rs 3b

ws tk= ?

#pair-wise constraints = 4∗4∗2=32

assert (y > 0)

(b)

Fig. 1. (a) Concurrent system, shown as thread CFGs, with threads M1 , M2 and local variables ai , bi respectively, communicating with shared variable x, y, z, and (b) Token-passing Model [17]

Reduction of Verification Conditions for Concurrent System Using MAT

71

2.1 Token-Passing Model The main idea of token-passing model (TPM) is to introduce a single Boolean token tk and a clock vector ctk in a model, and then manipulate the passing of the token to capture all necessary interleavings in the given system. The clock vector records the number of times the token tk is passed and is synchronized when the token is passed. Unlike a synchronous model, TPM does not have a scheduler in the model. The verification model is obtained two phases. In the first phase, the goal is obtain abstract and decoupled thread models. Each thread is decoupled from the other threads by localizing all the shared variables. For the example shown in Figure 1(a), M1 and M2 are decoupled by renaming (i.e., localizing) shared variable such as x to x1 and x2 , respectively. Each model is then abstracted by allowing renamed (i.e., localized) variables to take non-deterministic values at every shared access. To achieve that, each shared access node (in every thread) is instrumented with two control states as follows: (a) an atomic pre-access control state, referred to as read sync block, is inserted before each shared access, and (b) an atomic post-access control state, referred to as write sync block, is inserted after each shared access. In read sync block, all localized shared variables obtain non-deterministic values. As an example, we show the token-passing model in the Figure 1(b). For clarity of presentation, we did not show renaming of the shared variables, but for all our purpose we consider them to be local to the thread, i.e., x of thread Mi and x of Mj are not the same variable. In such a model, atomic control states rs and ws are inserted pre and post of shared accesses in decoupled model, respectively. As highlighted for a control state 3b, we add the following statements in the corresponding rs node, i.e., x=?,y=?,z=?,tk=?,ctk=?. Similarly, we add tk=? in ws node. (? denotes the non-deterministic values.) Note, the transition (update) relation for each localized shared variable depends on other local variables, thereby, making the model independent (i.e., decoupled). However, due to non-deterministic read values, the model have additional behaviors, hence, it is an abstract model. In the second phase, the goal is to remove the imprecision caused due to abstraction. In this phase, the constraints are added to restrict the introduced non-determinism and to capture the necessary interleavings. More specifically, for each pair of shared access state (in different threads), token-passing constraints are added from the write sync node of a shared access to the read sync node of the other shared access. Intuitively, these token-passing constraints allow passing of the token from one thread to another, giving a total order in the shared accesses. Furthermore, these constraints allow to synchronize the values of the localized shared variables from one thread to another. Together, the token-passing constraints captures all and only the necessary interleavings that are sequentially consistent [24] as stated in the following theorem. Theorem 1 (Ganai, 2008 [17]). The token-based model is both complete, i.e., it allows only sequentially consistent traces, and sound, i.e., captures all necessary interleaving, for a bounded unrolling of threads. Further, the size of pair-wise constraints added grow quadratically (in the worse case) with the unrolling depth.

72

M.K. Ganai and S. Kundu

In Figure 1(b), we show a token-passing constraint as a directed edge from a write sync ws node of one thread to a read sync rs node of another. Note, these constraints are added for all pairs of ws and rs nodes. A synchronization constraint from M1 to M2 will include x2 = x1 ∧ y2 = y1 ∧ z2 = z1 ∧ tk2 = 1 ∧ tk1 = 0 ∧ ctk2 = ctk1 , where token-passing is enforced by assertion/de-assertion of corresponding token variable. (Recall, vi is localized variable in Mi corresponding to shared variable v). As shown, one adds 4 ∗ 4 ∗ 2 = 32 such token-passing constraints for this example. Improvement Scope: Though the above approach captures all and only necessary interleavings, it also allows interleavings that may be redundant (i.e. equivalent). For example, the interleaving σ1 ≡ 1b · 2b · 1a · 3b · 4b · 2a · 3a · 4a, and σ2 ≡ 1a · 2a · 1b · 2b · 3a · 3b · 4b · 4a, are equivalent as in these interleavings the conflicting pairs (2b, 3a), (1a, 4b), (4b, 4a) are in the same happens-before order, besides the thread program order pairs. (Note, “·” denotes concatenation). The previous-approach [17] will explore both the interleavings. In the following sections, we build our approach on such a token-passing model to identify pair-wise constraints that can be safely removed, without affecting soundness and completeness, and guaranteeing optimality by removing all redundant interleavings. For the example in Figure 1, our approach removes 24 such pair-wise constraints (as shown in Figure 4), and yet covers all the necessary interleavings with no redundancy. To illustrate, our approach allows σ1 , and not any other equivalent (to σ1 ) interleavings such as σ2 . Note, the choice of a representative interleaving will depend on a given thread prioritization, as discussed later. 2.2 Mutually Atomic Transactions Our partial-order reduction approach is based on the concept of mutually atomic transactions, MAT for short. Intuitively, let a transaction be a sequence of statements in a thread, then we say two transactions tri and trj of threads Mi and Mj , respectively, are mutually atomic transactions if and only if there exists exactly one conflicting shared-access pair between them, and the statements containing the shared-access pair is the last one in each of the transactions. (We will present a more formal definition later). Now we illustrate the concept of MAT using an example as shown in Figure 2. From the control state pair (1a, 1b), there are two reachable control states with conflicting accesses, i.e., (3a, 2b) and (1a, 4b). Corresponding to that we have two MATs m = (tr1 = 1a · · · 3a, tr2 = 1b · · · 2b) (Figure 2(a)) and m′ = (tr1′ = 1a, tr2′ = 1b · · · 4b) (Figure 2(b)), respectively. Similarly, from (1a, 2b) we have m′′ = (tr1′′ = 1a, tr2′′ = 2b · · · 4b) (Figure 2(c)). In general, there could be multiple possible MATs for our examples. In a more general setting with conditional branching, we identify MATs by exploring beyond conditional branches, as illustrated in the Figure 2(d), with a conditional branch denoted as a diamond node, and control states Ai , Bi , Ci denoted as dark ovals. Starting from (A1 , A2 ), we have following control path segments, tr11 = A1 · · · B1 , tr12 = A1 · · · C1 , tr21 = A2 · · · B2 , and tr22 = A2 · · · C2 (shown as ovals). For each of the four combinations of tr1i , tr2j , we define MAT separately.

Reduction of Verification Conditions for Concurrent System Using MAT

tr1 W(y)

tr’1

tr2

1a

1b

R(x)

R(z)

W(y)

R(x)

2a

2b

W(z) R(x)

R(z)

3a

3b

R(x)

W(y)

4a

4b

W(y) W(y)

(a)

tr’2

tr’’2 1b

R(x)

1b

W(y)

1a

2a

2b

W(z)

R(x)

2a

2b

W(z)

3a

3b

R(x)

R(z)

3a

3b

R(x)

4a

4b

W(y)

W(y)

4a

4b

W(y)

1a

A1

(c)

A2

tr12

tr11

B1

(b)

M2

M1

tr’’1 R(x)

73

tr22

tr21

C1

B2

C2

(d)

Fig. 2. (a) m=(tr1 , tr2 ), (b) m′ =(tr1′ , tr2′ ), (c) m′′ =(tr1′′ , tr2′′ ) (d) MATs for branches

Given a MAT (tri , trj ), we can have only two equivalent classes of interleavings [25]. One represented by tri · trj , i.e., tri executing before trj and other by trj · tri , i.e., trj executing before tri . (Note, “·” represent concatenations.) For a given MAT m = (tr1 , tr2 ) shown in Figure 2(a), the interleavings σ1 ≡ 1a · 2a · 3a · 1b · 2b and σ2 ≡ 1b · 2b · 1a · 2a · 3a represent the two equivalent classes, respectively. In other words, given a MAT, the associated transactions can be considered atomic pair-wise, and one can avoid interleaving them in-between. In general, transactions associated with different MATs may not be atomic. For example, tr1 is not atomic with tr2′′ (Figure 2(a),(c)). Intuitively, it would be desirable to have a set of MATs such that, by adding tokenpassing constraints only between MATs, we will not only miss any necessary interleaving but also remove all the redundant interleaving. In Section 5, we describe such an algorithm GenMAT to compute an optimal and adequate set of MATs. For our example one such set is {(1a · · · 3a, 1b · · · 2b), (4a, 1b · · · 4b), (1a, 3b · · · 4b), (4a, 3b · · · 4b), (2a · · · 4a, 3b · · · 4b)}. Based on the set, we add only 8 token-passing constraints (Figure 4), compared to 32 (Figure 1(b)). At this point we would like to highlight the salient features of our approaches vis-a-vis previous works. A previous approach [9] on partial-order reduction used in a explicit model checking framework does not guarantee optimality. Though such guarantee is provided in a recent symbolic approach (using synchronous modeling) [16], our approach goes further in reducing problem sizes, besides an optimal reduction in the state space. Our approach obtains state space reduction by removing constraints (i.e., adding fewer token-passing constraints), while the approach [16] obtains it by adding more constraints (i.e., constraining the scheduler). In our experiments, we observed that our approach is order-of-magnitude more memory efficient compared to the approaches [16,17]. Our approach is orthogonal to the approaches that exploit transactionbased reductions [19,20,11]. Nevertheless, we can exploit those to identify unreachable conflicting pairs, and further reduce the necessary token-passing constraints. Contributions Highlights: – We are first to exploit partial order reduction techniques in a SMT-based bounded model checking using token-passing modeling approach. We developed a novel approach—based on MAT—to reduce verification conditions, both in size and state space for concurrent systems.

74

M.K. Ganai and S. Kundu

– We prove that our MAT-based reduction is both adequate (preserves all and only the necessary interleavings) and optimal (no redundant interleaving, as determined statically), for a bounded depth analysis. – Our approach outperforms other approaches [17, 16] by orders of magnitude, both in performance and size of the verification problems.

3 Formal Definitions With the brief informal overview, we present our approach in a more formal setting. We consider a multi-threaded system CS comprising a finite number of deterministic bounded-stack threads communicating with shared variables, some of which are used as synchronization objects such as locks. Let Mi (1 ≤ i ≤ N ) be a thread model represented by a control and data flow graph of the sequential program it executes. Let Ti represent the set of 4-tuple transitions (c, g, u, c′ ) of thread Mi , where c, c′ represent the control states, g is Boolean-valued enabling condition (or  guard) on program variables, u is an update function on program variables. Let T = i Ti be the set of all transitions. Let Vi be set of local variables in Ti and V be set of (global) shared variables. Let S be the set of global states of the system, and a state s ∈ S is valuation of all local and global variables of the system. A global transition system for CS is an interleaved composition of the individual thread models, Mi . Each transition consists of global firing of a local transition ti = (ai , gi , ui , bi ) ∈ T . If enabling predicate gi evaluates to true in s, we say that ti is enabled in s. 3.1 Notation We define the notion of a run of a multi-threaded program as an observation of events such as global accesses, thread creations and thread termination. If the events are ordered, we call it a total order run. We define a set Ai of shared accesses corresponding to a read Ri (x) and a write Wi (x) of a thread Mi where x ∈ V. For ai ∈ Ai , we use var(ai ) to denote the accessed shared variable. We use ⊢i to denote the beginning and ⊣i to denote the termination of thread Mi , respectively. The alphabets of events of thread Mi is a set Σi = Ai ∪ {⊢i , ⊣i }. We use Σ = ∪i Σi to denote a set of all events. A word σ defined over the alphabet set Σ, i.e., σ ∈ Σ ∗ is a string of alphabet from Σ, with σ[i] denoting the ith access in σ, and σ[i, j] denoting the access substring from ith to j th position, i.e., σ[i] · · · σ[j] (· denotes concatenation). |σ| denotes the length of the word σ. We use π(σ) to denote a permutation of alphabets in the word σ. We use σ |i to denote the projection of σ on thread Mi , i.e., inclusion of the actions of Mi only. Transaction: A transaction is a word tri ∈ Σi∗ that may be atomic (i.e., uninterrupted by other thread) with respect to some other transactions. If it is atomic with respect to all other thread transactions, we refer it as independent transaction. Schedule: Informally, we define a schedule as a total order run of a multi-threaded program where the accesses of the threads are interleaved. Formally, a schedule is a word σ ∈ Σ ∗ such that σ |i is a prefix of the word ⊢i ·A∗i · ⊣i . Happens-before Relation (≺, ): Given a schedule σ, we say e happens-before e′ , denoted as e ≺σ e′ if i < j where σ[i] = e and σ[j] = e′ . We drop the subscript if

Reduction of Verification Conditions for Concurrent System Using MAT

75

it is obvious from the context. Also, if the relation is not strict, we use the notation . If e, e′ ∈ Σi and e precedes e′ in σ, we say that they are in a thread program order, denoted as e ≺po e′ . Sequentially consistent: A schedule σ is sequentially consistent [24] iff (a) σ |i is in thread program order, (b) each shared read access gets the last data written at the same address location in the total order, and (c) synchronization semantics is maintained, i.e., the same locks are not acquired in the run without a corresponding release in between. We only consider schedules (and their permutations) that are sequentially consistent. Conflicting Access: We define a pair ai ∈ Ai , aj ∈ Aj , i = j conflicting, if they are accesses on the same shared variable (i.e., var(ai ) = var(aj )) and one of them is write access. We use Cij to denote the set of tuples (ai , aj ) of such conflicting accesses. We use Shij to denote a set of shared variables—between Mi and Mj threads—with at least one conflicting access, i.e., Shij = {var(ai )|(ai , aj ) ∈ Cij }. We define Shi =  i=j Shij , i.e., a set of variables shared between Mi and Mk , k = i with at least one conflicting access. In general, Shij ⊆ (Shi ∩ Shj ). Dependency Relation (D): A relation D ⊆ Σ × Σ is a dependency relation iff for all (e, e′ ) ∈ D, one of the following holds: (1) e, e′ ∈ Σi and e ≺po e′ , (2) (e, e′ ) ∈ Cij , (3) e =⊣i , e′ =⊣j for i = j. Note, the last condition is required when the order of thread termination is important. If (e, e′ ) ∈ D, we say the events e, e′ are independent. The dependency relation in general, is hard to obtain; however, one can obtain such relation conservatively using static analysis [4], which may result in a larger dependency set than required. For our reduction analysis, we assume such a relation is provided and base our optimality and adequacy results on accuracy of such a relation. Equivalency Relation (≃): We say two schedules σ1 = w · e · e′ · v and σ2 = w · e′ · e · v are equivalent (Mazurkiewicz’s trace theory [25]), denoted as σ1 ≃ σ2 , if (e, e′ ) ∈ D. An equivalent class of schedules can be obtained by iteratively swapping the consecutive independent events in a given schedule. Final values of both local and shared variables remains unchanged when two equivalent schedules are executed. A partial order is a relation R ⊆ Σ × Σ on a set Σ, that is reflexive, antisymmetric, and transitive. A partial order is also a total order if, for all e, e′ ∈ Σ, either (e, e′ ) ∈ R, or (e′ , e) ∈ R. Partial order-based reduction (POR) methods [4] avoid exploring all possible interleavings of shared access events. Note, if (e, e′ ) ∈ D, all equivalent schedules agree on either e ≺ e′ or e′ ≺ e, but not both. Definition 1 (MAT). We say two transactions tri and trj of threads Mi and Mj , respectively, are mutually atomic iff except for the last pair, all other event pairs in the corresponding transactions are independent. Formally, a Mutually Atomic Transactions (MAT) is a pair of transactions, i.e., (tri , trj ), i = j iff ∀k 1 ≤ k ≤ |tri |, ∀h 1 ≤ h ≤ |trj |, (tri [k], trj [h]) ∈ D (k = |tri | and h = |trj |), and tri [|tri |], trj [|trj |]) ∈ D. Given a MAT (tri , trj ), an interesting observation (as noted earlier) is that a word w = tri · trj is equivalent to any word π(w) obtained by swapping any consecutive events tri [k] and trj [h] such that k = |tri | and h = |trj |. Similarly, the word w′ = trj · tri is equivalent to any word π(w′ ) obtained as above. Note, w ≃ w′ . Therefore, for a given MAT, there are only two equivalent classes, represented by w and w′ . In other words, given a MAT, the associated transactions are atomic pair-wise.

76

M.K. Ganai and S. Kundu

4 Token-Passing Model Using MAT We exploit the pair-wise atomicity of MATs in a token-based model as follows: Let c(e) represent the control state of the thread where the corresponding event e occurs. For the given MAT (tri = fi · · · li , trj = fj · · · lj ), we only add token-passing constraints from c(lj ) to c(fi ), and c(li ) to c(fj ), respectively. Recall, such constraints are added between the corresponding pre and post- access blocks as discussed in Section 2.1. 1 n · · · w1n · wN , wik ∈ Σi∗ , 1 ≤ Adequacy of MATs. Given a schedule σ = w11 · · · wN k ≤ n, 1 ≤ i ≤ N . We define a set of ordered pairs CSP as follows: CSP (σ) = ′ {(lik , fik′ )|1 ≤ i, i′ ≤ N, 1 ≤ k, k ′ ≤ n} where fik and lik denote the first and last ′ accesses of wik ; and wik′ is a non-empty word adjacent right of wik . Note, CSP (σ) captures the necessary interleaving pairs to obtain the schedule, i.e., if we add token passing constraints between every pair of control states (a, b) ∈ CSP (σ), we allow the schedule σ. For a given MAT α = (fi · · · li , fj · · · lj ), we define a set of interleaving ordered pairs, T P (α) = {(li , fj )), (lj , fi ))}. Given a set of MAT ij , we define T P (MAT ij ) = α∈MAT ij T P (α), and denote it as T Pij . We say a tokenpassing pairs set T P is adequate iff for every schedule σ in the multi-threaded system, CSP (σ) ⊆ T P . A set MAT is adequate iff T P is adequate. Note, the size of T P is upper bounded by quadratic number of pair-wise accesses.

We use procedure GenM AT (ref. Section 5) to obtain a set of MAT ij . If Shij  Shi ∪ Shj , we use procedure GenExtraT P (ref. Section 6) to generate an extra token-passing pairs set eT Pij from MAT ij . We then construct the adequate set T P as   ( i=j T Pij ) ∪ ( i=j eT Pij ). We give an overview of using MATs in a token-passing model to selectively add token-passing constraints as shown in Figure 3.

C

Unrolled Thread CFGs M1…Mn

1. GenMAT: Given ij for a thread pair (Mi,Mj) find a set ij , 2. TPij = {(fi,lj),(fj,li) | (fi li, fj lj)∈ ) ij 3. GenExtraTP( ij ): find set eTPij 4. TP = (∪i≠jTPij ) ∪ (∪i≠jeTPij )

MAT ⇒ ⇒ MAT MAT

For each thread pair (Mi,Mj) identify a set ij of pair thread locations with conflicting shared accesses

C C

Update ij ⇐ ij\c where c is a conflicting pair location that is simultaneously unreachable

5

NEW

C

C

3

C

i={xi | (xi,xk)∈ ik k≠i}, TPij = {(xi,xj)(xj,xi) | (xi,xj) ∈ i × TP = ∪i≠jTPij

OLD

C C} j

4

Token-passing Model

Independent (decoupled) thread model For each (a,b) ∈ TP, add token passing constraint 6

OLD

1

2

NEW

Step 1,2: Given a set of unrolled threads M1 · · · MN , we obtain a set of conflicting pair of control locations Cij for each thread pair Mi , Mj . Step 3: From the set Cij , we remove the pairs that are unreachable simultaneously due to i) happens-before relation such as before and after fork/join, ii) mutual exclusion, iii) lock acquisition pattern [11].

Add bound constraints on number of token exchanges 7 Generate verification conditions and give to SMT/SAT solver 8

Fig. 3. Reducing verification conditions in a token-passing model using MAT

Reduction of Verification Conditions for Concurrent System Using MAT

77

Step 4: (Corresponds to previous scheme [17], denoted as OLD). An ordered set of token-passing pairs TP is obtained by considering every pair of control states in Ci × Cj , where Ci and Cj consist of control states of thread Mi and Mj that have some conflicting access, respectively. Step 5: (Corresponds to our proposed scheme, denoted as NEW). For each thread pairs Mi and Mj , and corresponding set Cij , we identify a set MAT ij using GenM AT . We obtain the set T Pij = T P (MAT ij ). Given a set MAT ij , we identify a set  eT Pij using GenExtraT P . We construct T P = ( i=j T Pij ) ∪ ( i=j eT Pij ). Step 6: We now build token-passing model by first generating decoupled (unrolled) thread models. For each ordered pair (a, b) ∈ T P , we add token passing constraints between (a, b), denoting token may be passed from a to b. Step 7: Optionally, we add constraints CBil ≤ ctk ≤ CBiu to bound the number of times a token could be passed to a specific thread model Mi , with CBil and CBiu corresponding to user-provided lower and upper context-bounds, respectively. Step 8: We generate verification conditions (discussed in Section 2.1) comprising transition relation of each thread model, token-passing constraints, context-bounding constraints (optionally), and environmental assumptions and negated property constraints. These constraints are expressed in a quantifier-free formula and passed to a SMT/SAT solver for a satisfiability check.

5 Generating MATs Notation Shortcuts: Before we get into details, we make some notation abuse for ease of readability. When there is no ambiguity, we use ei to also indicate c(ei ), the control state of thread Mi where the access event ei belongs. Further, we use +ei to denote the event immediately after ei in program order, i.e., c(+ei ) = next(c(ei )). Similarly, we use −ei to denote event immediately preceding ei , i.e., c(ei ) = next(c(−ei )). We sometimes refer to tuple (a, b) as a pair. We provide a simple procedure, GenM AT (Algorithm 1) for generating MAT ij , given a pair of unrolled threads Mi and Mj and dependency relation D. For ease of explanation, we assume the threads are unrolled for some bounded depth, and there is no conditional branching. (For a further discussion with conditional branching, please refer [26]). We first initialize a queue Q with control state pair (⊢i , ⊢j ) representing the beginning of the threads, respectively. For any pair (fi , fj ) in the Q, representing the current control pair locations, we can obtain a MAT m′ = (tri′ , trj′ ) as follows: we start tri′ and trj′ from fi and fj respectively, and end in li′ and lj′ respectively, such that (li′ , lj′ ) ∈ D, and there is no other conflicting pair in-between. There may be many MATcandidates m′ . Let Mc denote a set of such choices. The algorithm selects m ∈ Mc uniquely by assigning thread priorities and using the following selection rule. If a thread Mj is given higher priority over Mi , the algorithm prefers m = (tri = fi · · · li , trj = fj · · · lj ) over m′ = (tri′ = fi · · · li′ , trj′ = fi · · · lj′ ) if lj ≺po lj′ . Note, the choice of Mj over Mi is arbitrary, but is required for the optimality result. We presented MAT selection (lines 7–9) in a declarative style for better understanding. However, algorithm finds the unique MAT using the selection rule, without constructing the set Mc . We show later that GenM AT can always find such a unique MAT with the chosen priority (lines 7—9).

78

M.K. Ganai and S. Kundu #I 1

p∈Q\Q’ (1a,1b)

2

(4a,1b)

3

(1a,3b)

4

(4a,3b)

5

(2a,3b)

MAT

12



(1a 3a,1b

⇒2b)

⇒ ⇒ (4a,3b⇒4b) (2a⇒4a,3b ⇒4b)

I

Q\Q’

Fork Constraints

(1a,1b)

W(y)

1a

1b

R(x)

(4a,1b)(1a,3b) (4a,3b)

R(x)

2a

2b

W(z)

R(z)

3a

3b

R(x)

W(y)

4a

4b

W(y)

(4a,1b 4b)

(1a,3b)(4a,3b)

(1a,3b 4b)

(4a,3b)(2a,3b) (2a,3b)

MAT

Token Passing pair-set (TP( 12))= {(2b,1a)(3a,1b)(4a,1b)(4b,4a),(1a,3b)(4b,1a) (4a,3b)(4b,2a)}

E

Join Constraints

#pair-wise constraints = 8

Fig. 4. Run of GenM AT on example in Figure 1(a)

We update MAT ij with m. If (li =⊣i ) and (lj =⊣j ), we update Q with three pairs, i.e., (+li , +lj ), (+li , fj ), (fi , +li ); otherwise, we insert selectively as shown in the algorithm (lines 11—15). Example: We present a run of GenM AT in Figure 4 for the example in Figure 1(a). We gave M2 higher priority over M1 . The table columns provide each iteration step (#I), the pair p ∈ Q\Q′ selected, the chosen MAT 12 , and the new pairs added in Q\Q′ (shown in bold). We add token-passing constraints (shown as directed edges) in the figure (on the right) between every ordered pair in the set T P (MAT 12 ). Total number of pair-wise constraints we add is 8, much less compared with all pair-wise constraints (in Figure 1). The fork/join constraints, shown as dotted edges, provide happens-before ordering between the accesses. In the first iteration of the run, out of the two MAT candidates m = (1a · · · 3a, 1b · · · 2b) and m′ = (1a, 1b · · · 4b) (also shown in Figure 2(a)-(b)) GenM AT selects m, as M2 is given higher priority over M1 and 2b ≺po 4b. In the following section, we show the adequacy and optimality of the pair-wise constraints so obtained. Theorem 1. The algorithm GenM AT terminates. Proof. For bounded depth, number of pair-wise accesses are bounded. As each control state pair is picked only once (line 6), the procedure terminates. ⊓ ⊔

6 MAT-Based Reduction: Optimality and Adequacy For ease of understanding, we first present optimality and adequacy results for a twothreaded system i.e., Mi and Mj with i, j ∈ {1, 2}. For two-threaded system, Shij = (Shi ∪ Shj ), and as noted earlier, eT Pij = ∅. We ignore it for now; we discuss the general case later as the proof arguments are similar. Theorem 2 (Two-threaded Optimality). For two-threaded system with bounded unrolling, the set T P = T P (MAT ij ) is optimal i.e., it does not allow two equivalent schedules.

Reduction of Verification Conditions for Concurrent System Using MAT

79

Algorithm 1. GenM AT : Obtain a set of MATs 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17:

input: Unrolled Thread Models: Mi , Mj ; Dependency Relation D output: MAT ij . MAT ij := ∅; Q := {(⊢i , ⊢j )}; Q′ := ∅ {Initialize Queue}; while Q = Q′ do Select (fi , fj ) ∈ Q\Q′ Q := Q\{(fi , fj )}; Q′ := Q′ ∪ {(fi , fj )} MAT-candidates set, Mc = {m′ | m′ = (tri′ = fi · · · li′ , trj′ = fj · · · lj′ ) is M AT }, Select a MAT m = (tri = fi · · · li , tri = fj · · · lj ) ∈ Mc such that ∀m′ ∈Mc ,m′ =m lj ≺po lj′ , (i.e., Mj has higher priority). MAT ij := MAT ij ∪ {m} if (li =⊣i ∧lj =⊣j ) then continue; elseif (li =⊣i ) then q := {(fi , +lj )}; elseif lj =⊣j ) then q := {(+li , fj )}; else q := {(+li , +lj ), (+li , fj ), (fi , +lj )}; Q := Q ∪ q; end while return MAT ij

Lemma 1. If (ai , aj ) ∈ T P (MAT ij ), then ∃m = (a′i · · · ai , aj · · · a′j ) ∈ MAT ij where ⊢i po a′i po ai and aj po a′j po ⊣j . Lemma 2. From a given pair (fi , fj ) ∈ Q, given possible MAT candidates m1 = (fi · · · ei , fj · · · ej ) or m2 = (fi · · · e′i , fj · · · e′j ), GenM AT selects only one of them, i.e., either m1 ∈ MAT ij or m2 ∈ MAT ij , but not both. Further, if the thread Mi is given higher priority than Mj , m1 is selected if (ei ≺po e′i ), otherwise m2 is selected. Optimality Proof. We show the optimality by arguing the contrapositive holds, i.e., if two schedules allowed by T P (MAT ij ) are equivalent, then they are same. We explain our proof steps using the Figure 5(a). Consider two equivalent schedules, i.e., σ1 ≃ σ2 . We assume that the necessary interleaving pairs for the two schedules be captured by the MAT set, i.e., CSP (σ1 ) ⊆ T P (MAT ij ), and CSP (σ2 ) ⊆ T P (MAT ij ). We show σ1 = σ2 by contradiction. Assume σ1 = σ2 , i.e., CSP (σ1 ) = CSP (σ2 ). Wlog, let σ1 = wi1 · wj1 · · · wik · wjk · · · win ·wjn and σ2 = vi1 ·vj1 · · · vik ·vjk · · · vin ·vjn , a sequence of words, wik , vik ∈ Σi∗ , wjk , vjk ∈ Σj∗ , 1 ≤ k ≤ n. (Note, if the words do not align, we pick the schedule with fewer words, say σ1 , and prefix it with empty words corresponding to each thread.) Starting from the end, let the difference first show up at the k th word, i.e., wjk = vjk , and ∀t k < t ≤ n, wit = vit , wjt = vjt . ′ Let wjk = fjk · · · ljk and vjk = fjk · · · ljk . Note, both words end with the same access ′ event because the interleaving pairs matches till that point. Wlog, we assume fjk ≺po ′ fjk . Similarly, we have wik = fik · · · lik , and vik = fik · · · lik . Note, lik immediately precedes (in program order) wik+1 , i.e., lik = −wik+1 [1] (Recall, w[1] denotes the first event in word w). If k = 1, we get a trivial contradiction as wi2 = vi2 implies wi1 = vi1 . Therefore, we only need to consider k > 1. Further, as wjk+1 = vjk+1 , we have |vjk | = 0 implies

80

M.K. Ganai and S. Kundu

fik

Mi

Mj fik lj(k-1)’ fjk’ f ljk-1 e fjk bik pjk

≤po e ≤po bik ≤po lk 1. Let δ(s) ∈ Act(s) be an arbitrary selected action. Then, for each a ∈ Act(s) and a = δ(s), we introduce a binary variable vs,a , denoted by enc(s, a), to encode the transition with respect to a from s. The transition with respect to δ(s) is encoded via enc(s, δ(s)) := 1 −  b∈Act(s),b=δ(s) vs,b . In the following, we fix δ as defined above and let Var δ denote the set of these variables, all of which have domain {0, 1}. Intuitively, vs,a = 1 indicates that the transition labelled with a is taken from s, whereas vs,a = 0 for all vs,a indicates that δ(s) is taken. Now we define the encoding of M with respect to Var δ . Definition 12. Let M = (S, s0 , Act, P, V ) be a PMDP. The encoding PMC ˙ with respect to Var δ is defined as enc(M) = (S, s0 , Pδ , V ∪Var δ ) where Pδ (s, s′ ) =



a∈Act

P(s, a, s′ ) · enc(s, a).

98

E.M. Hahn, H. Hermanns, and L. Zhang

To avoid confusion, we use v : Var δ → {0, 1} to denote a total evaluation function for Var δ . We say v is stationary, if for each s with |Act(s)| > 1, there exists at most one a ∈ Act(s) \ {δ(s)} with v(vs,a ) = 1. We let SE X denote the set of stationary evaluations v with domain Dom(v) = X, and let SE := SE Var δ . Observe that if v(vs,a ) = 0 for all a ∈ Act(s) \ {δ(s)}, the transition labelled with δ(s) is selected. We can apply Algorithm 1. on the encoding PMC to compute the parametric reachability probability. In the following we discuss how to transform the result achieved this way back to the maximum reachability probability for the original PMDPs. The following lemma states that each stationary scheduler corresponds to a stationary evaluation with respect to Var δ : Lemma 3. Let M = (S, s0 , Act, P, V ) be a PMDP. Then for each stationary scheduler A there is a stationary evaluation v ∈ SE such that MA = (enc(M))v . Moreover, for each stationary evaluation v ∈ SE there exists a stationary scheduler A such that (enc(M))v = MA . Due to the fact that stationary schedulers are sufficient for maximum reachability probabilities, the above lemma suggests that for a strictly well-defined evaluation u of M, it holds that max

A∈MD(M)

Pr Mu ,A (s0 , B) = max Pr (enc(Mu ))v (s0 , B). v∈SE

Together with Lemma 1, the following lemma discusses the computation of this maximum: Lemma 4. Let M = (S, s0 , Act, P, V ) be a PMDP and let f be the function obtained by applying Algorithm 1. on enc(M). Let Var f denote the set of variables occurring in f . Then for each strictly well-defined evaluation u of M, it holds that: max

A∈MD(M)

Pr Mu ,A (s0 , B) =

max

v∈SE Var ∩Var δ f

f [Var δ /v][V /u].

In worst case, we have SE Var δ ∩Var f = SE . The size |SE | = exponential in the number of states s with |Act(s) > 1|. 3.4



s∈S

|Act(s)| grows

Bisimulation Minimisation for Parametric Models

We discuss how to apply bisimulation strategy to reduce the state space before our main algorithm. For PMCs, both strong and weak bisimulation can be applied, while for PMRMs only strong bisimulation is used. The most interesting part is for PMDPs, for which we minimise the encoded PMC instead of the original one. The following lemma shows that strong (weak) bisimilar states in D are also strong (weak) bisimilar in Du for each maximal well-defined evaluation: Lemma 5. Let D = (S, s0 , P, V ) be a PMC with s1 , s2 ∈ S. Let B be a set of target states. Then, for all maximal well-defined evaluation u, s1 ∼D s2 implies that s1 ∼Du s2 , and s1 ≈D s2 implies that s1 ≈Du s2 .

Probabilistic Reachability for Parametric Markov Models vs0 ,a

s1

1 − vs0 ,a

s2

1

s0

1

s3 s2

1

99

s0

s1 1

1 1 − vs3 ,a

s3

vs3 ,a

s4

Fig. 2. Bisimulation for PMDPs

Both strong and weak bisimulation preserve the reachability probability for ordinary MCs [14,3]. By the above lemma, for PMCs, both strong and weak bisimulation preserve reachability probability for all maximal well-defined evaluations. A similar result holds for PMRMs: if two states s1 , s2 of R = (D, r) are strong bisimilar, i.e. s1 ∼R s2 , then for all maximal well-defined evaluations u, we have s1 ∼Ru s2 . As a consequence, strong bisimulation preserves expected accumulated rewards for all well-defined evaluations for MRMs. Now we discuss how to minimise PMDPs. Instead of computing the bisimulation quotient of the original PMDPs M, we apply the bisimulation minimisation algorithms on the encoded PMCs enc(M). Since both strong and weak bisimulation preserve reachability for PMCs, by Lemma 3 and Lemma 4, bisimulation minimisation on the encoded PMC enc(M) also preserves the maximum reachability probability on M with respect to strictly well-defined evaluations. Thus, we can apply the efficient strong and weak bisimulation algorithm for the encoding PMC directly. The following example illustrates the use of strong and weak simulations for PMDPs. Example 2. Consider the encoding PMC on the left of Figure 2. States s1 , s2 are obviously strong bisimilar. Moreover, in the quotient, we have that the probability of going to the equivalence class {s1 , s2 } from s0 is 1. Because of this, the variable vs,a disappears in the quotient. Now consider the right part. In this encoding PMC, states s1 , s2 , s3 are weak bisimilar. Remark: For the right part of Figure 2, we explain below why our results do not hold to obtain minimum reachability probabilities. Using the state elimination algorithm, we obtain that the probability of reaching s4 from s0 is 1, independently of the variable vs,a . However, the minimum reachability probability is actually 0 instead. Moreover, states s0 , s1 , s2 and s3 are bisimilar, thus in the quotient we have the probability 1 of reaching the target state directly. Thus the information about minimum reachability probability is also lost during the state elimination and the weak bisimulation lumping of the encoding PMC. 3.5

Complexity

Since our algorithm is dealing with rational functions, we first discuss briefly the complexity of arithmetic for polynomials and rational functions. For more detail we refer to [10]. For a polynomial f , let mon(f ) denote the number of monomials. Addition and subtraction of two polynomials f and g are performed by adding or subtracting coefficients of like monomials, which takes time mon(f )+mon(g).

100

E.M. Hahn, H. Hermanns, and L. Zhang

Multiplication is performed by cross-multiplying each monomials, which takes O(mon(f ) · mon(g)). Division of two polynomials results a rational function, which is then simplified by shortening the greatest common divisor (GCD), which can be obtained efficiently using a variation of the Euclid’s algorithm. Arithmetic for rational functions reduces to manipulation of polynomials, for example ff12 +

2 g1 = f1 gf22+f . Checking whether two rational functions ff21 and gg21 are equal is g2 equivalent to checking whether f1 g2 − f2 g1 is a zero polynomial. We now discuss the complexity of our algorithms. In each elimination step, we have to update the transition functions (or rewards for PMRMs) which takes O(n2 ) polynomial operations in worst case. Thus, altogether O(n3 ) many operations are needed to get the final function, which is the same as in the state elimination algorithm [6]. The complexity of arithmetic for polynomials depends on the degrees. The size of the final rational function is in worst case nO(log n) . For PMDPs, we first encode the non-deterministic choices via new binary variables. Then, the encoding PMC is submitted to the dedicated algorithm for parametric MCs. The final function can thus contain both variables from the input model and variables encoding the non-determinism. As shown in Lemma 4, the evaluation is of exponential size in the number of variables encoding the nondeterminism occurring in the final rational function. We also discuss briefly the complexity of the bisimulation minimisation algorithms. For ordinary MCs, strong bisimulation can be computed [9] in O(m log n) where n, m denote the number of states and transitions respectively. The complexity of deciding weak bisimulation [3] is O(mn). These algorithms can be extended to PMCs directly, with the support of operations on functions. The complexity is then O(m log n) and O(mn) many operations on rational functions for strong and weak bisimulation respectively.

g1 g2

4

Case Studies

We have built the tool Param, which implements our new algorithms, including both the state-elimination algorithm as well as the bisimulation minimisation algorithm. Param allows a guarded-commands based input language supporting MC, MRM and MDPs. The language is extended from Prism [15] with unknown parameters. Properties are specified by PCTL formulae without nesting. The sparse matrices are constructed from the model, and then the set of target states B are extracted from the formula. Then, bisimulation minimisation can be applied to reduce the state space. For MCs, both strong and weak bisimulation applies, and for MRMs, currently only strong bisimulation is supported. For PMDP, bisimulation is run for the encoded PMC. We use the computer algebra library CoCoALib[1] for handling arithmetic of rational functions, for example the basic arithmetic operations, comparisons and simplification. We consider a selection of case studies to illustrate the practicality of our approach. All of the models are extended from the corresponding Prism models. All experiments were run on a Linux machine with an AMD Athlon(tm) XP 2600+ processor at 2 GHz equipped with 2GB of RAM.

Probabilistic Reachability for Parametric Markov Models

101

152 0.8

148

0.4

144

0 0

140

0.2

0.4 P

0.6

0.8

1 0

0.2

0.4

0.6

0.8 B

1

0.6 0

Fig. 3. Left: Crowds Protocol.

0.4

0.2 p

0.4

0.2 0.6

q

0

Right: Zeroconf.

Crowds Protocol. The intention of the Crowds protocol [22] is to protect the anonymity of Internet users. The protocol hides each user’s communications via random routing. Assume that we have N honest Crowd members, and M dishonest members. Moreover, assume that there are R different path reformulates. M is the The model is a PMC with two parameters of the model: (i) B = M+N probability that a Crowd member is untrustworthy, (ii) P is the probability that a member forwards the package to a random selected receiver. With probability 1 − P it delivers the message to the receiver directly. We consider the probability that the actual sender was observed more than any other one by the untrustworthy members. For various N and R values, the following table summarises the time needed for computing the function representing this probability, with and without the weak bisimulation optimisation. In the last column we evaluate the probability for M = N5 (thus B = 16 ) and P = 0.8. An interesting observation is that the weak bisimulation quotient has the same size for the same R, but different probabilities. The reason for this is that the other parameter N has only an effect on the transition probabilities of the quotient and not its underlying graph.

N R 5 5 5 10 10 15

3 5 7 3 5 3

no bisimulation weak bisimulation States Trans. Time(s) Mem(MB) States Trans. Time(s) Mem(MB) 1192 2031 6 6 33 62 3 6 8617 14916 73 22 127 257 22 21 37169 64888 1784 84 353 732 234 84 6552 15131 80 18 33 62 16 17 111098 261247 1869 245 127 257 504 245 19192 55911 508 47 33 62 51 47

Result 0.3129 0.3840 0.4627 0.2540 0.3159 0.2352

In Figure 3 we give the plot for N = 5, R = 7. Observe that this probability increases with the number of dishonest members M , which is due to the fact that the dishonest members share their local information. On the contrary, this probability decreases with P . The reason is that each router forwards the message randomly with probability P . Thus with increasing P the probability that the untrustworthy member can identify the real sender is then decreased.

102

E.M. Hahn, H. Hermanns, and L. Zhang

1/n − 1 sok

s−1

1 − q/1

s0

q/1

s1

p/1

1 − p/1

s2 1 − p/1

p/1

...

p/1

sn

p/1

serr

1 − p/1

Zeroconf. Zeroconf allows the installation and operation of a network in the most simple way. When a new host joins the network, it randomly selects an address among the K = 65024 possible ones. With m hosts in the network, the m . The host asks other hosts whether they are using collision probability is q = K this address. If a collision occurs, the host tries to detect this by waiting for an answer. The probability that the host gets not answer in case of collision is p, in which case he repeats the question. If after n tries the host got no answer, the host will erroneously consider the chosen address as valid. A sketch of the model is depicted in the figure above. We consider the expected number of tries till either the IP address is selected correctly or erroneously that is, B = {sok , serr }. For n = 140, the plot of this function is depicted in on the right part of Figure 3. The expected number of tests till termination increases with both the collision probability as well as the probability that a collision is not detected. Bisimulation optimisation was not of any use, as the quotient equals the original model. For n = 140, the analysis took 64 seconds and 50 MB of memory. Cyclic Polling Server. The cyclic polling server [17] consists of a number of N stations which are handled by the polling server. Process i is allowed to send a job to the server if he owns the token, circulating around the stations in a round robin manner. This model is a parametric continuous-time Markov chain, but we can apply our algorithm on the embedded discrete-time PMC, which has the same reachability probability. We have two parameters: the service rate µ and γ is the rate to move the token. Both are assumed to be exponentially distributed. µ . Initially the token is at Each station generates a new request with rate λ = N state 1. We consider the probability p that station 1 will be served before any other one. The following table summarises performance for different N . The last column corresponds to the evaluation µ = 1, γ = 200.

N 4 5 6 7 8

no bisimulation weak bisimulation Result States Trans. Time(s) Mem(MB) States Trans. Time(s) Mem(MB) 89 216 1 3 22 55 1 3 0.25 225 624 3 3 32 86 1 3 0.20 545 1696 10 4 44 124 3 4 0.17 1281 4416 32 5 58 169 7 5 0.14 2945 11136 180 7 74 221 19 8 0.12

On the left of Figure 4 a plot for N = 8 is given. We have several interesting observations. If µ is greater than approximately 1.5, p first decreases and then increases with γ. The mean time of the token staying in state 1 is γ1 . With

Probabilistic Reachability for Parametric Markov Models

0.128 0.124 0.12 0.116 0.112 0.108

103

12 8 4 0 0.5

1 1.5 mu

2

2.5

3

4

20 16 12 gamma 8

0

0.2

0.4 p1

0.6

0.8

1 1

0.8

0.6

0.4

0.2

0

p2

Fig. 4. Left: Cyclic Polling Server. Right: Randomised Mutual Exclusion.

increasing γ, it is more probable that the token pasts to the next station before station 1 sends the request. At some point however (approximated γ = 6), p increases again as the token moves faster around the stations. For small µ the probability p is always increasing. The reason for this is that the arrival rate µ is very small, which means also that the token moves faster. Now we fix λ= N γ to be greater than 6. Then, p decreases with µ, as increasing µ implies also a larger λ, which means that all other states become more competitive. However, for small γ we observe that µ increases later again: in this case station 1 has a higher probability of catching the token initially at this station. Randomised Mutual Exclusion. In the randomised mutual exclusion protocol [21] several processes try to enter a critical section. We consider the protocol with two processes i = 1, 2. Process i tries to enter the critical section with probability pi , and with probability 1 − pi , it waits until the next possibility to enter and tries again. The model is a PMRM with parameters pi . A reward with value 1 is assigned to each transition corresponding to the probabilistic branching pi and 1 − pi . We consider the expected number of coin tosses until one of the processes enters the critical section the first time. A plot of the expected number is given on the right part of Figure 4. This number decreases with both p1 and p2 , because both processes have more chance to enter their critical sections. The computation took 98 seconds, and 5 MB of memory was used. The model consisted of 77 states and 201 non-zero transitions. In the quotient, there were 71 states and 155 non-zero transitions. Bounded Retransmission Protocol. In the bounded retransmission protocol, a file to be sent is divided into a number of N chunks. For each of them, the number of retransmissions allowed is bounded by MAX . There are two lossy channels K and L for sending data and acknowledgements respectively. The model is a PMDP with two parameters pK , pL denoting the reliability of the channels K and L respectively. We consider the property “The maximum reachability probability that eventually the sender does not report a successful transmission”. In the following table we give statistics for several different instantiations of N and MAX . The column “Nd.Vars” gives the number of variables introduced additionally to encode the non-deterministic choices. We give only running time if the optimisation is used. Otherwise, the algorithm does not terminate within one hour. The last column gives the probability for pK = 0.98

104

E.M. Hahn, H. Hermanns, and L. Zhang

and pL = 0.99, as the one in the Prism model. We observe that for all instances of N and MAX , with an increasing reliability of channel K the probability that the sender does not finally report a successful transmission decreases.

N

MAX

64 64 256 256

4 5 4 5

States 8551 10253 33511 40205

model Trans. Nd.Vars 11569 137 13922 138 45361 521 54626 522

weak bisimulation Time(s) Mem(MB) Result States Trans. 643 1282 23 16 1.50E-06 771 1538 28 19 4.48E-08 2563 5122 229 63 6.02E-06 3075 6146 371 69 1.79E-07

Notably, we encode the non-deterministic choices via additional variables, and apply the algorithm for the resulting parametric MCs. This approach may suffer from exponential enumerations in the number of these additional variables in the final rational function. In this case study however, the method works quite well. This is partly owed to the fact, that after strong and weak bisimulation on the encoding PMC, the additional variables vanish as illustrated in Example 2. We are well aware however, that still much work needs to be done to handle general non-deterministic models.

5

Comparison with Daws’ Method

Our algorithm is based on the state elimination approach, inspired by Daws [8], who treats the concrete probabilities as an alphabet, and converts the MC into a finite automaton. Then a regular expression is computed and evaluated into functions afterwards (albeit lacking any implementation). The length of the resulting regular expression, however, has size nΘ(log n) [11] where n denotes the number of states of the automaton. Our method instead intertwines the steps of state elimination and evaluation. The size of the resulting function is in worst case still in nO(log n) , thus there is no theoretical gain, pessimistically speaking. The differences of our and Daws’ method are thus on the practical side, where they indeed have dramatic implications. Our method simplifies the rational functions in each intermediate step. The worst case for our algorithm can occur only in case no rational function can be simplified during the entire process. In essence, this is the case for models where each edge of the input model has a distinguished parameter. We consider this a pathological construction. In all of the interesting models we have seen, only very few parameters appear in the input model, and it seems natural that a model designer does not deal with more than a handful of model parameters in one go. For those models, the intermediate rational functions can be simplified, leading to a space (and time) advantage. This is the reason why our method does not suffer from a blow up in the case studies considered in Section 4. To shed light on the differences between the two methods, we return to the cyclic polling server example:

Probabilistic Reachability for Parametric Markov Models

105

Number of workstations 4 5 6 7 8 Length of regular expression (Daws’ method) 191 645 2294 8463 32011 Number of terms (our method) 7 9 11 13 15 Total degree (our method) 6 8 10 12 14

In the table above, we compare the two methods in terms of the observed size requirements. For each number of workstations from 4 to 8, we give the length of the regular expression arising in Daws’ method. On the other hand, we give the number of terms and the total degree of the nominator and denominator polynomials of the rational function resulting from our method. The numbers for Daws’ method are obtained by eliminating the states in the same order as we did for our method, namely by removing states with a lower distance to the target set first. For the length of regular expressions, we counted each occurrence of a probability as having the length 1, as well as each occurrence of the choice operator (“+”) and the Kleene star (“*”). We counted braces as well as concatenation (“·”) as having length zero. As can be seen from the table, the size of the regular expression grows very fast, thus materializing the theoretical complexity. This makes the nice idea of [8] infeasible in a direct implementation. For our method, both the number of terms as well as the total degree grow only linearly with the number of workstations.

6

Conclusion

We have presented algorithms for analysing parametric Markov models, possibly extended with rewards or non-determinism. As future work, we are investigating general improvements of the implementation with respect to memory usage and speed, especially for the setting with non-determinism. We also plan to look into continuous time models –with clocks–, and PMDPs with rewards. Other possible directions include the use of symbolic model representations, such as MTBDDbased techniques, symbolic bisimulation minimisation [25], and also a symbolic variant of the state elimination algorithm. We would also like to explore whether our algorithm can be used for model checking interval Markov chains [23]. Acknowledgements. We are grateful to Bj¨ orn Wachter (Saarland University) for insightful discussions and for providing us with the parser of PASS.

References 1. Abbott, J.: The design of cocoalib. In: Iglesias, A., Takayama, N. (eds.) ICMS 2006. LNCS, vol. 4151, pp. 205–215. Springer, Heidelberg (2006) 2. Baier, C., Ciesinski, F., Gr¨ oßer, M.: Probmela and verification of markov decision processes. SIGMETRICS Performance Evaluation Review 32(4), 22–27 (2005)

106

E.M. Hahn, H. Hermanns, and L. Zhang

3. Baier, C., Hermanns, H.: Weak Bisimulation for Fully Probabilistic Processes. In: Grumberg, O. (ed.) CAV 1997. LNCS, vol. 1254, pp. 119–130. Springer, Heidelberg (1997) 4. Baier, C., Katoen, J.-P., Hermanns, H., Wolf, V.: Comparative branching-time semantics for Markov chains. Inf. Comput. 200(2), 149–214 (2005) 5. Bianco, de Alfaro: Model Checking of Probabilistic and Nondeterministic Systems. In: FSTTCS, vol. 15 (1995) 6. Brzozowski, J.A., Mccluskey, E.: Signal Flow Graph Techniques for Sequential Circuit State Diagrams. IEEE Trans. on Electronic Computers EC-12, 67–76 (1963) 7. Damman, B., Han, T., Katoen, J.-P.: Regular Expressions for PCTL Counterexamples. In: QEST (2008) (to appear) 8. Daws, C.: Symbolic and Parametric Model Checking of Discrete-Time Markov Chains. In: Liu, Z., Araki, K. (eds.) ICTAC 2004. LNCS, vol. 3407, pp. 280–294. Springer, Heidelberg (2005) 9. Derisavi, S., Hermanns, H., Sanders, W.: Optimal State-Space Lumping in Markov Chains. Inf. Process. Lett. 87(6), 309–315 (2003) 10. Geddes, K.O., Czapor, S.R., Labahn, G.: Algorithms for computer algebra. Kluwer Academic Publishers, Dordrecht (1992) 11. Gruber, H., Johannsen, J.: Optimal Lower Bounds on Regular Expression Size Using Communication Complexity. In: Amadio, R. (ed.) FOSSACS 2008. LNCS, vol. 4962, pp. 273–286. Springer, Heidelberg (2008) 12. Hahn, E.M., Hermanns, H., Zhang, L.: Probabilistic reachability for parametric markov models. Reports of SFB/TR 14 AVACS 50, SFB/TR 14 AVACS (2009) 13. Han, T., Katoen, J.-P., Mereacre, A.: Approximate Parameter Synthesis for Probabilistic Time-Bounded Reachability. In: RTSS, pp. 173–182 (2008) 14. Hansson, H., Jonsson, B.: A Logic for Reasoning about Time and Reliability. FAC 6(5), 512–535 (1994) 15. Hinton, A., Kwiatkowska, M.Z., Norman, G., Parker, D.: PRISM: A Tool for Automatic Verification of Probabilistic Systems. In: Hermanns, H., Palsberg, J. (eds.) TACAS 2006. LNCS, vol. 3920, pp. 441–444. Springer, Heidelberg (2006) 16. Hopcroft, J.E., Motwani, R., Ullman, J.D.: Introduction to automata theory, languages, and computation, 2nd edn. SIGACT News 32(1), 60–65 (2001) 17. Ibe, O., Trivedi, K.: Stochastic Petri Net Models of Polling Systems. IEEE Journal on Selected Areas in Communications 8(9), 1649–1657 (1990) 18. Jonsson, B., Larsen, K.G.: Specification and Refinement of Probabilistic Processes. In: LICS, pp. 266–277. IEEE Computer Society Press, Los Alamitos (1991) 19. Kwiatkowska, M.Z., Norman, G., Parker, D.: Stochastic Model Checking. In: Bernardo, M., Hillston, J. (eds.) SFM 2007. LNCS, vol. 4486, pp. 220–270. Springer, Heidelberg (2007) 20. Lanotte, R., Maggiolo-Schettini, A., Troina, A.: Parametric probabilistic transition systems for system design and analysis. FAC 19(1), 93–109 (2007) 21. Pnueli, A., Zuck, L.: Verification of multiprocess probabilistic protocols. Distrib. Comput. 1(1), 53–72 (1986) 22. Reiter, M.K., Rubin, A.D.: Crowds: anonymity for Web transactions. ACM Trans. Inf. Syst. Secur. 1(1), 66–92 (1998) 23. Sen, K., Viswanathan, M., Agha, G.: Model-checking markov chains in the presence of uncertainties. In: Hermanns, H., Palsberg, J. (eds.) TACAS 2006. LNCS, vol. 3920, pp. 394–410. Springer, Heidelberg (2006) 24. Stewart, W.J.: Introduction to the Numerical Solution of Markov Chains. Princeton University Press, Princeton (1994) 25. Wimmer, R., Derisavi, S., Hermanns, H.: Symbolic partition refinement with dynamic balancing of time and space. In: QEST, pp. 65–74 (2008)

Extrapolation-Based Path Invariants for Abstraction Refinement of Fifo Systems Alexander Heußner1 , Tristan Le Gall2 , and Gr´egoire Sutre1 1

2

LaBRI, Universit´e Bordeaux, CNRS {heussner, sutre}@labri.fr Universit´e Libre de Bruxelles (ULB) [email protected]

Abstract. The technique of counterexample-guided abstraction refinement (Cegar) has been successfully applied in the areas of software and hardware verification. Automatic abstraction refinement is also desirable for the safety verification of complex infinite-state models. This paper investigates Cegar in the context of formal models of network protocols, in our case, the verification of fifo systems. Our main contribution is the introduction of extrapolation-based path invariants for abstraction refinement. We develop a range of algorithms that are based on this novel theoretical notion, and which are parametrized by different extrapolation operators. These are utilized as subroutines in the refinement step of our Cegar semi-algorithm that is based on recognizable partition abstractions. We give sufficient conditions for the termination of Cegar by constraining the extrapolation operator. Our empirical evaluation confirms the benefit of extrapolation-based path invariants.

1

Introduction

Distributed processes that communicate over a network of reliable and unbounded fifo channels are an important model for the automatic verification of client-server architectures and network protocols. We focus on communicating fifo systems that consist of a set of finite automata that model the processes, and a set of reliable, unbounded fifo queues that model the communication channels. This class of infinite-state systems is, unfortunately, Turing-complete even in the case of one fifo queue [BZ83]. In general, two approaches for the automatic verification of Turing-complete infinite-state models have been considered in the literature: (a) exact semi-algorithms that compute forward or backward reachability sets (e.g., [BG99, BH99, FIS03] for fifo systems) but may not terminate, and (b) algorithms that always terminate but only compute an over-approximation of these reachability sets (e.g., [LGJJ06, YBCI08] for fifo systems). CEGAR. In the last decade, counterexample-guided abstraction refinement [CGJ+ 03] has emerged as a powerful technique that bridges the gap between these two approaches. Cegar plays a prominent role in the automatic, iterative approximation and refinement of abstractions and has been applied successfully in the areas of software [BR01, HJMS02] and hardware verification [CGJ+ 03]. C.S. P˘ as˘ areanu (Ed.): SPIN 2009, LNCS 5578, pp. 107–124, 2009. c Springer-Verlag Berlin Heidelberg 2009 

108

A. Heußner, T. Le Gall, and G. Sutre

Briefly, the Cegar approach to the verification of a safety property utilizes an abstract–check–refine loop that searches for a counterexample in a conservative over-approximation of the original model, and, in the case of finding a false negative, refines the over-approximation to eliminate this spurious counterexample. Our Contribution. We present a Cegar semi-algorithm for safety verification of fifo systems based on finite partition abstractions where equivalence classes are recognizable languages of queue contents, or, equivalently, Qdds [BG99]. The crucial part in Cegar-based verification is refinement, which must find a new partition that is both (1) precise enough to rule out the spurious counterexample and (2) computationally “simple”. In most techniques, refinement is based on the generation of path invariants; these are invariants along the spurious counterexample that prove its unfeasibility (in our case, given by a series of recognizable languages). We follow this approach, and present several generic algorithms to obtain path invariants based on parametrized extrapolation operators for queue contents. Our path invariant generation procedures are fully generic with respect to the extrapolation. Refining the partition consists in splitting abstract states that occur on the counterexample with the generated path invariant. We formally present the resulting Cegar semi-algorithm and give partial termination results that, in contrast to the classical Cegar literature, do not rely on an “a priori finiteness condition” on the set of all possible abstractions. Actually, our results depend mainly on our generic extrapolation-based path invariant generation. In particular we show that our semi-algorithm always terminates if (at least) one of these two conditions is satisfied: (1) the fifo system under verification is unsafe, or (2) it has a finite reachability set and the parametrized extrapolation has a finite image for each value of the parameter. We have implemented our approach in the tool Mcscm [Mcs] that performs Cegar-based safety verification of fifo systems. Experimental results on a suite of (small to medium size) network protocols allow for a first discussion of our approach’s advantages. Related Work. Exact semi-algorithms for reachability set computations of fifo systems usually apply acceleration techniques [BG99, BH99, FIS03] that, intuitively, compute the effect of iterating a given “control flow” loop. The tools Lash [Las] (for counter/fifo systems) and Trex [Tre] (for lossy fifo systems) implement these techniques. However, recognizable languages equipped with Presburger formulas (Cqdds [BH99]) are required to represent (and compute) the effect of counting loops [BG99, FIS03]. Moreover such tools may only terminate when the fifo system can be flattened into an equivalent system without nested loops. Our experiments show that our approach can cope with both counting loops and nested loops that cannot be flattened. The closest approach to ours is abstract regular model checking [BHV04], an extension of the generic regular model-checking framework based on the abstract–check–refine paradigm. As in classical regular model-checking, a system is modeled as follows: configurations are words over a finite alphabet and the transition relation is given by a finite-state transducer. The analysis consists in an over-approximated forward exploration (by Kleene iteration), followed, in

Extrapolation-Based Path Invariants for Abstraction Refinement

109

case of a non-empty intersection with the bad states, by an exact backward computation along the reached sets. Two parametrized automata abstraction schemes are provided in [BHV04], both based on state merging. These schemes fit in our definition of extrapolation, and therefore can also be used in our framework. Notice that in Armc, abstraction is performed on the data structures that are used to represent sets of configurations, whereas in our case the system itself is abstracted. After each refinement step, Armc restarts (from scratch) the approximated forward exploration from the refined reached set, whereas our refinement is local to the spurious counterexample path. Moreover, the precision of the abstraction is global in Armc, and may only increase (for the entire system) at each refinement step. In contrast, our path invariant generation procedures only use the precision required for each spurious counterexample. Preliminary benchmarks demonstrate the benefit of our local and adaptive approach for the larger examples, where a “highly” precise abstraction is required only for a few control loops. Last, our approach is not tied to words and automata. In this work we only focus on fifo systems, but our framework is fully generic and could be applied to other infinite-state systems (e.g., hybrid systems), provided that suitable parametrized extrapolations are designed (e.g., on polyhedra). Outline. We recapitulate fifo systems in Section 2 and define their partition abstractions in Section 3. Refinement and extrapolation-based generation of path invariants are developed in Section 4. In Section 5, we present the general Cegar semi-algorithm, and analyze its correctness and termination. Section 6 provides an overview of the extrapolation used in our implementation. Experimental results are presented in Section 7, along with some perspectives. Due to space limitations, all proofs were omitted in this paper. A long version with detailed proofs and additional material can be obtained from the authors.

2

Fifo Systems

This section presents basic definitions and notations for fifo systems that will be used throughout the paper. For any set S we write ℘(S) for the set of all subsets of S, and S n for the set of n-tuples over S (when n ≥ 1). For any i ∈ {1, . . . , n}, we denote by s(i) the ith component of an n-tuple s. Given s ∈ S n , i ∈ {1, . . . , n} and u ∈ S, we write s[i ← u] for the n-tuple s′ ∈ S n defined by s′ (i) = u and s′ (j) = s(j) for all j ∈ {1, . . . , n} with j = i. Let Σ denote an alphabet (i.e., a non-empty set of letters). We write Σ ∗ for the set of all finite words (words for short) over Σ, and we let ε denote the empty word. For any two words w, w′ ∈ Σ ∗ , we write w · w′ for their concatenation. A language is any subset of Σ ∗ . For any language L, we denote by L∗ its Kleene closure and we write L+ = L ·L∗ . The alphabet of L, written alph(L), is the least subset A of Σ such that L ⊆ A∗ . For any word w ∈ Σ ∗ , the singleton language {w} will be written simply as word w when no confusion is possible. Safety Verification of Labeled Transition Systems. We will use labeled transition systems to formally define the behavioral semantics of fifo systems. A labeled

110

A. Heußner, T. Le Gall, and G. Sutre

Client

0 !o

Server

0

ch. 1 !c ?d

?o

?c !d

ch. 2 1

1

Fig. 1. The Connection/Disconnection Protocol [JR86]

transition system is any triple LTS = C, Σ, → where C is a set of configurations, Σ is a finite set of actions and → ⊆ C × Σ × C is a (labeled) transition relation. We say that LTS is finite when C is finite. For simplicity, we will often write l c− → c′ in place of (c, l, c′ ) ∈ →. A finite path (path for short) in LTS is any pair π = (c, u) where c ∈ C, and u is either the empty sequence, or a non-empty finite sequence of transitions (c0 , l0 , c′0 ), . . . , (ch−1 , lh−1 , c′h−1 ) such that c0 = c and c′i−1 = ci for every lh−1

l

0 0 < i < h. We simply write π as c0 − → · · · −−−→ ch . The natural number h is called the length of π. We say that π is a simple path if ci = cj for all 0 ≤ i < j ≤ h. For any two sets Init ⊆ C and Bad ⊆ C of configurations, a

l

lh−1

0 → · · · −−−→ ch such that c0 ∈ Init and path from Init to Bad is any path c0 − ch ∈ Bad. Observe that if c ∈ Init ∩ Bad then c is a path (of zero length) from Init to Bad. The reachability set of LTS from Init is the set of configurations c such that there is a path from Init to {c}. In this paper, we focus on the verification of safety properties on fifo systems. A safety property is in general specified as a set of “bad” configurations that should not be reachable from the initial configurations. Formally, a safety condition for a labeled transition system LTS = C, Σ, → is a pair (Init, Bad) of subsets of C. We say that LTS is (Init, Bad)-unsafe if there is a path from Init to Bad in LTS, which is called a counterexample. We say that LTS is (Init, Bad)-safe when it is not (Init, Bad)-unsafe.

Fifo Systems. The asynchronous communication of distributed systems is usually modeled as a set of local processes together with a network topology given by channels between processes. Each process can be modeled by a finite-state machine that sends and receives messages on the channels to which it is connected. Let us consider a classical example, which will be used in the remainder of this paper to illustrate our approach. Example 2.1. The connection/disconnection protocol [JR86] – abbreviated as c/d protocol – between two hosts is depicted in Figure 1. This model is composed of two processes, a client and a server, as well as two unidirectional channels. To simplify the presentation, we restrict our attention to the case of one finitestate control process. The general case of multiple processes can be reduced to this simpler form by taking the asynchronous product of all processes. For the connection/disconnection protocol, the asynchronous product of the two processes is depicted in Figure 2.

Extrapolation-Based Path Invariants for Abstraction Refinement

00

2?d 1!c

1!o

2!d

ch. 1

1?c

1?o

10

111

01 2!d 1?o

1?c

2?d 11

1!c

1!o

ch. 2

Fig. 2. Fifo System Representing the Connection/Disconnection Protocol

Definition 2.2. A fifo system A is a 4-tuple Q, M, n, ∆ where: – – – –

Q is a finite set of control states, M is a finite alphabet of messages, n ≥ 1 is the number of fifo queues, ∆ ⊆ Q × Σ × Q is a set of transition rules, where Σ = {1, . . . , n} × {!, ?} × M is the set of fifo actions over n queues.

Simplifying notation, fifo actions in Σ will be shortly written i!m and i?m instead of (i, !, m) and (i, ?, m). The intended meaning of fifo actions is the following: i!m means “emission of message m on queue i ” and i?m means “reception of message m from queue i ”. The operational semantics of a fifo system A is formally given by its associated labeled transition system A defined below. Definition 2.3. The operational semantics of a fifo system A = Q, M, n, ∆ is the labeled transition system A = C, Σ, → defined as follows: – C = Q × (M ∗ )n is the set of configurations, – Σ = {1, . . . , n} × {!, ?} × M is the set of actions, – the transition relation → ⊆ C × Σ × C is the set of triples ((q, w), l, (q ′ , w′ )) such that (q, l, q ′ ) ∈ ∆ and that satisfy the two following conditions: • if l = i!m then w′ (i) = w(i) · m and w′ (j) = w(j) for all j = i, • if l = i?m then w(i) = m · w′ (i) and w′ (j) = w(j) for all j = i. Example 2.4. The fifo system A = {00, 01, 10, 11}, {o, c, d}, 2, ∆ that corresponds to the c/d protocol is displayed in Figure 2. The set of initial configurations is Init = {(00, ε, ε)}. A set of bad configurations for this protocol is Bad = {00, 10} × (c · M ∗ × M ∗ ). This set contains configurations where the server is in control state 0 but the first message in the first queue is close. This is the classical case of an undefined reception which results in a (local) deadlock for the server. Setting the initial configuration to c0 = (00, ε, ε), a counterexam1!o

1?o

ple to the safety condition ({c0 }, Bad) is the path (00, ε, ε) −−→ (10, o, ε) −−→ 2!d

1!c

(11, ε, ε) −−→ (10, ε, d) −−→ (00, c, d) in A.

112

3

A. Heußner, T. Le Gall, and G. Sutre

Partition Abstraction for Fifo Systems

In the context of Cegar-based safety verification, automatic abstraction techniques are usually based on predicates [GS97] or partitions [CGJ+ 03]. In this work, we focus on partition-based abstraction and refinement techniques for fifo systems. A partition of a set S is any set P of non-empty pairwise disjoint subsets of S such that S = p∈P p. Elements p of a partition P are called classes. For any element s in S, we denote by [ s ]P the class in P containing s. At the labeled transition system level, partition abstraction consists of merging configurations that are equivalent with respect to a given equivalence relation, or a given partition. In practice, it is often desirable to maintain different partitions for different control states, to keep partition sizes relatively small. We follow this approach in our definition of partition abstraction for fifo systems, by associating a partition of (M ∗ )n with each control state. To ease notation, we write L = (M ∗ )n \ L for the complement of any subset L of (M ∗ )n . To effectively compute partition abstractions for fifo systems, we need a family of finitely representable subsets of (M ∗ )n . A natural candidate is the class of recognizable subsets of (M ∗ )n , or, equivalently, of Qdd-definable subsets of (M ∗ )n [BGWW97], since this class is effectively closed under Boolean operations. Recall that a subset L of (M ∗ )n is recognizable if (and only if) it is a finite union of subsets of the form L1 × · · · × Ln where each Li is a regular language over M [Ber79]. We extend recognizability in the natural way to subsets of the set C = Q × (M ∗ )n of configurations. A subset C ⊆ C is recognizable if {w | (q, w) ∈ C} is recognizable for every q ∈ Q. We denote by Rec ((M ∗ )n ) the set of recognizable subsets of (M ∗ )n , and write P ((M ∗ )n ) for the set of all finite partitions of (M ∗ )n where classes are recognizable subsets of (M ∗ )n . Definition 3.1. Consider a fifo system A = Q, M, n, ∆ and a partition map P : Q → P ((M ∗ )n ). The partition abstraction of A induced by P is the finite ♯ labeled transition system AP = CP♯ , Σ, →♯P  defined as follows: – CP♯ = {(q, p) | q ∈ Q and p ∈ P (q)} is the set of abstract configurations, – Σ = {1, . . . , n} × {!, ?} × M is the set of actions, – the abstract transition relation →♯P ⊆ CP♯ × Σ × CP♯ is the set of triples l

((q, p), l, (q ′ , p′ )) such that (q, w) − → (q ′ , w′ ) for some w ∈ p and w′ ∈ p′ . To relate concrete and abstract configurations, we define the abstraction function αP : C → CP♯ , and its extension to ℘(C) → ℘(CP♯ ), as well as the concretization function γP : CP♯ → C, extended to ℘(CP♯ ) → ℘(C), as expected: αP ((q, w)) = (q, [ w ]P (q) ) γP ((q, p)) = {q} × p

|c∈ C} αP (C) = {α(c)    γP (C ♯ ) = γ(c♯ )  c♯ ∈ C ♯

To simplify notations, we shall drop the P subscript when the partition map can easily be derived from the context. Intuitively, an abstract configuration ♯ (q, p) of A represents the set {q} × p of (concrete) configurations of A. The

Extrapolation-Based Path Invariants for Abstraction Refinement

113

abstract transition relation →♯ is the existential lift of the concrete transition relation → to abstract configurations. The following forward and backward language transformers will be used to capture the effect of fifo actions. The functions post : Σ ×℘((M ∗ )n ) → ℘((M ∗ )n ) and pre : Σ × ℘((M ∗ )n ) → ℘((M ∗ )n ) are defined by: post (i!m, L) post (i?m, L) pre(i!m, L) pre(i?m, L)

= = = =

{w[i ← u] {w[i ← u] {w[i ← u] {w[i ← u]

| | | |

w w w w

∈ L, u ∈ M ∗ ∈ L, u ∈ M ∗ ∈ L, u ∈ M ∗ ∈ L, u ∈ M ∗

and and and and

w(i) · m = u} w(i) = m · u} w(i) = u · m} m · w(i) = u}

Obviously, post (l, L) and pre(l, L) are effectively recognizable subsets of (M ∗ )n for any l ∈ Σ and any recognizable subset L ⊆ (M ∗ )n . Moreover, we may use post and pre to characterize the abstract transition relation of a partition abstraction ♯ AP , as follows: for any rule (q, l, q ′ ) ∈ ∆ and for any pair (p, p′ ) ∈ P (q)×P (q ′ ), l

we have (q, p) − →♯ (q ′ , p′ ) iff post (l, p) ∩ p′ = ∅ iff p ∩ pre(l, p′ ) = ∅. ♯

Lemma 3.2. For any fifo system A and partition map P : Q → P ((M ∗ )n ), A is effectively computable. For any recognizable subset C ⊆ C, α(C) is effectively computable. lh−1

l

l

0 ♯ 0 → · · · −−−→ ch ) = α(c0 ) −→ We extend α to paths in the obvious way: α(c0 −

lh−1 ♯



· · · −−−→ α(ch ). Observe that α(π) is an abstract path in A for any concrete path π in A. We therefore obtain the following safety preservation property. Proposition 3.3. Consider a fifo system A and a safety condition (Init, Bad) for A. For any partition abstraction A♯ of A, if A♯ is (α(Init), α(Bad))safe then A is (Init, Bad)-safe.

Init (00, ε × ε) 1?c

1!o (00, ε × ε)

2!d

1!o 1?o 1!o 2?d

1!c 2?d



(10, o × ε) 1?c

1!c

` ´ 10, o∗ × ε 1?c

1?c (01, M ∗ × M ∗ )

Bad

2!d

1!c

1?o 1!o 1?o (11, M ∗ × M ∗ )

2?d

Fig. 3. Example Partition Abstraction of the C/D Protocol (Example 3.5)

114

A. Heußner, T. Le Gall, and G. Sutre

The converse to this proposition does not hold generally. An abstract counterexample π ♯ is called feasible if there exists a concrete counterexample π such that π ♯ = α(π), and π ♯ is called spurious otherwise. Lemma 3.4. For any fifo system A, any partition map P : Q → P ((M ∗ )n ), and any safety condition (Init, Bad) for A, feasibility of abstract counterexamples is effectively decidable. Example 3.5. Continuing the discussion of the c/d protocol, we consider the partition abstraction induced by the following partition map: q∈Q

00

10 ∗

01 ∗

11

P (q) ε × ε, ε × ε o × ε, × ε M × M M × M∗ The set of initial abstract configurations is α(Init) = {(00, ε × ε)}, and the set of bad abstract configurations is α(Bad) = {(00, ε × ε), (10, o∗ × ε)}. The resulting partition abstraction is the finite labeled transition system depicted in Figure 3. A simple graph search reveals several abstract counterexamples, for 1!o 1!c instance π ♯ = (00, ε × ε) −−→♯ (10, o∗ × ε) −−→♯ (00, ε × ε). This counterexample is spurious since the only concrete path that corresponds to π ♯ (i.e., whose image o∗

1!o





1!c

under α is π ♯ ) is π = (00, ε, ε) −−→ (10, o, ε) −−→ (00, oc, ε) ∈ / Bad.

4

Counterexample-Based Partition Refinement

The abstraction-based verification of safety properties relies on refinement techniques that gradually increase the precision of abstractions in order to rule out spurious abstract counterexamples. Refinement for partition abstractions simply consists in splitting some classes into a sub-partition. Given two partitions P and P of a set S, we say that P refines P when each class p ∈ P is contained in some class p ∈ P . Moreover we then write [ p ]P for the class p ∈ P containing p.  Let us fix, for the remainder of this section, a fifo system A = Q, M, n, ∆ and a safety condition (Init, Bad) for A. Given two partition maps P, P : Q → P ((M ∗ )n ), we say that P refines P if P (q) refines P (q) for every control state l

lh−1 ♯

0 ♯ → · · · −−−→ (qh , p h ) q ∈ Q. If P refines P , then for any abstract path (q0 , p 0 ) −

lh−1 ♯

l



0 ♯ → · · · −−−→ (qh , [ p h ]P (qh ) ) is an abstract in AP , it holds that (q0 , [ p 0 ]P (q0 ) ) − 



path in AP . This fact shows that, informally, refining a partition abstraction does not introduce any new spurious counterexample. When a spurious counterexample is found in the abstraction, the partition map must be refined so as to rule out this counterexample. We formalize this l

lh−1 ♯

0 ♯ · · · −−−→ (qh , ph ) in A♯P from concept for an abstract path πP♯ = (q0 , p0 ) −→ αP (Init) to αP (Bad) as follows: a refinement P of P is said to rule out the

l

lh−1 ♯

0 ♯ → · · · −−−→ abstract counterexample πP♯ if there exists no path πP♯ = (q0 , p 0 ) − ♯ (qh , p h ) from αP (Init) to αP (Bad) in AP satisfying p i ⊆ pi for all 0 ≤ i ≤ h. 





Extrapolation-Based Path Invariants for Abstraction Refinement

115

Note that if πP♯ is a feasible counterexample, then no refinement of P can rule it out. Conversely, if P is a refinement of P that rules out πP♯ then any refinement of P also rules out πP♯ . The main challenge in Cegar is the discovery of “suitable” refinements, that are computationally “simple” but “precise enough” to rule out spurious counterexamples. In this work, we focus on counterexample-guided refinements based on path invariants. Definition 4.1. Consider a partition map P and a spurious counterexample lh−1 ♯

l

0 ♯ · · · −−−→ (qh , ph ) in A♯P . A path invariant for π ♯ is any π ♯ = (q0 , p0 ) −→ sequence L0 , . . . , Lh of recognizable subsets of (M ∗ )n such that:

(i) we have ({q0 } × p0 ) ∩ Init ⊆ {q0 } × L0 , and (ii) we have post (li , pi ∩ Li ) ⊆ Li+1 for every 0 ≤ i < h, and (iii) we have ({qh } × Lh ) ∩ Bad = ∅. Observe that condition (ii) is more general than post (li , Li ) ⊆ Li+1 which is classically required for inductive invariants. With this relaxed condition, path invariants are tailored to the given spurious counterexample, and therefore can be simpler (e.g., be coarser or have more empty Li ). Proposition 4.2. Consider a partition map P and a simple spurious counterexl

lh−1 ♯

0 ♯ ample π ♯ = (q0 , p0 ) −→ · · · −−−→ (qh , ph ). Given a path invariant L0 , . . . , Lh ♯ for π , the partition map P defined below is a refinement of P that rules out π ♯ :



P (q) = (P (q) \ {pi | i ∈ I(q)}) ∪





pi ∩ Li , pi ∩ Li \ {∅}

i∈I(q)

where I(q) = {i | 0 ≤ i ≤ h, qi = q} for each control state q ∈ Q. We propose a generic approach to obtain path invariants by utilizing a parametrized approximation operator for queue contents. The parameter (the k in the definition below) is used to adjust the precision of the approximation. Definition 4.3. A (parametrized) extrapolation is any function ∇ from N to Rec ((M ∗ )n ) → Rec ((M ∗ )n ) that satisfies, for any L ∈ Rec ((M ∗ )n ), the two following conditions (with ∇(k) written as ∇k ): (i) we have L ⊆ ∇k (L) for every k ∈ N, (ii) there exists kL ∈ N such that L = ∇k (L) for every k ≥ kL . Our definition of extrapolation is quite general, in particular, it does not require monotonicity in k or in L, but it is adequate for the design of path invariant generation procedures. The most simple extrapolation is the identity extrapolation that maps each k ∈ N to the identity on Rec ((M ∗ )n ). The parametrized automata approximations of [BHV04] and [LGJJ06] also satisfy the requirements of Definition 4.3. The choice of an appropriate extrapolation with respect to the underlying domain of fifo systems is crucial for the implementation of Cegar’s refinement step, and will be discussed in Section 6.

116

A. Heußner, T. Le Gall, and G. Sutre

UPInv (∇, Init, Bad, πP♯ ) Input: extrapolation ∇, recognizable subsets Init, Bad of Q×(M ∗ )n , spurious l

lh−1 ♯

0 ♯ counterexample πP♯ = (q0 , p0 ) −→ · · · −−−→ (qh , ph ) 1 2 3 4 5 6 7 8 9 10 11 12

k←0 do L0 ← ∇k (p0 ∩ {w | (q0 , w) ∈ Init}) for i from 1 upto h Fi ← post(li−1 , pi−1 ∩ Li−1 ) if pi ∩ Fi = ∅ Li ← ∅ else Li ← ∇k (Fi ) k ←k+1 while ({qh } × Lh ) ∩ Bad = ∅ return (L0 , . . . , Lh )

Split (∇, L0 , L1 ) Input: extrapolation ∇, disjoint recognizable subsets L0 , L1 of (M ∗ )n 1 2 3 4

k←0 while ∇k (L0 ) ∩ L1 = ∅ k ←k+1 return ∇k (L0 )

APInv (∇, Init, Bad, πP♯ ) Input: extrapolation ∇, recognizable subsets Init, Bad of Q×(M ∗ )n , spurious l

lh−1 ♯

0 ♯ counterexample πP♯ = (q0 , p0 ) −→ · · · −−−→ (qh , ph ) 1 2 3 4 5 6 7 8 9 10 11 12 13

Bh ← ph ∩ {w | (qh , w) ∈ Bad} i←h while Bi = ∅ and i > 0 i← i−1 Bi ← pi ∩ pre(li , Bi+1 ) if i = 0 I ← p0 ∩ {w | (q0 , w) ∈ Init} L0 ← Split (∇, I, B0 ) else (L0 , . . . , Li ) ← ((M ∗ )n , . . . , (M ∗ )n ) for j from i upto h − 1 Lj+1 ← Split (∇, post(lj , pj ∩ Lj ), Bj+1 ) return (L0 , . . . , Lh ) Fig. 4. Extrapolation-based Path Invariant Generation Algorithms

Extrapolation-Based Path Invariants for Abstraction Refinement

117

Remark 4.4. Extrapolations are closed under various operations, such as functional composition, union and intersection, as well as round-robin combination. We now present two extrapolation-based path invariant generation procedures (Figure 4). Recall that the parameter k of an extrapolation intuitively indicates the desired precision of the approximation. The first algorithm, UPInv, performs an approximated post computation along the spurious counterexample, and iteratively increases the precision k of the approximation until a path invariant is obtained. The applied precision in UPInv is uniform along the counterexample. Due to its simplicity, the termination analysis of Cegar in the following section will refer to UPInv. The second algorithm, APInv, first performs an exact pre computation along the spurious counterexample to identify the “bad” coreachable subsets Bi . The path invariant is then computed with a forward traversal that uses the Split subroutine to simplify each post image while remaining disjoint from the Bi . The precision used in Split is therefore tailored to each post image, which may lead to simpler path invariants. Naturally, both algorithms may be “reversed” to generate path invariants backwards. Observe that if the extrapolation ∇ is effectively computable, then all steps in the algorithms UPInv, Split and APInv are effectively computable. We now prove correctness and termination of these algorithms. Let us fix, for the remainder of this section, an extrapolation ∇ and a partition map P : Q → P ((M ∗ )n ), and assume that Init and Bad are recognizable. Proposition 4.5. For any spurious abstract counterexample πP♯ , the execution of UPInv (∇, Init, Bad, πP♯ ) terminates and returns a path invariant for πP♯ . Lemma 4.6. For any two recognizable subsets L0 , L1 of (M ∗ )n , if L0 ∩ L1 = ∅ then Split (∇, L0 , L1 ) terminates and returns a recognizable subset L of (M ∗ )n that satisfies L0 ⊆ L ⊆ L1 . Proposition 4.7. For any spurious abstract counterexample πP♯ , the execution of APInv (∇, Init, Bad, πP♯ ) terminates and returns a path invariant for πP♯ . Example 4.8. Consider again the c/d protocol, and assume an extrapolation ∗ ∇ satisfying ∇0 (L × ε) = (alph(L)) × ε for all L ⊆ M ∗ , and ∇1 (u × ε) = u × ε for each u ∈ {ε, o, oc}, e.g., the extrapolation ρ′′ presented in Remark 6.1. 1!o

The UPInv algorithm, applied to the spurious counterexample (00, ε × ε) −−→♯ 1!c (10, o∗ × ε) −−→♯ (00, ε × ε) of Example 3.5, would perform two iterations of the while-loop and produce the path invariant (ε × ε, o × ε, oc × ε). These iterations are detailed in the table below. The mark  or  indicates whether the condition at line 11 is satisfied. L0

L1 ∗

L2 ∗

Line 11

k = 0 ε × ε o × ε {o, c} × ε



k =1 ε×ε o×ε



oc × ε

118

A. Heußner, T. Le Gall, and G. Sutre

Following Proposition 4.2, the partition map would be refined to: q∈Q 00 10 P (q)

+

ε × ε, oc × ε, (ε ∪ oc) × ε o × ε, (ε ∪ (o · o )) × ε,

o∗

01, 11

× ε M∗ × M∗

This refined partition map clearly rules out the spurious counterexample.

5

Safety Cegar Semi-algorithm for Fifo Systems

We are now equipped with the key ingredients to present our Cegar semialgorithm for fifo systems. The semi-algorithm takes as input a fifo system A, a recognizable safety condition (Init, Bad), an initial partition map P0 , and a path invariant generation procedure PathInv. The initial partition map may be the trivial one, mapping each control state to (M ∗ )n . We may use any path invariant generation procedure, such as the ones presented in the previous section. The semi-algorithm iteratively refines the partition abstraction until either the abstraction is precise enough to prove that A is (Init, Bad)-safe (line 10), or a feasible counterexample is found (line 4). If the abstract counterexample picked at line 2 is spurious, a path invariant is generated and is used to refine the partition. The new partition map obtained after the foreach loop (lines 8–9) is precisely the partition map P from Proposition 4.2, and hence it rules out this abstract counterexample. Recall that Lemmata 3.2 and 3.4 ensure that the steps at lines 1 and 3 are effectively computable. The correctness of the CEGAR semialgorithm is expressed by the following proposition, which directly follows from Proposition 3.3 and from the definition of feasible abstract counterexamples. Proposition 5.1. For any terminating execution of CEGAR (A, Init, Bad, P0 , PathInv), if the execution returns  (resp. ) then A is (Init, Bad)-safe (resp. (Init, Bad)-unsafe). A detailed example execution of CEGAR on the c/d protocol is provided in the long version. Termination of the CEGAR semi-algorithm cannot be assured as, otherwise, it would solve the general reachability problem, which is known to be undecidable for fifo systems [BZ83]. However, (Init, Bad)-unsafety is semidecidable for fifo systems by forward or backward symbolic exploration when Init and Bad are recognizable [BG99]. Moreover, this problem becomes decidable for fifo systems having a finite reachability set from Init. We investigate in this section the termination of the CEGAR semi-algorithm when A is (Init, Bad)-unsafe or has a finite reachability set from Init. In contrast to other approaches where abstractions are refined globally (e.g., predicate abstraction [GS97]), partition abstractions [CGJ+ 03] are refined locally by splitting abstract configurations along the abstract counterexample (viz. lines 8 – 9 of the CEGAR semi-algorithm). The abstract transition relation only needs to be refined locally around the abstract configurations which have been split, and, hence, its refinement can be computed efficiently. However, this local nature of refinement complicates the analysis of the algorithm. We fix an extrapolation ∇ and we focus on the path invariant generation procedure UPInv presented in Section 4.

Extrapolation-Based Path Invariants for Abstraction Refinement

119

CEGAR (A, Init, Bad, P0 , PathInv) Input: fifo system A = Q, M, n, ∆ , recognizable subsets Init, Bad of Q × (M ∗ )n , partition map P0 : Q → P ((M ∗ )n ), procedure PathInv 1 2 3 4 5 6 7 8 9 10

while A♯P is (αP (Init), αP (Bad))-unsafe pick a simple abstract counterexample π ♯ in A♯P if π ♯ is a feasible abstract counterexample return  else lh−1 ♯ l0 ♯ write π ♯ as the abstract path (q0 , p0 ) −→ · · · −−−→ (qh , ph )   ♯ (L0 , . . . , Lh ) ← PathInv Init, Bad, π foreach i ∈ {0, . . . , h}    P (qi ) ← (P (qi ) \ {pi }) ∪ pi ∩ Li , pi ∩ Li \ {∅} return 

Proposition 5.2. For any breadth-first execution of CEGAR (A, Init, Bad, P0 , UPInv (∇)), if the execution does not terminate then the sequence (hθ )θ∈N of lengths of counterexamples picked at line 2 is nondecreasing and diverges. Corollary 5.3. If A is (Init, Bad)-unsafe then any breadth-first execution of CEGAR (A, Init, Bad, P0 , UPInv (∇)) terminates. It would also be desirable to obtain termination of the CEGAR semi-algorithm when A has a finite reachability set from Init. However, as demonstrated in the long version, this condition is not sufficient to guarantee that CEGAR (A, Init, Bad, P0 , UPInv (∇)) has a terminating execution. It turns out that termination can be guaranteed for fifo systems with a finite reachability set when ∇k has a finite image for every k ∈ N. This apparently strong requirement, formally specified in Definition 5.4, is satisfied by the extrapolations presented in [BHV04] and [LGJJ06], which are based on state equivalences up to a certain depth. Definition 5.4. An extrapolation ∇ is restricted if for every k ∈ N, the set {∇k (L) | L ∈ Rec ((M ∗ )n )} is finite. Remark that if ∇ is restricted then for any execution of CEGAR (A, Init, Bad, P0 , UPInv (∇)), the execution terminates if and only if the number of iterations of the while-loop of the algorithm UPInv is bounded1 . As shown by the following proposition, if moreover A has a finite reachability set from Init then the execution necessarily terminates. Proposition 5.5. Assume that ∇ is restricted. If A has a finite reachability set from Init, then any execution of CEGAR (A, Init, Bad, P0 , UPInv (∇)) terminates.

1

Remark that this bound is not a bound on the length of abstract counterexamples.

120

6

A. Heußner, T. Le Gall, and G. Sutre

Overview of the (Colored) Bisimulation Extrapolation

This section briefly introduces the bisimulation-based extrapolation underlying the widening operator introduced in [LGJJ06]. This extrapolation assumes an automata representation of recognizable subsets of (M ∗ )n , and relies on bounded-depth bisimulation over the states of the automata. For simplicity, we focus in this section on fifo systems with a single queue, i.e., n = 1. In this simpler case, recognizable subsets of (M ∗ )n are regular languages contained in M ∗ , which can directly be represented by finite automata over M . The general case of n ≥ 2, which is discussed in detail in the long version, requires the use of Qdds, that are finite automata accepting recognizable subsets of (M ∗ )n via an encoding of n-tuples in (M ∗ )n by words over an extended alphabet. Still, the main ingredients rest the same. In the remainder of this section, we lift our discussion from regular languages in M ∗ to finite automata over M . Consider a finite automaton over M with a set Q of states. As in abstract regular model checking [BHV04], we use quotienting under equivalence relations on Q to obtain over-approximations of the automaton. However, we follow the approach of [LGJJ06], and focus on bounded-depth bisimulation equivalence (other equivalence relations were used in [BHV04]). Given a priori an equivalence relation col on Q, also called “coloring”, and a bound k ∈ N, the (colored) bisimulation equivalence of depth k is the equivalence col relation ∼col k on Q defined as expected: ∼0 = col and two states are equivalent col col for ∼k+1 if (1) they are ∼k -equivalent and (2) they have ∼col k -equivalent msuccessors for each letter m ∈ M . The ultimately stationary sequence ∼col ⊇ 0 col col ⊇ · · · of equivalence relations on Q leads to the ⊇ ∼ ∼col ⊇ · · · ⊇ ∼ 1 k+1 k colored bisimulation-based extrapolation. We define a coloring std, called standard coloring, by (q1 , q2 ) ∈ std if either q1 and q2 are both final states or q1 and q2 are both non-final states. The bisimulation extrapolation is the function ρ from N to Rec (M ∗ ) → Rec (M ∗ ) defined by ρk (L) = L/∼std k , where L is identified to a finite automaton accepting it. Notice that ρ is a restricted extrapolation. Remark 6.1. We could also choose other colorings or define the sequence of equivalences in a different way. For instance, better results are sometimes obtained in practice with the extrapolation ρ′ that first (for k = 0) applies a quotienting with respect to the equivalence relation Q × Q (i.e., all states are merged), and then behaves as ρk−1 (for k ≥ 1). Analogously, the extrapolation ρ′′ defined by ρ′′0 = ρ′0 and ρ′′k = ρk for k ≥ 1 was used in Example 4.8. Example 6.2. Consider the regular language L = {aac, baaa} over the alphabet M = {a, b, c, d, e}, represented by the automaton FAL of Figure 5a. The previously defined extrapolation ρ applies to L as follows: ρ0 splits the states of FAL according to std, hence, ρ0 (L) = {a, b}∗ · {a, c} (viz. Figure 5c). Then ρ1 merges the states that are bisimulation equivalent up to depth 1, i.e., the states of FAL (Figure 5d). As all states of FAL are non equivalent for ∼std with k ≥ 2, we k have ρk (L) = L (again Figure 5a). The variants ρ′ and ρ′′ mentioned previously ∗ would lead to ρ′0 (L) = ρ′′0 (L) = (alph(L)) = {a, b, c}∗ (viz. Figure 5b).

Extrapolation-Based Path Invariants for Abstraction Refinement a

a

a,b

a

a

c

(a)

c

a

a

b b

a

a

a

a,c

a

b

a

121

c

(b)

(c)

(d)

Fig. 5. Finite Automata Representations for Extrapolating L (Example 6.2)

The benefits of the bisimulation extrapolation for the abstraction of fifo systems were already discussed in [LGJJ06]. The following example shows that this extrapolation can, in some common cases, discover exact repetitions of message sequences in queues, without the need for acceleration techniques. Example 6.3. Let us continue the running example of the c/d protocol, considered here as having a single-queue by restricting it to operations on the first 1!o 1!c queue. The application of acceleration techniques on the path (00, ε) −−→−−→ 1!o 1!c

(00, oc) −−→−−→ (00, ococ) · · · would produce the set of queue contents (oc)+ . The bisimulation extrapolation ρ applied to the singleton language {ococ}, represented by the obvious automaton, produces the following results for the first two parameters: ρ0 ({ococ}) = {o, c}∗ · c and ρ1 ({ococ}) = (oc)+ .

7

Experimental Evaluation

Our prototypical tool Mcscm that implements the previous algorithms is written in Ocaml and relies on a library by Le Gall and Jeannet [Scm] for the classical finite automata and Qdd operations, the fifo post /pre symbolic computations, as well as the colored bisimulation-based extrapolation. The standard coloring with final and non-final states is used by default in our tool, but several other variants are also available. Our implementation includes the two path invariant generation algorithms UPInv and APInv of Section 4. We actually implemented a “single split” backward variant of APInv, reminiscent of the classical Cegar implementation [CGJ+ 03] (analogous to APInv but applying the split solely to the “failure” abstract configuration). Therefore our implemented variant APInv’ leads to more Cegar loops than would be obtained with APInv, and this explains in part why UPInv globally outperforms APInv’ for larger examples. Several pluggable subroutines can be used to search for counterexamples (depth-first or breadth-first exploration). We tested the prototype on a suite of protocols that includes the classical alternating bit protocol Abp [AJ96], a simplified version of Tcp – also in the setting of one server with two clients that interfere on their shared channels, a sliding window protocol, as well as protocols for leader election due to Peterson and token passing in a ring topology. Further, we provide certain touchstones for our approach: an enhancement of the c/d protocol with nested loops for the exchange of data, and a protocol with a non-recognizable reachability set.

122

A. Heußner, T. Le Gall, and G. Sutre Table 1. Benchmark results of McScM on a suite of protocols

protocol

states/trans. refmnt. APInv’ Abp 16/64 UPInv APInv’ c/d protocol 5/17 UPInv APInv’ nested c/d protocol 6/17 UPInv APInv’ non-regular protocol 9/18 UPInv APInv’ Peterson 10648/56628 UPInv APInv’ (simplified) Tcp 196/588 UPInv APInv’ server with 2 clients 255/2160 UPInv APInv’ token ring 625/4500 UPInv APInv’ sliding window 225/2010 UPInv

time [s] mem [MiB] 0.30 1.09 2.13 1.58 0.02 0.61 0.01 0.61 0.68 1.09 1.15 1.09 0.02 0.61 0.06 0.61 7.05 32.58 2.14 32.09 2.19 3.03 1.38 2.06 (> 1h) — 9.61 4.97 85.15 19.50 4.57 6.42 16.43 9.54 0.93 2.55

loops 72 208 8 6 80 93 13 14 299 51 526 183 — 442 1720 319 1577 148

states♯ /trans♯ 87/505 274/1443 12/51 11/32 85/314 100/339 21/47 25/39 10946/58536 10709/56939 721/3013 431/1439 — 731/7383 2344/19596 1004/6956 1801/15274 388/2367

A detailed presentation of the protocols is provided in the long version. Except for the c/d protocol, which is unsafe, all other examples are safe. Table 1 gives a summary of the results obtained by Mcscm on an off-the-shelf computer (2.4 GHz Intel Core 2 Duo). Breadth-first exploration was applied in all examples to search for abstract counterexamples. The bisimulation extrapolation ρ presented in Section 6 was used except for the server with 2 clients, where we applied the variant ρ′ of ρ presented in Remark 6.1, as the analysis did not terminate after one hour with ρ. All examples are analyzed with UPInv in a few seconds, and memory is not a limitation. We compared Mcscm with Trex [Tre], which is, to the best of our knowledge, the sole publicly available and directly usable model-checker for the verification of unbounded fifo systems. Note, however, that the comparison is biased as Trex focuses on lossy channels. We applied Trex to the first six examples of Table 1. Trex has an efficient implementation based on simple regular expressions (and not general Qdds as we do), and needs in most cases less than 1 second to build the reachability set (the latter allows to decide the reachability of bad configurations by a simple additional look-up). Further, Trex implements communicating timed and counter automata that are – at this stage – beyond the focus of our tool. Nonetheless, Trex assumes a lossy fifo semantics, and, therefore, is not able to verify all reliable fifo protocols correctly (e.g., when omitting the disconnect messages in the c/d protocol, Trex is still able to reach Bad due to the possible loss of messages, albeit the protocol is safe). Moreover, Trex suffers (as would also a symbolic model-checker based on the Lash library [Las]) from the main

Extrapolation-Based Path Invariants for Abstraction Refinement

123

drawback of acceleration techniques, which in general cannot cope with nested loops, whereas they seem to have no adverse effect on our tool (viz. nested c/d protocol on which Trex did not finish after one hour). Mcscm can also handle a simple non-regular protocol (with a counting loop) that is beyond the Qdd-based approaches [BG99], as the representation of the reachability set would require recognizable languages equipped with Presburger formulas (Cqdds [BH99]). To obtain a finer evaluation of our approach, we prototypically implemented the global abstraction refinement scheme of [BHV04] in our tool. While this Armc implementation seems to be advantageous for some small protocols, larger examples confirm that the local and adaptive approach refinement approach developed in this paper outperforms a global refinement one in protocols that demand a “highly” precise abstraction only for a few control loops (e.g., Peterson’s leader election and token ring). Further, our Armc implementation was not able to handle the non-regular protocol nor the server with 2 clients.

8

Conclusion and Perspectives

Our prototypical implementation confirms our expectations that the proposed Cegar framework with extrapolation-based path invariants is a promising alternative approach to the automatic verification of fifo systems. Our approach relies on partition abstractions where equivalence classes are recognizable languages of queue contents. Our main contribution is the design of generic path invariant generation algorithms based on parametrized extrapolation operators for queue contents. Because of the latter, our CEGAR semialgorithm enjoys additional partial termination properties. The framework developed in this paper is not specific to fifo systems, and we intend to investigate its practical relevance to other families of infinite-state models. Future work also includes the safety verification of more complex fifo systems that would allow the exchange of unbounded numerical data over the queues, or include parametrization (e.g., over the number of clients). Several decidable classes of fifo systems have emerged in the literature (in particular lossy fifo systems) and we intend to investigate termination of our CEGAR semi-algorithm (when equipped with the path invariant generation algorithms developed in this paper) for these classes. Acknowledegments. We thank the anonymous reviewers for supporting and guiding the genesis of this publication, and we are especially grateful for the fruitful and challenging discussions with J´erˆome Leroux and Anca Muscholl.

References [AJ96] [Ber79]

Abdulla, P.A., Jonsson, B.: Verifying Programs with Unreliable Channels. Information and Computation 127(2), 91–101 (1996) Berstel, J.: Transductions and Context-Free Languages. Teubner (1979)

124

A. Heußner, T. Le Gall, and G. Sutre

[BG99]

[BGWW97]

[BH99]

[BHV04]

[BR01]

[BZ83] [CGJ+ 03]

[FIS03]

[GS97]

[HJMS02]

[JR86]

[Las] [LGJJ06]

[Mcs] [Scm] [Tre] [YBCI08]

Boigelot, B., Godefroid, P.: Symbolic Verification of Communication Protocols with Infinite State Spaces using QDDs. Formal Methods in System Design 14(3), 237–255 (1999) Boigelot, B., Godefroid, P., Willems, B., Wolper, P.: The Power of QDDs. In: Van Hentenryck, P. (ed.) SAS 1997. LNCS, vol. 1302, pp. 172–186. Springer, Heidelberg (1997) Bouajjani, A., Habermehl, P.: Symbolic Reachability Analysis of FIFOChannel Systems with Nonregular Sets of Configurations. Theoretical Computer Science 221(1-2), 211–250 (1999) Bouajjani, A., Habermehl, P., Vojnar, T.: Abstract Regular Model Checking. In: Alur, R., Peled, D.A. (eds.) CAV 2004. LNCS, vol. 3114, pp. 372–386. Springer, Heidelberg (2004) Ball, T., Rajamani, S.K.: Automatically Validating Temporal Safety Properties of Interfaces. In: Dwyer, M.B. (ed.) SPIN 2001. LNCS, vol. 2057, pp. 103–122. Springer, Heidelberg (2001) Brand, D., Zafiropulo, P.: On Communicating Finite-State Machines. Journal of the ACM 30(2), 323–342 (1983) Clarke, E., Grumberg, O., Jha, S., Lu, Y., Veith, H.: Counterexampleguided Abstraction Refinement for Symbolic Model Checking. Journal of the ACM 50(5), 752–794 (2003) Finkel, A., Iyer, S.P., Sutre, G.: Well-Abstracted Transition Systems: Application to FIFO Automata. Information and Computation 181(1), 1–31 (2003) Graf, S., Sa¨ıdi, H.: Construction of Abstract State Graphs with PVS. In: Grumberg, O. (ed.) CAV 1997. LNCS, vol. 1254, pp. 72–83. Springer, Heidelberg (1997) Henzinger, T.A., Jhala, R., Majumdar, R., Sutre, G.: Lazy Abstraction. In: Proc. Symposium on Principles of Programming Languages 2002, pp. 58–70. ACM Press, New York (2002) Jard, C., Raynal, M.: De la N´ecessit´e de Sp´ecifier des Propri´et´es pour la V´erification des Algorithmes Distribu´es. Rapports de Recherche 590, IRISA Rennes (December 1986) Li`ege Automata-based Symbolic Handler (Lash). Tool Homepage, http://www.montefiore.ulg.ac.be/~ boigelot/research/lash/ Le Gall, T., Jeannet, B., J´eron, T.: Verification of Communication Protocols using Abstract Interpretation of FIFO queues. In: Johnson, M., Vene, V. (eds.) AMAST 2006. LNCS, vol. 4019, pp. 204–219. Springer, Heidelberg (2006) Model Checker for Systems of Communicating Fifo Machines (McScM). Tool Homepage, http://www.labri.fr/~ heussner/mcscm/ Tools and Libraries for Static Analysis and Verification. Tool Homepage, http://gforge.inria.fr/projects/bjeannet/ Tool for Reachability Analysis of CompleX Systems (Trex). Tool Homepage, http://www.liafa.jussieu.fr/~ sighirea/trex/ Yu, F., Bultan, T., Cova, M., Ibarra, O.H.: Symbolic String Verification: An Automata-Based Approach. In: Havelund, K., Majumdar, R., Palsberg, J. (eds.) SPIN 2008. LNCS, vol. 5156, pp. 306–324. Springer, Heidelberg (2008)

A Decision Procedure for Detecting Atomicity Violations for Communicating Processes with Locks⋆ Nicholas Kidd1 , Peter Lammich2 , Tayssir Touili3 , and Thomas Reps1,4

2

3

1 University of Wisconsin {kidd,reps}@cs.wisc.edu Westf¨ alische Wilhelms-Universit¨ at M¨ unster [email protected] LIAFA, CNRS & Universit´e Paris Diderot [email protected] 4 GrammaTech, Inc.

Abstract. We present a new decision procedure for detecting property violations in pushdown models for concurrent programs that use lockbased synchronization, where each thread’s lock operations are properly nested (` a la synchronized methods in Java). The technique detects violations expressed as indexed phase automata (PAs)—a class of nondeterministic finite automata in which the only loops are self-loops. Our interest in PAs stems from their ability to capture atomic-set serializability violations. (Atomic-set serializability is a relaxation of atomicity to only a user-specified set of memory locations.) We implemented the decision procedure and applied it to detecting atomic-set-serializability violations in models of concurrent Java programs. Compared with a prior method based on a semi-decision procedure, not only was the decision procedure 7.5X faster overall, but the semi-decision procedure timed out on about 68% of the queries versus 4% for the decision procedure.

1

Introduction

Pushdown systems (PDSs) are a formalism for modeling the interprocedural control flow of recursive programs. Likewise, multi-PDSs have been used to model the set of all interleaved executions of a concurrent program with a finite number of threads [1,2,3,4,5,6,7]. This paper presents a decision procedure for multi-PDS model checking with respect to properties expressed as indexed phase automata (PAs)—a class of non-deterministic finite automata in which the only loops are self-loops. The decision procedure handles (i) reentrant locks, (ii) an unbounded number of context switches, and (iii) an unbounded number of lock acquisitions and releases by each PDS. The decision procedure is compositional : each PDS is ⋆

Supported by NSF under grants CCF-0540955, CCF-0524051, and CCF-0810053, by AFRL under contract FA8750-06-C-0249, and by ONR under grant N00014-091-0510.

C.S. P˘ as˘ areanu (Ed.): SPIN 2009, LNCS 5578, pp. 125–142, 2009. c Springer-Verlag Berlin Heidelberg 2009 

126

N. Kidd et al.

analyzed independently with respect to the PA, and then a single compatibility check is performed that ties together the results obtained from the different PDSs. Our interest in PAs stems from their ability to capture atomic-set serializability (AS-serializability) violations. AS-serializability was proposed by Vaziri et al. [8] as a relaxation of the atomicity property [9] to only a user specified set of fields of an object. (A detailed example is given in §2.) In previous work by some of the authors [10], we developed techniques for verifying AS-serializability for concurrent Java programs. Our tool first abstracts a concurrent Java program into EML, a modeling language based on multi-PDSs and a finite number of reentrant locks. The drawback of the approach that we have used to date is that an EML program is compiled into a communicating pushdown system (CPDS) [4,5], for which the required model-checking problem is undecidable. (A semi-decision procedure is used in [10].) Kahlon and Gupta [7] explored the boundary between decidability and undecidability for model checking multi-PDSs that synchronize via nested locks. One of their results is an algorithm to decide if a multi-PDS satisfies an (indexed) LTL formula that makes use of only atomic propositions, the “next” operator X, and the “eventually” operator F. In the case of a 2-PDS, the algorithm uses an automaton-pair M = (A, B) to represent a set of configurations of a 2-PDS, where an automaton encodes the configurations of a single PDS in the usual way [11,12]. For a given logical formula, the Kahlon-Gupta algorithm is defined inductively: from an automaton-pair that satisfies a subformula, they define an algorithm that computes a new automaton-pair for a larger formula that has one additional (outermost) temporal operator. We observed that PAs can be compiled into an LTL formula that uses only the X and F operators. (An algorithm to perform the encoding is given in [13, App. A].) Furthermore, [14] presents a sound and precise technique that uses only non-reentrant locks to model EML’s reentrant locks. Thus, combining previous work [10,14] with the Kahlon-Gupta algorithm provides a decision procedure for verifying AS-serializability of concurrent Java programs! (Briefly, the technique for replacing reentrant locks with non-reentrant locks pushes a special marker onto the stack the first time a lock is acquired, and records the acquisition in a PDS’s state space. All subsequent lock acquires and their matching releases do not change the state of the lock or the PDS. Only when the special marker is seen again is the lock then released. This technique requires that lock acquisition and releases be properly scoped, which is satisfied by Java’s synchronized blocks. Consequently, we consider only non-reentrant locks in the remainder of the paper.) Unfortunately, [7] erroneously claims that the disjunction operation distributes across two automaton-pairs. That is, for automaton-pairs M1 = (A1 , B1 ) and M2 = (A2 , B2 ), they claim that the following holds: M1 ∨M2 = (A1 ∨A2 , B1 ∨B2 ). This is invalid because cross-terms arise when attempting to distribute the disjunct. For example, if B1 ∩ B2 = ∅, then there can be configurations of the form a1 ∈ A1 , b2 ∈ B2  that would be accepted by (A1 ∨ A2 , B1 ∨ B2 ) but should not be in M1 ∨ M2 .

A Decision Procedure for Detecting Atomicity Violations

127

To handle this issue properly, a corrected algorithm must use a set of automatonpairs instead of single automaton-pair to represent a set of configurations of a 2PDS.1 Because the size of the set is exponential in the number of locks, in the worst case, their algorithm may perform an exponential number of individual reachability queries to handle one temporal operator. Furthermore, once reachability from one automaton-pair has been performed, the resulting automaton pair must be split into a set so that incompatible configurations are eliminated. Thus, it is not immediately clear if the (corrected) Kahlon-Gupta algorithm is amenable to an implementation that would be usable in practice. This paper presents a new decision procedure for checking properties specified as PAs on multi-PDSs that synchronize via nested locks.2 Unlike the (corrected) Kahlon-Gupta algorithm, our decision procedure uses only one reachability query per PDS. The key is to use tuples of lock histories (§5): moving from the lock histories used by Kahlon and Gupta to tuples-of-lock histories introduces a mechanism to maintain the correlations between the intermediate configurations. Hence, our decision procedure is able to make use of only a single compatibility check over the tuples-of-lock histories that our analysis obtains for each PDS. The benefit of this approach is shown in the following table, where Procs denotes the number of processes, L denotes the number of locks, and |PA| denotes the number of states in property automaton PA:

Kahlon-Gupta [7] (corrected) This paper (§6)

PDS State Space

Queries

O(2L ) O(|PA| · 2L )

O(|PA| · Procs · 2L ) Procs

Because our algorithm isolates the exponential cost in the PDS state space, that cost can often be side-stepped using symbolic techniques, such as BDDs, as explained in §7. This paper makes the following contributions: – We define a decision procedure for multi-PDS model-checking for PAs. The decision procedure handles (i) reentrant locks, (ii) an unbounded number of context switches, (iii) an unbounded number of lock acquisitions and releases by each PDS, and (iv) a bounded number of phase transitions. – The decision procedure is compositional : each PDS is analyzed independently with respect to the PA, and then a single compatibility check is performed that ties together the results obtained from the different PDSs. – We leverage the special form of PAs to give a symbolic implementation that is more space-efficient than standard BDD-based techniques for PDSs [16]. – We used the decision procedure to detect AS-serializability violations in automatically-generated models of four programs from the ConTest benchmark suite [17], and obtained substantially better performance than a prior method based on a semi-decision procedure [10]. 1 2

We confirmed these observations in e-mail with Kahlon and Gupta [15]. We do not consider multi-PDSs that use wait-notify synchronization because reachability analysis of multi-PDSs with wait-notify is undecidable [7].

128

N. Kidd et al.

The rest of the paper is organized as follows: §2 provides motivation. §3 defines multi-PDSs and PAs. §4 reviews Kahlon and Gupta’s decomposition result. §5 presents lock histories. §6 presents the decision procedure. §7 describes a symbolic implementation. §8 presents experimental results. §9 describes related work.

2

Motivating Example

Fig. 1 shows a simple Java implementation of a stack. Class Client is a test harness that performs concurrent accesses on a single stack. Client.get() uses the keyword “synchronized” to protect against concurrent calls on the same Client object. The annotation “@atomic” on Client.get() specifies that the programmer intends for Client.get() to be executed atomically. The program’s synchronization actions do not ensure this, however. The root cause is that the wrong object is used for synchronization: parameter “Stack s” of Client.get() should have been used, instead of Client.get()’s implicit this parameter. This mistake permits the interleaved execution shown at the bottom of Fig. 1, which would result in an exception being thrown. class Stack { class Client { Object[] storage = new Object[10]; // @atomic int item = -1; public synchronized Object get(Stack s){ public static Stack makeStack(){ if(!s.empty()) { return s.pop(); } return new Stack(); else return null; } } public synchronized Object pop(){ public static Client makeClient(){ Object res = storage[item]; return new Client(); storage[item--] = null; } return res; public static void main(String[] args){ } Stack stack = Stack.makeStack(); public synchronized void push(Object o){ stack.push(new Integer(1)); storage[++item] = o; Client client1 = makeClient(); } Client client2 = makeClient(); public synchronized boolean empty(){ new Thread("1") { client1.get(stack); } return (item == -1); new Thread("2") { client2.get(stack); } } } } }

get()   empty() pop()       1: Abeg1 (s R1 (i))s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (s R1 (i)R1 (s)R1 (i)W1 (s)W1 (i))s Aend1 get()    2: . . . . . . . . . . . . . . Abeg2 (s R2 (i))s (s R2 (i)R2 (s)R2 (i)W2 (s)W2 (i))s Aend2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       empty() pop() 

Fig. 1. Example program and problematic interleaving that violates atomic-set serializability. R and W denote a read and write access, respectively. i and s denote fields item and storage, respectively. Abeg and Aend denote the beginning and end, respectively, of an atomic code block. The subscripts “1” and “2” are thread ids. “(s ” and “)s ” denote the acquire and release operations, respectively, of the lock of Stack stack.

A Decision Procedure for Detecting Atomicity Violations

129

This is an example of an atomic-set serializability (AS-serializability)—a relaxation of atomicity [9] to only a specified set of shared-memory locations— violation [8] with respect to s.item and s.storage. AS-serializability violations can be completely characterized by a set of fourteen problematic access patterns [8].3 Each problematic pattern is a finite sequence of reads and writes by two threads to one or two shared memory locations. For the program in Fig. 1 and problematic pattern “Abeg1 ; R1 (i); W2 (s); W2 (i); R1 (s)”, the accesses that match the pattern are underlined in the interleaving shown at the bottom of Fig. 1. The fourteen problematic access Σ Λ Λ Λ Λ patterns can be encoded as an indexed phase automaton (PA). The Abeg1 R1i W2s W2i R1s q1 q2 q3 q4 q5 q6 PA that captures the problematic accesses of Fig. 1 is shown in Fig. 2. Its states—which represent the phases Fig. 2. The PA that accepts the problematic that the automaton passes through access pattern in the program from Fig. 1. Σ to accept a string—are chained to- is the set of all actions, and Λ is Σ \ {Aend1 }. gether by phase transitions; each state has a self-loop for symbols that cause the automaton to not change state. (“Indexed” refers to the fact that the index of the thread performing an action is included in the label of each transition.) The PA in Fig. 2 “guesses” when a violation occurs. That is, when it observes that thread 1 enters an atomic code block, such as get(), the atomic-code-blockbegin action Abeg1 causes it either to transition to state q2 (i.e., to start the next phase), or to follow the self-loop and remain in q1 . This process continues until it reaches the accepting state. Note that the only transition that allows thread 1 to exit an atomic code block (Aend1 ) is the self-loop on the initial state. Thus, incorrect guesses cause the PA in Fig. 2 to become “stuck” in one of the states q1 . . . q5 and not reach final state q6 .

3

Program Model and Property Specifications

Definition 1. A (labeled) pushdown system (PDS) is a tuple P = (P, Act, Γ, ∆, c0 ), where P is a finite set of control states, Act is a finite set of actions, Γ is a finite stack alphabet, and ∆ ⊆ (P × Γ ) × Act × (P × Γ ∗ ) a is a finite set of rules. A rule r ∈ ∆ is denoted by p, γ ֒−→p′ , u′ . A PDS configuration p, u is a control state along with a stack, where p ∈ P and u ∈ Γ ∗ , and c0 = p0 , γ0  is the initial configuration. ∆ defines a transition system over the set of all configurations. From c = p, γu, P can make a trana c′ , if there exists a rule sition to c′ = p′ , u′ u on action a, denoted by c −→ a w ∗ ′ ′ ′ p, γ ֒−→p , u  ∈ ∆. For w ∈ Act , c −−→ c is defined in the usual way. For a a rule r = p, γ ֒−→p′ , u′ , act(r) denotes r’s action label a. 3

This result relies on an assumption that programs do not always satisfy: an atomic code section that writes to one member of a set of correlated locations writes to all locations in that set (e.g., item and storage of Stack s).

130

N. Kidd et al.

A multi-PDS consists of a finite number of PDSs P1 , . . . , Pn that synchronize via a finite set of locks Locks = {l1 , . . . , lL } (i.e., L = |Locks|). The actions Act of each PDS consist of lock-acquires (“(i ”) and releases (“)i ”) for 1 ≤ i ≤ L, plus symbols from Σ, a finite alphabet of non-parenthesis symbols. The intention is that each PDS models a thread, and lock-acquire and release actions serve as synchronization primitives that constrain the behavior of the multi-PDS. We assume that locks are acquired and released in a well-nested fashion; i.e., locks are released in the opposite order in which they are acquired. The choice of what actions appear in Σ depends on the intended application. For verifying AS-serializability (see §2 and §7), Σ consists of actions to read and write a shared-memory location m (denoted by R(m) and W (m), respectively), and to enter and exit an atomic code section (Abeg and Aend, respectively). Formally, a program model is a tuple Π = (P1 , . . . , Pn , Locks, Σ). A global configuration g = (c1 , . . . , cn , o1 , . . . , oL ) is a tuple consisting of a local configuration ci for each PDS Pi and a valuation that indicates the owner of each lock: for each 1 ≤ i ≤ L, oi ∈ {⊥, 1, . . . , n} indicates the identity of the PDS that holds lock li . The value ⊥ signifies that a lock is currently not held by any PDS. The initial global configuration is g0 = (c10 , . . . , cn0 , ⊥, . . . , ⊥). A global configuration g = (c1 , c2 , . . . , cn , o1 , . . . , oL ) can make a transition to another global configuration g ′ = (c′1 , c2 , . . . , cn , o′1 , . . . , o′L ) under the following conditions: a – If c1 −→ c′1 and a ∈ / {(i , )i }, then g ′ = (c′1 , c2 , . . . , cn , o1 , . . . , oL ). (

i c′1 and g = (c1 , c2 , . . . , cn , o1 , . . . , oi−1 , ⊥, oi+1 , . . . , oL ), then g ′ = – If c1 −−→ ′ (c1 , c2 , . . . , cn , o1 , . . . , oi−1 , 1, oi+1 , . . . , oL ).

)

i c′1 and g = (c1 , c2 , . . . , cn , o1 , . . . , oi−1 , 1, oi+1 , . . . , oL ), then g ′ = – If c1 −−→ ′ (c1 , c2 , . . . , cn , o1 , . . . , oi−1 , ⊥, oi+1 , . . . , oL ).

For 1 < j ≤ n, a global configuration (c1 , . . . , cj , . . . , cn , o1 , . . . , oL ) can make a transition to (c1 , . . . , c′j , . . . , cn , o′1 , . . . , o′L ) in a similar fashion. A program property is specified as an indexed phase automaton. Definition 2. An indexed phase automaton (PA) is a tuple (Q, Id, Σ, δ), where Q is a finite, totally ordered set of states {q1 , . . . , q|Q| }, Id is a finite set of thread identifiers, Σ is a finite alphabet, and δ ⊆ Q × Id × Σ × Q is a transition relation. The transition relation δ is restricted to respect the order on states: for each transition (qx , i, a, qy ) ∈ δ, either y = x or y = x + 1. We call a transition of the form (qx , i, a, qx+1 ) a phase transition. The initial state is q1 , and the final state is q|Q| . The restriction on δ in Defn. 2 ensures that the only loops in a PA are “selfloops” on states. We assume that for every x, 1 ≤ x < |Q|, there is only one phase transition of the form (qx , i, a, qx+1 ) ∈ δ. (A PA that has multiple such transitions can be factored into a set of PAs, each of which satisfy this property.) Finally, we only consider PAs that recognize a non-empty language, which means that a PA must have exactly (|Q| − 1) phase transitions. For the rest of this paper we consider 2-PDSs, and fix Π = (P1 , P2 , Locks, Σ) and A = (Q, Id, Σ, δ); however, the techniques easily generalize to multi-PDSs

A Decision Procedure for Detecting Atomicity Violations

131

(see [13, App. B]), and our implementation is for the generic case. Given Π and A, the model-checking problem of interest is to determine if there is an execution that begins at the initial global configuration g0 that drives A to its final state.

4

Path Incompatibility

The decision procedure analyzes the PDSs of Π independently, and then checks if there exists a run from each PDS that can be performed in interleaved parallel fashion under the lock-constrained transitions of Π. To do this, it makes use of a decomposition result, due to Kahlon and Gupta [7, Thm. 1], which we now review. Suppose that PDS Pk , for k ∈ {1, 2}, when started in (single-PDS) configuration ck and executed alone, is able to reach configuration c′k using the rule sequence ρk . Let LocksHeld(Pk , (b1 , b2 , o1 , . . . , oL )) denote {li | oi = k}; i.e., the set of locks held by PDS Pk at global configuration (b1 , b2 , o1 , . . . , oL ). Along a rule sequence ρk and for an initially-held lock li and finally-held lock lf , we say that the initial release of li is the first release of li , and that the final acquisition of lf is the last acquisition of lf . Note that for execution to proceed along ρk , Pk must hold an initial set of locks at ck that is a superset of the set of initial releases along ρk ; i.e., not all initially-held locks need be released. Similarly, Pk ’s final set of locks at c′k must be a superset of the set of final acquisitions along ρk . Theorem 1. (Decomposition Theorem [7].) Suppose that PDS Pk , when started in configuration ck and executed alone, is able to reach configuration c′k using the rule sequence ρk . For Π = (P1 , P2 , Locks, Σ), there does not exist an interleaving of paths ρ1 and ρ2 from global configuration (c1 , c2 , o1 , . . . , oL ) to global configuration (c′1 , c′2 , o′1 , . . . , o′L ) iff one or more of the following five conditions hold: 1. LocksHeld(P1 , (c1 , c2 , o1 , . . . , oL )) ∩ LocksHeld(P2 , (c1 , c2 , o1 , . . . , oL )) = ∅ 2. LocksHeld(P1 , (c′1 , c′2 , o′1 , . . . , o′L )) ∩ LocksHeld(P2 , (c′1 , c′2 , o′1 , . . . , o′L )) = ∅ 3. In ρ1 , P1 releases lock li before it initially releases lock lj , and in ρ2 , P2 releases lj before it initially releases lock li . 4. In ρ1 , P1 acquires lock li after its final acquisition of lock lj , and in ρ2 , P2 acquires lock lj after its final acquisition of lock li , 5. (a) In ρ1 , P1 acquires or uses a lock that is held by P2 throughout ρ2 , or (b) in ρ2 , P2 acquires or uses a lock that is held by P1 throughout ρ1 . Intuitively, items 3 and 4 capture cycles in the dependence graph of lock operations: a cycle is a proof that there does not exist any interleaving of rule sequences ρ1 and ρ2 that adheres to the lock-constrained semantics of Π. If there is a cycle, then ρ1 (ρ2 ) can complete execution but not ρ2 (ρ1 ), or neither can complete because of a deadlock. The remaining items model standard lock semantics: only one thread may hold a lock at a given time.

132

5

N. Kidd et al.

Extracting Information from PDS Rule Sequences

To employ Thm. 1, we now develop methods to extract relevant information from a rule sequence ρk for PDS Pk . As in many program-analysis problems that involve matched operations [18]—in our case, lock-acquire and lock-release—it is useful to consider semi-Dyck languages [19]: languages of matched parentheses in which each parenthesis symbol is one-sided . That is, the symbols “(” and “)” match in the string “()”, but do not match in “)(”.4 Let Σ be a finite alphabet of non-parenthesis symbols. The semi-Dyck language of well-balanced parentheses over Σ ∪ {(i , )i | 1 ≤ i ≤ L} can be defined by the following context-free grammar, where e denotes a member of Σ: matched → ǫ | e matched | (i matched )i matched [for 1 ≤ i ≤ L] Because we are interested in paths that can begin and end while holding a set of locks, we define the following partially-matched parenthesis languages: unbalR → ǫ | unbalR matched )i unbalL → ǫ | (i matched unbalL The language of words that are possibly unbalanced on each end is defined by suffixPrefix → unbalR matched unbalL Example 1. Consider the following suffixPrefix string, in which the positions between symbols are marked A–W. Its unbalR, matched, and unbalL components are the substrings A–N, N–P, and P–W, respectively. )1 (2 )2 )3 (2 (4 (5 )5 )4 (6 )6 )2 )7 (6 )6 (4 (2 )2 (2 (7 )7 (8   W  V  U  S T  R    O  N    I J K  H    D  C  B P Q L M F G E  A Let wk ∈ L(suffixPrefix) be the word formed by concatenating the action symbols of the rule sequence ρk . One can see that to use Thm. 1, we merely need to extract the relevant information from wk . That is, items 3 and 4 require extracting (or recording) information from the unbalR and unbalL portions of wk , respectively; item 5 requires extracting information from the matched portion of wk ; and items 1 and 2 require extracting information from the initial and final parse configurations of wk . The information is obtained using acquisition histories (AH) and release histories (RH) for locks, as well as ρk ’s release set (R), use set (U), acquisition set (A), and held-throughout set (HT). – The acquisition history (AH) [7] for a finally-held lock li is the union of the set {li } with the set of locks that are acquired (or acquired and released) after the final acquisition of li .5 – The release history (RH) [7] of an initially-held lock li is the union of the set {li } with the set of locks that are released (or acquired and released) before the initial release of li . – The release set (R) is the set of initially-released locks. 4

5

The language of interest is in fact regular because the locks are non-reentrant. However, the semi-Dyck formulation provides insight into how one extracts the relevant information from a rule sequence. This is a slight variation from [7]; we include li in the acquisition history of lock li .

A Decision Procedure for Detecting Atomicity Violations

133

– The use set (U) is the set of locks that form the matched part of wk . – The acquisition set (A) is the set of finally-acquired locks. – The held-throughout set (HT) is the set of initially-held locks that are not released.  U, AH,  A, HT), where R, U, A, and HT are the A lock history is a six-tuple (R, RH,   sets defined above, and AH (RH) is a tuple of L acquisition (release) histories, one for each lock li , 1 ≤ i ≤ L. Let ρ = [r1 , . . . , rn ] be a rule sequence that drives a PDS from some starting configuration to an ending configuration, and let I be the set of locks held at the beginning of ρ. We define an abstraction function η(ρ, I) from rule sequences and initially-held locks to lock histories; η(ρ, I) uses  U, AH,  A, and HT for each an auxiliary function, post, which tracks R, RH, successively longer prefix. η([], I) = (∅, ∅L , ∅, ∅L , ∅, I) η([r1 , . . . , rn ], I) = post(η([r1 , . . . , rn−1 ], I), act(rn )), where

 U, AH,  A, HT), a) = post((R, RH, ⎧   (R, RH, U, AH, A, HT) ⎪ ⎪ ⎪ ⎪  U, AH  ′ , A ∪ {li }, HT) ⎪ (R, RH, ⎪ ⎪ ⎧ ⎪ ⎪ if j = i ⎪ ⎨ {li } ⎪ ′ ⎪ ⎪  [j] = ∅ if j = i and ⎪ where AH ⎪ ⎪ ⎩ ⎪ ⎪ AH[j] ∪ {li } if j = i and ⎨  U ∪ {li }, AH  ′ , A\{li }, HT\{li }) (R, RH, ⎪ ⎪ ⎪ ′ ∅ if j = i ⎪  ⎪ where AH [j] = ⎪ ⎪  AH[j] otherwise ⎪ ⎪ ⎪ ⎪  ′ , U, AH,  A, HT\{li }) ⎪ (R ∪ {li }, RH ⎪ ⎪ ⎪ ⎪ ′ {li } ∪ U ∪ R if j = i ⎪ ⎪  ⎩ where RH [j] =  RH[j] otherwise

if a ∈ / {(i , )i } if a = (i lj ∈ /A lj ∈ A if a = )i and li ∈ A

if a = )i and li ∈ /A

Example 2. Suppose that ρ is a rule sequence whose labels spell out the string from Example 1, and I = {1, 3, 7, 9}. Then η(ρ, I) returns the following lock history (only lock indices are written): ({1, 3, 7}, {1}, ∅, {1, 2, 3}, ∅, ∅, ∅, {1, 2, 3, 4, 5, 6, 7}, ∅, ∅, {6}, ∅, {2, 7, 8}, ∅, {2, 4, 7, 8}, ∅, ∅, ∅, {8}, ∅ , {2, 4, 8}, {9}).

Note: R and A are included above only for clarity; they can be recovered from  and AH,  as follows: R = {i | RH[i]  = ∅} and A = {i | AH[i]  = ∅}. In addition, RH   from LH = (R, RH, U, AH, A, HT), it is easy to see that the set I of initially-held locks is equal to (R ∪ HT), and the set of finally-held locks is equal to (A ∪ HT).  1 , U1 , AH  1 , A1 , HT1 ) and LH2 = Definition 3. Lock histories LH1 = (R1 , RH   (R2 , RH2 , U2 , AH2 , A2 , HT2 ) are compatible, denoted by Compatible(LH1 , LH2 ), iff all of the following five conditions hold: 1.(R1 ∪ HT1 ) ∩ (R2 ∪ HT2 ) = ∅ 2.(A1 ∪ HT1 ) ∩ (A2 ∪ HT2 ) = ∅  1 [i] ∧ li ∈ AH  2 [j]  1 [i] ∧ li ∈ RH  2 [j] 3.  ∃ i, j . lj ∈ AH 4.  ∃ i, j . lj ∈ RH 5.(A1 ∪ U1 ) ∩ HT2 = ∅ ∧ (A2 ∪ U2 ) ∩ HT1 = ∅

134

Π: 1: 2:

N. Kidd et al. Abeg1 q1 q2

R1i q2

q3

W2s W2i q3 q4 q 4 q5

q5

Abeg1 q1 q2

R1i q2

q3

W2s W2i q3 q4 q 4 q5

q5

q3

q3

R1i

Abeg1 q1

q2

q2

W2s

W2i q4 q4

R1s

R1s

q6

q6

R1s q5

q5

q6

Fig. 3. Π: bad interleaving of Fig. 2, showing only the actions that cause a phase transition. 1: the same interleaving from Thread 1’s point of view. The dashed boxes show where Thread 1 guesses that Thread 2 causes a phase transition. 2: the same but from Thread 2’s point of view and with the appropriate guesses.

Each conjunct verifies the absence of the corresponding incompatibility condition from Thm. 1: conditions 1 and 2 verify that the initially-held and finally-held locks of ρ1 and ρ2 are disjoint, respectively; conditions 3 and 4 verify the absence of cycles in the acquisition and release histories, respectively; and condition 5 verifies that ρ1 does not use a lock that is held throughout in ρ2 , and vice versa.

6

The Decision Procedure

As noted in §4, the decision procedure analyzes the PDSs independently. This decoupling of the PDSs has two consequences. First, when P1 and A are considered together, independently of P2 , they cannot directly “observe” the actions of P2 that cause A to take certain phase transitions. Thus, P1 must guess when P2 causes a phase transition, and vice versa for P2 . An example of the guessing is shown in Fig. 3. The interleaving labeled “Π” is the bad interleaving from Fig. 2, but focuses on only the PDS actions that cause phase transitions. The interleaving labeled “1” shows, via the dashed boxes, where P1 guesses that P2 caused a phase transition. Similarly, the interleaving labeled “2” shows the guesses that P2 must make. Second, a post-processing step must be performed to ensure that only those behaviors that are consistent with the lock-constrained behaviors of Π are considered. For example, if P1 guesses that P2 performs the W2 (s) action to make the PA transition from state q3 to state q4 (the dashed box for interleaving “1” in Fig. 3) while it is still executing the empty() method (see Fig. 2), the behavior is inconsistent with the semantics of Π. This is because both threads would hold the lock associated with the shared “Stack s” object. The post-processing step ensures that such behaviors are not allowed. 6.1

Combining a PDS with a PA

To define a compositional algorithm, we must be able to analyze P1 and A independently of P2 , and likewise for P2 and A. Our approach is to combine A

A Decision Procedure for Detecting Atomicity Violations

135

and P1 to define a new PDS P1A using a cross-product-like construction. The main difference is that lock histories and lock-history updates are incorporated in the construction. Recall that the goal is to determine if there exists an execution of Π that drives A to its final state. Any such execution must make |Q| − 1 phase transitions. Hence, a valid interleaved execution must be able to reach |Q| global configurations, one for each of the |Q| phases. Lock histories encode the constraints that a PDS path places on the set of possible interleaved executions of Π. A desired path of an individual PDS must also make |Q| − 1 phase transitions, and hence our algorithm keeps track of |Q| lock histories, one for each phase. This is accomplished by encoding into the state space of P1A a tuple of |Q| lock histories. A tuple maintains the sequence of lock histories for one or more paths taken through a sequence of phases. In addition, a tuple maintains the correlation between the lock histories of each phase, which is necessary to ensure that only valid executions are considered. The rules of P1A are then defined to update the lock-history tuple accordingly. The lock-history tuples are used later to check whether some scheduling of an execution of Π can actually perform all of the required phase transitions. = LH|Q| denote the set Let LH denote the set of all lock histories, and let LH of all tuples of lock histories of length |Q|. We denote a typical lock history by LH,  LH[i]  denotes the ith component of LH.  and a typical tuple of lock histories by LH. Our construction makes use of the phase-transition function on LHs defined  U, AH,  A, HT)) = (∅, ∅L , ∅, ∅L , ∅, A ∪ HT). This function as follows: ptrans((R, RH, is used to encode the start of a new phase: the set of initially-held locks is the set of locks held at the end of the previous phase. Let Pi = (Pi , Acti , Γi , ∆i , p0 , γ0 ) be a PDS, Locks be a set of locks of size  be a tuple of lock histories of length |Q|. L, A = (Q, Id, Σ, δ) be a PA, and LH A A A A We define the PDS Pi = (Pi , ∅, Γi , ∆A i , p0 , γ0 ), where Pi ⊆ Pi × Q × LH. A  ∅ ), where LH  ∅ is the lock-history tuple The initial control state is p0 = (p0 , q1 , LH (∅, ∅L , ∅, ∅L , ∅, ∅)|Q|. Each rule r ∈ ∆A performs only a single update to the tuple i  LH, at an index x determined by a transition in δ. The update is denoted by  → e], where e evaluates to an LH. Two kinds of rules are introduced to LH[x account for whether a transition in δ is a phase transition or not: ′

 → post(LH[x],   = LH[x 1. Non-phase Transitions: LH a)]. a

(a) For each rule p, γ ֒−→p′ , u ∈ ∆i and transition (qx , i, a, qx ) ∈ δ, there ′  γ ֒−→(p′ , qx , LH  ), u ∈ ∆A . is a rule r = (p, qx , LH), i

a

(b) For each rule p, γ ֒−→p′ , u ∈ ∆i , a ∈ {(k , )k }, and each qx ∈ Q, there ′  γ ֒−→(p′ , qx , LH  ), u ∈ ∆A . is a rule r = (p, qx , LH), i



 = LH[(x   + 1) → ptrans(LH[x])]. 2. Phase Transitions: LH a

(a) For each rule p, γ ֒−→p′ , u ∈ ∆i and transition (qx , i, a, qx+1 ) ∈ δ, ′  γ ֒−→(p′ , qx+1 , LH  ), u ∈ ∆A . there is a rule r = (p, qx , LH), i

136

1 2 3 4 5 6

N. Kidd et al. input : A 2-PDS Π = (P1 , P2 , Locks, Σ) and a PA A. output: true if Π can drive A to its final state. let A1post∗ ← post∗P A ; let A2post∗ ← post∗P A ; 2 1  1 s.t. ∃u1 ∈ Γ1∗ : (p1 , q|Q| , LH  1 ), u1  ∈ L(A1post∗ ) do foreach p1 ∈ P1 , LH ∗  2 s.t. ∃u2 ∈ Γ2 : (p2 , q|Q| , LH  2 ), u2  ∈ L(A2post∗ ) do foreach p2 ∈ P2 , LH  1 , LH  2 ) then if Compatible(LH return true; return false;

Algorithm 1: The decision procedure. The two tests of the form “∃uk ∈ Γk∗ :  k ), uk  ∈ L(Akpost∗ )” can be performed by finding any path in Akpost∗ (pk , q|Q| , LH  k ) to the final state. from state (pk , q|Q| , LH

(b) For each transition (qx , j, a, qx+1 ) ∈ δ, j = i, and for each p ∈ Pi and ′  γ ֒−→(p, qx+1 , LH  ), γ ∈ ∆A . γ ∈ Γi , there is a rule r = (p, qx , LH),

Rules defined by item 1(a) make sure that PiA is constrained to follow the selfloops on PA state qx . Rules defined by item 1(b) allow for PiA to perform lock acquires and releases. Recall that the language of a PA is only over the nonparenthesis alphabet Σ, and does not constrain the locking behavior. Consequently, a phase transition cannot occur when PiA is acquiring or releasing a lock. Rules defined by item 2(a) handle phase transitions caused by PiA . Finally, rules defined by item 2(b) implement PiA ’s guessing that another PDS PjA , j = i, causes a phase transition, in which case PiA has to move to the next phase as well. 6.2

Checking Path Compatibility

For a generated PDS PkA , we are interested in the set of paths that begin in the initial configuration pA 0 , γ0  and drive A to its final state q|Q| . Each such  k ), u, where u ∈ Γ ∗ . Let ρ1 and path ends in some configuration (pk , q|Q| , LH A A ρ2 be such paths from P1 and P2 , respectively. To determine if there exists a compatible scheduling for ρ1 and ρ2 , we use Thm. 1 on each component of the  2 from the ending configurations of ρ1 and ρ2 :  1 and LH lock-history tuples LH  2 ) ⇐⇒  1 , LH Compatible(LH

|Q|

i=1

 1 [i], LH  2 [i]). Compatible(LH

Due to recursion, P1A and P2A could each have an infinite number of such paths.  and there are However, each path is abstracted as a tuple of lock histories LH, thus, we only have to check a finite number only a finite number of tuples in LH;  2 ) pairs. For each PDS P A = (P A , Act, Γ, ∆, cA ), we can identify the  1 , LH of (LH 0  tuples by computing the set of all configurations that are set of relevant LH reachable starting from the initial configuration, post∗P A (cA 0 ), using standard automata-based PDS techniques [11,12]. (Because the initial configuration is defined by the PDS P A , henceforth, we merely write post∗P A .) That is, because

A Decision Procedure for Detecting Atomicity Violations

137

the construction of P A removed all labels, we can create a P-(multi)-automaton [11] Apost∗ that accepts exactly the set of configurations post∗P A . Alg. 1 gives the algorithm to check whether Π can drive A to its final state. Theorem 2. For 2-PDS Π = (P1 , P2 , Locks, Σ) and PA A, there exists an execution of Π that drives A to its final state iff Alg. 1 returns true. Proof. See [13, App. D.1].

7

⊓ ⊔

A Symbolic Implementation

Alg. 1 solves the multi-PDS model-checking problem for PAs. However, an implementation based on symbolic techniques is required because it would be infeasible to perform the final explicit enumeration step specified in Alg. 1, lines 2–5. One possibility is to use Schwoon’s BDD-based PDS techniques [16]; these represent the transitions of a PDS’s control-state from one configuration to another as a relation, using BDDs. This approach would work with relations over Q × LH, which requires using |Q|2 |LH|2 BDD variables, where |LH| = 2L + 2L2 . This section describes a more economical encoding that needs only (|Q| + 1)|LH| BDD variables. Our approach leverages the fact that when a property is specified with a phase automaton, once a PDS makes a phase transition from qx  tuples are no longer subject to change. In this to qx+1 , the first x entries in LH situation, Schwoon’s encoding contains redundant information; our technique eliminates this redundancy. We explain the more economical approach by defining a suitable weight domain for use with a weighted PDS (WPDS) [4,20]. A WPDS W = (P, S, f ) is a PDS P = (P, Act, Γ, ∆, c0 ) augmented with a bounded idempotent semiring S = (D, ⊗, ⊕, 1, 0) (see [13, App. C]), and a function f : ∆ → D that assigns a semiring element d ∈ D to each rule r ∈ ∆. When working with WPDSs, the result of a post∗ computation is a weighted automaton. For the purposes of this paper, we view the weighted automaton Apost∗ = post∗W as a function from a regular set of configurations C to the sum-over-all-paths from c0 to all c ∈ C;  r ...rn c, v = f (r1 ) ⊗ . . . ⊗ f (rn )}, where i.e., Apost∗ (C) = {v | ∃c ∈ C : c0 −−1−−−→ r1 . . . rn is a sequence of rules that transforms c0 into c. For efficient algorithms for computing both Apost∗ and Apost∗ (C), see [4,20]. Definition 4. Let S be a finite set; let A ⊆ S m+1 and B ⊆ S p+1 be relations of arity m + 1 and p + 1, respectively. The generalized relational composition of A and B, denoted by “A ; B”, is the following subset of S m+p : A ; B = {a1 , . . . , am , b2 , . . . , bp+1  | a1 , . . . , am , x ∈ A∧x, b2 , . . . , bp+1  ∈ B}. Definition 5. Let S be a finite set, and θ be the maximum number of phases of interest. The set of all θ-term formal power series over z, with relationvalued coefficients of different arities, is θ−1 RF PS[S, θ] = { i=0 ci z i | ci ⊆ S i+2 }.

138

1 2 3 4

N. Kidd et al. input : A 2-PDS (P1 , P2 , Locks, Σ) and a PA A. output: true if there is an execution that drives A to the accepting state. let A1post∗ ← post∗W1 ; let A2post∗ ← post∗W2 ;   let c1|Q|−1 z |Q|−1 = A1post∗ {(p1 , q|Q| ), u | p1 ∈ P1 ∧ u ∈ Γ1∗ } ;   let c2|Q|−1 z |Q|−1 = A2post∗ {(p2 , q|Q| ), u | p2 ∈ P2 ∧ u ∈ Γ2∗ } ;  1 , LH  2 );  2  ∈ c2  1  ∈ c1 : Compatible(LH , LH0 , LH return ∃LH0 , LH |Q|−1

|Q|−1

Algorithm 2: The symbolic decision procedure

A monomial is written as ci z i (all other coefficients are understood to be ∅); a monomial c0 z 0 denotes a constant. The multi-arity relational weight domain over S and θ is defined by (RF PS[S, θ], ×, +, Id, ∅), where × is polynomial multiplication in which generalized relational composition and ∪ are used to multiply and add coefficients, respectively, and terms cj z j for j ≥ θ are dropped; + is polynomial addition using ∪ to add coefficients; Id is the constant {s, s | s ∈ S}z 0 ; and ∅ is the constant ∅z 0 . We now define the WPDS Wi = (PiW , S, f ) that results from taking the product of PDS Pi = (Pi , Acti , Γi , ∆i , p0 , γ0 ) and phase automaton A = (Q, Id, Σ, δ). The construction is similar to that in §6.1, i.e., a cross product is performed that pairs the control states of Pi with the state space of A. The difference is that the lock-history tuples are removed from the control state, and instead are modeled by S, the multi-arity relational weight domain over the finite set LH W and θ = |Q|. We define PiW = (Pi × Q, ∅, Γi , ∆W i , (p0 , q1 ), γ0 ), where ∆i and f are defined as follows: 1. Non-phase Transitions: f (r) = {LH1 , LH2  | LH2 = post(LH1 , a)}z 0 . a

(a) For each rule p, γ ֒−→p′ , u ∈ ∆i and transition (qx , i, a, qx ) ∈ δ, there is a rule r = (p, qx ), γ ֒−→(p′ , qx ), u ∈ ∆W i . a ′ (b) For each rule p, γ ֒−→p , u ∈ ∆i , a ∈ {(k , )k }, and for each qx ∈ Q, there is a rule r = (p, qx ), γ ֒−→(p′ , qx ), u ∈ ∆W i . 2. Phase Transitions: f (r) = {LH, LH, ptrans(LH) | LH ∈ LH}z 1 . a

(a) For each rule p, γ ֒−→p′ , u ∈ ∆i and transition (qx , i, a, qx+1 ) ∈ δ, there is a rule r = (p, qx ), γ ֒−→(p′ , qx+1 ), u ∈ ∆W i . (b) For each transition (qx , j, a, qx+1 ) ∈ δ, j = i, and for each p ∈ Pi and γ ∈ Γi , there is a rule r = (p, qx ), γ ֒−→(p, qx+1 ), γ ∈ ∆W . A multi-arity relational weight domain is parameterized by the quantity θ— the maximum number of phases of interest—which we have picked to be |Q|. We must argue that weight operations performed during model checking do not cause this threshold to be exceeded. For configuration (p, qx ), u to be reachable from the initial configuration (p0 , q1 ), γ0  of some WPDS Wi , PA A must make a sequence of transitions from states q1 to qx , which means that A goes through exactly x − 1 phase transitions. Each phase transition multiplies by a weight of

A Decision Procedure for Detecting Atomicity Violations

139

the form c1 z 1 ; hence, the weight returned by Apost∗ ({(p, qx ), u}) is a monomial of the form cx−1 z x−1 . The maximum number of phases in a PA is |Q|, and thus the highest-power monomial that arises is of the form c|Q|−1 z |Q|−1 . (Moreover, during post∗Wk as computed by the algorithm from [20], only monomial-valued weights ever arise.) Alg. 2 states the algorithm for solving the multi-PDS model-checking problem for PAs. Note that the final step of Alg. 2 can be performed with a single BDD operation. Theorem 3. For 2-PDS Π = (P1 , P2 , Locks, Σ) and PA A, there exists an execution of Π that drives A to the accepting state iff Alg. 2 returns true. Proof. See [13, App. D.2].

8

⊓ ⊔

Experiments

Our experiment concerned detecting AS-serializability violations (or proving their absence) in models of concurrent Java programs. The experiment was designed to compare the performance of Alg. 2 against that of the communicatingpushdown system (CPDS) semi-decision procedure from [10]. Alg. 2 was implemented using the Wali WPDS library [21] (the multi-arity relational weight domain is included in the Wali release 3.0). The weight domain uses the BuDDy BDD library [22]. All experiments were run on a dual-core 3 GHz Pentium Xeon processor with 4 GB of memory. We analyzed four Java programs from the ConTest benchmark suite [17]. Our tool requires that the allocation site of interest be annotated in the source program. We annotated eleven of the twenty-seven programs that ConTest documentation identifies as having “non-atomic” bugs. Our front-end currently handles eight of the eleven (the AST rewriting of [10] currently does not support certain Java constructs). Finally, after abstraction, four of the eight EML models did not use locks, so we did not analyze them further. The four that we used in our study are SoftwareVerificationHW, BugTester, BuggyProgram, and shop. For each program, the front-end of the Empire tool [10] was used to create an EML program. An EML program has a set of shared-memory locations, SMem , a set of locks, SLocks , and a set of EML processes, SProcs . Five of the fourteen PAs used for detecting AS-serializability violations check behaviors that involve a single shared-memory location; the other nine check behaviors that involve a pair of shared-memory locations. For each of the five PAs that involve a single shared location, we ran one query for each m ∈ SMem . For each of the nine PAs that involve a pair of shared locations, we ran one query for each (m1 , m2 ) ∈ SMem × SMem . In total, each tool ran 2,147 queries. Fig. 4 shows log-log scatter-plots of the execution times, classified into the 43 queries for which Alg. 2 reported an AS-serializability violation (left-hand graph), and the 2,095 queries for which Alg. 2 verified correctness (right-hand graph). Although the CPDS-based method is a semi-decision procedure, it is capable of both (i) verifying correctness, and (ii) finding AS-serializability violations [10].

140

N. Kidd et al. AS−Violation

2^8

No AS−Violation

2^6

Alg. 2 (seconds)

2^4

2^2

2^0

2^−2

2^−4

2^−4

2^−2

2^0

2^2

2^4

2^6

2^8

2^−4

2^−2

2^0

2^2

2^4

2^6

2^8

CPDS (seconds)

Fig. 4. Log-log scatter-plots of the execution times of Alg. 2 (y-axis) versus the CPDS semi-decision procedure [10] (x-axis). The dashed lines denote equal running times; points below and to the right of the dashed lines are runs for which Alg. 2 was faster. The timeout threshold was 200 seconds; the minimal reported time is .25 seconds. The vertical bands near the right-hand axes represent queries for which the CPDS semidecision procedure timed out. (The horizontal banding is due to the fact that, for a given program, Alg. 2 often has similar performance for many queries.)

(The third possibility is that it times out.) Comparing the total time to run all queries, Alg. 2 ran 7.5X faster (136,235 seconds versus 17,728 seconds). The CPDS-based method ran faster than Alg. 2 on some queries, although never more than about 8X faster; in contrast, Alg. 2 was more than two orders of magnitude faster on some queries. Moreover, the CPDS-based method timed out on about 68 of the queries— both for the ones for which Alg. 2 Query Category reported an AS-serializability violaCPDS succeeded CPDS timed out tion (29 timeouts out of 43 queries), Alg. 2 succeeded Alg. 2 succeeded as well as the ones for which Alg. 2 Impl. (685 of 2,147) (1,453 of 2,147) verified correctness (1,425 timeCPDS 6,006 130,229 outs out of 2,095 queries). Alg. 2 Alg. 2 2,428 15,310 exceeded the 200-second timeout threshold on nine queries. The Fig. 5. Total time (in seconds) for examCPDS-based method also timed out ples classified according to whether CPDS sucon those queries. When rerun with ceeded or timed out no timeout threshold, Alg. 2 solved each of the nine queries in 205–231 seconds. Fig. 5 partitions the examples according to whether CPDS succeeded or timed out. The 1,453 examples on which CPDS timed out (col. 3 of Fig. 5) might be said to represent “harder” examples. Alg. 2 required 15,310 seconds for these, which is about 3X more than the 1,453/685 × 2,428 = 5,150 seconds expected

A Decision Procedure for Detecting Atomicity Violations

141

if the queries in the two categories were of equal difficulty for Alg. 2. Roughly speaking, therefore, the data supports the conclusion that what is harder for CPDS is also harder for Alg. 2.

9

Related Work

The present paper introduces a different technique than that used by Kahlon and Gupta [7]. To decide the model-checking problem for PAs (as well as certain generalizations not discussed here), one needs to check pairwise reachability of multiple global configurations in succession. Our algorithm uses WPDS weights that are sets of lock-history tuples, whereas Kahlon and Gupta use sets of pairs of configuration automata. There are similarities between the kind of splitting step needed by Qadeer and Rehof to enumerate states at a context switch [1] in context-bounded model checking and the splitting step on sets of automaton-pairs needed in the algorithm of Kahlon and Gupta [7] to enumerate compatible configuration pairs [15]. Kahlon and Gupta’s algorithm performs a succession of pre∗ queries; after each one, it splits the resulting set of automaton-pairs to enforce the invariant that succeeding queries are only applied to compatible configuration pairs. In contrast, our algorithm (i) analyzes each PDS independently using one post∗ query per PDS, and then (ii) ties together the answers obtained from the different PDSs by performing a single compatibility check on the sets of lock-history tuples that result. Because our algorithm does not need a splitting step on intermediate results, it avoids enumerating compatible configuration pairs, thereby enabling BDD-based symbolic representations to be used throughout. The Kahlon-Gupta decision procedure has not been implemented [15], so a direct performance comparison was not possible. It is left for future work to determine whether our approach can be applied to the decidable sub-logics of LTL identified in [7]. Our approach of using sets of tuples is similar in spirit to the use of matrix [2] and tuple [3] representations to address context-bounded model checking [1]. In this paper, we bound the number of phases, but permit an unbounded number of context switches and an unbounded number of lock acquisitions and releases by each PDS. The decision procedure is able to explore the entire state space of the model; thus, our algorithm is able to verify properties of multi-PDSs instead of just performing bug detection. Dynamic pushdown networks (DPNs) [23] extend parallel PDSs with the ability to create threads dynamically. Lammich et al. [24] present a generalization of acquisition histories to DPNs with well-nested locks. Their algorithm uses chained pre∗ queries, an explicit encoding of acquisition histories in the state space, and is not implemented.

References 1. Qadeer, S., Rehof, J.: Context-bounded model checking of concurrent software. In: Halbwachs, N., Zuck, L.D. (eds.) TACAS 2005. LNCS, vol. 3440, pp. 93–107. Springer, Heidelberg (2005)

142

N. Kidd et al.

2. Lal, A., Touili, T., Kidd, N., Reps, T.: Interprocedural analysis of concurrent programs under a context bound. In: Ramakrishnan, C.R., Rehof, J. (eds.) TACAS 2008. LNCS, vol. 4963, pp. 282–298. Springer, Heidelberg (2008) 3. Lal, A., Reps, T.: Reducing concurrent analysis under a context bound to sequential analysis. In: Gupta, A., Malik, S. (eds.) CAV 2008. LNCS, vol. 5123, pp. 37–51. Springer, Heidelberg (2008) 4. Bouajjani, A., Esparza, J., Touili, T.: A generic approach to the static analysis of concurrent programs with procedures. In: POPL (2003) 5. Chaki, S., Clarke, E., Kidd, N., Reps, T., Touili, T.: Verifying concurrent messagepassing C programs with recursive calls. In: Hermanns, H., Palsberg, J. (eds.) TACAS 2006. LNCS, vol. 3920, pp. 334–349. Springer, Heidelberg (2006) 6. Kahlon, V., Ivancic, F., Gupta, A.: Reasoning about threads communicating via locks. In: Etessami, K., Rajamani, S.K. (eds.) CAV 2005. LNCS, vol. 3576, pp. 505–518. Springer, Heidelberg (2005) 7. Kahlon, V., Gupta, A.: On the analysis of interacting pushdown systems. In: POPL (2007) 8. Vaziri, M., Tip, F., Dolby, J.: Associating synchronization constraints with data in an object-oriented language. In: POPL (2006) 9. Flanagan, C., Qadeer, S.: A type and effect system for atomicity. In: PLDI (2003) 10. Kidd, N., Reps, T., Dolby, J., Vaziri, M.: Finding concurrency-related bugs using random isolation. In: Jones, N.D., M¨ uller-Olm, M. (eds.) VMCAI 2009. LNCS, vol. 5403, pp. 198–213. Springer, Heidelberg (2009) 11. Bouajjani, A., Esparza, J., Maler, O.: Reachability analysis of pushdown automata: Application to model checking. In: Mazurkiewicz, A., Winkowski, J. (eds.) CONCUR 1997. LNCS, vol. 1243, pp. 135–150. Springer, Heidelberg (1997) 12. Finkel, A., Willems, B.: A direct symbolic approach to model checking pushdown systems. Elec. Notes in Theor. Comp. Sci., vol. 9 (1997) 13. Kidd, N., Lammich, P., Touili, T., Reps, T.: A decision procedure for detecting atomicity violations for communicating processes with locks. Technical Report 1649r, Univ. of Wisconsin (April 2009), http://www.cs.wisc.edu/wpis/abstracts/tr1649.abs.html 14. Kidd, N., Lal, A., Reps, T.: Language strength reduction. In: Alpuente, M., Vidal, G. (eds.) SAS 2008. LNCS, vol. 5079, pp. 283–298. Springer, Heidelberg (2008) 15. Kahlon, V., Gupta, A.: Personal communication (January 2009) 16. Schwoon, S.: Model-Checking Pushdown Systems. PhD thesis, TUM (2002) 17. Eytani, Y., Havelund, K., Stoller, S.D., Ur, S.: Towards a framework and a benchmark for testing tools for multi-threaded programs. Conc. and Comp. Prac. and Exp. 19(3) (2007) 18. Reps, T.: Program analysis via graph reachability. Inf. and Softw. Tech. 40 (1998) 19. Harrison, M.: Introduction to Formal Language Theory. Addison-Wesley, Reading (1978) 20. Reps, T., Schwoon, S., Jha, S., Melski, D.: Weighted pushdown systems and their application to interprocedural dataflow analysis. SCP 58 (2005) 21. Kidd, N., Lal, A., Reps, T.: WALi: The Weighted Automaton Library (February 2009), http://www.cs.wisc.edu/wpis/wpds/download.php 22. BuDDy: A BDD package (July 2004), http://buddy.wiki.sourceforge.net/ 23. Bouajjani, A., M¨ uller-Olm, M., Touili, T.: Regular symbolic analysis of dynamic networks of pushdown systems. In: Abadi, M., de Alfaro, L. (eds.) CONCUR 2005. LNCS, vol. 3653, pp. 473–487. Springer, Heidelberg (2005) 24. Lammich, P., M¨ uller-Olm, M., Wenner, A.: Predecessor sets of dynamic pushdown networks with tree-regular constraints. In: CAV (2009) (to appear)

Eclipse Plug-In for Spin and st2msc Tools-Tool Presentation Tim Kovˇse, Boˇstjan Vlaoviˇc, Aleksander Vreˇze, and Zmago Brezoˇcnik Faculty of Electrical Engineering and Computer Science, University of Maribor, Maribor, Slovenia {tim.kovse,bostjan.vlaovic,aleksander.vreze,brezocnik}@uni-mb.si

Abstract. In this article we present an Eclipse plug-in for Spin and st2msc tools. The plug-in can be used to edit a Promela model, run the formal verification of the model, and generate optimized MSC of the Spin trail by st2msc. It simplifies handling with extensive Promela models to a great extent.

1

Introduction

In an ideal development environment, each change in the specification of a product would be formally checked immediately against the requirement specifications. This is a difficult task. To verify a real-life system it must usually be converted into a simpler “verifiable” format—the model of the (sub)system. Next, each requirement should be formally described in such a way that developers fully understand its true meaning. If we use one of the temporal logics, this goal is not always easy to achieve. Finally, we check if the model satisfies the requirements. Several formal verification techniques and tools are available. We use the Simple Promela Interpreter model checker (Spin) [1], which suits our needs. The final research goal is to build a framework for the systematic use of model checking in the software development process of our industrial partners. We focus on Specification and Description Language (SDL) specifications that describe implementation details and are used to build production systems. Such specifications use SDL constructs and additional extensions that enable developers to include operators that are implemented in other programming languages. Our industrial partners use the C programming language for low-level or processor-intensive operations. Therefore, the framework should support such SDL extensions. A model can be generated manually or mechanically. Manual preparation of a model is error-prone, and a very time consuming task. Therefore, our research focuses on automating model generation from SDL specifications, and a manageable presentation of the model and its verification results. In [2,3,4] we proposed a new approach to automated model generation from SDL to Promela. The generation of a model is based on formally specified algorithms. We validate these algorithms in practice with the use of an SDL to C.S. P˘ as˘ areanu (Ed.): SPIN 2009, LNCS 5578, pp. 143–147, 2009. c Springer-Verlag Berlin Heidelberg 2009 

144

T. Kovˇse et al.

Promela (sdl2pml) tool, which implements most of our research results [5]. The applicability of our approach was tested on the implementation of an ISDN User Adaptation (IUA) protocol, which is part of the SI3000 softswitch. The specification was developed by the largest Slovenian telecommunication equipment company, Iskratel d.o.o. It consists of 97, 388 lines of SDL’93 code without comments and blank lines. An abstracted version of these blocks was prepared for the automated generation of a model consisting of 28, 563 lines of SDL code. The generated Promela model comprises 79, 281 lines of code. We chose IUA protocol because it includes all SDL constructs that are used in the specification of softswitch. Additionally, it includes many external operators written in C with the use of extended Abstract Data Type (ADT) definitions. After semi-automatic generation of a model, we successfully ran the simulation and managed to discover an unknown invalid execution path. When real-life telecommunication systems are studied, Spin trail can be very demanding. To ease the search for an error, we have developed a Spin Trail to Message Sequence Chart (st2msc) tool [6]. It provides automated generation of a Message Sequence Chart (MSC) diagram from a Spin execution trail. Spin trail of the simple call with the use of IUA protocol consisting of 55.371 lines of text. It contains 21 processes that interact with the use of 261 messages. During the study of the IUA protocol, we learned that the Promela editor with code folding capabilities and version control would improve an engineer’s user experience and efficiency. Additionally, the need for a common development environment for the Spin, sdl2pml, and st2msc emerged. In this paper, we focus only on Eclipse plug-in for Spin and st2msc1 .

2

Eclipse Plug-In for Spin and st2msc Tool

Large Promela models are usually difficult to examine in common text editors. Therefore, we have started the development of an Eclipse environment that would offer a better overview of a prepared model and simplify the use of st2msc and sdl2pml. We have chosen the widely-used Eclipse2 platform for software development. This open source tool is capable of handling Integrated Development Environments (IDEs) for different programming languages, including Java, C/C++, and PHP [7]. During our search for existing solutions, we found the Eclipse plug-in for Spin [8]. Unfortunately, it does not provide code folding that initiated our research. Therefore, we decided to start a new development project with the aim of simplifying engineers work with extensive Promela models, and provide integration for existing tools developed by our group. The developed plug-in includes: – Promela file wizard, – Spin perspective, – Promela editor, 1 2

http://lms.uni-mb.si/ep4ss/ http://www.eclipse.org/

Eclipse Plug-In for Spin and st2msc Tools-Tool Presentation

145

– preference pages, – launch shortcuts, and – update site. Preparation of a new Promela model in Eclipse IDE requires the creation of a new file. The developed file wizard simplifies the creation of a file that is used as a container for a model. It automatically inserts the init function into the model. Existing models can be included with Eclipse’s import manager. The Eclipse perspective is a visual container for a set of views, and editors. Our plug-in implements Spin perspective. It contains the following views: – – – –

St2msc—conversion of a Spin trail file to MSC diagram, Package Explorer—display of files associated with the model, Console—output of the Spin tool, and Problems—syntax problems in the model.

Spin perspective additionally includes the Promela editor that can be used for viewing or editing a model. The current version includes several features, e.g., syntax highlighting, code folding, problem markers, and content assistance. If syntax highlighting is enabled, all defined keyword groups are displayed in the selected colours. Promela language reference defines the following keyword groups: Meta Terms, Declarators Rule, Control Flow, Basic Statements, Predefined, Embedded C Code, Omissions, and Comments. By default, all Promela keywords are included, but they can be extended by the user. Code folding temporarily hides sections of the Promela code enclosed by brackets or between reserved pairs of words: do—od and if—fi. This functionality is essential if real-life specifications are studied. The Promela editor utilizes Eclipse’s problem markers that provide a powerful error reporting mechanism. Syntax errors are also shown in the Problems view. Syntax error position is parsed from the output of a Spin’s syntax check. Additional help is provided by the Content assistant that can be used for the completition of the reserved keywords. Settings for Promela editor, simulation, and verification are set-out in the preference pages. Additionally, paths to external tools (Spin, XSpin, C compiler, and st2msc) are defined. Customizations of the editor include user-defined keywords for the syntax highlighting and colour selection for keyword groups. Here, highlighting and code folding can be enabled or disabled. At the moment, random and guided simulations are supported. Simulation parameters “skipped simulation steps” and “seed value” can be set. For guided simulation a pan in.trail file or user-selected trail file can be used. Simulation output is shown in the Console view. For interactive simulation XSpin tool can be launched. In Fig. 1 Spin perspective is shown. The left side of the figure presents Package Explorer view with the iua project. It includes Promela model file iua model.pml. At the forefront of the figure, the options for Spin verification are shown. The verification preference page is divided into two parts. In the upper part, the user can import, export, or reload verification parameters. An

146

T. Kovˇse et al.

Fig. 1. Screen capture of Eclipse plug-in for Spin and st2msc tools

engineer can load and store verification parameters in XML format. If loaded parameters are changed during the study of the model, primary settings can be easily reloaded. Verification options are included in the lower part of the verification preference page. The Promela model of the system is shown behind the preferences page. An inline definition of send expire timers(queue) is collapsed, while set(tmr,value) is expanded. Promela editor’s syntax highlighting is shown on comments and following keywords: inline, typedef, hidden, and do. The result of a syntax check is shown in Console view at the bottom of the Fig. 1. After successful study and preparation of the model, the user can run the simulation or formal verification of the model. The Spin tool tracks the execution of the simulation in the trail file. Additionally, the trail file describes the counter-example of the verification run. To ease the search for an error, we developed st2msc tool for automated generation of ITU-T Z.106[9] and Z.120[10] standardized MSC diagrams. Its graphical interface is implemented as a view in Eclipse plug-in. An engineer can focus on specific parts of the trail with selection of messages that should be included in the MSC diagram. Additionally, a user can merge processes into virtual processes [6]. Generated diagrams can be studied with the use of any MSC tool, e.g., ObjectGEODE. The st2msc tool is implemented in Java and consist of 1.216 lines of code. The plug-in is in constant development. Therefore, we have implemented a feature for automatic installation and future updates from the development server. The Eclipse update manager can be used for the installation of the plug-in.

Eclipse Plug-In for Spin and st2msc Tools-Tool Presentation

3

147

Conclusion

Study of the IUA protocol showed, that Eclipse plug-in for Spin and st2msc tool can be very helpful during the development of Promela models. With code folding and syntax highlighting functionalities in Promela editor, the user gains a better overview of a modelled system. In future work, one of our major goals is implementation of a guided simulation in our plug-in. Another important goal is implementation of an MSC editor integrated into the Eclipse IDE. With its implementation, users will not need to install third-part tools for graphical presentation of MSC diagrams. Additionally integration of the sdl2pml tool into the plug-in is also important.

References 1. Holzmann, G.J.: The Spin Model Checker, Primer and Reference Manual. AddisonWesley, Reading (2004) 2. Vlaoviˇc, B.: Automatic Generation of Models with Probes from the SDL System Specification: Ph.D dissertation (in Slovene). University of Maribor, Faculty of EE&CS, Maribor, Slovenia (2004) 3. Vreˇze, A.: Extending automatic modeling of SDL specifications in Promela with embedded C code and a new model of discrete time: Ph.D dissertation (in Slovene). University of Maribor, Faculty of EE&CS, Maribor, Slovenia (2006) 4. Vlaoviˇc, B., Vreˇze, A., Brezoˇcnik, Z., Kapus, T.: Automated Generation of Promela Model from SDL Specification. Computer Standards & Interfaces 29(4), 449–461 (2007) 5. Vreˇze, A., Vlaoviˇc, B., Brezoˇcnik, Z.: Sdl2pml — Tool for automated generation of Promela model from SDL specification. Computer Standards & Interfaces (2008) 6. Kovˇse, T., Vlaoviˇc, B., Vreˇze, A., Brezoˇcnik, Z.: Spin Trail to Message Sequence Chart Conversion Tool. In: The 10th International Conference on Telecomunications, Zagreb, Croatia, 7. Clay, E., Rubel, D.: Eclipse: Building Commercial-Quality Plug-Ins. AddisonWesley, Reading (2006) 8. Rothmaier, G., Kneiphoff, T., Krumm, H.: Using SPIN and Eclipse for Optimized High-Level Modeling and Analysis of Computer Network Attack Models. In: Godefroid, P. (ed.) SPIN 2005. LNCS, vol. 3639, pp. 236–250. Springer, Heidelberg (2005) 9. International Telecommunication Union: Common Interchange Format for SDL. Recommendation Z.106, Telecommunication Standardization Sector of ITU, Geneva, Switzerland (1996) 10. International Telecommunication Union: Message Sequence Chart (MSC). Recommendation Z.120, Telecommunication Standardization Sector of ITU, Geneva, Switzerland (1999)

Symbolic Analysis via Semantic Reinterpretation⋆ Junghee Lim1,⋆⋆ , Akash Lal1,⋆ ⋆ ⋆ , and Thomas Reps1,2 1

University of Wisconsin, Madison, WI, USA {junghee,akash,reps}@cs.wisc.edu 2 GrammaTech, Inc., Ithaca, NY, USA

Abstract. The paper presents a novel technique to create implementations of the basic primitives used in symbolic program analysis: forward symbolic evaluation, weakest liberal precondition, and symbolic composition. We used the technique to create a system in which, for the cost of writing just one specification—an interpreter for the programming language of interest—one obtains automatically-generated, mutuallyconsistent implementations of all three symbolic-analysis primitives. This can be carried out even for languages with pointers and address arithmetic. Our implementation has been used to generate symbolic-analysis primitives for x86 and PowerPC.

1

Introduction

The use of symbolic-reasoning primitives for forward symbolic evaluation, weakest liberal precondition (WLP), and symbolic composition has experienced a resurgence in program-analysis tools because of the power that they provide when exploring a program’s state space. – Model-checking tools, such as SLAM [1], as well as hybrid concrete/symbolic program-exploration tools, such as DART [6], CUTE [13], SAGE [7], BITSCOPE [3], and DASH [2] use forward symbolic evaluation, WLP, or both. Symbolic evaluation is used to create path formulas. To determine whether a path π is executable, an SMT solver is used to determine whether π’s path formula is satisfiable, and if so, to generate an input that drives the program down π. WLP is used to identify new predicates that split part of a program’s state space [1,2]. – Bug-finding tools, such as ARCHER [15] and SATURN [14], use symbolic composition. Formulas are used to summarize a portion of the behavior of a procedure. Suppose that procedure P calls Q at call-site c, and that r is the site in P to which control returns after the call at c. When c is encountered ⋆

⋆⋆ ⋆⋆⋆

Supported by NSF under grants CCF-{0540955, 0524051, 0810053}, by AFRL under contract FA8750-06-C-0249, and by ONR under grant N00014-09-1-0510. Supported by a Symantec Research Labs Graduate Fellowship. Supported by a Microsoft Research Fellowship.

C.S. P˘ as˘ areanu (Ed.): SPIN 2009, LNCS 5578, pp. 148–168, 2009. c Springer-Verlag Berlin Heidelberg 2009 

Symbolic Analysis via Semantic Reinterpretation

149

during the exploration of P , such tools perform the symbolic composition of the formula that expresses the behavior along the path [entryP , . . . , c] explored in P with the formula that captures the behavior of Q to obtain a formula that expresses the behavior along the path [entryP , . . . , r]. The semantics of the basic symbolic-reasoning primitives are easy to state; for instance, if τ (σ, σ ′ ) is a 2-state formula that represents the semantics of an instruction, then WLP(τ, ϕ) can be expressed as ∀σ ′ .(τ (σ, σ ′ ) ⇒ ϕ(σ ′ )). However, this formula uses quantification over states—i.e., second-order quantification— whereas SMT solvers, such as Yices and Z3, support only quantifier-free firstorder logic. Hence, such a formula cannot be used directly. For a simple language that has only int-valued variables, it is easy to recast matters in first-order logic. For instance, the WLP of postcondition ϕ with respect to an assignment statement var = rhs; can be obtained by substituting rhs for all (free) occurrences of var in ϕ: ϕ[var ← rhs]. For real-world programming languages, however, the situation is more complicated. For instance, for languages with pointers, Morris’s rule of substitution [11] requires taking into account all possible aliasing combinations. The standard approach to implementing each of the symbolic-analysis primitives for a programming language of interest (which we call the subject language) is to create hand-written translation procedures—one per symbolicanalysis primitive—that convert subject-language commands into appropriate formulas. With this approach, a system can contain subtle inconsistency bugs if the different translation procedures adopt different “views” of the semantics. The consistency problem is compounded by the issue of aliasing: most subject languages permit memory states to have complicated aliasing patterns, but usually it is not obvious that aliasing is treated consistently across implementations of symbolic evaluation, WLP, and symbolic composition. One manifestation of an inconsistency bug would be that if one performs symbolic execution of a path π starting from a state that satisfies ψ = WLP(π, ϕ), the resulting symbolic state does not entail ϕ. Such bugs undermine the soundness of an analysis tool. Our own interest is in analyzing machine code, such as x86 and PowerPC. Unfortunately, machine-code instruction sets have hundreds of instructions, as well as other complicating factors, such as the use of separate instructions to set flags (based on the condition that is tested) and to branch according to the flag values, the ability to perform address arithmetic and dereference computed addresses, etc. To appreciate the need for tool support for creating symbolic-analysis primitives for real machine-code languages, consult Section 3.2 of the Intel manual (http://download.intel.com/design/processor/manuals/253666.pdf), and imagine writing three separate encodings of each instruction’s semantics to implement symbolic evaluation, WLP, and symbolic composition. Some tools (e.g., [7,3]) need an instruction-set emulator, in which case a fourth encoding of the semantics is also required. To address these issues, this paper presents a way to automatically obtain mutually-consistent, correct-by-construction implementations of symbolic primitives, by generating them from a specification of the subject language’s

150

J. Lim, A. Lal, and T. Reps

concrete semantics. More precisely, we present a method to obtain quantifierfree, first-order-logic formulas for (a) symbolic evaluation of a single command, (b) WLP with respect to a single command, and (c) symbolic composition for a class of formulas that express state transformations. The generated implementations are guaranteed to be mutually consistent, and also to be consistent with an instruction-set emulator (for concrete execution) that is generated from the same specification of the subject language’s concrete semantics. Primitives (a) and (b) immediately extend to compound operations over a given program path for use in forward and backwards symbolic evaluation, respectively; see §6. (The design of client algorithms that use such primitives to perform state-space exploration is an orthogonal issue that is outside the scope of this paper.) Semantic Reinterpretation. Our approach is based on factoring the concrete semantics of a language into two parts: (i) a client specification, and (ii) a semantic core. The interface to the core consists of certain base types, function types, and operators, and the client is expressed in terms of this interface. Such an organization permits the core to be reinterpreted to produce an alternative semantics for the subject language. The idea of exploiting such a factoring comes from the field of abstract interpretation [4], where semantic reinterpretation has been proposed as a convenient tool for formulating abstract interpretations [12,10] (see §2). Achievements and Contributions. We used the approach described in the paper to create a “Yacc-like” tool for generating mutually-consistent, correctby-construction implementations of symbolic-analysis primitives for instruction sets (§7). The input is a specification of an instruction set’s concrete semantics; the output is a triple of C++ functions that implement the three symbolicanalysis primitives. The tool has been used to generate such primitives for x86 and PowerPC. To accomplish this, we leveraged an existing tool, TSL [9], as the implementation platform for defining the necessary reinterpretations. However, we wish to stress that the ideas presented in the paper are not TSL-specific; other ways of implementing the necessary reinterpretations are possible (see §2). The contributions of this paper lie in the insights that went into defining the specific reinterpretations that we use to obtain mutually-consistent, correctby-construction implementations of the symbolic-analysis primitives, and the discovery that WLP could be obtained by using two different reinterpretations working in tandem. The paper’s other contributions are summarized as follows: – We present a new application for semantic reinterpretation, namely, to create implementations of the basic primitives for symbolic reasoning (§4 and §5). In particular, two key insights allowed us to obtain the primitives for WLP and symbolic composition. The first insight was that we could apply semantic reinterpretation in a new context, namely, to the interpretation function of a logic (§4). The second insight was to define a particular form of statetransformation formula—called a structure-update expression (see §3.1)—to be a first-class notion in the logic, which allows such formulas (i) to serve as a

Symbolic Analysis via Semantic Reinterpretation

151

replacement domain in various reinterpretations, and (ii) to be reinterpreted themselves (§4). – We show how reinterpretation can automatically create a WLP primitive that implements Morris’s rule of substitution [11] (§4). – We conducted an experiment on real x86 code using the generated primitives (§7). For expository purposes, simplified languages are used throughout. Our discussion of machine code (§3.3 and §5) is based on a greatly simplified fragment of the x86 instruction set; however, our implementation (§7) works on code from real x86 programs compiled from C++ source code, including C++ STL, using Visual Studio. Organization. §2 presents the basic principles of semantic reinterpretation by means of an example in which reinterpretation is used to create abstract transformers for abstract interpretation. §3 defines the logic that we use, as well a simple source-code language (PL) and an idealized machine-code language (MC). §4 discusses how to use reinterpretation to obtain the three symbolicanalysis primitives for PL. §5 addresses reinterpretation for MC. §6 explains how other language constructs beyond those found in PL and MC can be handled. §7 describes our implementation and the experiment carried out with it. §8 discusses related work.

2

Semantic Reinterpretation for Abstract Interpretation

This section presents the basic principles of semantic reinterpretation in the context of abstract interpretation. We use a simple language of assignments, and define the concrete semantics and an abstract sign-analysis semantics via semantic reinterpretation. Example 1. [Adapted from [10].] Consider the following fragment of a denotational semantics, which defines the meaning of assignment statements over variables that hold signed 32-bit int values (where ⊕ denotes exclusive-or): I ∈ Id S ∈ Stmt ::= I = E;

E ∈ Expr ::= I | E1 ⊕ E2 | . . . σ ∈ State = Id → Int32

E : Expr → State → Int32 E Iσ = σI E E1 ⊕ E2 σ = E E1 σ ⊕ E E2 σ

I : Stmt → State → State II = E;σ = σ[I → EEσ]

By “σ[I → v],” we mean the function that acts like σ except that argument I is mapped to v. The specification given above can be factored into client and core specifications by introducing a domain Val, as well as operators xor, lookup, and store. The client specification is defined by

152

J. Lim, A. Lal, and T. Reps

s1 : x = x ⊕ y; (a) s2 : y = x ⊕ y; s3 : x = x ⊕ y;

Before 0:

t1 : ∗px = ∗px ⊕ ∗py; (b) t2 : ∗py = ∗px ⊕ ∗py; t3 : ∗px = ∗px ⊕ ∗py;

After

v

px:

&py

py:

&py

v

0: px:

&py

py:

v

[1] [2] [3] [4] [5] [6] [7] [8] [9]

mov xor mov mov xor mov mov xor mov

(c)

eax, [ebp−10] eax, [ebp−14] [ebp−10], eax eax, [ebp−10] eax, [ebp−14] [ebp−14], eax eax, [ebp−10] eax, [ebp−14] [ebp−10], eax (d)

Fig. 1. (a) Code fragment that swaps two ints; (b) code fragment that swaps two ints using pointers; (c) possible before and after configurations for code fragment (b): the swap is unsuccessful due to aliasing; (d) x86 machine code corresponding to (a) xor : Val → Val → Val

lookup : State → Id → Val

E : Expr → State → Val E Iσ = lookup σ I E E1 ⊕ E2 σ = E E1 σ xor E E2 σ

store : State → Id → Val → State

I : Stmt → State → State II = E;σ = store σ I E Eσ

For the concrete (or “standard”) semantics, the semantic core is defined by v ∈ Valstd = Int32 Statestd = Id → Val

lookupstd = λσ.λI.σI storestd = λσ.λI.λv.σ[I → v]

xorstd = λv1 .λv2 .v1 ⊕ v2

Different abstract interpretations can be defined by using the same client semantics, but giving a different interpretation of the base types, function types, and operators of the core. For example, for sign analysis, assuming that Int32 values are represented in two’s complement, the semantic core is reinterpreted as follows:1 v ∈ Valabs Stateabs lookupabs storeabs

= = = =

{neg, zero, pos}⊤ Id → Valabs λσ.λI.σI λσ.λI.λv.σ[I → v]

v2 neg zero pos ⊤

xorabs = λv1 .λv2 .

neg v1 zero pos ⊤

⊤ neg neg ⊤

neg zero pos ⊤

neg pos ⊤ ⊤

⊤ ⊤ ⊤ ⊤

For the code fragment shown in Fig. 1(a), which swaps two ints, sign-analysis reinterpretation creates abstract transformers that, given the initial abstract state σ0 = {x → neg, y → pos}, produce the following abstract states: σ0 := {x → neg, y → pos}  neg, y → pos} σ1 := Is1 : x = x ⊕ y;σ0 = storeabs σ0 x (neg xorabs pos) = {x → σ2 := Is2 : y = x ⊕ y;σ1 = storeabs σ1 y (neg xorabs pos) = {x →  neg, y → neg} σ3 := Is3 : x = x ⊕ y;σ2 = storeabs σ2 x (neg xorabs neg) = {x → ⊤, y → neg}. 1

For the two’s-complement representation, pos xorabs neg = neg xorabs pos = neg because, for all combinations of values represented by pos and neg, the high-order bit of the result is set, which means that the result is always negative. However, pos xorabs pos = neg xorabs neg = ⊤ because the concrete result could be either 0 or positive, and zero ⊔ pos = ⊤.

Symbolic Analysis via Semantic Reinterpretation

153

Semantic Reinterpretation Versus Standard Abstract Interpretation. Semantic reinterpretation [12,10] is a form of abstract interpretation [4], but differs from the way abstract interpretation is normally applied: in standard abstract interpretation, one reinterprets the constructs of each subject language; in contrast, with semantic reinterpretation one reinterprets the constructs of the meta-language. Standard abstract interpretation helps in creating semantically sound tools; semantic reinterpretation helps in creating semantically sound tool generators. In particular, if you have N subject languages and M analyses, with semantic reinterpretation you obtain N × M analyzers by writing just N + M specifications: concrete semantics for N subject languages and M reinterpretations. With the standard approach, one must write N × M abstract semantics. Semantic Reinterpretation Versus Translation to a Common Intermediate Representation. The mapping of a client specification to the operations of the semantic core that one defines in a semantic reinterpretation resembles a translation to a common intermediate representation (CIR) data structure. Thus, another approach to obtaining “systematic” reinterpretations that are similar to semantic reinterpretations—in that they apply to multiple subject languages—is to translate subject-language programs to a CIR, and then create various interpreters that implement different abstract interpretations of the node types of the CIR data structure. Each interpreter can be applied to (the translation of) programs in any subject language L for which one has defined an L-to-CIR translator. Compared with interpreting objects of a CIR data type, the advantages of semantic reinterpretation (i.e., reinterpreting the constructs of the meta-language) are 1. The presentation of our ideas is simpler because one does not have to introduce an additional language of trees for representing CIR objects. 2. With semantic reinterpretation, there is no explicit CIR data structure to be interpreted. In essence, semantic reinterpretation removes a level of interpretation, and hence generated analyzers should run faster. To some extent, however, the decision to explain our ideas in terms of semantic reinterpretation is just a matter of presentational style. The goal of the paper is not to argue the merits of semantic reinterpretation per se; on the contrary, the goal is to present particular interpretations that yield three desirable symbolic-analysis primitives for use in program-analysis tools. Semantic reinterpretation is used because it allows us to present our ideas in a concise manner. The ideas introduced in §4 and §5 can be implemented using semantic reinterpretation—as we did (see §7); alternatively, they can be implemented by defining a suitable CIR datatype and creating appropriate interpretations of the CIR’s node types—again using ideas similar to those presented in §4 and §5.

154

J. Lim, A. Lal, and T. Reps

3

A Logic and Two Programming Languages

3.1

L: A Quantifier-Free Bit-Vector Logic with Finite Functions

The logic L is quantifier-free first-order bit-vector logic over a vocabulary of constant symbols (I ∈ Id) and function symbols (F ∈ FuncId). Strictly speaking, we work with various instantiations of L, denoted by L[PL] and L[MC], in which the vocabularies of function symbols are chosen to describe aspects of the values used by, and computations performed by, the programming languages PL and MC, respectively. We distinguish the syntactic symbols of L from their counterparts in PL (§2 and §3.2) by using boxes around L’s symbols. c ∈ CInt32 = {0, 1, . . .} bopL ∈ BoolOpL = { && , || , . . .}

op2L ∈ BinOpL = { + , - , ⊕ , . . .} ropL ∈ RelOpL = { = , = , < , > , . . .}

The rest of the syntax of L[·] is defined as follows: I ∈ Id, T ∈ Term, ϕ ∈ Formula, F ∈ FuncId, FE ∈ FuncExpr, U ∈ StructUpdate T ::= c | I | T1 op2L T2 | ite(ϕ, T1 , T2 ) | FE(T ) ϕ ::= T | F | T1 ropL T2 | ¬ ϕ1 | ϕ1 bopL ϕ2

FE ::= F | FE1 [T1 → T2 ] U ::= ({Ii ←֓ Ti }, {Fj ←֓ FEj })

A Term of the form ite(ϕ, T1 , T2 ) represents an if-then-else expression. A FuncExpr of the form FE1 [T1 → T2 ] denotes a function-update expression. A StructUpdate of the form ({Ii ←֓ Ti }, {Fj ←֓ FEj }) is called a structure-update expression. The subscripts i and j implicitly range over certain index sets, which will be omitted to reduce clutter. To emphasize that Ii and Fj refer to nextstate quantities, we sometimes write structure-update expressions with primes: ({Ii′ ←֓ Ti }, {Fj′ ←֓ FEj }). {Ii′ ←֓ Ti } specifies the updates to the interpretations of the constant symbols and {Fj′ ←֓ FEj } specifies the updates to the interpretations of the function symbols (see below). Thus, a structure-update expression ({Ii′ ←֓ Ti }, {Fj′ ←֓ FEj })  of as a kind of restricted  can be thought 2-vocabulary (i.e., 2-state) formula i (Ii′ = Ti ) ∧ j (Fj′ = FEj ). We define Uid to be ({I ′ ←֓ I | I ∈ Id}, {F ′ ←֓ F | F ∈ FuncId}). Semantics of L. The semantics of L[·] is defined in terms of a logical structure, which gives meaning to the Id and FuncId symbols of the logic’s vocabulary. ι ∈ LogicalStruct = (Id → Val) × (FuncId → (Val → Val))

(ι↑1) assigns meanings to constant symbols, and (ι↑2) assigns meanings to function symbols. ((p↑1) and (p↑2) denote the 1st and 2nd components, respectively, of a pair p.) The factored semantics of L is presented in Fig. 2. Motivated by the needs of later sections, we retain the convention from §2 of working with the domain Val rather than Int32. Similarly, we also use BVal rather than Bool. The standard interpretations of binopL , relopL , and boolopL are as one would expect, e.g., v1 binopL ( ⊕ ) v2 = v1 xor v2 , etc. The standard interpretations for lookupIdstd

Symbolic Analysis via Semantic Reinterpretation const condL lookupId binopL relopL boolopL lookupFuncId access update

: : : : : : : : :

155

CInt32 → Val BVal → Val → Val → Val LogicalStruct → Id → Val BinOpL → (Val × Val → Val) RelOpL → (Val × Val → BVal) BoolOpL → (BVal × BVal → BVal) LogicalStruct → FuncId → (Val → Val) (Val → Val) × Val) → Val ((Val → Val) × Val × Val) → (Val → Val)

F : Formula → LogicalStruct → BVal T : Term → LogicalStruct → Val F  T ι = T T cι = const(c) T Iι = lookupId ι I F  F ι = F T T1 op2L T2 ι = T T1 ι binopL (op2L ) T T2 ι F T1 ropL T2 ι = T T1 ι relopL (ropL ) T T2 ι T ite(ϕ, T1 , T2 )ι = condL (F ϕι, T T1 ι, T T2 ι) F  ¬ ϕ1 ι = ¬Fϕ1 ι T FE(T1 )ι = access(F EFEι, T T1 ι) F ϕ1 bopL ϕ2 ι = F ϕ1 ι boolopL (bopL ) F ϕ2 ι F E : FuncExpr → LogicalStruct → (Val → Val) F EF ι = lookupFuncId ι F F EFE1 [T1 → T2 ]ι = update(F EFE1 ι, T T1 ι, T T2 ι) U : StructUpdate → LogicalStruct → LogicalStruct U({Ii ←֓ Ti }, {Fj ←֓ FEj })ι = ((ι↑1)[Ii → T Ti ι], (ι↑2)[Fj → F EFEj ι])

Fig. 2. The factored semantics of L

and lookupFuncIdstd select from the first and second components, respectively, of a LogicalStruct: lookupIdstd ι I = (ι↑1)(I) and lookupFuncIdstd ι F = (ι↑2)(F ). The standard interpretations for access and update select from, and store to, a map, respectively. Let U = ({Ii ←֓ Ti }, {Fj ←֓ FEj }). Because UU ι retains from ι the value of each constant I and function F for which an update is not defined explicitly in U (i.e., I ∈ (Id − {Ii }) and F ∈ (FuncId − {Fj })), as a notational convenience we sometimes treat U as if it contains an identity update for each such symbol; that is, we say that (U ↑1)I = I for I ∈ (Id − {Ii }), and (U ↑2)F = F for F ∈ (FuncId − {Fj }). 3.2

PL : A Simple Source-Level Language

PL is the language from §2, extended with some additional kinds of int-valued expressions, an address-generation expression, a dereferencing expression, and an indirect-assignment statement. Note that arithmetic operations can also occur inside a dereference expression; i.e., PL allows arithmetic to be performed on addresses (including bitwise operations on addresses: see Ex. 2). S ∈ Stmt, E ∈ Expr, BE ∈ BoolExpr, I ∈ Id, c ∈ CInt32 E ::= c | I | &I | ∗E | E1 op2 E2 | BE ? E1 : E2 c ::= 0 | 1 | ... S ::= I = E; | ∗I = E; | S1 S2 BE ::= T | F | E1 rop E2 | ¬BE1 | BE1 bop BE2

Semantics of PL. The factored semantics of PL is presented in Fig. 3. The semantic domain Loc stands for locations (or memory addresses). We identify

156

J. Lim, A. Lal, and T. Reps

v ∈ Val l ∈ Loc = Val σ ∈ State = Store × Env

const cond lookupState lookupEnv lookupStore updateStore

: : : : : :

E : Expr → State → Val Ecσ = const(c) EIσ = lookupState σ I E&Iσ = lookupEnv σ I E∗Eσ = lookupStore σ (EEσ) EE1 op2 E2 σ = EE1 σ binop(op2) EE2 σ EBE ? E1 : E2 σ = cond(BBEσ, EE1 σ, EE2 σ)

B : BoolExpr → State → BVal CInt32 → Val BTσ = T BVal → Val → Val → Val BFσ = F State → Id → Val BE1 rop E2 σ = EE1 σ relop(rop) EE2 σ State → Id → Loc B¬BE1 σ = ¬BBE1 σ State → Loc → Val State → Loc → Val → State BBE1 bop BE2 σ = BBE1 σ boolop(bop) BBE2 σ I : Stmt → State → State II = E;σ = updateStore σ (lookupEnv σ I) (EEσ) I∗I = E;σ = updateStore σ (EIσ) (EEσ) IS1 S2 σ = IS2 (IS1 σ)

Fig. 3. The factored semantics of PL

const : CInt32 → Val cond : BVal → Val → Val → Val storereg : State → register → Val → State storemem : State → Val → Val → State lookupreg : State → register → Val lookupmem : State → Val → Val storeeip : State → State storeflag : State → flagName → BVal → State lookupflag : State → flagName → BVal storeeip = λσ.storereg(σ, EIP, REIPσ binop(+) 4) R : reg → State → Val Rrσ = lookupreg (σ, r) K : flagName → State → BVal KZFσ = lookupflag (σ, ZF)

O : src operand → State → Val OIndirect(r, c)σ = lookupmem (σ, Rrσ binop(+) const(c)) ODirectReg(r)σ = Rrσ OImmediate(c)σ = const(c)

I : instruction → State → State IMOV(Indirect(r, c), so)σ = storeeip (storemem (σ, Rrσ binop(+) const(c), Osoσ)) IMOV(DirectReg(r), so)σ = storeeip (storereg (σ, r, Osoσ)) ICMP(do, so)σ = storeeip (storeflag (σ, ZF, Odoσ binop(−) Osoσ relop(=) 0)) IXOR(do:Indirect(r, c), so)σ = storeeip (storemem (σ, Rrσ binop(+) const(c), Odoσ binop(⊕) Osoσ)) IXOR(do:DirectReg(r), so)σ = storeeip (storereg (σ, r, Odoσ binop(⊕) Osoσ)) IJZ(do)σ = storereg(σ, EIP, cond(KZFσ, REIPσ binop(+) 4, Odoσ))

Fig. 4. The factored semantics of MC

Loc with the set Val of values. A state σ ∈ State is a pair (η, ρ), where, in the standard semantics, environment η ∈ Env = Id → Loc maps identifiers to their associated locations and store ρ ∈ Store = Loc → Val maps each location to the value that it holds. The standard interpretations of the operators used in the PL semantics are BValstd = BVal Valstd = Int32 Locstd = Int32 η ∈ Envstd = Id → Locstd ρ ∈ Storestd = Locstd → Valstd

condstd = λb.λv1 .λv2 . (b ? v1 : v2 ) lookupStatestd = λ(η, ρ).λI.ρ(η(I)) lookupEnvstd = λ(η, ρ).λI.η(I) lookupStorestd = λ(η, ρ).λl.ρ(l) updateStore std = λ(η, ρ).λl.λv.(η, ρ[l → v])

Symbolic Analysis via Semantic Reinterpretation

3.3

157

MC: A Simple Machine-Code Language

MC is based on the x86 instruction set, but greatly simplified to have just four registers, one flag, and four instructions. r ∈ register, do ∈ dst operand, so ∈ src operand, i ∈ instruction r ::= EAX | EBX | EBP | EIP do ::= Indirect(r, Val) | DirectReg(r) flagName ::= ZF so ::= do ∪ Immediate(Val) instruction ::= MOV(do, so) | CMP(do, so) | XOR(do, so) | JZ(do)

Semantics of MC. The factored semantics of MC is presented in Fig. 4. It is similar to the semantics of PL, although MC exhibits two features not part of PL: there is an explicit program counter (EIP), and MC includes the typical feature of machine-code languages that a branch is split across two instructions (CMP . . . JZ). An MC state σ ∈ State is a triple (mem, reg, flag), where mem is a map Val → Val, reg is a map register → Val, and flag is a map flagName → BVal. We assume that each instruction is 4 bytes long; hence, the execution of a MOV, CMP or XOR increments the program-counter register EIP by 4. CMP sets the value of ZF according to the difference of the values of the two operands; JZ updates EIP depending on the value of flag ZF.

4

Symbolic Analysis for PL via Reinterpretation

A PL state (η, ρ) can be modeled in L[PL] by using a function symbol Fρ for store ρ, and a constant symbol cx ∈ Id for each PL identifier x. (To reduce clutter, we will use x for such constants instead of cx .) Given ι ∈ LogicalStruct, the constant symbols and their interpretations in ι correspond to environment η, and the interpretation of Fρ in ι corresponds to store ρ. Symbolic Evaluation. A primitive for forward symbolic-evaluation must solve the following problem: Given the semantic definition of a programming language, together with a specific statement s, create a logical formula that captures the semantics of s. The following table illustrates how the semantics of PL statements can be expressed as L[PL] structure-update expressions: PL L[PL] x = 17; (∅, {Fρ′ ←֓ Fρ [x → 17]}) x = y; (∅, {Fρ′ ←֓ Fρ [x → Fρ (y)]}) x = ∗q; (∅, {Fρ′ ←֓ Fρ [x → Fρ (Fρ (q))]})

To create such expressions automatically using semantic reinterpretation, we use formulas of logic L[PL] as a reinterpretation domain for the semantic core of PL. The base types and the state type of the semantic core are reinterpreted as follows (our convention is to mark each reinterpreted base type, function type, and operator with an overbar): Val = Term, BVal = Formula, and State = StructUpdate. The operators used in PL’s meaning functions E, B, and I are reinterpreted over these domains as follows:

158

J. Lim, A. Lal, and T. Reps

U1 = (∅, Fρ′ I∗px = ∗px ⊕ ∗py;U1 = (∅, Fρ′ = (∅, Fρ′ = (∅, Fρ′ I∗py = ∗px ⊕ ∗py;U2 = (∅, Fρ′ = (∅, Fρ′ = (∅, Fρ′ I∗px = ∗px ⊕ ∗py;U3 = (∅, Fρ′ = (∅, Fρ′ = (∅, Fρ′

←֓ Fρ [0 → v][px → py][py → py]) ←֓ Fρ [0 → v][px → py][py → (E∗pxU1 ⊕ E∗pyU1 )]) ←֓ Fρ [0 → v][px → py][py → (py ⊕ py)]) ←֓ Fρ [0 → v][px → py][py → 0]) = U2 ←֓ Fρ [0 → (E∗pxU2 ⊕ E∗pyU2 )][px → py][py → 0]) ←֓ Fρ [0 → (0 ⊕ v)][px → py][py → 0]) ←֓ Fρ [0 → v][px → py][py → 0]) = U3 ←֓ Fρ [0 → v][px → py][py → (E∗pxU3 ⊕ E∗pyU3 )]) ←֓ Fρ [0 → v][px → py][py → (0 ⊕ v)]) ←֓ Fρ [0 → v][px → py][py → v]) = U4

Fig. 5. Symbolic evaluation of Fig. 1(b) via semantic reinterpretation, starting with a StructUpdate that corresponds to the “Before” column of Fig. 1(c)

– The arithmetic, bitwise, relational, and logical operators are interpreted as syntactic constructors of L[PL] Terms and Formulas, e.g., binop(⊕) = λT1 .λT2 .T1 ⊕ T2 . Straightforward simplifications are also performed; e.g., 0 ⊕ a simplifies to a, etc. Other simplifications that we perform are similar to ones used by others, such as the preprocessing steps used in decision procedures (e.g., the ite-lifting and read-over-write transformations for operations on functions [5]). – cond residuates an ite(·, ·, ·) Term when the result cannot be simplified to a single branch. The other operations used in the PL semantics are reinterpreted as follows: lookupState : StructUpdate → Id → Term lookupState = λU.λI.((U ↑2)Fρ )((U ↑1)I) lookupEnv : StructUpdate → Id → Term lookupEnv = λU.λI.(U ↑1)I lookupStore : StructUpdate → Term → Term lookupStore = λU.λT.((U ↑2)Fρ )(T ) updateStore : StructUpdate → Term → Term → StructUpdate updateStore = λU.λT1 .λT2 .((U ↑1), (U ↑2)[Fρ → ((U ↑2)Fρ )[T1 → T2 ]])

By extension, this produces functions E, B, and I with the following types: Standard E: Expr → State → Val B: BoolExpr → State → BVal I: Stmt → State → State

Reinterpreted E : Expr → StructUpdate → Term B: BoolExpr → StructUpdate → Formula I: Stmt → StructUpdate → StructUpdate

Function I translates a statement s of PL to a phrase in logic L[PL]. Example 2. The steps of symbolic evaluation of Fig. 1(b) via semantic reinterpretation, starting with a StructUpdate that corresponds to Fig. 1(c), are shown in Fig. 5. The StructUpdate U4 can be considered to be the 2-vocabulary formula Fρ′ = Fρ [0 → v][px → py][py → v], which expresses a state change that does not usually perform a successful swap. WLP. WLP(s, ϕ) characterizes the set of states σ such that the execution of s starting in σ either fails to terminate or results in a state σ ′ such that

Symbolic Analysis via Semantic Reinterpretation

159

ϕ(σ ′ ) holds. For a language that only has int-valued variables, the WLP of a postcondition (specified by formula ϕ) with respect to an assignment statement var = rhs; can be expressed as the formula obtained by substituting rhs for all (free) occurrences of var in ϕ: ϕ[var ← rhs]. For a language with pointer variables, such as PL, syntactic substitution is not adequate for finding WLP formulas. For instance, suppose that we are interested in finding a formula for the WLP of postcondition x = 5 with respect to ∗p = e;. It is not correct merely to perform the substitution (x = 5)[∗p ← e]. That substitution yields x = 5, whereas the WLP depends on the execution context in which ∗p = e; is evaluated: – If p points to x, then the WLP formula should be e = 5. – If p does not point to x, then the WLP formula should be x = 5. The desired formula can be expressed informally as ((p = &x) ? e : x) = 5. For a program fragment that involves multiple pointer variables, the WLP formula may have to take into account all possible aliasing combinations. This is the essence of Morris’s rule of substitution [11]. One of the most important features of our approach is its ability to create correct implementations of Morris’s rule of substitution automatically—and basically for free. Example 3. In L[PL], such a formula would be expressed as shown below on the right. (This formula will be created using semantic reinterpretation in Ex. 4.) Query Result

Informal

L[PL]

WLP(∗p = e, x = 5) ((p = &x) ? e : x) = 5

WLP(∗p = e, Fρ (x) = 5) ite(Fρ (p) = x, Fρ (e), Fρ (x)) = 5

To create primitives for WLP and symbolic composition via semantic reinterpretation, we again use L[PL] as a reinterpretation domain; however, there is a trick: in contrast with what is done to generate symbolic-evaluation primitives, we use the StructUpdate type of L[PL] to reinterpret the meaning functions U, F E, F , and T of L[PL] itself! By this means, the “alternative meaning” of a Term/Formula/FuncExpr/StructUpdate is a (usually different) Term/Formula/FuncExpr/StructUpdate in which some substitution and/or simplification has taken place. The general scheme is outlined in the following table: Meaning function(s) I, E , B F, T U, FE , F, T

Type reinterpreted State LogicalStruct LogicalStruct

Replacement type Function created StructUpdate Symbolic evaluation StructUpdate WLP StructUpdate Symbolic composition

In §3.1, we defined the semantics of L[·] in a form that would make it amenable to semantic reinterpretation. However, one small point needs adjustment: in §3.1, the type signatures of LogicalStruct, lookupFuncId, access, update, and F E include occurrences of Val → Val. This was done to make the types more intuitive; however, for reinterpretation to work, an additional level of factoring is necessary. In particular, the occurrences of Val → Val need to be replaced by FVal. The

160

J. Lim, A. Lal, and T. Reps

standard semantics of FVal is Val → Val; however, for creating symbolic-analysis primitives, FVal is reinterpreted as FuncExpr. The reinterpretation used for U, F E, F , and T is similar to what was used for symbolic evaluation of PL programs: – Val = Term, BVal = Formula, FVal = FuncExpr, and LogicalStruct = StructUpdate. – The arithmetic, bitwise, relational, and logical operators are interpreted as syntactic Term and Formula constructors of L (e.g., binopL ( ⊕ ) = λT1 .λT2 .T1 ⊕ T2 ) although straightforward simplifications are also performed. – condL residuates an ite(·, ·, ·) Term when the result cannot be simplified to a single branch. – lookupId and lookupFuncId are resolved immediately, rather than residuated: • lookupId ({Ii ←֓ Ti }, {Fj ←֓ FEj }) Ik = Tk • lookupFuncId ({Ii ←֓ Ti }, {Fj ←֓ FEj }) Fk = FEk . – access and update are discussed below. By extension, this produces reinterpreted meaning functions U, F E, F, and T . Somewhat surprisingly, we do not need to introduce an explicit operation of substitution for our logic because a substitution operation is produced as a byproduct of reinterpretation. In particular, in the standard semantics for L, the return types of meaning function T and helper function lookupId of the semantic core are both Val. However, in the reinterpreted semantics, a Val is a Term— i.e., something symbolic—which is used in subsequent computations. Thus, when ι ∈ LogicalStruct is reinterpreted as U ∈ StructUpdate, the reinterpretation of formula ϕ via F ϕU substitutes Terms found in U into ϕ: F ϕU calls T T U , which may call lookupId U I; the latter would return a Term fetched from U , which would be a subterm of the answer returned by T T U , which in turn would be a subterm of the answer returned by F ϕU . To create a formula for WLP via semantic reinterpretation, we make use of both F , the reinterpreted logic semantics, and I, the reinterpreted programminglanguage semantics. The WLP formula for ϕ with respect to statement s is obtained by performing the following computation: WLP(s, ϕ) = F ϕ(IsUid ).

To understand how pointers are handled during the WLP operation, the key reinterpretations to concentrate on are the ones for the operations of the semantic core of L[PL] that manipulate FVals (i.e., arguments of type Val → Val)—in particular, access and update. We want access and update to enjoy the following semantic properties: T access(FE0 , T0 )ι = (FE FE0 ι)(T T0 ι) FEupdate(FE0 , T0 , T1 )ι = (FE FE0 ι)[T T0 ι → T T1 ι]

Note that these properties require evaluating the results of access and update with respect to an arbitrary ι ∈ LogicalStruct. As mentioned earlier, it is desirable for reinterpreted base-type operations to perform simplifications whenever possible, when they construct Terms, Formulas, FuncExprs, and StructUpdates. However,

Symbolic Analysis via Semantic Reinterpretation

161

because the value of ι is unknown, access and update operate in an uncertain environment. To use semantic reinterpretation to create a WLP primitive that implements Morris’s rule, simplifications are performed by access and update according to the . definitions given below, where ≡, =, and = denote equality-as-terms, definitedisequality, and possible-equality, respectively. access(F, k1 ) = ⎧ F (k1 ) if (k1 ≡ k2 ) ⎨ d2 if (k1 = k2 ) access(FE[k2 → d2 ]), k1 ) = access(FE, k1 ) . ⎩ ite(k1 = k2 , d2 , access(FE, k1 )) if (k1 = k2 ) update(F, k1 , d1 ) = ⎧ F [k1 → d1 ] if (k1 ≡ k2 ) ⎨ FE[k1 → d1 ] update(FE[k2 → d2 ], k1 , d1 ) = update(FE, k1 , d1 )[k2 → d2 ] if (k1 = k2 ) ⎩ . if (k1 = k2 ) FE[k2 → d2 ][k1 → d1 ]

. (The possible-equality tests, “k1 = k2 ”, are really “otherwise” cases of threepronged comparisons.) The possible-equality case for access introduces ite terms. As illustrated in Ex. 4, it is these ite terms that cause the reinterpreted operations to account for possible aliasing combinations, and thus are the reason that the semantic-reinterpretation method automatically carries out the actions of Morris’s rule of substitution [11]. Example 4. We now demonstrate how semantic reinterpretation produces the L[PL] formula for WLP(∗p = e, x = 5) claimed in Ex. 3. U := = = = =

I∗p = eUId updateStore(UId , EpUId , EeUId ) updateStore(UId , lookupState(UId , p), lookupState(UId , e) updateStore(UId , Fρ (p), Fρ (e)) ((UId ↑1), Fρ ←֓ Fρ [Fρ (p) → Fρ (e)])

WLP(∗p = e, Fρ (x) = 5) = = = = = = =

FFρ (x) = 5U (T Fρ (x)U ) = (T 5U ) (access(FEFρ U, T xU )) = 5 (access(lookupFuncId(U, Fρ ), lookupId(U, x))) = 5 (access(Fρ [Fρ (p) → Fρ (e)], x)) = 5 ite(Fρ (p) = x, Fρ (e), access(Fρ , x)) = 5 ite(Fρ (p) = x, Fρ (e), Fρ (x)) = 5

Note how the case for access that involves a possible-equality comparison causes an ite term to arise that tests “Fρ (p) = x”. The test determines whether the value of p is the address of x, which is the only aliasing condition that matters for this example. Symbolic Composition. The goal of symbolic composition is to have a method that, given two symbolic representations of state changes, computes a symbolic representation of their composed state change. In our approach, each state change is represented in logic L[PL] by a StructUpdate, and the method

162

J. Lim, A. Lal, and T. Reps

computes a new StructUpdate that represents their composition. To accomplish this, L[PL] is used as a reinterpretation domain, exactly as for WLP. Moreover, U turns out to be exactly the symbolic-composition function that we seek. In particular, U works as follows: U ({Ii ←֓ Ti }, {Fj ←֓ FEj })U = ((U ↑1)[Ii → T Ti U ], (U ↑2)[Fj → FEFEj U ])

Example 5. For the swap-code fragment from Fig. 1(a), we can demonstrate the ability of U to perform symbolic composition by showing that Is1 ; s2 ; s3 Uid = UIs3 Uid (Is1 ; s2 Uid ).

First, consider the left-hand side. It is not hard to show that Is1 ; s2 ; s3 Uid = ({x′ ←֓ y, y′ ←֓ x}, ∅). Now consider the right-hand side. Let U1,2 and U3 be U1,2 = Is1 ; s2 Uid = ({x′ ←֓ x ⊕ y, y′ ←֓ x}, ∅) = ({x′ ←֓ x ⊕ y, y′ ←֓ y}, ∅). U3 = Is3 Uid

We want to compute UU3 U1,2 = = = = =

U({x′ ←֓ x ⊕ y, y′ ←֓ y}, ∅)U1,2 ((U1,2 ↑1)[x → T x ⊕ yU1,2 , y → T yU1,2 ], ∅) ((U1,2 ↑1)[x → ((x ⊕ y) ⊕ x), y → x], ∅) ((U1,2 ↑1)[x → y, y → x], ∅) ({x′ ←֓ y, y′ ←֓ x}, ∅)

Therefore, Is1 ; s2 ; s3 Uid = UU3 U1,2 .

5

Symbolic Analysis for MC via Reinterpretation

To obtain the three symbolic-analysis primitives for MC, we use a reinterpretation of MC’s semantics that is essentially identical to the reinterpretation for PL, modulo the fact that the semantics of PL is written in terms of the combinators lookupEnv, lookupStore, and updateStore, whereas the semantics of MC is written in terms of lookupreg, storereg , lookupflag , storeflag , lookupmem , and storemem. Symbolic Evaluation. The base types are redefined as BVal = Formula, Val = Term, State = StructUpdate, where the vocabulary for LogicalStructs is ({ZF, EAX, EBX, EBP, EIP}, {Fmem}). Lookup and store operations for MC, such as lookupmem and storemem , are handled the same way that lookupStore and updateStore are handled for PL. Example 6. Fig. 1(d) shows the MC code that corresponds to the swap code in Fig. 1(a): lines 1–3, lines 4–6, and lines 7–9 correspond to lines 1, 2, and 3 of Fig. 1(a), respectively. For the MC code in Fig. 1(d), I MC swapUid , which denotes the symbolic execution of swap, produces the StructUpdate 

{EAX′ ←֓Fmem (EBP - 14)}, ′ ←֓Fmem [EBP - 10 → Fmem (EBP - 14)][EBP - 14 → Fmem (EBP - 10)]} {Fmem



Symbolic Analysis via Semantic Reinterpretation [1] void foo(int e, int x, int* p) { [2] ... [3] *p = e; [4] if(x == 5) [5] goto ERROR; [6] } (a)

[1] [2] [3] [4] [5] [6] [7]

163

mov eax, p; mov ebx, e; mov [eax], ebx; cmp x, 5; jz ERROR; ... ERROR: ... (b)

Fig. 6. (a) A simple source-code fragment written in PL; (b) the MC code for (a)

Fig. 1(d) illustrates why it is essential to be able to handle address arithmetic: an access on a source-level variable is compiled into machine code that dereferences an address in the stack frame computed from the frame pointer (EBP) and an offset. This example shows that I MC is able to handle address arithmetic correctly. WLP. To create a formula for the WLP of ϕ with respect to instruction i via semantic reinterpretation, we use the reinterpreted MC semantics I MC , together with the reinterpreted L[MC] meaning function F MC , where F MC is created via the same approach used in §4 to reinterpret L[PL]. WLP(i, ϕ) is obtained by performing F MC ϕ(I MC iUid ). Example 7. Fig. 6(a) shows a source-code fragment; Fig. 6(b) shows the corresponding MC code. (To simplify the MC code, source-level variable names are used.) In Fig. 6(a), the largest set of states just before line [3] that cause the branch to ERROR to be taken at line [4] is described by WLP(∗p = e, x = 5). In Fig. 6(b), an expression that characterizes whether the branch to ERROR is taken is WLP(s[1]-[5] , (EIP = c[7] )), where s[1]-[5] denotes instructions [1]–[5] of Fig. 6(b), and c[7] is the address of ERROR. Using semantic reinterpretation, F MC (EIP = c[7] )(I MC s[1]-[5] Uid ) produces the formula (ite((Fmem (p) = x), Fmem (e), Fmem (x)) - 5) = 0, which, transliterated to informal source-level notation, is (((p = &x) ? e : x) − 5) = 0. Even though the branch is split across two instructions, WLP can be used to recover the branch condition. WLP(cmp x,5; jz ERROR, (EIP = c[7] )) returns the formula ite(((Fmem (x) - 5) = 0), c[7] , c[6] ) = c[7] as follows: I MC cmp x,5Uid = ({ZF′ ←֓ (Fmem (x) - 5) = 0}, ∅) = U1 I MC jz ERRORU1 = ({EIP′ ←֓ ite(((Fmem (x) - 5) = 0), c[7] , c[6] )}, ∅) = U2 F MC EIP = c[7] U2 = ite(((Fmem (x) - 5) = 0), c[7] , c[6] ) = c[7]

Because c[7] = c[6] , this simplifies to (Fmem (x) - 5) = 0—i.e., in source-level terms, (x − 5) = 0. Symbolic Composition. For MC, symbolic composition can be performed using U MC .

164

6

J. Lim, A. Lal, and T. Reps

Other Language Constructs

Branching. Ex. 7 illustrated a WLP computation across a branch. We now illustrate forward symbolic evaluation across a branch. Suppose that an if-statement is represented by IfStmt(BE, Int32, Int32), where BE is the condition and the two Int32s are the addresses of the true-branch and false-branch, respectively. Its factored semantics would specify how the value of the program counter PC changes: IIfStmt(BE, cT , cF )σ = updateStore σ PC cond(BBEσ, const(cT ), const(cF )). In the reinterpretation for symbolic evaluation, the StructUpdate U obtained by IIfStmt(BE, cT , cF )Uid would be ({PC′ ←֓ ite(ϕBE , cT , cF )}, ∅), where ϕBE is the Formula obtained for BE under the reinterpreted semantics. To obtain the branch condition for a specific branch, say the true-branch, we evaluate F PC = cT U . The result is (ite(ϕBE , cT , cF ) = cT ), which (assuming that cT = cF ) simplifies to ϕBE . (A similar formula simplification was performed in Ex. 7 on the result of the WLP formula.) Loops. One kind of intended client of our approach to creating symbolicanalysis primitives is hybrid concrete/symbolic state-space exploration [6,13,7,3]. Such tools use a combination of concrete and symbolic evaluation to generate inputs that increase coverage. In such tools, a program-level loop is executed concretely a specific number of times as some path π is followed. The symbolicevaluation primitive for a single instruction is applied to each instruction of π to obtain symbolic states at each point of π. A path-constraint formula that characterizes which initial states must follow π can be obtained by collecting the branch formula ϕBE obtained at each branch condition by the technique described above; the algorithm is shown in Fig. 7. X86 String Instructions. X86 string instructions can involve actions that perform an a priori unbounded amount of work (e.g., the amount performed is determined by the value held in register ECX at the start of the instruction). Formula ObtainPathConstraintFormula(Path π) { Formula ϕ = T ; // Initial path-constraint formula StructUpdate U = Uid ; // Initial symbolic state-transformer let [PC1 : i1 , PC2 : i2 , . . . , PCn : in , PCn+1 : skip] = π in for (k = 1; k ≤ n; k++) { U = Iik U ; // Symbolically execute ik if (ik is a branch instruction) ϕ = ϕ && F PC = PCk+1 U ; // Conjoin the branch condition for ik } return ϕ; } Fig. 7. An algorithm to obtain a path-constraint formula that characterizes which initial states must follow path π

Symbolic Analysis via Semantic Reinterpretation

165

This can be reduced to the loop case discussed above by giving a semantics in which the instruction itself is one of its two successors. In essence, the “microcode loop” is converted into an explicit loop. Procedures. A call statement’s semantics (i.e., how the state is changed by the call action) would be specified with some collection of operations. Again, the reinterpretation of the state transformer is induced by the reinterpretation of each operation: – For a call statement in a high-level language, there would be an operation that creates a new activation record. The reinterpretation of this would generate a fresh logical constant to represent the location of the new activation record. – For a call instruction in a machine-code language, register operations would change the stack pointer and frame pointer, and memory operations would initialize fields of the new activation record. These are reinterpreted in exactly the same way that register and memory operations are reinterpreted for other constructs. Dynamic Allocation. Two approaches are possible: – The allocation package is implemented as a library. One can apply our techniques to the machine code from the library. – If a formula is desired that is based on a high-level semantics, a call statement that calls malloc or new can be reinterpreted using the kind of approach used in other systems (a fresh logical constant denoting a new location can be generated).

7

Implementation and Evaluation

Implementation. Our implementation uses the TSL system [9]. (TSL stands for “Transformer Specification Language”.) The TSL language is a strongly typed, first-order functional language with a datatype-definition mechanism for defining recursive datatypes, plus deconstruction by means of pattern matching. Writing a TSL specification for an instruction set is similar to writing an interpreter in first-order ML. For instance, the meaning function I of §3.3 is written as a TSL function state interpInstr(instruction I, state S) {...};

where instruction and state are user-defined datatypes that represent the syntactic objects (in this case, instructions) and the semantic states, respectively. We used TSL to (1) define the syntax of L[·] as a user-defined datatype; (2) create a reinterpretation based on L[·] formulas; (3) define the semantics of L[·] by writing functions that correspond to T , F , etc.; and (4) apply reinterpretation (2) to the meaning functions of L[·] itself. (We already had TSL specifications of x86 and PowerPC.) TSL’s meta-language provides a fixed set of base-types; a fixed set of arithmetic, bitwise, relational, and logical operators; and a facility for defining

166

J. Lim, A. Lal, and T. Reps

map-types. Each TSL reinterpretation is defined over the meta-language constructs, by reinterpreting the TSL base-types, base-type operators, map-types, and map-type operators (i.e., access and update). When semantic reinterpretation is performed in this way, it is independent of any given subject language. Consequently, now that we have carried out steps (1)–(4), all three symbolicanalysis primitives can be generated automatically for a new instruction set IS merely by writing a TSL specification of IS, and then applying the TSL compiler. In essence, TSL act as a “Yacc-like” tool for generating symbolic-analysis primitives from a semantic description of an instruction set. To illustrate the leverage gained by using the approach presented in this paper, the following table lists the number of (non-blank) lines of C++ that are generated from the TSL specifications of the x86 and PowerPC instruction sets. The number of (non-blank) lines of TSL are indicated in bold. TSL Specifications I· x86 PowerPC

3,524 1,546

F· ∪ T · ∪ FE· ∪ U· 1,510 (already written)

Generated C++ Templates I· 23,109 12,153

F· ∪ T · ∪ FE· ∪ U · 15,632 15,632

The C++ code is emitted as a template, which can be instantiated with different interpretations. For instance, instantiations that create C++ implementations of Ix86 · and IPowerPC · (i.e., emulators for x86 and PowerPC, respectively) can be obtained trivially. Thus, for a hybrid concrete/symbolic tool for x86, our tool essentially furnishes 23,109 lines of C++ for the concrete-execution component and 23,109 lines of C++ for the symbolic-evaluation component. Note that the 1,510 lines of TSL that defines F ·, T ·, FE·, and U· needs to be written only once. In addition to the components for concrete and symbolic evaluation, one also obtains an implementation of WLP—via the method described in §4—by calling the C++ implementations of F · and I·: WLP(s, ϕ) = Fϕ(IsUid ). WLP is guaranteed to be consistent with the components for concrete and symbolic evaluation (modulo bugs in the implementation of TSL). Evaluation. Some tools that use symbolic reasoning employ formula transformations that are not faithful to the actual semantics. For instance, SAGE [7] uses an approximate x86 symbolic evaluation in which concrete values are used when non-linear operators or symbolic pointer dereferences are encountered. As a result, its symbolic evaluation of a path can produce an “unfaithful” path-constraint formula ϕ; that is, ϕ can be unsatisfiable when the path is executable, or satisfiable when the path is not executable. Both situations are called a divergence [7]. Because the intended use of SAGE is to generate inputs that increase coverage, it can be acceptable for the tool to have a substantial divergence rate (due to the use of unfaithful symbolic techniques) if the cost of performing symbolic operations is lowered in most circumstances.

Symbolic Analysis via Semantic Reinterpretation

167

Table 1. Experimental results. We report the number of tests executed, the average length of the trace obtained from the tests, and the average number of branches in the traces. For the faithful version, we report the average time taken for concrete execution (CE) and symbolic evaluation (SE). In the approximate version, these were done in lock step and their total time is reported in (C+SE). (All times are in seconds.) For each version, we also report the average time taken by the SMT solver (Yices), the average number of constraints found (|ϕ|), and the divergence rate. For the approximate version, we also show the average distance (in % of the total length of the trace) before a diverging test diverged. Name # |Trace| # Faithful (STL) Tests #instrs branch CE SE SMT |ϕ| Div. C+SE search 18 770 28 0.26 8.68 0.26 10.5 0% 9.13 random shuffle 48 1831 51 0.59 21.6 0.17 27.3 0% 21.9 copy 5 1987 57 0.69 55.0 0.15 5.4 0% 55.8 partition 13 2155 76 0.72 26.4 0.43 35.2 0% 27.4 max element 101 2870 224 0.94 17.0 3.59 153.0 0% 18.0 transform 11 10880 476 4.22 720.8 1.12 220.6 0% 713.6

Approximate SMT |ϕ| Div. 0.10 4.8 61% 0.03 1.0 95% 0.03 1.0 60% 0.02 1.0 92% 2.90 78.4 83% 0.03 1.0 82%

Dist. 55% 93% 57% 58% 6% 89%

However, if we eventually hope to model check x86 machine code, implementations of faithful symbolic techniques will be required. Using faithful symbolic techniques could raise the cost of performing symbolic operations because faithful path-constraint formulas could end up being a great deal more complex than unfaithful ones. Thus, our experiment was designed to answer the question “What is the cost of using exact symbolic-evaluation primitives instead of unfaithful ones?” It would have been an error-prone task to implement a faithful symbolicevaluation primitive for x86 machine code manually. Using TSL, however, we were able to generate a faithful symbolic-evaluation primitive from an existing, welltested TSL specification of the semantics of x86 instructions. We also generated an unfaithful symbolic-evaluation primitive that adopts SAGE’s approximate approach. We used these to create two symbolic-evaluation tools that perform state-space exploration—one that uses the faithful primitive, and one that uses the unfaithful primitive. Although the presentation in earlier sections was couched in terms of simplified core languages, the implemented tools work with real x86 programs. Our experiments used six C++ programs, each exercising a single algorithm from the C++ STL, compiled under Visual Studio 2005. We compared the two tools’ divergence rates and running times (see Tab. 1). On average, the approximate version had 5.2X fewer constraints in ϕ, had a 79% divergence rate, and was about 2X faster than the faithful version; the faithful version reported no divergences.

8

Related Work

Symbolic analysis is used in many recent systems for testing and verification: – Hybrid concrete/symbolic tools [6,13,7,3] use a combination of concrete and symbolic evaluation to generate inputs that increase coverage. – WLP can be used to create new predicates that split part of a program’s abstract state space [1,2].

168

J. Lim, A. Lal, and T. Reps

– Symbolic composition is useful when a tool has access to a formula that summarizes a called procedure’s behavior [14]; re-exploration of the procedure is avoided by symbolically composing a path formula with the proceduresummary formula. However, compared with the way such symbolic-analysis primitives are implemented in existing program-analysis tools, our work has one key advantage: it creates the core concrete-execution and symbolic-analysis components in a way that ensures by construction that they are mutually consistent. We are not aware of existing tools in which the concrete-execution and symbolic-analysis primitives are implemented in a way that guarantees such a consistency property. For instance, in the source code for B2 [8] (the next-generation Blast), one finds symbolic evaluation (post ) and WLP implemented with different pieces of code, and hence mutual consistency is not guaranteed. WLP is implemented via substitution, with special-case code for handling pointers.

References 1. Ball, T., Majumdar, R., Millstein, T., Rajamani, S.: Automatic predicate abstraction of C programs. In: PLDI (2001) 2. Beckman, N., Nori, A., Rajamani, S., Simmons, R.: Proofs from tests. In: ISSTA (2008) 3. Brumley, D., Hartwig, C., Liang, Z., Newsome, J., Poosankam, P., Song, D., Yin, H.: Automatically identifying trigger-based behavior in malware. In: Botnet Analysis and Defense. Springer, Heidelberg (2008) 4. Cousot, P., Cousot, R.: Abstract interpretation. In: POPL (1977) 5. Ganesh, V., Dill, D.: A decision procesure for bit-vectors and arrays. In: Damm, W., Hermanns, H. (eds.) CAV 2007. LNCS, vol. 4590, pp. 519–531. Springer, Heidelberg (2007) 6. Godefroid, P., Klarlund, N., Sen, K.: DART: Directed automated random testing. In: PLDI (2005) 7. Godefroid, P., Levin, M., Molnar, D.: Automated whitebox fuzz testing. In: NDSS (2008) 8. Jhala, R., Majumdar, R.: B2: Software model checking for C (2009), http://www.cs.ucla.edu/~ rupak/b2/ 9. Lim, J., Reps, T.: A system for generating static analyzers for machine instructions. In: Hendren, L. (ed.) CC 2008. LNCS, vol. 4959, pp. 36–52. Springer, Heidelberg (2008) 10. Malmkjær, K.: Abstract Interpretation of Partial-Evaluation Algorithms. PhD thesis, Dept. of Comp. and Inf. Sci., Kansas State Univ. (1993) 11. Morris, J.: A general axiom of assignment. In: Broy, M., Schmidt, G. (eds.) Theor. Found. of Program. Methodology. Reidel, Dordrechtz (1982) 12. Mycroft, A., Jones, N.: A relational framework for abstract interpretation. In: PADO (1985) 13. Sen, K., Marinov, D., Agha, G.: CUTE: A concolic unit testing engine for C. In: FSE (2005) 14. Xie, Y., Aiken, A.: Saturn: A scalable framework for error detection using Boolean satisfiability. TOPLAS 29(3) (2007) 15. Xie, Y., Chou, A., Engler, D.: ARCHER: Using symbolic, path-sensitive analysis to detect memory access errors. In: FSE (2003)

EMMA: Explicit Model Checking Manager (Tool Presentation)⋆ Radek Pel´anek and V´ aclav Roseck´ y Department of Information Technology, Faculty of Informatics Masaryk University Brno, Czech Republic

Abstract. Although model checking is usually described as an automatic technique, the verification process with the use of model checker is far from being fully automatic. In this paper we elaborate on concept of a verification manager, which contributes to automation of the verification process by enabling efficient parallel combination of different verification techniques. We introduce a tool EMMA (Explicit Model checking MAnager), which is a practical realization of the concept, and discuss practical experience with the tool.

1

Introduction

Although model checking algorithms are automatic, the process of using a model checker can be quite elaborate and far from automatic. In order to successfully verify a model, it is often necessary to select appropriate techniques and parameter values. The selection is difficult, because there is a very large number of different heuristics and optimization techniques – our review of techniques [5] identified more than 100 papers just in the area of explicit model checking. These techniques are often complementary and there are non-trivial trade-offs which are hard to understand. In general, there is no best technique. Some techniques are more suited for verification, other techniques are better for detection of errors. Some techniques bring good improvement in a narrow domain of applicability, whereas in other cases they can worsen the performance [5]. The user needs a significant experience to choose good techniques. Moreover, models are usually parametrized and there are several properties to be checked. Thus the process of verification requires not just experience, but also a laborious effort, which is itself error prone. Another motivation for automating the verification process comes from trends in the development of hardware. Until recently, the performance of model checkers was continually improved by increasing processor speed. In last years, however, the improvement in processors speed has slowed down and processors designers have shifted their efforts towards parallelism [2]. This trend poses a challenge for further improvement of model checkers. A classic approach to application of parallelism in model checking is based on distribution of a state space among several workstations (processors). This approach, however, involves ⋆

ˇ grant no. 201/07/P035. Partially supported by GA CR

C.S. P˘ as˘ areanu (Ed.): SPIN 2009, LNCS 5578, pp. 169–173, 2009. c Springer-Verlag Berlin Heidelberg 2009 

170

R. Pel´ anek and V. Roseck´ y

large communication overhead. Given the large number of techniques and hardto-understand trade-offs, there is another way to employ parallelism: to run independent verification runs on individual workstations (processors) [2,5,8]. This approach, however, cannot be efficiently performed manually. We need to automate the verification process. With the aim of automating the verification process, we elaborate on a general concept of a verification manager [6] and we provide its concrete realization for the domain of explicit model checking. We also describe experience with the tool and discuss problematic issues concerning fair evaluation. The most related work is by Holzmann et al.: a tool for automated execution of verification runs for several model parameters and correctness properties using one fixed verification technique [3] and ‘swarm verification’ based on parallel execution of many different techniques [2]; their approach, however, does not allow any communication among techniques and they do not discuss the selection of techniques that are used for the verification (verification strategy). This paper describes the main ideas of our approach and our tool EMMA. More details are in the technical report [7] (including a more detailed discussion of related work).

2

Concept and Implementation

Verification manager is a tool which automates the verification process (see Fig. 1). As an input it takes a (parametrized) model and a list of properties. Then it employs available resources (hardware, verification techniques) to perform verification – the manager distributes the work among individual workstations, it collects results, and informs the user about progress and final results. Decisions of the manager (e.g., which technique should be started) are governed by a ‘verification strategy’. The verification strategy needs to be written by an expert user, but since it is generic, it can be used on many different models. In this way even a layman user can exploit experiences of expert users. Long-term log is used to store all input problems and verification results. It can be used for evaluation of strategies and for their improvement. As a proof of concept we introduce a prototype of the verification manager for the domain of explicit model checking – Explicit Model checking MAnager (EMMA). The tool is publicly available on the web page: http://anna.fi.muni.cz/~xrosecky/emma_web EMMA is based on the Distributed Verification Environment (DiVinE) [1]. All used verification techniques are implemented in C++ with the use of DiVinE library. At the moment, we use the following techniques: breadth-first search, depth-first search, random walk, directed search, bitstate hashing (with refinement), and under-approximation based on partial order reduction. Other techniques available in DiVinE can be easily incorporated. The manager itself is implemented in Java. Currently, the manager supports as the underlying hardware a network of workstations connected by Ethernet. Communication is based on SSH and stream socket.

EMMA: Explicit Model Checking Manager

171

Fig. 1. Verification manager — context

We can view the manager as a tool for performing a search in a ‘meta state space’ of verification techniques and their parameters [6]. To perform this metasearch we need some heuristic – that is our verification strategy. There are several possible approaches to realization of a strategy (see [7]). We use the following: we fix a basic skeleton of the strategy and implement support for this skeleton into the manager. Specifics of the strategy (e.g., order of techniques, values of parameters) are specified separately in a simple format – this specification of strategy can be easily and quickly (re)written by an expert user. In the implementation the strategy description is given in the XML format. For the first evaluation we use a simple priority-based strategies. For each technique we specify priority, timeout, and parameter values; techniques are executed according to their priorities. EMMA provides visualizations of executions (Fig. 2). These visualizations can be used for better understanding of the tool functionality and for improvement of strategies.

3

Experiences

The first experience is that the manager significantly simplifies the use of model checker for parametrized models even for an experienced user – this contribution is not easily measurable, but is very important for practical applications of model checkers. We also performed comparison of different strategies by running EMMA over models from BEEM [4] (probably the largest collection of models for explicit model checkers). We found that results depend very much on selection of input problems and that it is very difficult to give a fair evaluation. When we use mainly models without errors, strategies which focus on verification are more successful than strategies tuned for finding errors (Fig. 2, model Szymanski). When we use models with easy-to-find errors, there are negligible differences

172

R. Pel´ anek and V. Roseck´ y Strategy A: Firewire (58/60)

Strategy B: Firewire (45/60)

Strategy A: Szymanski (21/24)

Strategy B: Szymanski (23/24)

Fig. 2. Illustration of EMMA executions on 4 workstations for two models and two strategies. Each line corresponds to one workstation; numbers in boxes are identifications of model instances. The ratio X/Y means the number of decided properties (X) to number of all properties to be verified (Y). For color version see [7] or tool web page.

among strategies and we can be tempted to conclude that the choice of strategy does not matter. When we use models with hard-to-find errors, there are significant differences among strategies (Fig. 2, model Firewire); the success of individual strategies is, however, dependent very much on a choice of particular models and errors. By suitable selection of input problems we could “demonstrate” (even using quite large set of inputs) both that “verification manager brings significant improvement” and “verification manager is rather useless”. So what are the ‘correct’ input problems? The ideal case, in our opinion, is to use a large number of realistic case studies from an application domain of interest; moreover, these case studies should be used not just in their final correct versions, but also in developmental version with errors. However, this ideal is not realizable at this moment – although there is already a large number of available case studies in the domain of explicit model checking, developmental versions of these case studies are not publicly available. The employment of verification manager could help to overcome this problem. The long-term log can be used to archive all models and properties for which

EMMA: Explicit Model Checking Manager

173

verification was performed (with user’s content). Data collected in this way can be latter used for evaluation. Due to the above described bias caused by selection of models, we do not provide numerical evaluation, but only general observations: – For models with many errors, it is better to use strategy which employs several different (incomplete) techniques. – For models, which satisfy most of properties, it is better to use strategy which calls just one simple state space traversal technique with a large timeout. – If two strategies are comprised of same techniques (with just different priorities and timeouts), there can be a noticeable difference among them, but this difference is usually less than order of magnitude. Note that differences among individual verification techniques are often larger than order of magnitude [8]. Thus even with the use of a manager, we do not have a single universal approach. Suitable verification strategy depends on the application domain and also on the “phase of verification” – different strategies are suitable for early debugging of a model and for final verification. Nevertheless, the usage of a model checker becomes much more simple, since it suffices to use (and understand) just few strategies, which can be constructed by an expert user specifically for a given application domain of interest.

References ˇ ˇ 1. Barnat, J., Brim, L., Cern´ a, I., Moravec, P., Rockai, P., Simeˇ cek, P.: DiVinE - a tool for distributed verification. In: Ball, T., Jones, R.B. (eds.) CAV 2006. LNCS, vol. 4144, pp. 278–281. Springer, Heidelberg (2006), http://anna.fi.muni.cz/divine 2. Holzmann, G.J., Joshi, R., Groce, A.: Tackling large verification problems with the swarm tool. In: Havelund, K., Majumdar, R., Palsberg, J. (eds.) SPIN 2008. LNCS, vol. 5156, pp. 134–143. Springer, Heidelberg (2008) 3. Holzmann, G.J., Smith, M.H.: Automating software feature verification. Bell Labs Technical Journal 5(2), 72–87 (2000) 4. Pel´ anek, R.: BEEM: Benchmarks for explicit model checkers. In: Boˇsnaˇcki, D., Edelkamp, S. (eds.) SPIN 2007. LNCS, vol. 4595, pp. 263–267. Springer, Heidelberg (2007) 5. Pel´ anek, R.: Fighting state space explosion: Review and evaluation. In: Proc. of Formal Methods for Industrial Critical Systems, FMICS 2008 (2008) (to appear) 6. Pel´ anek, R.: Model classifications and automated verification. In: Leue, S., Merino, P. (eds.) FMICS 2007. LNCS, vol. 4916, pp. 149–163. Springer, Heidelberg (2008) 7. Pel´ anek, R., Roseck´ y, V.: Verification manager: Automating the verification process. Technical Report FIMU-RS-2009-02, Masaryk University Brno (2009) 8. Pel´ anek, R., Roseck´ y, V., Moravec, P.: Complementarity of error detection techniques. In: Proc. of Parallel and Distributed Methods in verifiCation, PDMC (2008)

Efficient Testing of Concurrent Programs with Abstraction-Guided Symbolic Execution Neha Rungta1 , Eric G. Mercer1 , and Willem Visser2 1

2

Dept. of Computer Science, Brigham Young University, Provo, UT 84602, USA Division of Computer Science, University of Stellenbosh, South Africa

Abstract. In this work we present an abstraction-guided symbolic execution technique that quickly detects errors in concurrent programs. The input to the technique is a set of target locations that represent a possible error in the program. We generate an abstract system from a backward slice for each target location. The backward slice contains program locations relevant in testing the reachability of the target locations. The backward slice only considers sequential execution and does not capture any inter-thread dependencies. A combination of heuristics are to guide a symbolic execution along locations in the abstract system in an effort to generate a corresponding feasible execution trace to the target locations. When the symbolic execution is unable to make progress, we refine the abstraction by adding locations to handle inter-thread dependencies. We demonstrate empirically that abstraction-guided symbolic execution generates feasible execution paths in the actual system to find concurrency errors in a few seconds where exhaustive symbolic execution fails to find the same errors in an hour.

1

Introduction

The current trend of multi-core and multi-processor computing is causing a paradigm shift from inherently sequential to highly concurrent and parallel applications. Certain thread interleavings, data input values, or combinations of both often cause errors in the system. Systematic verification techniques such as explicit state model checking and symbolic execution are extensively used to detect errors in such systems [9,25,7,12,17]. Explicit state model checking enumerates all possible thread schedules and input data values of a program in order to check for errors [9,25]. To partially mitigate the state space explosion from data input values, symbolic execution techniques substitute data input values with symbolic values [12,24,17]. Explicit state model checking and symbolic execution techniques used in conjunction with exhaustive search techniques such as depth-first search are unable to detect errors in medium to large-sized concurrent programs because the number of behaviors caused by data and thread non-determinism is extremely large. In this work we present an abstraction-guided symbolic execution technique that efficiently detects errors caused by a combination of thread schedules and C.S. P˘ as˘ areanu (Ed.): SPIN 2009, LNCS 5578, pp. 174–191, 2009. c Springer-Verlag Berlin Heidelberg 2009 

Efficient Testing of Concurrent Programs

175

data values in concurrent programs. The technique generates a set of key program locations relevant in testing the reachability of the target locations. The symbolic execution is then guided along these locations in an attempt to generate a feasible execution path to the error state. This allows the execution to focus in parts of the behavior space more likely to contain an error. A set of target locations that represent a possible error in the program is provided as input to generate an abstract system. The input target locations are either generated from static analysis warnings, imprecise dynamic analysis techniques, or user-specified reachability properties. The abstract system is constructed with program locations contained in a static interprocedural backward slice for each target location and synchronization locations that lie along control paths to the target locations [10]. The static backward slice contains call sites, conditional branch statements, and data definitions that determine the reachability of a target location. The backward slice only considers sequential control flow execution and does not contain data values or inter-thread dependencies. We systematically guide the symbolic execution toward locations in the abstract system in order to reach the target locations. A combination of heuristics are used to automatically pick thread identifiers and input data values at points of thread and data non-determinism respectively. We use the abstract system to guide the symbolic execution and do not verify or search the abstract system like most other abstraction refinement techniques [3,8]. At points when the program execution is unable to move further along a sequence of locations (e.g. due to the value of a global variable at a particular conditional statement), we refine the abstract system by adding program statements that re-define the global variables. The refinement step adds the inter-thread dependence information to the abstract system on a need-to basis. The contributions of this work are as follows: 1. An abstraction technique that uses static backward slicing along a sequential control flow execution of the program to generate relevant locations for checking the reachability of certain target locations. 2. A guided symbolic execution technique that generates a feasible execution trace corresponding to a sequence of locations in the abstract system. 3. A novel heuristic that uses the information in the abstract system to rank data non-determinism in symbolic execution. 4. A refinement heuristic to add inter-thread dependence information to the abstract system when the program execution is unable to make progress. We demonstrate in an empirical analysis on benchmarked multi-threaded Java programs and the JDK 1.4 concurrent libraries that locations in the abstract system can be used to generate feasible execution paths to the target locations. We show that the abstraction guided-technique can find errors in multi-threaded Java programs in a few seconds where exhaustive symbolic execution is unable to find the errors within a time bound of an hour.

176

N. Rungta, E.G. Mercer, and W. Visser Refine abstract system

Data dependence analysis

def(a) Target Locations

def(a)

l0

l1

Lt

if(a) Control dependence analysis

l2

True target l 3

Abstract System

Rank thread schedules

t0 , t1 , . . . , tn t1 . . . tn

t0

Rank input data values

if( asym) True

False

Symbolic Execution

Fig. 1. Overview of the abstraction-guided symbolic execution technique

2

Overview

A high-level overview of the technique is shown in Fig. 1. Input: The input to the technique is a set of target locations, Lt , that represent a possible error in the program. The target locations can either be generated using a static analysis tool or a user-specified reachability property. The lockset analysis, for example, reports program locations where lock acquisitions by unique threads may lead to a deadlock [5]. The lock acquisition locations generated by the lockset analysis are the input target locations for the technique. Abstract System: An abstraction of the program is generated from backward slices of the input target locations and synchronization locations that lie along control paths to the target locations. Standard control and data dependence analyses are used to generate the backward slices. Location l3 is a single target location in Fig. 1. The possible execution of location l3 is control dependent on the true branch of the conditional statement l2 . Two definitions of a global variable a at locations l0 and l1 reach the conditional statement l2 ; hence, locations l0 , l1 , and l2 are part of the abstract system. These locations are directly relevant in testing the reachability of l3 . Abstraction-Guided Symbolic Execution: The symbolic execution is guided along a sequence of locations (an abstract trace: l0 , l2 , l3 ) in the abstract system. The program execution is guided using heuristics to intelligently rank the successor states generated at points of thread and data non-determinism. The guidance strategy uses information that l3 is control dependent on the true branch of location l2 and in the ranking scheme prefers the successor representing the true branch of the conditional statement. Refinement: When the symbolic execution cannot reach the desired target of a conditional branch statement containing a global variable we refine the abstract system by adding inter-thread dependence information. Suppose, we cannot generate the successor state for the true branch of the conditional statement while

Efficient Testing of Concurrent Programs 1: Thread A{ 2: . . . 3: public void run(Element elem){

1: Thread B { 2: . . . 3: public void run(Element elem){

4: lock(elem) 5: check(elem) 4: int x /∗ Input Variable ∗/ 6: unlock(elem) 5: if x > 18 7: } 8: public void check(Element elem) 6: lock(elem) 7: elem.reset() 8: unlock(elem) 9: if elem.e > 9 10: Throw Exception 9: }} 11: }}

(a)

(b)

1: 2: 3: 4: 5: 6: 7: 8: 9:

177

Object Element{ int e ... public Element(){ e := 1 } public void reset(){ e := 11 }}

(c)

Fig. 2. An example of a multi-threaded program with two threads: A and B

guiding along l0 , l2 , l3  in Fig. 1, then the refinement automatically adds another definition of a to the abstract trace resulting in l1 , l0 , l2 , l3 . The new abstract trace implicitly states that two different threads need to define the variable a at locations l1 and l0 . Note that there is no single control flow path that passes through both l1 and l0 . Output: When the guided symbolic execution technique discovers a feasible execution path we output the trace. The technique, however, cannot detect infeasible errors. In such cases it outputs a “Don’t know” response.

3

Program Model and Semantics

To simplify the presentation of the guided symbolic execution we describe a simple programming model for multi-threaded and object-oriented systems. The restrictions, however, do not apply to the techniques presented in this work and the empirical analysis is conducted on Java programs. Our programs contain conditional branch statements, procedures, basic data types, complex data types supporting polymorphism, threads, exceptions, assertion statements, and an explicit locking mechanism. The threads are separate entities. The programs contain a finite number of threads with no dynamic thread creation. The threads communicate with each other through shared variables and use explicit locks to perform synchronization operations. The program can also seek input for data values from the environment. In Fig. 2 we present an example of such a multi-threaded program with two threads A and B that communicate with each other through a shared variable, elem, of type Element. Thread A essentially checks the value elem.e at line 9 in Fig. 2(a) while thread B resets the value of elem.e in Fig. 2(b) at line 7 by invoking the reset function shown in Fig. 2(c). We use the simple example in Fig. 2 through the rest of the paper to demonstrate how the guided symbolic execution technique works.

178

N. Rungta, E.G. Mercer, and W. Visser

A multi-threaded program, M, is a tuple {T0 , T1 , . . . , Tu−1 }, Vc , Dsym  where each Ti is a thread with a unique identifier id → {0, 1, . . . , u − 1} and a set of local variables; Vc is a finite set of concrete variables; and Dsym is a finite set of all input data variables in the system. An input data variable is essentially any variable that seeks a response from the environment. A runtime environment implements an interleaving semantics over the threads in the program. The runtime environment operates on a program state s that contains: (1) valuations of the variables in Vc , (2) for each thread, Ti , values of its local variables, runtime stack, and its current program location, (3) the symbolic representations and values of the variables in Dsym , and (4) a path constraint, φ, (a set of constraints) over the variables in Dsym . The runtime environment provides a set of functions to access certain information in a program state s: – getCurrentLoc(s) returns the current program location of the most recently executed thread in state s. – getLoc(s, i) returns the current program location of the thread with identifier i in state s. – getEnabledThreads(s) returns a set of identifiers of the threads enabled in s. A thread is enabled if it is not blocked (not waiting to acquire a lock). Given a program state, s, the runtime environment generates a set of successor states, {s0 , s1 , . . . , sn } based on the following rules ∀i ∈ getEnabledThreads(s)∧ l := getLoc(s, i): 1. If l is a conditional branch with symbolic primitive data types in the branch predicate, P , the runtime environment can generate at most two possible successor states. It can assign values to variables in Dsym to satisfy the path constraint φ ∧ P for the target of the true branch or satisfy its negation φ ∧ ¬P for the target of the false branch. 2. If l accesses an uninitialized symbolic complex data structure osym of type T , then the runtime environment generates multiple possible successor states where osym is initialized to: (a) null, (b) references to new objects of type T and all its subtypes, and (c) existing references to objects of type T and all its subtypes [11]. 3. If neither rule 1 or 2 are satisfied, then the runtime environment generates a single successor state obtained by executing l in thread Ti . In the initial program state, s0 , the current program location of each thread is initialized to its corresponding start location while the variables in Dsym are assigned a symbolic value v⊥ that represents an uninitialized value. A state sn is reachable from the initial state s0 if using the runtime environment we can find a non-zero sequence of states s0 , s1 , . . . , sn  that leads from s0 to sn such that ∀si , si+1 , si+1 is a successor of si for 0 ≤ i ≤ n − 1. Such a sequence of program states represents a feasible execution path through the system. The sequence of program states provides a set of concrete data values and a valid path constraint over the symbolic values. The reachable state space, S, can be generated using the runtime environment where S := {s | ∃s0 , . . . , s}.

Efficient Testing of Concurrent Programs

4

179

Abstraction

In this work we create an abstract system that contains program locations relevant in checking the reachability of the target locations. We then use the locations in the abstract system to guide the symbolic execution. The abstract system is constructed with program locations contained in a static interprocedural backward slice for each target location. The abstract system also contains synchronization locations that lie along control paths to the target locations. A backward slice of a program with respect to a program location l and a set of program variables V consists of all statements and predicates in the program that may affect the value of variables in V at l and the reachability of l. 4.1

Background Definitions

Definition 1. A control flow graph (CFG) of a procedure in a system is a directed graph G := L, E where L is a set of uniquely labeled program locations in the procedure while E ⊆ L × L is the set of edges that represents the possible flow of execution between the program locations. Each CFG has a start location lstart ∈ L and an end location lend ∈ L. Definition 2. An interprocedural control flow graph (ICFG)  for a system with  p procedures is L, E where L := 0≤i≤p Li and E := 0≤i≤p Ei . Additional edges from a call site to the start location of the callee and from the end location of a procedure back to its caller are also added in the ICFG. Definition 3. icfgPath(l, l′ ) describes a path in the ICFG and returns true iff there exists a sequence q := l, . . . , l′  such that (li , li+1 ) ∈ E where 0 ≤ i ≤ length(q) − 1 Definition 4. postDom(l, l′) returns true iff for each path in a CFG between l and lend , q := l, . . . , lend , where lend is an end location, and there exists an i such that li = l′ where 1 ≤ i ≤ length(q) − 1. 4.2

Abstract System

The abstract system is a directed graph A := Lα , Eα  where Lα ⊆ L is the set of program locations while Eα ⊆ Lα × Lα is the set of edges. The abstract system contains target locations; call sites, conditional branch statements, and data definitions in the backward slice of each target location; and all possible start locations of the program. It also contains synchronization operations that lie along control paths from the start of the program to the target locations. To compute an interprocedural backward slice, a backwards reachability analysis can be performed on a system dependence graph [10]. Note that the backward slice only considers sequential execution and ignores all inter-thread dependencies. Intuitively a backward slice contains: (1) call sites and the start locations of the corresponding callees such that invoking the sequence of calls leads to a

180

N. Rungta, E.G. Mercer, and W. Visser B.run(Element) A.check(Element)

A.run(Element) A.run lstart  4 : lock(elem)

check lstart 

l0

B.run lstart 

α0

5 : if x > 18

α1

6 : lock(elem)

α2

7 : elem.reset()

α3

8 : unlock(elem)

α6

l3

Element.reset()

9 : if elem.e > 9 l4

l1

10 : Exception

5 : check(elem) l2

l5

Element.reset lstart  α4 8 : e := 11

α5

B.run(Element)

6 : unlock(elem) l6

B.run lstart 

(a)

α0

(b)

Fig. 3. The abstract system for Fig. 2: (a) Initial Abstract System. (b) Additions to the abstract system after refinement.

procedure containing the target locations, (2) conditional statements that affect the reachability of the target locations determined using control dependency analyses, (3) data definitions that affect the variables at target locations determined from data dependence analyses, and (4) all locations generated from the transitive closures of the control and data dependences. In order to add the synchronization locations we define the auxiliary functions acqLock(l) that returns true iff l acquires a lock and relLock(l, l′) that returns true iff l releases a lock that is acquired at l′ . For each lα ∈ Lα we update Lα := Lα ∪ l if Eq. (1) is satisfied for l. [icfgPath(l, lα ) ∧ acqLock(l)] ∨ [icfgPath(lα , l) ∧ relLock(l, lα )]

(1)

After the addition of the synchronization locations and locations from the backward slices we connect the different locations. Edges between the different locations in the abstract system are added based on the control flow of the program as defined by the ICFG. To map the execution order of the program locations in the abstract system to execution order in the ICFG we check the post-dominance relationship between the locations while adding the edges. An edge between any two locations lα and lα′ in Lα is added to Eα if Eq. (2) evaluates to true. ∀(lα′′ ∈ Lα ) such that ¬postDom(lα , lα′′ ) ∨ ¬postDom(lα′′ , lα′ )

(2)

The abstract system for the example in Fig. 2 where the target location is line 10 in the check method in Fig. 2(a) is shown in Fig. 3(a). Locations l0 and α0 in Fig. 3(a) are the two start locations of the program. The target location, l5 , represents line 10 in Fig. 2(a). Location l2 is a call site that invokes start location l3 that reaches target location l5 . The target location is control dependent on the conditional statement at line 9 in Fig. 2(a); hence, l4 is part of the abstract system in Fig. 3(a). The locations l1 and l6 are the lock and unlock operations. The abstract system shows Thread B is not currently relevant in testing the reachability of location l5 .

Efficient Testing of Concurrent Programs

4.3

181

Abstract Trace Set

The input to the guided symbolic execution is an abstract trace set. The abstract trace set contains sequences of locations generated on the abstract system, A, from the start of the program to the various target locations in Lt . We refer to the sequences generated on the abstract system as abstract traces to distinguish them from the sequences generated on the CFGs. To construct the abstract trace set we first generate intermediate abstract trace sets, {P0 , P1 , . . . Pt−1 }, that contain abstract traces between start locations of the program (Ls ) and the input target locations (Lt ); hence, Pi := {π|π satisfies Eq. (3) and Eq. (4)}. We use the array indexing notation to reference elements in π, hence, π[i] refers to the ith element in π. ∃l0 ∈ Ls , lt ∈ Lt such that π[0] == l0 ∧ π[length(π) − 1] == lt (3) (π[i], π[i + 1]) ∈ Eα ∧ (i = j =⇒ π[i] = π[j]) for 0 ≤ i, j ≤ length(π) − 1 (4) Eq. (4) generates traces of finite length in the presence of cycles in the abstract system caused by loops, recursion, or cyclic dependencies in the program. Eq. (4) ensures that each abstract trace generated does not contain any duplicate locations by not considering any back edges arising from cycles in the abstract system. We rely on the guidance strategy to drive the program execution through the cyclic dependencies toward the next interesting location in the abstract trace; hence, the cyclic dependencies are not encoded in the abstract traces that are generated from the abstract system. Each intermediate abstract trace set, Pi , contains several abstract traces from the start of the program to a single target location li ∈ Lt . We generate a set of final abstract trace sets as: ΠA := {{π0 , . . . , πt−1 }|π0 ∈ P0 , . . . , πt−1 ∈ Pt−1 } Each Πα ∈ ΠA contains a set of abstract traces. Πα := {πα0 , πα1 , . . . , παt−1 } where each παi ∈ Πα is an abstract trace leading from the start of the program to a unique li ∈ Lt . Since there exists an abstract trace in Πα for each target location in Lt , |Πα | == |Lt |. The input to the guided symbolic execution technique is Πα ∈ ΠA . The different abstract trace sets in ΠA allow us to easily distribute checking the feasibility of individual abstract trace sets on a large number of computation nodes. Each execution is completely independent of another and as soon as we find a feasible execution path to the target locations we can simply terminate the other trials. In the abstract system shown in Fig. 3(a) there is only a single target location— line 10 in check procedure shown in Fig. 2(a). Furthermore, the abstract system only contains one abstract trace leading from the start of the program to the target location. The abstract trace Πα is a singleton set containing l0 , l1 , l2 , l3 , l4 , l5 .

182

N. Rungta, E.G. Mercer, and W. Visser

1: /∗ backtrack := ∅, Aα := Πα , s := s0 , trace := s0  ∗/ procedure main()  null do 2: while s, Πα , trace = 3: s, Πα , trace  := guided symbolic execution(s, Πα , trace) 4: procedure guided symbolic execution(s, Πα , trace ) 5: while ¬(end state(s) or depth bound(s) or time bound()) do 6: if goal state(s) then 7: print trace exit 8: s′ , Ss  := get ranked successors(s, Πα ) 9: for each sother ∈ Ss do 10: backtrack := backtrack ∪ {sother , Πα , trace ◦ sother } 11: if ∃ πi ∈ Πα , head(πi ) == getCurrentLoc(s) then 12: lα := head(πi ) /∗ First element in the trace ∗/ ′ 13: lα := head(tail(πi )) /∗ Second element in the trace ∗/ ′ = getCurrentLoc(s′ )) then 14: if branch(lα ) ∧ (lα 15: return s0 , Aα := refine trace(Aα , πi ), s0  16: remove(πi , lα ) /∗ This updates the πi reference in Πα ∗/ 17: s := s′ , trace := trace ◦ s′ 18: return s′ , Πα , trace  ∈ backtrack

Fig. 4. Guided symbolic execution pseudocode

5

Guided Symbolic Execution

We guide a symbolic program execution along an abstract trace set, Πα := {π0 , π1 , . . . , πt−1 }, to construct a corresponding feasible execution path, Πs := s0 , s1 , . . . , sn . For an abstract trace set, the guided symbolic execution tries to generate a feasible execution path that contains program states where the program location of the most recently executed thread in the state matches a location  in the abstract trace. The total number of locations in the abstract trace is m := πi ∈Πα length(πi ) where the length function returns the number of locations in the abstract trace πi . In our experience, the value of m is a lot smaller than n, m 1500m) or RESOURCE-ERROR due to memory requirements.

Subsumer-First: Steering Symbolic Reachability Analysis

203

Table 1. (continued) timing timing -sub-first gasburner gasburner -sub-first odometryls1lb odometryls1lb -sub-first rtalltcs rtalltcs -sub-first odometrys1lb odometrys1lb -sub-first triple2 triple2 -sub-first odometry ododmetry -sub-first bakery3 bakery3 -sub-first model-test01 model-test01 -sub-first model-test07 model-test07 -sub-first model-test13 model-test13 -sub-first model-test19 model-test19 -sub-first

29m 29m 0% 17m 9m 47% 12m 4m 66.7% 4m 2m 50% 2m 1m 50% 2m 2m 0% 1m 40s 33.3% 11s 10s 9% 32s 29s 9.4% 39s 37s 5.1% 2m 2m 0% 3m 2m 33.3%

0.4M 0.39M 2.5% 3.5M 1.7M 51.4% 0.8M 0.3M 62.5% 2.5M 1M 60% 0.2M 0.1M 50% 0.77M 0.70M 9% 0.14M 0.09M 35.7% 0.25M 0.19M 24% 0.18M 0.18M 0% 0.25M 0.24M 4% 0.9M 0.75M 16.7% 0.9M 0.6M 33.3%

3425 3378 1.4% 3309 1791 45.9% 1439 632 56% 1789 796 55.5% 681 425 37.6% 610 520 14.8% 246 193 21.5% 1311 986 24.8% 1578 1565 0.8% 1998 1902 4.8% 5766 4791 16.9% 5256 3499 33.4%

47

99093

4954

1.0

19

3124

152

1.89

16

6127

214

3.0

20

18757

122

2.0

15

3337

150

2.0

3

8

3

1.0

15

437

28

1.5

9

31

3

1.1

16

110

36

1.1

16

124

40

1.05

16

110

36

1.0

16

124

40

1.5

204

A. Rybalchenko and R. Singh Table 2. Experiments with full ARMC abstraction-refinement iterations Benchmark odometry odometry -sub-first

time 109m 8m 92.7% odometryls1lb 60m odometryls1lb -sub-first 29m 51.7% triple2 13m triple2 -sub-first 2m 84.6% odometrys1lb 9m odometrys1lb -sub-first 9m 0% odometrys1ub 195m odoemtrys1ub -sub-first 329m -68.7% gasburner 175m gasburner -sub-first 93m 46.9% timing 51m timing -sub-first 49m 3.9% rtalltcs 38m rtalltcs -sub-first 37m 2.6% bakery3 2m bakery3 -sub-first 32s 73.3% model-test01 4m model-test01 -sub-first 2m 50% model-test07 5m model-test07 -sub-first 3m 40% model-test13 17m model-test13 -sub-first 9m 47% model-test19 19m model-test19 -sub-first 23m -21%

# queries # iter # preds 9.3M 65 218 1.6M 37 153 82.8% 7.1M 32 97 3M 29 102 57.7% 6.5M 65 254 2.1M 45 219 67.7% 1.1M 20 72 1.0M 22 83 9% 14.4M 37 157 11.4M 33 257 20.8% 48.9M 64 198 17.3M 61 220 64.6% 1M 14 14 1M 14 14 0% 27M 30 56 25.3M 40 74 6.3% 2.6M 34 67 0.9M 36 58 65.4% 1.7M 58 115 1.5M 54 100 11.8% 2.4M 58 115 2.2M 56 104 8.3% 6.6M 63 140 5.2M 61 136 21.2% 7.7M 62 137 9.2M 59 135 -19.5%

# states speedup 680 13.6 295 56.6% 1439 2.07 539 62.5% 519 6.50 248 52.2% 681 1.0 345 49.3% 2073 0.59 2379 -14.8% 3309 1.88 1604 51.5% 3425 1.04 3378 1.4% 1789 1.03 1258 29.7% 1419 3.75 885 37.6% 1207 2.0 1565 -29.7% 1372 1.67 1902 -38.6% 4708 1.89 4791 -1.8% 5256 0.83 8135 -54.8%

Identifying Modeling Errors in Signatures by Model Checking Sebastian Schmerl, Michael Vogel, and Hartmut König Brandenburg University of Technology Cottbus Computer Science Department, P.O. Box 10 13 44, 03013 Cottbus, Germany {sbs,mv,koenig}@informatik.tu-cottbus.de

Abstract. Most intrusion detection systems deployed today apply misuse detection as analysis method. Misuse detection searches for attack traces in the recorded audit data using predefined patterns. The matching rules are called signatures. The definition of signatures is up to now an empirical process based on expert knowledge and experience. The analysis success and accordingly the acceptance of intrusion detection systems in general depend essentially on the topicality of the deployed signatures. Methods for a systematic development of signatures have scarcely been reported yet, so the modeling of a new signature is a time-consuming, cumbersome, and error-prone process. The modeled signatures have to be validated and corrected to improve their quality. So far only signature testing is applied for this. Signature testing is still a rather empirical and time-consuming process to detect modeling errors. In this paper we present the first approach for verifying signature specifications using the SPIN model checker. The signatures are modeled in the specification language EDL which leans on colored Petri nets. We show how the signature specification is transformed into a PROMELA model and how characteristic specification errors can be found by SPIN. Keywords: Computer Security, Intrusion Detection, Misuse Detection, Attack Signatures, Signature Verification, PROMELA, SPIN model checker.

1 Motivation The increasing dependence of human society on information technology (IT) systems requires appropriate measures to cope with their misuse. The enlarging technological complexity of IT systems increases the range of threats to endanger them. Besides traditional preventive security measures, such as encryption, authentication, access control mechanisms, etc, reactive approaches are more and more applied to counter these threats. Reactive approaches allow responses and counter-measures to security violations to prevent further damage. Intrusion detection systems (IDSs) have proved as one of the most important means to protect IT-systems. A wide range of commercial intrusion detection products is available, especially for misuse detection. Intrusion detection is based on the monitoring of IT-systems to detect security violations. The decision which activities are considered as security violations in a given C.S. Păsăreanu (Ed.): SPIN 2009, LNCS 5578, pp. 205–222, 2009. © Springer-Verlag Berlin Heidelberg 2009

206

S. Schmerl, M.Vogel, and H. König

context is defined by the used security policy. Two main complementary approaches are applied: anomaly and misuse detection. Anomaly detection aims at the exposure of abnormal user behavior. It requires a comprehensive set of data describing the normal user behavior. Although much research has been done in this area, it has still a limited practical importance because it is difficult to provide appropriate profile data. Misuse detection focuses on the (automated) detection of known attacks described by patterns which are used to identify an attack in an audit data stream. The matching rules are called signatures. Misuse detection is applied by the majority of the systems used in practice. The detection power of misuse detection though is still limited. First of all many intrusion detection systems are dedicated to the detection of simple structured network attacks, often still in a post-mortem mode. These are simple single-step attacks and the detection process is mainly a pattern matching process. Sophisticated multi-step or even distributed attacks, which are applied to intrude in dedicated computer systems, are not covered. These attacks are getting an increasing importance, especially in host based intrusion detection. The crucial factors for high detection rates in misuse detection are the accuracy and the topicality of the signatures used in the analysis process. Imprecise signatures confine strongly the detection capability and cause false positives or false negatives. The former trigger undesired false alarms, while the latter represent undetected attacks. This lack of reliability together with high false alarm rates has questioned the efficiency of intrusion detection systems in practice [8]. The reasons for this detection inaccuracy lie in the signature derivation process itself rather than in the quality of the monitored audit data. Signatures are derived from an exploit. This is the program that executes the attack. The latter represents a sequence of actions that exploit security vulnerabilities in an application, an operating system, or a network. The signatures, in contrast, describe rules, how traces of these actions can be found in an audit or network data stream. In practice, signatures are derived empirically based on expert knowledge and experience. Methods for a systematic derivation have scarcely been reported yet and the known approaches for signature generation e.g. [13] or [12] are limited to very specific preconditions, programs, operation systems, vulnerabilities, or attacks. Automated approaches for reusing design and modeling decisions of available signatures do not exist, yet. Therefore, new signatures are still manually derived. The modeling process is time consuming and the resulting signatures often contain errors. In order to identify these errors the new modeled signatures have to be validated, corrected, and improved iteratively. This process can take several months until the corrected signature is found. As long as this process is not finished the affect systems are vulnerable to the related attack because the intrusion detection system cannot protect them. This period is therefore also called vulnerability window. Although signatures are not derived systematically, they are usually described in a formal way, e.g. as finite state machines. Examples of such signature description languages are STATL [5], [3], Bro [6], IDIOT [7], and EDL [2], which define a strict semantic for the signatures. These languages though are mostly related to a concrete intrusion detection system. Astonishingly, these languages have not been used for the verification of signatures. The main validation method for signatures in practice is testing which proves with the help of an intrusion detection system, whether the derived signature is capable to exactly detect the related attack in an audit trail. For

Identifying Modeling Errors in Signatures by Model Checking

207

this, the signatures are applied to various audit trails. To test the different features of a signature, test cases are derived, which modify the signature to reveal detection deficits. Signature testing is a heuristic process. There exists no methodology like in protocol testing. Only some first approaches are reported [4]. Signature testing is a time consuming and costly process which requires manuals steps to derive test cases and to evaluate the test outcome. Testing is, however, not the right process to identify errors in signature modeling. Many of these errors may be already found by verifying the modeled signature. The objective of a signature verification stage should be to prove, whether the modeled signature is actually capable to detect the attack in an audit trail and to ensure that the signature is not in conflict with itself. Typical errors in signature modeling are mutually exclusive constraints, tautologies, or constraints which will be never checked. In this paper we present the first approach for the verification of signatures. It aims at the verification of multi-step signatures which are described in EDL [2] to demonstrate the principle. EDL supports the specification of complex multi-step attacks and possesses a high expressiveness [1] and nevertheless allows efficient analysis. For the verification, we use the model checker SPIN [11]. We choose SPIN because it supports large state spaces, provides a good tool performance, and is well documented. The transformation of EDL signature specifications into PROMELA is the kernel of this approach. We provide rules how this transformation has to be performed. The verification proves the absence of typical specification errors. These are formulated by linear temporal logic (LTL) conditions which are generated depending on the concrete signature under verification. The remainder of the paper is structured as follows. In Section 2 we consider the signature derivation process. We shortly introduce the signature modeling language EDL and outline the reasons for specification errors when modeling signatures. Section 3 describes the semantic equivalent transformation of EDL into PROMELA. We further show that the translation into PROMELA, which has a well defined semantics, is another way to give a formal semantics to a signature model. In Section 4 we present the verification procedure and show how typical specification errors can be detected. Thereafter in Section 5 we give an overview of a concrete evaluation example. Some final remarks conclude the paper.

2 On the Modeling of Complex Signatures An attack consists of a sequence of related security relevant actions or events which must be executed in the attacked system. This may be, for example, a sequence of system calls or network packets. These sequences usually consist of several events which form complex patterns. A signature of an attack describes criteria (patterns) which must be fulfilled to identify the manifestation of an attack in an audit trail. All the relations and constraints between the attack events must be modeled in the signature. A signature description which correlates several events can readily possess more than 30 constraints. This leads to very complex signatures. In addition, it is possible that several attacks of the same type are executed simultaneously and proceed independently, so that different instances of an attack have to be distinguished as well. This fact raises the complexity of the analysis.

208

S. Schmerl, M.Vogel, and H. König

To our knowledge there have been no approaches to identify specification errors in signatures by verification. This is remarkable due the fact that most signature languages have an underlying strict semantic model (e.g. STATL [5], Bro [6], EDL [2]). The approach demonstrated here uses the signature description language EDL (Event Description Language) as example of a signature modeling language. EDL leans on colored Petri nets and supports a more detailed modeling of signatures compared to other modeling languages. In particular, it allows a detailed modeling of constraints which must be fulfilled in transitions for attack progress. The definition of EDL is given in [2], the semantic model is described in [1]. Before describing the transformations of EDL signatures into PROMELA models, we outline the essential features of EDL. 2.1 Modeling Signatures with EDL The description of signatures in EDL consists of places and transitions which are connected by directed edges. Places represent states of the system which the related attack has to traverse. Transitions represent state changes which are triggered by events e.g. security relevant actions. These events are contained in the audit data stream recorded during attack execution. The progress of an attack, which corresponds to a signature execution, is represented by a token which flows from state to state. A token can be labeled with features as in colored Petri nets. The values of these features are assigned when the token passes the places. Several tokens can exist simultaneously. They represent different signature instances. Feature definitions by places:

P1

+

Int UserID, Int ProcessID

Int UserID

empty T1

P2

-

T2

Initial Place

String OpenFile, Int TimeStamp

P3

-

T3

Interior Place P4

Exit Place Escape Place

UserID=1066

UserID=1080 UserID=1066 UserID=1080 OpenFile=".mail" ProcessID=12 ProcessID=9 TimeStamp=1091

Value bindings by token

Transition Token

Fig.1. Places and features

Places. describe the relevant system states of an attack. They are characterized by a set of features and a place type. Features specify the properties located in a place and their types. The values of these properties are assigned to the token. The information contained in a token can change from place to place. EDL distinguishes four place types: initial, interior, escape, and exit places. Initial places are the starting places of a signature (and thus of the attack). They are marked with an initial token at the beginning of the analysis. Each signature has exactly one exit place that describes the final place of the signature. If a token reaches this place the signature has identified a manifestation of an attack in the audit data stream, i.e. the attack has performed successfully. Escape places stop the analysis of an attack instance because events have occurred which make the completion of the attack impossible, i.e. the observed behavior represents normal, allowed behavior but not an attack. Tokens which reach

Identifying Modeling Errors in Signatures by Model Checking

209

these places are discarded. All other places are interior places. Figure 1 shows a simple signature with four places P1 to P4 for illustration. Transitions. represent events which trigger state changes of signature instances. A transition is characterized by input places, output places, event type, conditions, feature mappings, consumption mode, and actions. Input places of transition T are places with an edge leading to the transition T. They describe the required state of the system before the transition can fire. Output places of transition T are places with an incoming edge from the transition T. They characterize the system state after the transition has fired. A change between two system states requires a security relevant event. Therefore each transition is associated with an event type. The firing of a transition can further depend on additional conditions which specify relations over certain features of the event (e.g. user name) and their assigned values (e.g. root). Conditions can require distinct relationships between events and token features on input places (e.g. same values). If a transition fires, tokens are created on the output places of the transition. They describe the new system state. To bind values to the features of the new tokens, the transitions contain feature mappings. These are bindings which can be parameterized with constants, references to event features, or references to input place features. The consumption mode (cf. [1]) of a transition controls, whether tokens that activate the transition remain on the input places after the transition fired. This mode can be individually defined for each input place. The consumption mode can be considered as a property of a connecting edge between input place and transition (indicated by “–“ or “+”). Only in the consuming case the tokens which activate the transition are deleted on the input places. Figure 2 illustrates the properties of a transition. The transition T1 contains two conditions. The first condition requires that feature Type of event E contains the value FileCreate. The second condition compares feature UserID of input place P1, referenced by “P1.UserID”, and feature EUserID of event type E, referenced by “EUserID”. This condition demands that the value of feature UserID of tokens on input place P1 is equal to the value of event feature EUserID. Transition T1 contains two feature mappings. The first one binds the feature UserID of the new token on the output place P2 with the value of the homonymous feature of the transition activating token on place P1. The second one maps the feature Name from the new token on place P2 to event feature EName of the transition triggering event of type E. Firing Rule. In contrast to Petri nets, in EDL all transitions are triggered in a deterministic conflict free manner. First, all transitions are evaluated to determine active transitions for which all conditions for firing are fulfilled. The active transitions are Feature definitions by places Int UserID; P1

...

-

Conditions: Feature mappings:

Int Type, EUserID; String EName; T1 E Type == FileCreate; P2.UserID := P1.UserID;

T1

Int UserID; String Name; P2

+

E

Transition T1 with associated event type E

+

Non-consuming

-

Consuming

Interior place

...

P 1.UserID == EUserID; P2.Name := EName;

Fig. 2. Traopnsition properties

210

S. Schmerl, M.Vogel, and H. König P1

P1 +

P2

T1

P4

E + +

P3 +

T2 E before

+ P2

P5

T1

P4

P1

E + +

P3 +

-

T2

P5

P2

-

E after

T1

P3

P1

E T2 E

before

-

P4

P2

-

T1

P3

E T2

P4

E

after

Fig. 3. Conflict situations

triggered at the same time. So there is no token conflict. Figure 3 illustrates the triggering rule with two examples signatures. The left side of the figure shows the marking before an event of type E occurs and the right side shows the signature state after firing. None of the depicted transitions have an additional transition condition. Even though EDL is a very intuitive signature language new modeled signature frequently contain specification errors, such as unreachable places or transition conditions, which can never be fulfilled (mutually exclusive conditions). Other possible errors are wrong or missing token flows from the initial place to the exit place or unreachable escape places to stop an attack tracking due to faulty feature mappings. Such specification errors lead to inaccurate signatures causing false positives or negatives, respectively. Many of these inaccuracies can hardly or not all be identified by static signature analyses. Therefore we decided to apply model checking to detect these types of errors.

3 Transformation of EDL Signatures into PROMELA Specifications Before verifying EDL signatures with SPIN, we have to transform them into semantic equivalent PROMELA specifications. The transformation rules needed for it are described in this section. The challenge of this transformation consists in an appropriate abstraction of the EDL signatures to achieve a limited state space. An one-to-one transformation would result in an infinite state space, since the number of tokens in EDL is not limited because each ongoing attack has to be pursued. In addition, the value range of features is not limited as well. Therefore the EDL signatures have to be abstracted, but the essential semantics of the signatures must be preserved. Only thus, it is possible to detect errors in their specification and to have a limited state space. 3.1 Overview The basic elements of PROMELA are processes, message channels, and variables. Processes specify concurrent instances in the PROMELA language. The execution of a process can be interrupted by another process at any time, if the process execution mode is not marked as atomic. Processes allow the specification of sequentially executed statements, the iterative execution of a block of instructions (do-loop) and conditional execution of instructions. Processes communicate among each other via global variables or message channels. Messages can be read from and written to channels. A channel operates like a queue. A message comprises a tuple of basic PROMELA data types.

Identifying Modeling Errors in Signatures by Model Checking

211

The transformation of EDL signatures into PROMELA models is transition driven. All EDL transitions are mapped into PROMELA processes which model the semantic behaviour of the transitions. EDL places are mapped to message channels and tokens to messages correspondingly. A process P of transition T evaluates messages from those message channels, the input places of T are mapped onto. If all required conditions for firing a transition T are fulfilled, process P generates new messages and writes them into those message channels the output places of T are mapped onto. In the following, the transformation of EDL signatures to PROMELA models will be described in detail. At first the transformation of places and transitions will be described then the realisation of the triggering rule to PROMELA will be explained. 3.2 Transformation of EDL Places The conversion to a PROMELA model starts with the definition of channels for all EDL places. The channel stores the tokens of the corresponding EDL place as messages. In EDL the tokens on a place describe the state of an attack on the observed system. Therefore, a place defines a set of features that describe the system state. The definition of a feature consists of (a) the features type (bool, string, number, float…), i.e. the values range the tokens may have on this place. Further the feature definition defines (b) the feature identifier for referring features in conditions of transitions. Thus the set of feature definitions of a place can be written as a tuple. This tuple can be directly adopted by the PROMELA model by defining the message type of the corresponding channel. Only slightly changes are needed to feature type definitions, except for EDL strings. They are mapped to fixed sized byte arrays. The size of these arrays (MAXSTRING) and the maximum message capacity of a channel (CHANNELSIZE) will be set accordingly to the EDL signature (see Section 4.1). Table 1. Transformation of EDL places into PROMELA channels EDL link_no_prefix { TYPE INTERIOR FEATURES STRING mLinkName, STRING mScriptName, INT mScriptOwner } exit_place { TYPE EXIT FEATURES STRING mScriptName, INT mExecutorID }

PROMELA typedef F_LinkNoPrefix { byte mLinkName[MAXSTRING]; byte mScriptName[MAXSTRING]; int mScriptOwner; }; chan LinkNoPrefix = [CHANNELSIZE] of { F_LinkNoPrefix }; typedef F_ExitPlace { byte mScriptName[MAXSTRING]; int mExecutorID; }; chan ExitPlace = [CHANNELSIZE] of { F_ExitPlace };

As an example, Table 1 shows the conversion of two EDL places “link_no_prefix“ and “exit_place” to PROMELA. The different place types (initial, interior, escape, and exit places) remain unconsidered during initial transformation to PROMELA because the different semantics of place types will be considered by the implementation of the firing rule.

212

S. Schmerl, M.Vogel, and H. König

3.3 Transformation of EDL Transitions The topology of the transitions and places in an EDL signature defines the temporal order of the occurrence of events during an attack. A single transition specifies the type of the event that triggers the transition, additional conditions of this event, and the tokens (values of token features) on the input places of the transition. (see Table 2a). The evaluation of a transition T begins with the examination whether the type of an incoming event X corresponds to the event type associated with the transition. Furthermore, the transition condition has to be evaluated in relation to event X for all combinations of tokens from input places of the transition. Here, a token combination is an n-tuple, with n= number of input places of T and every element ti of the tuple represents a token of place Pi. If all transition conditions are fulfilled for a token combination the transition will be fired creating new tokens with new feature values on the output places. Table 2a/b. EDL transition and the related PROMELA process a) EDL syntax

b) PROMELA syntax

01 LinkWithPrefix(-) ExitPlace 02 { 03 TYPE SolarisAuditEvent 04 CONDITIONS 05 (eventnr==7) OR (eventnr==23), 06 LinkWithPrefix.mLinkName==RNAME, 07 euid != audit_id 08 09 MAPPINGS 10 [ExitPlace].mScriptName = LinkWithPrefix.mScriptName, 11 [ExitPlace].mExcecutorID = euid 12 ACTIONS 13 ... 14 }

01 02 03 04 05 06 07 08 09 10 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 31 32 33 34 36 37 38 39 40 41 42 43

proctype LinkWithPrefix_ExitPlace (){ atomic{ if :: (gEventType==SolarisAuditEvent) -> F_LinkWithPrefix lF_LWP; int lNrOfToken = len(LinkWithPrefix) do :: (lNrOfToken > 0) -> LinkWithPrefix?lF_LWP; /* checking Conditions */ if :: ((gEvent.eventnr == 7) || (gEvent.eventnr == 23)) && (STRCMP(lF_LWP.mLinkName,gEvent.rname)) && (gEvent.euid != gEvent.audit_id) -> /* Create a new Message */ F_ExitPlace lF_EP; /* setting values of the new message */ STRCPY(lF_EP.mScriptName, lF_LWP.mScriptName); lF_EP. mExcecutorID = gEvent.euid; /* Transition is Consuming, */ /* so mark the message for delete*/ LinkWithPrefixDelete!lF_LWP; /*save new message to insert it later */ ExitPlaceInsert!lF_LNP_new; lNrOfToken--; :: else /* do nothing */ fi; /*put back current message to Channel */ LinkWithPrefix!lF_LWP; :: else -> break; od; :: else -> skip; fi; } }

The transformation of EDL transitions into the PROMELA model starts with the definition of separate process types for all transitions. A process of a specific type assumes that the incoming event is stored in the global variable gEvent and its type in the variable gEventType. The process interprets the messages on channels representing the input places of the transition (in the following denoted as input place

Identifying Modeling Errors in Signatures by Model Checking

213

channels) as tokens. The process starts with the event evaluation and checks the type of the current event. In the example of Table 2a the EDL transition “LinkWithPrefix(+) ExitPlace” requires the occurrence of a “SolarisAudiEvent” (line 3). Therefore the respective PROMELA process “LinkWithPrefix_ ExitPlace” in Table 2b checks the event type in the if-condition in line 4. If the condition is fulfilled the process iterates over all messages on all its input place channels. In the example this is implemented by the do-loop in lines 9-39. Thereafter one message from the input place channel is read out (line 12), evaluated (lines 14-33), and finally written back to the channel (line 36). The channels implement a FIFO buffer behavior. If messages from more than one input place channel have to be evaluated the iterations on all message combinations are performed by additional nested do-loops. The evaluation of the transition conditions are mapped directly onto PROMELA if-conditions (lines 14-18). Here, the currently considered combination of messages from the input place channels (only lF_LWP in the example because it is a single input place channel) is checked in relation to the current event (gEvent), whether all conditions are fulfilled. In this case and if the EDL transition is consuming those messages which fulfill the conditions are placed (line 27) for later removal in an auxiliary channel (LinkWithPrefixDelete) corresponding to the input place channel. Further new messages are created for all output place channels (line 20), the feature values of the messages are set (line 22-23), and the messages are written (line 29) in an additional auxiliary channel (ExitPlaceInsert) for a later insertion into the output place channel. The process terminates after all messages have been evaluated in relation with current event (gEvent). Finally there are tokens in the auxiliary channel (…Insert) to output place channels of the transition and tokens to be removed in the auxiliary channels (…Delete) of the input place channels of the transition. 3.4 Implementation of the Deterministic Triggering Rule of EDL EDL applies a deterministic firing rule which is conflict free as described in Section 2.1. The implementation of this rule is implicitly given in PROMELA. To guarantee a conflict free event evaluation every event is evaluated by applying the following four steps. (1) Reading out the current event. Depending on the event its values are written to the variable structure gEvent and the event type is stored in gEventType. After that (2) a process instance is created for all EDL transitions which corresponds to the process type of the respective transition as described in Section 3.3. These processes check sequentially the conditions of the EDL transitions in relation to the messages in the input place channels and the current event. The execution order of the processes is non-deterministic, but the processes are executed atomically (see atomic stmt. line 2, Table 2b), i.e. the processes cannot interrupt each other. This fact as well as the implementation principle of the processes, to store newly created messages and messages to be deleted in auxiliary channels first, ensure a conflict free firing rule. Thus, every process leaves the input data (messages) unchanged after analyzing them. When all transition processes have terminated (3) all messages to be removed are deleted from the channels and finally (4) all new messages are inserted. This is done by iteration on the messages in the auxiliary channels (…_Delete and …_Insert).

214

S. Schmerl, M.Vogel, and H. König

P2 -

E E

T1

P2 P2_insert P2_delete

T2 P3

P3 P3_insert P3_delete

E

-

process T2{}

(2) process T1{}

T3

P2 P2_insert P2_delete

process T2{}

-

process T1{}

P3 P3_insert P3_delete

process T3{}

P1

P1 P1_insert P1_delete

(a) EDL – initial state

(b) PROMELA – after step 2

P1 P1_insert P1_delete

process T3{}

(c) PROMELA – after step 4

Fig. 4. Transformation of an EDL signature into PROMELA

Figure 4(a) shows an example of an EDL signature fragment. None of the depicted transitions possess additional transition conditions. The depicted marking triggers the transitions T1, T2 and T3, when the associated event E occurs. Figure 4(b) shows the corresponding PROMELA model after all processes have evaluated event E (completed step 2). The arrows indicate message readings and writings. To accomplish the evaluation of all processes, first process T1 (for simplicity reasons, we assume T1 is the process evaluated first) reads and evaluates all messages from input channel P1 message by message. A read message is always written back to the channel (dotted lines) to leave the set of messages in the channel unchanged for the processes T2 and T3. Process T1 inserts a new message into the p2 insert channel (solid lines) for each message from p1 and a copy of the read message from p1 to p1_delete (solid lines) because of the consuming mode. After evaluating all other processes in the same manner the messages in the delete channels (p1_delete, p2_delete, p3_delete) are removed from the channels p1, p2 and p3 in step 3. Finally in step 4 the new messages from the channels p1_insert, p2_insert and p3_insert are transferred to the respective channels p1, p2, and p3. Figure 4(c) shows the situation after evaluating event E. The use of the auxiliary insert channels prevents in this example that T2 evaluates the new messages generated by T1 for the same event E. The ..._delete channels are responsible for the conflict free message processing of T1 and T3.

4 Signature Verification In this section we show how SPIN is used to identify typical specification errors in signatures. First we describe how we deal with events triggering transitions. We describe the decomposition of events in equivalence classes, the generation of new events, and the determination of the smallest possible channel size and string length to minimize the state space. Next we explain a number of typical specification errors. We formulate LTL conditions to prove their absence in the EDL modeling. 4.1 Setting Up the Verification To verify signature properties we have to analyze the behavior of the signature in relation to the occurring events because they affect the firing of the transitions, the

Identifying Modeling Errors in Signatures by Model Checking

215

creation of new tokens, and their deletion. An exhaustive verification of a signature specification considering all possible events is not possible because of the huge or even infinite number of potential events. Typically events possess about 15 different features, such as timestamp, user ID, group ID, process ID, resource names, and so forth. In order to limit the state space in this transformation all possible events are divided in equivalence classes. The partition is determined by the transitions conditions of the signature. We divide all transition conditions in atomic conditions. An equivalence class is set up for each combination of fulfilled and not fulfilled atomic conditions. A representative for each class is chosen which is used for the verification of the signature properties. We can limit the verification to the representatives without loss of generality, since all events of an equivalence class influence the signature in exact same way. The determination of the equivalence classes is accomplished by splitting up complex transition conditions into atomic conditions. For this purpose, we have to split the EDL conditions by means of AND- and OR-expressions. Then we use a constraint solver to determine a concrete event for each class. For this, all atomic conditions are entered into the constraint solver in negated and non-negated form each which then evaluates the constraints and calculates concrete values for all features of an event class. So an automatic generation of all representatives is feasible. If there is no solution for a combination of negated or non-negated conditions so this constraint represents features which mutually exclude each other. Such a class is deleted from verification. We do not use a Boolean abstraction of representative values for practical reasons. Even though the Boolean abstraction would be more efficient to verity, but a typical signature engineer can determine errors or problems in the signature easier by using verification counter-examples with representative values. In order to verify the demanded signature properties we analyze the signature with all generated event representatives. For this, (1) an equivalence class is randomly selected and the representative of the class is the new occurring event. Next (2) the corresponding PROMELA process is started for each EDL transition. These processes analyze the current selected representative event in relation to the messages on the input channels. If needed, new messages are created for a later insertion or deletion. After that (3) the message channels, the insert channels and delete channels are updated in the manner described in Section 3.4. SPIN generates the full state space by successively applying these three steps. The state space contains e.g. all produced messages, all executed instructions, all variable settings and thus the complete behavior of the signature. Based on this state space we can verify signature properties and identify signature specification errors. The size of the state space is the crucial factor for the usability of the described approach. If the state space is too large the model checker needs too many resources (computing time, memory size) for verification. The size of the state space is not only determined by the number of equivalence classes, but also by the number of messages in the channels. This is why the CHANNELSIZE (maximum number of messages in a channel, cf. Section 3.2) should be minimized. Without loss of generality, the unlimited number of tokens on a place P can be limited to the maximum number n of incoming edges of a transition T from same place P. In most cases n is one, except the signature has multiple edges (also called parallel edges) between an input place P and a transition T. Only in such cases the transition correlates several tokens on a single place. However, such topologies are very unusual for signatures. In most cases a

216

S. Schmerl, M.Vogel, and H. König

transition process correlates only one message per channel. Since the complete state space generated by SPIN covers already all possible token combinations, we can limit the number of messages in the channels this way. More messages per channels only lead to states which represent combinations of existing states. If strings are used in an EDL signature then the length MAXSTRING (cf. Section 3.2) of the corresponding PROMELA byte arrays must be specified. PROMELA does not allow dynamical memory allocation; therefore we must estimate the required array length beforehand. The defined byte array length does not affect the number of states calculated by SPIN, but it does influence the size of the state vectors in the state space. A state vector contains information on the global variables, contents of each channel, process counter, and local variables of each process. Consequently, the string size (MAXSTRING) should be specified as low as possible. It is first and foremost determined by the largest string (max_str_event) of all event class representatives. If the signature does not apply string concatenation we can automatically estimate MAXSTRING by max_str_event as upper bound. If string concatenation is used in an EDL signature without cycles then we can limit the MAXSTRING to max_str_event*2. In the rare case of a string concatenation in a cycle, MAXSTRING must be defined sufficiently by estimating the number of expected cycles. 4.2 Signature Properties Now we consider the properties which have to be fulfilled by each signature. If these properties are violated a specification error will be indicated. The properties to be fulfilled are specified as LTL formulas. These properties are verified by means of the model checker SPIN. Tracking new attack instances: The signature always have to be able to track new starting attack instances. An attack instance denotes an independent and distinct attack. These new attack instances can start with any new event and have to be tracked simultaneously. With each new occurring event, a token must be located on each init place of the signature to ensure a simultaneous attack tracking. If a channel (CI) represents an initial place (I) of a signature CI must contain at least one message, each time the processes Ti representing the transition are started. This behavior can be expressed in LTL as follows: ◊p ⇒ (aU p) with a = (len(CI) > 0), where len(CI) is the number of messages in channel CI, and p≡true, iff a process Pi is running. Unreachable system states: The places in a signature model represent relevant system states during an attack. If a token is never located on a certain place then this system state will be never reached. Accordingly the signature possesses a faulty linked topology of places and transitions or the modeled place is redundant. We specify the property that each place (P1, ... , Pn) should contain at least once a token as a LTL condition over the corresponding channels (cP1, ... , cPn): ◊ tCP1 ∧ ◊ tCP2 ∧ ... ∧ ◊ tCPn, with tCPi = (len(cPi) > 0) where cPi represents the place Pi. Dead system state changes: In the same way system states are modeled by places. Changes in system states are specified by transitions. If a transition never fires this means that the system state change from the input to the output places of the transition is never accomplished. The reasons for never firing transitions are either wrongly

Identifying Modeling Errors in Signatures by Model Checking

217

specified transition conditions or the lack of tokens on the input places. Lacking tokens can be identified by the “unreachable system states” property. Transitions which will never fire because of wrongly specified transition conditions can be identified by unreachable code in the PROMELA process of a transition. If the state space is exhaustively generated and the statements for creating and mapping new messages (e.g. line 20 in Table2b) in a transition process are never reached then this transition will never fire. The determination of unreachable code is a standard verification option in SPIN. Twice triggering token event combinations: If two transitions T1 and T2 have the same input place P and T1 and T2 are triggered by the same event as well as same token t on input place P then the signature allows for twice triggering token event combinations. This behavior means: it is possible that a single action/event transfers a single signature instance in two different system states. The reason for this is either the transitions conditions on T1 and/or T2 are underspecified, or T1 and T2 are modeling an implicit fork transition. If this behavior is intended an explicit fork transition TF should be used instead of two transitions T1 and T2 (cf. Figure 5). Otherwise the transition conditions of T1 and T2 should be refined in such a way that T1 and T2 are not triggering for the same event and the same token t. The usage of implicit fork transitions should be avoided for following two reasons: (1) the fork behavior can not be seen directly in the signature topology of places and transitions and (2) implicit fork transitions need additional conditions for correct behavior. Both issues raise the signature complexity and increase its error-proneness. The behavior of such an implicit fork transition with CHANNELSIZE=1 can be described for a pair of transitions T1, T2 with the same input place P by the following LTL formula: ◊p, with p≡true, iff T1 and T2 fires in the same event evaluation cycle. In a PROMELA model with more than one message per channel (CHANNELSIZE>1) the processes corresponding to the EDL transitions must be extended, so that each message from the input channels which fulfill all process conditions has to be copied to a further auxiliary channel. If this auxiliary channel contains a message twice after the termination of all processes then the signature possesses a twice triggering token event combination. The auxiliary channels must be erased before a new event occurs. Non-completion of signature instances: A token on a place which cannot be transferred either to an exit or escape place denotes a non-completion signature instance. This corresponds to an attack instance whose abortion or successful completion is not recognizable. Accordingly, we check, whether the PROMELA model of a signature contains messages which cannot be transferred to an exit place or to escape place channels. This requires a modification of the PROMELA model in such a way that messages reaching the exit or escape place channels are deleted immediately. T1

P2 P2

P

P

t T2

P3

t

TF P3

Fig. 5. Implicit and explicit fork transitions

218

S. Schmerl, M.Vogel, and H. König

A situation fulfills the non-completion signature instances property, when the PROmodel reaches a state, where the initial state (only initial place channels contain a message) cannot be reached again. In this case, the PROMELA model is referred to be non reversible. Here reversibility is defined by the LTL formula: □ ¬q ∨ ◊(q ∧ ◊p) with (1) p≡true, if all channels representing an initial place contain a single message and all remaining channels are empty, and (2) q≡true, if an arbitrary channel not representing an initial place contains a message. The search for non-completion instances can be refined if the transfer of all messages to escape channels as well as the transfer of all messages to exit places are examined separately. The transfer of a signature instance (token) to exit and escape places always has to be possible because an attack can be aborted in each system state or continued until success. In order to check this, all exit place channels and all processes representing incoming transitions have to be removed, whereas for verifying the transfer to exit place channels, the escape place channels and processes representing incoming transitions have to be removed. Both cases can be verified with the aforementioned LTL formula. Note that all described LTL properties can be adapted in such a way that an unambiguous identification of a faulty transition or problematic place/transition structures in the EDL signature is possible. For the sake of briefness, we cannot describe this in detail here. MELA

5 Example In order to prove the suitability of our verification approach we implemented the transformation rules introduced in Section 3 in a prototype transformer. This transformer reads an arbitrary EDL signature and generates the semantically equal PROMELA model. Besides, the transformer determines the equivalence classes for the occurring events by splitting the complex transition conditions into atomic conditions. After that, it generates the representatives for each equivalence class. For this, we use the finite domain solver [9] from the GNU PROLOG package [10]. We handle equal comparisons of two event features f1, f2 in an equivalence class by replacing each usage of f2 (resp. f1) with f1 (resp. f2). All other conditions are mapped to constraints between the features of the representative. Thereby the transformer automatically recognizes classes with mutually exclusive conditions. Merely the EDL regular string comparison condition must be handled manually. After determining representatives the prototype generates the LTL formulas of the signature properties to be hold whereby the properties described in Section 4.2 were adapted to concrete channel and process names. Finally SPIN is automatically started with the generated model and the LTL conditions to be verified. In the following we give an example of our verification approach using a typical signature for detecting a shell-link-attack in a Unix system. The shell-link-attack exploits a special shell feature and the SUID (Set-User-ID) mechanism. If a link to a shell script is created and the link name starts with "-" then it is possible to create an interactive shell by calling the link. In old shell versions regular users could create an appropriate link which points to a SUID-shell-script and produce an interactive shell

Identifying Modeling Errors in Signatures by Model Checking rename link

219

T3 -

create link with -

T7

-

T1 rename link link_with_prefix

+

T4

init_place_1

delete link -

T5 escape_place

rename link -

+

link_no_prefix -

T2 create link without -

-

rename link

T6

Script created

T10

+

+

init_place_2 create script escape_place_2

T14

+

SUID_script

exit_place execute link

-

delete SUIDscript

+

chmod script

delete script

T15 T13

T12 +

T9

T8

delete link copy script

T11

rename script

Fig. 6. Simplified EDL-signature of the shell-link-attack

which runs with the privileges of the shell-script owner (maybe the root user). Figure 6 depicts the respective EDL signature consisting of 15 transitions with 3-6 conditions per transition. The full textual specification of the signature consists of 356 lines. Our transformer identified 11 different atomic conditions for the shell-linksignature. Some of them are mutually exclusive. 1920 representatives were generated for the equivalence classes. Further our tool automatically adapts the signature properties from Section 4.2 to the shell-link-attack signature and generated a set of LTL formulas that the signature should hold (see Table 3). Table 3. LTL-Formulas to verifying the shell-link signature Signature property Tracking new attack instances

Unreachable system states:

Dead system state changes:

Twice triggering token event combinations

Non-completion of signature instances

LTL formula in SPIN notation p -> (a U p); a=(len(cinit_place_1)>0 && len(cinit_place_2)>0); p=(isRunning(T1)||isRunning(T2)||...||isRunning(T14)) tCP1 && tCP2 && ... && tCPn tCP1 = (len(cinit_place_1)>0); tCP2 = (len(cinit_place_2)>0); tCP3 = (len(clink_with_prefix)>0); ... tCPn = (len(cexit place)>0); verified by unreached code p; p=((wasTriggered(T1)&& wasTriggered (T2)) || (wasTriggered(T3)&& wasTriggered (T7)) || (wasTriggered(T3)&& wasTriggered (T5)) || (wasTriggered(T7)&& wasTriggered (T5)) || ... (wasTriggered(T15)&& wasTriggered (T11)) || (wasTriggered(T15)&& wasTriggered (T12)) || (wasTriggered(T15)&& wasTriggered (T13)) || (wasTriggered(T15)&& wasTriggered (T3)) || (wasTriggered(T15)&& wasTriggered (T5)) ) [] !q || ( q && p ); p=((len(cinit_place_1)==1) && (len(cinit_place_1)==1) && ... (len(clink_with_prefix)==0)&& (len(cexit_place)==0) )

q=((len(cscript_created)>0)|| ... (len(clink_with_prefix)>0)|| (len(cexit_place)>0) )

220

S. Schmerl, M.Vogel, and H. König Feature definitions by places Int UserID; Bool FileDescIsSet; Int FileDescriptor; String FileName; T10 F

SU-

...

chmod script

-

Int Etype, Int Efiledesc; String Efilename;

T13

T13 Transition T with associated 13 F event type F, which protocols files actions Interior place

escape place

F

Escape Place

delete SUIDscript

Conditions of T13: (EType == FileDelete) AND (PSUID_script.FileName == Efilename);

+

Non-consuming

-

Consuming

Conditions of T13 after correction: (EType == FileDelete) AND (Not(PSUID_script ,FileDescIsSet) && PSUID_script.FileName == Efilename) OR ( (PSUID_script ,FileDescIsSet) && PSUID_script.FileDescriptor ==Efiledesc)

Fig. 7. Detailed section of the shell-link-attack signature

The verification ensures that all properties are fulfilled by the shell-link signature beside the “non complete-able signature instances”. This property does not hold for the place “SUID_script” in Figure 6. This place models a system state where the attacker has created a SUID-script. In the PROMELA model there are messages in the corresponding channel that cannot be transferred to an escape channel. Consequently, the signature does not model each possible case if the attacker cancels the attack after script generation (T9) and script mode change (T10). This can be done, for instance, by deleting the created SUID script (T13). A closer look on transition T13 reveals that the transition does not distinguish how the script mode was changed to a SUID_script, either by a chmod syscall or by an fchmod syscall on transition T10. In the first case, T13 must identify related tokens for each occurring deletion event by comparing filenames, in the second case by comparing file descriptors. But the condition to distinguish the two cases and the condition for the second case is missing in T13, therefore delete events which base on file descriptors are not correctly handled by T13. This issue is depicted in Figure 7. Here the relevant section around transition T13 of the shell-link-attack signature is shown. The transition T10 sets the feature “FileDescIsSet” on place “SUID_script” to false and feature “FileName” with the observed file name, if T10 was triggered by an event for a chmod syscall. But if T10 is triggered by a fchmod syscall then “FileDescIsSet” is set true and the logged file descriptor is mapped to “FileDescriptor”. The problem is that the second condition on transition T13 only correlates the feature “FileName” of place “SUID_script” with the feature “Efilename” of the occurring event F., but the case of matching file descriptors is not considered. To correct this error the signature developer has to add the distinction of the two cases and the missing equal condition for file descriptors as shown in Figure 7 in section “Conditions of T13 after correction”. Such errors are typical specification errors of the signature programmers. Further errors, such as mutually exclusive conditions, wrong transition mappings, missing cases, or unreachable places can also readily be detected by our verification approach. Resource Requirements: In order to estimate the run-time and memory requirements of the SPIN tool we captured some performance figures. The following data refer to the verification of the generated PROMELA model of the shell-link-attack signature above. SPIN generated the complete state space for the PROMELA model on an AMD X2-64 (2 GHz) in 15 minutes and required nearly 900 MB for this. We used the SPIN

Identifying Modeling Errors in Signatures by Model Checking

221

options partial order reduction, bit state hashing, and state vector compression. In this configuration the complete state space contained 476,851 states with 2.2e+08 state changes. Our tool which performs the transformation from an EDL signature to the corresponding PROMELA model and the generation of the representatives of event classes required 25 seconds for the most complex signature. Apart from these run-time characteristics, a further advantage of our approach is that unfulfilled LTL formulas, i.e. violated signature properties, can easily be mapped onto concrete signature elements. Thus, fault detection and correction can be carried out easily.

6 Final Remarks The derivation of signatures from new exploits is still a tedious process which requires much experience. Systematic approaches are still rare. Newly derived signatures often possess a significant detection inaccuracy which strongly limits the detection power of misuse detection systems as well as their acceptance in practice. A longer validation and correction phase is needed to derive qualitative and accurate signatures. This implicates a larger vulnerability window of the affected systems which is unacceptable from the security point of view. Verification methods can help to accelerate the signature development process and to reduce the vulnerability window. In this paper we presented the first approach for identifying specification errors in signatures by verification. We have applied the SPIN model checker to detect common signature specification errors. The approach was implemented as tool for a concrete representative of a multi-step signature language, namely EDL. The tool maps a given EDL signature onto the corresponding PROMELA model and generates the signature properties which are then checked with the SPIN model checker. In addition, we developed an automated method for deriving a finite set of representative events, required for the verification. We have demonstrated and evaluated the approach exemplarily. We are currently working on the identification of further properties which each signature should hold. Furthermore, we intend to include a feature in our approach that suggests possible solutions to the signature modeler to correct found specification errors. As another working direction is the verification of single-step signatures as used in intrusion detection systems like Snort.

References [1] Meier, M.: A Model for the Semantics of Attack Signatures in Misuse Detection Systems. In: Zhang, K., Zheng, Y. (eds.) ISC 2004. LNCS, vol. 3225, pp. 158–169. Springer, Heidelberg (2004) [2] Meier, M., Schmerl, S.: Improving the Efficiency of Misuse Detection. In: Julisch, K., Krügel, C. (eds.) DIMVA 2005. LNCS, vol. 3548, pp. 188–205. Springer, Heidelberg (2005)

222

S. Schmerl, M.Vogel, and H. König

[3] Vigna, G., Eckmann, S.T., Kemmerer, R.A.: The STAT Tool Suite. In: Proceedings of DARPA Information Survivability Conference and Exposition (DISCEX) 2000, vol. 2, pp. 46–55. IEEE Computer Society Press, Hilton Head (2000) [4] Schmerl, S., König, H.: Towards Systematic Signature Testing. In: Petrenko, A., Veanes, M., Tretmans, J., Grieskamp, W. (eds.) TestCom/FATES 2007. LNCS, vol. 4581, pp. 276–291. Springer, Heidelberg (2007) [5] Eckmann, S.T., Vigna, G., Kemmerer, R.A.: STATL: An Attack Language for Statebased Intrusion Detection. Journal of Computer Security 10(1/2), 71–104 (2002) [6] Paxson, V.: Bro - A System for Detecting Network Intruders in Real-Time. Computer Networks 31, 23–24 (1999) [7] Kumar S.: Classification and Detection of Computer Intrusions. PhD Thesis, Department of Computer Science, Purdue University, West Lafayette, IN, USA (August 1995) [8] Ranum, M.J.: Challenges for the Future of Intrusion Detection. In: 5th International Symposium on Recent Advances in Intrusion Detection (RAID), Zürich (2002) (invited Talk) [9] Finite domain solver: http://www.gprolog.org/manual/html_node/index.html [10] GNU PROLOG: http://www.gprolog.org/manual/html_node/index.html [11] Holzmann, J.G.: The SPIN Model Checker: Primer and Reference Manual. AddisonWesley Professional, Reading (2003) [12] Nanda, S., Chiueh, T.: Execution Trace-Driven Automated Attack Signature Generation. In: Proceedings of 24th Annual Computer Security Applications Conference (AC-SAC), Anaheim, CA, USA, pp. 195–204. IEEE Computer Society, Los Alamitos (2008) [13] Liang, Z., Sekar, R.: Fast and Automated Generation of Attack Signatures: A Basis for Building Self-Protecting Servers. In: Proceedings of 12th ACM Conference on Computer and Communications Security (CCS), Alexandria, VA (November 2005)

Towards Verifying Correctness of Wireless Sensor Network Applications Using Insense and Spin Oliver Sharma1 , Jonathan Lewis2 , Alice Miller1 , Al Dearle2 , Dharini Balasubramaniam2, Ron Morrison2 , and Joe Sventek1 1

Department of Computing Science, University of Glasgow, Scotland School of Computer Science, University of St. Andrews, Scotland

2

Abstract. The design and implementation of wireless sensor network applications often require domain experts, who may lack expertise in software engineering, to produce resource-constrained, concurrent, real-time software without the support of high-level software engineering facilities. The Insense language aims to address this mismatch by allowing the complexities of synchronisation, memory management and event-driven programming to be borne by the language implementation rather than by the programmer. The main contribution of this paper is an initial step towards verifying the correctness of WSN applications with a focus on concurrency. We model part of the synchronisation mechanism of the Insense language implementation using Promela constructs and verify its correctness using SPIN. We demonstrate how a previously published version of the mechanism is shown to be incorrect by SPIN, and give complete verification results for the revised mechanism. Keywords: Concurrency; Distributed systems; Formal Modelling; Wireless Sensor Networks.

1

Introduction

The coupling between software and hardware in the design and implementation of wireless sensor network (WSN) applications, driven by time, power and space constraints, often results in ad-hoc, platform specific software. Domain experts are expected to produce complex, concurrent, real-time and resource-constrained applications without the support of high-level software engineering facilities. To address this mismatch, the Insense language [3,10] abstracts over the complexities of memory management, concurrency control and synchronisation and decouples the application software from the operating system and the hardware. An Insense application is modelled as a composition of active components that communicate via typed, directional, synchronous channels. Components are single threaded and stateful but do not share state, thereby avoiding race conditions. Thus, the complexity of concurrent programming in Insense is borne by the language implementation rather than by the programmer. Verifying the C.S. P˘ as˘ areanu (Ed.): SPIN 2009, LNCS 5578, pp. 223–240, 2009. c Springer-Verlag Berlin Heidelberg 2009 

224

O. Sharma et al.

correctness of Insense applications requires that the language implementation be proved correct with respect to its defined semantics. The main contribution of this paper is an initial step towards verifying the correctness of WSN applications by modelling the semantics of Insense using Promela constructs. We focus here on concurrent programming and in particular on the correctness of the Insense channel implementation. The Insense channels and some of their associated algorithms are modelled in Promela. SPIN is then used to verify a set of sufficient conditions under which the Insense channel semantics are satisfied for a small number of sender and receiver components. The remainder of this paper is structured as follows. Section 2 provides background information on WSNs, Insense, and model checking. We then present the Insense channel model and its implementation in sections 3 and 4 respectively. Section 5 details the translation of the Insense channel implementation to Promela, develops a set of properties to verify the correctness of the implementation and demonstrates how a previously published version of the channel algorithms is shown to be incorrect by SPIN. Section 6 presents complete verification results for a revised set of algorithms and for previously unpublished connect and disconnect algorithms. Section 7 includes conclusions and some thoughts and directions on future work.

2 2.1

Background Wireless Sensor Networks

WSNs, in general, and wireless environmental sensor networks, in particular, are receiving substantial research focus due to their potential importance to society [1]. By composing inexpensive, battery-powered, resource-constrained computation platforms equipped with short range radios, one can assemble networks of sensors targeted at a variety of tasks – e.g. monitoring air or water pollution [15], tracking movement of autonomous entities (automobiles [20], wild animals [22]), and attentiveness to potentially disastrous natural situations (magma flows indicative of imminent volcanic eruptions [23]). A wireless sensor node is an example of a traditional embedded system, in that it is programmed for a single, particular purpose, and is tightly integrated with the environment in which it is placed. As with all embedded computer systems, it is essential that appropriate design and construction tools and methodologies be used to eliminate application errors in deployed systems. Additionally, a wireless sensor node is usually constrained in a number of important operating dimensions: a) it is usually battery-powered and placed in a relatively inaccessible location; thus there is a need to maximize the useful lifetime of each node to minimize visits to the node in situ to replace batteries; b) the processing power and memory available to each node are severely constrained, therefore forcing the use of cycle-efficient and memory-efficient programming techniques; and c) the range of a node’s radio is limited, thus potentially forcing each node to act as a forwarding agent for packets from neighbouring nodes.

Towards Verifying Correctness of WSN Applications Using Insense and Spin

225

A typical application operating on a WSN system consists of code to: take measurements (either at regular intervals or when an application-specific event occurs), forward these measurements to one or more sink nodes, and subsequently to communicate these measurements from the sink node(s) to a data centre. In order to design such an application, a variant of the following methodology is used: – A domain expert (e.g. hydrologist), using information obtained from a site visit and topological maps, determines the exact locations at which sensors should be placed (e.g. at the bends of a stream) – A communications expert, using information obtained from a site visit, determines the exact location(s) at which the sink node(s) should be placed (e.g. with sufficient cellular telephony data signal strength to enable transport of the data back to a data centre) – A communications expert, using information obtained from a site visit, topological maps, and knowledge of radio wave propagation characteristics, then determines the number and placement of additional forwarding nodes in order to achieve required connectivity and redundancy – The system operation is then simulated using realistic data flow scenarios to determine whether the design meets the connectivity, redundancy, and reliability requirements. If not, the design is iterated until the simulations indicate that the requirements are met. Implementation of such a design takes many forms. The most common are: – a component-based framework such as using the nesC extension to C under TinyOS [13] to construct the application; – a more traditional OS kernel based approach such as using Protothreads for constructing the application in C under Contiki [12]. As these examples show, and as is normal for embedded systems, the application code is usually produced using a variant of the C programming language. 2.2

Insense

A fundamental design principle of Insense is that the complexity of concurrent programming is borne by the language implementation rather than by the programmer. Thus, the language does not include low-level constructs such as processes, threads and semaphores. Instead, the unit of concurrent computation is a language construct called the component. Components are stateful and provide strong syntactic encapsulation whilst preventing sharing, thereby avoiding accidental race conditions. In Insense an application is modelled as a composition of components that communicate via channels. Channels are typed, directional and synchronous, promoting the ability to reason about programs. Components are the basic building blocks of applications and thus provide strong cohesion between the architectural description of a system and its implementation. Components can

226

O. Sharma et al.

create instances of other components and may be arranged into a Fractal pattern [6], enabling complex programs to be constructed. We envisage the future development of high-level software engineering tools which permit components to be chosen and assembled into distributed applications executing on collections of nodes. The locus of control of an Insense component is by design akin to a single thread that never leaves the syntactic unit in which it is defined. As components and threads are defined by the same syntactic entity, each component may be safely replaced without affecting the correct execution of others with respect to threading. By contrast, in conventional thread based approaches, threads weave calls through multiple objects, often making it difficult (or at least expensive) to determine if a component can be replaced in a running program. The topology of Insense applications may be dynamically changed by connecting and disconnecting channels. Furthermore, new component instances may be dynamically created and executing component instances may be stopped. These mechanisms permit arbitrary components to be safely rewired and replaced at runtime. In order to decouple the application software from the operating system and hardware, Insense programs do not make operating system calls or set specific registers to read from a device. Instead, parts of the hardware are modelled as Insense components with the appropriate channels to allow the desired interaction and are provided as part of an Insense library. The Insense compiler is written in Java and generates C source code which is compiled using gcc and linked with the Insense library for the appropriate host operating system code. The current Insense library implementation is written for the Contiki operating system [12]. 2.3

Model Checking

Errors in system design are often not detected until the final testing stage when they are expensive to correct. Model checking [7,8,9] is a popular method that helps to find errors quickly by building small logical models of a system which can be automatically checked. Verification of a concurrent system design by temporal logic model checking involves first specifying the behaviour of the system at an appropriate level of abstraction. The specification P is described using a high level formalism (often similar to a programming language), from which an associated finite state model, M(P), representing the system is derived. A requirement of the system is specified as a temporal logic property, φ. A software tool called a model checker then exhaustively searches the finite state model M(P), checking whether φ is true for the model. In Linear Time Temporal Logic (LTL) model checking, this involves checking that φ holds for all paths of the model. If φ does not hold for some path, an error trace or counterexample is reported. Manual examination of this counter-example by the system designer can reveal that P does not adequately specify the behaviour of the system, that φ does not accurately describe the given requirement, or that there

Towards Verifying Correctness of WSN Applications Using Insense and Spin

227

is an error in the design. In this case, either P, φ, or the system design (and thus also P and possibly φ) must be modified, and re-checked. This process is repeated until the model checker reports that φ holds in every initial state of M(P), in which case we say M(P) satisfies φ, written M(P) |= φ. Assuming that the specification and temporal properties have been constructed with care, successful verification by model checking increases confidence in the system design, which can then be refined towards an implementation. The model checker SPIN [14] allows one to reason about specifications written in the model specification language Promela. Promela is an imperative style specification language designed for the description of network protocols. In general, a Promela specification consists of a series of global variables, channel declarations and proctype (process template) declarations. Individual processes can be defined as instances of parameterised proctypes in which case they are initiated via a defined init process. Properties are either specified using assert statements embedded in the body of a proctype (to check for unexpected reception, for example), an additional monitor process (to check global invariance properties), or via LTL properties. WSNs are inherently concurrent and involve complex communication mechanisms. Many aspects of their design would therefore benefit from the use of model checking techniques. Previous applications of model checking in this domain include the quantitative evaluation of WSN protocols [4,17,21] and the co-verification of WSN hardware and software using COSPAN [24] . SPIN has been used throughout the development of the WSN language Insense. In this paper we concentrate on the channel implementation. We show how even fairly simple analysis using SPIN has revealed errors in the early design, and allowed for the development of robust code, that we are confident is error-free.

3

Insense Channel Model

Insense channels are typed and directional and are the only means for intercomponent communication and synchronisation. A channel type consists of the direction of communication (in or out) and the type of messages that can be communicated via the channel. All values in the language may be sent over channels of the appropriate type including channels themselves. Inter-component communication is established by connecting an outgoing channel in one component to an incoming channel of the same message type in another component using the connect operator. Similarly the language supports a disconnect operation that permits components to be unwired. Insense supports three communication operations over channels: send, receive, and a non-deterministic select. In this paper we concentrate on the send and receive operations. Communication over channels is synchronous; the send operation blocks until the message is received and the receive operation blocks until a message is sent. These two operations also block if the channel is not connected. Multiple incoming and outgoing channels may be connected together enabling the specification of complex communication topologies. This facility introduces

228

O. Sharma et al. S1

0 cout 1 R1 1 0 0 cin 1 S1

0 cout 1 0 1

R1

1 0 0 cin 1

S1

S1

R1

1 0 0 cin 1

0 1 R1

0 cout 1

1 0 0 cin 1

0 1 R2 1 0 0 cin 1

0 1

0 cout 1 S2

S2

0 cout 1 0 1 R2 1 0 0 cin 1

0 cout 1 0 1 S3

0 cout 1

a)

b)

c)

0 1

d)

Fig. 1. Connection Topologies

non-determinism into the send and receive operations. The semantics of send and receive can be explained in more detail by considering Fig.1 which depicts four connection topologies. Fig. 1 (a) depicts a one-to-one connection between a sender component labelled S1 and a receiver component labelled R1. The semantics of send and receive over a one-to-one connection are akin to sending data down a traditional pipe in that all values sent by S1 are received by R1 in the order they were sent. The topology in Fig 1 (b) represents a one-to-many connection pattern between a sender component S1 and two receiver components R1 and R2. Each value sent by S1 is non-deterministically received by either R1 or R2, but not by both. A usage scenario for the one-to-many connection pattern is that a sender component wishes to request a service from an arbitrary component in a server farm. From the perspective of the sender it is irrelevant which component receives its request. The connection topology shown in Fig. 1 (c) represents a many-toone connection pattern in which a number of output channels from potentially numerous components may be connected to an input channel associated with another component. For the depicted topology, R1 non-deterministically receives values from either S1 or S2 on a single incoming channel. In this pattern, the receiving component cannot determine the identity of the sending component or the output channel that was used to send the message and the arrival order of messages is determined by scheduling. The pattern is useful as a multiplexer in which R1 can multiplex data sent from S1 and S2 and could forward the data to a fourth component. The multiplexer pattern is used to allow multiple components to connect to a shared standard output channel. Each of the three basic patterns of connectivity depicted in Fig. 1 (a)-(c) may be hybridized to create further variations. An example variation combining the patterns from Fig. 1 (b) and Fig. 1 (c) is depicted in Fig. 1 (d).

4

Insense Channel Implementation

Insense channels are used for concurrency control and to provide inter-component communication via arbitrary connection topologies. Furthermore, the language is intended to permit components to be rewired and even replaced at runtime.

Towards Verifying Correctness of WSN Applications Using Insense and Spin

(a) The Send Algorithm

229

(b) The Receive Algorithm

Fig. 2. Original Send and Receive Algorithms

The combination of component and channel abstractions reduces the complexity faced by the Insense programmer at the cost of increasing complexity in the channel implementation. Each Insense channel is represented by a half channel object in the implementation. Each half channel contains six fields: 1. a buffer for storing a single datum of the corresponding message type; 2. a field called ready which indicates if its owner is ready to send or receive data, 3. a field called nd received that is used to identify the receiving channel during a select operation, 4. a list of pointers, called connections, to the channels to which the channel is connected; and 5. two binary semaphores: one called mutex which serialises access to the channel and, 6. another called blocked upon which the components may block. When a channel is declared in the language a corresponding half channel is created in the implementation. Whenever a connection is made between an outgoing and an incoming channel in Insense, each half channel is locked in turn using the mutex. Next a pointer to the corresponding half channel is added to each of the connections lists and the mutex released. Disconnection is similar with the connections list being traversed and the bi-directional linkage between the half channels dissolved. The implementation of the send and receive operations are shown in Fig. 2 and were published in [10]. Numbers on the left hand side of the descriptions should be ignored - they are used for reasoning purposes in Section 5.2. The send and receive operations are almost symmetric. Both operations attempt to find a waiting component in the list of connections with the receiver

230

O. Sharma et al.

looking for a waiting sender and vice-versa. If no such match is found the sender or receiver block on the blocked semaphore until they are re-awakened by the signal(match.blocked) statement in the corresponding receive or send operation respectively.

5

Verification of the Send and Receive Operations

In this section we describe the Promela implementation of the half channels and of the send and receive operations described in Section 4. We show how simple initial verification with SPIN using assert statements revealed a subtle error in the channel implementation. We then provide the corrected algorithms which have been developed with the help of model checking. A list of properties is given, specifying the semantics of the send and receive operations. 5.1

Send and Receive in Promela

Communication between Insense components over a channel is achieved by a send operation in one component and a corresponding receive operation in the other. We therefore model the operations in Promela using a Sender and a Receiver proctype (see Section 2.3). We can then verify the behaviour of the send/receive operations to/from given sets of components by initiating the appropriate Sender/Receiver processes within an init process (see Section 2.3). Both proctypes have an associated myChan parameter, which is a byte identifying a process’s half-channel. In addition the Sender proctype has a data parameter indicating the item of data to be sent. After initialisation we are not interested in the actual data sent, so a single value for each Sender process suffices. Half-channels. Half-channels are implemented as C structs in the Insense implementation. They contain a buffer for storing an item of the channel type, semaphores and flags, and a list of other half-channels that this half-channel is connected to (see Section 4). In Promela, we implement half-channels using variations of the following typedef definition: typedef halfchan { // Binary semaphores bit mutex; // locks access to channel bit blocked; // indicates channel is blocked // Boolean Flags bit ready; //TRUE if ready to send/recv // Buffer byte buffer; // List of connections to other half-channels bit connections[NUMHALFCHANS]; }

Towards Verifying Correctness of WSN Applications Using Insense and Spin

231

Every sender and receiver is owner of exactly one half-channel. In our Promela specification all half channels are stored in a globally accessible array hctab. Note that we could have modelled the fields of each half channel as a set of channels (associated with a particular Sender or Receiver process). However we have used the halfchan construct to stay as true as possible to the original C implementation. Connections and Semaphores. Each half-channel contains a list of other half-channels to which it is connected. The connections list is an array of bits, where a value of 1 at index i indicates that the half-channel is connected to half-channel i in the hctab array. The Send and Receive algorithms use binary semaphores to synchronize. For example, if LOCK and UNLOCKED are constants denoting locked and unlocked status of a semaphore and me the half-channel parameter, then the wait operation (line (1) in Figure 2(a)) is represented by the following Promela code in the Sender proctype: atomic{ hctab[me].mutex!=LOCKED; // wait for mutex hctab[me].mutex=LOCKED //lock mutex } The lock can only be obtained if it is currently not in use (that is, it is currently set to UNLOCKED). If the lock is being used, the atomic sequence blocks until the lock can be obtained. The use of an atomic statement here ensures that race conditions do not occur. Data transfer. In addition to the data item being transfered from sender to receiver, global flags are set to note the type of transfer. In the case of a single sender and receiver, there are two types of data transfer: either the sender pushes the data item to the receiver, or the receiver pulls the data item from the sender. Two bit flags, push and pull, used for verification purposes only, are set accordingly within appropriate atomic steps. Note that for our purposes it is sufficient for constant data values to be sent/received. In section 5.4 counters are used to ensure that duplicate data is not sent/received during a single run. 5.2

Error in the Original Version of the Send Algorithm

The send and receive algorithms were modelled in Promela as described above. Our initial model only involved one Sender and one Receiver process, where each process could only execute a single cycle (i.e. the processes terminated when the statement corresponding to return had been reached). The model was sufficient to reveal a previously unobserved flaw in the send operation. This error was detected using an assert statement embedded in the Receiver proctype. After data has been pulled by the receiver, it should have the same value as that sent by the sender. Assuming that the sender always sends data with value 5,

232

O. Sharma et al.

(a) The Send Algorithm

(b) The Receive Algorithm

Fig. 3. Corrected Send and Receive Algorithms

the assert statement is assert(hctab[me].buffer==5). A safety check showed that there was an assertion violation. Close examination of the output generated by a guided simulation provided the error execution sequence for our model. The corresponding sequence in the send and receive operations is illustrated in the algorithms given in Figs. 2(a) and 2(b), following the numbered statements from (1) to (10). Both processes obtain their own (half-channel’s) mutex lock, set their ready flag and release the lock. The receiver then checks that the sender is ready for data transfer (by checking its ready flag), then commences to pull data from the sender’s buffer. This is where the error occurs: the data item is copied although it has not been initialized by this stage. Inspection of the send algorithm shows that the sender’s buffer is not set until the penultimate line of code is reached. Possible fixes for this bug are to either set the sender’s buffer before setting the ready flag or to not set the ready flag until the buffer is initialized. To maximize parallelism, the former fix was implemented. The corrected algorithms are shown in Fig. 3. Note that in addition to the fix, a conns semaphore, used when dynamically connecting and disconnecting channels, is introduced to the half channel data structures and to both algorithms. 5.3

Extending the Model for Multiple Processes

After adapting our Sender proctype to reflect the corrected version of the send operation, verification runs were performed to ensure that a model with single Sender and Receiver processes behaved as expected. They were then extended to run indefinitely, via additional goto statements and (start and return) labels.

Towards Verifying Correctness of WSN Applications Using Insense and Spin

233

The current Promela implementation allows for multiple Sender and Receiver processes. Extra receivers require the global variable NUMHALFCHANS to be incremented, thereby adding an additional element to global data structures such as the half-channel table and the half-channel’s connection lists. Each receiver’s half-channel must be initialized in the init proctype and the each sender and receiver process instantiated. With multiple sender/receiver processes, variables used for verification must be adapted. In particular, rather than using a single bit to indicate a sender push or receiver pull, bit arrays of length NUMHALFCHANS are used. As with the global half-channels table, each element in these arrays is associated with a single sender or receiver process. Note that some of the properties described in Section 5.4 apply only when multiple sender or receiver processes are present. In particular, property 6, which is concerned with duplication of data, applies only to versions where a sender is connected to multiple receivers. 5.4

Properties

The following list contains the high-level requirements of the channel implementation provided by the Insense designers. This list was developed over a period of time during discussion between the designers and modellers. This helped to clarify the design specification. – Property 1 In a connected system, send and receive operations are free from deadlock – Property 2 Finite progress – in a connected system data always flows from senders to receivers – Property 3 For any connection between a sender and a receiver, either the sender can push or the receiver can pull, but not both – Property 4 The send operation does not return until data has been written to a receiver’s buffer (either by sender-push or receiver-pull) – Property 5 The receive operation does not return until data has been written into its buffer (either by sender-push or receiver-pull) – Property 6 Data passed to the send operation is written to exactly one receiver’s buffer. i.e. data is not duplicated during a single send operation – Property 7 The receiver’s buffer is only written to once during a single operation. i.e. data is never overwritten (lost) before the receive operation returns Before we can verify that the properties hold at every possible system state, they must first be expressed in LTL. Property 1 can be checked by performing a no invalid endstates verification with SPIN, so no LTL property is required in this case. (This check would also reveal any assertion violations, like that exposing the bug in Section 5.2). In Table 1 we define propositions used in our LTL properties together with their meaning in Promela. The index i ranges from 1 to 3 and is used to access array elements associated with the ith sender or ith receiver

234

O. Sharma et al. Table 1. Propositions used in LT L properties

Proposition Definition Proposition Definition P ushi push[i] == T RU E P ulli pull[i] == T RU E SenderStarti Sender[spidi ]@start SenderReturni Sender[spidi ]@RET S1 ReceiverStarti Receiver[rpidi]@start ReceiverReturni Receiver[rpid i]@RET R1 scount[i] == 1 Rcountmax i rcount[i] == 1 Scountmax i

process respectively. On the other hand, spid i and rpid i are variables storing the process identifiers of the ith sender/receiver process respectively and are used to remotely reference labels within a given sender/receiver process. Note that scount [i] and rcount [i] are array elements recording the number of push/pull operations executed. Variable scount [i] is incremented when the ith sender is involved in a push or a pull, and decremented when the sender reaches its return label (similarly for rcount [i]). Note that both senders and receivers can increment these variables, but the scount [i]/rcount [i] variables are only decremented by the corresponding sender/receiver. The ith elements of the push and pull arrays record whether a push or pull has occurred to or from the ith receiver. We use the usual !, ||, && and → for negation, disjunction, conjunction and implication. In addition [], , and U denote the standard temporal operators “always”, “eventually” and “(strong) until” respectively. As shorthand we use W for “(weak) until”, where pW q denotes ([]p || (pU q)). In addition, for 1 ≤ j ≤ 3 we use the notation [PushOrPull ]j and [PushAndPull ]j to represent (P ush1 ||P ull1 || . . . ||P ushj ||P ullj ) and ((P ush1 &&P ull1 )|| . . . ||(P ushj &&P ullj )) respectively. Here R denotes the number of receivers. Properties are the same for any number of Senders greater than zero. – Property 2 • 1 ≤ R ≤ 3: [][PushOrPull ]R – Property 3 • 1 ≤ R ≤ 3: []![PushAndPull ]R – Property 4 • 1 ≤ R ≤ 3: [](SenderStart1 → ((!SenderReturn1 )W [PushOrPull ]R )) – Property 5 • 1 ≤ R ≤ 3: [](ReceiverStart1 → ((!ReceiverReturn1 )W (P ush1 ||P ull1 ))) – Property 6 • R = 1: Not applicable. • R > 1: [](SenderReturn1 → Scountmax1 ) – Property 7 • 1 ≤ R ≤ 3: [](ReceiverReturn1 → Rcountmax 1 ) Note that, since every process starts at the start label, properties 4 and 5 are not vacuously true. For properties 6 and 7 however, it is possible that in some paths the relevant Sender or Receiver never reaches the return label. This is acceptable - we are only interested here whether duplication is possible.

Towards Verifying Correctness of WSN Applications Using Insense and Spin

6

235

Experimental Results

The experiments were conducted on a 2.4 GHz Intel Xenon processor with 3Gb of available memory, running Linux (2.4.21) and SPIN 5.1.7. 6.1

Verification of the Corrected Send and Receive Operations

To provide consistency, a template model was used from which a unique model was generated for each configuration and property being tested. This allowed us to control the state-space by only including variables that were relevant to the property being tested. Promela code for our template and some example configurations, together with claim files (one per property) and full verification output for all configurations and properties can be found in an appendix at http://www.dcs.gla.ac.uk/dias/appendices.htm. In Table 2 we give results for scenarios in which S sender processes are connected to R receiver processes, where R + S ≤ 4. Here Property is the property number as given in Section 5.4; time is the actual verification time (user + system) in seconds; depth is the maximum search depth; states is the total number of stored states; and memory is the memory used for state storage in megabytes. Compression was used throughout, and in all cases full verification was possible (with no errors). Note that there is no result for property 6 with a single receiver, as this property applies to multiple receivers only. 6.2

Verification of the Connect/Disconnect Operations

The Insense designers worked closely with the model checking experts to develop previously unpublished algorithms for dynamic connection and disconnection of components. Using SPIN, deadlocks were shown to exist in previous versions of the algorithms. The final, verified algorithms are given in Fig. 4. We note that: – The connect and disconnect algorithms make use of: an additional Boolean is input field in the half channel data structures (that is set to true for incoming half channels) to prevent deadlocks from occurring by imposing a common order on mutex locking for send, receive, connect, and disconnect operations; and a conn op mutex to prevent race conditions from occurring when executing multiple connect and disconnect operations concurrently. The use of a global lock here is not an ideal solution. However, its removal resulted in models with an intractable state space (exceeding a 32Gb available space limit) for more than 2 Sender or Receiver processes. Since our systems (and hence our Promela models) are inherently symmetric, progress could be achieved here with the application of symmetry reduction (see Section 7).

236

O. Sharma et al.

Table 2. Results for sender and receiver verifications S:R 1:1 1:1 1:1 1:1 1:1 1:1 1:2 1:2 1:2 1:2 1:2 1:2 1:2 1:3 1:3 1:3 1:3 1:3 1:3 1:3 2:1 2:1 2:1 2:1 2:1 2:1 2:2 2:2 2:2 2:2 2:2 2:2 2:2 3:1 3:1 3:1 3:1 3:1 3:1

Property 1 2 3 4 5 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 7 1 2 3 4 5 6 7 1 2 3 4 5 7

Time 0.5 0.6 0.5 0.5 0.5 0.5 1.3 4.3 1.5 2.0 2.1 1.5 1.6 115.7 446.7 148.4 188.7 229.4 147.4 162.3 1.0 2.7 1.1 1.5 1.4 1.1 158.2 562.2 205.5 227.3 283.3 204.1 209.8 42.5 163.7 51.7 83.6 72.1 51.9

Depth 474 972 972 1014 971 969 1.9×104 4.1×104 4.1×104 4.2×104 4.1×104 4.0×104 4.3×104 1.0×106 2.2×106 2.2×106 2.1×106 2.2×106 2.1×106 2.4×106 1.5×104 3.2×104 3.2×104 3.3×104 3.2×104 3.2×104 2.3×106 5.0×106 5.0×106 5.1×106 5.0×106 5.0×106 5.1×106 5.9×105 1.2×106 1.2×106 1.4×106 1.2×106 1.2×106

States 1488 2982 1518 2414 2425 1552 1.3×105 2.7×105 1.4×105 2.0×105 2.1×105 1.4×105 1.5×105 1.1×107 2.2×107 1.2×107 1.5×107 1.8×107 1.1×107 1.3×107 8.0×104 1.6×105 8.3×104 1.3×105 1.2×105 8.4×104 1.5×107 2.8×107 1.6×107 2.2×107 2.2×107 1.6×107 1.6×107 4.0×106 7.9×106 4.2×106 6.8×106 5.9×106 4.2×106

Memory 0.3 0.4 0.3 0.3 0.3 0.3 4.0 9.7 5.3 7.3 7.9 5.3 5.7 351.6 832.0 439.3 576.8 678.0 437.1 486.2 2.5 5.9 3.3 5.1 4.7 3.3 460.1 1052.0 573.2 828.8 826.7 586.1 590.8 127.8 282.1 148.3 244.7 214.5 149.1

Towards Verifying Correctness of WSN Applications Using Insense and Spin

(a) The Connect Algorithm

237

(b) The Disconnect Algorithm

Fig. 4. Connect and Disconnect

Table 3. Results for sender and receiver verifications, with additional Connect and Disconnect processes

S:R 1:1 1:1 1:1 1:1 1:1 1:2 1:2 1:2 1:2 1:2 1:2 2:1 2:1 2:1 2:1 2:1

Property 1 3 4 5 7 1 3 4 5 6 7 1 3 4 5 7

Time 0.6 0.6 0.7 0.7 0.6 68.4 80.5 111.2 127.2 78.3 93.5 48.1 54.4 98.3 84.7 57.1

Depth 1.1×103 2.1×103 2.7×103 2.1×103 3.3×103 3.0×105 6.3×105 6.2×105 6.3×105 6.4×105 7.6×105 3.7×105 7.1×105 8.6×105 7.1×105 7.1×105

States 1.5×104 1.5×104 2.4×104 2.4×104 1.6×104 5.1×106 5.3×106 7.6×106 8.2×106 5.3×106 6.2×106 3.7×106 3.8×106 6.4×106 5.7×106 3.8×106

Memory 0.7 0.9 1.1 1.1 0.9 205.1 234.1 328.5 356.6 232.4 271.3 142.7 160.8 276.8 247.8 162.0

238

O. Sharma et al.

– In our Promela model, R × S Connect processes and R + S Disconnect processes are used to simulate connection and disconnection (1 Connect process per Sender-Receiver pair, and 1 Disconnect process per Sender or Receiver). The executions of these processes interleave with those of S Sender and R Receiver processes. – As Property 2 of 5.4 refers to a connected system, it is not relevant in this context. – All other relevant properties have been shown to hold for cases R + S ≤ 3. See Table 3. – A (further) template model was used to generate models. This template and an example model is contained in the online appendix.

7

Conclusions and Further Work

This paper outlines an initial step towards verifying the correctness of WSN applications with a focus on concurrency. The general approach taken here is to verify the implementation of the inter-component synchronisation mechanism of the Insense language using SPIN. Specifically, the Insense channel implementation and their associated send, receive, connect, and disconnect operations are first modelled using Promela constructs and SPIN is then used to verify a set of LTL properties under which the channel semantics are satisfied for a small number of senders and receivers. The SPIN model checker is used to reveal errors in a previously published version of the Insense channel implementation and to aid the development of revised algorithms that are correct with respect to their defined semantics. There are three avenues of further work in this area. First, the verification of the Insense language implementation is to be completed by modelling the nondeterministic select operation in Promela and using SPIN to check the relevant LTL properties. Second, we would like to show that the send and receive operations are safe for any number S of senders and any number R of receivers. This is an example of the parameterised model checking problem (PMCP) which is not, in general, decidable [2]. One approach that has proved successful for verifying some parameterised systems involves the construction of a network invariant (e.g. [16]). The network invariant I represents an arbitrary member of a family of processes. The problem here is especially hard, as we have two parameters, S and R. By fixing S to be equal to 1, however, we have applied an invariant-like approach, (from [18]) to at least show that a system with one sender process connected to any number (greater than zero) of receivers does not deadlock. (Details are not included here, for space reasons). In future work we intend to extend this to the case where S > 1. Our systems are inherently symmetric, and could benefit from the use of symmetry reduction [19]. Existing symmetry reduction tools for SPIN are not currently applicable. SymmSpin [5] requires all processes to be permutable and for symmetry information to be provided by the user, and TopSpin [11] does

Towards Verifying Correctness of WSN Applications Using Insense and Spin

239

not exploit symmetry between global variables. We plan to extend TopSpin to allow us to verify more complex systems and to remove the global lock from the Connect and Disconnect processes. Finally, an important aspect of further work is to extend our methodology from verifying the Insense language implementation to verifying programs. Our intention is to model WSN applications written in Insense using Promela constructs and to verify correctness of these programs using SPIN.

Acknowledgements This work is supported by the EPSRC grant entitled DIAS-MC (Design, Implementation and Adaptation of Sensor Networks through Multi-dimensional Co-design) EP/C014782/1.

References 1. Akyildiz, I., Su, W., Sankarasubramaniam, Y., Cyirici, E.: Wireless sensor networks: A survey. Computer Networks 38(4), 393–422 (2002) 2. Apt, K.R., Kozen, D.C.: Limits for automatic verification of finite-state concurrent systems. Information Processing Letters 22, 307–309 (1986) 3. Balasubramaniam, D., Dearle, A., Morrison, R.: A composition-based approach to the construction and dynamic reconfiguration of wireless sensor network applica´ (eds.) SC 2008. LNCS, vol. 4954, pp. 206–214. tions. In: Pautasso, C., Tanter, E. Springer, Heidelberg (2008) 4. Ballarini, P., Miller, A.: Model checking medium access control for sensor networks. In: Proc. of the 2nd Int’l. Symp. on leveraging applications of formal methods, pp. 255–262. IEEE, Los Alamitos (2006) 5. Bosnacki, D., Dams, D., Holenderski, L.: Symmetric Spin. International Journal on Software Tools for Technology Transfer 4(1), 65–80 (2002) ´ Coupaye, T., Leclercq, M., Qu´ema, V., Stefani, J.-B.: The fractal 6. Bruneton, E., component model and its support in Java. Software Practice and Experience 36(1112), 1257–1284 (2006) 7. Clarke, E., Emerson, E.: Synthesis of synchronization skeletons for branching time temporal logic. In: Kozen, D. (ed.) Logic of Programs 1981. LNCS, vol. 131. Springer, Heidelberg (1981) 8. Clarke, E., Emerson, E., Sistla, A.P.: Automatic verification of finite-state concurrent systems using temporal logic specifications. ACM Transactions on Programming Languages and Systems 8(2), 244–263 (1986) 9. Clarke, E., Grumberg, O., Peled, D.: Model Checking. The MIT Press, Cambridge (1999) 10. Dearle, A., Balasubramaniam, D., Lewis, J., Morrison, R.: A component-based model and language for wireless sensor network applications. In: Proc. of the 32nd Int’l Computer Software and Applications Conference (COMPSAC 2008), pp. 1303–1308. IEEE Computer Society Press, Los Alamitos (2008) 11. Donaldson, A.F., Miller, A.: A computational group theoretic symmetry reduction package for the SPIN model checker. In: Johnson, M., Vene, V. (eds.) AMAST 2006. LNCS, vol. 4019, pp. 374–380. Springer, Heidelberg (2006)

240

O. Sharma et al.

12. Dunkels, A., Gr¨ onvall, B., Voigt, T.: Contiki – a lightweight and flexible operating system for tiny networked sensors. In: Proc. 1st Workshop on Embedded Networked Sensors (EmNets-I). IEEE, Los Alamitos (2004) 13. Gay, D., Levis, P., Culler, D.: Software design patterns for TinyOS. Transactions on Embedded Computing Systems 6(4), 22 (2007) 14. Holzmann, G.: The SPIN model checker: primer and reference manual. Addison Wesley, Boston (2003) 15. Khan, A., Jenkins, L.: Undersea wireless sensor network for ocean pollution prevention. In: Proc. 3rd Int’l. Conference on Communication Systems Software and Middleware (COMSWARE 2008), pp. 2–8. IEEE, Los Alamitos (2008) 16. Kurshan, R.P., McMillan, K.L.: A structural induction theorem for processes. In: Proceedings of the eighth Annual ACM Symposium on Principles of Distrubuted Computing, pp. 239–247. ACM Press, New York (1989) 17. Kwiatkowska, M., Norman, G., Sproston, J.: Probabilistic model checking of the IEEE 802.11 wireless local area network protocol. In: Hermanns, H., Segala, R. (eds.) PROBMIV 2002, PAPM-PROBMIV 2002, and PAPM 2002. LNCS, vol. 2399, pp. 169–187. Springer, Heidelberg (2002) 18. Miller, A., Calder, M., Donaldson, A.F.: A template-based approach for the generation of abstractable and reducible models of featured networks. Computer Networks 51(2), 439–455 (2007) 19. Miller, A., Donaldson, A., Calder, M.: Symmetry in temporal logic model checking. Computing Surveys 36(3) (2006) 20. Skordylis, A., Guitton, A., Trigoni, N.: Correlation-based data dissemination in traffic monitoring sensor networks. In: Proc. 2nd int’l. conference on emerging networking experiments and Technoligies (CoNext 2006), p. 42 (2006) 21. Tobarra, L., Cazorla, D., Cuatero, F., Diaz, G., Cambronero, E.: Model checking wirelss sensor network security protocols: TinySec + LEAP. In: Wireless Sensor and Actor Networks. IFIP International Federation for Information Processing, vol. 248, pp. 95–106. Springer, Heidelberg (2007) 22. Venkatraman, S., Long, J., Pister, K., Carmena, J.: Wireless inertial sensors for monitoring animal behaviour. In: Proc. 29th Int’l. Conference on Engineering in Medicine and Biology (EMBS 2007), pp. 378–381. IEEE, Los Alamitos (2007) 23. Werner-Allen, G., Lorincz, K., Welsh, M., Marcillo, O., Johnson, J., Ruiz, M., Lees, J.: Deploying a wireless sensor network on an active volcano. IEEE Internet Computing 10(2), 18–25 (2006) 24. Xie, F., Song, X., Chung, H., Nandi, R.: Translation-based co-verification. In: Proceedings of the 3rd International Conference on Formal Methods and Models for Codesign, Verona, Italy, pp. 111–120. ACM-IEEE, IEEE Computer Society (2005)

Verification of GALS Systems by Combining Synchronous Languages and Process Calculi Hubert Garavel1 and Damien Thivolle1,2 1

INRIA Grenoble - Rhˆ one-Alpes 655 avenue de l’Europe 38 330 Montbonnot Saint Martin, France {Hubert.Garavel,Damien.Thivolle}@inria.fr 2 Polytechnic University of Bucharest Splaiul Independentei 313 060042 Bucharest, Romania

Abstract. A Gals (Globally Asynchronous Locally Synchronous) system typically consists of a collection of sequential, deterministic components that execute concurrently and communicate using slow or unreliable channels. This paper proposes a general approach for modelling and verifying Gals systems using a combination of synchronous languages (for the sequential components) and process calculi (for communication channels and asynchronous concurrency). This approach is illustrated with an industrial case-study provided by Airbus: a Tftp/Udp communication protocol between a plane and the ground, which is modelled using the Eclipse/Topcased workbench for model-driven engineering and then analysed formally using the Cadp verification and performance evaluation toolbox.

1

Introduction

In computer hardware, the design of synchronous circuits (i.e., circuits the logic of which is governed by a central clock) has long been the prevalent approach. In the world of software, synchronous languages [17] are based on similar concepts. Whatever their concrete syntaxes (textual or graphical) and their programming styles (data flow or automata-based), these languages share a common paradigm: a synchronous program consists of components that evolve by discrete steps, and there is a central clock ensuring that all components evolve simultaneously. Each component is usually deterministic, as is the composition of all components; this assumption greatly simplifies the simulation, testing and verification of synchronous systems. During the two last decades, synchronous languages have gained industrial acceptance and are being used for programming critical embedded real-time systems, such as avionics, nuclear, and transportation systems. They have also found applications in circuit design. Examples of synchronous languages are Argos [24], Esterel [3], Lustre/Scade [16], and Signal/Sildex [1]. However, embedded systems do not always satisfy the assumptions underlying the semantics of synchronous languages. Recent approaches in embedded systems C.S. P˘ as˘ areanu (Ed.): SPIN 2009, LNCS 5578, pp. 241–260, 2009. c Springer-Verlag Berlin Heidelberg 2009 

242

H. Garavel and D. Thivolle

(modular avionics, X-by-wire, etc.) introduce a growing amount of asynchronism and nondeterminism. This situation has been long known in the world of hardware, where the term Gals (Globally Asynchronous, Locally Synchronous) was coined to characterise circuits consisting of a set of components, each governed by its own local clock, that evolve asynchronously. Clearly, these evolutions challenge the established positions of synchronous languages in industry. There have been several attempts at pushing the limits of synchronous languages to model Gals systems. Following Milner’s conclusion [28] that asynchronism can be encoded in a synchronous process calculus, there have been approaches [18,23,29,19] suggesting ways to describe Gals systems using synchronous languages; for instance, nondeterminism is expressed by adding auxiliary input variables (oracles), the value of which is undefined; a main limitation of these approaches is that asynchronism and nondeterminism are not recognised as first-class concepts, so verification tools often lack optimisations specific to asynchronous concurrency (e.g. partial orders, compositional minimisation, etc.). Other approaches extend synchronous languages to allow a certain degree of asynchrony, as in Crp [2], Crsm [31], or multiclock Esterel [4], but, to our knowledge, such extensions are not (yet) used in industry. Finally, we can mention approaches [15,30] in which synchronous programs are compiled and distributed automatically over a set of processors running asynchronously; although these approaches allow the generation of Gals implementations, they do not address the issue of modelling and verifying Gals systems. A totally different approach would be to ignore synchronous languages and adopt languages specifically designed to model asynchrony and nondeterminism, and equipped with powerful verification tools, namely process calculi such as Csp [6], Lotos [21], or Promela [20]. Such a radical migration, however, would not be so easy for companies that invested massively in synchronous languages and whose products have very long life-cycles calling for stability in programming languages and development environments. In this paper, we propose an intermediate approach that combines synchronous languages and process calculi for modelling, verifying, and evaluating the performance of Gals systems. Our approach tries to retain the best of both worlds: – We continue using synchronous languages to specify the components of Gals systems, and possibly sets of components, running together in synchronous parallelism. – We introduce process calculi to: (1) encapsulate those synchronous components or sets of components; (2) model additional components whose behaviour is nondeterministic, a typical example being unreliable communication channels that can lose, duplicate and/or reorder messages; (3) interconnect all parts of a Gals systems that execute together according to asynchronous concurrency. The resulting specification is asynchronous and can be analysed using the tools available for the process calculus being considered. Regarding related work, we can mention [32], which translates Crsm [31] into Promela and then uses the Spin model checker to verify properties expressed

Verification of GALS Systems by Combining Synchronous Languages

243

as a set of distributed observers; our approach is different in the sense that it can use synchronous languages just as they are, instead of introducing a new synchronous/asynchronous language such as Crsm. Closer to our approach is [9], which uses the Signal compiler to generate C code from synchronous components written in Signal, embeds this C code into Promela processes, abstracts hardware communication buses as Promela finite Fifo channels, and finally uses Spin to verify temporal logic formulas. A key difference between their approach and ours is the way locally synchronous components are integrated into a globally asynchronous system. The approach of [9] is stateful in the sense that the C code generated for a synchronous Signal component is a transition system with an internal state that does not appear at the Promela level; thus, they must rely upon the “atomic” statement of Promela to enforce the synchronous paradigm by merging each pair of input and output events into one single event. To the contrary, our approach is stateless in the sense that each synchronous component is translated into a Mealy function without internal state; this allows a smoother integration within any asynchronous process calculi that has types and functions, even if it does not possess an “atomic” statement — which is the case of most process calculi. We illustrate our approach with an industrial case study provided by Airbus in the context of the Topcased1 project: a ground-plane communication protocol consisting of two Tftp (Trivial File Transfer Protocol) entities that execute asynchronously and communicate using unreliable Udp (User Datagram Protocol ) channels. For the synchronous language, we will consider Sam [8], a simple synchronous language (similar to Argos [24]) that was designed by Airbus and that is being used within this company. Software tools for Sam are available within the Topcased open-source platform based on Eclipse. For the process calculus, we will consider Lotos NT [7], a simplified version of the international standard E-Lotos [22]. A translator exists that transforms Lotos NT specifications into Lotos specifications, thus enabling the use of the Cadp toolbox [13] for verification and performance evaluation of the generated Lotos specifications. This paper is organised as follows. Section 2 presents the main ideas of our approach for analysing systems combining synchrony and asynchrony. Section 3 introduces the Tftp industrial case study. Section 4 gives insights into the formal modelling of Tftp using our approach. Section 5 reports on state space exploration and model checking verification of Tftp models. Section 6 addresses performance evaluation of Tftp models by means of simulation. Finally, Section 7 gives concluding remarks and discusses future work.

2

Proposed Methodology

This section explains how to make the connection between synchronous languages and process calculi. It takes the Sam and Lotos NT languages as particular examples, but the principles of our approach are more general. 1

www.topcased.org

244

2.1

H. Garavel and D. Thivolle

Synchronous Programs Seen as Mealy Functions

A synchronous program is the synchronous parallel composition of one or several synchronous components. A synchronous component performs a sequence of discrete steps and maintains an internal state s. At each step, it receives a set of m input values i1 , . . . , im from its environment, computes (in zero time) a reaction, sends a set of n output values o1 , . . . , on to its environment, and moves to its new state s′ . That is to say, it can be represented by a (usually deterministic) Mealy machine [27] i.e., a 5-tuple (S, s0 , I, O, f ) where: – – – – –

S is a finite set of states, s0 is the initial state, I is a finite input alphabet, O is a finite output alphabet, f ∈ S × I → S × O is a transition function (also called a Mealy function) mapping the current state and the input alphabet to the next state and the output alphabet: f (s, i1 ...im ) = (s′ , o1 ...on ).

When a synchronous program has several components, these components can communicate with each other by connecting the outputs of some components to the inputs of some other components. By definition of synchronous parallelism, at each step, all the components react simultaneously. Consequently, the composition of several components can also be modelled by a Mealy machine. For the synchronous Esterel and Lustre, a common format named OC (Object Code) has been proposed to represent those Mealy machines. 2.2

The SAM Language

To illustrate our approach, we consider the case of the synchronous language Sam designed by Airbus, a formal description of which is given in [8]. A synchronous component in Sam is an automaton that has a set of input and output ports, each port corresponding to a boolean variable. A Sam component is very similar to a Mealy machine. The main difference lies in the fact that a transition in Sam is a 5-tuple (s1 , s2 , F , G, P ), where: – s1 is the source state of the transition, – s2 is the destination state of the transition, – F is a boolean condition on the input variables (the transition can be fired only when F evaluates to true), – G is a set of output variables (when the transition is fired, the variables of G are set to true and the other output variables are set to false), and – P is a priority integer value. The priority values from transitions going out of the same state must be pairwise distinct. If a set of input values enables more than one outgoing transition from the current state, the transition with the lowest priority value is chosen, thus ensuring a deterministic execution. Priority values are notational conveniences that can be eliminated as follows: each transition (s1 , s2 , F, G, P ) can be replaced

Verification of GALS Systems by Combining Synchronous Languages

?A and B !C,D

s0

245

?A and not B !C

1 ?A !D

1

s1

?B !D

?B !C

2

2

1

s2

Fig. 1. Example automaton in Sam

by (s1 , s2 , F ′ , G) where F ′ = F ∧ ¬(F1 ∨ . . . ∨ Fn ) such that F1 , . . . , Fn are the conditions attached to the outgoing transitions of state s1 with priority values strictly lower than P . Each state has an implicit loop transition on itself that sets all the output ports to false and is fired if no other transition is enabled (its priority value is +∞). Fig. 1 gives an example of a Sam automaton. An interrogation mark precedes the condition F of each transition while an exclamation mark precedes its output variables list G. Priority values are attached to the source of the transitions. Sam supports the synchronous composition of components. A global system in Sam has input and output ports. It is composed of one or several Sam components. Communication between these components is expressed by drawing connections between input and output ports, with the following rules: – inputs of the system can connect to outputs of the system or inputs of automata, – outputs of automata can connect to inputs of other automata or outputs of the system, – cyclic dependencies are forbidden. 2.3

Translating SAM into LOTOS NT

In this section, we illustrate how a Sam automaton can be represented by its Mealy function encoded in Lotos NT. For instance, the Sam automaton of Fig. 1 can be encoded in Lotos NT as follows: type State is S0, S1, S2 -- this is an enumerated type end type function Transition (in CurrentState:State, in A:Bool, in B:Bool out NextState:State, out C:Bool, out D:Bool) is

246

H. Garavel and D. Thivolle NextState := CurrentState; C := false ; D := false ; case CurrentState in S0 -> if A then NextState := S1; D := true end if | S1 -> if A and B then NextState := S0; C := true; D := true elsif B then NextState := S2; C := true endif | S2 -> if A and not (B) then NextState := S0; C := true elsif B then NextState := S0; D := true end if end case end function

We chose Lotos NT rather than Lotos because Lotos NT functions are easier to use than Lotos equations for describing Mealy functions and manipulating data in general. The imperative style of Lotos NT makes this straightforward. Using Lotos algebraic data types would have been more difficult given that Lotos functions do not have “out” parameters. In this respect, Lotos NT is clearly superior to Lotos and other “traditional” value-passing process algebras; this contributes to the originality and elegance of our translation. Also, the fact that Lotos NT functions execute atomically (i.e., they do not create “small step” transitions) perfectly matches the assumption that a synchronous program reacts in zero time. A Sam system consisting of several Sam automata can also be translated to Lotos NT easily. Because cyclic dependencies are forbidden, one can find a topological order for the dependencies between automata. Thus, a Sam system can be encoded in Lotos NT as a sequential composition of the Mealy functions of its individual Sam automata. An alternative approach to translating a synchronous language L into Lotos NT, if there exists a code generator from L to the C language, would be to invoke the Mealy function (presumably generated in C code) directly from a Lotos NT program as an external function (a feature that is supported by Lotos NT). This way, our approach could even allow mixing of components written in different synchronous languages. 2.4

Wrapping Mealy Functions into LOTOS NT Processes

In contrast with synchronous programs, components of asynchronous programs run concurrently, at their own pace, and synchronise with each other through communications using gates or channels.

Verification of GALS Systems by Combining Synchronous Languages

values extracted from the message INPUT MESSAGE

inputs of the Mealy function

extracting V1...Vn processing I1...In

outputs of the Mealy function

Mealy Function

current state

247

values used to assemble the output message

O1...On processing V'1...V'n

assembling

OUTPUT MESSAGE

next state

values saved to be reused at next iteration

Fig. 2. A wrapper process in detail

Our approach to modelling Gals systems in asynchronous languages consists in encoding a synchronous program as a set of native types and functions in a given process calculus. But the Mealy function of a synchronous program alone cannot interact directly with an asynchronous environment. It needs to be wrapped (or encapsulated ) in a process that handles the communications with the environment. This wrapper transforms the Mealy function of a synchronous component into an Lts (Labelled Transition System). In our case, the Mealy function is a Lotos NT function and the wrapper is a Lotos NT process. The amount of processing a wrapper can do depends on the Gals system being modelled. Fig. 2 shows the basic processing usually done within a wrapper: extraction of the inputs, aggregation of the outputs, and storage of values for the next iteration. In certain cases, the wrapper can also implement extra behaviours not actually described by the Mealy function itself. Once encapsulated in a wrapper process, the Mealy function corresponding to a synchronous program can be made to synchronise and communicate with other asynchronous processes using the parallel composition operator of Lotos NT.

3

The TFTP Case Study

This case study was provided by Airbus to the participants of the Topcased project as a typical example of avionics embedded software. We first present a summary of the principles of the standard Tftp protocol, then we describe the adaptation of Tftp made by Airbus for plane/ground communications. 3.1

The Standard TFTP Protocol

Tftp [33] stands for Trivial File Transfer Protocol. It is a client/server protocol in which several clients can send (resp. receive) a file to (resp. from) one server. As it is designed to run over the Udp (User Datagram Protocol) protocol, the

248

H. Garavel and D. Thivolle

Tftp protocol implements its own flow control mechanism. In order for the server to differentiate between clients, each file transfer is served on a different Udp port. In a typical session, a client initiates a transfer by sending a request to the server: RRQ (Read ReQuest) for reading a file or WRQ (Write ReQuest) for writing (i.e. sending) a file. The files are divided into data fragments of equal size (except the last fragment whose size may be smaller), which are transferred sequentially. The server replies to an RRQ by sending in sequence the various data fragments (DATA) of the file and to a WRQ by sending an acknowledgement (ACK). When the client receives this acknowledgement, it starts sending the data fragments of the file. Each data fragment contains an order index which is used to check whether all data fragments are received consecutively. Each acknowledgement also carries the order index of the data fragment it acknowledges, or zero it if acknowledges a WRQ. A transfer ends when the acknowledgement of the last data fragment is received. The protocol is designed to be robust. Any lost message (RRQ, WRQ, DATA, ACK) can be retransmitted after a timeout. Duplicate (resent because of a timeout) acknowledgements are discarded upon receipt to avoid the Sorcerer’s Apprentice bug [5]. The Tftp standard suggests the use of dallying, i.e. waiting for a while after sending the final acknowledgement in case this acknowledgement is lost before reaching the other side (that will eventually resend its final data fragment after a timeout). If an error (memory shortage, fatal error, etc.) occurs, the client or the server sends an error message (ERROR) to abort the transfer. 3.2

The Airbus Variant of the TFTP Protocol

When a plane reaches its final parking position, it is connected to the airport using an Ethernet network. The ground/plane communication protocol currently in use is a very simple and certified to be correct. Airbus asked us to study a more complex protocol, a variant of the Tftp, which might be of interest for future generations of planes. The main differences with the standard Tftp are the following: – In the protocol stack considered by Airbus, this Tftp variant still runs above the Udp layer but below an avionic communication protocol layer (e.g. Arinc 615a). The files carried by the Tftp variant are frames of the upper layer protocol. – Each side of the Tftp variant has the ability to be both a client and a server, depending on the upper layer requests. – Each server communicates with one single client because there is a unique Tftp instance reserved for each plane that lands in the airport. This removes the need for modelling the fact that a server can serve many different clients on as many different Udp ports.

Verification of GALS Systems by Combining Synchronous Languages

249

In the rest of this paper, we will use the name Tftp to refer to this protocol variant studied by Airbus. The behaviour of a Tftp protocol entity was specified by Airbus as a Sam system consisting of one Sam automaton with 7 states, 39 transitions, 15 inputs and 11 outputs. Airbus was interested in knowing how this Tftp variant would behave in an unreliable environment, in which messages sent over the Udp layer could be lost, duplicated, or reordered.

4

Formal Specification of the Case Study

We have modelled a specification consisting of two Tftp protocol entities connected by two Udp media. As shown in Fig. 3, the Tftp protocol entities are two instances of the same Lotos NT process, whose behaviour is governed by the Mealy function of the Sam Tftp automaton. We manually translated this function into 215 lines of Lotos NT code (including the enumerated type encoding the states of the Sam automaton). The media are also two instances of the same Lotos NT process that models the behaviour of Udp. We have defined two versions of the Lotos NT wrapper process encapsulating the Tftp Mealy function. The basic Tftp process is the simplest one; it is modelled after Airbus recommendations to connect two Tftp Sam automata head-to-tail in an asynchronous environment. The accurate Tftp process is more involved: it is closer to the standard Tftp protocol and copes with limitations that we detected in the basic Tftp process. 4.1

Modelling the Basic TFTP Entities

The basic Tftp process, as shown by Fig. 4, is a simple wrapper (260 lines of Lotos NT) around the Mealy function and does no processing on its own. The idea behind this wrapper is to asynchronously connect output ports of one Tftp automaton to corresponding input ports of the other side. Inputs of the Mealy function that can neither be deduced from the input message nor from values stored at the previous iteration are assigned a random boolean value.

SEND_A TFTP WRAPPER Instance A TFTP transition function RECEIVE_A

UDP MEDIUM Instance 1

asynchronous communication channels

UDP MEDIUM Instance 2

RECEIVE_B TFTP WRAPPER Instance B TFTP transition function SEND_B

Fig. 3. Asynchronous connection of two TFTP processes via UDP media

250

H. Garavel and D. Thivolle

receive_RRQ receive_WRQ TA MESSAGE TYPE

RECEIVE MESSAGE

RRQ WRQ DATA ACK ERROR

receive_ERROR DAT internal_error eof apply_WRQ max_retries_reached timeout current_state

MESSAGE TYPE

TFTP transition function

TA TA send_ACK resend_ACK

RRQ WRQ DATA ACK ERROR

RECEIVE MESSAGE

stop_timer arm_timer next_state

RANDOM

TFTP WRAPPER

Fig. 4. Basic TFTP process

A key issue with this design is how to determine if two successive data fragments are different, or if they are the same fragment sent twice. For this purpose, the Sam automaton has different input ports (receive DATA and receive old DATA) and different output ports (send DATA and resend DATA). However, the basic Tftp wrapper is just too simple to interface with these ports in a satisfactory manner. For this reason, we had to refine this wrapper as explained in the next section. 4.2

Modelling the Accurate TFTP Entities

We developed a more accurate Tftp wrapper process (670 lines of Lotos NT) that receives and sends “real” Tftp frames (as defined in the Tftp standard). In our model, we assume the existence of a finite set of files (each represented by its file name, which we encode as an integer value) in which each Tftp process can pick up files to write to or read from the other side. Each RRQ and WRQ frame carries the name of the requested file. The contents of each file are modelled as a sequence of fragments, each fragment being represented as a character. Each DATA frame carries three values: a file fragment, an order index for the fragment, and a boolean value indicating whether this is the last fragment of the file. Each ACK frame carries the order index of the DATA frame it acknowledges, or zero if it acknowledges a WRQ. In order to fight state explosion in the latter phases, we restrict nondeterminism by constraining each Tftp process to select only those files belonging to a “read list” and “write list”. Whenever there is no active transfer, a process can randomly choose to send an RRQ request for the first file in its read list or a WRQ request for the first file in its write list. Besides the state of the automaton, additional values must be kept in memory between two subsequent calls to the Mealy function, for instance the name of the file being transferred, the index value of the last data fragment or acknowledgement received or sent, a boolean indicating whether the last data fragment received is the last one, etc.

Verification of GALS Systems by Combining Synchronous Languages

4.3

251

Modelling the UDP Media

The two Lotos NT processes describing the Udp media are not derived from a Sam specification: they have been written by hand. These processes should reproduce accurately the behaviour of an Udp layer over an Ethernet cable connecting the plane and the ground. As Udp is a connection-less protocol without error recovery mechanism, any error that is not detected and corrected by the lower networking layers will be propagated to the upper layers (i.e., Tftp in our case). These errors are: message losses, message reordering, and message duplications. Message losses are always possible, due to communication failures. Reordering of messages should be limited in practice (as modern routers use load-balancing policies that usually send all related packets through the same route), but we cannot totally exclude this possibility. Message duplications may only occur if the implementation of the lower networking layers is erroneous, so we can discard this possibility. We chose to model the medium in two different ways, using two different Lotos NT processes. Both processes allow messages to be lost and have a buffer of fixed size in which the messages are stored upon reception, waiting for delivery. The first process models the case where message reordering does not happen. It uses a Fifo as a buffer: messages are delivered in the same order as they are received. The second process models the case where message reordering can happen. It uses a bag as a buffer. We denote FIFO(n) (resp. BAG(n)) a medium with a Fifo (resp. bag) buffer of size n. The Lotos NT processes for the Fifo medium and the bag medium are respectively 24 and 27 line long. 4.4

Interconnecting TFTP Entities and UDP Media

To compose the Tftp protocol entities and the Udp media asynchronously as illustrated in Fig. 3, we use the parallel operator of Lotos NT: par || || || end

RECEIVE_A, SEND_A RECEIVE_B, SEND_B SEND_A, RECEIVE_B SEND_B, RECEIVE_A par

-> -> -> ->

TFTP_WRAPPER [RECEIVE_A, SEND_A] TFTP_WRAPPER [RECEIVE_B, SEND_B] UDP_MEDIUM [SEND_A, RECEIVE_B] UDP_MEDIUM [SEND_B, RECEIVE_A]

As we have two different Tftp processes and two different medium processes, we obtain four specifications: basic Tftp specification with bag media, basic Tftp specification with Fifo media, accurate Tftp specification with bag media, and accurate Tftp specification with Fifo media.

5

Functional Verification by Model Checking

In this section, we detail how to generate the state spaces for the specifications and how to define correctness properties characterising the proper behaviour of these specifications. Then, we discuss the model checking results obtained using Cadp.

252

H. Garavel and D. Thivolle

5.1

State Space Generation

Lotos NT specifications are automatically translated into Lotos specifications (using the Lpp/Lnt2Lotos [7] compilers) which are, in turn, compiled into Ltss (Labelled Transition Systems) using the Cæsar.adt [14] and Cæsar [10] compilers of Cadp. One important issue in model checking is the state space explosion problem. Because of this, we restrict the buffer size n of the Udp media processes to small values (e.g., n = 1, 2, 3...). In the case of the accurate Tftp we also limit the size of each file to two fragments (this is enough to exercise all the transitions of the Sam automaton) and we constrain the number of files exchanged between the two Tftp protocol entities by bounding the lengths of the read and write lists. To cover all the possibilities, we consider four scenarios: – – – –

Scenario Scenario Scenario Scenario

1: 2: 3: 4:

Tftp Tftp Tftp Tftp

entity A writes one file.; entities A and B both write one file; entity A writes one file and B reads one; entities A and B both read one file;

Additionally, we make use of the compositional verification tools available in Cadp to fight state explosion. Compositional verification is a divide and conquer approach that allows significant reductions in time, memory, and state space size. Applied to the Tftp case study, this approach consists in generating the Ltss for all the four processes (two Tftp processes and two media processes), minimising these Ltss according to strong bisimulation (using the Bcg Min tool of Cadp), and composing them progressively in parallel (using the Exp.Open and Generator tools of Cadp) by adding one Lts at a time. For instance, on the example of basic Tftp specification with two BAG(2) media, it took 7 minutes and 56 seconds on a 32-bit machine (2.17 Ghz Intel Core 2 Duo processor running Linux with 3 GB of RAM), to directly generate the corresponding Lts, which has 2,731,505 states and 11,495,662 transitions. Using compositional verification instead, it only takes 13.9 seconds to generate, on the same machine, a strongly equivalent, but smaller, Lts with 542,078 states and 2,543,930 transitions only. Practically, compositional verification is made simple by the Svl [12] script language of Cadp. Svl lets the user write compositional verification scenarios at a high level of abstraction and takes care of all low level tasks, such as invoking the Cadp tools with appropriate command-line options, managing all temporary files, etc. Tables 1 and 2 illustrate the influence of the buffer size on the state spaces of the basic and accurate Tftp specifications, respectively. In these tables, the hyphen symbol (“−”) indicates the occurrence of state space explosion. 5.2

Temporal Logic Properties

After a careful analysis of the standard Tftp protocol and discussions with Airbus engineers, we specified a collection of properties that the Tftp specification

Verification of GALS Systems by Combining Synchronous Languages

253

Table 1. Lts generation for the basic Tftp Medium BAG(1) BAG(2) BAG(3) BAG(4) FIFO(1) FIFO(2) FIFO(3) FIFO(4)

Minimised Medium Lts Entire Specification Generation States Transitions States Transitions Time (s) 13 60 20,166 86,248 10.49 70 294 542,078 2,543,930 13.90 252 1,008 6,698,999 32,868,774 54.89 714 2,772 − − − 13 60 20,166 86,248 9.95 85 384 846,888 3,717,754 15.13 517 2,328 31,201,792 137,500,212 200.32 3,109 13,992 − − −

Table 2. Lts generation for the accurate Tftp (scenario 1) Medium BAG(1) BAG(2) BAG(3) BAG(4) FIFO(1) FIFO(2) FIFO(3) FIFO(4)

Minimised Medium Lts Entire Specification Generation States Transitions States Transitions Time (s) 31 260 71,974 319,232 20.04 231 1,695 985,714 4,683,197 27.44 1,166 7,810 6,334,954 31,272,413 78.28 4,576 28,655 − − − 31 260 71,974 319,232 20.29 321 2,640 1,195,646 5,373,528 29.26 3,221 26,440 18,885,756 85,256,824 174.15 32,221 264,440 − − −

should verify. These properties were first expressed in natural language and then translated into temporal logic formulas. For the basic TFTP specification, we wrote a first collection of 12 properties using modal µ-calculus (extended with regular expressions as proposed in [25]). These properties were evaluated using the Evaluator 3.5 model checker of Cadp. We illustrate two of them here: – The Tftp automaton has two output ports arm timer and stop timer that respectively start and stop the timer used to decide when an incoming message should be considered as lost. The following property ensures that between two consecutive stop timer actions, there must be an arm timer action. It states that there exists no sequence of transitions containing two stop timer actions with no arm timer action in between. The suffix “ A” at the end of transition labels indicates that this formula holds for Tftp protocol entity A. There is a similar formula for entity B. [ true* . "STOP_TIMER_A" . not ("ARM_TIMER_A")* . "STOP_TIMER_A" ] false

– When a Tftp protocol entity receives an error, it must abort the current transfer. The following property ensures that receiving an error cannot be

254

H. Garavel and D. Thivolle

followed by sending an error. It states that there exists no sequence of transitions featuring the reception of an error directly followed by sending an error: [ true* . "RECEIVE_A !ERROR" . "SEND_A !ERROR" ] false

For the accurate TFTP specification, the collection of 12 properties we wrote for the basic Tftp specification can be reused without any modification, still using Evaluator 3.5 to evaluate them. We also wrote a second collection of 17 new properties that manipulate data in order to capture the messages exchanged between the Tftp protocol entities. These properties could have been written using the standard µ-calculus but they would have been too verbose. Instead, we used the Mcl language [26], which extends the modal µ-calculus with data manipulation constructs. Properties written in the Mcl language can be evaluated using the Evaluator 4.0 [26] model checker of Cadp. We illustrate two of these new properties below: – Data fragments must be sent in proper order. We chose to ensure this by showing that any data fragment numbered x cannot be followed by a data fragment numbered y, where y < x, unless there has been a re-initialisation (transfer succeeded or aborted) in between. This property is encoded as follows: [ true* . {SEND_A !"DATA" ?X:Nat ...} . not (REINIT_A)* . {SEND_A !"DATA" ?Y:Nat ... where Y < X} ] false

– Resent write requests must be replied to, in the limits set by the value of the maximum number of retries. The following formula states that for every write request received and accepted, it is possible to send the acknowledgement more than once, each time (within the limit of MAX RETRIES A) the write request is received – the r {p} notation meaning that the regular formula r must be repeated p times. [ not {RECEIVE_A !"WRQ" ...}* . {RECEIVE_A !"WRQ" ?n:Nat} . i . {SEND_A !"ACK" !0 of Nat} ] forall p:Nat among {1 ... MAX_RETRIES_A ()} . < ( not (REINIT_A or {RECEIVE_A !"WRQ" !n})* . {RECEIVE_A !"WRQ" !n} . {SEND_A !"ACK" !0 of Nat} ) {p} > true

5.3

Model Checking Results

Using the Evaluator 3.5 model checker, we evaluated all properties of the first collection on all the Ltss generated for the basic and accurate Tftp specifications. Using the Evaluator 4.0 model checker, we did the same for all properties in the second collection on all the Ltss generated for the accurate Tftp specifications. Several of the first collection of 12 properties did not hold on either the basic or the accurate Tftp specifications. This enabled us to find 11 errors in the

Verification of GALS Systems by Combining Synchronous Languages

255

Tftp automaton. From the two properties presented in Section 5.2 for the first collection, the first held while the second did not. The verification of the second collection of 17 properties specially written for the accurate Tftp specifications led to the discovery of an additional 8 errors. From the two properties presented in Section 5.2 for the second collection, the first held while the second did not. For both the basic and accurate Tftp specifications, we observed that the truth values of all these formulas did not depend on the sizes of bags or Fifos. Notice that, because Evaluator 3.5 and 4.0 can work on the fly, we could have applied them directly to the Lotos specifications generated for the Tftp instead of generating the Ltss first. Although this might have enabled us to handle larger state spaces, we did not chose this approach, as we felt that further increasing the bag and Fifo sizes would not lead to different results. Regarding the amount of time needed to evaluate formulas, we observed that it takes in average 35 seconds per formula on an Lts having 3.4 million states and 19.2 million transitions (basic Tftp specification) and 6.5 minutes per formula on an Lts having 18.2 million states and 88 million transitions (accurate Tftp specification). In total, we found 19 errors, which were reported to Airbus and were acknowledged as being actual errors in the Tftp variant. We also suggested changes in the Tftp automaton to correct them. As stated in Section 3.2, it is worth noticing that these errors only concern a prototype variant of Tftp, but not the communication protocols actually embedded in planes and airports. While some of these errors could have been found by a human after a careful study of the automaton, some others are more subtle and would have been hard to detect just by looking at the Tftp automaton: for instance, the fact that if both Tftp entities send a request (RRQ or WRQ) at the same time, both requests are just ignored.

6

Performance Evaluation by Simulation

In spite of the errors we detected, the Tftp automaton can always recover with timeouts, i.e. by waiting long enough that the timer expires. However, these extra timeouts and additional messages cause a performance degradation that needed to be quantified. There are several approaches to performance evaluation, namely queueing theory, Markov chains (the Cadp toolbox provides tools for Interactive Markov Chains [11]), and simulation methods. For the Tftp case study, we chose the latter approach. 6.1

Simulation Methodology with CADP

To quantify the performance loss caused by the errors, an “optimal“ model was needed to serve as a reference. For this purpose, we wrote a Tftp Mealy function in which all the errors have been corrected. We also produced, for each error e,

256

H. Garavel and D. Thivolle

a Tftp Mealy function in which all the errors but e had been corrected, so as to measure the individual impact of e on the global performance. State space explosion does not occur with simulation. This allowed us to increase the complexity of our models: – The number of files exchanged was set to 10,000. Before each simulation, these files are randomly distributed in the read and write lists of the Tftp. – The file size was increased to be between 4 and 10 fragments. File fragments are assumed to be 32 kB each. File contents are randomly generated before each simulation. A simulation stops when all the files in the read and write lists have been transferred. – We used bag Udp media with a buffer size of 6. We considered two simulation scenarios: 1. One Tftp protocol entity acts as a server and initiates no transfer. The other acts as a client that reads files from and writes files to the server. This is a realistic model of actual ground/plane communications. 2. Both Tftp protocol entities can read and write files. This is a worst-case scenario in which the Tftp protocol entities compete to start file transfers. This can happen under heavy load and Airbus engineers recognised it ought to be considered. To perform the simulations, we adapted the Executor tool of Cadp, which can explore random traces in LOTOS specifications on the fly. By default, in Executor, all transitions going out of the current state have the same probability of being fired. To obtain more realistic simulation traces, we modified Executor (whose source code is available in Cadp) to assign different probabilities to certain transitions. Namely, we gave to timeouts and message losses (resp. to internal errors) a probability that is 100 (resp. 10,000) times smaller than the probability of all other transitions. In the bag Udp media, older messages waiting in the buffers were given higher chance than newer messages to be chosen for delivery. To each transition, we also associated an estimated execution time, computed as follows: – The Udp media are assumed to have a speed of 1 MB/s and a latency of 8 ms. – Receiving or sending an RRQ, a WRQ, or an ACK takes 2 ms (one fourth of the latency) – Receiving or sending a DATA takes 18 ms: 2 ms from the medium latency plus half the time required to send 32 kB at 1 MB/s. – For the timeout values, we tried 20 different values in total, ranging from 50 ms to 1 second, varying by steps of 50 ms. – All other transitions have an estimated execution time of 0 ms. For each error e, for both simulations scenario, and for each timeout value, we ran ten simulations on the TFTP specification in which all errors but e had been corrected. We then analysed each trace produced by these simulations to compute:

Verification of GALS Systems by Combining Synchronous Languages

257

– its execution time, i.e. the sum of the estimated execution times for all the transitions present in the trace, – the number of bytes transferred during the simulation, which is obtained by multiplying the fragment size (32 kB) by the number of file fragments initially present in the read and write lists. Dividing the latter by the former gives a transfer speed, the mean value of which can be computed over the set of simulations. 6.2

Simulation Results

For the simulation scenario 1, we observed (see Fig. 5) that the Tftp specification in which all the errors have been corrected performs 10% faster than the original Tftp specification containing all the 19 errors. For the simulation scenario 2, the original Tftp specification has a transfer speed close to zero, whatever the timeout value chosen. This confirms our 600

All errors corrected No error corrected

550 Transfer speed (kB/s)

500 450 400 350 300 250 200 150 100 50 0

100

200

300

400

500

600

700

800

900 1000

Timeout value (s)

Fig. 5. Simulation results for scenario 1

600

All errors corrected No error corrected All errors corrected except error 13a All errors corrected except error 13d

Transfer speed (kB/s)

500 400 300 200 100 0 0

100

200

300

400

500

600

700

800

900 1000

Timeout value (s)

Fig. 6. Simulation results for scenario 2

258

H. Garavel and D. Thivolle

initial intuition that the errors we detected prevent the Tftp prototype from performing correctly under heavy load (this intuition was at the source of our performance evaluation study for the Tftp). After all errors have been corrected, the numerical results obtained for scenario 2 are the same as for the simulation scenario 1. We observed that certain errors play a major role in degrading the transfer speed. For instance (see Fig. 6), this is the case with errors 13a (resp. 13c), which are characterised by the fact that the Tftp automaton, after sending the last acknowledgement and entering the dallying phase, ignores incoming read (resp. write) requests, whereas it should either accept or reject them explicitly.

7

Conclusion

In this paper, we have proposed a simple and elegant approach for modelling and analysing systems consisting of synchronous components interacting asynchronously, commonly referred to as Gals (Globally Asynchronous Locally Synchronous) in the hardware design community. Contrary to other approaches that stretch or extend the synchronous paradigm to model asynchrony, our approach preserves the genuine semantics of synchronous languages, as well as the well-known semantics of asynchronous process calculi. It allows us to reuse without any modification the existing compilers for synchronous languages, together with the existing compilers and verification tools for process calculi. We demonstrated the feasibility of our approach on an industrial case study, the Tftp/Udp protocol for which we successfully performed model checking verification and performance evaluation using the Topcased and Cadp software tools. Although this case study was based on the Sam synchronous language and the Lotos/Lotos NT process calculi, we believe that our approach is general enough to be applicable to any synchronous language whose compiler can translate (sets of) synchronous components into Mealy machines — which is almost always the case — and to any process calculus equipped with asynchronous concurrency and user-defined functions. Regarding future work, we received strong support from Airbus. Work has already been undertaken to automate the translation from Sam to Lotos NT and to verify another avionics embedded software system. We would also like to compare our simulation results against results from “traditional” simulation tools and to apply our approach to other synchronous languages than Sam.

Acknowledgements We are grateful to Patrick Farail and Pierre Gaufillet (Airbus) for their continuing support and to Claude Helmstetter (INRIA/Vasy), Pascal Raymond (CNRS/Verimag), and Robert de Simone (INRIA/Aoste), as well as the anonymous referees, for their insightful comments about this work.

Verification of GALS Systems by Combining Synchronous Languages

259

References 1. Benveniste, A., Le Guernic, P., Jacquemot, C.: Synchronous Programming with Events and Relations: The SIGNAL Language and Its Semantics. Sci. Comput. Program. 16(2), 103–149 (1991) 2. Berry, G., Ramesh, S., Shyamasundar, R.K.: Communicating Reactive Processes. In: POPL’93, pp. 85–98. ACM, New York (1993) 3. Berry, G., Gonthier, G.: The Esterel Synchronous Programming Language: Design, Semantics, Implementation. Science of Computer Programming 19(2), 87–152 (1992) 4. Berry, G., Sentovich, E.: Multiclock Esterel. In: Margaria, T., Melham, T.F. (eds.) CHARME 2001. LNCS, vol. 2144, pp. 110–125. Springer, Heidelberg (2001) 5. Braden, R.: Requirements for Internet Hosts - Application and Support. RFC 1123, Internet Engineering Task Force (October 1989) 6. Brookes, S.D., Hoare, C.A.R., Roscoe, A.W.: A Theory of Communicating Sequential Processes. Journal of the ACM 31(3), 560–599 (1984) 7. Champelovier, D., Clerc, X., Garavel, H.: Reference Manual of the LOTOS NT to LOTOS Translator, Version 4G. Internal Report, INRIA/VASY (January 2009) 8. Clerc, X., Garavel, H., Thivolle, D.: Pr´esentation du language SAM d’Airbus. Internal Report, INRIA/VASY (2008), TOPCASED forge: http://gforge.enseeiht.fr/docman/view.php/33/2745/SAM.pdf 9. Doucet, F., Menarini, M., Kr¨ uger, I.H., Gupta, R.K., Talpin, J.-P.: A Verification Approach for GALS Integration of Synchronous Components. Electr. Notes Theor. Comput. Sci. 146(2), 105–131 (2006) 10. Garavel, H.: Compilation et v´erification de programmes LOTOS. Th`ese de Doctorat, Universit´e Joseph Fourier (Grenoble) (November 1989) 11. Garavel, H., Hermanns, H.: On Combining Functional Verification and Performance Evaluation using CADP. In: Eriksson, L.-H., Lindsay, P.A. (eds.) FME 2002. LNCS, vol. 2391, pp. 410–429. Springer, Heidelberg (2002) 12. Garavel, H., Lang, F.: SVL: a Scripting Language for Compositional Verification. In: Kim, M., Chin, B., Kang, S., Lee, D. (eds.) Proceedings of the 21st IFIP WG 6.1 International Conference on Formal Techniques for Networked and Distributed Systems FORTE’2001, Cheju Island, Korea, pp. 377–392. IFIP, Kluwer Academic Publishers, Dordrecht (2001); Full version available as INRIA Research Report RR4223 13. Garavel, H., Lang, F., Mateescu, R., Serwe, W.: CADP 2006: A Toolbox for the Construction and Analysis of Distributed Processes. In: Damm, W., Hermanns, H. (eds.) CAV 2007. LNCS, vol. 4590, pp. 158–163. Springer, Heidelberg (2007) 14. Garavel, H., Turlier, P.: CÆSAR.ADT : un compilateur pour les types abstraits alg´ebriques du langage LOTOS. In: Dssouli, R., Bochmann, G.v. (eds.) Actes du Colloque Francophone pour l’Ing´enierie des Protocoles CFIP 1993, Montr´eal, Canada (1993) 15. Girault, A., M´enier, C.: Automatic Production of Globally Asynchronous Locally Synchronous Systems. In: Sangiovanni-Vincentelli, A.L., Sifakis, J. (eds.) EMSOFT 2002. LNCS, vol. 2491, pp. 266–281. Springer, Heidelberg (2002) 16. Halbwachs, N., Caspi, P., Raymond, P., Pilaud, D.: The Synchronous Dataflow Programming Language LUSTRE. Proceedings of the IEEE 79(9), 1305–1320 (1991) 17. Halbwachs, N.: Synchronous programming of reactive systems. Kluwer Academic, Dordrecht (1993)

260

H. Garavel and D. Thivolle

18. Halbwachs, N., Baghdadi, S.: Synchronous Modelling of Asynchronous Systems. In: Sangiovanni-Vincentelli, A.L., Sifakis, J. (eds.) EMSOFT 2002. LNCS, vol. 2491, pp. 240–251. Springer, Heidelberg (2002) 19. Halbwachs, N., Mandel, L.: Simulation and Verification of Asynchronous Systems by Means of a Synchronous Model. In: ACSD ’06, pp. 3–14. IEEE Computer Society, Washington (2006) 20. Holzmann, G.J.: The Spin Model Checker - Primer and Reference Manual. Addison-Wesley, Reading (2004) 21. ISO/IEC. LOTOS — A Formal Description Technique Based on the Temporal Ordering of Observational Behaviour. International Standard 8807, International Organization for Standardization — Information Processing Systems — Open Systems Interconnection, Gen`eve (September 1989) 22. ISO/IEC. Enhancements to LOTOS (E-LOTOS). International Standard 15437:2001, International Organization for Standardization — Information Technology, Gen`eve (September 2001) 23. Le Guernic, P., Talpin, J.-P., Le Lann, J.-C.: Polychrony for System Design. Journal of Circuits, Systems and Computers. World Scientific 12 (2003) 24. Maraninchi, F., R´emond, Y.: Argos: an Automaton-Based Synchronous Language. Computer Languages 27(1–3), 61–92 (2001) 25. Mateescu, R., Sighireanu, M.: Efficient On-the-Fly Model-Checking for Regular Alternation-Free Mu-Calculus. Science of Computer Programming 46(3), 255–281 (2003) 26. Mateescu, R., Thivolle, D.: A Model Checking Language for Concurrent ValuePassing Systems. In: Cuellar, J., Maibaum, T., Sere, K. (eds.) FM 2008. LNCS, vol. 5014, pp. 148–164. Springer, Heidelberg (2008) 27. Mealy, G.H.: A Method for Synthesizing Sequential Circuits. Bell System Technical Journal 34(5), 1045–1079 (1955) 28. Milner, R.: Calculi for Synchrony and Asynchrony. Theoretical Computer Science 25, 267–310 (1983) 29. Mousavi, M.R., Le Guernic, P., Talpin, J.-P., Shukla, S.K., Basten, T.: Modeling and Validating Globally Asynchronous Design in Synchronous Frameworks. In: DATE ’04, p. 10384. IEEE Computer Society Press, Washington (2004) 30. Potop-Butucaru, D., Caillaud, B.: Correct-by-Construction Asynchronous Implementation of Modular Synchronous Specifications. Fundam. Inf. 78(1), 131–159 (2007) 31. Ramesh, S.: Communicating Reactive State Machines: Design, Model and Implementation. In: IFAC Workshop on Distributed Computer Control Systems (1998) 32. Ramesh, S., Sonalkar, S., D’Silva, V., Chandra, N., Vijayalakshmi, B.: A Toolset for Modelling and Verification of GALS Systems. In: Alur, R., Peled, D.A. (eds.) CAV 2004. LNCS, vol. 3114, pp. 506–509. Springer, Heidelberg (2004) 33. Sollins, K.: The TFTP Protocol (Revision 2). RFC 1350, Internet Engineering Task Force (July 1992)

Experience with Model Checking Linearizability Martin Vechev, Eran Yahav, and Greta Yorsh IBM T.J. Watson Research Center

Abstract. Non-blocking concurrent algorithms offer significant performance advantages, but are very difficult to construct and verify. In this paper, we describe our experience in using SPIN to check linearizability of non-blocking concurrent data-structure algorithms that manipulate dynamically allocated memory. In particular, this is the first work that describes a method for checking linearizability with non-fixed linearization points.

1 Introduction Concurrent data-structure algorithms are becoming increasingly popular as they provide an unequaled mechanism for achieving high performance on multi-core hardware. Typically, to achieve high performance, these algorithms use fine-grained synchronization techniques. This leads to complex interaction between processes that concurrently execute operations on the data structure. Such interaction presents serious challenges both for the construction of an algorithm and for its verification. Linearizability [11] is a widely accepted correctness criterion for implementations of concurrent data-structures. It guarantees that a concurrent data structure appears to the programmer as a sequential data structure. Intuitively, linearizability provides the illusion that any operation performed on a concurrent data structure takes effect instantaneously at some point between its invocation and its response. Such points are commonly referred to as linearization points. Automatic verification and checking of linearizability (e.g., [7,8,1,23,21,2]) and of related correctness conditions (e.g., [5,4]) is an active area of research. Most of these methods rely on the user to specify linearization points, which typically requires an insight on how the algorithm operates. Our study of checking linearizability is motivated by our work on systematic construction of concurrent algorithms, and in particular our work on the PARAGLIDER tool. The goal of the PARAGLIDER tool, described in [22], is to assist the programmer in systematic derivation of linearizable fine-grained concurrent data-structure algorithms. PARAGLIDER explores a (huge) space of algorithms derived from a schema that the programmer provides, and checks each of these algorithms for linearizability. Since PARAGLIDER automatically explores a space of thousands of algorithms, the user cannot specify the linearization points for each of the explored algorithms. Further, some of the explored algorithms might not have fixed linearization points (see Section 4). This motivated us to study approaches for checking the algorithms also in the cases when linearization points are not specified, and when linearization points are not fixed. We also consider checking of the algorithms using alternative correctness criteria such as sequential consistency. C.S. P˘as˘areanu (Ed.): SPIN 2009, LNCS 5578, pp. 261–278, 2009. c Springer-Verlag Berlin Heidelberg 2009 

262

M. Vechev, E. Yahav, and G. Yorsh

While [22] has focused on the derivation process, and on the algorithms, this paper focuses on our experience with checking linearizability of the algorithms, and the lessons we have learned from this experience. 1.1 Highly-Concurrent Data-Structure Algorithms Using PARAGLIDER, we checked a variety of highly-concurrent data-structure algorithms based on linked lists, ranging (with increasing complexity) from lock-free concurrent stacks[20], through concurrent queues and concurrent work-stealing queues[18], to concurrent sets [22]. In this paper, we will focus on concurrent set algorithms, which are the most complex algorithms that we have considered so far. Intuitively, a set implementation requires searching through the underlying structure (for example, correctly inserting an item into a sorted linked list), while queues and stacks only operate on the endpoints of the underlying structure. For example, in a stack implemented as linked list, push and pop operations involve only the head of the list; in a queue implemented as a linked list, enqueue and dequeue involve only the head and the tail of the list. We believe that our experience with concurrent sets will be useful to anyone trying to check properties of even more complex concurrent algorithms, such as concurrent trees or concurrent hash tables [16] which actually use concurrent sets in their implementation. 1.2 Linearizability and Other Correctness Criteria The linearizability of a concurrent object (data-structure) is checked with respect to a specification of the desired behavior of the object in a sequential setting. This sequential specification defines a set of permitted sequential executions. Informally, a concurrent object is linearizable if each concurrent execution of operations on the object is equivalent to some permitted sequential execution, in which the real-time order between non-overlapping operations is preserved. The equivalence is based on comparing the arguments of operation invocations, and the results of operations (responses). Other correctness criteria in the literature, such as sequential consistency [15] also require that a concurrent execution be equivalent to some sequential execution. However, these criteria differ on the requirements on ordering of operations. Sequential consistency requires that operations in the sequential execution appear in an order that is consistent with the order seen at individual threads. Compared to these correctness criteria, linearizability is more intuitive, as it preserves the real-time ordering of nonoverlapping operations. In this paper, we focus on checking linearizability, as it is the appropriate condition for the domain of concurrent objects [11]. Our tool can also check operation-level serializability, sequential consistency and commit-atomicity [8]. In addition, we also checked data-structure invariants (e.g., list is acyclic and sorted) and other safety properties (e.g., absence of null dereferences and memory leaks). Checking linearizability is challenging because it requires correlating every concurrent execution with a corresponding permitted sequential execution (linearization). Note that even though there could be many possible linearizations of a concurrent execution, finding a single linearization is enough to declare the concurrent exection correct.

Experience with Model Checking Linearizability

263

There are two alternative ways to check linearizability: (i) automatic linearization— explore all permutations of a concurrent execution to find a permitted linearization; (ii) linearization points—the linearization point of each operation is a program statement at which the operation appears to take place. When the linearization points of a concurrent object are known, they induce an order between overlapping operations of a concurrent execution. This obviates the need to enumerate all possible permutation for finding a linearization. For simpler algorithms, the linearization point for an operation is usually a statement in the code of the operation. For more complex fine-grained algorithms, such as the running example used in this paper, a linearization point may reside in method(s) other than the executing operation and may depend on the actual concurrent execution. We classify linearization points as either fixed or non-fixed, respectively. This work is the first to describe in detail the challenges and choices that arise when checking linearizability of algorithms with non-fixed linearization points. We use program instrumentation, as explained in Section 4. 1.3 Overview of PARAGLIDER Fig. 1 shows a high-level structure of PARAGLIDER. Given a sequential specification of the algorithm and a schema, the generator explores all concurrent algorithms represented by the schema. For each algorithm, it invokes the SPIN model checker to check linearizability. The generator performs domain-specific exploration that leverages the relationship between algorithms in the space defined by the schema to reduce the number of algorithms that have to be checked by the model checker. The Promela model, described in detail in Section 3, consists of the algorithm and a client that non-deterministically invokes the operations of the algorithm. The model records the entire history of the concurrent execution as part of each state. SPIN explores the state space of the algorithm and uses the linearization checker, described in Section 4 to checks if the history is linearizable. Essentially, it enumerates all possible linearizations of the history and checks each one against the sequential specification. This method is entirely automatic, requires no user annotations, and is the key for the success of the systematic exploration process. The main shortcoming of this method is that it records the entire history as part of the state, which means no two states are equivalent. Therefore, we limit the length of the history by placing a bound on the number of operations the client can invoke. PARAGLIDER supports both automatic checking, described above, and checking with linearization points. The latter requires algorithm-specific annotations to be provided by the user, but allows the model checker to explore a larger state space of the algorithm than the first approach. The generator produces a small set of candidate algorithms, which pass the automatic linearizability checker. This process is shown in the top half of Fig. 1. The user can perform a more thorough checking of individual candidate algorithms by providing PARAGLIDER with linearization points for each of them. The linearization is built and checked on-the-fly in SPIN using linearization points. Thus, we no longer need to record the history as part of the state. This process is shown in the bottom half of Fig. 1.

264

M. Vechev, E. Yahav, and G. Yorsh

linearizable? yes/no

schema

generator

program

candidate programs

linearization points

instrument with history

promela model

linearizable? yes/no

SPIN

concurrent execution

lin checker

sequential specification

instrument on-the-fly checking

promela model

SPIN

linearizable? yes/no

Fig. 1. Overview of PARAGLIDER tool

In both methods, the client is a program that invokes non-deterministically selected operations on the concurrent object. However, with the linearization point method, we can check algorithms with each thread executing this client, without any bound on the maximum number of operations (the automatic method requires such a bound).

1.4 Main Contributions Here we summarize our experience and insights. We elaborate on them in the rest of the paper. Garbage collection. Garbage Collection (GC) support in a verification tool is crucial for verifying an increasing number of important concurrent algorithms. Because SPIN does not provide GC support, we implemented GC as part of the input Promela model. We discuss the challenges and choices that we made in this process. Non-fixed linearization points. For many advanced concurrent algorithms the linearization point of an operation is not in the code of that operation. Checking such algorithms introduces another set of challenges not present in simpler algorithms such as queues or stacks, typically considered in the literature. We discuss the underlying issues as well as our solution to checking algorithms with non-fixed linearization points. Choosing Bounds. We discuss how we chose the bounds on the size of the heap in states explored by SPIN, and how this choice is related to the optimistic algorithms we are checking. We discuss different methods for checking linearizability and how each method inherently affects the size of the state space the model checker can explore. Data structure invariants vs. Linearizability. We discuss our experience in finding algorithms that are linearizable, but do not satisfy structural invariants. This motivates further work on simplifying formal proofs of linearizable algorithms. Sequential consistency vs. Linearizability. We discuss our experience in finding concurrent data structure algorithms which are sequentially consistent, but not linearizable.

Experience with Model Checking Linearizability

265

2 Running Example To illustrate the challenges which arise when checking linearizability of highlyconcurrent algorithms, we use the concurrent set algorithm shown in Fig. 2. This algorithm is based on a singly linked list with sentinel nodes Head and Tail. Each node in the list contains three fields: an integer variable key, a pointer variable next and a boolean variable marked. The list is intended to be maintained in a sorted manner using the key field. The Head node always contains the minimal possible key, and the Tail node always contains the maximal possible key. The keys in these two sentinel nodes are never modified, but are only read and used for comparison. Initially, the set is empty, that is, in the linked list, the next field of Head points to Tail and the next field of Tail points to null. The marked fields of both sentinel nodes are initialized to false. This algorithm consists of three methods: add, remove and contains. To keep the list sorted, the add method first optimistically searches over the list until it finds the position where the key should be inserted. This search traversal (shown in the LOCATE macro) is performed optimistically without any locking. If a key is already in the set, then the method returns false. Otherwise, the thread tries to insert the key. However, in between the optimistic traversal and the insertion, the shared invariants may be violated, i.e., the key may have been removed, or the predecessor which should point to the new key may have been removed. In either of these two cases, the algorithm does not perform the insertion and restarts its operation to traverse again from the beginning of the list. Otherwise, the key is inserted and the method returns true. The operation of the remove method is similar. It iterates over the list, and if it does not find the key it is looking for, it returns false. Otherwise, it checks whether the shared invariants are violated and if they are, it restarts. If they are not violated, it physically removes the node and sets its marked field to true. The marked field and setting it to true are important because they consistute a communication mechanism to tell other threads that this node has been removed in case they end up with it after the optimistic traversal. The last method is contains. It simply iterates over the heap without any kind of synchronization, and if it finds the key it is looking for, it returns true. Otherwise, it returns false.

Fig. 2. A set algorithm using a marked bit to mark deleted nodes. A variation of [9] that uses a weaker validation condition.

266

M. Vechev, E. Yahav, and G. Yorsh

It is important to note that when add or remove return false, they do not use any kind of synchronization. Similarly, for the contains method. That is, these methods complete successfully without using any synchronization, even though as they iterate, the list can be modified significantly by add and remove operations executed by other threads. It is exactly this kind of iteration over the linked list without any synchronization that distinguishes the concurrent set algorithms from concurrent stack and queues, and makes verification of concurrent sets significantly more involved. Memory Management. This algorithm requires the presence of a garbage collector (GC). That is, the memory (the nodes of the linked list) is only managed by the garbage collector and not via manual memory management. To understand why this particular algorithms requires a garbage collector, consider the execution of the remove method, right after the node is disconnected from the list, see line 31. It would be incorrect to free the removed node immediately at this point, because another thread may have a reference to this node. For example, a contains method may be iterating over the list optimistically and just when it is about to read the next field of a node, that node is freed. In this situation, contains would dereference a freed node — a memory error which might cause a system crash. There are various ways to add manual memory management to concurrent algorithms, such as hazard pointers [17]. However, these techniques complicate the algorithm design even further. Practically, garbage collection has gained wide proliferation via managed languages such as Java, X10, C#. In addition, as part of their user-level libraries, these languages provide a myriad of concurrent data-structure algorithms relying on GC. Hence, developing techniques to ensure the correctness of highly concurrent algorithms relying on automatic memory management has become increasingly important.

3 Modeling of Algorithms We construct a Promela model that is sound with respect to the algorithm up to the bound we explore, i.e., for every execution of the algorithm which respects the bound, in any legal environment, there is also an execution in the model. The goal is to construct an accurate Promela model which is as faithful as possible to the algorithm and its environment (e.g., assumption of a garbage collector). In this section, we explain the main issues we faced when modeling the algorithms. 3.1 Modeling the Heap The first issue that arises is that our algorithms manipulate dynamically allocated heap memory and linked data structures of an unbounded size. However, the Promela language used by SPIN does not support dynamically allocated memory (e.g. creating new objects, pointer dereference). Our desire was to stay with the latest versions of the SPIN tool, as they are likely to be most stable and include the latest optimizations, such as partial order reductions. Therefore, we decided not to use variants of SPIN such as dSPIN [6], which support dynamically allocated memory, but are not actively maintained. Hence, in order to model dynamically allocated memory, we pre-allocate a

Experience with Model Checking Linearizability

267

global array in the Promela model. Each element of the array is a Promela structure that models a node of a linked data-structure. Thus, pointers are modeled as indices into the array. 3.2 Garbage Collection As mentioned in Section 2, the algorithms we consider, as well as many other highlyconcurrent optimistic algorithms (e.g.,[10]), assume garbage collection. Without garbage collection, the algorithms may leak an unbounded amount of memory, while manual memory management is tricky and requires external mechanisms, such as hazard pointers [17]. Unfortunately, SPIN does not provide garbage collection support. 1 Hence, we define a garbage collector as part of the Promela model. Naturally, our first intuitive choice was to have a simple sequential mark and sweep collector that would run as a separate thread and would collect memory whenever it is invoked. This approach raises the following issues: • The collector needs to read the pointers from local variables of all the other threads. Unfortunately, at the Promela language level, there is no mechanism for one thread to inspect local variables of another thread. To address it, we could make these variable shared instead of local. • The collector needs to know the type of these variables, that is, whether these values are pointers or pure integer values (e.g. does the variable denote the key value of the node or is that the pointer value which is also modeled as an integer?). To address it, we could make the type of each shared variable explicitly known to the collector. • When should the garbage collection run? Making all of the thread local variables globally visible, so that the collector process can find them, is not an ideal solution as it may perturb partial order optimizations. Further, if the collector does not run immediately when an object becomes unreachable, it may result in exploring a large number of distinct states that are meaningless. That is, two states may differ only in the different unreachable and not yet collected objects. This hypothesis was confirmed by our experiments with the algorithm in Fig. 2, where even on machines with 16GB of memory, the exploration did not terminate (we tried a variety of choices for SPIN optimizations). To address this issue, we concluded that garbage collection should run on every pointer update, effectively leading us to implement a reference counting algorithm. Each node now contains an additional field, the RC field, which is modified on pointer updates. Once the field reaches zero, the object is collected. The collector runs atomically. Once the object is collected, it is important to clear all of the node fields in order to avoid creating distinct states that differ only in those object fields. Our reference counting collector does not handle unreachable cycles, because in the algorithms we consider (based on singly linked lists), the heap remains acyclic. Acyclicity is checked as part of the structural invariants. Despite the fact that the size of a single state increases, with a reference counting collector, the total number of states became manageable. To address the issue of increasing 1

In [13], dSpin was extended with garbage collection, but it has not been adopted in SPIN.

268

M. Vechev, E. Yahav, and G. Yorsh

state size, we experimented with various optimizations tricks (such as bit-packing all of the object fields). However, at the end we decided against such optimizations as it was becoming quite difficult to debug the resulting models and even worse, was obscuring our understanding of them. To use this reference counting approach, our models are augmented with the relevant operations on every pointer update statement. This requires careful additional work on behalf of the programmer. It would certainly have saved significant time had the SPIN runtime provided support for dynamically allocated memory and garbage collection. Further, our approach can also benefit from enhancing SPIN with heap and thread symmetry reductions, e.g., [12,19,3], to enable additional partial order reductions. Symmetry reduction are very useful in our setting as threads execute the same operations. 3.3 Choosing Bounds on Heap Size Because our algorithms may use unbounded amount of memory, we need to build a finite state space in order to apply SPIN. Our models are parameterized on the maximum number of keys in the set, rather than the maximum number of objects. The maximum number of keys (and threads) determines the number of objects. The reason is that it is difficult to say a priori what is the maximum number of objects that the algorithm will need. Due to the high concurrency of the algorithm, situations can arise where for two keys we may need for example 10 objects. The reason for that is not completely intuitive, as shown by the following example. Example 1. Consider a thread executing LOCATE of some operation from Fig. 2, on a set that consists of two keys, 3 and 5. Suppose that the thread gets preempted when it is holding pointers to two objects with keys 3 and 5, via its thread local variables pred and curr, respectively. Second thread then successfully executes remove(3) followed by remove(5), removing from the list the objects that the first thread is holding pointers to. Of course, these objects cannot be collected yet, because the first thread is still pointing to them. Then, the second thread executes add(3) and add(5), and successfully inserts new objects with the same keys, while the removed objects are still reachable from the first thread. Thus, with only two keys and two threads, we created a heap with 4 reachable objects. Via similar scenarios, one can end up with a surprisingly high number of reachable objects for a very small number of keys. In fact, initially we were surprised and had to debug the model to understand such situations. Moreover, for different algorithmic variations the maximum number of objects can vary. Of course, we would not want to pre-allocate more memory for objects than is required as this would increase the size of memory required for model checking. Hence, we experimentally determined the maximum number of objects required for a given number of keys. That is, we start with a number K of pre-allocated objects and if the algorithm tries to allocate more than K objects, we trigger an error and stop. Then, we increase the value of K and repeat the process.

Experience with Model Checking Linearizability

269

4 Checking Linearizability There are two alternative ways to check linearizability: (i) automatic linearization— explore all permutations of a concurrent execution to find a valid linearization; (ii) linearization points—build linearization on-the-fly during a concurrent execution, using linearization points provided by the user. While the second approach requires userprovided annotations, it can check much deeper state spaces. In this section, we first review the definition of linearizability, and then describe how both of these approaches are realized in our models. 4.1 Background: Linearizability Linearizability [11] is defined with respect to a sequential specification (pre/post conditions). A concurrent object is linearizable if each execution of its operations is equivalent to a permitted sequential execution in which the order between non-overlapping operations is preserved. Formally, an operation op is a pair of invocation and a response events. An invocation event is a triple (tid, op, args) where tid is the thread identifier, op is the operation identifier, and args are the arguments. Similarly, a response event is triple (tid, op, val) where tid and op are as defined earlier, and val is the value returned from the operation. For an operation op, we denote its invocation by inv(op) and its response by res(op). A history is a sequence of invoke and response events. A sequential history is one in which each invocation is immediately followed by a matching response. A thread subhistory, h|tid is the subsequence of all events in h that have thread id tid. Two histories h1 , h2 are equivalent when for every tid, h1 |tid = h2 |tid. An operation op1 precedes op2 in h, and write op1 86400 37 > 86400 18 > 86400 258 > 86400 2579 2 1 389 50 21 3 64067 155 > 86400 5147

Executions DPOR SymDpor 1009010 420 926 462 6006 87516 120 71 28079 3148 1096 136 4184546 7111 322695

Transitions DPOR SymDpor 30233023 19084 32485 16263 211256 3117152 1980 1206 428410 50296 18006 2334 64465088 85074 5373766

Automatic Discovery of Transition Symmetry in Multithreaded Programs

293

Table 2. Analysis on the overhead of dynamic analysis Benchmark Threads Executions pfscan-buggy pfscan-buggy pfscan pfscan pfscan

3 4 3 4 5

71 3148 136 7111 322695

Time (sec) Total Probing Residual + Bijection 1 0.05 0.01 50 0.35 0.04 3.18 0.04 0.02 155.3 1.92 0.48 5147 81.41 20.03

Dynamic Analysis Analysis Success 103 29 2613 230 207 51 11326 3275 1685733 544816

The results show that probing the local states of threads, computing residual code of threads, and constructing bijections among local variables only cost a small fraction (< 2%) of the total checking time. Most of the 15%-40% slowdown per execution is contributed by the code that is instrumented for supporting dynamic analysis.

7 Related Work There has been a lot of research on automatic symmetry discovery. In solving boolean satisfiability, a typical approach is to convert the problem into a graph and employ graph symmetry tool to uncover symmetry [18]. Another approach for discovering symmetry is boolean matching [19], which converts the boolean constraints into a canonization form to reveal symmetries. In domains such as microprocessor verification, the graph often has a large number of vertices, however, the average number of neighbors of a vertex is usually small. Several algorithms based on exploiting this fact [20,21] are proposed to efficiently handle these graphs. More recent effort on discovery symmetry using sparsity [22] significantly reduced the discovery time by exploiting the sparsity in both the input and the output of the system. In explicit state model checking, adaptive symmetry reduction [23] has been proposed to dynamically discover symmetry in a concurrent system on the fly. This is close in spirit to our work. [23] introduces the notion of subsumption, which means that a state subsumes another if its orbit contains that of the other one. Subsumption induces a quotient structure with an equivalent set of reachable states. However, [23] did not address the practical problems for discovering symmetries in multithreaded programs to improve the efficiency of dynamic verification. Our algorithm can revealing symmetries in realistic multithreaded programs. We have proven this with an efficient practical implementation. In software model checking, state canonicalization has been the primary method to reveal symmetry. Efficient canonization functions [24,25,26,27] have been proposed to handle heap symmetry in Java programs which create objects in dynamic area. As these algorithms assume that the model checker is capable of capturing the states of concurrent programs, we cannot utilize them in dynamic verification to reveal symmetries. In dynamic model checking of concurrent programs, transition symmetry [10] has been the main method for exploiting symmetry at the whole process level. However, in [10], the user is required to come up with a permutation function, which is then used by the algorithm to check whether two transitions are symmetric. In practice, it is often difficult to manually specify such a permutation function. By employing dynamic

294

Y. Yang et al.

analysis, our approach automates symmetry discovery. To the best of our knowledge, our algorithm is the first effort in automating symmetry discovery for dynamic model checking.

8 Conclusion and Future Work We propose a new algorithm that uses dynamic program analysis to discover symmetry in mulithreaded programs. The new algorithm can be easily combined with partial order reduction algorithms and significantly reduce the runtime of dynamic model checking. In future work, we would like to further improve the symmetry discovery algorithm with a more semantic-aware dynamic analysis. Since dynamic analysis can be a helpful technique for testing and verification in many contexts, we are investigating several possibilities in this direction.

References 1. Godefroid, P.: Model Checking for Programming Languages using Verisoft. In: POPL, pp. 174–186 (1997) 2. Musuvathi, M., Qadeer, S.: Iterative context bounding for systematic testing of multithreaded programs. In: Ferrante, J., McKinley, K.S. (eds.) PLDI, pp. 446–455. ACM, New York (2007) 3. Yang, Y., Chen, X., Gopalakrishnan, G.: Inspect: A Runtime Model Checker for Multithreaded C Programs. Technical Report UUCS-08-004, University of Utah (2008) 4. Flanagan, C., Godefroid, P.: Dynamic Partial-order Reduction for Model Checking Software. In: Palsberg, J., Abadi, M. (eds.) POPL, pp. 110–121. ACM, New York (2005) 5. Yang, Y., Chen, X., Gopalakrishnan, G., Kirby, R.M.: Efficient stateful dynamic partial order reduction. In: Havelund, K., Majumdar, R., Palsberg, J. (eds.) SPIN 2008. LNCS, vol. 5156, pp. 288–305. Springer, Heidelberg (2008) 6. Wang, C., Yang, Y., Gupta, A., Gopalakrishnan, G.: Dynamic model checking with property driven pruning to detect race conditions. In: Cha, S(S.), Choi, J.-Y., Kim, M., Lee, I., Viswanathan, M. (eds.) ATVA 2008. LNCS, vol. 5311, pp. 126–140. Springer, Heidelberg (2008) 7. Clarke, E.M., Enders, R., Filkorn, T., Jha, S.: Exploiting symmetry in temporal logic model checking. Form. Methods Syst. Des. 9(1-2), 77–104 (1996) 8. Emerson, E.A., Sistla, A.P.: Symmetry and model checking. Form. Methods Syst. Des. 9(12), 105–131 (1996) 9. Ip, C.N., Dill, D.L.: Better verification through symmetry. Formal Methods in System Design 9(1/2), 41–75 (1996) 10. Godefroid, P.: Exploiting symmetry when model-checking software. In: FORTE. IFIP Conference Proceedings, vol. 156, pp. 257–275. Kluwer, Dordrecht (1999) 11. Havelund, K., Pressburger, T.: Model Checking Java Programs using Java PathFinder. STTT 2(4), 366–381 (2000) 12. Zaks, A., Joshi, R.: Verifying multi-threaded c programs with SPIN. In: Havelund, K., Majumdar, R., Palsberg, J. (eds.) SPIN 2008. LNCS, vol. 5156, pp. 325–342. Springer, Heidelberg (2008) 13. http://www.cs.utah.edu/˜yuyang/inspect/ 14. Godefroid, P.: Partial-Order Methods for the Verification of Concurrent Systems: An Approach to the State-Explosion Problem. Springer, Heidelberg (1996)

Automatic Discovery of Transition Symmetry in Multithreaded Programs

295

15. Necula, G.C., McPeak, S., Rahul, S.P., Weimer, W.: CIL: Intermediate Language and Tools for Analysis and Transformation of C Programs. In: Horspool, R.N. (ed.) CC 2002. LNCS, vol. 2304, pp. 213–228. Springer, Heidelberg (2002) 16. http://freshmeat.net/projects/aget/ 17. http://freshmeat.net/projects/pfscan 18. Aloul, F.A., Ramani, A., Markov, I.L., Sakallah, K.A.: Solving difficult SAT instances in the presence of symmetry. In: DAC, pp. 731–736. ACM, New York (2002) 19. Chai, D., Kuehlmann, A.: Building a better boolean matcher and symmetry detector. In: DATE, pp. 1079–1084 (2006) 20. Darga, P.T., Liffiton, M.H., Sakallah, K.A., Markov, I.L.: Exploiting structure in symmetry detection for CNF. In: DAC, pp. 530–534. ACM, New York (2004) 21. Junttila, T., Kaski, P.: Engineering an efficient canonical labeling tool for large and sparse graphs. In: SIMA Workshop on Algorithm Engineering and Experiments (2007) 22. Darga, P.T., Sakallah, K.A., Markov, I.L.: Faster symmetry discovery using sparsity of symmetries. In: DAC, pp. 149–154. ACM, New York (2008) 23. Wahl, T.: Adaptive symmetry reduction. In: Damm, W., Hermanns, H. (eds.) CAV 2007. LNCS, vol. 4590, pp. 393–405. Springer, Heidelberg (2007) 24. Lerda, F., Visser, W.: Addressing dynamic issues of program model checking. In: Dwyer, M.B. (ed.) SPIN 2001. LNCS, vol. 2057, pp. 80–102. Springer, Heidelberg (2001) 25. Iosif, R.: Exploiting heap symmetries in explicit-state model checking of software. In: 16th IEEE International Conference on Automated Software Engineering (ASE 2001), Coronado Island, San Diego, CA, USA, November 26-29, 2001, pp. 254–261. IEEE Computer Society, Los Alamitos (2001) 26. Iosif, R.: Symmetry reductions for model checking of concurrent dynamic software. STTT 6(4), 302–319 (2004) 27. Visser, W., Pasareanu, C.S., Pel´anek, R.: Test input generation for java containers using state matching. In: Pollock, L.L., Pezz`e, M. (eds.) Proceedings of the ACM/SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2006, Portland, Maine, USA, July 17-20, 2006, pp. 37–48. ACM, New York (2006)

Author Index

Balasubramaniam, Dharini Ben-Ari, Mordechai (Moti) Boˇsnaˇcki, Dragan 32 Brezoˇcnik, Zmago 143 Chen, Xiaofang

Edelkamp, Stefan

32

50

279

Hahn, Ernst Moritz 88 Hermanns, Holger 88 Heußner, Alexander 107

Schmerl, Sebastian 205 Schmitt, Peter H. 50 Sharma, Oliver 223 Sifakis, Joseph 4 Singh, Rishabh 192 Sulewski, Damian 32 Sutre, Gr´egoire 107 Sventek, Joe 223 Thivolle, Damien 241 Touili, Tayssir 125

Kidd, Nicholas 125 K¨ onig, Hartmut 205 Kovˇse, Tim 143 Kundu, Sudipta 68 Kwiatkowska, Marta 2

Vechev, Martin 261 Visser, Willem 5, 174 Vlaoviˇc, Boˇstjan 143 Vogel, Michael 205 Vreˇze, Aleksander 143

Lal, Akash 148 Lammich, Peter 125 Le Gall, Tristan 107 Lewis, Jonathan 223 Lim, Junghee 148 Manolios, Panagiotis (Pete) Mercer, Eric G. 174

169

Reps, Thomas 125, 148 Roseck´ y, V´ aclav 169 Rungta, Neha 174 Rybalchenko, Andrey 192

12

Ganai, Malay K. 68 Garavel, Hubert 241 Godefroid, Patrice 1 Gopalakrishnan, Ganesh

Miller, Alice 223 Morrison, Ron 223 Pel´ anek, Radek

279

Dearle, Al 223 Dillinger, Peter C.

Farag´ o, David

223 6

Wang, Chao

279

Yahav, Eran 261 Yang, Yu 279 Yorsh, Greta 261 12 Zhang, Lijun

88