Challenges of Software Verification (Intelligent Systems Reference Library, 238) [1st ed. 2023] 9811996008, 9789811996009

This book provides an overview about the open challenges in software verification. Software verification is a branch of

221 15 6MB

English Pages 279 [275] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Contents
Editors and Contributors
About the Editors
Contributors
1 Abstract Interpretation: From 0, 1, to infty
1.1 Introduction
1.2 Abstract Interpretation for the Untaught
1.3 Abstract Interpretation for the Savant
1.3.1 Software Engineering
1.3.2 Education
1.3.3 Scope of Abstract Interpretation
1.3.4 More Complex Properties
1.3.5 Properties of More Complex Data Structures
1.3.6 Properties of More Complex Control Structures
1.3.7 Computation Spaces
1.3.8 Choosing Precise and Efficient Abstractions
1.3.9 Induction Abstraction
1.3.10 Calculational Design of Abstract Interpreters
1.3.11 Language-Independent Abstract Interpretation
1.3.12 New Computation Models and Challenges
1.4 Conclusion
References
2 LiSA: A Generic Framework for Multilanguage Static Analysis
2.1 Introduction
2.1.1 An Illustrative Example
2.1.2 Contribution and Paper Structure
2.2 LiSA's Overall Architecture
2.3 The Internal Language
2.3.1 Control Flow Graphs
2.3.2 Symbolic Expressions
2.4 The Analysis State
2.5 Interprocedural Analysis
2.6 Frontends
2.7 Multilanguage Analysis
2.8 Conclusion
2.8.1 Future Directions
2.8.2 Related Work
References
3 How to Make Taint Analysis Precise
3.1 Introduction
3.2 Concrete Semantics
3.2.1 Influenced Concrete States
3.2.2 Semantics of Basic Instructions
3.2.3 Partial Traces Semantics
3.2.4 Reachable States Semantics
3.3 Application to Security
3.3.1 Sources
3.3.2 Sinks
3.3.3 Sanitizers
3.4 Data Flow Analyses for Security
3.5 Reachable States-Based Taint Analysis
3.6 Trace-Based Taint Analysis
3.6.1 Trace Influence Algebra
3.6.2 Influence Semantics with Features
3.6.3 Inter-Procedural Analysis
3.6.4 Features for Analysis Approximations
3.7 Experience
3.8 Conclusions
References
4 ``Fixing'' the Specification of Widenings
4.1 Introduction
4.2 Background
4.3 On the Specification of Widening Operators
4.3.1 Classifying Abstract Domain Implementations
4.3.2 Classifying AI Engine Implementations
4.4 Combinations of Abstract Domains and AI Engines
4.4.1 Some Thoughts on the Unsafe Combinations
4.4.2 Comparing the Safe Combinations
4.5 Lesson Learned and Recommendation
4.5.1 Safe Widenings for Convex Polyhedra
4.5.2 A Note on the Unusual Widening Specifications
4.6 Conclusion
References
5 Static Analysis for Data Scientists
5.1 Introduction
5.1.1 Example
5.1.2 Data Expectation Static Analyses
5.2 Input Data-Aware Concrete Semantics
5.2.1 Input Data
5.2.2 Dataframe-Manipulating Language
5.2.3 Input-Aware Semantics
5.3 Expectations Abstract Domains
5.3.1 Column Expectations Abstract Domain
5.3.2 Other Expectations Abstract Domains
5.4 Implementation
5.5 Conclusion
References
6 Completeness in Static Analysis by Abstract Interpretation: A Personal Point of View
6.1 Introduction
6.2 Completeness of the Abstraction: the Case of LRU Caches
6.3 Completeness or Incompleteness of the Analysis Method
6.3.1 Widening Operators
6.3.2 Exact Solving
6.3.3 Imprecise Abstract Transfer Functions
6.4 Undecidability of an Abstraction
6.4.1 Polyhedral Abstraction
6.4.2 Richer Domains
6.5 Perspectives and Conclusion
References
7 Lifting String Analysis Domains
7.1 Introduction
7.1.1 Paper Contribution
7.1.2 Paper Structure
7.2 Background
7.2.1 Mathematical Notation
7.2.2 Abstract Interpretation
7.2.3 Reduced Product
7.2.4 Granger Product
7.2.5 String Operators
7.3 Related Work
7.3.1 Enhancing Operators
7.3.2 Combinations of String Analyses
7.3.3 String Analysis: Applications
7.4 Concrete Domain and Semantics
7.4.1 Concrete Domain
7.4.2 Concrete Semantics
7.4.3 Example
7.5 String Abstract Domains
7.5.1 String Length
7.5.2 Character Inclusion
7.5.3 Prefix and Suffix
7.6 Segmentation Abstract Domain
7.6.1 Strings Concrete Representation
7.6.2 Abstract Domain
7.6.3 Abstract Semantics
7.7 Refined String Abstract Domains
7.7.1 Meaning of Refinement
7.7.2 Combining Segmentation and String Length Domains
7.7.3 Combining Segmentation and Character Inclusion Domains
7.7.4 Combining Segmentation and Prefix Domains
7.8 Conclusion
References
8 Local Completeness in Abstract Interpretation
8.1 Completeness, Fallacy, and Approximation
8.2 Proving Completeness
8.3 LCL: Local Completeness Logic
8.4 Concluding Remarks
References
9 The Top-Down Solver—An Exercise in A2I
9.1 Introduction
9.2 Getting Started
9.3 Adding Fixpoint Iteration
9.4 The Top-Down Solver TD
9.5 The Top-Down Solver with Tabulation
9.6 Introducing Widening and Narrowing
9.7 Conclusion
References
10 Regular Matching with Constraint Programming
10.1 Introduction
10.2 Preliminaries
10.2.1 Strings and Regular Languages
10.2.2 Constraint Programming and String Solving
10.3 Matching Regular Expressions
10.3.1 Match
10.3.2 Generalization to replace
10.4 Conclusions
References
11 Floating-Point Round-off Error Analysis of Safety-Critical Avionics Software
11.1 Introduction
11.2 Formal Verification of the ADS-B CPR Algorithm
11.3 Automatizing the Verification with PRECiSA
11.4 Case Study: Point-in-Polygon Algorithm
11.5 Related Work
11.6 Conclusion and Future Challenges
References
12 Risk Estimation in IoT Systems
12.1 Introduction
12.2 Indoor Environmental Monitoring Scenario
12.3 Technical Background
12.3.1 Overview of IoT-LySa
12.3.2 Control Flow Analysis
12.4 Using the CFA Results for Analysing Critical Decisions
12.4.1 Taint Analysis
12.4.2 What if Reasoning
12.4.3 Estimation of Risks
12.5 Concluding Remarks
References
13 Verification of Reaction Systems Processes
13.1 Introduction
13.2 Reaction Systems
13.3 SOS Rules for Reaction Systems
13.4 Bio-simulation
13.4.1 Assertion Language
13.4.2 Bio-similarity and Biological Equivalence
13.4.3 A Case Study: Metabolic Pathways in Mammalian Epithelial Cells
13.4.4 Dynamic Slicing of RS Processes
13.5 Quantitative Extensions of RSs
13.6 Implementation and Experimentation
13.7 Related Work
13.8 Conclusion and Future Work
References
Recommend Papers

Challenges of Software Verification (Intelligent Systems Reference Library, 238) [1st ed. 2023]
 9811996008, 9789811996009

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Intelligent Systems Reference Library 238

Vincenzo Arceri Agostino Cortesi Pietro Ferrara Martina Olliaro   Editors

Challenges of Software Verification

Intelligent Systems Reference Library Volume 238

Series Editors Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland Lakhmi C. Jain, KES International, Shoreham-by-Sea, UK

The aim of this series is to publish a Reference Library, including novel advances and developments in all aspects of Intelligent Systems in an easily accessible and well structured form. The series includes reference works, handbooks, compendia, textbooks, well-structured monographs, dictionaries, and encyclopedias. It contains well integrated knowledge and current information in the field of Intelligent Systems. The series covers the theory, applications, and design methods of Intelligent Systems. Virtually all disciplines such as engineering, computer science, avionics, business, e-commerce, environment, healthcare, physics and life science are included. The list of topics spans all the areas of modern intelligent systems such as: Ambient intelligence, Computational intelligence, Social intelligence, Computational neuroscience, Artificial life, Virtual society, Cognitive systems, DNA and immunity-based systems, e-Learning and teaching, Human-centred computing and Machine ethics, Intelligent control, Intelligent data analysis, Knowledge-based paradigms, Knowledge management, Intelligent agents, Intelligent decision making, Intelligent network security, Interactive entertainment, Learning paradigms, Recommender systems, Robotics and Mechatronics including human-machine teaming, Self-organizing and adaptive systems, Soft computing including Neural systems, Fuzzy systems, Evolutionary computing and the Fusion of these paradigms, Perception and Vision, Web intelligence and Multimedia. Indexed by SCOPUS, DBLP, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.

Vincenzo Arceri · Agostino Cortesi · Pietro Ferrara · Martina Olliaro Editors

Challenges of Software Verification

Editors Vincenzo Arceri Department of Mathematical, Physical, and Computer Sciences University of Parma Parma, Italy

Agostino Cortesi Department of Environmental Sciences, Informatics and Statistics Ca’ Foscari University of Venice Venice, Italy

Pietro Ferrara Department of Environmental Sciences, Informatics and Statistics Ca’ Foscari University of Venice Venice, Italy

Martina Olliaro Department of Environmental Sciences, Informatics and Statistics Ca’ Foscari University of Venice Venice, Italy

ISSN 1868-4394 ISSN 1868-4408 (electronic) Intelligent Systems Reference Library ISBN 978-981-19-9600-9 ISBN 978-981-19-9601-6 (eBook) https://doi.org/10.1007/978-981-19-9601-6 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Preface

Nowadays, software systems have a huge impact on people’s lives, as most of our activities require some interaction with some kind of software. Individual, collective and corporate life increasingly depends on safety, quality and reliability of these systems. In such a context, software verification plays a crucial role in automatically guaranteeing these properties. This volume collects the contributions of the invited speakers to the first Challenges of Software Verification workshop (CSV),1 that was held on May 20, 2022, at the Ca’ Foscari University of Venice, Italy, on the occasion of the award ceremony of the Doctorate Honoris Causa in Computer Science to Professor Patrick Cousot. The contributions collected in this volume provide an overview of the challenges in software verification. Software verification is a branch of software engineering aiming at guaranteeing that software applications satisfy some requirements of interest. Over the years, the software verification community has proposed and considered several techniques: abstract interpretation, data-flow analysis, type systems, model checking are just a few examples. The theoretical advances have been always motivated by practical challenges, that have led to an equal evolution of both these sides of software verification. Indeed, several verification tools have been proposed by the research community and any software application, in order to guarantee that certain security, safety, and reliability, or availability requirements are met, needs to integrate a verification phase in its lifecycle, independently of the context of application or software size. This research field is active more than ever, evidenced by the long story of the conferences in the area and the solid and robust community spread all over the world that support them. This volume is addressed both to researchers in software verification and their practitioners and collects contributions discussing recent advances in facing open challenges in software verification, relying on a broad spectrum of verification techniques. In particular, the volume discusses open challenges motivated by the application context without neglecting theoretical open problems.

1

https://ssv.dais.unive.it/events/challenges-of-software-verification-workshop/. v

vi

Preface

We would like to thank all the authors of this volume for their valuable scientific contribution. We are also grateful to Dr. Aninda Bose and the Springer Nature staff for the very professional editorial support. Parma, Italy Venice, Italy Venice, Italy Venice, Italy November 2022

Vincenzo Arceri Agostino Cortesi Pietro Ferrara Martina Olliaro

Contents

1

2

Abstract Interpretation: From 0, 1, to ∞ . . . . . . . . . . . . . . . . . . . . . . . . Patrick Cousot 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Abstract Interpretation for the Untaught . . . . . . . . . . . . . . . . . . . . . 1.3 Abstract Interpretation for the Savant . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Software Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Scope of Abstract Interpretation . . . . . . . . . . . . . . . . . . . . 1.3.4 More Complex Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.5 Properties of More Complex Data Structures . . . . . . . . . . 1.3.6 Properties of More Complex Control Structures . . . . . . . 1.3.7 Computation Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.8 Choosing Precise and Efficient Abstractions . . . . . . . . . . 1.3.9 Induction Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.10 Calculational Design of Abstract Interpreters . . . . . . . . . 1.3.11 Language-Independent Abstract Interpretation . . . . . . . . 1.3.12 New Computation Models and Challenges . . . . . . . . . . . . 1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . LiSA: A Generic Framework for Multilanguage Static Analysis . . . Luca Negrini, Pietro Ferrara, Vincenzo Arceri, and Agostino Cortesi 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 An Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Contribution and Paper Structure . . . . . . . . . . . . . . . . . . . . 2.2 LiSA’s Overall Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 The Internal Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Control Flow Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Symbolic Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 The Analysis State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Interprocedural Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 2 6 6 8 8 8 9 9 10 10 10 11 11 12 13 13 19 20 20 22 23 25 26 27 29 31

vii

viii

3

4

Contents

2.6 2.7 2.8

Frontends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multilanguage Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32 33 36 37 37 39

How to Make Taint Analysis Precise . . . . . . . . . . . . . . . . . . . . . . . . . . . . Francesco Logozzo and Ibrahim Mohamed 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Concrete Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Influenced Concrete States . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Semantics of Basic Instructions . . . . . . . . . . . . . . . . . . . . . 3.2.3 Partial Traces Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Reachable States Semantics . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Application to Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Sinks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Sanitizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Data Flow Analyses for Security . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Reachable States-Based Taint Analysis . . . . . . . . . . . . . . . . . . . . . . 3.6 Trace-Based Taint Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Trace Influence Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Influence Semantics with Features . . . . . . . . . . . . . . . . . . . 3.6.3 Inter-Procedural Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.4 Features for Analysis Approximations . . . . . . . . . . . . . . . 3.7 Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

“Fixing” the Specification of Widenings . . . . . . . . . . . . . . . . . . . . . . . . . Enea Zaffanella and Vincenzo Arceri 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 On the Specification of Widening Operators . . . . . . . . . . . . . . . . . . 4.3.1 Classifying Abstract Domain Implementations . . . . . . . . 4.3.2 Classifying AI Engine Implementations . . . . . . . . . . . . . . 4.4 Combinations of Abstract Domains and AI Engines . . . . . . . . . . . 4.4.1 Some Thoughts on the Unsafe Combinations . . . . . . . . . 4.4.2 Comparing the Safe Combinations . . . . . . . . . . . . . . . . . . 4.5 Lesson Learned and Recommendation . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Safe Widenings for Convex Polyhedra . . . . . . . . . . . . . . . 4.5.2 A Note on the Unusual Widening Specifications . . . . . . . 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43 44 44 44 45 46 46 46 47 47 47 49 50 51 51 52 52 53 53 54 57 57 59 60 62 65 65 67 68 69 70 72 73 73

Contents

5

6

7

Static Analysis for Data Scientists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Caterina Urban 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Data Expectation Static Analyses . . . . . . . . . . . . . . . . . . . 5.2 Input Data-Aware Concrete Semantics . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Dataframe-Manipulating Language . . . . . . . . . . . . . . . . . . 5.2.3 Input-Aware Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Expectations Abstract Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Column Expectations Abstract Domain . . . . . . . . . . . . . . 5.3.2 Other Expectations Abstract Domains . . . . . . . . . . . . . . . 5.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Completeness in Static Analysis by Abstract Interpretation: A Personal Point of View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David Monniaux 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Completeness of the Abstraction: the Case of LRU Caches . . . . . 6.3 Completeness or Incompleteness of the Analysis Method . . . . . . 6.3.1 Widening Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Exact Solving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Imprecise Abstract Transfer Functions . . . . . . . . . . . . . . . 6.4 Undecidability of an Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Polyhedral Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Richer Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Perspectives and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lifting String Analysis Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martina Olliaro, Vincenzo Arceri, Agostino Cortesi, and Pietro Ferrara 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Paper Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Paper Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Mathematical Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Abstract Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Reduced Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.4 Granger Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.5 String Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Enhancing Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Combinations of String Analyses . . . . . . . . . . . . . . . . . . .

ix

77 77 78 80 81 81 82 83 86 87 89 89 90 91 93 93 94 97 97 99 101 102 102 105 105 106 109

109 110 112 113 113 113 114 114 115 115 115 116

x

Contents

7.3.3 String Analysis: Applications . . . . . . . . . . . . . . . . . . . . . . . Concrete Domain and Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Concrete Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Concrete Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 String Abstract Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 String Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 Character Inclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.3 Prefix and Suffix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Segmentation Abstract Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.1 Strings Concrete Representation . . . . . . . . . . . . . . . . . . . . 7.6.2 Abstract Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.3 Abstract Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 Refined String Abstract Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.1 Meaning of Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.2 Combining Segmentation and String Length Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.3 Combining Segmentation and Character Inclusion Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.4 Combining Segmentation and Prefix Domains . . . . . . . . 7.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

116 117 117 117 118 118 119 120 121 122 123 123 129 130 131

Local Completeness in Abstract Interpretation . . . . . . . . . . . . . . . . . . . Roberto Bruni, Roberto Giacobazzi, Roberta Gori, and Francesco Ranzato 8.1 Completeness, Fallacy, and Approximation . . . . . . . . . . . . . . . . . . 8.2 Proving Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 LCL: Local Completeness Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

145

7.4

8

9

The Top-Down Solver—An Exercise in A2 I . . . . . . . . . . . . . . . . . . . . . . Sarah Tilscher, Yannick Stade, Michael Schwarz, Ralf Vogler, and Helmut Seidl 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Adding Fixpoint Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 The Top-Down Solver TD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 The Top-Down Solver with Tabulation . . . . . . . . . . . . . . . . . . . . . . 9.6 Introducing Widening and Narrowing . . . . . . . . . . . . . . . . . . . . . . . 9.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

131 134 137 140 140

146 148 150 153 154 157

157 159 162 168 171 172 177 178

Contents

10 Regular Matching with Constraint Programming . . . . . . . . . . . . . . . . Roberto Amadini and Maurizio Gabbrielli 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.1 Strings and Regular Languages . . . . . . . . . . . . . . . . . . . . . 10.2.2 Constraint Programming and String Solving . . . . . . . . . . 10.3 Matching Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.1 Match . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.2 Generalization to replace . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Floating-Point Round-off Error Analysis of Safety-Critical Avionics Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Laura Titolo, Mariano Moscato, Marco A. Feliú, Aaron Dutle, and César Muñoz 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Formal Verification of the ADS-B CPR Algorithm . . . . . . . . . . . . 11.3 Automatizing the Verification with PRECiSA . . . . . . . . . . . . . . . . 11.4 Case Study: Point-in-Polygon Algorithm . . . . . . . . . . . . . . . . . . . . 11.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6 Conclusion and Future Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Risk Estimation in IoT Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chiara Bodei, Gian-Luigi Ferrari, Letterio Galletta, and Pierpaolo Degano 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Indoor Environmental Monitoring Scenario . . . . . . . . . . . . . . . . . . 12.3 Technical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.1 Overview of IoT- LySa . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.2 Control Flow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4 Using the CFA Results for Analysing Critical Decisions . . . . . . . 12.4.1 Taint Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.2 What if Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.3 Estimation of Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Verification of Reaction Systems Processes . . . . . . . . . . . . . . . . . . . . . . . Linda Brodo, Roberto Bruni, and Moreno Falaschi 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2 Reaction Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3 SOS Rules for Reaction Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4 Bio-simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4.1 Assertion Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

181 181 183 183 185 187 187 191 193 193 197

198 199 205 210 214 216 218 221

221 223 224 224 229 232 233 236 238 239 241 243 244 245 247 250 251

xii

Contents

13.4.2 Bio-similarity and Biological Equivalence . . . . . . . . . . . . 13.4.3 A Case Study: Metabolic Pathways in Mammalian Epithelial Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4.4 Dynamic Slicing of RS Processes . . . . . . . . . . . . . . . . . . . 13.5 Quantitative Extensions of RSs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.6 Implementation and Experimentation . . . . . . . . . . . . . . . . . . . . . . . 13.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.8 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

252 254 255 257 260 261 261 262

Editors and Contributors

About the Editors Vincenzo Arceri is a no-tenure track assistant professor at the Department of Mathematical, Physical, and Computer Sciences, University of Parma, and his research focuses on static software analysis and verification, having 7 years of experience in this research field, in which he published on international conferences and journals related to software analysis, formal methods for software security, programming languages, and software engineering (including ACM TOPS, Information and Computation, VMCAI, ACM SIGAPP SAC). His main research interests include static program analysis, string analysis and verification (in particular for dynamic languages), abstract interpretation and, more in general, formal methods for program security. Agostino Cortesi is a full professor at Ca’ Foscari University of Venice. He has over 25 years of experience in the area of software verification, having published over 150 articles in high-level international journals and international conference proceedings. He has been a member of numerous program committees for international conferences (e.g., SAS, VMCAI) and editorial committees of scientific journals (Computer Languages, Journal of Universal Computer Science). He is currently the head of the Ph.D. program in Computer Science at Ca’ Foscari. His main research interests concern programming language theory and static analysis techniques, with particular focus to security applications. He is the coordinator of the EU Horizon2020 “Families_Share” project and has held the position of head of unit of the H2020 project EQUAL-IST and of the COST project “Eutypes”. He also directs the project MAE Italy-India 2017–19 “Formal Specification for Secured Software System” and the FIRST Covid-19 F2F project. Prof. Pietro Ferrara is an assistant professor at Ca’ Foscari University of Venice. He is an expert on static analysis based on abstract interpretation with a focus on the detection of security vulnerabilities in object-oriented programs. He joined the

xiii

xiv

Editors and Contributors

University of Venice in November 2019 as a tenure track assistant professor. Previously, from 2013 to 2019, he worked in industry gaining experience in delivering prototypes and commercial tools to customers filling the gap between scientific research and development and delivery of software products, as well as technical and commercial presentation to customers, evaluation activities, and preparation of commercial and technical documentation. Martina Olliaro is a postdoc researcher at Ca’ Foscari University of Venice. She received her Ph.D. in Computer Science at Ca’ Foscari University of Venice (Italy) and Masaryk University of Brno (Czech Republic), under the supervision of both Professors Agostino Cortesi and Vashek Matyas. Her main research interest concerns string static analysis by means of abstract interpretation theory, with a focus to the string-related security issues. She is also interested in watermarking relational databases techniques and in the study of their semantics preservation.

Contributors Roberto Amadini University of Bologna, Bologna, Italy Vincenzo Arceri University of Parma, Parma, Italy Chiara Bodei Pisa University, Pisa, Italy Linda Brodo Università di Sassari, Sassari, Italy Roberto Bruni University of Pisa, Pisa, Italy Agostino Cortesi Ca’ Foscari University of Venice, Venice, Italy Patrick Cousot New York University, New York, USA Pierpaolo Degano Pisa University, Pisa, Italy; IMT School for Advances Studies Lucca, Lucca, Italy Aaron Dutle NASA Langley Research Center, Hampton, USA Moreno Falaschi Università di Siena, Siena, Italy Marco A. Feliú National Institute of Aerospace, Hampton, USA Pietro Ferrara Ca’ Foscari University of Venice, Venice, Italy Gian-Luigi Ferrari Pisa University, Pisa, Italy Maurizio Gabbrielli University of Bologna, Bologna, Italy Letterio Galletta IMT School for Advances Studies Lucca, Lucca, Italy Roberto Giacobazzi University of Verona, Verona, Italy Roberta Gori University of Pisa, Pisa, Italy

Editors and Contributors

xv

Francesco Logozzo Meta, USA Ibrahim Mohamed Meta, USA David Monniaux Université Grenoble Alpes, CNRS, Grenoble INP, Grenoble, France Mariano Moscato National Institute of Aerospace, Hampton, USA César Muñoz AWS (NASA at the time of contribution), Hampton, USA Luca Negrini Ca’ Foscari University of Venice, Venice, Italy; Corvallis SRL, Padova, Italy Martina Olliaro Ca’ Foscari University of Venice, Venice, Italy Francesco Ranzato University of Padova, Padova, Italy Michael Schwarz TU München, Garching, Germany Helmut Seidl TU München, Garching, Germany Yannick Stade TU München, Garching, Germany Sarah Tilscher TU München, Garching, Germany Laura Titolo National Institute of Aerospace, Hampton, USA Caterina Urban Inria and ENS, PSL, Paris, France Ralf Vogler TU München, Garching, Germany Enea Zaffanella University of Parma, Parma, Italy

Chapter 1

Abstract Interpretation: From 0, 1, to ∞ Patrick Cousot

Abstract This paper starts from zero knowledge about abstract interpretation and provides one rapid introduction for the untaught, goes rapidly over remarkable achievements, and widens to infinitely hard problems to be solved by the savant.

1.1 Introduction Abstract interpretation has a long history [61]. It appeared in the seventies to prove the correctness of program static analysis algorithms [16, 33, 35, 37]. This justification with respect to reachability semantics and the use of extrapolators (widenings and their duals) and interpolators (narrowing and their dual) to accelerate fixpoint convergence in an infinite abstract domains not satisfying the ascending chain condition was a breakthrough with respect to existing dataflow analysis methods (some of which being even syntactically correct but semantically wrong [29]). Then it appeared that the concepts of abstract interpretation also apply to the design of programming language semantics at various levels of abstraction [22, 43]. Then further abstractions lead to program verification methods that are sound (as justified with respect to the semantics) and complete (true abstract properties of the program semantics can always be proved) [3, 46]. Finally, the static analysis algorithms are computable abstractions of proof methods (such as invariance/reachability in [35]), hence sound but incomplete by undecidability. This paper is an introduction to abstract interpretation for those with a minimal background in mathematics and no knowledge of abstract interpretation (based on a simplification of [31, Chap. 3]). At the other extreme, it proposes research challenges for those that don’t need this introduction.

P. Cousot (B) New York University, New York, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 V. Arceri et al. (eds.), Challenges of Software Verification, Intelligent Systems Reference Library 238, https://doi.org/10.1007/978-981-19-9601-6_1

1

2

P. Cousot

1.2 Abstract Interpretation for the Untaught The main idea of abstract interpretation is that to prove properties of a program semantics, it is sufficient to restrict oneself to program properties necessary for the proof. Wassily Kandinsky has a painting dated 1925 entitled “Abstract Interpretation” on view at Yale University Art Gallery (artgallery.yale.edu/collections/objects/43818). The abstraction keeps only shapes and colors maybe reorganized to form a composition. The very first mathematical example (I know of) of abstract interpretation is by the Indian mathematician Brahmagupta, c. 598–c. 665 CE who invented the zero and postulated the signs rule for +, −, ×, and / [87]. … 18.32. A negative minus zero is negative, a positive [minus zero] positive; zero [minus zero] is zero. When a positive is to be subtracted from a negative or a negative from a positive, then it is to be added. … His only error was 00 = 0, nowadays undefined. A modern formulation in static program analysis is the following table for the minus operation where ± denotes the unknown sign (that is either 0 or 0) and ⊥± means no sign, e.g. nothing is known before evaluating the sign of the expression. Given a mathematical definition of integers Z, the modern understanding of signs is a property of integers Z (or rationals Q or reals R) represented in extension. For example, the meaning of “0 −± 0 = 0” is “Take any number n in {0, 1, 2, . . .} and any number m in {. . . , −2, −1, 0} then their difference n − m is in {0, 1, 2, . . .} where {0, 1, 2, . . .} is an extensional definition of the concrete positive numbers (enumerating all of them) and {. . . , −2, −1, 0} is an extensional definition of the concrete negative numbers (enumerating all of them). Therefore, • • • • •

0 is an abstraction of the positives {0, 1, 2, . . .}; 0 is an abstraction of the negatives {. . . , −2, −1, 0}; =0 is an abstraction of {0} (to be equal to zero); ± is an abstraction of all numbers {. . . , −2, −1, 0, 1, 2, . . .}; ⊥± is an abstraction of the empty set ∅ = {}.

The signs rule “0 −± 0 = 0” means that all possibilities {n − m | n ∈ {0, 1, 2, . . .} and m ∈ {. . . , −2, −1, 0} are included in {0, 1, 2, . . .}. The concretization function provides the meaning of signs as sets of integers.

1 Abstract Interpretation: From 0, 1, to ∞

γ± (⊥± ) = ∅ (that is {}) γ± (0) = {1, 2, 3, . . .}

3

γ± (0) = {. . . , −2, −1, 0} (1.1) γ± (=0) = {. . . − 2, −1, 1, 2, . . . } γ± (0) = {0, 1, 2, . . .} γ± (± ) = {. . . , −2, −1, 0, 1, 2, . . .}

The abstraction function provides the sign abstraction of properties of integers (that is a set S of integers) α± (S) = if S = ∅ then ⊥± else if S ⊆ {. . . , −2, −1} then < 0

(1.2)

else if S ⊆ {0} then = 0 else if S ⊆ {1, 2, . . .} then > 0 else if S ⊆ {. . . , −2, −1, 0} then  0 else if S ⊆ {. . . , −2, −1, 1, 2, . . .} then = 0 else if S ⊆ {0, 1, 2, . . .} then  0 else ± α± (S) is the best we can say on the sign, e.g. The absolute value is one

α± ({−1, 1}) ==0

(1.3)

The absolute value is less that one α± ({−1, 0, 1}) =±

(1.4)

Here, “best” means “the most precise” that is “included in”. For example, {1} is the property “to be one”, which implies “to be a positive integer”, which implies “to be a positive or zero integer”, which implies “to be an integer”, that is {1} ⊆ {1, 2, 3, . . .} ⊆ {0, 1, 2, 3, . . .} ⊆ {. . . , −3, −2, −1, 0, 1, 2, 3, . . .}. The idea of the rule of signs is to take the meaning of any two signs, consider all possibilities for the minus, and abstract the result: s1 −± s2 = α± ({n − m | n ∈ γ± (s1 ) and m ∈ γ± (s2 )})

(1.5)

This formula (1.5) means that taking any number n ∈ γ± (s1 ) with sign s1 and any number m ∈ γ± (s2 ) with sign s2 ; the sign of their difference n − m must be s1 −± s2 . The formula (1.5) uses sets to express that the rule of signs is valid for any number of a given sign by considering the sets γ± (s1 ) and γ± (s2 ) of all of them and then considering the best sign abstraction α± ({n − m | n ∈ γ± (s1 ) and m ∈ γ± (s2 )}) of the set {n − m | n ∈ γ± (s1 ) and m ∈ γ± (s2 )} of all their possible differences. Figure 1.1 is designed by calculating s1 −± s2 for all possibilities s1 , s2 ∈ {⊥± , 0, 0, =0, 0, ± }: A simple example is ⊥± −± ± = ⊥± , as follows. ⊥± −± ± = α± ({n − m | n ∈ γ± (⊥± ) and m ∈ γ± (± )})

by (1.5)

4

P. Cousot

Fig. 1.1 Rule of signs

= α± ({n − m | n ∈ ∅ and m ∈ {. . . , −2, −1, 0, 1, 2, . . .}}) = α± ({n − m | false})

by (1.1)

since n ∈ ∅ is false in the conjunction

= α± (∅)

definition of the empty set ∅ definition (1.2) of α± .

= ⊥±

Such calculations are foundations of the calculation design of abstract interpreters [20, 30]. The idea of “best abstraction” is that among the possible abstractions of sets of integers by signs (considered as sets), there is always the smallest one (for inclusion ⊆). This “best” or “most precise” abstraction is given by α± . This “best abstraction” follows from the fact that the pair (α± , γ± ) of functions is a Galois connection, a generalization by the mathematician Oyster [84] of an algebraic concept introduced by Évariste Galois. This means that ∀S, s . α± (S) ± s ⇐⇒ S ⊆ γ± (s). The partial order ± is defined by the following Hasse diagram: ± 0

 =0

0

0

⊥±

such that, for example, ⊥± ± >0 ± 0 ± ± while 0 are not comparable by ± . We have s ± s ⇐⇒ γ± (s) ⊆ γ± (s ) so that ± in the abstract is set inclusion ⊆ in the concrete, that is logical implication. For example 0) = {1, 2, 3, . . . , 42, . . .}. This is called soundness; when abstracting to sign, we consider over-approximations so no concrete case is ever omitted. Any sign s such that S ⊆ γ± (s) is an over-approximation of S. For example {1, 42} ⊆ γ± (>0) ⊆ γ± (0) ⊆ γ± (± ) and {1, 42} ⊆ γ± (=0). But α± (S), that is >0 in our example, is the most precise. This is because if S ⊆ γ± (s) then α± (S) ± s by definition of a Galois connection. So α± (S) is the best (or more precise) abstraction of S as a sign. Such connections were invented by Évariste Galois to provide a connection between field theory (originally the field of rationals) and group theory (group of permutations of the polynomial roots), allowing to reduce certain problems in field theory to group theory, which makes them simpler and easier to understand. Galois characterized in this way the polynomial equations that are solvable by radicals in terms of properties of the permutation group of their roots. An equation is solvable by radicals if its roots are a function of the polynome coefficients using only integers, +, −, ×, /, and n-th root operation. This implies the Abel-Ruffini theorem asserting that a general polynomial of degree at least five cannot be solved by radicals. Galois theory has been used to solve classic problems including showing that two problems of antiquity cannot be solved as they were stated (doubling the cube and trisecting the angle), and characterizing the regular polygons that are constructible (this characterization was previously given by Gauss, but all known proofs that this characterization is complete require Galois theory). Other elementary examples of Galois connections are the casting out of nine (invented by Carl Friedrich Gauss, to check a multiplication), Paul Bachmann and Edmund Landau’s Big O notation (e.g. log n + 10n 2 + n 3 = O(n 3 ) when n → ∞), the mean (expected value), etc.… The main idea of abstract interpretation is to use abstraction to simplify hard problems to provide (partial) solutions. The abstraction may be both simple and precise (e.g. the rule of signs for multiplication). The solution may also be partial, meaning that the abstraction may be too imprecise to give a precise answer (e.g. >0 −± >0 = ± ). But the abstraction can always be refined to get a more precise answer, as understood by Brahmagupta, [87]. … 18.31. [If] a smaller [positive] is to be subtracted from a larger positive, [the result] is positive, …. However the refined abstraction might not be computable by a machine, so finding efficiently computable and precise abstractions is a very difficult art! Besides applications to the design of semantics of programming languages and program proof methods, the main application of abstract interpretation is to program analysis.

6

P. Cousot

An example is octagonal analysis invented by Antoine Miné [78] where a set of points is enclosed within an octagon.

Given the hypothesis n  0 on the input value n of the program parameter n, an octagonal analysis of the following program {n  0} (hypothesis on input parameter n) i = 0; while (i < n) {i  0} {0  i  n-1} i = (i + 1); {i = n}

will automatically discover the invariants { ... } which holds the values of the variables each time execution reaches the program where they are given. This abstract domain in combination with dozens of other ones is used in the Astrée static analyzer [48] which scales to over 10 000 000 lines of C/C++ code. Astrée is developed and commercialized by AbsInt (www.absint.com/astree/index.html) and used by thousands of engineers in the industry.

1.3 Abstract Interpretation for the Savant 1.3.1 Software Engineering Most computer scientists start by learning a programming language through debugging rather than by relying on typing, static analysis, and verification tools. While advancing in their studies, they will be lucky to have one or two classes on static analysis, mainly in compilation courses, using standard examples from dataflow analysis. No one cares about soundness/correctness in compilation classes, and teaching a semantically incorrect dataflow algorithm (see, for example, [29]) is not a problem

1 Abstract Interpretation: From 0, 1, to ∞

7

(with very few exceptions, of course [73]). When becoming professionals, programmers are encouraged to produce rapidly rather than to deliver products of high quality. No one feels the need to use static analysis tools since this might slow down the delivery of the final product, which anyway will be debugged by the end users, supposedly at no cost to the programmer. There are several reasons for that situation beyond that of education. First, many static analysis analyzers are academic tools designed for competitions on small programs. Most do not scale up, and are not usable for production-quality software. The second is that, contrary to other industries, programmers are not responsible for their bugs, and so feel no need to use tools. The third is that if, exceptionally, there is an obligation to use a static analysis tool, it is always possible to find a cheap poor tool offering no guarantee. There is no obligation for tools to be qualified. For example, in 2020, the US National Institute of Standards and Technology determined Astrée and Frame C, both abstract interpretation based, to be the only two tools that satisfy their criteria for sound static code analysis [7]. The tools not satisfying the criteria are not cited but claim1 “Don’t overestimate the limited value of standard test suites such as Juliet.2 These suites often exercise language features that are not appropriate for safety-critical code. Historically, the overlap between findings of different tools that were run over the same Juliet test suite has been surprisingly small.” (It does not look to be understood that unsound static analysis tools may produce different sets of alarms, whose intersection may even be empty.) Fortunately, the above pessimistic picture of the state of the art of static analysis in research and industrial applications is counter-balanced by outstanding successes. Among these stands Astrée used in the aeronautic, automotive, and nuclear industries to check for runtime errors and standards (such as DO-178B/C, ISO 26262, IEC 61508, EN-50128, Misra) in safety-critical C or C++ software. Zoncolan is used in Facebook to detect and prevent security issues in Hack code [74]. It finds 70% of the privacy bugs found every day in the ever-changing100 million lines of code base. For the field to make significant progress, there are many research and education subjects to which to contribute. We discuss a few possible research objectives; other ones are considered in [24, 47]. Some are hard unresolved issues pending for years [21] for which progress has been much needed, continuous, but slow. There are also emerging subjects and applications. It seems that there is no limit since abstract interpretation has accompanied the evolution of computer science over several decades since, as originally shown by Galois, the idea of understanding a complex structure using a more abstract and simple one looks to be a universally applicable idea.

1

https://www.synopsys.com/content/dam/synopsys/sig-assets/whitepapers/coverity-risk-mitigation-for-do178c.pdf. 2 Juliet Test Suites are available at https://samate.nist.gov/SRD/testsuite.php.

8

P. Cousot

1.3.2 Education Successful projects in static analysis have mostly been developed by researchers knowing well abstract interpretation, some with very deep knowledge and experience. But it is not easy to acquire the necessary knowledge and experience starting from zero. There are numerous introductions ([17, 18, 20, 27, 42, 44, 45, 77] among others) or videos,3 but classes on abstract interpretation are necessary to go more in depth. Classes going deep into the intricacies of abstract interpretation are proposed in a few universities, sometimes online.4 Recently, books on abstract interpretation [31, 89] have been published which can also be very helpful.

1.3.3 Scope of Abstract Interpretation Abstract interpretation started in the mid-seventies with interval analysis of flowchart programs [34] and, although some imagine that it is still there, it evolved as a general theory to approximate the possible semantics (i.e. formal specifications of the runtime computations) of computer programs, with a sound construction methodology. Abstract interpretation can be used to design static analyzers, type systems, proof methods, and semantics. It can be both operational (by reasoning on program steps), structural (by reasoning on the denotation of program components), and compositional (by composing diverse abstractions). Many other methods of reasoning on program semantics such as type theory, symbolic execution, bisimulation, and security analysis have their own way of thinking about programs and expressing soundness. These are also abstract interpretations, although this is mostly not understood. Although generality by abstraction is the main aim in pure mathematics where merging several theories into a single formalization is considered progress, this is not the case in computer science where the multiplication of concepts, specialized by communities, does not encourage cross-fertilization. Such unification would make formal methods more easily teachable, applicable, and composable.

1.3.4 More Complex Properties Most static analyzers aim at discovering semantic properties quantifying one execution trace at a time, mainly safety properties such as reachability. Liveness properties, such as termination, have also been considered [46], which require sophisticated abstract domains and widenings [98, 100]. 3 4

https://www.youtube.com/watch?v=IBlfJerAcRw. such as the slightly outdated https://web.mit.edu/16.399/www/ d.

1 Abstract Interpretation: From 0, 1, to ∞

9

More subtle properties than safety and liveness involve quantification on two execution traces at a time like dependency analysis [26]. When abstracted to a property on one execution trace (like taint analysis for dependency in [96]), there is a loss of precision. And this is even more difficult with properties such as responsibility involving the comparison of any execution of a program with all its other possible executions [54]. Much more foundational work is needed to abstract higher-order program properties (including so-called hyper properties) precisely and efficiently.

1.3.5 Properties of More Complex Data Structures Most static analyzers have been designed for languages where taking into account simple properties of the data structures manipulated by programs such as integers, floats, pointers, and arrays is sufficient to report a few and significant alarms. There are even libraries of abstract domains that can be reused for analyzing these properties, mainly numerical ones (such as Apron [68] and Elina [92, 93]). Beyond linearity (i.e. intervals in one dimension and affine subspaces in Euclidian spaces), abstractions by non-linear (closed) convex sets and functions [66] is very challenging but necessary in the analysis on control/command software (beyond specialized domains such as filters [58]), especially to take the interaction of the controller and controlled system into account [23]. Despite recent progress in the analysis of complex data structures (see, for example, [67, 72, 83]), general-purpose reusable libraries do not exist for more complex symbolic data structures like sets of graphs, etc. commonly found in complex computer applications like social network databases, where capturing succinctly the evolution of unbounded graphs over time is essential. Another example is sets of functions. Of course, there are known methods to analyze procedures (by copy as in Astrée [8], by functional abstraction [36], and by compilation [90]), but they are not really applicable to higher-order first-class functional languages where functions take as parameters and return functions.

1.3.6 Properties of More Complex Control Structures Very few static analyzers go beyond sequential programs. Counter-examples are Astrée and Zoncolan. Parallelism is difficult because it can be formalized in many different ways, from data parallelism, shared memory models [39, 76], including with weak memory models [3, 95], to distributed systems with synchronous [38] or asynchronous communications, with various hypotheses on the communications achievability, static or dynamic process creation, etc. We lack good semantic models of general applicability and corresponding expressive abstractions.

10

P. Cousot

1.3.7 Computation Spaces Originally, abstract interpretation dealt with reachability properties, that is sets of states, but had to move to more complex ones such as sets of traces (e.g. for the soundness of dataflow analysis ([37, example 7.2.0.6]; [29]), to sets of sets of traces. This fits well for discrete computations. For continuous or hybrid ones, more complex models are needed [32, 57] and mainly, abstractions that go beyond time-limited/bounded ones would be most helpful.

1.3.8 Choosing Precise and Efficient Abstractions The choice of abstractions is completely empirical and experimental. At least, we have a lattice of abstractions allowing for combinations of abstractions like the reduced product [37, Sect. 10]. This allows for the refinement of static analyzers by adding and combining various abstractions. The automation of this process during the analysis (e.g. [50] and many others) has not been very successful. This is because the refinement with respect to the exact collecting semantics is too costly, while the refinement with respect to an abstract domain (e.g. when using SMT solvers necessarily restricted to a given combination of abstract domains [49]) cannot help when the needed properties are not exactly expressible in this abstract domain. Moreover, because the terminating extrapolation operators cannot be increasing (monotone), a more precise domain does not necessarily result in a more precise analysis. Finally, the lattice of abstract domain provides no information to compare the precision of incomparable domains in this lattice, meaning that properties expressible by one are not a subset of the properties expressible by the other. So a rigorous formalization of the relative precision of abstract domains and abstract interpretations would be welcomed. Casso et al. [13] is an innovative and interesting step in this direction.

1.3.9 Induction Abstraction Most reasonings on program executions require some form of mathematical induction such as recurrence on the number of loop iterations or recursive program calls [28]. In abstract interpretation, this is approximated by an abstract domain and by extrapolation (widening, dual widening) and interpolation (narrowing and dual narrowing) operators [25, 41], which are weak forms of induction that guarantee the soundness of the approximation and the termination of the induction (the same way that a proof by recurrence allows for the finite presentation of an otherwise infinite proof for each of the naturals).

1 Abstract Interpretation: From 0, 1, to ∞

11

This is certainly the most difficult part of program verification which is evaded in deductive methods (by asking the user to provide the inductive argument) and model checking (by requiring finiteness or else bounded model checking, a trivial widening) and make analysis harder than combinatorial enumeration and verification [51]. More sophisticated extrapolation and interpolation methods are needed to boost the precision of static analysis.

1.3.10 Calculational Design of Abstract Interpreters Given a semantics, and an abstraction, an abstract interpreter can be designed by calculus, mostly by folding and unfolding definitions, simplifications, and, at crucial points in the proof, by well-chosen approximations. If such proofs can be checked a posteriori [4, 59, 70], we lack tools that would automatize the symbolic computations, except maybe a few approximation points, although progress in program synthesis would certainly be applicable to such formal computations.

1.3.11 Language-Independent Abstract Interpretation If the theory of abstract interpretation has aimed at being independent of a specific programming language, the practice has been very different. Type inference, verification, static analysis, etc. are almost always tightly tight to a specific model of programs if not to a specific programming language. Designing a multi-language abstract interpreter is a challenge. If it is nowadays possible to design static analyzers that can handle both C and Python [71], it is much more difficult to consider say C++, Java, Lisa [2], Lustre [63, 64], Parallel Prolog [55], and Ocaml. Even something looking simpler, like extending Astrée to analyze inlined assembly code, is a very complex task [14]. The most common practical approach is to compile an intermediate language and then analyze this “universal” intermediate language. The approach is manual and the difficulty is to connect the source and object languages so that the analysis at the object level can take into account the specificities of the source and report its findings with respect to the source. A counter-example is Verasco [69] which makes the analysis of C code using one of the intermediate languages of CompCert, and reports with respect to this intermediate language, so that messages are not related to the source and therefore hardly understandable. One of the big difficulties to achieve language-independence is that the semantics of these languages, whether operational, denotational, or axiomatic, are completely language-dependent, and so are their abstractions. Transitions systems, as used in [15] and later in model checking, are certainly language-independent, can be given any of the known semantics by abstraction [22], but are of very limited expressivity. An example is [40] attempting to describe

12

P. Cousot

the abstract syntax and transitional semantics, including parallelism, in a languageindependent way. It is applied to the soundness proof of a variant of Hoare logic that can be further abstracted using abstract domains. Another attempt is [19] that consists in considering a meta-language to define the language (collecting) semantics accompanied by abstractions of this meta-language constructs and data which yield abstractions of the defined language. Instances of this approach are [79] and the “skeletal semantics” of [9]. Again, it is very difficult to be flexible in the representation of computations used in the meta-semantics (and not make an arbitrary choice such as execution traces or sets of reachability states for the collecting semantics) and to incorporate language-dependent abstractions in the meta-language (e.g. for pointer analysis). But this venue has not been largely explored, and progress is certainly possible beyond the first steps taken by [71].

1.3.12 New Computation Models and Challenges Being of general scope, abstract interpretation has shown to be largely applicable in computer science and is certainly the only verification method that has scaled up to huge industrial software over the past decades (think, e.g. to Astrée and Zoncolan). Innovations in computer science evolve rapidly, which opens new needs for sound verification, hence abstract interpretation. Let us just mention a few examples. Probabilistic Programming and Analyses The static analysis of probabilistic properties or probabilistic programs has a long history [1, 10, 52, 80, 81], but precise abstraction and inference of probabilistic properties (like sets of distributions) are not well-developed enough. Smart Contracts Smart contracts are commonly associated with cryptocurrencies and decentralized finance applications that take place on a blockchain or distributed ledger. A smart contract can be any kind of computer program written in a Turing-complete language and so subject to penalizing errors. The need for verification and analysis is obvious, and abstract interpretation is directly applicable ([5, 65, 86]). Data Sciences Machine learning has made huge progress this last decade, in particular thanks to breakthroughs in neural networks, with a large variety of highly publicized applications. However, it has also a number of weaknesses not so much advertized. This paves the way for formal methods and static analysis ([97, 99]) of required properties such as robustness ([60, 82, 91]), fairness [75], absence of data leakages [94], and sound behavior of neural network controlled physical systems [62].

1 Abstract Interpretation: From 0, 1, to ∞

13

Quantum Computing Efforts toward building a physical quantum computer may succeed in the near future. Since quantum computers obey the Church-Turing thesis, verification of quantum programs will be undecidable, and so abstract interpretation comes in [85, 101]. Static Analysis of Biological Networks A biological network is used to represent complex sets of binary interactions or relations between various biological entities and their evaluation over time. Inevitably, the behavior of such dynamic systems must be approximated to cope with enormous complexity, which paves the way for the use of abstract interpretation [6, 11, 12, 53, 56]. Economy (Discrete) dynamical systems, including games, have applications to a wide variety of fields, including economy, and their evolution over time requires sound approximation, the domain of predilection of abstract interpretation [88].

1.4 Conclusion The application of abstract interpretation to semantics, verification, and static analysis needs more powerful semantic models, reusable abstractions via libraries (of complex data and control structures), a significant deepening of abstract inference, and achieve the computer-assisted calculations design of sound extensible abstract interpreters. New application domains appear over time (we have only cited a few to follow the hype) and require more abstractions of formal semantic domains of computation. If the mentioned opened problems do not look difficult enough, you can combine them to get an impossible challenge like using machine learning to automatically design a sound by design static analyzer for your favorite language (or better anyone) running on a quantum computer.

References 1. Adjé, A., Bouissou, O., Goubault-Larrecq, J., Goubault, E., Putot, S.: Static analysis of programs with imprecise probabilistic inputs. In: VSTTE, Lecture Notes in Computer Science, vol. 8164, pp. 22–47. Springer (2013) 2. Alglave, J., Cousot, P.: Syntax and analytic semantics of LISA (2016). arxiv:abs/1608.06583 3. Alglave, J., Cousot, P.: Ogre and pythia: an invariance proof method for weak consistency models. In: POPL, pp. 3–18. ACM (2017) 4. Barthe, G., Blazy, S., Laporte, V., Pichardie, D., Trieu, A.: Verified translation validation of static analyses. In: CSF, pp. 405–419. IEEE Computer Society (2017) 5. Bau, G., Miné, A., Botbol, V., Bouaziz, M.: Abstract interpretation of michelson smartcontracts. In: SOAP@PLDI, pp. 36–43. ACM (2022)

14

P. Cousot

6. Beica, A., Feret, J., Petrov, T.: Tropical abstraction of biochemical reaction networks with guarantees. In: SASB, Electronic Notes in Theoretical Computer Science, vol. 350, pp. 3–32. Elsevier (2020) 7. Black, P.E., Walia, K.S.: SATE VI Ockham Sound Analysis Criteria. NIST, IR 8304 (2000). https://nvlpubs.nist.gov/nistpubs/ir/2020/NIST.IR.8304.pdf 8. Blanchet, B., Cousot, P., Cousot, R., Feret, J., Mauborgne, L., Miné, A., Monniaux, D., Rival, X.: A static analyzer for large safety-critical software. In: PLDI, pp. 196–207. ACM (2003) 9. Bodin, M., Gardner, P., Jensen, T.P., Schmitt, A.: Skeletal semantics and their interpretations. Proc. ACM Program. Lang. 3(POPL), 44:1–44:31 (2019) 10. Bouissou, O., Goubault, E., Putot, S., Chakarov, A., Sankaranarayanan, S.: Uncertainty propagation using probabilistic affine forms and concentration of measure inequalities. In: TACAS, Lecture Notes in Computer Science, vol. 9636, pp. 225–243. Springer (2016) 11. Boutillier, P., Camporesi, F., Coquet, J., Feret, J., Lý, K.Q., Théret, N., Vignet, P.: Kasa: A static analyzer for kappa. In: CMSB, Lecture Notes in Computer Science, vol. 11095, pp. 285–291. Springer (2018) 12. Boutillier, P., Cristescu, I., Feret, J.: Counters in kappa: semantics, simulation, and static analysis. In: ESOP, Lecture Notes in Computer Science, vol. 11423, pp. 176–204. Springer (2019) 13. Casso, I., Morales, J.F., López-García, P., Giacobazzi, R., Hermenegildo, M.V.: Computing abstract distances in logic programs. In: LOPSTR, Lecture Notes in Computer Science, vol. 12042, pp. 57–72. Springer (2019) 14. Chevalier, M., Feret, J.: Sharing ghost variables in a collection of abstract domains. In: VMCAI, Lecture Notes in Computer Science, vol. 11990, pp. 158–179. Springer (2020) 15. Cousot, P.: Méthodes itératives de construction et d’approximation de points fixes d’opérateurs monotones sur un treillis, analyse sémantique de programmes (in French). Thèse d’État ès sciences mathématiques, Université Joseph Fourier, Grenoble, France (1978) 16. Cousot, P.: Méthodes itératives de construction et d’approximation de points fixes d’opérateurs monotones sur un treillis, analyse sémantique des programmes. In: University of Grenoble (1978) 17. Cousot, P.: Abstract interpretation. ACM Comput. Surv. 28(2), 324–328 (1996). 18. Cousot, P.: Program analysis: the abstract interpretation perspective. ACM Comput. Surv. 28(4es), 165 (1996) 19. Cousot, P.: Abstract interpretation based static analysis parameterized by semantics. In: SAS, Lecture Notes in Computer Science, vol. 1302, pp. 388–394. Springer (1997) 20. Cousot, P.: The calculational design of a generic abstract interpreter. In: M. Broy, R. Steinbrüggen (eds.) Calculational System Design. NATO ASI Series F. IOS Press, Amsterdam (1999). 21. Cousot, P.: Directions for research in approximate system analysis. ACM Comput. Surv. 31(3es), 6 (1999) 22. Cousot, P.: Constructive design of a hierarchy of semantics of a transition system by abstract interpretation. Theor. Comput. Sci. 277(1–2), 47–103 (2002). 23. Cousot, P.: Integrating physical systems in the static analysis of embedded control software. In: APLAS, Lecture Notes in Computer Science, vol. 3780, pp. 135–138. Springer (2005) 24. Cousot, P.: The verification grand challenge and abstract interpretation. In: VSTTE, Lecture Notes in Computer Science, vol. 4171, pp. 189–201. Springer (2005) 25. Cousot, P.: Abstracting induction by extrapolation and interpolation. In: VMCAI, Lecture Notes in Computer Science, vol. 8931, pp. 19–42. Springer (2015) 26. Cousot, P.: Abstract semantic dependency. In: SAS, Lecture Notes in Computer Science, vol. 11822, pp. 389–410. Springer (2019) 27. Cousot, P.: A formal introduction to abstract interpretation. In: Pretschner, A., Müller, P., Stöckle, P. (eds.) Calculational System Design. NATO SPS, Series D, vol. 53. IOS Press, Amsterdam (2019) 28. Cousot, P.: On fixpoint/iteration/variant induction principles for proving total correctness of programs with denotational semantics. In: LOPSTR, Lecture Notes in Computer Science, vol. 12042, pp. 3–18. Springer (2019)

1 Abstract Interpretation: From 0, 1, to ∞

15

29. Cousot, P.: Syntactic and semantic soundness of structural dataflow analysis. In: SAS, Lecture Notes in Computer Science, vol. 11822, pp. 96–117. Springer (2019) 30. Cousot, P.: Calculational design of a regular model checker by abstract interpretation. Theor. Comput. Sci. 869, 62–84 (2021). 31. Cousot, P.: Principles of Abstract Interpretation, 1 edn. MIT Press (2021) 32. Cousot, P.: Asynchronous correspondences between hybrid trajectory semantics. In: Tom Henzinger Festschrift, Lecture Notes in Computer Science, vol. 13660. Springer (2022). To appear 33. Cousot, P., Cousot, R.: Static determination of dynamic properties of programs. In: Proceedings of the Second International Symposium on Programming, pp. 106–130. Dunod (1976) 34. Cousot, P., Cousot, R.: Static determination of dynamic properties of programs. In: Proceedings of the Second International Symposium on Programming, pp. 106–130. Dunod, Paris, France (1976) 35. Cousot, P., Cousot, R.: Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: POPL, pp. 238–252. ACM (1977) 36. Cousot, P., Cousot, R.: Static determination of dynamic properties of recursive procedures. In: Formal Description of Programming Concepts, pp. 237–278. North-Holland (1977) 37. Cousot, P., Cousot, R.: Systematic design of program analysis frameworks. In: POPL, pp. 269–282. ACM Press (1979) 38. Cousot, P., Cousot, R.: Semantic analysis of communicating sequential processes (shortened version). In: ICALP, Lecture Notes in Computer Science, vol. 85, pp. 119–133. Springer (1980) 39. Cousot, P., Cousot, R.: Invariance proof methods and analysis techniques for parallel programs. In: Biermann, A., Guiho, G., Kodratoff, Y. (eds.) Automatic Program Construction Techniques, chap. 12, pp. 243–271. Macmillan, New York, New York, USA (1984) 40. Cousot, P., Cousot, R.: A language independent proof of the soundness and completeness of generalized hoare logic. Inf. Comput. 80(2), 165–191 (1989). 41. Cousot, P., Cousot, R.: Inductive definitions, semantics and abstract interpretation. In: POPL, pp. 83–94. ACM Press (1992) 42. Cousot, P., Cousot, R.: Basic concepts of abstract interpretation. In: IFIP Congress Topical Sessions, IFIP, vol. 156, pp. 359–366. Kluwer/Springer (2004) 43. Cousot, P., Cousot, R.: Bi-inductive structural semantics. Inf. Comput. 207(2), 258–283 (2009). 44. Cousot, P., Cousot, R.: A gentle introduction to formal verification of computer systems by abstract interpretation. In: Esparza, J., Grumberg, O., Broy, M. (eds.) Logics and Languages for Reliability and Security, NATO Science Series III: Computer and Systems Sciences, pp. 1–29. IOS Press (2010) 45. Cousot, P., Cousot, R.: A gentle introduction to formal verification of computer systems by abstract interpretation. In: Logics and Languages for Reliability and Security, NATO Science for Peace and Security Series—D: Information and Communication Security, vol. 25, pp. 1–29. IOS Press (2010) 46. Cousot, P., Cousot, R.: An abstract interpretation framework for termination. In: POPL, pp. 245–258. ACM (2012) 47. Cousot, P., Cousot, R.: Abstract interpretation: past, present and future. In: CSL-LICS, pp. 2:1–2:10. ACM (2014) 48. Cousot, P., Cousot, R., Feret, J., Mauborgne, L., Miné, A., Monniaux, D., Rival, X.: The astreé analyzer. In: ESOP, Lecture Notes in Computer Science, vol. 3444, pp. 21–30. Springer (2005) 49. Cousot, P., Cousot, R., Mauborgne, L.: The reduced product of abstract domains and the combination of decision procedures. In: FoSSaCS, Lecture Notes in Computer Science, vol. 6604, pp. 456–472. Springer (2011) 50. Cousot, P., Ganty, P., Raskin, J.: Fixpoint-guided abstraction refinements. In: SAS, Lecture Notes in Computer Science, vol. 4634, pp. 333–348. Springer (2007) 51. Cousot, P., Giacobazzi, R., Ranzato, F.: Program analysis is harder than verification: a computability perspective. In: CAV (2), Lecture Notes in Computer Science, vol. 10982, pp. 75–95. Springer (2018)

16

P. Cousot

52. Cousot, P., Monerau, M.: Probabilistic abstract interpretation. In: ESOP, Lecture Notes in Computer Science, vol. 7211, pp. 169–193. Springer (2012) 53. Danos, V., Feret, J., Fontana, W., Krivine, J.: Abstract interpretation of cellular signalling networks. In: VMCAI, Lecture Notes in Computer Science, vol. 4905, pp. 83–97. Springer (2008) 54. Deng, C., Cousot, P.: The systematic design of responsibility analysis by abstract interpretation. ACM Trans. Program. Lang. Syst. 44(1), 3:1–3:90 (2022) 55. Dovier, A., Formisano, A., Gupta, G., Hermenegildo, M.V., Pontelli, E., Rocha, R.: Parallel logic programming: a sequel (2021). arxiv:abs/2111.11218 56. Fages, F., Soliman, S.: Abstract interpretation and types for systems biology. Theor. Comput. Sci. 403(1), 52–70 (2008). 57. Farjudian, A., Moggi, E.: Robustness, scott continuity, and computability (2022). 10.48550/ARXIV.2208.12347. arxiv:abs/2208.12347 58. Feret, J.: Static analysis of digital filters. In: ESOP, Lecture Notes in Computer Science, vol. 2986, pp. 33–48. Springer (2004) 59. Franceschino, L., Pichardie, D., Talpin, J.: Verified functional programming of an abstract interpreter. In: SAS, Lecture Notes in Computer Science, vol. 12913, pp. 124–143. Springer (2021) 60. Gehr, T., Mirman, M., Drachsler-Cohen, D., Tsankov, P., Chaudhuri, S., Vechev, M.T.: AI2: safety and robustness certification of neural networks with abstract interpretation. In: IEEE Symposium on Security and Privacy, pp. 3–18. IEEE Computer Society (2018) 61. Giacobazzi, R., Ranzato, F.: History of abstract interpretation. IEEE Ann. Hist. Comput. 44(2), 33–43 (2022). 62. Goubault, E., Putot, S.: RINO: robust inner and outer approximated reachability of neural networks controlled systems. In: CAV (1), Lecture Notes in Computer Science, vol. 13371, pp. 511–523. Springer (2022) 63. Halbwachs, N.: About synchronous programming and abstract interpretation. Sci. Comput. Program. 31(1), 75–89 (1998). 64. Halbwachs, N., Proy, Y., Roumanoff, P.: Verification of real-time systems using linear relation analysis. Formal Methods Syst. Des. 11(2), 157–185 (1997). 65. Henglein, F., Larsen, C.K., Murawska, A.: A formally verified static analysis framework for compositional contracts. In: Financial Cryptography Workshops, Lecture Notes in Computer Science, vol. 12063, pp. 599–619. Springer (2020) 66. Hiriart-Urruty, J.B., Lemaréchal, C.: Fundamentals of convex analysis, 2nd edn. Springer (2004) 67. Illous, H., Lemerre, M., Rival, X.: A relational shape abstract domain. Formal Methods Syst. Des. 57(3), 343–400 (2021). 68. Jeannet, B., Miné, A.: Apron: a library of numerical abstract domains for static analysis. In: CAV, Lecture Notes in Computer Science, vol. 5643, pp. 661–667. Springer (2009) 69. Jourdan, J.: Verasco: a formally verified C static analyzer. (verasco: un analyseur statique pour C formellement vérifié). Ph.D. thesis, Paris Diderot University, France (2016) 70. Jourdan, J., Laporte, V., Blazy, S., Leroy, X., Pichardie, D.: A formally-verified C static analyzer. In: POPL, pp. 247–259. ACM (2015) 71. Journault, M., Miné, A., Monat, R., Ouadjaout, A.: Combinations of reusable abstract domains for a multilingual static analyzer. In: VSTTE, Lecture Notes in Computer Science, vol. 12031, pp. 1–18. Springer (2019) 72. Ko, Y., Rival, X., Ryu, S.: Weakly sensitive analysis for javascript object-manipulating programs. Softw. Pract. Exp. 49(5), 840–884 (2019). 73. Leroy, X.: Formally verifying a compiler: What does it mean, exactly? In: ICALP, LIPIcs, vol. 55, pp. 2:1–2:1. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2016) 74. Logozzo, F., Fahndrich, M., Mosaad, I., Hooimeijer, P.: Zoncolan: How Facebook uses static analysis to detect and prevent security issues. Engineering at Meta (2019). https://engineering. fb.com/2019/08/15/security/zoncolan/

1 Abstract Interpretation: From 0, 1, to ∞

17

75. Mazzucato, D., Urban, C.: Reduced products of abstract domains for fairness certification of neural networks. In: SAS, Lecture Notes in Computer Science, vol. 12913, pp. 308–322. Springer (2021) 76. Miné, A.: Relational thread-modular static value analysis by abstract interpretation. In: VMCAI, Lecture Notes in Computer Science, vol. 8318, pp. 39–58. Springer (2014) 77. Miné, A.: Tutorial on static inference of numeric invariants by abstract interpretation. Found. Trends Program. Lang. 4(3–4), 120–372 (2017). 78. Miné, A.: The octagon abstract domain. Higher-Order and Symbolic Computation 19(1), 31–100 (2006) 79. Mirliaz, S., Pichardie, D.: A flow-insensitive-complete program representation. In: VMCAI, Lecture Notes in Computer Science, vol. 13182, pp. 197–218. Springer (2022) 80. Monniaux, D.: Abstract interpretation of probabilistic semantics. In: SAS, Lecture Notes in Computer Science, vol. 1824, pp. 322–339. Springer (2000) 81. Monniaux, D.: Backwards abstract interpretation of probabilistic programs. In: ESOP, Lecture Notes in Computer Science, vol. 2028, pp. 367–382. Springer (2001) 82. Munakata, S., Urban, C., Yokoyama, H., Yamamoto, K., Munakata, K.: Verifying attention robustness of deep neural networks against semantic perturbations (2022). arxiv:abs/2207.05902 83. Nicole, O., Lemerre, M., Rival, X.: Lightweight shape analysis based on physical types. In: VMCAI, Lecture Notes in Computer Science, vol. 13182, pp. 219–241. Springer (2022) 84. Ore, O.: Galois connexions. Trans. Amer. Math. Soc. 55(3), 493–513 (1944) 85. Perdrix, S.: Quantum entanglement analysis based on abstract interpretation. In: SAS, Lecture Notes in Computer Science, vol. 5079, pp. 270–282. Springer (2008) 86. Perez-Carrasco, V., Klemen, M., López-García, P., Morales, J.F., Hermenegildo, M.V.: Cost analysis of smart contracts via parametric resource analysis. In: SAS, Lecture Notes in Computer Science, vol. 12389, pp. 7–31. Springer (2020) 87. Plofker, K.: Mathematics in India. Princeton University Press (2007) 88. Ranzato, F.: Abstract interpretation of supermodular games. In: SAS, Lecture Notes in Computer Science, vol. 9837, pp. 403–423. Springer (2016) 89. Rival, X., Yi, K.: Introduction to Static Analysis. MIT Press (2020) 90. Sharir, M., Pnueli, A.: Two approaches to interprocedural data flow analysis. In: Muchnick, S., Jones, N. (eds.) Program Flow Analysis: Theory and Applications, chap. 7, pp. 189–342. Prentice–Hall (1981) 91. Singh, G., Gehr, T., Püschel, M., Vechev, M.T.: An abstract domain for certifying neural networks. Proc. ACM Program. Lang. 3(POPL), 41:1–41:30 (2019) 92. Singh, G., Püschel, M., Vechev, M.T.: Making numerical program analysis fast. In: PLDI, pp. 303–313. ACM (2015) 93. Singh, G., Püschel, M., Vechev, M.T.: Fast polyhedra abstract domain. In: POPL, pp. 46–59. ACM (2017) 94. Subotic, P., Bojanic, U., Stojic, M.: Statically detecting data leakages in data science code. In: SOAP@PLDI, pp. 16–22. ACM (2022) 95. Suzanne, T., Miné, A.: From array domains to abstract interpretation under store-bufferbased memory models. In: SAS, Lecture Notes in Computer Science, vol. 9837, pp. 469–488. Springer (2016) 96. Tripp, O., Pistoia, M., Cousot, P., Cousot, R., Guarnieri, S.: Andromeda: accurate and scalable security analysis of web applications. In: FASE, Lecture Notes in Computer Science, vol. 7793, pp. 210–225. Springer (2013) 97. Urban, C.: Static analysis of data science software. In: SAS, Lecture Notes in Computer Science, vol. 11822, pp. 17–23. Springer (2019) 98. Urban, C., Miné, A.: Inference of ranking functions for proving temporal properties by abstract interpretation. Comput. Lang. Syst. Struct. 47, 77–103 (2017).

18

P. Cousot

99. Urban, C., Miné, A.: A review of formal methods applied to machine learning (2021). arxiv:abs/2104.02466 100. Urban, C., Ueltschi, S., Müller, P.: Abstract interpretation of CTL properties. In: SAS, Lecture Notes in Computer Science, vol. 11002, pp. 402–422. Springer (2018) 101. Yu, N., Palsberg, J.: Quantum abstract interpretation. In: PLDI, pp. 542–558. ACM (2021)

Chapter 2

LiSA: A Generic Framework for Multilanguage Static Analysis Luca Negrini, Pietro Ferrara, Vincenzo Arceri, and Agostino Cortesi

Abstract Modern software engineering revolves around distributed applications. From IoT networks to client-server infrastructures, the application code is increasingly being divided into separate sub-programs interacting with each other. As they are completely independent from each other, each such program is likely to be developed in a separate programming language, choosing the best fit for the task to at hand. From a static program analysis perspective, taking on a mixture of languages is challenging. This paper defines a generic framework where modular multilanguage static analyses can be defined through the abstract interpretation theory. The framework has been implemented in LiSA (Library for Static Analysis), an open-source Java library that provides the complete infrastructure necessary for developing static analyzers. LiSA strives to be modular, ensuring that all components taking part in the analysis are both easy to develop and highly interchangeable. LiSA also ensures that components are parametric to all language-specific features: semantics, execution model, and memory model are not directly encoded within the components themselves. A proof-of-concept instantiation is provided, demonstrating LiSA’s capability to analyze multiple languages in a single analysis through the discovery of an IoT vulnerability spanning C++ and Java code.

L. Negrini (B) · P. Ferrara · A. Cortesi Ca’ Foscari University of Venice, Venice, Italy e-mail: [email protected] P. Ferrara e-mail: [email protected] A. Cortesi e-mail: [email protected] L. Negrini Corvallis SRL, Padova, Italy V. Arceri University of Parma, Parma, Italy e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 V. Arceri et al. (eds.), Challenges of Software Verification, Intelligent Systems Reference Library 238, https://doi.org/10.1007/978-981-19-9601-6_2

19

20

L. Negrini et al.

2.1 Introduction Software governs most aspects of everyday life. Almost every human action, regardless of it being for work or leisure, involves at least one device that is running a program. Proving these programs correct is as important as ever, as they can collect all sorts of sensitive information (for instance, contents of medical records) or govern critical processes (like driving a car). Software architecture has dramatically evolved in the last decades. The classical client-server architecture, that was characteristic of web applications, has recently seen broader adoption with the advent of mobile applications. Moreover, the commercial drive to the Software as a Service (SaaS) [36], where vendors only distribute simple clients to customers while keeping all of the application logic remote, led to a huge increase of cloud computing solutions. Since clients and servers have very different purposes, the programming languages used to implement them are typically different. Backend is also becoming less and less monolithic. Recent years have seen the rise of microservices infrastructures [10], where the atomic entity that was the server is split into smaller independent components that communicate with each other through APIs. Backend logic has also started being implemented through serverless applications, that is, code that runs in the cloud with (close to) no knowledge about the environment it runs into. Partitioning the server code into isolated entities also loosen the requirement of having those entities written in the same language, as different tasks might exploit different languages’ peculiarities. One more possible segmentation of the backend comes with blockchain oriented applications that interact with code present on a blockchain [42]. Smart contracts are usually written with specific DSLs (e.g. Solidity) dedicated to a particular blockchain in order to exploit its capabilities. Only recently a stream of blockchains adopted general purpose languages for writing smart contracts [3, 7, 32, 33]. Besides the transformation of client-server architectures, the Internet of Things (IoT) has also risen in popularity. Devices running embedded software can interact with various backends or other devices. IoT networks are becoming wildly adopted in several areas [52]: healthcare, smart homes and manufacturing are just few of the scenarios where they are applied. Once more, different programming languages can (and likely will) be involved in the realization of the system.

2.1.1 An Illustrative Example Consider for instance the following minimal example. The code reported in Fig. 2.11 has been used to prove the usefulness of static analysis for discovering IoT vulnerabilities in [24]. The code implements a system composed of a joystick and a robotic car that interact through a gateway. The Java fragment in Fig. 2.1a, running on the gateway, initializes the whole system and then repeatedly queries the joystick for 1

Available at https://github.com/amitmandalnitdgp/IOTJoyCar.

2 LiSA: A Generic Framework for Multilanguage Static Analysis

21

Fig. 2.1 Excerpt of the JoyCar source code

steer and throttle. The C++ fragment in Fig. 2.1b instead, implementing the remaining two components, interacts with the joystick’s sensors and the car’s motor. The two fragments communicate through the Java Native Interface (JNI). Here, the authors are interested in detecting the IoT Injection attack that can happen if the sensors’ outputs, that could be tampered with, can flow into the motor’s input without being sanitized, exposing the car to attacks that could damage it. Authors resort to Taint analysis [21, 51] for the task, but they require more than one analyzer: since the flow might span between the two codebases, analyses for Java and C++ are needed. Julia and CodeSonar were selected for the task, as they both were equipped with configurable Taint analysis engines able to receive a specification of sources and sinks from the user. The authors’ overall analysis proceeds as follows: 1. the value returned by function readAnalog was marked as source of tainted information for CodeSonar; 2. the second parameter of function softPwmWrite was marked as a sink for tainted information for CodeSonar; 3. to detect tainted values flowing from C++ to Java code, the value returned by Java_JoyCar_readUpDown was marked as sink for CodeSonar; 4. to detect tainted values flowing from Java to C++ code, the first parameter of JoyCar.runMotor was marked as sink for Julia;

22

L. Negrini et al.

5. a first round of Taint analysis was run with both analyzers: Julia did not find any vulnerability (as no sources were present under Java), but CodeSonar did find a flow of tainted data going into Java_JoyCar_readUpDown; 6. a second round was run after marking JoyCar.readUpDown’s return value as source for the Java analysis, and this time Julia detected a flow of tainted data going into the first parameter of JoyCar.runMotor; 7. a third and final round was run after marking Java_JoyCar_runMotor’s third parameter as source for CodeSonar, that was now able to detect the IoT Injection vulnerability with softPwmWrite as sink. Despite the successful discovery of a vulnerability that spanned multiple languages, the limits of this approach are quite evident: since tools need to exchange information at each iteration, multiple rounds of analysis are needed to reach a fixpoint over the shared information, each composed by one analysis for each tool involved. Moreover, tool communication is hard, even more if those come from different vendors as they might not agree on how information is exported and imported. Furthermore, communicating the information might not be an easy task. In this example, the authors focused on a “binary” property: a value is either tainted or not. However, with more complex structures (e.g. Polyhedra [17]) finding the appropriate format to exchange information between analysis might not be trivial.

2.1.2 Contribution and Paper Structure This paper formalizes the structure of LiSA, a Library for Static Analysis aimed at achieving multilanguage analysis that can be used to create static analyzers by abstract interpretation [14, 15]. Roughly, LiSA provides the full infrastructure of a static analyzer: starting from an intermediate representation in the form of extensible control flow graphs (CFGs), LiSA lets users define analysis components, and then takes care of orchestrating the analysis using a unique fixpoint algorithm over CFGs. Moreover, parsing logic is left to the user, that will define frontends translating source code into CFGs (modeling the syntax of the input program), while also providing rewriting rules for each CFG node into symbolic expressions, an internal extensible language representing atomic semantic operations (thus modeling the semantics of each instruction of input program). We then provide a proof-of-concept multilanguage analysis using LiSA on the example reported in Sect. 2.1.1. LiSA comes as a Java library available on GitHub.2 The remainder of this paper is structured as follows. A high-level overview is first introduced in Sect. 2.2, depicting the role of all the analysis components and how they cooperate to perform analyses. CFGs and symbolic expressions are discussed in Sect. 2.3, defining the languages that LiSA uses for syntax and semantics, respectively. Section 2.4 describes the modular Analysis State used to represent pre2

https://github.com/lisa-analyzer/lisa.

2 LiSA: A Generic Framework for Multilanguage Static Analysis

23

and post-states. The Interprocedural Analysis that orchestrates LiSA’s fixpoint is presented in Sect. 2.5. In Sect. 2.6, we define the role of frontends in compiling the source programs into LiSA’s CFGs. We conclude the paper with a proof-of-concept multilanguage analysis on the IoT network of Sects. 2.1.1 and 2.7. An in-depth discussion of each component is available in [39].

2.2 LiSA’s Overall Architecture We begin by providing a high-level overview of the analysis pipeline, that is shown in Fig. 2.2. The analysis starts by logically partitioning the input application P into programs Pi , each written in a single programming language Li . Li -to-CFG compilers, called frontends (top-left corner of Fig. 2.2), are invoked on such programs to obtain a uniform representation of all the code to analyze in the form of a LiSA program PiL . Frontends are more than compilers, as they also provide logic to the analysis, such as language-specific semantics of the instructions present in CFGs, semantic models for library code, and language-specific algorithms for common language features (e.g. call resolution and inheritance rules). The final version P L of the translated program to analyze is the union of all PiL . At this point, LiSA can be invoked on P L with a configuration of the analysis features and the implementations of the various components that are to be executed, namely: • the Interprocedural Analysis, responsible for the computation of the overall program fixpoint and for computing the results of function calls; • the Call Graph, used by the Interprocedural Analysis to find call targets; • the Abstract State, that computes the abstract values during the analysis; • the set of Checks, that produce warnings based on the result of the analysis. P L is fed to the Interprocedural Analysis (left-most block within LiSA in Fig. 2.2), that will compute a fixpoint over it. When the Interprocedural Analysis needs to analyze an individual CFG, it will invoke a unique fixpoint algorithm defined directly on CFGs (central portion of LiSA in Fig. 2.2). As the language-specific semantics of instructions is embedded in CFG nodes, called Statements, the fixpoint algorithm will use such semantics as transfer function. If the Statement performs calls as part of its semantics, it will interact back with the Interprocedural Analysis to determine the returned values, as how those are evaluated depends directly on how the overall fixpoint is computed. If the call’s targets are unknown (for instance, if the call happens in a language with dynamic method dispatching), the Interprocedural Analysis can delegate targets resolution to the Call Graph (inner component of Interprocedural Analysis in Fig. 2.2), that will use type information together with language-specific execution model to determine all possible targets. Alternatively, a non-calling Statement’s semantics can also rewrite the node into a sequence of symbolic expressions, that is, atomic instructions with precise semantics, that can be passed to the Analysis State for evaluation. LiSA’s Analysis State (right-most block of LiSA in Fig. 2.2) is composed by an Abstract State modeling the memory state at

24

L. Negrini et al.

Fig. 2.2 LiSA’s architecture

a given program point, together with a collection of symbolic expressions that are left on the stack after evaluating it. An Abstract State is a flexible entity whose main duty is to make the Value Domain, responsible for tracking values of program variables, communicate with the Heap Domain, that instead tracks how the dynamic memory of the program evolves at runtime. Whenever an expression is to be evaluated by an Abstract State, the latter first passes it to the Heap Domain, that will record all of its effects on the memory, such as the allocation of new regions or the access to some object’s fields. Then, the Heap Domain will rewrite all portions of the original expression that deal with the memory. According to the implementation-specific logic, one or more instrumented variables will be used to replace such portions, modeling their resolution to memory addresses that can be treated as regular variables. After the rewriting has been performed, the resulting expression will be passed to the Value Domain, that will track values and properties regarding the variables appearing in it. Note that, with this architecture, each component simplifies the program for the rest of the analysis pipeline. Interprocedural Analysis abstracts away calls from the program to analyze, leaving the Analysis State and its sub-components with noncalling programs. Successively, the Heap Domain removes every expression that deals with dynamic memory, substituting it with synthetic variables. At this point, the Value Domain has to deal with programs containing only variables, constants, and operators between them. When an overall fixpoint is reached, the computed pre- and post-states for each Statement, together with the Call Graph that has been built up, are passed to the Checks (top-right corner within LiSA in Fig. 2.2) that have been provided to the analysis. These are simply visitors of the program’s syntax, that can use the information computed by the analysis to issue warnings for the user. Since these are a standard component of static analyzers, they will be omitted by this work.

2 LiSA: A Generic Framework for Multilanguage Static Analysis

25

Fig. 2.3 Running example for LiSA’s architecture overview

Example. Consider the example Java code from Fig. 2.3. To analyze it, a Java frontend will first parse the code and produce three different CFGs, one for each method. Supposing that the Interprocedural Analysis applies context-sensitive [46] reasoning, the analysis could follow call-chains starting from the main CFG, analyzing CFGs are they are invoked. Thus, the first fixpoint algorithm to be invoked would be the one of main. Here, when the call to foo(10) at line 5 is encountered, an entry state for the targets of the call would be set up by assigning 10 to w. How the call would be resolved to its targets depends once again on the configuration. Supposing that the Call Graph relies on the runtime types inferred for the receiver [48], the call will be resolved to B.foo that can be analyzed (that is, whose fixpoint can be executed) using the prepared state. The code of B.foo deals with heap structures. The assignment at line 15 could be rewritten as l_0 = l_1 * 2 if the Heap Domain is precise enough to distinguish between different fields of the same object, or as l_0 = l_0 * 2 if it is not. Nonetheless, the resulting expression will not contain any reference to memory structures, and can be then processed by the Value Domain (e.g. intervals [14]) agnostically w.r.t. if and how a rewriting happened.

2.3 The Internal Language Before detailing the separate components of LiSA, we introduce and discuss (i) the CFG structure that LiSA uses for representing programs, and (ii) the symbolic expression language used as intermediate representation for analysis. As programming languages come with wildly different syntax, it is important to find a common ground to model their semantics so that analyses are not required to handle con-

26

L. Negrini et al.

structs from all languages. In fact, this is a common practice among static analyzers, even ones targeting a single language: moving to a uniform and more convenient intermediate representation (IR) that is usually enriched with additional information (e.g. dynamic typing) can make writing analyses much easier. Different syntactic constructs with the same (or similar) meaning can be represented by the analyzer as a unique IR construct, and complex ones can be decomposed as a sequence of them. Analyses can then attribute semantic meanings to such IR constructs with no knowledge of the original syntactic ones they represent. Rewriting toward the IR can typically be achieved at parse time, after ingesting the target application, or at analysis time, before passing the code to the abstract domains. LiSA implements hybrid rewriting: first, source code is compiled to control flow graphs (CFGs) by frontends, then each CFG node is rewritten into one or more symbolic expressions during the analysis. CFGs thus embed syntactic structures and language-specific constructs within them (i.e. + is a syntactic construct that might represent numeric addition, string concatenation, …), while each symbolic expression has a unique semantic meaning. LiSA’s programs are thus composed of CFGs, that can be logically grouped in CompilationUnits, a generalization of the concept of classes in object-oriented software. As both the structure and meaning of CompilationUnits mirrors the one of classes (with some additional parametrization for language-specific features like multiple inheritance), and does not directly influence the infrastructure of LiSA, their definition is omitted in this work. Thus, we will refer to LiSA programs as a collection of CFGs.

2.3.1 Control Flow Graphs Control flow graphs [1] (CFG) are directed graphs that express the control flow of functions. In a CFG, nodes contain the instructions of the functions, and edges express how the execution flows from one node to another. This means that all the syntactic constructs that form loops, branches and arbitrary jumps are directly encoded in the CFG structure, simplifying the code to analyze. LiSA’s CFGs are extensible: Statement and Edge are the base definitions of what nodes and edges are, respectively, while concrete instances are defined in frontends. A Statement (base class for CFG nodes) represents an instruction appearing in a function, and thus corresponds to a syntactic construct that does not modify the control flow (that is, it is not a loop, a branch or an arbitrary jump). When the evaluation of a Statement leaves a value on the operand stack, it is called an Expression whose type is one of the generated value. Examples of Statement are return and assert, while an Expression can be a reference to a local variable by its name, an assignment, or a sum. The Call expression, together with its descendants, plays a central role in LiSA and will be further discussed in Sect. 2.5. Note that Statements do not have a predefined semantics: in fact, the class defines a semantics method where implementers can define language-specific reasoning

2 LiSA: A Generic Framework for Multilanguage Static Analysis

27

and interact with entryState (instance of Analysis State representing the prestate) and interproc (instance of Interprocedural Analysis offering interprocedural reasoning) to compute the post-state for the Statement. The Edge class is the base class for CFG edges. The traverse method defined by this class expresses how the post-state of its source node, in the form of an Analysis State instance, is transformed when the edge is traversed. LiSA comes with three default implementations for edges: SequentialEdge, TrueEdge and FalseEdge. SequentialEdge represents an unconditional flow of execution from source to edge, with no modification to the initial Analysis State (i.e. its traverse implementation returns its parameter unaltered). TrueEdge instead models a flow of execution conditional to the evaluation of the expression at its source: the execution proceeds by reaching the edge’s destination only if it evaluates to true. Conversely, FalseEdge implements a conditional flow of execution that reaches the edge’s destination only if the expression at its source evaluates to false. Finally, LiSA offers native CFGs, that is, CFGs with a single Statement and no Edges, as a mean to model individual library functions. Whenever a call to one of these CFGs is found, the call’s result can be evaluated by rewriting it into the only statement contained in the native CFG, and then executing its semantics method. Modeling complex or frequently used library functions through native CFGs can drastically reduce the complexity of the analysis, as less code needs to be analyzed, while still providing all the necessary information about the modeled functions.

2.3.2 Symbolic Expressions A static analyzer’s main duty is to compute program properties by taking into account the semantic meaning its instructions. In this context, an extensible set of syntactic constructs such as the one provided by LiSA through CFGs comes with an intrinsic problem: instructions (i.e. Statements) do not have a well-defined semantics, as that is parametric to the source language. To recover well-definedness, LiSA adopts a two-phases rewriting: not only is the source program compiled to CFGs, but each of their Statement gets rewritten into symbolic expressions during the analysis. The SymbolicExpression class is the base type for the expressions that LiSA’s analysis components understand and analyze. Note that there is a clear distinction between expressions dealing with values of variables (i.e. ValueExpressions) and ones dealing with memory structures (i.e. HeapExpressions). This is a direct consequence of the architecture, introduced in Sect. 2.2 and discussed in Sect. 2.4, that separates domains dealing with the two worlds, decoupling their implementations. ValueExpressions model what can be handled entirely by the Value Domain: constants, variables and operators between them. On the other side, HeapExpressions represent operations that change or navigate the structure of the dynamic memory of the program. As for CFGs, Statements and Edges, symbolic expressions are also extensible. Note that no

28

L. Negrini et al.

symbolic expression is defined for calls, as those are abstracted away by the Interprocedural Analysis (Sect. 2.5). To better explain how the second rewriting is carried out, let us consider the expression new B() at line 5 of Fig. 2.3. In Java, object instantiation consists of four operations: (i) allocation of a memory region, (ii) creation of a pointer to the region, (iii) invocation of the desired constructor using the fresh pointer as receiver and (iv) storage of the pointer on the operand stack. Such behavior could be mimicked by a Statement instance with the following (simplified) semantics function: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

AnalysisState semantics(AnalysisState entry, InterproceduralAnalysis iproc) { // create a synthetic receiver VariableRef rec = new VariableRef("$receiver"); AnalysisState recState = rec.semantics(entryState, interproc); // assign the fresh memory region to the receiver HeapReference ref = new HeapReference(new HeapAllocation()); AnalysisState callState = entryState.bottom(); for (SymbolicExpression v : recState.getComputedExpressions()) callState = callState.lub(callstate.assign(v, ref)); // call the constructor String name = createdType.toString(); Expression[] params = ArrayUtils.insert(0, expressions, rec); UnresolvedCall call = new UnresolvedCall(name, name, params); AnalysisState sem = call.semantics(callState, interproc); // leave a reference on the stack return sem.smallStepSemantics(ref); }

While the methods exploited in this snippet are the subject of the following sections, we can nonetheless capture the intuition behind them. First, a VariableRef (that is an instance of Statement and thus modeling the syntactic reference to a program’s variable by its name) is created at line 3, mimicking the creation of the constructor call’s receiver. Then its semantics is computed at line 4 starting from the pre-state entry, obtaining a new instance of Analysis State (Sect. 2.4) that will contain the Variable (an instance of SymbolicExpression) corresponding to it. The following five lines are responsible for making such variable point to a newly allocated memory region: line 9 assigns a pointer (line 6) to a region that is being allocated in-place to the Variable corresponding to the receiver (line 8). Next, the call to the constructor is performed at line 14 by (i) extracting the name of the type created by the expression (line 11, where createdType is a field containing the Type being instantiated) as both the class and method name, and (ii) adding the instrumented receiver to the original parameters of the constructor call (line 12, where expressions is a field containing the original parameters and ArrayUtils.insert is a method that clones an array and adds a new element to it). The semantics of the UnresolvedCall of line 14 will defer the resolution and evaluation to the Interprocedural Analysis. The post-state of the call is then used at line 16 to reprocess the reference to the memory region through smallStepSemantics, returning the result as the final post-state of the whole instruction.

2 LiSA: A Generic Framework for Multilanguage Static Analysis

29

2.4 The Analysis State The state of LiSA’s analyses is modularly built bottom-up, ensuring that each component does not have visibility of its parents and siblings. Modularity entails that no additional knowledge is needed to implement a component other than what is strictly required by it. As presented in [26], this is crucial for picking up LiSA quickly. Each instance of Analysis State represent both elements of an ordered structure (thus exposing methods for partial ordering, least upper bound, …), and abstract transformers able to produce new elements (offering methods such as assign and smallStepSemantics that evaluate the semantics of a symbolic expression). We briefly discuss each internal component of the Analysis State, introducing them bottom-up. A full discussion of each can be found in [39]. LiSA’s Value Domain is the analysis component responsible for tracking properties about program variables, and it is the last component taking action to compute a symbolic expression’s semantics. Examples Value Domain implementations are Interval [14], Polyhedra [17], and Tarsis [40], or their combinations (e.g. products [11] and smashed sums [4]). LiSA provides a Value Domain implementation named ValueEnvironment defining the point-wise logic for Cartesian (i.e. non-relational) abstractions. Domains such as Interval can thus be implemented by (i) providing lattice operations for single intervals, and (ii) defining the logic for expression evaluation. LiSA then takes care of wrapping the domain inside a ValueEnvironment, providing a unique functional lifting to all Cartesian abstractions. An example implementation is presented in [26]. LiSA’s Heap Domain is the analysis component responsible for tracking how the dynamic memory of the program evolves during its execution. As the sole component having full knowledge on how expressions are resolved to memory locations, the Heap Domain operates before the Value Domain as it must simplify memory-dealing expressions that the latter cannot handle. Examples Heap Domain implementations are Andersen’s Pointer Analysis [2] and Shape analysis [45]. Moreover, as for Value Domains, combinations of Heap Domains are still Heap Domains. LiSA’s Abstract State wraps both Value and Heap Domains, coordinating their communication. It is designed after the framework presented in [22], where the two communicate by means of expression rewriting and variable renaming. Roughly, the semantics of such framework lets the two domains compute properties independently whenever an expression only deals with values or memory. When one instead requires knowledge about both worlds, the expression is first evaluated by the Heap Domain that tracks its effects on the memory. Then, abstract locations called heap identifiers are used to replace memory-dealing sub-expressions, and the Value Domain can process this rewritten expression to track properties about those identifiers. Furthermore, as the semantics of the Heap Domain might materialize or merge heap identifiers, a substitution can be applied to the Value Domain when necessary, before computing its semantics. A substitution can be described as a sequence of multivariable replacements between heap identifiers. The concrete implementation of the communication between the Heap Domain and Value Domain is provided by the

30

L. Negrini et al.

Fig. 2.4 Sequence diagram Analysis State’s assign

SimpleAbstractState class. Abstract State is left modular and extensible as further layers of abstraction can be applied on the entire state. For instance, Trace Partitioning [43] must be applied on the state as a whole, and can be defined as an Abstract State implementing a function from execution traces to Abstract States. The Analysis State is the outer-most component, and it is thus the one explicitly visible to the rest of the analysis. A direct implication of this is that other components are agnostic w.r.t. how LiSA abstracts memory structures and values of the program. Its main duty is to wrap the Abstract State together with additional semantic information over which the analysis must reach a fixpoint. Here, we identify as mandatory information only the set of symbolic expressions that are left on the stack after evaluating an arbitrary expression, but more can be added. To grasp the intuition of how the Analysis State operates, consider the sequence diagram of Fig. 2.4, depicting how the assign method behaves. Note that the pattern shown here is also valid for the other semantic operations, as they all follow the overall communication scheme defined in [22]. When the assign method of the Analysis State is invoked, the call is immediately forwarded to the Abstract State. The latter will first compute the effects of the assignment on the dynamic memory through the Heap Domain’s own assign method. Then, since such operation might have caused materialization or merge of heap identifiers, the Abstract State retrieves a substitution from the Heap Domain, and uses it to update the Value Domain. Then, Heap Domain’s rewrite replaces portions of the assigned expression dealing with dynamic memory with heap identifiers, rendering the right-hand of the assignment memory-free. The updated Value Domain instance is then used to evaluate the effects of the assignment on program variables, using Value Domain’s assign method on the rewritten expressions. The new Heap and Value Domain instances are then wrapped into a fresh Abstract State object, that is returned to the caller as part of the

2 LiSA: A Generic Framework for Multilanguage Static Analysis

31

final Analysis State that is built as result of the original call. Note that, as Abstract State, Heap Domain and Value Domain are defined modularly, each computational step might hide additional complexity (for instance, the Value Domain could be a Cartesian product of several instances, whose assign methods are recursively invoked). Moreover, the Value Domain can optionally rely on an inner Non-Relational Domain instance to compute the semantics of an expression, exploiting its eval method.

2.5 Interprocedural Analysis LiSA’s Interprocedural Analysis is responsible for computing both a program-wide fixpoint and the result of calls, as the two features are strictly related. In fact, the computation of the overall program fixpoint is a direct consequence of call evaluation. If a pre-computed result is to be returned when a call is encountered, call-chains should be analyzed bottom-up starting from the target of the last call, thus ensuring that results are already available when needed. Instead, if results are to be freshly generated, call-chains should be analyzed top-down, starting from the CFG containing the first call. LiSA’s semantic analysis begins by querying Interprocedural Analysis for the program’s fixpoint, that will implement a strategy for traversing call-chains and reaching a fixpoint on each of its CFGs. Individual CFG fixpoints are evaluated using the classical worklist-based fixpoint over graphs, uniquely implemented in LiSA. Such algorithm exploits the language-specific semantics functions defined by Statements. As discussed in Sect. 2.3.2, non-calling Statements will rewrite themselves as a sequence of symbolic expressions. When a call must be evaluated, the rewriting process is not enough. Four concrete Call instances are included in LiSA: NativeCall, CFGCall, OpenCall and UnresolvedCall. NativeCalls only target native CFGs: the semantics of this call exploits the latter’s rewriting functionality to transform them into a Statement instance, and then delegates the computation to its semantics. CFGCalls instead invoke CFGs, and thus have to access their targets’ fixpoint results through the Interprocedural Analysis. Instead, OpenCalls are evaluated as an abstraction of an unknown result (for instance, one could conservatively assume that open calls can natively manipulate memory, thus always return  as post-state). Lastly, UnresolvedCalls only carry signature information, and are not yet bound to their targets. Target resolution is performed by the Call Graph, but a further degree of modularity is added by having the call interact with Interprocedural Analysis instead, as some implementations (i.e. intraprocedural ones) might return fixed overapproximations when calls are evaluated, bypassing the Call Graph invocation that otherwise happens here. Regardless, every UnresolvedCall is converted to one of the other Call instances,3 and its semantics can be normally applied. UnresolvedCalls might resolve to both CFGs and native CFGs: here, LiSA instantiates a MultiCall, whose semantics yields the lub of the internal CFGCall and NativeCalls. 3

32

L. Negrini et al.

Interprocedural Analysis implementations can be context-sensitive [31, 46], following call-chains top-down evaluating each CFG fixpoint as they are called, or modular [16] (also called summary-based), where call-chains are analyzed bottom-up accessing pre-computed results. The Call Graph is tasked with resolving UnresolvedCalls to their targets. As such calls only come with signature information (i.e. the name of the target CFG) and their parameters, the whole program needs to be searched for possible targets. The search is a complex operation that relies on several features of programming languages, and it can be logically split into two phases: scanning for possible targets along the program and matching the actual parameters to each candidate’s signature. Candidate scanning depends on the call type. If the call is known to have a receiver (that is, if it is an instance call), only the receiver’s type hierarchy is to be searched for targets. Hierarchy traversal is language-specific, as it is influenced by how the language enables inheritance (e.g. it might be single or multiple, or it might provide explicit and implicit interface implementations). Instead, choosing the starting point for hierarchy traversal is a feature depending on the implementation of Call Graph: for instance, a graph implementing Class Hierarchy analysis [18] would consider all possible subtypes of the receiver’s static type, while one implementing Rapid Type analysis [5] would restrict such set to the subtypes instantiated by the program. Regardless, the hierarchy of candidate types must then be searched for CFGs with a matching name. If the call instead does not have a receiver (i.e. if it is static), the whole program needs to be searched for CFGs with a matching name. Once candidates have been selected, the actual parameters of the call must be matched with the formal ones defined by each target. Once more, this feature is language-specific: capabilities like optional parameters and named ones, as well as how types of each parameter are evaluated complicate the matching process to a point where no unique algorithm can be applied. To make resolution parametric, LiSA delegates language-specific call resolution features to each UnresolvedCall instance: when created, users need to specify both a strategy for traversing a type hierarchy given a starting point, and a strategy for matching a list of Expressions to an arbitrary signature. Note that, once a call has been resolved, an entry state for the targets has to be prepared by assigning actual parameters to formal ones. As this process also follows the parameter matching, the Interprocedural Analysis will defer this preparation to the same algorithm.

2.6 Frontends Frontends are tasked with performing the first rewriting phase, translating a (possibly partial) source program into one made of CFGs that can be analyzed by LiSA. As mentioned at the beginning of Sect. 2.3, a component performing a translation is included in most static analyzer, as moving to a more convenient representation makes writing analyses simpler. LiSA’s frontends, however, are more than just raw compilers: as the sole component with deep knowledge about the language they

2 LiSA: A Generic Framework for Multilanguage Static Analysis

33

target, they must define Statements with their semantics, types and languagespecific algorithms that implement the execution model of the language. Even if they might seem less relevant to the whole analysis process, writing and maintaining a complete frontend for a language is no easy endeavor. In fact, mature and widespread programming languages have a very complex semantics to model, with features that might be ambiguous or not formally defined.4 Moreover, each language has its own evolution, leading to different versions needing support. This not only translates to a higher number of instructions to model, but also to a variety of runtime environments (containing different libraries and software frameworks) whose semantics has to be taken into account for precise static analysis. Writing a frontend usually begins with the code parsing logic. Whenever it is not possible to use official tools (e.g. by plugging into the compilation process), parser generators such as ANTLR5 can be used to create custom abstract syntax tree visitors. Statement instances for each instruction must be included as part of the frontend (potentially using common implementations provided by LiSA), each one providing its own semantics and bringing language-specific algorithm that are exploited during the analysis. Type inference is optional, as the one ran by LiSA during the analysis can be exploited inside semantics functions. This means that constructs such as +, that in most languages have different semantics depending on the type of its operands, can be modeled by a single Statement instance. Modeling runtimes and libraries is achieved by means of native CFGs or SARL [25].

2.7 Multilanguage Analysis We now instantiate LiSA and its components to showcase how multilanguage analyses can be easily defined. We demonstrate the effectiveness of our approach on the JoyCar IoT system of Fig. 2.1.1. Code snippets reported in this section are available on GitHub,6 where the full implementation of this analysis is published. As the codebase is composed of two languages, a frontend for each has to be built. These have been developed using ANTLR for parser generation, and mostly exploit Statements and Edges provided out-of-the-box from LiSA. The key aspect w.r.t. multilanguage analysis is the handling of constructs that enable inter-language communication, offered here by the Java Native Interface (JNI). At runtime, the Java VM tries to resolve calls to native methods using the name-mangling scheme reported in

4

For instance, Python does not have a formal specification of its semantics, while C admits syntactic constructs whose behavior is undefined. 5 https://www.antlr.org/, with several well-tested grammars available at https://github.com/antlr/ grammars-v4. 6 https://github.com/lisa-analyzer/lisa-joycar-example.

34

L. Negrini et al.

the JNI specification.7 We thus proceeded by providing an implementation for native methods found in Java code using the following (simplified) snippet: void parseAsNative(CFG cfg, String className, String name, Parameter[] formals, Type returnType) { String mangled = nameMangling(className, name, formals); Expression[] args = buildArguments(formals); UnresolvedCall call = new UnresolvedCall(mangled, args); if (!returnType.isVoidType()) cfg.addNode(new Return(call)); else { Ret ret = new Ret(); cfg.addNode(call); cfg.addNode(ret); cfg.addEdge(new SequentialEdge(call, ret)); } }

1 2 3 4 5 6 7 8 9 10 11 12 13 14

The code above bridges the two codebases by creating an UnresolvedCall, where (i) the target’s name is built with the mangling scheme from the specification, (ii) the arguments for the call correspond to the ones passed to the native method preceded by the pointer to an instance of JNIEnv (an object required by JNI to hold pointers to native functions) and (iii) the value returned by the call is also returned by the native method, if any. With this setup, not only can the C++ code be parsed regularly, but the analysis components are also agnostic to the presence of JNI, as the call will to the native method is treated exactly as any other call. Note that, while this specific example did not require it, the generated call can be preceded by arbitrary instrumentations (e.g. the state conversion typical of boundary functions). The next step is to select the analysis components. We mostly rely on analyses natively provided by LiSA: • the Interprocedural Analysis is set to a context-sensitive implementation that follows call-chains top-down, thus starting from the main method and traversing them until a recursion is encountered (and thus enabling LiSA to follow every call in our target application); • the Call Graph uses inferred runtime types of variables and expressions; • the Abstract State used is SimpleAbstractState; • as the program properties do not rely on dynamic memory, we use a fast but imprecise Heap Domain called MonolithicHeap, that abstracts each memory location to a unique synthetic one; For Value Domain, we implemented a Taint analysis [21, 51] as a NonRelational Domain, whose simplified source code is reported in Fig. 2.5. The domain is based on the poset {⊥, clean, tainted}, {(⊥, clean), (clean, tainted)}, that forms a finite (and thus complete) lattice using trivial  and  operators that, given a pair of elements, return the greater and smaller of the two, respectively. 7

https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/design.html, paragraph “Resolving Native Method Names”.

2 LiSA: A Generic Framework for Multilanguage Static Analysis

35

Fig. 2.5 A simple Taint Analysis implementation

Implementation-wise, the superclass BaseNonRelationalValueDomain handles base cases of lattice operations, that is, when one of the operands involved is either  (tainted) or ⊥, or when the two operands are the same element. Hence, no additional logic needs to be implemented for , ∇ and . Recursive expression evaluation is also provided out-of-the-box by BaseNonRelationalValueDomain, and the concrete implementation only has to provide evaluation of individual expressions. Specifically, we consider all constants as clean, while evaluation of unary, binary and ternary expressions is carried out by computing the  of their arguments. Tainted values are generated only when evaluating variables, relying on their annotations: as LiSA assigns the result of Calls to temporary variables, transferring all annotations from the call’s targets, this enables uniform identification of both sources (i.e. annotated with TAINT_ANNOT) and sanitizers (i.e. annotated with CLEAN_ANNOT). Variables are thus considered always tainted or always clean relying on such annotations, defaulting otherwise to the abstraction stored inside the environment.

36

L. Negrini et al.

To exploit our analysis’ results, we defined a Check instance that iterates over the application under analysis to scan for method parameters that are annotated with @lisa.taint.Sink, a third kind of annotation that identifies places where tainted information should not flow. When one such parameter is found, the check inspects all call sites where the corresponding method is invoked, and checks if the post-state of the Expression passed for the annotated parameter contains a tainted expression on the stack, according to our Taint domain. If it is, a warning is issued. We then proceed by annotating as source (i.e. with @lisa.taint.Tainted) the value returned by readAnalog, and as sink (i.e. with @lisa.taint.Sink) the second parameter of softPwmWrite. The analysis can then be executed, obtaining the following warning on the softPwmWrite call: The value passed for the 2nd parameter of this call is tainted, and it reaches the sink at parameter ’value’ of softPwmWrite [at JoyCar.cpp:124:55]

thus showcasing that cross-language vulnerabilities can be discovered in a single analysis run. Also note that, with the same setup, domains computing complex structures (e.g. automata) can still operate cross-language without incurring in expensive serializations and deserializations needed to communicate information across analyzers.

2.8 Conclusion In this chapter, we thoroughly described LiSA, a modular framework for multilanguage static analysis with an open-source Java implementation. LiSA operates by analyzing an extensible language of CFGs, whose node contains user-defined language-specific semantics that translate them into symbolic expressions. These are atomic constructs with precise semantics, that abstract domains can analyze. LiSA’s infrastructure modularly decomposes semantics evaluation into separate tasks, each carried out by a different analysis component. Each such component performs agnostically w.r.t. the concrete implementations of other ones, and is responsible for abstracting specific program features. The Interprocedural Analysis, in cooperation with the Call Graph, abstracts calls and call-chains, leaving the rest of the analysis with call-free programs. Then, the modular Analysis State orchestrates memory and value abstraction. The former is performed by the Heap Domain, that abstracts all heap operations by rewriting them with synthetic variables representing heap locations they resolve to, leaving the Value Domain with call- and memory-free programs. Individual library functions can be modeled as native CFGs, that is, as special CFGs with a unique node that expresses the semantics of the whole function. We also reported our teaching experience with LiSA, that emphasizes how the modular structure enables experimenting with LiSA achievable by Master level students. Furthermore, we demonstrated LiSA’s capability of analyzing software written in multiple programming languages, identifying a vulnerability that spanned Java and C++ in a proof-of-concept case study.

2 LiSA: A Generic Framework for Multilanguage Static Analysis

37

2.8.1 Future Directions As a multilanguage analyzer, one of our objectives is to target as many programming languages as possible. This not only means having fronteds for each of them, but also ensuring that the internal LiSA program model is flexible and parametric enough to represent syntactic structures, semantics, execution model, inheritance and all of their other peculiarities. Currently, frontends for Go, Python, Java, Java bytecode, Rust and Michelson bytecode are in development, but this is undoubtedly a long-term effort. One additional vision for LiSA’s future is to not only provide users with an easy-to-use tool where new analysis can be implemented quickly and tested on several languages, but where they can also easily compare different implementations with their own. As such, we plan on extending LiSA to ship with several well-known component implementations, from numerical and string domains (e.g. Octagons [37] and Bricks [12]), to property domains (e.g. Information Flow [19, 44]), to interfaces with widely accepted static analysis libraries (e.g. Apron [29] and PPL [6]), also widening the analysis spectrum to backward analyses.

2.8.2 Related Work As the field grew and matured over decades, a vast literature about static analysis and abstract interpretation is available (most notable results are referenced in [13]), reporting a wide spectrum of techniques and their applications to prove software correct. Here, we focus on work strictly related to multilanguage analysis. The initial focus on multilanguage analysis targeted combinations of similar languages. Julia [47] analyzes Java bytecode, and has been extended to also analyze CIL bytecode resulting from the compilation of C# code by means of a semantic translation into Java bytecode [23]. Infer [20] analyzes Java, C, C++ and Objective- C programs by statically translating them into a proprietary intermediate representation, called SIL, composed of only four instructions. However, these approaches are intrinsically limited by the expressiveness of the intermediate representation (Java bytecode and SIL in case of Julia and Infer): since the set of constructs in those representations is predefined, they might not be enough to represent features of some languages. For instance, Java bytecode cannot express pointer arithmetic, while SIL is not suited for dynamic typing. Another stream of work instead considers a scenario where one central portion of the application, written in a single programming language, interacts with native code, that in this context can be considered as a collection of functions written in different programming languages. Wei et al. [53] performs a summary-based analysis of Android applications as a whole (that is, Java code and JNI-exposed native code), while [49] compiles C code into an extended Java bytecode that can be analyzed by existing Java analyzers. Furr and Foster [27, 28] perform type inference on so-called foreign function interfaces, that is, inter-language communication frame-

38

L. Negrini et al.

works like JNI, discovering type errors otherwise visible only at runtime. Li and Tan [35] instead detects mishandling of exceptions and errors raised in native code and then propagated back to Java. The work presented in [34] computes semantics summaries from guest programs, to be used during the analysis of a host program. All of these approaches share the same underlying idea: to compute summaries among a family of “secondary” programs and use them to analyze the main one (here, programs are defined as coherent modules written in the same language). While modular summary-based analyses are powerful and scalable, not all properties can be proven bottom-up, and often require precise context-sensitive analyses. Moreover, not all programs can be described as a single processing entity exploiting auxiliary codebases: for instance, mobile apps contain logic on both the app and the backend, with back-and-forth communication between the two. More recently instead, solutions for multilanguage analyses have been proposed. References [8, 9] provide an algebraic framework for multilanguage abstract interpretations by combining existing language-specific ones through boundary functions that perform state conversion when switching context between languages. Teixeira et al. [50] provides LARA [41] source-to-source compilation to transform Java, C, C++ and JavaScript programs toward a common syntax, over which static analyses can be run. Authors however focus on source-code metrics (that is, syntactic properties), with no reasoning on runtime behaviors (that is, semantic properties). The major alternative to the approach described in this paper is undoubtedly Mopsa [30] (Modular Open Platform for Static Analysis), a static analyzer based on the abstract interpretation theory written in OCaml. Mopsa is designed to compute fixpoints by induction on a program’s syntax. A program is an extensible abstract syntax tree (AST) that initially contains the original source code, but that can be syntactically and semantically rewritten during the analysis. Abstract domains share a common interface, and are thus easy to compose and extend. Moreover, the domains are responsible for dynamically rewriting fragments of the AST exploiting semantic information, avoiding static translation toward an internal language. Depending on both the target programming languages and the properties of interest, Mopsa’s analyses need to be configured by composing a chain of abstract domains that will dynamically rewrite portions of the AST until a common syntax is reached, over which the remaining domains can operate independently from the source language. Mopsa has been successfully used to analyze C and Python programs [30], showcasing its ability to target completely different languages, including dynamic ones. Moreover, analyses on a combination of the two have been performed [38]. Acknowledgements Work partially supported by SPIN-2021 projects ‘Ressa-Rob’ and ‘Static Analysis for Data Scientists’, and Fondi di primo insediamento 2019 project ‘Static Taint Analysis for IoT Software’ funded by Ca’Foscari University, and by iNEST-Interconnected NordEst Innovation Ecosystem, funded by PNRR (Mission 4.2, Investment 1.5), NextGeneration EU—Project ID: ECS 00000043.”

2 LiSA: A Generic Framework for Multilanguage Static Analysis

39

References 1. Allen, F.E.: Control flow analysis. In: Proceedings of a Symposium on Compiler Optimization, pp. 1–19. Association for Computing Machinery, New York, NY, USA (1970). https://doi.org/ 10.1145/800028.808479 2. Andersen, L.O.: Program analysis and specialization for the c programming language. Ph.D. thesis, DIKU, University of Copenhagen (1994). https://citeseerx.ist.psu.edu/viewdoc/ download?doi=10.1.1.109.6502&rep=rep1&type=pdf 3. Androulaki, E., Barger, A., Bortnikov, V., Cachin, C., Christidis, K., Caro, A.D., Enyeart, D., Ferris, C., Laventman, G., Manevich, Y., Muralidharan, S., Murthy, C., Nguyen, B., Sethi, M., Singh, G., Smith, K., Sorniotti, A., Stathakopoulou, C., Vukolic, M., Cocco, S.W., Yellick, J.: Hyperledger fabric: a distributed operating system for permissioned blockchains. In: Oliveira, R., Felber, P., Hu, Y.C. (eds.) Proceedings of the Thirteenth EuroSys Conference, EuroSys 2018, Porto, Portugal, 23–26 Apr. 2018, pp. 30:1–30:15. ACM (2018). https://doi.org/10. 1145/3190508.3190538 4. Arceri, V., Maffeis, S.: Abstract domains for type juggling. Electron. Notes Theor. Comput. Sci. 331, 41–55 (2017). DOI: 10.1016/j.entcs.2017.02.003. 5. Bacon, D.F., Sweeney, P.F.: Fast static analysis of c++ virtual function calls. In: Proceedings of the 11th ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA ’96, pp. 324–341. Association for Computing Machinery, New York, NY, USA (1996). https://doi.org/10.1145/236337.236371 6. Bagnara, R., Hill, P.M., Zaffanella, E.: The parma polyhedra library: toward a complete set of numerical abstractions for the analysis and verification of hardware and software systems. Sci. Comput. Program. 72(1), 3–21 (2008). https://doi.org/10.1016/j.scico.2007.08.001. https://www.sciencedirect.com/science/article/pii/S0167642308000415. Special Issue on Second issue of experimental software and toolkits (EST) 7. Buchman, E.: Tendermint: Byzantine fault tolerance in the age of blockchains. Ph.D. thesis, University of Guelph (2016). https://atrium.lib.uoguelph.ca/xmlui/handle/10214/9769 8. Buro, S., Crole, R.L., Mastroeni, I.: On multi-language abstraction. In: Pichardie, D., Sighireanu, M. (eds.) Static Analysis, pp. 310–332. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-65474-0_14 9. Buro, S., Mastroeni, I.: On the multi-language construction. In: Caires, L. (ed.) Programming Languages and Systems, pp. 293–321. Springer International Publishing, Cham (2019). https:// doi.org/10.1007/978-3-030-17184-1_11 10. Chen, L.: Microservices: Architecting for continuous delivery and devops. In: 2018 IEEE International Conference on Software Architecture (ICSA), pp. 39–397 (2018). https://doi. org/10.1109/ICSA.2018.00013 11. Cortesi, A., Costantini, G., Ferrara, P.: A survey on product operators in abstract interpretation. Electronic Proceedings in Theoretical Computer Science 129, 325–336 (2013). https://doi.org/ 10.4204/eptcs.129.19 12. Costantini, G., Ferrara, P., Cortesi, A.: A suite of abstract domains for static analysis of string values. Softw.: Pract. Exp. 45(2), 245–287 (2015). https://doi.org/10.1002/spe.2218. https:// onlinelibrary.wiley.com/doi/abs/10.1002/spe.2218 13. Cousot, P.: Principles of Abstract Interpretation. MIT Press (2021) 14. Cousot, P., Cousot, R.: Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: Graham, R.M., Harrison, M.A., Sethi, R. (eds.) Conference Record of the Fourth ACM Symposium on Principles of Programming Languages, Los Angeles, CA, USA, Jan. 1977, pp. 238–252. ACM (1977). https://doi.org/10. 1145/512950.512973 15. Cousot, P., Cousot, R.: Systematic design of program analysis frameworks. In: Aho, A.V., Zilles, S.N., Rosen, B.K. (eds.) Conference Record of the Sixth Annual ACM Symposium on Principles of Programming Languages, San Antonio, Texas, USA, Jan. 1979, pp. 269–282. ACM Press (1979). https://doi.org/10.1145/567752.567778

40

L. Negrini et al.

16. Cousot, P., Cousot, R.: Modular static program analysis. In: R.N. Horspool (ed.) Compiler Construction, pp. 159–179. Springer, Berlin, Heidelberg (2002). https://doi.org/10.1007/3540-45937-5_13 17. Cousot, P., Halbwachs, N.: Automatic discovery of linear restraints among variables of a program. In: Aho, A.V., Zilles, S.N., Szymanski, T.G. (eds.) Conference Record of the Fifth Annual ACM Symposium on Principles of Programming Languages, Tucson, Arizona, USA, Jan. 1978, pp. 84–96. ACM Press (1978). https://doi.org/10.1145/512760.512770 18. Dean, J., Grove, D., Chambers, C.: Optimization of object-oriented programs using static class hierarchy analysis. In: Tokoro, M., Pareschi, R. (eds.) ECOOP’95—Object-Oriented Programming, 9th European Conference, Åarhus, Denmark, 7–11 Aug. 1995, pp. 77–101. Springer, Berlin, Heidelberg (1995). https://doi.org/10.1007/3-540-49538-X_5 19. Denning, D.E.: A lattice model of secure information flow. Commun. ACM 19(5), 236–243 (1976). https://doi.org/10.1145/360051.360056 20. Distefano, D., Fähndrich, M., Logozzo, F., O’Hearn, P.W.: Scaling static analyses at facebook. Commun. the ACM 62(8), 62–70 (2019). https://doi.org/10.1145/3338112 21. Ernst, M.D., Lovato, A., Macedonio, D., Spiridon, C., Spoto, F.: Boolean formulas for the static identification of injection attacks in java. In: Davis, A. Fehnker, A. McIver, A. Voronkov (eds.) Logic for Programming, Artificial Intelligence, and Reasoning, pp. 130–145. Springer, Berlin, Heidelberg (2015). 978-3-662-48899-7_10 22. Ferrara, P.: A generic framework for heap and value analyses of object-oriented programming languages. Theor. Comput. Sci. 631, 43–72 (2016). https://doi.org/10.1016/j.tcs.2016.04.001. www.sciencedirect.com/science/article/pii/S0304397516300299 23. Ferrara, P., Cortesi, A., Spoto, F.: Cil to java-bytecode translation for static analysis leveraging. In: Proceedings of the 6th Conference on Formal Methods in Software Engineering, FormaliSE ’18, pp. 40–49. Association for Computing Machinery, New York, NY, USA (2018). https:// doi.org/10.1145/3193992.3193994 24. Ferrara, P., Mandal, A.K., Cortesi, A., Spoto, F.: Cross-programming language taint analysis for the iot ecosystem. Electron. Commun. EASST 77 (2019). 10.14279/tuj.eceasst.77.1104 25. Ferrara, P., Negrini, L.: Sarl: Oo framework specification for static analysis. In: Christakis, M., Polikarpova, N., Duggirala, P.S., Schrammel, P. (eds.) Software Verification, pp. 3–20. Springer International Publishing, Cham (2020). 10.1007/978-3-030-63618-0_1 26. Ferrara, P., Negrini, L., Arceri, V., Cortesi, A.: Static analysis for dummies: experiencing lisa. In: Proceedings of the 10th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis, SOAP 2021, pp. 1–6. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3460946.3464316 27. Furr, M., Foster, J.S.: Polymorphic type inference for the JNI. In: Sestoft, P. (ed.) Programming Languages and Systems, pp. 309–324. Springer, Berlin, Heidelberg (2006). https://doi.org/10. 1007/11693024_21 28. Furr, M., Foster, J.S.: Checking type safety of foreign function calls. ACM Trans. Program. Lang. Syst. 30(4) (2008). https://doi.org/10.1145/1377492.1377493 29. Jeannet, B., Miné, A.: Apron: A library of numerical abstract domains for static analysis. In: Bouajjani, A., Maler, O. (eds.) Computer Aided Verification, pp. 661–667. Springer, Berlin, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02658-4_52 30. Journault, M., Miné, A., Monat, R., Ouadjaout, A.: Combinations of reusable abstract domains for a multilingual static analyzer. In: Chakraborty, S., Navas, J.A. (eds.) Verified Software. Theories, Tools, and Experiments, pp. 1–18. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-41600-3_1 31. Khedker, U.P., Karkare, B.: Efficiency, precision, simplicity, and generality in interprocedural data flow analysis: Resurrecting the classical call strings method. In: Hendren, L. (ed.) Compiler Construction, pp. 213–228. Springer, Berlin, Heidelberg (2008). https://doi.org/10.1007/9783-540-78791-4_15 32. Kwon, J.: Tendermint: Consensus Without Mining (2014). https://tendermint.com/static/docs/ tendermint.pdf

2 LiSA: A Generic Framework for Multilanguage Static Analysis

41

33. Kwon, J., Buchman, E.: Cosmos Whitepaper (2019). https://v1.cosmos.network/resources/ whitepaper 34. Lee, S., Lee, H., Ryu, S.: Broadening horizons of multilingual static analysis: semantic summary extraction from c code for JNI program analysis. In: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, ASE ’20, pp. 127–137. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3324884. 3416558 35. Li, S., Tan, G.: Finding bugs in exceptional situations of JNI programs. In: Proceedings of the 16th ACM Conference on Computer and Communications Security, CCS’09, pp. 442–452. Association for Computing Machinery, New York, NY, USA (2009). https://doi.org/10.1145/ 1653662.1653716 36. Mell, P., Grance, T., et al.: The nist definition of cloud computing. Natl. Inst. Sci. Technol. Spec. Publ. 800(2011), 145 (2011) 37. Miné, A.: The octagon abstract domain. Higher-order and symbolic computation 19(1), 31–100 (2006). https://doi.org/10.1007/s10990-006-8609-1 38. Monat, R., Ouadjaout, A., Miné, A.: A multilanguage static analysis of python programs with native c extensions. In: Dr˘agoi, C., Mukherjee, S., Namjoshi, K. (eds.) Static Analysis, pp. 323–345. Springer International Publishing, Cham (2021). https://doi.org/10.1007/978-3030-88806-0_16 39. Negrini, L.: A generic framework for multilanguage analysis. Ph.D. thesis, Universitá Ca’ Foscari Venezia (2023) 40. Negrini, L., Arceri, V., Ferrara, P., Cortesi, A.: Twinning automata and regular expressions for string static analysis. In: Verification, Model Checking, and Abstract Interpretation: 22nd International Conference, VMCAI 2021, Copenhagen, Denmark, 17–19 Jan. 2021, Proceedings, pp. 267–290. Springer, Berlin, Heidelberg (2021). https://doi.org/10.1007/978-3-030-670672_13 41. Pinto, P., Carvalho, T., AO Bispo, J., Ramalho, M.A., AO M.P. Cardoso. J.: Aspect composition for multiple target languages using lara. Comput. Lang. Syst. Struct. 53, 1–26 (2018). https://doi.org/10.1016/j.cl.2017.12.003. www.sciencedirect.com/science/article/pii/ S147784241730115X 42. Porru, S., Pinna, A., Marchesi, M., Tonelli, R.: Blockchain-oriented software engineering: challenges and new directions. In: 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C), pp. 169–171 (2017). https://doi.org/10.1109/ICSEC.2017.142 43. Rival, X., Mauborgne, L.: The trace partitioning abstract domain. ACM Trans Program. Lang. Syst. (TOPLAS) 29(5), 26—es (2007). https://doi.org/10.1145/1275497.1275501 44. Sabelfeld, A., Myers, A.: Language-based information-flow security. IEEE Journal on Selected Areas in Communications 21(1), 5–19 (2003). https://doi.org/10.1109/JSAC.2002.806121 45. Sagiv, M., Reps, T., Wilhelm, R.: Parametric shape analysis via 3-valued logic. ACM Trans. Program. Lang. Syst. 24(3), 217–298 (2002). https://doi.org/10.1145/514188.514190 46. Sharir, M., Pnueli, A., et al.: Two approaches to interprocedural data flow analysis. New York University. Courant Institute of Mathematical Sciences (1978) 47. Spoto, F.: The julia static analyzer for java. In: Rival, X. (ed.) Static Analysis, pp. 39–57. Springer, Berlin, Heidelberg (2016). https://doi.org/10.1007/978-3-662-53413-7_3 48. Sundaresan, V., Hendren, L., Razafimahefa, C., Vallée-Rai, R., Lam, P., Gagnon, E., Godin, C.: Practical virtual method call resolution for java. In: Proceedings of the 15th ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA ’00, pp. 264–280. Association for Computing Machinery, New York, NY, USA (2000). https://doi.org/10.1145/353171.353189 49. Tan, G., Morrisett, G.: Ilea: Inter-language analysis across java and c. In: Proceedings of the 22nd Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications, OOPSLA ’07, pp. 39–56. Association for Computing Machinery, New York, NY, USA (2007). https://doi.org/10.1145/1297027.1297031

42

L. Negrini et al.

50. Teixeira, G., Bispo, J.a., Correia, F.F.: Multi-language static code analysis on the lara framework. In: Proceedings of the 10th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis, SOAP 2021, pp. 31–36. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3460946.3464317 51. Tripp, O., Pistoia, M., Fink, S.J., Sridharan, M., Weisman, O.: Taj: effective taint analysis of web applications. In: Hind, M., Diwan, A. (eds.) Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2009, Dublin, Ireland, 15–21 June 2009, pp. 87–97. ACM (2009). https://doi.org/10.1145/1542476.1542486 52. Vongsingthong, S., Smanchat, S.: Internet of things: a review of applications and technologies. Suranaree J. Sci. Technol. 21(4), 359–374 (2014) 53. Wei, F., Lin, X., Ou, X., Chen, T., Zhang, X.: Jn-saf: Precise and efficient NDK/JNI-aware interlanguage static analysis framework for security vetting of android applications with native code. In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS ’18, pp. 1137–1150. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3243734.3243835

Chapter 3

How to Make Taint Analysis Precise Francesco Logozzo and Ibrahim Mohamed

Abstract During a lunch at the ceremony for his honorary degree from the University of Venice, Patrick Cousot asked us the following question: “Why everyone in security is using taint analysis, if it so imprecise?”. We answered that yes, taint analysis per se can be very imprecise, but it can be made extremely accurate when (i) it is refined to capture (the abstractions of) the execution flow and the data transformations; and (ii) it is based on the source language and not on some intermediate representation. We explained that at Meta more than 50% of the security bugs are automatically detected using refined static taint analyses based on those principles. This short note expands and details the answer we gave Patrick.

3.1 Introduction Most of the software security problems can be formulated in the form of: an attacker controls some data that, after a sequence of transformations, reaches a sensitive part of the code base. Taint analysis has been proposed as a solution to this kind of problems [14, 18]. Taint analysis tracks the flow of transformed data from some attacker-controlled inputs (called sources) to privileged APIs (called sinks). Taint analysis can be either dynamic or static. A major problem in accurate taint analysis is the design of the taint policy, that is how to propagate the attacker-controlled data through basic statements of the program. A too conservative taint policy may miss flows and hence lead to false negatives, while reporting very few false positives. A too aggressive propagation of taint will surely reduce the number of false negatives, but it will also dramatically increase the number of false positives. Hence, Patrick Cousot’s question during a nice lunch in occasion of the honorary degree from the University of Venice:“Why everyone in security is using taint analysis, if it is so inaccurate?”. F. Logozzo (B) · I. Mohamed Meta, USA e-mail: [email protected] I. Mohamed e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 V. Arceri et al. (eds.), Challenges of Software Verification, Intelligent Systems Reference Library 238, https://doi.org/10.1007/978-981-19-9601-6_3

43

44

F. Logozzo and I. Mohamed

In this short note we will address Patrick Cousot’s question, expanding on the answer we gave in Venice. We propose a concrete, parametric, instrumented influence semantics to capture the intuitive notion of “taint data”. We instantiate it to the realworld cases of: data flow analysis, state/naive taint flow analysis, and trace-aware taint flow analysis. Our formalization also let us discuss the trade-offs between dynamic and static taint analyses.

3.2 Concrete Semantics 3.2.1 Influenced Concrete States We do consider concrete states that are parameterized by an influence domain. Inspired by [14], we let concrete states ∈  consist of two components: • a standard program state ∈ S, e.g., mapping variables to locations and locations to values; • an influence state ∈ I that tracks how memory locations are influenced. We do assume the influence state be an algebra I, ⊥, , , , expr, entry, return such that: • ⊥ ∈ I is the bottom value with the intuitive meaning of “not influenced”; •  is a partial order among influences, satisfying ∀i.⊥  i; •  is a binary operator that gathers the influences of its arguments, i.e., a  b is influenced by a or b; •  is a binary operator that approximates the intersection of the influences, i.e., a  b is influenced by a and b; • expr, entry, return capture how influences are propagated via expression evaluation, function calls, and return points. For now, we will leave the I be a parameter of our formalization, and we will describe in the following how different choices for I will determine different security program analyses. With an abuse of formalism, we will denote with I both the influence of a location and the influence state, that is the lifting of the influence to a map from locations to influences. We refer to the operations expr, entry, return as the taint policy of I.

3.2.2 Semantics of Basic Instructions We do assume a given standard semantics Bs ∈ S → S for the concrete values computation. We refine it with an influence semantics Is ∈ I → I that describes how the influence propagates through basic statements. We define Is in the following, parameterized by the taint policy of a given algebra I.

3 How to Make Taint Analysis Precise

3.2.2.1

45

Assignment

For simplicity, we do assume assignments to be side effects-free, so we only consider the three cases where the right-hand side is a literal, a variable, or a pure operation over expressions. Literals are trivially non-attacker influenced, therefore we let: def

Ix = ki = [x → ⊥]. The assignment from a basic variable copies the influence of the variable to the left-hand value. We formalize it by updating the influence state: def

Ix = yi = [x → i(y)]. For non-atomic expressions E = E1 + E2 | E1 ∗ E2 . . . , the propagation of the influence is a function of its variables and the operations applied to them, as determined by the taint policy: def

Ix = Ei = [x → expr(E, i)].

3.2.2.2

Function Call

Without loss of a generality, we do assume functions to have only one argument, passed by value and one return value, that is, the function definitions match the template: def

f = λy.{S; return z;}. The influence semantics of invoking f, first evaluates the actual argument to set up the initial state with the opportune influence value for y. Then it evaluates the body of the function and updates the return value z. The primitives entry and return determine, respectively, how the influence flows from the caller to the callee, and from the callee to the caller. Formally: def

Ix = call f(E) = let i 0 = [x → entry(f, expr(E, i))] in let i f = IS; return z;}(i 0 ) in i[x → return(f, i f (z))].

3.2.3 Partial Traces Semantics The concrete semantics of a basic statement s is λs, i.Bs(s), Isi. From the basic semantics, we can derive the transition relation τ ∈ P( × ) (e.g., as done

46

F. Logozzo and I. Mohamed

in [13]). Finally, we define the influence partial traces semantics from an initial set of states I0 ∈ P() as the least fixpoint: T I = lfp⊆ ∅ I0 ∪ {σ0 . . . σn σn+1 | n, n + 1 ∈ τ }. def

Intuitively, the semantics T I captures the precise evolution of the attackercontrolled input. Unfortunately, it is not computable, so that in practice some abstraction must be performed. The abstraction may under-approximate the set T I , as for instance in logging or dynamic taint flow analysis, or over-approximate it as in static taint flow analysis.

3.2.4 Reachable States Semantics The reachable states semantics abstracts away the execution order and focuses only on the states that may reach a given program point pp, not on how the information has been computed. It can be formalized by means of the following abstraction [10]: def

α (T ) = λpp.{σi | ∃σ0 . . . σi . . . σn ∈ T ∧ σi (PP) = pp}. The reachable state semantics is commonly used as the concrete semantics for numerical [8] or non-nullness [16] static analyses. We will see that using it as the concrete semantics for taint analysis is the root cause of its imprecision.

3.3 Application to Security Many security (and privacy) problems can be reduced to model an un-vetted flow from some attacker-controlled data to some sensitive part of the code base. The attacker-controlled inputs are the sources, the sensitive APIs are the sinks, and the APIs that make sure the data is innocuous are called sanitizers.

3.3.1 Sources We let Sources denote the set of functions whose return is attacker-controlled. We model it by requiring that it satisfies: ∀f ∈ Sources. ∀a ∈ I. return(f, a) = ⊥,

3 How to Make Taint Analysis Precise

47

that is, a source function can never return a not-influenced value. Please note that in general different functions may return different influences, to model the fact that there is more than one attack vector, e.g., from the WEB or from an untrusted file.

3.3.2 Sinks We let Sinks denote a map from function names to I that intuitively describes which influence is dangerous for the given sink—essentially, a necessary precondition [12]. For instance, for a SQL injection sink, an attacker-controlled string is dangerous, whereas an attacker-controlled integer value is not—there is no way an attacker can trigger a SQL injection with just a simple integer. Formally, an influence φ ∈ I is a security violation for a function f if ρ ∈ Sinks ∧ ρ(f)  φ = ⊥.

3.3.3 Sanitizers We let Sanitizers denote the set of functions that vet the input making sure that there is no potentially dangerous influence in the result. We distinguish two kinds of sanitizers, entry and exit sanitizers. An entry sanitizer for an influence φ ∈ I \ {⊥} satisfies ∀f ∈ Sanitizers. ∀a ∈ I. entry(f, a)  φ = ⊥, that is, all non-trivial influences compatible with φ are removed from the analysis of the callee. Similarly, an exit sanitizer satisfies ∀f ∈ Sanitizers. ∀a ∈ I. return(f, a)  φ = ⊥, that is, at the exit point all the non-trivial influence compatible with φ has been removed from the influence returned to the callee.

3.4 Data Flow Analyses for Security Program analyses based on data flow focus on determining whether attackercontrolled data reaches, without any transformation, a sink. Their main application in practice is security analysis via logging, the system monitors untrusted data that flows into some loggers (the sinks)), and in commercial static analyses tool to report few false positives, at the cost of increasing false negatives.

48

F. Logozzo and I. Mohamed

Fig. 3.1 A SQL injection: the attacker can inject an arbitrary string to the SQL command

Fig. 3.2 Even if the attacker controls the id, it cannot inject a SQL command because the string is converted to an integer. A taint flow analysis that will not take the type conversion into account will raise a false alarm

Example: Data flow A data flow analysis will detect the flow of un-sanitized input from the attacker to the sink in Fig. 3.1. The conversion from a string to integer in Fig. 3.2 will stop the data flow analysis, which correctly reports that there is no bug. Nevertheless, a data flow analysis will fail to report the bug in Fig. 3.3, as the size is the result of a transformation.

In data flow program analysis, transformed data is considered safe. We can formalize data flow using an opportune influence D. Formally: def

∀a ∈ D. expr(x, a) = a(x) def

expr(E, a) = ⊥.

if E is not a variable

Function entry and exit simply copy the data flow influence value: def

∀f = Sanitizers. ∀d ∈ D. entry(f, d) = return(f, d) = d. A very popular example of dynamic data flow applied to security is program logging. In this case the sources are the requests and the arguments of the sinks are logged. Security engineers use it to determine if some private data is made into a less-privileged location (logs may be visible to more people the original data was) or check if an intruder has infiltrated the system or some malicious query has been executed. In general, Dynamic data flow under-approximates (samples) the reachable states, and not the trace semantics, simply because it is too expensive to record the full trace at runtime1 : ˙  (T D ). DynamicDataFlow⊆α 1

˙ the functional lifting of ⊆. As customary, we denote with ⊆

3 How to Make Taint Analysis Precise

49

Fig. 3.3 A memory allocation controlled by the attacker. The attacker may cause the computation at line 2 to overflow, so the buffer allocated at line 3 is small, causing an out-of-bounds access at line 4. A taint analysis will determine that the memory allocation is a user allocated

Static data flow is relatively popular in commercial static analysis tools for security [1, 3]. Those tools propose it as the default, initial configuration because it has very few false positives. Unfortunately, it can also lead to many false negatives, essentially discarding attacker-controlled data every time it is transformed, leading to many attacks to be undetected. In general, the static data flow over-approximates the set of reachable states using the taint policy described in this section: T D ⊆ StaticDataFlow.

3.5 Reachable States-Based Taint Analysis Taint flow analysis refines the data flow analysis by taking into account the transformation of data. In the most common definition of the influence policy [14], the result of an expression is influenced by the influences of all its sub-expressions. Formally: def

expr = λE, i.



i(y),

y∈FV(E)

where FV(E) are the free variables appearing in the expression E. Taint analysis can be dynamic or static. However, we are not aware of any dynamic analysis deployed at an industrial scale. The reason for that is that tracking the taint data has a performance cost (every operation should be lifted to propagate taint) and memory cost (to record if a location is a taint or not). For instance, the dynamic taint analysis in Clang [2] limits the number of taint flows that can be tracked at the same time to 8 and limits the height of the stack trace, making in practice the analysis not useful. Static taint analysis, on the other hand, is very popular and has seen many applications in security. The common approach to static taint analysis is to consider only two influences (“taint”, “not taint”) and to associate them with memory locations, [9, Chap. 47.11.8]. Differently stated, most published papers assume (implicitly) as concrete semantics α (T ). This approach has the drawback of losing the information on how the taint has been computed, and hence being the cause of imprecision that Patrick Cousot pointed out.

50

F. Logozzo and I. Mohamed

Fig. 3.4 Upcasting length will prevent the integer overflow, and hence the attacker cannot trigger the memory corruption. In order to avoid reporting a false positive, the taint analysis will need to propagate the information that, while size is tainted, it is not a result of an overflowing operation

Example: Reachable States-Based Taint Analysis Taint analysis, with the influence policy defined in this section, correctly reports the bug in Fig. 3.3: as length is tainted, then its every derived value, namely, size. Unfortunately, it holds also for the code snippet in Fig. 3.4, so that a naive taint analysis will report a false alarm in that case.

In practice, this kind of taint analysis will report most of the program as influenced by the attacker. The canonical way to address the problems is the use of sanitizers. Unfortunately, this is not enough. Consider, for instance, the example in Fig. 3.2. One would be tempted to define a “string casted to int” as a sanitizer, but this will have the side effect of masking a possible privacy bug—the attacker can select which user to see.

3.6 Trace-Based Taint Analysis The key to increase the precision of taint analysis and to overcome the issues we outlined in the previous section is to design a static analysis based on the trace semantics T . In our experience, designing and implementing static analyzers at Meta, we found that there are two orthogonal aspects of T that we do need to capture so to make the analysis more precise: • Features that provide (an abstract) explanation of “how” the influence may reach a sink. Security engineers write more precise rules based on them. • The flow of influences through function calls. Security engineers use it to validate inter-procedural flows.

3 How to Make Taint Analysis Precise

51

3.6.1 Trace Influence Algebra We define the influence algebra Z, building on the top of the underlying concrete def

one I: Z = I × F × G, where: • I is a given influence state. • F is a finite set of features. • G, ≺, ,  is an algebra abstracting the call graph. It has an order ≺, an upper bound , and an operation to add elements . In general, by using a set for F we do abstract some information, e.g., where the cast happened, or the order among features, but we do gain in efficiency. We can relax the hypotheses for F , for instance, of being a finite set, or a set at all—in fact, in our implementation we do use a hybrid abstraction where some features are abstracted with a set, and others with sequences. We will not detail G in this paper, but its worth noting that designing the opportune abstraction is key to scale the analysis to hundreds of millions of lines of code while taking into account recursive inter-procedural calls.

3.6.2 Influence Semantics with Features The result of a cast expression propagates the influence of the cast expression while adding a new feature to record that the result has been influenced by a cast. We model it by adding a feature via : cast : int when this happens: def

expr((int)e, i) = let t, f, g = expr(e, i) in t, f ∪ {via : cast : int}, g}

Example: Trace-based Taint Analysis with syntactic features In the examples of Figs. 3.1 and 3.2, a taint analysis based on Z will be able to differentiate between the two flows, as in the second case the value that reaches the sinks is influenced by a feature via : cast : int. So that it can report an alarm for a SQL injection in the first case, and in the second, no alarm for SQL injection, but a warning that the attacker may control the id in a SQL query, which may or may not be a security or privacy problem, depending on the context.

The result of a binary expression propagates the influences of its sub-expressions, while recording the information about the operation with an opportune feature. This

52

F. Logozzo and I. Mohamed

information can be a simple tag for the operation, whether the operation may overflow or not, etc. For instance, expr for a binary expression may be defined as def

expr(e1 op e2 , i) = let t1 , f 1 , g1  = expr(e1 , i) in let t2 , f 2 , g2  = expr(e2 , i) in let overflow = if overflows(e1 op e2 )then {via:overflow} else ∅ in t1  t2 , f 1 ∪ f 2 ∪ {op} ∪ overflow, g1  g2 }.

Features can also be used to communicate information about different abstract domains, and hence achieve a form of reduced product [11]. In particular, they are very useful when combined with numerical domains, so the overflows function above may use information from a numerical abstract domain to provide more precise information. Example: Trace-based Taint Analysis with semantics features In the example of Fig. 3.3, an analysis based on Z will detect that the value that reaches the memory allocation, is influenced by the attacker and it may overflow— the analysis will raise an alarm. On the other hand, in the example of Fig. 3.4, there will be no overflow feature attached to the value reaching the sink—the analysis will not raise the alarm.

3.6.3 Inter-Procedural Analysis On entry and exit of functions, the analysis will record the fact in the abstraction of the call graph. def

entry( p, t, f, g) = t, f, g in(p) def

return( p, t, f, g) = t, f, g out(p). In certain cases, we would also like to attach features to functions, and we will do it by refining the definitions above to include the features on entry, return, or both.

3.6.4 Features for Analysis Approximations Features can also be used as witnesses that the analysis has performed some abstraction, and hence the result is approximated. For instance, when abstracting the heap with trees of access paths [15], at a certain point the analysis may decide to collapse

3 How to Make Taint Analysis Precise

53

some sub-tree losing the precise taint information on individual fields and causing over-tainting. The analysis can add a feature to capture that there may be over-tainting caused by a collapse of the tree. In general features inserted by the analysis to report known over-approximations can be used to triage or prioritize the results, e.g., first showing the traces that have any of those features.

3.7 Experience At Meta, we developed static analyzers based on the trace taint analysis of Sect. 3.6 to automatically find more than 50% of the security vulnerabilities in our products (Facebook, Messenger, Instagram, WhatsApp …) [4]. The static analysis tools were designed and developed in collaboration with the security engineers of the Meta App Security team (ProdSec). They cover hundreds of millions of lines of code, spanning the major code Meta codebases, that is the Hack [7], the Android [5], the Python [6], and the native code bases. The tools share the abstract domain design and philosophy, but they differentiate in the way they handle the peculiarities of the particular languages and AST nodes. This design is in contrast with the common approach of using a common intermediate language (IR) to compile different high level languages. In fact, the compilation step introduces an obfuscation of the code, which causes a loss of precision of the analysis, which should be recovered using a more expressive abstract domain, or decompilation [17]. Our tools take as input: (i) a denotational description of the sources, sinks, and sanitizer; (ii) a set of rules describing which pairs of sources and sinks cause a security risk. The output is a set of traces describing taint flows from the sources to the sinks, annotated with features. Security engineers tweak the signal-to-noise ratio, by writing filters on the trace, e.g., including or excluding traces that contain a given feature. They also provide feedback to the tool developers on how to improve the analysis—false positives can be reduced by having a more precise abstract semantics for a given statement or adding a new feature. Overall the design of filters by domain experts and the continuous feedback on the abstract semantics guarantee a high signal (few false positives, few false negatives).

3.8 Conclusions A taint analysis based on reachable states semantics is inherently imprecise. To make the analysis precise, one should use the trace semantics as concrete semantics, and build abstractions that are coarse enough to scale up to hundreds of millions of lines of code and precise enough to expose all the information security engineers need

54

F. Logozzo and I. Mohamed

to configure very precise filters. In our experience building static analysis tools for security, we achieved a very high precision (few false positives, few false negatives), by (i) basing our analysis on the target language and not an intermediate, low level representation; and (ii) emitting opportune features that describe the relevant transformation the data goes through and the approximations the analysis performs.

References 1. CodeQL. https://codeql.github.com/. Accessed: 2022-09-14 2. DataFlowSanitizer design document. https://clang.llvm.org/docs/DataFlowSanitizerDesign. html. Accessed: 2022-08-12 3. GrammaTech CodeSonar. https://resources.grammatech.com/youtube-all-videos/tainteddata-analysis-in-codesonar. Accessed: 2022-09-14 4. How Meta and the security industry collaborate to secure the internet. https://engineering. fb.com/2022/07/20/security/how-meta-and-the-security-industry-collaborate-to-secure-theinternet/. Accessed: 2022-09-15 5. Open-sourcing Mariana Trench: Analyzing Android and Java app security in depth. https:// engineering.fb.com/2021/09/29/security/mariana-trench/. Accessed: 2022-09-15 6. Pysa: An open source static analysis tool to detect and prevent security issues in Python code. https://engineering.fb.com/2020/08/07/security/pysa/. Accessed: 2022-09-15 7. Zoncolan: How Facebook uses static analysis to detect and prevent security issues. https:// engineering.fb.com/2019/08/15/security/zoncolan/. Accessed: 2022-09-15 8. Cousot, Patrick: The calculational design of a generic abstract interpreter. In: Broy, M., Steinbrüggen, R. (eds.) Calculational System Design. NATO ASI Series F. IOS Press, Amsterdam (1999) 9. Patrick Cousot. Principles of Abstract Interpretation. MIT Press, 2021 10. Patrick Cousot and Radhia Cousot. Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. In Robert M. Graham, Michael A. Harrison, and Ravi Sethi, editors, Conference Record of the Fourth ACM Symposium on Principles of Programming Languages, Los Angeles, California, USA, January 1977, pages 238–252. ACM, 1977 11. Patrick Cousot and Radhia Cousot. Systematic design of program analysis frameworks. In Alfred V. Aho, Stephen N. Zilles, and Barry K. Rosen, editors, Conference Record of the Sixth Annual ACM Symposium on Principles of Programming Languages, San Antonio, Texas, USA, January 1979, pages 269–282. ACM Press, 1979 12. Patrick Cousot, Radhia Cousot, Manuel Fähndrich, and Francesco Logozzo. Automatic inference of necessary preconditions. In Roberto Giacobazzi, Josh Berdine, and Isabella Mastroeni, editors, Verification, Model Checking, and Abstract Interpretation, 14th International Conference, VMCAI 2013, Rome, Italy, January 20-22, 2013. Proceedings, volume 7737 of Lecture Notes in Computer Science, pages 128–148. Springer, 2013 13. Chaoqiang Deng and Patrick Cousot. The systematic design of responsibility analysis by abstract interpretation. ACM Trans. Program. Lang. Syst., 44(1):3:1–3:90, 2022 14. Denning, Dorothy E.: A lattice model of secure information flow. Commun. ACM 19(5), 236– 243 (1976) 15. Alain Deutsch. Interprocedural may-alias analysis for pointers: Beyond k-limiting. In Vivek Sarkar, Barbara G. Ryder, and Mary Lou Soffa, editors, Proceedings of the ACM SIGPLAN’94 Conference on Programming Language Design and Implementation (PLDI), Orlando, Florida, USA, June 20-24, 1994, pages 230–241. ACM, 1994

3 How to Make Taint Analysis Precise

55

16. Laurent Hubert, Thomas Jensen, and David Pichardie. Semantic foundations and inference of non-null annotations. In Gilles Barthe and Frank S. de Boer, editors, Formal Methods for Open Object-Based Distributed Systems, 10th IFIP WG 6.1 International Conference, FMOODS 2008, Oslo, Norway, June 4-6, 2008, Proceedings, volume 5051 of Lecture Notes in Computer Science, pages 132–149. Springer, 2008 17. Francesco Logozzo and Manuel Fähndrich. On the relative completeness of bytecode analysis versus source code analysis. In Laurie J. Hendren, editor, Compiler Construction, 17th International Conference, CC 2008, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2008, Budapest, Hungary, March 29 - April 6, 2008. Proceedings, volume 4959 of Lecture Notes in Computer Science, pages 197–212. Springer, 2008 18. Yichen Xie and Alex Aiken. Static detection of security vulnerabilities in scripting languages. In Angelos D. Keromytis, editor, Proceedings of the 15th USENIX Security Symposium, Vancouver, BC, Canada, July 31 - August 4, 2006. USENIX Association, 2006

Chapter 4

“Fixing” the Specification of Widenings Enea Zaffanella and Vincenzo Arceri

Abstract The development of parametric analysis tools based on Abstract Interpretation relies on a clean separation between a generic fixpoint approximation engine and its main parameter, the abstract domain: a safe integration requires that the engine uses each domain operator according to its specification. Widening operators are special, among other reasons, in that they lack a single, universally adopted specification. In this paper, we review the specification and usage of widenings in several open-source implementations of abstract domains and analysis tools. While doing this, we witness a mismatch that potentially affects the correctness of the analysis, thereby suggesting that the specification of widenings should be “fixed”. We also provide some evidence that a fixed widening specification matching the classical one allows for precision and efficiency improvements.

4.1 Introduction Abstract Interpretation (AI) [17–19] is a mature research field, with many years of research and development that led to solid theoretical results and strong practical results. Among the many maturity indicators, the most important ones are probably the following: • the development of AI tools (analyzers, verifiers, etc.) is firmly based on the theoretical results; while some of these tools happen to take shortcuts for practical reasons, these are usually identifiable and quite often well-documented; • some AI tools have been industrialized and commercialized; we only mention here the remarkable case of Astrée [10];

E. Zaffanella (B) · V. Arceri University of Parma, Parma, Italy e-mail: [email protected] V. Arceri e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 V. Arceri et al. (eds.), Challenges of Software Verification, Intelligent Systems Reference Library 238, https://doi.org/10.1007/978-981-19-9601-6_4

57

58

E. Zaffanella and V. Arceri

• there is growing support for collaborative work, including open-source libraries of abstract domains and AI frameworks; hence, it is relatively easy to extend, combine and compare AI tools; • in recent years, the synergy between theory and practice has been fostered by the rather common usage of pairing research papers with corresponding artifacts, allowing for repeatability evaluations and even competitions between AI tools. As a rough attempt at summarizing all the positive consequences of working in a mature research field, one may state the following: the knowledgeable members of the AI community (which are not necessarily experts) know what is needed in order to have things go right; also, they can tell when things are going wrong. In this paper, we will question the truth of the last part of the sentence above: we will provide some evidence that, in the recent past, things may have gone wrong without people noticing (even the experts). We start by informally introducing the problem. The implementation of an AIbased static analysis tool is greatly simplified by pursuing a clean separation of its different components [35]: in particular, the development of a generic AI engine (i.e., the fixpoint approximation component) is kept distinct from the development of its main parameter, the abstract domain. Quite often, these two components are designed and implemented by different people, possibly working for distinct organizations. A safe integration requires that the AI engine implements its functionalities by using a subset of the semantic operators provided by the abstract domain and, in particular, that each of these operators is used according to its formal specification. In order to enable and foster collaborative work, it is therefore essential that the AI research community converges toward a well-defined, uniform interface for these operators. While this goal can be considered achieved for most semantic operators, it has been surprisingly missed for widening operators: as we will see, widenings happen to be implemented and used according to slightly different and incompatible specifications, thereby becoming a potential source of interface mismatches that can affect the correctness of AI tools. In this paper, we will review the specification and usage of widenings in several open-source implementations of abstract domains and analysis tools, witnessing the above-mentioned mismatch. We will also propose a solution that, in our humble opinion, represents a further little step toward the maturity of the research field: namely, we suggest to “fix” the specification of widenings by choosing, once and for all, the one that more likely allows for preserving, besides correctness, also efficiency and precision. The paper is organized as follows. In Sect. 4.2, we briefly recall AI-based static analysis and the classical specification for widening operators. In Sect. 4.3, after introducing an alternative specification for widenings, we classify open-source abstract domain and AI engine implementations according to the adopted widening specification. Building on this classification, in Sect. 4.4 we describe and compare the possible kinds of abstract domain and AI engine combinations, discussing the safety problems potentially caused by specification mismatches. In Sect. 4.5, we propose to fix the specification of widenings, so as to solve the identified problems and also simplify the development of correct abstract domains and AI engines, while avoiding inefficiencies and precision losses. We conclude in Sect. 4.6.

4 “Fixing” the Specification of Widenings

59

4.2 Background In this section, we briefly recall some basic concepts of Abstract Interpretation (AI) [16–20]. Interested readers are also referred to [41], which provides an excellent tutorial focused on numerical properties. For exposition purposes, we will describe here the classical AI framework based on Galois connections; however, as discussed at length in [19], it is not difficult to preserve the overall correctness of the analysis even when adopting weaker algebraic properties. The semantics of a program is specified as the least fixpoint of a continuous operator f C : C → C defined on the concrete domain C, often formalized as a complete lattice C, C , C , C , ⊥C , C . In most cases, the partial order relation C models both the computational ordering (used in the fixpoint computation) and the approximation ordering (establishing the relative precision of the concrete properties). The least fixpoint can be obtained as the limit lfp f C = C { ci | i ∈ N } of the increasing chain c0 C . . . C ci+1 C . . . of Kleene’s iterates, defined by def def c0 = ⊥C and ci+1 = f C (ci ), for i ∈ N. The goal of AI-based static analysis is to soundly approximate the concrete semantics using a simpler abstract domain A, which is usually (but not always) formalized as a bounded join semi-lattice A,  A ,  A , ⊥ A ,  A . Intuitively, the relative precision of the abstract elements is encoded by the abstract partial order  A , which should mirror the concrete one. Formally, the concrete and abstract domains are related by a Galois connection, which is a pair of monotone functions α : C → A and γ : A → C satisfying ∀c ∈ C, ∀a ∈ A : α(c)  A a ⇔ c C γ (a). When C and A are related by a Galois connection, an abstract function f A : A → A is a correct approximation of the concrete function f C : C → C if and only if α( f C (c))  A f A (α(c)) for all c ∈ C or, equivalently, if f C (γ (a)) C γ ( f A (a)) for all a ∈ A. Note that correct approximations are preserved by function composition; hence the problem of approximating the concrete function f C is usually simplified by decomposing it into simpler functions (depending on the semantics of the considered programming language), which are then approximated individually. As a matter of fact, software libraries implementing abstract domains usually provide several abstract operators approximating the most common concrete semantic operators (addition/removal of program variables, assignments, conditional tests, merging of different control flow paths, etc.). By exploiting the correctness condition above, traditional AI-based static analysis tools approximate the concrete semantics by computing abstract Kleene iterates. def In principle, they will compute the sequence a0 , . . . , ai+1 , . . . defined by a0 =⊥ A def and ai+1 = f A (ai ), for i ∈ N: by induction, each abstract iterate is a correct approx-

60

E. Zaffanella and V. Arceri

imation of the corresponding concrete iterate, i.e., ci C γ (ai ); moreover, if the def abstract sequence converges to an abstract element a =  A { ai | i ∈ N } ∈ A, then the correctness relation also holds for the least fixpoint, i.e., lfp f C C γ (a). In general, however, the abstract sequence may not converge at all (since the abstract function f A is not required to be monotone nor extensive and neither the abstract domain is required to be a complete lattice) or it may fail to converge in a finite number of steps. The classical method to obtain a finite convergence guarantee is by the use of widening operators; we recall here the classical specification for widening operator as provided in [16, 17, 20] (see also [41]). Definition 4.1 (Classical widening) A widening  : A × A → A is an operator such that 1. ∀a1 , a2 ∈ A : (a1  A a1  a2 ) ∧ (a2  A a1  a2 ); 2. for all ascending sequences a0  A . . .  A ai+1  A . . ., the ascending sequence x0  A . . .  A xi+1  A . . . defined by x0 = a0 and xi+1 = xi  ai+1 is not strictly increasing. The abstract increasing sequence with widening is x0  A . . .  A xi+1  A . . ., where def x0 =⊥ A and, for each i ∈ N, xi+1 = xi  f A (xi ). def

(4.1)

Note that Condition 1 in Definition 4.1 states that the widening is an upper bound operator for the abstract domain; this makes sure that the computed abstract sequence is indeed increasing, thereby enabling the application of Condition 2 and enforcing convergence after a finite number k ∈ N of iterations, obtaining lfp f C C γ (xk ). Roughly speaking, since the classical widening is an upper bound operator, the AI engine can directly use it in Eq. (4.1) as a replacement of the join operator. It is well known that the approximation computed by widening during the ascending sequence may be rather coarse and it can be improved by coupling it with a corresponding decreasing sequence using narrowing operators [17, 19, 20]; it is also possible to intertwine the ascending and descending phases [4]. Other techniques (e.g., widening up-to [32], lookahead widening [27], stratified widening [43]) are meant to directly improve the precision of the ascending sequence. All of these will not be discussed further, as they are completely orthogonal to the goal of this paper.

4.3 On the Specification of Widening Operators In the previous section, we have recalled the classical specification for the widening operator. However, a quick review of the literature on Abstract Interpretation shows that such a formalization is not uniformly adopted. Quoting from [20, footnote 6]: Numerous variants are possible. For example, […]

4 “Fixing” the Specification of Widenings

61

An incomplete list of the possible variants includes • using a different widening for each iterate [14]; • using an n-ary or set-based widening [13, 19], depending on many (possibly all) previous iterates; • using a widening satisfying the minimalistic specification proposed in [42]; • using a widening for the case when the computational and approximation orderings are different [19, 49]. Some of these variants may be regarded as having (only) theoretical interest, as their application in practical contexts has been quite limited. Here below we describe in more detail the very first variant mentioned in [20, footnote 6], which has been quite successful from a practical point of view, being adopted by a relatively high number of open-source abstract domain libraries and AI tools. Definition 4.2 (Alternative widening) A widening  : A × A → A is an operator such that 1. ∀a1 , a2 ∈ A : a1  A a2 =⇒ a2  A a1  a2 ; 2. for all ascending sequences a0  A . . .  A ai+1  A . . ., the ascending sequence x0  A . . .  A xi+1  A . . . defined by x0 = a0 and xi+1 = xi  ai+1 is not strictly increasing. When this alternative specification for the widening is adopted, then the (alternative) abstract increasing sequence with widening x0  A . . .  A xi+1  A . . . is defined by def x0 =⊥ A and, for each i ∈ N,   def xi+1 = xi  xi  A f A (xi ) .

(4.2)

When comparing the alternative and classical specifications, we see that • in Condition 2 of Definition 4.2, it is required that a1  A a2 , so that this widening is an upper bound operator only when this precondition is satisfied; • in order for this assumption to hold, in Eq. (4.2) the AI engine uses the join xi  A f A (xi ) as the second argument for the widening. In other words, when adopting this alternative specification, the widening is meant to be used in addition to the join operator, rather than as a replacement of the join as was the case for the classical specification. The option of choosing between the classical and alternative widening specifications has also been recalled more recently in [15]: in the main body of that paper, it is assumed that the abstract semantic function f A is extensive, thereby fulfilling the precondition of Definition 4.2; in [15, footnote 5], it is observed that the extensivity requirement can be satisfied by using Eq. (4.2); the classical widening specification is instead described in [15, footnote 6]. When considering parametric AI-based tools, the minor differences highlighted in the comparison above are nonetheless very important from a practical point of view, since they are directly affecting the interface boundary between the AI engine

62

E. Zaffanella and V. Arceri

component and the abstract domain component, which are quite often developed independently. It is therefore essential to classify these components depending on the widening specification they are adopting. In the following, we classify a subset of the available abstract domains and AI tools: we will focus on numerical abstract domains, but the reasoning can be extended to other abstract domains (e.g., the domain of finite state automata [22] used for the analysis of string values [5]); also, we only consider open-source implementations, since our classification is merely based on source code inspection.

4.3.1 Classifying Abstract Domain Implementations The classification of an abstract domain implementation is based on checking whether the corresponding widening operator is modeled according to Definition 4.1 or 4.2, i.e., whether or not it requires the precondition a1  A a2 . We consider most of the domains provided by the open-source libraries APRON (Analyse de PROgrammes Numériques) [37], ELINA (ETH LIbrary for Numerical Analysis) [23], PPL (Parma Polyhedra Library) [7], PPLite [8] and VPL (Verified Polyhedron Library) [11], as well as a few abstract domains that are embedded in specific AI tools, such as FramaC [38], IKOS [12] and Jandom [1]. The results of our classification are summarized in Table 4.1. We discuss in some detail the case of the domain of intervals (which generalizes to multi-dimensional boxes). Definition 4.3 (Interval abstract domain) The lattice of intervals on I ∈ {Z, Q} has carrier    Itv = {⊥} ∪ [, u]   ∈ I ∪ {−∞}, u ∈ I ∪ {+∞},  ≤ u , partially ordered by ⊥  x, for all x ∈ Itv, and [1 , u 1 ]  [2 , u 2 ] ⇐⇒ 1 ≥ 2 ∧ u 1 ≤ u 2 .

Table 4.1 Classifying implementations of abstract domains based on widening specification abstract domain classical widening: alternative widening: Definition 4.1 Definition 4.2 (no precondition) (requires a1  A a2 ) Intervals/boxes [16, 17] Zones [40] Octagons [40]

APRON  , IKOS, Jandom ELINA , IKOS APRON , ELINA , IKOS, Jandom

Polyhedra [21, 30] Parallelotopes [3] Interval congruences [39]

PPL PPL PPL APRON, ELINA, PPL, PPLite, VPL

Jandom Frama-C

4 “Fixing” the Specification of Widenings

63

Almost all implementations of this domain adopt the definition of interval widening provided in [16, 17], matching the classical specification of Definition 4.1. Definition 4.4 (Classical widening on intervals) The interval widening  : Itv × def def Itv → Itv is defined by x  ⊥ = ⊥  x = x, for all x ∈ Itv, and [1 , u 1 ]  [2 , u 2 ] = [(2 < 1 ? −∞ : 1 ), (u 2 > u 1 ? +∞ : u 1 )]. def

In contrast, the implementation provided by the PPL [7] is based on the alternative widening specification of Definition 4.2. Definition 4.5 (Alternative widening on intervals) The interval widening  : Itv × def Itv → Itv is defined by ⊥  x = x, for all x ∈ Itv, and [1 , u 1 ]  [2 , u 2 ] = [(2 = 1 ? −∞ : 1 ), (u 2 = u 1 ? +∞ : u 1 )]. def

One may wonder what Definition 4.5 is going to gain with respect to Definition 4.4: technically speaking, the main motivation for adding a precondition to a procedure is to enable optimizations; for instance, when considering intervals with arbitrary precision rational bounds, testing r1 = r2 is more efficient than testing r1 < r2 .1 Weakly relational domains, such as zones and octagons [40], are characterized by widening operators that are almost identical to those defined on intervals (these three domains can be seen as instances of the template polyhedra construction [46]). We can thus repeat the observations made above to conclude that almost all implementations adopt the classical widening specification, while the PPL implementation keeps adopting the alternative one. The case of library APRON deserves a rather technical, pedantic note. All the abstract domains provided by this library are meant to implement a common interface, whose C language documentation states the following:2 ap_abstract1_t ap_abstract1_widening(ap_manager_t* man, ap_abstract1_t* a1, ap_abstract1_t* a2) Widening of a1 with a2. a1 is supposed to be included in a2.

That is, all the domains in APRON, including boxes and octagons, are meant to implement the alternative widening specification; hence, even though the actual source code implementing the widenings for boxes has always been based on Definition 4.4 (up to version 0.9.13), in principle any future release of the library may decide to change its inner implementation details and switch to Definition 4.5. However, since this is an unlikely scenario, in Table 4.1 we keep these implementations in the column of the classical widening (marking them with a star). A similar observation applies 1

Arbitrary precision rationals are usually represented as canonicalized fractions; hence, testing for equality is relatively cheap, whereas testing nd11 < nd22 may require two arbitrary precision products.

2

Similar sentences are found in the documentation of the C++ and OCaml interfaces.

64

E. Zaffanella and V. Arceri

to the domains in the ELINA library since, as far as we can tell, its programming interface is meant to be identical to that of APRON; in this case, however, there is no clear statement in the documentation. Things are rather different when considering the domain of convex polyhedra: even though the standard widening operator defined in [21, 30] is modeled according to the classical specification, it turns out that all the available implementations are adopting the alternative specification of Definition 4.2 (no matter if considering the standard widening of [21, 30] or the more precise widenings defined in [6, 8]). The expert reader might note that Table 4.1 provides a rather incomplete summary, lacking entries for many numerical domains. We stress that the only purpose of this table is to witness a non-uniform situation; this is further confirmed by the last two lines of the table, where we have considered a couple of less popular abstract domain implementations (parallelotopes [3] and interval congruences [39]). It is also instructive to extend our review to consider the case of an abstract domain satisfying the Ascending Chain Condition (ACC), meaning that all its ascending chains are finite. We still require a widening operator for the following reasons: 1. to obtain a uniform abstract domain interface, which simplifies the generic implementation of AI engines; 2. to accelerate (rather than ensure) the convergence of the abstract iteration sequence, when the abstract domain contains very long increasing chains. When the last reason above does not apply, an obvious option is to define a dummy widening operator, intuitively based on the abstract join operator  A , so as to avoid precision losses as much as possible. Once again, we are faced with the choice of adopting the classical specification of Definition 4.1, leading to a1  a2 = a1  A a2 , def

(4.3)

or the alternative specification of Definition 4.2, exploiting precondition a1  A a2 to improve efficiency, and hence obtaining a1  a2 = a2 . def

(4.4)

As an example, the sign domain [17] found in the Frama-C framework implements its dummy widening according to Eq. (4.3): sign_domain.ml: let widen = join

whereas the same framework implements the widening for the generic lattice set domain according to Eq. (4.4):3 abstract_interp.ml: let widen _wh _t1 t2 = t2

This is somehow surprising, since one would expect that all the abstract domains implemented in a given framework are developed according to the same specification. 3

Parameter _wh, which is ignored, is meant to provide widening hints [10].

4 “Fixing” the Specification of Widenings

65

Table 4.2 Classifying AI engines based on widening specification classical engine: Eq. (4.1) alternative engine: Eq. (4.2) (only widening) (join + widening) CPAchecker [9] DIZY [45] GoLiSA [44] IKOS [12] Jandom [1] LiSA [24] SeaHorn/Crab [29]

(CPAchecker + APRON polyhedra) Frama-C [38] Goblint [50] Interproc [36] (Jandom + PPL domains) PAGAI [34] (SeaHorn/Crab + Boxes)

4.3.2 Classifying AI Engine Implementations We now turn our attention to the other side of the interface, checking how widening operators are used in different AI engines, that is whether they use widenings to replace joins, as in Eq. (4.1), or in addition to joins, as in Eq. (4.2). The results of the second classification are summarized in Table 4.2. As before, we are focusing on a rather limited subset of the available open-source AI engines, since our only goal is to show that, again, we have a non-uniform situation. It is worth stressing that, in this case, the classification process is more difficult, because there are AI tools that try to avoid the widening specification mismatch by adapting their own AI engine depending on the abstract domain chosen for the analysis. For instance, CPAchecker, Jandom and SeaHorn/Crab are all classified to follow the classical specification, as they usually only apply the widening operator. However, they switch to the alternative specification (computing joins before widenings) when using, respectively, the polyhedra domain provided by APRON, the abstract domains provided by PPL and the abstract domain of Boxes [28]; these special cases are reported in Table 4.2 in parentheses.4

4.4 Combinations of Abstract Domains and AI Engines In the previous section, we have seen that widening implementation in abstract domains and widening usage in AI engines can be classified according to the specification they are meant to follow. Here, we consider all possible ways to combine the two components inside an AI tool. Even though we will often refer to combinations that are actually found in some of the available tools, the discussion is meant to be more general: that is, we also target potential combinations of engines and domains 4

Our classification is based on human code review, hence it is error-prone: we may have missed other specific combinations treated as special cases.

66

E. Zaffanella and V. Arceri

Table 4.3 Classifying possible combinations of AI engines and widenings Widening AI engine classical: Definition 4.1 alternative: (no precondition) Definition 4.2 (requires a1  A a2 ) classical: Eq. (4.1) (only widening) alter native: Eq. (4.2) (join + widening)

IKOS + octagons PAGAI + octagons

IKOS + polyhedral PAGAI + polyhedral

that have never been implemented in the considered AI tools; the idea is that these combinations are anyway possible when assuming a collaborative context.5 The possible combinations are classified in Table 4.3, where we highlight in italic blue the safe portions of the specification (i.e., the parts making sure that we will obtain a correct analysis result), while highlighting in boldface red the risky portions of the specification (whose correctness depends on assumptions that should be satisfied by the other interfaced component). Also note that in the table, for exposition purposes, we use IKOS (resp., PAGAI) to actually represent any AI engine adopting the classical (resp., alter native) engine specification; similarly, we use APRON’s octagons (resp., polyhedral) to represent any abstract domain whose widening is implemented according to the classical (resp., alternative) widening specification. We now briefly describe the four kinds of combinations. IKOS + octagons. This combination matches the classical specification in [16, 17, 20], whose safety is ensured by Definition 4.1. PAGAI + polyhedral. This combination matches the alternative widening specification of [20, footnote 6], whose safety is ensured by Eq. (4.2). The combination Frama−C + generic lattice set mentioned previously can be seen to be another instance of this kind, where the abstract domain satisfies the ACC. PAGAI + octagons. This combination, which mixes the classical and alternative specifications, is probably not explicitly described in the literature; it can thus be interpreted as the result of a harmless specification mismatch. The safety of the resulting analysis is clearly ensured (by adopting a belt and suspenders approach). The combination Frama−C + signs mentioned previously can be seen to be another instance of this hybrid approach, where the abstract domain satisfies the ACC. IKOS + polyhedral. This combination is once again mixing the classical and alternative specifications, but in this case both the AI engine and the widening implementation take a risky approach, assuming that the other component will do what is required to obtain safety. Since neither assumption is satisfied, safety is (potentially) compromised. 5

Clearly, such an ideal scenario is sometimes limited by licensing issues and/or architectural and programming language implementation barriers.

4 “Fixing” the Specification of Widenings

67

In summary, of the four possible combination kinds, the first three are safe and the last one is (potentially) unsafe.

4.4.1 Some Thoughts on the Unsafe Combinations Technically speaking, we are witnessing the violation of a programming contract: depending on the implementation details, the outcome could be either the raising of a well-defined error or an undefined behavior. In the first (unlikely) case, the implementers are left with two options: either they simply report the problem to the user, interrupting the static analysis process, or they workaround the issue by arbitrarily returning a safe approximation of the actual result (e.g., the uninformative top element  A ∈ A). In the more likely case of undefined behavior, the outcome is obviously unpredictable. For instance, a few experiments on the unsafe combination PAGAI + PPLite polyhedral6 resulted in observing any of the following: • • • • •

identical results; different safe results (which usually were precision losses); unsafe results; non-termination of the analysis; segmentation faults.

It is worth stressing that this issue is well-known by AI experts. For instance, in the case of convex polyhedra, a full blown example showing the risk of missing the inclusion requirement was provided in [6, Example 5]; the problem has also been recalled more recently in [31]: […] widening operators are generally designed under the assumption that their first operand is smaller than the second one.

As we already observed, the precondition a1  A a2 is often mentioned in the documentation of software libraries (e.g., APRON, PPL and VPL) and in some of them the implementers have added assertions to check it; however, these assertions are executed only in debugging mode, which is often avoided for efficiency reasons. The case of VPL [11] deserves a longer note. As its name suggests, the Verified Polyhedron Library is meant to support verified computations on the domain of convex polyhedra: that is, each library operation, besides providing the result, also yields a correctness certificate; the result and the certificate can then be supplied to a verification procedure, which is formally proved correct in Coq. Therefore, the widening specification mismatch under investigation would have been a perfect fit for the VPL setting. Unfortunately, the library developers decided that widening is the one and only operator on polyhedra not requiring a certificate. Quoting from VPL’s documentation: 6

As shown in Table 4.2, PAGAI is a safe AI engine; PAGAI is a variant obtained by removing the computation of joins before widenings, i.e., replacing on purpose Eq. (4.2) with Eq. (4.1).

68

E. Zaffanella and V. Arceri Note that [widen p1 p2] relies on [p1] being included in [p2]. The result includes the two operands, although no certificate is created.

Hence, VPL cannot promptly detect this kind of error. In [25], it is argued that widening operators need not be certified because the framework is anyway going to formally check the computed post-fixpoint for inductiveness, so that any unsafe result will be anyway identified (later on). However, as discussed above, unsafety is just one of the possible undesired behaviors. To summarize, under this non-uniform state of things, it is rather difficult to guarantee that AI engines and abstract domains possibly developed by different research groups are always combined safely. Indeed, a partial review of the literature shows that a few unsafe combinations have been used, in recent years, in at least four artifacts accepted in top-level conferences: 1. a SAS 2013 artifact [45], combining DIZY with APRON polyhedra; 2. a POPL 2017 artifact [47], combining SeaHorn/Crab with (APRON, ELINA and PPL) polyhedra; 3. a CAV 2018 artifact [48], combining SeaHorn/Crab with ELINA polyhedra; 4. a PLDI 2020 artifact [33], combining SeaHorn/Crab with ELINA polyhedra. The unsafety of the last three artifacts, which are based on the SeaHorn/Crab engine, has been confirmed by one of their authors.7 At this point, it might seem natural to question whether or not these AI tools are actually computing unsafe results. Note that the answer is far from being obvious, because these analyses might be often returning a correct result anyway: namely, they might be losing safety in one iteration of the abstract sequence just to regain it at a later iteration, possibly due to the usual over-approximations incurred by the abstract semantic operators; this effect might even be amplified by the non-monotonicity of widening operators. We actually conjecture that all these experimental evaluations are almost never affected by the potential safety issue, so that they keep playing their role in the context of the considered papers. More importantly, it is our opinion that answering the question above is not really interesting: as soon as a potential problem is identified, one should focus on finding an adequate solution for the future, rather than looking for a specific example witnessing an actual error in the past. This will be the goal of Sect. 4.5.

4.4.2 Comparing the Safe Combinations We now compare the safe combinations, trying to understand whether there indeed are good reasons to have three slightly different options.

7

G. Singh, personal communication.

4 “Fixing” the Specification of Widenings

69

We start by observing that any combination such as PAGAI + octagons is going to lose some efficiency because the alternative AI engine, by following Eq. (4.2), is going to compute joins that are completely useless. Hence, we are left with a comparison of the two combinations corresponding to the classical (IKOS + octagons) and alternative (PAGAI + polyhedra) specifications. Firstly, we should get rid of a false impression related to the efficiency of widenings. When adopting the point of view of the abstract domain implementer, one may think that the alternative specification is going to be less costly because it makes it possible to exploit the inclusion precondition without having to compute the expensive join (recall the observation we made just after Definition 4.5). However, when considering the AI tool as a whole, we can easily note that the computation of the join is not really avoided: rather, it is just delegated to the AI engine component, so that any corresponding overhead is still there. Secondly, we will argue that the classical widening specification, by merging join and widening into a single operator, can sometimes trigger improvements in precision and/or efficiency. A potential precision improvement can be obtained when considering abstract domains that are not semi-lattices [26], such as the domain of parallelotopes [2, 3]. These domains are not provided with a least upper bound operator; rather, when computing a join, they necessarily have to choose among alternative, uncomparable upper bound operators: a poor choice typically leads to precision losses. Hence, as observed in [2, 3], forcing a separation between the join and the widening (as required in the alternative widening specification) is going to complicate the choice of an appropriate upper bound; in contrast, an implementation based on the classical widening will be able to exploit more context information and preserve precision. By following the same reasoning, one could argue that an implementation based on the classical widening specification may be able to apply specific efficiency optimizations that would be prevented when separating the join from the widening; we will show a concrete example in this respect in Sect. 4.5.1. To summarize, among the safe combinations, those based on the classical widening specification seem to have a few advantages and no disadvantages at all.

4.5 Lesson Learned and Recommendation The discussion in the previous section has shown that having different, not fully compatible specifications for the widening operators implemented in abstract domains and used by AI engines can cause issues that potentially affect the overall correctness of the analysis. Knowledgeable software engineers implementing AI tools may be easily confused and even AI experts may fail to detect these specification mismatches, since their consequences are quite often hidden. In our humble opinion, this state of things should be corrected: when implementing parametric static analysis tools following the traditional AI approach, we should “fix” (i.e., stick to) a single and universally adopted specification for the widening

70

E. Zaffanella and V. Arceri

operators. For all we said in the previous section, our recommendation is to choose the classical widening of [16, 17]. Namely, • the abstract domain designers should follow Definition 4.1 and make sure that their widening is an upper bound operator (with no precondition), thereby solving all the correctness issues; and • the AI engine designers should always adopt Eq. (4.1), which slightly improves efficiency by avoiding useless joins and can also preserve precision when the analysis is using a non-lattice abstract domain. It is worth stressing that, from a technical point of view, following the two recommendations above is quite simple: there is no new functionality that needs to be implemented; rather, the AI engine developers and the abstract domain implementers only need to agree on a clearer separation of concerns. For instance, an abstract domain implementing the alternative widening specification will have to wrap its implementation using an upper bound operator, as follows. Proposition 1 Let risky : A × A → A be a widening operator satisfying Definition 4.2 and let ˜ A : A × A → A be an upper bound operator on A (not necessarily the least one); then, the operator safe : A × A → A defined by ∀a1 , a2 ∈ A : a1 safe a2 = a1 risky (a1 ˜ A a2 ) def

is a widening satisfying Definition 4.1. Note that the widening wrapping operation is often (already) implemented inside classical AI engines that correctly interface an abstract domain implementing the alternative widening specification. Hence, we are just suggesting to move this wrapping operation (once and for all) into the abstract domain’s widening implementation, thereby simplifying the parametric AI engine. In specific cases, the wrapped widening of Proposition 1 can be replaced by an ad hoc implementation meant to improve its precision and/or its efficiency.

4.5.1 Safe Widenings for Convex Polyhedra As an interesting example, we discuss in more detail the case of the abstract domain of topologically closed convex polyhedra, describing the several implementation options that allow to transform a risky widening implementation (as found in any one of the software libraries we considered) into a safe widening. For space reasons, we will provide abridged definitions of the domain and its operators, referring the interested reader to the well-known literature [6, 21, 30, 46]. Let CPn , ⊆, , ∅, Rn be the semi-lattice of n-dimensional topologically closed convex polyhedra, where the least upper bound operator  denotes the convex polyhedral hull; let also std be an implementation of the standard widening operator

4 “Fixing” the Specification of Widenings

71

assuming the precondition P1 ⊆ P2 . Clearly, the simplest option is to instantiate Proposition 1 using , thereby obtaining P1 a P2 = P1 std (P1  P2 ). def

(4.5)

However, as observed in [46], this could result in some waste of computational resources. Roughly speaking, the standard widening P1 std P2 selects the constraints of P1 that are satisfied by P2 . Hence, there is no real need to compute the least upper bound of the two polyhedra, which is also going to compute new constraint slopes. Rather, we can instantiate Proposition 1 by using the weak join operator w ,8 which does not compute new constraint slopes: P1 b P2 = P1 std (P1 w P2 ). def

(4.6)

The third option is to provide an ad hoc definition for the widening of polyhedra, which simply avoids the explicit computation of least/weak upper bounds. Definition 4.6 (Ad hoc classical widening on CPn ) Let P1 , P2 ∈ CPn be represented by the constraint system C1 and the generator system G 2 , respectively, where C1 = def Eqs1 ∪ Ineqs1 is in minimal form. Consider polyhedron P represented by C = C1 \ { c1 ∈ Ineqs1 | g2 ∈ G 2 violates c1 }. Then,  P1 c P2 = def

P1  P2 , if (P1 = ∅) or (P2 = ∅) or (g2 ∈ G 2 violates c1 ∈ Eqs1 ); P, otherwise.

Note that this is a well-defined (i.e., semantic) widening operator, meaning that it does not depend on the constraint representation. This property holds because 1. it considers a constraint system C1 in minimal form; 2. it returns P1  P2 whenever a generator in G 2 violates an equation in C1 . Hence, the polyhedron P is computed only when P1 and P1  P2 have the same affine dimension (see [6, Proposition 6]). Also note that, in case 2 above, the computed result can be more precise than the one obtained using Eq. (4.5) or Eq. (4.6). In Table 4.4, we summarize the results of an experimental evaluation (using a modified version of the PPLite library) where we compare the efficiency of the operators a , b and c ; note that the use of operator a faithfully describes an implementation satisfying the alternative specification of Definition 4.2. The reader is warned that we are describing a synthetic efficiency comparison, having little statistical value: usually, a rather small fraction of the overall execution time of static analysis tools is spent computing widenings. In our synthetic test, we consider 70 pairs of randomly generated, fully dimensional closed polyhedra in a vector space of dimension 5 (each polyhedron is obtained by adding 5 random rays to a randomly generated bounded 8

This is the join used in the domain of template polyhedra [46], also called constraint hull.

72

E. Zaffanella and V. Arceri

Table 4.4 Efficiency comparison for variations of standard widening on CPn Time/operations P1 a P2 P1 b P2 P1 c P2 a /b Cumulative time Scalar products Linear combinations Bitset operations

163 ms 808078 42527 5881982

287 ms 1153298 70162 8556088

19 ms 131482 4456 106031

0.57 0.70 0.61 0.69

a /c 8.58 6.15 9.54 5.55

box). For each pair, we compute the widenings according to the three implementations, keeping track of the overall elapsed time; note that the three widenings always compute identical results. We also repeat the test with a modified implementation that tracks the number of low-level operations (scalar products, linear combinations and bitwise operations on saturation vectors) executed during the computations of the widenings. Table 4.4 shows that, on these benchmarks, the ad hoc classical widening c is able to significantly improve the efficiency of a . In contrast, for b (which is based on the weak join operator) we obtain a slowdown; this is due to the fact that, after efficiently computing the constraint system for P1 w P2 , the conversion procedure is implicitly invoked to obtain the corresponding generator system (which is required by the current implementation of std ).

4.5.2 A Note on the Unusual Widening Specifications Our recommendation is meant for tools following the mainstream approach to Abstract Interpretation. In particular, we are assuming that the computational ordering is identical to the approximation ordering. The rather unusual case of distinct orderings  A and  A has been first considered in [19, Proposition 6.20]: in this case, non-standard hypotheses on the widening operator make sure that the approximation relation α(ci )  A ai holds for each corresponding pair of concrete and abstract iterates (these properties can then be transferred to the fixpoint). The decision tree abstract domain proposed in [49], which can be used to prove conditional termination of programs, is an interesting specific example of a domain distinguishing the computational and approximation ordering. The corresponding widening operator, however, matches none of the specifications above: neither Definition 4.1, nor Definition 4.2, nor [19, Proposition 6.20]. As discussed in [49], when using this widening operator, the approximation relation α(ci )  A ai on the intermediate concrete and abstract iterates does not generally hold: the analysis can only ensure the correct over-approximation of the concrete fixpoint.

4 “Fixing” the Specification of Widenings

73

4.6 Conclusion We have shown that widely adopted open-source abstract domain libraries and AI engines assume slightly different specifications for widening operators, possibly leading to AI tool crashes or more subtle, hidden safety issues. This problem can be solved by systematically adopting a fixed and hence uniform widening specification. Based on our investigation, the best option is to stick to the classical widening specification of [16, 17], as it enjoys the following properties: • • • •

it is the default one taught in Abstract Interpretation courses and tutorials; it avoids all safety issues; it can be more precise than the alternative one (for non-lattice domains); its implementations can be as efficient as (or even more efficient than) those based on the alternative one.

Acknowledgements The authors would like to express their gratitude to the developers and maintainers of open-source abstract domain libraries and static analysis tools. Special thanks are also due to Gianluca Amato, David Monniaux, Jorge A. Navas, Francesca Scozzari, Gagandeep Singh and Caterina Urban for their helpful feedback on the subject of this paper.

References 1. Amato, G., Di Nardo Di Maio, S., Scozzari, F.: Numerical static analysis with Soot. In: P. Lam, E. Sherman (eds.) Proceedings of the 2nd ACM SIGPLAN International Workshop on State Of the Art in Java Program analysis, SOAP 2013, Seattle, WA, USA, June 20, 2013, pp. 25–30. ACM (2013). https://doi.org/10.1145/2487568.2487571 2. Amato, G., Rubino, M., Scozzari, F.: Inferring linear invariants with parallelotopes. Sci. Comput. Program. 148, 161–188 (2017). https://doi.org/10.1016/j.scico.2017.05.011. 3. Amato, G., Scozzari, F.: The abstract domain of parallelotopes. Electron. Notes Theor. Comput. Sci. 287, 17–28 (2012). https://doi.org/10.1016/j.entcs.2012.09.003. 4. Amato, G., Scozzari, F., Seidl, H., Apinis, K., Vojdani, V.: Efficiently intertwining widening and narrowing. Sci. Comput. Program. 120, 1–24 (2016). https://doi.org/10.1016/j.scico.2015. 12.005. 5. Arceri, V., Mastroeni, I.: Analyzing dynamic code: A sound abstract interpreter for Evil eval. ACM Trans. Priv. Secur. 24(2), 10:1–10:38 (2021). https://doi.org/10.1145/3426470. 6. Bagnara, R., Hill, P., Ricci, E., Zaffanella, E.: Precise widening operators for convex polyhedra. Sci. Comput. Program. 58(1–2), 28–56 (2005). https://doi.org/10.1016/j.scico.2005.02.003. 7. Bagnara, R., Hill, P.M., Zaffanella, E.: The Parma Polyhedra Library: Toward a complete set of numerical abstractions for the analysis and verification of hardware and software systems. Sci. Comput. Program. 72(1–2), 3–21 (2008). https://doi.org/10.1016/j.scico.2007.08.001. 8. Becchi, A., Zaffanella, E.: PPLite: zero-overhead encoding of NNC polyhedra. Inf. Comput. 275, 104620 (2020). https://doi.org/10.1016/j.ic.2020.104620. 9. Beyer, D., Keremoglu, M.E.: CPAchecker: A tool for configurable software verification. In: G. Gopalakrishnan, S. Qadeer (eds.) Computer Aided Verification - 23rd International Conference, CAV 2011, Snowbird, UT, USA, July 14-20, 2011. Proceedings, Lecture Notes in Computer Science, vol. 6806, pp. 184–190. Springer (2011). https://doi.org/10.1007/978-3642-22110-1_16.

74

E. Zaffanella and V. Arceri

10. Blanchet, B., Cousot, P., Cousot, R., Feret, J., Mauborgne, L., Miné, A., Monniaux, D., Rival, X.: A static analyzer for large safety-critical software. In: R. Cytron, R. Gupta (eds.) Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation 2003, San Diego, California, USA, June 9-11, 2003, pp. 196–207. ACM (2003). https://doi. org/10.1145/781131.781153. 11. Boulmé, S., Maréchal, A., Monniaux, D., Périn, M., Yu, H.: The Verified Polyhedron Library: an overview. In: 20th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2018, Timisoara, Romania, September 20-23, 2018, pp. 9– 17. IEEE (2018). https://doi.org/10.1109/SYNASC.2018.00014. 12. Brat, G., Navas, J., Shi, N., Venet, A.: IKOS: A framework for static analysis based on abstract interpretation. In: D. Giannakopoulou, G. Salaün (eds.) Software Engineering and Formal Methods - 12th International Conference, SEFM 2014, Grenoble, France, September 1-5, 2014. Proceedings, Lecture Notes in Computer Science, vol. 8702, pp. 271–277. Springer (2014). https://doi.org/10.1007/978-3-319-10431-7_20. 13. Cortesi, A., Zanioli, M.: Widening and narrowing operators for abstract interpretation. Comput. Lang. Syst. Struct. 37(1), 24–42 (2011). https://doi.org/10.1016/j.cl.2010.09.001. 14. Cousot, P.: Semantic foundations of program analysis. In: S.S. Muchnick, N.D. Jones (eds.) Program Flow Analysis: Theory and Applications, chap. 10, pp. 303–342. Prentice Hall, Englewood Cliffs, NJ, USA (1981) 15. Cousot, P.: Abstracting induction by extrapolation and interpolation. In: D. D’Souza, A. Lal, K.G. Larsen (eds.) Verification, Model Checking, and Abstract Interpretation - 16th International Conference, VMCAI 2015, Mumbai, India, January 12-14, 2015. Proceedings, Lecture Notes in Computer Science, vol. 8931, pp. 19–42. Springer (2015). https://doi.org/10.1007/ 978-3-662-46081-8_2. 16. Cousot, P., Cousot, R.: Static determination of dynamic properties of programs. In: Proceedings of the Second International Symposium on Programming, pp. 106–130. Dunod, Paris, France (1976) 17. Cousot, P., Cousot, R.: Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: Conference Record of the Fourth ACM Symposium on Principles of Programming Languages, Los Angeles, California, USA, January 1977, pp. 238–252 (1977) 18. Cousot, P., Cousot, R.: Systematic design of program analysis frameworks. In: Conference Record of the Sixth Annual ACM Symposium on Principles of Programming Languages, San Antonio, Texas, USA, January 1979, pp. 269–282 (1979) 19. Cousot, P., Cousot, R.: Abstract interpretation frameworks. J. Log. Comput. 2(4), 511–547 (1992). https://doi.org/10.1093/logcom/2.4.511. 20. Cousot, P., Cousot, R.: Comparing the galois connection and widening/narrowing approaches to abstract interpretation. In: M. Bruynooghe, M. Wirsing (eds.) Programming Language Implementation and Logic Programming, 4th International Symposium, PLILP’92, Leuven, Belgium, August 26-28, 1992, Proceedings, Lecture Notes in Computer Science, vol. 631, pp. 269–295. Springer (1992). https://doi.org/10.1007/3-540-55844-6_142. 21. Cousot, P., Halbwachs, N.: Automatic discovery of linear restraints among variables of a program. In: A. Aho, S. Zilles, T. Szymanski (eds.) Conference Record of the Fifth Annual ACM Symposium on Principles of Programming Languages, Tucson, Arizona, USA, January 1978, pp. 84–96. ACM Press (1978). https://doi.org/10.1145/512760.512770. 22. D’Silva, V.: Widening for automata. Diploma thesis, Institut Fur Informatick, Universitat Zurich, Switzerland (2006) 23. ETH Zurich SRI Lab: ELINA: ETH Library for Numerical Analysis. http://elina.ethz.ch 24. Ferrara, P., Negrini, L., Arceri, V., Cortesi, A.: Static analysis for dummies: experiencing LiSA. In: L. Nguyen Quang Do, C. Urban (eds.) SOAP@PLDI 2021: Proceedings of the 10th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis, Virtual Event, Canada, 22 June, 2021, pp. 1–6. ACM (2021). https://doi.org/10.1145/3460946.3464316. 25. Fouilhé, A.: Revisiting the abstract domain of polyhedra : constraints-only representation and formal proof. (le domaine abstrait des polyèdres revisité : représentation par contraintes et preuve formelle). Ph.D. thesis, Grenoble Alpes University, France (2015)

4 “Fixing” the Specification of Widenings

75

26. Gange, G., Navas, J.A., Schachte, P., Søndergaard, H., Stuckey, P.J.: Abstract interpretation over non-lattice abstract domains. In: F. Logozzo, M. Fähndrich (eds.) Static Analysis - 20th International Symposium, SAS 2013, Seattle, WA, USA, June 20-22, 2013. Proceedings, Lecture Notes in Computer Science, vol. 7935, pp. 6–24. Springer (2013). https://doi.org/10.1007/ 978-3-642-38856-9_3. 27. Gopan, D., Reps, T.: Lookahead widening. In: T. Ball, R. Jones (eds.) Computer Aided Verification, 18th International Conference, CAV 2006, Seattle, WA, USA, August 17-20, 2006, Proceedings, Lecture Notes in Computer Science, vol. 4144, pp. 452–466. Springer (2006). https://doi.org/10.1007/11817963_41. 28. Gurfinkel, A., Chaki, S.: Boxes: A symbolic abstract domain of boxes. In: R. Cousot, M. Martel (eds.) Static Analysis - 17th International Symposium, SAS 2010, Perpignan, France, September 14-16, 2010. Proceedings, Lecture Notes in Computer Science, vol. 6337, pp. 287–303. Springer (2010). https://doi.org/10.1007/978-3-642-15769-1_18. 29. Gurfinkel, A., Navas, J.A.: Abstract interpretation of LLVM with a region-based memory model. In: R. Bloem, R. Dimitrova, C. Fan, N. Sharygina (eds.) Software Verification - 13th International Conference, VSTTE 2021, New Haven, CT, USA, October 18-19, 2021, and 14th International Workshop, NSV 2021, Los Angeles, CA, USA, July 18-19, 2021, Revised Selected Papers, Lecture Notes in Computer Science, vol. 13124, pp. 122–144. Springer (2021). https://doi.org/10.1007/978-3-030-95561-8_8. 30. Halbwachs, N.: Détermination automatique de relations linéaires vérifiées par les variables d’un programme. Ph.D. thesis, Grenoble Institute of Technology, France (1979) 31. Halbwachs, N., Henry, J.: When the decreasing sequence fails. In: A. Miné, D. Schmidt (eds.) Static Analysis - 19th International Symposium, SAS 2012, Deauville, France, September 1113, 2012. Proceedings, Lecture Notes in Computer Science, vol. 7460, pp. 198–213. Springer (2012). https://doi.org/10.1007/978-3-642-33125-1_15. 32. Halbwachs, N., Proy, Y., Raymond, P.: Verification of linear hybrid systems by means of convex approximations. In: B.L. Charlier (ed.) Static Analysis, First International Static Analysis Symposium, SAS’94, Namur, Belgium, September 28-30, 1994, Proceedings, Lecture Notes in Computer Science, vol. 864, pp. 223–237. Springer (1994). https://doi.org/10.1007/3-54058485-4_43. 33. He, J., Singh, G., Püschel, M., Vechev, M.T.: Learning fast and precise numerical analysis. In: A.F. Donaldson, E. Torlak (eds.) Proceedings of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2020, London, UK, June 15-20, 2020, pp. 1112–1127. ACM (2020). https://doi.org/10.1145/3385412.3386016. 34. Henry, J., Monniaux, D., Moy, M.: PAGAI: A path sensitive static analyser. Electron. Notes Theor. Comput. Sci. 289, 15–25 (2012). https://doi.org/10.1016/j.entcs.2012.11.003. 35. Jeannet, B.: Some experience on the software engineering of abstract interpretation tools. Electron. Notes Theor. Comput. Sci. 267(2), 29–42 (2010). https://doi.org/10.1016/j.entcs. 2010.09.016. 36. Jeannet, B., Argoud, M., Lalire, G.: The INTERPROC interprocedural analyzer. http://popart.inrialpes.fr/interproc/interprocweb.cgi 37. Jeannet, B., Miné, A.: Apron: A library of numerical abstract domains for static analysis. In: A. Bouajjani, O. Maler (eds.) Computer Aided Verification, 21st International Conference, CAV 2009, Grenoble, France, June 26 - July 2, 2009. Proceedings, Lecture Notes in Computer Science, vol. 5643, pp. 661–667. Springer (2009). https://doi.org/10.1007/978-3-642-026584_52. 38. Kirchner, F., Kosmatov, N., Prevosto, V., Signoles, J., Yakobowski, B.: Frama-C: A software analysis perspective. Formal Aspects Comput. 27(3), 573–609 (2015). https://doi.org/10.1007/ s00165-014-0326-7. 39. Masdupuy, F.: Semantic analysis of interval congruences. In: D. Bjørner, M. Broy, I.V. Pottosin (eds.) Formal Methods in Programming and Their Applications, International Conference, Akademgorodok, Novosibirsk, Russia, June 28 - July 2, 1993, Proceedings, Lecture Notes in Computer Science, vol. 735, pp. 142–155. Springer (1993). https://doi.org/10.1007/ BFb0039705.

76

E. Zaffanella and V. Arceri

40. Miné, A.: Weakly relational numerical abstract domains. Ph.D. thesis, École Polytechnique, Palaiseau, France (2004) 41. Miné, A.: Tutorial on static inference of numeric invariants by abstract interpretation. Found. Trends Program. Lang. 4(3–4), 120–372 (2017). https://doi.org/10.1561/2500000034. 42. Monniaux, D.: A minimalistic look at widening operators. High. Order Symb. Comput. 22(2), 145–154 (2009). https://doi.org/10.1007/s10990-009-9046-8. 43. Monniaux, D., Guen, J.L.: Stratified static analysis based on variable dependencies. Electron. Notes Theor. Comput. Sci. 288, 61–74 (2012). https://doi.org/10.1016/j.entcs.2012.10.008. 44. Olivieri, L., Tagliaferro, F., Arceri, V., Ruaro, M., Negrini, L., Cortesi, A., Ferrara, P., Spoto, F., Talin, E.: Ensuring determinism in blockchain software with GoLiSA: an industrial experience report. In: L. Gonnord, L. Titolo (eds.) SOAP ’22: 11th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis, San Diego, CA, USA, 14 June 2022, pp. 23–29. ACM (2022). https://doi.org/10.1145/3520313.3534658. 45. Partush, N., Yahav, E.: Abstract semantic differencing for numerical programs. In: F. Logozzo, M. Fähndrich (eds.) Static Analysis - 20th International Symposium, SAS 2013, Seattle, WA, USA, June 20-22, 2013. Proceedings, Lecture Notes in Computer Science, vol. 7935, pp. 238– 258. Springer (2013). https://doi.org/10.1007/978-3-642-38856-9_14. 46. Sankaranarayanan, S., Colón, M., Sipma, H.B., Manna, Z.: Efficient strongly relational polyhedral analysis. In: E.A. Emerson, K.S. Namjoshi (eds.) Verification, Model Checking, and Abstract Interpretation, 7th International Conference, VMCAI 2006, Charleston, SC, USA, January 8-10, 2006, Proceedings, Lecture Notes in Computer Science, vol. 3855, pp. 111–125. Springer (2006). https://doi.org/10.1007/11609773_8. 47. Singh, G., Püschel, M., Vechev, M.: Fast polyhedra abstract domain. In: G. Castagna, A. Gordon (eds.) Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages, POPL 2017, Paris, France, January 18-20, 2017, pp. 46–59. ACM (2017). https:// doi.org/10.1145/3009837.3009885. 48. Singh, G., Püschel, M., Vechev, M.T.: Fast numerical program analysis with reinforcement learning. In: H. Chockler, G. Weissenbacher (eds.) Computer Aided Verification - 30th International Conference, CAV 2018, Held as Part of the Federated Logic Conference, FloC 2018, Oxford, UK, July 14-17, 2018, Proceedings, Part I, Lecture Notes in Computer Science, vol. 10981, pp. 211–229. Springer (2018). https://doi.org/10.1007/978-3-319-96145-3_12. 49. Urban, C., Miné, A.: A decision tree abstract domain for proving conditional termination. In: M. Müller-Olm, H. Seidl (eds.) Static Analysis - 21st International Symposium, SAS 2014, Munich, Germany, September 11-13, 2014. Proceedings, Lecture Notes in Computer Science, vol. 8723, pp. 302–318. Springer (2014). https://doi.org/10.1007/978-3-319-10936-7_19. 50. Vojdani, V., Apinis, K., Rõtov, V., Seidl, H., Vene, V., Vogler, R.: Static race detection for device drivers: the Goblint approach. In: D. Lo, S. Apel, S. Khurshid (eds.) Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, ASE 2016, Singapore, September 3-7, 2016, pp. 391–402. ACM (2016). https://doi.org/10.1145/2970276. 297033.

Chapter 5

Static Analysis for Data Scientists Caterina Urban

Abstract Big data analytics has revolutionized the world of software development in the past decade. Every day, data scientists develop computer programs to gather, triage, and process data, in order to ultimately help us make data-driven decisions. As we rely more and more on such data-manipulating software, we become increasingly vulnerable to poor choices, wrong assumptions, or other (programming or technical) mistakes made during software development. Mistakes that do not cause software failures can have serious consequences, since they give no indication that something went wrong along the way. In safety-critical applications, such mistakes can be deadly. In this chapter, we will present ongoing work to develop an abstract interpretation-based static analysis framework for data scientists. In particular, we will focus on issues arising from unexpected data and describe the challenges involved in designing and developing a practical static analysis that infers necessary expectations on the data read and manipulated using Jupyter notebooks, an increasingly popular development environment among data scientists.

5.1 Introduction The advent of big data—the manipulation and analysis of massive quantities of data [15]—has revolutionized the world of software development in the past decade. Every day, data scientists [7] develop software programs to gather, triage, and pre-process data, which is varied, often unstructured, and generally ‘dirty’ (i.e., inaccurate or even incorrect, incomplete, inconsistent, etc.). Data scientists have a mixed background in computer science and IT, and mathematics and statistics, as well as domain-specific knowledge pertaining to the type of data they work with (e.g., finance, biology, medicine, etc.). They are not professional software developers but, nonetheless, they spend most of their time writing software programs.

C. Urban (B) Inria and ENS, PSL, Paris, France e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 V. Arceri et al. (eds.), Challenges of Software Verification, Intelligent Systems Reference Library 238, https://doi.org/10.1007/978-981-19-9601-6_5

77

78

C. Urban

As we rely more and more on these data-manipulating programs for making decisions even in high stakes applications and sensitive applications (e.g., finance and medicine, but also hiring [16], credit scoring [10], prison sentencing [2], etc.), we become increasingly vulnerable to poor choices, wrong assumptions, or other mistakes (e.g., programming or technical errors) made during software development. Mistakes that do not cause software failures can have serious consequences, since a plausible result gives no indication that something went wrong along the way. Just to cite a recent case in a medical application: a simple technical mistake made during data processing caused nearly 16 000 cases of COVID-19 between September 25th and October 2nd, 2020, to go unreported from official figures in the UK. As a consequence, Public Health England was unable to send out the relevant contacttracing alerts [11]. Mistakes in medical applications can be deadly. Jupyter notebooks are an increasingly popular development environment among data scientists [14]. They offer a read-eval-print loop (REPL) environment in which developers can quickly prototype code while interleaving textual descriptions and data visualizations (e.g., tables, charts, plots, etc.). Code cells in Jupyter notebooks can be (re-)executed in any desired order by the user, regardless of the order in which they are written. For this reason, the behavior of Jupyter notebooks is notoriously hard to predict and reproduce [20]. This makes prototyping and exploratory development the most fragile phase of the data science development pipeline. Uncaught fallacies at this phase can easily transfer to deployed code. It is also a recurrent phase of development, even after deployment, used for designing software customizations and updates to respond to discrete situational needs as new data becomes available. The literature is scarce of work that aims at providing guarantees on the correctness of such data-manipulating software. A few static analysis approaches have been proposed to detect accidentally unused input data [19] and data leakages [18]. We focus here on issues arising from unexpected data, i.e., missing data, extra or duplicate data, data with a different format, etc.

5.1.1 Example Let us consider the Jupyter notebook shown in Fig. 5.1, which implements a simple course gradebook. Quiz grades are read from a CSV file in cell [2]. This yields a dataframe looking, for instance, as follows: [2]: 0 1 2

ID 2394 4583 3956

Name Q1 Q2 Q3 Alice A A A Bob F B B Carol F A C

In cell [3], letter grades (stored in the dataframe columns starting with the letter ‘Q’) are converted to a 4.0 GPA scale. Next, in cell [4], the average is computed and stored in a column ‘Grade’. Student emails are read from a CSV in cell [5] and

5 Static Analysis for Data Scientists

79

Fig. 5.1 Jupyter notebook implementing a simple course gradebook

matched with the student grades in cell [6]. Finally, cell [7] retrieves student emails and their grade (to be used to send email notifications). In our example, this yields the following dataframe: [7]: 0 1 2

Email [email protected] [email protected] [email protected]

Grade 4.0 2.0 2.0

This notebook implicitly contains several expectations on the data it reads and manipulates. We show below two possible ways in which violating these expectations produces a wrong but plausible result. Violation 1: Data with Missing Values. Imagine the ‘Grades.csv’ file contains a missing value because a student was not present on the day of the quiz: [2]: 0 1 2

ID 2394 4583 3956

Name Alice Bob Carol

Q1 Q2 Q3 A A A F B B NaN A C

The gradebook in Fig. 5.1 does not take this case into account, i.e., it expects that no quiz grades are missing, and thus only computes the average over the quiz grades that are not missing: [7]: 0 1 2

Email [email protected] [email protected] [email protected]

Grade 4.0 2.0 3.0

80

C. Urban

Instead, it should probably replace missing values with 0.0 before computing the course grade in cell [4]: [4]: df.iloc[:, quiz] = df.iloc[:, quiz].fillna(0.0) df[’Grade’] = df.iloc[:, quiz].mean(axis=1) Violation 2: Data with a Different Format. As another example, let us imagine that ‘Grades.csv’ also contains grades in a slightly different format: [2]: 0 1 2

ID 2394 4583 3956

Name Q1 Alice A Bob F Carol F

Q2 Q3 A A B+ B A C

The gradebook in Fig. 5.1 does not take this case into account either, i.e., it assumes that letter grades can only be ‘A’, ‘B’, ‘C’, ‘D’, or ‘F’. Thus, the ‘B+’ grade is treated as a missing value: [7]: 0 1 2

Email [email protected] [email protected] [email protected]

Grade 4.0 1.5 2.0

Instead, a possible solution is to simply grades to remove any ‘+’ or ‘–’ symbol before converting letter grades: [3]: grade2gpa = {’A’: 4.0, ’B’: 3.0, ’C’: 2.0, ’D’: 1.0, ’F’: 0.0} quiz = df.columns.str.startswith(’Q’) simplify = lambda x: x.strip(’+-’) df.iloc[:, quiz] = df.iloc[:, quiz].applymap(simplify) df.iloc[:, quiz] = df.iloc[:, quiz].applymap(grade2gpa.get)

In both of these cases, the Jupyter notebook runs just fine, without raising any error. There is no indication that something went wrong along the way due to a mismatch between the data expectations implicit in the code and the actual data.

5.1.2 Data Expectation Static Analyses The most widespread (and in many cases the only) method for ensuring software correctness is testing. However, thoroughly testing for data expectations is hard as it requires the test developer to be aware of them in the first place. Moreover, incentives for testing are low when Jupyter notebooks are disregarded as single-use non-critical code written as a means to an end.

5 Static Analysis for Data Scientists

81

In this chapter, we advocate for the need for lightweight and practical static analyses to automatically infer data expectations implicit in data-manipulation code in Jupyter notebooks. These static analyses should be directly usable by data scientists, without requiring any background in static analysis. Moreover, these analyses should be interactive to assist data scientists at all times while they develop and interact with their Jupyter notebooks.

We rely on the well-established framework of abstract interpretation [4] to (a) define a concrete semantics specifically tailored to indirectly reason about input data rather than only about program variables (cf. Sect. 5.2); (b) design abstract domains, i.e., abstractions and algorithms to manipulate them, to correctly over-approximate the concrete semantics in a computable way (cf. Sect. 5.3); (c) guide the practical implementation of these abstract semantics into usable static analyses for data scientists (cf. Sect. 5.4). We will not present a fully fledged solution but sketch ongoing work and focus the discussion on the challenges and opportunities that each of these steps brings along.

5.2 Input Data-Aware Concrete Semantics 5.2.1 Input Data We consider tabular data stored, e.g., in CSV files. Let S be a set of string values. Furthermore, let Snum ⊆ S the sets of string values that can be interpreted as numerical values. We formalize a data file value or dataframe as a possibly empty (r × c)matrix of string values, where r ∈ N and c ∈ N denote the number of matrix rows and columns, respectively. We write  to denote an empty dataframe. We assume that non-empty CSV files always have a header so the first row of non-empty dataframes contains the labels of the dataframe columns. Let  def Sr ×c (5.1) D= r ∈N c∈N

be the set of all possible data file values. Let D ∈ D \ {} be a non-empty dataframe. In the following, we write hdr(D) for the set of labels of the columns of D. Given a set of labels C ⊆ S, we write D[C] for the (sub)dataframe only containing the columns of D with labels in C. When C is a singleton {c}, with c ∈ S, we simplify our notation and write D[c] instead of D[{c}]. Given a dataframe with a single column V , we write D[c]  V , where  ∈ {, ≥}, for the (sub)dataframe only containing the rows of D that, in the column with label c, satisfy the value comparison with

82

C. Urban

Fig. 5.2 Syntax of a toy language for data(frame) manipulation

the corresponding rows of V . When all rows in V contain the same value v, we simplify our notation and write D[c]  v instead of D[c]  V . We write D|V for the dataframe resulting from adding column V to D. Finally, given two (non-empty) dataframes D1 and D2 with one ore more matching columns, we write D1  D2 for the dataframe resulting from (inner) joining D1 and D2 .

5.2.2 Dataframe-Manipulating Language We consider a toy programming language for data(frame) manipulation, which we use for illustration throughout the rest of the chapter. Let X be a finite set of dataframe variables. The syntax of the program is defined inductively in Fig. 5.2. A program P consists of an instruction S followed by a unique label  ∈ L. Another unique label appears before each instruction. In the following, given an instruction S, we write lbl(S) to denote the label of S. The language contains instructions for reading an input CSV file (X :=input()), and for selecting parts of a dataframe: the instruction X 1 := X 2 [C] only keeps dataframe columns with labels in a set C ⊆ S, while the instruction X 1 := X 2 [c]  E only keeps dataframe rows with values (in the column with label c) that satisfy the condition. Conditions are value comparisons with NaN, a given string value s ∈ S, or values in (X [c]) or resulting from operations on (A1 A2 ) other dataframe columns. They can also be conditional themselves (A1 if B else A2 ). Finally, the language

5 Static Analysis for Data Scientists

83

contains instructions for replacing all values in a dataframe column (X [c] := E), for combining two dataframes by (inner) join (X 1 := X 2  X 3 ), and for concatenating two instructions together (S1 ; S2 ). The language can be trivially extended to consider other ways to combine dataframes (e.g., concatenations, and left, right, or outer joins) as well as more complex instructions mimicking dataframe-manipulating operations of popular data science libraries. We also intentionally omitted the possibility of aliasing between dataframes, to keep things as simple as possible. Gradebook Example Here is the gradebook example in Fig. 5.1 written in our toy language (simplified by assuming that the input CSV file only contains two quiz grades): 1 df

:= input() := 4.0 i f df [Q1] ==A else (3.0 i f df [Q1] ==B else (2.0 i f df [Q1] ==C else (1.0 i f df [Q1] ==D else (0.0 i f df [Q1] ==F else NaN) ) ) ) 3 df [Q2] := 4.0 i f df [Q2] ==A else (3.0 i f df [Q2] ==B else (2.0 i f df [Q2] ==C else (1.0 i f df [Q2] ==D else (0.0 i f df [Q2] ==F else NaN) ) ) ) 4 df [Grade] := ( df [Q1] + df [Q2]) ÷ 2 5 es := input() 6 un := df  es 7 res := un[{Email, Grade}] 2 df [Q1]

8

5.2.3 Input-Aware Semantics We can now define the concrete semantics of data(frame)-manipulating programs.



! Challenge

This semantics differs from the usual concrete semantics in that it must be input data-aware, that is, it must perform a step of indirection to explicitly reason about data files read by programs, in addition to reasoning about program variables.

84

C. Urban

An environment ρ : X → D maps each dataframe variable X ∈ X to its value ρ(X ) ∈ D. Let E denote the set of all environments. In addition, let δ : X → P (L) map dataframe variables to their data source, that is, the set of labels where dataframe value of the variable originates from. The data source of a dataframe variable can be a single label  ∈ L, when the dataframe value of the variable originates from a dataframe read by the instruction with label , or a set of labels, when the dataframe value originates from joining dataframes read at different instruction labels. Let  be the set of all possible data source maps. Finally, let φ : L → P(D) map each instruction label to a possible data file value read at that label and let  be the set of all possible such maps. Environments keep track of dataframe variables, while data source and possible file value maps are what is needed to explicitly keep track of and reason about read data files. The semantics of an expression E is a function E : E → D mapping an environment to the dataframe (column) value of the expression in the given environment: NaN ρ = NaN def

s ρ = s def

X [c] ρ = ρ(X )[c] def

A1 A2  ρ = A1  ρ A2  ρ def

A1  A2  ρ = A1  ρ  A2  ρ def

B1 ∨ B2  ρ = B1  ρ ∨ B2  ρ def

B1 ∧ A2  ρ = B1  ρ ∧ B2  ρ def

 A1 if B else A2  ρ = def

A1  ρ B ρ A2  ρ otherwise

A single value NaN or s ∈ S represents a dataframe column in which all rows contain that same value. All operations between dataframe columns (arithmetic, comparisons, boolean, etc.) are performed independently for each row. The semantics of programs P : L → P (E ×  × ) maps each instruction label to the set of all triples of possible environments, possible data sources, and possible data file values read up to the point when the program execution is at that label. We define this semantics forwards, starting from the first instruction label where all environments in E are possible but no data files have yet been read:           def E × ∅˙ × ∅˙ p = lbl(S) P = S = S λp. undefined otherwise

5 Static Analysis for Data Scientists

85

Fig. 5.3 Input-aware concrete semantics of instructions

In Fig. 5.3, we define the semantics S : (L → P (E ×  × )) → (L → P (E ×  × )) of each instruction pointwise within P (E ×  × ): each function S S : P (E ×  × ) → P (E ×  × ) takes as input a set W of triples of environments, data sources, and data file values and outputs triples of possible environments, possible data sources, and possible data file values read up to the point when the program has executed S. Note that, when the instruction reads a CSV file (X :=input()), all data file values are possible after executing the instruction. Instead, instructions that select part of a dataframe (e.g., X 1 := X 2 [C]) impose expectations on dataframe values (e.g., C ⊆ hdr(ρ(X 2 ))) and thus restrict the set of possible data file values after executing the instruction. The function ←− E : P (E, , ) → P (E, , ) refines a set W based on expression E:

86

C. Urban

←−−− ← − def NaNW = sW = W ⎫ ⎧ (ρ, δ, φ) ∈ W, ⎬ ⎨ ←−−− def c ∈ hdr(ρ(X )), X [c]W = (ρ, δ, φ) ⎩ c ∈ H ∪ (hdr(ρ(X )) \ H )⎭  def H= hdr(φ()) ∈δ(X )

←−−−−−− def ←−− ←−− E 1  E 2 W = E 2  ◦ E 1 W  ∈ { , , ∨, ∧} ←−− ←− ←−−−−−−−−−−− def ←−− ←− A1 if B else A2 W = A1  ◦ BW ∩ A2  ◦ BW   Thus, S  () characterizes all possible (expected) data files read by program S. Gradebook Example (Continue) The concrete semantics of our toy gradebook example is the following: 1 2

→ →

3



4



5



6

→ →

7 8



     1   ], φ[1 → D]) | (ρ, δ, φ) ∈ 1 , D ∈ D ⎫ ⎧(ρ[df → D], δ[df → (ρ, δ, φ) ∈ 2 ⎬ ⎨ (ρ[df → ρ(df)[ρ(df)[Q1] → 4.0 . . . NaN ρ]], δ, φ) Q1 ∈ hdr(ρ(df)) ⎩ Q1 ∈ hdr(φ(1 )) ⎭ ⎫ ⎧ (ρ, δ, φ) ∈ 3 ⎬ ⎨ (ρ[df → ρ(df)[ρ(df)[Q2] → 4.0 . . . NaN ρ]], δ, φ) Q2 ∈ hdr(ρ(df)) ⎩ Q2 ∈ hdr(φ(1 )) ⎭    Grade ρ[df → ρ(df) ], δ, φ (ρ, δ, φ) ∈ 4 (df[Q1] + df[Q2]) ÷ 2 ρ     (ρ[es → D], δ[es → 5 ], φ[5 → D]) | (ρ, δ, φ) ∈ 5 , D ∈ D  1 5   6 ⎫ ⎧(ρ[un → ρ(df)  ρ(es)], δ[un → , ], φ) | (ρ, δ, φ) ∈ (ρ, δ, φ) ∈ 7 ⎬ ⎨ (ρ[un → ρ(un)[{Email, Grade}]], δ, φ) {Email, Grade} ⊆ hdr(ρ(un)) ⎩ Email ∈ hdr(φ(1 )) ∪ hdr(φ(5 ))⎭

E × ∅˙ × ∅˙

Note that, for simplicity, we only considered the case in which the column ‘Grade’ was not already present in the data file read at instruction label 1 .

5.3 Expectations Abstract Domains We now design a decidable abstraction of P which over-approximates the concrete semantics of P at each instruction label  ∈ L. As a consequence, this abstraction yields necessary expectations on dataframe values for a program to execute successfully and correctly. In particular, if a data file value is not in the abstraction, the program will definitely eventually run into an error or compute a wrong result if

5 Static Analysis for Data Scientists

87

it tries to read data from it. On the other hand, if a data file value is in the abstraction there is no guarantee that the program will execute successfully and correctly when reading data from it. This choice is intentional so as to provide immediately actionable results to data scientists rather than overwhelm them with possible false negatives: a mismatch between a data file value and (the abstraction of) a program indicates something that must be corrected (either in the program or the data file). The abstraction P : L → W associates to each instruction label  ∈ L and element W ∈ W of an abstract domain W. W over-approximates the possible environments, possible data sources, and possible data file values read up to the point when the program execution has reached the instruction with label .



! Challenge

The main challenge in designing such an abstraction is that it must reason about multi-dimensional data structures such as dataframes, rather than simpler values.

5.3.1 Column Expectations Abstract Domain As a simple example, we sketch an abstraction that infers expectations about column labels that a data file value must have. The elements of the column expectations abstract domain C belong to a set C of triples. The first element of each of these triples is an abstract environment ρ : X → P (S) × P (S) mapping variables to sets of column labels that its dataframe value may not have and must have, respectively. Notably, we will use the ‘may not’ column labels set to track columns that were potentially added by a program and thus were not necessarily present in the original data source. Given an abstract environment ρ ∈ E and a variable X ∈ X , we write ρm (X ) for the ‘may not’ set associated with X in ρ , and ρ M (X ) for the ‘must’ set. The second element of these triples is an abstract data source map δ : X → P (L) mapping dataframe variables to their potential data sources. Finally, the third element is an abstract data file value map φ : L → P (S) mapping each instruction label to the set of columns that the data file read at that label must have. The concretization function γC : C → P (E ×  × ) is defined as follows: ⎫ ∀X ∈ X : ρ (X ) ⊆ hdr(ρ(X ))⎬ M γC ((ρ , δ , φ )) = (ρ, δ, φ) ∈ E ×  ×  ∀X ∈ X : δ(X ) = δ (X ), ⎩ ∀ ∈ L : φ () ⊆ hdr(φ()) ⎭ def

⎧ ⎨

To define the abstract semantics of programs P : L → C, we first define the ←− function E : C → C refining an abstract element C ∈ C based on the expression ←− E (and abstracting the concrete refinement function E):

88

C. Urban

←−−− ← − def NaN C = s = C ←−−− def  X [c] (ρ , δ , φ ) = (ρ [X → (ρm (X ), ρ M (X ) ∪ {c})], δ , φ )       def φ [ → φ ( ) ∪ ({c} \ ρm (X ))] δ (X ) =  φ = otherwise φ ←−−−−−− −− ←−− def ← E 1  E 2  C = E 2  ◦ E 1  C  ∈ { , , ∨, ∧} ←−−−−−−−−−−− def ←−− ←− ←−− ←− A1 if B else A2  C = A1  ◦ B C ⊕ A2  ◦ B C A column selection expression X [c] adds the column c to the ‘must’ set of column labels for X in the abstract environment; column c is also added to the data source values of X , if this can be traced back to a single data source, and column c is not potentially added by the program. If X cannot be traced back to a single data source, there is a potential loss of precision in tracking column labels. In the case of conditional expressions, the abstract triples refined by the two conditional branches are merged together taking the intersection of corresponding sets of labels. (Note that ‘may not’ sets in abstract environments and abstract data source maps are never modified by this refining function. The only refinements happen in ‘must’ set of labels in abstract environments and abstract data source value maps.) Thus, the abstract semantics of programs P : L → E ×  ×  maps each instruction label to an abstract domain element:    ˙ ∅) ˙ p = lbl(S)    def (λX ∈ X : (∅, ∅), ∅, = S λp. P = S ⊥C otherwise where the abstract semantics S S : (L → C) → (L → C) of each instruction S is defined pointwise within C in Fig. 5.4. Gradebook Example (Continue) The abstract semantics of our toy gradebook example is the following: 1 2 3 4 5 6 7 8

→ → → → → → → →

˙ ∅) ˙ (λX ∈ X : (∅, ∅), ∅,   (ρ [df → (∅, ∅)], δ [df → 1 ], φ [1 → ∅]) where (ρ , δ , φ ) = 1 (ρ [df → (∅, {Q1})], δ , φ [1 → {Q1}]) where (ρ , δ , φ ) = 2 (ρ [df → (∅, {Q1, Q2})], δ , φ [1 → {Q1, Q2}]) where (ρ , δ , φ ) = 3 (ρ [df → ({Grade}, {Q1, Q2, Grade})], δ , φ ) where (ρ , δ , φ ) = 4   (ρ [es → (∅, ∅)], δ [es → 5 ], φ [5 → ∅]) where (ρ , δ , φ ) = 5   (ρ [un → ({Grade}, {Q1, Q2, Grade})], δ [un → 1 , 5 ], φ ) where (ρ , δ , φ ) = 6   (ρ [res → ({Grade},{Email, Grade})], δ [res → 1 , 5 ], φ ) where (ρ , δ , φ ) = 7

Note the loss of precision in tracking column labels after the dataframe join.

5 Static Analysis for Data Scientists

89

Fig. 5.4 Column expectations abstract semantics of instructions

5.3.2 Other Expectations Abstract Domains Several other such abstract domains can be defined to track expectations about, e.g., data types or data values. In our toy gradebook examples, we could infer that values in columns ‘Q1’ and ‘Q2’ are expected to be strings in {‘A’, ‘B’, ‘C’, ‘D’, ‘F’}. Existing numerical domains [6, 12, 13, etc.], string domains [1, 3, etc.], and abstract domains for data structures [5, 9, etc.] can be more or less easily adapted to work in this settings. By building upon relational abstract domains, one can even track relationships between data columns or values. Of course, the more sophisticated data expectations one wants to infer, the more complex the abstract domain definition will be.

5.4 Implementation Our implementation of these data expectation static analyses is ongoing, targeting Jupyter notebooks. We want these to be practically useful and directly usable by data scientists, without requiring them to have any background in static analysis. For the moment, we are developing our static analyses for Jupyter notebooks using Python. In the long term, we want to support other languages used for data science

90

C. Urban

such as R, as well as the not uncommon practice of using multiple programming languages in the same notebooks.



! Challenge

The challenge with such a long-term goal is not only developing static analyses for dynamic languages such as Python or R (with their complex data science libraries), but also their combination, taking into account their underlying practical semantic differences (e.g., array indexing starting at 0 in Python but at 1 in R, missing values automatically ignored in Python but not in R, etc.).

We are also studying combinations of dynamic and static analyses to effectively deal with really dynamic features of these languages (e.g., eval() expressions in Python). In particular, we are looking into using dynamic executions to guide the static analysis in a principled way, akin to abstract conflict-driven learning [8]. Finally, to maximize usability, we aim to integrate our static analyses into the Jupyter notebook environment, either as extensions or integrating them in the most used integrated development environment for Jupyter notebooks such as Visual Studio Code and PyCharm.



! Challenge

The main challenge to develop really useful static analyses for data scientists is to render them interactive [17] to adapt them to the way data scientists write and use their Jupyter notebooks. This also means that the analyses should be sufficiently lightweight to be able to compute results quickly, but at the same time precise enough for the results to remain useful in practice.

5.5 Conclusion In this chapter, we have argued for the need to develop new static analyses tailored for Jupyter notebooks and directly usable by data scientists without a static analysis background. We have sketched a simple static analysis framework to infer expectations about the data manipulated by a Jupyter notebook and highlighted the challenges that come with making such static analyses a reality in the near future. More generally, ours is a long-term effort to democratize static analysis, apply it to a wider range of software, and render it more accessible to a broader audience. We hope that others will join us in this endeavor.

5 Static Analysis for Data Scientists

91

Acknowledgements Work partly supported by DAIS, Ca’ Foscari University of Venice, within the IRIDE project ‘Static Analysis of Notebook Python’ directed by A. Cortesi.

References 1. V. Arceri, M. Olliaro, A. Cortesi, and P. Ferrara. Relational String Abstract Domains. In VMCAI, pages 20–42, 2022. 2. A. Chouldechova. Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments. Big Data, 5(2):153–163, 2017. 3. G. Costantini, P. Ferrara, and A. Cortesi. A suite of abstract domains for static analysis of string values. Software - Practice and Experience, 45(2):245–287, 2015. 4. P. Cousot and R. Cousot. Abstract Interpretation: A Unified Lattice Model for Static Analysis of Programs by Construction or Approximation of Fixpoints. In POPL, pages 238–252, 1977. 5. P. Cousot, R. Cousot, and F. Logozzo. A parametric segmentation functor for fully automatic and scalable array content analysis. In POPL, pages 105–118, 2011. 6. P. Cousot and N. Halbwachs. Automatic Discovery of Linear Restraints Among Variables of a Program. In POPL, pages 84–96, 1978. 7. T. H. Davenport and D. J. Patil. Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, 90(10):70–76, October 2012. 8. V. V. D’Silva, L. Haller, and D. Kroening. Abstract Conflict Driven Learning. In POPL, pages 143–154, 2013. 9. J. Fulara. Generic Abstraction of Dictionaries and Arrays. Electronic Notes in Theoretical Computer Science, 287:53–64, 2012. 10. A. E. Khandani, A. J. Kim, and A. W. Lo. Consumer Credit-Risk Models via Machine-Learning Algorithms. Journal of Banking & Finance, 34(11):2767–2787, 2010. 11. E. Mahase. Covid-19: Only Half of 16 000 Patients Missed from England’s Official Figures Have Been Contacted. BMJ, 371, 2020. 12. A. Miné. Weakly Relational Numerical Abstract Domains. PhD thesis, École Polytechnique, Palaiseau, France, 2004. 13. A. Miné. The octagon abstract domain. Higher-Order and Symbolic Computation, 19(1):31– 100, 2006. 14. J. M. Perkel. Why Jupyter is Data Scientists’ Computational Notebook of Choice. Nature, 563(7729):145–146, November 2018. 15. S. Sagiroglu and D. Sinanc. Big Data: A Review. In CTS, pages 42–47, 2013. 16. C. Schumann, J. S. Foster, N. Mattei, and J. P. Dickerson. We Need Fairness and Explainability in Algorithmic Hiring. In AAMAS, pages 1716–1720, 2020. 17. B. Stein, B. E. Chang, and M. Sridharan. Demanded abstract interpretation. In PLDI, pages 282–295, 2021. 18. P. Subotic, U. Bojanic, and M. Stojic. Statically Detecting Data Leakages in Data Science Code. In SOAP, pages 16–22, 2022. 19. C. Urban and P. Müller. An Abstract Interpretation Framework for Input Data Usage. In A. Ahmed, editor, ESOP, pages 683–710, 2018. 20. J. Wang, L. Li, and A. Zeller. Better Code, Better Sharing: On the Need of Analyzing Jupyter Notebooks. In G. Rothermel and D. Bae, editors, ICSE-NIER, pages 53–56, 2020.

Chapter 6

Completeness in Static Analysis by Abstract Interpretation: A Personal Point of View David Monniaux

Abstract Static analysis by abstract interpretation is generally designed to be “sound”, that is, it should not claim to establish properties that do not hold in other words, and not provide “false negatives” about possible bugs. A rarer requirement is that it should be “complete”, meaning that it should be able to infer certain properties if they hold. This paper describes a number of practical issues and questions related to completeness that I have come across over the years.

6.1 Introduction The concept of completeness, under several definitions, permeates mathematical logic. A proof system is deemed complete with respect to semantics if it can prove all properties that are true in this semantics. For instance, Gödel’s completeness theorem states that first-order logic (that is, any reasonable proof system for it) can prove any property true in all models. A decision procedure for a logic (that is, a procedure that answers “true” or “false” for any formula in the logic F) is expected to be sound (it does not declare to be true formulas that are not true) and complete (it declares to be true all formulas that are true). The concepts of completeness (a) for a proof system with respect to a class of properties (b) for a procedure that searches for proofs of such properties within such a proof system are related, but distinct. Obviously, if a proof system is incomplete, then so is any procedure that searches for proofs within that system. However, it is possible to have a complete proof system and an incomplete procedure for searching within it, for instance for practical efficiency reasons. In abstract interpretation, completeness, at least at the global level (proving properties of whole programs), is often forgone straight from the start when designing the abstraction (for instance, interval arithmetic is used even though it is obvious that it cannot prove safety in general, even if the property to prove is itself an interval), and there thus may be little reluctance to adding further incompleteness in the D. Monniaux (B) Université Grenoble Alpes, CNRS, Grenoble INP, VERIMAG, Grenoble, France e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 V. Arceri et al. (eds.), Challenges of Software Verification, Intelligent Systems Reference Library 238, https://doi.org/10.1007/978-981-19-9601-6_6

93

94

D. Monniaux

solving procedure by using approaches such as widening operators [5, 6]. Yet, it is interesting to control this further incompleteness, both from theoretical and practical points of view. Completeness in the solving method ensures some predictability in the analysis outcome, while the brittleness of some incomplete approaches (the approach succeeds or fails depending on seemingly inconsequential aspects of the program and the property) surprises end users: for instance, providing more precise information on the precondition of a program may prevent the analysis procedure from proving a property that it could prove with a less precise precondition. In this paper, I discuss various instances of completeness and incompleteness that I have came across, without any pretense of thorough theoretical discussion and, ironically, no pretense of completeness. For more theoretical discussion, Giacobazzi et al. [14, Sect. 7] surveyed the literature extensively. For a study of completeness with respect to the relationship between the semantics and the abstract domain, and constructions to make the domain complete, see [13].

6.2 Completeness of the Abstraction: the Case of LRU Caches It is well-known that it is equivalent to (a) evaluate an integer multivariate polynomial over integer inputs, then take the final result modulo N (b) do the same computation but reducing modulo N at every step.1 Due to a strong mathematical property (a ring morphism), it is possible to replace an expensive computation, possibly involving large numbers, with a simpler one that abstracts the expensive one by operating on small abstractions of the data. This example is ideal, and one could doubt the existence of complete abstractions that significantly simplify computations for nontrivial program analysis problems; yet we came across one such example, where we perform an exact analysis defined by a composition of exact abstractions and an exact solving procedure. All current processors use some form of cache memory: frequently accessed code or data is retained in fast memory close to the processor core, avoiding much slower accesses to main memory. Cache analysis consists in determining, given a program and a description of a processor cache subsystem, which memory accesses are always “cache hits” (data already in cache), which are always “cache misses” (data not in cache, thus access to external memory), or more complex properties (“the first time this statement is executed, the access is a cache miss, then subsequent accesses are hits”). Such an analysis is for instance used as a prerequisite for worst-case execution time analysis. For N = 9, this is the basis for the method of “casting out nines”, named so because reduction modulo N can be implemented over numbers in decimal form by summing the digits of the number, with 9 being replaced by 0 (repeat the process until the result consists in one single digit). Schoolchildren used to apply this method to check their hand computations: the final result of an error-prone integer computation, taken modulo 9, should be identical to the result obtained with reduction at every step. This is of course not a sound check: if results differ, one is sure that there has been some error somewhere (no false positives), but they can coincide by chance.

1

6 Completeness in Static Analysis by Abstract Interpretation …

95

A possible approach to solving the cache analysis problem would be to apply a model-checker to the combination of the program under analysis (or some suitably simplified model thereof) and of the cache system. The first and obvious abstraction is, for the cache system, to retain only the addresses of the memory blocks currently stored within the system, but not their contents. This abstraction is exact: the cache system finds data, or decides to evict data to make room for new contents, according to addresses only, never according to contents. The resulting model of the cache system, for a fixed processor model, is finite: the cache consists of a fixed number of cache lines, buffers, etc., labeled with addresses taken within a fixed range. However, such a model is in practice intractable in all but the simplest examples. The naive vision of cache memory is that of a fully associative cache: any cache line may be used to store any block from the main memory, regardless of its address. In reality, a cache is usually split into several cache sets, each of which is able to retain blocks present only at a given class of addresses in the main memory. The classes of addresses for the different cache sets form a partition of the possible addresses in the main memory. In almost all cases,2 these cache sets operate completely separately from each other. It is thus possible to perform cache analysis separately for each cache set; this is again an exact abstraction. However, this still results in intractable models. Each cache set is composed of a fixed number of cache lines; the number N of lines is also known as the number of ways or the associativity of the cache. Each cache line may be empty or nonempty; a nonempty cache line stores a block from the main memory: its address and its contents (we have already seen that the contents are not relevant for analysis). For simplicity, we shall consider here the case of singlelevel, read-only caches. A read from a memory location not currently in the cache triggers a load from the main memory into the cache. Except in the rare case when the relevant cache set still has empty lines, this involves evicting a block from the cache set. The selection of the block to be evicted is made according to a cache replacement policy. The most obvious replacement policy is to evict the least recently used (LRU) block.3 A cache set is therefore modeled as a sequence of at most N block addresses, in increasing age: the youngest block is the one accessed most recently, and the oldest is the least recently accessed. The positions of blocks within the set change as data is accessed. Let us take an example with N = 4. Assume a, b, c, . . . denote distinct addresses in memory, and assume the cache set initially contains abcd, meaning that a is the youngest block and d the oldest. If the processor requires block d, then it is

2

The exception is the “pseudo round robin” cache replacement policy, which involves a counter global to all cache sets. Also, dependencies between cache sets arise if one considers an integrated model of the processor pipeline, pre-fetching units and caches, because whether or not some data is in some cache set influences indirectly whether some other data will be fetched. 3 Some processors implement this LRU policy. It is however more common to implement so-called pseudo-LRU policies, meant to provide the same practical performance at a fraction of the hardware cost [15, 19]. Unfortunately, these pseudo-LRU policies are very different from the point of view of cache analysis [19, 23, 25].

96

D. Monniaux

rejuvenated and the cache set then contains dabc. If then the processor requires block e, then the oldest block (c) is evicted and the cache set contains edab. Let us now see our approach to analyzing programs over LRU caches [27]. Consider the case where we want to know whether a certain fixed block a is within the cache at certain program locations. Then, the blocks older than a in the cache are unimportant, so is the ordering of the blocks younger than a. The idea is that if a block is younger than a, then accessing it again will just change the ordering of blocks younger than a, but not the age of a; and if it is older, then accessing it will increment the age of a. A cache set may thus be described, with respect to a block a, by an abstract configuration: either “a is absent” or “a is here and the set of blocks younger than a is {. . . }”. Again, this is an exact abstraction. Cache analysis using that abstraction specialized for accesses to a thus amounts to collecting the sets (for all program locations) of reachable abstract configurations in the cache set where a is to be stored. This is still very costly. A collection of abstract configurations for that cache set consists of a Boolean “is it possible that a is absent” and a set of sets of sizes less than N of blocks distinct from a. A crucial observation is that, in this set of sets, only the minimal and maximal elements (with respect to the inclusion ordering) matter: if it is possible that a is preceded by {b}, {b, c}, {b, c, d}, then, for checking if it is possible that a (at some later point) is out of the cache, it is only necessary to retain {b, c, d}, which subsumes the other two cases; and for checking if it is possible that a (at some later point) is in the cache, it is only necessary to retain {b}, which again subsumes the other two cases. The reason for this subsumption is that if a sequence of steps leads from a state where a is preceded in the cache by a set P of blocks to the next access to a, and that access is a miss, then the same sequence of steps, starting with any cache state where the set of blocks younger than a is a superset of P, also leads to a miss (mutatis mutandis for subsets and hits). Retaining only the extremal elements is thus, again, an exact abstraction from the point of view of cache analysis. To summarize, we first gradually simplified the cache state by abstracting away parts and aspects that are irrelevant to the analysis under way, then we simplified the collection of abstract cache states by using a subsumption relation. The resulting abstraction is exact with respect to the properties under analysis. At this point, what remains is how to implement this abstraction. What is needed is an efficient way to represent sets of incomparable sets of blocks (antichains); we opted for zero-suppressed binary decision diagrams (ZDD). For better efficiency, this analysis, still somewhat costly, is applied to memory accesses only after some cheap approximate analyses [26] have failed to classify these accesses. An advantage of an exact approach is that, since the result is uniquely defined, it is possible to test implementations of various exact abstractions against each other; the results should be identical. This way we discovered a subtle ZDD bug by comparison with a (less scalable) model-checking approach. Granted, this approach is “exact” or “complete” only because a simplified model of the program (a control-flow graph, not taking into account the data being processed) is used. This model introduces approximation: for instance, in the following program

6 Completeness in Static Analysis by Abstract Interpretation …

97

i f ( flag ) access (a ) ; else access (b) ; i f ( flag ) access (a ) ; else access (b) ; the only feasible sequences of accesses are a, a and b, b (and thus in either case the second access is a hit), but an analysis based on control-flow only will not track the value of the flag and consider that sequences a, b and b, a are also feasible, and conclude that the second access may be a miss. However, tracking the value of data precisely makes the model undecidable. An intermediate solution that retains decidability is to consider control-flow with precise modeling of procedure calls [23]. On a final note, on many examples [27], worst-case execution time (WCET) analysis ended up being overall faster with the help of this apparently expensive analysis. The reason is that the WCET analysis in use involved enumerating all reachable pipeline states. If the cache analysis is imprecise and cannot conclude about some access, where the exact analysis would conclude to an “always miss”, this does not normally change the bound on WCET, but this increases the number of pipeline states to consider. In short: • incomplete analyses are first applied in order to answer most subproblems exactly; only the subproblems for which the answer is definitely unknown are passed to the more expensive complete analysis; • completeness is ensured by a composition of abstractions that do not lose any precision with respect to the properties we are interested in, but which greatly simplify the model to analyze; • the resulting model is analyzed exactly; • complete analyses have an exactly defined result, thus testing for bugs is easy by comparing results; • for the same reason, it is possible to study the complexity of the problem [23, 25]; • the extra cost of the complete analysis may be offset by savings in client analyses, because a more precise result means fewer case analyses down the road.

6.3 Completeness or Incompleteness of the Analysis Method A classical question of completeness in abstract interpretation is whether the abstract domain in use is sufficient to prove the properties being sought. Yet, even if the abstract domain is sufficient, it may happen that the abstract interpretation method being used cannot find the necessary invariants within the domain.

6.3.1 Widening Operators Let us see an example: some simplified version of a dataflow program where i is an index into some circular array of length 42.

98

D. Monniaux

i = 0; while ( true ) { i f ( trigger ( ) ) { i = i +1; i f ( i > 42) i = 0; } assert ( i < 1000); } Textbook forward static analysis by intervals [4, 5] will compute, for variable i, a sequence of intervals [0, 0], [0, 1], [0, 2], …, and widen it to [0, ∞). If narrowing iterations are used, in their simplest form by starting from the [0, ∞) invariant, running the loop body once more and adding the initialization states, one obtains [0, 999]. The assertion cannot be proved using that invariant. Yet the assertion would have been proved if the analysis had guessed the least possible interval, [0, 42]. Clearly, the problem is with the inference method (iterations with widening), not the abstract domain (intervals). Iterations with widening are an incomplete method with respect to the abstract domain: they may fail to discover suitable inductive invariants when some exist in the domain. Note also a surprising characteristic of this incomplete approach: if instead of the precondition i = 0 we had analyzed the loop with the precondition 0 ≤ i ≤ 42, we would be able to prove the assertion correct. This non-monotonic behavior (a less precise precondition, or a coarser abstraction of some program semantics leads to more precise analysis results) may surprise end users. Another surprising characteristic of widening operators for relational domains is that they do not commute, in general, with projection, meaning that adding some extra variables may change the result of the analysis on the original variables even if the extra variables have no influence on the original ones (for instance, if these variables are mere “observers”). Consider for instance the program: i =0; j =0; while ( true ) { i f (∗) i =0; else {i =1; j =0; } i f ( i==0) j=j +1; } In this program, j observes the number of last iterations during which i was 0. The set of reachable states of this program for (i, j) is (1, 0) ∪ {(0, n)|n ∈ N}; in particular the set of reachable values for i is {0, 1}, and it can be computed exactly with interval analysis by postponing widening by one step, or by applying one step of narrowing. Now consider the sequence of polyhedra, that is, polygons, computed by polyhedral analysis with widening [8]: at step n, the polygon is defined by vertices (0, 0), (1, 0), (0, n). The lines passing through vertices (0, 0) and (1, 0) (constraint j ≥ 0) and the one passing through vertices (0, 0) and (0, n) (constraint i ≥ 0) are stable; the line passing through vertices (1, 0) and (0, n) is unstable and will be discarded by widening. The resulting system of constraints is i ≥ 0 ∧ j ≥ 0;

6 Completeness in Static Analysis by Abstract Interpretation …

99

we have lost the i ≤ 1 constraint obtained by interval analysis. In some cases, a stratified approach, where successive analyses take increasing subsets of variables in consideration, may be able to recoup precision [24].

6.3.2 Exact Solving A safety property is established using forward abstract interpretation by inferring some inductive invariants in the abstract domain and checking that these invariants imply the property. A complete abstract interpretation method is one that would compute always such invariants if they exist. Clearly, any method that computes the least inductive invariants in the abstract domain is complete: if there exist invariants in the domain that can prove the safety property, then, a fortiori, the least inductive invariants in the domain can prove that property. One such method is ascending iterations without widening within a domain that satisfies the ascending chain condition (no infinite strictly ascending sequences), which ensures termination. This includes domains of finite height (there is a bound on the length of strictly ascending chains), and in particular finite domains, such as powersets of finite sets. If f : L → L is a monotone function, and ⊥ is the infimum of L, then the sequence f n (⊥) is ascending and, in all the above cases, becomes stationary. When f n (⊥) = f n+1 (⊥), then f n (⊥) is the least solution of x = f (x). The algorithms used in practice are algorithmic improvements over this idea.4 An example of an infinite domain of finite height is that of solution sets of systems of linear equations over a fixed number of variables [20]. When a solution set S is strictly included in a solution set S , the dimension of S is strictly greater than that of S; this dimension cannot be more than n, which thus bounds the height of the lattice. The domain of intervals has none of these characteristics; but it is however possible to compute the least inductive invariant within that domain of programs such as the one above, by specifying this least inductive invariant as a least fixed point, writing a system of numeric equations that this fixed point should satisfy, and solving this system for the least solution. Indeed, for the above example, let us write down the equations that h should satisfy for [0, h] to be a fixed point for the loop: h = min(max(min(42, h + 1), h), 999).

(6.1)

Let us solve this equation. In any solution, the outermost “minimum” operator must be equal to one of its arguments. Assume it is the second argument, 999. One can indeed verify that 999 is a solution to this equation. Now consider the case where In many cases, L = L m b where L b is a base lattice. An element x of L is then decomposed into x1 , . . . , xm , and f is decomposed into its m projections f 1 , . . . , f m . The problem is then to find a solution of a system of equations x1 = f 1 (x1 , . . . , xm ), . . . , xm = f m (x1 , . . . , xm ), typically using a system of working set of variables being updated, with some judiciously chosen ordering for choosing the next update to process.

4

100

D. Monniaux

it evaluates to its left argument. The equation then becomes h = max(min(42, h + 1), h). Similarly, the “maximum” operator evaluates to one of its operands. Assume it evaluates to its right argument. The system then simplifies to h = h, but this is under the condition that h ≤ 999 and h ≥ min(42, h + 1). When h + 1 > 42, that is, h ≥ 42, this minimum is equal to 42; thus all 42 ≤ h ≤ 999 yield valid solutions. When h + 1 ≤ 42, this minimum is equal to h + 1; but then the condition becomes h ≥ h + 1, which would imply h = ∞, which contradicts h ≤ 999. Now assume the “maximum” operator evaluates to its left argument. The system then simplifies to h = min(42, h + 1), which yields only h = 42 as solution. We conclude that h = 42 is the least solution. The above reasoning by case analysis over the minimum and maximum operators can be automated through exhaustive case analysis or, most subtly, by a satisfiability modulo theory (SMT) solver. Many such solvers can also optimize within the solution space, and thus directly produce the least fixpoint. Alternatively, for proving safety properties, it is sufficient to query the solver for any solution sufficient for proving the safety property. A more principled approach to the case analysis for the maximum operators is ascending policy iteration [11], which considers an ascending sequence of fixpoints of systems obtained by picking arguments to the “maximum” operators. All the above approaches generalize to domains other than intervals, for instance to zones [12], at the expense of some algorithmic complications. Let us go further. Assume the abstract domain consists of the sets defined by a parameterized predicate I (p, x) where p is the vector of parameters, and x is the program state. Note that this encompasses intervals, octagons, template polyhedra domains or even polyhedra with a fixed number of faces. Also note that I may contain disjunctions and express non-convex relations. The inductiveness condition for a transition τi, j from program location i to program location j may thus be written as the Horn clause: ∀x, x I (pi , x) ∧ τi, j (x, x ) ⇒ I (p j , x).

(6.2)

Safety conditions may be expressed as ∀x Ii (pi , x) ⇒ Ci (x), and program startup may be specified as ∀x Si (x) ⇒ Ii (pi , x). Inductive invariants within the abstract domain suitable for proving the safety properties are thus expressed as the solutions, in the parameters pi , of a system of Horn clauses. If I , the transitions τi, j , the safety conditions Ci and the start conditions Si are all expressed within a theory admitting algorithmic quantifier elimination, such as Presburger arithmetic or the theory of real closed fields, then the existence of such invariants is decided by quantifier elimination. For instance, if we consider a program operating over real numbers, and invariants of the form A(p)x ≤ B(p) where A(p) and B(p) have a fixed number r of rows (a polyhedron with at most r faces), then the existence of a suitable invariant is established by quantifier elimination in the theory of real closed fields. It is even possible to specify that one wants the least inductive invariant and obtain it by quantifier elimination [21].

6 Completeness in Static Analysis by Abstract Interpretation …

101

6.3.3 Imprecise Abstract Transfer Functions Widening operators are the best-known source of non-monotonicity and incompleteness in finding suitable invariants in abstract interpretation, but they are not the only ones. There are, and this may be surprising, cases of non-monotonicity and incompleteness due to transfer functions, particularly in combinations of abstractions. It is well-known that even if each individual step of abstract interpretation computes the least possible post-condition within the abstract domain compatible with the precondition, this property does not extend to the composition of steps; in other words, optimality does not compose. Let us see a simple example: y=x; z=x−y; If x is known to lie in [0, 1], then so does y, and this is optimal—non-optimal but sound answers would be intervals containing [0, 1]. Then interval arithmetic deduces that z is in [−1, 1], which is sound, but not optimal: the optimal interval is [0, 0]. However, in order to reach that conclusion, one must know that x = y at the intermediate control point, that is, a relation between x and y, which is not possible in a non-relational abstraction such as intervals. One workaround to this weakness is to propagate a system of relations along with the interval analysis. For instance, we had propagated y = x, then by substituting y → x into x − y and simplifying, we would have obtained z = 0. An approach to implement this workaround [7], for instance applied in the Astrée static analyzer, is to propagate a terminating rewriting system (for instance, populated in chronological order) along with the interval information, perform interval analysis on both the original expressions and the expressions obtained by rewriting and simplification, and then take the intersection. Here, the rewriting system would contain y → x, and the rewritten and simplified expression would yield z = 0. This approach however creates non-monotonicity.5 Consider this example [7, ex. 4]: Code int i, j=i+1; int k=j+1; if (j > 0) l=k;

Symbolic computation j → i + 1 j → i + 1, k → j + 1 → i + 2 j → i + 1, k → i + 2, l → i + 2

Less precise symbolic nothing k → j + 1 k → j + 1, l → j + 1

Assume we don’t know anything about the values of the variables initially, and consider the interval obtained for l at the end of the “then” branch of the test. Simple interval propagation yields no information about l. The rewriting system yields l → i + 2 and since no information is known about i, no information is known about l. Yet, if we forget that j → i + 1, and apply the resulting rewriting system, we obtain l → j + 1 and thus l > 1. 5

This non-monotonicity resulted in some “false alarms”, that is, warnings about nonexistent problems, one some industrial programs.

102

D. Monniaux

Granted, some improvements could be made by more clever propagation.6 For instance, since j → i + 1, the guard could be rewritten into i + 1 > 0, and this could itself be rewritten into i > −1. This amounts to “inverting” j → i + 1 to i → j − 1. More generally, one would need to consider the rewriting system not as a directed system, but as a system of equations. The combination of a system of linear equations and intervals forms the basis of the simplex algorithm for linear programming, and obtaining optimal interval bounds from a combination of equalities and interval bounds amounts to linear programming. In fact, any convex polyhedron can be expressed as the projection of a set defined by the conjunction of linear equalities and interval bounds.7 The appropriate domain for dealing with such constraints precisely would thus be that of convex polyhedra [8, 18].

6.4 Undecidability of an Abstraction The argument from Sect. 6.3.2 for establishing decidability of the existence of an inductive invariant within the domain of convex polyhedra with at most r faces does not extend to the domain convex polyhedra with any number of faces. In fact, there is no known algorithm for deciding the existence of inductive invariants within that domain for any nontrivial class of programs. Invariant inference approaches for that domain, starting from the early proposals from Cousot and Halbwachs [8, 18], are typically based on iterations with widening. An intriguing question is whether it is inevitable to resort to heuristics.

6.4.1 Polyhedral Abstraction I have attempted to prove that there is no algorithm deciding the existence of polyhedral invariants for linear transition systems (real or integer), and have so far failed. Because of my efforts in raising attention to this issue, some colleagues named this the “Monniaux problem”. Several distinguished researchers8 have also reported trying to solve it and failing.

6

More generally, advanced tools such as Astrée implement complex combinations of domains and iteration strategies, which make many examples of non-monotone behavior and false alarms on simple analyses, actually succeed. 7 Consider the representation of the polyhedron as a system of inequalities l (x , . . . , x ) ≤ B , i 1 n i create a new variable yi for any linear combination li for which there does not already exist a variable, keep this inequality as yi ≤ Bi and yi = li (x1 , . . . , xn ), then the set defined by the original system is the projection of the set defined by the transformed system on the x variables. 8 Including some at the 2022 CSV workshop in Venice, whence this volume comes.

6 Completeness in Static Analysis by Abstract Interpretation …

6.4.1.1

103

Undecidability with Quadratic Guards

Progress has however been made on variants of the so-called “Monniaux problem” [10]. I was able to prove that this problem becomes undecidable if one is allowed to use a quadratic transition guard: it is then possible to encode a deterministic counter machine reachability problem into the invariant inference problem. Let us recall how (proof sketch) [22]. Let the counter machine M operate over variables z 1 , . . . , z n , to which we add two variables x and y, initialized to 0. The transitions of the counter machine are synchronously combined with steps (x, y) → (x + 1, y + x). The combined thus simulates the counter machine on variables z i together with a parabola  machine n, 21 n(n − 1) on variables (x, y), where n is the number of steps taken so far. Now modify the resulting machine by conjoining to all transitions the guard y = 21 x(x − 1); clearly, this does not modify the behavior of the machine. Add a special control state σb , meant to be unreachable, and transitions from any state y < 21 x(x − 1) to that state. Call the resulting machine M . Assume M terminates in N steps, then so does M . Consider the convex hull H of the finite family of points such that x = k, y = 21 k(k − 1), z 1 , . . . , z n is the state reached after k steps in the execution of M, and 0 ≤ k ≤ N . The only points in H that lie on the y = 21 x(x − 1) parabola project to the reachable states of M. H , in general, contains extra states (integer points) such that y > 21 x(x − 1), but due to the nonlinear guards, these states cannot transition to other states. Thus, H is an inductive invariant for M that proves the inaccessibility of σb . Assume M does not terminate, and so does M . Any inductive invariant for M must contain the infinite family of points such that x = k, y = 21 k(k − 1), z 1 , . . . , z n is the state reached after k steps, for k ≥ 0. Any convex polyhedron that contains this infinite family of points must have a point y < 21 x(x − 1): assume there is none and consider the vertex V ∗ with maximal x ∗ ; this vertex must lie on the parabola; there is for x ≥ x ∗ at least one infinite face of the form y ≥ L(x, z 1 , . . . , z n ); but then inevitably there will be points of that face lying below the parabola (Fig. 6.1). However, any convex polyhedron that contains a point where y < 21 x(x − 1) is not an inductive invariant capable of proving that σb is unreachable. It follows that there is no convex polyhedron that is an inductive invariant for M capable of showing the unreachability of σb . Thus, M has an inductive polyhedral invariant suitable for proving the unreachability of σb if and only if M terminates, and this is for arbitrary M in a class of machines with undecidable termination. The same reasoning applies whether the state space is considered over integers or over the reals.

104

D. Monniaux

Fig. 6.1 When the process is non-terminating, any polyhedron containing the reachable points inevitably goes below the parabola

6.4.1.2

Perspectives and Dead Ends

The role of the parabola y = 21 x(x − 1) in the above proof could be played by other sets, such as a circle x 2 + y 2 = 1: the process that enumerated points on the parabola can be replaced by one that enumerates points on the circle. Initialize (x, y) = (1, 0), and, as the next step, apply a rotation matrix defined by aPythagorean triple ((a, b, c)  integers such that a 2 + b2 = c2 , for instance (3, 4, 5)):

a c b c

b c

. This however leads − ac to more complex proof arguments, and does not gain anything: we still need a nonlinear guard. Essentially, what we used in both cases is a strictly convex set S (y ≥ 21 x(x − 1)) with boundary definable by a guard (y = 21 x(x − 1)), with exterior (y < 21 x(x − 1)) definable by a guard, and with a way to enumerate points in the boundary while still remaining in the class of deterministic transitions under consideration. Strict convexity is used so that taking the convex hull of points in S, as when using polyhedral invariants, does not add “parasitic” points on the boundary. This ensures that the guard that keeps only points on the boundary removes “parasitic” points. Could we achieve similar goals using only linear constraints? Suppose we can express such a guard with a formula, possibly containing disjunctions. Then, by distributivity, this formula expresses a finite union of polyhedra P1 ∪ · · · ∪ Pn . Then there is a i such that the enumeration process eventually picks two points (in fact, an infinity of them) in Pi , and thus the guard cannot exclude “parasitic” points between these two, at least over the reals. A possible course of action could be to set the problem over integers and take advantage of the implicit guard that non-integer points are discarded. This however was also a dead end.

6 Completeness in Static Analysis by Abstract Interpretation …

105

6.4.2 Richer Domains If the abstract domain (class of invariants) under consideration is rich enough to express exactly the set of reachable states of the programs to be analyzed, then obviously the problem is undecidable: a suitable inductive invariant (the set of reachable states) exists in the domain if and only if the safety property is true. This happens for instance if the domain includes 10 formulas of Peano arithmetic, that is, formulas of the form ∃v1 . . . ∃vn F where F only contains bounded quantifiers and the usual arithmetic operations and comparisons. By a form of Gödel’s encoding, for any integer program with m variables built using the usual arithmetic operators, it is possible to build a 10 formula F(k, s1 , . . . , sm ) satisfied if and only if (s1 , . . . , sm ) is the state of the program after k steps of [28, Chap. 7], [3]. Such domains are obviously too rich. In fact, with the domain of 10 formulas described above, it is not even possible in general to check for inductiveness, because the formulas belong to an undecidable class. However, even semilinear sets (sets defined by Boolean combinations of linear inequalities with algebraic coefficients, thus disjunctions of convex polyhedra) are sufficient to obtain undecidability even for a very restricted class of programs (one single loop control state, nondeterministic choice between two linear transformations, no guards) [10, Theorem 9].

6.5 Perspectives and Conclusion The question of the completeness of the domain with respect to the properties to prove, and the class of programs under consideration, is distinct from the problem of algorithmically finding suitable invariants within that domain—or the related problem of computing the least inductive invariant within the domain. The traditional approach for abstract interpretation in domains with infinite ascending chains is iterations with widening. This approach has known weaknesses: the analysis may miss the invariants necessary for the proof and the analysis behavior is non-monotone (increasing knowledge about preconditions or transitions may decrease the precision of the analysis), which is a source of “brittleness” (a minor change in the program causes the analysis to fail to prove the property). Many tricks are thus used to improve the invariants obtained with widening: narrowing iterations [4, 6], lookahead widening [16] and guided static analysis [17]. Though they help in practice, none of them guarantees that the analysis will succeed. Over time, completely different approaches were designed to compute suitable arguments for proving safety properties. While widening is a form of extrapolation (look at sets of reachable states in 0, 1, 2 . . . steps and try to extrapolate for an arbitrary number of steps), these methods are based on interpolation: given some under-approximation of the set of reachable states, and the property to prove, find a simple separation between the two (a Craig interpolant) and hope that the interpolant,

106

D. Monniaux

or components thereof, or formulas constructed from components of successive interpolants, becomes inductive. One such approach is property-directed reachability [1, 9], later enriched with a number of extensions and improvements (dealing with arithmetic theories, dealing with non-linear Horn clauses and not just transition systems). Variants of this approach are in particular implemented in the popular Z3 solver.9 Unfortunately, this approach, too, is brittle [2]: minor differences in the problem to be solved (names of variables, etc.) may result in dramatic changes in outcome (finding invariants or timing out). Approaches that are guaranteed to find the least inductive invariant in the chosen abstract domain (or, more generally, an inductive invariant suitable for proving a given property, if it exists) are more resilient; I have listed several of them in Sect. 6.3.2. Some of these methods however do not scale up well (those based on quantifier elimination); policy iteration seems to be the most scalable. In our view, the case for continued use of widening operators, despite their known weaknesses, or other non-monotonic and/or brittle methods, would be strengthened by a proof that the existence of a suitable invariant in the domain is undecidable, or at least has high complexity. Unfortunately, such undecidability proofs are hard. In contrast, complete methods have many desirable properties. We have illustrated this with the example of LRU cache analysis (Sect. 6.2). Since the end result has a unique definition, there is no brittleness, and several algorithms or implementations can be used to compute the same result, which allows performance comparisons (the meaning of performance differences between methods producing non-comparable results is unclear) and validation by testing that results are identical.

References 1. Bradley, A.R.: Sat-based model checking without unrolling. In: R. Jhala, D.A. Schmidt (eds.) Verification, Model Checking, and Abstract Interpretation - 12th International Conference, VMCAI 2011, Austin, TX, USA, January 23-25, 2011. Proceedings, Lecture Notes in Computer Science, vol. 6538, pp. 70–87. Springer (2011). https://doi.org/10.1007/978-3-642-18275-4_7 2. Braine, J., Gonnord, L., Monniaux, D.: Verifying Programs with Arrays and Lists. Internship report, École normale supériéure de Lyon (2016). https://hal.archives-ouvertes.fr/hal01337140 3. Cook, S.A.: Soundness and completeness of an axiom system for program verification. SIAM Journal on Computing 7(1), 70–90 (1978). https://doi.org/10.1137/0207005 4. Cousot, P.: Méthodes itératives de construction et d’approximation de points fixes d’opérateurs monotones sur un treillis, analyse sémantique de programmes. Université scientifique et médicale de Grenoble, Grenoble, France, Thése d’état és sciences mathématiques (1978) 5. Cousot, P., Cousot, R.: Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: Conference Record of the 4th ACM Symposium on Principles of Programming Languages, pp. 238–252. Los Angeles, CA (1977) 6. Cousot, P., Cousot, R.: Abstract interpretation frameworks. J. Log. Comput. 2(4), 511–547 (1992). https://doi.org/10.1093/logcom/2.4.511. 9

https://github.com/Z3Prover/z3.

6 Completeness in Static Analysis by Abstract Interpretation …

107

7. Cousot, P., Cousot, R., Feret, J., Mauborgne, L., Miné, A., Monniaux, D., Rival, X.: Combination of abstractions in the astrée static analyzer. In: Advances in Computer Science — ASIAN 2006. Secure Software and Related Issues, 4435, pp. 272–300 (2008). https://doi.org/10.1007/ 978-3-540-77505-8_23 8. Cousot, P., Halbwachs, N.: Automatic discovery of linear restraints among variables of a program. In: Proceedings of the Fifth Conference on Principles of Programming Languages. ACM Press (1978) 9. Eén, N., Mishchenko, A., Brayton, R.K.: Efficient implementation of property directed reachability. In: P. Bjesse, A. Slobodová (eds.) International Conference on Formal Methods in Computer-Aided Design, FMCAD ’11, Austin, TX, USA, October 30 - November 02, 2011, pp. 125–134. FMCAD Inc. (2011) 10. Fijalkow, N., Lefaucheux, E., Ohlmann, P., Ouaknine, J., Pouly, A., Worrell, J.: On the monniaux problem in abstract interpretation. In: SAS, Lecture Notes in Computer Science, vol. 11822, pp. 162–180. Springer (2019) 11. Gawlitza, T., Seidl, H.: Precise fixpoint computation through strategy iteration. In: R.D. Nicola (ed.) Programming Languages and Systems, 16th European Symposium on Programming, ESOP 2007, Held as Part of the Joint European Conferences on Theory and Practics of Software, ETAPS 2007, Braga, Portugal, March 24 - April 1, 2007, Proceedings, Lecture Notes in Computer Science, vol. 4421, pp. 300–315. Springer (2007). https://doi.org/10.1007/978-3540-71316-6_21 12. Gawlitza, T.M., Seidl, H.: Precise program analysis through strategy iteration and optimization. In: Software Safety and Security, NATO Science for Peace and Security Series - D: Information and Communication Security, vol. 33, pp. 348–384. IOS Press (2012) 13. Giacobazzi, R., Ranzato, F.: Completeness in abstract interpretation: A domain perspective. In: AMAST, Lecture Notes in Computer Science, vol. 1349, pp. 231–245. Springer (1997) 14. Giacobazzi, R., Ranzato, F., Scozzari, F.: Making abstract interpretations complete. J. ACM 47(2), 361–416 (2000) 15. Gille, D.: Study of different cache line replacement algorithms in embedded systems. Master’s thesis, KTH (2007). https://people.kth.se/~ingo/MasterThesis/ThesisDamienGille2007.pdf 16. Gopan, D., Reps, T.W.: Lookahead widening. In: T. Ball, R.B. Jones (eds.) Computer Aided Verification, 18th International Conference, CAV 2006, Seattle, WA, USA, August 17-20, 2006, Proceedings, Lecture Notes in Computer Science, vol. 4144, pp. 452–466. Springer (2006). https://doi.org/10.1007/11817963_41 17. Gopan, D., Reps, T.W.: Guided static analysis. In: H.R. Nielson, G. Filé (eds.) Static Analysis, 14th International Symposium, SAS 2007, Kongens Lyngby, Denmark, August 22-24, 2007, Proceedings, Lecture Notes in Computer Science, vol. 4634, pp. 349–365. Springer (2007). 10.1007/978-3-540-74061-2_22 18. Halbwachs, N.: Détermination automatique de relations linéaires vérifiées par les variables d’un programme. Theses, Institut National Polytechnique de Grenoble - INPG ; Université Joseph-Fourier - Grenoble I (1979). https://tel.archives-ouvertes.fr/tel-00288805 19. Heckmann, R., Langenbach, M., Thesing, S., Wilhelm, R.: The influence of processor architecture on the design and the results of WCET tools. Proceedings of the IEEE 91(7), 1038–1054 (2003). https://doi.org/10.1109/JPROC.2003.814618 20. Karr, M.: Affine relationships among variables of a program. Acta Informatica 6, 133–151 (1976) 21. Monniaux, D.: Automatic modular abstractions for linear constraints. In: POPL (Principles of programming languages), pp. 140–151. ACM (2009). 10.1145/1480881.1480899 22. Monniaux, D.: On the decidability of the existence of polyhedral invariants in transition systems. Acta Informatica 56(4), 385–389 (2019). https://doi.org/10.1007/s00236-018-0324-y. 23. Monniaux, D.: The complexity gap in the static analysis of cache accesses grows if procedure calls are added. Formal Methods in System Design (2022). https://doi.org/10.1007/s10703022-00392-w 24. Monniaux, D., Guen, J.L.: Stratified static analysis based on variable dependencies. Electron. Notes Theor. Comput. Sci. 288, 61–74 (2012)

108

D. Monniaux

25. Monniaux, D., Touzeau, V.: On the complexity of cache analysis for different replacement policies. Journal of the ACM 66(6), 41:1–41:22 (2019). 10.1145/3366018 26. Touzeau, V., Maïza, C., Monniaux, D., Reineke, J.: Ascertaining uncertainty for efficient exact cache analysis. pp. 22–17. Cham (2017). 10.1007/3-540-63141-0_10 27. Touzeau, V., Maiza, C., Monniaux, D., Reineke, J.: Fast and exact analysis for LRU caches. Proc. ACM Program. Lang. 3, 54:1–54:29 (2019). 10.1145/3290367. http://doi.acm.org/10. 1145/3290367 28. Winskel, G.: The formal semantics of programming languages - an introduction. MIT Press, Foundation of computing series (1993)

Chapter 7

Lifting String Analysis Domains Martina Olliaro, Vincenzo Arceri, Agostino Cortesi, and Pietro Ferrara

Abstract Strings are characterized by properties that either refer to their content (e.g., the set of contained characters) and to their shape (e.g., the character position and the length of the strings). In this paper, we present a general framework providing a systematic lifting of string domains through a segmentation abstraction, yielding to a more accurate representation of strings without major impact on efficiency of the analysis. The proposed operator exploits the abstraction functor introduced by Cousot, Cousot and Logozzo for a fully parametric array representation, in the string scenario.

7.1 Introduction Static program analysis aims at inferring behavioral properties of a software system statically, i.e., at compilation time. In this paper we refer, in particular, to Abstract Interpretation [14, 15], a general theory for static analyses that, starting from the concrete semantics of a programming language, allows approximating the semantics of a program in a computable and sound way. Abstract Interpretation has been extensively and successfully applied to formalize static analyses for different programming languages and in many application scenarios, ranging from avionics [17] to automotive [44], from mobile apps [33] to databases [27] and neural networks [36], to name a few. M. Olliaro (B) · A. Cortesi · P. Ferrara Ca’ Foscari University of Venice, Venice, Italy e-mail: [email protected] A. Cortesi e-mail: [email protected] P. Ferrara e-mail: [email protected] V. Arceri University of Parma, Parma, Italy e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 V. Arceri et al. (eds.), Challenges of Software Verification, Intelligent Systems Reference Library 238, https://doi.org/10.1007/978-981-19-9601-6_7

109

110

M. Olliaro et al.

One of the main components of an Abstract Interpretation-based static analysis is how values are approximated. Several numerical domains like Octagons [35] and Polyhedra [19] have been introduced and applied over the last decades to prove different properties (e.g., buffer overruns) with various levels of precision and efficiency. More recently, a significant research effort has focused on string values too. String analysis aims at abstracting the sequence of literals possibly held by a string variable at each program point. The need of an accurate string analysis is related, in particular, to the verification of software systems that manage sensitive data against the exploitation of string vulnerabilities, like in the case of SQL injection attacks. Several abstract domains that approximate string values have already been formalized and implemented in the literature [13, 47]. Some of them focus on specific programming languages such as JavaScript [3, 28, 32] and C [11, 12]. Others target specific properties and rely on different approaches, like automata [46], regular expressions [38], and grammatical summaries [29]. The String Length domain [1] approximates the length of a string through an interval (e.g., the interval [2, 5] represents all the possible strings whose length ranges from 2 to 5). The Character Inclusion domain [13] approximates strings through a pair of sets highlighting the characters definitely and possibly contained (e.g., the pair of sets ({a, b}, {a, b, c}) represents all the possible strings definitely containing the characters ‘a’ and ‘b’ and possibly containing the character ‘c’). Notice that while the String Length domain captures information about the string shape, the Character Inclusion domain observes content information only. These two dimensions (shape and content) are both captured by the Prefix domain [13]. For example, the prefix ab∗ represents all the possible strings sharing the starting sequence “ab”, whose minimal length is 2. Finally, the Segmentation domain [18] is a sophisticated abstract domain originally proposed to abstract arrays content by bounded, consecutive, nonoverlapping, and possibly empty segments. It allows both the abstraction of arrays element values and relational abstractions between arrays’ content and indexes. Different abstract domains can be composed in various ways (such as the Cartesian and the reduced product) within the Abstract Interpretation framework [6, 10, 15]. Each abstract domain represents a trade-off between the precision and efficiency of the analysis since usually the more precise is the representation the more steps are necessary for the fixpoint algorithm to converge.

7.1.1 Paper Contribution In this paper, we discuss the combination of a suite of string abstract domains (such as the String Length [1], the Character Inclusion and the Prefix [13] domains) with the Segmentation domain proposed by Cousot et al. in [18], showing the advantages in terms of precision in the analysis. On the one hand, string domains track information about the content of string values. On the other hand, the segmentation domain behaves like a functor that allows lifting basic domains through structural

7 Lifting String Analysis Domains

111

information. Since it represents the string values by distinct segments, it allows string domains to keep track of information also on a single portion of a string. The main contribution of the paper can be summarized as follows: • We highlight the limits of the existing basic string abstract domains that deal with string shape and content properties separately or partially integrated, and the need to overcome these limits by providing a systematic construction of domains where string shape information is twinned with respect to string content information. • We specialize the Segmentation domain (originally designed for array representation) for string analysis. A string segmentation element over-approximates strings by looking at the intervals that possibly share a similar feature. For instance, if the feature we want to capture is just related to character occurrences, we may consider string segmentations like  a [2, 4]  b [0, 2]. This element contains two consecutive segments (namely,  a [2, 4] and  b [0, 2]), each of which is composed by the character representation followed by a numerical interval. The latter indicates how many times that character might be repeated. Thus, the abstract element  a [2, 4]  b [0, 2] represents all the possible strings having the character ‘a’ repeated from 2 to 4 times followed by the character ‘b’ repeated from 0 to 2 times, like “aa”, “aab”, “aaaabb”. • The combination of basic string abstract domains with the segmentation domain, where the tracked string information is homogeneous, is refined by exploiting the notions of Granger methodology and refinement operators [25], and we show how that the lifting is in fact a reduced product that can be over-approximated in a computable way. • We show how inconsistency conditions can be associated with abstract values, enabling the analysis to detect potential vulnerabilities within the source code. To better understand the motivation and the results of this work, consider the example below. Example Let SL, CI, PR and S be the String Length, Character Inclusion, Prefix and Segmentation string abstract domains respectively. Table 7.1 depicts possible elements of these abstract domains. Notice that all the abstractions in Table 7.1 approximate a set of string values that contains, for example, the literal “aaddd00xx88”. Consider in particular the following abstract values: (i) (ii)

aaddd∗ [a, c][1, 2] [d, f][1, 4] [0, 0][2, 2] [x, y][2, 3] [7, 10][2, 2]

(PR) (S)

Prefix (i) approximates all the strings starting with the sequence of characters “aaddd”, while segmentation (ii) approximates all the strings starting with a character in the interval from ‘a’ to ‘c’ repeated at most two times concatenated to a character

112

M. Olliaro et al.

Table 7.1 Introductory example String domain Abstract elements CI PR SL S

({a,d,0,x,8}, {a,d,0,x,y,8,4}) ({a,d}, {a...z,0...9}) (∅, a,d,x,y,z,0...9) ... a∗, aad∗, aaddd∗ ... [11, 11], [5, 16], [0, 20] ...  a  [2, 2]  d [3, 3]  0 [2, 2]  x [2, 2]  8 [2, 2] literal [0, 2] literal [3, 3] cypher [2, 2] literal [0, 4] cypher [2, 2] [a, c][1, 2] [d, f][1, 4] [0, 0][2, 2] [x, y][2, 3] [7, 10][2, 2]

in the interval from ‘d’ to ‘f’ repeated at most four times and so on. Note that we darken the segment character representations to distinguish them from the segment bounds. The reduced product between these two values leads to an improvement in the information precision tracked by each of the abstract elements. In particular, we can safely add to the end of the prefix aaddd∗ the sequence of ciphers 00 as the segmentation also approximates strings starting with the sequence of characters “aaddd” and where the integer 0 occurs at position 5 and 6, obtaining aaddd00∗. Instead, the prefix abstracts strings starting with a sequence of two ‘a’ followed by a sequence of three ‘d’. Thus the first two segments of the segmentation value can be refined accordingly, yielding to: aaddd00∗, [a, a][2, 2] [d, d][3, 3] [0, 0][2, 2] [x, y][2, 3][7, 10][2, 2]

(PR ⊗ S)

Observe that in this way, both abstract components have been lifted to a more accurate representation in the resulting reduced product.

7.1.2 Paper Structure The rest of the paper is structured as follows. Section 7.2 gives the basics of set theory and Abstract Interpretation. Section 7.3 discusses related work. Section 7.4 presents the concrete domain and semantics. Section 7.5 introduces the basic abstract domains for string analysis. Section 7.6 recalls the segmentation domain, and specializes it to the approximation of strings. Section 7.7 defines our refined string abstract domains. Section 7.8 concludes.

7 Lifting String Analysis Domains

113

7.2 Background In this section, we briefly recall the most relevant mathematical background used in the rest of the paper.

7.2.1 Mathematical Notation Given a set L, its cardinality is denoted by | L |. If L has an infinite number of elements, | L | = +∞. A partially ordered set (L, ) is a set L equipped with a partial order relation , that is, a reflexive, antisymmetric, and transitive relation. A partially ordered set is a lattice (L, , ⊥, , , ) if it contains a least element ⊥, a greatest element , and for every pair of elements l, l ∈ L, we have that their greatest lower is complete bound l l and their least upper bound l l exist and belong to L. A lattice  when for eachsubset of L, i.e., ∀X ⊆ L, the greatest lower bound X and the least upper bound X exist and belong to L. An operator is a function f : L → L where domain and codomain are the same set. Given an operator f, a fixpoint of f is any value l such that f(l) = l. An upper closure operator on a partially ordered set (L, ) is a function υ : L → L which is, for all l, l ∈ L, (i) monotone l l ⇒ υ(l) υ(l ), (ii) idempotent υ(υ(l)) = υ(l), and (iii) extensive l υ(l). The set of all upper closure operators on (L, ) is denoted by uco(L). L satisfies the Ascending Chain Condition (ACC) if every strictly ascending sequence of elements eventually converges. We denote by ∃n the existential quantifier with cardinality n, which is read as “there exist exactly n objects”.

7.2.2 Abstract Interpretation Abstract Interpretation [14, 15] is a well-known mathematical theory that formalizes the semantics of a program P as a fixpoint of all its execution traces, and it allows to define and prove the soundness of different degrees of abstraction. Formally, the concrete and abstract semantics are defined on a concrete D and abstract D sets, respectively. The concrete and the abstract domains are complete lattices (formally, (D, D , ⊥D , D , D , D ) and (Aα,γ , Aα,γ , ⊥ Aα,γ , Aα,γ , Aα,γ , Aα,γ )). The concrete and the abstract domains are related by a pair of monotonic functions: the concretization γD : D → D and the abstraction αD : D → D functions. In order to obtain a sound analysis, αD and γD have to form a Galois connection [14].1 (αD , γD ) is a Galois connection if and only if for every d ∈ D and d ∈ D we have that d D γD (d) ⇔ αD (d) D d. Note that one function uniquely identifies the other. Consequently, we can infer a Galois connection by proving that γD is a complete meet morphism (or resp. αD is a complete join morphism) (see Proposition 7 of [16]). 1

This condition can be relaxed [15].

114

M. Olliaro et al.

Abstract domains with infinite height and that do not respect the ascending chain condition need to be equipped with a widening operator ∇D in order to make the analysis convergent [7]. Let D and D be two abstract domains representing sound approximations of the same concrete domain D and let C = D × D be their Cartesian product. The abstract elements in C are pairs (d, d ) such that d ∈ D and d ∈ D . The partial order, upper and lower bound and widening operators are defined as the component-wise application of the corresponding operators of the two domains. The Cartesian product is a lattice. The abstraction function on C maps a concrete element d ∈ D to the pair (αD (d), αD (d)), while the concretization function on C maps an abstract element to γD (d) D γD (d ). Then, the Cartesian product forms a Galois connection with the concrete domain. An abstract domain functor is a function from the parameter abstract domains D1 , D2 , ..., Dn to a new abstract domain D(D1 , D2 , ..., Dn ). The abstract domain functor D(D1 , D2 , ..., Dn ) composes abstract domain properties of the parameter abstract domains to build a new class of abstract properties and operations [18].

7.2.3 Reduced Product The reduced product [15] combines abstract domains mutually refining them. Informally speaking, it improves the precision of the information tracked by one domain exploiting the information tracked by the other, and vice-versa [10]. Let (d1 , d2 ) ∈ C be the Cartesian product (cf. Sect. 7.2.2) of D1 and D2 . The reduced product is a Cartesian product equipped with a reduction operator. In particular, the reduced product search for the smallest element (d1∗, d2∗) such that: γD1 (d1∗) ⊆ γD1 (d1 ), γD2 (d2∗) ⊆ γD2 (d2 ), and γD1 (d1∗) ∩ γD2 (d2∗) = γD1 (d1 ) ∩ γD2 (d2 ). Formally, the reduction operator, ρ : C → C, is defined as follows: let c denote the pair (d1 , d2 )  and c∗ denote the pair (d1∗, d2∗) then, ρ(c) = C {c∗ ∈ C : γC (c) C γC (c∗ )}, where (d1 , d2 ) C (d1∗, d2∗) ⇔ d1 D1 d1∗ and d2 D2 d2∗. In the following we will refer to the reduced product between two abstract elements d1 and d2 as d1 ⊗ d2 .

7.2.4 Granger Product The Granger product [25] is an approximation of the reduction operator. It is based on two refinement operators iterated until a fixpoint is reached (or, in other words, when the smallest reduction is obtained). Each of these operators takes advantage of the information of one of the two domains involved in the product and it improves the information of the other using the information of the first one [10]. Let D1 and D2 be two abstract domains, d1 and d2 be abstract elements belonging to D1 and D2 respectively, and C be their Cartesian product. The Granger operators are defined as follows: ρ1 : C → D1 and ρ2 : C → D2 . In order to have a sound over-

7 Lifting String Analysis Domains

115

approximation of the reduction operator, ρ1 and ρ2 have to satisfy the following conditions: let c denote the pair (d1 , d2 ) ∈ C then, ρ1 (c) D1 d1 ∧ γC ((ρ1 (c), d2 )) = γC (c) and ρ2 (c) D2 d2 ∧ γC ((d1 , ρ2 (c))) = γC (c).

7.2.5 String Operators We briefly recall the string operators defined in [13]. While mainstream programming languages support a more comprehensive set of operators, we will focus our discussion and approach on this minimal set since they support the most important computations over string values. In particular, let str be a sequence of characters. newString(str) is the operator that generates a new constant string. Instead, let s1 and s2 be two strings; concat(s1 ,s2 ) concatenates s1 and s2 . Then, let s be a string and, let b and e be two integer values. substringeb (s) returns the substring of s from the index b to the index e. Finally, let s be a string and let c be a character. containsc (s) returns true if and only if the character c appears in s.

7.3 Related Work 7.3.1 Enhancing Operators Enhancing operators are those who refine the information tracked by abstract domains. They can be formalized as lower closure operators on the set of all Abstract Interpretations of a concrete domain D (i.e., uco(D), cf. Sect. 7.2.1) [10, 20, 22]. The best-known enhancement operators in Abstract Interpretation theory include: reduced product, disjunctive completion, reduced cardinal power [15], tensor product [37], open product, pattern completion [8, 9], functional dependencies [21], complete shell [23], and logical product [26]. The disjunctive completion [15] enhances an abstract domain by adding denotations for concrete disjunctions of its values [20]. In contrast, the reduced cardinal power [15] captures disjunctive information over abstract states [10], being suitable for relational analysis. A further Cartesian product refinement is the open product, presented by Cortesi et al. in [8, 9]. It works on open Abstract Interpretations, which include queries and open operations. In [23], Giacobazzi et al. presented a constructive way to obtain the so-called complete shell of an Abstract Interpretation, i.e., a domain transformer which includes the minimal number of abstract point in a certain abstract domain D to make it complete with respect to a certain operation of interest. The benefits of working with complete string abstract domains have been proven in [5]. Finally, the logical product [26], is more precise than the reduced product, and it combines lattices defined over convex, stably infinite, and disjoint theories.

116

M. Olliaro et al.

7.3.2 Combinations of String Analyses While the operators above have been extensively used to improve the precision of numerical analyses and implemented in various tools [17, 31], when we talk about string analysis, their usages and instantiations are short. In particular, Yu et al. [47] analyzed strings on Web applications to verify properties relevant for security by combining (using the Cartesian product) the relation and alphabet abstractions. Elements in the product lattice are the so-called abstraction classes which are chosen heuristically. Amadini et al. [2] designed a framework that allows a flexible combination of several string abstract domains for JavaScript analysis. Such combination was implemented as an extension of the SAFE tool [30]. Since such an approach relied on the Cartesian product, the resulting analysis exposed more precise results than the single abstract domains. Still, such information could have been reconstructed from the individual execution of the analyses. A further improvement of such approach [1] introduced a general and modular framework to combine several string abstract domains through the reduced product by introducing the concept of reference domain. Let D be a concrete domain, and D1 , ..., Dn be abstract domains. A reference domain soundly approximates D and captures any information expressible in each of the abstract domains D1 , ..., Dn . The reference domain can be applied as a medium for systematically transferring information from all these abstract domains. The information is exchanged through a strengthening function, i.e., a closure operator on D1 , ..., Dn . Depending on the considered domains, the strengthening function may be computationally expensive. Therefore, the authors introduced a weak strengthening function that is less precise but more efficient.

7.3.3 String Analysis: Applications String analysis has been applied in several areas but mostly focused on Web applications. For instance, Minimade [34] proposed using a context-free grammar to approximate strings to validate and guarantee web pages’ security dynamically generated by a server-side program. Wasserman and Su have extended this technique to find Structured Query Language (SQL) command injection [42] and Cross Site Scripting (XSS) vulnerabilities [43]. In [46], Yu et al. applied string analysis, by means of automaton abstraction, to check the correctness of sanitization operations and later to automatically repair faulty string manipulation code [45]. Detection and verification of sanitizers have been carried out also by Tateishi et al. [40], where constraints over program variables and string operations are represented with Monadic Second-order Logic. In [4], Arceri and Mastroeni defined a new automaton-based semantics for string analysis to handle dynamic languages string features such as dynamic typing and implicit type conversion. Samirni et al. [39] repaired HTML generation errors in PHP programs by solving a system of string constraints. Finally, string analysis by means of Abstract Interpretation has been used by Tripp et al. [41]

7 Lifting String Analysis Domains

117

to detect JavaScript security vulnerabilities in the client-side of web applications and by Cortesi et al. [11] to verify the correctness of string operations in C programs.

7.4 Concrete Domain and Semantics 7.4.1 Concrete Domain Formally, given a finite set of characters , i.e., an alphabet, and a set of all possible finite sequences of characters  ∗ , a concrete domain is the complete lattice  ℘ ( ∗ ), ⊆, ∅,  ∗ , ∩, ∪ where: ℘ ( ∗ ) denotes the powerset of  ∗ (that is, the set of all string values), ⊆ denotes the set inclusion (i.e., the partial order between elements in ℘ ( ∗ )), the emptyset ∅ is the bottom element of the lattice ℘ ( ∗ ),  ∗ is the top element of the lattice ℘ ( ∗ ), the set intersection ∩ is the greatest lower bound operand in ℘ ( ∗ ), and the set union ∪ is the least upper bound operand in ℘ ( ∗ ). Let σ ∈  ∗ , we denote by | σ | the length of the string σ, and by char(σ) the set of characters that occurs in the string σ. The empty string is denoted by ε. A non-empty string σ may be also denoted by σ0 ...σn-1 in order to refer to its characters.

7.4.2 Concrete Semantics We now define the concrete semantics S : Stm ×  ∗ ∪ [℘ ( ∗ )]k → ℘ ( ∗ ) and B : Stm × ℘ ( ∗ ) → {true, false, TB }, where Stm denotes a generic string operator. In particular, the semantics S applies to newString, concat and substring. Note that we deal with unary and binary operations, then k might be only 1 or 2. Instead, the semantics B applies to contains. The concrete semantics (c.f. Table 7.2) is defined on the string operators introduced in Sect. 7.2.5 as follows: (i) the semantics S, when applied to newString(σ), returns the singleton {σ}; (ii) the semantics S, when applied to concat(S1 , S2 ), returns a set containing all the possible concatenations between the strings which belong to S1 and S2 , with S1 , S2 ∈ ℘ ( ∗ ); (iii) the semantics S, when applied to substringeb (S), returns a set containing all the substrings from the

Table 7.2 Concrete semantics S[[newString(σ)]]() = {σ} S[[concat]](S1 , S2 ) = {σ1 + σ2 | σ1 ∈ S1 ∧ σ2 ∈ S2 } S[[substringeb ]](S) ={σb ...σe | σ0 ...σn-1 ∈ S ∧ 1 ≤ b ≤ e ≤ n} where S ∈ ℘ ( ∗ ) ⎧ ⎪ ⎨true if ∀σ ∈ S : c ∈ char(σ) B[[containsc ]](S) = false if ∀σ ∈ S : c ∈ / char(σ) ⎪ ⎩ TB otherwise

118

M. Olliaro et al.

b-th to the e-th character (σb and σe respectively) of the strings which belong to S, with S ∈ ℘ ( ∗ ). If a string is too short, the resulting set will not contain any element related to it.; (iv) the semantics B, applied to containsc (S), returns true if all the strings in S contain the character c, false if c is not contained by any of the strings in S, otherwise it returns TB .

7.4.3 Example 1 2 3 4

ResultSet getPerishablePrices ( String lowerBound ) { S t r i n g q u e r y = " S E L E C T ’$’ || ( R E T A I L / 1 0 0 ) F R O M I N V E N T O R Y W H E R E "; if ( l o w e r B o u n d != null ) query += " W H O L E S A L E > " + l o w e r B o u n d + " AND ";

5 6 7 8 9 10 11 12

query += " TYPE IN (" + g e t P e r i s h a b l e T y p e C o d e () + ");"; r e t u r n s t a t e m e n t . e x e c u t e Q u e r y ( q u e r y ); } S t r i n g g e t P e r i s h a b l e T y p e C o d e () { r e t u r n " S E L E C T TYPECODE , T Y P E D E S C FROM TYPES WHERE NAME = ’ fish ’ OR NAME = ’ meat ’" }

Listing 7.1 Java code building up and executing an SQL query

Listing 7.1 reports the source code of the example that will be used to explain how the analysis with the basic string abstract domains works. This example is taken from Gould et al. [24]. Method getPerishablePrices aims at executing a SQL query that selects, given a string representing a lowerBound, all the items from an inventory that are either fish or meat and whose WHOLESALE is greater than the lowerBound if such parameter is not null. This SQL query contains several errors: 1. ’$’ || (RETAIL/100) concatenates the character $ with the numeric expression RETAIL/100, 2. lowerBound is an arbitrary string that might contain non-numeric characters, and therefore the comparison between WHOLESALE and its value might cause a runtime error, and 3. the subquery returned by getPerishableTypeCode returns two columns instead of one, and this could cause an error. For the sake of readability we assign a shortcut to the string constants appearing in Listing 7.1 (cf. Table 7.3).

7.5 String Abstract Domains In this section, we provide an overview of a suite of basic string abstract domains [1, 13]. In particular, we highlight and compare how these string abstract domains track content and shape information of strings.

7 Lifting String Analysis Domains

119

Table 7.3 Shortcuts of string constants in Listing 7.1 Name String constant σ˜ 1 σ˜ 2 σ˜ 3 σ˜ 4 σ˜ 5 σ˜ 6

"SELECT ’$’ || (RETAIL/100) FROM INVENTORY WHERE " "WHOLESALE > " " AND " "TYPE IN (" "SELECT TYPECODE, TYPEDESC FROM TYPES WHERE NAME = ’fish’ OR NAME = ’meat’" ");"

7.5.1 String Length The first domain we consider is the String Length abstract domain SL, as presented in [1, 32]. This domain tracks, through a numerical interval [m, M] (such that m ∈ N and M ∈ N ∪ {∞}), the minimum (i.e., m) and the maximum (i.e., M) length of the concrete strings it represents. The String Length abstract domain detects information about the shape of the concrete strings it represents, i.e., their lengths. In particular, it precisely approximates only the empty string with the interval [0, 0]. In all the other cases, the abstraction totally loses the information about the content of the strings (e.g., the knowledge about characters order, repetitions, etc.). Any abstract element different from the top element of the SL lattice, i.e., SL = [0, ∞], leads to a finite set of concrete strings whose cardinality depends on the cardinality of the alphabet  and on the width of the abstract interval. SL can be implemented in a simple and efficient way as operations on it run in constant time. SL is an infinite lattice, and it does not respect the ascending chain condition, which is why it has been equipped with a widening operator. Example The result of the analysis of the program in Listing 7.1, using SL, is in Table 7.4. At pp.2, the variable query is associated to a state containing the abstraction of σ˜ 1 . The latter is approximated by the length interval [| σ˜ 1 |, | σ˜ 1 |]. The input variable lowerBound appears at pp.3 and, as it is unknown, it is abstracted by SL . At pp.4, the variable query is associated to a state containing the concatenation of the abstractions of σ˜ 1 , σ˜ 2 , lowerBound and σ˜ 3 , i.e., [| σ˜ 1 | + | σ˜ 2 | + 0 + | σ˜ 3 |, | σ˜ 1 | + | σ˜ 2 | + ∞ + | σ˜ 3 |] = [| σ˜ 1 | + | σ˜ 2 | + | σ˜ 3 |, ∞]. Then, at pp.5, the least upper bound ( SL ) between the abstract value of query after pp.2 and after pp.4 is computed, i.e., [| σ˜ 1 |, | σ˜ 1 |] SL [| σ˜ 1 | + | σ˜ 2 | + | σ˜ 3 |, ∞] = [min(| σ˜ 1 |, | σ˜ 1 | + | σ˜ 2 | + | σ˜ 3 |), max(| σ˜ 1 |, ∞)] = [| σ˜ 1 |, ∞] Finally, at pp.6, query is associated to a state containing the concatenation of the abstractions of itself after pp.5 and the

120

M. Olliaro et al.

Table 7.4 Program analysis with SL Program point Variable pp.2 pp.3 pp.4 pp.5 pp.6

query lowerBound query query query

SL [| σ˜ 1 |, | σ˜ 1 |] [0, ∞] [| σ˜ 1 | + | σ˜ 2 | + | σ˜ 3 |, ∞] [| σ˜ 1 |, ∞] [| σ˜ 1 | + | σ˜ 4 | + | σ˜ 5 | + | σ˜ 6 |, ∞]

strings σ˜ 4 , σ˜ 5 and σ˜ 6 . Thus, at the end, query will have a length between | σ˜ 1 | + | σ˜ 4 | + | σ˜ 5 | + | σ˜ 6 | and ∞.

7.5.2 Character Inclusion The Character Inclusion abstract domain CI, as defined in [13], uses a pair of sets (C, MC), to track the characters that are certainly contained (i.e., C) and those that might be contained (i.e., MC) by the concrete strings it represents. The Character Inclusion abstract domain detects information about the content of the concrete strings it represents. Like the String Length abstract domain, also CI can precisely approximate the empty string only with the abstract element (∅, ∅), and it makes it possible to infer the minimum length of the strings it abstracts, which corresponds to the cardinality of the set C. In all the other cases, the abstraction loses the information about concrete strings shape. Indeed, similarly to SL, CI does not preserve the information about the order of appearance of the characters, characters repetitions, and other relevant string properties. Any abstract element different from the bottom element (⊥CI ) and (∅, ∅) leads to an infinite set of concrete strings. CI has finite height, consequently the termination of the analysis is guaranteed by its least upper bound (used as widening operator). Example The result of the analysis of the program in Listing 7.1, using CI, is in Table 7.5. Note that αCI (˜σi ) = (Ci , MCi ). At pp.2, the variable query is associated with a state containing the abstraction of σ˜ 1 . The latter is approximated by a pair of set of characters, i.e., (C1 , MC1 ) with C1 = MC1 . The input variable lowerBound is abstracted by the top element of the CI lattice, i.e., CI = (∅, ). At pp.4, the variable query is associated with a state containing the concatenation of the abstractions of σ˜ 1 , σ˜ 2 , lowerBound and σ˜ 3 , i.e., (C1 ∪ C2 ∪ ∅ ∪ C3 , MC1 ∪ MC2 ∪  ∪ MC3 ) = (C1 ∪ C2 ∪ C3 , ). Then, at pp.5, the least upper bound ( CI ) between the abstract value of query after pp.2 and after pp.4 is computed, i.e., (C1 , MC1 ) CI (C1 ∪ C2 ∪

7 Lifting String Analysis Domains Table 7.5 Program analysis with CI Program point Variable pp.2 pp.3 pp.4 pp.5 pp.6

query lowerBound query query query

121

CI (C1 , MC1 ) (∅, ) (C1 ∪ C2 ∪ C3 , ) (C1 , ) (C1 ∪ C4 ∪ C5 ∪ C6 , )

C3 , ) = (C1 ∩ (C1 ∪ C2 ∪ C3 ), MC1 ∪ ) = (C1 , ). Finally, at pp.6, query is associated with a state containing the concatenation of the abstractions of itself after pp.5 and the strings σ˜ 4 , σ˜ 5 and σ˜ 6 . Thus, at the end, query will surely contain the characters in σ˜ 1 , σ˜ 4 , σ˜ 5 and σ˜ 6 and it will possibly contains any character.

7.5.3 Prefix and Suffix The Prefix abstract domain PR, presented in [13], approximates a set of concrete strings through a sequence of characters whose last element is ∗, that denotes any possible suffix string (the empty string ε is included). Similarly, the Suffix abstract domain SU for string values mirrors the Prefix domain, and its notation and all its operators are dual to those of PR. The suffix domain abstracts strings through their suffix preceded by ∗, which denotes any possible prefix string, ε included. Moreover, Amadini et al. [1] discussed the Prefix-Suffix abstract domain PS, which approximates string values by their prefix and suffix simultaneously. Again, the notation and the operators of this domain can be easily induced by the Prefix and the Suffix domains. Indeed, the Prefix-Suffix domain abstracts strings through a pair of strings (p, s) which concretizes to the set of all possible strings having p as prefix and s as suffix (note that here ∗ is not included in the definition). The domains discussed above partially detect both content and shape of the concrete string they represent. Indeed, PR, SU, and PS can track part of the strings structure, such as the initial part, the ending one or both of them, the minimum strings length, the characters surely contained, etc. Any abstract element different from the bottom element (⊥PR , ⊥SU and ⊥PS respectively) represents an infinite set of strings. Even though operations on these domains can be computed in linear time, they suffer from having an infinite height, e.g., given any prefix we can always add a character at the end of it, obtaining a new prefix. However, the domains respect the ACC, and the termination of the analysis is ensured.

122

M. Olliaro et al.

Table 7.6 Program analysis with PR Program point Variable pp.2 pp.3 pp.4 pp.5 pp.6

query lowerBound query query query

PR σ˜ 1 ∗ ∗ σ˜ 1 ∗ σ˜ 1 ∗ σ˜ 1 ∗

Example The result of the analysis of the program in Listing 7.1, using PR, is in Table 7.6. At pp.2, the variable query is associated with a state containing the abstraction of σ˜ 1 . The latter is approximated by the prefix σ˜ 1 ∗. The input variable lowerBound is abstracted by the top element of the PR lattice, i.e., PR = ∗. At pp.4, the variable query is associated with a state containing the concatenation of the abstractions of σ˜ 1 , σ˜ 2 , lowerBound and σ˜ 3 , i.e., σ˜ 1 ∗ +PR σ˜ 2 ∗ +PR ∗ +PR σ˜ 3 ∗ = σ˜ 1 ∗. Then, at pp.5, the least upper bound ( PR ) between the abstract value of query after pp.2 and after pp.4 is computed, i.e., σ˜ 1 ∗ PR σ˜ 1 ∗ = σ˜ 1 ∗. Finally, at pp.6, query is associated with a state containing the concatenation of the abstractions of itself after pp.5 and the strings σ˜ 4 , σ˜ 5 and σ˜ 6 . Thus, at the end, query will for sure begin with σ˜ 1 followed by any possible suffix string ∗.

7.6 Segmentation Abstract Domain Cousot et al. [18] presented FunArray, an array segmentation abstract domain functor that abstracts the content of array by the so-called segmentation abstract predicates, i.e., consecutive, non-overlapping segments covering all array elements.2 The domain is parametric to the abstraction of the array elements values and it is suitable also for relational abstractions. Since the order of characters in strings is fundamental to track precise information on these values, we instantiate the FunArray abstract domain for string analysis. In the following, we will slightly modify the original notation defined in [18] to highlight the fact that we are instantiating FunArray over strings.

2

We refer to this form of a segmentation as its normal form.

7 Lifting String Analysis Domains

123

7.6.1 Strings Concrete Representation Let ς ∈ Rs = S → Q be concrete string environments mapping string variables s ∈ S to their instrumented values ς (s) ∈ Q = Rv × E × E × (Z → (Z × )). Thus, a string variable s is represented by a quadruple ς (s) = (ρ, , h, idx) ∈ Q, such that: • ρ ∈ Rv = X → V are concrete scalar variable environments mapping variables x ∈ X to their values [[x]]ρ ∈ V. • E is the expressions domain, built from constants and scalar variables, through mathematical unary and binary operators, and , h ∈ E. The values of  and h ([[]]ρ and [[h]]ρ) denote respectively the lower and the upper limit of the string variable s. • idx is a function mapping an index i ∈ [[[]]ρ, [[h]]ρ) to the pair i, c of the index i and the corresponding string character c ∈ . Example Let s be a string variable initialized to the value ’’bunny’’. The concrete value of s is given by the tuple ς (s) = (ρ, 0, 5, idx), where the value of the lower and the upper bound s is inferred from the context and the function idx maps an index i ∈ [0, 5) to the pair (index, indexed character value). Thus, the range of idx is the set {(0,  b ), (1,  u ), (2,  n ), (3,  n ), (4,  y )}.

7.6.2 Abstract Domain The Segmentation abstract domain functor S for strings is a function from the parameter abstract domains B, C and R, where B and R are, in turn, abstract domain functors. The segment bound abstract domain functor B is a function of the expression abstract domain E(X) which depends on the variable abstract domain X, leading to the instantiated segment bound abstract domain B(E(X)). C is the string element abstract domain. Finally, R is the variable environment abstract domain functor which depends on X too, leading to the variable environment abstract domain R(X). Precisely: • The variable abstract domain X encodes program variables, thus X = X ∪ {v0 }, with v0 being a special variable whose value is always 0. • The variable environment abstract domain functor R depends on X, leading to the variable environment abstract domain R(X). Elements in R (shorthand for R(X)) are abstract variable environments ρ ∈ R = X → V, where the value abstract domain V approximates properties of values in V. R approximates sets of concrete variable environments. Formally, the concretization is γR : R → ℘ (Rv ), where Rv = X → V (cf. Sect. 7.6.1).

124

M. Olliaro et al.

• The expression abstract domain functor E depends on X, leading to the expression abstract domain E(X). Elements in E (shorthand for E(X)) are symbolic expressions e ∈ E(X), restricted to a canonical normal form, which depend on variables in X (notice that ⊥E , E ∈ E). The choice of the expression canonical form is let free. Note that in the following examples we will use the expression normal form x + k where k ∈ Z and x ∈ X, and we will omit v0 . E approximates program expressions. Formally, the concretization is γE : E → R → ℘ (V). Moreover, E is equipped with sum (⊕E ), subtraction (E ) and comparison (≤E ) operations. These operations are performed using the abstract information ρ ∈ R. For instance, let R(X) be a constant propagation analysis for integers, and consider the expressions e1 = 5 and e2 = x + 3. Their comparison, i.e., e1 ≤E e2 , is equal to: (i) true if ρ(x) ≥ 2, (ii) false if ρ(x) < 2, (iii) B if ρ(x) = R(X) , and (iv) undefined if ρ(x) = ⊥R(X) . Note that two expressions are said to be comparable if and only if their comparison returns a truth value. • The string element abstract domain C approximates sets of pairs (index, indexed string element), where c ∈ C. Formally, the abstraction is αC : ℘ (Z × ) → ℘ () → C, i.e., elements in ℘ (Z × ) may be first abstracted to ℘ () so to perform a non-relational analysis. The concretization is γC : C → ℘ (Z × ). The choice of C is let free. • The segment bound abstract domain functor B is a function of the expression abstract domain E, leading to the instantiated segment bound abstract domain B(E). Elements in B (shorthand for B(E)) are symbolic intervals b ∈ B, such that b = [e, e ], where e, e ∈ {E \ {⊥E , E }} and e ≤E e . Formally, the concretization is γB : B → ℘ (V).  k

Elements of S belong to the set C × B | k ≥ 1 ∪ {⊥S , S }, where ⊥S and S are special elements denoting the bottom/top element of S. In particular, elements  of S are in the form of s = c1 b1 . . . cn bn , where a segment ci bi (with bi = [ei , ei ]) abstracts a sequence of characters sharing the same property whose length goes from  ei to ei . Example Consider the string variable of Example 7.6.1. Its value in the Segmentation abstract domain S is  b [1, 1]  u [1, 1]  n [2, 2]  y [1, 1], where C is the constant propagation domain for characters.

Before defining the join and the meet operators of the Segmentation domain we present the following helping procedures: align (cf. Algorithm 7.1) and fold (cf. Algorithm 7.2). Algorithm 7.1 aligns two segmentations, i.e., make them coincide with respect to the number of segments, if they are both different from the bottom element of S and comparable. The comparability is actually restricted to the segment bound upper limits of each pair of corresponding segments under consideration during the alignment

7 Lifting String Analysis Domains

125

procedure. Let s1 , s2 ∈ S be different from ⊥S (line 4), the alignment procedure starts analysing the two segmentations from their leftmost segments, i.e., s1 [1] and s2 [1], and continues along their number of segments (line 5), i.e., numSeg(s1 ) and numSeg(s2 ), which may change during the procedure, if all the corresponding segment bound upper limits under consideration are comparable. Note that, given the ith segment of an abstract element s, i.e., s[i], we refer to its segment bound upper limits as e .b.s[i]. Similarly, we denote by c.s[i] the abstract character contained in the i-th segment of s. Algorithm 7.1 align procedure. 1: function align(s1 , s2 ) 2: i←1 3: k1 ← 0; k2 ← 0 4: if s1 = ⊥S ∧ s2 = ⊥S then 5: while i ≤ numSeg(s1 ) + k1 ∧ i ≤ numSeg(s2 ) + k2 do 6: if e .b.s1 [i] =E e .b.s2 [i] then 7: i++ 8: else if e .b.s2 [i] ≤E e .b.s1 [i] then 9: e .b.s1 [i] ← e .b.s2 [i] 10: s1 [i].append(c.s1 [i][0, e .b.s1 [i] E e .b.s2 [i]]) 11: k1 ++, i++ 12: else if e .b.s1 [i] ≤E e .b.s2 [i] then 13: e .b.s2 [i] ← e .b.s1 [i] 14: s2 [j].append(c.s2 [i][0, e .b.s2 [i] E e .b.s1 [i]]) 15: k2 ++, i++ 16: else 17: return error (s1 and s2 cannot be aligned) 18: else 19: return error (s1 and s2 cannot be aligned) 20: return s1 , s2

If this is the case and the segment bound upper limits are equal, then the procedure moves to the next segments; otherwise, if one of the two segment bound upper limits is greater or equal than the other, the first is modified according to the latter segment bound upper limit (line 11 or line 15), and the “remaining part” is appended to the previously modified segment (line 12 or line 16). In the case in which one or both the input segmentations are equals to ⊥S or if during the alignment procedure two corresponding segment bound upper limits are not comparable, the algorithm stops, and it returns an error message.

126

M. Olliaro et al.

Example Consider the segmentations: s1 =  a [0, 4]  b [1, 1]  c [0, x] and s2 =  a [2, 5 + x], where the segment predicate abstract domain C is the constant propagation domain for characters, R(X) is the constant propagation domain for integers, and x is an integer variable whose value is greater than or equal to 0. Algorithm 7.1 on s1 and s2 is applied as follows: starting from i equal to 1, we enter into the loop. The first segment bound upper limit of s1 is strictly smaller than the first segment bound upper limit of s2 (for any value of x), i.e., e .b.s1 [1] numSeg(s ) j k s2 = S ⎪then ∀i ∈ [numSeg(s ) + 1, numSeg(s )] : e.b.s [i] = 0 ⎪ ⎪ k j j ⎪ E ⎪ ⎩ otherwise ⊥ S

The meet between two aligned segmentations is performed so as the string character abstract domain meet ( C ) and the bound abstract domain meet ( B ) are applied segment-wise. Note that if one of the segmentations involved has more segments than the other and the exceeding ones are possibly empty then, they are   not preserved by S . Finally, we compute the fold of the meet between s 1 and s 2 . • S represents the join operator between two string segmentations. The join between two segmentations s1 and s2 is computed on their alignment if align(s1 , s2 ) does not raise an error; if only one of the two segmentations is equal to ⊥S then their join returns the one which is different from the bottom element; if both s1 and s2 are the bottom element then their join returns ⊥S ; otherwise S is returned.   Formally, let s 1 and s 2 be the results of align(s1 , s2 ) (cf. Algorithm 7.1) then: ⎧   fold(s 1 S s 2 ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎨s1

S s2 = s2 ⎪ ⎪ ⎪ ⊥S ⎪ ⎪ ⎩ S





if align(s1 , s2 ) = s 1 , s 2 if s2 = ⊥S ∧ s1 = ⊥S if s1 = ⊥S ∧ s2 = ⊥S if s1 = ⊥S ∧ s2 = ⊥S otherwise

The join between two aligned segmentations is performed so as the string character abstract domain join ( C ) and the bound abstract domain join ( B ) are applied segment-wise. Note that if one of the segmentations involved has more segments than the other, then the exceeding ones are preserved by S , but their segment bound lower limit is set to 0. • Let s1 and s2 be two abstract values in the Segmentation domain. The partial order on S is defined as follows: ∀s ∈ S : ⊥S S s ∧ s S S . Otherwise, if both s1 and s2 are different from ⊥S and S then, s1 S s2 ⇔ s1 S s2 = s2

Concretization The concretization function on the segmentation abstract domain γS : S → R → ℘ (Q) maps an abstract element to a set of strings as follows: γS (⊥S ) = ∅, γS ( S ) = Q, while it is the set of all possible (instrumented) sequences of characters derivable from a segmentation abstract predicate. The formalization below follows the one

128

M. Olliaro et al.

defined in Sect. 11.3 of [18]. Let γR : R → ℘ (Rv ) be the concretization function for the variable abstract domain and let γC : C → ℘ (Z × ) be the concretization function for the string elements abstract domain (cf. Sect. 7.6.2). Then γS denotes the concretization of a generic segment cb where b = [e, e ], formally: γS (cb)ρ = {(ρ, , h, idx) | ρ ∈ γR (ρ) ∧ [[]]ρ = 0 ∧ ∃e ∈ [[[e]]ρ, [[e ]]ρ] : [[h]]ρ = e ∧ ∀i ∈ [0, e -1) : idx(i) ∈ γC (c)} where ρ ∈ R. Then, the concretization function of a string segmentation is as follows: n

γS (c1 b1 ...cn bn )ρ = {(ρ, , h, idx) ∈ +



i=1



Q γS (ci bi )ρ

| [[]]ρ = 0 ∧ [[h]]ρ =

n

ei }

i=1

and γS (⊥S )ρ = ∅, where + Q denotes the concatenation of the strings concrete values, such that: ς (σ1 ) +Q ς (σ2 ) = ς (σ1 + σ2 ) with σ1 , σ2 ∈  ∗ . Note that a segmentation abstract predicate is valid if the upper bounds of segments that contain the bottom element in an abstract domain C are possibly empty; otherwise, a string segmentation is invalid. The concretization function of an invalid segmentation maps the latter abstract value to the empty-set. Theorem 1 Let X ⊆ S such that all elements in X can be aligned and their meet does not result in an invalid segmentation. Then, it holds that:

  γS s ρ = s ∈ X γS (s)ρ S s∈X



Proof The following inference chain holds: γS



s ρ S

s∈X



= γS (s )ρ by definition of S where s denotes the result of the meet between the segmentations s in X n n = {(ρ, , h, id x) ∈ + Q γS∗ (ci bi )ρ | [[]]ρ = 0 ∧ [[h]]ρ = ei } by definition of γS i=1

i=1

=

 s∈X

=



n

{(ρ, , h, m) ∈ +



γ∗ (ci bi )ρ | [[]]ρ = 0 ∧ [[h]]ρ = Q S

n

ei }

i=1

i=1

γS (s)ρ by definition of γS

s∈X

Observe that if the hypotheses of Theorem 1 are not satisfied, i.e., if either the abstract elements in X

can not be aligned or their meet leads to an invalid segmen  s ρ = γS (⊥S )ρ = ∅, and γS (s)ρ = ∅. tation, then γS S s∈X

s∈X

Abstraction Let X ∈ ℘ (Q) be a set of concrete string values. The abstraction function on the segmentation abstract domain αS maps X to ⊥S in the case in which X is equal to the empty set, otherwise to the segmentation that over-approximates values in X.

7 Lifting String Analysis Domains

129

Table 7.7 Segmentation abstract semantics SS [[newString(σ)]]() = s SS [[concat]](s1 , s2 ) = s1 +S s2 ⎧  c [max(0, ek −E b), ek ] ⎪ ⎪ ⎪ k  ⎪ ⎪ ⎪ c [max(0, ek+1 −E b), ek+1 ] ⎪ ⎪ k+1 ⎪    ⎨. . . c [max(0, e − b), e ] if ∃ck [ek , ek ], [i, i  ]k , cj [ej , ej ], [i, i  ]j  ∈ index(s) : j j E j SS [[substringeb ]](s) =  ⎪ k = min{i : ci bi , [i, i ]i  ∈ index(s) ∧ b ∈ [i, i  ]i } ∧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ j = max{i : ci bi , [i, i  ]i  ∈ index(s) ∧ e ∈ [i, i  ]i } ⎪ ⎪ ⎩ S otherwise ⎧  ⎪ / γE (ei ) ⎪ ⎨true if ∃ci [ei , ei ] ∈ s : charC (ci ) = {c} ∧ 0 ∈  BS [[containsc ]](s) = false if ci [ei , ei ] ∈ s : c ∈ char (ci ) C ⎪ ⎪ ⎩T otherwise B

where charC (c) = {c : i, c ∈ γC (c)}

To summarize, S abstracts strings by a sequence of segments in an abstract domain C and a segment bound in an abstract domain B. The segments are computed according to how the string content is manipulated. The value of a string segmentation abstract predicate depends on three parameters: the abstract domain representing symbolic segment bound expressions, the domain abstracting the pairs (index, character), and the abstract domain assigning values to segment bound expressions [18]. Thus, S can be instantiated with different abstract domains, achieving different levels of precision and costs of the analysis. Note that the segmentation abstract domain has infinite height as a string may have infinitely symbolic segments, and a segment might take successive strictly increasing abstract values. Therefore, to guarantee the convergence of the analysis, we define a widening similarly to the one presented by Cousot et al. in [18]. Informally, if the number of segments exceeds a certain threshold during the alignment procedure between two segmentations, the widening transforms those segmentations into S . Note that preciser widening operator may be defined, possibly together with a narrowing operator, to improve the accuracy of the analysis.

7.6.3 Abstract Semantics In Sect. 7.2.5 we recalled the syntax of some operations of interest (i.e., newString, concat, substring, and contains) presented in [13]. Moreover, in Sect. 7.4.2 we reported their concrete semantics. In Table 7.7, we formally define their

130

M. Olliaro et al.

approximation in the string segmentation abstract domain S. The semantics operators SS and BS are the abstract counterparts of S and B respectively (cf. Sect. 7.4.2), such that: • The semantics SS , when applied to newString(σ), returns the segmentation s of the string constant σ (implicitly converted to its instrumented value ς (σ)). Thus, s abstracts, in an abstract domain C, the sequences of equal characters in σ. Note that each segment bound of s will have equal lower and upper limits in an abstract domain E. • The semantics SS , applied to concat(s1 ,s2 ), returns the concatenation of the two input segmentations. Note that if the last segment of s1 and the first segment of s2 share the same abstract character then these segments are unified. More precisely, given c1,1 b1,1 ...c1,n b1,n and c2,1 b2,1 . . . c2,n b2,n , if c1,n =C c2,1 then their concatena  tion is equal to c1,1 b1,1 . . . c[e1,n ⊕E e2,1 , e1,n ⊕E e2,1 ]c2,2 b2,2 . . . c2,n b2,n , where c represents the character contained in c1,n and c2,1 . • The semantics SS , applied to substringeb (s), selects the subsegmentation of s from the segment whose associated interval on the indexes that may refer to it contains b to the segment whose associated interval on the indexes that may refer to it contains e, if exist. Then, the lower bound of each subsegment is scaled by b if the quantity is positive, otherwise the lower bound is leaved unaltered and it is 0. In the other cases, the semantics returns S . Note that index(s) is the function which associates each segment of s to the interval of the index abstract values that may refer to it, i.e., index(s) = {ci bi , [i, i  ]i  | i ∈ [1, numSeg(s)]}. For instance, consider the segmentation s =   a [0, 2]  b [4, 4]  c [0, 2], index(s) = { a [0, 2], [0, 1],  b [4, 4], [0, 5],  c [0, 2], [4, 7]} and the subsegmentation of s from the index 2 to the index 5 is  b [4, 4]  c [0, 2]. • The semantics BS , applied to containsc (s), returns: (i) true if there exists a segment abstract predicate in s which approximates only the character c and its segment bound lower limit is different from zero, (ii) false if does not exist a segment abstract predicate in s which approximates only the character c, otherwise (iii) B .

7.7 Refined String Abstract Domains To obtain our refined string abstract domains, we exploit the notion of Granger product [25]. The Granger product is a binary operator between abstract domains based on two refinement operators (cf. Sect. 7.2.4). Let D1 and D2 be two abstract domains. The Granger operators iteratively lift the information of D1 using the information of D2 , and vice-versa, until the smallest reduction is obtained. We will show that the reductions we provide in the next sections can be achieved without iterating the refinement operators. Formally, our Granger operators are as follows:

7 Lifting String Analysis Domains

131

• refine Aα,γ 2 : D1 → D1 • refine Aα,γ 1 : D2 → D2 Given two abstract elements d1 and d2 , which belong to D1 and D2 respectively, a precondition to compute the reduced product between d1 and d2 (i.e., d1 ⊗ d2 ) is that there must not be inconsistency (Inc) between them, i.e., γ Aα,γ 1 (d1 ) ∩ γ Aα,γ 2 (d2 ) = ∅. In case of inconsistency, the reduced product leads to the pair of bottom elements of their respective abstract domains, formally:  (⊥ Aα,γ 1 , ⊥ Aα,γ 2 ) if Inc(d1 , d2 ) d1 ⊗ d2 = (refined2 (d1 ), refined1 (d2 )) otherwise

7.7.1 Meaning of Refinement Before presenting our refined string abstract domains, we intuitively sketch what refinement means for any domains involved in the reductions. An improvement in the precision of an abstract element in (i) String Length domain means decreasing its range, (ii) Character Inclusion domain means either increasing the cardinality of the set of certainly contained characters and/or decreasing the cardinality of the set of the possibly contained characters, (iii) Prefix domain means increasing the length of its sequence of characters, concatenating one or more characters to its end, and (iv) Segmentation domain means either increasing the precision of its segment abstract predicates and/or decreasing the range of the segment bounds.

7.7.2 Combining Segmentation and String Length Domains The first combination involves the abstract elements s and n, which belong to the Segmentation and the String Length abstract domains, respectively, where n denotes the interval [m, M] introduced in Sect. 7.5.1. As mentioned above, we define the reduced product between s and n in terms of two refinement operators, improving the approximation reached by one domain through the other and vice-versa. Thus, below, we introduce the notion of inconsistency between two abstract elements in the Segmentation and in the String Length domains and the definitions of both refinement operators involved in the reduction of the latter abstract elements. We will follow this line-up also for the reductions we will present later in the section. Let s be a string segmentation different from the bottom element of its lattice, i.e., s ∈ S \ {⊥S }. In the following we denote by minlen(s) the minimal length of a segmentation abstract predicate s, i.e., the sum of all its segment lower bounds, and by maxlen(s) its maximal length, i.e., the sum of all its segment upper bounds. Definition 1 (Inconsistency between Segmentation and String Length) Let s and n be two abstract elements in the Segmentation and in the String Length domains,

132

M. Olliaro et al.

respectively. Moreover, let γS and γSL be the concretization functions of the domains just mentioned. There is inconsistency between s and n when: γS (s) ∩ γSL (n) = ∅. Lemma 1 (Inconsistency between Segmentation and String Length) Let s and n be two abstract elements in the Segmentation and in the String Length domains, respectively. If one of the following conditions holds, then s and n are inconsistent. 1. If both s and n are different from the bottom elements of their lattice and the intersection between the interval from the minimal to the maximal length of s and n is empty, then s and n are inconsistent. Formally: [minlen(s), maxlen(s)] ∩ [m, M] = ∅ ⇒ Inc(s, n) 2. If one of the two abstract elements is the bottom element of its lattice, then s and n are inconsistent. Formally: s = ⊥S or n = ⊥SL ⇒ Inc(s, n) Definition 2 (Segmentation refinement through String Length) Let s and n be two abstract elements in the Segmentation and in the String Length domains, respectively. We define refinen (s) as the refinement operator describing how the information tracked by n improves the one from s. Formally, refinen (s) = s ∗ where: a. if s contains only one segment and its segment bound lower limit is smaller than the one of n, then the refinement means substituting ei with m. Formally: 

s = c1 [e1 , e1 ] ∧ e1 E 0 ⇒ r ∗ = r[C/C ∪ {c}] b. if the the set of characters represented by s is a strict subset of MC, i.e., the set of possibly contained characters of r, then the set of possibly contained characters of r can be refined considering just the set of characters represented by s. Formally: charS (s)  MC ⇒ r ∗ = r[MC/charS (s)]  charC (c) where charS (s) = c∈s

If none of the conditions above apply, then r ∗ = r. Example Consider s =  a [2, 2]  b [3, 5]  c [4, 4]  d [0, 2] and r = ({a, b}, {a, b, c, d, e, f}). The refinement of r through s is as follows: refines (r) = r[C/(C ∪ (charC (c3 ) ∩ (MC\C)), MC/charC (s)] = ({a, b} ∪ ({c} ∩ {c, d, e, f}), {a, b, c, d}) = ({a, b} ∪ {c}, {a, b, c, d}) = ({a, b, c}, {a, b, c, d}) Indeed, the third segment abstract predicate of s has a segment bound lower limit strictly greater than 0 and it approximates exactly the character ’c’ which belongs to the set of possibly contained character of r, but that does not belong to the set of certainly contained characters of r. So we add the character ’c’ to the set of certainly contained characters of r, as by case a. of Definition 7. Moreover, we modify the set

7 Lifting String Analysis Domains

137

of possibly contained character of r considering just the characters occurring in the set of characters approximated by s, as by case b. of Definition 7, thus refining it.

Definition 8 (Reduced product between Segmentation and Character Inclusion) Let s and r be two abstract elements in the Segmentation and in the Character Inclusion abstract domains, respectively. Moreover, consider the refinement operators presented in the Definitions 6 and 7. The reduced product of s and r is obtained by applying refiner and refines as follows: s ⊗ r = (refiner (s), refines (r))

7.7.4 Combining Segmentation and Prefix Domains The last combination we present involves the abstract elements s and p, belonging to the Segmentation and the Prefix abstract domains, respectively. The reductions between the Segmentation and both the Suffix and the Prefix-Suffix abstract domains can be naturally induced by the Granger product among s and p. Notice that the refinement of s is defined according to the length of the sequence of characters of p. Definition 9 (Inconsistency between Segmentation and Prefix) Let s and p be two abstract elements in the Segmentation and in the Prefix domains respectively. Moreover, let γS and γPR be the concretization functions of the domains just mentioned. There is inconsistency between s and p when: γS (s) ∩ γPR (p) = ∅ Lemma 3 (Inconsistency between Segmentation and Prefix) Let s and p be two abstract elements in the Segmentation and in the Prefix domains, respectively. If one of the following conditions holds, then s and p are inconsistent. 1. If p is not a prefix of (some of) the strings approximated by s, then s and p are inconsistent. Formally: {σ ∈ ς −1 (γS (s)) : σ1 ...σlen(p) = p0 ...plen(p)−1 } = ∅ ⇒ Inc(s, p) 2. If one of the two abstract elements corresponds to the bottom element of their lattice, then s and p are inconsistent. Formally: s = ⊥S or p = ⊥PR ⇒ Inc(s, p) Definition 10 (Segmentation refinement through Prefix) Let s and p be two abstract elements in the Segmentation and in the Prefix domains, respectively. We define

138

M. Olliaro et al.

refinep (s) as the refinement operator describing how the information tracked by p improves the one from s. Let unFold(s) = s . 3 Formally, refinep (s) = s∗ where: a. if the abstraction of a character in p, in an abstract domain C, precedes the segment abstract predicate of its matching segment in s , then the refinement means substituting that segment abstract predicate with the abstract value of its corresponding character in p. Formally: ∀i ∈ s ∀k ∈ [0, lenPR (p) − 1]: ci bi is the corresponding segment of p[k] ∧ αC (p[k]) C ci ⇒ s∗ = fold(s [ci /αC (p[k])]) b. if the matching segment in s of a character in p is preceded by one or more segments possibly empty and none of those is the matching segment of the character preceding the first mentioned in p, then the refinement means removing these segments from s . Formally: ∀i ∈ s ∀k ∈ [0, lenPR (p) − 1]:ci bi is the corresponding segment of a p[k] ∧ ∈ s : cj bj is not the corresponding segment of p[k − 1] ∃(cj bj )n∈[i−1,1] j=i−1 ∧ p[k] ∈ / charC (cj ) ∧ bj ∈ {[0, 0], [0, 1], [0, x]} ⇒ s∗ = fold(s −S (cj bj )n∈[i−1,1] ) j=i−1 The conditions above can occur at the same time. In this case, both refinements apply. Observe that the order of application is not relevant. If none of the conditions above apply, then s∗ = s. Example Consider s =  a [2, 4]  b [1, 1]  c [0, 2] and p = aab∗. Let s denote unFold(s) = a [1, 1]  a [1, 1]  a [0, 1]  a [0, 1]  b [1, 1]  c [0, 1]  c [0, 1]. The refinement of s through p is as follows:

 

refinep (s) = fold(s −S (ck bk )3k=4 ) = fold( a [1, 1]  a [1, 1]  b [1, 1]  c [0, 1]  c [0, 1]) = s =  a [2, 2]  b [1, 1]  c [0, 2] Indeed, the fifth segment of s is the matching segment for p[2] and it is preceded by two possibly empty segments and none of them is the matching segment of p[1]. So we could remove them from s , as by case b. of Definition 10, thus refining it.

3

The unFold procedure can not be defined in general as it depends on the normal form of the expressions in E, but it is reasonable to assume that an unfolded segmentation can have a limited set of segment bounds types, that are: [0, 0], [0, 1], [1, 1], [0, x], [x, x], with x ∈ X (cf. Sect. 7.6) if expressions contain variables.

7 Lifting String Analysis Domains

139

Definition 11 (Prefix refinement through Segmentation) Let s and p be two abstract elements in the Segmentation and in the Prefix domains, respectively. We define refines (p) as the refinement operator describing how the information tracked by s improves the one from p. Let unFold(s) = s . Formally, refines (p) = p∗ where: if there is one or more segments in s after the one matching the last character of p, such that their segment abstract element approximates exactly one character and their bound is equal to the interval [1, 1], then the refinement means concatenating that character(s) to p. Formally: ∃i ∈ s : ci bi is the corresponding segment of p[lenPR (p) − 1] ∧ n∈[i+1,numSeg(s )]

∃(cj bj )j=i+1

∈ s : charC (cj ) = {c} ∧

bj = [1, 1] ⇒ p∗ = p + c : charC (cj ) = {c} ∧ j ∈ [i + 1, n] If the condition above does not apply, then p∗ = p. Example Consider s =  a [2, 3]  b [2, 2] and p = aab∗. Let s denote unfold(s) = a [1, 1]  a [1, 1] a [0, 1]  b [1, 1]  b [1, 1]. The refinement of p through s is as follows:

 

refines (p) = aaa + b = aab + b = aabb∗ Indeed, the segment of s that corresponds to the last character in p is followed by a segment whose character abstract element approximates exactly the character  b and its segment bound is the interval [1,1]. So we could concatenate to the sequence of character of p, the character approximated by c5 , as by Definition 11, thus refining it.

Definition 12 (Reduced product between Segmentation and Prefix) Let s and p be two abstract elements in the Segmentation and in the Prefix abstract domains, respectively. Moreover, consider the refinement operators presented in the Definitions 10 and 11. The reduced product of s and p is obtained by applying refinep and refines as follows: s ⊗ p = (refinep (s), refines (p)).

140

M. Olliaro et al.

7.8 Conclusion Strings are characterized by their content and shape. Existing string abstract domains capture information about the content or the shape of strings separately, or they can track both but partially, leading to a consistent loss of precision during the analysis. The Segmentation domain specialised for string analysis discussed in this paper tracks information about the shape and the content of strings in their entirety. Its precision depends on the abstract domains for characters, segment bounds, and variables it has been instantiated. More precise domains lead the segmentation analysis efficiency to decrease inevitably. Instead, less precise domains increase the efficiency of the segmentation analysis at the expense of losing important information, e.g., information useful to prevent undesired behaviours of the considered program. By lifting basic domains for strings by segmentation abstraction, we generate more sophisticated domains where structural information and character-based information are combined, yielding more accurate representation. Acknowledgements Work partially supported by SPIN-2021 projects “Ressa-Rob” and “Static Analysis for Data Scientists”, and Fondi di primo insediamento 2019 project “Static Taint Analysis for IoT Software” funded by Ca’ Foscari University, and by iNEST-Interconnected NordEst Innovation Ecosystem, funded by PNRR (Mission 4.2, Investment 1.5), NextGeneration EU—Project ID: ECS 00000043.

References 1. Amadini, R., Gange, G., Gauthier, F., Jordan, A., Schachte, P., Søndergaard, H., Stuckey, P.J., Zhang, C.: Reference abstract domains and applications to string analysis. Fundam. Informaticae 158(4), 297–326 (2018). https://doi.org/10.3233/FI-2018-1650 2. Amadini, R., Jordan, A., Gange, G., Gauthier, F., Schachte, P., Søndergaard, H., Stuckey, P.J., Zhang, C.: Combining string abstract domains for javascript analysis: An evaluation. In: A. Legay, T. Margaria (eds.) Tools and Algorithms for the Construction and Analysis of Systems - 23rd International Conference, TACAS 2017, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2017, Uppsala, Sweden, April 22-29, 2017, Proceedings, Part I, Lecture Notes in Computer Science, vol. 10205, pp. 41–57 (2017). https://doi.org/10.1007/978-3-662-54577-5_3 3. Arceri, V., Mastroeni, I.: Static program analysis for string manipulation languages. In: A. Lisitsa, A. Nemytykh (eds.) Proceedings Seventh International Workshop on Verification and Program Transformation, Genova, Italy, 2nd April 2019, Electronic Proceedings in Theoretical Computer Science, vol. 299, pp. 19–33. Open Publishing Association 4. Arceri, V., Mastroeni, I.: An automata-based abstract semantics for string manipulation languages. In: A. Lisitsa, A.P. Nemytykh (eds.) Proceedings Seventh International Workshop on Verification and Program Transformation, VPT@Programming 2019, Genova, Italy, 2nd April 2019, EPTCS, vol. 299, pp. 19–33 (2019). https://doi.org/10.4204/EPTCS.299.5 5. Arceri, V., Olliaro, M., Cortesi, A., Mastroeni, I.: Completeness of abstract domains for string analysis of javascript programs. In: R.M. Hierons, M. Mosbah (eds.) Theoretical Aspects of Computing - ICTAC 2019 - 16th International Colloquium, Hammamet, Tunisia, October 31 November 4, 2019, Proceedings, Lecture Notes in Computer Science, vol. 11884, pp. 255–272. Springer (2019). 10.1007/978-3-030-32505-3_15

7 Lifting String Analysis Domains

141

6. Codish, M., Mulkers, A., Bruynooghe, M., de la Banda, M.J.G., Hermenegildo, M.V.: Improving abstract interpretations by combining domains. ACM Trans. Program. Lang. Syst. 17(1), 28–44 (1995). https://doi.org/10.1145/200994.200998 7. Cortesi, A.: Widening operators for abstract interpretation. In: A. Cerone, S. Gruner (eds.) Sixth IEEE International Conference on Software Engineering and Formal Methods, SEFM 2008, Cape Town, South Africa, 10-14 November 2008, pp. 31–40. IEEE Computer Society (2008). https://doi.org/10.1109/SEFM.2008.20 8. Cortesi, A., Charlier, B.L., Hentenryck, P.V.: Combinations of abstract domains for logic programming. In: H. Boehm, B. Lang, D.M. Yellin (eds.) Conference Record of POPL’94: 21st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, Portland, Oregon, USA, January 17-21, 1994, pp. 227–239. ACM Press (1994). https://doi.org/10.1145/ 174675.177880 9. Cortesi, A., Charlier, B.L., Hentenryck, P.V.: Combinations of abstract domains for logic programming: open product and generic pattern construction. Sci. Comput. Program. 38(1–3), 27–71 (2000). https://doi.org/10.1016/S0167-6423(99)00045-3 10. Cortesi, A., Costantini, G., Ferrara, P.: A survey on product operators in abstract interpretation. In: A. Banerjee, O. Danvy, K. Doh, J. Hatcliff (eds.) Semantics, Abstract Interpretation, and Reasoning about Programs: Essays Dedicated to David A. Schmidt on the Occasion of his Sixtieth Birthday, Manhattan, Kansas, USA, 19-20th September 2013, EPTCS, vol. 129, pp. 325–336 (2013). https://doi.org/10.4204/EPTCS.129.19 11. Cortesi, A., Lauko, H., Olliaro, M., Rockai, P.: String abstraction for model checking of C programs. In: F. Biondi, T. Given-Wilson, A. Legay (eds.) Model Checking Software - 26th International Symposium, SPIN 2019, Beijing, China, July 15-16, 2019, Proceedings, Lecture Notes in Computer Science, vol. 11636, pp. 74–93. Springer (2019). https://doi.org/10.1007/ 978-3-030-30923-7_5 12. Cortesi, A., Olliaro, M.: M-string segmentation: A refined abstract domain for string analysis in C programs. In: J. Pang, C. Zhang, J. He, J. Weng (eds.) 2018 International Symposium on Theoretical Aspects of Software Engineering, TASE 2018, Guangzhou, China, August 29-31, 2018, pp. 1–8. IEEE Computer Society (2018). https://doi.org/10.1109/TASE.2018.00009 13. Costantini, G., Ferrara, P., Cortesi, A.: A suite of abstract domains for static analysis of string values. Softw. Pract. Exp. 45(2), 245–287 (2015). https://doi.org/10.1002/spe.2218 14. Cousot, P., Cousot, R.: Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: R.M. Graham, M.A. Harrison, R. Sethi (eds.) Conference Record of the Fourth ACM Symposium on Principles of Programming Languages, Los Angeles, California, USA, January 1977, pp. 238–252. ACM (1977). https://doi.org/10.1145/512950.512973 15. Cousot, P., Cousot, R.: Systematic design of program analysis frameworks. In: A.V. Aho, S.N. Zilles, B.K. Rosen (eds.) Conference Record of the Sixth Annual ACM Symposium on Principles of Programming Languages, San Antonio, Texas, USA, January 1979, pp. 269–282. ACM Press (1979). https://doi.org/10.1145/567752.567778 16. Cousot, P., Cousot, R.: Abstract interpretation and application to logic programs. J. Log. Program. 13(2&3), 103–179 (1992). https://doi.org/10.1016/0743-1066(92)90030-7 17. Cousot, P., Cousot, R., Feret, J., Mauborgne, L., Miné, A., Monniaux, D., Rival, X.: The astreé analyzer. In: Programming Languages and Systems, 14th European Symposium on Programming,ESOP 2005, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2005, Edinburgh, UK, April 4-8, 2005, Proceedings, Lecture Notes in Computer Science, vol. 3444, pp. 21–30. Springer (2005). https://doi.org/10.1007/ 978-3-540-31987-0_3 18. Cousot, P., Cousot, R., Logozzo, F.: A parametric segmentation functor for fully automatic and scalable array content analysis. In: T. Ball, M. Sagiv (eds.) Proceedings of the 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2011, Austin, TX, USA, January 26-28, 2011, pp. 105–118. ACM (2011). https://doi.org/10.1145/ 1926385.1926399

142

M. Olliaro et al.

19. Cousot, P., Halbwachs, N.: Automatic discovery of linear restraints among variables of a program. In: A.V. Aho, S.N. Zilles, T.G. Szymanski (eds.) Conference Record of the Fifth Annual ACM Symposium on Principles of Programming Languages, Tucson, Arizona, USA, January 1978, pp. 84–96. ACM Press (1978). https://doi.org/10.1145/512760.512770 20. Filé, G., Giacobazzi, R., Ranzato, F.: A unifying view of abstract domain design. ACM Comput. Surv. 28(2), 333–336 (1996). https://doi.org/10.1145/234528.234742 21. Giacobazzi, R., Ranzato, F.: Functional dependencies and moore-set completions of abstract interpretations and semantics. In: J.W. Lloyd (ed.) Logic Programming, Proceedings of the 1995 International Symposium, Portland, Oregon, USA, December 4-7, 1995, pp. 321–335. MIT Press (1995) 22. Giacobazzi, R., Ranzato, F.: Refining and compressing abstract domains. In: P. Degano, R. Gorrieri, A. Marchetti-Spaccamela (eds.) Automata, Languages and Programming, 24th International Colloquium, ICALP’97, Bologna, Italy, 7-11 July 1997, Proceedings, Lecture Notes in Computer Science, vol. 1256, pp. 771–781. Springer (1997). https://doi.org/10.1007/3-54063165-8_230 23. Giacobazzi, R., Ranzato, F., Scozzari, F.: Making abstract interpretations complete. J. ACM 47(2), 361–416 (2000). https://doi.org/10.1145/333979.333989 24. Gould, C., Su, Z., Devanbu, P.T.: JDBC checker: A static analysis tool for SQL/JDBC applications. In: A. Finkelstein, J. Estublier, D.S. Rosenblum (eds.) 26th International Conference on Software Engineering (ICSE 2004), 23-28 May 2004, Edinburgh, United Kingdom, pp. 697–698. IEEE Computer Society (2004). https://doi.org/10.1109/ICSE.2004.1317494 25. Granger, P.: Improving the Results of Static Analyses of Programs by Local Decreasing Iterations. In: Shyamasundar R. (eds) Foundations of Software Technology and Theoretical Computer Science. FSTTCS 1992. Lecture Notes in Computer Science, vol 652. Springer, Berlin, Heidelberg (1992) 26. Gulwani, S., Tiwari, A.: Combining abstract interpreters. In: M.I. Schwartzbach, T. Ball (eds.) Proceedings of the ACM SIGPLAN 2006 Conference on Programming Language Design and Implementation, Ottawa, Ontario, Canada, June 11-14, 2006, pp. 376–386. ACM (2006). https://doi.org/10.1145/1133981.1134026 27. Jana, A., Halder, R., Kalahasti, A., Ganni, S., Cortesi, A.: Extending abstract interpretation to dependency analysis of database applications. IEEE Trans. Software Eng. 46(5), 463–494 (2020). https://doi.org/10.1109/TSE.2018.2861707 28. Jensen, S.H., Møller, A., Thiemann, P.: Type analysis for javascript. In: J. Palsberg, Z. Su (eds.) Static Analysis, 16th International Symposium, SAS 2009, Los Angeles, CA, USA, August 9-11, 2009. Proceedings, Lecture Notes in Computer Science, vol. 5673, pp. 238–255. Springer (2009). https://doi.org/10.1007/978-3-642-03237-0_17 29. Kim, S., Chin, W., Park, J., Kim, J., Ryu, S.: Inferring grammatical summaries of string values. In: J. Garrigue (ed.) Programming Languages and Systems - 12th Asian Symposium, APLAS 2014, Singapore, November 17-19, 2014, Proceedings, Lecture Notes in Computer Science, vol. 8858, pp. 372–391. Springer (2014). https://doi.org/10.1007/978-3-319-12736-1_20 30. Lee, H., Won, S., Jin, J., Cho, J., Ryu, S.: SAFE: Formal Specification and Implementation of a Scalable Analysis Framework for ECMAScript. In: Proceedings of the 19th International Workshop on Foundations of Object-Oriented Languages (FOOL’12) (2012) 31. Logozzo, F., Fähndrich, M.: Static contract checking with abstract interpretation. In: B. Beckert, C. Marché (eds.) Formal Verification of Object-Oriented Software - International Conference, FoVeOOS 2010, Paris, France, June 28-30, 2010, Revised Selected Papers, Lecture Notes in Computer Science, vol. 6528, pp. 10–30. Springer (2010). https://doi.org/10.1007/978-3-64218070-5_2 32. Madsen, M., Andreasen, E.: String analysis for dynamic field access. In: A. Cohen (ed.) Compiler Construction - 23rd International Conference, CC 2014, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2014, Grenoble, France, April 5-13, 2014. Proceedings, Lecture Notes in Computer Science, vol. 8409, pp. 197–217. Springer (2014). https://doi.org/10.1007/978-3-642-54807-9_12

7 Lifting String Analysis Domains

143

33. Mandal, A.K., Panarotto, F., Cortesi, A., Ferrara, P., Spoto, F.: Static analysis of android auto infotainment and on-board diagnostics II apps. Softw. Pract. Exp. 49(7), 1131–1161 (2019). https://doi.org/10.1002/spe.2698 34. Minamide, Y.: Static approximation of dynamically generated web pages. In: A. Ellis, T. Hagino (eds.) Proceedings of the 14th international conference on World Wide Web, WWW 2005, Chiba, Japan, May 10-14, 2005, pp. 432–441. ACM (2005). https://doi.org/10.1145/1060745. 1060809 35. Miné, A.: The octagon abstract domain. High. Order Symb. Comput. 19(1), 31–100 (2006). https://doi.org/10.1007/s10990-006-8609-1 36. Narodytska, N.: Formal verification of deep neural networks. In: N. Bjørner, A. Gurfinkel (eds.) 2018 Formal Methods in Computer Aided Design, FMCAD 2018, Austin, TX, USA, October 30 - November 2, 2018, p. 1. IEEE (2018). https://doi.org/10.23919/FMCAD.2018.8603017 37. Nielson, F.: Tensor Products Generalize the Relational Data Flow Analysis Method. In: Proc. Fourth Hungarian Computer Science Conference, pp. 211–225 (1985) 38. Park, C., Im, H., Ryu, S.: Precise and scalable static analysis of jquery using a regular expression domain. In: R. Ierusalimschy (ed.) Proceedings of the 12th Symposium on Dynamic Languages, DLS 2016, Amsterdam, The Netherlands, November 1, 2016, pp. 25–36. ACM (2016). https:// doi.org/10.1145/2989225.2989228 39. Samimi, H., Schäfer, M., Artzi, S., Millstein, T.D., Tip, F., Hendren, L.J.: Automated repair of HTML generation errors in PHP applications using string constraint solving. In: M. Glinz, G.C. Murphy, M. Pezzè (eds.) 34th International Conference on Software Engineering, ICSE 2012, June 2-9, 2012, Zurich, Switzerland, pp. 277–287. IEEE Computer Society (2012). https://doi. org/10.1109/ICSE.2012.6227186 40. Tateishi, T., Pistoia, M., Tripp, O.: Path- and index-sensitive string analysis based on monadic second-order logic. ACM Trans. Softw. Eng. Methodol. 22(4), 33:1–33:33 (2013). https://doi. org/10.1145/2522920.2522926 41. Tripp, O., Ferrara, P., Pistoia, M.: Hybrid security analysis of web javascript code via dynamic partial evaluation. In: C.S. Pasareanu, D. Marinov (eds.) International Symposium on Software Testing and Analysis, ISSTA ’14, San Jose, CA, USA - July 21 - 26, 2014, pp. 49–59. ACM (2014). https://doi.org/10.1145/2610384.2610385 42. Wassermann, G., Su, Z.: Sound and precise analysis of web applications for injection vulnerabilities. In: J. Ferrante, K.S. McKinley (eds.) Proceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation, San Diego, California, USA, June 10-13, 2007, pp. 32–41. ACM (2007). https://doi.org/10.1145/1250734.1250739 43. Wassermann, G., Su, Z.: Static detection of cross-site scripting vulnerabilities. In: W. Schäfer, M.B. Dwyer, V. Gruhn (eds.) 30th International Conference on Software Engineering (ICSE 2008), Leipzig, Germany, May 10-18, 2008, pp. 171–180. ACM (2008). https://doi.org/10. 1145/1368088.1368112 44. Yamaguchi, T., Brain, M., Ryder, C., Imai, Y., Kawamura, Y.: Application of abstract interpretation to the automotive electronic control system 11388, 425–445 (2019). https://doi.org/10. 1007/978-3-030-11245-5_20 45. Yu, F., Alkhalaf, M., Bultan, T.: Patching vulnerabilities with sanitization synthesis. In: R.N. Taylor, H.C. Gall, N. Medvidovic (eds.) Proceedings of the 33rd International Conference on Software Engineering, ICSE 2011, Waikiki, Honolulu , HI, USA, May 21-28, 2011, pp. 251–260. ACM (2011). https://doi.org/10.1145/1985793.1985828 46. Yu, F., Bultan, T., Cova, M., Ibarra, O.H.: Symbolic string verification: An automata-based approach. In: K. Havelund, R. Majumdar, J. Palsberg (eds.) Model Checking Software, 15th International SPIN Workshop, Los Angeles, CA, USA, August 10-12, 2008, Proceedings, Lecture Notes in Computer Science, vol. 5156, pp. 306–324. Springer (2008). https://doi.org/ 10.1007/978-3-540-85114-1_21 47. Yu, F., Bultan, T., Hardekopf, B.: String abstractions for string verification. In: A. Groce, M. Musuvathi (eds.) Model Checking Software - 18th International SPIN Workshop, Snowbird, UT, USA, July 14-15, 2011. Proceedings, Lecture Notes in Computer Science, vol. 6823, pp. 20–37. Springer (2011). https://doi.org/10.1007/978-3-642-22306-8_3

Chapter 8

Local Completeness in Abstract Interpretation Roberto Bruni, Roberto Giacobazzi, Roberta Gori, and Francesco Ranzato

Abstract Completeness of an abstract interpretation is an ideal situation where the abstract interpreter is guaranteed to be compositional and producing no false alarm when used for verifying program correctness. Completeness for all possible programs and inputs is a very rare condition, met only by straightforward abstractions. In this paper we make a journey in the different forms of completeness in abstract interpretation that emerged in recent years. In particular, we consider the case of local completeness, requiring precision only on some specific, rather than all, program inputs. By leveraging this notion of local completeness, a logical proof system parameterized by an abstraction A, called LCL A , for Local Completeness Logic on A, has been put forward to prove or disprove program correctness. In this program logic a provable triple [ p] c [q] not only ensures that all alarms raised for the postcondition q are true ones, but also that if q does not raise alarms then the program c cannot go wrong with the precondition p. Keywords Abstract interpretation · Program logic · Completeness · Incorrectness

R. Bruni · R. Gori University of Pisa, Pisa, Italy e-mail: [email protected] R. Gori e-mail: [email protected] R. Giacobazzi University of Verona, Verona, Italy e-mail: [email protected] F. Ranzato (B) University of Padova, Padova, Italy e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 V. Arceri et al. (eds.), Challenges of Software Verification, Intelligent Systems Reference Library 238, https://doi.org/10.1007/978-981-19-9601-6_8

145

146

R. Bruni et al.

8.1 Completeness, Fallacy, and Approximation Formal methods are fundamental to have rigorous methods and correct algorithms to reason about programs, e.g., to prove their correctness or even incorrectness. These include program logics, model checking, and abstract interpretation as the most prominent examples. Abstract interpretation [8, 23] and most of well-known program logic, notably Hoare logic [28], have a lot in common. On the one hand, abstract interpretation can be seen as a fixpoint strategy for implementing a Hoare logicbased verifier. On the other hand, Hoare logic can be seen as a logical, i.e. rule-based, presentation of an abstract interpreter, deprived of a fixpoint extrapolation strategy. In this sense, abstract interpretation solves a harder problem than Hoare logic, and, more in general, program logics: Abstract interpretation provides an effective algorithmic construction of a witness invariant which can be used both in program analysis and in program verification [10]. It is in this perspective that abstract interpretation provides the most general framework to reason about program semantics at different levels of abstraction, including program analysis, and program verification as a special case. It is precisely in the ambition of inferring a program invariant that the need of approximation lies: approximation is inevitable to make tractable (e.g., decidable) problems that are typically intractable. The well known and inherent undecidability of all non-straightforward extensional properties of programs provides an intrinsic limitation in the use of approximated and decidable formal methods for proving program properties [39]. This is particularly clear in program analysis, where the required decidability of the analysis may introduce false positives/negatives. The soundness of a program analyser, which is guaranteed by construction in abstract interpretation, means that all true alarms (also called true positives) are caught, but it is often the case that false alarms (also called false positives) are reported. Of course, as in all alarming system, program analysis is credible when few false alarms are reported, ideally none. Completeness holds when the abstract semantics precisely describes, in the abstract domain of approximate properties, the property of the concrete semantics, namely when no false alarm can be raised. To substantiate the role of completeness, let us introduce some formal notation. We consider Galois insertion-based abstract interpretations [7, 8], where the concrete domain C and the abstract domain A are complete lattices related by a pair of monotonic functions α : C → A and γ : A → C, called, resp., abstraction and concretization maps, forming a Galois insertion. We let Abs(C) denote the class of abstract domains of C, where the notation Aα,γ ∈ Abs(C) makes explicit the abstraction and concretization maps. To simplify notation, in the following, we often use A as a function in place of γ α : C → C. When the concrete domain is a lattice of program properties, a property P ∈ C is expressible in the abstract domain A if A(P) = P. Given Aα,γ ∈ Abs(C) and a predicate transformer f : C → C, an abstract function f  : A → A is a correct (or sound) approximation of f if α f ≤ f  α holds. The best correct approximation (bca) of f in A is defined to be f A  α f γ , and any correct approximation f  of f in A is such that f A ≤ f  holds. The function f  is a complete approximation of f (or just complete) if A f = γ f  α holds. The

8 Local Completeness in Abstract Interpretation

147

abstract domain A is called a complete abstraction for f if there exists a complete approximation f  : A → A of f . Completeness of f  intuitively encodes the greatest achievable precision for a sound abstract semantics defined in A. In this case, the loss of precision is only due to the abstract domain and not to the abstract function f  itself, which must necessarily be the bca of f , namely if f  is complete in A then f  = f A and A f = A f A hold [26]. Although being desirable, completeness is extremely rare and hard to achieve in practice. It is indeed worth noting that the existance of false positives is not just a consequence of the required decidability of program analysis. In [16], the authors proved that for any non-straightforward abstraction A there always exists a program c for which any sound abstract interpretation of c on A yields one of more false positives. Later, in [1] the authors showed that any program equivalence induced by an abstract interpreter built over a non-straightforward abstraction violates program extensionality. Here, non-straightforward abstractions correspond to those abstract  domains A that are able to distinguish at least two programs, i.e., if · A is a sound abstract semantics defined on A of a concrete semantics [[·]], then there exist two   programs c1 and c2 such that c1  A = c2  A , and A does not coincide with the  identical abstraction, i.e., · A = [[·]]. Making abstract interpretation complete is the holy grail in program analysis and verification by approximate methods [23]. Since the very first non-constructive solution to the problem of making abstract interpretations complete given in [21], a constructive optimal abstraction refinement has been introduced in [25] and finalized in [26], which had several applications in program analysis [11, 24, 27], model checking [20, 22, 34, 36–38], language-based security [17, 32], code protection [14, 15, 18], and program semantics [19]. The interested reader can refer to [35] to realize how completeness arises everywhere in programming languages. The key result in [26] is the notion of most abstract completeness refinement, called complete shell, of an abstract domain A. This operation, which belongs to the family of abstract domain refinements [12], always exists for any Scott-continuous predicate transformer— therefore for all computable transfer functions—and it can be constructively defined as the solution of a recursive abstract domain equation. Although extremely powerful, this notion has an intrinsic global flavour: The complete shell of an abstract domain with respect to a transfer function f makes the abstract domain complete for f on all possible inputs. As a result, this complete shell yields an abstract domain that is often way too fine grain to work for a program or a set of programs, possibly blowing up to the whole concrete domain in many cases. On the side of program logics, several over-approximation techniques are known since the pioneering works by Floyd and Hoare [13, 29]. In classical Hoare correctness logic, a triple { p} c {q} asserts that the postcondition q over-approximates the states reachable starting from states satisfying the precondition p when executing the command c. Letting c denote the collecting semantics of c, this means that c p ≤ q holds. In an inductive setting, like forward program analysis, overapproximations are useful for proving correctness w.r.t. a specification spec: in fact, q ≤ spec implies c p ≤ spec. Likewise abstract interpretation, Hoare triples may

148

R. Bruni et al.

exhibit false positives, meaning that from q  spec we cannot conclude that elements of q  spec are real violations to the correctness specification spec. In a dual setting, a theory of program incorrectness has been recently investigated by O’Hearn [33]. A program specification [ p] c [q] in O’Hearn incorrectness logic states that the postcondition q is an under-approximation of the states reachable from some states satisfying p, i.e. q ≤ c p must hold, therefore exposing only real program bugs: if q  spec, then c p  spec. Dually to Hoare correctness logic, O’Hearn incorrectness logic cannot be used to prove program correctness because it may exhibit false negatives, in the sense that from q ≤ spec we cannot conclude that c p ≤ spec holds.

8.2 Proving Completeness The key fact that complete abstract functions compose [9], makes it possible to use structural induction to check whether an abstract interpretation is complete for a program. This idea was originally applied in [16] to define a very first sound proof system for checking whether an abstract domain is complete for a given program. Given an arbitrary abstract domain A ∈ Abs(C), the logical proof system  A in Fig. 8.1 is such that if  A c can be proved then the abstract interpreter on the domain A is complete for the program c. Here c is a program in a simple imperative language and a proof obligation C A ( f ) is set as the base inductive case, meaning that the abstract domain A is complete for the predicate transformer f . This proof system  A is very simple and basically postpones the checking of completeness for a given program c, no matter how complex c can be, to the checking of completeness of the basic predicate transformers occurring in c, namely those associated with boolean guards, i.e. C A (b) and C A (¬b), and with variable assignments, i.e. C A (x := a). All the aforementioned methods to make an abstract interpreter complete or to check its completeness, as in the case of the proof system in Fig. 8.1, consider completeness as a property that has to hold for all possible inputs. This, as we remarked above, is extremely hard to achieve. Some recent works put forward the idea of weakening the notion of completeness with the goal of making it more easily attainable. The basic idea is not to require completeness of an abstract interpreter for all possible input properties, hence on the whole abstract domain, but rather to demand

Fig. 8.1 The core proof system  A defined in [16]

8 Local Completeness in Abstract Interpretation

149

completeness along only the sequence of store properties computed by the program on a specific computation path. This approach makes completeness a “local” notion, hence the name of local completeness introduced in [2]. An abstract domain A ∈ Abs(C) is locally complete for a predicate transformer f on a precondition p ∈ C if the following condition holds: 

C Ap ( f ) ⇔ A f ( p) = A f A( p). An abstract interpreter is therefore locally complete relatively to a given precondition p when no false positives for the verification of a postcondition are produced by running the abstract interpreter with p as input. The soundness of the abstraction guar antees that c p ≤ A(c p) ≤ A(cA( p)) ≤ c A A( p) hold, where A(cA( p)) is the best correct approximating semantics of the program c on the input precondition p. The novel objective here is to define a program logic where an available underapproximation q of the postcondition c p is tied to the over-approximation of q in the abstraction A. Hence, the aim is to design a program logic where a provable triple  A [ p] c [q] guarantees that q ≤c p ≤ A(c p) = A(q)

(8.1)

holds. To provide these guarantees, this proof system requires that any computational step of the abstract interpreter is locally complete on the approximated input. As an illustrative example, consider the program for computing the absolute value of integer variables: Abs(x)  if x < 0 then x := −x The ubiquitous interval abstraction Int [8, 23] approximates any property s ∈ ℘ (Z) of the integer values that the variable x may assume by the least interval Int(s) = [a, b] over-approximating S, i.e. such that S ⊆ [a, b], where a ≤ b, a ∈ Z ∪ {−∞} and b ∈ Z ∪ {+∞}. Let us assume that the possible inputs for Abs(x) range just in the set i = {x | x is odd}. While the interval approximation of the outputs Abs(i) is Int(Abs(i)) = [1, +∞], showing that 0 is not a possible result, it turns out that the best correct approximation in Int of the concrete semantics is less precise, because it also includes 0: in fact, Int(Abs(Int(i))) = [0, +∞]. Technically, this means that Int is incomplete for Abs on input i. This can spawn a problem in program verification: for instance, if the result is used as divisor in an integer division, the abstract interval analysis would raise a “division-by-0” false alarm. However, for different sets of input, we can derive more precise results. For example, if we consider j = {x < 0 | x is odd}, we have that Int(Abs(Int( j))) = [1, +∞] = Int(Abs( j)) holds. This entails that the abstract domain Int is locally complete for Abs(x) on input j but not on input i. Proving local completeness means, e.g., to prove the following two triples: Int [{−7, −5, −3, −1}] Abs(x) [{1, 7}] and

Int [{−1, 0, 7}] Abs(x) [{0, 7}]

150

R. Bruni et al.

but not the triple Int [{−3, −1, 5, 7}] Abs(x) [{1, 7}] because some local completeness requirements are not met in this latter case, even if the property {1, 7} ⊆ Abs(x){−3, −1, 5, 7} = Int({1, 7}) is indeed valid. Notably, the triple Int [{−7, −5, −3, −1}] Abs(x) [{1, 7}] can be used to prove correctness w.r.t. the specification x > 0, while the triple Int [{−1, 0, 7] Abs(x) [{0, 7}] exhibits a true counterexample for the same property. Proving local completeness raises two main challenges. The first is obvious: the proof obligations of the basic program components (boolean guards and assignments in a simple imperative language) depend upon the input preconditions. The second is more interesting and sheds a deeper insight in the way approximation and computation interplay: the locality assumption closely tights completeness to the proof system, which is inductively defined on program’s syntax, thus going well beyond the basic program logic in Fig. 8.1 for “global” completeness [16]. In particular, a logic for locally complete abstract interpretations has to combine the standard overapproximation of abstract interpretation with under-approximations used in incorrectness logic, to encompass over- and under-approximating program reasoning in a unified program logic.

8.3 LCL: Local Completeness Logic We consider a simple language Reg of regular commands that covers imperative languages as well as other programming paradigms [30, 31, 41]: Reg r ::= e | r; r | r ⊕ r | r∗ This language is parametric on the syntax of basic transfer expressions e ∈ Exp, which define the basic commands and can be instantiated, e.g., with (deterministic or nondeterministic or parallel) assignments, boolean guards, error generation primitives, etc. For simplicity, we consider integer variables x ∈ Var and let a and b range over, resp., arithmetic and Boolean expressions, so that: Exp e ::= skip | x := a | b? The term r1 ; r2 represents sequential composition, r1 ⊕ r2 a choice that can behave as either r1 or r2 , and r∗ is the Kleene iteration, where r can be executed 0 or any bounded number of times. Moreover, regular commands represent in a compact way the structure of control-flow graphs (CFGs) of imperative programs, and standard while-based languages, such as Imp in [41], can be retrieved by the following standard encodings (cf. [30, Sect. 2.2]):

8 Local Completeness in Abstract Interpretation

151

if (b) then c1 else c2  (b?; c1 ) ⊕ (¬b?; c2 ) while (b) do c  (b?; c)∗ ; ¬b? The concrete semantics · : Reg → C → C is inductively defined for any c ranging in a concrete domain C by assuming that the basic commands have a semantics · : Exp → C → C defined on C such that e is an additive function for any e ∈ Exp: r1 ⊕ r2 c  r1 c ∨C r2 c ec  ec  r1 ; r2 c  r2 (r1 c) r∗ c  C {rn c | n ∈ N} A program store σ : V → Z is a total function from a finite set of variables of interest V ⊆ Var to values and   V → Z denotes the set of stores. The concrete domain is C  ℘ (), ordered by inclusion. When V = {x}, we let s ∈ ℘ (Z) denote the set {σ ∈  | σ (x) ∈ s} ∈ C. Store update σ [x → v] is defined as usual. The semantics e : C → C of basic commands is standard: for any s ∈ C, skips  s  x := as  {σ [x → {|a|}σ ] | σ ∈ s} b?s



{σ ∈ s | {|b|}σ = true}

where {|a|} :  → Z and {|b|} :  → {true, false} are defined as expected.  The abstract semantics · A : Reg → A → A on an abstraction Aα,γ ∈ Abs(C) is defined similarly, by structural induction, as follows: for any a ∈ A, 

e A a  e A a = A(ea) 





r1 ; r2  A a  r2  A (r1  A a)







r1 ⊕ r2  A a  r1  A a ∨ A r2  A a    r∗  A a  A {(r A )n a | n ∈ N}

(8.2)

It turns out that the above abstract semantics is monotonic and sound, i.e., Ar ≤  r A A holds. Note that the abstract semantics of a basic expression e is its bca e A on A. This Definition (8.2) agrees with the standard compositional definition by structural induction of abstract semantics used in abstract interpretation [7, 40]. Best correct abstractions are preserved by choice commands, i.e., r1 ⊕ r2  A a = r1  A a ∨ A r2  A a, but generally not by sequential composition and Kleene iteration: for example, r2  A ◦ r1  A is not guaranteed to be the bca of r1 ; r2 . Local completeness enjoys an “abstract convexity” property, that is, local completeness on a precondition p implies local completeness on any assertion s in between p and its abstraction A( p). It is this key observation that led us to the design of LCL A , the Local Completeness Logic on A. LCL A combines over- and underapproximation of program properties: in LCL A we can prove triples  A [ p] r [q] ensuring that: (i) q is an under-approximation of the concrete semantics r p, i.e., q ≤ r p; (ii) q and rP have the same over-approximation in A, i.e., A(r p) = A(q);  (iii) A is locally complete for r on input p, i.e., r A α(P) = α(Q).

152

R. Bruni et al.

Points (i–iii) guarantee that, given a specification spec expressible in A, any provable triple  A [ p] r [q] either proves correctness of r with respect to spec or exposes an alert in q  spec. This alert, in turn, must correspond to a true alert because q is an under-approximation of the concrete semantics r p. The full proof system is here omitted and can be found in [2]. Here, we focus on the two most relevant rules: (relax), that allows us to generalize a proof, and (transfer), that checks for local completeness of basic transfer functions. The main idea of provable triple  A [ p] r [q] in LCL A is to constrain the underapproximation q as postcondition in such a way that it has the same abstraction of the concrete semantics r p. The rule (relax) allows us to weaken the premises and strengthen the conclusions of the deductions in this proof system: d ≤ p ≤ A(d)

 A [d] r [s]  A [ p] r [q]

q ≤ s ≤ A(q)

(relax)

Since the proof system infers a s that has the same abstraction of the concrete semantics r p, by the abstract convexity property mentioned above, we have that local completeness of r on the under-approximation d is enough to prove local completeness on p. The conclusion s can then be strengthened to any under-approximation q preserving the abstraction (we have A(s) = A(q)). All local completeness proof obligations are introduced by the rule (transfer), in correspondence of each basic transfer function e, which is nothing else than the local completeness version of the proof obligations C A (b), C A (¬b) and C A (x := a) in the proof system for global completeness in Fig. 8.1: C Ap (e)  A [ p] e [e p]

(transfer)

The main consequence of this construction is that, given a specification spec expressible in the abstract domain A, a provable triple  A [ p] r [q] can determine both correctness and incorrectness of the program r, that is, r p ≤ spec ⇐⇒ q ≤ spec

(8.3)

holds. In equivalent terms: • If q ≤ spec, then we have also r p ≤ spec, so that the program is correct with respect to spec. • If q  spec, then r p  spec also holds, thus meaning that r p  spec is not empty, and therefore includes a true alarm for the program. Moreover, because q ≤ r p, we have that q  spec ≤ r p  spec. This means that already q is able to pinpoint some issues. To illustrate the approach, we consider the following example (discussed in [2] where the reader can find all the details of the derivation). Let us consider the command r  (r1 ⊕ r2 )∗ where

8 Local Completeness in Abstract Interpretation

153

r1  (0 < x?; x := x − 1) r2



(x < 1000?; x := x + 1)

The concrete domain is C = ℘ (Z) and the abstract domain is A = Int. Using the pre-condition p  {1, 999}, we can derive the triple Int [ p] r [{0, 2, 1000}]. This allows us to prove, for instance, that spec1 = (x ≤ 1000) is met, and exhibits two true violations of spec2 = (100 ≤ x), namely 0 and 2.

8.4 Concluding Remarks In this extended abstract, we presented the genesis and the main ideas behind the local completeness logic LCL A that we introduced in [2]. Local completeness represents a notable weakening of the notion of completeness originally introduced in [9] and later studied in [26]. The key point in using LCL A is that the proof obligations ensuring local completeness for the basic expressions occurring in a program have to be guaranteed. Of course, this is an issue when the abstract domain A is not expressive enough to entail these proof obligations. This problem has been settled in [3], where a strategy has been proposed to repair the abstract domain when a local completeness proof obligation fails. The goal here is to refine the abstract domain by adding a new abstract element, which must be as abstract as possible, such that the proof obligation is satisfied in the new refined domain. This strategy is called forward repair since it repairs the domain A along a derivation attempt as soon as a proof obligation of local completeness is found. After one such repair step, a new derivation must be started in the refined domain A , so that in general the process is iterative and is not guaranteed to terminate, similarly to what happens program verification based on counterexample-guided abstract refinement (CEGAR) [5, 6]. In fact, the repair A of a given proof obligation may compromise the satisfaction of previously encountered proof obligations that were valid in A but maybe not in A . A backward repair strategy has been therefore designed to overcome this limitation. Because of the analogy with partition refinement, in a sentence we may argue that abstract interpretation repair is for abstract interpretation what CEGAR is for abstract model checking. The general goal is to make LCL A an effective method for program analysis by securing a good trade-off between precision and efficiency. In particular, the overall objective is to reduce/minimize the presence of false positives/negatives. In our proof system in order to guarantee that the property (8.1) holds, the stronger requirement 

q ≤ c p ≤ A(c p) = A(q) = c A A( p). 

is enforced, that corresponds to ask for the actual abstract interpreter c A to be locally complete. This might be not strictly necessary, as the condition can be weakened to require that the best correct abstraction is locally complete. Unfortunately, computing

154

R. Bruni et al.

the best correct abstraction c A is not always possible, but we are investigating sufficient conditions to assure local completeness of best correct abstractions. The idea here is to exploit different refinements of the abstract domain for different parts of the derivations, so that precision can be improved whenever needed as far as the result can be transferred to the original domain. Furthermore, in the vein of the socalled core domain simplification for global completeness introduced in [26], we plan to investigate the chance to simplify the domain rather than refining it for ensuring that local completeness holds on a given input. Finally, let us mention that different types of weakening of the notion of completeness are possible. A notion of partial completeness has been introduced in [4], meaning that (local) completeness holds up to some measurable error ε ≥ 0. [4] studied a quantitative proof system which allows us to measure the imprecision of an abstract interpretation and can be used to estimate an upper bound on the error accumulated by the abstract interpreter during a program analysis. This quantitative framework is general enough to be instantiated to most known metrics for abstract domains. Acknowledgements This research has been funded by the Italian MIUR, under the PRIN2017 project no. 201784YSZ5 “AnalysiS of PRogram Analyses (ASPRA)”, and by a Meta Research unrestricted gift.

References 1. Bruni, R., Giacobazzi, R., Gori, R., Garcia-Contreras, I., Pavlovic, D.: Abstract extensionality: on the properties of incomplete abstract interpretations. Proc. ACM Program. Lang. 4(POPL), 28:1–28:28 (2020). https://doi.org/10.1145/3371096 2. Bruni, R., Giacobazzi, R., Gori, R., Ranzato, F.: A logic for locally complete abstract interpretations. In: Proceedings of LICS 2021, 36th Annual ACM/IEEE Symposium on Logic in Computer Science, pp. 1–13. IEEE (2021). Distinguished paper 3. Bruni, R., Giacobazzi, R., Gori, R., Ranzato, F.: Abstract interpretation repair. In: R. Jhala, I. Dillig (eds.) PLDI ’22: 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation, San Diego, CA, USA, June 13 - 17, 2022, pp. 426–441. ACM (2022). https://doi.org/10.1145/3519939.3523453 4. Campion, M., Preda, M.D., Giacobazzi, R.: Partial (in)completeness in abstract interpretation: limiting the imprecision in program analysis. Proc. ACM Program. Lang. 6(POPL), 1–31 (2022). https://doi.org/10.1145/3498721 5. Clarke, E.M., Grumberg, O., Jha, S., Lu, Y., Veith, H.: Counterexample-guided abstraction refinement. In: Proceedings of CAV 2000, 12th International Conference on Computer Aided Verification, Lecture Notes in Computer Science, vol. 1855, pp. 154–169. Springer-Verlag (2000). https://doi.org/10.1007/10722167_15 6. Clarke, E.M., Grumberg, O., Jha, S., Lu, Y., Veith, H.: Counterexample-guided abstraction refinement for symbolic model checking. J. ACM 50(5), 752–794 (2003). https://doi.org/10. 1145/876638.876643 7. Cousot, P.: Principles of Abstract Interpretation. MIT Press (2021) 8. Cousot, P., Cousot, R.: Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: Proceedings of ACM POPL’77, pp. 238–252. ACM (1977). https://doi.org/10.1145/512950.512973

8 Local Completeness in Abstract Interpretation

155

9. Cousot, P., Cousot, R.: Systematic design of program analysis frameworks. In: Proceedings of ACM POPL’79, pp. 269–282. ACM (1979). https://doi.org/10.1145/567752.567778 10. Cousot, P., Giacobazzi, R., Ranzato, F.: Program analysis is harder than verification: A computability perspective. In: H. Chockler, G. Weissenbacher (eds.) Computer Aided Verification— 30th International Conference, CAV 2018, Held as Part of the Federated Logic Conference, FloC 2018, Oxford, UK, July 14-17, 2018, Proceedings, Part II, Lecture Notes in Computer Science, vol. 10982, pp. 75–95. Springer (2018). https://doi.org/10.1007/978-3-319-961422_8 11. Dalla Preda, M., Giacobazzi, R., Mastroeni, I.: Completeness in approximated transductions. In: Static Analysis, 23rd International Symposium, SAS 2016., LNCS, vol. 9837, pp. 126–146 (2016) 12. Filé, G., Giacobazzi, R., Ranzato, F.: A unifying view of abstract domain design. ACM Comput. Surv. 28(2), 333–336 (1996). https://doi.org/10.1145/234528.234742 13. Floyd, R.W.: Assigning meanings to programs. Proceedings of Symposium on Applied Mathematics 19, 19–32 (1967). 14. Giacobazzi, R.: Hiding information in completeness holes - new perspectives in code obfuscation and watermarking. In: Proc. of the 6th IEEE Int. Conferences on Software Engineering and Formal Methods (SEFM ’08), pp. 7–20. IEEE Press (2008) 15. Giacobazzi, R., Jones, N.D., Mastroeni, I.: Obfuscation by partial evaluation of distorted interpreters. In: Proc. of the ACM SIGPLAN Symp. on Partial Evaluation and Semantics-Based Program Manipulation (PEPM’12), pp. 63–72. ACM Press (2012) 16. Giacobazzi, R., Logozzo, F., Ranzato, F.: Analyzing program analyses. In: Proceedings of POPL 2015, 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 261–273. ACM (2015). https://doi.org/10.1145/2676726.2676987 17. Giacobazzi, R., Mastroeni, I.: Adjoining classified and unclassified information by abstract interpretation. Journal of Computer Security 18(5), 751–797 (2010). 18. Giacobazzi, R., Mastroeni, I.: Making abstract interpretation incomplete: Modeling the potency of obfuscation. In: A. Miné, D. Schmidt (eds.) Static Analysis - 19th International Symposium, SAS 2012, Deauville, France, September 11-13, 2012. Proceedings, Lecture Notes in Computer Science, vol. 7460, pp. 129–145. Springer (2012). https://doi.org/10.1007/978-3-642-331251_11 19. Giacobazzi, R., Mastroeni, I.: Making abstract models complete. Mathematical Structures in Computer Science 26(4), 658–701 (2016). https://doi.org/10.1017/S0960129514000358 20. Giacobazzi, R., Quintarelli, E.: Incompleteness, counterexamples and refinements in abstract model-checking. In: Proceedings of SAS 2001, 8th International Static Analysis Symposium, Lecture Notes in Computer Science, vol. 2126, pp. 356–373. Springer (2001). https://doi.org/ 10.1007/3-540-47764-0_20 21. Giacobazzi, R., Ranzato, F.: Completeness in abstract interpretation: A domain perspective. In: M. Johnson (ed.) Proc. of the 6th Internat. Conf. on Algebraic Methodology and Software Technology (AMAST ’97), Lecture Notes in Computer Science, vol. 1349, pp. 231–245. Springer-Verlag (1997) 22. Giacobazzi, R., Ranzato, F.: Incompleteness of states w.r.t. traces in model checking. Inf. Comput. 204(3), 376–407 (2006). https://doi.org/10.1016/j.ic.2006.01.001 23. Giacobazzi, R., Ranzato, F.: History of abstract interpretation. IEEE Ann. Hist. Comput. 44(2), 33–43 (2022). 24. Giacobazzi, R., Ranzato, F., Scozzari, F.: Building complete abstract interpretations in a linear logic-based setting. In: G. Levi (ed.) Static Analysis, Proceedings of the Fifth International Static Analysis Symposium SAS 98, Lecture Notes in Computer Science, vol. 1503, pp. 215– 229. Springer-Verlag (1998) 25. Giacobazzi, R., Ranzato, F., Scozzari, F.: Complete abstract interpretations made constructive. In: L. Brim, J. Gruska, J. Zlatuška (eds.) Proc. of the 23rd Internat. Symp. on Mathematical Foundations of Computer Science (MFCS ’98), Lecture Notes in Computer Science, vol. 1450, pp. 366–377. Springer-Verlag (1998)

156

R. Bruni et al.

26. Giacobazzi, R., Ranzato, F., Scozzari, F.: Making abstract interpretation complete. Journal of the ACM 47(2), 361–416 (2000). https://doi.org/10.1145/333979.333989 27. Giacobazzi, R., Ranzato, F., Scozzari, F.: Making abstract domains condensing. ACM Transactions on Computational Logic 6(1), 33–60 (2005). https://doi.org/10.1145/1042038.1042040 28. Hoare, C.: An axiomatic basis for computer programming. Comm. of The ACM 12(10), 576– 580 (1969). 29. Hoare, C.A.R.: An axiomatic basis for computer programming. Commun. ACM 12(10), 576– 580 (1969). 30. Kozen, D.: Kleene algebra with tests. ACM Trans. Program. Lang. Syst. 19(3), 427–443 (1997) 31. Kozen, D.: On Hoare logic and Kleene algebra with tests. ACM Trans. Comput. Logic 1(1), 60–76 (2000) 32. Mastroeni, I., Banerjee, A.: Modelling declassification policies using abstract domain completeness. Mathematical Structures in Computer Science 21(6), 1253–1299 (2011). https://doi. org/10.1017/S096012951100020X 33. O’Hearn, P.W.: Incorrectness logic. Proc. ACM Program. Lang. 4(POPL), 10:1–10:32 (2020). https://doi.org/10.1145/3371078 34. Ranzato, F.: On the completeness of model checking. In: D. Sands (ed.) Proc. of the European Symp. on Programming (ESOP’01), Lecture Notes in Computer Science, vol. 2028, pp. 137– 154. Springer-Verlag (2001) 35. Ranzato, F.: Complete abstractions everywhere. In: Proceedings of the 14th International Conference on Verification, Model Checking, and Abstract Interpretation, VMCAI 2013, Lecture Notes in Computer Science, vol. 7737, pp. 15–26. Springer (2013) 36. Ranzato, F., Tapparo, F.: Strong preservation as completeness in abstract interpretation. In: Proceedings of ESOP 2004, 13th European Symposium on Programming, Lecture Notes in Computer Science, vol. 2986, pp. 18–32. Springer (2004). https://doi.org/10.1007/978-3-54024725-8_3 37. Ranzato, F., Tapparo, F.: An abstract interpretation-based refinement algorithm for strong preservation. In: N. Halbwachs, L. Zuck (eds.) Proceedings of TACAS 2005, Tools and Algorithms for the Construction and Analysis of Systems, Lecture Notes in Computer Science, vol. 3440, pp. 140–156. Springer-Verlag (2005) 38. Ranzato, F., Tapparo, F.: Generalized strong preservation by abstract interpretation. J. Log. Comput. 17(1), 157–197 (2007). https://doi.org/10.1093/logcom/exl035 39. Rice, H.: Classes of recursively enumerable sets and their decision problems. Trans. Amer. Math. Soc. 74, 358–366 (1953). 40. Rival, X., Yi, K.: Introduction to Static Analysis - An Abstract Interpretation Perspective. MIT Press (2020) 41. Winskel, G.: The Formal Semantics of Programming Languages: an Introduction. MIT press (1993)

Chapter 9

The Top-Down Solver—An Exercise in A2 I Sarah Tilscher, Yannick Stade, Michael Schwarz, Ralf Vogler, and Helmut Seidl

Abstract The top-down solver TD is a convenient local generic fixpoint engine which is at the heart of static analysis frameworks such as Ciao and Goblint. Here, we show how Patrick Cousot’s idea of applying analysis to the analyzer itself allows to derive advanced versions of TD from a recursive descent fixpoint algorithm. A run of that fixpoint algorithm provides us with a trace whose dynamic analysis allows not only to identify semantic dependencies between unknowns on-the-fly, but also to choose appropriate widening/narrowing points. It is thus not only the sequence of iterates for individual unknowns which is taken into account, but the global trace of the fixpoint algorithm itself.

9.1 Introduction In a series of papers, Muthukumar and Hermenegildo [20, 21, 26] and later also Charlier and Van Hentenryck [6] introduce and develop the local generic solver TD. The original development, while originally targeted at Prolog, later turned out to be useful for imperative programs as well [16] and was also applied to other languages via translation to CLP programs [23, 24]. A variant of TD is still at the heart of the program analysis system Ciao [17, 18] and also used by Goblint for analyzing multi-threaded C code [27, 29]. Inspired by the top-down evaluation strategy for S. Tilscher · Y. Stade · M. Schwarz · R. Vogler · H. Seidl (B) TU München, Garching, Germany e-mail: [email protected] S. Tilscher e-mail: [email protected] Y. Stade e-mail: [email protected] M. Schwarz e-mail: [email protected] R. Vogler e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 V. Arceri et al. (eds.), Challenges of Software Verification, Intelligent Systems Reference Library 238, https://doi.org/10.1007/978-981-19-9601-6_9

157

158

S. Tilscher et al.

query-answering of Datalog and Prolog programs, the fixpoint algorithm TD allows for a demand-driven exploration of large, if not infinite, systems of equations. Such systems arise, e.g., for the context-sensitive static analysis not only of Prolog programs, but also of procedural languages such as C [9]. In the latter case, summaries for predicates or procedures are represented by partial tabulation of abstract effects for the abstract calling contexts encountered during fixpoint iteration. In the formulation of [15], the top-down solver is further extended to conveniently support widening and narrowing [2] and also side-effects [28]. Technically, sideeffects in the system of equations correspond to assert predicates in Prolog. While evaluating the right-hand side of an equation x = f x , not only the contribution to the queried unknown x of the left-hand side is calculated, but also extra contributions to other unknowns of the system [1]. This facility allows to conveniently combine context-sensitive analyses with context- and flow-insensitive analyses (see [28] for a more extensive discussion). The solver TD is not only practically efficient, but also generic, i.e., it operates on systems of equations x = f x , x ∈ X , where each righthand side function f x is considered a black-box. Consequently, preprocessing of the system is neither necessary nor possible since, e.g., the unknowns queried during the evaluation of f x are only revealed when evaluating f x for a given assignment of unknowns to their (current) abstract values. Therefore, TD comes with a mechanism to track queried unknowns and thus detect dependencies between unknowns onthe-fly. In [28], this kind of self-monitoring was further employed to dynamically detect widening/narrowing points. These additions and extensions, though, blur the conceptual simplicity of the solver and make it difficult to understand and argue about its correctness. At this point, the idea of Abstract2 Interpretation (A2 I ), i.e., applying abstract interpretation to the abstract interpreter itself, comes in handy. In [13], Cousot et al. present two variants of this idea, an offline pre-analysis, and an online monitoring version. Online monitoring can, e.g., be used to conveniently realize flexible widening and narrowing strategies [7]. Here, we follow the online monitoring approach: Starting from an inefficient, yet simple solver, we introduce the notion of a solver trace and show how a vanilla version of TD can be understood as an optimization of the simple solver that uses information obtained from a homomorphic abstraction of the reaching left-context of the solver trace. We also use this idea to further improve the efficiency by additionally keeping track of access-result pairs to replace re-evaluations of right-hand sides with look-ups. These two optimizations neither increase the applicability of the solver nor affect the analysis result. Finally, we add widening and narrowing iterations to the solver. Now, runtime monitoring is additionally used to automatically detect those unknowns where widening and narrowing should be employed (W/N points). We also indicate how this automatic detection (together with subsequent dynamic removal) improves the precision of the solver compared to an iteration with a fixed set of W/N points.

9 The Top-Down Solver—An Exercise in A2 I

159

9.2 Getting Started Systems of equations as they (conceptually) arise for the abstract interpretation of programs or systems are often given by some syntactical representation such as control-flow graphs [3, 4, 14, 29] or syntax trees [12, 25]. Here, we take a more abstract point of view and just assume that an equation system eq is a function that provides for each unknown x ∈ X a right-hand side, that stores the relation of x to all other unknowns. Hence, the right-hand side eq x is a function of type (X → D) → D where D is the set of possible abstract values for unknowns from X . A solution of the system eq is a mapping σ : X → D such that σ x = eq x σ

(9.1)

holds for all x ∈ X . Not every system eq will have a solution. Assuming that there is a solution σ , we might want to query the solution for the value of a particular unknown x. The most naive attempt to do so is via the local recursive function query:

The function query recursively explores the system of equations to determine the values of right-hand sides. Accordingly, the call solve eq x will return the value σ x—whenever the call terminates. Termination, though, is a major issue here. For convenience, let us assume that the function eq terminates for all arguments. A necessary condition for solve eq x to terminate is that, whenever a call query y is in progress, no further call query y is encountered. System of Equations For X = D = Z, consider the system of equations

representing the Fibonacci numbers. Then solve eq n queries at most |n| + 1 unknowns and will always terminate. For the system of equations represented by the call solve eq x will never terminate.

To check that no unknown is recursively queried, and thus auto-detect nontermination, we may equip the recursive local function query with an extra argument called that keeps track of the set of unknowns whose evaluation has been started and not yet completed. Indeed, this information can be considered as an abstraction

160

S. Tilscher et al.

of the call-stack of the solver or, more generally, the solver state. Solver states can be maintained explicitly, e.g., by a state-transformer monad MS (D) = S → S × D where S is an arbitrary set of states. For extensions of solvers with arbitrary sets of states to work for a given system of equations eq, we demand eq to be ignorant of any kind of the solver state or pure [19, 22]. This means that eq should be of type eq : X → (X → M(D)) → M(D)

(9.2)

for any monad M. Intuitively, changes to the solver state during the evaluation of eq x are only due to encountered look-ups of unknowns. Under this assumption, the function eq x can be represented as a computation tree, namely, a value of the type Here, the tree A d represents the result value d, while Q (x, f ) represents the query of an unknown x whose value is then passed to the continuation function f [19, 22]. Each computation tree can be evaluated by the function eval : (X , D) ct → (X → M(D)) → M(D) defined by

or, for the particular case of a state-transformer monad M = S → S × D,

Computation Trees The computation trees for the two equation systems from Sect. 9.2 are given by ct1 n and ct2 x, respectively:

Evaluation of these trees results in the functions

9 The Top-Down Solver—An Exercise in A2 I

161

To let the function query track the set of unknowns that already have been called, we set S = X Set.t and define the function solve as in Listing 1. This function improves on the original formulation in that it detects when non-termination occurs due to recursively querying the the same unknown. In that case, an exception is raised. Non-termination of the algorithm still may occur when an infinite number of unknowns need to be queried.

Writing solver algorithms using state monads to pass solver states around, though, is rather verbose and hard to read—in particular when the state is more complicated. Since only one instance of the state needs to be passed through the program, we prefer an Ocaml like implementation with one mutable state. In this formulation, the Ocaml type of eq is again X → (X → D) → D while it is implicitly assumed that eq must be ignorant of the solver state. The resulting function solve is shown in Listing 2.

162

S. Tilscher et al.

9.3 Adding Fixpoint Iteration Solving systems of equations arising from abstract interpretation typically requires (some form of) fixpoint iteration. Therefore, we extend the solver from Listing 2 in two ways. First, we add a local function iterate to repeatedly evaluate right-hand sides until a fixpoint is reached. Second, we extend the solver state to additionally keep track of the current values of unknowns. Thus, the solver state is an element of X Set.t × (X , D) Map.t where the function solve is defined as in Listing 3.

Values from (X , D) Map.t represent partial mappings from X to D. Here, we assume that there is a dedicated default value ⊥ ∈ D so that the function find_default ⊥ returns ⊥ for every unknown x for which no value is recorded. Also, instead of returning the value for the initial query, it returns a partial map σ wherein the values of all queried unknowns have been collected. This mapping σ , when restricted to the set of unknowns currently contributing to answering the initial query x, represents a partial solution to the system. This set of contributing unknowns can be determined after solving by recursively walking through the equation system eq starting from x and collecting all unknowns that are accessed when evaluating the right-hand sides with the found solution σ :

9 The Top-Down Solver—An Exercise in A2 I

163

The solver from Listing 3 has multiple inefficiencies—but provides us with the notion of a fixpoint iteration trace which will be the basis for applying A2 I . Subsequent improvements can be considered to execute the same trace, but avoid certain unnecessary re-evaluations. A trace of the solver from Listing 3 can be represented as an element of the type (X , D, (X , D) Map.t) solve:

Nodes Iterate and Query of such a trace correspond to calls to the function iterate and query, respectively, while actual parameters, as well as global variables, recursive calls and the returned value, are represented by child nodes.1 Since iterate is a tail-recursive function, the corresponding nodes inside the trace for a query are collected into a list. Likewise, the evaluation of eq y for some unknown y is considered to tail-recursively explore the corresponding computation tree. Therefore, the sequence of encountered queries is collected into a list as well. Although iterate does not return a value, we include the value returned by the call to eq y as an extra child to the node Iterate. The solver state attained when reaching a particular node of the trace can be considered as an attribute of that node that is computed on-the-fly by a (conceptual) top-down left-to-right traversal over the trace. This state only depends on the left-context of the node. Left-contexts collect complete sub-trees in reverse order during the traversal where appropriate constructors indicate descents in the trace. Technically, they can be described by values of type

Only well-formed values may occur as left-contexts. These consist of an S_context value possibly inside a nested sequence of alternating I_context and Q_context constructors starting with an I_context.

1

In graphs, we restrict children to recursive calls only.

164

S. Tilscher et al.

Trace of a Solver Consider, e.g., the trace given by

which corresponds to the tree

Then the left-context (gray) of the second Iterate node (orange) inside the query for unknown z is given by

When descending into an Iterate node, a new Q_context element is pushed onto the reaching left-context. Likewise, an I_context is pushed when descending into a Query node. When proceeding in the trace along a sequence of iterates or queries, the traversed sub-trees are pushed into the list inside the topmost context element.

The left-context kept in the variable state reaching a particular program point inside the solver can be computed with a modification of the basic solver (Listing 4). The auxiliary predicates pred Q and pred I define the exit conditions of the functions query and iterate, respectively, while the auxiliary functions ret Q and result determine the return values for query and the final result returned

9 The Top-Down Solver—An Exercise in A2 I

165

by the solver, respectively. In order for this algorithm to calculate left-contexts for each point of the program, it requires a basic function init for creating the initial context together with a function up S for returning from the initial call to solve. For iterate, we introduce down I , next I and up I which are called when iterate is entered, when iterate proceeds to the next tail-recursive call to iterate, and when the last call to iterate for a particular unknown returns, respectively. The functions down Q , up Q and skip Q update the left-context whenever the solver enters, exits, or immediately returns from the function query. The auxiliary functions are specified as

We omitted the cases for input contexts that should not occur during solving. These, therefore, raise exceptions. Let us call the resulting algorithm a context solver. In practical applications, the program state does not consist of the left-context itself—but of some abstraction thereof. For a set S of states, this abstraction is given by functions init  down I  next I  up I

: : : :

X X X X

→S →S→S →D→S→S →D→S→S

upS  down Q  up Q  skip Q

:X :X :X :X

→S→S →S→S →X →D→S→S →X →D→S→S

(9.3)

166

S. Tilscher et al.

These functions encode how the reaching state is calculated. This state may then be inspected at any time. Let us call the solver which is obtained from a context solver by using the functions from (9.3) an abstract context solver. Instantiation of a Context Solver Consider the solver algorithm from Listing 3. It represents an abstract context solver where the functions (9.3) are given by

9 The Top-Down Solver—An Exercise in A2 I

167

The state for a left-context is obtained by the function h, using the auxiliary functions h S , h Q , h I , h I _list as state transformers for (lists of) sub-trees of traces: hS hQ hI h I _list h

: : : : :

S → (X , D) solve → S S → (X , D) query → S (X → D → S → S) → S → (X , D) iterate → S S → (X , D) iterate list → S (X , D) context → S

These are defined by

Proposition 9.1 Let c be a well-formed left-context such that h c is defined. 1. If t is a trace of type (X , D) query, and c = Q_context (y, qq, c ), then h Q (h c) t = h (Q_context (y, t :: qq, c ))

168

S. Tilscher et al.

2. If t is a trace of type (X , D) iterate, and c = I_context (y, ii, c ), then 

h I next I (h c) t = h (I_context (y, t :: ii, c )) 3. If the application of the respective function on the right-hand side is defined, thens init x  up S x (h c)  down I y (h c)  next I y d (h c)

= = = =

h (init x) h (up S x c) h (down I y c) h (next I y d c)



up I y d (h c)  down Q z (h c) 

= h (up I y d c) = h (down Q z c)

up Q y z d (h c)

= h (up Q y z d c)  skip Q y z d (h c) = h (skip Q y z d c)

holds for all unknowns x, y, z ∈ X , d ∈ D. In light of Proposition 9.1, we obtain the main theorem for abstract context solvers: Theorem 9.1 1. During its execution, the context solver will only construct wellformed left-contexts and never raise an exception. 2. Assume that for every left-context c where h c is defined, 

ret Q x (h c) result x (h c)  pred I (h c)  pred Q (h c)

= ret Q x c = result x c = pred I c = pred Q c

(9.4)

holds. Then the abstract context solver terminates if and only if the context solver terminates. In this case, both return the same value. In light of property (9.4) in Theorem 9.1, we call the abstraction h of the context, as computed by the abstract context solver, a homomorphic abstraction. It turns out that not only the simple solver from Listing 3 but also the elaborate solvers from the subsequent sections are optimized versions of abstract context solvers.

9.4 The Top-Down Solver TD By exploiting the potential of monitoring the behavior of the solver, we add optimizations that allow to prune re-computations which would consume time, without providing different results. This is the case for queries to an unknown y which has

9 The Top-Down Solver—An Exercise in A2 I

169

already been iterated on until stabilization, while none of the unknowns possibly contributing to the value of y, have changed their value. A key issue for this optimization is to equip the solving routines to track dependencies between unknowns. Therefore, the solver state is extended with data structures stable : X Set.t and infl : (X , Set.t) Map.t. The set stable should contain those unknowns whose value need not be recomputed but can directly be found in the map σ . For each unknown z the map infl keeps track of the set of unknowns y in stable which may be directly influenced by z. To keep these data structures consistent, an auxiliary function destab is provided which is called for z whenever the value of z in σ changed. The function destab recursively walks through the map infl and removes all encountered unknowns from stable while resetting their infl sets to the empty set. The code of the resulting solver is shown in Listing 4 where the code sections that differ from the solver in Listing 3 are highlighted. For the map infl we assume Set.empty as the default value. Influences from z to y are recorded by the function query for the unknown z inside an iteration on y, before returning the value for z. It is worth noting that influences are only recorded after the iteration on z has terminated, implying that the intermediate changes to the value of z need not result in a destabilization of the surrounding unknown y. The function iterate now immediately terminates when the argument unknown y is contained in stable and returns the value stored in σ for y. If y is not contained in stable, y is optimistically added to the set stable. This means that the right-hand side of y need not be re-evaluated, whenever the value of y does not recursively depend on y itself: only in the latter case, the call eq y may remove y from stable. In terms of runtime monitoring according to A2 I , this solver is best understood as being obtained in two steps. First, an abstract context solver is constructed where the set of states is given as S = X Set.t × X Set.t × (X , S Set.t) Map.t × (X , D) Map.t corresponding to the set called, the set stable, the map infl, and the map σ , respectively, together with functions

170

S. Tilscher et al.

9 The Top-Down Solver—An Exercise in A2 I

171

The auxiliary function destab’ is like function destab, only that it computes on pairs of sets and maps directly, instead of using mutable data structures:

In the second phase, the information provided by the state is used to prune further iterations on unknowns y which are found to still be in stable.

9.5 The Top-Down Solver with Tabulation The solver from Listing 4 tries to avoid iteration on unknowns y when none of the unknowns directly or indirectly queried during the evaluation of the right-hand side for y have changed their value, i.e., y is in stable. Destabilization, however, is done recursively. Therefore, y might have been removed from stable due to a change to another unknown z, even though that change has not propagated to any of the unknowns directly accessed in the right-hand side of y. Thus, y would still be amenable to re-iteration—even when a re-evaluation of its right-hand side will definitely return the same value. To avoid this inefficiency, the re-evaluation of the right-hand side of y is replaced with a look-up if the values of all accessed unknowns are still the same as during the previous evaluation. To enable this look-up, the implementation records for each unknown y the pair (args, d) consisting of the accesses args, i.e., the sequence of unknowns accessed during the evaluation of the right-hand side with their values, together with the returned result d. Clearly, recording this information introduces overhead which only pays off if calculations on abstract values performed during the evaluation of right-hand sides are costly. The corresponding addition to the top-down solver from Listing 4 is shown

172

S. Tilscher et al.

in Listing 5. Essentially, we introduce an additional data structure dep for recording access-result pairs, and instrument eq so that the pair (z, a) is added to the list of accesses in dep y whenever query z is called during the evaluation of the right-hand side of y with return value a. Furthermore, we provide a wrapper function for the function representing the equation system while the rest of the solver stays the same. The wrapper function, when called for y, first checks whether the right-hand side for y has already been evaluated. If not, a first evaluation is initiated and the pair of the list args of encountered accesses together with the resulting value d are stored in dep for y. If on the other hand, a pair (args, d) is found in dep for y, it is checked (from left to right) for each pair (z, a) in args whether the query to z still returns a. If this is the case, the value d is returned, and the previous value of dep for y is restored. If this is not the case, then a fresh evaluation of the right-hand side for y is initiated. We remark that this results in potentially re-considering the prefix of unknowns that have already been considered. Their query, though, will find each of them already in stable, implying that their values are looked up without further iteration.

9.6 Introducing Widening and Narrowing While the solvers from Listings 3, 4, 5 define the same function, we now extend the solvers with widening and narrowing. These operators have been introduced by Patrick and Radhia Cousot to accelerate fixpoint iteration in domains of values with infinite ascending and/or descending chains [8, 10, 11]. Assume that we are given a domain D which is a complete lattice. In the original setting, a global widening iteration is proposed to arrive at an assignment σ so that σ x  eq x σ

(9.5)

holds for all x ∈ X which subsequently is improved by a global narrowing iteration which, when all right-hand sides eq x are monotonic, is guaranteed to result in an assignment still satisfying property (9.5). Widening and narrowing later have been generalized to semantic widening and narrowing where after termination, soundness is guaranteed, although property (9.5) need no longer hold [10].

9 The Top-Down Solver—An Exercise in A2 I

173

174

S. Tilscher et al.

For simplicity, we stick to the original notion of widening and narrowing with monotonic right-hand sides. Our goal, however, is not to compute a total assignment, but a partial mapping σ . This should be defined for the initially queried unknown x as well as all unknowns contributing to the value of x so that property (9.5) is satisfied for all these unknowns y. This has been called a partial post-solution of the system of equations (see [28] for an extensive discussion). The code of the resulting solver is shown in Listing 7 where the code sections different from the vanilla solver in Listing 5 are highlighted. Compared to the vanilla solver from Listing 5, the solver state has been extended with the set point of unknowns where widening and narrowing are going to be applied. These unknowns are detected by the function query, which inserts its argument y into point when y is already contained in called. Otherwise, query remains unchanged, only that now the call to iterate receives the extra argument W —indicating that iteration should start with the widening phase. The function destab remains unchanged.

9 The Top-Down Solver—An Exercise in A2 I

175

The main change concerns the function iterate, which must take into account whether its argument y is contained in point or not. If y is not contained in point, iterate proceeds as the iterate for the solver from Listing 4. Otherwise, the reached phase of iteration is taken into account. Assume that wn equals W indicating that iteration for y is in the widening phase. If y is found to be in stable, iteration, therefore, is not yet complete. Instead, y is removed from stable for iterate to be called with arguments N and y to indicate that iteration for y should now proceed with the narrowing phase. If y is not contained in stable, then y is put into stable, eq y query is evaluated, and the old value for y, as stored in sigma, is combined with the new value by means of the widening operator ∇. The result is compared with the old value for y from σ . If they are the same, y is removed from stable, and iteration on y proceeds with the narrowing phase by calling iterate N y. Otherwise, the value for y in σ is updated, eq is called for y, and the widening iteration proceeds with the call iterate W y. Assume now that wn equals N , i.e., the iteration for y has reached the narrowing phase. If y is found to be in stable, y is removed from point, iteration terminates and the latest value for y is returned. If y is not contained in stable, then y is put into stable, and eq y query is evaluated, as in the widening phase. The old value for y, as stored in σ , however, is now combined with the new value by means of the narrowing operator . The result is compared with the old value for y from σ . If they are the same, y is removed from point, and iteration terminates with the latest value for y. Otherwise, the value for y in σ is updated, eq is called for y, and the narrowing iteration proceeds with the call iterate N y.

176

S. Tilscher et al.

We remark that a similar extension of the top-down solver with widening and narrowing has already been reported in [28]—with two remarkable differences. First, we now determine whether y is in point already right at the beginning of the body of iterate (instead of after the evaluation of eq y query), and second, we remove y from point, once the iteration is completed. These two seemingly marginal modifications, though, have a significant impact on the precision of the resulting solver. Consider the following C program which consists of two nested loops:

Assume that we perform an intra-procedural interval analysis of this program, where the unknowns are the program points of the body of main. Assume further that we start the analysis by asking for the value of the endpoint of main. The resulting sequences of values determined for the outermost and innermost loop head h 1 and h 2 , respectively, are tabulated in Table 9.1. During the first call of iterate for h 1 , the old value of h 1 is still ⊥—implying that ⊥ is also computed for all program points inside the body of the outer loop. For the second call of iterate W h 1 , a sub-iteration on h 2 occurs where the value of i is the [0,0] interval and eventually [0, 17] is computed for j. Since h 1 has been identified as a W/N point during the first call to iterate, the new value for i is given by [0, 0] ∇ [0, 1] = [0, ∞]. This value is used for a re-evaluation of h 2 . At that point, though, h 2 is not yet included in point. Therefore, the old value for h 2 in σ is over-written with the new value (i → [0, 41], j → [0, 0]) (i → [0, 0], j → [1, 17]) = (i → [0, 41], j → [0, 17])

Subsequent iteration on h 2 will not change this value anymore. What follows is a narrowing iteration on h 1 —which will improve the upper bound of i to 42. However, when unknowns are never removed from point, h 2 would still be in point when starting the third outer iteration. Thus, the new value for h 2 would be (i → [0, 0], j → [0, 17]) ∇ (i → [0, 41], j → [0, 17]) = (i → [0, ∞], j → [0, 17])

—implying that the upper bound of i, also for h 1 , is lost. The same loss of information occurs when W/N points are statically determined, e.g., as in [5].

9 The Top-Down Solver—An Exercise in A2 I

177

Table 9.1 The values attained for the loop heads h 1 , h 2 Iteration h 1 h1 Iteration h 2

h2 ⊥

1 i → [0, 0] 2

1 2 3 4

i → [0, 0], j → [0, 0] i → [0, 0], j → [0, ∞] i → [0, 0], j → [0, ∞] i → [0, 0], j → [0, 17]

1

i → [0, 41], j → [0, 17] i → [0, 41], j → [0, 17]

i → [0, ∞] 3

2 i → [0, ∞] 4

1

i → [0, 41], j → [0, 17]

1

i → [0, 41], j → [0, 17]

i → [0, 42] 5 i → [0, 42]

9.7 Conclusion We have considered a sequence of successively more intricate local solver algorithms. The solvers from Sects. 9.3, 9.4 and 9.5 were all equivalent, but differed in their practical efficiency. In order to understand the more intricate variants, we introduced the concept of a trace for the base solver where nodes correspond to encountered function calls. We indicated how the solver state attained at a function call can be obtained through a homomorphic abstraction of the left-context of the corresponding node in the trace. We then explained what kind of analysis is required to trigger the optimizations corresponding to the more efficient solver variants. For the vanilla topdown solver TD, we recorded the set stable of unknowns whose query would return identical values and also recorded the encountered influences between unknowns. That allowed us to prune the iteration on stable unknowns. For the top-down solver with tabulation, we additionally kept track of access-result pairs for unknowns to avoid re-evaluation of the corresponding right-hand sides. Finally, in Sect. 9.6, we provided an extension of TD with widening and narrowing. The key issue there was to use the solver state to auto-detect W/N points on-the-fly. It remains future work to formalize the given approach within a theorem prover to arrive at machine-checked solver algorithms. In order to capture the solvers as used in the Goblint system, it would also be necessary to deal with side-effecting equation systems as considered in [1, 28].

178

S. Tilscher et al.

Acknowledgements This work was supported in part by Deutsche Forschungsgemeinschaft (DFG)—378803395/2428 ConVeY.

References 1. Apinis, K., Seidl, H., Vojdani, V.: Side-effecting constraint systems: A swiss army knife for program analysis. In: R. Jhala, A. Igarashi (eds.) Programming Languages and Systems - 10th Asian Symposium, APLAS 2012. Proceedings, LNCS, vol. 7705, pp. 157–172. Springer (2012) 2. Apinis, K., Seidl, H., Vojdani, V.: Enhancing top-down solving with widening and narrowing. In: C.W. Probst, C. Hankin, R.R. Hansen (eds.) Semantics, Logics, and Calculi - Essays Dedicated to Hanne Riis Nielson and Flemming Nielson on the Occasion of Their 60th Birthdays, LNCS, vol. 9560, pp. 272–288. Springer (2016) 3. Baudin, P., Bobot, F., Bühler, D., Correnson, L., Kirchner, F., Kosmatov, N., Maroneze, A., Perrelle, V., Prevosto, V., Signoles, J., Williams, N.: The dogged pursuit of bug-free C programs: the frama-c software analysis platform. Commun. ACM 64(8), 56–68 (2021). 4. Blazy, S., Bühler, D., Yakobowski, B.: Structuring abstract interpreters through state and value abstractions. In: A. Bouajjani, D. Monniaux (eds.) Verification, Model Checking, and Abstract Interpretation - 18th International Conference, VMCAI 2017, Proceedings, LNCS, vol. 10145, pp. 112–130. Springer (2017) 5. Bourdoncle, F.: Efficient chaotic iteration strategies with widenings. In: D. Bjørner, M. Broy, I.V. Pottosin (eds.) Formal Methods in Programming and Their Applications, International Conference, 1993, Proceedings, LNCS, vol. 735, pp. 128–141. Springer (1993) 6. Charlier, B.L., Van Hentenryck, P.: A universal top-down fixpoint algorithm. Tech. rep, Providence, RI, USA (1992) 7. Cousot, P.: Abstracting induction by extrapolation and interpolation. In: D. D’Souza, A. Lal, K.G. Larsen (eds.) Verification, Model Checking, and Abstract Interpretation - 16th International Conference, VMCAI 2015. Proceedings, LNCS, vol. 8931, pp. 19–42. Springer (2015) 8. Cousot, P., Cousot, R.: Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: R.M. Graham, M.A. Harrison, R. Sethi (eds.) Conference Record of the Fourth ACM Symposium on Principles of Programming Languages, 1977, pp. 238–252. ACM (1977) 9. Cousot, P., Cousot, R.: Static determination of dynamic properties of recursive procedures. In: E.J. Neuhold (ed.) Formal Description of Programming Concepts: Proceedings of the IFIP Working Conference on Formal Description of Programming Concepts, 1977, pp. 237–278. North-Holland (1977) 10. Cousot, P., Cousot, R.: Systematic design of program analysis frameworks. In: A.V. Aho, S.N. Zilles, B.K. Rosen (eds.) Conference Record of the Sixth Annual ACM Symposium on Principles of Programming Languages, 1979, pp. 269–282. ACM Press (1979) 11. Cousot, P., Cousot, R.: Comparing the galois connection and widening/narrowing approaches to abstract interpretation. In: M. Bruynooghe, M. Wirsing (eds.) Programming Language Implementation and Logic Programming, 4th International Symposium, PLILP’92, Proceedings, LNCS, vol. 631, pp. 269–295. Springer (1992) 12. Cousot, P., Cousot, R., Feret, J., Mauborgne, L., Miné, A., Monniaux, D., Rival, X.: The astreé analyzer. In: S. Sagiv (ed.) Programming Languages and Systems, 14th European Symposium on Programming,ESOP 2005, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2005, Proceedings, LNCS, vol. 3444, pp. 21–30. Springer (2005) 13. Cousot, P., Giacobazzi, R., Ranzato, F.: A2 I: Abstract2 interpretation. Proc. ACM Program. Lang. 3(POPL), 42:1–42:31 (2019) 14. Distefano, D., Fähndrich, M., Logozzo, F., O’Hearn, P.W.: Scaling static analyses at facebook. Commun. ACM 62(8), 62–70 (2019).

9 The Top-Down Solver—An Exercise in A2 I

179

15. Fecht, C., Seidl, H.: A faster solver for general systems of equations. Sci. Comput. Program. 35(2), 137–161 (1999). 16. Hermenegildo, M.: Parallelizing irregular and pointer-based computations automatically: Perspectives from logic and constraint programming. Parallel Computing (13–14), 1685–1708 (2000). 17. Hermenegildo, M.V., Bueno, F., Carro, M., López-García, P., Mera, E., Morales, J.F., Puebla, G.: An overview of Ciao and its design philosophy. Theory Pract. Log. Program. 12(1–2), 219–252 (2012). 18. Hermenegildo, M.V., Puebla, G., Bueno, F., López-García, P.: Integrated program debugging, verification, and optimization using abstract interpretation (and the Ciao system preprocessor). Science of Computer Programming 58(1–2), 115–140 (2005). 19. Hofmann, M., Karbyshev, A., Seidl, H.: What is a pure functional? In: S. Abramsky, C. Gavoille, C. Kirchner, F.M. auf der Heide, P.G. Spirakis (eds.) Automata, Languages and Programming, 37th International Colloquium, ICALP 2010, Proceedings, Part II, LNCS, vol. 6199, pp. 199– 210. Springer (2010) 20. K. Muthukumar, M.H.: Determination of variable dependence information at compile-time through abstract interpretation. In: North American Conference on Logic Programming, pp. 166–189. MIT Press (1989) 21. K. Muthukumar, M.H.: Compile-time derivation of variable dependency using abstract interpretation. Journal of Logic Programming 13(2/3), 315–347 (1992) 22. Karbyshev, A.: Monadic parametricity of second-order functionals. Ph.D. thesis, Technical University Munich (2013). https://nbn-resolving.org/urn:nbn:de:bvb:91-diss-201309231144371-0-6 23. K.S. Henriksen, J.G.: Abstract interpretation of PIC programs through logic programming. In: SCAM, p. 184-196. IEEE Computer Society (2006) 24. M. Mendez-Lojo J. Navas, M.H.: A flexible (C)LP-based approach to the analysis of objectoriented programs. In: LOPSTR, p. 154-168. LNCS 4915, Springer (2007) 25. Monat, R., Ouadjaout, A., Miné, A.: A multilanguage static analysis of python programs with native C extensions. In: C. Dragoi, S. Mukherjee, K.S. Namjoshi (eds.) Static Analysis - 28th International Symposium, SAS 2021, Proceedings, Lecture Notes in Computer Science, vol. 12913, pp. 323–345. Springer (2021) 26. Muthukumar, K., Hermenegildo, M.: Deriving a fixpoint computation algorithm for top-down abstract interpretation of logic programs. Tech. Rep. ACT-DC-153-90, Microelectronics and Computer Technology Corporation (MCC), Austin, TX 78759 (1990) 27. Schwarz, M., Saan, S., Seidl, H., Apinis, K., Erhard, J., Vojdani, V.: Improving thread-modular abstract interpretation. In: C. Dragoi, S. Mukherjee, K.S. Namjoshi (eds.) Static Analysis - 28th International Symposium, SAS 2021, Proceedings, LNCS, vol. 12913, pp. 359–383. Springer (2021) 28. Seidl, H., Vogler, R.: Three improvements to the top-down solver. Math. Struct. Comput. Sci. 31(9), 1090–1134 (2021). 29. Vojdani, V., Apinis, K., Rõtov, V., Seidl, H., Vene, V., Vogler, R.: Static race detection for device drivers: the goblint approach. In: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, ASE 2016, pp. 391–402. ACM (2016)

Chapter 10

Regular Matching with Constraint Programming Roberto Amadini and Maurizio Gabbrielli

Abstract Regular expressions and related formalism are nowadays adopted by the most common programming languages, and frequently used as a pattern for matching strings. These constructs are useful and powerful but, on the downside, could pose security risks, especially for web applications. The analysis of modern programs cannot therefore be separated from adequate handling of regular patterns. In this paper, we focus on how to tackle string constraints involving regular pattern matching using the constraint programming paradigm, which currently offers limited support for these constraints.

10.1 Introduction Regular expressions are a powerful, flexible and useful tool for text processing, and more generally for string processing. Nowadays, they are a native feature of established programming languages such as Python, Java and JavaScript, and they are commonly used for data parsing, scraping, wrangling and validation. For example, Table 10.1 shows some JavaScript methods, provided by the RegExp and String objects, for pattern matching and string manipulation via regular expressions. The widespread use of regular expressions arguably coincides with the increasing development of web applications, which make heavy use of string processing both explicitly (i.e., by directly invoking string functions) and implicitly (i.e., by silently converting objects of different types into strings—this is the case, e.g., of JavaScript properties). Unsurprisingly, this led to the development of a number of both static and dynamic string analysis frameworks.

R. Amadini (B) · M. Gabbrielli University of Bologna, Bologna, Italy e-mail: [email protected] M. Gabbrielli e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 V. Arceri et al. (eds.), Challenges of Software Verification, Intelligent Systems Reference Library 238, https://doi.org/10.1007/978-981-19-9601-6_10

181

182

R. Amadini and M. Gabbrielli

Table 10.1 JavaScript methods for regular expressions. Table from [1] Method Description exec test match matchAll search replace replaceAll split

Executes a search for a match in a string. It returns an array of information or null on a mismatch Tests for a match in a string. It returns true or false Returns an array containing all of the matches, including capturing groups, or null if no match is found Returns an iterator containing all of the matches, including capturing groups Tests for a match in a string. It returns the index of the match, or -1 if the search fails Executes a search for a match in a string, and replaces the matched substring with a replacement substring Executes a search for all matches in a string, and replaces the matched substrings with a replacement substring Uses a regular expression or a fixed string to break a string into an array of substrings

Static string analysis concerns the over-approximation of possible values that a string variable can take at different program points, for any possible input of a target program. This task, in general undecidable, can be handled with different techniques, e.g., automata-based approaches [27] or abstract interpretation [14, 21]. On the other hand, dynamic string analysis under-approximates the set of possible string values occurring at runtime. Therefore, dynamic analyses cannot guarantee to cover all the possible execution traces. However, symbolic expressions can be used to enhance the exploration of feasible program paths [17]. This often requires constraint solving capabilities to reason about the collected path conditions, i.e., the conditions under which the program execution will branch this way or that. If path conditions involve string variables, then proper string solving [2] techniques are required. The analysis of regular expressions, both from the static and dynamic points of view, is especially driven by cybersecurity issues. For example, Listing 1 shows a snippet of PHP code from [13], representing a simplified version of a program taken from MyEasyMarket [8]. In line 4, function preg_replace is used to replace all the substrings of $www matching the pattern defined by regular expression [ˆA-Za-z0-9 .-@:\/] with the empty string. That is, $www is sanitized by removing all the substrings matching the specified regular expression.

10 Regular Matching with Constraint Programming

183

The code of Listing 1 contains a potential cross-site scripting (XSS) vulnerability. Indeed, character ’ 0 sets Sili ,u i called blocks, k li ≤ λ. Each block where Si ⊆  and 0 ≤ li ≤ u i ≤ λ for i = 1, . . . , k and i=1 l,u ∗ S denotes the language γ(X ) = {w ∈ S | l ≤ |w| ≤ u}, and each dashed string X = S1l1 ,u 1 · · · Sklk ,u k denotes γ(X ) = {w ∈ γ(S1l1 ,u 1 ) · · · γ(Sklk ,u k ) | |w| ≤ λ}. The null block ∅0,0 denotes γ(∅0,0 ) = {}.3 The goal of dashed strings is, on the one hand, to overcome the efficiency issues of automata operations, often resulting in “states explosions”, and, on the other hand, having a compact representation that avoids a too eager unfolding of a string variable into λ integer variables. Note that eagerly unfolding is not only problematic when λ is big but also because it can be less precise w.r.t. the dashed string representation. For example, let us assume  = {a,b,c}, λ > We use the notation γ(X ) instead of L(X ) to be consistent to the definition provided in [7] and related papers. Note the similarity with Abstract Interpretation: γ maps an “abstract” string X to a set of concrete strings in  ∗ .

3

10 Regular Matching with Constraint Programming

187

4 and consider dashed string X = {b,c}1,1 {a}0,1 {c}1,2 denoting |γ(X )| = 8 strings: {bc, bcc, bac, bacc, cc, ccc, cac, cacc}. Eagerly unfolding X means having X 1 , X 2 , X 3 , X 4 , . . . , X λ integer variables where the domain of X 1 , X 2 , X 3 , X 4 is respectively {b,c}, {a,c}, {, a, c}, {, a, c} and X i =  for i = 5, . . . , λ. But this denotes 4 + 8 + 16 = 28 concrete strings: {ba, bc, ca, cc, baa, bac, caa, cac, cca, ccc, baaa, baac, baca, bacc, . . . , cccc}. To find a feasible solution, suitable dashed string propagators have been defined, with the goal of refining the domain of each string variable into a dashed string representing a single string of  ∗ , if a solution exists. These propagators are implemented into the G-Strings solver [7].

10.3 Matching Regular Expressions Matching a regular pattern ρ against a target string x is a common problem when x is a fixed string. Less obvious is when x is a string variable whose domain is a set of two or more concrete strings. Even harder is when also ρ is a variable, i.e., the pattern is not fully specified (this interesting problem is however outside the scope of this paper). If we look at the string operations of Table 10.1, the tools currently provided by the CP paradigm are only able to properly propagate the test method. In this case, to test if a pattern ρ matches x we only need to use the reified version of regular: test(x, ρ) ⇔ regular(x, DFA(ρ)). For the other operations, we need specialized propagators to not incur in awkward and inefficient mappings to basic constraints. In the rest of this section, we shall provide some ideas for defining such propagators.

10.3.1 Match Let us suppose that we are interested in capturing the semantics of the search method of Table 10.1, that is, in finding the leftmost position i where a pattern ρ matches a string variable x. We define a constraint i = match(x, ρ), assuming i = 0 if x does not contain any string of L(ρ).4 In practice, with the notion of matching indexes given in Def. 1, we have that i = min(MI(x, ρ)). Using the regular [5, 22] and find [4] constraints only —find(x, y) returns the index of the first occurrence of y in x—is not enough to implement min(MI(x, ρ)) because the set {i ∈ 1..|w| | ∃ j. w[i.. j] ∈ L(ρ)} of Def. 1 is not of fixed size. At present, MiniZinc does not offer built-in primitives for variable-length arrays and

4

Here we differ from the JavaScript semantics, where strings’ characters are indexed starting from 0, so search returns -1 if no match is found.

188

R. Amadini and M. Gabbrielli

the min operation over set variables, because to the best of our knowledge no stateof-the-art CP solver is currently supporting such constructs. It is important to note that, because we assume  occurring at each position of a string, we have  ∈ L(ρ) ⇒ match(x, ρ) = 1 for any string variable x. Hence, before propagating match we should check first if  ∈ L(ρ). If so, we can simply rewrite i = match(x, ρ) into i = 1. Because we want to focus on the more interesting cases, in the following we shall consider only constraints of the form match(x, ρ) with  ∈ / L(ρ). Assuming that we have a propagator for the reified version of regular (i.e., for the constraint b ⇔ regular(x, R) where b is a Boolean variable), a sound way of decomposing i = match(x, ρ) with l ≤ |x| ≤ u into basic constraints is to follow a case-splitting approach: ⎧ ⎪ 1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎨2 i = ... ⎪ ⎪ ⎪u − l ⎪ ⎪ ⎪ ⎩0

if regular(x, R) else if regular(x[2..u], R) (10.1) else if regular(x[u − l..u], R) otherwise

where R = DFA(ρ). This decomposition is correct, but heavily depends on the length of x. If the bounds of |x| are unknown, it introduces O(λ) reifications, where λ is the maximum allowed string length. This clearly has performance implications. In a more compact way, we can impose 0 ≤ i ≤ |x| and 

¬regular(x, R) ¬regular(x[1..i − 1], R) ∧ regular(x[i..|x|], R)

if i = 0 otherwise

(10.2)

Figures 10.1 and 10.2 show, respectively, the implementation of (10.1) and (10.2) with the MiniZinc syntax of [3]. Here, the pattern is given as a regular expression, encoded with a string parameter regexp. Unsurprisingly, (10.2) tends to be more efficient. However, as we shall see in the next example, there are cases where also this decomposition does not propagate well. Example Let us consider i = match(x, ρ) where the domain of x is denoted by dashed string X = {c, d}3,5 {a, b}1,1 {a, d}1,1 {a, b}0,4 and the pattern is ρ = dab(ab|ba)∗ (Fig. 10.3 shows DFA(ρ)). In this case, an optimal match propagator would narrow the domain of i to {0, 5, 6, 7}, meaning that either there is no match (i = 0) or the first occurrence of a match must be between x[5] and x[7]. Indeed, each string of L(ρ) has prefix dab and the earliest possible match of ρ in X occurs at block {a,d}1,1 , which must be preceded by at least 4 mandatory characters (3 c’s or d’s, followed by 1 a or b). On the other hand, the latest possible match (i.e., the latest occurrence of character d in x) cannot occur later than position 7. With the dashed string solving approach defined in [7], both decompositions (10.1) and (10.2) can only narrow the domain of i to 0..11 (because 11 is the upper bound

10 Regular Matching with Constraint Programming

189

Fig. 10.1 Decomposition (10.1) of match. The lower/upper bounds l and u of |x| are given as parameters

Fig. 10.2 Decomposition (10.2) of match

Fig. 10.3 Minimal DFA for regular expression dab(ab|ba)∗

of |x|). If we impose that i > 0, i.e., that there must be at least one match, then both approaches reduce the domain of i to 1..9, because the shortest string of L(R), namely dab, cannot start after position 9 being |x| ≤ 11. Note that in this case an optimal propagator for match would not only narrow the domain of i to 5..7 but also refine the domain of x into X  = {c, d}3,5 {a, b}1,1 {d}1,1 {a}1,1 {b}1,1 {a, b}0,2 .  In general, a CP propagator for i = match(x, ρ) should follow the structure described with pseudo-code in Fig. 10.4. Lines 2–5 cover the simplest cases, by rewriting match into other constraints. If the domain of i contains the value 0, then we can try to refine i but we cannot refine x because we do not know if any string of L(ρ) actually matches x (lines 6–7). In this case, the refined domain of i will still contain the value 0, but it will be possibly pruned of other spurious values. For example, if i = match(aab, ba∗ ) with D(i) = 0..5, then a good propagator should refine D(i) into {0, 3}. If instead we know that there is at least one match, then we can refine both x and i (lines 8–9).

190

R. Amadini and M. Gabbrielli

Fig. 10.4 High-level view of match propagator

Obviously, the crucial point is how to properly implement the Refine functions of Fig. 10.4. For example, if i is fixed then we might simply use decomposition (10.2). Otherwise, the refinement of D(i) is not trivial. For the upper bound, we can compute the length − of the shortest string accepted by R = DFA(ρ), with a breadth-first traversal, and then intersect D(i) with 0..u x − − + 1 where u x is the upper bound of |x|. Refining the lower bound is more tricky, because computing the length + of the longest string accepted by R is not helpful. In this case, we may test if (i) surely there must be a substring of x belonging to L(R), in which case we intersect D(i) / L(R), in which case D(i) is intersected with {0}. with 1..u x , or (ii) surely x ∈ How to define these procedures is a compromise between efficiency (propagators are typically incomplete but fast) and efficacy (how many inconsistent values are pruned). Surely, a precise refinement D(i) can significantly speed up the search for a solution. A possible idea might be to approximate ρ with a simplified pattern ρ . For example, dab(ab|ab)∗ can be over-approximated by dashed string {d}1,1 {a}1,1 {b}1,1 {a, b}0,˘−3 . For the refinement of x, we should probably take inspiration from the propagation of regular described in [5] and [22], depending on the abstraction adopted to represent the domain of string variables. In a nutshell, these approaches work in two steps: a forward pass and a backward pass. In the first step, the set of reachable states is computed starting from the initial state, according to the domain of x. In the second step, from the feasible final states we traverse R backwards to possibly refine the domain of x. However, instead of R we have to consider its wrapping R. For example, Fig. 10.5 shows the wrapping of the DFA of Fig. 10.3. Note that L(dab(ab|ba)∗ ) = L(dab). Assuming the domain of x defined by dashed string X = {c, d}3,5 {a, b}1,1 {a, d}1,1 {a, b}0,4 as in Example 10.3.1, running the forward step iteratively builds a set Q i of reachable states after consuming the i-th block. In this way, Q 1 = {q0 , q1 } because if we consume 3 to 5 c’s or d’s, then we end up in either state q0 or q1 . From Q 1 we get Q 2 = {q0 , q1 , q2 } because we are in q0 or q1 and we consume either a or b. Following the same reasoning, we get Q 3 = Q 2 and Q 4 = {q0 , . . . , q3 }. Because Q 4 contains at least a final state, we can proceed with the backward pass from q3 to q0 by considering X in reverse order to possibly refine its blocks. This procedure, whose complexity depends on the length of x and the size of the transition function δ, can also be applied the other way round by reversing the blocks

10 Regular Matching with Constraint Programming

191

Fig. 10.5 Minimal DFA for regular expression dab(ab|ba)∗ . Label a1 , . . . , ak denotes all characters of  \ {a1 , . . . , ak }, while ∗ denotes all characters of 

of X and the automaton R. Indeed, this approach works with minor modifications also if R is a non-deterministic automaton [5].

10.3.2 Generalization to replace A natural extension for the match constraint is its generalization to replace, as done in [4] for find and replace with dashed string variables. However, in that paper the authors consider the constraint replace(x, y, z) returning the string obtained from x by replacing its first occurrence of y with z. If y does not occur in x, then x = replace(x, y, z). For example, replace(aabbb, bb, c) = aacb and replace(aabbb, bc, c) = aabbb. In [4], the constraint x  = replace(x, y, z) is rewritten into ⎛ ⎞ i = find(x, y) ⎜ ∧ | px | = max(0, i − 1) ⎟ ⎜ ⎟ [[i>0]] ⎟ (10.3) ∃ i, px , sx . ⎜ ⎜ ∧ x  = px · y [[i>0]] · sx ⎟ ⎝ ∧ x = px · z ⎠ · sx ∧ find( px , y) = [[|y| = 0]] where [[C]] ∈ 0..1 is the integer reification of the constraint C, such that [[C]] = 1 if and only if C holds. In practice, find(x, y) is used to track the index i of the first occurrence of y in x. If i > 0, then [[i > 0]] = 1 so (10.3) implies x = px · y · sx and x  = px · z · sx with find( px , y) = [[|y| = 0]] ensuring that y does not occur in the prefix px unless y = . If instead i = 0, then we have y 0 = z 0 = px =  and thus x = x  = sx . The generalization of replace where we look for a regular pattern ρ instead of an individual string y is a function replace(x, ρ, z) returning the string obtained by replacing the leftmost and shortest successful match of ρ in x with the string z, if any.5 For example, replace(aaabc, b∗ (ab|c), ed) = aaedc because the leftmost match occurs at position 3, and the strings of L(b∗ (ab|c)) matching aaabc from 5

Definition taken from the SMT-LIB 2.6 specifications [11].

192

R. Amadini and M. Gabbrielli

position 3 are {ab, abc} so the shortest string ab is replaced with ed. If instead the pattern was ρ = b∗ (ab|c)∗ then  ∈ L(ρ) hence replace(aaabc, b∗ (ab|c), ed) = edaaabc because  is the leftmost and shortest match. The first attempt to handle this constraint can be generalizing (10.3) as ⎛ ⎜∧ ⎜ ⎜∧ ⎜ ⎜∧ ⎜ ∃ i, j, px , sx ⎜ ⎜∧ ⎜∧ ⎜ ⎜∧ ⎜ ⎝∧ ∧

⎞ i = match(x, ρ) ⎟ i ≤ j ≤ |x| ⎟ ⎟ i = 0 ⇐⇒ j = 0 ⎟ i > 0 =⇒ regular(x[i.. j], DFA(ρ)) ⎟ ⎟ j > i =⇒ match(x[i.. j − 1], ρ) = 0 ⎟ ⎟ ⎟ | px | = max(0, i − 1) ⎟ [[i>0]] ⎟ · sx x = px · x[i.. j] ⎟  [[i>0]] ⎠ x = px · z · sx match( px , ρ) = [[ px = ]]

(10.4)

In practice, if i > 0 we extract the shortest, non-empty substring x[i.. j] matching ρ, if any. It is non-empty because i ≤ j ≤ |x|, and it is minimal-length because if j > i then match(x[i.. j − 1], ρ) = 0 enforces that ρ does not match any proper substring of x[i.. j]. If i = 0, then x = px · sx = sx = x  . For example with ρ = b∗ (ab|c) we have that replace(aaabc, ρ, ed) rewrites into i = match(x, ρ) = 3 so | px | = 2, px = aa and 3 ≤ j ≤ 5. However, there is not enough information to infer that j = 4 and consequently rewrite into x = px · ab · sx , x  = px · ed · sx and sx = c. Therefore, rewriting (10.4) is feasible but not very appealing. Defining a specialized propagator, similar to what done in [6] to overcome the limitations of decomposition (10.3), would surely be a better choice. Finally, a few words about replacing all the matches. Handling this constraint with CP is tricky for a number of reasons. First of all, we probably want to exclude overlapping matches. In this case, Def. 1 should be refined because, e.g., with the current definition MI(abbbb, bbb∗ )= {2, 3, 4} but it should be {2, 4} if we exclude overlaps. Second, the behavior of replaceAll(x, ρ, z) is not standard: according to SMT-LIB 2.6 specifications [11],  is not considered a valid match, e.g., replaceAll(abc, b∗ , d) = adc. In [6] instead  matches each character of the target string, so it would be replaceAll(abc, b∗ , d) = dadbdcd.6 Apart from these technicalities, decomposing replaceAll(x, ρ, z) to basic constraints is too inefficient, and therefore a dedicated propagator is certainly needed. To be more general, and also to model string operations like split (see Table 10.1), it would be useful to have a constraint like I = findAll(x, ρ) returning all the nonoverlapping matches of ρ in x. The problem here is the type of I : should it be a set variable, an array variable or a string variable encoding the matching positions? In any case, the support offered by current CP technology for these sorts of variables is quite limited.

6

This choice follows the JavaScript semantics of replaceAll method.

10 Regular Matching with Constraint Programming

193

10.4 Conclusions In this paper, we explored the fields of constraint programming (CP) and string solving to see how to possibly model the matching between regular patterns, given in the form of regular expressions or finite-state machines, and string variables. We have highlighted the constraints for which the CP paradigm has not yet developed suitable propagators, and provided some suggestions on how to possibly implement them. Future directions are manifold. First of all, we should develop a propagator for the match constraint, returning the leftmost position where a pattern matches a string variable, and compare its performance w.r.t. the decomposition approach and other state-of-the-art string solving technologies—especially SMT solvers like Z3 [19] and CVC5 [9]. Then, we can focus on propagating the replace constraint where the string to be replaced is defined by a pattern. Finally, we can move to harder constraints like, e.g., finding and replacing all the occurrences or considering non-regular patterns such as back-references. Acknowledgements We are thankful to Prof. Peter J. Stuckey for his constructive feedback.

References 1. Javascript Regular Expressions. Available at https://developer.mozilla.org/en-US/docs/Web/ JavaScript/Guide/Regular_Expressions 2. Amadini, R.: A survey on string constraint solving. ACM Comput. Surv. 55(1) (2021). https:// doi.org/10.1145/3484198 3. Amadini, R., Flener, P., Pearson, J., Scott, J.D., Stuckey, P.J., Tack, G.: MiniZinc with strings. In: Hermenegildo, M., López-García, P. (eds.) LOPSTR 2016: Revised Selected Papers. Lecture Notes in Computer Science, vol. 10184, pp. 59–75. Springer (2017) 4. Amadini, R., Gange, G., Stuckey, P.J.: Propagating lex, find and replace with dashed strings. In: van Hoeve, W.J. (ed.) Integration of Constraint Programming, Artificial Intelligence, and Operations Research - 15th International Conference, CPAIOR 2018, Delft, The Netherlands, June 26–29, 2018, Proceedings. Lecture Notes in Computer Science, vol. 10848, pp. 18–34. Springer (2018) 5. Amadini, R., Gange, G., Stuckey, P.J.: Propagating regular membership with dashed strings. In: Hooker, J. (ed.) Proceeding 24th Conference Principles and Practice of Constraint Programming. Lecture Notes in Computer Science, vol. 11008, pp. 13–29. Springer (2018) 6. Amadini, R., Gange, G., Stuckey, P.J.: Dashed strings and the replace(-all) constraint. In: Simonis, H. (ed.) Principles and Practice of Constraint Programming—26th International Conference, CP 2020, Louvain-la-Neuve, Belgium, September 7–11, 2020, Proceedings. Lecture Notes in Computer Science, vol. 12333, pp. 3–20. Springer (2020) 7. Amadini, R., Gange, G., Stuckey, P.J.: Dashed strings for string constraint solving. Artif. Intell. 289, 103368 (2020) 8. Balzarotti, D., Cova, M., Felmetsger, V., Jovanovic, N., Kirda, E., Kruegel, C., Vigna, G.: Saner: composing static and dynamic analysis to validate sanitization in web applications. In: 2008 IEEE Symposium on Security and Privacy (S&P 2008), 18–21 May 2008, Oakland, California, USA, pp. 387–401. IEEE Computer Society (2008)

194

R. Amadini and M. Gabbrielli

9. Barbosa, H., Barrett, C.W., Brain, M., Kremer, G., Lachnitt, H., Mann, M., Mohamed, A., Mohamed, M., Niemetz, A., Nötzli, A., Ozdemir, A., Preiner, M., Reynolds, A., Sheng, Y., Tinelli, C., Zohar, Y.: cvc5: a versatile and industrial-strength SMT solver. In: Fisman, D., Rosu, G. (eds.) Tools and Algorithms for the Construction and Analysis of Systems—28th International Conference, TACAS 2022, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022, Munich, Germany, April 2–7, 2022, Proceedings, Part I. Lecture Notes in Computer Science, vol. 13243, pp. 415–442. Springer (2022) 10. Barrett, C., Conway, C.L., Deters, M., Hadarean, L., Jovanovi´c, D., King, T., Reynolds, A., Tinelli, C.: Cvc4. In: Gopalakrishnan, G., Qadeer, S. (eds.) Computer Aided Verification, pp. 171–177. Springer, Berlin Heidelberg, Berlin, Heidelberg (2011) 11. Barrett, C., Fontaine, P., Tinelli, C.: The Satisfiability Modulo Theories Library (SMT-LIB) (2016). Available at https://www.SMT-LIB.org 12. Berzish, M., Ganesh, V., Zheng, Y.: Z3str3: a string solver with theory-aware heuristics. In: Stewart, D., Weissenbacher, G. (eds.) Proceeding 17th Conference Formal Methods in Computer-Aided Design, pp. 55–59. FMCAD Inc (2017) 13. Bultan, T., Yu, F., Alkhalaf, M., Aydin, A.: String Analysis for Software Verification and Security. Springer (2017). https://doi.org/10.1007/978-3-319-68670-7 14. Costantini, G., Ferrara, P., Cortesi, A.: A suite of abstract domains for static analysis of string values. Softw. Pract. Exp.45(2), 245–287 (2015) 15. Hooimeijer, P., Weimer, W.: StrSolve: Solving string constraints lazily. Automated Software Engineering 19(4), 531–559 (2012) 16. Kie˙zun, A., Ganesh, V., Artzi, S., Guo, P.J., Hooimeijer, P., Ernst, M.D.: HAMPI: a solver for word equations over strings, regular expressions, and context-free grammars. ACM Trans. Softw. Eng. Methodol.21(4), article 25 (2012) 17. King, J.C.: Symbolic execution and program testing. Commun. ACM 19(7), 385–394 (1976) 18. Li, G., Ghosh, I.: PASS: string solving with parameterized array and interval automaton. In: Bertacco, V., Legay, A. (eds.) Proceeding 9th International Haifa Verification Conference. Lecture Notes in Computer Science, vol. 8244, pp. 15–31. Springer (2013) 19. de Moura, L.M., Bjørner, N.: Z3: an efficient SMT solver. In: Tools and Algorithms for the Construction and Analysis of Systems, 14th International Conference, TACAS 2008, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2008, Budapest, Hungary, March 29-April 6, 2008, Proceedings, pp. 337–340 (2008) 20. Nethercote, N., Stuckey, P.J., Becket, R., Brand, S., Duck, G.J., Tack, G.: MiniZinc: towards a standard CP modelling language. In: Proceedings of the 13th International Conference on Principles and Practice of Constraint Programming. Lecture Notes in Computer Science, vol. 4741, pp. 529–543. Springer (2007) 21. Parolini, F., Miné, A.: Sound static analysis of regular expressions for vulnerabilities to denial of service attacks. In: Ameur, Y.A., Craciun, F. (eds.) Theoretical Aspects of Software Engineering—16th International Symposium, TASE 2022, Cluj-Napoca, Romania, July 8–10, 2022, Proceedings. Lecture Notes in Computer Science, vol. 13299, pp. 73–91. Springer (2022) 22. Pesant, G.: A regular language membership constraint for finite sequences of variables. In: Wallace, M. (ed.) Proceedings of the 10th International Conference on Principles and Practice of Constraint Programming. Lecture Notes in Computer Science, vol. 3258, pp. 482–495. Springer (2004) 23. Rossi, F., van Beek, P., Walsh, T. (eds.): Handbook of Constraint Programming. Elsevier (2006) 24. Saxena, P., Akhawe, D., Hanna, S., Mao, F., McCamant, S., Song, D.: A symbolic execution framework for JavaScript. In: Proceeding 2010 IEEE Symposium Security and Privacy, pp. 513–528. IEEE Computer Society (2010) 25. Scott, J.D., Flener, P., Pearson, J., Schulte, C.: Design and implementation of bounded-length sequence variables. In: Lombardi, M., Salvagnin, D. (eds.) Proceeding 14th International Conference Integration of Artificial Intelligence and Operations Research Techniques in Constraint Programming. Lecture Notes in Computer Science, vol. 10335, pp. 51–67. Springer (2017)

10 Regular Matching with Constraint Programming

195

26. Tateishi, T., Pistoia, M., Tripp, O.: Path-and index-sensitive string analysis based on monadic second-order logic. ACM Trans. Softw. Eng. Methodol.22(4), article 33 (2013) 27. Yu, F., Alkhalaf, M., Bultan, T., Ibarra, O.H.: Automata-based symbolic string analysis for vulnerability detection. Formal Methods Syst. Des. 44(1), 44–70 (2014)

Chapter 11

Floating-Point Round-off Error Analysis of Safety-Critical Avionics Software Laura Titolo, Mariano Moscato, Marco A. Feliú, Aaron Dutle, and César Muñoz

Abstract The presence of round-off errors in floating-point programs may provoke a significant divergence between the actual result of the computation and the one ideally obtained using exact real-number arithmetic. These rounding errors are particularly problematic in the context of safety-critical systems such as aerospace applications. In fact, in this context, even a small rounding error can lead to catastrophic consequences when not appropriately accounted for. This paper shows how different formal methods tools can be combined to perform rigorous round-off error analysis of avionics software and outlines the challenges and the open problems in this field. Three case studies taken from real-world avionics applications are presented: the ADS-B Compact Position Reporting Algorithm, the winding number point-in-polygon algorithm used for geofencing of unmanned vehicles, and a function from the NASA DAIDALUS suite of detect-and-avoid solutions developed by NASA.

L. Titolo (B) · M. Moscato · M. A. Feliú National Institute of Aerospace, Hampton, USA e-mail: [email protected]; [email protected] M. Moscato e-mail: [email protected] M. A. Feliú e-mail: [email protected] A. Dutle NASA Langley Research Center, Hampton, USA e-mail: [email protected] C. Muñoz AWS (NASA at the time of contribution), Hampton, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 V. Arceri et al. (eds.), Challenges of Software Verification, Intelligent Systems Reference Library 238, https://doi.org/10.1007/978-981-19-9601-6_11

197

198

L. Titolo et al.

11.1 Introduction Writing floating-point software is challenging. The developer has to take into consideration both runtime exceptions, such as division by zero and overflows, and roundoff errors, which originate from the difference between real numbers and their finite precision representation. While runtime exceptions are visible to the developer when the program is executed, rounding errors are more subtle. In fact, the program still produces a numerical output, but it can be very different from the result that would be obtained using exact real-number arithmetic. Figure 11.1 shows a possible computation that leads to a rounding error of 99 using double precision floating-point numbers in the Glasgow Haskell Compiler interpreter (GHCi), but it could be reproduced in any modern interpreter or compiler. The expression (4/3 − 1) ∗ 3 − 1 is evaluated to 0 in real-number arithmetic. However, when double precision floating-point arithmetic is used, the same expression is evaluated approximately to −2.22e−16. This may look like a negligible error, however, if the floor operator is applied, the floatingpoint evaluation produces -1 instead of the expected result, which is 0. Therefore, the accumulated round-off error amounts to 1. When checking in the conditional if the expression is below 0, it can be noticed that real and floating-point execution flows diverge. Thus, the overall error is the difference between the two branches, which is 99. This phenomenon is called an unstable guard 1 and it occurs when the guard of a conditional statement contains a floating-point expression whose roundoff error makes the actual Boolean value of the guard differ from the value that would be obtained assuming real arithmetic. As is clear from the previous example, besides the accumulation of the rounding error in the arithmetic expressions, the presence of unstable guards can significantly increase the divergence between real and floating-point results. This divergence may lead to catastrophic consequences in safety-critical applications such as avionics software. Understanding how round-off errors and unstable guards affect the result and execution flow of floating-point programs is essential to guarantee the correctness of a program. Therefore, rigorous formal methods tools are needed to provide guarantees on the accumulated rounding error and to help reduce the impact of such errors. In recent years, several techniques have been proposed to analyze and improve the quality of floating-point software (see [12] for a survey).

Fig. 11.1 Accumulation of round-off errors in GHCi computation

1

In the literature [28, 55], unstable guards are often referred to as unstable tests.

11 Floating-Point Round-off Error Analysis of Safety-Critical …

199

This paper presents an overview of the research effort carried out at NASA on the analysis of numerical properties of floating-point programs. In particular, it shows how different formal methods tools can be combined to perform rigorous round-off error analysis of avionic software and outlines the challenges and the open problems in this field. First, the case study of the verification of the ADS-B Compact Position Reporting Algorithm is shown (11.2). This verification technique [54] uses a combination of existing formal methods tools: the static analyzer Frama-C [30], the interactive theorem prover PVS [42], and Gappa [23], a tool to formally verify properties on numerical programs. While successful, this verification approach is not fully automatic and it requires a certain level of expertise. A background in floatingpoint arithmetic and a deep understanding of the features of each tool is essential for completing the verification. In order to automatize this approach, the PRECiSA [35, 52] tool2 was developed (11.3). PRECiSA produces a formally verified C implementation using floating-point arithmetic from a PVS real-number specification. It bounds the round-off error that can occur in the generated program and provides an externally checkable PVS certificate ensuring the soundness of the bound. PRECiSA also instruments the code to emit a warning when the floating-point flow may diverge w.r.t. the real-number specification due to unstable conditionals, and automatically generates ACSL annotations stating the relationship between the floating-point C implementation and its real-number counterpart. PRECiSA is integrated with the Frama-C analyzer and the PVS theorem prover to provide a fully automatic toolchain to generate and verify floating-point C code from a PVS real-number specification. In particular, the code generated by PRECiSA is input to Frama-C which generates a set of verification conditions in the language of PVS. While PVS is an interactive theorem prover, these verification conditions are automatically proved by ad-hoc strategies. Therefore, neither expertise in theorem proving nor in floating-point arithmetic is required from the user to verify the correctness of the generated C program. This verification toolchain is used to generate C implementations from two formal developments by NASA: the winding number point-in-polygon algorithm used for geofencing applications (11.4), and a fragment of the DAIDALUS [38] software library, which is the reference implementation of detect-and-avoid for unmanned aircraft systems in the RTCA DO-365 [45] standard (11.3).

11.2 Formal Verification of the ADS-B CPR Algorithm The Automatic Dependent Surveillance-Broadcast (ADS-B) protocol [44] is a fundamental component of the next generation of air transportation systems, which is intended to provide direct exchange of precise aircraft state information. Aircraft equipped with ADS-B services broadcast information related to their current state 2

The PRECiSA distribution is available at https://github.com/nasa/PRECiSA.

200

L. Titolo et al.

to other traffic aircraft and to ground stations. The use of ADS-B transponders is mandatory since 2020 for most commercial aircraft in the US [14] and Europe [25]. Currently, about 100,000 general aviation aircraft are equipped with ADS-B.3 The ADS-B broadcast message is defined to be 112 bits long, of which 56 bits are available for data, the others used for aircraft identification and error correcting. In an ADS-B position message, 21 bits are allocated to altitude, leaving 35 bits remaining for latitude and longitude. If raw latitude and longitude data were expressed as numbers of 17 bits each, the resulting position accuracy would be worse than 300 m, which is inadequate for safe navigation. For this reason, the ADS-B protocol uses an algorithm called Compact Position Reporting (CPR) to encode/decode the aircraft position in 35 bits in a way that, for airborne applications, is intended to guarantee a position accuracy of approximately 5 m. CPR uses a special coordinate system where each direction (latitude and longitude) is divided into equally sized zones of approximately 360 nautical miles. There are two different subdivisions of the space in zones, called even and odd. Each zone is itself divided into 217 parts, called bins, which measure approximately 5 m. Figure 11.2 shows how the latitude is divided into 60 zones (for the even subdivision) or into 59 zones (for the odd subdivision) and how each zone is then divided into 217 bins indexed from 0 to 217 − 1. Therefore, a CPR coordinate is uniquely determined by a zone index and a bin index. The CPR encoding translates latitude and longitude coordinates into a pair of bin indices. The 35 bits composing the CPR message are grouped into three parts. One bit determines the kind of subdivision (0 for even and 1 for odd), 17 bits are devoted to the bin number for the latitude, and the other 17 bits to the bin number for the longitude. The decoding procedure recovers the position of the aircraft from the CPR coordinates. It returns a coordinate which corresponds to the centerline of the bin where the

Fig. 11.2 CPR latitude coordinate system 3

https://generalaviationnews.com/2020/06/13/ads-b-installations-continue-even-in-midst-ofpandemic/.

11 Floating-Point Round-off Error Analysis of Safety-Critical …

201

Fig. 11.3 CPR issue reported by Airservices AustraliaCPR issue reported by Airservices Australia

target is located (see Fig. 11.2). A CPR message coordinate contains only the number corresponding to the bin where the target is located. However, to unambiguously retrieve the position, also the zone number is necessary. The correct zone number can be recovered from either a previously known position (for local decoding) or from a matched pair of even and odd messages (for global decoding). The CPR local decoding uses a reference position that is known to be near the broadcast one. Observe that a one zone wide interval centered around a given reference position does not contain more than one occurrence of the same bin number. Therefore, as long as the target is close enough to the reference position (slightly less than half a zone), decoding can be performed correctly. Global decoding is used when a valid reference position is unknown. To determine the correct zone in which the encoded position lies, the global decoding uses a pair of messages, one even and one odd. The algorithm computes the number of zone offsets (the difference between an odd zone length and an even zone length) from the origin (either equator or prime meridian) to the encoded position. This can be used to establish the zone for either message type, and hence used to decode the position. Unfortunately, pilots and manufacturers have reported errors in the positions obtained by encoding and decoding with the CPR algorithm. For instance, Fig. 11.3 depicts a CPR malfunctioning reported in 2007 by Airservices Australia. It can be noticed that the position decoded by CPR was more than 220 nautical miles away from the actual position of the aircraft. In [24], it was formally proven that the original operational requirements of the CPR algorithm were not enough to guarantee the intended precision, even when computations are assumed to be performed using exact arithmetic. Additionally, the ideal real-number implementation of CPR has been formally proven correct for a slightly tightened set of requirements. Nevertheless, even assuming these more restrictive requirements, a straightforward floating-point implementation of the CPR algorithm may still be unsound and produce incorrect results due to round-off errors. For instance, using a standard single-precision floating-point implementation of CPR on a position whose latitude

202

L. Titolo et al.

Alternative CPR PVS real arithmetic

PVS

ACSL to PVS

Alternative CPR FP C/ACSL

Alt-Ergo Frama-C

Verification Conditions Gappa

Original CPR PVS real arithmetic

Fig. 11.4 Verification approach for CPR

is −77.368 ◦ and longitude is 180 ◦ , the recovered position differs from the original one by approximately 1500 nautical miles. An alternative implementation of the CPR algorithm is presented in [54]. This version includes simplifications that decrease the numerical complexity of the expressions w.r.t. the original version presented in the ADS-B standard. In this way, the accumulated round-off error is reduced. In [54], the double-precision floating-point implementation of this improved version of the CPR algorithm is proved correct with a combination of different formal methods tools. Figure 11.4 depicts the proposed verification approach which is composed of: • Frama-C [30], a collaborative tool suite for the analysis of C code, • The Prototype Verification System (PVS) [42], an interactive theorem prover, and • Gappa [23], a tool to formally prove properties on numerical programs dealing with floating-point arithmetic. The floating-point C implementation of the CPR algorithm is manually annotated with contracts relating real and floating-point expressions. As an example, Fig. 11.5 shows the annotated code for the latitude encoding function. On lines 2–10, the ideal real-number counterpart of the C implementation is modeled. On lines 13–14, the ranges for the input variables are defined. The index i denotes if the encoding is even (i = 0) or odd (i = 1). The input coordinates for this CPR algorithm are assumed to be given in a format called 32 bit angular weighted binary (AWB), a standard format for expressing geographical positions used by GPS manufacturers and many others. An AWB coordinate is a 32 bit integer in the interval [0, 232 − 1], where degrees. Lines 15–16 ensure that i and the AWB the value x corresponds to 360x 232 latitude are integers. Finally, at line 17, it is required that the result of the floatingpoint function yzf is the same as its real-valued counterpart. The main body of the function (lines 19–44) is enriched with intermediate assertions relating real and floating-point variables. This code is input to Frama-C, a tool suite that collects several static analyzers for the C language. The Frama-C WP plug-in implements a weakest precondition calculus for ACSL annotations through C programs. For each ACSL annotation, this plug-in generates a set of verification conditions (VCs) that can be discharged by

11 Floating-Point Round-off Error Analysis of Safety-Critical …

Fig. 11.5 ACSL annotations and code for the latitude encoding function

203

204

L. Titolo et al.

Fig. 11.6 Gappa verification condition generated by Frama-C

external provers. In this case, the automatic solver Alt-Ergo [15] and the Gappa [23] tool are used. Gappa is a tool able to formally verify properties on floating-point computations and to bound the associated round-off error. Gappa is very efficient for checking enclosures for floating-point rounding errors, but it is not always suited to tackle other types of verification conditions. For this reason, the SMT solver Alt-Ergo is used in combination with Gappa. Figure 11.6 shows a Gappa verification condition generated by Frama-C for the encoding function in Fig. 11.5. Each variable has a real and floating-point version. The rounding of an expression is denoted with rnd. The main property states that the difference between the real and floating-point encoding is 0. Note that two additional hypotheses are needed to automatically discharge this verification condition. These two hypotheses are modeled and proved separately in Gappa. Gappa models the propagation of the round-off error by using interval arithmetic and a battery of theorems on real and floating-point numbers. The main drawback of interval arithmetic is that it does not keep track of the correlation between expressions

11 Floating-Point Round-off Error Analysis of Safety-Critical …

205

sharing subterms, which may lead to imprecise over-approximations. To improve precision, Gappa accepts hints from the user. These hints are provided in the verification of other CPR functions to perform a bisection on the domain of an expression and to define some rewriting rules which will appear as hypotheses in the generated formal proof and are proved separately in Gappa. The real-valued counterpart of each C function (lines 2–9 in Fig. 11.5) is manually translated into the PVS language. PVS is used to formally prove that the real counterpart of the alternative CPR algorithm is mathematically equivalent to the one defined in the standard [44]. It follows that the correctness results presented in [24] also hold for the alternative simplified version of CPR. The formally verified double precision floating-point version of CPR is available at https://github.com/nasa/cpr. In [54], it is proven that in this version the encoding does not introduce any rounding error, and that the original and recovered position differ by at most half the size of a bin, therefore the intended CPR granularity is ensured.

11.3 Automatizing the Verification with PRECiSA The approach presented in the previous section was successful for verifying the double precision floating-point implementation of CPR presented in [54]. However, the applied verification approach requires some level of expertise. A background in floating-point arithmetic is needed to express the properties to be verified and to properly annotate the code for the weakest precondition deductive reasoning. A deep understanding of the features of each tool is essential for the analysis. In fact, careful choice of types in the C implementation leads to fewer and simpler verification conditions. Gappa requires user input to identify critical subexpressions and perform bisection of the domain when needed. Additionally, Gappa’s ability to verify complex expressions requires adding additional ACSL intermediate assertions and providing hints through annotation that may be unfeasible to automatically generate. Ideally, one would like to write a program as if it was going to be evaluated using exact arithmetic and use automatic and rigorous tools to obtain a floating-point implementation with guarantees on the rounding errors that may occur. To this aim, the tool PRECiSA [35, 52] was developed. PRECiSA is a static analyzer that computes sound and accurate round-off error estimations and it provides support for a large variety of mathematical operators and programming language constructs. Additionally, PRECiSA generates proof certificates ensuring the correctness of the computed round-off error estimations, which can be automatically checked in the PVS theorem prover [42]. As opposed to Gappa, which uses interval arithmetic to model the propagation of the error, PRECiSA computes a symbolic error expression and then numerically evaluates it with an optimization algorithm. The rounding error is composed of two parts: (a) the error introduced by the application of the floating point-operator versus its real-number counterpart and (b) the propagation of the errors carried out by the

206

L. Titolo et al.

arguments. In the case of arithmetic operators, the IEEE-754 [29] standard states that every basic operation is correctly rounded, meaning it should be performed as if it would be calculated with infinite precision and then rounded according to the desired rounding modality. In PRECiSA, to-the-nearest rounding modality is supported. In this modality, given a real number r and its floating-point representation f , it holds that |r − f | ≤ 21 ulp(r ) where the function ulp (unit in the last place) computes the difference between the two closest consecutive floating-point numbers enclosing r . For instance, consider the expression x ∗˜ y where x and y are floating-point variables and ∗˜ is the floating-point multiplication. The symbolic error expression for x ∗˜ y is r x ∗ e y + r y ∗ ex + ex ∗ e y + 21 ulp((|r x | + ex )(|r y | + e y )) where ex (respectively e y ) is the error associated to the variable x (respectively y) and r x (respectively r y ) is the real-number counterpart of x (respectively y). Note that in PRECiSA the symbolic error expressions depend on the real value of the variables or constants and their error estimations. The expression (|r x | + ex )(|r y | + e y ) is an over-approximation of x ∗ y and, since the ulp function is monotonic, a sound over-estimation of the rounding error is obtained. In the following, floating-point operations, such as ∗˜ , are denoted with a tilde on top. Given the initial ranges for the input variables, it is possible to compute numerical bounds for this symbolic error expression. To this aim, PRECiSA uses the optimizer Kodiak [49] which is based on the formally verified branch-and-bound algorithm presented in [40]. This branch-and-bound algorithm relies on enclosure functions for arithmetic operators. These enclosure functions compute provably correct over approximations of the symbolic error expressions using either interval arithmetic or Bernstein polynomials. The algorithm recursively splits the domain of the function into smaller subdomains and computes an enclosure of the original expression in these subdomains. The recursion stops when the size of the enclosure is smaller than a given parameter or when a given maximum recursion depth is reached. The output of the algorithm is a numerical enclosure for the symbolic error expression. This approach has three main benefits. First, symbolic error expressions allow for a compositional analysis, thus the error of a subprogram can be reused in the analysis of a larger program that contains it. Second, the same symbolic error expression can be reused when just the input ranges change, improving the efficiency of the analysis since it is not necessary to compute the error expression again. Finally, the evaluation of the expression using optimization algorithms such as Kodiak, provides a much tighter rounding error bound compared to using interval arithmetic. Besides bounding the rounding error of arithmetic expressions, PRECiSA correctly handles the error associated with unstable guards. To this aim, it collects information on both real and floating-point path conditions. When these conditions do not coincide, the error is computed as the difference in the outcome of the two alternative  branches. For instance, consider the conditional expression ˜ −z)˜ ˜ ∗ y −z ˜ if (x /y < 0 then a else b. Each branch has a round-off error that corresponds to the error of a (then branch) or the error of b (else branch), but to guarantee correctness it is essential to take into account also the error due to the instability of the guard. In fact, from Fig. 11.1 it can be noticed that if x = 4, y = 3, and z = 1,

11 Floating-Point Round-off Error Analysis of Safety-Critical …

207

real and floating-point computational flows diverge. In this case, additionally to the round off error of the specific branch, the whole conditional expression carries also an error of |a − b|. PRECiSA generates formal certificates ensuring that these bounds are correct with respect to the floating-point IEEE-754 standard. Having an externally checkable certificate increases the level of trustworthiness of the tool. The certificates produced are mathematical theorems written in the language supported by the PVS [42] theorem prover. They rely on a formalization of floating-point arithmetic originally presented in [9] and extended in [35]. Proof scripts have been developed in [35] to automatically discharge the generated lemmas. In [53], PRECiSA is extended to generate a floating-point C implementation from a PVS real-number specification. The generated C code is automatically annotated with ACSL assertions stating the relationship between real and floating-point expressions as manually done in the approach presented in Sect. 11.2. In addition, the code is instrumented to emit a warning when an unstable condition may occur using the program transformation presented in [55]. Given a real-valued program P and the desired floating-point format (single or double precision), PRECiSA replaces each real arithmetic operator with its floatingpoint counterpart. Then it applies the program transformation: numerically unstable guards are replaced with more restrictive ones that preserve the control flow of the real-valued original specification. These new guards take into account the round-off error that may occur when the expressions of the original program are evaluated in floating-point arithmetic. In addition, the transformation instruments the program to emit a warning when the floating-point flow may diverge w.r.t. the original realnumber specification. Example 1 Consider the program tcoa, which is part of the DAIDALUS NASA library for detectand-avoid. This function is used to compute the time to co-altitude of two aircraft: an ownship and an intruder. Variable sz models the relative position of ownship and intruder, and vz their relative velocity. tcoa(sz , vz ) = if sz ∗ vz < 0 then −sz /vz else 0 If the values of such variables are assumed to lie in the range [1, 1000] and double precision floating-point precision is used, PRECiSA computes the round-off error estimation  = 1.71 × 10−10 for the expression sz ∗ vz in the guard of the conditional. The following program is obtained by using the transformation of [55]. ˜ z elsif sz ∗˜ vz >=  then 0 else warning ˜ z /v  tcoa(sz , vz ) = if sz ∗˜ vz < − then −s The transformed program detects these unstable cases and returns a warning when sz ∗˜ vz is in [−, ).

208

L. Titolo et al.

Fig. 11.7 ACSL annotations and code for the tcoa function

The obtained transformed program is then converted into C syntax and ANSI/ISO C Specification Language (ACSL) annotations are generated. Figure 11.7 shows the annotated C code for the tcoa function as generated by PRECiSA. To model the fact that the output can be either a value or a warning, an ad-hoc struct maybe(T ) is defined, which is parametric with respect to a basic type T ∈ {int, bool, float, double}. The struct has two fields, a boolean isValid indicating if the result is a valid output or a warning, and a value. A warning output is modeled as none(T ) = {isValid = false, value = 0}, while a valid output is modeled as some(T )(val : T ) = {isValid = true, value = val}. The function in the original specification is expressed as a logic axiomatic definition in ACSL syntax (lines 2–3). This definition can be seen as a predicate modeling

11 Floating-Point Round-off Error Analysis of Safety-Critical … Instrumented ACSL/C program

PVS Real Program

PRECiSA

PVS round-off errors certificates

Input Ranges

Kodiak

PVS

209 Frama-C

Verification Conditions

Fig. 11.8 Toolchain to automatically generate and verify guard-stable C code

the real-valued expected behavior of the function. The floating-point version is also expressed as an ACSL definition (lines 5–8). A predicate called tcoa_stable_paths is introduced to model under which conditions real and floating-point flows coincide (lines 10–14). A new variable E_0_double is introduced to model the error occurring in the conditional guard. This variable is used in the new guards (lines 25 and 27). Line 18 states that if the result is not a warning, then the result of the C implementation is the same as the axiomatic floatingpoint function (lines 5–8). A post-condition is introduced stating that, when the instrumented function does not emit a warning and provided E_0_double is a correct overestimation of the error for the expression sz ∗ vz , the predicate tcoa_stable_paths holds. Given ranges for the input variables, it is possible to instantiate the error variable E_0_double in the instrumented program with numerical values representing a provably correct round-off error over-estimation. For instance, assuming 1 ≤ sz ≤ 1000 and 1 ≤ vz ≤ 1000, the function tcoa_num simply calls the main function tcoa_num instantiating the error variable to the rounding error  computed by PRECiSA for sz ∗˜ vz (see Example 1). A contract is generated stating that assuming these initial input ranges, if s_z_double and v_z_double are correct roundings of the real-valued variables sz and vz , respectively, then the difference between the C function and its real-number specification is at most the round-off error computed by PRECiSA. In summary, the fully automatic toolchain presented in [53] combines PRECiSA with the static analyzer Frama-C, and the interactive prover PVS. The input to the toolchain is a real-valued program expressed in the PVS specification language, the desired floating-point precision (single and double precision are supported), and initial ranges for the input variables. The output is a formally verified annotated C program that is guaranteed to emit a warning when real and floating-point paths diverge in the original program and PVS certificates that ensure its correctness. An overview of the approach is depicted in Fig. 11.8. As already mentioned, PRECiSA generates the annotated C code from the PVS specification. The tool suite Frama-C [30] is used to compute a set of verification conditions (VCs) stating the relationship between the transformed floating-point program and the original real-valued specification. The WP plug-in has been customized to support the PVS certificates generated by PRECiSA in the proof of correctness of the C program. This extension relates the proof obligations generated by Frama-

210

L. Titolo et al.

C with the certificates emitted by PRECiSA. These certificates are automatically discharged in PVS by proof strategies that recursively inspect the round-off error expression and apply the corresponding lemmas included in the PVS floating-point round-off error formalization [9]. Therefore, no expertise in floating-point arithmetic or in PVS is required to verify the correctness of the generated C code.

11.4 Case Study: Point-in-Polygon Algorithm PolyCARP (Algorithms for Computations with Polygons) [39, 41] is a NASAdeveloped open source software library for geo-containment applications based on polygons.4 One of the main applications of PolyCARP is to provide geofencing capabilities to unmanned aerial vehicles (UAV), i.e., detecting whether a UAV is inside or outside a given geographical region, which is modeled using a 2D polygon with a minimum and a maximum altitude. Another application is detecting if an aircraft’s trajectory encounters weather cells, which are modeled as moving polygons. PolyCARP implements point-in-polygon methods, i.e., methods for checking whether a point lies inside a polygon, which are based on the winding number computation. The winding number of a point s with respect to a polygon P is defined as the number of times the perimeter of P travels counterclockwise around s. For simple polygons, i.e., the ones that do not contain intersecting edges, this function can be used to determine whether s is inside or outside P. In [39], the winding number of s w.r.t. P is computed by applying a geometric translation that sets s as the origin of coordinates. For each edge e of P, the algorithm counts how many axes e intersects. This contribution can be positive or negative, depending on the direction of the edge e. If the sum of all contributions from all edges is 0 then s is outside the perimeter of P, otherwise, it is inside. Figure 11.9 shows the edge contributions in the computation of the winding number for two different polygons. Mathematical functions that define the winding number algorithm are presented in Fig. 11.10. Given a point v = (vx , v y ), the function Quadrant returns the quadrant in which v is located. Given the endpoints of an edge e, v = (vx , v y ) and v  = (vx , v y ), and the point under test s = (sx , s y ), the function EdgeContrib(vx , v y , vx , v y , sx , s y ) computes the number of axes e intersects in the coordinate system centered in s. This function checks in which quadrants v and v  are located and counts how many axes are crossed by the edge e. If v and v  belong to the same quadrant, the contribution of the edge to the winding number is 0 since no axis is crossed. If v and v  lie in adjacent quadrants, the contribution is 1 (respectively –1) if moving from v to v  along the edge is in counterclockwise (respectively clockwise) direction. In the case where v and v  are in opposite quadrants, the determinant is computed to check the direction of the edge. If it is counterclockwise, the contribution is 2; otherwise, it is –2. The function WindingNumber takes as input a point s = (sx , s y ) and a polygon P of size n, which is represented as a couple of arrays Px , Py  modeling the coordinates 4

https://shemesh.larc.nasa.gov/fm/PolyCARP.

11 Floating-Point Round-off Error Analysis of Safety-Critical …

+1

211

+1

+0

+0

−1 −2

+2

(a) The sum of the contributions is 0 and the point is outside.

+1

+2

(b) The sum of the contributions is 4 and the point is inside.

Fig. 11.9 Winding number edge contributions

Fig. 11.10 Winding number algorithm

of its vertices (Px (0), Py (0)) . . . (Px (n − 1), Py (n − 1)). The size of a polygon is defined as the number of its vertices. The winding number of s w.r.t. the polygon P is obtained as the sum of the contributions of all the edges in P. The result of the winding number is 0 if and only if the polygon P does not wind around the point s, hence s lies outside P.

212

L. Titolo et al.

Fig. 11.11 Points that cause instability in EdgeContrib and WindingNumber

Unfortunately, a straightforward floating-point implementation of the winding number can lead to incorrect results due to round-off errors. In particular, unstable conditions may cause the control flow of the program to diverge with respect to the ideal real-number algorithm. This divergence potentially results in an incorrect point-in-polygon determination. As an example, consider the edge (v, v  ), where v = (1, 1) and v  = (3, 2), in the polygon depicted in Fig. 11.11. The red lines represent a guaranteed overapproximation of the values for sx and s y that may cause instability in the function EdgeContrib w.r.t. the considered edge. The black aircraft denotes a case in which the contribution of the edge (v, v  ) has a different value in real and floating-point arithmetic. In fact, when sx = 4 and s y ≈ 1.0000000000000001, the real function EdgeContrib returns -1, indicating that v and v  are located in adjacent quadrants. However, its floating-point counterpart returns 0 meaning that the vertices are located in the same quadrant. The red aircraft represents the point sx ≈ 2.0000000000000002, s y = 1.5, for which the main function WindingNumber returns 0, i.e., the point is outside, when evaluated with real arithmetics, and it returns 4, i.e., the point is inside, when evaluated in floating-point arithmetic. This figure suggests that simply considering a buffer around the edge is not enough to guarantee the correct behavior of the EdgeContrib function since errors in the contribution can happen also when the point is far from the boundaries. It has been conjectured that, for this algorithm, when the checked point is far from the edges of the polygon, the error occurring in one edge is compensated with the error of another edge of the polygon in the computation of the winding number. To the authors’ knowledge, no formal proof of this statement exists. In [36], the approach presented in Sect. 11.3 is applied to generate a formally verified floating-point C implementation of the winding number algorithm. Figure 11.12 shows the code generated by PRECiSA for the Quadrant function. Two error variables are introduced, E_0_double is the error associated to variable X _double, while E_1_double is the error associated to variable Y _double. The conditional guards are replaced with stronger ones that take into consideration the rounding error as explained in Sect. 11.3. A warning is generated if the input relative position is positioned so close to the cartesian product axis that a rounding error could change the quadrant determination.

11 Floating-Point Round-off Error Analysis of Safety-Critical …

213

Fig. 11.12 ACSL annotations and code for the Quadrant function

Figure 11.13 shows the generated code for the EdgeContrib function. Five error variables are introduced, one for each coordinate of points THIS and NEXT and one for the determinant. The results of the two calls to Quadrant are stored in a maybe structure of type Int. If either one of these calls returns a warning, it is propagated to the calling function. A warning is emitted either when one of these calls output a warning itself, or when computation of the determinant produces a rounding error that taints the evaluation of the conditional guard. Figure 11.14 shows the code generated by PRECiSA for the WindingNumber function. Since this function contains a for-loop, a recursive real-valued version is generated as a logic axiomatic function in ACSL (lines 1–8). The loop invariant is computed in order to relate the result of each iteration of the for-loop with the corresponding call of the recursive real-valued function (lines 19–20). In particular, the invariant states that no rounding error is introduced in any floating-point iteration. If either one of the calls to EdgeContrib emits a warning, it is propagated to the calling function. It follows that if the floating-point implementation of the winding number does not return a warning, then the result is equal to the result obtained using real number arithmetic.

214

L. Titolo et al.

Fig. 11.13 ACSL annotations and code for the EdgeContrib function

11.5 Related Work In recent years, numerous techniques have been proposed to improve the quality of floating-point code [12]. Worst-case rounding error analysis tools such as PRECiSA provide a sound enclosure of the round-off error that may occur in a program. Examples of these tools are FPTaylor [50], Fluctuat [27], Daisy [19], Real2Float [31], Gappa [23], and Seesaw [21]. Precision allocation (or tuning) tools [1, 13, 20, 46] select the lowest floating-point precision for the program variables that is enough to achieve the desired accuracy. This is essential to minimize the overall round-off error while also maximizing the performance. Program optimization tools [17, 43, 51, 56]

11 Floating-Point Round-off Error Analysis of Safety-Critical …

215

Fig. 11.14 ACSL annotations and code for the WindingNumber function

improve the accuracy of floating-point programs by rewriting arithmetic expressions in equivalent forms with a lower round-off error. The verification approach used in this work is similar to the analysis of numerical programs described in [7], where a chain of tools composed of Frama-C, the Jessie plug-in [33], and Why is used. The verification conditions obtained from the ACSL annotated C programs are checked by several external provers including Coq, Gappa, Z3 [37], CVC3 [2], and Alt-Ergo. Recently, much work has been done on the verification of numerical properties for industrial and safety-critical C code, including aerospace software. The approach presented in [7] was applied to the formal verification of wave propagation differential equations [6] and to the verification of numerical properties of a pairwise state-based conflict detection algorithm [26]. A similar verification approach was employed to verify numerical properties of industrial software related to inertial navigation [32].

216

L. Titolo et al.

The interactive theorem prover Coq can also be used to prove verification conditions on floating-point numbers thanks to the formalization defined in [8]. Nevertheless, Coq [4] tactics need to be implemented to automatize the verification process. The static analyzer Astrée [16] detects the presence of run-time exceptions such as division by zero and under and over-flows by means of sound floating-point abstract domains [11, 34]. Astrée has been successfully applied to automatically check for the absence of runtime errors associated with floating-point computations in aerospace control software [5]. More specifically, in [22] the fly-by-wire primary software of commercial airplanes is verified. Additionally, Astrée and Fluctuat were combined to analyze on-board software acting in the Monitoring and Safing Unit of the ATV space vehicle [10].

11.6 Conclusion and Future Challenges Formally verifying that finite-precision implementations behave as expected is essential in safety-critical applications such as aerospace software. Rounding errors and unstable conditionals, which occur when rounding errors affect the evaluation of conditional statements, are hard to detect without the expert use of specialized tools. This paper presented how different formal methods tools can be combined to provide formal guarantees on the rounding error of floating-point programs and shows different applications to avionics software. The formal technique used to verify the floating-point implementation of the CPR algorithm, which uses a combination of Frama-C, Gappa, and PVS, while successful, requires a high level of expertise in floating-point arithmetic, theorem proving, and deductive reasoning. For this reason, a fully automatic toolchain composed by Frama-C, PRECiSA, and PVS is proposed in [53]. Given a PVS program specification written in real arithmetic and the desired precision, PRECiSA automatically generates a conditional-stable floating-point version in C syntax enriched with ACSL annotations that states the rounding error introduced by each function in the program. PVS proof certificates are automatically generated to ensure the correctness of the round-off error overestimations used in the program transformation. The Frama-C/WP plug-in is used to generate verification conditions in PVS. The customization developed in [53] enables a seamless integration between the proof obligations generated by Frama-C and the PVS certificates generated by PRECiSA. Having externally checkable certificates increases the level of confidence in the approach. In addition, no theorem proving expertise is required from the user since proof strategies, which have been implemented as part of this work, automatically discharge the verification conditions generated by Frama-C. This verification approach has been applied to generated code for the PolyCARP geofencing NASA library and for a fragment of NASA’s DAIDALUS software library [38], which serves as a reference implementation of minimum operational performance standards of detect-and-avoid for unmanned aircraft systems in RTCA’s DO-365.

11 Floating-Point Round-off Error Analysis of Safety-Critical …

217

While much work has been done in the last decade to automatically reason on the rounding error of finite-precision computations, several challenges remain open. As described in Sect. 11.5, a variety of tools have been proposed to address different aspects of numerical reliability. However, little effort has been devoted to combining these techniques which are often complementary. A first step in this direction have been taken by the FPBench project [18], which provides a common specification language for floating-point analysis tools. In [3], two tools have been combined: Herbie, which performs accuracy optimization, and Daisy, which performs accuracy verification. Pherbie [47] combines precision tuning and rewriting to improve both the accuracy and speed of floating point expressions. An interesting future direction is the integration of the program instrumentation and code generation of [53] with numerical optimization tools such as Salsa [17] and Herbie [43]. This integration will improve the accuracy of the mathematical expressions used inside a program and, at the same time, prevent unstable guards that may cause unexpected behaviors. The proposed approach could also be combined with tuning precision techniques [13, 20]. Since the program transformation lowers the overall round-off error, this would likely increase the chance of finding a precision allocation meeting the target accuracy. Unstable guards are particularly challenging to detect and correct. The technique proposed in [53] instruments the code to emit a warning when a rounding error may occur. This technique is based on a sound overapproximation of the conditional guards so false warnings may arise. Instead, it would be useful to check exactly if conditional instability can occur and to compute the set of inputs that may provoke conditional instability. To this aim techniques for solving mixed real and floatingpoint formulas are needed. FPRoCK [48] is a prototype tool for solving mixed real and floating-point formulas. FPRoCK transforms a mixed formula into an equisatisfiable one over the reals. This formula is then solved using an off-the-shelf SMT solver. The current version of FPRoCK has some limitations in terms of expressivity and efficiency. Thus, the development of more advance techniques and direct support for mixed formulas in an SMT solver are needed to address this challenge. Finally, the pervasive use of machine learning in several applications, including aerospace, is requiring the use of formal methods to ensure functional correctness when safety is imperative. This includes verifying that numerical errors do not compromise the training and behavior of neural networks which, to the best of the authors’ knowledge, is a research area still unexplored. Acknowledgements Research by the first three authors was supported by the National Aeronautics and Space Administration under NASA/NIA Cooperative Agreement NNL09AA00A.

218

L. Titolo et al.

References 1. Adjé, A., Ben Khalifa, D., Martel, M.: Fast and efficient bit-level precision tuning. In: Proceedings of the 28th International Symposium on Static Analysis (SAS 2021), Lecture Notes in Computer Science, vol. 12913, pp. 1–24. Springer (2021). https://doi.org/10.1007/978-3030-88806-0_1 2. Barrett, C.T.: CVC3. In: Proceedings of the 19th International Conference on Computer Aided Verification, CAV 2007, pp. 298–302 (2007) 3. Becker, H., Panchekha, P., Darulova, E., Tatlock, Z.: Combining tools for optimization and analysis of floating-point computations. In: Proceedings of the 22nd International Symposium on Formal Methods (FM 2018), Lecture Notes in Computer Science, vol. 10951, pp. 355–363. Springer (2018). https://doi.org/10.1007/978-3-319-95582-7_21 4. Bertot, Y., Castéran, P.: Interactive Theorem Proving and Program Development - Coq’Art: The Calculus of Inductive Constructions. Texts in theoretical computer science. an EATCS series. Springer (2004) 5. Bertrane, J., Cousot, P., Cousot, R., Feret, J., Mauborgne, L., Miné, A., Rival, X.: Static analysis and verification of aerospace software by abstract interpretation. Found. Trends Program. Lang. 2(2–3), 71–190 (2015) 6. Boldo, S., Clément, F., Filliâtre, J.C., Mayero, M., Melquiond, G., Weis, P.: Wave equation numerical resolution: a comprehensive mechanized proof of a C program. J. Autom. Reason. 50(4), 423–456 (2013) 7. Boldo, S., Marché, C.: Formal verification of numerical programs: from C annotated programs to mechanical proofs. Math. Comput. Sci. 5(4), 377–393 (2011) 8. Boldo, S., Melquiond, G.: Flocq: a unified library for proving floating-point algorithms in Coq. In: 20th IEEE Symposium on Computer Arithmetic, ARITH 2011, pp. 243–252. IEEE Computer Society (2011) 9. Boldo, S., Muñoz, C.: A high-level formalization of floating-point numbers in PVS. Tech. Rep. CR-2006-214298, NASA (2006) 10. Bouissou, O., Conquet, E., Cousot, P., Cousot, R., Feret, J., Goubault, E., Ghorbal, K., Lesens, D., Mauborgne, L., Miné, A., Putot, S., Rival, X., Turin, M.: Space Software Validation using Abstract Interpretation. In: Proceedings of the International Space System Engineering Conference, Data Systems in Aerospace, DASIA 2009, pp. 1–7. ESA publications (2009) 11. Chen, L., Miné, A., Cousot, P.: A sound floating-point polyhedra abstract domain. In: Proceedings of the 6th Asian Symposium on Programming Languages and Systems, APLAS 2008, Lecture Notes in Computer Science, vol. 5356, pp. 3–18. Springer (2008) 12. Cherubin, S., Agosta, G.: Tools for reduced precision computation: a survey. ACM Comput. Surv. 53(2), 33:1–33:35 (2020). https://doi.org/10.1145/3381039 13. Chiang, W., Baranowski, M., Briggs, I., Solovyev, A., Gopalakrishnan, G., Rakamari´c, Z.: Rigorous floating-point mixed-precision tuning. In: Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages, POPL 2017, pp. 300–315. ACM (2017) 14. Code of Federal Regulations: Automatic dependent surveillance-broadcast (ADS-B) Out, 91 c.f.r., section 225 (2015) 15. Conchon, S., Contejean, E., Kanig, J., Lescuyer, S.: CC(X): semantic combination of congruence closure with solvable theories. Electron. Notes Theor. Comput. Sci. 198(2), 51–69 (2008) 16. Cousot, P., Cousot, R., Feret, J., Mauborgne, L., Miné, A., Monniaux, D., Rival: The ASTREÉ Analyzer. In: Proceedings of the 14th European Symposium on Programming (ESOP 2005), Lecture Notes in Computer Science, vol. 3444, pp. 21–30. Springer (2005) 17. Damouche, N., Martel, M.: Salsa: An Automatic Tool to Improve the Numerical Accuracy of Programs. 6th Workshop on Automated Formal Methods, AFM 2017 (2017) 18. Damouche, N., Martel, M., Panchekha, P., Qiu, C., Sanchez-Stern, A., Tatlock, Z.: Toward a standard benchmark format and suite for floating-point analysis. In: Proceedings of the 9th International Workshop on Numerical Software Verification (NSV), pp. 63–77. Springer (2016)

11 Floating-Point Round-off Error Analysis of Safety-Critical …

219

19. Darulova, E., Izycheva, A., Nasir, F., Ritter, F., Becker, H., Bastian, R.: Daisy - framework for analysis and optimization of numerical programs (tool paper). In: 24th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2018), Lecture Notes in Computer Science, vol. 10805, pp. 270–287. Springer (2018) 20. Darulova, E., Kuncak, V.: Sound compilation of reals. In: Proceedings of the 41st Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pp. 235–248. ACM (2014) 21. Das, A., Tirpankar, T., Gopalakrishnan, G., Krishnamoorthy, S.: Robustness analysis of loopfree floating-point programs via symbolic automatic differentiation. In: IEEE International Conference on Cluster Computing (CLUSTER 2021), pp. 481–491. IEEE (2021). https://doi. org/10.1109/Cluster48925.2021.00055 22. Delmas, D., Souyris, J.: Astrée: From research to industry. In: Proceedings of the 14th International Symposium on Static Analysis, SAS 2007, pp. 437–451 (2007) 23. de Dinechin, F., Lauter, C., Melquiond, G.: Certifying the floating-point implementation of an elementary function using gappa. IEEE Trans. Comput. 60(2), 242–253 (2011) 24. Dutle, A., Moscato, M., Titolo, L., Muñoz, C.: A formal analysis of the compact position reporting algorithm. 9th Working Conference on Verified Software: Theories, Tools, and Experiments, VSTTE 2017, Revised Selected Papers, vol. 10712, pp. 19–34 (2017) 25. European Commission: Commission Implementing Regulation (EU) 2017/386 of 6 march 2017 amending Implementing Regulation (EU) No 1207/2011, C/2017/1426 (2017) 26. Goodloe, A., Muñoz, C., Kirchner, F., Correnson, L.: Verification of numerical programs: From real numbers to floating point numbers. In: Proceedings of NFM 2013, Lecture Notes in Computer Science, vol. 7871, pp. 441–446. Springer (2013) 27. Goubault, E., Putot, S.: Static analysis of numerical algorithms. In: Proceedings of SAS 2006, Lecture Notes in Computer Science, vol. 4134, pp. 18–34. Springer (2006) 28. Goubault, E., Putot, S.: Robustness analysis of finite precision implementations. In: Proceedings of APLAS 2013, Lecture Notes in Computer Science, vol. 8301, pp. 50–57. Springer (2013) 29. IEEE: IEEE standard for binary floating-point arithmetic. Tech. rep. Institute of Electrical and Electronics Engineers (2008) 30. Kirchner, F., Kosmatov, N., Prevosto, V., Signoles, J., Yakobowski, B.: Frama-C: a software analysis perspective. Form. Asp. Comput. 27(3), 573–609 (2015) 31. Magron, V., Constantinides, G., Donaldson, A.: Certified roundoff error bounds using semidefinite programming. ACM Trans. Math. Softw. 43(4), 34:1–34:31 (2017) 32. Marché, C.: Verification of the functional behavior of a floating-point program: an industrial case study. Sci. Comput. Program. 96, 279–296 (2014) 33. Marché, C., Moy, Y.: The Jessie Plugin for Deductive Verification in Frama-C (2017) 34. Miné, A.: Relational abstract domains for the detection of floating-point run-time errors. In: Proceedings of the 13th European Symposium on Programming Languages and Systems, ESOP 2004, Lecture Notes in Computer Science, vol. 2986, pp. 3–17. Springer (2004) 35. Moscato, M., Titolo, L., Dutle, A., Muñoz, C.: Automatic estimation of verified floating-point round-off errors via static analysis. In: Proceedings of the 36th International Conference on Computer Safety, Reliablilty, and Security, SAFECOMP 2017. Springer (2017) 36. Moscato, M., Titolo, L., Feliú, M., Muñoz, C.: Provably correct floating-point implementation of a point-in-polygon algorithm. In: Proceedings of the 23nd International Symposium on Formal Methods (FM 2019) (2019) 37. de Moura, L., Bjørner, N.: Z3: An efficient SMT solver. In: Proceedings of the 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS), pp. 337–340. Springer (2008) 38. Muñoz, C., Narkawicz, A., Hagen, G., Upchurch, J., Dutle, A., Consiglio, M.: DAIDALUS: Detect and avoid alerting logic for unmanned systems. In: Proceedings of the 34th Digital Avionics Systems Conference (DASC 2015). Prague, Czech Republic (2015) 39. Narkawicz, A., Hagen, G.: Algorithms for collision detection between a point and a moving polygon, with applications to aircraft weather avoidance. In: Proceedings of the AIAA Aviation Conference (2016)

220

L. Titolo et al.

40. Narkawicz, A., Muñoz, C.: A formally verified generic branching algorithm for global optimization. In: Proceedings of the 5th International Conference on Verified Software: Theories, Tools, Experiments (VSTTE), pp. 326–343. Springer (2013) 41. Narkawicz, A., Muñoz, C., Dutle, A.: The MINERVA software development process. In: 6th Workshop on Automated Formal Methods, AFM 2017 (2017) 42. Owre, S., Rushby, J., Shankar, N.: PVS: A prototype verification system. In: Proceedings of the 11th International Conference on Automated Deduction (CADE), pp. 748–752. Springer (1992) 43. Panchekha, P., Sanchez-Stern, A., Wilcox, J., Z., T.: Automatically improving accuracy for floating point expressions. In: Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2015, pp. 1–11. ACM (2015) 44. RTCA SC-186: Minimum operational performance standards for 1090 MHz extended squitter automatic dependent surveillance-broadcast (ADS-B) and traffic information servicesbroadcast (TIS-B) (2009) 45. RTCA SC-228: DO-365, Minimum operational performance standards for detect and avoid (DAA) systems (2017) 46. Rubio-González, C., Nguyen, C., Nguyen, H., Demmel, J., Kahan, W., Sen, K., Bailey, D., Iancu, C., Hough, D.: Precimonious: tuning assistant for floating-point precision. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC’13, p. 27. ACM (2013) 47. Saiki, B., Flatt, O., Nandi, C., Panchekha, P., Tatlock, Z.: Combining precision tuning and rewriting. In: 28th IEEE Symposium on Computer Arithmetic (ARITH 2021), pp. 1–8. IEEE (2021). https://doi.org/10.1109/ARITH51176.2021.00013 48. Salvia, R., Titolo, L., Feliú, M., Moscato, M., Muñoz, C., Rakamaric, Z.: A mixed real and floating-point solver. In: Proceedings of the 11th NASA Formal Methods International Symposium (NFM 2019), Lecture Notes in Computer Science, vol. 11460, pp. 363–370. Springer (2019). https://doi.org/10.1007/978-3-030-20652-9_25 49. Smith, A.P., Muñoz, C., Narkawicz, A.J., Markevicius, M.: A rigorous generic branch and bound solver for nonlinear problems. In: Proceedings of the 17th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2015, pp. 71–78 (2015) 50. Solovyev, A., Jacobsen, C., Rakamaric, Z., Gopalakrishnan, G.: Rigorous estimation of floatingpoint round-off errors with Symbolic Taylor Expansions. In: Proceedings of the 20th International Symposium on Formal Methods (FM), pp. 532–550. Springer (2015) 51. Thévenoux, L., Langlois, P., Martel, M.: Automatic source-to-source error compensation of floating-point programs. In: 18th IEEE International Conference on Computational Science and Engineering, CSE 2015, pp. 9–16. IEEE Computer Society (2015) 52. Titolo, L., Feliú, M., Moscato, M., Muñoz, C.: An abstract interpretation framework for the round-off error analysis of floating-point programs. In: Proceedings of the 19th International Conference on Verification, Model Checking, and Abstract Interpretation (VMCAI), pp. 516– 537. Springer (2018) 53. Titolo, L., Moscato, M., Feliú, M., Muñoz, C.: Automatic generation of guard-stable floatingpoint code. In: Proceedings of the 16th International Conference on Integrated Formal Methods (IFM 2020), Lecture Notes in Computer Science, vol. 12546, pp. 141–159. Springer (2020). https://doi.org/10.1007/978-3-030-63461-2_8 54. Titolo, L., Moscato, M., Muñoz, C., Dutle, A., Bobot, F.: A formally verified floating-point implementation of the compact position reporting algorithm. In: Proceedings of the 22nd International Symposium on Formal Methods (FM 2018), Lecture Notes in Computer Science, vol. 10951, pp. 364–381. Springer (2018) 55. Titolo, L., Muñoz, C., Feliú, M., Moscato, M.: Eliminating unstable tests in floating-point programs. In: Proceedings of the 28th International Symposium on Logic-Based Program Synthesis and Transformation (LOPSTR 2018), pp. 169–183. Springer (2018) 56. Yi, X., Chen, L., Mao, X., Ji, T.: Efficient automated repair of high floating-point errors in numerical libraries. Proc. ACM Program. Lang. 3(POPL), 56:1–56:29 (2019)

Chapter 12

Risk Estimation in IoT Systems Chiara Bodei, Gian-Luigi Ferrari, Letterio Galletta, and Pierpaolo Degano

Abstract In the era of the Internet of Things, it is essential to ensure that data collected by sensors and smart devices are reliable and that they are aggregated and transmitted securely to computational components. This has significant effects on the software that manages critical decisions and actuations of IoT systems, with possibly serious consequences when linked to essential services. The development of IoT applications requires suitable techniques to understand and evaluate the complexity of the design process. Here we adopt a software engineering approach where IoT applications are formally specified (specifically in the IoT- LySa process calculus) and a Control Flow Analysis (CFA) is exploited to statically predict how data will flow through the system. Hence, the CFA builds on a kind of supply chain for subsequent aggregations and use. Based on the analysis prediction, we propose a risk analysis that captures the dependencies between collected and aggregated data and critical decisions.

12.1 Introduction Patrick Cousot taught us that the world is divided into people who consciously use Abstract Interpretation and people who do not, but use it anyway. The authors, “to err on the safe side”, are aware that Control Flow Analysis is built around classical approaches to static analysis, including Abstract Interpretation, which shares the same well-founded theoretical basis. C. Bodei (B) · G.-L. Ferrari · P. Degano Pisa University, Pisa, Italy e-mail: [email protected] G.-L. Ferrari e-mail: [email protected] P. Degano e-mail: [email protected] L. Galletta · P. Degano IMT School for Advances Studies Lucca, Lucca, Italy e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 V. Arceri et al. (eds.), Challenges of Software Verification, Intelligent Systems Reference Library 238, https://doi.org/10.1007/978-981-19-9601-6_12

221

222

C. Bodei et al.

Recently, more and more physical devices have been transformed into intelligent entities and connected to the so-called Internet of Things (IoT), resulting in intelligent ecosystems used in various fields, such as environmental monitoring, agriculture, industry, health care, smart cars and smart home. The complex processing of sensor data, due to aggregation and logical functions applied at different points of the networks heavily impacts the physical part of the systems. Actuator regulation and anomalies management depend in fact on these data, the reliability of which is crucial. Sensors that can be compromised, due to accidental failure or intentional malicious attack, can fail to yield reliable data. We focus here on how unreliable data coming from attacked or frail sensors can affect the conditions related to critical conditions, possibly leading to faulty behaviour of systems. For this purpose, we exploit a language-based methodology introduced in [1], where the modelling language IoT- LySa is first used to describe the structure and interactions of an IoT system. Afterwards, IoT- LySa specifications are analysed with a Control Flow Analysis (CFA), adapted from [2–4], that provides us with an abstract and approximate model of the system’s behaviour. Our analysis offers over-approximations of all possible data flows, abstracting from their concrete values. CFA abstract values record the type of data, their origin (not their numerical consistency), the way they are exchanged and the functions used to aggregate and process them. In this regard, the abstract values that correspond to critical conditions, “symbolically” encode the logical structure and the supply chain of the data used for decision-making. We analyse CFA abstract values related to critical decisions in terms of both security and robustness of the sensors and the resulting reliability of their values. Note that from a security point of view, an explicit attacker model is not necessary. We can simply decide which subsets of sensors can be compromised and partition the abstract sensor data accordingly. In more detail, we start from the CFA results and adopt an ex post taint analysis, partitioning the abstract sensor data (our sources) in tainted/frail (the ones coming from possibly compromised sensors) and untainted/unfrail. We then monitor the propagation of the data and the impact on the decisions of interest. This provides a single formal framework for several parametric investigations, depending on the chosen abstract data partition and the chosen propagation policy. Investigations are carried out downstream of the CFA, and applied directly to the abstract values obtained from it. Changing partitions and policies do not require the CFA to be re-computed. This approach emphasises the strength of Control Flow Analysis, which merely infers the needed information and leaves the examination of its predictions of system behaviour to a separate stage. This static technique offers a general methodology that can be applied to many different contexts and that can help here in the design process of IoT systems. By simply changing partitions, we can check whether a subset of sensors is able to influence the decision under analysis, following a sort of “what if” reasoning. The hierarchical structure of the abstract value of interest helps determine which subsets of sensors data are sufficient, if unreliable, to alter the whole value.

12 Risk Estimation in IoT Systems

223

Concerning propagation, we can adopt strict policies in which a single taint value contaminates the entire function applied to it. Alternatively, we can choose policies that do not propagate taint labels when only some of the data are unreliable. In this way, we can address resilience issues and consider an acceptable level of decision reliability even in the presence of unreliable data. Finally, in the style of Hankin et al. [5, 6], we associate abstract sensor data with numerical scores. For security, scores are proportional to the attacker’s efforts to compromise the corresponding physical devices. As a consequence, we can estimate and compare the different risks of compromise of each subset of sensors in the analysed network, and determine where it is worth taking countermeasures. For robustness, the scores represent the reliability of the sensors, e.g. due to battery problems and help, designers determine which of them are most suitable to reinforce. In this paper, the findings from [3, 4] are extended, by proposing a unifying framework for reasoning about critical decisions in IoT systems. The framework exploits a taint analysis that is directly applied to abstract values derived from CFA. The paper is organised as follows. In Sect. 12.2, we briefly present an IoT monitoring system as running example. The technical background is included in Sect. 12.3, where Subsects. 12.3.1 briefly introduces the syntax and semantics of our process calculus IoT- LySa, while and 12.3.2 intuitively presents the adopted CFA. In Sect. 12.4, the results of the analysis are usedt to reason on critical decisions and to estimate the costs of possible security attacks or accidental failures. Conclusions are given in Sect. 12.5.

12.2 Indoor Environmental Monitoring Scenario We illustrate our approach through a simple smart building scenario, in which an IoT network provides an indoor environmental monitoring system N , described in Fig. 12.1. The system is mainly used to sense room conditions. Each sensor has its own measurement parameters with a range of values for standard behaviour and thresholds for anomalous situations. The detection of unwanted fluctuations or imbalances in the values triggers appropriate warnings and actuations to regulate the uneven values so as to restore the right conditions. Here, we provide a high-level overview of a smart building system. In each room, there are environmental sensors to measure temperature, humidity and light and to detect fire and smoke, as well as the water level in the pump to extinguish the fire. Data are gathered locally at many points in the room by devices called Environmental nodes. The collected data are provided to the node monitoring the room, called Room controller, that aggregates them, and decides which regulations are in order, e.g. which actions are needed to increase or decrease the temperature. Furthermore, the Room controller sends appropriate signals to take specific actuations when selected measurements are out-of-bounds and cause an alarm. A number of actuators are provided to control the different parameters sampled by the sensors:

224

C. Bodei et al.

Fig. 12.1 Indoor environmental monitor network N

temperature is controlled by thermostats, humidity by dehumidifiers, light intensity by lamps and fire extinguishers are activated in the event of a fire alarm. The aggregated data computed by each Room controller are then sent to the control room of the entire building, called Building monitor which stores them in a database, and in case of alarm, reacts by taking suitable actions, e.g. triggering an automatic call to the Fire Station in case of fire.

12.3 Technical Background 12.3.1 Overview of IoT- LySa We first briefly review IoT- LySa [1, 7], a specification language proposed for IoT systems. A formal presentation of the process calculus and of the static analysis can be found in [3], while only the relevant aspects are introduced here, exemplified on the indoor environmental monitoring scenario outlined in Sect. 12.2. Syntax Systems in IoT- LySa are pools of parallel (the operator is |) nodes N (things), each of which is uniquely identified by a label . Each node hosts a parallel composition (with the operator ) of internal components: a shared local store  for internal communication, sensors S, actuators A, and a finite number of control processes P. The system of nodes are defined by the syntax in Table 12.1. We introduce the syntax and the semantics of the calculus using the following IoT- LySa specification of the network N of our smart building scenario.

12 Risk Estimation in IoT Systems

225

Table 12.1 Syntax N  N :: = systems of nodes

SS

B  B:: = node components

0

empty system



 : [B]

single node ( ∈ L)

P

process

N1 | N2

par. composition

S A

sensor (label i ∈ I ) actuator (label j ∈ J )

BB

par. composition

:: = sensors

AA

node store

:: = actuators

0

inactive sensor

0

inactive actuator

τ.S

internal action

τ.A

internal action

i := v. S

store of v ∈ V

(| j, |). A

command for actuator j

by the i th sensor

γ .A

triggered action (γ ∈ )

h

iteration var.

h

iteration var.

μh.S

tail iteration

μh.S

tail iteration

E  E :: = annotated terms Ma

M  M:: = terms

annotated term

v

value (v ∈ V)

with a ∈ A

i

sensor location (i ∈ I )

x f (E 1 , . . . , Er ) function on data P:: = control processes 0

inactive process

E 1 , . . . , Er   L . P

asynchronous multi-output L ⊆ L

(E 1 , . . . , E j ; x j+1 , . . . , xr ). P input (with matching and tag) E?P : Q

conditional statement

h

iteration variable

μh. P

tail iteration

x a := E. P

assignment to x ∈ X

E, j, γ . P

output of action γ to actuator j on condition E

N M C E1 ... E6

= M | C| E 1 | ... | E 6 | C | E 1 ... | = M : [M  PM  BM ] = C : [C  PC  Swater  Apump  BC ] = T : [T  PT  T1  ...  T4  Athermostat  BT ] = F : [F  PF  F1  ...  F4  Afire−suppression  BF ]

Our network system consists of the node Building Monitor M put in parallel, through the operator |, with the nodes located in the n rooms. In each room, there is a Room Controller node and six Environmental nodes. For the sake of simplicity, we just restrict ourselves to considering the nodes in the first room, where the Room Controller is C and the Environmental nodes are E j , with j ∈ [1, 6].

226

C. Bodei et al.

Sensors are active entities that obtain data from the physical environment and put them in reserve locations of the local store of the node, while actuators are instead passive entities: they only wait for a command to operate on the environment. Sensors and actuators have, as unique identifiers, natural numbers in the set I . A sensor can perform an internal activity τ , or it can put the value v, sampled from the environment, into its store location i with the assignment action i := v. In the IoT- LySa specification of our example, the node C consists of a processing unit PC , a sensor Swater , an actuator Apump and some further immaterial components abstracted by BC . Both the sensor and actuator repeat their behaviour, specified by the tail recursive construct μh.P, with h as iteration variable. The sensor samples the water level in the pump from the environment and stores it in its dedicated variable 1water , while the actuator waits for regulation commands from the control process Swater = μh.1water := swater .τ.h Apump = μh.(|Apump , {PumpReg(x)}|).τ.h Similar are the specifications of sensors and actuators in the environmental nodes. Ti Athermostat Fi Afire−suppression

= μh.i ti := sti .τ.h = μh.(|Athermostat , {Goth (x)}|).τ.h = μh.i fi := sfi .τ.h = μh.(|Afire−suppression , {Gofs (x)}|).τ.h

Processes specify the logic of the system, by coordinating sensors and actuators, communicating with the other network nodes and managing the data gathered from sensors and exchanged with other nodes. Data are represented by terms (values, sensor locations and functions on data), which carry annotations a, a , ai , ... ∈ A that are used in the CFA to identify their occurrences. For simplicity, we omit the annotations in the specification of our running example, but we assume that each term E is annotated by l E In more detail, a process specification consists of a sequence of prefixes (sequentialised by the operator “.".), representing communication actions, conditional statements, iteration, assignments and actuation commands for actuators. In our example, the control process PC initially stores the water level data collected by the sensor (by means of the assignment prefix), checks whether the level falls within the wanted range [twater , Twater ] using the predicate rng, computes the possible actuator regulation parameter through the function reg and assigns it to the variable check water . Finally, it produces a suitable trigger command to regulate the actuator. Syntactically, check water , Pump, PumpReg(vpump ) represents the trigger command of action PumpReg for the actuator Pump, with parameter vpump , to be executed when the condition check water is true. PC = μh.(wwater := 1water ). checkwater := rng(wwater , twater , Twater ).vpump := reg(wwater ). checkwater , Pump, PumpReg(vpump ).PC

12 Risk Estimation in IoT Systems

227

After that, the continuation PC proceeds with the reception of data from the environmental nodes, the checks, the computation of possible regulation instructions and the detection of anomalies. Before examining PC , we illustrate the specification of the control processes PT and PF of the environmental nodes E 1 and E 6 , respectively. These processes communicate with each other through input and output actions, expressed by the corresponding prefixes. In general, the prefix E 1 , . . . , Er   L represents a simple form of multi-party communication: the tuple obtained by evaluating E 1 , . . . , Er is asynchronously sent to the nodes with labels in the set L that lie in the same transmission range. The input prefix (E 1 ,. . ., E j ; x j+1 ,. . ., xr ) receives a r -tuple, provided that its first j elements match the corresponding input elements, i.e. E 1 = E 1 , . . . , E j = E j , and then assigns the variables (after “;”) to the received values. Otherwise, the r -tuple is not accepted. In the PT specification, the output prefix is E 1 , z t1 , z t2 , z t3 , z t4   {C }: the receiver is the Room Controller C, whose label is C . The corresponding input prefix in PC is (E 1 ; wt1 , . . . , wt4 ), where the first element E 1 matches and each variable wti is assigned the corresponding sent value z ti (with i ∈ [1, 4]). PT = μh.(z t1 := 1t1 ). · · · .(z t4 := 4t4 ). E 1 , z t1 , z t2 , z t3 , z t4   {C }. (C; z checkT , vthermostat ).z checkT , AThermostat , Goth (vthermostat ).h PF = μh.(z f 1 := 1 f 1 ). · · · .(z f 4 := 4 f 4 ). E 6 , z f 1 , z f 2 , z f 3 , z f 4   {C }. (C; z checkAlarm , z alarm ).z alarm , Afire−suppression , Gofs (z alarm ).h The control process PT stores the temperature data collected by the node 4 sensors in the locations z ti , sends all of them to the Room Controller C, waits for its actuation decision for the appropriate parameter, and therefore, produces the suitable trigger command. Similarly, the control process PF stores the fire detection data collected by the node 4 sensors in the locations z f i , sends all of them to the Room Controller C, waits for the trigger command for the Fire Suppression System, in case of alarm. PC = (E 1 ; wt1 , . . . , wt4 ). · · · (E 6 , w f 1 , . . . , w f 4 ). avgT := avg(wt1 , ..., wt4 ). checkT = rng(avgT , tmin , Tmax ).setT = reg(avgT ). · · · C, checkT , setT   {T }. ... .[management of data coming from E 2 ,..., E 5 ] PC PC = checkTA := gt(avgT , Tanomaly ). dt1 = dt(w f 1 )...dt4 = dt(w f 4 ). checkFA := or(dt1 , . . . , dt4 ). checkalarm := and(checkwater , and(checkTA , checkFA )) check alar m , Pump, PumpReg(alarm). C, check alar m , alarm  {F , M }.h

228

C. Bodei et al.

The control process PC receives and checks the data sensed by all environmental nodes, and makes the suitable actuation decisions. In particular, for temperature management, it computes the average of the temperature data, checks whether the average is in the right range [tmin , Tmax ] and computes the parameter of actuator regulation action for PT . The process PC checks for temperature anomalies and monitors the data detected by the fire detection sensors. The variable checkalarm (in red in the pdf) detects a fire alarm when the water level is not right and the average temperature in the room is over the TF and at least a sensor detects fire. In case of a fire alarm, the process sends E 6 and M the actuation decision to deal with the alarm. Finally, the control process PM receives messages from all room controllers and, if at least one of them sends an alarm, then the bell is triggered. PM = μh.(C; xcheckalarm1 , xalarm1 ).(C ; xcheckalarmn , xalarmn ). · · · . siren := or(xcheckalarm1 , . . . , xcheckalarmn ).siren, Bell, Ring.h AM = μh.(|Bell, {Ring}|).τ.h Semantic rules for communication The operational semantics of IoT- LySa is a transition system based on a two-level reduction relation → defined by a set of induction rules. We provide four semantic rules, the first two drive asynchronous IoT- LySa multi-communications, where we use tuples of size 2 for the sake of simplicity. The other two rules (A-com1) and (A-com2) are used by control processes to communicate with actuators. (Ev-out)

v1 = [[E 1 ]] ∧ v2 = [[E 2 ]]

  E 1 , E 2   L . P  B →   v1 , v2   L .0  P  B (Multi-com) 2 ∈ L ∧ Comp(1 , 2 ) ∧ v1 = [[E 1 ]]2 1 : [v1 , v2   L . 0  B1 ] | 2 : [2  (E 1 ; x2a2 ).Q  B2 ] → 1 : [v1 , v2   L \ {2 }. 0  B1 ] | 2 : [2 {v2 /x2 }  Q  B2 ] In the (Ev-out) rule, to send a message v1 , v2  obtained by evaluating E 1 , E 2 , a node with label  generates a new process, running in parallel with the continuation P; this new process offers the evaluated tuple to all receivers with labels in L. In the (Multi-com) rule, the message coming from 1 is received by a node labelled 2 , provided that: (i) 2 belongs to the set L of possible receivers, (ii) the two nodes satisfy a compatibility predicate Comp (e.g. when they are in the same transmission range), and (iii) that the first j value (before “;”) matches with the evaluations of the first j term in the input. When this match succeeds the variable x2 after the semicolon is assigned the corresponding value of the received tuple (updating the store 2 results in 2 {v2 /x2 }). Moreover, the label 2 is removed from the set of receivers L of the tuple. The generated process terminates when all the receivers have received the message (L = ∅). According to these rules, in our example, we deduce the following

12 Risk Estimation in IoT Systems

229

transition, where PC , PT are the continuations after the communications prefixes in PC , PT , resp. (note that the local stores T is updated with the received values), and E 1 , z t1 , z t2 , z t3 , z t4   {T } \ {T } is structurally equivalent to 0. C : [C  E 1 , z t1 , z t2 , z t3 , z t4   {T }. · · · .PC  Swater  Apump  BC ] | T : [T  (E 1 ; wt1 , . . . , wt4 ).PT  T1  ...  T4  Athermostat  BT ] → C : [C  E 1 , z t1 , z t2 , z t3 , z t4   {T } \ {T }. · · · .PC  Swater  Apump  BC ] | T : [T {z t1 /wt1 , ..., z t4 /wt4 }  (E 1 ; wt1 , . . . , wt4 ).PT  T1  ...  T4  Athermostat  BT ] (A-com1)

γ ∈  ∧ [[E]] = true

E, j, γ (v). P  (| j, |). A  B → P  γ (v). A  B (A-com2)

γ ∈  ∧ [[E]] = false

E, j, γ (v). P  (| j, |). A  B → P  (| j, |). A  B

In rules (A-com1) and (A-com2), a process with prefix E, j, γ (v) commands the actuator j to perform the action γ , with parameter v, if it is one of the actions the actuators is able to perform, and if the boolean expression E in the trigger command is true, according to the standard denotational interpretation [[E]] . In our example, if PumpReg is among the possible actions and [[check water ]] is true, then we have checkwater , Pump, PumpReg(vpump ).PC  (|Apump , {PumpReg(x)}|).τ → PC  PumpReg(vpump ).τ

12.3.2 Control Flow Analysis Control Flow Analysis is a static analysis technique, based on Flow Logic [8], that provides safe and computable approximations to the sets of values that programs may assume during their executions. Once a system of nodes has been specified, a designer can use our CFA to statically predict its behaviour. This analysis estimates the “shape” of the data that nodes may process and exchange across the network, in the form of composed abstract values. In particular, CFA components approximate, for each datum, which sensors it can be derived from and which manipulations it may be subject to and where. Intuitively, these abstract values “symbolically” encode where data can be generated, the nodes that they can pass through and how they can be processed. In addition, the analysis records the conditions that may trigger actuators. We intuitively introduce the CFA through our running example. Further details can be found in [1–3].

230

C. Bodei et al.

Our CFA over-approximates the behaviour of a system of nodes by statically computing “abstract” values that represent data sampled from sensors and resulting from aggregation, comparison and other functions. More precisely, abstract values are pairs (v, ˆ a) (concisely written as v a ) where the first component is an abstract value and the second records the annotation of the expression where the concrete value may be originated or computed by a function in the dynamic evolution. They are defined as follows: ˆ  v:: V ˆ = ( , a) (i, a) (v, a) ( f (vˆ1 , . . . , vˆn ), a)

abstract terms value denoting cut sensor abstract value i ∈ I value for clear data value for aggregated data

Since at run-time, nodes can generate composed terms with an arbitrarily nesting level, the distinguished abstract terms ( , a) represent those with a depth greater than a given threshold d.  , κ, , α) when analysing Formally, the result or estimate of our CFA is a tuple (  , ) when analysing terms. The components of an estimate networks and a pair ( are the following abstract domains:   ˆ  is the disjoint union of functions  ˆ  : X ∪ I → 2V  = ∈L  • abstract store  mapping a variable in X or a sensor in I to a  set of abstract values; k  i returns the set of the mesV • abstract network environment κ : L → L × i=1 sages that may be received by the node with address ;  • abstract data collection  : L → A → 2V yields the set of the abstract values that may be computed by each labelled term M a in a given node ;  • abstract decision components α : L →  → 2V returns the set of the abstract values that may affect the condition triggering an action γ ∈ . Tuples and pairs satisfy the judgements defined by the CFA rules. We just illustrate some of them, by using communication tuples of size 2, for simplicity.  , ) |= M a1 ∧ (  , ) |= M a2 ∧ (  , κ, , α) |= P ∧ ( 2 2 1 ∀vˆ1 , vˆ2 : i=1 vˆi ∈ ()(ai ) ⇒ ∀ ∈ L : (, vˆ1 , vˆ2 ) ∈ κ( )  , κ, , α) |= M a1 , M a2   L . P ( 1

 , ) |= M a1 ∧ ( 1 ∀( , vˆ1 , vˆ2 ) ∈ κ() : Comp( , ) ⇒ ˆ  (x2 ) ∧ (  , κ, , α) |= P) (vˆ2 ∈   , κ, , α) |= (M a1 ; x a2 ). P ( 1

2

2

 , ) |= M a ∧ (  , κ, , α) |= P ()(a) ⊆ α()(γ ) ∧ (  , κ, , α) |= M a , j, γ . P (

An estimate is valid for multi-output, if it is valid for the continuation of P and the set of messages communicated by the node  to each node  in L, includes all messages obtained by the evaluation of the r -tuple M1a1 , M2a2 . More precisely, the

12 Risk Estimation in IoT Systems

231

rule (i) finds the sets ()(ai ) for each term Miai , and (ii) for all tuples of values (vˆ1 , vˆ2 ) in ()(a1 ) × ()(a2 ) it checks whether they belong to κ( ) for each  ∈ L. Symmetrically, the rule for input considers the values inside messages. Those values that may be sent to the node  and pass the pattern matching, are included in the estimates of the variable x2 . In more detail, the rule analyses each term Miai , and requires that for any message that the node with label  can receive, i.e. ( , vˆ1 , vˆ2 ) in κ(), the abstract value vˆ2 is included in the estimates of x2 , provided that the two nodes are in the communication range (formally Comp( , )). Moreover, the rule for actuator trigger predicts in the component α that a process at node  may trigger the action γ for the actuator j, based on the abstract values included in the analysis of the term M a . Intuitively, the analysis “mimics” the evolution of the system, using abstract values in place of concrete ones and modelling the consequences of each possible action. Formally, our CFA is correct w.r.t. the dynamic semantics, in that its valid estimates are preserved under semantic reduction steps. CFA predictions help to assess whether the specifications conform to the intended design. At development time, one can check whether the activation conditions are implemented correctly, by looking at the α component. The  and κ components allow us to verify whether the necessary computations are distributed as expected among the nodes, and whether the communications are designed as intended.

CFA at work on the running example For the sake of presentation, we just consider the part of the analysis in Table 12.2, concerning the flow of data for temperature and fire detection. Any valid analysis estimate must satisfy the constraints given in the table. Intuitively, we discuss the entries going downwards from the left-hand column. The first constraint (1) says that the abstract values i ti are assigned to the variables E of the node E1 . Constraint (3) z ti (whose labels are l zti in (2)) of the local store  T says that these values are sent to the node C.

Table 12.2 The estimate where vˆfire is a shorthand for the abstract value  and rng(1water , twater , Twater ), and(gt(avg(1t1 , ..., 4t4 ), Tanomaly ), or(dt(1 f 1 ), . . . , dt(4 f 4 )))  (z ti ) ⊇ {i ti } (1)  ET (2) (ET )(l z ti ) ⊇ {i ti }

(8) (C )(lcheckTA ) ⊇ {gt(avg(1t1 , ..., 4t4 ), Tanomaly )}

(3) κ(C ) ⊇ {( E T , E 1 , 1t1 , ..., 4t4 )}

(10) (C )(fcheck

(4) (C )(lwwater ) ⊇ {1water }

(11) κ(M ), κ(F ) ⊇ (C , C, vˆfire , alarm)

(9) (C )(lcheckFA ) ⊇ {or(dt(1 f 1 ), . . . , dt(4 f 4 ))} alarm

(5) (C )(lcheckwater ) ⊇ {rng(1water , twater , Twater )} (12) (M )(l xcheck

) ⊇ {vˆfire }

alarm1

) ⊇ {vˆfire }

(6) (C )(vpump ) ⊇ {reg(1water )}

(13) (F )(l z check

(7) (C )(lwti ) ⊇ {i ti }

(14) α(C )(PumpReg(alarm)), α(F )(Gofs (alarm)) ⊇ {vˆfire }

alarm

) ⊇ {vˆfire }

232

C. Bodei et al.

We briefly illustrate that this constraint has to be part of a correct estimate, in that it statically approximates the run-time behaviour. Since by (2) we have that (ET )(l zti ) ⊇ {i ti }, we are predicting that the variables z ti will assume the values  , ) |= z ti . In turn, this condition, represented by i ti , fulfilling the condition ( together with the constraint (3) and the validity for PF , permits to fulfil the following formula:  , κ, , α) |= E 1 , z t1 , z t2 , z t3 , z t4   {C }.PF ( E T

Similar to (2), constraint (4) concerns the abstract value 1water of the water level. Constraints (5) and (6) predict instead that the system may apply the functions rng and reg to check if the value of 1water is in the expected range and to compute the needed regulation value, respectively. Similarly, constraints (8) and (9) record the possible applications of functions gt and avg. Note that the abstract value gt(avg(1t1 , ..., 4t4 ), Tanomaly ) denotes the results of checking whether the average temperatures sampled by the sensors of environmental node i ti is greater than a given threshold Tanomaly . Similarly, the abstract value or(1 f 1 , . . . , 4 f 4 ) denotes the check on the fire detection values coming from the fire detection sensor Fi . The Room controller C may receive the temperature sensed by the environmental node E 1 , according to (7). In the following constraints, the shorthand vˆfire stands for the composite abstract value (in red in the pdf)  and rng(1water , twater , Twater ), and(gt(avg(1t1 , ..., 4t4 ), Tanomaly ), or(dt(1 f 1 ), . . . , dt(4 f 4 )))

that is assigned to the relevant variable checkalarm of the Room Controller C (10), whose corresponding assignment in the specification is checkalarm := and(checkwater , and(checkTA , checkFA ) The value vˆfire is then sent to the environmental node E 6 and the Building Monitor M (stored in the variables xcheckalarm1 and z checkalarm , respectively, according to (11)-(13). Finally, according to (14), the abstract value vˆfire is included in the ones that may trigger the actions Gofs and PumpReg.

12.4 Using the CFA Results for Analysing Critical Decisions We present our methodology, based on the CFA introduced in Sect. 12.3.2 and exemplifies it on the indoor monitoring system described in Sect. 12.2. It is a general technique that can be applied to many different contexts in which run-time choices depend on the values data can take on. We focus on the critical decisions that monitoring systems make to physically protect the corresponding supervised environments. Decisions are usually based on the analysis of the data collected from sensors monitoring environmental parameters,

12 Risk Estimation in IoT Systems

233

like imbalances in temperature, humidity and so on. These data are aggregated and compared to determine how to regulate actuators. Control Flow Analysis supports us in reasoning about critical decisions by considering abstract values associated with decision-related conditions such as the triggering conditions used for actuators. By making explicit their basic constituents and the functions used to process them, abstract values show in fact what the evaluation of conditions depends on. In our running example, the abstract value vˆfire  and rng(1water , twater , Twater ), and(gt(avg(1t1 , ..., 4t4 ), Tanomaly ), or(dt(1 f 1 ), . . . , dt(4 f 4 )))

related to the condition used to establish the fire alarm, is composed of data from the Swater sensor detecting the pump water level, the temperature sensors Ti and the fire detection sensors Fi . These values are processed by means of logical operators (and, or, gt, dt) and aggregation functions (rng, avg). By looking at  component of the CFA, we also have information on where the single functions are applied, e.g. gt is applied to the pair (avg(1t1 , ..., 4t4 ), Tanomaly ) in node C (see (9) in Table 12.2). We are interested here on the possible physical vulnerabilities of sensors, the compromise of which, whether intentional or accidental, may impact the reliability of the data on which critical decisions are made. Unreliable data can indeed lead to incorrect and possibly dangerous behaviour. In our running example, this kind of impairment may prevent the activation of the fire suppression in case of fire, or it may instead force it when there is no fire, and cause rooms to flood. We do not deal with data alteration due to attacks on communications among nodes, which can be addressed with the approach developed for LySa [9].

12.4.1 Taint Analysis Taint analysis is used in information security to identify how untrustworthy input flow through a system and to track how they affect the rest of the data and the system’s behaviour. The corresponding technique consists of adding tainted labels to the data coming from the unreliable sources and monitoring the propagations of tainted data. We adopt a taint analysis that we obtain by suitably labelling the abstract sensor data returned by the CFA. We assume that designers provide a set of the tamperable sources (sensors and variables) T . This allows us to partition abstract sensor data (our sources) into tainted (those coming from the tamperable sensors, in blue in the pdf) and untainted (all the rest, in red in the pdf). Formally, we define the set of taint labels B (ranged over by b, b1 , ...) whose elements (the colours in the pdf should help the intuition) are as follows: ♦ untainted

 tamperable

234

C. Bodei et al.

Table 12.3 The operator ⊗ version 1 (for taintedness) and version 2 (for frailty) ⊗ ♦ ♦ ♦  

⊗ ♦  ♦ ♦    

Hence, we obtain enriched abstract values, called coloured abstract values, as pairs ˜ with vˆ ∈ V, ˆ b ∈ B. (v, ˆ b) ∈ V, For simplicity, hereafter, we write them as vˆ b , and indicate with ↓i the projection function on the i th component of the pair. We naturally extend the projection to sets, ˜ i.e. V˜↓i = {v˜↓i |v˜ ∈ V˜ }, where V˜ ⊆ V. To propagate the taint information, we resort to the taint combination operator ⊗ : B × B → B, defined in the left-hand part of Table 12.3. This operator naturally extends to abstract values: b ⊗ vˆ b = vˆ b⊗b ; and to sets of abstract values: b ⊗ V˜ = ˜ {b ⊗ v| ˜ v˜ ∈ V˜ ⊆ V}. This operator implements the simplest form of propagation of taint information in functions, where a taint operand renders the overall combination tainted. Given the combination operator ⊗ : B × B → B, the taint propagation function Fτ , which returns the taint resulting by the application of a function to coloured abstract values, is defined as follows Fτ ( f, v˜1 , . . . , v˜r ) = ⊗(v˜1↓2 , . . . , v˜r ↓2 ). The taint assignment function τ , given a set T of tamperable sensors labels, and the function Fτ extends abstract values into coloured abstract terms, as defined below ⎧  if vˆ ∈ I and  ∈ T ⎨ vˆ if vˆ ∈ I and  ∈ /T τ (v) ˆ = vˆ ♦ ⎩ f (τ (vˆ1 ), . . . , τ (vˆn ), a) Fτ ( f,τ (vˆ1 ),...,τ (ˆr1 )) if vˆ = f (vˆ1 , . . . , vˆn , a) For simplicity and readability, in the following, we do not include the component a in the abstract values. Back to our running example, if we suppose that only the sensor T1 in the node E 1 may be tampered, i.e. t1 ∈ T , then the abstract value 1t1 is classified as tainted in its  coloured version 1t1 , while the others are classified as untainted. As a consequence, ♦ ♦   , 2t2 , 3♦ we can apply τ to the function avg and obtain avg(1t1 t3 , 4t4 ) , as a result of τ (avg(1t1 , 2t2 , 3t3 , 4t4 )) = avg(τ (1t1 ), τ (2t2 ), τ (3t3 ), τ (4t4 ))) Fτ (avg,τ (1t1 ),τ (2t2 ),τ (3t3 ),τ (4t4 )) ♦ ♦ ♦ where (τ (1t1 ), τ (2t2 ), τ (3t3 ), τ (4t4 )) = (1 and t1 , 2t2 , 3t3 , 4t4 ) ♦ ♦ ♦ , 2 , 3 , 4 ) = ⊗(, ♦, ♦, ♦) = . Fτ (avg, 1 t3 t4 t1 t2 ♦ ♦ ♦  ♦ Similarly, since Fτ (gt, avg(1 t1 , 2t2 , 3t3 , 4t4 ) , Tanomaly ) = ⊗(, ♦) = , we ♦ ♦  ♦   obtain gt(avg(1t1 , 2t2 , 3♦ t3 , 4t4 ) , Tanomaly )) . As expected, propagation leads us to

12 Risk Estimation in IoT Systems

235

 have the coloured version of the abstract value vfire as vfire , that signals that the overall condition check alar m can be affected by possibly tampered data coming from T1 . To mitigate the situation, we can intervene with targeted sanitisation to compensate for any compromises. For instance, if the data from one of the temperature sensors Ti may be compromised, we can compare the average of the values collected in the node C with a historical record of values and make the appropriate correction. This amounts to adopting a sanitisation function: a more robust version of the average function AVG compensates the presence of one unreliable datum and changes the final taint label. Typically, when applied to the above tuple, a more robust ♦ ♦ ♦  , 2t2 , 3♦ function results in the following coloured abstract value AVG(1t1 t3 , 4t4 ) , where the propagation of taint to the other functions is stopped. For this purpose, it is necessary to have an additional case to the taint propagation function: Fτ ( f, v˜1 , . . . , v˜r ) = ♦ if f is a sanitisation function. Robustness can be addressed in the same way. Once the less robust sensors have been identified, we label accordingly the corresponding abstract data as frail  (in cyan in the pdf) and not frail ♦ (in black in the pdf), thus obtaining a different version of coloured abstract values. The combination operator is defined on the right-hand side of Table 12.3. The propagation and assignment functions work in the same way as the corresponding functions for tainted values. If we suppose that only the sensor T1 of our running example is frail, then the abstract value 1t1 is classified as frail in  . As a consequence, the application of τ to the function avg its coloured version 1t1 ♦ ♦   results in avg(1t1 , 2t2 , 3♦ t3 , 4t4 ) .

Resilience Analysis à la Quality Calculus Our methodology can be further pushed in order to consider resilience issues, by answering to questions like “Given an aggregation function, how many values must be correct for giving a reliable answer?” This results in ensuring a certain level of reliability of actuations even in the presence of unreliable data. For instance, consider checking the temperature average used to trigger the fire suppression system and assume that the temperature sensors in the room are now eight. A possible detailed question here could be: “What if 1t1 and 6t6 , give wrong results?” (expressed below as possibly tainted values) ♦ ♦ ♦ ♦ ♦ ♦  avg(1 t1 , 2t2 , 3t3 , 4t4 , 5t5 , 6t6 , 7t7 , 8t8 )

To handle these issues, we follow [10] and embed quality predicates inside our function abstract terms: &∃ (v˜1 , . . . , v˜r ) and &∀ (v˜1 , . . . , v˜r ) where &∃ means that it suffices to have at least one correct value, and &∀ means that all the values are necessary. Also, quality predicates can be applied directly to the

236

C. Bodei et al.

resulting abstract values. It is sufficient to extend the taint propagation function to quality predicates with two new cases: Fτ (&∃ , v˜1 , . . . , v˜r ) =

♦ if ∃i : v˜i↓2 = ♦  o.w.

Fτ (&∀ , v˜1 , . . . , v˜r ) =

 if ∃i : v˜i↓2 =  ♦ o.w.

Note that we do not need to modify the taint assignment function. Predicates can be combined in more complex nested logical structures. For example, the following formula expresses that the reliable data are 1t1 , 2t2 , 5t5 , 6t6 , one between 3t3 and 4t4 , and one between 7t7 and 8t8 : avg(&∀ (v˜1 , v˜2 , &∃ (v˜3 , v˜4 ), v˜5 , v˜6 , &∃ (v˜7 , v˜8 )))

(12.1)

According to this formula, we obtain an untainted label ♦ for the corresponding abstract value, when, e.g. only 3t3 and 8t8 are tainted. ♦ ♦ ♦ ♦ ♦   avg(&∀ ((1♦ t1 , 2t2 , &∃ (3t3 , 4t4 ), 5t5 , 6t6 , &∃ (7t7 , 8t8 )))

We can adopt similar resilience logic constructs when addressing robustness. For example, if we assume that the third and the sixth temperature sensors are less robust than the others in 12.1, we obtain that the overall label is not frail ♦, and the corre♦ ♦ ♦ ♦ ♦   sponding result avg(&∀ ((1♦ t1 , 2t2 , &∃ (3t3 , 4t4 ), 5t5 , 6t6 , &∃ (7t7 , 8t8 ))) is reliable.

12.4.2 What if Reasoning By choosing different abstract data partitions, we can analyse how different subsets of unreliable sensors impact critical decisions. Therefore, we answer to “what if” questions, such as “What if a certain subset of sensors is compromised: is the condition of interest altered?”. In our framework, this amounts to asking “If a certain subset of sensor data is tainted, is the abstract value of interest tainted as well?”. The hierarchical structure of the abstract values corresponding to the condition of interest helps us assess which are all the different subsets of data that, if altered, are sufficient to render the condition unreliable. To this aim, we exploit trees that pictorially represent abstract values, and that emphasise the structural dependencies between these and their sub-terms. Their use recalls attack or fault trees.1 In our trees, the root corresponds to the whole abstract value, the internal nodes correspond to the functions on simpler abstract values, and the leaves represent the data collected by sensors. As an example, consider the tree for the abstract value vˆfire given in Fig. 12.2.

1

Note that in our trees the nodes represent the components of a system to attack and their dependencies, while attack trees represent the steps of an attack.

12 Risk Estimation in IoT Systems

237

Fig. 12.2 Tree of the abstract value vˆfire

Following the approach of Hankin et al. [5], we identify the possible subsets of sensors whose compromise is sufficient to alter a critical decision, on the trigger conditions associated with the abstract value vˆfire .  and rng(1water , twater , Twater ), and(gt(avg(1t1 , ..., 4t4 ), Tanomaly ), or(dt(1 f 1 ), . . . , dt(4 f 4 )))

Triggering the fire alarm depends on the evaluation of the logical conjunction of the results of the two branches that we explain below. In particular, it depends on the evaluation of the function rng applied to the data coming from the sensor SC , and by the one resulting from the logic conjunction of the outputs of the two branches gt(avg(1t1 , ..., 4t4 ), Tanomaly ) and or(dt(1 f 1 ), ..., dt(4 f 4 )). The output of the second depends on two further branches. The first branch consists of the application of the functions avg ◦ gt to the data coming from the sensors Ti . The second one consists of the logical disjunction of the values, locally evaluated with the function dt and coming from the sensors Fi . To maliciously force the trigger condition associated with vˆ f ir e , it suffices to force a single and branch to false. These are the possible ways of tampering with sensors: 1. tampering with the sensor Swater that measures the water level in the node C, in order to force a false on the left branch; or 2. altering the result of second and, by tampering with a. enough temperature sensors among T1 , ..., T4 in order to force the temperature average avg(1t1 , ..., 4t4 ) not to be greater than Tanomaly ; or

238

C. Bodei et al.

b. the fire detector sensors F1 , ..., F4 in order to alter the values dt(i f i ), so making false the result of the or disjunction. Each one of the above alternatives is enough to impair the evaluation of the condition. ♦ ♦  For instance, in case (2a), the  label of 1t1 propagates to avg(1t1 , 2t2 , 3♦ t3 , 4t4 ) and, ♦ ♦ ♦   in turn, to gt(avg(1t1 , 2t2 , 3t3 , 4t4 ) , Tanomaly ), up to the overall vˆfire . Similarly, the evaluation of the condition can be impaired if sensed values are unreliable because of a local failure and not an attack. For instance, the local failure of the first temperature sensor T1 may alter the temperature average avg(1t1 , ..., 4t4 ). In our framework, this amounts to having a  label for the abstract sensor value 1t1 that indicates that T1 is frail and its data are unreliable.

12.4.3 Estimation of Risks We now enrich our abstract values of an additional component, in order to guide designers’ choices on how to reduce the risk of compromising certain critical decisions. As in [5], we provide a numerical score that measures the efforts required by an attacker to compromise a given sensor, and to be associated with its abstract sensor data. This metric is exploited to determine the subsets of sensors that the attacker can most easily compromise and for which it is worth taking security countermeasures. Formally, we enrich taint labels of sensors data, including a component for numerical scores: (b, n) : b ∈ B, n ∈ N. We adjust the propagation policy accordingly. Formally, the new values called coloured abstract values with scores are triples ˆ b ∈ B, n ∈ N, where n has a numerical value different

with vˆ ∈ V, (v, ˆ b, n) ∈ V, from 0 only in case of possibly tainted data. The extended propagation function is able to compute the overall score and it is defined as follows: Fτ ( f, v˜1 , . . . , v˜r )) = ⊗(v˜1↓2 , . . . , v˜r ↓2 ), (v˜1↓3 + . . . + v˜r ↓3 ) while the extended taint assignment function is ⎧ ,n if vˆ ∈ I ,  ∈ T and φ S (v) ˆ =n ⎨ vˆ if vˆ ∈ I and  ∈ /T τ (v) ˆ = vˆ ♦,0 ⎩ f (τ (vˆ1 ), . . . , τ (vˆn ), a)b,m if vˆ = f (vˆ1 , . . . , vˆn , a) where b = Fτ ( f, τ (vˆ1 ), . . . , τ (ˆr1 )) and m = Fτ ( f, τ (vˆ1 ), . . . , τ (ˆr1 )). In our example, suppose that tampering with temperature sensors is easier than tampering with fire detectors which, in turn, requires less effort than attacking the pump water level sensor. This is maybe because the latter sensor is usually better protected and more difficult to reach. In this case, the security scores φ S of abstract sensor data are as follows: φ S (1water ) = 3

φ S (i ti ) = 0.1

φ S (i f i ) = 0.3

12 Risk Estimation in IoT Systems

239

As a result, attacking either the temperature sensors (case (i)) or the fire detectors (case (ii)) in the environmental sensor nodes is more advantageous than attacking the water level sensor of the Room controller (case (iii)). In case (i), the overall score amounts to a minimum of 0.1 and to a maximum of 0.4, depending on how many temperature sensors are tampered. If only the first one is tampered, the propagation function is as follows: ♦,0 ♦,0 , 2t2 , 3♦,0 Fτ (avg, (1,0.1 t1 t3 , 4t4 )) = ⊗(, ♦, ♦, ♦), (0.1 + 0 + 0 + 0) = (, 0.1)

In case (ii), in the worst case, i.e. when the attacker must tamper with all the fire detectors, the overall cost is 1.2. ,0.3 ,0.3 , . . . , dt(4,0.3 ) = (, 1, 2) Fτ (or, (dt(1,0.3 f1 ) f4 )

In both cases, the cost is less than that of tampering Swater (iii), that is 3. ♦,0 ♦,0 , Twater ) = (, 3) Fτ (rng, (1,3 water , twater

An alternative attack consists of activating the fire system in the absence of fire. To do this, the attacker simply forces a single fire detector to signal the presence of flames and tampers the temperature sensors to raise the average temperature above the threshold. Similarly, we can address robustness issues, associating scores to the reliability of the sensors, e.g. due to battery problems. In this case, our framework should help designers to determine which of them should be strengthened. Back to our running example, assume to have the following robustness scores φ R : φ R (1water ) = 7

φ R (i ti ) = 1

φ R (i f i ) = 1.5

according to which, the least robust sensors are the temperature ones, followed by the fire detectors, while the water level sensor is more robust. These scores indicates that the environmental sensors can suffer from battery issues, while the Room controller sensor has a better charge. As a consequence, empowering environmental sensors is a more cautious choice than empowering the water level sensor in C.

12.5 Concluding Remarks IoT systems can make wrong and possibly dangerous decisions when the data used to decide are altered and unreliable. This is especially risky for conditions that trigger actuations in the presence of critical anomalies in the systems’ behaviour. We addressed this issue by presenting a semantic-based methodology that focusses on the way, physical components of the systems affect decisions. In fact, some sen-

240

C. Bodei et al.

sors (and thus their data) can be compromised, due to an accidental failure or to an intentional malicious attack. We specified IoT systems with the IoT- LySa calculus. Specifications are subjected to a Control Flow Analysis that predicts the behaviour of systems and, in particular, the usage and the flow of data that are gathered by sensors and propagated into the network as row data or processed through suitable aggregation and logical functions. We obtain abstract values with a CFA and then analyse them to study various aspects of the system to be designed/used. In particular, our CFA provides abstract values corresponding to critical conditions that “symbolically” encode the logical structure and the supply chain of data used for decisions. Abstract values give in fact information on the data, the nodes traversed and the functions used to aggregate and combine them. The obtained CFA results are used to investigate security and robustness issues by analysing the abstract values used for critical decisions. Control Flow Analysis is performed once and for all: its results are not re-computed with any modification to support all the different investigations presented. We studied the impact of compromising subsets of sensors, with an ex post taint analysis performed downstream of the CFA, where different data partitions and propagation policies can be chosen. The hierarchical structure of abstract values helped us determine which subsets are sufficient to alter the conditions of interest and to partition abstract data accordingly. The presence of tainted data can contaminate the functions where they are used or can be tolerated within certain limits, depending on the chosen propagation policy. Finally, we associated a numerical score with sensors data to guide designers in identifying which sensors are worth upgrading in order to keep costs down. From a security point of view, scores represent the different efforts required by an attacker to tamper with sensors, and allow us to compare the costs to attack different subsets of sensors. From a robustness point of view, scores represent the different levels of frailty of sensors and allow us to determine which subsets are the most frail. The present paper extends the results of [3, 4], by offering a unifying framework for reasoning on critical decisions in IoT systems, in terms of different subsets of possible compromised sensors and of different propagation policies. The adopted taint analysis is directly applied to abstract values resulting from the CFA. In [11], two of the authors presented a static taint analysis of IoT- LySa systems with a similar treatment of data partitions and propagation policies. Abstract values include taint labels from the very beginning, propagated in all analysis results, while our current analysis is performed ex post, downstream of the CFA. Other formal and language-based approaches have been introduced for modeling and analyzing IoT or cyber-physical systems (CPS), such as [12–14]. In particular, the latter two propose a semantic theory based on bisimulation, which is well-suited for compositional systems. Works close to ours can also be found in Lanotte et al. [15, 16], where the authors introduce hybrid process calculi to model both cyber−physical systems and potential attacks to physical devices; in [17],where a probabilistic version of these models is presented; and in [18], where a weak bisimulation metric

12 Risk Estimation in IoT Systems

241

is used to estimate the impact of attacking sensors of IoT systems. These proposals heavily rely on the formal dynamic semantics of the introduced calculi. Our approach differs from those mentioned above, in that, it provides a design framework that includes static semantics to support verification techniques and tools for checking properties of IoT applications. Moreover, Akella et al. [19] use a process algebraic approach to model the interactions between the cyber and physical components of a system. Information flow properties are then verified using symbolic model checking. The idea of offering quantifiable measurements of attacks by taking into account dependencies between components is the basis of a recent line of research on the security of CPSs (see, e.g. [6, 20, 21]. Specifically, in [20], Hankin proposes a game theory-based approach to support decision-making against attacks on industrial control systems. Nicolaou et al. introduce instead, in [6], a methodology to estimate the number of cyber-physical elements to be attacked, to compromise an entire target system. In [21], the complex dependencies among CPS components, in the form of logical combinations of AND/OR connectives, are captured by a suitable model. In addition, they propose a minimal weighted vertex cut in AND/OR graphs and a security metric that measures the efforts required to successfully attack individual components. While the papers discussed above focus on the physical interdependencies between components, to assess the vulnerability of conditions related to critical decisions, we concentrate on the functional interdependencies between data. A further proposal for the identification of critical cyber-physical components that also considers functional dependencies is in Deng et al. [22], who analyse vulnerabilities in critical infrastructure systems without a fixed topological structure.

References 1. Bodei, C., Degano, P., Ferrari, G.L., Galletta, L.: Tracing where IoT data are collected and aggregated. Log. Methods Comput. Sci. 13(3) 2. Bodei, C., Galletta, L.: Analysing the provenance of IoT data. In: Mori, P., Furnell, S., Camp, O. (Eds.) Information Systems Security and Privacy - ICISSP 2019, Revised Selected Papers, vol. 1221, pp. 358–381 Communications in Computer and Information Science (2019) 3. Bodei, C., Degano, P., Ferrari, G.-L., Galletta, L.: Security metrics at work on the things in IoT systems. In: From Lambda Calculus to Cybersecurity Through Program Analysis, LNCS 12065, pp. 233–255. Springer (2020) 4. Bodei, C., Degano, P., Ferrari, G.L., Galletta, L.: Modelling and analysing IoT systems. J. Parallel Distrib. Comput. 157, 233–242 (2021). https://doi.org/10.1016/j.jpdc.2021.07.004 5. Barrère, M., Hankin, C., Nicolaou, N., Eliades, D.G., Parisini, T.: Identifying security-critical cyber-physical components in industrial control systems, CoRR abs/ arXiv:1905.04796. http:// arxiv.org/abs/1905.04796 6. Nicolaou, N., Eliades, D.G., Panayiotou, C.G., Polycarpou, M.M., Reducing vulnerability to cyber-physical attacks in water distribution networks. In: International Workshop on Cyberphysical Systems for Smart Water Networks, CySWater@CPSWeek, vol. 2018, pp. 16–19. IEEE Computer Society (2018) 7. Bodei, C., Degano, P., Ferrari, G.-L., Galletta, L.: Where do your IoT ingredients come from? In: Proceedings of Coordination 2016, LNCS 9686, pp. 35–50. Springer (2016)

242

C. Bodei et al.

8. Nielson, H.R., Nielson, F.: Flow logic: a multi-paradigmatic approach to static analysis. In: The Essence of Computation, Complexity, Analysis, Transformation, LNCS 2566, pp. 223–244. Springer (2002) 9. Bodei, C., Buchholtz, M., Degano, P., Nielson, F., Nielson, H.R.: Static validation of security protocols. Journal of Computer Security 13(3), 347–390 (2005) 10. H. R. Nielson, F. Nielson, R. Vigo, A calculus of quality for robustness against unreliable communication, J. Log. Algebr. Meth. Program. 84 (5) (2015) 611–639. 11. Bodei, C., Galletta, L.: Tracking sensitive and untrustworthy data in IoT. In: Proceedings of the First Italian Conference on Cybersecurity (ITASEC 2017), CEUR Vol-1816, pp. 38–52 (2017) 12. Lanese, I., Bedogni, L., Felice, M.D.: Internet of Things: a process calculus approach. In: Proceedings of the 28th Annual ACM Symposium on Applied Computing, SAC ’13, pp. 1339– 1346. ACM (2013) 13. Lanotte, R., Merro, M.: A semantic theory of the Internet of Things. In: Proceedings of Coordination 2016, LNCS 9686, pp. 157–174. Springer (2016) 14. R. Lanotte, M. Merro, A semantic theory of the Internet of Things, Inf. Comput. 259 (1) (2018) 72–101. 15. Lanotte, R., Merro, M., Munteanu, A., Viganò, L.: A formal approach to physics-based attacks in cyber-physical systems. ACM Trans. Priv. Secur. 23(1), 3:1–3:41 (2020) 16. Lanotte, R., Merro, M., Tini, S.: A probabilistic calculus of cyber-physical systems. Inf. Comput. 104618 17. Lanotte, R., Merro, M., Munteanu, A., Tini, S.: Formal impact metrics for cyber-physical attacks. In: 34th IEEE Computer Security Foundations Symposium, CSF 2021, pp. 1–16. IEEE (2021) 18. Lanotte, R., Merro, M., Tini, S.: Towards a formal notion of impact metric for cyber-physical attacks. In: Furia, C.A., Winter, K. (eds.) Integrated Formal Methods - IFM 2018. LNCS, vol. 11023, pp. 296–315. Springer (2018) 19. Akella, R., Tang, H., McMillin, B.M.: Analysis of information flow security in cyber-physical systems. Int. J. Crit. Infrastruct Protect 3(3), 157–173 (2010) 20. Hankin, C.: Game theory and industrial control systems. In: Probst, C.W., Hankin, C., Hansen, R.R. (eds.) Semantics. Logics, and Calculi - Essays Dedicated to Hanne Riis Nielson and Flemming Nielson on the Occasion of Their 60th Birthdays, LNCS, vol. 9560, pp. 178–190. Springer (2016) 21. M. Barrère, C. Hankin, N. Nicolaou, D. G. Eliades, T. Parisini, Measuring cyber-physical security in industrial control systems via minimum-effort attack strategies, J. Inf. Secur. Appl. 52 (2020) 102471. 22. Deng, Y., Song, L., Zhou, Z., Liu, P.: Complexity and vulnerability analysis of critical infrastructures: a methodological approach, Mathematical Problems in Engineering (2017)

Chapter 13

Verification of Reaction Systems Processes Linda Brodo, Roberto Bruni, and Moreno Falaschi

Abstract Reaction Systems (RSs) are a computational framework inspired by biological systems. A RS is formed by a set of entities together with a set of reactions over them. Entities can enable or inhibit each reaction and are produced by reactions. The interaction of a RS with the environment can be modelled by means of an external context sequence. RSs can model several computer science frameworks as well as biological systems. However, the basic computational mechanism of RSs abstracts from several features of biochemical systems which reduces somewhat their expressivity. In some previous works, we have defined semantics of RSs based on process algebras and SOS rules. This allowed us to introduce a flexible framework for extending the basic features of RSs. We implemented our framework in Prolog, as our declarative interpreter makes easier the modifications necessary to accommodate new extensions. In this paper, we give an overview of the basic framework and some extensions and of a methodology for verifying properties of (extended) RSs, and we briefly present our implementation. We discuss the open problems of verification of RSs and the corresponding challenges, with the final aim to realize an expressive integrated framework easily accessible to computer scientists and biologists. Keywords Bioinformatics · SOS rules · Reaction systems · Logic programming

L. Brodo Università di Sassari, Sassari, Italy e-mail: [email protected] R. Bruni Università di Pisa, Pisa, Italy e-mail: [email protected] M. Falaschi (B) Università di Siena, Siena, Italy e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 V. Arceri et al. (eds.), Challenges of Software Verification, Intelligent Systems Reference Library 238, https://doi.org/10.1007/978-981-19-9601-6_13

243

244

L. Brodo et al.

13.1 Introduction This paper surveys our recent work in the area of verification techniques for computational biology. In particular, we focus on a theoretical and practical infrastructure for Reaction Systems (RSs) [9], a computational framework inspired by systems of living cells. Their constituents are a finite set of entities and a finite set of reactions that can be enabled or inhibited by entities and produce entities. Given a discrete sequence of entities provided by the context at each step, the dynamics of RSs is uniquely determined and can be represented as a finite, deterministic and unlabelled sequence of steps from one state to the next. RSs are a quite general computational model whose application ranges from the modelling of biological phenomena [1, 3, 4, 17], and molecular chemistry [32] to theoretical foundations of computing [21, 22]. When a biological system is modelled as a RS, in silico experiments are conducted by synthesizing a specific context sequence to represent the external stimuli and then observing the resulting state sequence. This makes it difficult to collect the outcomes of different experiments within a single semantic object, where the consequences of some variation in the context sequence would be easier to compare and analyze. Moreover, RSs abstract away from several features of biochemical systems, which reduces somewhat their expressivity. Our approach relies on reformulating the theory of RSs within the so-called Structural Operational Semantics (SOS) approach [35] that has been particularly successful in the area of process algebras [27, 30, 34]. Using SOS rules, we equip RSs with Labelled Transition Systems (LTSs) semantics that allow us to model their behaviour as interacting processes. We will recall the semantics of RSs based on process algebras and SOS rules defined in [12]. The SOS approach has several advantages, for example, compositionality and modularity, but relies on very detailed and informative labels whose induced behavioural equivalences are too fine grain and would distinguish too much. A first contribution is to exploit a language of assertions over labels that allows focusing on some entities and disregard others, making the definition of behavioural and logical equivalences parametric w.r.t. such assertions. In [12], we have proven the correspondence between a coinductive definition in terms of bisimilarity and its logical counterpart à la Hennessy–Milner. The SOS approach also allowed us to introduce a flexible framework for extending the basic features of RSs. Within our framework, we can express the logical properties of our models and verify them by a notion of bio-simulation which is suitable for RSs. We will also recall several extensions which improve the expressivity of RSs [15]. Then, we will discuss some open problems of verification of RSs. Our final aim is to realize an expressive integrated framework easily accessible to computer scientists and biologists. We have implemented a prototype interpreter of our semantic framework and its extensions in SWI-Prolog. This paper is dedicated to Patrick Cousot, whom Falaschi met in Padova in 1993 at the third international Symposium on Static Analysis. We are all indebted to him for having invented the framework of Abstract Interpretation [18–20] which allows us to develop formal static analyses and approximated verification for all programming

13 Verification of Reaction Systems Processes

245

languages. He is not only a great computer scientist, but also a wonderful and kind person who has positively and deeply influenced generations of researchers, showing that Abstract Interpretation can be integrated into many apparently very different research areas. Structure of the paper. In Sect. 13.2, we recall the basics of RSs. In Sect. 13.3, we detail our process algebra for RSs. Section 13.4 shows the correspondence between assertions-driven coinductive equivalences and their logical counterpart. Some extensions of RSs that build on our theory are summarized in Sect. 13.5. A prototype implementation in Prolog of our semantic framework is briefly described in Sect. 13.6. Section 13.7 discusses the relation to other works. Concluding remarks and challenges for future work are illustrated in Sect. 13.8.

13.2 Reaction Systems The theory of Reaction Systems (RSs) [9] was born in the field of Natural Computing to model the behaviour of biochemical reactions in living cells. The two key mechanisms that regulate the functioning of a living cell are facilitation and inhibition: they are based on the presence/absence of entities. Moreover, the theory of RSs is based on three assumptions: no permanency, any entity vanishes unless it is sustained by a reaction (a living cell would die for lack of energy, without chemical reactions); no counting, the quantity of entities that are present in a cell is not taken into account to keep RSs theory very abstract and qualitative; threshold nature of resources, any entity is either available to sustain all reactions, or it is not available at all. Definition 1 (Reaction) Let S be a (finite) set of entities. A reaction in S is a triple a = (R, I, P), where R, I, P ⊆ S are finite, non empty sets and the two sets R and I are disjoint, written I # R. The sets R, I, P are the sets of reactants, inhibitors, and products, respectively. All reactants are needed for the reaction to take place. Any inhibitor blocks the reaction. Products are the outcome of the reaction. Since R and I are not empty, all products are produced from at least one reactant and every reaction can be inhibited. Sometimes artificial entities can be introduced to guarantee that R and I are never empty. We let rac(S) be the set of all reactions in S. Definition 2 (Reaction System) A Reaction System (RS) is a pair A = (S, A) s.t. S is a finite set, and A ⊆ rac(S) is a finite set of reactions in S. The passage of state according to the currently available entities W and the reactions in a RS is defined as the reaction result. Definition 3 (Reaction Result) Given a (finite) set of entities S, and a subset W ⊆ S, we define the following, where a = (R, I, P) ∈ rac(S) is a reaction in S and A ⊆ rac(S) is a finite set of reactions.

246

L. Brodo et al.

1. The enabling predicate ena (W ) is defined by letting ena (W )  R ⊆ W ∧ I #W. 2. The result of a on W , writtenresa (W ), and the result of A on W , written res A (W ),  P if ena (W ) are defined by: resa (W )  res A (W )  a∈A resa (W ). ∅ otherwise Living cells are seen as open systems that react with the external environment. The behaviour of a RS is formalized in terms of interactive processes. Definition 4 (Interactive Process) Let A = (S, A) be a RS and let n ≥ 0. An n-steps interactive process in A is a pair π = (γ , δ) s.t. γ = {Ci }i∈[0,n] is the context sequence and δ = {Di }i∈[0,n] is the result sequence, where Ci , Di ⊆ S for any i ∈ [0, n], D0 = ∅, and Di+1 = res A (Di ∪ Ci ) for any i ∈ [0, n − 1]. We call τ = W0 , . . . , Wn with Wi  Ci ∪ Di , for any i ∈ [0, n] the state sequence. The context sequence γ represents the environment. The result sequence δ is entirely determined by γ and A. Each state Wi in τ is the union of two sets: the context Ci at step i and the result set Di = res A (Wi−1 ) from the previous step. Given a context sequence γ , we denote by γ k the shift of γ starting at the kth step. The shift notation will come in handy to draw a tight correspondence between the classic semantics of RS and their SOS version. Definition 5 (Sequence shift) Let γ = {Ci }i∈[0,n] a context sequence. Given a positive integer k ≤ n, we let γ k = {Ci+k }i∈[0,n−k] . We conclude this section with a simple example of RS. Example 1 Let A1 = (S, A) be a toy RS, where S = {a, b, c} contains three entities, and A = {a} contains the reaction a = ({a, b}, {c}, {b}), written as (ab, c, b). Then, a 4-steps interactive process π = (γ , δ), with γ = {C0 , C1 , C2 , C3 }, where C0 = {a, b}, C1 = {a}, C2 = {c}, and C3 = {c}; and δ = {D0 , D1 , D2 , D3 }, where D0 = ∅, D1 = {b}, D2 = {b}, and D3 = ∅, results in the state sequence τ = W0 , W1 , W2 , W3 = {a, b}, {a, b}, {b, c}, {c}. It is easy to check that, e.g., W0 = C0 , D1 = res A (W0 ) = res A ({a, b}) = {b} because  ena (W0 ), and W1 = C1 ∪ D1 = {a} ∪ {b} = {a, b}. A graphical representation of the Example 1 is given in Fig. 13.1: the yellow rectangles hold the reactions; the white ovals hold the evolution of the context; and the blue circles hold the available entities. The arrows show the applied reactions.

13 Verification of Reaction Systems Processes

247

Fig. 13.1 Graphical representation of the evolution of the Example 1

13.3 SOS Rules for Reaction Systems Inspired by classic process algebras, such as CCS [30], we introduce a syntax for RSs that resembles their original presentation and then equip each operator with some SOS inference rules defining its operational semantics. It consists of an LTS semantics: the states are terms, and each transition corresponds to a step of the RS, and transition labels retain some information needed for compositionality. Definition 6 (RS processes) Let S be a set of entities. An RS process P is defined by the following grammar: P :: = [M]

M :: = (R, I, P) | D | K | M|M

K :: = 0 | X | C.K | K + K | rec X. K where R, I, P ⊆ S are non empty sets of entities such that I # R; C, D ⊆ S are possibly empty set of entitities; and X is a context process variable. An RS process P embeds a mixture process M obtained as the parallel composition of some reactions (R, I, P), some set of currently present  entities D (possibly the empty set ∅), and some context process K. We write i∈I Mi for the parallel composition of all Mi with i ∈ I . For example, i∈{1,2} Mi = M1 | M2 . A process context K is possibly nondeterministic and recursive: 0 stops the computation; the prefixed context C.K makes entities in C immediately available to the reactions, and K is the context offered at the next step; with the non deterministic choice K1 + K2 the context can behave either as K1 or K2 ; X is a process variable, and rec X. K is the usual recursive operator of process algebras. We say that P and P are structurally equivalent, written P ≡ P , when they denote the same term up to the laws of commutative monoids for parallel composition ·|·, with ∅ as the unit, and the laws of idempotent and commutative monoids for choice · + ·, with 0 as the unit. We also assume D1 |D2 ≡ D1 ∪ D2 for any D1 , D2 ⊆ S. Definition 7 (RSs as RS processes) Let A = (S, A) be a RS, and π = (γ , δ) an nstep interactive process in A, with γ = {Ci }i∈[0,n] and δ = {Di }i∈[0,n] . For any step i ∈ [0, n], the corresponding RS process A, π i is defined as follows:

248

L. Brodo et al.

 A, π i 



 a | Di | Kγ i

a∈A

where the context process Kγ i  Ci .Ci+1 . · · · .Cn .0 is the sequentialization of the entities offered by γ i . We write A, π  as a shorthand for A, π 0 . Example 2 The encoding of the interactive process in Example 1 is: P = A1 , π  = ({a, b, c}, {(ab, c, b)}), π  = [(ab, c, b) | ∅ | Kγ ] ≡ [(ab, c, b) | Kγ ]

where Kγ = {a, b}.{a}.{c}.{c}.0, written more concisely as ab.a.c.c.0. Note that ∅ is inessential and can be discarded thanks to structural congruence.  As shown in Example 3, our syntax allows for more general kinds of contexts than the sequential one in Definition 7. Nondeterministic contexts can be used to collect several experiments, while recursion can be exploited to extract some regularity in the long-term behaviour of a Reaction System. Example 3 Let us consider our running example. Suppose we want a context that nondeterministically can behave as K1 or as K2 , where K1 = ab.a.c.c.0 (as in Example 2), and K2 = rec X. ab.a.X (which recursively allows the reaction to be always enabled), then our context is K = K1 + K2 , and we simply define  P ≡ [(ab, c, b) | K ]. Definition 8 (Label) A label is a tuple W  R, I, P with W, R, I, P ⊆ S such that I and W ∪ R are disjoint, written I #(W ∪ R) for short. The set of transition labels is ranged over by . In a transition label W  R, I, P , we record the set W of entities currently in the system (produced in the previous step or provided by the context), the set R of entities whose presence is assumed (either as reactants or as inhibitors); the set I of entities whose absence is assumed (either as inhibitors or as reactants); the set P of products of all the applied reactions. As a convenient notation, we write 1 ⊗ 2 for the component-wise union of labels. We also define a noninterference predicate over labels, written 1  2 , that will be used to guarantee that there is no conflict between reactants and inhibitors of the reactions that take place in two separate parts of the system. Formally, we let: W1  R1 , I1 , P1 ⊗ W2  R2 , I2 , P2  W1 ∪ W2  R1 ∪ R2 , I1 ∪ I2 , P1 ∪ P2 W1  R1 , I1 , P1  W2  R2 , I2 , P2  (I1 ∪ I2 )#(W1 ∪ W2 ∪ R1 ∪ R2 )

Definition 9 (Operational semantics) The operational semantics of processes is defined by the set of SOS inference rules in Fig. 13.2.

13 Verification of Reaction Systems Processes

249

Fig. 13.2 SOS semantics of the reaction system processes

The rule (Ent) makes available the entities in the (possibly empty) set D, then reduces ∅∅,∅,∅

to ∅. As a special instance of (Ent), we derive the transition ∅ −−−−−→ ∅. The rule (Cxt) says that a prefixed context process C.K makes available the entities in the set C and then reduces to K; the context K = 0 has no transition. The rule (Rec) is the a∅,∅,∅

classical rule for recursion, for example rec X. (a.b.X ) −−−−−→ b.rec X. (a.b.X ). The rules (Suml) and (Sumr) select a move of either the left or the right component, resp., discarding the other process. The rule (Pro), executes the reaction (R, I, P) (its reactants, inhibitors, and products are recorded on the label), which remains available at the next step together with P. The rule (Inh) applies when the reaction (R, I, P) should not be executed; it records in the label the possible causes for which the reaction is disabled: possibly some inhibiting entities (J ⊆ I ) are present or some reactants (Q ⊆ R) are missing, with J ∪ Q = ∅. The rule (Par) puts two processes in parallel by pooling their labels and joining all the set components of the labels; a sanity check is required to guarantee that there is no conflict between reactants and inhibitors of the applied reactions. Finally, the rule (Sys) requires that all the processes of the systems have been considered, and also checks that all the needed reactants are actually available in the system (R ⊆ W ). In fact, this constraint can only be met on top of all processes. The check that inhibitors are absent (I ∩ W = ∅) is not necessary, as it is embedded in the premise 1  2 of rule (Par). Example 4 Let us consider the RS process P0  [(ab, c, b) | ab.a.c.c.0] from Example 2. The process P0 , and its next state P1 have a unique outgoing transition: (ba)ab,c,b

baab,c,b

[(ab, c, b) | ab.a.c.c.0] −−−−−−−−→ [(ab, c, b) | b | a.c.c.0] −−−−−−−→ [(ab, c, b) | b | c.c.0]

250

L. Brodo et al.

The process P2 = [(ab, c, b) | b | c.c.0] has three outgoing transitions, sharing the same target process P3  [(ab, c, b) | c.0], each providing a different justification why reaction (ab, c, b) is not enabled: (bc)c,∅,∅

1. P2 −−−−−−−→ P3 where the presence of c inhibited the reaction; (bc)∅,a,∅

2. P2 −−−−−−−→ P3 where the absence of a inhibited the reaction. (bc)c,a,∅

3. P2 −−−−−−−→ P3 where both the presence of c and the absence of a inhibited the reaction; this label is thus more informative than the previous two (see [12]).  Example 4 shows that we can have redundant transitions because of rule (Inh). In [12], we have introduced the notion of dominance by relying on an order relation  over pairs of a set of entities to guarantee that any instance of the rule (Inh) is applied in a way that maximizes the sets J and Q (given the overall available entities W ). In (b,c)c,a,∅

the example above, this is realized by the third transition P2 −−−−−−−→ P3 . The main theorem (see [12]) shows that the rewrite steps of a RS exactly match the transitions of its corresponding RS process. Theorem 1 Let A = (S, A) be a RS, and π = (γ , δ) an n-step interactive process in A with γ = {Ci }i∈[0,n] , δ = {Di }i∈[0,n] , and let Wi  Ci ∪ Di and Pi  A, π i for any i ∈ [0, n]. Then: W R,I,P

1. ∀i ∈ [0, n − 1], Pi −−−−−−→ P implies W = Wi , P = Di+1 and P ≡ Pi+1 ; Wi R,I,Di+1

2. ∀i ∈ [0, n − 1], there exists R, I ⊆ S such that Pi −−−−−−−−→ Pi+1 .

13.4 Bio-simulation Bisimulation equivalences [37] play a central role in process algebras. They can be defined in terms of coinductive games of fixpoint theory and of logics. In the case of biological systems, transition labels must convey a lot of information and the classical notion of bisimulation can be too concrete. In fact, in a biological soup, a high number of interactions occur every time instant, and generally speaking, biologists are only interested to analyse a small subset of phenomena. Depending on the application, only a suitable abstraction over the label can be of interest. For this reason, following the approach introduced in Brodo et al. [13], we exploit an alternative notion of bisimulation, called bio-simulation, that compares two biological systems by observing only a limited set of events that are of particular interest. With respect to the work in Brodo et al. [13], here the labels are easier to manage and simpler to parse, because the underlying process algebra is tailored to RSs (whereas in [13], RSs were encoded in a fragment of a much more general process algebra). In a way, at each step of the bisimulation game, we want to query our labels about some partial information. To this goal, we define an assertion language to express

13 Verification of Reaction Systems Processes

251

detailed and partial queries about what happened in a single transition. For instance, we may wonder whether a given entity is produced or used by some reactions. We remark on the importance of dealing with nondeterministic contexts, as bisimulation takes into account the branching structure of system dynamics. The bio-simulation approach works as follows. First, we introduce an assertion language to abstract away some information from the labels; then we define a bisimilarity equivalence that is parametric to a given assertion, called bio-similarity; finally, we give a logical characterization of bio-similarity, called biological equivalence, by tailoring the classical Hennessy–Milner Logic (HML) to the given assertion.

13.4.1 Assertion Language An assertion is a formula that predicates the labels of our LTS. The assertion language that we propose is very basic, but can be extended if necessary. Definition 10 (Assertion Language) Given a set of entities S, assertions F on S are built from the following syntax, where E ⊆ S and Pos ∈ {W, R, I, P}: F :: = E ⊆ Pos | ? ∈ Pos | F ∨ F | F ∧ F | ¬F Roughly, Pos distinguishes different positions in the labels: W stands for entities provided by the current state, R stands for reactants, I stands for inhibitors and P stands for products. An assertion F is either the membership of a subset of entities E in a given position Pos, E ⊆ Pos, the test of Pos for non-emptiness, ? ∈ Pos, the disjunction of two assertions F1 ∨ F2 , their conjunction F1 ∧ F2 , or the negation of an assertion ¬F. To improve readability, we assume that negation binds stronger than conjunction and disjunction, so that ¬F1 ∧ F2 stands for (¬F1 ) ∧ F2 . Of course, all the remaining usual logical operators can be derived as expected, e.g., we write exclusive or F1F2 as a shorthand for the assertion (F1 ∧ ¬F2 ) ∨ (¬F1 ∧ F2 ) and the implication F1 → F2 as a shorthand for ¬F1 ∨ F2 . Definition 11 (Satisfaction of Assertions) Let P be a RSs process, let υ = (W)  R, I, P be a transition label, and F be an assertion. We write υ |= F (read as the transition label υ satisfies the assertion F) if and only if the following hold: υ |= E ⊆ Pos iff E ⊆ select(υ, Pos) υ |=? ∈ Pos iff select(υ, Pos) = ∅  select( (W)  R, I, P , Pos) 

υ |= F1 ∨ F2 iff υ |= F1 ∨ υ |= F2 υ |= F1 ∧ F2 iff υ |= F1 ∧ υ |= F2 υ |= ¬F iff υ |= F W if Pos = W, I if Pos = I,

R if Pos = R P if Pos = P

Given two transition labels v, w we write v ≡F w if v |= F ⇔ w |= F, i.e. if both v, w satisfy F or they both do not.

252

L. Brodo et al.

Example 5 We show some assertions and explain their meaning: 1. F1  a ⊆ R allows to check if the presence of the entity a has been exploited by some reaction; 2. F2  ab ⊆ P queries if the entities a and b been produced by some reaction; 3. F3  a ⊆ W ∨ c ⊆ W checks if a or c were present in the source state; 4. F4  ab ⊆ R ∧ c ⊆ I checks if the reaction (ab, c, b) has been applied, while F5  a ⊆ I ∨ b ⊆ I ∨ c ⊆ R the opposite case. Alternatively, we can set F5  ¬F4 . If we take the label υ = ab  ab, c, b it is immediate to check that υ |= F1

υ |= F2

υ |= F3

υ |= F4

υ |= F5



13.4.2 Bio-similarity and Biological Equivalence The notion of bio-simulation builds on the above language of assertions to parameterize the induced equivalence on the property of interest. Please recall that we have defined the behaviour of the context in a nondeterministic way, thus at each step, different possible sets of entities can be provided to the system and different sets of reaction can be enabled/disabled. Bio-simulation can thus be used to compare the behaviour of different systems that share some of the reactions or entities or also to compare the behaviour of the same set of reaction rules when different contexts are provided. Maximized dominant transitions, identified by double arrows, are those such that the sets J and Q are the largest. For a formal definition see [12]. Definition 12 (Bio-similarity ∼F [13]) Given an assertion F, a bio-simulation RF that respects F is a binary relation over RS processes s.t., if P RF Q then: υ

w

• ∀υ, P s.t. P = ⇒ P , ∃w, Q s.t. Q = ⇒ Q with υ ≡F w and P RF Q . w υ ⇒ Q , ∃υ, P s.t. P = ⇒ P with υ ≡F w and P RF Q . • ∀w, Q s.t. Q = We let ∼F denote the largest bio-simulation and we say that P is bio-similar to Q, with respect to F, if P ∼F Q. It can be easily shown that the identity relation is a bio-simulation and that biosimulations are closed under (relational) inverse, composition and union, and that, as a consequence, bio-similarity is an equivalence relation. Example 6 Let us consider some variants of our working example. When considering maximized transitions only, denoted by double arrows, the behaviour of P0  [(ab, c, b) | ab.a.ac.0] is deterministic, and its unique trace of labels are: P0

abab,c,b

P1

abab,c,b

abcc,∅,∅

P2

[(ab, c, b)|0]

13 Verification of Reaction Systems Processes

253

Instead, the behaviour of P 0  [(ab, c, b) | (ab.a.ac.0 + ab.a.a.0)] is nondeterministic. Now there are two possible traces of labels: the first trace is equal to the above one, and the other one follows: P 0

abab,c,b

abab,c,b

P1 P 1

abab,c,b abab,c,b

P2 P 2

abcc,∅,∅ abab,c,b

[(ab, c, b)|0] [(ab, c, b)|b|0]

Now, it is easy to check that the two processes P0 , P 0 are not bio-similar w.r.t. the assertion F1  c ∈ E, requiring that in the state configuration entity c is present, and are bio-similar w.r.t. the assertion F2  (a ∈ R)(c ∈ R), requiring that either c or a are used as reactants.  Now, we introduce a slightly modified version of the Hennessy–Milner Logic [26], called bioHML; due to the reasons we explained above, we do not want to look at the complete transition labels, thus we rely on our simple assertion language to make it parametric to the assertion F of interest: Definition 13 (BioHML [13]) Let F be an assertion, then the set of bioHML formulas G that respects F are built by the following syntax, where χ ∈ {F, ¬F}: G, H :: = t | f | G ∧ G | G ∨ G | χ G | [χ ]G The semantics of a bioHML formula is the set of processes that satisfy it. Definition 14 (Semantics of BioHML) Let P denote the set of all RS processes over S. For a BioHML formula G, we define G ⊆ P inductively on G: t  P G ∧ H  G ∩ H

f  ∅ G ∨ H  G ∪ H υ

 χ G  {P ∈ P : ∃υ, P . P = ⇒ P with υ |= χ and P ∈ G} υ ⇒ P implies υ |= χ and P ∈ G} [χ ]G  {P ∈ P : ∀υ, P . P = We write P |= G (P satisfies G) if P ∈ G. Negation is not included in the syntax, but the converse G of a bioHML formula G can be easily defined inductively in the same way as for HML logic. We let LF be the set of all bioHML formulas that respect F. Definition 15 (Biological equivalence) We say that P, Q are biologically equivalent w.r.t. F, written P ≡LF Q, when P and Q satisfy the exactly the same bioHML formulas in LF , i.e., when for any G ∈ LF , we have P |= G ⇔ Q |= G. Finally, we extend the classical result establishing the correspondence between the logical equivalence induced by HML with bisimilarity for proving that bio-similarity coincides with biological equivalence.

254

L. Brodo et al.

Theorem 2 (Correspondence [13]) ∼F = ≡LF Example 7 We continue by considering our running example in Example 6. There is already evidence that the two processes P0  [(ab, c, b) | ab.a.ac.0] and P 0  [(ab, c, b) | (ab.a.ac.0 + ab.a.a.0)] are not bio-similar w.r.t. the assertion F1  c ∈ W. Here, we give a bioHML formula that distinguishes P0 and P 0 : G  ¬F1 [¬F1 ] ¬F1 t. In fact, G is not satisfied by P0 , written P0 |= G, because, along the unique possible path, the labels of the first two transitions satisfy ¬F1 but P2 cannot perform any transition whose label satisfies ¬F1 . Differently, P 0 |= G. In fact, P 0 can move to P 1 with a transition whose label satisfies ¬F1 , then P 1 has a unique transition to P 2 whose label satisfies ¬F1 and  finally the target state P 2 can perform a transition whose label satisfies ¬F1 .

13.4.3 A Case Study: Metabolic Pathways in Mammalian Epithelial Cells Example 8 In this example [36], we consider a fragment of the molecular signalling network in mammalian epithelial cells. The focus is on the regulation of the cell cycle, leaving out, for simplicity, the complex signalling network relating to programmed cell death. As emerges from the diagram in Fig. 13.3, the essential signals for the activation of the pathways that lead to the growth of the cell (therefore to protein synthesis), to the subsequent duplication of DNA, and therefore, to mitosis, are molecules called growth factors. Growth factors are perceived by intracellular receptors, which, following conformational changes caused by interaction with ligands, activate other proteins involved in signalling. The reactions which we defined to represent this system try to be as close as possible to the biochemical processes described. In particular, for cyclin B we have represented its synthesis, by using several reactions, modelling its progressive concentration increase, until it gets to the activation of the Mitosis Promoting Factor. In Fig. 13.3, we show an excerpt of the reactions. The full RS is available online.1 In Fig. 13.4, we see a representation of two computations with two different context sequences in our interpreter for the RS in Example 8, where the colour of states depends on the entities provided by the context. The computations start from the same initial context and then use different contexts. In the first computation at each step, a growth factor entity is introduced in the system by the context, while in the second computation, an antigrowth factor entity is introduced. The corresponding computations show two different evolutions of the system. By considering a simple formula such as {cycB16, proteins} ⊆ W we can prove that the two computations 1

https://www3.diism.unisi.it/~falaschi/mammalianRSandContexts.txt.

13 Verification of Reaction Systems Processes

255

Fig. 13.3 Mammalian cell growth diagram

are not bio-similar. The initial fragments of computations in Fig. 13.1 show that after 5 steps, it is reached a state W such that on the first fragment, the formula is true, while on the second it is false. 

13.4.4 Dynamic Slicing of RS Processes Sometimes it can be difficult to express the properties of a RS by means of a logical formula that expresses an invariant of the computation, in particular, for researchers coming from biomedical areas. Then the methodology of dynamic slicing may result easier to use and useful for simplifying the debugging process of a model. The idea behind slicing is to select a portion of a computation which requires more attention or may contain a bug. Slicing was introduced by Weiser [39]. It was originally defined as a static technique for imperative programs, independent of any particular input of the program. Then, the technique was extended by introducing the so-called dynamic program slicing [29]. In [14], we have defined a a framework for dynamic (backward) slicing of RSs. In practice, we analyse partial computation traces. In the last state of a computation trace, we consider a marking to identify the important entities for which we compute automatically a simplified computation which contains only the information necessary to derive the marked entities. The process of marking can be performed by the user, or in an automated manner, by expressing a specification of the system by a logic of assertions similar to the one presented in this paper and detecting automatically the state and entities which do not satisfy the specification. In [14], we dealt with a slicing process for RS processes that goes backwards, which means that the simplification (of states and contexts) proceeds from the last state to the first one.

256

Fig. 13.4 Computations driven by two different contexts

L. Brodo et al.

13 Verification of Reaction Systems Processes

257

Table 13.1 Initial segments of the two computations in Fig. 13.4

There are several open issues that would be relevant for improving and extending dynamic slicing of RSs. Defining a forward strategy, starting from a selection of the initial state and combining it with the backward strategy would certainly improve the efficacy of the tool. It would also be quite important to adapt the methodology to more advanced extensions of RSs, such as processes with speed, delays and linear processes as discussed in the next section. A slicer would also benefit from integration with static analysers able to check properties such as the dynamic causalities between molecules [4].

13.5 Quantitative Extensions of RSs Delays, Durations and Timed Processes: In Biology, it is well known that reactions occur with different frequencies. For example, since enzymes catalyse reactions, many reactions are more frequent when some enzymes are present, and less frequent when such enzymes are absent. Moreover, reactions describing complex transformations may require time before releasing their products. To capture these dynamical aspects in our framework by preserving the discrete and abstract nature of RS, in [15], we have proposed a discretisation of the delay between two occurrences of a reaction by using a scale of natural numbers, from 0 (smallest delay, highest frequency) up to n (increasing delay, lower frequency). Intuitively, the notation D n stands for making the entities D available after n time units, and we use the shorthand D for D 0 , meaning that the entities are immediately available. Similarly, we can associate a delay value to the product of each reaction by writing (R, I, P)n when the product of the reaction will be available after n time

258

L. Brodo et al.

Fig. 13.5 SOS semantics with delays and durations

Fig. 13.6 Two transition sequences of timed processes P1 and P2 (see Example 9)

units, and we write (R, I, P) for (R, I, P)0 . The syntax for mixture processes is thus extended as below and the operational semantics is changed accordingly: M :: = (R, I, P)n D n K M|M Figure 13.5 only reports the rules that are new and those that override the ones in Fig. 13.2 (e.g., the semantics of context processes is unchanged). Rule (Tick) represents the passing of one time unit, while rule (Ent) makes available to those entities whose delay has expired. Rule (Pro) delays the product of the reaction as specified by the reaction itself, while rule (Inh) is used when the reaction is not enabled. Example 9 Let us consider two RSs sharing the same entity set S = {a, b, c, d} and the same reactions a1 = (a, b, b), a2 = (b, a, a), a3 = (ac, b, d), a4 = (d, a, c), but working with different reaction speeds. For simplicity, we assume that only two speed levels are distinguished: 0 the fastest and 1 the slowest. The reaction system A1 provides the speed assignment {a11 , a2 , a3 , a41 } to its reactions, while A2 provides the speed assignment {a1 , a21 , a31 , a4 }. We assume that the context process for both reaction systems is just K  ac.∅.0. The LTSs of their corresponding timed processes are in Fig. 13.6, where, for brevity, we let: M1  a11 | a2 | a3 | a41

M2  a1 | a21 | a31 | a4 .

In Table 13.2, we show the graphical representation of the timed processes in the above Example. We depict in grey those reactions and entities with one step delay.  Inspired by [10], we can also provide entities with a duration, i.e. entities that last a finite number of steps. To this aim, we use the syntax D [n,m] to represent the availability of D for m > 0 time units starting after n time units from the current time. In [15], we have shown that durations can be encoded using just delays. We

13 Verification of Reaction Systems Processes

259

Table 13.2 The evolution of timed processes M1 and M2

use the name timed processes for processes with delays and durations. Notably, our extension is conservative, i.e., it does not change the semantics of processes without delays and durations. Therefore, the encoding of standard RSs described in Definition 7 still applies. Concentration Levels and Linear Processes: Quantitative modelling of chemical reactions requires taking molecule concentrations into account. An abstract representation of concentrations that are considered in many formalisms is based on concentration levels: rather than representing such quantities as real numbers, a finite classification is considered (e.g., low/medium/high) with a granularity that reflects the number of concentrations levels at which significant changes in the behaviour of the molecule are observed. In classical RSs, the modelling of concentration levels would require using different entities for the same molecule (e.g., al , am , and ah for low, medium and high concentration of a, respectively). This may introduce some additional complexity due to the need of guaranteeing that only one of these entities is present at any time for the state to be consistent. Moreover, consistency would be put at risk whenever entities representing different levels of the same molecule (e.g., al and ah ) could be provided by the context. In [15], we have enhanced RS processes by adding some quantitative information associated with each entity of each reaction, so that levels are just natural numbers and the concentration levels of the products depend on the concentration levels of reactants. The idea is to associate linear expressions, such as e = m · x + n (where m ∈ N and n ∈ N+ are two constants and x stands for a variable ranging over natural numbers),2 to reactants and products of each reaction. In the following, we write s(e) to state that expression e is associated to entity s. Expressions associated with reactants are used as patterns to match the current levels of the entities involved in the reaction. Pattern matching allows us to find the largest value for the variable x (the same for all reactants) that is consistent with the current concentration levels. Then, linear expressions associated with products (that can contain, again, variable x) can be evaluated to compute the concentration levels of those entities. Expressions To ease the presentation, we require n ∈ N+ to guarantee that e evaluates to a positive number, even when x = 0. Alternative choices are possible to relax this constraint.

2

260

L. Brodo et al.

Fig. 13.7 The evolution of a RS process equipped with abstract quantities

can also be associated with reaction inhibitors in order to let such entities inhibit the reaction only when their concentration level is above a given threshold. However, we require inhibitor expressions to be ground, namely they cannot contain the variable x term and simply correspond to a positive natural number n. Also, the state of the system has to take into account the concentration levels. Consequently, in the definition of states, we exploit again ground expressions: each entity in the state is paired with a natural number representing its current concentration level. Example 10 Assume that we want to write a reaction that produces c with a concentration level that corresponds to the current concentration level of a (but at least one occurrence of a must be present), and that requires b not to be present at a concentration level higher than 1. Such a reaction would be r1 = (R, I, P) where R = a(x + 1), I = b(2) and P = c(x + 1). In the state {a(3), b(1)} reaction r1 is enabled by taking x = 2 (the maximum value for x that satisfies a(x + 1) ≤ a(3)). Since b(1) < b(2), entity c will be produced with concentration level c(x + 1) = c(2 + 1) = c(3). On the contrary, in the state {a(2), b(2)}, the reaction a is not enabled because the concentration of the inhibitor is too high (b(2) < b(2)). Figure 13.7 depicts the evolution of the linear process: each entity in sets R and P is equipped with a linear expression, and each entity in the state is associated with a positive number. 

13.6 Implementation and Experimentation In Falaschi and Palma [24], we presented a really preliminary implementation of a RS formalism in Prolog. Recently, we have defined a much more advanced and efficient implementation [12] by including the more general notion of contexts, and have exploited transition labels to derive the corresponding LTSs. We have also added the predicates for formulating expressions of our assertion language that acts on the transition labels. On the basis of this assertion language, we have implemented a slightly modified version of the Hennessy–Milner Logic to make it parametric on the specific assertion specified by the user. We have also added the implementation of delays and duration, speeds, and linear processes. Finally, we added a dynamic slicer [14], which helps in the process of debugging of the models realized in RSs. Our interpreter is available for download,3 together with a template for writing RS 3

http://pages.di.unipi.it/bruni/LTSRS/.

13 Verification of Reaction Systems Processes

261

specifications and usage instructions. The interpreter has been developed and tested under SWI-Prolog4 and makes use of a few library predicates for handling efficiently association lists and ordered sets. DCG Grammar rules are used to ease the writing of custom RS. Our implementation introduces several novel features not covered in the literature and has been designed as a tool for verification, as well as for rapid prototyping extensions of RSs, not just for their simulations. For ordinary interactive RSs, there are already several more performant simulators, such as brsim, written in Haskell [2]. HERESY is a GPU-based simulator of RSs, written using CUDA [31]. Ferretti et al. [25] presented the tool cl-rs,5 which is an optimized Common Lisp simulator for RSs. The performances of all these tools have been compared in [25]. Shang et al. in [38] presented the first attempt to implement RSs in hardware. They describe algorithms for translating RSs into synchronous digital circuits keeping the same behaviour.

13.7 Related Work Kleijn et al. [28] presents an LTS for RS over 2 S states, where S is the set of entities. As a major difference, we give a process algebra-style definition of the RS, where the SOS rules produce informative transition labels, including context specification, allowing new and different kinds of analysis. Reaction Algebras using an SOS approach have been proposed by Pardini et al. [33]. They defined an operator of (entities) hiding and the resulting LTS was essentially deterministic so that the usual notion of bisimulation coincided with trace equivalence. In this paper, we define recursive and nondeterministic environments, which allows us to model more complex and interesting in silico experiments. There are some previous works based on bisimulation applied to models for biological systems [5, 16] in completely different computing frameworks. In Brodo et al. [11, 13], we derived a similar LTS to the one in this paper by encoding RSs into cCNA, a more general multi-party process algebra (a variant of the link-calculus [7, 8]). In comparison with the encoding of RS in cCNA, here we give an SOS semantics tailored for RSs, without relying on an ad hoc translation.

13.8 Conclusion and Future Work We have given a survey on our recent work in the area of verification techniques for computational biology. We presented an SOS semantics for an extension of RSs, considering nondeterministic and recursive contexts, that generates a labelled transi4 5

https://www.swi-prolog.org/. Available at https://github.com/mnzluca/cl-rs.

262

L. Brodo et al.

tion system. We have revised RSs as processes, formulating a set of ad hoc inference rules. We have defined a flexible framework that allows one to add new operators in a natural way. The transition labels play an interesting role, not only because they reflect the important aspect of the computations, but also because they can add expressivity to the computation allowing for additional analysis, as we did in Sect. 13.4. In Sect. 13.6, we have described a prototype implementation in logic programming. The tool can verify the bio-similarity of two adversarial RS processes. Moreover, the structure of an RS process can be shown graphically, thus helping the user to analyse the behaviour and evolution of the modelled system. The SOS semantics paves the way for systematic integration of other operators for combining Reaction Systems, as we have already shown with the quantitative extensions recalled in Sect. 13.5 and many others available from the process algebra literature [6] (e.g. hiding, interleaving, external choice). Analogously, we plan to investigate de-synchronized versions of Reaction Systems, where some of the enabled reactions, but not necessarily all of them, can take place at each computation step. We also plan to apply our technique to define SOS semantics for other synchronous rewrite-rule systems (where all the rules are applied synchronously) to define a uniform computational framework. We also want to study the relation to analysis techniques based on Abstract Interpretation [18–20, 23] and continue our investigation on slicing techniques [14]. Acknowledgements Research supported by University of Pisa PRA_2020_26 Metodi Informatici Integrati per la Biomedica, by MIUR PRIN Project 201784YSZ5 ASPRA–Analysis of Program Analyses, and by University of Sassari Fondo di Ateneo per la ricerca 2020.

References 1. S. Azimi. Steady states of constrained reaction systems. Theor. Comput. Sci., 701(C):20–26, 2017 2. Azimi, S., Gratie, C., Ivanov, S., Petre, I.: Dependency graphs and mass conservation in reaction systems. Theor. Comput. Sci. 598, 23–39 (2015) 3. Azimi, S., Iancu, B., Petre, I.: Reaction system models for the heat shock response. Fundam. Informaticae 131(3–4), 299–312 (2014) 4. Barbuti, R., Gori, R., Levi, F., Milazzo, P.: Investigating dynamic causalities in reaction systems. Theor. Comput. Sci. 623, 114–145 (2016) 5. Barbuti, R., Maggiolo-Schettini, A., Milazzo, P., Troina, A.: Bisimulations in calculi modelling membranes. Form. Asp. Comput. 20(4), 351–377 (2008) 6. Bernini, A., Brodo, L., Degano, P., Falaschi, M., Hermith, D.: Process calculi for biological processes. Natural Computing 17(2), 345–373 (2018) 7. Bodei, C., Brodo, L., Bruni, R.: A formal approach to open multiparty interactions. Theor. Comput. Sci. 763, 38–65 (2019) 8. C. Bodei, L. Brodo, and R. Bruni. The link-calculus for open multiparty interactions. Inf. Comput., 275, 2020 9. Brijder, R., Ehrenfeucht, A., Main, M., Rozenberg, G.: A tour of reaction systems. Int. J. Found. Comput. Sci. 22(07), 1499–1517 (2011)

13 Verification of Reaction Systems Processes

263

10. R. Brijder, A. Ehrenfeucht, and G. Rozenberg. Reaction systems with duration. In Computation, Cooperation, and Life: Essays Dedicated to Gheorghe P˘aun on the Occasion of His 60th Birthday, pages 191–202, Berlin, Heidelberg, 2011. Springer 11. Brodo, L., Bruni, R., Falaschi, M.: Enhancing reaction systems: A process algebraic approach. In: Alvim, M., Chatzikokolakis, K., Olarte, C., Valencia, F. (eds.) The Art of Modelling Computational Systems. LNCS, vol. 11760, pp. 68–85. Springer, Berlin (2019) 12. Brodo, L., Bruni, R., Falaschi, M.: A logical and graphical framework for reaction systems. Theor. Comput. Sci. 875, 1–27 (2021) 13. Brodo, L., Bruni, R., Falaschi, M.: A process algebraic approach to reaction systems. Theor. Comput. Sci. 881, 62–82 (2021) 14. L. Brodo, R. Bruni, and M. Falaschi. Dynamic slicing of reaction systems based on assertions and monitors. Technical Report DIISM CS-52, Dept. of Information Engineering and Mathematics, University of Siena, 2022. Submitted for publication 15. L. Brodo, R. Bruni, M. Falaschi, R. Gori, F. Levi, and P. Milazzo. Exploiting modularity of SOS semantics to define quantitative extensions of reaction systems. In C. Aranha, C. Martín-Vide, and M. A. Vega-Rodríguez, editors, Proceedings of TPNC 2021, volume 13082 of Lecture Notes in Computer Science, pages 15–32, Cham, 2021. Springer 16. L. Cardelli, M. Tribastone, M. Tschaikowski, and A. Vandin. Forward and backward bisimulations for chemical reaction networks. In Proc. of CONCUR 2015, volume 42, pages 226–239. Schloss Dagstuhl Publ., 2015 17. Corolli, L., Maj, C., Marinia, F., Besozzi, D., Mauri, G.: An excursion in reaction systems: From computer science to biology. Theor. Comput. Sci. 454, 95–108 (2012) 18. Cousot, P.: Principles of Abstract Interpretation. MIT Press, Cambridge, MA, USA (2021) 19. P. Cousot and R. Cousot. Abstract Interpretation: A Unified Lattice Model for Static Analysis of Programs by Construction or Approximation of Fixpoints. In Proceedings of the 4th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, Los Angeles, California, January 17–19, pages 238–252, New York, NY, USA, 1977. ACM Press 20. P. Cousot and R. Cousot. Systematic Design of Program Analysis Frameworks. In Proceedings of the 6th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, San Antonio, Texas, January 29–31, pages 269–282, New York, NY, USA, 1979. ACM Press 21. Ehrenfeucht, A., Main, M.G., Rozenberg, G.: Combinatorics of life and death for reaction systems. Int. J. Found. Comput. Sci. 21(3), 345–356 (2010) 22. Ehrenfeucht, A., Main, M.G., Rozenberg, G.: Functions defined by reaction systems. Int. J. Found. Comput. Sci. 22(1), 167–178 (2011) 23. Falaschi, M., Olarte, C., Palamidessi, C.: Abstract interpretation of temporal concurrent constraint programs. Theory and Practice of Logic Programming 15(3), 312–357 (2015) 24. M. Falaschi and G. Palma. A Logic Programming Approach to Reaction Systems. In DIP’20, volume 86 of OASIcs, pages 6:1–6:15. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 2020 25. Ferretti, C., Leporati, A., Manzoni, L., Porreca, A.E.: The many roads to the simulation of reaction systems. Fundam. Informaticae 171(1–4), 175–188 (2020) 26. M. Hennessy and R. Milner. On observing nondeterminism and concurrency. In ICALP’80, volume 85 of LNCS, pages 299–309. Springer, 1980 27. J. Hillston. A compositional approach to performance modelling. PhD thesis, University of Edinburgh, UK, 1994 28. J. Kleijn, M. Koutny, Ł. Mikulski, and G. Rozenberg. Reaction systems, transition systems, and equivalences. In H. Böckenhauer, D. Komm, and W. Unger, editors, Adventures Between Lower Bounds and Higher Altitudes: Essays Dedicated to Juraj Hromkoviˇc on the Occasion of His 60th Birthday, volume 11011 of LNCS, pages 63–84. Springer, 2018 29. Korel, B., Laski, J.: Dynamic program slicing. Inf. Process. Lett. 29(3), 155–163 (1988) 30. R. Milner. A Calculus of Communicating Systems. Lecture Notes in Computer Science 92. Springer, 1980 31. Nobile, M.S., Porreca, A.E., Spolaor, S., Manzoni, L., Cazzaniga, P., Mauri, G., Besozzi, D.: Efficient simulation of reaction systems on graphics processing units. Fundam. Informaticae 154(1–4), 307–321 (2017)

264

L. Brodo et al.

32. Okubo, F., Yokomori, T.: The computational capability of chemical reaction automata. Natural Computing 15(2), 215–224 (2016) 33. Pardini, G., Barbuti, R., Maggiolo-Schettini, A., Milazzo, P., Tini, S.: Compositional semantics and behavioural equivalences for reaction systems with restriction. Theor. Comput. Sci. 551, 1–21 (2014) 34. G. D. Plotkin. An operational semantics for CSP. In D. Bjørner, editor, Proceedings of the IFIP Working Conf. on Formal Description of Programming Concepts- II, Garmisch-Partenkirchen, pages 199–226. North-Holland, 1982 35. Plotkin, G.D.: A structural approach to operational semantics. J. Log. Algebraic Methods Program. 60–61, 17–139 (2004) 36. M. Salustri. Modellare sistemi biologici attraverso i reaction systems. Technical report, Dept. of Information Engineering and Mathematics, University of Siena, 2022. Master thesis 37. Sangiorgi, D.: Introduction to Bisimulation and Coinduction. Cambridge University Press, USA (2011) 38. Z. Shang, S. Verlan, I. Petre, and G. Zhang. Reaction systems and synchronous digital circuits. Molecules, 24(10), 2019. 1961, 1-13 39. Weiser, M.: Program slicing. IEEE Trans. on Soft. Eng. 10(4), 352–357 (1984)