271 58 5MB
English Pages 326 [335] Year 2010
Lecture Notes in Computer Science
6011
Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison, UK Josef Kittler, UK Alfred Kobsa, USA John C. Mitchell, USA Oscar Nierstrasz, Switzerland Bernhard Steffen, Germany Demetri Terzopoulos, USA Gerhard Weikum, Germany
Takeo Kanade, USA Jon M. Kleinberg, USA Friedemann Mattern, Switzerland Moni Naor, Israel C. Pandu Rangan, India Madhu Sudan, USA Doug Tygar, USA
Advanced Research in Computing and Software Science Subline of Lectures Notes in Computer Science Subline Series Editors Giorgio Ausiello, University of Rome ‘La Sapienza’, Italy Vladimiro Sassone, University of Southampton, UK
Subline Advisory Board Susanne Albers, University of Freiburg, Germany Benjamin C. Pierce, University of Pennsylvania, USA Bernhard Steffen, University of Dortmund, Germany Madhu Sudan, Microsoft Research, Cambridge, MA, USA Deng Xiaotie, City University of Hong Kong Jeannette M. Wing, Carnegie Mellon University, Pittsburgh, PA, USA
Rajiv Gupta (Ed.)
Compiler Construction 19th International Conference, CC 2010 Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2010 Paphos, Cyprus, March 20-28, 2010 Proceedings
13
Volume Editor Rajiv Gupta University of California Riverside Department of Computer Science and Engineering Riverside, CA 92521, USA E-mail: [email protected]
Library of Congress Control Number: 2010922288 CR Subject Classification (1998): D.2, D.3, D.2.4, C.2, D.4, D.1 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13
0302-9743 3-642-11969-7 Springer Berlin Heidelberg New York 978-3-642-11969-9 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180
Foreword
ETAPS 2010 was the 13th instance of the European Joint Conferences on Theory and Practice of Software. ETAPS is an annual federated conference that was established in 1998 by combining a number of existing and new conferences. This year it comprised the usual five sister conferences (CC, ESOP, FASE, FOSSACS, TACAS), 19 satellite workshops (ACCAT, ARSPA-WITS, Bytecode, CMCS, COCV, DCC, DICE, FBTC, FESCA, FOSS-AMA, GaLoP, GT-VMT, LDTA, MBT, PLACES, QAPL, SafeCert, WGT, and WRLA) and seven invited lectures (excluding those that were specific to the satellite events). The five main conferences this year received 497 submissions (including 31 tool demonstration papers), 130 of which were accepted (10 tool demos), giving an overall acceptance rate of 26%, with most of the conferences at around 24%. Congratulations therefore to all the authors who made it to the final programme! I hope that most of the other authors will still have found a way of participating in this exciting event, and that you will all continue submitting to ETAPS and contributing to make of it the best conference on software science and engineering. The events that comprise ETAPS address various aspects of the system development process, including specification, design, implementation, analysis and improvement. The languages, methodologies and tools which support these activities are all well within its scope. Different blends of theory and practice are represented, with an inclination toward theory with a practical motivation on the one hand and soundly based practice on the other. Many of the issues involved in software design apply to systems in general, including hardware systems, and the emphasis on software is not intended to be exclusive. ETAPS is a confederation in which each event retains its own identity, with a separate Programme Committee and proceedings. Its format is open-ended, allowing it to grow and evolve as time goes by. Contributed talks and system demonstrations are in synchronised parallel sessions, with invited lectures in plenary sessions. Two of the invited lectures are reserved for ‘unifying’ talks on topics of interest to the whole range of ETAPS attendees. The aim of cramming all this activity into a single one-week meeting is to create a strong magnet for academic and industrial researchers working on topics within its scope, giving them the opportunity to learn about research in related areas, and thereby to foster new and existing links between work in areas that were formerly addressed in separate meetings. ETAPS 2010 was organised by the University of Cyprus in cooperation with: ⊲ European Association for Theoretical Computer Science (EATCS) ⊲ European Association for Programming Languages and Systems (EAPLS) ⊲ European Association of Software Science and Technology (EASST) and with support from the Cyprus Tourism Organisation.
VI
Foreword
The organising team comprised: General Chairs: Tiziana Margaria and Anna Philippou Local Chair: George Papadopoulos Secretariat: Maria Kittira Administration: Petros Stratis Satellite Events: Anna Philippou Website: Konstantinos Kakousis. Overall planning for ETAPS conferences is the responsibility of its Steering Committee, whose current membership is: Vladimiro Sassone (Southampton, Chair), Parosh Abdulla (Uppsala), Luca de Alfaro (Santa Cruz), Gilles Barthe (IMDEA-Software), Giuseppe Castagna (CNRS Paris), Marsha Chechik (Toronto), Sophia Drossopoulou (Imperial College London), Javier Esparza (TU Munich), Dimitra Giannakopoulou (CMU/NASA Ames), Andrew D. Gordon (MSR Cambridge), Rajiv Gupta (UC Riverside), Chris Hankin (Imperial College London), Holger Hermanns (Saarbr¨ ucken), Mike Hinchey (Lero, the Irish Software Engineering Research Centre), Martin Hofmann (LM Munich), Joost-Pieter Katoen (Aachen), Paul Klint (Amsterdam), Jens Knoop (Vienna), Shriram Krishnamurthi (Brown), Kim Larsen (Aalborg), Rustan Leino (MSR Redmond), Gerald Luettgen (Bamberg), Rupak Majumdar (Los Angeles), Tiziana Margaria (Potsdam), Ugo Montanari (Pisa), Oege de Moor (Oxford), Luke Ong (Oxford), Fernando Orejas (Barcelona) Catuscia Palamidessi (INRIA Paris), George Papadopoulos (Cyprus), David Rosenblum (UCL), Don Sannella (Edinburgh), Jo˜ ao Saraiva (Minho), Michael Schwartzbach (Aarhus), Perdita Stevens (Edinburgh), Gabriele Taentzer (Marburg), and Martin Wirsing (LM Munich). I would like to express my sincere gratitude to all of these people and organisations, the Programme Committee Chairs and members of the ETAPS conferences, the organisers of the satellite events, the speakers themselves, the many reviewers, all the participants, and Springer for agreeing to publish the ETAPS proceedings in the ARCoSS subline. Finally, I would like to thank the Organising Chair of ETAPS 2010, George Papadopoulos, for arranging for us to have ETAPS in the most beautiful surroundings of Paphos.
January 2010
Vladimiro Sassone
Preface
The CC 2010 Programme Committee is pleased to present the proceedings of the 19th International Conference on Compiler Construction (CC 2010) which was held during March 25–26 in Paphos, Cyprus, as part of the Joint European Conference on Theory and Practice of Software (ETAPS 2010). As in the last few years, papers were solicited on a wide range of areas including traditional compiler construction, compiler analyses, runtime systems and tools, programming tools, techniques for specific domains, and the design and implementation of novel language constructs. We received submissions from a wide variety of areas and the papers in this volume reflect this variety. The Programme Committee received 56 submissions. From these, 16 research papers were selected, giving an overall acceptance rate of 28%. The Programme Committee carried out the reviewing and paper selection completely electronically, in two rounds. In the first round at least three Programme Committee members reviewed each paper, and through discussion among the reviewers those papers which were definite “accepts” and those which needed further discussion were identified. Our second round concentrated on the papers needing further discussion, and we added an additional review to help us decide which papers to finally accept. Many people contributed to the success of this conference. First of all, we would like to thank the authors for all the care they put into their submissions. Our gratitude also goes to the Programme Committee members and external reviewers for their substantive and insightful reviews. Also, thanks go to the developers and supporters of the EasyChair conference management system for providing a reliable, sophisticated and free service. CC 2010 was made possible by the ETAPS Steering Committee and the local Organizing Committee. Finally, we are grateful to Jim Larus for giving the CC 2010 invited talk. January 2010
Rajiv Gupta
Conference Organization
Programme Chair Rajiv Gupta
UC Riverside, USA
Programme Committee Jack Davidson Paul Feautrier Guang Gao Antonio Gonzalez Laurie Hendren Robert Hundt Suresh Jagannathan Chandra Krintz Julia Lawall Madan Musuvathi Michael O’Boyle Yunheung Paek Santosh Pande Christoph von Praun Vivek Sarkar Bernhard Scholz Bjorn De Sutter Andreas Zeller
Unversity of Virginia, USA Ecole Normale Sup´erieure de Lyon, France Unversity of Delaware, USA Intel Barcelona Research Center, Spain McGill University, Canada Google, USA Purdue University, USA UC Santa Barbara, USA DIKU, Denmark Microsoft Research, USA University of Edinburgh, USA Seoul National University, Republic of Korea Georgia Institute of Technology, USA Georg-Simon-Ohm Hochschule N¨ urnberg, Germany Rice University, USA The University of Sydney, Australia Ghent University, Belgium Saarland University, Germany
External Reviewers Alex Aleta Rajkishore Barik Indu Bhagat Zoran Budimlic Bernd Burgstaller Qiong Cai Romain Cledat Josep M. Codina Jesse Doherty S. M. Farhad Enric Gibert Christian Grothoff
Lang Hames Surinder Kumar Jain Surinder Jain Kyoungwon Kim Yongjoo Kim Tushar Kumar Akash Lal Nurudeen Lameed Jongwon Lee David Li Pedro Lopez Marc Lupon
X
Conference Organization
Carlos Madriles Nagy Mostafa Sarang Ozarde Greogory Prokopski Easwaran Raman August Schwerdfeger Tianwei Sheng Jun Shirako
Jaswanth Sreeram Neil Vachharajani Xavier Vera Eran Yahav Seungjun Yang Jonghee Youn Jisheng Zhao
Table of Contents
Invited Talk Programming Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . James Larus
1
Optimization Techniques Mining Opportunities for Code Improvement in a Just-In-Time Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adam Jocksch, Marcel Mitran, Joran Siu, Nikola Grcevski, and Jos´e Nelson Amaral Unrestricted Code Motion: A Program Representation and Transformation Algorithms Based on Future Values . . . . . . . . . . . . . . . . . . ¨ Shuhan Ding and Soner Onder Optimizing Matlab through Just-In-Time Specialization . . . . . . . . . . . . . Maxime Chevalier-Boisvert, Laurie Hendren, and Clark Verbrugge RATA: Rapid Atomic Type Analysis by Abstract Interpretation – Application to JavaScript Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . Francesco Logozzo and Herman Venter
10
26 46
66
Program Transformations JReq: Database Queries in Imperative Languages . . . . . . . . . . . . . . . . . . . . Ming-Yee Iu, Emmanuel Cecchet, and Willy Zwaenepoel
84
Verifying Local Transformations on Relaxed Memory Models . . . . . . . . . . Sebastian Burckhardt, Madanlal Musuvathi, and Vasu Singh
104
Program Analysis Practical Extensions to the IFDS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . Nomair A. Naeem, Ondˇrej Lhot´ ak, and Jonathan Rodriguez Using Ownership to Reason about Inherent Parallelism in Object-Oriented Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrew Craik and Wayne Kelly
124
145
XII
Table of Contents
Register Allocation Punctual Coalescing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fernando Magno Quint˜ ao Pereira and Jens Palsberg
165
Strategies for Predicate-Aware Register Allocation . . . . . . . . . . . . . . . . . . . Gerolf F. Hoflehner
185
Preference-Guided Register Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matthias Braun, Christoph Mallon, and Sebastian Hack
205
Validating Register Allocation and Spilling . . . . . . . . . . . . . . . . . . . . . . . . . . Silvain Rideau and Xavier Leroy
224
High-Performance Systems Automatic C-to-CUDA Code Generation for Affine Programs . . . . . . . . . . Muthu Manikandan Baskaran, J. Ramanujam, and P. Sadayappan Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yunlian Jiang, Eddy Z. Zhang, Kai Tian, and Xipeng Shen The Polyhedral Model Is More Widely Applicable Than You Think . . . . Mohamed-Walid Benabderrahmane, Louis-No¨el Pouchet, Albert Cohen, and C´edric Bastoul
244
264 283
The Hot Path SSA Form: Extending the Static Single Assignment Form for Speculative Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Subhajit Roy and Y.N. Srikant
304
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
325
Programming Clouds James Larus Microsoft Research One Microsoft Way Redmond, WA 98052 [email protected]
Abstract. Cloud computing provides a platform for new software applications that run across a large collection of physically separate computers and free computation from the computer in front of a user. Distributed computing is not new, but the commodification of its hardware platform—along with ubiquitous networking; powerful mobile devices; and inexpensive, embeddable, networkable computers—heralds a revolution comparable to the PC. Software development for the cloud offers many new (and some old challenges) that are central to research in programming models, languages, and tools. The language and tools community should embrace this new world as fertile source of new challenges and opportunities to advance the state of the art. Keywords: cloud computing, programming languages, software tools, optimization, concurrency, parallelism, distributed systems.
1 Introduction As I write this paper, cloud computing is a hot new trend in computing. By the time you read it, the bloom may be off this rose, and with a sense of disillusionment at yet another overhyped fad, popular enthusiasm may have moved on to the next great idea. Nevertheless, it is worth taking a close look at cloud computing, as it represents a fundamental break in software development that poses enormous challenges for the programming languages and tools. Cloud computing extends far beyond the utility computing services offered by Amazon’s AWS, Microsoft’s Azure, or Google’s AppEngine. These services provide a foundation for cloud computing by supplying on-demand, internet computing resources on a vast scale and at low cost. Far more significant, however, is the software model this hardware platform enables; one in which software applications are executed across a large collection of physically separate computers and computation is no longer limited to the computer in front of you. Distributed computing is not new, but the commodification of its hardware platform—along with ubiquitous networking; powerful mobile devices; and inexpensive, embeddable, networkable computers— may bring about a revolution comparable to the PC. Programming the cloud is not easy. The underlying hardware platform of clusters of networked parallel computers is familiar, but not well supported by programming models, languages, or tools. In particular, concurrency, parallelism, distribution, and R. Gupta (Ed.): CC 2010, LNCS 6011, pp. 1–9, 2010. © Springer-Verlag Berlin Heidelberg 2010
2
J. Larus
availability are long-established research areas in which progress and consensus has been slow and painful. As cloud computing becomes prevalent, it is increasingly imperative to refine existing programming solutions and investigate new approaches to constructing robust, reliable software. The languages and tools community has a central role to play in the success of cloud computing. Below is a brief and partial list of areas that could benefit from further research and development. The discussion is full of broad generalizations, so if I malign or ignore your favorite language or your research, excuse me in advance. 1. Concurrency. Cloud computing is an inherently concurrent and asynchronous computation, in which autonomous processes interact by exchanging messages. This architecture gives raise to two forms of concurrency within a process: • The first, similar to an operating system, provides control flow to respond to inherently unordered events. • The second, similar to a web server, supports processing of independent streams of requests. Neither use of concurrency is well supported by programming models or languages. There is a long-standing debate between proponents of threads and event handling [1-3] as to which model best supports concurrency. Threads are close to a familiar, sequential programming model, but concurrency still necessitates synchronization to avoid unexpected state changes in the midst of an apparently sequential computation. Moreover, the high overhead of a thread and the cost of context switching limits concurrency and constrains system architectures. Event handlers, on the other hand, offer low overhead and feel more closely tied to the underlying events. However, handlers provide little program structure and scale poorly to large systems. They also require developers to explicitly manage program state. Other models, such as state machines or Actors, have not yet emerged in a general-purpose programming language. 2. Parallelism. Cloud computing runs on parallel computers, both on the client and server. Parallelism currently is the dominate approach to increasing processor performance without exceeding power dissipation limitations [4]. Future processors are likely to become more heterogeneous, as specialized functional units greatly increase performance or reduce power consumption for specific tasks. Parallelism, unfortunately, is a long-standing challenge for computer science. Despite four decades of experience with parallel computers, we have not yet reached consensus on the underlying models and semantics or provided adequate programming languages and tools. For most developers, shared-memory parallel programs are still written in the assembly language of threads and explicit synchronization. Not surprisingly, parallel programming is difficult, slow, and error-prone and will be a major impediment in developing high-performance cloud applications. The past few years have seen promising research on new, higher-level parallel programming models, such as transactional memory and deterministic execution [5, 6]. Neither is a panacea, but both abstractions could hide some complexities of parallelism. 3. Message passing. The alternative to shared-memory parallel programming is message passing, ubiquitous on the large clusters used in scientific and technical
Programming Clouds
3
computing. Because of its intrinsic advantages, message passing will be the primary parallel programming model for cloud computing as well. It scales across very large numbers of machines and is suited for distributed systems with long communications latencies. Equally important, message passing is a better programming model than shared memory as it provides inherent performance and correctness isolation with clearly identified points of interactions. Both aspects contribute to more secure and robust software systems [7]. Message passing can be more difficult to program than shared memory, in large measure because it is not directly supported by many programming languages. Message-passing libraries offer an inadequate interface between the asynchronous world of messages and the synchronous control flow of procedure calls and returns. A few languages, such as Erlang, integrate message into existing language constructions such as pattern matching [8], but full support for messages requires communications contracts, such as Sing# [9], and tighter integration with the type system and memory model. 4. Distribution. Distributed systems are a well-studied area with proven solutions for difficult problems such as replication, consistency, and quorum. This field has focused considerable effort on understanding the fundamental problems and in formulating efficient solutions. One challenge is integrating these techniques into a mainstream programming model. Should they reside in libraries, where developers need to invoke operations at appropriate points, or can they be better integrated into a language, so developers can state properties of their code and the run-time system can ensure correct execution? 5. High availability. The cloud end of cloud computing provides of services potentially used by millions of clients, and these services must be highly available. Failures of systems used by millions of people are noteworthy events widely reported by the media. And, as these services become integrated into the fabric of everyday life, they become part of the infrastructure that people depend on for their businesses, activities, and safety. High availability is not the same as high reliability, the focus of much research on detecting and eliminating software bugs. A reliable system that runs slowly under heavy load may fail to provide a necessary level of service. Conversely, components of a highly available system can fail frequently, but a properly architected system will continue to provide adequate levels of service [10]. Availability starts at the architecture level of the system, but programming languages have an important role to play in the implementation. Existing language provide little support for systematically handling unexpected and erroneous conditions beyond exceptions, which are notoriously difficult to use properly [11]. Error handling is complex and delicate code that runs when program invariants are violated, but it is often written as an afterthought and rarely thoroughly tested. Better language support, for example lightweight, non-isolated transactions, could help developers handle and recover from errors [12]. 6. Performance. Performance is primarily a system-level concern in cloud computing. Many performance problems involve shared resources running across large numbers of computers and complex networks. Few techniques exist to analyze a design or
4
J. Larus
system in advance, to understand bottlenecks or predict performance. As a consequence, current practice is to build, overprovision, measure, tweak, and pray. One pervasive concern is detecting and understanding performance problems. Amazon’s Dynamo system uses service-level agreements (SLA) among system components to quickly identify performance problems [13]. These SLAs are the performance equivalents of pre- and post-conditions. Making performance into a first-class programming abstraction, with full language and tools support, would help with the construction of complex, distributed systems. 7. Application partitioning. Current practice is to statically partition functionality between a client and service by defining an interface and writing both endpoints independently. This approach leads to inflexible architectures that forego opportunities to migrate computations to where they could run most efficiently. In particular, battery powered clients such as phones are limited in memory or processing capability. Migrating a running computation from a phone to a server might enable it to complete faster (or at all) or to better utilize limited network bandwidth by moving computation to data rather than the reverse [14]. Even within a data center, code mobility is valuable. It permits server workloads to be balanced to improve performance or consolidated to reduce power consumption. Currently virtual machines move an entire image, from the operating system up, between computers. Finer-grain support for moving computations could lower the cost of migration and provide mechanisms useful in a wider range of circumstances. Statically partitioned systems could benefit from better language support. Microsoft’s prototype Volta tool offered a single-source programming model for writing client-server applications [15]. The developer writes a single application, with annotations as to which methods run on the client or server. The Volta compiler partitions the program into two executables, a C# one for running on the server and a Javascript one for the client. Similar programming models could simplify the development of cloud applications by providing developers with a higher-level abstraction of their computation. 8. Defect detection. Software defect detection has made considerable progress over the past decade in finding low-level bugs in software. The tools resulting from this effort are valuable to cloud computing, but are far from sufficient. Few tools have looked for bugs in complex systems built from autonomous, asynchronous components. Although this domain appears similar to reactive systems, the complexity of cloud services present considerable challenges in applying techniques from this area. 9. High-level abstractions. Google’s Map-Reduce and Microsoft Dryad are two higher level programming models that hide much of the complexity of writing a server-side analytic application [16, 17]. A simple programming model hides much of the complexity of data distribution, failure detection and notification, communication, and scheduling. It also opens opportunities for optimizations such as speculative execution. These two abstractions are intended for code that analyzes large amounts of data. There is a pressing need for similarly abstract models for writing distributed client-server applications and web services.
Programming Clouds
5
This list of open problems is not exhaustive, but instead is a starting point for research directly applicable to problems facing developers of cloud computing applications.
2 Orleans Orleans is a project under development in the Cloud Computing Futures (CCF) group in Microsoft Research. Its goal is to achieve significant improvements in productivity of building cloud computing applications. Orleans specifically addresses the challenges of building, deploying, and operating very large cloud applications that encompass thousands of machines in multiple datacenters, constantly evolving software, and large teams to construct, maintain, and administer these properties. At a coarse level, Orleans consists of three interdependent components: • Programming model • Programming language and tools • Runtime. Software for a cloud application, both the portion that runs on servers in a data center and the part that runs on clients, will be written in DC#, an extended version of C# that provides explicit support for the Orleans programming model. Orleans tools help a developer build reliable code by providing static and dynamic defect detection and test tools. Application code runs on the Orleans run-time system, which provides robust, tested implementations of the abstractions needed for these systems. These abstractions in turn execute on Azure, Microsoft’s data center operating system. 2.1 Design Philosophy Orleans is frankly a prescriptive system—it strongly encourages the use of software architectures and design patterns that have proven themselves in practice. Because Orleans targets large-scale cloud computing, the key criterion for adopting a principle is that it results in a scalable, resilient, reliable system. Cloud software is scalable if it a system can grow to accommodate a steadily increasing number of clients without requiring major rewrites, even when the increase in volume spans multiple orders of magnitude. The common practice today is to plan on several complete rewrites of a system as an internet property grows in popularity, even though there are multiple examples of scalable internet properties whose design principles are widely known. Today’s general-purpose programming languages and tools provide little or no support for these principles, so the burden of scalability is shifted to developers; and consequently most new enterprises choose short-term expediency to get their websites up quickly. A system is resilient if it can tolerate failures in its components: the computers, communication network, other services on which it relies, and even the data center in which it runs. Toleration requires the system to detect a failure, respond to it in a manner that minimizes the effect of a failure on unrelated components and clients, restore service when possible by using other resources, and resume execution when the failure is corrected.
6
J. Larus
The distributed systems community has studied techniques for building scalable, resilient software systems for many years. A small number of abstractions have proven their value in building these systems: asynchronous communications and software architecture; data partitioning; data replication; consensus; and consistent, systematic design policies. Orleans will build these ideas into its programming and data model and provide first-class support for them in the DC# language and tools. These abstractions by no means guarantee a well-written program or successful system; it still remains true that it is possible to write a bad program in any language. However, these abstractions have proven their value in many systems and are well studied and understood, and they provide a solid basis for building resilient systems. 2.2 Centrality of Failure In ordinary software, error-handling code is home to a disproportionate share of defects. This code is difficult to write because invariants and preconditions often are invalid after an error and paths through this code are less well tested because they are uncommon. Distributed systems complicate error handling by introducing new failure modes, such as asynchronous communications and partial failure, which are challenging to reason about and difficult to handle correctly. Much of the difficulty of building a reliable internet property is attributable to asynchrony and failure. Distributed systems research offer some techniques for masking failures and asynchrony (e.g., Paxos), but they have significant drawbacks and are unsuitable to mask all failures in a responsive service. Paxos and other replication strategies increase the quantity of resources dedicated to a computation task by a significant (3 – 5x) amount. In addition, these techniques increase the time to perform an operation. Because of increased cost and latency, replication strategies must be used sparingly in scalable services. Other techniques, such as checkpoint and restart, are more successful for nonreactive computations (e.g., large-scale analytic computations implemented with mapreduce or Dryad) in which it is possible to capture input to a portion of a computation and in which a large recovery cost is less than the far-more-expensive alternative of rerunning the entire computation. Another advantage is that it is possible to automate the failure detection and error recovery process. Programming models also have a significant influence on the correctness and resiliency of code. For example, every client making a remote procedure call (RPC) has to deal with three possibilities: the call succeeds and the client knows it; the call fails and the client knows it; the call times out and the client does not know whether it succeeded or failed. In more sophisticated models that allow simultaneous RPC calls, complexity further increases when calls complete in arbitrary orders. Complicating this reasoning is the syntactic similarity of an RPC call and a conventional call, which encourage a developer to conflate the two, despite their vast difference in cost and semantics. For these reasons, undisciplined use of RPC has proven to be a bad abstraction for building distributed systems. 2.3 Orleans Programming Model The Orleans programming model is inherently more resilient. An application is composed of loosely coupled components, each of which executes in its own failure
Programming Clouds
7
container. In Orleans, these components are called grains. A grain consists of a singlethreaded computation with its local state. It can fail and be restarted without directly affecting the execution of any other grain—though it may indirectly affect a dependent grain that cannot respond appropriately to its failure. All communications between grains occurs across channels: higher-order (i.e., can send a channel over a channel), strongly typed paths for sending messages between grains. The code within a grain is inherently asynchronous, to deal with the unpredictable arrival of messages across multiple channels or the unpredictable ordering of messages between asynchronous services. This model exposes the reality of a distributed system (communication via messages that arrive at unpredictable times) but constrains it, in single threaded, isolated containers, to simplify reasoning about and analyzing code. Grains are not distributed objects. The differences between the two models are fundamental. Orleans does not provide a pointer or reference to a grain, nor do grains reside in a global address space. A computation communicates with a grain through a channel, which is a capability, not a reference. A common channel allows two grains to communicate according to the channel’s protocol. However, the channel does not uniquely identify either grain since channels can be passed around. Nor does a channel identify the location of a grain, which can migrate between machines while the channel is active. Moreover, interactions between grains are asynchronous, not RPC. One grain can request another grain perform an operation by sending a message (which could be wrapped in syntactic sugar to look like a method invocation). The receiving grain has the freedom to process this request in any order with respect to its on-going computations and other requests. When the operation completes, the grain can send back its result. In general, the first grain will not block waiting for this value, as it would for a method call, but instead will process other, concurrent operations. An important property of a grain is that it can migrate between computers. Migration allows Orleans to adaptively execute a system: to reduce communication latency by moving a computation closer to a client or data resource, to increase fault tolerance by moving a computation to a less tightly coupled system, and to balance the load among servers. Grains encourage an SPMD (single program, multiple data) style of programming. The same computation (code) runs in all grains of a particular type, and each grain’s computation executes independently of other grains and the computations are initiated at different times. However, it is also possible to use grains to implement a dataflow programming model. In this case, a grain is a unit of computation that accepts input and sends results across channels. Dataflow is appropriate for streaming computation and can achieve the scalability of asynchronous data parallelism by replicating dataflow graphs and computations. What is the appropriate size for a grain? In today’s scalable services, it is necessary to partition the data manipulated by the service at a fine granularity, to allow for rebalancing in the face of load and usage skew. For example, code acting on behalf of a Messenger user does not assume it is co-located with another Messenger user, and it must expect the location of a user’s data to change when a server is added or removed. Similar properties hold for Hotmail user’s address books, subscriptions in Live Mesh’s pub-sub service, ongoing meetings in Office Communications Server, rows in Google’s BigTable, keys in Amazon’s Dynamo, etc. With this fundamental
8
J. Larus
assumption, a system can spread a large and varying collection of data items (e.g., a user’s IM presence) across a large number of servers, even across multiple data centers. Though partitioning by user is illustrative, grains can represent many other entities. For example, a user’s mailbox may contain grains corresponding to mail messages. 2.4 Orleans Data Model Data in cloud computing application exists in a richer, more complex environment than in non-distributed applications. This environment has a number of orthogonal dimensions. Unlike the local case, a single model does not meet all needs. Different grains will require different guarantees, and the developer must assume responsibility for selecting the properties that match the importance of data, semantics of operations, and performance constraints on the system. Orleans will implement a variety of different types of gains that support the different models for the data they contain, so an application developer can declare the properties of a particular grain and expect the system to implement its functionality. Data can be persistent, permitting it to survive a machine crash. Changes to the data are written to durable storage and Orleans can keep independent copies of the data on distinct computers (or data centers), to increase availability in the face of resource failures. Replicating the data among machines introduces the issue of consistency among the replicas. Strong consistency requires the replicas to change simultaneously, while weaker models tolerate divergence among the copies. Within a grain, Orleans supports a simple, local concurrency model. Data local to the grain is only modified by code executing in the grain and execution is singlethreaded, so from the perspective of this code, the execution model is mostly sequential. However, when code for an operation ends and yields control back to the grain, other operations can execute and modify the grain’s local state, so a developer cannot make assumptions across turns in a grain. Orleans does not impose a single model on the operations exported by a grain. The semantics of concurrent operations has been formalized in numerous ways, and different models (e.g., sequential consistency, serializability, linearizability) offer varying tradeoffs among simplicity, generality, and efficiency. Orleans will need to provide the support that enables a developer to implement these models where appropriate.
3 Conclusion Until recently, only a handful of people had ever used more than one computer to solve a problem. This is no longer true, as search engines routinely execute a query across a thousand or so computers. Cloud computing is the next step into a world in which computation and data are no longer tightly tied to a specific computer and it is possible to share vast computing resources and data sets to build new forms of computing that go far beyond the familiar desktop or laptop PCs. Software development for the cloud offers many new (and some old challenges) that are central to research in programming models, languages, and tools. The language and tools community should embrace this new world as fertile source of new challenges and opportunities to advance the state of the art.
Programming Clouds
9
References 1. Adya, A., Howell, J., Theimer, M., Bolosky, W.J., Douceur, J.R.: Cooperative Task Management without Manual Stack Management or, Event-driven Programming is Not the Opposite of Threaded Programming. In: Proceedings of the USENIX 2002 Conference, pp. 289–302. Usenix, Monterey (2002) 2. Ousterhout, J.: Why Threads are a Bad Idea (for most purposes). In: Proceedings of the 1996 USENIX Technical Conference. Usenix, San Diego (1996) 3. von Behren, R., Condit, J., Zhou, F., Necula, G.C., Brewer, E.: Capriccio: Scalable Threads for Internet Services. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles, pp. 268–281. ACM, Bolton Landing (2003) 4. Larus, J.: Spending Moore’s Dividend. Communications of the ACM 52, 62–69 (2009) 5. Larus, J., Kozyrakis, C.: Transactional Memory. Communications of the ACM 51, 80–88 (2008) 6. Bocchino Jr., R.L., Adve, V.S., Adve, S.V., Snir, M.: Parallel Programming Must Be Deterministic by Default. In: First USENIX Workshop on Hot Topics in Parallelism. Usenix, Berkeley (2009) 7. Hunt, G., Larus, J.: Singularity: Rethinking the Software Stack. ACM SIGOPS Operating Systems Review 41, 37–49 (2007) 8. Armstrong, J.: Programming Erlang: Software for a Concurrent World. The Pragmatic Bookshelf, Raleigh (2007) 9. Fähndrich, M., Aiken, M., Hawblitzel, C., Hodson, O., Hunt, G., Larus, J.R., Levi, S.: Language Support for Fast and Reliable Message Based Communication in Singularity OS. In: Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems, Leuven, Belgium, pp. 177–190 (2006) 10. Barroso, L.A., Hölzle, U.: The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, vol. 6. Morgan & Claypool, San Francisco (2009) 11. Weimer, W., Necula, G.C.: Exceptional Situations and Program Reliability. ACM Transactions on Programming Languages and Systems 30, 1–51 (2008) 12. Lenharth, A., Adve, V.S., King, S.T.: Recovery Domains: An Organizing Principle for Recoverable Operating Systems. In: Proceeding of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 49–60. ACM, Washington (2009) 13. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s Highly Available Key-value Store. In: Proceedings of the 21st ACM SIGOPS Symposium on Operating Systems Principles, pp. 205–220. ACM, Stevenson (2007) 14. Gray, J.: Distributed Computing Economics. Microsoft Research, p. 6. Redmond, WA (2003) 15. anon.: Volta Technology Preview from Microsoft Live Labs Helps Developers Build Innovative, Multi-Tiered Web Applications with Existing Tools, Technology. Microsoft Press Pass (2007) 16. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM 51, 107–113 (2008) 17. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, pp. 59–72. ACM, Lisbon (2007)
Mining Opportunities for Code Improvement in a Just-In-Time Compiler Adam Jocksch1 , Marcel Mitran2 , Joran Siu2 , Nikola Grcevski2 , and Jos´e Nelson Amaral1 1
2
Department of Computing Science University of Alberta, Edmonton, Canada {ajocksch,amaral}@cs.ualberta.ca IBM Toronto Software Laboratory, Toronto, Canada
Abstract. The productivity of a compiler development team depends on its ability not only to the design effective solutions to known code generation problems, but also to uncover potential code improvement opportunities. This paper describes a data mining tool that can be used to identify such opportunities based on a combination of hardware-profiling data and on compiler-generated counters. This data is combined into an Execution Flow Graph (EFG) and then FlowGSP, a new data mining algorithm, finds sequences of attributes associated with subpaths of the EFG. Many examples of important opportunities for code improvement in the IBM R Testarossa compiler are described to illustrate the usefulness of this data mining technique. This mining tool is specially useful for programs whose execution is not dominated by a small set of frequently executed loops. Information about the amount of space and time required to run the mining tool are also provided. In comparison with manual search through the data, the mining tool saved a significant amount of compiler development time and effort.
1
Introduction
Compiler developers continue to face the challenges of accelerated time-to-market and significantly reduced release cycles for both hardware and software. Microarchitectures continue to grow in numbers, complexity, and diversity. In this evolving technological environment, commercial-compiler developing teams must discover and rank the next set of opportunities for code transformations that will provide the highest performance improvement per development cost ratio. The discovery of opportunities for profitable code transformations in large enterprise applications presents additional challenges. Traditionally, compiler developers have relied on the intuition that the code that is relevant for performance improvement is located in easily identifiable, frequently executed, regions of the code — often called hot loops. However, many enterprise applications do not exhibit discernible regions of frequently executed code. Rather, these applications exhibit a flat profile: thousands of methods are invoked along an execution path, and no single method accounts for a significant portion of the R. Gupta (Ed.): CC 2010, LNCS 6011, pp. 10–25, 2010. c Springer-Verlag Berlin Heidelberg 2010
Mining Opportunities for Code Improvement in a Just-In-Time Compiler
11
execution time — even though a typical transaction executes millions of instructions. Thus, focusing development effort on any single method provides negligible overall performance improvement. However, these applications may display code patterns that appear repeatedly throughout the code base. Even though no single instance of such a pattern is executed frequently, the aggregated run time of the pattern may be significant. Applications with flat profiles are becoming increasingly important for commercial compilers that are used to generate code for middleware and enterprise information-technology (IT) infrastructure. Thus, a challenge when developing a compiler for applications with flat profiles is to discover code patterns whose aggregated execution time is significant so that development efforts can be focused into improving the code generation for such patterns. This paper describes a data mining infrastructure, based on the recently developed FlowGSP algorithm [13], which can be used for automatic analysis of code compiled by the IBM Testarossa Just-in-Time (JIT) Compiler [8]. This infrastructure was used to discover patterns in the code genR R erated for applications running in the IBM WebSphere Application Server R R and for SPECjvm2008 [20] running under Linux for System Z [22,19]. TM WebSphere Application Server is a fully compliant Java Enterprise Edition (JEE) application server written in Java code [11]. This paper uses the DayTrader Benchmark in the WebSphere Application Server[7]. This benchmark produces a typical WebSphere Application Server profile reporting the compilation of thousands of methods, with no method representing more than 2% of the total execution time. For instance, cache misses represent 12% of the overall run time in one run of a certain application in application server. But, to account for 75% of the misses requires the aggregation of misses from 750 different methods [8]. SPECjvm2008 exemplifies the growing variety of industry standards that are quickly expanding the scope of benchmarks. The SPECjvm2008 suite comprises more than double the number of benchmarks that were in its predecessor, SPECjvm98 [18]. Some of the benchmarks in the newer suite have flat profiles, making the analysis and identification of opportunities for code improvement more difficult, more tedious and more indeterminate. The IBM Testarossa JIT compiler ships as part of the IBM Developer Kit for Java which powers thousands of mission-critical applications on everything from embedded devices, to desktops, to high-end servers. The IBM Testarossa JIT is a state-of-the-art commercial compiler that offers a very complete set of traditional OO-based and Java-based optimizations. As a dynamic compiler, Testarossa is also equipped with a sophisticated compilation control system for online feedback-directed re-compilation [21]. The analysis presented in this paper was performed on Linux for System z. System z10TM is the latest and most powerful incarnation of IBM’s mainframe family, which continues to provide the foundation for IT centers for many of the world’s largest institutions. The System z10 processor has a 4.4 GHz dual core super-scalar pipeline, executes instructions in order, and can be characterized as an address-generation-interlocked pipeline. This processor is a complex
12
A. Jocksch et al.
instruction set computer with a rich set of register-to-register, register-to-storage, storage-to-storage, and complex branching operations, in addition to hardware co-processors for cryptography, decimal-floating-point, and Lempel-Ziv compression [22]. The System z10 processor also provides an extensive set of performancemonitoring counters that can be used to examine the state of the processor as it executes the program. The data mining infrastructure was applied to a large set of compiler attributes and hardware counters. The attributes and hardware data are organized in a directed graph representing program flow. Edge frequencies are used to represent the probabilistic flow between basic blocks. The FlowGSP algorithm is general and can mine any flow graph. A vertex in this flow graph may represent any single-entry-single-exit region such as an instruction, a basic block, a bytecode, or a method. Attributes are associated with each vertex, and the algorithm mines for sequences of attributes along a path. The main contributions of this paper are: – An introduction of the problem of identifying important code patterns that occur in applications with flat profiles, such as enterprise applications. – A description of a new data mining framework that can be used to discover important opportunities for code generation improvement in a commercial dynamic compiler environment. – A demonstration of the effectiveness of the data mining tool through the narrative of several discoveries in the code generated for the System z architecture by the IBM Testarossa compiler. – Statistics on space and time requirements for the usage of the mining tool in this environment. This information should be relevant for other compiler groups that wish to implement a similar tool, as well as for researchers that wish to improve on our design. Section 2 explains the need for the mining tool through the description of one of the important discoveries in a very common segment of code. The mining tool is described in Section 3. Several additional improvement opportunities discovered by the tool are described in Section 4. Experimental data describing the time and space requirements for the usage of the tool in the Testarossa environment is presented in Section 5. Section 6 discusses previous work related to the development of similar analysis tools.
2
Motivating Case Study
This section outlines the motivation for the use of data mining to discover patterns that account for significant execution time by describing one such pattern discovered by FlowGSP. The data mined by FlowGSP to discover this pattern includes, instruction type, execution time, cache misses, pipeline interlock, etc [13]. This pattern is part of the array-copy code generated by Testarossa for the System z10 platform. FlowGSP identified that, in some benchmarks, more than 5% of the execution time was due to a single instruction called execute (EX).
Mining Opportunities for Code Improvement in a Just-In-Time Compiler
13
This finding is surprising because the IBM Testarossa JIT compiler uses this instruction in only one scenario – to implement the tail-end of an array copy.1 More specifically, a variable-length array copy is implemented with a loop that executes an MVC (Move Characters) instruction. The MVC instruction is very efficient at copying up to 256 bytes. The 256-byte copy length is encoded as a literal value of the instruction. Figure 1 shows the code generated for array copying. Any residual of the copy after the repeated execution of MVCs is handled by using the EX instruction. The EX instruction executes a target instruction out of order. Before executing the target, EX replaces an 8-bit literal value specified in the target with an 8-bit field from a register specified in EX. The overloading is done through an OR of the two bit fields. For the residual array-copy code generated by Testarossa, the register specified in EX contains the length of the residual array and the target instruction is a MVC instruction. Rsrc = Address o f s o u r c e a r r a y ; R t r g t = Address o f t a r g e t a r r a y ; w h i l e ( Rlength >= 2 5 6 ) MVC Rsrc , Rtrgt , 256 Rsrc = Rsrc + 2 5 6 ; R t r g t = Rtgt + 2 5 6 ; Rlength = Rlength − 2 5 6 ; } EX ResLabel , Rlength , mvcLabel ; ... ResLabel : MVC Rsrc , Rtrgt , 0 Fig. 1. Pseudo-assembly code for array copy
After the data mining tool identified that 5% of the time was spent in EX, we examined the profiling data more carefully to find out that the 5% of time spent in EX is spread over several methods. Therefore, the time spent in the EX instruction would not be apparent from a study of individual methods. Moreover, part of that time is spent in the MVC instruction. Nonetheless, the EX instruction incurs significantly more misses in the data-cache and the translationlook-aside-buffer (TLB) misses than expected. There are two potential reasons for this: 1. The length of many array copies is less than 256 byte long. In this case, data cache misses would occur while fetching the source/target operands of MVC. 2. The EX instruction misses the cache upon fetching the overloaded MVC. This miss occurs because the targeted MVC instruction is located next to other insructions used by the program, and hence resides in the instruction cache. On a z10, the EX instruction needs the targeted MVC in the data 1
Array copies use 256-byte copy instructions, the tail-end is any final portion of the copy that is smaller than 256 bytes.
14
A. Jocksch et al.
cache. Moving the targeted MVC from the instruction cache to the data cache incurs an extra cost that was not apparent to the compiler designers. This discovery started an important review of the array-copy code generated by the compiler. A suitable strategy must be designed to isolate the targeted MVC from the other data values that are located around it. This strategy must take into consideration the long lines in the architecture. An important question is why there is the need for a data mining tool to discover such an opportunity. Could simple inspection of the hardware and compiler profiling data reveal this opportunity? Even if a developer were to spot the cache miss caused by the EX instruction, she would have no way to know that the aggregation of occurrences of EX in many infrequently executed methods is amount to significant performance loss that needs to be addressed. Even though profile logs of code generated by this commercial compiler had been inspected by hand for many years, the issue with the use of EX and MVC for array copy had never been regarded as worthy of attention from the team. Once the mining tool reported it, one of the developers remarked: “Now we can see!”.
3
The Mining Tool
The mining tool design is based on a new data mining algorithm called FlowGSP. FlowGSP mines for subpaths in an execution flow graph (EFG). Jocksch formally defines a an EFG as a directed flow graph possibly containing cycles [13]. Each EFG vertex is annotated with a normalized weight and has an associated list of attributes. Each EFG edge is annotated with a normalized execution frequency. A subpath is of interest if either its frequency of execution, called frequency support, or vertex weights, called weight support, is above a set threshold. A subpath is also of interest if the difference between its frequency and weight support is higher than a difference support. FlowGSP reports sequences of attributes whose aggregated support over the entire EFG is higher than the specified supports. FlowGSP is an extension of the Generalized Sequential Pattern (GSP) algorithm, originally introduced by Agrawal et al. [1]. The main difference between FlowGSP and GSP is that GSP was designed to mine for sequences of attributes in a list of totally ordered transactions while FlowGSP enables the mining for sequences of attributes in subpaths of a flow graph, thus allowing a partial order between the transactions (vertices in the EFG). Similar to GSP, FlowGSP allows for windows and gaps. A window allows attributes that occur in distinct vertices that are close in a subpath — within the specified window — to be regarded as occurring in the same vertex. A gap is a maximum number of vertices in the subpath that do not contain attributes in the sequence. 3.1 Preparation of Data for Mining The overall architecture and flow in a system that uses FlowGSP for mining is shown in Figure 2. Performance-counter data generated by the hardware [12]
Mining Opportunities for Code Improvement in a Just-In-Time Compiler
15
Application Source Code
✓❄
✏
✒
✑
Java Compiler
❄
Application Byte Codes
Data Input
✏ ✓ JiT ✲ ✲ Compiler ✒ ✑ ❅ ✠ ❘ ❅
Control Flow Graph
❅ ❅
Generated Code
Compiler Log
✓❄ ✏ ✲ Program Execution ✒ ✑ ❄ Hardware Profile
✏✏ ✏✏ ✏ ✡ ✏✏ ✢ ✏ ✏✏ ❅ ✏ ✓✡ ❘ ❅ ✮ Data ✲ Execution Preparation Flow Graph ✒ ✑ ✡
✏ ✓ ✲ FlowGSP ✒ ✑ ❄ Mined Sequences
Fig. 2. Overall architecture and flow in system that uses FlowGSP for mining
is added to the control-flow-graph representation of the program created by the compiler to produce the input for the mining tool. The Testarossa compiler comes equipped with a rich set of logging features, including the ability to report all generated machine instructions. The only modification to the compiler was to annotate each instruction with a corresponding basic block so that the log can then be transformed into an EFG. In the implementation of the mining tool, the hardware performance counter information and the control-flow-graph R data from the compiler are stored in IBM DB2 Version 9.1 Express Edition for Linux, a relational database. A relational database was chosen because the amount of input data is quite large (some applications running in the WebSphere Application Server contain over 4000 methods). A flat representation of this data could result in a very large input file with very poor random-access performance. Moreover, a relational database allows concurrent access to the data, which enables the use of a parallel implementation of FlowGSP. For the use of the mining tool reported in this paper, each vertex in the EFG represents an instruction. The weight of each instruction represents the amount of total execution time spent on that instruction. The System Z operating system uses an event-based sampling mechanism: active events and the instruction under execution are recorded when the sample takes place. Instructions that occupy more cycles will be sampled more frequently, and the number of sampling hits or “ticks” is recorded on each instruction. The vertex weights are calculated by counting the number of sampling ticks on each instruction. The edge frequencies in the EFG are a measure of how many times each edge was taken during program
16
A. Jocksch et al.
execution. In the case of edges that lie between basic blocks, this value can be read directly from the control flow graph in the compiler logs. For intra-basicblock edges, edge weights are assigned the frequency of the basic block in which they reside. Both edge and basic block frequencies in the control flow graph are obtained by the compiler through counters inserted in the JVM interpreter. Each vertex is assigned attributes based on the corresponding instruction’s characteristics or events observed on the instruction in the hardware profile data. Examples of attributes include: opcode, whether an instruction-cache miss was observed, and whether the instruction caused a TLB miss. In this application FlowGSP is mining for sequences of attributes that occur in subpaths of the EFG, but this search is based on edge frequency collected by the compiler. Precise path execution frequency cannot be derived from edge frequencies [2]. Therefore, the results produced by the mining tool are an approximation. The support reported for a sequence of attributes represents the maximal possible execution of that path that could have occurred based on the edge-frequency information available [13]. FlowGSP is a general flow mining algorithm that can be applied to any flow graph. For instance, each vertex of the EFG could represent any singleentry/single-exit region, including a Java bytecode, a basic block, or an entire method. The vertex weights and edge frequencies would have to be computed accordingly. 3.2
Operation of the Mining Algorithm
When the tool is run, it first recreates the control flow graph from the information taken from the compiler logs. Then, it inserts each instruction from the hardware profile into the correct basic block using the instruction’s annotations. The tool constructs and mines only a single method at a time in order to match the level of granularity of the compiler; the Testarossa JIT compiles each individual method in isolation. As a consequence, FlowGSP does not discover patterns that cross method boundaries. However, this restriction is a design decision of the tool, not a limitation of the algorithm. To mine graphs containing cycles, FlowGSP does not allow a vertex that is the start vertex of a current candidate sequence to start a new sequence. Therefore a vertex within a cycle can only start a sequence the first time that it is visited. FlowGSP can detect frequent subpaths that occur over cycles but avoids looping indefinitely because the lenght of a sequence is bounded by an specified constant. Jocksch provides a detailed description of FlowGSP [13]. FlowGSP is an iterative generate-and-test algorithm. Each iteration creates a set of candidate sequences from the survivors of the previous generation, and then calculates their supports and tests them against the provided thresholds (discussed in Section 3.3). Each iteration discovers longer sequences in the data. Execution terminates when either a specified number of iterations have completed or no new candidate sequences meet the minimum support thresholds.
Mining Opportunities for Code Improvement in a Just-In-Time Compiler
3.3
17
Support Thresholds for Mining
FlowGSP accepts a number of parameters that can adjust the type and quantity of sequences that are discovered. FlowGSP takes a maximal support threshold and a differential support threshold. If the support of a sequence does not meet either of these thresholds, then the sequence is excluded from further mining. FlowGSP also accepts a maximum allowable gap size and window size. The maximum gap size determines how much space is allowed between each part of a sequence, and the maximum window size determines how many vertices to consider when searching for one part of a sequence. Table 1 lists the parameters used in the experimental evaluation for both the SPECjvm2008 benchmarks and the DayTrader 2.0 benchmark in the WebSphere Application Server. The support values for the application server are lower than the corresponding values for the SPECjvm2008 benchmarks because the application server is orders of magnitude larger than any of the SPECjvm2008 benchmarks and has an extremely flat profile. The System z10 instructions are grouped into pairs for execution. Therefore, events that occur on one instruction of a pair can sometimes also appear on the other instruction. A window size of one is used to group paired instructions together so that more accurate patterns can be discovered. Table 1. FlowGSP parameters used during this study Parameter crypto compiler sunflow montecarlo xml serial WebSphere Maximal support 1% 7% 7% 7% 15% 7% 1% Diff. support 1% 7% 7% 7% 15% 7% 1% Gap size 1 0 0 0 0 0 0 Window size 1 0 1 1 1 1 1 Iterations 5 5 5 5 5 5 5
4
Opportunities Discovered
Before the development of the data-mining framework, significant development resources had been invested on the search for performance improvement opportunities in applications running in the WebSphere Application Server. This investment resulted in many observations about potential opportunities for performance improvement. Therefore, a first effort to test the FlowGSP algorithm, and to build confidence in the compiler development team about the efficacy of the framework, was a set of acid tests to find out if data mining could discover the opportunities for code improvement that were already known to the team. FlowGSP performed extremely well in these tests: it identified all the patterns that were listed by the developers. Examples of these patterns include: 1. A high correlation between data cache misses, TLB misses, and instruction cache misses. Consultation with hardware experts led to the observation that the page table is loaded through the instruction cache, which explained the
18
A. Jocksch et al.
unusual correlation. After FlowGSP confirmed and quantified this correlation, large pages (1 MB instead of 4 KB) were used to reduce the number of TLB misses, resulting in a performance improvement of 3% on applications running in the WebSphere Application Server. 2. A high incidence for instruction-cache misses on entry to JIT code methods. These are cold cache misses for which effective prefetching is a challenge because of dynamic method dispatching. This observation led to additional efforts for inlining and code-cache organization by the compiler team, as well as to discussions on how to mitigate the cache misses in future hardware releases. 3. A high correlation between branch misprediction and instruction cache misses on indirect branches with a higher-than-expected occurrence of these events. A large volume of indirect branches overflows the branch-table buffers. The compiler team implemented code transformations to transform indirect branches into direct branches through versioning. Moreover, the hardware team was engaged to look for solutions to mitigate this issue in future hardware. The discovery of these issues through manual inspection of performance-monitor data by analysts required orders of magnitude more time and effort than the analysis with the data-mining tool based on FlowGSP. Moreover, the manual approach is not easy to reproduce for a new data set and is less deterministic. Once the development team was confident about the results produced by the mining tool, they started examining the output of the tool to find new opportunities for code improvement. The time spent in the EX instruction in array copies described in Section 2 is one such opportunity. The team discovered most of the new opportunities when applying the tool to profiling data collected from newer benchmarks, such as the SPECjvm2008. While extensive development effort has been dedicated to discover opportunities in applications running in the WebSphere Application Server over many years, these newer benchmarks have received relatively less attention from the compiler development team. Some of the new discoveries are listed here: – Stores account for a majority of data cache directory misses [14] in all SPECjvm2008 benchmarks. This is unexpected because the load-to-store ratio in programs is typically on the order of 5:1. Moreover, intuition would indicate that a program writes to locations from which it has read recently. Discussions and analysis are still under way to better understand this ratio. The serial benchmark spends three times more time servicing directory lookups for stores than for loads. This benchmark is highly parallel in nature, which, on the surface, would lead developers to dismiss cache contention as a concern. The trends presented by FlowGPS, which would have remained unobserved under manual inspection, have been instrumental in forcing developers to reconsider cache contention as a possible concern. – Address-generation interlock (AGI) accounts for more than 10% of the execution time in some benchmarks. In the System z architecture, an AGI occurs when the computation of the address required by a memory access
Mining Opportunities for Code Improvement in a Just-In-Time Compiler
19
instruction has not completed by the time that the instruction needs it [22]. In some cases, such as in a small pointer-chasing loop, AGIs are difficult to avoid. The mining tool’s finding is helping to focus analysis in this benchmark, and the team is planning a review of the instruction scheduling in the compiler to reduce the impact of AGIs on execution time. – Branch misses account for 9% of execution time in montecarlo, a benchmark from the SPECjvm2008 suite. This is unexpected because the execution of this benchmark is dominated by a single method with several hot loops and the benchmark has very good instruction locality. This result led to further analysis that uncovered a limitation in the hardware’s instruction fetch unit: the unit stops predicting branches when it cannot detect any previously taken branches within a given window further down the instruction stream. A consequence of this limitation is that when the compiler unrolls a loop, it needs to take into account the size of this window to ensure that the loop backedge is predicted correctly. The compiler team is currently re-examining the loop unrolling strategy to take into account the penalty for branch misses. Experienced compiler developers will understand the value of the observations above to provide direction to a compiler development team. These observations R focus on the z/architecture , the Testarossa compiler, and are based on mining data from the SPECjvm2008 benchmark suite. A similar approach can be used to most combinations of compiler/architecture/application. Moreover, the mining tool can be used to discover opportunities that might be specific to important applications.
5
Experimental Data on the Usage of the Mining Tool
This section presents statistics on the usage of storage and on the time required to mine several benchmarks. The goal of this section is to provide developers with an idea of the resources needed to deploy such a tool, and to encourage researchers to come up with improvements on our tool design. Information reported here include size of input data, overall running time, number of sequences generated, and the format of the rules output by the tool. 5.1
Profiling and Storage Requirements
This experimental evaluation uses the DayTrader 2.0 benchmark in the WebSphere Application Server 7.0 and programs from the SPECjvm2008 benchmark suite. All programs are run using the IBM Testarossa JIT compiler. The WebSphere Application Server workload is DayTrader 2.0 and the server is run for 5 minutes once a stable throughput has been achieved. This delay is necessary to ensure that the Testarossa JIT has compiled the majority of the methods in the application server to native code. The throughput of the application server increases as methods are compiled to native code. Therefore, stabilization of throughput is an indication that the majority of the code being executed has
20
A. Jocksch et al.
been natively compiled. A hardware profile of 5 minutes of execution of the WebSphere Application Server results in roughly 37 MB of compressed data. The same run produces a 5.9 GB uncompressed, plain-text compiler log.2 At the time of this writing, the Testarossa JIT does not have an option to output logs in a compressed format. Compressing the compiler-generated log using gzip reduces its size to around 700 MB. Table 2. SPECjvm2008 benchmarks studied Benchmark compiler.compiler compiler.sunflow crypto.signverify scimark.montecarlo serial xml.transform
# of Methods to # of Methods # Unique Account for 50% of time Compilations Methods Invoked 60 3659 7113 55 4009 6946 2 1219 4654 1 703 4077 8 2967 7645 25 5374 12430
The SPECjvm2008 benchmarks are profiled for a period of 4 minutes after a 1minute warm-up time. Only a minute is required until the most of the benchmark code is being executed natively because the SPECjvm2008 benchmarks used in this study are significantly smaller than applications running in the WebSphere Application Server. The 6 SPECjvm2008 benchmarks examined in this study are listed in Table 2. The data in this table provides an indication of how flat the execution profile of each benchmark is by listing the number of methods that need to be examined to account for 50% of the execution time.3 The table also show the total number of method compilations and the total number of unique methods that are invoked when the benchmark is executed. These benchmarks were chosen because they form a representative sample of the SPECjvm2008 benchmark suite and they produce both flat and non-flat profiles. Running these benchmarks for 5 minutes results in 7 MB of hardware profiling data per benchmark on average, and an average uncompressed compiler log with 1.4 GB of data. The benchmark with largest hardware profile is compiler.compiler which produces 12 MB of data. largest compiler log has 3.3 GB of data and is produced by xml.transform. The benchmark scimark.montecarlo produces the smallest hardware profile (385 KB) and the smallest compiler log (97 MB). 5.2
Time Needed to Mine
The execution time of the tool depends on the size of the log of the program being mined and the parameters passed to the tool. FlowGSP is multi-threaded 2
3
The compiler option required to output control flow graph data also outputs a large volume of information that was extraneous to the mining process. This measurement is an approximation because the number of sampling ticks in the performance monitor that is used to determine the number of methods shown in the table.
Mining Opportunities for Code Improvement in a Just-In-Time Compiler
21
in order to exploit the resources available in multi-core architectures. FlowGSP was run with 8 threads on a machine equipped with two AMD 2350 quadcore CPUs and 8 GB of memory. All runs were performed with the parameters outlined in Section 3. Table 3. Running times of FlowGSP, in seconds Program Execution Time Websphere App. Server (DayTrader 2.0) 6399 compiler.compiler 815 compiler.sunflow 539 scimark.montecarlo 2 xml.transform 557 serial 215 crypto.signverify 177
Table 3 lists the running time of FlowGSP on both the DayTrader 2.0 benchmark in the WebSphere Application Server and SPECjvm2008 benchmark profiles with execution time in seconds. The xml.transform, compiler.sunflow,serial, and scimark.montecarlo benchmarks terminated when no more candidates with support greater than the minimum threshold remained. Xml.transform and scimark.montecarlo terminated after three iterations, compiler.sunflow and serial after four iterations. Montecarlo has one small method which occupies almost 100% of total execution time. Therefore the time to mine this benchmark is significantly lower. The times reported in Table 3 indicate that the mining tool based in FlowGSP can be used on a daily basis in the development of a production compiler. 5.3
Sequences Reported by Mining
FlowGSP outputs frequent sequences in the following format: S = s1 , . . . , sk where each si ∈ s1 , . . . , sk is a set of attributes: si = (α1 , . . . , αk ) Each sequence is accompanied by four values, which indicate the sequence’s weight, frequency, maximal, and differential support. In this use of the datamining tool the vertices of the EFG are instructions. Examples of attributes include the instruction type, occurrence of cache misses, pipeline interlock, branch missprediction, the type of bytecode the originated the isntruction, etc. Results are output to a plain-text file. In the experiments reported here, the DayTrader 2.0 benchmark in WebSphere produced 1286 sequences while the SPECjvm2008 data produced, on average, 64,000 sequences. The SPECjvm2008 benchmarks exhibited a very wide range in terms of the number of sequences generated. The
22
A. Jocksch et al.
most sequences were discovered in the scimark.montecarlo benchmark with roughly 291,000 sequences. On the other hand, the xml.tranform benchmark had the smallest number of sequences at around 1,900. In general, support thresholds for the SPECjvm2008 benchmarks were set generously low because this is an initial exploration of the applications of data mining in the compiler development. These low thresholds ensure that no interesting sequences are overlooked. With experience the support threshold can be increased to allow only the most interesting sequences to be reported. It could be possible in future work to automate this process based on the number of surviving sequences. We implemented an user interface to display the results of mining. This interface allows sequences to be sorted lexicographically or by any of the support metrics. A maximum and minimum support value can be specified to reduce the number of sequences displayed. The tool can also selectively display sequences based on whether they do or do not contain specific attributes. This filtering is particularly effective at reducing the number of sequences that must be examined by a compiler developer. For instance, the serial benchmark contained 16,518 sequences, but only 2,880 involved pipeline stalls due to AGI interlocks. Ranking these resulting sequences by maximal or differential support allows quick identification of the most interesting patterns. The tool also allows the developer to specify one rule as the baseline against which all other sequences are compared. This feature allows for easy comparison of sequences with respect to the baseline sequence.
6
Related Work
This is potentially the first attempt to use data mining to discover patterns of execution that occur frequently in an application but yet do not necessarily occur inside loops. Work that is related to this approach include performance analysis tools, the use of performance counters in JVMs, and the search for code bloat. Optiscope is an “optimization microscope” developed to aid compiler developers in understanding low-level differences in the code generated by a compiler executing different code transformations, or between code generated by two different compilers for the same program [15]. Optiscope automatically matches up code in two hardware profiles that originated from the same region of source code. Optiscope focuses on loops. In contrast, FlowGSP focuses on finding interesting patterns within a single hardware profile and aims to discover common patterns that occur throughout the profile. The design of most existing performance analysis tools, such as the popular R Intel VTune for Intel chipsets [5], focuses on locating small regions of code that are frequently executed to concentrate development efforts on these regions. Chen et al. try to capture the most execution time with the least amount of code [4]. Similarly, Schneider et al. use hardware performance monitors to “direct the compiler to those parts of the program that deserve its attention” [17]. Contrary to earlier work, the premise of this paper is that in some applications these parts are scattered through the code and not concentrated in smaller
Mining Opportunities for Code Improvement in a Just-In-Time Compiler
23
regions. Hundt presents HP Caliper, a framework for developing performance R analysis tools on the Intel Itanium platform running HP-UX [10]. Similar to the approach presented here, Caliper integrates sampled hardware performance counters with compiler-generated dynamic instrumentation. Dynamic instrumentation involves changing program instructions on the fly to obtain more accurate program analysis. However, unlike our mining tool, HP Caliper does not attempt to mine the combined data for patterns. Huck et al. present PerfExplorer, a parallel performance analysis tool [9]. PerfExplorer incorporates a number of automated data analysis techniques such as k-means and hierarchical clustering, coefficient of correlation analysis, and comparative analysis. PerfExplorer targets application developers seeking to understand bottlenecks in their code, not compiler developers. Also, PerfExplorer does not search for frequent sequences in the data. Cuthbertson et al. incorporate performance counter information into a production JVM to improve program performance [6]. They use a custom library to retrieve instruction cache miss information on the Intel Itanium platform. This information is used to guide both object allocation and instruction scheduling in order to increase performance. They achieve an average performance increase of 2% on various Java benchmarks. Schneider et al. perform similar work using hardware counters on the Intel Itanium platform to guide object co-allocation [17]. However, these approaches can only improve the performance of existing code transformations whereas FlowGSP is aimed at discovering opportunities for new code transformations. Also, both approaches only look at a small fraction of all available program data. It is not clear how much increased overhead will result from increasing the amount of data being brought into the compiler. Buytaert et al. use hardware-performance counters to both improve the accuracy and decrease the cost of hot method detection in a production JVM [3]. Their focus is purely on improving the efficiency and accuracy of the JVM and does not provide any insights into new opportunities for code transformations. Xu et al. develop a method for profiling Java programs to identify areas of code bloat [23]. They evaluate the DaCapo benchmark suite, elements of the Java 1.5 standard library, and Eclipse 3.1, and are able to identify a number of specific opportunities to improve performance by decreasing bloat. Similarly, Novark et al. develop a tool called Hound to identify memory leaks and sources of bloat in C and C++ programs [16]. Hound was able to achieve a 14% performance increase in one of the studied benchmarks by identifying a single line of code that needed to be changed. While removing code bloat can significantly improve the performance of applications, it only addresses performance from the point of view of the application programmer. Proper use of code transformations by the compiler is equally as important in increasing program performance.
7
Conclusion
In compiler and computer-architecture development, as in Science in general, discovering the question to ask is often as difficult as finding the answer. Recent
24
A. Jocksch et al.
developments in hardware performance-monitoring tools, and in leaner techniques to insert profiling counters in generated code, have provided developers with an unprecedented amount of data to examine the run-time behavior of a program. The combination of these techniques amounts to a very powerful scope. The mining tool presented in this paper is a mechanism to help focus this powerful scope on patterns that happen frequently enough to warrant the attention of compiler or hardware developers. This paper describes the methodology and the tool used for this mining task. It also presents several examples of discoveries that were done using the tool. Then, it presents statistics on the amount of space and time that is required to use the tool to mine the data produced by enterprise software in a high-end hardware platform with a mature compiler infrastructure. This data indicates that this methodology can be used routinely for the development of production compilers.
Acknowledgments We are very thankful to Jane Bartik and John Rankin from the IBM Poughkeepsie campus for sharing their invaluable insight into the z/Architecture. This work was supported by an IBM Centre for Advanced Studies fellowship and by grants from the Natural Science and Engineering Research Council (NSERC) of Canada through its Collaborative Research and Development program.
Trademarks The following are trademarks or registered trademarks of IBM Corporation in the United States, other countries, or both: IBM, Websphere, z10, and DB2. R The symbol or TM indicates U.S. registered or common law trademarks owned by IBM at the time of publication. Such trademarks may also be registered or common law trademarks in other countries. Other company, product, and service names may be trademarks or service marks of others.
References 1. Agrawal, R., Srikant, R.: Mining sequential patterns. In: International Conference on Data Engineering (ICDE), March 1995, pp. 3–14 (1995) 2. Ball, T., Mataga, P., Sagiv, M.: Edge profiling versus path profiling: the showdown. In: Symposium on Principles of Programming Languages (POPL), San Diego, CA, USA, pp. 134–148 (1998) 3. Buytaert, D., Georges, A., Hind, M., Arnold, M., Eeckhout, L., De Bosschere, K.: Using HPM-sampling to drive dynamic compilation. In: Object-Oriented Programming, Systems, Languages and Applications (OOPSLA), Montreal, Quebec, Canada, pp. 553–568 (2007) 4. Chen, H., Hsu, W.-C., Lu, J., Yew, P.-C., Chen, D.-Y.: Dynamic trace selection using performance monitoring hardware sampling. In: Code Generation and Optimization (CGO), San Francisco, CA, USA, pp. 79–90 (2003) 5. Intel Corporation. Intel v-Tune performance analyzer, http://software.intel. com/en-us/articles/intel-vtune-performance-analyzer-white-papers/
Mining Opportunities for Code Improvement in a Just-In-Time Compiler
25
6. Cuthbertson, J., Viswanathan, S., Bobrovsky, K., Astapchuk, A., Kaczmarek, E., Srinivasan, U.: A practical approach to hardware performance monitoring based dynamic optimizations in a production JVM. In: Code Generation and Optimization (CGO), Seattle, WA, USA, pp. 190–199 (2009) 7. Geronimo, A.: Apache daytrader benchmark sample (October 2009), http:// cwiki.apache.org/GMOxDOC20/daytrader.html 8. Grcevski, N., Kielstra, A., Stoodley, K., Stoodley, M., Sundaresan, V.: Java just-intime compiler and virtual machine improvements for server and middleware applications. In: Conference on Virtual Machine Research and Technology Symposium (VM), San Jose, CA, USA, pp. 12–12 (2004) 9. Huck, K.A., Malony, A.D.: PerfExplorer: A performance data mining framework for large-scale parallel computing. In: ACM/IEEE Conference on Supercomputing (SC), Seattle, WA, USA, p. 41 (2005) 10. Hundt, R.: HP Caliper: A framework for performance analysis tools. IEEE Concurrency 8(4), 64–71 (2000) 11. IBM Corporation. WebSphere Application Server (October 2009), http://www-01. ibm.com/software/websphere/ 12. Jackson, K.M., Wisniewski, M.A., Schmidt, D., Hild, U., Heisig, S., Yeh, P.C., Gellerich, W.: Ibm system z10 performance improvements with software and hardware synergy. IBM J. of Res. and Development 53(1), Paper 16:1–8 (2009) 13. Jocksch, A.: Data mining flow graphs in a dynamic compiler. Master’s thesis, University of Alberta, Edmonton, AB, Canada (October 2009) 14. Mak, P., Walters, C.R., Strait, G.E.: IBM system z10 processor cache subsystem microarchitecture. IBM J. of Res. and Development 53(1), Paper 2:1–12 (2009) 15. Moseley, T., Grunwald, D., Peri, R.V.: Optiscope: Performance accountability for optimizing compilers. In: Code Generation and Optimization (CGO), Seattle, WA, USA (2009) 16. Novark, G., Berger, E.D., Zorn, B.G.: Efficiently and precisely locating memory leaks and bloat. In: Conference on Programming Language Design and Implementation (PLDI), Dublin, Ireland, pp. 397–407 (2009) 17. Schneider, F.T., Payer, M., Gross, T.R.: Online optimizations driven by hardware performance monitoring. In: Conference on Programming Language Design and Implementation (PLDI), pp. 373–382 (2007) 18. Shiv, K., Chow, K., Wang, Y., Petrochenko, D.: SPECjvm2008 performance characterization. In: SPEC Workshop on Computer Performance Evaluation and Benchmarking, Austin, TX, USA, pp. 17–35 (2009) 19. Shum, C.-L.K., Busaba, F., Dao-Trong, S., Gerwig, G., Jacobi, C., Koehler, T., Pfeffer, E., Prasky, B.R., Rell, J.G., Tsai, A.: Design and microarchitecture of the IBM system z10 microprocessor. IBM J. of Res. and Development 53(1), Paper 1:1–12 (2009) 20. Standard Performance Evaluation Corporation. SPEC: The standard performance evaluation corporation, http://www.spec.org/ 21. Sundaresan, V., Maier, D., Ramarao, P., Stoodley, M.: Experiences with multithreading and dynamic class loading in a java just-in-time compiler. In: Code Generation and Optimization (CGO), New York, NY, USA, pp. 87–97 (2006) 22. Webb, C.F.: IBM z10: The next generation mainframe microprocessor. IEEE Micro 28(2), 19–29 (2008) 23. Xu, G., Arnold, M., Mitchell, N., Rountev, A., Sevitsky, G.: Go with the flow: profiling copies to find runtime bloat. In: Conference on Programming Language Design and Implementation (PLDI), Dublin, Ireland, pp. 419–430 (2009)
Unrestricted Code Motion: A Program Representation and Transformation Algorithms Based on Future Values⋆ ¨ Shuhan Ding and Soner Onder Department of Computer Science Michigan Technological University [email protected], [email protected]
Abstract. We introduce the concept of future values. Using future values it is possible to represent programs in a new control-flow form such that on any control flow path the data-flow aspect of the computation is either traditional (i.e., definition of a value precedes its consumers), or reversed (i.e., consumers of a value precede its definition). The representation hence allows unrestricted code motion since ordering of instructions are not prohibited by the data dependencies. We present a new program representation called Recursive Future Predicated Form (RFPF) which implements the concept. RFPF subsumes general if-conversion and permits unrestricted code motion to the extent that the whole procedure can be reduced to a single block. We develop algorithms which enable instruction movement in acyclic as well as cyclic regions and give examples of various optimizations in RFPF form.
1
Introduction
Code motion is an essential tool for many compiler optimizations. By reordering instructions, a compiler can eliminate redundant computations [4,9,10], schedule instructions for faster execution [17], or enable early initiation of long latency operations, such as possible cache misses. In these optimizations, the range of code motion is limited by data and control dependencies [4,5]. Therefore, code-optimization algorithms which rely on code-motion have to make sure that control and data dependencies are not violated. Ability to move code in a control-flow setting in an unrestricted manner would have several significant benefits. Obviously, having the necessary means to move instructions in an unrestricted manner while maintaining correct program semantics could enable the development of simpler algorithms for program optimization. More importantly however, when we permit code motion beyond the obvious limits, code-motion itself can become a very important tool for program analysis. ⋆
This work is supported in part by a NSF CAREER award (CCR-0347592) to Soner ¨ Onder.
R. Gupta (Ed.): CC 2010, LNCS 6011, pp. 26–45, 2010. c Springer-Verlag Berlin Heidelberg 2010
Unrestricted Code Motion: A Program Representation
27
In this paper, we first present the concept of future values. Future values allow a consumer instruction to be placed before the producer of its source operands. Using the concept, we develop a program representation which is referred to as Recursive Future Predicated Form (RFPF). RFPF is a new control-flow form such that on any control flow path the data-flow aspect of the computation is either traditional (i.e., definition of a value precedes its consumers), or reversed (i.e., consumers of a value precede its definition). When an instruction is to be hoisted above an instruction that defines its source operands, the representation updates the data-flow aspect to become reversed (i.e., future, meaning that the instruction will encounter the definition of its source operands in a future sequence of control-flow). If, on the other hand, the same instruction is propagated down, the representation will update the data-flow aspect to become traditional again. The representation hence allows unrestricted code motion since ordering of instructions is not prohibited by the data dependencies. Of course, for correct computation, the values still need to be produced before they can be consumed. However, with the aid of a future values based program representation, the actual time that this happens will appropriately be delayed. RFPF is a representation built on the principle of single-assignment [3,7] and it subsumes general if-conversion [2]. In this respect, RFPF properly extends the SSA representation and covers the domain of legal transformations resulting from instruction movements. Possible transformations range from the starting SSA form where all data-flow is traditional, to a final reduction where the entire procedure becomes a single block through upward code motion, possibly with mixed (i.e., traditional and future) data-flow. We refer to a procedure which is reduced to a single block through code motion to be in complete RFPF. Complete RFPF expresses the program semantics without using control-flow edges except sequencing. During the upward motion of instructions, valuable information is collected and as it is shown later in the paper, this information can be used to perform several sophisticated optimizations such as Partial Redundancy Elimination (PRE). Such optimizations typically require program analysis followed by code motion and/or code restructuring [12,9,4]. Our contributions in this paper are as follows: (1) We introduce the novel concept of future values which permits a consumer instruction to be encountered before the producer of its source operand(s) in a control-flow setting; (2) Using future values, we introduce the concept of future predicates which permits instruction hoisting above the controlling instructions by specifying future control flow; (3) We introduce the concept of instruction-level recursion. This concept allows the loops to be represented as straight-line code and analyzed with ease. Combination of future predicates and instruction-level recursion enables predication of backward branches; (4) Using the concepts of future values, future predicates and instruction-level recursion, we develop a unified representation (RFPF) which is control-flow based, yet instructions can freely be reordered in this representation by simply comparing the instruction’s predicate, source and destination variables to the neighboring instruction; (5) We illustrate that unrestricted code motion itself can be used to analyze programs for optimization opportunities.
28
¨ S. Ding and S. Onder
We present a PRE example in which redundancy cannot be eliminated using code motion alone and restructuring is necessary, yet both the discovery and the optimization of the opportunity can be performed with ease; (6) We present algorithms to convert conventional programs into the RFPF. These algorithms are low in complexity and with the exception of identification of loop headers and the nesting of the loops in the program, they do not need additional external information to be represented. Instead, these algorithms operate by propagating instructions and predicates and use only the local information available at the vicinity of moved instructions; (7) We illustrate that for any graph with mixedmode data-flow, there is a path through instruction reordering and control flow node generation to convert the future data-flow in the representation back to a traditional SSA graph, or generate code directly from the representation. In the remainder of the paper, in Section 2, we first present the concept of future values. Section 3 through Section 6 illustrate a process through which instructions can be hoisted to convert a program into RFPF while collecting the data and control dependencies necessary to perform optimizations. For this purpose, we first illustrate how the concept can be used for instruction movement in an acyclic region in Section 3. This set of algorithms can be utilized by existing optimization algorithms that need code motion by incorporating the concept of future-values into them. Code motion in cyclic regions requires conversion of loops into instruction-level recursion. We introduce the concept of instructionlevel recursion in Section 4. This section presents the idea of recursive predicates and illustrates how backward branches can be predicated. Next, in Section 5 we give an algorithm for computing recursive predicates. Combination of code motion in acyclic and cyclic regions enables the development of an algorithm that generates procedures in complete RFPF from a given SSA program using a series of topological traversals of the graph and instruction hoisting. Since reordering of instructions has to deal with explicit dependencies, memory dependencies pose specific challenges. We discuss the handling of code motion involving memory dependencies in Section 6. Section 7 gives examples of optimizations using the RFPF form. We discuss the conversion back into CFG in Section 8. Finally we describe the related work in Section 9 and summarize the paper in Section 10.
2
The Concept of Future Values
Any instruction ordering must respect the true data dependencies as well as the control dependencies. As a result, an instruction cannot normally be hoisted beyond an instruction which defines the hoisted instruction’s source operand(s). When such a hoisting is permitted, a future dependency results: Definition 1. When instructions I and J are true dependent on each other and the instruction order is reversed, the true dependency becomes a future dependency and is marked on the source operand with the subscript f. Consider the statements shown in Figure 1(a). In this example, the control first encounters instruction i1 which computes the value x, and then encounters the
Unrestricted Code Motion: A Program Representation
Control flow
i1: x = a + b
Control flow
i2: z = x + a (a) True dependence
i 1 : if(a < b ) i 2 : x = x +1 (c) Traditional control-flow
29
i2: z = x + a f
i1: x = a + b (b) Future (reversed) dependence
i 1 : P=(a < b ) i 2 : [P ]x = x + 1 (d) If-conversion
i 2 : [P f ]x = x + 1 i 1 : P=(a < b ) (e) Future control-dependence
Fig. 1. The concept of Future data and control dependences
instruction i2 which consumes the value. In Figure 1(b), the instruction i2 has been hoisted above i1, and its source operand x has been marked to be a future value using the subscript f. If the machine buffers any instructions whose operands are future values alongside with any operand values which are not future until the producer instruction is encountered, the instructions can be executed with proper data flow between them even though the order at which the control has discovered them is reversed. Similarly, we can represent control dependencies in future form as well. Consider Figure 1(c). In this example, i2 is control dependent on i1 . In Figure 1(d) predicate P is used to guard i2 , which represents the same control dependence. When the order of i1 and i2 is reversed(Figure 1(e)), predicate P becomes a future value and thus the original control dependence becomes future control dependence. The combination of future data and control dependencies and singleassignment semantics permit unrestricted code motion. In the rest of the paper, single-assignment semantics is assumed and all the transformations maintain the single-assignment semantics. We first discuss code motion using future values in acyclic regions involving control dependencies.
3
Code Motion in Acyclic Code
For an acyclic control-flow graph G =< s, N, E > such that, s is the start node, N is the set of nodes and E is the set of edges, instruction hoisting involves one of three possible cases. These are: (1) movement that does not involve control dependencies (i.e., straight-line code), (2) splitting (i.e., parallel move to predecessor basic blocks), and (3) merging (i.e., parallel move to a predecessor block that dominates the source blocks). Note that movement of a φ-node is a special case and normally would destroy the single-assignment property. We examine each of these cases below: Case 1 (Basic block code motion). Consider instructions I and J. Instruction J follows instruction I in program order. If I and J are true dependent, hoisting J above I converts the true dependency to a future dependency. Alternatively, if the instructions are future dependent on each other, hoisting J above I converts the future dependency to a true dependency (Figure 1(a) and (b)).
¨ S. Ding and S. Onder
30
When code motion involves control dependencies, the instruction propagation is carried out using instruction predication, instruction cloning and instruction merging. An instruction is cloned when the instruction is moved from a control independent block to a control dependent block. Cloned copies then propagate along the code motion direction into different control dependent blocks. When cloned copies of instructions arrive at the same basic block they can be merged. Case 2 (Splitting code motion). Consider instruction I that is to be hoisted above the block that contains the instruction. For each incoming edge ei a new block is inserted, a copy of the instruction is placed in these blocks and a φ-node is left in the position of the moved instruction (Figure 2).
I1: x1,1,2=
N
I: x1 =
J: x1 = φ(x1,1,2, x1,2,2)
Fig. 2. Splitting code motion
if (P) [¬P ] I
if (P)
I2: x1,2,2=
Y
N
Y
I
Fig. 3. Merging code motion
Note that in Figure 2, when generated copies I1 and I2 are merged back into a single instruction, the inserted φ-node can safely be deleted and the new instruction can be renamed back to x1 . The two new names created during the process, namely, x1,1,2 and x1,2,2 are eliminated as part of the merging process. In order to facilitate easy merging of clones, we adopt the naming convention vi,j,k where vi is an SSA name, j is the copy version number and k is the total number of copies. Generated copies can be merged when they arrive at the immediate dominator of the origin block, and in case of reduction to a single block, all copies can be merged. We discuss these aspects of merging later in Section 3.3. Case 3 (Merging code motion). Consider instruction I that is to be hoisted into a block where the source block is control dependent on the destination block. The instruction I is converted to a predicated instruction labeled with the controlling predicate of the edge (Figure 3). 3.1
Future Predicated Form
When a predicated instruction is hoisted above the instruction which defines its predicate, the predicate guarding the instruction becomes future as the predicate is also a value and the data dependence must be updated properly. Figure 4 shows a control dependent case. Instruction I is control dependent on condition a0 < b0 . When the instruction I is moved from B2 to B1, it becomes predicated and is guarded by Q (Figure 4(b)). In the next step, the instruction is hoisted above the definition of Q and its predicate Q becomes future (i.e., Qf ) (Figure 4(c)).
Unrestricted Code Motion: A Program Representation
B0
N
B0
P = d0 < e0
N
Y
N
Q = a0 < b0
Y
Y
Q = a0 < b0 if (P)
if (P)
if (Q)
N
t0 = x1 + y1
[Qf ] I: z1 = x1 + y1
Q = a0 < b0 [Q] I: z1 = x1 + y1
if (a0 < b0 )
B1
t0 = x1 + y1
t0 = x1 + y1
t0 = x1 + y1
Y
B1
B1
B1
B0
P = d0 < e0 [P ∧ Qf ] I: z1 = x1 + y1 if (P)
if (P) Y
N
Y
B0
P = d0 < e0
if (P)
if (d0 < e0 )
31
Y
N
Y
N
N
B2 B2
B2
B2
I: z1 = x1 + y1
(a) before code motion
(b) after code motion
(c) future predicate
(d) nested predicate
Fig. 4. Code motion across control dependent regions
When a predicated instruction is hoisted further, it may cross additional control dependent regions and will acquire additional predicates. Consider Figure 4(c). Since the target instruction is already guarded by the predicate Qf , when it moves across the branch defined by P , it becomes guarded by a nested predicate (Figure 4(d)). In terms of control flow, it means that predicate P must appear, and it will appear before Q. Similarly, if P is true, then Q must also appear since if the flow takes the true path of P the predicate Q will eventually be encountered. In other words, the conjunction operator has the short-circuit property and it is evaluated from left to right. Semantically, a nested predicate which involves future predicates is quite interesting as it defines possible control flow. 3.2
Elimination of φ-Nodes
RFPF transformations aim to generate a single block representing a given procedure. The algorithms developed for this purpose hoist instructions until all the blocks, except the start node are empty. Proper maintenance of the program semantics during this process requires the graph to be in single-assignment form. On the other hand, movement of φ-nodes as regular instructions is not possible and the elimination of φ-nodes result in the destruction of the single-assignment property. For example, elimination of the φ-node x3 = φ(x1 , x2 ) involves insertion of copy operations x3 = x1 and x3 = x2 across each incoming edge in that order. Such elimination creates two definitions of x3 and the resulting graph is no longer in single-assignment form. Our solution is to delay the elimination of φ-nodes until the two definitions can be merged, at which time a gating function [13] can be used if necessary: Definition 2. We define the gating function ψp (a1, a2) as an executable function which returns the input a1 if the predicate p is true and a2 otherwise.
32
¨ S. Ding and S. Onder
I: x1 =
J: x2 =
I: x1 = K1: x3,1,2 = x1
K: x3 = φ(x1, x2)
J: x2 = K2: x3,2,2 = x2
K: x3 = φ(x3,1,2, x3,2,2)
Fig. 5. φ -node elimination
Note that during merging, cloned copies already bring in the necessary information for computing the controlling predicate for the gating function. The merging process is enabled by transforming the φ-node in a manner similar to the splitting case described above: Case 4 (φ-node elimination). Consider the elimination of the φ-node x3=φ(x1 , x2 ) (Figure 5). φ-node elimination can be carried out by placing copy operations x3,1,2 = x1 and x3,2,2 = x2 across each incoming edge in that order and updating the φ-node with the new definitions to become x3 = φ(x3,1,2 , x3,2,2 ). Merging of the instructions x3,1,2 = x1 and x3,2,2 = x2 requires the insertion of a gating function since the right-hand sides are different. Once the instructions are merged, the φ-node can be eliminated. It is important to observe that until the merging takes place and the deletion of the φ-node, instructions which use the φ-node destination x3 can be freely hoisted by converting their dependencies to future dependencies. 3.3
Merging of Instructions
In general, upward instruction movement will expose all paths resulting in many copies of the same instruction guarded by different predicates. This is a desired property for optimizations that examine alternative paths such as PRE and related optimizations since partial redundancy needs to be exposed before it can be optimized. We illustrate an example of PRE optimization in Section 7. On the other hand, the code explosion that results from the movement must be controlled. RFPF representation allows copies of instructions with different predicates to be merged. Merging can be carried out between copies of instructions which result from a splitting move, as well as those created by φ-node elimination. As previously indicated, merging of two instructions with the same derivative destination (i.e., such as those which result from φ-node elimination) requires the introduction of the gating function ψ into the representation, whereas merging of the two copies of the same instruction can be conducted without the use of a gating function. When the merged instructions are the only copies, the resulting instruction can be renamed back to the φ destination. Otherwise, a new name is created for the resulting instruction, which will be merged with other copies later during the instruction propagation.
Unrestricted Code Motion: A Program Representation
33
Definition 3. Two instructions γ : xi,m,k ← e1 and δ : xi,n,k ← e2, where γ and δ are predicate expressions, represent the single instruction γ ∨δ : xi,(m,n),k ← e1 if e1 and e2 are identical. Definition 4. Two instructions γ : xi,m,k ← e1 and δ : xi,n,k ← e2, where γ and δ are predicate expressions represent the single instruction γ∨δ : xi,(m,n),k ← ψP (e1, e2) if e1 and e2 are not identical. The predicate expression P is the first predicate expression in γ and δ such that P controls γ and ¬P controls δ. Definition 5. Instruction γ : xi,(p,...,q),k ← e can be renamed back to γ : xi ← e if (p, . . . , q) contains a total of k version numbers. Theorem 1. Copy instructions generated from a given instruction I during upward propagation are merged at the immediate dominator of the source node of I, since all generated copies will eventually arrive at the immediate dominator of the source block. Proof. Let node A be the immediate dominator of the source node I has originated from in the forward CFG. Assume there’s one copy instruction I ′ which does not pass through A during the whole propagation. For this to happen, there must be a path p, which from the start node reaches I’ and then reaches the source node of I. The fact that p does not pass through node A conflicts the assumption that A is the immediate dominator node of I. B1
B1
B1 P =a z2 )
0(f )
∧S
0(f ) }
w1 = (RP 0)[T ]{w1 0 = μ(w0, w2 0)}
eback : P ∧ S
z1 = (RP 0)[T ]{z1 0 = μ(z0, z2 0)} P = (RP 0)[T ]{P 0 = (z1 0 ≥ 0)} x4 = (RP 0)[P 0]{x4 0 = 1} w2 = (RP 0)[P 0]{w2 0 = x4 0 + w1 0} z2 = (RP 0)[P 0]{z2 0 = z1 0 − 1}
e1exit : ¬P
S = (RP 0)[P 0]{S 0 = (w2 0 > z2 0)}
B4
e2exit : P ∧ ¬S
use w1 B5
e1exit : ¬P B4
e2exit : P ∧ ¬S
use w1
use w2
END
(c) Eliminate loop except loop header
B5 use w2
END
(d) Convert to recursive form
Fig. 10. Program 1: Conversion of a cyclic program into RFPF
40
6
¨ S. Ding and S. Onder
Code Motion Involving Memory Dependencies and Function Calls
Memory dependencies pose significant challenges in code motion. There are many cases a compile time analysis of memory references does not yield precise answers. Our solution is to assume dependence and enforce the original memory ordering in the program through predication. Since a series of consecutive load operations without intervening stores have no dependence on each other, RFPF allows these loads to be executed in any order once the dependence of the first load in the series is satisfied. We define the memory operations as: MEM, @P where MEM represents a Load/Store operation and P is a predicate whose value is set to 1 when the memory operation MEM gets executed. Any memory operation that has a dependence with MEM will be guarded by P as a predicated operation. In this way, the dependence among memory operations are converted into data dependencies explicitly. Once the memory operations are converted in this manner, they can be moved like any other instruction. Because of the predication, if a memory operation is hoisted above another which defines its controlling predicate, the controlling predicate becomes a future value (Figure 11).
predicated memory
memory reordering
LW1
LW1, @P1
[P1 f] SW1, @P2
SW1
[P1] SW1, @P2
LW1, @P1
Fig. 11. Predicated memory and reordered memory
Our algorithm to rewrite memory operations is based on Cytron et al’s SSA construction algorithm [7]. Since all the load/store operations can be treated as assignments to the same variable, Cytron et al’s algorithm can be modified to accomplish the rewriting. Due to lack of space, we are unable to include the algorithm. We employ a similar algorithm for handling function calls. Because of their side effects such as input/output, function calls may not be reordered without a proper analysis of the functions referenced. Therefore, we introduce a single predicate for each call instruction which is set when the call is executed. A single φ node is needed at merge steps to enforce the function call order on any path.
7
Optimizations Using RFPF
Many optimizations can be carried out on the complete RFPF and as well as during the transformation process. One of the advantages of RFPF is its ability to perform traditional optimizations while keeping the graph in single-assignment
Unrestricted Code Motion: A Program Representation
41
form with minimal book keeping. We show two examples of optimizations, one which can be employed during the transformation and another after the graph is converted into full RPFP. Case study 1. PRE during the transformation: Consider Figure 12(a). There’s a redundant computation of x0 +y0 along the path (B2 B4 B5). Most PRE algorithms cannot capture this redundancy because node B4 destroys the available information for x0 + y0 . On the other hand, instruction propagation and RFPF cover the case. Observe that during the instruction propagation, one of the clones, namely, (I1) reaches node B2(Figure 12(b)). By applying Value numbering [1] in the basic block, x0 + y0 in I1 is subsumed by z1 (Figure 12(c)). B1
P B2
N
Y
B3
B2
Y
J: z1 = x0 + y0
J: z1 = x0 + y0
[Qf ] I1: z2,1,2 = x0 + y0 B4
Y
B5
N
B3
[Qf ] I2: z2,2,2 = x0 + y0
Y
B1
P
B4 [Qf ] K: z2 = φ(z2,1,2, z2,2,2) Q
Q N
B1
P N
B2
N
Y
J: z1 = x0 + y0 [Qf ] I1: z2,1,2 = z1
B3
[Qf ] I2: z2,2,2 = x0 + y0
B4 [Qf ] K: z2 = φ(z2,1,2, z2,2,2) Q B5
N
Y
B5
I: z2 = x0 + y0
(a) A PRE example
(b) Code motion
(c) Value numbering
Fig. 12. Partial redundancy elimination during the code motion
By further propagating and merging, instruction I1 and I2 are merged in B1 with the addition of the gating function ψ (Figure 13(a)) yielding the complete RFPF: B1 P=..... [¬P ] J: z1 = x0 + y0 Q=..... [Q] I: z2 = ψP (x0 + y0, z1 )
Figure 13(b) gives the result of transforming RFPF back into SSA using the algorithm in Section 8. This graph is functionally equivalent to Figure 13(c), which shows the result by using the PRE algorithm of Bodik et al. [4]. This algorithm separates the expression available path from the unavailable path by node cloning which eliminates all redundancies. As it can be seen, RFPF can perform PRE and keep the resulting representation in the SSA form. The dependency elimination in our example is not a coincidence. By splitting instructions into copies, we naturally split the dataflow information available
¨ S. Ding and S. Onder
42
B1 [¬Pf ] J: z1 = x0 + y0 [Qf ] I: z2 = ψPf (x0 + y0 , z1) P N Y B2 B3
B1
P N
B2
B3
B2
J: z1 = x0 + y0
Y
N
Q
B5
Y
N
N I1: z2,1,2 = z1
Y
B4
if (P)
B4
N
B4’
Q Y
N
Q
Y
N
I2: z2,2,2 = x0 + y0
Y t = x0 + y0
B5
I: z2 = t
K:z2 = φ(z2,1,2, z2,2,2)
(a) Instruction merging
B3
t = x0 + y0 J: z1 = t
B4 Q
B1
P
Y
(b) RFPF to CFG
B5
(c) Bodik et al., [4]
Fig. 13. Merging and Converting Back to CFG START
START B1
B1
read x1 y=0
read x1 y=0
if (Q = y ≥ 0)
if (Q = y ≥ 0)
N B2 x3 = −1
B3 Y if (R=(y==0)) B4 Y N x2 = 0
N B2 x3 = −1 x4,1,3 = x3
B3 Y [¬Rf ] x4,2,3 = x1 if (R=(y==0)) Y B4 N x2 = 0 x4,3,3 = x2
B5 x4 = φ(x3, x1, x2) use x4
B5 x4 = φ(x4,1,3, x4,2.3, x4,3,3) use x4
END
END
(a) A CP Example read x1 y=0 Q = (y ≥ 0) Q : R = (y == 0) Q ∧ R : x2 = 0 ¬Q : x3 = −1 x4 = ψQ (ψR (x2 , x1 ), x3 ) use x4 (c) Complete RFPF
(b) Transform to RFPF read x1 y=0 Q = true true : R = true true : x2 = 0 f alse : x3 = −1 x4 = ψtrue (ψtrue (0, x1 ), x3 ) use 0(x4 ) (d) Apply CCP
Fig. 14. Constant propagation on RFPF
path from unavailable path. From the perspective of the total number of the computations, RPFP yields essentially the same result. The optimality of RPFP and code motion based PRE in RFPF is yet to be studied, but its ability to catch difficult PRE cases is quite promising.
Unrestricted Code Motion: A Program Representation
43
Case study 2. Constant propagation in complete RFPF: We use another example(Figure 14(a)) to show how to do constant propagation(CP) in complete RFPF. As in the PRE example, constant propagation chances are caught in node B2 and B4(Figure 14(b)). Figure 14(c) and (d) shows complete RFPF of the program and the result after optimization. We use the conditional constant propagation(CCP) approach described in [18]. Note that x4 becomes a constant in our representation because gating function ψ can be evaluated given the constant information of the predicate and the variable values. The choice of applying various optimizations during or after the transformation has to be decided based on foreseen benefits. This is an open research problem and it’s a part of our future work.
8
Algorithms for Converting RFPF Back to CFG
The inverse transformation algorithms are necessary because the existing algorithms can be applied on CFG for further optimizations and to produce machine code. In different stages of compilation, the conversion algorithms have different goals. Before scheduling, the goal is to minimize number of nodes in the resulting CFG. After scheduling, the goal is to maximize the issue rate on the resulting CFG. At the register allocation stage, the goal is to minimize live range of variables in the resulting CFG. So we must take into account different optimality criteria for different conversion stages. The basic algorithm to transform RFPF back to CFG consists of three steps: 1. Reorder RFPF in a way that no future values occur by pushing-down or moving-up instructions, which forms an initial instruction list. 2. Group instructions with identical predicates together. Such grouping reduces the multiple node insertions for a branch condition and forms the loop structures. 3. Iterate through the instruction list and insert instructions one by one into corresponding basic blocks. The optimality of the resulting graph is dependent on how the predicate expressions are analyzed and combined. A complete inverse transformation framework is part of our future work.
9
Related Work
Intermediate program representation design has always been a very important topic for optimizing compiler research since the choice of program representation affects significantly the design and complexity of optimization algorithms. Some of the most relevant to this work are the control flow graph [1], def-use chains [1], program dependence graph [8], static single assignment(SSA) [7,3], and the
44
¨ S. Ding and S. Onder
program dependence web [13]. We directly use these prior art in this paper. The SSA form as well as the gating functions that the program dependence web proposes are significant for correct translation of programs into RFPF. The dependence flow graph [14] contributed to our thinking in designing the representation. Allen et al. proposed the idea of isomorphic control transformation(ICT) [2] which converts the control dependencies into data dependencies. This idea forms the basis of hyperblock formation in many techniques, including ours as well as others [11]. Warter et al. [17] proposes a technique which uses ICT and apply local scheduling techniques on the hyperblock and then transforms the scheduled code back to CFG representation. RFPF follows a similar, but a more comprehensive path. Partial redundancy elimination(PRE) proposed by Morel and Renvoise [12] is a powerful optimization technique which is usually carried out using code motion [9,10]. As it is well known, code motion alone cannot completely eliminate partial redundancies. Click proposed an approach using global value numbering supported by code motion is proposed to eliminate redundancies [6]. This approach may insert extra computations along some path. Bodik et al. [4], give an algorithm based on the integration of code motion and CFG restructuring which achieves the complete removal of partial redundancies. Chow et al. [5] proposes a similar PRE algorithm for SSA yielding similar optimality to lazy code motion. The algorithm maintains its output in the same SSA form. VanDrunen and Hosking [16] present a structurally similar PRE for SSA covering more cases. Control flow obfuscate data-flow information needed by many optimization algorithms. Thakur and Govindarajan [15] proposes a framework to find out the merge region in a CFG which prevents the data-flow analysis, and restructure the CFG to make data-flow analysis more accurate. Our technique of instruction propagation and merging exposes similar opportunities.
10
Conclusion
We have presented a new approach to program representation and optimization. The most significant difference of our approach is to move instructions to collect the necessary data and control flow information, and in the process yield a representation in which compiler optimizations can be carried out. Our future work involves transformation and adaptation of state-of-the-art optimization algorithms into the new framework.
References 1. Aho, A.V., Sethi, R., Ullman, J.D.: Compilers: principles, techniques, and tools. Addison-Wesley Longman Publishing Co., Inc., Boston (1986) 2. Allen, J.R., Kennedy, K., Porterfield, C., Warren, J.: Conversion of control dependence to data dependence. In: POPL 1983: Proceedings of the 10th ACM SIGACTSIGPLAN symposium on Principles of programming languages, pp. 177–189. ACM, New York (1983)
Unrestricted Code Motion: A Program Representation
45
3. Bilardi, G., Pingali, K.: Algorithms for computing the static single assignment form. J. ACM 50(3), 375–425 (2003) 4. Bod´ık, R., Gupta, R., Soffa, M.L.: Complete removal of redundant expressions. In: PLDI 1998: Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation, pp. 1–14. ACM, New York (1998) 5. Chow, F., Chan, S., Kennedy, R., Liu, S.M., Lo, R., Tu, P.: A new algorithm for partial redundancy elimination based on ssa form. In: PLDI 1997: Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation, pp. 273–286. ACM, New York (1997) 6. Click, C.: Global code motion/global value numbering. SIGPLAN Not. 30(6), 246– 257 (1995) 7. Cytron, R., Ferrante, J., Rosen, B.K., Wegman, M.N., Zadeck, F.K.: Efficiently computing static single assignment form and the control dependence graph. ACM Trans. Program. Lang. Syst. 13(4), 451–490 (1991) 8. Ferrante, J., Ottenstein, K.J., Warren, J.D.: The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst. 9(3), 319–349 (1987) 9. Knoop, J., R¨ uthing, O., Steffen, B.: Lazy code motion. In: PLDI 1992: Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation, pp. 224–234. ACM, New York (1992) 10. Knoop, J., R¨ uthing, O., Steffen, B.: Optimal code motion: theory and practice. ACM Trans. Program. Lang. Syst. 16(4), 1117–1155 (1994) 11. Mahlke, S.A., Lin, D.C., Chen, W.Y., Hank, R.E., Bringmann, R.A.: Effective compiler support for predicated execution using the hyperblock. In: MICRO 25: Proceedings of the 25th annual international symposium on Microarchitecture, pp. 45–54. IEEE Computer Society Press, Los Alamitos (1992) 12. Morel, E., Renvoise, C.: Global optimization by suppression of partial redundancies. Commun. ACM 22(2), 96–103 (1979) 13. Ottenstein, K.J., Ballance, R.A., MacCabe, A.B.: The program dependence web: a representation supporting control-, data-, and demand-driven interpretation of imperative languages. SIGPLAN Not. 25(6), 257–271 (1990) 14. Pingali, K., Beck, M., Johnson, R.C., Moudgill, M., Stodghill, P.: Dependence flow graphs: An algebraic approach to program dependencies. Tech. rep., Cornell University, Ithaca, NY, USA (1990) 15. Thakur, A., Govindarajan, R.: Comprehensive path-sensitive data-flow analysis. In: CGO 2008: Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization, pp. 55–63. ACM, New York (2008) 16. VanDrunen, T., Hosking, A.L.: Anticipation-based partial redundancy elimination for static single assignment form. Softw. Pract. Exper. 34(15), 1413–1439 (2004) 17. Warter, N.J., Mahlke, S.A., Hwu, W.M.W., Rau, B.R.: Reverse if-conversion. In: PLDI 1993: Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation, pp. 290–299. ACM, New York (1993) 18. Wegman, M.N., Zadeck, F.K.: Constant propagation with conditional branches. ACM Trans. Program. Lang. Syst. 13(2), 181–210 (1991)
Optimizing Matlab through Just-In-Time Specialization⋆ Maxime Chevalier-Boisvert, Laurie Hendren, and Clark Verbrugge School of Computer Science, McGill University, Montreal, QC, Canada {mcheva,hendren,clump}@cs.mcgill.ca
Abstract. Scientists are increasingly using dynamic programming languages like Matlab for prototyping and implementation. Effectively compiling Matlab raises many challenges due to the dynamic and complex nature of Matlab types. This paper presents a new JIT-based approach which specializes and optimizes functions on-the-fly based on the current types of function arguments. A key component of our approach is a new type inference algorithm which uses the run-time argument types to infer further type and shape information, which in turn provides new optimization opportunities. These techniques are implemented in McVM, our open implementation of a Matlab virtual machine. As this is the first paper reporting on McVM, a brief introduction to McVM is also given. We have experimented with our implementation and compared it to several other Matlab implementations, including the Mathworks proprietary system, McVM without specialization, the Octave opensource interpreter and the McFor static compiler. The results are quite encouraging and indicate that specialization is an effective optimization— McVM with specialization outperforms Octave by a large margin and also sometimes outperforms the Mathworks implementation.
1
Introduction
Scientists are increasingly using dynamic languages to prototype and implement their applications. Matlab is particularly appealing because it has an interactive development environment, a rich set of libraries, and highly expressive semantics due to its dynamic nature. However, even though the dynamic nature of Matlab may be convenient for scientists, it provides many challenges for effective and efficient compilation and execution. Furthermore, scientists would like to have reasonable performance as many scientific applications are computation-heavy and execute for a long time. Ideally this performance should be achieved without requiring a rewrite of Matlab code to a more static language such as Fortran. For good performance, we require an optimizing compiler that works directly on Matlab programs. However, Matlab poses several challenges. Firstly, Matlab programs are normally developed incrementally, using an interactive development loop and mixing Matlab scripts (a sequence of commands like those ⋆
This work was supported, in part, by NSERC and FQRNT.
R. Gupta (Ed.): CC 2010, LNCS 6011, pp. 46–65, 2010. c Springer-Verlag Berlin Heidelberg 2010
Optimizing Matlab through Just-In-Time Specialization
47
typed into the interactive loop prompt) with functions that are defined in separate source files. This means that code is dynamically-loaded and not all code is known ahead-of-time. Secondly, Matlab’s type system is both dynamic and intricate. The types of variables are not declared, but rather change as the computation proceeds. For example, it is not even straightforward to determine which values are scalars and which are arrays since a scalar assignment, such as x = 1, is assumed to define x as an 1 × 1 array. Furthermore, the size of an array dynamically increases as new values are written outside the current array bounds, and the effective base type of an array can change when an element of a more general type is written into it. All of these challenges suggest that Matlab is best optimized on-the-fly using a JIT compiler within a Matlab Virtual Machine. We have developed a new open Matlab VM called McVM which includes a JIT compiler built upon LLVM [1] which we briefly introduce in this paper. The main feature of the McVM JIT is a new on-the-fly specialization algorithm which specializes functions based on the run-time types of their arguments. This relies on a type and shape inference analysis which is specifically tailored to abstract the key features of the types in the function body. This type and shape analysis must be simple enough to work in the JIT context, but at the same time it must abstract the key features needed for optimization. Our approach is to combine 8 different simple abstractions, consisting of a variable’s overall type, whether or not it is a scalar or a 2D matrix, its shape, and so on. The results of this type and shape inference analysis are then used to compile a specialized and optimized version of the function. In order to determine the effectiveness of this argument-type-based specialization approach, we have implemented it and compared it against both McVM without specialization and three other existing Matlab implementations: the Mathworks proprietary implementation, Octave1 which is an open-source Matlab interpreter, and McFor which is our group’s static Matlab-to-Fortran compiler. Initial results are quite encouraging and show that specialization works, provides good performance and that a reasonable number of specialized versions of functions are created. The main contributions of this paper are: McVM: an introduction of McVM giving our design criteria and an overview of the architecture of the system (Section 3); Specialization: an introduction our approach for specializing functions on-thefly based on the run-time types of function arguments (Section 4); Type and Shape Inference: a new type and shape inference algorithm which approximates type and shape information based on argument types (Section 5); and Experimental Validation: an experimental validation showing the overall effectiveness of McVM and the the effectiveness of specialization and type inference, in particular (Section 6). 1
www.gnu.org/software/octave/
48
M. Chevalier-Boisvert, L. Hendren, and C. Verbrugge
In the remainder of this paper we first describe the challenges of compiling Matlab in Section 2, then we address each of our main contributions in Sections 3 through 6. We then discuss related work in Section 7 and give conclusions and future work in Section 8.
2
Optimization Challenges
Matlab presents many challenges to an optimizing compiler. Traditional static optimization techniques do not work because of the highly dynamic nature and the complex semantics of the language. Dynamic loading of functions and scripts prevents us from assuming the entire program is known ahead of time, for example. One of the main challenges, however, is dealing with types, since the language is dynamically typed and follows intricate type rules. Listing 1 shows an example of a simple program that illustrates some of the intricacies of the Matlab type system. In this example, the caller function calls the sumvals function twice, with different argument types each time. The sumvals function is designed to sum numbers within a range of values. However, as this example illustrates, in Matlab, it can be applied to both scalar types and arrays of values. Specifically, the variable a will be assigned the scalar integer value 5 * 10e11, while b will be assigned the 1 × 2 floating-point array 1.0e12 * [0.8533 1.7067]. These two values are then concatenated into c, a 1 × 3 array. function s = sumvals ( start , step , stop ) i = start ; s = i; while i < stop i = i + step ; s = s + i; end end function caller () a = sumvals (1 , 1 , 10^6) ; b = sumvals ([1 2] , [1.5 3] , [20^5 , 20^5]) ; c = [ a b ]; disp ( c ) ; end
Listing 1. Implicit typing in Matlab programs
Since the sumvals function can apply to either scalars or arrays, and the values operated on could be either integer, real or complex, compiling this program into efficient machine code can be challenging: type information is not explicit, and can change dynamically. A naive compiler could always store the variables inside the sumvals function as the widest available type (i.e.: complex matrices) or even generate code based on the idea that the type of all variables in the function are unknown, which is clearly very inefficient.
Optimizing Matlab through Just-In-Time Specialization
49
To generate efficient code, type inference is needed to extract implicit type information in the source program. In the case where sumvals is called with only scalar integer inputs, it is possible to logically infer that all of the intermediate variables will also be scalar integers, and generate efficient code for this case. As for the case where sumvals is called with arrays as input, it should be possible to at least infer that complex values will never occur in the computation. This example motivates our approach of specialization based on the run-time types of arguments. Our approach will compile two different versions of the function based on the call signatures. This ensures that efficient code can be generated for each case. More details of our specialization technique are given in Section 4 and our type inference analysis is described in Section 5.
3
Design Overview
Our approach to optimization requires the ability to both interpret and compile multiple versions of code. The McVM virtual machine thus implements a mixed mode design, consisting of both interpreter and JIT components. The design is modular, making use of external front-end and low-level back-end components to simplify implementation complexity; Figure 1 shows the overall structure, which we now describe in more detail.
Fig. 1. Structure of the McVM Virtual Machine
50
M. Chevalier-Boisvert, L. Hendren, and C. Verbrugge
The McLab front-end is used to parse interactive-mode commands and M-file source code, producing a common Abstract Syntax Tree (AST) representation for both interpretation and compilation. The functionality of the interpreter is divided into interpretation logic and state management (housekeeping), while the JIT compiler manages the function specialization/versioning system, and generates low-level intermediate code for the statements it can compile. Our design allows for incremental and flexible development, with the JIT relying on the interpreter as fallback to evaluate code for which there is not yet compiler support. At the core, McVM’s implementation of matrix types depends directly on a set of mathematical libraries (ATLAS, BLAS and LAPACK) to perform fast matrix and vector operations. We use the Boehm garbage collector library for garbage collection [2], and our JIT compiler uses the LLVM framework to implement lowlevel JIT compilation and generate machine code [1]. The JIT compiler also implements several analyses to gain additional information about source programs being compiled, including basic analyses and optimizations such as live variables, reaching definitions, bounds check alimination, as well as type inference. The McVM interpreter performs a straightforward, pre-order traversal of the internal AST in order to execute the input code. This interpretation approach is naive, but provides a correct, if low-performance execution that can serve both as a reference and act as a fallback when JIT compilation cannot be performed. The interpreter also serves housekeeping roles, providing essential run-time services. These include taking care of loading Matlab files on-demand, executing interactive-mode commands, hosting library function bindings, maintaining bindings to global variables, and so forth. The JIT compiler improves performance by translating high-level source code into a more efficient low-level form. A fundamental design goal in our VM was to aim for a simple and easily extensible design—similar to the phc compiler [3], our JIT compiler is built as an extension of the interpreter. The compiler can thus fall back to interpreting sections of code it cannot compile, mixing sections of both compiled and interpreted code in the execution of a given function. This allows for incremental JIT development, and also for language modifications to be more easily incorporated—new data types or statements can be added by modifying only the interpreter, relying on the fallback mechanism for any new features. The JIT compiler can later be modified, if necessary, to gain performance benefits from any additional optimization opportunities. The JIT compiler performs actual code generation in conjunction with LLVM. During run-time, the input AST is first translated by our JIT compiler into a low-level, RISC-like Static Single Assignment (SSA) representation. From this, LLVM generates machine-specific executable code; LLVM also performs basic optimization passes on the code, such as constant propagation, dead code elimination and redundant operation elimination. As such, it greatly simplifies the construction of a JIT compiler by completely hiding much of the platform-specific details and providing low-level optimizations. Our fallback mechanism requires a high-level strategy to coordinate the transition from compiled code to interpretation and vice-versa. In particular, at each
Optimizing Matlab through Just-In-Time Specialization
51
step of the compilation process the JIT must track how and where each live variable is stored in order to appropriately transfer execution context. When interpreter fallback code is generated, instructions are issued to flush any register variables into memory for interpreter consumption. Upon returning to compiled execution, variables are copied back into their original registers. While spilling variables in this way is expensive, it has the advantage that the interpreter fallback mechanism does not impose extra penalties on compiled code in the case of functions which do not need to use it. The McVM JIT compiler is able to compile and make use of specialized versions of functions based on call signatures. This corresponds to the two shaded boxes in Figure 1 labeled “Versioning Logic” and “Type Inference”. In the next two sections we examine these two important components in more detail.
4
Just-In-Time Specialization
Exposing and using type information is central to most existing approaches to Matlab optimization [4,5]. McVM uses run-time type information to create multiple specialized versions of Matlab functions. This allows for optimized function dispatch and improved code generation for many common operations, greatly reducing overhead costs necessary in a more generic design. Below we describe our precise versioning strategy, followed by core optimizations so enabled. 4.1
Function Versioning
Specialization requires creating type-specific versions of function bodies. This process is performed at run-time, by “trapping” commands issued through the interpreter (including calls made in the read-eval-print loop of the interactive mode). If the command is a call to a function (and not a script), the interpreter will try and pass control to the JIT compiler. When this happens, the JIT compiler builds an argument type string from the input arguments to the function, and attempts to locate a previously compiled version of the function with a matching argument type string. If none exists a new version will first be compiled, appropriately specialized to the given argument types. This removes significant dispatch overhead, allowing, for instance, scalar variables to be stored on the stack instead of as objects allocated on the heap. While compiling specialized function versions, the JIT compiler also considers functions called by the function being compiled, compiling them as direct calls to specialized versions as well. Thus entire executions can be specialized in a “deep” fashion. As an example of how our function versioning works, consider the sumvals function shown earlier in Listing 1. This function is meant to sum numerical values in the range from start to stop, inclusively. In the absence of type information and specialization a compiler must make conservative assumptions, assuming iteration is potentially performed over arrays. Expensive heap storage is thus required, as well as function calls to generically perform every operation (addition, comparison, etc.).
52
M. Chevalier-Boisvert, L. Hendren, and C. Verbrugge
function s < scalar int > = sumvals ( start < scalar int > , step < scalar int > , stop < scalar int >) i < scalar int > = start ; s < scalar int > = i ; while i < stop i < scalar int > = i + step ; s < scalar int > = s + i ; end end
Listing 2. The type-annotated sumvals function
At an actual invocation of the function, however, such as in Listing 1: a = sumvals (1, 1, 10^6);, argument types are known to be scalar integers. This information is flowed through the function by our type inference, producing a type-annotated version as shown in Listing 2. From this, efficient code can be generated: all variables are easily stored on the stack, and there is no need to make expensive dispatches, because there are efficient machine instructions to add and compare scalar integer values. The obvious downside is that this scheme has the potential to generate many specialized versions of a function, with each requiring additional compilation time, and potentially impacting the performance of the instruction cache, should multiple versions be executed. We will see that this is not the case in practice (see Section 6). From our observations, Matlab programs tend to have few long functions and fewer call sites than code written in other programming languages. 4.2
Additional Optimizations
Type-based specialization greatly simplifies basic arithmetic operations, allowing many uses of scalars to be implemented in just a few machine instructions. The type information, however, also facilitates the optimization of a number of other common operations, in particular certain array access operations, and use of library function calls. These optimizations improve performance by both taking advantage of type information, and eliminating cases where interpreter fallback is otherwise required. Matlab possesses a sophisticated array indexing scheme that allows programmers to read or write to n-dimensional slices (sub-arrays) based on ranges of indices, specified independently for each dimension. This behaviour is implemented through the interpreter, using the fallback mechanism to evaluate complex array reads and writes. When types are known, however, such as in x = a(i); where i is a scalar, optimized code can be generated to read or write the value directly. Type information includes array dimensions as well, eliminating the need for many dynamic array bounds checks. Library functions are implemented in our virtual machine as native C++ functions which take as input (and return as output) dynamically allocated arrays of pointers to data objects. This strategy is conservatively correct in the
Optimizing Matlab through Just-In-Time Specialization
53
presence of unknown types, but can be inefficient because each call to these functions requires array allocation. Even for variables known to be scalar, the use of a generic library routine requires boxing and unboxing arguments and return values respectively, reducing the benefit from other optimizations. To address these issues, we have devised a further simple specialization scheme for some library functions. Multiple, type-specific versions of library functions are first registered ahead-of-time in McVM. When a library function call is encountered, the JIT compiler will attempt to locate an appropriately specialized version, matching function argument and return types. An obvious example where this is beneficial is in the case of functions like abs or sin, where scalar data allows the direct use of the native C++ versions of these library functions.
5
Type and Shape Inference System
The McVM JIT compiler uses data provided by our type inference analysis to implement the just-in-time function specialization scheme described in Section 4. The more information the analysis provides about the concrete types and shapes of program variables, the more interpretive dispatching and storage overhead can be eliminated, and the faster the resulting compiled code will be, as demonstrated in Section 6. Our type inference analysis works on a per-function basis, with the assumption that the whole program is not necessarily known at run-time, and new functions could be loaded at any time. The analysis assumes that the set of possible types for each input argument of a given function are known, and infers the set of possible types for every variable at every point (before and after every statement) in the function, given those possible input argument types. The analysis is an abstract interpretation style analysis, which implements a compositional forward analysis directly on the structured AST representation. The analysis computes an abstraction of the actual types and shapes of variables at each program point. The actual abstraction is a carefully designed combination of simple abstractions, where each element of the abstraction captures a key aspect of the variable’s type or shape. For example the isScalar flag indicates when a variable is definitely a scalar variable. If this flag is true, then the JIT compiler can allocate it to a register, which is much more efficient than storing it as a matrix. Another key point of our analysis is that it is flow-sensitive, and we thus have type and shape information for each program point. 5.1
Abstract Domain
In the real domain of Matlab programs, variables at different program points are bound to actual values (data objects). In our abstract domain, variables instead map to sets of possible abstract types. These sets contain zero or more type abstractions summarizing all possible types and shapes the specific variable can have. Each type abstraction is actually an 8-tuple: overallT ype, is2D, isScalar, isInteger, sizeKnown, size, handle, cellT ypes.
54
M. Chevalier-Boisvert, L. Hendren, and C. Verbrugge
If an abstract type set contains multiple type abstractions, it means that the variable whose potential types are represented by the set at that program point could be one of the several types represented by each type abstraction in the set. The empty set is the ⊥ element of the type lattice, representing situations where no information has been computed yet. The set of all type objects is the ⊤ element of the lattice, representing the situation where the type of a variable cannot be determined. The core of the abstraction is the first item of the 8-tuple, the overallType, which represents a specific Matlab language type, such as character array, floating-point matrix, complex number matrix, etc. Figure 2 represents the hierarchical type lattice of McVM overallType values.
Fig. 2. Hierarchical lattice of McVM types
The remaining elements of each 8-tuple provide abstractions of different features of the type. Table 1 describes the fields stored in type objects. These fields cannot hold arbitrary values. For example, if the isScalar flag is set to True, then the sizeKnown flag must also be True. However, the is2D flag does not necessarily indicate that the matrix size is known. For each statement in a program, our analysis produces a mapping of symbols to sets of type abstractions representing the type that each variable in the current function may hold before the statement is executed. Formally, if O is the set of all possible type abstractions and S is the set of all symbols, then our analysis operates in the domain of subsets of M , where M is the set of all pairs of symbols and subsets of O (mappings of symbols to type sets): M = { (s, t)| s ∈ S, t ∈ P (O)} 5.2
Merge Operator
A merge operator is required to implement inference rules for control flow statements. This is because when multiple control paths join at a given point in a
Optimizing Matlab through Just-In-Time Specialization
55
Table 1. Description of type object fields Field Meaning/Description overallType An element of the set of possible McVM data types. is2D Flag whose value applies to matrix types only. A True value indicates that the matrix has at most two dimensions. False means it is not known how many dimensions the matrix has. isScalar Flag whose value applies to matrix types only. A True value indicates that the matrix is a scalar. False means the matrix may not be scalar. isInteger Flag whose value applies to matrix types only. A True value indicates that the matrix contains only integer values. False means the matrix may contain non-integer values. sizeKnown Flag whose value applies to matrix types only. A True value indicates the size of the matrix is known. False means the size is not known. size Applies to matrix types only. A vector of integers storing the dimensions of the matrix. This is only defined if the sizeKnown flag is set to True. handle Applies to function handles types only. Stores a pointer to the function object the handle points to. This value can be null if the specific function is not known at inference time. cellTypes Applies to cell array types only. Set of type objects representing the possible types the cell array stores.
Default Undefined False (unknown) False (unknown) False (unknown) False (unknown) Undefined
null (unknown) ⊥ (undefined)
program, our analysis needs to merge the mappings of symbols to type sets for each of these control flow paths into one single mapping. In our analysis, the merging of two type mappings is accomplished by performing, for each symbol, the joining of the type sets for each type mapping: merge(M1 , M2 ) = { (s, t)| (s, t1 ) ∈ M1 , (s, t2 ) ∈ M2 , t = join(t1 , t2 )} The joining of type sets is accomplished by using set union as a merge operator and then applying a filter operator to the result: join(t1 , t2 ) = f ilter (t1 ∪ t2 ) The filter operator takes a type set as input and returns a new type set in which all type objects having the same overallType value have been merged into one. It does so in a pessimistic way, that is, if one of the type objects to be merged has an unknown value for one of its flags, the merged type object will have the unknown value for this flag. For example, if we are filtering a type set containing multiple double matrix type objects, the resulting type object will have the integer flag set to true only if all input type objects did. 5.3
Inference Rules
Our type inference analysis follows inference rules to determine the mapping of possible variable types after a given statement based on the possible types before that same statement. Each kind of statement has an associated type inference rule that takes the mapping of possible input types as input and returns the mapping of possible output types as output. Expression statements, such as disp(3); use the identity type mapping, that is, the output types they produce are the same as the input types.
56
M. Chevalier-Boisvert, L. Hendren, and C. Verbrugge
The statements that are at the core of our type inference analysis are assignment statements. They are the only kind of statement that can define a variable, and thus, change its type. In the case of an assignment statement of the form v = op(a, b);, where op is an element of the set R of all possible binary operators, we have that the type of v is redefined as the set of possible output types of the operator being applied to the possible types of a and b, according to its own type rule: typeRulev=op(a,b) (Min ) = { (s, t) ∈ Min | s = v} ∪ typeRuleop(v,a,b) (Min ) typeRuleop(v,a,b) (Min ) = { (v, t)| t = outtypeop ({(a, t) ∈ Min } , {(b, t) ∈ Min })} As an example, we can look at the assignment c = [a b]; in Listing 1. This represents the horizontal concatenation of arrays a and b. In this case, a holds the value 5 * 10e11, which is a scalar integer value, and b holds the value 1.0e12 * [0.8533 1.7067], a 1 × 2 floating-point array. Thus, the type abstractions for a and b are: type(a) = {overallT ype = double, is2D = T, isScalar = T, isInteger = T, sizeKnown = T, size = (1, 1), handle = null, cellT ypes = ⊥} type(b) = {overallT ype = double, is2D = T, isScalar = F, isInteger = F, sizeKnown = T, size = (1, 2), handle = null, cellT ypes = ⊥} The type rule associated with the horizontal concatenation operation allows us to infer that c will be a 1 × 3 floating-point array, that is: outtypehcat(type(a), type(b)) = {overallT ype = double, is2D = T, isScalar = F, isInteger = F, sizeKnown = T, size = (1, 3), handle = null, cellT ypes = ⊥} In the case of if statements, the type inference process is handled differently. The “true” and “false” branches of the statement are both treated as compound statements, as if all statements on either branch were one statement. The output type mappings are determined separately for both branches and then merged together into one mapping of the possible types at the output of the if statement itself: typeRuleif (Min ) = merge (typeRuletrueStmt (Min ) , typeRulef alseStmt (Min )) Handling of loop statements is slightly more complex. Because types at the input of the loop depend on types at the output, a fixed point must be iteratively computed. Before we apply our type inference analysis, all loop statements are converted to while loops. As is the case for if statements, statements in the loop body are treated as one single compound statement. Special care is taken to properly deal with both break and continue statements.
Optimizing Matlab through Just-In-Time Specialization
5.4
57
Inference Process
In terms of abstract interpretation, we wish to compute, for a given function, the least fixed point of the mapping of program statements and variables to sets of possible types before that given program point. The type inference process for a function begins with the type sets for the input parameters of the function being given. Because of the Matlab semantics, the possible types of all other variables are initialized to ⊤. This is because undeclared variables could be globals, and thus, could potentially hold any type. The body of the function is then analyzed. The function body itself is a compound statement. When inferring the types in a compound statements, the statements it contains are traversed in order, and the inferred output type of each statement is stored in a global mapping (e.g.: hash map) of the types at the output of each statement.
6
Evaluation
In order to assess the performance of our virtual machine we compare the actual performance of McVM to that obtained by several related systems: Mathworks Matlab, GNU Octave (the GNU Matlab environment) and McFor (a Matlab to Fortran translator built by Jun Li, a member of the McLab team). The Octave and Matlab performance numbers are intended to give us some idea of how well our current solution performs against competing implementations. The McFor numbers are provided as a rough “upper bound” on performance— Fortran compilers are known to perform very well on numerical computations, giving an indication of potential compiler performance for non-interactive code. We have performed our tests on a total of 20 benchmark programs. These benchmarks are gathered from previous work on optimizing Matlab2 , in the FALCON [6] and OTTER projects, Mathworks’ CentralFile Exchange, Chalmers University, and from individual course work and student projects at McGill. Several of these are currently unsupported by the McFor Fortran translator as it lacks support for cell arrays, closures and function handles at this time. The left part of Table 2 provides characteristic numbers for each of the benchmarks supported by McVM. Number of functions and statements (3-address form) relate to the overall (static) input load on our system, while number of call sites directly affects specialization. Maximum loop nesting depth affects the theoretical efficiency of our dataflow analysis. Not all benchmarks benefit equally from our optimizations of course, and in the following sections we show further profiling numbers intended to explain where specific performance bottlenecks occur. Section 6.2 describes the behaviour of the type inference system, while Section 6.3 gives data on the specialization system, including compiler overhead. All of our benchmarking metrics were gathered on a system equipped with an Intel Core 2 Quad Q6600 processor (quad core, 2.4GHz) and 4GB of dual channel DDR2 RAM, running Ubuntu 9.10 2
http://www.ece.northwestern.edu/cpdc/pjoisha/MAT2C/
58
M. Chevalier-Boisvert, L. Hendren, and C. Verbrugge
(linux kernel 2.6.31, 32-bit). We have gathered our Matlab performance numbers using Matlab R2009a, and our GNU Octave numbers on Octave version 3.0.5. The Fortran code produced by McFor was compiled using the GNU Fortran compiler version 4.4.1. Because of significant variance when timing benchmarks, attributable to i-cache effects and the garbage collector, all benchmark timing measurements are based on an average over 10 runs. 6.1
Baseline Performance
The rightmost columns of Table 2 show a comparison of benchmark running times under our four execution environments, as well as a version of McVM with the JIT and specialization disabled, giving absolute time as well as times normalized to the McVM JIT (values greater than 1 are running slower than McVM with JIT). As we can see, McVM with JIT performs better than Matlab in 8 out of 20 benchmarks, sometimes by a fair margin. In the cases where it does worse than Matlab, the running times can be relatively close (as with nnet), or, as exemplified by the crni benchmark, sometimes dramatically less; we discuss reasons for this poor performance in Section 6.2. GNU Octave, possessing no JIT compiler, does rather poorly in general. It trails far behind Matlab and outperforms McVM with JIT on only a single benchmark. Interestingly, McVM in interpreted mode, although it performs much worse than the JIT on several benchmarks, actually performs better on some (this will also be discussed further). The McFor running times are generally well ahead of Matlab and McVM, with the exception of the clos benchmark. This suggests that Matlab and McVM both are still far from the “optimal” performance level. 6.2
Type Inference Efficiency
Our ability to optimize strongly depends on the behaviour of our type inference system. The leftmost part of Table 3 thus shows relevant run-time profiling information, dynamically weighted by the relative execution counts of the associated statements. The first data column gives the percentage of type sets that are at top, providing no type information, while column 3 shows the percentage of type sets which contain only one type, and so give exact type data. The third column shows the percentage of times where variables holding scalar values were known ahead of time to be scalar, and the fourth column is the percentage of times where the size of matrix variables was known by the type inference system. In general the more type information our system has the better it will be able to optimize code generation. Knowledge of which variables are scalars is even more critical, however, as it lets the JIT compiler know which variables can be stored on the stack. As we can see, this matches our results: benchmarks with speedups of over 99% all have 100% of scalar variables known. The behaviour of the crni benchmark can also be explained by this data. As can be seen in Table 3, scalars are known in only 68.7% of cases, one of the lowest such ratios. An examination of the code reveals this benchmark uses matrix “creation on
Optimizing Matlab through Just-In-Time Specialization
59
Table 2. Benchmark characteristics and comparison of running times. Columns 6– 10 give absolute running times, while columns 11–14 are performance normalized to McVM JIT. The geometric mean was used for relative values (columns 11–14). McVM no JIT
Octave
McFor
MATLAB
McVM no JIT
Octave
McFor
Relative to McVM JIT
MATLAB
adpt 2 196 2 6 beul 10 511 1 38 capr 5 214 2 10 2 58 2 3 clos 3 142 2 7 crni dich 2 144 3 7 2 253 3 6 diff 2 130 2 6 edit fdtd 2 157 1 3 2 159 3 8 fft 2 120 2 4 fiff mbrt 3 78 2 11 nb1d 3 194 2 11 nb3d 3 164 2 12 5 151 2 11 nfrc nnet 4 186 3 16 6 364 2 29 play 8 203 1 32 schr sdku 9 363 2 49 svd 11 308 3 42 mean 4.3 205 2.1 15.6
Performance (s) McVM JIT
Call Sites
Loop Nesting
Stmts
Functions
Benchmark
Static Measurements
13.4 3.07 3.51 6.84 1321 2.80 30.0 54.9 20.1 12.8 5.37 34.6 4.10 3.88 15.7 6.95 3.37 2.48 1.23 8.24 77.7
2.66 3.09 8.10 0.75 6.95 4.71 5.26 11.0 3.32 16.2 6.97 4.53 9.85 1.54 4.94 6.35 8.68 2.07 9.74 2.38 5.96
12.6 1.56 1674 13.6 1788 1149 41.9 81.4 8.56 2470 1528 98.6 4.24 2.51 26.0 7.32 4.24 3.03 16.0 7.02 447
45.9 7.62 5256 17.5 5591 4254 120 394 172 8794 4808 295 43.9 40.8 80.3 26.5 29.0 2.31 112 10.9 1505
0.72 N/A 1.26 7.87 3.56 1.88 0.65 0.13 0.29 9.13 0.99 0.96 0.74 0.89 N/A N/A N/A N/A N/A N/A 2.24
0.20 1.01 2.31 0.11 0.01 1.68 0.17 0.20 0.17 1.27 1.30 0.13 2.40 0.40 0.32 0.91 2.57 0.84 7.93 0.29 0.49
0.94 0.51 478 1.99 1.35 410 1.39 1.48 0.43 193 285 2.84 1.03 0.65 1.66 1.05 1.26 1.22 13.1 0.85 3.91
3.42 2.49 1499 2.56 4.23 1517 3.98 7.17 8.55 689 895 8.51 10.7 10.5 5.13 3.81 8.60 0.93 90.9 1.33 15.4
0.05 N/A 0.36 1.15 0.00 0.67 0.02 0.00 0.01 0.72 0.18 0.03 0.18 0.23 N/A N/A N/A N/A N/A N/A 0.08
assignment” to initialize its input data, resulting in several unknown types being propagated through the entire program. We examine ways to fix this weakness of our type inference system as part of future work. While our JIT compiler is able to speed up most benchmarks, sometimes by very significant margins, some still show slowdowns over interpreted performance. These do not necessarily have poor type information. The nb3d benchmark, for example, has 100% scalar variables known and 96.9% singleton type sets. Most of these benchmarks makes heavy use of complex slice read operations operating on entire columns or rows of a matrix at a time, and these are currently implemented through our (expensive) interpreter fallback mechanism. 6.3
JIT Specialization
The benefit of JIT specialization depends on how well it improves the code as well as any introduced overhead. The rightmost three columns of Table 3 show
60
M. Chevalier-Boisvert, L. Hendren, and C. Verbrugge
100 71.3 100 100 68.7 100 66.7 96.8 100 100 100 100 88.1 100 100 98.7 77.5 99.5 83.8 94.2 92.3
90.0 29.5 82.8 99.9 54.8 85.1 66.7 81.5 49.8 80.3 86.1 100 34.5 16.5 98.9 55.1 52.1 41.7 49.7 59.7 65.7
-6.82 -96.3 99.8 49.7 26.1 99.8 28.2 32.5 -135 99.5 99.6 64.9 3.33 -54.6 39.8 5.08 20.6 18.3 92.3 -17.4 23.5
24.8 85.5 0.00 0.00 66.7 0.00 68.3 65.0 88.1 0.00 0.01 33.3 75.6 94.0 42.5 86.9 72.5 65.5 7.55 84.7 48.0
16.8 49.8 0.00 100 69.2 0.00 100 40.0 90.0 0.00 0.00 100 0.00 98.3 100 100 100 54.0 5.69 100 56.2
Env. lookups
Slice reads
Matrices created
95.8 44.8 100 100 71.4 97.9 82.1 94.9 100 100 100 90.9 94.2 96.9 82.7 47.4 66.6 55.3 85.2 73.8 84.0
JIT speedup
Singleton sets
4.18 55.2 0.01 0.00 19.1 2.09 14.3 5.14 0.01 0.00 0.01 9.09 5.84 3.13 16.4 52.6 23.3 31.8 14.8 16.4 13.7
Size known
Top sets
adpt beul capr clos crni dich diff edit fdtd fft fiff mbrt nb1d nb3d nfrc nnet play schr sdku svd mean
Scalars known
Benchmark
Table 3. Profiled performance. All values are percentages.
39.2 114 0.00 0.00 55.2 0.00 2.45 81.6 90.5 0.00 0.00 0.00 14.9 76.2 19.8 82.8 45.9 84.6 4.65 60.2 38.6
the effect of JIT compilation on three profile measures, the number of matrices created, the number of slice reads, and the number of environment lookups, in each case presented as a percentage of the original, interpreted quantity. These are all expensive operations, and so large reductions should map to large improvements from JIT compilation. The fft benchmark, for instance, has 100% of its 789 million interpreter slice reads eliminated, and runs over 190 times faster with the JIT compiler enabled. For a better understanding of the cost/benefit of different components of our system, we also evaluate the performance of McVM with specific JIT optimizations disabled. Relative to the McVM JIT compiler with all optimizations enabled, the five leftmost columns in Table 4 show the ratio of run-times of McVM with optimizations to arithmetic operations, array operations, function calls, specialized library functions, and the entire JIT selectively disabled (a number greater than one signifies a slowdown). Clearly, arithmetic operation and array access optimizations have a tremendous impact as they speed up several benchmarks by two orders of magnitude. In certain cases, such as dich, optimizing library functions also has a large impact.
Optimizing Matlab through Just-In-Time Specialization
61
adpt beul capr clos crni dich diff edit fdtd fft fiff mbrt nb1d nb3d nfrc nnet play schr sdku svd mean
1.43 1.03 590 3.40 1.63 459 2.20 1.90 1.25 144 280 3.57 0.90 0.66 1.33 1.20 1.21 1.47 1.42 3.92 4.56
1.12 1.00 428 1.01 1.27 282 1.03 1.46 1.10 143 204 1.05 1.22 1.07 1.04 1.01 1.03 1.00 1.67 0.98 3.28
0.97 1.00 1.73 1.00 0.75 1.00 1.01 0.61 1.01 1.02 1.01 1.05 1.06 1.08 1.77 1.02 1.11 1.02 1.13 1.05 1.04
1.07 1.00 1.05 1.00 0.99 29.7 0.96 0.98 0.87 1.01 1.05 0.99 0.97 0.97 0.98 0.98 0.98 1.00 0.97 0.98 1.17
0.94 0.51 478 1.99 1.35 410 1.39 1.48 0.43 193 285 2.84 1.03 0.65 1.66 1.05 1.26 1.22 13.1 0.85 3.91
2 9 5 2 3 2 2 2 2 2 2 3 3 3 5 4 6 8 9 11 4.2
2 16 5 2 3 2 2 2 2 2 2 3 3 3 5 4 10 9 11 15 5.2
0.86 1.20 0.50 0.14 0.32 0.38 1.19 0.22 0.48 0.58 0.24 0.14 0.51 0.57 0.22 0.36 0.58 0.55 1.08 0.79 0.55
Analysis (s)
Compile (s)
# versions
# functions
JIT
Library
Direct calls
Array
Arith.
Benchmark
Table 4. Relative JIT performance with specific optimizations disabled (columns 2– 6), and overhead of the optimization system (columns 7–10). The geometric mean was used for relative values (columns 2–6).
0.79 0.90 0.43 0.12 0.26 0.32 1.10 0.17 0.38 0.54 0.20 0.11 0.42 0.46 0.15 0.29 0.42 0.45 0.85 0.61 0.45
The direct call mechanism has much less impressive benefits. It improves benchmarks that perform many function calls, but can also yield lower performance in cases where the types of input parameters to a function are unknown. A version of the function then gets compiled with insufficient type information, whereas the interpreter can extract exact type information on-the-fly when a call is performed with direct calls disabled. Given our specialization strategy, compilation overhead is a concern—if types are highly variable, many function versions will be compiled, adding CPU and memory overhead. We thus measured the number of functions compiled, as well as the total number of specialized versions for each of our benchmarks. Columns 7 and 8 in Table 4 show that excessive specialization is not a problem in practice. In most cases functions are always called with the same argument types, and there are never more than twice as many versions as compiled functions. The last two columns of Table 4 give the absolute compile-time overhead and its analysis-time constituent. As we can see, most of the compilation time is spent performing analyses on the functions to be compiled, as opposed to code
62
M. Chevalier-Boisvert, L. Hendren, and C. Verbrugge
generation. The slowest compilation time is associated with the diff benchmark. We attribute this to the large quantity of code contained in a triple nested loop in this benchmark, for which our analyses take longer to compute a fixed point. In most cases these costs are not excessive and are easily overcome by the performance improvement, especially for longer running benchmarks.
7
Related Work
Our approach to optimizing Matlab has concentrated on dynamic features of the language that interfere with more traditional optimization. This brings together more traditional work on compiling scientific, array-intensive languages and techniques for optimizing dynamic languages, and specifically dynamic specialization and type inference. Previous compiler approaches to Matlab have mainly focused on numerical performance, primarily in the context of static language subsets or contexts. As well as more traditional loop and array optimizations, code restructuring can be performed to ensure programs take good advantage of optimized instrinics [7]. Good performance can also be achieved by translating Matlab code to other static languages, such as C [8] or Fortran 90 [6,9], where further aggressive optimization or parallelization can be performed. A major source of complexity for almost all Matlab optimizations, as in our case, is analyzing and understanding array properties, such as shape and size [10]. Elphick et al. identify similar typing and dynamic language concerns in their partial evaluation approach to optimizing Matlab programs [5]. They develop MPE, an online system to partially evaluate Matlab source functions into more efficient Matlab code. Their design is intra-procedural and does not handle polyvariant types, but as such may provide an additional and orthogonal benefit to our approach. Full VM approaches have also been applied, including JIT-based solutions. MaJIC combines JIT-compilation with an offline code cache maintained through speculative compilation of Matlab code into C/Fortran [4]. They derive the most benefit from optimizations such as array bounds check removals and register allocation. The Match VM project translates Matlab programs to a lowerlevel intermediate form which is then analyzed for dependencies and used to automatically parallelize computation [11]. The result is invisible to the user, and by relying on run-time estimates for scheduling avoids static array analysis requirements. Program Specialization. We use program specialization [12] in order to optimize effectively in the presence of imprecise type information. More specifically, we apply procedure cloning [13] to create specialized copies of function bodies in which we can make stronger typing assumptions. Such specialization techniques have previously been used offline to translate Matlab code into optimized C or Fortran code [14]. Our design extends on run-time specialization techniques used by languages such as SELF [15] and is similar to the approach used to optimize the JIT compilation of generics for the C# language [16]. More general specialization designs have also been applied [17]. In practice this can yield
Optimizing Matlab through Just-In-Time Specialization
63
very significant performance gains—Schultz and Consel report speedups of up to 300% for their specializing JSpec Java compiler [18]. Run-time specialization accommodates Matlab’s dynamic nature, and is a technique that has been applied in many other dynamic optimization contexts. The Psyco python virtual machine, for instance, implements specialization “by need” [19], a process similar to the online partial evaluation approaches applied to Matlab [5] and Maple [20]. This specialization technique involves interleaving program specialization and execution; the specializer can request facts such as the type of variables while a procedure is executing, and depending on the result, potentially modify the compiled code to be more efficient. The design goal was to eliminate much of the interpretative overhead through the use of JIT compilation, without sacrificing the dynamic features of the language. Approaches such as Psyco differ from our system by working on fine-grain code fragments rather than functions, trading simpler code-generation and analysis requirements for smaller specialized sequences. Similar to the Psyco effort, the TraceMonkey VM for the JavaScript language has focused on just-in-time specialization based on type information in order to increase performance [21]. The design is based on a bytecode interpreter that can identify frequently executed bytecode sequences (traces) going through loops and compile them to efficient native code based on collected type information. A crucial assumption of their system is that programs will spend most of their time in loops, and that the types of variables will remain mostly stable throughout the execution of loops. They have achieved speedups of up to 25 times on some benchmarks. However, their current VM does poorly on benchmarks making extensive use of recursion. Type Inference. Our specialization approach is facilitated by a type inference analysis [22], where we use a straightforward, if non-trivial dataflow analysis to determine type information. The problem, of course, has been examined in many contexts, and poses an efficiency and accuracy trade-off even in the case of statically typed languages, such as C++ [23] or Java [24]. In these cases relatively cheap flow-insensitive approaches to type analysis have been shown effective. In a more general and flow-sensitive sense the type inference problem can also be seen as a bidirectional dataflow analysis, propagating type information both along and against the direction of control flow [25]. In most such analyses types are considered static, although dynamic types may be reduced to static types through the use of a Static Single Assignment (SSA) representation. Type inference on dynamic languages brings additional complexity. Constructs like eval, Matlab’s cd, as well as dynamic loading and reflection features, make it difficult or impossible to know the entire call graph of a program ahead of time. Despite this, there have been efforts to statically perform type inference on dynamic languages such as Matlab [26] and Ruby [27]. These approaches show potential to detect type errors ahead of time, but they do not address the aforementioned problems. Our approach, on the other hand, can operate on programs whose call graphs are not fully known ahead of time.
64
8
M. Chevalier-Boisvert, L. Hendren, and C. Verbrugge
Conclusions and Future Work
Our experience with McVM demonstrates that online specialization is an effective and viable technique for optimizing Matlab programs. Although other specialization and partial evaluation approaches have been applied to Matlab [4,5] and similar dynamic language contexts [19,21], we provide an efficient and full JIT solution. Our approach focuses on optimizing code generation, uses a coarse-grained strategy that minimizes specialization overhead, and is specifically designed to accommodate complex dynamic language properties. Combined with an effective type and shape inference strategy, McVM is able to achieve performance up to three orders of magnitude faster than competing Matlab implementations such as GNU Octave, and in several cases faster than the commercial product. Further improvements to performance are possible in a number of ways. The need to be conservative in our type inference analysis means that unknown types dominate in merges. The result is that once “unknown” types are introduced, they often propagate and undermine the type inference efforts. Our code generation strategy is then left with very little information to operate on. In many cases, however, even if the type of a variable cannot be determined with 100% certainty, it may be possible to mitigate the impact of unknown types by predicting the most likely outcome. A speculative design enables heuristic judgements. It is likely, for example, that if a variable is repeatedly added to integer matrices, that it is also an integer matrix. Our code generation system could use these “best guesses” to generate an optimized code path. The types of variables can then be tested during execution and either an optimized path or default code chosen as appropriate. Speculative approaches have been successful based on external compilation [4], and a JITbased solution has potential to yield further significant speed gains.
References 1. Lattner, C.: LLVM: An infrastructure for multi-stage optimization. Master’s thesis, Comp. Sci. Dept., U. of Illinois at Urbana-Champaign (December 2002) 2. Boehm, H., Spertus, M.: Transparent programmer-directed garbage collection for C++ (2007) 3. Biggar, P., de Vries, E., Gregg, D.: A practical solution for scripting language compilers. In: SAC 2009, pp. 1916–1923. ACM, New York (2009) 4. Alm´ asi, G., Padua, D.: MaJIC: compiling MATLAB for speed and responsiveness. In: PLDI 2002, pp. 294–303. ACM, New York (2002) 5. Elphick, D., Leuschel, M., Cox, S.: Partial evaluation of MATLAB. In: Pfenning, F., Smaragdakis, Y. (eds.) GPCE 2003. LNCS, vol. 2830, pp. 344–363. Springer, Heidelberg (2003) 6. Derose, L., Rose, L.D., Gallivan, K., Gallivan, K., Gallopoulos, E., Gallopoulos, E., Marsolf, B., Marsolf, B., Padua, D., Padua, D.: FALCON: A MATLAB interactive restructuring compiler. In: Huang, C.-H., Sadayappan, P., Banerjee, U., Gelernter, D., Nicolau, A., Padua, D.A. (eds.) LCPC 1995. LNCS, vol. 1033, pp. 269–288. Springer, Heidelberg (1996)
Optimizing Matlab through Just-In-Time Specialization
65
7. Birkbeck, N., Levesque, J., Amaral, J.N.: A dimension abstraction approach to vectorization in MATLAB. In: CGO 2007, pp. 115–130. IEEE Computer Society, Los Alamitos (2007) 8. Joisha, P.G., Banerjee, P.: A translator system for the MATLAB language: Research articles. Softw. Pract. Exper. 37(5), 535–578 (2007) 9. Rose, L.D., Padua, D.: A MATLAB to Fortran 90 translator and its effectiveness. In: ICS 1996, pp. 309–316. ACM, New York (1996) 10. Joisha, P.G., Banerjee, P.: An algebraic array shape inference system for R ACM Trans. Program. Lang. Syst. 28(5), 848–907 (2006) MATLAB. 11. Haldar, M., Nayak, A., Kanhere, A., Joisha, P., Shenoy, N., Choudhary, A., Banerjee, P.: Match virtual machine: An adaptive runtime system to execute MATLAB in parallel. In: ICPP 2000, pp. 145–152 (2000) 12. Jones, N.D., Gomard, C.K., Sestoft, P.: Partial evaluation and automatic program generation. Prentice-Hall, Inc., Englewood Cliffs (1993) 13. Cooper, K.D., Hall, M.W., Kennedy, K.: Procedure cloning. Computer Languages, 96–105 (1992) 14. Chauhan, A., McCosh, C., Kennedy, K., Hanson, R.: Automatic type-driven library generation for telescoping languages. In: SC 2003, vol. 1, pp. 58113–695 (1917) 15. Chambers, C., Ungar, D.: Customization: optimizing compiler technology for SELF, a dynamically-typed object-oriented programming language. SIGPLAN Not. 24(7), 146–160 (1989) 16. Kennedy, A., Syme, D.: Design and implementation of generics for the .NET Common Language Runtime. In: PLDI 2001, pp. 1–12. ACM, New York (2001) 17. Shankar, A., Sastry, S.S., Bod´ık, R., Smith, J.E.: Runtime specialization with optimistic heap analysis. SIGPLAN Not. 40(10), 327–343 (2005) 18. Schultz, U., Consel, C.: Automatic program specialization for Java. ACM Trans. Program. Lang. Syst. 25(4), 452–499 (2003) 19. Rigo, A.: Representation-based just-in-time specialization and the Psyco prototype for Python. In: PEPM 2004, pp. 15–26. ACM, New York (2004) 20. Carette, J., Kucera, M.: Partial evaluation of Maple. In: PEPM 2007, pp. 41–50. ACM, New York (2007) 21. Gal, A., Eich, B., Shaver, M., Anderson, D., Mandelin, D., Haghighat, M.R., Kaplan, B., Hoare, G., Zbarsky, B., Orendorff, J., Ruderman, J., Smith, E.W., Reitmaier, R., Bebenita, M., Chang, M., Franz, M.: Trace-based just-in-time type specialization for dynamic languages. In: PLDI 2009, pp. 465–478. ACM, New York (2009) 22. Duggan, D., Bent, F.: Explaining type inference. Science of Computer Programming, 37–83 (1996) 23. Bacon, D.F., Sweeney, P.F.: Fast static analysis of C++ virtual function calls. In: OOPSLA 1996, pp. 324–341. ACM, New York (1996) 24. Tip, F., Palsberg, J.: Scalable propagation-based call graph construction algorithms. In: OOPSLA 2000, pp. 281–293. ACM, New York (2000) 25. Singer, J.: Sparse bidirectional data flow analysis as a basis for type inference. In: Web proceedings of the Applied Semantics Workshop (2004) 26. Joisha, P.G., Banerjee, P.: Correctly detecting intrinsic type errors in typeless languages such as MATLAB. In: APL 2001, pp. 7–21. ACM, New York (2001) 27. Furr, M., An, J.h.D., Foster, J.S., Hicks, M.: Static type inference for Ruby. In: SAC 2009, pp. 1859–1866. ACM, New York (2009)
RATA: Rapid Atomic Type Analysis by Abstract Interpretation – Application to JavaScript Optimization Francesco Logozzo and Herman Venter Microsoft Research, Redmond, WA (USA) {logozzo,hermanv}@microsoft.com
Abstract. We introduce RATA, a static analysis based on abstract interpretation for the rapid inference of atomic types in JavaScript programs. RATA enables aggressive type specialization optimizations in dynamic languages. RATA is a combination of an interval analysis (to determine the range of variables), a kind analysis (to determine if a variable may assume fractional values, or NaN), and a variation analysis (to relate the values of variables). The combination of those three analyses allows our compiler to specialize Float64 variables (the only numerical type in JavaScript) to Int32 variables, providing large performance improvements (up to 7. 7×) in some of our benchmarks.
1
Introduction
JavaScript is probably the most widespread programming platform in the world. JavaScript is an object-oriented, dynamically typed language with closures and higher-order functions. JavaScript runtimes can be found in every WEB browser (e.g., Internet Explorer,Firefox, Safari and so on) and in popular software such as Adobe Acrobat and Adobe Flash. Large and complex WEB applications such as Microsoft Office WEB Apps or Google Mail, rely on JavaScript to run inside every browser on the planet. A fast JavaScript implementation is crucial to provide a good user experience for rich WEB applications and hence enabling their success. Because of its dynamic nature, a JavaScript program cannot statically be compiled to efficient machine code. A fully interpreted solution for JavaScript runtime is generally acknowledged to be too slow for the new generation of web applications. Modern implementations rely on Just-in-time (JIT) techniques: When a function f is invoked at runtime, f is compiled to a function f′ in machine code, and it is then executed. The performance gain of executing f′ pays off the extra time spent in the compilation of f. The quality of the code that the JIT generates for f′ depends on the amount of dynamic and static information that is available to it at the moment of the invocation of f. For instance, if the JIT knows that a certain variable is of an atomic type then it generates specialized machine instructions (e.g., incr for an Int32) instead of relying on expensive boxing/unboxing operations. R. Gupta (Ed.): CC 2010, LNCS 6011, pp. 66–83, 2010. c Springer-Verlag Berlin Heidelberg 2010
RATA: Rapid Atomic Type Analysis by Abstract Interpretation
67
Motivating Example. Let us consider the nestedLoops function in Fig. 1. Without any knowledge of the concrete types of i and j, the JIT should generate a value wrapper containing: (i) a tag with the dynamic type of the value, and (ii) the value. Value wrappers are disastrous for performance. For instance, the execution of nestedLoops takes 310ms on our laptop. 1 In fact, the dynamic execution of the statement i++ involves: (i) an “unbox” operation to fetch the old value of i and check that it is a numerical type; (ii) incrementing i; (iii) a “box” operation to update the wrapper with the new value. The JIT can specialize the function if it knows that i and j are numerical values. In JavaScript, the only numerical type is a 64 bits floating point (Float64) which follows the IEEE754 standard [16,19]. In our case, a simple type inference can determine that i and j are Float64: they are initialized to zero and only incremented by one. The execution time then goes down to 180ms. The JIT may do a better job if it knows that i and j are Int32: floating point comparisons are quite inefficient and they usually requires twice or more instructions to perform than integer comparisons on a x86 architecture. A simple type inference does not help, as it cannot infer that i and j are bounded by 10000. In fact, it is safe to specialize a numerical variable x with type Int32 when one can prove that for all possible executions: (i) x never assumes values outside of the range [−231 , 231 − 1]; and (ii) x is never assigned a fractional value (e.g., 0.5). Contribution. We introduce RATA, Rapid Atomic Type Analysis, a new static analysis based on abstract interpretation, to quickly and precisely infer the numerical types of variables. RATA is based on a combination of an interval analysis (to determine the range of variables), a kind analysis (to determine if a variable may assume fractional values, or NaN) and a variation analysis (to relate the values of variables). In our example, the first analysis discovers that i ∈ [0, 10000], j ∈ [0, 10000] and the second that i, j ∈ Z. Using this information, the JIT can further specialize the code so that i and j are allocated in integer registers, and as a matter of fact the execution time (inclusive of the analysis time) drops to 31ms! The function bitsinbyte in Fig. 1 (extracted from the SunSpider benchmarks [31]) illustrates the need for the variation analysis. The interval analysis determines that m ∈ [1, 256], c ∈ [0, +∞]. The kind analysis determines that m, c ∈ Z. If we infer that c ≤ m then we can conclude that c is an Int32. In general, we can solve this problem using a relational, or weakly relational abstract domain, such as Polyhedra [12], Subpolyhedra [23], Octagons [26], or Pentagons [24]. However, all those abstract domains have a cost which is quadratic (Pentagons), cubic (Octagons), polynomial (Subpolyhedra) or exponential (Polyhedra) and hence we rejected their use, as non-linear costs are simply not tolerable at runtime. Our variation analysis infers that: (i) m and c differ by one 1
The data we report is based on the experience with our own implementation of a JavaScript interpreter for .Net. More details will be given in Sect. 6.
68
F. Logozzo and H. Venter
function nestedLoops() { var i, j; for(i = 0; i < 10000; i++) for(j = 0; j < i; j++) { // do nothing... } }
function bitsinbyte(b) { var m = 1, c = 0; while(m 1 hr 0.0 q19 2.8 0.1 q20 69.4 4.3 q21 245.5 3.2 q22 1.1 0.0
JReq Time σ ∆ 29.7 0.2 27% 500.8 10.1 2% 24.8 0.6 (0%) > 1 hr 0.2 11.0 3.6 429% 349.3 4.0 18.1 0.4 540% 508.4 11.4 633% 517.0 7.1 111% 1.6 0.0 43%
TPC-H in turn using random query parameters, with a garbage collection cycle run in-between each query. We then executed the corresponding JDBC queries using the same parameters. This was repeated six times, with the last five runs kept for the final results. Queries that ran longer than one hour were cancelled. Table 2 summarises the results of the benchmarks. Unlike TPC-W, the queries in TPC-H take several seconds each to execute, so runtime optimisations do not significantly affect the results. Since almost all the execution time occurs at the database and since the SQL generated from the JQS queries are semantically equivalent to the original SQL queries, differences in execution time are mostly caused by the inability of the database’s query optimiser to find optimal execution plans. In order to execute the complex queries in TPC-H efficiently, query optimisers must be able to recognise certain patterns in a query and restructure them into more optimal forms. The particular SQL generated by JReq uses a SQL subset that may match different optimisation patterns in database query optimisers than hand-written SQL code. For example, the original SQL for query 16 evaluates a COUNT(DISTINCT) operation inside of GROUP BY. This is written in JQS using an equivalent triply nested query, but MySQL is not able to optimise the query correctly, and running the triply nested query directly results in extremely poor performance. On the other hand, in query 18, JReq’s use of deeply nested queries instead of a more specific SQL operation (in this case, GROUP BY...HAVING) fits a pattern that MySQL is able to execute efficiently, unlike the original hand-written SQL. Because of the sensitivity of MySQL’s query optimiser to the structure of SQL queries, it will be important in the future for JReq to provide more flexibility to programmers in adjusting the final SQL generated by JReq. Overall, 21 of the 22 queries from TPC-H could be successfully expressed using the JQS syntax and translated into SQL. Only one query, which used a LEFT OUTER JOIN, could not be handled because JQS and JReq do not currently support the operation yet. For most of the queries, the JQS queries executed with similar performance to the original queries. Where there are differences in
102
M.-Y. Iu, E. Cecchet, and W. Zwaenepoel
execution time, most of these differences can be eliminated by either improving the MySQL query optimiser, adding special rules to the SQL generator to generate patterns that are better handled by MySQL, or extending the syntax of JQS to allow programmers to more directly specify those specific SQL keywords that are better handled by MySQL.
6
Conclusions
The JReq system translates database queries written in the imperative language Java into SQL. Unlike other systems, the algorithms underlying JReq are able to analyse code written in imperative programming languages and recognise complex query constructs like aggregation and nesting. In developing JReq, we have created a syntax for database queries that can be written entirely with normal Java code, we have designed an algorithm based on symbolic execution to automatically translate these queries into SQL, and we have implemented a research prototype of our system that shows competitive performance to handwritten SQL. We envision JReq as a useful complement to other techniques for translating imperative code into SQL. For common queries, existing techniques often provide greater syntax flexibility than JReq, but for the most complex queries, programmers can use JReq instead of having to resort to domain-specific languages like SQL. As a result, all queries will end up being written in Java, which can be understood by all the programmers working on the codebase.
References 1. American National Standards Institute: American National Standard for Information Systems—Database Language—SQL: ANSI INCITS 135-1992 (R1998). American National Standards Institute (1992) 2. Amza, C., Cecchet, E., Chanda, A., Elnikety, S., Cox, A., Gil, R., Marguerite, J., Rajamani, K., Zwaenepoel, W.: Bottleneck characterization of dynamic web site benchmarks. Tech. Rep. TR02-389, Rice University (February 2002) 3. Bradley, A.R., Manna, Z.: The Calculus of Computation: Decision Procedures with Applications to Verification. Springer, New York (2007) 4. Cook, W.R., Rai, S.: Safe query objects: statically typed objects as remotely executable queries. In: ICSE 2005: Proceedings of the 27th international conference on Software engineering, pp. 97–106. ACM, New York (2005) 5. DeMichiel, L., Keith, M.: JSR 220: Enterprise JavaBeans 3.0, http://www.jcp.org/en/jsr/detail?id=220 6. Flanagan, C., Saxe, J.B.: Avoiding exponential explosion: generating compact verification conditions. In: POPL 2001, pp. 193–205. ACM, New York (2001) 7. Guravannavar, R., Sudarshan, S.: Rewriting procedures for batched bindings. Proc. VLDB Endow. 1(1), 1107–1123 (2008) 8. Iu, M.Y., Zwaenepoel, W.: Queryll: Java database queries through bytecode rewriting. In: van Steen, M., Henning, M. (eds.) Middleware 2006. LNCS, vol. 4290, pp. 201–218. Springer, Heidelberg (2006)
JReq: Database Queries in Imperative Languages
103
9. Katz, R.H., Wong, E.: Decompiling CODASYL DML into relational queries. ACM Trans. Database Syst. 7(1), 1–23 (1982) 10. Lieuwen, D.F., DeWitt, D.J.: Optimizing loops in database programming languages. In: DBPL3: Proceedings of the third international workshop on Database programming languages: bulk types & persistent data, pp. 287–305. Morgan Kaufmann, San Francisco (1992) 11. Maier, D., Stein, J., Otis, A., Purdy, A.: Development of an object-oriented DBMS. In: OOPLSA 1986, pp. 472–482. ACM Press, New York (1986) 12. Necula, G.C.: Translation validation for an optimizing compiler. In: PLDI 2000, pp. 83–94. ACM, New York (2000) 13. PostgreSQL Global Development Group: PostgreSQL, http://www.postgresql.org/ 14. Rinard, M.C.: Credible compilation. Tech. Rep. MIT/LCS/TR-776, Cambridge, MA, USA (1999) 15. Torgersen, M.: Language INtegrated Query: unified querying across data sources and programming languages. In: OOPSLA 2006, pp. 736–737. ACM Press, New York (2006) 16. Transaction Processing Performance Council: TPC Benchmark W (Web Commerce) Specification Version 1.8. Transaction Processing Performance Council (2002) 17. Transaction Processing Performance Council: TPC Benchmark H (Decision Support) Standard Specification Version 2.8.0. Transaction Processing Performance Council (2008) 18. Vall´ee-Rai, R., Co, P., Gagnon, E., Hendren, L., Lam, P., Sundaresan, V.: Soot a Java bytecode optimization framework. In: CASCON 1999: Proceedings of the 1999 conference of the Centre for Advanced Studies on Collaborative research, p. 13. IBM Press (1999) 19. Wiedermann, B., Cook, W.R.: Extracting queries by static analysis of transparent persistence. In: POPL 2007, pp. 199–210. ACM Press, New York (2007) 20. Wiedermann, B., Ibrahim, A., Cook, W.R.: Interprocedural query extraction for transparent persistence. In: OOPSLA 2008, pp. 19–36. ACM, New York (2008) 21. Wong, L.: Kleisli, a functional query system. J. Funct. Program. 10(1), 19–56 (2000)
Verifying Local Transformations on Relaxed Memory Models Sebastian Burckhardt1 , Madanlal Musuvathi1 , and Vasu Singh2 1 2
Microsoft Research EPFL, Switzerland
Abstract. The problem of locally transforming or translating programs without altering their semantics is central to the construction of correct compilers. For concurrent shared-memory programs this task is challenging because (1) concurrent threads can observe transformations that would be undetectable in a sequential program, and (2) contemporary multiprocessors commonly use relaxed memory models that complicate the reasoning. In this paper, we present a novel proof methodology for verifying that a local program transformation is sound with respect to a specific hardware memory model, in the sense that it is not observable in any context. The methodology is based on a structural induction and relies on a novel compositional denotational semantics for relaxed memory models that formalizes (1) the behaviors of program fragments as a set of traces, and (2) the effect of memory model relaxations as local trace rewrite operations. To apply this methodology in practice, we implemented a semiautomated tool called Traver and used it to verify/falsify several compiler transformations for a number of different hardware memory models.
1
Introduction
Compilers perform a series of transformations that translate a high-level program into low-level machine instructions, while optimizing the code for performance. For correctness, these transformations must preserve the meaning for any input program. Proving the correctness of program transformations has been well studied for sequential programs [29,18,17,19]. However, concurrent shared-memory programs require additional caution because transformations that reorder, introduce, or eliminate accesses to shared memory may be observed by concurrent threads and can thus introduce subtle safety or liveness errors in an otherwise correct program. For example, the redundant read elimination shown in Fig. 1 is not safe because it leads to nontermination, and the branch consolidation in Fig. 2 is unsafe because it can lead to an assertion violation. Typically, only a very small part of all memory accesses (namely the accesses that are used for synchronization purposes) are susceptible to such issues. However, in the absence of a whole-program-analysis or user-provided annotations, R. Gupta (Ed.): CC 2010, LNCS 6011, pp. 104–123, 2010. c Springer-Verlag Berlin Heidelberg 2010
Verifying Local Transformations on Relaxed Memory Models
105
int X = 0; Transformation
Observer
int r1 = X; int r1 = X; ⇒ while(X == 0); while(r1 == 0);
X = 1;
Fig. 1. Redundant read elimination causing nontermination bool B = false, X = false, Y = false; Transformation
Observer
bool r = B; if(r) { bool r = B; X = r; Y = !r; X = true; ⇒ X = r; } else { assert(X || Y); Y = !r; Y = !r; X = r; } Fig. 2. This branch consolidation is unsafe: the assert can fail in the transformed program, but not the original program. The reason is that the transformation changes the order of the writes to X and Y in the then-branch.
we can not distinguish between data accesses and accesses that are used for synchronization. In practice, most compilers rely on the programmer to provide special type qualifiers like ’volatile’ [20] or ’atomic’ [4] or on custom annotations to identify synchronization accesses. Programs that correctly convey all synchronization are called ’properly labeled’ [14] or ’data-race-free’ [2]. There is a general understanding on how to correctly transform data-race free programs [20,4,25]. In this paper, however, we address the more conservative problem of safely transforming general programs, including programs that contain data races, or programs that are missing the annotations or types needed to identify synchronization accesses. It may seem at first that under this conservative restriction, very few transformations would be safe. However, we can assume that programs that are designed to work on relaxed hardware memory models are resilient to certain transformations. Clearly, there is no need for a compiler to be more conservative than the hardware executing the compiled program. For example, consider the example in Fig. 2 again, but let the execution be on a machine that relaxes write-to-write order. Now, we may argue that the transformation is indeed correct as it does not introduce new behaviors: if writeto-write order is relaxed by the hardware, the assertion violation may occur even for the original untransformed program. For some transformations it can be rather mind-boggling to determine whether it is safe for a given architecture. For instance, by using the methodology
106
S. Burckhardt, M. Musuvathi, and V. Singh
presented in this paper, we will prove (though not fully comprehend) that the transformation {r := A; if r == 0 then A := 0} → {r := A} is safe on a sequentially consistent machine, unsafe on a machine that relaxes write-to-read order (such as TSO), but once more safe on a machine that additionally relaxes write-to-write order (such as PSO). Overall, we summarize our contributions as follows: – (Section 3) We build a semantic foundation for relaxed hardware memory models. We show how many common relaxations can be explained as local rewrite operations on memory access sequences. In particular, we present a novel aggregation rule that can explain the effect of store buffers, the most common relaxation of all. Our semantics is compositional (it defines the behavior of program fragments recursively) and can model infinite executions. – (Section 4) We present a proof methodology to verify the soundness of local program transformations over relaxed memory models, based on a notion of observations. We introduce a notion of invisible rewrite rules (Section 4.1) to reason about all possible program contexts. – (Section 5) We show how to apply the methodology in practice by verifying/falsifying 8 program transformation for 5 different memory models (including sequential consistency), aided by a custom semi-automatic tool called Traver. Given a local program transformation and a memory model, Traver uses an automated theorem prover [11] to prove that the set of observations of the transformed program is contained in the set of observations of the original program, for all possible program contexts. Conversely, when provided with an additional falsification context, Traver can automatically show that the transformation leads to observable differences in behavior. This produces a certificate of unsoundness of the transformation.
2
Related Work
Our calculus and semantics, and in particular the handling of infinite executions, were inspired by Brookes’ fully abstract denotational semantics for sequentially consistent programs [6]. Languages and semantics to study relaxed memory models have been developed before, in both operational style [5] and algebraic style [24]. Our work differs in that it (1) guarantees fairness for infinite executions and (2) relates to contemporary multiprocessor architectures and common program transformations. Much prior work on hardware memory models focuses on the complex intricacies of axiomatic specifications and gives only partial formalizations (in particular, program syntax is generally ignored). Some work departs from the mainstream and uses an operational style [23] or an algebraic style [3,27] (where the algebraic style bears some similarity to our use of dynamic rewrite rules, but
Verifying Local Transformations on Relaxed Memory Models
107
does not include the important store-load aggregation rule which is crucial to correctly model contemporary hardware memory models). Recently, researchers have proposed revised axiomatic formalizations of the x86 architecture [13,22]. Our work is orthogonal: our goal is to find simple yet precise means to reason about various common hardware relaxations, rather than fully model all details of one specific hardware architecture. Our work was partly motivated by recent work [9,26] that demonstrated the difficulty of manually verifying compiler optimizations against memory models. It is also similar to efforts on verifying the soundness of compiler transformations for language-level models (Java, DRF) [25]. Unlike the latter, however, we define soundness of transformations relative to the hardware memory model (and are thus not susceptible to whether programs are data-race-free or not), can handle infinite executions, and provide a tool that helps to automate parts of the verification/falsification effort.
3
Semantic Foundation
In this section, we lay the foundation for understanding hardware memory models and for reasoning about them formally. We start by demonstrating how we explain typical relaxations in the hardware using dynamic rewrite operations. We then formalize this concept by defining a simple imperative language for shared-memory programs and a compositional denotational semantics. Along the way, we discuss various challenges, such as how our semantics handles infinite executions and fairness. We start with a quick introduction to relaxed hardware memory models, revisiting classical examples [1,14]. We use special diagrams called derivations to explain how to understand relaxations as a consequence of dynamic rewriting of access sequences. We distinguish three types of dynamic rewrite operations: reordering, aggregation, and splitting. Ordering relaxations allow the hardware to execute operations in a different order than specified by the program. This can speed up execution as it allows the hardware to delay the completion of operations with high latency (such as propagating stores to a global shared memory) past subsequent operations with low latency (such as reading a locally cached value). In Fig. 3 (a) and (b), we show classic “litmus tests” to illustrate the effects of ordering relaxations. These programs distinguish syntactically between processor-local registers (lowercase identifiers) and shared memory locations (capitalized identifiers). Not all effects can be explained by simply reordering instructions. For example, the program in Fig. 3(c) is a variation of 3(b) that shows how stored values can be visible to subsequent loads by the same processor before they have been committed to shared memory. This effect is very common and often attributed to processor-local “store buffers”. We explain this effect as an aggregation of the store with the following load. More formally, let ld L, x and st L, x represent store or load accesses from/to location L, with loaded/stored value of x. Now consider the dynamic
108
S. Burckhardt, M. Musuvathi, and V. Singh (a)
(b)
Initially: A = B = 0 P1 P2 A := 1 r := B B := 1 s := A
Initially: A = B = 0 P1 P2 A := 1 B := 1 r := B s := A
Eventually: r = 1, s = 0
Eventually: r = s = 0
(c) Initially: A = B = 0 P1 P2 A := 1 B := 1 u := A v := B r := B s := A Eventually: r = s = 0, u = v = 1
Fig. 3. (a) This outcome is possible if the stores by P1 are reordered, or if the loads by P2 are reordered. (b) This outcome (known as Dekker) is possible if the stores are delayed past the loads. (c) This outcome (a variation of Dekker) is possible if stores can be both forwarded to loads and delayed past loads. sss (swap store-store) ′ L =L st L, xst L′ , x′ → st L′ , x′ st L, x sll (swap load-load) ld L, xld L′ , x′ → ld L′ , x′ ld L, x ssl (swap store-load) ′ L =L st L, xld L′ , x′ → ld L′ , x′ st L, x sls (swap load-store) ′ L =L ld L, xst L′ , x′ → st L′ , x′ ld L, x asl (aggregate store-load) st L, xld L, x → st L, x
Model Rewrite Rules SC (none) 390 ssl TSO ssl asl x86-TSO ssl asl PSO ssl asl sss CLR ssl asl sll RMO ssl asl sss sllcd slscd Alpha ssl asl sss sll= slscd
Fig. 4. Dynamic rewrite operations employed by some commercial hardware memory models and by the CLR memory model. The symbols c, d and = indicate that the accesses are swapped only if they are not control dependent, not data dependent, or target a different location, respectively. st A, 1st ⏐ B, 1 ⏐ sss st B, 1st A, 1 ld B, 1ld A, 0 st B, 1ld B, 1ld A, 0st A, 1
st B, 1ld st A, 1ld ⏐ A, 0 ⏐ B, 0 ⏐ ⏐ ssl ssl ld A, 0st B, 1 ld B, 0st A, 1 ld B, 0ld A, 0st A, 1st B, 1
st B, 1ld ⏐ B, 1ld A, 0 st A, 1ld ⏐ A, 1ld B, 0 ⏐ ⏐ asl asl st B, 1ld st A, 1ld B, 0 ⏐ A, 0 ⏐ ⏐ ⏐ ssl ssl ld A, 0st B, 1 ld B, 0st A, 1 ld B, 0ld A, 0st A, 1st B, 1
Fig. 5. Top left: Derivation for Fig. 3(a). P1 issues two stores that get reordered by sss before being interleaved with the two loads by P2. Note that we could provide an alternative derivation where the loads get reordered by sll. Top right: Derivation for Fig. 3(b). Both store-load sequences are reordered by ssl before being interleaved. Bottom: Derivation for Fig. 3(c). Both processors first aggregate the stores with the first following load by asl, then delay it past the second load by ssl.
Verifying Local Transformations on Relaxed Memory Models
109
rewrite operations in Fig. 4. All of these operations preserve the semantics of single-processor programs, as long as the conditions are observed (asl applies only to accesses that target the same location and store/load the same value, while sss, ssl, and sls apply only to accesses that target different locations). To see how these dynamic rewrite operations can explain the examples in Fig. 3, consider the derivation diagrams in Fig. 5. Each processor first produces a sequence of memory accesses consistent with the program. These sequences are dynamic, as they contain data that may not be known statically (such as actual addresses and values loaded or stored), and may repeat program fragments that execute in loops. The access sequences may then be locally modified by the dynamic rewrite operations. Next, the sequences of the processors are interleaved. Informally, an interleaving shuffles the various sequences while maintaining the access order within each sequence (we give a formal definition in Section 3.2). Our derivation diagrams show which sequences are being interleaved with an underbrace. At the end of the derivation (but not necessarily before), the sequence must be value-consistent ; that is, loaded values must be equal to the latest value stored to the same location, or the initial value if there is no preceding store. In general, it is quite difficult to establish a precise relationship between abstract memory models (described as a collection of relaxations, in the style of [1]) and official memory model specifications of commercially available multiprocessors. However, it is possible and sensible for research purposes to model just the abstract core of such models, by focusing on the behavior of regular loads and stores. Fig. 4 shows how can model the core of many commercial hardware memory models, and even the CLR memory model, using the dynamic rewrite rules defined in Fig. 3. Our main sources for constructing this table were [16] for 390, [28] for TSO, PSO and RMO, [10] for Alpha, [22] for x86-TSO, and [7,12,21] for CLR. Beyond simple loads and stores, all of these architectures contain additional constructs (such as locked instructions, compare-and-swaps, various memory fences, or volatile memory accesses). Many of them can be formalized using custom syntax and rewrite rules. However, for simplicity, we stick to regular loads and stores in this paper, augmented only by atomic load-stores (which offer a general method to represent synchronization operations such as locked instructions or compare-and-swap) and a full memory fence. Also, we do not currently model control or data dependencies (which would require us to follow the machine language syntax much more closely, as done in [22], for example). Some memory models (such as PPC, ARM, RC, and PC) allow stores to be split into separate components for each processor. By combining the asl rule with a hierarchical cache organization, our formalism can handle a limited form of store splitting that is sufficient to explain most examples (for more detail on this topic, see [8]). To correctly handle examples that involve synchronization with spinloops (such as Fig. 1), our formalism must handle infinite executions and model fairness conditions (e.g., the store must eventually be performed). To illustrate the subtleties of infinite rewriting, consider first the program in Fig. 6(a). If we naively
110
S. Burckhardt, M. Musuvathi, and V. Singh (a)
Initially: A = B = r = s = 0 P1 P2 A := 1 while (s == 0) while (r == 0) s := A r := B B := 1 Eventually: P 1, P 2 do not terminate
(b) Initially: A = B = r = s = 0 P1 P2 while (r == 0){ while (s == 0){ r := B s := A A := 1 B := 1 B := 0 A := 0 } } Eventually: P 1, P 2 do not terminate
Fig. 6. (a) This outcome is not possible: the store by P1 has to reach P2 eventually, and vice versa. (b) This outcome is possible: both processors repeat Dekker forever.
allow infinite applications of ssl, the store of A can be delayed past the infinite number of subsequent loads in the while loop. As a result, the program may not terminate, which we would like to disallow for the following reason. On actual hardware, stores are not retained indefinitely, so this program is guaranteed to terminate. Now consider Fig. 6(b). This program is essentially a “repeated Dekker” (Fig. 3(b)) and it is conceivable that both P1 and P2 keep executing forever. To explain such behavior, we need to apply ssl infinitely often. To handle both these examples correctly, our denotational semantics uses parallel rewriting on infinite traces (to be formally defined in the next section). 3.1
A Simple Imperative Language for Shared Memory
We now proceed to formalize our description of relaxed memory models. We start by defining a simple imperative “toy” programming language that is sufficient to express the relevant concepts. It is explicitly parallel and distinguishes syntactically between shared variables (uppercase identifiers) and local variables (lowercase identifiers). All variables are mutable and lexically scoped, and must be initialized. For example, the litmus test in Fig. 3(a) looks as follows: share A = 0 in (share B = 0 in (local r = 0 in (local s = 0 in ((A := 1; B := 1) (r := B; s := A))))) The formal syntax is shown in Fig. 7. We let L be the set of shared variables (locations in shared memory), R be the set of processor-local variables (registers), V = L ∪ R be the set of all variables, and X be the set of values assumed by the variables. The (load) and (store) statements move values between local and shared variables. The (assign) statement performs computation, such as addition, on local variables. The (compare-and-swap) statement compares the values of L and rc , stores rn to L if they are equal, and assigns the original value of L to rr . Note that our language does not contain lock or unlock instructions, as there is in fact no
Verifying Local Transformations on Relaxed Memory Models L r x f s
∈ ∈ ∈ : ::= | | | | | | | | | | | | |
L R X X n→ X skip r := L L := r r := f (r1 , . . . , rn) rr := cas(L, rc , rn ) fence get r print r s; s s1 · · · sn if r then s else s while r do s local r = x in s share L = x in s
111
(shared variable) (local variable) (value) (local computation), n ≥ 0 (skip) (load) (store) (assign), n ≥ 0 (compare and swap) (full memory fence) (read from console) (write to console) (sequential composition) (parallel composition), n ≥ 2 (conditional) (loop) (local variable declaration) (shared variable declaration)
Fig. 7. Syntax of program snippets s
blocking synchronization at the hardware level (blocking synchronization can be implemented using spinloops and compare-and-swap). We also include a (fence) statement to enforce a full memory fence. The statements (get) and (print) represent simple I/O in the form of reading from or writing to an interactive console. The statements (sequential composition), (conditional) and (loop) have their usual meaning (we let the special value 0 denote false, and all others denote true). The statement (parallel composition) executes its components concurrently, and waits for all of them to finish before completing. The statements (local) and (shared) declare mutable variables and initialize them to the given value. Compared to let, as used in functional languages, they differ by (1) allowing mutation of the variable, and (2) strictly restricting the scope and lifetime to the nested snippet. To enforce that local variables are not accessed concurrently, we define the free variables as in (Fig. 8) and call a snippet ill-formed if it contains a parallel composition s1 · · · sn such that for some i, j, we have (FV (si )∩FV (sj )∩R) = ∅, and well-formed otherwise. We let S be the set of all well-formed snippets. Finally, we define a program to be a well-formed snippet s with no free variables. We let P be the set of all programs. Note that conventional hardware memory models consider only a restricted shape of programs (a single parallel composition of sequential processes). Our syntax is more general, as it allows arbitrary nesting of declarations and compositions. This (1) simplifies the definitions and proofs, (2) lets us perform local reasoning (because we can delimit the scope of variables), and (3) allows us to explore the implications of hierarchical memory organizations.
112
S. Burckhardt, M. Musuvathi, and V. Singh FV (skip) = FV (r := L) = FV (L := r) = FV (r0 := f (r1 . . . rn )) = FV (rr := cas(L, rc , rn )) = FV (fence) = FV (get r) = FV (print r) = FV (s; s′ ) = FV (s1 · · · sn ) = FV (if r then s else s′ ) = FV (while r do s) = FV (local r = x in s) = FV (share L = x in s) =
∅ {r, L} {L, r} {r0 , r1 , . . . rn } {L, rr , rc , rn } ∅ {r} {r} FV (s) ∪ FV (s′ ) FV (s1 ) ∪ · · · ∪ FV (sn ) {r} ∪ FV (s) ∪ FV (s′ ) {r} ∪ FV (s) FV (s) \ {r} FV (s) \ {L}
Fig. 8. Definition of the set of free variables FV (s) of s
3.2
Denotational Semantics
Our semantics mirror the ideas behind the derivation diagrams used in the previous section. Informally speaking, each processor generates a set of potential traces. These traces are concatenated by sequential composition, interleaved by parallel composition, and modified by the dynamic rewrite operations of the memory model. They are then filtered by requiring value consistency (after being interleaved and reordered). To capture the semantics of a program or snippet more formally, we first define a set B of behaviors; We then recursively define the semantic function [[]]M to map any snippet s onto the set [[s]]M ⊂ B of its behaviors for a given memory model M . We represent the memory model M as a set of dynamic rewrite operations, and model its effect on behaviors as a closure operator. To capture behaviors locally, we use a combination of state valuations (to capture local state) and event traces (to capture externally visible events and accesses to shared variables). Let Q be the set of local states, defined as functions R → X , and let Evt be the set of events e of the form e ::= ld L, x | st L, x | ldst L, xl , xs | fence | get x | print x. We let Evt ∗ be the set of finite event sequences (containing in particular the empty sequence, denoted ǫ), we let Evt ω be the set of infinite event sequences, and we let Evt ∞ = Evt ∗ ∪ Evt ω be the set of all event sequences. For two sequences w ∈ Evt ∗ and w′ ∈ Evt ∞ , we let ww′ ∈ Evt ∞ be the concatenation as usual. For a sequence of finite sequences w1 , w2 , · · · ∈ Evt ∗ , we let w1 w2 · · · ∈ Evt ∞ be the concatenation (which may be finite or infinite). We then define the set of behaviors B = (Q × Q × Evt ∗ ) ∪ (Q × Evt ∞ ). A triple (q, q ′ , w) represents a terminating behavior that starts in local state q, ends in local state q ′ , and emits the finite event sequence w. A pair (q, w)
Verifying Local Transformations on Relaxed Memory Models
113
represents a nonterminating behavior that starts in local state q and emits the (finite or infinite) event sequence w. For a set B ⊆ B and states q, q ′ ⊆ Q we define the projections [B]qq′ = {w | (q, q ′ , w) ∈ B} and [B]q = {w | (q, w) ∈ B}. To specify dynamic rewrite operations formally, we use rewrite rules (as in ϕ Fig. 4) of the form p → q where p and q are symbolic event sequences (that is, sequences of events where locations and values are represented by variables) and where ϕ (if present) is a formula over the variables appearing in p and q which describes conditions under which the rewrite rule applies. We let T be the set of all such rewrite rules. Definition 1. A memory model is a finite set M ⊂ T of rewrite rules. ϕ
Definition 2. For a rewrite rule t = p → q, let gt ⊂ Evt ∗ × Evt ∗ be the set of pairs (w1 , w2 ) such that there exists a valuation of the variables in p, q for which p = w1 , q = w2 and ϕ is true. Then, define the operator t : P(Evt ∗ ) → P(Evt ∗ ) to map a set A of finite event sequences to the set t(A) = {ww2 w′ | w, w′ ∈ Evt ∗ ∧ (w1 , w2 ) ∈ gt ∧ ww1 w′ ∈ A} ∗ For a set of rewrite rules M ⊂ T and a set of finite sequences A ⊂ Evt , we define the result of applying M to A as M (A) = A ∪ t∈M t(A). In order to apply M to infinite sequences as well, we first introduce a definition for parallel rewriting. We generalize the notation for sequence concatenation to sets of sequences as usual (elementwise): for example, for S ⊆ Evt ∗ and S ′ ⊆ Evt ∞ we let SS ′ = {ss′ | s ∈ S, s ∈ S ′ }.
Definition 3. Let f : P(Evt ∗ ) → P(Evt ∗ ). Then we define the operators Pf : f : P(Evt ∞ ) → P(Evt ∞ ) by P(Evt ∗ ) → P(Evt ∗ ) and P
{ f (A1 ) · · · f (An ) | Ai ⊂ Evt ∗ such that A1 · · · An ⊆ A} Pf (A) =
ˆ = ˆ f (A) { f (A1 )f (A2 )f (A3 ) · · · | Ai ⊂ Evt ∗ such that A1 A2 A3 · · · ⊆ A} P
f (A) ˆ may contain infinite sequences even if Aˆ does not.1 We now Note that P show how to construct fixpoints for the effect of memory models M ⊆ T on behaviors. Definition 4. Let M be a memory model. We define M ∗ : P(Evt ∗ ) → P(Evt ∗ ) and M ∞ : P(Evt ∞ ) → P(Evt ∞ ) by
k ˆ ˆ = M k (A) M ∞ (A) (P M ∗ (A) = M∗ ) (A) k≥0
k≥0
Moreover, for a set B ⊆ B of behaviors, define the closure B M =
{(q, q, w) | q, q ′ ∈ Q and w ∈ M ∗ ([B]qq′ )} ∪ {(q, w) | q ∈ Q and w ∈ M ∞ ([B]q )}. 1
ˆ = {ǫ} and M = {ǫ → 0}. Then P ˆ contains the infinite for example, consider A (A) M sequence 000 · · · .
114
S. Burckhardt, M. Musuvathi, and V. Singh
We can show that this is indeed a closure operation, namely, that (B M )M = B M (see our tech report [8] for a proof). Note that our use of parallel rewriting applies the rewrite rules in a “locally finite” manner, which is important to handle infinite executions correctly.2 Definition of the Semantics. Using the notations listed in the next paragraph, Fig. 9 shows our recursive definition of the semantic function [[.]]M : S → P(B) that assigns to each snippet s the set of behaviors [[s]]M that s may exhibit on memory model M . It computes behaviors of snippets from the inside out, applying the rewrite rules at each step. Sequential composition appends the behaviors of its constituents, while parallel composition interleaves them. The behaviors of a load include all possible values it could load (because the actual value depends on the context which is not known at this point). Value consistency is enforced at the level of the shared-variable declaration, at which point we also project away accesses to that variable.3 Fences are modeled as events that do not participate in any rewrite rules, thus enforcing ordering. Notations used. For q ∈ Q, r ∈ R and x ∈ X we let q[r → x] denote the function that maps r to x, but is otherwise the same as the function q. For a shared variable L ∈ L, let Evt (L) ⊆ Evt be the set of memory accesses to L. For w ∈ Evt ∞ and i ∈ N, let w[i] ∈ Evt be the event at position i (starting with 1). Let dom w ⊆ N be the set of positions of w. For two sequences w, w′ ∈ Evt ∞ we define the set of fair interleavings (w # w′ ) ⊆ Evt ∞ to consist of all sequences u ∈ Evt ∞ such that there exist strictly monotonic functions f : dom w → dom u and g : dom w′ → dom u satisfying rg f ∩ rg g = ∅ and rg f ∪ rg g = dom w, and such that w[i] = u[f (i)] and w′ [i] = u[g(i)] for all valid positions i. Note that the interleaving operator # is commutative and associative. For a subset of events C ⊆ Evt , we define the projection function proj C : Evt ∞ → Evt ∞ to map a sequence to the largest subsequence containing only events in C. We write proj −L short for the function proj Evt\Evt (L) (which removes all accesses to L). We call a sequence w ∈ Evt ∞ value-consistent with respect to a shared variable L ∈ L and an initial value x ∈ X if for each load of L appearing in w, the value loaded matches the value of the rightmost store to L that precedes the load in w, or the initial value x if there is no such store. We let Cons(L, x) ⊆ Evt ∞ be the set of all sequences that are value-consistent with respect to L and x. Similarly, we let Cons(L, x, x′ ) ⊆ Evt ∗ be the set of finite sequences that are value-consistent with respect to initial and final values x and x′ of L, respectively. For simplicity, we assume X = Z. 2
3
For example, consider the operation ssl in Fig. 4 which represents the effect of stores being delayed in a buffer; while there is no bound on how long stores can be delayed, they must be eventually performed. Our formalism reflects this properly, as follows (using digits 0,1 instead of load and store events for illustration purposes). Let A = {1010 . . . } and M = {10 → 01}. Then 0k 1010 . . . is in M ∞ (A), but 000 · · · is not. This behavior is similar to “hide” operators in process algebras. It implies that the behaviors of a program (unlike the behaviors of snippets) contain only external events.
Verifying Local Transformations on Relaxed Memory Models
115
[[skip]]M = {(q, q, ǫ) | q ∈ Q}M
[[r :=h L]]M = {(q, q[r → x], ld L, x) | q ∈ Q, x ∈ X }M [[L :=h r]]M = {(q, q, st L, q(r) | q ∈ Q}M [[r0 := f (r1 . . . rn )]]M = {(q, q[r0 → f (q(r1 ) . . . q(rn )], ǫ) | q ∈ Q}M
M {(q, q[rr → q(rc )], ldst L, q(rc ), q(rn )) | q ∈ Q} [[rr := cas h (L, rc , rn )]]M = ∪ {(q, q[rr → x], ldst L, x, x) | q ∈ Q, x ∈ X , x = q(rc )} [[get r]]M = {(q, q[r → x], get x) | q ∈ Q, x ∈ X }M [[print r]]M = {(q, q, print q(r) | q ∈ Q}M [[s1 ; s2 ]]M = ⎞M ⎛ {(q, q ′ , w) | there exist (q, q ′′ , w1 ) ∈ [[s1 ]]M and (q ′′ , q ′ , w2 ) ∈ [[s2 ]]M with w = w1 w2 } ⎠ ⎝ ∪ {(q, w) | (q, w) ∈ [[s1 ]]M } ∪ {(q, w) | there exist (q, q ′ , w1 ) ∈ [[s1 ]]M and (q ′ , w2 ) ∈ [[s2 ]]M with w = w1 w2 }
[[s1 · · · sn ]]M = ⎛ ⎞M {(q, q ′ , w) | there exist (q, qi , wi ) ∈ [[si ]]M for all 1 ≤ i ≤ n such that ′ ⎜ ⎟ w ∈ w1 # . . . # wn and such that q (r) = qi (r) for all r ∈ FV (si ) and ⎜ ⎟ ⎜ ⎟ q ′ (r) = q(r) for all r ∈ / FV (s1 ) ∪ . . . FV (sn )} ⎜ ⎟ ∞ ⎜ ∪ {(q, w) | there exist w1 , . . . , wn ∈ Evt and a nonempty subset D ⊆ {1, . . . , n} ⎟ ⎜ ⎟ ⎜ ⎟ such that for all j ∈ D, we have a behavior (q, wj ) ∈ [[sj ]]M , ⎜ ⎟ ⎝ ⎠ and for all j ∈ / D, we have a behavior (q, qj , wj ) ∈ [[sj ]]M for some qj , and w ∈ w1 # . . . # wn }
[[if r then s1 else s2 ]]M =
M {(q, q ′ , w) | (q(r) = 0 ∧ (q, q ′ , w) ∈ [[s1 ]]M ) ∨ (q(r) = 0 ∧ (q, q ′ , w) ∈ [[s2 ]]M )} ∪ {(q, w) | (q(r) = 0 ∧ (q, w) ∈ [[s1 ]]M ) ∨ (q(r) = 0 ∧ (q, w) ∈ [[s2 ]]M )}
[[while r do s]]M = ⎛ ⎞M {(q0 , qn , w1 · · · wn ) | there exist n ≥ 0 and q0 , . . . , qn such that (qi , qi+1 , wi+1 ) ∈ [[s]]M ⎜ ⎟ for 0 ≤ i < n, and q0 (r) = 0, . . . , qn−1 (r) = 0, and qn (r) = 0} ⎜ ⎟ ⎜ ∪ {(q0 , w1 w2 · · · ) | ∃q1 , q2 , . . . : (qi , qi+1 , wi+1 ) ∈ [[s]]M and qi (r) = 0} ⎟ ⎜ ⎟ ⎝ ∪ {(q0 , w1 · · · wn ) | there exist n ≥ 1 and q0 , . . . , qn−1 such that qi (r) = 0 for all i and ⎠ (qi , qi+1 , wi+1 ) ∈ [[s]]M for 0 ≤ i < n − 1 and (qn−1 , wn ) ∈ [[s]]M }
[[local L = x in s]]M = ⎞M ⎛ {(q, q ′ , w) | there exists a behavior (q[r → x], q ′′ , w) ∈ [[s]]M ′ ′′ ⎠ ⎝ such that q = q [r → q(r)]} ∪ {(q, w) | there exists a behavior (q[r → x], w) ∈ [[s]]M }
[[share L = x in s]]M = ⎞M ⎛ {(q, q ′ , w) | there exists a behavior (q, q ′ , w′ ) ∈ [[s]]M ′ ′ ⎜ such that w ∈ Cons(L, x) and w = proj −L (w )} ⎟ ⎟ ⎜ ⎠ ⎝ ∪ {(q, w) | there exists a behavior (q, w′ ) ∈ [[s]]M ′ ′ such that w ∈ Cons(L, x) and w = proj −L (w )}
Fig. 9. Denotational Semantics of our Calculus, parameterized by a set M of dynamic rewrite rules. An empty set M represents the standard semantics (sequential consistency).
116
S. Burckhardt, M. Musuvathi, and V. Singh p
p
p 4 ⎤ local r = 0 in ⎡ ⎤ ⎤ local r = 1 in ⎡ ⎥ ⎢ local r = 1 in ⎥ ⎥ local r = 1 in ⎢ get r; ⎢ ⎣ local s = 2 in ⎦ ⎢ local s = 2 in ⎥ ⎣ while r do ⎦ ⎢ while r do ⎥ ⎥ ⎢ ⎦ ⎣ print r; ⎦ ⎣ skip; print r (print r) (print s) print s print r 1
⎡
p
2
3
⎤
⎡
Fig. 10. Four example programs. p1 and p2 always terminate, p3 never terminates, and p4 sometimes terminates. p1 can be soundly transformed to p2 , but not vice versa.
4
Verifying Local Program Transformations
In this section, we present our methodology for verifying the soundness of local program transformations on a chosen hardware memory model. We start with a general definition of what can be observed about a program execution. Next, we show how to prove that a local, static program transformation is unobservable (and thus sound) if its effect on dynamic traces can be captured by invisible rewrite rules on those traces (Section 4.1) . For each memory model, we present a list of invisible rewrite rules and describe how we proved invisibility. For our purposes, the observable behavior of a program includes (1) whether the program terminates or diverges, and (2) the sequence of externally visible events (that is, interactions of the program with the environment). We formalize this by defining the subset Ext ⊂ Evt of externally visible events and the set O of observations as Ext = {get n | n ∈ Z} ∪ {print n | n ∈ Z} O = {u | u ∈ Ext ∗ } ∪ {∇u | u ∈ Ext ∞ } An observation of the form u represents a terminating execution that produces the finite event sequence u; an observation of the form ∇u represents a nonterminating execution that produces the (finite or infinite) sequence u. For example, the program p1 in Fig. 10 has two possible observations, print 1print 2 and print 2print 1; the program p2 has one possible observation, print 1print 2; the program p3 has one possible observation, ∇print 1ω ; and the program p4 has the set {get 0print 0} ∪ {∇get n | n = 0} of observations. Using the semantics established in the previous section, we now formally define the set of observations of a program p on a memory model M as follows: obs M (p) = {u | ∃(q, q ′ , w) ∈ [[p]]M : u = proj Ext (w)} ∪ {∇u | ∃(q, w) ∈ [[p]]M : u = proj Ext (w)} For programs p, p′ ∈ P, we let p ⇒ p′ represent the global transformation of p into p′ . We then define a global transformation p ⇒ p′ to be sound for memory model M if it does not introduce any new observations, that is, obs M (p′ ) ⊆ obs M (p). Note that we consider it acceptable if the transformed program has fewer observations than the original one. For example, we would consider it o.k. to
Verifying Local Transformations on Relaxed Memory Models edl (eliminate double load) eds (eliminate double store) ecs (eliminate confirmed store) asl (aggregate store-load) iil (invent irrelevant load) eil (eliminate irrelevant load)
: : : : : :
117
ld L, xld L, x → ld L, x st L, xst L, x′ → st L, x′ st L, xld L, x → st L, x st L, xld L, x → st L, x ǫ → ld L, ∗ ld L, ∗ → ǫ
Fig. 11. A list of rewrite rules that are invisible for certain memory models. The last two contain wildcards; the meaning is that those rules apply to sets of behaviors, rather than indidvidual behaviors.
transform program p1 to program p2 in Fig. 10, which essentially reduces the nondeterministic choices available to the scheduler in scheduling the two print statements. An external entity interacting with the program cannot conclusively detect that a transformation took place. The reason is that schedulers are free to favor certain schedules over others (as long as the schedules themselves are fair). Therefore, an observer can not tell whether the reduction in schedules is caused by the transformation or by a whim of the scheduler. In this work, we focus on local transformations, that is, transformations of components whose context is not known. See Fig. 12 for 8 examples of local transformations. More formally, we define a program context to be a “program with a hole [ ]”, defined syntactically as follows: c ::=
[]
| | |
c ; s | s ; c | local r = x in c | share L = x in c while r do c | if r then c else s | if r then s else c s1 · · · sk−1 c sk+1 · · · sn (where 1 ≤ k ≤ n)
For a context c and snippet s, we let c[s] be the snippet obtained by replacing the hole in c with s. For two snippets s, s′ ∈ S, we let s → s′ be a local transformation. We say a local transformation s → s′ induces a global transformation p ⇒ p′ if there exists a context c such that p = c[s], p′ = c[s′ ], and we say a local transformation is sound if all induced global transformations are sound. 4.1
Invisible Rewrite Rules
To determine whether a local transformation s → s′ (such as shown in Fig. 12) is sound, we can compare the set of behaviors [[s]]M and [[s′ ]]M . Because our denotational semantics is defined recursively, it is quite obvious that [[s′ ]]M = [[s]]M implies obs M (c[s′ ]) = obs M (c[s]) in any context c, and thus that the transformation is sound. Unfortunately, not all transformations are that simple to prove, because a transformation can be sound even if [[s′ ]]M = [[s]]M (our semantics is not fully abstract).4 4
For example, consider the “redundant read-after-read elimination” transformation from Fig. 12, and consider M = SC = ∅. Clearly, the sets [[s′ ]]M and [[s]]M are not the same and not contained in each other (all behaviors of [[s′ ]]M contain one fewer load). Nevertheless, this transformation is actually safe, because the removal of the read can not be observed by any context.
118
S. Burckhardt, M. Musuvathi, and V. Singh
(load reordering) {if r then {s := A; t := B} else {t := B; s := A}} → {s := A; t := B} (store reordering) {if r then {A := s; B := t} else {B := t; A := s}} → {A := s; B := t} (irrelevant read elim.) {local r = 0 in {r := A; if r then {B := s} else {B := s}}} → {B := s} (irrelevant read introd.) {if r then local s = 0 in {s := A; B := s}} → {local s = 0 in {s := A; if r then B := s}}} (redundant read-after-read (redundant read-after-write (redundant write-before-write (redundant write-after-read
elim.) elim.) elim.) elim.)
{r := A; b := A} → {r := A; b := r} {A := r; s := A} → {A := r; s := r} {A := r; A := s} → {A := s} {r := A; if r == 0 then A := 0} → {r := A}
Fig. 12. Some examples of local transformations [26]. The snippets follow the syntax defined in §3.1, with L = {A, B, . . . } and R = {r, s, t, . . . }.
To handle a larger generality of transformations, we introduce the concept of “invisible” rewrite rules on dynamic traces. Essentially, we show that certain dynamic rewrite operations never alter the set of observations. In particular, any rewrite rule that is already part of the memory model is invisible. In general, there can be many more such rules, however. Consider the rules shown in Fig. 11. All of these rules are “invisible” on at least some of the memory models. More formally, we say a local transformation s → s′ is covered by a set of rewrite rules D if [[s′ ]]M ⊆ fD ([[s]]M ), where the operator fD : P(B) → P(B) on behaviors is defined as parallel rewriting5 [fD (B)]qq′ = PD ([B]qq′ )
[fD (B)]q = P D ([B]q ).
The following definition and theorem relate how invisibility provides the means to prove the soundness of a local transformation by showing that it is covered by some set D of invisible rules. Definition 5 (Invisibility). Let D be a set of rewrite rules, and let M be a memory model. We say D is invisible on M if it is the case that any local transformation that is covered by D is sound for M . We say an individual rule d is invisible on M if the set {d} is invisible on M . Theorem 1. The dynamic rewrite rules edl, eds, ecs, asl, eil, and iil are invisible on SC , the rules edl, eds, eil, and iil are invisible on TSO, 390 and PSO, the rules edl, eds, eil, and iil are invisible on CLR, and the set {eds, ecs} is invisible on PSO . 5
Recall our earlier definition of [X]qq′ = {w | (q, q ′ , w) ∈ X} and [X]q= {w | (q, w) ∈ X} for a set X ⊂ B of behaviors.
Verifying Local Transformations on Relaxed Memory Models
119
The proof of Thm. 1 is based on structural induction, and is available in our tech report [8]. However, walking through the entire proof whenver we wish to enlarge the list of rules or memory models in Thm. 1 is unpractical. Thus, we have broken out a set of conditions that are sufficient to prove invisibility, and can be checked with relative ease. Theorem 2 (Simple Conditions for Invisibility). Let M ⊆ T be a memory model, and let D ⊆ T be a set of rewrite rules. Then the following conditions are sufficient to guarantee that D is invisible on M : 1. (Commutativity). m(PD (A)) ⊆ PD (M ∗ (A)) for all m ∈ M and A ⊆ Evt ∗ . 2. (Atomicity). if (S1 , S2 ) ∈ Gd for some d ∈ D, then all sequences in S2 are of length 0 or 1. 3. (Value Consistency) if (S1 , S2 ) ∈ Gd for some d ∈ D, and w2 ∈ S2 ∩ Cons(L, x, x′ ) for some L, x, and x′ , then there exists a w1 ∈ S1 ∩ Cons(L, x, x′ ) such that proj −L ({w2 }) ∈ D(proj −L ({w1 })). 4. (External Consistency) if (S1 , S2 ) ∈ Gd for some d ∈ D, then proj Ext (S2 ) ⊆ proj Ext (S1 ). We illustrate the use of these conditions by walking through one case, namely M = 390 = {ssl} and D = {edl}. Atomicity is immediate (the right-hand side of edl is a single event). Value Consistency is straightforward because the left- and right-hand side of edl are functionally equivalent, and the projection proj −L will either map them both to ǫ or both to themselves. External Consistency is trivial as edl does not contain external events. Commutativity requires some work. To show that ssl(Pedl (A)) ⊂ Pedl (ssl∗ (A)) for all A, we examine ssl(Pedl (A)) and think about all possible scenarios where ssl rewrites modified positions of a parallel application of edl (if it rewrites only unmodified positions, it clearly commutes with Pedl ). Thinking about this scenario (matching the left-hand side of ssl with the right-hand side of edl), we can single out the following situation st L, xld L′ , x′ ld L′ , x′ ∈ A st L, xld L′ , x′ ∈ PD (A) ld L′ , x′ st L, x ∈ ssl(Pedl ) Now we understand that starting with the same first line, we can get to the same last line by first applying ssl twice and then applying Pedl : st L, xld L′ , x′ ld L′ , x′ ∈ A ld L′ , x′ st L, xld L′ , x′ ∈ ssl(A) ld L′ , x′ ld L′ , x′ st L, x ∈ ssl(ssl(A)) ld L′ , x′ st L, xw′ ∈ Pedl (ssl(ssl(A))) which implies the claim.
5
Application
To simplify the task of proving or refuting soundness, we automated some parts of the proof by developing a tool called Traver, written in F# and using the
√ √ √ √
√ √ √ √
× × (eil) (iil) (edl) (eds) √
× √
×
√
√ (eil) √ (iil) √ (edl) √ (eds) √
CLR
× × (eil) (iil) (edl) (eds) × ×
PSO
√ √ √ √ √ √
× × (eil) (iil) (edl) (eds) (asl) (ecs)
TSO
transformation name (see Fig. 12) (load reordering) (store reordering) (irrelevant read elim.) (irrelevant read intr.) (red. read-after-read elim.) (red. wr.-bef.-wr. elim.) (red. read-after-wr. elim.) (red. wr.-after-read elim.)
390
S. Burckhardt, M. Musuvathi, and V. Singh
SC
120
√ × √ (eil) √ (iil) √ (edl) √ (eds) √
(eds, ecs)
×
Fig. 13. Soundness results for the examples from Fig. 12. For sound transformations √ (marked by ), we list the set D of invisible rules employed by the proof. For unsound transformations (marked by ×), we show example derivations in Fig. 14. All results were validated by our tool.
automated theorem prover Z3 [11]. It operates in one of two modes, verification or falsification. – In verification mode, Traver takes as input a local transformation s → s′ , a memory model M , and a set D of invisible rewrite rules supplied by the user. It then executes both s and s′ symbolically to obtain symbolic representations of their behaviors, and attempts to prove that D covers s → s′ by computing the closure of [[s]]M under D and checking whether it contains [[s′ ]]M . If successful, soundness is established. Otherwise, the result is inconclusive,and Traver reports a behavior in the set difference to the user (which can be inspected to find new candidates for invisible rules that may help to prove soundness, or provide ideas on how to falsify the transformation). – In falsification mode, Traver takes as input a local transformation s → s′ , a memory model M , and a context c (which may contain several threads). It then computes the closure of c[s′ ] and c[s] under interleavings under M , and solves for a behavior of c[s′ ] that is not observationally equivalent to any behavior in c[s] (assuming that all initial and final values of all variables are being observed) . If such a behavior is found, soundness has been successfully refuted. Otherwise, the result is inconclusive. For both modes, the snippets s, s′ are supplied to Traver using a sugared syntax, which makes it very easy to try out many different local transformations (however, we currently support loop-free snippets without parallel composition only). The model M is specified by selecting a subset of the rewrite rules in Fig. 4 and Fig. 11 (not including rewrite rules that are conditional on control or data dependencies). Using our tool, we successfully proved or refuted soundness of the 8 transformations in Fig. 12 for the memory models SC, 390, TSO,6 PSO, and CLR as defined in Fig.4. The total time needed by the tool to prove/refute all examples is about 15 seconds. The results are shown in Fig. 13. 6
Note that the results for TSO also apply for x86-TSO and for x86-IRIW
Verifying Local Transformations on Relaxed Memory Models ⎡
⎤ B := 1; A := 1; ⎢ r := A; ⎥ ⎣ if (r == 0) fence; ⎦ s := B; A := 0; ⇓ B := 1; A := 1; r := A; fence; s := B;
A := 1; B := 1; r := A; fence; s := B; t := A; ⇓ A := 1; B := 1; r := 1; fence; s := B; t := A;
final values s = t = 0, A = B = r = 1
final values r = s = 0, A = B = 1
st A, 1ld B, 0 ld B, 0st A, 1
121
st B, 1ld A, 0 st B, 1fenceld A, 0
ld B, 0st B, 1fenceld A, 0st A, 1
ld A, 0st B, 1
st A, 1fenceld B, 0
ld A, 0st A, 1fenceld B, 0st B, 1
Fig. 14. (Left.) Derivation showing that the redundant-read-after-write-elimination is not sound on 390. (Right.) Derivation showing that the redundant-write-after-readelimination is not sound on 390, on TSO, and on CLR. (Both.) We show the original program, the transformed program, and an execution of the transformed program that is not possible on the original program. All shared variables and registers are initially zero.
As expected, the first two transformations (load-reordering, store-reordering) are unsound for all models except models that specifically relax load-load order or store-store order. The next four transformations (irrelevant-read-elimination, irrelevant-readintroduction, redundant-read-after-read-elimination, and redundant-writebefore-write-elimination) are sound for all memory models. The last two transformations proved more interesting. For both, we were able to prove that they are sound on SC . However, they exhibit some surprising behavior on relaxed memory models. – The redundant-read-after-write-elimination is unsound on 390. Fig. 14 (left) shows a derivation to explain this effect. Intuitively, the sequence {A := r; s := A} has a fence-like effect on 390 which is lost by the transformation. However, on memory models that also support store-load forwarding (asl), this transformation is sound. – The redundant-write-after-read elimination is unsound on 390, T SO, and CLR, but sound on PSO. Fig. 14 (right) shows a derivation to explain this effect. Intuitively, the reason is that because the transformed snippet is a simple load, it can be swapped with a preceding store if the rule ssl is part of the memory model. This would not be possible with the original code unless the memory model also contains the rule sss which in turn sheds some light on why this transformation is sound for PSO. We believe it would have been very difficult to correctly determine soundness of these transformations (in particular the last two) or to discover the derivations that explain the effects without our proof methodology.
122
6
S. Burckhardt, M. Musuvathi, and V. Singh
Conclusion and Future Work
Our experience with Traver has successfully demonstrated the power of formalism and automation in discovering corner cases where normal intuition fails. We believe that the proof methodology and the tool presented in the paper have many more uses in the future. Of particular interest are (1) verifying translations involving different memory models (between different architectures, or between different intermediate representations), and (2) extending our methodology to transformations involving higher-level synchronization such as locks, semaphores, or sending and receiving messages on channels.
References 1. Adve, S., Gharachorloo, K.: Shared memory consistency models: a tutorial. Computer 29(12), 66–76 (1996) 2. Adve, S., Hill, M.: A unified formalization of four shared-memory models. IEEE Trans. Parallel Distrib. Syst. 4(6), 613–624 (1993) 3. Arvind, Maessen, J.-W.: Memory model = instruction reordering + store atomicity. In: ISCA, pp. 29–40 (2006) 4. Boehm, H.-J., Adve, S.V.: Foundations of the C++ concurrency memory model. In: Programming Language Design and Implementation (PLDI), pp. 68–78 (2008) 5. Boudol, G., Petri, G.: Relaxed memory models: an operational approach. In: Principles of Programming Languages, POPL (2009) 6. Brookes, S.: Full abstraction for a shared variable parallel language. In: LICS, pp. 98–109 (1993) 7. Brumme, C.: cbrumme’s weblog, http://blogs.gotdotnet.com/cbrumme/archive/2003/05/17/51445.aspx 8. Burckhardt, S., Musuvathi, M., Singh, V.: Verification of compiler transformations for concurrent programs. Technical Report MSR-TR-2008-171, Microsoft Research (2008) 9. Cenciarelli, P., Sibilio, E.: The java memory model: Operationally, denotationally, axiomatically. In: De Nicola, R. (ed.) ESOP 2007. LNCS, vol. 4421, pp. 331–346. Springer, Heidelberg (2007) 10. Compaq Computer Corporation. Alpha Architecture Reference Manual, 4th edn. (January 2002) 11. de Moura, L.M., Bjørner, N.: Z3: An efficient SMT solver. In: Ramakrishnan, C.R., Rehof, J. (eds.) TACAS 2008. LNCS, vol. 4963, pp. 337–340. Springer, Heidelberg (2008) 12. Duffy, J.: Joe Duffy’s Weblog, http://www.bluebytesoftware.com/blog/2007/11/10/CLR20MemoryModel.aspx 13. Sarkar, S., et al.: The semantics of x86-CC multiprocessor machine code. In: Principles of Programming Languages, POPL (2009) 14. Gharachorloo, K.: Memory Consistency Models for Shared-Memory Multiprocessors. PhD thesis, University of Utah (2005) 15. Intel Corporation. Intel 64 Architecture Memory Ordering White Paper (August 2007) 16. International Business Machines Corporation. z/Architecture Principles of Operation, 1st edn. (December 2000)
Verifying Local Transformations on Relaxed Memory Models
123
17. Klein, G., Nipkow, T.: A machine-checked model for a java-like language, virtual machine, and compiler. ACM Transactions on Programming Languages and Systems 28(4), 619–695 (2006) 18. Lerner, S., Millstein, T., Chambers, C.: Automatically proving the correctness of compiler optimizations. In: Programming Language Design and Implementation (PLDI), pp. 220–231 (2003) 19. Leroy, X.: Formal certification of a compiler back-end or: programming a compiler with a proof assistant. In: Principles of programming languages (POPL), pp. 42–54 (2006) 20. Manson, J., Pugh, W., Adve, S.: The Java memory model. In: Principles of Programming Languages (POPL), pp. 378–391 (2005) 21. Morrison, V.: Understand the impact of low-lock techniques in multithreaded apps. MSDN Magazine 20(10) (October 2005) 22. Owens, S., Sarkar, S., Sewell, P.: A better x86 memory model: x86-TSO (extended version). Technical Report UCAM-CL-TR-745, Univ. of Cambridge (2009) 23. Park, S., Dill, D.L.: An executable specification, analyzer and verifier for RMO (relaxed memory order). In: Symposium on Parallel Algorithms and Architectures (SPAA), pp. 34–41 (1995) 24. Saraswat, V., Jagadeesan, R., Michael, M., von Praun, C.: A theory of memory models. In: PPoPP 2007: Principles and practice of parallel programming, pp. 161–172 (2007) 25. Sevcik, J.: Program Transformations in Weak Memory Models. PhD thesis, University of Edinburgh (2008) 26. Sevcik, J., Aspinall, D.: On validity of program transformations in the Java memory model. In: Vitek, J. (ed.) ECOOP 2008. LNCS, vol. 5142, pp. 27–51. Springer, Heidelberg (2008) 27. Shen, X., Arvind, Rudolph, L.: Commit-reconcile & fences (crf): A new memory model for architects and compiler writers. In: ISCA, pp. 150–161 (1999) 28. Weaver, D., Germond, T. (eds.): The SPARC Architecture Manual Version 9. PTR Prentice Hall, Englewood Cliffs (1994) 29. Young, W.D.: A mechanically verified code generator. Journal of Automated Reasoning 5(4), 493–518 (1989)
Practical Extensions to the IFDS Algorithm Nomair A. Naeem, Ondˇrej Lhot´ak, and Jonathan Rodriguez University of Waterloo, Canada {nanaeem,olhotak,j2rodrig}@uwaterloo.ca
Abstract. This paper presents four extensions to the Interprocedural Finite Distributive Subset (IFDS) algorithm that make it applicable to a wider class of analysis problems. IFDS is a dynamic programming algorithm that implements context-sensitive flow-sensitive interprocedural dataflow analysis. The first extension constructs the nodes of the supergraph on demand as the analysis requires them, eliminating the need to build a full supergraph before the analysis. The second extension provides the procedure-return flow function with additional information about the program state before the procedure was called. The third extension improves the precision with which φ instructions are modelled when analyzing a program in SSA form. The fourth extension speeds up the algorithm on domains in which some of the dataflow facts subsume each other. These extensions are often necessary when applying the IFDS algorithm to non-separable (i.e. non-bit-vector) problems. We have found them necessary for alias set analysis and multi-object typestate analysis. In this paper, we illustrate and evaluate the extensions on a simpler problem, a variation of variable type analysis.
1 Introduction The Interprocedural Finite Distributive Subset (IFDS) algorithm [15] is an efficient and precise, context-sensitive and flow-sensitive dataflow analysis algorithm for the class of problems that satisfy its restrictions. Although this class includes the classic bit-vector dataflow problems, the original IFDS algorithm is not directly suitable for more interesting problems for which context- and flow-sensitivity would be useful, particularly problems involving objects and pointers. The algorithm can be extended to solve this larger class of problems, however, and in this paper, we present four such extensions. The IFDS algorithm is an efficient dynamic programming instantiation of the functional approach to interprocedural analysis [19]. The fundamental restrictions of the algorithm, which we do not seek to eliminate in this paper, are that the analysis domain must be a powerset of some finite set D, and that the dataflow functions must be distributive. We present a detailed overview of the IFDS algorithm in Section 2, and further illustrate the algorithm with a running example variable type analysis in Section 3. A more practical restriction is that the set D must be small, because the algorithm requires as input a so-called exploded supergraph, and the number of nodes in this supergraph is approximately the product of the size of D and the number of instructions in the program. Our first extension, presented in Section 4, removes the restriction on the size of D by enabling the algorithm to compute only those parts of the supergraph that are actually reached in the analysis. This allows the algorithm to be used for problems R. Gupta (Ed.): CC 2010, LNCS 6011, pp. 124–144, 2010. c Springer-Verlag Berlin Heidelberg 2010
Practical Extensions to the IFDS Algorithm
125
in which D is theoretically large, but only a small subset of D is encountered during the analysis, which is typical of analyses modelling objects and pointers. A second practical restriction of the original IFDS algorithm is that it provides limited information to flow functions modelling return flow from a procedure. For many analyses, mapping dataflow facts from the callee back to the caller requires information about the state before the procedure was called. In Section 5, we extend the IFDS algorithm to provide this information to the return flow function. A third limitation of many standard dataflow analysis algorithms, IFDS included, is that they can be less precise on a program in Static Single Assignment (SSA) form [2] than on the original non-SSA form of the program. When an instruction has multiple control flow predecessors, incoming dataflow facts are merged before the flow function is applied; this imprecisely models the semantics of φ instructions in SSA form. In Section 6, we present an example that exhibits this imprecision, and we extend the IFDS algorithm to avoid it, so that it is equally precise on SSA form as on non-SSA form programs. SSA form is not only a convenience; in prior work, we showed that SSA form can be used to improve running time and space requirements of analyses such as alias set analysis [13]. Finally, the IFDS algorithm does not take advantage of any structure in the set D. In many analyses of objects and pointers, some elements of D subsume others. In Section 7, we present an extension that exploits such structure to reduce analysis time. We have implemented the IFDS algorithm with all four of these extensions, as well as the running example variable type analysis. In Section 8, we report on an empirical evaluation of the benefits of the extensions. We survey related work in Section 9 and conclude in Section 10.
2 Background: The Original IFDS Algorithm The IFDS algorithm of Reps et al. [15] is a dynamic programming algorithm that computes a merge-over-all-valid paths solution to interprocedural, finite, distributive, subset problems. The merge is over valid paths in that procedure calls and returns are correctly matched (i.e. the analysis is context sensitive). The algorithm requires that the domain of dataflow facts be the powerset of a finite set D, with set union as the merge operator. The data flow functions must be distributive over set union: f (a) ∪ f (b) = f (a ∪ b). The algorithm follows the summary function approach to context-sensitive interprocedural analysis [19], in that it computes functions in P(D) → P(D) that summarize the effect of ever-longer sections of code on any given subset of D. The key to the efficiency of the algorithm is the compact representation of these functions, made possible by their distributivity. For example, suppose the set S = {a, b, c} is a subset of D. By distributivity, f (S) can be computed as f (S) = f ({}) ∪ f ({a}) ∪ f ({b}) ∪ f ({c}). Thus every distributive function in P(D) → P(D) is uniquely defined by its value on the empty set and on every singleton subset of D. Equivalently, the function can be defined by a bipartite graph D ∪ {0}, D, E, where E is a set of edges from elements of D ∪ {0} to elements of (a second copy of) D. The graph contains an edge from d1 to d2 if and only if d2 ∈ f ({d1 }). The special 0 vertex represents the empty set: the edge 0 → d indicates that d ∈ f ({}). The function represented by the graph is defined to be
126
N.A. Naeem, O. Lhot´ak, and J. Rodriguez 0 0
a
b
c
d
0 a b c d g = λS.(S \ {a}) ∪ {b, c} (a)
0
a
b
c
a
b
c
d
d
0 a b c d f = λS.(S \ {d}) ∪ {b}
0 a b c d f ◦ g = λS.(S \ {a, d}) ∪ {b, c}
(b)
(c)
Fig. 1. Compact representation of functions and their composition
f (S) = {b : (a, b) ∈ E ∧ (a = 0 ∨ a ∈ S)}. For example, the graph in Figure 1(a) represents the function g(S) = {x : x ∈ {b, c} ∨ (x = d ∧ d ∈ S)}, which can be written more simply as g(S) = (S \ {a}) ∪ {b, c}. The composition f ◦ g of two functions can be computed by combining their graphs, merging the nodes of the range of g with the corresponding nodes of the domain of f , then computing reachability from the nodes of the domain of g to the nodes of the range of f . That is, a relational product of the sets of edges representing the two functions gives a set of edges representing their composition. An example is shown in Figure 1. The graph in Figure 1(c), representing f ◦ g, contains an edge from x to y whenever there is an edge from x to some z in the representation of g in Figure 1(a) and an edge from the same z to y in the representation of f in Figure 1(b). We have reproduced the original IFDS algorithm [15] in Figure 2. The input to the algorithm is a so-called exploded supergraph that represents both the program being analyzed and the dataflow functions. The supergraph is constructed from the interprocedural control flow graph (ICFG) of the program by replacing each instruction with the graph representation of its flow function. Thus the vertices of the supergraph are pairs l, d, where l is a label in the program and d ∈ D ∪ {0}. The supergraph contains an edge l, d → l ′ , d′ if the ICFG contains an edge l → l′ and d′ ∈ f ({d}) (or d′ ∈ f ({}) when d = 0), where f is the flow function of the instruction at l. For each interprocedural call or return edge in the ICFG, the supergraph contains a set of edges representing the flow function associated with the call or return. The flow function on the call edge typically maps facts about actuals in the caller to facts about formals in the callee. The merge-over-all-valid paths solution at label l contains exactly the elements d of D for which there exists a valid path from s, 0 to l, d in the supergraph. The dataflow analysis therefore reduces to valid-path reachability on the supergraph. The IFDS algorithm works by incrementally constructing two tables, PathEdge and SummaryEdge, representing the flow functions of ever longer sequences of code. The PathEdge table contains triples d, l, d′ , indicating that there is a path from sp , d to l, d′ , where sp is the start node of the procedure containing l. These triples are often written in the form sp , d → l, d′ for clarity, but the start node sp is uniquely determined by l, so it is not stored in an actual implementation. The SummaryEdge table contains triples c, d, d′ , where c is the label of a call site. Such a triple indicates that d′ ∈ f ({d}), where f is a flow function summarizing the effect of the procedure called at c. These triples are often written c, d → r, d′ , where r is the instruction
Practical Extensions to the IFDS Algorithm
1 2 3 4 5 6 7 8
9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
127
declare PathEdge, WorkList, SummaryEdge: global edge set algorithm Tabulate(G I P ) begin Let (N , E ) = GI P PathEdge:={ smain , 0 → smain , 0 } WorkList:={ smain , 0 → smain , 0 } SummaryEdge:= ∅ ForwardTabulateSLRPs() foreach n ∈ N do ˙ ¸ Xn := { d2 ∈ D|∃d1 ∈ (D ∪ {0}) s.t. sprocOf(n) , d1 → n, d2 ∈ PathEdge od end procedure Propagate(e) begin if e ∈ / PathEdge then Insert e into PathEdge; Insert e into WorkList; fi end procedure ForwardTabulateSLRPs() begin while WorkList = ∅ do Select and remove an edge sp , d1 → n, d2 from WorkList switch n case n ∈ Callp : ˙ ¸ foreach d3 s.t. , d3 ∈ E do `˙ n, d2 → scalledProc(n) ¸ ˙ ¸´ Propagate scalledProc(n) , d3 → scalledProc(n) , d3 od foreach d3 s.t. n, d2 → returnSite(n), d3 ∈ (E ∪ SummaryEdge) do Propagate(sp , d1 → returnSite(n), d3 ) od end case case n ∈ ep : foreach c ∈ callers(p) do foreach d4 ,d5 s.t. c, d4 → sp , d1 ∈ E and ep , d2 → returnSite(c), d5 ∈ E do if c, d4 → returnSite(c), d5 ∈ / SummaryEdge then Insert c, d4 → ˙ returnSite(c), ¸ d5 into SummaryEdge foreach d3 s.t. ´ do `˙ sprocOf(c) , d¸3 → c, d4 ∈ PathEdge Propagate sprocOf(c) , d3 → returnSite(c), d5 od fi od od end case case n ∈ (Np − Callp − {ep }) : foreach m, d3 s.t. n, d2 → m, d3 ∈ E do Propagate(sp , d1 → m, d3 ) od end case end switch od end
Fig. 2. Original IFDS Algorithm reproduced from [15]
128
N.A. Naeem, O. Lhot´ak, and J. Rodriguez
following c. For convenience, Reps’s presentation of the IFDS algorithm [15] assumes that in the ICFG, every call site c has a single successor, a no-op “return site” node r. The PathEdge and SummaryEdge tables are interdependent. Consider the edge sp , d1 → ep , d2 added to PathEdge, in which ep is the exit node of some procedure p. This edge means that d2 ∈ fp ({d1 }), where fp is the flow function representing the effect of the entire procedure p. As a result, for every call site c calling procedure p, a corresponding triple must be added to SummaryEdge indicating the newly-discovered effect at that call site. In fact, several such triples may be needed for a single edge added to PathEdge, since the effect of a procedure at c is represented not just by fp , but by the composition fr ◦ fp ◦ fc , where fc and fr are the flow functions representing the function call and return. This composition is computed by combining the graphs representing fc and fr from the supergraph with the newly discovered edge d1 , d2 of fp . That is, for each d4 and d5 such that d4 , d1 ∈ fc and d2 , d5 ∈ fr , c, d4 , d5 is added to SummaryEdge. This is performed in lines 23 to 25 of the algorithm. Conversely, consider a triple c, d4 , d5 added to SummaryEdge, indicating a new effect of the call at c. As a result, for each d3 such that there is a path from s, d3 to c, d4 , where s is the start node of the procedure containing c, there is now a valid path from s, d3 to r, d5 , where r is the successor of c. Thus s, d3 → r, d5 must be added to PathEdge. This is performed in lines 26 to 28 of the algorithm.
3 Running Example: Type Analysis The extensions to the IFDS algorithm presented in this paper were originally motivated by context-sensitive alias set analysis [13] and multi-object typestate analysis [12]. The same extensions are applicable to many other kinds of analyses. In this paper, we will use a much simpler analysis as a running example to illustrate the IFDS extensions. The example analysis is a variation of Variable Type Analysis (VTA) [21] for Java. The analysis computes the set of possible types for each variable. This information can be used to construct a call graph or to check the validity of casts. At each program point p, the analysis computes a subset of D, where D is defined as the set of all pairs v, t, where v is a variable in the program and t is a class in the program. The presence of the pair v, t in the subset indicates that the variable v may point to an object of type t. For the sake of the example, we would like the analysis to analyze only the application code and not the large standard library. The analysis therefore makes conservative assumptions about the unanalyzed code based on statically declared types. For example, if m() is in the library, the analysis assumes that m() could return an object of the declared return type of m() or any of its subtypes. To this end we amend the meaning of a pair v, t to indicate that v may point to an object of type t or any of its subtypes. The unanalyzed code could write to fields in the heap, either directly or by calling back into application code. To keep the analysis sound yet simple, we make the conservative assumption that a field can point to any object whose type is consistent with its declared type. We model a field read x = y.f with the pair x, t, where t is the declared type of f. We make these simplifications because the analysis is intended to illustrate the extensions to the IFDS algorithm, not necessarily as a practical analysis.
Practical Extensions to the IFDS Algorithm
129
When the declared type of a field is an interface, the object read from it could be of any class that implements the interface. For a read from such a field, we generate multiple pairs x, ti , where the ti are all classes that implement the interface. If class A extends B and both implement the interface, it is redundant to include x, B since x, A already includes all subclasses of A, including B. For efficiency, we generate only those pairs x, ti where ti implements the interface and its superclass does not. The analysis is performed on an intermediate representation comprising the following kinds of instructions, in addition to procedure calls and returns: s ::= x ← y | y.f ← x | x ← y.f | x ← null | x ← new T | x ← (T )y. The instructions copy pointers between variables, store and load objects to and from fields, assign null to variables, create new objects and cast objects to a given type, respectively. We use sP : P(D) → P(D) to denote the transfer function for the type analysis. The IFDS algorithm requires the transfer function to be decomposed into its effect on each individual element of D and on the empty set. We decompose it as s : D ∪ {0} → P(D) and define sP (P ) s(0) ∪ d∈P s(d). The decomposed transfer function s is defined in Figure 3. ⎧ ⎨ {x, t , y, t} if v = y {v, t} if v = y and v = x x ← y(v, t) ⎩ ∅ if v = y and v = x
y.f ← x(v, t) {v, t} {v, t} if v = x x ← null|new T |y.f (v, t) ∅ otherwise x ← new T (0) {x, T }
x ← y.f (0) {x, c : c ∈ implClasses(type(f ))} x ← (T )y(v, t) cast(x, y, c)(v, t) c∈implClasses(T )
⎧ ⎪ ⎪ ⎪ ⎪ ⎨
{v, t1 } ∅ cast(x, y, t2 )(v, t1 ) {x, t1 , y, t1 } ⎪ ⎪ {x, t2 , y, t2 } ⎪ ⎪ ⎩ ∅
if v if v if v if v if v
= x and v = y = x and v = y = y and t1