276 20 4MB
English Pages 348 [355] Year 2002
Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen
2304
3
Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Tokyo
R. Nigel Horspool (Ed.)
Compiler Construction 11th International Conference, CC 2002 Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2002 Grenoble, France, April 8-12, 2002 Proceedings
13
Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editor R. Nigel Horspool University of Victoria, Dept. of Computer Science Victoria, BC, Canada V8W 3P6 E-mail: [email protected]
Cataloging-in-Publication Data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Compiler construction : 11th international conference ; proceedings / CC 2002, held as part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2002, Grenoble, France, April 8 - 12, 2002. R. Nigel Horspool (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; London ; Milan ; Paris ; Tokyo : Springer, 2002 (Lecture notes in computer science ; Vol. 2304) ISBN 3-540-43369-4
CR Subject Classification (1998): D.3.4, D.3.1, F.4.2, D.2.6, I.2.2, F.3 ISSN 0302-9743 ISBN 3-540-43369-4 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2002 Printed in Germany Typesetting: Camera-ready by author, data conversion by DA-TeX Gerd Blumenstein Printed on acid-free paper SPIN 10846505 06/3142 543210
Foreword
ETAPS 2002 was the fifth instance of the European Joint Conferences on Theory and Practice of Software. ETAPS is an annual federated conference that was established in 1998 by combining a number of existing and new conferences. This year it comprised 5 conferences (FOSSACS, FASE, ESOP, CC, TACAS), 13 satellite workshops (ACL2, AGT, CMCS, COCV, DCC, INT, LDTA, SC, SFEDL, SLAP, SPIN, TPTS, and VISS), 8 invited lectures (not including those specific to the satellite events), and several tutorials. The events that comprise ETAPS address various aspects of the system development process, including specification, design, implementation, analysis, and improvement. The languages, methodologies, and tools which support these activities are all well within its scope. Different blends of theory and practice are represented, with an inclination towards theory with a practical motivation on one hand and soundly-based practice on the other. Many of the issues involved in software design apply to systems in general, including hardware systems, and the emphasis on software is not intended to be exclusive. ETAPS is a loose confederation in which each event retains its own identity, with a separate program committee and independent proceedings. Its format is open-ended, allowing it to grow and evolve as time goes by. Contributed talks and system demonstrations are in synchronized parallel sessions, with invited lectures in plenary sessions. Two of the invited lectures are reserved for “unifying” talks on topics of interest to the whole range of ETAPS attendees. The aim of cramming all this activity into a single one-week meeting is to create a strong magnet for academic and industrial researchers working on topics within its scope, giving them the opportunity to learn about research in related areas, and thereby to foster new and existing links between work in areas that were formerly addressed in separate meetings. ETAPS 2002 was organized by the Laboratoire Verimag in cooperation with Centre National de la Recherche Scientifique (CNRS) Institut de Math´ematiques Appliqu´ees de Grenoble (IMAG) Institut National Polytechnique de Grenoble (INPG) Universit´e Joseph Fourier (UJF) European Association for Theoretical Computer Science (EATCS) European Association for Programming Languages and Systems (EAPLS) European Association of Software Science and Technology (EASST) ACM SIGACT, SIGSOFT, and SIGPLAN
VI
Foreword
The organizing team comprised Susanne Graf - General Chair Saddek Bensalem - Tutorials Rachid Echahed - Workshop Chair Jean-Claude Fernandez - Organization Alain Girault - Publicity Yassine Lakhnech - Industrial Relations Florence Maraninchi - Budget Laurent Mounier - Organization Overall planning for ETAPS conferences is the responsibility of its Steering Committee, whose current membership is: Egidio Astesiano (Genova), Ed Brinksma (Twente), Pierpaolo Degano (Pisa), Hartmut Ehrig (Berlin), Jos´e Fiadeiro (Lisbon), Marie-Claude Gaudel (Paris), Andy Gordon (Microsoft Research, Cambridge), Roberto Gorrieri (Bologna), Susanne Graf (Grenoble), John Hatcliff (Kansas), G¨ orel Hedin (Lund), Furio Honsell (Udine), Nigel Horspool (Victoria), Heinrich Hußmann (Dresden), Joost-Pieter Katoen (Twente), Paul Klint (Amsterdam), Daniel Le M´etayer (Trusted Logic, Versailles), Ugo Montanari (Pisa), Mogens Nielsen (Aarhus), Hanne Riis Nielson (Copenhagen), Mauro Pezz`e (Milan), Andreas Podelski (Saarbr¨ ucken), Don Sannella (Edinburgh), Andrzej Tarlecki (Warsaw), Herbert Weber (Berlin), Reinhard Wilhelm (Saarbr¨ ucken) I would like to express my sincere gratitude to all of these people and organizations, the program committee chairs and PC members of the ETAPS conferences, the organizers of the satellite events, the speakers themselves, and finally Springer-Verlag for agreeing to publish the ETAPS proceedings. As organizer of ETAPS’98, I know that there is one person that deserves a special applause: Susanne Graf. Her energy and organizational skills have more than compensated for my slow start in stepping into Don Sannella’s enormous shoes as ETAPS Steering Committee chairman. Yes, it is now a year since I took over the role, and I would like my final words to transmit to Don all the gratitude and admiration that is felt by all of us who enjoy coming to ETAPS year after year knowing that we will meet old friends, make new ones, plan new projects and be challenged by a new culture! Thank you Don! January 2002
Jos´e Luiz Fiadeiro
Preface
Once again, the number, the breadth and the quality of papers submitted to the CC 2002 conference continues to be impressive. In spite of some difficult times which may have discouraged many potential authors from thinking of travelling to a conference, we still received 44 submissions. Of these submissions, 21 came from 12 different European countries, 17 from the USA and Canada, and the remaining 6 from Australia and Asia. In addition to the regular paper submissions, we have an invited paper from Patrick and Radhia Cousot. It is especially fitting that Patrick Cousot should deliver the CC 2002 invited paper in Grenoble because many years ago he wrote his PhD thesis at the University of Grenoble. The members of the Program Committee took their refereeing task very seriously and decided very early on that a physical meeting was necessary to make the selection process as fair as possible. Accordingly, nine members of the Program Committee attended a meeting in Austin, Texas, on December 1, 2001, where the difficult decisions were made. Three others joined in the deliberations via a telephone conference call. Eventually, and after much (friendly) argument, 18 papers were selected for publication. I wish to thank the Program Committee members for their selfless dedication and their excellent advice. I especially want to thank Kathryn McKinley and her assistant, Gem Naivar, for making the arrangements for the PC meeting. I also wish to thank my assistant, Catherine Emond, for preparing the materials for the PC meeting and for assembling the manuscript of the proceedings. The paper submissions and the reviewing process were supported by the START system (http://www.softconf.com). I thank the author of START, Rich Gerber, for making his software available to CC 2002 and for his prompt attention to the little problems that arose. These conference proceedings include the invited paper of Patrick and Radhia Cousot, the 18 regular papers, and brief descriptions of three software tools.
January 2002
Nigel Horspool
VIII
Preface
Program Committee Uwe Aßmann (Linkopings Universitet, Sweden) David Bernstein (IBM Haifa, Israel) Judith Bishop (University of Pretoria, South Africa) Ras Bodik (University of Wisconsin-Madison, USA) Cristina Cifuentes (Sun Microsystems, USA) Christian Collberg (University of Arizona, USA) Stefano Crespi-Reghizzi (Politecnico di Milano, Italy) Michael Franz (University of California at Irvine, USA) Andreas Krall (Technical University of Vienna, Austria) Reiner Leupers (University of Dortmund, Germany) Kathryn McKinley (University of Texas at Austin, USA) Nigel Horspool – Chair (University of Victoria, Canada) Todd Proebsting (Microsoft Research, USA) Norman Ramsey (Harvard University, USA)
Additional Reviewers G.P. Agosta John Aycock Jon Eddy Anton Ertl Marco Garatti G¨ orel Hedin Won Kee Hong Bruce Kapron Moshe Klausner Annie Liu V. Martena Bilha Mendelson Sreekumar Nair
Dorit Naishloss Ulrich Neumerkel Mark Probst Fermin Reig P.L. San Pietro Bernhard Scholz Glenn Skinner Phil Tomsich David Ung Mike Van Emmerik JingLing Xue Yaakov Yaari Ayal Zaks
Table of Contents
Tool Demonstrations LISA: An Interactive Environment for Programming Language Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 ˇ Marjan Mernik, Mitja Leniˇc, Enis Avdiˇcauˇsevi´c, and Viljem Zumer Building an Interpreter with Vmgen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 M. Anton Ertl and David Gregg Compiler Construction Using LOTOS NT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Hubert Garavel, Fr´ed´eric Lang, and Radu Mateescu Analysis and Optimization Data Compression Transformations for Dynamically Allocated Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14 Youtao Zhang and Rajiv Gupta Evaluating a Demand Driven Technique for Call Graph Construction . . . . . . 29 Gagan Agrawal, Jinqian Li, and Qi Su A Graph–Free Approach to Data–Flow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Markus Mohnen A Representation for Bit Section Based Analysis and Optimization . . . . . . . . .62 Rajiv Gupta, Eduard Mehofer, and Youtao Zhang Low-Level Analysis Online Subpath Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 David Oren, Yossi Matias, and Mooly Sagiv Precise Exception Semantics in Dynamic Compilation . . . . . . . . . . . . . . . . . . . . . . 95 Michael Gschwind and Erik Altman Decompiling Java Bytecode: Problems, Traps and Pitfalls . . . . . . . . . . . . . . . . . 111 Jerome Miecznikowski and Laurie Hendren Grammars and Parsing Forwarding in Attribute Grammars for Modular Language Design . . . . . . . . .128 Eric Van Wyk, Oege de Moor, Kevin Backhouse, and Paul Kwiatkowski
X
Table of Contents
Disambiguation Filters for Scannerless Generalized LR Parsers . . . . . . . . . . . . 143 Mark G. J. van den Brand, Jeroen Scheerder, Jurgen J. Vinju, and Eelco Visser Invited Talk Modular Static Program Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Patrick Cousot and Radhia Cousot Domain-Specific Languages and Tools StreamIt: A Language for Streaming Applications . . . . . . . . . . . . . . . . . . . . . . . . .179 William Thies, Michal Karczmarek, and Saman Amarasinghe Compiling Mercury to High-Level C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Fergus Henderson and Zoltan Somogyi CIL: Intermediate Language and Tools for Analysis and Transformation of C Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 George C. Necula, Scott McPeak, Shree P. Rahul, and Westley Weimer Energy Consumption Optimizations Linear Scan Register Allocation in the Context of SSA Form and Register Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 Hanspeter M¨ ossenb¨ ock and Michael Pfeiffer Global Variable Promotion: Using Registers to Reduce Cache Power Dissipation . . . . . . . . . . . . . . . . . . . . . . . 247 Andrea G. M. Cilio and Henk Corporaal Optimizing Static Power Dissipation by Functional Units in Superscalar Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 Siddharth Rele, Santosh Pande, Soner Onder, and Rajiv Gupta Influence of Loop Optimizations on Energy Consumption of Multi-bank Memory Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 Mahmut Kandemir, Ibrahim Kolcu, and Ismail Kadayif Loop and Array Optimizations Effective Enhancement of Loop Versioning in Java . . . . . . . . . . . . . . . . . . . . . . . . 293 Vitaly V. Mikheev, Stanislav A. Fedoseev, Vladimir V. Sukharev, and Nikita V. Lipsky
Table of Contents
XI
Value-Profile Guided Stride Prefetching for Irregular Code . . . . . . . . . . . . . . . . 307 Youfeng Wu, Mauricio Serrano, Rakesh Krishnaiyer, Wei Li, and Jesse Fang A Comprehensive Approach to Array Bounds Check Elimination for Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 Feng Qian, Laurie Hendren, and Clark Verbrugge Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .343
LISA: An Interactive Environment for Programming Language Development ˇ Marjan Mernik, Mitja Leniˇc, Enis Avdiˇcauˇsevi´c, and Viljem Zumer University of Maribor, Faculty of Electrical Engineering and Computer Science Institute of Computer Science Smetanova 17, 2000 Maribor, Slovenia
Abstract. The LISA system is an interactive environment for programming language development. From the formal language specifications of a particular programming language LISA produces a language specific environment that includes editors (a language-knowledgable editor and a structured editor), a compiler/interpreter and other graphic tools. The LISA is a set of related tools such as scanner generators, parser generators, compiler generators, graphic tools, editors and conversion tools, which are integrated by well-designed interfaces.
1
Introduction
We have developed a compiler/interpreter generator tool LISA ver 1.0 which automatically produces a compiler or an interpreter from the ordinary attribute grammar specifications [2] [8]. But in this version of the tool the incremental language development was not supported, so the language designer had to design new languages from scratch or by scavenging old specifications. Other deficiencies of ordinary attribute grammars become apparent in specifications for real programming languages. Such specifications are large, unstructured and are hard to understand, modify and maintain. The goal of the new version of the compiler/interpreter tool LISA was to dismiss deficiencies of ordinary attribute grammars. We overcome the drawbacks of ordinary attribute grammars with concepts from object-oriented programming, i.e. template and multiple inheritance [4]. With attribute grammar templates we are able to describe the semantic rules which are independent of grammar production rules. With multiple attribute grammar inheritance we are able to organize specifications in such way that specifications can be inherited and specialized from ancestor specifications. The proposed approach was successfully implemented in the compiler/interpreter generator LISA ver. 2.0 [5].
2
Architecture of the Tool LISA 2.0
LISA (Fig. 1) consists of several tools: editors, scanner generators, parser generators, compiler generators, graphic tools, and conversion tools such as fsa2rex, etc. The architecture of the system LISA is modular. Integration is achieved R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 1–4, 2002. c Springer-Verlag Berlin Heidelberg 2002
2
Marjan Mernik et al.
with strictly defined interfaces that describe the behavior and type of integration of the modules. Each module can register actions when it is loaded into the core environment. Actions are methods accessible from the environment. These actions can be executed via class reflection. Their existence is not verified until invocation, so actions are dynamically linked with module methods. The module can be integrated in the environment as a visual or core module. Visual modules are used for the graphical user interface and visual representation of data structures. Core modules are non-visual components, such as the LISA language compiler. This approach is based on class reflection and is similar to JavaBeans technology. With class reflection (java.lang.reflect.* package) we can dynamically obtain a set of public methods and public variables of a module, so we can dynamically link module methods with actions. When the action is executed, the proper method is located and invoked with the description of the action event. With this architecture it is also possible to upgrade our system with different types of scanners, parsers and evaluators, which are presented as modules. This was achieved with a strict definition of communication data structures. Moreover, modules for scanners, parsers and evaluators use templates for code generation, which can be easily changed and improved.
Fig. 1. LISA Integrated Development Environment From formal language definition also editors are generated. The languageknowledgable editor is a compromise between text editors and structure editors since just colors the different parts of a program (comments, operators, reserved
LISA: An Interactive Environment for Programming Language Development
3
words, etc.) to enhance understandability and readability of programs. Generated lexical, syntax and semantic analysers, also written in Java, can be compiled in an integrated environment without issuing a command to javac (Java compiler). Programs written in the newly defined language can be executed and evaluated. Users of the generated compiler/interpreter have the possibility to visually observe the work of lexical, syntax and semantic analyzers by watching the animation of finite state automata, parse and semantic tree. The animation shows the program in action and the graphical representation of finite state automata, the syntax and the semantic tree are automatically updated as the program executes. Animated visualizations help explain the inner workings of programs and are a useful tool for debugging. These features make the tool LISA very appropriate for the programming language development. LISA tool is freely available for educational institutions from: http://marcel.uni-mb.si/lisa . It is run on different platforms and require Java 2 SDK (Software Development Kits & Runtimes), version 1.2.2 or higher.
3
Applications of LISA
We have incrementally developed various small programming languages, such as PLM [3]. An application domain for which LISA is very suitable is a development of domain-specific languages. To our opinion, in the development of domain-specific languages the advantages of the formal definitions of generalpurpose languages should be exploited, taking into consideration the special nature of domain-specific languages. An appropriate methodology that considers frequent changes of domain-specific languages is needed since the language development process should be supported by modularity and abstraction in a manner that allows incremental changes as easily as possible. If incremental language development [7] is not supported, then the language designer has to design languages from scratch or by scavenging old specifications. This approach was successfully used in the design and implementation of various domain-specific languages. In [6] a design and implementation of Simple Object Description Language SODL for automatic interface creation are presented. The application domain was network applications. Since the cross network method calls slow down performance of our applications the solution was Tier to Tier Object Transport (TTOT). However, with this approach the network application development time has been increased. To enhance our productivity a new domainspecific SODL language has been designed. In [1] a design and implementation of COOL and AspectCOOL languages has been described using the LISA system. Here the application domain was aspect-oriented programming (AOP). AOP is a programming technique for modularizing concerns that crosscut the basic functionality of programs. In AOP, aspect languages are used to describe properties, which crosscut basic functionality in a clean and a modular way. AspectCOOL is an extension of the class-based object-oriented language COOL (Classroom Object-Oriented Language), which has been designed and implemented simultaneously with AspectCOOL. Both languages were formally specified with mul-
4
Marjan Mernik et al.
tiple attribute grammar inheritance, which enables us to gradually extend the languages with new features and to reuse the previously defined specifications. Our experience with these non-trivial examples shows that multiple attribute grammars inheritance is useful in managing the complexity, reusability and extensibility of attribute grammars. Huge specifications become much shorter and are easier to read and maintain.
4
Conclusion
Many applications today are written in well-understood domains. One trend in programming is to provide software development tools designed specifically to handle such applications and thus to greatly simplify their development. These tools take a high-level description of the specific task and generate a complete application. One of such well established domain is compiler construction, because there is a long tradition of producing compilers, underlying theories are well understood and there exist many application generators, which automatically produce compilers or interpreters from programming language specifications. In the paper the compiler/interpreter generator LISA 2.0 is briefly presented.
References ˇ 1. Enis Avdiˇcauˇsevi´c, Mitja Leniˇc, Marjan Mernik, and Viljem Zumer. AspectCOOL: An experiment in design and implementation of aspect-oriented language. Accepted for publications in ACM SIGPLAN Notices. 3 ˇ 2. Marjan Mernik, Nikolaj Korbar, and Viljem Zumer. LISA: A tool for automatic language implementation. ACM SIGPLAN Notices, 30(4):71–79, April 1995. 1 ˇ 3. Marjan Mernik, Mitja Leniˇc, Enis Avdiˇcauˇsevi´c, and Viljem Zumer. A reusable object-oriented approach to formal specifications of programming languages. L’Objet, 4(3):273–306, 1998. 3 ˇ 4. Marjan Mernik, Mitja Leniˇc, Enis Avdiˇcauˇsevi´c, and Viljem Zumer. Multiple Attribute Grammar Inheritance. Informatica, 24(3):319–328, September 2000. 1 ˇ 5. Marjan Mernik, Mitja Leniˇc, Enis Avdiˇcauˇsevi´c, and Viljem Zumer. Compiler/interpreter generator system LISA. In IEEE CD ROM Proceedings of 33rd Hawaii International Conference on System Sciences, 2000. 1 ˇ 6. Marjan Mernik, Uroˇs Novak, Enis Avdiˇcauˇsevi´c, Mitja Leniˇc, and Viljem Zumer. Design and implementation of simple object description language. In Proceedings of 16th ACM Symposium on applied computing, pages 203–210, 2001. 3 ˇ 7. Marjan Mernik and Viljem Zumer. Incremental language design. IEE Proceedings Software, 145(2-3):85–91, 1998. 3 ˇ 8. Viljem Zumer, Nikolaj Korbar, and Marjan Mernik. Automatic implementation of programming languages using object-oriented approach. Journal of Systems Architecture, 43(1-5):203–210, 1997. 1
Building an Interpreter with Vmgen M. Anton Ertl1 and David Gregg2 1
Institut f¨ ur Computersprachen, Technische Universit¨ at Wien Argentinierstraße 8, A-1040 Wien, Austria [email protected] 2 Trinity College, Dublin
Abstract. Vmgen automates many of the tasks of writing the virtual machine part of an interpreter, resulting in less coding, debugging and maintenance effort. This paper gives some quantitative data about the source code and generated code for a vmgen-based interpreter, and gives some examples demonstrating the simplicity of using vmgen.
1
Introduction
Interpreters are a popular approach for implementing programming languages, because only interpreters offer all of the following benefits: ease of implementation, portability, and a fast edit-compile-run-cycle. The interpreter generator vmgen1 automates many of the tasks in writing the virtual machine (VM) part of an interpretive system; it takes a simple VM instruction description file and generates code for: executing and tracing VM instructions, generating VM code, disassembling VM code, combining VM instructions into superinstructions, and profiling VM instruction sequences to find superinstructions. Vmgen has special support for stack-based VMs, but most of its features are also useful for register-based VMs. Vmgen supports a number of high-performance techniques and optimizations. The resulting interpreters tend to be faster than other interpreters for the same language. This paper presents an example of vmgen usage. A detailed discussion of the inner workings of vmgen and performance data can be found elsewhere [1].
2
Example Overview
The running example in this paper is the example provided with the vmgen package: an interpretive system for a tiny Modula-2-style language that uses a JVM-style virtual machine. The language supports integer variables and expressions, assignments, if- and while-structures, function definitions and calls. Our example interpreter consists of two conceptual parts: the front-end parses the source code and generates VM code; the VM interpreter executes the VM code. 1
Vmgen is available at http://www.complang.tuwien.ac.at/anton/vmgen/.
R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 5–8, 2002. c Springer-Verlag Berlin Heidelberg 2002
6
M. Anton Ertl and David Gregg Name Lines Description Makefile 67 mini-inst.vmg 139 VM instruction descriptions mini.h 72 common declarations mini.l 42 front-end scanner mini.y 139 front-end (parser, VM code generator) support.c 220 symbol tables, main() peephole-blacklist 3 VM instructions that must not be combined disasm.c 36 template: VM disassembler engine.c 186 template: VM interpreter peephole.c 101 template: combining VM instructions profile.c 160 template: VM instruction sequence profiling stat.awk 13 template: aggregate profile information seq2rule.awk 8 template: define superinstructions 504 template files total 682 specific files total 1186 total
Fig. 1. Source files in the example interpreter Figure 1 shows quantitative data on the source code of our example. Note that the numbers include comments, which are sometimes relatively extensive (in particular, more than half of the lines in mini-inst.vmg are comments or empty). Some of the files are marked as templates; in a typical vmgen application they will be copied from the example and used with few changes, so these files cost very little. The other files contain code that will typically be written specifically for each application. Among the specific files, mini-inst.vmg contains all of the VM description; in addition, there are VM-related declarations in mini.h, calls to VM code generation functions in mini.y, and calls to the VM interpreter, disassembler, and profiler in support.c. Vmgen generates 936 lines in six files from mini-inst.vmg (see Fig. 2). The expansion factor from the source file indicates that vmgen saves a lot of work in coding, maintaining and debugging the VM interpreter. In addition to the reduced line count there is another reason why vmgen reduces the number of bugs: a new VM instruction just needs to be inserted in one place in mini-inst.vmg (and code for generating it should be added to the front end), whereas in a manually coded VM interpreter a new instruction needs code in several places. The various generated files correspond mostly directly to template files, with the template files containing wrapper code that works for all VMs, and the generated files containing code or tables specific to the VM at hand.
Building an Interpreter with Vmgen
7
Name Lines Description mini-disasm.i 103 VM disassembler mini-gen.i 84 VM code generation mini-labels.i 19 VM instruction codes mini-peephole.i 0 VM instruction combining mini-profile.i 95 VM instruction sequence profiling mini-vm.i 635 VM instruction execution 936 total
Fig. 2. Vmgen-generated files in the example interpreter
3
Simple VM Instructions
A typical vmgen instruction specification looks like this: sub ( i1 i2 -- i ) i = i1-i2; The first line gives the name of the VM instruction (sub) and its stack effect: it takes two integers (i1 and i2) from the stack and pushes one integer (i) on the stack. The next line contains C code that accesses the stack items as variables. Loading i1 and i2 from and storing i to the stack, and instruction dispatch are managed automatically by vmgen. Another example: lit ( #i -- i ) The lit instruction takes the immediate argument i from the instruction stream (indicated by the # prefix) and pushes it on the stack. No user-supplied C code is necessary for lit.
4
VM Code Generation
These VM instructions are generated by the following rules in mini.y: expr: term ’-’ term { gen_sub(&vmcodep); } term: NUM { gen_lit(&vmcodep, $1); } The code generation functions gen sub and gen lit are generated automatically by vmgen; gen lit has a second argument that specifies the immediate argument of lit (in this example, the number being compiled by the front end). Parsing and generating code for all subexpressions, then generating the code for the expression naturally leads to postfix code for a stack machine. This is one of the reasons why stack-based VMs are very popular in interpreters. The programmer just has to ensure that all rules for term and expr produce code that leaves exactly one value on the stack.
8
M. Anton Ertl and David Gregg
The power of yacc and its actions is sufficient for our example, but for implementing a more complex language the user will probably choose a more sophisticated tool or build a tree and manually code tree traversals. In both cases, generating code in a post-order traversal of the expression parse tree is easy.
5
Superinstructions
In addition to simple instructions, you can define superinstructions as a combination of a sequence of simple instructions: lit_sub = lit sub This defines a new VM instruction lit sub that behaves in the same way as the sequence lit sub, but is faster. After adding this instruction to mini-inst.vmg and rebuilding the interpreter, this superinstruction is generated automatically whenever a call to gen lit is followed by a call to gen sub. But you need not even define the superinstructions yourself, you can generate them automatically from a profile of executed VM instruction sequences: You can compile the VM interpreter with profiling enabled, and run some programs representing your workload. The resulting profile lists the number of dynamic executions for each static occurence of a sequence, e.g., 18454929 9227464
lit sub ... lit sub
This indicates that the sequence lit sub occured in two places, for a total of 27682393 dynamic executions. These data can be aggregated with the stat.awk script, then the user can choose the most promising superinstructions (typically with another small awk or perl script), and finally transform the selected sequences into the superinstruction rule syntax with seq2rule.awk. The original intent of the superinstruction features was to improve the runtime performance of the interpreter (and it achieves this goal), but we also noticed that it makes interpreter construction easier: In some places in an interpretive system, we can generate a sequence of existing instructions or define a new instruction and generate that; in a manually written interpreter, the latter approach yields a faster interpreter, but requires more work. Using vmgen, you can just take the first approach, and let the sequence be optimized into a superinstruction if it occurs frequently; in this way, you get the best of both approaches: little effort and run-time performance.
References 1. M. Anton Ertl, David Gregg, Andreas Krall, and Bernd Paysan. vmgen — a generator of efficient virtual machine interpreters. Software—Practice and Experience, 2002. Accepted for publication. 5
Compiler Construction Using LOTOS NT Hubert Garavel, Fr´ed´eric Lang, and Radu Mateescu Inria Rhˆ one-Alpes – Vasy 655, avenue de l’Europe, 38330 Montbonnot, France {Hubert.Garavel,Frederic.Lang,Radu.Mateescu}@inria.fr
1
Introduction
Much academic and industrial effort has been invested in compiler construction. Numerous tools and environments1 have been developed to improve compiler quality while reducing implementation and maintenance costs. In the domain of computer-aided verification, most tools involve compilation and/or translation steps. This is the case with the tools developed by the Vasy team of Inria Rhˆ one-Alpes, for instance the Cadp2 [5] tools for analysis of protocols and distributed systems. As regards the lexical and syntax analysis, all Cadp tools are built using Syntax [3], a compiler generator that offers advanced error recovery features. As regards the description, construction, and traversal of abstract syntax trees (Asts), three approaches have been used successively: – In the Caesar [8] compiler for Lotos [10], Asts are programmed in C. This low-level approach leads to slow development as one has to deal explicitly with pointers and space management to encode and explore Asts. – In the Caesar.Adt [6] and Xtl [13] compilers, Asts are described and handled using Lotos abstract data types, which are then translated into C using the Caesar.Adt compiler itself (bootstrap); yet, for convenience and efficiency, certain imperative processings are directly programmed in C. This approach reduces the drawbacks of using C exclusively, but suffers from limitations inherent to the algebraic specification style (lack of local variables, of sequential composition, etc.). – For the Traian and Svl 1.0 compilers, and for the Evaluator 3.0 [14] model-checker, the Fnc-23 [12] compiler generator based on attribute grammars was used. Fnc-2 allows to declare attribute calculations for each Ast node and evaluates the attributes automatically, according to their dependencies. Although we have been able to suggest many improvements incorporated to Fnc-2, it turned out that, for input languages with large grammars, Fnc-2 has practical limitations: development and debugging are complex, and the generated compilers have large object files and exhibit average performances (slow compilation, large memory footprint due to the creation of multiple Asts and the absence of garbage collection). Therefore, the Vasy team switched to a new technology in order to develop its most recent verification tools. 1 2 3
An extensive catalog can be found at http://catalog.compilertools.net http://www.inrialpes.fr/vasy/cadp http://www.inrialpes.fr/vasy/fnc2
R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 9–13, 2002. c Springer-Verlag Berlin Heidelberg 2002
10
2
Hubert Garavel et al.
Using LOTOS NT for Compiler Construction
E-Lotos (Enhanced Lotos) [11] is a new Iso standard for the specification of protocols and distributed systems. Lotos NT [9,16] is a simplified variant of E-Lotos targeting at efficient implementation. It combines the strong theoretical foundations of process algebras with language features suitable for a wide industrial use. The data part of Lotos NT significantly improves over the previous Lotos standard [10]: equational programming is replaced with a language similar to first-order Ml extended with imperative features (assignments, loops, etc.). A compiler for Lotos NT, named Traian,4 translates the data part of Lotos NT specifications into C. Used in conjunction with a parser generator such as Lex/Yacc or Syntax, Traian is suitable to compiler construction: – Lotos NT allows a straightforward description of Asts: each non-terminal symbol of the grammar is encoded by a data type having a constructor for each grammar rule associated to the symbol. Traversals of Asts for computing attributes are defined by recursive functions using “case” statements and pattern-matching. – Traian generates automatically “printer” functions for each Lotos NT data type, which enables to inspect Asts and facilitates the debugging of semantic passes. – Traian also allows to include in a Lotos NT specification external data types and functions implemented in C, enabling an easy interfacing of Lotos NT specifications with hand-written C modules as well as C code generated by Lex/Yacc or Syntax.
3
Applications
Since 1999, Lotos NT has been used to develop three significant compilers. For each compiler, the lexer and parser are built using Syntax and the Asts using Lotos NT. Type-checking, program transformation, and code generation are also implemented in Lotos NT. Some hand-written C code is added either for routine tasks (e.g., parsing options) or for some specialized algorithms (e.g., model-checking): – The Svl 2.0 [7] compiler transforms high-level verification scripts into Bourne shell scripts (see Figure 1). – The Evaluator 4.0 model-checker transforms a temporal logic formula into a boolean equation system solver written in C; the solver is then compiled and executed, taking as input a labelled transition system and producing a diagnostic (see Figure 2). – The Ntif tool suite deals with a high-level language for symbolic transition systems; it includes a front-end, the Nt2if back-end generating a lower-level format, and the Nt2dot back-end producing a graph format visualizable by At&t’s GraphViz package. 4
http://www.inrialpes.fr/vasy/traian
Compiler Construction Using LOTOS NT
INPUT
SVL Program
Syntax Analysis & AST construction (SYNTAX)
Syntax error
LOTOS NT Term
Type Checking (LOTOS NT)
Type error
LOTOS NT Term
Expansion of Meta-Operations (LOTOS NT)
Code Generation (LOTOS NT)
LOTOS NT Term
Input Files
OUTPUT
Shell Interpreter
Bourne Shell Script
11
Output Files
INPUT
Fig. 1. Architecture of the Svl 2.0 compiler Temporal Logic Formula
Labelled Transition System
OUTPUT
Model Checker
Syntax Analysis & AST construction (SYNTAX)
LOTOS NT Term
Syntax error Type error
Type Checking (LOTOS NT)
LOTOS NT Term
C Compiler
BES Solver (C)
Transl. to Boolean Equation Systems (LOTOS NT)
Diagnostic File
Fig. 2. Architecture of the Evaluator 4.0 model-checker The table below summarizes the size (in lines of code) of each compiler. Syntax Lotos NT C Shell Total Generated C Svl 2.0 1,250 2,940 370 2,170 6,730 12,400 Evaluator 4.0 3,600 7,500 3,900 — 15,000 37,000 Ntif 1,620 3,620 1,200 — 6,440 20,644
4
Related Work and Conclusions
Alternative approaches exist based upon declarative representations, such as attributed grammars (Fnc-2 [12], SmartTools [1]), logic programming (Ale [4], Centaur [2]), or term rewriting (Txl5 , Kimwitu [18], Asf+Sdf [17]). In these 5
http://www.thetxlcompany.com
12
Hubert Garavel et al.
approaches, Asts are implicit (not directly visible to the programmer) and it is not necessary to specify the order of attribute evaluation, which is inferred from the dependencies. On the contrary, our approach requires the explicit Ast specification and attribute computation ordering. Practically, this is not too restrictive, since the user is usually aware of these details. Lotos NT is an hybrid between imperative and functional languages. Unlike the object-oriented approach (e.g., JavaCC6 ), in which Asts are defined using classes, and visitors are implemented using methods, the Lotos NT code for computing a given attribute does not need to be split into several classes, but can be clearly centralized in a single function containing a “case” statement. Compared to lower-level imperative languages such as C, Lotos NT avoids tedious and error-prone explicit pointer manipulation. Compared to functional languages such as Haskell or Caml7 (for which the Happy8 and CamlYacc parser generators are available), Lotos NT does not allow higher-order functions nor polymorphism. In practice, we believe that these missing features are not essential for compiler construction; instead, Lotos NT provides useful mechanisms such as strong typing, function overloading, pattern-matching, and sequential composition. Lotos NT external C types and functions make input/output operations simpler than Haskell/Happy, in which one must be acquainted with the notion of monads. Contrary to functional languages specifically dedicated to compiler construction such as Puma9 and Gentle [15], Lotos NT is a general-purpose language, applicable to a wider range of problems. The Lotos NT technology can be compared with other hybrid approaches such as the App10 and Memphis11 preprocessors, which extend C/C++ with abstract data types and pattern-matching. Yet, these preprocessors lack the static analysis checks supported by Lotos NT and Traian (strong typing, detection of uninitialized variables, exhaustiveness of “case” statements, etc.), which significantly facilitate the programming activity. Our experience in using Lotos NT for developing three compilers demonstrated the efficiency and robustness of this pragmatic approach. Since 1998, the Traian compiler is available on several platforms (Windows, Linux, Solaris) and can be downloaded on the Internet. The three Traian-based compilers are or will be available soon: Svl 2.0 is distributed within Cadp 2001 “Ottawa”; Evaluator 4.0 and Ntif will be released in future versions of Cadp. Ntif is already used in a test generation platform for smart cards in an industrial project with Schlumberger. 6 7 8 9 10 11
http://www.webgain.com/products/java_cc http://caml.inria.fr http://www.haskell.org/happy Puma belongs to the Cocktail toolbox (http://www.first.gmd.de/cocktail) http://www.primenet.com/~georgen/app.html http://memphis.compilertools.net
Compiler Construction Using LOTOS NT
13
References 1. I. Attali, C. Courbis, P. Degenne, A. Fau, D. Parigot, and C. Pasquier. SmartTools: A Generator of Interactive Environments Tools. In Proc. of CC ’2001, volume 2027 of LNCS, 2001. 11 2. P. Borras, D. Cl´ement, Th. Despeyroux, J. Incerpi, G. Kahn, B. Lang, and V. Pascual. Centaur: the system. In Proc. of SIGSOFT’88, 3rd Symposium on Software Development Environments (SDE3), 1988. 11 3. P. Boullier and P. Deschamp. Le syst`eme SYNTAX : Manuel d’utilisation et de mise en œuvre sous Unix. http://www-rocq.inria.fr/oscar/www/syntax, 1997. 9 4. B. Carpenter. The Logic of Typed Feature Structures. Cambridge Tracts in Theoretical Computer Science, 32, 1992. 11 5. J.-C. Fernandez, H. Garavel, A. Kerbrat, R. Mateescu, L. Mounier, and M. Sighireanu. CADP (CÆSAR/ALDEBARAN Development Package): A Protocol Validation and Verification Toolbox. In Proc. of CAV ’96, volume 1102 of LNCS, 1996. 9 6. H. Garavel. Compilation of LOTOS Abstract Data Types. In Proc. of FORTE’89. North-Holland, 1989. 9 7. H. Garavel and F. Lang. SVL: A Scripting Language for Compositional Verification. In Proc. of FORTE’2001. Kluwer, 2001. INRIA Research Report RR-4223. 10 8. H. Garavel and J. Sifakis. Compilation and Verification of LOTOS Specifications. In Proc. of PSTV’90. North-Holland, 1990. 9 9. H. Garavel and M. Sighireanu. Towards a Second Generation of Formal Description Techniques – Rationale for the Design of E-LOTOS. In Proc. of FMICS’98, Amsterdam, 1998. CWI. Invited lecture. 10 10. ISO/IEC. LOTOS — A Formal Description Technique Based on the Temporal Ordering of Observational Behaviour. International Standard 8807, 1988. 9, 10 11. ISO/IEC. Enhancements to LOTOS (E-LOTOS). International Standard 15437:2001, 2001. 10 12. M. Jourdan, D. Parigot, C. Juli´e, O. Durin, and C. Le Bellec. Design, Implementation and Evaluation of the FNC-2 Attribute Grammar System. ACM SIGPLAN Notices, 25(6), 1990. 9, 11 13. R. Mateescu and H. Garavel. XTL: A Meta-Language and Tool for Temporal Logic Model-Checking. In Proc. of STTT ’98. BRICS, 1998. 9 14. R. Mateescu and M. Sighireanu. Efficient On-the-Fly Model-Checking for Regular Alternation-Free Mu-Calculus. In Proc. of FMICS’2000, 2000. INRIA Research Report RR-3899. To appear in Science of Computer Programming. 9 15. F. W. Schr¨ oer. The GENTLE Compiler Construction System. R. Oldenbourg Verlag, 1997. 12 16. M. Sighireanu. LOTOS NT User’s Manual (Version 2.1). INRIA projet VASY. ftp://ftp.inrialpes.fr/pub/vasy/traian/manual.ps.Z, November 2000. 10 17. M. G. J. van den Brand, A. van Deursen, J. Heering, H. A. de Jong, M. de Jonge, T. Kuipers, P. Klint, L. Moonen, P. A. Olivier, J. Scheerder, J. J. Vinju, E. Visser, and J. Visser. The ASF+SDF Meta-Environment: A Component-Based Language Development Environment. In Proc. of CC ’2001, volume 2027 of LNCS, 2001. 11 18. P. van Eijk, A. Belinfante, H. Eertink, and H. Alblas. The Term Processor Generator Kimwitu. In Proc. of TACAS ’97, 1997. 11
Data Compression Transformations for Dynamically Allocated Data Structures ⋆ Youtao Zhang and Rajiv Gupta Dept. of Computer Science, The University of Arizona, Tucson, Arizona 85721
Abstract. We introduce a class of transformations which modify the representation of dynamic data structures used in programs with the objective of compressing their sizes. We have developed the commonprefix and narrow-data transformations that respectively compress a 32 bit address pointer and a 32 bit integer field into 15 bit entities. A pair of fields which have been compressed by the above compression transformations are packed together into a single 32 bit word. The above transformations are designed to apply to data structures that are partially compressible, that is, they compress portions of data structures to which transformations apply and provide a mechanism to handle the data that is not compressible. The accesses to compressed data are efficiently implemented by designing data compression extensions (DCX) to the processor’s instruction set. We have observed average reductions in heap allocated storage of 25% and average reductions in execution time and power consumption of 30%. If DCX support is not provided the reductions in execution times fall from 30% to 12.5%.
1
Introduction
With the proliferation of limited memory computing devices, optimizations that reduce memory requirements are increasing in importance. We introduce a class of transformations which modify the representation of dynamically allocated data structures used in pointer intensive programs with the objective of compressing their sizes. The fields of a node in a dynamic data structure typically consist of both pointer and non-pointer data. Therefore we have developed the common-prefix and narrow-data transformations that respectively compress a 32 bit address pointer and a 32 bit integer field into 15 bit entities. A pair of fields which have been compressed can be packed into a single 32 bit word. As a consequence of compression, the memory footprint of the data structures is significantly reduced leading to significant savings in heap allocated storage requirements which is quite important for memory intensive applications. The reduction in memory footprint can also lead to significantly reduced execution times due to a reduction in data cache misses that occur in the transformed program. ⋆
Supported by DARPA PAC/C Award. F29601-00-1-0183 and NSF grants CCR0105355, CCR-0096122, EIA-9806525, and EIA-0080123 to the Univ. of Arizona.
R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 14-28, 2002. c Springer-Verlag Berlin Heidelberg 2002
Data Compression Transformations for Dynamically Allocated Data Structures
15
An important feature of our transformations is that they have been designed to apply to data structures that are partially compressible. In other words, they compress portions of data structures to which transformations apply and provide a mechanism to handle the data that is not compressible. Initially data storage for a compressed data structure is allocated assuming that it is fully compressible. However, at runtime, when uncompressible data is encountered, additional storage is allocated to handle such data. Our experience with applications from Olden test suite demonstrates that this is a highly important feature because all the data structures that we examined in our experimentation were highly compressible, but none were fully compressible. For efficiently accessing data in compressed form we propose data compression extensions (DCX) to a RISC-style ISA which consist of six simple instructions. These instructions perform two types of operations. First since we must handle partially compressible data structures, whenever a field that has been compressed is updated, we must check to see if the new value to be stored in that field is indeed compressible. Second when we need to make use of a compressed value in a computation, we must perform an extract and expand operation to obtain the original 32 bit representation of the value. We have implemented our techniques and evaluated them. The DCX instructions have been incorporated into the MIPS like instruction set used by the simplescalar simulator. The compression transformations have been incorporated in the gcc compiler. We have also addressed other important implementation issues including the selection of fields for compression and packing. Our experiments with six benchmarks from the Olden test suite demonstrate an average space savings of 25% in heap allocated storage and average reductions of 30% in execution times and power consumption. The net reduction in execution times is attributable to reduced miss rates for L1 data cache and L2 unified cache and the availability of DCX instructions.
2
Data Compression Transformations
As mentioned earlier, we have developed two compression transformations: one to handle pointer data and the other to handle narrow width non-pointer data. We illustrate the transformations by using an example of the dynamically allocated link list data structure shown below – the next and value fields are compressed to illustrate the compression of both pointer and non-pointer data. The compressed fields are packed together to form a single 32 bit field value next. Original Structure: struct list node { · · ·; int value; struct list node *next; } *t;
Transformed Structure: struct list node { · · ·; int value next; } *t;
Common-Prefix transformation for pointer data. The pointer contained in the next field of the link list can be compressed under certain conditions. In particular, consider the addresses corresponding to an instance of list node (addr1)
16
Youtao Zhang and Rajiv Gupta
and the next field in that node (addr2). If the two addresses share a common 17 bit prefix because they are located fairly close in memory, we classify the next pointer as compressible. In this case we eliminate the common prefix from address addr2 which is stored in the next pointer field. The lower order 15 bits from addr2 represent the representation of the pointer in compressed form. The 32 bit representation of a next field can be reconstructed when required by obtaining the prefix from the pointer to the list node instance to which the next field belongs. Narrow data transformation for non-pointer data. Now let us consider the compression of the narrow width integer value in the value field. If the 18 higher order bits of an array element are identical, that is, they are either all 0’s or all 1’s, it is classified as compressible. The 17 higher order bits are discarded and leaving a 15 bit entity. Since the 17 bits discarded are identical to the most significant order bit of the 15 bit entity, the 32 bit representation can be easily derived when needed by replicating the most significant bit. Packing together compressed fields. The value and next fields of a node belonging to an instance of list node can be packed together into a single 32 bit word as they are simply 15 bit entities in their compressed form. Together they are stored in value next field of the transformed structure. The 32 bits of value next are divided into two half words. Each compressed field is stored in the lower order 15 bits of the corresponding half word. According to the above strategy, bits 15 and 31 are not used by the compressed fields. Next we describe the handling of uncompressible data in partially compressible data structures. The implementation of partially compressible data structures require an additional bit for encoding information. This is why we compress fields down to 15 bit entities and not into 16 bit entities. Partial compressibility. Our basic approach is to allocate only enough storage to accommodate a compressed node when a new node in the data structure is created. Later, as the pointer fields are assigned values, we check to see if the fields are compressible. If they are, they can be accommodated in the allocated space; otherwise additional storage is allocated to hold the fields in uncompressed form. The previously allocated location is now used to hold a pointer to this additional storage. Therefore for accessing uncompressible fields we have to go through an extra step of indirection. If the uncompressible data stored in the fields is modified, it is possible that the fields may now become compressible. However, we do not carry out such checks and instead we leave the fields in such cases in uncompressed form. This is because exploitation of such compression opportunities can lead to repeated allocation and deallocation of extra locations if data values repeatedly keep oscillating between compressible and uncompressible kind. To avoid repeated allocation and deallocation of extra locations we simplify our approach so that once a field is assigned an uncompressible value, from then onwards, the data in the field is always maintained in uncompressed form.
Data Compression Transformations for Dynamically Allocated Data Structures
17
We use the most significant bit (bit 31) in the word to indicate whether or not the data stored in the word is compressed or not. This is possible because in the MIPS base system that we use, the most significant bit for all heap addresses is always 0. It contains a 0 to indicate that the word contains compressed values. If it contains a 1, it means that one or both of values were not compressible and instead the word contains a pointer to an extra pair of dynamically allocated locations which contain the values of the two fields in uncompressed form. While bit 31 is used to encode extra information, bit 15 is never used for any purpose. Original: Set "value" field and Create "next" link addr0
addr0
t
t value next
addr1
value ( = v1 ) next
nil
Transformed(case 1) : both "next" and "value" fields are compressible addr0
addr0
t
t 0
nv
(v1)
0
nv
(v1)
nil
addr11 Transformed(case 2) : "value" is compressible and "next" is not addr0
addr0
t
t 0
nv
(v1)
1
nv
nil
v1
addr11
v1
addr11
Transformed(case 3) : "value" is not compressible addr0
addr0
t
t 1
nv
1
v1
nv
nil
Fig. 1. Dealing with uncompressible data.
In Fig. 1 we illustrate the above method using an example in which an instance of list node is allocated and then the value and next fields are set up one at a time. As we can see first storage is allocated to accommodate the two fields in compressed form. As soon as the first uncompressible field is encountered additional storage is allocated to hold the two fields in uncompressed form. Under this scheme there are three possibilities which are illustrated in Fig. 1. In the first case both fields are found to be compressible and therefore no extra locations are allocated. In the second case the value field, which is accessed first, is compressible but the next field is not. Thus, initially value field is stored in compressed form but later when next field is found to be compressible, extra locations are allocated and both fields are store in uncompressed form. Finally in the third case the value field is not compressible and therefore extra locations are allocated right away and none of the two fields are ever stored in compressed form.
18
Youtao Zhang and Rajiv Gupta
3
Instruction Set Support
Compression reduces the amount of heap allocated storage used by the program which typically improves the data cache behavior. Also if both the fields need to be read in tandem, a single load is enough to read both the fields. However, the manipulation of the fields also creates additional overhead. To minimize this overhead we have design new RISC-style instructions. We have designed three simple instructions each for pointer and non-pointer data respectively that efficiently implement common-prefix and narrow-data transformations. The semantics of the these instructions are summarized in Fig. 2. These instructions are RISC-style instructions with complexity comparable to existing branch and integer ALU instructions. Let us discuss these instructions in greater detail. Checking compressibility. Since we would like to handle partially compressible data, before we actually compress a data item at runtime, we must first check whether the data item is compressible. Therefore the first instruction type we introduce allows efficient checking of data compressibility. We have provided the two new instructions that are described below. The first checks the compressibility of pointer data and the second does the same for non-pointer data. bneh17 R1, R2, L1 – is used to check if the higher order 17 bits of R1 and R2 are the same. If they are the same, the execution continues and the field held in R2 can be compressed; otherwise the branch is taken to a point where we handle the situation, by allocating additional storage, in which the address in R2 is not compressible. The instruction also handles the case where R2 contains a nil pointer which is represented by the value 0 both in compressed and uncompressed forms. Since 0 represents a nil pointer, the lower order 15 bits of an allocated address should never be all zeroes - to correctly handle this situation we have modified our malloc routine so that it never allocates storage locations with such addresses. bneh18 R1, L1 – is used to check if the higher order 18 bits of R1 are identical (i.e., all 0’s or all 1’s). If they are the same, the execution continues and the value held in R1 is compressed; otherwise the value in R1 is not compressible and the branch is taken to a point where we place code to handle this situation by allocating additional storage. Extract-and-expand. If a pointer is stored in compressed form, before it can be derefrenced, we must first reconstruct its 32-bit representation. We do the same for compressed non-pointer data before its use. Therefore the second instruction type that we introduce carries out extract-and-expand operations. There are four new instructions that we describe below. The first two instructions are used to extract-and-expand compressed pointer fields from lower and upper halves of a 32-bit word respectively. The next two instructions do the same for non-poniter data. xtrhl R1, R2, R3 – extracts the compressed pointer field stored in lower order bits (0 through 14) of register R3 and appends it to the common-prefix
Data Compression Transformations for Dynamically Allocated Data Structures
19
contained in higher order bits (15 through 31) of R2 to construct the uncompressed pointer which is then made available in R1. We also handle the case when R3 contains a nil pointer. If the compressed field is a nil pointer, R1 is set to nil.
BNEH18 R1,L1
BNEH17 R1,R2,L1 if ( R2 != 0 ) && ( R131..15 != R231..15 ) goto L1 31
...
15
14 ...
if ( R131..14 != 0 ) && ( R131..14 != 0x3ff ) goto L1
0
31
R1
...
14
13 ...
0
R1
R2 XTRHL
XTRL
R1,R2,R3
if ( R314..0 != 0 ) /* Non-NULL case */ R1 = R231..15 R314..0 else R1 = 0 31 ... 15
14 ... 0
31
R2
R2 31
R3
30 ... 16
15
0
13 ... 0
- x
31 ... 15
31 30 29 ... 16
R2 0
30 ... 16
R1,R2
if ( R230 == 1 ) R1 = 0x1ffff R230..16 else R1 = R230..16
14 ... 0
R2 31
xxxxxxxxxxxxxxxx
XTRH
R1,R2,R3
if ( R330..16 != 0 ) /* Non-NULL case */ R1 = R231..15 R330..16 else R1 = 0
R1
15 14
14 ... 0
R1
R3
30 ... 16
0
-
R1 XTRHH
R1,R2
if ( R214 == 1 ) R1 = 0x1ffff R214..0 else R1 = R214..0
15
0 x
15
14 ... 0
-
14 ... 0
-
R1
xxxxxxxxxxxxxxxx
Fig. 2. DCX instructions.
xtrhh R1, R2, R3 – extracts the compressed pointer field stored in the higher order bits (16 through 30) of register R3 and appends it to the commonprefix contained in higher order bits (15 through 31) of R2 to construct the uncompressed pointer which is then made available in R1. If the compressed field is a nil pointer, R1 is set to nil. The instructions xtrhl and xtrhh can also be used to compress two fields together. However, they are not essential for this purpose because typically there are existing instructions which can perform this operation. In the MIPS like instruction set we used in this work this was indeed the case. xtrl R1, R2 – extracts the field stored in lower half of the R2, expands it, and then stores the resulting 32 bit value in R1. xtrh R1, R2 – extracts the field stored in the higher order bits of R2, exapands it, and then stores the resulting 32 bit value in R1.
20
Youtao Zhang and Rajiv Gupta
Next we give a simple example to illustrate the use of the above instructions. Let us assume that an integer field t → value and a pointer field t → next are compressed together into a single field t → value next. In Fig. 3a we show how compressibility checks are used prior to appropriately storing newvalue and newnext values in to the compressed fields. In Fig. 3b we illustrate the extract and expand instructions by extracting the compressed values stored in t → value next. ; $16 : &t− > value next ; $18 : newvalue ; $19 : newnext ; ; branch if newvalue is not compressible bneh18 $18, $L1 ; branch if newnext is not compressible bneh17 $16, $19, $L1 ; store compressed data in t− > value next ori $19, $19, 0x7fff swr $18, 0($16) swr $19, 2($16) j $L2 $L1: ; allocate extra locations and store pointer ; to extra locations in t− > value next ; store uncompressed data in extra locations ··· $L2: · · · (a) Illustration of compressibility checks. ; $16: &(t− > value next) ; $17: uncompressed integer t− > value ; $18: uncompressed pointer t− > next ; ; load contents of t− > value next lw $3,0($16) ; branch if $3 is a pointer to extra locations bltz $3, $L1 ; extract and expand t− > value xtrl $17, $3 ; extract and expand t− > next xtrhh$18, $16, $3 j $L2 $L1: ; load values from extra locations ··· $L2: · · · (b) Illustration of extract and expand instructions. Fig. 3. An example.
Data Compression Transformations for Dynamically Allocated Data Structures
4
21
Compiler Support
Object layout transformations can only be applied to a C program if the user does not access the fields through explicit address arithmetic and also does not typecast the objects of the transformed type into objects of another type. Like prior work by Truong et al. [14] on field reorganization and instance interleaving, we assume that the programmer has given us the go ahead to freely transform the data structures when it is apprpriate to do so. From this step onwards the rest of process is carried out automatically by the compiler. In the remainder of this section we describe key aspects of the the compiler support required for effective data compression. Identifying fields for compression and packing. Our observation is that most pointer fields can be compressed quite effectively using the common-prefix transformation. Integer fields to which narrow-data transformation can be applied can be identified either based upon knowledge about the application or using value profiling. The most critical issue is that of pairing compressed fields for packing into a single word. For this purpose we must first categorize the fields as hot fields and cold fields. It is useful to pack two hot fields together if they are typically accessed in tandem. This is because in this situation a single load can be shared while reading the two values. It is also useful to compress any two cold fields even if they are not accessed in tandem. This is because even though they cannot share the same load, they are not accessed frequently. In all other situations it is not as useful to pack data together because even though space savings will be obtained, execution time will be adversely affected. We used basic block frequency counts to identify pairs of fields belonging to the above categories and then applied compression transformations to them. ccmalloc vs malloc. We make use of ccmalloc [6], a modified version of malloc, for carrying out storage allocation. This form of storage allocation was developed by Chilimbi et al. [6] and as described earlier it improves the locality of dynamic data structures by allocating the linked nodes of the data structure as close to each other as possible in the heap. As a consequence, this technique increases the likelihood that the pointer fields in a given node will be compressible. Therefore it makes sense to use ccmalloc in order to exploit the synergy between ccmalloc and data compression. Register pressure. Another issue that we consider in our implementation is that of potential increase in register pressure. The code executed when the pointer fields are found to be uncompressible is substantial and therefore it can increase register pressure significantly causing a loss in performance. However, we know that this code is executed very infrequently since very few fields are uncompressible. Therefore, in this piece of code we first free registers by saving values and then after executing the code the values are restored in registers. In other words, the increase in register pressure does not have an adverse effect on frequently executed code.
22
Youtao Zhang and Rajiv Gupta
Instruction cache behavior and code size. The additional instructions generated for implementing compression can lead to an increase in code size which can further impact the instruction cache behavior. It is important to note however that a large part of the code size increase is due to the handling of the infrequent case in which the data is found not to be compressible. In order to minimize the impact on the code size we can share the code for handling the above infrequent case across all the updates corresponding to a given data field. To minimize the impact of the performance on the instruction cache, we can employ a code layout strategy which places the above infrequently executed code elsewhere and create branches to it and back so that the instruction cache behavior for more frequently executed code is minimally affected. Our implementation currently does not support the above techniques and therefore we observed code size increase and degraded instruction cache behavior in our experiments. Code generation. The remainder of the code generation details for implementing data compression are in most part quite straightforward. Once the fields have been selected for compression and packing together, whenever a use of a value of any of the fields is encountered, the load is followed by an extract-and expand instruction. If the value of any of compressed fields is to be updated, the compressibility check is performed before storing the value. When two hot fields that are packed together are to be read/updated, initially we generate separate loads/stores for them. Later in a separate pass we eliminate the later of the two loads/stores whenever possible.
5
Performance Evaluation
Experimental setup. We have implemented the techniques described to evaluate their performance. The transformations have been implemented as part of the gcc compiler and the DCX instructions have been incorporated in the MIPS like instruction set of the superscalar processor simulated by simplescalar [3]. The evaluation is based upon six benchmarks taken from the Olden test suite [5] (see Fig. 4a) which contains pointer intensive programs that make extensive use of dynamically allocated data structures. In order to study the impact of memory performance we varied the input sizes of the programs and also varied the L2 cache latency. The cache organization of simplescalar is shown in Fig. 4b. There are first level separate instruction and data caches (I-cache and D-cache). The lower level cache is a unified-cache for instructions and data. The L1 cache used was a 16K direct mapped cache with 9 cycle miss latency while the unified L2 cache is 256K with 100/200/400 cycle miss latencies. Our experiments are for an out-of-order issue superscalar with issue width of 4 instructions and the Bimod branch predictor. Impact on storage needs. The transformations applied and their impact on the node sizes is shown in Fig. 5a. In the first four benchmarks (treeadd, bisort, tsp, and perimeter), node sizes are reduced by storing pairs of compressed pointers in a single word. In the health benchmark a pair of small values are
Data Compression Transformations for Dynamically Allocated Data Structures Program treeadd
Application Recursive sum of values in a B-tree bisort Bitonic Sorting tsp Traveling salesman problem perimeter Perimeters of regions in images health Columbian health care simulation mst Minimum Spanning tree of a graph (a) Benchmarks.
Parameter Issue Width I cache I cache miss latency L1 data cache L1 data cache miss latency L2 unified cache Memory latency (L2 cache miss latency)
23
Value 4 issue, out of order 16K direct mapped 9 cycles 16K direct mapped 9 cycles 256K 2-way Configuration 1/2/3 = 100/200/400 cycles
(b) Machine configurations. Fig. 4. Experimental setup.
compressed together and stored in a single word. Finally, in the mst benchmark a compressed pointer and a compressed small value are stored together in a single word. The changes in node sizes range from 25% to 33% for five of the benchmarks. Only in case of tsp is the reduction smaller – just over 10%. We measured the runtime savings in heap allocated storage for small and large program inputs. The results are given in Fig. 5b. The average savings are nearly 25% while they range from 10% to 33% across different benchmarks. Even more importantly these savings represent significant levels of heap storage – typically in megabytes. For example, the 33% storage savings for treeadd represents 4.2 Mbytes and 17 Mbytes of heap storage savings for small and large program inputs respectively. It should also be noted that such savings cannot be obtained by other locality improving techniques described earlier [14, 15, 6]. From the results in Fig. 5b we make another very important observation. The extra locations allocated when non-compressible data is encountered is non-zero for all of the benchmarks. In other words we observe that for none of the data structures to which our compression transformations were applied, were all of the instances of the data encountered at runtime actually compressible. A small amount of additional locations were allocated to hold a small number of uncompressible pointers and small values in each case. Therefore the generality of our transformation which allows handling of partially compressible data structures is extremely important. If we had restricted the application of our technique to data fields that are always guaranteed to be compressible, we could not have achieved any compression and therefore no space savings would have resulted. We also measured the increase in code size caused by our transformations (see Fig. 5c). The increase in code size prior to linking is significant while after linking the increase is very small since the user code is small part of the binaries. However, the reason for significant increase in user code is because each time a compressed field is updated, our current implementation generates a new copy of the additional code for handling the case where the data being stored may
24
Youtao Zhang and Rajiv Gupta
not be compressible. In practice it is possible to share this code across multiple updates. Once such sharing has been implemented, we expect that the increase in the size of user code will also be quite small. Program
Transformation Applied
Size Change (bytes) treeadd Com.Prefix/Com.Prefix from 28 to 20 bisort Com.Prefix/Com.Prefix from 12 to 8 tsp Com.Prefix/Com.Prefix from 36 to 32 perimeter Com.Prefix/Com.Prefix from 12 to 8 health NarrowData/NarrowData from 16 to 12 mst Com.Prefix/NarrowData from 16 to 12
Program
Before After Linking Linking treeadd 16.4% 0.04% bisort 40.0% 0.01% tsp 4.9% 0.18% perimeter 21.3% 1.97% health 33.7% 0.23% mst 10.7% 0.06% average 21.1% 0.41%
(a) Reduction in node size.
Program treeadd bisort tsp perimeter health mst average
Original 12582900 786420 5242840 4564364 566872 3414020
Small Input Total (Extra) 8402040 (13440) 549880 (25600) 4200352 (6080) 3265380 (5120) 510272 (320) 2367812 (320)
(c) Code size increase. Storage (bytes)
Savings 33.2 % 30.1 % 19.9 % 28.5 % 10.0 % 30.6 % 25.4 %
Original 50331636 3145716 20971480 20332620 1128240 54550532
Large Input Total (Extra) 33605684 (51260) 2301304 (204160) 16800224 (23040) 14546980 (23680) 1015124 (320) 37781828 (320)
Savings 33.2 % 26.8 % 19.9 % 28.5 % 10.0 % 30.7 % 24.9 %
(b) Reduction in heap storage for small and large inputs. Fig. 5. Impact on storage needs.
Impact on execution times. Based upon the cycle counts provided by the simplescalar simulator we studied the changes in execution times resulting from compression transformations. The impact of L2 latency on execution times was also studied. The results in Fig. 6 are for small inputs. For L2 cache latency of 100 cycles, the reduction in execution times in comparison to the original programs which use malloc range from 3% to 64% while on an average the reduction in execution time is around 30%. The reductions for higher latencies are also similar. We also compared our execution times with versions of the programs that use ccmalloc. Our approach outperforms ccmalloc in five out of the six benchmarks (our version of mst runs slightly slower than the ccmalloc version). On an average we outperform ccmalloc by nearly 10%. Our approach outperforms ccmalloc because once the node sizes are reduced, typically greater number of nodes fit into a single cache line leading to a low number of cache misses. We also pay additional runtime overhead in form of extra instructions needed to carry out compression and extraction of compressed values. However, this additional
Data Compression Transformations for Dynamically Allocated Data Structures
25
execution time is more than offset by the time savings resulting from reduced cache misses; thus leading to overall reduction in execution time. On an average, compression reduces the execution times by 10%, 15%, and 20% over ccmalloc for L2 cache latencies of 100, 200, and 400 cycles respectively. Therefore we observe that as the latency of L2 cache is increased, compression outperforms ccmalloc by a greater extent. In summary our approach provides large storage savings and significant execution time reductions over ccmalloc. Comp./Orig.*100 (Latency=100 cycles) Comp./Orig.*100 (Latency=200 cycles) Comp./Orig.*100 (Latency=400 cycles) Comp./ccmalloc*100 (Latency=100 cycles) Comp./ccmalloc*100 (Latency=200 cycles) Comp./ccmalloc*100 (Latency=400 cycles)
120
percentage comparison
100
80
60
40
20
0
add
tree
rt
biso
tsp
ter
ime
per
lth
hea
t
ms
e
rag
ave
Fig. 6. Reduction in execution time due to data compression.
We would also like to point out that the use of special DCX instructions was critical in reducing the overhead of compression and extraction. Without DCX instructions the programs would have ran significantly slower. We ran versions of programs which did not use DCX instructions for L2 cache latency of 100 cycles. The average reduction in execution times, in comparison to original programs, dropped from 30% to 12.5%. Instead of an average reduction in execution times of 10% in comparison to ccmalloc versions of the program we observed an average increase of 9% in execution times. Impact on power consumption. We also compared the power consumption for the compression based programs with that of the original programs and ccmalloc based programs (see Fig. 7). These measurements are based upon the Wattch [1] system which is built on top of the simplescalar simulator. These results track the execution time results quite closely. The average reduction in power consumption over the original programs is around 30% for the small input. The reductions in power dissipation that compression provides over ccmalloc for the different cache latencies is also given. As we can see, on an average, compression reduces the power dissipation by 5%, 10%, and 15% over ccmalloc for L2 cache latencies of 100, 200, and 400 cycles respectively.
26
Youtao Zhang and Rajiv Gupta Comp./Orig.*100 (Latency=100 cycles) Comp./Orig.*100 (Latency=200 cycles) Comp./Orig.*100 (Latency=400 cycles) Comp./ccmalloc*100 (Latency=100 cycles) Comp./ccmalloc*100 (Latency=200 cycles) Comp./ccmalloc*100 (Latency=400 cycles)
120
percentage comparison
100
80
60
40
20
0
add
tree
rt
biso
tsp
ter
ime
per
lth
hea
t
ms
e
rag
ave
Fig. 7. Impact on in power consumption.
Impact on cache performance. Finally in Fig. 8 we present the impact of compression on cache behavior, including I-cache, D-cache and unified L2 cache behaviors. As expected, the I-cache performance is degraded due to increase in code size caused by our current implementation of compression. However, the performances of D-cache and unified cache are significantly improved. This improvement in data cache performance is a direct consequence of compression.
I−cache:Comp./Orig.*100 I−cache:Comp./ccmalloc*100 D−cache:Comp./Orig.*100 D−cache:Comp./ccmalloc*100 U−cache:Comp./Orig.*100 U−cache:Comp./ccmalloc*100
160
140
percentage comparison
120
100
80
60
40
20
0
add
tree
rt
biso
tsp
ter
ime
per
lth
hea
Fig. 8. Impact on cache misses.
t
ms
e
rag
ave
Data Compression Transformations for Dynamically Allocated Data Structures
6
27
Related Work
Recently there has been a lot of interest in exploiting narrow width values to improve program performance [2, 12, 13]. However, our work focusses on pointer intensive applications for which it is important to also handle pointer data. A great deal of research has been conducted on development of locality improving transformations for dynamically allocated data structures. These transformations alter object layout and placement to improve cache performance [14, 6, 15]. However, none of these transformations result in space savings. Existing compression transformations [10, 7] rely upon compile time analysis to prove that certain data items do not require a complete word of memory. They are applicable only when the compiler can determine that the data being compressed is fully compressible and they only apply to narrow width non-pointer data. In contrast, our compression transformations apply to partially compressible data and, in addition to handling narrow width non-pointer data, they also apply to pointer data. Our approach is not only more general but it is also simpler in one respect. We do not require compile-time analysis to prove that the data is always compressible. Instead simple compile-time heuristics are sufficient to determine that the data is likely to be compressible. ISA extensions have been developed to efficiently process narrow width data including Intel’s MMX [9] and Motorola’s AltiVec [11]. Compiler techniques are also being developed to exploit such instruction sets [8]. However, the instructions we require are quite different from MMX instructions because we must handle partially compressible data structures and we must also handle pointer data.
7
Conclusions
In conclusion we have introduced a new class of transformations that apply data compression techniques to compact the sizes of dynamically allocated data structures. These transformations result in large space savings and also result in significant reductions in program execution times and power dissipation due to improved memory performance. An attractive property of these transformations is that they are applicable to partially compressible data structures. This is extremely important because according to our experiments, while the data structures in all of the benchmarks we studied are very highly compressible, they contain small amounts of uncompressible data. Even for programs with fully compressible data structures our approach has one advantage. The application of compression transformations can be driven by simple value profiling techniques [4]. There is no need for complex compile-time analyses for identifying fully compressible fields in data structures. Our approach is applicable to a more general class of programs than existing compression techniques: we can compress pointers as well as non-pointer data; and we can compress partially compressible data structures. Finally we have designed the DCX ISA extensions to enable efficient manipulation of compressed data. The same task cannot be carried using MMX type instructions. Our main contribution is that data compression techniques can now be used to
28
Youtao Zhang and Rajiv Gupta
improve performance of general purpose programs and therefore this work takes the utility of compression beyond the realm of multimedia applications.
References 1. D. Brooks, V. Tiwari, and D. Martonosi, “Wattch: A Framework for ArchitectureLevel Power Analysis and Optimizations,” 27th International Symposium on Computer Architecture (ISCA), pages 83–94, May 2000. 2. D. Brooks and D. Martonosi, “Dynamically Exploiting Narrow Width Operands to Improve Processor Power and Performance,” 5th International Symposium on High-Performance Computer Architecture (HPCA), pages 13–22, Jan. 1999. 3. D. Burger and T.M. Austin, “The Simplescalar Tool Set, Version 2.0,” Computer Architecture News, pages 13–25, June 1997. 4. M. Burrows, U. Erlingson, S-T.A. Leung, M.T. Vandevoorde, C.A. Waldspurger, K. Walker, and W.E. Weihl, “Efficient and Flexible Value Sampling,” The Ninth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 160–167, Cambridge, MA, November 2000. 5. M. Carlisle, “Olden: Parallelizing Progrms with Dynamic Data Structures on Distributed-Memory Machines,” PhD Thesis, Princeton Univ., Dept. of Comp. Science, June 1996. 6. T.M. Chilimbi, M.D. Hill, and J.R. Larus, “Cache-Conscious Structure Layout,” ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 1–12, Atlanta, Georgia, May 1999. 7. J. Davidson and S. Jinturkar, “Memory access coalescing : a technique for eliminating redundant memory accesses,” ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 186–195, 1994. 8. S. Larsen and S. Amarasinghe, “Exploiting Superword Level Parallelism with Multimedia Instruction Sets,” ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), pages 145–156, Vancouver B.C., Canada, June 2000. 9. A. Peleg and U. Weiser, MMX Technology Extension to Intel Architecture. 16(4):4250, August 1996. 10. M. Stephenson, J. Babb, and S. Amarasinghe, “Bitwidth Analysis with Application to Silicon Compilation,” ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), pages 108–120, Vancouver B.C., Canada, June 2000. 11. J. Tyler, J. Lent, A. Mather, and H.V. Nguyen, “AltiVec(tm): Bringing Vector Technology to the PowerPC(tm) Processor Family,” Phoenix, AZ, February 1999. 12. Y. Zhang, J. Yang, and R. Gupta, “Frequent Value Locality and Value-Centric Data Cache Design,” The Ninth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 150–159, Cambridge, MA, November 2000. 13. J. Yang, Y. Zhang, and R. Gupta, “Frequent Value Compression in Data Caches,” The 33nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 258–265, Monterey, CA, December 2000. 14. D.N. Truong, F. Bodin, and A. Seznec, “Improving Cache Behavior of Dynamically Allocated Data Structures,” International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 322–329, Paris, France, 1998. 15. B. Calder, C. Krintz, S. John, and T. Austin, “Cache-Conscious Data Placement,” 8th International Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 139–149, San Jose, California, October 1998.
Evaluating a Demand Driven Technique for Call Graph Construction⋆ Gagan Agrawal1, Jinqian Li2 , and Qi Su2 1
2
Department of Computer and Information Sciences, Ohio State University Columbus, OH 43210 [email protected] Department of Computer and Information Sciences, University of Delaware Newark DE 19716 {li,su}@eecis.udel.edu
Abstract. With the increasing importance of just-in-time or dynamic compilation and the use of program analysis as part of software development environments, there is a need for techniques for demand driven construction of a call graph. We have developed a technique for demand driven call graph construction which handles dynamic calls due to polymorphism in object-oriented languages. Our demand driven technique has the same accuracy as the corresponding exhaustive technique. The reduction in the graph construction time depends upon the ratio of the cardinality of the set of influencing nodes and the total number of nodes in the entire program. This paper presents a detailed experimental evaluation of the benefits of the demand driven technique over the exhaustive one. We consider a number of scenarios, including resolving a single call site, resolving all call sites in a method, resolving all call sites within all methods in a class, and computing reaching definitions of all actual parameters inside a method. We compare the analysis time, the number of methods analyzed, and the number of nodes in the working set for the demand driven and exhaustive analyses. We use SPECJVM programs as benchmarks for our experiments. Our experiments show for the larger SPECJVM programs, javac, mpegaudio, and jack, demand driven analysis on the average takes nearly an order of magnitude less time than exhaustive analysis.
1
Introduction
A call graph is a static representation of dynamic invocation relationships between procedures (or functions or methods) in a program. A node in this directed graph represents a procedure and an edge (p → q) exists if the procedure p can invoke the procedure q. In program analysis or compiler optimizations for object-oriented programs, call graph construction becomes a critical step for at ⋆
This research was supported by NSF CAREER award ACI-9733520 and NSF grant CCR-9808522.
R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 29–45, 2002. c Springer-Verlag Berlin Heidelberg 2002
30
Gagan Agrawal et al.
least two reasons. First, because the average size of a method is typically quite small, very limited information is available without performing interprocedural analysis. Second, because of the frequent use of virtual functions, accuracy and efficiency of the call graph construction technique is crucial for the results of interprocedural analysis. Therefore, call graph construction or dynamic call site resolution has been a focus of attention lately in the object-oriented compilation community [3,4,8,9,11,13,14,15,19,20,21,24]. We believe that with an increasing popularity of just-in-time or dynamic compilation and with an increasing use of program analysis in software development environments, there is a need for demand driven call graph analysis techniques. In a dynamic or just-in-time compilation environment, aggressive compiler analysis and optimizations are applied to selected portions of the code, and not to other less frequently executed or never executed portions of the code. Therefore, the set of procedures called needs to be computed for a small set of call sites, and not for all the call sites in the entire program. Similarly, when program analysis is applied in a software development environment, demand driven call graph analysis may be preferable to exhaustive analysis. For example, while constructing static program slices [23], the information on the set of procedures called is required only for the call sites included in the slice and depends upon the slicing criterion used. Similarly, during program analysis for regression testing [16], only a part of the code needs to be analyzed, and therefore, demand driven call graph analysis can be significantly quicker than an exhaustive approach. We have developed a technique for performing demand driven call graph analysis [1,2]. The technique has two major theoretical properties. The worstcase complexity of our analysis is the same as the well known 0-CFA exhaustive analysis technique [18], except that our input is the cardinality of the set of influencing nodes, rather than the total number of nodes in the program representation. Thus, the advantage of our demand driven technique depends upon the ratio of the size of set of influencing nodes and the total number of nodes. Second, we have shown that the type information computed by our technique for all the nodes in the set of influencing nodes is as accurate as the 0-CFA exhaustive analysis technique. This paper presents an implementation and detailed experimental evaluation of our demand driven call graph construction technique. The implementation has been carried out using the sable infrastructure developed at McGill University [22]. Initial work on call graph construction exclusively focused on exhaustive analysis, i.e., analysis of a complete program. Many recent efforts have focused on analysis when entire program may not available, or cannot be analyzed because of memory constraints [6,17,19]. These efforts focus on obtaining most precision with the amount of available information. In comparison, our goal is to reduce the cost of analysis when demand-driven analysis can be performed, but not compromise the accuracy of analysis. We are not aware of any previous work on performing and evaluating demand-driven call graph analysis for the purpose of efficiency, even when the full program is available. Our work is also related to previous work on demand driven data flow analysis [10,12]. Their work assumes
Evaluating a Demand Driven Technique for Call Graph Construction
This
x
This
y
y
This
This
31
y
This
y
This
y
cs1
cs2
x y
Fig. 1. Procedure A::P’s portion of PSG that a call graph is already available and does not, therefore, apply to the demand driven call graph construction problem. The rest of the paper is organized as follows. The demand driven call graph construction technique is reviewed in Section 2. Our experimental design is presented in Section 3 and experimental results are presented in Section 4. We conclude in Section 5.
2
Demand Driven Call Graph Construction
In this section, we review our demand driven call graph construction technique. More details of the technique are available from our previous papers [1,2]. We use the interprocedural representation Program Summary Graph (PSG), initially proposed by Callahan [5], for presenting our demand driven call graph analysis technique. Procedure A::P’s portion of PSG is shown in Figure 1. We also construct a relatively inaccurate initial call graph by performing relatively inexpensive Class Hierarchy Analysis (CHA) [7]. In presenting our technique, we use the following definitions. pred(v) : The set of predecessors of the node v in the PSG. This set is initially defined during the construction of PSG and is not modified as the type information becomes more precise. proc(v) : This relation is only defined if the node v is an entry node or an exit node. It denotes the name of the procedure to which this node belongs. TYPES(v): The set of types associated with a node v in the PSG during any stage in the analysis. This set is initially constructed using Class Hierarchy Analysis, and is later refined through data-flow propagation.
32
Gagan Agrawal et al.
THIS NODE(v): This is the node corresponding to the this pointer at the procedure entry (if v is an entry node), procedure exit (if v is an exit node), procedure call (if v is a call node) or call return (if v is a return node). THIS TYPE(v): If the vertex v is a call node or a return node,THIS TYPE(v) returns the types currently associated with the call node for the this pointer at this call site. This relation is not defined if v is an entry or exit node. PROCS(S): Let S be the set of types associated with a call node for a this pointer. Then, PROCS(S) is the set of procedures that can actually be invoked at this call site. This function is computed using Class Hierarchy Analysis (CHA). We now describe how we compute the set of nodes in the PSG for the entire program that influence the set of procedures invoked at the given call site ci . The PSG for the entire program is never constructed. However, for ease in presenting the definition of the set of influencing nodes, we assume that the PSG components of all procedures in the entire program are connected based upon the initial sound call graph. Let v be the call node for the this pointer at the call site ci . Given the hypothetical complete PSG, the set of influencing nodes (which we denote by S) is the minimal set of nodes such that: 1) v ∈ S, 2) (x ∈ S) ∧ (y ∈ pred(x)) → y ∈ S, and 3) x ∈ S → THIS NODE(x) ∈ S Starting from the node v, we include the predecessors of any node already in the set, until we reach internal nodes that do not have any predecessors. For any node included in the set, we also include the corresponding node for the this pointer (denoted by THIS NODE) in the set. The next step in the algorithm is to perform iterative analysis over the set of nodes in the Partial Program Summary Graph (PPSG) to compute the set of types associated with a given initial node. This problem can be modeled as computing the data-flow set TYPES with each node in the PPSG and refining it iteratively. The initial values of TYPES(v) are computed through class hierarchy analysis that we described earlier in this section. If a formal or actual parameter is declared to be a reference to class cname, then the actual runtime type of that parameter can be any of the subclasses (including itself) of cname. The refinement stage can be described by a single equation, which is shown in Figure 2. Consider a node v in PPSG. Depending upon the type of v, three cases are possible in performing the update: 1) v is a call or exit node, 2) v is an entry node, and 3) v is a return node. In Case 1, the predecessors of the node v are the internal nodes, the entry nodes for the same procedure, or the return nodes at one of the call sites within this procedure. The important observation is that such a set of predecessors does not change as the type information is made more precise. So, the set TYPES(v) is updated by taking union over the sets of TYPES(v) over the predecessors of the node v. We next consider case 2, i.e., when the node v is an entry node. proc(v) is the procedure to which the node v belongs. The predecessors of such a node are call nodes at all call sites at which the function proc(v) can possibly be called, as per the initial call graph assumed by performing class hierarchy analysis.
Evaluating a Demand Driven Technique for Call Graph Construction
TYPES(v) ( p ∈ pred(v) TYPES(p) ) if v is call or exit node TYPES(v) (
TYPES(v)=
(p ∈ pred(v)) ∧ (proc(v) ∈ PROCS(THIS
33
TYPE(p))) TYPES(p) )
if v is an entry node TYPES(v) ( (p ∈ pred(v)) ∧ (proc(p) ∈ PROCS(THIS TYPE(v))) TYPES(p) ) if v is a return node
Fig. 2. Data-flow equation for propagating type information Such a set of possible call sites for proc(v) gets restricted as interprocedural type propagation is performed. Let p be a call node that is a predecessor of v. We want to use the set TYPES(p) in updating TYPES(v) only if the call site corresponding to p invokes proc(v). We determine this by checking the condition proc(v) ∈ PROCS(THIS TYPE(p)). The function THIS TYPE(p) determines the types currently associated with the this pointer at the call site corresponding to p and the function PROCS determines the set of procedures that can be called at this call site based upon this type information. Case 3 is very similar to the case 2. If the node v is a return node, the predecessor node p to v is an exit node. We want to use the set TYPES(p) in updating TYPES(v) only if the call site corresponding to v can invoke the function proc(p). We determine this by checking the condition proc(p) ∈ PROCS(THIS TYPE(v)). The function THIS TYPE(v) determines the types currently associated with the this pointer at the call site corresponding to v and the function PROCS determines the set of procedures that can be called at this call site based upon this type information. Theoretical Results: The technique has two major theoretical properties [2]. The worst-case complexity of our analysis is the same as the well known 0-CFA exhaustive analysis technique [18], except that our input is the cardinality of the set of influencing nodes, rather than the total number of nodes in the program representation. Thus, the advantage of our demand driven technique depends upon the ratio of the size of set of influencing nodes and the total number of nodes. Second, we have shown that the type information computed by our technique for all the nodes in the set of influencing nodes is as accurate as the 0-CFA exhaustive analysis technique.
3
Experiment Design
We have implemented our demand driven technique using the sable infrastructure developed at McGill University [22]. In this section, we describe the design of the experiments conducted, including benchmarks used, scenarios used for evaluating demand driven call graph constructions, and metrics used for comparison. Benchmark Programs: We have primarily used programs from the most commonly used benchmark set for Java programs, SPECJVM. The 10 SPECJVM programs are check, compress, jess, raytrace, db, javac, mpegaudio, mtrt,
34
Gagan Agrawal et al. Benchmark no. of no. of no. of classes methods PSG nodes check 20 96 3954 compress 15 35 601 jess 8 41 1126 raytrace 28 130 6518 db 6 34 1452 javac 180 1004 48147 mpegaudio 58 270 6205 mtrt 4 6 51 jack 61 261 14080 checkit 6 8 495
Fig. 3. Description of benchmarks
jack, and checkit. The total number of classes, methods, and PSG nodes for each of these benchmarks is listed in Figure 3. The number of classes ranges from 4 to 180, the number of methods ranges from 6 to 1004, and the number of PSG nodes ranges from 51 to 48147. Scenarios for Experiments: In Section 2, our technique was presented under the assumption that the call graph edges need to be computed for a single call site. In practice, demand driven analysis may be invoked under more complex scenarios. For example, one may be interested in knowing the reaching definitions for a set of variables in a method. Performing this analysis may require knowing the methods invoked at a set of call sites in the program. Thus, demand driven call graph analysis may be performed to determine the call graph edges at the call sites within this set. Alternatively, there may be interest in fully analyzing a single method or a class, and selectively analyzing codes from other methods or classes to have more precise information within the method or class. We have conducted experiments to evaluate demand driven call graph construction under the following scenarios: – Experiment A: Resolving a single call site in the program. We have only considered the call sites that can potentially invoke multiple methods after Class Hierarchy Analysis (CHA) is applied. This is the simplest case for the demand driven technique, and should require analyzing only a small set of procedures and PSG nodes in the program. – Experiment B: Computing reaching definitions of all actual parameters at all call sites within a method. Computing interprocedural reaching definitions will typically require knowing calling relationship at a set of call sites. This scenario depicts a situation in which demand driven call graph construction is invoked while computing certain data-flow information on a demand basis. – Experiment C: Resolving all call sites within a method. This is more complicated than the experiment A above, and represents a more realistic case when interprocedural optimizations are applied at a portion of the program.
Evaluating a Demand Driven Technique for Call Graph Construction
35
– Experiment D: Resolving all call sites within all methods within a class. This scenario represents analyzing a single class, but performing selective analysis on portions of code from other classes to improve the accuracy of analysis within the class. M e t r ics Used: We now describe the metrics used for reporting the benefits of demand driven call graph construction over exhaustive call graph analysis. Performing demand driven analysis will require fewer PSG nodes to be analyzed, fewer procedures to be analyzed, and should require lesser time. We individually report these three factors. Specifically, the three metrics used are: – Time Ratio: This is the ratio of the time required for demand driven analysis, as compared to exhaustive analysis. This metric evaluates the benefits of using demand driven analysis, but is dependent on our implementation. – Node Ratio: This is the ratio of the number of nodes in PPSG to the total number of nodes in PSG of the entire program. This metric is an implementation independent indicative of the benefits of the analysis. – Procedure Ratio: This is the ratio of the number of methods analyzed during demand driven analysis, as compared to the total number of methods in the entire program. Since each method’s portion of the full program representation used in our analysis is constructed only if that method needs to be analyzed, and is always constructed in entirety if the methods needs to be analyzed; this metric demonstrates the space-efficiency of demand driven call graph construction.
4
Experimental Results
We now present the results from our experiments. Our experiments were conducted on a Sun 250 MHz Ultra-Sparc processor with 512 MB of main memory. We first present results from exhaustive analysis. Then, we present results from demand driven analysis for scenarios A, B, C, and D. Exhaustive Analysis: To provide a comparison against demand driven analysis, we first include the results from exhaustive 0-CFA call graph construction on our set of benchmarks. The results from exhaustive analysis are presented in Figure 4. The time required for Class Hierarchy Analysis (CHA), time required for the iterative call graph refinement, and the number of call sites that are not-monomorphic after applying CHA are shown here. Call sites that can potentially invoke multiple methods after CHA has been applied are the ones that can benefit from more aggressive iterative analysis. The time required in CHA phase in our implementation is dominated by setting up of data-structures, and turns out to be almost the same for all benchmarks. The time required for the iterative refinement phase varies a lot between benchmarks, and is roughly proportional to the size of the benchmark. Two important observations from the Figure 4 are as follows. First, only 4 of the 10 programs have call sites that are polymorphic after the results of
36
Gagan Agrawal et al. Benchmark CHA time Iter. Analysis Polymorphic Call Sites (sec.) (sec.) After CHA check 72.3 27.7 0 compress 84.5 13.3 0 jess 96.5 59.4 0 raytrace 82.1 60.9 39 db 72.8 12.0 0 javac 85.6 2613 577 mpegaudio 73.4 462 35 mtrt 80.2 3.5 0 jack 74.1 250.7 77 checkit 73.6 5.3 0
Fig. 4. Results from exhaustive analysis
CHA are known. These 4 programs are raytrace, javac, mpegaudio, and jack. These are also the 4 largest programs among the programs in this benchmark set, comprising 28 to 180 classes and 130 to 1004 methods. For the smaller programs, CHA is as accurate as any analysis for constructing the call graph. The second observation is that for 7 of 10 programs, the total time required for exhaustive call graph construction is dominated by the CHA phase. For the three remaining programs, javac, mpegaudio, and jack, the time required for iterative analysis is 30 times, 6 times, and nearly 4 times the time required for CHA analysis, respectively. Therefore, for the smaller programs in the benchmark set, CHA analysis is sufficient, and they do not benefit from more aggressive analysis. The dominant cost of analysis is CHA, which remains the same during demand driven call graph construction. So, these programs cannot benefit from demand driven analysis. On the other hand, the time required for analysis is dominated by the iterative phase in the larger programs. A large number of call sites are polymorphic after applying CHA, and are therefore likely to benefit from iterative analysis. Since the iterative analysis is applied on a much small number of nodes in the demand driven technique, these programs are likely to benefit from the proposed demand driven analysis. This is analyzed in details in the remaining part of this section. Experiment A: In the first set of experiments, we perform demand driven analysis to resolve a single call site in the program. We only consider call sites that are known to potentially invoke multiple procedures after CHA has been applied. As we described in the previous subsection, only raytrace, javac, mpegaudio, and jack contain such polymorphic call sites. Therefore, the results are only presented from these call sites. The averages for time ratio, node ratio, and procedure ratio for these 4 programs is shown in Figure 5. The analysis time compared in this table is the time
Evaluating a Demand Driven Technique for Call Graph Construction Benchmark No. of Cases raytrace javac mpegaudio jack
39 577 35 77
37
Analysis Time PPSG Nodes Procedures Avg. (sec.) Ratio Avg. No. Ratio Avg. No. Ratio 3.78 6.2% 96.6 1.5% 13.7 10.5% 341.2 13.1% 9831 20.4% 747 74.5% 15.6 3.3% 186.3 3.0% 31.9 11.8% 11.8 4.7% 422.3 2.9% 46.1 17.6%
Fig. 5. Results from experiment A
for iterative analysis only. For both demand driven and exhaustive versions, additional time is spent in performing CHA. The average of the ratio of the number of nodes that need to be analyzed during demand driven analysis is extremely low for raytrace, mpegaudio, and jack, ranging between 1.5% and 3.0%. This results in an average iterative analysis time ratio of less than 7%. Even the number of procedures that need to be analyzed is less than 20% for these three programs. The results for javac are significantly different, but still demonstrate gains from the use of demand driven analysis. The average node ratio is 20.4%, resulting in an average time ratio of 13.1%. However, the average procedure ratio is nearly 75%. This means that for most of the cases, a very large fraction of procedures need to be involved in demand driven analysis. Use of demand driven analysis does not result in significant space savings for javac. After including the time for CHA, the average time ratio are 60%, 16%, 17%, and 26% for raytrace, javac, mpegaudio, and jack, respectively. The gains from demand driven analysis for raytrace are limited, because the time required for CHA exceeds the exhaustive iterative analysis time. javac, which had the highest ratio before CHA time was included, has the lowest ratio after including CHA because the time required for exhaustive iterative analysis is more than 30 times the time required for CHA. Demand driven analysis gives clear benefits in the case of javac, mpegaudio, and jack, because the time required for the iterative phase dominates the time required for CHA. To further study the results from these three benchmarks, we present a series of cumulative frequency graphs. For the experiment A, cumulative frequency graphs for the benchmarks javac, mpegaudio, and jack are presented in Figures 6, 7, and 9, respectively. A point (x, y) in such a graph means that the fraction x of the cases in the experiments had a ratio of less than or equal to y. The results from javac follow an interesting trend. 56 of the 577 cases require analysis of 120 or fewer procedures, or nearly 12% of all procedures. The same set of cases requires analyzing 257 or fewer nodes, or less than 1% of all nodes. The time taken for these cases is also less than 2% of the time for exhaustive analysis. However, the ratios are very different for the remaining cases. The next 413 cases require analysis of the same set of 837 procedures, or 83% of all procedures. The remaining cases require between 838 and 876 procedures to be
38
Gagan Agrawal et al.
analyzed. The analysis time is between 15% and 20% of the exhaustive analysis time, and the number of nodes involved for these cases is nearly 25% of the total number of nodes. The results from mpegaudio are as follows. 11 of the 35 cases require analysis of between 73 and 98 procedures, or between 27% and 36% of all procedures. The same 11 cases require analysis of between 8% and 10% of nodes, and between 2% and 4% of time. The other 24 cases require analysis of less than 12% of all procedures, and less than 1.5% of nodes and time. For jack, 61 of 77 cases require analysis of 59 or 57 procedures, or nearly 20% of all procedures. The same set of cases require between 4% and 6% of time, and 2% and 4% of all nodes. The other 16 cases involve analyzing less than 5% of all procedures, less than 1% of time, and less than 0.5% of all nodes.
0
0
10
10
−1
10
−1
Ratio
Ratio
10
−2
10
−2
10 −3
10
Time ratio Node ratio Proc ratio
Time ratio Node ratio Proc ratio
−4
10
−3
0
0.1
0.2
0.3
0.4 0.5 0.6 Cumulative Frequency
0.7
0.8
0.9
1
Fig. 6. Experiment A: Cumulative frequency of time, node, and procedure ratio for javac
10
0
0.1
0.2
0.3
0.4 0.5 0.6 Cumulative Frequency
0.7
0.8
0.9
1
Fig. 7. Experiment A: Cumulative frequency of time, node, and procedure ratio for mpegaudio
Experiment B: In the second set of experiments, we evaluated the performance of demand driven call graph construction when it is initiated from demand driven data flow analysis. The particular data flow problem we consider is the computation of reaching definitions for all actual parameters in a procedure. We report results from this experiment only on raytrace, mpegaudio, and jack. The 6 smaller programs in SPECJVM benchmark set do not contain any polymorphic call sites. Even after many attempts, we could not complete this experiment for javac, which is the largest program in this benchmark set. We believe that it was because of very large memory requirements when reaching definition and call graph construction analyses are combined. The average time, node, and procedure ratios for the three benchmarks are presented in Figure 8. As compared to the experiment A, we are reporting results from a significantly larger number of cases, because this analysis was performed on all procedures. At the same time, for many cases in experiment B resolution of several polymorphic call sites may be required. The three ratios for mpegaudio are lower for the experiment B, as compared to the ones obtained from experi-
Evaluating a Demand Driven Technique for Call Graph Construction
39
Benchmark No. of Cases raytrace mpegaudio jack
Analysis Time PPSG Nodes Procedures Avg. (sec.) Ratio Avg. No. Ratio Avg. No. Ratio 129 4.36 7.2% 354.3 5.4% 28.7 22.0% 270 5.48 1.2% 133.5 2.2% 26.8 9.9% 261 15.44 6.2% 524.9 3.7% 94.8 36.6 %
Fig. 8. Results from experiment B
0
0
10
10
−1
−1
10
Ratio
Ratio
10
−2
10
−3
−2
10
−3
10
10
Time Ratio Node ratio Proc ratio
Time Ratio Node ratio Proc ratio
−4
10
−4
0
0.1
0.2
0.3
0.4 0.5 0.6 Cumulative Frequency
0.7
0.8
0.9
1
Fig. 9. Experiment A: Cumulative frequency of time, node, and procedure ratio for jack
10
0
0.1
0.2
0.3
0.4 0.5 0.6 Cumulative Frequency
0.7
0.8
0.9
1
Fig. 10. Experiment B: Cumulative frequency of time, node, and procedure ratio for mpegaudio
ment A. For raytrace and jack, the reverse is true; the three ratios are higher for the experiment B. The ratio for iterative analysis time are 7.2%, 1.2%, and 6.2% for raytrace, mpegaudio, and jack, respectively. After including the time for CHA, the ratios of the time required are 60%, 14%, and 27%, respectively. We studied the results in more details for mpegaudio and jack. The cumulative frequency plots for these two benchmarks are presented in Figures 10 and 11, respectively. The results from mpegaudio are as follows. 192 of 270 cases require analysis of 33 or fewer procedures, or less than 12% of all procedures. The same set of cases require analysis of less than 2% of all nodes, and take less than 1% of time for exhaustive analysis. For the remaining cases, the number of procedures to be analyzed is distributed fairly uniformly between 66 and 118. For jack, the trends are very different. 126 of 261 cases require analysis of 162 or 161 procedures, or nearly 62% of all procedures. The same set of cases require analysis of nearly 800 nodes, or 6% of all nodes. The time required for this set of cases is nearly 9% of the time for exhaustive iterative analysis. The portions of the program that need to be analyzed for this set of cases (48% of all cases) is almost the same. This has the following implications. If demand driven analysis is performed for one of these cases, and then needs to be performed for another case in the same set, very limited additional effort will be required.
40
Gagan Agrawal et al. 0
0
10
10
−1
−1
10
10
−2
−2
Ratio
10
Ratio
10
−3
−3
10
10
−4
−4
10
10
Time ratio Node ratio Proc ratio
Time ratio Node ratio Proc ratio
−5
10
−5
0
0.1
0.2
0.3
0.4 0.5 0.6 Cumulative Frequency
0.7
0.8
0.9
1
Fig. 11. Experiment B: Cumulative frequency of time, node, and procedure ratio for jack Benchmark No. of Cases raytrace javac mpegaudio jack
130 1004 270 261
10
0
0.1
0.2
0.3
0.4 0.5 0.6 Cumulative Frequency
0.7
0.8
0.9
1
Fig. 12. Experiment C: Cumulative frequency of time, node, and procedure ratio for javac
Analysis Time PPSG Nodes Procedures Avg. (sec.) Ratio Avg. No. Ratio Avg. No. Ratio 4.51 7.4% 358.9 5.5% 29.1 22.4% 271.1 10.3% 7634.5 15.8 % 587 58.5% 5.37 1.2% 133.5 2.1 % 26.8 9.9% 14.9 5.9% 524.9 3.7% 94.8 36.3%
Fig. 13. Results from experiment C
Experiment C: Our next set of experiments evaluated the performance of demand driven call graph construction when all call sites in a procedures had to be resolved. We present data only from raytrace, javac, mpegaudio, and jack, because they contain polymorphic call sites. For these programs, we include results from analysis of all methods, even if they do not contain any polymorphic call site. The averages of time, node, and procedure ratios are presented in Figure 13. The averages are very close to the results for experiment B. We believe that this because all call sites in a method had to be resolved for experiment C, and all cites that can potentially invoke a method had to be resolved for experiment B. The three ratios for javac are lower for experiment C, as compared to the experiment A. This is because the averages are taken over much larger number of cases in the experiment C. Many of the procedures do not require analysis of any polymorphic call site, and contribute to a lower overall average. The cumulative frequency plots for javac, mpegaudio, and jack are presented in Figures 12, 14, and 15, respectively. Results from javac for experiment C are similar to the results from experiment A, with one important difference. A larger fraction of cases can be analyzed with a small fraction of procedures and nodes. 316 of 1004 cases require between 1 and 125 procedures, or up to 12% of all procedures. The remaining 688 cases
Evaluating a Demand Driven Technique for Call Graph Construction
41
require between 837 and 907 procedures, nearly 25% of all nodes, and nearly 15% of exhaustive analysis time. Results from mpegaudio for experiment C are very similar to the results from experiment B. 192 of 270 cases (the same number as in experiment B) require analysis of at most 33 procedures, while the remaining cases need analysis of between 66 and 118 procedures. The same trend (closeness between results from experiments B and C) continues for jack.
0
0
10
10
−1
10 −1
10
−2
Ratio
Ratio
10 −2
10
−3
10
−3
10
−4
10
Time ratio Node ratio Proc ratio
Time ratio Node ratio Proc ratio
−4
10
−5
0
0.1
0.2
0.3
0.4 0.5 0.6 Cumulative Frequency
0.7
0.8
0.9
1
Fig. 14. Experiment C: Cumulative frequency of time, node, and procedure ratio for mpegaudio
10
0
0.1
0.2
0.3
0.4 0.5 0.6 Cumulative Frequency
0.7
0.8
0.9
1
Fig. 15. Experiment C: Cumulative frequency of time, node, and procedure ratio for jack
Experiment D: Our final set of experiments evaluates demand driven analysis when all call sites in all procedures of a class are to be resolved. Figure 16 presents average time ratio, node ratio, and procedure ratio for raytrace, javac, mpegaudio, and jack. Even though each invocation of demand driven analysis may involve resolving several call sites, the ratio are quite small. For raytrace, mpegaudio, and jack, the averages of time ratios and node ratios are still less than 10%. The averages for javac are a bit higher, consistent with the previous experiments. The average time ratio and node ratio are 13.1% and 20.6%, respectively. Space savings are not significant with javac, but quite impressive for the other three benchmarks. After including the time required for CHA, the average time ratio is 61% for raytrace, 16% for javac, 16% for mpegaudio, and 25% for jack. In comparison with the results from experiment C, the averages of ratios from experiment D are all higher for raytrace, javac, and mpegaudio, as one would normally expect. The surprising results are from jack, where all three ratios are lower in experiment D. The explanation for this is as follows. The results from experiment D are averaged over a smaller number of cases, specifically, 61 instead of 261 for jack. It turns out that the procedures that require the most time, number of nodes, and number of procedures to be analyzed belong to a small set of classes. Therefore, they contribute much more significantly to the
42
Gagan Agrawal et al. Benchmark No. of Cases raytrace javac mpegaudio jack
28 180 58 61
Analysis Time PPSG Nodes Procedures Avg. (sec.) Ratio Avg. No. Ratio Avg. No. Ratio 5.32 8.7% 598.3 9.2% 41.5 31.9% 343.6 13.1% 9940 20.6% 741.3 73.8% 14.1 3.1% 280.5 4.5% 47.6 17.6 % 7.49 4.7% 291.3 2.1% 27.6 10.5%
Fig. 16. Results from experiment D
0
0
10
10
−1
10
−1
10
−2
Ratio
Ratio
10
−2
10
−3
10
−3
10 −4
10
Time ratio Node ratio Proc ratio
Time ratio Node ratio Proc ratio
−5
10
−4
0
0.1
0.2
0.3
0.4 0.5 0.6 Cumulative Frequency
0.7
0.8
0.9
1
Fig. 17. Experiment D: Cumulative frequency of time, node, and procedure ratio for javac
10
0
0.1
0.2
0.3
0.4 0.5 0.6 Cumulative Frequency
0.7
0.8
0.9
1
Fig. 18. Experiment D: Cumulative frequency of time, node, and procedure ratio for mpegaudio
average ratios in the results from the experiment C, than in the results from experiment D. Details of the results from javac, mpegaudio, and jack are presented in Figures 17, 18, and 19, respectively. Again, the results from javac are very different from the results on the other two benchmarks. In javac, 20 of the 180 classes can be resolved by analyzing a small fraction of procedures. Specifically, these cases require analysis of between 1 and 63 procedures, i.e., less than 7% of all procedures in the program. However, the other 160 cases require analysis of between 837 and 963 procedures in the program. Each of the cases from this set requires analyzing nearly 25% of all the nodes in the program, and between 15% and 20% of the time for exhaustive analysis. However, the sets of influencing nodes that need to analyzed for these cases are almost identical. Our theoretical result, therefore, implies that after one of these cases has been analyzed, the time required for other cases will be very small. For mpegaudio, the number of procedures that need to be analyzed for the 58 cases ranges from 1 to 139, or from less than 1% to nearly 50%. The distribution is fairly uniform. The time required for demand driven analysis for these cases also has a fairly uniform distribution, between 0.1 second to 22.5 second, or between 0.02% to 5% of the time required for exhaustive analysis. Similarly, the
Evaluating a Demand Driven Technique for Call Graph Construction
43
0
10
−1
Ratio
10
−2
10
−3
10
Time ratio Node ratio Proc ratio −4
10
0
0.1
0.2
0.3
0.4 0.5 0.6 Cumulative Frequency
0.7
0.8
0.9
1
Fig. 19. Experiment D: Cumulative frequency of time, node, and procedure ratio for jack number of nodes ranges from 2 to 880, or from 0.03% to 13%. The results from jack are similar.
5
Conclusions
We have presented evaluation of an algorithm for resolving call sites in an object oriented program on a demand driven fashion. The summary of our results using SPECJVM benchmarks is as follows: – The time required for Class Hierarchy Analysis (CHA), which is a prerequisite for both exhaustive and demand driven iterative analysis, dominates the exhaustive call graph construction time for 7 of the 10 SPECJVM programs. However, CHA itself is sufficient for constructing an accurate call graph for 6 of these 7 programs. The time required for exhaustive iterative analysis clearly dominates CHA time for the three largest SPECJVM programs, javac, mpegaudio, and jack. – For resolving a single call site, demand driven iterative analysis averages at nearly 10% of the time required for exhaustive iterative analysis. The number of nodes that need to be analyzed averages at nearly 3% for mpegaudio and jack, but around 20% for javac. The number of procedures that need to be analyzed is less than 20% for mpegaudio and jack, but nearly 75% for javac. – The averages for the number of nodes and procedures analyzed and the time taken surprisingly stays low when all call sites within a class or a method are analyzed instead of a single call site. This is because the program portions that need to be analyzed for resolving different call sites within a method or a class are highly correlated.
44
Gagan Agrawal et al.
References 1. Gagan Agrawal. Simultaneous demand-driven data-flow and call graph analysis. In Proceedings of International Conference on Software Maintainance (ICSM), September 1999. 30, 31 2. Gagan Agrawal. Demand-drive call graph construction. In Proceedings of the Compiler Construction (CC) Conference, March 2000. 30, 31, 33 3. David Bacon and Peter F. Sweeney. Fast static analysis of c++ virtual function calls. In Eleventh Annual Conference on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA ’96), pages 324–341, October 1996. 30 4. Brad Calder and Dirk Grunwald. Reducing indirect function call overhead in C++ programs. In Conference Record of POPL ’94: 21st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 397–408, Portland, Oregon, January 1994. 30 5. D. Callahan. The program summary graph and flow-sensitive interprocedural data flow analysis. In Proceedings of the SIGPLAN ’88 Conference on Programming Language Design and Implementation, Atlanta, GA, June 1988. 31 6. R. Chatterjee, B. G. Ryder, and W. A. Landi. Relevant Context Inference. In Proceedings of the Conference on Principles of Programming Languages (POPL), pages 133–146, January 1999. 30 7. Jeffrey Dean, Craig Chambers, and David Grove. Selective specialization for object-oriented languages. In Proceedings of the ACM SIGPLAN’95 Conference on Programming Language Design and Implementation (PLDI), pages 93–102, La Jolla, California, 18–21 June 1995. SIGPLAN Notices 30(6), June 1995. 31 8. Greg DeFouw, David Grove, and Craig Chambers. Fast interprocedural class analysis. In Proceedings of the POPL’98 Conference, 1998. 30 9. A. Diwan, K. S. McKinley, and J. E. B. Moss. Using Types to Analyze and Optimize Object-Oriented Programs. ACM Transactions on Programming Languages and Systems, 23(1):30–72, January 2001. 30 10. E. Duesterwald, R. Gupta, and M. L. Soffa. A Practical Framework for DemandDriven Interprocedural Data Flow Analysis. ACM Transactions on Programming Languages and Systems, 19(6):992–1030, November 1997. 30 11. David Grove, Greg DeFouw, Jeffrey Dean, and Craig Chambers. Call graph construction in object-oriented languages. In Proceedings of the Conference on Object Oriented Programming Systems, Languages and Applications, 1997. 30 12. S. Horwitz, T. Reps, and M. Sagiv. Demand interprocedural dataflow analysis. In In SIGSOFT ’95: Proceedings of the Third ACM SIGSOFT Symposium on the Foundations of Software Engineering, pages 104–115, 1995. 30 13. Jens Palsberg and Patrick O’Keefe. A type system equivalent to flow analysis. In Conference Record of POPL ’95: 22nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 367–378, San Francisco, California, January 1995. 30 14. Hemant Pande and Barbara Ryder. Data-flow-based virtual function resolution. In Proceedings of the Third International Static Analysis Symposium, 1996. 30 15. M. Porat, M. Biberstein, L. Koved, and M. Mendelson. Automatic detection of immutable fields in Java. In Proceedings of CASCON, 2000. 30 16. Gregg Rothermel and M. J. Harrold. Analyzing regression test selection. IEEE Transactions on Software Engineering, 1996. 30 17. Atanas Routnev, Barbara G. Ryder, and William Landi. Data-Flow Analysis of Program Fragments. In Proceedings of the Conference on Foundations of Software Engineering (FSE), pages 235–253, September 1999. 30
Evaluating a Demand Driven Technique for Call Graph Construction
45
18. O. Shivers. The semantics of Scheme control-flow analysis. In Proceedings of the Symposium on Partial Evaluation and Semantics-Based Program Manipulation, volume 26, pages 190–198, New Haven, CN, June 1991. 30, 33 19. V. C. Sreedhar, M. Burke, and J. D. Choi. A Framework for Interprocedural Optimization in the Presence of Dynamic Class Loading. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2000. 30 20. Vijay Sundaresan, Laurie Hendren, Chrislain Razafimahefa, Raja Vallee-Rai, Patrick Lam, Etienne Gagnon, and Charles Godin. Practical virtual method call resolution for Java. In Fifteenth Annual Conference on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA ’2000), pages 264–280. ACM Press, October 2000. 30 21. Frank Tip and Jens Palsberg. Scalable propagation-based call graph construction algorithms. In Fifteenth Annual Conference on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA ’2000), pages 281–293. ACM Press, October 2000. 30 22. Raja Vallee-Rai. Soot: A Java ByteCode Optimization Framework. Master’s thesis, McGill University, 1999. 30, 33 23. Mark Weiser. Program slicing. IEEE Transactions on Software Engineering, 10:352–357, 1984. 30 24. A. Zaks, V. Feldman, and N. Aizikowitz. Sealed calls in java packages. In Proceedings of Conference on Object Oriented Programming Systems and Languages (OOPSLA), pages 83–92. ACM Press, October 2000. 30
A Graph–Free Approach to Data–Flow Analysis Markus Mohnen hrstuhl f¨ ur Informatik II, RWTH Aachen, Germany
[email protected]
Abstract. For decades, data–flow analysis (DFA) has been done using an iterative algorithm based on graph representations of programs. For a given data–flow problem, this algorithm computes the maximum fixed point (MFP) solution. The edge structure of the graph represents possible control flows in the program. In this paper, we present a new, graph–free algorithm for computing the MFP solution. The experimental implementation of the algorithm was applied to a large set of samples. The experiments clearly show that the memory usage of our algorithm is much better: Our algorithm always reduces the amount of memory and reached improvements upto less than a tenth. In the average case, the reduction is about a third of the memory usage of the classical algorithm. In addition, the experiments showed that the runtimes are almost the same: The average speedup of the classical algorithm is only marginally greater than one.
1
Introduction
Optimising compilers perform various static program analyses to obtain informations needed to apply optimisations. In the context of imperative languages, the technique commonly used is data–flow analysis (DFA). It provides information about properties of the states that may occur at a given program point during execution. Here, programs considered are intermediate code, e.g. three address code, register code, or Java Virtual Machine (JVM) code [LY97]. For decades, the de facto classical algorithm for DFA has been an iterative algorithm [MJ81, ASU86, Muc97] which uses a graph as essential data structure. The graph is extracted from the program, making explicit the possible control flows in the program as the edge structure of the graph. Typically, the nodes of the graph are basic blocks (BB), i.e. maximal sequences of straight–line code (but see also [KKS98] for comments on the adequacy of this choice). A distinct root node of the graph corresponds to the entry point of the program. For a given graph and a given initial annotation of the root node, the algorithm computes an annotation for each of the nodes. Each annotation captures the information about the state of the execution at the corresponding program point. The exact relation between annotations and states depends on the data– flow problem. However, independently of the exact relation, the annotations computed by the algorithm are guaranteed to be the greatest solution of the consistency equations imposed by the data–flow problem. This result is known as the maximal fixed point (MFP) solution. R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 46–61, 2002. c Springer-Verlag Berlin Heidelberg 2002
A Graph–Free Approach to Data–Flow Analysis
47
In the context of BB graphs, there is a need for an additional post–processing of the annotations. Since each BB represents a sequence of instructions, the annotation for a single BB must be propagated to the instruction level. As a result of this post–processing, each program instruction is annotated. The contribution of this paper is an alternative algorithm for computing the MFP solution. In contrast to the classical algorithm, our approach is graph–free: Besides a working set, it does not need any additional data structures (of course, the graph structure is always there implicitly in the program). The key idea is to give the program a more active role: While the classical approach transforms the program to a passive data object on which the solver operates, our point of view is that the program itself executes on the annotation. An obvious advantage of this approach is the reduced memory usage. In addition, it is handy if there is already machinery for execution of programs available. Consequently, our execution–based approach is advantageous in settings where optimisations are done immediately before execution of the code. Here it saves effort to implement the analyses and it saves valuable memory for the execution. The most prominent example of such a setting is the Java Virtual Machine (JVM) [LY97]. In fact, the JVM specification requires that each class file is verified at linking time by a data–flow analyser. The purpose of this verification is to ensure that the code is well–typed and that no operand stack overflows or underflows occur at runtime. In addition, certain optimisations cannot be done by the Java compiler producing JVM code. For instance, optimisation w.r.t. memory allocation like compile–time garbage collection (CTGC) can only be done in the JVM since the JVM code does not provide facilities to influence the memory allocation. CTGC was originally proposed in the context of functional languages [Deu97, Moh97] and then adopted for Java [Bla98, Bla99]. To validate the benefits of our approach, we studied the performance of the new algorithm in competition with the classical one, both in terms of memory usage and runtime. Therefore, we applied both to a large set of samples. The experiments clearly show that the memory usage of our algorithm is much better: Our algorithm always reduces the amount of memory and reached improvements upto less than a tenth. In the average case, the reduction is about a third of the memory usage of the classical algorithm. Moreover, the runtimes are comparable in the average case: Using the classical algorithm does not give a substantial speedup. Structure of this article. We start by defining some basic notions. In Section 3 the classical, iterative algorithm for computing the MFP solution is discussed briefly. Our main contribution starts with Section 4 where we present the new execution algorithm, discuss its relation to the classical algorithm, and prove the termination and correctness. Experimental results presented in Section 5 give an estimation of the benefits our method. Finally, Section 6 concludes the paper.
48
2
Markus Mohnen
Notations
In this section, we briefly introduce the notations that we use in the rest of the paper. Although we focus on abstract interpretation based DFA, our results are applicable to other DFAs as well. The programs we consider are three–address code programs, i.e. non–empty sequences of instructions I ∈ Instr. Each instruction I is either a jump, which can be conditional (if ψ goto n) or unconditional (goto n), or an assignment (x:=y◦z). In assignments, x must be a variable, and y and z can be variables or constants. Since we consider intraprocedural DFA only, we do not need instructions for procedure calls or exits. The major point of this setting is to distinguish between instructions which cause the control flow to branch and those which keep the control flow linear. Hence, the exact structure is not important. Any other intermediate code, like the JVM code, is suitable as well. To model program properties, we use lattices L = A, ⊓, ⊔ where A is a set, and ⊓ and ⊔ are binary meet and join operations on A. Furthermore, ⊥ and ⊤ are least and greatest element of the lattice. Often, finite lattices are used, but in general it suffices to consider lattices which have only finite chains. The point of view of DFA based on abstract interpretation [CC77, AH87] is to replace the standard semantics of programs by an abstract semantics describing how the instructions operate on the abstract values A. Formally, we assume a monotone semantic functional ![.!] : Instr → (A → A) which assigns a function on A to each instruction. A data–flow problem is a quadruple (P, L, ![.!], a0 ) where P = I0 . . . In ∈ Instr+ is a program, L is a lattice, ![.!] is an abstract semantics, and a0 ∈ A is an initial value for the entry of P . To define the MFP solution of a data–flow problem, we first introduce the notion of predecessors. For a given program P = I0 . . . In ∈ Instr+ , we define the function predP : {0, . . . , n} → P({0, . . . , n}) in the following way: j ∈ predP (i) iff either Ij ∈ {goto i, if ψ goto i}, or i = j + 1 and Ij = goto t for some t. Intuitively, the predecessors of an instruction are all instructions which may be executed immediately before it. The MFP solution is a vector of values s0 , . . . , sn ∈ A. Each entry si is the abstract value valid immediately before the instruction Ii . It is defined as the great est solution of the equation system si = j∈predP (i) ![Ij !](sj ). The well–known fixed point theorem by Tarski guarantees the existence of the MFP solution in this setting. Example 1 (Constant Folding Propagation). We now introduce an example, which we use as a running example in the rest of the paper. Constant folding and propagation aims at finding as many constants as possible at compile time, and replacing the computations with the constant values. In the setting described above, we associate with each variable and each program point the information if the variable is always constant at this point. For simplicity, we assume that the program only uses the arithmetic operations on integers. We define a set C := Z ⊎ {⊤, ⊥} and a relation c1 ≤ c2 iff (a) c1 = c2 , (b) c1 = ⊥, or
A Graph–Free Approach to Data–Flow Analysis
49
(c) c2 = ⊤. Intuitively, values can be interpreted in the following way: An integer means “constant value”, ⊤ means “not constant due to missing information”, and ⊥ means “not constant due to conflict”. The relation ≤ induces meet and join operations. Hence, C, ⊓, ⊔ is a (non–finite) lattice with only finite chains. Fig. 1 shows the corresponding Hasse diagram. The abstract lattice is defined in terms of this lattice. Formally, let X be the set of variables of a program P . By definition, X is finite. We define the set of abstract values as C := X → C, the set of all functions mapping a variable to a value in C. Since X is finite, C is finite as well. We obtain meet and join operations ⊓C , ⊔C in the canonical way by argument–wise use of the corresponding operation on C. Hence, our lattice for this abstract interpretation is C, ⊓C , ⊔C . The abstract semantics ![.!]C : Instr → (C → C) is defined in the following way: For jumps, we define ![goto l!]C and ![if ψ goto l!]C to be the identity, since jumps do not change any variable. For assignments, we define ![x:=y◦z!]C := c → c′ , where c′ = c[x/a], i.e. c′ is the same function as c except at argument x. The new value is defined as ay ◦ az if y = ay ∈ Z or c(y) = ay ∈ Z ′ c (x) = a := and z = az ∈ Z or c(z) = az ∈ Z ⊥ otherwise Intuitively, the value of the variable on the left–hand side is constant iff all operands are either constants in the code or known to be constants during execution. For a data–flow problem, the initial value will be a0 = ⊥: At the entry, no variable can be constant. Fig. 1 shows an example for a program, the associated abstractions, the equation system, and the MFP solution. This example also demonstrates why it is necessary to use the infinite lattice C: The solution contains the constant ‘5’ which is not found in the program. Our presentation of these notions differs slightly from the presentation found in text books. Typically, data–flow problems are already formulated using an explicit graph structure. However, we want to point out that this is not a necessity. Furthermore, it allows us to formulate and prove the correctness of our algorithm without reference to the classical one.
⊤ ···
−2
−1
0
1
2
⊥
Fig. 1. Hasse diagram of C, ⊓, ⊔
···
50
Markus Mohnen
Program I0 = x := 1 I1 = y := 2 I2 = z := 3 I3 = goto 8 I4 = r := y + z I5 = if x ≤
Abstractionsa x/1 y/2 z/3 (identity) n r/
c(y)+c(z) c(y), c(z) ∈ Z
⊥
z goto 7 (identity) n
I6 = r := z + y
r/
I7 = x := x + 1
x/
c(z)+c(y) c(y), c(z) ∈ Z
otherwise n⊥c(x)+1 c(x) ∈Z ⊥
otherwise
I8 = if x < 10 goto 4 (identity) a
otherwise
Equations s0 = a 0 s1 =![I0 !]C (s0 ) s2 =![I1 !]C (s1 ) s3 =![I2 !]C (s2 )
Solution x/⊥ y/⊥ z/⊥ x/1 y/⊥ z/⊥ x/1 y/2 z/⊥ x/1 y/2 z/3
s4 =![I8 !]C (s8 )
x/⊥
y/2 z/3 r/⊥
s5 =![I4 !]C (s4 )
x/⊥
y/2 z/3 r/5
s6 =![I5 !]C (s5 )
x/⊥
y/2 z/3 r/5
s7 =![I5 !]C (s5 )⊓ ![I6 !]C (s6 ) x/⊥
y/2 z/3 r/5
s8 =![I3 !]C (s3 )⊓ ![I7 !]C (s7 ) x/⊥
y/2 z/3 r/5
For each abstraction only the modification x/y as abbreviation for c→ given.
r/⊥ r/⊥ r/⊥ r/⊥
c[x/y] is
Fig. 2. Example for data–flow problem The approach described so far can be generalised in two dimensions: Firstly, changing ⊓ to ⊔ results in existential data–flow problems, in contrast to universal data–flow problems: The intuition is that a property holds at a point if there is a single path starting at the point such that the property holds on this path. For existential data–flow problems, the least fixed point is computed instead of the greatest fixed point. Secondly, we can change predecessors predP to successors succP : {0, . . . , n} → P({0, . . . , n}) defined as i ∈ succP (j) ⇐⇒ j ∈ predP (i). The resulting class of data–flow problems are called backward problems (in contrast to forward problems), since the flow of information is opposite to the normal execution flow. Here, the abstract values are valid immediately after the corresponding instruction. Altogether, the resulting taxonomy has four cases. However, the algorithms for all the cases have the same general structure. Therefore, we will consider only the forward and universal setting.
3
Classical Iterative Basic-Block Based Algorithm
This section reviews the classical, graph–based approach to DFA. To make the data–flow of program explicit, we define two types of flow graphs: single instruction (SI) graphs and basic block (BB) graphs. For a program P = I0 . . . In , we define the SI graph SIG(P ) := ({I0 , . . . , In }, {(Ij , Ii ) | j ∈ predP (i)}, I0 ) with a node for each instruction, an edge from node Ij to node Ii iff j is predecessor of i, and root node I0 . Intuitively, the BB graph results from the SI graph by merging maximal sequences of straight–line code. Formally, we define the set of basic blocks as the unique partition of P : BB(P ) = {B0 , . . . , Bm } iff (a) Bj = Ij1 . . . Ijn with jk+1 = jk +1, (b) predP (j1 ) = {(j−1)n } or succP ((j−1)n ) = {j1 }, (c) |predP (jk )| = 1 for j1 < jk ≤ jn , and (d) Ijn +1 = I(j+1)1 , I01 = I0 , and Imn = In . The BB graph is defined as BBG(P ) := (BB(P ), {(Bj , Bi ) | jn ∈ predP (i1 )}, B0 ).
A Graph–Free Approach to Data–Flow Analysis
I0 I1 I2
I0 I1 B0 = I2 I3
I3 I4
B1 =
I4 I5
I5
B2 = I6
I6
B3 = I7
I7 I8
(a) SI graph
51
B4 = I8
(b) BB graph
Fig. 3. Examples for SI graph and BB graph
Example 2 (Constant Folding Propagation, Cont’d). In Fig. 3 we see the SI graph and the BB graph for the example program from the last section. Obviously, for a given flow graph G = (N, E, r), the usual notions of predecessors predG : N → P(N ) and successors succG : N → P(N ), defined as n ∈ predG (n′ ), n′ ∈ succG (n) : ⇐⇒ (n′ , n) ∈ E coincide with the corresponding notions for programs. For a given data–flow problem (P, L, ![.!], a0 ), an additional pre–processing step must be performed to extend the abstract semantics to basic blocks: We define ![.!] : Instr+ → (A → A) as ![I0 . . . In !] :=![In !] ◦ · · · ◦![I0 !]. The classical iterative algorithm for computing the MFP solution of a data– flow problem is shown in Fig. 4. In addition to the BB graph G it uses a working set W and an array a, which associates an abstract value with each node. The working set keeps all nodes which must be visited again. In each iteration a node is selected from the working set. At this level, we assume no specific strategy for the working set and consider this choice to be non–deterministic. By visiting all predecessors of this node, a new approximation is computed. If this approximation differs from the last approximation, the new one is used. In addition, all successors of the node are put in the working set. After termination of the main loop, the post–processing is done, which propagates the solution from the basic block level to the instruction level. Example 3 (Constant Folding Propagation, Cont’d). For the example from the last section, Table 1 shows a trace of the execution of the algorithm. Each line shows the state of working set W , the selected node B, and the array a[.] at
52
Markus Mohnen Input: Data–flow problem (P, L, ![.!], a0 ) where P = I0 . . . In , L = A, ⊓, ⊔ Output: MFP solution s0 , . . . , sn ∈ A G = (BB(P ), E, B0 ) := BBG(P ) a[B0 ] := a0 for each B ∈ BB(P ) − B0 do a[B] := ⊤ W := BB(P ) while W = ∅ do choose B ∈ W W := W − B new := a[B] for each B ′ ∈ predG (B) do new := new⊓![B ′ !](a[B ′ ]) if new = a[B] then a[B] := new; for each B ′ ∈ succG (B) do W := W + B ′ end end for each B ∈ BB(P ) do with B = Ik . . . Il do sk := a[B] for i := k + 1 to l do si :=![Ii−1 !](si−1 ) end end
Fig. 4. Classical iterative algorithm for computing MFP solution
the end of the main loop. To keep the example brief, we omitted all cells which did not change w.r.t. the previous line and we have chosen the best selection of nodes. The resulting MFP solution is identical to the one in Fig. 1, of course. In an implementation, the non–deterministic structure of the working set must be implemented in a deterministic way. However, both the classical algorithm described above and the new algorithm, which we describe in the next section, based on the concept of working sets. Therefore, we continue to assume that the working set is non–deterministic.
4
New Execution Based Algorithm
The new algorithm for computing the MFP solution (see Fig. 5) of a given data–flow problem is graph–free. The underlying idea is to give the program a more active role: The program itself executes on the abstract values. The program counter variable pc always holds the currently executing instruction. The execution of this instruction affects the abstract values for all succeeding instructions and it is propagated iff it makes a change. Here we see another difference w.r.t. the classical algorithm: While the pc in our algorithm identifies the instruction causing a change, the current node n in the classical algorithm
A Graph–Free Approach to Data–Flow Analysis
53
Ta b l e 1. Example Execution of classical iterative algorithm W B a[B0 ] a[B1 ] a[B2 ] a[B3 ] a[B4 ] {B1 , B2 , B3 , B4 } B0 x/⊥ y/⊥ x/⊤ y/⊤ x/⊤ y/⊤ x/⊤ y/⊤ x/⊤ y/⊤ z/⊥ r/⊥ z/⊤ r/⊤ z/⊤ r/⊤ z/⊤ r/⊤ z/⊤ r/⊤ {B1 , B2 , B3 } B4 x/1 y/2 z/3 r/⊥ {B2 , B3 } B1 x/1 y/2 z/3 r/5 {B3 } B2 x/1 y/2 z/3 r/5 {B4 } B3 x/2 y/2 z/3 r/5 {B1 } B4 x/⊥ y/2 z/3 r/⊥ {B2 } B1 x/⊥ y/2 z/3 r/5 {B3 } B2 x/⊥ y/2 z/3 r/5 ∅ B3 x/⊥ y/2 z/3 r/5
identifies the point where a change is cumulated. Note that the algorithm checks whether or not the instruction makes a change by the condition new < spc′ which is equivalent to new ⊓ spc′ = new and new = spc′ . Obviously, the execution cannot be deterministic: On the level of abstract values there is no way to determine which branch to follow at conditional jumps. Therefore, we consider both branches here. Consequently, we use a working set of program counters, just like the classical algorithm uses a working set of graph nodes. However, the new algorithm uses the working set in a more modest way that the classical: While the classical one chooses a new node from the working set in each iteration, the new one follows one path of computation as long as changes occur and the path does not reach the end of the program. This is done in the inner repeat/until loop. Only if this path terminates, elements are chosen from the working set in the outer while loop. In addition, the new algorithm tries to keep the working set as small as possible during execution of a path: Note that the instruction W := W − pc is placed inside the inner loop. Hence, even execution of a path may cause the working set to shrink. In comparison to the classical algorithm, our approach has the following advantages: – It uses less memory: There is neither a graph to store the possible control flows in the program nor an associative array needed to store the abstract values at the basic block level.
54
Markus Mohnen Input: Data–flow problem (P, L, ![.!], a0 ) where P = I0 . . . In , L = A, ⊓, ⊔ Output: MFP solution s0 , . . . , sn ∈ A s0 := a0 for i := 1 to n do si := ⊤ W := {0, . . . , n} while W = ∅ do choose pc ∈ W repeat W := W − pc new :=![Ipc !](spc ) if Ipc = (goto l) then pc′ := l else pc′ := pc + 1 if Ipc = (if ψ goto l) and new < sl then W := W + l sl := new end end if new < spc′ then spc′ := new pc := pc′ else pc := n + 1 end until pc = n + 1 end
Fig. 5. New execution algorithm for computing MFP solution
– The data locality is better. At a node, the classical algorithm visits all predecessors and potentially all successors. Since these nodes will typically be scattered in memory, the access to the abstract values associated with them will often cause data cache misses. In contrast, our algorithm only visits a node and potentially its successors. Typically, one of the successors is the next instruction. Since the abstract values are arranged in an array, the abstract value associated with the next instruction is the next element in the array. Here, the likelihood of cache hits is large. Recent studies show that such small differences in data layout can cause large differences in performance on modern system architectures [CHL99, CDL99]. – There is no need for pre–processing by finding the abstract semantics of a basic block ![I0 . . . In !] :=![In !] ◦ · · · ◦![I0 !]. – There is no need for a post–processing stage, which propagates the solution from the basic block level to the instruction level. Theorem 1 (Termination). The algorithm in Fig. 5 terminates for all inputs. Proof. During each execution of the inner loop at least one 0 ≤ i ≤ n exists such that value of the variable si decreases w.r.t. the underlying partial order of the lattice L. Since L only has finite chains, this can happen only finitely many times. Hence, the inner loop always terminates.
A Graph–Free Approach to Data–Flow Analysis
55
Furthermore, the working set grows iff a conditional jump is encountered and the corresponding value sl decreases. Just like above, this can happen only finitely many times. Hence, there is an upper bound for the size of the working set. In addition, during each execution of the outer loop, the working set shrinks at least by one element, the one chosen in the outer loop. Hence, the outer loop always terminates. ⊓ ⊔ Theorem 2 (Correctness). After termination of the algorithm in Fig. 5, the values of the variables s0 , . . . , sn are the MFP solution of the given data–flow problem. Proof. To prove correctness, we can obviously consider a modified version of the algorithm, where the inner loop is removed and nodes are selected from the working set in each iteration. In this setting, no program point will be ignored forever. Hence, we can use the results from [GKL+ 94]: The selection of program point is a fair strategy and the correctness of our algorithm directly follows from the theorem on chaotic fixed point iterations. To do so, we have to validate one more premise of the theorem: We have to show that the algorithm computes si = j∈predP (i) ![Ij !](sj ) for each program point 0 ≤ i ≤ n. The algorithm can change si iff it visits a program point pc with pc ∈ predP (i). Let s be the value of si before the loop and s′ be the value after the loop. If we can show that s′ = s⊓![Ipc !](spc ), we know that the algorithm computes the meet over all predecessors by iteratively computing the pairwise meet. To show that, we distinguish two cases: 1. If ![Ipc !](spc ) = new < s then s′ = new = s⊓![Ipc !](spc ). 2. Otherwise, we know that ![Ipc !](spc ) = new ≥ s since ![.!] is monotone and the initial value of s is the top element. Hence we also have s′ = s = s⊓![Ipc !](spc ). ⊓ ⊔ Example 4 (Constant Folding Propagation, Cont’d). Table 2 shows an trace of the execution of the new algorithm for the constant folding propagation example. Each line shows the state of the working set and the approximations at the end of the inner loop, and the values of the program counter pc at the beginning and the end of the inner loop (written in the column pcs in the form begin/end). During this execution, the algorithm loads the value of pc only three times from the working set: Once at the beginning and twice after reaching the end of the program (pcs = 8/9). The adaption of the execution algorithm for the other three cases of the taxonomy of data–flow problems described at the end of Section 2 is straightforward: (a) Existential problems can simply be handled by replacing < by >, and (b) backward problems require a simple pre–processing which inserts new pseudo instructions to connect jump targets with the corresponding jump instructions.
5
Experimental Results
To validate the benefits of our approach, we studied the performance of the new algorithm in competition with the classical one, both in terms of memory
56
Markus Mohnen
Table 2. Example execution of new algorithm W pcs s0 s1 {1, . . . , 8} 0/1 x/⊥ y/⊥ x/1 y/⊤ z/⊥ r/⊥ z/⊤ r/⊤ {2, . . . , 8} 1/2 {3, . . . , 8} 2/3 {4, . . . , 8} 3/8 {4, . . . , 7} 8/9 {5, . . . , 7} 4/5 {6, 7}
5/6
{7} ∅
6/7 7/8
{4}
8/9
∅
4/5
{7}
5/6
{7} ∅
6/7 7/8
∅
8/9
s2 s3 s4 s5 s6 s7 s8 x/⊤ y/⊤ x/⊤ y/⊤ x/⊤ y/⊤ x/⊤ y/⊤ x/⊤ y/⊤ x/⊤ y/⊤ x/⊤ y/⊤ z/⊤ r/⊤ z/⊤ r/⊤ z/⊤ r/⊤ z/⊤ r/⊤ z/⊤ r/⊤ z/⊤ r/⊤ z/⊤ r/⊤ x/1 y/2 z/⊥ r/⊥ x/1 y/2 z/3 r/⊥ x/1 y/2 z/3 r/⊥
x/⊥ y/2 z/3 r/⊥
x/1 y/2 z/3 r/5
x/1 y/2 x/1 y/2 z/3 r/5 z/3 r/5 x/2 y/2 z/3 r/5
x/⊥ y/2 z/3 r/5
x/⊥ y/2 x/⊥ y/2 z/3 r/5 z/3 r/5 x/⊥ y/2 z/3 r/5
usage and runtimes. Prior to the presentation of the results, we discuss the experimental setting in more detail. We have implemented the classical BB algorithm and our new execution algorithm for full Java Virtual Machine (JVM) code [LY97]. This decision was taken in view of the following reasons: 1. As already mentioned, we see the JVM as a natural target environment for our execution–based algorithm, since it already contains an execution environment and is sensitive to high memory overhead. 2. Except for native code compilers for Java [GJS96], all compilers generate the same JVM code as target code. Consequently, we get realistic samples independent of a specific compiler. 3. Java programs are distributed as JVM code, often available for free on the internet.
A Graph–Free Approach to Data–Flow Analysis
57
Although we omitted procedure/method calls from our model, we can handle full JVM code. For intraprocedural analysis, we assume the result of method invocations to be the top element of the lattice. All these aspects allowed us to collect a large repository of JVM code with little effort. In addition to active search, we established a web site for donations of class files at http://www-i2.informatik.rwth-aachen.de/~mohnen/ CLASSDONATE/. So far, we have collected 15,339 classes with a total of 98,947 methods. This large set of samples covers a wide range of applications, applets, and APIs. To name a few, it contains the complete JDK runtime environment (including AWT and Swing), the compiler generator ANTLR, the Byte Code Engineering Library, and the knowledge-based system Prot´eg´e. The classes were compiled by a variety of compilers: javac (Sun) in different version, jikes (IBM), CodeWarrior (Metrowerks), and JBuilder (Borland). In some cases, the class files were compiled to JVM code from other languages than Java, for instance from Ada using Jgnat. In contrast to a hand–selected suite of benchmarks like SPECjvm98 [SPE], we do not impose any restrictions on the samples in the set: The samples may contains errors or even might not be working at all. In our opinion, this allows a better estimation of the “average case” a data–flow analyser must face in practice. Altogther, we consider our experiments suitable for estimating the benefits and drawbacks of our method.
import de.fub.bytecode.generic.*; import Domains.*; public interface JVMAbstraction { public Lattice getLattice(); public Element getInitialValue(InstructionHandle ih); public Function getAbstract(InstructionHandle ih); }
Fig. 6. Interface JVMAbstraction However, we did not integrate our experiment in a JVM. Doing so would have fixed the experiment to a specific architecture since the JVM implementation depends on it. Therefore, we implemented the classical BB algorithm and our new execution algorithm in Java, using the Byte Code Engineering Library [BD98] for accessing JVM class files. The implementation directly follows the notions defined in Section 2: We used the interface concept of Java to model the concepts of lattices, (JVM) abstractions, and data–flow problems. For instance, Fig. 6 shows the essential parts of the interface JVMAbstraction which models JVM abstractions. Consequently, the algorithms do not depend on specific data–flow problems. In contrast, our approach allows to model any data–flow problem simply by providing a Java class which implements the interface JVMAbstraction.
58
Markus Mohnen
20.
40.
60.
80.
Memory 100. Reduction %
Fig. 7. Histogram of memory reduction For the experiment, we implemented constant folding propagation, as described in the previous sections. All experiment were done on a system with Pentium III at 750 Mhz, 256 MB main memory running under Linux 2.2.16 and Sun JDK 1.2.2. For each of the 98,947 JVM methods of the repository, we measured memory usage and runtimes of both our algorithm and the classical algorithm. The working set was implemented as a stack. Memory improvement. Given the number of bytes mX allocated by our algorithm and the number of bytes mC allocated by the classical algorithm, we compute the memory reduction as the percentage mX /mC ∗ 100. In the resulting distribution, we found a maximal reduction of 7.28%, a minimal reduction of 74.61%, and an average reduction of 30.83%. Moreover, the median1 is 31.28%, which is very close to the average. Hence, our algorithm always reduces the amount of memory and reached improvements upto less than a tenth! In the average case, the reduction is about a third. Fig. 7 shows a histogram of the distribution. A study of the relation of number of instructions and memory reduction does not reveal a relation between those values. In Fig. 8(a) each point represents a method: The coordinates are the number of instructions on the horizontal axis and memory reduction of the vertical axis. We have restricted the plot to the interesting range up to 1,000 instructions: While the sample set contains methods with up to 32,768 instructions, the average of instructions per method is only 40.3546 and the median is only 11. Obviously, object–orientation has a measurable impact on the structure of program. Surprisingly, there is a relation between the amount of reduction caused by BBs and memory reduction. One might expect that the classical algorithm is better for higher amounts of reduction cause by BBs. However, this turns out to 1
The median (or central value) of a distribution is the value with the property that one half of the elements of the distribution is less or equal and the other half is greater or equal.
A Graph–Free Approach to Data–Flow Analysis
(a) Memory reduction vs. number of instructions
59
(b) Memory reduction vs. basic block reduction
Fig. 8. Memory reduction of new algorithm be a wrong: Fig. 8(b) shows that the new algorithm reduces the memory even more for higher BB reductions. Runtimes. For the study of runtimes, we use the speedup caused by the use of the classical algorithm: If tC is the runtime of the classical algorithm and tX is the runtime of our algorithm, we consider tC /tX to be the speedup. The distribution of speedups turned out to be a big surprise: Speedups vary from 291.2 down to 0.015, but the mean is 1.62, median is 1.33, and variance is only 7.49! Hence, for the majority of methods our algorithm performs as well as the BB algorithm. Fig. 9 shows a histogram of the interesting area of the distribution. Again, relating speedup on one hand and number of instructions Fig. 10(a) on the other hand did not reveal a significant correlation. In addition, and not surprisingly, the speedup is higher for better BB reduction Fig. 10(b) .
BB Speedup 0.5
1.
1.5
2.
2.5
3.
Fig. 9. Histogram of BB speedup
60
Markus Mohnen
(a) Speedup vs. number of instructions
(b) Speedup vs. basic block reduction
Fig. 10. Speedup of classical algorithm
6
Conclusions and Future Work
We have shown that data–flow analysis can be done without explicit graph structure. Our new algorithm for computing the MFP solution of a data–flow problem is based on the idea of the program executing on the abstract values. The advantages resulting from the approach are less memory use, better data locality, and no need for pre–processing or post–processing stages. We validated these expectation by applying a test implementation to a large set of sample. It turned out that while the runtimes are almost identical, our approach always saves between a third and 9/10 of the memory used by the classical algorithm. In the average case, it saves two thirds of the memory used by the classical algorithm. The algorithm is very easy to implement in settings where there is already a machinery for execution of programs available, for instance in Java Virtual Machines. In addition, the absence of the graph makes the algorithm easier to implement. In the presence of full JVM code, implementing BB graphs turned out to be trickier than expected. In fact, after having implemented both approaches, errors in the implementation of the BB graphs were revealed by the correct results of the new algorithm.
References [AH87]
S. Abramsky and C. Hankin. An Introduction to Abstract Interpretation. In S. Abramsky and C. Hankin, editors, Abstract Interpretation of Declarative Languages, chapter 1, pages 63–102. Ellis Horwood, 1987. 48 [ASU86] A.V. Ahos, R. Sethi, and J.D. Ullman. Compilers: Principles, Techniques, and Tools. Addison Wesley, 1986. 46 [BD98] B. Bokowski and M. Dahm. Byte Code Engineering. In C. H. Cap, editor, Java-Informations-Tage (JIT), Informatik Aktuell. Springer–Verlag, 1998. See also at http://bcel.sourceforge.net/. 57
A Graph–Free Approach to Data–Flow Analysis [Bla98]
61
B. Blanchet. Escape Analysis: Correctness Proof, Implementation and Experimental Results. In Proceedings of the 25th Symposium on Principles of Programming Languages (POPL). ACM, January 1998. 47 [Bla99] B. Blanchet. Escape Analysis for Object Oriented Languages: Application to Java. In Proceedings of the 14th Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), volume 34, 10 of ACM SIGPLAN Notices, pages 20–34. ACM, 1999. 47 [CC77] P. Cousot and R. Cousot. Abstract Interpretation: A Unified Lattice Model for Static Analysis of Programs by Construction or Approximation of Fixed Points. In Proceedings of the 4th Symposium on Principles of Programming Languages (POPL), pages 238–252. ACM, January 1977. 48 [CDL99] T. M. Chilimbi, B. Davidson, and J. R. Larus. Cache-conscious structure definition. In PLDI’99 [PLD99], pages 13–24. 54 [CHL99] T. M. Chilimbi, M. D. Hill, and J. R. Larus. Cache-Conscious Structure Layout. In PLDI’99 [PLD99], pages 1–12. 54 [Deu97] A. Deutsch. On the Complexity of Escape Analysis. In Proceedings of the 24th Symposium on Principles of Programming Languages (POPL), pages 358–371. ACM, January 1997. 47 [GJS96] J. Gosling, B. Joy, and G. Steele. The Java Language Specification. The Java Series. Addison Wesley, 1996. 56 uttgen, O. R¨ uthing, and B. Steffen. Chaotic Fixed [GKL+ 94] A. Geser, J. Knoop, G. L¨ Point Iterations. Technical Report MIP-9403, Fakult¨ at f¨ ur Mathematik und Informatik, University of Passau, 1994. 55 [KKS98] J. Knoop, D. Kosch¨ utzki, and B. Steffen. Basic-Block Graphs: Living Dinosaurs? In K. Koskimies, editor, Proceedings of the 7th International Conference on Compiler Construction (CC), number 1383 in Lecture Notes in Computer Science, pages 65–79. Springer–Verlag, 1998. 46 [LY97] T. Lindholm and F. Yellin. The Java Virtual Machine Specification. The Java Series. Addison Wesley, 1997. 46, 47, 56 [MJ81] S. S. Muchnick and N. D. Jones. Program Flow Analysis: Theory and Applications. Prentice–Hall, 1981. 46 [Moh97] M. Mohnen. Optimising the Memory Management of Higher–Order Functional Programs. Technical Report AIB-97-13, RWTH Aachen, 1997. PhD Thesis. 47 [Muc97] S. S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers, 1997. 46 [PLD99] Proceedings of the ACM SIGPLAN ’99 Conference on Programming Language Design and Implementation (PLDI), SIGPLAN Notices 34(5). ACM, 1999. 61 [SPE] Standard Performance Evaluation Corporation. SPECjvm98 documentation, Relase 1.01. Online version at http://www.spec.org/osg/jvm98/jvm98/doc/. 57
A Representation for Bit Section Based Analysis and Optimization⋆ Rajiv Gupta1 , Eduard Mehofer2 , and Youtao Zhang1 1
Dpartment of Computer Science, The University of Arizona 2
Tucson, Arizona Institute for Software Science, University of Vienna Vienna, Austria
Abstract. Programs manipulating data at subword level are growing in number and importance. Examples are programs running on network processors, media processors, or general purpose processors with media extensions. In addition data compression techniques which are vital for embedded system applications result in code operating on subword level as well. Performing analysis on word level, however, is too coarse grain missing opportunities for optimizations. In this paper we introduce a novel program representation which allows reasoning at subword level. This is achieved by making accesses to subwords explicit. First in a local phase statements are analyzed and accesses at subword level identified. Then in a global phase the control-flow is taken into account and the accesses are related to one another. As a result various traditional analyses can be performed on our representation at subword level very easily. We discuss the algorithms for constructing the program representation in detail and illustrate their application with examples.
1
Introduction
Programs that manipulate data at subword level are growing in number and importance. The need to operate upon subword data arises if multiple data items are packed together into a single word of memory. The packing may be a characteristic of the application domain or it may be carried out automatically by the compiler. We have identified the following categories of applications. Network processors are specialized processors that are being designed to efficiently manipulate packets [5]. Since a packet is a stream of bits the individual fields in the packet get mapped to subword entities within a memory location or may even be spread across multiple locations. Media processors are special purpose processors to process media data (e.g., TigerSHARC [3]) as well as general purpose processors with multimedia extensions (e.g., Intel’s MMX [1,6]). The narrow width of media data is exploited by ⋆
Supported by DARPA PAC/C Award. F29601-00-1-0183 and NSF grants CCR0105355, CCR-0096122, EIA-9806525, and EIA-0080123 to the Univ. of Arizona.
R. N. Horspool (Ed.): CC 2002, LNCS 2304, pp. 62–77, 2002. c Springer-Verlag Berlin Heidelberg 2002
A Representation for Bit Section Based Analysis and Optimization
63
packing multiple data items in a single word and supporting instructions that are able to exploit subword parallelism. Data compression transformations reduce the data memory footprint of the program [2,9]. After data compression transformations have been applied, the resulting code operates on subword entities. Program analysis, which is the basis of optimization and code generation phases, is a challenging task for above programs since we need to reason about entities at subword level. Moreover, accesses at subword level are expressed in C (commonly used language in those application domains) by means of rather complex mask and shift operations. In this paper we introduce a novel program representation that enables reasoning about subword entities corresponding to bit sections (a bit section is a sequence of consecutive bits within a word). This is made possible by explicitly expressing manipulation of bit sections and relating the flow of values among bit sections. We present algorithms for constructing this representation. The key steps in building our representation are as follows: – By locally examining the bit operations in an expression appearing on the right hand side of an assignment, we identify the bit sections of interest. In particular, the word corresponding to the variable on the left hand side is split into a number of bit sections such that adjacent bit sections are modified differently by the assignment. The assignment statement is replaced by multiple bit section assignments. – By carrying out global analysis, explicit relationships are established among different bit sections belonging to the same variable. These relationships are expressed by introducing split and combine nodes. A split node takes a larger bit section and replaces it by multiple smaller bit sections and a combine node takes multiple adjacent bit sections and replaces them by a single larger bit section. The above representation is appropriate for reasoning about bit sections. For example, the flow of values among the bit sections can be easily traced in this representation resulting in definition-use chains at the bit section level. Moreover, since our representation makes accesses at subword level explicit, processors with special instructions for packet-level addressing can be supported easily and efficiently by the code generator and the costly mask and shift operations can be replaced. The remainder of the paper is organized as follows. In section 2 we describe our representation including its form and its important properties. In sections 3 and 4 we present the local and global phases of the algorithm used to construct the representation. And finally concluding remarks are given in section 5.
64
Rajiv Gupta et al.
2
The Representation
This section presents our representation for bit section based analyses and optimizations. Starting point for our extensions are programs modeled as directed control flow graphs (CFG) G = (N, E, entry, exit) with node set N including the unique entry and exit nodes and edge set E. For the ease of presentation we assume that the nodes represent statements rather than basic blocks1 . The construction of our representation is driven by assignment statements of the form v = t whereby the right hand side term t contains bit operations only, i.e. & (and), | (or), not (not), > (shift right). Since the term on the right hand side of such an assignment can be arbitrarily long and intricate and since our goal is to replace those assignments by a sequence of simplified assignments, we call them complex assignments. Essentially our representation is based on two transformations performed on the CFG. First we partition the original program variables into bit sections of interest. The bit sections of interest are identified locally by examining the usage of these bit sections in a complex assignment. Only complex assignments which are formed using bit operations are processed by this phase because partitioning is guided by the special semantics of bit operations. Other assignments are not partitioned since no useful additional information can be exposed in this way. Hence, in the remainder of the discussion, only complex assignments are considered. Second we relate definitions and uses of bit sections belonging to the same program variable using global analysis. The required program representation is obtained by making the outcomes of the above steps explicit in the CFG. In the remainder of this section, we illustrate the effects of the above two steps and describe the resulting representation in detail. 2.1
Identifying Bit Sections of Interest
Definition 1 (Bit Section). Given a program variable v with the size of c bits, a bit section of v is denoted by vl..h (1 ≤ l ≤ h ≤ c) and refers to the sequence of bits l, l + 1, .., h − 1, h.2 The symbol := is used to denote a bit section assignment. In the following discussion, if nothing is said to the contrary, we assume for the ease of discussion that variables have a size of 32 bits. Partitioning a program variable. Given a complex assignment (v = t), the program variable v on the left hand side is partitioned into bit sections, if each of the resulting sections is updated differently from its neighboring bit sections by the term t on the right hand side of the complex assignment. In particular, the value of a bit section of the lhs variable v, say vl..h , can be specified in one of the following ways: 1 2
Handling basic blocks is straightforward. The definition includes 1-bit sections as well as whole variable sections.
A Representation for Bit Section Based Analysis and Optimization
65
– No Modification: The value of vl..h remains unchanged because it is assigned its own value. – Constant Assignment: vl..h is assigned a compile time constant. – Copy Assignment: The value of another bit section variable is copied into vl..h . – Expression Assignment: The value of vl..h is determined by an expression which is in general simpler than t. The partitioning of variable v is made explicit in the program representation by replacing the complex assignment by a series of bit section assignments. A consequence of this transformation is that operands used in t may also have to be partitioned into compatible bit sections. Properties. There are two important properties that will be observed by our choice of bit section partitions: 1. Non-overlapping sections. The sections resulting from such partitioning are non-overlapping for individual assignments. 2. Maximal sections. Each section is as large as needed to expose the semantic information that can be extracted from a given complex assignment. In other words, further partitioning will not provide us with any more information about the values stored in the individual bits. Example. Consider the complex assignment to variable a shown in Fig. 1. If we carefully examine this assignment, we observe that this complex assignment is equivalent to the bit section assignments shown below. Note that each bit section is updated differently from its neighboring sections. Bit sections a1..4 and a17..32 are set to 0, a5..8 is involved in a copy assignment, and a13..16 is not modified at all (we have placed the assignment below simply for clarity). Bit section a9..12 is computed using an expression which is simpler than the original expression. Finally, as a consequence of a’s partitioning, variable b must be partitioned into compatible bit sections as well. Complex Assignment a = (a & 0xf f 00) | ((b & 0xf f ) >’s annotation var/1 : {[(l, h), s]} 0 : {[(32 − c, 32), 0]} and var/1 : {[(l′ − s + c, h′ − s + c), s − c]}, where (l′ , h′ ) = (l + s − c, h + s − c) ∩ (0, 32) 0 : {[(32 − c, 32), 0]} and 0 : {[(l, h), 0]} 0 : {[(l′ , h′ ), 0]}, where (l′ , h′ ) = (l − c, h − c) ∩ (0, 32) 0 : {[(l − c, 32) ∩ (0, 32), 0]} 0 : {[(l, 32), 0]}
2. E n s u re all bits within a section are computed identically. Closer examination of bit sections of different operand variables that annotate a given node can reveal whether further splitting of these bit sections is required to ensure that each resulting bit section is computed by exactly one expression. The bit section var1 : {[(l1 , h1 ), s1 ]} is split by bit section var2 : {[(l2 , h2 ), s2 ]}, denoted by var1 /var2 , at a node in the expression tree by means of the following rule:
72
Rajiv Gupta et al.
var1 : {[(l1 , l2 + s2 − s1 , h2 + s2 − s1 , h1 ), s1 ]} if l1 + s1 < l2 + s2 < h2 + s2 < h1 + s1 var1 : {[(l1 , l2 + s2 − s1 , h1 ], s1 }
var1 : {[(l1 , h1 ), s1 ]} if l1 + s1 < l2 + s2 < h1 + s1 < h2 + s2 = var var2 : {[(l2 , h2 ), s2 ]} 1 : {[(l1 , h2 + s2 − s1 , h1 ), s1 ]} if l2 + s2 < l1 + s1 < h2 + s2 < h1 + s1 var1 : {[(l1 , h1 ), s1 ]} otherwise
The splitting is performed by considering every ordered pair of bit sections. As we can see, the above bit sectioning is performed to distinguish between bit sections which are computed differently by both bit sections var1 and var2 . More precisely, we distinguish a bit section which is computed from both var1 and var2 from one which is computed only from var1 . 3. Identify bit sections for the lhs variable. After steps 1 and 2, the annotations of the root node of the expression tree are used to identify the bit sections of the variable on the left hand side. Let us assume that the width of a word is 32 bits, then we split the initial bit section of the lhs variable varlhs : {[(0, 32), 0]} if parts are computed differently. More formally, new bit sections are obtained by a repeated evaluation of varlhs : section any : {[(l, h), s]} for each annotation any : {[(l, h), s]} at the root node of the rhs tree. II. Generating bit section assignments. In this step we generate the bit section assignments corresponding to the bit sections identified for a lhs variable of a complex assignment. Given a bit section vl+1..h , the expression which has to be assigned to vl+1..h is returned by the function call genexp((l, h), eroot), where eroot is the root node of the entire expression tree, i.e., for each bit section (l, h) for a lhs variable v we call vl+1..h := simplif y(genexp((l, h), eroot)). Function simplify is the last step in which trivial patterns like “a|0” or “a&1” are reduced to “a”. As shown in Fig. 5, genexp() traverses the expression examining the bit sections that annotate each node in order to find those that contribute to bits l + 1..h. If only one of the bit sections at a node contributes to bits l + 1..h, a traversal of the subtree is not required any more. In this case the operand is a sequence of h−l bits belonging to a variable or it consists of constant (0 or 1) bits. If multiple bit sections contribute to bits l + 1..h, then the operator represented by the current node is included in the expression and the subexpressions that are its operands are identified by recursively applying genexp() to the descendants.
A Representation for Bit Section Based Analysis and Optimization
73
genexp((l, h), e) { BS = φ f or each section any : [(el, eh), es] ∈ set of annotations of node e do if range (l, h) is contained in range (el + es, eh + es) then BS = BS ∪ {any : [(el, eh), es]} endif endf or if BS == {any : [(el, eh), es]} then return (”any−e s +1..h−es ”) else let e.lchild and e.rchild be expression trees f or operands of e case e.op of e.op == ”not” : return (”not” genexp((l, h), e.lchild); e.op == ” > c” : return(genexp((l + c, h + c), e.lchild); e.op == ”&” : return(genexp((l, h), e.lchild) ”&” genexp((l, h), e.rchild); e.op == ”|” : return(genexp((l, h), e.lchild) ”|” genexp((l, h), e.rchild)); end case endif }
Fig. 5. Generating Bit Section Assignments Step 1: 1:{[(8,16),0], 0:{[(0,8),0], [(16,32),0]}
Step 1: a:{[(0,32),0]} a
b
0xff00
Step 1: a:{[(8,16),0]} 0:{[(0,8),0], [(16,32),0]}
Step 1: b:{[(0,32),0]}
&
e
Step 1: 1:{[(0,8),0]} 0:{[(8,32),0]} 0xff
Step 1: b:{[(0,8),0]} 2
0:{[(8,32),0]}
4
&
Step 1: b:{[(0,8),4]} 0:{[(0,4),0],