Power-Aware Computer Systems: Second International Workshop, PACS 2002 Cambridge, MA, USA, February 2, 2002, Revised Papers (Lecture Notes in Computer Science, 2325) 3540010289, 9783540010289

WelcometotheproceedingsofthePower-AwareComputerSystems(PACS2002) workshopheld in conjunction with the 8th InternationalS

148 27 4MB

English Pages 236 [224] Year 2003

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Power-Aware Computer Systems
Preface
PACS 2002 Program Committee
Table of Contents
Early-Stage Definition of LPX: A Low Power Issue-Execute Processor
Introduction
Background: Power-Performance Data
Areas of Focus in Defining the LPX Processor
Tuning the Microarchitecture
High-Level Microarchitecture of LPX
Examples: LPX Microarchitecture Analysis
Conclusions and Future Work
References
Dynamic Tag-Check Omission: A Low Power Instruction Cache Architecture Exploiting Execution Footprints
Introduction
Related Work
History-Based Tag-Comparison Cache
Concept
Organization
Operation
Advantages and Disadvantages
Evaluation
Simulation Environment
Results
Tag-Check Count
Energy Consumption
Performance Overhead
Effects of Execution-Footprint-Invalidation Penalty
Conclusions
References
A Hardware Architecture for Dynamic Performance and Energy Adaptation
Introduction
Opportunities for Scaling
Power Adaptation Unit
PAU Table Entry Management
Handling Cache and Memory
Example
Limits on Energy Savings
PAU Overhead and Tradeoffs
Efficacy of the PAU
Simulation Environment
Effect of PAU Size on Energy Savings
Effect of PAU Size on Performance Degradation
Effect of PAU Size on Energy-Delay Product
Related Work
Summary and Future Work
References
Multi-processor Computer System Having Low Power Consumption
Introduction
A Low Power Multi-Processor Computer System
The Target Device
Processor Energy Model
Task Suite
The Big Picture: A Discussion
Summary
References
An Integrated Heuristic Approach to Power-Aware Real-Time Scheduling
Introduction
Related Work on Variable Voltage Scheduling
System and Energy Models
Formulation of the Problem
The Optimization Problem
PORTS: Power-Optimized Real-Time Scheduling Server
Handling Power-Aware Real-Time Tasks
Activating the PORTS Server and Feasibility Test
Reduction Scheme from MCKP to the Classical KP
Enhanced Greedy Algorithm
Restoring the Solution from the EKP to the MCKP
Scheduling the New Task
Simulation Experiments
Conclusions
References
Power-Aware Task Motion for Enhancing Dynamic Range of Embedded Systems with Renewable Energy Sources
Introduction
Related Work
DVS Anomaly
Task Motion under Timing and Power Constraints
Constraint Graph and Schedule
Task Motion under Timing Constraints
Utilization Constraints
Scheduling Algorithms for Power-Aware Task Motion
Experimental Results
Conclusion
References
A Low-Power Content-Adaptive Texture Mapping Architecture for Real-Time 3D Graphics
Introduction
Texture Mapping
Previous Work
Content Adaptive Texture Mapping
Proposed Approach
Proposed Architecture
Results
Conclusion
References
Energy-Driven Statistical Sampling: Detecting Software Hotspots
Introduction
Energy-Driven Statistical Sampling Prototype
Hardware
Software
Error Analysis and Validation
Comparison of Sampling Approaches
Time-Driven Statistical Sampling Prototypes
Benchmarks
Experimental Results
Observations
Comparison with Activation-Model Approaches
Summary and Future Work
References
Modeling of DRAM Power Control Policies Using Deterministic and Stochastic Petri Nets
Introduction
Methodology
Validation
Model Extension
Conclusion
References
SimDVS: An Integrated Simulation Environment for Performance Evaluation of Dynamic Voltage Scaling Algorithms
Introduction
DVS Algorithms for Hard Real-Time Systems
Our Contribution
Overview of SimDVS
Design Goals
Architectural Organization
Main Modules of SimDVS
SimDVS Inputs
InterDVS Module
IntraDVS and Its Preprocessing Module
Case Studies
Performance Evaluation of InterDVS and IntraDVS
Performance Evaluation of Hybrid Methods
Overhead Measurement of InterDVS Algorithms
Conclusion
References
Application-Supported Device Management for Energy and Performance
Introduction
Related Work
Models
Modeling the Energy-Oblivious Policy
Modeling the Fixed-Thresholds Policy
Modeling the Direct Deactivation Policy
Modeling the Pre-activation Policy
Modeling the Combined Policy
Modeling Whole Applications
Evaluation for a Laptop Disk
The Fujitsu Disk
Model Predictions
Benefits for Applications
Application Transformations
Experiments
Conclusions
References
Energy-Efficient Server Clusters
Introduction
Cluster Power Management
Details of the Coordinated Policy
Evaluation
Methodology
Results
Related Work
Conclusions
References
Single Region vs. Multiple Regions: A Comparison of Different Compiler-Directed Dynamic Voltage Scheduling Approaches
Introduction
Basic Compilation Strategy
Implementation
Estimate the Transition Graph
Reformulate the Problem
Other Issues
Experiments
Related Work
Conclusions
References
Author Index
Recommend Papers

Power-Aware Computer Systems: Second International Workshop, PACS 2002 Cambridge, MA, USA, February 2, 2002, Revised Papers (Lecture Notes in Computer Science, 2325)
 3540010289, 9783540010289

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen

2325

3

Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Tokyo

B. Falsafi T. N. Vijaykumar (Eds.)

Power-Aware Computer Systems Second International Workshop, PACS 2002 Cambridge, MA, USA, February 2, 2002 Revised Papers

13

Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editors Babak Falsafi Carnegie Mellon University Electrical and Computer Engineering, Computer Science Hamershlag Hall A305, 5000 Frobes Ave., Pittsburgh, PA 15213, USA E-mail: [email protected] T. N. Vijaykumar Purdue University School of Electrical and Computer Engineering 1285 Electrical Engineering Building, West Lafayette, Indiana 47907-1285, USA E-mail: [email protected] Cataloging-in-Publication Data applied for A catalog record for this book is available from the Library of Congress. Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at .

CR Subject Classification (1998): B.7, B.8, C.1, C.2, C.3, C.4, D.4 ISSN 0302-9743 ISBN 3-540-01028-9 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2003 Printed in Germany Typesetting: Camera-ready by author, data conversion by DA-TeX Gerd Blumenstein Printed on acid-free paper SPIN: 10846717 06/3142 543210

Preface

Welcome to the proceedings of the Power-Aware Computer Systems (PACS 2002) workshop held in conjunction with the 8th International Symposium on High Performance Computer Architecture (HPCA-8). Improvements in computer system performance have been accompanied by an alarming increase in power and energy dissipation, leading to higher cost and lower reliability in all computer systems market segments. The higher power/energy dissipation has also significantly reduced battery life in portable systems. While circuit-level techniques continue to reduce power and energy, all levels of computer systems are being used to address power and energy issues. PACS 2002 was the second workshop in its series to address power-/energy-awareness at all levels of computer systems and brought together experts from academia and industry. These proceedings include research papers spanning a wide spectrum of areas in power-aware systems. We have grouped the papers into the following categories: (1) power-aware architecture and microarchitecture, (2) power-aware real-time systems, (3) power modeling and monitoring, and (4) power-aware operating systems and compilers. The first group of papers propose power-aware techniques for the processor pipeline using adaptive resizing of power-hungry microarchitectural structures and clock gating, and power-aware cache design by avoiding tag checks in periods when the tags have not changed. This group also includes ideas to adapt energy and performance dynamically by detecting regions of application at runtime where the supply voltage may be scaled to reduce power with a bounded decrease in performance. Lastly, a paper on multiprocessor designs trades off computing capacity and functionality for improved energy per cycle by scheduling simple tasks on low-end and low-energy processors and complex tasks on high-end processors. The second group of papers target real-time systems including ideas on a lowcomplexity heuristic which schedules real-time tasks such that no task misses its deadline and the total energy savings are maximized. The other papers in this group (1) tune the system-level parallelism to the current-level of power/energy availability and optimize the system power utilization, and (2) perform adaptive texture mapping in real-time 3D graphics systems based on a model of human visual perception to achieve significant power savings without noticeable image quality degradation. The third group of papers focus on power modeling and monitoring including statistical profiling to detect software hotspots of power, and using Petri Nets to model DRAM power policies. This group also includes a simulator for evaluating the performance and power of dynamic voltage scaling algorithms. The last group concentrates on OS and compilers for low power. The first paper proposes application-issued directives to set the power modes in devices such as a disk drive. The second paper proposes policies for cluster-wide power

VI

Preface

management. The policies employ combinations of dynamic voltage scaling and turning on and off to reduce overall cluster power. PACS 2002 was a highly successful forum due to the high-quality submissions, the enormous efforts of the program committee and the keynote speaker, and the attendees. We would like to thank Ronny Ronen for an excellent keynote speech, showing the technological scaling trends and their impact on energy/power consumption in general-purpose microprocessors, and pinpointing recent microarchitectural strategies to achieve more power-efficient microprocessors. We would like to also thank Antonio Gonzalez, Andreas Moshovos, John Kalamatianos, and other members of the HPCA-8 organizing committee who helped arrange for local accomodation and publicize the workshop.

February 2002

Babak Falsafi and T.N. Vijaykumar

PACS 2002 Program Committee

Babak Falsafi, Carnegie Mellon University (co-chair) T.N. Vijaykumar, Purdue University (co-chair) Dave Albonesi, University of Rochester Krste Asanovic, Massachusetts Institute of Technology Iris Bahar, Brown University Luca Benini, University of Bologna Doug Carmean, Intel Yuen Chan, IBM Keith Farkas, Compaq WRL Mary Jane Irwin, Pennsylvania State University Stefanos Kaxiras, Agere Systems Peter Kogge, University of Notre Dame Uli Kremer, Rutgers University Alvin Lebeck, Duke University Andreas Moshovos, University of Toronto Raj Rajkumar, Carnegie Mellon University Kaushik Roy, Purdue University

Table of Contents

Power-Aware Architecture/Microarchitecture Early-Stage Definition of LPX: A Low Power Issue-Execute Processor . . . . . . . 1 P. Bose, D. Brooks, A. Buyuktosunoglu, P. Cook, K. Das, P. Emma, M. Gschwind, H. Jacobson, T. Karkhanis, P. Kudva, S. Schuster, J. Smith, V. Srinivasan, V. Zyuban, D. Albonesi, and S. Dwarkadas Dynamic Tag-Check Omission: A Low Power Instruction Cache Architecture Exploiting Execution Footprints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Koji Inoue, Vasily Moshnyaga, and Kazuaki Murakami A Hardware Architecture for Dynamic Performance and Energy Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . .33 Phillip Stanley-Marbell, Michael S. Hsiao, and Ulrich Kremer Multi-processor Computer System Having Low Power Consumption . . . . . . . . 53 C. Michael Olsen and L. Alex Morrow Power-Aware Real-Time Systems An Integrated Heuristic Approach to Power-Aware Real-Time Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Pedro Mejia, Eugene Levner, and Daniel Moss´e Power-Aware Task Motion for Enhancing Dynamic Range of Embedded Systems with Renewable Energy Sources . . . . . . . . . . . . . . . . . . . . . 84 Jinfeng Liu, Pai H. Chou, and Nader Bagherzadeh A Low-Power Content-Adaptive Texture Mapping Architecture for Real-Time 3D Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Jeongseon Euh, Jeevan Chittamuru, and Wayne Burleson Power Modeling and Monitoring Energy-Driven Statistical Sampling: Detecting Software Hotspots . . . . . . . . . 110 Fay Chang, Keith I. Farkas, and Parthasarathy Ranganathan Modeling of DRAM Power Control Policies Using Deterministic and Stochastic Petri Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 Xiaobo Fan, Carla S. Ellis, and Alvin R. Lebeck SimDVS: An Integrated Simulation Environment for Performance Evaluation of Dynamic Voltage Scaling Algorithms . . . . . . . 141 Dongkun Shin, Woonseok Kim, Jaekwon Jeon, Jihong Kim, and Sang Lyul Min

X

Table of Contents

Power-Aware OS and Compilers Application-Supported Device Management for Energy and Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Taliver Heath, Eduardo Pinheiro, and Ricardo Bianchini Energy-Efficient Server Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 E.N. (Mootaz) Elnozahy, Michael Kistler, and Ramakrishnan Rajamony Single Region vs. Multiple Regions: A Comparison of Different Compiler-Directed Dynamic Voltage Scheduling Approaches . . . . . . . . . . . . . . 197 Chung-Hsing Hsu and Ulrich Kremer Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .213

Early-Stage Definition of LPX: A Low Power Issue-Execute Processor P. Bose1 , D. Brooks1, A. Buyuktosunoglu2 , P. Cook1 , K. Das3 , P. Emma1 , M. Gschwind1 , H. Jacobson1, T. Karkhanis4, P. Kudva1 , S. Schuster1 , J. Smith5 , V. Srinivasan1 , V. Zyuban1 , D. Albonesi6 , and S. Dwarkadas6 1

IBM T. J. Watson Research Center, Yorktown Heights, NY [email protected] 2 University of Rochester, NY; summer intern at IBM Watson 3 University of Michigan, Ann Arbor; summer intern at IBM Watson 4 University of Wisconsin, Madison; summer intern at IBM Watson 5 University of Wisconsin, Madison; visiting scientist at IBM Watson 6 University of Rochester, NY

Abstract. We present the high-level microarchitecture of LPX: a lowpower issue-execute processor prototype that is being designed by a joint industry-academia research team. LPX implements a very small subset of a RISC architecture, with a primary focus on a vector (SIMD) multimedia extension. The objective of this project is to validate some key new ideas in power-aware microarchitecture techniques, supported by recent advances in circuit design and clocking.

1

Introduction

Power dissipation limits constitute one of the primary design constraints in future high performance processors. Also, depending on the thermal time constants implied by the chosen packaging/cooling technology, on-chip power-density is a more critical constraint than overall power in many cases. In current CMOS technologies, dynamic (“switching”) power still dominates; but, increasingly, the static (“leakage”) component is threatening to become a major component in future technologies [6]. In this paper, we focus primarily on the dynamic component of power dissipation. Current generation high-end processors like the IBM POWER4— [3, 26], are performance-driven designs. In POWER4, power dissipation is still comfortably below the 0.5 watts/sq. mm. power density limit afforded by the package/cooling solution of choice in target server markets. However, in designing and implementing future processors (or even straight “remaps”) the power (and especially the power-density) limits could become a potential “show-stopper” as transistors shrink and the frequency keeps increasing. Techniques like clock-gating (e.g. [21, 13]) and dynamic size adaptation of on-chip resources like caches and queues (e.g. [1, 20, 4, 9, 12, 2, 15, 27]) have been either used or proposed as methods for power management in future processor cores. Many of these techniques, however, have to be used with caution in

©

B. Falsafi and T.N. Vijaykumar (Eds.): PACS 2002, LNCS 2325, pp. 1–17, 2003. Springer-Verlag Berlin Heidelberg 2003

2

P. Bose et al.

server-class processors. Aspects like reliability and inductive noise on the power supply rails (Ldi/dt) need to be quantitatively evaluated prior to committing a particular gating or adaptation technique to a real design. Another issue in the design of next generation, power-aware processors, is the development of accurate power-performance simulators for use in early-stage design. University research simulators like Wattch [7] and industrial research simulators like Tempest [10] and PowerTimer [8] have been described in the recent past; however their use in real design environments is needed to validate the accuracy of the energy models in the context of power-performance tradeoff decisions made in early design. In the light of the above issues, we decided to design and implement a simple RISC “sub-processor” test chip to validate some of the key new ideas in adaptive and gated architectures. This chip is called: LPX, which stands for low-power issue-execute processor. This is a research project, with a goal of influencing real development groups. LPX is a joint university-industry collaboration project. The design and modeling team is composed of 10-12 part-time researchers spanning the two groups (IBM and University of Rochester) aided by several graduate student interns and visiting scientists recruited from multiple universities to work (part-time) at IBM. LPX is targeted for fabrication in a CMOS 0.1 micron high-end technology. RTL (VHDL) simulation and verification is scheduled for completion in 2002. Intermediate circuit test chips are in plan (mid- to late 2002) for early validation of the circuit and clocking support. LPX chip tapeout is slated for early 2003. In this paper, we present the microarchitecture definition with preliminary simulation-based characterization of the LPX prototype. We summarize the goals of the LPX project as follows: – To understand and assess the true worth of a few key ideas in power-aware microarchitecture design through simulation and eventually via direct hardware measurement. Based on known needs in real products of the future, we have set a target of average power density reduction by at least a factor of 5, with no more than 5% reduction in architectural performance (i.e. instructions per cycle or IPC). – To quantify the instantaneous power (current) swings incurred by the use of the adaptive resizing, throttling and clock-gating ideas that are used to achieve the targeted power reduction factors in each unit of the processor. – To use the hardware-based average and instantaneous power measurements for calibration and validation of energy models used in early-stage, powerperformance simulators. Clearly, what we learn through the “simulation and prototyping in the small” experiments in LPX, will be useful in influencing full-function, power-efficient designs of the future. The calibrated energy models will help us conduct design space exploration studies for high-end machines with greater accuracy. In this paper, we limit our focus to the microarchitectural definition process, with related simulation-based result snapshots, of the LPX prototype. (Note that LPX is a research test chip. It is not intended to be a full-function, production-quality

Early-Stage Definition of LPX: A Low Power Issue-Execute Processor

3

IDU 2.6% FXU 4.4%

Clock Tree 10.3% L3 cntrl. 1.7%

Other 9.5%

IFU 6.1%

Issue Q's 32.1%

ISU 10.4%

L2 22.1%

Rename h/w 43.3%

Compl Table 8.8% Disp. DataFlow/Ctrl 6.3%

LSU 18.7% Bus/IO/other 18.2%

FPU 5.4% (a)

(b)

Fig. 1. Power profile: (a) relative unit-wise power; (b) power breakdowns: ISU

microprocessor. At this time, LPX is not directly linked to any real development project).

2

Background: Power-Performance Data

In an out-of-order, speculative super scalar design like each of the two cores in POWER4, a large percentage of the core power in the non-cache execution engine is spent in the instruction window or issue queue unit [26, 9, 20, 12]. Figure 1(a) shows the relative distribution of power across the major units within a single POWER4 core. Figure 1(b) zooms in on the instruction sequencing unit that contains the various out-of-order issue queues and rename buffers. Figure 2 shows the power density across some of the major units of a single POWER4 core. The power figures are non-validated pre-silicon projections based on unconstrained (i.e. without any clock-gating assumptions) “average/max” power projections using a circuit-level simulation facility called CPAM [19]. (Actual unit-wise power distribution, with available conditional clocking modes enabled, are not shown). This tool allowed us to build up unit-level power characteristics from very detailed, macro-level data. Here, the activity (utilization) factors of all units are assumed to be 100% (i.e. worst case with no clock-gating anywhere); but average, expected input data switching factors (based on representative test cases run at the RTL level, and other heuristics) are assumed for each circuit macro. Such average macro-level input data switching factors typically range from 4-15%. (From Figure 2, we note that although on a unit basis, the power density numbers are under 0.5 watts/sq. mm., there are smaller hotspots, like the integer FX issue queue within the ISU that are above the limit in an unconstrained mode). (Legend for Figs. 1-2: IDU: instruction decode unit; FXU: fixed point unit; IFU: instruction fetch unit; BHT: branch history table;

4

P. Bose et al.

BHT Icache FX-issQ

IFU IDU ISU FXU LSU FPU L2 L3 cntrl

Power Density

watts/sq.mm 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Fig. 2. Unconstrained power density profile

ISU: instruction sequencing unit; LSU: load-store unit: includes L1 data cache; FPU: floating point unit). Another class of data that we used was the performance and utilization information obtained from pre-silicon performance simulators. Figure 3 shows the relative “active/idle” barchart plot across some of the major units for a POWER4like pre-silicon simulation model. The data plotted is for a commercial TPC-C trace segment. This figure shows, for example, that the instruction fetch unit (IFU) is idle for approximately 47% of the cycles. Similar data, related to activities within other units, like issue queues and execution unit pipes were collected and analyzed.

3

Areas of Focus in Defining the LPX Processor

Based on microarchitecture level and circuit simulation level utilization, power and power-density projections, as above, we made a decision to focus on the following aspects of a super scalar processing core in our LPX test chip: Power-efficient, Just-in-Time Instruction Fetch. Here, we wanted to study the relative advantages of conditional gating of the ifetch function, with a goal of saving power without appreciable loss of performance. The motivation for this study was clearly established after reviewing data like that depicted in Figures 1 and 2. In simulation mode, we studied the benefit of various hardware heuristics for determining the “gating condition” [18, 14, 5], before fixing on a particular set of choices (being reported in detail in [17]) to implement in LPX. Our emphasis here is on studying ifetch gating heuristics that are easy to implement and test, with negligible added power for the control mechanism. Adaptive Issue Queues. The out-of-order issue queue structure inherent in today’s high-end super scalar processors is a known “hot-spot” in terms of power dissipation. The data shown in Figures 1, 2, and also corroborative data from

5

100 80 USEFUL OTHER WASTE HOLD/IDLE

60 40 20 0

IFU IDU FXU-pipe FPU-pipe LSU-pipe BRU-pipe

Percentage of total cycles

Early-Stage Definition of LPX: A Low Power Issue-Execute Processor

Fig. 3. Unit-wise utilization stack (TPC-C trace)

other processors (e.g. [1]), makes this an obvious area to focus on. In LPX, our goal is also to compare the achieved power savings with a fixed issue queue design, but with fine-grain clock-gating support, where the valid-bit for each issue queue entry is used as a clock-gating control. A basic issue in this context is the extra power that is spent due to the presence of out-of-order execution modes. Is the extra power spent worth the performance gain that is achievable? We wish to understand the fundamental power-performance tradeoffs in the design of issue queues for the next generation processors. Again, simplicity of the adaptive control and monitoring logic is crucial, especially in the context of the LPX prototype test vehicle. Locally Clocked Execution Pipeline. Based on the data shown in Figures 1 and 2, a typical, multi-stage complex arithmetic pipeline is also a high powerdensity region within the chip. We wish to study the comparative benefit of alternate conditional clocking methods proposed in ongoing work in advanced circuit design groups ([21, 23, 16]). In particular, we wish to understand: (a) the benefit of simple valid-bit-based clock-gating in a synchronously clocked execution unit; and (b) the added power-savings benefit of using a locally asynchronous arithmetic unit pipeline, within a globally synchronous chip. The asynchronously clocked pipeline structure is based on the IPCMOS circuit technology previously tested in isolation [23] by some in our research team. Such locally clocked methods offer the promise of low power at high performance, with manageable inductive noise (Ldi/dt) characteristics. In LPX, we wish to measure and validate these expectations as the IPCMOS pipe is driven by data in real computational loop kernels. Power-Efficient Stalling of Synchronous Pipelines. In the synchronous regions of the design, we wish to quantify the amount of power that is consumed by pipeline stall (or “hold/recirculation”) conditions. Anticipating (from circuit

6

P. Bose et al.

LPX power-perf cycle simulator

Configuration parameters for given idea and experiment

LPX parms filter

Full function, parameterized super scalar, power-perf simulator

tangible benefits? possible selection?

measureable Augment the LPX benefits? No microarchitecture Yes Yes Idea selected for implementation in LPX

No Idea not selected; go to next idea

Fig. 4. Methodology for fine-tuning the LPX microarchitecture

simulation coupled with microarchitectural simulation data) such wastage to be significant, we wish to experiment with alternate methods to reduce or eliminate the “stall energy” by using a novel circuit technique called interlocked synchronous pipelines (ISP) that was recently invented by some members of our team [16]. Thus, a basic fetch-issue-execute super scalar processing element (see sections 4 and 5) was decided upon as the study vehicle for implementation by our small research team. The goal is to study the power-performance characteristics of dynamic adaptation: in microarchitectural terms as well as in clocking terms with the target of achieving significant power (and especially, power density) reduction, with acceptable margins of IPC loss.

4

Tuning the Microarchitecture

In this section, we outline the methodology adopted for defining the range of hardware design choices to be studied in the LPX testchip. Since we are constrained by the small size of our design team, and yet the ideas explored are targeted to influence real, full-function processor designs, we adopted the following general method. Figure 4 shows the iterative method used to decide what coarse-level features to add into the LPX test chip, starting from an initial, baseline “bare-bones” fetch-issue-execute model. – A given, power-efficient microarchitectural design idea is first simulated in the context of a realistic, current generation super scalar processor model (e.g. POWER4-like microarchitectural parameters) and full workloads (like SPEC and TPC-C) to infer the power-performance benefit. Once a basic

Early-Stage Definition of LPX: A Low Power Issue-Execute Processor

7

hardware heuristic is found to yield tangible benefit - in other words, a significant power reduction, at small IPC impact - it is selected for possible implementation in LPX. – A detailed, trace-driven, cycle-by-cycle simulator for the baseline LPX processor is coded to run a set of application-based and synthetic loop test cases designed to test and quantify the LPX-specific power-performance characteristics of the candidate hardware power-saving feature. In order to get a measurable benefit, it may be necessary to further simplify the heuristic, or augment the microarchitecture minimally to create a new baseline. Once the power-performance benefit is deemed significant, we proceed to the next candidate idea. In this paper we mainly focus on (b) above: i.e. understanding the fundamental power-performance tradeoff characteristics, using a simple, illustrative loop test case. However, we also refer briefly to example, full-model super scalar simulation results to motivate the choice of a particular hardware heuristic. Energy Models Used. The LPX cycle-by-cycle simulator used to analyze early stage microarchitectural power-performance tradeoffs has integrated energy models, as in the PowerTimer tool [8]. These energy models were derived largely from detailed, macro-level energy data for POWER4, scaled for size and technology to fit the requirements of LPX. The CPAM tool [19] was used to get this data for most of the structures modeled. Additional experiments were performed at the circuit simulation level, to derive power characteristics of newer latch designs (with and without clock- and stall-based gating). The energy model-enabled LPX simulator is systematically validated using specially architected testcases. Analytical bounds modeling is used to generate bounds on IPC and unit-wise utilization (post-processed to form power bounds). These serve as reference “signatures” for validating the power-performance simulator. Since the LPX design and model are still evolving, validation exercises must necessarily continue throughout the high-level design process. Details of the energy model derivation and validation are omitted for brevity.

5

High-Level Microarchitecture of LPX

Figure 5 shows a very high-level block diagram of the baseline LPX processor that we started with before further refinement of the microarchitectural features and parameters through a simulation-based study. The function and storage units shown in dashed edges are ones that are modeled (to the extent required) in the simulation infrastructure, but are not targeted for implementation in the initial LPX design. The primary goal of this design is to experiment with the fetch-issue-execute pipe which processes a basic set of vector integer arithmetic instructions. These instructions are patterned after a standard, 4x32 bit SIMD multimedia extension architecture [11] but, simplified in syntax and semantics. The “fetch-and-issue” sub-units act together as a producer of instructions, which

8

P. Bose et al.

valid bit (used for clock-gating)

I-BUFFER

(contains predecoded LPX instructions)

V I-CACHE (loop buffer)

V

0-4 instr per cycl

(3 read, 2 write)

(adaptive . queue)

vector register file

. issue queue On-chip power/perf

V

counters

V

opnd A opnd B

opnd C

up to 2 instructions per cycle misc. proggrammable

valid-bit

control regs to inject

based CG

pipeline stalls (e.g.

vector fixed pt unit

for cache misses)

scalar fixed point unit (combined

(VFXU)

LSU and FXU) (LSFX Unit)

simulated via dummy pipelined unit built using scannable stage latches

(IPCMOS pipe)

writeback (rename) bus/buffer

Fig. 5. LPX Processor: High-Level Block Diagram

are consumed by the “execute” sub-unit. The design attempts to balance the dynamic complexity of the producer-consumer pair with the goal of maximizing performance, while minimizing power consumption. The basic instruction processing pipeline is illustrated in Figure 6. The decode/dispatch/rename stage, which is shown as a lumped, dummy dispatch unit in Figure 5, is actually modeled in our simulator as an m-stage pipe, where m=1 in the nominal design point. The nominal VFXU execute pipe is n=4 stages deep. The LSFX execute pipe is p=2 stages (in infinite cache mode) and p=12 stages when a data cache miss stall is injected using the stall control registers (Figure 5); in particular, using a miss-control register (MCR). One of the functional units is the scalar FXU (a combined load-store unit and integer unit, LSFX) and the other is the vector integer unit (VFXU). The VFXU execution pipe is multi-cycle (nominally 4 cycles). The LSFX unit has a 1-cycle pipe plus (nominally) a 1-cycle (infinite) data cache access cycle for loads and stores. At the end of the final execution stage, the results are latched on to the result bus while the target register tags are broadcast to the instructions pending in the issue queue. As a substitute for instruction caching, LPX uses a loop buffer in which a loop (of up to 128 instructions) is pre-loaded prior to processor operation. The loaded program consists of pre decoded instructions, with inline explicit specifiers of pre renamed register operands - in full out-of-order mode of execution. This avoids the task of designing explicit logic for the instruction decode and rename

Early-Stage Definition of LPX: A Low Power Issue-Execute Processor

9

m stages IF

DD-1

DD-m

Issue

Rread

dispatch/decode EX-n WB EX-1 In LPX n stages baseline: m=1 EX-p EX-1 WB n=4 (VCFXU) p=2 (LSFXU: p stages infinite cache) p=12 (LSFXU: L1 dcache miss)

Fig. 6. LPX (simulator) pipeline stages

processes. LPX also supports an “in-order” mode, without register renaming as the lowest performance design point for our tradeoff experiments. The instructions implemented in LPX are listed below in Table 1. For the most part, these are a set of basic vector (SIMD) mode load, store and arithmetic instructions, following the general semantics of a PowerPC—VMX (vector multimedia extension) architecture [11]. There are a few added scalar RISC (PowerPC-like) instructions to facilitate loading and manipulation of scalar integer registers required in vector load-store instructions. The (vector) load and store instructions have an implied “update” mode in LPX, where the scalar address base register is auto-incremented to hold the address of the next sequential array data in memory.

Table 1. LPX Instruction Set Example Syntax Semantics Vector Load VLD vr1, r2, 0x08 Load vr1. Scalar base address register: r2 Vector Store VST vr1, r2, 0x08 Store vr1. Vector Add VADD vr1, vr2, vr3 vr1 | | | | | ---

VLD VADD VLD VADD VST DEC BRZ

vr1, vr4, vr6, vr4, vr4, r7 r7,

r2 (0x4) vr1, vr6 r2 (0x8) vr4, vr6 r3 (0x8) -0x7

The baseline LPX model parameters were fixed as follows, after initial experimentation. Instruction fetch (ifetch) bandwidth is up to four instructions/cycle, with no fetch beyond a branch on a given cycle. The instruction fetch buffer size is four instructions; dispatch bandwidth (into the issue queue) is up to two instructions/cycle; issue bandwidth (into the execution pipes) is up to two instructions/cycle; and, completion bandwidth is also two instructions/cycle. Fetch and dispatch is in-order and issue can be in-order or out-of-order (switchable); instructions finish out-of-order. (LPX does not model or implement in-order completion for precise interrupt support using reorder buffers). Conditional Ifetch. Figures 7(a,b) show a snapshot of analysis data from a typical 4-way, out-of-order super scalar processor model. The data reported is for two benchmarks from the SPECint2000 suite. It shows that the ifetch stage/buffer, the front-end pipe and the issue queue/window can be idle for

Early-Stage Definition of LPX: A Low Power Issue-Execute Processor

AMMP on 4-way super scalar model Idle Active Flushed Active Stall Flushed Stall Used 100 80 60 40 20 0

11

GCC on 4-way super scalar model Idle Stall Flushed

Active Flushed Stall Used

Active

100 80 60 40 20 0 Fetch

Pipe

(a)

Issue Queue

Fetch

Pipe

IssQ

(b)

Fig. 7. Idle, speculative and stall waste: (a) AMMP and (b) GCC

significant fractions of the program run. These are cycles where power can be saved by valid-bit-based clock-gating. In addition, the fraction of cycles that are wasted by useful (but stalled) instructions and by incorrectly fetched speculative instructions can also be significant. Gating off the ifetch process using a hardware heuristic to compute the gating condition, is therefore a viable approach to saving energy. For LPX, we wish to experiment with the simplest of such heuristics, that are easy to implement. The basic method used is to employ the “stall” or “impending stall” signals available from “downstream” consumer units to throttle back the “upstream” producer (ifetch). Such stall signals are easy to generate and are usually available in the logic design anyway. Figures 8(a,b) show results from an illustrative use of conditional ifetch while simulating the vect add loop trace. We use the following simple hardware heuristic for determining the ifetch gating scenario. When a “stall” signal is asserted by the instruction buffer (e.g. when the ibuffer is full) the ifetch process is naturally inhibited in most designs; so this is assumed in the baseline model. However, additional power savings can be achieved by retaining the “ifetch-hold” condition for a fetch-gate cycle window, GW, beyond the negation of the ibuffer stall signal. Since the ibuffer was full, it would take a while to drain it; hence ifetch could be gated off for GW cycles. Depending on the size of the ibuffer, IPC would be expected to drop off to unacceptable levels beyond a certain value of GW; but increasing GW is expected to reduce IFU (instruction fetch unit) and overall chip power. Adaptive Issue Queue. Figure 9 shows a snapshot of our generalized simulationbased power-savings projection for various styles of out-of-order issue queue design. (An 8-issue, super scalar, POWER4-like research simulator was used). These studies showed potential power savings of more than 80% in the issue queue with at most 2-3% hit in IPC on the average. However, the best power reductions were for adaptive and banked CAM/RAM based designs that are not

P. Bose et al.

Power (watts)

5

vect_add loop trace on LPX CLK VCFXU LSFXU ISU IDU IFU

4 3 2 1 0

Base IO + VB-CG + OO +C-IF-10 CPI of baseline in-order (IO) = 3.00 CPI of all the other out-of-order (OO) = 2.29

Cycles per instruction (CPI)

12

vect_add loop trace on LPX

2.9

Pwr=1.52

2.8 2.7 Pwr=1.69

2.6 2.5

Pwr=1.84

2.4 2.3 Pwr=1.99Pwr=2.09 Pwr=2.08 2.2 2 4 6 8 10 12 14 16 Cond. Ifetch (C-IF) Gating Window, GW

Fig. 8. Conditional ifetch in LPX: (a) power reduction and (b) CPI vs. gating window 120

POWER

100

Baseline, POWER4-like

80 60 40

adapt.banked

bank-CAM

CAM issQ

w/clk gate

w/clk gate

latch issQ

0

adapt.latch

20

Fig. 9. Adaptive issue queue: power saving (generalized 8-issue super scalar)

easy to design and verify. For LPX, we started with a baseline design of the POWER4 integer issue queue [26], which is a latch-based design. It is structured as a 2-chunk structure, where in adaptive mode, one of the chunks can be shut-off completely (to eliminate dynamic and static power). Figure 10 illustrates the benefit of using a simple, LPX-specific adaptive issue queue heuristic that is targeted to reduce power, without loss of performance; i.e. the size is adapted downwards only when “safe” to do so from a performance viewpoint; and the size is increased in anticipation of increased demand. (In the example data shown in this paper, we consider only the reduction of dynamic power via such adaptation). The adaptive issue queue control heuristic illustrated is simpler than proposed in the detailed studies reported earlier [9], for ease of implementation in the LPX context. The control heuristic in LPX is as follows:

Cycles per instruction (CPI)

Early-Stage Definition of LPX: A Low Power Issue-Execute Processor

13

vect_add loop trace on LPX

2.8 2.7

2.6 Pwr=4.059, IssueQpwr=0.098 2.5 2.4 2.3 Pwr=4.119, IssueQpwr=0.098 2.2 1

2

4

8

16

32

64

128

Adaptation cycle window, AW (cycles) baseline power (non-adaptive) = 4.46 watts

Fig. 10. Adaptive issue queue experiment in LPX

if (current-cycle-window-issuecount < 0.5 * last-cycle-window-issuecount) then increase-size (* if possible *); else decrease-size (* if possible *); Discussion of Results. From Figure 8(a) we note that adding out-of-order (oo) mode to the baseline in-order (io) machine causes an IPC increase (CPI decrease) of 23.6%, but with a 12.5% overall power increase. The ISU, which contains the issue queue, increases in power by 27.5%. So, from an overall power-performance efficiency viewpoint, the out-of-order (oo) mode does seem to pay off in LPX for this loop trace, in infinite cache mode. However, from a power-density “hot-spot” viewpoint, even this basic enhancement may need to be carefully evaluated with a representative workload suite. Adding the valid-bit-based clock-gating (VBCG) mode in the instruction buffer, issue queue and the execution unit pipes, causes a sharp decrease in power (42.4% from the baseline oo design point). Adding a conditional ifetch mode, (with a cycle window W of 10 cycles over which ifetch is blocked after the ibuffer stall signal goes away) yields an additional 18.8% power reduction, without loss of IPC performance. As the gating cycle window W is increased, we see a further sharp decrease in net power beyond W=10, but with IPC degradation. For the adaptive issue queue experiment (Fig. 10) shown, we see that a 8% reduction in net LPX power is possible; but beyond an adaptation cycle window, AW of 1, a 11% increase in CPI (cycles-perinstruction) is incurred. Thus, use of fine-grain, valid-bit based clock-gating is simpler and more effective than adaptive methods. Detailed results, combining VB-CG and adaptation will be reported in follow-up research.

14

P. Bose et al.

Stall-Based Clock-Gating. As previously alluded to, in addition to valid-bitbased clock-gating in synchronous (an locally asynchronous) pipelines, LPX uses a mode in which an instruction stalling in a buffer or queue for multiple cycles is clock-gated off, instead of a recirculation-based, hold strategy often used in high performance processors. The stall-related energy waste is a significant fraction of queue/buffer power that can be saved if the stall signal is avaliable in time to do the gating. Carefully designed control circuits [16] have enabled us to exploit this feature in LPX. In this version of the paper, we could not include the experimental results that show the additional benefits of such stall-based gating. However, suffice it to say, with the addition of stall-based clock-gating, simulations predict that we are well within the target of achieving a factor of 5 reduction in power and power density, without appreciable loss of IPC performance. The use of a locally asynchronous IPCMOS execution pipe [23] is expected to increase power reduction even further. Detailed LPX-specific simulation results for these circuit-centric features, will be available in subsequent reports.

7

Conclusions and Future Work

We presented the early-stage definition of LPX: a low-power issue-execute processor prototype that is designed to serve as a measurement and evaluation vehicle for a few new ideas in adaptive microarchitecture and conditional clocking. We described the methodology that was used to architect and tune simple hardware heuristics in the prototype test chip, with the goal of drawing meaningful conclusions of use in future products. We presented a couple of simple examples to illustrate the process of definition and to report the expected power-performance benefits of the illustrated adaptive features. The basic idea of fetch-throttling to conserve power is not new. In addition to work that we have already alluded to [18, 14, 5], Sanchez et al. [22] describe a fetch stage throttling mechanism for the G3 and G4 PowerPC processors. The throttling mode in the prior PowerPC processors was architected to respond to thermal emergencies. The work reported in [18, 14, 5] and the new gating heuristics described in this paper and in [17] are aimed at reducing average power during normal operation. Similarly, the adaptive issue queue control heuristics being developed for LPX are intended to be simpler adaptations of our prior general work [9]. We believe that the constraint of designing a simple test chip with a small design team forces us to experiment with heuristics that are easy to implement with low overhead. If some of these heuristics help create relatively simple power management solutions for a full-function, production-quality processor, then the investment in LPX development will be easily justified. In addition to the adaptive microarchitecture principles alluded to above, the team is considering the inclusion of other ideas in the simulation toolkit; some of these remain candidates for inclusion in the actual LPX definition: at least for LPX-II, a follow-on design. The following is a partial list of these other ideas:

Early-Stage Definition of LPX: A Low Power Issue-Execute Processor

15

– Adaptive, power-efficient cache and register file designs: these were not considered for implementation in the initial LPX prototype, due to lack of seasoned SRAM designers in our research team. In particular, as a candidate data cache design for LPX-II, we are exploring ideas that combine prior energy-efficient solutions [1, 4, 2, 15] with recently proposed, high performance split-cache architectures ([24, 25]). – Exploiting the data sparseness of vector/SIMD-mode execution, through hardware features that minimize clocking waste in processing vector data that contains lots of zeroes. – Newer features that reduce static (leakage) power waste. – Adding monitoring hardware to measure current swings in clock-gated and adaptive structures.

Acknowledgement The authors are grateful to Dan Prener, Jaime Moreno, Steve Kosonocky, Manish Gupta and Lew Terman (all at IBM Watson) for many helpful discussions before the inception of the LPX design project. Special thanks are due to Scott Neely (IBM Watson), Michael Wang and Gricell Co (IBM Austin) for access to CPAM and the detailed energy data used in our power analysis. The support and encouragement received from senior management at IBM - in particular Mike Rosenfield and Eric Kronstadt - are gratefully acknowledged. The authors would like to thank Joel Tendler (IBM Austin) for his comments and suggestions during the preparation and clearance of this paper. Also, the comments provided by the anonymous reviewers were useful in preparing an improved final version; this help is gratefully acknowledged. The work done in this project at University of Rochester is supported in part by NSF grants CCR–9701915, CCR–9702466, CCR–9705594, CCR–9811929, EIA–9972881, CCR–9988361 and EIA–0080124; by DARPA/ITO under AFRL contract F29601-00-K-0182; and by an IBM Faculty Partnership Award. Additional support, at University of Wisconsin, for continuation of research on power-efficient microarchitectures, is provided by NSF grant CCR-9900610.

References [1] D. H. Albonesi. The inherent energy efficiency of complexity-effective processors. In Power-Driven Microarchitecture Workshop at ISCA25, June 1998. 1, 5, 15 [2] D. H. Albonesi. Selective cache ways: on-demand cache resource allocation. In Proceedings of the 32nd International Symposium on Microarchitecture (MICRO32), pages 248–259, Nov. 1999. 1, 15 [3] C. Anderson et al. Physical design of a fourth-generation power ghz microprocessor. In ISSCC Digest of Technical Papers, page 232, 2001. 1 [4] R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas. Memory hierarchy reconfiguration for energy and performance in general purpose architectures. In Proceedings of the 33rd International Symposium on Microarchitecture (MICRO-33), pages 245–257, Dec. 2000. 1, 15

16

P. Bose et al.

[5] A. Baniasadi and A. Moshovos. Instruction flow-based front-end throttling for power-aware high performance processors. In Proceedings of International Symposium on Low Power Electronics and Design, August 2001. 4, 14 [6] S. Borkar. Design Challenges of Technology Scaling. IEEE Micro, 19(4):23–29, July-August 1999. 1 [7] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A framework for architecturallevel power analysis and optimizations. In Proceedings of the 27th International Symposium on Computer Architecture (ISCA-27), June 2000. 2 [8] D. Brooks, J.-D. Wellman, P. Bose, and M. Martonosi. Power-Performance Modeling and Tradeoff Analysis for a High-End Microprocessor. In Power Aware Computing Systems Workshop at ASPLOS-IX, Nov. 2000. 2, 7 [9] A. Buyuktosunoglu et al. An adaptive issue queue for reduced power at high performance. In Power Aware Computing Systems Workshop at ASPLOS-IX, Nov. 2000. 1, 3, 12, 14 [10] A. Dhodapkar, C. Lim, and G. Cai. TEM2 P2 EST: A Thermal Enabled MultiModel Power/Performance ESTimator. In Power Aware Computing Systems Workshop at ASPLOS-IX, Nov. 2000. 2 [11] K. Diefendorff, P. Dubey, R. Hochsprung, and H. Scales. AltiVec extension to PowerPC accelerates media processing. IEEE Micro, pages 85–95, April 2000. 7, 9 [12] D. Folegnani and A. Gonzalez. Energy-effective issue logic. In Proceedings of the 28th International Symposium on Computer Architecture (ISCA-28), pages 230–239, June 2001. 1, 3 [13] M. Gowan, L. Biro, and D. Jackson. Power considerations in the design of the Alpha 21264 microprocessor. In 35th Design Automation Conference, 1998. 1 [14] D. Grunwald, A. Klauser, S. Manne, and A. Pleszkun. Confidence estimation for speculation control. In Proceedings of the 25th International Symposium on Computer Architecture (ISCA-25), pages 122–31, June 1998. 4, 14 [15] K. Inoue et al. Way-predicting set-associative cache for high performance and low energy consumption. In Proceedings of International Symposium on Low Power Electronics and Design, pages 273–275, August 1999. 1, 15 [16] H. Jacobson et al. Synchronous interlocked pipelines. IBM Research Report (To appear in ASYNC-2002) RC 22239, IBM T J Watson Research Center, Oct. 2001. 5, 6, 14 [17] T. Karkhanis et al. Saving energy with just-in-time instruction delivery. submitted for publication. 4, 14 [18] S. Manne, A. Klauser, and D. Grunwald. Pipeline gating: Speculation control for energy reduction. In Proceedings of the 25th International Symposium on Computer Architecture (ISCA-25), pages 132–41, June 1998. 4, 14 [19] J. Neely et al. CPAM: A Common Power Analysis Methodology for High Performance Design. In Proc. 9th Topical Meeting on Electrical Performance of Electronic Packaging, Oct. 2000. 3, 7 [20] D. Ponomarev, G. Kucuk, and K. Ghose. Dynamic allocation of datapath resources for low power. In Workshop on Complexity Effective Design 2001 at ISCA28, June 2001. 1, 3 [21] J. Rabaey and M. Pedram, editors. Low Power Design Methodologies. Kluwer Academic Publishers, 1996. Proceedings of the NATO Advanced Study Institute on Hardware/Software Co-Design. 1, 5 [22] H. Sanchez et al. Thermal management system for high performance PowerPC microprocessors. Digest of Papers - COMPCON - IEEE Computer Society International Conference, page 325, 1997. 14

Early-Stage Definition of LPX: A Low Power Issue-Execute Processor

17

[23] S. Schuster et al. Asynchronous interlocked pipelined CMOS operating at 3.3-4.5 GHz. In ISSCC Digest of Technical Papers, pages 292–293, February 2000. 5, 14 [24] V. Srinivasan. Hardware Solutions to Reduce Effective Memory Access Time. PhD thesis, University of Michigan, Ann Arbor, February 2001. 15 [25] V. Srinivasan et al. Recovering single cycle access of primary caches. submitted for publication. 15 [26] J. M. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. POWER4 system microarchitecture. IBM J. of Research and Development, 46(1):5–26, 2002. 1, 3, 12 [27] S.-H. Yang et al. An energy-efficient high performance deep submicron instruction cache. IEEE Transactions on VLSI, Special Issue on Low Power Electronics and Design, 2001. 1

Dynamic Tag-Check Omission: A Low Power Instruction Cache Architecture Exploiting Execution Footprints Koji Inoue1 , Vasily Moshnyaga1, and Kazuaki Murakami2 1

Dept. of Electronics Engineering and Computer Science, Fukuoka University 8-19-1 Nanakuma, Jonan-ku, Fukuoka 814-0133 JAPAN 2 Dept. of Informatics, Kyushu University 6-1 Kasuga-Koen, Kasuga, Fukuoka 816-8580 JAPAN

Abstract. This paper proposes an architecture for low-power directmapped instruction caches, called “history-based tag-comparison (HBTC) cache”. The HBTC cache attempts to detect and omit unnecessary tag checks at run time. Execution footprints are recorded in an extended BTB (Branch Target Buffer), and are used to know the cache residence of target instructions before starting cache access. In our simulation, it is observed that our approach can reduce the total count of tag checks by 90 %, resulting in 15 % of cache-energy reduction, with less than 0.5 % performance degradation.

1

Introduction

On-chip caches have been playing an important role in achieving high performance. In particular, instruction caches give a great impact on processor performance because one or more instructions have to be issued on every clock cycle. In other words, from energy point of view, instruction caches consume a lot of energy. Therefore, it is strongly required to reduce the energy consumption for instruction-cache accesses. On a conventional cache access, tag checks and data read are performed in parallel. Thus, the total energy consumed for a cache access consists of two factors: the energy for tag checks and that for data read. In conventional caches, the height (or the total number of word-lines) of tag memory and that of data memory are equal, but not for the width (or the total number of bit-lines). The tag-memory width depends on the tag size, while the data-memory width depends on the cache-line size. Usually, the tag size is much smaller than the cache-line size. For example, in the case of a 16 KB direct-mapped cache having 32-byte lines, the cache-line size is 256 bits (32 × 8), while the tag size is 18 bits (32 - 9bit index - 5bit offset). Thus, the total cache energy is dominated by data-memory accesses. Cache subbanking is one of the approaches to reducing the data-memoryaccess energy. The data-memory array is partitioned into several subbanks, and only one subbank including the target data is activated [6]. Figure 1 depicts B. Falsafi and T.N. Vijaykumar (Eds.): PACS 2002, LNCS 2325, pp. 18–32, 2003. c Springer-Verlag Berlin Heidelberg 2003 

Dynamic Tag-Check Omission 32-bit word

19

64-bit word

1.00

Normalized Energy Consumption

0.90 Others

0.80

Energy Consumed in Bit-lines

0.70

Energy Consumed in Word-lines

0.60 0.50 0.40 0.30 0.20 0.10 0.00

1 (8) base

2 (4)

4 (2)

8 (1)

1 (8) base

2 (4)

4 (2)

8 (1)

# of Subbanks (# of Words in a Subbank)

Fig. 1. Effect of tag-check energy

the breakdown of cache-access energy of a 16 KB direct-mapped cache with the varied number of subbanks. We have calculated the energy based on the Kamble’s model [6]. All the results are normalized to a conventional configuration denoted as “1(8)”. It is clear from the figure that increasing the number of subbanks makes significant reduction for data-memory energy. Since the tagmemory energy is maintained, however, it becomes a significant factor. If the number of subbanks is 8, about 30 % and 50 % of total energy are dissipated by the tag memory where the word size is 32 bits and 64 bits, respectively. In this paper, we focus on the energy consumed for tag checks, and propose an architecture for low-power direct-mapped instruction caches, called “historybased tag-comparison (HBTC) cache”. The basic idea of the HBTC cache has been introduced in [4]. The HBTC cache attempts to detect and omit unnecessary tag checks at run time. When an instruction block is referenced without causing any cache miss, a corresponding execution footprint is recorded in an extended BTB (Branch Target Buffer). All execution footprints are erased whenever a cache miss takes place, because the instruction block (or a part of the instruction block) might be evicted from the cache. The execution footprint indicates whether the instruction block currently resides in the cache. At and after the next execution of that instruction block, if the execution footprint is detected, all tag checks are omitted. In our simulation, it has been observed that

20

Koji Inoue et al.

our approach can reduce the total count of tag checks by 90 %, resulting in 15 % of cache-energy reduction, with less than 0.5 % performance degradation. The rest of this paper is organized as follows. Section 2 shows related work, and explains the detail of another technique proposed in [11] to omit tag checks as a comparative approach. Section 3 presents the concept and mechanism of the HBTC cache. Section 4 reports evaluation results for performance/energy efficiency of our approach, and Section 5 concludes this paper.

2

Related Work

A technique to reduce the frequency of tag checks has been proposed [11]. If successively executed instructions i and j reside in the same cache line, then we can omit the tag check for instruction j. Namely, the cache proposed in [11] performs tag checks only when i and j reside in different cache lines. We call the cache interline tag-comparison cache (ITC cache). This kind of traditional technique has been employed at commercial microprocessors, e.g., ARMs. The ITC cache detects unnecessary tag checks by monitoring program counter (PC). Against to the ITC cache, our approach exploits an extended BTB in order to record instruction-access history, and can omit unnecessary tag checks even if successive instructions reside in different cache lines. In Section 4.2, we compare our approach with the ITC cache. Direct Addressing (DA) is another scheme to omit tag checks [13]. In DA, previous tag-check results are recorded in the DA register, and are reused for future cache accesses. The DA register is controlled by compiler, whereas our HBTC cache does not need any software support. Note that the ITC cache and the DA scheme can be used for both instruction caches and data caches, while our HBTC cache can be used only for direct-mapped instruction caches. The extension to set-associative caches is discussed in Section 5. Ma et al. [9] have proposed a dynamic approach to omitting tag checks. In their approach, cache line structure is extended for recording valid links, and a branch-link is implemented per two instructions. Their approach can be applied regardless of cache associativity. The HBTC cache is another alternative to implement their approach on direct-mapped instruction caches, and can be organized with smaller hardware overhead. This is because the HBTC cache records 1-bit cacheresidence information for each instruction block, which could be larger than the cache line. The S-cache has also been proposed in [11]. The S-cache is a small added memory to the L1 cache, and has statically allocated address space. No cache replacements occur in the S-cache. Therefore, S-cache accesses can be done without tag checks. The scratchpad-memory [10], the loop-cache [3], and the decompressor-memory [5] also employ this kind of a small memory, and have the same effect as the S-cache. In the scratchpad-memory and the loop-cache, application programs are analyzed statically, and compiler allocates well executed instructions to the small memory. For the S-cache and the decompressor-memory, prior simulations using input-data set are required to optimize the code alloca-

Dynamic Tag-Check Omission

21

tion. They are differ from ours in two aspects. First, these caches require static analysis. Second, the cache has to be separated to a dynamically allocated memory space (i.e., main cache) and a statically allocated memory space (i.e., the small cache). The HBTC cache does not require these arrangements. The filter cache [8] achieves low power consumption by adding a very small L0-cache between the processor and the L1-cache. The advantage of the L0cache largely depends on how many memory references hit the L0-cache. Block buffering can achieve the same effect of the filter cache [6]. Bellas et al. [2] proposed a run-time cache-management technique to allocate the most frequently executed instruction-blocks to the L0-cache. On L0-cache hits, accessing both the tag-memory and data-memory of L1-cache is avoided, so that of cause tag checks at L1-cache do not performed. However, on L0-cache misses, the L1-cache is accessed with conventional behavior (tag checks are required). Our approach can be used in conjunction with the L0-caches in order to avoid L1-cache tag checks.

3 3.1

History-Based Tag-Comparison Cache Concept

On an access to a direct-mapped cache, a tag check is performed to determine whether the memory reference hits the cache. For almost all programs, instruction caches can achieve higher hit rates. In other words, the state (or contents) of the instruction cache is rarely changed. Only when a cache miss takes place, the state of instruction cache is changed by filling the missed instruction (and some instructions residing in the same cache line with the missed instruction). Therefore, if an instruction is referenced once, it stays in the cache at least until the next cache miss occurs. We refer the period between a cache miss to the next cache miss as a stable-time. Here, we consider where an instruction is executed repeatedly. At the first reference of the instruction, the tag check has to be performed. However, at and after the second reference, if no cache miss has occurred since the first reference, it is guaranteed that the target instruction currently resides in the cache. Therefore, for accesses to the same instruction in a stable-time, performing tag checks is absolutely required at the first reference, but not for the following references. We can omit tag checks if the following conditions are satisfied. – The target instruction has been executed at least once. – No cache miss has occurred since the previous execution of the target instruction. Figure 2 shows how many unnecessary tag checks are performed in a conventional 16 KB direct-mapped cache for two SPEC benchmark programs. Simulation environment is explained in Section 4.1. The y-axis is the average referencecount (up to one hundred times) for each cache line per stable-time. We ignored where each cache line has never referenced in a stable-time. The x-axis is the

22

Koji Inoue et al.

Ave. # of Reference-Counts per Stable-Time

129.compress 100 90 80 70 60 50 40 30 20

10 9 8 7 6 5 4 3 2

1 0

127

255

383

Ave. # of Reference-Counts per Stable-Time

Cache-line Address

511

132.ijpeg

100 90 80 70 60 50 40 30 20

10 9 8 7 6 5 4 3 2

1 0

127

255

383

511

Cache-line Address

Fig. 2. Opportunity of tag-check omission

cache-line address. It can be understood from the figure that the conventional cache wastes a lot of energy for unnecessary tag checks. Almost all cache lines are referenced more than four times in a stable-time, and some cache lines are referenced more than one hundred times. In order to detect the conditions for omitting unnecessary tag checks, the HBTC cache records execution footprints in an extended BTB (Branch Target Buffer). An execution footprint indicates whether the target-instruction block or fall-through-instruction block associated with a branch resides in the cache. An execution footprint is recorded after all instructions in the corresponding instruction block are referenced. All execution footprints are erased, or invalidated, whenever a cache miss takes place. At the execution of an instruction block, if the corresponding execution footprint is detected, we can fetch instructions without performing tag checks. 3.2

Organization

Figure 3 depicts the organization of the extended BTB. The following two 1-bit flags are added to each BTB entry.

Dynamic Tag-Check Omission

PBAreg

Target of branch-K

Branch Inst. Addr.

branch-inst. addr.

Adr-A: basic-block A

Inst. Addr. of Branch-K

Prediction Result Address target addr.

EFF

Adr-A

Inst. Addr. of Branch-Y

Adr-E

Inst. Addr. of Branch-Z

Adr-F

basic-block B PC+1

branch Y Adr-C:

EFT

Branch Target Buffer

branch X Adr-B:

23

fall-through address of Branch-Y

PC

Direct-Mapped Instruction Cache

Prediction Result Mode Controller tag-check omitting

basic-block C branch Z

Instruction-Fetch Address

Program Code

Fig. 3. The Organization of a Direct-Mapped HBL Cache

– EFT (Execution Footprint of Target instructions): This is an execution footprint of the branch-target-instruction block whose beginning address is indicated by the target address of current branch. – EFF (Execution Footprint of Fall-through instructions): This is an execution footprint of the fall-through-instruction block whose beginning address is indicated by the fall-through address of current branch. The end address of the branch-target- and fall-through-instruction block is indicated by another branch-instruction address which is already registered in the BTB, as shown in Figure 3. In addition, the following hardware components are required. – Mode Controller: This component selects one of the following operation modes based on the execution footprints read from the extended BTB. The detail of operation is explained in Section 3.3. • Normal-Mode (Nmode): The HBL cache behaves as a conventional cache (tag checks are performed). • Omitting-Mode (Omode): Tag checks for instruction-cache accesses are omitted. • Tracing-Mode (Tmode): The HBL cache behaves as a conventional cache (tag checks are performed). When a BTB hit is detected in this mode, the execution footprint indexed by the PBAreg is set to ’1’. – PBAreg (Previous Branch-instruction Address REGister): This is a register to keep the previous-branch-instruction address. The prediction result (taken or not-taken) is also kept.

24

Koji Inoue et al.

GOtoNmode

EFT (or EFF) is ’1’

Omitting I-Cache miss or Mode BTB replacement or RAS access or Branch misprediction GOtoNmode

Normal Mode

GOtoNmode

BTB Hit

EFT (or EFF) is ’0’

Tracing Mode

Fig. 4. Operation-Mode Transition

3.3

Operation

Execution footprints (i.e., EFT and EFF flags) are left or erased at run time. Figure 4 shows operation-mode transition. On every BTB hit, the HBTC cache works as follows: 1. Regardless of current operation mode, both EFT and EFF flags associated with the BTB-hit entry are read in parallel. 2. Based on the branch-prediction result, EFT for taken or EFF for not-taken is selected. 3. If the selected execution footprint is ’1’, operation mode is transited to Omode. 4. Otherwise, operation mode is transited to Tmode. At that time, current PC (branch-instruction address) and the branch-prediction result are stored into the PBAreg. Whenever a cache miss takes place, operation mode is transited to Nmode, as explained in the next paragraph. Therefore, occurring a BTB hit on Tmode means that there has never been any cache miss since the previous BTB hit. In other words, the instruction block, whose beginning address is indicated by the PBAreg and end address is indicated by the current branch-instruction address, has been referenced without causing any cache miss. Thus, when a BTB hit occurs on Tmode, the execution footprint indexed by the PBAreg is validated (set to 1). If one of the followings takes place, execution footprints have to be invalidated. In addition, the operation mode is transited to Nmode. – instruction-cache miss: The state of instruction cache is changed by filling the missed instruction. The cache-line replacement might evict the instruction block (or a part of the instruction block) corresponding to valid execution footprints from the cache. Therefore, the execution footprints of the victim line have to be invalidated.

Dynamic Tag-Check Omission

25

– BTB replacement: As explained in Section 3.3, the end address of an instruction block is indicated by another branch-instruction address already registered in the BTB. We lose the end-address information when the BTB-entry is evicted. Thus, the execution footprints of the instruction block, whose end address is indicated by the victim BTB-entry, have to be invalidated. Although it is possible to invalidate only the execution footprints affected by the cache miss or the BTB replacement, we have employed a conservative scheme, i.e., all execution footprints in the extended BTB are invalidated. In addition, when an indirect jump is executed, or a branch mis-prediction is detected, the HBTC cache works on Nmode (tag checks are performed as conventional organization). These decisions make it possible to avoid area overhead and complex control logic. 3.4

Advantages and Disadvantages

Total energy dissipated in the HBTC cache (ET OT AL ) can be expressed as follow: ET OT AL = ECACHE + EBT Badd , where ECACHE is the energy consumed in the instruction cache and EBT Badd is the additional energy for BTB extension. The energy consumed in conventional BTB organization is not included. ECACHE can be approximated by the following equation: ECACHE = Etag + Edata + Eoutput + Eainput , where Etag and Edata are the energy consumed in tag memory and data memory, respectively. Eoutput is the energy for driving output buses, and Eainput is that for address decoding. In this paper, we do not consider Eainput , because some papers reported that it is about three orders of magnitude smaller than other components [1] [8]. EBT Badd can be expressed as follows: EBT Badd = EBT Bef + EBT Blogic , where EBT Bef is the energy consumed for reading and writing execution footprints, and EBT Blogic is that for the control logic (i.e., mode controller and PBAreg). The logic portion can be implemented by simple and small hardware, so that we do not take account for EBT Blogic . On Omitting-Mode (Omode), the energy consumed for tag checks (Etag ) is completely eliminated. However, that for accessing execution footprints (EBT Bef ) appears as energy overhead on every BTB access. On the other hand, from performance point of view, the HBTC cache causes performance degradation. Reading execution footprints can be performed in parallel with normal BTB access from the microprocessor. However, for writing, the HBTC cache causes one processor-stall cycle. This is because the BTB entry accessed for executionfootprint writing and that for branch-target reading are different. Whenever

26

Koji Inoue et al.

a cache miss or BTB replacement takes place, execution-footprint invalidation is required. This operation also causes processor-stall cycles, because BTB access from the microprocessor has to wait until the invalidation is completed. The invalidation penalty largely depends on the implementation of the BTB. In Section 4.2, we discuss the effects of the invalidation penalty on processor performance.

4 4.1

Evaluation Simulation Environment

In order to evaluate the performance-energy efficiency of the HBTC cache, we have measured the total energy consumption (ET OT AL ) explained in Section 3.4 and total clock cycles as performance. We modified the SimpleScalar source code for this simulation [15]. To calculate energy consumption, the cache energy model assuming 0.8 um CMOS technology explained in [6] was used. We referred the load capacitance for each node from [7] [12]. In this simulation, the following configuration was assumed: instruction-cache size is 16 KB, cache-line size is 32 B, the number of direct-mapped branchprediction-table entry is 2048, predictor type is bimod, the number of BTB set is 512, BTB associativity is 4, and RAS size is 8. For other parameters, the default value of the SimpleScalar out-of-order simulator was used. In addition, we assumed that all caches evaluated in this paper employ subbanking approach, 16 KB data memory is partitioned into 4 subbanks. The following benchmark programs were used in this evaluation. – SPECint95 [16]: 099.go, 124.m88ksim, 126.gcc, 129.compress, 130.li, 132.ijpeg (using training input). – Mediabench [14]: adpcm encode, adpcm decode, mpeg2 encode, mpeg2 decode 4.2

Results

Tag-Check Count Figure 5 shows tag-check counts required for whole program executions. All results are normalized to a 16 KB conventional cache. The figure includes the simulation results for the ITC cache explained in Section 2 and the combination of the ITC cache and the HBTC cache. Since sequential accesses are inherent in programs, the ITC cache works well for all benchmark programs. While the effectiveness of the HBTC cache is application dependent. The HBTC cache produces more tag-check count reduction than the ITC cache for two SPEC integer programs, 129.compress and 132.ijpeg, and all media programs. In the best case, adpcm dec, the tag-check count is reduced by about 90 %. However, for the other benchmark programs, the ITC cache is superior to our approach. This result can be understood by considering the characteristics of benchmark programs. Media application programs have relatively well structured loops. The HBTC cache attempts to avoid performing unnecessary tag checks by exploiting iterative execution behavior. Thus, we can

Dynamic Tag-Check Omission

27

Normalized tag-check Count

1.00

0.80

ITC: intraline tag-compare cache HBTC: History-based Tag-Comparison cache Comb: Combination of ITC and HBTC

0.60

0.40

0.20

0.00

126.gcc 130.li adpcm_enc mpeg2_enc 099.go mpeg2_dec 124.m88ksim 129.compress 132.ijpeg adpcm_dec

Benchmark Programs

Fig. 5. Tag-check count compared with other approaches

consider that if our main target is media applications, employing the HBTC cache makes energy advantages. Otherwise, we should employ the ITC cache. The hybrid model of the ITC cache and the HBTC cache makes significant reductions. It eliminates more than 80 % and 95 % of unnecessary tag checks for all benchmark programs. Therefore, we conclude that combining the ITC and the HBTC caches is the best approach to avoiding energy dissipation caused by unnecessary tag checks. Energy Consumption Figure 6 reports energy consumption of the HBTC cache and its break down for each benchmark program. All results are normalized to the conventional cache. As explained in Section 4.1, a 0.8 um CMOS technology is assumed. The energy model used in this paper does not take account for the energy consumed in sense amplifiers. However, we believe that the energy reduction reported in this section can be achieved even if sense amplifiers are considered. This is because tag-memory accesses can be completely eliminated when the HBTC cache works on Omitting-Mode, so that the energy consumed in sense amplifiers is also eliminated. As discussed in Section 4.2, the HBTC cache makes a significant tag-check count reduction for 129.compress, 132.ijpeg, and all media application programs.

28

Koji Inoue et al.

E btbadd

Normalized Cache-Energy Consumption

1.20

E output E tag 1.00 E data 0.80

0.60

0.40

0.20

0.00

099.go 126.gcc 130.li adpcm_enc mpeg2_enc 124.m88ksim 129.compress 132.ijpeg adpcm_dec mpeg2_dec

Benchmark Programs

Fig. 6. Cache-Energy Consumption

Since the extension of each BTB entry for execution footprints is only 2 bits, the energy overhead for BTB accesses (EBT Badd ) does not have a large impact on the total cache energy. As a result, the HBTC cache reduces the total cache energy by about 15 %. However, for 099.go and 126.gcc, the energy reduction is only from 2 % to 3 %. This is because the HBTC cache could not eliminate effectively unnecessary tag checks due to irregular behavior of the program execution. Performance Overhead As explained in Section 3.4, the HBTC cache causes processor stalls when the extended BTB is up-dated for recording or invalidating execution footprints. Figure 7 shows program-execution time in terms of the total number of clock cycles. All results are normalized to the conventional organization. From the simulation results, it is observed that the performance degradation is less than 1 % for all but three benchmark programs. However, for 126.gcc, the performance is degraded by about 2.5 %. This might not be acceptable if high performance is strictly required. The processor-stalls are caused by conflicting BTB accesses from the processor with up-date operations of execution footprints.

Dynamic Tag-Check Omission

29

Normalized Execution Time (clock cycle)

1.10

1.05

1.00

0.95

0.90

099.go 126.gcc 130.li adpcm_enc mpeg2_enc 124.m88ksim 129.compress 132.ijpeg adpcm_dec mpeg2_dec Benchmark Programs

Fig. 7. Program Execution Time

In order to alleviate the negative effect of the HBTC cache, we can consider two approaches. First is to pre-decode fetched instructions. Since conventional BTB is accessed on every instruction fetch regardless of the instruction type, processor stalls occur whenever execution footprints are up-dated. By pre-decoding fetched instructions, we can determine whether, or not, it has to access the BTB before starting normal BTB access. In this case, processor stalls occur only when branch (or jump) instruction conflicts with up-dating execution footprints. Another approach for compensating the processor stalls is to add a decoder logic for accessing execution footprints. This makes it possible to access BTB for obtaining branch-target address and up-dating execution footprints simultaneously. Effects of Execution-Footprint-Invalidation Penalty All execution footprints recorded in the extended BTB are invalidated whenever a cache miss, or a BTB replacement, takes place. So far, we have assumed that the invalidation can be completed in one processor-clock cycle. However, the invalidation penalty largely depends on the implementation of extended BTB.

Normalized Execution Time (clock cycle)

30

Koji Inoue et al.

3.00 099.go 124.m88ksim 126.gcc 129.compress 130.li 132.ijpeg adpcm_enc adpcm_dec mpeg2enc mpeg2dec

2.50

2.00

1.50

1.00 mpeg2enc132.ijpegadpcm_encadpcm_dec

0.80

1

2 4 8 16 Execution-Footprint Invalidation Penalty (clock cycle)

32

Fig. 8. Effect of Execution-Footprint Invalidation Penalty

Figure 8 depicts performance overhead caused by the HBTC approach where the invalidation penalty is varied from 1 to 32 cycles. The y-axis indicates program-execution time normalized to conventional organization for all benchmark programs, and the x-axis shows the invalidation penalty in terms of clock cycles. For all benchmark programs, it is observed that performance degradation is trivial if the invalidation penalty is equal to or less than 4 clock cycles. We have analyzed the break down of the invalidations, and have found that more than 98 % are caused by cache misses (less than 2 % are caused by BTB replacements). The invalidation penalty can be hidden if it is smaller than cache-miss penalty. Actually, in this evaluation, we have assumed that cache-miss penalty is 6 clock cycles. However, the invalidation penalty clearly appears where it is grater than 6 clock cycles, so that we can see large performance degradation for 099.go and 126.gcc. On the other hand, for 132.ijpeg, adpcm enc, adpcm dec, and mpeg2 decode, performance degradation is small even if the invalidation penalty is large. This is because cache-miss rates for these programs are high, resulting in the small number of invalidations. Actually, each cache-miss rate of 099.go, 126.gcc, 132.ijpeg, and mpeg2 decode was 4.7%, 5.5%, 0.5%, and 0.5%, respectively.

Dynamic Tag-Check Omission

5

31

Conclusions

In this paper, we have proposed the history-based tag-comparison (HBTC) cache for low-energy consumption. The HBTC cache exploits the following two facts. First, instruction-cache-hit rates are much higher. Second, almost all programs have many loops. The HBTC cache records the execution footprints, and determines whether the instructions to be fetched are currently cache resident without performing tag checks. An extended branch target buffer (BTB) is used to record the execution footprints. In our simulation, it has been observed that the HBTC cache can reduce the total count of tag checks by about 90 %, resulting in 15 % of cache-energy reduction. In our evaluation, it has been assumed that the BTB size, or the total number of BTB entries, is fixed. Our future work is to evaluate the effects of the BTB size on the energy reduction achieved by the HBTC cache. In addition, the effects of branch-predictor type will be evaluated. Another future work is to establish a microarchitecture for set-associative caches. By memorizing wayaccess information as proposed in [9], we can extend the HBTC approach for set-associative caches.

References [1] Bahar, I., Albera, G., and Manne, S.: Power and Performance Tradeoffs using Various Caching Strategies. Proc. of the 1998 International Symposium on Low Power Electronics and Design, pp.64–69, Aug. 1998. 25 [2] Bellas, N., Hajj,—I., and Polychronopoulos, C.: Using dynamic cache management techniques to reduce energy in a high-performance processor. Proc. of the 1999 International Symposium on Low Power Electronics and Design, pp.64–69, Aug. 1999. 21 [3] Bellas, N., Hajj, I., Polychronopoulos. C., and Stamoulis, G.: Energy and Performance Improvements in Microprocessor Design using a Loop Cache. Proc. of the 1999 International Conference on Computer Design: VLSI in Computers & Processors, pp.378–383, Oct. 1999. 20 [4] Inoue, K. and Murakami, K.: A Low-Power Instruction Cache Architecture Exploiting Program Execution Footprints. International Symposium on HighPerformance Computer Architecture, Work-in-progress session (included in the CD proceedings), Feb. 2001. 19 [5] Ishihara, T. and Yasuura, H.: A Power Reduction Technique with Object Code Merging for Application Specific Embedded Processors. Proc. of the Design, Automation and Test in Europe Conference, pp617–623, Mar. 2000. 20 [6] Kamble, M. and Ghose, K.: Analytical Energy Dissipation Models For Low Power Caches. Proc. of the 1997 International Symposium on Low Power Electronics and Design, pp.143–148, Aug. 1997. 18, 19, 21, 26 [7] Kamble, M. and Ghose, K.: Energy-Efficiency of VLSI Caches: A Comparative Study. Proc. of the 10th International Conference on VLSI Design, pp.261–267, Jan. 1997. 26 [8] Kin, J., Gupta, M., and Mngione-Smith, W.: The Filter Cache: An Energy Efficient Memory Structure. Proc. of the 30th Annual International Symposium on Microarchitecture, pp.184–193, Dec. 1997. 21, 25

32

Koji Inoue et al.

[9] Ma, A., Zhang, M., and Asanovi´c, K.: Way Memorization to Reduce Fetch Energy in Instruction Caches. ISCA Workshop on Complexity Effective Design, July 2001. 20, 31 [10] Panda, R., Dutt, N., and Nicolau, A.: Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications. Proc. of European Design & Test Conference, Mar. 1997. 20 [11] Panwar, R. and Rennels, D.: Reducing the frequency of tag compares for low power I-cache design. Proc. of the 1995 International Symposium on Low Power Electronics and Design, Aug. 1995. 20 [12] Wilton, S. and Jouppi, N.: An Enhanced Access and Cycle Time Model for On-Chip Caches. WRL Research Report 93/5, July 1994. 26 [13] Witchel, E., Larsen, S., Ananian, C., and Asanovi´c, K.: Direct Addressed Caches for Reduced Power Consumption. Proc. of the 34th International Symposium on Microarchitecture, Dec. 2001. 20 [14] MediaBench, URL: http://www.cs.ucla.edu/˜leec/mediabench/. 26 [15] “SimpleScalar Simulation Tools for Microprocessor and System Evaluation,” URL:http://www.simplescalar.org/. 26 [16] SPEC (Standard Performance Evaluation Corporation), URL: http://www.specbench.org/osg/cpu95. 26

A Hardware Architecture for Dynamic Performance and Energy Adaptation Phillip Stanley-Marbell1 , Michael S. Hsiao2 , and Ulrich Kremer3 1

3

Dept. of ECE, Carnegie Mellon University, Pittsburgh, PA 15213 [email protected] 2 Dept. of ECE, Virginia Tech, Blacksburg, VA 24061 [email protected] Dept. of Computer Science, Rutgers University, Piscataway, NJ 08854 [email protected]

Abstract. Energy consumption of any component in a system may sometimes constitute just a small percentage of that of the overall system, making it necessary to address the issue of energy efficiency across the entire range of system components, from memory, to the CPU, to peripherals. Presented is a hardware architecture for detecting regions of application execution at runtime, for which there is opportunity to run a device at a slightly lower performance level, by reducing the operating frequency and voltage, to save energy. The proposed architecture, the Power Adaptation Unit (PAU) may be used to control the operating voltage of various system components, ranging from the CPU core, to memory and peripherals. An evaluation of the tradeoffs in performance versus energy savings and hardware cost of the PAU is conducted, along with results on its efficacy for a set of benchmarks. It is shown that on the average, a single entry PAU provides energy savings of 27%, with a corresponding performance degradation of 0.75% for the SPEC CPU 2000 integer and floating point benchmarks investigated.

1

Introduction

Reduction of the overall energy usage and per-cycle power consumption in microprocessors is becoming increasingly important as device integration increases. This leads to increases in the energy density of generated heat, which creates problems in reliability and packaging. Increased energy usage is likewise undesirable in applications with limited energy resources, such as mobile battery powered applications. Reduction in microprocessor energy consumption can be achieved through many means, from altering the transistor level design and manufacturing process to consume less power per device, to modification of the processor microarchitecture to reduce energy consumption. It is necessary to address the issue of energy efficiency across the entire range of system components, from memory, to the CPU, to peripherals, since the CPU energy consumption may sometimes constitute a small percentage of that of the complete system. It is no longer sufficient B. Falsafi and T.N. Vijaykumar (Eds.): PACS 2002, LNCS 2325, pp. 33–52, 2003. c Springer-Verlag Berlin Heidelberg 2003 

34

Phillip Stanley-Marbell et al.

for systems to be low power, but they must also be energy aware, adapting to application behavior and user requirements. In applications in which there is an imbalance between the amount of computation and the time spent waiting for memory, it is possible to reduce the operating frequency and voltage of the CPU, memory, or peripherals, to reduce energy consumption at the cost of a tolerable performance penalty. Previous studies have shown that there is significant opportunity for slowing down system components such as the CPU without incurring significant overall performance penalties [9, 8, 7]. Compiler approaches rely on static analyses to predict program behavior. In many cases, static information may not be accurate enough to take full advantage of program optimization opportunities. However, static analyses often have a more global view of overall program structure, allowing coarse-grain program transformations, in order to enable further optimizations at the fine-grain levels by the hardware. Hardware approaches are often complementary to compilerbased approaches such as [9, 7]. The window of instructions that is seen by hardware may not always be large enough to make voltage and frequency scaling feasible. However, even though only the compiler can potentially have a complete view of the entire program structure, only the hardware has knowledge of runtime program behavior. Presented in this paper is a hardware architecture for detecting regions of application execution at runtime, for which there is the possibility to run a device (e.g. CPU core) with a bounded decrease in performance while obtaining a significant decrease in per-cycle power and overall energy consumption. The proposed architecture, the Power Adaptation Unit (PAU), appropriately sets the operating voltage and frequency of the device, to reduce the power dissipation and to conserve energy, while not incurring more than a prescribed performance penalty. The PAU attempts to effectively identify such dynamic program regions, and to determine when it would be beneficial to perform voltage and frequency scaling, given the inherent overheads. Because of the type of behavior the PAU captures, even a small single entry PAU is effective in reducing power consumption under bounded performance penalty, for the benchmark applications investigated. The additional hardware overhead due to the PAU is minimized due to the fact that a majority of the facilities it relies on (e.g. performance counters) are already integrated into contemporary processor designs. The overhead due to maintaining PAU state is shown to be small, and proposals are provided for using existing hardware to implement other functionality required by the PAU. The remainder of this paper is structured as follows. The next section describes opportunities for implementing dynamic resource scaling. Section 3 details the structure of the PAU architecture, and describes how entries are created and managed in the PAU. Section 4 illustrates the action of the PAU with an example. Section 5 discusses the analytical limits of utility of the PAU. Section 6 discusses the hardware overheads of the PAU. Sections 7 presents simulation results for 8 benchmarks from the SPEC CPU 2000 integer and floating point

A Hardware Architecture for Dynamic Performance and Energy Adaptation

ALU t=0

Mem. t=4

ALU Instructions

Memory Stall t=6

t=8 Memory Access

ALU

t=12 t=14

Mem. t=18

Memory Stall

35

Memory Stall t=28

t=20 ALU Instructions

Memory Access

t=32

t=40 Memory Stall

Fig. 1. Opportunities for energy/performance tradeoff in a single-issue architecture benchmark suites. Section 8 discusses related work and Section 9 concludes the paper.

2

Opportunities for Scaling

In statically scheduled single-issue processors, any decrease in the operating voltage or frequency will lead to longer execution times. However, if the application being executed is memory-bound in nature (i.e. it has a significant number of memory accesses, which cause cache misses), the processor may spend most of its time waiting for memory. In such memory-bound applications, if the processor is run at a reduced operating voltage and memory remains at-speed, the portions of the runtime of a program performing computation (few) will take more time, while the portions of the runtime performing memory stalls (many) will remain the same, as illustrated in Figure 1. As illustrated in the figure, halving the operating voltage and frequency of the CPU while keeping that of memory constant can result in ideal case energy savings of 87.5% , with a 43% degradation in performance for the example scenario. In practice, this can only be approximately achieved, as there exist dependencies between the operating frequency of the CPU core and that of memory (one usually runs at a multiple of the frequency of the other). Dynamically scheduled multiple issue (superscalar) architectures will permit the overlapping of computation and memory stalls, and would witness a smaller slowdown if the CPU core (or portions of it) were run at a lower voltage while the operating voltage of memory were kept constant. This initial work focuses on single-issue in-order processors, such as those typically employed in low power embedded systems. The benefits to superscalar architectures will be pursued in future work.

3

Power Adaptation Unit

The power adaptation unit (PAU) is a hardware structure which detects regions of program execution with imbalance in memory and CPU activity, such as code execution that leads to frequent repeated memory stalls (e.g. memory-bound loop computations), or regions of execution that lead to significant CPU activity with little memory activity (e.g. CPU-bound loop computations). In both cases, the PAU outputs control signals suitable for adjusting the operating frequency

36

Phillip Stanley-Marbell et al.

Memory

CPU

FREQ Vdd

PAU

FREQ Vdd Programmable Voltage/Frequency Controller Peripherals

FREQ Vdd

Fig. 2. Typical implementation of PAU

and voltage of the unit it monitors, to values such that less power is consumed per cycle, while still maintaining similar performance. The PAU must determine an appropriate voltage and frequency at which to run the device it controls, for which a specified performance penalty (say, 1%) will not be exceeded. The PAU in a typical system architecture is shown in Figure 2. The next two subsections focus on controlling the CPU for memory-bound applications and extending these ideas to controlling the cache and memory in CPU-bound applications respectively.

Power Adaptation Unit (PAU)

Init PC

Tag

Index

Active Transient Valid Tag

=?

STRIDE Q

NCLKS

NINSTR I

A T

Delta Computation

Programmable Voltage Controller

Fig. 3. The Power Adaptation Unit

V

A Hardware Architecture for Dynamic Performance and Energy Adaptation

37

Clock NCLK++; if (NCLK > STRIDE) STRIDE := NCLK, NCLK := 0, Q--; ACTIVE

Stall

Q == HIH2O

Clock

Q STRIDE) STRIDE := NCLK, NCLK := 0, Q--;

TRANSIENT

INVALID

Clock

NCLK > STRIDE, Q == 0

Stall STRIDE := NCLK, NCLK := 0, Q++

Stall

STRIDE := NCLK, NCLK := 0, Q++

Stall

INIT

Clock NCLK == STRIDE_MAX

Clock

NCLK++

Fig. 4. PAU entry state transition diagram

3.1

PAU Table Entry Management

The primary component of the PAU is the PAU table. Figure 3 illustrates the construction of the PAU table for a direct mapped configuration. The least significant bits of the program counter, the index, are used to select a PAU table entry. A hit in the PAU occurs when there is a match between the Tag from the PC (the most significant bits before the index) and the Tag in the PAU entry, as illustrated in Figure 3. The PAU operates on windows, which are ranges of program counter values in the dynamic instruction stream. Windows are defined by a starting PC value, a stride in clock cycles, STRIDE, and a count in overall instructions executed, NINSTRS. An entry corresponding to a window is created on an event such as a cache miss. The STRIDE field is a distance vector [3, 14], specifying the distance between the occurrence of events in the iteration space. The NCLKS field maintains the age of the entry symbolizing a window. Four state bits, INIT, TRANSIENT, ACTIVE and INVALID, are used by the PAU to manage entries. The Q field is a saturating counter that provides an indicator of the degree of confidence to be attached to the particular PAU entry.

38

Phillip Stanley-Marbell et al.

Figure 4 shows the state transition diagram for a PAU entry. There are four states in which an entry may exist, INIT, TRANSIENT, ACTIVE and INVALID corresponding to the state bits in the PAU entry described previously. Transitions between states occur either when there is a pipeline stall due to a cache miss, or may be induced by the passage of a clock cycle. The two extremes of the state machine are the INVALID and ACTIVE states, with the INIT and TRANSIENT states providing hysteresis between these two states. In the figure, the transitions between states are labeled with both the event causing the transition (circled) and the action performed in the transition. For example, the transition between the TRANSIENT and INVALID states occurs on the passage of a clock cycle if NCLK is greater than STRIDE, and Q is zero. Entries created in the PAU table are initially in the INIT state, and move to the TRANSIENT state when there is a stall caused by the instruction which maps to the entry in question. On every clock cycle, the NCLKS fields of all valid entries are incremented, on the faith that the entries will be used in the future. For all valid entries, the NCLKS field is reset to 0 on a stall, after copying it to the STRIDE field, and incrementing Q. The number of instructions that completed in that period is then recorded in the NINSTRS field for that PAU entry. The goal of a PAU entry is to track PC values for which repeated stalls occur. The PAU will effectively track such cases even if the time between stalls is not constant. Whenever the NCLKS field for a TRANSIENT or ACTIVE entry reaches the value of the STRIDE field, it is reset to zero and the Q field decremented, therefore if the distance between stalls decreases monotonically, the STRIDE will be correctly updated with the new iteration distance. If the number of iterations for which the distance between stalls is increased is large, then the entry will eventually be invalidated, and then recreated in the table with the new STRIDE. This purpose is served by the high and low water marks (HIH2O and LOH2O). The high and low water marks determine how many repeated stalls occur before an entry is considered ACTIVE, and how many clock cycles must elapse before the determination is made to degrade an entry from ACTIVE status, respectively. The values of HIH2O and LOH2O may either be hard-coded in the architecture, or may be modified by software such as an operating system, or by applications, as a result of code appropriately inserted by a compiler. It is also possible to have the values of HIH2O and LOH2O adapt to application performance, their values being controlled using the information that is already summarized in the PAU, with a minimal amount of additional logic. Such techniques are beyond the scope of this paper and are left for future research. Once Q reaches the high water mark, the entry goes into the ACTIVE state. If a PAU hit occurs on an ACTIVE entry, VDD and FREQ are altered as will be described in Section 5. If Q falls below a low water mark, LOH2O, this indicates that the repeated stalls with equal stride have stopped happening, and have not occurred for STRIDE*(HIH2O-LOH2O) cycles. In such a situation, the VDD and FREQ are then set back to their default values.

A Hardware Architecture for Dynamic Performance and Energy Adaptation

39

for (x = 100;;) { if (x-- > 0) a~= i; b = *n; c = *p++; }

Fig. 5. Example In a multiprogramming environment where several different processes, with different performance characteristics, are multiplexed onto one or more processing units, HIH2O and LOH2O must permit the PAU to respond quickly enough, and STRIDE*(HIH2O-LOH2O) must be significantly smaller than the length of a process quantum. Alternatively, an operating system could invalidate all entries in the PAU on a context switch. Addresses that only cause one stall could potentially tie up a PAU entry forever. To avoid this, PAU entries in the INIT state time out after PAU_STRIDE_MAX cycles. 3.2

Handling Cache and Memory

For real benefit across the board, both memory- and CPU-bound applications must be handled simultaneously – either the CPU is stalled for memory, or memory is idle while the CPU is busy, or both may be busy. It is desirable to use the same structure, if possible, to detect CPU-bound code regions, as for memory-bound regions, to amortize the on-chip real estate used in implementing the PAU. The control signals generated by a PAU entry for a given PC value can also be applied to shutting down memory banks or shutting down sets in a set-associative cache, along the lines of [2] and [18]. Periods of memory inactivity are detected by identifying memory load/store instructions at PC values for which the corresponding PAU entries’ NINSTR and STRIDE fields indicate a large ratio of computation to stalls for memory. For example, if NINSTR is very close to the ratio of STRIDE to the average machine CPI for the given architecture, then for repeated memory accesses to the corresponding address, there are very few cache misses. In such a situation, since most activity is occurring in the cache as opposed to main memory, memory can be run at a lower voltage. Similar techniques have previously been applied to RAMBUS DRAMS [1] in [15]. The PAU ensures that such adjustments only occur when they will be of long enough duration to be beneficial.

4

Example

At any given time, there may be more than one PAU entry in the ACTIVE state, i.e., during the execution of a loop, there may be several program counter values

40

Phillip Stanley-Marbell et al.

that lead to repeated stalls. In the example illustrated in Figure 5, let us assume that the assignments to variables a, b and c all cause repeated cache misses (e.g. the variables reside at memory addresses that map to the same cache line in a direct mapped cache). After 1 iteration of the loop, there will be 3 PAU entries corresponding to the PC values of the memory access instructions for the three assignments, and these will be placed in the INIT state, with their Q fields set to 0. The NCLK fields of all entries are incremented once each clock cycle hereafter. On the second iteration, after all three memory references cause cache misses once more, the three PAU entries will move from the INIT state to the TRANSIENT state, the Q fields of the entries will be incremented and the value of the NCLK field copied to the STRIDE field. The value of the NCLK fields at this point denotes the number of clock cycles that have elapsed since the last hit for each entry. Likewise, the NINSTR field denotes the number of instructions that have been executed, since the last hit to the entry. If the architecture is configured with LOH2O of 1 and a HIH2O of 3, then following a process similar to that described above, the entries will graduate to the ACTIVE state in the third iteration of the loop. On the fourth iteration, with all 3 entries in the ACTIVE state, the first PAU hit occurs due to the memory reference associated with the assignment to variable a. On a hit in an ACTIVE PAU entry, the values of the STRIDE and NINSTR fields are used to calculate the factor by which to slow down the device being controlled, which in these discussions, is the CPU. Intuitively, the ratio of NINSTR to STRIDE provides a measure of the ratio of computations to time spent stalling for memory. A detailed analysis of this calculation is described in the next section. After 100 iterations of the loop, the variable x in the program decrements to zero, and the PAU entry corresponding to the memory access to variable a will degrade from ACTIVE to TRANSIENT and eventually to INVALID. In the organization of the PAU described here, the other ACTIVE entries would only be able to influence the operating voltage after this degradation from ACTIVE has occurred, (HIH2O-LOH2O)*STRIDE cycles after the 100th iteration of the loop. Energy is saved when there is a PAU hit on an ACTIVE entry, and the operating voltage is lowered. When the operating voltage is lowered however, increased gate delays make it necessary to reduce the operating frequency as well, to maintain correct circuit design behavior. At the new operating voltage, instructions will take longer to execute, but memory accesses will incur the same penalty in terms of absolute time, though the number of memory stall cycles will be smaller. There is an overhead (both time and energy) involved in lowering the operating voltage, as well as bringing it back up. This makes it useful only to lower the voltage if it can be determined that the processor will run at the low voltage for a long enough time. A more formal analysis of the opportunities for lowering operating voltage/frequency and the overheads involved therein, are presented in the next section.

A Hardware Architecture for Dynamic Performance and Energy Adaptation

5

41

Limits on Energy Savings

It is possible to incur no performance degradation if computation and memory accesses can be perfectly overlapped, the program being executed is memorybound, and the CPU is run at a slower than default execution rate. For an ACTIVE PAU entry, we can determine the effective instruction execution rate as: N IN ST R F REQ instructions = ST RIDE = ST RIDE time F REQ N IN ST RS In the above, STRIDE/NINSTRS is the effective CPI, and is similar to the inverse of the average-rate requirement defined in [23]. It is desired to find an appropriate frequency at which we can run while keeping the ratio instructions/time constant. The following analysis is performed in terms of the frequency, and the interdependence between operating voltage and frequency is not explicitly shown. The maximum value of the instructions/time ratio will be: 1 AV GCP I

CY CLET IM E

=

RAT ED F REQ , AV GCP I

where AVGCPI is the theoretical average number of cycles necessary to execute an instruction on the architecture of interest, and RATED FREQ is the processor’s rated operating frequency. For an architecture in which memory operations can be perfectly overlapped with execution, it will be possible to lower the clock frequency until RAT ED F REQ = AV GCP I Therefore

 Fnew =

RAT ED F REQ AV GCP I

Fnew ST RIDE N IN ST RS

   ST RIDE · N IN ST RS

The slowdown factor, δ, is a number greater than 1 by which the original operating frequency is divided to obtain the scaled frequency. The slowdown factor for cases of possible ideal overlap of memory operations and computation is: RAT ED F REQ δideal+overlap = F   new N IN ST RS = · AV GCP I ST RIDE In the general case, it will not be possible to perfectly overlap computation and memory accesses, thus slowdown of the processor will not be hidden by memory latency, since memory accesses will be sequential with computation.

42

Phillip Stanley-Marbell et al.

In architectures that cannot overlap memory access and computation, the performance penalty can still be relatively small compared to the savings in energy, and per-cycle power will almost certainly be reduced. In such situations, since there will always be a performance degradation, it is necessary to define a limit to the acceptable degradation in performance. For the purposes of evaluation, a maximum degradation in performance of 1% will be used throughout the remainder of this paper.

Tnew

Told = Tmem + Tcpu = Tmem + δno−overlap · Tcpu

For a < 1% slowdown: Tnew − Told ≤ 0.01 Told Therefore δno−overlap · Tcpu − Tcpu ≤ 0.01 Tmem + Tcpu

δno−overlap ≤

(0.01) · (Tmem + Tcpu ) + Tcpu Tcpu

Let Tmem Tmem + Tcpu Tcpu cpu f rac = Tmem + Tcpu

mem f rac =

then δno−overlap ≤

0.01(mem f rac + cpu f rac) + cpu f rac cpu f rac

The slowdown factor can also be expressed in terms of the entries in the PAU structure: mem f rac =

ST RIDE − N IN ST R ST RIDE N IN ST R cpu f rac = ST RIDE

Then δno−overlap ≤

0.01(ST RIDE) + N IN ST R N IN ST R

A Hardware Architecture for Dynamic Performance and Energy Adaptation

43

Fig. 6. Effect of PAU Table size on energy consumption

6

PAU Overhead and Tradeoffs

In this section, the overheads involved in adjusting the operating voltage and frequency are discussed, as well as the area cost of implementing the PAU table. The energy cost incurred by the PAU structure will be addressed in our future research. As will be shown in Section 7, a PAU size of even a single entry is effective in reducing energy consumption, while incurring a minimal degradation in performance. Besides the PAU table, most of the information needed for each PAU entry is already available in current state-of-the-art architectures such as the Intel XScale architecture [11], and the Berkeley lpARM processor [16]. For example, the Intel XScale microarchitecture maintains event counters to monitor instruction and data cache hit rates, instruction and data Translation Look-aside Buffer (TLB) hit rates, pipeline stalls, Branch Target Buffer (BTB) prediction hit rates, and instruction execution count. Furthermore, eight additional events may be monitored when using the Intel XScale microarchitecture as the basis for an application specific standard product [11]. The largest real-estate overhead of the PAU is incurred by the PAU table and δ calculation. It should be possible to use unused functional units for δ computation, as the computation and the attendant voltage scaling can be postponed if resources are unavailable. For an m-entry direct mapped PAU, in a b-bit architecture, with i-byte instructions, the number of bits, PAUbits needed to implement one PAU entry, is given by:

44

Phillip Stanley-Marbell et al.

P AUbits = m · ((b − log2 (m) − log2 (i)) + 3 · log2 (PAU_STRIDE_MAX) + log2 (HIH2O) + +2) The terms on the right hand side of the above equation correspond to the (1) Tag, (2) NCLK, STRIDE and NINSTR (3) Q, (4) FREQ and (5) Entry state bits, respectively. Thus, a single entry PAU table can be implemented with just 106 bits on an architecture with a 32-bit PC and a chosen PAU_STRIDE_MAX of 224 , HIH2O of 4 and 4-byte instructions. Altering the operating voltage by the DC-DC converter is neither instantaneous nor energy-cost-free. In general, the time, tRF G taken to reconfigure from a voltage V1 to V2 , with a maximum current IMAX at the output of the converter, converter efficiency η, and a supply smoothing capacitor with capacitance C, is given, from [6], by: tRF G ≈

2·C · |V2 − V1 | IMAX

Likewise, the energy cost of reconfiguration, ERF G , is given as: ERF G = (1 − η) · C · |V22 − V12 | With a DC-DC converter smoothing capacitance value of 10µF, which is twice the minimum suggested in [6], a transition from 3.3V to 1.65V, IMAX of 1A and η of 90%, tRF G equals 33µs. Similarly, the energy cost of reconfiguration, ERF G , is 8.167500µJ. In the simulations a reconfiguration penalty of 1024 clock cycles, and 14µJ was used. This penalty may be pessimistic as it has been shown in [6] that it is possible to perform voltage scaling without halting computation for designs with a small die area, as is the case for embedded processors.

7

Efficacy of the PAU

Beyond the overall architecture of the PAU, there exist implementation parameters that will determine the efficacy of the PAU in a system. This section investigates the effect of the size of the PAU table on the energy savings and performance degradation, and ultimately, the effect on the energydelay product. In a practical implementation, it is unlikely that a fully associative PAU structure will be utilized, due to hardware overhead involved, and a real PAU implementation is more likely to employ a small, set-associative or directmapped structure. Eight different direct-mapped PAU sizes of 0, 1, 2, 4, 8, 16, 32 and 64 entries were investigated. In all of these configurations, a VDD reconfiguration penalty of 14µJ and 1024 clock cycles was used, based on [6]. The overhead involved in performing voltage scaling was discussed in Section 6.

A Hardware Architecture for Dynamic Performance and Energy Adaptation

45

Fig. 7. Effect of PAU Table size on energy savings

7.1

Simulation Environment

The investigation was performed using the Myrmigki simulator, a power estimating execution driven simulator which models a single issue embedded processor [21]. The modeled architecture has a 5 stage in-order pipeline, unified 8K 4-way set-associative L1 cache with 16 byte blocks, and a miss penalty to main memory of 100 cycles. The power estimation framework has been shown to provide accuracy within 6.5% of the hardware it models. The benchmarks were taken from the SPEC2000 benchmark suite, and compiled with GCC [20] version 2.95.3 for the Hitachi SH architecture. The optimization flags during compilation were the default flags specified for compiling each benchmark from the SPEC suite. Table 1 provides a summary of the bench-

Table 1. Summary of benchmarks used in experimental analysis Benchmark

SPEC Suite # of Instructions Simulated 164.gzip Integer 200,000,000 175.vpr Integer 200,000,000 197.parser Integer 200,000,000 256.bzip2 Integer 200,000,000 176.gcc Integer 200,000,000 181.mcf Integer 122,076,300 183.equake Floating Point 200,000,000 188.ammp Floating Point 200,000,000

46

Phillip Stanley-Marbell et al.

Fig. 8. Effect of PAU Table size on performance degradation

marks used, and number of dynamic instructions for which they were simulated. Each of the benchmarks was simulated for 200 million dynamic instructions, unless the execution time was smaller, as was the case for 181.mcf. The inputs to the benchmarks were taken from the SPEC reduced simulation inputs [13], except for 176.gcc, where the reference input 166.i was used. 7.2

Effect of PAU Size on Energy Savings

Figure 6 illustrates the effect of the number of PAU entries in a direct-mapped PAU organization, on the energy consumption, for a targeted 1% performance degradation 1 . The zero-sized PAU table is the baseline case, and illustrates the energy consumption without the use of the PAU. The percentage reduction in energy consumption with increasing PAU table size is illustrated in Figure 7. The general trend is that the energy savings for the largest PAU configuration (64 entries), is only slightly better than that of a single entry PAU. In the case of Gzip, the energy savings with a 64-entry PAU are actually less than those for a single entry PAU. This non-monotonic increase in savings with increasing PAU size can also be witnessed for Ammp, Vpr and Equake. The reason for this behavior is that as the number of entries in the PAU table is increased, there is a greater possibility that regions of recurrent cache misses which have smaller duration will be allocated entries in the PAU. With an 1

The actual performance degradation observed is not exactly 1%, as discussed in the next section

A Hardware Architecture for Dynamic Performance and Energy Adaptation

47

increase in the number of potentially less beneficial occupants of the PAU table, there is a greater occurrence of the voltage and frequency being lowered due to short runs of repeated stalls. Since there is an overhead involved in changing the operating voltage, such short runs lead to a smaller benefit in energy savings. Adding more entries to the PAU increases the opportunity for voltage scaling to occur, but does not increase the chances that a more beneficial execution region (longer, larger proportion of memory stalls) would be captured. The trend in energy consumption in Figure 6 tracks that of energy savings in Figure 7, and the benchmarks with a larger energy consumption witness a greater savings with the use of the PAU. The effect of increased number of PAU table entries on the energy savings does not follow the same trend, with some benchmarks (e.g., Mcf) benefitting more from the use of larger PAU sizes than others (e.g. Equake). For the average over the 8 benchmarks, there is a steady increase in the energy savings with increased number of PAU entries, except for the case of a 2entry PAU table where there is a slight degradation over the single entry PAU. The additional energy savings from a 64-entry PAU are however not significant, with the 64-entry PAU having an energy saving of 31% versus 27% for the single entry PAU. 7.3

Effect of PAU Size on Performance Degradation

Figure 8 shows the trend in performance degradation with increasing number of PAU entries. As the number of PAU entries is increased, the number of times an entry takes control over the operating voltage increases, since there is a general increase in the number of ACTIVE entries. Due to the overhead involved in switching the operating voltage, there is a general increase in performance degradation which eventually plateaus, as the number of stall-inducing memory references approaches the number of PAU entries. The increase in performance degradation is not monotonic, and for example, in going from a 4-entry PAU table to an 8-entry PAU table for Ammp, there is a decrease in the performance degradation. The reasons are similar to those previously discussed for the trend in the energy savings. As the number of PAU entries is increased, the PAU captures more dynamic execution regions. These lead to allocations of entries in the PAU table which will lead to increased occurrences of voltage scaling, but may or may not be beneficial to the overall energy consumption and performance degradation. It is important to note that, on the average, with increasing PAU size, even though there is an increase in the energy savings, there is an increase in performance degradation. Using larger PAU tables does not provide greater accuracy, but rather only provides greater opportunity to perform resource scaling. In choosing an appropriate PAU size, it is therefore necessary to tradeoff the energy saved for the hit in performance. This makes it desirable to use the energy-delay product as a metric rather than just either performance or energy consumption.

48

Phillip Stanley-Marbell et al.

Fig. 9. Effect of PAU Table size on Energy-Delay product 7.4

Effect of PAU Size on Energy-Delay Product

To evaluate the efficacy of each configuration, both the energy savings and performance degradation must be considered together. An appropriate metric is the energy-delay product, with smaller values being better. Figure 9 shows the trend in energy-delay product with increasing PAU size. In the figure, the baseline case (no PAU) is listed as a PAU table size of zero. On the average over all the benchmarks, there is little additional decrease in the energy-delay product after the addition of a single PAU entry. Even though there is an apparent significant increase in performance degradation with increasing PAU table size (Figure 8), the contributions to energy savings far outweigh the performance penalty. One factor that is not accounted for in Figure 9 is the additional hardware cost of larger PAU sizes. If this cost is high, it would preclude using larger PAU table sizes, as it might lead to an increase in the energy delay-product.

8

Related Work

Although hardware architectures aimed at improving application performance have been around for decades, hardware targeted at reducing per-cycle power and overall energy consumption have only recently begun to be proposed. In [22], it is observed that hardware techniques for shutting down unused hardware modules will provide the possibility for significant (upward of 20%) energy savings, over software techniques, which in themselves involve the execution of instructions which consume energy.

A Hardware Architecture for Dynamic Performance and Energy Adaptation

49

Hardware architectures which adapt to application needs, by reconfiguring to match applications and save energy have been proposed in [2, 12, 18]. In [12], the authors detail a scheme for adjusting the number of hardware units, in this case resource reservation units in use, in a model of the SimpleScalar architecture, in order to reduce power consumption and overall energy consumption. The authors further propose applying the architecture to performing dynamic voltage scaling. In a similar manner to hardware architectures for performance, and similar also to the solutions proposed in [12], the PAU uses application history to determine opportunities for hardware reconfiguration. Furthermore, while [12] alters the hardware configuration on superscalar processors, to save energy, the PAU performs dynamic voltage scaling and clock speed setting, and addresses the spectrum of hardware architectures ranging from single-issue processors to multiple issue VLIW and superscalar architectures. In [2], sets in a set-associative cache are disabled to reduce the energy dissipation of the cache, with a small degradation in performance. The technique takes advantage of the existing cache sub-array partitioning that already exists in high performance cache designs. However, even though the proposal is based on hardware structures, it requires software (the operating system or applications with the help of a compiler) to perform the selection of the cache ways to be disabled. The proposed mechanism for this interface is the addition of two new instructions into the machine ISA for reading and writing a cache way select register. The Dynamically ResIzable i-cache (DRI i-cache) in [18] employs a combination of two novel techniques, gated-Vdd [17] and a purely hardware structure to take advantage of the variation in i-cache usage to reduce leakage power consumption of the instruction cache. The techniques introduced herein are complementary to those previously proposed in [2, 12, 18]. Like [12], one of the aims of the PAU is to reduce the power consumption of the CPU core. Unlike [18], the PAU does not address leakage power consumption, which is increasingly important as supply and threshold voltages are lowered. It should be possible to employ a combination of the PAU and the techniques proposed in [18, 17] either in concert with voltage scaling, or replacing it altogether. Structures such as those described in [4, 19, 10] perform dynamic thermal management, reducing power consumption and saving energy, while incurring only limited application performance degradation. Thermal management is indirectly achieved by the PAU by attempting to reduce power consumption. The action of the PAU in this regard is pro-active as opposed to reactive, however it will not be able to detect situations of “thermal crisis”. The calculation of the CPU slowdown factor in Section 5, is based on previous efforts described in [9]. In [9], the slowdown factor determination was for processors in which it is possible to overlap computation and memory accesses, such as multiple issue superscalar processors. The analysis presented in Section 5, builds upon and extends that of [9], to the general case of processors with and without the ability to overlap computations with memory accesses.

50

Phillip Stanley-Marbell et al.

The work in [7] discusses a compiler that identifies program regions where the CPU can be slowed down without resulting in significant performance penalties. A trace-based prototype compiler is implemented as part of the SUIF2 compiler infrastructure and achieved up to 24% energy savings with performance penalties of less than 2.7% on the SPECfp95 benchmark suite. The PAU hardware is complementary to compiler techniques such as [9] and [7].

9

Summary and Future Work

Presented was a hardware structure, the PAU, that detects dynamic execution regions of a program for which there is a mismatch between the number of computations occurring, and the number of memory stalls, and if feasible, lowers the operating voltage and frequency of the processor to obtain a savings in energy with a slight degradation in performance. A direct mapped configuration of the PAU was investigated for 8 PAU sizes ranging from a baseline configuration with no PAU to a 64-entry PAU. It was observed that PAU sizes of even a single entry provide an average of of 27% savings in energy with performance degradation of 0.75%. In general, it was observed that there was increased energy savings accompanied by increased performance degradation with increasing PAU size, as more penalties of voltage scaling were incurred. The overall effect of using larger PAUs was however positive, with an overall decrease in the energy-delay product as the PAU size was increased. Lacking in the current analysis is an accurate estimate of the hardware cost of the PAU. This is the subject of our current research, and we are investigating various hardware implementations. The usefulness of a PAU in a superscalar architecture is also being investigated, with the implementation of the PAU in the Wattch [5] simulator. This will also permit preliminary analysis of the hardware cost of the PAU table, as it will be possible to model the PAU table in Wattch as an array structure. In addition to investigation of the utility of the PAU in superscalar architectures, implementation in Wattch permits the analysis of the performance of the PAU in a machine with a different ISA. In this regard, it is also planned to implement the PAU in the SimplePower simulator [24] for further comparison. The proposed hardware structure only addresses the dynamic power dissipation, though the use of voltage scaling. With decreasing feature sizes, leakage power is becoming increasingly important, and it is therefore necessary to investigate the possible impact of any proposal on the leakage power consumption. It should be straightforward to incorporate a PAU into current state-of-theart low power architectures, given that most of the hardware required by the PAU is currently beginning to appear in some commercial and research microprocessor designs. In the short term however, it should be possible to implement the PAU in a programmable logic device and use it as an additional board level device in a system design.

A Hardware Architecture for Dynamic Performance and Energy Adaptation

51

References [1] RDRAM. http://www.rambus.com, 1999. 39 [2] D. H. Albonesi. Selective cache ways: On-demand cache resource allocation. Journal of Instruction Level Parallelism, 2(2000):1–6, May 2000. 39, 49 [3] J. R. Allen and K. Kennedy. Automatic translation of Fortran programs to vector form. ACM Transactions on Programming Languages and Systems, 9(4):491– 542, Oct. 1987. 37 [4] D. Brooks and M. Martonosi. Dynamic Thermal Management for HighPerformance Microprocessors. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture, January 2001. 49 [5] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A Framework for Architectural-Level Power Analysis and Optimizations. In 27th Annual International Symposium on Computer Architecture, pages 83–94, June 2000. 50 [6] T. D. Burd and R. W. Brodersen. Design issues for dynamic voltage scaling. In Proceedings of the 2000 International Symposium on Low Power Electronics and Design, ISLPED’00, pages 9–14, July 2000. 44 [7] C.-H. Hsu and U. Kremer. Compiler-Directed Dynamic Voltage Scaling Based on Program Regions. Technical Report DCS-TR-461, Department of Computer Science, Rutgers University, November 2001. 34, 50 [8] C.-H. Hsu, U. Kremer, and M. Hsiao. Compiler-Directed Dynamic Frequency and Voltage Scaling. In Workshop on Power-Aware Computer Systems, ASPLOS-IX, November 2000. 34 [9] C.-H. Hsu, U. Kremer, and M. Hsiao. Compiler-Directed Dynamic Frequency/Voltage Scheduling for Energy Reduction in Microprocessors. In Proceedings of the 2001 International Symposium on Low Power Electronics and Design, ISLPED’01, pages 275–278, August 2001. 34, 49, 50 [10] M. Huang, J. Renau, S.-M. Yoo, and J. Torrellas. A Framework for Dynamic Energy Efficiency and Temperature Management. In Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture, pages 202– 213, 2000. 49 [11] Intel Corporation. Intel XScale Microarchitecture Technical Summary. Technical report, 2001. 43 [12] A. Iyer and D. Marculescu. Power aware microarchitecture resource scaling. In Proceedings of 2000 Design Automation and Test in Europe, pages 190–196, 2001. 49 [13] A. KleinOsowski, J. Flynn, N. Meares, and D. J. Lilja. Adapting the SPEC2000 Benchmark Suite for Simulation-Based Computer Architecture Research. In Proceedings of the Workshop on Workload Characterization, International Conference on Computer Design, September 2000. 46 [14] D. Kuck, R. Kuhn, D. Padua, B. Leasure, and M. J. Wolfe. Dependence graphs and compiler optimizations. In Conference Record of the Eighth Annual ACM Symposium on the Principles of Programming Languages, Jan. 1981. 37 [15] A. R. Lebeck, X. Fan, H. Zeng, and C. Ellis. Power Aware Page Allocation. In Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 105–116, November 2000. 39 [16] T. Pering, T. Burd, and R. Brodersen. Voltage scheduling in the lparm microprocessor system. In Proceedings of the 2000 International Symposium on Low Power Electronics and Design, ISLPED’00, pages 96–101, July 2000. 43

52

Phillip Stanley-Marbell et al.

[17] M. D. Powell, S.-H. Yang, B. Falsafi, K. Roy, and T. N. Vijaykumar. GatedVdd: A circuit technique to reduce leakage in cache memories. In ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED’00)., pages 90–95, July 2000. 49 [18] M. D. Powell, S.-H. Yang, B. Falsafi, K. Roy, and T. N. Vijaykumar. Reducing leakage in a high-performance deep-submicron instruction cache . IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 9(1):77 – 89, February 2001. 39, 49 [19] H. Sanchez, B. Kuttanna, T. Olson, M. Alexander, G. Gerosa, R. Philip, and J. Alvarez. Thermal Management System for High Performance PowerPC Microprocessors. In Proceedings IEEE Compcon, page 325, February 1997. 49 [20] R. M. Stallman. Using and Porting GNU CC, 1995. 45 [21] P. Stanley-Marbell and M. Hsiao. Fast, flexible, cycle-accurate energy estimation. In ACM/IEEE International Symposium on Low Power Electronics and Design, ISLPED’01., pages 141–146, August 2001. 45 [22] V. Tiwari and M. Lee. Power analysis of a 32-bit embedded microcontroller. In Proceedings, Asia and south Pacific DAC, pages (CD–ROM), August 1995. 48 [23] F. Yao, A. Demers, and S. Shenker. A scheduling model for reduced cpu energy. In Proceedings IEEE Symposium on Foundations of Computer Science, pages 374–382, October 1995. 41 [24] W. Ye, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin. The Design and Use of SimplePower: A Cycle-Accurate Energy Estimation Tool. In Proceedings of the 37th Conference on Design Automation, pages 340–345, 2000. 50

Multi-Processor Computer System Having Low Power Consumption C. Michael Olsen and L. Alex Morrow IBM Research Division, P.O.Box 218, Yorktown Heights, NY 10598, USA {cmolsen, alex_morrow}@ us.ibm.com

Abstract. We propose to improve battery life in pervasive devices by using multiple processors that trade off computing capacity for improved energy-per-cycle (EPC) efficiency. A separate scheduler circuit intercepts interrupts and schedules execution to minimize overall energy consumption. To facilitate this operation, software tasks are compiled and profiled for execution on multiple processors so that task requirements to computing capacities may be evaluated realistically to satisfy system requirements and task response time. We propose a simple model for estimating the EPC for each processor. To optimize energy consumption, processors are designed to satisfy a particular usage model. Thus, the particular task suite that is anticipated to run on the device, in conjunction with user expectations to software reaction times, governs the design point of each processor. We show that the battery life of a wearable device may be extended by a factor 3-18 depending on users activity.

1 Introduction A major obstacle for the success of certain types of battery powered Pervasive Devices (PvD) is the battery life. Depending on the device and its usage model, the battery may last anywhere from hours to months. An important mode of device operation is the user idling mode. In this mode, the device is always “on” but without being used by the user. “on” refers to that the device is instantly responsive. Wearable devices fall into this category because they form an extension of the user and therefore may be expected to be always instantly available. Secondly, the lower bound of power consumption on the device may be limited by its need to keep time and perform periodic tasks such as polling sensors and evaluate data, regardless of user activity. In other words, the main contributor to the accumulated “on” battery drain is the user idling mode rather than the user active mode in which the user actively is using the device. Most PDAs follow this usage model. A user will turn the PDA on and then press one or two buttons and make a selection from the screen. The user then reads the information and then either leave the device idling, or turn it off. In either case, the time the PDA spent idling is generally significantly larger than the time it spent executing instructions associated with the button and screen selections. PvDs with advanced power management capabilities, such as the Compaq Itsy [1] and the IBM Linux Watch [2], have several stages of power saving modes. In the most efficient “on” low power state, the Itsy may last for 215 hours on its 610mAh battery B. Falsafi and T.N. Vijaykumar (Eds.): PACS 2002, LNCS 2325, pp. 53-67, 2003. c Springer-Verlag Berlin Heidelberg 2003

54

C. Michael Olsen and L. Alex Morrow

while the Linux Watch may last for 64 hours on its 60mAh battery. However, if the Linux Watch, for example, had to perform small periodic tasks more frequently than once per second, it would largely be prevented from taking advantage of its most efficient low power state, and battery life would drop to 8 hours. A battery lifetime of this magnitude, or even a couple of days with a larger battery, is not satisfactory. Users may not be able to recharge or replace batteries at such short intervals. Further, users may be annoyed at the frequent charging requirements, especially if they feel they are not even using the device. Although the battery drains more quickly when the user does use the device, this is more reasonable since the user can develop a sense of how much a given action costs and make usage decisions accordingly. Another lesson we learned from the Linux Watch was that even if keeping and displaying time was the only task expected of it, the battery life of 64 hours still pales in comparison with commercial wrist watches. Although these devices also use processor chips, they can maintain and display time for several years on a single watch battery. This two-order of magnitude discrepancy in battery life was a primary motivation for this investigation. It led us to think there might be great benefits in off-loading simple repetitive tasks, such as time keeping and sensor polling, from the high-performance processor to, perhaps, a low-speed 8-bit processor with a small cache and a few necessary blocks in the I/O ring. The idea is that the low-speed processor would be specifically designed to execute small simple tasks in such a way as to consume much less active energy as compared to executing equal tasks on the high-performance processor. In other words, there must be a significant differential in energy-per-cycle (EPC) between low end and high end processors. Several means exist to widen this EPC differential, for example, by changing the architecture of the low end processor so that fewer transistors are involved in each cycle. Voltage scaling, transistor device scaling, and the switching scheme known as adiabatic switching are circuit techniques that improve EPC [3]. The concept of using more than one processor in a computer for power management is not new. The PC AT used a separate, small, battery-powered microprocessor to maintain time and date when the PC is powered off. The batteries for this function were often soldered in place in early PC’s, so it was clearly designed for very low current drain over a long period of time. Further, a number of mobile phone companies have filed patents on computer architectures which utilize multiple computational devices [4,5]. The common thread among these systems is that the systems represent static configurations with prescribed functionality. On the other hand, we are mainly interested in developing a power efficient dynamic, or general purpose, computer system with a functionality like the Palm Pilot, Compaq Itsy and Linux Watch. In other words, a computer platform for which a programmer with relative ease can write new application and driver code and in which said code is executed in the most power efficient manner. The multi-processor system we are going to propose can not readily be developed since many of the software and hardware components of the system are presently non-existing or require significant modification. In other words, it would take a

Multi-processor Computer System Having Low Power Consumption

55

considerable effort to properly research and mature such a system. Nevertheless, we still believe that the system has merits from the perspective of Makimoto’s Figure of Merit formula [6], Figure of Merit = (Intelligence) / ( (Size)(Cost)(Power) ), which is a qualitative measure of the value of a nomadic device as perceived by the user. Even though the formula is crude, it does suggest that it may be acceptable to trade off Size and Cost for improved Power and Functionality/Intelligence. The paper is organized as follows. In Chapter 2, the multi-processor system is presented and we walk through a usage example. Next, in Chapter 3 and 4 we present the hypothetical target device to perform energy analysis on and the processor energy model for calculating EPC for each processor. Chapter 5 presents the task suite, user model and discusses the analytical results. Chapter 6 takes a broader look at the whole system. Chapter 7 is a summary.

2 A Low Power Multi-Processor Computer System Architecture. In this and the next chapter we shall propose a low power multi-processor computer system. It is a first attempt to piece the whole system together in enough detail to facilitate some minimal analysis of the power savings potential. We wish to give readers enough appreciation for how the system may be connected and operated so they can improve or suggest alternatives to the system. Slowest, most power efficient

T1

P1 SIG GP

BUS MEM

MEM

GOV T2

BUS I/O

P2 Interrupt lines

I/O

Fastest, least power efficient

Fig. 1. Multi-processor computer system for power conscious task scheduling.

Figure 1 shows an example of a multi-processor system. It utilizes 2 processors, P1 and P2, and a governor circuit, GOV. MEM is the memory space and I/O is the I/O

56

C. Michael Olsen and L. Alex Morrow

space. SIGGP, BUSMEM and BUSI/O is the governor-processor signal lines, the memory bus and the I/O bus, respectively. P1 and P2 execute tasks. P1 is the most power efficient processor with little computing performance. P2 is the least power efficient processor but with very high computing performance. All interrupts from I/O space and from the two processors are brought to GOV. GOV intercepts the interrupt signals and determines which of the 2 processors should handle the interrupt. Issues such as interrupt ownership and which processor may execute the task associated with the interrupt in the most power efficient manner, are being considered by GOV and is discussed next. System Infrastructure. In the following we discuss some of the dynamic aspects of the system operation. The discussion is generic and is not limited to a 2-processor system. All static issues, such as initial setup, software loading, table establishment, and so forth are not discussed. The discussion will shed some light on how the whole computer system may work together. At the end, we give an example of how a calculator application is launched and operated by using a touchscreen. The following assumptions to the system infrastructure are made: y y

y y

y y y

Processors execute tasks simultaneously in parallel. In general the processors do not share code nor data space, and code is never moved from one processors code space to another processors code space. The only memory spaces shared among the processors are device buffer areas, the before mentioned tables in GOV and space for passing parameters. Interrupt handlers and tasks have been individually profiled so their computing capacity requirements is known for each of the processors they may execute on. Four system tables are used to coordinate energy efficient scheduling of tasks (see Figure 2): Interrupt vector table (IVT), peripheral device attribute table (DAT), process task attribute table (TAT), and the processor capacity table (PCT). As shown in Figure 2, the tables are local to GOV. GOV can access the tables without stealing bus cycles from BUSMEM, and tables may be updated dynamically by the processors through BUSMEM. The IVT contains dynamic pointers to DATs and TATs so GOV can access the proper table upon reception of an interrupt. DAT and TAT structures are identical. The parameters are shown in Figure 2, most of which are self-explanatory. POWNER is the processor ID of the processor that currently owns the task or handler. NPH is the number of processors which may potentially host (execute) the task, or handler. {P, CPS, ADDR}TID,i is the {processor ID, demand to processor bandwidth, code entry address} of the i’th most power efficient processor. Note that processors are listed in order of descending energy efficiency. Processors dynamically update the PCT on each launch or termination of a task or interrupt handler to reflect the processors current instantaneous spare computing capacity. GOV needs this information to properly schedule the execution of tasks and handlers. An OS on one processor may utilize the governor to schedule a process task for execution on an OS on another processor. A file system may be shared OSs. Each OS/processor utilizes a local timer interrupt mechanism. The processors share a common time base counter for agreeing on instantaneous time.

Multi-processor Computer System Having Low Power Consumption

y

57

The OS utilizes a work dependent timing scheme [2] in which the local hardware timer is dynamically programmed to interrupt the processor only when there is work to be done. Physical timer ticks that do not result in work are skipped, enabling the processor to save power and to shut down more effectively.

MEM IVT

GOV DAT

Parameter TID POWNER NPH PTID,1 CPSTID,P ADDRTID,1

Description Task Identification number. ID of current owner processor (if any). Number of potential host processors. ID of most energy efficient processor. Required cycles/sec to sustain task. Pointer to task code.

.

TAT

. .

PCT

ID of least energy efficient processor. PTID,NPH CPSTID,P Required cycles/sec to sustain task. ADDRTID,NPH Pointer to task code.

Fig. 2. System tables for energy efficient task/handler scheduling.

Usage Example. The following example demonstrates how the whole system could work together. Assume a process called user interface (UI) is running on the processor PBIG. Assume that UI has opened the touchscreen device, and the system has therefore updated POWNER in the touchscreen DAT located in GOV so that POWNER=PBIG. Now, the user uses the touchscreen to select the icon representing a calculator application. The touch interrupt is detected by GOV which uses the IVT to find the associated attribute table. GOV then checks in the table if/which processor currently owns the touchscreen, finds POWNER in the processor list, puts the touchscreen interrupt handler address into a predefined memory slot, and finally signals/interrupts PBIG which in turn jumps to the interrupt handler address. UI may now launch the calculator application, which is yet another process task. But let’s assume that the launcher software first peeks into the calculator applications TAT and discovers that it requires very little computing capacity to execute. In this case, the launcher decides not to launch the calculator on PBIG but rather passes the calculator request to GOV for execution on a more power efficient processor. PBIG now updates

58

C. Michael Olsen and L. Alex Morrow

its own interrupt entry in the IVT in GOV with the address of the calculator applications TAT and clears the POWNER field in the attribute table. Then, PBIG interrupts GOV. Upon reception of the interrupt, GOV, via the IVT, finds the associated attribute table and determines that it is not owned by any processor. GOV will then schedule the calculator application on the most power efficient processor, let’s call it PLITTLE, on which another UI process is also running. PLITTLE now changes the owner to POWNER=PLITTLE in the calculators TAT and then launches the calculator application (say from FLASH). The next time PLITTLE receives a calculator interrupt, it’s probably due to the user entering data. So PLITTLE must determine the proper address to jump to in the calculator application upon future calculator inputs. PLITTLE then updates the address in the calculator TAT accordingly. Since it is now likely that the next screen interrupt will be associated with the calculator application, PLITTLE further opens a touchscreen driver and updates the driver address and POWNER in the touchscreen DAT accordingly. In this fashion, the next touchscreen interrupted is routed directly to PLITTLE instead of the original owner PBIG which can then be put to sleep for a longer period. Though the shift in ownership of the touchscreen interrupt should only be done if PLITTLE has the spare processor bandwidth as specified in the touchscreen DAT. It is also important that the touch handler and/or the UI manager on PLITTLE can determine if the (x,y)-coordinates belong to the calculator. If the coordinates do not belong to the calculator application, it is equally important that PLITTLE can determine to which application the coordinates do belong, if any, so that it can reflect the interrupt to the proper processor via GOV (assuming the application is not already running on PLITTLE). PLITTLE would do this by putting the (x,y) data in a shared buffer somewhere, update the jump address in PLITTLE‘s IVT entry to point to the applications TAT and then finally interrupting GOV. PLITTLE should also be able to launch a new application in the same fashion that PBIG originally launched the calculator application. In order to save power effectively, it’s important that the UI on PLITTLE itself is capable of updating the screen whenever the calculator is being used. This has two consequences. First, it requires the use of an external display controller. Secondly, it requires the ability of several UI managers to coordinate access and share information about screen content and contexts. This may be accomplished through the shared file or buffer system.

3 The Target Device In this section, we introduce the SensorWatch, a small wearable device with several sensors intended to help it infer its wearer’s condition. For example, our hypothetical SensorWatch is able to measure it’s wearer’s s body temperature and pulse. When the

Multi-processor Computer System Having Low Power Consumption

59

user takes the SensorWatch off, putting it on his bedside table, the device infers, from the lack of a pulse and temperature, that the user is not wearing the device. This is used to enter a power saving mode, disabling interfaces and tasks which are not required when the watch is not worn. On the other hand, it maintains the integrity of the watch, keeping the time with the most power efficient processor. Note that, in this case, time is not the only task that must be run in detached mode, since the watch will want to sample sensors periodically to determine when the wearer puts the watch on again. For simplicity in exposition and analysis, we simplify the analysis to consider just static scheduling of tasks. In other words a task’s characteristics, such as the processor on which it should run, are established at task creation time and do not vary. A task, whenever it is invoked, will always run on the same processor. Assumptions. The SensorWatch has time and sensor monitoring functions which must take place continuously, and at the lowest possible power. It also has on-demand functions requested in various ways by the user, which have response time requirements that may make it necessary to run them on higher powered processors. SensorWatch is a hypothetical device. For clarity we ignore other power consuming devices, such as sensors, memory, network interface and display. It is assumed that: 1. 2. 3. 4. 5. 6. 7. 8.

We have a wearable device with multiple processors. The device has multiple sensors it must monitor at the lowest possible power. Wearers create a predictable mix of events. Each processor is maximally duty cycled. The CPU cycles required to enter and exit SLEEP mode are negligible, relative to the task CPU cycles. The CPU cycles required by the first level interrupt handlers are negligible, relative to the task CPU cycles. The power consumed by GOV is negligible. Each processor is able to accommodate the worst case combination of tasks which run concurrently under a multi-tasking operating system on the processor.

We consider SensorWatches with one, two and three processors. We first describe our hypothetical task characteristics and review certain task scheduling issues. Next we present a processor energy model and define energy related task parameters. We then give a suite of tasks to be considered for analysis. Finally, we give the results. Task Characteristics. We characterize tasks as either CPU-bound or I/O-bound. CPU-bound tasks run to completion as quickly as their CPU can process them, never entering SLEEP mode. I/O-bound tasks run until they must issue an I/O request through some interface, which they then wait for by putting the processor in SLEEP mode. When the I/O completes, the SLEEP mode is interrupted and the task resumes execution. Task events arrive in two ways: either randomly or predictably. A randomly scheduled task is characterized by how many times per day, NIPD, the user, or some other random-like process, triggers the task. A predictably scheduled task is characterized by an interrupt frequency, F=1/T, where T is the scheduling interval. Thus, we can now define the following 4 task types.

60

Type A: Type B: Type C: Type D:

C. Michael Olsen and L. Alex Morrow

CPU Bound, randomly scheduled. CPU Bound, predictably scheduled. I/O Bound, randomly scheduled. I/O Bound, predictably scheduled.

Scheduling. As a first approximation to the scheduling algorithm outlined earlier, we are going to assume a static distribution of tasks. Thus, for any interrupt received by GOV, it always results in the same processor selection for task execution.

4 Processor Energy Model We will assume a simple energy model for the processors. It is assumed that a processor dissipates the same amount of energy in each cycle. This enables us to represent a processor’s energy efficiency by its Energy Per clock Cycle, EPC. The energy model is based on the assumption that the energy efficiency improves as the processor clock frequency decreases. This may be achieved by voltage scaling, and by optimizing transistor design parameters [3]. Further, the overall size of the chip may be reduced by making the caches smaller, by shrinking register and bus widths, and by ensuring that the tasks that run a low-end processor limit themselves to the native register widths. These constraints would further reduce EPC by lowering the number of switching elements per operation and by reducing wiring capacitance. Finally, we mention the technique of adiabatic switching [3]. The notion of adiabatic switching is to charge up the switching capacitor slowly by ramping up the supply voltage in synchronization with the change in output bit value, thus effectively minimizing heat loss in the resistive path. In conventional CMOS switching technology, a transistor state is changed by instantaneously applying or removing the supply voltage, Vdd, across the RC element. Adiabatic switching promisesEPC ` f clk . In other words, the slower the processor is running the better the energy efficiency. However, to implement adiabatic switching requires additional control circuitry which increases capacitance and complexity. Thus, some of the advantage is lost. Adiabatic switching circuits appears to be most promising in low-speed circuits with clocking frequencies smaller than 10 MHz or so [9,10]. By lumping together all the techniques mentioned above, and being somewhat conservative about the net result, we assume that a processors energy efficiency may be characterized by the equation

EPC = K f clk . where K is a proportionality constant.

(1)

Multi-processor Computer System Having Low Power Consumption

61

Task Related Energy Parameters. Next, we define the following parameters: NC: Number of Cycles to complete task. CPS: Cycles Per Second [Hz] required to complete task in time, T. NIPD: Number of Interrupts Per Day For periodic tasks, i.e. type B, NIPD may be calculated as NIPDi = 86, 400s¸Ti where Ti is the maximum duration the task may take to complete, which in case of a type B task is identical to the interrupt interval, and 86,400 is the number of seconds in a day. For type A and type C tasks, NIPD is based on the User Activity Level, UAL, or how frequently he uses his PvD. When discussing specific tasks later, we are going to assign typical values of NIPD to these tasks and then consider what happens if the user is either a more active or less active user. The Number of Cycles Per Day, NCPD, for task i on processor j may be calculated as

NCPD i,j = NC i,j NIPD i .

(2)

The Energy Per Day, EPD, for task i on processor j may be calculated as

EPD i,j = NCPD i,j EPC j .

(3)

The Energy Per Day, EPD, for processor j may be calculated as NT j

EPD j = EPD i,j .

(4)

i=1

NTj is the number of tasks on processor j. The total Energy Per Day, EPDtot, for all NP processors may be calculated as NP

EPD TOT = EPD j .

(5)

j=1

5 Task Suite We now create a hypothetical mix of tasks, categorized into three categories depending on their requirements for processor performance. The task mix and their associated computational characteristics are listed in Table 1-3. The names of the tasks should be self-explanatory. The second column accounts for the basic demands to response time (or periodicity), T, the number of cycles, NC, to run to completion (if applicable) and the number of times per day the task is toggled (user dependent). The third column contains the demand to processor bandwidth required to sustain the task. At the very bottom (in bold font) is the total demand to processor bandwidth, CPS TOT , assuming the worst case mix of tasks executing simultaneously. (Note, some tasks may be mutually exclusive.) The fourth, and last, column contains the total number of cycles per day for each task. At the very bottom (in bold font) is the accumulated total number of cycles per day, NCPDTOT , for the particular task suite.

62

C. Michael Olsen and L. Alex Morrow

The low-performance tasks (shown in Table 1) are all CPU bound and periodic (type B) in that they are timer interrupted tasks which in turn poll a sensor interface (except TimeDate which just updates time and date), update some variables and then determine if the new values of the variables have exceeded a threshold value, or if the evolution of the values signifies some interesting change. The purpose of the low-performance tasks is largely to determine whether to initiate/enable or disable other tasks and hardware components for the sake of power management and to infer about the state of the user and users surroundings. All tasks may run concurrently. Table 1. Characteristics of low-performance tasks (NTlow=8).

Task Name TimeDate UserTemp UserPulse UserAudio AmbTemp AmbHumid DeviceOrient DeviceAccel

Basic Demands and Task Properties T=1s, NC=500 T=60s, NC=500 T=10s, NC=500 T=100ms, NC=500 T=1s, NC=500 T=100ms, NC=500 T=50ms, NC=500 T=50ms, NC=500

CPS [Hz] NCPD [106] 500 43 8 0.7 50 4.3 5k 433 500 43 5k 433 10 k 864 10 k 864 31 k 2,685

Table 2. Characteristics of medium-performance tasks (NTmed=5).

Task Name EvaluateWorld UpdateDisplay FetchDbRec UINavigation SyncDb

Basic Demands and Task Properties CPS [Hz] NCPD [106] NIPD=250/day, T=50ms, NC=10000 0.2 M 2.5 NIPD=1000/day, T=50ms, NC=50000 1M 50 NIPD=250/day, T=50ms, NC=10000 0.2 M 2.5 NIPD=500/day, T=50ms, NC=5000 0.1 M 2.5 NIPD=10/day, T=10ms/rec1000rec=10s 0.1 M 10 NC=1000cycles/rec1000rec=106 1.5 M 67.5

Table 3. Characteristics of high-performance tasks (NThigh=2). Task Name Basic Demands and Task Properties CPS [Hz] NCPD [106] 6 VoiceCommand NIPD=100/day, T=1s, NC=7.510 7.5 M 750 AudioMemo 2.5 M 250 NIPD=20/day, T=5s, NC=12.5106 7.5 M 1,000

The middle-performance tasks (shown in Table 2) have user-centric real-time requirements: acceptable behavior for them is governed by a reaction time requirements based on user experience considerations. EvaluateWorld, UpdateDisplay, FetchDbRec and UINavigation are of type A since they have to run

Multi-processor Computer System Having Low Power Consumption

63

completion within a time acceptable to a user. On the other hand, SynchDb (synchronize database) is a task that may incorporate network resources. It will typically send out requests and information and then sit and wait for a reply of sorts. Thus, it is of type C. The reply, once it arrives, may not be continuous but rather arrive in multiple chunks. This task may put the processor into the SLEEP state while waiting for the network interface to generate an interrupt. All tasks may run concurrently. The high-performance tasks (shown in Table 3) are similar to most of the medium-performance tasks in that they are randomly interrupted and, once interrupted, run as fast as they need to sustain their function. Their tasks are both of type A. The two tasks are mutually exclusive. User Activity Level. As mentioned earlier, the User Activity Level, UAL, will impact the total energy performance, thus UAL must be included in the analysis. If UAL=1, then the user is assumed to use the system exactly as described above. For example, he would issue 100 voice commands per day, where each command lasts 1sec and he would synchronize his databases 10 times per day. Now, if the user is twice as active, i.e. UAL=2, he will issue 200 voice commands per day and synchronize his databases 20 times per day. More generally, we'll assume that the users Activity Level only affects middle- and high-performance tasks. Processor Speeds and Energy Efficiencies. First, we need to make an assumption about energy efficiency at some given clock frequency. So let’s assume a good mobile processor, such as the StrongARM SA-1110, as our reference candidate. This processor dissipates 240mW@133MHz. Thus, we can calculate the Energy Per Cycle for this reference point EPC(f clk =133Mhz) = 0.24W/133MHz = 1.8nJ from which the proportionality constant in Eq. 1 can be calculated, and thus Equation 1 now becomes EPC = 1.8nJ¸ 133MHz f clk = 156fJs 1/2 f clk .

(6)

As mentioned earlier, we assume that each processor is designed to support exactly the worst case combination of tasks that may conceivably run on each processor. The requirement to processor j's clock frequency, f clk,j is found by appropriately summing CPS TOT from Table 1-3 according to how many processors are considered and on which processor each task suite is executing. In turn, we can then calculateEPCj . The results follow. 1-processor system: All tasks run on P1. hP1: f clk,1 = 9.131MHz => EPC1 = 0.471nJ 2-processor system: Low-performance tasks run on P1 and other tasks on P2. P1: f clk,1 = 0.031MHz => EPC1 = 0.028nJ P2: f clk,2 = 9.1MHz => EPC2 = 0.471nJ 3-processor system: Low-, medium- and high-performance tasks run on P1, P2 and P3, respectively.

64

C. Michael Olsen and L. Alex Morrow

P1: f clk,1 = 0.031MHz P2: f clk,2 = 1.6MHz P3: f clk,3 = 7.5MHz

=> EPC1 = 0.028nJ => EPC2 = 0.197nJ => EPC3 = 0.427nJ

Results: Assuming the user activity level varies from very inactive (UAL=0.001) to very active (UAL=10), we calculated the energy performance, EPDTOT , for 1-, 2- and 3-processor systems. The results are shown in Figure 3. When comparing the 2- and 3-processor case as with the 1-processor case, it may be seen that the processor energy consumption is reduced by a factor of 18 when the user is very inactive (UAL=0.001), by a factor of 3 when the user activity is average (UAL=1) and by a factor of 1.25-1.4 when the user is very active. The reason why the 3-processor system does not offer much improvement is that the task suite that runs on P2 does not significantly contribute to the total amount of task cycles consumed by the entire system.

Energy Per Day, EPDtot [Joules]

10

NP=1 1

NP=3

NP=2

0.1

0.01 0.001

0.01

0.1

1

10

User Activity Level, UAL

Fig. 3. Total Energy Per Day versus User Activity Level for 1-, 2- and 3-processor systems. UAL only affects medium- and high-performance tasks.

Keeping processor design constant in the 3 cases, and considering a usage model in which the user does not use the high-performance task suite at all (say, if he uses the device in a noisy environment) and in which he’s toggles the medium-performance tasks at a five times higher rate, the 3-processor case would clearly improve the efficiency over the 2-processor case in the UAL=0.1-10 range where the energy consumption is dominated by the medium- and high-speed processors.

Multi-processor Computer System Having Low Power Consumption

65

6 The Big Picture: A Discussion Hardware Systems Perspective. The above results look very promising. But there are several factors that currently make it difficult, if not impossible, to reap the benefits of a low-power multi-processor system. Most importantly, energy-efficient low-speed (say