Computer Architecture, Sixth Edition: A Quantitative Approach [6 ed.] 0128119055, 9780128119051

Computer Architecture: A Quantitative Approach, Sixth Edition has been considered essential reading by instructors, stud

2,765 164 35MB

English Pages 936 [1527] Year 2017

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Computer Architecture, Sixth Edition: A Quantitative Approach [6 ed.]
 0128119055, 9780128119051

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Computer Architecture Formulas 1. CPU time = Instruction count  Clock cycles per instruction  Clock cycle time 2. X is n times faster than Y: n = Execution time Y / Execution time X = Performance X / Performance Y 1 Execution time old 3. Amdahl’s Law: Speedupoverall = ------------------------------------------- = -------------------------------------------------------------------------------------------Fraction enhanced Execution time new (#1 – Fraction ) + ----------------------------------enhanced Speedup enhanced 2

4. Energydynamic

1 / 2  Capacitive load  Voltage

5. Power dynamic

1 / 2  Capacitive load  Voltage  Frequency switched

6. Power static

2

Current static  Voltage

7. Availability = Mean time to fail / (Mean time to fail + Mean time to repair) 8. Die yield = Wafer yield  1 / ( 1 + Defects per unit area  Die area )

N

where Wafer yield accounts for wafers that are so bad they need not be tested and N is a parameter called the process-complexity factor, a measure of manufacturing difficulty. N ranges from 11.5 to 15.5 in 2011. 9. Means—arithmetic (AM), weighted arithmetic (WAM), and geometric (GM): n

1 AM = --n

n

n

Weight i  Time i GM =

Time i WAM = i =1

n

i =1

Time i i=1

where Timei is the execution time for the ith program of a total of n in the workload, Weighti is the weighting of the ith program in the workload. 10. Average memory-access time = Hit time + Miss rate  Miss penalty 11. Misses per instruction = Miss rate  Memory access per instruction 12. Cache index size: 2index = Cache size / (Block size  Set associativity) Total Facility Power 13. Power Utilization Effectiveness (PUE) of a Warehouse Scale Computer = -------------------------------------------------IT Equipment Power

Rules of Thumb 1. Amdahl/Case Rule: A balanced computer system needs about 1 MB of main memory capacity and 1 megabit per second of I/O bandwidth per MIPS of CPU performance. 2. 90/10 Locality Rule: A program executes about 90% of its instructions in 10% of its code. 3. Bandwidth Rule: Bandwidth grows by at least the square of the improvement in latency. 4. 2:1 Cache Rule: The miss rate of a direct-mapped cache of size N is about the same as a two-way setassociative cache of size N/2. 5. Dependability Rule: Design with no single point of failure. 6. Watt-Year Rule: The fully burdened cost of a Watt per year in a Warehouse Scale Computer in North America in 2011, including the cost of amortizing the power and cooling infrastructure, is about $2.

In Praise of Computer Architecture: A Quantitative Approach Sixth Edition “Although important concepts of architecture are timeless, this edition has been thoroughly updated with the latest technology developments, costs, examples, and references. Keeping pace with recent developments in open-sourced architecture, the instruction set architecture used in the book has been updated to use the RISC-V ISA.” —from the foreword by Norman P. Jouppi, Google “Computer Architecture: A Quantitative Approach is a classic that, like fine wine, just keeps getting better. I bought my first copy as I finished up my undergraduate degree and it remains one of my most frequently referenced texts today.” —James Hamilton, Amazon Web Service “Hennessy and Patterson wrote the first edition of this book when graduate students built computers with 50,000 transistors. Today, warehouse-size computers contain that many servers, each consisting of dozens of independent processors and billions of transistors. The evolution of computer architecture has been rapid and relentless, but Computer Architecture: A Quantitative Approach has kept pace, with each edition accurately explaining and analyzing the important emerging ideas that make this field so exciting.” —James Larus, Microsoft Research “Another timely and relevant update to a classic, once again also serving as a window into the relentless and exciting evolution of computer architecture! The new discussions in this edition on the slowing of Moore's law and implications for future systems are must-reads for both computer architects and practitioners working on broader systems.” —Parthasarathy (Partha) Ranganathan, Google “I love the ‘Quantitative Approach’ books because they are written by engineers, for engineers. John Hennessy and Dave Patterson show the limits imposed by mathematics and the possibilities enabled by materials science. Then they teach through real-world examples how architects analyze, measure, and compromise to build working systems. This sixth edition comes at a critical time: Moore’s Law is fading just as deep learning demands unprecedented compute cycles. The new chapter on domain-specific architectures documents a number of promising approaches and prophesies a rebirth in computer architecture. Like the scholars of the European Renaissance, computer architects must understand our own history, and then combine the lessons of that history with new techniques to remake the world.” —Cliff Young, Google

This page intentionally left blank

Computer Architecture A Quantitative Approach Sixth Edition

John L. Hennessy is a Professor of Electrical Engineering and Computer Science at Stanford University, where he has been a member of the faculty since 1977 and was, from 2000 to 2016, its 10th President. He currently serves as the Director of the Knight-Hennessy Fellowship, which provides graduate fellowships to potential future leaders. Hennessy is a Fellow of the IEEE and ACM, a member of the National Academy of Engineering, the National Academy of Science, and the American Philosophical Society, and a Fellow of the American Academy of Arts and Sciences. Among his many awards are the 2001 Eckert-Mauchly Award for his contributions to RISC technology, the 2001 Seymour Cray Computer Engineering Award, and the 2000 John von Neumann Award, which he shared with David Patterson. He has also received 10 honorary doctorates. In 1981, he started the MIPS project at Stanford with a handful of graduate students. After completing the project in 1984, he took a leave from the university to cofound MIPS Computer Systems, which developed one of the first commercial RISC microprocessors. As of 2017, over 5 billion MIPS microprocessors have been shipped in devices ranging from video games and palmtop computers to laser printers and network switches. Hennessy subsequently led the DASH (Director Architecture for Shared Memory) project, which prototyped the first scalable cache coherent multiprocessor; many of the key ideas have been adopted in modern multiprocessors. In addition to his technical activities and university responsibilities, he has continued to work with numerous start-ups, both as an early-stage advisor and an investor. David A. Patterson became a Distinguished Engineer at Google in 2016 after 40 years as a UC Berkeley professor. He joined UC Berkeley immediately after graduating from UCLA. He still spends a day a week in Berkeley as an Emeritus Professor of Computer Science. His teaching has been honored by the Distinguished Teaching Award from the University of California, the Karlstrom Award from ACM, and the Mulligan Education Medal and Undergraduate Teaching Award from IEEE. Patterson received the IEEE Technical Achievement Award and the ACM Eckert-Mauchly Award for contributions to RISC, and he shared the IEEE Johnson Information Storage Award for contributions to RAID. He also shared the IEEE John von Neumann Medal and the C & C Prize with John Hennessy. Like his co-author, Patterson is a Fellow of the American Academy of Arts and Sciences, the Computer History Museum, ACM, and IEEE, and he was elected to the National Academy of Engineering, the National Academy of Sciences, and the Silicon Valley Engineering Hall of Fame. He served on the Information Technology Advisory Committee to the President of the United States, as chair of the CS division in the Berkeley EECS department, as chair of the Computing Research Association, and as President of ACM. This record led to Distinguished Service Awards from ACM, CRA, and SIGARCH. He is currently Vice-Chair of the Board of Directors of the RISC-V Foundation. At Berkeley, Patterson led the design and implementation of RISC I, likely the first VLSI reduced instruction set computer, and the foundation of the commercial SPARC architecture. He was a leader of the Redundant Arrays of Inexpensive Disks (RAID) project, which led to dependable storage systems from many companies. He was also involved in the Network of Workstations (NOW) project, which led to cluster technology used by Internet companies and later to cloud computing. His current interests are in designing domain-specific architectures for machine learning, spreading the word on the open RISC-V instruction set architecture, and in helping the UC Berkeley RISELab (Real-time Intelligent Secure Execution).

Computer Architecture A Quantitative Approach Sixth Edition John L. Hennessy Stanford University

David A. Patterson University of California, Berkeley With Contributions by Krste Asanovic University of California, Berkeley Jason D. Bakos University of South Carolina Robert P. Colwell R&E Colwell & Assoc. Inc. Abhishek Bhattacharjee Rutgers University Thomas M. Conte Georgia Tech  Duato Jose Proemisa Diana Franklin University of Chicago David Goldberg eBay

Norman P. Jouppi Google Sheng Li Intel Labs Naveen Muralimanohar HP Labs Gregory D. Peterson University of Tennessee Timothy M. Pinkston University of Southern California Parthasarathy Ranganathan Google David A. Wood University of Wisconsin–Madison Cliff Young Google Amr Zaky University of Santa Clara

Morgan Kaufmann is an imprint of Elsevier 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States © 2019 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-12-811905-1 For information on all Morgan Kaufmann publications visit our website at https://www.elsevier.com/books-and-journals

Publisher: Katey Birtcher Acquisition Editor: Stephen Merken Developmental Editor: Nate McFadden Production Project Manager: Stalin Viswanathan Cover Designer: Christian J. Bilbow Typeset by SPi Global, India

To Andrea, Linda, and our four sons

This page intentionally left blank

Foreword by Norman P. Jouppi, Google

Much of the improvement in computer performance over the last 40 years has been provided by computer architecture advancements that have leveraged Moore’s Law and Dennard scaling to build larger and more parallel systems. Moore’s Law is the observation that the maximum number of transistors in an integrated circuit doubles approximately every two years. Dennard scaling refers to the reduction of MOS supply voltage in concert with the scaling of feature sizes, so that as transistors get smaller, their power density stays roughly constant. With the end of Dennard scaling a decade ago, and the recent slowdown of Moore’s Law due to a combination of physical limitations and economic factors, the sixth edition of the preeminent textbook for our field couldn’t be more timely. Here are some reasons. First, because domain-specific architectures can provide equivalent performance and power benefits of three or more historical generations of Moore’s Law and Dennard scaling, they now can provide better implementations than may ever be possible with future scaling of general-purpose architectures. And with the diverse application space of computers today, there are many potential areas for architectural innovation with domain-specific architectures. Second, high-quality implementations of open-source architectures now have a much longer lifetime due to the slowdown in Moore’s Law. This gives them more opportunities for continued optimization and refinement, and hence makes them more attractive. Third, with the slowing of Moore’s Law, different technology components have been scaling heterogeneously. Furthermore, new technologies such as 2.5D stacking, new nonvolatile memories, and optical interconnects have been developed to provide more than Moore’s Law can supply alone. To use these new technologies and nonhomogeneous scaling effectively, fundamental design decisions need to be reexamined from first principles. Hence it is important for students, professors, and practitioners in the industry to be skilled in a wide range of both old and new architectural techniques. All told, I believe this is the most exciting time in computer architecture since the industrial exploitation of instruction-level parallelism in microprocessors 25 years ago. The largest change in this edition is the addition of a new chapter on domainspecific architectures. It’s long been known that customized domain-specific architectures can have higher performance, lower power, and require less silicon area than general-purpose processor implementations. However when general-purpose

ix

x



Foreword

processors were increasing in single-threaded performance by 40% per year (see Fig. 1.11), the extra time to market required to develop a custom architecture vs. using a leading-edge standard microprocessor could cause the custom architecture to lose much of its advantage. In contrast, today single-core performance is improving very slowly, meaning that the benefits of custom architectures will not be made obsolete by general-purpose processors for a very long time, if ever. Chapter 7 covers several domain-specific architectures. Deep neural networks have very high computation requirements but lower data precision requirements – this combination can benefit significantly from custom architectures. Two example architectures and implementations for deep neural networks are presented: one optimized for inference and a second optimized for training. Image processing is another example domain; it also has high computation demands and benefits from lower-precision data types. Furthermore, since it is often found in mobile devices, the power savings from custom architectures are also very valuable. Finally, by nature of their reprogrammability, FPGA-based accelerators can be used to implement a variety of different domain-specific architectures on a single device. They also can benefit more irregular applications that are frequently updated, like accelerating internet search. Although important concepts of architecture are timeless, this edition has been thoroughly updated with the latest technology developments, costs, examples, and references. Keeping pace with recent developments in open-sourced architecture, the instruction set architecture used in the book has been updated to use the RISC-V ISA. On a personal note, after enjoying the privilege of working with John as a graduate student, I am now enjoying the privilege of working with Dave at Google. What an amazing duo!

Contents

Foreword

Chapter 1

Preface

xvii

Acknowledgments

xxv

Fundamentals of Quantitative Design and Analysis 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 1.12 1.13

Chapter 2

ix

Introduction Classes of Computers Defining Computer Architecture Trends in Technology Trends in Power and Energy in Integrated Circuits Trends in Cost Dependability Measuring, Reporting, and Summarizing Performance Quantitative Principles of Computer Design Putting It All Together: Performance, Price, and Power Fallacies and Pitfalls Concluding Remarks Historical Perspectives and References Case Studies and Exercises by Diana Franklin

2 6 11 18 23 29 36 39 48 55 58 64 67 67

Memory Hierarchy Design 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9

Introduction Memory Technology and Optimizations Ten Advanced Optimizations of Cache Performance Virtual Memory and Virtual Machines Cross-Cutting Issues: The Design of Memory Hierarchies Putting It All Together: Memory Hierarchies in the ARM Cortex-A53 and Intel Core i7 6700 Fallacies and Pitfalls Concluding Remarks: Looking Ahead Historical Perspectives and References

78 84 94 118 126 129 142 146 148

xi

xii



Contents

Case Studies and Exercises by Norman P. Jouppi, Rajeev Balasubramonian, Naveen Muralimanohar, and Sheng Li

Chapter 3

Instruction-Level Parallelism and Its Exploitation 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15

Chapter 4

Instruction-Level Parallelism: Concepts and Challenges Basic Compiler Techniques for Exposing ILP Reducing Branch Costs With Advanced Branch Prediction Overcoming Data Hazards With Dynamic Scheduling Dynamic Scheduling: Examples and the Algorithm Hardware-Based Speculation Exploiting ILP Using Multiple Issue and Static Scheduling Exploiting ILP Using Dynamic Scheduling, Multiple Issue, and Speculation Advanced Techniques for Instruction Delivery and Speculation Cross-Cutting Issues Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor Throughput Putting It All Together: The Intel Core i7 6700 and ARM Cortex-A53 Fallacies and Pitfalls Concluding Remarks: What’s Ahead? Historical Perspective and References Case Studies and Exercises by Jason D. Bakos and Robert P. Colwell

168 176 182 191 201 208 218 222 228 240 242 247 258 264 266 266

Data-Level Parallelism in Vector, SIMD, and GPU Architectures 4.1 4.2 4.3 4.4 4.5 4.6 4.7

Introduction Vector Architecture SIMD Instruction Set Extensions for Multimedia Graphics Processing Units Detecting and Enhancing Loop-Level Parallelism Cross-Cutting Issues Putting It All Together: Embedded Versus Server GPUs and Tesla Versus Core i7 4.8 Fallacies and Pitfalls 4.9 Concluding Remarks 4.10 Historical Perspective and References Case Study and Exercises by Jason D. Bakos

Chapter 5

148

282 283 304 310 336 345 346 353 357 357 357

Thread-Level Parallelism 5.1 5.2 5.3

Introduction Centralized Shared-Memory Architectures Performance of Symmetric Shared-Memory Multiprocessors

368 377 393

Contents

5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12

Chapter 6

Distributed Shared-Memory and Directory-Based Coherence Synchronization: The Basics Models of Memory Consistency: An Introduction Cross-Cutting Issues Putting It All Together: Multicore Processors and Their Performance Fallacies and Pitfalls The Future of Multicore Scaling Concluding Remarks Historical Perspectives and References Case Studies and Exercises by Amr Zaky and David A. Wood

xiii

404 412 417 422 426 438 442 444 445 446

Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism 6.1 6.2

Introduction Programming Models and Workloads for Warehouse-Scale Computers 6.3 Computer Architecture of Warehouse-Scale Computers 6.4 The Efficiency and Cost of Warehouse-Scale Computers 6.5 Cloud Computing: The Return of Utility Computing 6.6 Cross-Cutting Issues 6.7 Putting It All Together: A Google Warehouse-Scale Computer 6.8 Fallacies and Pitfalls 6.9 Concluding Remarks 6.10 Historical Perspectives and References Case Studies and Exercises by Parthasarathy Ranganathan

Chapter 7



466 471 477 482 490 501 503 514 518 519 519

Domain-Specific Architectures 7.1 7.2 7.3 7.4

Introduction Guidelines for DSAs Example Domain: Deep Neural Networks Google’s Tensor Processing Unit, an Inference Data Center Accelerator 7.5 Microsoft Catapult, a Flexible Data Center Accelerator 7.6 Intel Crest, a Data Center Accelerator for Training 7.7 Pixel Visual Core, a Personal Mobile Device Image Processing Unit 7.8 Cross-Cutting Issues 7.9 Putting It All Together: CPUs Versus GPUs Versus DNN Accelerators 7.10 Fallacies and Pitfalls 7.11 Concluding Remarks 7.12 Historical Perspectives and References Case Studies and Exercises by Cliff Young

540 543 544 557 567 579 579 592 595 602 604 606 606

xiv



Contents

Appendix A

Instruction Set Principles A.1 A.2 A.3 A.4 A.5 A.6 A.7 A.8 A.9 A.10 A.11 A.12

Appendix B

A-2 A-3 A-7 A-13 A-15 A-16 A-21 A-24 A-33 A-42 A-46 A-47 A-47

Review of Memory Hierarchy B.1 B.2 B.3 B.4 B.5 B.6 B.7 B.8

Appendix C

Introduction Classifying Instruction Set Architectures Memory Addressing Type and Size of Operands Operations in the Instruction Set Instructions for Control Flow Encoding an Instruction Set Cross-Cutting Issues: The Role of Compilers Putting It All Together: The RISC-V Architecture Fallacies and Pitfalls Concluding Remarks Historical Perspective and References Exercises by Gregory D. Peterson

Introduction Cache Performance Six Basic Cache Optimizations Virtual Memory Protection and Examples of Virtual Memory Fallacies and Pitfalls Concluding Remarks Historical Perspective and References Exercises by Amr Zaky

B-2 B-15 B-22 B-40 B-49 B-57 B-59 B-59 B-60

Pipelining: Basic and Intermediate Concepts C.1 C.2 C.3 C.4 C.5

Introduction The Major Hurdle of Pipelining—Pipeline Hazards How Is Pipelining Implemented? What Makes Pipelining Hard to Implement? Extending the RISC V Integer Pipeline to Handle Multicycle Operations C.6 Putting It All Together: The MIPS R4000 Pipeline C.7 Cross-Cutting Issues C.8 Fallacies and Pitfalls C.9 Concluding Remarks C.10 Historical Perspective and References Updated Exercises by Diana Franklin

C-2 C-10 C-26 C-37 C-45 C-55 C-65 C-70 C-71 C-71 C-71

Contents



xv

Online Appendices Appendix D

Storage Systems

Appendix E

Embedded Systems by Thomas M. Conte

Appendix F

Interconnection Networks by Timothy M. Pinkston and Jose Duato

Appendix G

Vector Processors in More Depth by Krste Asanovic

Appendix H

Hardware and Software for VLIW and EPIC

Appendix I

Large-Scale Multiprocessors and Scientific Applications

Appendix J

Computer Arithmetic by David Goldberg

Appendix K

Survey of Instruction Set Architectures

Appendix L

Advanced Concepts on Address Translation by Abhishek Bhattacharjee

Appendix M

Historical Perspectives and References References Index

R-1 I-1

This page intentionally left blank

Preface

Why We Wrote This Book Through six editions of this book, our goal has been to describe the basic principles underlying what will be tomorrow’s technological developments. Our excitement about the opportunities in computer architecture has not abated, and we echo what we said about the field in the first edition: “It is not a dreary science of paper machines that will never work. No! It’s a discipline of keen intellectual interest, requiring the balance of marketplace forces to cost-performance-power, leading to glorious failures and some notable successes.” Our primary objective in writing our first book was to change the way people learn and think about computer architecture. We feel this goal is still valid and important. The field is changing daily and must be studied with real examples and measurements on real computers, rather than simply as a collection of definitions and designs that will never need to be realized. We offer an enthusiastic welcome to anyone who came along with us in the past, as well as to those who are joining us now. Either way, we can promise the same quantitative approach to, and analysis of, real systems. As with earlier versions, we have strived to produce a new edition that will continue to be as relevant for professional engineers and architects as it is for those involved in advanced computer architecture and design courses. Like the first edition, this edition has a sharp focus on new platforms—personal mobile devices and warehouse-scale computers—and new architectures—specifically, domainspecific architectures. As much as its predecessors, this edition aims to demystify computer architecture through an emphasis on cost-performance-energy trade-offs and good engineering design. We believe that the field has continued to mature and move toward the rigorous quantitative foundation of long-established scientific and engineering disciplines.

xvii

xviii



Preface

This Edition The ending of Moore’s Law and Dennard scaling is having as profound effect on computer architecture as did the switch to multicore. We retain the focus on the extremes in size of computing, with personal mobile devices (PMDs) such as cell phones and tablets as the clients and warehouse-scale computers offering cloud computing as the server. We also maintain the other theme of parallelism in all its forms: data-level parallelism (DLP) in Chapters 1 and 4, instruction-level parallelism (ILP) in Chapter 3, thread-level parallelism in Chapter 5, and requestlevel parallelism (RLP) in Chapter 6. The most pervasive change in this edition is switching from MIPS to the RISCV instruction set. We suspect this modern, modular, open instruction set may become a significant force in the information technology industry. It may become as important in computer architecture as Linux is for operating systems. The newcomer in this edition is Chapter 7, which introduces domain-specific architectures with several concrete examples from industry. As before, the first three appendices in the book give basics on the RISC-V instruction set, memory hierarchy, and pipelining for readers who have not read a book like Computer Organization and Design. To keep costs down but still supply supplemental material that is of interest to some readers, available online at https://www.elsevier.com/books-and-journals/book-companion/9780128119051 are nine more appendices. There are more pages in these appendices than there are in this book! This edition continues the tradition of using real-world examples to demonstrate the ideas, and the “Putting It All Together” sections are brand new. The “Putting It All Together” sections of this edition include the pipeline organizations and memory hierarchies of the ARM Cortex A8 processor, the Intel core i7 processor, the NVIDIA GTX-280 and GTX-480 GPUs, and one of the Google warehouse-scale computers.

Topic Selection and Organization As before, we have taken a conservative approach to topic selection, for there are many more interesting ideas in the field than can reasonably be covered in a treatment of basic principles. We have steered away from a comprehensive survey of every architecture a reader might encounter. Instead, our presentation focuses on core concepts likely to be found in any new machine. The key criterion remains that of selecting ideas that have been examined and utilized successfully enough to permit their discussion in quantitative terms. Our intent has always been to focus on material that is not available in equivalent form from other sources, so we continue to emphasize advanced content wherever possible. Indeed, there are several systems here whose descriptions cannot be found in the literature. (Readers interested strictly in a more basic introduction to computer architecture should read Computer Organization and Design: The Hardware/Software Interface.)

Preface



xix

An Overview of the Content Chapter 1 includes formulas for energy, static power, dynamic power, integrated circuit costs, reliability, and availability. (These formulas are also found on the front inside cover.) Our hope is that these topics can be used through the rest of the book. In addition to the classic quantitative principles of computer design and performance measurement, it shows the slowing of performance improvement of general-purpose microprocessors, which is one inspiration for domain-specific architectures. Our view is that the instruction set architecture is playing less of a role today than in 1990, so we moved this material to Appendix A. It now uses the RISC-V architecture. (For quick review, a summary of the RISC-V ISA can be found on the back inside cover.) For fans of ISAs, Appendix K was revised for this edition and covers 8 RISC architectures (5 for desktop and server use and 3 for embedded use), the 8086, the DEC VAX, and the IBM 360/370. We then move onto memory hierarchy in Chapter 2, since it is easy to apply the cost-performance-energy principles to this material, and memory is a critical resource for the rest of the chapters. As in the past edition, Appendix B contains an introductory review of cache principles, which is available in case you need it. Chapter 2 discusses 10 advanced optimizations of caches. The chapter includes virtual machines, which offer advantages in protection, software management, and hardware management, and play an important role in cloud computing. In addition to covering SRAM and DRAM technologies, the chapter includes new material both on Flash memory and on the use of stacked die packaging for extending the memory hierarchy. The PIAT examples are the ARM Cortex A8, which is used in PMDs, and the Intel Core i7, which is used in servers. Chapter 3 covers the exploitation of instruction-level parallelism in highperformance processors, including superscalar execution, branch prediction (including the new tagged hybrid predictors), speculation, dynamic scheduling, and simultaneous multithreading. As mentioned earlier, Appendix C is a review of pipelining in case you need it. Chapter 3 also surveys the limits of ILP. Like Chapter 2, the PIAT examples are again the ARM Cortex A8 and the Intel Core i7. While the third edition contained a great deal on Itanium and VLIW, this material is now in Appendix H, indicating our view that this architecture did not live up to the earlier claims. The increasing importance of multimedia applications such as games and video processing has also increased the importance of architectures that can exploit data level parallelism. In particular, there is a rising interest in computing using graphical processing units (GPUs), yet few architects understand how GPUs really work. We decided to write a new chapter in large part to unveil this new style of computer architecture. Chapter 4 starts with an introduction to vector architectures, which acts as a foundation on which to build explanations of multimedia SIMD instruction set extensions and GPUs. (Appendix G goes into even more depth on vector architectures.) This chapter introduces the Roofline performance model and then uses it to compare the Intel Core i7 and the NVIDIA GTX 280 and GTX 480 GPUs. The chapter also describes the Tegra 2 GPU for PMDs.

xx



Preface

Chapter 5 describes multicore processors. It explores symmetric and distributed-memory architectures, examining both organizational principles and performance. The primary additions to this chapter include more comparison of multicore organizations, including the organization of multicore-multilevel caches, multicore coherence schemes, and on-chip multicore interconnect. Topics in synchronization and memory consistency models are next. The example is the Intel Core i7. Readers interested in more depth on interconnection networks should read Appendix F, and those interested in larger scale multiprocessors and scientific applications should read Appendix I. Chapter 6 describes warehouse-scale computers (WSCs). It was extensively revised based on help from engineers at Google and Amazon Web Services. This chapter integrates details on design, cost, and performance of WSCs that few architects are aware of. It starts with the popular MapReduce programming model before describing the architecture and physical implementation of WSCs, including cost. The costs allow us to explain the emergence of cloud computing, whereby it can be cheaper to compute using WSCs in the cloud than in your local datacenter. The PIAT example is a description of a Google WSC that includes information published for the first time in this book. The new Chapter 7 motivates the need for Domain-Specific Architectures (DSAs). It draws guiding principles for DSAs based on the four examples of DSAs. Each DSA corresponds to chips that have been deployed in commercial settings. We also explain why we expect a renaissance in computer architecture via DSAs given that single-thread performance of general-purpose microprocessors has stalled. This brings us to Appendices A through M. Appendix A covers principles of ISAs, including RISC-V, and Appendix K describes 64-bit versions of RISC V, ARM, MIPS, Power, and SPARC and their multimedia extensions. It also includes some classic architectures (80x86, VAX, and IBM 360/370) and popular embedded instruction sets (Thumb-2, microMIPS, and RISC V C). Appendix H is related, in that it covers architectures and compilers for VLIW ISAs. As mentioned earlier, Appendix B and Appendix C are tutorials on basic caching and pipelining concepts. Readers relatively new to caching should read Appendix B before Chapter 2, and those new to pipelining should read Appendix C before Chapter 3. Appendix D, “Storage Systems,” has an expanded discussion of reliability and availability, a tutorial on RAID with a description of RAID 6 schemes, and rarely found failure statistics of real systems. It continues to provide an introduction to queuing theory and I/O performance benchmarks. We evaluate the cost, performance, and reliability of a real cluster: the Internet Archive. The “Putting It All Together” example is the NetApp FAS6000 filer. Appendix E, by Thomas M. Conte, consolidates the embedded material in one place. Appendix F, on interconnection networks, is revised by Timothy M. Pinkston and Jose Duato. Appendix G, written originally by Krste Asanovic, includes a description of vector processors. We think these two appendices are some of the best material we know of on each topic.

Preface



xxi

Appendix H describes VLIW and EPIC, the architecture of Itanium. Appendix I describes parallel processing applications and coherence protocols for larger-scale, shared-memory multiprocessing. Appendix J, by David Goldberg, describes computer arithmetic. Appendix L, by Abhishek Bhattacharjee, is new and discusses advanced techniques for memory management, focusing on support for virtual machines and design of address translation for very large address spaces. With the growth in clouds processors, these architectural enhancements are becoming more important. Appendix M collects the “Historical Perspective and References” from each chapter into a single appendix. It attempts to give proper credit for the ideas in each chapter and a sense of the history surrounding the inventions. We like to think of this as presenting the human drama of computer design. It also supplies references that the student of architecture may want to pursue. If you have time, we recommend reading some of the classic papers in the field that are mentioned in these sections. It is both enjoyable and educational to hear the ideas directly from the creators. “Historical Perspective” was one of the most popular sections of prior editions.

Navigating the Text There is no single best order in which to approach these chapters and appendices, except that all readers should start with Chapter 1. If you don’t want to read everything, here are some suggested sequences: ■

Memory Hierarchy: Appendix B, Chapter 2, and Appendices D and M.



Instruction-Level Parallelism: Appendix C, Chapter 3, and Appendix H



Data-Level Parallelism: Chapters 4, 6, and 7, Appendix G



Thread-Level Parallelism: Chapter 5, Appendices F and I



Request-Level Parallelism: Chapter 6



ISA: Appendices A and K

Appendix E can be read at any time, but it might work best if read after the ISA and cache sequences. Appendix J can be read whenever arithmetic moves you. You should read the corresponding portion of Appendix M after you complete each chapter.

Chapter Structure The material we have selected has been stretched upon a consistent framework that is followed in each chapter. We start by explaining the ideas of a chapter. These ideas are followed by a “Crosscutting Issues” section, a feature that shows how the ideas covered in one chapter interact with those given in other chapters. This is

xxii



Preface followed by a “Putting It All Together” section that ties these ideas together by showing how they are used in a real machine. Next in the sequence is “Fallacies and Pitfalls,” which lets readers learn from the mistakes of others. We show examples of common misunderstandings and architectural traps that are difficult to avoid even when you know they are lying in wait for you. The “Fallacies and Pitfalls” sections is one of the most popular sections of the book. Each chapter ends with a “Concluding Remarks” section.

Case Studies With Exercises Each chapter ends with case studies and accompanying exercises. Authored by experts in industry and academia, the case studies explore key chapter concepts and verify understanding through increasingly challenging exercises. Instructors should find the case studies sufficiently detailed and robust to allow them to create their own additional exercises. Brackets for each exercise () indicate the text sections of primary relevance to completing the exercise. We hope this helps readers to avoid exercises for which they haven’t read the corresponding section, in addition to providing the source for review. Exercises are rated, to give the reader a sense of the amount of time required to complete an exercise: [10] Less than 5 min (to read and understand) [15] 5–15 min for a full answer [20] 15–20 min for a full answer [25] 1 h for a full written answer [30] Short programming project: less than 1 full day of programming [40] Significant programming project: 2 weeks of elapsed time [Discussion] Topic for discussion with others Solutions to the case studies and exercises are available for instructors who register at textbooks.elsevier.com.

Supplemental Materials A variety of resources are available online at https://www.elsevier.com/books/ computer-architecture/hennessy/978-0-12-811905-1, including the following: ■

Reference appendices, some guest authored by subject experts, covering a range of advanced topics



Historical perspectives material that explores the development of the key ideas presented in each of the chapters in the text

Preface



Instructor slides in PowerPoint



Figures from the book in PDF, EPS, and PPT formats



Links to related material on the Web



List of errata



xxiii

New materials and links to other resources available on the Web will be added on a regular basis.

Helping Improve This Book Finally, it is possible to make money while reading this book. (Talk about cost performance!) If you read the Acknowledgments that follow, you will see that we went to great lengths to correct mistakes. Since a book goes through many printings, we have the opportunity to make even more corrections. If you uncover any remaining resilient bugs, please contact the publisher by electronic mail ([email protected]). We welcome general comments to the text and invite you to send them to a separate email address at [email protected].

Concluding Remarks Once again, this book is a true co-authorship, with each of us writing half the chapters and an equal share of the appendices. We can’t imagine how long it would have taken without someone else doing half the work, offering inspiration when the task seemed hopeless, providing the key insight to explain a difficult concept, supplying over-the-weekend reviews of chapters, and commiserating when the weight of our other obligations made it hard to pick up the pen. Thus, once again, we share equally the blame for what you are about to read. John Hennessy



David Patterson

This page intentionally left blank

Acknowledgments

Although this is only the sixth edition of this book, we have actually created ten different versions of the text: three versions of the first edition (alpha, beta, and final) and two versions of the second, third, and fourth editions (beta and final). Along the way, we have received help from hundreds of reviewers and users. Each of these people has helped make this book better. Thus, we have chosen to list all of the people who have made contributions to some version of this book.

Contributors to the Sixth Edition Like prior editions, this is a community effort that involves scores of volunteers. Without their help, this edition would not be nearly as polished.

Reviewers Jason D. Bakos, University of South Carolina; Rajeev Balasubramonian, University of Utah; Jose Delgado-Frias, Washington State University; Diana Franklin, The University of Chicago; Norman P. Jouppi, Google; Hugh C. Lauer, Worcester Polytechnic Institute; Gregory Peterson, University of Tennessee; Bill Pierce, Hood College; Parthasarathy Ranganathan, Google; William H. Robinson, Vanderbilt University; Pat Stakem, Johns Hopkins University; Cliff Young, Google; Amr Zaky, University of Santa Clara; Gerald Zarnett, Ryerson University; Huiyang Zhou, North Carolina State University. Members of the University of California-Berkeley Par Lab and RAD Lab who gave frequent reviews of Chapters 1, 4, and 6 and shaped the explanation of GPUs and WSCs: Krste Asanovic, Michael Armbrust, Scott Beamer, Sarah Bird, Bryan Catanzaro, Jike Chong, Henry Cook, Derrick Coetzee, Randy Katz, Yunsup Lee, Leo Meyervich, Mark Murphy, Zhangxi Tan, Vasily Volkov, and Andrew Waterman.

Appendices Krste Asanovic, University of California, Berkeley (Appendix G); Abhishek Bhattacharjee, Rutgers University (Appendix L); Thomas M. Conte, North Carolina State University (Appendix E); Jose Duato, Universitat Politècnica de

xxv

xxvi



Acknowledgments

València and Simula (Appendix F); David Goldberg, Xerox PARC (Appendix J); Timothy M. Pinkston, University of Southern California (Appendix F). Jose Flich of the Universidad Politecnica de Valencia provided significant contributions to the updating of Appendix F.

Case Studies With Exercises Jason D. Bakos, University of South Carolina (Chapters 3 and 4); Rajeev Balasubramonian, University of Utah (Chapter 2); Diana Franklin, The University of Chicago (Chapter 1 and Appendix C); Norman P. Jouppi, Google, (Chapter 2); Naveen Muralimanohar, HP Labs (Chapter 2); Gregory Peterson, University of Tennessee (Appendix A); Parthasarathy Ranganathan, Google (Chapter 6); Cliff Young, Google (Chapter 7); Amr Zaky, University of Santa Clara (Chapter 5 and Appendix B). Jichuan Chang, Junwhan Ahn, Rama Govindaraju, and Milad Hashemi assisted in the development and testing of the case studies and exercises for Chapter 6.

Additional Material John Nickolls, Steve Keckler, and Michael Toksvig of NVIDIA (Chapter 4 NVIDIA GPUs); Victor Lee, Intel (Chapter 4 comparison of Core i7 and GPU); John Shalf, LBNL (Chapter 4 recent vector architectures); Sam Williams, LBNL (Roofline model for computers in Chapter 4); Steve Blackburn of Australian National University and Kathryn McKinley of University of Texas at Austin (Intel performance and power measurements in Chapter 5); Luiz Barroso, Urs H€olzle, Jimmy Clidaris, Bob Felderman, and Chris Johnson of Google (the Google WSC in Chapter 6); James Hamilton of Amazon Web Services (power distribution and cost model in Chapter 6). Jason D. Bakos of the University of South Carolina updated the lecture slides for this edition. This book could not have been published without a publisher, of course. We wish to thank all the Morgan Kaufmann/Elsevier staff for their efforts and support. For this fifth edition, we particularly want to thank our editors Nate McFadden and Steve Merken, who coordinated surveys, development of the case studies and exercises, manuscript reviews, and the updating of the appendices. We must also thank our university staff, Margaret Rowland and Roxana Infante, for countless express mailings, as well as for holding down the fort at Stanford and Berkeley while we worked on the book. Our final thanks go to our wives for their suffering through increasingly early mornings of reading, thinking, and writing.

Acknowledgments



xxvii

Contributors to Previous Editions Reviewers George Adams, Purdue University; Sarita Adve, University of Illinois at UrbanaChampaign; Jim Archibald, Brigham Young University; Krste Asanovic, Massachusetts Institute of Technology; Jean-Loup Baer, University of Washington; Paul Barr, Northeastern University; Rajendra V. Boppana, University of Texas, San Antonio; Mark Brehob, University of Michigan; Doug Burger, University of Texas, Austin; John Burger, SGI; Michael Butler; Thomas Casavant; Rohit Chandra; Peter Chen, University of Michigan; the classes at SUNY Stony Brook, Carnegie Mellon, Stanford, Clemson, and Wisconsin; Tim Coe, Vitesse Semiconductor; Robert P. Colwell; David Cummings; Bill Dally; David Douglas; Jose Duato, Universitat Politècnica de València and Simula; Anthony Duben, Southeast Missouri State University; Susan Eggers, University of Washington; Joel Emer; Barry Fagin, Dartmouth; Joel Ferguson, University of California, Santa Cruz; Carl Feynman; David Filo; Josh Fisher, Hewlett-Packard Laboratories; Rob Fowler, DIKU; Mark Franklin, Washington University (St. Louis); Kourosh Gharachorloo; Nikolas Gloy, Harvard University; David Goldberg, Xerox Palo Alto Research Center; Antonio González, Intel and Universitat Politècnica de Catalunya; James Goodman, University of Wisconsin-Madison; Sudhanva Gurumurthi, University of Virginia; David Harris, Harvey Mudd College; John Heinlein; Mark Heinrich, Stanford; Daniel Helman, University of California, Santa Cruz; Mark D. Hill, University of Wisconsin-Madison; Martin Hopkins, IBM; Jerry Huck, Hewlett-Packard Laboratories; Wen-mei Hwu, University of Illinois at UrbanaChampaign; Mary Jane Irwin, Pennsylvania State University; Truman Joe; Norm Jouppi; David Kaeli, Northeastern University; Roger Kieckhafer, University of Nebraska; Lev G. Kirischian, Ryerson University; Earl Killian; Allan Knies, Purdue University; Don Knuth; Jeff Kuskin, Stanford; James R. Larus, Microsoft Research; Corinna Lee, University of Toronto; Hank Levy; Kai Li, Princeton University; Lori Liebrock, University of Alaska, Fairbanks; Mikko Lipasti, University of Wisconsin-Madison; Gyula A. Mago, University of North Carolina, Chapel Hill; Bryan Martin; Norman Matloff; David Meyer; William Michalson, Worcester Polytechnic Institute; James Mooney; Trevor Mudge, University of Michigan; Ramadass Nagarajan, University of Texas at Austin; David Nagle, Carnegie Mellon University; Todd Narter; Victor Nelson; Vojin Oklobdzija, University of California, Berkeley; Kunle Olukotun, Stanford University; Bob Owens, Pennsylvania State University; Greg Papadapoulous, Sun Microsystems; Joseph Pfeiffer; Keshav Pingali, Cornell University; Timothy M. Pinkston, University of Southern California; Bruno Preiss, University of Waterloo; Steven Przybylski; Jim Quinlan; Andras Radics; Kishore Ramachandran, Georgia Institute of Technology; Joseph Rameh, University of Texas, Austin; Anthony Reeves, Cornell University; Richard Reid, Michigan State University; Steve Reinhardt, University of Michigan; David Rennels, University of California, Los Angeles; Arnold L. Rosenberg, University of Massachusetts, Amherst; Kaushik Roy, Purdue

xxviii



Acknowledgments

University; Emilio Salgueiro, Unysis; Karthikeyan Sankaralingam, University of Texas at Austin; Peter Schnorf; Margo Seltzer; Behrooz Shirazi, Southern Methodist University; Daniel Siewiorek, Carnegie Mellon University; J. P. Singh, Princeton; Ashok Singhal; Jim Smith, University of Wisconsin-Madison; Mike Smith, Harvard University; Mark Smotherman, Clemson University; Gurindar Sohi, University of Wisconsin-Madison; Arun Somani, University of Washington; Gene Tagliarin, Clemson University; Shyamkumar Thoziyoor, University of Notre Dame; Evan Tick, University of Oregon; Akhilesh Tyagi, University of North Carolina, Chapel Hill; Dan Upton, University of Virginia; Mateo Valero, Universidad Politecnica de Cataluña, Barcelona; Anujan Varma, University of California, Santa Cruz; Thorsten von Eicken, Cornell University; Hank Walker, Texas A&M; Roy Want, Xerox Palo Alto Research Center; David Weaver, Sun Microsystems; Shlomo Weiss, Tel Aviv University; David Wells; Mike Westall, Clemson University; Maurice Wilkes; Eric Williams; Thomas Willis, Purdue University; Malcolm Wing; Larry Wittie, SUNY Stony Brook; Ellen Witte Zegura, Georgia Institute of Technology; Sotirios G. Ziavras, New Jersey Institute of Technology.

Appendices The vector appendix was revised by Krste Asanovic of the Massachusetts Institute of Technology. The floating-point appendix was written originally by David Goldberg of Xerox PARC.

Exercises George Adams, Purdue University; Todd M. Bezenek, University of WisconsinMadison (in remembrance of his grandmother Ethel Eshom); Susan Eggers; Anoop Gupta; David Hayes; Mark Hill; Allan Knies; Ethan L. Miller, University of California, Santa Cruz; Parthasarathy Ranganathan, Compaq Western Research Laboratory; Brandon Schwartz, University of Wisconsin-Madison; Michael Scott; Dan Siewiorek; Mike Smith; Mark Smotherman; Evan Tick; Thomas Willis.

Case Studies With Exercises Andrea C. Arpaci-Dusseau, University of Wisconsin-Madison; Remzi H. ArpaciDusseau, University of Wisconsin-Madison; Robert P. Colwell, R&E Colwell & Assoc., Inc.; Diana Franklin, California Polytechnic State University, San Luis Obispo; Wen-mei W. Hwu, University of Illinois at Urbana-Champaign; Norman P. Jouppi, HP Labs; John W. Sias, University of Illinois at Urbana-Champaign; David A. Wood, University of Wisconsin-Madison.

Special Thanks Duane Adams, Defense Advanced Research Projects Agency; Tom Adams; Sarita Adve, University of Illinois at Urbana-Champaign; Anant Agarwal; Dave

Acknowledgments



xxix

Albonesi, University of Rochester; Mitch Alsup; Howard Alt; Dave Anderson; Peter Ashenden; David Bailey; Bill Bandy, Defense Advanced Research Projects Agency; Luiz Barroso, Compaq’s Western Research Lab; Andy Bechtolsheim; C. Gordon Bell; Fred Berkowitz; John Best, IBM; Dileep Bhandarkar; Jeff Bier, BDTI; Mark Birman; David Black; David Boggs; Jim Brady; Forrest Brewer; Aaron Brown, University of California, Berkeley; E. Bugnion, Compaq’s Western Research Lab; Alper Buyuktosunoglu, University of Rochester; Mark Callaghan; Jason F. Cantin; Paul Carrick; Chen-Chung Chang; Lei Chen, University of Rochester; Pete Chen; Nhan Chu; Doug Clark, Princeton University; Bob Cmelik; John Crawford; Zarka Cvetanovic; Mike Dahlin, University of Texas, Austin; Merrick Darley; the staff of the DEC Western Research Laboratory; John DeRosa; Lloyd Dickman; J. Ding; Susan Eggers, University of Washington; Wael El-Essawy, University of Rochester; Patty Enriquez, Mills; Milos Ercegovac; Robert Garner; K. Gharachorloo, Compaq’s Western Research Lab; Garth Gibson; Ronald Greenberg; Ben Hao; John Henning, Compaq; Mark Hill, University of WisconsinMadison; Danny Hillis; David Hodges; Urs H€olzle, Google; David Hough; Ed Hudson; Chris Hughes, University of Illinois at Urbana-Champaign; Mark Johnson; Lewis Jordan; Norm Jouppi; William Kahan; Randy Katz; Ed Kelly; Richard Kessler; Les Kohn; John Kowaleski, Compaq Computer Corp; Dan Lambright; Gary Lauterbach, Sun Microsystems; Corinna Lee; Ruby Lee; Don Lewine; Chao-Huang Lin; Paul Losleben, Defense Advanced Research Projects Agency; Yung-Hsiang Lu; Bob Lucas, Defense Advanced Research Projects Agency; Ken Lutz; Alan Mainwaring, Intel Berkeley Research Labs; Al Marston; Rich Martin, Rutgers; John Mashey; Luke McDowell; Sebastian Mirolo, Trimedia Corporation; Ravi Murthy; Biswadeep Nag; Lisa Noordergraaf, Sun Microsystems; Bob Parker, Defense Advanced Research Projects Agency; Vern Paxson, Center for Internet Research; Lawrence Prince; Steven Przybylski; Mark Pullen, Defense Advanced Research Projects Agency; Chris Rowen; Margaret Rowland; Greg Semeraro, University of Rochester; Bill Shannon; Behrooz Shirazi; Robert Shomler; Jim Slager; Mark Smotherman, Clemson University; the SMT research group at the University of Washington; Steve Squires, Defense Advanced Research Projects Agency; Ajay Sreekanth; Darren Staples; Charles Stapper; Jorge Stolfi; Peter Stoll; the students at Stanford and Berkeley who endured our first attempts at creating this book; Bob Supnik; Steve Swanson; Paul Taysom; Shreekant Thakkar; Alexander Thomasian, New Jersey Institute of Technology; John Toole, Defense Advanced Research Projects Agency; Kees A. Vissers, Trimedia Corporation; Willa Walker; David Weaver; Ric Wheeler, EMC; Maurice Wilkes; Richard Zimmerman. John Hennessy



David Patterson

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 1.12 1.13

Introduction Classes of Computers Defining Computer Architecture Trends in Technology Trends in Power and Energy in Integrated Circuits Trends in Cost Dependability Measuring, Reporting, and Summarizing Performance Quantitative Principles of Computer Design Putting It All Together: Performance, Price, and Power Fallacies and Pitfalls Concluding Remarks Historical Perspectives and References Case Studies and Exercises by Diana Franklin

2 6 11 18 23 29 36 39 48 55 58 64 67 67

1 Fundamentals of Quantitative Design and Analysis

An iPod, a phone, an Internet mobile communicator… these are NOT three separate devices! And we are calling it iPhone! Today Apple is going to reinvent the phone. And here it is. Steve Jobs, January 9, 2007

New information and communications technologies, in particular high-speed Internet, are changing the way companies do business, transforming public service delivery and democratizing innovation. With 10 percent increase in high speed Internet connections, economic growth increases by 1.3 percent. The World Bank, July 28, 2009

Computer Architecture. https://doi.org/10.1016/B978-0-12-811905-1.00001-8 © 2019 Elsevier Inc. All rights reserved.

2



Chapter One Fundamentals of Quantitative Design and Analysis

1.1

Introduction Computer technology has made incredible progress in the roughly 70 years since the first general-purpose electronic computer was created. Today, less than $500 will purchase a cell phone that has as much performance as the world’s fastest computer bought in 1993 for $50 million. This rapid improvement has come both from advances in the technology used to build computers and from innovations in computer design. Although technological improvements historically have been fairly steady, progress arising from better computer architectures has been much less consistent. During the first 25 years of electronic computers, both forces made a major contribution, delivering performance improvement of about 25% per year. The late 1970s saw the emergence of the microprocessor. The ability of the microprocessor to ride the improvements in integrated circuit technology led to a higher rate of performance improvement—roughly 35% growth per year. This growth rate, combined with the cost advantages of a mass-produced microprocessor, led to an increasing fraction of the computer business being based on microprocessors. In addition, two significant changes in the computer marketplace made it easier than ever before to succeed commercially with a new architecture. First, the virtual elimination of assembly language programming reduced the need for object-code compatibility. Second, the creation of standardized, vendor-independent operating systems, such as UNIX and its clone, Linux, lowered the cost and risk of bringing out a new architecture. These changes made it possible to develop successfully a new set of architectures with simpler instructions, called RISC (Reduced Instruction Set Computer) architectures, in the early 1980s. The RISC-based machines focused the attention of designers on two critical performance techniques, the exploitation of instruction-level parallelism (initially through pipelining and later through multiple instruction issue) and the use of caches (initially in simple forms and later using more sophisticated organizations and optimizations). The RISC-based computers raised the performance bar, forcing prior architectures to keep up or disappear. The Digital Equipment Vax could not, and so it was replaced by a RISC architecture. Intel rose to the challenge, primarily by translating 80x86 instructions into RISC-like instructions internally, allowing it to adopt many of the innovations first pioneered in the RISC designs. As transistor counts soared in the late 1990s, the hardware overhead of translating the more complex x 86 architecture became negligible. In low-end applications, such as cell phones, the cost in power and silicon area of the x 86-translation overhead helped lead to a RISC architecture, ARM, becoming dominant. Figure 1.1 shows that the combination of architectural and organizational enhancements led to 17 years of sustained growth in performance at an annual rate of over 50%—a rate that is unprecedented in the computer industry. The effect of this dramatic growth rate during the 20th century was fourfold. First, it has significantly enhanced the capability available to computer users. For many applications, the highest-performance microprocessors outperformed the supercomputer of less than 20 years earlier.

Intel Core i7 4 cores 4.2 GHz (Boost to 4.5 GHz) Intel Core i7 4 cores 4.0 GHz (Boost to 4.2 GHz) Intel Core i7 4 cores 4.0 GHz (Boost to 4.2 GHz) Intel Xeon 4 cores 3.7 GHz (Boost to 4.1 GHz) Intel Xeon 4 cores 3.6 GHz (Boost to 4.0 GHz) Intel Xeon 4 cores 3.6 GHz (Boost to 4.0 GHz) Intel Core i7 4 cores 3.4 GHz (boost to 3.8 GHz) 49,935 Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz) 49,870 Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz) 39,419 Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) 31,999 49,935 Intel Core Duo Extreme 2 cores, 3.0 GHz 21,871 40,967 34,967 Intel Core 2 Extreme 2 cores, 2.9 GHz 24,129 AMD Athlon 64, 2.8 GHz 14,387 19,484 AMD Athlon, 2.6 GHz 11,865 Intel Xeon EE 3.2 GHz 6,681 7,108 Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology) 6,043 4,195 IBM Power4, 1.3 GHz 3,016 Intel VC820 motherboard, 1.0 GHz Pentium III processor 1,779 Professional Workstation XP1000, 667 MHz 21264A 1,267 Digital AlphaServer 8400 6/575, 575 MHz 21264 993 AlphaServer 4000 5/600, 600 MHz 21164 649 Digital Alphastation 5/500, 500 MHz 481 Digital Alphastation 5/300, 300 MHz 280 23%/year 12%/year 3.5%/year Digital Alphastation 4/266, 266 MHz 183 IBM POWERstation 100, 150 MHz 117

Performance (vs. VAX-11/780)

100,000

10,000

1000

100

Digital 3000 AXP/500, 150 MHz HP 9000/750, 66 MHz

80

51 IBM RS6000/540, 30 MHz MIPS M2000, 25 MHz 18 MIPS M/120, 16.7 MHz 13 Sun-4/260, 16.7 MHz 9

10

VAX 8700, 22 MHz

24

52%/year

5

AX-11/780, 5 MHz

25%/year 1 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018

1.1 Introduction ■

3

Figure 1.1 Growth in processor performance over 40 years. This chart plots program performance relative to the VAX 11/780 as measured by the SPEC integer benchmarks (see Section 1.8). Prior to the mid-1980s, growth in processor performance was largely technology-driven and averaged about 22% per year, or doubling performance every 3.5 years. The increase in growth to about 52% starting in 1986, or doubling every 2 years, is attributable to more advanced architectural and organizational ideas typified in RISC architectures. By 2003 this growth led to a difference in performance of an approximate factor of 25 versus the performance that would have occurred if it had continued at the 22% rate. In 2003 the limits of power due to the end of Dennard scaling and the available instruction-level parallelism slowed uniprocessor performance to 23% per year until 2011, or doubling every 3.5 years. (The fastest SPECintbase performance since 2007 has had automatic parallelization turned on, so uniprocessor speed is harder to gauge. These results are limited to single-chip systems with usually four cores per chip.) From 2011 to 2015, the annual improvement was less than 12%, or doubling every 8 years in part due to the limits of parallelism of Amdahl’s Law. Since 2015, with the end of Moore’s Law, improvement has been just 3.5% per year, or doubling every 20 years! Performance for floating-point-oriented calculations follows the same trends, but typically has 1% to 2% higher annual growth in each shaded region. Figure 1.11 on page 27 shows the improvement in clock rates for these same eras. Because SPEC has changed over the years, performance of newer machines is estimated by a scaling factor that relates the performance for different versions of SPEC: SPEC89, SPEC92, SPEC95, SPEC2000, and SPEC2006. There are too few results for SPEC2017 to plot yet.

4



Chapter One Fundamentals of Quantitative Design and Analysis

Second, this dramatic improvement in cost-performance led to new classes of computers. Personal computers and workstations emerged in the 1980s with the availability of the microprocessor. The past decade saw the rise of smart cell phones and tablet computers, which many people are using as their primary computing platforms instead of PCs. These mobile client devices are increasingly using the Internet to access warehouses containing 100,000 servers, which are being designed as if they were a single gigantic computer. Third, improvement of semiconductor manufacturing as predicted by Moore’s law has led to the dominance of microprocessor-based computers across the entire range of computer design. Minicomputers, which were traditionally made from off-the-shelf logic or from gate arrays, were replaced by servers made by using microprocessors. Even mainframe computers and high-performance supercomputers are all collections of microprocessors. The preceding hardware innovations led to a renaissance in computer design, which emphasized both architectural innovation and efficient use of technology improvements. This rate of growth compounded so that by 2003, highperformance microprocessors were 7.5 times as fast as what would have been obtained by relying solely on technology, including improved circuit design, that is, 52% per year versus 35% per year. This hardware renaissance led to the fourth impact, which was on software development. This 50,000-fold performance improvement since 1978 (see Figure 1.1) allowed modern programmers to trade performance for productivity. In place of performance-oriented languages like C and C++, much more programming today is done in managed programming languages like Java and Scala. Moreover, scripting languages like JavaScript and Python, which are even more productive, are gaining in popularity along with programming frameworks like AngularJS and Django. To maintain productivity and try to close the performance gap, interpreters with just-in-time compilers and trace-based compiling are replacing the traditional compiler and linker of the past. Software deployment is changing as well, with Software as a Service (SaaS) used over the Internet replacing shrink-wrapped software that must be installed and run on a local computer. The nature of applications is also changing. Speech, sound, images, and video are becoming increasingly important, along with predictable response time that is so critical to the user experience. An inspiring example is Google Translate. This application lets you hold up your cell phone to point its camera at an object, and the image is sent wirelessly over the Internet to a warehouse-scale computer (WSC) that recognizes the text in the photo and translates it into your native language. You can also speak into it, and it will translate what you said into audio output in another language. It translates text in 90 languages and voice in 15 languages. Alas, Figure 1.1 also shows that this 17-year hardware renaissance is over. The fundamental reason is that two characteristics of semiconductor processes that were true for decades no longer hold. In 1974 Robert Dennard observed that power density was constant for a given area of silicon even as you increased the number of transistors because of smaller dimensions of each transistor. Remarkably, transistors could go faster but use less

1.1

Introduction



5

power. Dennard scaling ended around 2004 because current and voltage couldn’t keep dropping and still maintain the dependability of integrated circuits. This change forced the microprocessor industry to use multiple efficient processors or cores instead of a single inefficient processor. Indeed, in 2004 Intel canceled its high-performance uniprocessor projects and joined others in declaring that the road to higher performance would be via multiple processors per chip rather than via faster uniprocessors. This milestone signaled a historic switch from relying solely on instruction-level parallelism (ILP), the primary focus of the first three editions of this book, to data-level parallelism (DLP) and thread-level parallelism (TLP), which were featured in the fourth edition and expanded in the fifth edition. The fifth edition also added WSCs and request-level parallelism (RLP), which is expanded in this edition. Whereas the compiler and hardware conspire to exploit ILP implicitly without the programmer’s attention, DLP, TLP, and RLP are explicitly parallel, requiring the restructuring of the application so that it can exploit explicit parallelism. In some instances, this is easy; in many, it is a major new burden for programmers. Amdahl’s Law (Section 1.9) prescribes practical limits to the number of useful cores per chip. If 10% of the task is serial, then the maximum performance benefit from parallelism is 10 no matter how many cores you put on the chip. The second observation that ended recently is Moore’s Law. In 1965 Gordon Moore famously predicted that the number of transistors per chip would double every year, which was amended in 1975 to every two years. That prediction lasted for about 50 years, but no longer holds. For example, in the 2010 edition of this book, the most recent Intel microprocessor had 1,170,000,000 transistors. If Moore’s Law had continued, we could have expected microprocessors in 2016 to have 18,720,000,000 transistors. Instead, the equivalent Intel microprocessor has just 1,750,000,000 transistors, or off by a factor of 10 from what Moore’s Law would have predicted. The combination of ■

transistors no longer getting much better because of the slowing of Moore’s Law and the end of Dinnard scaling,



the unchanging power budgets for microprocessors,



the replacement of the single power-hungry processor with several energyefficient processors, and



the limits to multiprocessing to achieve Amdahl’s Law

caused improvements in processor performance to slow down, that is, to double every 20 years, rather than every 1.5 years as it did between 1986 and 2003 (see Figure 1.1). The only path left to improve energy-performance-cost is specialization. Future microprocessors will include several domain-specific cores that perform only one class of computations well, but they do so remarkably better than general-purpose cores. The new Chapter 7 in this edition introduces domain-specific architectures.

6



Chapter One Fundamentals of Quantitative Design and Analysis

This text is about the architectural ideas and accompanying compiler improvements that made the incredible growth rate possible over the past century, the reasons for the dramatic change, and the challenges and initial promising approaches to architectural ideas, compilers, and interpreters for the 21st century. At the core is a quantitative approach to computer design and analysis that uses empirical observations of programs, experimentation, and simulation as its tools. It is this style and approach to computer design that is reflected in this text. The purpose of this chapter is to lay the quantitative foundation on which the following chapters and appendices are based. This book was written not only to explain this design style but also to stimulate you to contribute to this progress. We believe this approach will serve the computers of the future just as it worked for the implicitly parallel computers of the past.

1.2

Classes of Computers These changes have set the stage for a dramatic change in how we view computing, computing applications, and the computer markets in this new century. Not since the creation of the personal computer have we seen such striking changes in the way computers appear and in how they are used. These changes in computer use have led to five diverse computing markets, each characterized by different applications, requirements, and computing technologies. Figure 1.2 summarizes these mainstream classes of computing environments and their important characteristics.

Internet of Things/Embedded Computers Embedded computers are found in everyday machines: microwaves, washing machines, most printers, networking switches, and all automobiles. The phrase

Feature

Personal mobile device (PMD)

Desktop

Server

Clusters/warehousescale computer

Internet of things/ embedded

Price of system

$100–$1000

$300–$2500

$5000–$10,000,000

$100,000–$200,000,000

$10–$100,000

Price of microprocessor

$10–$100

$50–$500

$200–$2000

$50–$250

$0.01–$100

Critical system design issues

Cost, energy, media performance, responsiveness

Priceperformance, energy, graphics performance

Throughput, availability, scalability, energy

Price-performance, throughput, energy proportionality

Price, energy, applicationspecific performance

Figure 1.2 A summary of the five mainstream computing classes and their system characteristics. Sales in 2015 included about 1.6 billion PMDs (90% cell phones), 275 million desktop PCs, and 15 million servers. The total number of embedded processors sold was nearly 19 billion. In total, 14.8 billion ARM-technology-based chips were shipped in 2015. Note the wide range in system price for servers and embedded systems, which go from USB keys to network routers. For servers, this range arises from the need for very large-scale multiprocessor systems for high-end transaction processing.

1.2

Classes of Computers



7

Internet of Things (IoT) refers to embedded computers that are connected to the Internet, typically wirelessly. When augmented with sensors and actuators, IoT devices collect useful data and interact with the physical world, leading to a wide variety of “smart” applications, such as smart watches, smart thermostats, smart speakers, smart cars, smart homes, smart grids, and smart cities. Embedded computers have the widest spread of processing power and cost. They include 8-bit to 32-bit processors that may cost one penny, and high-end 64-bit processors for cars and network switches that cost $100. Although the range of computing power in the embedded computing market is very large, price is a key factor in the design of computers for this space. Performance requirements do exist, of course, but the primary goal often meets the performance need at a minimum price, rather than achieving more performance at a higher price. The projections for the number of IoT devices in 2020 range from 20 to 50 billion. Most of this book applies to the design, use, and performance of embedded processors, whether they are off-the-shelf microprocessors or microprocessor cores that will be assembled with other special-purpose hardware. Unfortunately, the data that drive the quantitative design and evaluation of other classes of computers have not yet been extended successfully to embedded computing (see the challenges with EEMBC, for example, in Section 1.8). Hence we are left for now with qualitative descriptions, which do not fit well with the rest of the book. As a result, the embedded material is concentrated in Appendix E. We believe a separate appendix improves the flow of ideas in the text while allowing readers to see how the differing requirements affect embedded computing.

Personal Mobile Device Personal mobile device (PMD) is the term we apply to a collection of wireless devices with multimedia user interfaces such as cell phones, tablet computers, and so on. Cost is a prime concern given the consumer price for the whole product is a few hundred dollars. Although the emphasis on energy efficiency is frequently driven by the use of batteries, the need to use less expensive packaging—plastic versus ceramic—and the absence of a fan for cooling also limit total power consumption. We examine the issue of energy and power in more detail in Section 1.5. Applications on PMDs are often web-based and media-oriented, like the previously mentioned Google Translate example. Energy and size requirements lead to use of Flash memory for storage (Chapter 2) instead of magnetic disks. The processors in a PMD are often considered embedded computers, but we are keeping them as a separate category because PMDs are platforms that can run externally developed software, and they share many of the characteristics of desktop computers. Other embedded devices are more limited in hardware and software sophistication. We use the ability to run third-party software as the dividing line between nonembedded and embedded computers. Responsiveness and predictability are key characteristics for media applications. A real-time performance requirement means a segment of the application has an absolute maximum execution time. For example, in playing a video on a

8



Chapter One Fundamentals of Quantitative Design and Analysis

PMD, the time to process each video frame is limited, since the processor must accept and process the next frame shortly. In some applications, a more nuanced requirement exists: the average time for a particular task is constrained as well as the number of instances when some maximum time is exceeded. Such approaches—sometimes called soft real-time—arise when it is possible to miss the time constraint on an event occasionally, as long as not too many are missed. Real-time performance tends to be highly application-dependent. Other key characteristics in many PMD applications are the need to minimize memory and the need to use energy efficiently. Energy efficiency is driven by both battery power and heat dissipation. The memory can be a substantial portion of the system cost, and it is important to optimize memory size in such cases. The importance of memory size translates to an emphasis on code size, since data size is dictated by the application.

Desktop Computing The first, and possibly still the largest market in dollar terms, is desktop computing. Desktop computing spans from low-end netbooks that sell for under $300 to highend, heavily configured workstations that may sell for $2500. Since 2008, more than half of the desktop computers made each year have been battery operated laptop computers. Desktop computing sales are declining. Throughout this range in price and capability, the desktop market tends to be driven to optimize price-performance. This combination of performance (measured primarily in terms of compute performance and graphics performance) and price of a system is what matters most to customers in this market, and hence to computer designers. As a result, the newest, highest-performance microprocessors and cost-reduced microprocessors often appear first in desktop systems (see Section 1.6 for a discussion of the issues affecting the cost of computers). Desktop computing also tends to be reasonably well characterized in terms of applications and benchmarking, though the increasing use of web-centric, interactive applications poses new challenges in performance evaluation.

Servers As the shift to desktop computing occurred in the 1980s, the role of servers grew to provide larger-scale and more reliable file and computing services. Such servers have become the backbone of large-scale enterprise computing, replacing the traditional mainframe. For servers, different characteristics are important. First, availability is critical. (We discuss availability in Section 1.7.) Consider the servers running ATM machines for banks or airline reservation systems. Failure of such server systems is far more catastrophic than failure of a single desktop, since these servers must operate seven days a week, 24 hours a day. Figure 1.3 estimates revenue costs of downtime for server applications.

1.2

Classes of Computers



9

Annual losses with downtime of Application Brokerage service

Cost of downtime per hour

1% (87.6 h/year)

0.5% (43.8 h/year)

0.1% (8.8 h/year)

$4,000,000

$350,400,000

$175,200,000

$35,000,000

Energy

$1,750,000

$153,300,000

$76,700,000

$15,300,000

Telecom

$1,250,000

$109,500,000

$54,800,000

$11,000,000

Manufacturing

$1,000,000

$87,600,000

$43,800,000

$8,800,000

Retail

$650,000

$56,900,000

$28,500,000

$5,700,000

Health care

$400,000

$35,000,000

$17,500,000

$3,500,000

$50,000

$4,400,000

$2,200,000

$400,000

Media

Figure 1.3 Costs rounded to nearest $100,000 of an unavailable system are shown by analyzing the cost of downtime (in terms of immediately lost revenue), assuming three different levels of availability, and that downtime is distributed uniformly. These data are from Landstrom (2014) and were collected and analyzed by Contingency Planning Research.

A second key feature of server systems is scalability. Server systems often grow in response to an increasing demand for the services they support or an expansion in functional requirements. Thus the ability to scale up the computing capacity, the memory, the storage, and the I/O bandwidth of a server is crucial. Finally, servers are designed for efficient throughput. That is, the overall performance of the server—in terms of transactions per minute or web pages served per second—is what is crucial. Responsiveness to an individual request remains important, but overall efficiency and cost-effectiveness, as determined by how many requests can be handled in a unit time, are the key metrics for most servers. We return to the issue of assessing performance for different types of computing environments in Section 1.8.

Clusters/Warehouse-Scale Computers The growth of Software as a Service (SaaS) for applications like search, social networking, video viewing and sharing, multiplayer games, online shopping, and so on has led to the growth of a class of computers called clusters. Clusters are collections of desktop computers or servers connected by local area networks to act as a single larger computer. Each node runs its own operating system, and nodes communicate using a networking protocol. WSCs are the largest of the clusters, in that they are designed so that tens of thousands of servers can act as one. Chapter 6 describes this class of extremely large computers. Price-performance and power are critical to WSCs since they are so large. As Chapter 6 explains, the majority of the cost of a warehouse is associated with power and cooling of the computers inside the warehouse. The annual amortized computers themselves and the networking gear cost for a WSC is $40 million, because they are usually replaced every few years. When you are buying that

10



Chapter One Fundamentals of Quantitative Design and Analysis

much computing, you need to buy wisely, because a 10% improvement in priceperformance means an annual savings of $4 million (10% of $40 million) per WSC; a company like Amazon might have 100 WSCs! WSCs are related to servers in that availability is critical. For example, Amazon.com had $136 billion in sales in 2016. As there are about 8800 hours in a year, the average revenue per hour was about $15 million. During a peak hour for Christmas shopping, the potential loss would be many times higher. As Chapter 6 explains, the difference between WSCs and servers is that WSCs use redundant, inexpensive components as the building blocks, relying on a software layer to catch and isolate the many failures that will happen with computing at this scale to deliver the availability needed for such applications. Note that scalability for a WSC is handled by the local area network connecting the computers and not by integrated computer hardware, as in the case of servers. Supercomputers are related to WSCs in that they are equally expensive, costing hundreds of millions of dollars, but supercomputers differ by emphasizing floating-point performance and by running large, communication-intensive batch programs that can run for weeks at a time. In contrast, WSCs emphasize interactive applications, large-scale storage, dependability, and high Internet bandwidth.

Classes of Parallelism and Parallel Architectures Parallelism at multiple levels is now the driving force of computer design across all four classes of computers, with energy and cost being the primary constraints. There are basically two kinds of parallelism in applications: 1. Data-level parallelism (DLP) arises because there are many data items that can be operated on at the same time. 2. Task-level parallelism (TLP) arises because tasks of work are created that can operate independently and largely in parallel. Computer hardware in turn can exploit these two kinds of application parallelism in four major ways: 1. Instruction-level parallelism exploits data-level parallelism at modest levels with compiler help using ideas like pipelining and at medium levels using ideas like speculative execution. 2. Vector architectures, graphic processor units (GPUs), and multimedia instruction sets exploit data-level parallelism by applying a single instruction to a collection of data in parallel. 3. Thread-level parallelism exploits either data-level parallelism or task-level parallelism in a tightly coupled hardware model that allows for interaction between parallel threads. 4. Request-level parallelism exploits parallelism among largely decoupled tasks specified by the programmer or the operating system.

1.3

Defining Computer Architecture



11

When Flynn (1966) studied the parallel computing efforts in the 1960s, he found a simple classification whose abbreviations we still use today. They target data-level parallelism and task-level parallelism. He looked at the parallelism in the instruction and data streams called for by the instructions at the most constrained component of the multiprocessor and placed all computers in one of four categories: 1. Single instruction stream, single data stream (SISD)—This category is the uniprocessor. The programmer thinks of it as the standard sequential computer, but it can exploit ILP. Chapter 3 covers SISD architectures that use ILP techniques such as superscalar and speculative execution. 2. Single instruction stream, multiple data streams (SIMD)—The same instruction is executed by multiple processors using different data streams. SIMD computers exploit data-level parallelism by applying the same operations to multiple items of data in parallel. Each processor has its own data memory (hence, the MD of SIMD), but there is a single instruction memory and control processor, which fetches and dispatches instructions. Chapter 4 covers DLP and three different architectures that exploit it: vector architectures, multimedia extensions to standard instruction sets, and GPUs. 3. Multiple instruction streams, single data stream (MISD)—No commercial multiprocessor of this type has been built to date, but it rounds out this simple classification. 4. Multiple instruction streams, multiple data streams (MIMD)—Each processor fetches its own instructions and operates on its own data, and it targets task-level parallelism. In general, MIMD is more flexible than SIMD and thus more generally applicable, but it is inherently more expensive than SIMD. For example, MIMD computers can also exploit data-level parallelism, although the overhead is likely to be higher than would be seen in an SIMD computer. This overhead means that grain size must be sufficiently large to exploit the parallelism efficiently. Chapter 5 covers tightly coupled MIMD architectures, which exploit thread-level parallelism because multiple cooperating threads operate in parallel. Chapter 6 covers loosely coupled MIMD architectures—specifically, clusters and warehouse-scale computers—that exploit request-level parallelism, where many independent tasks can proceed in parallel naturally with little need for communication or synchronization. This taxonomy is a coarse model, as many parallel processors are hybrids of the SISD, SIMD, and MIMD classes. Nonetheless, it is useful to put a framework on the design space for the computers we will see in this book.

1.3

Defining Computer Architecture The task the computer designer faces is a complex one: determine what attributes are important for a new computer, then design a computer to maximize

12



Chapter One Fundamentals of Quantitative Design and Analysis

performance and energy efficiency while staying within cost, power, and availability constraints. This task has many aspects, including instruction set design, functional organization, logic design, and implementation. The implementation may encompass integrated circuit design, packaging, power, and cooling. Optimizing the design requires familiarity with a very wide range of technologies, from compilers and operating systems to logic design and packaging. A few decades ago, the term computer architecture generally referred to only instruction set design. Other aspects of computer design were called implementation, often insinuating that implementation is uninteresting or less challenging. We believe this view is incorrect. The architect’s or designer’s job is much more than instruction set design, and the technical hurdles in the other aspects of the project are likely more challenging than those encountered in instruction set design. We’ll quickly review instruction set architecture before describing the larger challenges for the computer architect.

Instruction Set Architecture: The Myopic View of Computer Architecture We use the term instruction set architecture (ISA) to refer to the actual programmer-visible instruction set in this book. The ISA serves as the boundary between the software and hardware. This quick review of ISA will use examples from 80x86, ARMv8, and RISC-V to illustrate the seven dimensions of an ISA. The most popular RISC processors come from ARM (Advanced RISC Machine), which were in 14.8 billion chips shipped in 2015, or roughly 50 times as many chips that shipped with 80x86 processors. Appendices A and K give more details on the three ISAs. RISC-V (“RISC Five”) is a modern RISC instruction set developed at the University of California, Berkeley, which was made free and openly adoptable in response to requests from industry. In addition to a full software stack (compilers, operating systems, and simulators), there are several RISC-V implementations freely available for use in custom chips or in field-programmable gate arrays. Developed 30 years after the first RISC instruction sets, RISC-V inherits its ancestors’ good ideas—a large set of registers, easy-to-pipeline instructions, and a lean set of operations—while avoiding their omissions or mistakes. It is a free and open, elegant example of the RISC architectures mentioned earlier, which is why more than 60 companies have joined the RISC-V foundation, including AMD, Google, HP Enterprise, IBM, Microsoft, Nvidia, Qualcomm, Samsung, and Western Digital. We use the integer core ISA of RISC-V as the example ISA in this book. 1. Class of ISA—Nearly all ISAs today are classified as general-purpose register architectures, where the operands are either registers or memory locations. The 80x86 has 16 general-purpose registers and 16 that can hold floating-point data, while RISC-V has 32 general-purpose and 32 floating-point registers (see Figure 1.4). The two popular versions of this class are register-memory ISAs,

1.3

Defining Computer Architecture



13

Register

Name

Use

Saver

x0

zero

The constant value 0

N.A.

x1

ra

Return address

Caller

x2

sp

Stack pointer

Callee

x3

gp

Global pointer



Thread pointer



x4

tp

x5–x7

t0–t2

Temporaries

Caller

x8

s0/fp

Saved register/frame pointer

Callee

x9

s1

Saved register

Callee

x10–x11

a0–a1

Function arguments/return values

Caller

x12–x17

a2–a7

Function arguments

Caller

x18–x27

s2–s11

Saved registers

Callee

x28–x31

t3–t6

Temporaries

Caller

f0–f7

ft0–ft7

FP temporaries

Caller

f8–f9

fs0–fs1

FP saved registers

Callee

f10–f11

fa0–fa1

FP function arguments/return values

Caller

f12–f17

fa2–fa7

FP function arguments

Caller

f18–f27

fs2–fs11

FP saved registers

Callee

f28–f31

ft8–ft11

FP temporaries

Caller

Figure 1.4 RISC-V registers, names, usage, and calling conventions. In addition to the 32 general-purpose registers (x0–x31), RISC-V has 32 floating-point registers (f0–f31) that can hold either a 32-bit single-precision number or a 64-bit double-precision number. The registers that are preserved across a procedure call are labeled “Callee” saved.

such as the 80x86, which can access memory as part of many instructions, and load-store ISAs, such as ARMv8 and RISC-V, which can access memory only with load or store instructions. All ISAs announced since 1985 are load-store. 2. Memory addressing—Virtually all desktop and server computers, including the 80x86, ARMv8, and RISC-V, use byte addressing to access memory operands. Some architectures, like ARMv8, require that objects must be aligned. An access to an object of size s bytes at byte address A is aligned if A mod s ¼ 0. (See Figure A.5 on page A-8.) The 80x86 and RISC-V do not require alignment, but accesses are generally faster if operands are aligned. 3. Addressing modes—In addition to specifying registers and constant operands, addressing modes specify the address of a memory object. RISC-V addressing modes are Register, Immediate (for constants), and Displacement, where a constant offset is added to a register to form the memory address. The 80x86 supports those three modes, plus three variations of displacement: no register (absolute), two registers (based indexed with displacement), and two registers

14



Chapter One Fundamentals of Quantitative Design and Analysis

where one register is multiplied by the size of the operand in bytes (based with scaled index and displacement). It has more like the last three modes, minus the displacement field, plus register indirect, indexed, and based with scaled index. ARMv8 has the three RISC-V addressing modes plus PC-relative addressing, the sum of two registers, and the sum of two registers where one register is multiplied by the size of the operand in bytes. It also has autoincrement and autodecrement addressing, where the calculated address replaces the contents of one of the registers used in forming the address. 4. Types and sizes of operands—Like most ISAs, 80x86, ARMv8, and RISC-V support operand sizes of 8-bit (ASCII character), 16-bit (Unicode character or half word), 32-bit (integer or word), 64-bit (double word or long integer), and IEEE 754 floating point in 32-bit (single precision) and 64-bit (double precision). The 80x86 also supports 80-bit floating point (extended double precision). 5. Operations—The general categories of operations are data transfer, arithmetic logical, control (discussed next), and floating point. RISC-V is a simple and easy-to-pipeline instruction set architecture, and it is representative of the RISC architectures being used in 2017. Figure 1.5 summarizes the integer RISC-V ISA, and Figure 1.6 lists the floating-point ISA. The 80x86 has a much richer and larger set of operations (see Appendix K). 6. Control flow instructions—Virtually all ISAs, including these three, support conditional branches, unconditional jumps, procedure calls, and returns. All three use PC-relative addressing, where the branch address is specified by an address field that is added to the PC. There are some small differences. RISC-V conditional branches (BE, BNE, etc.) test the contents of registers, and the 80x86 and ARMv8 branches test condition code bits set as side effects of arithmetic/logic operations. The ARMv8 and RISC-V procedure call places the return address in a register, whereas the 80x86 call (CALLF) places the return address on a stack in memory. 7. Encoding an ISA—There are two basic choices on encoding: fixed length and variable length. All ARMv8 and RISC-V instructions are 32 bits long, which simplifies instruction decoding. Figure 1.7 shows the RISC-V instruction formats. The 80x86 encoding is variable length, ranging from 1 to 18 bytes. Variable-length instructions can take less space than fixed-length instructions, so a program compiled for the 80x86 is usually smaller than the same program compiled for RISC-V. Note that choices mentioned previously will affect how the instructions are encoded into a binary representation. For example, the number of registers and the number of addressing modes both have a significant impact on the size of instructions, because the register field and addressing mode field can appear many times in a single instruction. (Note that ARMv8 and RISC-V later offered extensions, called Thumb-2 and RV64IC, that provide a mix of 16-bit and 32-bit length instructions, respectively, to reduce program size. Code size for these compact versions of RISC architectures are smaller than that of the 80x86. See Appendix K.)

1.3

Defining Computer Architecture



15

Instruction type/opcode

Instruction meaning

Data transfers

Move data between registers and memory, or between the integer and FP or special registers; only memory address mode is 12-bit displacement + contents of a GPR

lb, lbu, sb

Load byte, load byte unsigned, store byte (to/from integer registers)

lh, lhu, sh

Load half word, load half word unsigned, store half word (to/from integer registers)

lw, lwu, sw

Load word, load word unsigned, store word (to/from integer registers)

ld, sd

Load double word, store double word (to/from integer registers)

flw, fld, fsw, fsd

Load SP float, load DP float, store SP float, store DP float

fmv._.x, fmv.x._

Copy from/to integer register to/from floating-point register; “__” ¼ S for singleprecision, D for double-precision

csrrw, csrrwi, csrrs, csrrsi, csrrc, csrrci

Read counters and write status registers, which include counters: clock cycles, time, instructions retired

Arithmetic/logical

Operations on integer or logical data in GPRs

add, addi, addw, addiw

Add, add immediate (all immediates are 12 bits), add 32-bits only & sign-extend to 64 bits, add immediate 32-bits only

sub, subw

Subtract, subtract 32-bits only

mul, mulw, mulh, mulhsu, mulhu

Multiply, multiply 32-bits only, multiply upper half, multiply upper half signedunsigned, multiply upper half unsigned

div, divu, rem, remu

Divide, divide unsigned, remainder, remainder unsigned

divw, divuw, remw, remuw

Divide and remainder: as previously, but divide only lower 32-bits, producing 32-bit sign-extended result

and, andi

And, and immediate

or, ori, xor, xori

Or, or immediate, exclusive or, exclusive or immediate

lui

Load upper immediate; loads bits 31-12 of register with immediate, then sign-extends

auipc

Adds immediate in bits 31–12 with zeros in lower bits to PC; used with JALR to transfer control to any 32-bit address

sll, slli, srl, srli, sra, srai

Shifts: shift left logical, right logical, right arithmetic; both variable and immediate forms

sllw, slliw, srlw, srliw, sraw, sraiw

Shifts: as previously, but shift lower 32-bits, producing 32-bit sign-extended result

slt, slti, sltu, sltiu

Set less than, set less than immediate, signed and unsigned

Control

Conditional branches and jumps; PC-relative or through register

beq, bne, blt, bge, bltu, bgeu

Branch GPR equal/not equal; less than; greater than or equal, signed and unsigned

jal, jalr

Jump and link: save PC + 4, target is PC-relative (JAL) or a register (JALR); if specify x0 as destination register, then acts as a simple jump

ecall

Make a request to the supporting execution environment, which is usually an OS

ebreak

Debuggers used to cause control to be transferred back to a debugging environment

fence, fence.i

Synchronize threads to guarantee ordering of memory accesses; synchronize instructions and data for stores to instruction memory

Figure 1.5 Subset of the instructions in RISC-V. RISC-V has a base set of instructions (R64I) and offers optional extensions: multiply-divide (RVM), single-precision floating point (RVF), double-precision floating point (RVD). This figure includes RVM and the next one shows RVF and RVD. Appendix A gives much more detail on RISC-V.

16



Chapter One Fundamentals of Quantitative Design and Analysis

Instruction type/opcode

Instruction meaning

Floating point

FP operations on DP and SP formats

fadd.d, fadd.s

Add DP, SP numbers

fsub.d, fsub.s

Subtract DP, SP numbers

fmul.d, fmul.s

Multiply DP, SP floating point

fmadd.d, fmadd.s, fnmadd.d, fnmadd.s

Multiply-add DP, SP numbers; negative multiply-add DP, SP numbers

fmsub.d, fmsub.s, fnmsub.d, fnmsub.s

Multiply-sub DP, SP numbers; negative multiply-sub DP, SP numbers

fdiv.d, fdiv.s

Divide DP, SP floating point

fsqrt.d, fsqrt.s

Square root DP, SP floating point

fmax.d, fmax.s, fmin.d, fmin.s

Maximum and minimum DP, SP floating point

fcvt._._, fcvt._._u, fcvt._u._

Convert instructions: FCVT.x.y converts from type x to type y, where x and y are L (64-bit integer), W (32-bit integer), D (DP), or S (SP). Integers can be unsigned (U)

feq._, flt._,fle._

Floating-point compare between floating-point registers and record the Boolean result in integer register; “__” ¼ S for single-precision, D for double-precision

fclass.d, fclass.s

Writes to integer register a 10-bit mask that indicates the class of the floating-point number (∞, +∞, 0, +0, NaN, …)

fsgnj._, fsgnjn._, fsgnjx._

Sign-injection instructions that changes only the sign bit: copy sign bit from other source, the oppositive of sign bit of other source, XOR of the 2 sign bits

Figure 1.6 Floating point instructions for RISC-V. RISC-V has a base set of instructions (R64I) and offers optional extensions for single-precision floating point (RVF) and double-precision floating point (RVD). SP ¼ single precision; DP ¼ double precision. 31

25 24 funct7

20 19 rs2

imm [12] imm [10:5]

76

0

funct3

rd

opcode R-type

rs1

funct3

rd

opcode I-type

rs2

rs1

funct3

imm [4:0]

opcode S-type

rs2

rs1

funct3

imm [4:1|11]

opcode B-type

rd

opcode U-type

rd

opcode J-type

imm [11:0] imm [11:5]

15 14 12 11 rs1

imm [31:12] imm [20|10:1|11|19:12]

Figure 1.7 The base RISC-V instruction set architecture formats. All instructions are 32 bits long. The R format is for integer register-to-register operations, such as ADD, SUB, and so on. The I format is for loads and immediate operations, such as LD and ADDI. The B format is for branches and the J format is for jumps and link. The S format is for stores. Having a separate format for stores allows the three register specifiers (rd, rs1, rs2) to always be in the same location in all formats. The U format is for the wide immediate instructions (LUI, AUIPC).

1.3

Defining Computer Architecture



17

The other challenges facing the computer architect beyond ISA design are particularly acute at the present, when the differences among instruction sets are small and when there are distinct application areas. Therefore, starting with the fourth edition of this book, beyond this quick review, the bulk of the instruction set material is found in the appendices (see Appendices A and K).

Genuine Computer Architecture: Designing the Organization and Hardware to Meet Goals and Functional Requirements The implementation of a computer has two components: organization and hardware. The term organization includes the high-level aspects of a computer’s design, such as the memory system, the memory interconnect, and the design of the internal processor or CPU (central processing unit—where arithmetic, logic, branching, and data transfer are implemented). The term microarchitecture is also used instead of organization. For example, two processors with the same instruction set architectures but different organizations are the AMD Opteron and the Intel Core i7. Both processors implement the 80 x86 instruction set, but they have very different pipeline and cache organizations. The switch to multiple processors per microprocessor led to the term core also being used for processors. Instead of saying multiprocessor microprocessor, the term multicore caught on. Given that virtually all chips have multiple processors, the term central processing unit, or CPU, is fading in popularity. Hardware refers to the specifics of a computer, including the detailed logic design and the packaging technology of the computer. Often a line of computers contains computers with identical instruction set architectures and very similar organizations, but they differ in the detailed hardware implementation. For example, the Intel Core i7 (see Chapter 3) and the Intel Xeon E7 (see Chapter 5) are nearly identical but offer different clock rates and different memory systems, making the Xeon E7 more effective for server computers. In this book, the word architecture covers all three aspects of computer design—instruction set architecture, organization or microarchitecture, and hardware. Computer architects must design a computer to meet functional requirements as well as price, power, performance, and availability goals. Figure 1.8 summarizes requirements to consider in designing a new computer. Often, architects also must determine what the functional requirements are, which can be a major task. The requirements may be specific features inspired by the market. Application software typically drives the choice of certain functional requirements by determining how the computer will be used. If a large body of software exists for a particular instruction set architecture, the architect may decide that a new computer should implement an existing instruction set. The presence of a large market for a particular class of applications might encourage the designers to incorporate requirements that would make the computer competitive in that market. Later chapters examine many of these requirements and features in depth.

18



Chapter One Fundamentals of Quantitative Design and Analysis

Functional requirements

Typical features required or supported

Application area

Target of computer

Personal mobile device

Real-time performance for a range of tasks, including interactive performance for graphics, video, and audio; energy efficiency (Chapters 2–5 and 7; Appendix A)

General-purpose desktop

Balanced performance for a range of tasks, including interactive performance for graphics, video, and audio (Chapters 2–5; Appendix A)

Servers

Support for databases and transaction processing; enhancements for reliability and availability; support for scalability (Chapters 2, 5, and 7; Appendices A, D, and F)

Clusters/warehouse-scale computers

Throughput performance for many independent tasks; error correction for memory; energy proportionality (Chapters 2, 6, and 7; Appendix F)

Internet of things/embedded computing

Often requires special support for graphics or video (or other application-specific extension); power limitations and power control may be required; real-time constraints (Chapters 2, 3, 5, and 7; Appendices A and E)

Level of software compatibility

Determines amount of existing software for computer

At programming language

Most flexible for designer; need new compiler (Chapters 3, 5, and 7; Appendix A)

Object code or binary compatible

Instruction set architecture is completely defined—little flexibility—but no investment needed in software or porting programs (Appendix A)

Operating system requirements

Necessary features to support chosen OS (Chapter 2; Appendix B)

Size of address space

Very important feature (Chapter 2); may limit applications

Memory management

Required for modern OS; may be paged or segmented (Chapter 2)

Protection

Different OS and application needs: page versus segment; virtual machines (Chapter 2)

Standards

Certain standards may be required by marketplace

Floating point

Format and arithmetic: IEEE 754 standard (Appendix J), special arithmetic for graphics or signal processing

I/O interfaces

For I/O devices: Serial ATA, Serial Attached SCSI, PCI Express (Appendices D and F)

Operating systems

UNIX, Windows, Linux, CISCO IOS

Networks

Support required for different networks: Ethernet, Infiniband (Appendix F)

Programming languages

Languages (ANSI C, C++, Java, Fortran) affect instruction set (Appendix A)

Figure 1.8 Summary of some of the most important functional requirements an architect faces. The left-hand column describes the class of requirement, while the right-hand column gives specific examples. The right-hand column also contains references to chapters and appendices that deal with the specific issues.

Architects must also be aware of important trends in both the technology and the use of computers because such trends affect not only the future cost but also the longevity of an architecture.

1.4

Trends in Technology If an instruction set architecture is to prevail, it must be designed to survive rapid changes in computer technology. After all, a successful new instruction set

1.4

Trends in Technology



19

architecture may last decades—for example, the core of the IBM mainframe has been in use for more than 50 years. An architect must plan for technology changes that can increase the lifetime of a successful computer. To plan for the evolution of a computer, the designer must be aware of rapid changes in implementation technology. Five implementation technologies, which change at a dramatic pace, are critical to modern implementations: ■

Integrated circuit logic technology—Historically, transistor density increased by about 35% per year, quadrupling somewhat over four years. Increases in die size are less predictable and slower, ranging from 10% to 20% per year. The combined effect was a traditional growth rate in transistor count on a chip of about 40%–55% per year, or doubling every 18–24 months. This trend is popularly known as Moore’s Law. Device speed scales more slowly, as we discuss below. Shockingly, Moore’s Law is no more. The number of devices per chip is still increasing, but at a decelerating rate. Unlike in the Moore’s Law era, we expect the doubling time to be stretched with each new technology generation.



Semiconductor DRAM (dynamic random-access memory)—This technology is the foundation of main memory, and we discuss it in Chapter 2. The growth of DRAM has slowed dramatically, from quadrupling every three years as in the past. The 8-gigabit DRAM was shipping in 2014, but the 16-gigabit DRAM won’t reach that state until 2019, and it looks like there will be no 32-gigabit DRAM (Kim, 2005). Chapter 2 mentions several other technologies that may replace DRAM when it hits its capacity wall.



Semiconductor Flash (electrically erasable programmable read-only memory)—This nonvolatile semiconductor memory is the standard storage device in PMDs, and its rapidly increasing popularity has fueled its rapid growth rate in capacity. In recent years, the capacity per Flash chip increased by about 50%–60% per year, doubling roughly every 2 years. Currently, Flash memory is 8–10 times cheaper per bit than DRAM. Chapter 2 describes Flash memory.



Magnetic disk technology—Prior to 1990, density increased by about 30% per year, doubling in three years. It rose to 60% per year thereafter, and increased to 100% per year in 1996. Between 2004 and 2011, it dropped back to about 40% per year, or doubled every two years. Recently, disk improvement has slowed to less than 5% per year. One way to increase disk capacity is to add more platters at the same areal density, but there are already seven platters within the one-inch depth of the 3.5-inch form factor disks. There is room for at most one or two more platters. The last hope for real density increase is to use a small laser on each disk read-write head to heat a 30 nm spot to 400°C so that it can be written magnetically before it cools. It is unclear whether Heat Assisted Magnetic Recording can be manufactured economically and reliably, although Seagate announced plans to ship HAMR in limited production in 2018. HAMR is the last chance for continued improvement in areal density of hard disk

20



Chapter One Fundamentals of Quantitative Design and Analysis

drives, which are now 8–10 times cheaper per bit than Flash and 200–300 times cheaper per bit than DRAM. This technology is central to server- and warehouse-scale storage, and we discuss the trends in detail in Appendix D. ■

Network technology—Network performance depends both on the performance of switches and on the performance of the transmission system. We discuss the trends in networking in Appendix F.

These rapidly changing technologies shape the design of a computer that, with speed and technology enhancements, may have a lifetime of 3–5 years. Key technologies such as Flash change sufficiently that the designer must plan for these changes. Indeed, designers often design for the next technology, knowing that, when a product begins shipping in volume, the following technology may be the most cost-effective or may have performance advantages. Traditionally, cost has decreased at about the rate at which density increases. Although technology improves continuously, the impact of these increases can be in discrete leaps, as a threshold that allows a new capability is reached. For example, when MOS technology reached a point in the early 1980s where between 25,000 and 50,000 transistors could fit on a single chip, it became possible to build a single-chip, 32-bit microprocessor. By the late 1980s, first-level caches could go on a chip. By eliminating chip crossings within the processor and between the processor and the cache, a dramatic improvement in cost-performance and energyperformance was possible. This design was simply unfeasible until the technology reached a certain point. With multicore microprocessors and increasing numbers of cores each generation, even server computers are increasingly headed toward a single chip for all processors. Such technology thresholds are not rare and have a significant impact on a wide variety of design decisions.

Performance Trends: Bandwidth Over Latency As we shall see in Section 1.8, bandwidth or throughput is the total amount of work done in a given time, such as megabytes per second for a disk transfer. In contrast, latency or response time is the time between the start and the completion of an event, such as milliseconds for a disk access. Figure 1.9 plots the relative improvement in bandwidth and latency for technology milestones for microprocessors, memory, networks, and disks. Figure 1.10 describes the examples and milestones in more detail. Performance is the primary differentiator for microprocessors and networks, so they have seen the greatest gains: 32,000–40,000  in bandwidth and 50–90 in latency. Capacity is generally more important than performance for memory and disks, so capacity has improved more, yet bandwidth advances of 400–2400  are still much greater than gains in latency of 8–9 . Clearly, bandwidth has outpaced latency across these technologies and will likely continue to do so. A simple rule of thumb is that bandwidth grows by at least the square of the improvement in latency. Computer designers should plan accordingly.

1.4

Trends in Technology



21

100,000

Network 10,000

Relative bandwidth improvement

Microprocessor

1000 Memory

Disk 100

(Latency improvement = Bandwidth improvement)

10

1 1

10 Relative Latency Improvement

100

Figure 1.9 Log-log plot of bandwidth and latency milestones in Figure 1.10 relative to the first milestone. Note that latency improved 8–91 , while bandwidth improved about 400–32,000 . Except for networking, we note that there were modest improvements in latency and bandwidth in the other three technologies in the six years since the last edition: 0%–23% in latency and 23%–70% in bandwidth. Updated from Patterson, D., 2004. Latency lags bandwidth. Commun. ACM 47 (10), 71–75.

Scaling of Transistor Performance and Wires Integrated circuit processes are characterized by the feature size, which is the minimum size of a transistor or a wire in either the x or y dimension. Feature sizes decreased from 10 μm in 1971 to 0.016 μm in 2017; in fact, we have switched units, so production in 2017 is referred to as “16 nm,” and 7 nm chips are underway. Since the transistor count per square millimeter of silicon is determined by the surface area of a transistor, the density of transistors increases quadratically with a linear decrease in feature size.

Microprocessor

5-Stage 32-Bit 16-Bit pipeline, address/ address/ on-chip I & D bus, bus, microcoded microcoded caches, FPU

Product

Intel 80286 Intel 80386

Year Die size (mm2) Transistors

Intel 80486

2-Way superscalar, 64-bit bus

Out-of-order 3-way superscalar

Multicore Out-of-order superpipelined, OOO 4-way on chip L3 on-chip L2 cache, Turbo cache

Intel Pentium Intel Pentium Pro Intel Pentium 4 Intel Core i7

1982

1985

1989

1993

1997

2001

47

43

81

90

308

217

2015 122

134,000

275,000

1,200,000

3,100,000

5,500,000

42,000,000

1,750,000,000

Processors/chip

1

1

1

1

1

1

4

Pins

68

132

168

273

387

423

1400

Latency (clocks)

6

5

5

5

10

22

14

Bus width (bits)

16

32

32

64

64

64

196

Clock rate (MHz)

12.5

16

25

66

200

1500

4000

Bandwidth (MIPS)

2

6

25

132

600

4500

64,000

313

200

76

Latency (ns) Memory module Module width (bits)

320 DRAM

Page mode Fast page Fast page DRAM mode DRAM mode DRAM

50

15

4

Synchronous DRAM

Double data rate SDRAM

DDR4 SDRAM

16

16

32

64

64

64

64

Year

1980

1983

1986

1993

1997

2000

2016

Mbits/DRAM chip

0.06

0.25

1

16

64

256

4096

Die size (mm2)

35

45

70

130

170

204

50

Pins/DRAM chip

16

16

18

20

54

66

134

Bandwidth (MBytes/s)

13

40

160

267

640

1600

27,000 30

Latency (ns)

225

170

125

75

62

52

Ethernet

Fast Ethernet

Gigabit Ethernet

10 Gigabit Ethernet

100 Gigabit Ethernet

400 Gigabit Ethernet

IEEE standard

802.3

803.3u

802.3ab

802.3ac

802.3ba

802.3bs

Year

1978

1995

1999

2003

2010

2017

10

100

1000

10,000

100,000

400,000

Local area network

Bandwidth (Mbits/seconds) Latency (μs)

3000

500

340

190

100

60

Hard disk

3600 RPM

5400 RPM

7200 RPM

10,000 RPM

15,000 RPM

15,000 RPM

Product

CDC WrenI 94145-36

Seagate ST41600

Seagate ST15150

Seagate ST39102

Seagate ST373453

Seagate ST600MX0062

Year

1983

1990

1994

1998

2003

2016

Capacity (GB)

0.03

1.4

4.3

9.1

73.4

600

Disk form factor

5.25 in.

5.25 in.

3.5 in.

3.5 in.

3.5 in.

3.5 in.

Media diameter

5.25 in.

5.25 in.

3.5 in.

3.0 in.

2.5 in.

2.5 in.

Interface

ST-412

SCSI

SCSI

SCSI

SCSI

SAS

Bandwidth (MBytes/s)

0.6

4

9

24

86

250

Latency (ms)

48.3

17.1

12.7

8.8

5.7

3.6

Figure 1.10 Performance milestones over 25–40 years for microprocessors, memory, networks, and disks. The microprocessor milestones are several generations of IA-32 processors, going from a 16-bit bus, microcoded 80286 to a 64-bit bus, multicore, out-of-order execution, superpipelined Core i7. Memory module milestones go from 16-bitwide, plain DRAM to 64-bit-wide double data rate version 3 synchronous DRAM. Ethernet advanced from 10 Mbits/s to 400 Gbits/s. Disk milestones are based on rotation speed, improving from 3600 to 15,000 RPM. Each case is bestcase bandwidth, and latency is the time for a simple operation assuming no contention. Updated from Patterson, D., 2004. Latency lags bandwidth. Commun. ACM 47 (10), 71–75.

1.5

Trends in Power and Energy in Integrated Circuits



23

The increase in transistor performance, however, is more complex. As feature sizes shrink, devices shrink quadratically in the horizontal dimension and also shrink in the vertical dimension. The shrink in the vertical dimension requires a reduction in operating voltage to maintain correct operation and reliability of the transistors. This combination of scaling factors leads to a complex interrelationship between transistor performance and process feature size. To a first approximation, in the past the transistor performance improved linearly with decreasing feature size. The fact that transistor count improves quadratically with a linear increase in transistor performance is both the challenge and the opportunity for which computer architects were created! In the early days of microprocessors, the higher rate of improvement in density was used to move quickly from 4-bit, to 8-bit, to 16-bit, to 32-bit, to 64-bit microprocessors. More recently, density improvements have supported the introduction of multiple processors per chip, wider SIMD units, and many of the innovations in speculative execution and caches found in Chapters 2–5. Although transistors generally improve in performance with decreased feature size, wires in an integrated circuit do not. In particular, the signal delay for a wire increases in proportion to the product of its resistance and capacitance. Of course, as feature size shrinks, wires get shorter, but the resistance and capacitance per unit length get worse. This relationship is complex, since both resistance and capacitance depend on detailed aspects of the process, the geometry of a wire, the loading on a wire, and even the adjacency to other structures. There are occasional process enhancements, such as the introduction of copper, which provide one-time improvements in wire delay. In general, however, wire delay scales poorly compared to transistor performance, creating additional challenges for the designer. In addition to the power dissipation limit, wire delay has become a major design obstacle for large integrated circuits and is often more critical than transistor switching delay. Larger and larger fractions of the clock cycle have been consumed by the propagation delay of signals on wires, but power now plays an even greater role than wire delay.

1.5

Trends in Power and Energy in Integrated Circuits Today, energy is the biggest challenge facing the computer designer for nearly every class of computer. First, power must be brought in and distributed around the chip, and modern microprocessors use hundreds of pins and multiple interconnect layers just for power and ground. Second, power is dissipated as heat and must be removed.

Power and Energy: A Systems Perspective How should a system architect or a user think about performance, power, and energy? From the viewpoint of a system designer, there are three primary concerns. First, what is the maximum power a processor ever requires? Meeting this demand can be important to ensuring correct operation. For example, if a processor

24



Chapter One Fundamentals of Quantitative Design and Analysis

attempts to draw more power than a power-supply system can provide (by drawing more current than the system can supply), the result is typically a voltage drop, which can cause devices to malfunction. Modern processors can vary widely in power consumption with high peak currents; hence they provide voltage indexing methods that allow the processor to slow down and regulate voltage within a wider margin. Obviously, doing so decreases performance. Second, what is the sustained power consumption? This metric is widely called the thermal design power (TDP) because it determines the cooling requirement. TDP is neither peak power, which is often 1.5 times higher, nor is it the actual average power that will be consumed during a given computation, which is likely to be lower still. A typical power supply for a system is typically sized to exceed the TDP, and a cooling system is usually designed to match or exceed TDP. Failure to provide adequate cooling will allow the junction temperature in the processor to exceed its maximum value, resulting in device failure and possibly permanent damage. Modern processors provide two features to assist in managing heat, since the highest power (and hence heat and temperature rise) can exceed the long-term average specified by the TDP. First, as the thermal temperature approaches the junction temperature limit, circuitry lowers the clock rate, thereby reducing power. Should this technique not be successful, a second thermal overload trap is activated to power down the chip. The third factor that designers and users need to consider is energy and energy efficiency. Recall that power is simply energy per unit time: 1 watt ¼ 1 joule per second. Which metric is the right one for comparing processors: energy or power? In general, energy is always a better metric because it is tied to a specific task and the time required for that task. In particular, the energy to complete a workload is equal to the average power times the execution time for the workload. Thus, if we want to know which of two processors is more efficient for a given task, we need to compare energy consumption (not power) for executing the task. For example, processor A may have a 20% higher average power consumption than processor B, but if A executes the task in only 70% of the time needed by B, its energy consumption will be 1.2  0.7 ¼ 0.84, which is clearly better. One might argue that in a large server or cloud, it is sufficient to consider the average power, since the workload is often assumed to be infinite, but this is misleading. If our cloud were populated with processor Bs rather than As, then the cloud would do less work for the same amount of energy expended. Using energy to compare the alternatives avoids this pitfall. Whenever we have a fixed workload, whether for a warehouse-size cloud or a smartphone, comparing energy will be the right way to compare computer alternatives, because the electricity bill for the cloud and the battery lifetime for the smartphone are both determined by the energy consumed. When is power consumption a useful measure? The primary legitimate use is as a constraint: for example, an air-cooled chip might be limited to 100 W. It can be used as a metric if the workload is fixed, but then it’s just a variation of the true metric of energy per task.

1.5

Trends in Power and Energy in Integrated Circuits



25

Energy and Power Within a Microprocessor For CMOS chips, the traditional primary energy consumption has been in switching transistors, also called dynamic energy. The energy required per transistor is proportional to the product of the capacitive load driven by the transistor and the square of the voltage: Energydynamic / Capacitive load  Voltage2

This equation is the energy of pulse of the logic transition of 0 ! 1 ! 0 or 1 ! 0 ! 1. The energy of a single transition (0 ! 1 or 1 ! 0) is then: Energydynamic / 1=2  Capacitive load  Voltage2

The power required per transistor is just the product of the energy of a transition multiplied by the frequency of transitions: Powerdynamic / 1=2  Capacitive load  Voltage2  Frequency switched

For a fixed task, slowing clock rate reduces power, but not energy. Clearly, dynamic power and energy are greatly reduced by lowering the voltage, so voltages have dropped from 5 V to just under 1 V in 20 years. The capacitive load is a function of the number of transistors connected to an output and the technology, which determines the capacitance of the wires and the transistors. Example

Some microprocessors today are designed to have adjustable voltage, so a 15% reduction in voltage may result in a 15% reduction in frequency. What would be the impact on dynamic energy and on dynamic power?

Answer

Because the capacitance is unchanged, the answer for energy is the ratio of the voltages Energynew ðVoltage  0:85Þ2 ¼ ¼ 0:852 ¼ 0:72 Energyold Voltage2

which reduces energy to about 72% of the original. For power, we add the ratio of the frequencies Powernew ðFrequency switched  0:85Þ ¼ 0:72  ¼ 0:61 Powerold Frequency switched

shrinking power to about 61% of the original. As we move from one process to the next, the increase in the number of transistors switching and the frequency with which they change dominate the decrease in load capacitance and voltage, leading to an overall growth in power consumption and energy. The first microprocessors consumed less than a watt, and the first

26



Chapter One Fundamentals of Quantitative Design and Analysis

32-bit microprocessors (such as the Intel 80386) used about 2 W, whereas a 4.0 GHz Intel Core i7-6700K consumes 95 W. Given that this heat must be dissipated from a chip that is about 1.5 cm on a side, we are near the limit of what can be cooled by air, and this is where we have been stuck for nearly a decade. Given the preceding equation, you would expect clock frequency growth to slow down if we can’t reduce voltage or increase power per chip. Figure 1.11 shows that this has indeed been the case since 2003, even for the microprocessors in Figure 1.1 that were the highest performers each year. Note that this period of flatter clock rates corresponds to the period of slow performance improvement range in Figure 1.1. Distributing the power, removing the heat, and preventing hot spots have become increasingly difficult challenges. Energy is now the major constraint to using transistors; in the past, it was the raw silicon area. Therefore modern

10,000 Intel Pentium4 Xeon 3200 MHz in 2003 Intel Pentium III 1000 MHz in 2000

Clock rate (MHz)

1000

Intel Skylake Core i7 4200 MHz in 2017

2%/year

Digital Alpha 21164A 500 MHz in 1996 Digital Alpha 21064 150 MHz in 1992

40%/year

100 MIPS M2000 25 MHz in 1989

Sun-4 SPARC 16.7 MHz in 1986

10 Digital VAX-11/780 5 MHz in 1978 15%/year

19 78 19 80 19 82 19 84 19 86 19 88 19 90 19 92 19 94 19 96 19 98 20 00 20 02 20 04 20 06 20 08 20 10 20 12 20 14 20 16 20 18

1

Figure 1.11 Growth in clock rate of microprocessors in Figure 1.1. Between 1978 and 1986, the clock rate improved less than 15% per year while performance improved by 22% per year. During the “renaissance period” of 52% performance improvement per year between 1986 and 2003, clock rates shot up almost 40% per year. Since then, the clock rate has been nearly flat, growing at less than 2% per year, while single processor performance improved recently at just 3.5% per year.

1.5

Trends in Power and Energy in Integrated Circuits



27

microprocessors offer many techniques to try to improve energy efficiency despite flat clock rates and constant supply voltages: 1. Do nothing well. Most microprocessors today turn off the clock of inactive modules to save energy and dynamic power. For example, if no floating-point instructions are executing, the clock of the floating-point unit is disabled. If some cores are idle, their clocks are stopped. 2. Dynamic voltage-frequency scaling (DVFS). The second technique comes directly from the preceding formulas. PMDs, laptops, and even servers have periods of low activity where there is no need to operate at the highest clock frequency and voltages. Modern microprocessors typically offer a few clock frequencies and voltages in which to operate that use lower power and energy. Figure 1.12 plots the potential power savings via DVFS for a server as the workload shrinks for three different clock rates: 2.4, 1.8, and 1 GHz. The overall server power savings is about 10%–15% for each of the two steps. 3. Design for the typical case. Given that PMDs and laptops are often idle, memory and storage offer low power modes to save energy. For example, DRAMs have a series of increasingly lower power modes to extend battery life in PMDs and laptops, and there have been proposals for disks that have a mode that spins more slowly when unused to save power. However, you cannot access DRAMs or disks in these modes, so you must return to fully active mode to read or write, no matter how low the access rate. As mentioned, microprocessors for PCs have been designed instead for heavy use at high operating temperatures, relying on on-chip temperature sensors to detect when activity should be reduced automatically to avoid overheating. This “emergency slowdown” allows manufacturers to design for a more typical case and then rely on this safety mechanism if someone really does run programs that consume much more power than is typical.

Power (% of peak)

100

2.4 GHz 1.8 GHz

80 1 GHz

60 40 20

DVS savings (%)

0 Idle

7

14

21

29

36 43 50 57 64 Compute load (%)

71

79

86

93 100

Figure 1.12 Energy savings for a server using an AMD Opteron microprocessor, 8 GB of DRAM, and one ATA disk. At 1.8 GHz, the server can handle at most up to two-thirds of the workload without causing service-level violations, and at 1 GHz, it can safely han€lzle, 2009). dle only one-third of the workload (Figure 5.11 in Barroso and Ho

28



Chapter One Fundamentals of Quantitative Design and Analysis 4. Overclocking. Intel started offering Turbo mode in 2008, where the chip decides that it is safe to run at a higher clock rate for a short time, possibly on just a few cores, until temperature starts to rise. For example, the 3.3 GHz Core i7 can run in short bursts for 3.6 GHz. Indeed, the highest-performing microprocessors each year since 2008 shown in Figure 1.1 have all offered temporary overclocking of about 10% over the nominal clock rate. For single-threaded code, these microprocessors can turn off all cores but one and run it faster. Note that, although the operating system can turn off Turbo mode, there is no notification once it is enabled, so the programmers may be surprised to see their programs vary in performance because of room temperature! Although dynamic power is traditionally thought of as the primary source of power dissipation in CMOS, static power is becoming an important issue because leakage current flows even when a transistor is off: Powerstatic / Currentstatic  Voltage

That is, static power is proportional to the number of devices. Thus increasing the number of transistors increases power even if they are idle, and current leakage increases in processors with smaller transistor sizes. As a result, very low-power systems are even turning off the power supply (power gating) to inactive modules in order to control loss because of leakage. In 2011 the goal for leakage was 25% of the total power consumption, with leakage in high-performance designs sometimes far exceeding that goal. Leakage can be as high as 50% for such chips, in part because of the large SRAM caches that need power to maintain the storage values. (The S in SRAM is for static.) The only hope to stop leakage is to turn off power to the chips’ subsets. Finally, because the processor is just a portion of the whole energy cost of a system, it can make sense to use a faster, less energy-efficient processor to allow the rest of the system to go into a sleep mode. This strategy is known as race-to-halt. The importance of power and energy has increased the scrutiny on the efficiency of an innovation, so the primary evaluation now is tasks per joule or performance per watt, contrary to performance per mm2 of silicon as in the past. This new metric affects approaches to parallelism, as we will see in Chapters 4 and 5.

The Shift in Computer Architecture Because of Limits of Energy As transistor improvement decelerates, computer architects must look elsewhere for improved energy efficiency. Indeed, given the energy budget, it is easy today to design a microprocessor with so many transistors that they cannot all be turned on at the same time. This phenomenon has been called dark silicon, in that much of a chip cannot be unused (“dark”) at any moment in time because of thermal constraints. This observation has led architects to reexamine the fundamentals of processors’ design in the search for a greater energy-cost performance. Figure 1.13, which lists the energy cost and area cost of the building blocks of a modern computer, reveals surprisingly large ratios. For example, a 32-bit

1.6

Trends in Cost

Relative energy cost Area (µm2)

8b Add

0.03

36

16b Add 32b Add

0.05 0.1

137

16b FB Add

0.4

1360

32b FB Add

0.9

4184

8b Mult

0.2

282

32b Mult

3.1

3495

16b FB Mult

1.1 3.7

1640

5

N/A

640

N/A

32b FB Mult 32b SRAM Read (8KB) 32b DRAM Read

29

Relative area cost

Energy (pJ)

Operation:



67

7700

1

10

100

1000

10000

1

10

100

1000

Energy numbers are from Mark Horowitz *Computing’s Energy problem (and what we can do about it)*. ISSCC 2014 Area numbers are from synthesized result using Design compiler under TSMC 45nm tech node. FP units used DesignWare Library.

Figure 1.13 Comparison of the energy and die area of arithmetic operations and energy cost of accesses to SRAM and DRAM. [Azizi][Dally]. Area is for TSMC 45 nm technology node.

floating-point addition uses 30 times as much energy as an 8-bit integer add. The area difference is even larger, by 60 times. However, the biggest difference is in memory; a 32-bit DRAM access takes 20,000 times as much energy as an 8-bit addition. A small SRAM is 125 times more energy-efficient than DRAM, which demonstrates the importance of careful uses of caches and memory buffers. The new design principle of minimizing energy per task combined with the relative energy and area costs in Figure 1.13 have inspired a new direction for computer architecture, which we describe in Chapter 7. Domain-specific processors save energy by reducing wide floating-point operations and deploying special-purpose memories to reduce accesses to DRAM. They use those saving to provide 10–100 more (narrower) integer arithmetic units than a traditional processor. Although such processors perform only a limited set of tasks, they perform them remarkably faster and more energy efficiently than a general-purpose processor. Like a hospital with general practitioners and medical specialists, computers in this energy-aware world will likely be combinations of general-purpose cores that can perform any task and special-purpose cores that do a few things extremely well and even more cheaply.

1.6

Trends in Cost Although costs tend to be less important in some computer designs—specifically supercomputers—cost-sensitive designs are of growing significance. Indeed, in the past 35 years, the use of technology improvements to lower cost, as well as increase performance, has been a major theme in the computer industry.

30



Chapter One Fundamentals of Quantitative Design and Analysis

Textbooks often ignore the cost half of cost-performance because costs change, thereby dating books, and because the issues are subtle and differ across industry segments. Nevertheless, it’s essential for computer architects to have an understanding of cost and its factors in order to make intelligent decisions about whether a new feature should be included in designs where cost is an issue. (Imagine architects designing skyscrapers without any information on costs of steel beams and concrete!) This section discusses the major factors that influence the cost of a computer and how these factors are changing over time.

The Impact of Time, Volume, and Commoditization The cost of a manufactured computer component decreases over time even without significant improvements in the basic implementation technology. The underlying principle that drives costs down is the learning curve—manufacturing costs decrease over time. The learning curve itself is best measured by change in yield—the percentage of manufactured devices that survives the testing procedure. Whether it is a chip, a board, or a system, designs that have twice the yield will have half the cost. Understanding how the learning curve improves yield is critical to projecting costs over a product’s life. One example is that the price per megabyte of DRAM has dropped over the long term. Since DRAMs tend to be priced in close relationship to cost—except for periods when there is a shortage or an oversupply—price and cost of DRAM track closely. Microprocessor prices also drop over time, but because they are less standardized than DRAMs, the relationship between price and cost is more complex. In a period of significant competition, price tends to track cost closely, although microprocessor vendors probably rarely sell at a loss. Volume is a second key factor in determining cost. Increasing volumes affect cost in several ways. First, they decrease the time needed to get through the learning curve, which is partly proportional to the number of systems (or chips) manufactured. Second, volume decreases cost because it increases purchasing and manufacturing efficiency. As a rule of thumb, some designers have estimated that costs decrease about 10% for each doubling of volume. Moreover, volume decreases the amount of development costs that must be amortized by each computer, thus allowing cost and selling price to be closer and still make a profit. Commodities are products that are sold by multiple vendors in large volumes and are essentially identical. Virtually all the products sold on the shelves of grocery stores are commodities, as are standard DRAMs, Flash memory, monitors, and keyboards. In the past 30 years, much of the personal computer industry has become a commodity business focused on building desktop and laptop computers running Microsoft Windows. Because many vendors ship virtually identical products, the market is highly competitive. Of course, this competition decreases the gap between cost and selling

1.6

Trends in Cost



31

price, but it also decreases cost. Reductions occur because a commodity market has both volume and a clear product definition, which allows multiple suppliers to compete in building components for the commodity product. As a result, the overall product cost is lower because of the competition among the suppliers of the components and the volume efficiencies the suppliers can achieve. This rivalry has led to the low end of the computer business being able to achieve better price-performance than other sectors and has yielded greater growth at the low end, although with very limited profits (as is typical in any commodity business).

Cost of an Integrated Circuit Why would a computer architecture book have a section on integrated circuit costs? In an increasingly competitive computer marketplace where standard parts—disks, Flash memory, DRAMs, and so on—are becoming a significant portion of any system’s cost, integrated circuit costs are becoming a greater portion of the cost that varies between computers, especially in the high-volume, costsensitive portion of the market. Indeed, with PMDs’ increasing reliance of whole systems on a chip (SOC), the cost of the integrated circuits is much of the cost of the PMD. Thus computer designers must understand the costs of chips in order to understand the costs of current computers. Although the costs of integrated circuits have dropped exponentially, the basic process of silicon manufacture is unchanged: A wafer is still tested and chopped into dies that are packaged (see Figures 1.14–1.16). Therefore the cost of a packaged integrated circuit is Cost of integrated circuit ¼

Cost of die + Cost of testing die + Cost of packaging and final test Final test yield

In this section, we focus on the cost of dies, summarizing the key issues in testing and packaging at the end. Learning how to predict the number of good chips per wafer requires first learning how many dies fit on a wafer and then learning how to predict the percentage of those that will work. From there it is simple to predict cost: Cost of die ¼

Cost of wafer Dies per wafer  Die yield

The most interesting feature of this initial term of the chip cost equation is its sensitivity to die size, shown below. The number of dies per wafer is approximately the area of the wafer divided by the area of the die. It can be more accurately estimated by Dies per wafer ¼

π  ðWafer diameter=2Þ2 π  Wafer diameter  pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Die area 2  Die area

The first term is the ratio of wafer area (πr2) to die area. The second compensates for the “square peg in a round hole” problem—rectangular dies near the periphery

32



Chapter One Fundamentals of Quantitative Design and Analysis

Figure 1.14 Photograph of an Intel Skylake microprocessor die, which is evaluated in Chapter 4.

Core

Core

Core

Core

Core

Core

Memory Controller

Core

Core

Core

Core

Memory Controller

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

DDRIO

DDRIO

3x Intel® UPI, 3x16 PCIe Gen3, 1x4 DMI3

Figure 1.15 The components of the microprocessor die in Figure 1.14 are labeled with their functions.

1.6

Trends in Cost



33

Figure 1.16 This 200 mm diameter wafer of RISC-V dies was designed by SiFive. It has two types of RISC-V dies using an older, larger processing line. An FE310 die is 2.65 mm  2.72 mm and an SiFive test die that is 2.89 mm  2.72 mm. The wafer contains 1846 of the former and 1866 of the latter, totaling 3712 chips.

of round wafers. Dividing the circumference (πd) by the diagonal of a square die is approximately the number of dies along the edge.

Example Answer

Find the number of dies per 300 mm (30 cm) wafer for a die that is 1.5 cm on a side and for a die that is 1.0 cm on a side. When die area is 2.25 cm2: Dies per wafer ¼

π  ð30=2Þ2 π  30 706:9 94:2  ¼ 270  pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 2:25 2  2:25 2:25 2:12

Because the area of the larger die is 2.25 times bigger, there are roughly 2.25 as many smaller dies per wafer: Dies per wafer ¼

π  ð30=2Þ2 π  30 706:9 94:2  ¼ 640  pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 1:00 2  1:00 1:00 1:41

However, this formula gives only the maximum number of dies per wafer. The critical question is: What is the fraction of good dies on a wafer, or the die yield? A simple model of integrated circuit yield, which assumes that defects are randomly

34



Chapter One Fundamentals of Quantitative Design and Analysis

distributed over the wafer and that yield is inversely proportional to the complexity of the fabrication process, leads to the following: Die yield ¼ Wafer yield  1=ð1 + Defects per unit area  Die areaÞN

This Bose-Einstein formula is an empirical model developed by looking at the yield of many manufacturing lines (Sydow, 2006), and it still applies today. Wafer yield accounts for wafers that are completely bad and so need not be tested. For simplicity, we’ll just assume the wafer yield is 100%. Defects per unit area is a measure of the random manufacturing defects that occur. In 2017 the value was typically 0.08–0.10 defects per square inch for a 28-nm node and 0.10–0.30 for the newer 16 nm node because it depends on the maturity of the process (recall the learning curve mentioned earlier). The metric versions are 0.012–0.016 defects per square centimeter for 28 nm and 0.016–0.047 for 16 nm. Finally, N is a parameter called the process-complexity factor, a measure of manufacturing difficulty. For 28 nm processes in 2017, N is 7.5–9.5. For a 16 nm process, N ranges from 10 to 14.

Example Answer

Find the die yield for dies that are 1.5 cm on a side and 1.0 cm on a side, assuming a defect density of 0.047 per cm2 and N is 12. The total die areas are 2.25 and 1.00 cm2. For the larger die, the yield is Die yield ¼ 1=ð1 + 0:047  2:25Þ12  270 ¼ 120

For the smaller die, the yield is Die yield ¼ 1=ð1 + 0:047  1:00Þ12  640 ¼ 444

The bottom line is the number of good dies per wafer. Less than half of all the large dies are good, but nearly 70% of the small dies are good. Although many microprocessors fall between 1.00 and 2.25 cm2, low-end embedded 32-bit processors are sometimes as small as 0.05 cm2, processors used for embedded control (for inexpensive IoT devices) are often less than 0.01 cm2, and high-end server and GPU chips can be as large as 8 cm2. Given the tremendous price pressures on commodity products such as DRAM and SRAM, designers have included redundancy as a way to raise yield. For a number of years, DRAMs have regularly included some redundant memory cells so that a certain number of flaws can be accommodated. Designers have used similar techniques in both standard SRAMs and in large SRAM arrays used for caches within microprocessors. GPUs have 4 redundant processors out of 84 for the same reason. Obviously, the presence of redundant entries can be used to boost the yield significantly.

1.6

Trends in Cost



35

In 2017 processing of a 300 mm (12-inch) diameter wafer in a 28-nm technology costs between $4000 and $5000, and a 16-nm wafer costs about $7000. Assuming a processed wafer cost of $7000, the cost of the 1.00 cm2 die would be around $16, but the cost per die of the 2.25 cm2 die would be about $58, or almost four times the cost of a die that is a little over twice as large. What should a computer designer remember about chip costs? The manufacturing process dictates the wafer cost, wafer yield, and defects per unit area, so the sole control of the designer is die area. In practice, because the number of defects per unit area is small, the number of good dies per wafer, and therefore the cost per die, grows roughly as the square of the die area. The computer designer affects die size, and thus cost, both by what functions are included on or excluded from the die and by the number of I/O pins. Before we have a part that is ready for use in a computer, the die must be tested (to separate the good dies from the bad), packaged, and tested again after packaging. These steps all add significant costs, increasing the total by half. The preceding analysis focused on the variable costs of producing a functional die, which is appropriate for high-volume integrated circuits. There is, however, one very important part of the fixed costs that can significantly affect the cost of an integrated circuit for low volumes (less than 1 million parts), namely, the cost of a mask set. Each step in the integrated circuit process requires a separate mask. Therefore, for modern high-density fabrication processes with up to 10 metal layers, mask costs are about $4 million for 16 nm and $1.5 million for 28 nm. The good news is that semiconductor companies offer “shuttle runs” to dramatically lower the costs of tiny test chips. They lower costs by putting many small designs onto a single die to amortize the mask costs, and then later split the dies into smaller pieces for each project. Thus TSMC delivers 80–100 untested dies that are 1.57  1.57 mm in a 28 nm process for $30,000 in 2017. Although these die are tiny, they offer the architect millions of transistors to play with. For example, several RISC-V processors would fit on such a die. Although shuttle runs help with prototyping and debugging runs, they don’t address small-volume production of tens to hundreds of thousands of parts. Because mask costs are likely to continue to increase, some designers are incorporating reconfigurable logic to enhance the flexibility of a part and thus reduce the cost implications of masks.

Cost Versus Price With the commoditization of computers, the margin between the cost to manufacture a product and the price the product sells for has been shrinking. Those margins pay for a company’s research and development (R&D), marketing, sales, manufacturing equipment maintenance, building rental, cost of financing, pretax profits, and taxes. Many engineers are surprised to find that most companies spend only 4% (in the commodity PC business) to 12% (in the high-end server business) of their income on R&D, which includes all engineering.

36



Chapter One Fundamentals of Quantitative Design and Analysis

Cost of Manufacturing Versus Cost of Operation For the first four editions of this book, cost meant the cost to build a computer and price meant price to purchase a computer. With the advent of WSCs, which contain tens of thousands of servers, the cost to operate the computers is significant in addition to the cost of purchase. Economists refer to these two costs as capital expenses (CAPEX) and operational expenses (OPEX). As Chapter 6 shows, the amortized purchase price of servers and networks is about half of the monthly cost to operate a WSC, assuming a short lifetime of the IT equipment of 3–4 years. About 40% of the monthly operational costs are for power use and the amortized infrastructure to distribute power and to cool the IT equipment, despite this infrastructure being amortized over 10–15 years. Thus, to lower operational costs in a WSC, computer architects need to use energy efficiently.

1.7

Dependability Historically, integrated circuits were one of the most reliable components of a computer. Although their pins may be vulnerable, and faults may occur over communication channels, the failure rate inside the chip was very low. That conventional wisdom is changing as we head to feature sizes of 16 nm and smaller, because both transient faults and permanent faults are becoming more commonplace, so architects must design systems to cope with these challenges. This section gives a quick overview of the issues in dependability, leaving the official definition of the terms and approaches to Section D.3 in Appendix D. Computers are designed and constructed at different layers of abstraction. We can descend recursively down through a computer seeing components enlarge themselves to full subsystems until we run into individual transistors. Although some faults are widespread, like the loss of power, many can be limited to a single component in a module. Thus utter failure of a module at one level may be considered merely a component error in a higher-level module. This distinction is helpful in trying to find ways to build dependable computers. One difficult question is deciding when a system is operating properly. This theoretical point became concrete with the popularity of Internet services. Infrastructure providers started offering service level agreements (SLAs) or service level objectives (SLOs) to guarantee that their networking or power service would be dependable. For example, they would pay the customer a penalty if they did not meet an agreement of some hours per month. Thus an SLA could be used to decide whether the system was up or down. Systems alternate between two states of service with respect to an SLA: 1. Service accomplishment, where the service is delivered as specified. 2. Service interruption, where the delivered service is different from the SLA.

1.7

Dependability



37

Transitions between these two states are caused by failures (from state 1 to state 2) or restorations (2 to 1). Quantifying these transitions leads to the two main measures of dependability: ■

Module reliability is a measure of the continuous service accomplishment (or, equivalently, of the time to failure) from a reference initial instant. Therefore the mean time to failure (MTTF) is a reliability measure. The reciprocal of MTTF is a rate of failures, generally reported as failures per billion hours of operation, or FIT (for failures in time). Thus an MTTF of 1,000,000 hours equals 109/106 or 1000 FIT. Service interruption is measured as mean time to repair (MTTR). Mean time between failures (MTBF) is simply the sum of MTTF + MTTR. Although MTBF is widely used, MTTF is often the more appropriate term. If a collection of modules has exponentially distributed lifetimes—meaning that the age of a module is not important in probability of failure—the overall failure rate of the collection is the sum of the failure rates of the modules.



Module availability is a measure of the service accomplishment with respect to the alternation between the two states of accomplishment and interruption. For nonredundant systems with repair, module availability is Module availability ¼

MTTF ðMTTF + MTTRÞ

Note that reliability and availability are now quantifiable metrics, rather than synonyms for dependability. From these definitions, we can estimate reliability of a system quantitatively if we make some assumptions about the reliability of components and that failures are independent. Example

Assume a disk subsystem with the following components and MTTF: ■

10 disks, each rated at 1,000,000-hour MTTF



1 ATA controller, 500,000-hour MTTF



1 power supply, 200,000-hour MTTF



1 fan, 200,000-hour MTTF



1 ATA cable, 1,000,000-hour MTTF

Using the simplifying assumptions that the lifetimes are exponentially distributed and that failures are independent, compute the MTTF of the system as a whole. Answer

The sum of the failure rates is Failure ratesystem ¼ 10  ¼

1 1 1 1 1 + + + + 1,000, 000 500, 000 200, 000 200, 000 1, 000,000

10 + 2 + 5 + 5 + 1 23 23, 000 ¼ ¼ 1,000, 000 hours 1,000, 000 1,000, 000,000 hours

38



Chapter One Fundamentals of Quantitative Design and Analysis

or 23,000 FIT. The MTTF for the system is just the inverse of the failure rate MTTFsystem ¼

1 1,000, 000, 000 hours ¼ 43,500 hours ¼ Failure ratesystem 23, 000

or just under 5 years. The primary way to cope with failure is redundancy, either in time (repeat the operation to see if it still is erroneous) or in resources (have other components to take over from the one that failed). Once the component is replaced and the system is fully repaired, the dependability of the system is assumed to be as good as new. Let’s quantify the benefits of redundancy with an example. Example

Disk subsystems often have redundant power supplies to improve dependability. Using the preceding components and MTTFs, calculate the reliability of redundant power supplies. Assume that one power supply is sufficient to run the disk subsystem and that we are adding one redundant power supply.

Answer

We need a formula to show what to expect when we can tolerate a failure and still provide service. To simplify the calculations, we assume that the lifetimes of the components are exponentially distributed and that there is no dependency between the component failures. MTTF for our redundant power supplies is the mean time until one power supply fails divided by the chance that the other will fail before the first one is replaced. Thus, if the chance of a second failure before repair is small, then the MTTF of the pair is large. Since we have two power supplies and independent failures, the mean time until one supply fails is MTTFpower supply/2. A good approximation of the probability of a second failure is MTTR over the mean time until the other power supply fails. Therefore a reasonable approximation for a redundant pair of power supplies is MTTFpower supply pair ¼

2 MTTF2power supply MTTFpower supply =2 MTTFpower supply =2 ¼ ¼ MTTRpower supply MTTRpower supply 2  MTTRpower supply MTTFpower supply

Using the preceding MTTF numbers, if we assume it takes on average 24 hours for a human operator to notice that a power supply has failed and to replace it, the reliability of the fault tolerant pair of power supplies is MTTFpower supply pair ¼

MTTF2power supply 200, 0002 ffi 830,000, 000 ¼ 2  MTTRpower supply 2  24

making the pair about 4150 times more reliable than a single power supply. Having quantified the cost, power, and dependability of computer technology, we are ready to quantify performance.

1.8

1.8

Measuring, Reporting, and Summarizing Performance



39

Measuring, Reporting, and Summarizing Performance When we say one computer is faster than another one is, what do we mean? The user of a cell phone may say a computer is faster when a program runs in less time, while an Amazon.com administrator may say a computer is faster when it completes more transactions per hour. The cell phone user wants to reduce response time—the time between the start and the completion of an event—also referred to as execution time. The operator of a WSC wants to increase throughput—the total amount of work done in a given time. In comparing design alternatives, we often want to relate the performance of two different computers, say, X and Y. The phrase “X is faster than Y” is used here to mean that the response time or execution time is lower on X than on Y for the given task. In particular, “X is n times as fast as Y” will mean Execution timeY ¼n Execution timeX

Since execution time is the reciprocal of performance, the following relationship holds: 1 Execution timeY PerformanceY PerformanceX n¼ ¼ ¼ 1 Execution timeX PerformanceY PerformanceX

The phrase “the throughput of X is 1.3 times as fast as Y” signifies here that the number of tasks completed per unit time on computer X is 1.3 times the number completed on Y. Unfortunately, time is not always the metric quoted in comparing the performance of computers. Our position is that the only consistent and reliable measure of performance is the execution time of real programs, and that all proposed alternatives to time as the metric or to real programs as the items measured have eventually led to misleading claims or even mistakes in computer design. Even execution time can be defined in different ways depending on what we count. The most straightforward definition of time is called wall-clock time, response time, or elapsed time, which is the latency to complete a task, including storage accesses, memory accesses, input/output activities, operating system overhead—everything. With multiprogramming, the processor works on another program while waiting for I/O and may not necessarily minimize the elapsed time of one program. Thus we need a term to consider this activity. CPU time recognizes this distinction and means the time the processor is computing, not including the time waiting for I/O or running other programs. (Clearly, the response time seen by the user is the elapsed time of the program, not the CPU time.) Computer users who routinely run the same programs would be the perfect candidates to evaluate a new computer. To evaluate a new system, these users would simply compare the execution time of their workloads—the mixture of programs

40



Chapter One Fundamentals of Quantitative Design and Analysis

and operating system commands that users run on a computer. Few are in this happy situation, however. Most must rely on other methods to evaluate computers, and often other evaluators, hoping that these methods will predict performance for their usage of the new computer. One approach is benchmark programs, which are programs that many companies use to establish the relative performance of their computers.

Benchmarks The best choice of benchmarks to measure performance is real applications, such as Google Translate mentioned in Section 1.1. Attempts at running programs that are much simpler than a real application have led to performance pitfalls. Examples include ■

Kernels, which are small, key pieces of real applications.



Toy programs, which are 100-line programs from beginning programming assignments, such as Quicksort.



Synthetic benchmarks, which are fake programs invented to try to match the profile and behavior of real applications, such as Dhrystone.

All three are discredited today, usually because the compiler writer and architect can conspire to make the computer appear faster on these stand-in programs than on real applications. Regrettably for your authors—who dropped the fallacy about using synthetic benchmarks to characterize performance in the fourth edition of this book since we thought all computer architects agreed it was disreputable— the synthetic program Dhrystone is still the most widely quoted benchmark for embedded processors in 2017! Another issue is the conditions under which the benchmarks are run. One way to improve the performance of a benchmark has been with benchmark-specific compiler flags; these flags often caused transformations that would be illegal on many programs or would slow down performance on others. To restrict this process and increase the significance of the results, benchmark developers typically require the vendor to use one compiler and one set of flags for all the programs in the same language (such as C++ or C). In addition to the question of compiler flags, another question is whether source code modifications are allowed. There are three different approaches to addressing this question: 1. No source code modifications are allowed. 2. Source code modifications are allowed but are essentially impossible. For example, database benchmarks rely on standard database programs that are tens of millions of lines of code. The database companies are highly unlikely to make changes to enhance the performance for one particular computer. 3. Source modifications are allowed, as long as the altered version produces the same output.

1.8

Measuring, Reporting, and Summarizing Performance



41

The key issue that benchmark designers face in deciding to allow modification of the source is whether such modifications will reflect real practice and provide useful insight to users, or whether these changes simply reduce the accuracy of the benchmarks as predictors of real performance. As we will see in Chapter 7, domainspecific architects often follow the third option when creating processors for well-defined tasks. To overcome the danger of placing too many eggs in one basket, collections of benchmark applications, called benchmark suites, are a popular measure of performance of processors with a variety of applications. Of course, such collections are only as good as the constituent individual benchmarks. Nonetheless, a key advantage of such suites is that the weakness of any one benchmark is lessened by the presence of the other benchmarks. The goal of a benchmark suite is that it will characterize the real relative performance of two computers, particularly for programs not in the suite that customers are likely to run. A cautionary example is the Electronic Design News Embedded Microprocessor Benchmark Consortium (or EEMBC, pronounced “embassy”) benchmarks. It is a set of 41 kernels used to predict performance of different embedded applications: automotive/industrial, consumer, networking, office automation, and telecommunications. EEMBC reports unmodified performance and “full fury” performance, where almost anything goes. Because these benchmarks use small kernels, and because of the reporting options, EEMBC does not have the reputation of being a good predictor of relative performance of different embedded computers in the field. This lack of success is why Dhrystone, which EEMBC was trying to replace, is sadly still used. One of the most successful attempts to create standardized benchmark application suites has been the SPEC (Standard Performance Evaluation Corporation), which had its roots in efforts in the late 1980s to deliver better benchmarks for workstations. Just as the computer industry has evolved over time, so has the need for different benchmark suites, and there are now SPEC benchmarks to cover many application classes. All the SPEC benchmark suites and their reported results are found at http://www.spec.org. Although we focus our discussion on the SPEC benchmarks in many of the following sections, many benchmarks have also been developed for PCs running the Windows operating system.

Desktop Benchmarks Desktop benchmarks divide into two broad classes: processor-intensive benchmarks and graphics-intensive benchmarks, although many graphics benchmarks include intensive processor activity. SPEC originally created a benchmark set focusing on processor performance (initially called SPEC89), which has evolved into its sixth generation: SPEC CPU2017, which follows SPEC2006, SPEC2000, SPEC95 SPEC92, and SPEC89. SPEC CPU2017 consists of a set of 10 integer benchmarks (CINT2017) and 17 floating-point benchmarks (CFP2017). Figure 1.17 describes the current SPEC CPU benchmarks and their ancestry.

SPEC2006

Benchmark name by SPEC generation SPEC2000 SPEC95 SPEC92 SPEC89 gcc perl

General data compression

mcf omnetpp

bzip2 vortex

XZ

Discrete Event simulation - computer network

xalancbmk

gzip

h264ref

Artificial Intelligence: alpha-beta tree search (Chess)

X264 deepsjeng

Artificial Intelligence: Monte Carlo tree search (Go)

leela

gobmk

eon twolf vortex

Artificial Intelligence: recursive solution generator (Sudoku)

exchange2

astar

vpr

hmmer libquantum

parser

XML to HTML conversion via XSLT Video compression

sjeng

compress go ijpeg

eqntott

sc

m88ksim

crafty

Explosion modeling

bwaves

Physics: relativity

cactuBSSN

fpppp tomcatv

Molecular dynamics

namd

doduc

Ray tracing

povray

nasa7

Fluid dynamics

lbm

Weather forecasting

wrf

Biomedical imaging: optical tomography with finite elements 3D rendering and animation

parest

gamess

blender wupwise

cam4

milc

Image manipulation

imagick

zeusmp

apply

Molecular dynamics

nab

gromacs

galgel

Computational Electromagnetics

fotonik3d

Regional ocean modeling

roms

leslie3d dealII

mesa art

soplex

equake

calculix GemsFDTD

facerec ammp

Atmosphere modeling

tonto sphinx3

apsi

swim hydro2d

mgrid

su2cor

applu turb3d

wave5

spice matrix300

lucas fma3d sixtrack

Figure 1.17 SPEC2017 programs and the evolution of the SPEC benchmarks over time, with integer programs above the line and floatingpoint programs below the line. Of the 10 SPEC2017 integer programs, 5 are written in C, 4 in C++., and 1 in Fortran. For the floating-point programs, the split is 3 in Fortran, 2 in C++, 2 in C, and 6 in mixed C, C++, and Fortran. The figure shows all 82 of the programs in the 1989, 1992, 1995, 2000, 2006, and 2017 releases. Gcc is the senior citizen of the group. Only 3 integer programs and 3 floating-point programs survived three or more generations. Although a few are carried over from generation to generation, the version of the program changes and either the input or the size of the benchmark is often expanded to increase its running time and to avoid perturbation in measurement or domination of the execution time by some factor other than CPU time. The benchmark descriptions on the left are for SPEC2017 only and do not apply to earlier versions. Programs in the same row from different generations of SPEC are generally not related; for example, fpppp is not a CFD code like bwaves.

Chapter One Fundamentals of Quantitative Design and Analysis

Route planning

espresso li



GNU C compiler Perl interpreter

42

SPEC2017

1.8

Measuring, Reporting, and Summarizing Performance



43

SPEC benchmarks are real programs modified to be portable and to minimize the effect of I/O on performance. The integer benchmarks vary from part of a C compiler to a go program to a video compression. The floating-point benchmarks include molecular dynamics, ray tracing, and weather forecasting. The SPEC CPU suite is useful for processor benchmarking for both desktop systems and single-processor servers. We will see data on many of these programs throughout this book. However, these programs share little with modern programming languages and environments and the Google Translate application that Section 1.1 describes. Nearly half of them are written at least partially in Fortran! They are even statically linked instead of being dynamically linked like most real programs. Alas, the SPEC2017 applications themselves may be real, but they are not inspiring. It’s not clear that SPECINT2017 and SPECFP2017 capture what is exciting about computing in the 21st century. In Section 1.11, we describe pitfalls that have occurred in developing the SPEC CPUbenchmark suite, as well as the challenges in maintaining a useful and predictive benchmark suite. SPEC CPU2017 is aimed at processor performance, but SPEC offers many other benchmarks. Figure 1.18 lists the 17 SPEC benchmarks that are active in 2017.

Server Benchmarks Just as servers have multiple functions, so are there multiple types of benchmarks. The simplest benchmark is perhaps a processor throughput-oriented benchmark. SPEC CPU2017 uses the SPEC CPU benchmarks to construct a simple throughput benchmark where the processing rate of a multiprocessor can be measured by running multiple copies (usually as many as there are processors) of each SPEC CPU benchmark and converting the CPU time into a rate. This leads to a measurement called the SPECrate, and it is a measure of request-level parallelism from Section 1.2. To measure thread-level parallelism, SPEC offers what they call highperformance computing benchmarks around OpenMP and MPI as well as for accelerators such as GPUs (see Figure 1.18). Other than SPECrate, most server applications and benchmarks have significant I/O activity arising from either storage or network traffic, including benchmarks for file server systems, for web servers, and for database and transactionprocessing systems. SPEC offers both a file server benchmark (SPECSFS) and a Java server benchmark. (Appendix D discusses some file and I/O system benchmarks in detail.) SPECvirt_Sc2013 evaluates end-to-end performance of virtualized data center servers. Another SPEC benchmark measures power, which we examine in Section 1.10. Transaction-processing (TP) benchmarks measure the ability of a system to handle transactions that consist of database accesses and updates. Airline reservation systems and bank ATM systems are typical simple examples of TP; more sophisticated TP systems involve complex databases and decision-making.

44



Chapter One Fundamentals of Quantitative Design and Analysis

Category

Name

Measures performance of

Cloud

Cloud_IaaS 2016

Cloud using NoSQL database transaction and K-Means clustering using map/reduce

CPU

CPU2017

Compute-intensive integer and floating-point workloads

SPECviewperf® 12

3D graphics in systems running OpenGL and Direct X

SPECwpc V2.0

Workstations running professional apps under the Windows OS

SPECapcSM for 3ds Max 2015™

3D graphics running the proprietary Autodesk 3ds Max 2015 app

SPECapcSM for Maya® 2012

3D graphics running the proprietary Autodesk 3ds Max 2012 app

SPECapcSM for PTC Creo 3.0

3D graphics running the proprietary PTC Creo 3.0 app

SPECapcSM for Siemens NX 9.0 and 10.0

3D graphics running the proprietary Siemens NX 9.0 or 10.0 app

SPECapcSM for SolidWorks 2015

3D graphics of systems running the proprietary SolidWorks 2015 CAD/CAM app

ACCEL

Accelerator and host CPU running parallel applications using OpenCL and OpenACC

MPI2007

MPI-parallel, floating-point, compute-intensive programs running on clusters and SMPs

OMP2012

Parallel apps running OpenMP

Java client/server

SPECjbb2015

Java servers

Power

SPECpower_ssj2008

Power of volume server class computers running SPECjbb2015

Solution File Server (SFS)

SFS2014

File server throughput and response time

SPECsfs2008

File servers utilizing the NFSv3 and CIFS protocols

Virtualization

SPECvirt_sc2013

Datacenter servers used in virtualized server consolidation

Graphics and workstation performance

High performance computing

Figure 1.18 Active benchmarks from SPEC as of 2017.

In the mid-1980s, a group of concerned engineers formed the vendor-independent Transaction Processing Council (TPC) to try to create realistic and fair benchmarks for TP. The TPC benchmarks are described at http://www.tpc.org. The first TPC benchmark, TPC-A, was published in 1985 and has since been replaced and enhanced by several different benchmarks. TPC-C, initially created in 1992, simulates a complex query environment. TPC-H models ad hoc decision support—the queries are unrelated and knowledge of past queries cannot be used to optimize future queries. The TPC-DI benchmark, a new data integration (DI) task also known as ETL, is an important part of data warehousing. TPC-E is an online transaction processing (OLTP) workload that simulates a brokerage firm’s customer accounts.

1.8

Measuring, Reporting, and Summarizing Performance



45

Recognizing the controversy between traditional relational databases and “No SQL” storage solutions, TPCx-HS measures systems using the Hadoop file system running MapReduce programs, and TPC-DS measures a decision support system that uses either a relational database or a Hadoop-based system. TPC-VMS and TPCx-V measure database performance for virtualized systems, and TPC-Energy adds energy metrics to all the existing TPC benchmarks. All the TPC benchmarks measure performance in transactions per second. In addition, they include a response time requirement so that throughput performance is measured only when the response time limit is met. To model real-world systems, higher transaction rates are also associated with larger systems, in terms of both users and the database to which the transactions are applied. Finally, the system cost for a benchmark system must be included as well to allow accurate comparisons of cost-performance. TPC modified its pricing policy so that there is a single specification for all the TPC benchmarks and to allow verification of the prices that TPC publishes.

Reporting Performance Results The guiding principle of reporting performance measurements should be reproducibility—list everything another experimenter would need to duplicate the results. A SPEC benchmark report requires an extensive description of the computer and the compiler flags, as well as the publication of both the baseline and the optimized results. In addition to hardware, software, and baseline tuning parameter descriptions, a SPEC report contains the actual performance times, shown both in tabular form and as a graph. A TPC benchmark report is even more complete, because it must include results of a benchmarking audit and cost information. These reports are excellent sources for finding the real costs of computing systems, since manufacturers compete on high performance and costperformance.

Summarizing Performance Results In practical computer design, one must evaluate myriad design choices for their relative quantitative benefits across a suite of benchmarks believed to be relevant. Likewise, consumers trying to choose a computer will rely on performance measurements from benchmarks, which ideally are similar to the users’ applications. In both cases, it is useful to have measurements for a suite of benchmarks so that the performance of important applications is similar to that of one or more benchmarks in the suite and so that variability in performance can be understood. In the best case, the suite resembles a statistically valid sample of the application space, but such a sample requires more benchmarks than are typically found in most suites and requires a randomized sampling, which essentially no benchmark suite uses.

46



Chapter One Fundamentals of Quantitative Design and Analysis

Once we have chosen to measure performance with a benchmark suite, we want to be able to summarize the performance results of the suite in a unique number. A simple approach to computing a summary result would be to compare the arithmetic means of the execution times of the programs in the suite. An alternative would be to add a weighting factor to each benchmark and use the weighted arithmetic mean as the single number to summarize performance. One approach is to use weights that make all programs execute an equal time on some reference computer, but this biases the results toward the performance characteristics of the reference computer. Rather than pick weights, we could normalize execution times to a reference computer by dividing the time on the reference computer by the time on the computer being rated, yielding a ratio proportional to performance. SPEC uses this approach, calling the ratio the SPECRatio. It has a particularly useful property that matches the way we benchmark computer performance throughout this text—namely, comparing performance ratios. For example, suppose that the SPECRatio of computer A on a benchmark is 1.25 times as fast as computer B; then we know Execution timereference SPECRatioA Execution timeB PerformanceA Execution timeA 1:25 ¼ ¼ ¼ ¼ SPECRatioB Execution timereference Execution timeA PerformanceB Execution timeB

Notice that the execution times on the reference computer drop out and the choice of the reference computer is irrelevant when the comparisons are made as a ratio, which is the approach we consistently use. Figure 1.19 gives an example. Because a SPECRatio is a ratio rather than an absolute execution time, the mean must be computed using the geometric mean. (Because SPECRatios have no units, comparing SPECRatios arithmetically is meaningless.) The formula is sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n Y n Geometric mean ¼ samplei i¼1

In the case of SPEC, samplei is the SPECRatio for program i. Using the geometric mean ensures two important properties: 1. The geometric mean of the ratios is the same as the ratio of the geometric means. 2. The ratio of the geometric means is equal to the geometric mean of the performance ratios, which implies that the choice of the reference computer is irrelevant. Therefore the motivations to use the geometric mean are substantial, especially when we use performance ratios to make comparisons.

1.8

Example Answer

Measuring, Reporting, and Summarizing Performance



47

Show that the ratio of the geometric means is equal to the geometric mean of the performance ratios and that the reference computer of SPECRatio does not matter. Assume two computers A and B and a set of SPECRatios for each. sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n Y n SPECRatio Ai Geometric meanA i ¼1 ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ n Geometric meanB n Y SPECRatio Bi

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n Y SPECRatio Ai n SPECRatio Bi i ¼1

i ¼1

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u Execution time referencei u sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uY n n n Y Y Execution timeBi PerformanceAi Execution timeAi u n ¼ n ¼u ¼ n t i ¼1 Execution timereferencei Execution time PerformanceBi A i i ¼1 i ¼1 Execution timeBi

That is, the ratio of the geometric means of the SPECRatios of A and B is the geometric mean of the performance ratios of A to B of all the benchmarks in the suite. Figure 1.19 demonstrates this validity using examples from SPEC.

Sun Ultra Enterprise 2 time (seconds)

AMD A106800K time (seconds)

SPEC 2006Cint ratio

Intel Xeon E5-2690 time (seconds)

SPEC 2006Cint ratio

AMD/Intel times (seconds)

Intel/AMD SPEC ratios

perlbench

9770

401

24.36

261

37.43

1.54

1.54

bzip2

9650

505

19.11

422

22.87

1.20

1.20

gcc

8050

490

16.43

227

35.46

2.16

2.16

mcf

9120

249

36.63

153

59.61

1.63

1.63

gobmk

10,490

418

25.10

382

27.46

1.09

1.09

hmmer

9330

182

51.26

120

77.75

1.52

1.52

sjeng

12,100

517

23.40

383

31.59

1.35

1.35

libquantum

20,720

84

246.08

3

7295.77

29.65

29.65

h264ref

22,130

611

36.22

425

52.07

1.44

1.44

omnetpp

6250

313

19.97

153

40.85

2.05

2.05

astar

7020

303

23.17

209

33.59

1.45

1.45

xalancbmk

6900

215

32.09

98

70.41

2.19

2.19

63.72

2.00

2.00

Benchmarks

Geometric mean

31.91

Figure 1.19 SPEC2006Cint execution times (in seconds) for the Sun Ultra 5—the reference computer of SPEC2006—and execution times and SPECRatios for the AMD A10 and Intel Xeon E5-2690. The final two columns show the ratios of execution times and SPEC ratios. This figure demonstrates the irrelevance of the reference computer in relative performance. The ratio of the execution times is identical to the ratio of the SPEC ratios, and the ratio of the geometric means (63.7231.91/20.86 ¼ 2.00) is identical to the geometric mean of the ratios (2.00). Section 1.11 discusses libquantum, whose performance is orders of magnitude higher than the other SPEC benchmarks.

48



Chapter One Fundamentals of Quantitative Design and Analysis

1.9

Quantitative Principles of Computer Design Now that we have seen how to define, measure, and summarize performance, cost, dependability, energy, and power, we can explore guidelines and principles that are useful in the design and analysis of computers. This section introduces important observations about design, as well as two equations to evaluate alternatives.

Take Advantage of Parallelism Using parallelism is one of the most important methods for improving performance. Every chapter in this book has an example of how performance is enhanced through the exploitation of parallelism. We give three brief examples here, which are expounded on in later chapters. Our first example is the use of parallelism at the system level. To improve the throughput performance on a typical server benchmark, such as SPECSFS or TPCC, multiple processors and multiple storage devices can be used. The workload of handling requests can then be spread among the processors and storage devices, resulting in improved throughput. Being able to expand memory and the number of processors and storage devices is called scalability, and it is a valuable asset for servers. Spreading of data across many storage devices for parallel reads and writes enables data-level parallelism. SPECSFS also relies on request-level parallelism to use many processors, whereas TPC-C uses thread-level parallelism for faster processing of database queries. At the level of an individual processor, taking advantage of parallelism among instructions is critical to achieving high performance. One of the simplest ways to do this is through pipelining. (Pipelining is explained in more detail in Appendix C and is a major focus of Chapter 3.) The basic idea behind pipelining is to overlap instruction execution to reduce the total time to complete an instruction sequence. A key insight into pipelining is that not every instruction depends on its immediate predecessor, so executing the instructions completely or partially in parallel may be possible. Pipelining is the best-known example of ILP. Parallelism can also be exploited at the level of detailed digital design. For example, set-associative caches use multiple banks of memory that are typically searched in parallel to find a desired item. Arithmetic-logical units use carrylookahead, which uses parallelism to speed the process of computing sums from linear to logarithmic in the number of bits per operand. These are more examples of data-level parallelism.

Principle of Locality Important fundamental observations have come from properties of programs. The most important program property that we regularly exploit is the principle of locality: programs tend to reuse data and instructions they have used recently. A widely held rule of thumb is that a program spends 90% of its execution time in only 10% of the code. An implication of locality is that we can predict with reasonable

1.9

Quantitative Principles of Computer Design



49

accuracy what instructions and data a program will use in the near future based on its accesses in the recent past. The principle of locality also applies to data accesses, though not as strongly as to code accesses. Two different types of locality have been observed. Temporal locality states that recently accessed items are likely to be accessed soon. Spatial locality says that items whose addresses are near one another tend to be referenced close together in time. We will see these principles applied in Chapter 2.

Focus on the Common Case Perhaps the most important and pervasive principle of computer design is to focus on the common case: in making a design trade-off, favor the frequent case over the infrequent case. This principle applies when determining how to spend resources, because the impact of the improvement is higher if the occurrence is commonplace. Focusing on the common case works for energy as well as for resource allocation and performance. The instruction fetch and decode unit of a processor may be used much more frequently than a multiplier, so optimize it first. It works on dependability as well. If a database server has 50 storage devices for every processor, storage dependability will dominate system dependability. In addition, the common case is often simpler and can be done faster than the infrequent case. For example, when adding two numbers in the processor, we can expect overflow to be a rare circumstance and can therefore improve performance by optimizing the more common case of no overflow. This emphasis may slow down the case when overflow occurs, but if that is rare, then overall performance will be improved by optimizing for the normal case. We will see many cases of this principle throughout this text. In applying this simple principle, we have to decide what the frequent case is and how much performance can be improved by making that case faster. A fundamental law, called Amdahl’s Law, can be used to quantify this principle.

Amdahl’s Law The performance gain that can be obtained by improving some portion of a computer can be calculated using Amdahl’s Law. Amdahl’s Law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. Amdahl’s Law defines the speedup that can be gained by using a particular feature. What is speedup? Suppose that we can make an enhancement to a computer that will improve performance when it is used. Speedup is the ratio Speedup ¼

Performance for entire task using the enhancement when possible Performance for entire task without using the enhancement

Alternatively, Speedup ¼

Execution time for entire task without using the enhancement Execution time for entire task using the enhancement when possible

50



Chapter One Fundamentals of Quantitative Design and Analysis

Speedup tells us how much faster a task will run using the computer with the enhancement contrary to the original computer. Amdahl’s Law gives us a quick way to find the speedup from some enhancement, which depends on two factors: 1. The fraction of the computation time in the original computer that can be converted to take advantage of the enhancement—For example, if 40 seconds of the execution time of a program that takes 100 seconds in total can use an enhancement, the fraction is 40/100. This value, which we call Fractionenhanced, is always less than or equal to 1. 2. The improvement gained by the enhanced execution mode, that is, how much faster the task would run if the enhanced mode were used for the entire program—This value is the time of the original mode over the time of the enhanced mode. If the enhanced mode takes, say, 4 seconds for a portion of the program, while it is 40 seconds in the original mode, the improvement is 40/4 or 10. We call this value, which is always greater than 1, Speedupenhanced. The execution time using the original computer with the enhanced mode will be the time spent using the unenhanced portion of the computer plus the time spent using the enhancement:   Fractionenhanced Execution timenew ¼ Execution timeold  ð1  Fractionenhanced Þ + Speedupenhanced

The overall speedup is the ratio of the execution times: Speedupoverall ¼

Example

Answer

Execution timeold ¼ Execution timenew

1 ð1  Fractionenhanced Þ +

Fractionenhanced Speedupenhanced

Suppose that we want to enhance the processor used for web serving. The new processor is 10 times faster on computation in the web serving application than the old processor. Assuming that the original processor is busy with computation 40% of the time and is waiting for I/O 60% of the time, what is the overall speedup gained by incorporating the enhancement? Fractionenhanced ¼ 0:4; Speedupenhanced ¼ 10; Speedupoverall ¼

1 0:4 0:6 + 10

¼

1  1:56 0:64

Amdahl’s Law expresses the law of diminishing returns: The incremental improvement in speedup gained by an improvement of just a portion of the computation diminishes as improvements are added. An important corollary of Amdahl’s Law is that if an enhancement is usable only for a fraction of a task, then we can’t speed up the task by more than the reciprocal of 1 minus that fraction.

1.9

Quantitative Principles of Computer Design



51

A common mistake in applying Amdahl’s Law is to confuse “fraction of time converted to use an enhancement” and “fraction of time after enhancement is in use.” If, instead of measuring the time that we could use the enhancement in a computation, we measure the time after the enhancement is in use, the results will be incorrect! Amdahl’s Law can serve as a guide to how much an enhancement will improve performance and how to distribute resources to improve cost-performance. The goal, clearly, is to spend resources proportional to where time is spent. Amdahl’s Law is particularly useful for comparing the overall system performance of two alternatives, but it can also be applied to compare two processor design alternatives, as the following example shows. Example

Answer

A common transformation required in graphics processors is square root. Implementations of floating-point (FP) square root vary significantly in performance, especially among processors designed for graphics. Suppose FP square root (FSQRT) is responsible for 20% of the execution time of a critical graphics benchmark. One proposal is to enhance the FSQRT hardware and speed up this operation by a factor of 10. The other alternative is just to try to make all FP instructions in the graphics processor run faster by a factor of 1.6; FP instructions are responsible for half of the execution time for the application. The design team believes that they can make all FP instructions run 1.6 times faster with the same effort as required for the fast square root. Compare these two design alternatives. We can compare these two alternatives by comparing the speedups: 1 ¼ ¼ 1:22 0:2 0:82 ð1  0:2Þ + 10 1 1 ¼ 1:23 SpeedupFP ¼ ¼ 0:5 0:8125 ð1  0:5Þ + 1:6

SpeedupFSQRT ¼

1

Improving the performance of the FP operations overall is slightly better because of the higher frequency. Amdahl’s Law is applicable beyond performance. Let’s redo the reliability example from page 39 after improving the reliability of the power supply via redundancy from 200,000-hour to 830,000,000-hour MTTF, or 4150 better. Example

The calculation of the failure rates of the disk subsystem was Failure ratesystem ¼ 10  ¼

1 1 1 1 1 + + + + 1, 000,000 500,000 200,000 200,000 1,000, 000

10 + 2 + 5 + 5 + 1 23 ¼ 1,000, 000 hours 1, 000,000 hours

52



Chapter One Fundamentals of Quantitative Design and Analysis

Therefore the fraction of the failure rate that could be improved is 5 per million hours out of 23 for the whole system, or 0.22. Answer

The reliability improvement would be Improvementpower supply pair ¼

1 0:22 ð1  0:22Þ + 4150

¼

1 ¼ 1:28 0:78

Despite an impressive 4150  improvement in reliability of one module, from the system’s perspective, the change has a measurable but small benefit. In the preceding examples, we needed the fraction consumed by the new and improved version; often it is difficult to measure these times directly. In the next section, we will see another way of doing such comparisons based on the use of an equation that decomposes the CPU execution time into three separate components. If we know how an alternative affects these three components, we can determine its overall performance. Furthermore, it is often possible to build simulators that measure these components before the hardware is actually designed.

The Processor Performance Equation Essentially all computers are constructed using a clock running at a constant rate. These discrete time events are called clock periods, clocks, cycles, or clock cycles. Computer designers refer to the time of a clock period by its duration (e.g., 1 ns) or by its rate (e.g., 1 GHz). CPU time for a program can then be expressed two ways: CPU time ¼ CPU clock cycles for a program  Clock cycle time

or CPU time ¼

CPU clock cycles for a program Clock rate

In addition to the number of clock cycles needed to execute a program, we can also count the number of instructions executed—the instruction path length or instruction count (IC). If we know the number of clock cycles and the instruction count, we can calculate the average number of clock cycles per instruction (CPI). Because it is easier to work with, and because we will deal with simple processors in this chapter, we use CPI. Designers sometimes also use instructions per clock (IPC), which is the inverse of CPI. CPI is computed as CPI ¼

CPU clock cycles for a program Instruction count

This processor figure of merit provides insight into different styles of instruction sets and implementations, and we will use it extensively in the next four chapters.

1.9

Quantitative Principles of Computer Design



53

By transposing the instruction count in the preceding formula, clock cycles can be defined as IC  CPI. This allows us to use CPI in the execution time formula: CPU time ¼ Instruction count  Cycles per instruction  Clock cycle time

Expanding the first formula into the units of measurement shows how the pieces fit together: Instructions Clock cycles Seconds Seconds   ¼ ¼ CPU time Program Instruction Clock cycle Program

As this formula demonstrates, processor performance is dependent upon three characteristics: clock cycle (or rate), clock cycles per instruction, and instruction count. Furthermore, CPU time is equally dependent on these three characteristics; for example, a 10% improvement in any one of them leads to a 10% improvement in CPU time. Unfortunately, it is difficult to change one parameter in complete isolation from others because the basic technologies involved in changing each characteristic are interdependent: ■

Clock cycle time—Hardware technology and organization



CPI—Organization and instruction set architecture



Instruction count—Instruction set architecture and compiler technology

Luckily, many potential performance improvement techniques primarily enhance one component of processor performance with small or predictable impacts on the other two. In designing the processor, sometimes it is useful to calculate the number of total processor clock cycles as CPU clock cycles ¼

n X

ICi  CPIi

i¼1

where ICi represents the number of times instruction i is executed in a program and CPIi represents the average number of clocks per instruction for instruction i. This form can be used to express CPU time as CPU time ¼

n X

!

ICi  CPIi  Clock cycle time

i¼1

and overall CPI as n X

CPI ¼

ICi  CPIi

i¼1

Instruction count

¼

n X

ICi  CPIi Instruction count i¼1

The latter form of the CPI calculation uses each individual CPIi and the fraction of occurrences of that instruction in a program (i.e., ICi  Instruction count). Because it must include pipeline effects, cache misses, and any other memory system

54



Chapter One Fundamentals of Quantitative Design and Analysis

inefficiencies, CPIi should be measured and not just calculated from a table in the back of a reference manual. Consider our performance example on page 52, here modified to use measurements of the frequency of the instructions and of the instruction CPI values, which, in practice, are obtained by simulation or by hardware instrumentation. Example

Suppose we made the following measurements: Frequency of FP operations ¼ 25% Average CPI of FP operations ¼ 4.0 Average CPI of other instructions ¼ 1.33 Frequency of FSQRT ¼ 2% CPI of FSQRT ¼ 20 Assume that the two design alternatives are to decrease the CPI of FSQRT to 2 or to decrease the average CPI of all FP operations to 2.5. Compare these two design alternatives using the processor performance equation.

Answer

First, observe that only the CPI changes; the clock rate and instruction count remain identical. We start by finding the original CPI with neither enhancement: CPIoriginal ¼

  n X ICi CPIi  Instruction count i¼1

¼ ð4  25%Þ + ð1:33  75%Þ ¼ 2:0

We can compute the CPI for the enhanced FSQRT by subtracting the cycles saved from the original CPI:   CPIwith new FPSQR ¼ CPIoriginal  2%  CPIold FPSQR  CPIof new FPSQR only ¼ 2:0  2%  ð20  2Þ ¼ 1:64

We can compute the CPI for the enhancement of all FP instructions the same way or by summing the FP and non-FP CPIs. Using the latter gives us CPInew FP ¼ ð75%  1:33Þ + ð25%  2:5Þ ¼ 1:625

Since the CPI of the overall FP enhancement is slightly lower, its performance will be marginally better. Specifically, the speedup for the overall FP enhancement is Speedupnew FP ¼ ¼

CPU timeoriginal IC  Clock cycle  CPIoriginal ¼ CPU timenew FP IC  Clock cycle  CPInew FP CPIoriginal 2:00 ¼ ¼ 1:23 CPInew FP 1:625

Happily, we obtained this same speedup using Amdahl’s Law on page 51.

1.10

Putting It All Together: Performance, Price, and Power



55

It is often possible to measure the constituent parts of the processor performance equation. Such isolated measurements are a key advantage of using the processor performance equation versus Amdahl’s Law in the previous example. In particular, it may be difficult to measure things such as the fraction of execution time for which a set of instructions is responsible. In practice, this would probably be computed by summing the product of the instruction count and the CPI for each of the instructions in the set. Since the starting point is often individual instruction count and CPI measurements, the processor performance equation is incredibly useful. To use the processor performance equation as a design tool, we need to be able to measure the various factors. For an existing processor, it is easy to obtain the execution time by measurement, and we know the default clock speed. The challenge lies in discovering the instruction count or the CPI. Most processors include counters for both instructions executed and clock cycles. By periodically monitoring these counters, it is also possible to attach execution time and instruction count to segments of the code, which can be helpful to programmers trying to understand and tune the performance of an application. Often designers or programmers will want to understand performance at a more fine-grained level than what is available from the hardware counters. For example, they may want to know why the CPI is what it is. In such cases, the simulation techniques used are like those for processors that are being designed. Techniques that help with energy efficiency, such as dynamic voltage frequency scaling and overclocking (see Section 1.5), make this equation harder to use, because the clock speed may vary while we measure the program. A simple approach is to turn off those features to make the results reproducible. Fortunately, as performance and energy efficiency are often highly correlated—taking less time to run a program generally saves energy—it’s probably safe to consider performance without worrying about the impact of DVFS or overclocking on the results.

1.10

Putting It All Together: Performance, Price, and Power In the “Putting It All Together” sections that appear near the end of every chapter, we provide real examples that use the principles in that chapter. In this section, we look at measures of performance and power-performance in small servers using the SPECpower benchmark. Figure 1.20 shows the three multiprocessor servers we are evaluating along with their price. To keep the price comparison fair, all are Dell PowerEdge servers. The first is the PowerEdge R710, which is based on the Intel Xeon 85670 microprocessor with a clock rate of 2.93 GHz. Unlike the Intel Core i7-6700 in Chapters 2–5, which has 20 cores and a 40 MB L3 cache, this Intel chip has 22 cores and a 55 MB L3 cache, although the cores themselves are identical. We selected a twosocket system—so 44 cores total—with 128 GB of ECC-protected 2400 MHz DDR4 DRAM. The next server is the PowerEdge C630, with the same processor, number of sockets, and DRAM. The main difference is a smaller rack-mountable package: “2U” high (3.5 inches) for the 730 versus “1U” (1.75 inches) for the 630.

56



Chapter One Fundamentals of Quantitative Design and Analysis

System 1 Component

System 2

Cost (% Cost) $653 (7%)

System 3

Cost (% Cost) PowerEdge R815

$1437 (15%)

Cost (% Cost)

Base server

PowerEdge R710

Power supply

570 W

Processor

Xeon X5670

Clock rate

2.93 GHz

2.20 GHz

2.20 GHz

Total cores

12

24

48

1100 W $3738 (40%)

Opteron 6174

PowerEdge R815

$1437 (11%)

1100 W $2679 (29%)

Opteron 6174

$5358 (42%)

Sockets

2

2

4

Cores/socket

6

12

12

DRAM

12 GB

Ethernet Inter.

Dual 1-Gbit

$199 (2%)

Dual 1-Gbit

$199 (2%)

Dual 1-Gbit

$199 (2%)

Disk

50 GB SSD

$1279 (14%)

50 GB SSD

$1279 (14%)

50 GB SSD

$1279 (10%)

$484 (5%)

16 GB

$2999 (32%)

Windows OS

32 GB

$2999 (33%)

$9352 (100%)

Total

$693 (7%)

$1386 (11%)

$2999 (24%)

$9286 (100%)

$12,658 (100%)

Max ssj_ops

910,978

926,676

1,840,450

Max ssj_ops/$

97

100

145

Figure 1.20 Three Dell PowerEdge servers being measured and their prices as of July 2016. We calculated the cost of the processors by subtracting the cost of a second processor. Similarly, we calculated the overall cost of memory by seeing what the cost of extra memory was. Hence the base cost of the server is adjusted by removing the estimated cost of the default processor and memory. Chapter 5 describes how these multisocket systems are connected together, and Chapter 6 describes how clusters are connected together.

The third server is a cluster of 16 of the PowerEdge 630 s that is connected together with a 1 Gbit/s Ethernet switch. All are running the Oracle Java HotSpot version 1.7 Java Virtual Machine (JVM) and the Microsoft Windows Server 2012 R2 Datacenter version 6.3 operating system. Note that because of the forces of benchmarking (see Section 1.11), these are unusually configured servers. The systems in Figure 1.20 have little memory relative to the amount of computation, and just a tiny 120 GB solid-state disk. It is inexpensive to add cores if you don’t need to add commensurate increases in memory and storage! Rather than run statically linked C programs of SPEC CPU, SPECpower uses a more modern software stack written in Java. It is based on SPECjbb, and it represents the server side of business applications, with performance measured as the number of transactions per second, called ssj_ops for server side Java operations per second. It exercises not only the processor of the server, as does SPEC CPU, but also the caches, memory system, and even the multiprocessor interconnection system. In addition, it exercises the JVM, including the JIT runtime compiler and garbage collector, as well as portions of the underlying operating system. As the last two rows of Figure 1.20 show, the performance winner is the cluster of 16 R630s, which is hardly a surprise since it is by far the most expensive. The price-performance winner is the PowerEdge R630, but it barely beats the cluster at 213 versus 211 ssj-ops/$. Amazingly, the 16 node cluster is within 1% of the same price-performances of a single node despite being 16 times as large.

1.10

Putting It All Together: Performance, Price, and Power



57

While most benchmarks (and most computer architects) care only about performance of systems at peak load, computers rarely run at peak load. Indeed, Figure 6.2 in Chapter 6 shows the results of measuring the utilization of tens of thousands of servers over 6 months at Google, and less than 1% operate at an average utilization of 100%. The majority have an average utilization of between 10% and 50%. Thus the SPECpower benchmark captures power as the target workload varies from its peak in 10% intervals all the way to 0%, which is called Active Idle. Figure 1.21 plots the ssj_ops (SSJ operations/second) per watt and the average power as the target load varies from 100% to 0%. The Intel R730 always has the lowest power and the single node R630 has the best ssj_ops per watt across each target workload level. Since watts ¼ joules/second, this metric is proportional to SSJ operations per joule:

Dell 630 44 cores perf/watt

Dell 730 44 cores perf/watt

Dell 630 cluster 704 cores perf/watt

Dell 630 cluster 704 cores watts/node

Dell 630 44 cores watts

Dell 730 44 cores watts

14000

350

12000

300

10000

250

8000

200

6000

150

4000

100

2000

50

Watts

ssj_ops/watt

ssj_operations=second ssj_operations=second ssj_operations ¼ ¼ Watt Joule=second Joule

0

0 100%

90%

80%

70%

60%

50%

40%

30%

20%

10%

Active idle

Target Workload

Figure 1.21 Power-performance of the three servers in Figure 1.20. Ssj_ops/watt values are on the left axis, with the three columns associated with it, and watts are on the right axis, with the three lines associated with it. The horizontal axis shows the target workload, as it varies from 100% to Active Idle. The single node R630 has the best ssj_ops/watt at each workload level, but R730 consumes the lowest power at each level.

58



Chapter One Fundamentals of Quantitative Design and Analysis

To calculate a single number to use to compare the power efficiency of systems, SPECpower uses X ssj_ops Overall ssj_ops=watt ¼ X power

The overall ssj_ops/watt of the three servers is 10,802 for the R730, 11,157 for the R630, and 10,062 for the cluster of 16 R630s. Therefore the single node R630 has the best power-performance. Dividing by the price of the servers, the ssj_ops/watt/ $1,000 is 879 for the R730, 899 for the R630, and 789 (per node) for the 16-node cluster of R630s. Thus, after adding power, the single-node R630 is still in first place in performance/price, but now the single-node R730 is significantly more efficient than the 16-node cluster.

1.11

Fallacies and Pitfalls The purpose of this section, which will be found in every chapter, is to explain some commonly held misbeliefs or misconceptions that you should avoid. We call such misbeliefs fallacies. When discussing a fallacy, we try to give a counterexample. We also discuss pitfalls—easily made mistakes. Often pitfalls are generalizations of principles that are true in a limited context. The purpose of these sections is to help you avoid making these errors in computers that you design.

Pitfall

All exponential laws must come to an end. The first to go was Dennard scaling. Dennard’s 1974 observation was that power density was constant as transistors got smaller. If a transistor’s linear region shrank by a factor 2, then both the current and voltage were also reduced by a factor of 2, and so the power it used fell by 4. Thus chips could be designed to operate faster and still use less power. Dennard scaling ended 30 years after it was observed, not because transistors didn’t continue to get smaller but because integrated circuit dependability limited how far current and voltage could drop. The threshold voltage was driven so low that static power became a significant fraction of overall power. The next deceleration was hard disk drives. Although there was no law for disks, in the past 30 years the maximum areal density of hard drives—which determines disk capacity—improved by 30%–100% per year. In more recent years, it has been less than 5% per year. Increasing density per drive has come primarily from adding more platters to a hard disk drive. Next up was the venerable Moore’s Law. It’s been a while since the number of transistors per chip doubled every one to two years. For example, the DRAM chip introduced in 2014 contained 8B transistors, and we won’t have a 16B transistor DRAM chip in mass production until 2019, but Moore’s Law predicts a 64B transistor DRAM chip. Moreover, the actual end of scaling of the planar logic transistor was even predicted to end by 2021. Figure 1.22 shows the predictions of the physical gate length

1.11

Fallacies and Pitfalls



59

25 2013 report 2015 report

Physical gate length (nm)

20

15

10

5

0 2013

2015

2017

2019

2021

2023 Year

2024

2025

2027

2028

2030

Figure 1.22 Predictions of logic transistor dimensions from two editions of the ITRS report. These reports started in 2001, but 2015 will be the last edition, as the group has disbanded because of waning interest. The only companies that can produce state-of-the-art logic chips today are GlobalFoundaries, Intel, Samsung, and TSMC, whereas there were 19 when the first ITRS report was released. With only four companies left, sharing of plans was too hard to sustain. From IEEE Spectrum, July 2016, “Transistors will stop shrinking in 2021, Moore’s Law Roadmap Predicts,” by Rachel Courtland.

of the logic transistor from two editions of the International Technology Roadmap for Semiconductors (ITRS). Unlike the 2013 report that projected gate lengths to reach 5 nm by 2028, the 2015 report projects the length stopping at 10 nm by 2021. Density improvements thereafter would have to come from ways other than shrinking the dimensions of transistors. It’s not as dire as the ITRS suggests, as companies like Intel and TSMC have plans to shrink to 3 nm gate lengths, but the rate of change is decreasing. Figure 1.23 shows the changes in increases in bandwidth over time for microprocessors and DRAM—which are affected by the end of Dennard scaling and Moore’s Law—as well as for disks. The slowing of technology improvements is apparent in the dropping curves. The continued networking improvement is due to advances in fiber optics and a planned change in pulse amplitude modulation (PAM-4) allowing two-bit encoding so as to transmit information at 400 Gbit/s.

60



Chapter One Fundamentals of Quantitative Design and Analysis

100,000

Microprocessor

Network

Relative Bandwidth Improvement

10,000

Memory

1000 Disk

100

10

1 1975

1980

1985

1990

1995

2000

2005

2010

2015

2020

Year

Figure 1.23 Relative bandwidth for microprocessors, networks, memory, and disks over time, based on data in Figure 1.10.

Fallacy

Multiprocessors are a silver bullet. The switch to multiple processors per chip around 2005 did not come from some breakthrough that dramatically simplified parallel programming or made it easy to build multicore computers. The change occurred because there was no other option due to the ILP walls and power walls. Multiple processors per chip do not guarantee lower power; it’s certainly feasible to design a multicore chip that uses more power. The potential is just that it’s possible to continue to improve performance by replacing a high-clock-rate, inefficient core with several lower-clock-rate, efficient cores. As technology to shrink transistors improves, it can shrink both capacitance and the supply voltage a bit so that we can get a modest increase in the

1.11

Fallacies and Pitfalls



61

number of cores per generation. For example, for the past few years, Intel has been adding two cores per generation in their higher-end chips. As we will see in Chapters 4 and 5, performance is now a programmer’s burden. The programmers’ La-Z-Boy era of relying on a hardware designer to make their programs go faster without lifting a finger is officially over. If programmers want their programs to go faster with each generation, they must make their programs more parallel. The popular version of Moore’s law—increasing performance with each generation of technology—is now up to programmers. Pitfall

Falling prey to Amdahl’s heartbreaking law. Virtually every practicing computer architect knows Amdahl’s Law. Despite this, we almost all occasionally expend tremendous effort optimizing some feature before we measure its usage. Only when the overall speedup is disappointing do we recall that we should have measured first before we spent so much effort enhancing it!

Pitfall

A single point of failure. The calculations of reliability improvement using Amdahl’s Law on page 53 show that dependability is no stronger than the weakest link in a chain. No matter how much more dependable we make the power supplies, as we did in our example, the single fan will limit the reliability of the disk subsystem. This Amdahl’s Law observation led to a rule of thumb for fault-tolerant systems to make sure that every component was redundant so that no single component failure could bring down the whole system. Chapter 6 shows how a software layer avoids single points of failure inside WSCs.

Fallacy

Hardware enhancements that increase performance also improve energy efficiency, or are at worst energy neutral. Esmaeilzadeh et al. (2011) measured SPEC2006 on just one core of a 2.67 GHz Intel Core i7 using Turbo mode (Section 1.5). Performance increased by a factor of 1.07 when the clock rate increased to 2.94 GHz (or a factor of 1.10), but the i7 used a factor of 1.37 more joules and a factor of 1.47 more watt hours!

Fallacy

Benchmarks remain valid indefinitely. Several factors influence the usefulness of a benchmark as a predictor of real performance, and some change over time. A big factor influencing the usefulness of a benchmark is its ability to resist “benchmark engineering” or “benchmarketing.” Once a benchmark becomes standardized and popular, there is tremendous pressure to improve performance by targeted optimizations or by aggressive interpretation of the rules for running the benchmark. Short kernels or programs that spend their time in a small amount of code are particularly vulnerable. For example, despite the best intentions, the initial SPEC89 benchmark suite included a small kernel, called matrix300, which consisted of eight different 300  300 matrix multiplications. In this kernel, 99% of the execution time was in a single line (see SPEC, 1989). When an IBM compiler optimized this inner loop

62



Chapter One Fundamentals of Quantitative Design and Analysis

(using a good idea called blocking, discussed in Chapters 2 and 4), performance improved by a factor of 9 over a prior version of the compiler! This benchmark tested compiler tuning and was not, of course, a good indication of overall performance, nor of the typical value of this particular optimization. Figure 1.19 shows that if we ignore history, we may be forced to repeat it. SPEC Cint2006 had not been updated for a decade, giving compiler writers substantial time to hone their optimizers to this suite. Note that the SPEC ratios of all benchmarks but libquantum fall within the range of 16–52 for the AMD computer and from 22 to 78 for Intel. Libquantum runs about 250 times faster on AMD and 7300 times faster on Intel! This “miracle” is a result of optimizations by the Intel compiler that automatically parallelizes the code across 22 cores and optimizes memory by using bit packing, which packs together multiple narrow-range integers to save memory space and thus memory bandwidth. If we drop this benchmark and recalculate the geometric means, AMD SPEC Cint2006 falls from 31.9 to 26.5 and Intel from 63.7 to 41.4. The Intel computer is now about 1.5 times as fast as the AMD computer instead of 2.0 if we include libquantum, which is surely closer to their real relative performances. SPECCPU2017 dropped libquantum. To illustrate the short lives of benchmarks, Figure 1.17 on page 43 lists the status of all 82 benchmarks from the various SPEC releases; Gcc is the lone survivor from SPEC89. Amazingly, about 70% of all programs from SPEC2000 or earlier were dropped from the next release. Fallacy

The rated mean time to failure of disks is 1,200,000 hours or almost 140 years, so disks practically never fail. The current marketing practices of disk manufacturers can mislead users. How is such an MTTF calculated? Early in the process, manufacturers will put thousands of disks in a room, run them for a few months, and count the number that fail. They compute MTTF as the total number of hours that the disks worked cumulatively divided by the number that failed. One problem is that this number far exceeds the lifetime of a disk, which is commonly assumed to be five years or 43,800 hours. For this large MTTF to make some sense, disk manufacturers argue that the model corresponds to a user who buys a disk and then keeps replacing the disk every 5 years—the planned lifetime of the disk. The claim is that if many customers (and their great-grandchildren) did this for the next century, on average they would replace a disk 27 times before a failure, or about 140 years. A more useful measure is the percentage of disks that fail, which is called the annual failure rate. Assume 1000 disks with a 1,000,000-hour MTTF and that the disks are used 24 hours a day. If you replaced failed disks with a new one having the same reliability characteristics, the number that would fail in a year (8760 hours) is Failed disks ¼

Number of disks  Time period 1000 disks  8760 hours=drive ¼ ¼9 MTTF 1,000, 000 hours=failure

Stated alternatively, 0.9% would fail per year, or 4.4% over a 5-year lifetime.

1.11

Fallacies and Pitfalls



63

Moreover, those high numbers are quoted assuming limited ranges of temperature and vibration; if they are exceeded, then all bets are off. A survey of disk drives in real environments (Gray and van Ingen, 2005) found that 3%–7% of drives failed per year, for an MTTF of about 125,000–300,000 hours. An even larger study found annual disk failure rates of 2%–10% (Pinheiro et al., 2007). Therefore the real-world MTTF is about 2–10 times worse than the manufacturer’s MTTF. Fallacy

Peak performance tracks observed performance. The only universally true definition of peak performance is “the performance level a computer is guaranteed not to exceed.” Figure 1.24 shows the percentage of peak performance for four programs on four multiprocessors. It varies from 5% to 58%. Since the gap is so large and can vary significantly by benchmark, peak performance is not generally useful in predicting observed performance.

70% Power4

Percentage of peak performance

Itanium 2

58%

60% 54%

NEC earth simulator

54%

Cray X1

50%

40% 34%

33%

34%

30%

20%

20% 16% 10%

10%

6%

11%

11% 7%

6%

6% 5%

0% Paratec plasma physics

LBMHD materials science

Cactus astrophysics

GTC magnetic fusion

Figure 1.24 Percentage of peak performance for four programs on four multiprocessors scaled to 64 processors. The Earth Simulator and X1 are vector processors (see Chapter 4 and Appendix G). Not only did they deliver a higher fraction of peak performance, but they also had the highest peak performance and the lowest clock rates. Except for the Paratec program, the Power 4 and Itanium 2 systems delivered between 5% and 10% of their peak. From Oliker, L., Canning, A., Carter, J., Shalf, J., Ethier, S., 2004. Scientific computations on modern parallel vector systems. In: Proc. ACM/IEEE Conf. on Supercomputing, November 6–12, 2004, Pittsburgh, Penn., p. 10.

64



Chapter One Fundamentals of Quantitative Design and Analysis

Pitfall

Fault detection can lower availability. This apparently ironic pitfall is because computer hardware has a fair amount of state that may not always be critical to proper operation. For example, it is not fatal if an error occurs in a branch predictor, because only performance may suffer. In processors that try to exploit ILP aggressively, not all the operations are needed for correct execution of the program. Mukherjee et al. (2003) found that less than 30% of the operations were potentially on the critical path for the SPEC2000 benchmarks. The same observation is true about programs. If a register is “dead” in a program—that is, the program will write the register before it is read again—then errors do not matter. If you were to crash the program upon detection of a transient fault in a dead register, it would lower availability unnecessarily. The Sun Microsystems Division of Oracle lived this pitfall in 2000 with an L2 cache that included parity, but not error correction, in its Sun E3000 to Sun E10000 systems. The SRAMs they used to build the caches had intermittent faults, which parity detected. If the data in the cache were not modified, the processor would simply reread the data from the cache. Because the designers did not protect the cache with ECC (error-correcting code), the operating system had no choice but to report an error to dirty data and crash the program. Field engineers found no problems on inspection in more than 90% of the cases. To reduce the frequency of such errors, Sun modified the Solaris operating system to “scrub” the cache by having a process that proactively wrote dirty data to memory. Because the processor chips did not have enough pins to add ECC, the only hardware option for dirty data was to duplicate the external cache, using the copy without the parity error to correct the error. The pitfall is in detecting faults without providing a mechanism to correct them. These engineers are unlikely to design another computer without ECC on external caches.

1.12

Concluding Remarks This chapter has introduced a number of concepts and provided a quantitative framework that we will expand on throughout the book. Starting with the last edition, energy efficiency is the constant companion to performance. In Chapter 2, we start with the all-important area of memory system design. We will examine a wide range of techniques that conspire to make memory look infinitely large while still being as fast as possible. (Appendix B provides introductory material on caches for readers without much experience and background with them.) As in later chapters, we will see that hardware-software cooperation has become a key to high-performance memory systems, just as it has to highperformance pipelines. This chapter also covers virtual machines, an increasingly important technique for protection. In Chapter 3, we look at ILP, of which pipelining is the simplest and most common form. Exploiting ILP is one of the most important techniques for building

1.12

Concluding Remarks



65

high-speed uniprocessors. Chapter 3 begins with an extensive discussion of basic concepts that will prepare you for the wide range of ideas examined in both chapters. Chapter 3 uses examples that span about 40 years, drawing from one of the first supercomputers (IBM 360/91) to the fastest processors on the market in 2017. It emphasizes what is called the dynamic or runtime approach to exploiting ILP. It also talks about the limits to ILP ideas and introduces multithreading, which is further developed in both Chapters 4 and 5. Appendix C provides introductory material on pipelining for readers without much experience and background in pipelining. (We expect it to be a review for many readers, including those of our introductory text, Computer Organization and Design: The Hardware/Software Interface.) Chapter 4 explains three ways to exploit data-level parallelism. The classic and oldest approach is vector architecture, and we start there to lay down the principles of SIMD design. (Appendix G goes into greater depth on vector architectures.) We next explain the SIMD instruction set extensions found in most desktop microprocessors today. The third piece is an in-depth explanation of how modern graphics processing units (GPUs) work. Most GPU descriptions are written from the programmer’s perspective, which usually hides how the computer really works. This section explains GPUs from an insider’s perspective, including a mapping between GPU jargon and more traditional architecture terms. Chapter 5 focuses on the issue of achieving higher performance using multiple processors, or multiprocessors. Instead of using parallelism to overlap individual instructions, multiprocessing uses parallelism to allow multiple instruction streams to be executed simultaneously on different processors. Our focus is on the dominant form of multiprocessors, shared-memory multiprocessors, though we introduce other types as well and discuss the broad issues that arise in any multiprocessor. Here again we explore a variety of techniques, focusing on the important ideas first introduced in the 1980s and 1990s. Chapter 6 introduces clusters and then goes into depth on WSCs, which computer architects help design. The designers of WSCs are the professional descendants of the pioneers of supercomputers, such as Seymour Cray, in that they are designing extreme computers. WSCs contain tens of thousands of servers, and the equipment and the building that holds them cost nearly $200 million. The concerns of price-performance and energy efficiency of the earlier chapters apply to WSCs, as does the quantitative approach to making decisions. Chapter 7 is new to this edition. It introduces domain-specific architectures as the only path forward for improved performance and energy efficiency given the end of Moore’s Law and Dennard scaling. It offers guidelines on how to build effective domain-specific architectures, introduces the exciting domain of deep neural networks, describes four recent examples that take very different approaches to accelerating neural networks, and then compares their cost-performance. This book comes with an abundance of material online (see Preface for more details), both to reduce cost and to introduce readers to a variety of advanced topics. Figure 1.25 shows them all. Appendices A–C, which appear in the book, will be a review for many readers.

66



Chapter One Fundamentals of Quantitative Design and Analysis

Appendix

Title

A

Instruction Set Principles

B

Review of Memory Hierarchies

C

Pipelining: Basic and Intermediate Concepts

D

Storage Systems

E

Embedded Systems

F

Interconnection Networks

G

Vector Processors in More Depth

H

Hardware and Software for VLIW and EPIC

I

Large-Scale Multiprocessors and Scientific Applications

J

Computer Arithmetic

K

Survey of Instruction Set Architectures

L

Advanced Concepts on Address Translation

M

Historical Perspectives and References

Figure 1.25 List of appendices.

In Appendix D, we move away from a processor-centric view and discuss issues in storage systems. We apply a similar quantitative approach, but one based on observations of system behavior and using an end-to-end approach to performance analysis. This appendix addresses the important issue of how to store and retrieve data efficiently using primarily lower-cost magnetic storage technologies. Our focus is on examining the performance of disk storage systems for typical I/O-intensive workloads, such as the OLTP benchmarks mentioned in this chapter. We extensively explore advanced topics in RAID-based systems, which use redundant disks to achieve both high performance and high availability. Finally, Appendix D introduces queuing theory, which gives a basis for trading off utilization and latency. Appendix E applies an embedded computing perspective to the ideas of each of the chapters and early appendices. Appendix F explores the topic of system interconnect broadly, including wide area and system area networks that allow computers to communicate. Appendix H reviews VLIW hardware and software, which, in contrast, are less popular than when EPIC appeared on the scene just before the last edition. Appendix I describes large-scale multiprocessors for use in high-performance computing. Appendix J is the only appendix that remains from the first edition, and it covers computer arithmetic. Appendix K provides a survey of instruction architectures, including the 80x86, the IBM 360, the VAX, and many RISC architectures, including ARM, MIPS, Power, RISC-V, and SPARC. Appendix L is new and discusses advanced techniques for memory management, focusing on support for virtual machines and design of address translation

Case Studies and Exercises by Diana Franklin



67

for very large address spaces. With the growth in cloud processors, these architectural enhancements are becoming more important. We describe Appendix M next.

1.13

Historical Perspectives and References Appendix M (available online) includes historical perspectives on the key ideas presented in each of the chapters in this text. These historical perspective sections allow us to trace the development of an idea through a series of machines or to describe significant projects. If you’re interested in examining the initial development of an idea or processor or want further reading, references are provided at the end of each history. For this chapter, see Section M.2, “The Early Development of Computers,” for a discussion on the early development of digital computers and performance measurement methodologies. As you read the historical material, you’ll soon come to realize that one of the important benefits of the youth of computing, compared to many other engineering fields, is that some of the pioneers are still alive—we can learn the history by simply asking them!

Case Studies and Exercises by Diana Franklin Case Study 1: Chip Fabrication Cost Concepts illustrated by this case study ■

Fabrication Cost



Fabrication Yield



Defect Tolerance Through Redundancy

Many factors are involved in the price of a computer chip. Intel is spending $7 billion to complete its Fab 42 fabrication facility for 7 nm technology. In this case study, we explore a hypothetical company in the same situation and how different design decisions involving fabrication technology, area, and redundancy affect the cost of chips. 1.1

[10/10] Figure 1.26 gives hypothetical relevant chip statistics that influence the cost of several current chips. In the next few exercises, you will be exploring the effect of different possible design decisions for the Intel chips.

Die Size (mm2)

Estimated defect rate (per cm2)

N

Manufacturing size (nm)

Transistors (billion)

Cores

BlueDragon

180

0.03

12

10

7.5

4

RedDragon

120

0.04

14

7

7.5

4

200

0.04

14

7

12

8

Chip

Phoenix

8

Figure 1.26 Manufacturing cost factors for several hypothetical current and future processors.

68



Chapter One Fundamentals of Quantitative Design and Analysis a. [10] What is the yield for the Phoenix chip? b. [10] Why does Phoenix have a higher defect rate than BlueDragon? 1.2

[20/20/20/20] They will sell a range of chips from that factory, and they need to decide how much capacity to dedicate to each chip. Imagine that they will sell two chips. Phoenix is a completely new architecture designed with 7 nm technology in mind, whereas RedDragon is the same architecture as their 10 nm BlueDragon. Imagine that RedDragon will make a profit of $15 per defect-free chip. Phoenix will make a profit of $30 per defect-free chip. Each wafer has a 450 mm diameter. a. [20] How much profit do you make on each wafer of Phoenix chips? b. [20] How much profit do you make on each wafer of RedDragon chips? c. [20] If your demand is 50,000 RedDragon chips per month and 25,000 Phoenix chips per month, and your facility can fabricate 70 wafers a month, how many wafers should you make of each chip?

1.3

[20/20] Your colleague at AMD suggests that, since the yield is so poor, you might make chips more cheaply if you released multiple versions of the same chip, just with different numbers of cores. For example, you could sell Phoenix8, Phoenix4, Phoenix2, and Phoenix1, which contain 8, 4, 2, and 1 cores on each chip, respectively. If all eight cores are defect-free, then it is sold as Phoenix8. Chips with four to seven defect-free cores are sold as Phoenix4, and those with two or three defect-free cores are sold as Phoenix2. For simplification, calculate the yield for a single core as the yield for a chip that is 1/8 the area of the original Phoenix chip. Then view that yield as an independent probability of a single core being defect free. Calculate the yield for each configuration as the probability of at the corresponding number of cores being defect free. a. [20] What is the yield for a single core being defect free as well as the yield for Phoenix4, Phoenix2 and Phoenix1? b. [5] Using your results from part a, determine which chips you think it would be worthwhile to package and sell, and why. c. [10] If it previously cost $20 dollars per chip to produce Phoenix8, what will be the cost of the new Phoenix chips, assuming that there are no additional costs associated with rescuing them from the trash? d. [20] You currently make a profit of $30 for each defect-free Phoenix8, and you will sell each Phoenix4 chip for $25. How much is your profit per Phoenix8 chip if you consider (i) the purchase price of Phoenix4 chips to be entirely profit and (ii) apply the profit of Phoenix4 chips to each Phoenix8 chip in proportion to how many are produced? Use the yields calculated from part Problem 1.3a, not from problem 1.1a.

Case Studies and Exercises by Diana Franklin



69

Case Study 2: Power Consumption in Computer Systems Concepts illustrated by this case study ■

Amdahl’s Law



Redundancy



MTTF



Power Consumption

Power consumption in modern systems is dependent on a variety of factors, including the chip clock frequency, efficiency, and voltage. The following exercises explore the impact on power and energy that different design decisions and use scenarios have. 1.4

[10/10/10/10] A cell phone performs very different tasks, including streaming music, streaming video, and reading email. These tasks perform very different computing tasks. Battery life and overheating are two common problems for cell phones, so reducing power and energy consumption are critical. In this problem, we consider what to do when the user is not using the phone to its full computing capacity. For these problems, we will evaluate an unrealistic scenario in which the cell phone has no specialized processing units. Instead, it has a quad-core, generalpurpose processing unit. Each core uses 0.5 W at full use. For email-related tasks, the quad-core is 8 as fast as necessary. a. [10] How much dynamic energy and power are required compared to running at full power? First, suppose that the quad-core operates for 1/8 of the time and is idle for the rest of the time. That is, the clock is disabled for 7/8 of the time, with no leakage occurring during that time. Compare total dynamic energy as well as dynamic power while the core is running. b. [10] How much dynamic energy and power are required using frequency and voltage scaling? Assume frequency and voltage are both reduced to 1/8 the entire time. c. [10] Now assume the voltage may not decrease below 50% of the original voltage. This voltage is referred to as the voltage floor, and any voltage lower than that will lose the state. Therefore, while the frequency can keep decreasing, the voltage cannot. What are the dynamic energy and power savings in this case? d. [10] How much energy is used with a dark silicon approach? This involves creating specialized ASIC hardware for each major task and power

70



Chapter One Fundamentals of Quantitative Design and Analysis

gating those elements when not in use. Only one general-purpose core would be provided, and the rest of the chip would be filled with specialized units. For email, the one core would operate for 25% the time and be turned completely off with power gating for the other 75% of the time. During the other 75% of the time, a specialized ASIC unit that requires 20% of the energy of a core would be running. 1.5

[10/10/10] As mentioned in Exercise 1.4, cell phones run a wide variety of applications. We’ll make the same assumptions for this exercise as the previous one, that it is 0.5 W per core and that a quad core runs email 3 as fast. a. [10] Imagine that 80% of the code is parallelizable. By how much would the frequency and voltage on a single core need to be increased in order to execute at the same speed as the four-way parallelized code? b. [10] What is the reduction in dynamic energy from using frequency and voltage scaling in part a? c. [10] How much energy is used with a dark silicon approach? In this approach, all hardware units are power gated, allowing them to turn off entirely (causing no leakage). Specialized ASICs are provided that perform the same computation for 20% of the power as the general-purpose processor. Imagine that each core is power gated. The video game requires two ASICS and two cores. How much dynamic energy does it require compared to the baseline of parallelized on four cores?

1.6

1

[10/10/10/10/10/20] General-purpose processes are optimized for general-purpose computing. That is, they are optimized for behavior that is generally found across a large number of applications. However, once the domain is restricted somewhat, the behavior that is found across a large number of the target applications may be different from general-purpose applications. One such application is deep learning or neural networks. Deep learning can be applied to many different applications, but the fundamental building block of inference—using the learned information to make decisions—is the same across them all. Inference operations are largely parallel, so they are currently performed on graphics processing units, which are specialized more toward this type of computation, and not to inference in particular. In a quest for more performance per watt, Google has created a custom chip using tensor processing units to accelerate inference operations in deep learning.1 This approach can be used for speech recognition and image recognition, for example. This problem explores the trade-offs between this process, a general-purpose processor (Haswell E5-2699 v3) and a GPU (NVIDIA K80), in terms of performance and cooling. If heat is not removed from the computer efficiently, the fans will blow hot air back onto the computer, not cold air. Note: The differences are more than processor—on-chip memory and DRAM also come into play. Therefore statistics are at a system level, not a chip level.

Cite paper at this website: https://drive.google.com/file/d/0Bx4hafXDDq2EMzRNcy1vSUxtcEk/view.

Case Studies and Exercises by Diana Franklin



71

a. [10] If Google’s data center spends 70% of its time on workload A and 30% of its time on workload B when running GPUs, what is the speedup of the TPU system over the GPU system? b. [10] If Google’s data center spends 70% of its time on workload A and 30% of its time on workload B when running GPUs, what percentage of Max IPS does it achieve for each of the three systems? c. [15] Building on (b), assuming that the power scales linearly from idle to busy power as IPS grows from 0% to 100%, what is the performance per watt of the TPU system over the GPU system? d. [10] If another data center spends 40% of its time on workload A, 10% of its time on workload B, and 50% of its time on workload C, what are the speedups of the GPU and TPU systems over the general-purpose system? e. [10] A cooling door for a rack costs $4000 and dissipates 14 kW (into the room; additional cost is required to get it out of the room). How many Haswell-, NVIDIA-, or Tensor-based servers can you cool with one cooling door, assuming TDP in Figures 1.27 and 1.28? f. [20] Typical server farms can dissipate a maximum of 200 W per square foot. Given that a server rack requires 11 square feet (including front and back clearance), how many servers from part (e) can be placed on a single rack, and how many cooling doors are required?

System

Chip

TDP

Idle power

Busy power

General-purpose

Haswell E5-2699 v3

504 W

159 W

455 W

Graphics processor

NVIDIA K80

1838 W

357 W

991 W

Custom ASIC

TPU

861 W

290 W

384 W

Figure 1.27 Hardware characteristics for general-purpose processor, graphical processing unit-based or custom ASIC-based system, including measured power (cite ISCA paper).

System

Chip

Throughput

% Max IPS

A

B

C

A

B

C

General-purpose

Haswell E5-2699 v3

5482

13,194

12,000

42%

100%

90%

Graphics processor

NVIDIA K80

13,461

36,465

15,000

37%

100%

40%

Custom ASIC

TPU

225,000

280,000

2000

80%

100%

1%

Figure 1.28 Performance characteristics for general-purpose processor, graphical processing unit-based or custom ASIC-based system on two neural-net workloads (cite ISCA paper). Workloads A and B are from published results. Workload C is a fictional, more general-purpose application.

72



Chapter One Fundamentals of Quantitative Design and Analysis

Exercises 1.7

[10/15/15/10/10] One challenge for architects is that the design created today will require several years of implementation, verification, and testing before appearing on the market. This means that the architect must project what the technology will be like several years in advance. Sometimes, this is difficult to do. a. [10] According to the trend in device scaling historically observed by Moore’s Law, the number of transistors on a chip in 2025 should be how many times the number in 2015? b. [15] The increase in performance once mirrored this trend. Had performance continued to climb at the same rate as in the 1990s, approximately what performance would chips have over the VAX-11/780 in 2025? c. [15] At the current rate of increase of the mid-2000s, what is a more updated projection of performance in 2025? d. [10] What has limited the rate of growth of the clock rate, and what are architects doing with the extra transistors now to increase performance? e. [10] The rate of growth for DRAM capacity has also slowed down. For 20 years, DRAM capacity improved by 60% each year. If 8 Gbit DRAM was first available in 2015, and 16 Gbit is not available until 2019, what is the current DRAM growth rate?

1.8

[10/10] You are designing a system for a real-time application in which specific deadlines must be met. Finishing the computation faster gains nothing. You find that your system can execute the necessary code, in the worst case, twice as fast as necessary. a. [10] How much energy do you save if you execute at the current speed and turn off the system when the computation is complete? b. [10] How much energy do you save if you set the voltage and frequency to be half as much?

1.9

[10/10/20/20] Server farms such as Google and Yahoo! provide enough compute capacity for the highest request rate of the day. Imagine that most of the time these servers operate at only 60% capacity. Assume further that the power does not scale linearly with the load; that is, when the servers are operating at 60% capacity, they consume 90% of maximum power. The servers could be turned off, but they would take too long to restart in response to more load. A new system has been proposed that allows for a quick restart but requires 20% of the maximum power while in this “barely alive” state. a. [10] How much power savings would be achieved by turning off 60% of the servers? b. [10] How much power savings would be achieved by placing 60% of the servers in the “barely alive” state?

Case Studies and Exercises by Diana Franklin



73

c. [20] How much power savings would be achieved by reducing the voltage by 20% and frequency by 40%? d. [20] How much power savings would be achieved by placing 30% of the servers in the “barely alive” state and 30% off? 1.10

[10/10/20] Availability is the most important consideration for designing servers, followed closely by scalability and throughput. a. [10] We have a single processor with a failure in time (FIT) of 100. What is the mean time to failure (MTTF) for this system? b. [10] If it takes one day to get the system running again, what is the availability of the system? c. [20] Imagine that the government, to cut costs, is going to build a supercomputer out of inexpensive computers rather than expensive, reliable computers. What is the MTTF for a system with 1000 processors? Assume that if one fails, they all fail.

1.11

[20/20/20] In a server farm such as that used by Amazon or eBay, a single failure does not cause the entire system to crash. Instead, it will reduce the number of requests that can be satisfied at any one time. a. [20] If a company has 10,000 computers, each with an MTTF of 35 days, and it experiences catastrophic failure only if 1/3 of the computers fail, what is the MTTF for the system? b. [20] If it costs an extra $1000, per computer, to double the MTTF, would this be a good business decision? Show your work. c. [20] Figure 1.3 shows, on average, the cost of downtimes, assuming that the cost is equal at all times of the year. For retailers, however, the Christmas season is the most profitable (and therefore the most costly time to lose sales). If a catalog sales center has twice as much traffic in the fourth quarter as every other quarter, what is the average cost of downtime per hour during the fourth quarter and the rest of the year?

1.12

[20/10/10/10/15] In this exercise, assume that we are considering enhancing a quad-core machine by adding encryption hardware to it. When computing encryption operations, it is 20 times faster than the normal mode of execution. We will define percentage of encryption as the percentage of time in the original execution that is spent performing encryption operations. The specialized hardware increases power consumption by 2%. a. [20] Draw a graph that plots the speedup as a percentage of the computation spent performing encryption. Label the y-axis “Net speedup” and label the x-axis “Percent encryption.” b. [10] With what percentage of encryption will adding encryption hardware result in a speedup of 2? c. [10] What percentage of time in the new execution will be spent on encryption operations if a speedup of 2 is achieved?

74



Chapter One Fundamentals of Quantitative Design and Analysis d. [15] Suppose you have measured the percentage of encryption to be 50%. The hardware design group estimates it can speed up the encryption hardware even more with significant additional investment. You wonder whether adding a second unit in order to support parallel encryption operations would be more useful. Imagine that in the original program, 90% of the encryption operations could be performed in parallel. What is the speedup of providing two or four encryption units, assuming that the parallelization allowed is limited to the number of encryption units? 1.13

[15/10] Assume that we make an enhancement to a computer that improves some mode of execution by a factor of 10. Enhanced mode is used 50% of the time, measured as a percentage of the execution time when the enhanced mode is in use. Recall that Amdahl’s Law depends on the fraction of the original, unenhanced execution time that could make use of enhanced mode. Thus we cannot directly use this 50% measurement to compute speedup with Amdahl’s Law. a. [15] What is the speedup we have obtained from fast mode? b. [10] What percentage of the original execution time has been converted to fast mode?

1.14

[20/20/15] When making changes to optimize part of a processor, it is often the case that speeding up one type of instruction comes at the cost of slowing down something else. For example, if we put in a complicated fast floating-point unit, that takes space, and something might have to be moved farther away from the middle to accommodate it, adding an extra cycle in delay to reach that unit. The basic Amdahl’s Law equation does not take into account this trade-off. a. [20] If the new fast floating-point unit speeds up floating-point operations by, on average, 2x, and floating-point operations take 20% of the original program’s execution time, what is the overall speedup (ignoring the penalty to any other instructions)? b. [20] Now assume that speeding up the floating-point unit slowed down data cache accesses, resulting in a 1.5x slowdown (or 2/3 speedup). Data cache accesses consume 10% of the execution time. What is the overall speedup now? c. [15] After implementing the new floating-point operations, what percentage of execution time is spent on floating-point operations? What percentage is spent on data cache accesses?

1.15

[10/10/20/20] Your company has just bought a new 22-core processor, and you have been tasked with optimizing your software for this processor. You will run four applications on this system, but the resource requirements are not equal. Assume the system and application characteristics listed in Table 1.1. Table 1.1 Four applications Application

A

B

C

D

% resources needed

41

27

18

14

% parallelizable

50

80

60

90

Case Studies and Exercises by Diana Franklin



75

The percentage of resources of assuming they are all run in serial. Assume that when you parallelize a portion of the program by X, the speedup for that portion is X. a. [10] How much speedup would result from running application A on the entire 22-core processor, as compared to running it serially? b. [10] How much speedup would result from running application D on the entire 22-core processor, as compared to running it serially? c. [20] Given that application A requires 41% of the resources, if we statically assign it 41% of the cores, what is the overall speedup if A is run parallelized but everything else is run serially? d. [20] What is the overall speedup if all four applications are statically assigned some of the cores, relative to their percentage of resource needs, and all run parallelized? e. [10] Given acceleration through parallelization, what new percentage of the resources are the applications receiving, considering only active time on their statically-assigned cores? 1.16

[10/20/20/20/25] When parallelizing an application, the ideal speedup is speeding up by the number of processors. This is limited by two things: percentage of the application that can be parallelized and the cost of communication. Amdahl’s Law takes into account the former but not the latter. a. [10] What is the speedup with N processors if 80% of the application is parallelizable, ignoring the cost of communication? b. [20] What is the speedup with eight processors if, for every processor added, the communication overhead is 0.5% of the original execution time. c. [20] What is the speedup with eight processors if, for every time the number of processors is doubled, the communication overhead is increased by 0.5% of the original execution time? d. [20] What is the speedup with N processors if, for every time the number of processors is doubled, the communication overhead is increased by 0.5% of the original execution time? e. [25] Write the general equation that solves this question: What is the number of processors with the highest speedup in an application in which P% of the original execution time is parallelizable, and, for every time the number of processors is doubled, the communication is increased by 0.5% of the original execution time?

2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9

Introduction Memory Technology and Optimizations Ten Advanced Optimizations of Cache Performance Virtual Memory and Virtual Machines Cross-Cutting Issues: The Design of Memory Hierarchies Putting It All Together: Memory Hierarchies in the ARM Cortex-A53 and Intel Core i7 6700 Fallacies and Pitfalls Concluding Remarks: Looking Ahead Historical Perspectives and References Case Studies and Exercises by Norman P. Jouppi, Rajeev Balasubramonian, Naveen Muralimanohar, and Sheng Li

78 84 94 118 126 129 142 146 148

148

2 Memory Hierarchy Design

Ideally one would desire an indefinitely large memory capacity such that any particular… word would be immediately available… We are… forced to recognize the possibility of constructing a hierarchy of memories each of which has greater capacity than the preceding but which is less quickly accessible. A. W. Burks, H. H. Goldstine, and J. von Neumann, Preliminary Discussion of the Logical Design of an Electronic Computing Instrument (1946).

Computer Architecture. https://doi.org/10.1016/B978-0-12-811905-1.00002-X © 2019 Elsevier Inc. All rights reserved.

78



Chapter Two Memory Hierarchy Design

2.1

Introduction Computer pioneers correctly predicted that programmers would want unlimited amounts of fast memory. An economical solution to that desire is a memory hierarchy, which takes advantage of locality and trade-offs in the cost-performance of memory technologies. The principle of locality, presented in the first chapter, says that most programs do not access all code or data uniformly. Locality occurs in time (temporal locality) and in space (spatial locality). This principle plus the guideline that for a given implementation technology and power budget, smaller hardware can be made faster led to hierarchies based on memories of different speeds and sizes. Figure 2.1 shows several different multilevel memory hierarchies, including typical sizes and speeds of access. As Flash and next generation memory technologies continue to close the gap with disks in cost per bit, such technologies are likely to increasingly replace magnetic disks for secondary storage. As Figure 2.1 shows, these technologies are already used in many personal computers and increasingly in servers, where the advantages in performance, power, and density are significant. Because fast memory is more expensive, a memory hierarchy is organized into several levels—each smaller, faster, and more expensive per byte than the next lower level, which is farther from the processor. The goal is to provide a memory system with a cost per byte that is almost as low as the cheapest level of memory and a speed almost as fast as the fastest level. In most cases (but not all), the data contained in a lower level are a superset of the next higher level. This property, called the inclusion property, is always required for the lowest level of the hierarchy, which consists of main memory in the case of caches and secondary storage (disk or Flash) in the case of virtual memory. The importance of the memory hierarchy has increased with advances in performance of processors. Figure 2.2 plots single processor performance projections against the historical performance improvement in time to access main memory. The processor line shows the increase in memory requests per second on average (i.e., the inverse of the latency between memory references), while the memory line shows the increase in DRAM accesses per second (i.e., the inverse of the DRAM access latency), assuming a single DRAM and a single memory bank. The reality is more complex because the processor request rate is not uniform, and the memory system typically has multiple banks of DRAMs and channels. Although the gap in access time increased significantly for many years, the lack of significant performance improvement in single processors has led to a slowdown in the growth of the gap between processors and DRAM. Because high-end processors have multiple cores, the bandwidth requirements are greater than for single cores. Although single-core bandwidth has grown more slowly in recent years, the gap between CPU memory demand and DRAM bandwidth continues to grow as the numbers of cores grow. A modern high-end desktop processor such as the Intel Core i7 6700 can generate two data memory references per core each clock cycle. With four cores and a 4.2 GHz clock rate, the i7 can generate a peak of 32.8 billion 64-bit data memory references per second, in addition to a peak instruction demand of about 12.8 billion 128-bit instruction

2.1

CPU Registers

Register reference Size: 1000 bytes Speed: 300 ps

(A)

L1 C a c h e

L2 C a c h e

Level 1 Cache reference

Level 2 Cache reference

64 KB 1 ns

256 KB 5-10 ns

Memory bus

Memory

Introduction

Flash memory reference

Memory reference 1– 2 GB 50 –100 ns

4– 64 GB 25 – 50 us

Memory hierarchy for a personal mobile device

L2 C a c h e

L3 C a c h e

Level 1 Cache reference

Level 2 Cache reference

Level 3 Cache reference

Memory reference

Size: 1000 bytes Speed: 300 ps

64 KB 1 ns

256 KB 3–10 ns

4-8 MB 10 – 20 ns

4 –16 GB 50 –100 ns

256 GB-1 TB 50-100 uS

Desktop Size: 2000 bytes Speed: 300 ps

64 KB 1 ns

256 KB 3–10 ns

8-32 MB 10 – 20 ns

8–64 GB 50 –100 ns

256 GB-2 TB 50-100 uS

CPU Registers

Register reference

(B)

Memory bus

Memory

Storage

Flash memory reference

Memory hierarchy for a laptop or a desktop

CPU Registers

Register reference

Size: 4000 bytes Speed: 200 ps

(C)

79

Storage

L1 C a c h e

Laptop



L1 C a c h e

L2 C a c h e

L3 C a c h e

Level 1 Cache reference

Level 2 Cache reference

Level 3 Cache reference

Memory reference

64 KB 1 ns

256 KB 3–10 ns

16-64 MB 10 – 20 ns

32–256 GB 50 –100 ns

Memory bus

Disk storage I/O bus Memory Flash storage Disk memory reference

Flash memory reference

16–64 TB 1-16 TB 5 –10 ms 100-200 us

Memory hierarchy for server

Figure 2.1 The levels in a typical memory hierarchy in a personal mobile device (PMD), such as a cell phone or tablet (A), in a laptop or desktop computer (B), and in a server (C). As we move farther away from the processor, the memory in the level below becomes slower and larger. Note that the time units change by a factor of 109 from picoseconds to milliseconds in the case of magnetic disks and that the size units change by a factor of 1010 from thousands of bytes to tens of terabytes. If we were to add warehouse-sized computers, as opposed to just servers, the capacity scale would increase by three to six orders of magnitude. Solid-state drives (SSDs) composed of Flash are used exclusively in PMDs, and heavily in both laptops and desktops. In many desktops, the primary storage system is SSD, and expansion disks are primarily hard disk drives (HDDs). Likewise, many servers mix SSDs and HDDs.



Chapter Two Memory Hierarchy Design

100,000

10,000 Performance

80

1000 Processor 100

10

1 1980

Memory

1985

1990

1995

2000

2005

2010

2015

Year

Figure 2.2 Starting with 1980 performance as a baseline, the gap in performance, measured as the difference in the time between processor memory requests (for a single processor or core) and the latency of a DRAM access, is plotted over time. In mid-2017, AMD, Intel and Nvidia all announced chip sets using versions of HBM technology. Note that the vertical axis must be on a logarithmic scale to record the size of the processor-DRAM performance gap. The memory baseline is 64 KiB DRAM in 1980, with a 1.07 per year performance improvement in latency (see Figure 2.4 on page 88). The processor line assumes a 1.25 improvement per year until 1986, a 1.52 improvement until 2000, a 1.20 improvement between 2000 and 2005, and only small improvements in processor performance (on a per-core basis) between 2005 and 2015. As you can see, until 2010 memory access times in DRAM improved slowly but consistently; since 2010 the improvement in access time has reduced, as compared with the earlier periods, although there have been continued improvements in bandwidth. See Figure 1.1 in Chapter 1 for more information.

references; this is a total peak demand bandwidth of 409.6 GiB/s! This incredible bandwidth is achieved by multiporting and pipelining the caches; by using three levels of caches, with two private levels per core and a shared L3; and by using a separate instruction and data cache at the first level. In contrast, the peak bandwidth for DRAM main memory, using two memory channels, is only 8% of the demand bandwidth (34.1 GiB/s). Upcoming versions are expected to have an L4 DRAM cache using embedded or stacked DRAM (see Sections 2.2 and 2.3). Traditionally, designers of memory hierarchies focused on optimizing average memory access time, which is determined by the cache access time, miss rate, and miss penalty. More recently, however, power has become a major consideration. In high-end microprocessors, there may be 60 MiB or more of on-chip cache, and a large second- or third-level cache will consume significant power both as leakage when not operating (called static power) and as active power, as when performing a read or write (called dynamic power), as described in Section 2.3. The problem is even more acute in processors in PMDs where the CPU is less aggressive and the power budget may be 20 to 50 times smaller. In such cases, the caches can account for 25% to 50% of the total power consumption. Thus more designs must consider both performance and power trade-offs, and we will examine both in this chapter.

2.1

Introduction



81

Basics of Memory Hierarchies: A Quick Review The increasing size and thus importance of this gap led to the migration of the basics of memory hierarchy into undergraduate courses in computer architecture, and even to courses in operating systems and compilers. Thus we’ll start with a quick review of caches and their operation. The bulk of the chapter, however, describes more advanced innovations that attack the processor—memory performance gap. When a word is not found in the cache, the word must be fetched from a lower level in the hierarchy (which may be another cache or the main memory) and placed in the cache before continuing. Multiple words, called a block (or line), are moved for efficiency reasons, and because they are likely to be needed soon due to spatial locality. Each cache block includes a tag to indicate which memory address it corresponds to. A key design decision is where blocks (or lines) can be placed in a cache. The most popular scheme is set associative, where a set is a group of blocks in the cache. A block is first mapped onto a set, and then the block can be placed anywhere within that set. Finding a block consists of first mapping the block address to the set and then searching the set—usually in parallel—to find the block. The set is chosen by the address of the data: ðBlock addressÞ MOD ðNumber of sets in cacheÞ

If there are n blocks in a set, the cache placement is called n-way set associative. The end points of set associativity have their own names. A direct-mapped cache has just one block per set (so a block is always placed in the same location), and a fully associative cache has just one set (so a block can be placed anywhere). Caching data that is only read is easy because the copy in the cache and memory will be identical. Caching writes is more difficult; for example, how can the copy in the cache and memory be kept consistent? There are two main strategies. A write-through cache updates the item in the cache and writes through to update main memory. A write-back cache only updates the copy in the cache. When the block is about to be replaced, it is copied back to memory. Both write strategies can use a write buffer to allow the cache to proceed as soon as the data are placed in the buffer rather than wait for full latency to write the data into memory. One measure of the benefits of different cache organizations is miss rate. Miss rate is simply the fraction of cache accesses that result in a miss—that is, the number of accesses that miss divided by the number of accesses. To gain insights into the causes of high miss rates, which can inspire better cache designs, the three Cs model sorts all misses into three simple categories: ■

Compulsory—The very first access to a block cannot be in the cache, so the block must be brought into the cache. Compulsory misses are those that occur even if you were to have an infinite-sized cache.



Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses (in addition to compulsory misses) will occur because of blocks being discarded and later retrieved.

82



Chapter Two Memory Hierarchy Design



Conflict—If the block placement strategy is not fully associative, conflict misses (in addition to compulsory and capacity misses) will occur because a block may be discarded and later retrieved if multiple blocks map to its set and accesses to the different blocks are intermingled.

Figure B.8 on page 24 shows the relative frequency of cache misses broken down by the three Cs. As mentioned in Appendix B, the three C’s model is conceptual, and although its insights usually hold, it is not a definitive model for explaining the cache behavior of individual references. As we will see in Chapters 3 and 5, multithreading and multiple cores add complications for caches, both increasing the potential for capacity misses as well as adding a fourth C, for coherency misses due to cache flushes to keep multiple caches coherent in a multiprocessor; we will consider these issues in Chapter 5. However, miss rate can be a misleading measure for several reasons. Therefore some designers prefer measuring misses per instruction rather than misses per memory reference (miss rate). These two are related: Misses Miss rate  Memory accesses Memory accesses ¼ ¼ Miss rate  Instruction Instruction count Instruction

(This equation is often expressed in integers rather than fractions, as misses per 1000 instructions.) The problem with both measures is that they don’t factor in the cost of a miss. A better measure is the average memory access time, Average memory access time ¼ Hit time + Miss rate  Miss penalty

where hit time is the time to hit in the cache and miss penalty is the time to replace the block from memory (that is, the cost of a miss). Average memory access time is still an indirect measure of performance; although it is a better measure than miss rate, it is not a substitute for execution time. In Chapter 3 we will see that speculative processors may execute other instructions during a miss, thereby reducing the effective miss penalty. The use of multithreading (introduced in Chapter 3) also allows a processor to tolerate misses without being forced to idle. As we will examine shortly, to take advantage of such latency tolerating techniques, we need caches that can service requests while handling an outstanding miss. If this material is new to you, or if this quick review moves too quickly, see Appendix B. It covers the same introductory material in more depth and includes examples of caches from real computers and quantitative evaluations of their effectiveness. Section B.3 in Appendix B presents six basic cache optimizations, which we quickly review here. The appendix also gives quantitative examples of the benefits of these optimizations. We also comment briefly on the power implications of these trade-offs. 1. Larger block size to reduce miss rate—The simplest way to reduce the miss rate is to take advantage of spatial locality and increase the block size. Larger blocks

2.1

Introduction



83

reduce compulsory misses, but they also increase the miss penalty. Because larger blocks lower the number of tags, they can slightly reduce static power. Larger block sizes can also increase capacity or conflict misses, especially in smaller caches. Choosing the right block size is a complex trade-off that depends on the size of cache and the miss penalty. 2. Bigger caches to reduce miss rate—The obvious way to reduce capacity misses is to increase cache capacity. Drawbacks include potentially longer hit time of the larger cache memory and higher cost and power. Larger caches increase both static and dynamic power. 3. Higher associativity to reduce miss rate—Obviously, increasing associativity reduces conflict misses. Greater associativity can come at the cost of increased hit time. As we will see shortly, associativity also increases power consumption. 4. Multilevel caches to reduce miss penalty—A difficult decision is whether to make the cache hit time fast, to keep pace with the high clock rate of processors, or to make the cache large to reduce the gap between the processor accesses and main memory accesses. Adding another level of cache between the original cache and memory simplifies the decision. The first-level cache can be small enough to match a fast clock cycle time, yet the second-level (or third-level) cache can be large enough to capture many accesses that would go to main memory. The focus on misses in second-level caches leads to larger blocks, bigger capacity, and higher associativity. Multilevel caches are more power-efficient than a single aggregate cache. If L1 and L2 refer, respectively, to first- and second-level caches, we can redefine the average memory access time: Hit timeL1 + Miss rateL1  ðHit timeL2 + Miss rateL2  Miss penaltyL2 Þ

5. Giving priority to read misses over writes to reduce miss penalty—A write buffer is a good place to implement this optimization. Write buffers create hazards because they hold the updated value of a location needed on a read miss— that is, a read-after-write hazard through memory. One solution is to check the contents of the write buffer on a read miss. If there are no conflicts, and if the memory system is available, sending the read before the writes reduces the miss penalty. Most processors give reads priority over writes. This choice has little effect on power consumption. 6. Avoiding address translation during indexing of the cache to reduce hit time— Caches must cope with the translation of a virtual address from the processor to a physical address to access memory. (Virtual memory is covered in Sections 2.4 and B.4.) A common optimization is to use the page offset—the part that is identical in both virtual and physical addresses—to index the cache, as described in Appendix B, page B.38. This virtual index/physical tag method introduces some system complications and/or limitations on the size and structure of the L1 cache, but the advantages of removing the translation lookaside buffer (TLB) access from the critical path outweigh the disadvantages.

84



Chapter Two Memory Hierarchy Design

Note that each of the preceding six optimizations has a potential disadvantage that can lead to increased, rather than decreased, average memory access time. The rest of this chapter assumes familiarity with the preceding material and the details in Appendix B. In the “Putting It All Together” section, we examine the memory hierarchy for a microprocessor designed for a high-end desktop or smaller server, the Intel Core i7 6700, as well as one designed for use in a PMD, the Arm Cortex-53, which is the basis for the processor used in several tablets and smartphones. Within each of these classes, there is a significant diversity in approach because of the intended use of the computer. Although the i7 6700 has more cores and bigger caches than the Intel processors designed for mobile uses, the processors have similar architectures. A processor designed for small servers, such as the i7 6700, or larger servers, such as the Intel Xeon processors, typically is running a large number of concurrent processes, often for different users. Thus memory bandwidth becomes more important, and these processors offer larger caches and more aggressive memory systems to boost that bandwidth. In contrast, PMDs not only serve one user but generally also have smaller operating systems, usually less multitasking (running of several applications simultaneously), and simpler applications. PMDs must consider both performance and energy consumption, which determines battery life. Before we dive into more advanced cache organizations and optimizations, one needs to understand the various memory technologies and how they are evolving.

2.2

Memory Technology and Optimizations …the one single development that put computers on their feet was the invention of a reliable form of memory, namely, the core memory. …Its cost was reasonable, it was reliable and, because it was reliable, it could in due course be made large. (p. 209) Maurice Wilkes. Memoirs of a Computer Pioneer (1985)

This section describes the technologies used in a memory hierarchy, specifically in building caches and main memory. These technologies are SRAM (static randomaccess memory), DRAM (dynamic random-access memory), and Flash. The last of these is used as an alternative to hard disks, but because its characteristics are based on semiconductor technology, it is appropriate to include in this section. Using SRAM addresses the need to minimize access time to caches. When a cache miss occurs, however, we need to move the data from the main memory as quickly as possible, which requires a high bandwidth memory. This high memory bandwidth can be achieved by organizing the many DRAM chips that make up the main memory into multiple memory banks and by making the memory bus wider, or by doing both. To allow memory systems to keep up with the bandwidth demands of modern processors, memory innovations started happening inside the DRAM chips

2.2

Memory Technology and Optimizations



85

themselves. This section describes the technology inside the memory chips and those innovative, internal organizations. Before describing the technologies and options, we need to introduce some terminology. With the introduction of burst transfer memories, now widely used in both Flash and DRAM, memory latency is quoted using two measures—access time and cycle time. Access time is the time between when a read is requested and when the desired word arrives, and cycle time is the minimum time between unrelated requests to memory. Virtually all computers since 1975 have used DRAMs for main memory and SRAMs for cache, with one to three levels integrated onto the processor chip with the CPU. PMDs must balance power and performance, and because they have more modest storage needs, PMDs use Flash rather than disk drives, a decision increasingly being followed by desktop computers as well.

SRAM Technology The first letter of SRAM stands for static. The dynamic nature of the circuits in DRAM requires data to be written back after being read—thus the difference between the access time and the cycle time as well as the need to refresh. SRAMs don’t need to refresh, so the access time is very close to the cycle time. SRAMs typically use six transistors per bit to prevent the information from being disturbed when read. SRAM needs only minimal power to retain the charge in standby mode. In earlier times, most desktop and server systems used SRAM chips for their primary, secondary, or tertiary caches. Today, all three levels of caches are integrated onto the processor chip. In high-end server chips, there may be as many as 24 cores and up to 60 MiB of cache; such systems are often configured with 128–256 GiB of DRAM per processor chip. The access times for large, third-level, on-chip caches are typically two to eight times that of a second-level cache. Even so, the L3 access time is usually at least five times faster than a DRAM access. On-chip, cache SRAMs are normally organized with a width that matches the block size of the cache, with the tags stored in parallel to each block. This allows an entire block to be read out or written into a single cycle. This capability is particularly useful when writing data fetched after a miss into the cache or when writing back a block that must be evicted from the cache. The access time to the cache (ignoring the hit detection and selection in a set associative cache) is proportional to the number of blocks in the cache, whereas the energy consumption depends both on the number of bits in the cache (static power) and on the number of blocks (dynamic power). Set associative caches reduce the initial access time to the memory because the size of the memory is smaller, but increase the time for hit detection and block selection, a topic we will cover in Section 2.3.

DRAM Technology As early DRAMs grew in capacity, the cost of a package with all the necessary address lines was an issue. The solution was to multiplex the address lines, thereby

86



Chapter Two Memory Hierarchy Design

Bank Column Rd/Wr Act

Pre

Row

Figure 2.3 Internal organization of a DRAM. Modern DRAMs are organized in banks, up to 16 for DDR4. Each bank consists of a series of rows. Sending an ACT (Activate) command opens a bank and a row and loads the row into a row buffer. When the row is in the buffer, it can be transferred by successive column addresses at whatever the width of the DRAM is (typically 4, 8, or 16 bits in DDR4) or by specifying a block transfer and the starting address. The Precharge commend (PRE) closes the bank and row and readies it for a new access. Each command, as well as block transfers, are synchronized with a clock. See the next section discussing SDRAM. The row and column signals are sometimes called RAS and CAS, based on the original names of the signals.

cutting the number of address pins in half. Figure 2.3 shows the basic DRAM organization. One-half of the address is sent first during the row access strobe (RAS). The other half of the address, sent during the column access strobe (CAS), follows it. These names come from the internal chip organization, because the memory is organized as a rectangular matrix addressed by rows and columns. An additional requirement of DRAM derives from the property signified by its first letter, D, for dynamic. To pack more bits per chip, DRAMs use only a single transistor, which effectively acts as a capacitor, to store a bit. This has two implications: first, the sensing wires that detect the charge must be precharged, which sets them “halfway” between a logical 0 and 1, allowing the small charge stored in the cell to cause a 0 or 1 to be detected by the sense amplifiers. On reading, a row is placed into a row buffer, where CAS signals can select a portion of the row to read out from the DRAM. Because reading a row destroys the information, it must be written back when the row is no longer needed. This write back happens in overlapped fashion, but in early DRAMs, it meant that the cycle time before a new row could be read was larger than the time to read a row and access a portion of that row. In addition, to prevent loss of information as the charge in a cell leaks away (assuming it is not read or written), each bit must be “refreshed” periodically. Fortunately, all the bits in a row can be refreshed simultaneously just by reading that row and writing it back. Therefore every DRAM in the memory system must access every row within a certain time window, such as 64 ms. DRAM controllers include hardware to refresh the DRAMs periodically. This requirement means that the memory system is occasionally unavailable because it is sending a signal telling every chip to refresh. The time for a refresh is a row activation and a precharge that also writes the row back (which takes

2.2

Memory Technology and Optimizations



87

roughly 2/3 of the time to get a datum because no column select is needed), and this is required for each row of the DRAM. Because the memory matrix in a DRAM is conceptually square, the number of steps in a refresh is usually the square root of the DRAM capacity. DRAM designers try to keep time spent refreshing to less than 5% of the total time. So far we have presented main memory as if it operated like a Swiss train, consistently delivering the goods exactly according to schedule. In fact, with SDRAMs, a DRAM controller (usually on the processor chip) tries to optimize accesses by avoiding opening new rows and using block transfer when possible. Refresh adds another unpredictable factor. Amdahl suggested as a rule of thumb that memory capacity should grow linearly with processor speed to keep a balanced system. Thus a 1000 MIPS processor should have 1000 MiB of memory. Processor designers rely on DRAMs to supply that demand. In the past, they expected a fourfold improvement in capacity every three years, or 55% per year. Unfortunately, the performance of DRAMs is growing at a much slower rate. The slower performance improvements arise primarily because of smaller decreases in the row access time, which is determined by issues such as power limitations and the charge capacity (and thus the size) of an individual memory cell. Before we discuss these performance trends in more detail, we need to describe the major changes that occurred in DRAMs starting in the mid-1990s.

Improving Memory Performance Inside a DRAM Chip: SDRAMs Although very early DRAMs included a buffer allowing multiple column accesses to a single row, without requiring a new row access, they used an asynchronous interface, which meant that every column access and transfer involved overhead to synchronize with the controller. In the mid-1990s, designers added a clock signal to the DRAM interface so that the repeated transfers would not bear that overhead, thereby creating synchronous DRAM (SDRAM). In addition to reducing overhead, SDRAMs allowed the addition of a burst transfer mode where multiple transfers can occur without specifying a new column address. Typically, eight or more 16-bit transfers can occur without sending any new addresses by placing the DRAM in burst mode. The inclusion of such burst mode transfers has meant that there is a significant gap between the bandwidth for a stream of random accesses versus access to a block of data. To overcome the problem of getting more bandwidth from the memory as DRAM density increased, DRAMS were made wider. Initially, they offered a four-bit transfer mode; in 2017, DDR2, DDR3, and DDR DRAMS had up to 4, 8, or 16 bit buses. In the early 2000s, a further innovation was introduced: double data rate (DDR), which allows a DRAM to transfer data both on the rising and the falling edge of the memory clock, thereby doubling the peak data rate. Finally, SDRAMs introduced banks to help with power management, improve access time, and allow interleaved and overlapped accesses to different banks.

88



Chapter Two Memory Hierarchy Design

Access to different banks can be overlapped with each other, and each bank has its own row buffer. Creating multiple banks inside a DRAM effectively adds another segment to the address, which now consists of bank number, row address, and column address. When an address is sent that designates a new bank, that bank must be opened, incurring an additional delay. The management of banks and row buffers is completely handled by modern memory control interfaces, so that when a subsequent access specifies the same row for an open bank, the access can happen quickly, sending only the column address. To initiate a new access, the DRAM controller sends a bank and row number (called Activate in SDRAMs and formerly called RAS—row select). That command opens the row and reads the entire row into a buffer. A column address can then be sent, and the SDRAM can transfer one or more data items, depending on whether it is a single item request or a burst request. Before accessing a new row, the bank must be precharged. If the row is in the same bank, then the precharge delay is seen; however, if the row is in another bank, closing the row and precharging can overlap with accessing the new row. In synchronous DRAMs, each of these command cycles requires an integral number of clock cycles. From 1980 to 1995, DRAMs scaled with Moore’s Law, doubling capacity every 18 months (or a factor of 4 in 3 years). From the mid-1990s to 2010, capacity increased more slowly with roughly 26 months between a doubling. From 2010 to 2016, capacity only doubled! Figure 2.4 shows the capacity and access time for various generations of DDR SDRAMs. From DDR1 to DDR3, access times improved by a factor of about 3, or about 7% per year. DDR4 improves power and bandwidth over DDR3, but has similar access latency. As Figure 2.4 shows, DDR is a sequence of standards. DDR2 lowers power from DDR1 by dropping the voltage from 2.5 to 1.8 V and offers higher clock rates: 266, 333, and 400 MHz. DDR3 drops voltage to 1.5 V and has a maximum clock speed of 800 MHz. (As we discuss in the next section, GDDR5 is a graphics

Best case access time (no precharge)

Precharge needed

Production year

Chip size

DRAM type

RAS time (ns)

CAS time (ns)

Total (ns)

Total (ns)

2000

256M bit

DDR1

21

21

42

63

2002

512M bit

DDR1

15

15

30

45

2004

1G bit

DDR2

15

15

30

45

2006

2G bit

DDR2

10

10

20

30

2010

4G bit

DDR3

13

13

26

39

2016

8G bit

DDR4

13

13

26

39

Figure 2.4 Capacity and access times for DDR SDRAMs by year of production. Access time is for a random memory word and assumes a new row must be opened. If the row is in a different bank, we assume the bank is precharged; if the row is not open, then a precharge is required, and the access time is longer. As the number of banks has increased, the ability to hide the precharge time has also increased. DDR4 SDRAMs were initially expected in 2014, but did not begin production until early 2016.

2.2

Standard

Memory Technology and Optimizations



89

I/O clock rate

M transfers/s

DRAM name

MiB/s/DIMM

DIMM name

DDR1

133

266

DDR266

2128

PC2100

DDR1

150

300

DDR300

2400

PC2400

DDR1

200

400

DDR400

3200

PC3200

DDR2

266

533

DDR2-533

4264

PC4300

DDR2

333

667

DDR2-667

5336

PC5300

DDR2

400

800

DDR2-800

6400

PC6400

DDR3

533

1066

DDR3-1066

8528

PC8500

DDR3

666

1333

DDR3-1333

10,664

PC10700

DDR3

800

1600

DDR3-1600

12,800

PC12800

DDR4

1333

2666

DDR4-2666

21,300

PC21300

Figure 2.5 Clock rates, bandwidth, and names of DDR DRAMS and DIMMs in 2016. Note the numerical relationship between the columns. The third column is twice the second, and the fourth uses the number from the third column in the name of the DRAM chip. The fifth column is eight times the third column, and a rounded version of this number is used in the name of the DIMM. DDR4 saw significant first use in 2016.

RAM and is based on DDR3 DRAMs.) DDR4, which shipped in volume in early 2016, but was expected in 2014, drops the voltage to 1–1.2 V and has a maximum expected clock rate of 1600 MHz. DDR5 is unlikely to reach production quantities until 2020 or later. With the introduction of DDR, memory designers increasing focused on bandwidth, because improvements in access time were difficult. Wider DRAMs, burst transfers, and double data rate all contributed to rapid increases in memory bandwidth. DRAMs are commonly sold on small boards called dual inline memory modules (DIMMs) that contain 4–16 DRAM chips and that are normally organized to be 8 bytes wide (+ ECC) for desktop and server systems. When DDR SDRAMs are packaged as DIMMs, they are confusingly labeled by the peak DIMM bandwidth. Therefore the DIMM name PC3200 comes from 200 MHz  2  8 bytes, or 3200 MiB/s; it is populated with DDR SDRAM chips. Sustaining the confusion, the chips themselves are labeled with the number of bits per second rather than their clock rate, so a 200 MHz DDR chip is called a DDR400. Figure 2.5 shows the relationships’ I/O clock rate, transfers per second per chip, chip bandwidth, chip name, DIMM bandwidth, and DIMM name.

Reducing Power Consumption in SDRAMs Power consumption in dynamic memory chips consists of both dynamic power used in a read or write and static or standby power; both depend on the operating voltage. In the most advanced DDR4 SDRAMs, the operating voltage has dropped to 1.2 V, significantly reducing power versus DDR2 and DDR3 SDRAMs. The addition of banks also reduced power because only the row in a single bank is read.



Chapter Two Memory Hierarchy Design

600 500 Power in mW

90

400

Read, write, terminate power Activate power Background power

300 200 100 0 Low power mode

Typical usage

Fully active

Figure 2.6 Power consumption for a DDR3 SDRAM operating under three conditions: low-power (shutdown) mode, typical system mode (DRAM is active 30% of the time for reads and 15% for writes), and fully active mode, where the DRAM is continuously reading or writing. Reads and writes assume bursts of eight transfers. These data are based on a Micron 1.5V 2GB DDR3-1066, although similar savings occur in DDR4 SDRAMs.

In addition to these changes, all recent SDRAMs support a power-down mode, which is entered by telling the DRAM to ignore the clock. Power-down mode disables the SDRAM, except for internal automatic refresh (without which entering power-down mode for longer than the refresh time will cause the contents of memory to be lost). Figure 2.6 shows the power consumption for three situations in a 2 GB DDR3 SDRAM. The exact delay required to return from low power mode depends on the SDRAM, but a typical delay is 200 SDRAM clock cycles.

Graphics Data RAMs GDRAMs or GSDRAMs (Graphics or Graphics Synchronous DRAMs) are a special class of DRAMs based on SDRAM designs but tailored for handling the higher bandwidth demands of graphics processing units. GDDR5 is based on DDR3 with earlier GDDRs based on DDR2. Because graphics processor units (GPUs; see Chapter 4) require more bandwidth per DRAM chip than CPUs, GDDRs have several important differences: 1. GDDRs have wider interfaces: 32-bits versus 4, 8, or 16 in current designs. 2. GDDRs have a higher maximum clock rate on the data pins. To allow a higher transfer rate without incurring signaling problems, GDRAMS normally connect directly to the GPU and are attached by soldering them to the board, unlike DRAMs, which are normally arranged in an expandable array of DIMMs. Altogether, these characteristics let GDDRs run at two to five times the bandwidth per DRAM versus DDR3 DRAMs.

2.2

Memory Technology and Optimizations



91

Packaging Innovation: Stacked or Embedded DRAMs

DRAM

DRAM

The newest innovation in 2017 in DRAMs is a packaging innovation, rather than a circuit innovation. It places multiple DRAMs in a stacked or adjacent fashion embedded within the same package as the processor. (Embedded DRAM also is used to refer to designs that place DRAM on the processor chip.) Placing the DRAM and processor in the same package lowers access latency (by shortening the delay between the DRAMs and the processor) and potentially increases bandwidth by allowing more and faster connections between the processor and DRAM; thus several producers have called it high bandwidth memory (HBM). One version of this technology places the DRAM die directly on the CPU die using solder bump technology to connect them. Assuming adequate heat management, multiple DRAM dies can be stacked in this fashion. Another approach stacks only DRAMs and abuts them with the CPU in a single package using a substrate (interposer) containing the connections. Figure 2.7 shows these two different interconnection schemes. Prototypes of HBM that allow stacking of up to eight chips have been demonstrated. With special versions of SDRAMs, such a package could contain 8 GiB of memory and have data transfer rates of 1 TB/s. The 2.5D technique is currently available. Because the chips must be specifically manufactured to stack, it is quite likely that most early uses will be in high-end server chipsets. In some applications, it may be possible to internally package enough DRAM to satisfy the needs of the application. For example, a version of an Nvidia GPU used as a node in a special-purpose cluster design is being developed using HBM, and it is likely that HBM will become a successor to GDDR5 for higher-end applications. In some cases, it may be possible to use HBM as main memory, although the cost limitations and heat removal issues currently rule out this technology for some embedded applications. In the next section, we consider the possibility of using HBM as an additional level of cache.

xPU

Interposer xPU Vertical stacking (3D)

Interposer stacking (2.5D)

Figure 2.7 Two forms of die stacking. The 2.5D form is available now. 3D stacking is under development and faces heat management challenges due to the CPU.

92



Chapter Two Memory Hierarchy Design

Flash Memory Flash memory is a type of EEPROM (electronically erasable programmable readonly memory), which is normally read-only but can be erased. The other key property of Flash memory is that it holds its contents without any power. We focus on NAND Flash, which has higher density than NOR Flash and is more suitable for large-scale nonvolatile memories; the drawback is that access is sequential and writing is slower, as we explain below. Flash is used as the secondary storage in PMDs in the same manner that a disk functions in a laptop or server. In addition, because most PMDs have a limited amount of DRAM, Flash may also act as a level of the memory hierarchy, to a much greater extent than it might have to do in a desktop or server with a main memory that might be 10–100 times larger. Flash uses a very different architecture and has different properties than standard DRAM. The most important differences are 1. Reads to Flash are sequential and read an entire page, which can be 512 bytes, 2 KiB, or 4 KiB. Thus NAND Flash has a long delay to access the first byte from a random address (about 25 μS), but can supply the remainder of a page block at about 40 MiB/s. By comparison, a DDR4 SDRAM takes about 40 ns to the first byte and can transfer the rest of the row at 4.8 GiB/s. Comparing the time to transfer 2 KiB, NAND Flash takes about 75 μS, while DDR SDRAM takes less than 500 ns, making Flash about 150 times slower. Compared to magnetic disk, however, a 2 KiB read from Flash is 300 to 500 times faster. From these numbers, we can see why Flash is not a candidate to replace DRAM for main memory, but is a candidate to replace magnetic disk. 2. Flash memory must be erased (thus the name flash for the “flash” erase process) before it is overwritten, and it is erased in blocks rather than individual bytes or words. This requirement means that when data must be written to Flash, an entire block must be assembled, either as new data or by merging the data to be written and the rest of the block’s contents. For writing, Flash is about 1500 times slower then SDRAM, and about 8–15 times as fast as magnetic disk. 3. Flash memory is nonvolatile (i.e., it keeps its contents even when power is not applied) and draws significantly less power when not reading or writing (from less than half in standby mode to zero when completely inactive). 4. Flash memory limits the number of times that any given block can be written, typically at least 100,000. By ensuring uniform distribution of written blocks throughout the memory, a system can maximize the lifetime of a Flash memory system. This technique, called write leveling, is handled by Flash memory controllers. 5. High-density NAND Flash is cheaper than SDRAM but more expensive than disks: roughly $2/GiB for Flash, $20 to $40/GiB for SDRAM, and $0.09/GiB for magnetic disks. In the past five years, Flash has decreased in cost at a rate that is almost twice as fast as that of magnetic disks.

2.2

Memory Technology and Optimizations



93

Like DRAM, Flash chips include redundant blocks to allow chips with small numbers of defects to be used; the remapping of blocks is handled in the Flash chip. Flash controllers handle page transfers, provide caching of pages, and handle write leveling. The rapid improvements in high-density Flash have been critical to the development of low-power PMDs and laptops, but they have also significantly changed both desktops, which increasingly use solid state disks, and large servers, which often combine disk and Flash-based storage.

Phase-Change Memory Technology Phase-change memory (PCM) has been an active research area for decades. The technology typically uses a small heating element to change the state of a bulk substrate between its crystalline form and an amorphous form, which have different resistive properties. Each bit corresponds to a crosspoint in a two-dimensional network that overlays the substrate. Reading is done by sensing the resistance between an x and y point (thus the alternative name memristor), and writing is accomplished by applying a current to change the phase of the material. The absence of an active device (such as a transistor) should lead to lower costs and greater density than that of NAND Flash. In 2017 Micron and Intel began delivering Xpoint memory chips that are believed to be based on PCM. The technology is expected to have much better write durability than NAND Flash and, by eliminating the need to erase a page before writing, achieve an increase in write performance versus NAND of up to a factor of ten. Read latency is also better than Flash by perhaps a factor of 2–3. Initially, it is expected to be priced slightly higher than Flash, but the advantages in write performance and write durability may make it attractive, especially for SSDs. Should this technology scale well and be able to achieve additional cost reductions, it may be the solid state technology that will depose magnetic disks, which have reigned as the primary bulk nonvolatile store for more than 50 years.

Enhancing Dependability in Memory Systems Large caches and main memories significantly increase the possibility of errors occurring both during the fabrication process and dynamically during operation. Errors that arise from a change in circuitry and are repeatable are called hard errors or permanent faults. Hard errors can occur during fabrication, as well as from a circuit change during operation (e.g., failure of a Flash memory cell after many writes). All DRAMs, Flash memory, and most SRAMs are manufactured with spare rows so that a small number of manufacturing defects can be accommodated by programming the replacement of a defective row by a spare row. Dynamic errors, which are changes to a cell’s contents, not a change in the circuitry, are called soft errors or transient faults. Dynamic errors can be detected by parity bits and detected and fixed by the use of error correcting codes (ECCs). Because instruction caches are read-only, parity

94



Chapter Two Memory Hierarchy Design

suffices. In larger data caches and in main memory, ECC is used to allow errors to be both detected and corrected. Parity requires only one bit of overhead to detect a single error in a sequence of bits. Because a multibit error would be undetected with parity, the number of bits protected by a parity bit must be limited. One parity bit per 8 data bits is a typical ratio. ECC can detect two errors and correct a single error with a cost of 8 bits of overhead per 64 data bits. In very large systems, the possibility of multiple errors as well as complete failure of a single memory chip becomes significant. Chipkill was introduced by IBM to solve this problem, and many very large systems, such as IBM and SUN servers and the Google Clusters, use this technology. (Intel calls their version SDDC.) Similar in nature to the RAID approach used for disks, Chipkill distributes the data and ECC information so that the complete failure of a single memory chip can be handled by supporting the reconstruction of the missing data from the remaining memory chips. Using an analysis by IBM and assuming a 10,000 processor server with 4 GiB per processor yields the following rates of unrecoverable errors in three years of operation: ■

Parity only: About 90,000, or one unrecoverable (or undetected) failure every 17 minutes.



ECC only: About 3500, or about one undetected or unrecoverable failure every 7.5 hours.



Chipkill: About one undetected or unrecoverable failure every 2 months.

Another way to look at this is to find the maximum number of servers (each with 4 GiB) that can be protected while achieving the same error rate as demonstrated for Chipkill. For parity, even a server with only one processor will have an unrecoverable error rate higher than a 10,000-server Chipkill protected system. For ECC, a 17-server system would have about the same failure rate as a 10,000-server Chipkill system. Therefore Chipkill is a requirement for the 50,000–100,00 servers in warehouse-scale computers (see Section 6.8 of Chapter 6).

2.3

Ten Advanced Optimizations of Cache Performance The preceding average memory access time formula gives us three metrics for cache optimizations: hit time, miss rate, and miss penalty. Given the recent trends, we add cache bandwidth and power consumption to this list. We can classify the 10 advanced cache optimizations we examine into five categories based on these metrics: 1. Reducing the hit time—Small and simple first-level caches and way-prediction. Both techniques also generally decrease power consumption. 2. Increasing cache bandwidth—Pipelined caches, multibanked caches, and nonblocking caches. These techniques have varying impacts on power consumption.

2.3

Ten Advanced Optimizations of Cache Performance



95

3. Reducing the miss penalty—Critical word first and merging write buffers. These optimizations have little impact on power. 4. Reducing the miss rate—Compiler optimizations. Obviously any improvement at compile time improves power consumption. 5. Reducing the miss penalty or miss rate via parallelism—Hardware prefetching and compiler prefetching. These optimizations generally increase power consumption, primarily because of prefetched data that are unused. In general, the hardware complexity increases as we go through these optimizations. In addition, several of the optimizations require sophisticated compiler technology, and the final one depends on HBM. We will conclude with a summary of the implementation complexity and the performance benefits of the 10 techniques presented in Figure 2.18 on page 113. Because some of these are straightforward, we cover them briefly; others require more description.

First Optimization: Small and Simple First-Level Caches to Reduce Hit Time and Power The pressure of both a fast clock cycle and power limitations encourages limited size for first-level caches. Similarly, use of lower levels of associativity can reduce both hit time and power, although such trade-offs are more complex than those involving size. The critical timing path in a cache hit is the three-step process of addressing the tag memory using the index portion of the address, comparing the read tag value to the address, and setting the multiplexor to choose the correct data item if the cache is set associative. Direct-mapped caches can overlap the tag check with the transmission of the data, effectively reducing hit time. Furthermore, lower levels of associativity will usually reduce power because fewer cache lines must be accessed. Although the total amount of on-chip cache has increased dramatically with new generations of microprocessors, because of the clock rate impact arising from a larger L1 cache, the size of the L1 caches has recently increased either slightly or not at all. In many recent processors, designers have opted for more associativity rather than larger caches. An additional consideration in choosing the associativity is the possibility of eliminating address aliases; we discuss this topic shortly. One approach to determining the impact on hit time and power consumption in advance of building a chip is to use CAD tools. CACTI is a program to estimate the access time and energy consumption of alternative cache structures on CMOS microprocessors within 10% of more detailed CAD tools. For a given minimum feature size, CACTI estimates the hit time of caches as a function of cache size, associativity, number of read/write ports, and more complex parameters. Figure 2.8 shows the estimated impact on hit time as cache size and associativity are varied. Depending on cache size, for these parameters, the model suggests that the hit time for direct mapped is slightly faster than two-way set associative and that two-way set associative is 1.2 times as fast as four-way and four-way is 1.4



Chapter Two Memory Hierarchy Design

3.0

Relative access time in microseconds

96

1-way 4-way

2-way 8-way

2.5

2.0

1.5

1.0

0.5

0

16 KB

32 KB

64 KB Cache size

128 KB

256 KB

Figure 2.8 Relative access times generally increase as cache size and associativity are increased. These data come from the CACTI model 6.5 by Tarjan et al. (2005). The data assume typical embedded SRAM technology, a single bank, and 64-byte blocks. The assumptions about cache layout and the complex trade-offs between interconnect delays (that depend on the size of a cache block being accessed) and the cost of tag checks and multiplexing lead to results that are occasionally surprising, such as the lower access time for a 64 KiB with two-way set associativity versus direct mapping. Similarly, the results with eight-way set associativity generate unusual behavior as cache size is increased. Because such observations are highly dependent on technology and detailed design assumptions, tools such as CACTI serve to reduce the search space. These results are relative; nonetheless, they are likely to shift as we move to more recent and denser semiconductor technologies.

times as fast as eight-way. Of course, these estimates depend on technology as well as the size of the cache, and CACTI must be carefully aligned with the technology; Figure 2.8 shows the relative tradeoffs for one technology. Example

Using the data in Figure B.8 in Appendix B and Figure 2.8, determine whether a 32 KiB four-way set associative L1 cache has a faster memory access time than a 32 KiB two-way set associative L1 cache. Assume the miss penalty to L2 is 15 times the access time for the faster L1 cache. Ignore misses beyond L2. Which has the faster average memory access time?

Answer

Let the access time for the two-way set associative cache be 1. Then, for the twoway cache, Average memory access time2-way ¼ Hit time + Miss rate  Miss penalty ¼ 1 + 0:038  15 ¼ 1:38

2.3

Ten Advanced Optimizations of Cache Performance



97

For the four-way cache, the access time is 1.4 times longer. The elapsed time of the miss penalty is 15/1.4 ¼ 10.1. Assume 10 for simplicity: Average memory access time4-way ¼ Hit time2-way  1:4 + Miss rate  Miss penalty ¼ 1:4 + 0:037  10 ¼ 1:77

Clearly, the higher associativity looks like a bad trade-off; however, because cache access in modern processors is often pipelined, the exact impact on the clock cycle time is difficult to assess. Energy consumption is also a consideration in choosing both the cache size and associativity, as Figure 2.9 shows. The energy cost of higher associativity ranges from more than a factor of 2 to negligible in caches of 128 or 256 KiB when going from direct mapped to two-way set associative. As energy consumption has become critical, designers have focused on ways to reduce the energy needed for cache access. In addition to associativity, the other key factor in determining the energy used in a cache access is the number of blocks in the cache because it determines the number of “rows” that are accessed. A designer could reduce the number of rows by increasing the block size (holding total cache size constant), but this could increase the miss rate, especially in smaller L1 caches.

10.0 1-way

Relative energy per read in nano joules

9.0

4-way

2-way 8-way

8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0 16 KB

32 KB

64 KB Cache size

128 KB

256 KB

Figure 2.9 Energy consumption per read increases as cache size and associativity are increased. As in the previous figure, CACTI is used for the modeling with the same technology parameters. The large penalty for eight-way set associative caches is due to the cost of reading out eight tags and the corresponding data in parallel.

98



Chapter Two Memory Hierarchy Design

An alternative is to organize the cache in banks so that an access activates only a portion of the cache, namely the bank where the desired block resides. The primary use of multibanked caches is to increase the bandwidth of the cache, an optimization we consider shortly. Multibanking also reduces energy because less of the cache is accessed. The L3 caches in many multicores are logically unified, but physically distributed, and effectively act as a multibanked cache. Based on the address of a request, only one of the physical L3 caches (a bank) is actually accessed. We discuss this organization further in Chapter 5. In recent designs, there are three other factors that have led to the use of higher associativity in first-level caches despite the energy and access time costs. First, many processors take at least 2 clock cycles to access the cache and thus the impact of a longer hit time may not be critical. Second, to keep the TLB out of the critical path (a delay that would be larger than that associated with increased associativity), almost all L1 caches should be virtually indexed. This limits the size of the cache to the page size times the associativity because then only the bits within the page are used for the index. There are other solutions to the problem of indexing the cache before address translation is completed, but increasing the associativity, which also has other benefits, is the most attractive. Third, with the introduction of multithreading (see Chapter 3), conflict misses can increase, making higher associativity more attractive.

Second Optimization: Way Prediction to Reduce Hit Time Another approach reduces conflict misses and yet maintains the hit speed of directmapped cache. In way prediction, extra bits are kept in the cache to predict the way (or block within the set) of the next cache access. This prediction means the multiplexor is set early to select the desired block, and in that clock cycle, only a single tag comparison is performed in parallel with reading the cache data. A miss results in checking the other blocks for matches in the next clock cycle. Added to each block of a cache are block predictor bits. The bits select which of the blocks to try on the next cache access. If the predictor is correct, the cache access latency is the fast hit time. If not, it tries the other block, changes the way predictor, and has a latency of one extra clock cycle. Simulations suggest that set prediction accuracy is in excess of 90% for a two-way set associative cache and 80% for a four-way set associative cache, with better accuracy on I-caches than D-caches. Way prediction yields lower average memory access time for a twoway set associative cache if it is at least 10% faster, which is quite likely. Way prediction was first used in the MIPS R10000 in the mid-1990s. It is popular in processors that use two-way set associativity and was used in several ARM processors, which have four-way set associative caches. For very fast processors, it may be challenging to implement the one-cycle stall that is critical to keeping the way prediction penalty small. An extended form of way prediction can also be used to reduce power consumption by using the way prediction bits to decide which cache block to actually

2.3

Ten Advanced Optimizations of Cache Performance



99

access (the way prediction bits are essentially extra address bits); this approach, which might be called way selection, saves power when the way prediction is correct but adds significant time on a way misprediction, because the access, not just the tag match and selection, must be repeated. Such an optimization is likely to make sense only in low-power processors. Inoue et al. (1999) estimated that using the way selection approach with a four-way set associative cache increases the average access time for the I-cache by 1.04 and for the D-cache by 1.13 on the SPEC95 benchmarks, but it yields an average cache power consumption relative to a normal four-way set associative cache that is 0.28 for the I-cache and 0.35 for the D-cache. One significant drawback for way selection is that it makes it difficult to pipeline the cache access; however, as energy concerns have mounted, schemes that do not require powering up the entire cache make increasing sense. Example

Assume that there are half as many D-cache accesses as I-cache accesses and that the I-cache and D-cache are responsible for 25% and 15% of the processor’s power consumption in a normal four-way set associative implementation. Determine if way selection improves performance per watt based on the estimates from the preceding study.

Answer

For the I-cache, the savings in power is 25  0.28 ¼ 0.07 of the total power, while for the D-cache it is 15  0.35 ¼ 0.05 for a total savings of 0.12. The way prediction version requires 0.88 of the power requirement of the standard four-way cache. The increase in cache access time is the increase in I-cache average access time plus one-half the increase in D-cache access time, or 1.04 + 0.5  0.13 ¼ 1.11 times longer. This result means that way selection has 0.90 of the performance of a standard four-way cache. Thus way selection improves performance per joule very slightly by a ratio of 0.90/0.88 ¼ 1.02. This optimization is best used where power rather than performance is the key objective.

Third Optimization: Pipelined Access and Multibanked Caches to Increase Bandwidth These optimizations increase cache bandwidth either by pipelining the cache access or by widening the cache with multiple banks to allow multiple accesses per clock; these optimizations are the dual to the superpipelined and superscalar approaches to increasing instruction throughput. These optimizations are primarily targeted at L1, where access bandwidth constrains instruction throughput. Multiple banks are also used in L2 and L3 caches, but primarily as a power-management technique. Pipelining L1 allows a higher clock cycle, at the cost of increased latency. For example, the pipeline for the instruction cache access for Intel Pentium processors in the mid-1990s took 1 clock cycle; for the Pentium Pro through Pentium III in the mid-1990s through 2000, it took 2 clock cycles; and for the Pentium 4, which became available in 2000, and the current Intel Core i7, it takes 4 clock cycles. Pipelining the instruction cache effectively increases the number of pipeline stages,

100



Chapter Two Memory Hierarchy Design

Block address 0 4 8 12

Bank 0

Block address 1 5 9 13

Bank 1

Block address 2 6 10 14

Bank 2

Block address 3 7 11 15

Bank 3

Figure 2.10 Four-way interleaved cache banks using block addressing. Assuming 64 bytes per block, each of these addresses would be multiplied by 64 to get byte addressing.

leading to a greater penalty on mispredicted branches. Correspondingly, pipelining the data cache leads to more clock cycles between issuing the load and using the data (see Chapter 3). Today, all processors use some pipelining of L1, if only for the simple case of separating the access and hit detection, and many high-speed processors have three or more levels of cache pipelining. It is easier to pipeline the instruction cache than the data cache because the processor can rely on high performance branch prediction to limit the latency effects. Many superscalar processors can issue and execute more than one memory reference per clock (allowing a load or store is common, and some processors allow multiple loads). To handle multiple data cache accesses per clock, we can divide the cache into independent banks, each supporting an independent access. Banks were originally used to improve performance of main memory and are now used inside modern DRAM chips as well as with caches. The Intel Core i7 has four banks in L1 (to support up to 2 memory accesses per clock). Clearly, banking works best when the accesses naturally spread themselves across the banks, so the mapping of addresses to banks affects the behavior of the memory system. A simple mapping that works well is to spread the addresses of the block sequentially across the banks, which is called sequential interleaving. For example, if there are four banks, bank 0 has all blocks whose address modulo 4 is 0, bank 1 has all blocks whose address modulo 4 is 1, and so on. Figure 2.10 shows this interleaving. Multiple banks also are a way to reduce power consumption in both caches and DRAM. Multiple banks are also useful in L2 or L3 caches, but for a different reason. With multiple banks in L2, we can handle more than one outstanding L1 miss, if the banks do not conflict. This is a key capability to support nonblocking caches, our next optimization. The L2 in the Intel Core i7 has eight banks, while Arm Cortex processors have used L2 caches with 1–4 banks. As mentioned earlier, multibanking can also reduce energy consumption.

Fourth Optimization: Nonblocking Caches to Increase Cache Bandwidth For pipelined computers that allow out-of-order execution (discussed in Chapter 3), the processor need not stall on a data cache miss. For example, the processor could

2.3

Ten Advanced Optimizations of Cache Performance



101

continue fetching instructions from the instruction cache while waiting for the data cache to return the missing data. A nonblocking cache or lockup-free cache escalates the potential benefits of such a scheme by allowing the data cache to continue to supply cache hits during a miss. This “hit under miss” optimization reduces the effective miss penalty by being helpful during a miss instead of ignoring the requests of the processor. A subtle and complex option is that the cache may further lower the effective miss penalty if it can overlap multiple misses: a “hit under multiple miss” or “miss under miss” optimization. The second option is beneficial only if the memory system can service multiple misses; most high-performance processors (such as the Intel Core processors) usually support both, whereas many lower-end processors provide only limited nonblocking support in L2. To examine the effectiveness of nonblocking caches in reducing the cache miss penalty, Farkas and Jouppi (1994) did a study assuming 8 KiB caches with a 14-cycle miss penalty (appropriate for the early 1990s). They observed a reduction in the effective miss penalty of 20% for the SPECINT92 benchmarks and 30% for the SPECFP92 benchmarks when allowing one hit under miss. Li et al. (2011) updated this study to use a multilevel cache, more modern assumptions about miss penalties, and the larger and more demanding SPECCPU2006 benchmarks. The study was done assuming a model based on a single core of an Intel i7 (see Section 2.6) running the SPECCPU2006 benchmarks. Figure 2.11 shows the reduction in data cache access latency when allowing 1, 2, and 64 hits under a miss; the caption describes further details of the memory system. The larger caches and the addition of an L3 cache since the earlier study have reduced the benefits with the SPECINT2006 benchmarks showing an average reduction in cache latency of about 9% and the SPECFP2006 benchmarks about 12.5%.

Example

Answer

Which is more important for floating-point programs: two-way set associativity or hit under one miss for the primary data caches? What about integer programs? Assume the following average miss rates for 32 KiB data caches: 5.2% for floating-point programs with a direct-mapped cache, 4.9% for the programs with a two-way set associative cache, 3.5% for integer programs with a direct-mapped cache, and 3.2% for integer programs with a two-way set associative cache. Assume the miss penalty to L2 is 10 cycles, and the L2 misses and penalties are the same. For floating-point programs, the average memory stall times are Miss rateDM  Miss penalty ¼ 5:2%  10 ¼ 0:52 Miss rate2-way  Miss penalty ¼ 4:9%  10 ¼ 0:49

The cache access latency (including stalls) for two-way associativity is 0.49/0.52 or 94% of direct-mapped cache. Figure 2.11 caption indicates that a hit under one miss reduces the average data cache access latency for floating-point programs to 87.5% of a blocking cache. Therefore, for floating-point programs, the

Chapter Two Memory Hierarchy Design

Hit-under-1-miss

Hit-under-2-misses

Hit-under-64-misses

100% 90% 80% 70% 60% 50% 40% bzip2 gcc mcf hmmer sjeng libquantum h264ref omnetpp astar gamess zeusmp milc gromacs cactusADM namd soplex povray calculix GemsFDTD tonto lbm wrf sphinx3



Cache access latency

102

SPECINT

SPECFP

Figure 2.11 The effectiveness of a nonblocking cache is evaluated by allowing 1, 2, or 64 hits under a cache miss with 9 SPECINT (on the left) and 9 SPECFP (on the right) benchmarks. The data memory system modeled after the Intel i7 consists of a 32 KiB L1 cache with a four-cycle access latency. The L2 cache (shared with instructions) is 256 KiB with a 10-clock cycle access latency. The L3 is 2 MiB and a 36-cycle access latency. All the caches are eight-way set associative and have a 64-byte block size. Allowing one hit under miss reduces the miss penalty by 9% for the integer benchmarks and 12.5% for the floating point. Allowing a second hit improves these results to 10% and 16%, and allowing 64 results in little additional improvement.

direct-mapped data cache supporting one hit under one miss gives better performance than a two-way set-associative cache that blocks on a miss. For integer programs, the calculation is Miss rateDM  Miss penalty ¼ 3:5%  10 ¼ 0:35 Miss rate2-way  Miss penalty ¼ 3:2%  10 ¼ 0:32

The data cache access latency of a two-way set associative cache is thus 0.32/0.35 or 91% of direct-mapped cache, while the reduction in access latency when allowing a hit under one miss is 9%, making the two choices about equal. The real difficulty with performance evaluation of nonblocking caches is that a cache miss does not necessarily stall the processor. In this case, it is difficult to judge the impact of any single miss and thus to calculate the average memory access time. The effective miss penalty is not the sum of the misses but the nonoverlapped time that the processor is stalled. The benefit of nonblocking caches is complex, as it depends upon the miss penalty when there are multiple misses, the memory reference pattern, and how many instructions the processor can execute with a miss outstanding. In general, out-of-order processors are capable of hiding much of the miss penalty of an L1 data cache miss that hits in the L2 cache but are not capable

2.3

Ten Advanced Optimizations of Cache Performance



103

of hiding a significant fraction of a lower-level cache miss. Deciding how many outstanding misses to support depends on a variety of factors: ■

The temporal and spatial locality in the miss stream, which determines whether a miss can initiate a new access to a lower-level cache or to memory.



The bandwidth of the responding memory or cache.



To allow more outstanding misses at the lowest level of the cache (where the miss time is the longest) requires supporting at least that many misses at a higher level, because the miss must initiate at the highest level cache.



The latency of the memory system.

The following simplified example illustrates the key idea. Example

Assume a main memory access time of 36 ns and a memory system capable of a sustained transfer rate of 16 GiB/s. If the block size is 64 bytes, what is the maximum number of outstanding misses we need to support assuming that we can maintain the peak bandwidth given the request stream and that accesses never conflict. If the probability of a reference colliding with one of the previous four is 50%, and we assume that the access has to wait until the earlier access completes, estimate the number of maximum outstanding references. For simplicity, ignore the time between misses.

Answer

In the first case, assuming that we can maintain the peak bandwidth, the memory system can support (16  10)9/64 ¼ 250 million references per second. Because each reference takes 36 ns, we can support 250  106  36  109 ¼ 9 references. If the probability of a collision is greater than 0, then we need more outstanding references, because we cannot start work on those colliding references; the memory system needs more independent references, not fewer! To approximate, we can simply assume that half the memory references do not have to be issued to the memory. This means that we must support twice as many outstanding references, or 18. In Li, Chen, Brockman, and Jouppi’s study, they found that the reduction in CPI for the integer programs was about 7% for one hit under miss and about 12.7% for 64. For the floating-point programs, the reductions were 12.7% for one hit under miss and 17.8% for 64. These reductions track fairly closely the reductions in the data cache access latency shown in Figure 2.11.

Implementing a Nonblocking Cache Although nonblocking caches have the potential to improve performance, they are nontrivial to implement. Two initial types of challenges arise: arbitrating contention between hits and misses, and tracking outstanding misses so that we know when loads or stores can proceed. Consider the first problem. In a blocking cache, misses cause the processor to stall and no further accesses to the cache will occur

104



Chapter Two Memory Hierarchy Design

until the miss is handled. In a nonblocking cache, however, hits can collide with misses returning from the next level of the memory hierarchy. If we allow multiple outstanding misses, which almost all recent processors do, it is even possible for misses to collide. These collisions must be resolved, usually by first giving priority to hits over misses, and second by ordering colliding misses (if they can occur). The second problem arises because we need to track multiple outstanding misses. In a blocking cache, we always know which miss is returning, because only one can be outstanding. In a nonblocking cache, this is rarely true. At first glance, you might think that misses always return in order, so that a simple queue could be kept to match a returning miss with the longest outstanding request. Consider, however, a miss that occurs in L1. It may generate either a hit or miss in L2; if L2 is also nonblocking, then the order in which misses are returned to L1 will not necessarily be the same as the order in which they originally occurred. Multicore and other multiprocessor systems that have nonuniform cache access times also introduce this complication. When a miss returns, the processor must know which load or store caused the miss, so that instruction can now go forward; and it must know where in the cache the data should be placed (as well as the setting of tags for that block). In recent processors, this information is kept in a set of registers, typically called the Miss Status Handling Registers (MSHRs). If we allow n outstanding misses, there will be n MSHRs, each holding the information about where a miss goes in the cache and the value of any tag bits for that miss, as well as the information indicating which load or store caused the miss (in the next chapter, you will see how this is tracked). Thus, when a miss occurs, we allocate an MSHR for handling that miss, enter the appropriate information about the miss, and tag the memory request with the index of the MSHR. The memory system uses that tag when it returns the data, allowing the cache system to transfer the data and tag information to the appropriate cache block and “notify” the load or store that generated the miss that the data is now available and that it can resume operation. Nonblocking caches clearly require extra logic and thus have some cost in energy. It is difficult, however, to assess their energy costs exactly because they may reduce stall time, thereby decreasing execution time and resulting energy consumption. In addition to the preceding issues, multiprocessor memory systems, whether within a single chip or on multiple chips, must also deal with complex implementation issues related to memory coherency and consistency. Also, because cache misses are no longer atomic (because the request and response are split and may be interleaved among multiple requests), there are possibilities for deadlock. For the interested reader, Section I.7 in online Appendix I deals with these issues in detail.

Fifth Optimization: Critical Word First and Early Restart to Reduce Miss Penalty This technique is based on the observation that the processor normally needs just one word of the block at a time. This strategy is impatience: don’t wait for the full

2.3

Ten Advanced Optimizations of Cache Performance



105

block to be loaded before sending the requested word and restarting the processor. Here are two specific strategies: ■

Critical word first—Request the missed word first from memory and send it to the processor as soon as it arrives; let the processor continue execution while filling the rest of the words in the block.



Early restart—Fetch the words in normal order, but as soon as the requested word of the block arrives, send it to the processor and let the processor continue execution.

Generally, these techniques only benefit designs with large cache blocks because the benefit is low unless blocks are large. Note that caches normally continue to satisfy accesses to other blocks while the rest of the block is being filled. However, given spatial locality, there is a good chance that the next reference is to the rest of the block. Just as with nonblocking caches, the miss penalty is not simple to calculate. When there is a second request in critical word first, the effective miss penalty is the nonoverlapped time from the reference until the second piece arrives. The benefits of critical word first and early restart depend on the size of the block and the likelihood of another access to the portion of the block that has not yet been fetched. For example, for SPECint2006 running on the i7 6700, which uses early restart and critical word first, there is more than one reference made to a block with an outstanding miss (1.23 references on average with a range from 0.5 to 3.0). We explore the performance of the i7 memory hierarchy in more detail in Section 2.6.

Sixth Optimization: Merging Write Buffer to Reduce Miss Penalty Write-through caches rely on write buffers, as all stores must be sent to the next lower level of the hierarchy. Even write-back caches use a simple buffer when a block is replaced. If the write buffer is empty, the data and the full address are written in the buffer, and the write is finished from the processor’s perspective; the processor continues working while the write buffer prepares to write the word to memory. If the buffer contains other modified blocks, the addresses can be checked to see if the address of the new data matches the address of a valid write buffer entry. If so, the new data are combined with that entry. Write merging is the name of this optimization. The Intel Core i7, among many others, uses write merging. If the buffer is full and there is no address match, the cache (and processor) must wait until the buffer has an empty entry. This optimization uses the memory more efficiently because multiword writes are usually faster than writes performed one word at a time. Skadron and Clark (1997) found that even a merging four-entry write buffer generated stalls that led to a 5%–10% performance loss.

106



Chapter Two Memory Hierarchy Design

Write address

V

V

V

V

100

1

Mem[100]

0

0

0

108

1

Mem[108]

0

0

0

116

1

Mem[116]

0

0

0

124

1

Mem[124]

0

0

0

Write address

V

V

V

V

100

1

Mem[100]

1

Mem[108]

1

Mem[116]

1

0

0

0

0

0

0

0

0

0

0

0

0

Mem[124]

Figure 2.12 In this illustration of write merging, the write buffer on top does not use write merging while the write buffer on the bottom does. The four writes are merged into a single buffer entry with write merging; without it, the buffer is full even though three-fourths of each entry is wasted. The buffer has four entries, and each entry holds four 64-bit words. The address for each entry is on the left, with a valid bit (V) indicating whether the next sequential 8 bytes in this entry are occupied. (Without write merging, the words to the right in the upper part of the figure would be used only for instructions that wrote multiple words at the same time.)

The optimization also reduces stalls because of the write buffer being full. Figure 2.12 shows a write buffer with and without write merging. Assume we had four entries in the write buffer, and each entry could hold four 64-bit words. Without this optimization, four stores to sequential addresses would fill the buffer at one word per entry, even though these four words when merged fit exactly within a single entry of the write buffer. Note that input/output device registers are often mapped into the physical address space. These I/O addresses cannot allow write merging because separate I/O registers may not act like an array of words in memory. For example, they may require one address and data word per I/O register rather than use multiword writes using a single address. These side effects are typically implemented by marking the pages as requiring nonmerging write through by the caches.

2.3

Ten Advanced Optimizations of Cache Performance



107

Seventh Optimization: Compiler Optimizations to Reduce Miss Rate Thus far, our techniques have required changing the hardware. This next technique reduces miss rates without any hardware changes. This magical reduction comes from optimized software—the hardware designer’s favorite solution! The increasing performance gap between processors and main memory has inspired compiler writers to scrutinize the memory hierarchy to see if compile time optimizations can improve performance. Once again, research is split between improvements in instruction misses and improvements in data misses. The optimizations presented next are found in many modern compilers.

Loop Interchange Some programs have nested loops that access data in memory in nonsequential order. Simply exchanging the nesting of the loops can make the code access the data in the order in which they are stored. Assuming the arrays do not fit in the cache, this technique reduces misses by improving spatial locality; reordering maximizes use of data in a cache block before they are discarded. For example, if x is a two-dimensional array of size [5000,100] allocated so that x[i,j] and x[i,j + 1] are adjacent (an order called row major because the array is laid out by rows), then the two pieces of the following code show how the accesses can be optimized: /* Before */ for (j ¼ 0; j < 100; j ¼ j + 1) for (i ¼ 0; i < 5000; i ¼ i + 1) x[i][j] ¼ 2 * x[i][j]; /* After */ for (i ¼ 0; i < 5000; i ¼ i + 1) for (j ¼ 0; j < 100; j ¼ j + 1) x[i][j] ¼ 2 * x[i][j]; The original code would skip through memory in strides of 100 words, while the revised version accesses all the words in one cache block before going to the next block. This optimization improves cache performance without affecting the number of instructions executed.

Blocking This optimization improves temporal locality to reduce misses. We are again dealing with multiple arrays, with some arrays accessed by rows and some by columns. Storing the arrays row by row (row major order) or column by column (column major order) does not solve the problem because both rows and columns are used in every loop iteration. Such orthogonal accesses mean that transformations such as loop interchange still leave plenty of room for improvement.

108

Chapter Two Memory Hierarchy Design



x

j 0

1

2

3

4

k

y

5

0

1

2

3

4

5

j

z

0

0

0

0

1

1

1

2

3

4

5

2

2

2

k

i

i

1

3

3

3

4

4

4

5

5

5

Figure 2.13 A snapshot of the three arrays x, y, and z when N 5 6 and i 5 1. The age of accesses to the array elements is indicated by shade: white means not yet touched, light means older accesses, and dark means newer accesses. The elements of y and z are read repeatedly to calculate new elements of x. The variables i, j, and k are shown along the rows or columns used to access the arrays.

Instead of operating on entire rows or columns of an array, blocked algorithms operate on submatrices or blocks. The goal is to maximize accesses to the data loaded into the cache before the data are replaced. The following code example, which performs matrix multiplication, helps motivate the optimization: /* Before */ for (i ¼ 0; i < N; i ¼ i + 1) for (j ¼ 0; j < N; j ¼ j + 1) {r ¼ 0; for (k ¼ 0; k < N; k = k + 1) r ¼ r + y[i][k]*z[k][j]; x[i][j] ¼ r; }; The two inner loops read all N-by-N elements of z, read the same N elements in a row of y repeatedly, and write one row of N elements of x. Figure 2.13 gives a snapshot of the accesses to the three arrays. A dark shade indicates a recent access, a light shade indicates an older access, and white means not yet accessed. The number of capacity misses clearly depends on N and the size of the cache. If it can hold all three N-by-N matrices, then all is well, provided there are no cache conflicts. If the cache can hold one N-by-N matrix and one row of N, then at least the ith row of y and the array z may stay in the cache. Less than that and misses may occur for both x and z. In the worst case, there would be 2N3 + N2 memory words accessed for N3 operations. To ensure that the elements being accessed can fit in the cache, the original code is changed to compute on a submatrix of size B by B. Two inner loops now compute in steps of size B rather than the full length of x and z. B is called the blocking factor. (Assume x is initialized to zero.)

2.3

Ten Advanced Optimizations of Cache Performance



109

/* After */ for (jj ¼ 0; jj < N; jj ¼ jj + B) for (kk ¼ 0; kk < N; kk ¼ kk + B) for (i ¼ 0; i < N; i ¼ i + 1) for (j ¼ jj; j < min(jj + B,N); j ¼ j + 1) {r ¼ 0; for (k ¼ kk; k < min(kk + B,N); k ¼ k + 1) r ¼ r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r; }; Figure 2.14 illustrates the accesses to the three arrays using blocking. Looking only at capacity misses, the total number of memory words accessed is 2N3/B + N2. This total is an improvement by an approximate factor of B. Therefore blocking exploits a combination of spatial and temporal locality, because y benefits from spatial locality and z benefits from temporal locality. Although our example uses a square block (BxB), we could also use a rectangular block, which would be necessary if the matrix were not square. Although we have aimed at reducing cache misses, blocking can also be used to help register allocation. By taking a small blocking size such that the block can be held in registers, we can minimize the number of loads and stores in the program. As we shall see in Section 4.8 of Chapter 4, cache blocking is absolutely necessary to get good performance from cache-based processors running applications using matrices as the primary data structure.

Eighth Optimization: Hardware Prefetching of Instructions and Data to Reduce Miss Penalty or Miss Rate Nonblocking caches effectively reduce the miss penalty by overlapping execution with memory access. Another approach is to prefetch items before the processor requests them. Both instructions and data can be prefetched, either directly into j

x

0

1

2

3

4

k

y

5

0

1

2

3

4

5

j

z

0

0

0

0

1

1

1

2

3

4

5

2

2

2

k

i

i

1

3

3

3

4

4

4

5

5

5

Figure 2.14 The age of accesses to the arrays x, y, and z when B 5 3. Note that, in contrast to Figure 2.13, a smaller number of elements is accessed.

110



Chapter Two Memory Hierarchy Design

the caches or into an external buffer that can be more quickly accessed than main memory. Instruction prefetch is frequently done in hardware outside of the cache. Typically, the processor fetches two blocks on a miss: the requested block and the next consecutive block. The requested block is placed in the instruction cache when it returns, and the prefetched block is placed in the instruction stream buffer. If the requested block is present in the instruction stream buffer, the original cache request is canceled, the block is read from the stream buffer, and the next prefetch request is issued. A similar approach can be applied to data accesses (Jouppi, 1990). Palacharla and Kessler (1994) looked at a set of scientific programs and considered multiple stream buffers that could handle either instructions or data. They found that eight stream buffers could capture 50%–70% of all misses from a processor with two 64 KiB four-way set associative caches, one for instructions and the other for data. The Intel Core i7 supports hardware prefetching into both L1 and L2 with the most common case of prefetching being accessing the next line. Some earlier Intel processors used more aggressive hardware prefetching, but that resulted in reduced performance for some applications, causing some sophisticated users to turn off the capability. Figure 2.15 shows the overall performance improvement for a subset of SPEC2000 programs when hardware prefetching is turned on. Note that this figure 2.20

1.97

Performance improvement

2.00

1.80

1.60 1.49

1.45 1.40

1.40 1.29

1.26 1.20

1.18

1.16

1.20

1.32

1.21

1.00 gap

mcf

SPECint2000

fam3d wupwise galgel

facerec

swim

applu

lucas

mgrid

equake

SPECfp2000

Figure 2.15 Speedup because of hardware prefetching on Intel Pentium 4 with hardware prefetching turned on for 2 of 12 SPECint2000 benchmarks and 9 of 14 SPECfp2000 benchmarks. Only the programs that benefit the most from prefetching are shown; prefetching speeds up the missing 15 SPECCPU benchmarks by less than 15% (Boggs et al., 2004).

2.3

Ten Advanced Optimizations of Cache Performance



111

includes only 2 of 12 integer programs, while it includes the majority of the SPECCPU floating-point programs. We will return to our evaluation of prefetching on the i7 in Section 2.6. Prefetching relies on utilizing memory bandwidth that otherwise would be unused, but if it interferes with demand misses, it can actually lower performance. Help from compilers can reduce useless prefetching. When prefetching works well, its impact on power is negligible. When prefetched data are not used or useful data are displaced, prefetching will have a very negative impact on power.

Ninth Optimization: Compiler-Controlled Prefetching to Reduce Miss Penalty or Miss Rate An alternative to hardware prefetching is for the compiler to insert prefetch instructions to request data before the processor needs it. There are two flavors of prefetch: ■

Register prefetch loads the value into a register.



Cache prefetch loads data only into the cache and not the register.

Either of these can be faulting or nonfaulting; that is, the address does or does not cause an exception for virtual address faults and protection violations. Using this terminology, a normal load instruction could be considered a “faulting register prefetch instruction.” Nonfaulting prefetches simply turn into no-ops if they would normally result in an exception, which is what we want. The most effective prefetch is “semantically invisible” to a program: it doesn’t change the contents of registers and memory, and it cannot cause virtual memory faults. Most processors today offer nonfaulting cache prefetches. This section assumes nonfaulting cache prefetch, also called nonbinding prefetch. Prefetching makes sense only if the processor can proceed while prefetching the data; that is, the caches do not stall but continue to supply instructions and data while waiting for the prefetched data to return. As you would expect, the data cache for such computers is normally nonblocking. Like hardware-controlled prefetching, the goal is to overlap execution with the prefetching of data. Loops are the important targets because they lend themselves to prefetch optimizations. If the miss penalty is small, the compiler just unrolls the loop once or twice, and it schedules the prefetches with the execution. If the miss penalty is large, it uses software pipelining (see Appendix H) or unrolls many times to prefetch data for a future iteration. Issuing prefetch instructions incurs an instruction overhead, however, so compilers must take care to ensure that such overheads do not exceed the benefits. By concentrating on references that are likely to be cache misses, programs can avoid unnecessary prefetches while improving average memory access time significantly.

112



Chapter Two Memory Hierarchy Design

Example

For the following code, determine which accesses are likely to cause data cache misses. Next, insert prefetch instructions to reduce misses. Finally, calculate the number of prefetch instructions executed and the misses avoided by prefetching. Let’s assume we have an 8 KiB direct-mapped data cache with 16-byte blocks, and it is a write-back cache that does write allocate. The elements of a and b are 8 bytes long because they are double-precision floating-point arrays. There are 3 rows and 100 columns for a and 101 rows and 3 columns for b. Let’s also assume they are not in the cache at the start of the program. for (i ¼ 0; i < 3; i ¼ i + 1) for (j ¼ 0; j < 100; j ¼ j + 1) a[i][j] ¼ b[j][0] * b[j + 1][0];

Answer

The compiler will first determine which accesses are likely to cause cache misses; otherwise, we will waste time on issuing prefetch instructions for data that would be hits. Elements of a are written in the order that they are stored in memory, so a will benefit from spatial locality: The even values of j will miss and the odd values will hit. Because a has 3 rows and 100 columns, its accesses will lead to 3  (100/2), or 150 misses. The array b does not benefit from spatial locality because the accesses are not in the order it is stored. The array b does benefit twice from temporal locality: the same elements are accessed for each iteration of i, and each iteration of j uses the same value of b as the last iteration. Ignoring potential conflict misses, the misses because of b will be for b[j + 1][0] accesses when i ¼ 0, and also the first access to b[j][0] when j ¼ 0. Because j goes from 0 to 99 when i ¼ 0, accesses to b lead to 100 + 1, or 101 misses. Thus this loop will miss the data cache approximately 150 times for a plus 101 times for b, or 251 misses. To simplify our optimization, we will not worry about prefetching the first accesses of the loop. These may already be in the cache, or we will pay the miss penalty of the first few elements of a or b. Nor will we worry about suppressing the prefetches at the end of the loop that try to prefetch beyond the end of a (a[i] [100] … a[i][106]) and the end of b (b[101][0] … b[107][0]). If these were faulting prefetches, we could not take this luxury. Let’s assume that the miss penalty is so large we need to start prefetching at least, say, seven iterations in advance. (Stated alternatively, we assume prefetching has no benefit until the eighth iteration.) We underline the changes to the preceding code needed to add prefetching. for (j ¼ 0; j < 100; j ¼ j + 1) { prefetch(b[j + 7][0]); /* b(j,0) for 7 iterations later */ prefetch(a[0][j + 7]); /* a(0,j) for 7 iterations later */ a[0][j] ¼ b[j][0] * b[j + 1][0];};

2.3

Ten Advanced Optimizations of Cache Performance



113

for (i ¼ 1; i < 3; i ¼ i + 1) for (j ¼ 0; j < 100; j ¼ j + 1) { prefetch(a[i][j + 7]); /* a(i,j) for + 7 iterations */ a[i][j] ¼ b[j][0] * b[j + 1][0];} This revised code prefetches a[i][7] through a[i][99] and b[7][0] through b[100][0], reducing the number of nonprefetched misses to ■ ■





7 misses for elements b[0][0], b[1][0], … , b[6][0] in the first loop 4 misses ([7/2]) for elements a[0][0], a[0][1], … , a[0][6] in the first loop (spatial locality reduces misses to 1 per 16-byte cache block) 4 misses ([7/2]) for elements a[1][0], a[1][1], … , a[1][6] in the second loop 4 misses ([7/2]) for elements a[2][0], a[2][1], … , a[2][6] in the second loop

or a total of 19 nonprefetched misses. The cost of avoiding 232 cache misses is executing 400 prefetch instructions, likely a good trade-off.

Example

Calculate the time saved in the preceding example. Ignore instruction cache misses and assume there are no conflict or capacity misses in the data cache. Assume that prefetches can overlap with each other and with cache misses, thereby transferring at the maximum memory bandwidth. Here are the key loop times ignoring cache misses: the original loop takes 7 clock cycles per iteration, the first prefetch loop takes 9 clock cycles per iteration, and the second prefetch loop takes 8 clock cycles per iteration (including the overhead of the outer for loop). A miss takes 100 clock cycles.

Answer

The original doubly nested loop executes the multiply 3  100 or 300 times. Because the loop takes 7 clock cycles per iteration, the total is 300  7 or 2100 clock cycles plus cache misses. Cache misses add 251  100 or 25,100 clock cycles, giving a total of 27,200 clock cycles. The first prefetch loop iterates 100 times; at 9 clock cycles per iteration the total is 900 clock cycles plus cache misses. Now add 11  100 or 1100 clock cycles for cache misses, giving a total of 2000. The second loop executes 2  100 or 200 times, and at 8 clock cycles per iteration, it takes 1600 clock cycles plus 8  100 or 800 clock cycles for cache misses. This gives a total of 2400 clock cycles. From the prior example, we know that this code executes 400 prefetch instructions during the 2000 + 2400 or 4400 clock cycles to execute these two loops. If we assume that the prefetches are completely overlapped with the rest of the execution, then the prefetch code is 27,200/4400, or 6.2 times faster.

114



Chapter Two Memory Hierarchy Design

Although array optimizations are easy to understand, modern programs are more likely to use pointers. Luk and Mowry (1999) have demonstrated that compiler-based prefetching can sometimes be extended to pointers as well. Of 10 programs with recursive data structures, prefetching all pointers when a node is visited improved performance by 4%–31% in half of the programs. On the other hand, the remaining programs were still within 2% of their original performance. The issue is both whether prefetches are to data already in the cache and whether they occur early enough for the data to arrive by the time it is needed. Many processors support instructions for cache prefetch, and high-end processors (such as the Intel Core i7) often also do some type of automated prefetch in hardware.

Tenth Optimization: Using HBM to Extend the Memory Hierarchy Because most general-purpose processors in servers will likely want more memory than can be packaged with HBM packaging, it has been proposed that the inpackage DRAMs be used to build massive L4 caches, with upcoming technologies ranging from 128 MiB to 1 GiB and more, considerably more than current on-chip L3 caches. Using such large DRAM-based caches raises an issue: where do the tags reside? That depends on the number of tags. Suppose we were to use a 64B block size; then a 1 GiB L4 cache requires 96 MiB of tags—far more static memory than exists in the caches on the CPU. Increasing the block size to 4 KiB, yields a dramatically reduced tag store of 256 K entries or less than 1 MiB total storage, which is probably acceptable, given L3 caches of 4–16 MiB or more in next-generation, multicore processors. Such large block sizes, however, have two major problems. First, the cache may be used inefficiently when content of many blocks are not needed; this is called the fragmentation problem, and it also occurs in virtual memory systems. Furthermore, transferring such large blocks is inefficient if much of the data is unused. Second, because of the large block size, the number of distinct blocks held in the DRAM cache is much lower, which can result in more misses, especially for conflict and consistency misses. One partial solution to the first problem is to add sublocking. Subblocking allow parts of the block to be invalid, requiring that they be fetched on a miss. Subblocking, however, does nothing to address the second problem. The tag storage is the major drawback for using a smaller block size. One possible solution for that difficulty is to store the tags for L4 in the HBM. At first glance this seems unworkable, because it requires two accesses to DRAM for each L4 access: one for the tags and one for the data itself. Because of the long access time for random DRAM accesses, typically 100 or more processor clock cycles, such an approach had been discarded. Loh and Hill (2011) proposed a clever solution to this problem: place the tags and the data in the same row in the HBM SDRAM. Although opening the row (and eventually closing it) takes a large amount of time, the CAS latency to access a different part of the row is about one-third the new row access time. Thus we can access the tag portion of the block first, and if it is a hit,

2.3

Ten Advanced Optimizations of Cache Performance



115

then use a column access to choose the correct word. Loh and Hill (L-H) have proposed organizing the L4 HBM cache so that each SDRAM row consists of a set of tags (at the head of the block) and 29 data segments, making a 29-way set associative cache. When L4 is accessed, the appropriate row is opened and the tags are read; a hit requires one more column access to get the matching data. Qureshi and Loh (2012) proposed an improvement called an alloy cache that reduces the hit time. An alloy cache molds the tag and data together and uses a direct mapped cache structure. This allows the L4 access time to be reduced to a single HBM cycle by directly indexing the HBM cache and doing a burst transfer of both the tag and data. Figure 2.16 shows the hit latency for the alloy cache, the L-H scheme, and SRAM based tags. The alloy cache reduces hit time by more than a factor of 2 versus the L-H scheme, in return for an increase in the miss rate by a factor of 1.1–1.2. The choice of benchmarks is explained in the caption. Unfortunately, in both schemes, misses require two full DRAM accesses: one to get the initial tag and a follow-on access to the main memory (which is even 125

Average hit latency

100

75

50 LH-Cache SRAM-Tags Alloy cache

25

0 mcf_r

lbm_r

soplex_r

milc_r

omnet_r bwaves_r

gcc_r

libqntm_r sphinx_r

gems_r

Benchmarks

Figure 2.16 Average hit time latency in clock cycles for the L-H scheme, a currently-impractical scheme using SRAM for the tags, and the alloy cache organization. In the SRAM case, we assume the SRAM is accessible in the same time as L3 and that it is checked before L4 is accessed. The average hit latencies are 43 (alloy cache), 67 (SRAM tags), and 107 (L-H). The 10 SPECCPU2006 benchmarks used here are the most memory-intensive ones; each of them would run twice as fast if L3 were perfect.

116



Chapter Two Memory Hierarchy Design

slower). If we could speed up the miss detection, we could reduce the miss time. Two different solutions have been proposed to solve this problem: one uses a map that keeps track of the blocks in the cache (not the location of the block, just whether it is present); the other uses a memory access predictor that predicts likely misses using history prediction techniques, similar to those used for global branch prediction (see the next chapter). It appears that a small predictor can predict likely misses with high accuracy, leading to an overall lower miss penalty. Figure 2.17 shows the speedup obtained on SPECrate for the memoryintensive benchmarks used in Figure 2.16. The alloy cache approach outperforms the LH scheme and even the impractical SRAM tags, because the combination of a fast access time for the miss predictor and good prediction results lead to a shorter time to predict a miss, and thus a lower miss penalty. The alloy cache performs close to the Ideal case, an L4 with perfect miss prediction and minimal hit time. 1.5

Spedup on SPECRate

1.4

1.3

1.2 LH-Cache SRAM-Tags Alloy cache Ideal

1.1

1

64 MB

128 MB

256 MB

512 MB

1 GB

L4 cache size

Figure 2.17 Performance speedup running the SPECrate benchmark for the LH scheme, an SRAM tag scheme, and an ideal L4 (Ideal); a speedup of 1 indicates no improvement with the L4 cache, and a speedup of 2 would be achievable if L4 were perfect and took no access time. The 10 memory-intensive benchmarks are used with each benchmark run eight times. The accompanying miss prediction scheme is used. The Ideal case assumes that only the 64-byte block requested in L4 needs to be accessed and transferred and that prediction accuracy for L4 is perfect (i.e., all misses are known at zero cost).

2.3

Ten Advanced Optimizations of Cache Performance



117

HBM is likely to have widespread use in a variety of different configurations, from containing the entire memory system for some high-performance, specialpurpose systems to use as an L4 cache for larger server configurations.

Cache Optimization Summary The techniques to improve hit time, bandwidth, miss penalty, and miss rate generally affect the other components of the average memory access equation as well as the complexity of the memory hierarchy. Figure 2.18 summarizes these techniques and estimates the impact on complexity, with + meaning that the technique

Technique

Hit Band- Miss Miss Power Hardware cost/ time width penalty rate consumption complexity

Small and simple caches

+

Way-predicting caches

+

Pipelined & banked caches





Comment

+

0

Trivial; widely used

+

1

Used in Pentium 4

1

Widely used

+

3

Widely used

Critical word first and early restart

+

2

Widely used

Merging write buffer

+

1

Widely used with write through

0

Software is a challenge, but many compilers handle common linear algebra calculations

2 instr., 3 data

Most provide prefetch instructions; modern highend processors also automatically prefetch in hardware

3

Needs nonblocking cache; possible instruction overhead; in many CPUs

3

Depends on new packaging technology. Effects depend heavily on hit rate improvements

Nonblocking caches

+ +

Compiler techniques to reduce cache misses

+

Hardware prefetching of instructions and data

+

+

Compiler-controlled prefetching

+

+



+

HBM as additional level of cache

+/–



+

Figure 2.18 Summary of 10 advanced cache optimizations showing impact on cache performance, power consumption, and complexity. Although generally a technique helps only one factor, prefetching can reduce misses if done sufficiently early; if not, it can reduce miss penalty. + means that the technique improves the factor,  means it hurts that factor, and blank means it has no impact. The complexity measure is subjective, with 0 being the easiest and 3 being a challenge.

118



Chapter Two Memory Hierarchy Design improves the factor,  meaning it hurts that factor, and blank meaning it has no impact. Generally, no technique helps more than one category.

2.4

Virtual Memory and Virtual Machines A virtual machine is taken to be an efficient, isolated duplicate of the real machine. We explain these notions through the idea of a virtual machine monitor (VMM)… a VMM has three essential characteristics. First, the VMM provides an environment for programs which is essentially identical with the original machine; second, programs run in this environment show at worst only minor decreases in speed; and last, the VMM is in complete control of system resources. Gerald Popek and Robert Goldberg, “Formal requirements for virtualizable third generation architectures,” Communications of the ACM (July 1974).

Section B.4 in Appendix B describes the key concepts in virtual memory. Recall that virtual memory allows the physical memory to be treated as a cache of secondary storage (which may be either disk or solid state). Virtual memory moves pages between the two levels of the memory hierarchy, just as caches move blocks between levels. Likewise, TLBs act as caches on the page table, eliminating the need to do a memory access every time an address is translated. Virtual memory also provides separation between processes that share one physical memory but have separate virtual address spaces. Readers should ensure that they understand both functions of virtual memory before continuing. In this section, we focus on additional issues in protection and privacy between processes sharing the same processor. Security and privacy are two of the most vexing challenges for information technology in 2017. Electronic burglaries, often involving lists of credit card numbers, are announced regularly, and it’s widely believed that many more go unreported. Of course, such problems arise from programming errors that allow a cyberattack to access data it should be unable to access. Programming errors are a fact of life, and with modern complex software systems, they occur with significant regularity. Therefore both researchers and practitioners are looking for improved ways to make computing systems more secure. Although protecting information is not limited to hardware, in our view real security and privacy will likely involve innovation in computer architecture as well as in systems software. This section starts with a review of the architecture support for protecting processes from each other via virtual memory. It then describes the added protection provided by virtual machines, the architecture requirements of virtual machines, and the performance of a virtual machine. As we will see in Chapter 6, virtual machines are a foundational technology for cloud computing.

2.4

Virtual Memory and Virtual Machines



119

Protection via Virtual Memory Page-based virtual memory, including a TLB that caches page table entries, is the primary mechanism that protects processes from each other. Sections B.4 and B.5 in Appendix B review virtual memory, including a detailed description of protection via segmentation and paging in the 80x86. This section acts as a quick review; if it’s too quick, please refer to the denoted Appendix B sections. Multiprogramming, where several programs running concurrently share a computer, has led to demands for protection and sharing among programs and to the concept of a process. Metaphorically, a process is a program’s breathing air and living space—that is, a running program plus any state needed to continue running it. At any instant, it must be possible to switch from one process to another. This exchange is called a process switch or context switch. The operating system and architecture join forces to allow processes to share the hardware yet not interfere with each other. To do this, the architecture must limit what a process can access when running a user process yet allow an operating system process to access more. At a minimum, the architecture must do the following: 1. Provide at least two modes, indicating whether the running process is a user process or an operating system process. This latter process is sometimes called a kernel process or a supervisor process. 2. Provide a portion of the processor state that a user process can use but not write. This state includes a user/supervisor mode bit, an exception enable/disable bit, and memory protection information. Users are prevented from writing this state because the operating system cannot control user processes if users can give themselves supervisor privileges, disable exceptions, or change memory protection. 3. Provide mechanisms whereby the processor can go from user mode to supervisor mode and vice versa. The first direction is typically accomplished by a system call, implemented as a special instruction that transfers control to a dedicated location in supervisor code space. The PC is saved from the point of the system call, and the processor is placed in supervisor mode. The return to user mode is like a subroutine return that restores the previous user/supervisor mode. 4. Provide mechanisms to limit memory accesses to protect the memory state of a process without having to swap the process to disk on a context switch. Appendix A describes several memory protection schemes, but by far the most popular is adding protection restrictions to each page of virtual memory. Fixedsized pages, typically 4 KiB, 16 KiB, or larger, are mapped from the virtual address space into physical address space via a page table. The protection restrictions are included in each page table entry. The protection restrictions might determine whether a user process can read this page, whether a user process can write to this page, and whether code can be executed from this page. In addition, a process can

120



Chapter Two Memory Hierarchy Design

neither read nor write a page if it is not in the page table. Because only the OS can update the page table, the paging mechanism provides total access protection. Paged virtual memory means that every memory access logically takes at least twice as long, with one memory access to obtain the physical address and a second access to get the data. This cost would be far too dear. The solution is to rely on the principle of locality; if the accesses have locality, then the address translations for the accesses must also have locality. By keeping these address translations in a special cache, a memory access rarely requires a second access to translate the address. This special address translation cache is referred to as a TLB. A TLB entry is like a cache entry where the tag holds portions of the virtual address and the data portion holds a physical page address, protection field, valid bit, and usually a use bit and a dirty bit. The operating system changes these bits by changing the value in the page table and then invalidating the corresponding TLB entry. When the entry is reloaded from the page table, the TLB gets an accurate copy of the bits. Assuming the computer faithfully obeys the restrictions on pages and maps virtual addresses to physical addresses, it would seem that we are done. Newspaper headlines suggest otherwise. The reason we’re not done is that we depend on the accuracy of the operating system as well as the hardware. Today’s operating systems consist of tens of millions of lines of code. Because bugs are measured in number per thousand lines of code, there are thousands of bugs in production operating systems. Flaws in the OS have led to vulnerabilities that are routinely exploited. This problem and the possibility that not enforcing protection could be much more costly than in the past have led some to look for a protection model with a much smaller code base than the full OS, such as virtual machines.

Protection via Virtual Machines An idea related to virtual memory that is almost as old are virtual machines (VMs). They were first developed in the late 1960s, and they have remained an important part of mainframe computing over the years. Although largely ignored in the domain of single-user computers in the 1980s and 1990s, they have recently gained popularity because of ■

the increasing importance of isolation and security in modern systems;



the failures in security and reliability of standard operating systems;



the sharing of a single computer among many unrelated users, such as in a data center or cloud; and



the dramatic increases in the raw speed of processors, which make the overhead of VMs more acceptable.

The broadest definition of VMs includes basically all emulation methods that provide a standard software interface, such as the Java VM. We are interested in

2.4

Virtual Memory and Virtual Machines



121

VMs that provide a complete system-level environment at the binary instruction set architecture (ISA) level. Most often, the VM supports the same ISA as the underlying hardware; however, it is also possible to support a different ISA, and such approaches are often employed when migrating between ISAs in order to allow software from the departing ISA to be used until it can be ported to the new ISA. Our focus here will be on VMs where the ISA presented by the VM and the underlying hardware match. Such VMs are called (operating) system virtual machines. IBM VM/370, VMware ESX Server, and Xen are examples. They present the illusion that the users of a VM have an entire computer to themselves, including a copy of the operating system. A single computer runs multiple VMs and can support a number of different operating systems (OSes). On a conventional platform, a single OS “owns” all the hardware resources, but with a VM, multiple OSes all share the hardware resources. The software that supports VMs is called a virtual machine monitor (VMM) or hypervisor; the VMM is the heart of virtual machine technology. The underlying hardware platform is called the host, and its resources are shared among the guest VMs. The VMM determines how to map virtual resources to physical resources: A physical resource may be time-shared, partitioned, or even emulated in software. The VMM is much smaller than a traditional OS; the isolation portion of a VMM is perhaps only 10,000 lines of code. In general, the cost of processor virtualization depends on the workload. Userlevel processor-bound programs, such as SPECCPU2006, have zero virtualization overhead because the OS is rarely invoked, so everything runs at native speeds. Conversely, I/O-intensive workloads generally are also OS-intensive and execute many system calls (which doing I/O requires) and privileged instructions that can result in high virtualization overhead. The overhead is determined by the number of instructions that must be emulated by the VMM and how slowly they are emulated. Therefore, when the guest VMs run the same ISA as the host, as we assume here, the goal of the architecture and the VMM is to run almost all instructions directly on the native hardware. On the other hand, if the I/O-intensive workload is also I/O-bound, the cost of processor virtualization can be completely hidden by low processor utilization because it is often waiting for I/O. Although our interest here is in VMs for improving protection, VMs provide two other benefits that are commercially significant: 1. Managing software—VMs provide an abstraction that can run the complete software stack, even including old operating systems such as DOS. A typical deployment might be some VMs running legacy OSes, many running the current stable OS release, and a few testing the next OS release. 2. Managing hardware—One reason for multiple servers is to have each application running with its own compatible version of the operating system on separate computers, as this separation can improve dependability. VMs allow these separate software stacks to run independently yet share hardware, thereby consolidating the number of servers. Another example is that most newer VMMs support migration of a running VM to a different computer, either to

122



Chapter Two Memory Hierarchy Design

balance load or to evacuate from failing hardware. The rise of cloud computing has made the ability to swap out an entire VM to another physical processor increasingly useful. These two reasons are why cloud-based servers, such as Amazon’s, rely on virtual machines.

Requirements of a Virtual Machine Monitor What must a VM monitor do? It presents a software interface to guest software, it must isolate the state of guests from each other, and it must protect itself from guest software (including guest OSes). The qualitative requirements are ■

Guest software should behave on a VM exactly as if it were running on the native hardware, except for performance-related behavior or limitations of fixed resources shared by multiple VMs.



Guest software should not be able to directly change allocation of real system resources.

To “virtualize” the processor, the VMM must control just about everything— access to privileged state, address translation, I/O, exceptions and interrupts—even though the guest VM and OS currently running are temporarily using them. For example, in the case of a timer interrupt, the VMM would suspend the currently running guest VM, save its state, handle the interrupt, determine which guest VM to run next, and then load its state. Guest VMs that rely on a timer interrupt are provided with a virtual timer and an emulated timer interrupt by the VMM. To be in charge, the VMM must be at a higher privilege level than the guest VM, which generally runs in user mode; this also ensures that the execution of any privileged instruction will be handled by the VMM. The basic requirements of system virtual machines are almost identical to those for the previously mentioned paged virtual memory: ■

At least two processor modes, system and user.



A privileged subset of instructions that is available only in system mode, resulting in a trap if executed in user mode. All system resources must be controllable only via these instructions.

Instruction Set Architecture Support for Virtual Machines If VMs are planned for during the design of the ISA, it’s relatively easy to reduce both the number of instructions that must be executed by a VMM and how long it takes to emulate them. An architecture that allows the VM to execute directly on the hardware earns the title virtualizable, and the IBM 370 architecture proudly bears that label.

2.4

Virtual Memory and Virtual Machines



123

However, because VMs have been considered for desktop and PC-based server applications only fairly recently, most instruction sets were created without virtualization in mind. These culprits include 80x86 and most of the original RISC architectures, although the latter had fewer issues than the 80x86 architecture. Recent additions to the x86 architecture have attempted to remedy the earlier shortcomings, and RISC V explicitly includes support for virtualization. Because the VMM must ensure that the guest system interacts only with virtual resources, a conventional guest OS runs as a user mode program on top of the VMM. Then, if a guest OS attempts to access or modify information related to hardware resources via a privileged instruction—for example, reading or writing the page table pointer—it will trap to the VMM. The VMM can then effect the appropriate changes to corresponding real resources. Therefore, if any instruction that tries to read or write such sensitive information traps when executed in user mode, the VMM can intercept it and support a virtual version of the sensitive information as the guest OS expects. In the absence of such support, other measures must be taken. A VMM must take special precautions to locate all problematic instructions and ensure that they behave correctly when executed by a guest OS, thereby increasing the complexity of the VMM and reducing the performance of running the VM. Sections 2.5 and 2.7 give concrete examples of problematic instructions in the 80x86 architecture. One attractive extension allows the VM and the OS to operate at different privilege levels, each of which is distinct from the user level. By introducing an additional privilege level, some OS operations—e.g., those that exceed the permissions granted to a user program but do not require intervention by the VMM (because they cannot affect any other VM)—can execute directly without the overhead of trapping and invoking the VMM. The Xen design, which we examine shortly, makes use of three privilege levels.

Impact of Virtual Machines on Virtual Memory and I/O Another challenge is virtualization of virtual memory, as each guest OS in every VM manages its own set of page tables. To make this work, the VMM separates the notions of real and physical memory (which are often treated synonymously) and makes real memory a separate, intermediate level between virtual memory and physical memory. (Some use the terms virtual memory, physical memory, and machine memory to name the same three levels.) The guest OS maps virtual memory to real memory via its page tables, and the VMM page tables map the guests’ real memory to physical memory. The virtual memory architecture is specified either via page tables, as in IBM VM/370 and the 80x86, or via the TLB structure, as in many RISC architectures. Rather than pay an extra level of indirection on every memory access, the VMM maintains a shadow page table that maps directly from the guest virtual address space to the physical address space of the hardware. By detecting all modifications to the guest’s page table, the VMM can ensure that the shadow page table

124



Chapter Two Memory Hierarchy Design

entries being used by the hardware for translations correspond to those of the guest OS environment, with the exception of the correct physical pages substituted for the real pages in the guest tables. Therefore the VMM must trap any attempt by the guest OS to change its page table or to access the page table pointer. This is commonly done by write protecting the guest page tables and trapping any access to the page table pointer by a guest OS. As previously noted, the latter happens naturally if accessing the page table pointer is a privileged operation. The IBM 370 architecture solved the page table problem in the 1970s with an additional level of indirection that is managed by the VMM. The guest OS keeps its page tables as before, so the shadow pages are unnecessary. AMD has implemented a similar scheme for its 80x86. To virtualize the TLB in many RISC computers, the VMM manages the real TLB and has a copy of the contents of the TLB of each guest VM. To pull this off, any instructions that access the TLB must trap. TLBs with Process ID tags can support a mix of entries from different VMs and the VMM, thereby avoiding flushing of the TLB on a VM switch. Meanwhile, in the background, the VMM supports a mapping between the VMs’ virtual Process IDs and the real Process IDs. Section L.7 of online Appendix L describes additional details. The final portion of the architecture to virtualize is I/O. This is by far the most difficult part of system virtualization because of the increasing number of I/O devices attached to the computer and the increasing diversity of I/O device types. Another difficulty is the sharing of a real device among multiple VMs, and yet another comes from supporting the myriad of device drivers that are required, especially if different guest OSes are supported on the same VM system. The VM illusion can be maintained by giving each VM generic versions of each type of I/O device driver, and then leaving it to the VMM to handle real I/O. The method for mapping a virtual-to-physical I/O device depends on the type of device. For example, physical disks are normally partitioned by the VMM to create virtual disks for guest VMs, and the VMM maintains the mapping of virtual tracks and sectors to the physical ones. Network interfaces are often shared between VMs in very short time slices, and the job of the VMM is to keep track of messages for the virtual network addresses to ensure that guest VMs receive only messages intended for them.

Extending the Instruction Set for Efficient Virtualization and Better Security In the past 5–10 years, processor designers, including those at AMD and Intel (and to a lesser extent ARM), have introduced instruction set extensions to more efficiently support virtualization. Two primary areas of performance improvement have been in handling page tables and TLBs (the cornerstone of virtual memory) and in I/O, specifically handling interrupts and DMA. Virtual memory performance is enhanced by avoiding unnecessary TLB flushes and by using the nested page table mechanism, employed by IBM decades earlier, rather than a complete

2.4

Virtual Memory and Virtual Machines



125

set of shadow page tables (see Section L.7 in Appendix L). To improve I/O performance, architectural extensions are added that allow a device to directly use DMA to move data (eliminating a potential copy by the VMM) and allow device interrupts and commands to be handled by the guest OS directly. These extensions show significant performance gains in applications that are intensive either in their memory-management aspects or in the use of I/O. With the broad adoption of public cloud systems for running critical applications, concerns have risen about security of data in such applications. Any malicious code that is able to access a higher privilege level than data that must be kept secure compromises the system. For example, if you are running a credit card processing application, you must be absolutely certain that malicious users cannot get access to the credit card numbers, even when they are using the same hardware and intentionally attack the OS or even the VMM. Through the use of virtualization, we can prevent accesses by an outside user to the data in a different VM, and this provides significant protection compared to a multiprogrammed environment. That might not be enough, however, if the attacker compromises the VMM or can find out information by observations in another VMM. For example, suppose the attacker penetrates the VMM; the attacker can then remap memory so as to access any portion of the data. Alternatively, an attack might rely on a Trojan horse (see Appendix B) introduced into the code that can access the credit cards. Because the Trojan horse is running in the same VM as the credit card processing application, the Trojan horse only needs to exploit an OS flaw to gain access to the critical data. Most cyberattacks have used some form of Trojan horse, typically exploiting an OS flaw, that either has the effect of returning access to the attacker while leaving the CPU still in privilege mode or allows the attacker to upload and execute code as if it were part of the OS. In either case, the attacker obtains control of the CPU and, using the higher privilege mode, can proceed to access anything within the VM. Note that encryption alone does not prevent this attacker. If the data in memory is unencrypted, which is typical, then the attacker has access to all such data. Furthermore, if the attacker knows where the encryption key is stored, the attacker can freely access the key and then access any encrypted data. More recently, Intel introduced a set of instruction set extensions, called the software guard extensions (SGX), to allow user programs to create enclaves, portions of code and data that are always encrypted and decrypted only on use and only with the key provided by the user code. Because the enclave is always encrypted, standard OS operations for virtual memory or I/O can access the enclave (e.g., to move a page) but cannot extract any information. For an enclave to work, all the code and all the data required must be part of the enclave. Although the topic of finer-grained protection has been around for decades, it has gotten little traction before because of the high overhead and because other solutions that are more efficient and less intrusive have been acceptable. The rise of cyberattacks and the amount of confidential information online have led to a reexamination of techniques for improving such fine-grained security. Like Intel’s SGX, IBM and AMD’s recent processors support on-the-fly encryption of memory.

126



Chapter Two Memory Hierarchy Design

An Example VMM: The Xen Virtual Machine Early in the development of VMs, a number of inefficiencies became apparent. For example, a guest OS manages its virtual-to-real page mapping, but this mapping is ignored by the VMM, which performs the actual mapping to physical pages. In other words, a significant amount of wasted effort is expended just to keep the guest OS happy. To reduce such inefficiencies, VMM developers decided that it may be worthwhile to allow the guest OS to be aware that it is running on a VM. For example, a guest OS could assume a real memory as large as its virtual memory so that no memory management is required by the guest OS. Allowing small modifications to the guest OS to simplify virtualization is referred to as paravirtualization, and the open source Xen VMM is a good example. The Xen VMM, which is used in Amazon’s web services data centers, provides a guest OS with a virtual machine abstraction that is similar to the physical hardware, but drops many of the troublesome pieces. For example, to avoid flushing the TLB, Xen maps itself into the upper 64 MiB of the address space of each VM. Xen allows the guest OS to allocate pages, checking only to be sure the guest OS does not violate protection restrictions. To protect the guest OS from the user programs in the VM, Xen takes advantage of the four protection levels available in the 80x86. The Xen VMM runs at the highest privilege level (0), the guest OS runs at the next level (1), and the applications run at the lowest privilege level (3). Most OSes for the 80x 86 keep everything at privilege levels 0 or 3. For subsetting to work properly, Xen modifies the guest OS to not use problematic portions of the architecture. For example, the port of Linux to Xen changes about 3000 lines, or about 1% of the 80x86-specific code. These changes, however, do not affect the application binary interfaces of the guest OS. To simplify the I/O challenge of VMs, Xen assigned privileged virtual machines to each hardware I/O device. These special VMs are called driver domains. (Xen calls VMs “domains.”) Driver domains run the physical device drivers, although interrupts are still handled by the VMM before being sent to the appropriate driver domain. Regular VMs, called guest domains, run simple virtual device drivers that must communicate with the physical device drivers in the driver domains over a channel to access the physical I/O hardware. Data are sent between guest and driver domains by page remapping.

2.5

Cross-Cutting Issues: The Design of Memory Hierarchies This section describes four topics discussed in other chapters that are fundamental to memory hierarchies.

Protection, Virtualization, and Instruction Set Architecture Protection is a joint effort of architecture and operating systems, but architects had to modify some awkward details of existing instruction set architectures when virtual memory became popular. For example, to support virtual memory in the IBM

2.5

Cross-Cutting Issues: The Design of Memory Hierarchies



127

370, architects had to change the successful IBM 360 instruction set architecture that had been announced just 6 years before. Similar adjustments are being made today to accommodate virtual machines. For example, the 80x86 instruction POPF loads the flag registers from the top of the stack in memory. One of the flags is the Interrupt Enable (IE) flag. Until recent changes to support virtualization, running the POPF instruction in user mode, rather than trapping it, simply changed all the flags except IE. In system mode, it does change the IE flag. Because a guest OS runs in user mode inside a VM, this was a problem, as the OS would expect to see a changed IE. Extensions of the 80x86 architecture to support virtualization eliminated this problem. Historically, IBM mainframe hardware and VMM took three steps to improve performance of virtual machines: 1. Reduce the cost of processor virtualization. 2. Reduce interrupt overhead cost due to the virtualization. 3. Reduce interrupt cost by steering interrupts to the proper VM without invoking VMM. IBM is still the gold standard of virtual machine technology. For example, an IBM mainframe ran thousands of Linux VMs in 2000, while Xen ran 25 VMs in 2004 (Clark et al., 2004). Recent versions of Intel and AMD chipsets have added special instructions to support devices in a VM to mask interrupts at lower levels from each VM and to steer interrupts to the appropriate VM.

Autonomous Instruction Fetch Units Many processors with out-of-order execution and even some with simply deep pipelines decouple the instruction fetch (and sometimes initial decode), using a separate instruction fetch unit (see Chapter 3). Typically, the instruction fetch unit accesses the instruction cache to fetch an entire block before decoding it into individual instructions; such a technique is particularly useful when the instruction length varies. Because the instruction cache is accessed in blocks, it no longer makes sense to compare miss rates to processors that access the instruction cache once per instruction. In addition, the instruction fetch unit may prefetch blocks into the L1 cache; these prefetches may generate additional misses, but may actually reduce the total miss penalty incurred. Many processors also include data prefetching, which may increase the data cache miss rate, even while decreasing the total data cache miss penalty.

Speculation and Memory Access One of the major techniques used in advanced pipelines is speculation, whereby an instruction is tentatively executed before the processor knows whether it is really needed. Such techniques rely on branch prediction, which if incorrect requires that

128



Chapter Two Memory Hierarchy Design

the speculated instructions are flushed from the pipeline. There are two separate issues in a memory system supporting speculation: protection and performance. With speculation, the processor may generate memory references, which will never be used because the instructions were the result of incorrect speculation. Those references, if executed, could generate protection exceptions. Obviously, such faults should occur only if the instruction is actually executed. In the next chapter, we will see how such “speculative exceptions” are resolved. Because a speculative processor may generate accesses to both the instruction and data caches, and subsequently not use the results of those accesses, speculation may increase the cache miss rates. As with prefetching, however, such speculation may actually lower the total cache miss penalty. The use of speculation, like the use of prefetching, makes it misleading to compare miss rates to those seen in processors without speculation, even when the ISA and cache structures are otherwise identical.

Special Instruction Caches One of the biggest challenges in superscalar processors is to supply the instruction bandwidth. For designs that translate the instructions into micro-operations, such as most recent Arm and i7 processors, instruction bandwidth demands and branch misprediction penalties can be reduced by keeping a small cache of recently translated instructions. We explore this technique in greater depth in the next chapter.

Coherency of Cached Data Data can be found in memory and in the cache. As long as the processor is the sole component changing or reading the data and the cache stands between the processor and memory, there is little danger in the processor seeing the old or stale copy. As we will see, multiple processors and I/O devices raise the opportunity for copies to be inconsistent and to read the wrong copy. The frequency of the cache coherency problem is different for multiprocessors than for I/O. Multiple data copies are a rare event for I/O—one to be avoided whenever possible—but a program running on multiple processors will want to have copies of the same data in several caches. Performance of a multiprocessor program depends on the performance of the system when sharing data. The I/O cache coherency question is this: where does the I/O occur in the computer—between the I/O device and the cache or between the I/O device and main memory? If input puts data into the cache and output reads data from the cache, both I/O and the processor see the same data. The difficulty in this approach is that it interferes with the processor and can cause the processor to stall for I/O. Input may also interfere with the cache by displacing some information with new data that are unlikely to be accessed soon.

2.6

Putting It All Together: Memory Hierarchies in the ARM Cortex-A53 and Intel Core i7 6700



129

The goal for the I/O system in a computer with a cache is to prevent the stale data problem while interfering as little as possible. Many systems therefore prefer that I/O occur directly to main memory, with main memory acting as an I/O buffer. If a write-through cache were used, then memory would have an up-to-date copy of the information, and there would be no stale data issue for output. (This benefit is a reason processors used write through.) However, today write through is usually found only in first-level data caches backed by an L2 cache that uses write back. Input requires some extra work. The software solution is to guarantee that no blocks of the input buffer are in the cache. A page containing the buffer can be marked as noncachable, and the operating system can always input to such a page. Alternatively, the operating system can flush the buffer addresses from the cache before the input occurs. A hardware solution is to check the I/O addresses on input to see if they are in the cache. If there is a match of I/O addresses in the cache, the cache entries are invalidated to avoid stale data. All of these approaches can also be used for output with write-back caches. Processor cache coherency is a critical subject in the age of multicore processors, and we will examine it in detail in Chapter 5.

2.6

Putting It All Together: Memory Hierarchies in the ARM Cortex-A53 and Intel Core i7 6700 This section reveals the ARM Cortex-A53 (hereafter called the A53) and Intel Core i76700 (hereafter called i7) memory hierarchies and shows the performance of their components on a set of single-threaded benchmarks. We examine the Cortex-A53 first because it has a simpler memory system; we go into more detail for the i7, tracing out a memory reference in detail. This section presumes that readers are familiar with the organization of a two-level cache hierarchy using virtually indexed caches. The basics of such a memory system are explained in detail in Appendix B, and readers who are uncertain of the organization of such a system are strongly advised to review the Opteron example in Appendix B. Once they understand the organization of the Opteron, the brief explanation of the A53 system, which is similar, will be easy to follow.

The ARM Cortex-A53 The Cortex-A53 is a configurable core that supports the ARMv8A instruction set architecture, which includes both 32-bit and 64-bit modes. The Cortex-A53 is delivered as an IP (intellectual property) core. IP cores are the dominant form of technology delivery in the embedded, PMD, and related markets; billions of ARM and MIPS processors have been created from these IP cores. Note that IP cores are different from the cores in the Intel i7 or AMD Athlon multicores. An IP core (which may itself be a multicore) is designed to be incorporated with other logic (thus it is the core of a chip), including application-specific processors

130



Chapter Two Memory Hierarchy Design

(such as an encoder or decoder for video), I/O interfaces, and memory interfaces, and then fabricated to yield a processor optimized for a particular application. For example, the Cortex-A53 IP core is used in a variety of tablets and smartphones; it is designed to be highly energy-efficient, a key criteria in battery-based PMDs. The A53 core is capable of being configured with multiple cores per chip for use in high-end PMDs; our discussion here focuses on a single core. Generally, IP cores come in two flavors. Hard cores are optimized for a particular semiconductor vendor and are black boxes with external (but still on-chip) interfaces. Hard cores typically allow parametrization only of logic outside the core, such as L2 cache sizes, and the IP core cannot be modified. Soft cores are usually delivered in a form that uses a standard library of logic elements. A soft core can be compiled for different semiconductor vendors and can also be modified, although extensive modifications are very difficult because of the complexity of modern-day IP cores. In general, hard cores provide higher performance and smaller die area, while soft cores allow retargeting to other vendors and can be more easily modified. The Cortex-A53 can issue two instructions per clock at clock rates up to 1.3 GHz. It supports both a two-level TLB and a two-level cache; Figure 2.19 summarizes the organization of the memory hierarchy. The critical term is returned first, and the processor can continue while the miss completes; a memory system with up to four banks can be supported. For a D-cache of 32 KiB and a page size of 4 KiB, each physical page could map to two different cache addresses; such aliases are avoided by hardware detection on a miss as in Section B.3 of Appendix B. Figure 2.20 shows how the 32-bit virtual address is used to index the TLB and the caches, assuming 32 KiB primary caches and a 1 MiB secondary cache with 16 KiB page size.

Structure

Size

Organization

Typical miss penalty (clock cycles)

Instruction MicroTLB

10 entries

Fully associative

2

Data MicroTLB

10 entries

Fully associative

2

L2 Unified TLB

512 entries

4-way set associative

20

L1 Instruction cache

8–64 KiB

2-way set associative; 64-byte block

13

L1 Data cache L2 Unified cache

8–64 KiB 128 KiB to 2 MiB

2-way set associative; 64-byte block

13

16-way set associative; LRU

124

Figure 2.19 The memory hierarchy of the Cortex A53 includes multilevel TLBs and caches. A page map cache keeps track of the location of a physical page for a set of virtual pages; it reduces the L2 TLB miss penalty. The L1 caches are virtually indexed and physically tagged; both the L1 D cache and L2 use a write-back policy defaulting to allocate on write. Replacement policy is LRU approximation in all the caches. Miss penalties to L2 are higher if both a MicroTLB and L1 miss occur. The L2 to main memory bus is 64–128 bits wide, and the miss penalty is larger for the narrow bus.

Virtual address

Virtual page number

Page offset

L1 cache index Block offset TLB tag

Real page number

To CPU Instruction TLB 8

L1 cache tag L1 data

To CPU

2 Instruction cache

=?

Physical address

To L2 (see part b below)

(A)

The instruction access path Virtual address

Virtual page number

Page offset

L1 cache index Block offset Real page number

TLB tag

To CPU

Data TLB L1 cache tag

7

L1 data

3 =?

7

9

TLB tag

Real page number

Data cache

To CPU

L2 TLB

=?

Physical address

L2 tag compare address

L2 cache index

Block offset

To CPU

L2 cache tag

L2 data

=?

(B)

To L1 cache or CPU

The data access path

Figure 2.20 The virtual address, physical and data blocks for the ARM Cortex-A53 caches and TLBs, assuming 32bit addresses. The top half (A) shows the instruction access; the bottom half (B) shows the data access, including L2. The TLB (instruction or data) is fully associative each with 10 entries, using a 64 KiB page in this example. The L1 Icache is two-way set associative, with 64-byte blocks and 32 KiB capacity; the L1 D-cache is 32 KiB, four-way set associative, and 64-byte blocks. The L2 TLB is 512 entries and four-way set associative. The L2 cache is 16-way set associative with 64-byte blocks and 128 cKiB to 2 MiB capacity; a 1 MiB L2 is shown. This figure doesn’t show the valid bits and protection bits for the caches and TLB.

132

Chapter Two Memory Hierarchy Design



Performance of the Cortex-A53 Memory Hierarchy The memory hierarchy of the Cortex-A8 was measured with 32 KiB primary caches and a 1 MiB L2 cache running the SPECInt2006 benchmarks. The instruction cache miss rates for these SPECInt2006 are very small even for just the L1: close to zero for most and under 1% for all of them. This low rate probably results from the computationally intensive nature of the SPECCPU programs and the twoway set associative cache that eliminates most conflict misses. Figure 2.21 shows the data cache results, which have significant L1 and L2 miss rates. The L1 rate varies by a factor of 75, from 0.5% to 37.3% with a median miss rate of 2.4%. The global L2 miss rate varies by a factor of 180, from 0.05% to 9.0% with a median of 0.3%. MCF, which is known as a cache buster, sets the upper bound and significantly affects the mean. Remember that the L2 global miss rate is significantly lower than the L2 local miss rate; for example, the median L2 stand-alone miss rate is 15.1% versus the global miss rate of 0.3%. Using these miss penalties in Figure 2.19, Figure 2.22 shows the average penalty per data access. Although the L1 miss rates are about seven times higher than the L2 miss rate, the L2 penalty is 9.5 times as high, leading to L2 misses slightly dominating for the benchmarks that stress the memory system. In the next chapter, we will examine the impact of the cache misses on overall CPI.

40.0% L1 data miss rate L2 data miss rate

35.0% 30.0% 25.0% 20.0% 15.0% 10.0% 5.0%

cf m

pp ne t

ar om

as t

c gc

k la nc

bm

k xa

go bm

p2 bz i

g sj en

ch rlb en pe

um qu an t

lib

h2 64 re f

hm

m

er

0.0%

Figure 2.21 The data miss rate for ARM with a 32 KiB L1 and the global data miss rate for a 1 MiB L2 using the SPECInt2006 benchmarks are significantly affected by the applications. Applications with larger memory footprints tend to have higher miss rates in both L1 and L2. Note that the L2 rate is the global miss rate that is counting all references, including those that hit in L1. MCF is known as a cache buster.

2.6

Putting It All Together: Memory Hierarchies in the ARM Cortex-A53 and Intel Core i7 6700



133

16 L2 data average memory penalty L1 data average memory penalty

Miss penalty per data reference

14 12 10 8 6 4 2

cf

tp ne om

m

p

r ta as

c gc

k bm la xa

go

nc

bm

k

2 ip bz

g en sj

rlb pe

an qu lib

en

tu

ch

m

f re 64 h2

hm

m

er

0

Figure 2.22 The average memory access penalty per data memory reference coming from L1 and L2 is shown for the A53 processor when running SPECInt2006. Although the miss rates for L1 are significantly higher, the L2 miss penalty, which is more than five times higher, means that the L2 misses can contribute significantly.

The Intel Core i7 6700 The i7 supports the x 86-64 instruction set architecture, a 64-bit extension of the 80x86 architecture. The i7 is an out-of-order execution processor that includes four cores. In this chapter, we focus on the memory system design and performance from the viewpoint of a single core. The system performance of multiprocessor designs, including the i7 multicore, is examined in detail in Chapter 5. Each core in an i7 can execute up to four 80x86 instructions per clock cycle, using a multiple issue, dynamically scheduled, 16-stage pipeline, which we describe in detail in Chapter 3. The i7 can also support up to two simultaneous threads per processor, using a technique called simultaneous multithreading, described in Chapter 4. In 2017 the fastest i7 had a clock rate of 4.0 GHz (in Turbo Boost mode), which yielded a peak instruction execution rate of 16 billion instructions per second, or 64 billion instructions per second for the four-core design. Of course, there is a big gap between peak and sustained performance, as we will see over the next few chapters. The i7 can support up to three memory channels, each consisting of a separate set of DIMMs, and each of which can transfer in parallel. Using DDR3-1066 (DIMM PC8500), the i7 has a peak memory bandwidth of just over 25 GB/s.

134



Chapter Two Memory Hierarchy Design

i7 uses 48-bit virtual addresses and 36-bit physical addresses, yielding a maximum physical memory of 36 GiB. Memory management is handled with a two-level TLB (see Appendix B, Section B.4), summarized in Figure 2.23. Figure 2.24 summarizes the i7’s three-level cache hierarchy. The first-level caches are virtually indexed and physically tagged (see Appendix B, Section B.3), while the L2 and L3 caches are physically indexed. Some versions of the i7 6700 will support a fourth-level cache using HBM packaging. Figure 2.25 is labeled with the steps of an access to the memory hierarchy. First, the PC is sent to the instruction cache. The instruction cache index is 2Index ¼

Characteristic

Cache size 32K ¼ ¼ 64 ¼ 26 Block size  Set associativity 64  8

Instruction TLB

Data DLB

Second-level TLB

128

64

1536

Entries Associativity

8-way

4-way

12-way

Replacement

Pseudo-LRU

Pseudo-LRU

Pseudo-LRU

Access latency

1 cycle

1 cycle

8 cycles

Miss

9 cycles

9 cycles

Hundreds of cycles to access page table

Figure 2.23 Characteristics of the i7’s TLB structure, which has separate first-level instruction and data TLBs, both backed by a joint second-level TLB. The first-level TLBs support the standard 4 KiB page size, as well as having a limited number of entries of large 2–4 MiB pages; only 4 KiB pages are supported in the second-level TLB. The i7 has the ability to handle two L2 TLB misses in parallel. See Section L.3 of online Appendix L for more discussion of multilevel TLBs and support for multiple page sizes.

Characteristic Size Associativity Access latency Replacement scheme

L1

L2

L3

32 KiB I/32 KiB D

256 KiB

2 MiB per core

both 8-way

4-way

16-way

4 cycles, pipelined

12 cycles

44 cycles

Pseudo-LRU

Pseudo-LRU

Pseudo-LRU but with an ordered selection algorithm

Figure 2.24 Characteristics of the three-level cache hierarchy in the i7. All three caches use write back and a block size of 64 bytes. The L1 and L2 caches are separate for each core, whereas the L3 cache is shared among the cores on a chip and is a total of 2 MiB per core. All three caches are nonblocking and allow multiple outstanding writes. A merging write buffer is used for the L1 cache, which holds data in the event that the line is not present in L1 when it is written. (That is, an L1 write miss does not cause the line to be allocated.) L3 is inclusive of L1 and L2; we explore this property in further detail when we explain multiprocessor caches. Replacement is by a variant on pseudo-LRU; in the case of L3, the block replaced is always the lowest numbered way whose access bit is off. This is not quite random but is easy to compute.

2.6

Putting It All Together: Memory Hierarchies in the ARM Cortex-A53 and Intel Core i7 6700

Virtual page number

Page offset

Instruction

CPU

Data

Data virtual page number



135

Page offset

PC

Data in 2 1

Prot V I T L B

Tag

Prot V

Physical address

3 (128 PTEs in 8 banks) 4 8:1 mux

Prot V

L2

Tag

Tag

Physical address

D T L B

(64 PTEs in 4 banks)



4:1 mux

Physical address

T L B =?

Index I C A C H E

(1536 PTEs in 12 banks)

12:1 mux

Index

Block offset Data

V D

16

D C A C H E

5 5

Block offset

6 =?

=?

8:1 mux

7

Data

V D Tag

8:1 mux

(512 blocks in 8 banks)

(512 blocks in 8 banks)

2:1 mux

8 L2

Tag

C A C H E

Index

V D Tag

Data

9

10

=?

4:1 mux

(4K blocks in 4 banks)

DIMM 14

L3 C A C H E

11 Tag

Index

V D Tag

Data

16

12

13

=?

M A I N

M E M O R Y

Memory Interface

DIMM

15

DIMM

16:1 mux

(128K blocks in 16 banks)

Figure 2.25 The Intel i7 memory hierarchy and the steps in both instruction and data access. We show only reads. Writes are similar, except that misses are handled by simply placing the data in a write buffer, because the L1 cache is not write-allocated.

136



Chapter Two Memory Hierarchy Design or 6 bits. The page frame of the instruction’s address (36 ¼ 48  12 bits) is sent to the instruction TLB (step 1). At the same time, the 12-bit page offset from the virtual address is sent to the instruction cache (step 2). Notice that for the eight-way associative instruction cache, 12 bits are needed for the cache address: 6 bits to index the cache plus 6 bits of block offset for the 64-byte block, so no aliases are possible. The previous versions of the i7 used a four-way set associative I-cache, meaning that a block corresponding to a virtual address could actually be in two different places in the cache, because the corresponding physical address could have either a 0 or 1 in this location. For instructions this did not pose a problem because even if an instruction appeared in the cache in two different locations, the two versions must be the same. If such duplication, or aliasing, of data is allowed, the cache must be checked when the page map is changed, which is an infrequent event. Note that a very simple use of page coloring (see Appendix B, Section B.3) can eliminate the possibility of these aliases. If even-address virtual pages are mapped to even-address physical pages (and the same for odd pages), then these aliases can never occur because the low-order bit in the virtual and physical page number will be identical. The instruction TLB is accessed to find a match between the address and a valid page table entry (PTE) (steps 3 and 4). In addition to translating the address, the TLB checks to see if the PTE demands that this access result in an exception because of an access violation. An instruction TLB miss first goes to the L2 TLB, which contains 1536 PTEs of 4 KiB page sizes and is 12-way set associative. It takes 8 clock cycles to load the L1 TLB from the L2 TLB, which leads to the 9-cycle miss penalty including the initial clock cycle to access the L1 TLB. If the L2 TLB misses, a hardware algorithm is used to walk the page table and update the TLB entry. Sections L.5 and L.6 of online Appendix L describe page table walkers and page structure caches. In the worst case, the page is not in memory, and the operating system gets the page from secondary storage. Because millions of instructions could execute during a page fault, the operating system will swap in another process if one is waiting to run. Otherwise, if there is no TLB exception, the instruction cache access continues. The index field of the address is sent to all eight banks of the instruction cache (step 5). The instruction cache tag is 36 bits  6 bits (index)  6 bits (block offset), or 24 bits. The four tags and valid bits are compared to the physical page frame from the instruction TLB (step 6). Because the i7 expects 16 bytes each instruction fetch, an additional 2 bits are used from the 6-bit block offset to select the appropriate 16 bytes. Therefore 6 + 2 or 8 bits are used to send 16 bytes of instructions to the processor. The L1 cache is pipelined, and the latency of a hit is 4 clock cycles (step 7). A miss goes to the second-level cache. As mentioned earlier, the instruction cache is virtually addressed and physically tagged. Because the second-level caches are physically addressed, the physical page address from the TLB is composed with the page offset to make an address to access the L2 cache. The L2 index is 2Index ¼

Cache size 256K ¼ ¼ 1024 ¼ 210 Block size  Set associativity 64  4

2.6

Putting It All Together: Memory Hierarchies in the ARM Cortex-A53 and Intel Core i7 6700



137

so the 30-bit block address (36-bit physical address  6-bit block offset) is divided into a 20-bit tag and a 10-bit index (step 8). Once again, the index and tag are sent to the four banks of the unified L2 cache (step 9), which are compared in parallel. If one matches and is valid (step 10), it returns the block in sequential order after the initial 12-cycle latency at a rate of 8 bytes per clock cycle. If the L2 cache misses, the L3 cache is accessed. For a four-core i7, which has an 8 MiB L3, the index size is 2Index ¼

Cache size 8M ¼ ¼ 8192 ¼ 213 Block size  Set associativity 64  16

The 13-bit index (step 11) is sent to all 16 banks of the L3 (step 12). The L3 tag, which is 36  (13 + 6) ¼ 17 bits, is compared against the physical address from the TLB (step 13). If a hit occurs, the block is returned after an initial latency of 42 clock cycles, at a rate of 16 bytes per clock and placed into both L1 and L3. If L3 misses, a memory access is initiated. If the instruction is not found in the L3 cache, the on-chip memory controller must get the block from main memory. The i7 has three 64-bit memory channels that can act as one 192-bit channel, because there is only one memory controller and the same address is sent on both channels (step 14). Wide transfers happen when both channels have identical DIMMs. Each channel supports up to four DDR DIMMs (step 15). When the data return they are placed into L3 and L1 (step 16) because L3 is inclusive. The total latency of the instruction miss that is serviced by main memory is approximately 42 processor cycles to determine that an L3 miss has occurred, plus the DRAM latency for the critical instructions. For a single-bank DDR4-2400 SDRAM and 4.0 GHz CPU, the DRAM latency is about 40 ns or 160 clock cycles to the first 16 bytes, leading to a total miss penalty of about 200 clock cycles. The memory controller fills the remainder of the 64-byte cache block at a rate of 16 bytes per I/O bus clock cycle, which takes another 5 ns or 20 clock cycles. Because the second-level cache is a write-back cache, any miss can lead to an old block being written back to memory. The i7 has a 10-entry merging write buffer that writes back dirty cache lines when the next level in the cache is unused for a read. The write buffer is checked on a miss to see if the cache line exists in the buffer; if so, the miss is filled from the buffer. A similar buffer is used between the L1 and L2 caches. If this initial instruction is a load, the data address is sent to the data cache and data TLBs, acting very much like an instruction cache access. Suppose the instruction is a store instead of a load. When the store issues, it does a data cache lookup just like a load. A miss causes the block to be placed in a write buffer because the L1 cache does not allocate the block on a write miss. On a hit, the store does not update the L1 (or L2) cache until later, after it is known to be nonspeculative. During this time, the store resides in a load-store queue, part of the out-of-order control mechanism of the processor. The I7 also supports prefetching for L1 and L2 from the next level in the hierarchy. In most cases, the prefetched line is simply the next block in the cache. By prefetching only for L1 and L2, high-cost unnecessary fetches to memory are avoided.

138



Chapter Two Memory Hierarchy Design

Performance of the i7 memory system We evaluate the performance of the i7 cache structure using the SPECint2006 benchmarks. The data in this section were collected by Professor Lu Peng and PhD student Qun Liu, both of Louisiana State University. Their analysis is based on earlier work (see Prakash and Peng, 2008). The complexity of the i7 pipeline, with its use of an autonomous instruction fetch unit, speculation, and both instruction and data prefetch, makes it hard to compare cache performance against simpler processors. As mentioned on page 110, processors that use prefetch can generate cache accesses independent of the memory accesses performed by the program. A cache access that is generated because of an actual instruction access or data access is sometimes called a demand access to distinguish it from a prefetch access. Demand accesses can come from both speculative instruction fetches and speculative data accesses, some of which are subsequently canceled (see Chapter 3 for a detailed description of speculation and instruction graduation). A speculative processor generates at least as many misses as an in-order nonspeculative processor, and typically more. In addition to demand misses, there are prefetch misses for both instructions and data. The i7’s instruction fetch unit attempts to fetch 16 bytes every cycle, which complicates comparing instruction cache miss rates because multiple instructions are fetched every cycle (roughly 4.5 on average). In fact, the entire 64-byte cache line is read and subsequent 16-byte fetches do not require additional accesses. Thus misses are tracked only on the basis of 64-byte blocks. The 32 KiB, eight-way set associative instruction cache leads to a very low instruction miss rate for the SPECint2006 programs. If, for simplicity, we measure the miss rate of SPECint2006 as the number of misses for a 64-byte block divided by the number of instructions that complete, the miss rates are all under 1% except for one benchmark (XALANCBMK), which has a 2.9% miss rate. Because a 64-byte block typically contains 16–20 instructions, the effective miss rate per instruction is much lower, depending on the degree of spatial locality in the instruction stream. The frequency at which the instruction fetch unit is stalled waiting for the I-cache misses is similarly small (as a percentage of total cycles) increasing to 2% for two benchmarks and 12% for XALANCBMK, which has the highest I-cache miss rate. In the next chapter, we will see how stalls in the IFU contribute to overall reductions in pipeline throughput in the i7. The L1 data cache is more interesting and even trickier to evaluate because in addition to the effects of prefetching and speculation, the L1 data cache is not write-allocated, and writes to cache blocks that are not present are not treated as misses. For this reason, we focus only on memory reads. The performance monitor measurements in the i7 separate out prefetch accesses from demand accesses, but only keep demand accesses for those instructions that graduate. The effect of speculative instructions that do not graduate is not negligible, although pipeline effects probably dominate secondary cache effects caused by speculation; we will return to the issue in the next chapter.

2.6

Putting It All Together: Memory Hierarchies in the ARM Cortex-A53 and Intel Core i7 6700



139

45% L1 miss rate prefetches and demand reads L1 miss rate demand reads only

41%

40%

35%

35%

Miss rate

30%

25% 22% 20%

18% 15%

15% 11%

11%

10% 7% 5%

5%

5%

4% 3%

6%

5%

3%

3%

2%

1%

1%

2%

1%

1% 1% 1% K BM

G

C N LA

XA

LB R PE

EN

H EN

C

ET P N M O

SJ

P

F C M

M TU AN

M M

LI BQ U

H

4R

ER

EF

K 26 H

O G

G

C

BM

C

2 IP BZ

AS TA

R

0%

Figure 2.26 The L1 data cache miss rate for the SPECint2006 benchmarks is shown in two ways relative to the demand L1 reads: one including both demand and prefetch accesses and one including only demand accesses. The i7 separates out L1 misses for a block not present in the cache and L1 misses for a block already outstanding that is being prefetched from L2; we treat the latter group as hits because they would hit in a blocking cache. These data, like the rest in this section, were collected by Professor Lu Peng and PhD student Qun Liu, both of Louisiana State University, based on earlier studies of the Intel Core Duo and other processors (see Peng et al., 2008).

To address these issues, while keeping the amount of data reasonable, Figure 2.26 shows the L1 data cache misses in two ways: 1. The L1 miss rate relative to demand references given by the L1 miss rate including prefetches and speculative loads/L1 demand read references for those instructions that graduate.

140



Chapter Two Memory Hierarchy Design 2. The demand miss rate given by L1 demand misses/L1 demand read references, both measurements only for instructions that graduate. On average, the miss rate including prefetches is 2.8 times as high as the demandonly miss rate. Comparing this data to that from the earlier i7 920, which had the same size L1, we see that the miss rate including prefetches is higher on the newer i7, but the number of demand misses, which are more likely to cause a stall, are usually fewer. To understand the effectiveness of the aggressive prefetch mechanisms in the i7, let’s look at some measurements of prefetching. Figure 2.27 shows both the fraction of L2 requests that are prefetches versus demand requests and the prefetch miss rate. The data are probably astonishing at first glance: there are roughly 1.5 times as many prefetches as there are L2 demand requests, which come directly from L1 misses. Furthermore, the prefetch miss rate is amazingly high, with an average miss rate of 58%. Although the prefetch ratio varies considerably, the prefetch miss rate is always significant. At first glance, you might conclude that the designers made a mistake: they are prefetching too much, and the miss rate is too high. Notice, however, that the benchmarks with the higher prefetch ratios (ASTAR, BZIP2, HMMER, LIBQUANTUM, and OMNETPP) also show the greatest gap between the prefetch miss rate and the demand miss rate, more than a factor of 2 in each case. The aggressive prefetching is trading prefetch misses, which occur earlier, for demand misses, which occur later; and as a result, a pipeline stall is less likely to occur due to the prefetching. Similarly, consider the high prefetch miss rate. Suppose that the majority of the prefetches are actually useful (this is hard to measure because it involves tracking individual cache blocks), then a prefetch miss indicates a likely L2 cache miss in the future. Uncovering and handling the miss earlier via the prefetch is likely to reduce the stall cycles. Performance analysis of speculative superscalars, like the i7, has shown that cache misses tend to be the primary cause of pipeline stalls, because it is hard to keep the processor going, especially for longer running L2 and L3 misses. The Intel designers could not easily increase the size of the caches without incurring both energy and cycle time impacts; thus the use of aggressive prefetching to try to lower effective cache miss penalties is an interesting alternative approach. With the combination of the L1 demand misses and prefetches going to L2, roughly 17% of the loads generate an L2 request. Analyzing L2 performance requires including the effects of writes (because L2 is write-allocated), as well as the prefetch hit rate and the demand hit rate. Figure 2.28 shows the miss rates of the L2 caches for demand and prefetch accesses, both versus the number of L1 references (reads and writes). As with L1, prefetches are a significant contributor, generating 75% of the L2 misses. Comparing the L2 demand miss rate with that of earlier i7 implementations (again with the same L2 size) shows that the i7 6700 has a lower L2 demand miss rate by an approximate factor of 2, which may well justify the higher prefetch miss rate.

2.6

Putting It All Together: Memory Hierarchies in the ARM Cortex-A53 and Intel Core i7 6700



141

100%

5.0

90%

4.5 Prefetches/demand accesses

80%

3.5

70%

3.0

60%

2.5

50%

2.0

40%

1.5

30%

1.0

20%

0.5

10%

K

AN

C

BM

G EN

H SJ

XA L

PE

R

LB

N

EN

ET

C

PP

F C M M O

AN

TU

M

ER U

M LI BQ

H

4R 26 H

O G

M

EF

K BM

C C G

IP BZ

TA AS

2

0% R

0

Prefetch miss rate

Prefetches to LA/All L2 demand references

Prefetches miss ratio 4.0

Figure 2.27 The fraction of L2 requests that are prefetches is shown via the columns and the left axis. The right axis and the line shows the prefetch hit rate. These data, like the rest in this section, were collected by Professor Lu Peng and PhD student Qun Liu, both of Louisiana State University, based on earlier studies of the Intel Core Duo and other processors (see Peng et al., 2008).

Because the cost for a miss to memory is over 100 cycles and the average data miss rate in L2 combining both prefetch and demand misses is over 7%, L3 is obviously critical. Without L3 and assuming that about one-third of the instructions are loads or stores, L2 cache misses could add over two cycles per instruction to the CPI! Obviously, prefetching past L2 would make no sense without an L3. In comparison, the average L3 data miss rate of 0.5% is still significant but less than one-third of the L2 demand miss rate and 10 times less than the L1 demand miss rate. Only in two benchmarks (OMNETPP and MCF) is the L3 miss rate

Chapter Two Memory Hierarchy Design

22%

22% 20%

L2 demand miss rate L2 prefetch miss rate

18% 16% 14% 12%

12% 10%

11%

10%

8%

7%

6% 4%

4%

4%

3%

3%

3%

2%

bm

k

g

1%

nc

en

0% 0%

en

p rlb pe

ne

tp

cf m

ch

0%

om

m tu

er

an

m lib

qu

re 64

0%

la

1% 0%

f

k h2

bm

c gc

2 bz

ip

r ta

0%

sj

1%

0%

xa

1%

1%

hm

1%

0%

go

2%

as



L2 miss rate

142

Figure 2.28 The L2 demand miss rate and prefetch miss rate, both shown relative to all the references to L1, which also includes prefetches, speculative loads that do not complete, and program-generated loads and stores (demand references). These data, like the rest in this section, were collected by Professor Lu Peng and PhD student Qun Liu, both of Louisiana State University.

above 0.5%; in those two cases, the miss rate of about 2.3% likely dominates all other performance losses. In the next chapter, we will examine the relationship between the i7 CPI and cache misses, as well as other pipeline effects.

2.7

Fallacies and Pitfalls As the most naturally quantitative of the computer architecture disciplines, memory hierarchy would seem to be less vulnerable to fallacies and pitfalls. Yet we were limited here not by lack of warnings, but by lack of space!

Fallacy

Predicting cache performance of one program from another. Figure 2.29 shows the instruction miss rates and data miss rates for three programs from the SPEC2000 benchmark suite as cache size varies. Depending on the

2.7

Fallacies and Pitfalls



143

Misses per 1000 instructions

160 D: lucas D: gap

140

D: gcc I: gap

I: gcc I: lucas

120 100 80 60 40 20 0 4

16

64

256

1024

4096

Cache size (KB)

Figure 2.29 Instruction and data misses per 1000 instructions as cache size varies from 4 KiB to 4096 KiB. Instruction misses for gcc are 30,000–40,000 times larger than for lucas, and, conversely, data misses for lucas are 2–60 times larger than for gcc. The programs gap, gcc, and lucas are from the SPEC2000 benchmark suite.

program, the data misses per thousand instructions for a 4096 KiB cache are 9, 2, or 90, and the instruction misses per thousand instructions for a 4 KiB cache are 55, 19, or 0.0004. Commercial programs such as databases will have significant miss rates even in large second-level caches, which is generally not the case for the SPECCPU programs. Clearly, generalizing cache performance from one program to another is unwise. As Figure 2.24 reminds us, there is a great deal of variation, and even predictions about the relative miss rates of integer and floating-pointintensive programs can be wrong, as mcf and sphnix3 remind us! Pitfall

Simulating enough instructions to get accurate performance measures of the memory hierarchy. There are really three pitfalls here. One is trying to predict performance of a large cache using a small trace. Another is that a program’s locality behavior is not constant over the run of the entire program. The third is that a program’s locality behavior may vary depending on the input. Figure 2.30 shows the cumulative average instruction misses per thousand instructions for five inputs to a single SPEC2000 program. For these inputs, the average memory rate for the first 1.9 billion instructions is very different from the average miss rate for the rest of the execution.

Pitfall

Not delivering high memory bandwidth in a cache-based system. Caches help with average cache memory latency but may not deliver high memory bandwidth to an application that must go to main memory. The architect must design a high bandwidth memory behind the cache for such applications. We will revisit this pitfall in Chapters 4 and 5.

Chapter Two Memory Hierarchy Design

Instruction misses per 1000 references



9 8 7 6 1

5

2, 3, 4, 5

4 3 2 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

Instructions (billions) Instruction misses per 1000 references

144

9 5

8

3

7 4

2

6 5 4 3

1

2 1 0 0

2

4

6

8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 Instructions (billions)

Figure 2.30 Instruction misses per 1000 references for five inputs to the perl benchmark in SPEC2000. There is little variation in misses and little difference between the five inputs for the first 1.9 billion instructions. Running to completion shows how misses vary over the life of the program and how they depend on the input. The top graph shows the running average misses for the first 1.9 billion instructions, which starts at about 2.5 and ends at about 4.7 misses per 1000 references for all five inputs. The bottom graph shows the running average misses to run to completion, which takes 16–41 billion instructions depending on the input. After the first 1.9 billion instructions, the misses per 1000 references vary from 2.4 to 7.9 depending on the input. The simulations were for the Alpha processor using separate L1 caches for instructions and data, each being two-way 64 KiB with LRU, and a unified 1 MiB direct-mapped L2 cache.

2.7

Pitfall

Fallacies and Pitfalls



145

Implementing a virtual machine monitor on an instruction set architecture that wasn’t designed to be virtualizable. Many architects in the 1970s and 1980s weren’t careful to make sure that all instructions reading or writing information related to hardware resource information were privileged. This laissez faire attitude causes problems for VMMs for all of these architectures, including the 80x86, which we use here as an example. Figure 2.31 describes the 18 instructions that cause problems for paravirtualization (Robin and Irvine, 2000). The two broad classes are instructions that ■

read control registers in user mode that reveal that the guest operating system is running in a virtual machine (such as POPF mentioned earlier) and



check protection as required by the segmented architecture but assume that the operating system is running at the highest privilege level.

Virtual memory is also challenging. Because the 80x86 TLBs do not support process ID tags, as do most RISC architectures, it is more expensive for the VMM and guest OSes to share the TLB; each address space change typically requires a TLB flush. Problem category

Problem 80x86 instructions

Access sensitive registers without trapping when running in user mode

Store global descriptor table register (SGDT) Store local descriptor table register (SLDT) Store interrupt descriptor table register (SIDT) Store machine status word (SMSW) Push flags (PUSHF, PUSHFD) Pop flags (POPF, POPFD)

When accessing virtual memory mechanisms in user mode, instructions fail the 80x86 protection checks

Load access rights from segment descriptor (LAR) Load segment limit from segment descriptor (LSL) Verify if segment descriptor is readable (VERR) Verify if segment descriptor is writable (VERW) Pop to segment register (POP CS, POP SS, …) Push segment register (PUSH CS, PUSH SS, …) Far call to different privilege level (CALL) Far return to different privilege level (RET) Far jump to different privilege level (JMP) Software interrupt (INT) Store segment selector register (STR) Move to/from segment registers (MOVE)

Figure 2.31 Summary of 18 80x86 instructions that cause problems for virtualization (Robin and Irvine, 2000). The first five instructions of the top group allow a program in user mode to read a control register, such as a descriptor table register without causing a trap. The pop flags instruction modifies a control register with sensitive information but fails silently when in user mode. The protection checking of the segmented architecture of the 80x86 is the downfall of the bottom group because each of these instructions checks the privilege level implicitly as part of instruction execution when reading a control register. The checking assumes that the OS must be at the highest privilege level, which is not the case for guest VMs. Only the MOVE to segment register tries to modify control state, and protection checking foils it as well.

146



Chapter Two Memory Hierarchy Design

Virtualizing I/O is also a challenge for the 80x86, in part because it supports memory-mapped I/O and has separate I/O instructions, but more importantly because there are a very large number and variety of types of devices and device drivers of PCs for the VMM to handle. Third-party vendors supply their own drivers, and they may not properly virtualize. One solution for conventional VM implementations is to load real device drivers directly into the VMM. To simplify implementations of VMMs on the 80x86, both AMD and Intel have proposed extensions to the architecture. Intel’s VT-x provides a new execution mode for running VMs, a architected definition of the VM state, instructions to swap VMs rapidly, and a large set of parameters to select the circumstances where a VMM must be invoked. Altogether, VT-x adds 11 new instructions for the 80x86. AMD’s Secure Virtual Machine (SVM) provides similar functionality. After turning on the mode that enables VT-x support (via the new VMXON instruction), VT-x offers four privilege levels for the guest OS that are lower in priority than the original four (and fix issues like the problem with the POPF instruction mentioned earlier). VT-x captures all the states of a virtual machine in the Virtual Machine Control State (VMCS) and then provides atomic instructions to save and restore a VMCS. In addition to critical state, the VMCS includes configuration information to determine when to invoke the VMM and then specifically what caused the VMM to be invoked. To reduce the number of times the VMM must be invoked, this mode adds shadow versions of some sensitive registers and adds masks that check to see whether critical bits of a sensitive register will be changed before trapping. To reduce the cost of virtualizing virtual memory, AMD’s SVM adds an additional level of indirection, called nested page tables, which makes shadow page tables unnecessary (see Section L.7 of Appendix L).

2.8

Concluding Remarks: Looking Ahead Over the past thirty years there have been several predictions of the eminent [sic] cessation of the rate of improvement in computer performance. Every such prediction was wrong. They were wrong because they hinged on unstated assumptions that were overturned by subsequent events. So, for example, the failure to foresee the move from discrete components to integrated circuits led to a prediction that the speed of light would limit computer speeds to several orders of magnitude slower than they are now. Our prediction of the memory wall is probably wrong too but it suggests that we have to start thinking “out of the box.” Wm. A. Wulf and Sally A. McKee, Hitting the Memory Wall: Implications of the Obvious, Department of Computer Science, University of Virginia (December 1994). This paper introduced the term memory wall.

The possibility of using a memory hierarchy dates back to the earliest days of general-purpose digital computers in the late 1940s and early 1950s. Virtual memory was introduced in research computers in the early 1960s and into IBM mainframes in the 1970s. Caches appeared around the same time. The basic concepts

2.8

Concluding Remarks: Looking Ahead



147

have been expanded and enhanced over time to help close the access time gap between main memory and processors, but the basic concepts remain. One trend that is causing a significant change in the design of memory hierarchies is a continued slowdown in both density and access time of DRAMs. In the past 15 years, both these trends have been observed and have been even more obvious over the past 5 years. While some increases in DRAM bandwidth have been achieved, decreases in access time have come much more slowly and almost vanished between DDR4 and DDR3. The end of Dennard scaling as well as a slowdown in Moore’s Law both contributed to this situation. The trenched capacitor design used in DRAMs is also limiting its ability to scale. It may well be the case that packaging technologies such as stacked memory will be the dominant source of improvements in DRAM access bandwidth and latency. Independently of improvements in DRAM, Flash memory has been playing a much larger role. In PMDs, Flash has dominated for 15 years and became the standard for laptops almost 10 years ago. In the past few years, many desktops have shipped with Flash as the primary secondary storage. Flash’s potential advantage over DRAMs, specifically the absence of a per-bit transistor to control writing, is also its Achilles heel. Flash must use bulk erase-rewrite cycles that are considerably slower. As a result, although Flash has become the fastest growing form of secondary storage, SDRAMs still dominate for main memory. Although phase-change materials as a basis for memory have been around for a while, they have never been serious competitors either for magnetic disks or for Flash. The recent announcement by Intel and Micron of the cross-point technology may change this. The technology appears to have several advantages over Flash, including the elimination of the slow erase-to-write cycle and greater longevity in terms. It could be that this technology will finally be the technology that replaces the electromechanical disks that have dominated bulk storage for more than 50 years! For some years, a variety of predictions have been made about the coming memory wall (see previously cited quote and paper), which would lead to serious limits on processor performance. Fortunately, the extension of caches to multiple levels (from 2 to 4), more sophisticated refill and prefetch schemes, greater compiler and programmer awareness of the importance of locality, and tremendous improvements in DRAM bandwidth (a factor of over 150 times since the mid1990s) have helped keep the memory wall at bay. In recent years, the combination of access time constraints on the size of L1 (which is limited by the clock cycle) and energy-related limitations on the size of L2 and L3 have raised new challenges. The evolution of the i7 processor class over 6–7 years illustrates this: the caches are the same size in the i7 6700 as they were in the first generation i7 processors! The more aggressive use of prefetching is an attempt to overcome the inability to increase L2 and L3. Off-chip L4 caches are likely to become more important because they are less energy-constrained than on-chip caches. In addition to schemes relying on multilevel caches, the introduction of out-oforder pipelines with multiple outstanding misses has allowed available instructionlevel parallelism to hide the memory latency remaining in a cache-based system. The introduction of multithreading and more thread-level parallelism takes this a step further by providing more parallelism and thus more latency-hiding

148



Chapter Two Memory Hierarchy Design

opportunities. It is likely that the use of instruction- and thread-level parallelism will be a more important tool in hiding whatever memory delays are encountered in modern multilevel cache systems. One idea that periodically arises is the use of programmer-controlled scratchpad or other high-speed visible memories, which we will see are used in GPUs. Such ideas have never made the mainstream in general-purpose processors for several reasons: First, they break the memory model by introducing address spaces with different behavior. Second, unlike compiler-based or programmer-based cache optimizations (such as prefetching), memory transformations with scratchpads must completely handle the remapping from main memory address space to the scratchpad address space. This makes such transformations more difficult and limited in applicability. In GPUs (see Chapter 4), where local scratchpad memories are heavily used, the burden for managing them currently falls on the programmer. For domain-specific software systems that can use such memories, the performance gains are very significant. It is likely that HBM technologies will thus be used for caching in large, general-purpose computers and quite possibility as the main working memories in graphics and similar systems. As domain-specific architectures become more important in overcoming the limitations arising from the end of Dennard’s Law and the slowdown in Moore’s Law (see Chapter 7), scratchpad memories and vector-like register sets are likely to see more use. The implications of the end of Dennard’s Law affect both DRAM and processor technology. Thus, rather than a widening gulf between processors and main memory, we are likely to see a slowdown in both technologies, leading to slower overall growth rates in performance. New innovations in computer architecture and in related software that together increase performance and efficiency will be key to continuing the performance improvements seen over the past 50 years.

2.9

Historical Perspectives and References In Section M.3 (available online) we examine the history of caches, virtual memory, and virtual machines. IBM plays a prominent role in the history of all three. References for further reading are included.

Case Studies and Exercises by Norman P. Jouppi, Rajeev Balasubramonian, Naveen Muralimanohar, and Sheng Li Case Study 1: Optimizing Cache Performance via Advanced Techniques Concepts illustrated by this case study ■

Nonblocking Caches



Compiler Optimizations for Caches



Software and Hardware Prefetching



Calculating Impact of Cache Performance on More Complex Processors

Case Studies and Exercises



149

The transpose of a matrix interchanges its rows and columns; this concept is illustrated here: A11

A12

A13

A14

A21

A22

A23

A24

A31

A32

A33

A34

A41

A42

A43

A44



A11

A21

A31

A41

A12

A22

A32

A42

A13

A23

A33

A43

A14

A24

A34

A44

Here is a simple C loop to show the transpose: for (i = 0; i < 3; i++) { for (j = 0; j < 3; j++) { output[j][i] = input[i][j]; } } Assume that both the input and output matrices are stored in the row major order (row major order means that the row index changes fastest). Assume that you are executing a 256256 double-precision transpose on a processor with a 16 KB fully associative (don’t worry about cache conflicts) least recently used (LRU) replacement L1 data cache with 64-byte blocks. Assume that the L1 cache misses or prefetches require 16 cycles and always hit in the L2 cache, and that the L2 cache can process a request every 2 processor cycles. Assume that each iteration of the preceding inner loop requires 4 cycles if the data are present in the L1 cache. Assume that the cache has a write-allocate fetch-on-write policy for write misses. Unrealistically, assume that writing back dirty cache blocks requires 0 cycles. 2.1

[10/15/15/12/20] For the preceding simple implementation, this execution order would be nonideal for the input matrix; however, applying a loop interchange optimization would create a nonideal order for the output matrix. Because loop interchange is not sufficient to improve its performance, it must be blocked instead. a. [10] What should be the minimum size of the cache to take advantage of blocked execution? b. [15] How do the relative number of misses in the blocked and unblocked versions compare in the preceding minimum-sized cache? c. [15] Write code to perform a transpose with a block size parameter B that uses BB blocks. d. [12] What is the minimum associativity required of the L1 cache for consistent performance independent of both arrays’ position in memory? e. [20] Try out blocked and nonblocked 256256 matrix transpositions on a computer. How closely do the results match your expectations based on what you know about the computer’s memory system? Explain any discrepancies if possible.

150



Chapter Two Memory Hierarchy Design 2.2

[10] Assume you are designing a hardware prefetcher for the preceding unblocked matrix transposition code. The simplest type of hardware prefetcher only prefetches sequential cache blocks after a miss. More complicated “nonunit stride” hardware prefetchers can analyze a miss reference stream and detect and prefetch nonunit strides. In contrast, software prefetching can determine nonunit strides as easily as it can determine unit strides. Assume prefetches write directly into the cache and that there is no “pollution” (overwriting data that must be used before the data that are prefetched). For best performance given a nonunit stride prefetcher, in the steady state of the inner loop, how many prefetches must be outstanding at a given time?

2.3

[15/20] With software prefetching, it is important to be careful to have the prefetches occur in time for use but also to minimize the number of outstanding prefetches to live within the capabilities of the microarchitecture and minimize cache pollution. This is complicated by the fact that different processors have different capabilities and limitations. a. [15] Create a blocked version of the matrix transpose with software prefetching. b. [20] Estimate and compare the performance of the blocked and unblocked transpose codes both with and without software prefetching.

Case Study 2: Putting It All Together: Highly Parallel Memory Systems Concept illustrated by this case study ■

Cross-Cutting Issues: The Design of Memory Hierarchies

The program in Figure 2.32 can be used to evaluate the behavior of a memory system. The key is having accurate timing and then having the program stride through memory to invoke different levels of the hierarchy. Figure 2.32 shows the code in C. The first part is a procedure that uses a standard utility to get an accurate measure of the user CPU time; this procedure may have to be changed to work on some systems. The second part is a nested loop to read and write memory at different strides and cache sizes. To get accurate cache timing, this code is repeated many times. The third part times the nested loop overhead only so that it can be subtracted from overall measured times to see how long the accesses were. The results are output in .csv file format to facilitate importing into spreadsheets. You may need to change CACHE_MAX depending on the question you are answering and the size of memory on the system you are measuring. Running the program in single-user mode or at least without other active applications will give more consistent results. The code in Figure 2.32 was derived from a program written by Andrea Dusseau at the University of California-Berkeley and was based on a detailed description found in Saavedra-Barrera (1992). It has been modified to fix a number of issues with more modern machines and to run under Microsoft

Case Studies and Exercises

#include "stdafx.h" #include #include #define ARRAY_MIN (1024) /* 1/4 smallest cache */ #define ARRAY_MAX (4096*4096) /* 1/4 largest cache */ int x[ARRAY_MAX]; /* array going to stride through */ double get_seconds() { /* routine to read time in seconds */ __time64_t ltime; _time64( <ime ); return (double) ltime; } int label(int i) {/* generate text labels */ if (i

Figure 3.52 Sample VLIW code with two adds, two loads, and two stalls.

272



Chapter Three Instruction-Level Parallelism and Its Exploitation

Loop:

lw x1,0(x2) addi

x1,x1, 1

sw

x1,0(x2)

addi

x2,x2,4

sub

x4,x3,x2

bnz

x4,Loop

Figure 3.53 Code loop for Exercise 3.11.

ALU 0

Instructions from decoder 1 Reservation station

ALU 1

2

LD/ST

Mem

Figure 3.54 Microarchitecture for Exercise 3.12.

c. [10] Assume a dynamic branch predictor. How many cycles are lost on a correct prediction? 3.12

[15/20/20/10/20] Let’s consider what dynamic scheduling might achieve here. Assume a microarchitecture as shown in Figure 3.54. Assume that the arithmetic-logical units (ALUs) can do all arithmetic ops (fmul.d, fdiv.d, fadd.d, addi, sub) and branches, and that the Reservation Station (RS) can dispatch, at most, one operation to each functional unit per cycle (one op to each ALU plus one memory op to the fld/ fsd). a. [15] Suppose all of the instructions from the sequence in Figure 3.47 are present in the RS, with no renaming having been done. Highlight any instructions in the code where register renaming would improve performance. (Hint: look for read-after-write and write-after-write hazards. Assume the same functional unit latencies as in Figure 3.47.) b. [20] Suppose the register-renamed version of the code from part (a) is resident in the RS in clock cycle N, with latencies as given in Figure 3.47. Show how the RS should dispatch these instructions out of order, clock by clock, to

Case Studies and Exercises by Jason D. Bakos and Robert P. Colwell



273

obtain optimal performance on this code. (Assume the same RS restrictions as in part (a). Also assume that results must be written into the RS before they’re available for use—no bypassing.) How many clock cycles does the code sequence take? c. [20] Part (b) lets the RS try to optimally schedule these instructions. But in reality, the whole instruction sequence of interest is not usually present in the RS. Instead, various events clear the RS, and as a new code sequence streams in from the decoder, the RS must choose to dispatch what it has. Suppose that the RS is empty. In cycle 0, the first two register-renamed instructions of this sequence appear in the RS. Assume it takes one clock cycle to dispatch any op, and assume functional unit latencies are as they were for Exercise 3.2. Further assume that the front end (decoder/register-renamer) will continue to supply two new instructions per clock cycle. Show the cycle-by-cycle order of dispatch of the RS. How many clock cycles does this code sequence require now? d. [10] If you wanted to improve the results of part (c), which would have helped most: (1) Another ALU? (2) Another LD/ST unit? (3) Full bypassing of ALU results to subsequent operations? or (4) Cutting the longest latency in half? What’s the speedup? e. [20] Now let’s consider speculation, the act of fetching, decoding, and executing beyond one or more conditional branches. Our motivation to do this is twofold: the dispatch schedule we came up with in part (c) had lots of nops, and we know computers spend most of their time executing loops (which implies the branch back to the top of the loop is pretty predictable). Loops tell us where to find more work to do; our sparse dispatch schedule suggests we have opportunities to do some of that work earlier than before. In part (d) you found the critical path through the loop. Imagine folding a second copy of that path onto the schedule you got in part (b). How many more clock cycles would be required to do two loops’ worth of work (assuming all instructions are resident in the RS)? (Assume all functional units are fully pipelined.)

Exercises 3.13

[25] In this exercise, you will explore performance trade-offs between three processors that each employ different types of multithreading (MT). Each of these processors is superscalar, uses in-order pipelines, requires a fixed three-cycle stall following all loads and branches, and has identical L1 caches. Instructions from the same thread issued in the same cycle are read in program order and must not contain any data or control dependences. ■

Processor A is a superscalar simultaneous MT architecture, capable of issuing up to two instructions per cycle from two threads.



Processor B is a fine-grained MT architecture, capable of issuing up to four instructions per cycle from a single thread and switches threads on any pipeline stall.

274



Chapter Three Instruction-Level Parallelism and Its Exploitation



Processor C is a coarse-grained MT architecture, capable of issuing up to eight instructions per cycle from a single thread and switches threads on an L1 cache miss.

Our application is a list searcher, which scans a region of memory for a specific value stored in R9 between the address range specified in R16 and R17. It is parallelized by evenly dividing the search space into four equal-sized contiguous blocks and assigning one search thread to each block (yielding four threads). Most of each thread’s runtime is spent in the following unrolled loop body: loop: lw x1,0(x16) lw x2,8(x16) lw x3,16(x16) lw x4,24(x16) lw x5,32(x16) lw x6,40(x16) lw x7,48(x16) lw x8,56(x16) beq x9,x1,match0 beq x9,x2,match1 beq x9,x3,match2 beq x9,x4,match3 beq x9,x5,match4 beq x9,x6,match5 beq x9,x7,match6 beq x9,x8,match7 DADDIU x16,x16,#64 blt x16,x17,loop Assume the following: ■

A barrier is used to ensure that all threads begin simultaneously.



The first L1 cache miss occurs after two iterations of the loop.



None of the BEQAL branches is taken.



The BLT is always taken.

All three processors schedule threads in a round-robin fashion. Determine how many cycles are required for each processor to complete the first two iterations of the loop.



3.14

[25/25/25] In this exercise, we look at how software techniques can extract instruction-level parallelism (ILP) in a common vector loop. The following loop is the so-called DAXPY loop (double-precision aX plus Y) and is the central operation in Gaussian elimination. The following code implements the DAXPY operation, Y ¼ aX + Y, for a vector length 100. Initially, R1 is set to the base address of array X and R2 is set to the base address of Y: addi

x4,x1,#800 ; x1 = upper bound for X

Case Studies and Exercises by Jason D. Bakos and Robert P. Colwell

foo: fld fmul.d fld fadd.d fsd addi addi sltu bnez

F2,0(x1) F4,F2,F0 F6,0(x2) F6,F4,F6 F6,0(x2) x1,x1,#8 x2,x2,#8 x3,x1,x4 x3,foo



275

; (F2) = X(i) ; (F4) = a*X(i) ; (F6) = Y(i) ; (F6) = a*X(i) + Y(i) ; Y(i) = a*X(i) + Y(i) ; increment X index ; increment Y index ; test: continue loop? ; loop if needed

Assume the functional unit latencies as shown in the following table. Assume a one-cycle delayed branch that resolves in the ID stage. Assume that results are fully bypassed. Instruction producing result

Instruction using result

Latency in clock cycles

FP multiply

FP ALU op

6

FP add

FP ALU op

4

FP multiply

FP store

5

FP add

FP store

4

Integer operations and all loads

Any

2

a. [25] Assume a single-issue pipeline. Show how the loop would look both unscheduled by the compiler and after compiler scheduling for both floating-point operation and branch delays, including any stalls or idle clock cycles. What is the execution time (in cycles) per element of the result vector, Y, unscheduled and scheduled? How much faster must the clock be for processor hardware alone to match the performance improvement achieved by the scheduling compiler? (Neglect any possible effects of increased clock speed on memory system performance.) b. [25] Assume a single-issue pipeline. Unroll the loop as many times as necessary to schedule it without any stalls, collapsing the loop overhead instructions. How many times must the loop be unrolled? Show the instruction schedule. What is the execution time per element of the result? c. [25] Assume a VLIW processor with instructions that contain five operations, as shown in Figure 3.20. We will compare two degrees of loop unrolling. First, unroll the loop 6 times to extract ILP and schedule it without any stalls (i.e., completely empty issue cycles), collapsing the loop overhead instructions, and then repeat the process but unroll the loop 10 times. Ignore the branch delay slot. Show the two schedules. What is the execution time per element of the result vector for each schedule? What percent of the operation slots are used in each schedule? How much does the size of the code differ between the two schedules? What is the total register demand for the two schedules?

276



Chapter Three Instruction-Level Parallelism and Its Exploitation 3.15

[20/20] In this exercise, we will look at how variations on Tomasulo’s algorithm perform when running the loop from Exercise 3.14. The functional units (FUs) are described in the following table. FU type

Cycles in EX

Number of FUs

Number of reservation stations

Integer

1

1

5

FP adder

10

1

3

FP multiplier

15

1

2

Assume the following: ■ Functional units are not pipelined. ■ There is no forwarding between functional units; results are communicated by the common data bus (CDB). ■ The execution stage (EX) does both the effective address calculation and the memory access for loads and stores. Thus, the pipeline is IF/ID/IS/EX/WB. ■ Loads require one clock cycle. ■ The issue (IS) and write-back (WB) result stages each require one clock cycle. ■ There are five load buffer slots and five store buffer slots. ■ Assume that the Branch on Not Equal to Zero (BNEZ) instruction requires one clock cycle. a. [20] For this problem use the single-issue Tomasulo MIPS pipeline of Figure 3.10 with the pipeline latencies from the preceding table. Show the number of stall cycles for each instruction and what clock cycle each instruction begins execution (i.e., enters its first EX cycle) for three iterations of the loop. How many cycles does each loop iteration take? Report your answer in the form of a table with the following column headers: ■ Iteration (loop iteration number) ■ Instruction ■ Issues (cycle when instruction issues) ■ Executes (cycle when instruction executes) ■ Memory access (cycle when memory is accessed) ■ Write CDB (cycle when result is written to the CDB) ■ Comment (description of any event on which the instruction is waiting) Show three iterations of the loop in your table. You may ignore the first instruction. b. [20] Repeat part (a) but this time assume a two-issue Tomasulo algorithm and a fully pipelined floating-point unit (FPU). 3.16

[10] Tomasulo’s algorithm has a disadvantage: only one result can compute per clock per CDB. Use the hardware configuration and latencies from the previous question and find a code sequence of no more than 10 instructions where Tomasulo’s algorithm must stall due to CDB contention. Indicate where this occurs in your sequence.

Case Studies and Exercises by Jason D. Bakos and Robert P. Colwell 3.17



277

[20] An (m,n) correlating branch predictor uses the behavior of the most recent m executed branches to choose from 2m predictors, each of which is an nbit predictor. A two-level local predictor works in a similar fashion, but only keeps track of the past behavior of each individual branch to predict future behavior. There is a design trade-off involved with such predictors: correlating predictors require little memory for history, which allows them to maintain 2-bit predictors for a large number of individual branches (reducing the probability of branch instructions reusing the same predictor), while local predictors require substantially more memory to keep history and are thus limited to tracking a relatively small number of branch instructions. For this exercise, consider a (1,2) correlating predictor that can track four branches (requiring 16 bits) versus a (1,2) local predictor that can track two branches using the same amount of memory. For the following branch outcomes, provide each prediction, the table entry used to make the prediction, any updates to the table as a result of the prediction, and the final misprediction rate of each predictor. Assume that all branches up to this point have been taken. Initialize each predictor to the following: Correlating predictor Entry

Branch

Last outcome

Prediction

0

0

T

T with one misprediction

1

0

NT

NT

2

1

T

NT

3

1

NT

T

4

2

T

T

5

2

NT

T

6

3

T

NT with one misprediction

7

3

NT

NT

Local predictor Entry

Branch

Last 2 outcomes (right is most recent)

Prediction

0

0

T,T

T with one misprediction

1

0

T,NT

NT

2

0

NT,T

NT

3

0

NT

T

4

1

T,T

T

5

1

T,NT

T with one misprediction

6

1

NT,T

NT

7

1

NT,NT

NT

278



Chapter Three Instruction-Level Parallelism and Its Exploitation

Branch PC (word address)

Outcome

454

T

543

NT

777

NT

543

NT

777

NT

454

T

777

NT

454

T

543

T

3.18

[10] Suppose we have a deeply pipelined processor, for which we implement a branch-target buffer for the conditional branches only. Assume that the misprediction penalty is always four cycles and the buffer miss penalty is always three cycles. Assume a 90% hit rate, 90% accuracy, and 15% branch frequency. How much faster is the processor with the branch-target buffer versus a processor that has a fixed two-cycle branch penalty? Assume a base clock cycle per instruction (CPI) without branch stalls of one.

3.19

[10/5] Consider a branch-target buffer that has penalties of zero, two, and two clock cycles for correct conditional branch prediction, incorrect prediction, and a buffer miss, respectively. Consider a branch-target buffer design that distinguishes conditional and unconditional branches, storing the target address for a conditional branch and the target instruction for an unconditional branch. a. [10] What is the penalty in clock cycles when an unconditional branch is found in the buffer? b. [10] Determine the improvement from branch folding for unconditional branches. Assume a 90% hit rate, an unconditional branch frequency of 5%, and a two-cycle penalty for a buffer miss. How much improvement is gained by this enhancement? How high must the hit rate be for this enhancement to provide a performance gain?

This page intentionally left blank

4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10

Introduction Vector Architecture SIMD Instruction Set Extensions for Multimedia Graphics Processing Units Detecting and Enhancing Loop-Level Parallelism Cross-Cutting Issues Putting It All Together: Embedded Versus Server GPUs and Tesla Versus Core i7 Fallacies and Pitfalls Concluding Remarks Historical Perspective and References Case Study and Exercises by Jason D. Bakos

282 283 304 310 336 345 346 353 357 357 357

4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures

We call these algorithms data parallel algorithms because their parallelism comes from simultaneous operations across large sets of data rather than from multiple threads of control. W. Daniel Hillis and Guy L. Steele, “Data parallel algorithms,” Commun. ACM (1986)

If you were plowing a field, which would you rather use: two strong oxen or 1024 chickens? Seymour Cray, Father of the Supercomputer (arguing for two powerful vector processors versus many simple processors)

Computer Architecture. https://doi.org/10.1016/B978-0-12-811905-1.00004-3 © 2019 Elsevier Inc. All rights reserved.

282



Chapter Four Data-Level Parallelism in Vector, SIMD, and GPU Architectures

4.1

Introduction A question for the single instruction multiple data (SIMD) architecture, which Chapter 1 introduced, has always been just how wide a set of applications has significant data-level parallelism (DLP). Five years after the SIMD classification was proposed (Flynn, 1966), the answer is not only the matrix-oriented computations of scientific computing but also the media-oriented image and sound processing and machine learning algorithms, as we will see in Chapter 7. Since a multiple instruction multiple data (MIMD) architecture needs to fetch one instruction per data operation, single instruction multiple data (SIMD) is potentially more energy-efficient since a single instruction can launch many data operations. These two answers make SIMD attractive for personal mobile devices as well as for servers. Finally, perhaps the biggest advantage of SIMD versus MIMD is that the programmer continues to think sequentially yet achieves parallel speedup by having parallel data operations. This chapter covers three variations of SIMD: vector architectures, multimedia SIMD instruction set extensions, and graphics processing units (GPUs).1 The first variation, which predates the other two by more than 30 years, extends pipelined execution of many data operations. These vector architectures are easier to understand and to compile to than other SIMD variations, but they were considered too expensive for microprocessors until recently. Part of that expense was in transistors, and part was in the cost of sufficient dynamic random access memory (DRAM) bandwidth, given the widespread reliance on caches to meet memory performance demands on conventional microprocessors. The second SIMD variation borrows from the SIMD name to mean basically simultaneous parallel data operations and is now found in most instruction set architectures that support multimedia applications. For x86 architectures, the SIMD instruction extensions started with the MMX (multimedia extensions) in 1996, which were followed by several SSE (streaming SIMD extensions) versions in the next decade, and they continue until this day with AVX (advanced vector extensions). To get the highest computation rate from an x86 computer, you often need to use these SIMD instructions, especially for floating-point programs. The third variation on SIMD comes from the graphics accelerator community, offering higher potential performance than is found in traditional multicore computers today. Although GPUs share features with vector architectures, they have their own distinguishing characteristics, in part because of the ecosystem in which they evolved. This environment has a system processor and system memory in addition to the GPU and its graphics memory. In fact, to recognize those distinctions, the GPU community refers to this type of architecture as heterogeneous.

This chapter is based on material in Appendix F, “Vector Processors,” by Krste Asanovic, and Appendix G, “Hardware and Software for VLIW and EPIC” from the 5th edition of this book; on material in Appendix A, “Graphics and Computing GPUs,” by John Nickolls and David Kirk, from the 5th edition of Computer Organization and Design; and to a lesser extent on material in “Embracing and Extending 20th-Century Instruction Set Architectures,” by Joe Gebis and David Patterson, IEEE Computer, April 2007.

1

4.2

Vector Architecture



283

For problems with lots of data parallelism, all three SIMD variations share the advantage of being easier on the programmer than classic parallel MIMD programming. The goal of this chapter is for architects to understand why vector is more general than multimedia SIMD, as well as the similarities and differences between vector and GPU architectures. Because vector architectures are supersets of the multimedia SIMD instructions, including a better model for compilation, and because GPUs share several similarities with vector architectures, we start with vector architectures to set the foundation for the following two sections. The next section introduces vector architectures, and Appendix G goes much deeper into the subject.

4.2

Vector Architecture The most efficient way to execute a vectorizable application is a vector processor. Jim Smith, International Symposium on Computer Architecture (1994)

Vector architectures grab sets of data elements scattered about memory, place them into large sequential register files, operate on data in those register files, and then disperse the results back into memory. A single instruction works on vectors of data, which results in dozens of register-register operations on independent data elements. These large register files act as compiler-controlled buffers, both to hide memory latency and to leverage memory bandwidth. Because vector loads and stores are deeply pipelined, the program pays the long memory latency only once per vector load or store versus once per element, thus amortizing the latency over, say, 32 elements. Indeed, vector programs strive to keep the memory busy. The power wall leads architects to value architectures that can deliver good performance without the energy and design complexity costs of highly out-oforder superscalar processors. Vector instructions are a natural match to this trend because architects can use them to increase performance of simple in-order scalar processors without greatly raising energy demands and design complexity. In practice, developers can express many of the programs that ran well on complex out-oforder designs more efficiently as data-level parallelism in the form of vector instructions, as Kozyrakis and Patterson (2002) showed.

RV64V Extension We begin with a vector processor consisting of the primary components that Figure 4.1 shows. It is loosely based on the 40-year-old Cray-1, which was one of the first supercomputers. At the time of the writing of this edition, the RISCV vector instruction set extension RVV was still under development. (The vector extension by itself is called RVV, so RV64V refers to the RISC-V base instructions

284



Chapter Four Data-Level Parallelism in Vector, SIMD, and GPU Architectures

Main memory

Vector load/store

FP add/subtract

FP multiply

FP divide Vector registers

Integer

Logical

Scalar registers

Figure 4.1 The basic structure of a vector architecture, RV64V, which includes a RISC-V scalar architecture. There are also 32 vector registers, and all the functional units are vector functional units. The vector and scalar registers have a significant number of read and write ports to allow multiple simultaneous vector operations. A set of crossbar switches (thick gray lines) connects these ports to the inputs and outputs of the vector functional units.

plus the vector extension.) We show a subset of RV64V, trying to capture its essence in a few pages. The primary components of the instruction set architecture of RV64V are the following: ■

Vector registers—Each vector register holds a single vector, and RV64V has 32 of them, each 64 bits wide. The vector register file needs to provide enough ports to feed all the vector functional units. These ports will allow a high degree of overlap among vector operations to different vector registers. The read and write ports, which total at least 16 read ports and 8 write ports, are connected to the functional unit inputs or outputs by a pair of crossbar switches. One way to

4.2

Vector Architecture



285

increase the register file bandwidth is to compose it from multiple banks, which work well with relatively long vectors. ■

Vector functional units—Each unit is fully pipelined in our implementation, and it can start a new operation on every clock cycle. A control unit is needed to detect hazards, both structural hazards for functional units and data hazards on register accesses. Figure 4.1 shows that we assume an implementation of RV64V has five functional units. For simplicity, we focus on the floating-point functional units in this section.



Vector load/store unit—The vector memory unit loads or stores a vector to or from memory. The vector loads and stores are fully pipelined in our hypothetical RV64V implementation so that words can be moved between the vector registers and memory with a bandwidth of one word per clock cycle, after an initial latency. This unit would also normally handle scalar loads and stores.



A set of scalar registers—Scalar registers can likewise provide data as input to the vector functional units, as well as compute addresses to pass to the vector load/store unit. These are the normal 31 general-purpose registers and 32 floating-point registers of RV64G. One input of the vector functional units latches scalar values as they are read out of the scalar register file.

Figure 4.2 lists the RV64V vector instructions we use in this section. The description in Figure 4.2 assumes that the input operands are all vector registers, but there are also versions of these instructions where an operand can be a scalar register (xi or fi). RV64V uses the suffix .vv when both are vectors, .vs when the second operand is a scalar, and .sv when the first is a scalar register. Thus these three are all valid RV64V instructions: vsub.vv, vsub.vs, and vsub.sv. (Add and other commutative operations have only the first two versions, as vadd.sv and vadd.sv would be redundant.) Because the operands determine the version of the instruction, we usually let the assembler supply the appropriate suffix. The vector functional unit gets a copy of the scalar value at instruction issue time. Although the traditional vector architectures didn’t support narrow data types efficiently, vectors naturally accommodate varying data sizes (Kozyrakis and Patterson, 2002). Thus, if a vector register has 32 64-bit elements, then 128  16bit elements, and even 256  8-bit elements are equally valid views. Such hardware multiplicity is why a vector architecture can be useful for multimedia applications as well as for scientific applications. Note that the RV64V instructions in Figure 4.2 omit the data type and size! An innovation of RV64V is to associate a data type and data size with each vector register, rather than the normal approach of the instruction supplying that information. Thus, before executing the vector instructions, a program configures the vector registers being used to specify their data type and widths. Figure 4.3 lists the options for RV64V.

Mnemonic vadd

Name ADD

Description Add elements of V[rs1] and V[rs2], then put each result in V[rd]

vsub

SUBtract

Subtract elements of V[rs2] frpm V[rs1], then put each result in V[rd]

vmul

MULtiply

Multiply elements of V[rs1] and V[rs2], then put each result in V[rd]

vdiv

DIVide

Divide elements of V[rs1] by V[rs2], then put each result in V[rd]

vrem

REMainder

Take remainder of elements of V[rs1] by V[rs2], then put each result in V[rd]

vsqrt

SQuare RooT

Take square root of elements of V[rs1], then put each result in V[rd]

vsll

Shift Left

Shift elements of V[rs1] left by V[rs2], then put each result in V[rd]

vsrl

Shift Right

Shift elements of V[rs1] right by V[rs2], then put each result in V[rd]

vsra

Shift Right Arithmetic

Shift elements of V[rs1] right by V[rs2] while extending sign bit, then put each result in V[rd]

vxor

XOR

Exclusive OR elements of V[rs1] and V[rs2], then put each result in V[rd]

vor

OR

Inclusive OR elements of V[rs1] and V[rs2], then put each result in V[rd]

vand

AND

Logical AND elements of V[rs1] and V[rs2], then put each result in V[rd]

vsgnj

SiGN source

Replace sign bits of V[rs1] with sign bits of V[rs2], then put each result in V[rd]

vsgnjn

Negative SiGN source

Replace sign bits of V[rs1] with complemented sign bits of V[rs2], then put each result in V[rd]

vsgnjx

Xor SiGN source

Replace sign bits of V[rs1] with xor of sign bits of V[rs1] and V[rs2], then put each result in V[rd]

vld

Load

Load vector register V[rd] from memory starting at address R[rs1]

vlds

Strided Load

Load V[rd] from address at R[rs1] with stride in R[rs2] (i.e., R[rs1] + i  R[rs2])

vldx

Indexed Load (Gather)

Load V[rs1] with vector whose elements are at R[rs2] + V[rs2] (i.e., V[rs2] is an index)

vst

Store

Store vector register V[rd] into memory starting at address R[rs1]

vsts

Strided Store

Store V[rd] into memory at address R[rs1] with stride in R[rs2] (i.e., R[rs1] + i  R[rs2])

vstx

Indexed Store (Scatter)

Store V[rs1] into memory vector whose elements are at R[rs2] + V[rs2] ( i.e., V[rs2] is an index)

vpeq

Compare ¼

Compare elements of V[rs1] and V[rs2]. When equal, put a 1 in the corresponding 1-bit element of p[rd]; otherwise, put 0

vpne

Compare !¼

Compare elements of V[rs1] and V[rs2]. When not equal, put a 1 in the corresponding 1-bit element of p[rd]; otherwise, put 0

vplt

Compare b)? a:b;

floating selects non-NaN

setp.cmp.type

setp.lt.f32 p, a, b

p = (a < b);

compare and set predicate

numeric .cmp = eq, ne, lt, le, gt, ge; unordered cmp = equ, neu, ltu, leu, gtu, geu, num, nan mov.type

mov.b32 d, a

d = a;

move

selp.type

selp.f32 d, a, b, p

d = p? a: b;

select with predicate

cvt.dtype.atype

cvt.f32.s32 d, a

d = convert(a);

convert atype to dtype

special .type = .f32 (some .f64)

Special function

rcp.type

rcp.f32 d, a

d = 1/a;

reciprocal

sqrt.type

sqrt.f32 d, a

d = sqrt(a);

square root

rsqrt.type

rsqrt.f32 d, a

d = 1/sqrt(a);

reciprocal square root

sin.type

sin.f32 d, a

d = sin(a);

sine

cos.type

cos.f32 d, a

d = cos(a);

cosine

lg2.type

lg2.f32 d, a

d = log(a)/log(2)

binary logarithm

ex2.type

ex2.f32 d, a

d = 2 ** a;

binary exponential

logic.type = .pred,.b32, .b64

Logical

and.type

and.b32 d, a, b

d = a & b;

or.type

or.b32 d, a, b

d = a j b;

xor.type

xor.b32 d, a, b

d = a ^b;

not.type

not.b32 d, a, b

d = a;

one’s complement

cnot.type

cnot.b32 d, a, b

d = (a==0)? 1:0;

C logical not

shl.type

shl.b32 d, a, b

d = a > b;

shift right

memory.space = .global, .shared, .local, .const; .type = .b8, .u8, .s8, .b16, .b32, .b64

Memory access

ld.space.type

ld.global.b32 d, [a+off]

d = *(a+off);

load from memory space

st.space.type

st.shared.b32 [d+off], a

*(d+off) = a;

store to memory space

tex.nd.dtyp.btype

tex.2d.v4.f32.f32 d, a, b

d = tex2d(a, b);

texture lookup

atom.spc.op.type

atomic { d = *a; atom.global.add.u32 d,[a], b atom.global.cas.b32 d,[a], b, c *a = op(*a, b); }

atomic read-modify-write operation

atom.op = and, or, xor, add, min, max, exch, cas; .spc = .global; .type = .b32

Control flow

branch

@p bra target

if (p) goto target;

call

call (ret), func, (params)

ret = func(params); call function

ret

ret

return;

return from function call

bar.sync

bar.sync d

wait for threads

barrier synchronization

exit

exit

exit;

terminate thread execution

Figure 4.17 Basic PTX GPU thread instructions.

conditional branch

4.4

Graphics Processing Units



323

All data transfers are gather-scatter! To regain the efficiency of sequential (unitstride) data transfers, GPUs include special Address Coalescing hardware to recognize when the SIMD Lanes within a thread of SIMD instructions are collectively issuing sequential addresses. That runtime hardware then notifies the Memory Interface Unit to request a block transfer of 32 sequential words. To get this important performance improvement, the GPU programmer must ensure that adjacent CUDA Threads access nearby addresses at the same time so that they can be coalesced into one or a few memory or cache blocks, which our example does.

Conditional Branching in GPUs Just like the case with unit-stride data transfers, there are strong similarities between how vector architectures and GPUs handle IF statements, with the former implementing the mechanism largely in software with limited hardware support and the latter making use of even more hardware. As we will see, in addition to explicit predicate registers, GPU branch hardware uses internal masks, a branch synchronization stack, and instruction markers to manage when a branch diverges into multiple execution paths and when the paths converge. At the PTX assembler level, control flow of one CUDA Thread is described by the PTX instructions branch, call, return, and exit, plus individual per-thread-lane predication of each instruction, specified by the programmer with per-thread-lane 1-bit predicate registers. The PTX assembler analyzes the PTX branch graph and optimizes it to the fastest GPU hardware instruction sequence. Each can make its own decision on a branch and does not need to be in lock step. At the GPU hardware instruction level, control flow includes branch, jump, jump indexed, call, call indexed, return, exit, and special instructions that manage the branch synchronization stack. GPU hardware provides each SIMD Thread with its own stack; a stack entry contains an identifier token, a target instruction address, and a target thread-active mask. There are GPU special instructions that push stack entries for a SIMD Thread and special instructions and instruction markers that pop a stack entry or unwind the stack to a specified entry and branch to the target instruction address with the target thread-active mask. GPU hardware instructions also have an individual per-lane predication (enable/disable), specified with a 1-bit predicate register for each lane. The PTX assembler typically optimizes a simple outer-level IF-THEN-ELSE statement coded with PTX branch instructions to solely predicated GPU instructions, without any GPU branch instructions. A more complex control flow often results in a mixture of predication and GPU branch instructions with special instructions and markers that use the branch synchronization stack to push a stack entry when some lanes branch to the target address, while others fall through. NVIDIA says a branch diverges when this happens. This mixture is also used when a SIMD Lane executes a synchronization marker or converges, which pops a stack entry and branches to the stack-entry address with the stack-entry threadactive mask.

324



Chapter Four Data-Level Parallelism in Vector, SIMD, and GPU Architectures

The PTX assembler identifies loop branches and generates GPU branch instructions that branch to the top of the loop, along with special stack instructions to handle individual lanes breaking out of the loop and converging the SIMD Lanes when all lanes have completed the loop. GPU indexed jump and indexed call instructions push entries on the stack so that when all lanes complete the switch statement or function call, the SIMD Thread converges. A GPU set predicate instruction (setp in Figure 4.17) evaluates the conditional part of the IF statement. The PTX branch instruction then depends on that predicate. If the PTX assembler generates predicated instructions with no GPU branch instructions, it uses a per-lane predicate register to enable or disable each SIMD Lane for each instruction. The SIMD instructions in the threads inside the THEN part of the IF statement broadcast operations to all the SIMD Lanes. Those lanes with the predicate set to 1 perform the operation and store the result, and the other SIMD Lanes don’t perform an operation or store a result. For the ELSE statement, the instructions use the complement of the predicate (relative to the THEN statement), so the SIMD Lanes that were idle now perform the operation and store the result while their formerly active siblings don’t. At the end of the ELSE statement, the instructions are unpredicated so the original computation can proceed. Thus, for equal length paths, an IF-THEN-ELSE operates at 50% efficiency or less. IF statements can be nested, thus the use of a stack, and the PTX assembler typically generates a mix of predicated instructions and GPU branch and special synchronization instructions for complex control flow. Note that deep nesting can mean that most SIMD Lanes are idle during execution of nested conditional statements. Thus, doubly nested IF statements with equal-length paths run at 25% efficiency, triply nested at 12.5% efficiency, and so on. The analogous case would be a vector processor operating where only a few of the mask bits are ones. Dropping down a level of detail, the PTX assembler sets a “branch synchronization” marker on appropriate conditional branch instructions that pushes the current active mask on a stack inside each SIMD Thread. If the conditional branch diverges (some lanes take the branch but some fall through), it pushes a stack entry and sets the current internal active mask based on the condition. A branch synchronization marker pops the diverged branch entry and flips the mask bits before the ELSE portion. At the end of the IF statement, the PTX assembler adds another branch synchronization marker that pops the prior active mask off the stack into the current active mask. If all the mask bits are set to 1, then the branch instruction at the end of the THEN skips over the instructions in the ELSE part. There is a similar optimization for the THEN part in case all the mask bits are 0 because the conditional branch jumps over the THEN instructions. Parallel IF statements and PTX branches often use branch conditions that are unanimous (all lanes agree to follow the same path) such that the SIMD Thread does not diverge into a different individual lane control flow. The PTX assembler optimizes such branches to skip over blocks of instructions that are not executed by any lane of a SIMD Thread. This optimization is

4.4

Graphics Processing Units



325

useful in conditional error checking, for example, where the test must be made but is rarely taken. The code for a conditional statement similar to the one in Section 4.2 is if (X[i] != 0) X[i] = X[i] – Y[i]; else X[i] = Z[i]; This IF statement could compile to the following PTX instructions (assuming that R8 already has the scaled thread ID), with *Push, *Comp, *Pop indicating the branch synchronization markers inserted by the PTX assembler that push the old mask, complement the current mask, and pop to restore the old mask: ld.global.f64 RD0, [X+R8] setp.neq.s32 P1, RD0, #0 @!P1, bra ELSE1, *Push ; mask bits ld.global.f64 RD2, [Y+R8] sub.f64 RD0, RD0, RD2 st.global.f64 [X+R8], RD0 @P1, bra ENDIF1, *Comp

; RD0 = X[i] ;P1 is predicate reg 1 ; Push old mask, set new if P1 false, go to ELSE1 ; RD2 = Y[i] ; Difference in RD0 ; X[i] = RD0 ; complement mask bits ; if P1 true, go to ENDIF1 ELSE1: ld.global.f64 RD0, [Z+R8] ; RD0 = Z[i] st.global.f64 [X+R8], RD0 ; X[i] = RD0 ENDIF1:, *Pop ; pop to restore old mask Once again, normally all instructions in the IF-THEN-ELSE statement are executed by a SIMD Processor. It’s just that only some of the SIMD Lanes are enabled for the THEN instructions and some lanes for the ELSE instructions. As previously mentioned, in the surprisingly common case that the individual lanes agree on the predicated branch—such as branching on a parameter value that is the same for all lanes so that all active mask bits are 0s or all are 1s—the branch skips the THEN instructions or the ELSE instructions. This flexibility makes it appear that an element has its own program counter; however, in the slowest case, only one SIMD Lane could store its result every 2 clock cycles, with the rest idle. The analogous slowest case for vector architectures is operating with only one mask bit set to 1. This flexibility can lead naive GPU programmers to poor performance, but it can be helpful in the early stages of program development. Keep in mind, however, that the only choice for a SIMD Lane in a clock cycle is to perform the operation specified in the PTX instruction or be idle; two SIMD Lanes cannot simultaneously execute different instructions. This flexibility also helps explain the name CUDA Thread given to each element in a thread of SIMD instructions, because it gives the illusion of acting independently. A naive programmer may think that this thread abstraction means GPUs handle conditional branches more gracefully. Some threads go one way, the rest go

326



Chapter Four Data-Level Parallelism in Vector, SIMD, and GPU Architectures

another, which seems true as long as you’re not in a hurry. Each CUDA Thread is either executing the same instruction as every other thread in the Thread Block or it is idle. This synchronization makes it easier to handle loops with conditional branches because the mask capability can turn off SIMD Lanes and it detects the end of the loop automatically. The resulting performance sometimes belies that simple abstraction. Writing programs that operate SIMD Lanes in this highly independent MIMD mode is like writing programs that use lots of virtual address space on a computer with a smaller physical memory. Both are correct, but they may run so slowly that the programmer will not be pleased with the result. Conditional execution is a case where GPUs do in runtime hardware what vector architectures do at compile time. Vector compilers do a double IF-conversion, generating four different masks. The execution is basically the same as GPUs, but there are some more overhead instructions executed for vectors. Vector architectures have the advantage of being integrated with a scalar processor, allowing them to avoid the time for the 0 cases when they dominate a calculation. Although it will depend on the speed of the scalar processor versus the vector processor, the crossover point when it’s better to use scalar might be when less than 20% of the mask bits are 1s. One optimization available at runtime for GPUs, but not at compile time for vector architectures, is to skip the THEN or ELSE parts when mask bits are all 0s or all 1s. Thus the efficiency with which GPUs execute conditional statements comes down to how frequently the branches will diverge. For example, one calculation of eigenvalues has deep conditional nesting, but measurements of the code show that around 82% of clock cycle issues have between 29 and 32 out of the 32 mask bits set to 1, so GPUs execute this code more efficiently than one might expect. Note that the same mechanism handles the strip-mining of vector loops—when the number of elements doesn’t perfectly match the hardware. The example at the beginning of this section shows that an IF statement checks to see if this SIMD Lane element number (stored in R8 in the preceding example) is less than the limit (i < n), and it sets masks appropriately.

NVIDIA GPU Memory Structures Figure 4.18 shows the memory structures of an NVIDIA GPU. Each SIMD Lane in a multithreaded SIMD Processor is given a private section of off-chip DRAM, which we call the private memory. It is used for the stack frame, for spilling registers, and for private variables that don’t fit in the registers. SIMD Lanes do not share private memories. GPUs cache this private memory in the L1 and L2 caches to aid register spilling and to speed up function calls. We call the on-chip memory that is local to each multithreaded SIMD Processor local memory. It is a small scratchpad memory with low latency (a few dozen clocks) and high bandwidth (128 bytes/clock) where the programmer can store data that needs to be reused, either by the same thread or another thread in the same

4.4

Graphics Processing Units



327

CUDA thread

Per-CUDA thread private memory

Thread block Per-block local memory

Grid 0

Sequence

... Inter-grid synchronization

GPU memory

Grid 1

...

Figure 4.18 GPU memory structures. GPU memory is shared by all Grids (vectorized loops), local memory is shared by all threads of SIMD instructions within a Thread Block (body of a vectorized loop), and private memory is private to a single CUDA Thread. Pascal allows preemption of a Grid, which requires that all local and private memory be able to be saved in and restored from global memory. For completeness sake, the GPU can also access CPU memory via the PCIe bus. This path is commonly used for a final result when its address is in host memory. This option eliminates a final copy from the GPU memory to the host memory.

Thread Block. Local memory is limited in size, typically to 48 KiB. It carries no state between Thread Blocks executed on the same processor. It is shared by the SIMD Lanes within a multithreaded SIMD Processor, but this memory is not shared between multithreaded SIMD Processors. The multithreaded SIMD Processor dynamically allocates portions of the local memory to a Thread Block when it creates the Thread Block, and frees the memory when all the threads of the Thread Block exit. That portion of local memory is private to that Thread Block. Finally, we call the off-chip DRAM shared by the whole GPU and all Thread Blocks GPU Memory. Our vector multiply example used only GPU Memory. The system processor, called the host, can read or write GPU Memory. Local memory is unavailable to the host, as it is private to each multithreaded SIMD Processor. Private memories are unavailable to the host as well.

328



Chapter Four Data-Level Parallelism in Vector, SIMD, and GPU Architectures

Rather than rely on large caches to contain the whole working sets of an application, GPUs traditionally use smaller streaming caches and, because their working sets can be hundreds of megabytes, rely on extensive multithreading of threads of SIMD instructions to hide the long latency to DRAM. Given the use of multithreading to hide DRAM latency, the chip area used for large L2 and L3 caches in system processors is spent instead on computing resources and on the large number of registers to hold the state of many threads of SIMD instructions. In contrast, as mentioned, vector loads and stores amortize the latency across many elements because they pay the latency only once and then pipeline the rest of the accesses. Although hiding memory latency behind many threads was the original philosophy of GPUs and vector processors, all recent GPUs and vector processors have caches to reduce latency. The argument follows Little’s Law from queuing theory: the longer the latency, the more threads need to run during a memory access, which in turn requires more registers. Thus GPU caches are added to lower average latency and thereby mask potential shortages of the number of registers. To improve memory bandwidth and reduce overhead, as mentioned, PTX data transfer instructions in cooperation with the memory controller coalesce individual parallel thread requests from the same SIMD Thread together into a single memory block request when the addresses fall in the same block. These restrictions are placed on the GPU program, somewhat analogous to the guidelines for system processor programs to engage hardware prefetching (see Chapter 2). The GPU memory controller will also hold requests and send ones together to the same open page to improve memory bandwidth (see Section 4.6). Chapter 2 describes DRAM in sufficient detail for readers to understand the potential benefits of grouping related addresses.

Innovations in the Pascal GPU Architecture The multithreaded SIMD Processor of Pascal is more complicated than the simplified version in Figure 4.20. To increase hardware utilization, each SIMD Processor has two SIMD Thread Schedulers, each with multiple instruction dispatch units (some GPUs have four thread schedulers). The dual SIMD Thread Scheduler selects two threads of SIMD instructions and issues one instruction from each to two sets of 16 SIMD Lanes, 16 load/store units, or 8 special function units. With multiple execution units available, two threads of SIMD instructions are scheduled each clock cycle, allowing 64 lanes to be active. Because the threads are independent, there is no need to check for data dependences in the instruction stream. This innovation would be analogous to a multithreaded vector processor that can issue vector instructions from two independent threads. Figure 4.19 shows the Dual Scheduler issuing instructions, and Figure 4.20 shows the block diagram of the multithreaded SIMD Processor of a Pascal GP100 GPU. Each new generation of GPU typically adds some new features that increase performance or make it easier for programmers. Here are the four main innovations of Pascal:

4.4

Graphics Processing Units

SIMD thread scheduler

Instruction dispatch unit

Instruction dispatch unit

SIMD thread 8 instruction 11

SIMD thread 9 instruction 11

SIMD thread 2 instruction 42

SIMD thread 3 instruction 33

SIMD thread 14 instruction 95

SIMD thread 15 instruction 95

SIMD thread 8 instruction 12

SIMD thread 9 instruction 12

SIMD thread 14 instruction 96

SIMD thread 3 instruction 34

SIMD thread 2 instruction 43

SIMD thread 15 instruction 96

329

Time

SIMD thread scheduler



Figure 4.19 Block diagram of Pascal’s dual SIMD Thread scheduler. Compare this design to the single SIMD Thread design in Figure 4.16. ■

Fast single-precision, double-precision, and half-precision floating-point arithmetic—Pascal GP100 chip has significant floating-point performance in three sizes, all part of the IEEE standard for floating-point. The singleprecision floating-point of the GPU runs at a peak of 10 TeraFLOP/s. Double-precision is roughly half-speed at 5 TeraFLOP/s, and half-precision is about double-speed at 20 TeraFLOP/s when expressed as 2-element vectors. The atomic memory operations include floating-point add for all three sizes. Pascal GP100 is the first GPU with such high performance for half-precision.



High-bandwidth memory—The next innovation of the Pascal GP100 GPU is the use of stacked, high-bandwidth memory (HBM2). This memory has a wide bus with 4096 data wires running at 0.7 GHz offering a peak bandwidth of 732 GB/s, which is more than twice as fast as previous GPUs.



High-speed chip-to-chip interconnect—Given the coprocessor nature of GPUs, the PCI bus can be a communications bottleneck when trying to use multiple GPUs with one CPU. Pascal GP100 introduces the NVLink communications channel that supports data transfers of up to 20 GB/s in each direction. Each GP100 has 4 NVLink channels, providing a peak aggregate chip-tochip bandwidth of 160 GB/s per chip. Systems with 2, 4, and 8 GPUs are available for multi-GPU applications, where each GPU can perform load, store, and atomic operations to any GPU connected by NVLink. Additionally, an NVLink channel can communicate with the CPU in some cases. For example, the IBM Power9 CPU supports CPU-GPU communication. In this chip, NVLink provides a coherent view of memory between all GPUs and CPUs connected together. It also provides cache-to-cache communication instead of memory-to-memory communication.

330



Chapter Four Data-Level Parallelism in Vector, SIMD, and GPU Architectures

Instruction Cache

Dispatch Units

Instruction Buffer

Instruction Buffer

SIMD Thread Scheduler

SIMD Thread Scheduler

Dispatch Units

Dispatch Units

Dispatch Units

Dispatch Units

Dispatch Units

Register File (32,768 × 32-bit)

Dispatch Units

Dispatch Units

Register File (32,768 × 32-bit)

SIMD Lane

SIMD Lane

DP Unit

SIMD Lane

SIMD Lane

DP Unit

LD/ST

SFU

SIMD Lane

SIMD Lane

DP Unit

SIMD Lane

SIMD Lane

DP Unit

LD/ST

SFU

SIMD Lane

SIMD Lane

DP Unit

SIMD Lane

SIMD Lane

DP Unit

LD/ST

SFU

SIMD Lane

SIMD Lane

DP Unit

SIMD Lane

SIMD Lane

DP Unit

LD/ST

SFU

SIMD Lane

SIMD Lane

DP Unit

SIMD Lane

SIMD Lane

DP Unit

LD/ST

SFU

SIMD Lane

SIMD Lane

DP Unit

SIMD Lane

SIMD Lane

DP Unit

LD/ST

SFU

SIMD Lane

SIMD Lane

DP Unit

SIMD Lane

SIMD Lane

DP Unit

LD/ST

SFU

SIMD Lane

SIMD Lane

DP Unit

SIMD Lane

SIMD Lane

DP Unit

LD/ST

SFU

SIMD Lane

SIMD Lane

DP Unit

SIMD Lane

SIMD Lane

DP Unit

LD/ST

SFU

SIMD Lane

SIMD Lane

DP Unit

SIMD Lane

SIMD Lane

DP Unit

LD/ST

SFU

SIMD Lane

SIMD Lane

DP Unit

SIMD Lane

SIMD Lane

DP Unit

LD/ST

SFU

SIMD Lane

SIMD Lane

DP Unit

SIMD Lane

SIMD Lane

DP Unit

LD/ST

SFU

SIMD Lane

SIMD Lane

DP Unit

SIMD Lane

SIMD Lane

DP Unit

LD/ST

SFU

SIMD Lane

SIMD Lane

DP Unit

SIMD Lane

SIMD Lane

DP Unit

LD/ST

SFU

SIMD Lane

SIMD Lane

DP Unit

SIMD Lane

SIMD Lane

DP Unit

LD/ST

SFU

SIMD Lane

SIMD Lane

DP Unit

SIMD Lane

SIMD Lane

DP Unit

LD/ST

SFU

Texture / L1 Cache Tex

Tex

Tex

Tex

64KB Shared Memory

Figure 4.20 Block diagram of the multithreaded SIMD Processor of a Pascal GPU. Each of the 64 SIMD Lanes (cores) has a pipelined floating-point unit, a pipelined integer unit, some logic for dispatching instructions and operands to these units, and a queue for holding results. The 64 SIMD Lanes interact with 32 double-precision ALUs (DP units) that perform 64-bit floating-point arithmetic, 16 load-store units (LD/STs), and 16 special function units (SFUs) that calculate functions such as square roots, reciprocals, sines, and cosines.



Unified virtual memory and paging support—The Pascal GP100 GPU adds page-fault capabilities within a unified virtual address space. This feature allows a single virtual address for every data structure that is identical across all the GPUs and CPUs in a single system. When a thread accesses an address that is remote, a page of memory is transferred to the local GPU for subsequent use. Unified memory simplifies the programming model by providing demand paging instead of explicit memory copying between the CPU and GPU or

4.4

Graphics Processing Units



331

between GPUs. It also allows allocating far more memory than exists on the GPU to solve problems with large memory requirements. As with any virtual memory system, care must be taken to avoid excessive page movement.

Similarities and Differences Between Vector Architectures and GPUs As we have seen, there really are many similarities between vector architectures and GPUs. Along with the quirky jargon of GPUs, these similarities have contributed to the confusion in architecture circles about how novel GPUs really are. Now that you’ve seen what is under the covers of vector computers and GPUs, you can appreciate both the similarities and the differences. Because both architectures are designed to execute data-level parallel programs, but take different paths, this comparison is in depth in order to provide a better understanding of what is needed for DLP hardware. Figure 4.21 shows the vector term first and then the closest equivalent in a GPU. A SIMD Processor is like a vector processor. The multiple SIMD Processors in GPUs act as independent MIMD cores, just as many vector computers have multiple vector processors. This view will consider the NVIDIA Tesla P100 as a 56-core machine with hardware support for multithreading, where each core has 64 lanes. The biggest difference is multithreading, which is fundamental to GPUs and missing from most vector processors. Looking at the registers in the two architectures, the RV64V register file in our implementation holds entire vectors—that is, a contiguous block of elements. In contrast, a single vector in a GPU will be distributed across the registers of all SIMD Lanes. A RV64V processor has 32 vector registers with perhaps 32 elements, or 1024 elements total. A GPU thread of SIMD instructions has up to 256 registers with 32 elements each, or 8192 elements. These extra GPU registers support multithreading. Figure 4.22 is a block diagram of the execution units of a vector processor on the left and a multithreaded SIMD Processor of a GPU on the right. For pedagogic purposes, we assume the vector processor has four lanes and the multithreaded SIMD Processor also has four SIMD Lanes. This figure shows that the four SIMD Lanes act in concert much like a four-lane vector unit, and that a SIMD Processor acts much like a vector processor. In reality, there are many more lanes in GPUs, so GPU “chimes” are shorter. While a vector processor might have 2 to 8 lanes and a vector length of, say, 32— making a chime 4 to 16 clock cycles—a multithreaded SIMD Processor might have 8 or 16 lanes. A SIMD Thread is 32 elements wide, so a GPU chime would just be 2 or 4 clock cycles. This difference is why we use “SIMD Processor” as the more descriptive term because it is closer to a SIMD design than it is to a traditional vector processor design.

332

Processing and memory hardware

Machine objects

Program abstractions

Type



Chapter Four Data-Level Parallelism in Vector, SIMD, and GPU Architectures

Closest CUDA/NVIDIA GPU term

Comment

Vectorized Loop

Grid

Concepts are similar, with the GPU using the less descriptive term

Chime



Because a vector instruction (PTX instruction) takes just 2 cycles on Pascal to complete, a chime is short in GPUs. Pascal has two execution units that support the most common floating-point instructions that are used alternately, so the effective issue rate is 1 instruction every clock cycle

Vector Instruction

PTX Instruction

A PTX instruction of a SIMD Thread is broadcast to all SIMD Lanes, so it is similar to a vector instruction

Gather/ Scatter

Global load/store (ld. global/st.global)

All GPU loads and stores are gather and scatter, in that each SIMD Lane sends a unique address. It’s up to the GPU Coalescing Unit to get unit-stride performance when addresses from the SIMD Lanes allow it

Mask Registers

Predicate Registers and Internal Mask Registers

Vector mask registers are explicitly part of the architectural state, while GPU mask registers are internal to the hardware. The GPU conditional hardware adds a new feature beyond predicate registers to manage masks dynamically

Vector Processor

Multithreaded SIMD Processor

These are similar, but SIMD Processors tend to have many lanes, taking a few clock cycles per lane to complete a vector, while vector architectures have few lanes and take many cycles to complete a vector. They are also multithreaded where vectors usually are not

Control Processor

Thread Block Scheduler

The closest is the Thread Block Scheduler that assigns Thread Blocks to a multithreaded SIMD Processor. But GPUs have no scalar-vector operations and no unit-stride or strided data transfer instructions, which Control Processors often provide in vector architectures

Scalar Processor

System Processor

Because of the lack of shared memory and the high latency to communicate over a PCI bus (1000s of clock cycles), the system processor in a GPU rarely takes on the same tasks that a scalar processor does in a vector architecture

Vector Lane

SIMD Lane

Very similar; both are essentially functional units with registers

Vector Registers

SIMD Lane Registers

The equivalent of a vector register is the same register in all 16 SIMD Lanes of a multithreaded SIMD Processor running a thread of SIMD instructions. The number of registers per SIMD Thread is flexible, but the maximum is 256 in Pascal, so the maximum number of vector registers is 256

Main Memory

GPU Memory

Memory for GPU versus system memory in vector case

Vector term

Figure 4.21 GPU equivalent to vector terms.

4.4

PC

PC PC

Instruction cache

PC

Graphics Processing Units

333



SIMD thread scheduler Instruction cache

Dispatch unit

PC

Instruction register

Vector registers

Control processor

Mask

Mask

Mask

FU0

FU1

FU2

FU3

0 4

1 5

2 6

3 7

• • •

• • •

• • •

• • •

60

61

62

63

Vector load/store unit

Mask

Registers

Mask

Instruction register Mask

FU0

FU1

0 1

Mask

Mask

FU2

FU3

0 1

0 1

0 1

• • •

• • •

• • •

• • •

1023

1023

1023

1023

SIMD load/store unit

Address coalescing unit Memory interface unit

Memory interface unit

Figure 4.22 A vector processor with four lanes on the left and a multithreaded SIMD Processor of a GPU with four SIMD Lanes on the right. (GPUs typically have 16 or 32 SIMD Lanes.) The Control Processor supplies scalar operands for scalar-vector operations, increments addressing for unit and nonunit stride accesses to memory, and performs other accounting-type operations. Peak memory performance occurs only in a GPU when the Address Coalescing Unit can discover localized addressing. Similarly, peak computational performance occurs when all internal mask bits are set identically. Note that the SIMD Processor has one PC per SIMD Thread to help with multithreading.

The closest GPU term to a vectorized loop is Grid, and a PTX instruction is the closest to a vector instruction because a SIMD Thread broadcasts a PTX instruction to all SIMD Lanes. With respect to memory access instructions in the two architectures, all GPU loads are gather instructions and all GPU stores are scatter instructions. If data addresses of CUDA Threads refer to nearby addresses that fall into the same cache/memory block at the same time, the Address Coalescing Unit of the GPU will ensure high memory bandwidth. The explicit unit-stride load and store instructions of vector architectures versus the implicit unit stride of GPU programming is why writing efficient GPU code requires that programmers think in terms of SIMD operations, even though the CUDA programming model looks like MIMD. Because CUDA Threads can generate their own addresses, strided as well as gather-scatter, addressing vectors are found in both vector architectures and GPUs.

334



Chapter Four Data-Level Parallelism in Vector, SIMD, and GPU Architectures

As we mentioned several times, the two architectures take very different approaches to hiding memory latency. Vector architectures amortize it across all the elements of the vector by having a deeply pipelined access, so you pay the latency only once per vector load or store. Therefore vector loads and stores are like a block transfer between memory and the vector registers. In contrast, GPUs hide memory latency using multithreading. (Some researchers are investigating adding multithreading to vector architectures to try to capture the best of both worlds.) With respect to conditional branch instructions, both architectures implement them using mask registers. Both conditional branch paths occupy time and/or space even when they do not store a result. The difference is that the vector compiler manages mask registers explicitly in software while the GPU hardware and assembler manages them implicitly using branch synchronization markers and an internal stack to save, complement, and restore masks. The Control Processor of a vector computer plays an important role in the execution of vector instructions. It broadcasts operations to all the Vector Lanes and broadcasts a scalar register value for vector-scalar operations. It also does implicit calculations that are explicit in GPUs, such as automatically incrementing memory addresses for unit-stride and nonunit-stride loads and stores. The Control Processor is missing in the GPU. The closest analogy is the Thread Block Scheduler, which assigns Thread Blocks (bodies of vector loop) to multithreaded SIMD Processors. The runtime hardware mechanisms in a GPU that both generate addresses and then discover if they are adjacent, which is commonplace in many DLP applications, are likely less power-efficient than using a Control Processor. The scalar processor in a vector computer executes the scalar instructions of a vector program; that is, it performs operations that would be too slow to do in the vector unit. Although the system processor that is associated with a GPU is the closest analogy to a scalar processor in a vector architecture, the separate address spaces plus transferring over a PCIe bus means thousands of clock cycles of overhead to use them together. The scalar processor can be slower than a vector processor for floating-point computations in a vector computer, but not by the same ratio as the system processor versus a multithreaded SIMD Processor (given the overhead). Therefore each “vector unit” in a GPU must do computations that you would expect to do using a scalar processor in a vector computer. That is, rather than calculate on the system processor and communicate the results, it can be faster to disable all but one SIMD Lane using the predicate registers and built-in masks and do the scalar work with one SIMD Lane. The relatively simple scalar processor in a vector computer is likely to be faster and more power-efficient than the GPU solution. If system processors and GPUs become more closely tied together in the future, it will be interesting to see if system processors can play the same role as scalar processors do for vector and multimedia SIMD architectures.

4.4

Graphics Processing Units



335

Similarities and Differences Between Multimedia SIMD Computers and GPUs At a high level, multicore computers with multimedia SIMD instruction extensions do share similarities with GPUs. Figure 4.23 summarizes the similarities and differences. Both are multiprocessors whose processors use multiple SIMD Lanes, although GPUs have more processors and many more lanes. Both use hardware multithreading to improve processor utilization, although GPUs have hardware support for many more threads. Both have roughly 2:1 performance ratios between peak performance of single-precision and double-precision floating-point arithmetic. Both use caches, although GPUs use smaller streaming caches, and multicore computers use large multilevel caches that try to contain whole working sets completely. Both use a 64-bit address space, although the physical main memory is much smaller in GPUs. Both support memory protection at the page level as well as demand paging, which allows them to address far more memory than they have on board. In addition to the large numerical differences in processors, SIMD Lanes, hardware thread support, and cache sizes, there are many architectural differences. The scalar processor and multimedia SIMD instructions are tightly integrated in traditional computers; they are separated by an I/O bus in GPUs, and they even have separate main memories. The multiple SIMD Processors in a GPU use a single address space and can support a coherent view of all memory on some systems given support from CPU vendors (such as the IBM Power9). Unlike GPUs, multimedia SIMD instructions historically did not support gather-scatter memory accesses, which Section 4.7 shows is a significant omission.

Feature

Multicore with SIMD

GPU

SIMD Processors

4–8

8–32

SIMD Lanes/Processor

2–4

up to 64

Multithreading hardware support for SIMD Threads

2–4

up to 64

Typical ratio of single-precision to double-precision performance

2:1

2:1

Largest cache size

40 MB

4 MB

Size of memory address

64-bit

64-bit

up to 1024 GB

up to 24 GB

Memory protection at level of page

Yes

Yes

Demand paging

Yes

Yes

Integrated scalar processor/SIMD Processor

Yes

No

Cache coherent

Yes

Yes on some systems

Size of main memory

Figure 4.23 Similarities and differences between multicore with multimedia SIMD extensions and recent GPUs.

336



Chapter Four Data-Level Parallelism in Vector, SIMD, and GPU Architectures

Summary Now that the veil has been lifted, we can see that GPUs are really just multithreaded SIMD Processors, although they have more processors, more lanes per processor, and more multithreading hardware than do traditional multicore computers. For example, the Pascal P100 GPU has 56 SIMD Processors with 64 lanes per processor and hardware support for 64 SIMD Threads. Pascal embraces instruction-level parallelism by issuing instructions from two SIMD Threads to two sets of SIMD Lanes. GPUs also have less cache memory—Pascal’s L2 cache is 4 MiB—and it can be coherent with a cooperative distant scalar processor or distant GPUs. The CUDA programming model wraps up all these forms of parallelism around a single abstraction, the CUDA Thread. Thus the CUDA programmer can think of programming thousands of threads, although they are really executing each block of 32 threads on the many lanes of the many SIMD Processors. The CUDA programmer who wants good performance keeps in mind that these threads are organized in blocks and executed 32 at a time and that addresses need to be to adjacent addresses to get good performance from the memory system. Although we’ve used CUDA and the NVIDIA GPU in this section, rest assured that the same ideas are found in the OpenCL programming language and in GPUs from other companies. Now that you understand better how GPUs work, we reveal the real jargon. Figures 4.24 and 4.25 match the descriptive terms and definitions of this section with the official CUDA/NVIDIA and AMD terms and definitions. We also include the OpenCL terms. We believe the GPU learning curve is steep in part because of using terms such as “streaming multiprocessor” for the SIMD Processor, “thread processor” for the SIMD Lane, and “shared memory” for local memory— especially because local memory is not shared between SIMD Processors! We hope that this two-step approach gets you up that curve quicker, even if it’s a bit indirect.

4.5

Detecting and Enhancing Loop-Level Parallelism Loops in programs are the fountainhead of many of the types of parallelism we previously discussed here and in Chapter 5. In this section, we discuss compiler technology used for discovering the amount of parallelism that we can exploit in a program as well as hardware support for these compiler techniques. We define precisely when a loop is parallel (or vectorizable), how a dependence can prevent a loop from being parallel, and techniques for eliminating some types of dependences. Finding and manipulating loop-level parallelism is critical to exploiting both DLP and TLP, as well as the more aggressive static ILP approaches (e.g., VLIW) that we examine in Appendix H.

4.5

Machine object

Program abstractions

Type

Detecting and Enhancing Loop-Level Parallelism



337

More descriptive name used in this book

Official CUDA/ NVIDIA term

Vectorizable loop

Grid

A vectorizable loop, executed on the GPU, made up of one or more “Thread Blocks” (or bodies of vectorized loop) that can execute in parallel. OpenCL name is “index range.” AMD name is “NDRange”

A Grid is an array of Thread Blocks that can execute concurrently, sequentially, or a mixture

Body of Vectorized loop

Thread Block

A vectorized loop executed on a multithreaded SIMD Processor, made up of one or more threads of SIMD instructions. These SIMD Threads can communicate via local memory. AMD and OpenCL name is “work group”

A Thread Block is an array of CUDA Threads that execute concurrently and can cooperate and communicate via shared memory and barrier synchronization. A Thread Block has a Thread Block ID within its Grid

Sequence of SIMD Lane operations

CUDA Thread

A vertical cut of a thread of SIMD instructions corresponding to one element executed by one SIMD Lane. Result is stored depending on mask. AMD and OpenCL call a CUDA Thread a “work item”

A CUDA Thread is a lightweight thread that executes a sequential program and that can cooperate with other CUDA Threads executing in the same Thread Block. A CUDA Thread has a thread ID within its Thread Block

A thread of SIMD instructions

Warp

A traditional thread, but it contains just SIMD instructions that are executed on a multithreaded SIMD Processor. Results are stored depending on a per-element mask. AMD name is “wavefront”

A warp is a set of parallel CUDA Threads (e.g., 32) that execute the same instruction together in a multithreaded SIMT/SIMD Processor

SIMD instruction

PTX instruction

A single SIMD instruction executed across the SIMD Lanes. AMD name is “AMDIL” or “FSAIL” instruction

A PTX instruction specifies an instruction executed by a CUDA Thread

Short explanation and AMD and OpenCL terms

Official CUDA/NVIDIA definition

Figure 4.24 Conversion from terms used in this chapter to official NVIDIA/CUDA and AMD jargon. OpenCL names are given in the book’s definitions.

Loop-level parallelism is normally investigated at the source level or close to it, while most analysis of ILP is done once instructions have been generated by the compiler. Loop-level analysis involves determining what dependences exist among the operands in a loop across the iterations of that loop. For now, we will consider only data dependences, which arise when an operand is written at some point and read at a later point. Name dependences also exist and may be removed by the renaming techniques discussed in Chapter 3. The analysis of loop-level parallelism focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations; such dependence is called a loop-carried dependence. Most of the examples

Memory hardware

Processing hardware

Type

More descriptive name used in this book

Official CUDA/ NVIDIA term

Short explanation and AMD and OpenCL terms

Official CUDA/NVIDIA definition

Multithreaded SIMD processor

Streaming multiprocessor

Multithreaded SIMD Processor that executes thread of SIMD instructions, independent of other SIMD Processors. Both AMD and OpenCL call it a “compute unit.” However, the CUDA programmer writes program for one lane rather than for a “vector” of multiple SIMD Lanes

A streaming multiprocessor (SM) is a multithreaded SIMT/SIMD Processor that executes warps of CUDA Threads. A SIMT program specifies the execution of one CUDA Thread, rather than a vector of multiple SIMD Lanes

Thread Block Scheduler

Giga Thread Engine

Assigns multiple bodies of vectorized loop to multithreaded SIMD Processors. AMD name is “Ultra-Threaded Dispatch Engine”

Distributes and schedules Thread Blocks of a grid to streaming multiprocessors as resources become available

SIMD Thread scheduler

Warp scheduler

Hardware unit that schedules and issues threads of SIMD instructions when they are ready to execute; includes a scoreboard to track SIMD Thread execution. AMD name is “Work Group Scheduler”

A warp scheduler in a streaming multiprocessor schedules warps for execution when their next instruction is ready to execute

SIMD Lane

Thread processor

Hardware SIMD Lane that executes the operations in a thread of SIMD instructions on a single element. Results are stored depending on mask. OpenCL calls it a “processing element.” AMD name is also “SIMD Lane”

A thread processor is a datapath and register file portion of a streaming multiprocessor that executes operations for one or more lanes of a warp

GPU Memory

Global memory

DRAM memory accessible by all multithreaded SIMD Processors in a GPU. OpenCL calls it “global memory”

Global memory is accessible by all CUDA Threads in any Thread Block in any grid; implemented as a region of DRAM, and may be cached

Private memory

Local memory

Portion of DRAM memory private to each SIMD Lane. Both AMD and OpenCL call it “private memory”

Private “thread-local” memory for a CUDA Thread; implemented as a cached region of DRAM

Local memory

Shared memory

Fast local SRAM for one multithreaded SIMD Processor, unavailable to other SIMD Processors. OpenCL calls it “local memory.” AMD calls it “group memory”

Fast SRAM memory shared by the CUDA Threads composing a Thread Block, and private to that Thread Block. Used for communication among CUDA Threads in a Thread Block at barrier synchronization points

SIMD Lane registers

Registers

Registers in a single SIMD Lane allocated across body of vectorized loop. AMD also calls them “registers”

Private registers for a CUDA Thread; implemented as multithreaded register file for certain lanes of several warps for each thread processor

Figure 4.25 Conversion from terms used in this chapter to official NVIDIA/CUDA and AMD jargon. Note that our descriptive terms “local memory” and “private memory” use the OpenCL terminology. NVIDIA uses SIMT (singleinstruction multiple-thread) rather than SIMD to describe a streaming multiprocessor. SIMT is preferred over SIMD because the per-thread branching and control flow are unlike any SIMD machine.

4.5

Detecting and Enhancing Loop-Level Parallelism



339

we considered in Chapters 2 and 3 had no loop-carried dependences and thus are loop-level parallel. To see that a loop is parallel, let us first look at the source representation: for (i=999; i>=0; i=i-1) x[i] = x[i] + s; In this loop, the two uses of x[i] are dependent, but this dependence is within a single iteration and is not loop-carried. There is a loop-carried dependence between successive uses of i in different iterations, but this dependence involves an induction variable that can be easily recognized and eliminated. We saw examples of how to eliminate dependences involving induction variables during loop unrolling in Section 2.2 of Chapter 2, and we will look at additional examples later in this section. Because finding loop-level parallelism involves recognizing structures such as loops, array references, and induction variable computations, a compiler can do this analysis more easily at or near the source level, in contrast to the machine-code level. Let’s look at a more complex example.

Example

Consider a loop like this one: for (i=0; i=2.5 GHz

PHB

Core >=2.5 GHz

Home Agent DDR (2 channels)

(B)

DD

DD

DD

DD

DD

DD

DD (4 channels)

The Xeon E7 organization

Figure 5.27 The on-chip organizations of the Power8 and Xeon E7 are shown. The Power8 uses 8 separate buses between L3 and the CPU cores. Each Power8 also has two sets of links for connecting larger multiprocessors. The Xeon uses three rings to connect processors and L3 cache banks, as well QPI for interchip links. Software is used to logically associate half the cores with each memory channel.

5.8

Putting It All Together: Multicore Processors and Their Performance



429

increases the probability that a desired memory page is open on a given access. The E7 provides 3 QuickPath Interconnect (QPI) links for connecting multiple E7s. Multiprocessors consisting of these multicores use a variety of different interconnection strategies, as Figure 5.28 shows. The Power8 design provides support for connecting 16 Power8 chips for a total of 192 cores. The intragroup links provide higher bandwidth interconnect among a completely connected module of 4 processor chips. The intergroup links are used to connect each processor chip to the 3 other modules. Thus each processor is two hops from any other, and the memory access time is determined by whether an address resides in local memory, cluster memory, or intercluster memory (actually the latter can have two different values, but the difference is swamped by the intercluster time). The Xeon E7 uses QPI to interconnect multiple multicore chips. In a 4-chip, multiprocessor, which with the latest announced Xeon could have 128 cores, the three QPI links on each processor are connected to three neighbors, yielding a 4-chip fully connected multiprocessor. Because memory is directly connected to each E7 multicore, even this 4-chip arrangement has nonuniform memory access time (local versus remote). Figure 5.28 shows how 8 E7 processors can be connected; like the Power8, this leads to a situation where every processor is one or two hops from every other processor. There are a number of Xeon-based multiprocessor servers that have more than 8 processor chips. In such designs, the typical organization is to connect 4 processor chips together in a square, as a module, with each processor connecting to two neighbors. The third QPI in each chip is connected to a crossbar switch. Very large systems can be created in this fashion. Memory accesses can then occur at four locations with different timings: local to the processor, an immediate neighbor, the neighbor in the cluster that is two hops away, and through the crossbar. Other organizations are possible and require less than a full crossbar in return for more hops to get to remote memory. The SPARC64 X + also uses a 4-processor module, but each processor has three connections to its immediate neighbors plus two (or three in the largest configuration) connections to a crossbar. In the largest configuration, 64 processor chips can be connected to two crossbar switches, for a total of 1024 cores. Memory access is NUMA (local, within a module, and through the crossbar), and coherency is directory-based.

Performance of Multicore-Based Multiprocessors on a Multiprogrammed Workload First, we compare the performance scalability of these three multicore processors using SPECintRate, considering configurations up to 64 cores. Figure 5.29 shows how the performance scales relative to the performance of the smallest configuration, which varies between 4 and 16 cores. In the plot, the smallest configuration is assumed to have perfect speedup (i.e., 8 for 8 cores, 12 for 12 cores, etc.). This figure does not show performance among these different processors. Indeed such performance varies significantly: in the 4-core configuration, the IBM Power8 is

430



Chapter Five Thread-Level Parallelism

4 chip group

25.6 GB/s Inter-group Cable

78.4 GB/s Intra-group Bus

12-core Power 8

12-core Power 8

12-core Power 8

12-core Power 8

12-core Power 8

12-core Power 8

12-core Power 8

12-core Power 8

12-core Power 8

12-core Power 8

12-core Power 8

12-core Power 8

12-core Power 8

12-core Power 8

12-core Power 8

12-core Power 8

(A)

Power8 system with up to 16 chips.

CPU

E7

E7

E7

E7

E7

E7

CPU

XB

CPU

XB

E7

E7

(B) Xeon E7 system showing with up to 8 chips.

CPU

(C) SPARC64 X+ with the 4-chip building block.

Figure 5.28 The system architecture for three multiprocessors built from multicore chips.

Speedup relative to the smallest configuration

5.8

84 80 76 72 68 64 60 56 52 48 44 40 36 32 28 24 20 16 12 8 4

Putting It All Together: Multicore Processors and Their Performance



431

IBM Power8 Fujitsu SPARC64 X+ Intel Xeon E7

4

8

12

16

20

24

28

32 36 40 Number of cores

44

48

52

56

60

64

Figure 5.29 The performance scaling on the SPECintRate benchmarks for four multicore processors as the number of cores is increased to 64. Performance for each processor is plotted relative to the smallest configuration and assuming that configuration had perfect speedup. Although this chart shows how a given multiprocessor scales with additional cores, it does not supply any data about performance among processors. There are differences in the clock rates, even within a given processor family. These are generally swamped by the core scaling effects, except for the Power8 that shows a clock range spread of 1.5  from the smallest configuration to the 64 core configuration.

1.5 times as fast as the SPARC64 X + on a per core basis! Instead Figure 5.29 shows how the performance scales for each processor family as additional cores are added. Two of the three processors show diminishing returns as they scale to 64 cores. The Xeon systems appear to show the most degradation at 56 and 64 cores. This may be largely due to having more cores share a smaller L3. For example, the 40core system uses 4 chips, each with 60 MiB of L3, yielding 6 MiB of L3 per core. The 56-core and 64-core systems also use 4 chips but have 35 or 45 MiB of L3 per chip, or 2.5–2.8 MiB per core. It is likely that the resulting larger L3 miss rates lead to the reduction in speedup for the 56-core and 64-core systems. The IBM Power8 results are also unusual, appearing to show significant superlinear speedup. This effect, however, is due largely to differences in the clock rates, which are much larger across the Power8 processors than for the other processors

Chapter Five Thread-Level Parallelism

600 580 560 540 520 500 480 460 440 420 400 380 360 340 320 300 280 260 240 220 200 180 160 140 120 100 80 60 40 20 0

10 0 12 0 14 0 16 0 18 0 20 0 22 0 24 0 26 0 28 0 30 0 32 0 34 0 36 0 38 0 40 0 42 0 44 0 46 0 48 0 50 0 52 0 54 0 56 0 58 0 60 0

60 80

0

IBM Power8 Fujitu SPARC64 X+ Intel Xeon E7

20 40



Speedup relative to the smallest configuration

432

Number of cores

Figure 5.30 The scaling of relative performance for multiprocessor multicore. As before, performance is shown relative to the smallest available system. The Xeon result at 80 cores is the same L3 effect that showed up for smaller configurations. All systems larger than 80 cores have between 2.5 and 3.8 MiB of L3 per core, and the 80-core, or smaller, systems have 6 MiB per core.

in this figure. In particular, the 64-core configuration has the highest clock rate (4.4 GHz), whereas the 4-core configuration has a 3.0 GHz clock. If we normalize the relative speedup for the 64-core system based on the clock rate differential with the 4-core system, the effective speedup is 57 rather than 84. Therefore, while the Power8 system scales well, and perhaps the best among these processors, it is not miraculous. Figure 5.30 shows scaling for these three systems at configurations above 64 processors. Once again, the clock rate differential explains the Power8 results; the clock-rate equivalent scaled speedup with 192 processors is 167, versus 223, when not accounting for clock rate differences. Even at 167, the Power8 scaling is somewhat better than that on the SPARC64 X + or Xeon systems. Surprisingly, although there are some effects on speedup in going from the smallest system to 64 cores, they do not seem to get dramatically worse at these larger configurations. The nature of the workload, which is highly parallel and user-CPU-intensive, and the overheads paid in going to 64 cores probably lead to this result.

5.8

Putting It All Together: Multicore Processors and Their Performance



433

Scalability in an Xeon MP With Different Workloads In this section, we focus on the scalability of the Xeon E7 multiprocessors on three different workloads: a Java-based commercially oriented workload, a virtual machine workload, and a scientific parallel processing workload, all from the SPEC benchmarking organization, as described next. ■

SPECjbb2015: Models a supermarket IT system that handles a mix of point-of-sale requests, online purchases, and data-mining operations. The performance metric is throughput-oriented, and we use the maximum performance measurement on the server side running multiple Java virtual machines.



SPECVirt2013: Models a collection of virtual machines running independent mixes of other SPEC benchmarks, including CPU benchmarks, web servers, and mail servers. The system must meet a quality of service guarantee for each virtual machine.



SPECOMP2012: A collection of 14 scientific and engineering programs written with the OpenMP standard for shared-memory parallel processing. The codes are written in Fortran, C, and C++ and range from fluid dynamics to molecular modeling to image manipulation.

As with the previous results, Figure 5.31 shows performance assuming linear speedup on the smallest configuration, which for these benchmarks varies from 48 cores to 72 cores, and plotting performance relative to the that smallest configuration. The SPECjbb2015 and SPECVirt2013 include significant systems software, including the Java VM software and the VM hypervisor. Other than the system software, the interaction among the processes is very small. In contrast, SPECOMP2012 is a true parallel code with multiple user processes sharing data and collaborating in the computation. Let’s begin by examining SPECjbb2015. It obtains speedup efficiency (speedup/processor ratio) of between 78% and 95%, showing good speedup, even in the largest configuration. SPECVirt2013 does even better (for the range of system measured), obtaining almost linear speedup at 192 cores. Both SPECjbb2015 and SPECVirt2013 are benchmarks that scale up the application size (as in the TPC benchmarks discussed in Chapter 1) with larger systems so that the effects of Amdahl’s Law and interprocess communication are minor. Finally, let’s turn to SPECOMP2012, the most compute-intensive of these benchmarks and the one that truly involves parallel processing. The major trend visible here is a steady loss of efficiency as we scale from 30 to 576 cores so that by 576 cores, the system exhibits only half the efficiency it showed at 30 cores. This reduction leads to a relative speedup of 284, assuming that the 30-core speedup is 30. These are probably Amdahl’s Law effects resulting from limited parallelism as well as synchronization and communication overheads. Unlike

434



Chapter Five Thread-Level Parallelism

340 320 SPECVirtSC2013 SPECJBB2015 SPECOMP2012

Performance relative to the smallest configuration

300 280 260 240 220 200 180 160 140 120 100 80 60

80 10 0 12 0 14 0 16 0 18 0 20 0 22 0 24 0 26 0 28 0 30 0 32 0 34 0 36 0 38 0 40 0 42 0 44 0 46 0 48 0 50 0 52 0 54 0 56 0 58 0

60

40

40

Figure 5.31 Scaling of performance on a range of Xeon E7 systems showing performance relative to the smallest benchmark configuration, and assuming that configuration gets perfect speedup (e.g., the smallest SPEWCOMP configuration is 30 cores and we assume a performance of 30 for that system). Only relative performance can be assessed from this data, and comparisons across the benchmarks have no relevance. Note the difference in the scale of the vertical and horizontal axes.

the SPECjbb2015 and SPECVirt2013, these benchmarks are not scaled for larger systems.

Performance and Energy Efficiency of the Intel i7 920 Multicore In this section, we closely examine the performance of the i7 920, a predecessor of the 6700, on the same two groups of benchmarks we considered in Chapter 3: the parallel Java benchmarks and the parallel PARSEC benchmarks (described in detail in Figure 3.32 on page 247). Although this study uses the older i7 920, it remains, by far, the most comprehensive study of energy efficiency in multicore processors and the effects of multicore combined with SMT. The fact that the i7 920 and 6700 are similar indicates that the basic insights should also apply to the 6700.

5.8

Putting It All Together: Multicore Processors and Their Performance



435

First, we look at the multicore performance and scaling versus a single-core without the use of SMT. Then we combine both the multicore and SMT capability. All the data in this section, like that in the earlier i7 SMT evaluation (Chapter 3) come from Esmaeilzadeh et al. (2011). The dataset is the same as that used earlier (see Figure 3.32 on page 247), except that the Java benchmarks tradebeans and pjbb2005 are removed (leaving only the five scalable Java benchmarks); tradebeans and pjbb2005 never achieve speedup above 1.55 even with four cores and a total of eight threads, and thus are not appropriate for evaluating more cores. Figure 5.32 plots both the speedup and energy efficiency of the Java and PARSEC benchmarks without the use of SMT. Energy efficiency is computed by the ratio: energy consumed by the single-core run divided by the energy consumed by the two- or four-core run (i.e., efficiency is the inverse of energy consumed). Higher energy efficiency means that the processor consumes less energy for the same computation, with a value of 1.0 being the break-even

4

3.5

1.10 PARSEC energy efficiency PARSEC speedup Java speedup Java energy efficiency

1.08 1.06

i7 2P and 4P speedup

3

1.02 1.00

2.5

0.98 0.96

2

0.94 0.92

1.5

0.90

i7 2P and 4P energy efficiency

1.04

0.88 1

2P

4P

0.86

Cores

Figure 5.32 This chart shows the speedup and energy efficiency for two- and four-core executions of the parallel Java and PARSEC workloads without SMT. These data were collected by Esmaeilzadeh et al. (2011) using the same setup as described in Chapter 3. Turbo Boost is turned off. The speedup and energy efficiency are summarized using harmonic mean, implying a workload where the total time spent running each benchmark on 2 cores is equivalent.

436



Chapter Five Thread-Level Parallelism

point. The unused cores in all cases were in deep sleep mode, which minimized their power consumption by essentially turning them off. In comparing the data for the single-core and multicore benchmarks, it is important to remember that the full energy cost of the L3 cache and memory interface is paid in the singlecore (as well as the multicore) case. This fact increases the likelihood that energy consumption will improve for applications that scale reasonably well. Harmonic mean is used to summarize results with the implication described in the caption. As the figure shows, the PARSEC benchmarks get better speedup than the Java benchmarks, achieving 76% speedup efficiency (i.e., actual speedup divided by processor count) on four cores, whereas the Java benchmarks achieve 67% speedup efficiency on four cores. Although this observation is clear from the data, analyzing why this difference exists is difficult. It is quite possible that Amdahl’s Law effects have reduced the speedup for the Java workload, which includes some typically serial parts, such as the garbage collector. In addition, interaction between the processor architecture and the application, which affects issues such as the cost of synchronization or communication, may also play a role. In particular, well-parallelized applications, such as those in PARSEC, sometimes benefit from an advantageous ratio between computation and communication, which reduces the dependence on communications costs (see Appendix I). These differences in speedup translate to differences in energy efficiency. For example, the PARSEC benchmarks actually slightly improve energy efficiency over the single-core version; this result may be significantly affected by the fact that the L3 cache is more effectively used in the multicore runs than in the single-core case and the energy cost is identical in both cases. Thus, for the PARSEC benchmarks, the multicore approach achieves what designers hoped for when they switched from an ILP-focused design to a multicore design; namely, it scales performance as fast or faster than scaling power, resulting in constant or even improved energy efficiency. In the Java case, we see that neither the two- nor four-core runs break even in energy efficiency because of the lower speedup levels of the Java workload (although Java energy efficiency for the 2p run is the same as for PARSEC). The energy efficiency in the four-core Java case is reasonably high (0.94). It is likely that an ILP-centric processor would need even more power to achieve a comparable speedup on either the PARSEC or Java workload. Thus the TLP-centric approach is also certainly better than the ILP-centric approach for improving performance for these applications. As we will see in Section 5.10, there are reasons to be pessimistic about simple, efficient, long-term scaling of multicore.

Putting Multicore and SMT Together Finally, we consider the combination of multicore and multithreading by measuring the two benchmark sets for two to four processors and one to two threads (a total of four data points and up to eight threads). Figure 5.33 shows the speedup

Putting It All Together: Multicore Processors and Their Performance

3.5

1.20 PARSEC speedup Java speedup PARSEC energy efficiency Java energy efficiency

1.15 1.10

3 1.05 2.5

1.00 0.95

2 0.90 1.5 0.85 1

437



2Px1T

2Px2T

4Px1T

4Px2T

i7 2Px1T, 2Px2T, 4Px1T, and 4Px2T energy efficiency

4 i7 2Px1T, 2Px2T, 4Px1T, and 4Px2T speedup

5.8

0.80

Figure 5.33 This chart shows the speedup for two- and four-core executions of the parallel Java and PARSEC workloads both with and without SMT. Remember that the preceding results vary in the number of threads from two to eight and reflect both architectural effects and application characteristics. Harmonic mean is used to summarize results, as discussed in the Figure 5.32 caption.

and energy efficiency obtained on the Intel i7 when the processor count is two or four and SMT is or is not employed, using harmonic mean to summarize the two benchmarks sets. Clearly, SMT can add to performance when there is sufficient thread-level parallelism available even in the multicore situation. For example, in the four-core, no-SMT case, the speedup efficiencies were 67% and 76% for Java and PARSEC, respectively. With SMT on four cores, those ratios are an astonishing 83% and 97%. Energy efficiency presents a slightly different picture. In the case of PARSEC, speedup is essentially linear for the four-core SMT case (eight threads), and power scales more slowly, resulting in an energy efficiency of 1.1 for that case. The Java situation is more complex; energy efficiency peaks for the two-core SMT (fourthread) run at 0.97 and drops to 0.89 in the four-core SMT (eight-thread) run. It seems highly likely that the Java benchmarks are encountering Amdahl’s Law effects when more than four threads are deployed. As some architects have observed, multicore does shift more responsibility for performance (and thus energy efficiency) to the programmer, and the results for the Java workload certainly bear this out.

438



Chapter Five Thread-Level Parallelism

5.9

Fallacies and Pitfalls Given the lack of maturity in our understanding of parallel computing, there are many hidden pitfalls that will be uncovered either by careful designers or by unfortunate ones. Given the large amount of hype that has surrounded multiprocessors over the years, common fallacies abound. We have included a selection of them.

Pitfall

Measuring performance of multiprocessors by linear speedup versus execution time. Graphs like those in Figures 5.32 and 5.33, which plot performance versus number of processors, showing linear speedup, a plateau, and then a falling off, have long been used to judge the success of parallel processors. Although speedup is one facet of a parallel program, it is not a direct measure of performance. The first issue is the power of the processors being scaled: a program that linearly improves performance to equal 100 Intel Atom processors (the low-end processor used for netbooks) may be slower than the version run on an 8-core Xeon. Be especially careful of floating-point-intensive programs; processing elements without hardware assist may scale wonderfully but have poor collective performance. Comparing execution times is fair only if you are comparing the best algorithms on each computer. Comparing the identical code on two computers may seem fair, but it is not; the parallel program may be slower on a uniprocessor than on a sequential version. Developing a parallel program will sometimes lead to algorithmic improvements, so comparing the previously best-known sequential program with the parallel code—which seems fair—will not compare equivalent algorithms. To reflect this issue, the terms relative speedup (same program) and true speedup (best program) are sometimes used. Results that suggest superlinear performance, when a program on n processors is more than n times faster than the equivalent uniprocessor, may indicate that the comparison is unfair, although there are instances where “real” superlinear speedups have been encountered. For example, some scientific applications regularly achieve superlinear speedup for small increases in processor count (2 or 4 to 8 or 16). These results usually arise because critical data structures that do not fit into the aggregate caches of a multiprocessor with 2 or 4 processors fit into the aggregate cache of a multiprocessor with 8 or 16 processors. As we saw in the previous section, other differences (such as high clock rate) may appear to yield superlinear speedups when comparing slightly different systems. In summary, comparing performance by comparing speedups is at best tricky and at worst misleading. Comparing the speedups for two different multiprocessors does not necessarily tell us anything about the relative performance of the multiprocessors, as we also saw in the previous section. Even comparing two different algorithms on the same multiprocessor is tricky because we must use true speedup, rather than relative speedup, to obtain a valid comparison.

Fallacy

Amdahl’s Law doesn’t apply to parallel computers. In 1987 the head of a research organization claimed that Amdahl’s Law (see Section 1.9) had been broken by an MIMD multiprocessor. This statement hardly

5.9

Fallacies and Pitfalls



439

meant, however, that the law has been overturned for parallel computers; the neglected portion of the program will still limit performance. To understand the basis of the media reports, let’s see what Amdahl (1967) originally said: A fairly obvious conclusion which can be drawn at this point is that the effort expended on achieving high parallel processing rates is wasted unless it is accompanied by achievements in sequential processing rates of very nearly the same magnitude. [p. 483]

One interpretation of the law was that, because portions of every program must be sequential, there is a limit to the useful economic number of processors—say, 100. By showing linear speedup with 1000 processors, this interpretation of Amdahl’s Law was disproved. The basis for the statement that Amdahl’s Law had been “overcome” was the use of scaled speedup, also called weak scaling. The researchers scaled the benchmark to have a dataset size that was 1000 times larger and compared the uniprocessor and parallel execution times of the scaled benchmark. For this particular algorithm, the sequential portion of the program was constant independent of the size of the input, and the rest was fully parallel—thus, linear speedup with 1000 processors. Because the running time grew faster than linear, the program actually ran longer after scaling, even with 1000 processors. Speedup that assumes scaling of the input is not the same as true speedup, and reporting it as if it were is misleading. Because parallel benchmarks are often run on different-sized multiprocessors, it is important to specify what type of application scaling is permissible and how that scaling should be done. Although simply scaling the data size with processor count is rarely appropriate, assuming a fixed problem size for a much larger processor count (called strong scaling) is often inappropriate, as well, because it is likely that users given a much larger multiprocessor would opt to run a larger or more detailed version of an application. See Appendix I for more discussion on this important topic. Fallacy

Linear speedups are needed to make multiprocessors cost-effective. It is widely recognized that one of the major benefits of parallel computing is to offer a “shorter time to solution” than the fastest uniprocessor. Many people, however, also hold the view that parallel processors cannot be as cost-effective as uniprocessors unless they can achieve perfect linear speedup. This argument says that, because the cost of the multiprocessor is a linear function of the number of processors, anything less than linear speedup means that the performance/cost ratio decreases, making a parallel processor less cost-effective than using a uniprocessor. The problem with this argument is that cost is not only a function of processor count but also depends on memory, I/O, and the overhead of the system (box, power supply, interconnect, etc.). It also makes less sense in the multicore era, when there are multiple processors per chip. The effect of including memory in the system cost was pointed out by Wood and Hill (1995). We use an example based on more recent data using TPC-C and SPECRate benchmarks, but the argument could also be made with a parallel scientific application workload, which would likely make the case even stronger.

440



Chapter Five Thread-Level Parallelism

72 Linear speedup

64

Speedup TPM

Speedup SPECintRate Speedup SPECfpRate

56

Speedup

48 40 32 24 16 8 0 0

8

16

24

32 Processor count

40

48

56

64

Figure 5.34 Speedup for three benchmarks on an IBM eServer p5 multiprocessor when configured with 4, 8, 16, 32, and 64 processors. The dashed line shows linear speedup.

Figure 5.34 shows the speedup for TPC-C, SPECintRate, and SPECfpRate on an IBM eServer p5 multiprocessor configured with 4–64 processors. The figure shows that only TPC-C achieves better than linear speedup. For SPECintRate and SPECfpRate, speedup is less than linear, but so is the cost, because unlike TPC-C, the amount of main memory and disk required both scale less than linearly. As Figure 5.35 shows, larger processor counts can actually be more costeffective than the 4-processor configuration. In comparing the cost-performance of two computers, we must be sure to include accurate assessments of both total system cost and what performance is achievable. For many applications with larger memory demands, such a comparison can dramatically increase the attractiveness of using a multiprocessor. Pitfall

Not developing the software to take advantage of, or optimize for, a multiprocessor architecture. There is a long history of software lagging behind on multiprocessors, probably because the software problems are much harder. We give one example to show the subtlety of the issues, but there are many examples we could choose from. One frequently encountered problem occurs when software designed for a uniprocessor is adapted to a multiprocessor environment. For example, the SGI

5.9

Fallacies and Pitfalls



441

Performance/cost relative to four-processor system

1.15 TPM performance/cost SPECint performance/cost SPECfp performance/cost

1.10

1.05

1.00

0.95

0.90

0.85 0

8

16

24

32

40

48

56

64

Processor count

Figure 5.35 The performance/cost for IBM eServer p5 multiprocessors with 4–64 processors is shown relative to the 4-processor configuration. Any measurement above 1.0 indicates that the configuration is more cost-effective than the 4-processor system. The 8-processor configurations show an advantage for all three benchmarks, whereas two of the three benchmarks show a cost-performance advantage in the 16- and 32-processor configurations. For TPC-C, the configurations are those used in the official runs, which means that disk and memory scale nearly linearly with processor count, and a 64-processor machine is approximately twice as expensive as a 32-processor version. In contrast, the disk and memory are scaled more slowly (although still faster than necessary to achieve the best SPECRate at 64 processors). In particular, the disk configurations go from one drive for the 4-processor version to four drives (140 GB) for the 64processor version. Memory is scaled from 8 GiB for the 4-processor system to 20 GiB for the 64-processor system.

operating system in 2000 originally protected the page table data structure with a single lock, assuming that page allocation was infrequent. In a uniprocessor, this does not represent a performance problem. In a multiprocessor, it can become a major performance bottleneck for some programs. Consider a program that uses a large number of pages that are initialized at startup, which UNIX does for statically allocated pages. Suppose the program is parallelized so that multiple processes allocate the pages. Because page allocation requires the use of the page table data structure, which is locked whenever it is in use, even an OS kernel that allows multiple threads in the OS will be serialized if the processes all try to allocate their pages at once (which is exactly what we might expect at initialization time).

442



Chapter Five Thread-Level Parallelism

This page table serialization eliminates parallelism in initialization and has significant impact on overall parallel performance. This performance bottleneck persists even under multiprogramming. For example, suppose we split the parallel program apart into separate processes and run them, one process per processor, so that there is no sharing between the processes. (This is exactly what one user did, because he reasonably believed that the performance problem was due to unintended sharing or interference in his application.) Unfortunately, the lock still serializes all the processes, so even the multiprogramming performance is poor. This pitfall indicates the kind of subtle but significant performance bugs that can arise when software runs on multiprocessors. Like many other key software components, the OS algorithms and data structures must be rethought in a multiprocessor context. Placing locks on smaller portions of the page table effectively eliminates the problem. Similar problems exist in memory structures, which increases the coherence traffic in cases where no sharing is actually occurring. As multicore became the dominant theme in everything from desktops to servers, the lack of an adequate investment in parallel software became apparent. Given the lack of focus, it will likely be many years before the software systems we use adequately exploit the growing numbers of cores.

5.10

The Future of Multicore Scaling For more than 30 years, researchers and designers have predicted the end of uniprocessors and their dominance by multiprocessors. Until the early years of this century, this prediction was constantly proven wrong. As we saw in Chapter 3, the costs of trying to find and exploit more ILP became prohibitive in efficiency (both in silicon area and in power). Of course, multicore does not magically solve the power problem because it clearly increases both the transistor count and the active number of transistors switching, which are the two dominant contributions to power. As we will see in this section, energy issues are likely to limit multicore scaling more severely than previously thought. ILP scaling failed because of both limitations in the ILP available and the efficiency of exploiting that ILP. Similarly, a combination of two factors means that simply scaling performance by adding cores is unlikely to be broadly successful. This combination arises from the challenges posed by Amdahl’s Law, which assesses the efficiency of exploiting parallelism, and the end of Dennard’s Scaling, which dictates the energy required for a multicore processor. To understand these factors, we take a simple model of both technology scaling (based on an extensive and highly detailed analysis in Esmaeilzadeh et al. (2012)). Let’s start by reviewing energy consumption and power in CMOS. Recall from Chapter 1 that the energy to switch a transistor is given as Energy / Capacitive load  Voltage2

CMOS scaling is limited primarily by thermal power, which is a combination of static leakage power and dynamic power, which tends to dominate. Power is given by

5.10

The Future of Multicore Scaling



Device count scaling (since a transistor is 1/4 the size)

4

Frequency scaling (based on projections of device speed)

1.75

Voltage scaling projected

0.81

Capacitance scaling projected

0.39

Energy per switched transistor scaling (CV2)

0.26

Power scaling assuming fraction of transistors switching is the same and chip exhibits full frequency scaling

1.79

443

Figure 5.36 A comparison of the 22 nm technology of 2016 with a future 11 nm technology, likely to be available sometime between 2022 and 2024. The characteristics of the 11 nm technology are based on the International Technology Roadmap for Semiconductors, which has been recently discontinued because of uncertainty about the continuation of Moore’s Law and what scaling characteristics will be seen. Power ¼ Energy per Transistor  Frequency  Transistors switched ¼ Capacitive load  Voltage2  Frequency  Transistors switched

To understand the implications of how energy and power scale, let’s compare today’s 22 nm technology with a technology projected to be available in 2021– 24 (depending on the rate at which Moore’s Law continues to slow down). Figure 5.36 shows this comparison based on technology projections and resulting effects on energy and power scaling. Notice that power scaling > 1.0 means that the future device consumes more power; in this case, 1.79 as much. Consider the implications of this for one of the latest Intel Xeon processors, the E7-8890, which has 24 cores, 7.2 billion transistors (including almost 70 MiB of cache), operates at 2.2 GHz, has a thermal power rating of 165 watts, and a die size of 456 mm2. The clock frequency is already limited by power dissipation: a 4-core version has a clock of 3.2 GHz, and a 10-core version has a 2.8 GHz clock. With the 11 nm technology, the same size die would accommodate 96 cores with almost 280 MiB of cache and operate at a clock rate (assuming perfect frequency scaling) of 4.9 GHz. Unfortunately, with all cores operating and no efficiency improvements, it would consume 165  1.79 ¼ 295 watts. If we assume the 165-W heat dissipation limit remains, then only 54 cores can be active. This limit yields a maximum performance speedup of 54/24 ¼ 2.25 over a 5–6 year period, less than onehalf the performance scaling seen in the late 1990s. Furthermore, we may have Amdahl’s Law effects, as the next example shows. Example

Suppose we have a 96-core future generation processor, but on average only 54 cores can be busy. Suppose that 90% of the time, we can use all available cores; 9% of the time, we can use 50 cores; and 1% of the time is strictly serial. How much speedup might we expect? Assume that cores can be turned off when not in use and draw no power and assume that the use of a different number of cores is distributed so that we need to worry only about average power consumption. How would the

444



Chapter Five Thread-Level Parallelism

multicore speedup compare to the 24-processor count version that can use all its processor 99% of the time? Answer

We can find how many cores can be used for the 90% of the time when more than 54 are usable, as follows: Average Processor Usage ¼ 0:09  50 + 0:01  1 + 0:90  Max processor 54 ¼ 4:51 + 0:90  Max processor Max processor ¼ 55

Now, we can find the speedup: Speedup ¼

1 Fraction55 Fraction50 + + ð1  Fraction55  Fraction50 Þ 55 50 1 ¼ 35:5 Speedup ¼ 0:90 0:09 + + 0:01 55 50

Now compute the speedup on 24 processors: 1 Fraction24 + ð1  Fraction24 Þ 24 1 ¼ 19:5 Speedup ¼ 0:99 + 0:01 24

Speedup ¼

When considering both power constraints and Amdahl’s Law effects, the 96processor version achieves less than a factor of 2 speedup over the 24-processor version. In fact, the speedup from clock rate increase nearly matches the speedup from the 4  processor count increase. We comment on these issues further in the concluding remarks.

5.11

Concluding Remarks As we saw in the previous section, multicore does not magically solve the power problem because it clearly increases both the transistor count and the active number of transistors switching, which are the two dominant contributions to power. The failure of Dennard scaling merely makes it more extreme. But multicore does alter the game. By allowing idle cores to be placed in power-saving mode, some improvement in power efficiency can be achieved, as the results in this chapter have shown. For example, shutting down cores in the Intel i7 allows other cores to operate in Turbo mode. This capability allows a trade-off between higher clock rates with fewer processors and more processors with lower clock rates. More importantly, multicore shifts the burden for keeping the processor busy by relying more on TLP, which the application and programmer are responsible for

5.12

Historical Perspectives and References



445

identifying, rather than on ILP, for which the hardware is responsible. Multiprogrammed and highly parallel workloads that avoid Amdahl’s Law effects will benefit more easily. Although multicore provides some help with the energy efficiency challenge and shifts much of the burden to the software system, there remain difficult challenges and unresolved questions. For example, attempts to exploit thread-level versions of aggressive speculation have so far met the same fate as their ILP counterparts. That is, the performance gains have been modest and are likely less than the increase in energy consumption, so ideas such as speculative threads or hardware run-ahead have not been successfully incorporated in processors. As in speculation for ILP, unless the speculation is almost always right, the costs exceed the benefits. Thus, at the present, it seems unlikely that some form of simple multicore scaling will provide a cost-effective path to growing performance. A fundamental problem must be overcome: finding and exploiting significant amounts of parallelism in an energy- and silicon-efficient manner. In the previous chapter, we examined the exploitation of data parallelism via a SIMD approach. In many applications, data parallelism occurs in large amounts, and SIMD is a more energyefficient method for exploiting data parallelism. In the next chapter, we explore large-scale cloud computing. In such environments, massive amounts of parallelism are available from millions of independent tasks generated by individual users. Amdahl’s Law plays little role in limiting the scale of such systems because the tasks (e.g., millions of Google search requests) are independent. Finally, in Chapter 7, we explore the rise of domain-specific architectures (DSAs). Most domain-specific architectures exploit the parallelism of the targeted domain, which is often data parallelism, and as with GPUs, DSAs can achieve much higher efficiency as measured by energy consumption or silicon utilization. In the last edition, published in 2012, we raised the question of whether it would be worthwhile to consider heterogeneous processors. At that time, no such multicore was delivered or announced, and heterogeneous multiprocessors had seen only limited success in special-purpose computers or embedded systems. While the programming models and software systems remain challenging, it appears inevitable that multiprocessors with heterogeneous processors will play an important role. Combining domain-specific processors, like those discussed in Chapters 4 and 7, with general-purpose processors is perhaps the best road forward to achieve increased performance and energy efficiency while maintaining some of the flexibility that general-purpose processors offer.

5.12

Historical Perspectives and References Section M.7 (available online) looks at the history of multiprocessors and parallel processing. Divided by both time period and architecture, the section features discussions on early experimental multiprocessors and some of the great debates in parallel processing. Recent advances are also covered. References for further reading are included.

446



Chapter Five Thread-Level Parallelism

Case Studies and Exercises by Amr Zaky and David A. Wood Case Study 1: Single Chip Multicore Multiprocessor Concepts illustrated by this case study ■

Snooping Coherence Protocol Transitions



Coherence Protocol Performance



Coherence Protocol Optimizations



Synchronization

A multicore SMT multiprocessor is illustrated in Figure 5.37. Only the cache contents are shown. Each core has a single, private cache with coherence maintained using the snooping coherence protocol of Figure 5.7. Each cache is direct-mapped, with four lines, each holding 2 bytes (to simplify diagram). For further simplification, the whole line addresses in memory are shown in the address fields in the caches, where the tag would normally exist. The coherence states are denoted M, S, and I for Modified, Shared, and Invalid. 5.1.

[10/10/10/10/10/10/10] For each part of this exercise, the initial cache and memory state are assumed to initially have the contents shown in Figure 5.37. Each part of this exercise specifies a sequence of one or more CPU operations of the form Core 1

Core 0 Line number 0 1 2 3

Coherency Address state I AC00 S AC08 M AC10 I AC18

Data 0010 0008 0030 0010

Cache line 0 1 2 3

Coherency Address state I AC00 M AC28 I AC10 S AC18

Address

Data

… AC00 AC08 AC10 AC18 AC20 AC28 AC30 ….

… 0010 0008 0010 0018 0020 0028 0030 …..

Memory

Figure 5.37 Multicore (point-to-point) multiprocessor.

Core3 Data 0010 0068 0010 0018

Cache line 0 1 2 3

Coherency Address state S AC20 S AC08 I AC10 I AC18

Data 20 0008 0010 0010

Case Studies and Exercises by Amr Zaky and David A. Wood



447

Ccore#: R, < address > for reads and Ccore#: W, < address > for writes. For example, C3: R, AC10 & C0: W, AC18 For the SGEMM code developed above for the i7 processor, include the use of AVX2 intrinsics to improve the performance. In particular, try to vectorize your code to better utilize the AVX hardware. Compare the code size and performance to the original code. Compare your results to Intel's Math Kernel Library (MKL) implementation for SGEMM.

A.16

[30] < A.7, A.9 > The RISC-V processor is open source and boasts an impressive collection of implementations, simulators, compilers, and other tools. See riscv.org for an overview of tools, including spike, a simulator for RISC-V processors. Use spike or another simulator to measure the instruction set mix for some SPEC CPU2017 benchmark programs.

A.17

[35/35/35/35] < A.2–A.8 > gcc targets most modern instruction set architectures (see www.gnu.org/software/gcc/). Create a version of gcc for several architectures that you have access to, such as x86, RISC-V, PowerPC, and ARM. a. [35] < A.2–A.8 > Compile a subset of SPEC CPU2017 integer benchmarks and create a table of code sizes. Which architecture is best for each program? b. [35] < A.2–A.8 > Compile a subset of SPEC CPU2017 floating-point benchmarks and create a table of code sizes. Which architecture is best for each program? c. [35] < A.2–A.8 > Compile a subset of EEMBC AutoBench benchmarks (see www.eembc.org/home.php) and create a table of code sizes. Which architecture is best for each program? d. [35] < A.2–A.8 > Compile a subset of EEMBC FPBench floating-point benchmarks and create a table of code sizes. Which architecture is best for each program?

A.18

[40] < A.2–A.8 > Power efficiency has become very important for modern processors, particularly for embedded systems. Create a version of gcc for two architectures that you have access to, such as x86, RISC-V, PowerPC, Atom, and ARM. (Note that the different versions of RISC-V can also be explored and compared.) Compile a subset of EEMBC benchmarks while using EnergyBench to measure energy usage during execution. Compare code size, performance, and energy usage for the processors. Which is best for each program?

A.19

[20/15/15/20] Your task is to compare the memory efficiency of four different styles of instruction set architectures. The architecture styles are: ■

Accumulator—All operations occur between a single register and a memory location.



Memory-memory—All instruction addresses reference only memory locations.



Stack—All operations occur on top of the stack. Push and pop are the only instructions that access memory; all others remove their operands from the stack and replace them with the result. The implementation uses a hardwired stack for only the top two stack entries, which keeps the processor circuit very small and low in cost. Additional stack positions are kept in memory locations, and accesses to these stack positions require memory references.

Exercises by Gregory D. Peterson





A-53

Load-store—All operations occur in registers, and register-to-register instructions have three register names per instruction.

To measure memory efficiency, make the following assumptions about all four instruction sets: ■

All instructions are an integral number of bytes in length.



The opcode is always one byte (8 bits).



Memory accesses use direct, or absolute, addressing.



The variables A, B, C, and D are initially in memory.

a. [20] < A.2, A.3 > Invent your own assembly language mnemonics (Figure A.2 provides a useful sample to generalize), and for each architecture write the best equivalent assembly language code for this high-level language code sequence: A = B + C; B = A + C; D = A – B; b. [15] < A.3 > Label each instance in your assembly codes for part (a) where a value is loaded from memory after having been loaded once. Also label each instance in your code where the result of one instruction is passed to another instruction as an operand, and further classify these events as involving storage within the processor or storage in memory. c. [15] < A.7 > Assume that the given code sequence is from a small, embedded computer application that uses a 16-bit memory address and data operands. If a load-store architecture is used, assume it has 16 general-purpose registers. For each architecture answer the following questions: How many instruction bytes are fetched? How many bytes of data are transferred from/to memory? Which architecture is most efficient as measured by total memory traffic (code + data)? d. [20] < A.7 > Now assume a processor with 64-bit memory addresses and data operands. For each architecture answer the questions of part (c). How have the relative merits of the architectures changed for the chosen metrics? A.20

[30] < A.2, A.3 > Use the four different instruction set architecture styles from above, but assume that the memory operations supported include register indirect as well as direct addressing. Invent your own assembly language mnemonics (Figure A.2 provides a useful sample to generalize), and for each architecture, write the best equivalent assembly language code for this fragment of C code: for (i = 0; i The size of displacement values needed for the displacement addressing mode or for PC-relative addressing can be extracted from compiled applications. Use a disassembler with one or more of the SPEC CPU2017 or EEMBC benchmarks compiled for the RISC-V processor. a. [20] < A.3, A.9 > For each instruction using displacement addressing, record the displacement value used. Create a histogram of displacement values. Compare the results to those shown in this appendix in Figure A.8. b. [20] < A.6, A.9 > For each branch instruction using PC-relative addressing, record the offset value used. Create a histogram of offset values. Compare the results to those shown in this chapter in Figure A.15.

A.22

[15/15/10/10/10/10] < A.3 > The value represented by the hexadecimal number 5249 5343 5643 5055 is to be stored in an aligned 64-bit double word. a. [15] < A.3 > Using the physical arrangement of the first row in Figure A.5, write the value to be stored using Big Endian byte order. Next, interpret each byte as an ASCII character and below each byte write the corresponding character, forming the character string as it would be stored in Big Endian order. b. [15] < A.3 > Using the same physical arrangement as in part (a), write the value to be stored using Little Endian byte order, and below each byte write the corresponding ASCII character. c. [10] < A.3 > What are the hexadecimal values of all misaligned 2-byte words that can be read from the given 64-bit double word when stored in Big Endian byte order? d. [10] < A.3 > What are the hexadecimal values of all misaligned 2-byte words that can be read from the given 64-bit double word when stored in Big Endian byte order? e. [10] < A.3 > What are the hexadecimal values of all misaligned 2-byte words that can be read from the given 64-bit double word when stored in Little Endian byte order? f. [10] < A.3 > What are the hexadecimal values of all misaligned 4-byte words that can be read from the given 64-bit double word when stored in Little Endian byte order?

A.23

[25,25] < A.3, A.9 > The relative frequency of different addressing modes impacts the choices of addressing modes support for an instruction set architecture. Figure A.7 illustrates the relative frequency of addressing modes for three applications on the VAX. a. [25] < A.3 > Compile one or more programs from the SPEC CPU2017 or EEMBC benchmark suites to target the x86 architecture. Using a disassembler, inspect the instructions and the relative frequency of various addressing modes. Create a histogram to illustrate the relative frequency of the addressing modes. How do your results compare to Figure A.7?

Exercises by Gregory D. Peterson



A-55

b. [25] < A.3, A.9 > Compile one or more programs from the SPEC CPU2017 or EEMBC benchmark suites to target the RISC-V architecture. Using a disassembler, inspect the instructions and the relative frequency of various addressing modes. Create a histogram to illustrate the relative frequency of the addressing modes. How do your results compare to Figure A.7? A.24

[Discussion] < A.2–A.12 > Consider typical applications for desktop, server, cloud, and embedded computing. How would instruction set architecture be impacted for machines targeting each of these markets?

B.1 B.2 B.3 B.4 B.5 B.6 B.7 B.8

Introduction Cache Performance Six Basic Cache Optimizations Virtual Memory Protection and Examples of Virtual Memory Fallacies and Pitfalls Concluding Remarks Historical Perspective and References Exercises by Amr Zaky

B-2 B-15 B-22 B-40 B-49 B-57 B-59 B-59 B-60

B Review of Memory Hierarchy

Cache: a safe place for hiding or storing things. Webster’s New World Dictionary of the American Language, Second College Edition (1976)

B-2



Appendix B Review of Memory Hierarchy

B.1

Introduction This appendix is a quick refresher of the memory hierarchy, including the basics of cache and virtual memory, performance equations, and simple optimizations. This first section reviews the following 36 terms: cache

fully associative

write allocate

virtual memory

dirty bit

unified cache

memory stall cycles

block offset

misses per instruction

direct mapped

write back

block

valid bit

data cache

locality

block address

hit time

address trace

write through

cache miss

set

instruction cache

page fault

random replacement

average memory access time

miss rate

index field

cache hit

n-way set associative

no-write allocate

page

least recently used

write buffer

miss penalty

tag field

write stall

If this review goes too quickly, you might want to look at Chapter 7 in Computer Organization and Design, which we wrote for readers with less experience. Cache is the name given to the highest or first level of the memory hierarchy encountered once the address leaves the processor. Because the principle of locality applies at many levels, and taking advantage of locality to improve performance is popular, the term cache is now applied whenever buffering is employed to reuse commonly occurring items. Examples include file caches, name caches, and so on. When the processor finds a requested data item in the cache, it is called a cache hit. When the processor does not find a data item it needs in the cache, a cache miss occurs. A fixed-size collection of data containing the requested word, called a block or line run, is retrieved from the main memory and placed into the cache. Temporal locality tells us that we are likely to need this word again in the near future, so it is useful to place it in the cache where it can be accessed quickly. Because of spatial locality, there is a high probability that the other data in the block will be needed soon. The time required for the cache miss depends on both the latency and bandwidth of the memory. Latency determines the time to retrieve the first word of the block, and bandwidth determines the time to retrieve the rest of this block. A cache miss is handled by hardware and causes processors using in-order execution to pause, or stall, until the data are available. With out-of-order execution, an instruction using the result must still wait, but other instructions may proceed during the miss. Similarly, not all objects referenced by a program need to reside in main memory. Virtual memory means some objects may reside on disk. The address space is

B.1

Introduction



B-3

Level

1

2

3

4

Name

Registers

Cache

Main memory

Disk storage

Typical size

When you run a program with an overall miss rate of 3%, what will the average memory access time (in CPU cycles) be? b. [10] < B.1 > Next, you run a program specifically designed to produce completely random data addresses with no locality. Toward that end, you use an array of size 1 GB (all of which fits in the main memory). Accesses to random elements of this array are continuously made (using a uniform random number generator to generate the elements indices). If your data cache size is 64 KB, what will the average memory access time be? c. [10] If you compare the result obtained in part (b) with the main memory access time when the cache is disabled, what can you conclude about the role of the principle of locality in justifying the use of cache memory? d. [15] < B.1 > You observed that a cache hit produces a gain of 104 cycles (1 cycle vs. 105), but it produces a loss of 5 cycles in the case of a miss (110 cycles vs. 105). In the general case, we can express these two quantities as G (gain) and L (loss). Using these two quantities (G and L), identify the highest miss rate after which the cache use would be disadvantageous.

B.2

[15/15] For the purpose of this exercise, we assume that we have a 512byte cache with 64-byte blocks. We will also assume that the main memory is 2 KB large. We can regard the memory as an array of 64-byte blocks: M0, M1, …, M31. Figure B.30 sketches the memory blocks that can reside in different cache blocks if the cache was direct-mapped. a. [15] Show the contents of the table if the cache is organized as a fullyassociative cache. b. [15] < B.1 > Repeat part (a) with the cache organized as a four-way set associative cache.

B.3

[10/10/10/10/15/10/15/20] < B.1 > Cache organization is often influenced by the desire to reduce the cache's power consumption. For that purpose we assume that the cache is physically distributed into a data array (holding the data), a tag array (holding the tags), and replacement array (holding information needed by replacement policy). Furthermore, every one of these arrays is physically distributed into multiple subarrays (one per way) that can be individually accessed; for example, a four-way set associative least recently used (LRU) cache would have four data subarrays, four tag subarrays, and four replacement subarrays. We assume that the

Exercises by Amr Zaky



B-61

Cache block

Set

Way

Possible memory blocks

0

0

0

M0, M8, M16, M24

1

1

0

M1, M9, M17, M25

2

2

0

M2, M10, M18, M26

3

3

0

….

4

4

0

….

5

5

0

….

6

6

0

….

7

7

0

M7, M15, M23, M31

Figure B.30 Memory blocks distributed to direct-mapped cache.

Array

Power consumption weight (per way accessed)

Data array

20 units

Tag

Array 5 units

Miscellaneous array

1 unit

Memory access

200 units

Figure B.31 Power consumption costs of different operations.

replacement subarrays are accessed once per access when the LRU replacement policy is used, and once per miss if the first-in, first-out (FIFO) replacement policy is used. It is not needed when a random replacement policy is used. For a specific cache, it was determined that the accesses to the different arrays have the following power consumption weights (Figure B.31): a. [10] < B.1 > A cache read hit. All arrays are read simultaneously. b. [10] < B.1 > Repeat part (a) for a cache read miss. c. [10] < B.1 > Repeat part (a) assuming that the cache access is split across two cycles. In the first cycle, all the tag subarrays are accessed. In the second cycle, only the subarray whose tag matched will be accessed. d. [10] < B.1 > Repeat part (c) for a cache read miss (no data array accesses in the second cycle). e. [15] < B.1 > Repeat part (c) assuming that logic is added to predict the cache way to be accessed. Only the tag subarray for the predicted way is accessed in cycle one. A way hit (address match in predicted way) implies a cache hit. A way miss dictates examining all the tag subarrays in the second cycle. In case of a way hit, only one data subarray (the one whose tag matched) is accessed in cycle two. Assume the way predictor hits.

B-62



Appendix B Review of Memory Hierarchy f. [10] < B.1 > Repeat part (e) assuming that the way predictor misses (the way it choses is wrong). When it fails, the way predictor adds an extra cycle in which it accesses all the tag subarrays. Assume the way predictor miss is followed by a cache read hit. g. [15] < B.1 > Repeat part (f) assuming a cache read miss. h. [20] < B.1 > Use parts (e), (f), and (g) for the general case where the workload has the following statistics: way predictor miss rate = 5% and cache miss rate = 3%. (Consider different replacement policies.) Estimate the memory system (cache + memory) power usage (in power units) for the following configurations. We assume the cache is four-way set associative. Provide answers for the LRU, FIFO, and random replacement policies. B.4

[10/10/15/15/15/20] We compare the write bandwidth requirements of write-through versus write-back caches using a concrete example. Let us assume that we have a 64 KB cache with a line size of 32 bytes. The cache will allocate a line on a write miss. If configured as a write-back cache, it will write back all of the dirty line if it needs to be replaced. We will also assume that the cache is connected to the lower level in the hierarchy through a 64-bit-wide (8-byte-wide) bus. The number of CPUcycles for a B-bytes write access on this bus is 10 + 5 B8  1 , where the square brackets represent the “ceiling” function. For example, an 8-byte  write would take 10 + 5 88  1 ¼ 10 cycles, whereas using the same formula a 12-byte write would take 15 cycles. Answer the following questions while referring to the C code snippet below: ...

#define PORTION 1 ... base = 8*i; for (unsigned int j = base; j < base + PORTION; j++) //assume j is stored in a register { data[j] = j; } a. [10] For a write-through cache, how many CPU cycles are spent on write transfers to the memory for all the combined iterations of the j loop? b. [10] < B.1 > If the cache is configured as a write-back cache, how many CPU cycles are spent on writing back a cache line? c. [15] < B.1 > Change PORTION to 8 and repeat part (a). d. [15] < B.1 > What is the minimum number of array updates to the same cache line (before replacing it) that would render the write-back cache superior? e. [15] < B.1 > Think of a scenario where all the words of the cache line will be written (not necessarily using the above code) and a write-through cache will require fewer total CPU cycles than the write-back cache.

Exercises by Amr Zaky B.5



B-63

[10/10/10/10/] < B.2 > You are building a system around a processor with in-order execution that runs at 1.1 GHz and has a CPI of 1.35 excluding memory accesses. The only instructions that read or write data from memory are loads (20% of all instructions) and stores (10% of all instructions). The memory system for this computer is composed of a split L1 cache that imposes no penalty on hits. Both the Icache and D-cache are direct-mapped and hold 32 KB each. The I-cache has a 2% miss rate and 32-byte blocks, and the D-cache is write-through with a 5% miss rate and 16-byte blocks. There is a write buffer on the D-cache that eliminates stalls for 95% of all writes. The 512 KB write-back, unified L2 cache has 64-byte blocks and an access time of 15 ns. It is connected to the L1 cache by a 128-bit data bus that runs at 266 MHz and can transfer one 128-bit word per bus cycle. Of all memory references sent to the L2 cache in this system, 80% are satisfied without going to main memory. Also, 50% of all blocks replaced are dirty. The 128-bit-wide main memory has an access latency of 60 ns, after which any number of bus words may be transferred at the rate of one per cycle on the 128-bit-wide 133 MHz main memory bus. a. [10] What is the average memory access time for instruction accesses? b. [10] < B.2 > What is the average memory access time for data reads? c. [10] < B.2 > What is the average memory access time for data writes? d. [10] < B.2 > What is the overall CPI, including memory accesses?

B.6

[10/15/15] < B.2 > Converting miss rate (misses per reference) into misses per instruction relies upon two factors: references per instruction fetched and the fraction of fetched instructions that actually commits. a. [10] < B.2 > The formula for misses per instruction on page B-5 is written first in terms of three factors: miss rate, memory accesses, and instruction count. Each of these factors represents actual events. What is different about writing misses per instruction as miss rate times the factor memory accesses per instruction? b. [15] < B.2 > Speculative processors will fetch instructions that do not commit. The formula for misses per instruction on page B-5 refers to misses per instruction on the execution path; that is, only the instructions that must actually be executed to carry out the program. Convert the formula for misses per instruction on page B-5 into one that uses only miss rate, references per instruction fetched, and fraction of fetched instructions that commit. Why rely upon these factors rather than those in the formula on page B-5? c. [15] < B.2 > The conversion in part (b) could yield an incorrect value to the extent that the value of the factor references per instruction fetched is not equal to the number of references for any particular instruction. Rewrite the formula of part (b) to correct this deficiency.

B.7

[20] < B.1, B.3 > In systems with a write-through L1 cache backed by a write-back L2 cache instead of main memory, a merging write buffer can be simplified. Explain how this can be done. Are there situations where having a full write buffer (instead of the simple version you have just proposed) could be helpful?

B-64



Appendix B Review of Memory Hierarchy B.8

[5/5/5] < B.3 > We want to observe the following calculation di ¼ ai + bi ∗ ci ,

i : ð0 : 511Þ

Arrays a, b, c, and d memory layout is displayed below (each has 512 4-byte-wide integer elements). The above calculation employs a for loop that runs through 512 iterations. Assume a 32 Kbyte 4-way set associative cache with a single cycle access time. The miss penalty is 100 CPU cycles/access, and so is the cost of a write-back. The cache is a write-back on hits write-allocate on misses cache (Figure B.32). a. [5] < B3 > How many cycles will an iteration take if all three loads and single store miss in the data cache? b. [5] < B3 > If the cache line size is 16 bytes, what is the average number of cycles an average iteration will take? (Hint: Spatial locality!) c. [5] < B3 > If the cache line size is 64 bytes, what is the average number of cycles an average iteration will take? d. If the cache is direct-mapped and its size is reduced to 2048 bytes, what is the average number of cycles an average iteration will take? B.9

[20] < B.3 > Increasing a cache's associativity (with all other parameters kept constant) statistically reduces the miss rate. However, there can be pathological cases where increasing a cache's associativity would increase the miss rate for a particular workload. Consider the case of direct-mapped compared to a two-way set associative cache of equal size. Assume that the set associative cache uses the LRU replacement policy. To simplify, assume that the block size is one word. Now, construct a trace of word accesses that would produce more misses in the two-way associative cache. (Hint: Focus on constructing a trace of accesses that are exclusively directed to a single set of the two-way set associative cache, such that the same trace would exclusively access two blocks in the direct-mapped cache.)

B.10

[10/10/15] < B.3 > Consider a two-level memory hierarchy made of L1 and L2 data caches. Assume that both caches use write-back policy on write hit and both have the same block size. List the actions taken in response to the following events: a. [10] < B.3 > An L1 cache miss when the caches are organized in an inclusive hierarchy. Mem. address in bytes

Contents

0–2047

Array a

2048–4095

Array b

4096–6143

Array c

6144–8191

Array d

Figure B.32 Arrays layout in memory.

Exercises by Amr Zaky



B-65

b. [10] < B.3 > An L1 cache miss when the caches are organized in an exclusive hierarchy. c. [15] < B.3 > In both parts (a) and (b), consider the possibility that the evicted line might be clean or dirty. B.11

[15/20] < B.2, B.3 > excluding some instructions from entering the cache can reduce conflict misses. a. [15] < B.3 > Sketch a program hierarchy where parts of the program would be better excluded from entering the instruction cache. (Hint: Consider a program with code blocks that are placed in deeper loop nests than other blocks.) b. [20] < B.2, B.3 > Suggest software or hardware techniques to enforce exclusion of certain blocks from the instruction cache.

B.12

[5/15] < B.3 > Whereas larger caches have lower miss rates, they also tend to have longer hit times. Assume a direct-mapped 8 KB cache has 0.22 ns hit time and miss rate m1; also assume a 4-way associative 64 KB cache has 0.52 ns hit time and a miss rate m2. a. [5] < B.3 > If the miss penalty is 100 ns, when would it be advantageous to use the smaller cache to reduce the overall memory access time? b. [15] < B.3 > Repeat part (a) for miss penalties of 10 and 1000 cycles. Conclude when it might be advantageous to use a smaller cache.

B.13

[15] A program is running on a computer with a four-entry fully associative (micro) translation lookaside buffer (TLB) (Figure B.33): The following is a trace of virtual page numbers accessed by a program. For each access indicate whether it produces a TLB hit/miss and, if it accesses the page table, whether it produces a page hit or fault. Put an X under the page table column if it is not accessed (Figures B.34 and B.35).

B.14

[15/15/15/15/] < B.4 > Some memory systems handle TLB misses in software (as an exception), while others use hardware for TLB misses. a. [15] < B.4 > What are the trade-offs between these two methods for handling TLB misses? b. [15] < B.4 > Will TLB miss handling in software always be slower than TLB miss handling in hardware? Explain.

VP#

PP#

Entry valid

5

30

1

7

1

0

10

10

1

15

25

1

Figure B.33 TLB contents (problem B.12).

B-66



Appendix B Review of Memory Hierarchy

Virtual page index

Physical page #

Present

0

3

Y

1

7

N

2

6

N

3

5

Y

4

14

Y

5

30

Y

6

26

Y

7

11

Y

8

13

N

9

18

N

10

10

Y

11

56

Y

12

110

Y

13

33

Y

14

12

N

15

25

Y

Figure B.34 Page table contents. Virtual page accessed

TLB (hit or miss)

Page table (hit or fault)

1 5 9 14 10 6 15 12 7 2 Figure B.35 Page access trace.

c. [15] < B.4 > Are there page table structures that would be difficult to handle in hardware but possible in software? Are there any such structures that would be difficult for software to handle but easy for hardware to manage? d. [15] < B.4 > Why are TLB miss rates for floating-point programs generally higher than those for integer programs?

Exercises by Amr Zaky B.15



B-67

[20/20] < B.5 > It is possible to provide more flexible protection than that in the Intel Pentium architecture by using a protection scheme similar to that used in the Hewlett-Packard Precision Architecture (HP/PA). In such a scheme, each page table entry contains a “protection ID” (key) along with access rights for the page. On each reference, the CPU compares the protection ID in the page table entry with those stored in each of four protection ID registers (access to these registers requires that the CPU be in supervisor mode). If there is no match for the protection ID in the page table entry or if the access is not a permitted access (writing to a readonly page, for example), an exception is generated. a. [20] < B.5 > Explain how this model could be used to facilitate the construction of operating systems from relatively small pieces of code that cannot overwrite each other (microkernels). What advantages might such an operating system have over a monolithic operating system in which any code in the OS can write to any memory location? b. [20] < B.5 > A simple design change to this system would allow two protection IDs for each page table entry, one for read access and the other for either write or execute access (the field is unused if neither the writable nor executable bit is set). What advantages might there be from having different protection IDs for read and write capabilities? (Hint: Could this make it easier to share data and code between processes?)

C.1 C.2 C.3 C.4 C.5 C.6 C.7 C.8 C.9 C.10

Introduction The Major Hurdle of Pipelining—Pipeline Hazards How Is Pipelining Implemented? What Makes Pipelining Hard to Implement? Extending the RISC V Integer Pipeline to Handle Multicycle Operations Putting It All Together: The MIPS R4000 Pipeline Cross-Cutting Issues Fallacies and Pitfalls Concluding Remarks Historical Perspective and References Updated Exercises by Diana Franklin

C-2 C-10 C-26 C-37 C-45 C-55 C-65 C-70 C-71 C-71 C-71

C Pipelining: Basic and Intermediate Concepts

It is quite a three-pipe problem. Sir Arthur Conan Doyle, The Adventures of Sherlock Holmes

C-2



Appendix C Pipelining: Basic and Intermediate Concepts

C.1

Introduction Many readers of this text will have covered the basics of pipelining in another text (such as our more basic text Computer Organization and Design) or in another course. Because Chapter 3 builds heavily on this material, readers should ensure that they are familiar with the concepts discussed in this appendix before proceeding. As you read Chapter 3, you may find it helpful to turn to this material for a quick review. We begin the appendix with the basics of pipelining, including discussing the data path implications, introducing hazards, and examining the performance of pipelines. This section describes the basic five-stage RISC pipeline that is the basis for the rest of the appendix. Section C.2 describes the issue of hazards, why they cause performance problems, and how they can be dealt with. Section C.3 discusses how the simple five-stage pipeline is actually implemented, focusing on control and how hazards are dealt with. Section C.4 discusses the interaction between pipelining and various aspects of instruction set design, including discussing the important topic of exceptions and their interaction with pipelining. Readers unfamiliar with the concepts of precise and imprecise interrupts and resumption after exceptions will find this material useful, because they are key to understanding the more advanced approaches in Chapter 3. Section C.5 discusses how the five-stage pipeline can be extended to handle longer-running floating-point instructions. Section C.6 puts these concepts together in a case study of a deeply pipelined processor, the MIPS R4000/4400, including both the eight-stage integer pipeline and the floating-point pipeline. The MIPS R40000 is similar to a single-issue embedded processor, such as the ARM Cortex-A5, which became available in 2010, and was used in several smart phones and tablets. Section C.7 introduces the concept of dynamic scheduling and the use of scoreboards to implement dynamic scheduling. It is introduced as a cross-cutting issue, because it can be used to serve as an introduction to the core concepts in Chapter 3, which focused on dynamically scheduled approaches. Section C.7 is also a gentle introduction to the more complex Tomasulo’s algorithm covered in Chapter 3. Although Tomasulo’s algorithm can be covered and understood without introducing scoreboarding, the scoreboarding approach is simpler and easier to comprehend.

What Is Pipelining? Pipelining is an implementation technique whereby multiple instructions are overlapped in execution; it takes advantage of parallelism that exists among the actions needed to execute an instruction. Today, pipelining is the key implementation technique used to make fast processors, and even processors that cost less than a dollar are pipelined.

C.1

Introduction



C-3

A pipeline is like an assembly line. In an automobile assembly line, there are many steps, each contributing something to the construction of the car. Each step operates in parallel with the other steps, although on a different car. In a computer pipeline, each step in the pipeline completes a part of an instruction. Like the assembly line, different steps are completing different parts of different instructions in parallel. Each of these steps is called a pipe stage or a pipe segment. The stages are connected one to the next to form a pipe—instructions enter at one end, progress through the stages, and exit at the other end, just as cars would in an assembly line. In an automobile assembly line, throughput is defined as the number of cars per hour and is determined by how often a completed car exits the assembly line. Likewise, the throughput of an instruction pipeline is determined by how often an instruction exits the pipeline. Because the pipe stages are hooked together, all the stages must be ready to proceed at the same time, just as we would require in an assembly line. The time required between moving an instruction one step down the pipeline is a processor cycle. Because all stages proceed at the same time, the length of a processor cycle is determined by the time required for the slowest pipe stage, just as in an auto assembly line the longest step would determine the time between advancing cars in the line. In a computer, this processor cycle is almost always 1 clock cycle. The pipeline designer’s goal is to balance the length of each pipeline stage, just as the designer of the assembly line tries to balance the time for each step in the process. If the stages are perfectly balanced, then the time per instruction on the pipelined processor—assuming ideal conditions—is equal to Time per instruction on unpipelined machine Number of pipe stages

Under these conditions, the speedup from pipelining equals the number of pipe stages, just as an assembly line with n stages can ideally produce cars n times as fast. Usually, however, the stages will not be perfectly balanced; furthermore, pipelining does involve some overhead. Thus, the time per instruction on the pipelined processor will not have its minimum possible value, yet it can be close. Pipelining yields a reduction in the average execution time per instruction. If the starting point is a processor that takes multiple clock cycles per instruction, then pipelining reduces the CPI. This is the primary view we will take. Pipelining is an implementation technique that exploits parallelism among the instructions in a sequential instruction stream. It has the substantial advantage that, unlike some speedup techniques (see Chapter 4), it is not visible to the programmer.

The Basics of the RISC V Instruction Set Throughout this book we use RISC V, a load-store architecture, to illustrate the basic concepts. Nearly all the ideas we introduce in this book are applicable to other

C-4



Appendix C Pipelining: Basic and Intermediate Concepts

processors, but the implementation may be much more complicated with complex instructions. In this section, we make use of the core of the RISC V architecture; see Chapter 1 for a full description. Although we use RISC V, the concepts are significantly similar in that they will apply to any RISC, including the core architectures of ARM and MIPS. All RISC architectures are characterized by a few key properties: ■

All operations on data apply to data in registers and typically change the entire register (32 or 64 bits per register).



The only operations that affect memory are load and store operations that move data from memory to a register or to memory from a register, respectively. Load and store operations that load or store less than a full register (e.g., a byte, 16 bits, or 32 bits) are often available.



The instruction formats are few in number, with all instructions typically being one size. In RISC V, the register specifiers: rs1, rs2, and rd are always in the same place simplifying the control.

These simple properties lead to dramatic simplifications in the implementation of pipelining, which is why these instruction sets were designed this way. Chapter 1 contains a full description of the RISC V ISA, and we assume the reader has read Chapter 1.

A Simple Implementation of a RISC Instruction Set To understand how a RISC instruction set can be implemented in a pipelined fashion, we need to understand how it is implemented without pipelining. This section shows a simple implementation where every instruction takes at most 5 clock cycles. We will extend this basic implementation to a pipelined version, resulting in a much lower CPI. Our unpipelined implementation is not the most economical or the highest-performance implementation without pipelining. Instead, it is designed to lead naturally to a pipelined implementation. Implementing the instruction set requires the introduction of several temporary registers that are not part of the architecture; these are introduced in this section to simplify pipelining. Our implementation will focus only on a pipeline for an integer subset of a RISC architecture that consists of load-store word, branch, and integer ALU operations. Every instruction in this RISC subset can be implemented in, at most, 5 clock cycles. The 5 clock cycles are as follows. 1. Instruction fetch cycle (IF): Send the program counter (PC) to memory and fetch the current instruction from memory. Update the PC to the next sequential instruction by adding 4 (because each instruction is 4 bytes) to the PC.

C.1

Introduction



C-5

2. Instruction decode/register fetch cycle (ID): Decode the instruction and read the registers corresponding to register source specifiers from the register file. Do the equality test on the registers as they are read, for a possible branch. Sign-extend the offset field of the instruction in case it is needed. Compute the possible branch target address by adding the sign-extended offset to the incremented PC. Decoding is done in parallel with reading registers, which is possible because the register specifiers are at a fixed location in a RISC architecture. This technique is known as fixed-field decoding. Note that we may read a register we don’t use, which doesn’t help but also doesn’t hurt performance. (It does waste energy to read an unneeded register, and power-sensitive designs might avoid this.) For loads and ALU immediate operations, the immediate field is always in the same place, so we can easily sign extend it. (For a more complete implementation of RISC V, we would need to compute two different sign-extended values, because the immediate field for store is in a different location.) 3. Execution/effective address cycle (EX): The ALU operates on the operands prepared in the prior cycle, performing one of three functions, depending on the instruction type. ■

Memory reference—The ALU adds the base register and the offset to form the effective address.



Register-Register ALU instruction—The ALU performs the operation specified by the ALU opcode on the values read from the register file.



Register-Immediate ALU instruction—The ALU performs the operation specified by the ALU opcode on the first value read from the register file and the sign-extended immediate.



Conditional branch—Determine whether the condition is true.

In a load-store architecture the effective address and execution cycles can be combined into a single clock cycle, because no instruction needs to simultaneously calculate a data address and perform an operation on the data. 4. Memory access (MEM): If the instruction is a load, the memory does a read using the effective address computed in the previous cycle. If it is a store, then the memory writes the data from the second register read from the register file using the effective address. 5. Write-back cycle (WB): ■

Register-Register ALU instruction or load instruction:

Write the result into the register file, whether it comes from the memory system (for a load) or from the ALU (for an ALU instruction). In this implementation, branch instructions require three cycles, store instructions require four cycles, and all other instructions require five cycles. Assuming a

C-6



Appendix C Pipelining: Basic and Intermediate Concepts

branch frequency of 12% and a store frequency of 10%, a typical instruction distribution leads to an overall CPI of 4.66. This implementation, however, is not optimal either in achieving the best performance or in using the minimal amount of hardware given the performance level; we leave the improvement of this design as an exercise for you and instead focus on pipelining this version.

The Classic Five-Stage Pipeline for a RISC Processor We can pipeline the execution described in the previous section with almost no changes by simply starting a new instruction on each clock cycle. (See why we chose this design?) Each of the clock cycles from the previous section becomes a pipe stage—a cycle in the pipeline. This results in the execution pattern shown in Figure C.1, which is the typical way a pipeline structure is drawn. Although each instruction takes 5 clock cycles to complete, during each clock cycle the hardware will initiate a new instruction and will be executing some part of the five different instructions. You may find it hard to believe that pipelining is as simple as this; it’s not. In this and the following sections, we will make our RISC pipeline “real” by dealing with problems that pipelining introduces. To start with, we have to determine what happens on every clock cycle of the processor and make sure we don’t try to perform two different operations with the same data path resource on the same clock cycle. For example, a single ALU cannot be asked to compute an effective address and perform a subtract operation at the same time. Thus, we must ensure that the overlap of instructions in the pipeline cannot cause such a conflict. Fortunately, the simplicity of a RISC instruction set makes resource evaluation relatively easy. Figure C.2 shows a simplified version of a RISC data path drawn in pipeline fashion. As you can see, the major functional units are used in different cycles, and hence overlapping the execution of multiple

Clock number Instruction number

1

2

3

4

5

Instruction i

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

Instruction i + 1 Instruction i + 2 Instruction i + 3 Instruction i + 4

6

7

8

9

WB

Figure C.1 Simple RISC pipeline. On each clock cycle, another instruction is fetched and begins its five-cycle execution. If an instruction is started every clock cycle, the performance will be up to five times that of a processor that is not pipelined. The names for the stages in the pipeline are the same as those used for the cycles in the unpipelined implementation: IF ¼ instruction fetch, ID ¼ instruction decode, EX ¼ execution, MEM ¼ memory access, and WB ¼ write-back.

C.1

Introduction



C-7

IM

CC 5

DM

Reg

Reg

DM

Reg

Reg

DM

Reg

IM

Reg

DM

IM

Reg

IM

CC 6

CC 7

ALU

Reg

CC 4

ALU

IM

CC 3

ALU

CC 2

ALU

CC 1

ALU

Program execution order (in instructions)

Time (in clock cycles) CC 8

CC 9

Reg

DM

Reg

Figure C.2 The pipeline can be thought of as a series of data paths shifted in time. This figure shows the overlap among the parts of the data path, with clock cycle 5 (CC 5) showing the steady-state situation. Because the register file is used as a source in the ID stage and as a destination in the WB stage, it appears twice. We show that it is read in one part of the stage and written in another by using a solid line, on the right or left, respectively, and a dashed line on the other side. The abbreviation IM is used for instruction memory, DM for data memory, and CC for clock cycle.

instructions introduces relatively few conflicts. There are three observations on which this fact rests. First, we use separate instruction and data memories, which we would typically implement with separate instruction and data caches (discussed in Chapter 2). The use of separate caches eliminates a conflict for a single memory that would arise between instruction fetch and data memory access. Notice that if our pipelined processor has a clock cycle that is equal to that of the unpipelined version, the memory system must deliver five times the bandwidth. This increased demand is one cost of higher performance. Second, the register file is used in the two stages: one for reading in ID and one for writing in WB. These uses are distinct, so we simply show the register file in two places. Hence, we need to perform two reads and one write every clock cycle.

C-8



Appendix C Pipelining: Basic and Intermediate Concepts

To handle reads and a write to the same register (and for another reason, which will become obvious shortly), we perform the register write in the first half of the clock cycle and the read in the second half. Third, Figure C.2 does not deal with the PC. To start a new instruction every clock, we must increment and store the PC every clock, and this must be done during the IF stage in preparation for the next instruction. Furthermore, we must also have an adder to compute the potential branch target address during ID. One further problem is that we need the ALU in the ALU stage to evaluate the branch condition. Actually, we don’t really need a full ALU to evaluate the comparison between two registers, but we need enough of the function that it has to occur in this pipestage. Although it is critical to ensure that instructions in the pipeline do not attempt to use the hardware resources at the same time, we must also ensure that instructions in different stages of the pipeline do not interfere with one another. This separation is done by introducing pipeline registers between successive stages of the pipeline, so that at the end of a clock cycle all the results from a given stage are stored into a register that is used as the input to the next stage on the next clock cycle. Figure C.3 shows the pipeline drawn with these pipeline registers. Although many figures will omit such registers for simplicity, they are required to make the pipeline operate properly and must be present. Of course, similar registers would be needed even in a multicycle data path that had no pipelining (because only values in registers are preserved across clock boundaries). In the case of a pipelined processor, the pipeline registers also play the key role of carrying intermediate results from one stage to another where the source and destination may not be directly adjacent. For example, the register value to be stored during a store instruction is read during ID, but not actually used until MEM; it is passed through two pipeline registers to reach the data memory during the MEM stage. Likewise, the result of an ALU instruction is computed during EX, but not actually stored until WB; it arrives there by passing through two pipeline registers. It is sometimes useful to name the pipeline registers, and we follow the convention of naming them by the pipeline stages they connect, so the registers are called IF/ID, ID/EX, EX/MEM, and MEM/WB.

Basic Performance Issues in Pipelining Pipelining increases the processor instruction throughput—the number of instructions completed per unit of time—but it does not reduce the execution time of an individual instruction. In fact, it usually slightly increases the execution time of each instruction due to overhead in the control of the pipeline. The increase in instruction throughput means that a program runs faster and has lower total execution time, even though no single instruction runs faster! The fact that the execution time of each instruction does not decrease puts limits on the practical depth of a pipeline, as we will see in the next section. In addition to limitations arising from pipeline latency, limits arise from imbalance

C.1

Introduction



C-9

CC 3

CC 4

CC 5

IM

Reg

DM

Reg

Reg

IM

DM

ALU

IM

Reg

IM

Reg

IM

CC 6

Reg

DM

ALU

CC 2

ALU

CC 1

ALU

Time (in clock cycles)

Reg

Figure C.3 A pipeline showing the pipeline registers between successive pipeline stages. Notice that the registers prevent interference between two different instructions in adjacent stages in the pipeline. The registers also play the critical role of carrying data for a given instruction from one stage to the other. The edge-triggered property of registers—that is, that the values change instantaneously on a clock edge—is critical. Otherwise, the data from one instruction could interfere with the execution of another!

among the pipe stages and from pipelining overhead. Imbalance among the pipe stages reduces performance because the clock can run no faster than the time needed for the slowest pipeline stage. Pipeline overhead arises from the combination of pipeline register delay and clock skew. The pipeline registers add setup time, which is the time that a register input must be stable before the clock signal that triggers a write occurs, plus propagation delay to the clock cycle. Clock skew, which is the maximum delay between when the clock arrives at any two registers,

C-10



Appendix C Pipelining: Basic and Intermediate Concepts

also contributes to the lower limit on the clock cycle. Once the clock cycle is as small as the sum of the clock skew and latch overhead, no further pipelining is useful, because there is no time left in the cycle for useful work. The interested reader should see Kunkel and Smith (1986). Example

Answer

Consider the unpipelined processor in the previous section. Assume that it has a 4 GHz clock (or a 0.5 ns clock cycle) and that it uses four cycles for ALU operations and branches and five cycles for memory operations. Assume that the relative frequencies of these operations are 40%, 20%, and 40%, respectively. Suppose that due to clock skew and setup, pipelining the processor adds 0.1 ns of overhead to the clock. Ignoring any latency impact, how much speedup in the instruction execution rate will we gain from a pipeline? The average instruction execution time on the unpipelined processor is Average instruction execution time ¼ Clock cycle  Average CPI ¼ 0:5 ns  ½ð40% + 20%Þ  4 + 40%  5 ¼ 0:5 ns  4:4 ¼ 2:2 ns

In the pipelined implementation, the clock must run at the speed of the slowest stage plus overhead, which will be 0.5 + 0.1 or 0.6 ns; this is the average instruction execution time. Thus, the speedup from pipelining is Speedup from pipelining ¼ ¼

Average instruction time unpipelined Average instruction time pipelined 2:2 ns ¼ 3:7 times 0:6 ns

The 0.1 ns overhead essentially establishes a limit on the effectiveness of pipelining. If the overhead is not affected by changes in the clock cycle, Amdahl’s Law tells us that the overhead limits the speedup. This simple RISC pipeline would function just fine for integer instructions if every instruction were independent of every other instruction in the pipeline. In reality, instructions in the pipeline can depend on one another; this is the topic of the next section.

C.2

The Major Hurdle of Pipelining—Pipeline Hazards There are situations, called hazards, that prevent the next instruction in the instruction stream from executing during its designated clock cycle. Hazards reduce the performance from the ideal speedup gained by pipelining. There are three classes of hazards:

C.2

The Major Hurdle of Pipelining—Pipeline Hazards



C-11

1. Structural hazards arise from resource conflicts when the hardware cannot support all possible combinations of instructions simultaneously in overlapped execution. In modern processors, structural hazards occur primarily in special purpose functional units that are less frequently used (such as floating point divide or other complex long running instructions). They are not a major performance factor, assuming programmers and compiler writers are aware of the lower throughput of these instructions. Instead of spending more time on this infrequent case, we focus on the two other hazards that are much more frequent. 2. Data hazards arise when an instruction depends on the results of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline. 3. Control hazards arise from the pipelining of branches and other instructions that change the PC. Hazards in pipelines can make it necessary to stall the pipeline. Avoiding a hazard often requires that some instructions in the pipeline be allowed to proceed while others are delayed. For the pipelines we discuss in this appendix, when an instruction is stalled, all instructions issued later than the stalled instruction—and hence not as far along in the pipeline—are also stalled. Instructions issued earlier than the stalled instruction—and hence farther along in the pipeline—must continue, because otherwise the hazard will never clear. As a result, no new instructions are fetched during the stall. We will see several examples of how pipeline stalls operate in this section— don’t worry, they aren’t as complex as they might sound!

Performance of Pipelines With Stalls A stall causes the pipeline performance to degrade from the ideal performance. Let’s look at a simple equation for finding the actual speedup from pipelining, starting with the formula from the previous section: Speedup from pipelining ¼

Average instruction time unpipelined Average instruction time pipelined

¼

CPI unpipelined  Clock cycle unpipelined CPI pipelined  Clock cycle pipelined

¼

CPI unpipelined  Clock cycle unpipelined CPI pipelined  Clock cycle pipelined

Pipelining can be thought of as decreasing the CPI or the clock cycle time. Because it is traditional to use the CPI to compare pipelines, let’s start with that assumption. The ideal CPI on a pipelined processor is almost always 1. Hence, we can compute the pipelined CPI: CPI pipelined ¼ Ideal CPI + Pipeline stall clock cycles per instruction ¼ 1 + Pipelines stall clock cycles per instruction

C-12



Appendix C Pipelining: Basic and Intermediate Concepts

If we ignore the cycle time overhead of pipelining and assume that the stages are perfectly balanced, then the cycle time of the two processors can be equal, leading to Speedup ¼

CPI unpiplined 1 + Pipeline stall cycles per instruction

One important simple case is where all instructions take the same number of cycles, which must also equal the number of pipeline stages (also called the depth of the pipeline). In this case, the unpipelined CPI is equal to the depth of the pipeline, leading to Speedup ¼

Pipeline depth 1 + Pipeline stall cycles per instruction

If there are no pipeline stalls, this leads to the intuitive result that pipelining can improve performance by the depth of the pipeline.

Data Hazards A major effect of pipelining is to change the relative timing of instructions by overlapping their execution. This overlap introduces data and control hazards. Data hazards occur when the pipeline changes the order of read/write accesses to operands so that the order differs from the order seen by sequentially executing instructions on an unpipelined processor. Assume instruction i occurs in program order before instruction j and both instructions use register x, then there are three different types of hazards that can occur between i and j: 1. Read After Write (RAW) hazard: the most common, these occur when a read of register x by instruction j occurs before the write of register x by instruction i. If this hazard were not prevented instruction j would use the wrong value of x. 2. Write After Read (WAR) hazard: this hazard occurs when read of register x by instruction i occurs after a write of register x by instruction j. In this case, instruction i would use the wrong value of x. WAR hazards are impossible in the simple five stage, integrer pipeline, but they occur when instructions are reordered, as we will see when we discuss dynamically scheduled pipelines beginning on page C.65. 3. Write After Write (WAW) hazard: this hazard occurs when write of register x by instruction i occurs after a write of register x by instruction j. When this occurs, register x will have the wrong value going forward. WAR hazards are also impossible in the simple five stage, integrer pipeline, but they occur when instructions are reordered or when running times vary, as we will see later. Chapter 3 explores the issues of data dependence and hazards in much more detail. For now, we focus only on RAW hazards.

C.2

The Major Hurdle of Pipelining—Pipeline Hazards



C-13

Consider the pipelined execution of these instructions: add sub and or xor

x1,x2,x3 x4,x1,x5 x6,x1,x7 x8,x1,x9 x10,x1,x11

All the instructions after the add use the result of the add instruction. As shown in Figure C.4, the add instruction writes the value of x1 in the WB pipe stage, but the sub instruction reads the value during its ID stage, which results in a RAW hazard. Unless precautions are taken to prevent it, the sub instruction will read the wrong value and try to use it. In fact, the value used by the sub instruction is not even deterministic: though we might think it logical to assume that sub would always use the value of x1 that was assigned by an instruction prior to add, this is not

Figure C.4 The use of the result of the add instruction in the next three instructions causes a hazard, because the register is not written until after those instructions read it.

C-14



Appendix C Pipelining: Basic and Intermediate Concepts

always the case. If an interrupt should occur between the add and sub instructions, the WB stage of the add will complete, and the value of x1 at that point will be the result of the add. This unpredictable behavior is obviously unacceptable. The and instruction also creates a possible RAW hazard. As we can see from Figure C.4, the write of x1 does not complete until the end of clock cycle 5. Thus, the and instruction that reads the registers during clock cycle 4 will receive the wrong results. The xor instruction operates properly because its register read occurs in clock cycle 6, after the register write. The or instruction also operates without incurring a hazard because we perform the register file reads in the second half of the cycle and the writes in the first half. Note that the xor instruction still depends on the add, but it no longer creates a hazard; a topic we explore in more detail in Chapter 3. The next subsection discusses a technique to eliminate the stalls for the hazard involving the sub and and instructions.

Minimizing Data Hazard Stalls by Forwarding The problem posed in Figure C.4 can be solved with a simple hardware technique called forwarding (also called bypassing and sometimes short-circuiting). The key insight in forwarding is that the result is not really needed by the sub until after the add actually produces it. If the result can be moved from the pipeline register where the add stores it to where the sub needs it, then the need for a stall can be avoided. Using this observation, forwarding works as follows: 1. The ALU result from both the EX/MEM and MEM/WB pipeline registers is always fed back to the ALU inputs. 2. If the forwarding hardware detects that the previous ALU operation has written the register corresponding to a source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the register file. Notice that with forwarding, if the sub is stalled, the add will be completed and the bypass will not be activated. This relationship is also true for the case of an interrupt between the two instructions. As the example in Figure C.4 shows, we need to forward results not only from the immediately previous instruction but also possibly from an instruction that started two cycles earlier. Figure C.5 shows our example with the bypass paths in place and highlighting the timing of the register read and writes. This code sequence can be executed without stalls. Forwarding can be generalized to include passing a result directly to the functional unit that requires it: a result is forwarded from the pipeline register corresponding to the output of one unit to the input of another, rather than just from

C.2

The Major Hurdle of Pipelining—Pipeline Hazards



C-15

Figure C.5 A set of instructions that depends on the add result uses forwarding paths to avoid the data hazard. The inputs for the sub and and instructions forward from the pipeline registers to the first ALU input. The or receives its result by forwarding through the register file, which is easily accomplished by reading the registers in the second half of the cycle and writing in the first half, as the dashed lines on the registers indicate. Notice that the forwarded result can go to either ALU input; in fact, both ALU inputs could use forwarded inputs from either the same pipeline register or from different pipeline registers. This would occur, for example, if the and instruction was and x6,x1,x4.

the result of a unit to the input of the same unit. Take, for example, the following sequence: add ld sd

x1,x2,x3 x4,0(x1) x4,12(x1)

To prevent a stall in this sequence, we would need to forward the values of the ALU output and memory unit output from the pipeline registers to the ALU and data memory inputs. Figure C.6 shows all the forwarding paths for this example.

C-16



Appendix C Pipelining: Basic and Intermediate Concepts

Figure C.6 Forwarding of operand required by stores during MEM. The result of the load is forwarded from the memory output to the memory input to be stored. In addition, the ALU output is forwarded to the ALU input for the address calculation of both the load and the store (this is no different than forwarding to another ALU operation). If the store depended on an immediately preceding ALU operation (not shown herein), the result would need to be forwarded to prevent a stall.

Data Hazards Requiring Stalls Unfortunately, not all potential data hazards can be handled by bypassing. Consider the following sequence of instructions: ld sub and or

x1,0(x2) x4,x1,x5 x6,x1,x7 x8,x1,x9

The pipelined data path with the bypass paths for this example is shown in Figure C.7. This case is different from the situation with back-to-back ALU operations. The ld instruction does not have the data until the end of clock cycle 4 (its MEM cycle), while the sub instruction needs to have the data by the beginning of that clock cycle. Thus, the data hazard from using the result of a load instruction cannot be completely eliminated with simple hardware. As Figure C.7 shows, such a forwarding path would have to operate backward in time—a capability not yet available to computer designers! We can forward the result immediately to the ALU from the pipeline registers for use in the and operation, which begins 2 clock cycles after the load. Likewise, the or instruction has no problem, because it receives the value through the register file. For the sub instruction, the forwarded

C.2

The Major Hurdle of Pipelining—Pipeline Hazards



C-17

Figure C.7 The load instruction can bypass its results to the and and or instructions, but not to the sub, because that would mean forwarding the result in “negative time.”

result arrives too late—at the end of a clock cycle, when it is needed at the beginning. The load instruction has a delay or latency that cannot be eliminated by forwarding alone. Instead, we need to add hardware, called a pipeline interlock, to preserve the correct execution pattern. In general, a pipeline interlock detects a hazard and stalls the pipeline until the hazard is cleared. In this case, the interlock stalls the pipeline, beginning with the instruction that wants to use the data until the source instruction produces it. This pipeline interlock introduces a stall or bubble, just as it did for the structural hazard. The CPI for the stalled instruction increases by the length of the stall (1 clock cycle in this case). Figure C.8 shows the pipeline before and after the stall using the names of the pipeline stages. Because the stall causes the instructions starting with the sub to move one cycle later in time, the forwarding to the and instruction now goes through the register file, and no forwarding at all is needed for the or instruction. The insertion of the bubble causes the number of cycles to complete this sequence to grow by one. No instruction is started during clock cycle 4 (and none finishes during cycle 6).

C-18



Appendix C Pipelining: Basic and Intermediate Concepts

ld x1,0(x2)

IF

sub x4,x1,x5

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

ID

EX

MEM

WB

IF

ID

Stall

EX

MEM

WB

IF

Stall

ID

EX

MEM

WB

Stall

IF

ID

EX

MEM

and x6,x1,x7 or x8,x1,x9 ld x1,0(x2) sub x4,x1,x5 and x6,x1,x7

IF

or x8,x1,x9

WB

WB

Figure C.8 In the top half, we can see why a stall is needed: the MEM cycle of the load produces a value that is needed in the EX cycle of the sub, which occurs at the same time. This problem is solved by inserting a stall, as shown in the bottom half.

Branch Hazards Control hazards can cause a greater performance loss for our RISC V pipeline than do data hazards. When a branch is executed, it may or may not change the PC to something other than its current value plus 4. Recall that if a branch changes the PC to its target address, it is a taken branch; if it falls through, it is not taken, or untaken. If instruction i is a taken branch, then the PC is usually not changed until the end of ID, after the completion of the address calculation and comparison. Figure C.9 shows that the simplest method of dealing with branches is to redo the fetch of the instruction following a branch, once we detect the branch during ID (when instructions are decoded). The first IF cycle is essentially a stall, because it never performs useful work. You may have noticed that if the branch is untaken, then the repetition of the IF stage is unnecessary because the correct instruction was indeed fetched. We will develop several schemes to take advantage of this fact shortly. One stall cycle for every branch will yield a performance loss of 10% to 30% depending on the branch frequency, so we will examine some techniques to deal with this loss.

Branch instruction Branch successor Branch successor + 1 Branch successor + 2

IF

ID

EX

MEM

WB

IF

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

IF

ID

EX

Figure C.9 A branch causes a one-cycle stall in the five-stage pipeline. The instruction after the branch is fetched, but the instruction is ignored, and the fetch is restarted once the branch target is known. It is probably obvious that if the branch is not taken, the second IF for branch successor is redundant. This will be addressed shortly.

C.2

The Major Hurdle of Pipelining—Pipeline Hazards



C-19

Reducing Pipeline Branch Penalties There are many methods for dealing with the pipeline stalls caused by branch delay; we discuss four simple compile time schemes in this subsection. In these four schemes the actions for a branch are static—they are fixed for each branch during the entire execution. The software can try to minimize the branch penalty using knowledge of the hardware scheme and of branch behavior. We will then look at hardware-based schemes that dynamically predict branch behavior, and Chapter 3 looks at more powerful hardware techniques for dynamic branch prediction. The simplest scheme to handle branches is to freeze or flush the pipeline, holding or deleting any instructions after the branch until the branch destination is known. The attractiveness of this solution lies primarily in its simplicity both for hardware and software. It is the solution used earlier in the pipeline shown in Figure C.9. In this case, the branch penalty is fixed and cannot be reduced by software. A higher-performance, and only slightly more complex, scheme is to treat every branch as not taken, simply allowing the hardware to continue as if the branch were not executed. Here, care must be taken not to change the processor state until the branch outcome is definitely known. The complexity of this scheme arises from having to know when the state might be changed by an instruction and how to “back out” such a change. In the simple five-stage pipeline, this predicted-not-taken or predicted-untaken scheme is implemented by continuing to fetch instructions as if the branch were a normal instruction. The pipeline looks as if nothing out of the ordinary is happening. If the branch is taken, however, we need to turn the fetched instruction into a no-op and restart the fetch at the target address. Figure C.10 shows both situations. An alternative scheme is to treat every branch as taken. As soon as the branch is decoded and the target address is computed, we assume the branch to be taken and Untaken branch instruction

IF

Instruction i + 1

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

Instruction i + 2 Instruction i + 3 Instruction i + 4 Taken branch instruction Instruction i + 1 Branch target Branch target + 1 Branch target + 2

IF

ID

EX

MEM

WB

IF

idle

idle

idle

idle

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

WB

Figure C.10 The predicted-not-taken scheme and the pipeline sequence when the branch is untaken (top) and taken (bottom). When the branch is untaken, determined during ID, we fetch the fall-through and just continue. If the branch is taken during ID, we restart the fetch at the branch target. This causes all instructions following the branch to stall 1 clock cycle.

C-20



Appendix C Pipelining: Basic and Intermediate Concepts

begin fetching and executing at the target. This buys us a one-cycle improvement when the branch is actually taken, because we know the target address at the end of ID, one cycle before we know whether the branch condition is satisfied in the ALU stage. In either a predicted-taken or predicted-not-taken scheme, the compiler can improve performance by organizing the code so that the most frequent path matches the hardware’s choice. A fourth scheme, which was heavily used in early RISC processors is called delayed branch. In a delayed branch, the execution cycle with a branch delay of one is branch instruction sequential successor1 branch target if taken The sequential successor is in the branch delay slot. This instruction is executed whether or not the branch is taken. The pipeline behavior of the five-stage pipeline with a branch delay is shown in Figure C.11. Although it is possible to have a branch delay longer than one, in practice almost all processors with delayed branch have a single instruction delay; other techniques are used if the pipeline has a longer potential branch penalty.The job of the compiler is to make the successor instructions valid and useful. Although the delayed branch was useful for short simple pipelines at a time when hardware prediction was too expensive, the technique complicates implementation when there is dynamic branch prediction. For this reason, RISC V appropriately omitted delayed branches.

Untaken branch instruction

IF

Branch delay instruction (i + 1)

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

Instruction i + 2 Instruction i + 3 Instruction i + 4 Taken branch instruction Branch delay instruction (i + 1) Branch target Branch target + 1 Branch target + 2

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

WB

Figure C.11 The behavior of a delayed branch is the same whether or not the branch is taken. The instructions in the delay slot (there was only one delay slot for most RISC architectures that incorporated them) are executed. If the branch is untaken, execution continues with the instruction after the branch delay instruction; if the branch is taken, execution continues at the branch target. When the instruction in the branch delay slot is also a branch, the meaning is unclear: if the branch is not taken, what should happen to the branch in the branch delay slot? Because of this confusion, architectures with delay branches often disallow putting a branch in the delay slot.

C.2

The Major Hurdle of Pipelining—Pipeline Hazards

C-21



Performance of Branch Schemes What is the effective performance of each of these schemes? The effective pipeline speedup with branch penalties, assuming an ideal CPI of 1, is Pipeline speedup ¼

Pipeline depth 1 + Pipeline stall cycles from branches

Because of the following: Pipeline stall cycles from branches ¼ Branch frequency  Branch penalty

we obtain: Pipeline speedup ¼

Pipeline depth 1 + Branch frequency  Branch penalty

The branch frequency and branch penalty can have a component from both unconditional and conditional branches. However, the latter dominate because they are more frequent. Example

For a deeper pipeline, such as that in a MIPS R4000 and later RISC processors, it takes at least three pipeline stages before the branch-target address is known and an additional cycle before the branch condition is evaluated, assuming no stalls on the registers in the conditional comparison. A three-stage delay leads to the branch penalties for the three simplest prediction schemes listed in Figure C.12. Find the effective addition to the CPI arising from branches for this pipeline, assuming the following frequencies: Unconditional branch

4%

Conditional branch, untaken

6%

Conditional branch, taken

Answer

Branch scheme

10%

We find the CPIs by multiplying the relative frequency of unconditional, conditional untaken, and conditional taken branches by the respective penalties. The results are shown in Figure C.13.

Penalty unconditional

Penalty untaken

Penalty taken

Flush pipeline

2

3

3

Predicted taken

2

3

2

Predicted untaken

2

0

3

Figure C.12 Branch penalties for the three simplest prediction schemes for a deeper pipeline.

C-22



Appendix C Pipelining: Basic and Intermediate Concepts

Additions to the CPI from branch costs Branch scheme Frequency of event

Unconditional branches

Untaken conditional branches

Taken conditional branches

All branches

4%

6%

10%

20%

Stall pipeline

0.08

0.18

0.30

0.56

Predicted taken

0.08

0.18

0.20

0.46

Predicted untaken

0.08

0.00

0.30

0.38

Figure C.13 CPI penalties for three branch-prediction schemes and a deeper pipeline.

The differences among the schemes are substantially increased with this longer delay. If the base CPI were 1 and branches were the only source of stalls, the ideal pipeline would be 1.56 times faster than a pipeline that used the stall-pipeline scheme. The predicted-untaken scheme would be 1.13 times better than the stall-pipeline scheme under the same assumptions.

Reducing the Cost of Branches Through Prediction As pipelines get deeper and the potential penalty of branches increases, using delayed branches and similar schemes becomes insufficient. Instead, we need to turn to more aggressive means for predicting branches. Such schemes fall into two classes: low-cost static schemes that rely on information available at compile time and strategies that predict branches dynamically based on program behavior. We discuss both approaches here.

Static Branch Prediction A key way to improve compile-time branch prediction is to use profile information collected from earlier runs. The key observation that makes this worthwhile is that the behavior of branches is often bimodally distributed; that is, an individual branch is often highly biased toward taken or untaken. Figure C.14 shows the success of branch prediction using this strategy. The same input data were used for runs and for collecting the profile; other studies have shown that changing the input so that the profile is for a different run leads to only a small change in the accuracy of profile-based prediction. The effectiveness of any branch prediction scheme depends both on the accuracy of the scheme and the frequency of conditional branches, which vary in SPEC from 3% to 24%. The fact that the misprediction rate for the integer programs is higher and such programs typically have a higher branch frequency is a major limitation for static branch prediction. In the next section, we consider dynamic branch predictors, which most recent processors have employed.

C.2

The Major Hurdle of Pipelining—Pipeline Hazards



C-23

25% 22% 18%

Misprediction rate

20%

15%

15% 12%

11%

12% 9% 10%

10% 5% 6%

5%

c ea hy r dr o2 d m dl jd p su 2c or

do

du

li

pr t es so gc c

ot nt

eq

es

co

m

pr es

s

0%

Integer

Floating-point Benchmark

Figure C.14 Misprediction rate on SPEC92 for a profile-based predictor varies widely but is generally better for the floating-point programs, which have an average misprediction rate of 9% with a standard deviation of 4%, than for the integer programs, which have an average misprediction rate of 15% with a standard deviation of 5%. The actual performance depends on both the prediction accuracy and the branch frequency, which vary from 3% to 24%.

Dynamic Branch Prediction and Branch-Prediction Buffers The simplest dynamic branch-prediction scheme is a branch-prediction buffer or branch history table. A branch-prediction buffer is a small memory indexed by the lower portion of the address of the branch instruction. The memory contains a bit that says whether the branch was recently taken or not. This scheme is the simplest sort of buffer; it has no tags and is useful only to reduce the branch delay when it is longer than the time to compute the possible target PCs. With such a buffer, we don’t know, in fact, if the prediction is correct—it may have been put there by another branch that has the same low-order address bits. But this doesn’t matter. The prediction is a hint that is assumed to be correct, and fetching begins in the predicted direction. If the hint turns out to be wrong, the prediction bit is inverted and stored back. This buffer is effectively a cache where every access is a hit, and, as we will see, the performance of the buffer depends on both how often the prediction is for the branch of interest and how accurate the prediction is when it matches. Before we analyze the performance, it is useful to make a small, but important, improvement in the accuracy of the branch-prediction scheme.

C-24



Appendix C Pipelining: Basic and Intermediate Concepts

This simple 1-bit prediction scheme has a performance shortcoming: even if a branch is almost always taken, we will likely predict incorrectly twice, rather than once, when it is not taken, because the misprediction causes the prediction bit to be flipped. To remedy this weakness, 2-bit prediction schemes are often used. In a 2-bit scheme, a prediction must miss twice before it is changed. Figure C.15 shows the finite-state processor for a 2-bit prediction scheme. A branch-prediction buffer can be implemented as a small, special “cache” accessed with the instruction address during the IF pipe stage, or as a pair of bits attached to each block in the instruction cache and fetched with the instruction. If the instruction is decoded as a branch and if the branch is predicted as taken, fetching begins from the target as soon as the PC is known. Otherwise, sequential fetching and executing continue. As Figure C.15 shows, if the prediction turns out to be wrong, the prediction bits are changed. What kind of accuracy can be expected from a branch-prediction buffer using 2 bits per entry on real applications? Figure C.16 shows that for the SPEC89 benchmarks a branch-prediction buffer with 4096 entries results in a prediction accuracy

Taken Not taken Predict taken 11

Predict taken 10 Taken Not taken

Taken Not taken Predict not taken 01

Predict not taken 00 Taken Not taken

Figure C.15 The states in a 2-bit prediction scheme. By using 2 bits rather than 1, a branch that strongly favors taken or not taken—as many branches do—will be mispredicted less often than with a 1-bit predictor. The 2 bits are used to encode the four states in the system. The 2-bit scheme is actually a specialization of a more general scheme that has an n-bit saturating counter for each entry in the prediction buffer. With an n-bit counter, the counter can take on values between 0 and 2n  1: when the counter is greater than or equal to one-half of its maximum value (2n  1), the branch is predicted as taken; otherwise, it is predicted as untaken. Studies of n-bit predictors have shown that the 2-bit predictors do almost as well, thus most systems rely on 2-bit branch predictors rather than the more general n-bit predictors.

SPEC89 benchmarks

C.2

The Major Hurdle of Pipelining—Pipeline Hazards

nasa7

1%

matrix300

0%

tomcatv

1%

doduc



C-25

5%

spice

9%

fpppp

9%

gcc

12%

espresso

5%

eqntott

18%

li

10% 0%

2%

4%

6% 8% 10% 12% 14% 16% Frequency of mispredictions

18%

Figure C.16 Prediction accuracy of a 4096-entry 2-bit prediction buffer for the SPEC89 benchmarks. The misprediction rate for the integer benchmarks (gcc, espresso, eqntott, and li) is substantially higher (average of 11%) than that for the floating-point programs (average of 4%). Omitting the floating-point kernels (nasa7, matrix300, and tomcatv) still yields a higher accuracy for the FP benchmarks than for the integer benchmarks. These data, as well as the rest of the data in this section, are taken from a branchprediction study done using the IBM Power architecture and optimized code for that system. See Pan et al. (1992). Although these data are for an older version of a subset of the SPEC benchmarks, the newer benchmarks are larger and would show slightly worse behavior, especially for the integer benchmarks.

ranging from over 99% to 82%, or a misprediction rate of 1%–18%. A 4K entry buffer, like that used for these results, is considered small in 2017, and a larger buffer could produce somewhat better results. As we try to exploit more ILP, the accuracy of our branch prediction becomes critical. As we can see in Figure C.16, the accuracy of the predictors for integer programs, which typically also have higher branch frequencies, is lower than for the loop-intensive scientific programs. We can attack this problem in two ways: by increasing the size of the buffer and by increasing the accuracy of the scheme we use for each prediction. A buffer with 4K entries, however, as Figure C.17 shows, performs quite comparably to an infinite buffer, at least for benchmarks like those in SPEC. The data in Figure C.17 make it clear that the hit rate of the buffer is not the major limiting factor. As we mentioned, simply increasing the number of bits per predictor without changing the predictor structure also has little impact. Instead, we need to look at how we might increase the accuracy of each predictor, as we will in Chapter 3.



Appendix C Pipelining: Basic and Intermediate Concepts

nasa7

1% 0%

matrix300

0% 0%

tomcatv

1% 0%

4096 entries: 2 bits per entry Unlimited entries: 2 bits per entry

5% 5%

doduc SPEC89 benchmarks

C-26

spice

9% 9%

fpppp

9% 9% 12% 11%

gcc

5% 5%

espresso

18% 18%

eqntott

10% 10%

li 0%

2%

4%

6%

8%

10%

12% 14% 16% 18%

Frequency of mispredictions

Figure C.17 Prediction accuracy of a 4096-entry 2-bit prediction buffer versus an infinite buffer for the SPEC89 benchmarks. Although these data are for an older version of a subset of the SPEC benchmarks, the results would be comparable for newerversionswithperhapsasmanyas8Kentriesneededtomatchaninfinite2-bitpredictor.

C.3

How Is Pipelining Implemented? Before we proceed to basic pipelining, we need to review a simple implementation of an unpipelined version of RISC V.

A Simple Implementation of RISC V In this section we follow the style of Section C.1, showing first a simple unpipelined implementation and then the pipelined implementation. This time, however, our example is specific to the RISC V architecture.

C.3

How Is Pipelining Implemented?



C-27

In this subsection, we focus on a pipeline for an integer subset of RISC V that consists of load-store word, branch equal, and integer ALU operations. Later in this appendix we will incorporate the basic floating-point operations. Although we discuss only a subset of RISC V, the basic principles can be extended to handle all the instructions; for example, adding store involves some additional computing of the immediate field. We initially used a less aggressive implementation of a branch instruction. We show how to implement the more aggressive version at the end of this section. Every RISC V instruction can be implemented in, at most, 5 clock cycles. The 5 clock cycles are as follows: 1. Instruction fetch cycle (IF): IR Mem[PC]; NPC PC + 4; Operation—Send out the PC and fetch the instruction from memory into the instruction register (IR); increment the PC by 4 to address the next sequential instruction. The IR is used to hold the instruction that will be needed on subsequent clock cycles; likewise, the register NPC is used to hold the next sequential PC. 2. Instruction decode/register fetch cycle (ID): A Regs[rs1]; B Regs[rs2]; Imm sign-extended immediate field of IR; Operation—Decode the instruction and access the register file to read the registers (rs1 and rs2 are the register specifiers). The outputs of the general-purpose registers are read into two temporary registers (A and B) for use in later clock cycles. The lower 16 bits of the IR are also sign extended and stored into the temporary register Imm, for use in the next cycle. Decoding is done in parallel with reading registers, which is possible because these fields are at a fixed location in the RISC V instruction format. Because the immediate portion of a load and an ALU immediate is located in an identical place in every RISC V instruction, the sign-extended immediate is also calculated during this cycle in case it is needed in the next cycle. For stores, a separate sign-extension is needed, because the immediate field is split in two pieces. 3. Execution/effective address cycle (EX): The ALU operates on the operands prepared in the prior cycle, performing one of four functions depending on the RISC V instruction type: ■

Memory reference: ALUOutput

A + Imm;

Operation—The ALU adds the operands to form the effective address and places the result into the register ALUOutput.

C-28



Appendix C Pipelining: Basic and Intermediate Concepts



Register-register ALU instruction: ALUOutput

A func B;

Operation—The ALU performs the operation specified by the function code (a combination of the func3 and func7 fields) on the value in register A and on the value in register B. The result is placed in the temporary register ALUOutput. ■

Register-Immediate ALU instruction: ALUOutput

A op Imm;

Operation—The ALU performs the operation specified by the opcode on the value in register A and on the value in register Imm. The result is placed in the temporary register ALUOutput. ■

Branch: ALUOutput NPC + (Imm Use the following code fragment: Loop:

ld addi sd addi sub bnez

x1,0(x2) x1,x1,1 x1,0,(x2) x2,x2,4 x4,x3,x2 x4,Loop

;load x1 from address 0+x2 ;x1=x1+1 ;store x1 at address 0+x2 ;x2=x2+4 ;x4=x3-x2 ;branch to Loop if x4! = 0

Assume that the initial value of x3 is x2 + 396. a. [15] < C.2 > Data hazards are caused by data dependences in the code. Whether a dependency causes a hazard depends on the machine implementation (i.e., number of pipeline stages). List all of the data dependences in the code above. Record the register, source instruction, and destination instruction; for example, there is a data dependency for register x1 from the ld to the addi. b. [15] < C.2 > Show the timing of this instruction sequence for the 5-stage RISC pipeline without any forwarding or bypassing hardware but assuming that a register read and a write in the same clock cycle “forwards” through the register

C-72



Appendix C Pipelining: Basic and Intermediate Concepts

file, as between the add and or shown in Figure C.5. Use a pipeline timing chart like that in Figure C.8. Assume that the branch is handled by flushing the pipeline. If all memory references take 1 cycle, how many cycles does this loop take to execute? c. [15] < C.2 > Show the timing of this instruction sequence for the 5-stage RISC pipeline with full forwarding and bypassing hardware. Use a pipeline timing chart like that shown in Figure C.8. Assume that the branch is handled by predicting it as not taken. If all memory references take 1 cycle, how many cycles does this loop take to execute? d. [15] < C.2 > Show the timing of this instruction sequence for the 5-stage RISC pipeline with full forwarding and bypassing hardware, as shown in Figure C.6. Use a pipeline timing chart like that shown in Figure C.8. Assume that the branch is handled by predicting it as taken. If all memory references take 1 cycle, how many cycles does this loop take to execute? e. [25] < C.2 > High-performance processors have very deep pipelines—more than 15 stages. Imagine that you have a 10-stage pipeline in which every stage of the 5-stage pipeline has been split in two. The only catch is that, for data forwarding, data are forwarded from the end of a pair of stages to the beginning of the two stages where they are needed. For example, data are forwarded from the output of the second execute stage to the input of the first execute stage, still causing a 1-cycle delay. Show the timing of this instruction sequence for the 10-stage RISC pipeline with full forwarding and bypassing hardware. Use a pipeline timing chart like that shown in Figure C.8 (but with stages labeled IF1, IF2, ID1, etc.). Assume that the branch is handled by predicting it as taken. If all memory references take 1 cycle, how many cycles does this loop take to execute? f. [10] < C.2 > Assume that in the 5-stage pipeline, the longest stage requires 0.8 ns, and the pipeline register delay is 0.1 ns. What is the clock cycle time of the 5-stage pipeline? If the 10-stage pipeline splits all stages in half, what is the cycle time of the 10-stage machine? g. [15] < C.2 > Using your answers from parts (d) and (e), determine the cycles per instruction (CPI) for the loop on a 5-stage pipeline and a 10-stage pipeline. Make sure you count only from when the first instruction reaches the write-back stage to the end. Do not count the start-up of the first instruction. Using the clock cycle time calculated in part (f), calculate the average instruction execute time for each machine. C.2

[15/15] < C.2 > Suppose the branch frequencies (as percentages of all instructions) are as follows: Conditional branches

15%

Jumps and calls

1%

Taken conditional branches

60% are taken

Updated Exercises by Diana Franklin



C-73

a. [15] < C.2 > We are examining a four-stage pipeline where the branch is resolved at the end of the second cycle for unconditional branches and at the end of the third cycle for conditional branches. Assuming that only the first pipe stage can always be completed independent of whether the branch is taken and ignoring other pipeline stalls, how much faster would the machine be without any branch hazards? b. [15] < C.2 > Now assume a high-performance processor in which we have a 15deep pipeline where the branch is resolved at the end of the fifth cycle for unconditional branches and at the end of the tenth cycle for conditional branches. Assuming that only the first pipe stage can always be completed independent of whether the branch is taken and ignoring other pipeline stalls, how much faster would the machine be without any branch hazards? C.3

[5/15/10/10] < C.2 > We begin with a computer implemented in single-cycle implementation. When the stages are split by functionality, the stages do not require exactly the same amount of time. The original machine had a clock cycle time of 7 ns. After the stages were split, the measured times were IF, 1 ns; ID, 1.5 ns; EX, 1 ns; MEM, 2 ns; and WB, 1.5 ns. The pipeline register delay is 0.1 ns. a. [5] < C.2 > What is the clock cycle time of the 5-stage pipelined machine? b. [15] < C.2 > If there is a stall every four instructions, what is the CPI of the new machine? c. [10] < C.2 > What is the speedup of the pipelined machine over the single-cycle machine? d. [10] < C.2 > If the pipelined machine had an infinite number of stages, what would its speedup be over the single-cycle machine?

C.4

[15] < C.1, C.2 > A reduced hardware implementation of the classic five-stage RISC pipeline might use the EX stage hardware to perform a branch instruction comparison and then not actually deliver the branch target PC to the IF stage until the clock cycle in which the branch instruction reaches the MEM stage. Control hazard stalls can be reduced by resolving branch instructions in ID, but improving performance in one respect may reduce performance in other circumstances. Write a small snippet of code in which calculating the branch in the ID stage causes a data hazard, even with data forwarding.

C.5

[12/13/20/20/15/15] < C.2, C.3 > For these problems, we will explore a pipeline for a register-memory architecture. The architecture has two instruction formats: a register-register format and a register-memory format. There is a single-memory addressing mode (offset + base register). There is a set of ALU operations with the format: ALUop Rdest, Rsrc1, Rsrc2 or ALUop Rdest, Rsrc1, MEM

C-74



Appendix C Pipelining: Basic and Intermediate Concepts

where the ALUop is one of the following: add, subtract, AND, OR, load (Rsrc1 ignored), or store. Rsrc or Rdest are registers. MEM is a base register and offset pair. Branches use a full compare of two registers and are PC relative. Assume that this machine is pipelined so that a new instruction is started every clock cycle. The pipeline structure, similar to that used in the VAX 8700 micropipeline (Clark, 1987), is IF

RF

ALU1

MEM

ALU2

WB

IF

RF

ALU1

MEM

ALU2

IF

RF

ALU1

MEM

ALU2

WB

IF

RF

ALU1

MEM

ALU2

WB

IF

RF

ALU1

MEM

ALU2

WB

IF

RF

ALU1

MEM

ALU2

WB

WB

The first ALU stage is used for effective address calculation for memory references and branches. The second ALU cycle is used for operations and branch comparison. RF is both a decode and register-fetch cycle. Assume that when a register read and a register write of the same register occur in the same clock, the write data are forwarded. a. [12] < C.2 > Find the number of adders needed, counting any adder or incrementer; show a combination of instructions and pipe stages that justify this answer. You need only give one combination that maximizes the adder count. b. [13] < C.2 > Find the number of register read and write ports and memory read and write ports required. Show that your answer is correct by showing a combination of instructions and pipeline stage indicating the instruction and the number of read ports and write ports required for that instruction. c. [20] < C.3 > Determine any data forwarding for any ALUs that will be needed. Assume that there are separate ALUs for the ALU1 and ALU2 pipe stages. Put in all forwarding among ALUs necessary to avoid or reduce stalls. Show the relationship between the two instructions involved in forwarding using the format of the table in Figure C.23 but ignoring the last two columns. Be careful to consider forwarding across an intervening instruction—for example, add x1, ... any instruction add ..., x1, ... d. [20] < C.3 > Show all of the data forwarding requirements necessary to avoid or reduce stalls when either the source or destination unit is not an ALU. Use the same format as in Figure C.23, again ignoring the last two columns. Remember to forward to and from memory references.

Updated Exercises by Diana Franklin



C-75

e. [15] < C.3 > Show all the remaining hazards that involve at least one unit other than an ALU as the source or destination unit. Use a table like that shown in Figure C.25, but replace the last column with the lengths of the hazards. f. [15] < C.2 > Show all control hazards by example and state the length of the stall. Use a format like that shown in Figure C.11, labeling each example. C.6

[12/13/13/15/15] < C.1, C.2, C.3 > We will now add support for register-memory ALU operations to the classic five-stage RISC pipeline. To offset this increase in complexity, all memory addressing will be restricted to register indirect (i.e., all addresses are simply a value held in a register; no offset or displacement may be added to the register value). For example, the register-memory instruction add x4, x5, (x1) means add the contents of register x5 to the contents of the memory location with address equal to the value in register x1 and put the sum in register x4. Register-register ALU operations are unchanged. The following items apply to the integer RISC pipeline: a. [12] < C.1 > List a rearranged order of the five traditional stages of the RISC pipeline that will support register-memory operations implemented exclusively by register indirect addressing. b. [13] < C.2, C.3 > Describe what new forwarding paths are needed for the rearranged pipeline by stating the source, destination, and information transferred on each needed new path. c. [13] < C.2, C.3 > For the reordered stages of the RISC pipeline, what new data hazards are created by this addressing mode? Give an instruction sequence illustrating each new hazard. d. [15] < C.3 > List all of the ways that the RISC pipeline with register-memory ALU operations can have a different instruction count for a given program than the original RISC pipeline. Give a pair of specific instruction sequences, one for the original pipeline and one for the rearranged pipeline, to illustrate each way. e. [15] < C.3 > Assume that all instructions take 1 clock cycle per stage. List all of the ways that the register-memory RISC V can have a different CPI for a given program as compared to the original RISC V pipeline.

C.7

[10/10] < C.3 > In this problem, we will explore how deepening the pipeline affects performance in two ways: faster clock cycle and increased stalls due to data and control hazards. Assume that the original machine is a 5-stage pipeline with a 1 ns clock cycle. The second machine is a 12-stage pipeline with a 0.6 ns clock cycle. The 5-stage pipeline experiences a stall due to a data hazard every five instructions, whereas the 12-stage pipeline experiences three stalls every eight instructions. In addition, branches constitute 20% of the instructions, and the misprediction rate for both machines is 5%. a. [10] < C.3 > What is the speedup of the 12-stage pipeline over the 5-stage pipeline, taking into account only data hazards?

C-76



Appendix C Pipelining: Basic and Intermediate Concepts b. [10] < C.3 > If the branch mispredict penalty for the first machine is 2 cycles but the second machine is 5 cycles, what are the CPIs of each, taking into account the stalls due to branch mispredictions? C.8

[15] < C.5 > Construct a table like that shown in Figure C.21 to check for WAW stalls in the RISC V FP pipeline of Figure C.30. Do not consider FP divides.

C.9

[20/22/22] < C.4, C.6 > In this exercise, we will look at how a common vector loop runs on statically and dynamically scheduled versions of the RISC V pipeline. The loop is the so-called DAXPY loop (discussed extensively in Appendix G) and the central operation in Gaussian elimination. The loop implements the vector operation Y = a*X + Y for a vector of length 100. Here is the MIPS code for the loop: foo:

fld fmul.d fld fadd.d fsd addi addi sltiu bnez

f2, 0(x1) f4, f2, f0 f6, 0(x2) f6, f4, f6 0(x2), f6 x1, x1, 8 x2, x2, 8 x3, x1, done x3, foo

; load X(i) ; multiply a*X(i) ; load Y(i) ; add a*X(i) + Y(i) ; store Y(i) ; increment X index ; increment Y index ; test if done ; loop if not done

For parts (a) to (c), assume that integer operations issue and complete in 1 clock cycle (including loads) and that their results are fully bypassed. You will use the FP latencies (only) shown in Figure C.29, but assume that the FP unit is fully pipelined. For scoreboards below, assume that an instruction waiting for a result from another function unit can pass through read operands at the same time the result is written. Also assume that an instruction in WB completing will allow a currently active instruction that is waiting on the same functional unit to issue in the same clock cycle in which the first instruction completes WB. a. [20] < C.5 > For this problem, use the RISC V pipeline of Section C.5 with the pipeline latencies from Figure C.29, but a fully pipelined FP unit, so the initiation interval is 1. Draw a timing diagram, similar to Figure C.32, showing the timing of each instruction's execution. How many clock cycles does each loop iteration take, counting from when the first instruction enters the WB stage to when the last instruction enters the WB stage? b. [20] < C.8 > Perform static instruction reordering to reorder the instructions to minimize the stalls for this loop, renaming registers where necessary. Use all the same assumptions as in (a). Draw a timing diagram, similar to Figure C.32, showing the timing of each instruction's execution. How many clock cycles does each loop iteration take, counting from when the first instruction enters the WB stage to when the last instruction enters the WB stage?

Updated Exercises by Diana Franklin



C-77

c. [20] < C.8 > Using the original code above, consider how the instructions would have executed using scoreboarding, a form of dynamic scheduling. Draw a timing diagram, similar to Figure C.32, showing the timing of the instructions through stages IF, IS (issue), RO (read operands), EX (execution), and WR (write result). How many clock cycles does each loop iteration take, counting from when the first instruction enters the WB stage to when the last instruction enters the WB stage? C.10

[25] < C.8 > It is critical that the scoreboard be able to distinguish RAW and WAR hazards, because a WAR hazard requires stalling the instruction doing the writing until the instruction reading an operand initiates execution, but a RAW hazard requires delaying the reading instruction until the writing instruction finishes—just the opposite. For example, consider the sequence: fmul.d fsub.d fadd.d

f0,f6,f4 f8,f0,f2 f2,f10,f2

The fsub.d depends on the fmul.d (a RAW hazard), thus the fmul.d must be allowed to complete before the fsub.d. If the fmul.d were stalled for the fsub.d due to the inability to distinguish between RAW and WAR hazards, the processor will deadlock. This sequence contains a WAR hazard between the fadd.d and the fsub.d, and the fadd.d cannot be allowed to complete until the fsub.d begins execution. The difficulty lies in distinguishing the RAW hazard between fmul.d and fsub.d, and the WAR hazard between the fsub.d and fadd.d. To see just why the three-instruction scenario is important, trace the handling of each instruction stage by stage through issue, read operands, execute, and write result. Assume that each scoreboard stage other than execute takes 1 clock cycle. Assume that the fmul.d instruction requires 3 clock cycles to execute and that the fsub.d and fadd.d instructions each take 1 cycle to execute. Finally, assume that the processor has two multiply function units and two add function units. Present the trace as follows. 1. Make a table with the column headings Instruction, Issue, Read Operands, Execute, Write Result, and Comment. In the first column, list the instructions in program order (be generous with space between instructions; larger table cells will better hold the results of your analysis). Start the table by writing a 1 in the Issue column of the fmul.d instruction row to show that fmul.d completes the issue stage in clock cycle 1. Now, fill in the stage columns of the table through the cycle at which the scoreboard first stalls an instruction. 2. For a stalled instruction write the words “waiting at clock cycle X,” where X is the number of the current clock cycle, in the appropriate table column to show that the scoreboard is resolving an RAW or WAR hazard by stalling that stage. In the Comment column, state what type of hazard and what dependent instruction is causing the wait.

C-78



Appendix C Pipelining: Basic and Intermediate Concepts 3. Adding the words “completes with clock cycle Y” to a “waiting” table entry, fill in the rest of the table through the time when all instructions are complete. For an instruction that stalled, add a description in the Comments column telling why the wait ended when it did and how deadlock was avoided (Hint: Think about how WAW hazards are prevented and what this implies about active instruction sequences.). Note the completion order of the three instructions as compared to their program order. C.11

[10/10/10] < C.5 > For this problem, you will create a series of small snippets that illustrate the issues that arise when using functional units with different latencies. For each one, draw a timing diagram similar to Figure C.32 that illustrates each concept, and clearly indicate the problem. a. [10] < C.5 > Demonstrate, using code different from that used in Figure C.32, the structural hazard of having the hardware for only one MEM and WB stage. b. [10] < C.5 > Demonstrate a WAW hazard requiring a stall.

D.1 D.2 D.3 D.4 D.5 D.6 D.7 D.8 D.9 D.10 D.11

Introduction Advanced Topics in Disk Storage Definition and Examples of Real Faults and Failures I/O Performance, Reliability Measures, and Benchmarks A Little Queuing Theory Crosscutting Issues Designing and Evaluating an I/O System—The Internet Archive Cluster Putting It All Together: NetApp FAS6000 Filer Fallacies and Pitfalls Concluding Remarks Historical Perspective and References Case Studies with Exercises by Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau

D-2 D-2 D-10 D-15 D-23 D-34 D-36 D-41 D-43 D-47 D-48 D-48

D Storage Systems

I think Silicon Valley was misnamed. If you look back at the dollars shipped in products in the last decade, there has been more revenue from magnetic disks than from silicon. They ought to rename the place Iron Oxide Valley. Al Hoagland A pioneer of magnetic disks (1982)

Combining bandwidth and storage … enables swift and reliable access to the ever expanding troves of content on the proliferating disks and … repositories of the Internet … the capacity of storage arrays of all kinds is rocketing ahead of the advance of computer performance. George Gilder “The End Is Drawing Nigh,” Forbes ASAP (April 4, 2000)

D-2



Appendix D Storage Systems

D.1

Introduction The popularity of Internet services such as search engines and auctions has enhanced the importance of I/O for computers, since no one would want a desktop computer that couldn’t access the Internet. This rise in importance of I/O is reflected by the names of our times. The 1960s to 1980s were called the Computing Revolution; the period since 1990 has been called the Information Age, with concerns focused on advances in information technology versus raw computational power. Internet services depend upon massive storage, which is the focus of this chapter, and networking, which is the focus of Appendix F. This shift in focus from computation to communication and storage of information emphasizes reliability and scalability as well as cost-performance. Although it is frustrating when a program crashes, people become hysterical if they lose their data; hence, storage systems are typically held to a higher standard of dependability than the rest of the computer. Dependability is the bedrock of storage, yet it also has its own rich performance theory—queuing theory—that balances throughput versus response time. The software that determines which processor features get used is the compiler, but the operating system usurps that role for storage. Thus, storage has a different, multifaceted culture from processors, yet it is still found within the architecture tent. We start our exploration with advances in magnetic disks, as they are the dominant storage device today in desktop and server computers. We assume that readers are already familiar with the basics of storage devices, some of which were covered in Chapter 1.

D.2

Advanced Topics in Disk Storage The disk industry historically has concentrated on improving the capacity of disks. Improvement in capacity is customarily expressed as improvement in areal density, measured in bits per square inch: Areal density ¼

Tracks Bits on a disk surface  on a track Inch Inch

Through about 1988, the rate of improvement of areal density was 29% per year, thus doubling density every 3 years. Between then and about 1996, the rate improved to 60% per year, quadrupling density every 3 years and matching the traditional rate of DRAMs. From 1997 to about 2003, the rate increased to 100%, doubling every year. After the innovations that allowed this renaissances had largely played out, the rate has dropped recently to about 30% per year. In 2011, the highest density in commercial products is 400 billion bits per square inch. Cost per gigabyte has dropped at least as fast as areal density has increased, with smaller diameter drives playing the larger role in this improvement. Costs per gigabyte improved by almost a factor of 1,000,000 between 1983 and 2011.

D.2

Advanced Topics in Disk Storage



D-3

Magnetic disks have been challenged many times for supremacy of secondary storage. Figure D.1 shows one reason: the fabled access time gap between disks and DRAM. DRAM latency is about 100,000 times less than disk, and that performance advantage costs 30 to 150 times more per gigabyte for DRAM. The bandwidth gap is more complex. For example, a fast disk in 2011 transfers at 200 MB/sec from the disk media with 600 GB of storage and costs about $400. A 4 GB DRAM module costing about $200 in 2011 could transfer at 16,000 MB/ sec (see Chapter 2), giving the DRAM module about 80 times higher bandwidth than the disk. However, the bandwidth per GB is 6000 times higher for DRAM, and the bandwidth per dollar is 160 times higher. Many have tried to invent a technology cheaper than DRAM but faster than disk to fill that gap, but thus far all have failed. Challengers have never had a product to market at the right time. By the time a new product ships, DRAMs and disks have made advances as predicted earlier, costs have dropped accordingly, and the challenging product is immediately obsolete. The closest challenger is Flash memory. This semiconductor memory is nonvolatile like disks, and it has about the same bandwidth as disks, but latency is 100 to 1000 times faster than disk. In 2011, the price per gigabyte of Flash was 15 to 20 times cheaper than DRAM. Flash is popular in cell phones because it comes in much smaller capacities and it is more power efficient than disks, despite the cost per gigabyte being 15 to 25 times higher than disks. Unlike disks and DRAM,

1,000,000

1980 DRAM 1985

100,000

Cost ($/GB)

1980

1990 1995

10,000

Disk

1990

2000

1000

Access time gap 2005

100

1985

1995

2000

10

1 2005 0.1 1

10

100

1000

10,000

100,000

1,000,000 10,000,000 100,000,000

Access time (ns)

Figure D.1 Cost versus access time for DRAM and magnetic disk in 1980, 1985, 1990, 1995, 2000, and 2005. The two-order-of-magnitude gap in cost and five-order-of-magnitude gap in access times between semiconductor memory and rotating magnetic disks have inspired a host of competing technologies to try to fill them. So far, such attempts have been made obsolete before production by improvements in magnetic disks, DRAMs, or both. Note that between 1990 and 2005 the cost per gigabyte DRAM chips made less improvement, while disk cost made dramatic improvement.



Appendix D Storage Systems

Flash memory bits wear out—typically limited to 1 million writes—and so they are not popular in desktop and server computers. While disks will remain viable for the foreseeable future, the conventional sector-track-cylinder model did not. The assumptions of the model are that nearby blocks are on the same track, blocks in the same cylinder take less time to access since there is no seek time, and some tracks are closer than others. First, disks started offering higher-level intelligent interfaces, like ATA and SCSI, when they included a microprocessor inside a disk. To speed up sequential transfers, these higher-level interfaces organize disks more like tapes than like random access devices. The logical blocks are ordered in serpentine fashion across a single surface, trying to capture all the sectors that are recorded at the same bit density. (Disks vary the recording density since it is hard for the electronics to keep up with the blocks spinning much faster on the outer tracks, and lowering linear density simplifies the task.) Hence, sequential blocks may be on different tracks. We will see later in Figure D.22 on page D-45 an illustration of the fallacy of assuming the conventional sector-track model when working with modern disks. Second, shortly after the microprocessors appeared inside disks, the disks included buffers to hold the data until the computer was ready to accept it, and later caches to avoid read accesses. They were joined by a command queue that allowed the disk to decide in what order to perform the commands to maximize performance while maintaining correct behavior. Figure D.2 shows how a queue depth of 50 can double the number of I/Os per second of random I/Os due to better scheduling of accesses. Although it’s unlikely that a system would really have 256 commands in a queue, it would triple the number of I/Os per second. Given buffers, caches, and out-of-order accesses, an accurate performance model of a real disk is much more complicated than sector-track-cylinder.

600 Random 512-byte reads per second 500

I/O per second

D-4

400 300 200 100 0 0

50

100

150 Queue depth

200

250

300

Figure D.2 Throughput versus command queue depth using random 512byte reads. The disk performs 170 reads per second starting at no command queue and doubles performance at 50 and triples at 256 [Anderson 2003].

D.2

Advanced Topics in Disk Storage



D-5

Finally, the number of platters shrank from 12 in the past to 4 or even 1 today, so the cylinder has less importance than before because the percentage of data in a cylinder is much less.

Disk Power Power is an increasing concern for disks as well as for processors. A typical ATA disk in 2011 might use 9 watts when idle, 11 watts when reading or writing, and 13 watts when seeking. Because it is more efficient to spin smaller mass, smallerdiameter disks can save power. One formula that indicates the importance of rotation speed and the size of the platters for the power consumed by the disk motor is the following [Gurumurthi et al. 2005]: Power  Diameter4:6  RPM2:8  Number of platters

I/O/sec

Disk BW (MB/sec)

5900

3.7

16

12

47

45–95

300

32

0.6 M

4

15,000

2.6

3–4

16

285

122–204

750

16

1.6 M

MTTF (hrs)

Power (watts)

4

$400

Buffer size (MB)

Average seek (ms)

$85

600

Diameter (inches)

2000

SAS

RPM

Price

SATA

Platters

Capacity (GB)

Buffer BW (MB/sec)

Thus, smaller platters, slower rotation, and fewer platters all help reduce disk motor power, and most of the power is in the motor. Figure D.3 shows the specifications of two 3.5-inch disks in 2011. The Serial ATA (SATA) disks shoot for high capacity and the best cost per gigabyte, so the 2000 GB drives cost less than $0.05 per gigabyte. They use the widest platters that fit the form factor and use four or five of them, but they spin at 5900 RPM and seek relatively slowly to allow a higher areal density and to lower power. The corresponding Serial Attach SCSI (SAS) drive aims at performance, so it spins at 15,000 RPM and seeks much faster. It uses a lower areal density to spin at that high rate. To reduce power, the platter is much narrower than the form factor. This combination reduces capacity of the SAS drive to 600 GB. The cost per gigabyte is about a factor of five better for the SATA drives, and, conversely, the cost per I/O per second or MB transferred per second is about a factor of five better for the SAS drives. Despite using smaller platters and many fewer of them, the SAS disks use twice the power of the SATA drives, due to the much faster RPM and seeks.

Figure D.3 Serial ATA (SATA) versus Serial Attach SCSI (SAS) drives in 3.5-inch form factor in 2011. The I/Os per second were calculated using the average seek plus the time for one-half rotation plus the time to transfer one sector of 512 KB.

D-6



Appendix D Storage Systems

Advanced Topics in Disk Arrays An innovation that improves both dependability and performance of storage systems is disk arrays. One argument for arrays is that potential throughput can be increased by having many disk drives and, hence, many disk arms, rather than fewer large drives. Simply spreading data over multiple disks, called striping, automatically forces accesses to several disks if the data files are large. (Although arrays improve throughput, latency is not necessarily improved.) As we saw in Chapter 1, the drawback is that with more devices, dependability decreases: N devices generally have 1/N the reliability of a single device. Although a disk array would have more faults than a smaller number of larger disks when each disk has the same reliability, dependability is improved by adding redundant disks to the array to tolerate faults. That is, if a single disk fails, the lost information is reconstructed from redundant information. The only danger is in having another disk fail during the mean time to repair (MTTR). Since the mean time to failure (MTTF) of disks is tens of years, and the MTTR is measured in hours, redundancy can make the measured reliability of many disks much higher than that of a single disk. Such redundant disk arrays have become known by the acronym RAID, which originally stood for redundant array of inexpensive disks, although some prefer the word independent for I in the acronym. The ability to recover from failures plus the higher throughput, measured as either megabytes per second or I/Os per second, make RAID attractive. When combined with the advantages of smaller size and lower power of small-diameter drives, RAIDs now dominate large-scale storage systems. Figure D.4 summarizes the five standard RAID levels, showing how eight disks of user data must be supplemented by redundant or check disks at each RAID level, and it lists the pros and cons of each level. The standard RAID levels are well documented, so we will just do a quick review here and discuss advanced levels in more depth. ■

RAID 0—It has no redundancy and is sometimes nicknamed JBOD, for just a bunch of disks, although the data may be striped across the disks in the array. This level is generally included to act as a measuring stick for the other RAID levels in terms of cost, performance, and dependability.



RAID 1—Also called mirroring or shadowing, there are two copies of every piece of data. It is the simplest and oldest disk redundancy scheme, but it also has the highest cost. Some array controllers will optimize read performance by allowing the mirrored disks to act independently for reads, but this optimization means it may take longer for the mirrored writes to complete.



RAID 2—This organization was inspired by applying memory-style errorcorrecting codes (ECCs) to disks. It was included because there was such a disk array product at the time of the original RAID paper, but none since then as other RAID organizations are more attractive.



RAID 3—Since the higher-level disk interfaces understand the health of a disk, it’s easy to figure out which disk failed. Designers realized that if one extra disk

D.2

Advanced Topics in Disk Storage



D-7

RAID level

Disk failures tolerated, check space overhead for 8 data disks

Pros

Cons

0

Nonredundant striped

0 failures, 0 check disks

No space overhead

No protection

Widely used

1

Mirrored

1 failure, 8 check disks

No parity calculation; fast recovery; small writes faster than higher RAIDs; fast reads

Highest check storage overhead

EMC, HP (Tandem), IBM

2

Memory-style ECC

1 failure, 4 check disks

Doesn’t rely on failed disk to self-diagnose

 Log 2 check storage overhead

Not used

3

Bit-interleaved parity

1 failure, 1 check disk

Low check overhead; high bandwidth for large reads or writes

No support for small, random reads or writes

Storage Concepts

4

Blockinterleaved parity

1 failure, 1 check disk

Low check overhead; more bandwidth for small reads

Parity disk is small write bottleneck

Network Appliance

5

Blockinterleaved distributed parity

1 failure, 1 check disk

Low check overhead; more bandwidth for small reads and writes

Small writes ! 4 disk accesses

Widely used

6

Row-diagonal parity, EVENODD

2 failures, 2 check disks

Protects against 2 disk failures

Small writes ! 6 disk accesses; 2  check overhead

Network Appliance

Company products

Figure D.4 RAID levels, their fault tolerance, and their overhead in redundant disks. The paper that introduced the term RAID [Patterson, Gibson, and Katz 1987] used a numerical classification that has become popular. In fact, the nonredundant disk array is often called RAID 0, indicating that the data are striped across several disks but without redundancy. Note that mirroring (RAID 1) in this instance can survive up to eight disk failures provided only one disk of each mirrored pair fails; worst case is both disks in a mirrored pair fail. In 2011, there may be no commercial implementations of RAID 2; the rest are found in a wide range of products. RAID 0 + 1, 1 + 0, 01, 10, and 6 are discussed in the text.

contains the parity of the information in the data disks, a single disk allows recovery from a disk failure. The data are organized in stripes, with N data blocks and one parity block. When a failure occurs, we just “subtract” the good data from the good blocks, and what remains is the missing data. (This works whether the failed disk is a data disk or the parity disk.) RAID 3 assumes that the data are spread across all disks on reads and writes, which is attractive when reading or writing large amounts of data. ■

RAID 4—Many applications are dominated by small accesses. Since sectors have their own error checking, you can safely increase the number of reads per second by allowing each disk to perform independent reads. It would seem that writes would still be slow, if you have to read every disk to calculate parity. To increase the number of writes per second, an alternative approach involves only two disks. First, the array reads the old data that are about to be overwritten, and then calculates what bits would change before it writes the new data. It then reads the old value of the parity on the check disks, updates parity according to the list of changes, and then writes the new value of parity to the check

D-8



Appendix D Storage Systems disk. Hence, these so-called “small writes” are still slower than small reads— they involve four disks accesses—but they are faster than if you had to read all disks on every write. RAID 4 has the same low check disk overhead as RAID 3, and it can still do large reads and writes as fast as RAID 3 in addition to small reads and writes, but control is more complex. ■

RAID 5—Note that a performance flaw for small writes in RAID 4 is that they all must read and write the same check disk, so it is a performance bottleneck. RAID 5 simply distributes the parity information across all disks in the array, thereby removing the bottleneck. The parity block in each stripe is rotated so that parity is spread evenly across all disks. The disk array controller must now calculate which disk has the parity for when it wants to write a given block, but that can be a simple calculation. RAID 5 has the same low check disk overhead as RAID 3 and 4, and it can do the large reads and writes of RAID 3 and the small reads of RAID 4, but it has higher small write bandwidth than RAID 4. Nevertheless, RAID 5 requires the most sophisticated controller of the classic RAID levels.

Having completed our quick review of the classic RAID levels, we can now look at two levels that have become popular since RAID was introduced.

RAID 10 versus 01 (or 1 + 0 versus RAID 0 + 1) One topic not always described in the RAID literature involves how mirroring in RAID 1 interacts with striping. Suppose you had, say, four disks’ worth of data to store and eight physical disks to use. Would you create four pairs of disks—each organized as RAID 1—and then stripe data across the four RAID 1 pairs? Alternatively, would you create two sets of four disks—each organized as RAID 0—and then mirror writes to both RAID 0 sets? The RAID terminology has evolved to call the former RAID 1 + 0 or RAID 10 (“striped mirrors”) and the latter RAID 0 + 1 or RAID 01 (“mirrored stripes”).

RAID 6: Beyond a Single Disk Failure The parity-based schemes of the RAID 1 to 5 protect against a single selfidentifying failure; however, if an operator accidentally replaces the wrong disk during a failure, then the disk array will experience two failures, and data will be lost. Another concern is that since disk bandwidth is growing more slowly than disk capacity, the MTTR of a disk in a RAID system is increasing, which in turn increases the chances of a second failure. For example, a 500 GB SATA disk could take about 3 hours to read sequentially assuming no interference. Given that the damaged RAID is likely to continue to serve data, reconstruction could be stretched considerably, thereby increasing MTTR. Besides increasing reconstruction time, another concern is that reading much more data during reconstruction means increasing the chance of an uncorrectable media failure, which would result in data loss. Other arguments for concern about simultaneous multiple failures are

D.2

Advanced Topics in Disk Storage



D-9

the increasing number of disks in arrays and the use of ATA disks, which are slower and larger than SCSI disks. Hence, over the years, there has been growing interest in protecting against more than one failure. Network Appliance (NetApp), for example, started by building RAID 4 file servers. As double failures were becoming a danger to customers, they created a more robust scheme to protect data, called row-diagonal parity or RAID-DP [Corbett et al. 2004]. Like the standard RAID schemes, row-diagonal parity uses redundant space based on a parity calculation on a per-stripe basis. Since it is protecting against a double failure, it adds two check blocks per stripe of data. Let’s assume there are p + 1 disks total, so p  1 disks have data. Figure D.5 shows the case when p is 5. The row parity disk is just like in RAID 4; it contains the even parity across the other four data blocks in its stripe. Each block of the diagonal parity disk contains the even parity of the blocks in the same diagonal. Note that each diagonal does not cover one disk; for example, diagonal 0 does not cover disk 1. Hence, we need just p  1 diagonals to protect the p disks, so the disk only has diagonals 0 to 3 in Figure D.5. Let’s see how row-diagonal parity works by assuming that data disks 1 and 3 fail in Figure D.5. We can’t perform the standard RAID recovery using the first row using row parity, since it is missing two data blocks from disks 1 and 3. However, we can perform recovery on diagonal 0, since it is only missing the data block associated with disk 3. Thus, row-diagonal parity starts by recovering one of the four blocks on the failed disk in this example using diagonal parity. Since each diagonal misses one disk, and all diagonals miss a different disk, two diagonals are only missing one block. They are diagonals 0 and 2 in this example, so we next restore the block from diagonal 2 from failed disk 1. When the data for those blocks have been recovered, then the standard RAID recovery scheme can be used to

Data disk 0

Data disk 1

Data disk 2

Data disk 3

Row parity

Diagonal parity

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

0

1

2

3

Figure D.5 Row diagonal parity for p 5 5, which protects four data disks from double failures [Corbett et al. 2004]. This figure shows the diagonal groups for which parity is calculated and stored in the diagonal parity disk. Although this shows all the check data in separate disks for row parity and diagonal parity as in RAID 4, there is a rotated version of row-diagonal parity that is analogous to RAID 5. Parameter p must be prime and greater than 2; however, you can make p larger than the number of data disks by assuming that the missing disks have all zeros and the scheme still works. This trick makes it easy to add disks to an existing system. NetApp picks p to be 257, which allows the system to grow to up to 256 data disks.

D-10



Appendix D Storage Systems

recover two more blocks in the standard RAID 4 stripes 0 and 2, which in turn allows us to recover more diagonals. This process continues until two failed disks are completely restored. The EVEN-ODD scheme developed earlier by researchers at IBM is similar to row diagonal parity, but it has a bit more computation during operation and recovery [Blaum 1995]. Papers that are more recent show how to expand EVEN-ODD to protect against three failures [Blaum, Bruck, and Vardy 1996; Blaum et al. 2001].

D.3

Definition and Examples of Real Faults and Failures Although people may be willing to live with a computer that occasionally crashes and forces all programs to be restarted, they insist that their information is never lost. The prime directive for storage is then to remember information, no matter what happens. Chapter 1 covered the basics of dependability, and this section expands that information to give the standard definitions and examples of failures. The first step is to clarify confusion over terms. The terms fault, error, and failure are often used interchangeably, but they have different meanings in the dependability literature. For example, is a programming mistake a fault, error, or failure? Does it matter whether we are talking about when it was designed or when the program is run? If the running program doesn’t exercise the mistake, is it still a fault/ error/failure? Try another one. Suppose an alpha particle hits a DRAM memory cell. Is it a fault/error/failure if it doesn’t change the value? Is it a fault/error/failure if the memory doesn’t access the changed bit? Did a fault/error/failure still occur if the memory had error correction and delivered the corrected value to the CPU? You get the drift of the difficulties. Clearly, we need precise definitions to discuss such events intelligently. To avoid such imprecision, this subsection is based on the terminology used by Laprie [1985] and Gray and Siewiorek [1991], endorsed by IFIP Working Group 10.4 and the IEEE Computer Society Technical Committee on Fault Tolerance. We talk about a system as a single module, but the terminology applies to submodules recursively. Let’s start with a definition of dependability: Computer system dependability is the quality of delivered service such that reliance can justifiably be placed on this service. The service delivered by a system is its observed actual behavior as perceived by other system(s) interacting with this system’s users. Each module also has an ideal specified behavior, where a service specification is an agreed description of the expected behavior. A system failure occurs when the actual behavior deviates from the specified behavior. The failure occurred because of an error, a defect in that module. The cause of an error is a fault. When a fault occurs, it creates a latent error, which becomes effective when it is activated; when the error actually affects the delivered service, a failure occurs. The

D.3

Definition and Examples of Real Faults and Failures



D-11

time between the occurrence of an error and the resulting failure is the error latency. Thus, an error is the manifestation in the system of a fault, and a failure is the manifestation on the service of an error. [p. 3] Let’s go back to our motivating examples above. A programming mistake is a fault. The consequence is an error (or latent error) in the software. Upon activation, the error becomes effective. When this effective error produces erroneous data that affect the delivered service, a failure occurs. An alpha particle hitting a DRAM can be considered a fault. If it changes the memory, it creates an error. The error will remain latent until the affected memory word is read. If the effective word error affects the delivered service, a failure occurs. If ECC corrected the error, a failure would not occur. A mistake by a human operator is a fault. The resulting altered data is an error. It is latent until activated, and so on as before. To clarify, the relationship among faults, errors, and failures is as follows: ■

A fault creates one or more latent errors.



The properties of errors are (1) a latent error becomes effective once activated; (2) an error may cycle between its latent and effective states; and (3) an effective error often propagates from one component to another, thereby creating new errors. Thus, either an effective error is a formerly latent error in that component or it has propagated from another error in that component or from elsewhere.



A component failure occurs when the error affects the delivered service.



These properties are recursive and apply to any component in the system.

Gray and Siewiorek classified faults into four categories according to their cause: 1. Hardware faults—Devices that fail, such as perhaps due to an alpha particle hitting a memory cell 2. Design faults—Faults in software (usually) and hardware design (occasionally) 3. Operation faults—Mistakes by operations and maintenance personnel 4. Environmental faults—Fire, flood, earthquake, power failure, and sabotage Faults are also classified by their duration into transient, intermittent, and permanent [Nelson 1990]. Transient faults exist for a limited time and are not recurring. Intermittent faults cause a system to oscillate between faulty and fault-free operation. Permanent faults do not correct themselves with the passing of time. Now that we have defined the difference between faults, errors, and failures, we are ready to see some real-world examples. Publications of real error rates are rare for two reasons. First, academics rarely have access to significant hardware resources to measure. Second, industrial researchers are rarely allowed to publish failure information for fear that it would be used against their companies in the marketplace. A few exceptions follow.

D-12



Appendix D Storage Systems

Berkeley’s Tertiary Disk The Tertiary Disk project at the University of California created an art image server for the Fine Arts Museums of San Francisco in 2000. This database consisted of high-quality images of over 70,000 artworks [Talagala et al., 2000]. The database was stored on a cluster, which consisted of 20 PCs connected by a switched Ethernet and containing 368 disks. It occupied seven 7-foot-high racks. Figure D.6 shows the failure rates of the various components of Tertiary Disk. In advance of building the system, the designers assumed that SCSI data disks would be the least reliable part of the system, as they are both mechanical and plentiful. Next would be the IDE disks since there were fewer of them, then the power supplies, followed by integrated circuits. They assumed that passive devices such as cables would scarcely ever fail. Figure D.6 shatters some of those assumptions. Since the designers followed the manufacturer’s advice of making sure the disk enclosures had reduced vibration and good cooling, the data disks were very reliable. In contrast, the PC chassis containing the IDE/ATA disks did not afford the same environmental controls. (The IDE/ATA disks did not store data but helped the application and operating

Component

Total in system

Total failed

Percentage failed

44

1

2.3%

SCSI cable

39

1

2.6%

SCSI disk

368

7

1.9%

SCSI controller

IDE/ATA disk

24

6

25.0%

Disk enclosure—backplane

46

13

28.3%

Disk enclosure—power supply

92

3

3.3%

Ethernet controller

20

1

5.0%

Ethernet switch

2

1

50.0%

Ethernet cable

42

1

2.3%

CPU/motherboard

20

0

0%

Figure D.6 Failures of components in Tertiary Disk over 18 months of operation. For each type of component, the table shows the total number in the system, the number that failed, and the percentage failure rate. Disk enclosures have two entries in the table because they had two types of problems: backplane integrity failures and power supply failures. Since each enclosure had two power supplies, a power supply failure did not affect availability. This cluster of 20 PCs, contained in seven 7-foot-high, 19-inch-wide racks, hosted 368 8.4 GB, 7200 RPM, 3.5-inch IBM disks. The PCs were P6-200 MHz with 96 MB of DRAM each. They ran FreeBSD 3.0, and the hosts were connected via switched 100 Mbit/sec Ethernet. All SCSI disks were connected to two PCs via double-ended SCSI chains to support RAID 1. The primary application was called the Zoom Project, which in 1998 was the world’s largest art image database, with 72,000 images. See Talagala et al. [2000b].

D.3

Definition and Examples of Real Faults and Failures



D-13

system to boot the PCs.) Figure D.6 shows that the SCSI backplane, cables, and Ethernet cables were no more reliable than the data disks themselves! As Tertiary Disk was a large system with many redundant components, it could survive this wide range of failures. Components were connected and mirrored images were placed so that no single failure could make any image unavailable. This strategy, which initially appeared to be overkill, proved to be vital. This experience also demonstrated the difference between transient faults and hard faults. Virtually all the failures in Figure D.6 appeared first as transient faults. It was up to the operator to decide if the behavior was so poor that they needed to be replaced or if they could continue. In fact, the word “failure” was not used; instead, the group borrowed terms normally used for dealing with problem employees, with the operator deciding whether a problem component should or should not be “fired.”

Tandem The next example comes from industry. Gray [1990] collected data on faults for Tandem Computers, which was one of the pioneering companies in fault-tolerant computing and used primarily for databases. Figure D.7 graphs the faults that caused system failures between 1985 and 1989 in absolute faults per system and in percentage of faults encountered. The data show a clear improvement in the reliability of hardware and maintenance. Disks in 1985 required yearly service by Tandem, but they were replaced by disks that required no scheduled maintenance. Shrinking numbers of chips and connectors per system plus software’s ability to tolerate hardware faults reduced hardware’s contribution to only 7% of failures by 1989. Moreover, when hardware was at fault, software embedded in the hardware device (firmware) was often the culprit. The data indicate that software in 1989 was the major source of reported outages (62%), followed by system operations (15%). The problem with any such statistics is that the data only refer to what is reported; for example, environmental failures due to power outages were not reported to Tandem because they were seen as a local problem. Data on operation faults are very difficult to collect because operators must report personal mistakes, which may affect the opinion of their managers, which in turn can affect job security and pay raises. Gray suggested that both environmental faults and operator faults are underreported. His study concluded that achieving higher availability requires improvement in software quality and software fault tolerance, simpler operations, and tolerance of operational faults.

Other Studies of the Role of Operators in Dependability While Tertiary Disk and Tandem are storage-oriented dependability studies, we need to look outside storage to find better measurements on the role of humans in failures. Murphy and Gent [1995] tried to improve the accuracy of data on

Appendix D Storage Systems

120

Unknown Environment (power, network) Operations (by customer) Maintenance (by Tandem) Hardware Software (applications + OS)

100

Faults per 1000 systems



80

60

40

20

0 1985

100%

4% 6%

1987

5% 9%

1989

5% 6%

9% 80% Percentage faults per category

D-14

12% 19%

15% 5%

13% 7%

60% 29%

22%

40% 62% 39% 20%

34%

0% 1985

1987

1989

Figure D.7 Faults in Tandem between 1985 and 1989. Gray [1990] collected these data for fault-tolerant Tandem Computers based on reports of component failures by customers.

operator faults by having the system automatically prompt the operator on each boot for the reason for that reboot. They classified consecutive crashes to the same fault as operator fault and included operator actions that directly resulted in crashes, such as giving parameters bad values, bad configurations, and bad application installation. Although they believed that operator error is under-reported,

D.4

I/O Performance, Reliability Measures, and Benchmarks



D-15

they did get more accurate information than did Gray, who relied on a form that the operator filled out and then sent up the management chain. The hardware/operating system went from causing 70% of the failures in VAX systems in 1985 to 28% in 1993, and failures due to operators rose from 15% to 52% in that same period. Murphy and Gent expected managing systems to be the primary dependability challenge in the future. The final set of data comes from the government. The Federal Communications Commission (FCC) requires that all telephone companies submit explanations when they experience an outage that affects at least 30,000 people or lasts 30 minutes. These detailed disruption reports do not suffer from the self-reporting problem of earlier figures, as investigators determine the cause of the outage rather than operators of the equipment. Kuhn [1997] studied the causes of outages between 1992 and 1994, and Enriquez [2001] did a follow-up study for the first half of 2001. Although there was a significant improvement in failures due to overloading of the network over the years, failures due to humans increased, from about one-third to two-thirds of the customer-outage minutes. These four examples and others suggest that the primary cause of failures in large systems today is faults by human operators. Hardware faults have declined due to a decreasing number of chips in systems and fewer connectors. Hardware dependability has improved through fault tolerance techniques such as memory ECC and RAID. At least some operating systems are considering reliability implications before adding new features, so in 2011 the failures largely occurred elsewhere. Although failures may be initiated due to faults by operators, it is a poor reflection on the state of the art of systems that the processes of maintenance and upgrading are so error prone. Most storage vendors claim today that customers spend much more on managing storage over its lifetime than they do on purchasing the storage. Thus, the challenge for dependable storage systems of the future is either to tolerate faults by operators or to avoid faults by simplifying the tasks of system administration. Note that RAID 6 allows the storage system to survive even if the operator mistakenly replaces a good disk. We have now covered the bedrock issue of dependability, giving definitions, case studies, and techniques to improve it. The next step in the storage tour is performance.

D.4

I/O Performance, Reliability Measures, and Benchmarks I/O performance has measures that have no counterparts in design. One of these is diversity: Which I/O devices can connect to the computer system? Another is capacity: How many I/O devices can connect to a computer system? In addition to these unique measures, the traditional measures of performance (namely, response time and throughput) also apply to I/O. (I/O throughput is sometimes called I/O bandwidth and response time is sometimes called latency.) The next two figures offer insight into how response time and throughput trade off

D-16



Appendix D Storage Systems

Queue Producer

Server

Figure D.8 The traditional producer-server model of response time and throughput. Response time begins when a task is placed in the buffer and ends when it is completed by the server. Throughput is the number of tasks completed by the server in unit time.

against each other. Figure D.8 shows the simple producer-server model. The producer creates tasks to be performed and places them in a buffer; the server takes tasks from the first in, first out buffer and performs them. Response time is defined as the time a task takes from the moment it is placed in the buffer until the server finishes the task. Throughput is simply the average number of tasks completed by the server over a time period. To get the highest possible throughput, the server should never be idle, thus the buffer should never be empty. Response time, on the other hand, counts time spent in the buffer, so an empty buffer shrinks it. Another measure of I/O performance is the interference of I/O with processor execution. Transferring data may interfere with the execution of another process. There is also overhead due to handling I/O interrupts. Our concern here is how much longer a process will take because of I/O for another process.

Throughput versus Response Time Figure D.9 shows throughput versus response time (or latency) for a typical I/O system. The knee of the curve is the area where a little more throughput results in much longer response time or, conversely, a little shorter response time results in much lower throughput. How does the architect balance these conflicting demands? If the computer is interacting with human beings, Figure D.10 suggests an answer. An interaction, or transaction, with a computer is divided into three parts: 1. Entry time—The time for the user to enter the command. 2. System response time—The time between when the user enters the command and the complete response is displayed. 3. Think time—The time from the reception of the response until the user begins to enter the next command. The sum of these three parts is called the transaction time. Several studies report that user productivity is inversely proportional to transaction time. The results in Figure D.10 show that cutting system response time by 0.7 seconds saves 4.9 seconds (34%) from the conventional transaction and 2.0 seconds (70%) from

D.4

I/O Performance, Reliability Measures, and Benchmarks



D-17

Response time (latency) (ms)

300

200

100

0 0%

20%

40%

60%

80%

100%

Percentage of maximum throughput (bandwidth)

Figure D.9 Throughput versus response time. Latency is normally reported as response time. Note that the minimum response time achieves only 11% of the throughput, while the response time for 100% throughput takes seven times the minimum response time. Note also that the independent variable in this curve is implicit; to trace the curve, you typically vary load (concurrency). Chen et al. [1990] collected these data for an array of magnetic disks.

Workload Conventional interactive workload (1.0 sec system response time) Conventional interactive workload (0.3 sec system response time)

–34% total (–70% think)

High-function graphics workload (1.0 sec system response time) High-function graphics workload (0.3 sec system response time)

–70% total (–81% think) 0

5

10

15

Time (sec) Entry time

System response time

Think time

Figure D.10 A user transaction with an interactive computer divided into entry time, system response time, and user think time for a conventional system and graphics system. The entry times are the same, independent of system response time. The entry time was 4 seconds for the conventional system and 0.25 seconds for the graphics system. Reduction in response time actually decreases transaction time by more than just the response time reduction. (From Brady [1986].)

D-18



Appendix D Storage Systems

Throughput metric

I/O benchmark

Response time restriction

TPC-C: Complex Query OLTP

90% of transaction must meet response time limit; 5 seconds for most types of transactions

New order transactions per minute

TPC-W: Transactional Web benchmark

90% of Web interactions must meet response time limit; 3 seconds for most types of Web interactions

Web interactions per second

SPECsfs97

Average response time 40 ms

NFS operations per second

Figure D.11 Response time restrictions for three I/O benchmarks.

the graphics transaction. This implausible result is explained by human nature: People need less time to think when given a faster response. Although this study is 20 years old, response times are often still much slower than 1 second, even if processors are 1000 times faster. Examples of long delays include starting an application on a desktop PC due to many disk I/Os, or network delays when clicking on Web links. To reflect the importance of response time to user productivity, I/O benchmarks also address the response time versus throughput trade-off. Figure D.11 shows the response time bounds for three I/O benchmarks. They report maximum throughput given either that 90% of response times must be less than a limit or that the average response time must be less than a limit. Let’s next look at these benchmarks in more detail.

Transaction-Processing Benchmarks Transaction processing (TP, or OLTP for online transaction processing) is chiefly concerned with I/O rate (the number of disk accesses per second), as opposed to data rate (measured as bytes of data per second). TP generally involves changes to a large body of shared information from many terminals, with the TP system guaranteeing proper behavior on a failure. Suppose, for example, that a bank’s computer fails when a customer tries to withdraw money from an ATM. The TP system would guarantee that the account is debited if the customer received the money and that the account is unchanged if the money was not received. Airline reservations systems as well as banks are traditional customers for TP. As mentioned in Chapter 1, two dozen members of the TP community conspired to form a benchmark for the industry and, to avoid the wrath of their legal departments, published the report anonymously [Anon. et al. 1985]. This report led to the Transaction Processing Council, which in turn has led to eight benchmarks since its founding. Figure D.12 summarizes these benchmarks. Let’s describe TPC-C to give a flavor of these benchmarks. TPC-C uses a database to simulate an order-entry environment of a wholesale supplier, including

D.4

I/O Performance, Reliability Measures, and Benchmarks



D-19

Benchmark

Data size (GB)

Performance metric

Date of first results

A: debit credit (retired)

0.1–10

Transactions per second

July 1990

B: batch debit credit (retired)

0.1–10

Transactions per second

July 1991

C: complex query OLTP

100–3000 (minimum 0.07 * TPM)

New order transactions per minute (TPM)

September 1992

D: decision support (retired)

100, 300, 1000

Queries per hour

December 1995

H: ad hoc decision support

100, 300, 1000

Queries per hour

October 1999

R: business reporting decision support (retired)

1000

Queries per hour

August 1999

W: transactional Web benchmark

50, 500

Web interactions per second

July 2000

App: application server and Web services benchmark

2500

Web service interactions per second (SIPS)

June 2005

Figure D.12 Transaction Processing Council benchmarks. The summary results include both the performance metric and the price-performance of that metric. TPC-A, TPC-B, TPC-D, and TPC-R were retired.

entering and delivering orders, recording payments, checking the status of orders, and monitoring the level of stock at the warehouses. It runs five concurrent transactions of varying complexity, and the database includes nine tables with a scalable range of records and customers. TPC-C is measured in transactions per minute (tpmC) and in price of system, including hardware, software, and three years of maintenance support. Figure 1.17 on page 42 in Chapter 1 describes the top systems in performance and cost-performance for TPC-C. These TPC benchmarks were the first—and in some cases still the only ones— that have these unusual characteristics: ■

Price is included with the benchmark results. The cost of hardware, software, and maintenance agreements is included in a submission, which enables evaluations based on price-performance as well as high performance.



The dataset generally must scale in size as the throughput increases. The benchmarks are trying to model real systems, in which the demand on the system and the size of the data stored in it increase together. It makes no sense, for example, to have thousands of people per minute access hundreds of bank accounts.



The benchmark results are audited. Before results can be submitted, they must be approved by a certified TPC auditor, who enforces the TPC rules that try to make sure that only fair results are submitted. Results can be challenged and disputes resolved by going before the TPC.



Throughput is the performance metric, but response times are limited. For example, with TPC-C, 90% of the new order transaction response times must be less than 5 seconds.



Appendix D Storage Systems



An independent organization maintains the benchmarks. Dues collected by TPC pay for an administrative structure including a chief operating office. This organization settles disputes, conducts mail ballots on approval of changes to benchmarks, holds board meetings, and so on.

SPEC System-Level File Server, Mail, and Web Benchmarks The SPEC benchmarking effort is best known for its characterization of processor performance, but it has created benchmarks for file servers, mail servers, and Web servers. Seven companies agreed on a synthetic benchmark, called SFS, to evaluate systems running the Sun Microsystems network file service (NFS). This benchmark was upgraded to SFS 3.0 (also called SPEC SFS97_R1) to include support for NFS version 3, using TCP in addition to UDP as the transport protocol, and making the mix of operations more realistic. Measurements on NFS systems led to a synthetic mix of reads, writes, and file operations. SFS supplies default parameters for comparative performance. For example, half of all writes are done in 8 KB blocks and half are done in partial blocks of 1, 2, or 4 KB. For reads, the mix is 85% full blocks and 15% partial blocks. Like TPC-C, SFS scales the amount of data stored according to the reported throughput: For every 100 NFS operations per second, the capacity must increase by 1 GB. It also limits the average response time, in this case to 40 ms. Figure D.13

8 7 34,089

47,927

2 Xeons

4 Xeons

6 Response time (ms)

D-20

5 136,048 4 8 Opterons

FAS3000 3 FAS6000 2

100,295

4 Opterons 1 0 0

25,000

50,000

75,000 100,000 Operations/second

125,000

150,000

Figure D.13 SPEC SFS97_R1 performance for the NetApp FAS3050c NFS servers in two configurations. Two processors reached 34,089 operations per second and four processors did 47,927. Reported in May 2005, these systems used the Data ONTAP 7.0.1R1 operating system, 2.8 GHz Pentium Xeon microprocessors, 2 GB of DRAM per processor, 1 GB of nonvolatile memory per system, and 168 15 K RPM, 72 GB, Fibre Channel disks. These disks were connected using two or four QLogic ISP-2322 FC disk controllers.

D.4

I/O Performance, Reliability Measures, and Benchmarks



D-21

shows average response time versus throughput for two NetApp systems. Unfortunately, unlike the TPC benchmarks, SFS does not normalize for different price configurations. SPECMail is a benchmark to help evaluate performance of mail servers at an Internet service provider. SPECMail2001 is based on the standard Internet protocols SMTP and POP3, and it measures throughput and user response time while scaling the number of users from 10,000 to 1,000,000. SPECWeb is a benchmark for evaluating the performance of World Wide Web servers, measuring number of simultaneous user sessions. The SPECWeb2005 workload simulates accesses to a Web service provider, where the server supports home pages for several organizations. It has three workloads: Banking (HTTPS), E-commerce (HTTP and HTTPS), and Support (HTTP).

Examples of Benchmarks of Dependability The TPC-C benchmark does in fact have a dependability requirement. The benchmarked system must be able to handle a single disk failure, which means in practice that all submitters are running some RAID organization in their storage system. Efforts that are more recent have focused on the effectiveness of fault tolerance in systems. Brown and Patterson [2000] proposed that availability be measured by examining the variations in system quality-of-service metrics over time as faults are injected into the system. For a Web server, the obvious metrics are performance (measured as requests satisfied per second) and degree of fault tolerance (measured as the number of faults that can be tolerated by the storage subsystem, network connection topology, and so forth). The initial experiment injected a single fault—such as a write error in disk sector—and recorded the system’s behavior as reflected in the quality-of-service metrics. The example compared software RAID implementations provided by Linux, Solaris, and Windows 2000 Server. SPECWeb99 was used to provide a workload and to measure performance. To inject faults, one of the SCSI disks in the software RAID volume was replaced with an emulated disk. It was a PC running software using a SCSI controller that appears to other devices on the SCSI bus as a disk. The disk emulator allowed the injection of faults. The faults injected included a variety of transient disk faults, such as correctable read errors, and permanent faults, such as disk media failures on writes. Figure D.14 shows the behavior of each system under different faults. The two top graphs show Linux (on the left) and Solaris (on the right). As RAID systems can lose data if a second disk fails before reconstruction completes, the longer the reconstruction (MTTR), the lower the availability. Faster reconstruction implies decreased application performance, however, as reconstruction steals I/O resources from running applications. Thus, there is a policy choice between taking a performance hit during reconstruction or lengthening the window of vulnerability and thus lowering the predicted MTTF.

D-22

Appendix D Storage Systems



Solaris 160

220

150 140

215

Hits per second

Hits per second

Linux 225

210 Reconstruction 205

Reconstruction

130 120 110

200

100

195 190

90 80 0

10

20

30

40

50

60

70

80

90 100 110

0

10

20

30

40

Time (minutes)

50

60

70

80

90

100

110

Time (minutes)

Windows

200

Hits per second

190

180 Reconstruction

170

160

150 0

5

10

15

20

25

30

35

40

45

Time (minutes)

Figure D.14 Availability benchmark for software RAID systems on the same computer running Red Hat 6.0 Linux, Solaris 7, and Windows 2000 operating systems. Note the difference in philosophy on speed of reconstruction of Linux versus Windows and Solaris. The y-axis is behavior in hits per second running SPECWeb99. The arrow indicates time of fault insertion. The lines at the top give the 99% confidence interval of performance before the fault is inserted. A 99% confidence interval means that if the variable is outside of this range, the probability is only 1% that this value would appear.

Although none of the tested systems documented their reconstruction policies outside of the source code, even a single fault injection was able to give insight into those policies. The experiments revealed that both Linux and Solaris initiate automatic reconstruction of the RAID volume onto a hot spare when an active disk is taken out of service due to a failure. Although Windows supports RAID

D.5

A Little Queuing Theory



D-23

reconstruction, the reconstruction must be initiated manually. Thus, without human intervention, a Windows system that did not rebuild after a first failure remains susceptible to a second failure, which increases the window of vulnerability. It does repair quickly once told to do so. The fault injection experiments also provided insight into other availability policies of Linux, Solaris, and Windows 2000 concerning automatic spare utilization, reconstruction rates, transient errors, and so on. Again, no system documented their policies. In terms of managing transient faults, the fault injection experiments revealed that Linux’s software RAID implementation takes an opposite approach than do the RAID implementations in Solaris and Windows. The Linux implementation is paranoid—it would rather shut down a disk in a controlled manner at the first error, rather than wait to see if the error is transient. In contrast, Solaris and Windows are more forgiving—they ignore most transient faults with the expectation that they will not recur. Thus, these systems are substantially more robust to transients than the Linux system. Note that both Windows and Solaris do log the transient faults, ensuring that the errors are reported even if not acted upon. When faults were permanent, the systems behaved similarly.

D.5

A Little Queuing Theory In processor design, we have simple back-of-the-envelope calculations of performance associated with the CPI formula in Chapter 1, or we can use full-scale simulation for greater accuracy at greater cost. In I/O systems, we also have a bestcase analysis as a back-of-the-envelope calculation. Full-scale simulation is also much more accurate and much more work to calculate expected performance. With I/O systems, however, we also have a mathematical tool to guide I/O design that is a little more work and much more accurate than best-case analysis, but much less work than full-scale simulation. Because of the probabilistic nature of I/O events and because of sharing of I/O resources, we can give a set of simple theorems that will help calculate response time and throughput of an entire I/O system. This helpful field is called queuing theory. Since there are many books and courses on the subject, this section serves only as a first introduction to the topic. However, even this small amount can lead to better design of I/O systems. Let’s start with a black-box approach to I/O systems, as shown in Figure D.15. In our example, the processor is making I/O requests that arrive at the I/O device, and the requests “depart” when the I/O device fulfills them. We are usually interested in the long term, or steady state, of a system rather than in the initial start-up conditions. Suppose we weren’t. Although there is a mathematics that helps (Markov chains), except for a few cases, the only way to solve the resulting equations is simulation. Since the purpose of this section is to show something a little harder than back-of-the-envelope calculations but less than simulation, we won’t cover such analyses here. (See the references in Appendix M for more details.)

D-24



Appendix D Storage Systems

Departures

Arrivals

Figure D.15 Treating the I/O system as a black box. This leads to a simple but important observation: If the system is in steady state, then the number of tasks entering the system must equal the number of tasks leaving the system. This flow-balanced state is necessary but not sufficient for steady state. If the system has been observed or measured for a sufficiently long time and mean waiting times stabilize, then we say that the system has reached steady state.

Hence, in this section we make the simplifying assumption that we are evaluating systems with multiple independent requests for I/O service that are in equilibrium: The input rate must be equal to the output rate. We also assume there is a steady supply of tasks independent for how long they wait for service. In many real systems, such as TPC-C, the task consumption rate is determined by other system characteristics, such as memory capacity. This leads us to Little’s law, which relates the average number of tasks in the system, the average arrival rate of new tasks, and the average time to perform a task: Mean number of tasks in system ¼ Arrival rate  Mean response time

Little’s law applies to any system in equilibrium, as long as nothing inside the black box is creating new tasks or destroying them. Note that the arrival rate and the response time must use the same time unit; inconsistency in time units is a common cause of errors. Let’s try to derive Little’s law. Assume we observe a system for Timeobserve minutes. During that observation, we record how long it took each task to be serviced, and then sum those times. The number of tasks completed during Timeobserve is Numbertask, and the sum of the times each task spends in the system is Timeaccumulated. Note that the tasks can overlap in time, so Timeaccumulated  Timeobserved. Then, Mean number of tasks in system ¼

Timeaccumulated Timeobserve

Mean response time ¼

Timeaccumulated Numbertasks

Arrival rate ¼

Numbertasks Timeobserve

Algebra lets us split the first formula: Timeaccumulated Timeaccumulated Numbertasks ¼ ∞ Timeobserve Numbertasks Timeobserve

D.5

A Little Queuing Theory

Queue

Arrivals



D-25

Server

I/O controller and device

Figure D.16 The single-server model for this section. In this situation, an I/O request “departs” by being completed by the server.

If we substitute the three definitions above into this formula, and swap the resulting two terms on the right-hand side, we get Little’s law: Mean number of tasks in system ¼ Arrival rate  Mean response time

This simple equation is surprisingly powerful, as we shall see. If we open the black box, we see Figure D.16. The area where the tasks accumulate, waiting to be serviced, is called the queue, or waiting line. The device performing the requested service is called the server. Until we get to the last two pages of this section, we assume a single server. Little’s law and a series of definitions lead to several useful equations: ■

■ ■



Timeserver—Average time to service a task; average service rate is 1/Timeserver, traditionally represented by the symbol μ in many queuing texts. Timequeue—Average time per task in the queue. Timesystem—Average time/task in the system, or the response time, which is the sum of Timequeue and Timeserver. Arrival rate—Average number of arriving tasks/second, traditionally represented by the symbol λ in many queuing texts.



Lengthserver—Average number of tasks in service.



Lengthqueue—Average length of queue.



Lengthsystem—Average number of tasks in system, which is the sum of Lengthqueue and Lengthserver.

One common misunderstanding can be made clearer by these definitions: whether the question is how long a task must wait in the queue before service starts (Timequeue) or how long a task takes until it is completed (Timesystem). The latter term is what we mean by response time, and the relationship between the terms is Timesystem ¼ Timequeue + Timeserver. The mean number of tasks in service (Lengthserver) is simply Arrival rate  Timeserver, which is Little’s law. Server utilization is simply the mean number of tasks being serviced divided by the service rate. For a single server, the service rate is 1/Timeserver. Hence, server utilization (and, in this case, the mean number of tasks per server) is simply: Server utilization ¼ Arrival rate  Timeserver

D-26



Appendix D Storage Systems

Service utilization must be between 0 and 1; otherwise, there would be more tasks arriving than could be serviced, violating our assumption that the system is in equilibrium. Note that this formula is just a restatement of Little’s law. Utilization is also called traffic intensity and is represented by the symbol ρ in many queuing theory texts. Example

Answer

Suppose an I/O system with a single disk gets on average 50 I/O requests per second. Assume the average time for a disk to service an I/O request is 10 ms. What is the utilization of the I/O system? Using the equation above, with 10 ms represented as 0.01 seconds, we get: 50 Server utilization ¼ Arrival rate  Timeserver ¼

50  0:01 sec ¼ 0:50 sec

Therefore, the I/O system utilization is 0.5. How the queue delivers tasks to the server is called the queue discipline. The simplest and most common discipline is first in, first out (FIFO). If we assume FIFO, we can relate time waiting in the queue to the mean number of tasks in the queue: Timequeue ¼ Lengthqueue  Timeserver + Mean time to complete service of task when new task arrives if server is busy

That is, the time in the queue is the number of tasks in the queue times the mean service time plus the time it takes the server to complete whatever task is being serviced when a new task arrives. (There is one more restriction about the arrival of tasks, which we reveal on page D-28.) The last component of the equation is not as simple as it first appears. A new task can arrive at any instant, so we have no basis to know how long the existing task has been in the server. Although such requests are random events, if we know something about the distribution of events, we can predict performance.

Poisson Distribution of Random Variables To estimate the last component of the formula we need to know a little about distributions of random variables. A variable is random if it takes one of a specified set of values with a specified probability; that is, you cannot know exactly what its next value will be, but you may know the probability of all possible values. Requests for service from an I/O system can be modeled by a random variable because the operating system is normally switching between several processes that generate independent I/O requests. We also model I/O service times by a random variable given the probabilistic nature of disks in terms of seek and rotational delays. One way to characterize the distribution of values of a random variable with discrete values is a histogram, which divides the range between the minimum and maximum values into subranges called buckets. Histograms then plot the number in each bucket as columns.

D.5

A Little Queuing Theory



D-27

Histograms work well for distributions that are discrete values—for example, the number of I/O requests. For distributions that are not discrete values, such as time waiting for an I/O request, we have two choices. Either we need a curve to plot the values over the full range, so that we can estimate accurately the value, or we need a very fine time unit so that we get a very large number of buckets to estimate time accurately. For example, a histogram can be built of disk service times measured in intervals of 10 μs although disk service times are truly continuous. Hence, to be able to solve the last part of the previous equation we need to characterize the distribution of this random variable. The mean time and some measure of the variance are sufficient for that characterization. For the first term, we use the weighted arithmetic mean time. Let’s first assume that after measuring the number of occurrences, say, ni, of tasks, you could compute frequency of occurrence of task i: fi ¼

ni ! n X ni i¼1

Then weighted arithmetic mean is Weighted arithmetic mean time ¼ f1  T1 + f2  T2 + … + fn  Tn

where Ti is the time for task i and fi is the frequency of occurrence of task i. To characterize variability about the mean, many people use the standard deviation. Let’s use the variance instead, which is simply the square of the standard deviation, as it will help us with characterizing the probability distribution. Given the weighted arithmetic mean, the variance can be calculated as   Variance ¼ f1  T12 + f2  T22 + … + fn  Tn2  Weighted arithmetic mean time2

It is important to remember the units when computing variance. Let’s assume the distribution is of time. If time is about 100 milliseconds, then squaring it yields 10,000 square milliseconds. This unit is certainly unusual. It would be more convenient if we had a unitless measure. To avoid this unit problem, we use the squared coefficient of variance, traditionally called C2: C2 ¼

Variance Weighted arithmetic mean time2

We can solve for C, the coefficient of variance, as pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Variance Standard deviation C¼ ¼ Weighted arithmetic mean time Weighted arithmetic mean time

We are trying to characterize random events, but to be able to predict performance we need a distribution of random events where the mathematics is tractable. The most popular such distribution is the exponential distribution, which has a C value of 1. Note that we are using a constant to characterize variability about the mean. The invariance of C over time reflects the property that the history of events has no impact

D-28



Appendix D Storage Systems

on the probability of an event occurring now. This forgetful property is called memoryless, and this property is an important assumption used to predict behavior using these models. (Suppose this memoryless property did not exist; then, we would have to worry about the exact arrival times of requests relative to each other, which would make the mathematics considerably less tractable!) One of the most widely used exponential distributions is called a Poisson distribution, named after the mathematician Simeon Poisson. It is used to characterize random events in a given time interval and has several desirable mathematical properties. The Poisson distribution is described by the following equation (called the probability mass function): Probabilityðk Þ ¼

ea  ak k!

where a ¼ Rate of events  Elapsed time. If interarrival times are exponentially distributed and we use the arrival rate from above for rate of events, the number of arrivals in a time interval t is a Poisson process, which has the Poisson distribution with a ¼ Arrival rate  t. As mentioned on page D-26, the equation for Timeserver has another restriction on task arrival: It holds only for Poisson processes. Finally, we can answer the question about the length of time a new task must wait for the server to complete a task, called the average residual service time, which again assumes Poisson arrivals:   Average residual service time ¼ 1=2  Arithemtic mean  1 + C2

Although we won’t derive this formula, we can appeal to intuition. When the distribution is not random and all possible values are equal to the average, the standard deviation is 0 and so C is 0. The average residual service time is then just half the average service time, as we would expect. If the distribution is random and it is Poisson, then C is 1 and the average residual service time equals the weighted arithmetic mean time. Example

Using the definitions and formulas above, derive the average time waiting in the queue (Timequeue) in terms of the average service time (Timeserver) and server utilization.

Answer

All tasks in the queue (Lengthqueue) ahead of the new task must be completed before the task can be serviced; each takes on average Timeserver. If a task is at the server, it takes average residual service time to complete. The chance the server is busy is server utilization; hence, the expected time for service is Server utilization  Average residual service time. This leads to our initial formula: Timequeue ¼Lengthqueue  Timeserver + Server utilization  Average residual service time

Replacing the average residual service time by its definition and Lengthqueue by Arrival rate  Timequeue yields

D.5

A Little Queuing Theory



D-29

   Timequeue ¼Server utilization  1=2  Timeserver  1 + C2   + Arrival rate  Timequeue  Timeserver

Since this section is concerned with exponential distributions, C2 is 1. Thus   Timequeue ¼ Server utilization  Timeserver + Arrival rate  Timequeue  Timeserver

Rearranging the last term, let us replace Arrival rate  Timeserver by Server utilization: Timequeue ¼ Server utilization  Timeserver + ðArrival rate  Timeserver Þ  Timequeue ¼ Server utilization  Timeserver + Server utilization  Timequeue

Rearranging terms and simplifying gives us the desired equation: Timequeue ¼ Server utilization  Timeserver + Server utilization  Timequeue Timequeue  Server utilization  Timequeue ¼ Server utilization  Timeserver Timequeue  ð1  Server utilizationÞ ¼ Server utilization  Timeserver Server utilization Timequeue ¼ Timeserver  ð1  Server utilizationÞ

Little’s law can be applied to the components of the black box as well, since they must also be in equilibrium: Lengthqueue ¼ Arrival rate  Timequeue

If we substitute for Timequeue from above, we get: Lengthqueue ¼ Arrival rate  Timeserver 

Server utilization ð1  Server utilizationÞ

Since Arrival rate  Timeserver ¼ Server utilization, we can simplify further: Lengthqueue ¼ Server utilization 

Server utilization Server utilization2 ¼ ð1  Server utilizationÞ ð1  Server utilizationÞ

This relates number of items in queue to service utilization. Example Answer

For the system in the example on page D-26, which has a server utilization of 0.5, what is the mean number of I/O requests in the queue? Using the equation above, Lengthqueue ¼

Server uti1ization2 0:52 0:25 ¼ ¼ ¼ 0:5 ð1  Server uti1izationÞ ð1  0:5Þ 0:50

Therefore, there are 0.5 requests on average in the queue. As mentioned earlier, these equations and this section are based on an area of applied mathematics called queuing theory, which offers equations to predict

D-30



Appendix D Storage Systems

behavior of such random variables. Real systems are too complex for queuing theory to provide exact analysis, hence queuing theory works best when only approximate answers are needed. Queuing theory makes a sharp distinction between past events, which can be characterized by measurements using simple arithmetic, and future events, which are predictions requiring more sophisticated mathematics. In computer systems, we commonly predict the future from the past; one example is least recently used block replacement (see Chapter 2). Hence, the distinction between measurements and predicted distributions is often blurred; we use measurements to verify the type of distribution and then rely on the distribution thereafter. Let’s review the assumptions about the queuing model: ■

The system is in equilibrium.



The times between two successive requests arriving, called the interarrival times, are exponentially distributed, which characterizes the arrival rate mentioned earlier.



The number of sources of requests is unlimited. (This is called an infinite population model in queuing theory; finite population models are used when arrival rates vary with the number of jobs already in the system.)



The server can start on the next job immediately after finishing the prior one.



There is no limit to the length of the queue, and it follows the first in, first out order discipline, so all tasks in line must be completed.



There is one server.

Such a queue is called M/M/1: M 5 exponentially random request arrival (C2 ¼ 1), with M standing for A. A. Markov, the mathematician who defined and analyzed the memoryless processes mentioned earlier M 5 exponentially random service time (C2 ¼ 1), with M again for Markov 1 ¼ single server The M/M/1 model is a simple and widely used model. The assumption of exponential distribution is commonly used in queuing examples for three reasons—one good, one fair, and one bad. The good reason is that a superposition of many arbitrary distributions acts as an exponential distribution. Many times in computer systems, a particular behavior is the result of many components interacting, so an exponential distribution of interarrival times is the right model. The fair reason is that when variability is unclear, an exponential distribution with intermediate variability (C ¼ 1) is a safer guess than low variability (C  0) or high variability (large C). The bad reason is that the math is simpler if you assume exponential distributions.

D.5

A Little Queuing Theory



D-31

Let’s put queuing theory to work in a few examples. Example

Suppose a processor sends 40 disk I/Os per second, these requests are exponentially distributed, and the average service time of an older disk is 20 ms. Answer the following questions: 1. On average, how utilized is the disk? 2. What is the average time spent in the queue? 3. What is the average response time for a disk request, including the queuing time and disk service time?

Answer

Let’s restate these facts: Average number of arriving tasks/second is 40. Average disk time to service a task is 20 ms (0.02 sec). The server utilization is then Server utilization ¼ Arrival rate  Timeserver ¼ 40  0:02 ¼ 0:8

Since the service times are exponentially distributed, we can use the simplified formula for the average time spent waiting in line: Timequeue ¼ Timeserver  ¼ 20 ms 

Server utilization ð1  Server utilizationÞ

0:8 0:8 ¼ 20  ¼ 20  4 ¼ 80 ms 1  0:8 0:2

The average response time is Time system ¼ Timequeue + Timeserver ¼ 80 + 20 ms ¼ 100 ms

Thus, on average we spend 80% of our time waiting in the queue!

Example Answer

Suppose we get a new, faster disk. Recalculate the answers to the questions above, assuming the disk service time is 10 ms. The disk utilization is then Server utilization ¼ Arrival rate  Timeserver ¼ 40  0:01 ¼ 0:4

The formula for the average time spent waiting in line: Timequeue ¼ Timeserver  ¼ 10 ms 

Server utilization ð1  Server utilizationÞ

0:4 0:4 2 ¼ 10  ¼ 10  ¼ 6:7 ms 1  0:4 0:6 3

The average response time is 10 + 6.7 ms or 16.7 ms, 6.0 times faster than the old response time even though the new service time is only 2.0 times faster.

D-32



Appendix D Storage Systems

Server

I/O controller and device

Server

Queue Arrivals

I/O controller and device

Server

I/O controller and device

Figure D.17 The M/M/m multiple-server model.

Thus far, we have been assuming a single server, such as a single disk. Many real systems have multiple disks and hence could use multiple servers, as in Figure D.17. Such a system is called an M/M/m model in queuing theory. Let’s give the same formulas for the M/M/m queue, using Nservers to represent the number of servers. The first two formulas are easy: Utilization ¼

Arrival rate  Timeserver Nservers

Lengthqueue ¼ Arrival rate  Timequeue

The time waiting in the queue is Timequeue ¼ Timeserver 

PtasksNservers Nservers  ð1  UtilizationÞ

This formula is related to the one for M/M/1, except we replace utilization of a single server with the probability that a task will be queued as opposed to being immediately serviced, and divide the time in queue by the number of servers. Alas, calculating the probability of jobs being in the queue is much more complicated when there are Nservers. First, the probability that there are no tasks in the system is "

X1 ðNservers  UtilizationÞn ðNservers  UtilizationÞNservers Nservers + Prob0 tasks ¼ 1 + Nservers !  ð1  UtilizationÞ n! n¼1

#1

Then the probability there are as many or more tasks than we have servers is ProbtasksNservers ¼

Nservers  UtilizationNservers  Prob0 tasks Nservers !  ð1  UtilizationÞ

D.5

A Little Queuing Theory



D-33

Note that if Nservers is 1, ProbtaskNservers simplifies back to Utilization, and we get the same formula as for M/M/1. Let’s try an example. Example

Answer

Suppose instead of a new, faster disk, we add a second slow disk and duplicate the data so that reads can be serviced by either disk. Let’s assume that the requests are all reads. Recalculate the answers to the earlier questions, this time using an M/M/ m queue. The average utilization of the two disks is then Server utilization ¼

Arrival rate  Timeserver 40  0:02 ¼ ¼ 0:4 Nservers 2

We first calculate the probability of no tasks in the queue: "

#1 1 X ð2  UtilizationÞ2 ð2  UtilizationÞn + 2!  ð1  UtilizationÞ n¼1 n! " #1  1 ð2  0:4Þ2 0:640 + 0:800 + ð2  0:4Þ ¼ 1+ ¼ 1+ 1:2 2  ð1  0:4Þ

Prob0 tasks ¼ 1 +

¼ ½1 + 0:533 + 0:8001 ¼ 2:3331

We use this result to calculate the probability of tasks in the queue: ProbtasksNservers ¼ ¼

2  Utilization2  Prob0 2!  ð1  UtilizationÞ

tasks

ð2  0:4Þ2 0:640  2:3331  2:3331 ¼ 1:2 2  ð1  0:4Þ

¼ 0:533=2:333 ¼ 0:229

Finally, the time waiting in the queue: Timequeue ¼ Timeserver  ¼ 0:020 

ProbtasksNservers Nservers  ð1  UtilizationÞ

0:229 0:229 ¼ 0:020  2  ð1  0:4Þ 1:2

¼ 0:020  0:190 ¼ 0:0038

The average response time is 20 + 3.8 ms or 23.8 ms. For this workload, two disks cut the queue waiting time by a factor of 21 over a single slow disk and a factor of 1.75 versus a single fast disk. The mean service time of a system with a single fast disk, however, is still 1.4 times faster than one with two disks since the disk service time is 2.0 times faster.

D-34



Appendix D Storage Systems

It would be wonderful if we could generalize the M/M/m model to multiple queues and multiple servers, as this step is much more realistic. Alas, these models are very hard to solve and to use, and so we won’t cover them here.

D.6

Crosscutting Issues Point-to-Point Links and Switches Replacing Buses Point-to-point links and switches are increasing in popularity as Moore’s law continues to reduce the cost of components. Combined with the higher I/O bandwidth demands from faster processors, faster disks, and faster local area networks, the decreasing cost advantage of buses means the days of buses in desktop and server computers are numbered. This trend started in high-performance computers in the last edition of the book, and by 2011 has spread itself throughout storage. Figure D.18 shows the old bus-based standards and their replacements. The number of bits and bandwidth for the new generation is per direction, so they double for both directions. Since these new designs use many fewer wires, a common way to increase bandwidth is to offer versions with several times the number of wires and bandwidth.

Block Servers versus Filers Thus far, we have largely ignored the role of the operating system in storage. In a manner analogous to the way compilers use an instruction set, operating systems determine what I/O techniques implemented by the hardware will actually be used. The operating system typically provides the file abstraction on top of blocks stored on the disk. The terms logical units, logical volumes, and physical volumes are related terms used in Microsoft and UNIX systems to refer to subset collections of disk blocks.

Width (bits)

Length (meters)

Clock rate

MB/sec

Max I/O devices

(Parallel) ATA Serial ATA

8 2

0.5 2

133 MHz 3 GHz

133 300

2 ?

SCSI Serial Attach SCSI

16 1

12 10

80 MHz (DDR)

320 375

15 16,256

32/64 2

0.5 0.5

33/66 MHz 3 GHz

533 250

? ?

Standard

PCI PCI Express

Figure D.18 Parallel I/O buses and their point-to-point replacements. Note the bandwidth and wires are per direction, so bandwidth doubles when sending both directions.

D.6

Crosscutting Issues



D-35

A logical unit is the element of storage exported from a disk array, usually constructed from a subset of the array’s disks. A logical unit appears to the server as a single virtual “disk.” In a RAID disk array, the logical unit is configured as a particular RAID layout, such as RAID 5. A physical volume is the device file used by the file system to access a logical unit. A logical volume provides a level of virtualization that enables the file system to split the physical volume across multiple pieces or to stripe data across multiple physical volumes. A logical unit is an abstraction of a disk array that presents a virtual disk to the operating system, while physical and logical volumes are abstractions used by the operating system to divide these virtual disks into smaller, independent file systems. Having covered some of the terms for collections of blocks, we must now ask: Where should the file illusion be maintained: in the server or at the other end of the storage area network? The traditional answer is the server. It accesses storage as disk blocks and maintains the metadata. Most file systems use a file cache, so the server must maintain consistency of file accesses. The disks may be direct attached—found inside a server connected to an I/O bus—or attached over a storage area network, but the server transmits data blocks to the storage subsystem. The alternative answer is that the disk subsystem itself maintains the file abstraction, and the server uses a file system protocol to communicate with storage. Example protocols are Network File System (NFS) for UNIX systems and Common Internet File System (CIFS) for Windows systems. Such devices are called network attached storage (NAS) devices since it makes no sense for storage to be directly attached to the server. The name is something of a misnomer because a storage area network like FC-AL can also be used to connect to block servers. The term filer is often used for NAS devices that only provide file service and file storage. Network Appliance was one of the first companies to make filers. The driving force behind placing storage on the network is to make it easier for many computers to share information and for operators to maintain the shared system.

Asynchronous I/O and Operating Systems Disks typically spend much more time in mechanical delays than in transferring data. Thus, a natural path to higher I/O performance is parallelism, trying to get many disks to simultaneously access data for a program. The straightforward approach to I/O is to request data and then start using it. The operating system then switches to another process until the desired data arrive, and then the operating system switches back to the requesting process. Such a style is called synchronous I/O—the process waits until the data have been read from disk. The alternative model is for the process to continue after making a request, and it is not blocked until it tries to read the requested data. Such asynchronous I/O

D-36



Appendix D Storage Systems

allows the process to continue making requests so that many I/O requests can be operating simultaneously. Asynchronous I/O shares the same philosophy as caches in out-of-order CPUs, which achieve greater bandwidth by having multiple outstanding events.

D.7

Designing and Evaluating an I/O System— The Internet Archive Cluster The art of I/O system design is to find a design that meets goals for cost, dependability, and variety of devices while avoiding bottlenecks in I/O performance and dependability. Avoiding bottlenecks means that components must be balanced between main memory and the I/O device, because performance and dependability—and hence effective cost-performance or cost-dependability— can only be as good as the weakest link in the I/O chain. The architect must also plan for expansion so that customers can tailor the I/O to their applications. This expansibility, both in numbers and types of I/O devices, has its costs in longer I/O buses and networks, larger power supplies to support I/O devices, and larger cabinets. In designing an I/O system, we analyze performance, cost, capacity, and availability using varying I/O connection schemes and different numbers of I/O devices of each type. Here is one series of steps to follow in designing an I/O system. The answers for each step may be dictated by market requirements or simply by cost, performance, and availability goals. 1. List the different types of I/O devices to be connected to the machine, or list the standard buses and networks that the machine will support. 2. List the physical requirements for each I/O device. Requirements include size, power, connectors, bus slots, expansion cabinets, and so on. 3. List the cost of each I/O device, including the portion of cost of any controller needed for this device. 4. List the reliability of each I/O device. 5. Record the processor resource demands of each I/O device. This list should include: ■

Clock cycles for instructions used to initiate an I/O, to support operation of an I/O device (such as handling interrupts), and to complete I/O



Processor clock stalls due to waiting for I/O to finish using the memory, bus, or cache



Processor clock cycles to recover from an I/O activity, such as a cache flush

6. List the memory and I/O bus resource demands of each I/O device. Even when the processor is not using memory, the bandwidth of main memory and the I/O connection is limited.

D.7

Designing and Evaluating an I/O System—The Internet Archive Cluster



D-37

7. The final step is assessing the performance and availability of the different ways to organize these I/O devices. When you can afford it, try to avoid single points of failure. Performance can only be properly evaluated with simulation, although it may be estimated using queuing theory. Reliability can be calculated assuming I/O devices fail independently and that the times to failure are exponentially distributed. Availability can be computed from reliability by estimating MTTF for the devices, taking into account the time from failure to repair. Given your cost, performance, and availability goals, you then select the best organization. Cost-performance goals affect the selection of the I/O scheme and physical design. Performance can be measured either as megabytes per second or I/Os per second, depending on the needs of the application. For high performance, the only limits should be speed of I/O devices, number of I/O devices, and speed of memory and processor. For low cost, most of the cost should be the I/O devices themselves. Availability goals depend in part on the cost of unavailability to an organization. Rather than create a paper design, let’s evaluate a real system.

The Internet Archive Cluster To make these ideas clearer, we’ll estimate the cost, performance, and availability of a large storage-oriented cluster at the Internet Archive. The Internet Archive began in 1996 with the goal of making a historical record of the Internet as it changed over time. You can use the Wayback Machine interface to the Internet Archive to perform time travel to see what the Web site at a URL looked like sometime in the past. It contains over a petabyte (1015 bytes) and is growing by 20 terabytes (1012 bytes) of new data per month, so expansible storage is a requirement. In addition to storing the historical record, the same hardware is used to crawl the Web every few months to get snapshots of the Internet. Clusters of computers connected by local area networks have become a very economical computation engine that works well for some applications. Clusters also play an important role in Internet services such the Google search engine, where the focus is more on storage than it is on computation, as is the case here. Although it has used a variety of hardware over the years, the Internet Archive is moving to a new cluster to become more efficient in power and in floor space. The basic building block is a 1U storage node called the PetaBox GB2000 from Capricorn Technologies. In 2006, it used four 500 GB Parallel ATA (PATA) disk drives, 512 MB of DDR266 DRAM, one 10/100/1000 Ethernet interface, and a 1 GHz C3 processor from VIA, which executes the 80x86 instruction set. This node dissipates about 80 watts in typical configurations. Figure D.19 shows the cluster in a standard VME rack. Forty of the GB2000s fit in a standard VME rack, which gives the rack 80 TB of raw capacity. The 40 nodes are connected together with a 48-port 10/100 or 10/100/1000 switch, and it

D-38



Appendix D Storage Systems

Figure D.19 The TB-80 VME rack from Capricorn Systems used by the Internet Archive. All cables, switches, and displays are accessible from the front side, and the back side is used only for airflow. This allows two racks to be placed back-to-back, which reduces the floor space demands in machine rooms.

dissipates about 3 KW. The limit is usually 10 KW per rack in computer facilities, so it is well within the guidelines. A petabyte needs 12 of these racks, connected by a higher-level switch that connects the Gbit links coming from the switches in each of the racks.

Estimating Performance, Dependability, and Cost of the Internet Archive Cluster To illustrate how to evaluate an I/O system, we’ll make some guesses about the cost, performance, and reliability of the components of this cluster. We make the following assumptions about cost and performance: ■

The VIA processor, 512 MB of DDR266 DRAM, ATA disk controller, power supply, fans, and enclosure cost $500.



Each of the four 7200 RPM Parallel ATA drives holds 500 GB, has an average time seek of 8.5 ms, transfers at 50 MB/sec from the disk, and costs $375. The PATA link speed is 133 MB/sec.

D.7

Designing and Evaluating an I/O System—The Internet Archive Cluster



D-39



The 48-port 10/100/1000 Ethernet switch and all cables for a rack cost $3000.



The performance of the VIA processor is 1000 MIPS.



The ATA controller adds 0.1 ms of overhead to perform a disk I/O.



The operating system uses 50,000 CPU instructions for a disk I/O.



The network protocol stacks use 100,000 CPU instructions to transmit a data block between the cluster and the external world.



The average I/O size is 16 KB for accesses to the historical record via the Wayback interface, and 50 KB when collecting a new snapshot.

Example

Evaluate the cost per I/O per second (IOPS) of the 80 TB rack. Assume that every disk I/O requires an average seek and average rotational delay. Assume that the workload is evenly divided among all disks and that all devices can be used at 100% of capacity; that is, the system is limited only by the weakest link, and it can operate that link at 100% utilization. Calculate for both average I/O sizes.

Answer

I/O performance is limited by the weakest link in the chain, so we evaluate the maximum performance of each link in the I/O chain for each organization to determine the maximum performance of that organization. Let’s start by calculating the maximum number of IOPS for the CPU, main memory, and I/O bus of one GB2000. The CPU I/O performance is determined by the speed of the CPU and the number of instructions to perform a disk I/O and to send it over the network: Maximum IOPS for CPU ¼

1000 MIPS 50,000 instructions per I=O + 100,000 instructions per message

¼ 6667 IOPS

The maximum performance of the memory system is determined by the memory bandwidth and the size of the I/O transfers: Maximum IOPS for main memory ¼

266  8  133, 000 IOPS 16 KB per I=O

Maximum IOPS for main memory ¼

266  8  42, 500 IOPS 50 KB per I=O

The Parallel ATA link performance is limited by the bandwidth and the size of the I/O: Maximum IOPS for the I=O bus ¼

133 MB= sec  8300 IOPS 16 KB per I=O

Maximum IOPS for the I=O bus ¼

133 MB= sec  2700 IOPS 50 KB per I=O

Since the box has two buses, the I/O bus limits the maximum performance to no more than 18,600 IOPS for 16 KB blocks and 5400 IOPS for 50 KB blocks.

D-40



Appendix D Storage Systems

Now it’s time to look at the performance of the next link in the I/O chain, the ATA controllers. The time to transfer a block over the PATA channel is Parallel ATA transfer time ¼

16 KB  0:1 ms 133 MB= sec

Parallel ATA transfer time ¼

50 KB  0:4 ms 133 MB= sec

Adding the 0.1 ms ATA controller overhead means 0.2 ms to 0.5 ms per I/O, making the maximum rate per controller 1 ¼ 5000 IOPS 0:2 ms 1 ¼ 2000 IOPS Maximum IOPS per ATA controller ¼ 0:5 ms Maximum IOPS per ATA controller ¼

The next link in the chain is the disks themselves. The time for an average disk I/O is I=O time ¼ 8:5 ms +

0:5 16 KB + ¼ 8:5 + 4:2 + 0:3 ¼ 13:0 ms 7200 RPM 50 MB= sec

I=O time ¼ 8:5 ms +

0:5 50 KB + ¼ 8:5 + 4:2 + 1:0 ¼ 13:7 ms 7200 RPM 50 MB= sec

Therefore, disk performance is 1  77 IOPS 13:0 ms 1  73 IOPS Maximum IOPS ðusing average seeksÞ per disk ¼ 13:7 ms Maximum IOPS ðusing average seeksÞ per disk ¼

or 292 to 308 IOPS for the four disks. The final link in the chain is the network that connects the computers to the outside world. The link speed determines the limit: 1000 Mbit ¼ 7812 IOPS 16 K  8 1000 Mbit ¼ 2500 IOPS Maximum IOPS per 1000 Mbit Ethernet link ¼ 50 K  8 Maximum IOPS per 1000 Mbit Ethernet link ¼

Clearly, the performance bottleneck of the GB2000 is the disks. The IOPS for the whole rack is 40  308 or 12,320 IOPS to 40  292 or 11,680 IOPS. The network switch would be the bottleneck if it couldn’t support 12,320  16 K  8 or 1.6 Gbits/sec for 16 KB blocks and 11,680  50 K  8 or 4.7 Gbits/sec for 50 KB blocks. We assume that the extra 8 Gbit ports of the 48-port switch connects the rack to the rest of the world, so it could support the full IOPS of the collective 160 disks in the rack. Using these assumptions, the cost is 40  ($500 + 4  $375) + $3000 + $1500 or $84,500 for an 80 TB rack. The disks themselves are almost 60% of the cost. The cost per terabyte is almost $1000, which is about a factor of 10 to 15 better than storage cluster from the prior edition in 2001. The cost per IOPS is about $7.

D.8

Putting It All Together: NetApp FAS6000 Filer



D-41

Calculating MTTF of the TB-80 Cluster Internet services such as Google rely on many copies of the data at the application level to provide dependability, often at different geographic sites to protect against environmental faults as well as hardware faults. Hence, the Internet Archive has two copies of the data in each site and has sites in San Francisco, Amsterdam, and Alexandria, Egypt. Each site maintains a duplicate copy of the high-value content—music, books, film, and video—and a single copy of the historical Web crawls. To keep costs low, there is no redundancy in the 80 TB rack. Example

Answer

Let’s look at the resulting mean time to fail of the rack. Rather than use the manufacturer’s quoted MTTF of 600,000 hours, we’ll use data from a recent survey of disk drives [Gray and van Ingen 2005]. As mentioned in Chapter 1, about 3% to 7% of ATA drives fail per year, for an MTTF of about 125,000 to 300,000 hours. Make the following assumptions, again assuming exponential lifetimes: ■

CPU/memory/enclosure MTTF is 1,000,000 hours.



PATA Disk MTTF is 125,000 hours.



PATA controller MTTF is 500,000 hours.



Ethernet Switch MTTF is 500,000 hours.



Power supply MTTF is 200,000 hours.



Fan MTTF is 200,000 hours.



PATA cable MTTF is 1,000,000 hours.

Collecting these together, we compute these failure rates:

Failure rate ¼ ¼

40 160 40 1 40 40 80 + + + + + + 1, 000,000 125,000 500,000 500,000 200,000 200,000 1,000, 000 40 + 1280 + 80 + 2 + 200 + 200 + 80 1882 ¼ 1,000, 000 hours 1,000, 000 hours

The MTTF for the system is just the inverse of the failure rate: MTTF ¼

1 1,000, 000 hours ¼ ¼ 531 hours Failure rate 1882

That is, given these assumptions about the MTTF of components, something in a rack fails on average every 3 weeks. About 70% of the failures would be the disks, and about 20% would be fans or power supplies.

D.8

Putting It All Together: NetApp FAS6000 Filer Network Appliance entered the storage market in 1992 with a goal of providing an easy-to-operate file server running NSF using their own log-structured file system and a RAID 4 disk array. The company later added support for the Windows CIFS

D-42



Appendix D Storage Systems

file system and a RAID 6 scheme called row-diagonal parity or RAID-DP (see page D-8). To support applications that want access to raw data blocks without the overhead of a file system, such as database systems, NetApp filers can serve data blocks over a standard Fibre Channel interface. NetApp also supports iSCSI, which allows SCSI commands to run over a TCP/IP network, thereby allowing the use of standard networking gear to connect servers to storage, such as Ethernet, and hence at a greater distance. The latest hardware product is the FAS6000. It is a multiprocessor based on the AMD Opteron microprocessor connected using its HyperTransport links. The microprocessors run the NetApp software stack, including NSF, CIFS, RAID-DP, SCSI, and so on. The FAS6000 comes as either a dual processor (FAS6030) or a quad processor (FAS6070). As mentioned in Chapter 5, DRAM is distributed to each microprocessor in the Opteron. The FAS6000 connects 8 GB of DDR2700 to each Opteron, yielding 16 GB for the FAS6030 and 32 GB for the FAS6070. As mentioned in Chapter 4, the DRAM bus is 128 bits wide, plus extra bits for SEC/DED memory. Both models dedicate four HyperTransport links to I/O. As a filer, the FAS6000 needs a lot of I/O to connect to the disks and to connect to the servers. The integrated I/O consists of: ■

8 Fibre Channel (FC) controllers and ports



6 Gigabit Ethernet links



6 slots for x8 (2 GB/sec) PCI Express cards



3 slots for PCI-X 133 MHz, 64-bit cards



Standard I/O options such as IDE, USB, and 32-bit PCI

The 8 Fibre Channel controllers can each be attached to 6 shelves containing 14 3.5-inch FC disks. Thus, the maximum number of drives for the integrated I/O is 8  6  14 or 672 disks. Additional FC controllers can be added to the option slots to connect up to 1008 drives, to reduce the number of drives per FC network so as to reduce contention, and so on. At 500 GB per FC drive, if we assume the RAID RDP group is 14 data disks and 2 check disks, the available data capacity is 294 TB for 672 disks and 441 TB for 1008 disks. It can also connect to Serial ATA disks via a Fibre Channel to SATA bridge controller, which, as its name suggests, allows FC and SATA to communicate. The six 1-gigabit Ethernet links connect to servers to make the FAS6000 look like a file server if running NTFS or CIFS or like a block server if running iSCSI. For greater dependability, FAS6000 filers can be paired so that if one fails, the other can take over. Clustered failover requires that both filers have access to all disks in the pair of filers using the FC interconnect. This interconnect also allows each filer to have a copy of the log data in the NVRAM of the other filer and to keep the clocks of the pair synchronized. The health of the filers is constantly monitored, and failover happens automatically. The healthy filer maintains its own network identity and its own primary functions, but it also assumes the network identity

D.9

Fallacies and Pitfalls



D-43

of the failed filer and handles all its data requests via a virtual filer until an administrator restores the data service to the original state.

D.9

Fallacies and Pitfalls

Fallacy

Components fail fast A good deal of the fault-tolerant literature is based on the simplifying assumption that a component operates perfectly until a latent error becomes effective, and then a failure occurs that stops the component. The Tertiary Disk project had the opposite experience. Many components started acting strangely long before they failed, and it was generally up to the system operator to determine whether to declare a component as failed. The component would generally be willing to continue to act in violation of the service agreement until an operator “terminated” that component. Figure D.20 shows the history of four drives that were terminated, and the number of hours they started acting strangely before they were replaced.

Fallacy

Computers systems achieve 99.999% availability (“five nines”), as advertised Marketing departments of companies making servers started bragging about the availability of their computer hardware; in terms of Figure D.21, they claim availability of 99.999%, nicknamed five nines. Even the marketing departments of operating system companies tried to give this impression. Five minutes of unavailability per year is certainly impressive, but given the failure data collected in surveys, it’s hard to believe. For example, Hewlett-Packard claims that the HP-9000 server hardware and HP-UX operating system can deliver

Number of log messages

Duration (hours)

Hardware Failure (Peripheral device write fault [for] Field Replaceable Unit)

1763

186

Not Ready (Diagnostic failure: ASCQ ¼ Component ID [of] Field Replaceable Unit)

1460

90

Recovered Error (Failure Prediction Threshold Exceeded [for] Field Replaceable Unit)

1313

5

Recovered Error (Failure Prediction Threshold Exceeded [for] Field Replaceable Unit)

431

17

Messages in system log for failed disk

Figure D.20 Record in system log for 4 of the 368 disks in Tertiary Disk that were replaced over 18 months. See Talagala and Patterson [1999]. These messages, matching the SCSI specification, were placed into the system log by device drivers. Messages started occurring as much as a week before one drive was replaced by the operator. The third and fourth messages indicate that the drive’s failure prediction mechanism detected and predicted imminent failure, yet it was still hours before the drives were replaced by the operator.

D-44



Appendix D Storage Systems

Unavailability (minutes per year)

Availability (percent)

Availability class (“number of nines”)

50,000

90%

1

5000

99%

2

500

99.9%

3

50

99.99%

4

5

99.999%

5

0.5

99.9999%

6

0.05

99.99999%

7

Figure D.21 Minutes unavailable per year to achieve availability class. (From Gray and Siewiorek [1991].) Note that five nines mean unavailable five minutes per year.

a 99.999% availability guarantee “in certain pre-defined, pre-tested customer environments” (see Hewlett-Packard [1998]). This guarantee does not include failures due to operator faults, application faults, or environmental faults, which are likely the dominant fault categories today. Nor does it include scheduled downtime. It is also unclear what the financial penalty is to a company if a system does not match its guarantee. Microsoft also promulgated a five nines marketing campaign. In January 2001, www.microsoft.com was unavailable for 22 hours. For its Web site to achieve 99.999% availability, it will require a clean slate for 250 years. In contrast to marketing suggestions, well-managed servers typically achieve 99% to 99.9% availability. Pitfall

Where a function is implemented affects its reliability In theory, it is fine to move the RAID function into software. In practice, it is very difficult to make it work reliably. The software culture is generally based on eventual correctness via a series of releases and patches. It is also difficult to isolate from other layers of software. For example, proper software behavior is often based on having the proper version and patch release of the operating system. Thus, many customers have lost data due to software bugs or incompatibilities in environment in software RAID systems. Obviously, hardware systems are not immune to bugs, but the hardware culture tends to place a greater emphasis on testing correctness in the initial release. In addition, the hardware is more likely to be independent of the version of the operating system.

Fallacy

Operating systems are the best place to schedule disk accesses Higher-level interfaces such as ATA and SCSI offer logical block addresses to the host operating system. Given this high-level abstraction, the best an OS can do is to try to sort the logical block addresses into increasing order. Since only the disk knows the mapping of the logical addresses onto the physical geometry of sectors, tracks, and surfaces, it can reduce the rotational and seek latencies.

D.9

Fallacies and Pitfalls



D-45

For example, suppose the workload is four reads [Anderson 2003]: Operation

Starting LBA

Length

Read

724

8

Read

100

16

Read

9987

1

Read

26

128

The host might reorder the four reads into logical block order: Read

26

128

Read

100

16

Read

724

8

Read

9987

1

Depending on the relative location of the data on the disk, reordering could make it worse, as Figure D.22 shows. The disk-scheduled reads complete in three-quarters of a disk revolution, but the OS-scheduled reads take three revolutions. Fallacy

The time of an average seek of a disk in a computer system is the time for a seek of one-third the number of cylinders This fallacy comes from confusing the way manufacturers market disks with the expected performance, and from the false assumption that seek times are linear in distance. The one-third-distance rule of thumb comes from calculating the distance of a seek from one random location to another random location, not including the current track and assuming there is a large number of tracks. In the past, manufacturers listed the seek of this distance to offer a consistent basis for comparison. (Today, they

100 Host-ordered queue Drive-ordered queue

724

26

9987

Figure D.22 Example showing OS versus disk schedule accesses, labeled hostordered versus drive-ordered. The former takes 3 revolutions to complete the 4 reads, while the latter completes them in just 3/4 of a revolution. (From Anderson [2003].)



Appendix D Storage Systems calculate the “average” by timing all seeks and dividing by the number.) Assuming (incorrectly) that seek time is linear in distance, and using the manufacturer’s reported minimum and “average” seek times, a common technique to predict seek time is Timeseek ¼ Timeminimum +

  Distance  Timeaverage  Timeminimum Distanceaverage

The fallacy concerning seek time is twofold. First, seek time is not linear with distance; the arm must accelerate to overcome inertia, reach its maximum traveling speed, decelerate as it reaches the requested position, and then wait to allow the arm to stop vibrating (settle time). Moreover, sometimes the arm must pause to control vibrations. For disks with more than 200 cylinders, Chen and Lee [1995] modeled the seek distance as: pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Seek timeðDistanceÞ ¼ a  Distance  1 + b  ðDistance  1Þ + c

where a, b, and c are selected for a particular disk so that this formula will match the quoted times for Distance ¼ 1, Distance ¼ max, and Distance ¼ 1/3 max. Figure D.23 plots this equation versus the fallacy equation. Unlike the first equation, the square root of the distance reflects acceleration and deceleration. The second problem is that the average in the product specification would only be true if there were no locality to disk activity. Fortunately, there is both temporal and spatial locality (see page B-2 in Appendix B). For example, Figure D.24 shows sample measurements of seek distances for two workloads: a UNIX time-sharing workload and a business-processing workload. Notice the high percentage of disk 30 25

Access time (ms)

D-46

Naive seek formula

20 New seek formula 15 10

5 0 0

250

500

750

1000

1250

1500

1750

2000

2250

2500

Seek distance

a=

– 10 × Timemin+ 15 × Timeavg– 5 × Timemax 3×

Number of cylinders

b=

7 × Timemin– 15 × Timeavg+ 8 × Timemax 3 × Number of cylinders

c = Timemin

Figure D.23 Seek time versus seek distance for sophisticated model versus naive model. Chen and Lee [1995] found that the equations shown above for parameters a, b, and c worked well for several disks.

D.10

195

3%

208

0%

180

3%

192

0%

165

2%

176

0% 0%

150

3%

160

135

2%

144

3%

120 Seek distance

3%

105 90

Seek distance

1%

75

3%

60 45 30

1%

112

1%

1%

3%

64

1%

4%

48

1%

32 23%

16

0

24%

0

0%

10%

20%

30%

40%

50%

60%

70%

Percentage of seeks (UNIX time-sharing workload)

D-47

3%

96

15



3%

128

80

8%

Concluding Remarks

0%

3% 11% 61% 10%

20%

30%

40%

50%

60%

70%

Percentage of seeks (business workload)

Figure D.24 Sample measurements of seek distances for two systems. The measurements on the left were taken on a UNIX time-sharing system. The measurements on the right were taken from a business-processing application in which the disk seek activity was scheduled to improve throughput. Seek distance of 0 means the access was made to the same cylinder. The rest of the numbers show the collective percentage for distances between numbers on the yaxis. For example, 11% for the bar labeled 16 in the business graph means that the percentage of seeks between 1 and 16 cylinders was 11%. The UNIX measurements stopped at 200 of the 1000 cylinders, but this captured 85% of the accesses. The business measurements tracked all 816 cylinders of the disks. The only seek distances with 1% or greater of the seeks that are not in the graph are 224 with 4%, and 304, 336, 512, and 624, each having 1%. This total is 94%, with the difference being small but nonzero distances in other categories. Measurements courtesy of Dave Anderson of Seagate.

accesses to the same cylinder, labeled distance 0 in the graphs, in both workloads. Thus, this fallacy couldn’t be more misleading.

D.10

Concluding Remarks Storage is one of those technologies that we tend to take for granted. And yet, if we look at the true status of things today, storage is king. One can even argue that servers, which have become commodities, are now becoming peripheral to storage devices. Driving that point home are some estimates from IBM, which expects storage sales to surpass server sales in the next two years. Michael Vizard Editor-in-chief, Infoworld (August 11, 2001)

As their value is becoming increasingly evident, storage systems have become the target of innovation and investment. The challenges for storage systems today are dependability and maintainability. Not only do users want to be sure their data are never lost (reliability),

D-48



Appendix D Storage Systems

applications today increasingly demand that the data are always available to access (availability). Despite improvements in hardware and software reliability and fault tolerance, the awkwardness of maintaining such systems is a problem both for cost and for availability. A widely mentioned statistic is that customers spend $6 to $8 operating a storage system for every $1 of purchase price. When dependability is attacked by having many redundant copies at a higher level of the system—such as for search—then very large systems can be sensitive to the price-performance of the storage components. Today, challenges in storage dependability and maintainability dominate the challenges of I/O.

D.11

Historical Perspective and References Section M.9 (available online) covers the development of storage devices and techniques, including who invented disks, the story behind RAID, and the history of operating systems and databases. References for further reading are included.

Case Studies with Exercises by Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau Case Study 1: Deconstructing a Disk Concepts illustrated by this case study ■

Performance Characteristics



Microbenchmarks

The internals of a storage system tend to be hidden behind a simple interface, that of a linear array of blocks. There are many advantages to having a common interface for all storage systems: An operating system can use any storage system without modification, and yet the storage system is free to innovate behind this interface. For example, a single disk can map its internal < sector, track, surface > geometry to the linear array in whatever way achieves the best performance; similarly, a multidisk RAID system can map the blocks on any number of disks to this same linear array. However, this fixed interface has a number of disadvantages, as well; in particular, the operating system is not able to perform some performance, reliability, and security optimizations without knowing the precise layout of its blocks inside the underlying storage system. In this case study, we will explore how software can be used to uncover the internal structure of a storage system hidden behind a block-based interface. The basic idea is to fingerprint the storage system: by running a well-defined workload on top of the storage system and measuring the amount of time required for different requests, one is able to infer a surprising amount of detail about the underlying system.

Case Studies with Exercises by Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau



D-49

The Skippy algorithm, from work by Nisha Talagala and colleagues at the University of California–Berkeley, uncovers the parameters of a single disk. The key is to factor out disk rotational effects by making consecutive seeks to individual sectors with addresses that differ by a linearly increasing amount (increasing by 1, 2, 3, and so forth). Thus, the basic algorithm skips through the disk, increasing the distance of the seek by one sector before every write, and outputs the distance and time for each write. The raw device interface is used to avoid file system optimizations. The SECTOR SIZE is set equal to the minimum amount of data that can be read at once from the disk (e.g., 512 bytes). (Skippy is described in more detail in Talagala and Patterson [1999].) fd = open("raw disk device"); for (i = 0; i < measurements; i++) { begin_time = gettime(); lseek(fd, i*SECTOR_SIZE, SEEK_CUR); write(fd, buffer, SECTOR_SIZE); interval_time = gettime() -begin_time; printf("Stride: %d Time: %d\n", i, interval_time); } close(fd); By graphing the time required for each write as a function of the seek distance, one can infer the minimal transfer time (with no seek or rotational latency), head switch time, cylinder switch time, rotational latency, and the number of heads in the disk. A typical graph will have four distinct lines, each with the same slope, but with different offsets. The highest and lowest lines correspond to requests that incur different amounts of rotational delay, but no cylinder or head switch costs; the difference between these two lines reveals the rotational latency of the disk. The second lowest line corresponds to requests that incur a head switch (in addition to increasing amounts of rotational delay). Finally, the third line corresponds to requests that incur a cylinder switch (in addition to rotational delay). D.1

[10/10/10/10/10] < D.2 > The results of running Skippy are shown for a mock disk (Disk Alpha) in Figure D.25. a. [10] < D.2 > What is the minimal transfer time? b. [10] < D.2 > What is the rotational latency? c. [10] < D.2 > What is the head switch time? d. [10] < D.2 > What is the cylinder switch time? e. [10] < D.2 > What is the number of disk heads?

D.2

[25] < D.2 > Draw an approximation of the graph that would result from running Skippy on Disk Beta, a disk with the following parameters: ■

Minimal transfer time, 2.0 ms



Rotational latency, 6.0 ms



Appendix D Storage Systems

14

12

10

Time (ms)

D-50

8

6

4

2

0 0

50

100

150

200

250

300

Distance (sectors)

Figure D.25 Results from running Skippy on Disk Alpha.

D.3



Head switch time, 1.0 ms



Cylinder switch time, 1.5 ms



Number of disk heads, 4



Sectors per track, 100

[10/10/10/10/10/10/10] < D.2 > Implement and run the Skippy algorithm on a disk drive of your choosing. a. [10] < D.2 > Graph the results of running Skippy. Report the manufacturer and model of your disk. b. [10] < D.2 > What is the minimal transfer time? c. [10] < D.2 > What is the rotational latency? d. [10] < D.2 > What is the head switch time? e. [10] < D.2 > What is the cylinder switch time? f. [10] < D.2 > What is the number of disk heads? g. [10] < D.2 > Do the results of running Skippy on a real disk differ in any qualitative way from that of the mock disk?

Case Studies with Exercises by Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau



D-51

Case Study 2: Deconstructing a Disk Array Concepts illustrated by this case study ■

Performance Characteristics



Microbenchmarks

The Shear algorithm, from work by Timothy Denehy and colleagues at the University of Wisconsin [Denehy et al. 2004], uncovers the parameters of a RAID system. The basic idea is to generate a workload of requests to the RAID array and time those requests; by observing which sets of requests take longer, one can infer which blocks are allocated to the same disk. We define RAID properties as follows. Data are allocated to disks in the RAID at the block level, where a block is the minimal unit of data that the file system reads or writes from the storage system; thus, block size is known by the file system and the fingerprinting software. A chunk is a set of blocks that is allocated contiguously within a disk. A stripe is a set of chunks across each of D data disks. Finally, a pattern is the minimum sequence of data blocks such that block offset i within the pattern is always located on disk j. D.4

[20/20] < D.2 > One can uncover the pattern size with the following code. The code accesses the raw device to avoid file system optimizations. The key to all of the Shear algorithms is to use random requests to avoid triggering any of the prefetch or caching mechanisms within the RAID or within individual disks. The basic idea of this code sequence is to access N random blocks at a fixed interval p within the RAID array and to measure the completion time of each interval. for (p = BLOCKSIZE; p Figure D.26 shows the results of running the pattern size algorithm on an unknown RAID system.

D-52



Appendix D Storage Systems

Time (s)

1.5

1.0

0.5

0.0 0

32

64

96

128

160

192

224

256

Pattern size assumed (KB)

Figure D.26 Results from running the pattern size algorithm of Shear on a mock storage system. ■

What is the pattern size of this storage system?



What do the measured times of 0.4, 0.8, and 1.6 seconds correspond to in this storage system?



If this is a RAID 0 array, then how many disks are present?



If this is a RAID 0 array, then what is the chunk size?

b. [20] < D.2 > Draw the graph that would result from running this Shear code on a storage system with the following characteristics:

D.5



Number of requests, N ¼ 1000



Time for a random read on disk, 5 ms



RAID level, RAID 0



Number of disks, 4



Chunk size, 8 KB

[20/20] < D.2 > One can uncover the chunk size with the following code. The basic idea is to perform reads from N patterns chosen at random but always at controlled offsets, c and c  1, within the pattern. for (c = 0; c < patternsize; c += BLOCKSIZE) { for (i = 0; i < N; i++) { requestA[i] = random()*patternsize + c; requestB[i] = random()*patternsize + (c-1)%patternsize; } begin_time = gettime(); issue all requestA[N] and requestB[N] to raw device in parallel; wait for requestA[N] and requestB[N] to complete; interval_time = gettime() - begin_time; printf("ChunkSize: %d Time: %d\n", c, interval_time); }

Case Studies with Exercises by Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau



D-53

If you run this code and plot the measured time as a function of c, then you will see that the measured time is lowest when the requestA and requestB reads fall on two different disks. Thus, the values of c with low times correspond to the chunk boundaries between disks of the RAID. a. [20] < D.2 > Figure D.27 shows the results of running the chunk size algorithm on an unknown RAID system. ■

What is the chunk size of this storage system?



What do the measured times of 0.75 and 1.5 seconds correspond to in this storage system?

b. [20] < D.2 > Draw the graph that would result from running this Shear code on a storage system with the following characteristics: Number of requests, N ¼ 1000



Time for a random read on disk, 5 ms



RAID level, RAID 0



Number of disks, 8



Chunk size, 12 KB

[10/10/10/10] < D.2 > Finally, one can determine the layout of chunks to disks with the following code. The basic idea is to select N random patterns and to exhaustively read together all pairwise combinations of the chunks within the pattern. for (a = 0; a < numchunks; a += chunksize) { for (b = a; b < numchunks; b += chunksize) { for (i = 0; i < N; i++) { requestA[i] = random()*patternsize + a; requestB[i] = random()*patternsize + b; } begin_time = gettime(); issue all requestA[N] and requestB[N] to raw device in parallel;

1.5

Time (s)

D.6



1.0

0.5

0.0 0

16

32

48

64

Boundary offset assumed (KB)

Figure D.27 Results from running the chunk size algorithm of Shear on a mock storage system.



Appendix D Storage Systems

wait for all requestA[N] and requestB[N] to complete; interval_time = gettime() - begin_time; printf("A: %d B: %d Time: %d\n", a, b, interval_time); } } After running this code, you can report the measured time as a function of a and b. The simplest way to graph this is to create a two-dimensional table with a and b as the parameters and the time scaled to a shaded value; we use darker shadings for faster times and lighter shadings for slower times. Thus, a light shading indicates that the two offsets of a and b within the pattern fall on the same disk. Figure D.28 shows the results of running the layout algorithm on a storage system that is known to have a pattern size of 384 KB and a chunk size of 32 KB. a. [20] < D.2 > How many chunks are in a pattern? b. [20] < D.2 > Which chunks of each pattern appear to be allocated on the same disks? c. [20] < D.2 > How many disks appear to be in this storage system? d. [20] < D.2 > Draw the likely layout of blocks across the disks. D.7

[20] < D.2 > Draw the graph that would result from running the layout algorithm on the storage system shown in Figure D.29. This storage system has four disks and a chunk size of four 4 KB blocks (16 KB) and is using a RAID 5 Left-Asymmetric layout.

10

8

Chunk

D-54

6

4

2

0 0

2

4

6 Chunk

8

10

Figure D.28 Results from running the layout algorithm of Shear on a mock storage system.

Case Studies with Exercises by Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau

00 01 02 03 04 05 06 07 08 09 10 11

P

12 13 14 15 16 17 18 19

P

20 21 22 23

24 25 26 27

P

28 29 30 31 32 33 34 35

P

36 37 38 39 40 41 42 43 44 45 46 47

P

P

P

P

P

P

P

P

P

P

P

60 61 62 63 64 65 66 67

P

68 69 70 71

72 73 74 75

P

76 77 78 79 80 81 82 83

P

84 85 86 87 88 89 90 91 92 93 94 95

P

P

P

P

P

P

P

P

P

D-55

P

48 49 50 51 52 53 54 55 56 57 58 59 P

P

P



P

Parity: RAID 5 Left-Asymmetric, stripe = 16, pattern = 48

Figure D.29 A storage system with four disks, a chunk size of four 4 KB blocks, and using a RAID 5 Left-Asymmetric layout. Two repetitions of the pattern are shown.

Case Study 3: RAID Reconstruction Concepts illustrated by this case study ■

RAID Systems



RAID Reconstruction



Mean Time to Failure (MTTF)



Mean Time until Data Loss (MTDL)



Performability



Double Failures

A RAID system ensures that data are not lost when a disk fails. Thus, one of the key responsibilities of a RAID is to reconstruct the data that were on a disk when it failed; this process is called reconstruction and is what you will explore in this case study. You will consider both a RAID system that can tolerate one disk failure and a RAID-DP, which can tolerate two disk failures. Reconstruction is commonly performed in two different ways. In offline reconstruction, the RAID devotes all of its resources to performing reconstruction and does not service any requests from the workload. In online reconstruction, the RAID continues to service workload requests while performing the reconstruction; the reconstruction process is often limited to use some fraction of the total bandwidth of the RAID system. How reconstruction is performed impacts both the reliability and the performability of the system. In a RAID 5, data are lost if a second disk fails before the data from the first disk can be recovered; therefore, the longer the reconstruction time (MTTR), the lower the reliability or the mean time until data loss (MTDL). Performability is a metric meant to combine both the performance of a system and its

D-56



Appendix D Storage Systems

availability; it is defined as the performance of the system in a given state multiplied by the probability of that state. For a RAID array, possible states include normal operation with no disk failures, reconstruction with one disk failure, and shutdown due to multiple disk failures. For these exercises, assume that you have built a RAID system with six disks, plus a sufficient number of hot spares. Assume that each disk is the 37 GB SCSI disk shown in Figure D.3 and that each disk can sequentially read data at a peak of 142 MB/sec and sequentially write data at a peak of 85 MB/sec. Assume that the disks are connected to an Ultra320 SCSI bus that can transfer a total of 320 MB/ sec. You can assume that each disk failure is independent and ignore other potential failures in the system. For the reconstruction process, you can assume that the overhead for any XOR computation or memory copying is negligible. During online reconstruction, assume that the reconstruction process is limited to use a total bandwidth of 10 MB/sec from the RAID system. D.8

[10] < D.2 > Assume that you have a RAID 4 system with six disks. Draw a simple diagram showing the layout of blocks across disks for this RAID system.

D.9

[10] < D.2, D.4 > When a single disk fails, the RAID 4 system will perform reconstruction. What is the expected time until a reconstruction is needed?

D.10

[10/10/10] < D.2, D.4 > Assume that reconstruction of the RAID 4 array begins at time t. a. [10] < D.2, D.4 > What read and write operations are required to perform the reconstruction? b. [10] < D.2, D.4 > For offline reconstruction, when will the reconstruction process be complete? c. [10] < D.2, D.4 > For online reconstruction, when will the reconstruction process be complete?

D.11

[10/10/10/10] < D.2, D.4 > In this exercise, we will investigate the mean time until data loss (MTDL). In RAID 4, data are lost only if a second disk fails before the first failed disk is repaired. a. [10] < D.2, D.4 > What is the likelihood of having a second failure during offline reconstruction? b. [10] < D.2, D.4 > Given this likelihood of a second failure during reconstruction, what is the MTDL for offline reconstruction? c. [10] < D.2, D.4 > What is the likelihood of having a second failure during online reconstruction? d. [10] < D.2, D.4 > Given this likelihood of a second failure during reconstruction, what is the MTDL for online reconstruction?

D.12

[10] < D.2, D.4 > What is performability for the RAID 4 array for offline reconstruction? Calculate the performability using IOPS, assuming a random readonly workload that is evenly distributed across the disks of the RAID 4 array.

Case Studies with Exercises by Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau



D-57

D.13

[10] < D.2, D.4 > What is the performability for the RAID 4 array for online reconstruction? During online repair, you can assume that the IOPS drop to 70% of their peak rate. Does offline or online reconstruction lead to better performability?

D.14

[10] < D.2, D.4 > RAID 6 is used to tolerate up to two simultaneous disk failures. Assume that you have a RAID 6 system based on row-diagonal parity, or RAIDDP; your six-disk RAID-DP system is based on RAID 4, with p ¼ 5, as shown in Figure D.5. If data disk 0 and data disk 3 fail, how can those disks be reconstructed? Show the sequence of steps that are required to compute the missing blocks in the first four stripes.

Case Study 4: Performance Prediction for RAIDs Concepts illustrated by this case study ■

RAID Levels



Queuing Theory



Impact of Workloads



Impact of Disk Layout

In this case study, you will explore how simple queuing theory can be used to predict the performance of the I/O system. You will investigate how both storage system configuration and the workload influence service time, disk utilization, and average response time. The configuration of the storage system has a large impact on performance. Different RAID levels can be modeled using queuing theory in different ways. For example, a RAID 0 array containing N disks can be modeled as N separate systems of M/M/1 queues, assuming that requests are appropriately distributed across the N disks. The behavior of a RAID 1 array depends upon the workload: A read operation can be sent to either mirror, whereas a write operation must be sent to both disks. Therefore, for a read-only workload, a two-disk RAID 1 array can be modeled as an M/M/2 queue, whereas for a write-only workload, it can be modeled as an M/ M/1 queue. The behavior of a RAID 4 array containing N disks also depends upon the workload: A read will be sent to a particular data disk, whereas writes must all update the parity disk, which becomes the bottleneck of the system. Therefore, for a read-only workload, RAID 4 can be modeled as N  1 separate systems, whereas for a write-only workload, it can be modeled as one M/M/1 queue. The layout of blocks within the storage system can have a significant impact on performance. Consider a single disk with a 40 GB capacity. If the workload randomly accesses 40 GB of data, then the layout of those blocks to the disk does not have much of an impact on performance. However, if the workload randomly accesses only half of the disk’s capacity (i.e., 20 GB of data on that disk), then layout does matter: To reduce seek time, the 20 GB of data can be compacted within 20 GB of consecutive tracks instead of allocated uniformly distributed over the entire 40 GB capacity.

D-58



Appendix D Storage Systems

For this problem, we will use a rather simplistic model to estimate the service time of a disk. In this basic model, the average positioning and transfer time for a small random request is a linear function of the seek distance. For the 40 GB disk in this problem, assume that the service time is 5 ms * space utilization. Thus, if the entire 40 GB disk is used, then the average positioning and transfer time for a random request is 5 ms; if only the first 20 GB of the disk is used, then the average positioning and transfer time is 2.5 ms. Throughout this case study, you can assume that the processor sends 167 small random disk requests per second and that these requests are exponentially distributed. You can assume that the size of the requests is equal to the block size of 8 KB. Each disk in the system has a capacity of 40 GB. Regardless of the storage system configuration, the workload accesses a total of 40 GB of data; you should allocate the 40 GB of data across the disks in the system in the most efficient manner. D.15

[10/10/10/10/10] < D.5 > Begin by assuming that the storage system consists of a single 40 GB disk. a. [10] < D.5 > Given this workload and storage system, what is the average service time? b. [10] < D.5 > On average, what is the utilization of the disk? c. [10] < D.5 > On average, how much time does each request spend waiting for the disk? d. [10] < D.5 > What is the mean number of requests in the queue? e. [10] < D.5 > Finally, what is the average response time for the disk requests?

D.16

[10/10/10/10/10/10] < D.2, D.5 > Imagine that the storage system is now configured to contain two 40 GB disks in a RAID 0 array; that is, the data are striped in blocks of 8 KB equally across the two disks with no redundancy. a. [10] < D.2, D.5 > How will the 40 GB of data be allocated across the disks? Given a random request workload over a total of 40 GB, what is the expected service time of each request? b. [10] < D.2, D.5 > How can queuing theory be used to model this storage system? c. [10] < D.2, D.5 > What is the average utilization of each disk? d. [10] < D.2, D.5 > On average, how much time does each request spend waiting for the disk? e. [10] < D.2, D.5 > What is the mean number of requests in each queue? f. [10] < D.2, D.5 > Finally, what is the average response time for the disk requests?

D.17

[20/20/20/20/20] < D.2, D.5 > Instead imagine that the storage system is configured to contain two 40 GB disks in a RAID 1 array; that is, the data are mirrored

Case Studies with Exercises by Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau



D-59

across the two disks. Use queuing theory to model this system for a read-only workload. a. [20] < D.2, D.5 > How will the 40 GB of data be allocated across the disks? Given a random request workload over a total of 40 GB, what is the expected service time of each request? b. [20] < D.2, D.5 > How can queuing theory be used to model this storage system? c. [20] < D.2, D.5 > What is the average utilization of each disk? d. [20] < D.2, D.5 > On average, how much time does each request spend waiting for the disk? e. [20] < D.2, D.5 > Finally, what is the average response time for the disk requests? D.18

[10/10] < D.2, D.5 > Imagine that instead of a read-only workload, you now have a write-only workload on a RAID 1 array. a. [10] < D.2, D.5 > Describe how you can use queuing theory to model this system and workload. b. [10] < D.2, D.5 > Given this system and workload, what are the average utilization, average waiting time, and average response time?

Case Study 5: I/O Subsystem Design Concepts illustrated by this case study ■

RAID Systems



Mean Time to Failure (MTTF)



Performance and Reliability Trade-Offs

In this case study, you will design an I/O subsystem, given a monetary budget. Your system will have a minimum required capacity and you will optimize for performance, reliability, or both. You are free to use as many disks and controllers as fit within your budget. Here are your building blocks: ■

A 10,000 MIPS CPU costing $1000. Its MTTF is 1,000,000 hours.



A 1000 MB/sec I/O bus with room for 20 Ultra320 SCSI buses and controllers.



Ultra320 SCSI buses that can transfer 320 MB/sec and support up to 15 disks per bus (these are also called SCSI strings). The SCSI cable MTTF is 1,000,000 hours.



An Ultra320 SCSI controller that is capable of 50,000 IOPS, costs $250, and has an MTTF of 500,000 hours.

D-60



Appendix D Storage Systems



A $2000 enclosure supplying power and cooling to up to eight disks. The enclosure MTTF is 1,000,000 hours, the fan MTTF is 200,000 hours, and the power supply MTTF is 200,000 hours.



The SCSI disks described in Figure D.3.



Replacing any failed component requires 24 hours.

You may make the following assumptions about your workload: ■

The operating system requires 70,000 CPU instructions for each disk I/O.



The workload consists of many concurrent, random I/Os, with an average size of 16 KB.

All of your constructed systems must have the following properties: ■

You have a monetary budget of $28,000.



You must provide at least 1 TB of capacity.

D.19

[10] < D.2 > You will begin by designing an I/O subsystem that is optimized only for capacity and performance (and not reliability), specifically IOPS. Discuss the RAID level and block size that will deliver the best performance.

D.20

[20/20/20/20] < D.2, D.4, D.7 > What configuration of SCSI disks, controllers, and enclosures results in the best performance given your monetary and capacity constraints? a. [20] < D.2, D.4, D.7 > How many IOPS do you expect to deliver with your system? b. [20] < D.2, D.4, D.7 > How much does your system cost? c. [20] < D.2, D.4, D.7 > What is the capacity of your system? d. [20] < D.2, D.4, D.7 > What is the MTTF of your system?

D.21

[10] < D.2, D.4, D.7 > You will now redesign your system to optimize for reliability, by creating a RAID 10 or RAID 01 array. Your storage system should be robust not only to disk failures but also to controller, cable, power supply, and fan failures as well; specifically, a single component failure should not prohibit accessing both replicas of a pair. Draw a diagram illustrating how blocks are allocated across disks in the RAID 10 and RAID 01 configurations. Is RAID 10 or RAID 01 more appropriate in this environment?

D.22

[20/20/20/20/20] < D.2, D.4, D.7 > Optimizing your RAID 10 or RAID 01 array only for reliability (but staying within your capacity and monetary constraints), what is your RAID configuration? a. [20] < D.2, D.4, D.7 > What is the overall MTTF of the components in your system?

Case Studies with Exercises by Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau



D-61

b. [20] < D.2, D.4, D.7 > What is the MTDL of your system? c. [20] < D.2, D.4, D.7 > What is the usable capacity of this system? d. [20] < D.2, D.4, D.7 > How much does your system cost? e. [20] < D.2, D.4, D.7 > Assuming a write-only workload, how many IOPS can you expect to deliver? D.23

[10] < D.2, D.4, D.7 > Assume that you now have access to a disk that has twice the capacity, for the same price. If you continue to design only for reliability, how would you change the configuration of your storage system? Why?

Case Study 6: Dirty Rotten Bits Concepts illustrated by this case study ■

Partial Disk Failure



Failure Analysis



Performance Analysis



Parity Protection



Checksumming

You are put in charge of avoiding the problem of “bit rot”—bits or blocks in a file going bad over time. This problem is particularly important in archival scenarios, where data are written once and perhaps accessed many years later; without taking extra measures to protect the data, the bits or blocks of a file may slowly change or become unavailable due to media errors or other I/O faults. Dealing with bit rot requires two specific components: detection and recovery. To detect bit rot efficiently, one can use checksums over each block of the file in question; a checksum is just a function of some kind that takes a (potentially long) string of data as input and outputs a fixed-size string (the checksum) of the data as output. The property you will exploit is that if the data changes then the computed checksum is very likely to change as well. Once detected, recovering from bit rot requires some form of redundancy. Examples include mirroring (keeping multiple copies of each block) and parity (some extra redundant information, usually more space efficient than mirroring). In this case study, you will analyze how effective these techniques are given various scenarios. You will also write code to implement data integrity protection over a set of files. D.24

[20/20/20] < D.2 > Assume that you will use simple parity protection in Exercises D.24 through D.27. Specifically, assume that you will be computing one parity block for each file in the file system. Further, assume that you will also use a 20-byte MD5 checksum per 4 KB block of each file.

D-62



Appendix D Storage Systems

We first tackle the problem of space overhead. According to studies by Douceur and Bolosky [1999], these file size distributions are what is found in modern PCs: 1 KB

2 KB

4 KB

8 KB

16 KB

32 KB

64 KB

128 KB

256 KB

512 KB

1 MB

26.6%

11.0%

11.2%

10.9%

9.5%

8.5%

7.1%

5.1%

3.7%

2.4%

4.0%

The study also finds that file systems are usually about half full. Assume that you have a 37 GB disk volume that is roughly half full and follows that same distribution, and answer the following questions: a. [20] < D.2 > How much extra information (both in bytes and as a percent of the volume) must you keep on disk to be able to detect a single error with checksums? b. [20] < D.2 > How much extra information (both in bytes and as a percent of the volume) would you need to be able to both detect a single error with checksums as well as correct it? c. [20] < D.2 > Given this file distribution, is the block size you are using to compute checksums too big, too little, or just right? D.25

[10/10] < D.2, D.3 > One big problem that arises in data protection is error detection. One approach is to perform error detection lazily—that is, wait until a file is accessed, and at that point, check it and make sure the correct data are there. The problem with this approach is that files that are not accessed frequently may slowly rot away and when finally accessed have too many errors to be corrected. Hence, an eager approach is to perform what is sometimes called disk scrubbing— periodically go through all data and find errors proactively. a. [10] < D.2, D.3 > Assume that bit flips occur independently, at a rate of 1 flip per GB of data per month. Assuming the same 20 GB volume that is half full, and assuming that you are using the SCSI disk as specified in Figure D.3 (4 ms seek, roughly 100 MB/sec transfer), how often should you scan through files to check and repair their integrity? b. [10] < D.2, D.3 > At what bit flip rate does it become impossible to maintain data integrity? Again assume the 20 GB volume and the SCSI disk.

D.26

[10/10/10/10] < D.2, D.4 > Another potential cost of added data protection is found in performance overhead. We now study the performance overhead of this data protection approach. a. [10] < D.2, D.4 > Assume we write a 40 MB file to the SCSI disk sequentially, and then write out the extra information to implement our data protection scheme to disk once. How much write traffic (both in total volume of bytes and as a percentage of total traffic) does our scheme generate? b. [10] < D.2, D.4 > Assume we now are updating the file randomly, similar to a database table. That is, assume we perform a series of 4 KB random writes to the file, and each time we perform a single write, we must update the on-disk protection information. Assuming that we perform 10,000 random writes, how

Case Studies with Exercises by Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau



D-63

much I/O traffic (both in total volume of bytes and as a percentage of total traffic) does our scheme generate? c. [10] < D.2, D.4 > Now assume that the data protection information is always kept in a separate portion of the disk, away from the file it is guarding (that is, assume for each file A, there is another file Achecksums that holds all the check-sums for A). Hence, one potential overhead we must incur arises upon reads—that is, upon each read, we will use the checksum to detect data corruption. Assume you read 10,000 blocks of 4 KB each sequentially from disk. Assuming a 4 ms average seek cost and a 100 MB/sec transfer rate (like the SCSI disk in Figure D.3), how long will it take to read the file (and corresponding checksums) from disk? What is the time penalty due to adding checksums? d. [10] < D.2, D.4 > Again assuming that the data protection information is kept separate as in part (c), now assume you have to read 10,000 random blocks of 4 KB each from a very large file (much bigger than 10,000 blocks, that is). For each read, you must again use the checksum to ensure data integrity. How long will it take to read the 10,000 blocks from disk, again assuming the same disk characteristics? What is the time penalty due to adding checksums? D.27

[40] < D.2, D.3, D.4 > Finally, we put theory into practice by developing a userlevel tool to guard against file corruption. Assume you are to write a simple set of tools to detect and repair data integrity. The first tool is used for checksums and parity. It should be called build and used like this: build The build program should then store the needed checksum and redundancy information for the file filename in a file in the same directory called .file name.cp (so it is easy to find later). A second program is then used to check and potentially repair damaged files. It should be called repair and used like this: repair < filename > The repair program should consult the .cp file for the filename in question and verify that all the stored checksums match the computed checksums for the data. If the checksums don’t match for a single block, repair should use the redundant information to reconstruct the correct data and fix the file. However, if two or more blocks are bad, repair should simply report that the file has been corrupted beyond repair. To test your system, we will provide a tool to corrupt files called corrupt. It works as follows: corrupt < filename > < blocknumber > All corrupt does is fill the specified block number of the file with random noise. For checksums you will be using MD5. MD5 takes an input string and gives you a

D-64



Appendix D Storage Systems 128-bit “fingerprint” or checksum as an output. A great and simple implementation of MD5 is available here: http://sourceforge.net/project/showfiles.php?group_ id=42360 Parity is computed with the XOR operator. In C code, you can compute the parity of two blocks, each of size BLOCKSIZE, as follows: unsigned char block1[BLOCKSIZE]; unsigned char block2[BLOCKSIZE]; unsigned char parity[BLOCKSIZE]; // first, clear parity block for (int i = 0; i < BLOCKSIZE; i++) parity[i] = 0; // then compute parity; carat symbol does XOR in C for (int i = 0; i < BLOCKSIZE; i++) { parity[i] = block1[i] ^block2[i]; }

Case Study 7: Sorting Things Out Concepts illustrated by this case study ■

Benchmarking



Performance Analysis



Cost/Performance Analysis



Amortization of Overhead



Balanced Systems

The database field has a long history of using benchmarks to compare systems. In this question, you will explore one of the benchmarks introduced by Anon. et al. [1985] (see Chapter 1): external, or disk-to-disk, sorting. Sorting is an exciting benchmark for a number of reasons. First, sorting exercises a computer system across all its components, including disk, memory, and processors. Second, sorting at the highest possible performance requires a great deal of expertise about how the CPU caches, operating systems, and I/O subsystems work. Third, it is simple enough to be implemented by a student (see below!). Depending on how much data you have, sorting can be done in one or multiple passes. Simply put, if you have enough memory to hold the entire dataset in memory, you can read the entire dataset into memory, sort it, and then write it out; this is called a “one-pass” sort. If you do not have enough memory, you must sort the data in multiple passes. There are many different approaches possible. One simple approach is to sort each

Case Studies with Exercises by Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau



D-65

chunk of the input file and write it to disk; this leaves (input file size)/(memory size) sorted files on disk. Then, you have to merge each sorted temporary file into a final sorted output. This is called a “two-pass” sort. More passes are needed in the unlikely case that you cannot merge all the streams in the second pass. In this case study, you will analyze various aspects of sorting, determining its effectiveness and cost-effectiveness in different scenarios. You will also write your own version of an external sort, measuring its performance on real hardware. D.28

[20/20/20] < D.4 > We will start by configuring a system to complete a sort in the least possible time, with no limits on how much we can spend. To get peak bandwidth from the sort, we have to make sure all the paths through the system have sufficient bandwidth. Assume for simplicity that the time to perform the in-memory sort of keys is linearly proportional to the CPU rate and memory bandwidth of the given machine (e.g., sorting 1 MB of records on a machine with 1 MB/sec of memory bandwidth and a 1 MIPS processor will take 1 second). Assume further that you have carefully written the I/O phases of the sort so as to achieve sequential bandwidth. And, of course, realize that if you don’t have enough memory to hold all of the data at once that sort will take two passes. One problem you may encounter in performing I/O is that systems often perform extra memory copies; for example, when the read() system call is invoked, data may first be read from disk into a system buffer and then subsequently copied into the specified user buffer. Hence, memory bandwidth during I/O can be an issue. Finally, for simplicity, assume that there is no overlap of reading, sorting, or writing. That is, when you are reading data from disk, that is all you are doing; when sorting, you are just using the CPU and memory bandwidth; when writing, you are just writing data to disk. Your job in this task is to configure a system to extract peak performance when sorting 1 GB of data (i.e., roughly 10 million 100-byte records). Use the following table to make choices about which machine, memory, I/O interconnect, and disks to buy. CPU

I/O interconnect

Slow

1 GIPS

$200

Slow

Standard

2 GIPS

$1000

Fast

4 GIPS

$2000

Slow

512 MB/sec

$100/GB

Standard

1 GB/sec

$200/GB

Fast

2 GB/sec

$500/GB

Memory

80 MB/sec

$50

Standard

160 MB/sec

$100

Fast

320 MB/sec

$400

Slow

30 MB/sec

$70

Standard

60 MB/sec

$120

Fast

110 MB/sec

$300

Disks

D-66



Appendix D Storage Systems

Note: Assume that you are buying a single-processor system and that you can have up to two I/O interconnects. However, the amount of memory and number of disks are up to you (assume there is no limit on disks per I/O interconnect). a. [20] < D.4 > What is the total cost of your machine? (Break this down by part, including the cost of the CPU, amount of memory, number of disks, and I/O bus.) b. [20] < D.4 > How much time does it take to complete the sort of 1 GB worth of records? (Break this down into time spent doing reads from disk, writes to disk, and time spent sorting.) c. [20] < D.4 > What is the bottleneck in your system? D.29

[25/25/25] < D.4 > We will now examine cost-performance issues in sorting. After all, it is easy to buy a high-performing machine; it is much harder to buy a costeffective one. One place where this issue arises is with the PennySort competition (research. microsoft.com/barc/SortBenchmark/). PennySort asks that you sort as many records as you can for a single penny. To compute this, you should assume that a system you buy will last for 3 years (94,608,000 seconds), and divide this by the total cost in pennies of the machine. The result is your time budget per penny. Our task here will be a little simpler. Assume you have a fixed budget of $2000 (or less). What is the fastest sorting machine you can build? Use the same hardware table as in Exercise D.28 to configure the winning machine. (Hint: You might want to write a little computer program to generate all the possible configurations.) a. [25] < D.4 > What is the total cost of your machine? (Break this down by part, including the cost of the CPU, amount of memory, number of disks, and I/O bus.) b. [25] < D.4 > How does the reading, writing, and sorting time break down with this configuration? c. [25] < D.4 > What is the bottleneck in your system?

D.30

[20/20/20] < D.4, D.6 > Getting good disk performance often requires amortization of overhead. The idea is simple: If you must incur an overhead of some kind, do as much useful work as possible after paying the cost and hence reduce its impact. This idea is quite general and can be applied to many areas of computer systems; with disks, it arises with the seek and rotational costs (overheads) that you must incur before transferring data. You can amortize an expensive seek and rotation by transferring a large amount of data. In this exercise, we focus on how to amortize seek and rotational costs during the second pass of a two-pass sort. Assume that when the second pass begins, there are N sorted runs on the disk, each of a size that fits within main memory. Our task here is to read in a chunk from each sorted run and merge the results into a final sorted

Case Studies with Exercises by Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau



D-67

output. Note that a read from one run will incur a seek and rotation, as it is very likely that the last read was from a different run. a. [20] < D.4, D.6 > Assume that you have a disk that can transfer at 100 MB/sec, with an average seek cost of 7 ms, and a rotational rate of 10,000 RPM. Assume further that every time you read from a run, you read 1 MB of data and that there are 100 runs each of size 1 GB. Also assume that writes (to the final sorted output) take place in large 1 GB chunks. How long will the merge phase take, assuming I/O is the dominant (i.e., only) cost? b. [20] < D.4, D.6 > Now assume that you change the read size from 1 MB to 10 MB. How is the total time to perform the second pass of the sort affected? c. [20] < D.4, D.6 > In both cases, assume that what we wish to maximize is disk efficiency. We compute disk efficiency as the ratio of the time spent transferring data over the total time spent accessing the disk. What is the disk efficiency in each of the scenarios mentioned above? D.31

[40] < D.2, D.4, D.6 > In this exercise, you will write your own external sort. To generate the data set, we provide a tool generate that works as follows: generate < size (in MB)> By running generate, you create a file named filename of size size MB. The file consists of 100 byte keys, with 10-byte records (the part that must be sorted). We also provide a tool called check that checks whether a given input file is sorted or not. It is run as follows: check The basic one-pass sort does the following: reads in the data, sorts the data, and then writes the data out. However, numerous optimizations are available to you: overlapping reading and sorting, separating keys from the rest of the record for better cache behavior and hence faster sorting, overlapping sorting and writing, and so forth. One important rule is that data must always start on disk (and not in the file system cache). The easiest way to ensure this is to unmount and remount the file system. One goal: Beat the Datamation sort record. Currently, the record for sorting 1 million 100-byte records is 0.44 seconds, which was obtained on a cluster of 32 machines. If you are careful, you might be able to beat this on a single PC configured with a few disks.

E.1 E.2 E.3 E.4 E.5 E.6 E.7 E.8

Introduction Signal Processing and Embedded Applications: The Digital Signal Processor Embedded Benchmarks Embedded Multiprocessors Case Study: The Emotion Engine of the Sony PlayStation 2 Case Study: Sanyo VPC-SX500 Digital Camera Case Study: Inside a Cell Phone Concluding Remarks

E-2 E-5 E-12 E-14 E-15 E-19 E-20 E-25

E Embedded Systems

By Thomas M. Conte North Carolina State University Where a calculator on the ENIAC is equipped with 18,000 vacuum tubes and weighs 30 tons, computers in the future may have only 1,000 vacuum tubes and perhaps weigh 1 1/2 tons. Popular Mechanics March 1949

E-2



Appendix E Embedded Systems

E.1

Introduction Embedded computer systems—computers lodged in other devices where the presence of the computers is not immediately obvious—are the fastest-growing portion of the computer market. These devices range from everyday machines (most microwaves, most washing machines, printers, network switches, and automobiles contain simple to very advanced embedded microprocessors) to handheld digital devices (such as PDAs, cell phones, and music players) to video game consoles and digital set-top boxes. Although in some applications (such as PDAs) the computers are programmable, in many embedded applications the only programming occurs in connection with the initial loading of the application code or a later software upgrade of that application. Thus, the application is carefully tuned for the processor and system. This process sometimes includes limited use of assembly language in key loops, although time-to-market pressures and good software engineering practice restrict such assembly language coding to a fraction of the application. Compared to desktop and server systems, embedded systems have a much wider range of processing power and cost—from systems containing low-end 8bit and 16-bit processors that may cost less than a dollar, to those containing full 32-bit microprocessors capable of operating in the 500 MIPS range that cost approximately 10 dollars, to those containing high-end embedded processors that cost hundreds of dollars and can execute several billions of instructions per second. Although the range of computing power in the embedded systems market is very large, price is a key factor in the design of computers for this space. Performance requirements do exist, of course, but the primary goal is often meeting the performance need at a minimum price, rather than achieving higher performance at a higher price. Embedded systems often process information in very different ways from general-purpose processors. Typically these applications include deadline-driven constraints—so-called real-time constraints. In these applications, a particular computation must be completed by a certain time or the system fails (there are other constraints considered real time, discussed in the next subsection). Embedded systems applications typically involve processing information as signals. The lay term “signal” often connotes radio transmission, and that is true for some embedded systems (e.g., cell phones). But a signal may be an image, a motion picture composed of a series of images, a control sensor measurement, and so on. Signal processing requires specific computation that many embedded processors are optimized for. We discuss this in depth below. A wide range of benchmark requirements exist, from the ability to run small, limited code segments to the ability to perform well on applications involving tens to hundreds of thousands of lines of code. Two other key characteristics exist in many embedded applications: the need to minimize memory and the need to minimize power. In many embedded applications, the memory can be a substantial portion of the system cost, and it is important to optimize memory size in such cases. Sometimes the application is expected to fit

E.1

Introduction



E-3

entirely in the memory on the processor chip; other times the application needs to fit in its entirety in a small, off-chip memory. In either case, the importance of memory size translates to an emphasis on code size, since data size is dictated by the application. Some architectures have special instruction set capabilities to reduce code size. Larger memories also mean more power, and optimizing power is often critical in embedded applications. Although the emphasis on low power is frequently driven by the use of batteries, the need to use less expensive packaging (plastic versus ceramic) and the absence of a fan for cooling also limit total power consumption. We examine the issue of power in more detail later in this appendix. Another important trend in embedded systems is the use of processor cores together with application-specific circuitry—so-called “core plus ASIC” or “system on a chip” (SOC), which may also be viewed as special-purpose multiprocessors (see Section E.4). Often an application’s functional and performance requirements are met by combining a custom hardware solution together with software running on a standardized embedded processor core, which is designed to interface to such special-purpose hardware. In practice, embedded problems are usually solved by one of three approaches: 1. The designer uses a combined hardware/software solution that includes some custom hardware and an embedded processor core that is integrated with the custom hardware, often on the same chip. 2. The designer uses custom software running on an off-the-shelf embedded processor. 3. The designer uses a digital signal processor and custom software for the processor. Digital signal processors are processors specially tailored for signalprocessing applications. We discuss some of the important differences between digital signal processors and general-purpose embedded processors below. Figure E.1 summarizes these three classes of computing environments and their important characteristics.

Real-Time Processing Often, the performance requirement in an embedded application is a real-time requirement. A real-time performance requirement is one where a segment of the application has an absolute maximum execution time that is allowed. For example, in a digital set-top box the time to process each video frame is limited, since the processor must accept and process the frame before the next frame arrives (typically called hard real-time systems). In some applications, a more sophisticated requirement exists: The average time for a particular task is constrained as well as is the number of instances when some maximum time is exceeded. Such approaches (typically called soft real-time) arise when it is possible to occasionally miss the time constraint on an event, as long as not too many are missed. Real-time

E-4



Appendix E Embedded Systems

Feature

Desktop

Server

Embedded

Price of system

$1000–$10,000

$10,000–$10,000,000

$10–$100,000 (including network routers at the high end)

Price of microprocessor module

$100–$1000

$200–$2000 (per processor)

$0.20–$200 (per processor)

Microprocessors sold per year (estimates for 2000)

150,000,000

4,000,000

300,000,000 (32-bit and 64-bit processors only)

Critical system design issues

Price-performance, graphics performance

Throughput, availability, scalability

Price, power consumption, application-specific performance

Figure E.1 A summary of the three computing classes and their system characteristics. Note the wide range in system price for servers and embedded systems. For servers, this range arises from the need for very large-scale multiprocessor systems for high-end transaction processing and Web server applications. For embedded systems, one significant high-end application is a network router, which could include multiple processors as well as lots of memory and other electronics. The total number of embedded processors sold in 2000 is estimated to exceed 1 billion, if you include 8-bit and 16-bit microprocessors. In fact, the largest-selling microprocessor of all time is an 8-bit microcontroller sold by Intel! It is difficult to separate the low end of the server market from the desktop market, since lowend servers—especially those costing less than $5000—are essentially no different from desktop PCs. Hence, up to a few million of the PC units may be effectively servers.

performance tends to be highly application dependent. It is usually measured using kernels either from the application or from a standardized benchmark (see Section E.3). The construction of a hard real-time system involves three key variables. The first is the rate at which a particular task must occur. Coupled to this are the hardware and software required to achieve that real-time rate. Often, structures that are very advantageous on the desktop are the enemy of hard real-time analysis. For example, branch speculation, cache memories, and so on introduce uncertainty into code. A particular sequence of code may execute either very efficiently or very inefficiently, depending on whether the hardware branch predictors and caches “do their jobs.” Engineers must analyze code assuming the worst-case execution time (WCET). In the case of traditional microprocessor hardware, if one assumes that all branches are mispredicted and all caches miss, the WCET is overly pessimistic. Thus, the system designer may end up overdesigning a system to achieve a given WCET, when a much less expensive system would have sufficed. In order to address the challenges of hard real-time systems, and yet still exploit such well-known architectural properties as branch behavior and access locality, it is possible to change how a processor is designed. Consider branch prediction: Although dynamic branch prediction is known to perform far more accurately than static “hint bits” added to branch instructions, the behavior of static hints is much more predictable. Furthermore, although caches perform better than softwaremanaged on-chip memories, the latter produces predictable memory latencies. In some embedded processors, caches can be converted into software-managed on-chip memories via line locking. In this approach, a cache line can be locked in the cache so that it cannot be replaced until the line is unlocked

E.2

Signal Processing and Embedded Applications: The Digital Signal Processor

E.2



E-5

Signal Processing and Embedded Applications: The Digital Signal Processor A digital signal processor (DSP) is a special-purpose processor optimized for executing digital signal processing algorithms. Most of these algorithms, from time-domain filtering (e.g., infinite impulse response and finite impulse response filtering), to convolution, to transforms (e.g., fast Fourier transform, discrete cosine transform), to even forward error correction (FEC) encodings, all have as their kernel the same operation: a multiply-accumulate operation. For example, the discrete Fourier transform has the form: X ðk Þ ¼

N1 X

xðnÞWNkn where WNkn ¼ ej

n¼0

2πkn N

    kn kn + j sin 2π ¼ cos 2π N N

The discrete cosine transform is often a replacement for this because it does not require complex number operations. Either transform has as its core the sum of a product. To accelerate this, DSPs typically feature special-purpose hardware to perform multiply-accumulate (MAC). A MAC instruction of “MAC A,B,C” has the semantics of “A ¼ A + B * C.” In some situations, the performance of this operation is so critical that a DSP is selected for an application based solely upon its MAC operation throughput. DSPs often employ fixed-point arithmetic. If you think of integers as having a binary point to the right of the least-significant bit, fixed point has a binary point just to the right of the sign bit. Hence, fixed-point data are fractions between 1 and +1. Example

Here are three simple 16-bit patterns: 0100 0000 0000 0000 0000 1000 0000 0000 0100 1000 0000 1000

What values do they represent if they are two’s complement integers? Fixedpoint numbers? Answer

Number representation tells us that the ith digit to the left of the binary point represents 2i1 and the ith digit to the right of the binary point represents 2i. First assume these three patterns are integers. Then the binary point is to the far right, so they represent 214, 211, and (214+ 211+ 23), or 16,384, 2048, and 18,440. Fixed point places the binary point just to the right of the sign bit, so as fixed point these patterns represent 21, 24, and (21 + 24 + 212). The fractions are 1/2, 1/16, and (2048 + 256 + 1)/4096 or 2305/4096, which represents about 0.50000, 0.06250, and 0.56274. Alternatively, for an n-bit two’s complement,

E-6



Appendix E Embedded Systems

fixed-point number we could just divide the integer presentation by 2n1 to derive the same results: 16,384=32, 768 ¼ 1=2, 2048=32,768 ¼ 1=16, and 18,440=32,768 ¼ 2305=4096:

Fixed point can be thought of as a low-cost floating point. It doesn’t include an exponent in every word and doesn’t have hardware that automatically aligns and normalizes operands. Instead, fixed point relies on the DSP programmer to keep the exponent in a separate variable and ensure that each result is shifted left or right to keep the answer aligned to that variable. Since this exponent variable is often shared by a set of fixed-point variables, this style of arithmetic is also called blocked floating point, since a block of variables has a common exponent. To support such manual calculations, DSPs usually have some registers that are wider to guard against round-off error, just as floating-point units internally have extra guard bits. Figure E.2 surveys four generations of DSPs, listing data sizes and width of the accumulating registers. Note that DSP architects are not bound by the powers of 2 for word sizes. Figure E.3 shows the size of data operands for the TI TMS320C55 DSP. In addition to MAC operations, DSPs often also have operations to accelerate portions of communications algorithms. An important class of these algorithms revolve around encoding and decoding forward error correction codes—codes in which extra information is added to the digital bit stream to guard against errors in transmission. A code of rate m/n has m information bits for (m + n) check bits. So, for example, a 1/2 rate code would have 1 information bit per every 2 bits. Such codes are often called trellis codes because one popular graphical flow diagram of

Generation

Year

Example DSP

Data width

Accumulator width

1

1982

TI TMS32010

16 bits

32 bits

2

1987

Motorola DSP56001

24 bits

56 bits

3

1995

Motorola DSP56301

24 bits

56 bits

4

1998

TI TMS320C6201

16 bits

40 bits

Figure E.2 Four generations of DSPs, their data width, and the width of the registers that reduces round-off error.

Data size

Memory operand in operation

Memory operand in data transfer

16 bits

89.3%

89.0%

32 bits

10.7%

11.0%

Figure E.3 Size of data operands for the TMS320C55 DSP. About 90% of operands are 16 bits. This DSP has two 40-bit accumulators. There are no floating-point operations, as is typical of many DSPs, so these data are all fixed-point integers.

E.2

Signal Processing and Embedded Applications: The Digital Signal Processor



E-7

their encoding resembles a garden trellis. A common algorithm for decoding trellis codes is due to Viterbi. This algorithm requires a sequence of compares and selects in order to recover a transmitted bit’s true value. Thus DSPs often have compareselect operations to support Viterbi decode for FEC codes. To explain DSPs better, we will take a detailed look at two DSPs, both produced by Texas Instruments. The TMS320C55 series is a DSP family targeted toward battery-powered embedded applications. In stark contrast to this, the TMS VelociTI 320C6x series is a line of powerful, eight-issue VLIW processors targeted toward a broader range of applications that may be less power sensitive.

The TI 320C55 At one end of the DSP spectrum is the TI 320C55 architecture. The C55 is optimized for low-power, embedded applications. Its overall architecture is shown in Figure E.4. At the heart of it, the C55 is a seven-staged pipelined CPU. The stages are outlined below: ■

Fetch stage reads program data from memory into the instruction buffer queue.



Decode stage decodes instructions and dispatches tasks to the other primary functional units.



Address stage computes addresses for data accesses and branch addresses for program discontinuities.



Access 1/Access 2 stages send data read addresses to memory.



Read stage transfers operand data on the B bus, C bus, and D bus.



Execute stage executes operation in the A unit and D unit and performs writes on the E bus and F bus.

Data read buses BB, CB, DB (3 x 16) Data read address buses BAB, CAB, DAB (3 x 24) Program address bus PAB (24) CPU

Program read bus PB (32)

Instruction buffer unit (IU)

Program flow unit (PU)

Address data flow unit (AU)

Data computation unit (DU)

Data write address buses EAB, FAB (2 x 24) Data write buses EB, FB (2 x 16)

Figure E.4 Architecture of the TMS320C55 DSP. The C55 is a seven-stage pipelined processor with some unique instruction execution facilities. (Courtesy Texas Instruments.)

E-8



Appendix E Embedded Systems

The C55 pipeline performs pipeline hazard detection and will stall on write after read (WAR) and read after write (RAW) hazards. The C55 does have a 24 KB instruction cache, but it is configurable to support various workloads. It may be configured to be two-way set associative, directmapped, or as a “ramset.” This latter mode is a way to support hard realtime applications. In this mode, blocks in the cache cannot be replaced. The C55 also has advanced power management. It allows dynamic power management through software-programmable “idle domains.” Blocks of circuitry on the device are organized into these idle domains. Each domain can operate normally or can be placed in a low-power idle state. A programmer-accessible Idle Control Register (ICR) determines which domains will be placed in the idle state when the execution of the next IDLE instruction occurs. The six domains are CPU, direct memory access (DMA), peripherals, clock generator, instruction cache, and external memory interface. When each domain is in the idle state, the functions of that particular domain are not available. However, in the peripheral domain, each peripheral has an Idle Enable bit that controls whether or not the peripheral will respond to the changes in the idle state. Thus, peripherals can be individually configured to idle or remain active when the peripheral domain is idled. Since the C55 is a DSP, the central feature is its MAC units. The C55 has two MAC units, each comprised of a 17-bit by 17-bit multiplier coupled to a 40-bit dedicated adder. Each MAC unit performs its work in a single cycle; thus, the C55 can execute two MACs per cycle in full pipelined operation. This kind of capability is critical for efficiently performing signal processing applications. The C55 also has a compare, select, and store unit (CSSU) for the add/compare section of the Viterbi decoder.

The TI 320C6x In stark contrast to the C55 DSP family is the high-end Texas Instruments VelociTI 320C6x family of processors. The C6x processors are closer to traditional very long instruction word (VLIW) processors because they seek to exploit the high levels of instruction-level parallelism (ILP) in many signal processing algorithms. Texas Instruments is not alone in selecting VLIW for exploiting ILP in the embedded space. Other VLIW DSP vendors include Ceva, StarCore, Philips/TriMedia, and STMicroelectronics. Why do these vendors favor VLIW over superscalar? For the embedded space, code compatibility is less of a problem, and so new applications can be either hand tuned or recompiled for the newest generation of processor. The other reason superscalar excels on the desktop is because the compiler cannot predict memory latencies at compile time. In embedded, however, memory latencies are often much more predictable. In fact, hard real-time constraints force memory latencies to be statically predictable. Of course, a superscalar would also perform well in this environment with these constraints, but the extra hardware to dynamically schedule instructions is both wasteful in terms of precious chip area and in terms of power consumption. Thus VLIW is a natural choice for highperformance embedded.

E.2

Signal Processing and Embedded Applications: The Digital Signal Processor



E-9

The C6x family employs different pipeline depths depending on the family member. For the C64x, for example, the pipeline has 11 stages. The first four stages of the pipeline perform instruction fetch, followed by two stages for instruction decode, and finally four stages for instruction execution. The overall architecture of the C64x is shown below in Figure E.5. The C6x family’s execution stage is divided into two parts, the left or “1” side and the right or “2” side. The L1 and L2 units perform logical and arithmetic operations. D units in contrast perform a subset of logical and arithmetic operations but also perform memory accesses (loads and stores). The two M units perform multiplication and related operations (e.g., shifts). Finally the S units perform comparisons, branches, and some SIMD operations (see the next subsection for a detailed explanation of SIMD operations). Each side has its own 32-entry, 32-bit register file (the A file for the 1 side, the B file for the 2 side). A side may access the other side’s registers, but with a 1- cycle penalty. Thus, an instruction executing on side 1 may access B5, for example, but it will take 1- cycle extra to execute because of this. VLIWs are traditionally very bad when it comes to code size, which runs contrary to the needs of embedded systems. However, the C6x family’s approach “compresses” instructions, allowing the VLIW code to achieve the same density as equivalent RISC (reduced instruction set computer) code. To do so, instruction fetch is carried out on an “instruction packet,” shown in Figure E.6. Each instruction has a p bit that specifies whether this instruction is a member of the current

Program cache/program memory 32-bit address 256-bit data

C6000 CPU Power down

Program fetch Instruction dispatch Instruction decode

EDMA, EMIF

Data path A

Data path B

Register file A

Register file B

.L1 .S1 .M1 .D1

.D2 .M2 .S2 .L2

Control registers

Control logic Test Emulation Interrupts

Data cache/data memory 32-bit address 8-, 16-, 32-, 64-bit data

Additional peripherals: timers, serial ports, etc.

Figure E.5 Architecture of the TMS320C64x family of DSPs. The C6x is an eight-issue traditional VLIW processor. (Courtesy Texas Instruments.)

E-10



Appendix E Embedded Systems

31

0 31

0 31

0 31

0 31

0 31

0 31

0 31

0

p

p

p

p

p

p

p

p

Instruction A

Instruction B

Instruction C

Instruction D

Instruction E

Instruction F

Instruction G

Instruction H

Figure E.6 Instruction packet of the TMS320C6x family of DSPs. The p bits determine whether an instruction begins a new VLIW word or not. If the p bit of instruction i is 1, then instruction i + 1 is to be executed in parallel with (in the same cycle as) instruction i. If the p bit of instruction i is 0, then instruction i + 1 is executed in the cycle after instruction i. (Courtesy Texas Instruments.)

VLIW word or the next VLIW word (see the figure for a detailed explanation). Thus, there are now no NOPs that are needed for VLIW encoding. Software pipelining is an important technique for achieving high performance in a VLIW. But software pipelining relies on each iteration of the loop having an identical schedule to all other iterations. Because conditional branch instructions disrupt this pattern, the C6x family provides a means to conditionally execute instructions using predication. In predication, the instruction performs its work. But when it is done executing, an additional register, for example A1, is checked. If A1 is zero, the instruction does not write its results. If A1 is nonzero, the instruction proceeds normally. This allows simple if-then and if-then-else structures to be collapsed into straight-line code for software pipelining.

Media Extensions There is a middle ground between DSPs and microcontrollers: media extensions. These extensions add DSP-like capabilities to microcontroller architectures at relatively low cost. Because media processing is judged by human perception, the data for multimedia operations are often much narrower than the 64-bit data word of modern desktop and server processors. For example, floating-point operations for graphics are normally in single precision, not double precision, and often at a precision less than is required by IEEE 754. Rather than waste the 64-bit arithmeticlogical units (ALUs) when operating on 32-bit, 16-bit, or even 8-bit integers, multimedia instructions can operate on several narrower data items at the same time. Thus, a partitioned add operation on 16-bit data with a 64-bit ALU would perform four 16-bit adds in a single clock cycle. The extra hardware cost is simply to prevent carries between the four 16-bit partitions of the ALU. For example, such instructions might be used for graphical operations on pixels. These operations are commonly called single-instruction multiple-data (SIMD) or vector instructions. Most graphics multimedia applications use 32-bit floating-point operations. Some computers double peak performance of single-precision, floating-point operations; they allow a single instruction to launch two 32-bit operations on operands found side by side in a double-precision register. The two partitions must be insulated to prevent operations on one half from affecting the other. Such floating-point operations are called paired single operations. For example, such an operation

E.2

Signal Processing and Embedded Applications: The Digital Signal Processor



E-11

might be used for graphical transformations of vertices. This doubling in performance is typically accomplished by doubling the number of floating-point units, making it more expensive than just suppressing carries in integer adders. Figure E.7 summarizes the SIMD multimedia instructions found in several recent computers. DSPs also provide operations found in the first three rows of Figure E.7, but they change the semantics a bit. First, because they are often used in real-time applications, there is not an option of causing an exception on arithmetic overflow (otherwise it could miss an event); thus, the result will be used no matter what the inputs. To support such an unyielding environment, DSP architectures use saturating arithmetic: If the result is too large to be represented, it is set to the largest representable number, depending on the sign of the result. In contrast, two’s complement arithmetic can add a small positive number to a large positive. HP PA-RISC MAX2

Intel Pentium MMX

PowerPC AltiVec

SPARC VIS

Add/subtract

4H

8B, 4H, 2W

16B, 8H, 4W

4H, 2W

Saturating add/subtract

4H

8B, 4H

16B, 8H, 4W

4H

16B, 8H

8B, 4H, 2W (¼, >)

16B, 8H, 4W (¼, >, > ¼, , 100,000

Number of devices interconnected

Figure F.2 Relationship of the four interconnection network domains in terms of number of devices connected and their distance scales: on-chip network (OCN), system/storage area network (SAN), local area network (LAN), and wide area network (WAN). Note that there are overlapping ranges where some of these networks compete. Some supercomputer systems use proprietary custom networks to interconnect several thousands of computers, while other systems, such as multicomputer clusters, use standard commercial networks.

Approach and Organization of This Appendix Interconnection networks can be well understood by taking a top-down approach to unveiling the concepts and complexities involved in designing them. We do this by viewing the network initially as an opaque “black box” that simply and ideally performs certain necessary functions. Then we systematically open various layers of the black box, allowing more complex concepts and nonideal network behavior to be revealed. We begin this discussion by first considering the interconnection of just two devices in Section F.2, where the black box network can be viewed as a simple dedicated link network—that is, wires or collections of wires running bidirectionally between the devices. We then consider the interconnection of more than two devices in Section F.3, where the black box network can be viewed as a shared link network or as a switched point-to-point network connecting the devices. We continue to peel away various other layers of the black box by considering in more detail the network topology (Section F.4); routing, arbitration, and switching (Section F.5); and switch microarchitecture (Section F.6). Practical issues for commercial networks are considered in Section F.7, followed by examples illustrating the trade-offs for each type of network in Section F.8. Internetworking is briefly discussed in Section F.9, and additional crosscutting issues for interconnection networks are presented in Section F.10. Section F.11 gives some common fallacies

F-6



Appendix F Interconnection Networks

and pitfalls related to interconnection networks, and Section F.12 presents some concluding remarks. Finally, we provide a brief historical perspective and some suggested reading in Section F.13.

F.2

Interconnecting Two Devices This section introduces the basic concepts required to understand how communication between just two networked devices takes place. This includes concepts that deal with situations in which the receiver may not be ready to process incoming data from the sender and situations in which transport errors may occur. To ease understanding, the black box network at this point can be conceptualized as an ideal network that behaves as simple dedicated links between the two devices. Figure F.3 illustrates this, where unidirectional wires run from device A to device B and vice versa, and each end node contains a buffer to hold the data. Regardless of the network complexity, whether dedicated link or not, a connection exists from each end node device to the network to inject and receive information to/from the network. We first describe the basic functions that must be performed at the end nodes to commence and complete communication, and then we discuss network media and the basic functions that must be performed by the network to carry out communication. Later, a simple performance model is given, along with several examples to highlight implications of key network parameters.

Network Interface Functions: Composing and Processing Messages Suppose we want two networked devices to read a word from each other’s memory. The unit of information sent or received is called a message. To acquire the desired data, the two devices must first compose and send a certain type of message in the form of a request containing the address of the data within the other device. The address (i.e., memory or operand location) allows the receiver to identify where to find the information being requested. After processing the request, each device then composes and sends another type of message, a reply, containing the data. The address and data information is typically referred to as the message payload.

Machine A

Machine B

Figure F.3 A simple dedicated link network bidirectionally interconnecting two devices.

F.2

Interconnecting Two Devices



F-7

In addition to payload, every message contains some control bits needed by the network to deliver the message and process it at the receiver. The most typical are bits to distinguish between different types of messages (e.g., request, reply, request acknowledge, reply acknowledge) and bits that allow the network to transport the information properly to the destination. These additional control bits are encoded in the header and/or trailer portions of the message, depending on their location relative to the message payload. As an example, Figure F.4 shows the format of a message for the simple dedicated link network shown in Figure F.3. This example shows a single-word payload, but messages in some interconnection networks can include several thousands of words. Before message transport over the network occurs, messages have to be composed. Likewise, upon receipt from the network, they must be processed. These and other functions described below are the role of the network interface (also referred to as the channel adapter) residing at the end nodes. Together with some direct memory access (DMA) engine and link drivers to transmit/receive messages to/from the network, some dedicated memory or register(s) may be used to buffer outgoing and incoming messages. Depending on the network domain and design specifications for the network, the network interface hardware may consist of nothing more than the communicating device itself (i.e., for OCNs and some SANs) or a separate card that integrates several embedded processors and DMA engines with thousands of megabytes of RAM (i.e., for many SANs and most LANs and WANs). In addition to hardware, network interfaces can include software or firmware to perform the needed operations. Even the simple example shown in Figure F.3 may invoke messaging software to translate requests and replies into messages with the appropriate headers. This way, user applications need not worry about composing and processing messages as these tasks can be performed automatically at a lower level. An application program usually cooperates with the operating or runtime

Header

Destination port Message ID Sequence number Type

Trailer Payload

Checksum

Data

00 = Request 01 = Reply 10 = Request acknowledge 11 = Reply acknowledge

Figure F.4 An example packet format with header, payload, and checksum in the trailer.

F-8



Appendix F Interconnection Networks

system to send and receive messages. As the network is likely to be shared by many processes running on each device, the operating system cannot allow messages intended for one process to be received by another. Thus, the messaging software must include protection mechanisms that distinguish between processes. This distinction could be made by expanding the header with a port number that is known by both the sender and intended receiver processes. In addition to composing and processing messages, additional functions need to be performed by the end nodes to establish communication among the communicating devices. Although hardware support can reduce the amount of work, some can be done by software. For example, most networks specify a maximum amount of information that can be transferred (i.e., maximum transfer unit) so that network buffers can be dimensioned appropriately. Messages longer than the maximum transfer unit are divided into smaller units, called packets (or datagrams), that are transported over the network. Packets are reassembled into messages at the destination end node before delivery to the application. Packets belonging to the same message can be distinguished from others by including a message ID field in the packet header. If packets arrive out of order at the destination, they are reordered when reassembled into a message. Another field in the packet header containing a sequence number is usually used for this purpose. The sequence of steps the end node follows to commence and complete communication over the network is called a communication protocol. It generally has symmetric but reversed steps between sending and receiving information. Communication protocols are implemented by a combination of software and hardware to accelerate execution. For instance, many network interface cards implement hardware timers as well as hardware support to split messages into packets and reassemble them, compute the cyclic redundancy check (CRC) checksum, handle virtual memory addresses, and so on. Some network interfaces include extra hardware to offload protocol processing from the host computer, such as TCP offload engines for LANs and WANs. But, for interconnection networks such as SANs that have low latency requirements, this may not be enough even when lighter-weight communication protocols are used such as message passing interface (MPI). Communication performance can be further improved by bypassing the operating system (OS). OS bypassing can be implemented by directly allocating message buffers in the network interface memory so that applications directly write into and read from those buffers. This avoids extra memory-to-memory copies. The corresponding protocols are referred to as zero-copy protocols or user-level communication protocols. Protection can still be maintained by calling the OS to allocate those buffers at initialization and preventing unauthorized memory accesses in hardware. In general, some or all of the following are the steps needed to send a message at end node devices over a network: 1. The application executes a system call, which copies data to be sent into an operating system or network interface buffer, divides the message into packets (if needed), and composes the header and trailer for packets.

F.2

Interconnecting Two Devices



F-9

2. The checksum is calculated and included in the header or trailer of packets. 3. The timer is started, and the network interface hardware sends the packets. Message reception is in the reverse order: 3. The network interface hardware receives the packets and puts them into its buffer or the operating system buffer. 2. The checksum is calculated for each packet. If the checksum matches the sender’s checksum, the receiver sends an acknowledgment back to the packet sender. If not, it deletes the packet, assuming that the sender will resend the packet when the associated timer expires. 1. Once all packets pass the test, the system reassembles the message, copies the data to the user’s address space, and signals the corresponding application. The sender must still react to packet acknowledgments: ■

When the sender gets an acknowledgment, it releases the copy of the corresponding packet from the buffer.



If the sender reaches the time-out instead of receiving an acknowledgment, it resends the packet and restarts the timer.

Just as a protocol is implemented at network end nodes to support communication, protocols are also used across the network structure at the physical, data link, and network layers responsible primarily for packet transport, flow control, error handling, and other functions described next.

Basic Network Structure and Functions: Media and Form Factor, Packet Transport, Flow Control, and Error Handling Once a packet is ready for transmission at its source, it is injected into the network using some dedicated hardware at the network interface. The hardware includes some transceiver circuits to drive the physical network media—either electrical or optical. The type of media and form factor depends largely on the interconnect distances over which certain signaling rates (e.g., transmission speed) should be sustainable. For centimeter or less distances on a chip or multichip module, typically the middle to upper copper metal layers can be used for interconnects at multiGbps signaling rates per line. A dozen or more layers of copper traces or tracks imprinted on circuit boards, midplanes, and backplanes can be used for Gbps differential-pair signaling rates at distances of about a meter or so. Category 5E unshielded twisted-pair copper wiring allows 0.25 Gbps transmission speed over distances of 100 meters. Coaxial copper cables can deliver 10 Mbps over kilometer distances. In these conductor lines, distance can usually be traded off for higher transmission speed, up to a certain point. Optical media enable faster transmission

F-10



Appendix F Interconnection Networks

speeds at distances of kilometers. Multimode fiber supports 100 Mbps transmission rates over a few kilometers, and more expensive single-mode fiber supports Gbps transmission speeds over distances of several kilometers. Wavelength division multiplexing allows several times more bandwidth to be achieved in fiber (i.e., by a factor of the number of wavelengths used). The hardware used to drive network links may also include some encoders to encode the signal in a format other than binary that is suitable for the given transport distance. Encoding techniques can use multiple voltage levels, redundancy, data and control rotation (e.g., 4b5b encoding), and/or a guaranteed minimum number of signal transitions per unit time to allow for clock recovery at the receiver. The signal is decoded at the receiver end, and the packet is stored in the corresponding buffer. All of these operations are performed at the network physical layer, the details of which are beyond the scope of this appendix. Fortunately, we do not need to worry about them. From the perspective of the data link and higher layers, the physical layer can be viewed as a long linear pipeline without staging in which signals propagate as waves through the network transmission medium. All of the above functions are generally referred to as packet transport. Besides packet transport, the network hardware and software are jointly responsible at the data link and network protocol layers for ensuring reliable delivery of packets. These responsibilities include: (1) preventing the sender from sending packets at a faster rate than they can be processed by the receiver, and (2) ensuring that the packet is neither garbled nor lost in transit. The first responsibility is met by either discarding packets at the receiver when its buffer is full and later notifying the sender to retransmit them, or by notifying the sender to stop sending packets when the buffer becomes full and to resume later once it has room for more packets. The latter strategy is generally known as flow control. There are several interesting techniques commonly used to implement flow control beyond simple handshaking between the sender and receiver. The more popular techniques are Xon/Xoff (also referred to as Stop & Go) and credit-based flow control. Xon/Xoff consists of the receiver notifying the sender either to stop or to resume sending packets once high and low buffer occupancy levels are reached, respectively, with some hysteresis to reduce the number of notifications. Notifications are sent as “stop” and “go” signals using additional control wires or encoded in control packets. Credit-based flow control typically uses a credit counter at the sender that initially contains a number of credits equal to the number of buffers at the receiver. Every time a packet is transmitted, the sender decrements the credit counter. When the receiver consumes a packet from its buffer, it returns a credit to the sender in the form of a control packet that notifies the sender to increment its counter upon receipt of the credit. These techniques essentially control the flow of packets into the network by throttling packet injection at the sender when the receiver reaches a low watermark or when the sender runs out of credits. Xon/Xoff usually generates much less control traffic than credit-based flow control because notifications are only sent when the high or low buffer occupancy levels are crossed. On the other hand, credit-based flow control requires less than half the buffer size required by Xon/Xoff. Buffers for Xon/Xoff must be large

F.2

Interconnecting Two Devices



F-11

enough to prevent overflow before the “stop” control signal reaches the sender. Overflow cannot happen when using credit-based flow control because the sender will run out of credits, thus stopping transmission. For both schemes, full link bandwidth utilization is possible only if buffers are large enough for the distance over which communication takes place. Let’s compare the buffering requirements of the two flow control techniques in a simple example covering the various interconnection network domains. Example

Suppose we have a dedicated-link network with a raw data bandwidth of 8 Gbps for each link in each direction interconnecting two devices. Packets of 100 bytes (including the header) are continuously transmitted from one device to the other to fully utilize network bandwidth. What is the minimum amount of credits and buffer space required by credit-based flow control assuming interconnect distances of 1 cm, 1 m, 100 m, and 10 km if only link propagation delay is taken into account? How does the minimum buffer space compare against Xon/Xoff?

Answer

At the start, the receiver buffer is initially empty and the sender contains a number of credits equal to buffer capacity. The sender will consume a credit every time a packet is transmitted. For the sender to continue transmitting packets at network speed, the first returned credit must reach the sender before the sender runs out of credits. After receiving the first credit, the sender will keep receiving credits at the same rate it transmits packets. As we are considering only propagation delay over the link and no other sources of delay or overhead, null processing time at the sender and receiver are assumed. The time required for the first credit to reach the sender since it started transmission of the first packet is equal to the round-trip propagation delay for the packet transmitted to the receiver and the return credit transmitted back to the sender. This time must be less than or equal to the packet transmission time multiplied by the initial credit count: Packet propagation delay + Credit propagation delay 

Packet size  Credit count Bandwidth

The speed of light is about 300,000 km/sec. Assume we can achieve 66% of that in a conductor. Thus, the minimum number of credits for each distance is given by 

 Distance 100 bytes 2  Credit count 2=3  300, 000 km= sec 8 Gbits= sec

As each credit represents one packet-sized buffer entry, the minimum amount of credits (and, likewise, buffer space) needed by each device is one for the 1 cm and 1 m distances, 10 for the 100 m distance, and 1000 packets for the 10 km distance. For Xon/Xoff, this minimum buffer size corresponds to the buffer fragment from the high occupancy level to the top of the buffer and from the low occupancy level to the bottom of the buffer. With the added hysteresis between both occupancy levels to reduce notifications, the minimum buffer space for Xon/Xoff turns out to be more than twice that for credit-based flow control.

F-12



Appendix F Interconnection Networks

Networks that implement flow control do not need to drop packets and are sometimes referred to as lossless networks; networks that drop packets are sometimes referred to as lossy networks. This single difference in the way packets are handled by the network drastically constrains the kinds of solutions that can be implemented to address other related network problems, including packet routing, congestion, deadlock, and reliability, as we will see later in this appendix. This difference also affects performance significantly as dropped packets need to be retransmitted, thus consuming more link bandwidth and suffering extra delay. These behavioral and performance differences ultimately restrict the interconnection network domains for which certain solutions are applicable. For instance, most networks delivering packets over relatively short distances (e.g., OCNs and SANs) tend to implement flow control; on the other hand, networks delivering packets over relatively long distances (e.g., LANs and WANs) tend to be designed to drop packets. For the shorter distances, the delay in propagating flow control information back to the sender can be negligible, but not so for longer distance scales. The kinds of applications that are usually run also influence the choice of lossless versus lossy networks. For instance, dropping packets sent by an Internet client like a Web browser affects only the delay observed by the corresponding user. However, dropping a packet sent by a process from a parallel application may lead to a significant increase in the overall execution time of the application if that packet’s delay is on the critical path. The second responsibility of ensuring that packets are neither garbled nor lost in transit can be met by implementing some mechanisms to detect and recover from transport errors. Adding a checksum or some other error detection field to the packet format, as shown in Figure F.4, allows the receiver to detect errors. This redundant information is calculated when the packet is sent and checked upon receipt. The receiver then sends an acknowledgment in the form of a control packet if the packet passes the test. Note that this acknowledgment control packet may simultaneously contain flow control information (e.g., a credit or stop signal), thus reducing control packet overhead. As described earlier, the most common way to recover from errors is to have a timer record the time each packet is sent and to presume the packet is lost or erroneously transported if the timer expires before an acknowledgment arrives. The packet is then resent. The communication protocol across the network and network end nodes must handle many more issues other than packet transport, flow control, and reliability. For example, if two devices are from different manufacturers, they might order bytes differently within a word (Big Endian versus Little Endian byte ordering). The protocol must reverse the order of bytes in each word as part of the delivery system. It must also guard against the possibility of duplicate packets if a delayed packet were to become unstuck. Depending on the system requirements, the protocol may have to implement pipelining among operations to improve performance. Finally, the protocol may need to handle network congestion to prevent performance degradation when more than two devices are connected, as described later in Section F.7.

F.2

Interconnecting Two Devices



F-13

Characterizing Performance: Latency and Effective Bandwidth Now that we have covered the basic steps for sending and receiving messages between two devices, we can discuss performance. We start by discussing the latency when transporting a single packet. Then we discuss the effective bandwidth (also known as throughput) that can be achieved when the transmission of multiple packets is pipelined over the network at the packet level. Figure F.5 shows the basic components of latency for a single packet. Note that some latency components will be broken down further in later sections as the internals of the “black box” network are revealed. The timing parameters in Figure F.5 apply to many interconnection network domains: inside a chip, between chips on a board, between boards in a chassis, between chassis within a computer, between computers in a cluster, between clusters, and so on. The values may change, but the components of latency remain the same. The following terms are often used loosely, leading to confusion, so we define them here more precisely: ■

Bandwidth—Strictly speaking, the bandwidth of a transmission medium refers to the range of frequencies for which the attenuation per unit length introduced by that medium is below a certain threshold. It must be distinguished from the transmission speed, which is the amount of information transmitted over a medium per unit time. For example, modems successfully increased transmission speed in the late 1990s for a fixed bandwidth (i.e., the 3 KHz bandwidth provided by voice channels over telephone lines) by encoding more voltage levels and, hence, more bits per signal cycle. However, to be consistent with

Sender

Receiver

Transmission time (bytes/bandwidth)

Sender overhead

Time of flight

Transmission time (bytes/bandwidth)

Receiver overhead

Transport latency

Total latency Time

Figure F.5 Components of packet latency. Depending on whether it is an OCN, SAN, LAN, or WAN, the relative amounts of sending and receiving overhead, time of flight, and transmission time are usually quite different from those illustrated here.

F-14



Appendix F Interconnection Networks its more widely understood meaning, we use the term band-width to refer to the maximum rate at which information can be transferred, where information includes packet header, payload, and trailer. The units are traditionally bits per second, although bytes per second is sometimes used. The term bandwidth is also used to mean the measured speed of the medium (i.e., network links). Aggregate bandwidth refers to the total data bandwidth supplied by the network, and effective bandwidth or throughput is the fraction of aggregate bandwidth delivered by the network to an application. ■









Time of flight—This is the time for the first bit of the packet to arrive at the receiver, including the propagation delay over the links and delays due to other hardware in the network such as link repeaters and network switches. The unit of measure for time of flight can be in milliseconds for WANs, microseconds for LANs, nanoseconds for SANs, and picoseconds for OCNs. Transmission time—This is the time for the packet to pass through the network, not including time of flight. One way to measure it is the difference in time between when the first bit of the packet arrives at the receiver and when the last bit of that packet arrives at the receiver. By definition, transmission time is equal to the size of the packet divided by the data bandwidth of network links. This measure assumes there are no other packets contending for that bandwidth (i.e., a zero-load or no-load network). Transport latency—This is the sum of time of flight and transmission time. Transport latency is the time that the packet spends in the interconnection network. Stated alternatively, it is the time between when the first bit of the packet is injected into the network and when the last bit of that packet arrives at the receiver. It does not include the overhead of preparing the packet at the sender or processing it when it arrives at the receiver. Sending overhead—This is the time for the end node to prepare the packet (as opposed to the message) for injection into the network, including both hardware and software components. Note that the end node is busy for the entire time, hence the use of the term overhead. Once the end node is free, any subsequent delays are considered part of the transport latency. We assume that overhead consists of a constant term plus a variable term that depends on packet size. The constant term includes memory allocation, packet header preparation, setting up DMA devices, and so on. The variable term is mostly due to copies from buffer to buffer and is usually negligible for very short packets. Receiving overhead—This is the time for the end node to process an incoming packet, including both hardware and software components. We also assume here that overhead consists of a constant term plus a variable term that depends on packet size. In general, the receiving overhead is larger than the sending overhead. For example, the receiver may pay the cost of an interrupt or may have to reorder and reassemble packets into messages.

F.2

Interconnecting Two Devices



F-15

The total latency of a packet can be expressed algebraically by the following: Latency ¼ Sending overhead + Time of flight +

Packet size + Receiving overhead Bandwidth

Let’s see how the various components of transport latency and the sending and receiving overheads change in importance as we go across the interconnection network domains: from OCNs to SANs to LANs to WANs. Example

Assume that we have a dedicated link network with a data bandwidth of 8 Gbps for each link in each direction interconnecting two devices within an OCN, SAN, LAN, or WAN, and we wish to transmit packets of 100 bytes (including the header) between the devices. The end nodes have a per-packet sending overhead of x + 0.05 ns/byte and receiving overhead of 4/3(x) + 0.05 ns/byte, where x is 0 μs for the OCN, 0.3 μs for the SAN, 3 μs for the LAN, and 30 μs for the WAN, which are typical for these network types. Calculate the total latency to send packets from one device to the other for interconnection distances of 0.5 cm, 5 m, 5000 m, and 5000 km assuming that time of flight consists only of link propagation delay (i.e., no switching or other sources of delay).

Answer

Using the above expression and the calculation for propagation delay through a conductor given in the previous example, we can plug in the parameters for each of the networks to find their total packet latency. For the OCN: Latency ¼Sending overhead + Time of flight + ¼5 ns +

Packet size + Receiving overhead Bandwidth

0:5 cm 100 bytes + + 5 ns 2=3  300,000 km= sec 8 Gbits= sec

Converting all terms into nanoseconds (ns) leads to the following for the OCN: Total latency ðOCNÞ ¼5 ns +

0:5 cm 100  8 + ns + 5 ns 2=3  300,000 km= sec 8

¼5 ns + 0:025 ns + 100 ns + 5 ns ¼110:025 ns

Substituting in the appropriate values for the SAN gives the following latency: Total latency ðSANÞ ¼0:305 μs +

5m 100 bytes + + 0:405 μs 2=3  300,000 km= sec 8 Gbits= sec

¼0:305 μs + 0:025 μs + 0:1 μs + 0:405 μs ¼0:835 μs

F-16



Appendix F Interconnection Networks

Substituting in the appropriate values for the LAN gives the following latency: Total latency ðLANÞ ¼3:005 μs +

5 km 100 bytes + + 4:005 μs 2=3  300,000 km= sec 8 Gbits= sec

¼3:005 μs + 25 μs + 0:1 μs + 4:005 μs ¼32:11 μs

Substituting in the appropriate values for the WAN gives the following latency: Total latency ðWANÞ ¼ 30:005 μs +

5000 km 100 bytes + + 40:005 μs 2=3  300, 000 km= sec 8 Gbits=sec

¼ 30:005 μs + 25000 μs + 0:1 μs + 40:005 μs ¼ 25:07 ms

The increased fraction of the latency required by time of flight for the longer distances along with the greater likelihood of errors over the longer distances are among the reasons why WANs and LANs use more sophisticated and time-consuming communication protocols, which increase sending and receiving overheads. The need for standardization is another reason. Complexity also increases due to the requirements imposed on the protocol by the typical applications that run over the various interconnection network domains as we go from tens to hundreds to thousands to many thousands of devices. We will consider this in later sections when we discuss connecting more than two devices. The above example shows that the propagation delay component of time of flight for WANs and some LANs is so long that other latency components—including the sending and receiving overheads—can practically be ignored. This is not so for SANs and OCNs where the propagation delay pales in comparison to the overheads and transmission delay. Remember that time-of-flight latency due to switches and other hardware in the network besides sheer propagation delay through the links is neglected in the above example. For noncongested networks, switch latency generally is small compared to the overheads and propagation delay through the links in WANs and LANs, but this is not necessarily so for multiprocessor SANs and multicore OCNs, as we will see in later sections. So far, we have considered the transport of a single packet and computed the associated end-to-end total packet latency. In order to compute the effective bandwidth for two networked devices, we have to consider a continuous stream of packets transported between them. We must keep in mind that, in addition to minimizing packet latency, the goal of any network optimized for a given cost and power consumption target is to transfer the maximum amount of available information in the shortest possible time, as measured by the effective bandwidth delivered by the network. For applications that do not require a response before sending the next packet, the sender can overlap the sending overhead of later packets with the transport latency and receiver overhead of prior packets. This essentially pipelines the transmission of packets over the network, also known as link pipelining. Fortunately, as discussed in prior chapters of this book, there are many application

F.2

Interconnecting Two Devices



F-17

areas where communication from either several applications or several threads from the same application can run concurrently (e.g., a Web server concurrently serving thousands of client requests or streaming media), thus allowing a device to send a stream of packets without having to wait for an acknowledgment or a reply. Also, as long messages are usually divided into packets of maximum size before transport, a number of packets are injected into the network in succession for such cases. If such overlap were not possible, packets would have to wait for prior packets to be acknowledged before being transmitted and thus suffer significant performance degradation. Packets transported in a pipelined fashion can be acknowledged quite straightforwardly simply by keeping a copy at the source of all unacknowledged packets that have been sent and keeping track of the correspondence between returned acknowledgments and packets stored in the buffer. Packets will be removed from the buffer when the corresponding acknowledgment is received by the sender. This can be done by including the message ID and packet sequence number associated with the packet in the packet’s acknowledgment. Furthermore, a separate timer must be associated with each buffered packet, allowing the packet to be resent if the associated time-out expires. Pipelining packet transport over the network has many similarities with pipelining computation within a processor. However, among some differences are that it does not require any staging latches. Information is simply propagated through network links as a sequence of signal waves. Thus, the network can be considered as a logical pipeline consisting of as many stages as are required so that the time of flight does not affect the effective bandwidth that can be achieved. Transmission of a packet can start immediately after the transmission of the previous one, thus overlapping the sending overhead of a packet with the transport and receiver latency of previous packets. If the sending overhead is smaller than the transmission time, packets follow each other back-to-back, and the effective bandwidth approaches the raw link bandwidth when continuously transmitting packets. On the other hand, if the sending overhead is greater than the transmission time, the effective bandwidth at the injection point will remain well below the raw link bandwidth. The resulting link injection bandwidth, BWLinkInjection, for each link injecting a continuous stream of packets into a network is calculated with the following expression: BWLinkInjection ¼

Packet size max ðSending overhead, Transmission timeÞ

We must also consider what happens if the receiver is unable to consume packets at the same rate they arrive. This occurs if the receiving overhead is greater than the sending overhead and the receiver cannot process incoming packets fast enough. In this case, the link reception bandwidth, BWLinkReception, for each reception link of the network is less than the link injection bandwidth and is obtained with this expression: BWLinkReception ¼

Packet size max ðReceiving overhead, Transmission timeÞ

F-18



Appendix F Interconnection Networks

When communication takes place between two devices interconnected by dedicated links, all the packets sent by one device will be received by the other. If the receiver cannot process packets fast enough, the receiver buffer will become full, and flow control will throttle transmission at the sender. As this situation is produced by causes external to the network, we will not consider it further here. Moreover, if the receiving overhead is greater than the sending overhead, the receiver buffer will fill up and flow control will, likewise, throttle transmission at the sender. In this case, the effect of flow control is, on average, the same as if we replace sending overhead with receiving overhead. Assuming an ideal network that behaves like two dedicated links running in opposite directions at the full link bandwidth between the two devices—which is consistent with our black box view of the network to this point—the resulting effective bandwidth is the smaller of twice the injection bandwidth (to account for the two injection links, one for each device) or twice the reception bandwidth. This results in the following expression for effective bandwidth:  Effective bandwidth ¼ min 2  BWLinkInjection , 2  BWLinkReception ¼

2  Packet size max ðOverhead, Transmission timeÞ

where Overhead ¼ max(Sending overhead, Receiving overhead). Taking into account the expression for the transmission time, it is obvious that the effective bandwidth delivered by the network is identical to the aggregate network bandwidth when the transmission time is greater than the overhead. Therefore, full network utilization is achieved regardless of the value for the time of flight and, thus, regardless of the distance traveled by packets, assuming ideal network behavior (i.e., enough credits and buffers are provided for credit-based and Xon/ Xoff flow control). This analysis assumes that the sender and receiver network interfaces can process only one packet at a time. If multiple packets can be processed in parallel (e.g., as is done in IBM’s Federation network interfaces), the overheads for those packets can be overlapped, which increases effective bandwidth by that overlap factor up to the amount bounded by the transmission time. Let’s use the equation on page F-17 to explore the impact of packet size, transmission time, and overhead on BWLink Injection, BWLinkReception, and effective bandwidth for the various network domains: OCNs, SANs, LANs, and WANs.

Example

As in the previous example, assume we have a dedicated link network with a data bandwidth of 8 Gbps for each link in each direction interconnecting the two devices within an OCN, SAN, LAN, or WAN. Plot effective bandwidth versus packet size for each type of network for packets ranging in size from 4 bytes (i.e., a single 32-bit word) to 1500 bytes (i.e., the maximum transfer unit for Ethernet), assuming that end nodes have the same per-packet sending and receiving overheads as before: x + 0.05 ns/byte and 4/3(x) + 0.05 ns/byte, respectively, where x is 0 μs for the OCN, 0.3 μs for the SAN, 3 μs for the LAN, and 30 μs for the WAN. What limits the effective bandwidth, and for what packet sizes is the effective bandwidth within 10% of the aggregate network bandwidth?

F.2

Interconnecting Two Devices



F-19

100

Effective bandwidth (Gbits/sec)

10

1

0.1

OCN SAN LAN WAN

0.01

0.001 4

200

400

600

800

1000

1200

1400

Packet size (bytes)

Figure F.6 Effective bandwidth versus packet size plotted in semi-log form for the four network domains. Overhead can be amortized by increasing the packet size, but for too large of an overhead (e.g., for WANs and some LANs) scaling the packet size is of little help. Other considerations come into play that limit the maximum packet size.

Answer

Figure F.6 plots effective bandwidth versus packet size for the four network domains using the simple equation and parameters given above. For all packet sizes in the OCN, transmission time is greater than overhead (sending or receiving), allowing full utilization of the aggregate bandwidth, which is 16 Gbps—that is, injection link (alternatively, reception link) bandwidth times two to account for both devices. For the SAN, overhead—specifically, receiving overhead—is larger than transmission time for packets less than about 800 bytes; consequently, packets of 655 bytes and larger are needed to utilize 90% or more of the aggregate bandwidth. For LANs and WANs, most of the link bandwidth is not utilized since overhead in this example is many times larger than transmission time for all packet sizes. This example highlights the importance of reducing the sending and receiving overheads relative to packet transmission time in order to maximize the effective bandwidth delivered by the network. The analysis above suggests that it is possible to provide some upper bound for the effective bandwidth by analyzing the path followed by packets and determining where the bottleneck occurs. We can extend this idea beyond the network interfaces by defining a model that considers the entire network from end to

F-20



Appendix F Interconnection Networks

end as a pipe and identifying the narrowest section of that pipe. There are three areas of interest in that pipe: the aggregate of all network injection links and the corresponding network injection bandwidth (BWNetworkInjection), the aggregate of all network reception links and the corresponding network reception bandwidth (BWNetworkReception), and the aggregate of all network links and the corresponding network bandwidth (BWNetwork). Expressions for these will be given in later sections as various layers of the black box view of the network are peeled away. To this point, we have assumed that for just two interconnected devices the black box network behaves ideally and the network bandwidth is equal to the aggregate raw network bandwidth. In reality, it can be much less than the aggregate bandwidth as we will see in the following sections. In general, the effective bandwidth delivered end-to-end by the network to an application is upper bounded by the minimum across all three potential bottleneck areas: Effective bandwidth ¼ min BWNetworkInjection , BWNetwork , BWNetworkReception



We will expand upon this expression further in the following sections as we reveal more about interconnection networks and consider the more general case of interconnecting more than two devices. In some sections of this appendix, we show how the concepts introduced in the section take shape in example high-end commercial products. Figure F.7 lists several commercial computers that, at one point in time in their existence, were among the highest-performing systems in the world within their class. Although these systems are capable of interconnecting more than two devices, they implement the basic functions needed for interconnecting only two devices. In addition to being applicable to the SANs used in those systems, the issues discussed in this section also apply to other interconnect domains: from OCNs to WANs.

F.3

Connecting More than Two Devices To this point, we have considered the connection of only two devices communicating over a network viewed as a black box, but what makes interconnection networks interesting is the ability to connect hundreds or even many thousands of devices together. Consequently, what makes them interesting also makes them more challenging to build. In order to connect more than two devices, a suitable structure and more functionality must be supported by the network. This section continues with our black box approach by introducing, at a conceptual level, additional network structure and functions that must be supported when interconnecting more than two devices. More details on these individual subjects are given in Sections F.4 through F.7. Where applicable, we relate the additional structure and functions to network media, flow control, and other basics presented in the previous section. In this section, we also classify networks into two broad categories

2001

4510 [2]

2500 ft2

1984 [4]

400 [400]

Few μs

Handshaking; CRC + parity

IBM

ASCI White SP Power3 [Colony]

2001

512 [16]

10,000 ft2

1024 [6]

500 [500]

3 μs

25 m; creditbased; CRC

Intel

Thunder Itanium2 Tiger4 [QsNetII]

2004

1024 [4]

120 m2

2048 [14]

928 [928]

0.240 μs

13 m; creditbased; CRC for link, dest.

Cray

XT3 [SeaStar]

2004

30,508 [1]

263.8 m2

80 [16]

3200 [3200]

Few μs

7 m; creditbased; CRC

Cray

X1E

2004

1024 [1]

27 m2

32 [16]

1600 [1600]

0 (direct LD ST accesses)

5 m; creditbased; CRC

IBM

ASC Purple pSeries 575 [Federation]

2005

>1280 [8]

6720 ft2

2048 [7]

2000 [2000]

1 μs with up to 4 packets processed in k

25 m; creditbased; CRC

IBM

Blue Gene/L eServer Sol. [Torus Net.]

2005

65,536 [2]

2500 ft2 (.9  .9  1.9 m3/ 1 K node rack)

256 [8]

612.5 [1050]

3 μs (2300 cycles)

8.6 m; creditbased; CRC (header/pkt)

Intro year

Max. number of compute nodes [ # CPUs]

ASCI Red Paragon

System [network] name

Intel

Company

Maximum copper link length; flow control; error

F-21

Minimum send/ receive overhead



Injection [reception] node BW in MB/sec

Connecting More than Two Devices

Packet [header] max size (bytes)

System footprint for max. configuration

F.3

Figure F.7 Basic characteristics of interconnection networks in commercial high-performance computer systems.

based on their connection structure—shared-media versus switched-media networks—and we compare them. Finally, expanded expressions for characterizing network performance are given, followed by an example.

Additional Network Structure and Functions: Topology, Routing, Arbitration, and Switching Networks interconnecting more than two devices require mechanisms to physically connect the packet source to its destination in order to transport the packet and deliver it to the correct destination. These mechanisms can be implemented in different ways and significantly vary across interconnection network domains. However, the types of network structure and functions performed by those mechanisms are very much the same, regardless of the domain. When multiple devices are interconnected by a network, the connections between them oftentimes cannot be permanently established with dedicated links.

F-22



Appendix F Interconnection Networks

This could either be too restrictive as all the packets from a given source would go to the same one destination (and not to others) or prohibitively expensive as a dedicated link would be needed from every source to every destination (we will evaluate this further in the next section). Therefore, networks usually share paths among different pairs of devices, but how those paths are shared is determined by the network connection structure, commonly referred to as the network topology. Topology addresses the important issue of “What paths are possible for packets?” so packets reach their intended destinations. Every network that interconnects more than two devices also requires some mechanism to deliver each packet to the correct destination. The associated function is referred to as routing, which can be defined as the set of operations that need to be performed to compute a valid path from the packet source to its destinations. Routing addresses the important issue of “Which of the possible paths are allowable (valid) for packets?” so packets reach their intended destinations. Depending on the network, this function may be executed at the packet source to compute the entire path, at some intermediate devices to compute fragments of the path on the fly, or even at every possible destination device to verify whether that device is the intended destination for the packet. Usually, the packet header shown in Figure F.4 is extended to include the necessary routing information. In general, as networks usually contain shared paths or parts thereof among different pairs of devices, packets may request some shared resources. When several packets request the same resources at the same time, an arbitration function is required to resolve the conflict. Arbitration, along with flow control, addresses the important issue of “When are paths available for packets?” Every time arbitration is performed, there is a winner and possibly several losers. The losers are not granted access to the requested resources and are typically buffered. As indicated in the previous section, flow control may be implemented to prevent buffer overflow. The winner proceeds toward its destination once the granted resources are switched in, providing a path for the packet to advance. This function is referred to as switching. Switching addresses the important issue of “How are paths allocated to packets?” To achieve better utilization of existing communication resources, most networks do not establish an entire end-to-end path at once. Instead, as explained in Section F.5, paths are usually established one fragment at a time. These three network functions—routing, arbitration, and switching—must be implemented in every network connecting more than two devices, no matter what form the network topology takes. This is in addition to the basic functions mentioned in the previous section. However, the complexity of these functions and the order in which they are performed depends on the category of network topology, as discussed below. In general, routing, arbitration, and switching are required to establish a valid path from source to destination from among the possible paths provided by the network topology. Once the path has been established, the packet transport functions previously described are used to reliably transmit packets and receive them at the corresponding destination. Flow control, if implemented, prevents buffer overflow by throttling the sender. It can be implemented at the end-toend level, the link level within the network, or both.

F.3

Connecting More than Two Devices



F-23

Shared-Media Networks The simplest way to connect multiple devices is to have them share the network media, as shown for the bus in Figure F.8 (a). This has been the traditional way of interconnecting devices. The shared media can operate in half-duplex mode, where data can be carried in either direction over the media but simultaneous transmission and reception of data by the same device is not allowed, or in full-duplex, where the data can be carried in both directions and simultaneously transmitted and received by the same device. Until very recently, I/O devices in most systems typically shared a single I/O bus, and early system-on-chip (SoC) designs made use of a shared bus to interconnect on-chip components. The most popular LAN, Ethernet, was originally implemented as a half-duplex bus shared by up to a hundred computers, although now switched-media versions also exist. Given that network media are shared, there must be a mechanism to coordinate and arbitrate the use of the shared media so that only one packet is sent at a time. If the physical distance between network devices is small, it may be possible to have a central arbiter to grant permission to send packets. In this case, the network nodes may use dedicated control lines to interface with the arbiter. Centralized arbitration is impractical, however, for networks with a large number of nodes spread over large distances, so distributed forms of arbitration are also used. This is the case for the original Ethernet shared-media LAN. A first step toward distributed arbitration of shared media is “looking before you leap.” A node first checks the network to avoid trying to send a packet while another packet is already in the network. Listening before transmission to avoid collisions is called carrier sensing. If the interconnection is idle, the node tries to send. Looking first is not a guarantee of success, of course, as some other node may also decide to send at the same instant. When two nodes send at the same time, Switched-media network Node

Shared-media network Node

Node

Node

Node Switch fabric

Node

(A)

Node

(B)

Figure F.8 (a) A shared-media network versus (b) a switched-media network. Ethernet was originally a shared media network, but switched Ethernet is now available. All nodes on the shared-media networks must dynamically share the raw bandwidth of one link, but switched-media networks can support multiple links, providing higher raw aggregate bandwidth.

F-24



Appendix F Interconnection Networks a collision occurs. Let’s assume that the network interface can detect any resulting collisions by listening to hear if the data become garbled by other data appearing on the line. Listening to detect collisions is called collision detection. This is the second step of distributed arbitration. The problem is not solved yet. If, after detecting a collision, every node on the network waited exactly the same amount of time, listened to be sure there was no traffic, and then tried to send again, we could still have synchronized nodes that would repeatedly bump heads. To avoid repeated head-on collisions, each node whose packet gets garbled waits (or backs off) a random amount of time before resending. Randomization breaks the synchronization. Subsequent collisions result in exponentially increasing time between attempts to retransmit, so as not to tax the network. Although this approach controls congestion on the shared media, it is not guaranteed to be fair—some subsequent node may transmit while those that collided are waiting. If the network does not have high demand from many nodes, this simple approach works well. Under high utilization, however, performance degrades since the media are shared and fairness is not ensured. Another distributed approach to arbitration of shared media that can support fairness is to pass a token between nodes. The function of the token is to grant the acquiring node the right to use the network. If the token circulates in a cyclic fashion between the nodes, a certain amount of fairness is ensured in the arbitration process. Once arbitration has been performed and a device has been granted access to the shared media, the function of switching is straightforward. The granted device simply needs to connect itself to the shared media, thus establishing a path to every possible destination. Also, routing is very simple to implement. Given that the media are shared and attached to all the devices, every device will see every packet. Therefore, each device just needs to check whether or not a given packet is intended for that device. A beneficial side effect of this strategy is that a device can send a packet to all the devices attached to the shared media through a single transmission. This style of communication is called broadcasting, in contrast to unicasting, in which each packet is intended for only one device. The shared media make it easy to broadcast a packet to every device or, alternatively, to a subset of devices, called multicasting.

Switched-Media Networks The alternative to sharing the entire network media at once across all attached nodes is to switch between disjoint portions of it shared by the nodes. Those portions consist of passive point-to-point links between active switch components that dynamically establish communication between sets of source-destination pairs. These passive and active components make up what is referred to as the network switch fabric or network fabric, to which end nodes are connected. This approach is shown conceptually in Figure F.8(b). The switch fabric is described in greater detail in Sections F.4 through F.7, where various black box layers for switchedmedia networks are further revealed. Nevertheless, the high-level view shown

F.3

Connecting More than Two Devices



F-25

in Figure F.8(b) illustrates the potential bandwidth improvement of switchedmedia networks over shared-media networks: aggregate bandwidth can be many times higher than that of shared-media networks, allowing the possibility of greater effective bandwidth to be achieved. At best, only one node at a time can transmit packets over the shared media, whereas it is possible for all attached nodes to do so over the switched-media network. Like their shared-media counterparts, switched-media networks must implement the three additional functions previously mentioned: routing, arbitration, and switching. Every time a packet enters the network, it is routed in order to select a path toward its destination provided by the topology. The path requested by the packet must be granted by some centralized or distributed arbiter, which resolves conflicts among concurrent requests for resources along the same path. Once the requested resources are granted, the network “switches in” the required connections to establish the path and allows the packet to be forwarded toward its destination. If the requested resources are not granted, the packet is usually buffered, as mentioned previously. Routing, arbitration, and switching functions are usually performed within switched networks in this order, whereas in shared-media networks routing typically is the last function performed.

Comparison of Shared- and Switched-Media Networks In general, the advantage of shared-media networks is their low cost, but, consequently, their aggregate network bandwidth does not scale at all with the number of interconnected devices. Also, a global arbitration scheme is required to resolve conflicting demands, possibly introducing another type of bottleneck and again limiting scalability. Moreover, every device attached to the shared media increases the parasitic capacitance of the electrical conductors, thus increasing the time of flight propagation delay accordingly and, possibly, clock cycle time. In addition, it is more difficult to pipeline packet transmission over the network as the shared media are continuously granted to different requesting devices. The main advantage of switched-media networks is that the amount of network resources implemented scales with the number of connected devices, increasing the aggregate network bandwidth. These networks allow multiple pairs of nodes to communicate simultaneously, allowing much higher effective network bandwidth than that provided by shared-media networks. Also, switched-media networks allow the system to scale to very large numbers of nodes, which is not feasible when using shared media. Consequently, this scaling advantage can, at the same time, be a disadvantage if network resources grow superlinearly. Networks of superlinear cost that provide an effective network bandwidth that grows only sublinearly with the number of interconnected devices are inefficient designs for many applications and interconnection network domains.

Characterizing Performance: Latency and Effective Bandwidth The routing, switching, and arbitration functionality described above introduces some additional components of packet transport latency that must be taken into

F-26



Appendix F Interconnection Networks

account in the expression for total packet latency. Assuming there is no contention for network resources—as would be the case in an unloaded network—total packet latency is given by the following:  Packet size Latency ¼ Sending overhead + TTotalProp + TR + TA + TS + + Receiving overhead Bandwidth

Here TR, TA, and TS are the total routing time, arbitration time, and switching time experienced by the packet, respectively, and are either measured quantities or calculated quantities derived from more detailed analyses. These components are added to the total propagation delay through the network links, TTotalProp, to give the overall time of flight of the packet. The expression above gives only a lower bound for the total packet latency as it does not account for additional delays due to contention for resources that may occur. When the network is heavily loaded, several packets may request the same network resources concurrently, thus causing contention that degrades performance. Packets that lose arbitration have to be buffered, which increases packet latency by some contention delay amount of waiting time. This additional delay is not included in the above expression. When the network or part of it approaches saturation, contention delay may be several orders of magnitude greater than the total packet latency suffered by a packet under zero load or even under slightly loaded network conditions. Unfortunately, it is not easy to compute analytically the total packet latency when the network is more than moderately loaded. Measurement of these quantities using cycle-accurate simulation of a detailed network model is a better and more precise way of estimating packet latency under such circumstances. Nevertheless, the expression given above is useful in calculating best-case lower bounds for packet latency. For similar reasons, effective bandwidth is not easy to compute exactly, but we can estimate best-case upper bounds for it by appropriately extending the model presented at the end of the previous section. What we need to do is to find the narrowest section of the end-to-end network pipe by finding the network injection bandwidth (BWNetworkInjection), the network reception bandwidth (BWNetworkReception), and the network bandwidth (BWNetwork) across the entire network interconnecting the devices. The BWNetworkInjection can be calculated simply by multiplying the expression for link injection bandwidth, BWLinkInjection, by the total number of network injection links. The BWNetworkReception is calculated similarly using BWLinkReception, but it must also be scaled by a factor that reflects application traffic and other characteristics. For more than two interconnected devices, it is no longer valid to assume a one-to-one relationship among sources and destinations when analyzing the effect of flow control on link reception bandwidth. It could happen, for example, that several packets from different injection links arrive concurrently at the same reception link for applications that have many-to-one traffic characteristics, which causes contention at the reception links. This effect can be taken into account by an average reception factor parameter, σ, which is either a measured quantity or a calculated quantity derived from detailed analysis. It is defined as the average

F.3

Connecting More than Two Devices

F-27



fraction or percentage of packets arriving at reception links that can be accepted. Only those packets can be immediately delivered, thus reducing network reception bandwidth by that factor. This reduction occurs as a result of application behavior regardless of internal network characteristics. Finally, BWNetwork takes into account the internal characteristics of the network, including contention. We will progressively derive expressions in the following sections that will enable us to calculate this as more details are revealed about the internals of our black box interconnection network. Overall, the effective bandwidth delivered by the network end-to-end to an application is determined by the minimum across the three sections, as described by the following: Effective bandwidth¼ min BWNetworkInjection ,BWNetwork , σ  BWNetworkReception



¼ min N  BWLinkInjection , BWNetwork ,σ  N  BWLinkReception



Let’s use the above expressions to compare the latency and effective bandwidth of shared-media networks against switched-media networks for the four interconnection network domains: OCNs, SANs, LANs, and WANs. Example

Plot the total packet latency and effective bandwidth as the number of interconnected nodes, N, scales from 4 to 1024 for shared-media and switched-media OCNs, SANs, LANs, and WANs. Assume that all network links, including the injection and reception links at the nodes, each have a data bandwidth of 8 Gbps, and unicast packets of 100 bytes are transmitted. Shared-media networks share one link, and switched-media networks have at least as many network links as there are nodes. For both, ignore latency and bandwidth effects due to contention within the network. End nodes have per-packet sending and receiving overheads of x + 0.05 ns/byte and 4/3(x) + 0.05 ns/byte, respectively, where x is 0 μs for the OCN, 0.3 μs for the SAN, 3 μs for the LAN, and 30 μs for the WAN, and interconnection distances are 0.5 cm, 5 m, 5000 m, and 5000 km, respectively. Also assume that the total routing, arbitration, and switching times are constants or functions of the number of interconnected nodes: TR ¼ 2.5 ns, TA ¼ 2.5(N) ns, and TS ¼ 2.5 ns for shared-media networks and TR ¼ TA ¼ TS ¼ 2.5(log2 N) ns for switched-media networks. Finally, taking into account application traffic characteristics for the network structure, the average reception factor, σ, is assumed to be N1 for shared media and polylogarithmic (log2 N)1/4 for switched media.

Answer

All components of total packet latency are the same as in the example given in the previous section except for time of flight, which now has additional routing, arbitration, and switching delays. For shared-media networks, the additional delays total 5 + 2.5(N) ns; for switched-media networks, they total 7.5(log2 N) ns. Latency is plotted only for OCNs and SANs in Figure F.9 as these networks give the more interesting results. For OCNs, TR, TA, and TS combine to dominate time of flight



Appendix F Interconnection Networks

10,000 SAN— shared OCN— shared SAN— switched OCN— switched

Latency (ns)

F-28

1000

100 4

8

16

32

64

128

256

512

1024

Number of nodes (N)

Figure F.9 Latency versus number of interconnected nodes plotted in semi-log form for OCNs and SANs. Routing, arbitration, and switching have more of an impact on latency for networks in these two domains, particularly for networks with a large number of nodes, given the low sending and receiving overheads and low propagation delay.

and are much greater than each of the other latency components for a moderate to large number of nodes. This is particularly so for the shared-media network. The latency increases much more dramatically with the number of nodes for shared media as compared to switched media given the difference in arbitration delay between the two. For SANs, TR, TA, and TS dominate time of flight for most network sizes but are greater than each of the other latency components in sharedmedia networks only for large-sized networks; they are less than the other latency components for switched-media networks but are not negligible. For LANs and WANs, time of flight is dominated by propagation delay, which dominates other latency components as calculated in the previous section; thus, TR, TA, and TS are negligible for both shared and switched media. Figure F.10 plots effective bandwidth versus number of interconnected nodes for the four network domains. The effective bandwidth for all shared-media networks is constant through network scaling as only one unicast packet can be received at a time over all the network reception links, and that is further limited by the receiving overhead of each network for all but the OCN. The effective bandwidth for all switched-media networks increases with the number of interconnected nodes, but it is scaled down by the average reception factor. The receiving overhead further limits effective bandwidth for all but the OCN.

F.3

Connecting More than Two Devices



F-29

10,000 OCN— switched SAN— switched LAN— switched WAN— switched OCN— shared SAN— shared LAN— shared WAN— shared

Effective bandwidth (Gbits/sec)

1000

100

10

1

0.1

0.01 1

200

400

600

800

1000

1200

Number of nodes (N)

Figure F.10 Effective bandwidth versus number of interconnected nodes plotted in semi-log form for the four network domains. The disparity in effective bandwidth between shared- and switched-media networks for all interconnect domains widens significantly as the number of nodes in the network increases. Only the switched on-chip network is able to achieve an effective bandwidth equal to the aggregate bandwidth for the parameters given in this example.

Given the obvious advantages, why weren’t switched networks always used? Earlier computers were much slower and could share the network media with little impact on performance. In addition, the switches for earlier LANs and WANs took up several large boards and were about as large as an entire computer. As a consequence of Moore’s law, the size of switches has reduced considerably, and systems have a much greater need for high-performance communication. Switched networks allow communication to harvest the same rapid advancements from silicon as processors and main memory. Whereas switches from telecommunication companies were once the size of mainframe computers, today we see single-chip switches and even entire switched networks within a chip. Thus, technology and application trends favor switched networks today. Just as single-chip processors led to processors replacing logic circuits in a surprising number of places, single-chip switches and switched on-chip networks are increasingly replacing shared-media networks (i.e., buses) in several application domains. As an example, PCI-Express (PCIe)—a switched network—was introduced in 2005 to replace the traditional PCI-X bus on personal computer motherboards. The previous example also highlights the importance of optimizing the routing, arbitration, and switching functions in OCNs and SANs. For these network domains in particular, the interconnect distances and overheads typically are small

F-30



Appendix F Interconnection Networks

enough to make latency and effective bandwidth much more sensitive to how well these functions are implemented, particularly for larger-sized networks. This leads mostly to implementations based mainly on the faster hardware solutions for these domains. In LANs and WANs, implementations based on the slower but more flexible software solutions suffice given that performance is largely determined by other factors. The design of the topology for switched-media networks also plays a major role in determining how close to the lower bound on latency and the upper bound on effective bandwidth the network can achieve for OCN and SAN domains. The next three sections touch on these important issues in switched networks, with the next section focused on topology.

F.4

Network Topology When the number of devices is small enough, a single switch is sufficient to interconnect them within a switched-media network. However, the number of switch ports is limited by existing very-large-scale integration (VLSI) technology, cost considerations, power consumption, and so on. When the number of required network ports exceeds the number of ports supported by a single switch, a fabric of interconnected switches is needed. To embody the necessary property of full access (i.e., connectedness), the network switch fabric must provide a path from every end node device to every other device. All the connections to the network fabric and between switches within the fabric use point-to-point links as opposed to shared links—that is, links with only one switch or end node device on either end. The interconnection structure across all the components—including switches, links, and end node devices—is referred to as the network topology. The number of network topologies described in the literature would be difficult to count, but the number that have been used commercially is no more than about a dozen or so. During the 1970s and early 1980s, researchers struggled to propose new topologies that could reduce the number of switches through which packets must traverse, referred to as the hop count. In the 1990s, thanks to the introduction of pipelined transmission and switching techniques, the hop count became less critical. Nevertheless, today, topology is still important, particularly for OCNs and SANs, as subtle relationships exist between topology and other network design parameters that impact performance, especially when the number of end nodes is very large (e.g., 64 K in the Blue Gene/L supercomputer) or when the latency is critical (e.g., in multicore processor chips). Topology also greatly impacts the implementation cost of the network. Topologies for parallel supercomputer SANs have been the most visible and imaginative, usually converging on regularly structured ones to simplify routing, packaging, and scalability. Those for LANs and WANs tend to be more haphazard or ad hoc, having more to do with the challenges of long distance or connecting across different communication subnets. Switch-based topologies for OCNs are only recently emerging but are quickly gaining in popularity. This section

F.4

Network Topology



F-31

describes the more popular topologies used in commercial products. Their advantages, disadvantages, and constraints are also briefly discussed.

Centralized Switched Networks As mentioned above, a single switch suffices to interconnect a set of devices when the number of switch ports is equal to or larger than the number of devices. This simple network is usually referred to as a crossbar or crossbar switch. Within the crossbar, crosspoint switch complexity increases quadratically with the number of ports, as illustrated in Figure F.11(a). Thus, a cheaper solution is desirable when the number of devices to be interconnected scales beyond the point supportable by implementation technology. A common way of addressing the crossbar scaling problem consists of splitting the large crossbar switch into several stages of smaller switches interconnected in such a way that a single pass through the switch fabric allows any destination to be reached from any source. Topologies arranged in this way are usually referred to as multistage interconnection networks or multistage switch fabrics, and these networks typically have complexity that increases in proportion to N log N. Multistage interconnection networks (MINs) were initially proposed for telephone exchanges in the 1950s and have since been used to build the communication backbone for parallel supercomputers, symmetric multiprocessors, multicomputer clusters, and IP router switch fabrics.

0

1

2

3

4

5

6

7

0

0

1

1

2

2

2

3

3

3

4

4

5

5

6

6

7

7

0 1

4 5 6 7

(A)

(B)

Figure F.11 Popular centralized switched networks: (a) the crossbar network requires N2 crosspoint switches, shown as black dots; (b) the Omega, a MIN, requires N/2 log2 N switches, shown as vertical rectangles. End node devices are shown as numbered squares (total of eight). Links are unidirectional—data enter at the left and exit out the top or right.

F-32



Appendix F Interconnection Networks

The interconnection pattern or patterns between MIN stages are permutations that can be represented mathematically by a set of functions, one for each stage. Figure F.11(b) shows a well-known MIN topology, the Omega, which uses the perfect-shuffle permutation as its interconnection pattern for each stage, followed by exchange switches, giving rise to a perfect-shuffle exchange for each stage. In this example, eight input-output ports are interconnected with three stages of 2  2 switches. It is easy to see that a single pass through the three stages allows any input port to reach any output port. In general, when using k  k switches, a MIN with N input-output ports requires at least logk N stages, each of which contains N/k switches, for a total of N/k (logk N) switches. Despite their internal structure, MINs can be seen as centralized switch fabrics that have end node devices connected at the network periphery, hence the name centralized switched network. From another perspective, MINs can be viewed as interconnecting nodes through a set of switches that may not have any nodes directly connected to them, which gives rise to another popular name for centralized switched networks—indirect networks. Example

Compute the cost of interconnecting 4096 nodes using a single crossbar switch relative to doing so using a MIN built from 2  2, 4  4, and 16  16 switches. Consider separately the relative cost of the unidirectional links and the relative cost of the switches. Switch cost is assumed to grow quadratically with the number of input (alternatively, output) ports, k, for k  k switches.

Answer

The switch cost of the network when using a single crossbar is proportional to 40962. The unidirectional link cost is 8192, which accounts for the set of links from the end nodes to the crossbar and also from the crossbar back to the end nodes. When using a MIN with k  k switches, the cost of each switch is proportional to k2 but there are 4096/k (logk 4096) total switches. Likewise, there are (logk 4096) stages of N unidirectional links per stage from the switches plus N links to the MIN from the end nodes. Therefore, the relative costs of the crossbar with respect to each MIN is given by the following:  Relative cost ð2  2Þswitches ¼ 40962 = 22  4096=2  log 2 4096 ¼ 170  Relative cost ð4  4Þswitches ¼ 40962 = 42  4096=4  log 4 4096 ¼ 170  Relative cost ð16  16Þswitches ¼ 40962 = 162  4096=16  log 16 4096 ¼ 85 Relative cost ð2  2Þlinks ¼ 8192=ð4096  ð log 2 4096 + 1ÞÞ ¼ 2=13 ¼ 0:1538 Relative cost ð4  4Þlinks ¼ 8192=ð4096  ð log 4 4096 + 1ÞÞ ¼ 2=7 ¼ 0:2857 Relative cost ð16  16Þlinks ¼ 8192=ð4096  ð log 16 4096 + 1ÞÞ ¼ 2=4 ¼ 0:5

In all cases, the single crossbar has much higher switch cost than the MINs. The most dramatic reduction in cost comes from the MIN composed from the smallest sized but largest number of switches, but it is interesting to see that the MINs with 2  2 and 4  4 switches yield the same relative switch cost. The relative link cost

F.4

Network Topology



F-33

of the crossbar is lower than the MINs, but by less than an order of magnitude in all cases. We must keep in mind that end node links are different from switch links in their length and packaging requirements, so they usually have different associated costs. Despite the lower link cost, the crossbar has higher overall relative cost. The reduction in switch cost of MINs comes at the price of performance: contention is more likely to occur on network links, thus degrading performance. Contention in the form of packets blocking in the network arises due to paths from different sources to different destinations simultaneously sharing one or more links. The amount of contention in the network depends on communication traffic behavior. In the Omega network shown in Figure F.11(b), for example, a packet from port 0 to port 1 blocks in the first stage of switches while waiting for a packet from port 4 to port 0. In the crossbar, no such blocking occurs as links are not shared among paths to unique destinations. The crossbar, therefore, is nonblocking. Of course, if two nodes try to send packets to the same destination, there will be blocking at the reception link even for crossbar networks. This is accounted for by the average reception factor parameter (σ) when analyzing performance, as discussed at the end of the previous section. To reduce blocking in MINs, extra switches must be added or larger ones need to be used to provide alternative paths from every source to every destination. The first commonly used solution is to add a minimum of logk N  1 extra switch stages to the MIN in such a way that they mirror the original topology. The resulting network is rearrangeably nonblocking as it allows nonconflicting paths among new source-destination pairs to be established, but it also doubles the hop count and could require the paths of some existing communicating pairs to be rearranged under some centralized control. The second solution takes a different approach. Instead of using more switch stages, larger switches—which can be implemented by multiple stages if desired—are used in the middle of two other switch stages in such a way that enough alternative paths through the middle-stage switches allow for nonconflicting paths to be established between the first and last stages. The best-known example of this is the Clos network, which is nonblocking. The multipath property of the three-stage Clos topology can be recursively applied to the middle-stage switches to reduce the size of all the switches down to 2  2, assuming that switches of this size are used in the first and last stages to begin with. What results is a Beneŝ topology consisting of 2(log2 N)  1 stages, which is rearrangeably nonblocking. Figure F.12(a) illustrates both topologies, where all switches not in the first and last stages comprise the middle-stage switches (recursively) of the Clos network. The MINs described so far have unidirectional network links, but bidirectional forms are easily derived from symmetric networks such as the Clos and Beneŝ simply by folding them. The overlapping unidirectional links run in different directions, thus forming bidirectional links, and the overlapping switches merge into a single switch with twice the ports (i.e., 4  4 switch). Figure F.12(b) shows the resulting folded Beneŝ topology but in this case with the end nodes connected

F-34



Appendix F Interconnection Networks

0 1

0 1

0 1

2 3

2 3

2 3

4 5

4 5

4 5

6 7

6 7

6 7

8 9

8 9

8 9

10 11

10 11

10 11

12 13

12 13

12 13

14 15

14 15

14 15

(A)

(B)

Figure F.12 Two Beneŝ networks. (a) A 16-port Clos topology, where the middle-stage switches shown in the darker shading are implemented with another Clos network whose middle-stage switches shown in the lighter shading are implemented with yet another Clos network, and so on, until a Beneŝ network is produced that uses only 2  2 switches everywhere. (b) A folded Beneŝ network (bidirectional) in which 4  4 switches are used; end nodes attach to the innermost set of the Beneŝ network (unidirectional) switches. This topology is equivalent to a fat tree, where tree vertices are shown in shades.

to the innermost switch stage of the original Beneŝ. Ports remain free at the other side of the network but can be used for later expansion of the network to larger sizes. These kind of networks are referred to as bidirectional multistage interconnection networks. Among many useful properties of these networks are their modularity and their ability to exploit communication locality, which saves packets from having to hop across all network stages. Their regularity also reduces routing complexity and their multipath property enables traffic to be routed more evenly across network resources and to tolerate faults. Another way of deriving bidirectional MINs with nonblocking (rearrangeable) properties is to form a balanced tree, where end node devices occupy leaves of the tree and switches occupy vertices within the tree. Enough links in each tree level must be provided such that the total link bandwidth remains constant across all levels. Also, except for the root, switch ports for each vertex typically grow as ki  ki, where i is the tree level. This can be accomplished by using ki1 total switches at each vertex, where each switch has k input and k output ports, or k bidirectional ports (i.e., k  k input-output ports). Networks having such topologies are called fat tree networks. As only half of the k bidirectional ports are used in each direction, 2 N/k switches are needed in each stage, totaling 2 N/k (logk/2 N) switches in the fat tree. The number of switches in the root stage can be halved as no forward links are needed, reducing switch count by N/k. Figure F.12(b) shows a fat tree for 4  4 switches. As can be seen, this is identical to the folded Beneŝ. The fat tree is the topology of choice across a wide range of network sizes for most commercial systems that use multistage interconnection networks. Most SANs used in multicomputer clusters, and many used in the most powerful

F.4

Network Topology



F-35

supercomputers, are based on fat trees. Commercial communication subsystems offered by Myrinet, Mellanox, and Quadrics are also built from fat trees.

Distributed Switched Networks Switched-media networks provide a very flexible framework to design communication subsystems external to the devices that need to communicate, as presented above. However, there are cases where it is convenient to more tightly integrate the end node devices with the network resources used to enable them to communicate. Instead of centralizing the switch fabric in an external subsystem, an alternative approach is to distribute the network switches among the end nodes, which then become network nodes or simply nodes, yielding a distributed switched network. As a consequence, each network switch has one or more end node devices directly connected to it, thus forming a network node. These nodes are directly connected to other nodes without indirectly going through some external switch, giving rise to another popular name for these networks—direct networks. The topology for distributed switched networks takes on a form much different from centralized switched networks in that end nodes are connected across the area of the switch fabric, not just at one or two of the peripheral edges of the fabric. This causes the number of switches in the system to be equal to the total number of nodes. A quite obvious way of interconnecting nodes consists of connecting a dedicated link between each node and every other node in the network. This fully connected topology provides the best connectivity (full connectivity in fact), but it is more costly than a crossbar network, as the following example shows. Example

Compute the cost of interconnecting N nodes using a fully connected topology relative to doing so using a crossbar topology. Consider separately the relative cost of the unidirectional links and the relative cost of the switches. Switch cost is assumed to grow quadratically with the number of unidirectional ports for k  k switches but to grow only linearly with 1  k switches.

Answer

The crossbar topology requires an N  N switch, so the switch cost is proportional to N2. The link cost is 2N, which accounts for the unidirectional links from the end nodes to the centralized crossbar, and vice versa. In the fully connected topology, two sets of 1  (N  1) switches (possibly merged into one set) are used in each of the N nodes to connect nodes directly to and from all other nodes. Thus, the total switch cost for all N nodes is proportional to 2N(N  1). Regarding link cost, each of the N nodes requires two unidirectional links in opposite directions between its end node device and its local switch. In addition, each of the N nodes has N  1 unidirectional links from its local switch to other switches distributed across all the other end nodes. Thus, the total number of unidirectional links is 2N +N(N  1), which is equal to N(N + 1) for all N nodes. The relative costs of the fully connected topology with respect to the crossbar is, therefore, the following:

F-36



Appendix F Interconnection Networks Relative costswitches ¼ 2N ðN  1Þ=N 2 ¼ 2ðN  1Þ=N ¼ 2ð1  1=N Þ Relative costlinks ¼ N ðN + 1Þ=2N ¼ ðN + 1Þ=2

As the number of interconnected devices increases, the switch cost of the fully connected topology is nearly double the crossbar, with both being very high (i.e., quadratic growth). Moreover, the fully connected topology always has higher relative link cost, which grows linearly with the number of nodes. Again, keep in mind that end node links are different from switch links in their length and packaging, particularly for direct networks, so they usually have different associated costs. Despite its higher cost, the fully connected topology provides no extra performance benefits over the crossbar as both are nonblocking. Thus, crossbar networks are usually used in practice instead of fully connected networks. A lower-cost alternative to fully connecting all nodes in the network is to directly connect nodes in sequence along a ring topology, as shown in Figure F.13. For bidirectional rings, each of the N nodes now uses only 3  3 switches and just two bidirectional network links (shared by neighboring nodes), for a total of N switches and N bidirectional network links. This linear cost excludes the N injection-reception bidirectional links required within nodes. Unlike shared-media networks, rings can allow many simultaneous transfers: the first node can send to the second while the second sends to the third, and so on. However, as dedicated links do not exist between logically nonadjacent node pairs, packets must hop across intermediate nodes before arriving at their destination, increasing their transport latency. For bidirectional rings, packets can be transported in either direction, with the shortest path to the destination usually being the one selected. In this case, packets must travel N/4 network switch hops, on average, with total switch hop count being one more to account for the local switch at the packet source node. Along the way, packets may block on network resources due to other packets contending for the same resources simultaneously. Fully connected and ring-connected networks delimit the two extremes of distributed switched topologies, but there are many points of interest in between for a given set of cost-performance requirements. Generally speaking, the ideal switched-media topology has cost approaching that of a ring but performance

Figure F.13 A ring network topology, folded to reduce the length of the longest link. Shaded circles represent switches, and black squares represent end node devices. The gray rectangle signifies a network node consisting of a switch, a device, and its connecting link.

F.4

(A) 2D grid or mesh of 16 nodes

Network Topology



F-37

(B) 2D torus of 16 nodes

(C) Hypercube of 16 nodes (16 = 24 so n = 4) Figure F.14 Direct network topologies that have appeared in commercial systems, mostly supercomputers. The shaded circles represent switches, and the black squares represent end node devices. Switches have many bidirectional network links, but at least one link goes to the end node device. These basic topologies can be supplemented with extra links to improve performance and reliability. For example, connecting the switches on the periphery of the 2D mesh, shown in (a), using the unused ports on each switch forms a 2D torus, shown in (b). The hypercube topology, shown in (c) is an n-dimensional interconnect for 2n nodes, requiring n + 1 ports per switch: one for the n nearest neighbor nodes and one for the end node device.

approaching that of a fully connected topology. Figure F.14 illustrates three popular direct network topologies commonly used in systems spanning the costperformance spectrum. All of them consist of sets of nodes arranged along multiple dimensions with a regular interconnection pattern among nodes that can be expressed mathematically. In the mesh or grid topology, all the nodes in each dimension form a linear array. In the torus topology, all the nodes in each dimension form a ring. Both of these topologies provide direct communication to neighboring nodes with the aim of reducing the number of hops suffered by packets in the network with respect to the ring. This is achieved by providing greater connectivity through additional dimensions, typically no more than three in commercial systems. The hypercube or n-cube topology is a particular case of the mesh in which only two nodes are interconnected along each dimension, leading to a number of dimensions, n, that must be large enough to interconnect all N nodes in the system (i.e., n ¼ log2 N). The hypercube provides better connectivity than meshes

F-38



Appendix F Interconnection Networks

and tori at the expense of higher link and switch costs, in terms of the number of links and number of ports per node. Example

Compute the cost of interconnecting N devices using a torus topology relative to doing so using a fat tree topology. Consider separately the relative cost of the bidirectional links and the relative cost of the switches—which is assumed to grow quadratically with the number of bidirectional ports. Provide an approximate expression for the case of switches being similar in size.

Answer

Using k  k switches, the fat tree requires 2 N/k (logk/2 N) switches, assuming the last stage (the root) has the same number of switches as each of the other stages. Given that the number of bidirectional ports in each switch is k (i.e., there are k input ports and k output ports for a k  k switch) and that the switch cost grows quadratically with this, total network switch cost is proportional to 2kN logk/2 N. The link cost is N logk/2 N as each of the logk/2 N stages requires N bidirectional links, including those between the devices and the fat tree. The torus requires as many switches as nodes, each of them having 2n + 1 bidirectional ports, including the port to attach the communicating device, where n is the number of dimensions. Hence, total switch cost for the torus is (2n + 1)2N. Each of the torus nodes requires 2n + 1 bidirectional links for the n different dimensions and the connection for its end node device, but as the dimensional links are shared by two nodes, the total number of links is (2n/2 + 1)N ¼ (n + 1)N bidirectional links for all N nodes. Thus, the relative costs of the torus topology with respect to the fat tree are Relative costswitches ¼ ð2n + 1Þ2 N=2kN log k=2 N ¼ ð2n + 1Þ2 =2k log k=2 N Relative costlinks ¼ ðn + 1ÞN=N log k=2 N ¼ ðn + 1Þ= log k=2 N

When switch sizes are similar, 2n + 1 ffi k. In this case, the relative cost is Relative costswitches ¼ ð2n + 1Þ2 =2k log k=2 N ¼ ð2n + 1Þ=2 log k=2 N ¼ k=2 log k=2 N

When the number of switch ports (also called switch degree) is small, tori have lower cost, particularly when the number of dimensions is low. This is an especially useful property when N is large. On the other hand, when larger switches and/or a high number of tori dimensions are used, fat trees are less costly and preferable. For example, when interconnecting 256 nodes, a fat tree is four times more expensive in terms of switch and link costs when 4  4 switches are used. This higher cost is compensated for by lower network contention, on average. The fat tree is comparable in cost to the torus when 8  8 switches are used (e.g., for interconnecting 256 nodes). For larger switch sizes beyond this, the torus costs more than the fat tree as each node includes a switch. This cost can be amortized by connecting multiple end node devices per switch, called bristling. The topologies depicted in Figure F.14 all have in common the interesting characteristic of having their network links arranged in several orthogonal dimensions in a regular way. In fact, these topologies all happen to be particular

F.4

Network Topology



F-39

instances of a larger class of direct network topologies known as k-ary n-cubes, where k signifies the number of nodes interconnected in each of the n dimensions. The symmetry and regularity of these topologies simplify network implementation (i.e, packaging) and packet routing as the movement of a packet along a given network dimension does not modify the number of remaining hops in any other dimension toward its destination. As we will see in the next section, this topological property can be readily exploited by simple routing algorithms. Like their indirect counterpart, direct networks can introduce blocking among packets that concurrently request the same path, or part of it. The only exception is fully connected networks. The same way that the number of stages and switch hops in indirect networks can be reduced by using larger switches, the hop count in direct networks can likewise be reduced by increasing the number of topological dimensions via increased switch degree. It may seem to be a good idea always to maximize the number of dimensions for a system of a certain size and switch cost. However, this is not necessarily the case. Most electronic systems are built within our three-dimensional (3D) world using planar (2D) packaging technology such as integrated circuit chips, printed circuit boards, and backplanes. Direct networks with up to three dimensions can be implemented using relatively short links within this 3D space, independent of system size. Links in higher-dimensioned networks would require increasingly longer wires or fiber. This increase in link length with system size is also indicative of MINs, including fat trees, which require either long links within all the stages or increasingly longer links as more stages are added. As we saw in the first example given in Section F.2, flow-controlled buffers increase in size proportionally to link length, thus requiring greater silicon area. This is among the reasons why the supercomputer with the largest number of compute nodes existing in 2005, the IBM Blue Gene/L, implemented a 3D torus network for interprocessor communication. A fat tree would have required much longer links, rendering a 64K node system less feasible. This highlights the importance of correctly selecting the proper network topology that meets system requirements. Besides link length, other constraints derived from implementing the topology may also limit the degree to which a topology can scale. These are available pin-out and achievable bisection bandwidth. Pin count is a local restriction on the bandwidth of a chip, printed circuit board, and backplane (or chassis) connector. In a direct network that integrates processor cores and switches on a single chip or multichip module, pin bandwidth is used both for interfacing with main memory and for implementing node links. In this case, limited pin count could reduce the number of switch ports or bit lines per link. In an indirect network, switches are implemented separately from processor cores, allowing most of the pins to be dedicated to communication bandwidth. However, as switches are grouped onto boards, the aggregate of all input-output links of the switch fabric on a board for a given topology must not exceed the board connector pin-outs. The bisection bandwidth is a more global restriction that gives the interconnect density and bandwidth that can be achieved by a given implementation

F-40



Appendix F Interconnection Networks

(packaging) technology. Interconnect density and clock frequency are related to each other: When wires are packed closer together, crosstalk and parasitic capacitance increase, which usually impose a lower clock frequency. For example, the availability and spacing of metal layers limit wire density and frequency of on-chip networks, and copper track density limits wire density and frequency on a printed circuit board. To be implementable, the topology of a network must not exceed the available bisection bandwidth of the implementation technology. Most networks implemented to date are constrained more so by pin-out limitations rather than bisection bandwidth, particularly with the recent move to blade-based systems. Nevertheless, bisection bandwidth largely affects performance. For a given topology, bisection bandwidth, BWBisection, is calculated by dividing the network into two roughly equal parts—each with half the nodes—and summing the bandwidth of the links crossing the imaginary dividing line. For nonsymmetric topologies, bisection bandwidth is the smallest of all pairs of equal-sized divisions of the network. For a fully connected network, the bisection bandwidth is proportional to N2/2 unidirectional links (or N2/4 bidirectional links), where N is the number of nodes. For a bus, bisection bandwidth is the bandwidth of just the one shared half-duplex link. For other topologies, values lie in between these two extremes. Network injection and reception bisection bandwidth is commonly used as a reference value, which is N/2 for a network with N injection and reception links, respectively. Any network topology that provides this bisection bandwidth is said to have full bisection bandwidth. Figure F.15 summarizes the number of switches and links required, the corresponding switch size, the maximum and average switch hop distances between nodes, and the bisection bandwidth in terms of links for several topologies discussed in this section for interconnecting 64 nodes.

Evaluation category

Bus

Ring

2D mesh

2D torus

Hypercube

Fat tree

Fully connected

Performance BWBisection in # links

1

2

8

16

32

32

1024

Max (ave.) hop count

1 (1)

32 (16)

14 (7)

8 (4)

6 (3)

11 (9)

1 (1)

Cost I/O ports per switch

NA

3

5

5

7

4

64

Number of switches

NA

64

64

64

64

192

64

Number of net. links

1

64

112

128

192

320

2016

Total number of links

1

128

176

192

256

384

2080

Figure F.15 Performance and cost of several network topologies for 64 nodes. The bus is the standard reference at unit network link cost and bisection bandwidth. Values are given in terms of bidirectional links and ports. Hop count includes a switch and its output link, but not the injection link at end nodes. Except for the bus, values are given for the number of network links and total number of links, including injection/reception links between end node devices and the network.

F.4

Network Topology



F-41

Effects of Topology on Network Performance Switched network topologies require packets to take one or more hops to reach their destination, where each hop represents the transport of a packet through a switch and one of its corresponding links. Interestingly, each switch and its corresponding links can be modeled as a black box network connecting more than two devices, as was described in the previous section, where the term “devices” here refers to end nodes or other switches. The only differences are that the sending and receiving overheads are null through the switches, and the routing, switching, and arbitration delays are not cumulative but, instead, are delays associated with each switch. As a consequence of the above, if the average packet has to traverse d hops to its destination, then TR + TA + TS ¼ (Tr + Ta + Ts)  d, where Tr, Ta, and Ts are the routing, arbitration, and switching delays, respectively, of a switch. With the assumption that pipelining over the network is staged on each hop at the packet level (this assumption will be challenged in the next section), the transmission delay is also increased by a factor of the number of hops. Finally, with the simplifying assumption that all injection links to the first switch or stage of switches and all links (including reception links) from the switches have approximately the same length and delay, the total propagation delay through the network TTotalProp is the propagation delay through a single link, TLinkProp, multiplied by d + 1, which is the hop count plus one to account for the injection link. Thus, the best-case lowerbound expression for average packet latency in the network (i.e., the latency in the absence of contention) is given by the following expression: Latency ¼ Sending overhead + TLinkProp  ðd + 1Þ + ðTr + Ta + Ts Þ  d +

Packet size  ðd + 1Þ + Receiving overhead Bandwidth

Again, the expression on page F-40 assumes that switches are able to pipeline packet transmission at the packet level. Following the method presented previously, we can estimate the best-case upper bound for effective bandwidth by finding the narrowest section of the end-to-end network pipe. Focusing on the internal network portion of that pipe, network bandwidth is determined by the blocking properties of the topology. Non-blocking behavior can be achieved only by providing many alternative paths between every source-destination pair, leading to an aggregate network bandwidth that is many times higher than the aggregate network injection or reception bandwidth. This is quite costly. As this solution usually is prohibitively expensive, most networks have different degrees of blocking, which reduces the utilization of the aggregate bandwidth provided by the topology. This, too, is costly but not in terms of performance. The amount of blocking in a network depends on its topology and the traffic distribution. Assuming the bisection bandwidth, BWBisection, of a topology is implementable (as typically is the case), it can be used as a constant measure of the maximum degree of blocking in a network. In the ideal case, the network always achieves full bisection bandwidth irrespective of the traffic behavior, thus

F-42



Appendix F Interconnection Networks

transferring the bottlenecking point to the injection or reception links. However, as packets destined to locations in the other half of the network necessarily must cross the bisection links, those links pose as potential bottleneck links—potentially reducing the network bandwidth to below full bisection bandwidth. Fortunately, not all of the traffic must cross the network bisection, allowing more of the aggregate network bandwidth provided by the topology to be utilized. Also, network topologies with a higher number of bisection links tend to have less blocking as more alternative paths are possible to reach destinations and, hence, a higher percentage of the aggregate network bandwidth can be utilized. If only a fraction of the traffic must cross the network bisection, as captured by a bisection traffic fraction parameter γ (0 < γ  1), the network pipe at the bisection is, effectively, widened by the reciprocal of that fraction, assuming a traffic distribution that loads the bisection links at least as heavily, on average, as other network links. This defines the upper limit on achievable network bandwidth, BWNetwork: BWNetwork ¼

BWBisection γ

Accordingly, the expression for effective bandwidth becomes the following when network topology is taken into consideration:   BWBisection , σ  N  BWLinkReception Effective bandwidth ¼ min N  BWLinkInjection , γ

It is important to note that γ depends heavily on the traffic patterns generated by applications. It is a measured quantity or calculated from detailed traffic analysis. Example

A common communication pattern in scientific programs is to have nearest neighbor elements of a two-dimensional array to communicate in a given direction. This pattern is sometimes called NEWS communication, standing for north, east, west, and south—the directions on a compass. Map an 8  8 array of elements one-toone onto 64 end node devices interconnected in the following topologies: bus, ring, 2D mesh, 2D torus, hypercube, fully connected, and fat tree. How long does it take in the best case for each node to send one message to its northern neighbor and one to its eastern neighbor, assuming packets are allowed to use any minimal path provided by the topology? What is the corresponding effective bandwidth? Ignore elements that have no northern or eastern neighbors. To simplify the analysis, assume that all networks experience unit packet transport time for each network hop—that is, TLinkProp, Tr, Ta, Ts, and packet transmission time for each hop sum to one. Also assume the delay through injection links is included in this unit time, and sending/ receiving overhead is null.

Answer

This communication pattern requires us to send 2  (64  8) or 112 total packets— that is, 56 packets in each of the two communication phases: northward and eastward. The number of hops suffered by packets depends on the topology. Communication between sources and destinations are one-to-one, so σ is 100%.

F.4

Network Topology



F-43

The injection and reception bandwidth cap the effective bandwidth to a maximum of 64 BW units (even though the communication pattern requires only 56 BW units). However, this maximum may get scaled down by the achievable network bandwidth, which is determined by the bisection bandwidth and the fraction of traffic crossing it, γ, both of which are topology dependent. Here are the various cases: ■







Bus—The mapping of the 8  8 array elements to nodes makes no difference for the bus as all nodes are equally distant at one hop away. However, the 112 transfers are done sequentially, taking a total of 112 time units. The bisection bandwidth is 1, and γ is 100%. Thus, effective bandwidth is only 1 BW unit. Ring—Assume the first row of the array is mapped to nodes 0 to 7, the second row to nodes 8 to 15, and so on. It takes just one time unit for all nodes simultaneously to send to their eastern neighbor (i.e., a transfer from node i to node i + 1). With this mapping, the northern neighbor for each node is exactly eight hops away so it takes eight time units, which also is done in parallel for all nodes. Total communication time is, therefore, 9 time units. The bisection bandwidth is 2 bidirectional links (assuming a bidirectional ring), which is less than the full bisection bandwidth of 32 bidirectional links. For eastward communication, because only 2 of the eastward 56 packets must cross the bisection in the worst case, the bisection links do not pose as bottlenecks. For northward communication, 8 of the 56 packets must cross the two bisection links, yielding a γ of 10/112 ¼ 8.93%. Thus, the network bandwidth is 2/.0893 ¼ 22.4 BW units. This limits the effective bandwidth at 22.4 BW units as well, which is less than half the bandwidth required by the communication pattern. 2D mesh—There are eight rows and eight columns in our grid of 64 nodes, which is a perfect match to the NEWS communication. It takes a total of just 2 time units for all nodes to send simultaneously to their northern neighbors followed by simultaneous communication to their eastern neighbors. The bisection bandwidth is 8 bidirectional links, which is less than full bisection bandwidth. However, the perfect matching of this nearest neighbor communication pattern on this topology allows the maximum effective bandwidth to be achieved regardless. For eastward communication, 8 of the 56 packets must cross the bisection in the worst case, which does not exceed the bisection bandwidth. None of the northward communications crosses the same network bisection, yielding a γ of 8/112 ¼ 7.14% and a network bandwidth of 8/0.0714 ¼ 112 BW units. The effective bandwidth is, therefore, limited by the communication pattern at 56 BW units as opposed to the mesh network. 2D torus—Wrap-around links of the torus are not used for this communication pattern, so the torus has the same mapping and performance as the mesh.

F-44



Appendix F Interconnection Networks







Hypercube—Assume elements in each row are mapped to the same location within the eight 3-cubes comprising the hypercube such that consecutive row elements are mapped to nodes only one hop away. Northern neighbors can be similarly mapped to nodes only one hop away in an orthogonal dimension. Thus, the communication pattern takes just 2 time units. The hypercube provides full bisection bandwidth of 32 links, but at most only 8 of the 112 packets must cross the bisection. Thus, effective bandwidth is limited only by the communication pattern to be 56 BW units, not by the hypercube network. Fully connected—Here, nodes are equally distant at one hop away, regardless of the mapping. Parallel transfer of packets in both the northern and eastern directions would take only 1 time unit if the injection and reception links could source and sink two packets at a time. As this is not the case, 2 time units are required. Effective bandwidth is limited by the communication pattern at 56 BW units, so the 1024 network bisection links largely go underutilized. Fat tree—Assume the same mapping of elements to nodes as is done for the ring and the use of switches with eight bidirectional ports. This allows simultaneous communication to eastern neighbors that takes at most three hops and, therefore, 3 time units through the three bidirectional stages interconnecting the eight nodes in each of the eight groups of nodes. The northern neighbor for each node resides in the adjacent group of eight nodes, which requires five hops, or 5 time units. Thus, the total time required on the fat tree is 8 time units. The fat tree provides full bisection bandwidth, so in the worst case of half the traffic needing to cross the bisection, an effective bandwidth of 56 BW units (as limited by the communication pattern and not by the fattree network) is achieved when packets are continually injected.

The above example should not lead one to the wrong conclusion that meshes are just as good as tori, hypercubes, fat trees, and other networks with higher bisection bandwidth. A number of simplifications that benefit low-bisection networks were assumed to ease the analysis. In practice, packets typically are larger than the link width and occupy links for many more than just one network cycle. Also, many communication patterns do not map so cleanly to the 2D mesh network topology; instead, usually they are more global and irregular in nature. These and other factors combine to increase the chances of packets blocking in lowbisection networks, increasing latency and reducing effective bandwidth. To put this discussion on topologies into further perspective, Figure F.16 lists various attributes of topologies used in commercial high-performance computers.

F.5

Network Routing, Arbitration, and Switching Routing, arbitration, and switching are performed at every switch along a packet’s path in a switched media network, no matter what the network topology. Numerous interesting techniques for accomplishing these network functions have been

F.5

Network Routing, Arbitration, and Switching



Injection [reception] node BW in MB/sec

# of data bits per link per direction

Raw network link BW per direction in MB/sec

2D mesh 64  64

400 [400]

16 bits

400

51.2

512 [16]

Bidirectional MIN with 8port bidirectional switches (typically a fat tree or Omega)

500 [500]

8 bits (+1 bit of control)

500

256

Thunder Itanium2 Tiger4 [QsNetII]

1024 [4]

Fat tree with 8port bidirectional switches

928 [928]

8 bits (+2 of control for 4b/5b encoding)

1333

1365

Cray

XT3 [SeaStar]

30,508 [1]

3D torus 40  32  24

3200 [3200]

12 bits

3800

5836.8

Cray

X1E

1024 [1]

4-way bristled 2D torus (23  11) with express links

1600 [1600]

16 bits

1600

51.2

IBM

ASC Purple pSeries 575 [Federation]

>1280 [8]

Bidirectional MIN with 8port bidirectional switches (typically a fat tree or Omega)

2000 [2000]

8 bits (+2 bits of control for novel 5b/ 6b encoding scheme)

2000

2560

IBM

Blue Gene/ L eServer Sol. [Torus Net.]

65,536 [2]

3D torus 32  32  64

612.5 [1050]

1 bit (bit serial)

175

358.4

System [network] name

Max. number of nodes [× # CPUs]

Basic network topology

Intel

ASCI Red Paragon

4816 [2]

IBM

ASCI White SP Power3 [Colony]

Intel

Company

F-45

Raw network bisection BW (bidirectional) in GB/sec

Figure F.16 Topological characteristics of interconnection networks used in commercial high-performance machines.

proposed in the literature. In this section, we focus on describing a representative set of approaches used in commercial systems for the more commonly used network topologies. Their impact on performance is also highlighted.

Routing The routing algorithm defines which network path, or paths, are allowed for each packet. Ideally, the routing algorithm supplies shortest paths to all packets such that

F-46



Appendix F Interconnection Networks

traffic load is evenly distributed across network links to minimize contention. However, some paths provided by the network topology may not be allowed in order to guarantee that all packets can be delivered, no matter what the traffic behavior. Paths that have an unbounded number of allowed nonminimal hops from packet sources, for instance, may result in packets never reaching their destinations. This situation is referred to as livelock. Likewise, paths that cause a set of packets to block in the network forever waiting only for network resources (i.e., links or associated buffers) held by other packets in the set also prevent packets from reaching their destinations. This situation is referred to as deadlock. As deadlock arises due to the finiteness of network resources, the probability of its occurrence increases with increased network traffic and decreased availability of network resources. For the network to function properly, the routing algorithm must guard against this anomaly, which can occur in various forms—for example, routing deadlock, request-reply (protocol) deadlock, and fault-induced (reconfiguration) deadlock, etc. At the same time, for the network to provide the highest possible performance, the routing algorithm must be efficient—allowing as many routing options to packets as there are paths provided by the topology, in the best case. The simplest way of guarding against livelock is to restrict routing such that only minimal paths from sources to destinations are allowed or, less restrictively, only a limited number of nonminimal hops. The strictest form has the added benefit of consuming the minimal amount of network bandwidth, but it prevents packets from being able to use alternative nonminimal paths in case of contention or faults along the shortest (minimal) paths. Deadlock is more difficult to guard against. Two common strategies are used in practice: avoidance and recovery. In deadlock avoidance, the routing algorithm restricts the paths allowed by packets to only those that keep the global network state deadlock-free. A common way of doing this consists of establishing an ordering between a set of resources—the minimal set necessary to support network full access—and granting those resources to packets in some total or partial order such that cyclic dependency cannot form on those resources. This allows an escape path always to be supplied to packets no matter where they are in the network to avoid entering a deadlock state. In deadlock recovery, resources are granted to packets without regard for avoiding deadlock. Instead, as deadlock is possible, some mechanism is used to detect the likely existence of deadlock. If detected, one or more packets are removed from resources in the deadlock set—possibly by regressively dropping the packets or by progressively redirecting the packets onto special deadlock recovery resources. The freed network resources are then granted to other packets needing them to resolve the deadlock. Let us consider routing algorithms designed for distributed switched networks. Figure F.17(a) illustrates one of many possible deadlocked configurations for packets within a region of a 2D mesh network. The routing algorithm can avoid all such deadlocks (and livelocks) by allowing only the use of minimal paths that cross the network dimensions in some total order. That is, links of a given dimension are not supplied to a packet by the routing algorithm until no other links are needed by the packet in all of the preceding dimensions for it to reach its

F.5

s1

s4

(A)

Network Routing, Arbitration, and Switching

s2

d3

d4

d2

d1

s5

s1

d5

s3

s4

F-47



s2

d3

d4

d2

d1

s5

d5

s3

(B)

Figure F.17 A mesh network with packets routing from sources, si, to destinations, di. (a) Deadlock forms from packets destined to d1 through d4 blocking on others in the same set that fully occupy their requested buffer resources one hop away from their destinations. This deadlock cycle causes other packets needing those resources also to block, like packets from s5 destined to d5 that have reached node s3. (b) Deadlock is avoided using dimensionorder routing. In this case, packets exhaust their routes in the X dimension before turning into the Y dimension in order to complete their routing.

destination. This is illustrated in Figure F.17(b), where dimensions are crossed in XY dimension order. All the packets must follow the same order when traversing dimensions, exiting a dimension only when links are no longer required in that dimension. This well-known algorithm is referred to as dimension-order routing (DOR) or e-cube routing in hypercubes. It is used in many commercial systems built from distributed switched networks and on-chip networks. As this routing algorithm always supplies the same path for a given source-destination pair, it is a deterministic routing algorithm. Crossing dimensions in order on some minimal set of resources required to support network full access avoids deadlock in meshes and hypercubes. However, for distributed switched topologies that have wrap-around links (e.g., rings and tori), a total ordering on a minimal set of resources within each dimension is also needed if resources are to be used to full capacity. Alternatively, some empty resources or bubbles along the dimensions would be required to remain below full capacity and avoid deadlock. To allow full access, either the physical links must be duplicated or the logical buffers associated with each link must be duplicated, resulting in physical channels or virtual channels, respectively, on which the ordering is done. Ordering is not necessary on all network resources to avoid deadlock—it is needed only on some minimal set required to support network full access (i.e., some escape resource set). Routing algorithms based on this technique (called Duato’s protocol) can be defined that allow alternative paths provided by the topology to be used for a given source-destination pair in addition to the escape resource set. One of those allowed paths must be selected, preferably the most

F-48



Appendix F Interconnection Networks

efficient one. Adapting the path in response to prevailing network traffic conditions enables the aggregate network bandwidth to be better utilized and contention to be reduced. Such routing capability is referred to as adaptive routing and is used in many commercial systems. Example

How many of the possible dimensional turns are eliminated by dimension-order routing on an n-dimensional mesh network? What is the fewest number of turns that actually need to be eliminated while still maintaining connectedness and deadlock freedom? Explain using a 2D mesh network.

Answer

The dimension-order routing algorithm eliminates exactly half of the possible dimensional turns as it is easily proven that all turns from any lower-ordered dimension into any higher-ordered dimension are allowed, but the converse is not true. For example, of the eight possible turns in the 2D mesh shown in Figure F.17, the four turns from X+ to Y+, X+ to Y, X to Y+, and X to Y are allowed, where the signs (+ or ) refer to the direction of travel within a dimension. The four turns from Y+ to X+, Y+ to X, Y to X+, and Y to X are disallowed turns. The elimination of these turns prevents cycles of any kind from forming—and, thus, avoids deadlock—while keeping the network connected. However, it does so at the expense of not allowing any routing adaptivity. The Turn Model routing algorithm proves that the minimum number of eliminated turns to prevent cycles and maintain connectedness is a quarter of the possible turns, but the right set of turns must be chosen. Only some particular set of eliminated turns allow both requirements to be satisfied. With the elimination of the wrong set of a quarter of the turns, it is possible for combinations of allowed turns to emulate the eliminated ones (and, thus, form cycles and deadlock) or for the network not to be connected. For the 2D mesh, for example, it is possible to eliminate only the two turns ending in the westward direction (i.e., Y+ to X and Y to X) by requiring packets to start their routes in the westward direction (if needed) to maintain connectedness. Alternatives to this west-first routing for 2D meshes are negative-first routing and north-last routing. For these, the extra quarter of turns beyond that supplied by DOR allows for partial adaptivity in routing, making these adaptive routing algorithms. Routing algorithms for centralized switched networks can similarly be defined to avoid deadlocks by restricting the use of resources in some total or partial order. For fat trees, resources can be totally ordered along paths starting from the input leaf stage upward to the root and then back down to the output leaf stage. The routing algorithm can allow packets to use resources in increasing partial order, first traversing up the tree until they reach some least common ancestor (LCA) of the source and destination, and then back down the tree until they reach their destinations. As there are many least common ancestors for a given destination, multiple alternative paths are allowed while going up the tree, making the routing algorithm adaptive. However, only a single

F.5

Network Routing, Arbitration, and Switching



F-49

deterministic path to the destination is provided by the fat tree topology from a least common ancestor. This self-routing property is common to many MINs and can be readily exploited: The switch output port at each stage is given simply by shifts of the destination node address. More generally, a tree graph can be mapped onto any topology—whether direct or indirect—and links between nodes at the same tree level can be allowed by assigning directions to them, where “up” designates paths moving toward the tree root and “down” designates paths moving away from the root node. This allows for generic up*/down* routing to be defined on any topology such that packets follow paths (possibly adaptively) consisting of zero or more up links followed by zero or more down links to their destination. Up/down ordering prevents cycles from forming, avoiding deadlock. This routing technique was used in Autonet—a self-configuring switched LAN—and in early Myrinet SANs. Routing algorithms are implemented in practice by a combination of the routing information placed in the packet header by the source node and the routing control mechanism incorporated in the switches. For source routing, the entire routing path is precomputed by the source—possibly by table lookup—and placed in the packet header. This usually consists of the output port or ports supplied for each switch along the predetermined path from the source to the destination, which can be stripped off by the routing control mechanism at each switch. An additional bit field can be included in the header to signify whether adaptive routing is allowed (i.e., that any one of the supplied output ports can be used). For distributed routing, the routing information usually consists of the destination address. This is used by the routing control mechanism in each switch along the path to determine the next output port, either by computing it using a finite-state machine or by looking it up in a local routing table (i.e., forwarding table). Compared to distributed routing, source routing simplifies the routing control mechanism within the network switches, but it requires more routing bits in the header of each packet, thus increasing the header overhead.

Arbitration The arbitration algorithm determines when requested network paths are available for packets. Ideally, arbiters maximize the matching of free network resources and packets requesting those resources. At the switch level, arbiters maximize the matching of free output ports and packets located in switch input ports requesting those output ports. When all requests cannot be granted simultaneously, switch arbiters resolve conflicts by granting output ports to packets in a fair way such that starvation of requested resources by packets is prevented. This could happen to packets in shorter queues if a serve-longest-queue (SLQ) scheme is used. For packets having the same priority level, simple round-robin (RR) or age-based schemes are sufficiently fair and straightforward to implement. Arbitration can be distributed to avoid centralized bottlenecks. A straightforward technique consists of two phases: a request phase and a grant phase. Let us assume that each switch input port has an associated queue to hold incoming

F-50



Appendix F Interconnection Networks

Request

(A)

Grant

Request

Grant

Acknowledgment

(B)

Figure F.18 Two arbitration techniques. (a) Two-phased arbitration in which two of the four input ports are granted requested output ports. (b) Three-phased arbitration in which three of the four input ports are successful in gaining the requested output ports, resulting in higher switch utilization.

packets and that each switch output port has an associated local arbiter implementing a round-robin strategy. Figure F.18(a) shows a possible set of requests for a four-port switch. In the request phase, packets at the head of each input port queue send a single request to the arbiters corresponding to the output ports requested by them. Then, each output port arbiter independently arbitrates among the requests it receives, selecting only one. In the grant phase, one of the requests to each arbiter is granted the requested output port. When two packets from different input ports request the same output port, only one receives a grant, as shown in the figure. As a consequence, some output port bandwidth remains unused even though all input queues have packets to transmit. The simple two-phase technique can be improved by allowing several simultaneous requests to be made by each input port, possibly coming from different virtual channels or from multiple adaptive routing options. These requests are sent to different output port arbiters. By submitting more than one request per input port, the probability of matching increases. Now, arbitration requires three phases: request, grant, and acknowledgment. Figure F.18(b) shows the case in which up to two requests can be made by packets at each input port. In the request phase, requests are submitted to output port arbiters, and these arbiters select one of the received requests, as is done for the two-phase arbiter. Likewise, in the grant phase, the selected requests are granted to the corresponding requesters. Taking into account that an input port can submit more than one request, it may receive more than one grant. Thus, it selects among possibly multiple grants using some arbitration strategy such as round-robin. The selected grants are confirmed to the corresponding output port arbiters in the acknowledgment phase. As can be seen in Figure F.18(b), it could happen that an input port that submits several requests does not receive any grants, while some of the requested ports remain free. Because of this, a second arbitration iteration can improve the probability of matching. In this iteration, only the requests corresponding to nonmatched input and output ports are submitted. Iterative arbiters with multiple

F.5

Network Routing, Arbitration, and Switching



F-51

requests per input port are able to increase the utilization of switch output ports and, thus, the network link bandwidth. However, this comes at the expense of additional arbiter complexity and increased arbitration delay, which could increase the router clock cycle time if it is on the critical path.

Switching The switching technique defines how connections are established in the network. Ideally, connections between network resources are established or “switched in” only for as long as they are actually needed and exactly at the point that they are ready and needed to be used, considering both time and space. This allows efficient use of available network bandwidth by competing traffic flows and minimal latency. Connections at each hop along the topological path allowed by the routing algorithm and granted by the arbitration algorithm can be established in three basic ways: prior to packet arrival using circuit switching, upon receipt of the entire packet using store-and-forward packet switching, or upon receipt of only portions of the packet with unit size no smaller than that of the packet header using cut-through packet switching. Circuit switching establishes a circuit a priori such that network bandwidth is allocated for packet transmissions along an entire source-destination path. It is possible to pipeline packet transmission across the circuit using staging at each hop along the path, a technique known as pipelined circuit switching. As routing, arbitration, and switching are performed only once for one or more packets, routing bits are not needed in the header of packets, thus reducing latency and overhead. This can be very efficient when information is continuously transmitted between devices for the same circuit setup. However, as network bandwidth is removed from the shared pool and preallocated regardless of whether sources are in need of consuming it or not, circuit switching can be very inefficient and highly wasteful of bandwidth. Packet switching enables network bandwidth to be shared and used more efficiently when packets are transmitted intermittently, which is the more common case. Packet switching comes in two main varieties—store-and-forward and cutthrough switching, both of which allow network link bandwidth to be multiplexed on packet-sized or smaller units of information. This better enables bandwidth sharing by packets originating from different sources. The finer granularity of sharing, however, increases the overhead needed to perform switching: Routing, arbitration, and switching must be performed for every packet, and routing and flow control bits are required for every packet if flow control is used. Store-and-forward packet switching establishes connections such that a packet is forwarded to the next hop in sequence along its source-destination path only after the entire packet is first stored (staged) at the receiving switch. As packets are completely stored at every switch before being transmitted, links are completely decoupled, allowing full link bandwidth utilization even if links have very different bandwidths. This property is very important in WANs, but the price to pay is packet latency; the total routing, arbitration, and switching delay is multiplicative with the number of hops, as we have seen in Section F.4 when analyzing performance under this assumption.

F-52



Appendix F Interconnection Networks Cut-through packet switching establishes connections such that a packet can “cut through” switches in a pipelined manner once the header portion of the packet (or equivalent amount of payload trailing the header) is staged at receiving switches. That is, the rest of the packet need not arrive before switching in the granted resources. This allows routing, arbitration, and switching delay to be additive with the number of hops rather than multiplicative to reduce total packet latency. Cut-through comes in two varieties, the main differences being the size of the unit of information on which flow control is applied and, consequently, the buffer requirements at switches. Virtual cut-through switching implements flow control at the packet level, whereas wormhole switching implements it on flow units, or flits, which are smaller than the maximum packet size but usually at least as large as the packet header. Since wormhole switches need to be capable of storing only a small portion of a packet, packets that block in the network may span several switches. This can cause other packets to block on the links they occupy, leading to premature network saturation and reduced effective bandwidth unless some centralized buffer is used within the switch to store them—a technique called buffered wormhole switching. As chips can implement relatively large buffers in current technology, virtual cut-through is the more commonly used switching technique. However, wormhole switching may still be preferred in OCNs designed to minimize silicon resources. Premature network saturation caused by wormhole switching can be mitigated by allowing several packets to share the physical bandwidth of a link simultaneously via time-multiplexed switching at the flit level. This requires physical links to have a set of virtual channels (i.e., the logical buffers mentioned previously) at each end, into which packets are switched. Before, we saw how virtual channels can be used to decouple physical link bandwidth from buffered packets in such a way as to avoid deadlock. Now, virtual channels are multiplexed in such a way that bandwidth is switched in and used by flits of a packet to advance even though the packet may share some links in common with a blocked packet ahead. This, again, allows network bandwidth to be used more efficiently, which, in turn, reduces the average packet latency.

Impact on Network Performance Routing, arbitration, and switching can impact the packet latency of a loaded network by reducing the contention delay experienced by packets. For an unloaded network that has no contention, the algorithms used to perform routing and arbitration have no impact on latency other than to determine the amount of delay incurred in implementing those functions at switches—typically, the pin-to-pin latency of a switch chip is several tens of nanoseconds. The only change to the best-case packet latency expression given in the previous section comes from the switching technique. Store-and-forward packet switching was assumed before in which transmission delay for the entire packet is incurred on all d hops plus at the source node. For cut-through packet switching, transmission delay is pipelined across the network links comprising the packet’s path at the granularity of the packet header instead of the entire packet. Thus, this delay component is reduced, as shown in the following lower-bound expression for packet latency:

F.5

Network Routing, Arbitration, and Switching

Latency ¼ Sending overhead + TLinkProp  ðd + 1Þ + ðTr + τa + TS Þ  d +



F-53

ðPacket + ðd  HeaderÞÞ + Receiving overhead Bandwidth

The effective bandwidth is impacted by how efficiently routing, arbitration, and switching allow network bandwidth to be used. The routing algorithm can distribute traffic more evenly across a loaded network to increase the utilization of the aggregate bandwidth provided by the topology—particularly, by the bisection links. The arbitration algorithm can maximize the number of switch output ports that accept packets, which also increases the utilization of network bandwidth. The switching technique can increase the degree of resource sharing by packets, which further increases bandwidth utilization. These combine to affect network bandwidth, BWNetwork, by an efficiency factor, ρ, where 0 < ρ  1: BWNetwork ¼ ρ 

BWBisection γ

The efficiency factor, ρ, is difficult to calculate or to quantify by means other than simulation. Nevertheless, with this parameter we can estimate the best-case upperbound effective bandwidth by using the following expression that takes into account the effects of routing, arbitration, and switching: Effective bandwidth ¼ min

  BWBisection ,σ  N  BWLinkReception N  BWLinkInjection , ρ  γ

We note that ρ also depends on how well the network handles the traffic generated by applications. For instance, ρ could be higher for circuit switching than for cut-through switching if large streams of packets are continually transmitted between a source-destination pair, whereas the converse could be true if packets are transmitted intermittently. Example

Compare the performance of deterministic routing versus adaptive routing for a 3D torus network interconnecting 4096 nodes. Do so by plotting latency versus applied load and throughput versus applied load. Also compare the efficiency of the best and worst of these networks. Assume that virtual cut-through switching, three-phase arbitration, and virtual channels are implemented. Consider separately the cases for two and four virtual channels, respectively. Assume that one of the virtual channels uses bubble flow control in dimension order so as to avoid deadlock; the other virtual channels are used either in dimension order (for deterministic routing) or minimally along shortest paths (for adaptive routing), as is done in the IBM Blue Gene/L torus network.

Answer

It is very difficult to compute analytically the performance of routing algorithms given that their behavior depends on several network design parameters with complex interdependences among them. As a consequence, designers typically resort to cycle-accurate simulators to evaluate performance. One way to evaluate the effect of a certain design decision is to run sets of simulations over a range of network loads, each time modifying one of the design parameters of interest while

F-54



Appendix F Interconnection Networks

keeping the remaining ones fixed. The use of synthetic traffic loads is quite frequent in these evaluations as it allows the network to stabilize at a certain working point and for behavior to be analyzed in detail. This is the method we use here (alternatively, trace-driven or execution-driven simulation can be used). Figure F.19 shows the typical interconnection network performance plots. On the left, average packet latency (expressed in network cycles) is plotted as a function of applied load (traffic generation rate) for the two routing algorithms with two and four virtual channels each; on the right, throughput (traffic delivery rate) is similarly plotted. Applied load is normalized by dividing it by the number of nodes in the network (i.e., bytes per cycle per node). Simulations are run under the assumption of uniformly distributed traffic consisting of 256-byte packets, where flits are byte sized. Routing, arbitration, and switching delays are assumed to sum to 1 network cycle per hop while the time-of-flight delay over each link is assumed to be 10 cycles. Link bandwidth is 1 byte per cycle, thus providing results that are independent of network clock frequency. As can be seen, the plots within each graph have similar characteristic shapes, but they have different values. For the latency graph, all start at the no-load latency 10,000

0.4

Throughput (bytes/cycle/node)

Average packet latency (cycles)

8000

Deterministic DOR, 2 VC Deterministic DOR, 4 VC Adaptive routing, 2 VC Adaptive routing, 4 VC

6000

4000

0.2

0.1

2000

0 0.01

0.3

0.09

0.17

0.25

0.33

0.41

0 0.01

Applied load (bytes/cycle/node)

(A)

Adaptive routing, 4 VC Deterministic DOR, 4 VC Adaptive routing, 2 VC Deterministic DOR, 2 VC

0.13

0.25

0.37

0.49

0.61

0.73

0.85

0.97

Applied load (bytes/cycle/node)

(B)

Figure F.19 Deterministic routing is compared against adaptive routing, both with either two or four virtual channels, assuming uniformly distributed traffic on a 4 K node 3D torus network with virtual cut-through switching and bubble flow control to avoid deadlock. (a) Average latency is plotted versus applied load, and (b) throughput is plotted versus applied load (the upper grayish plots show peak throughput, and the lower black plots show sustained throughput). Simulation data were collected by P. Gilabert and J. Flich at the Universidad Politècnica de València, Spain (2006).

F.5

Network Routing, Arbitration, and Switching



F-55

as predicted by the latency expression given above, then slightly increase with traffic load as contention for network resources increases. At higher applied loads, latency increases exponentially, and the network approaches its saturation point as it is unable to absorb the applied load, ¼ causing packets to queue up at their source nodes awaiting injection. In these simulations, the queues keep growing over time, making latency tend toward infinity. However, in practice, queues reach their capacity and trigger the application to stall further packet generation, or the application throttles itself waiting for acknowledgments/responses to outstanding packets. Nevertheless, latency grows at a slower rate for adaptive routing as alternative paths are provided to packets along congested resources. For this same reason, adaptive routing allows the network to reach a higher peak throughput for the same number of virtual channels as compared to deterministic routing. At nonsaturation loads, throughput increases fairly linearly with applied load. When the network reaches its saturation point, however, it is unable to deliver traffic at the same rate at which traffic is generated. The saturation point, therefore, indicates the maximum achievable or “peak” throughput, which would be no more than that predicted by the effective bandwidth expression given above. Beyond saturation, throughput tends to drop as a consequence of massive head-of-line blocking across the network (as will be explained further in Section F.6), very much like cars tend to advance more slowly at rush hour. This is an important region of the throughput graph as it shows how significant of a performance drop the routing algorithm can cause if congestion management techniques (discussed briefly in Section F.7) are not used effectively. In this case, adaptive routing has more of a performance drop after saturation than deterministic routing, as measured by the postsaturation sustained throughput. For both routing algorithms, more virtual channels (i.e., four) give packets a greater ability to pass over blocked packets ahead, allowing for a higher peak throughput as compared to fewer virtual channels (i.e., two). For adaptive routing with four virtual channels, the peak throughput of 0.43 bytes/cycle/node is near the maximum of 0.5 bytes/cycle/node that can be obtained with 100% efficiency (i.e., ρ ¼ 100%), assuming there is enough injection and reception bandwidth to make the network bisection the bottlenecking point. In that case, the network bandwidth is simply 100% times the network bisection bandwidth (BWBisection) divided by the fraction of traffic crossing the bisection (γ), as given by the expression above. Taking into account that the bisection splits the torus into two equally sized halves, γ is equal to 0.5 for uniform traffic as only half the injected traffic is destined to a node at the other side of the bisection. The BWBisection for a 4096-node 3D torus network is 16  16  4 unidirectional links times the link bandwidth (i.e., 1 byte/cycle). If we normalize the bisection bandwidth by dividing it by the number of nodes (as we did with network bandwidth), the BWBisection is 0.25 bytes/cycle/node. Dividing this by γ gives the ideal maximally obtainable network bandwidth of 0.5 bytes/ cycle/node. We can find the efficiency factor, ρ, of the simulated network simply by dividing the measured peak throughput by the ideal throughput. The efficiency factor for

F-56



Appendix F Interconnection Networks

the network with fully adaptive routing and four virtual channels is 0.43/(0.25/ 0.5) ¼ 86%, whereas for the network with deterministic routing and two virtual channels it is 0.37/(0.25/0.5) ¼ 74%. Besides the 12% difference in efficiency between the two, another 14% gain in efficiency might be obtained with even better routing, arbitration, switching, and virtual channel designs. To put this discussion on routing, arbitration, and switching in perspective, Figure F.20 lists the techniques used in SANs designed for commercial highperformance computers. In addition to being applied to the SANs as shown in the figure, the issues discussed in this section also apply to other interconnect domains: from OCNs to WANs.

F.6

Switch Microarchitecture Network switches implement the routing, arbitration, and switching functions of switched-media networks. Switches also implement buffer management mechanisms and, in the case of lossless networks, the associated flow control. For some networks, switches also implement part of the network management functions that explore, configure, and reconfigure the network topology in response to boot-up and failures. Here, we reveal the internal structure of network switches by describing a basic switch microarchitecture and various alternatives suitable for different routing, arbitration, and switching techniques presented previously.

Basic Switch Microarchitecture The internal data path of a switch provides connectivity among the input and output ports. Although a shared bus or a multiported central memory could be used, these solutions are insufficient or too expensive, respectively, when the required aggregate switch bandwidth is high. Most high-performance switches implement an internal crossbar to provide nonblocking connectivity within the switch, thus allowing concurrent connections between multiple input-output port pairs. Buffering of blocked packets can be done using first in, first out (FIFO) or circular queues, which can be implemented as dynamically allocatable multi-queues (DAMQs) in static RAM to provide high capacity and flexibility. These queues can be placed at input ports (i.e., input buffered switch), output ports (i.e., output buffered switch), centrally within the switch (i.e., centrally buffered switch), or at both the input and output ports of the switch (i.e., input-output-buffered switch). Figure F.21 shows a block diagram of an input-output-buffered switch. Routing can be implemented using a finite-state machine or forwarding table within the routing control unit of switches. In the former case, the routing information given in the packet header is processed by a finite-state machine that determines the allowed switch output port (or ports if routing is adaptive), according to the routing algorithm. Portions of the routing information in the header are usually

F.6

Switch Microarchitecture



F-57

System [network] name

Max. number of nodes [× # CPUs]

Basic network topology

Switch queuing (buffers)

Network routing algorithm

Switch arbitration technique

Network switching technique

Intel

ASCI Red Paragon

4510 [2]

2D mesh (64  64)

Input buffered (1 flit)

Distributed dimensionorder routing

2-phased RR, distributed across switch

Wormhole with no virtual channels

IBM

ASCI White SP Power3 [Colony]

512 [16]

Bidirectional MIN with 8port bidirectional switches (typically a fat tree or Omega)

Input and central buffer with output queuing (8-way speedup)

Source-based LCA adaptive, shortest-path routing, and table-based multicast routing

2-phased RR, centralized and distributed at outputs for bypass paths

Buffered wormhole and virtual cut-through for multicasting, no virtual channels

Intel

Thunder Itanium2 Tiger4 [QsNetII]

1024 [4]

Fat tree with 8-port bidirectional switches

Input buffered

Source-based LCA adaptive, shortest-path routing

2-phased RR, priority, aging, distributed at output ports

Wormhole with 2 virtual channels

Cray

XT3 [SeaStar]

30,508 [1]

3D torus (40  32  24)

Input with staging output

Distributed table-based dimensionorder routing

2-phased RR, distributed at output ports

Virtual cutthrough with 4 virtual channels

Cray

X1E

1024 [1]

4-way bristled 2D torus (23  11) with express links

Input with virtual output queuing

Distributed table-based dimensionorder routing

2-phased wavefront (pipelined) global arbiter

Virtual cutthrough with 4 virtual channels

IBM

ASC Purple pSeries 575 [Federation]

>1280 [8]

Bidirectional MIN with 8port bidirectional switches (typically a fat tree or Omega)

Input and central buffer with output queuing (8-way speedup)

Source and distributed table-based LCA adaptive, shortest-path routing, and multicast

2-phased RR, centralized and distributed at outputs for bypass paths

Buffered wormhole and virtual cut-through for multicasting with 8 virtual channels

IBM

Blue Gene/ L eServer Solution [Torus Net.]

65,536 [2]

3D torus (32  32  64)

Inputoutput buffered

Distributed, adaptive with bubble escape virtual channel

2-phased SLQ, distributed at input and output

Virtual cutthrough with 4 virtual channels

Company

Figure F.20 Routing, arbitration, and switching characteristics of interconnections networks in commercial machines.

Mux

Link control

Link control

Output buffers Demux

Input buffers

Link control

Mux

Crossbar

Mux

Link control

Output buffers Demux

Input buffers

Mux

Physical channel

Appendix F Interconnection Networks

Demux

Physical channel



Demux

F-58

Physical channel

Physical channel

Routing control and arbitration unit

Figure F.21 Basic microarchitectural components of an input-output-buffered switch.

stripped off or modified by the routing control unit after use to simplify processing at the next switch along the path. When routing is implemented using forwarding tables, the routing information given in the packet header is used as an address to access a forwarding table entry that contains the allowed switch output port(s) provided by the routing algorithm. Forwarding tables must be preloaded into the switches at the outset of network operation. Hybrid approaches also exist where the forwarding table is reduced to a small set of routing bits and combined with a small logic block. Those routing bits are used by the routing control unit to know what paths are allowed and decide the output ports the packets need to take. The goal with those approaches is to build flexible yet compact routing control units, eliminating the area and power wastage of a large forwarding table and thus being suitable for OCNs. The routing control unit is usually implemented as a centralized resource, although it could be replicated at every input port so as not to become a bottleneck. Routing is done only once for every packet, and packets typically are large enough to take several cycles to flow through the switch, so a centralized routing control unit rarely becomes a bottleneck. Figure F.21 assumes a centralized routing control unit within the switch. Arbitration is required when two or more packets concurrently request the same output port, as described in the previous section. Switch arbitration can be implemented in a centralized or distributed way. In the former case, all of the requests and status information are transmitted to the central switch arbitration unit; in the latter case, the arbiter is distributed across the switch, usually among the input and/or output ports. Arbitration may be performed multiple times on packets, and there may be multiple queues associated with each input port,

F.6

Switch Microarchitecture



F-59

increasing the number of arbitration requests that must be processed. Thus, many implementations use a hierarchical arbitration approach, where arbitration is first performed locally at every input port to select just one request among the corresponding packets and queues, and later arbitration is performed globally to process the requests made by each of the local input port arbiters. Figure F.21 assumes a centralized arbitration unit within the switch. The basic switch microarchitecture depicted in Figure F.21 functions in the following way. When a packet starts to arrive at a switch input port, the link controller decodes the incoming signal and generates a sequence of bits, possibly deserializing data to adapt them to the width of the internal data path if different from the external link width. Information is also extracted from the packet header or link control signals to determine the queue to which the packet should be buffered. As the packet is being received and buffered (or after the entire packet has been buffered, depending on the switching technique), the header is sent to the routing unit. This unit supplies a request for one or more output ports to the arbitration unit. Arbitration for the requested output port succeeds if the port is free and has enough space to buffer the entire packet or flit, depending on the switching technique. If wormhole switching with virtual channels is implemented, additional arbitration and allocation steps may be required for the transmission of each individual flit. Once the resources are allocated, the packet is transferred across the internal crossbar to the corresponding output buffer and link if no other packets are ahead of it and the link is free. Link-level flow control implemented by the link controller prevents input queue overflow at the neighboring switch on the other end of the link. If virtual channel switching is implemented, several packets may be timemultiplexed across the link on a flit-by-flit basis. As the various input and output ports operate independently, several incoming packets may be processed concurrently in the absence of contention.

Buffer Organizations As mentioned above, queues can be located at the switch input, output, or both sides. Output-buffered switches have the advantage of completely eliminating head-of-line blocking. Head-of-line (HOL) blocking occurs when two or more packets are buffered in a queue, and a blocked packet at the head of the queue blocks other packets in the queue that would otherwise be able to advance if they were at the queue head. This cannot occur in output-buffered switches as all the packets in a given queue have the same status; they require the same output port. However, it may be the case that all the switch input ports simultaneously receive a packet for the same output port. As there are no buffers at the input side, output buffers must be able to store all those incoming packets at the same time. This requires implementing output queues with an internal switch speedup of k. That is, output queues must have a write bandwidth k times the link bandwidth, where k is the number of switch ports. This oftentimes is too expensive. Hence, this solution by itself has rarely been implemented in lossless networks. As the probability of concurrently receiving many packets for the same output port is usually small,

F-60



Appendix F Interconnection Networks

commercial systems that use output-buffered switches typically implement only moderate switch speedup, dropping packets on rare buffer overflow. Switches with buffers on the input side are able to receive packets without having any switch speedup; however, HOL blocking can occur within input port queues, as illustrated in Figure F.22(a). This can reduce switch output port utilization to less than 60% even when packet destinations are uniformly distributed. As shown in Figure F.22(b), the use of virtual channels (two in this case) can mitigate HOL blocking but does not eliminate it. A more effective solution is to organize the input queues as virtual output queues (VOQs), shown in Figure F.22(c). With this, each input port implements as many queues as there are output ports, thus providing separate buffers for packets destined to different output ports. This is a popular technique widely used in ATM switches and IP routers. The main drawbacks of Input port i

Input port i

Input buffers

Output port X+

Input buffers

X+

X+

X – X+ Demux

Y – Y+ Y+ X – X+

Output port X+

Output port X–

Output port X– Y – Y+ Y+

Crossbar

Crossbar Output port Y+

Output port Y+

Output port Y–

Output port Y–

(A)

(B) Input port i Input buffers

Output port X+ X+

Demux

X+ X–

Output port X–

Y+ Y+ Y–

Crossbar Output port Y+

Output port Y–

(C) Figure F.22 (a) Head-of-line blocking in an input buffer, (b) the use of two virtual channels to reduce HOL blocking, and (c) the use of virtual output queuing to eliminate HOL blocking within a switch. The shaded input buffer is the one to which the crossbar is currently allocated. This assumes each input port has only one access port to the switch’s internal crossbar.

F.6

Switch Microarchitecture



F-61

VOQs, however, are cost and lack of scalability: The number of VOQs grows quadratically with switch ports. Moreover, although VOQs eliminate HOL blocking within a switch, HOL blocking occurring at the network level end-to-end is not solved. Of course, it is possible to design a switch with VOQ support at the network level also—that is, to implement as many queues per switch input port as there are output ports across the entire network—but this is extremely expensive. An alternative is to dynamically assign only a fraction of the queues to store (cache) separately only those packets headed for congested destinations. Combined input-output-buffered switches minimize HOL blocking when there is sufficient buffer space at the output side to buffer packets, and they minimize the switch speedup required due to buffers being at the input side. This solution has the further benefit of decoupling packet transmission through the internal crossbar of the switch from transmission through the external links. This is especially useful for cut-through switching implementations that use virtual channels, where flit transmissions are time-multiplexed over the links. Many designs used in commercial systems implement input-output-buffered switches.

Routing Algorithm Implementation It is important to distinguish between the routing algorithm and its implementation. While the routing algorithm describes the rules to forward packets across the network and affects packet latency and network throughput, its implementation affects the delay suffered by packets when reaching a node, the required silicon area, and the power consumption associated with the routing computation. Several techniques have been proposed to pre-compute the routing algorithm and/or hide the routing computation delay. However, significantly less effort has been devoted to reduce silicon area and power consumption without significantly affecting routing flexibility. Both issues have become very important, particularly for OCNs. Many existing designs address these issues by implementing relatively simple routing algorithms, but more sophisticated routing algorithms will likely be needed in the future to deal with increasing manufacturing defects, process variability, and other complications arising from continued technology scaling, as discussed briefly below. As mentioned in a previous section, depending on where the routing algorithm is computed, two basic forms of routing exist: source and distributed routing. In source routing, the complexity of implementation is moved to the end nodes where paths need to be stored in tables, and the path for a given packet is selected based on the destination end node identifier. In distributed routing, however, the complexity is moved to the switches where, at each hop along the path of a packet, a selection of the output port to take is performed. In distributed routing, two basic implementations exist. The first one consists of using a logic block that implements a fixed routing algorithm for a particular topology. The most common example of such an implementation is dimension-order routing, where dimensions are offset in an established order. Alternatively, distributed routing can be implemented with forwarding tables, where each entry encodes the output port to be used for a particular

F-62



Appendix F Interconnection Networks

destination. Therefore, in the worst case, as many entries as destination nodes are required. Both methods for implementing distributed routing have their benefits and drawbacks. Logic-based routing features a very short computation delay, usually requires a small silicon area, and has low power consumption. However, logicbased routing needs to be designed with a specific topology in mind and, therefore, is restricted to that topology. Table-based distributed routing is quite flexible and supports any topology and routing algorithm. Simply, tables need to be filled with the proper contents based on the applied routing algorithm (e.g., the up*/down* routing algorithm can be defined for any irregular topology). However, the down side of table-based distributed routing is its non-negligible area and power cost. Also, scalability is problematic in table-based solutions as, in the worst case, a system with N end nodes (and switches) requires as many as N tables each with N entries, thus having quadratic cost. Depending on the network domain, one solution is more suitable than the other. For instance, in SANs, it is usual to find table-based solutions as is the case with InfiniBand. In other environments, like OCNs, table-based implementations are avoided due to the aforementioned costs in power and silicon area. In such environments, it is more advisable to rely on logic-based implementations. Herein lies some of the challenges OCN designers face: ever continuing technology scaling through device miniaturization leads to increases in the number of manufacturing defects, higher failure rates (either transient or permanent), significant process variations (transistors behaving differently from design specs), the need for different clock frequency and voltage domains, and tight power and energy budgets. All of these challenges translate to the network needing support for heterogeneity. Different—possibly irregular—regions of the network will be created owing to failed components, powered down switches and links, disabled components (due to unacceptable variations in performance) and so on. Hence, heterogeneous systems may emerge from a homogeneous design. In this framework, it is important to efficiently implement routing algorithms designed to provide enough flexibility to address these new challenges. A well-known solution for providing a certain degree of flexibility while being much more compact than traditional table-based approaches is interval routing [Leeuwen 1987], where a range of destinations is defined for each output port. Although this approach is not flexible enough, it provides a clue on how to address emerging challenges. A more recent approach provides a plausible implementation design point that lies between logic-based implementation (efficiency) and tablebased implementation (flexibility). Logic-Based Distributed Routing (LBDR) is a hybrid approach that takes as a reference a regular 2D mesh but allows an irregular network to be derived from it due to changes in topology induced by manufacturing defects, failures, and other anomalies. Due to the faulty, disabled, and powereddown components, regularity is compromised and the dimension-order routing algorithm can no longer be used. To support such topologies, LBDR defines a set of configuration bits at each switch. Four connectivity bits are used at each switch to indicate the connectivity of the switch to the neighbor switches in the

F.6

Switch Microarchitecture



F-63

topology. Thus, one connectivity bit per port is used. Those connectivity bits are used, for instance, to disable an output port leading to a faulty component. Additionally, eight routing bits are used, two per output port, to define the available routing options. The value of the routing bits is set at power-on and is computed from the routing algorithm to be implemented in the network. Basically, when a routing bit is set, it indicates that a packet can leave the switch through the associated output port and is allowed to perform a certain turn at the next switch. In this respect, LBDR is similar to interval routing, but it defines geographical areas instead of ranges of destinations. Figure F.23 shows an example where a topology-agnostic routing algorithm is implemented with LBDR on an irregular topology. The figure shows the computed configuration bits. The connectivity and routing bits are used to implement the routing algorithm. For that purpose, a small set of logic gates are used in combination with the configuration bits. Basically, the LBDR approach takes as a reference the initial topology (a 2D mesh), and makes a decision based on the current coordinates of the router, the coordinates of the destination router, and the configuration bits. Figure F.24 shows the required logic, and Figure F.25 shows an example of where a packet is forwarded from its source to its destination with the use of the configuration bits. As can be noticed, routing restrictions are enforced by preventing the use of the west port at switch 10. LBDR represents a method for efficient routing implementation in OCNs. This mechanism has been recently extended to support non-minimal paths, collective communication operations, and traffic isolation. All of these improvements have been made while maintaining a compact and efficient implementation with the use of a small set of configuration bits. A detailed description of LBDR and its extensions, and the current research on OCNs can be found in Flich [2010].

0

1

2

3

4

5

6

7

8

9

12

13

Router 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Cn 0 0 0 0 1 1 1 1 1 1 1 1 -

Ce 1 1 1 0 1 1 1 0 1 0 1 0 -

Cw 0 1 1 1 0 1 1 1 0 1 0 1 -

Cs 1 1 1 1 1 1 0 0 1 1 0 0 -

Rne 1 1 1 1 1 1 1 1 1 1 1 1 -

Rnw 1 1 1 1 1 1 1 1 1 1 1 1 -

Ren 1 1 1 1 0 0 0 1 1 1 0 1 -

Res 1 1 1 1 1 1 1 1 1 1 1 1 -

Rwn 1 1 1 1 1 1 1 1 1 0 1 1 -

Rws 1 1 1 1 1 1 1 1 1 1 1 1 -

Rse 1 1 1 1 0 1 1 1 1 1 1 1 -

Rsw 1 0 0 0 1 1 1 1 1 0 1 1 -

Bidirectional routing restriction

Figure F.23 Shown is an example of an irregular network that uses LBDR to implement the routing algorithm. For each router, connectivity and routing bits are defined.

F-64



Appendix F Interconnection Networks

1st stage Xdst Ydst Xcurr Ycurr

Comparator

2nd stage Cn

N' E' W' S'

N' E' W'

N

N' E' Rne N' W' Rnw

E = Ce·(E'·!N'·!S' + E'·N'·Ren + E'·S'·Res) W = Cw·(W'·!N'·!S' + W'·N'·Rwn + W'·S'·Rws) S = Cs·(S'·!E'·!W' + S'·E'·Rse + S'·W'·Rsw) L = !N'·!E'·!W'·!S'

Figure F.24 LBDR logic at each input port of the router.

0

1

2

3

0

1

2

3

0

1

2

3

4

5

6

7

4

5

6

7

4

5

6

7

8

9

10

11

8

9

10

11

8

9

10

11

12

13

14

15

12

13

14

15

12

13

14

15

Bidirectional routing restriction

Message

Figure F.25 Example of routing a message from Router 14 to Router 5 using LBDR at each router.

Pipelining the Switch Microarchitecture Performance can be enhanced by pipelining the switch microarchitecture. Pipelined processing of packets in a switch has similarities with pipelined execution of instructions in a vector processor. In a vector pipeline, a single instruction indicates what operation to apply to all the vector elements executed in a pipelined way. Similarly, in a switch pipeline, a single packet header indicates how to process all of the internal data path physical transfer units (or phits) of a packet, which are processed in a pipelined fashion. Also, as packets at different input ports are independent of each other, they can be processed in parallel similar to the way multiple independent instructions or threads of pipelined instructions can be executed in parallel.

F.6

Switch Microarchitecture



F-65

The switch microarchitecture can be pipelined by analyzing the basic functions performed within the switch and organizing them into several stages. Figure F.26 shows a block diagram of a five-stage pipelined organization for the basic switch microarchitecture given in Figure F.21, assuming cut-through switching and the use of a forwarding table to implement routing. After receiving the header portion of the packet in the first stage, the routing information (i.e., destination address) is used in the second stage to look up the allowed routing option(s) in the forwarding table. Concurrent with this, other portions of the packet are received and buffered in the input port queue at the first stage. Arbitration is performed in the third stage. The crossbar is configured to allocate the granted output port for the packet in the fourth stage, and the packet header is buffered in the switch output port and ready for transmission over the external link in the fifth stage. Note that the second and Stage 1

Stage 4

Stage 5

Arbit ration unit

Routing control unit

Mux

Link control

Mux

Link control

Output buffers Demux

Input buffers Link control

Demux Crossbar

Mux

Link control

Output buffers

Demux

Input buffers

Mux

Physical channel

Stage 3

Demux

Physical channel

Stage 2

Physical channel

Physical channel

Crossbar control

Output port #

Header fill Forwarding table

Packet header Payload fragment Payload fragment Payload fragment

IB

RC

SA

ST

OB

IB

IB

IB

ST

OB

IB

IB

IB

ST

OB

IB

IB

IB

ST

OB

Figure F.26 Pipelined version of the basic input-output-buffered switch. The notation in the figure is as follows: IB is the input link control and buffer stage, RC is the route computation stage, SA is the crossbar switch arbitration stage, ST is the crossbar switch traversal stage, and OB is the output buffer and link control stage. Packet fragments (flits) coming after the header remain in the IB stage until the header is processed and the crossbar switch resources are provided.

F-66



Appendix F Interconnection Networks

third stages are used only by the packet header; the payload and trailer portions of the packet use only three of the stages—those used for data flow-thru once the internal data path of the switch is set up. A virtual channel switch usually requires an additional stage for virtual channel allocation. Moreover, arbitration is required for every flit before transmission through the crossbar. Finally, depending on the complexity of the routing and arbitration algorithms, several clock cycles may be required for these operations.

Other Switch Microarchitecture Enhancements As mentioned earlier, internal switch speedup is sometimes implemented to increase switch output port utilization. This speedup is usually implemented by increasing the clock frequency and/or the internal data path width (i.e., phit size) of the switch. An alternative solution consists of implementing several parallel data paths from each input port’s set of queues to the output ports. One way of doing this is by increasing the number of crossbar input ports. When implementing several physical queues per input port, this can be achieved by devoting a separate crossbar port to each input queue. For example, the IBM Blue Gene/L implements two crossbar access ports and two read ports per switch input port. Another way of implementing parallel data paths between input and output ports is to move the buffers to the crossbar crosspoints. This switch architecture is usually referred to as a buffered crossbar switch. A buffered crossbar provides independent data paths from each input port to the different output ports, thus making it possible to send up to k packets at a time from a given input port to k different output ports. By implementing independent crosspoint memories for each inputoutput port pair, HOL blocking is eliminated at the switch level. Moreover, arbitration is significantly simpler than in other switch architectures. Effectively, each output port can receive packets from only a disjoint subset of the crosspoint memories. Thus, a completely independent arbiter can be implemented at each switch output port, each of those arbiters being very simple. A buffered crossbar would be the ideal switch architecture if it were not so expensive. The number of crosspoint memories increases quadratically with the number of switch ports, dramatically increasing its cost and reducing its scalability with respect to the basic switch architecture. In addition, each crosspoint memory must be large enough to efficiently implement link-level flow control. To reduce cost, most designers prefer input-buffered or combined input-output-buffered switches enhanced with some of the mechanisms described previously.

F.7

Practical Issues for Commercial Interconnection Networks There are practical issues in addition to the technical issues described thus far that are important considerations for interconnection networks within certain domains. We mention a few of these below.

F.7

Practical Issues for Commercial Interconnection Networks



F-67

Connectivity The type and number of devices that communicate and their communication requirements affect the complexity of the interconnection network and its protocols. The protocols must target the largest network size and handle the types of anomalous systemwide events that might occur. Among some of the issues are the following: How lightweight should the network interface hardware/software be? Should it attach to the memory network or the I/O network? Should it support cache coherence? If the operating system must get involved for every network transaction, the sending and receiving overhead becomes quite large. If the network interface attaches to the I/O network (PCI-Express or HyperTransport interconnect), the injection and reception bandwidth will be limited to that of the I/O network. This is the case for the Cray XT3 SeaStar, Intel Thunder Tiger 4 QsNetII, and many other supercomputer and cluster networks. To support coherence, the sender may have to flush the cache before each send, and the receiver may have to flush its cache before each receive to prevent the stale-data problem. Such flushes further increase sending and receiving overhead, often causing the network interface to be the network bottleneck. Computer systems typically have a multiplicity of interconnects with different functions and cost-performance objectives. For example, processor-memory interconnects usually provide higher bandwidth and lower latency than I/O interconnects and are more likely to support cache coherence, but they are less likely to follow or become standards. Personal computers typically have a processormemory interconnect and an I/O interconnect (e.g., PCI-X 2.0, PCIe or Hyper-Transport) designed to connect both fast and slow devices (e.g., USB 2.0, Gigabit Ethernet LAN, Firewire 800). The Blue Gene/L supercomputer uses five interconnection networks, only one of which is the 3D torus used for most of the interprocessor application traffic. The others include a tree-based collective communication network for broadcast and multicast; a tree-based barrier network for combining results (scatter, gather); a control network for diagnostics, debugging, and initialization; and a Gigabit Ethernet network for I/O between the nodes and disk. The University of Texas at Austin’s TRIPS Edge processor has eight specialized on-chip networks—some with bidirectional channels as wide as 128 bits and some with 168 bits in each direction—to interconnect the 106 heterogeneous tiles composing the two processor cores with L2 on-chip cache. It also has a chip-to-chip switched network to interconnect multiple chips in a multiprocessor configuration. Two of the on-chip networks are switched networks: One is used for operand transport and the other is used for on-chip memory communication. The others are essentially fan-out trees or recombination dedicated link networks used for status and control. The portion of chip area allocated to the interconnect is substantial, with five of the seven metal layers used for global network wiring.

Standardization: Cross-Company Interoperability Standards are useful in many places in computer design, including interconnection networks. Advantages of successful standards include low cost and stability.

F-68



Appendix F Interconnection Networks

The customer has many vendors to choose from, which keeps price close to cost due to competition. It makes the viability of the interconnection independent of the stability of a single company. Components designed for a standard interconnection may also have a larger market, and this higher volume can reduce the vendors’ costs, further benefiting the customer. Finally, a standard allows many companies to build products with interfaces to the standard, so the customer does not have to wait for a single company to develop interfaces to all the products of interest. One drawback of standards is the time it takes for committees and specialinterest groups to agree on the definition of standards, which is a problem when technology is changing rapidly. Another problem is when to standardize: On the one hand, designers would like to have a standard before anything is built; on the other hand, it would be better if something were built before standardization to avoid legislating useless features or omitting important ones. When done too early, it is often done entirely by committee, which is like asking all of the chefs in France to prepare a single dish of food—masterpieces are rarely served. Standards can also suppress innovation at that level, since standards fix the interfaces— at least until the next version of the standards surface, which can be every few years or longer. More often, we are seeing consortiums of companies getting together to define and agree on technology that serve as “de facto” industry standards. This was the case for InfiniBand. LANs and WANs use standards and interoperate effectively. WANs involve many types of companies and must connect to many brands of computers, so it is difficult to imagine a proprietary WAN ever being successful. The ubiquitous nature of the Ethernet shows the popularity of standards for LANs as well as WANs, and it seems unlikely that many customers would tie the viability of their LAN to the stability of a single company. Some SANs are standardized such as Fibre Channel, but most are proprietary. OCNs for the most part are proprietary designs, with a few gaining widespread commercial use in system-on-chip (SoC) applications, such as IBM’s CoreConnect and ARM’s AMBA.

Congestion Management Congestion arises when too many packets try to use the same link or set of links. This leads to a situation in which the bandwidth required exceeds the bandwidth supplied. Congestion by itself does not degrade network performance: simply, the congested links are running at their maximum capacity. Performance degradation occurs in the presence of HOL blocking where, as a consequence of packets going to noncongested destinations getting blocked by packets going to congested destinations, some link bandwidth is wasted and network throughput drops, as illustrated in the example given at the end of Section F.4. Congestion control refers to schemes that reduce traffic when the collective traffic of all nodes is too large for the network to handle. One advantage of a circuit-switched network is that, once a circuit is established, it ensures that there is sufficient bandwidth to deliver all the information

F.7

Practical Issues for Commercial Interconnection Networks



F-69

sent along that circuit. Interconnection bandwidth is reserved as circuits are established, and if the network is full, no more circuits can be established. Other switching techniques generally do not reserve interconnect bandwidth in advance, so the interconnection network can become clogged with too many packets. Just as with poor rush-hour commuters, a traffic jam of packets increases packet latency and, in extreme cases, fewer packets per second get delivered by the interconnect. In order to handle congestion in packet-switched networks, some form of congestion management must be implemented. The two kinds of mechanisms used are those that control congestion and those that eliminate the performance degradation introduced by congestion. There are three basic schemes used for congestion control in interconnection networks, each with its own weaknesses: packet discarding, flow control, and choke packets. The simplest scheme is packet discarding, which we discussed briefly in Section F.2. If a packet arrives at a switch and there is no room in the buffer, the packet is discarded. This scheme relies on higher-level software that handles errors in transmission to resend lost packets. This leads to significant bandwidth wastage due to (re)transmitted packets that are later discarded and, therefore, is typically used only in lossy networks like the Internet. The second scheme relies on flow control, also discussed previously. When buffers become full, link-level flow control provides feedback that prevents the transmission of additional packets. This backpressure feedback rapidly propagates backward until it reaches the sender(s) of the packets producing congestion, forcing a reduction in the injection rate of packets into the network. The main drawbacks of this scheme are that sources become aware of congestion too late when the network is already congested, and nothing is done to alleviate congestion. Backpressure flow control is common in lossless networks like SANs used in supercomputers and enterprise systems. A more elaborate way of using flow control is by implementing it directly between the sender and the receiver end nodes, generically called end-to-end flow control. Windowing is one version of end-to-end credit-based flow control where the window size should be large enough to efficiently pipeline packets through the network. The goal of the window is to limit the number of unacknowledged packets, thus bounding the contribution of each source to congestion, should it arise. The TCP protocol uses a sliding window. Note that end-to-end flow control describes the interaction between just two nodes of the interconnection network, not the entire interconnection network between all end nodes. Hence, flow control helps congestion control, but it is not a global solution. Choke packets are used in the third scheme, which is built upon the premise that traffic injection should be throttled only when congestion exists across the network. The idea is for each switch to see how busy it is and to enter into a warning state when it passes a threshold. Each packet received by a switch in the warning state is sent back to the source via a choke packet that includes the intended destination. The source is expected to reduce traffic to that destination by a fixed percentage. Since it likely will have already sent other packets along that path, the source node waits for all the packets in transit to be returned before acting on

F-70



Appendix F Interconnection Networks

the choke packets. In this scheme, congestion is controlled by reducing the packet injection rate until traffic reduces, just as metering lights that guard on-ramps control the rate of cars entering a freeway. This scheme works efficiently when the feedback delay is short. When congestion notification takes a long time, usually due to long time of flight, this congestion control scheme may become unstable—reacting too slowly or producing oscillations in packet injection rate, both of which lead to poor network bandwidth utilization. An alternative to congestion control consists of eliminating the negative consequences of congestion. This can be done by eliminating HOL blocking at every switch in the network as discussed previously. Virtual output queues can be used for this purpose; however, it would be necessary to implement as many queues at every switch input port as devices attached to the network. This solution is very expensive, and not scalable at all. Fortunately, it is possible to achieve good results by dynamically assigning a few set-aside queues to store only the congested packets that travel through some hot-spot regions of the network, very much like caches are intended to store only the more frequently accessed memory locations. This strategy is referred to as regional explicit congestion notification (RECN).

Fault Tolerance The probability of system failures increases as transistor integration density and the number of devices in the system increases. Consequently, system reliability and availability have become major concerns and will be even more important in future systems with the proliferation of interconnected devices. A practical issue arises, therefore, as to whether or not the interconnection network relies on all the devices being operational in order for the network to work properly. Since software failures are generally much more frequent than hardware failures, another question surfaces as to whether a software crash on a single device can prevent the rest of the devices from communicating. Although some hardware designers try to build fault-free networks, in practice, it is only a question of the rate of failures, not whether they can be prevented. Thus, the communication subsystem must have mechanisms for dealing with faults when—not if—they occur. There are two main kinds of failure in an interconnection network: transient and permanent. Transient failures are usually produced by electromagnetic interference and can be detected and corrected using the techniques described in Section F.2. Oftentimes, these can be dealt with simply by retransmitting the packet either at the link level or end-to-end. Permanent failures occur when some component stops working within specifications. Typically, these are produced by overheating, overbiasing, overuse, aging, and so on and cannot be recovered from simply by retransmitting packets with the help of some higher-layer software protocol. Either an alternative physical path must exist in the network and be supplied by the routing algorithm to circumvent the fault or the network will be crippled, unable to deliver packets whose only paths are through faulty resources. Three major categories of techniques are used to deal with permanent failures: resource sparing, fault-tolerant routing, and network reconfiguration. In the first

F.7

Practical Issues for Commercial Interconnection Networks



F-71

technique, faulty resources are switched off or bypassed, and some spare resources are switched in to replace the faulty ones. As an example, the ServerNet interconnection network is designed with two identical switch fabrics, only one of which is usable at any given time. In case of failure in one fabric, the other is used. This technique can also be implemented without switching in spare resources, leading to a degraded mode of operation after a failure. The IBM Blue Gene/L supercomputer, for instance, has the facility to bypass failed network resources while retaining its base topological structure and routing algorithm. The main drawback of this technique is the relatively large number of healthy resources (e.g., midplane node boards) that may need to be switched off after a failure in order to retain the base topological structure (e.g., a 3D torus). Fault-tolerant routing, on the other hand, takes advantage of the multiple paths already existing in the network topology to route messages in the presence of failures without requiring spare resources. Alternative paths for each supported fault combination are identified at design time and incorporated into the routing algorithm. When a fault is detected, a suitable alternative path is used. The main difficulty when using this technique is guaranteeing that the routing algorithm will remain deadlock-free when using the alternative paths, given that arbitrary fault patterns may occur. This is especially difficult in direct networks whose regularity can be compromised by the fault pattern. The Cray T3E is an example system that successfully applies this technique on its 3D torus direct network. There are many examples of this technique in systems using indirect networks, such as with the bidirectional multistage networks in the ASCI White and ASC Purple. Those networks provide multiple minimal paths between end nodes and, inherently, have no routing deadlock problems (see Section F.5). In these networks, alternative paths are selected at the source node in case of failure. Network reconfiguration is yet another, more general technique to handle voluntary and involuntary changes in the network topology due either to failures or to some other cause. In order for the network to be reconfigured, the nonfaulty portions of the topology must first be discovered, followed by computation of the new routing tables and distribution of the routing tables to the corresponding network locations (i.e., switches and/or end node devices). Network reconfiguration requires the use of programmable switches and/or network interfaces, depending on how routing is performed. It may also make use of generic routing algorithms (e.g., up*/down* routing) that can be configured for all the possible network topologies that may result after faults. This strategy relieves the designer from having to supply alternative paths for each possible fault combination at design time. Programmable network components provide a high degree of flexibility but at the expense of higher cost and latency. Most standard and proprietary interconnection networks for clusters and SANs—including Myrinet, Quadrics, InfiniBand, Advanced Switching, and Fibre Channel—incorporate software for (re)configuring the network routing in accordance with the prevailing topology. Another practical issue ties to node failure tolerance. If an interconnection network can survive a failure, can it also continue operation while a new node is added to or removed from the network, usually referred to as hot swapping? If not, each addition or removal of a new node disables the interconnection network, which is

F-72



Appendix F Interconnection Networks

impractical for WANs and LANs and is usually intolerable for most SANs. Online system expansion requires hot swapping, so most networks allow for it. Hot swapping is usually supported by implementing dynamic network reconfiguration, in which the network is reconfigured without having to stop user traffic. The main difficulty with this is guaranteeing deadlock-free routing while routing tables for switches and/or end node devices are dynamically and asynchronously updated as more than one routing algorithm may be alive (and, perhaps, clashing) in the network at the same time. Most WANs solve this problem by dropping packets whenever required, but dynamic network reconfiguration is much more complex in lossless networks. Several theories and practical techniques have recently been developed to address this problem efficiently.

Example

Figure F.27 shows the number of failures of 58 desktop computers on a local area network for a period of just over one year. Suppose that one local area network is based on a network that requires all machines to be operational for the interconnection network to send data; if a node crashes, it cannot accept messages, so the interconnection becomes choked with data waiting to be delivered. An alternative is the traditional local area network, which can operate in the presence of node failures; the interconnection simply discards messages for a node that decides not to accept them. Assuming that you need to have both your workstation and the connecting LAN to get your work done, how much greater are your chances of being prevented from getting your work done using the failure-intolerant LAN versus traditional LANs? Assume the downtime for a crash is less than 30 minutes. Calculate using the one-hour intervals from this figure.

Answer

Assuming the numbers for Figure F.27, the percentage of hours that you can’t get your work done using the failure-intolerant network is Intervals with failures Total intervals  Intervals with no failures ¼ Total intervals Total intervals 8974  8605 369 ¼ ¼ 4:1% ¼ 8974 8974

The percentage of hours that you can’t get your work done using the traditional network is just the time your workstation has crashed. If these failures are equally distributed among workstations, the percentage is Failures=Machines 654=58 11:28 ¼ ¼ ¼ 0:13% Total intervals 8974 8974

Hence, you are more than 30 times more likely to be prevented from getting your work done with the failure-intolerant LAN than with the traditional LAN, according to the failure statistics in Figure F.27. Stated alternatively, the person responsible for maintaining the LAN would receive a 30-fold increase in phone calls from irate users!

F.8

Examples of Interconnection Networks

Total failures per one-hour interval

One-day intervals with number of failed machines in first column

F-73



Total failures per one-day interval

Failed machines per time interval

One-hour intervals with number of failed machines in first column

0

8605

0

184

0

1

264

264

105

105

2

50

100

35

70

3

25

75

11

33

4

10

40

6

24

5

7

35

9

45

6

3

18

6

36

7

1

7

4

28

8

1

8

4

32

9

2

18

2

18

10

2

20

11

1

11

2

22

1

12

17

1

17

20

1

20

21

1

21

1

21

31

1

31

38

1

38

12

58 Total

8974

654

1

58

373

573

Figure F.27 Measurement of reboots of 58 DECstation 5000 s running Ultrix over a 373-day period. These reboots are distributed into time intervals of one hour and one day. The first column sorts the intervals according to the number of machines that failed in that interval. The next two columns concern one-hour intervals, and the last two columns concern one-day intervals. The second and fourth columns show the number of intervals for each number of failed machines. The third and fifth columns are just the product of the number of failed machines and the number of intervals. For example, there were 50 occurrences of one-hour intervals with 2 failed machines, for a total of 100 failed machines, and there were 35 days with 2 failed machines, for a total of 70 failures. As we would expect, the number of failures per interval changes with the size of the interval. For example, the day with 31 failures might include one hour with 11 failures and one hour with 20 failures. The last row shows the total number of each column; the number of failures doesn’t agree because multiple reboots of the same machine in the same interval do not result in separate entries. (Randy Wang of the University of California–Berkeley collected these data.)

F.8

Examples of Interconnection Networks To further provide mass to the concepts described in the previous sections, we look at five example networks from the four interconnection network domains considered in this appendix. In addition to one for each of the OCN, LAN, and WAN areas, we look at two examples from the SAN area: one for system area networks

F-74



Appendix F Interconnection Networks

and one for system/storage area networks. The first two examples are proprietary networks used in high-performance systems; the latter three examples are network standards widely used in commercial systems.

On-Chip Network: Intel Single-Chip Cloud Computer With continued increases in transistor integration as predicted by Moore’s law, processor designers are under the gun to find ways of combating chip-crossing wire delay and other problems associated with deep submicron technology scaling. Multicore microarchitectures have gained popularity, given their advantages of simplicity, modularity, and ability to exploit parallelism beyond that which can be achieved through aggressive pipelining and multiple instruction/data issuing on a single core. No matter whether the processor consists of a single core or multiple cores, higher and higher demands are being placed on intrachip communication bandwidth to keep pace—not to mention interchip bandwidth. This has spurred a great amount of interest in OCN designs that efficiently support communication of instructions, register operands, memory, and I/O data within and between processor cores both on and off the chip. Here we focus on one such on-chip network: The Intel Single-chip Cloud Computer prototype. The Single-chip Cloud Computer (SCC) is a prototype chip multiprocessor with 48 Intel IA-32 architecture cores. Cores are laid out (see Figure F.28) on a network with a 2D mesh topology (6  4). The network connects 24 tiles, 4 ondie memory controllers, a voltage regulator controller (VRC), and an external system interface controller (SIF). In each tile two cores are connected to a router. The four memory controllers are connected at the boundaries of the mesh, two on each side, while the VRC and SIF controllers are connected at the bottom border of the mesh. Each memory controller can address two DDR3 DIMMS, each up to 8 GB of memory, thus resulting in a maximum of 64 GB of memory. The VRC controller allows any core or the system interface to adjust the voltage in any of the six predefined regions configuring the network (two 2-tile regions). The clock can also be adjusted at a finer granularity with each tile having its own operating frequency. These regions can be turned off or scaled down for large power savings. This method allows full application control of the power state of the cores. Indeed, applications have an API available to define the voltage and the frequency of each region. The SIF controller is used to communicate the network from outside the chip. Each of the tiles includes two processor cores (P54C-based IA) with associated L1 16 KB data cache and 16 KB instruction cache and a 256 KB L2 cache (with the associated controller), a 5-port router, traffic generator (for testing purposes only), a mesh interface unit (MIU) handling all message passing requests, memory look-up tables (with configuration registers to set the mapping of a core’s physical addresses to the extended memory map of the system), a message-passing buffer, and circuitry for the clock generation and synchronization for crossing asynchronous boundaries.

F.8

R

Tile

R

Tile

R

Tile

R

Tile

DIMM

R

DIMM

R

MC

R (x,

Tile

R

Tile

Tile = y)

VRC

R

)

,0

(0



F-75

Tile

(x, y) = (5, 3)

(x, y) = (0, 3)

MC

R

R

Tile

Tile

Tile

R

R

R

Tile

Tile

Tile

R

R

R (x,

Tile

Tile

Tile = y)

) ,0 (3

R

R

R

Tile

Tile

Tile

R

R

R

Tile MC

DIMM

Tile

MC

DIMM

R

Examples of Interconnection Networks

Tile

Tile

(x, y) = (5, 0)

System interface SCC die

System FPGA Management console PC PCIe

Figure F.28 SCC Top-level architecture. From Howard, J. et al., IEEE International Solid-State Circuits Conference Digest of Technical Papers, pp. 58–59.

Focusing on the OCN, the MIU unit is in charge of interfacing the cores to the network, including the packetization and de-packetization of large messages; command translation and address decoding/lookup; link-level flow control and credit management; and arbiter decisions following a round-robin scheme. A creditbased flow control mechanism is used together with virtual cut-through switching (thus making it necessary to split long messages into packets). The routers are connected in a 2D mesh layout, each on its own power supply and clock source. Links connecting routers have 16B + 2B side bands running at 2 GHz. Zero-load latency is set to 4 cycles, including link traversal. Eight virtual channels are used for performance (6 VCs) and protocol-level deadlock handling (2 VCs). A message-level arbitration is implemented by a wrapped wave-front arbiter. The dimension-order XY routing algorithm is used and pre-computation of the output port is performed at every router. Besides the tiles having regions defined for voltage and frequency, the network (made of routers and links) has its own single region. Thus, all the network components run at the same speed and use the same power supply. An asynchronous clock transition is required between the router and the tile. One of the distinctive features of the SCC architecture is the support for a messaging-based communication protocol rather than hardware cache-coherent

F-76



Appendix F Interconnection Networks

memory for inter-core communication. Message passing buffers are located on every router and APIs are provided to take full control of MPI structures. Cache coherency can be implemented by software. The SCC router represents a significant improvement over the Teraflops processor chip in the implementation of a 2D on-chip interconnect. Contrasted with the 2D mesh implemented in the Teraflops processor, this implementation is tuned for a wider data path in a multiprocessor interconnect and is more latency, area, and power optimized for such a width. It targets a lower 2-GHz frequency of operation compared to the 5 GHz of its predecessor Teraflops processor, yet with a higherperformance interconnect architecture.

System Area Network: IBM Blue Gene/L 3D Torus Network The IBM BlueGene/L was the largest-scaled, highest-performing computer system in the world in 2005, according to www.top500.org. With 65,536 dual-processor compute nodes and 1024 I/O nodes, this 360 TFLOPS (peak) supercomputer has a system footprint of approximately 2500 square feet. Both processors at each node can be used for computation and can handle their own communication protocol processing in virtual mode or, alternatively, one of the processors can be used for computation and the other for network interface processing. Packets range in size from 32 bytes to a maximum of 256 bytes, and 8 bytes are used for the header. The header includes routing, virtual channel, link-level flow control, packet size, and other such information, along with 1 byte for CRC to protect the header. Three bytes are used for CRC at the packet level, and 1 byte serves as a valid indicator. The main interconnection network is a proprietary 32  32  64 3D torus SAN that interconnects all 64 K nodes. Each node switch has six 350 MB/sec bidirectional links to neighboring torus nodes, an injection bandwidth of 612.5 MB/sec from the two node processors, and a reception bandwidth of 1050 MB/sec to the two node processors. The reception bandwidth from the network equals the inbound bandwidth across all switch ports, which prevents reception links from bottlenecking network performance. Multiple packets can be sunk concurrently at each destination node because of the higher reception link bandwidth. Two nodes are implemented on a 2  1  1 compute card, 16 compute cards and 2 I/O cards are implemented on a 4  4  2 node board, 16 node boards are implemented on an 8  8  8 midplane, and 2 midplanes form a 1024-node rack with physical dimensions of 0.9  0.9  1.9 cubic meters. Links have a maximum physical length of 8.6 meters, thus enabling efficient link-level flow control with reasonably low buffering requirements. Low latency is achieved by implementing virtual cut-through switching, distributing arbitration at switch input and output ports, and precomputing the current routing path at the previous switch using a finite-state machine so that part of the routing delay is removed from the critical path in switches. High effective bandwidth is achieved using input-buffered

F.8

Examples of Interconnection Networks



F-77

switches with dual read ports, virtual cut-through switching with four virtual channels, and fully adaptive deadlock-free routing based on bubble flow control. A key feature in networks of this size is fault tolerance. Failure rate is reduced by using a relatively low link clock frequency of 700 MHz (same as processor clock) on which both edges of the clock are used (i.e., 1.4 Gbps or 175 MB/sec transfer rate is supported for each bit-serial network link in each direction), but failures may still occur in the network. In case of failure, the midplane node boards containing the fault(s) are switched off and bypassed to isolate the fault, and computation resumes from the last checkpoint. Bypassing is done using separate bypass switch boards associated with each midplane that are additional to the set of torus node boards. Each bypass switch board can be configured to connect either to the corresponding links in the midplane node boards or to the next bypass board, effectively removing the corresponding set of midplane node boards. Although the number of processing nodes is reduced to some degree in some network dimensions, the machine retains its topological structure and routing algorithm. Some collective communication operations such as barrier synchronization, broadcast/multicast, reduction, and so on are not performed well on the 3D torus as the network would be flooded with traffic. To remedy this, two separate tree networks with higher per-link bandwidth are used to implement collective and combining operations more efficiently. In addition to providing support for efficient synchronization and broadcast/multicast, hardware is used to perform some arithmetic reduction operations in an efficient way (e.g., to compute the sum or the maximum value of a set of values, one from each processing node). In addition to the 3D torus and the two tree networks, the Blue Gene/L implements an I/O Gigabit Ethernet network and a control system Fast Ethernet network of lower bandwidth to provide for parallel I/O, configuration, debugging, and maintenance.

System/Storage Area Network: InfiniBand InfiniBand is an industrywide de facto networking standard developed in October 2000 by a consortium of companies belonging to the InfiniBand Trade Association. InfiniBand can be used as a system area network for interprocessor communication or as a storage area network for server I/O. It is a switch-based interconnect technology that provides flexibility in the topology, routing algorithm, and arbitration technique implemented by vendors and users. InfiniBand supports data transmission rates of 2 to 120 Gbp/link per direction across distances of 300 meters. It uses cut-through switching, 16 virtual channels and service levels, credit-based link-level flow control, and weighted round-robin fair scheduling and implements programmable forwarding tables. It also includes features useful for increasing reliability and system availability, such as communication subnet management, end-to-end path establishment, and virtual destination naming.

F-78



Appendix F Interconnection Networks

Institution and processor [network] name

Year built

Number of network ports [cores or tiles + other ports]

Basic network topology

# of data bits per link per direction

Link bandwidth [link clock speed]

Routing; arbitration; switching

# of chip metal layers; flow control; #virtual channels

MIT Raw [General Dynamic Network]

2002

16 ports [16 tiles]

2D mesh (4  4)

32 bits

0.9 GB/sec [225 MHz, clocked at proc speed]

XY DOR with request-reply deadlock recovery; RR arbitration; wormhole

6 layers; credit-based no virtual channels

IBM Power5

2004

7 ports [2 PE cores + 5 other ports]

Crossbar

256 bits Inst fetch; 64 bits for stores; 256 bits LDs

[1.9 GHz, clocked at proc speed]

Shortest-path; nonblocking; circuit switch

7 layers; handshaking; no virtual channels

U.T. Austin TRIP Edge [Operand Network]

2005

25 ports [25 execution unit tiles]

2D mesh (5  5)

110 bits

5.86 GB/sec [533 MHz clock scaled by 80%]

YX DOR; distributed RR arbitration; wormhole

7 layers; on/ off flow control; no virtual channels

U.T. Austin TRIP Edge [On-Chip Network]

2005

40 ports [16 L2 tiles + 24 network interface tile]

2D mesh (10  4)

128 bits

6.8 GB/sec [533 MHz clock scaled by 80%]

YX DOR; distributed RR arbitration; VCT switched

7 layers; credit-based flow control; 4 virtual channels

Sony, IBM, Toshiba Cell BE [Element Interconnect Bus]

2005

12 ports [1 PPE and 8 SPEs + 3 other ports for memory, I/O interface]

Ring (4 total, 2 in each direction)

128 bits data (+16 bits tag)

25.6 GB/sec [1.6 GHz, clocked at half the proc speed]

Shortest-path; tree-based RR arbitration (centralized); pipelined circuit switch

8 layers; credit-based flow control; no virtual channels

Sun UltraSPARC T1 processor

2005

Up to 13 ports [8 PE cores + 4 L2 banks + 1 shared I/O]

Crossbar

128 bits both for the 8 cores and the 4 L2 banks

19.2 GB/sec [1.2 GHz, clocked at proc speed]

Shortest-path; age-based arbitration; VCT switched

9 layers; handshaking; no virtual channels

Figure F.29 Characteristics of on-chip networks implemented in recent research and commercial processors. Some processors implement multiple on-chip networks (not all shown)—for example, two in the MIT Raw and eight in the TRIP Edge.

Figure F.30 shows the packet format for InfiniBand juxtaposed with two other network standards from the LAN and WAN areas. Figure F.31 compares various characteristics of the InfiniBand standard with two proprietary system area networks widely used in research and commercial high-performance computer systems.

F.8

Ethernet

ATM

Destination

Preamble

Destination

Source

Preamble

Partition key

Destination

InfiniBand Version Length T

Examples of Interconnection Networks

Destination queue

F-79



T

Checksum

Source

Destination Source

Sequence number

Length

Type

Data (48)

Data (0–1500) Data (0–4096)

32 bits

Pad (0–46) Checksum Checksum

32 bits

Checksum

32 bits

Figure F.30 Packet format for InfiniBand, Ethernet, and ATM. ATM calls their messages “cells” instead of packets, so the proper name is ATM cell format. The width of each drawing is 32 bits. All three formats have destination addressing fields, encoded differently for each situation. All three also have a checksum field to catch transmission errors, although the ATM checksum field is calculated only over the header; ATM relies on higher-level protocols to catch errors in the data. Both InfiniBand and Ethernet have a length field, since the packets hold a variable amount of data, with the former counted in 32-bit words and the latter in bytes. InfiniBand and ATM headers have a type field (T) that gives the type of packet. The remaining Ethernet fields are a preamble to allow the receiver to recover the clock from the self-clocking code used on the Ethernet, the source address, and a pad field to make sure the smallest packet is 64 bytes (including the header). InfiniBand includes a version field for protocol version, a sequence number to allow inorder delivery, a field to select the destination queue, and a partition key field. Infiniband has many more small fields not shown and many other packet formats; above is a simplified view. ATM’s short, fixed packet is a good match to real-time demand of digital voice.

F-80



Network name [vendors]

Appendix F Interconnection Networks

Used in top 10 supercomputer Number clusters (2005) of nodes

Basic network topology

Switching Raw link bidirectional Routing Arbitration technique; BW algorithm technique flow control

InfiniBand SGI Altrix and [Mellanox, Dell Poweredge Voltair] Thunderbird

>Millions Completely (2128 configurable (arbitrary) GUID addresses, like IPv6)

Myrinet2000 [Myricom]

Barcelona Supercomputer Center in Spain

8192 nodes

Bidirectional 4 Gbps MIN with 16port bidirectional switches (Clos net.)

QsNetII [Quadrics]

Intel Thunder Itanium2 Tiger4

>Tens of thousands

Fat tree with 8-port bidirectional switches

4–240 Gbps

21.3 Gbps

Arbitrary (tabledriven), typically up*/down*

Weighted RR fair scheduling (2-level priority)

Cut-through, 16 virtual channels (15 for data); credit-based

Sourcebased dispersive (adaptive) minimal routing

Roundrobin arbitration

Cut-through switching with no virtual channels; Xon/ Xoff flow control

Sourcebased LCA adaptive shortestpath routing

2-phased RR, priority, aging, distributed at output ports

Wormhole with 2 virtual channels; credit-based

Figure F.31 Characteristics of system area networks implemented in various top 10 supercomputer clusters in 2005.

InfiniBand offers two basic mechanisms to support user-level communication: send/receive and remote DMA (RDMA). With send/receive, the receiver has to explicitly post a receive buffer (i.e., allocate space in its channel adapter network interface) before the sender can transmit data. With RDMA, the sender can remotely DMA data directly into the receiver device’s memory. For example, for a nominal packet size of 4 bytes measured on a Mellanox MHEA28-XT channel adapter connected to a 3.4 GHz Intel Xeon host device, sending and receiving overhead is 0.946 and 1.423 μs, respectively, for the send/receive mechanism, whereas it is 0.910 and 0.323 μs, respectively, for the RDMA mechanism. As discussed in Section F.2, the packet size is important in getting full benefit of the network bandwidth. One might ask, “What is the natural size of messages?” Figure F.32(a) shows the size of messages for a commercial fluid dynamics simulation application, called Fluent, collected on an InfiniBand network at The Ohio State University’s Network-Based Computer Laboratory. One plot is cumulative in messages sent and the other is cumulative in data bytes sent. Messages in this graph are message passing interface (MPI) units of information, which gets divided into InfiniBand maximum transfer units (packets) transferred over the network. As shown, the maximum message size is over 512 KB, but approximately 90% of the messages are less than 512 bytes. Messages of 2 KB represent approximately 50% of the bytes transferred. An Integer Sort application kernel in the NAS Parallel

F.8

Examples of Interconnection Networks

100%

Measured effective bandwidth (MB/sec)

80% 70% Percentage

F-81

1600

90%

60% 50% 40% 30% 20% Number of messages Data volume

10%

MVAPICH native DDR MVAPICH native SDR MVAPICH 1PoIB SDR MVAPICH 1PoIB DDR

1400 1200 1000 800 600 400 200

0%

0 64

256

1K

4K

16K

64K

256K

4

Message size (bytes)

(A)



64

1K

16K

256K

4M

Message size (bytes)

(B)

Figure F.32 Data collected by D.K. Panda, S. Sur, and L. Chai (2005) in the Network-Based Computing Laboratory at The Ohio State University. (a) Cumulative percentage of messages and volume of data transferred as message size varies for the Fluent application (www.fluent.com). Each x-axis entry includes all bytes up to the next one; for example, 128 represents 1 byte to 128 bytes. About 90% of the messages are less than 512 bytes, which represents about 40% of the total bytes transferred. (b) Effective bandwidth versus message size measured on SDR and DDR InfiniBand networks running MVAPICH (http://nowlab.cse.ohio-state.edu/projects/mpi-iba) with OS bypass (native) and without (IPoIB).

Benchmark suite is also measured to have about 75% of its messages below 512 bytes (plots not shown). Many applications send far more small messages than large ones, particularly since requests and acknowledgments are more frequent than data responses and block writes. InfiniBand reduces protocol processing overhead by allowing it to be offloaded from the host computer to a controller on the InfiniBand network interface card. The benefits of protocol offloading and bypassing the operating system are shown in Figure F.32(b) for MVAPICH, a widely used implementation of MPI over InfiniBand. Effective bandwidth is plotted against message size for MVAPICH configured in two modes and two network speeds. One mode runs IPoIB, in which InfiniBand communication is handled by the IP layer implemented by the host’s operating system (i.e., no OS bypass). The other mode runs MVAPICH directly over VAPI, which is the native Mellanox InfiniBand interface that offloads transport protocol processing to the channel adapter hardware (i.e., OS bypass). Results are shown for 10 Gbps single data rate (SDR) and 20 Gbps double data rate (DDR) InfiniBand networks. The results clearly show that offloading the protocol processing and bypassing the OS significantly reduce sending and receiving overhead to allow near wire-speed effective bandwidth to be achieved.

F-82



Appendix F Interconnection Networks

Ethernet: The Local Area Network Ethernet has been extraordinarily successful as a LAN—from the 10 Mbit/sec standard proposed in 1978 used practically everywhere today to the more recent 10 Gbit/sec standard that will likely be widely used. Many classes of computers include Ethernet as a standard communication interface. Ethernet, codified as IEEE standard 802.3, is a packet-switched network that routes packets using the destination address. It was originally designed for coaxial cable but today uses primarily Cat5E copper wire, with optical fiber reserved for longer distances and higher bandwidths. There is even a wireless version (802.11), which is testimony to its ubiquity. Over a 20-year span, computers became thousands of times faster than they were in 1978, but the shared media Ethernet network remained the same. Hence, engineers had to invent temporary solutions until a faster, higher-bandwidth network became available. One solution was to use multiple Ethernets to interconnect machines and to connect those Ethernets with internetworking devices that could transfer traffic from one Ethernet to another, as needed. Such devices allow individual Ethernets to operate in parallel, thereby increasing the aggregate interconnection bandwidth of a collection of computers. In effect, these devices provide similar functionality to the switches described previously for point-to-point networks. Figure F.33 shows the potential parallelism that can be gained. Depending on how they pass traffic and what kinds of interconnections they can join together, these devices have different names:

Single Ethernet: one packet at a time Node

Node

Node

Node

Node

Node

Node

Node

Node

Node

Node

Multiple Ethernets: multiple packets at a time Node

Node

Node

Node

Node

Bridge

Node

Node

Bridge

Node

Node

Node

Figure F.33 The potential increased bandwidth of using many Ethernets and bridges.

Node

F.8

Examples of Interconnection Networks



F-83

Bridges—These devices connect LANs together, passing traffic from one side to another depending on the addresses in the packet. Bridges operate at the Ethernet protocol level and are usually simpler and cheaper than routers, discussed next. Using the notation of the OSI model described in the next section (see Figure F.36 on page F-85), bridges operate at layer 2, the data link layer.



Routers or gateways—These devices connect LANs to WANs, or WANs to WANs, and resolve incompatible addressing. Generally slower than bridges, they operate at OSI layer 3, the network layer. WAN routers divide the network into separate smaller subnets, which simplifies manageability and improves security.



The final internetworking devices are hubs, but they merely extend multiple segments into a single LAN. Thus, hubs do not help with performance, as only one message can transmit at a time. Hubs operate at OSI layer 1, called the physical

Internet

T3 line

Stanford, California

SU-CM. BARRNet.net 131.119.5.3

fd-0.enss128.t3. ans.net 192.31.48.244

CIS-Gateway. Stanford.edu 36.1.0.22

FDDI

FDDI

mojave. Stanford.edu 36.22.0.120

Ethernet

T1 line

Berkeley, California

UCB1. BARRNet.net 192.31.161.4

Ethernet

mammoth. Berkeley.edu 128.32.149.78

inr-111-cs2. Berkeley.edu 128.32.149.13 128.32.120.111

inr-108-eecs. Berkeley.edu 128.32.120.108

FDDI

Ethernet

Figure F.34 The connection established between mojave.stanford.edu and mammoth.berkeley.edu (1995). FDDI is a 100 Mbit/sec LAN, while a T1 line is a 1.5 Mbit/sec telecommunications line and a T3 is a 45 Mbit/sec telecommunications line. BARRNet stands for Bay Area Research Network. Note that inr-111-cs2.Berkeley.edu is a router with two Internet addresses, one for each port.

F-84



Appendix F Interconnection Networks

Applications

Internetworking

Networks

Figure F.35 The role of internetworking. The width indicates the relative number of items at each level.

Layer number

Layer name

Main function

Example protocol

Network component

FTP, DNS, NFS, http

Gateway, smart switch

7

Application

Used for applications specifically written to run over the network

6

Presentation

Translates from application to network format, and vice versa

5

Session

Establishes, maintains, and ends sessions across the network

Named pipes, RPC

Gateway

4

Transport

Additional connection below the session layer

TCP

Gateway

3

Network

Translates logical network address and names to their physical address (e.g., computer name to MAC address)

IP

Router, ATM switch

2

Data Link

Turns packets into raw bits and at the receiving end turns bits into packets

Ethernet

Bridge, network interface card

1

Physical

Transmits raw bit stream over physical cable

IEEE 802

Hub

Gateway

Figure F.36 The OSI model layers. Based on www.geocities.com/SiliconValley/Monitor/3131/ne/osimodel.html.

layer. Since these devices were not planned as part of the Ethernet standard, their ad hoc nature has added to the difficulty and cost of maintaining LANs. As of 2011, Ethernet link speeds are available at 10, 100, 10,000, and 100,000 Mbits/sec. Although 10 and 100 Mbits/sec Ethernets share the media with multiple devices, 1000 Mbits/sec and above Ethernets rely on point-to-point links and switches. Ethernet switches normally use some form of store-and-forward. Ethernet has no real flow control, dating back to its first instantiation. It originally used carrier sensing with exponential back-off (see page F-23) to arbitrate for the shared media. Some switches try to use that interface to retrofit their version of flow control, but flow control is not part of the Ethernet standard.

Wide Area Network: ATM Asynchronous Transfer Mode (ATM) is a wide area networking standard set by the telecommunications industry. Although it flirted as competition to Ethernet as a LAN in the 1990s, ATM has since retreated to its WAN stronghold.

F.9

Internetworking



F-85

The telecommunications standard has scalable bandwidth built in. It starts at 155 Mbits/sec and scales by factors of 4 to 620 Mbits/sec, 2480 Mbits/sec, and so on. Since it is a WAN, ATM’s medium is fiber, both single mode and multimode. Although it is a switched medium, unlike the other examples it relies on virtual connections for communication. ATM uses virtual channels for routing to multiplex different connections on a single network segment, thereby avoiding the inefficiencies of conventional connection-based networking. The WAN focus also led to storeand-forward switching. Unlike the other protocols, Figure F.30 shows ATM has a small, fixed-sized packet with 48 bytes of payload. It uses a credit-based flow control scheme as opposed to IP routers that do not implement flow control. The reason for virtual connections and small packets is quality of service. Since the telecommunications industry is concerned about voice traffic, predictability matters as well as bandwidth. Establishing a virtual connection has less variability than connectionless networking, and it simplifies store-and-forward switching. The small, fixed packet also makes it simpler to have fast routers and switches. Toward that goal, ATM even offers its own protocol stack to compete with TCP/IP. Surprisingly, even though the switches are simple, the ATM suite of protocols is large and complex. The dream was a seamless infrastructure from LAN to WAN, avoiding the hodgepodge of routers common today. That dream has faded from inspiration to nostalgia.

F.9

Internetworking Undoubtedly one of the most important innovations in the communications community has been internetworking. It allows computers on independent and incompatible networks to communicate reliably and efficiently. Figure F.34 illustrates the need to traverse between networks. It shows the networks and machines involved in transferring a file from Stanford University to the University of California at Berkeley, a distance of about 75 km. The low cost of internetworking is remarkable. For example, it is vastly less expensive to send electronic mail than to make a coast-to-coast telephone call and leave a message on an answering machine. This dramatic cost improvement is achieved using the same long-haul communication lines as the telephone call, which makes the improvement even more impressive. The enabling technologies for internetworking are software standards that allow reliable communication without demanding reliable networks. The underlying principle of these successful standards is that they were composed as a hierarchy of layers, each layer taking responsibility for a portion of the overall communication task. Each computer, network, and switch implements its layer of the standards, relying on the other components to faithfully fulfill their responsibilities. These layered software standards are called protocol families or protocol suites. They enable applications to work with any interconnection without extra work by the application programmer. Figure F.35 suggests the hierarchical model of communication.

F-86



Appendix F Interconnection Networks

The most popular internetworking standard is TCP/IP (Transmission Control Protocol/Internet Protocol). This protocol family is the basis of the humbly named Internet, which connects hundreds of millions of computers around the world. This popularity means TCP/IP is used even when communicating locally across compatible networks; for example, the network file system (NFS) uses IP even though it is very likely to be communicating across a homogenous LAN such as Ethernet. We use TCP/IP as our protocol family example; other protocol families follow similar lines. Section F.13 gives the history of TCP/IP. The goal of a family of protocols is to simplify the standard by dividing responsibilities hierarchically among layers, with each layer offering services needed by the layer above. The application program is at the top, and at the bottom is the physical communication medium, which sends the bits. Just as abstract data types simplify the programmer’s task by shielding the programmer from details of the implementation of the data type, this layered strategy makes the standard easier to understand. There were many efforts at network protocols, which led to confusion in terms. Hence, Open Systems Interconnect (OSI) developed a model that popularized describing networks as a series of layers. Figure F.36 shows the model. Although all protocols do not exactly follow this layering, the nomenclature for the different layers is widely used. Thus, you can hear discussions about a simple layer 3 switch versus a layer 7 smart switch. The key to protocol families is that communication occurs logically at the same level of the protocol in both sender and receiver, but services of the lower level implement it. This style of communication is called peer-to-peer. As an analogy, imagine that General A needs to send a message to General B on the battlefield. General A writes the message, puts it in an envelope addressed to General B, and gives it to a colonel with orders to deliver it. This colonel puts it in an envelope, and writes the name of the corresponding colonel who reports to General B, and gives it to a major with instructions for delivery. The major does the same thing and gives it to a captain, who gives it to a lieutenant, who gives it to a sergeant. The sergeant takes the envelope from the lieutenant, puts it into an envelope with the name of a sergeant who is in General B’s division, and finds a private with orders to take the large envelope. The private borrows a motorcycle and delivers the envelope to the other sergeant. Once it arrives, it is passed up the chain of command, with each person removing an outer envelope with his name on it and passing on the inner envelope to his superior. As far as General B can tell, the note is from another general. Neither general knows who was involved in transmitting the envelope, nor how it was transported from one division to the other. Protocol families follow this analogy more closely than you might think, as Figure F.37 shows. The original message includes a header and possibly a trailer sent by the lower-level protocol. The next-lower protocol in turn adds its own header to the message, possibly breaking it up into smaller messages if it is too large for this layer. Reusing our analogy, a long message from the general is divided and placed in several envelopes if it could not fit in one. This division of the message and appending of headers and trailers continues until the message

F.9

Internetworking



F-87

Logical Message

Message

Actual

Actual Logical

H

T

H

T

H

T

H

T

H

Actual

T T

H H

T T

T

H

T

Actual

H H

T T

H H

T T

H H

T T

H H

T T

Actual

Figure F.37 A generic protocol stack with two layers. Note that communication is peer-to-peer, with headers and trailers for the peer added at each sending layer and removed by each receiving layer. Each layer offers services to the one above to shield it from unnecessary details.

descends to the physical transmission medium. The message is then sent to the destination. Each level of the protocol family on the receiving end will check the message at its level and peel off its headers and trailers, passing it on to the next higher level and putting the pieces back together. This nesting of protocol layers for a specific message is called a protocol stack, reflecting the last in, first out nature of the addition and removal of headers and trailers. As in our analogy, the danger in this layered approach is the considerable latency added to message delivery. Clearly, one way to reduce latency is to reduce the number of layers, but keep in mind that protocol families define a standard but do not force how to implement the standard. Just as there are many ways to implement an instruction set architecture, there are many ways to implement a protocol family. Our protocol stack example is TCP/IP. Let’s assume that the bottom protocol layer is Ethernet. The next level up is the Internet Protocol or IP layer; the official term for an IP packet is a datagram. The IP layer routes the datagram to the destination machine, which may involve many intermediate machines or switches. IP makes a best effort to deliver the packets but does not guarantee delivery, content, or order of datagrams. The TCP layer above IP makes the guarantee of reliable, inorder delivery and prevents corruption of datagrams. Following the example in Figure F.37, assume an application program wants to send a message to a machine via an Ethernet. It starts with TCP. The largest number of bytes that can be sent at once is 64 KB. Since the data may be much larger than 64 KB, TCP must divide them into smaller segments and reassemble them in proper order upon arrival. TCP adds a 20-byte header (Figure F.38) to every datagram and passes them down to IP. The IP layer above the physical layer adds a 20-byte header, also shown in Figure F.38. The data sent down from the IP level

V

L

Type

Length

Protocol

Header checksum

Identifier IP header

Time

Fragment

Source Destination Source

Destination

Sequence number (length) Piggyback acknowledgment L

Flags

TCP header

Window

Checksum

Urgent pointer

IP data

TCP data TCP data (0–65,516 bytes)

32 bits

Figure F.38 The headers for IP and TCP. This drawing is 32 bits wide. The standard headers for both are 20 bytes, but both allow the headers to optionally lengthen for rarely transmitted information. Both headers have a length of header field (L) to accommodate the optional fields, as well as source and destination fields. The length field of the whole datagram is in a separate length field in IP, while TCP combines the length of the datagram with the sequence number of the datagram by giving the sequence number in bytes. TCP uses the checksum field to be sure that the datagram is not corrupted, and the sequence number field to be sure the datagrams are assembled into the proper order when they arrive. IP provides checksum error detection only for the header, since TCP has protected the rest of the packet. One optimization is that TCP can send a sequence of datagrams before waiting for permission to send more. The number of datagrams that can be sent without waiting for approval is called the window, and the window field tells how many bytes may be sent beyond the byte being acknowledged by this datagram. TCP will adjust the size of the window depending on the success of the IP layer in sending datagrams; the more reliable and faster it is, the larger TCP makes the window. Since the window slides forward as the data arrive and are acknowledged, this technique is called a sliding window protocol. The piggyback acknowledgment field of TCP is another optimization. Since some applications send data back and forth over the same connection, it seems wasteful to send a datagram containing only an acknowledgment. This piggyback field allows a datagram carrying data to also carry the acknowledgment for a previous transmission, “piggybacking” on top of a data transmission. The urgent pointer field of TCP gives the address within the datagram of an important byte, such as a break character. This pointer allows the application software to skip over data so that the user doesn’t have to wait for all prior data to be processed before seeing a character that tells the software to stop. The identifier field and fragment field of IP allow intermediary machines to break the original datagram into many smaller datagrams. A unique identifier is associated with the original datagram and placed in every fragment, with the fragment field saying which piece is which. The time-to-live field allows a datagram to be killed off after going through a maximum number of intermediate switches no matter where it is in the network. Knowing the maximum number of hops that it will take for a datagram to arrive—if it ever arrives—simplifies the protocol software. The protocol field identifies which possible upper layer protocol sent the IP datagram; in our case, it is TCP. The V (for version) and type fields allow different versions of the IP protocol software for the network. Explicit version numbering is included so that software can be upgraded gracefully machine by machine, without shutting down the entire network. Nowadays, version six of the Internet protocol (IPv6) was widely used.

F.10

Crosscutting Issues for Interconnection Networks



F-89

to the Ethernet are sent in packets with the format shown in Figure F.30. Note that the TCP packet appears inside the data portion of the IP datagram, just as Figure F.37 suggests.

F.10

Crosscutting Issues for Interconnection Networks This section describes five topics discussed in other chapters that are fundamentally impacted by interconnection networks, and vice versa.

Density-Optimized Processors versus SPEC-Optimized Processors Given that people all over the world are accessing Web sites, it doesn’t really matter where servers are located. Hence, many servers are kept at collocation sites, which charge by network bandwidth reserved and used and by space occupied and power consumed. Desktop microprocessors in the past have been designed to be as fast as possible at whatever heat could be dissipated, with little regard for the size of the package and surrounding chips. In fact, some desktop microprocessors from Intel and AMD as recently as 2006 burned as much as 130 watts! Floor space efficiency was also largely ignored. As a result of these priorities, power is a major cost for collocation sites, and processor density is limited by the power consumed and dissipated, including within the interconnect! With the proliferation of portable computers (notebook sales exceeded desktop sales for the first time in 2005) and their reduced power consumption and cooling demands, the opportunity exists for using this technology to create considerably denser computation. For instance, the power consumption for the Intel Pentium M in 2006 was 25 watts, yet it delivered performance close to that of a desktop microprocessor for a wide set of applications. It is therefore conceivable that performance per watt or performance per cubic foot could replace performance per microprocessor as the important figure of merit. The key is that many applications already make use of large clusters, so it is possible that replacing 64 power-hungry processors with, say, 256 power-efficient processors could be cheaper yet be software compatible. This places greater importance on power- and performanceefficient interconnection network design. The Google cluster is a prime example of this migration to many “cooler” processors versus fewer “hotter” processors. It uses racks of up to 80 Intel Pentium III 1 GHz processors instead of more power-hungry high-end processors. Other examples include blade servers consisting of 1-inch-wide by 7-inch-high rack unit blades designed based on mobile processors. The HP ProLiant BL10e G2 blade server supports up to 20 1-GHz ultra-low-voltage Intel Pentium M processors with a 400-MHz front-side bus, 1-MB L2 cache, and up to 1 GB memory. The Fujitsu Primergy BX300 blade server supports up to 20 1.4- or 1.6-GHz Intel Pentium M processors, each with 512 MB of memory expandable to 4 GB.

F-90



Appendix F Interconnection Networks

Smart Switches versus Smart Interface Cards Figure F.39 shows a trade-off as to where intelligence can be located within a network. Generally, the question is whether to have either smarter network interfaces or smarter switches. Making one smarter generally makes the other simpler and less expensive. By having an inexpensive interface, it was possible for Ethernet to become standard as part of most desktop and server computers. Lower-cost switches were made available for people with small configurations, not needing sophisticated forwarding tables and spanning-tree protocols of larger Ethernet switches. Myrinet followed the opposite approach. Its switches are dumb components that, other than implementing flow control and arbitration, simply extract the first byte from the packet header and use it to directly select the output port. No routing tables are implemented, so the intelligence is in the network interface cards (NICs). The NICs are responsible for providing support for efficient communication and for implementing a distributed protocol for network (re)configuration. InfiniBand takes a hybrid approach by offering lower-cost, less sophisticated interface cards called target channel adapters (or TCAs) for less demanding devices such as disks—in the hope that it can be included within some I/O devices—and by offering more expensive, powerful interface cards for hosts called host channel adapters (or HCAs). The switches implement routing tables.

Switch

Small-scale Ethernet switch

Large-scale Ethernet switch

Myrinet InfiniBand

Interface card

Ethernet InfiniBand target channel adapter

Myrinet InfiniBand host channel adapter

More intelligence

Figure F.39 Intelligence in a network: switch versus network interface card. Note that Ethernet switches come in two styles, depending on the size of the network, and that InfiniBand network interfaces come in two styles, depending on whether they are attached to a computer or to a storage device. Myrinet is a proprietary system area network.

F.10

Crosscutting Issues for Interconnection Networks



F-91

Protection and User Access to the Network A challenge is to ensure safe communication across a network without invoking the operating system in the common case. The Cray Research T3D supercomputer offers an interesting case study. Like the more recent Cray X1E, the T3D supports a global address space, so loads and stores can access memory across the network. Protection is ensured because each access is checked by the TLB. To support transfer of larger objects, a block transfer engine (BLT) was added to the hardware. Protection of access requires invoking the operating system before using the BLT to check the range of accesses to be sure there will be no protection violations. Figure F.40 compares the bandwidth delivered as the size of the object varies for reads and writes. For very large reads (e.g., 512 KB), the BLT achieves the highest performance: 140 MB/sec. But simple loads get higher performance for 8 KB or less. For the write case, both achieve a peak of 90 MB/sec, presumably because of the limitations of the memory bus. But, for writes, the BLT can only match the performance of simple stores for transfers of 2 MB; anything smaller and it’s faster to send stores. Clearly, a BLT that can avoid invoking the operating system in the common case would be more useful.

Efficient Interface to the Memory Hierarchy versus the Network Traditional evaluations of processor performance, such as SPECint and SPECfp, encourage integration of the memory hierarchy with the processor as the efficiency of the memory hierarchy translates directly into processor performance. Hence, 160 BLT read

140

Bandwidth (MB/sec)

120 CPU write 100 80 BLT write 60 40

CPU read

20 0 8

12

6

25

2

51

24

10

48

20

96

40

84

92

81

,3

16

36

68

,7

32

,5

65

28

14

2,

26

4,

52

52

76

8

4

2

07

1,

13

,5

48

0 1,

04

,1

97

0 2,

08

,3

94

1 4,

,6

88

3 8,

Transfer size (bytes)

Figure F.40 Bandwidth versus transfer size for simple memory access instructions versus a block transfer device on the Cray Research T3D. (From Arpaci et al. [1995].)

F-92



Appendix F Interconnection Networks

microprocessors have multiple levels of caches on chip along with buffers for writes. Because benchmarks such as SPECint and SPECfp do not reward good interfaces to interconnection networks, many machines make the access time to the network delayed by the full memory hierarchy. Writes must lumber their way through full write buffers, and reads must go through the cycles of first-, second-, and often third-level cache misses before reaching the interconnection network. This hierarchy results in newer systems having higher latencies to the interconnect than older machines. Let’s compare three machines from the past: a 40-MHz SPARCstation-2, a 50MHz SPARCstation-20 without an external cache, and a 50-MHz SPARCstation20 with an external cache. According to SPECint95, this list is in order of increasing performance. The time to access the I/O bus (S-bus), however, increases in this sequence: 200 ns, 500 ns, and 1000 ns. The SPARCstation-2 is fastest because it has a single bus for memory and I/O, and there is only one level to the cache. The SPARCstation-20 memory access must first go over the memory bus (M-bus) and then to the I/O bus, adding 300 ns. Machines with a second-level cache pay an extra penalty of 500 ns before accessing the I/O bus.

Compute-Optimized Processors versus Receiver Overhead The overhead to receive a message likely involves an interrupt, which bears the cost of flushing and then restarting the processor pipeline, if not offloaded. As mentioned earlier, reading network status and receiving data from the network interface likely operate at cache miss speeds. If microprocessors become more superscalar and go to even faster clock rates, the number of missed instruction issue opportunities per message reception will likely rise to unacceptable levels.

F.11

Fallacies and Pitfalls Myths and hazards are widespread with interconnection networks. This section mentions several warnings, so proceed carefully.

Fallacy

The interconnection network is very fast and does not need to be improved The interconnection network provides certain functionality to the system, very much like the memory and I/O subsystems. It should be designed to allow processors to execute instructions at the maximum rate. The interconnection network subsystem should provide high enough bandwidth to keep from continuously entering saturation and becoming an overall system bottleneck. In the 1980s, when wormhole switching was introduced, it became feasible to design large-diameter topologies with single-chip switches so that the bandwidth capacity of the network was not the limiting factor. This led to the flawed belief that interconnection networks need no further improvement.

F.11

Fallacies and Pitfalls



F-93

Since the 1980s, much attention has been placed on improving processor performance, but comparatively less has been focused on interconnection networks. As technology advances, the interconnection network tends to represent an increasing fraction of system resources, cost, power consumption, and various other attributes that impact functionality and performance. Scaling the bandwidth simply by overdimensioning certain network parameters is no longer a cost-viable option. Designers must carefully consider the end-toend interconnection network design in concert with the processor, memory, and I/O subsystems in order to achieve the required cost, power, functionality, and performance objectives of the entire system. An obvious case in point is multicore processors with on-chip networks. Fallacy

Bisection bandwidth is an accurate cost constraint of a network Despite being very popular, bisection bandwidth has never been a practical constraint on the implementation of an interconnection network, although it may be one in future designs. It is more useful as a performance measure than as a cost measure. Chip pin-outs are the more realistic bandwidth constraint.

Pitfall

Using bandwidth (in particular, bisection bandwidth) as the only measure of network performance It seldom is the case that aggregate network bandwidth (likewise, network bisection bandwidth) is the end-to-end bottlenecking point across the network. Even if it were the case, networks are almost never 100% efficient in transporting packets across the bisection (i.e., ρ < 100%) nor at receiving them at network endpoints (i.e., σ < 100%). The former is highly dependent upon routing, switching, arbitration, and other such factors while both the former and the latter are highly dependent upon traffic characteristics. Ignoring these important factors and concentrating only on raw bandwidth can give very misleading performance predictions. For example, it is perfectly conceivable that a network could have higher aggregate bandwidth and/or bisection bandwidth relative to another network but also have lower measured performance! Apparently, given sophisticated protocols like TCP/IP that maximize delivered bandwidth, many network companies believe that there is only one figure of merit for networks. This may be true for some applications, such as video streaming, where there is little interaction between the sender and the receiver. Many applications, however, are of a request-response nature, and so for every large message there must be one or more small messages. One example is NFS. Figure F.41 compares a shared 10-Mbit/sec Ethernet LAN to a switched 155Mbit/sec ATM LAN for NFS traffic. Ethernet drivers were better tuned than the ATM drivers, such that 10-Mbit/sec Ethernet was faster than 155-Mbit/sec ATM for payloads of 512 bytes or less. Figure F.41 shows the overhead time, transmission time, and total time to send all the NFS messages over Ethernet and ATM. The peak link speed of ATM is 15 times faster, and the measured link speed for 8KB messages is almost 9 times faster. Yet, the higher overheads offset the benefits so that ATM would transmit NFS traffic only 1.2 times faster.

F-94



Appendix F Interconnection Networks

Size

Number of messages

32 64

Overhead (sec)

Number of data bytes

Transmission (sec)

Total time (sec)

ATM

ATM

Ethernet

ATM

Ethernet

Ethernet

771,060

532

389

33,817,052

4

48

536

436

56,923

39

29

4,101,088

0

5

40

34

96

4,082,014

2817

2057

428,346,316

46

475

2863

2532

128

5,574,092

3846

2809

779,600,736

83

822

3929

3631

160

328,439

227

166

54,860,484

6

56

232

222

192

16,313

11

8

3,316,416

0

3

12

12

224

4820

3

2

1,135,380

0

1

3

4

256

24,766

17

12

9,150,720

1

9

18

21

512

32,159

22

16

25,494,920

3

23

25

40

1024

69,834

48

35

70,578,564

8

72

56

108

1536

8842

6

4

15,762,180

2

14

8

19

2048

9170

6

5

20,621,760

2

19

8

23

2560

20,206

14

10

56,319,740

6

51

20

61

3072

13,549

9

7

43,184,992

4

39

14

46

3584

4200

3

2

16,152,228

2

14

5

17

4096

67,808

47

34

285,606,596

29

255

76

290

5120

6143

4

3

35,434,680

4

32

8

35

6144

5858

4

3

37,934,684

4

34

8

37

7168

4140

3

2

31,769,300

3

28

6

30

8192

287,577

198

145

2,390,688,480

245

2132

444

2277

Total

11,387,913

7858

5740

4,352,876,316

452

4132

8310

9872

Figure F.41 Total time on a 10-Mbit Ethernet and a 155-Mbit ATM, calculating the total overhead and transmission time separately. Note that the size of the headers needs to be added to the data bytes to calculate transmission time. The higher overhead of the software driver for ATM offsets the higher bandwidth of the network. These measurements were performed in 1994 using SPARCstation 10s, the ForeSystems SBA-200 ATM interface card, and the Fore Systems ASX-200 switch. (NFS measurements taken by Mike Dahlin of the University of California–Berkeley.)

Pitfall

Not providing sufficient reception link bandwidth, which causes the network end nodes to become even more of a bottleneck to performance Unless the traffic pattern is a permutation, several packets will concurrently arrive at some destinations when most source devices inject traffic, thus producing contention. If this problem is not addressed, contention may turn into congestion that will spread across the network. This can be dealt with by analyzing traffic patterns and providing extra reception bandwidth. For example, it is possible to implement more reception bandwidth than injection bandwidth. The IBM Blue Gene/L, for example, implements an on-chip switch with 7-bit

F.11

Fallacies and Pitfalls



F-95

injection and 12-bit reception links, where the reception BW equals the aggregate switch input link BW. Pitfall

Using high-performance network interface cards but forgetting about the I/O subsystem that sits between the network interface and the host processor This issue is related to the previous one. Messages are usually composed in user space buffers and later sent by calling a send function from the communications library. Alternatively, a cache controller implementing a cache coherence protocol may compose a message in some SANs and in OCNs. In both cases, messages have to be copied to the network interface memory before transmission. If the I/O bandwidth is lower than the link bandwidth or introduces significant overhead, this is going to affect communication performance significantly. As an example, the first 10-Gigabit Ethernet cards in the market had a PCI-X bus interface for the system with a significantly lower bandwidth than 10 Gbps.

Fallacy

Zero-copy protocols do not require copying messages or fragments from one buffer to another Traditional communication protocols for computer networks allow access to communication devices only through system calls in supervisor mode. As a consequence of this, communication routines need to copy the corresponding message from the user buffer to a kernel buffer when sending a message. Note that the communication protocol may need to keep a copy of the message for retransmission in case of error, and the application may modify the contents of the user buffer once the system call returns control to the application. This buffer-to-buffer copy is eliminated in zero-copy protocols because the communication routines are executed in user space and protocols are much simpler. However, messages still need to be copied from the application buffer to the memory in the network interface card (NIC) so that the card hardware can transmit it from there through to the network. Although it is feasible to eliminate this copy by allocating application message buffers directly in the NIC memory (and, indeed, this is done in some protocols), this may not be convenient in current systems because access to the NIC memory is usually performed through the I/O subsystem, which usually is much slower than accessing main memory. Thus, it is generally more efficient to compose the message in main memory and let DMA devices take care of the transfer to the NIC memory. Moreover, what few people count is the copy from where the message fragments are computed (usually the ALU, with results stored in some processor register) to main memory. Some systolic-like architectures in the 1980s, like the iWarp, were able to directly transmit message fragments from the processor to the network, effectively eliminating all the message copies. This is the approach taken in the Cray X1E shared-memory multiprocessor supercomputer. Similar comments can be made regarding the reception side; however, this does not mean that zero-copy protocols are inefficient. These protocols represent the most efficient kind of implementation used in current systems.

F-96



Appendix F Interconnection Networks Pitfall

Ignoring software overhead when determining performance Low software overhead requires cooperation with the operating system as well as with the communication libraries, but even with protocol offloading it continues to dominate the hardware overhead and must not be ignored. Figures F.32 and F.41 give two examples, one for a SAN standard and the other for a WAN standard. Other examples come from proprietary SANs for supercomputers. The Connection Machine CM-5 supercomputer in the early 1990s had a software overhead of 20 μs to send a message and a hardware overhead of only 0.5 μs. The first Intel Paragon supercomputer built in the early 1990s had a hardware overhead of just 0.2 μs, but the initial release of the software had an overhead of 250 μs. Later releases reduced this overhead down to 25 μs and, more recently, down to only a few microseconds, but this still dominates the hardware overhead. The IBM Blue Gene/L has an MPI sending/receiving overhead of approximately 3 μs, only a third of which (at most) is attributed to the hardware. This pitfall is simply Amdahl’s law applied to networks: Faster network hardware is superfluous if there is not a corresponding decrease in software overhead. The software overhead is much reduced these days with OS bypass, lightweight protocols, and protocol offloading down to a few microseconds or less, typically, but it remains a significant factor in determining performance.

Fallacy

MINs are more cost-effective than direct networks A MIN is usually implemented using significantly fewer switches than the number of devices that need to be connected. On the other hand, direct networks usually include a switch as an integral part of each node, thus requiring as many switches as nodes to interconnect. However, nothing prevents the implementation of nodes with multiple computing devices on it (e.g., a multicore processor with an on-chip switch) or with several devices attached to each switch (i.e., bristling). In these cases, a direct network may be as (or even more) cost-effective as a MIN. Note that, for a MIN, several network interfaces may be required at each node to match the bandwidth delivered by the multiple links per node provided by the direct network.

Fallacy

Low-dimensional direct networks achieve high-dimensional networks such as hypercubes

higher

performance

than

This conclusion was drawn by several studies that analyzed the optimal number of dimensions under the main physical constraint of bisection bandwidth. However, most of those studies did not consider link pipelining, considered only very short links, and/or did not consider switch architecture design constraints. The misplaced assumption that bisection bandwidth serves as the main limit did not help matters. Nowadays, most researchers and designers believe that high-radix switches are more cost-effective than low-radix switches, including some who concluded the opposite before.

F.11

Fallacy

Fallacies and Pitfalls



F-97

Wormhole switching achieves better performance than other switching techniques Wormhole switching delivers the same no-load latency as other pipelined switching techniques, like virtual cut-through switching. The introduction of wormhole switches in the late 1980s coinciding with a dramatic increase in network bandwidth led many to believe that wormhole switching was the main reason for the performance boost. Instead, most of the performance increase came from a drastic increase in link bandwidth, which, in turn, was enabled by the ability of wormhole switching to buffer packet fragments using on-chip buffers, instead of using the node’s main memory or some other off-chip source for that task. More recently, much larger on-chip buffers have become feasible, and virtual cutthrough achieved the same no-load latency as wormhole while delivering much higher throughput. This did not mean that wormhole switching was dead. It continues to be the switching technique of choice for applications in which only small buffers should be used (e.g., perhaps for on-chip networks).

Fallacy

Implementing a few virtual channels always increases throughput by allowing packets to pass through blocked packets ahead In general, implementing a few virtual channels in a wormhole switch is a good idea because packets are likely to pass blocked packets ahead of them, thus reducing latency and significantly increasing throughput. However, the improvements are not as dramatic for virtual cut-through switches. In virtual cut-through, buffers should be large enough to store several packets. As a consequence, each virtual channel may introduce HOL blocking, possibly degrading performance at high loads. Adding virtual channels increases cost, but it may deliver little additional performance unless there are as many virtual channels as switch ports and packets are mapped to virtual channels according to their destination (i.e., virtual output queueing). It is certainly the case that virtual channels can be useful in virtual cut-through networks to segregate different traffic classes, which can be very beneficial. However, multiplexing the packets over a physical link on a flit-by-flit basis causes all the packets from different virtual channels to get delayed. The average packet delay is significantly shorter if multiplexing takes place on a packet-by-packet basis, but in this case packet size should be bounded to prevent any one packet from monopolizing the majority of link bandwidth.

Fallacy

Adaptive routing causes out-of-order packet delivery, thus introducing too much overhead needed to reorder packets at the destination device Adaptive routing allows packets to follow alternative paths through the network depending on network traffic; therefore, adaptive routing usually introduces outof-order packet delivery. However, this does not necessarily imply that reordering packets at the destination device is going to introduce a large overhead, making adaptive routing not useful. For example, the most efficient adaptive routing algorithms to date support fully adaptive routing in some virtual channels but required

F-98



Appendix F Interconnection Networks

deterministic routing to be implemented in some other virtual channels in order to prevent deadlocks (à la the IBM Blue Gene/L). In this case, it is very easy to select between adaptive and deterministic routing for each individual packet. A single bit in the packet header can indicate to the switches whether all the virtual channels can be used or only those implementing deterministic routing. This hardware support can be used as indicated below to eliminate packet reordering overhead at the destination. Most communication protocols for parallel computers and clusters implement two different protocols depending on message size. For short messages, an eager protocol is used in which messages are directly transmitted, and the receiving nodes use some preallocated buffer to temporarily store the incoming message. On the other hand, for long messages, a rendezvous protocol is used. In this case, a control message is sent first, requesting the destination node to allocate a buffer large enough to store the entire message. The destination node confirms buffer allocation by returning an acknowledgment, and the sender can proceed with fragmenting the message into bounded-size packets, transmitting them to the destination. If eager messages use only deterministic routing, it is obvious that they do not introduce any reordering overhead at the destination. On the other hand, packets belonging to a long message can be transmitted using adaptive routing. As every packet contains the sequence number within the message (or the offset from the beginning of the message), the destination node can store every incoming packet directly in its correct location within the message buffer, thus incurring no overhead with respect to using deterministic routing. The only thing that differs is the completion condition. Instead of checking that the last packet in the message has arrived, it is now necessary to count the arrived packets, notifying the end of reception when the count equals the message size. Taking into account that long messages, even if not frequent, usually consume most of the network bandwidth, it is clear that most packets can benefit from adaptive routing without introducing reordering overhead when using the protocol described above. Fallacy

Adaptive routing by itself always improves network fault tolerance because it allows packets to follow alternative paths Adaptive routing by itself is not enough to tolerate link and/or switch failures. Some mechanism is required to detect failures and notify them, so that the routing logic could exclude faulty paths and use the remaining ones. Moreover, while a given link or switch failure affects a certain number of paths when using deterministic routing, many more source/destination pairs could be affected by the same failure when using adaptive routing. As a consequence of this, some switches implementing adaptive routing transition to deterministic routing in the presence of failures. In this case, failures are usually tolerated by sending messages through alternative paths from the source node. As an example, the Cray T3E implements direction-order routing to tolerate a few failures. This fault-tolerant routing technique avoids cycles in the use of resources by crossing directions in order

F.11

Fallacies and Pitfalls



F-99

(e.g., X+, Y+, Z+, Z, Y, then X). At the same time, it provides an easy way to send packets through nonminimal paths, if necessary, to avoid crossing faulty components. For instance, a packet can be initially forwarded a few hops in the X+ direction even if it has to go in the X direction at some point later. Pitfall

Trying to provide features only within the network versus end-to-end The concern is that of providing at a lower level the features that can only be accomplished at the highest level, thus only partially satisfying the communication demand. Saltzer, Reed, and Clark [1984] gave the end-to-end argument as follows: The function in question can completely and correctly be specified only with the knowledge and help of the application standing at the endpoints of the communication system. Therefore, providing that questioned function as a feature of the communication system itself is not possible. [page 278] Their example of the pitfall was a network at MIT that used several gateways, each of which added a checksum from one gateway to the next. The programmers of the application assumed that the checksum guaranteed accuracy, incorrectly believing that the message was protected while stored in the memory of each gateway. One gateway developed a transient failure that swapped one pair of bytes per million bytes transferred. Over time, the source code of one operating system was repeatedly passed through the gateway, thereby corrupting the code. The only solution was to correct infected source files by comparing them to paper listings and repairing code by hand! Had the checksums been calculated and checked by the application running on the end systems, safety would have been ensured. There is a useful role for intermediate checks at the link level, however, provided that end-to-end checking is available. End-to-end checking may show that something is broken between two nodes, but it doesn’t point to where the problem is. Intermediate checks can discover the broken component. A second issue regards performance using intermediate checks. Although it is sufficient to retransmit the whole in case of failures from the end point, it can be much faster to retransmit a portion of the message at an intermediate point rather than wait for a time-out and a full message retransmit at the end point.

Pitfall

Relying on TCP/IP for all networks, regardless of latency, bandwidth, or software requirements The network designers on the first workstations decided it would be elegant to use a single protocol stack no matter where the destination of the message: Across a room or across an ocean, the TCP/IP overhead must be paid. This might have been a wise decision back then, especially given the unreliability of early Ethernet hardware, but it sets a high software overhead barrier for commercial systems of today. Such an obstacle lowers the enthusiasm for low-latency network interface hardware and low-latency interconnection networks if the software is just going to waste hundreds of microseconds when the message must travel only dozens of meters or less. It also can use significant processor resources. One rough rule of

F-100



Appendix F Interconnection Networks

thumb is that each Mbit/sec of TCP/IP bandwidth needs about 1 MHz of processor speed, so a 1000-Mbit/sec link could saturate a processor with an 800- to 1000MHz clock. The flip side is that, from a software perspective, TCP/IP is the most desirable target since it is the most connected and, hence, provides the largest number of opportunities. The downside of using software optimized to a particular LAN or SAN is that it is limited. For example, communication from a Java program depends on TCP/IP, so optimization for another protocol would require creation of glue software to interface Java to it. TCP/IP advocates point out that the protocol itself is theoretically not as burdensome as current implementations, but progress has been modest in commercial systems. There are also TCP/IP offloading engines in the market, with the hope of preserving the universal software model while reducing processor utilization and message latency. If processors continue to improve much faster than network speeds, or if multiple processors become ubiquitous, software TCP/IP may become less significant for processor utilization and message latency.

F.12

Concluding Remarks Interconnection network design is one of the most exciting areas of computer architecture development today. With the advent of new multicore processor paradigms and advances in traditional multiprocessor/cluster systems and the Internet, many challenges and opportunities exist for interconnect architecture innovation. These apply to all levels of computer systems: communication between cores on a chip, between chips on a board, between boards in a system, and between computers in a machine room, over a local area and across the globe. Irrespective of their domain of application, interconnection networks should transfer the maximum amount of information within the least amount of time for given cost and power constraints so as not to bottleneck the system. Topology, routing, arbitration, switching, and flow control are among some of the key concepts in realizing such high-performance designs. The design of interconnection networks is end-to-end: It includes injection links, reception links, and the interfaces at network end points as much as it does the topology, switches, and links within the network fabric. It is often the case that the bandwidth and overhead at the end node interfaces are the bottleneck, yet many mistakenly think of the interconnection network to mean only the network fabric. This is as bad as processor designers thinking of computer architecture to mean only the instruction set architecture or only the microarchitecture! End-to-end issues and understanding of the traffic characteristics make the design of interconnection networks challenging and very much relevant even today. For instance, the need for low end-to-end latency is driving the development of efficient network interfaces located closer to the processor/memory controller. We may soon see most multicore processors used in multiprocessor systems implementing network interfaces on-chip,

F.13

Historical Perspective and References



F-101

devoting some core(s) to execute communication tasks. This is already the case for the IBM Blue Gene/L supercomputer, which uses one of its two cores on each processor chip for this purpose. Networking has a long way to go from its humble shared-media beginnings. It is in “catch-up” mode, with switched-media point-to-point networks only recently displacing traditional bus-based networks in many networking domains, including on chip, I/O, and the local area. We are not near any performance plateaus, so we expect rapid advancement of WANs, LANs, SANs, and especially OCNs in the near future. Greater interconnection network performance is key to the information- and communication-centric vision of the future of our field, which, so far, has benefited many millions of people around the world in various ways. As the quotes at the beginning of this appendix suggest, this revolution in twoway communication is at the heart of changes in the form of our human associations and actions.

Acknowledgments We express our sincere thanks to the following persons who, in some way, have contributed to the contents of the previous edition of the appendix: Lei Chai, Scott Clark, Jose Flich, Jose Manuel Garcia, Paco Gilabert, Rama Govindaraju, Manish Gupta, Wai Hong Ho, Siao Jer, Steven Keckler, Dhabaleswar (D.K.) Panda, Fabrizio Petrini, Steve Scott, Jeonghee Shin, Craig Stunkel, Sayantan Sur, Michael B. Taylor, and Bilal Zafar. We especially appreciate the new contributions of Jose Flich to this edition of the appendix.

F.13

Historical Perspective and References This appendix has taken the perspective that interconnection networks for very different domains—from on-chip networks within a processor chip to wide area networks connecting computers across the globe—share many of the same concerns. With this, interconnection network concepts are presented in a unified way, irrespective of their application; however, their histories are vastly different, as evidenced by the different solutions adopted to address similar problems. The lack of significant interaction between research communities from the different domains certainly contributed to the diversity of implemented solutions. Highlighted below are relevant readings on each topic. In addition, good general texts featuring WAN and LAN networking have been written by Davie, Peterson, and Clark [1999] and by Kurose and Ross [2001]. Good texts focused on SANs for multiprocessors and clusters have been written by Duato, Yalamanchili, and Ni [2003] and by Dally and Towles [2004]. An informative chapter devoted to dead-lock resolution in interconnection networks was written by Pinkston [2004]. Finally, an edited work by Jantsch and Tenhunen [2003] on OCNs for multicore processors and system-on-chips is also interesting reading.

F-102



Appendix F Interconnection Networks

Wide Area Networks Wide area networks are the earliest of the data interconnection networks. The forerunner of the Internet is the ARPANET, which in 1969 connected computer science departments across the United States that had research grants funded by the Advanced Research Project Agency (ARPA), a U.S. government agency. It was originally envisioned as using reliable communications at lower levels. Practical experience with failures of the underlying technology led to the failure-tolerant TCP/IP, which is the basis for the Internet today. Vint Cerf and Robert Kahn are credited with developing the TCP/IP protocols in the mid-1970s, winning the ACM Software Award in recognition of that achievement. Kahn [1972] is an early reference on the ideas of ARPANET. For those interested in learning more about TPC/IP, Stevens [1994–1996] has written classic books on the topic. In 1975, there were roughly 100 networks in the ARPANET; in 1983, only 200. In 1995, the Internet encompassed 50,000 networks worldwide, about half of which were in the United States. That number is hard to calculate now, but the number of IP hosts grew by a factor of 15 from 1995 to 2000, reaching 100 million Internet hosts by the end of 2000. It has grown much faster since then. With most service providers assigning dynamic IP addresses, many local area networks using private IP addresses, and with most networks allowing wireless connections, the total number of hosts in the Internet is nearly impossible to compute. In July 2005, the Internet Systems Consortium (www.isc.org) estimated more than 350 million Internet hosts, with an annual increase of about 25% projected. Although key government networks made the Internet possible (i.e., ARPANET and NSFNET), these networks have been taken over by the commercial sector, allowing the Internet to thrive. But major innovations to the Internet are still likely to come from government-sponsored research projects rather than from the commercial sector. The National Science Foundation’s Global Environment for Network Innovation (GENI) initiative is an example of this. The most exciting application of the Internet is the World Wide Web, developed in 1989 by Tim Berners-Lee, a programmer at the European Center for Particle Research (CERN), for information access. In 1992, a young programmer at the University of Illinois, Marc Andreessen, developed a graphical interface for the Web called Mosaic. It became immensely popular. He later became a founder of Netscape, which popularized commercial browsers. In May 1995, at the time of the second edition of this book, there were over 30,000 Web pages, and the number was doubling every two months. During the writing of the third edition of this text, there were more than 1.3 billion Web pages. In December 2005, the number of Web servers approached 75 million, having increased by 30% during that same year. Asynchronous Transfer Mode (ATM) was an attempt to design the definitive communication standard. It provided good support for data transmission as well as digital voice transmission (i.e., phone calls). From a technical point of view, it combined the best from packet switching and circuit switching, also providing excellent support for providing quality of service (QoS). Alles [1995] offers a good

F.13

Historical Perspective and References



F-103

survey on ATM. In 1995, no one doubted that ATM was going to be the future for this community. Ten years later, the high equipment and personnel training costs basically killed ATM, and we returned back to the simplicity of TCP/IP. Another important blow to ATM was its defeat by the Ethernet family in the LAN domain, where packet switching achieved significantly lower latencies than ATM, which required establishing a connection before data transmission. ATM connectionless servers were later introduced in an attempt to fix this problem, but they were expensive and represented a central bottleneck in the LAN. Finally, WANs today rely on optical fiber. Fiber technology has made so many advances that today WAN fiber bandwidth is often underutilized. The main reason for this is the commercial introduction of wavelength division multiplexing (WDM), which allows each fiber to transmit many data streams simultaneously over different wavelengths, thus allowing three orders of magnitude bandwidth increase in just one generation, that is, 3 to 5 years (a good text by Senior [1993] discusses optical fiber communications). However, IP routers may still become a bottleneck. At 10- to 40-Gbps link rates, and with thousands of ports in large core IP routers, packets must be processed very quickly—that is, within a few tens of nanoseconds. The most time-consuming operation is routing. The way IP addresses have been defined and assigned to Internet hosts makes routing very complicated, usually requiring a complex search in a tree structure for every packet. Network processors have become popular as a cost-effective solution for implementing routing and other packet-filtering operations. They usually are RISC-like and highly multithreaded and implement local stores instead of caches.

Local Area Networks ARPA’s success with wide area networks led directly to the most popular local area networks. Many researchers at Xerox Palo Alto Research Center had been funded by ARPA while working at universities, so they all knew the value of networking. In 1974, this group invented the Alto, the forerunner of today’s desktop computers [Thacker et al. 1982], and the Ethernet [Metcalfe and Boggs 1976], today’s LAN. This group—David Boggs, Butler Lampson, Ed McCreight, Bob Sprowl, and Chuck Thacker—became luminaries in computer science and engineering, collecting a treasure chest of awards among them. This first Ethernet provided a 3-Mbit/sec interconnection, which seemed like an unlimited amount of communication bandwidth with computers of that era. It relied on the interconnect technology developed for the cable television industry. Special microcode support gave a round-trip time of 50 μs for the Alto over Ethernet, which is still a respectable latency. It was Boggs’ experience as a ham radio operator that led to a design that did not need a central arbiter, but instead listened before use and then varied back-off times in case of conflicts. The announcement by Digital Equipment Corporation, Intel, and Xerox of a standard for 10-Mbit/sec Ethernet was critical to the commercial success of

F-104



Appendix F Interconnection Networks

Ethernet. This announcement short-circuited a lengthy IEEE standards effort, which eventually did publish IEEE 802.3 as a standard for Ethernet. There have been several unsuccessful candidates that have tried to replace the Ethernet. The Fiber Data Distribution Interconnect (FDDI) committee, unfortunately, took a very long time to agree on the standard, and the resulting interfaces were expensive. It was also a shared medium when switches were becoming affordable. ATM also missed the opportunity in part because of the long time to standardize the LAN version of ATM, and in part because of the high latency and poor behavior of ATM connectionless servers, as mentioned above. InfiniBand for the reasons discussed below has also faltered. As a result, Ethernet continues to be the absolute leader in the LAN environment, and it remains a strong opponent in the high-performance computing market as well, competing against the SANs by delivering high bandwidth at low cost. The main drawback of Ethernet for high-end systems is its relatively high latency and lack of support in most interface cards to implement the necessary protocols. Because of failures of the past, LAN modernization efforts have been centered on extending Ethernet to lower-cost media such as unshielded twisted pair (UTP), switched interconnects, and higher link speeds as well as to new domains such as wireless communication. Practically all new PC motherboards and laptops implement a Fast/Gigabit Ethernet port (100/1000 Mbps), and most laptops implement a 54 Mbps Wireless Ethernet connection. Also, home wired or wireless LANs connecting all the home appliances, set-top boxes, desktops, and laptops to a shared Internet connection are very common. Spurgeon [2006] has provided a nice online summary of Ethernet technology, including some of its history.

System Area Networks One of the first nonblocking multistage interconnection networks was proposed by Clos [1953] for use in telephone exchange offices. Building on this, many early inventions for system area networks came from their use in massively parallel processors (MPPs). One of the first MPPs was the Illiac IV, a SIMD array built in the early 1970s with 64 processing elements (“massive” at that time) interconnected using a topology based on a 2D torus that provided neighbor-to-neighbor communication. Another representative of early MPP was the Cosmic Cube, which used Ethernet interface chips to connect 64 processors in a 6-cube. Communication between nonneighboring nodes was made possible by store-and-forwarding of packets at intermediate nodes toward their final destination. A much larger and truly “massive” MPP built in the mid-1980s was the Connection Machine, a SIMD multiprocessor consisting of 64 K 1-bit processing elements, which also used a hypercube with store-and-forwarding. Since these early MPP machines, interconnection networks have improved considerably. In the 1970s through the 1990s, considerable research went into trying to optimize the topology and, later, the routing algorithm, switching, arbitration, and flow control techniques. Initially, research focused on maximizing performance with

F.13

Historical Perspective and References



F-105

little attention paid to implementation constraints or crosscutting issues. Many exotic topologies were proposed having very interesting properties, but most of them complicated the routing. Rising from the fray was the hypercube, a very popular network in the 1980s that has all but disappeared from MPPs since the 1990s. What contributed to this shift was a performance model by Dally [1990] that showed that if the implementation is wire limited, lower-dimensional topologies achieve better performance than higher-dimensional ones because of their wider links for a given wire budget. Many designers followed that trend assuming their designs to be wire limited, even though most implementations were (and still are) pin limited. Several supercomputers since the 1990s have implemented lowdimensional topologies, including the Intel Paragon, Cray T3D, Cray T3E, HP AlphaServer, Intel ASCI Red, and IBM Blue Gene/L. Meanwhile, other designers followed a very different approach, implementing bidirectional MINs in order to reduce the number of required switches below the number of network nodes. The most popular bidirectional MIN was the fat tree topology, originally proposed by Leiserson [1985] and first used in the Connection Machine CM-5 supercomputer and, later, the IBM ASCI White and ASC Purple supercomputers. This indirect topology was also used in several European parallel computers based on the Transputer. The Quadrics network has inherited characteristics from some of those Transputer-based networks. Myrinet has also evolved significantly from its first version, with Myrinet 2000 incorporating the fat tree as its principal topology. Indeed, most current implementations of SANs, including Myrinet, InfiniBand, and Quadrics as well as future implementations such as PCIExpress Advanced Switching, are based on fat trees. Although the topology is the most visible aspect of a network, other features also have a significant impact on performance. A seminal work that raised awareness of deadlock properties in computer systems was published by Holt [1972]. Early techniques for avoiding deadlock in store-and-forward networks were proposed by Merlin and Schweitzer [1980] and by Gunther [1981]. Pipelined switching techniques were first introduced by Kermani and Kleinrock [1979] (virtual cutthrough) and improved upon by Dally and Seitz [1986] (wormhole), which significantly reduced low-load latency and the topology’s impact on message latency over previously proposed techniques. Wormhole switching was initially better than virtual cut-through largely because flow control could be implemented at a granularity smaller than a packet, allowing high-bandwidth links that were not as constrained by available switch memory bandwidth. Today, virtual cut-through is usually preferred over wormhole because it achieves higher throughput due to less HOL blocking effects and is enabled by current integration technology that allows the implementation of many packet buffers per link. Tamir and Frazier [1992] laid the groundwork for virtual output queuing with the notion of dynamically allocated multiqueues. Around this same time, Dally [1992] contributed the concept of virtual channels, which was key to the development of more efficient deadlock-free routing algorithms and congestion-reducing flow control techniques for improved network throughput. Another highly relevant contribution to routing was a new theory proposed by Duato [1993] that allowed

F-106



Appendix F Interconnection Networks the implementation of fully adaptive routing with just one “escape” virtual channel to avoid deadlock. Previous to this, the required number of virtual channels to avoid deadlock increased exponentially with the number of network dimensions. Pinkston and Warnakulasuriya [1997] went on to show that deadlock actually can occur very infrequently, giving credence to deadlock recovery routing approaches. Scott and Goodman [1994] were among the first to analyze the usefulness of pipelined channels for making link bandwidth independent of the time of flight. These and many other innovations have become quite popular, finding use in most highperformance interconnection networks, both past and present. The IBM Blue Gene/L, for example, implements virtual cut-through switching, four virtual channels per link, fully adaptive routing with one escape channel, and pipelined links. MPPs represent a very small (and currently shrinking) fraction of the information technology market, giving way to bladed servers and clusters. In the United States, government programs such as the Advanced Simulation and Computing (ASC) program (formerly known as the Accelerated Strategic Computing Initiative, or ASCI) have promoted the design of those machines, resulting in a series of increasingly powerful one-of-a-kind MPPs costing $50 million to $100 million. These days, many are basically lower-cost clusters of symmetric multiprocessors (SMPs) (see Pfister [1998] and Sterling [2001] for two perspectives on clustering). In fact, in 2005, nearly 75% of the TOP500 supercomputers were clusters. Nevertheless, the design of each generation of MPPs and even clusters pushes interconnection network research forward to confront new problems arising due to shear size and other scaling factors. For instance, source-based routing—the simplest form of routing—does not scale well to large systems. Likewise, fat trees require increasingly longer links as the network size increases, which led IBM Blue Gene/L designers to adopt a 3D torus network with distributed routing that can be implemented with bounded-length links.

Storage Area Networks System area networks were originally designed for a single room or single floor (thus their distances are tens to hundreds of meters) and were for use in MPPs and clusters. In the intervening years, the acronym SAN has been co-opted to also mean storage area networks, whereby networking technology is used to connect storage devices to compute servers. Today, many refer to “storage” when they say SAN. The most widely used SAN example in 2006 was Fibre Channel (FC), which comes in many varieties, including various versions of Fibre Channel Arbitrated Loop (FC-AL) and Fibre Channel Switched (FC-SW). Not only are disk arrays attached to servers via FC links, but there are even some disks with FC links attached to switches so that storage area networks can enjoy the benefits of greater bandwidth and interconnectivity of switching. In October 2000, the InfiniBand Trade Association announced the version 1.0 specification of InfiniBand [InfiniBand Trade Association 2001]. Led by Intel, HP, IBM, Sun, and other companies, it was targeted to the high-performance

F.13

Historical Perspective and References



F-107

computing market as a successor to the PCI bus by having point-to-point links and switches with its own set of protocols. Its characteristics are desirable potentially both for system area networks to connect clusters and for storage area networks to connect disk arrays to servers. Consequently, it has had strong competition from both fronts. On the storage area networking side, the chief competition for InfiniBand has been the rapidly improving Ethernet technology widely used in LANs. The Internet Engineering Task Force proposed a standard called iSCSI to send SCSI commands over IP networks [Satran et al. 2001]. Given the cost advantages of the higher-volume Ethernet switches and interface cards, Gigabit Ethernet dominates the low-end and medium range for this market. What’s more, the slow introduction of InfiniBand and its small market share delayed the development of chip sets incorporating native support for InfiniBand. Therefore, network interface cards had to be plugged into the PCI or PCI-X bus, thus never delivering on the promise of replacing the PCI bus. It was another I/O standard, PCI-Express, that finally replaced the PCI bus. Like InfiniBand, PCI-Express implements a switched network but with pointto-point serial links. To its credit, it maintains software compatibility with the PCI bus, drastically simplifying migration to the new I/O interface. Moreover, PCI-Express benefited significantly from mass market production and has found application in the desktop market for connecting one or more high-end graphics cards, making gamers very happy. Every PC motherboard now implements one or more 16x PCI-Express interfaces. PCI-Express absolutely dominates the I/O interface, but the current standard does not provide support for interprocessor communication. Yet another standard, Advanced Switching Interconnect (ASI), may emerge as a complementary technology to PCI-Express. ASI is compatible with PCI-Express, thus linking directly to current motherboards, but it also implements support for interprocessor communication as well as I/O. Its defenders believe that it will eventually replace both SANs and LANs with a unified network in the data center market, but ironically this was also said of InfiniBand. The interested reader is referred to Pinkston et al. [2003] for a detailed discussion on this. There is also a new disk interface standard called Serial Advanced Technology Attachment (SATA) that is replacing parallel Integrated Device Electronics (IDE) with serial signaling technology to allow for increased bandwidth. Most disks in the market use this new interface, but keep in mind that Fibre Channel is still alive and well. Indeed, most of the promises made by InfiniBand in the SAN market were satisfied by Fibre Channel first, thus increasing their share of the market. Some believe that Ethernet, PCI-Express, and SATA have the edge in the LAN, I/O interface, and disk interface areas, respectively. But the fate of the remaining storage area networking contenders depends on many factors. A wonderful characteristic of computer architecture is that such issues will not remain endless academic debates, unresolved as people rehash the same arguments repeatedly. Instead, the battle is fought in the marketplace, with well-funded and talented groups giving their best efforts at shaping the future. Moreover, constant changes to technology reward those who are either astute or lucky. The best combination of

F-108



Appendix F Interconnection Networks

technology and follow-through has often determined commercial success. Time will tell us who will win and who will lose, at least for the next round!

On-Chip Networks Relative to the other network domains, on-chip networks are in their infancy. As recently as the late 1990s, the traditional way of interconnecting devices such as caches, register files, ALUs, and other functional units within a chip was to use dedicated links aimed at minimizing latency or shared buses aimed at simplicity. But with subsequent increases in the volume of interconnected devices on a single chip, the length and delay of wires to cross a chip, and chip power consumption, it has become important to share on-chip interconnect bandwidth in a more structured way, giving rise to the notion of a network on-chip. Among the first to recognize this were Agarwal [Waingold et al. 1997] and Dally [Dally 1999; Dally and Towles 2001]. They and others argued that on-chip networks that route packets allow efficient sharing of burgeoning wire resources between many communication flows and also facilitate modularity to mitigate chip-crossing wire delay problems identified by Ho, Mai, and Horowitz [2001]. Switched on-chip networks were also viewed as providing better fault isolation and tolerance. Challenges in designing these networks were later described by Taylor et al. [2005], who also proposed a 5-tuple model for characterizing the delay of OCNs. A design process for OCNs that provides a complete synthesis flow was proposed by Bertozzi et al. [2005]. Following these early works, much research and development has gone into on-chip network design, making this a very hot area of microarchitecture activity. Multicore and tiled designs featuring on-chip networks have become very popular since the turn of the millennium. Pinkston and Shin [2005] provide a survey of on-chip networks used in early multicore/tiled systems. Most designs exploit the reduced wiring complexity of switched OCNs as the paths between cores/tiles can be precisely defined and optimized early in the design process, thus enabling improved power and performance characteristics. With typically tens of thousands of wires attached to the four edges of a core or tile as “pinouts,” wire resources can be traded off for improved network performance by having very wide channels over which data can be sent broadside (and possibly scaled up or down according to the power management technique), as opposed to serializing the data over fixed narrow channels. Rings, meshes, and crossbars are straightforward to implement in planar chip technology and routing is easily defined on them, so these were popular topological choices in early switched OCNs. It will be interesting to see if this trend continues in the future when several tens to hundreds of heterogeneous cores and tiles will likely be interconnected within a single chip, possibly using 3D integration technology. Considering that processor microarchitecture has evolved significantly from its early beginnings in response to application demands and technological advancements, we would expect to see vast architectural improvements to onchip networks as well.

F.13

Historical Perspective and References



F-109

References Agarwal, A., 1991. Limits on interconnection network performance. IEEE Trans. on Parallel and Distributed Systems 2 (4 (April)), 398–412. Alles, A., 1995. “ATM internetworking” (May). www.cisco.com/warp/public/614/12.html. Anderson, T.E., Culler, D.E., Patterson, D., 1995. A case for NOW (networks of workstations). IEEE Micro 15 (1 (February)), 54–64. Anjan, K.V., Pinkston, T.M., 1995. An efficient, fully-adaptive deadlock recovery scheme: Disha. In: Proc. 22nd Annual Int’l. Symposium on Computer Architecture, June 22–24, 1995. Santa Margherita Ligure, Italy. Arpaci, R.H., Culler, D.E., Krishnamurthy, A., Steinberg, S.G., Yelick, K., 1995. Empirical evaluation of the Cray-T3D: A compiler perspective. In: Proc. 22nd Annual Int’l. Symposium on Computer Architecture, June 22–24, 1995. Santa Margherita Ligure, Italy. Bell, G., Gray, J., 2001. Crays, Clusters and Centers. Microsoft Corporation, Redmond, Wash. MSRTR-2001-76. Benes, V.E., 1962. Rearrangeable three stage connecting networks. Bell Syst. Tech. J. 41, 1481–1492. Bertozzi, D., Jalabert, A., Murali, S., Tamhankar, R., Stergiou, S., Benini, L., De Micheli, G., 2005. NoC synthesis flow for customized domain specific multiprocessor systems-on-chip. IEEE Trans. on Parallel and Distributed Systems 16 (2 (February)), 113–130. Bhuyan, L.N., Agrawal, D.P., 1984. Generalized hypercube and hyperbus structures for a computer network. IEEE Trans. on Computers 32 (4 (April)), 322–333. Brewer, E.A., Kuszmaul, B.C., 1994. How to get good performance from the CM-5 data network. In: Proc. Eighth Int’l Parallel Processing Symposium, April 26–29, 1994. Cancun, Mexico. Clos, C., 1953. A study of non-blocking switching networks. Bell Systems Technical Journal 32 (March), 406–424. Dally, W.J., 1990. Performance analysis of k-ary n-cube interconnection networks. IEEE Trans. on Computers 39 (6 (June)), 775–785. Dally, W.J., 1992. Virtual channel flow control. IEEE Trans. on Parallel and Distributed Systems 3 (2 (March)), 194–205. Dally, W.J., 1999. Interconnect limited VLSI architecture. In: Proc. of the Int’l. Interconnect Technology Conference, May 24–26, 1999. San Francisco, Calif. Dally, W.J., Seitz, C.I., 1986. The torus routing chip. Distributed Computing 1 (4), 187–196. Dally, W.J., Towles, B., 2001. Route packets, not wires: On-chip interconnection networks. In: Proc. of the 38th Design Automation Conference, June 18–22, 2001. Las Vegas, Nev. Dally, W.J., Towles, B., 2004. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers, San Francisco. Davie, B.S., Peterson, L.L., Clark, D., 1999. Computer Networks: A Systems Approach, second ed. Morgan Kaufmann Publishers, San Francisco. Duato, J., 1993. A new theory of deadlock-free adaptive routing in wormhole networks. IEEE Trans. on Parallel and Distributed Systems 4 (12 (December)), 1320–1331. Duato, J., Pinkston, T.M., 2001. A general theory for deadlock-free adaptive routing using a mixed set of resources. IEEE Trans. on Parallel and Distributed Systems 12 (12 (December)), 1219–1235. Duato, J., Yalamanchili, S., Ni, L., 2003. Interconnection Networks: An Engineering Approach. Morgan Kaufmann Publishers, San Francisco. 2nd printing. Duato, J., Johnson, I., Flich, J., Naven, F., Garcia, P., Nachiondo, T., 2005a. A new scalable and costeffective congestion management strategy for lossless multistage interconnection networks. In: Proc. 11th Int’l. Symposium on High Performance Computer Architecture, February 12–16, 2005 San Francisco. Duato, J., Lysne, O., Pang, R., Pinkston, T.M., 2005b. Part I: A theory for deadlock-free dynamic reconfiguration of interconnection networks. IEEE Trans. on Parallel and Distributed Systems 16 (5 (May)), 412–427. Flich, J., Bertozzi, D., 2010. Designing Network-on-Chip Architectures in the Nanoscale Era. CRC Press, Boca Raton, FL. Glass, C.J., Ni, L.M., 1992. The Turn Model for adaptive routing. In: Proc. 19th Int’l. Symposium on Computer Architecture. May, Gold Coast, Australia. Gunther, K.D., 1981. Prevention of deadlocks in packet-switched data transport systems. IEEE Trans. on Communications, 512–524. COM–29:4 (April). Ho, R., Mai, K.W., Horowitz, M.A., 2001. The future of wires. In: Proc. of the IEEE 89:4 (April), pp. 490–504. Holt, R.C., 1972. Some deadlock properties of computer systems. ACM Computer Surveys 4 (3 (September)), 179–196.

F-110



Appendix F Interconnection Networks

Hoskote, Y., Vangal, S., Singh, A., Borkar, N., Borkar, S., 2007. A 5-ghz mesh interconnect for a teraflops processor. IEEE Micro 27 (5), 51–61. Howard, J., Dighe, S., Hoskote, Y., Vangal, S., Finan, S., Ruhl, G., Jenkins, D., Wilson, H., Borka, N., Schrom, G., Pailet, F., Jain, S., Jacob, T., Yada, S., Marella, S., Salihundam, P., Erraguntla, V., Konow, M., Riepen, M., Droege, G., Lindemann, J., Gries, M., Apel, T., Henriss, K., LundLarsen, T., Steibl, S., Borkar, S., De, V., Van Der Wijngaart, R., Mattson, T., 2010. A 48-core IA-32 message-passing processor with DVFS in 45 nm CMOS. In: IEEE International Solid-State Circuits Conference Digest of Technical Papers, pp. 58–59. InfiniBand Trade Association, 2001. InfiniBand Architecture Specifications Release 1.0.a. www. infinibandta.org. Jantsch, A., Tenhunen, H. (Eds.), 2003. Networks on Chips. Kluwer Academic Publishers, The Netherlands. Kahn, R.E., 1972. Resource-sharing computer communication networks. In: Proc. IEEE 60:11 (November), pp. 1397–1407. Kermani, P., Kleinrock, L., 1979. Virtual cut-through: A new computer communication switching technique. Computer Networks 3 (January), 267–286. Kurose, J.F., Ross, K.W., 2001. Computer Networking: A Top-Down Approach Featuring the Internet. Addison-Wesley, Boston. Leiserson, C.E., 1985. Fat trees: Universal networks for hardware-efficient supercomputing. IEEE Trans. on Computers, 892–901. C–34:10 (October). Merlin, P.M., Schweitzer, P.J., 1980. Deadlock avoidance in store-and-forward networks. I. Store-andforward deadlock. IEEE Trans. on Communications, 345–354. COM–28:3 (March). Metcalfe, R.M., 1993. Computer/network interface design: Lessons from Arpanet and Ethernet. IEEE J. on Selected Areas in Communications 11 (2 (February)), 173–180. Metcalfe, R.M., Boggs, D.R., 1976. Ethernet: Distributed packet switching for local computer networks. Comm. ACM 19 (7 (July)), 395–404. Partridge, C., 1994. Gigabit Networking. Addison-Wesley, Reading, Mass. Peh, L.S., Dally, W.J., 2001. A delay model and speculative architecture for pipelined routers. In: Proc. 7th Int’l. Symposium on High Performance Computer Architecture, January 20–24, 2001. Monterrey, Mexico. Pfister, G.F., 1998. In Search of Clusters, second ed. Prentice Hall, Upper Saddle River, N.J. Pinkston, T.M., 2004. Deadlock characterization and resolution in interconnection networks. In: Zhu, M.C., Fanti, M.P. (Eds.), Deadlock Resolution in Computer-Integrated Systems. CRC Press, Boca Raton, Fl, pp. 445–492. Pinkston, T.M., Shin, J., 2005. Trends toward on-chip networked microsystems. Int’l. J. of High Performance Computing and Networking 3 (1), 3–18. Pinkston, T.M., Warnakulasuriya, S., 1997. On deadlocks in interconnection networks. In: Proc. 24th Int’l. Symposium on Computer Architecture, June 2–4, 1997. Denver, Colo. Pinkston, T.M., Benner, A., Krause, M., Robinson, I., Sterling, T., 2003. InfiniBand: The ‘de facto’ future standard for system and local area networks or just a scalable replacement for PCI buses?” Special Issue on Communication Architecture for Clusters 6:2 (April). Cluster Computing, 95–104. Puente, V., Beivide, R., Gregorio, J.A., Prellezo, J.M., Duato, J., Izu, C., 1999. Adaptive bubble router: A design to improve performance in torus networks. In: Proc. 28th Int’l. Conference on Parallel Processing, September 21–24, 1999. Aizu-Wakamatsu, Japan. Rodrigo, S., Flich, J., Duato, J., Hummel, M., 2008. Efficient unicast and multicast support for CMPs. In: Proc. 41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-41), November 8–12, 2008. Lake Como, Italy, pp. 364–375. Saltzer, J.H., Reed, D.P., Clark, D.D., 1984. End-to-end arguments in system design. ACM Trans. on Computer Systems 2 (4 (November)), 277–288. Satran, J., Smith, D., Meth, K., Sapuntzakis, C., Wakeley, M., Von Stamwitz, P., Haagens, R., Zeidner, E., Dalle Ore, L., Klein, Y., 2001. “iSCSI”, IPS working group of IETF, Internet draft. www.ietf.org/internet-drafts/draft-ietf-ips-iscsi-07.txt. Scott, S.L., Goodman, J., 1994. The impact of pipelined channels on k-ary n-cube networks. IEEE Trans. on Parallel and Distributed Systems 5 (1 (January)), 1–16. Senior, J.M., 1993. Optical Fiber Commmunications: Principles and Practice, second ed. Prentice Hall, Hertfordshire, U.K.. Spurgeon, C., 2006. Charles Spurgeon’s Ethernet Web Site. www.etherman-age.com/ethernet/ethernet. html.

Exercises



F-111

Sterling, T., 2001. Beowulf PC Cluster Computing with Windows and Beowulf PC Cluster Computing with Linux. MIT Press, Cambridge, Mass. Stevens, W.R., 1994–1996. TCP/IP Illustrated (three volumes). Addison-Wesley, Reading, Mass. Tamir, Y., Frazier, G., 1992. Dynamically-allocated multi-queue buffers for VLSI communication switches. IEEE Trans. on Computers 41 (6 (June)), 725–734. Tanenbaum, A.S., 1988. Computer Networks, second ed. Prentice Hall, Englewood Cliffs, N.J. Taylor, M.B., Lee, W., Amarasinghe, S.P., Agarwal, A., 2005. Scalar operand networks. IEEE Trans. on Parallel and Distributed Systems 16 (2 (February)), 145–162. Thacker, C.P., McCreight, E.M., Lampson, B.W., Sproull, R.F., Boggs, D.R., 1982. Alto: A personal computer. In: Siewiorek, D.P., Bell, C.G., Newell, A. (Eds.), Computer Structures: Principles and Examples. McGraw-Hill, New York, pp. 549–572. TILE-GX, http://www.tilera.com/sites/default/files/productbriefs/PB025_TILE-Gx_Processor_A_v3. pdf. Vaidya, A.S., Sivasubramaniam, A., Das, C.R., 1997. Performance benefits of virtual channels and adaptive routing: An application-driven study. In: Proc. 11th ACM Int’l Conference on Supercomputing, July 7–11, 1997. Vienna, Austria. Van Leeuwen, J., Tan, R.B., 1987. Interval Routing. The Computer Journal 30 (4), 298–307. von Eicken, T., Culler, D.E., Goldstein, S.C., Schauser, K.E., 1992. Active messages: A mechanism for integrated communication and computation. In: Proc. 19th Annual Int’l. Symposium on Computer Architecture, May 19–21, 1992. Gold Coast, Australia. Waingold, E., Taylor, M., Srikrishna, D., Sarkar, V., Lee, W., Lee, V., Kim, J., Frank, M., Finch, P., Barua, R., Babb, J., Amarasinghe, S., Agarwal, A., 1997. Baring it all to software: Raw Machines. IEEE Computer 30 (September), 86–93. Yang, Y., Mason, G., 1991. Nonblocking broadcast switching networks. IEEE Trans. on Computers 40 (9 (September)), 1005–1015.

Exercises Solutions to “starred” exercises are available for instructors who register at textbooks.elsevier.com. ✪ F.1

[15] < F.2, F.3 > Is electronic communication always faster than nonelectronic means for longer distances? Calculate the time to send 1000 GB using 25 8-mm tapes and an overnight delivery service versus sending 1000 GB by FTP over the Internet. Make the following four assumptions: ■

The tapes are picked up at 4 P.M. Pacific time and delivered 4200 km away at 10 A.M. Eastern time (7 A.M. Pacific time).



On one route the slowest link is a T3 line, which transfers at 45 Mbits/sec.



On another route the slowest link is a 100-Mbit/sec Ethernet.

You can use 50% of the slowest link between the two sites. Will all the bytes sent by either Internet route arrive before the overnight delivery person arrives?



✪ F.2

[10] < F.2, F.3 > For the same assumptions as Exercise F.1, what is the bandwidth of overnight delivery for a 1000-GB package?

✪ F.3

[10] < F.2, F.3 > For the same assumptions as Exercise F.1, what is the minimum bandwidth of the slowest link to beat overnight delivery? What standard network options match that speed?

F-112



Appendix F Interconnection Networks ✪ F.4

[15] < F.2, F.3 > The original Ethernet standard was for 10 Mbits/sec and a maximum distance of 2.5 km. How many bytes could be in flight in the original Ethernet? Assume you can use 90% of the peak bandwidth.

✪ F.5

[15] < F.2, F.3 > Flow control is a problem for WANs due to the long time of flight, as the example on page F-14 illustrates. Ethernet did not include flow control when it was first standardized at 10 Mbits/sec. Calculate the number of bytes in flight for a 10-Gbit/sec Ethernet over a 100 meter link, assuming you can use 90% of peak bandwidth. What does your answer mean for network designers?

✪ F.6

[15] < F.2, F.3 > Assume the total overhead to send a zero-length data packet on an Ethernet is 100 μs and that an unloaded network can transmit at 90% of the peak 1000-Mbit/sec rating. For the purposes of this question, assume that the size of the Ethernet header and trailer is 56 bytes. Assume a continuous stream of packets of the same size. Plot the delivered bandwidth of user data in Mbits/sec as the payload data size varies from 32 bytes to the maximum size of 1500 bytes in 32-byte increments.

✪ F.7

[10] < F.2, F.3 > Exercise F.6 suggests that the delivered Ethernet bandwidth to a single user may be disappointing. Making the same assumptions as in that exercise, by how much would the maximum payload size have to be increased to deliver half of the peak bandwidth?

✪ F.8

[10] < F.2, F.3 > One reason that ATM has a fixed transfer size is that when a short message is behind a long message, a node may need to wait for an entire transfer to complete. For applications that are time sensitive, such as when transmitting voice or video, the large transfer size may result in transmission delays that are too long for the application. On an unloaded interconnection, what is the worstcase delay in microseconds if a node must wait for one full-size Ethernet packet versus an ATM transfer? See Figure F.30 (page F78) to find the packet sizes. For this question assume that you can transmit at 100% of the 622-Mbits/sec ATM network and 100% of the 1000-Mbit/ sec Ethernet.

✪ F.9

[10] < F.2, F.3 > Exercise F.7 suggests the need for expanding the maximum pay-load to increase the delivered bandwidth, but Exercise F.8 suggests the impact on worst-case latency of making it longer. What would be the impact on latency of increasing the maximum payload size by the answer to Exercise F.7?

✪ F.10

[12/12/20] < F.4 > The Omega network shown in Figure F.11 on page F-31 consists of three columns of four switches, each with two inputs and two outputs. Each switch can be set to straight, which connects the upper switch input to the upper switch output and the lower input to the lower output, and to exchange, which connects the upper input to the lower output and vice versa for the lower input. For each column of switches, label the inputs and outputs 0, 1, …, 7 from top to bottom, to correspond with the numbering of the processors.

Exercises



F-113

a. [12] < F.4 > When a switch is set to exchange and a message passes through, what is the relationship between the label values for the switch input and output used by the message? (Hint: Think in terms of operations on the digits of the binary representation of the label number.) b. [12] < F.4 > Between any two switches in adjacent columns that are connected by a link, what is the relationship between the label of the output connected to the input? c. [20] < F.4 > Based on your results in parts (a) and (b), design and describe a simple routing scheme for distributed control of the Omega network. A message will carry a routing tag computed by the sending processor. Describe how the processor computes the tag and how each switch can set itself by examining a bit of the routing tag. ✪ F.11

[12/12/12/12/12/12] < F.4 > Prove whether or not it is possible to realize the following permutations (i.e., communication patterns) on the eight-node Omega network shown in Figure F.11 on page F-31: a. [12] < F.4 > Bit-reversal permutation—the node with binary coordinates an1, an2, …, a1, a0 communicates with the node a0, a1, …, an2, an1. b. [12] < F.4 > Perfect shuffle permutation—the node with binary coordinates an1, an2, …, a1, a0 communicates with the node an2, an3, …, a0, an1 (i.e., rotate left 1 bit). c. [12] < F.4 > Bit-complement permutation—the node with binary coordinates an1, an2, …, a1, a0 communicates with the node an1 , an2 , …, a1 , a0 (i.e., complement each bit). d. [12] < F.4 > Butterfly permutation—the node with binary coordinates an1, an2, …, a1, a0 communicates with the node a0, an2, …, a1, an1 (i.e., swap the most and least significant bits). e. [12] < F.4 > Matrix transpose permutation—the node with binary coordinates an1, an2, …, a1, a0 communicates with the node an/21, …, a0, an1, …, an/2 (i.e., transpose the bits in positions approximately halfway around). f. [12] < F.4 > Barrel-shift permutation—node i communicates with node i + 1 modulo N  1, where N is the total number of nodes and 0  i.

✪ F.12

[12] < F.4 > Design a network topology using 18-port crossbar switches that has the minimum number of switches to connect 64 nodes. Each switch port supports communication to and from one device.

✪ F.13

[15] < F.4 > Design a network topology that has the minimum latency through the switches for 64 nodes using 18-port crossbar switches. Assume unit delay in the switches and zero delay for wires.

✪ F.14

[15] < F.4 > Design a switch topology that balances the bandwidth required for all links for 64 nodes using 18-port crossbar switches. Assume a uniform traffic pattern.

F-114



Appendix F Interconnection Networks ✪ F.15

[15] < F.4 > Compare the interconnection latency of a crossbar, Omega network, and fat tree with eight nodes. Use Figure F.11 on page F-31, Figure F.12 on page F33, and Figure F.14 on page F-37. Assume that the fat tree is built entirely from two-input, two-output switches so that its hardware resources are more comparable to that of the Omega network. Assume that each switch costs a unit time delay. Assume that the fat tree randomly picks a path, so give the best case and worst case for each example. How long will it take to send a message from node 0 to node 6? How long will it take node 1 and node 7 to communicate?

✪ F.16

[15] < F.4 > Draw the topology of a 6-cube after the same manner of the 4-cube in Figure F.14 on page F-37. What is the maximum and average number of hops needed by packets assuming a uniform distribution of packet destinations?

✪ F.17

[15] < F.4 > Complete a table similar to Figure F.15 on page F-40 that captures the performance and cost of various network topologies, but do it for the general case of N nodes using k  k switches instead of the specific case of 64 nodes.

✪ F.18

[20] < F.4 > Repeat the example given on page F-41, but use the bit-complement communication pattern given in Exercise F.11 instead of NEWS communication.

✪ F.19

[15] < F.5 > Give the four specific conditions necessary for deadlock to exist in an interconnection network. Which of these are removed by dimension-order routing? Which of these are removed in adaptive routing with the use of “escape” routing paths? Which of these are removed in adaptive routing with the technique of deadlock recovery (regressive or progressive)? Explain your answer.

✪ F.20

[12/12/12/12] < F.5 > Prove whether or not the following routing algorithms based on prohibiting dimensional turns are suitable to be used as escape paths for 2D meshes by analyzing whether they are both connected and deadlock-free. Explain your answer. (Hint: You may wish to refer to the Turn Model algorithm and/or to prove your answer by drawing a directed graph for a 4  4 mesh that depicts dependencies between channels and verifying the channel dependency graph is free of cycles.) The routing algorithms are expressed with the following abbreviations: W ¼ west, E ¼ east, N ¼ north, and S ¼ south. a. [12] < F.5 > Allowed turns are from W to N, E to N, S to W, and S to E. b. [12] < F.5 > Allowed turns are from W to S, E to S, N to E, and S to E. c. [12] < F.5 > Allowed turns are from W to S, E to S, N to W, S to E, W to N, and S to W. d. [12] < F.5 > Allowed turns are from S to E, E to S, S to W, N to W, N to E, and E to N.

✪ F.21

[15] < F.5 > Compute and compare the upper bound for the efficiency factor, ρ, for dimension-order routing and up*/down* routing assuming uniformly distributed traffic on a 64-node 2D mesh network. For up*/down* routing, assume optimal placement of the root node (i.e., a node near the middle of the mesh). (Hint: You will have to find the loading of links across the network bisection that carries the global load as determined by the routing algorithm.)

Exercises



F-115

✪ F.22

[15] < F.5 > For the same assumptions as Exercise F.21, find the efficiency factor for up*/down* routing on a 64-node fat tree network using 4  4 switches. Compare this result with the ρ found for up*/down* routing on a 2D mesh. Explain.

✪ F.23

[15] < F.5 > Calculate the probability of matching two-phased arbitration requests from all k input ports of a switch simultaneously to the k output ports assuming a uniform distribution of requests and grants to/from output ports. How does this compare to the matching probability for three-phased arbitration in which each of the k input ports can make two simultaneous requests (again, assuming a uniform random distribution of requests and grants)?

✪ F.24

[15]< F.5 > The equation on page F-52 shows the value of cut-through switching. Ethernet switches used to build clusters often do not support cut-through switching. Compare the time to transfer 1500 bytes over a 1000-Mbit/sec Ethernet with and without cut-through switching for a 64-node cluster. Assume that each Ethernet switch takes 1.0 μs and that a message goes through seven intermediate switches.

✪ F.25

[15] < F.5 > Making the same assumptions as in Exercise F.24, what is the difference between cut-through and store-and-forward switching for 32 bytes?

✪ F.26

[15] < F.5 > One way to reduce latency is to use larger switches. Unlike Exercise F.24, let’s assume we need only three intermediate switches to connect any two nodes in the cluster. Make the same assumptions as in Exercise F.24 for the remaining parameters. What is the difference between cut-through and store-and-forward for 1500 bytes? For 32 bytes?

✪ F.27

[20] < F.5 > Using FlexSim 1.2 (http://ceng.usc.edu/smart/FlexSim/flexsim.html) or some other cycle-accurate network simulator, simulate a 256-node 2D torus network assuming wormhole routing, 32-flit packets, uniform (random) communication pattern, and four virtual channels. Compare the performance of deterministic routing using DOR, adaptive routing using escape paths (i.e., Duato’s Protocol), and true fully adaptive routing using progressive deadlock recovery (i.e., Disha routing). Do so by plotting latency versus applied load and through-put versus applied load for each, as is done in Figure F.19 for the example on page F-53. Also run simulations and plot results for two and eight virtual channels for each. Compare and explain your results by addressing how/why the number and use of virtual channels by the various routing algorithms affect network performance. (Hint: Be sure to let the simulation reach steady state by allowing a warm-up period of a several thousand network cycles before gathering results.)

✪ F.28

[20] < F.5 > Repeat Exercise F.27 using bit-reversal communication instead of the uniform random communication pattern. Compare and explain your results by addressing how/why the communication pattern affects network performance.

✪ F.29

[40] < F.5 > Repeat Exercises F.27 and F.28 using 16-flit packets and 128-flit packets. Compare and explain your results by addressing how/why the packet size along with the other design parameters affect network performance.

F.30

[20] < F.2, F.4, F.5, F.8 > Figures F.7, F.16, and F.20 show interconnection network characteristics of several of the top 500 supercomputers by machine type

F-116



Appendix F Interconnection Networks

as of the publication of the fourth edition. Update that figure to the most recent top 500. How have the systems and their networks changed since the data in the original figure? Do similar comparisons for OCNs used in microprocessors and SANs targeted for clusters using Figures F.29 and F.31. ✪ F.31

[12/12/12/15/15/18] < F.8 > Use the M/M/1 queuing model to answer this exercise. Measurements of a network bridge show that packets arrive at 200 packets per second and that the gateway forwards them in about 2 ms. a. [12] < F.8 > What is the utilization of the gateway? b. [12] < F.8 > What is the mean number of packets in the gateway? c. [12] < F.8 > What is the mean time spent in the gateway? d. [15] < F.8 > Plot response time versus utilization as you vary the arrival rate. e. [15] < F.8 > For an M/M/1 queue, the probability of finding n or more tasks in the system is Utilizationn. What is the chance of an overflow of the FIFO if it can hold 10 messages? f. [18] < F.8 > How big must the gateway be to have packet loss due to FIFO overflow less than one packet per million?

✪ F.32

[20] < F.8 > The imbalance between the time of sending and receiving can cause problems in network performance. Sending too fast can cause the network to back up and increase the latency of messages, since the receivers will not be able to pull out the message fast enough. A technique called bandwidth matching proposes a simple solution: Slow down the sender so that it matches the performance of the receiver [Brewer and Kuszmaul 1994]. If two machines exchange an equal number of messages using a protocol like UDP, one will get ahead of the other, causing it to send all its messages first. After the receiver puts all these messages away, it will then send its messages. Estimate the performance for this case versus a bandwidthmatched case. Assume that the send overhead is 200 μs, the receive overhead is 300 μs, time of flight is 5 μs, latency is 10 μs, and that the two machines want to exchange 100 messages.

F.33

[40] < F.8 > Compare the performance of UDP with and without bandwidth matching by slowing down the UDP send code to match the receive code as advised by bandwidth matching [Brewer and Kuszmaul 1994]. Devise an experiment to see how much performance changes as a result. How should you change the send rate when two nodes send to the same destination? What if one sender sends to two destinations?

✪ F.34

[40] < F.6, F.8 > If you have access to an SMP and a cluster, write a program to measure latency of communication and bandwidth of communication between processors, as was plotted in Figure F.32 on page F-80.

F.35

[20/20/20] < F.9 > If you have access to a UNIX system, use ping to explore the Internet. First read the manual page. Then use ping without option flags to be sure you can reach the following sites. It should say that X is alive. Depending on your system, you may be able to see the path by setting the flags to verbose mode

Exercises



F-117

(-v) and trace route mode (-R) to see the path between your machine and the example machine. Alternatively, you may need to use the program trace route to see the path. If so, try its manual page. You may want to use the UNIX command script to make a record of your session. a. [20] < F.9 > Trace the route to another machine on the same local area network. What is the latency? b. [20] < F.9 > Trace the route to another machine on your campus that is not on the same local area network.What is the latency? c. [20] < F.9 > Trace the route to another machine off campus. For example, if you have a friend you send email to, try tracing that route. See if you can discover what types of networks are used along that route.What is the latency? F.36

[15] < F.9 > Use FTP to transfer a file from a remote site and then between local sites on the same LAN. What is the difference in bandwidth for each transfer? Try the transfer at different times of day or days of the week. Is the WAN or LAN the bottleneck?

✪ F.37

[10/10] < F.9, F.11 > Figure F.41 on page F-93 compares latencies for a highbandwidth network with high overhead and a low-bandwidth network with low overhead for different TCP/IP message sizes. a. [10] < F.9, F.11 > For what message sizes is the delivered bandwidth higher for the high-bandwidth network? b. [10] < F.9, F.11 > For your answer to part (a), what is the delivered bandwidth for each network?

✪ F.38

[15] < F.9, F.11 > Using the statistics in Figure F.41 on page F-93, estimate the per-message overhead for each network.

✪ F.39

[15] < F.9, F.11 > Exercise F.37 calculates which message sizes are faster for two networks with different overhead and peak bandwidth. Using the statistics in Figure F.41 on page F-93, what is the percentage of messages that are transmitted more quickly on the network with low overhead and bandwidth? What is the percentage of data transmitted more quickly on the network with high overhead and bandwidth?

✪ F.40

[15] < F.9, F.11 > One interesting measure of the latency and bandwidth of an inter-connection is to calculate the size of a message needed to achieve one-half of the peak bandwidth. This halfway point is sometimes referred to as n1/2, taken from the terminology of vector processing. Using Figure F.41 on page F-93, estimate n1/2 for TCP/IP message using 155-Mbit/sec ATM and 10-Mbit/sec Ethernet.

F.41

[Discussion] < F.10 > The Google cluster used to be constructed from 1 rack unit (RU) PCs, each with one processor and two disks. Today there are considerably denser options. How much less floor space would it take if we were to replace the 1 RU PCs with modern alternatives? Go to the Compaq or Dell Web sites to find the densest alternative. What would be the estimated impact on cost of the equipment? What would be the estimated impact on rental cost of floor space?

F-118



Appendix F Interconnection Networks

What would be the impact on interconnection network design for achieving power/ performance efficiency? F.42

[Discussion] < F.13 > At the time of the writing of the fourth edition, it was unclear what would happen with Ethernet versus InfiniBand versus Advanced Switching in the machine room. What are the technical advantages of each? What are the economic advantages of each? Why would people maintaining the system prefer one to the other? How popular is each network today? How do they compare to proprietary commercial networks such as Myrinet and Quadrics?

G.1 G.2 G.3 G.4 G.5 G.6 G.7 G.8 G.9

Introduction Vector Performance in More Depth Vector Memory Systems in More Depth Enhancing Vector Performance Effectiveness of Compiler Vectorization Putting It All Together: Performance of Vector Processors A Modern Vector Supercomputer: The Cray X1 Concluding Remarks Historical Perspective and References Exercises

G-2 G-2 G-9 G-11 G-14 G-15 G-21 G-25 G-26 G-29

G Vector Processors in More Depth

Revised by Krste Asanovic Massachusetts Institute of Technology I’m certainly not inventing vector processors. There are three kinds that I know of existing today. They are represented by the Illiac-IV, the (CDC) Star processor, and the TI (ASC) processor. Those three were all pioneering processors.…One of the problems of being a pioneer is you always make mistakes and I never, never want to be a pioneer. It’s always best to come second when you can look at the mistakes the pioneers made. Seymour Cray Public lecture at Lawrence Livermore Laboratorieson on the introduction of the Cray-1 (1976)

G-2



Appendix G Vector Processors in More Depth

G.1

Introduction Chapter 4 introduces vector architectures and places Multimedia SIMD extensions and GPUs in proper context to vector architectures. In this appendix, we go into more detail on vector architectures, including more accurate performance models and descriptions of previous vector architectures. Figure G.1 shows the characteristics of some typical vector processors, including the size and count of the registers, the number and types of functional units, the number of load-store units, and the number of lanes.

G.2

Vector Performance in More Depth The chime approximation is reasonably accurate for long vectors. Another source of overhead is far more significant than the issue limitation. The most important source of overhead ignored by the chime model is vector start-up time. The start-up time comes from the pipelining latency of the vector operation and is principally determined by how deep the pipeline is for the functional unit used. The start-up time increases the effective time to execute a convoy to more than one chime. Because of our assumption that convoys do not overlap in time, the start-up time delays the execution of subsequent convoys. Of course, the instructions in successive convoys either have structural conflicts for some functional unit or are data dependent, so the assumption of no overlap is reasonable. The actual time to complete a convoy is determined by the sum of the vector length and the start-up time. If vector lengths were infinite, this start-up overhead would be amortized, but finite vector lengths expose it, as the following example shows.

Example

Assume that the start-up overhead for functional units is shown in Figure G.2. Show the time that each convoy can begin and the total number of cycles needed. How does the time compare to the chime approximation for a vector of length 64?

Answer

Figure G.3 provides the answer in convoys, assuming that the vector length is n. One tricky question is when we assume the vector sequence is done; this determines whether the start-up time of the SV is visible or not. We assume that the instructions following cannot fit in the same convoy, and we have already assumed that convoys do not overlap. Thus, the total time is given by the time until the last vector instruction in the last convoy completes. This is an approximation, and the start-up time of the last vector instruction may be seen in some sequences and not in others. For simplicity, we always include it. The time per result for a vector of length 64 is 4 + (42/64) ¼ 4.65 clock cycles, while the chime approximation would be 4. The execution time with startup overhead is 1.16 times higher.

Processor (year)

Vector Elements per clock rate Vector register (64-bit (MHz) registers elements)

Vector arithmetic units

Vector load-store units

Lanes

Cray-1 (1976)

80

8

64

6: FP add, FP multiply, FP reciprocal, integer add, logical, shift

1

1

Cray X-MP (1983) Cray Y-MP (1988)

118 166

8

64

8: FP add, FP multiply, FP reciprocal, integer add, 2 logical, shift, population count/parity

2 loads 1 store

1

Cray-2 (1985)

244

8

64

5: FP add, FP multiply, FP reciprocal/sqrt, integer add/shift/population count, logical

1

1

Fujitsu VP100/ VP200 (1982)

133

8–256

32–1024

3: FP or integer add/logical, multiply, divide

2

1 (VP100) 2 (VP200)

Hitachi S810/S820 (1983)

71

32

256

4: FP multiply-add, FP multiply/divide-add unit, 2 integer add/logical

3 loads 1 store

1 (S810) 2 (S820)

Convex C-1 (1985)

10

8

128

2: FP or integer multiply/divide, add/logical

1

1 (64 bit) 2 (32 bit)

NEC SX/2 (1985)

167

8 + 32

256

4: FP multiply/divide, FP add, integer add/ logical, shift

1

4

Cray C90 (1991) Cray T90 (1995)

240 460

8

128

8: FP add, FP multiply, FP reciprocal, integer add, 2 logical, shift, population count/parity

2 loads 1 store

2

NEC SX/5 (1998)

312

8 + 64

512

4: FP or integer add/shift, multiply, divide, logical

1

16

Fujitsu VPP5000 (1999)

300

8–256

128–4096

3: FP or integer multiply, add/logical, divide

1 load 1 store

16

Cray SV1 (1998) SV1ex (2001)

300 500

8

64 (MSP)

8: FP add, FP multiply, FP reciprocal, integer add, 2 logical, shift, population count/parity

1 load-store 2 1 load 8 (MSP)

VMIPS (2001)

500

8

64

5: FP multiply, FP divide, FP add, integer add/ shift, logical

1 load-store

1

NEC SX/6 (2001)

500

8 + 64

256

4: FP or integer add/shift, multiply, divide, logical

1

8

NEC SX/8 (2004)

2000

8 + 64

256

4: FP or integer add/shift, multiply, divide, logical

1

4

Cray X1 (2002)

800

32

64 256 (MSP)

3: FP or integer, add/logical, multiply/shift, divide/square root/logical

1 load 1 store

2 8 (MSP)

Cray XIE (2005)

1130

Figure G.1 Characteristics of several vector-register architectures. If the machine is a multiprocessor, the entries correspond to the characteristics of one processor. Several of the machines have different clock rates in the vector and scalar units; the clock rates shown are for the vector units. The Fujitsu machines’ vector registers are configurable: The size and count of the 8K 64-bit entries may be varied inversely to one another (e.g., on the VP200, from eight registers each 1K elements long to 256 registers each 32 elements long). The NEC machines have eight foreground vector registers connected to the arithmetic units plus 32 to 64 background vector registers connected between the memory system and the foreground vector registers. Add pipelines perform add and subtract. The multiply/divideadd unit on the Hitachi S810/820 performs an FP multiply or divide followed by an add or subtract (while the multiplyadd unit performs a multiply followed by an add or subtract). Note that most processors use the vector FP multiply and divide units for vector integer multiply and divide, and several of the processors use the same units for FP scalar and FP vector operations. Each vector load-store unit represents the ability to do an independent, overlapped transfer to or from the vector registers. The number of lanes is the number of parallel pipelines in each of the functional units as described in Section G.4. For example, the NEC SX/5 can complete 16 multiplies per cycle in the multiply functional unit. Several machines can split a 64-bit lane into two 32-bit lanes to increase performance for applications that require only reduced precision. The Cray SV1 and Cray X1 can group four CPUs with two lanes each to act in unison as a single larger CPU with eight lanes, which Cray calls a Multi-Streaming Processor (MSP).

G-4



Appendix G Vector Processors in More Depth

Unit

Start-up overhead (cycles)

Load and store unit

12

Multiply unit

7

Add unit

6

Figure G.2 Start-up overhead.

Convoy 1. LV 2. MULVS.D LV

Starting time

First-result time

Last-result time

0

12

11 + n

12 + n

12 + n + 12

23 + 2n

3. ADDV.D

24 + 2n

24 + 2n + 6

29 + 3n

4. SV

30 + 3n

30 + 3n + 12

41 + 4n

Figure G.3 Starting times and first- and last-result times for convoys 1 through 4. The vector length is n.

For simplicity, we will use the chime approximation for running time, incorporating start-up time effects only when we want performance that is more detailed or to illustrate the benefits of some enhancement. For long vectors, a typical situation, the overhead effect is not that large. Later in the appendix, we will explore ways to reduce start-up overhead. Start-up time for an instruction comes from the pipeline depth for the functional unit implementing that instruction. If the initiation rate is to be kept at 1 clock cycle per result, then  Pipeline depth ¼

Total functional unit time Clock cycle time



For example, if an operation takes 10 clock cycles, it must be pipelined 10 deep to achieve an initiation rate of one per clock cycle. Pipeline depth, then, is determined by the complexity of the operation and the clock cycle time of the processor. The pipeline depths of functional units vary widely—2 to 20 stages are common—although the most heavily used units have pipeline depths of 4 to 8 clock cycles. For VMIPS, we will use the same pipeline depths as the Cray-1, although latencies in more modern processors have tended to increase, especially for loads. All functional units are fully pipelined. From Chapter 4, pipeline depths are 6 clock cycles for floating-point add and 7 clock cycles for floating-point multiply. On VMIPS, as on most vector processors, independent vector operations using different functional units can issue in the same convoy. In addition to the start-up overhead, we need to account for the overhead of executing the strip-mined loop. This strip-mining overhead, which arises from

G.2

Vector Performance in More Depth

Operation



G-5

Start-up penalty

Vector add

6

Vector multiply

7

Vector divide

20

Vector load

12

Figure G.4 Start-up penalties on VMIPS. These are the start-up penalties in clock cycles for VMIPS vector operations.

the need to reinitiate the vector sequence and set the Vector Length Register (VLR) effectively adds to the vector start-up time, assuming that a convoy does not overlap with other instructions. If that overhead for a convoy is 10 cycles, then the effective overhead per 64 elements increases by 10 cycles, or 0.15 cycles per element. Two key factors contribute to the running time of a strip-mined loop consisting of a sequence of convoys: 1. The number of convoys in the loop, which determines the number of chimes. We use the notation Tchime for the execution time in chimes. 2. The overhead for each strip-mined sequence of convoys. This overhead consists of the cost of executing the scalar code for strip-mining each block, Tloop, plus the vector start-up cost for each convoy, Tstart. There may also be a fixed overhead associated with setting up the vector sequence the first time. In recent vector processors, this overhead has become quite small, so we ignore it. The components can be used to state the total running time for a vector sequence operating on a vector of length n, which we will call Tn: Tn ¼

h n i    Tloop + Tstart + n  Tchime MVL

The values of Tstart, Tloop, and Tchime are compiler and processor dependent. The register allocation and scheduling of the instructions affect both what goes in a convoy and the start-up overhead of each convoy. For simplicity, we will use a constant value for Tloop on VMIPS. Based on a variety of measurements of Cray-1 vector execution, the value chosen is 15 for Tloop. At first glance, you might think that this value is too small. The overhead in each loop requires setting up the vector starting addresses and the strides, incrementing counters, and executing a loop branch. In practice, these scalar instructions can be totally or partially overlapped with the vector instructions, minimizing the time spent on these overhead functions. The value of Tloop of course depends on the loop structure, but the dependence is slight compared with the connection between the vector code and the values of Tchime and Tstart.

G-6



Appendix G Vector Processors in More Depth

Example

What is the execution time on VMIPS for the vector operation A ¼ B  s, where s is a scalar and the length of the vectors A and B is 200?

Answer

Assume that the addresses of A and B are initially in Ra and Rb, s is in Fs, and recall that for MIPS (and VMIPS) R0 always holds 0. Since (200 mod 64) ¼ 8, the first iteration of the strip-mined loop will execute for a vector length of 8 elements, and the following iterations will execute for a vector length of 64 elements. The starting byte addresses of the next segment of each vector is eight times the vector length. Since the vector length is either 8 or 64, we increment the address registers by 8  8 ¼ 64 after the first segment and 8  64 ¼ 512 for later segments. The total number of bytes in the vector is 8  200 ¼ 1600, and we test for completion by comparing the address of the next vector segment to the initial address plus 1600. Here is the actual code: DADDUI DADDU DADDUI MTC1 DADDUI DADDUI Loop: LV MULVS.D SV DADDU DADDU DADDUI MTC1 DSUBU BNEZ

R2,R0,#1600 ;total # bytes in vector R2,R2,Ra ;address of the end of A vector R1,R0,#8 ;loads length of 1st segment VLR,R1 ;load vector length in VLR R1,R0,#64 ;length in bytes of 1st segment R3,R0,#64 ;vector length of other segments V1,Rb ;load B V2,V1,Fs ;vector * scalar Ra,V2 ;store A Ra,Ra,R1 ;address of next segment of A Rb,Rb,R1 ;address of next segment of B R1,R0,#512 ;load byte offset next segment VLR,R3 ;set length to 64 elements R4,R2,Ra ;at the end of A? R4,Loop ;if not, go back

The three vector instructions in the loop are dependent and must go into three convoys, hence Tchime ¼ 3. Let’s use our basic formula: Tn ¼

h n i    Tloop + Tstart + n  Tchime MVL

T200 ¼ 4  ð15 + Tstart Þ + 200  3 T200 ¼ 60 + ð4  Tstart Þ + 600 ¼ 660 + ð4  Tstart Þ

The value of Tstart is the sum of: ■

The vector load start-up of 12 clock cycles



A 7-clock-cycle start-up for the multiply



A 12-clock-cycle start-up for the store

Thus, the value of Tstart is given by: Tstart ¼ 12 + 7 + 12 ¼ 31

G.2

Vector Performance in More Depth



G-7

9 8 7 6 5 Clock cycles

Total time per element

4 3 2

Total overhead per element

1 0 10

30

50

70

90

110

130

150

170

190

Vector size

Figure G.5 The total execution time per element and the total overhead time per element versus the vector length for the example on page F-6. For short vectors, the total start-up time is more than one-half of the total time, while for long vectors it reduces to about one-third of the total time. The sudden jumps occur when the vector length crosses a multiple of 64, forcing another iteration of the strip-mining code and execution of a set of vector instructions. These operations increase Tn by Tloop + Tstart.

So, the overall value becomes: T200 ¼ 660 + 4  31 ¼ 784

The execution time per element with all start-up costs is then 784/200 ¼ 3.9, compared with a chime approximation of three. In Section G.4, we will be more ambitious—allowing overlapping of separate convoys. Figure G.5 shows the overhead and effective rates per element for the previous example (A ¼ B  s) with various vector lengths. A chime-counting model would lead to 3 clock cycles per element, while the two sources of overhead add 0.9 clock cycles per element in the limit.

Pipelined Instruction Start-Up and Multiple Lanes Adding multiple lanes increases peak performance but does not change start-up latency, and so it becomes critical to reduce start-up overhead by allowing the start of one vector instruction to be overlapped with the completion of preceding vector instructions. The simplest case to consider is when two vector instructions access a different set of vector registers. For example, in the code sequence ADDV.D V1,V2,V3 ADDV.D V4,V5,V6

G-8



Appendix G Vector Processors in More Depth

An implementation can allow the first element of the second vector instruction to follow immediately the last element of the first vector instruction down the FP adder pipeline. To reduce the complexity of control logic, some vector machines require some recovery time or dead time in between two vector instructions dispatched to the same vector unit. Figure G.6 is a pipeline diagram that shows both start-up latency and dead time for a single vector pipeline. The following example illustrates the impact of this dead time on achievable vector performance. Example

The Cray C90 has two lanes but requires 4 clock cycles of dead time between any two vector instructions to the same functional unit, even if they have no data dependences. For the maximum vector length of 128 elements, what is the reduction in achievable peak performance caused by the dead time? What would be the reduction if the number of lanes were increased to 16?

Answer

A maximum length vector of 128 elements is divided over the two lanes and occupies a vector functional unit for 64 clock cycles. The dead time adds another 4 cycles of occupancy, reducing the peak performance to 64/(64 + 4) ¼ 94.1% of the value without dead time. If the number of lanes is increased to 16, maximum length vector instructions will occupy a functional unit for only 128/16¼ 8 cycles, and the dead time will reduce peak performance to 8/(8+ 4)¼ 66.6% of the value without dead time. In this second case, the vector units can never be more than 2/3 busy!

Figure G.6 Start-up latency and dead time for a single vector pipeline. Each element has a 5-cycle latency: 1 cycle to read the vector-register file, 3 cycles in execution, then 1 cycle to write the vector-register file. Elements from the same vector instruction can follow each other down the pipeline, but this machine inserts 4 cycles of dead time between two different vector instructions. The dead time can be eliminated with more complex control logic. (Reproduced with permission from Asanovic [1998].)

G.3

Vector Memory Systems in More Depth



G-9

Pipelining instruction start-up becomes more complicated when multiple instructions can be reading and writing the same vector register and when some instructions may stall unpredictably—for example, a vector load encountering memory bank conflicts. However, as both the number of lanes and pipeline latencies increase, it becomes increasingly important to allow fully pipelined instruction start-up.

G.3

Vector Memory Systems in More Depth To maintain an initiation rate of one word fetched or stored per clock, the memory system must be capable of producing or accepting this much data. As we saw in Chapter 4, this usually done by spreading accesses across multiple independent memory banks. Having significant numbers of banks is useful for dealing with vector loads or stores that access rows or columns of data. The desired access rate and the bank access time determined how many banks were needed to access memory without stalls. This example shows how these timings work out in a vector processor.

Example

Suppose we want to fetch a vector of 64 elements starting at byte address 136, and a memory access takes 6 clocks. How many memory banks must we have to support one fetch per clock cycle? With what addresses are the banks accessed? When will the various elements arrive at the CPU?

Answer

Six clocks per access require at least 6 banks, but because we want the number of banks to be a power of 2, we choose to have 8 banks. Figure G.7 shows the timing for the first few sets of accesses for an 8-bank system with a 6-clock-cycle access latency. The timing of real memory banks is usually split into two different components, the access latency and the bank cycle time (or bank busy time). The access latency is the time from when the address arrives at the bank until the bank returns a data value, while the busy time is the time the bank is occupied with one request. The access latency adds to the start-up cost of fetching a vector from memory (the total memory latency also includes time to traverse the pipelined interconnection networks that transfer addresses and data between the CPU and memory banks). The bank busy time governs the effective bandwidth of a memory system because a processor cannot issue a second request to the same bank until the bank busy time has elapsed. For simple unpipelined SRAM banks as used in the previous examples, the access latency and busy time are approximately the same. For a pipelined SRAM bank, however, the access latency is larger than the busy time because each element access only occupies one stage in the memory bank pipeline. For a DRAM bank, the access latency is usually shorter than the busy time because a DRAM needs extra time to restore the read value after the destructive read operation. For memory systems that support multiple simultaneous vector accesses

G-10



Appendix G Vector Processors in More Depth

Bank Cycle no.

0

1

2

3

4

5

6

7

0

136

1

Busy

144

2

Busy

Busy

152

3

Busy

Busy

Busy

160

4

Busy

Busy

Busy

Busy

168

5

Busy

Busy

Busy

Busy

Busy

176

Busy

Busy

Busy

Busy

Busy

184

Busy

Busy

Busy

Busy

Busy

Busy

Busy

Busy

Busy

Busy

Busy

Busy

Busy

Busy

6 7

192

8

Busy

200

9

Busy

Busy

208

10

Busy

Busy

Busy

216

11

Busy

Busy

Busy

Busy

224

12

Busy

Busy

Busy

Busy

Busy

232

Busy

Busy

Busy

Busy

Busy

240

Busy

Busy

Busy

Busy

Busy

248

Busy

Busy

Busy

Busy

Busy

Busy

Busy

Busy

Busy

13 14 15

256

16

Busy

264

Busy

Figure G.7 Memory addresses (in bytes) by bank number and time slot at which access begins. Each memory bank latches the element address at the start of an access and is then busy for 6 clock cycles before returning a value to the CPU. Note that the CPU cannot keep all 8 banks busy all the time because it is limited to supplying one new address and receiving one data item each cycle.

or allow nonsequential accesses in vector loads or stores, the number of memory banks should be larger than the minimum; otherwise, memory bank conflicts will exist. Memory bank conflicts will not occur within a single vector memory instruction if the stride and number of banks are relatively prime with respect to each other and there are enough banks to avoid conflicts in the unit stride case. When there are no bank conflicts, multiword and unit strides run at the same rates. Increasing the number of memory banks to a number greater than the minimum to prevent stalls with a stride of length 1 will decrease the stall frequency for some other strides. For example, with 64 banks, a stride of 32 will stall on every other access, rather than every access. If we originally had a stride of 8 and 16 banks, every other access would stall; with 64 banks, a stride of 8 will stall on every eighth access. If we have multiple memory pipelines and/or multiple processors sharing the same memory system, we will also need more banks to prevent conflicts. Even machines with a single memory pipeline can experience memory bank conflicts on unit stride

G.4

Enhancing Vector Performance



G-11

accesses between the last few elements of one instruction and the first few elements of the next instruction, and increasing the number of banks will reduce the probability of these inter-instruction conflicts. In 2011, most vector supercomputers spread the accesses from each CPU across hundreds of memory banks. Because bank conflicts can still occur in non-unit stride cases, programmers favor unit stride accesses whenever possible. A modern supercomputer may have dozens of CPUs, each with multiple memory pipelines connected to thousands of memory banks. It would be impractical to provide a dedicated path between each memory pipeline and each memory bank, so, typically, a multistage switching network is used to connect memory pipelines to memory banks. Congestion can arise in this switching network as different vector accesses contend for the same circuit paths, causing additional stalls in the memory system.

G.4

Enhancing Vector Performance In this section, we present techniques for improving the performance of a vector processor in more depth than we did in Chapter 4.

Chaining in More Depth Early implementations of chaining worked like forwarding, but this restricted the timing of the source and destination instructions in the chain. Recent implementations use flexible chaining, which allows a vector instruction to chain to essentially any other active vector instruction, assuming that no structural hazard is generated. Flexible chaining requires simultaneous access to the same vector register by different vector instructions, which can be implemented either by adding more read and write ports or by organizing the vector-register file storage into interleaved banks in a similar way to the memory system. We assume this type of chaining throughout the rest of this appendix. Even though a pair of operations depends on one another, chaining allows the operations to proceed in parallel on separate elements of the vector. This permits the operations to be scheduled in the same convoy and reduces the number of chimes required. For the previous sequence, a sustained rate (ignoring start-up) of two floating-point operations per clock cycle, or one chime, can be achieved, even though the operations are dependent! The total running time for the above sequence becomes: Vector length + Start-up timeADDV + Start-up timeMULV

Figure G.8 shows the timing of a chained and an unchained version of the above pair of vector instructions with a vector length of 64. This convoy requires one chime; however, because it uses chaining, the start-up overhead will be seen in the actual timing of the convoy. In Figure G.8, the total time for chained operation is 77 clock cycles, or 1.2 cycles per result. With 128 floating-point operations done in that time, 1.7 FLOPS per clock cycle are obtained. For the unchained version, there are 141 clock cycles, or 0.9 FLOPS per clock cycle.

G-12



Appendix G Vector Processors in More Depth

7

64

6

64 Total = 141

Unchained

MULV 7

ADDV

64 MULV

Chained 6

64 Total = 77 ADDV

Figure G.8 Timings for a sequence of dependent vector operations ADDV and MULV, both unchained and chained. The 6- and 7-clock-cycle delays are the latency of the adder and multiplier.

Although chaining allows us to reduce the chime component of the execution time by putting two dependent instructions in the same convoy, it does not eliminate the start-up overhead. If we want an accurate running time estimate, we must count the start-up time both within and across convoys. With chaining, the number of chimes for a sequence is determined by the number of different vector functional units available in the processor and the number required by the application. In particular, no convoy can contain a structural hazard. This means, for example, that a sequence containing two vector memory instructions must take at least two convoys, and hence two chimes, on a processor like VMIPS with only one vector load-store unit. Chaining is so important that every modern vector processor supports flexible chaining.

Sparse Matrices in More Depth Chapter 4 shows techniques to allow programs with sparse matrices to execute in vector mode. Let’s start with a quick review. In a sparse matrix, the elements of a vector are usually stored in some compacted form and then accessed indirectly. Assuming a simplified sparse structure, we might see code that looks like this: do 100

100 i = 1,n A(K(i)) = A(K(i)) + C(M(i))

This code implements a sparse vector sum on the arrays A and C, using index vectors K and M to designate the nonzero elements of A and C. (A and C must have the same number of nonzero elements—n of them.) Another common representation for sparse matrices uses a bit vector to show which elements exist and a dense vector for the nonzero elements. Often both representations exist in the same program. Sparse matrices are found in many codes, and there are many ways to implement them, depending on the data structure used in the program. A simple vectorizing compiler could not automatically vectorize the source code above because the compiler would not know that the elements of K are distinct values and thus that no dependences exist. Instead, a programmer directive would tell the compiler that it could run the loop in vector mode. More sophisticated vectorizing compilers can vectorize the loop automatically without programmer annotations by inserting run time checks for data

G.4

Enhancing Vector Performance



G-13

dependences. These run time checks are implemented with a vectorized software version of the advanced load address table (ALAT) hardware described in Appendix H for the Itanium processor. The associative ALAT hardware is replaced with a software hash table that detects if two element accesses within the same stripmine iteration are to the same address. If no dependences are detected, the stripmine iteration can complete using the maximum vector length. If a dependence is detected, the vector length is reset to a smaller value that avoids all dependency violations, leaving the remaining elements to be handled on the next iteration of the stripmined loop. Although this scheme adds considerable software overhead to the loop, the overhead is mostly vectorized for the common case where there are no dependences; as a result, the loop still runs considerably faster than scalar code (although much slower than if a programmer directive was provided). A scatter-gather capability is included on many of the recent supercomputers. These operations often run more slowly than strided accesses because they are more complex to implement and are more susceptible to bank conflicts, but they are still much faster than the alternative, which may be a scalar loop. If the sparsity properties of a matrix change, a new index vector must be computed. Many processors provide support for computing the index vector quickly. The CVI (create vector index) instruction in VMIPS creates an index vector given a stride (m), where the values in the index vector are 0, m, 2  m, …, 63  m. Some processors provide an instruction to create a compressed index vector whose entries correspond to the positions with a one in the mask register. Other vector architectures provide a method to compress a vector. In VMIPS, we define the CVI instruction to always create a compressed index vector using the vector mask. When the vector mask is all ones, a standard index vector will be created. The indexed loads-stores and the CVI instruction provide an alternative method to support conditional vector execution. Let us first recall code from Chapter 4:

10

1

low = 1 VL = (n mod MVL) /*find the odd-size piece*/ do 1 j = 0,(n/MVL) /*outer loop*/ do 10 i = low, low + VL - 1 /*runs for length VL*/ Y(i) = a * X(i) + Y(i) /*main operation*/ continue low = low + VL /*start of next vector*/ VL = MVL /*reset the length to max*/ continue

Here is a vector sequence that implements that loop using CVI: LV L.D SNEVS.D CVI POP MTC1 CVM

V1,Ra F0,#0 V1,F0 V2,#8 R1,VM VLR,R1

;load vector A into V1 ;load FP zero into F0 ;sets the VM to 1 if V1(i)!=F0 ;generates indices in V2 ;find the number of 1’s in VM ;load vector-length register ;clears the mask

G-14



Appendix G Vector Processors in More Depth LVI LVI SUBV.D SVI

V3,(Ra+V2) V4,(Rb+V2) V3,V3,V4 (Ra+V2),V3

;load the nonzero A elements ;load corresponding B elements ;do the subtract ;store A back

Whether the implementation using scatter-gather is better than the conditionally executed version depends on the frequency with which the condition holds and the cost of the operations. Ignoring chaining, the running time of the original version is 5n + c1. The running time of the second version, using indexed loads and stores with a running time of one element per clock, is 4n + 4fn + c2, where f is the fraction of elements for which the condition is true (i.e., A(i) ¦ 0). If we assume that the values of c1 and c2 are comparable, or that they are much smaller than n, we can find when this second technique is better. Time1 ¼ 5ðnÞ Time2 ¼ 4n + 4fn

We want Time1 > Time2, so 5n > 4n + 4fn 1 >f 4

That is, the second method is faster if less than one-quarter of the elements are nonzero. In many cases, the frequency of execution is much lower. If the index vector can be reused, or if the number of vector statements within the if statement grows, the advantage of the scatter-gather approach will increase sharply.

G.5

Effectiveness of Compiler Vectorization Two factors affect the success with which a program can be run in vector mode. The first factor is the structure of the program itself: Do the loops have true data dependences, or can they be restructured so as not to have such dependences? This factor is influenced by the algorithms chosen and, to some extent, by how they are coded. The second factor is the capability of the compiler. While no compiler can vectorize a loop where no parallelism among the loop iterations exists, there is tremendous variation in the ability of compilers to determine whether a loop can be vectorized. The techniques used to vectorize programs are the same as those discussed in Chapter 3 for uncovering ILP; here, we simply review how well these techniques work. There is tremendous variation in how well different compilers do in vectorizing programs. As a summary of the state of vectorizing compilers, consider the data in Figure G.9, which shows the extent of vectorization for different processors using a test suite of 100 handwritten FORTRAN kernels. The kernels were designed to test vectorization capability and can all be vectorized by hand; we will see several examples of these loops in the exercises.

G.6

Putting It All Together: Performance of Vector Processors

Processor

Compiler

Completely vectorized



G-15

Partially Not vectorized vectorized

CDC CYBER 205 VAST-2 V2.21

62

5

33

Convex C-series

FC5.0

69

5

26

Cray X-MP

CFT77 V3.0

69

3

28

Cray X-MP

CFT V1.15

50

1

49

Cray-2

CFT2 V3.1a

27

1

72

ETA-10

FTN 77 V1.0

62

7

31

Hitachi S810/820

FORT77/HAP V20-2B

67

4

29

IBM 3090/VF

VS FORTRAN V2.4

52

4

44

NEC SX/2

FORTRAN77 / SX V.040

66

5

29

Figure G.9 Result of applying vectorizing compilers to the 100 FORTRAN test kernels. For each processor we indicate how many loops were completely vectorized, partially vectorized, and unvectorized. These loops were collected by Callahan, Dongarra, and Levine [1988]. Two different compilers for the Cray X-MP show the large dependence on compiler technology.

G.6

Putting It All Together: Performance of Vector Processors In this section, we look at performance measures for vector processors and what they tell us about the processors. To determine the performance of a processor on a vector problem we must look at the start-up cost and the sustained rate. The simplest and best way to report the performance of a vector processor on a loop is to give the execution time of the vector loop. For vector loops, people often give the MFLOPS (millions of floating-point operations per second) rating rather than execution time. We use the notation Rn for the MFLOPS rating on a vector of length n. Using the measurements Tn (time) or Rn (rate) is equivalent if the number of FLOPS is agreed upon. In any event, either measurement should include the overhead. In this section, we examine the performance of VMIPS on a DAXPY loop (see Chapter 4) by looking at performance from different viewpoints. We will continue to compute the execution time of a vector loop using the equation developed in Section G.2. At the same time, we will look at different ways to measure performance using the computed time. The constant values for Tloop used in this section introduce some small amount of error, which will be ignored.

Measures of Vector Performance Because vector length is so important in establishing the performance of a processor, length-related measures are often applied in addition to time and MFLOPS. These length-related measures tend to vary dramatically across different processors

G-16



Appendix G Vector Processors in More Depth and are interesting to compare. (Remember, though, that time is always the measure of interest when comparing the relative speed of two processors.) Three of the most important length-related measures are ■





R∞—The MFLOPS rate on an infinite-length vector. Although this measure may be of interest when estimating peak performance, real problems have limited vector lengths, and the overhead penalties encountered in real problems will be larger. N1/2—The vector length needed to reach one-half of R∞. This is a good measure of the impact of overhead. Nv—The vector length needed to make vector mode faster than scalar mode. This measures both overhead and the speed of scalars relative to vectors.

Let’s look at these measures for our DAXPY problem running on VMIPS. When chained, the inner loop of the DAXPY code in convoys looks like Figure G.10 (assuming that Rx and Ry hold starting addresses). Recall our performance equation for the execution time of a vector loop with n elements, Tn: Tn ¼

h n i    Tloop + Tstart + n  Tchime MVL

Chaining allows the loop to run in three chimes (and no less, since there is one memory pipeline); thus, Tchime ¼ 3. If Tchime were a complete indication of performance, the loop would run at an MFLOPS rate of 2/3  clock rate (since there are 2 FLOPS per iteration). Thus, based only on the chime count, a 500 MHz VMIPS would run this loop at 333 MFLOPS assuming no strip-mining or start-up overhead. There are several ways to improve the performance: Add additional vector load-store units, allow convoys to overlap to reduce the impact of start-up overheads, and decrease the number of loads required by vector-register allocation. We will examine the first two extensions in this section. The last optimization is actually used for the Cray-1, VMIPS’s cousin, to boost the performance by 50%. Reducing the number of loads requires an interprocedural optimization; we examine this transformation in Exercise G.6. Before we examine the first two extensions, let’s see what the real performance, including overhead, is.

LV V1,Rx

MULVS.D V2,V1,F0

Convoy 1: chained load and multiply

LV V3,Ry

ADDV.D V4,V2,V3

Convoy 2: second load and add, chained

SV Ry,V4

Convoy 3: store the result

Figure G.10 The inner loop of the DAXPY code in chained convoys.

G.6

Putting It All Together: Performance of Vector Processors



G-17

The Peak Performance of VMIPS on DAXPY First, we should determine what the peak performance, R∞, really is, since we know it must differ from the ideal 333 MFLOPS rate. For now, we continue to use the simplifying assumption that a convoy cannot start until all the instructions in an earlier convoy have completed; later we will remove this restriction. Using this simplification, the start-up overhead for the vector sequence is simply the sum of the start-up times of the instructions: Tstart ¼ 12 + 7 + 12 + 6 + 12 ¼ 49

Using MVL ¼ 64, Tloop ¼ 15, Tstart ¼ 49, and Tchime ¼ 3 in the performance equation, and assuming that n is not an exact multiple of 64, the time for an nelement operation is hni Tn ¼  ð15 + 49Þ + 3n 64  ðn + 64Þ + 3n ¼ 4n + 64

The sustained rate is actually over 4 clock cycles per iteration, rather than the theoretical rate of 3 chimes, which ignores overhead. The major part of the difference is the cost of the start-up overhead for each block of 64 elements (49 cycles versus 15 for the loop overhead). We can now compute R∞ for a 500 MHz clock as:   Operations per iteration  C1ock rate n!∞ C1ock cyc1es per iteration

R∞ ¼ lim

The numerator is independent of n, hence R∞ ¼

Operations per iteration  C1ock rate lim ðC1ock cyc1es per iterationÞ n!∞



lim ðClock cycles per iterationÞ ¼ lim

n!∞

n!∞

R∞ ¼

   Tn 4n + 64 ¼4 ¼ lim n!∞ n n

2  500 MHz ¼ 250 MFLOPS 4

The performance without the start-up overhead, which is the peak performance given the vector functional unit structure, is now 1.33 times higher. In actuality, the gap between peak and sustained performance for this benchmark is even larger!

Sustained Performance of VMIPS on the Linpack Benchmark The Linpack benchmark is a Gaussian elimination on a 100  100 matrix. Thus, the vector element lengths range from 99 down to 1. A vector of length k is used k times. Thus, the average vector length is given by:

G-18



Appendix G Vector Processors in More Depth 99 X

i2

i¼1 99 X

¼ 66:3 i

i¼1

Now we can obtain an accurate estimate of the performance of DAXPY using a vector length of 66: T66 ¼ 2  ð15 + 49Þ + 66  3 ¼ 128 + 198 ¼ 326 R66 ¼

2  66  500 MFLOPS ¼ 202 MFLOPS 326

The peak number, ignoring start-up overhead, is 1.64 times higher than this estimate of sustained performance on the real vector lengths. In actual practice, the Linpack benchmark contains a nontrivial fraction of code that cannot be vectorized. Although this code accounts for less than 20% of the time before vectorization, it runs at less than one-tenth of the performance when counted as FLOPS. Thus, Amdahl’s law tells us that the overall performance will be significantly lower than the performance estimated from analyzing the inner loop. Since vector length has a significant impact on performance, the N1/2 and Nv measures are often used in comparing vector machines.

Example

What is N1/2 for just the inner loop of DAXPY for VMIPS with a 500 MHz clock?

Answer

Using R∞ as the peak rate, we want to know the vector length that will achieve about 125 MFLOPS. We start with the formula for MFLOPS assuming that the measurement is made for N1/2 elements: MFLOPS ¼ 125 ¼

FLOPS executed in N1=2 iterations C1ock cycles   106 C1ock cyc1es to execute N1=2 iterations Second 2  N1=2  500 TN1=2

Simplifying this and then assuming N1/2 < 64, so that TN1=2 Write a VMIPS vector sequence that achieves the peak MFLOPS performance of the processor (use the functional unit and instruction description in Section G.2). Assuming a 500-MHz clock rate, what is the peak MFLOPS?

G.2

[20/15/15] < G.1–G.6 > Consider the following vector code run on a 500 MHz version of VMIPS for a fixed vector length of 64: LV MULV.D ADDV.D SV SV

V1,Ra V2,V1,V3 V4,V1,V3 Rb,V2 Rc,V4

G-30



Appendix G Vector Processors in More Depth

Ignore all strip-mining overhead, but assume that the store latency must be included in the time to perform the loop. The entire sequence produces 64 results. a. [20] < G.1–G.4 > Assuming no chaining and a single memory pipeline, how many chimes are required? How many clock cycles per result (including both stores as one result) does this vector sequence require, including start-up overhead? b. [15] < G.1–G.4 > If the vector sequence is chained, how many clock cycles per result does this sequence require, including overhead? c. [15] < G.1–G.6 > Suppose VMIPS had three memory pipelines and chaining. If there were no bank conflicts in the accesses for the above loop, how many clock cycles are required per result for this sequence? G.3

[20/20/15/15/20/20/20] < G.2–G.6 > Consider the following FORTRAN code:

10

do 10 i = 1,n A(i) = A(i) + B(i) B(i) = x * B(i) continue

Use the techniques of Section G.6 to estimate performance throughout this exercise, assuming a 500 MHz version of VMIPS. a. [20] < G.2–G.6 > Write the best VMIPS vector code for the inner portion of the loop. Assume x is in F0 and the addresses of A and B are in Ra and Rb, respectively. b. [20] < G.2–G.6 > Find the total time for this loop on VMIPS (T100). What is the MFLOPS rating for the loop (R100)? c. [15] < G.2–G.6 > Find R∞ for this loop. d. [15] < G.2–G.6 > Find N1/2 for this loop. e. [20] < G.2–G.6 > Find Nv for this loop. Assume the scalar code has been pipeline scheduled so that each memory reference takes six cycles and each FP operation takes three cycles. Assume the scalar overhead is also Tloop. f. [20] < G.2–G.6 > Assume VMIPS has two memory pipelines. Write vector code that takes advantage of the second memory pipeline. Show the layout in convoys. g. [20] < G.2–G.6 > Compute T100 and R100 for VMIPS with two memory pipelines. G.4

[20/10] < G.2 > Suppose we have a version of VMIPS with eight memory banks (each a double word wide) and a memory access time of eight cycles. a. [20] < G.2 > If a load vector of length 64 is executed with a stride of 20 double words, how many cycles will the load take to complete? b. [10] < G.2 > What percentage of the memory bandwidth do you achieve on a 64-element load at stride 20 versus stride 1?

Exercises G.5



G-31

[12/12] < G.5–G.6 > Consider the following loop:

10

C = 0.0 do 10 i = 1,64 A(i) = A(i) + B(i) C = C + A(i) continue

a. [12] < G.5–G.6 > Split the loop into two loops: one with no dependence and one with a dependence. Write these loops in FORTRAN—as a source-to-source transformation. This optimization is called loop fission. b. [12] < G.5–G.6 > Write the VMIPS vector code for the loop without a dependence. G.6

[20/15/20/20] < G.5–G.6 > The compiled Linpack performance of the Cray-1 (designed in 1976) was almost doubled by a better compiler in 1989. Let’s look at a simple example of how this might occur. Consider the DAXPY-like loop (where k is a parameter to the procedure containing the loop):

10

do 10 i = 1,64 do 10 j = 1,64 Y(k,j) = a*X(i,j) + Y(k,j) continue

a. [20] < G.5–G.6 > Write the straightforward code sequence for just the inner loop in VMIPS vector instructions. b. [15] < G.5–G.6 > Using the techniques of Section G.6, estimate the performance of this code on VMIPS by finding T64 in clock cycles. You may assume that Tloop of overhead is incurred for each iteration of the outer loop. What limits the performance? c. [20] < G.5–G.6 > Rewrite the VMIPS code to reduce the performance limitation; show the resulting inner loop in VMIPS vector instructions. (Hint: Think about what establishes Tchime; can you affect it?) Find the total time for the resulting sequence. d. [20] < G.5–G.6 > Estimate the performance of your new version, using the techniques of Section G.6 and finding T64. G.7

[15/15/25] < G.4 > Consider the following code:

10

do 10 i = 1,64 if (B(i) .ne. 0) then A(i) = A(i)/B(i) continue

Assume that the addresses of A and B are in Ra and Rb, respectively, and that F0 contains 0.

G-32



Appendix G Vector Processors in More Depth a. [15] < G.4> Write the VMIPS code for this loop using the vector-mask capability. b. [15] < G.4 > Write the VMIPS code for this loop using scatter-gather. c. [25] < G.4 > Estimate the performance (T100 in clock cycles) of these two vector loops, assuming a divide latency of 20 cycles. Assume that all vector instructions run at one result per clock, independent of the setting of the vector-mask register. Assume that 50% of the entries of B are 0. Considering hardware costs, which would you build if the above loop were typical? G.8

[15/20/15/15] < G.1–G.6 > The difference between peak and sustained performance can be large. For one problem, a Hitachi S810 had a peak speed twice as high as that of the Cray X-MP, while for another more realistic problem, the Cray X-MP was twice as fast as the Hitachi processor. Let’s examine why this might occur using two versions of VMIPS and the following code sequences: C

10 C

10

Code sequence 1 do 10 i = 1,10000 A(i) = x * A(i) + y * A(i) continue Code sequence 2 do 10 i = 1,100 A(i) = x * A(i) continue

Assume there is a version of VMIPS (call it VMIPS-II) that has two copies of every floating-point functional unit with full chaining among them. Assume that both VMIPS and VMIPS-II have two load-store units. Because of the extra functional units and the increased complexity of assigning operations to units, all the overheads (Tloop and Tstart) are doubled for VMIPS-II. a. [15] < G.1–G.6 > Find the number of clock cycles on code sequence 1 on VMIPS. b. [20] < G.1–G.6 > Find the number of clock cycles on code sequence 1 for VMIPS-II. How does this compare to VMIPS? c. [15] < G.1–G.6 > Find the number of clock cycles on code sequence 2 for VMIPS. d. [15] < G.1–G.6 > Find the number of clock cycles on code sequence 2 for VMIPS-II. How does this compare to VMIPS? G.9

[20] < G.5 > Here is a tricky piece of code with two-dimensional arrays. Does this loop have dependences? Can these loops be written so they are parallel? If so, how? Rewrite the source code so that it is clear that the loop can be vectorized, if possible.

290

do 290 j = 2,n do 290 i = 2,j aa(i,j) = aa(i-1,j)*aa(i-1,j) + bb(i,j) continue

Exercises G.10



G-33

[12/15] < G.5 > Consider the following loop: do 10 i = 2,n A(i) = B 10 C(i) = A(i - 1) a. [12] < G.5 > Show there is a loop-carried dependence in this code fragment. b. [15] < G.5 > Rewrite the code in FORTRAN so that it can be vectorized as two separate vector sequences.

G.11

[15/25/25] < G.5 > As we saw in Section G.5, some loop structures are not easily vectorized. One common structure is a reduction—a loop that reduces an array to a single value by repeated application of an operation. This is a special case of a recurrence. A common example occurs in dot product:

10

dot = 0.0 do 10 i = 1,64 dot = dot + A(i) * B(i)

This loop has an obvious loop-carried dependence (on dot) and cannot be vectorized in a straightforward fashion. The first thing a good vectorizing compiler would do is split the loop to separate out the vectorizable portion and the recurrence and perhaps rewrite the loop as: 10 20

do 10 i = 1,64 dot(i) = A(i) * B(i) do 20 i = 2,64 dot(1) = dot(1) + dot(i)

The variable dot has been expanded into a vector; this transformation is called scalar expansion. We can try to vectorize the second loop either relying strictly on the compiler (part (a)) or with hardware support as well (part (b)). There is an important caveat in the use of vector techniques for reduction. To make reduction work, we are relying on the associativity of the operator being used for the reduction. Because of rounding and finite range, however, floating-point arithmetic is not strictly associative. For this reason, most compilers require the programmer to indicate whether associativity can be used to more efficiently compile reductions. a. [15] < G.5 > One simple scheme for compiling the loop with the recurrence is to add sequences of progressively shorter vectors—two 32-element vectors, then two 16-element vectors, and so on. This technique has been called recursive doubling. It is faster than doing all the operations in scalar mode. Show how the FORTRAN code would look for execution of the second loop in the preceding code fragment using recursive doubling. b. [25] < G.5 > In some vector processors, the vector registers are addressable, and the operands to a vector operation may be two different parts of the same vector register. This allows another solution for the reduction, called partial sums.

G-34



Appendix G Vector Processors in More Depth The key idea in partial sums is to reduce the vector to m sums where m is the total latency through the vector functional unit, including the operand read and write times. Assume that the VMIPS vector registers are addressable (e.g., you can initiate a vector operation with the operand V1(16), indicating that the input operand began with element 16). Also, assume that the total latency for adds, including operand read and write, is eight cycles. Write a VMIPS code sequence that reduces the contents of V1 to eight partial sums. It can be done with one vector operation. c. [25] < G.5 > Discuss how adding the extension in part (b) would affect a machine that had multiple lanes. G.12

[40] < G.3–G.4 > Extend the MIPS simulator to be a VMIPS simulator, including the ability to count clock cycles. Write some short benchmark programs in MIPS and VMIPS assembly language. Measure the speedup on VMIPS, the percentage of vectorization, and usage of the functional units.

G.13

[50] < G.5 > Modify the MIPS compiler to include a dependence checker. Run some scientific code and loops through it and measure what percentage of the statements could be vectorized.

G.14

[Discussion] Some proponents of vector processors might argue that the vector processors have provided the best path to ever-increasing amounts of processor power by focusing their attention on boosting peak vector performance. Others would argue that the emphasis on peak performance is misplaced because an increasing percentage of the programs are dominated by nonvector performance. (Remember Amdahl’s law?) The proponents would respond that programmers should work to make their programs vectorizable. What do you think about this argument?

H.1 H.2 H.3 H.4 H.5 H.6 H.7

Introduction: Exploiting Instruction-Level Parallelism Statically Detecting and Enhancing Loop-Level Parallelism Scheduling and Structuring Code for Parallelism Hardware Support for Exposing Parallelism: Predicated Instructions Hardware Support for Compiler Speculation The Intel IA-64 Architecture and Itanium Processor Concluding Remarks

H-2 H-2 H-12 H-23 H-27 H-32 H-43

H Hardware and Software for VLIW and EPIC

The EPIC approach is based on the application of massive resources. These resources include more load-store, computational, and branch units, as well as larger, lower-latency caches than would be required for a superscalar processor. Thus, IA-64 gambles that, in the future, power will not be the critical limitation, and that massive resources, along with the machinery to exploit them, will not penalize performance with their adverse effect on clock speed, path length, or CPI factors. M. Hopkins in a commentary on the EPIC approach and the IA-64 architecture (2000)

H-2



Appendix H Hardware and Software for VLIW and EPIC

H.1

Introduction: Exploiting Instruction-Level Parallelism Statically In this chapter, we discuss compiler technology for increasing the amount of parallelism that we can exploit in a program as well as hardware support for these compiler techniques. The next section defines when a loop is parallel, how a dependence can prevent a loop from being parallel, and techniques for eliminating some types of dependences. The following section discusses the topic of scheduling code to improve parallelism. These two sections serve as an introduction to these techniques. We do not attempt to explain the details of ILP-oriented compiler techniques, since that would take hundreds of pages, rather than the 20 we have allotted. Instead, we view this material as providing general background that will enable the reader to have a basic understanding of the compiler techniques used to exploit ILP in modern computers. Hardware support for these compiler techniques can greatly increase their effectiveness, and Sections H.4 and H.5 explore such support. The IA-64 represents the culmination of the compiler and hardware ideas for exploiting parallelism statically and includes support for many of the concepts proposed by researchers during more than a decade of research into the area of compilerbased instruction-level parallelism. Section H.6 provides a description and performance analyses of the Intel IA-64 architecture and its second-generation implementation, Itanium 2. The core concepts that we exploit in statically based techniques—finding parallelism, reducing control and data dependences, and using speculation—are the same techniques we saw exploited in Chapter 3 using dynamic techniques. The key difference is that the techniques in this appendix are applied at compile time by the compiler, rather than at runtime by the hardware. The advantages of compile time techniques are primarily two: They do not burden runtime execution with any inefficiency, and they can take into account a wider range of the program than a runtime approach might be able to incorporate. As an example of the latter, the next section shows how a compiler might determine that an entire loop can be executed in parallel; hardware techniques might or might not be able to find such parallelism. The major disadvantage of static approaches is that they can use only compile time information. Without runtime information, compile time techniques must often be conservative and assume the worst case.

H.2

Detecting and Enhancing Loop-Level Parallelism Loop-level parallelism is normally analyzed at the source level or close to it, while most analysis of ILP is done once instructions have been generated by the compiler. Loop-level analysis involves determining what dependences exist among the operands in a loop across the iterations of that loop. For now, we will

H.2

Detecting and Enhancing Loop-Level Parallelism



H-3

consider only data dependences, which arise when an operand is written at some point and read at a later point. Name dependences also exist and may be removed by renaming techniques like those we explored in Chapter 3. The analysis of loop-level parallelism focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations; such a dependence is called a loop-carried dependence. Most of the examples we considered in Section 3.2 have no loop-carried dependences and, thus, are loop-level parallel. To see that a loop is parallel, let us first look at the source representation: for (i=1000; i>0; i=i-1) x[i] = x[i] + s; In this loop, there is a dependence between the two uses of x[i], but this dependence is within a single iteration and is not loop carried. There is a dependence between successive uses of i in different iterations, which is loop carried, but this dependence involves an induction variable and can be easily recognized and eliminated. We saw examples of how to eliminate dependences involving induction variables during loop unrolling in Section 3.2, and we will look at additional examples later in this section. Because finding loop-level parallelism involves recognizing structures such as loops, array references, and induction variable computations, the compiler can do this analysis more easily at or near the source level, as opposed to the machine-code level. Let’s look at a more complex example.

Example

Consider a loop like this one: for (i=1; i