119 101 11MB
English Pages 420 [403] Year 2022
Lennart Bamberg Jan Moritz Joseph Alberto García-Ortiz Thilo Pionteck
3D Interconnect Architectures for Heterogeneous Technologies: Modeling and Optimization
3D Interconnect Architectures for Heterogeneous Technologies
Lennart Bamberg • Jan Moritz Joseph • Alberto García-Ortiz • Thilo Pionteck
3D Interconnect Architectures for Heterogeneous Technologies Modeling and Optimization
Lennart Bamberg NXP Semiconductors Hamburg, Germany
Jan Moritz Joseph RWTH Aachen University Aachen, Germany
Alberto García-Ortiz Universität Bremen Bremen, Germany
Thilo Pionteck Otto-von-Guericke Universität Magdeburg Magdeburg, Germany
ISBN 978-3-030-98228-7 ISBN 978-3-030-98229-4 https://doi.org/10.1007/978-3-030-98229-4
(eBook)
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To our families, partners, friends, and mentors.
Preface
The global interconnects are one of the most fundamental bottlenecks in modern systems on chips (SoCs). They determine to a large degree the performance and energy consumption of the complete system. To palliate this problem, it is mandatory to explore new directions in integrated circuit (IC) manufacturing beyond mere logic scaling which has little impact on the interconnect performance. One of the most promising emerging technologies here is the silicon integration into the third dimension. In three-dimensional (3D) ICs, the maximum wire length is drastically reduced, which improves nearly all key metrics—timing/latency, power consumption, and area. Moreover, 3D integration enables heterogeneous stacking, in which dies manufactured in different technologies nodes are tightly co-integrated. This enables block-level heterogeneous integration, that is, each block can be manufactured in the best suitable technology node. Thereby, vast improvements can be achieved for systems with mixed-signal or noise-critical blocks as well as pure-digital blocks, such as in modern artificial intelligence (AI) accelerators and embedded systems. However, 3D integration entails the challenge that the interconnect architecture has to be also 3D, that is, it has to interconnect blocks integrated over various layers/dies. For heterogeneous stacking, it becomes even more challenging, as the different dies have different electrical characteristics. Consequently, the same logic block has different integration cost (in terms of power, timing, and area) in the different dies of the stack. This demands the interconnect architecture to not be only 3D but also heterogeneous.
Scope The efficient use of 3D technologies, especially with heterogeneous integration, requires work on all levels of abstractions—from low-level physical effects to system-level aspects—as architects face many cross-cutting issues when designing systems for such drastically different emerging technologies. This book provides vii
viii
Preface
a comprehensive and unique approach for the modeling and optimization of interconnect architectures in 3D SoCs at all abstraction levels, specifically addressing the challenges and opportunities arising from heterogeneous stacking. Thereby, It improves the most relevant quality metrics: throughput, power consumption, area, and yield. So far, there is no comprehensive work that focuses on interconnect architectures for heterogeneous 3D SoCs. This is overcome by this book. Furthermore, it is the first comprehensive work to investigate the potential gains of cross-layer optimization for 3D interconnect networks, starting at the lowest (physical) abstraction level, going up via the gate level to router (micro-)architectures before spanning system-level aspects such as SoC floor planning. For this purpose, an extensive set of power-performance models for 3D-interconnect structures—abstract enough for architectural exploration and optimization, but yet physically precise—are derived. These models consider complex physical phenomena but are still so generic that they provide a high re-usability. Thereby, the models build a solid foundation for quantitative, technology-aware architecture evaluation for 3D integrated systems. Moreover, the book contributes optimization techniques based on the presented models. These techniques demonstrate a high value of the presented models for 3Dinterconnect optimization and performance estimation, as the techniques provide massive improvements over the state of the art.
Contribution This book is a unique reference for all experts in academia and industry, requiring a deep and ready-to-use understanding of the state of the art in interconnect architectures for 3D ICs based on through-silicon vias (TSVs) and the challenges and opportunities of heterogeneous integration. Readers will learn about the physical implications of using heterogeneous 3D-stacking for SoC integration. They will also understand how to get the most from the used technology, through a physical-effectaware architecture design with concrete optimization methods. The book provides a deep theoretical background covering all abstraction levels needed to research and architect tomorrow’s 3D SoCs. Moreover, the book comes with an open-source optimization and simulation framework for a systematic exploration of novel 3D interconnect architectures. Specifically, this book contributes: 1. Modeling and optimization techniques for technological and system-level aspects, hardware architectures, and physical-design tools 2. Abstract but yet physically precise models for the power-consumption and performance of TSV-based 3D interconnects to optimize and evaluate interconnect architectures for 3D SoCs
Preface
ix
3. A wide set of generic ready-to-use optimization techniques which substantially improve power consumption, performance, and yield of TSV-based 3D interconnects, while exploiting key characteristics of heterogeneous integration 4. Novel architectures for 3D networks on chips (NoCs) using heterogeneous integration, providing higher network performance at a lower power-consumption and area 5. An open-source NoC simulator and optimization tool for heterogeneous SoC architectures.
Structure
Modeling:
Part II: Technology Modeling
Circuit
Optimization:
Bits
Part IV: Link Optimization
Part III: System Modeling
Network Architecture
SoC Design
Part V: NoC Optimization
The book starts with an introductory part on 3D integration, interconnects, and NoCs. This provides the knowledge required by the reader to understand the book’s contributions. As the book offers both, insights for architects working on system as well as physical-design aspects, we provide two separate chapters that provide the complementary basic knowledge. Afterwards, the book’s core parts follow. Their structure is shown in the figure above. We focus on modeling in the first step and follow a comprehensive route starting at the low abstraction levels. We start at the bottom layer with the modeling of the physical interconnects. In detail, we contribute universally valid formulas for the power consumption and performance of 3D interconnects based on characteristic of the transmitted data. These formulas are based on the logical bit level, enabling the design and analysis of complex optimization techniques such as data encoding for improved interconnect power consumption and performance. In Part III, we shift gears towards the higher abstraction levels. We start by introducing models for the application and simulation of NoC architectures. Then, we use the application models to find fast and precise models to estimate the bitlevel statistics needed for our interconnect power models from Part II. This makes the low-level models available at the system level for a fast and yet accurate powerconsumption estimation. We consolidate these two approaches in our Ratatoskr framework that can model, simulate, and optimize 3D NoCs for heterogeneous 3D integration. This simulation tool is extensively used in the next part on optimization.
x
Preface
We start with the optimization of the physical 3D-interconnect structures in Part IV. These optimization techniques are based on the lowest abstraction levels (physical/circuit level and bit/gate level) as they exploit physical phenomena of 3D interconnects identified in Part II. A wide range of methods are contributed, to optimize various interconnect metrics. Some methods only optimize the interconnect power consumption, while others also optimize the timing/performance and the manufacturing yield. Again, we shift gears towards higher abstraction levels in Part V by optimizing NoCs in heterogeneous 3D SoCs. We thoroughly optimize each component of the NoC. We begin with the main cause of area in an NoC, the buffers. Then, we propose a routing algorithm, affecting also crossbar and link architecture in the NoC router to improve performance. Next, we improve the distribution of virtual channels (VCs) in the NoC for high-performance or low-power designs. Finally, we move towards the system level and propose application-specific network synthesis and SoC floor planning optimizations. Finally, the book is concluded in Part VI.
Acknowledgments We acknowledge the practical and theoretical participation on project work of our students Imad Hajjar, Dominik Ermel, Christopher Blochwitz, Sven Wrieden, and Behnam Razi Perjikolaei (listed in no particular order). They contributed to this book with many hours of patient programming, debugging, and modeling. Thanks to our (former) colleagues Anna Drewes and Robert Schmidt for intellectual help, emotional support, and the countless discussions. We would like to thank Prof. Dr. Rainer Leupers (RWTH Aachen University, Germany) for his support to realize this book. The research this book is based on was mainly funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)—328514428. Hamburg, Germany Aachen, Germany Bremen, Germany Magdeburg, Germany December 2021
Lennart Bamberg Jan Moritz Joseph Alberto García-Ortiz Thilo Pionteck
Contents
Part I Introduction 1
Introduction to 3D Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Motivation for Heterogeneous 3D ICs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Motivation for 3D: Wire-Length Reduction . . . . . . . . . . . . . 1.1.2 Motivation for 3D: Heterogeneous Integration . . . . . . . . . . 1.1.3 Examples of Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 3D Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Monolithic Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 TSV-Based Integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 TSV Capacitances—A Problem Resistant to Scaling . . . . . . . . . . . . . 1.3.1 Model to Extract the TSV Parasitics . . . . . . . . . . . . . . . . . . . . . 1.3.2 Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 3 4 5 6 8 9 12 18 19 21 25
2
Interconnect Architectures for 3D Technologies . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Interconnect Architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Networks on Chips (NoCs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Overview of Interconnect Architectures for 3D ICs . . . . . . . . . . . . . . 2.3 Three-Dimensional Networks on Chips . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Performance, Power, and Area . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27 27 28 44 45 46 47
Part II 3D Technology Modeling 3
High-Level Formulas for the 3D-Interconnect Power Consumption and Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 High-Level Formula for the Power Consumption . . . . . . . . . . . . . . . . . 3.1.1 Effective Capacitance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 High-Level Formula for the Propagation Delay . . . . . . . . . . . . . . . . . . . 3.3 Matrix Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51 52 57 58 60 64 xi
xii
Contents
3.4 3.5 4
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66 69
High-Level Estimation of the 3D-Interconnect Capacitances . . . . . . . . . 4.1 Existing Capacitance Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Edge and MOS Effects on the TSV Capacitances . . . . . . . . . . . . . . . . . 4.2.1 MOS Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Edge Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 TSV Capacitance Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Model Coefficients and Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Accuracy for the Estimation of the TSV Power Consumption and Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71 73 75 75 77 80 83 84 97 99
Part III System Modeling 5
Models for Application Traffic and 3D-NoC Simulation . . . . . . . . . . . . . . . 5.1 Overview of the Modeling Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Application Traffic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Simulation Model of 3D NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Simulator Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
103 104 105 109 111 112
6
Estimation of the Bit-Level Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Existing Approaches to Estimate the Bit-Level Statistics for Single Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Random Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Normally Distributed Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 One-Hot-Encoded Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.4 Sequential Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Data-Stream Multiplexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Data-Stream Multiplexing to Reduce the TSV Count . . . 6.2.2 Data-Stream Multiplexing in NoCs . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Impact on the Power Consumption . . . . . . . . . . . . . . . . . . . . . . . 6.3 Estimation of the Bit-Level Statistics in the Presence of Data-Stream Multiplexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Model Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Low-Power Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
113
Ratatoskr: A Simulator for NoCs in Heterogeneous 3D SoCs . . . . . . . . . 7.1 Ratatoskr for Practitioners. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Parts and Functionality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 User Input: Setting Design Parameters . . . . . . . . . . . . . . . . . . .
133 135 135 137
7
115 115 115 118 120 122 122 122 123 126 128 128 129 132
Contents
xiii
7.1.3 7.1.4 7.1.5
7.2
7.3
7.4 7.5
User Output: Power-Performance-Area Reports. . . . . . . . . Router Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Power Model of Interconnects Using Data-Flow Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 NoC Simulator in C++/SystemC . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Router in VHDL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Power Models in C++/Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Simulation Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Power, Performance, and Area of the RTL Router Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Case Study: Link Power Estimation and Optimization . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
139 139 142 143 143 144 145 145 145 147 148 150
Part IV 3D-Interconnect Optimization 8
9
Low-Power Technique for 3D Interconnects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Fundamental Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Power-Optimal TSV Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Systematic Net-to-TSV Assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Combination with Traditional Low-Power Codes . . . . . . . . . . . . . . . . . 8.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 Worst-Case Impact on the 3D-Interconnect Parasitics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.2 Systematic Versus Optimal Assignment for Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.3 Combination with Traditional Coding Techniques . . . . . . 8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Low-Power Technique for High-Performance 3D Interconnects . . . . . . 9.1 Edge-Effect-Aware Crosstalk Classification . . . . . . . . . . . . . . . . . . . . . . . 9.2 Existing Approaches and Their Limitations . . . . . . . . . . . . . . . . . . . . . . . 9.3 Proposed Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 General TSV-CAC Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.2 3D-CAC Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 Extension to a Low-Power 3D CAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.1 TSV-Performance Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.2 Simultaneous TSV Delay and Power-Consumption Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.3 Comparison with Existing 3D CACs . . . . . . . . . . . . . . . . . . . . . 9.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
155 156 157 161 165 167 167 169 173 176 177 179 182 186 187 187 192 194 194 197 199 202
xiv
10
11
Contents
Low-Power Technique for High-Performance 3D Interconnects in the Presence of Temporal Misalignment . . . . . . . . . . . . . 10.1 Temporal-Misalignment Effect on the Crosstalk . . . . . . . . . . . . . . . . . . 10.1.1 Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.2 Look-Up-Table Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Exploiting Misalignment to Improve the Performance . . . . . . . . . . . 10.3 Effect on the TSV Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.1 Expected Delay Reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.2 Delay Reduction for Various Misalignment Scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.3 Comparison with 3D-CAC Techniques . . . . . . . . . . . . . . . . . . 10.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Low-Power Technique for Yield-Enhanced 3D Interconnects . . . . . . . . . 11.1 Existing TSV Yield-Enhancement Techniques . . . . . . . . . . . . . . . . . . . . 11.2 Preliminaries—Logical Impact of TSV Faults . . . . . . . . . . . . . . . . . . . . 11.3 Fundamental Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Formal Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.1 Decodability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.2 Circuit Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5 TSV Redundancy Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.1 Fixed-Decoding Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.2 Fixed-Encoding Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.1 Yield Enhancement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.2 Impact on the Power Consumption . . . . . . . . . . . . . . . . . . . . . . . 11.6.3 Hardware Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
203 204 206 208 211 213 216 217 220 222 225 227 228 231 231 234 234 236 238 238 243 246 246 251 254 260 263
Part V NoC Optimization for Heterogeneous 3D Integration 12
Heterogeneous Buffering for 3D NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1 Buffer Distributions and Depths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Routers with Optimized Buffer Distribution . . . . . . . . . . . . . . . . . . . . . . 12.2.1 Router Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 Routers with Optimized Buffer Depths . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.1 Routers with Optimized Buffer Distribution . . . . . . . . . . . . . 12.4.2 Routers with Optimized Buffer Depths . . . . . . . . . . . . . . . . . . 12.4.3 Combination of Both Optimizations . . . . . . . . . . . . . . . . . . . . . 12.4.4 Influence of Clock Frequency Deviation . . . . . . . . . . . . . . . . . 12.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
267 268 270 271 271 272 273 276 278 279 279 279
Contents
13
14
15
xv
Heterogeneous Routing for 3D NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.1 Heterogeneity and Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2 Modeling Heterogeneous Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.1 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.2 Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3 Modeling Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.1 Horizontal Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.2 Vertical Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4 Routing Limitations from Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4.1 Tackling Latency with Routing Algorithms. . . . . . . . . . . . . . 13.4.2 Tackling Throughput with Router Architectures . . . . . . . . 13.5 Heterogeneous Routing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.5.1 Fundamentals of Heterogeneous Routing Algorithms. . . 13.5.2 Model of the NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.5.3 Z+ (XY)Z− Routing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.5.4 Proof of Deadlock and Livelock Freedom . . . . . . . . . . . . . . . 13.5.5 Z+ (XY)Z− : R1 is Deadlock Free . . . . . . . . . . . . . . . . . . . . . . . . 13.5.6 Livelock Freedom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.6 Heterogeneous Router Architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.6.1 High Vertical-Throughput Router . . . . . . . . . . . . . . . . . . . . . . . . 13.6.2 Pseudo-Mesochronous High-Throughput Link . . . . . . . . . . 13.7 Low-Power Routing in Heterogeneous 3D ICs . . . . . . . . . . . . . . . . . . . . 13.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.8.1 Model Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.8.2 Latency of Routing Algorithm Z+ (XY)Z− . . . . . . . . . . . . . . 13.8.3 Throughput of High Vertical-Throughput Router . . . . . . . 13.8.4 Area Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.8.5 Power Savings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.8.6 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.9.1 Model Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.9.2 Power-Performance-Area Evaluation of Heterogeneous Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
281 282 282 283 284 285 285 287 288 288 290 290 291 292 294 296 296 297 299 299 301 302 304 304 306 306 307 307 308 309 309
Heterogeneous Virtualisation for 3D NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2 Heterogeneous Microarchitectures Exploiting Traffic Imbalance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.1 Area and Energy Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.2 Network Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
313 314
310 310
316 317 318 320 322
Network Synthesis and SoC Floor Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 15.1 Fundamental Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 15.1.1 Existing Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
xvi
Contents
15.2
15.3
15.4 15.5
15.6 Part VI
Modelling and Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.1 Router Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.2 Three-Dimensional Technology Model . . . . . . . . . . . . . . . . . . 15.2.3 Modelling Assumptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mixed-Integer Linear Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3.1 Constants and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3.2 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3.3 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3.4 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3.5 Case Study: Modeling Elevator-First Dimension-Order Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Heuristic Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4.1 Heuristic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.5.1 Performance and Computational Complexity. . . . . . . . . . . . 15.5.2 Mixed-Interger Linear Program . . . . . . . . . . . . . . . . . . . . . . . . . . 15.5.3 Heuristic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.5.4 Optimization Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.5.6 Case Study: 3D VSoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
325 326 327 327 328 328 330 331 332 338 339 340 346 347 347 347 348 356 358 360
Finale
16
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.1 Modeling and Optimization of 3D Interconnects. . . . . . . . . . . . . . . . . . 16.2 Modeling and Optimization of 3D NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.4 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5 Impact on Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
365 366 368 369 370 371
A
Pseudo Codes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
B
Method to Calculate the Depletion-Region Widths . . . . . . . . . . . . . . . . . . . . . 377
C
Modeling Logical OR Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
Specific Mathematical Symbols
¬ ◦ Bn Bn×m B E{} F2 N Q R L{} 1S ⊕ !
= def
bit-wise Boolean negation Hadamard-product operator (for matrices) Frobenius inner product of two matrices vector of n binary numbers matrix with n rows and m columns of binary numbers set of binary numbers: {0, 1} expectation operator binary Galois field 2 set of natural numbers: {0, 1, 2, 3, . . . } set of rational numbers set of real numbers Laplace transform indicator function of a set S, evaluates to 1 for s ∈ S, 0 else Boolean XOR operator demanded equality
= ∼
defined as being equal to proportionality
exp()
exponent operation
max diag() maxk, i () spark() max()
maximum diagonal entry of a matrix maximum entry of a discrete-time vector over all cycles spark of a matrix maximum matrix of two matrices (entry-by-entry)
log2 ()
base-2 logarithm
min(·) min{·} mod()
minimum of a function minimum of a vector modulo operation xvii
xviii
Specific Mathematical Symbols
n
·1
Manhattan/L1 norm p1 =
·2
Euclidian-distance/L2 norm p2 =
n| i=1 |p
for p ∈ Rn n n 2 i=1 pn for p ∈ R
Specific Units
GE pp
gate equivalent percentage point
xix
Specific Symbols Used in Multiple Chapters
bi C i,j C αi γi,j SIπ ,n ρ σ δi,j
clock-cycle-based self switching of bi change in Ci,j with increasing pi and pj matrix of the C i,j values Colors of tokens in a Petri Net, use equivalent with a type and properties of a data stream for power modeling switching activity of bi switching correlation of bi and bj set of all valid n × n permutation matrices correlation of subsequent data words standard deviation of the data words clock-cycle-based crosstalk factor for bi and bj
bi
clock-cycle-based bit signal
Cc0 Cc1
ground capacitance of a corner TSV coupling capacitance of a corner TSV and a directly adjacent TSV coupling capacitance of a corner TSV and an indirectly adjacent TSV coupling capacitance of diagonally adjacent TSVs ground capacitance of an edge TSV coupling capacitance of two directly adjacent edge TSVs coupling capacitance of two indirectly adjacent edge TSV clock-cycle-based effective capacitance of interconnect i maximum possible effective capacitance for all interconnects mean effective capacitance of interconnect i clock-cycle-based vector of the Ceff,i values Ci,j value for pi and pj equal to 0 matrix of the CG,i,j values coupling capacitance between interconnect i and j for i = j ; ground capacitance of interconnect i for i = j
Cc2 Cd Ce0 Ce1 Ce2 Ceff,i Cˆ eff C¯ eff,i C eff CG,i,j CG Ci,j
xxi
xxii
Specific Symbols Used in Multiple Chapters
C Cmw,c Cmw,g Cn CR,i,j CR
matrix of the Ci,j values coupling capacitance of adjacent metal wires ground capacitance of a metal wire coupling capacitance of directly adjacent middle TSVs Ci,j value for pi and pj equal to 0.5 matrix of the CR,i,j values
dmin
minimum TSV pitch
f fs
clock frequency (i.e.1/Tclk ) significant frequency of the TSV signals
Iπ Iπ,perf-opt
permutation matrix performance-optimal net-to-TSV assignment expressed as a permutation matrix
ltsv
TSV length
M ×N
TSV-array shape
P pi
interconnect power consumption probability of bi being 1
RD rtsv XYZ Z+ (XY)ZZXYZ
driver equivalent resistance TSV radius conventional Dimension-ordered routing algorithm in 3D Novel minimal heterogeneous routing algorithm Novel non-minimal heterogeneous routing algorithm
S SE Swc
clock-cycle-based switching matrix of the bit values mean switching matrix of the bit values worst-case switching matrix of the bit values
Tclk TD,0 Tedge,i tox Tpd Tˆpd
cycle duration of the clock driver-induced offset in the signal propagation delay delay on the signal edges on bi relative to the rising clock edges TSV-oxide thickness interconnect signal-propagation delay maximum interconnect signal-propagation delay
Vdd
power-supply voltage
wdep,i W/Lmin
depletion-region width of TSVi transistor sizing
Acronyms
2.5D 2D 3D
2.5-dimensional two-dimensional three-dimensional
ADC AI AND ASIC
analog-to-digital converter artificial intelligence logical conjunction application-specific integrated circuit
BEOL BP BPSK
back-end of line bus partitioning binary phase-shift keying
CA CAC CBI CIS CMOS CNFET CNN CODEC CPU
cycle-accurate-level model crosstalk-avoidance code classical bus invert complementary metal-oxide-semiconductor (CMOS) image sensor complementary metal-oxide-semiconductor carbon-nanotube field-effect transistors convolutional neural network coder-decoder circuit central processing unit
DAC DOR DRAM DSA DSP
digital-to-analog converter dimension ordered routing dynamic random-access memory domain-specific accelerator digital signal processor
EDA EDP
electronic design automation energy-delay product xxiii
xxiv
Acronyms
EM
electromagnetic
FEOL FinFET FPF FPGA FSS FTF
front-end of line fin field-effect transistor forbidden-pattern free field-programmable gate array full-system simulator forbidden-transition free
GALS GP GPU
globally asynchronous locally synchronous general-purpose graphics processing unit
IC ILD ILP ITRS
integrated circuit inter-layer dielectric integer linear program International Technology Roadmap for Semiconductors
KOZ
keep-out zone
LP LPC LSB LUT
linear program low-power code least significant bit lookup table
M3D MAE MILP MIV MOS MOSFET MSB
monolithic 3D maximum absolute error mixed-integer linear program monolithic inter-tier via metal-oxide-semiconductor MOS field-effect transistor most significant bit
NAND NI NMAE NoC NOR NRMSE NVM
logical non-conjunction network interface normalized maximum absolute error network on a chip logical non-disjunction normalized root-mean-square error non-volatile memory
OR
logical disjunction
P/G PE PPA PSO PTM
power or ground processing element power-performance-area particle swarm optimization Predictive Technology Model
Acronyms
xxv
QAM16 QAM64 QOR QoS
16-point quadrature amplitude modulation 64-point quadrature amplitude modulation qualitiy of results quality-of-service
RAM RC RD RF RGB RLC RLC RMS RMSE RR RRAM RS RTL Rx
random-access memory resistance-capacitance redistribution radio frequency red-green-blue resistance-inductance-capacitance resistance-inductancecapacitance root mean square root-mean-square error repair register resistive random-access memory repair signature register-transfer level receiver
SA SA0 SA1 SDP SIMD SoC SOI SRAM
stuck-at stuck-at-0 stuck-at-1 semi-definite program single instruction multiple data system on a chip silicon on insulator static random-access memory
TFT TLM TSV TSV-3D Tx
thin film transistor transaction-level model through-silicon via TSV-based 3D transmitter
ULV
ultra-low voltage
VC VHDL VLSI VLSI VSoC
virtual channel very high speed integrated circuit hardware description language very-large-scale integration very-largescale integration vision-system on a chip
XNOR XOR
logical exclusive non-disjunction logical exclusive disjunction
Part I
Introduction
Chapter 1
Introduction to 3D Technologies
Before discussing the modeling and optimization of 3D interconnect architectures for heterogeneous technologies in detail, it is advisable to overview the key ideas of 3D integration and interconnect architectures. This is precisely the goal of the first part of the book. We start in this chapter with an introduction to 3D technologies, with special focus on heterogeneous aspects. A precise discussion of 3D technologies, with all its different alternatives and continuous development, is outside the scope of this book. Rather, in this first introductory chapter, our goal is to present a short overview of the different 3D technologies and to provide enough background information so that the ideas discussed in Parts II–IV of this book can be properly assessed. For that, we present the fundamental characteristics of monolithic 3D (M3D) integration and TSV-based 3D (TSV-3D) integration and discuss their key advantages and disadvantages. Special emphasis is done in the aspect of heterogeneity; i.e., the possibility of constructing a system where the different vertically-interconnected tiers/dies are integrated in the optimal technology for its purpose. The remainder of this chapter is structured in three main parts. Firstly, we summarize the key motivations for 3D integration in Sect. 1.1; after that, we overview the two key alternatives for 3D integration in Sect. 1.2; and then, we outline the key challenge in terms of modeling and optimization for 3D interconnects with emphasis on through-silicon via (TSV)-based approaches in Sect. 1.3. Finally, the chapter is concluded in Sect. 1.4.
1.1 Motivation for Heterogeneous 3D ICs As traditional complementary metal-oxide-semiconductor (CMOS) scaling comes to an end [114], 3D integration is seen as a viable solution to sustain the promise of Moore’s law. Three-dimensional integration, as an alternative to standard 2D and © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 L. Bamberg et al., 3D Interconnect Architectures for Heterogeneous Technologies, https://doi.org/10.1007/978-3-030-98229-4_1
3
4
1 Introduction to 3D Technologies
2D
x Silicon area 2D = x2 Footprint 2D = x2 Longest wire 2D = 2x
x
3D (2 tiers)
3D (4 tiers)
x 2
x 2
x 2
Silicon area 3D = 2 Footprint 3D =
x 2
x 2
Longest wire 3D = 2
x 2 2
x 2
2
= =
= x2 x2 2
2x
Silicon area 3D = 4 Footprint 3D =
x 2
x 2 2
2
=
= x2 x2 4
Longest wire 3D = 2 x 2 = x
Fig. 1.1 Impact of 3D integration on the wire lengths and the system footprint
2.5-dimensional (2.5D) integration,1 does not only promise a significant reduction in latency, power consumption and area, as well as an increase in bandwidth [221], it also allows for extrinsic heterogeneity. Tiers or dies with different electrical characteristics can be tightly co-integrated, enabling new architectures and system functionalities [47]. In the following, we will discuss these two fundamental advantages in more detail.
1.1.1 Motivation for 3D: Wire-Length Reduction The first fundamental reason motivating the use of 3D integration is the resulting decrease in the wire lengths over an equivalent 2D-integrated system [194]. Figure 1.1 is used to illustrate the reason behind this phenomenon with more detail. Consider a 2D integrated circuit (IC) with a quadratic floorplan that has a side length of x. Hence, the available silicon/substrate area as well as the footprint of the IC is x 2 . Since wires are only routed in two directions that are orthogonal to each other (i.e., Manhattan routing), the maximum wire length in an IC is equal to the sum of the two side lengths of the rectangular floorplan (here 2x). Now consider an alternative implementation as a 3D IC made up of two stacked substrates (referred to as tiers). With a footprint that is half as big, the same substrate area is available in this 3D IC as in the 2D IC.√ Thus, the side lengths of the floorplan √ can be theoretically reduced by a factor of 2× to x/ 2, without increasing the
1 Integration methods in which the vertical interconnects between the tiers are realized off-chip (e.g., through wire-bonding or a passive interposer) do not provide a high vertical throughput. Thus, such techniques are referred here to as 3D-packing or 2.5D-integration, rather than 3D-integration, strategies [194].
1.1 Motivation for Heterogeneous 3D ICs
5
integration density in the substrates with respect to the substrate of the 2D IC. When the number of tiers in the 3D IC is again doubled √ to four, the side lengths, and hence the wire lengths, can be further reduced by 2× to x/2, while still providing the same substrate area for integration. √ Hence, the maximum and mean wire length is reduced by about a factor of NT×, where NT is the number of tiers of the 3D system (i.e., physically stacked substrates). Thus, 3D integration results in a continuous scaling in the wire lengths with an increase in the number of tiers. Please note that, in this motivational example, overhead costs for vertical interconnects—required to establish inter-tier connections—are not considered. Thus, only if the power consumption, performance, or area of the vertical interconnects does not cancel out the promised wire-length savings gained from the 3D organization, 3D integration is a viable solution for the interconnect bottleneck. Hence, efficient vertical interconnects between the substrates are imperative to obtain the promised interconnect-related power and performance gains from integration into the third dimension.
1.1.2 Motivation for 3D: Heterogeneous Integration Beside the advantages in terms of wire length reduction, another fundamental feature of 3D integration—often seen as even more promising than the implied wire-length reductions—is that it enables heterogeneous integration. As pointed out by Ref. [153], this is actually the final goal of 3D integration. In a 3D IC, the electrical characteristics of the individual tiers/substrates can be fine-tuned in a way that each layer is particularly efficient for the integration of a specific kind of component. Furthermore, components that are physically located in different tiers are no longer constrained by being fully process-compatible among each other. This feature brings numerous advantages such as a decreased design complexity for full-custom components, a higher system performance, and a lower power consumption [2]. An already well-established example of heterogeneous integration is memoryon-logic stacking, where some dies are only dedicated to the integration of memory components (e.g., SRAM cells), while other ones are dedicated to the integration of semi-custom logic blocks made up of standard cells as illustrated in Fig. 1.2a. Through such a heterogeneous 3D integration, the memory cells are no longer constrained by being process compatible with logic cells, and vice versa, as they are located in different substrates. Already for a two-tier 3D system, memory-on-logic 3D integration shows to improve the performance of a placed and routed multi-core processor system by 36.8% compared to a 2D baseline design [198]. Other works such as Ref. [48] advocate increasing the degree of heterogeneity by adding an field-programmable gate array (FPGA) layer between a standard-logic layer (used to integrate a central processing unit (CPU)) and a memory layer, as illustrated in Fig. 1.2b. This
6
1 Introduction to 3D Technologies
Memory layer Logic layer
Memory layer
Memory layer
FPGA layer
Logic layer
CPU layer
a) Memory-logic stack [198].
b) CPU-FPGA-DRAM stack [48].
Fig. 1.2 Examples of heterogeneous 3D integration: (a) Memory-logic 3D integration (Memorylogic stack [198]); (b) Memory-FPGA-CPU 3D integration (CPU-FPGA-DRAM stack [48])
organization promises a power-consumption reduction by up to 47.5% versus a baseline 2D system Ref. [48]. Also, mixed-signal systems on chips (SoCs) benefit from heterogeneous 3D integration. In contrast to logic components, sensors and other analog or mixedsignal blocks typically do not benefit from using ultimately scaled technology nodes. Thus, in 3D-integrated mixed-signal SoCs, one or more substrates can be optimized for the integration of digital components by using an aggressively scaled technology, while other substrates are optimized for mixed-signal and analog components by using a less aggressively scaled technology. Hence, heterogeneous 3D integration promises significantly better power, performance, area, and cost metrics than homogeneous 2D or 3D integration for a broad set of systems.
1.1.3 Examples of Applications The significant promises of 3D integration have generated a plethora of different architectures and paradigms: 3D-integrated dynamic random-access memory (DRAM) subsystems, 3D FPGAs [193], and even vision systems-on-chips (VSoCs) with stacked sensors [260]. In general, the architectures use either a 2.5D approach where a silicon interposer connects wire-bonded heterogeneous dies or a true 3D approach, using either monolithic integration or TSV-based integration. All these approaches allow and exploit heterogeneous integration. While common 2.5D technologies require routing through an interposer layer and only connect layers horizontally, the full 3D technology allows for a direct vertical connection between different dies or tiers. Thus, the integration density can be increased. The 2.5D technology is well established today. It is regularly used in current chip designs, such as in Intel’s, AMD’s, and Nvidia’s recent CPUs and graphics processing units (GPUs) to connect the memory to the cores [3, 4, 167]. Another prominent example is the Xilinx Virtex-7 FPGA [251]. Here, heterogeneous
1.1 Motivation for Heterogeneous 3D ICs
7
transceivers and homogeneous 28-nm FPGA dies are connected via a 65-nm silicon interposer. In heterogeneous TSV-based 3D integration, individually fabricated dies— potentially with very different electrical characteristics and technologies (analog, mixed-signal, logic, and memory)—are mechanically stacked on top of each other with vertical connections through the substrates [153, 216].2 In high performance applications for example, dies optimized for high-speed digital applications are connected with dies optimized for DRAM integration. In 3D VSoCs, a relatively conservative mixed-signal die (typically using a 130- or 180-nm technology) is connected with a high-speed digital die (using a technology node below 65 nm). A promising approach to mitigate the memory wall in processor based systems is random-access memory (RAM) stacking [115]. Examples for this kind of 3D designs are found in Ref. [5, 131, 135, 233]. These designs comprise two different silicon technologies on multiple layers, one optimized for logic and the other for memory. However, 3D stacking is also used to increase the bandwidth of individually packed memory modules. For example, the hybrid memory cube by Micron consists of a stack of four DRAM dies and a logic die that contains the DRAM interfaces [244]. Another approach for stacked/3D memory chips, which does not only allow for interposer stacking but also for vertical stacking, is the high bandwidth memory (HBM) standard [151]. Again, DRAM dies are stacked on top of an optional logic die. A more advanced technique, which does not only combine RAM blocks per die but uses several dies to form a RAM block, is DiRAM (dis-integrated memory) [219]. Here, memory cells and access transistors of one RAM block are spread among multiple dies. An integrated 3D-stacked multicore processor with stacked memory layers is presented in Ref. [150]. In contrast to the previously mentioned work, four processor layers are stacked on top of each other in addition to two RAM layers. Beyond these memory based design paradigms, the SoC provides an example of a 3D-SoC design composed of multiple layers with stacked sensors, mixed signal units, and processing layers. In this area, a 3D-demonstrator chip for readout of pixel vertex detectors is presented in Ref. [257]. The design comprises three layers. One layer for the analog circuitry and storage capacitors and two digital layers for the time-stamp circuitry and sparsification logic, respectively. While this design does not include the actual sensor, an entire 3D-sensorprocessor circuit design is presented within the “Viscube” project [260]. This 3D sensor-processor does not only readout the sensor data but offers image processing as well as basic feature-extraction-capabilities on-chip. A 3D layout was chosen to meet the strict performance requirements and thus allowing for a frameless sensor plate and high frame rates. The design is intrinsically heterogeneous as it comprises four different layers. On the top is the sensor layer, which is connected
2 More
details on the realization of TSV-based 3D integration will be provided in the next chapter.
8
1 Introduction to 3D Technologies
to the subjacent layer using a bump-bonding interface. The second layer is a mixedsignal processing layer, which is connected to the frame buffer layer via TSVs. Through-silicon vias are also used to connect the digital processor layer, which is the lowest layer of the 3D stack. Finally, a general concept for 3D stacking of heterogeneous layers is proposed for “smart-dust” in ambient intelligence applications. Here, small chips with broad functionality are required for sensing and interaction with the environment. The “eCube” [90] is a wireless, autonomous module concept. It consists of multiple stacked layers for power scavenging and saving, application, sensing, and processing tasks, and radio communication. In monolithic integration, the 3D stack is sequentially fabricated on top of a single handle substrate rather mechanically stacking individually, prefabricated 2D integrated dies. This enables a drastic increase in the achievable density of vertical intra-die connections3 The use of monolithic integration provides examples of even more disruptive applications of 3D integration. For example, a comparison of a multi-tier system against a 2D baseline, 2.5D interposer and 3D TSV-based system was done in Ref. [80]. By a simulative analysis, the authors showed an energy-delay product (EDP) improvement by a of factor 11.4 ×, 5.7 ×, and 1.2 ×, respectively. In Ref. [45], a post-layout analysis of a homogeneous 2-tier neural-network design showed a 22.3% iso-performance power saving. Even larger iso-performance power savings of 41% could be achieved for a 2-tier ARM Cortex-A processor core [255]. In Ref. [175], a 2-tier monolithic 3D FPGA with thin film transistor (TFT) SRAM over 90 nm 9 layer Cu CMOS is demonstrated. More recently, Ref. [250] presents a 2-tier TFT-based static random-access memory (SRAM). A 4-tier system with one layer of silicon logic, carbon-nanotube field-effect transistorss (CNFETs), resistive random-access memory (RRAM) and layer of CNFETs and sensor logic is presented in Ref. [225] as a working prototype. These works illustrate the increasing relevance of M3D in industry and research, although this technology is still not as mature as 2.5D or TSV-based 3D integration.
1.2 3D Technologies Three-dimensional integration can be achieved with different manufacturing technologies, which can be categorized into two schemes: sequential and parallel [60]. The first category of 3D integration, called sequential or monolithic integration, allows for much finer integration of inter-tier connections [157]. For monolithic 3D integration (M3D), the stacked layers/tiers are sequentially fabricated on top of a single handle substrate [35, 93].
3 More
on the realization of monolithic 3D ICs will be provided the next section.
1.2 3D Technologies
9
Regarding the second category, the most commonly used parallel integration scheme is TSV-based 3D integration (TSV-3D). Here, individually fabricated planar dies are stacked on top of each other and are vertically connected by TSVs. In the remainder of this section, we will outline the manufacturing as well as the pros and cons of and these two 3D-IC variants in more detail.
1.2.1 Monolithic Integration In this subsection, the concept of monolithic 3D integration for future 3D ICs is discussed with more detail, alongside its advantages and disadvantages. As previously mentioned, a monolithic 3D system is fabricated sequentially on a single handle bulk/substrate, one tier after another, instead of stacking pre-fabricated 2D dies [25, 27, 40]. The structure of a monolithic 3D IC is illustrated in Fig. 1.3. First, the active circuits and the metallization in the bottom tier (handle bulk) of a monolithic 3D IC are processed with standard techniques which are also used to manufacture 2D ICs. Second, an inter-layer dielectric (ILD) is added on top of the metallization, followed by a direct-bonding of a top substrate. Afterward, this pure substrate is aggressively thinned down, for example through a wet-etching process. The resulting silicon on insulator (SOI) structure is used to integrate the active circuit elements of the second tier. However, to fabricate the active components in the higher tiers (i.e., second and above), only a lower thermal budget (ca. 500 ◦ C) is available to preserve already manufactured components in the metal layers of lower tiers. After forming the active circuits of the second die, the associated metallization is created. Thereby, the Bosch process is used to etch the vertical interconnects connecting the tiers. The resulting vertical inter-tier interconnects are referred to as monolithic inter-tier vias (MIVs). To build more tiers on top, the manufacturing steps are repeated from step two, the inter-layer dielectric (ILD) formation.
Transistor Wire MIV
Transistor M1
Tier 3 ILD
MIV
Tier 2 BEOL
Tier 1
Buried oxide Handle bulk
a)
Substrate b)
Fig. 1.3 Monolithic 3D integration: (a) Cross-view illustration of a three-dies circuit. (b) Crossview of a small manufactured monolithic 3D circuit from Ref. [243]. ©IEEE
10
1 Introduction to 3D Technologies
The previously summarized sequential flow offers extremely small and thus dense vertical inter-tier interconnects. Not only are monolithic inter-tier vias (MIVs) similar to metal-to-metal vias in dimensions but also parasitics [217, 256]. The tiny diameters and parasitics of MIVs allow for an inter-tier connection density comparable to the vertical connectivity in metal 1 through metal 4 in conventional 2D ICs [60], reaching up to 108 inter-tier vias per mm2 [33]. This allows for very fine-grained vertical interconnections between neighboring ties, enabling 3D partitioning on the level of transistors [26, 217].
1.2.1.1
Advantages and Disadvantages of Monolithic Integration for Architectural Design
Monolithic 3D integration is a disruptive technology for homogeneous [45, 80, 255] and heterogeneous [175, 225, 250] applications. Part of this disruptive advantages come from the rich set of system-partitioning possibilities that it provides [26, 160]: on the component-level, block-level, gate-level (intra-block) and even transistorlevel. In transistor-level partitioning—the most commonly used approach for M3D designs today [224]—nMOS and pMOS transistors are separated on two different tiers [256]. This allows for an individual optimization of technology parameters; but, it does not provide an optimal footprint reduction [149]. Moreover, it does not consider the extended design space provided by 3D integration. Gate-level partitioning, also known as intra-block partitioning, splits a circuit on the granularity of multiple gates within a functional block. This allows for a reduction of area footprint, and, thus, of the intra-block wire lengths [93], allowing for substantial power and performance benefits for wire-dominated blocks [160]. This partitioning scheme can be much more beneficial than transistor-level partitioning [37]. However, as transistor-level partitioning it requires an extensive amount of MIVs, which is challenging due to the low MIV yield today. Partitioning on the level of cores tackles this yield issue as it reduces the coreto-core communication paths; yet, it underutilizes the available bandwidth between tiles and does not allow for performance improvements within a core [160]. Blocklevel 3D partitioning partitions the blocks of a core across multiple tiers. This approach limits the performance benefits to the connections between blocks and does not allow for critical path reduction within a block [93]. Particularly interesting for the (intended) heterogeneous 3D systems are corelevel and block-level 3D integration [12]. For example, with block-level partitioning, storage elements can be placed on a tier optimized for memory, while digital and analog processing blocks are placed on a tier optimized for digital logic and analog logic, respectively. With a core-level portioning, standard CPU cores are places an on a tier for digital logic, while sensing elements are placed on a mixed signal tier. Consequently, we focus in this book mainly on block- and core-level partitioning due to its larger optimization potential with heterogeneous technologies.
1.2 3D Technologies
11
M7-8
M11-12
intermediate
M1-3 AL
M6-10
local
M1-5
M1-5
AL
AL
MB1 AL
MB4-6
MB1-3
2D
TR-M3D
AL
slow wires
M6-10
slow transistors
M4-6
fast transistors
global
M11-12
fast wires
The partitioning strategy and fabrication constraints has several implications on the structure of the process stack [122], which introduces an intrinsic heterogeneity in the tiers. A low-temperature top-tier annealing process to prevent yield degradation of bottom-tier transistors is required during fabrication, but therefore the top-tier transistors have less performance. A temperature resilient tungsten interconnect in bottom-tier transistors is required to withstand high top-tier fabrication temperature, but therefore the resistance of the lower-tier interconnects is higher [173]. This can result in a performance degradation of 20% in top-tier transistors [162] and an additional propagation delay per unit length of 10–30% in the bottom tier [173]. It is obvious that this heterogeneity needs to be taken into account; otherwise overestimates of more than 50% in the EDP appear when developing interconnect architectures [173]. The left side of Fig. 1.4 shows the typical structure of a 2D-IC metal stack with an active layer (AL), local wires (M1-5), intermediate wires (M6-10) and global wires (M11-12). For transistor-level partitioning, the bottom tier typically has no intermediate or global wires (see Fig. 1.4, TR-M3D), since this would limit the minimum achievable MIV pitch drastically [146].
G-M3D
Fig. 1.4 Sections of 2D and M3D technologies (transistor and gate-level partitioning) (adapted from [12])
12
1 Introduction to 3D Technologies
In the same way, for gate-level partitioning, the bottom tier does not have global wires (see Fig. 1.4, G-M3D). The consequence is that for gate-level partitioning there is a scarcity of global interconnects per unit gate that limits the achievable link frequency. The preliminary results presented in Ref. [122] show that this effect is so dramatic that the overall advantage of using M3D technology over 3D technologies that provide lower maximum vertical-interconnect densities may even vanish for partitioning schemes more coarse-grained than transistor-level partitioning. Moreover, monolithic 3D integration is still a relatively immature technology with an inferior manufacturing yield, not only for the inter-tier vias. The reason for this low yield is that monolithic 3D integration requires to fundamentally develop new processing steps to form active circuits and metal layers. Not only because of the lower thermal budget for the upper tiers but also because of the vastly different technological parameters Ref. [243]. Another major disadvantage of M3D comes from the thin substrates from the second tier onward, which are not only a threat to the reliability. These thin substrates are enclosed by metal wires from below and above as shown in Fig. 1.4, which will induce substantial coupling noise on the active circuits in the substrate. Such noise is typically not tolerable for analog and mixed-signal components found in heterogeneous systems. In summary, even-though M3D promises to exploit the full potential of 3D integration, the technology is today not best-suited for the block- and componentlevel heterogeneous 3D SoCs targeted by this book. First, due to cost, yield, and reliability concerns, this M3D SoC-integration style is still several year away from mass production. Second, for any 3D partitioning scheme that is less fine grain than gate-level, the gains of monolithic over TSV-based 3D integration diminish. Hence, throughout the remainder of this book we will focus on TSV-based 3D integration, which provides a lower maximum vertical-interconnect density, but does not require fundamentally novel processing steps compared to traditional planar/2D ICs. Moreover, TSVs-based 3D integration has thicker substrates per die, increasing the substrate-noise robustness. The manufacturing of TSV-based 3D ICs, and the challenges they entail for designers are discussed next.
1.2.2 TSV-Based Integration The key idea of the more mature TSV-based 3D integration—discussed throughout this subsection—is to mechanically stack pre-fabricated “2D dies”, as illustrated in Fig. 1.5. In this 3D-integration style, the vertical interconnects through the substrates are realized by through-silicon vias (TSVs). Stacking pre-fabricated “2D dies” brings the tremendous advantage that only mature manufacturing techniques from traditional 2D IC manufacturing are required, except for the TSV fabrication and the die-to-die bonding. Hence, arbitrary front-end of line (FEOL) and back-end of line (BEOL) manufacturing techniques known from 2D-IC manufacturing can be reused for TSV-based 3D integration. Consequently, TSV-based 3D ICs can
1.2 3D Technologies
13
Copper wire ILD
Face-to-face bonded dies
Face-to-back bonded dies
Transistor Tier 3
Copper TSV Tier 2
BEOL
Copper bump Bonding resin Tier 1 Substrate
Fig. 1.5 Cross-view illustration of a three-tier TSV-based 3D IC. The first two dies are face-toface bonded, and the third die is bonded on top of the second one in a face-to-back manner
be manufactured with well-established transistor technologies such as SOI, fin field-effect transistor (FinFET), or traditional planar CMOS. This fact accelerates the process of making 3D ICs more efficient than 2D counterparts. Also, it increases the manufacturing yield compared to sequential/monolithic approaches where the 3D stack is grown on top of a single handle substrate. In a stacked 3D IC, TSVs are not needed for the interconnects between the first two tiers. The first two pre-fabricated dies can be electrically connected through their top-most metal layers using a face-to-face bonding technology, as illustrated in Fig. 1.5. This only requires bonding bumps, which are typically made from copper due to the inherent advantage of compatibility with the metal wires and vias in the BEOLs of the dies. As the bonding resin, epoxies and polymers are commonly used due to their excellent adhesive properties. From the third die onward, a face-to-back bonding is needed, which connects the top metal layer of the added die with the substrate backside of the previously bonded die. For a face-to-back bonding, TSVs are needed to establish low-resistive electrical connections between elements located in different dies, as shown in Fig. 1.5. In contrast to metal wires and vias, a TSV occupies an area of the substrate, which consequently cannot be used for active-circuit elements. Hence, TSVs increase substrate-area requirements.
14
1.2.2.1
1 Introduction to 3D Technologies
TSV Manufacturing
In the following, the manufacturing steps for the fabrication of TSVs are briefly reviewed. Three main TSV-manufacturing variants exist: Via-first, via-last, and via-middle [194]. In the via-first process, TSVs are formed in the substrate before the active circuits (i.e., FEOL) and the metal layers (i.e., BEOL). Viafirst TSV manufacturing has the advantage that it generally results in the shortest TSVs. However, for the manufacturing of the FEOL, very high temperatures are required. This is a threat to via-first TSVs. Consequently, via-first TSVs must have strong thermal reliability, which typically forbids copper as the conductor material. However, copper TSVs are desirable due to the compatibility with standard BEOL fabrication steps. The second variant, via-last TSVs, are manufactured after the FEOL and the BEOL. Via-last manufacturing has the advantage that the TSVs only have to withstand the manufacturing stress caused by the bonding and wafer thinning. On the downside, via-last TSVs are the longest, and the TSV etching must be performed through several metal and dielectric layers besides the substrate. Furthermore, via-last TSVs have a lower thermal budget for manufacturing as the previously fabricated metal layers in the BEOL must be preserved. Thus, the predominant approach today is to use via-middle TSVs. As illustrated in Fig. 1.6, via-middle TSVs are fabricated after the FEOL but before the BEOL. Hence, first, the active circuits in the substrate and the pre-metal dielectric are fabricated, subsequently the TSVs, and finally, the BEOL. Through-silicon vias are formed by etching a cylindrical hole in the substrate, which is then filled with copper surrounded by a dielectric liner to isolate the TSV conductor from the doped, and thus conductive, substrate. For this purpose, the Bosch process is used, which was initially invented to manufacture micro-electromechanical systems (MEMS) [141, 194]. The Bosch process applies an etching and a silicon-dioxide (insulator) deposition in successive time intervals in the range of seconds. Afterward, the etched hole with the insulator is filled with the TSV conductor material. Typically, TSVs are formed as blind vias before bonding, which are exposed during a wafer-thinning step (e.g., wet etching), as illustrated in Fig. 1.6. The advantages of this approach are the compatibility with well-established manufacturing techniques and a simplified wafer handling [134]. However, this has the severe disadvantage that the following wafer-thinning and bonding process steps induce stress on the formed TSVs, impairing their manufacturing yield. Nevertheless, a thinning of the wafers or dies before the TSV manufacturing is typically not an alternative as this requires several processing steps with a thin wafer, which makes manufacturing significantly more difficult. The wafer is typically thinned before bonding, which often demands a temporary bonding of a carrier/wafer-handle after thinning to increase the mechanical stability during the bonding stage [194]. After thinning and bonding, an ILD is deposited on top of the stack together with copper bumpers. These bumpers are required for the electrical connections with the next die that will be bonded to the stack. For the last
1.2 3D Technologies
Step 1: FEOL
15
Step 2: Etch
Pre-metal dielectric Well Contact
Step 4: BEOL Bump
ILD
Step 5: Bonding
Step 3: TSV TSV
liner TSV conductor
Step 6: Finish
Flipped thinned die
Fig. 1.6 Basic (simplified) steps to fabricate a 3D IC with via-middle TSVs
die of the stack, the bumpers are extended by solder balls, required to bond the 3D IC onto a printed circuit board in a flip-chip manner.
1.2.2.2
TSV-Manufacturing Challenges
While metal wires and vias, as well as active-circuits elements, are fabricated with mature, and thus efficient, manufacturing techniques; in stacked 3D ICs, the fabrication of the TSVs is challenging. The first critical challenge is to achieve a high TSV manufacturing yield [128]. Even-though correctly manufactured TSVs enable a relatively reliable highbandwidth communication between the dies (compared with other approaches such as inductive coupling through the dies), they have the drawback of being fabricated with a poor manufacturing yield due to the immaturity of the involved process steps.
16
1 Introduction to 3D Technologies
In the following, the major TSV manufacturing defects are briefly summarized. Four main TSV defect types exist: Voids, delamination at the interface, material impurities, and TSV-to-substrate shorts [106, 259, 263]. Improper TSV filling or stress during bonding can result in voids or cracks in the TSV conductor. The second defect type, TSV delamination, is caused by a spatial misalignment between a TSV and its bumper during the bonding stage. Material impurities mitigate the conductivity of the TSV channel. A TSV-to-substrate short arises due to a pinhole in the oxide liner, which usually isolates the TSV from the conductive substrate. Since the occurrence of TSV defects can dramatically reduce the overall manufacturing yield, TSVs have to be integrated efficiently (i.e., sparsely). This requirement typically demands to use TSVs only on a global level in the form of clustered arrays (i.e., to connect larger circuit blocks), as it reduces the overall number of required TSVs. Moreover, using regular arrays facilitates TSV manufacturing due to the more regular patterning for the Bosch process and the TSV filling. Furthermore, redundancy techniques can be integrated to increase the overall manufacturing yield if the TSVs are clustered together [128]. However, the immaturity of the TSV manufacturing steps does not only has a critical impact on the manufacturing yield but also on the silicon-area requirements. Globalfoundries has presented its via-middle TSV manufacturing processes, which are used for their 14- and 20-nm wafers, through a comprehensive set of publications (e.g., [203, 204]). Pictures of TSVs manufactured by Globalfoundries, are shown in Fig. 1.7. A close look at the figures reveals that the TSVs are extremely large compared to the metal wires in the BEOL and the transistors in the FEOL. The radius of the commercially available TSVs is about 2.5 μm for the 14-nm technology and 3 μm for the 20-nm technology. Furthermore, the fabricated TSVs are 50 μm long/deep after die thinning for both technology nodes. Also for the successive TSVmanufacturing node of the foundry, the TSV depth is kept at 50 μm, but the radius is further reduced to 1.5 μm [262]. The reason for the steady TSV depth likely is the fact that an even more aggressively thinned substrate dramatically decreases the yield. Moreover, a too aggressively thinned substrate leads to a critical increase in the inter-die substrate noise and thermal coupling. The reason for the large TSV radius—determining the reduction in the available substrate area for active-circuit components—is that TSVs cannot yet be manufactured with a large aspect ratio (i.e., the ratio of the via length over its diameter). An accepted target for this aspect ratio is 20/1, which is for example the target for the year 2018 reported in the International Technology Roadmap for Semiconductors (ITRS) predictions [1]. Today it is theoretically possible to implement aspect ratios as large 25/1 [264], but only with the inferior via-first approach which has BEOL compatibility issues, as discussed before. Even a high maximum aspect ratio of 25/1 implies a TSV radius of at least 1 μm for a 50 μm thin substrate. Furthermore, since the die-to-die bonding is a mechanical process, its alignment is relatively poor (typically in the micrometer range), which is another limiter for the minimum TSV or bumper size.
1.2 3D Technologies
17
a)
b)
c)
Fig. 1.7 Globalfoundries via-middle TSV process: (a) Cross-view of the top of a fabricated TSV with a radius of 2.5 μm and a depth of 50 μm and the full BEOL stack for the 14-nm node; (b) FinFETs device at the edge of the TSV KOZ for the 14-nm node; (c) Full cross-view of a fabricated via-middle TSV with a radius of 3 μm and a depth of 55 μm for the 20-nm node before wafer thinning (note that after wafer thinning the TSV depth is reduced from 55 to 50 μm). Pictures taken from [203, 204]. ©IEEE
To put the substrate area occupation of a TSV into a better perspective, 2 ) with a 2.5-μm radius— the quadratic substrate footprint of a TSV (i.e., 4 rtsv integrated into 14-nm wafers [204]—is compared to the footprint of the logical non-conjunction (NAND) cell (drive strength 1×) and the full-adder cell of the publicly available 15-nm standard-cell library NanGate15 [177]. Furthermore, the footprint of the two standard cells is compared to the footprint of an aggressively scaled global TSV with a 1-μm radius. In Fig. 1.8, the results of the comparison are illustrated to scale. The substrate area occupation of a commercial TSV is about 125× and 21× bigger than for a NAND cell and a full-adder cell, respectively. Even for the aggressively scaled radius, the TSV footprint is still more than three times larger than the footprint of the 15-nm full-adder cell. Furthermore, a TSV is surrounded by a so-called keep-out zone (KOZ), which is an area where no active circuit can be placed in the substrate, resulting in a further reduction in the effectively available area for logic cells. The reason for the KOZ is that the copper filling of a via-middle TSV results in high thermal stress around the TSV, which can impair previously fabricated FEOL structures [203]. For the 14-nm FinFET transistors used in Ref. [204], the TSV KOZ is about the size of the TSV diameter, as shown in Fig. 1.7b.
18
1 Introduction to 3D Technologies
125×
1×
15-nm NAND
6×
20×
15-nm Mini TSV full adder (rtsv = 1 μm)
TSV
(rtsv = 2.5 μm)
Fig. 1.8 Substrate footprints of modern TSVs and standard cells from the 15-nm library NanGate15. The considered radius, rtsv , of a standard TSV is equal to the one demonstrated in commercial 14-nm wafers [204]. For the “mini TSV”, the radius is equal to the minimum possible TSV radius for a substrate as thin as 50 μm and a maximum TSV aspect ratio of 25/1
In summary, two main TSV-manufacturing challenges need to be mastered for TSV-based 3D integration. First, to achieve a higher TSV yield, and second, to reduce the TSV area occupation. To palliate both issues from the design perspective requires to use TSVs only on the global level (i.e., to connect larger blocks) and in the form of regular array arrangements.
1.3 TSV Capacitances—A Problem Resistant to Scaling Extensive research has been conducted on how to accelerate the process of overcoming TSV-related yield and area issues through techniques derived on higher-levels of abstraction (e.g., [107, 128, 184, 190, 238, 246]). However, another major issue of TSV structures is not yet adequately addressed: TSVs entail large parasitic capacitances that can be a crucial bottleneck of modern and future 3D ICs, especially with regards to the power consumption. This TSV-related issue is outlined in this section. For this purpose, we compare the parasitics of modern global TSVs to the parasitics of modern standard cells. The investigated standard cells belong to two
1.3 TSV Capacitances—A Problem Resistant to Scaling
19
different libraries for 3D-IC design from Ref. [133]; one is based on the 22-nm Predictive Technology Model (PTM) and the other one on the 16-nm PTM. The used libraries also consider predictive TSV parasitics. However, the used TSV dimensions in the sub-micron range are way too optimistic, as revealed by the commercially available TSV structures. For example, the smallest analyzed TSV radius in Ref. [133] is 50 nm, which is 50× smaller than the radius of the TSVs integrated into commercial 14-nm wafers (2.5 μm).
1.3.1 Model to Extract the TSV Parasitics To obtain TSV parasitics for realistic geometrical TSV dimensions and arbitrary array shapes, a parameterizable Python script is developed, which builds a 3D model of a rectangular TSV array in the Ansys Electromagnetics Suite. Such a 3D model enables us to extract the TSV-array parasitics by means of the electromagnetic (EM) field solver Q3D Extractor, which uses the method of moments [73].4 The TSVs in the 3D model are homogeneously placed in an M × N array and indexed as TSVi , where the column location of the TSV in the array is equal to i modulo N , and the row location is equal to i/N . Both M and N can be arbitrarily defined through two parameters in the script. An exemplary model instance of a 3 × 3 TSV array is depicted in Fig. 1.9. The minimum pitch between the centers of directly adjacent TSVs is set by a parameter represented √by dmin in this book. Thus, diagonally adjacent TSVs in the model have a pitch of 2dmin . The length, ltsv , and radius, rtsv , of the cylindrical copper TSVs are parameterizable as well. A substrate area with an x/y-expansion of at least 2 dmin , in which no other TSV is located, is surrounding the TSV array in the 3D model.5 The TSVs in the model mainly traverse a silicon substrate, which has an electrical conductivity that is defined by parameter σsubs . Unless stated otherwise, a typical p-doped (Boron) substrate, biased at 0 V, with a dopant concentration of about 1.35 × 1015 cm−3 , is considered in this book. Such a substrate has a conductivity of about 10 S/m due to the doping. For direct-current insulation from the conductive substrate, TSVs are surrounded by SiO2 dielectrics of thickness tox . The ITRS does not report values for the thickness of these SiO2 liners. In the 3D model, a TSVliner thickness of 0.2 rtsv is considered, being a realistic value according to existing manufacturing nodes. The TSVs start and terminate in quadratic copper pads located in thin dielectric layers. These copper pads model bonding bumps at one side of the TSVs, and
4 Parasitic extractions for this TSV-array model using Q3D Extractor are used extensively in several chapters of this book. 5 Due to the large TSV depth, compared to the depth of an n-well or a p-well (see Fig. 1.7), nearby active circuits in the substrate have a negligible impact on the TSV parasitics.
20
1 Introduction to 3D Technologies
Copper TSV1
dmin
TSV2
TSV3
Depleted silicon Silicon substrate
2 dmin TSV4
TSV5
TSV6
TSV7
TSV8
TSV9
Silicon dioxide
a)
wdep tox rtsv ltsv
b)
c)
Fig. 1.9 Parameterisable 3D model of a TSV array used for parasitic extraction: (a) Substrate top view for a 3 × 3 array; (b) TSV cross-view; (c) TSV conductor side view
quadratic landing pads on the other side.6 However, the copper pads and the thin dielectric layers have a secondary impact on the parasitics. Thus, the quadratic bumps, as well as landing pads, have a fixed side length of 2 rtsv and a thickness of 50 nm in all analyses conducted throughout this book. This makes the TSV model reciprocal (i.e., it is irrelevant which TSV side is the signal source and which the
6 Quadratic
landing pads are used to connect TSVs with metal wires/vias [133].
1.3 TSV Capacitances—A Problem Resistant to Scaling
21
signal sink). The dielectrics used to isolate the copper pads from the conductive substrate have a constant thickness of 0.25 μm. A TSV, its dielectric, and the substrate form a metal-oxide-semiconductor (MOS) junction. Consequently, a TSV is surrounded by a depletion region. For typical p-doped substrates, the width of the depletion region increases with an increase in the mean voltage on the related TSV, which further isolates the TSV from the conductive substrate. In contrast, the depletion-region width shrinks with an increasing mean TSV voltage for n-doped substrates [253]. The exact formulas from [253], are applied to determine the widths of the depletion regions surrounding the TSVs (represented by wdep,i ), which depend on the mean TSV voltages.7 A depletion region is represented in the 3D model by a fully depleted substrate area (i.e., electrical conductivity equal to zero), as in previous works [197, 199, 200, 253]. The parasitics of a TSV array show a slight frequency dependence, as shown in Chap. 4 of this book. Hence, TSV parasitics must always be extracted for a given significant frequency, fs . According to the findings from Ref. [23], the significant frequency of a digital signal can be estimated using the driver-dependent mean rise/fall time of the TSV signals, Trf : fs ≈
1 . 2Trf
(1.1)
With the parameterizable TSV model, a parasitics extraction for various TSVarray structures is possible. The used Q3D Extractor employs a quasi-static field approximation for parasitic extraction, providing a high accuracy for reasonable significant frequencies. For all structures analyzed throughout this book, the relative error within the quasi-static approximation is smaller than 0.2% [232]. Even for an analysis of frequencies as high as 40 GHz (i.e., Trf ≈ 13 ps), this error remains below 2%.
1.3.2 Analysis A TSV radius of 2.5 μm (demonstrated for a commercial 14-nm wafer Ref. [203]), is considered in this work as a “large” value. Furthermore, a shrunk radius of 2 μm is chosen as a “typical” value. Thereby, even a conservative analysis of the impact of the TSV parasitics in a 16-nm or a 22-nm transistor technology is presented in the following. To moreover analyze the effect of an aggressive TSV scaling, “small” TSVs with the maximum possible radius for an aspect ratio pf 25/1 (here, 1 μm) are considered as well. The TSV depth is fixed to 50 μm. Hence, the depth does not scale in accordance with Ref. [203, 204, 262].
7 In
Appendix B, the formal method to determine the depletion-region widths is outlined.
22
1 Introduction to 3D Technologies
Table 1.1 Extracted parasitics for densely spaced global TSVs Large (rtsv = 2.5 μm) Rtsv Ctsv κcoup,tsv [] [fF] [%] 0.08 31.62 99.45 Table 1.2 Input capacitances of 22- and 16-nm standard cells [133]
Typical (rtsv = 2 μm) Rtsv Ctsv κcoup,tsv [] [fF] [%] 0.10 31.35 99.61
Small (rtsv = 1 μm) Rtsv Ctsv κcoup,tsv [] [fF] [%] 0.27 29.21 99.74
Cell NAND 1× XOR 1× D flip-flop 1× Inverter 4× Full adder
22 nm Cin [fF] 0.24 0.55 0.41 0.69 1.31
16 nm Cin [fF] 0.22 0.45 0.26 0.56 1.36
Considered are densely spaced TSVs, which minimize the area overhead due to a TSV array. Thus, the pitch between directly adjacent TSVs, dmin , is chosen as 4 rtsv in accordance with the minimum possible value predicted by the ITRS for all manufacturing nodes [1]. The array dimensions are varied in this analysis between 3 × 3 and 10 × 10 to consider arrays with small, as well as large, amounts of TSVs. Lumped resistance-inductance-capacitance (RLC) parasitics of the individual arrays are extracted with the Q3D Extractor for a significant signal frequency of 6 GHz (i.e., Trf ≈ 83 ps). The resulting maximum overall capacitance of a TSV, Ctsv , as well as the maximum TSV resistance, Rtsv , are subsequently compared for the three TSV radii. In Table 1.1, the results are reported which tend to be independent of the array shape, M × N. Moreover, the table also includes the coupling ratio, κcoup,tsv , defined as the ratio of the sum of all coupling capacitances of a TSV over its accumulated capacitance value (a coupling capacitance is a capacitance between two different TSVs). To set the reported TSV capacitances in relation with the capacitances of logic cells, Table 1.2 includes the input capacitances of representative standard cells reported in [133]. The two tables reveal that to switch the logical value on a typical TSV requires, on average, 122× to 122× more charge than to toggle an input of a standard twoinput NAND gate for the 22-nm technology. These factors even increase to 133× and 144× for the 16-nm technology. The accumulated TSV capacitance is also more than 20× larger than the input capacitances of the full-adder cells. Even scaling the TSV radius to the smallest possible value (i.e., 1 μm), which implies a drastic increase in the manufacturing efforts, does not significantly close this gap. The scaling by a factor of 2× only reduces the TSV capacitances by less than 7%. Thus, just relying on advances in the TSV radius due to improved manufacturing techniques, does not help to overcome problems due to the large TSV parasitic capacitances. In fact, with ongoing technology scaling for the transistors too, TSV parasitics likely become even more critical. The only approach that can overcome this issue is to aggressively shrink the die/substrate thickness, and thereby the TSV
1.3 TSV Capacitances—A Problem Resistant to Scaling Table 1.3 Worst-case wire parasitics for the 22- and the 16-nm technology reported per unit wire length
Wire M1 M4 M8
22 nm Rmw [/μm] 9.43 3.39 0.38
Cmw [fF/μm] 0.11 0.12 0.16
23
κmw [%] 65.20 46.88 47.13
16 nm Rmw [/μm] 23.92 11.3 0.95
Cmw [fF/μm] 0.09 0.11 0.15
κcoup,mw [%] 48.18 51.44 48.91
depth. However, this typically has unacceptable drawbacks for the manufacturing and the reliability, as outlined in Sect. 1.2.2. Coupling capacitances contribute more than 99% to the total capacitance of a TSV for all three analyzed radii. Thus, the coupling capacitances are the primary design concern. The coupling can also be a critical issue for metal wires [68], but the coupling capacitances of the metal wires are not reported in [133]. However, the dimensions for the eight-track metal stack per die are given. These values can be entered into the PTM interconnect tool [205] to obtain resistance and capacitance values for the local (M1), the intermediate (M4), and the global (M8) wires in the worst case (i.e., minimum-spaced wires). Doing this results in the metal-wire parasitics reported in Table 1.3. The results reveal that the coupling coefficients of metal wires are critical, but still significantly smaller than those of TSVs. Coupling effects are much more dominant for TSVs due to the increased number of adjacent aggressors in 3D and the doped substrate TSVs traverse. Additionally, the coupling capacitances of metal wires, and thereby Cmw and κcoup,mw , can be effectively reduced by increasing the line spacing, which does not work well for TSVs as shown in [158]. However, even without wire spacing, the overall capacitance of a small TSV is higher than the maximum capacitance of a global metal wire with a length of 182 μm. This value increases to over 243 μm when compared with local or intermediate wires. Hence, the capacitance per unit length of a TSV is much larger than for a metal wire. That the capacitances of local and intermediate wires as short as 3 μm already dominate over the input capacitances of standard cells, is another strong piece of evidence that parasitic TSV capacitances as well as metal-wire capacitances are a serious concern. Thus, the TSVs are a severe threat to the power consumption of a 3D IC since the parasitic capacitances, combined with the switching of the transmitted bits, determine the interconnect power consumption as shown in Ref. [68]—especially since this TSV issue shows to be resistive to scaling. Consequently, to address this challenge on higher abstraction levels, by exploiting the bit-pattern dependent nature of the power consumption, is of particular importance. A positive of TSVs is their negligibly low resistance due to their large radius and the relatively short length/depth, especially when compared to long metal wires. For all three analyzed radii, the TSV resistance is significantly below 1 , while the resistance of a long metal wire often is in the k range. The signal-propagation delay of an interconnect for a transmission of random bit patterns can be estimated through its RC constant (i.e., the product of the total interconnect-path capacitance
24
1 Introduction to 3D Technologies
16-nm PTM 22-nm PTM
Resistance [kΩ]
40
30
20
10
0
5
10
15
20
25
30
Transistor size (W/Lmin ) Fig. 1.10 Equivalent channel resistance of the 22- and the 16-nm PTM p-channel MOSFET transistor over the transistor size
and resistance) [74]. The total path resistance of an interconnect segment is estimated by the sum of the equivalent resistance of the driver’s pull-up/pull-down path in the conductive state and the resistance of the actual interconnect. To quantify the range of the driver’s equivalent resistance, the current through the pull-up path of an inverting driver as a function of the transistor sizing is analyzed with the Spectre circuit simulator. For this purpose, a Spectre inverter circuit is built twice, once out of the 22-nm, and once out of the 16-nm PTM transistors. Ground is set at the input and the output of the inverter (beginning of pull-up phase) and the drain-source current, Ids , through the p-channel MOS field-effect transistor (MOSFET) is measured. The quotient of the power-supply voltage and the current (i.e., Vdd/Ids ) is used to estimate the effective channel resistance. In Fig. 1.10, the resulting resistances are plotted over W/Lmin (the width of the transistor channel over its length). The equivalent resistance of the p-channel MOSFET is in the k range for both technologies. It is about 40 k for small transistor widths and decreases toward 2.5 k for very large transistor widths. This is orders of magnitude bigger than the resistance of TSVs and shorter metal wires. Thus, for a 3D interconnect made up of two drivers, a TSV, and short wires inbetween, the total interconnect-path resistance is mainly determined by the drivers. In contrast, for long 2D interconnects, also the metal-wire resistance contributes significantly to the overall path resistance. Hence, the interconnect bottleneck in terms of timing can already be mitigated with today’s TSV-manufacturing techniques; but, there still is plenty of room for improvement.
1.4 Conclusion
25
1.4 Conclusion In this chapter, we have shown that 3D integration offers an enormous potential for the optimization of complex ICs. The two main advantages that it offers are the reduction of the length of the global interconnect and the use of heterogeneous integration. Although the main characteristic that has been exploited by previous work is interconnect-length reduction, the possibilities offered by heterogeneous integration are even more promising. We have also discussed different alternatives for the implementation of 3D technologies. Monolithic integration is certainly a promising approach that will play an important role in the future; TSV-based integration, on the other side, offers a smaller integration density but a more mature technology, ready to be employed. Also, we have shown that, for the heterogeneous 3D systems targeted by this book, the TSV technology in combination with block/core-level 3D partitioning builds the most viable integration approach (at least in the near future). Hence, in the remainder of this book we will focus on this type of 3D integration. Further on, it was shown in this chapter that the large TSV capacitances are a severe threat to the power consumption of TSV-based 3D ICs, which shows to be a problem that is resistive to scaling. Hence, even for future smaller TSV dimensions, a 3D implementation can even result in a higher power consumption compared to a logically equivalent 2D implementation. Moreover, the large TSV capacitances affect the performance/timing negatively. Nevertheless, a 3D system will, in most cases, still result in a better performance than a 2D system due to the small resistance of a TSV. These outlined TSV-based power and performance issues will be addressed in the later optimization part of this book. For this purpose, we need abstract but yet universally valid models for the pattern-dependent TSV power consumption, which will be presented in the modeling part of this book.
Chapter 2
Interconnect Architectures for 3D Technologies
In this chapter, we discuss the basics of interconnect architectures for 2D and 3D SoCs. The trend toward more components in a SoC requires an interconnect architecture whose bandwidth and complexity scales linearly with the number of its elements. Traditional interconnect architectures such as bus systems cannot fulfill these requirements. Networks on chips meet the demands above as the number of routers can easily be adjusted to the number of processing elements. In addition, it mitigates the global-wire delay problem. We, therefore, focus on networks on chips (NoCs) and provide an overview of all relevant aspects of NoC design, starting from basics like switching technique to flow control and router architecture. In addition, we discuss aspects that are related to the later chapters of this book, like application mapping and evaluation of NoCs. For NoCs in 3D IC designs, there is a vast design space that does not only result from the availability of the third dimension but also the potential heterogeneity of tiers. We also briefly discuss the different design options arising from these characteristics and categorize different NoC variants for 3D ICs. The remainder of this chapter is structured as follows. First, basic information on interconnect architectures for SoCs are given with a special focus on NoCs. Section 2.2 briefly summarizes interconnect architectures for 3D SoCs while Sect. 2.3 discusses design options and challenges when using NoCs for 3D SoCs. Finally, the chapter is concluded.
2.1 Interconnect Architectures The continuous trend towards more and more components in SoCs brings traditional interconnects architectures to limits. Dedicated point-to-point connections are not flexible enough for complex communication patterns. Fully connected crossbars yield unbearable costs since their area grows quadratically with the number of © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 L. Bamberg et al., 3D Interconnect Architectures for Heterogeneous Technologies, https://doi.org/10.1007/978-3-030-98229-4_2
27
28
2 Interconnect Architectures for 3D Technologies
components. Bus systems are limited in arbitration for many components. None of these architectures scale linearly in bandwidth and complexity with the number of its elements. In contrast, NoCs [30] offer scalability. Networks on chips are interconnection architectures that connect components via a network of routers and transmit data via packets. The number of routers scales with to the number of components. One distinctive feature between NoCs and the other interconnect architectures is that NoCs implement packet-based transmission, opposing the wirebased transmission of data, as in the case of direct links, crossbars, and buses. This approach follows the principle route packets, not wire [58]. Therefore, the payload data are split into packets, and configuration data are attached that includes the packet destination address.
2.1.1 Networks on Chips (NoCs) There is a wide variety of NoC architectures and possible implementations. An example is shown in Fig. 2.1. There are different processing elements (PEs), each of which connects to a network interface (NI). This serialization/deserialization logic is the entry point for data into the network. Routers transmit data. These three parts are the typical building blocks of NoCs with the following functionality: Processing element: Processing elements connect via a NI to the NoC. Each PE represents one component of the SoC. Addresses are associated with PEs used for the calculation of packet paths. The set of a router, a NI, and a PE at the same location (i.e., with the same address) is typically called a tile. Network interface: network interfaces connect PEs and routers. Within the NI, the data from the PE are serialized, split up into packets, and vice versa. The NI connects to the local port of the router. Router: Routers transmit data through the network. Routers are connected following a given topology. A straightforward mesh network is shown in Fig. 2.1. Routers are connected via their four output ports called north, east, south, and west. Routers may have a varying number of ports in other topologies (connection schemes between routers). Routers receive and send packets to transmit data. The correct output port is calculated in each router per packet Fig. 2.1 Schematic view of an NoC with components
CPU
FPGA
Memory
CPU
CPU
Network interface
Router
IO
2.1 Interconnect Architectures
29
following a routing algorithm. If the output port is not in use, i.e., not blocked, the router will send the data via a small internal crossbar to the next downstream router. Otherwise, the data are held back or stored until a path is available, depending on the router architecture. There are many challenges for the NoC design. For instance, NoCs are ultimately limited in scalability, as well, since every new router increases the maximum communication latency due to an additional hop. Although less severe in comparison to other interconnect architectures, it leads to significant performance degradations for some applications, e.g., [186]. Therefore, optimized architectures are proposed, which for instance, accelerate data streams in network loads [123]. Another challenge lies in router costs. Their costs are tackled, e.g., by routers which are frugal in terms of implementation area and power consumption [41]. Furthermore, the incorporation of today’s emerging technologies, such as 3Dintegrated chips, is a relevant research topic, which is also within the scope of this book.
2.1.1.1
Packet Transmission
The switching method determines how data are transmitted through the network. Switching can be either packet-based, in which data are transmitted using packets without path reservation, or circuit-based switching, in which data packets are transmitted along a reserved path. Store-and-forward switching is a packet-based method in which packets are completely stored in a router before they are forwarded to the next downstream router as shown in Fig. 2.2. Since each message is completely stored in routers along the path, a large buffer space is required. Packet switching was the first transmission scheme used in NoCs (e.g., [168]). Today it is obsolete due to its large area costs. In virtual cut-through switching, a packet header can leave the router before receiving the entire packet. The header allocates crossbar connections, and this allocation is kept until the packet is completely transmitted. If the header is blocked,
Message 2, Packet 1
Message 1, Packet 1-3
Msg. 2, Pkg. 2 & 3
Fig. 2.2 Store-and-forward switching: message transmission is started after prior transmission is completed
30
2 Interconnect Architectures for 3D Technologies
the buffer space for the remaining packet part will be allocated, and the part of the packet which is still in transmission can be stored at the blocking router. This is shown in Fig. 2.3. Wormhole switching is the most common packet-based switching method. It was proposed by Dally et at. [56]. Packets are split up into flits (short for flow control digits or flow control unit). The head flit contains routing information. The body flits contain a payload. Tail flits finish the transmission of the packet. Depending on the implementation, tail flits can be the last body flit (i.e., there is no separate tail flit). Flow control is applied at the flit level rather than on the packet level. Each flit must secure its buffer space separately. This allocation allows reducing buffer sizes required in routers since packets must not be stored entirely within each router, as shown in Fig. 2.4. In a wormhole-based network, a packet can block many links. A packet at the front of a buffer with a blocked output will block subsequent packets that want to travel to another free output. This scenario is called head-of-line blocking. It can lead to congestion propagated through the network. Virtual-cutthrough and wormhole switching have the same latency in a zero-load model i.e., without congestion. Virtual-cut-through switching has reduced latency and higher acceptance rates for light traffic and equal buffer sizes [24]. In any case, wormhole switching is widely used in NoCs due to its acceptable area costs. It can be found in
Packet 2, Flit 1
Packet 1, Flits 1, 2 & 3
Packet 2, Flits 2 & 3
Fig. 2.3 Virtual-cut-through switching: packets span multiple routers with three flits
Flit 2 & 3
Flit 1
Flit 4
Fig. 2.4 Wormhole switching: a packet spans multiple routers with four flits
2.1 Interconnect Architectures
31
industry research such as Intel’s Teraflops [104], the Tile64 [11] or the TRIPS [94] chips. When circuit switching is used, a virtual circuit from the source to the destination is established. The virtual circuit is exclusively reserved. The data are transmitted through this connection. After successful transmission, the virtual-circuit connection is terminated. This reservation yields a high start-up delay but maximum bandwidth with small buffers (i.e., area). One possible application is quality-of-service (QoS) since circuit connections allow for service guarantees of bandwidth and throughput. For instance, the Æthernal NoC [92] consists of separated subrouters, of which one is optimized for guaranteed throughput traffic using circuit switching, and the other subrouter targets best-effort traffic by wormhole switching. The QoS is managed by service guarantees, which determine the chosen subnetwork. Routers can combine multiple switching methods to add advantages.
2.1.1.2
Router Architecture
The router architecture is the hardware structure in which a router is built. An exemplary, basic router architecture is shown in Fig. 2.5. It is not optimized and is used to exemplify the functionality of NoC routers. It is based on [81]. Data are transmitted from the input ports (left) to the output ports (right). Flits are stored in the input units’ buffers. Based on the routing information in the head flits, the routing unit calculates the next router along the packet path. The switch arbiter allocates the crossbar using the state of the output unit and the results of the routing calculation and virtual channel (VC) allocation (Sect. “Virtual Channels”). Data transmission
Routing
VC Allocator
Switch Arbiter Flow Control
Input Unit
Output Unit Switch / Crossbar Output Buffers
...
...
Data
FiFo Input Buffers Flow Controller
... Fig. 2.5 Schematics of an exemplary router architecture
...
32
2 Interconnect Architectures for 3D Technologies
can, for instance, be done using status fields in the input and output units [81]. The microarchitecture of such a router consists of the following components: Buffer: There are two buffer locations in the exemplary router design. In the input unit, First-in-first-out (FiFo) buffers are used. Each buffer has the same width as a single flit and is some flits deep. Buffers are replicated per VC and port. There are also buffers in the output unit in the exemplary design, which reduces the length of the critical path but increases area costs. Since memory adds significant area overhead to the router design, many works reduce the buffer area, for instance, by buffer sharing [145, 212] or by (buffer-less) deflection routing schemes [41]. Crossbar: Crossbars consist of multiplexer and demultiplexer logic that pairwise connects input and output ports. The arbiter configures it. Optimized crossbar designs to save costs are proposed in [64]. Routing unit: The routing unit computes the routing decision for incoming packets. There are different routing algorithms. Those with a smaller area and power overhead are preferred. VC allocator: The VC controller allocates a VC for incoming packets. Therefore, the VC controller reads the status of each input port and VC. It selects the next free VC available at the output port based on the state of the downstream router or uses the fixed VC assigned to the packet. In general, VCs enable deadlockfree routing methods, a better router utilization and QoS . Virtual channels are explained in Sect. “Virtual Channels”.1 Arbitration unit: The arbitration unit selects which pairs of input and output ports are allowed to transmit data via the crossbar in each clock cycle. It reads the status of input channels and finds a cover between the requests. An optimal solution to the coverage problem is impossible, so heuristics are implemented. The most common are round robin due to low costs, but priority-based, congestion-aware, or fixed-orders are also published. Flow and link controller: The flow and link controller manages the data transmission between adjacent routers or routers and PEs. It ensures that data are not duplicated or lost during transmission, and it prevents buffer overflow by flow control. Flow control is explained in Sect. 2.1.1.3. The timing behavior of an exemplary, basic router architecture is shown in Fig. 2.6 under zero load, in which only two subsequent packets traverse the router without congestion or collision. After the head flit of the first packet was received, in the first cycle, the routing is calculated (RC). In the second clock cycle, the virtual channel is allocated for the packet (VC). Next, the switch is arbitrated (SA). Finally, the first head flit traverses the switch (ST) in the fourth clock cycle. In parallel, the body flit wins switch arbitration. This pipeline is repeated for all body flits and, finally, the tail flit, which resets the routing calculation and the VC allocation for the
1 Please note that allocators may also be separated between inputs and output. We do not explain this architecture here for the sake of brevity, and kindly refer to [28].
2.1 Interconnect Architectures Fig. 2.6 Exemplary router pipeline in a zero-load model
33 Cycle
1
2
3
4
5
6
7
8
9
10
11
Head 1 RC VA SA ST Body 1 Tail 1 Head 2 Body 2 Tail 2
SA ST SA ST RC VA SA ST SA ST SA ST
next packet. Then, the subsequent packet is processed. Other router architectures require more or fewer clock cycles to set up a connection.
2.1.1.3
Flow Control
Flow control assures that packets are neither lost nor duplicated. It is done on packet or flit level, depending on the switching method. There are two basic schemes for flow control. The ready-valid or on-off method uses two binary signals: Valid indicates that the sender is providing data. Ready indicates that the receiver can accept data. Thus, if both signals are true, a flit will be transmitted. The credit-based method has a counter in the sender that stores the available space in the receiver. The counter is decremented for each data unit sent and incremented for each data unit removed from the receiver’s buffers. Both methods provide the required functionality, but the credit-based method is more area expensive. It has two advantages, however. If pipeline stages are implemented on the link connecting the routers, credits will yield a better buffer utilization because the equivalent ready signal is generated locally in the sender and therefore must not be transmitted with a delay. The more critical drawback of valid-ready switching is the combinatorial paths spanning over the sender and the receiver if used without pipeline stages in between. These paths result in poor timing because the sender and receiver are distant in the physical layout. Hence, the credit-based method is the defacto standard.
Virtual Channels Virtual channels are a time multiplexing of a physical channel between senders and receivers. If a packet is blocked, another packet on a different VC will be allowed to pass the channel in the otherwise idle time. This multiplexing reduces the impact of head-of-line blocking in wormhole switching [55]. Virtual channel-based wormhole and virtual cut-through switching are compared in [70]. The latency of both methods
34
2 Interconnect Architectures for 3D Technologies
is comparable, yet a much higher throughput can be achieved using VCs for equal buffer capacity (i.e., area). Therefore, wormhole switching with VCs is preferable compared to virtual-cut-through switching. Routers with VCs have higher energy consumption and yield a larger area since buffers are replicated per VC in every physical channel. In [186] it is found that more than four VCs per channel only add a marginal performance advantage. There are different selection strategies for VCs. Dynamic VC allocation in each router allows for time-efficient multiplexing of the physical channel, which can be realized, for instance, using a round-robin arbiter. Virtual channels that are assigned to packets during payload generation and do not change during transmission can be used for QoS by prioritization of channels.
2.1.1.4
Network Topology
The network topology is the interconnection scheme between routers. It can be defined as the topology graph of the network. Definition 1 (Topology Graph) The topology graph of an NoC is given by the digraph N = (R, EN ), in which the set R of vertices in the network consists of all routers ri , with i ∈ {1, . . . , |R|}, and the set of directed edges ei,j ∈ EN models the connections between the routers ri and rj ∈ R. The graph is not directed, since links are usually bidirectional.2 Routers are not self-connected, i.e., EN = {(i, j ) | i, j ∈ R, i = j }. Network topologies are characterized as follows: Network diameter: The average diameter is the shortest distance in counted hops in the network. In a network with a large diameter, packets travel many hops to reach their destination in the worst case. In general, small network diameters are preferable [59]. Average distance: The average distance, counted in hops, is calculated by the average of the shortest distance between all pairs of routers. Large average distances have a negative influence on performance since it increases the average transmission latency. Node degree: The node degree denotes the number of ports in routers. Routers with fewer ports generally have reduced area costs but yield a possibly worse network performance. Large node degree harms routing during physical layout. Bisection width: The bisection width is defined as the number of edges that can be removed before the network bisects. For a large number, the network is more prone to errors due to link failures. Number of links: The number of links in a network should be large since it increases its bandwidth.
2 Links are unidirectional from an architectural point of view. However, from a network perspective they are bidirectional, because in nearly all proposed practical implementations connected pairs of routers have links in both directions. Hence, we prefer modeling via an undirected graph.
2.1 Interconnect Architectures
35
There are many works on the influence of different topologies. Most common is 2D mesh, in which routers are located in a grid, and neighbored routers connect as shown in Fig. 2.7a. It has an obvious structure and allows for lightweight routing algorithms. This topology is also common in the industry. The first chip was presented in 2007 by Intel, which implemented the Teraflops chip with 80 core and an NoC as interconnect following mesh topology [105]. More recently, in 2020, the topology can be found in many products, e.g., in AI accelerators [170]. A 2D mesh topology with n and m routers per dimension has a rather large diameter of m+n−2 [139]. The bisection width is also large with min (n, m) and the number of links is 2(m(n−1)+ n(m−1)) [139]. The node degree is between 3 and 5, depending on the router’s position. The average distance is (m + n) /3 [139]. Mesh topology can be extended to a torus by connecting the outer routers to peers at the opposing side of the network (Fig. 2.7b). Torus topology reduces the average hop distance but is difficult to implement due to wire length restrictions and layout complications. It is gaining attention again with optical NoCs [258] and 3D technologies [192] due to reduced layout constraints. Another topological type is a tree-based network, which is especially popular in high-performance computing networks. One example is shown in Fig. 2.7c. Sending data via the network requires moving them up and down in the tree, depending on the number of components, location, and tree structure.
a) Mesh.
c) Tree.
b) Torus.
d) Small world.
Fig. 2.7 NoC topologies. (a) Mesh. (b) Torus. (c) Tree. (d) Small world
36
2 Interconnect Architectures for 3D Technologies
Routing algorithms are rather simple, in particular for binary trees. Finally, the advantages of small-world graphs such as a small diameter can be used in NoCs as well, [53] (Fig. 2.11c).
2.1.1.5
Routing Algorithm
The routing algorithm, or routing function, calculates the path of packets from source to destination. The calculation depends on the topology of the network and might also consider the network’s state. The routing function is defined in [69] as: S : N × N → P (EN ), in which P (EN ) denotes the power set of EN . Routing functions possess the following properties: Unique addresses of nodes: A unique address is assigned to each node to avoid ambiguity. Deadlock freedom: Deadlocks stall the network and thus prevent packet delivery. Therefore, deadlock freedom must be guaranteed or deadlock are resolved using VCs. Deadlocks are explained in detail in Sect. “Deadlocks”. Livelock freedom: Packets travel without ever reaching their destination in a livelock. This prevents packet delivery. Livelock freedom must be proven. Livelocks are explained in Sect. “Livelocks” in detail. Networks on chips can implement two options for the locality of the calculation of the routing algorithm. First, in source routing, the path of each packet is calculated in the source and attached to the packet. During transmission, this information is read, and routing is executed without further calculations. Second, in distributed routing, each node calculates the next hop of the route with the header information in the packet from the destination address. Both techniques can be either implemented using look-up tables, in which for the destination address a routing decision is stored, or via dynamic calculation using a routing function. The latter is preferable because look-up tables require a significant area overhead for more extensive networks. The routing function can be classified into two categories. In the first category, routing algorithms can be minimal, in which packets follow one of the shortest paths between source and destination, or non-minimal, where longer routes can be selected. The second category distinguishes the information sources, which the routing function takes as input. If the path between source and destination is known before the transmission starts, this is called deterministic routing. The path does not depend on network configurations such as the link loads, congestion, or faulty links. In oblivious routing the selected path of the packet does not depend on the state of the network, but it is selected randomly or cyclically from a given set of alternatives. If the decision about the path taken depends on the network status during run time, adaptive routing will be applied. Adaptive routing algorithms can adapt to influences on the network from external sources, such as faulty links, internal sources, congestion, or high link loads. However, adaptive routing algorithms have
2.1 Interconnect Architectures
37
an overhead area, worse timing, and might further increase the total network load due to detours. Many adaptive routing algorithms have been proposed [46, 71, 72]. It is required to differentiate between routing algorithm and selection for deterministic routing algorithms. While the routing algorithm defines a set of possible paths, the selection selects one element in this set. Popular algorithms are turn-based models [91], in which one of the turns in the network is forbidden to avoid cyclic dependencies and deadlocks. The most common routing algorithm, which implements a turn-based model, is dimension ordered routing (DOR). The packet travels along one dimension of the network until the difference between packet location and the destination is zero. Then, this is repeated in the following dimensions until the destination is reached. In a 2D mesh NoC, this routing algorithm is called “XY routing”, which describes that the X dimension is zeroed before the Y dimension.
2.1.1.6
Deadlocks
Deadlocks occur due to cyclic dependencies between packets in a network. This is depicted in Fig. 2.8. Let us start by looking at the upper left router. The gray flits cannot be transmitted since the northern input ports of the router on the lower
Fig. 2.8 Deadlock configuration in the network due to cyclic dependency between four packets. The network stalls, since none of the packets can be transmitted, although subsequent routes are not blocked
38
2 Interconnect Architectures for 3D Technologies
right-hand side are blocked by green flits. The green flits themselves cannot be transmitted, as well, since red flits block their route. Those, in turn, are blocked by blue flits. These, closing the cyclic dependency, are blocked by the gray flits again. The network stalls since none of the flits can be transmitted. Deadlocks can be identified using the channel dependency graph [57]. The vertices of this graph are channels in the NoC. For each dependency between channel pairs in the network, an edge is drawn. Each cycle in this graph represents a deadlock configuration. In the example in Fig. 2.8, the four channels between the four routers have a cyclic dependency. Therefore, a cycle can be found in the corresponding channel dependency graph. Deadlock freedom can be proved using the channel dependency graph with Duato’s Theorem [69]. A simple way to avoid deadlocks is the usage of routing algorithms, in which one of the turn directions is prohibited. Another solution is the introduction of VCs. The dependency between the blocked packets can be resolved by switching VCs for one of the packets, which then can be transmitted, thus resolving the cyclic dependency. Finally, routers can reserve empty spaces in the buffers that periodically move through the network to resolve deadlocks [188]. Deadlocks can also occur from system-level dependencies. One example is dependent data types such as cache requests and responses that cannot be resolved. These deadlocks can take place even though the routing algorithm is deadlock-free by construction. Most commonly, virtual networks (separated sub-nets) are used to resolve this, which can be implemented either by replicating routers or using separate sets of VCs in the same router that do not intervene.
2.1.1.7
Livelocks
Livelocks will occur if packets infinitely travel in the network but never reach the destination. As a solution, packets can only take minimal paths. Thereby, packets reduce the distance to their destination in each step and will ultimately reach their destination. If the routing algorithm is non-minimal, the nonexistence of livelocks must either be proven mathematically or avoided by livelock detection. The latter can be realized via counters, how often a packet traverses nodes. If a threshold is surpassed, a livelock will be detected, and the packet will be treated specially.
2.1.1.8
Application Mapping
The problem of assigning tasks, often called cores, to PEs is called application mapping.3 The aim is the reduction of communication latency for a given network and application. It is a rather significant problem because it allows for efficient
3 The usage of the terms cores and tasks is ambiguous in the literature since the term core is both used for tasks and PEs.
2.1 Interconnect Architectures
39
application execution on chips connected by an NoC. The mapping consists of two steps: First, the task graph of the application is used to allocate tasks to cores. The resulting core graph has cores as vertices, and the bandwidth requirements are edge weights: Definition 2 (Core Graph) The core graph is defined as the digraph G = (C, E). Vertexes ci ∈ C represents cores. Edges e(i,j ) ∈ E model communication between cores ci and cj . The bandwidth requirement between the two cores is given by the edge weight u(i,j ) . Second, during the actual mapping, cores from the core graph are mapped onto PEs under consideration of the topology graph. The objective is to reduce the communication delay of the application. The mapping problem is NP hard [201]. Mapping is divided into two classes: In static mapping, the core graph is assigned to the topology graph during design time. For a dynamic mapping, this is allowed to change during run time, which may yield performance benefits by avoiding congestion for other task communicating [214]. There are exact methods to solve the problem via formulation as a mixed-integer linear program (MILP) [29]. This modeling allows for an exact solution yet yields significant or even unbearable compute times. This issue led to the use of heuristics. An overview of those can be found in the survey [214].
2.1.1.9
Evaluation
Network performance, area costs, and power consumption are the most crucial design metrics for NoCs. This set is called qualitiy of results (QOR) or power-performance-area (PPA). Other, less relevant metrics are defined as well: For instance, reusability allows to reduce time-to-market. Reliability is relevant for special applications such as space. Scalability measures how many components can be connected. Good PPA naturally increases scalability, why it is seldom considered as a separate measure.
Performance The most obvious metric to evaluate an NoC is its performance, i.e., its capability to transmit data. Different design targets can be relevant: If the transmission is timecritical, the latency must be small. If large amounts of data are transmitted, the throughput must be significant. Packet latency: Packet latency is the time for a transmission of a packet via the NoC. If a packet p is injected at time ti and received at time tr , the packet’s latency will be lp = |ti − tr |. Latency can be measured in simulations for every packet and usual statistical measures can be applied. It is denoted in clock
40
2 Interconnect Architectures for 3D Technologies
cycles, independent of the implementation, or in seconds, if the clock frequency is available. Throughput: Throughput measures available bandwidth for communication. It is defined as the number of packets npackets that are accepted per period of time Δt: n Throughput = packets Δt . This metric is denoted in packets or flit per cycle or per second. The throughput can be converted to [Mb/s] using link widths and packet size. There are three options to evaluate the metrics: First, analytical models are fast but demanding to formulate. Second, emulations can be conducted on transaction-level model (TLM) or cycle-accurate-level model (CA). These offer less speed but can rather effortlessly cover many effects. Finally, simulations on register-transfer level (RTL) or via FPGA prototyping are possible, in which the actual circuit is modeled and therefore offers the highest precision. Since the network load depends on the traffic patterns within the network, it is essential to compare different NoCs under the same load. Therefore, different traffic patterns (benchmarks) are defined. A fair comparison and, therefore, a proper evaluation is possible using a clear definition of benchmarks.
Synthetic Traffic Patterns Synthetic traffic patterns follow mathematically defined spatial and temporal distributions, which do not necessarily reflect the properties of a real system. Synthetic traffic patterns offer high comparability between different designs but limited modeling of realistic traffic properties. The spatial distribution can be modeled via a relationship between source and destination addresses. Assuming a n-bit long binary address b1 , b2 , . . . , bn per component, the source-destination relation is given by an address permutation fp : {1, . . . , 2n − 1} → {1, . . . , 2n − 1}. The most common spacial distributions are the following: Uniform random: In uniform random traffic, a new destination address is drawn per packet prior to transmission following a uniform random function. It is a very common traffic pattern. The traffic pattern is shown in Fig. 2.9a. Hotspot: In the hotspot traffic scenario, each component sends data to the same destination with the address d1 , d2 , . . . , dn . It follows the function fhotspot (b1 , b2 , . . . , bn ) = d1 , d2 , . . . , dn . The spacial distribution is not a permutation since it is not bijective. Hotspot traffic is used to test the worst case communication in a network since the router at the global destination is used as much as possible. The traffic pattern is shown in Fig. 2.9b. Transpose: Data are sent to the diagonally opposite side of the network. This pattern is equivalent to a matrix transposition permutation. This traffic pattern put high stress on both vertical and horizontal links. The traffic pattern is shown in Fig. 2.9c. Bit complement: Here, data are sent to the opposing side of the network. The permutation function of the bit complement spacial distribution is
2.1 Interconnect Architectures
a) Uniform random.
41
b) Hotspot.
d) Bit complement.
c) Transpose.
e) Bit reversal.
Fig. 2.9 Synthetic traffic patterns. (a) Uniform random. (b) Hotspot. (c) Transpose. (d) Bit complement. (e) Bit reversal
fbit complement (b1 , b2 , . . . , bn ) = ¬bn , ¬bn−1 , . . . , ¬b1 . This traffic pattern stresses vertical and horizontal links (from diagonally opposing nodes) similar to transpose traffic pattern; its distribution is more homogeneous, because some links are not loaded at all in transpose traffic pattern. The traffic pattern is shown in Fig. 2.9d. Bit reversal: Data are sent between nodes with reversed bit addresses. The permutation function is fbit reversal (b1 , b2 , . . . , bn ) = bn , bn−1 , . . . , b1 . The pattern stresses horizontal and vertical links, locally, from locally vertically opposing nodes. The traffic pattern is shown in Fig. 2.9e. In terms of temporal distribution, the injection rate is modulated from low to high to find the saturation point of the network. Injection rates are measured in flits or packets per cycle. Therefore, the average time between two injected packets/flits must leave enough idle time in between that their ratio gives the desired injection rate. This injection mechanism is shown in Fig. 2.10: The PEs inject at a random time in each time slot; the length of the time slots is proportional to the injection rate. Please note that the injection rate can be measured per component or for the whole network. If measured per component, the instances of time for injection must necessarily be different between components. Otherwise, bursts in the traffic result in (virtually) higher injection rates and disproportionate traffic properties.
42
2 Interconnect Architectures for 3D Technologies
Fig. 2.10 Modeling a random injection rate. Time-slot length (gray) is given by the injection rate and PEs send data at a random instance of time in the slots
∝inj. rate PE 1
send idle send
PE 2 ...
idle
PE n
idle
send
Task-Graph-Based Traffic Patterns There is the requirement for realistic traffic patterns without running slow full-system simulator (FSS). Hence, task-graph-based traffic patterns have been proposed. Combining the simulation of application-based data streams and router simulation on a low abstraction level is conducted in the NocBench project [163]. Applications are modeled using Kahn process networks [126], a common software model. There are different benchmarks such as universal mobile telecommunication system (UMTS) modems or video encoders and decoders [195]. Liu et al. propose traffic patterns for NoCs based on real-world traffic patterns using a task graph in the MCSL suite [159]. The suite comprises eight benchmarks covering applications such as a sample rate converter, media encoders, decoders, or robot control.
Real-World-Based Traffic Patterns The most precise method to benchmark NoCs with real-world-based applications is a FSS in parallel to the simulation of the NoC. This method suffers from terrible performance. For FSS of multi-core processors, simulators such as Gem5 [34] can be used. The Princeton Application Repository for Shared-Memory Computers (PARSEC) benchmark suite [32] is proposed as one option to benchmark systems with a focus on workloads that are emerging as representative of the design of nextgeneration multi-core processors. Many other benchmark suits exist as well, but we skip a discussion for the sake of brevity. However, we would like to note that most of them are not focused on modern SoC workloads. Trace-driven simulation tackles the slow simulation speed of FSS. A trace of the activity of packet injection in the network is recorded within a single FSS. Then, this trace can be played to generate the same traffic again. Modifications in the system architecture (not the interconnect architecture) require newly recorded traces. Dependency-tracking covers the effects of congested interconnects, as proposed in Netrace [103]. The traces are generated using a FSS of a 64-core system. To further increase the performance, these trace-driven simulations can be conducted on an FPGA [65].
2.1 Interconnect Architectures
43
Area The second important design metric is area. It should be minimized since auxiliary functions such as the interconnection network must occupy an as little area as possible to increase chip area for functional components. Usually, the area costs are either evaluated via router architecture synthesis for FPGA or standard cells. In the case of FPGAs, vendor-specific compilers such as Xilinx Vivado are used to synthesize the very high speed integrated circuit very high speed integrated circuit hardware description language (VHDL) or Verilog description of routers. Since NoCs yield relatively large resource utilization on FPGAs, NoCs are normally not used to connect components. Therefore, FPGAs are used for NoC prototyping or verification rather than production purposes. an NoC performance evaluation framework for FPGAs can be used [65]. If NoCs are to be used in production, a synthesis for standard-cell technology will be conducted with synthesis tools, from Synopsys [234] or Cadence [113]. This approach also requires libraries of target technology nodes available for commercial technologies from the vendors as closed-source. Academic researchers will use open-source libraries for predictive technologies. Synthesis for standard cells is the only reasonable evaluation for heterogeneous 3D integration because effects of disparate technologies can be revealed.
Power The third important design metric for NoCs is their power consumption. Synthesis both for FPGA and standard cells provide power consumption estimations per transaction and for average usage patterns. Network on a chip simulators sometimes also provide power estimations. Most often, the power consumption of routers is estimated by simply counting events such as buffer writes, buffer reads, routing calculations, or data transmissions [42]. This modeling is imprecise since the power consumption depends on the transmitted data [87]. Therefore, novel methods and models for power estimation of NoCs are required, especially with heterogeneity being present.
Timing The maximum achievable speed at which routers can be clocked is another critical metric for NoCs. This metric is called timing. It determines the NoC performance to a large degree since a faster-clocked router by a given factor will yield a higher performance by the same factor. The timing of an NoC can be found by synthesis from VHDL or using estimation tools such as Orion 3.0 [127]. In the scope of this book, none of the existing estimation tools are applicable since they do not account for heterogeneity.
44
2 Interconnect Architectures for 3D Technologies
Network on a Chip Evaluation via Simulations A fair and comparable evaluation of NoCs design methods and implementations is complex since an assessment in terms of performance and power consumption is complicated. Common design tools vaguely estimate the power consumption. The estimation quality differs largely depending on the assumptions about the utilization of the on-chip network. Detailed knowledge about the network load is required for an exact power evaluation. These data can often only be received via simulations. Many simulators have been proposed for this purpose–both universal simulators and specific tools exist [161]. According to the NoC Blog [181] the most popular NoC simulators are BookSim 2.0 [121] and Noxim [43], which offer various features and are actively maintained. Noxim is implemented in SystemC, and many parameters, such as depth of buffers, size of packets, and the routing algorithm used, can be set. For benchmarking, synthetic traffic patterns are implemented. Noxim measures throughput and packet delay and estimates power consumption. The power consumption is based on a simple, CA model, which tracks different events. The timing behavior of routers in Noxim is static yet can be modified via the source code. The router model is divided into a receive method, which handles incoming flits and stores them into buffers and a transmit process, which sends flits and calculates routes. The simulator Booksim is comparable in features to Noxim. Its router model is also on CA, and the implementation is written in C++. Booksim is a flexible tool since many topologies can be read from configuration files. Also, multiple router architectures with various routing algorithms are already implemented. Again, only synthetic traffic patterns can be injected into the network. The evaluation features are similar to Noxim; a power model is not provided.
2.2 Overview of Interconnect Architectures for 3D ICs Three-dimensional systems on chips require a complex communication strategy including not only dedicated interconnect architectures within each die but also interconnect architectures spanning multiple (possibly heterogeneous) dies. For the communication between dies, point-to-point interconnects shared buses, and NoCs are proposed. The usage of point-to-point interconnects between dies show the same limitations as in 2D design. Neither are they flexible enough for the communication patterns occurring in 3D SoCs nor do they allow for design reuse. Because of their simplicity, they are often used nevertheless to connect directly adjacent dies. The implementation of vertical buses shows some principle problems. A general scalable shared bus is difficult to implement in the vertical direction, especially since most of the standard shared buses require a central arbiter and decoder, which is hardly compatible with the any-to-any post-silicon-stacking of dies [261]. Although some research is conducted in 3D buses [261], most of the existing systems prefer either
2.3 Three-Dimensional Networks on Chips
45
dedicated point-to-point ad-hoc approaches or NoCs. Simple 3D buses are gaining attention in hybrid NoC-bus interconnects, where the buses vertically connect NoC routers [7, 207].
2.3 Three-Dimensional Networks on Chips The scalability of NoCs makes the well suited even for 3D SoCs. Three-dimensional technologies enable previously discarded topologies for NoCs, since the wirelength constraints and layout complications of 2D designs are reduced [79]. For example, torus topology is again promising for 3D designs, since its main drawback (long wires on 2D-chips) is nullified [194]. In addition, new technologies such as optical links and 3D chip stacking provide new freedoms to design of network topologies [194, 220]. Moreover, the zero-load latency can be reduced by 15 to 20% for chips with a large number of components in comparison to 2D designs [194]. Three-dimensional NoCs can be divided into three categories: homogeneous 3D NoCs, heterogeneous 3D NoCs, and hybrid 3D NoCs as shwon in Fig. 2.11. Homogeneous 3D NoCs extend 2D designs by a spacial dimension, yet not by new manufacturing technologies (e.g., see [235]). Existing works tacitly assume a multi-layer homogeneous 3D SoC, such that communication costs in each die are identical. The main difference to 2D NoCs is that routers in 3D systems have more ports, e.g., seven instread of five ones in a 3D mesh. This drives area, power, and timing. However, the advantages of a 3D topology with more paths and lower network diameter usually offset these costs. Heterogeneous 3D NoC [7, 8] denote NoC designs with non-uniform properties at the architectural level [124].4 In one approach, a heterogeneous mixture of 2D and 3D routers is constructed. This inter-router heterogeneity has been investigated in [8] achieving power reductions of around 20% with negligible performance
a) Homogeneous 3D NoC.
b) Heterogeneous 3D NoC.
c) Hybrid 3D NoC.
Fig. 2.11 Schematic illustration of different 3D NoC types. (a) Homogeneous 3D NoC. (b) Heterogeneous 3D NoC. (c) Hybrid 3D NoC
4 The term can also be used for 2D NoCs. In this case, heterogeneous (non-uniform) designs combine different router architectures in a single silicon layer.
46
2 Interconnect Architectures for 3D Technologies
degradation. Further on, [227] reports additional energy reductions by 40%. Interestingly, this later work claims that 3D systems require heterogeneous NoCs due to the technology asymmetry typically found in 3D systems. However, the technology asymmetry is not further investigated. In another approach, multiple router architectures are implemented side-by-side on homogeneous 3D SoCs targeting cost reductions. A standard router can be divided [189] and implemented in multiple layers. This method provides a multi-layered 2D NoC router in a 3D IC. Units, which can be separated, span multiple layers (e.g., crossbars and buffers). Routing and arbitration are inseparable and thus located in one layer. Signals within the router are transmitted using TSVs. This 3D router architecture with a 2D network topology has 42% area reduction and 51% performance increase for synthetic traffic patterns in comparison to a 2D router architecture. For real-world applications, energy savings are achieved by dynamically shutting down idle layers. In [211], link and crossbar sharing between routers decreases the network latency by up to 21% compared with a standard 3D router. Methods in which buses connect NoC routers are referred to as hybrid NoCs [7, 207, 208]. Buses can be vertical or horizontal. The concept can be applied both to 2D and 3D NoCs.5 The fundamental premise for vertical buses is that these transmissions do not need to be hop-to-hop. Routers communicating vertically via a local bus are sufficient. It saves area since the number of ports in the router is reduced and improves performance since the packets can cross more than one die in one step. The logical bus can be implemented either with tri-state logic [7] or with a simple demultiplexing and multiplexing stage [208]. The number of buses connecting the silicon layers in the hybrid 3D-mesh topology has been optimized in [7]; it produces an intra-router heterogeneity which saves up to 20% of the area of 3D-mesh NoCs. Finally, fault-tolerant and thermal issues are discussed in [207].
2.3.1 Performance, Power, and Area The same optimizations for throughput and latency from 2D NoCs are also valid for (homogeneous) 3D systems. Look-ahead routing, for instance, reduces the system latency and increases the throughput by approximately 45% in 3D systems [9], which is a similar result as in 2D networks [172]. While for 2D systems many benchmarks exist, there are few for 3D NoCs. Synthetic benchmarks can be applied to both 2D and 3D NoCs. Most of the real-world-based benchmarks are tailored for 2D ICs and therefore cannot be used. Common NoC simulators offer the capability to model 3D networks, as well, but are not focused on the special properties of (heterogeneous) 3D technology.
5 As with heterogeneous NoCs, this term can also be used for 2D NoCs. Several approaches exist to synthesize an application specific NoC-bus topology especially suited to the communication patters found in the system.
2.4 Conclusion
47
Area and power of routers will increase if 2D designs are extended to the third dimension. Additional buffer space is required for ports in vertical directions. The vertical links themselves require area from KOZs, as discussed in Paragraph 1.2.2.2. The crossbar size increases and additional logic is needed for arbitration and routing calculation, although marginal. Reference [79] shows that the area increase per router is about 50% when moving from 2D-mesh to 3D-mesh topology without optimizing the router architecture. Therefore, area reductions in 3D NoCs are essential. The majority of works addressing this focus on the placement and number of vertical links: The TSV count is reduced by partially connecting layers [17, 82] or by sharing TSVs among neighbored routers [246]. The router architecture can be optimized, as well [247]. Novel router architectures achieve optimization of power consumption. For instance, reference [148] proposes to use partially activated crossbars, clock-frequency scaling, and serial-link coding. This can be applied to 2D and 3D NoCs. Different properties of vertical and horizontal links are only investigated recently, which has a vast influence on energy consumption [18].
2.4 Conclusion In this chapter, we introduced NoCs as the primary communication infrastructure used in SoCs. In addition to the presentation of the general properties and design challenges of NoCs, we focused on the extended design options in heterogeneous 3D ICs. For such chips, it is not only possible to select different network topologies or router architectures per tier, but also to distribute the components of an NoC router among neighboring tiers. This approach results in a large number of micro-and micro-architectural design options. Without a good understanding of the impact of individual design decisions on the overall system design and the interdependencies between individual design decisions, it will not be possible for a system designer to identify good design options. For this reason, we present models and design guidelines for optimizing 3D SoC interconnect architectures in the following chapters. Heterogeneous 3D technology poses several critical challenges for NoCs. We will evaluate and solve these challenges throughout this book in detail, but the following simple example highlights the critical effect of heterogeneity. If one combines a more conservative technology with a more advanced one, they must reduce the router microarchitecture’s complexity and costs in the more conservative layer. Otherwise, the design is limited by a significant power/area or timing imbalance. As we will show in this book, only specific router architecture can resolve this issue. We will introduce one approach to save area, improve timing, and optimize yield by reducing the number of TSV links.
Part II
3D Technology Modeling
Chapter 3
High-Level Formulas for the 3D-Interconnect Power Consumption and Performance
In this chapter, generally valid high-level formulas to estimate the interconnect power consumption and performance are systematically derived. An abstract and yet accurate model that maps abstract pattern properties to power consumption and performance quantities is required to derive and rapidly evaluate techniques to improve the 3D-interconnect power consumption and performance on higher abstraction levels. The presented high-level formulas enable the estimation of the power requirements and performance of a system at early design stages, without complex circuit simulations. The interconnect architecture of modern systems on chips (SoCs) has a throughput of several GB/s. This data rate makes it almost impossible to perform circuit simulations to determine the application-specific power requirements of the interconnects. A Spectre circuit simulation for the transmission of only 3500 bitpatterns/data-samples with a throughput of 1.1 GB/s over a single global 3 × 3 through-silicon via (TSV) array, already results in a run time of about 1140 s on a Linux Xeon E5–2630 machine with 256 GB of random-access memory (RAM).1 In contrast, a high-level estimation (typically based on SystemC [43, 44]) only takes a few seconds, independent of the pattern count and rate, with an error of only a few percent. For system simulations with data sets containing millions of samples, this translates to a speedup that is significantly higher than 1 × 106 , just due to a high-level formula for the interconnect power consumption. The remainder of this chapter is structured as follows. First, the formulas to estimate the interconnect power consumption and performance are systematically derived in Sects. 3.1 and 3.2, respectively. Afterward, matrix formulations for both formulas are provided in Sect. 3.3, which enable the systematic derivation and evaluation of our proposed optimization techniques in Part IV of this book. In the
1 All run times reported in this book are for the execution of a single thread, a utilization below 5% for all cores, and more than 80% available main-memory space.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 L. Bamberg et al., 3D Interconnect Architectures for Heterogeneous Technologies, https://doi.org/10.1007/978-3-030-98229-4_3
51
52
3 High-Level Formulas for the 3D-Interconnect Power Consumption. . .
following Sect. 3.4, the derived formulas are validated through circuit simulations for a modern global TSV array. Finally, the chapter is concluded.
3.1 High-Level Formula for the Power Consumption Formulas to estimate the dynamic power consumption of 2D interconnects are well known (e.g., [68, 89, 215]).2 However, previous formulas consider only two adjacent neighbors for each interconnect. Therefore, they are not applicable for 3D interconnects, as a TSV in an array arrangement is surrounded by up-to eight adjacent TSVs. Here, a formula that overcomes these limitations is derived. Figure 3.1 is considered the starting point to derive the formula. The energy extracted from the driver of the ith interconnect in the kth cycle is expressed as
kTclk
Ee,i [k] =
(3.1)
Vdd ii (t)dt, (k−1)Tclk
Vdd
bj •
Rj Vdd
j •
u •
Ri
ii,v
Ci,i
,v
Ci
ii,w
ii,u ii,i
i,
ii,j i l i,l i •
C
l ii,l
u
Ci,j
Interconnect driver
bi •
l ij,l
Ci,w
• w
• v
Fig. 3.1 Equivalent circuit of an exemplary interconnect structure used for the derivation of the formulas for the interconnect power consumption
2 As shown in Ref. [68], leakage effects are negligible for traditional very-large-scale integration (VLSI) interconnects. This still holds for TSV-based interconnects, as shown in the evaluation section of this chapter.
3.1 High-Level Formula for the Power Consumption
53
where Vdd is the power-supply voltage, ii (t) is the current that flows through the driver at time t, and Tclk is the clock period. If the binary input bi is zero, the current ii , and therefore also the energy extracted from the supply, is zero for neglected leakage effects. Since ii = l ii,l (Kirchhoff’s law), the contribution of each capacitance, Ci,j , can be analyzed individually in the remainder. First, the scenario of no temporal misalignment between the signal edges at the input ports of the drivers of the ith and the j th interconnect is considered. In other words, the inputs bi and bj switch exactly at the beginning of the cycle (in case of an event on both signals). By considering the current-voltage relation of capacitances, the energy extracted from the ith driver due to the coupling capacitance between the interconnects i and j , Ci,j , is calculated as follows: d vi (t) − vj (t) dt Ee,i,j [k] = bi [k] Vdd Ci,j dt (k−1)Tclk
kTclk
= Vdd Ci,j bi [k](Vdd bi [k] − Vdd bj [k] − Vdd bi [k−1] + Vdd bj [k−1]) 2 = Vdd Ci,j bi [k](Δbi [k] − Δbj [k]),
(3.2)
where vi (t) is the voltage between the ith interconnect and the ground at time t, which is equal to the binary value bi times Vdd after the switching on the interconnect is completed (i.e., at the end of a cycle). bi [k] and bi [k − 1] are the binary values transmitted over the ith interconnect in the current, kth, and the previous, (k − 1)th, clock cycle, respectively. Δbi [k] is equal to bi [k] − bi [k−1] and thus indicates the switching of the logical/binary value on the interconnect. It is equal to 1 for a logical 0 to logical 1 transition, 0 for no transition in the logical value, and −1 for a logical 1 to 0 transition. Analogously, one can calculate the energy extracted over the driver of the j th interconnect due to the same capacitance Ci,j : 2 Ee,j,i [k] = Vdd Ci,j bj [k](Δbj [k] − Δbi [k]).
(3.3)
Consequently, the total extracted energy because of Ci,j is expressed as Ee,i,j = Ee,i,j + Ee,j,i e,j,i 2 = Vdd Ci,j (bi [k] − bj [k])(Δbi [k] − Δbj [k]).
(3.4)
Equation (3.4) does not hold anymore if a temporal misalignment between the edges of the two input signals exists. Temporal misalignment implies here that one of the two inputs, bi or bj , switches later than the other in case of an event/edge on both signals in the current cycle. Exemplary, the scenario in which the signal edges on the i th input bi are delayed by Tedge,i against the edges on bj , is considered in the following. If Tedge,i is greater than the rise/fall time of the input drivers, plus the
54
3 High-Level Formulas for the 3D-Interconnect Power Consumption. . .
delay over the interconnects, Eq. (3.2) has to be modified to Ee,i,j [k] = bi [k−1]
Vdd Ci,j
+ bi [k]
(k−1)Tclk +Tedge,i (k−1)Tclk kTclk
(k−1)Tclk +Tedge,i
Vdd Ci,j
−dvj (t) dt dt
dvi (t) dt dt
2 = Vdd Ci,j (bi [k]Δbi [k] − bi [k−1]Δbj [k]).
(3.5)
Ee,j,i can still be calculated with Eq. (3.3) if the edges on bi are delayed. Thus, the total extracted energy because of Ci,j is in this case 2 Ee,i,j [k] = Vdd Ci,j Δbi [k](bi [k] − bj [k]) + Δbj [k](bj [k] − bi [k−1]) .
(3.6)
e,j,i
The formulas simplify if the dissipated energy is considered instead of the extracted energy. On average, the two energy quantities are equal. Thus, it is irrelevant which quantity is used in the following to derive a formula for the mean power consumption. The dissipated energy is the difference between the differential stored energy and the extracted energy: Ei,j [k] = Ee,i,j [k] − ΔEstored,i,j [k],
(3.7)
e,j,i
where the difference between the energies stored in Ci,j after and before the transitions can be expressed as 1 2 2 Ci,j Vi,j [k] − Vi,j [k−1] 2 2 2 V 2 Ci,j bi [k] − bj [k] − bi [k−1] − bj [k−1] = dd 2
ΔEstored,i,j =
=
(3.8a) (3.8b)
2C Vdd i,j bi [k] − bj [k] + bi [k−1] − bj [k−1] Δbi [k] − Δbj [k] . 2 (3.8c)
Here, Vi,j [k−1] and Vi,j [k] are the voltage drops over Ci,j before and after the transitions, respectively. Equation (3.8c) can be derived from Eq. (3.8b) by applying the third binomial formula. By substituting, Eqs. (3.8c) and (3.4) in Eq. (3.7) a formula to calculate the energy dissipation due to Ci,j in case of temporally aligned signal edges is obtained: Ei,j [k] =
2C Vdd i,j (Δbi [k] − Δbj [k])2 . 2
(3.9)
3.1 High-Level Formula for the Power Consumption
55
For misaligned/skewed input signals, Eq. (3.4) instead of Eq. (3.6) has to be substituted, resulting in Ei,j [k] =
2C Vdd i,j (Δbi2 [k] + Δbj2 [k]). 2
(3.10)
Analyzing Eq. (3.10) reveals that it is irrelevant which interconnect changes its level first for temporally misaligned edges if the dissipated energy is considered instead of the extracted energy (i.e., the formula is distributive for i and j as Ei,j is always equal to Ej,i ). The only important question is whether the edges show sufficient temporal misalignment or not. Another advantage of using the dissipated energy is that we generally do not have to differentiate between a charging and a discharging capacitance. Consequently, the formulas for the dissipated energy exhibit a significantly lower complexity than the ones for the extracted energy. By substituting a 0 for Δbj [k] in Eq. (3.9) or (3.10), the well-known formula for the dissipated energy due to the self/ground capacitance between node i and the (logically stable) ground, Ci,i , is obtained: Ei,i [k] =
2C Vdd i,i Δbi2 [k]. 2
(3.11)
To summarize, the dissipated energy in clock cycle k due to an interconnect structure containing n lines, for perfectly temporal aligned signals edges, is equal to
E[k] =
n 2 n 2 Vdd Ci,i Δbi2 [k] + Ci,j Δbi [k] − Δbj [k] i=1 2
(3.12a)
j =i+1
=
2 n Vdd Ci,i Δbi2 [k] + i=1 2
n
Ci,j Δbi2 [k] + Δbj2 [k] − 2Δbi Δbj [k]
j =i+1
(3.12b) =
n 2 n Vdd Ci,i Δbi2 [k] + Ci,j Δbi2 [k] − Δbi [k]Δbj [k] , i=1 2 j =1
i =j
(3.12c) From Eq. (3.12c), a formula to quantify the energy consumption associated with an individual interconnect, symbolized as Ei , is derived. This formula is for example required to identify interconnects that contribute the most to the power consumption and thus strongly demand optimization.
56
3 High-Level Formulas for the 3D-Interconnect Power Consumption. . . def
Defining E = aligned edges:
i
Ei , the following formula is obtained for perfectly temporal
⎛ Ei [k] =
2 ⎜ Vdd ⎜Ci,i Δb2 [k] + i 2 ⎝
n
Ci,j
j =1
⎞ ⎟ Δbi2 [k] − Δbi [k]Δbj [k] ⎟ ⎠.
(3.13)
i =j
While the energy consumption due to the self-capacitance of an interconnect, Ci,i , is determined only by the self switching of the associated input, Δbi , the power consumption due to a coupling capacitance of the interconnect, Ci,j , additionally depends on the switching on the logical value on the j th interconnect, Δbj , if the lines switch temporally aligned. Compared to the scenario where only the ith interconnect toggles (i.e., Δbj [k] = 0), the contribution of capacitance Ci,j to energy-consumption Ei is doubled when bj toggles in the opposite direction (i.e., Δbi [k]Δbj [k] = −1), and vanishes if it toggles in the same direction (i.e., Δbi [k]Δbj [k] = 1). This effect is commonly referred to as crosstalk or coupling switching of the interconnects, while sometimes also the term “Miller effect” is used. The energy dissipated in clock cycle k due to an n-bit interconnect in case of significant temporal misalignment between the switching times of all input signal pairs is expressed through Eqs. (3.10) and (3.11) as ⎞ ⎛ n 2 n Vdd ⎝Ci,i Δbi2 [k] + E[k] = Ci,j Δbi2 [k] + Δbj2 [k] ⎠ i=1 2 ⎛
=
(3.14a)
j =i+1
⎞
n 2 n ⎜ ⎟ 2 Vdd ⎜Ci,i + Ci,j ⎟ ⎝ ⎠ Δbi [k] i=1 2
(3.14b)
j =1
i =j
=
n n 2 Vdd
2
Ci,j Δbi2 [k].
(3.14c)
i=1 j =1
Hence, for misaligned edges, the energy dissipation associated with the ith interconnect is independent of the switching on the other lines (i.e., no crosstalk effects): Ei [k] =
n 2 Vdd Ci,j Δbi2 [k]. 2 j =1
(3.15)
3.1 High-Level Formula for the Power Consumption
57
3.1.1 Effective Capacitance Introducing the concept of an effective lumped capacitance for each interconnect has greatly simplified the derivation and evaluation of high-level optimization techniques for 2D interconnects in Ref. [68]. Thus, the concept of the effective capacitance is extended in this paragraph so that it can be applied also for 3D interconnects. The effective capacitance, Ceff,i , is defined as a single ground capacitance connected to the ith line that, if recharged, results in the same energy consumption as the real circuit considering complex coupling effects. The energy dissipated if a ground capacitance Cg is recharged is equal to (Vdd2 Cg )/2. Thus, the effective capacitance is formally defined as follows. Definition 3 (Effective Capacitance) 2 Vdd Ceff,i , 2
(3.16)
n 2 Vdd Ceff,i . 2
(3.17)
def
Ei = which implies that def
E =
i=1
Hence, the effective capacitance of the ith interconnect for temporally aligned edges, derived from Eq. (3.13), is expressed as Ceff,i [k] =
Ci,i Δbi2 [k] +
n
Ci,j Δbi2 [k] − Δbi [k]Δbj [k] .
(3.18)
j =1
i =j
For completely misaligned edges, the formula for the effective capacitance (derived from Eq. (3.15)) is Ceff,i [k] =
n
Ci,j Δbi2 [k].
(3.19)
j =1
In the same way, one can derive a formula for the effective capacitance if some signal pairs switch temporally aligned, while others switch temporally misaligned: Ceff,i [k] =
n j =1
Ci,j
Δbi2 [k] − Δbi [k]Δbj [k] for aligned Δbi Δbj and i =j Δbi2 [k]
else. (3.20)
58
3 High-Level Formulas for the 3D-Interconnect Power Consumption. . .
This more complex formula is, for example, required to estimate the energy dissipation in scenarios where one-half of the signals switch temporally aligned with the rising clock edge while the other half switches with the falling clock edges, resulting in 50% temporally misaligned signal pairs.
3.1.2 Power Consumption In general, the critical concern is not the energy dissipation in a specific clock cycle, but the mean power consumption over many/all clock cycles, represented by P . This quantity can be derived through the expected value of the dissipated energy. Definition 4 (Expected Value) For a stochastic variable X, representing continuous values, with a cumulative density function described by fcd (x), the expected value is defined as:3 ∞ E{X} = x · fcd (x) dx. (3.21) −∞
The expected value of a stochastic variable X with a finite number of finite outcomes is defined as E{X} =
m
xi · pxi ,
(3.22)
i=1
where x1 , x2 , . . . , xm are the possible outcomes occurring with the respective probabilities px1 , px2 , . . . , pxm . Since the power-supply voltage, as well as the capacitance values, are considered as constant, the mean energy dissipation for perfectly temporal-aligned signal edges is derived from Eq. (3.12) as follows:
E¯ = E
⎧ ⎪ ⎪ ⎨ V 2 n dd
⎪ 2 ⎪ ⎩
⎛
⎜ ⎜Ci,i Δb2 + i i=1 ⎝
⎛
=
n
Ci,j
j =1
i =j
⎞⎫ ⎪ ⎬ ⎟⎪ 2 ⎟ Δbi − Δbi Δbj ⎠ ⎪ ⎪ ⎭
(3.23a) ⎞
n 2 n ⎜ ⎟ Vdd 2 ⎜Ci,i · E{Δb2 } + ⎟. E{Δb C } − E{Δb Δb } i,j i j i i ⎠ i=1 ⎝ 2 j =1
i =j
(3.23b)
3 This
integral does not necessarily exist.
3.1 High-Level Formula for the Power Consumption
59
In the same way, the formula for the mean energy dissipation in case of just completely temporal-misaligned signal edges is derived from Eq. (3.14): E¯ =
n n 2 Vdd Ci,j · E{Δbi2 }. 2
(3.24a)
i=1 j =1
From these two formulas, a simple formula is derived which can be used for all cases in which binary input signal pairs switch either perfectly aligned or perfectly misaligned: ⎛
E¯ =
⎞
n 2 n ⎜ ⎟ Vdd ⎜Ci,i αi + Ci,j αi − γi,j ⎟ ⎝ ⎠. i=1 2
(3.25)
j =1
i =j
In this equation, αi is the toggle/switching activity of bi (i.e., E{Δbi2 }), which is equal to the probability of a transition in the binary value in a clock cycle. The second statistical parameter γi,j is used to capture the coupling effects between the interconnect pair over which bi and bj are transmitted. If there is no temporal misalignment between the signal edges on bi and bj , γi,j = E{Δbi Δbj }.
(3.26)
For a sufficient misalignment between the signals bi and bj , γi,j = 0.
(3.27)
Coupling parameter γi,j can take any real value in the range of −1 to 1. It is equal to: −1 if the signals bi and bj switch simultaneously, but in the opposite direction, in every clock cycle; 1 if the signals switch simultaneously, and in the same direction, in every clock cycle; and 0 if the signals never switch simultaneously or are completely uncorrelated (in terms of the switching directions). Generally, γi,j is bigger than zero if the two signals switch with a higher probability in the same direction than in the opposite direction, and smaller than zero if they are more likely to switch in the opposite direction (for temporally aligned edges). Hence, γi,j is referred to as the switching correlation of the bits in the remainder of this book. However, note that a γi,j value does not represent a correlation in the traditional form defined by Pearson. Introducing the concept of a mean effective capacitance, C¯ eff,i , for each interconnect—used to estimate the power consumption—simplifies the derivation of the optimization techniques in Part IV of this book.
60
3 High-Level Formulas for the 3D-Interconnect Power Consumption. . .
Definition 5 (Mean Effective Capacitance) def E¯ i =
2 Vdd C¯ eff,i 2
with
def E¯ =
E¯ i .
(3.28)
i
Comparing this definition with Eq. (3.25) shows that the mean effective capacitance of the ith interconnect can be expressed as C¯ eff,i = Ci,i αi +
n Ci,j αi − γi,j .
(3.29)
j =1
i =j
Since energy is the integral of power over time, the mean power consumption is calculated by means of P =
n V2 f E¯ = dd C¯ eff,i . Tclk 2
(3.30)
i=1
In this equation, f is the frequency used for signal transmission equal to the inverse of the clock period Tclk .
3.2 High-Level Formula for the Propagation Delay The reciprocal of the maximum signal-propagation delay for any line is the common performance metric of an interconnect structure, as it quantifies the maximum rate at which bit-patterns can be transmitted error-free (i.e., the bandwidth). Typically, the signal propagation delay is defined as the time difference between the point at which the input of the interconnect driver changes its logical value (i.e., the node voltage crosses Vdd/2) and the time at which the node at the far end of the interconnect changes its logical value accordingly. In this section, a formula to estimate this propagation delay is derived by the simplified lumped-RC equivalent circuit shown in Fig. 3.2b. The actual non-linear driver is modeled as an ideal switch and a resistance, RD , representing the onresistance of the pull-up or pull-down channel in the conductive state. Furthermore, the binary input value, bi [k], is used in the model to control the switch and thereby the set potential, Vset,i [k], for the interconnect in the current cycle, k. Vset,i [k] is equal to Vdd and 0 V (ground potential) for bi [k] equal to 1 and 0, respectively. First, a formula for the output-voltage waveform, vi (t ), of an interconnect as T a function of the input switching, Δb[k] = [Δb1 , . . . , Δbn ] , is derived. Thereby, t refers to the time relative to the k th rising clock edge (i.e., t = t − k · Tclk ). Kirchhoff’s current law on the interconnect output node i results in the following
3.2 High-Level Formula for the Propagation Delay i−1 •
1 •
Ci,1 i•
Ci,n
• n
a)
• Vdd
Ci,i Ri
i−1 •
1 •
Ci,i−1
Driver bi •
61
bi = 1
Vset,i • R = RD + Ri i • •
Ci,i+1
Ci,1
Ci,i−1 Ci,i
Ci,n
• i+1
• n
Ci,i+1
• i+1
b)
Fig. 3.2 Equivalent circuit of an interconnect structure with: (a) an actual driver; (b) a simplified linear driver model for the derivation of a formula for the signal propagation delay
equation: dvi (t ) n Vset,i [k] − vi (t ) = Ci,i + i=1 Ci,j j =i R dt
dvi (t ) dvj (t ) , − dt dt
(3.31)
where R is the sum of the equivalent resistance of the driver, RD , and the interconnect resistance, Ri . By definition, the line only has a propagation delay when the logical value transmitted through Vset,i is toggling compared to the previous clock cycle (i.e., Δbi2 [k] = 1). Thus, this particular scenario is considered in the following. In this case, the voltage change on any other line j , dvj (t )/dt , can be approximated as ηi,j · dvi (t )/dt , with ηi,j ∈ {−1, 0, 1} [68].4 For perfectly temporal-aligned edges on bi and bj , ηi,j is equal to: 1 if bj switches in the same direction as bi ; 0 if bj is stable; and −1 if bj toggles in the opposite direction as bi . Hence, dvi (t ) dvi (t ) dvj (t ) − = 1 − η i,j dt dt dt dvi (t ) = 1 − Δbi [k]Δbj [k] . dt
(3.32)
4 This assumption perfectly holds for drivers with perfectly matched rising and falling transitions for the individual interconnects.
62
3 High-Level Formulas for the 3D-Interconnect Power Consumption. . .
By substituting Eqs. (3.32) in (3.31), the differential equation for solely temporalaligned edges is rewritten as ⎛ Vset,i [k] − vi (t ) ⎜ =⎜ ⎝Ci,i + R
⎞ n j =1
⎟ dvi (t ) Ci,j 1 − Δbi [k]Δbj [k] ⎟ ⎠ dt .
(3.33)
i =j
For completely temporal-misaligned edges on bi and bj , vj is stable while vi transitions from logical 0 to logical 1, implying dvj/dt = 0 in the delay analysis of the i th line. Hence, if all edges occur temporally misaligned, Eq. (3.31) simplifies to ⎞
⎛
n ⎟ dvi (t ) Vset,i [k] − vi (t ) ⎜ ⎟ C =⎜ + C i,i i,j ⎠ dt ⎝ R j =1
i =j
=
n
Ci,j
j =1
dvi (t ) . dt
(3.34)
Since the formulas were derived under the assumption that bi2 [k] is equal to 1, Eq. (3.33) can be rewritten as follows: ⎛ Vset,i [k] − vi R
(t )
⎜ 2 =⎜ ⎝Ci,i Δbi [k] +
n
Ci,j
j =1
⎞ ⎟ dv (t ) i Δbi2 [k] − Δbi [k]Δbj [k] ⎟ ⎠ dt .
i =j
(3.35) In the same way, Eq. (3.34) is rewritten to Vset,i [k] − vi (t ) dvi (t ) = Ci,j Δbi2 [k] . R dt n
(3.36)
j =1
Comparing these two formulas with Eqs. (3.18) and (3.19), shows that, for both cases, aligned and misaligned signal edges, the differential equation can be simplified by using the concept of the effective capacitance:5 Vset,i [k] − vi (t ) dvi (t ) = Ceff,i [k] . R dt
(3.37)
5 The same formula can be derived for the case in which temporally aligned and misaligned signal edges occur simultaneously.
3.2 High-Level Formula for the Propagation Delay
63
In the following, the Laplace transform, represented by L{}, is used to solve this differential equation. Multiplying both sides with R and then transforming the equation into the frequency domain results in Vset,i [k] − L{vi }(s) = RCeff,i [k] (sL{vi }(s) − vi (0)) , s
(3.38)
where vi (0) is the potential of the output node at the beginning of the clock cycle (t = 0) equal to Vdd bi [k − 1]. Furthermore, the set voltage can be expressed as Vdd bi [k]. Thus, the equation can be rewritten as follows: Vdd bi [k] − L{vi }(s) = RCeff,i [k] (sL{vi }(s) − Vdd bi [k − 1]) . s
(3.39)
Solving this equation for L{vi }(s) results in L{vi }(s) = Vdd
1 1 + bi [k − 1] bi [k] 2 RCeff,i [k]s + s s + RC 1
.
(3.40)
eff,i [k]
Finally, the inverse Laplace transform is applied to obtain vi (t ). The inverse Laplace transform of the terms 1/(s+a) and 1/(a s 2 +s) can be looked up as e−at and 1 − e−t/a , respectively. Thus, the inverse Laplace transform results in:
vi (t ) = Vdd
− RC t [k] − RC t [k] eff,i eff,i bi [k] 1 − e + bi [k − 1]e
= Vdd bi [k] − (bi [k] − bi [k − 1])e = Vdd bi [k] − Δbi [k]e
t eff,i [k]
− RC
t eff,i [k]
− RC
(3.41a)
(3.41b)
.
(3.41c)
For a logical 0 to logical 1 transition on the interconnect (i.e., Δbi [k] = 1) as well as for a logical 1 to logical 0 transition (i.e., Δbi [k] = −1), the rise and fall time for the node-voltage vi (t ) to reach χ Vdd and 1 − χ Vdd , respectively, can be derived from Eq. (3.41) as Tdelay,i (χ )[k] = − ln(1 − χ )RCeff,i [k].
(3.42)
The time to propagate the 50% switching point is Tdelay,i (0.5)[k] equal to 0.69RCeff,i [k]. For the derivation of this formula, an ideal/simplified driver model was considered. To partially account for non-ideal driver effects, a driver-specific constant TD,0 has to be added to the delay. Hence, the signal propagation delay to reach the 50% switching point is Tpd,i [k] = 0.69RCeff,i [k] + TD,0 Δbi2 [k].
(3.43)
64
3 High-Level Formulas for the 3D-Interconnect Power Consumption. . .
Although this formula is derived under the assumption that bi toggles, it is even valid for the scenario that bi is stable. Clearly, in this case, the propagation delay is zero, which is equal to the output of the formula for Δbi [k] equal to 0 (resulting in Ceff,i [k] equal to 0). The performance is quantified by the reciprocal of the maximum propagation delay for all cycles and interconnects, as it equals to the maximum rate at which bit patterns can be transmitted over the interconnect structure error-free. This delay metric, represented by Tˆpd , can be estimated through the following equation: Tˆpd = max Tpd,i [k] = R Cˆ eff + TD,0 , k,i
(3.44)
assuming that one line eventually toggles, resulting in a non-zero entropy for the transmitted data (i.e., maxk,i (Δbi2 [k]) = 1). In this equation, Cˆ eff is the maximum effective-capacitance value of all interconnects for all possible bit-pattern transitions. In conclusion, the energy consumption of an interconnect and the signalpropagation delay are proportional to the pattern-dependent effective capacitance. Thus, decreasing the effective-capacitance values can drastically improve the performance as well as the power consumption of VLSI interconnects. This correlation between the metrics enables us to design high-level techniques that effectively improve the 3D-interconnect power consumption and performance by optimizing the transmitted bit patterns.
3.3 Matrix Formulations Matrix formulations for the derived formulas to estimate the power consumption and propagation delay are presented in this section. These formulas simplify the notation and allow for a more straightforward model implementation with Matlab or the Python package Numpy, but they are also the key enabler for the optimization techniques presented and evaluated in Part IV of this book. Through the sum of the mean effective capacitances—directly proportional to the dynamic power consumption (Eq. (3.30))—the interconnects can be optimized toward a low power-consumption due to the pattern-dependent nature of this metric. The sum of the mean effective capacitances can be expressed through the Frobenius inner product of two matrices:6
Frobenius inner product, represented by , of two real-valued matrices A and B is equal to the sum of the element-wise multiplication ( i,j Ai,j Bi,j ). 6 The
3.3 Matrix Formulations
65
⎛
n
C¯ eff,i =
i=1
⎞
n ⎜ ⎜ i=1
⎝Ci,i αi +
n ⎟ Ci,j αi − γi,j ⎟ ⎠ j =1
i =j
= SE , C.
(3.45)
Here, C is the capacitance matrix with capacitance Ci,j on entry (i, j ). Matrix SE contains the switching statistics. In detail, the entries of this switching matrix depend on the switching activities and correlations of the binary values transmitted over the interconnects: αi for i = j SE,i,j = (3.46) αi − γi,j for i = j . Hence, the mean power consumption of an interconnect structure can also be formulated using a matrix instead of a sum notation: P =
2f Vdd SE , C. 2
(3.47)
In the formula for the cycle-based interconnect propagation delay (i.e., Eq. (3.43)), the effective capacitance is the pattern-dependent metric that allows optimization at higher abstraction levels. Instead of using Eq. (3.20) n times to calculate the effective capacitances of all interconnects individually, the vector containing all effective capacitances can be expressed in a single matrix notation: C eff [k] = diag(S[k] · C),
(3.48)
where the ith vector entry is equal to the effective capacitance of the ith line, Ceff,i [k]. The function “diag()” returns the diagonal of a matrix as a vector. The switching matrix S[k] is defined as follows: Si,j [k] =
Δbi2 [k] − Δbi [k]Δbj [k]
for aligned Δbi Δbj and i = j
Δbi2 [k]
else.
(3.49)
Consequently, the maximum propagation delay for all interconnects, and thus the performance of the whole interconnect structure, can be estimated by the following matrix equation: Tˆpd = R Cˆ eff + TD,0 = R max (diag(S[k] · C)) + TD,0 . k
(3.50)
66
3 High-Level Formulas for the 3D-Interconnect Power Consumption. . .
The two derived matrix formulations for the mean-sum and the cycle-based effective capacitance values (i.e., Eqs. (3.45) and (3.48)) are of particular importance in the context of this book, as they are extensively used for most optimization techniques presented later on.
3.4 Evaluation The previously derived formulas for the interconnect power consumption and performance are validated in this section, comparing the results obtained for precise but computationally complex circuit simulations. For this purpose, first, a setup is constructed, which allows determining the pattern-dependent power consumption and propagation delay of interconnects by circuit simulations with Cadence Spectre. The setup for each individual line of the interconnect structure under analysis is illustrated in Fig. 3.3. Each interconnect is driven by an inverter made up of one n-channel and one p-channel MOS field-effect transistor (MOSFET) stemming from the 22-nm Predictive Technology Model (PTM). The W/Lmin ratios of the two transistor channels define the strength of the interconnect drivers. For the analysis in this chapter, W/Lmin of the n-channel and the p-channel MOSFETs are 12 and 20, respectively. At the far end, each TSV terminates in a smaller 22-nm PTM inverter acting as a driver for a capacitive load of 1 fF. The transistor channels of this load driver have widths that are by a factor of 3 × smaller than the respective channel widths of the interconnect driver. Bit sources (Vdd = 1 V) with a transition time of 10 ps and a bit duration of 1 ns (i.e., fclk = 1 GHz) are used to generate the stimuli for the circuit simulations. The transmitted bit sequence and the delay of the signal edges, relative to the rising clock edges (at k · Tclk , with k ∈ N), can be arbitrarily defined for each line. To achieve a realistic shape of the input-voltage waveforms of the TSVs, vin,i , additional inverters are added, which are sized like the ones charging the capacitive load at the interconnect termination. Signal shaper
Load driver
Driver
Bit source
ii vin,i •
Interconnect i
vi •
vout,i • CL
Fig. 3.3 Setup to measure the power consumption and propagation delay of interconnects through circuit simulations
3.4 Evaluation
67
The energy consumption of the ith interconnect for a specific time frame is measured by integrating the current that flows into the respective driver, ii , over the considered time window. By dividing this energy consumption by the duration of the time frame, the mean power consumption is obtained. The difference between the simulation time at which the driver input voltage, vin,i , crosses Vdd/2, and the simulation time at which the interconnect output voltage, vi , crosses Vdd/2 accordingly is the measured signal-propagation delay of the ith interconnect in the respective cycle. In this section, the power consumption of a 3 × 3 TSV array with a TSV radius and minimum pitch of 1 and 4 μm, respectively, is analyzed. These geometrical TSV dimensions correspond with the minimum global TSV dimensions, predicted by the International Technology Roadmap for Semiconductors (ITRS) [1]. A TSV length/depth of 50 μm is chosen to model a commonly thinned substrate. To extensively validate the derived formulas, all possible switching scenarios for the TSV located in the array middle (i.e., TSV5 in Fig. 1.9 on Page 20) are investigated. Furthermore, each scenario is analyzed twice: Once for perfectly temporal-aligned signal edges, and once for completely temporal-misaligned edges on TSV5 compared to the other TSVs. To analyze temporally misaligned edges, the input of the driver of TSV5 switches by 0.5 ns earlier/later than the inputs of the remaining drivers. By means of the Q3D Extractor and the 3D model presented in Sect. 1.3.1, the TSV parasitics are obtained. A mean TSV voltage of 0.5 V (Vdd/2) is considered for the parasitic extraction, since 0 and 1 bits are equally distributed in the analyzed pattern set. For the setup in this analysis, the significant frequency for TSV parasitic extraction is 6 Ghz. The resulting 3π -RLC equivalent Spectre circuit is integrated into the setup to model the TSVs in the circuit simulations. Furthermore, contact resistances of 100 are added between the drivers and the TSVs in accordance with Ref. [132]. Besides the equivalent circuit, the capacitance matrix—required to perform the high-level estimation—is also generated out of the Q3D-Extractor results. The effective TSV load capacitance due to the drivers (ca. 1 fF) is added to the self/ground capacitance of each TSV (i.e., the diagonal entries of the capacitance matrix). Afterward, by analyzing the abstract bit-level switching of the considered pattern set, the power consumption is predicted with the derived formula. To quantify the accuracy of the power formula for an extensive set of switching scenarios, not only the mean power consumption of the TSV over all cycles is analyzed, but also the individual mean power consumption quantities for each possible effective capacitance value in the respective cycles. The driver-dependent values R and TD,0 need to be determined once before the propagation delay can be estimated through Eq. (3.43). Here, a simple linear leastsquare fit for the propagation delay of the middle TSV5 as a function of the effective capacitance Ceff,5 is performed. Thereby, rising edges on the TSVs are considered, as they result in a higher propagation delay due to the lower channel conductivity of the pull-up path (i.e., p-channel MOSFET). The maximum propagation delay
68
3 High-Level Formulas for the 3D-Interconnect Power Consumption. . .
Circuit simulations
Proposed high-level formula 300
Propagation delay [ps]
Power consumption [μW]
30
20
10
0
0
20
40
60
80
Effective capacitance [fF]
100
200
100
0
0
20
40
60
80
100
Effective capacitance [fF]
Fig. 3.4 Mean power consumption and maximum propagation delay of a TSV located in the middle of a 3 × 3 array over the effective capacitance according to circuit simulations and the derived high-level formulas
for each possible effective-capacitance value of TSV5 is subsequently analyzed to quantify the accuracy of the derived formula for the performance estimation. In Fig. 3.4, the results of the analysis are plotted. As shown in the figure, the derived high-level formula enables a close to perfect estimation of the power consumption, despite the neglected inductance, internal-driver, and leakage effects. Thus, these effects are negligible for the TSV power consumption, like they are for the metal-wire power consumption [68]. The overall root-mean-square error (RMSE) for the power estimation, normalized by the mean power consumption, is as low as 0.4%. Only for larger effective capacitances, the model shows a slightly noticeable error due to the intrinsic misalignment effect for transitions that entail a large power consumption [176]. The main reason for the intrinsic misalignment, here, is a mismatch in the equivalent on-resistances of the pull-up and pull-down path of the TSV driver due to the moderate increase in the width of the p-channel compared to the n-channel, which does not fully compensate for the lower hole mobility. Nevertheless, the normalized maximum absolute error (MAE) is 2.5% for all estimated power quantities. Despite the high accuracy, the high-level estimation, performed with Matlab, requires a 115 000 × lower execution time than the circuit simulations on the Linux Xeon E5–2630 machine. For the estimation of the propagation delay, the derived formula results in an overall normalized root-mean-square error (NRMSE) of 3.9%. Thus, the derived formula for the propagation delay is not as accurate as the one for the power consumption. Especially for small effective-capacitance (and thus delay) values,
3.5 Conclusion
69
the estimates are not reliable. However, the strong general correlation between effective capacitance, power consumption, and propagation delay is validated by the results. Furthermore, the performance is either way determined only by the maximum occurring delay. Considering only the switching scenarios that result in a propagation delay bigger than 50% of the maximum possible one, the formula estimates all delay values with a relative MAE in the 10% range. Thus, the accuracy of the derived delay formula is sufficient for the scope of this work, as none of the later proposed optimization techniques increases the interconnect performance by more than a factor of 2 ×. The lower accuracy for the delay estimation is mainly due to the simplified driver model. Actually, the driver’s equivalent pull-up and pull-down resistance RD , as well as the offset factor TD,0 , depend on the size of the capacitive load (here, Ceff,i ). A more accurate piece-wise-linear model for RD and TD,0 as a function of the capacitive load could be extracted from the non-linear delay model (NLDM) liberty file of the used standard cell technology. Afterward, a precise delay estimation could be performed even for switching scenarios, which result in a small propagation delay. Moreover, inductance effects have a minor impact on the propagation delay, and could be considered to increase the accuracy of the delay formula further. However, complex driver or inductance effects are not desired for the objective of this book, as it would profoundly complicate the systematic derivation of optimization techniques without revealing a significantly higher optimization potential. In summary, the results validate the derived formulas for the interconnect metrics. The derived formula for the power consumption is close to perfect, while the one for the propagation delay should only be used for performance (i.e., maximum-delay) estimation, as long as a non-linear driver model does not extend it.
3.5 Conclusion In this chapter, abstract formulas for the power consumption and the signal propagation delay (required for a performance estimation) of 3D interconnects were systematically derived. Both formulas were not only derived for perfectly temporal-aligned signal edges on the interconnects, but also for a substantial temporal misalignment (skew) between the signal edges. An important finding is that the derived formulas reveal a strong correlation between the interconnect power consumption and the interconnect performance through the pattern-dependent effective capacitance. This fact theoretically allows for a simultaneous optimization of both metrics on higher abstraction levels. The derived formulas have been validated against precise circuit simulations, considering a modern 3 × 3 array of global TSVs and 22-nm drivers. For both cases, temporally aligned and misaligned edges, as well as all possible pattern transitions, the power consumption is estimated with the derived formula with a normalized RMSE and MAE as low as 0.4 and 2.5%, respectively. A propagation-delay
70
3 High-Level Formulas for the 3D-Interconnect Power Consumption. . .
estimation with the proposed formula is not as accurate as the power estimation (NRMSE 3.9%). However, for the estimation of the pattern-dependent performance, the formula still provides more than sufficient accuracy. The presented formulas for the 3D-interconnect power consumption and performance are essential for the derivation and fast evaluation of the optimization techniques presented in Part IV of this book. However, knowledge about the interconnect capacitances is required for a power and performance estimation with the derived formulas. Moreover, the bit-level statistics of the transmitted data are required for precise power estimation. Hence, methods to precisely estimate the 3D-interconnect capacitances and the bit-level statistics will be presented in the following chapters.
Chapter 4
High-Level Estimation of the 3D-Interconnect Capacitances
In the previous Chap. 3, formulas were presented, which allow for a fast estimation of the power consumption and performance of 3D interconnects on higher abstraction levels. However, these formulas require a knowledge of the 3D-interconnect capacitances. The interconnect capacitances can be extracted with electromagnetic (EM) solvers on 3D models, as done in the previous Chap. 3. Nevertheless, an abstract and yet universally valid model for capacitances is still required for mainly two reasons. First, EM solvers have significant run times, and their results are not scalable. If the line count of the interconnect structure changes, the capacitances must be extracted again. Thus, in large systems with hundreds of global interconnect structures, possibly all having a different shape, the capacitances would have to be extracted separately for each bundle. As an illustration, for a single 8 × 8 through-silicon via (TSV) array, already 2080 capacitance values must be extracted for a full parasitic extraction based on an EM solver. The second reason why an abstract capacitance model is required is that it is impossible to derive optimization techniques that are generally efficient for most 3D-interconnect structures without having an abstract knowledge about regularities in the capacitances. Moreover, the model should abstractly encapsulate complex physical phenomena affecting the capacitance quantities. Thereby, optimization techniques based on higher abstraction levels can effectively exploit these phenomena. Previous works have proposed a set of equivalent circuits for TSV and metal-wire arrangements, which partially overcome the need for parasitic extractions by means of an EM solver [196, 197, 226, 249, 253]. However, all existing equivalent circuits for TSVs are too complex to work with the formulas derived in the previous chapter, as the equivalent circuits do not allow to obtain lumped coupling-capacitance and ground-capacitance quantities for TSV arrays. Hence, existing equivalent circuits for TSV arrays are only usable for power and performance analysis in combination with computationally complex circuit simulations.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 L. Bamberg et al., 3D Interconnect Architectures for Heterogeneous Technologies, https://doi.org/10.1007/978-3-030-98229-4_4
71
72
4 High-Level Estimation of the 3D-Interconnect Capacitances
In contrast, the commonly used metal-wire equivalent circuit exhibits such a low complexity that its parameters can be easily used to estimate the power consumption and performance of an arbitrary metal-wire arrangement through the derived formulas [68, 249]. Moreover, the capacitance model of the equivalent circuit has proven over many years to be very practical for the derivation of high-level techniques, which drastically optimize the power consumption and performance of metal wires [66, 68, 89, 116, 187]. Thus, a capacitance model for TSV arrays that is similar to the one for metal wires was used to systematically derive existing coding techniques that aim for an improvement in the TSV performance [54, 138, 265]. This capacitance model was derived by merely extending the model for metal wires into the third dimension (i.e., increasing the number of adjacent neighbors per line from two to eight). However, with this approach, TSV-specific physical phenomena that affect the capacitances, such as the TSV metal-oxide-semiconductor (MOS) effect or the electric-fieldsharing effect [14, 197, 253], are not captured in the capacitance model at all. To capture these effects in the capacitance model is of particular importance for two reasons. First, the degraded electrical-field-sharing effect at the edges of a TSV array typically increases the maximum capacitance value by over 45%. Second, the MOS effect results in capacitance values that vary with the bit-level properties of the transmitted data by over 25%. Thus, there is a strong need to extend the previously used abstract TSV capacitance model by newly arising physical phenomena. Such an extended model is presented in this chapter. The proposed capacitance model exhibits a low complexity, is scalable, and yet provides high accuracy. An evaluation shows that using the proposed capacitance model for a high-level estimation of the power consumption of modern TSV arrangements for random transmitted bit patterns results in a precise estimation of the power consumption with an error as low as 2.6%. In contrast, the previously used model results in an underestimation of the power requirements by as much as 21.1% for the same pattern set. Furthermore, the previous model results in an under-estimation of the propagation delay of edge TSVs by 25.1% due to the neglected electrical-field sharing. In contrast, applying the proposed capacitance model instead results in a precise estimation of the edge-TSV delay, as the error is only 4.2%. Hence, the proposed capacitance model enables a precise TSV power-consumption and performance estimation without the need for computationally expensive full-chip parasitic extraction. The rest of this chapter is structured as follows. First, the existing abstract capacitance models for metal wires and TSV arrays are reviewed in Sect. 4.1. In the following Sect. 4.2, TSV-specific physical phenomena and their impact on the capacitances are outlined. Afterward, the proposed capacitance model is presented in Sect. 4.3. The coefficients and accuracy of the proposed model are quantified for typical TSV structures in the first part of the evaluation section, Sect. 4.4.1. Furthermore, a comparison of the accuracy of the proposed and the previously used capacitance model is drawn. In Sect. 4.4.2, the accuracy of a high-level TSV powerconsumption and performance estimation based on the proposed and the previous
4.1 Existing Capacitance Models
73
TSV capacitance model are investigated through experimental results. Finally, the chapter is concluded.
4.1 Existing Capacitance Models In this section, the previous capacitance models—used to estimate the interconnect power consumption and performance on higher abstraction levels—are reviewed. First, the model for metal wires is summarized. Subsequently, the model previously used for TSVs is discussed. The scalable capacitance model used for metal-wire power-performance estimation and optimization is illustrated in Fig. 4.1. Algorithm 1 in Appendix A is a pseudo code to generate the according capacitance matrix as it is required for the proposed high-level formulas. Between every adjacent metal-wire pair, there is a coupling capacitance of size Cmw,c , and every metal wire has a ground capacitance of size Cmw,g [249]. The overall ground capacitance of a metal wire is equal to the sum of Cmw,g and the input capacitance of the interconnect load. Capacitances between non-directly adjacent wires are negligible, as interjacent conductors almost entirely shield the electric field (Faraday-cage effect) [249]. Both capacitance values, Cmw,c and Cmw,g , increase linearly with the wire length. Thus, for a constant wire spacing and width, only two capacitance values, given per unit length, are needed to model all metal-wire buses of arbitrary length and line count. These capacitances can be obtained through EM solvers or analytical formulas such as the ones presented in [249]. With ongoing technology scaling (resulting in denserspaced wires), the coupling capacitances in a metal-wire bus dominate more, so that the ground capacitances, Cmw,g , become negligible [68]. Despite its simplicity, the existing metal-wire capacitance model has proven over decades to be accurate for the estimation of the power consumption and performance of metal wires. Therefore, the accuracy of this model is not further investigated. Furthermore, the simplicity and scalability of the metal-wire capacitance model
Cmw,c
Cmw,g
Fig. 4.1 Capacitances of metal wires
74
4 High-Level Estimation of the 3D-Interconnect Capacitances
Fig. 4.2 Capacitances of a TSV in an array arrangement according to the previously used capacitance model
Cd,prev Cn,prev
Cn,prev
Cd,prev
Cn,prev
Cd,prev
Cd,prev Cn,prev
enabled the derivation of a broad set of universally valid optimization techniques that effectively improve the power consumption and performance of metal-wire buses [68, 89]. This fact likely was the main driver for previous works to merely extend the existing capacitance model for metal wires into the third dimension for the derivation of high-level techniques aiming to optimize the TSV-array performance [54, 138, 265]. Furthermore, this enabled the reuse of an existing optimization approach for 2D integrated circuits (ICs), namely crosstalk-avoidance coding, for TSV-based 3D integration. The previously used TSV capacitance model is illustrated in Fig. 4.2. Algorithm 2 in Appendix A is a pseudo code for the generation of the according capacitance matrix. A TSV-based interconnect is surrounded by up to eight directly (i.e., horizontally or vertically) and diagonally adjacent TSVs. Like in the model for metal wires, capacitances between non-adjacent interconnects are neglected in the existing TSV capacitance model. One parameter, Cn,prev , is used to model all capacitances between directly adjacent TSVs. Another one, Cd,prev , is used to model the capacitances between all diagonally adjacent TSVs. Due to the increased pitch, the coupling-capacitance value for diagonally adjacent TSVs, Cd,prev , is smaller than Cn,prev . The following relationship was assumed for the two capacitance values [54, 265]: Cd,prev =
Cn,prev . 4
(4.1)
This relationship was determined in previous work by extracting the two capacitance values for one exemplary TSV-array structure using an EM solver [265]. In the existing TSV capacitance model, all self/ground capacitances are neglected, since eight TSVs surrounding a TSV form a strong Faraday cage. Thus, as the metalwire capacitance model, the existing TSV capacitance model only consists of two parameters. These parameters are the coupling capacitance for directly adjacent
4.2 Edge and MOS Effects on the TSV Capacitances
75
TSV pairs, Cn,prev , and the coupling capacitance for diagonally adjacent TSV pairs, Cd,prev .
4.2 Edge and MOS Effects on the TSV Capacitances The conductive substrate TSVs traverse, combined with their 3D arrangement, entails coupling effects that do not occur for planar metal wires traversing dielectric layers. These effects are categorized as TSV MOS and edge effects in this work. Both effect types and their general impact on the TSV capacitances are outlined in the remainder of this section.
4.2.1 MOS Effect The MOS effect and its implications on the electrical characteristics of TSV structures have been investigated on the circuit level in many previous publications [22, 75, 197, 199, 200, 253]. Thus, the TSV MOS effect is only briefly summarized in the following. A TSV, its surrounding dielectric, and the conductive substrate form a metal-oxide-semiconductor (MOS) junction (see Fig. 1.9 on Page 20). Consequently, TSVs are surrounded in the substrate by depletion regions [22]. The width of the depletion region surrounding a TSV generally depends on the voltage between the TSV conductor and the substrate. However, the width of the depletion region does not instantaneously change with the voltage on the TSV. Instead, it reaches a stable state depending on the mean TSV voltage over several hundred, or even thousands, of clock cycles. The reason for this behavior is that the substrate is relatively low doped, why the generation/recombination of inversion carriers, as well as the charging/discharging of interface states, cannot follow the fast voltage changes in digital signals [253]. Typical p-doped substrates are grounded and have a negative Flatband voltage [253].1 In this case, a depletion region always exists around TSVs carrying digital signals (i.e., mean voltage between 0 V and Vdd ). The width of a depletion region increases with the mean voltage on the TSV, which further isolates the TSV conductor from the conductive substrate. Since the depletion region follows the “medium-frequency” curve for digital signals [253], at a specific mean TSV-tosubstrate voltage, the depletion region reaches its maximum width (i.e., no deep depletion or inversion). Thus, any further increase in the mean voltage beyond this point does not affect the depletion-region width.
1 If
the TSV-to-substrate voltage exceeds the flat-band voltage, a depletion region exists.
4 High-Level Estimation of the 3D-Interconnect Capacitances
Capacitance [fF]
76
5
4.5
4 0
0.5
1
1.5
2
2.5
Mean TSV voltage [V] Fig. 4.3 Size of the coupling capacitance between two directly adjacent TSVs over the mean voltage on the TSVs for a p-doped substrate and a TSV radius, minimum pitch, and depth of 1 μm, 4 μm, and 50 μm, respectively
The parameterizable 3D model of a TSV array, presented in Sect. 1.3.1, is used to outline the effect of the varying depletion-region widths on the capacitances of modern global TSVs (i.e., rtsv = 1 μm, dmin = 4 μm, ltsv = 50 μm, and fs = 6 GHz). In detail, the size of the coupling capacitance between the TSV in the middle of a 5 × 5 array and a directly adjacent neighbor TSV (modeled by Cn,prev in the previous capacitance model), is extracted with the Q3D Extractor for different mean voltages on all TSVs.2 In Fig. 4.3, the results are illustrated. The analysis shows that the capacitance decreases with an ongoing increase in the mean voltages on the TSVs. However, from the point where the depletion-region widths do not further change with an increase in the mean voltages, no further change in the capacitance occurs. However, the maximum possible TSV voltage is either-way bounded by the power-supply voltage, Vdd , which, in most modern technologies, is lower than the voltage at which a depletion region reaches its maximum width (in this example, about 1.6 V). For the analyzed modern TSV geometries and a power-supply voltage of 1 V, the MOS effect can affect the capacitance value by more than 25%. Thus, a high-level TSV capacitance model must consider dynamic instead of fixed capacitance quantities. Finally, note that the width of a depletion region decreases, instead of increases, with an increasing mean voltage on the related TSV for n-doped substrates, which are biased at Vdd [253]. Thus, for n-doped substrates, the capacitances increase instead of decrease with increasing mean TSV voltages. However, ndoped substrates are not investigated in-depth since common substrates are p-doped.
2 The mathematical method to determine the varying depletion-region widths for the 3D model, as a function of the mean TSV voltages, is reviewed in Appendix B.
4.2 Edge and MOS Effects on the TSV Capacitances
77
Nevertheless, the capacitance model presented in this chapter is also valid for ndoped substrates.
4.2.2 Edge Effects Most previous works on TSV parasitic extraction focus on arrangements of only two TSVs. Hence, the already carried out research on effects that occur at the edges of TSV-array arrangements (referred to as edge effects throughout this book) is insufficient, even on the lower levels of abstraction. Therefore, the parameterizable TSV-array model from Sect. 1.3.1 is used to outline the TSV edge effects in can visualize the capacitances. The Ansys the following. Electric-field vectors, E, Electromagnetic Suite is used to draw the electric-field vectors in the substrate cross-view of the 3D model for different TSV-conductor potentials. In this section, the electric-field distribution for an exemplary 5 × 5 array with a TSV radius and minimum pitch of 2 μm and 8.5 μm, respectively, is investigated. Thereby, no further TSV is assumed to be located nearby the TSV array (other structures that are not implanted deep into the substrate, such as doping wells, have a negligible impact on the TSV parasitics). In Fig. 4.4, the electric-field vectors are illustrated for a potential of 1 V on a TSV located in the middle of the TSV array (referred to as a middle TSV) while
North 1
2
3
4
5
6
7
8
9
10 [V/μm] | E| 0.10
11
12
13
14
15
0.09 0.08 0.07
16
17
18
19
20
0.06 0.05 0.04
21
22
23
24
25
0.03 0.02 0.01
Fig. 4.4 Electric-field vectors for a potential of 1 V on a TSV located in the array middle, while all other TSVs are grounded
78
4 High-Level Estimation of the 3D-Interconnect Capacitances
all remaining TSVs are grounded. For clarity, only vectors with an absolute value bigger than 5% of the maximum one are shown in the figures of this section. In accordance with the model used in previous works, only adjacent TSVs show a noticeable coupling in the array middle, since the surrounding eight TSVs form a Faraday cage around a middle TSV, which terminates the electric-field vectors. Since the absolute value of E decreases with an increasing distance from the TSV, the coupling between diagonally adjacent TSVs is lower than the coupling between directly adjacent TSVs. This is also correctly captured by the previous capacitance model. Second, the edges of the array are analyzed. A TSV located at least one edge of the array is referred to in the remainder of this book as an edge TSV. For every TSV that is a not an edge TSV (i.e., for all middle TSVs), the surrounding eight TSVs form a strong Faraday cage. This also entails that the coupling of a middle TSVs and any adjacent neighbor is mainly through the sides of the two TSVs that face each other. Consequently, an edge TSV couples with an adjacent middle TSV only over the side facing toward the middle. As a result, the edge effects can only noticeably influence the coupling between two edge TSVs. This is validated by Fig. 4.5, which illustrates the electric-field vectors for a potential of 1 V on middle TSV9 , located one row and one column away from a corner. The figure shows that the coupling of TSV9 with its adjacent edge TSVs does not noticeably differ from the coupling of the TSV with its adjacent middle TSVs.
North 1
2
3
4
5
6
7
8
9
10 [V/μm] | E| 0.10
11
12
13
14
15
0.09 0.08 0.07
16
17
18
19
20
0.06 0.05 0.04
21
22
23
24
25
0.03 0.02 0.01
Fig. 4.5 Electric-field vectors for a potential of 1 V on a TSV located one row and one column away from two edges, while all other TSVs are grounded
4.2 Edge and MOS Effects on the TSV Capacitances
79
North
1
2
3
4
5
6
7
8
9
10 [V/μm] | E| 0.10
11
12
13
14
15
0.09 0.08 0.07
16
17
18
19
20
0.06 0.05 0.04
21
22
23
24
25
0.03 0.02 0.01
Fig. 4.6 Electric-field vectors for a potential of 1 V on a TSV located at a single edge of the array, while all other TSVs are grounded
To further investigate the edge effects, Fig. 4.6 illustrates the electric-field distribution for a potential of 1 V on a TSV located at a single edge of the array. In Fig. 4.7, the electric-field distribution for a potential of 1 V on a corner TSV, located at two-edges, is moreover illustrated. Between an edge-TSV pair, the electrical field is generally stronger (compared to a middle-TSV pair with the same distance between the TSVs), as a solid Faraday cage does not enclose edge TSVs. Consequently, the coupling between indirect/second-order adjacent TSV pairs (e.g., TSV2 and TSV4 ) is no longer negligible at the edges. Additionally, this implies that edge TSVs can have a non-negligible self-capacitance due to the substrate grounding. Moreover, the free conductive substrate in at least one direction (i.e., the reduced electric-field sharing with other TSVs) enables additional paths for the field lines at the array edges. This increases the coupling between two edge TSVs. At the array corners, the reduced electric-field sharing has an even stronger impact on the coupling, as a corner TSV is surrounded by only two directly adjacent TSVs. Consequently, the coupling of a corner TSV and a directly adjacent edge TSV is particularly strong. The TSV edge effects are of particular importance, as 2M + 2N − 4 TSVs are located at the edges of an M × N array. In contrast, only two wires can be located at the edges of any parallel metal-wire structure. For example, in the analyzed 5 × 5
80
4 High-Level Estimation of the 3D-Interconnect Capacitances
North
1
2
3
4
5
6
7
8
9
10 [V/μm] | E| 0.10
11
12
13
14
15
0.09 0.08 0.07
16
17
18
19
20
0.06 0.05 0.04
21
22
23
24
25
0.03 0.02 0.01
Fig. 4.7 Electric-field vectors for a potential of 1 V on a TSV located at a corner of the array, while all other TSVs are grounded
array, as much as 64% of the TSVs are located at one or more edges. Even for a large 10 × 10 array, this proportion is still 36%. Summarized, the edge effects result in increased capacitance quantities at the edges of an array, especially at the four corners, which cannot be neglected by a capacitance model. Thus, different parameters must be considered in the capacitance model for middle, single-edge, and corner TSVs.
4.3 TSV Capacitance Model This section proposes a new TSV capacitance model, which considers the previously outlined MOS effect and the edge effects abstractly. The MOS effect results in capacitance values, which are a function of the mean TSV voltages, instead of being constant as in the previously used model. In principle, the average TSV voltages to determine the depletion-region widths must be calculated over a sliding window, and the formulas to find the correct window duration exhibit a high complexity. However, this work considers strongly stationary data streams. In this case, the average TSV voltages for the calculation
4.3 TSV Capacitance Model
81
of the depletion-region widths can be considered as constant for each data stream. ¯ i , which determine the depletion-region Consequently, the mean TSV voltages, V widths, can be expressed by means of the 1-bit probabilities on the TSVs, pi : ¯ i = E{bi } · Vdd = pi · Vdd . V
(4.2)
The resulting dependency between the size of the coupling capacitances and the bit probabilities is still complicated. Nevertheless, from the previous section, it is known that the capacitance values can only increase or decrease (not both) with increasing mean voltages (i.e., 1-bit probabilities) on the TSVs. As shown in Fig. 4.3 on Page 76, the capacitance values decrease with an increase in the 1-bit probabilities for common p-doped substrates. In contrast, the capacitance values increase with the 1-bit probabilities for n-doped substrates. To achieve a low model complexity, the dependency between the bit probabilities and the capacitances is approximated in the proposed model by a linear regression. The average 1-bit probabilities for the TSVs with the indices i and j , symbolized as p¯ i,j , is used as the independent variable (i.e., feature) to estimate/model the size of capacitance Ci,j . Please note that for i equal to j (i.e., the fit for the self-capacitance Ci,i ), the feature p¯ i,j is equal to the logical 1-bit probability on the ith TSV, pi . Summarized, the following formula models the MOS effect on the TSV capacitances: Ci,j = CG,i,j +
ΔCi,j (pi + pj ) = CG,i,j + ΔCi,j · p¯ i,j , 2
(4.3)
where CG,i,j is the fitted capacitance value for a mean voltage of 0 V on all TSVs. ΔCi,j is the fitted derivation of the capacitance value with increasing 1-bit probability on the TSVs. Hence, the full capacitance matrix for a structure of n lines is mathematically expressed as ¯ C = CG + C ◦ PR,
(4.4)
¯ is an n×n probability matrix with p¯ i,j on entry (i, j ). ◦ is the Hadamardwhere PR product operator, resulting in an element-wise multiplication of the entries of two matrices. CG and C are n × n matrices with CG,i,j and ΔCi,j on the (i, j ) entries, respectively. The entries of C are negative for typical p-doped substrates. In contrast, the entries are positive for n-doped substrates, indicating increasing capacitance values with increasing 1-bit probabilities. If the power-supply voltage of the driver technology exceeds the absolute voltage at which the depletion region of a TSV reaches its maximum value, the capacitance values additionally must be clipped to a lower bound: ¯ Cmax-dep , C = max CG + C ◦ PR,
(4.5)
where Cmax-dep is the fitted capacitance matrix for all TSVs surrounded by a depletion region of maximum width, resulting in the lowest capacitance values.
82
4 High-Level Estimation of the 3D-Interconnect Capacitances
Ce0
Ce0
Cc2
Ce2
Cc0
C Ce0
C Corner TSV
Ce
Cc
E
E
Cd Cc
Cd Cn
Cd
E Edge TSV Middle TSV Corner cap.
Cn Cd
Edge cap. Middle cap.
Cn
Cc2
Cn
E Cd
Ce0
Ce Cd
Ce2
Cd Cn
Cn Cd
Cn
Cn
E
Fig. 4.8 Proposed edge-effect-aware TSV capacitance model
Function “max()” returns a matrix with the entry-wise maximum values of the two matrices. Important is in this particular case that the extracted capacitance values, used to fit a ΔC i,j value, and the respective CG,i,j value, are obtained for mean voltages that do not result in a maximum depletion-region width for any TSV to avoid inaccurate C entries. However, for all TSV structures analyzed in this book, the power-supply voltages of the driver cells are low enough that the extended model (i.e., Eq. (4.5)) is not needed. Finally, abstract, scalable, and yet universally valid models to build CG and C are required. The previously discussed TSV edge effects reveal that more than two different capacitance types, Cn and Cd , have to be considered in these models, as the coupling capacitance between two edge TSVs is generally higher compared to its counterpart in the middle of the array, especially if one of the two edge TSVs is even located at an array corner. Also, the ground capacitances of TSVs are not necessarily negligible at the edges. Thus, this book proposes the TSV capacitance model illustrated in Fig. 4.8. The model distinguishes between capacitances connected to at least one middle TSV (marked black in Fig. 4.8), and capacitances connected only to edge TSVs. For the second case, the capacitances also generally differ if an associated TSV is located at a corner of the array (marked blue in Fig. 4.8) or not (marked red in Fig. 4.8). For the coupling capacitances between any two directly adjacent TSVs, out of which at least one TSV is located in the array middle, a single fit Cn (p¯ i,j ) is used. This fit results in the two model coefficients CG,n and ΔCn . Although the edge
4.4 Evaluation
83
effects sometimes have a slight impact on the capacitance value of two diagonally adjacent edge TSVs over a corner, a single linear fit, Cd (p¯ i,j ), is used to model all coupling capacitances between diagonally adjacent TSVs. The capacitances between the corner TSVs and their two directly adjacent edge TSVs are modeled using another fit, Cc (p¯ i,j ). Also, the capacitance value between a directly adjacent edge-TSV pair that is not located at an array corner is modeled by an individual fit, Ce (p¯ i,j ). At the edges, the capacitances between indirectly adjacent TSVs must be included in the capacitance model. Thus, one linear fit, Cc2 (p¯ i,j ), is used to model the capacitances between a corner TSV and its two indirectly adjacent edge TSVs. Another fit, Ce2 (p¯ i,j ), is used to model the capacitances between indirectly adjacent TSV pairs located only at one edge of the array. Finally, the fits Cc0 (pi ) and Ce0 (pi ) model the ground capacitances of TSVs located at a corner and a single edge of the array, respectively. Summarized, in the proposed capacitance model defined by Eq. (4.4), CG is constructed out of eight capacitance values: CG,n , CG,d , CG,e0 , CG,e1 , CG,e2 , CG,c0 , CG,c1 , and CG,c2 . Analogously, also C is constructed out of eight capacitance deviations: ΔC n , ΔC d , ΔC e0 , ΔC e1 , ΔC e2 , ΔC c0 , ΔC c1 , and ΔC c2 . Thus, the proposed model has in total 16 parameters. A pseudo-code to build the two matrices out of these 16 parameters is presented through Algorithm 3 in Appendix A on Page 375. To fit the 16 capacitance coefficients of the model for a certain TSV technology, eight capacitance values must be extracted for an exemplary TSV array bigger than 4 × 4 for varying (i.e., at least two) depletion-region widths. The more depletionregion widths (and thus 1-bit probabilities) are considered for parasitic extraction, the better the capacitance fits. On the downside, considering more depletionregion widths has the drawback of increased computational complexity for parasitic extraction. However, for each different TSV technology, the model coefficients only have to be fitted once, as the proposed model is universally valid and scalable. Due to this strong reusability, the overhead of analyzing more depletion-region widths vanishes over time in most application scenarios. To further achieve the reusability of the fitted coefficients for different substrate thicknesses, all 16 values must be reported per unit TSV length. Moreover, the ΔC values must be additionally normalized on the power-supply voltage to make the coefficients reusable for different driver technologies.
4.4 Evaluation The evaluation section of this chapter consists of two subsections. Section 4.4.1 reports the coefficients of the proposed TSV capacitance model. Furthermore, the accuracy of the capacitance model is quantified and compared to the accuracy of the previously used capacitance model to put the accuracy into a better perspective. In detail, two analyses are conducted to quantify the accuracy of the TSV capacitance models. The first primarily quantifies how well the MOS effect is captured by
84
4 High-Level Estimation of the 3D-Interconnect Capacitances
applying linear fits, and what are the errors of the previous model using constant capacitance values. In the second, it is investigated how precisely the abstract capacitance models construct the complete capacitance matrices of several TSV arrays with varying line count and bit-level properties. Section 4.4.2, investigates the accuracy of a high-level TSV power-consumption and performance estimation based on the proposed capacitance model combined with the formulas presented in Chap. 3. Again, a comparison is drawn with the results for the previous TSV capacitance model.
4.4.1 Model Coefficients and Accuracy This section determines the coefficients and accuracy of the presented, as well as the previous TSV capacitance model. Note that in our previous work Ref. [18], it is shown that the proposed MOS-effect-aware capacitance model will become even more essential for future submicrometric TSV dimensions. The reason is the increasing relative impact of the depletion-region widths on the coupling-capacitance values with reduced TSV dimensions. For example, it is shown in Ref. [18] that, for a TSV radius and pitch of 0.5 μm and 2 μm, respectively, the MOS effect can even impact the TSV capacitance sizes by over 40%. Furthermore, Ref. [18] discusses the necessity of more complex quadratic fits to capture the MOS effect for future TSV dimensions. However, these results are not included in this book, as submicrometric global TSVs are not expected to become a reality soon. Moreover, in the authors’ initial publication on edge effects in TSV-array arrangements (see Ref. [20]), it is shown that the proposed edge-effect-aware capacitance model remains valid if structures deeply integrated into the substrate are located nearby the TSV arrays to suppress the substrate noise TSVs induce on active circuits. Examples of such structures are shielding lines or guard rings [50]. However, structures to reduce the TSV substrate noise are not investigated here, as they are uncommonly implemented in modern 3D systems. Thus, these results are also not included here. The interested reader finds further information in related publications of the authors.
4.4.1.1
Experimental Setup
In the following, the experimental setup to obtain the model coefficients is outlined. To determine the model coefficients, parasitic extractions for a 5 × 5 instance of the scalable TSV-array model are performed with the Q3D Extractor. All coefficients of the proposed capacitance model are reported per unit TSV length. Hence, the model coefficients only depend on the TSV radius, rtsv , the minimum TSV pitch, dmin , and the significant signal frequency, fs . Consequently, the TSV length is fixed to 50 μm for the determination of the model coefficients.
4.4 Evaluation
85
Two variants are analyzed for the TSV radius: 1 and 2 μm. Primarily, densely spaced TSVs are considered, as an enlarged TSV pitch increases the critical area occupation of TSV arrays even further. Moreover, previous research has shown that increasing the TSV spacing does not effectively reduce the TSV coupling capacitances, in contrast to metal-wire structures where spacing efficiently reduces the capacitances [158]. The minimum predicted TSV pitches for the analyzed TSV radii of 1 μm and 2 μm are 4 μm and 8 μm, respectively. Besides these minimum values for the TSV pitches, a reasonable increase in the minimum pitches by 0.5 μm, to 4.5 μm and 8.5 μm respectively, is considered. Two significant signal frequencies, fs , are analyzed: 6 and 11 GHz. The first one corresponds with a typical TSV-signal rise and fall time of about 80 ps, while the second one corresponds with a TSV-signal rise and fall time of less than 50 ps for strong drivers. To obtain the model coefficients, different mean voltages on the TSVs against the ground are analyzed. Varying the voltages leads to varying depletion-region widths that differ for the two considered TSV radii (note that the depletion-region widths are independent of the TSV pitches). After the depletion-region widths for the investigated mean voltages are determined using the method presented in Appendix B, they are fed into the TSV-array model to extract the associated capacitance values. These extracted capacitance values are then used to perform linear regression analyses using the least-squares approach. Thereby, the feature values for the regression analyses are the mean voltages for the related TSVs— which can later be mapped to the actual 1-bit probabilities through Eq. (4.2). For each capacitance fit, extraction results for a single capacitance are considered. The extracted values for the coupling capacitances C13,12 and C13,7 of TSV13 in the middle of the array (see Fig. 4.4 on Page 77), are analyzed to obtain the respective capacitance fits Cn (p¯ i,j ) and Cd (p¯ i,j ). To obtain the fits Ce0 (pi ), Ce1 (p¯ i,j ), and Ce2 (p¯ i,j ), the extracted capacitance values for the capacitances C2,2 , C2,3 , and C2,4 of TSV2 located at a single array edge are considered, respectively. For the last three fits, Cc0 (pi ), Cc1 (p¯ i,j ), and Cc2 (p¯ i,j ), the extracted capacitance values for C1,1 , C1,2 , and C1,3 of TSV1 , located in a corner of the array, are analyzed. Each analyzed capacitance is extracted for a wide set of different mean TSV voltages. First, the mean voltage on the related TSV is varied between 0 and 1 V with a constant step-size. Furthermore, the mean voltage on all remaining TSVs is varied independently using the same voltage steps. For example, consider an analysis where the step size between the analyzed TSV voltages is 0.5 V while the power supply voltage is 1 V. Thus, in the exemplary analysis, the 1-bit probabilities on the TSVs are either 0, 0.5, or 1. In this case, three different capacitance values would be extracted for an analyzed coupling capacitance Ci,j , and p¯ i,j equal to 0.5. One value for a 1-bit probability of 0.5 on all TSVs in the array; one value for a 1-bit probability of 0 on TSVi , while the 1-bit probability is 1 for all other TSVs in the array; and one value for a 1-bit probability of 1 on TSVi , while the 1-bit probability is 0 for all other TSVs in the array. Also, three different values would be extracted for an analyzed ground capacitance Ci,i and pi equal to 0.5: One value for a 1-bit probability of 0 on all other TSVs; one
86
4 High-Level Estimation of the 3D-Interconnect Capacitances
value for a 1-bit probability of 0.5 on all other TSVs; and one value for a 1-bit probability of 1 on all other TSVs. This setup is chosen, as it later allows to quantify the model-inaccuracy due to the averaging of the two bit probabilities pi and pj , as well as the neglected impact of the depletion regions of the remaining TSVs which are not directly associated with the analyzed capacitance. In the actual experiment, the step size in the considered voltages is 0.1 V and not 0.5 V, to provide a higher model accuracy and to quantify the model errors better. Thus, 121 different values are extracted for each analyzed capacitance, and the smallest resolution of the feature variable for the linear regression, p¯ i,j , is 0.05.
4.4.1.2
Model Coefficients
The extracted capacitance values are first used to perform linear regression analyses to obtain the coefficients of the proposed model. Furthermore, the mean extracted value for C13,12 is used as Cn,prev for the previous model. Afterward, the second parameter of the previously used model, Cd,prev , is calculated as Cn,prev/4. The fittings of Cc1 (p¯ i,j ) and Cn (p¯ i,j ) for the proposed model are exemplarily illustrated in Fig. 4.9a for the capacitance values extracted for a TSV radius, minimum pitch, and significant frequency of 1 μm, 4.5 μm, and 6 GHz, respectively. Furthermore, the fitting of Cn,prev is illustrated in the figure (used in previous works to model the coupling capacitances of all directly adjacent TSV pairs). In Fig. 4.9b, the relative errors of the two models to recreate the extracted capacitance values for C13,12 and C1,2 are plotted. The resulting coefficients of the proposed capacitance model are reported in Table 4.1 for all analyzed geometrical TSV dimensions and significant frequencies. To obtain the actual model parameters for a given power-supply voltage, Vdd , and TSV length, ltsv , the normalized C values must be multiplied with ltsv , and the normalized ΔC values with ltsv Vdd . However, note that the reported ΔC values are fitted for a voltage range of 0–1 V. Hence, only for Vdd equal to 1 V the fits are least-square fits. For comparison purposes, the coefficients for the previously used TSV capacitance model are additionally reported in Table 4.2. The results in Table 4.1 confirm that the edge effects lead to a drastic increase in the capacitance values. For example, the coupling capacitance between a corner TSV and a directly adjacent edge TSV, CG,c1 , is 40–46% bigger than the couplingcapacitance between directly adjacent TSVs in the array middle, CG,n , for all analyzed TSV radii, pitches, and significant frequencies. Also, the fitted capacitance value for two directly adjacent single-edge TSVs, CG,e1 , is always 30–33% bigger than CG,n . Moreover, the CG,e2 and CG,c2 values are in the range of the CG,d values. This fact underscores the importance of considering the edge effects in the capacitance model. For the analyzed TSV structures, the self capacitances are relatively small. Nevertheless, the coefficients CG,e0 , ΔCe0 , CG,c0 , and ΔCc0 cannot be generally removed from the model to reduce its complexity. The reason is that these values
4.4 Evaluation
87
Capacitance [fF]
7
Directly adjacent TSVs in the corner (Q3D)
6
Directly adjacent TSVs in the middle (Q3D) Proposed model for directly adjacent TSVs in the corner Proposed model for directly adjacent TSVs in the middle Previous model for all directly adjacent TSVs
5
4 0
a)
0.2
0.4
0.6
0.8
1
Mean voltage on the TSVs [V] 10
Model error [%]
0
−10 Proposed-model error for directly adjacent TSVs in the corner Proposed-model error for directly adjacent TSVs in the middle Previous-model error for directly adjacent TSVs in the corner Previous-model error for directly adjacent TSVs in the middle
−20
−30
−40 0
b)
0.2
0.4
0.6
0.8
1
Mean voltage on the TSVs [V]
Fig. 4.9 Fitting of the capacitance-model coefficients: (a) Extracted capacitance values and the fitted models over the mean voltage on the related TSVs; (b) Relative errors of the models to reconstruct the extracted capacitance values
6.0 11.0 6.0 11.0 6.0 11.0 6.0 11.0
[pF/m]
36.4 26.4 37.3 26.3 38.4 26.7 38.5 26.6
[pF/m V]
−19.3 −11.1 −23.5 −10.3 −10.2 −3.7 −9.9 −3.1
[pF/m]
1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
104.0 91.4 102.8 85.8 116.2 95.7 115.2 92.7
[μm]
4.0 4.0 4.5 4.5 8.0 8.0 8.5 8.5
[μm]
[GHz]
Model coefficients CG,n ΔCn CG,d
TSV parameters rtsv dmin fs
−4.2 −1.6 −5.4 −0.9 −0.7 0.0 −0.7 0.0
[pF/m V]
ΔCd 8.3 2.7 3.4 2.6 3.4 4.9 3.2 5.1
[pF/m]
CG,e0
−6.3 −0.7 −0.4 −0.1 −0.9 −0.9 0.0 −0.6
[pF/m V]
ΔCe0
138.3 120.3 136.9 112.8 154.1 125.2 151.5 120.8
[pF/m]
CG,e1
−24.6 −13.0 −29.9 −12.1 −13.2 −4.2 −12.3 −3.5
[pF/m V]
ΔCe1
21.8 14.7 22.8 15.1 22.6 14.7 22.2 14.5
[pF/m]
CG,e2
Table 4.1 Coefficients of the proposed TSV capacitance model for different TSV parameters
0.0 0.0 −0.2 0.0 0.0 0.0 0.0 0.0
[pF/m V]
ΔCe2
8.6 5.6 4.2 5.1 5.9 10.5 9.6 12.3
[pF/m]
CG,c0
−5.0 −0.4 −1.2 −2.1 0.0 −0.1 −0.4 −0.6
[pF/m V]
ΔCc0
150.0 127.8 150.5 121.8 168.0 134.2 164.3 129.6
[pF/m]
CG,c1
−23.6 −14.2 −29.0 −11.8 −13.9 −4.0 −12.3 −3.4
[pF/m V]
ΔCc1
28.8 19.5 29.3 19.4 29.5 19.5 29.0 19.1
[pF/m]
CG,c2
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
[pF/m V]
ΔCc2
88 4 High-Level Estimation of the 3D-Interconnect Capacitances
4.4 Evaluation
89
Table 4.2 Coefficients of the previous TSV capacitance model for different TSV parameters
TSV parameters rtsv dmin fs
Model coefficients Cn,prev Cd,prev
[μm]
[μm]
[GHz]
[pF/m]
[pF/m]
1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
4.0 4.0 4.5 4.5 8.0 8.0 8.5 8.5
6.0 11.0 6.0 11.0 6.0 11.0 6.0 11.0
94.4 85.9 91.1 80.6 110.6 93.7 109.8 91.0
23.6 21.5 22.8 20.2 27.6 23.4 27.5 22.8
drastically increase if the TSV arrays are enclosed by ground rings to reduce the substrate noise, as shown in the authors’ initial publication [20]. Generally, the relative increase of the coupling-capacitance values toward the edges is nearly independent of the TSV radius and pitch, but shows a slight dependency on the significant frequency. For example, the ratio CG,c1 /CG,n is always in the range from 1.44 to 1.46 for a significant frequency of 6 GHz, and 1.40 to 1.41 for a significant frequency of 11 GHz. In the following, the reason for this frequency dependency is outlined. Furthermore, the dependency on substrate doping is discussed. Through-silicon vias couple through their insulating silicon-oxide liners, the surrounding depletion regions, and the substrate—all connected in series. While the electrical behavior of TSV oxide liners and the depletion regions can be precisely modeled by capacitances, the substrate is modeled at lower abstraction levels as parallel capacitance-conductance components due to its electrical conductivity [253]. Thus, the admittance of a substrate unit element is electrically modeled as follows: Ysubs = j 2πfs Csubs + Gsubs ,
(4.6)
and Gsubs is the capacitance and conductance of the unit element, where Csubs respectively. Here, the lossy (complex) equivalent capacitance is introduced to simplify the following discussion: = Ceq,subs
def
Ysubs G = Csubs + subs . j 2πfs j 2πfs
(4.7)
decreases with increasing fs or decreasing Gsubs and The absolute value of Ceq,subs . Hence, the TSV capacitance values decrease with asymptotically reaches Csubs increasing significant frequency and decreasing substrate doping (reduces Gsubs ). value leads to slightly decreasing edge effects, Furthermore, a decreasing Ceq,subs since edge effects occur due to free substrate paths for the electric field. Thus, with
90
4 High-Level Estimation of the 3D-Interconnect Capacitances
Cc1/Cn
1.5
Cc1/Cn Ce1/Cn Ce1/Cn
(rtsv (rtsv (rtsv (rtsv
= 1.0 μm, = 1.0 μm, = 1.0 μm, = 1.0 μm,
16
18
dmin dmin dmin dmin
= 4.0 μm) = 4.5 μm) = 4.0 μm) = 4.5 μm)
1.4
1.3
1.2
6
8
10
12
14
20
22
24
26
28
30
32
Frequency [GHz] Fig. 4.10 Ratio of the coupling capacitance between two directly adjacent edge TSVs over the coupling capacitance between two directly adjacent middle TSVs for different significant signal frequencies and a mean voltage of 0.5 V on all TSVs
higher frequencies or lower doping concentrations, the magnitude of the edge effects decreases. Asymptotically, for high significant frequencies and low doping profiles, the magnitude of the edge effects is determined solely by the capacitive behavior of the substrate. An analysis is conducted to quantify the magnitude of the edge effects for ultra-high-speed signaling (fs 10 GHz). For a mean voltage of 0.5 V on all TSVs, the coupling-capacitance values for directly adjacent edge TSVs (i.e., Cc1 and Ce1 ) and directly adjacent middle TSVs (i.e., Cn ) are extracted with the Q3D Extractor over the significant signal frequency. Thereby, a TSV radius of 1 μm is assumed exemplarily. The ratios Cc1/Cn and Ce1/Cn for the two different analyzed TSV pitches are plotted in Fig. 4.10 over fs . For all analyzed TSV dimensions and significant frequencies as high as 31 GHz, the Cc1 and Ce1 values are always at least 33 and 27% bigger than their counterpart in the array middle, Cn , respectively. With a further increase in the significant frequency, no significant change in the capacitance ratios is obtained. The same asymptotic Cc1/Cn and Ce1/Cn ratios are obtained if the substrate doping is decreased toward 0 S/m. Therefore, even for ultrahigh frequencies and low doping profiles, the edge effects are still of paramount importance and must be captured in the capacitance model. So far, only the magnitude of the edge effects, but not the MOS effect, has been discussed. Also, the MOS effect has a significant impact on the capacitances, especially for moderate significant frequencies. However, the magnitude of the MOS effect decreases with increasing significant frequency or decreasing substrate conductivity. The rationale for this behavior is again the capacitive-conductive duality of the substrate. Since a depletion region is an area in the substrate with a conductivity decreased to zero, the admittance of a depletion region unit element is
4.4 Evaluation
91 Ysubs,depleted = j 2πfs Csubs .
(4.8)
With increasing depletion-region widths, a larger area of the substrate has a smaller admittance described by Eq. (4.8) instead of Eq. (4.6) and vice versa. However, with increasing fs or decreasing Gsubs , the relative difference between Eq. (4.8) and Eq. (4.6) decreases. Thus, the smaller the substrate conductivity, and the higher the significant frequency, the lower the impact of the depletion regions on the capacitances. In contrast to the edge effects, the MOS effect even completely vanishes with increasing significant frequencies or decreasing substrate doping as lim Ysubs =
fs →∞
lim Ysubs = Ysubs,depleted .
Gsubs →0
(4.9)
Consequently, for high frequencies or non-doped substrates, the capacitance model can be simplified by removing the MOS effect. In this case, all C entries are zero. Consequently, the capacitance matrix would be simply modeled by CG , resulting in only eight capacitance values that must be extracted without any linear-regression analysis. Furthermore, for indirectly adjacent TSV pairs, the magnitude of the MOS effect is generally negligible even for small significant frequencies and high doping profiles. Thus, the ΔCe2 and ΔCc2 values could be removed from the model to decrease the model complexity. Another fact worth mentioning is that the magnitude of the MOS effect increases with ongoing TSV scaling. The main reason for this behavior is that the relative dynamic width of the depletion region is higher for thinner TSV oxides.
4.4.1.3
Model Accuracy—Goodness of the Linear Fits
The following analysis quantifies the maximum absolute error (MAE) as well as the root-mean-square error (RMSE) values of the linear fits. All maximum absolute error (MAE) and RMSE values are obtained by considering all 121 extracted values for each different capacitance, which are then compared with the respective model predictions. Furthermore, the error values for the previously used TSV capacitance model are determined. The resulting error metrics for the proposed and the previous capacitance model are reported in Tables 4.3 and 4.4, respectively. In the tables, all error values are reported in percentage numbers relative to the according Cn,prev value (i.e., Cn,prev · ltsv ) to remove any scaling bias. This allows us to better relate the error values for different TSV parameters. The results show that the proposed model enables us to accurately replicate the extracted capacitance values for all analyzed TSV parameters, as all normalized MAE values are below 10%. In contrast, the previous model results in errors as high
6.0 11.0 6.0 11.0 6.0 11.0 6.0 11.0
4.0 1.1 5.0 1.9 1.0 0.1 4.0 0.1
4.0 4.0 4.5 4.5 8.0 8.0 8.5 8.5
1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
1.3 0.4 2.0 0.7 0.4 0.0 0.5 0.0
[%]
1.0 0.3 1.8 0.7 0.5 0.2 1.3 0.2
[%]
0.4 0.1 0.8 0.4 0.2 0.1 0.3 0.1
[%]
9.2 1.7 6.6 2.6 1.2 0.9 0.9 0.6
[%]
[%]
[GHz]
[μm]
[μm]
4.3 0.5 2.6 0.9 0.5 0.4 0.4 0.3
[%]
RMSE
MAE
RMSE
C2,2 —Ce0
MAE
MAE
RMSE
Capacitance—linear model C13,12 —Cn C13,7 —Cd
TSV parameters rtsv dmin fs 5.0 1.2 4.1 1.6 0.5 0.3 5.1 0.3
[%]
MAE
1.6 0.4 1.8 0.7 0.2 0.1 0.6 0.1
[%]
RMSE
C2,3 —Ce1
1.6 0.5 0.9 1.0 0.5 0.3 0.5 0.4
[%]
MAE
0.5 0.2 0.2 0.3 0.2 0.1 0.2 0.2
[%]
RMSE
C2,4 —Ce2
5.6 1.1 6.9 4.4 2.3 1.0 0.7 0.7
[%]
MAE
3.3 0.3 2.8 1.5 1.7 0.3 0.3 0.2
[%]
RMSE
C1,1 —Cc0
Table 4.3 Normalized maximum-absolute and RMS errors of the linear fits for the extracted TSV capacitances
5.8 2.2 3.0 1.7 0.4 0.3 0.8 0.1
[%]
MAE
2.5 0.8 1.3 0.6 0.2 0.1 0.4 0.0
[%]
RMSE
C1,2 —Cc1
1.4 0.6 1.3 1.1 0.3 0.3 0.4 0.3
[%]
MAE
0.6 0.2 0.3 0.4 0.1 0.1 0.2 0.2
[%]
RMSE
C1,3 —Cc2
92 4 High-Level Estimation of the 3D-Interconnect Capacitances
6.0 11.0 6.0 11.0 6.0 11.0 6.0 11.0
14.2 7.2 15.5 6.9 4.5 1.8 5.3 1.5
4.0 4.0 4.5 4.5 8.0 8.0 8.5 8.5
1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
4.7 2.9 6.1 3.0 1.9 0.8 1.9 0.7
[%]
13.7 5.5 16.2 8.0 9.7 3.7 10.2 4.5
[%]
11.4 4.8 13.1 7.1 9.4 3.6 9.7 4.2
[%]
15.3 4.8 10.3 4.6 4.2 6.0 3.9 5.9
[%]
7.3 2.8 4.4 3.2 2.7 4.7 3.0 5.3
[%]
51.4 40.9 51.8 40.1 38.5 33.2 37.2 32.3
[%]
[%]
[GHz]
34.0 32.7 34.7 32.5 33.0 31.3 32.0 30.7
[%]
RMSE
[μm]
RMSE
[μm]
MAE
MAE
RMSE
C2,3 —Cn,prev
MAE
MAE
RMSE
Capacitance—constant model value C13,12 —Cn,prev C13,7 —Cd,prev C2,2 —0
TSV parameters rtsv dmin fs 23.7 17.2 25.4 19.1 20.7 16.0 20.7 16.3
[%]
MAE
23.1 17.1 25.0 18.7 20.5 15.8 20.3 16.0
[%]
RMSE
C2,4 —0
12.9 7.0 11.3 9.0 7.6 11.7 9.2 13.8
[%]
MAE
7.4 6.2 4.9 5.3 5.6 11.2 8.5 13.2
[%]
RMSE
C1,1 —0
64.3 49.9 65.1 51.2 51.0 42.8 48.9 42.0
[%]
MAE
46.8 40.8 49.8 43.9 45.1 40.9 43.6 40.4
[%]
RMSE
C1,2 —Cn,prev
31.4 22.9 32.6 24.6 27.0 21.2 26.7 21.3
[%]
MAE
30.5 22.7 32.2 24.1 26.7 20.9 26.4 21.0
[%]
RMSE
C1,3 —0
Table 4.4 Normalized maximum-absolute and RMS errors of the previous TSV capacitances model for the nine capacitance values extracted for varying mean TSV voltages
4.4 Evaluation 93
94
4 High-Level Estimation of the 3D-Interconnect Capacitances
as 64.3%. The normalized RMSE values of the linear fits of the proposed model do not exceed 4.3%. This is a dramatic improvement compared to the previous model, which results in RMSE values of up to 46.3%. Generally, both models result in higher errors for aggressively scaled TSV dimensions and lower significant frequencies. For example, for a TSV radius of 2 μm, a minimum TSV pitch of 8 μm, and a significant frequency of 11 GHz, all reported MAE and RMSE values of the proposed model do not exceed 1% and 0.4%, respectively. For these TSV parameters, the previously used model still results in MAE and RMSE values of up to 42.8% and 40.9%, respectively. The reason for the lower accuracy of the capacitance model for smaller TSV dimensions and lower frequencies has various reasons. First, the magnitude of the MOS effect increases with shrinking TSV geometries and a decreasing significant frequency. Thereby, also the error due to the applied simple linear fit increases. As shown in the authors’ publication [18], using non-linear fits allows overcoming this issue of the proposed model in future technologies. Second, the impact of the depletion regions of the TSVs that are not directly connected to the modeled coupling capacitance increases with shrunk TSV geometries and lower significant frequencies. Last, non-idealities in the substrate become more dominant with lower frequencies.
4.4.1.4
Model Accuracy for Complete Capacitance Matrices
So far, it has been shown that linear fits, using p¯ i,j as the feature variable, allow us to accurately reproduce extracted capacitance values for varying depletion-region widths. However, it is yet not proven that it is accurate to use only eight linear fits to model full capacitance matrices, despite the strong heterogeneity in the capacitance values due to edge effects. Thus, the coefficients from Tables 4.1 and 4.2 are used to construct the capacitance matrices of complete array arrangement. These modeled capacitance matrices are subsequently compared to the “true” capacitance matrices extracted with the Q3D Extractor. In the first part of this analysis, the transmission of random data bits over the TSVs is considered, implying equal amounts of 1-bits and 0-bits transmitted over all TSVs (i.e., pi = 0.5 for all i). To show the reusability of the model, the coefficients—fitted using extracted capacitance values for 5 × 5 arrays and a TSV length of 50 μm—are used to: first, reproduce the capacitance matrix of the 5 × 5 arrays; and second, generate the capacitance matrices of 7 × 7 arrays with an increased TSV length of 70 μm. The resulting RMSE and MAE values for the modeled capacitance matrices compared to the respective extracted matrices are reported in Table 4.5. Again all numbers are reported as a percentage of the according Cn,prev value to better relate the results for different TSV technologies. The table reveals that the errors of the presented model do not exceed 14.6%, while the previous model shows errors as high as 52.9% of Cn,prev .
4.4 Evaluation
95
Table 4.5 Normalized MAE and RMSE values of the proposed and the previous TSV capacitance models for full capacitance matrices and a transmission of random bit-patterns TSV parameters rtsv dmin fs [μm] [μm] [GHz] 1.0 4.0 6.0 1.0 4.0 11.0 1.0 4.5 6.0 1.0 4.5 11.0 2.0 8.0 6.0 2.0 8.0 11.0 2.0 8.5 6.0 2.0 8.5 11.0
Proposed model 5×5 7×7 MAE RMSE MAE
RMSE
Previous model 5×5 MAE RMSE
7×7 MAE
RMSE
[%]
[%]
[%]
[%]
[%]
[%]
[%]
[%]
14.5 11.6 14.6 11.6 12.5 10.6 12.7 10.8
3.5 2.5 3.6 2.7 2.6 2.1 3.0 2.3
13.2 10.1 13.4 10.4 10.4 8.9 10.3 9.0
2.2 1.7 2.2 1.7 1.6 1.3 1.7 1.3
47.9 41.1 50.0 43.8 44.3 39.5 44.2 41.0
12.1 9.9 12.8 10.5 10.9 9.2 10.8 9.5
52.9 46.4 51.4 46.4 49.3 43.2 45.8 43.2
7.7 6.4 8.1 6.6 7.1 5.8 6.8 5.9
Only for some third-order-adjacent TSV pairs at the array edges, the proposed model results in errors higher than 10%. Furthermore, using only a single parameter to model all coupling capacitances between diagonally adjacent TSVs—even if both TSVs are located at an edge—results in some noticeable model errors. Nevertheless, the RMSE value for the proposed model does not exceed 3.6% in all cases. In contrast, the previously used capacitance model results in RMSE values that are by a factor of 3.5× to 4.5× higher than the ones of the proposed model. For the 7 × 7 arrays, the presented approach models the capacitance matrices with RMSE values that are even smaller (always below 2.3%), which proves the scalability of the model. The model errors for larger array dimensions are even lower due to the decreased ratio of edge TSVs over middle TSVs, resulting in a decreased impact of the neglected coupling effects at the edges. Also, the error values for the previously used model decrease with a decreased ratio of edge TSVs (maximum RMSE for the 7 × 7 arrays is equal to 8.1%). Furthermore, as expected from the previous subsection, the accuracy of both models slightly increases for the less aggressively scaled TSV radius of 2 μm. The increased errors of the traditional model in the previous analysis, compared to the proposed one, are only due to the non-considered edge effects in the previous analysis, as the transmission of random bits over the TSVs was considered. For scenarios in which the bit probabilities are not equally distributed, the MOS effect results in further heterogeneity in the capacitance matrices, which is not captured by the previous capacitance model. Hence, the accuracy of the models for three scenarios in which 1 and 0 bits are not equally distributed is quantified as the second analysis. Through the previous analysis, it was already proven that the relative model errors increase with a further scaled TSV radius and smaller array dimensions. Thus, in this analysis, only the 5×5 arrays with the smaller TSV radius of 1 μm are considered to report worst-case error values for the models. The first investigated scenario represents the transmission of
96
4 High-Level Estimation of the 3D-Interconnect Capacitances
Table 4.6 Normalized MAE and RMSE values of the TSV capacitance models for full capacitance matrices and the transmission of one-hot-encoded data
Data One-hot encoded
Inverted one-hot enhoced
Half-inverted one-hot encoded
TSV parameters rtsv dmin [μm] [μm] 1.0 4.0 1.0 4.0 1.0 4.5 1.0 4.5 1.0 4.0 1.0 4.0 1.0 4.5 1.0 4.5 1.0 4.0 1.0 4.0 1.0 4.5 1.0 4.5
fs [GHz] 6.0 11.0 6.0 11.0 6.0 11.0 6.0 11.0 6.0 11.0 6.0 11.0
Proposed model 5×5 MAE RMSE
Previous model 5×5 MAE RMSE
[%]
[%]
[%]
[%]
13.7 10.6 13.8 11.0 14.6 11.0 17.0 11.6 20.5 14.9 22.8 13.0
4.2 2.7 3.3 2.5 4.0 3.3 5.0 3.5 5.5 3.9 6.3 3.8
67.9 54.9 69.7 53.4 37.2 39.9 41.8 41.6 51.0 47.5 54.4 47.7
16.6 12.7 17.2 12.4 10.3 9.6 11.4 10.1 12.9 11.0 13.8 11.2
random data words that are one-hot encoded.3 One-hot-encoded data words result in a 1-bit probability of 1/25 for all TSVs of a 5×5 array. The second analyzed scenario is equal to the first, but with inverting TSV drivers, resulting in swapped logical 1bit and 0-bit probabilities for the TSVs. Hence, all 1-bit probabilities are equal to 24/25. In the last analyzed scenario, only every second line has an inverting driver. Consequently, in the third scenario, the 1-bit probabilities for all TSVs with an oddnumbered index i are equal to 1/25, while the 1-bit probabilities off all TSVs with an even-numbered index i are 24/25. The last scenario is less common in practice than the other, but it quantifies the error of using the mean of only two 1-bit probabilities as the feature variable for fitting in case of huge variations in the 1-bit probabilities. The resulting normalized RMSE and MAE values of the two capacitance models, compared to full parasitic extractions, are reported in Table 4.6. For the more conventional two scenarios, the proposed model results in error values of similar magnitude as the once reported above for random data. In contrast, the error values of the previous model increase significantly due to the neglected MOS effect. The normalized MAE and RMSE of the previous model reach values as high as 69.7% and 17.2%, respectively. Consequently, the proposed model results in MAE and RMSE values that are more than 5× smaller than the ones obtained for the previous model. For the third less common scenario, the errors of the proposed model slightly increase due to the applied averaging of the two bit probabilities for the MOS-effect
3 A more in-depth explanation of the properties of one-hot encoded data can be found in the later Sect. 6.1.3 on Page 118.
4.4 Evaluation
97
modeling. Here, the proposed model shows MAE and RMSE values of up to 22.8% and 6.3%, respectively. Thus, the RMSE value—being the more important metric for power and performance estimation—is still tolerable. Moreover, the proposed model still results in an improvement in the MAE and RMSE values by a factor of 2.2× to 3.7× compared to the previous model. This proves a substantial superiority of the proposed capacitance model for a wide range of scenarios.
4.4.2 Accuracy for the Estimation of the TSV Power Consumption and Performance After the accuracy of the presented and the previous capacitance model has been quantified in the previous subsection, the accuracy of the capacitance models for the estimation of the TSV power consumption and performance is analyzed. For this purpose, a Spectre circuit simulation for the general setup, presented in the evaluation section of the previous chapter (Sect. 3.4), is performed. A 4 × 5 array over which 10,000 random 20-bit patterns are transmitted is analyzed. This particular array shape is chosen to additionally prove that the proposed model remains valid for non-quadratic arrays, which have not been considered yet. The investigated transmission of random data/bit-patterns results in equally distributed 1 and 0 bits on the TSVs. This significantly reduces the errors of the previously used capacitance model, neglecting the MOS effect. Edge effects—neglected as well by the previous capacitance model—decrease with an increasing significant frequency, as shown before. Thus, to report optimistic values for the previous model, the significant frequency is increased from 6 to 10 GHz compared to the setup from Sect. 3.4. An increased significant frequency is due to stronger TSV drivers. Thus, the strength of the 22-nm drivers in the circuitsimulation setup is increased by doubling the W/Lmin of each transistor compared to the previous chapter. Again, the TSV array is modeled in the circuit simulations by a 3π -RLC equivalent circuit extracted with the Q3D Extractor. The same TSV radius, minimum pitch, and length as in Sect. 3.4 is considered (i.e., rtsv = 1 μm, dmin = 4 μm, and ltsv = 50 μm). Thus, only the array dimensions and the significant frequency are changed in the 3D model used for parasitic extraction. The rest of the circuit-simulation setup is kept the same as in the evaluation part of the previous chapter. Three different capacitance matrices are investigated for the high-level estimation of the power consumption and performance using the formulas presented in Chap. 3. The first one is the exact capacitance matrix from the Q3D Extractor; the second is the capacitance matrix according to the proposed model, and the third is the capacitance matrix according to the previously used model. An effective TSV load capacitance of about 2 fF is added to the diagonal/self-capacitance entries of each of the three capacitance matrices to account for the drivers. Furthermore, the driver-
98
4 High-Level Estimation of the 3D-Interconnect Capacitances
Table 4.7 Power consumption, P , and maximum propagation delay, Tˆpd , of the TSVs in a 4 × 5 array for the transmission of random data according to circuit simulations and high-level estimates. The percentage errors relative to the circuit simulation results are reported in bold for the highlevel estimates
Middle TSV Edge TSV Corner TSV Complete Array
P [μW] Tˆpd [ps] P [μW] Tˆpd [ps] P [μW] Tˆpd [ps] P [μW] Tˆpd [ps]
Circuit sim. 6.40 117.0 5.85 109.5 5.15 98.8 117.53 117.0
High-level estimation Exact Proposed capacitances model 6.39 (−0.2%) 6.52 (+1.8%) 116.1 (−0.8%) 115.9 (−0.9%) 5.83 (−0.3%) 5.68 (−2.9%) 108.7 (−0.07%) 104.8 (−4.2%) 5.04 (−2.1%) 4.65 (−9.7%) 87.5 (−11.5%) 79.8 (−19.3%) 116.82 (−0.5%) 114.51 (−2.6%) 116.1 (−0.8%) 115.9 (−0.9%)
Previous model 6.19 (−3.4%) 113.7 (−2.8%) 4.36 (−25.5%) 82.04 (−25.1%) 2.99 (−42.0%) 53.2 (−46.1%) 92.7 (−21.1%) 113.7 (−2.8%)
dependent parameters for the delay/performance estimation with Eq. (3.50) (i.e., R and TD,0 ) have to be fitted again due to the modified drivers. In contrast to the circuit simulations, which take several hours, the high-level estimations based on the derived formulas are performed within a few seconds. Nevertheless, using the more complex proposed capacitance model instead of the previous model does not significantly impact the run time (i.e., the run time for creating the capacitance matrices, determining the bit-level properties, and finally applying the formulas). The bit-level analysis for the transmitted patterns either way heavily determines the run time, not the generation of the capacitance matrix with the models. Investigated is the power consumption and performance of a middle TSV, a single-edge TSV, and a corner TSV. Moreover, the power consumption and performance of the whole array are analyzed. For all cases, the high-level estimates are compared with the results from the circuit simulations. In Table 4.7, the results are reported. The results show that using the previous capacitance model leads to an underestimation of the power consumption of the TSV array by 21.1%. In contrast, a high-level estimation using the proposed capacitance model results in a precise power estimation, as the error is as low as 2.6%. This significant error for the previous model makes the use of the proposed model inevitable for precise power estimation. While the proposed, as well as the previous, capacitance model, allow for a precise estimation of the power consumption of the middle TSVs (error below 3.5%), the previous model results in a large underestimation of the power consumption of the TSVs located at the edges (25.5–42.0%). According to the estimates obtained for the previous model, a TSV located at a single edge of the array results in a 29.6% lower power consumption than a middle TSV. For a corner TSV, the previous capacitance model even indicates a 51.7% lower power consumption. However,
4.5 Conclusion
99
in reality, a TSV at a single edge and a TSV at a corner only result in power consumptions that are by 8.6 and 19.5% lower than for a middle TSV, respectively. The proposed model accurately estimates the power consumption of the middle and the edge TSVs. Only for the corner TSVs, the proposed model results in a non-negligible miss-prediction by 9.7%. This relatively large error is mainly due to the neglected third-order coupling capacitances. However, since always only four corner TSVs exist in an array arrangement, the power consumption of the whole array is still estimated precisely with the proposed model. The maximum propagation delay occurs for the middle TSVs in an array arrangement, as they have a larger accumulated capacitance. For middle TSVs, the previous and the proposed capacitance model only differ slightly in this analysis due to the balanced bit probabilities on each TSV. Consequently, both capacitance models result in a relatively accurate estimation of the maximum performance of the TSV array (error below 3%). However, note that the previous capacitance model underestimates the maximum propagation delay of an edge TSV by 25.1%. In contrast, the proposed model results in an underestimation of this quantity by only 4.2%. According to the estimates obtained for the previous capacitance model, the maximum delay of an edge TSV is 27.8% lower compared to a middle TSV. However, the true maximum delay value of an edge TSV is only 6.4% lower. This fact is not essential for the performance estimation in case of a transmission of random data, as the performance is defined as the reciprocal of the maximum propagation delay of any TSV in the array (i.e., Tˆpd ). However, the fact becomes essential for the derivation of optimizing techniques. The reason is that the proposed model reveals that one already has to optimize the middle-TSV as well as the edgeTSV performance to increase the performance by over 10%. For corner TSVs, the proposed and the previously used capacitance model underestimated the maximum propagation delay by 19.3% and 46.1%, respectively. These more significant errors for both models are also because the driver-related coefficients in the formula for the propagation-delay estimation have been fitted to be most accurate for larger effective capacitance values. However, since the maximum propagation delay of a corner TSV is significantly lower compared to the value for middle or an edge TSV, this limitation of the proposed approach can be typically tolerated for performance estimation.
4.5 Conclusion In this chapter, an abstract, universally valid, and yet scalable model for the capacitances of TSV-array arrangements was presented. The model extends the previously used model, such that the edge effects, as well as the MOS effect, are considered abstractly. An in-depth evaluation showed that the model is highly accurate compared with precise, but computationally expensive, full parasitic extractions (overall NRMSE below 5%). Furthermore, the proposed model allows us to precisely estimate the power consumption and performance of TSV arrays on high abstraction levels
100
4 High-Level Estimation of the 3D-Interconnect Capacitances
(error below 3%). In contrast, a high-level estimation using the previously used capacitance model showed an underestimation of the TSV power consumption by over 21%. The proposed capacitance model enables systematic derivation of universally valid optimization approaches in Part IV. Furthermore, the strong heterogeneity in the capacitances of edge and middle TSVs, revealed by the proposed model, can be exploited to improve the TSV power consumption and performance effectively. Another important finding of this chapter is that the capacitance values can be optimized through the 1-bit probabilities, which can be affected by data-encoding techniques. Thus, exploiting the TSV edge effects and the TSV MOS effect has large potential to improve the efficiency of optimization techniques.
Part III
System Modeling
Chapter 5
Models for Application Traffic and 3D-NoC Simulation
In the previous chapters, high-level models for 3D interconnects were introduced. They enable power and performance estimation of TSVs and metal wires, making up 3D interconnects. The models are universally applicable to any stacked 3D-IC. In this and the subsequent chapters, we will show their advantage for architectural optimization by using the models to build and optimize 3D networks on chips (NoCs) for heterogeneous 3D-stacked ICs. Therefore, the models must be made available at a system level to improve the power-performance-area (PPA) figures of the whole NoC. To improve the PPA figures for NoCs, optimization methods and simulation tools are some of the most relevant parts of equipment for engineers and researchers. Throughout this book, we will introduce these methods and tools for 3D NoCs. In this part, we focus on system modeling to derive a design and simulation framework called Ratatoskr in Chap. 7. We will introduce the models for application traffic and 3D NoC simulation in this chapter. This book pursues a rigid approach as it starts with a definition of system models. This method has several advantages. First, it clearly shows the features and limitations of the simulations. Second, it enables dataflow modeling of SoC applications. Third, it facilitates the integration of low-level models (such as defined before) on a system level by combining the two levels of abstraction in the models. Like the previous chapters, which modeled the physical foundations before deriving practical tools and methods, we will use the simulations models from this chapter to derive our tool Ratatoskr in Chap. 7. The modeling and design philosophy followed by this book is shown in Fig. 5.1. The models are introduced in this chapter. Our tool Ratatoskr for the design optimization is proposed later, in Chap. 7, along the design process, i.e., how to use it. In the last parts of the book on link-level, architectural-level, and system-level optimization, we will use the proposed design process with Ratatoskr to generate the analysis of the results.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 L. Bamberg et al., 3D Interconnect Architectures for Heterogeneous Technologies, https://doi.org/10.1007/978-3-030-98229-4_5
103
104
5 Models for Application Traffic and 3D-NoC Simulation
Models Definition and limits of design space.
→
Tools E.g., implementation of a simulator.
→
Processes Definition of methodology and tool flow.
→
Analysis Evaluation of results and validation.
Fig. 5.1 Models, tools, processes, and analysis for design space exploration using simulations
The two most popular existing NoC simulators are Noxim [43] and Booksim 2.0 [121]. Both software targets conventional NoCs not implemented in heterogeneous 3D technologies. We require a novel approach to tackle the challenges, which are: • Consideration of technology properties: It is insufficient to model the router and the network architecture. Technology properties must be considered, as well. With a change of the technology node, the routers’ maximum achievable clock frequency differs, and the simulation must support non-purely synchronous communication. Furthermore, the area footprint of routers and components varies with the technology node. • Irregular network topologies: As a consequence of heterogeneous technology, the topology in the NoC might not follow the usual patterns such as a 2D mesh. Hence, the simulator must model any irregular network topology. • Benchmarks and traffic generation: Typical application (i.e., benchmarks) in heterogeneous 3D SoCs yield distinctive traffic patterns, since they typically combine sensing and processing. The simulator must support an application model with sophisticated capabilities that allow changing timing and bandwidth based on a statistical understanding of this class of applications. The remainder of this chapter is structured as follows. In Sect. 5.1 we give a short overview of our modeling approach. In Sect. 5.2 we introduce the application model and, in Sect. 5.3, the 3D-NoC model. The models are connected with the interfaces in Sect. 5.4. Finally, the chapter is concluded.
5.1 Overview of the Modeling Approach The models proposed must be accurate in describing system parameters, precise in measuring design metrics, adjustable in covering different use cases, and versatile in simulating different architectures. We achieve this by the following five basic approaches. • We decompose models into functional components that represent different parts of the system. This separation of units allows identifying influencing parameters and determining their effect location. • We model components on the most abstract level possible that is still accurate. This abstraction has a positive impact on the simulation speed. Furthermore, well-defined abstraction isolates essential design parameters of components.
5.2 Application Traffic Model
105 Model
Application model
Hardware model
Task
Processing element Application data
P ackets Network interface F lits Router F lits Link
Transaction-level Model
Cycle-accurate Model
Fig. 5.2 Simulation model
• We cover multiple levels of abstraction in the simulation simultaneously, e.g., a co-simulation of a faster and less accurate application model with a slower and more precise NoC model. • We define interfaces to encapsulate the interaction between components and to include the relevant data passed between components. • We define parameter sets that encapsulate influencing parameters and enable flexible modeling of each component. The proposed models for 3D NoCs in heterogeneous technologies comprise two parts, application model and hardware model as seen in Fig. 5.2. The application model generates network traffic modeled on the transaction level. The hardware model consists of processing elements (PEs), network interfaces (NIs), routers, links, etc. i.e., all parts of an NoC, modeled cycle-accurately. Running the models estimates network performance and energy consumption. We provide parameter sets that define the variable properties of the hardware models. Their definition is delicate since it limits the expressiveness of the model, and it must account for all relevant use cases and possible architectures. The interfaces connect components in the model as shown in Fig. 5.2 and transmit data between tasks in the application model. These application data are converted to network packets for injection into the NoC. Packets are split into flits for transmission between routers. We provide a detailed definition of interfaces in Sect. 5.4.
5.2 Application Traffic Model The application must model typical use cases of heterogeneous 3D SoCs. Mostly, these combine sensing and processing. There are three unique properties on our model, not found in other application simulators for NoCs: First, the model includes
106
5 Models for Application Traffic and 3D-NoC Simulation
the timing of data processing. Processing times can vary depending on technology, data, and processor type. For example, analog-digital conversion has different timing than digital data processing, and both are present in heterogeneous 3D SoCs next to each other. Second, there are dynamic effects of changing input data from sensors. For abstraction, the expected behavior for inputs is modeled rather than the actual control and data flows per input; we model the statistical properties of control and data flow. Third, the dynamic energy consumption of links and routers, which depends on the properties of transmitted data, is estimated during simulation. We rely on statistical modeling, in which data are classified by similar statistical properties and annotated with these properties, instead of modeling each data individually for simulation performance. Petri nets are an application model with well-known properties. Data are represented as tokens and are passed between places via transitions. Conventional Petri nets are not sufficient in our case because they do not express the three properties mentioned above. Therefore, we extend them as described in the following. 1. Timing: Places (i.e., processors) have a delay that models the calculation times of an application. In other words, places are invalid after consuming a token before validating themselves later. We use the concept of retention time on places introduced by Popova and Zeugmann [202]. 2. Statistical communication selection: Tokens can be transmitted via a set of enabled transitions, of which a subset is selected at random. This randomness models applications with non-deterministic behavior originating from varying input data, e.g., changing environment. We modify the model of statistical Petri nets as proposed by Haas [96]. 3. Annotation with data classes: Tokens must be annotated with data classes to compute the dynamic energy consumption of links and routers. The details of this power model are explained later in this book. It shows that a link’s energy consumption depends on the transmitted data’s statistical properties and the data coding method. To represent these properties, we use colors annotated to application data, leading to colored Petri nets. Each color represents one data type, denoted by ψ ∈ Ψ . Bringing the three properties together, we propose stochastic, colored Petri nets with retention time on places. We define the model based on a conventional Petri net given by: Definition 6 (Petri Net) A Petri net N = (ρ, τ, F, υ, m0 ) is a tuple such that: • The finite sets ρ and τ are called places and transitions with ρ ∩ τ = ∅ and ρ ∪ τ = ∅. • The relation F ⊆ (ρ × τ ) ∪ (τ × ρ) is called flow relation and consists of arcs. Arcs connect places and transitions. • The weight-function weights the arcs υ : F → N \ {0}. • The initial marking assigns many initial marks to each place: m0 : ρ → N. A Petri net is shown in Fig. 5.3. There are two places p1 and p2 with 3 and 2 tokens. Data are sent between the two places and from the first place to itself.
5.2 Application Traffic Model
107
Fig. 5.3 Petri net
p1
p2
p1 : [4, 7]
p2 : [2, 3]
Fig. 5.4 Petri net with retention time for places
The distribution of tokens in the Petri net is given by a marking: Definition 7 (Marking) Let N be a Petri net with places ρ. The function m : ρ → N is a marking. The set of markings is called G. An element m ∈ G is given by m = (m(1), . . . , m(|τ |)) with a number of tokens m(p) in all places p ∈ ρ, and |τ | the number of transitions. Tokens are transmitted via enabled transitions: Definition 8 (Enabled) Let N be a Petri net and let m be a marking in N . A transition t ∈ τ is enabled if the necessary token count is surpassed on any place given in a marking m, i.e., υ(t, p) ≥ m. The set of pre-places of a transition t is denoted by • t = {p ∈ ρ : (p, t) ∈ F }. The set of enabled transitions for a given marking m ∈ G is τ (m) = {t ∈ τ : m(t) ≥ υ(• t, p)}. The set of markings with a transition t enabled is G(t) = {m ∈ G : t ∈ τ (m)}. In a dual manner, t • = {p ∈ ρ : (t, p) ∈ F } denotes the set of post-places the transition t. There are many definitions of Petri nets with timing annotation found in the literature. As we model the calculation times of tasks, places must be invalidated for a while, although they already consumed enough tokens. This timing behavior is modeled by a duration of invalidity for places. An exemplary net is shown in Fig. 5.4 in which both places have retention time intervals [4, 7] and [2, 3]. Definition 9 (Petri Net with Retention Time on Places) The tuple I = (N , I ) is called Petri net with retention time on places. The set N = (ρ, τ, F, υ, m0 ) is a Petri net, cp. Defintion 6. The function I assigns a retention time interval to each place: + I : ρ → Q+ 0 × (Q0 ∪ {∞})
(5.1)
It must further hold that the bounds of the retention time interval are well-ordered (ascending), i.e., it holds for all places p ∈ ρ with I (p) = (lp , up ): lp ≤ up .1
+ is also possible use N+ 0 × (N0 ∪ {∞}) as the target set; However, it is sufficiently precise to use N [231].
1 It
108
5 Models for Application Traffic and 3D-NoC Simulation
The retention time influences the enabling of transitions. Places are only allowed to fire if their retention time is over. We refer to [202], Definitions 6–11 for a complete definition of this behavior. Petri nets that select transitions non-deterministically are called stochastic Petri nets [96]. Each transition has a firing probability: Definition 10 (Firing Probability Function) Consider a τ ∗ ⊆ τ . The probability of a new marking m from an old marking m for the transition τ ∗ is p(m , m, τ ∗ ). Further properties of p are: • For each marking m ∈ G and each transition set t ∈ τ (m), the function p(·, m, τ ∗ ) is a probability mass function on G: m ∈G p(m , m, τ ∗ ) = 1 • The function p(m , m, τ ∗ ) must be positive. We permit this if m = (m(1), . . . , m(|τ |)) and m = (m(1) , . . . , m(|τ |) ) and τ ∗ satisfies for all j ∈ [1, . . . , |τ |]: mj −
τ ∗ ∈τ ∗
V (τ ∗ , • τ ∗ )1• τ ∗ (dj ) ≤ mj ≤ mj +
(5.2)
V (τ ∗ , t ∗• )1t ∗• (dj ).
τ ∗ ∈τ ∗
• ∗ Here, 1S is the indicator function of a set S. The sum τ ∗ ∈τ ∗ 1 τ (dj ) is the ∗ number of transitions τ with the input place dj . The sum τ ∗ ∈τ ∗ 1t ∗• (dj ) is the number of transitions τ ∗ that terminate at the output place dj . Multiplication with the edge weights in υ ensures the tokens increase and decrease as wished. We extend our Petri net with timing annotation from Definition 9 using the firing probability function: Definition 11 (Stochastic Petri Net with Retention Time on Places) The tuple P = (S, I ) is called stochastic Petri net with retention time on places. The set S = (ρ, τ, F, υ, m0 , p(·, m, τ ∗ )) is a stochastic Petri net. The function I assigns a retention time interval to each place (cp. Definition 9). An example for such a Petri net is shown in Fig. 5.5. Two transitions are selected non-deterministically that are taken with probability pˆ and 1 − p. ˆ We refer to [96], Definition 1.5 for a definition of deterministic transitions. We use colors to annotate tokens with the data-type classes. Definition 12 (Colored Stochastic Petri Net with Retention Time on Places) The tuple A = (P, Φ, Ψ ) is called colored stochastic Petri net with retention time Fig. 5.5 Stochastic Petri net with retention time for places
p(· ) = 1 p(· ) = 1
p1 : [4, 7]
p(· ) = p ˆ
p(· ) = 1 − p ˆ
p2 : [2, 3]
p(· ) = 1
5.3 Simulation Model of 3D NoCs
109
Fig. 5.6 Colored stochastic Petri net with retention time for places
p(· ) = 1 p(· ) = 1
p1 : [4, 7]
p(· ) = p ˆ
p(· ) = 1 − p ˆ
p(· ) = 1
p2 : [2, 3]
on places. The set P = (ρ, τ, F, υ, m0 , p(·, m, τ ∗ ), I ) is a stochastic Petri net with retention time on places. The set Φ is the set of colors. The function C assigns a color Ψ to each place P : C : P → Ψ . An example of a colored stochastic Petri net with retention time on places is shown in Fig. 5.6 that gives an illustrative overview of the model. There are two places, depicted as blue circles. The retention time is given for both places; place p1 has a retention time between 4 s and 7 s and place p2 between 2 s and 3 s. Both places hold different tokens. The tokens are of varying types i.e., color illustrated by red rectangles and purple circles. (We changed both color and shapes for improved accessibility.) Transitions are shown as gray rectangles. The probability of transmitting tokens from place p1 back to place p1 is p. ˆ 2 The probability to transmit tokens from place p1 back to place p2 is 1 − p. ˆ Tokens from place p2 are always transmitted to place p1 (i.e., p(·) = 1).
5.3 Simulation Model of 3D NoCs In this section, we define the simulation model by all components for an NoC. They are given by the hardware, and include PEs, NIs, routers, and links (see Fig. 5.2). Definition 13 (Processing Element) The PE model is an abstract representation of any type of hardware with a focus on traffic generation. Processing elements typically represent sensors, actors, processors, memory, and hardware accelerators. Function: Processing elements inject traffic into the network from tasks. The model comprises a static mapping of tasks to PEs; dynamic remapping is not modeled. It is possible to model such dynamic properties using the stochastic property of the application model. Parameter sets: The number of PEs o is adjusted by defining their set [o]. The mapping function M : ρ → [o] maps places in the application to PEs. Places are in the set ρ. Definition 14 (Network Interface) The NI serializes payload data into network packets that consist of flits when sending and deserializes the network packets upon receival. There is one NI for each PE.
2 Transmission from a task to itself does not make sense for actual applications but can be modeled.
110
5 Models for Application Traffic and 3D-NoC Simulation
Function: Network interfaces are modeled as message queues with infinite depth. Methodologically, they realize the transaction-level model (TLM) interconnect between the application model and hardware model in the simulation. Parameter sets: The NI divides the payload size of a packet into flits using the link’s bit width b ∈ N. It adds a header flit of bit size h ∈ N and, if used, the last flit is marked as the last flit (even though it is called tail flit in the implementation, it contains valid payload data). We model the NI cycle accurately. A 3D NoC in heterogeneous technologies is not purely synchronous from components in different layers with different clock speeds. Accordingly, we model the NIs with a layer-wise clock frequency using the parameter clkNI : [] → R>0 for layers. Definition 15 (Router) Routers send packets through the network following a routing function. We model input-buffered routers, which are the most common in the industry and academia when writing this book. Based on the destination address provided from the flits, all their possible routes are calculated, one of them is selected, and a virtual channel (VC) is allocated. Function: Routers are modeled cycle accurately with multiple pipeline states per input channel as a state machine. The states are given by idle, routing calculation (RC), VC calculation (VC) , switch arbitration (SA), and switch traversal (ST): state = {idle, RC, VC, SA, ST}, which are assigned to each input port via [m] × in → state, in which m is the number of routers, in the set of input ports. A translation relation is defined: δ ⊂ state × F lits × state that models the router’s behavior. The function reports power-relevant events. The selection and number of transitions in the state machine that can be taken per clock cycle define different pipeline depths. Parameter sets: The number of routers is adjusted via m, with their set being called [m]. The network topology is defined as a network graph, in which vertices represent routers and edges represent links. The set of all possible directed edges is given by: EN = {(k, l) | k, l ∈ [m]}. The set of network graphs is given by N = [m], EN (of course, not every router pair can be connected in a real system due to wire length and wire count constraints). The graph is spatially embedded with positions of routers: p : [m] → R2>0 ×[]. Input and output ports of routers are given by in : [m] → P(EN ) and out : [m] → P(EN ), in which P(EN ) represents the power set of EN .3 All links connected to the input port of the router r are given by in(r) = EN,r = {(k, l) | l = r, (k, l) ∈ EN } ⊂ EN . The number of input ports of this router is given by: |in(r)|. The definition of output port is analog. Routers are modeled cycle accurately using one clock frequency per layer set by clkrouter : [] → R>0 . The number of VCs can be set per router v : [m] → N and the buffer depth can be set per port d : [m] × in → N. This allows modeling asymmetric router buffer depths, cf. Chap. 12.
3 The
power set of a set A denotes the set of all subsets of A.
5.4 Simulator Interfaces
111
The routing function, the path selection function, and the channel allocation can be set via routing : [m]×F lits → out 4 and selection : F lit ×state×state → v, depending on the state of the corresponding downstream router. The timing behavior of each input port of a router can be set by its delay to evaluate the head flit in clock cycles: Δ : [m] × in → R. For example, a router with a three-stage pipeline is modeled by setting Δ := 3. Definition 16 (Links) Links send data between routers. Function: Links calculate transmission matrices. These are used to evaluate their dynamic energy consumption because this depends on different factors such as the transferred data [86] or the VCs. The latter is shown in Chap. 6. Parameter sets: The links do not require parameter sets. Their properties are already modeled by link widths (in the NIs) and their energy consumption is calculated using data flow matrices in post-simulation processing.
5.4 Simulator Interfaces Interfaces are the means of communication between components of the model. The interfaces are defined as follows: Definition 17 (Application Data) Application data are communicated between the application and hardware models, i.e., between places ρ and PEs. The amount of data is represented by the number of transmitted tokens. Bit-level statistics of the data are kept by annotation with the colors Φ. The amount of payload data per packet is given by u. ApplicationData = (ρ × ρ, Φ, u)
(5.3)
Definition 18 (Packets) Within the network, packets are transmitted between PEs. These contain a source and destination address regarding the address space of PEs ([o] × [o]). P ackets = ([o] × [o], ρ × ρ, Φ)
(5.4)
It is required to include the source and destination within the application (ρ × ρ) since the inverse mapping function between task addresses and hardware addresses M−1 is not well-defined when multiple tasks are executed on the same PE. The size of packets is limited by the payload per flit times the maximum number of flits. Definition 19 (Flits) Within the network, flits are transmitted between routers. The type of flit is denoted by the set T ype = {head, body, tail} (Remark: the tail flit
4 Adaptive
routing algorithms additionally requires the state of adjacent routers.
112
5 Models for Application Traffic and 3D-NoC Simulation
will only be used to denote the last flit for the implementation but may contain payload data as tail-only flits are suboptimal). F lits = ([o] × [o], ρ × ρ, T ype, Φ)
(5.5)
5.5 Conclusion In this chapter, we introduced models for application traffic and simulation of 3D NoCs for heterogeneous 3D ICs. These models will be used in Chap. 7 for the implementation of the Ratatoskr framework. Before that, we will introduce methods to estimate the bit-level statistics of interconnects to use the high-level models for 3D interconnects from the previous chapters at this level of abstraction. This implementation in a simulator enables precise and fast evaluation of the dynamic power consumption of the NoC interconnects.
Chapter 6
Estimation of the Bit-Level Statistics
Chapter 3 presented high-level formulas to precisely calculate the 3D-interconnect power consumption through the interconnect capacitances and the self as well as coupling switching. To overcome the need for computational-complex EM simulations for each interconnect structure, an universally valid high-level model to estimate the interconnect capacitances was presented in Chap. 4. The presented TSV capacitance model requires the logical bit probabilities of the transmitted patterns as inputs. Thus, summarized, three bit-level characteristic of the transmitted data are needed for a precise power estimation on higher abstraction levels: Firstly, the selfswitching probabilities of the bits, secondly, the mean coupling switching between the bits; and thirdly, the logical 1-bit probabilities. All three bit-level properties can be obtained by means of exact bit-level simulations which, again, have the drawback of a long run time, especially for large data sets. Bit-level simulations also require sufficient samples of the physically transmitted data, which are often not available at early design stages. Furthermore, bit-level simulations do not enable a systematic derivation and fast evaluation of low-power coding techniques [88]. Also this can be overcome through high-level models to estimate the bit-level characteristics. High-level formulas to estimate the bit-level characteristics for the transmission of data streams of various kinds (e.g. DSP, multimedia, compressed) are well known [36, 88, 101, 144, 174, 209, 210, 218], and adopted for a wide range of lowpower techniques (e.g. [6, 31, 86, 87, 89, 174, 191, 210, 223]). However, existing formulas are only applicable to estimate the interconnect power consumption for a sequential transmission of a single data stream, and not if multiple data streams are transmitted over the interconnect structure in a time-multiplexed fashion. Among others, popular interconnect architectures resulting in a time multiplexing of the transmission channel to transmit multiple data streams interleaved are buses and virtual-channel-based NoCs. For TSV-based interconnects, data-stream multiplexing is even more likely to be applied due to the large TSV footprint and their limited manufacturing yield. In this chapter, exemplary, packet-switching based NoCs are © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 L. Bamberg et al., 3D Interconnect Architectures for Heterogeneous Technologies, https://doi.org/10.1007/978-3-030-98229-4_6
113
114
6 Estimation of the Bit-Level Statistics
analyzed. However, the model and the findings presented throughout this chapter can easily be adopted for other communication architectures that result in a timemultiplexed access of the physical transmission channel. Virtual channels are a commonly implemented technique to increase the NoC performance or to avoid deadlocks, but the resulting data-stream multiplexing can also drastically increase the switching activities compared to NoCs without virtual channels, as shown in this chapter. Since the switching activities are directly proportional to the power consumption, as shown in Chap. 3, the energy requirements are drastically underestimated when existing high-level models are applied. Also on one of the most promising low-power approaches, coding, data-stream multiplexing has major implications. Since the overhead of a coding approach must not cancel out the savings in the links, low-power codes (LPCs) for NoCs are typically implemented on an end-to-end manner [30, 116]. Thus, the data is encoded and decoded in the network interfaces at the source and the destination processing element, respectively, and not on a link level. However, on an end-toend basis the effect of virtual channels can not yet be modeled. Thus, the power savings of most techniques vanishes when virtual channels are integrated into the network. Hence, the majority of techniques are explicitly designed for NoCs without virtual channels [116, 187], even-though virtual channels are commonly implemented in NoCs. The only technique which is applicable for virtual-channelbased NoCs is probability coding [86]. However, this method is only designed for the uncommon scenario that links permanently make use of multiple nonprioritized virtual channels, which is unrealistic. Thus, no optimal coding approach can be identified yet, due to the lack of a pattern-dependent model for the power consumption of links with virtual channels. This chapter presents an accurate, coding-aware, high-level model to estimate the bit-level characteristics required for a fast but yet precise estimation of the link power consumption in the presence of virtual channels. In detail, the existing model is extended in a way that it enables to estimate the bit-level characteristics not only for a sequential transmission of data packets/streams (no virtual channels) over a link, but also if the single flits/patterns of multiple streams are virtually transmitted in parallel, implemented via a time-multiplexed access on the single physical channel. While the existing model likely results in the implementation of inefficient coding techniques and to an underestimation of the link power consumption by up to a factor of 4×, the proposed model precisely predicts the power consumption of 2D as well as 3D NoC links (error below 1%), as well as the efficiencies of LPCs. Furthermore, the presented model reveals new possibilities to reduce the link power consumption through coding techniques. The remainder of this chapter is organized as follows. First, existing models to estimate the bit-level statistics for non-multiplexed data transmissions are reviewed in Sect. 6.1. Afterward, the concept of data-stream multiplexing and its impact on the power consumption are outlined in Sect. 6.2. The proposed model to estimate the bit-level statistics for multiplexed data streams is presented in Sect. 6.3 and evaluated in Sect. 6.4. Finally, the chapter is concluded.
6.1 Existing Approaches to Estimate the Bit-Level Statistics for Single Data. . .
115
6.1 Existing Approaches to Estimate the Bit-Level Statistics for Single Data Streams In this section, formulas to estimate the bit-level statistics for a sequentially (nonmultiplexed) transmission of data streams are reviewed. As shown in the previous chapters, the three bit-level statistics that are required to estimate the interconnect power consumption are the switching activities, αi , the switching correlations, γi,j , and the bit probabilities, pi . Since the switching correlations are always zero in the case of temporally misaligned signal edges, irrespective of the data stream (see Eq. (3.27) on Page 59), temporally aligned edges are considered throughout this chapter. Due to the wide range of different types and models to estimate their bitlevel statistics (e.g., [36, 85, 86, 88, 101, 143, 144, 209, 218]), this section only covers the data types that are considered throughout this book.
6.1.1 Random Data An m-bit random data stream implies a sequence of integer data words in the range from 0 to 2m − 1 (unsigned base-2 representation) that are uncorrelated and uniformly distributed. For the transmission of samples of a random data stream, each possible bit pattern, as well as every possible pattern transition, occurs with the same probability. Consequently, each bit of the binary words has a 1-bit probability of 50%, the switching probability of the bits is 50%, and all bit pairs switch completely uncorrelated. Hence, the required bit-level statistics are αi = E{Δbi2 } = 0.5;
(6.1)
γi,j = E{Δbi Δbj } = 0;
(6.2)
pi = E{Δbi } = 0.5.
(6.3)
To validate these formulas, a 16-bit random data stream containing 100,000 samples is generated synthetically, which is subsequently analyzed. Exact bit-level statistics, from bit-level simulations, alongside the values from Eqs. (6.1) to (6.3), are plotted in Fig. 6.1. The results show the correctness of the formulas.
6.1.2 Normally Distributed Data In many scenarios (e.g., DSP applications), the data words tend to be Gaussian/normally distributed. The rationale is the central limit theorem, which states that the normalized sum of mutually independent, zero-mean random variables tends to
116
6 Estimation of the Bit-Level Statistics
Fig. 6.1 Bit-level statistics of a stream of purely random data words (i.e., uniform-distributed and uncorrelated) according to bit-level simulations and higher-level estimates
0.5 γi,j estimates pi estimates αi estimates γi,i+1 true pi true αi true
0.4 0.3 0.2 0.1 0 1
3
5
7 9 11 Bit index i
13
15
follow a Gaussian distribution, even if the original variables themselves are not normally distributed [230]. The joint probability density function of the discrete-time mean-free Gaussian processes Xk and Xk−1 (Xk delayed by one sample instance) is mathematically expressed as 2
fPD (x, y) =
2
1 − x +y −2ρxy e 2σ 2 (1−ρ 2 ) , 2π σ 2 1 − ρ 2
(6.4)
where Xk and Xk−1 are substituted for x and y, respectively [85]. In this equation, σ is the standard deviation of the mean-free data words: σ = E{Xk2 }. (6.5) ρ is the word-level correlation of the data words: ρ=
E{Xk · Xk−1 } . σ2
(6.6)
Throughout this work, an m-bit Gaussian/normally distributed data stream with a given standard deviation, σ , and correlation, ρ, is a sequence of digitalized data words (signed integer, base-2 representation) that were generated according to the probability density function described by Eq. (6.4). In the following, the model to estimate the required bit-level statistics of such normally distributed data streams, as a function of the word-level statistics σ and ρ, is briefly reviewed. For a more in-depth analysis of the bit-level properties of normally distributed data, please refer to Ref. [36, 85, 101, 143, 144]. After analyzing the switching of the bits in a set of samples from digital signal processor (DSP) signals, Landman and Rabaey concluded that the bit-switching statistics can be represented by a piece-wise-linear model composed of three regions [143, 144]. For each of the three regions, different bit-level properties are assumed in the model. The first region ranges from the least significant bit (LSB)
6.1 Existing Approaches to Estimate the Bit-Level Statistics for Single Data. . .
117
of the data words to the so-called “LSB break-point”, BPlsb , and the second region ranges from the “MSB break-point”, BPmsb , to the most significant bit (MSB) of the data words. All bits in between BPlsb and BPmsb compose the third intermediate region. In Ref. [144], the following formulas to calculate the breakpoints were derived: |ρ| 2 BPlsb = log2 (σ ) + log2 1−ρ + ; (6.7) 8 BPmsb = log2 (3 σ ).
(6.8)
The bits in the LSB region up to BPlsb tend to be random. Hence, the toggle activity and the 1-bit probability of a bit belonging to this first region are approximately 0.5. Furthermore, the switching correlation of a bit belonging to this region with any other bit (from the same or another region) is 0 in the model. The bits in the MSB region show a strong spatial correlation, which implies that if one of the bits in the MSB region toggles, in nearly all cases, all other bits in the region toggle as well and in the same direction. According to Ref. [36], the toggle activity of the bits in the MSB region can be calculated from the world-level correlation via αmsb-reg =
cos−1 (ρ) . π
(6.9)
For uncorrelated data words (i.e., ρ = 0), αmsb-reg is equal to 0.5. With an increasing (positive) pattern correlation, the switching activity of the bits in the MSB-region decreases toward 0. For a decreasing (negative) pattern correlation, the switching activity of the MSBs increases toward 1. The switching correlation, γi,j , between two bits that both belong to the MSB region, is also equal to αmsb-reg due to their strong spatial correlation. For all bits in the MSB region, the 1-bit probability is 0.5. In the model, a continuous linear model between the bit-level statistics in the LSB and the MSB region is assumed for the bit-level statistics in the intermediate region. Thus, the model to estimate the bit-level statistics of zero-mean, normally distributed data by means of the word-level statistics is expressed through the following formulas:
αi =
⎧ ⎪ ⎪ ⎨0.5
0.5 +
⎪ ⎪ ⎩α
(i−BPmsb )(αmsb-reg −0.5) BPmsb −BPlsb
msb-reg
γi,j =
⎧ ⎪ 0 ⎪ ⎪ (i−BP )(α ⎪ msb msb-reg ) ⎨
BPmsb −BPlsb ⎪ (j −BPmsb )(αmsb-reg ) ⎪ ⎪ ⎪ BPmsb −BPlsb
⎩
αmsb-reg
pi = 0.5.
for i ≤ BPlsb for BPlsb < i < BPmsb
(6.10)
for i ≥ BPmsb ; for i or j ≤ BPlsb for BPlsb < i < BPmsb and j > i for BPlsb < j < BPmsb and j < i
(6.11)
for i and j ≥ BPmsb ; (6.12)
118
6 Estimation of the Bit-Level Statistics
We validate these formulas with five Gaussian-distributed 16-bit data streams, each containing 100,000 samples. The individual data streams have different wordlevel correlations and standard deviations in the range from −0.9 to 0.9 and 27 to 210 , respectively. Resulting exact bit-level statistics from bit-level simulations, alongside the estimates from Eqs. (6.10) to (6.12), are plotted in Fig. 6.2. The results show a good agreement between the bit-level simulations and the model estimates. Although the previously reviewed formulas were derived by Landman et al. for primary Gaussian-distributed data, they are usable for other uni-modal distributions (e.g., Laplace) as well [143].
6.1.3 One-Hot-Encoded Data An m-bit one-hot-encoded data stream implies a sequence of integer numbers in the range from 0 to m − 1, which are uncorrelated and uniformly distributed. The representation of the data is one-hot, meaning that only a single bit is 1 in each of the m possible binary data words (all occurring with the same probability), while all other bits are 0 [100]. The pattern with a 1 on its ith bit represents the integer value i−1. For example, the four possible data words of a 4-bit one-hot-encoded data stream “0001”, “0010”, “0100” and “1000” represent the integer numbers 0, 1, 2 and 3, respectively. One-hot encoding is often used for states due to its robustness against single flipped bits (Hamming distance between two valid codewords/states is always two). Furthermore, a one-hot encoding is often used in machine-learning applications to encode categorical data. To derive the bit-level statistics of one-hot-encoded random data is a straightforward task. Since each of the m patterns occurs with the same probability 1/m, while each bit is only 1 in a single pattern, the 1-bit probability of each bit is pi =
1 . m
(6.13)
Thus, the probability of a bit being 0 is m−1/m. Hence, the probability of switching from logical 1 to logical 0 in two respective clock cycles is (m−1)/m2 for all bits. Since the probability of a 0 to 1 transition is the same as of a 1 to 0 transition, the switching activity of each bit is αi = 2
m−1 . m2
(6.14)
In a one-hot-encoded data stream, two or more bits can never switch in the same direction, only in the opposite direction. Hence, the Δbi Δbj values can only be 0 or −1 implying negative switching correlations for the bit pairs. A switching in the
6.1 Existing Approaches to Estimate the Bit-Level Statistics for Single Data. . . ρ = − 0 . 9;
1
σ =2
1
0.5
0.5
0
0 1
6
11
ρ = 0;
7
16
1
Bit index i ρ = 0 . 9;
1
σ =2
1
0
0 11
16
Bit index i ρ = 0 . 7;
1
σ =2
ρ = − 0 . 7;
7
0.5
6
σ =2
10
11
16
Bit index i
0.5
1
6
119
1
6
σ =2
11
7
16
Bit index i 7
γ i,i +1 estimates p i estimates α i estimates γ i,i +1 true p i true α i true
0.5
0 1
6
11
16
Bit index i
Fig. 6.2 Bit-level statistics of normally distributed data according to bit-level simulations and higher-level estimates
120
6 Estimation of the Bit-Level Statistics
0.2
γi,j estimates pi estimates αi estimates γi,i+1 true pi true αi true
0.1
0 1
2
3
4 5 6 Bit index i
7
8
Fig. 6.3 Bit-level statistics of one-hot-encoded random data streams according to bit-level simulations and higher-level estimates
opposite direction for two bit values bi and bj (i.e., , Δbi Δbj = −1) occurs only for two pattern combinations. For an exemplary 4-bit one-hot-encoded data stream, opposite transitions in the first two bits only occur for the two pattern sequences “0001” → “0010” and “0010” → “0001”. Thus, the switching correlation of the bit-pairs of an m-bit one-hot-encoded data stream are γi,j = −
2 . m2
(6.15)
The derived formulas are validated through an exemplary 8-bit one-hot-encoded data stream made up of 100,000 samples. The exact bit-level statistics from bitlevel simulations, alongside the estimates from the derived formulas, are plotted in Fig. 6.3. The results prove the correctness of the formulas.
6.1.4 Sequential Data Sequential data streams build a specific data type for which subsequent data words are strongly correlated. Data steams of this type can be found in most real systems in the form of address signals or program counters [31]. A sequential m-bit data stream is a stream of integer words (unsigned base-2 representation), in which the value of a data word always is the increment (+1) of the value of the previous data word. The exception are branches where the value changes randomly. Sequential signals are strongly correlated, but tend to be uniformly distributed. Hence, the 1-bit probability of each bit is pi = 0.5.
(6.16)
6.1 Existing Approaches to Estimate the Bit-Level Statistics for Single Data. . .
121
For an ideal sequential data stream without any branches, the toggle activity of the bits decreases exponentially with the significance of the bit: αi = 21−i .
(6.17)
Considering that a branch occurs with a probability of pbranch , the switching activities of the bits are precisely estimated by αi = (1 − pbranch ) · 21−i + pbranch · 0.5.
(6.18)
Due to the uniform distribution, the individual bits toggle uncorrelated for sequential data: γi,j = 0.
(6.19)
The derived formulas for the bit-level statistics of sequential data streams are validated through two synthetically generated sequential data streams with different branch probabilities of 5 and 20%. Each of the two synthetically generated data streams is made up of 100,000 8-bit words. Exact bit-level statistics from bit-level simulations, alongside the estimates of the formulas, are plotted in Fig. 6.4. The results prove the correctness of the derived formulas.
pi estimates pi true
γi,i+1 estimates γi,i+1 true
αi estimates αi true
pbranch = 0.05
pbranch = 0.2
1
1
0.5
0.5
0
0 1
2
3
4
5
Bit index i
6
7
8
1
2
3
4
5
6
7
8
Bit index i
Fig. 6.4 Bit-level statistics of sequential data according to bit-level simulations and higher-level estimates
122
6 Estimation of the Bit-Level Statistics
6.2 Data-Stream Multiplexing In this section, the two main use-cases of data-stream multiplexing are outlined: TSV-count reduction to reduce TSV-related area and yield issues, and virtualchannel usage to improve the NoC performance. Furthermore, it is shown that data-stream multiplexing can have a large impact on the interconnect power consumption and efficiency of a wide set of low-power techniques.
6.2.1 Data-Stream Multiplexing to Reduce the TSV Count Consider three different data streams A, B, C with the same bit-width m that have to be transmitted from one die of a IC to an adjacent one. One possibility would be to integrate an array containing 3 m data TSVs so that the three data streams can be transmitted in parallel. Following this simple approach can result in excessive TSV area requirements (due to the large TSV radius) and a low overall yield (due to the high TSV-defect probability). To tackle this, the concept of TSV multiplexing has been introduced [238]. Consider the simple example where the source of data stream A only produces new data samples with a rate that is two times lower than the maximum frequency at which we can transmit bits over a TSV. Furthermore, we assume that for the data streams B and C, the rate of new samples is even four times lower than the maximum TSV transmission rate. In this scenario, it is possible to integrate only an array of size m over which the samples are transmitted in a time-multiplexed manner as A1 B1 A2 C1 A3 B2 A4 C3 . . . (here, the indices are the sample numbers of the individual data streams). In this example, the number of TSVs can be reduced by using a multiplexed data transmission to one-third for the cost of an extra circuit that manages the multiplexed transmission.
6.2.2 Data-Stream Multiplexing in NoCs The (micro) architectures and concepts of modern NoCs were discussed in depth in Chap. 2 of this book. Thus, this subjection only covers the logical data stream multiplexing on the physical links resulting from virtual channels in a NoC. To improve performance or resolve sources of deadlocks, virtual channels allow to transmit multiple packets over a single physical link in parallel through a distribution of the available bandwidth between the packets. Therefor, more than one logical buffer is integrated into each input port when virtual channels are implemented. Thereby, different packets can be buffered simultaneously, and thus transmitted interleaved. While the assignment of virtual channels (input buffers) is
6.2 Data-Stream Multiplexing
123
packet-based, the arbitration for physical-channel bandwidth is typically on a flitby-flit basis [55]. Different techniques exist to arbitrate the use of a physical channel (i.e., the interconnect structure). The most common one is balanced time multiplexing [129]. Thereby, the available bandwidth of the link is equally partitioned on each virtual channel (strong fairness). Let us consider three packets P1, P2, and P3, that request the usage of one link at the same time. Here, the available link bandwidth is shared by transmitting the flit sequence P11 P21 P31 P12 P22 P32 . . . P1NF P2NF P3NF , assuming that no congestion occurs in the preceding paths and that the amount of virtual channels is greater than two. Here, the indices are the flit numbers of the packets containing NF flits. Another common technique uses priority-based multiplexing [38]. Thereby, each virtual channel is associated with a different priority, depending on the service class of the message. The transmission of packets with a lower priority is deferred if higher priority packets use the link. This guarantees a good quality of service for high-priority traffic at the cost of a fairness reduction. Summarized, virtual channels result in not only sequential package transmissions but also in multiplexed/interleaved transmissions of multiple packages. The probability of data-stream multiplexing in two subsequent cycles is reduced for a priority-based virtual-channel arbitration compared to a round-robin arbitration, but not zero.
6.2.3 Impact on the Power Consumption In this subsection, we outline the impact of data-stream multiplexing on the interconnect power consumption and coding-based low-power techniques. For the sequential transmission of highly correlated data (e.g., DSP signals or address signals), the switching activities of the bits are small compared to the transmission of uncorrelated data, as outlined in the previous section. Hence, the interconnect power consumption is relatively low for the sequential transmission of such data. However, when the data streams are transmitted in a multiplexed manner, this beneficial behavior is lost, as the individual data streams are uncorrelated. Consequently, the interconnect power consumption can increase drastically when data-stream multiplexing is applied. The following analysis validates this. The power consumption for the transmission of two data sets is investigated. Each set contains 100,000 Gaussian-distributed 16-bit data words with a word-level correlation ρ of 0.99 and a standard deviation σ of 28 . A clock frequency of 1 GHz is considered for the pattern transmission. Two interconnect structures are investigated. A TSV array and a metal-wire bus, both driven by commercial 40-nm drivers. To obtain the metal-wire capacitances, a commercial electronic design automation (EDA) tool, provided by the vendor of the 40-nm technology, is used. The metal-wire width and pitch are set to 0.3 μm
124
6 Estimation of the Bit-Level Statistics
Metal wires
TSVs
Coding gain [%]
Power consumption [µW/bit]
6
5
4
3
0
−10
2 0
a)
10
0.2
0.4
0.6
0.8
Multiplexing probability
1
0
b)
0.2
0.4
0.6
0.8
1
Multiplexing probability
Fig. 6.5 Effect of data-stream multiplexing on: (a) The interconnect power consumption for the transmission of correlated data words; (b) the gain of CBI coding for the transmission of two random data streams
and 0.6 μm, respectively. Through-silicon-via parasitics are generated with the capacitance model presented in the previous Chap. 4. For the quadratic 4 × 4 TSV array, a typical TSV radius of 2 μm and the corresponding minimum TSV pitch of 8 μm are analyzed. Again, a common TSV depth of 50 μm is considered, while the metal wires have a length of 100 μm to obtain a similar power consumption for both structures. The parasitic capacitances of the drivers, required for the power estimation, are extracted from the process design kit of the 40-nm standard-cell library. Employing exact bit-level simulations and the precise high-level formula from Chap. 3, the power consumption for various data-stream-multiplexing probabilities is quantified. The multiplexing probability is defined as the probability of a change in the transmitted data stream, so that the next transmitted bit pattern belongs to another data stream than the previous one. In Fig. 6.5a, the resulting interconnect power consumption, normalized by the number of transmitted bits per cycle, is plotted over the data-stream-multiplexing probability.1 For no data-stream multiplexing, the power consumption of the metal wires and the TSVs is 2.10 and 3.40 μW/b, respectively. In the case of continuous cycle-by-cycle data-stream multiplexing (i.e., multiplexing probability equal to 1), the power consumption of the links is almost doubled (metal wires: 4.27 μW/b; TSVs: 6.11 μW/b), compared to the scenario without data-stream multiplexing. This increase shows the possible dramatic effect of data-stream multiplexing on the power consumption.
1 The
power values are plotted normalized to evaluate an LPC technique in the following, which adds redundancy to the data words, reducing the number of effectively transmitted bits per cycle.
6.2 Data-Stream Multiplexing
125
To illustrate the impact of data-stream multiplexing on the efficiency of existing low-power coding techniques, the transmission of two random data streams, encoded individually with the classical bus invert (CBI) technique [229], is analyzed. The CBI technique is one of the most well-known low-power techniques based on higher abstraction levels. It belongs to the class of LPCs. The conceptual idea of low-power coding is to integrate an encoder and a decoder architecture at the beginning and the termination of a communication path, respectively, which adapt the bit-level characteristics of the signals transmitted over the physical medium to reduce the power consumption. For example, the CBI technique aims at reducing the switching activities of the bits. For this purpose, one redundant bit is added to the transmitted codewords. Depending on the logical value of this so-called invert bit, the remaining bits are either equal to the initial/unencoded data word (if invert bit is 0) or equal to its bitwise negation (if invert bit is 1). A memory-based block is used to set the invert bit if this reduces the Hamming distance between the previously encoded codeword and the currently encoded codeword. Thereby, CBI coding reduces the switching activity for random data by up to 25% [229]. The impact of CBI coding on the power consumption of the interconnect structures, over the data-stream-multiplexing probability, is investigated for an endto-end encoding scheme. End-to-end encoding implies that the data is encoded and decoded at its source and sink, respectively, and not on a link level. Such an end-to-end encoding is, for example, widely used in NoCs, which employs a multihop/link transmission of the data. Here, with a link-level coding, the data must be encoded and decoded once per link it traverses [30, 116]. In contrast, with an end-toend encoding, each data stream only undergoes one encoding and decoding, which maximizes the coding gain and minimizes the power consumption induced by the coding architectures. Figure 6.5b includes the results of the analysis. If the two data streams are transmitted without any data-stream multiplexing, the CBI technique leads to a reduction in the TSV and metal-wire power consumption by approximately 14%. However, with increasing data-stream multiplexing, the coding efficiency vanishes. Due to the redundancy the technique adds to the codewords (invert bit), the encoding approach even increases the power consumption by 6% for continuous data-stream multiplexing. One can show in the same manner that the effectiveness of other existing LPCs vanishes if they are used on an end-to-end basis in the presence of data-stream multiplexing. The previous analyses highlight the strong need to consider the effect of datastream multiplexing on the interconnect power consumption in high-level models. Not only to precisely estimate the power requirements, but also to systematically derive efficient LPCs whose power gains do not vanish for multiplexed data streams.
126
6 Estimation of the Bit-Level Statistics
6.3 Estimation of the Bit-Level Statistics in the Presence of Data-Stream Multiplexing In this section, a model is presented to estimate the bit-level statistic in the presence of data-stream multiplexing. Let us assume that up to nds different data streams, represented by D 1 to D nds , are transmitted over an interconnect structure. One can use one of the various existing models to obtain the bit-level statistics for a sequential transmission of the individual data streams. For the approach presented in this chapter, not only the 1-bit probabilities for the individual data streams (i.e., pi1 to pinds ) have to be determined through the existing methods, but joint bit-probability matrices, symbolized as JP1 to JPn , with JPxi,j = E{bix bjx }.
(6.20)
The ith diagonal entry of a joint bit-probability matrix, JPxi,i , is equal to the 1-bit probability of bi in the data stream D x : E{bix bix } = E{bix } = pi .
(6.21)
A non-diagonal entry, JPx i,j , is equal to the probability that both bits bi and bj of a data word of D x are 1. First, the data-flow-independent JP matrices, are used to estimate the switching x→y ) and the switching correlations (γi x→y = activities (αi x→y = E{Δbi2 } x→y E{Δbi Δbj } with i = j ) for the case that two data streams D x and D y are transmitted in a continuous (i.e., cycle-by-cycle) multiplexed manner: y
y
E{Δbi Δbj }x→y = E{(bi − bix )(bj − bjx )} y y
y
(6.22) y
= E{bi bj + bix bjx − bi bjx − bix bj } y
y
y
= JPi,j + JPxi,j − JPi,i JPxj,j − JPxi,i JPj,j . For i equal to j , the self-switching probability αi x→y is obtained and, for i unequal to j , the switching correlation γi,j x→y . Note that in the derived equation, it is exploited that the cross-correlation of two different data streams is zero, resulting in y
y
E{bi bjx } = E{bi } · E{bjx } y
= JPi,i · JPxj,j .
(6.23)
Employing the resulting switching statistics for a continuous multiplexing as well x = γ x→x ), as the switching statistics for no multiplexing (i.e., αix = αix→x and γi,j i,j the overall switching statistics for an interconnect structure with given multiplexing
6.3 Estimation of the Bit-Level Statistics in the Presence of Data-Stream. . .
127
probabilities, Mx,y , can be determined through αi =
nds nds x→y (Mx,y + Mx+nds ,y ) · αi ;
(6.24)
x=1 y=1
γi,j =
nds nds x→y (Mx,y + Mx+nds ,y ) · γi,j .
(6.25)
x=1 y=1
In these equations, αi represents the overall switching activity of the ith line of the interconnect structure, while γi,j is the overall switching correlation of the ith and the j th line. Mx,y is equal to the probability of transmitting a pattern of data stream D y after transmitting a pattern of data stream D x (i.e., a switching in the active data stream from D x to D y ). Thus, Mx,x is equal to the probability of two subsequently transmitted patterns belonging to the same data stream D x (i.e., no multiplexing event). Mx,x+nds is equal to the probability that the link is idle, holding a pattern of stream D x . Therefore, Mx+nds ,y is the probability of transmitting a pattern of D y after being idle, holding a pattern of D x . The 2nds × 2nds matrix M, containing the Mx,y values, is referred to as the data-flow matrix, since it contains the abstract information about the word-level data flow. Note that in Eqs. (6.24) and (6.25) the switching characteristics for a sequential/non-multiplexed transmission of a data stream D x (i.e., αix→x and x→x ) must be determined with the traditional methods. The reason is that the γi,j derived Eq. (6.22) only holds if the pattern of D x and D y are uncorrelated. Besides the switching activities and the switching correlations, the 1-bit probabilities are required for interconnect structures that contain TSV segments to estimate the TSV capacitances. The 1-bit probabilities do not depend on the bit switching. Thus, the 1-bit probabilities are independent of the multiplexing; it is only relevant how many samples of each data stream are transmitted on average. Consequently, previous methods generally can correctly estimate the 1-bit probabilities. In the proposed model, the 1-probabilities are calculated via pi =
nds nds y (Mx,y + Mx,y+n + Mx+nds ,y ) · pi .
(6.26)
x=1 y=1
Summarized, Eqs. (6.22)–(6.26) extend existing methods such that the bit-level statistics can also be estimated in the case of a complex time-multiplexed transmission of several data streams over arbitrary 2D and 3D interconnect structures.
128
6 Estimation of the Bit-Level Statistics
6.4 Evaluation The proposed method is evaluated in this section. First, the accuracy of the technique to predict the bit-level statistics for data-flow scenarios is quantified in Sect. 6.4.1. Afterward, the implications of the model for low-power coding techniques are outlined.
6.4.1 Model Accuracy The accuracy of the presented model to estimate the bit-level statistics is investigated in this section. For this purpose, the transmission of two to five data types/streams is analyzed for different mean multiplexing probabilities with Python. Furthermore, each simulation is executed 1000 times for a flit width of 16 b, and 1000 times for a flit width of 32 b. To cover the large space of data type combinations in the best possible way, in each run the synthetically generated data streams vary randomly. The pattern distribution of each single data stream is either uniform, normal (Gaussian), or log-normal. For the last two distributions, the standard deviation of the patterns is in the range from 2N/10 to 2N −1 and the relative pattern correlation is in the range from 0 to 1. Compared are the switching properties, SE , estimated with the high-level model proposed in this chapter and the exact switching properties determined by means of exact bit-level simulations for the transmission of 10,000 flits. Reported are the overall RMSE values as well as the MAE values. Since the switching properties are independent of the link type (2D or 3D), this particular simulation setup enables to quantify the general accuracy of the proposed modeling approach for a possibly large set of data flow scenarios. The results, presented in Fig. 6.6, show that the presented approach enables a precise estimation of the switching characteristics independent of the multiplexing probabilities, the number of multiplexed data streams, or the flit width. For all analyzed scenarios, the RMSE of the estimates is in the range of 0.6 percentage points (pp) to 0.6 pp. The MAE for all 5,120,000 estimated switching properties does not exceed 2.8 pp. Although the proposed model has a close to perfect match with bit-level simulations, it requires a more than 2000 times lower execution time on an Intel i5-4690 machine, with 16 GB of RAM, running Linux kernel 3.16. This speed up will even increase with an increasing pattern and/or link count. Thus, the proposed model enables to precisely predict the energy consumption of a full virtual channel based NoC, containing multiple processing elements transmitting various data stream types over the network, within a tolerable time. Furthermore, the experiment proves that, as expected, the switching activities linearly increase with an increasing multiplexing probability (see also Fig. 6.5), while the number of multiplexed data streams does not effect the switching properties. Since switching is only determined by direct consecutive pattern pairs, a multiplexing of two data streams leads, on average, results in the same link energy
6.4 Evaluation
129
Multiplexing probability = 0.1 Multiplexing probability = 0.7
Multiplexing probability = 0.4 Multiplexing probability = 1.0
NRMSE
[%]
0.8
0.4
0
2
3 4 Data-stream count
5
2
3 4 Data-stream count
5
a)
NMAE
[%]
3
2
1
0
b) Fig. 6.6 Errors of the proposed model to estimate the switching statistics for various data-stream counts and multiplexing probabilities: (a) NRMSE; (b) NMAE
consumption as a multiplexing of three or more data streams. Thus, without loss of generality, in the remainder of this section the analysis can be restricted to scenarios where only two data types are multiplexed.
6.4.2 Low-Power Coding As outlined in Sect. 6.2.3, there is a strong need for coding techniques which reduce the energy consumption of links with virtual channels. The modeling technique presented in this chapter enables a fast estimation of the efficiency of such low-power code (LPC) techniques. With a single simulation the coding (datastream) independent data flow matrices, M, are determined once using for example a
130
6 Estimation of the Bit-Level Statistics
SystemC simulator. Subsequently, employing the M matrices, the bit and switching probabilities, SE and p, can be determined for arbitrary encoded or unencoded data streams using the model in the present chapter. Out of p the pattern dependent TSV capacitance matrix is estimated as proposed in Chap. 4. Finally, the switching probability and capacitance matrices are used to obtain the link energy consumption for the analyzed data streams by means of Eq. (3.47). Thereby, the link energy requirements are estimated for arbitrary data streams, and thus different applied LPC techniques while running the NoC simulator only once. Moreover, the high-level model presented in this chapter enables the design of new coding techniques which consider the effect of virtual channels on the energy consumption. This approach is investigated in more detail in the present subsection of this book. Therefore, the simultaneous transmission of 2 MB of data from two different sources over 26-bit wide 2D and 3D links is analyzed. For the physical transmission medias (2D and 3D links), the same structures as in Sect. 6.2.3 are considered. Investigated is the dissipated energy per effectively transmitted byte over the mean multiplexing probability (mean(M1,2 , M2,1 )) to take possible bit overheads of the encoding techniques into account and to present values independent of the NoC router clock frequency. For the data streams, uniformly distributed patterns are considered where the eight MSBs show a strong temporal correlation (ρ = 0.99), and completely random (uncorrelated) patterns. Here, CBI coding is investigated exemplary for its usability as a LPC technique for the uncorrelated data. The power reduction of correlator coding, proposed in [89], is analyzed for the correlated data as this LPC technique. The second LPC technique, correlates (bit-wise XOR operation) every data word with the previous data word of the stream. Therefore, in the analyzed example, the high MSB correlation, in combination with the inverting drivers, leads to code word MSBs nearly stable on logical 1 [86]. The energy quantities are determined twice: Once using the high-level model presented in this chapter to estimate SE and p; and once using the exact SE and p obtained by means of bit-level simulations. The resulting energy quantities are illustrated in Fig. 6.7. Markers indicate the energy quantities obtained for the exact bit properties which are in perfect accordance with the energy quantities obtained with the proposed high-level model (lines). Thus, the model proposed in this chapter enables a fast and precise estimation of the efficiencies of LPC techniques for multiplexed data streams. For example, the model precisely predicts the decreasing coding efficiency of the CBI technique with an increasing multiplexing probability as well as the increasing efficiency of the correlator coding for the transmission of two correlated data streams. Thus, it allows to identify that for a high multiplexing probability (high virtual channel usage) a different LPC technique than CBI coding is required, while correlator coding performs particularly well for this data flow scenario. Without the proposed high-level model, it is not trivial for a designer to explain this observation, which complicates the design of new coding techniques for links with virtual channels. However, the model proposed in the present chapter provides the clear answer. CBI coding only affects the switching probabilities for the sequential (non-multiplexed) transmission of data streams. However, it does not
6.4 Evaluation
131
Power consumption [µW/bit]
Power consumption [µW/bit]
Two streams of random data words (unencoded) Two streams of random data words (encoded) Two streams of correlated data words (unencoded) Two streams of correlated data words (encoded) One stream of random data words and one stream of correlated data words (unencoded) One stream of random data words and one stream of correlated data words (encoded)
Metal wires
3.0
2.5
2.0 0
0.2
0.4 0.6 0.8 Multiplexing probability
1
TSV array 4.5
4.5
3.5
0
0.2
0.4 0.6 0.8 Multiplexing probability
1
Fig. 6.7 Effect of existing low-power coding techniques on the 3D-interconnect power consumption for the transmission of correlated and completely random data words in the presence of data stream multiplexing. Marks indicate results for exact bit-level simulations; lines indicate estimates of the proposed method
affect the bit probability matrices S. Thus, as revealed by the proposed model, it does not decrease the energy consumption per clock cycle for a continuous data stream multiplexing, and due to its induced overhead it even increases the energy consumption per effectively transmitted byte. In contrast, the correlator coding leads to MSBs nearly stable on logical 1. This increases the S values, which reduces
132
6 Estimation of the Bit-Level Statistics
the SE x→y values as well as the TSV capacitance quantities via the MOS effect, resulting in a significantly reduced link energy requirements in the presence data stream multiplexing. Summarized, the model proposed in the present chapter reveals an important message for the design of LPC techniques: In order to obtain the efficient coding approaches for links with data stream multiplexing (e.g. due to virtual channels), the technique must not only affect the switching activities of the single data streams E{Δbi Δbj }, but also the bit probabilities E{bi · bj }.
6.5 Conclusion In this chapter, a method to precisely estimate the bit-level statistics in the presence of data-stream multiplexing was presented. The method enables the precise estimation of the 3D-interconnect power consumption without the need for computationally expensive bit-level simulations for systems that use TSV-sharing approaches or virtual-channel NoCs. Furthermore, the proposed technique enables the systematic derivation of techniques that effectively reduce the interconnect power consumption of such systems. This is of particular importance since the interconnect power consumption can dramatically increase if multiple data streams are transmitted in a time-multiplexed fashion. An important finding of this chapter for the design of end-to-end LPCs is that the joint bit probabilities of the transmitted data have to be maximized to effectively improve the 3D-interconnect power consumption in the presence of extensive data-stream multiplexing. This behavior is exploited in the 3Dinterconnect optimization techniques presented in the following part of this book.
Chapter 7
Ratatoskr: A Simulator for NoCs in Heterogeneous 3D SoCs
In the previous chapters, we introduced abstract application and physical models for three-dimensional (3D) interconnects and 3D networks on chips (NoCs). In this chapter, we will bring both together by introducing this book’s 3D NoC simulator for heterogeneous 3D integrated circuits (ICs) called Ratatoskr. The simulator is provided as open-source software along with this book. Please check out our project website at https://ratatoskr-project.github.io for further details. Power consumption, performance, and area are the key performance indicators for nearly all system on a chip (SoC) components. As shown in the models in the previous chapters, heterogeneous integration requires novel power and performance models for physical interconnects, whole networks, and to a limited extent, even for applications. Ratatoskr is the a comprehensive framework to precisely determine power-performance-area (PPA) for NoCs from physical via gate to transaction level, i.e., including a register-transfer level (RTL) router model, a CA NoC model, and a TLM application model. Furthermore, it includes optimization methods for heterogeneous 3D integration, as introduced in this book. It also automatically provides an optimized hardware description for the NoC router. Such a comprehensive framework cannot be found in the literature so far. While individual tools exist, neither are these integrated into one single tool flow, nor are these able to cope with heterogeneous 3D integration. Simulators such as Noxim [78] and Booksim 2.0 [120] allow for performance evaluation of NoCs, but are not applicable for heterogeneous 3D integration. Power models of routers, such as ORION [127], also do not model heterogeneity and do not provide a model for dynamic power consumption of links. Existing open-source RTL router models such as OpenSoCFabric [77] or OpenSmart [137] are not shipped with a simulator for scaled performance analysis, and do not allow heterogeneous architectures
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 L. Bamberg et al., 3D Interconnect Architectures for Heterogeneous Technologies, https://doi.org/10.1007/978-3-030-98229-4_7
133
134
7 Ratatoskr: A Simulator for NoCs in Heterogeneous 3D SoCs
needed in heterogeneous 3D integration. Ratatoskr addresses these shortcomings and provides the following specific features: • Power estimation: – Static and dynamic power estimation of routers and links on a CAs level. – Accuracy of dynamic link-energy estimation is within 2.4% of bit-level accurate simulations. • Performance estimation: – Network performance for millions of clock cycles using the CAs NoC simulator. – Network performance for thousands of clock cycles using RTL. – Timing of routers from RTL-to-gate-level synthesis. • Area analysis: NoC area from RTL-to-gate-level. • Benchmark models: – Support for realistic application model on transaction-level. – Conventional synthetic traffic patterns. • Support for heterogeneous 3D integration: – Heterogeneity yields non-purely synchronous systems, since the same circuit in different technology nodes (e.g., digital and mixed-signal) achieve a different maximum clock speed. The NoC simulation and router hardware model allow to implement a pseudo-mesochronous router, as introduced later in this book. – Heterogeneity yields different number of routers per layers, since (identical) circuits in different technology nodes have different area. The network on a chip simulation allows for non-regular network topology via XML configuration files. – With the same approach, NoCs in heterogeneous 2.5D integration (using interposers) can also be modeled. • Yield estimation: Estimation of the through-silicon via (TSV) yield for any manufacturing technology. Since the TSV yield limits the chip yield, this estimation is also representative for the overall manufacturing yield. • Focus on the ease of use: Single point-of-entry to set design parameters. The design parameters allow rapid testing of different designs. • Reports: Automatic generation of detailed reports from network-scale performance to buffer utilization. • Open-source availability: The framework can be used without access to closedsource tools. However, parts of the framework must rely on closed-source software to generate industrial-grade results. The name of the framework Ratatoskr is based on the squirrel with the same name from Nordic methodically. In the mythological text, the squirrel runs up and down a tall tree named Yggdrasil to transport messages from an eagle sitting on top
7.1 Ratatoskr for Practitioners
135
of the tree to the dragon Nidhögg lying under the tree. The eagle and the dragon are gossipy because they are fighting. Thus, the squirrel is packed with its task. The inspiration of our framework comes directly from the squirrel’s task: Just as a NoC in a heterogeneous 3D SoC, it transmits messages vertically to different types of senders and receivers. The remainder of this chapter is structured as follows. In Sect. 7.1 we introduce Ratatoskr for practitioners explaining its functionality with user inputs and reported outputs. In Sect. 7.2 we explain the technical details of the tool’s implementation. In Sect. 7.3 we analyze the capabilities of our tool including a case study. Finally, the chapter is concluded.
7.1 Ratatoskr for Practitioners This section targets the practitioner who designs a NoC for heterogeneous 3D integration. We explain the framework’s parts and functionality to generate an indepth PPA analysis and introduce the process to set design parameters.
7.1.1 Parts and Functionality The tool flow is shown in Fig. 7.1. It is tailored towards a unified user experience in the generation of PPA reports. One can set parameters, evaluate the design, and iterate until targets are met. There are two sets of design parameters derived from the NoC and the application models presented in Chap. 5. The network properties include the network topology and floorplan, the number of virtual channels (VCs), the buffer depths, the used routing algorithm, and the bit width of the link (i.e., flit width). The application properties define either a synthetic traffic pattern or an instance of the application model. Setting the parameters is simple, as the user modifies only a single-pointof-entry file bin/config.ini and uses a Python library. We kindly refer to the project website for the detailed documentation kept up to date. Together, this sets up the configuration of the framework. We also provide a GUI that displays the simulation of the network in real-time. The execution of Ratatoskr starts after setting the design parameters. Three parts reflect the different levels of abstraction. The parts are depicted in Fig. 7.1 in blue; they are • Register-transfer level NoC model: The box on the left-hand side shows the hardware implementation. An RTL-NoC model is generated with Python scripts that write very high speed integrated circuit hardware description language (VHDL) code automatically from the network properties. The router architecture
136
7 Ratatoskr: A Simulator for NoCs in Heterogeneous 3D SoCs Parameters Network Properties Topology, VCs, buffers, ...
Application Properties Link dime nsion s
l, ode et m tern ri n c pat t e P affi Tr
Petri net colors
RTL NoC
Cycle-Accurate NoC Model
Link Model
Synthesis
Simulation
Analytical Model
- Router Power - NoC Area - Timing
- Transmission Matrices - Network Performance - NoC Power
- Dynamic Link Energy
Reports Power
Performance
Area
Fig. 7.1 Tool flow of Ratatoskr
is the pseudo-mesochronous vertical high-throughput router presented in the later Chap. 13 of this book. There is a significant advantage to having an RTL model of the NoC. It can be synthesized to gate level for any standard-cell library, generating precise power, area, and timing data. This approach targets the needs of heterogeneous 3D ICs, as one can obtain area and power figures for different technology nodes. The gate-level NoC enables timing verification and opens the possibility for chip manufacturing. The register-transfer level network on a chip can be simulated using VHDL simulators. We provide pseudo-processing elements (PEs) that inject uniform random traffic, as well as a trace-file-based traffic generator for real-world application traffic. Higher simulation performance is given for the NoC on CA, as RTL simulation is slower, typically enabling only a few thousand clock cycles of simulation time. • Cycle-accurate-level model NoC model: The CA-NoC simulator is shown on the blue box in the middle of Fig. 7.1. It takes both network and application parameters as input. The latter is required to load traffic patterns. The CA router model copes with the non-purely synchronous transmission of data typical for heterogeneous 3D-ICs. The simulator generates results for the network performance (e.g., flit latency) and the dynamic router power consumption. We use the router power model of Noxim, in which power-relevant events are back-annotated with energy figures from gate-level simulations, cf. [78, Sec. 5.1]). As an innovative feature, Ratatoskr’s simulator stores data-flow matrices to model a power model for links, as introduced in Chap. 6. This feature allows to precisely and dynamically estimate the pattern-dependent link-energy consumption.
7.1 Ratatoskr for Practitioners
137
• Cycle-accurate-level model link-power estimation model: To generate power estimation for the links, we use our power models for 2D (vertical) and 3D (horizontal) links from Chap. 3 and integrate them into the framework. We use data-flow matrices from simulations, which enable precise power estimates at a high simulation performance. During execution, Ratatoskr generates PPA results, as shown in Fig. 7.1 on the bottom. Besides the individual results of the framework’s parts, Ratatoskr encapsulates the most relevant ones into a report file. This approach maintains high usability and aligns with our plan to enable rapid design-space exploration by iterating design parameters and testing them against constraints.
7.1.2 User Input: Setting Design Parameters For the user, an easy-to-use interface to set the critical design parameters is imperative to handle the capabilities of the underlying models. We encapsulated these parameters accessible via a configuration file or Python classes. The content of the file is shown in Table 7.1 (the software architecture in Python mirrors this structure). The four subsections in the “application properties” configure the simulator (in “Config”), link a task-graph-based benchmark incl. mapping (in “Task”), define a synthetic traffic pattern (incl. warm-up phases) as known from other simulators (in “Synthetic”), and set router identifier numbers enabling micro-architectural reports. The parameter’s usage is essentially identical to other simulators. There are two significant differences. First, the task model implements the book’s application models. Second, synthetic traffic patterns usually are parameterized along with the injection rate (as done here using the “Rate” intervals) and are measured in flits/cycle (or packets/cycle). This definition is not directly applicable to heterogeneous 3D ICs, as the length of a clock cycle will vary between layers. We chose a model in which we take the slowest clock cycle in the system as the measurement reference. The “network properties” configure the NoC. Like other simulators, the user defines architectural parameters such as the network size, the routing algorithm, the operating frequency, etc. As Ratatoskr targets heterogeneous 3D ICs, two differences extend these modeling capabilities. First, a non-regular mesh topology can be defined using x- and y-dimensions that are a layer-wise list. Each element in the list represents the size of the network in the corresponding layer. Because these dimensions differ, 3D links can be arbitray defined/located. Ratatoskr can also simulate any other topology, but we did not make this feature available at the high-level interface because regular topologies are the most common ones. Second, the user can set micro-architectural parameters. The clock period is a list to set the operating frequency per layer; the VC count and buffer counts are lists to simulate different router resources per layer. This feature approaches a requirement of heterogeneous 3D ICs, for which those parameters will vary between layers.
138
7 Ratatoskr: A Simulator for NoCs in Heterogeneous 3D SoCs
Table 7.1 A description of software and hardware configurations using the configuration file Property Application properties
Section Config
Task
Synthetic
Report
Network properties
Field Description General parameters such as simulation time and benchmark simulationtime Duration of simulation in ns flitsperpacket maximum length of packets benchmark [“synthetic”, “task”] to select application model Configuration of task (application) model traffic libDir Folder path for configuration data Configuration of synthetic uniform random traffic simDir Folder path for temporary data restarts Number of simulations runRateMin, Injection rates for the runRateMax, simulation runRateStep runDuration Duration of run phase (the one after a warm-up phase) numCores Number of simulated cores Configuration of reports bufferReportRouters List of routers’ IDs to be reported Hardware configurations like network and router properties z Layer count x, y List of router counts per layer routing Routing algorithm clockPeriod List with the clock period in each layer bufferDepthType [“single”, “perVC”] single or VC-wise buffer depth bufferDepth Value for buffer depth buffersDepths List of buffers depths (one for each VC) vcCount list of VCs per router port topologyFile File with network topology flitSize Bit-width of flits
7.1 Ratatoskr for Practitioners
139
The abstract interfaces via a single configuration file or the Python interface classes can be extended by writing individual XML. However, this should be out of the scope for most users of the framework, as our simulation models from Chap. 5 encapsulate the most important parameters.
7.1.3 User Output: Power-Performance-Area Reports Ratatoskr generates the following comprehensive set of results:
Ratatoskr generates a PDF file with three types of result plots: The network performance, i.e., the latency over injection rate, the average buffer usage per layer and direction as 3D histograms, and the average VC usage per layer as a histogram. The exemplary plots are given in Figs. 7.2 and 7.3. In Fig. 7.2, the VC usage is given for the bottom layer (therefore, the downwards direction is never used). One can see that higher VC numbers are less used, which is the expected architecture, as the book’s used router model always fills the lower VCs first. Figure 7.3 gives a histogram that reports for one direction in a layer the buffer and VC usage. The report gives an in-depth insight into network dynamics and allows for router optimization on a micro-architectural level, e.g., on the level of single buffers, as shown in Chap. 12 of this book. To summarize, the generated reports, made available by the book’s framework, provide detailed insights into the NoC. Especially the buffer usage and the VC usage are unique features that enable novel micro-architectural optimizations, as demonstrated later in this book (Part IV).
7.1.4 Router Architecture We use a lightweight router architecture in Ratatoskr. A lightweight model suits the needs of heterogeneous 3D ICs, as communication infrastructures must be (areawise) small due to the relatively large feature size of mixed-signal technologies vs. digital ones.
140
7 Ratatoskr: A Simulator for NoCs in Heterogeneous 3D SoCs Layer 0, Injection Rate = 0.07
·105 3
Local East West
2 Count
North South Up
1
0 1
2 3 4 Number of used VCs
5
Fig. 7.2 Exemplary VC usage histogram generated by Ratatoskr after a simulation with uniform random traffic Fig. 7.3 Exemplary buffer usage report generated by Ratatoskr after a simulation with uniform random traffic
The user simultaneously configures the NoC simulation, i.e., the CA NoC simulation model and the RTL NoC hardware model, when using Ratatoskr. Many macro- and micro-architectural parameters of the routers are configured in Ratatoskr. An overview of the router architecture is shown in Fig. 7.4. The router model has the following properties: (a) We implement wormhole packet switching to optimize for buffer depth and network congestion. (b) We use credit-based flow-control to throttle routers in case of back-pressure effectively. (c) We provide an input-buffered router, probably the most common router architecture in today’s systems. (d) The architecture of the central control and arbitration unit is one of the most light-weight ones published [28]. Each input ports selects (pre-arbitrates) at most one VC that sends a request to the respective output port. Thereby an output port has to arbitrate at most #InputPorts − 1 requests rather than #VCs#InputPorts − 1 request. This substantially improves timing and area
7.1 Ratatoskr for Practitioners
1. Configurable routing algorithm
... 0
n0,0
...
VC0
...
VCm0
0 n0,m0 Input Unit
Output Port 0
Routing Computation VC Allocation
...
Input Port 0
141
Switch Allocation ... 0 Input Port k
6. Configurable input port count
VCmk
Central Control Unit nk,0
...
VC0
Central Crossbar
...
2. Configurable switch allocation method
0 nk,mk Input Unit
7. Configurable VC count per port
8. Configurable buffer depth per VC and per port
Output Port k
9. Small crossbar by removing 4. Configurable link unused turns from crossbar traversal stage
Fig. 7.4 Schematic illustration of the router
as the logic depth of a (round-robin) arbiter scales quadratically and linearly with the number of possible requester, respectively. It comes at a typically negligible performance loss due to the loss off maximal matching) [28]. Flitlevel arbitration of the next VC to be routed over the crossbar/link happens on a round-robin basis, or alternatively on a fixed priority basis (switch allocation). The packet-level allocation of VC buffers happens on a hop-to-hop basis, i.e., a packet can use a different VC on each hop. Virtual channel buffer allocation prioritizes lower VC numbers; i.e., the first package to be routed over a link always uses VC0 and only when multiple VCs are currently in use higher number VCs buffers are sequentially allocated. Some parts of the simulated router model are parameterizable, as highlighted in blue in Fig. 7.4: 1. 2. 3. 4. 5.
The routing algorithm; The switch-allocation method; The flit width; The link traversal (whether it is an individual pipeline stage); The maximum packet length.
In addition, the Ratatoskr NoC enables micro-architectural fine-tuning required to target heterogeneous 3D systems. Users can generically/arbitrarily define: 6. The input port count, which allows to implement any network topology; 7. The VC count per port; 8. The buffer depth per port and VC, which allows to model heterogeneous router microarchitectures together with the previous feature; 9. The turns that are impossible in the specified routing algorithm enable an automated PPA improvement. In the RTL router model, the central arbiter and crossbar implementation is automatically modified such that for all impossible turns the connections and requests are cut. This allows a gate-level synthesis tool
142
7 Ratatoskr: A Simulator for NoCs in Heterogeneous 3D SoCs
to improve area, power, and timing of the crossbar and the central arbiter unit remarkably without any drawback. Impossible turns are normally (i.e., without the proposed optimization) not easy visible by the synthesis tool as this requires higher-level algorithmic knowledge.
7.1.5 Power Model of Interconnects Using Data-Flow Matrices The Ratatoskr NoC simulator generates CA data-flow matrices (one for each link), M, needed for our extremely accurate high-level model for the link power consumption from Chap. 6. A matrix indicates the frequency of flit transitions from one data-type to another and the number of flit transition between the same data types of the respective link. For each flit-type The M-entries (defining the expected data flow over the link) depend on the application (i.e., NoC traffic) and the NoC architecture (virtual channel count/arbitration, routing, etc.). A typical scenario for the generation of data-flow matrices is shown in Fig. 7.5. In the upper part of the figure, an application is shown. It consists of a sender and two receivers; the data for each receiver have different pattern types, i.e., different switching (T) and bit (p) probabilities. In the lower layer, a simple 3 × 2 NoC is shown with mesh topology and dimension order routing. For the sake of simplicity, only routers and links are shown; NIs and PEs are linked to the routers at the same location. The sender is mapped to the upper right PE. The senders are mapped to two PEs on the left-hand side of the NoC.1 We use the colored Petrinet application model (Sect. 5.2), in which each transmission between two tasks is annotated with a color ψi from the set of all colors Ψ = {ψ, . . . , ψn }. A data-flow matrix, therefore, denotes the transitions between different colors on that link and also denotes, if a link was idle or used. As introduced, for n colors, each data-flow
p(·) = 1 p(·) = 1
p1 : [4, 7]
p(·) = p ˆ
p(·) = 1 − p ˆ
p(·) = 1
p2 : [2, 3] Mapping
Examplary 2× 3 NoC
Fig. 7.5 Schematic view of the construction of data-flow matrices
1 Actually, to comply with the model from Chap. 5, two different places are required to send two different pattern types, which are mapped to the same PE. For the sake of simplicity, we depict a single place.
7.2 Implementation
143
matrix M has the size [0, 1]2n+3×2n+3 . It has a row and column for each color both as active (data of that color have been sent) or idle (last time the link was active, data of that color have been sent). Head flits form an extra data/flit type, why two more rows columns are added to the matrices (for active and idle). Initially after reset, the link contains all zeros or all ones depending on the flip-flop reset value. Thus, the matrix has an extra row and columns for this initial idle state until data were sent the first time via this link. This method of saving data flow matrix M is superior to saving a whole protocol of the transmitted flit types (out of which the M matrices are calculated in a postsimulation step) since this would require memory which linearly increases with the number of simulated clock cycles, i.e., the trace of the link has a memory complexity of O(t), with t as simulation time. The effort to save the data flow matrices, however, is constant because the matrices are of size 2n × 2n for n − 1 transmitted data streams and do not increase their size with the simulation time, i.e., the generation of the matrices has a memory complexity of O(n2 ). Naturally, the execution time complexity is identical for both methods, since the whole simulation must be executed for t clock cycles (O(t)). Let us consider a toy example using a simulation of a router without VCs, which is passed by two data streams with different colors and different injection rates of 0.5625 flits/cycle and 0.1875 flits/cycle. This results in a transmission matrix such as: ⎛
idle
(head, data) (head, idle)
⎜ 0.010 ⎜ ⎜ (head, data) ⎜ − ⎜ ⎜ (head, idle) ⎜ − ⎜ ⎜ (ψ1 , data) ⎜ ⎜ − ⎜ ⎜ (ψ1 , idle) ⎜ − ⎜ ⎜ (ψ2 , data) ⎜ − ⎝ − (ψ2 , idle) idle
(ψ1 , data) (ψ1 , idle) (ψ2 , data) (ψ2 , idle)
0.000
−
−
−
−
−
0.019
0.042
−
0.014
−
0.007
0.013
−
0.005
0.041
−
0.326
0.126
−
0.013
−
0.113
0.039
−
0.014
−
−
−
0.114
0.006
−
−
−
0.040
⎞
−
⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ − ⎟ ⎟ ⎟ − ⎟ ⎟ ⎟ − ⎟ ⎟ ⎟ 0.045 ⎟ ⎠ 0.014 −
To highlight the expressiveness of the matrix, it shows this example that 32.6% of all transmissions of color ψ1 are followed by the same color (Matrix read row-wise). Since different colors represent data/flits with different coupling and switching activities, the power consumption of the network can be calculated fast and yet extremely precise (error in the range of 1%) based on these pieces of information. Therefor the power models from the previous part of this book are employed.
7.2 Implementation 7.2.1 NoC Simulator in C++/SystemC Ratatoskr’s NoC simulator is implemented in C++ using the SystemC 2.3.3 library, which provides the simulation kernel. Inside the router model, we rely on C++
144
7 Ratatoskr: A Simulator for NoCs in Heterogeneous 3D SoCs
data structures. Hence, requests and acknowledgments for VC and switch allocation are std::maps. The resulting flat hierarchy is advantageous both for software maintenance and simulation performance. If we used SystemC modules and communication via ports, we would get slow context switches during simulation. Iteration over a data structure offers improved performance.
7.2.2 Router in VHDL The register-transfer level implementation is in VHDL. There are three architectural options: 1. A network on a chip using a conventional router, as introduced in Sect. 7.1.4. 2. A (heterogeneous) network on a chip using the high-throughput router as introduced later in Chap. 13 of this book. 3. A (heterogeneous) extra-lightweight network on a chip with varying VC counts and buffer depth as introduced later in Chap. 14 of this book. The network on a chip can be synthesized for field-programmable gate array (FPGA)-based prototyping, for which we provide a framework. We include an automated method to shrink the crossbar by removing unused turns, defined by the routing algorithm. Python scripts connect the required turns to a constant-0 value. The synthesis tools automatically remove them from the crossbar. This method improves the router area. We also implement pseudo PEs injecting uniform random traffic. They validate the router model in RTL simulations. Ratatoskr also provides pseudo-PEs for the routers that inject traffic into the RTL NoC for RTL simulation and verification. It supports both synthetic traffic and trace-based files (e.g., generated from the task model in Ratatoskr’s simulator). A so-called “Data Generate Unit” injects packets following the packet length and injection time given in the traces. A “Data Converter Unit” converts the received data from the network to an end-user-friendly format; this allows for: 1. Back/converting received data to the original data type and post-processing (like noise analysis); 2. Comparing the received data (from hardware simulation) and the generated data (from high-level simulation) for verification and error analysis; and 3. Reporting of system statistics, just as generated from the higher-level simulation models. This hardware-level functionality gives the user many options for verification, prototyping, and emulation.
7.3 Evaluation
145
from interconnect import Interconnect , Driver , DataStream , D a t a StreamProb # define 3 D link phit_width = 16 # t r a n s m i t t e d bits per link ( incl . flow - control , ECC , etc .) wire_spacing , wire_width = 0.6 e -6 , 0.3 e -6 TSV_pitch , TSV_radius , TSV_length = 8e -6 , 2e -6 , 50 e -6 # length is constant metal_layer = 5 ground_ring = False # s tructure used to reduce \ gls { tsv } noise ( affects power ) KOZ = 2 * TSV_pitch # \ gls { tsv } Keep - out - zone , a m a n u f a c t u r i n g c o n s t r a i n t driver_40nm_d4 = Driver . predefined ( ’ comm_45nm ’ , 4) # comm . 40 nm driver / or define own via Driver () # set p a r a m e t e r s of a 3 D - link ( TSV array ) ; for 2D - links , set TSVs = False i n t e r c o n n e c t _ 3 D _ d i g = Interconnect ( B = phit_width , wire_spacing = wire_spacing , wire_width = wire_width , metal_layer = metal_layer , Driver = driver_40nm_d4 , TSVs = True , TSV_radius = TSV_radius , TSV_pitch = TSV_pitch , TSV_length = TSV_length , KOZ = KOZ , ground_ring = ground_ring ) # define three d a t a s t r e a m s with d i f f e r e n t p r o p e r t i e s ds1 = DataStream ( ex_samp , B =16 , is_signed = False ) # 16 b data stream from specific samples ds2 = DataStream . from_stoch ( N =1000 , B =16 , uniform =1 , ro =0.4) # random dist . ro := c o r r e l a t i o n (1000 samples ) ds3 = DataStream . from_stoch ( N =1000 , B =16 , uniform =0 , ro =0.95 , mu =2 , log2_std =8) # gaussian # c a l c u l a t e energy based on data flow matrix from matrix E_mean = i n t e r c o n n e c t _ 2 D _ d i g . E ([ ds1 , ds2 , ds3 ] , d a t a _ f l o w _ m a t r i x ) # get maximum length based on target clock fre quency f_clk_dig = 1 e9 l_max_3D_dig = i n t e r c o n n e c t _ 3 D _ m s . m a x _ m e t a l _ w i r e _ l e n g t h ( f_clk_dig ) # get area of link TSV_area = i n t e r c o n n e c t _ 3 D _ d i g . area_3D
Fig. 7.6 Exemplary usage of the link power model
7.2.3 Power Models in C++/Python The model for the dynamic energy/power consumption of routers and links is implemented the NoC simulator analog to Noxim’s power model. It counts buffer writes, buffer reads, head element pops from a buffer, routing calculations, and crossbar traversals. The power class counts the events. Data-flow matrices generate the dynamic power consumption. The Link class generates them during simulation. Our Python scripts link them to the power model using link size, switch activities, and the used technology node. Figure 7.6 shows the exemplary usage of the link power consumption. The mean energy consumption E_mean is calculated from the given link parameters. Moreover, the user can obtain other physical design properties from the link model, besides the power consumption, e.g., the keep-out zone (KOZ) area or the maximum allowed length (in mm) for the given target clock frequency.
7.3 Evaluation 7.3.1 Simulation Performance The simulation time for different injection rates is shown in Fig. 7.7. We simulate a 4 × 4 NoC with 32 flits per packet, 4-flits deep buffer, 4 VCs, and dimension ordered routing (DOR) with Booksim 2.0 (depicted as green rectangles), Noxim
Fig. 7.7 Relation between simulation performance and injection rate
7 Ratatoskr: A Simulator for NoCs in Heterogeneous 3D SoCs median simulation time in [s]
146
20
Booksim 2.0 Noxim Ratatoskr
15
2.3×
10 2.0×
5
0 0.02
0.04
0.06
0.08
Fig. 7.8 Relation between simulation performance and network size
median simulation time in [s]
injection rate [flits/cycle]
Booksim 2.0 Noxim Ratatoskr
100
50
0 4
5
6
7
8
9
10
NoC size in n × n mesh
(orange circle) and our simulator inside Ratatoskr (blue cross). Uniform random traffic is injected into the network with injection rates from 0.015 flits/cycle to 0.08 flits/cycle. We run ten simulations using Ubuntu 18.04, and one core of an Intel i7-6700 clocked at 4 GHz with 8 GB main memory. Figure 7.7 reports the run time statistics. The performance of Noxim and Ratatoskr is independent of traffic. Ratatoskr is consistently slower than Noxim with approximately 4–8 seconds. We show the simulation time for different network sizes in Fig. 7.8. We do not change the parameters to the previous experiment. Uniform random traffic is injected into the network with an injection of 0.03 flits/cycle. The network size varies between 4 × 4 and 10 × 10. Figure 7.8 reports the run time statistics. The performance relationship between the programs remains the same for all network sizes. Booksim is slower than Ratatoskr, which is slower than Noxim. Ratatoskr is 2× slower than Noxim. This result is not due to the application model. The impact of the application is around 1/30 of the simulation time. The added functionality of the power estimations for links and the detailed buffer statistics cost simulation performance because large data structures must be handled for every link in every clock cycle. This finding can easily be shown using profiling (using gcc -pg). Thus, Ratatoskr is slower than the competition but adds more features. This trade-off is an acceptable compromise for the accuracy of the power model. The simulation time for different network sizes increases linear with the router count, as expected.
latency [ns]
7.3 Evaluation
147
Flit latency Packet latency
100
50
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
injection rate [flits/cycle]
Fig. 7.9 Average flit and packet latency for different injection rates Table 7.2 Router cost reduction by removing the turns impossible in the routing algorithm Crossbar connections Fully connected XYZ routing algorithm
Metric Power [mW] Area [μm2 ] Power [mW] Area [μm2 ]
Router 5.40 39168 4.49 37942
−17% −3%
Crossbar 183.588 1288 64.005 894
−65% −31%
7.3.2 Power, Performance, and Area of the RTL Router Model We showcase the PPA metrics of the our router architecture delivered with each simulation. We kindly refer to Chap. 13 for the architectural details. The router performance is shown in Fig. 7.9. The average flit latency and packet latency for different injection rates from 0.01 to 0.08 flits/cycle are reported using a 4 × 4 × 4 NoC with DOR, 4 VCs, 4-flit deep buffer, and 1 GHz clock speed. We simulated 100,000 clock cycles using uniform random traffic. We synthesize the router for 45-nm technology at 250 MHz frequency to obtain area and power. Table 7.2 shows the results for a router with a fully connected crossbar as the baseline, and with DOR (XYZ) and removed turns from the crossbar. We report the numbers for power consumption in [mW] and area in [µm2 ] for both the complete router and its crossbar. One individual inner router has a total cell area of 37899–39168 μm2 and total power consumption of 4.57–5.4 mW depending on the routing algorithm. These findings highlight the advantages of crossbar area/power reduction using the routing algorithm’s properties. The power consumption of the crossbar is reduced by 61 and 65%, respectively, using our power reduction approach. This method has a 15–17% positive effect on the total router power consumption. The area of the crossbar is reduced by 31–32%. This crossbar reduction improves the total router area by approximately 3%. The power-consumption and area enhancements are not affecting the router’s performance, as the turns in the removed turns in the crossbar had not been taken.
148
7 Ratatoskr: A Simulator for NoCs in Heterogeneous 3D SoCs
7.4 Case Study: Link Power Estimation and Optimization This section presents a case study to demonstrate that the power estimation methods from Chap. 6 are suitable for usage in combination with our system simulator. In detail, the usability of our link power model—derived in the previous parts of this book—is quantitatively accessed for an estimation of the link energy consumption in a heterogeneous 3D NoC using Ratatoskr. Furthermore, an exploration of the efficiency of low-power code (LPC) techniques is performed, demonstrating the value of Ratatoskr for architectural power optimization. A heterogeneous 3D vision-system on a chip (VSoC) is considered in this case study. In contrast to mere complementary metal-oxide-semiconductor (CMOS) image sensors, VSoCs are used to capture and process the images in a single chip. This overcomes the limitations of traditional systems due to expensive image transmissions between sensors and processors, especially for high frame rates and resolutions [95]. The NoC architecture of the analyzed VSoC has a flit width of 16 bit with one head and 31 body flits (payload) per packet, supports up to four virtual channels per port, and has an input-buffer depth of four. The full 3D stack consists of three layers: One mixed-signal layer, which contains six CMOS image sensors (S1– S6) at the top, one underlying memory layer, and at the bottom one digital layer containing the actual processors. In this case study, the transmission of 8-bit grayscale-image pixels (two per flit) from the sensors to the memory is analyzed and optimized toward a low interconnect energy consumption. Since the analyzed NoC uses a dimension-order (XYZ) routing [30], while images are read by the cores from memory, the traffic from the memory to the sensors can be analyzed without considering the traffic of the cores. Thus, the structure sketched in Fig. 7.10 is analyzed. In total, seven different data/flit types are transmitted over the links: One for head flits, H, and six for body flits, B1 –B6 (one per source). Each of the six image sensors in the mixed-signal layer is linked to one router, R1–R6, which are connected by 2D/metal-wire links. Router R5 is connected via a 3D/TSV link with router R7 in the memory layer. Connected to R7 is a memory block to store the sensor data. Thus, over the links connecting R1→R2, R3→R2, R4→R5, and R6→R5, only data stemming from one sensor and head flits are transmitted, resulting in unused virtual channels. Usage of virtual channels occurs in the links R2→R5 and R5→R7 as they are part of the transmission paths from multiple sensors to the memory. In this section, the same metal-wire and TSV structures, as in the previous sections of this chapter, are considered. A simulation for the analyzed 3D-NoC architecture with our Ratatoskr NoC simulator is executed for a mean flit injection rate of 20% per sensor element. To allow for subsequent bit-level simulations (to obtain reference values), the simulator is modified that it saves the whole protocol of the transmitted flits. After the simulation, the energy quantities per transmitted packet are determined based on: First, the proposed method from Chap. 6, second, the standard method used
7.4 Case Study: Link Power Estimation and Optimization
S1
S2
S3 R2
{H,B1 }
R3
{H,B3 }
er lay al gn - si xed Mi
R1
149
{H,B1 ,B2 ,B3 }
S4
S5 R4
{H,B4 }
S6 R5
{H,B6 }
ry mo
Me
{H,B1 –B6 }
Memory
R6
er
lay
R7
Fig. 7.10 Part of the 3D VSoC analyzed in the case study Table 7.3 Results of the NoC case study Data Unencoded Gray encoded Correlator encoded
Energy per transmitted packet [pJ] Bit-level sim. Proposed model 4.18 4.15 4.11 (−1.67%) 4.09 (−1.44%) 2.69 (−35.64%) 2.68 (−35.42%)
Standard model [116] 2.40 2.23 (−7.08%) 2.71 (+12.92%)
in Ref. [43, 116, 187] (neglects effects of data-stream multiplexing), and, third, exact bit-level simulations.2 A perfect knowledge of the bit-level statistics of the individual data streams is assumed. Two different traffic scenarios are analyzed, representing realistic use cases of a VSoC. In the first scenario, all six sensors capture road images with a resolution of 512 × 512 pixels during daylight and, in the second scenario, during the night. These particular traffic scenarios are chosen as they result in relatively high errors for the proposed technique, which requires that the cross-correlation between the individual data streams is zero to be exact. This assumption is not necessarily correct if all sensors capture pictures of the same environment from different perspectives. In the first row of Table 7.3, the results are presented. They show that for realistic NoC-traffic scenarios, the proposed model can precisely predict the energy consumption (error below 1%). In contrast, a high-level model that neglects the effect of data stream multiplexing, leads to an error of almost 50%, even though, in the analyzed scenario, less than 50% of the links make use of VCs. However,
2 Circuit simulations to obtain the energy quantities on the lowest levels of abstraction are here not possible due to the complexity of the system.
150
7 Ratatoskr: A Simulator for NoCs in Heterogeneous 3D SoCs
the two links which use virtual channels show the highest energy consumption, and, for these links, the traditional model leads to an underestimation of the energy consumption by more than a factor of 4×. Additionally, the integration of the two low-power techniques correlator-coding and Gray-coding [83] for the body flits is investigated. These bit-overhead-free LPCs are analyzed as a wider bit width for the codewords would increase the buffer cost of a router, which is a crucial concern in NoCs. Gray encoding reduces the switching activities for a sequential transmission of the highly correlated pixels, while a correlator mainly affects the bit probabilities. As outlined in Chap. 6, correlator coding shows an excellent coding efficiency for multiplexed correlated data streams, while the efficiency for no multiplexing is very limited. The energy quantities for the transmission of the encoded data are shown in the second and third row of Table 7.3. Furthermore, the (estimated) percentage reductions in the energy consumption of the links due to the LPC techniques are shown in bold. The presented method—as well as the results from the exact bit-level simulations—indicate that the correlator leads to a drastically better coding gain (36% reduction in the link-energy requirements compared to slightly above 1%). In contrast, the standard/reference model to estimate the bit-level statistics falsely predicts a much higher coding efficiency for Gray coding and even a significant increase in the energy consumption for the correlator coding. This underlines again that using any other than our link power model can result in the implementation of inefficient low-power optimization techniques and a dramatic underestimation of the power requirements.
7.5 Conclusion In this chapter, we introduced Ratatoskr. It is this book’s open-source framework for in-depth PPA analysis in NoCs for heterogeneous 3D ICs. We offer power estimation of routers and links on a CA level. The accuracy of the dynamic power estimation of links is within 2.4% of bit-level simulations while maintaining CA simulation speed. The shipped hardware implementation of the routers can be synthesized for any standard cell technology. The whole framework is targeted to be user-friendly. It uses single configuration points to set the most important design parameters easily, but more detailed and complex configuration options are also available. The framework’s output is a set of comprehensive and insightful reports. An evaluation of the proposed simulator with the used power estimation technique introduced in Chap. 6, showed that using standard models results in an underestimation of the interconnect power consumption by up to a factor of 4 ×. Moreover, the standard models showed to miscalculate the gains of LPCs dramatically. In contrast, the power consumption of the NoC links, as well as the coding gains, are estimated precisely when existing models are extended by the proposed method (all errors below 1%).
7.5 Conclusion
151
Thus, by integrating the physical and bit-level models from the previous chapters in our CA NoC simulator, we proposed the a simulation framework that enables an extremely precise estimation of the NoC performance metrics in heterogeneous 3D-integrated systems. This framework does not only enable a precise estimation but also a systematic optimization of the key NoC metrics.
Part IV
3D-Interconnect Optimization
Chapter 8
Low-Power Technique for 3D Interconnects
The edge effects, as well as the metal-oxide-semiconductor (MOS) effect, outlined in Chap. 4 of this work, can significantly impact the through-silicon via (TSV) power consumption. Nevertheless, existing low-power techniques do not consider nor exploit these effects, as they have been designed for traditional metal-wire interconnects which do not show these effects. Low-power techniques designed explicitly for TSV-based interconnects are still lacking, mainly due to the previous absence of a pattern-dependent model for the TSV power consumption. Such a model was presented in the previous Part II of this book. Consequently, efficient low-power techniques for TSV-based 3D interconnects can now be derived. In this chapter, a coding-based low-power technique for 3D interconnects is presented. This technique is based on the observation that the varying bit-level statistics of typical data streams in modern systems on chips (SoCs) can be exploited to effectively reduce the TSV power consumption by an intelligent net-to-TSV assignment. The reason for this is the strong heterogeneity in the TSV capacitances due to the edge effects. Furthermore, an assignment of some bits as inverted (i.e., logically negated) can further decrease the power consumption, mainly due to the MOS effect. Moreover, the proposed power-optimal assignment can boost the efficiency of traditional low-power codes (LPCs) for TSV-based 3D interconnects. The presented approach only affects the local net-to-TSV assignments within the individual TSV arrays, while the global net-to-TSV-bundle assignment remains routing-optimal. A modification in the local net-to-TSV assignment, even in the worst-case, has a negligible impact on the parasitics of the full 3D-interconnect paths for reasonable array sizes, as shown in this chapter. Thus, the implementation costs of the proposed technique are negligibly low. A key contribution of this chapter is a formal method to find the power-optimal net-to-TSV assignment (including inversions) for any given TSV arrangement by means of the formulas for the estimation of the TSV power consumption and capacitances derived in Part II of this book. The method considers the bit-level
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 L. Bamberg et al., 3D Interconnect Architectures for Heterogeneous Technologies, https://doi.org/10.1007/978-3-030-98229-4_8
155
156
8 Low-Power Technique for 3D Interconnects
statistics of the data that is transmitted over the nets and the capacitances of the TSV array to find the power-optimal assignment. To overcome the necessity to have in-depth knowledge about the transmitted data and the TSV capacitances, systematic net-to-TSV assignments, which are generally valid for the transmission of correlated and normally distributed data words, are contributed as well. These systematic assignments are derived by means of the insights provided by Part II of this book. The proposed technique is evaluated for a broad set of data streams and modern TSV arrangements of different sizes. These analyses show that the proposed technique can reduce the TSV power consumption by over 45%, despite its negligible implementation costs. The remainder of this chapter is structured as follows. First, the fundamental idea of the proposed technique is outlined in Sect. 8.1. Subsequently, a formal method to determine the power-optimal TSV assignment is derived in Sect. 8.2. Systematic assignments for correlated or normally distributed patterns are presented in Sect. 8.3. Afterward, the combination of the proposed technique with traditional LPC techniques is discussed in Sect. 8.4. In Sect. 8.5, the proposed method is evaluated in-depth. Finally, the chapter is concluded.
8.1 Fundamental Idea In this section, the fundamental idea to minimize the TSV power consumption through a power-optimal, fixed, net-to-TSV assignment is presented. The TSVassignment technique is embedded at the end of the detailed routing step of the physical design and does not result in a TSV overhead. In detail, the technique only affects the layout-generation step at which a set of source nets, located at an edge of a TSV array in the i th layer, have to be electrically connected with the sink nets in the adjacent (i + 1)th/(i − 1)th layer through a predetermined TSV array. The layout for an exemplary 3 × 3 array before and after this step is illustrated in Fig. 8.1a, b, respectively. Each source net must be connected to a TSV through a metal wire. Additionally, the associated sink net in the adjacent layer must be connected to the other end of the same TSV. The standard approach will perform this task with the aim of minimizing the wiring length (routing-minimal assignment). Instead, the proposed technique performs this task intending to minimize the TSV power consumption. For this purpose, two possible variation possibilities in the net-to-TSV assignment are taken into account simultaneously. First, to interleave the net-to-TSV assignment; and, second, to assign negated nets by swapping inverting drivers with non-inverting drivers in the source and the sink layer. Interleaving has promising potential to reduce the TSV power consumption due to the heterogeneity in the TSV capacitances, arising from the edge effects, outlined in Sect. 4.2.2. Inversions swap the logical 1-bit and 0-bit probabilities for the according nets, which can reduce the power consumption through the TSV MOS effect outlined in Sect. 4.2.1.
8.2 Power-Optimal TSV Assignment
b9
a)
b9
•
•
•
•
•
•
•
•
b9 •
Layer i+1
b1
•
TSV array
Sink nets
Sink nets
TSV array
• b1
•
•
•
•
•
•
•
•
•
Layer i
b1
TSV array
Layer i+1
Source nets
• Via Wire
Layer i
Source nets
TSV array
157
b1 • b9 •
b)
Fig. 8.1 Simplified TSV-array layout: (a) before local/detailed TSV routing; (b) after local TSV routing
Furthermore, if the logical bit values bi and bj on two nets with temporally aligned signal edges have a negative switching correlation (i.e., , γi,j = E{Δbi Δbj } < 0), an inversion can improve the power consumption. As shown in Sect. 3.1, a negative switching correlation increases the power consumption due to the coupling capacitance between the two TSVs over which bi and bj are transmitted, compared to the case of an uncorrelated switching on the two lines (coupling effect). However, after negating one of the two lines, the sign of the switching correlation is changed since Δ(¬bi )Δbj = −Δbi Δbj , resulting in a positive switching correlation (i.e., γi,j > 0) and consequently to a reduced power consumption. An example of bit-pairs with a negative switching correlation are bits that belong to the same one-hot encoded signal, as shown in Sect. 6.1.3.
8.2 Power-Optimal TSV Assignment In this section, a formal method to determine the power-optimal net-to-TSV assignment is derived. For this purpose, the formulas that were derived in Chaps. 3 and 4 of this book are used.
158
8 Low-Power Technique for 3D Interconnects
If the logical value on the ith net in the kth cycle is symbolized as bi [k], the matrix notation of the power consumption (Eq. (3.47) on Page 65) expresses the power consumption for the initial assignment:1 Pinit =
n 2f Vdd V2 f C¯ eff,i = dd SE , C. 2 2
(8.1)
i=1
2 f In this equation, the first term, Vdd /2, depends on the power-supply voltage and the clock frequency. This term can only be affected at the circuit level, not at higher abstraction levels. The metric that can be optimized at higher abstraction levels, and thus through changing the net-to-TSV assignment, is the sum of the mean effective capacitances:
n
C¯ eff,i = SE , C.
(8.2)
i=1
To systematically determine the power-optimal assignment, the effect of a certain assignment on the switching matrix, SE , as well as on the capacitance matrix, C, must be expressed mathematically. First, a method to formally express the assignment is required. A mere reordering of the net-to-TSV assignment could be mathematically expressed using the concept of the permutation matrix [39]. A permutation matrix is a matrix with exactly one 1-entry in each row and column while all other entries are 0. If an assignment of the ith net to the j th TSV is expressed by a 1 on matrix entry (j, i), each permutation matrix represents a valid assignment (i.e., each net is assigned to exactly one TSV and vice versa). Furthermore, in that case, the set of all permutation matrices represents the set of all possible assignments that do not include inversions of nets. However, inversions are an additional assignment variant in the proposed technique. Thus, a signed form of the permutation matrix, represented by Iσ , is used here. Such a signed permutation matrix has one 1 or −1 in each column or row while all other entries are 0. If the ith net is assigned negated to the j th TSV, the (j, i) entry of the signed permutation matrix is set to −1. Still, the (j, i) entry is set to 1 if the net is assigned as non-negated. Thus, for an exemplary 2 × 2 TSV array, if net 1 is assigned to TSV 2, net 2 to TSV 3, net 3 negated to TSV 1, and net 4 to TSV 4, the assignment matrix is ⎡
00 ⎢1 0 Iσ = ⎢ ⎣0 1 00
1 An
−1 0 0 0
⎤ 0 0⎥ ⎥. 0⎦ 1
initial assignment implies here that the i th net is assigned to the i th TSV.
(8.3)
8.2 Power-Optimal TSV Assignment
159
In the following, formulas to express the switching matrix and the capacitance matrix as a function of the assignment, Iσ , are derived. A Boolean inversion of a net does not affect the switching activity of the logical value transmitted over a TSV, as 0-to-1 transitions are transformed in 1-to-0 transitions and vice versa. However, the inversion negates/sign-changes the switching correlations with all other TSV signals. Thus, the switching matrix, defined by Eq. (3.46) on Page 65, is here divided into two sub-matrices to separate the switching activities (i.e., the αi values), and the switching correlations (i.e., the γi,j values) of the nets that have to be assigned to the TSVs: SE = Sα 1n×n − Sγ ,
(8.4)
where Sα is a matrix with the αi values on the diagonal entries and zeros on the non-diagonal entries: Sα,i,j =
αi
for i = j
0
else.
(8.5)
Sγ contains the γi,j values on the non-diagonal entries and zeros on the diagonal entries: 0 for i = j (8.6) Sγ ,i,j = γi,j else. 1n×n is an n × n matrix of ones. Afterward, the effect of a reassignment on SE is expressed as SE,assign = Sα,assign 1n×n − Sγ ,assign T
(8.7)
T
= Iσ Sα Iσ 1n×n − Iσ Sγ Iσ . As desired, the switching activities are merely reordered since the minuses in Iσ cancel each other out on the diagonal entries for a left-hand-side multiplication T with Iσ and a right-hand-side multiplication with Iσ . In contrast, the signs of the switching correlations are changed if some nets are assigned negated. If all signal edges on the different nets are completely temporal misaligned, all switching correlations, and thus all entries of Sγ , are 0. Consequently, Eq. (8.7) simplifies in this scenario as SE,assign = Sα,assign 1n×n T
= Iσ Sα Iσ 1n×n .
(8.8)
160
8 Low-Power Technique for 3D Interconnects
Second, a formula to express the TSV capacitances as a function of the net-toTSV assignment is derived. As shown in Chap. 4, the following equation can be used to estimate the size of a capacitance for an assignment of the nets i and j to the TSVs i and j : Ci,j = CG,i,j +
ΔCi,j (pi + pj ). 2
(8.9)
Here, the requirement is a formula where an inversion of the bits/nets leads to simple negations in the formula to use the signed permutation matrix. Thus, a shifted form of Eq. (8.9) is used: Ci,j = CR,i,j +
ΔCi,j (i + j ), 2
(8.10)
where CR,i,j is the capacitance value for all 1-bit probabilities, pi equal to 1/2 (i.e., CR,i,j = C0,i,j + ΔCi,j/2). i is mathematically expressed as: i = pi −
1 1 = E{bi } − . 2 2
(8.11)
Since E{¬bi } = 1 − E{bi }, an inversion of bi negates the related i value. Thus, the capacitance matrix as a function of Iσ is formulated as follows: Cassign = CR +
C T T T ◦ (Iσ 1 n + 1 n Iσ ), 2
(8.12)
where CR and C are matrices containing the CR,i,j and ΔCi,j values, respectively. is the vector containing the i values. 1 n is a vector of n ones. Summarized, the assignment dependent power consumption is Passign = =
2f Vdd SE,assign , Cassign 2
(8.13)
2f Vdd C T T T T T Iσ Sα Iσ 1n×n − Iσ Sγ Iσ , CR + ◦ (Iσ 1 n + 1 n Iσ ). 2 2
Thus, the power-optimal bit assignment, Iσ,power-opt , full-fills Iσ,power-opt = arg min SE,assign , Cassign , Iσ ∈SIσ ,n
(8.14)
where SIσ ,n is the set of valid signed permutation matrices of shape n × n. In practice, Iσ,power-opt can be determined with any of the several optimization tools available to reduce the computational complexity compared to iterating over the full search space. Although up to several hundreds of TSVs exist in modern 3D integrated circuits (ICs), the run-time to perform the proposed optimization in the
8.3 Systematic Net-to-TSV Assignments
161
TSV assignment is relatively low as it is applied for each TSV bundle individually whose sizes are relatively small. Here, simulated annealing [240] is exemplarily used to determine the “power-optimal” assignment for a TSV array.
8.3 Systematic Net-to-TSV Assignments In some scenarios, a representative number of samples of the transmitted data, required to precisely determine the exact bit-level statistics of the nets, may not be available at design time. Alternatively, the exact capacitance values may not be known. In such cases, the power-optimal assignment cannot be determined by means of Eq. (8.14). However, the essential characteristics of the data and the capacitances can be used to obtain systematic assignments. Furthermore, this approach results in regular assignments that are generally valid for a broad set of data streams and TSV technologies. Exemplary, this section proposes systematic assignments that are generally applicable for normally distributed and correlated data streams, as they build two of the most common data types in SoCs.2 In the following, the term “assigning bit i to TSV j ” is used as an expression for “assigning the net over which the ith bit of the data words is transmitted to the j th TSV” to increase the readability. As shown in Sect. 6.1, for mean-free normally distributed as well as correlated data words, 1 and 0 bits are equally distributed on all lines (i.e., i = 0 for all i). Hence, the TSV capacitance matrix is simply CR for the transmission of such data streams why assigning nets negated cannot optimize the capacitances through the MOS effect. In fact, inversions can only result in an increased power consumption, as they potentially destroy the positive switching correlations of the most significant bits (MSBs) of normally distributed data streams, outlined in Sect. 6.1.2. Consequently, the systematic assignments presented in the following do not include inversions. First, positively correlated and uniformly distributed data is considered (e.g., sequential data streams). The bit-level statistics of such data streams have been outlined in Sect. 6.1.4 of this book. All logical bits values switch completely uncorrelated (i.e., all γi,j values are equal to 0). Thus, SE,i,j is simply equal to switching activity αi . Hence, the power consumption for the initial assignment simplifies to Pcorr,init =
n n 2f Vdd V2 f SE , CR = dd CR,i,j αi . 2 2
(8.15)
i=1 j =1
2 The bit-level statistics for correlated and normally distributed data streams have been outlined in Sect. 6.1.
162
8 Low-Power Technique for 3D Interconnects LSB
MSB
LSB MSB
a) Spiral
b) Sawtooth
Fig. 8.2 Systematic net-to-TSV assignments: (a) Spiral mapping for correlated data words; (b) Sawtooth mapping for normally distributed data words
Therefore, bits with the highest switching activity, αi , should optimally be reassigned to TSVs with the lowest overall capacitance, nj=1 CR,i,j , and vice versa, as it minimizes the power consumption. The TSVs in the array corners have the lowest overall capacitance, and edge TSVs have a lower overall capacitance than TSVs in the middle of an array, as shown in Chap. 4. Thus, optimally, the bits with the highest self-switching are assigned to the TSVs at the array corners. The bits with the next highest switching activities are assigned to the array edges. Remaining bits are assigned to the array middle. The switching activity decreases with an increase in the significance of the bit for correlated data streams, as shown in Chap. 6. Hence, the proposed systematic assignment forms a spiral starting at the least significant bit (LSB) and ending at the MSB, illustrated in Fig. 8.2a. The proposed spiral mapping is validated for various sequential data streams with varying branch probability, each containing 1 × 105 samples.3 With the branch probability, the correlation of subsequent data words varies. The resulting powerconsumption reductions due to the spiral mapping and due to an optimal assignment are shown in Fig. 8.3 for two different TSV arrays: First, a 4 × 4 array with a TSV radius of 2 μm and a minimum pitch of 8 μm is considered. Moreover, a 5 × 5 array with a TSV radius of 1 μm and a minimum pitch of 4 μm is investigated. All power-consumption values reported in this section are calculated with the proposed high-level formula, employing exact capacitance quantities obtained by parasitic extractions with the Q3D Extractor for the TSV-array model from Sect. 1.3.1. Thereby, a significant signal frequency of 6 GHz is considered for parasitic extraction. The results in Fig. 8.3 show that the power savings for both assignment techniques, optimal and systematic, are almost equal. Thus, the analysis
3 In the evaluation section of this chapter, the efficiency of the mapping for real correlated image data streams is analyzed.
8.3 Systematic Net-to-TSV Assignments
Power reduction [%]
25 Optimal Spiral Optimal Spiral
20
163
(4 × 4 (4 × 4 (5 × 5 (5 × 5
array, array, array, array,
rtsv rtsv rtsv rtsv
= 2 μm, = 2 μm, = 1 μm, = 1 μm,
dmin dmin dmin dmin
= 8 μm) = 8 μm) = 4 μm) = 4 μm)
15
10
5
0
0.1
0.2
0.3
0.4 0.5 0.6 0.7 Branch probability
0.8
0.9
1
Fig. 8.3 Reduction in the TSV power consumption due to the proposed power-optimal net-toTSV assignment and the systematic spiral mapping for sequential data streams with varying branch probabilities (see Sect. 6.1.4)
proves the optimal nature of the systematic spiral mapping for uniformly distributed and correlated data words. As a second scenario, a systematic assignment for normally distributed, but uncorrelated, data is derived. Such data implies that the switching probability of each bit is 1/2. Thus, the power consumption for an initial assignment can be expressed as: Pnormal-dist,init =
=
2f Vdd SE , CR 2 ⎛
2 f n ⎜ Vdd ⎜ 1 CR,i,i − i=1 ⎝ 2 2
(8.16) n j =1
⎞
⎟ 1 − CR,i,j γi,j ⎟ ⎠. 2
i =j
Thus, to minimize the power consumption, bit pairs with a strong switching correlation, γi,j , have to be assigned to TSV pairs connected by a large coupling capacitance, CR,i,j , in order to reduce the contribution of the capacitance to the power consumption as much as possible. The biggest coupling capacitances in an array are located between the four corner TSVs and their two directly adjacent edge TSVs due to the edge effects, outlined in Chap. 4. In Sect. 6.1.2, it was shown that the MSBs of normally distributed signals have the highest switching correlation. Thus, the second systematic mapping assigns the MSB onto a corner TSV and the next lower significant bit onto one of its directly adjacent edge TSVs. The following bits, recursively, are assigned by finding the TSV in the array that has the biggest accumulated coupling capacitance
8 Low-Power Technique for 3D Interconnects
Power reduction [%]
164
30 Optimal Sawtooth Spiral
20
10
0
4
6
8
10
12
14
log2 (σ)
a) ρ = 0
40 Power reduction [%]
Power reduction [%]
40 30 20 10 0
4
6
10
12
Power reduction [%]
10
4
d) ρ = 0.5
6
8
10
log2 (σ)
10
4
6
12
14
8
10
12
14
12
14
log2 (σ)
c) ρ = −0.5
20
0
20
0
14
log2 (σ)
b) ρ = −0.9
Power reduction [%]
8
30
20 15 10 5 0
4
e) ρ = 0.9
6
8
10
log2 (σ)
Fig. 8.4 Reduction in the TSV power-consumption due to the proposed assignment techniques for uncorrelated (ρ = 0), negatively correlated (ρ < 0), and correlated (ρ > 0), normally distributed data words
8.4 Combination with Traditional Low-Power Codes
165
with all previously assigned TSVs. Figure 8.2b illustrates the resulting systematic assignment. Over the first two rows, the bits, from the MSB downwards, are mapped in a sawtooth manner. From the third row on, a simple row-by-row mapping is used. This second systematic assignment is referred to as sawtooth mapping in the following. Figure 8.4a shows the reduction in the power consumption due to the proposed optimal assignment technique and the systematic sawtooth mapping for the transmission of Gaussian distributed 16-bit data streams with varying standard deviation, σ , over a 4 × 4 TSV array (rtsv = 2 μm and dmin = 8 μm). The results underline the optimal nature of the sawtooth mapping for normally distributed, but temporally uncorrelated, data streams. In some applications, normally distributed signals occur, which are also temporally correlated. In these cases, the optimal net-to-TSV assignment is not as trivial and dependent on the correlation quantities. As shown in Fig. 8.4, the sawtooth mapping leads to the lowest power consumption (reduction up to 40%) for negatively correlated (i.e., ρ < 0) and normally distributed data words. However, neither the sawtooth nor the spiral mapping results in the lowest possible power consumption for positively correlated and normally distributed data words. However, compared to a random assignment, both approaches still lead to a significant improvement in the power consumption. In summary, if it is not possible to determine the optimal assignment by means of Eq. (8.14), the proposed sawtooth mapping should be applied for normally distributed signals and the spiral mapping for primarily temporally correlated signals. Following the same approach, systematic assignments can be derived for other scenarios as well.
8.4 Combination with Traditional Low-Power Codes Long 3D interconnects are commonly made up of multiple metal-wire and TSV segments. Since both structures typically contribute significantly to the overall power consumption of a 3D system, a simultaneous reduction in the power consumption of TSVs and metal wires is desirable. Using one of the existing lowpower coding techniques to reduce the power consumption of the metal wires and a different, completely new designed coding technique for the vertical TSVs is impractical due to the overhead cost of the encoder-decoder pairs. In that scenario, 2 · ndie − 1 encoder-decoder pairs would be required for a 3D interconnect spanning over nl dies. ndie − 1 for the TSV segments, and ndie for the metal-wire segments. For example, three encoder-decoder pairs would be already required for a 3D interconnect that only spans over two dies. Two encoder-decoder pairs for the metal wires in the source and the destination layer, and one encoder-decoder pair for the TSVs in between. Thus, there is a high demand for low-power coding techniques, which simultaneously optimize the power consumption of metal wires and TSVs. Such techniques— referred to in this book as 3D low-power codes (LPCs)—only require a single
166
8 Low-Power Technique for 3D Interconnects
Die i d1 dk
Lowpower encoder
Die i+1
Inter die
b1
Metal wire
TSV
Metal wire
bn
Metal wire
TSV
Metal wire
Lowpower decoder
d1 dk
Complete 3D interconnect Fig. 8.5 Proposed low-power coding approach for 3D interconnects
low-power encoder-decoder pair enclosing the full 3D interconnect as illustrated in Fig. 8.5. Nevertheless, 3D LPCs still effectively optimize the power consumption of all TSV and metal-wire segments. The technique proposed in the present chapter enables us to use traditional LPCs as 3D LPCs in the best possible way, without impairing the encoding overhead by finding the optimal codeword-bit-to-TSV assignment. Unencoded, most data streams generally have a balanced number of 0 and 1 bits. However, data-encoding techniques often lead to a larger fraction of 0 bits, which affects the power consumption of TSVs traversing a typical p-doped substrate negatively due to the MOS effect. Here, the proposed assignment technique further improves the efficiency of the data encoding by transmitting inverted bits. As mentioned before, an inversion can be realized by using inverting drivers instead of non-inverting ones (or vice versa) on both sides of the according TSV. However, inversions can also be hidden in the encoder and the decoder architecture. For example, Gray encoding is a popular approach to reduce the power consumption of gates and metal wires. Gray encoding does not only reduce the power consumption for sequential but also for normally distributed data. This is due to the strong spatial correlation of the bits in the MSB-region of normally distributed data words, outlined in Sect. 6.1.2. The ith output of a Gray encoder is equal to the ith input XORed with the (i+1)th input (i.e., bi = di ⊕ di+1 ), while the MSB of the encoder output is equal to the MSB of unencoded data word (i.e., bn = dn ). This logical exclusive disjunction (XOR)ing in the Gray encoder for two neighbored spatial-correlated MSBs results in output bits that are almost stable on logical 0, independent of the pattern correlation. Hence, Gray encoding reduces the switching activities of the MSBs for normally distributed signals to nearly zero, which is beneficial for the power consumption of TSVs and metal wires. However, the encoding also decreases the 1-bit probabilities, resulting in a higher TSV power consumption for typical p-doped substrates. Thus, inverting the codeword bits has good potential to further increase the power savings of the coding technique for TSV-based interconnects. For many data-encoding techniques, inversions of the codeword bits can be realized inside the encoder-decoder architectures, instead of swapping inverting and non-inverting drivers. For example, for a Gray coding, the XOR operations are swapped with logical exclusive non-disjunction (XNOR) operations in the encoder and the decoder to obtain negated codeword bits. Since
8.5 Evaluation
167
XOR and XNOR operations typically have the same costs, this optimization of the data encoding technique is again overhead-free.
8.5 Evaluation In this section, the proposed low-power technique is evaluated in depth. The only implementation cost of the proposed technique is a non-routing-minimal local netto-TSV assignment. Thus, in Sect. 8.5.1, the impact of the local TSV assignment on the 3D-interconnect parasitics is quantified. In Sect. 8.5.2, the power savings of the proposed systematic assignments are investigated for real data streams and compared to the power savings for the respective power-optimal assignments. Afterward, the TSV power-consumption reductions of the proposed technique, used in combination with traditional LPCs, are analyzed.
8.5.1 Worst-Case Impact on the 3D-Interconnect Parasitics In this subsection, the worst-case impact of a non-routing-minimal local net-to-TSV wiring on the parasitics is quantified. To keep the TSV delays as small as possible, the TSV drivers have to be placed as close as possible to the arrays [158]. Thus, the sources and the sinks of the nets in Fig. 8.1 on Page 157 are driver outputs and inputs, respectively. The minimum distance between a driver and a TSV is determined by the TSV keep-out zone (KOZ) constraint, which, in this analysis, is equal to two times the minimum TSV pitch, representing a realistic value. Since longer metal wires reduce the relative impact of the local routing within the TSV array on the overall parasitics, the full 3D-interconnect paths in this analysis only range between these nearby TSV drivers. Thus, in the following, worst-case values for the impact of the assignment are reported. The effect of one million random assignments of the driver inputs and outputs to the TSVs is analyzed. Thereby, the driver inputs and outputs are assumed to be aligned centrally at two edges of the TSV-array keep-out zone (KOZ), as illustrated in Fig. 8.1. Typical global TSV dimensions (i.e., rtsv = 2 μm, dmin = 8 μm, ltsv = 50 μm, and fs = 6 GHz) are considered and the array size is varied to quantify how the results scale with the number of TSVs/nets. Two scenarios for the local routing of the nets to the TSVs are considered. Scenario one uses the lower metal layers for routing. In this scenario, the nets are initially spaced with the minimum M1 spacing reported in Ref. [133] for a 22-nm technology. The second scenario considers intermediate metal layers for routing, why the input nets are spaced with the minimum M4 spacing reported in Ref. [133]. The higher metal layers—reserved for long global metal wires—are not required for the relatively short connections between the nearby drivers and the TSVs, especially since the lower metal layers have sufficient capacity in that area due to the large TSV dimensions and the absence
168
8 Low-Power Technique for 3D Interconnects
of active elements caused by the KOZ constraints. Thus, a routing in the global metal layers is not analyzed. The Manhattan distances between the driver inputs and outputs and the assigned TSVs are calculated to estimate the lengths of the metal wires for all assignments. Afterward, the two assignments which result in the lowest and the highest maximum length of a single wire, are taken as the routing-minimal (i.e., best-case parasitics) and the routing-maximal (i.e., worst-case parasitics) assignment, respectively. For both assignments, the maximum wire length is multiplied with the wire resistance per unit-length reported in Table 1.3 on Page 23 to obtain best-case and worstcase values for the maximum wire resistances. Added to both values are the TSV resistance and the driver’s channel-resistance Reff,d of 3.02 k (see Sect. 1.3). Furthermore, a 100- resistances are added to model the connections between the metal wires and the TSVs [132]. Thereby, values for the maximum resistance of an ˆ for a routing-minimal and a routing-maximal assignment are interconnect path, R, obtained. The metal-wire capacitances per unit-length in Table 1.3 are only valid for minimum-spaced wires, which result in the highest coupling capacitances. However, the analyzed minimum TSV pitch is more than 60 and 100 times larger than the minimum pitch of the local and intermediate metal wires, respectively. Thus, the wire pitches are typically way higher for the local net-to-TSV routing within an array. Here, a mean metal-wire pitch equal to half of the minimum TSV pitch is considered. This is still a pessimistic estimation, as the low utilization of the metal layers allows for an even higher spacing for critical/worst-case paths. The normalized wire capacitances for the increased spacing are extracted by means of the Predictive Technology Model (PTM) interconnect tool [205]. Compared to the values reported in Table 1.3, the capacitances of local and intermediate metal wires decrease to 0.09 and 0.10 fF/μm, respectively, due to the increased spacing. Moreover, the coupling capacitances become negligible compared to the ground capacitances. These normalized wire capacitances, multiplied with the maximum wire lengths for the routing-minimal and the routing-maximal assignments, are used to obtain worst-case and best-case values for the maximum wire capacitances. Added are the maximum effective TSV capacitance—extracted with Q3D Extractor—and a load capacitance of 0.22 fF to model a small inverter as the load Ref. [133]. In Table 8.1, the resulting worst-case percentage increases in the maximum-path resistance and effective capacitance due to a non-routing-minimal TSV assignment are reported. The results show that—in contrast to the global net-to-array routing—
Table 8.1 Worst-case parasitic increases due to a non-routing-minimal local TSV assignment for a routing in the local or the intermediate metal layers Metal layer M1 M4
3 × 3 array Rˆ [% ] Cˆ eff [% ] 0.24 0.12 0.17 0.24
5 × 5 array Rˆ [% ] Cˆ eff [% ] 0.68 0.35 0.53 0.75
7 × 7 array Rˆ [% ] Cˆ eff [% ] 1.22 0.65 0.99 1.38
10 × 10 array Rˆ [% ] Cˆ eff [% ] 2.37 1.34 1.92 2.63
8.5 Evaluation
169
the effect of the local routing within the individual arrays is, as expected, negligible for reasonable array sizes. Considering local metal wires, the maximum possible increase in the parasitics is below 0.24% for a 3 × 3 array. The maximum increase for a 7 × 7 array is still as low as 1.22%. Thus, with an increase in the TSV count by a factor of 5.44 ×, the maximum parasitic increase goes up by a factor of 5.08 ×. Generally, the values increase sub-linearly with the TSV count, as the maximal relative increase is 2.37% for 100 TSVs arranged as a 10 × 10 array. Using higher metal layers decreases the relative impact of the assignment on the resistances at the cost of a higher increase in the capacitances. For the 7 × 7 array, the maximum possible increase in the interconnect capacitance is 1.38% for M4 wires and only 0.75% for M1 wires. Also for intermediate metal layers, the relative increases in the parasitics scale almost linearly with the TSV count. In summary, one can in fact neglect the added parasitics due to a non-minimal local TSV assignment for arrays with less than 100 TSVs. Hence, in such cases, the power reduction of the proposed technique comes at negligible costs and performance degradation. For larger arrays, the added parasitics should be considered, which requires a small extension of the proposed technique. For example, one could only include assignments in SIσ ,n in Eq. (8.14) that do not increase the maximum Manhattan distance between a sink/source net and the assigned TSV beyond a threshold value. However, arrays that are large enough that this extension is required are rather uncommon, and thus not further considered in this work.
8.5.2 Systematic Versus Optimal Assignment for Real Data In Sect. 8.3, the efficiency of the proposed systematic assignments has been only investigated for synthetically generated data streams to show the optimal nature of the systematic assignments. In the following, the efficiencies of the systematic and optimal TSV assignments are compared for real data streams. Thereby, the focus lies on an important class of systems: Heterogeneous 3D SoCs. Two commercially relevant examples are vision systems-on-chips (VSoCs) [95], including dies for image sensing and dies for digital image processing; and SoCs with one or more mixed-signal dies for sensing and sampling of signals, bonded to one or more digital dies for computation [180].
8.5.2.1
Image-Sensor Data
In a 3D VSoC, some dies (typically integrated into a rather conservative technology node) are dedicated to image sensing, while others are used for image processing (e.g., filtering, compression). In this subsection, the proposed assignment technique is analyzed for the transmission of digitized image pixels from a mixed-signal die to a processor die in such a VSoC.
170
8 Low-Power Technique for 3D Interconnects
The first three analyses are performed for data stemming from a 0–255 red– green-blue (RGB) image sensor using a standard Bayer filter [147]. First, the parallel transmission of all four RGB colors (1 red, 2 green, and 1 blue) of each Bayerpattern pixel over one 4 × 8 TSV array is analyzed. For the second analysis, four additional TSVs are considered in the array, which then has a 6 × 6 shape. Two of the additional TSVs are used to transmit enable-signals with a set probability of 1 × 10−5 . The other two additional TSVs are a power/ground (P/G) signal pair to supply the sensor. In the third analysis, the four colors of each pixel are transmitted one after another (i.e., one-by-one time-multiplexed) together with an enable signal (set probability 1 × 10−5 ) over a 3 × 3 array. The multiplexing allows decreasing the number of required TSVs for the image transmission to eight. The fourth analysis is performed for a data stream stemming from a 0–255 grayscale image sensor. Here, the transmission of one pixel per cycle over a 3 × 3 array, including the same enable-signal as in the third analysis, is investigated. All analyzed data streams are composed of an extensive set of pictures of cars, people, and landscapes, to obtain representative results. For all analyzed scenarios, the reduction in the TSV power consumption, compared to worst-case random assignments, is investigated for the optimal assignment and the systematic spiral mapping since the strong correlation of adjacent pixels results in transmitted data words that are strongly correlated but tend to be uniformly distributed. The enable and P/G signals are (almost) stable. An enable signal is here assumed as set to logical 0 when unused, which can be exploited by an inversion. Power or ground lines are always on logical 1 and logical 0, respectively. However, an inversion of the logical values for P/G lines is not possible and consequently forbidden for the assignment. For the simultaneous transmission of a complete RGB pixel, the bits of the four color-components are spatially interleaved one-by-one for the proposed systematic spiral mapping. Stable lines are added as the MSBs for the spiral assignment, as they have the lowest switching activities of all nets. Aggressively scaled global TSV dimensions are considered in this analysis (i.e., rtsv = 1 μm, dmin = 4 μm, and ltsv = 50 μm). The power consumption for the 3 × 3 and the 6 × 6 array is also investigated for a TSV radius, rtsv , and minimum pitch, dmin , of 2 μm and 8 μm, respectively, to show the effect of varying TSV geometries. Again, the power-consumption quantities are obtained through the highlevel formula, employing capacitance matrices that were extracted with the Q3D Extractor for a significant frequency of 6 GHz. In Fig. 8.6, the resulting TSV power-consumption reductions due to the proposed technique are reported. The results show that the spiral mapping is nearly optimal for the transmission of the RGB data without additional stable lines in the array and leads to a power-consumption reduction of more than 11%. For the multiplexed colors, the lowest power improvement of only about 6% is achieved, as the correlation between subsequently transmitted bit-patterns is lost by the multiplexing. Thus, only a wise assignment of the almost stable enable signal can be applied here to reduce the power consumption.
8.5 Evaluation
171
Optimal (rtsv = 1 μm, dmin = 4 μm) Optimal (rtsv = 2 μm, dmin = 8 μm)
Spiral (rtsv = 1 μm, dmin = 4 μm) Spiral (rtsv = 2 μm, dmin = 8 μm)
Power reduction [%]
15
10
5
0 B RG
P/G le +
enab B+ RG
le
uxed Bm RG
ab + en
le
ab + en e l a ysc Gra
Data set Fig. 8.6 Reduction in the TSV power consumption due to the proposed power-optimal assignment and the systematic spiral mapping for Bayer-pattern (RGB) and gray-scale image-sensor data
With (almost) stable lines in the TSV array, the power-consumption reduction due to an optimal assignment is generally higher than for the systematic assignment (up to 2.5 percentage points (pp)). The reason for this is that only the optimal assignment considers possible inversions for the enable signals in order to increase the 1-bit probabilities. Moreover, additional (almost) stable enable or P/G nets in the array, generally, increase the power-consumption improvement of the proposed technique. The reason is that the bit-level statistics of the nets become more heterogeneous due to the added lines, which increases the optimization potential. In conclusion, the results show that both net-to-TSV assignment approaches, optimal and systematic, can effectively improve the TSV power consumption for the transmission of real image data. However, in the presence of additional control-signal and P/G nets, the optimal approach can result in noticeably higher power-consumption reductions than the systematic one.
8.5.2.2
Smartphone Sensor Data
In the following, the efficiency of the proposed technique for real smartphone sensor data, transmitted from a sensing die to a processing die, is investigated. For this purpose, sensor signals from a modern smartphone in various daily-use scenarios are captured. Analyzed are the magnetometer, the accelerometer, and the gyroscope sensor, all sensing on three axes x, y, and z. Considered is a transmission of one 16-bit sample per cycle over a 4 × 4 array with a TSV radius and minimum pitch
172
8 Low-Power Technique for 3D Interconnects
Power reduction [%]
20 15
Optimal Sawtooth Spiral
10 5 0
S S S d d d d RM RM RM lexe ltiplexe ltiplexe ltiplexe er— multip er— scope— t t u u u e e m m m m m o All neto elero r—XYZ e—XYZ r—XYZ Gyr Acc Mag p te te o e e c s m m o neto elero Gyr Acc Mag
Data set Fig. 8.7 Reduction in the TSV power consumption due to the proposed optimal and systematic net-to-TSV assignment approach for real smartphone-sensor data
of 2 μm and 8 μm, respectively. The transmission of the individual data streams is analyzed for two scenarios. In the first one, the root mean square (RMS) values— calculated from the respective three axes values—are transmitted. Scenario two represents the transmission of the x-axis, y-axis, and z-axis values in a one-byone time-multiplexed manner (XYZ multiplexed). Since the sample time can be way lower than the maximum propagation delay of a TSV, the transmission of all three data streams over a single 4 × 4 array is analyzed as well. Thereby, a regular pattern-by-pattern multiplexing of the three individual XYZ-multiplexed data streams is considered. In this analysis, both systematic bit-to-TSV assignments are investigated since normally distributed as well as temporally correlated data streams are amongst the analyzed set. Figure 8.7 shows the mean power-consumption reductions compared to worstcase assignments. The figure reveals that, for the multiplexed data streams, the proposed sawtooth mapping is only slightly worse than the optimal assignment, which reduces the power consumption by up to 21.1%. Generally, the single-axis values are normally distributed and temporally correlated. However, for multiplexed data streams, the correlation of subsequently transmitted patterns is lost, as outlined in Chap. 6. Thus, these scenarios build examples for temporally uncorrelated and normally distributed data, as multiplexing does not affect the pattern distribution. The small gain for the optimal net-to-TSV assignment over the spiral assignment is because not all sensor signals are perfectly mean-free, resulting in slightly unequal bit probabilities [85]. In contrast, for the RMS data, the spiral mapping significantly outperforms the sawtooth mapping as the RMS patterns are unsigned (i.e., no mean-free normal distribution) and correlated. However, for the RMS data, the maximum possible
8.5 Evaluation
173
power reduction due to a reassignment of the nets is 13.3%, which is noticeably lower than the maximum power reduction for the interleaved data streams. In conclusion, to exploit a normal distribution in the data words shows to be more efficient than to exploit a strong correlation of consecutive data words. Furthermore, due to non-idealities in real signals compared to the abstract data model, the optimal assignment approach has a slightly higher gain than the systematic one. However, both assignment approaches, systematic and optimal, generally lead to a significant improvement in the TSV power consumption.
8.5.3 Combination with Traditional Coding Techniques In this subsection, the combination of the proposed net-to-TSV assignment technique with traditional LPCs is investigated for real data streams by means of Spectre circuit simulations for aggressively scaled TSVs (i.e., rtsv = 1 μm, dmin = 4 μm, and ltsv = 50 μm) and 22-nm drivers. The setup with the moderate-sized 22-nm drivers from Sect. 3.4 is used here for the circuit simulations in order to obtain a realistic driver power consumption. Compared to Sect. 3.4, the duration of the transmitted bit-patterns is reduced to 333 ps, resulting in a throughput of about 3Gb/s per data TSV. The power consumption quantities in Fig. 8.8 are scaled to
Power consumption [mW/word]
Raw data—random assignment Raw data—power-optimal assignment Low-power-encoded data—random assignment Low-power-encoded data—power-optimal assignment 0.7
0.5
0.3
al
S
d
lexe
enti
u r seq enso
Se
tip mul nsor
exed
l ultip Bm G R
dom
Ran
Data set Fig. 8.8 Power consumption of TSVs (including drivers and leakage) in case of a transmission of raw and low-power-encoded data for the proposed power-optimal as well as a random net-to-TSV assignment
174
8 Low-Power Technique for 3D Interconnects
values for a transmission of one 32-bit word per cycle. Hereby, we report values independent of the TSV count in the array, and redundant bits in the transmitted patterns. The power consumption is investigated twice for the transmission of four different data streams. Once for the power-optimal net-to-TSV assignment, and once for a random assignment. The first data stream contains the sensor data from Sect. 8.5.2.2, where, for 3900 cycles, patterns of a single axis of one sensor are transmitted. Subsequently, patterns stemming from the next sensor are transmitted for 3900 cycles and so on, until data for all axes and sensors has been transmitted. This data stream is referred to in the following as “sequential sensor data”. For the second data stream, labeled as “multiplexed sensor data”, the patterns belonging to the individual axes are multiplexed one-by-one. The pattern width is 16 bit for the first two data streams, and a 4 × 4 array is chosen as the transmission medium. As expected, the results in Fig. 8.8 show that the multiplexed sensor data leads to a dramatically higher power consumption since the correlation between consecutively transmitted patterns and its beneficial impact on the bit switching is lost. Nevertheless, because of limited buffer capabilities in the mixed-signal sensing layer, a multiplexed transmission is more likely than a sequential transmission. However, in Sect. 6.4.2, it was shown that the interconnect power consumption, for the transmission of multiplexed data streams, can be effectively reduced on an end-to-end basis by increasing or decreasing the joint 1-bit probabilities (i.e., the probabilities that two bits are on logical 1 in a single cycle). Gray encoding can be used to decrease the joint bit probabilities for normally distributed data as it results in codeword MSBs that are almost stable on logical 0, as outlined previously. Furthermore, Gray encoding has the advantage here that it can be realized on an end-to-end basis in the analog-to-digital converters (ADCs) of the sensors to cut the implementation cost for the coding technique. Thus, Gray coding is analyzed twice for the sensor data. Once in the traditional way, and once in combination with the proposed optimal net-to-TSV assignment technique. For the raw multiplexed sensor data, the proposed assignment technique leads to a reduction in the power consumption by 18.3%, compared to a random assignment. This is even more than two times higher than the power-consumption reduction of a mere Gray encoding, which is only 8.6%. Thus, the proposed low-cost net-toTSV assignment technique can even significantly outperform well-established dataencoding techniques. A combination of Gray coding with the proposed technique leads to the lowest power consumption and more than doubles the Gray-coding efficiency (power reduction 21.7%). The third analyzed data stream contains multiplexed RGB pixels for different pictures. A transmission of the 8-bit data stream together with an enable signal (set probability 1 × 10−5 ) over a 3 × 3 array is investigated. Again, the beneficial correlation between consecutively transmitted patterns is lost if the RGB colors are multiplexed. This leads to a dramatic increase in the interconnect power consumption and to no power reduction for a Gray encoding. A correlator is used to reduce the switching activity for the transmission of the multiplexed color values on an end-to-end basis, hidden in the ADCs of the image sensor. For a new red,
8.5 Evaluation
175
green, or blue value, the pattern is first bitwise XORed with the previous value of the same color and subsequently transmitted. Since consecutive red, green, and blue values are highly correlated, this again leads to MSBs nearly stable on logical 0. Thereby, the correlator coding drastically reduces the switching activities on the TSVs, without inducing a bit overhead. Furthermore, the proposed assignment technique exploits the resulting increased number of 0 bits through inversions. Thus, the transmission of the correlator encoded RGB data over a 3 × 3 array, including the enable signal, is analyzed in addition to the unencoded data. The results show that, combined with the correlator, the proposed optimal assignment technique results in a drastic improvement in the power consumption from 0.61 to 0.36 mW (−41.0%). In contrast, the correlator itself only reduces the power consumption by 25.2%. Thus, the technique proposed throughout this chapter again significantly enhances the gain of the coding technique. For the mere assignment approach without the correlator encoding, the proposed technique only results in a power-consumption improvement by 6.8%. Thus, without a strong heterogeneity in the bit-level statistics, the proposed technique should be supplemented by a traditional low-power coding technique to be most effective. To show the general usability of the proposed technique for all data types, the last analyzed data stream is a random 7-bit data stream, encoded to an 8-bit data stream using the coupling-invert LPC presented in Ref. [187]. The encoded data words are transmitted in this analysis together with a flag with a set probability of 1 × 10−4 over a 3 × 3 array. The analyzed coding technique is tailored for the specific capacitance structure of a metal-wire bus and is thus intrinsically not suitable for TSV arrays. Here, a 3D network on a chip (NoC) is considered as an exemplary use case in which the data is mainly transmitted over 2D links made up of metal wires, while a dedicated encoding for each TSV-link is too cost-intensive. However, the coding approach leads to a positive switching correlation in between some bit pairs. These correlations and the set probability of the flag are exploited by the low-power approach proposed in this chapter. Thereby, a further reduction in the TSV power consumption by 11.2% is achieved. This proves the efficiency of the technique for a broad set of applications. Please note that the TSV dimensions that are analyzed in this subsection are equal to the minimum ones predicted by the International Technology Roadmap for Semiconductors (ITRS). For thicker TSVs and larger TSV pitches, which is the more common case today, the proposed technique results in an even higher reduction in the TSV power consumption. This increase in the efficiency is mainly due to the more dominant edge effects for larger TSV dimensions. For example, the powerconsumption improvement of the proposed technique reaches values as high as 48% if the experiment presented in the current subsection is repeated for a TSV radius, minimum pitch, and length of 2 ,8 , and 50 μm, respectively.
176
8 Low-Power Technique for 3D Interconnects
8.6 Conclusion In this chapter, a technique was presented, which reduces the TSV power consumption at negligible costs. Its fundamental idea is based on a physical-effect-aware net-to-TSV assignment which exploits the bit-level properties of the transmitted patterns and the heterogeneity in the TSV-array capacitances, outlined in Chap. 4. Analyses for real and synthetic data streams have proven the efficiency of the proposed low-power technique, which can reduce the power consumption of modern TSVs by over 45%, without inducing noticeable overhead costs. Furthermore, the proposed technique is the key enabler for efficient low-power coding for 3D interconnects, as it allows the reuse of traditional low-power codes for TSVs in the most effective way. Thereby, such traditional techniques can be used to efficiently improve the power consumption of the horizontal (i.e., metal-wire) and vertical (i.e., TSV) interconnects in 3D SoCs simultaneously.
Chapter 9
Low-Power Technique for High-Performance 3D Interconnects
It was shown in the previous chapter that the TSV edge and metal-oxide-semiconductor (MOS) effects, outlined in Chap. 4, can be exploited to improve the TSV power consumption effectively. In this chapter, a coding technique is presented that aims for an increase in the performance of TSVs and metal wires. The method takes the edge effects into account, which enables a much higher improvement in the TSV performance than previous techniques and, for the first time, in a simultaneous improvement in the 3D-interconnect power consumption. High-level approaches that improve the performance of traditional very-largescale integration (VLSI) interconnects are well established today and commonly referred to as crosstalk-avoidance coding techniques [68].1 For metal wires, this data-encoding approach can improve the performance by over 90% while it additionally reduces their power consumption by over 20% [68]. In contrast to TSV low-power codes (LPCs), several TSV crosstalk-avoidance codes (CACs) have been proposed prior to this work [54, 138, 265]. These 3D CACs aim to improve the TSV performance by keeping the effective capacitance of each TSV in the middle of an array always below a certain level. However, none of the existing methods ever analyzed the performance of the TSVs at the edges of an array arrangement. The edge TSVs are completely neglected in previous works—claiming that the maximum propagation delay of a TSV at an array edge is either-way significantly lower than for a middle TSV due to their reduced number of directly adjacent neighbors. However, this assumption does not hold because of the edge effects. The coupling capacitance between two adjacent edge TSVs is significantly larger than its counterpart in the middle of the array, as shown in Chap. 4. Consequently, the maximum propagation delay for the edge TSVs is only slightly lower compared to the middle TSVs. This fact drastically reduces the coding efficiency of all existing
1 The
term crosstalk is used to indicate that the propagation delay of a VLSI interconnect heavily depends on the simultaneous switching on the other interconnects. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 L. Bamberg et al., 3D Interconnect Architectures for Heterogeneous Technologies, https://doi.org/10.1007/978-3-030-98229-4_9
177
178
9 Low-Power Technique for High-Performance 3D Interconnects
3D CACs. For modern TSV arrays, the actual performance improvements of the CACs are less than 50% of the previously reported values, as shown in this chapter. The lower coding gains in combination with their high overhead costs make existing 3D CACs impractical for most real applications. Thus, an efficient coding method needs to be aware of the edge effects, which demands a simultaneous improvement in the performance of the middle and the edge TSVs. Another limitation of existing 3D CACs is that they only aim to improve the TSV performance. However, metal wires are not absent in 3D integrated circuits (ICs), and the impact of long metal wires on the system’s performance is often not negligible. Hence, an efficient technique should improve the performance of metal wires and TSVs simultaneously. Furthermore, the high bit overheads of existing 3D CACs lead to a drastic increase in the TSV power consumption by up to 50%. This often strictly forbids the usage of existing techniques due to the importance of a low TSV power consumption in 3D ICs. Since existing 2D CACs can effectively improve the performance and power consumption of metal wires, the technique proposed in this chapter focuses on 2D CACs and provides a methodology to make them efficient for arbitrary TSV arrangements without inducing noticeable costs. Thereby, the proposed technique fully overcomes the severe limitations of previous techniques. The main idea is to use a performance-optimal, edge-effect-aware, bit-to-TSV assignment, similar to the low-power technique presented in the previous Chap. 8. Thereby, the CACs retain their gains for metal wires, but they additionally improve the TSV performance. A formal method to determine the performance-optimal bit-to-TSV assignment for a given array is derived in this chapter. Moreover, a systematic assignment that is generally valid for arbitrary TSV arrays is contributed. For the optimal/formal method, the optimization constraints can furthermore be adapted to solely optimize toward a high TSV performance, or additionally toward a low TSV power consumption. In the later scenario, the weight of the power consumption as an optimization objective can be defined arbitrarily. An in-depth evaluation considering modern TSV arrangements shows that, for all analyzed underlying 2D CACs, the proposed technique outperforms all existing 3D CACs drastically. For example, considering an underlying forbidden-transitionfee (FTF) encoding—one of the most promising 2D CACs [68]—the evaluation shows a more than five times higher TSV performance improvement, compared to the most recently proposed 3D CAC [54]. Moreover, while the recently most promising technique increases the TSV power consumption by 50.0%, the proposed FTF approach decreases the power consumption by 5.3%. Despite this drastic improvement in the coding gain, the proposed technique requires a 12.0% lower bit overhead and a 42.8% lower coder-decoder circuit (CODEC) area. Furthermore, the technique results in an improvement in the power consumption and performance of global metal wires by 21.9% and 90.8%, respectively. The remainder of this chapter is organized as follows. In Sect. 9.1, the recently used model to classify the TSV crosstalk is extended to consider the edge effects. Afterward, the limitations of previous 3D CACs are precisely outlined in Sect. 9.2
9.1 Edge-Effect-Aware Crosstalk Classification
179
with the help of the extended crosstalk classification. The proposed 3D-CAC technique is presented in Sect. 9.3 and extended toward a low-power 3D-CAC technique in Sect. 9.4. An in-depth evaluation of the proposed technique, alongside a comparison with previous approaches, is presented in Sect. 9.5. Finally, the chapter is concluded in Sect. 9.6.
9.1 Edge-Effect-Aware Crosstalk Classification The metric that allows for an optimization of the performance at higher abstraction levels is the maximum effective capacitance due to its strong bit-pattern dependent nature, as the other two parameters in the formula for the maximum propagation delay (Eq. (3.44) on Page 64) are purely technology dependent. Thus, for the derivation of existing high-level optimization techniques, the pattern-dependent performance of the ith interconnect was classified by its effective-capacitance value, mathematically expressed in previous works as follows (e.g., [68, 265]): Ceff,i [k] = Ci,i Δbi2 [k] +
n
Ci,j δi,j [k].
(9.1)
j =1
i =j
In this equation, Δbi2 [k], multiplied with the interconnect ground capacitance, Ci,i , is equal to 1 if the interconnect toggles in the respective kth clock cycle, else it is 0. δi,j depends on the transitions in the logical values on the ith and the j th interconnect and quantifies the impact of the coupling capacitance between the two lines, Ci,j , on the performance of the ith interconnect. The impact of the switching on a remote (j th) interconnect on the signal integrity (e.g., propagation delay, switching noise) of the interconnect is commonly referred to as the crosstalk between the two lines [68]. Hence, δi,j , is referred to as the crosstalk factor throughout this book. Existing CAC techniques assume perfectly temporal aligned signal edges on all interconnects. Consequently, the performance gains of all existing CAC techniques vanish completely for strongly temporal misaligned signal edges. Thus, temporally aligned signal edges are considered throughout this chapter.2 Hence, the interconnect performance is classified in this chapter by the effective capacitance in the case of temporally aligned edges. Comparing the derived formula for this quantity (i.e., Eq. (3.18) on Page 57) with Eq. (9.1) reveals that, for perfectly temporal aligned
2 In the following Chap. 10, a technique to improve the performance in the presence of an arbitrary temporal misalignment between the signal edges is presented.
180
9 Low-Power Technique for High-Performance 3D Interconnects
edges, the crosstalk factor can be expressed as follows: δi,j [k] = Δbi2 [k] − Δbi [k]Δbj [k].
(9.2)
For temporally aligned edges, the crosstalk factor can be only 0, 1, or 2. It is equal to 2 if reverse signal transitions occur on the two interconnects (e.g., bi switches from logical 0 to 1, while bj switches from logical 1 to 0). δi,j is equal to 1 if only interconnect i switches. Otherwise, it is 0. Note that this systematically derived expression for δi,j slightly differs from the one used in previous works: # # δprev,i,j [k] = #Δbi [k] − Δbj [k]# .
(9.3)
The expression derived in this work differs in so far from the traditional one that it always results in an effective capacitance of zero for interconnects that are stable. This has the advantage that it will always consider a propagation delay of zero for stable interconnects. However, for the estimation of the performance (i.e., the maximum propagation delay for all possible switching scenarios), it is irrelevant which expression is used. Combining Eq. (9.1) with an abstract capacitance model for the considered interconnect structure, results in a discrete crosstalk classification. Figure 4.1 on Page 73 illustrates the capacitance model for metal wires. Between every adjacent metal-wire pair exists a coupling capacitance of size Cmw,c , and every metal wire has a ground capacitance of size Cmw,g . Combining this metal-wire capacitance model with Eq. (9.1) results in effective capacitance values that range from 0 to Cmw,g +4Cmw,c , where the worst case occurs if both adjacent aggressor lines of a wire switch in the opposite direction. Previous works neglect the ground capacitances due to the dominance of Cmw,c over Cmw,g . Consequently, the metalwire crosstalk is traditionally classified using five classes: 0Cmw,c , 1Cmw,c , 2Cmw,c , 3Cmw,c , and 4Cmw,c [68]. In the traditional TSV capacitance model, each directly adjacent conductor pair is connected by a coupling capacitance of size Cn,prev , and each diagonally adjacent pair by a coupling capacitance of size Cd,prev . Thus, the crosstalk of a TSV was previously classified in 81 classes in the range from 0 to (8Cn,prev + 8Cd,prev ), where the maximum crosstalk class occurs if all eight adjacent neighbors of a middle TSV switch in the opposite direction. To simplify the crosstalk classification, the capacitance value between two directly adjacent middle TSVs (i.e., Cn,prev ) was used in previous works to normalize the crosstalk classes. Moreover, the normalization value was referred to as C3D . Hence, the TSV crosstalk was classified in the range from 0 C3D to (8+8λd,prev )C3D . Furthermore, in some works, the maximum crosstalk class was denoted as 10 C3D as λd,prev (i.e., Cd,prev/Cn,prev ) is 0.25 in the traditional capacitance model. In the following, an edge-effect-aware crosstalk classification for TSV structures, based on the TSV capacitance model from Chap. 4, is presented. The MOS effect is not considered explicitly for TSV crosstalk classification due to two main reasons. First, all existing 3D CAC techniques, as well as the one presented in this chapter,
9.1 Edge-Effect-Aware Crosstalk Classification
181
result in equally distributed 0-bit and 1-bit probabilities for all data TSVs. Second, exploiting the MOS effect is not well suited for performance improvement. In contrast to power optimization—where the mean values of the capacitances over time are essential—for performance improvement, the maximum capacitance values over all cycles count. Thus, one must ensure a much stronger stationary of the transmitted data in order to exploit the MOS effect for performance improvement safely. This typically demands additional coding complexity. Consequently, for performance optimization and estimation, capacitance matrices for equally balanced bit probabilities on all TSV (i.e., pi = 0.5 for all i) are considered throughout this book. The same straightforward approach as in previous works is used to obtain an edge-effect-aware TSV crosstalk classification based on the proposed regular capacitance structure (see Fig. 4.8 on Page 82). Again, all coefficients in the crosstalk model are capacitance values that are normalized by C3D , which is here equal to Cn . These unit-less coefficients are represented by λ. Thus, λx is equal to Cx/Cn . Employing the edge-effect-aware TSV capacitance model, the crosstalk of a TSV located at a single edge of an exemplary 5 × 5 array is classified in the range from 0C3D to (2+4λd +λe0 +4λe1 +4λc2 )C3D , while the crosstalk classes of a corner TSV range from 0C3D to (2λd + λc0 +4λc1 +4λc2 )C3D . Since dynamic capacitance values due to the MOS effect are not considered here, there is no significant difference between the proposed and the traditional capacitance model for middle TSVs. In both models, a middle TSV has four coupling capacitances with its directly adjacent neighbors and four with its diagonally adjacent neighbors. Thus, the edge-effect-aware crosstalk model still classifies the middle-TSV crosstalk in 81 classes, ranging from 0 C3D to (8+8λd,prev )C3D . To map the crosstalk classes to actual effective-capacitance values, and thereby to comparable performance quantities, the capacitance-model coefficients reported in Table 4.1 on Page 88 are used to obtain values for C3D and the λ coefficients for various TSV parameters. Table 9.1 reports the resulting values for a mean voltage of 0.5 V on all TSVs (i.e., Vdd = 1 V). The table reveals that the relative Table 9.1 C3D and λ values for the proposed edge-effect-aware crosstalk classification based on the TSV capacitance model proposed in Chap. 4 TSV technology rtsv [μm] dmin [μm] 1.0 4.0 1.0 4.0 1.0 4.5 1.0 4.5 2.0 8.0 2.0 8.0 2.0 8.5 2.0 8.5
fs [GHz] 6 11 6 11 6 11 6 11
Coefficients [pF/m] C3D 94.4 85.9 91.1 80.6 111.1 93.8 110.3 91.2
λd 0.36 0.30 0.38 0.32 0.34 0.29 0.35 0.29
λe0 0.05 0.03 0.04 0.03 0.03 0.05 0.03 0.05
λe1 1.33 1.33 1.34 1.32 1.33 1.31 1.32 1.31
λe2 0.23 0.17 0.25 0.19 0.20 0.16 0.20 0.16
λc0 0.06 0.06 0.04 0.05 0.05 0.11 0.09 0.13
λc1 1.46 1.41 1.49 1.44 1.45 1.41 1.43 1.40
λc2 0.31 0.23 0.32 0.24 0.27 0.21 0.26 0.21
182
9 Low-Power Technique for High-Performance 3D Interconnects
λ coefficients only slightly differ for the different analyzed TSV geometries. Hence, the proposed crosstalk classification allows us to derive and rapidly evaluate the general gains of different crosstalk-avoidance strategies without an in-depth analysis of an extensive set of different TSV technologies despite the added complexity due to the considered edge effects.
9.2 Existing Approaches and Their Limitations This section precisely outlines the limitations of existing CAC techniques for 3D ICs. Crosstalk-avoidance coding was initially proposed for metal wires [68]. The idea of such traditional 2D CACs is to strictly avoid all pattern transitions that result in a crosstalk class for a metal wire that exceeds a specific threshold value. For example, in contrast to an unencoded metal-wire bus, in which the effective capacitance of a line can be as big as 4Cmw,c +Cmw,g , in a 3C-encoded and a 2C-encoded encoded bus the effective-capacitance of a metal wire never exceeds 3Cmw,c +Cmw,g and 2Cmw,c +Cmw,g , respectively. For aggressively scaled metal-wire pitches with Cmw,c Cmw,g , 3C and 2C CACs improve the maximum propagation delay by asymptotically 25% and 50%, respectively. Since the performance is defined as the reciprocal of the maximum propagation delay, 1/Tˆpd , this implies a performance increase by up to 33.3%, for the 3C encoding, and, 100%, for the 2C encoding. Besides the performance improvement, some 2D CACs moreover improve the power consumption by reducing the sum of the mean effectivecapacitance values [68]. This makes these techniques particularly efficient. Since 2D CACs itself cannot optimize the TSV performance, dedicated crosstalkavoidance techniques have been proposed recently for TSV arrays [54, 138, 265]. For the derivation of all existing TSV CACs, the edge effects are not considered as the traditional capacitance model, used for the underlying crosstalk classification, does not capture them. Furthermore, the efficiency of previous CAC approaches was only evaluated for TSVs located in the middle of an array. The implicit claim of the authors of the related publications was that the edge TSVs either-way have a much lower maximum effective capacitance due to the reduced number of adjacent TSVs and thus do not have to be optimized nor investigated. However, the proposed edge-effect-aware TSV crosstalk classification shows that this assumption does not hold. Table 9.2 reports the maximum effective capacitance of a middle, an edge, and a corner TSV according to the edge-effect-aware crosstalk model. In accordance with the previously used crosstalk model, the reported maximum normalized effective-capacitance value of a middle TSV is about 10 C3D to 11 C3D . According to the previously used crosstalk classification, the maximum effective capacitance of an edge TSV is equal to 7 C3D ((6 + 4λd,prev )C3D ), and thus much lower. However, taking the edge effects into account reveals that the maximum effective capacitance of an edge TSV can actually also exceed 10 C3D . Consequently, the maximum effective capacitance of an edge TSV is only 8–
9.2 Existing Approaches and Their Limitations
183
Table 9.2 Maximum effective-capacitance values of a middle, an edge, and a corner TSV according to the edge-effect-aware crosstalk classification TSV technology rtsv dmin [μm] [μm] 1.0 4.0 1.0 4.0 1.0 4.5 1.0 4.5 2.0 8.0 2.0 8.0 2.0 8.5 2.0 8.5
fs [GHz] 6 11 6 11 6 11 6 11
Maximum crosstalk class Middle TSV Edge TSV
Corner TSV
(8 + 8λd )
(2 + 4λd + λe0 + 4λe1 + 4λc2 )
(2λd + λc0 + 4λc1 + 4λc2 )
10.90 C3D 10.38 C3D 11.04 C3D 10.57 C3D 10.74 C3D 10.28 C3D 10.77 C3D 10.33 C3D
10.06 C3D 9.43 C3D 10.20 C3D 9.57 C3D 9.77 C3D 9.27 C3D 9.74 C3D 9.28 C3D
7.87 C3D 7.19 C3D 8.06 C3D 7.41 C3D 7.60 C3D 7.15 C3D 7.56 C3D 7.17 C3D
10% (instead of the previously reported 30%) lower than for a middle TSV. This disproves the fundamental assumption existing 3D CACs rely on. Existing 3D CACs have in common that they limit the maximum possible amount of opposite transitions on adjacent neighbors for each TSV. For example, the 6C CAC ensures that for each TSV, at most three directly adjacent (aggressor) TSVs switch in the opposite direction; and if three directly adjacent neighbors switch in the opposite direction, the fourth one always switches in the same direction [138]. Consequently, for a middle TSV, the maximum crosstalk class is reduced to (6 + 8λd )C3D . As an example, for aggressively scaled global TSV dimensions (i.e., rtsv = 1 μm and dmin = 4 μm) and a significant frequency of 6 GHz, (6 + 8λd )C3D is equal to 8.88 C3D . Edge TSVs have a maximum of three directly adjacent neighbors. Thus, the 6C CAC does not provide any optimization of the maximum crosstalk class of an edge TSV. Encoded and unencoded, the maximum effective capacitance of an edge TSV is the same (10.06 C3D for the considered example). Thus, the worstcase delay for the encoded patterns occurs at the array edges and no longer in the middle. Consequently, the edge effects reduce the true coding efficiency. Instead of the previously reported theoretical performance increase by 25.0% (Cˆ eff decreases from 10 C3D to 8 C3D ), the improvement is only 7.8% (10.90 C3D to 10.06 C3D ) for the analyzed TSV technology. The same analysis for other TSV technologies reveals the same drastic degradation in the coding efficiency by at least 50% due to the neglected edge effects. Furthermore, one can show in the same way that also all other existing 3D CACs have an over 50% lower performance gain when the edge effects are considered. To precisely quantify the real efficiencies of existing 3D CACs, the reduction in the maximum propagation delay of the previously most promising approaches is reanalyzed by means of circuit simulation. Besides the 6C CAC, the 4LAT coding [265] and the 6C-FNS coding [54] are investigated. The 4LAT coding limits, for each TSV, the maximum number of switching neighbor-TSVs to four. A 6C-FNS encoding limits the maximum crosstalk class of each middle TSV
184
9 Low-Power Technique for High-Performance 3D Interconnects
to 6.5 C3D —according to the previous crosstalk classification. The 6C and 4LAT coding technique are evaluated for a quadratic 5 × 5 array with rtsv , dmin , and ltsv equal to 1 μm, 4 μm, and 50 μm, respectively. A drawback of the 6C-FNS coding is that it only works for 3×x arrays. Thus, for the analysis of this 3D CAC, the array dimensions are changed to 3×8. In contrast to the analyses in the respective papers, where only the delay of a middle TSV is investigated, here, the delay of an edge TSV is analyzed as well. To determine the pattern-dependent TSV delays with the Spectre circuit simulator, the setup from Sect. 3.4 is used. Thereby, realistic non-linear driver effects are considered. In the simulations, the strong 22-nm drivers from Sect. 4.4.2 are used. Due to these high-performance drivers, TSV equivalent circuits that were extracted for a significant signal frequency of 10 GHz are integrated into the circuit-simulation setup. Figure 9.1 illustrates the simulation results for the 6C-encoded and the unencoded data. Table 9.3 reports the measured maximum propagation delays for all pattern sets. As expected, the worst-case propagation delay always occurs at the array edges for the 3D CACs, while it occurs in the array middle only for the unencoded patterns. Table 9.3 also reports the relative improvements in the worstcase delay for middle TSVs, ΔTˆpd,middle , and all TSVs, ΔTˆpd . The reduction in the overall worst-case delay determines the actual performance improvement of the respective CAC technique. All measured delay improvements for the middle TSVs are in accordance with the ones reported in Ref. [54, 138, 265] for the overall delay improvements. However, the results prove that the edge effects drastically decrease
Inverter/driver input Edge TSV—unencoded and 6C encoded Middle TSV—unencoded Middle TSV—6C encoded
Node voltage [V]
1.5
1
0.5
6C -encoded Tˆpd,middle Unencoded & 6C -encoded Tˆpd,edge Unencoded Tˆpd,middle
0 0
50
100
150
Time [ps] Fig. 9.1 Driver-input and TSV-output voltage waveforms for the worst-case TSV propagation delay, Tˆpd , for unencoded data and the 6C CAC technique
9.2 Existing Approaches and Their Limitations
185
Table 9.3 Maximum propagation delay of a middle TSV, Tˆpd,middle , and an edge TSV, Tˆpd,edge , for unencoded and CAC-encoded data besides the relative coding gains. Here, large-sized drivers are considered Data Unencoded 6C[138] 4LAT[265] 6C-FNS[54]
Tˆpd,middle [ps] 117.0 100.6 93.3 82.9
Tˆpd,edge [ps] 109.6 109.6 104.5 110.1
ΔTˆpd,middle [%] – −14.1 −20.3 −29.2
ΔTˆpd [%] – −6.3 −10.7 −5.9
the improvements in the maximum delay, and consequently the efficiencies of the CACs. For example, the delay reduction of the 6C and the 4LAT encoding are only about 50% of their previously expected values. The delay improvement is only about one-fifth of the previously expected value (5.9% instead of 29.2%) for the recently most promising 6C-FNS coding [54]. This implies that the performance improvement is only 6.2% instead of 41.2%, which is a dramatic loss. One reason for this drastic degradation in the coding efficiency is the neglected edge effects. Another one is the generally unconsidered edge TSVs. The 6C-FNS CAC simply limits the maximum crosstalk class of the middle TSVs to 6.5C3D , according to the previous crosstalk classification, while the maximum crosstalk class of the edge TSVs remains unaffected. If all neighbors of a TSV located at a single edge switch in the opposite direction, its crosstalk class is equal to 6Cn + 4Cd (7C3D ), according to the previously used crosstalk classification. Therefore, the 6C-FNS encoding is actually a 7C-FNS encoding, even when the traditional crosstalk classification is considered. In contrast to the power consumption, the delay has a significant dependency on the drivers. Thus, to prove that the previously outlined degradation in the gain of existing 3D CACs is independent of the drivers, the previous analysis is repeated for the more moderate-sized drivers from Sect. 3.4. Hence, the width of each transistor is halved compared to the previous analysis. Thereby, the strength of the drivers is reduced. Besides the driver sizing, the rest of the experimental setup is the same as before. Thus, although the resulting weaker drivers actually lead to a reduction in the significant signal frequency, the TSV equivalent circuits are not changed since the edge effects increase for lower frequencies, as shown in Chap. 4. Increasing edge effects are undesired here, as it would further decrease the gain of existing CACs, which prohibits to quantify the driver impact isolated. Table 9.4 includes the results for the repeated experiment with the more moderate-sized drivers. As expected, the individual delay values are about two times higher compared to the previous experiment. Nevertheless, the percentage delay reductions of the 3D CACs do not change significantly compared to the results for the stronger drivers. Still, the edge effects result in the same dramatic decrease in the delay reductions by about 50–75%. Hence, the outlined dramatic degradation in the gains of the previous 3D CACs is independent of the TSV drivers.
186
9 Low-Power Technique for High-Performance 3D Interconnects
Table 9.4 Maximum propagation delay of a middle TSV, Tˆpd,middle , and an edge TSV, Tˆpd,edge , for unencoded and CAC-encoded data besides the relative coding gains. Here, moderate-sized drivers are considered Data Unencoded 6C[138] 4LAT[265] 6C-FNS[54]
Tˆpd,middle [ps] 212.1 179.6 171.1 148.0
Tˆpd,edge [ps] 197.4 197.5 192.1 197.6
ΔTˆpd,middle [%] – −15.3 −19.3 −30.0
ΔTˆpd [%] – −6.9 −9.4 −6.8
A general drawback of CACs is that they add several redundant bits to the transmitted codewords, compared to the unencoded data words. This cost is quantified by the (bit) overhead of a coding technique, which is mathematically defined as follows: OH(m) =
n−m , m
(9.4)
where m and n are the bit widths of the unencoded data words and the codewords, respectively [68]. Existing 3D-CAC techniques require asymptotic overheads (i.e., limm→∞ OH(m)) of 44–80%. These high bit overheads of existing 3D CACs lead to a drastically increased overall TSV power consumption as they surpass the power savings per TSV reported in previous works. In the later evaluation section of this chapter, it is shown that existing 3D CACs increase the overall TSV power consumption by 8–50%. Two severe limitations of previous 3D CACs were outlined so far. First, the limited performance improvements and, second, the increased TSV power consumption due to the large bit overheads. However, there even is a third drawback of existing approaches: No CAC technique can optimize the performance and the power consumption of metal wires and TSVs simultaneously. This drawback results in high coding-hardware costs for an effective CAC coding of 3D interconnects that are made up of several metal-wire and TSV segments. In such cases, one CAC encoding and decoding is needed per segment. Thus, a 3D CAC should be capable of optimizing the performance of metal wires and TSVs at the same time. Thereby, only one encoder-decoder pair would be required to optimize the performance of arbitrary 3D-interconnect structures.
9.3 Proposed Technique A 3D-CAC technique that overcomes all outlined limitations of previous techniques is derived throughout this section. First, a TSV-CAC approach that overcomes the limitations due to the edge effects is presented in Sect. 9.3.1. Afterward, it is shown
9.3 Proposed Technique
187
in Sect. 9.3.2 how the proposed TSV-CAC technique is implemented such that a simultaneous improvement in the metal-wire and the TSV performance is achieved. Finally, this 3D CAC is extended to a low-power 3D CAC, which increases the performance of TSVs and metal wires while it simultaneously decreases their power consumption.
9.3.1 General TSV-CAC Approach In this subsection, a CAC approach for TSV arrays, called ωm/ωe TSV CAC, is proposed. The presented coding approach overcomes the limitations of previous 3D CACs, which arise due to the edge effects. In detail, the presented coding approach does not only improve the performance of the middle TSVs, but it also improves the performance of TSVs located at the array edges. The general idea is to reduce the maximum possible effective capacitance of each middle and each single-edge TSV by at least ωm C3D and ωe C3D , respectively. Consequently, the maximum crosstalk class of a TSV in the middle of an array is reduced by an ωm/ωe encoding to (8 + 8λd − ωm )C3D ,
(9.5)
while the maximum crosstalk class of TSVs at an array edge is reduced to (2 + 4λd + λe0 + 4λe1 + 4λc2 − ωe )C3D .
(9.6)
As shown in Table 9.2, the highest possible crosstalk class of a middle TSV results in an effective-capacitance value that is about 0.8 C3D to 1.0 C3D bigger than the counterpart for the highest crosstalk class of an edge TSV. Hence, the most effective ωm/ωe encoding has an ωm that is in that range bigger than ωe .
9.3.2 3D-CAC Technique In this subsection, a coding technique is presented, which implements the previously introduced ωm/ωe -CAC approach by means of a traditional 2D CAC. Thereby, a 3D CAC is obtained, which simultaneously improves the TSV and the metal-wire performance. The fundamental idea is to exploit the bit-level properties of a pattern set that is encoded with a 2D CAC, by means of an assignment of the codeword bits to the TSVs that results in an ωm/ωe TSV CAC. In practice, this approach is implemented through the local net-to-TSV assignment at negligible extra costs compared to the traditional 2D CAC, as shown in Sect. 8.5.1.
188
9 Low-Power Technique for High-Performance 3D Interconnects
Besides this ωm/ωe -CAC assignment, the proposed technique is completely constructed of coder-decoder circuits (CODECs), which are well known from 2D crosstalk-avoidance coding. An in-depth explanation of the implementation of these well-known circuits can be found in Ref. [68] and is thus not repeated in this book. However, the fundamental ideas of the used 2D CACs are briefly reviewed in the following. Memory-less 2C CACs are the most popular 2D CACs [66, 67, 206, 241]. One major advantage of memory-less over memory-based CACs is their significantly lower hardware requirements [68]. Furthermore, memory-based CACs also work for multiplexed data streams when the encoding and decoding are applied on an end-toend basis since the coding is based solely on a pattern level. Thus, only memory-less CACs are considered in this work. Mainly, two different data-encoding approaches exist for memory-less 2CCAC encoding: Forbidden-pattern free (FPF) [66] and forbidden-transition free (FTF) [241] encoding. The FPF-CAC approach forbids bit patterns that contain a “010” or a “101” bit sequence as codewords. For example, “111000110” is a valid 9-bit FPF codeword, while “110100011” is a forbidden pattern and thus not included in the set of codewords. In [68], the authors prove that an FPF metal-wire bus is a 2C-encoded bus (i.e., maximum possible crosstalk class is 2Cmw,c ), since, for all i, if the crosstalk factor δi,i−1 is equal to 2 (i.e., bits switch in the opposite direction), the crosstalk factor δi,i+1 is always 0 (i.e., bits switch in the same direction) and vice versa. For the FTF 2C CAC, all δi,i−1 and δi,i+1 values are limited to a maximum of 1 by prohibiting adjacent bits from switching in the opposite direction. Hence, the forbidden transitions are “01” → “10” and “10” → “01”. It is proven in Ref. [241], that the largest set of FTF codewords is generated by eliminating the “01” pattern from the odd-even (i.e., b2i+1 b2i ) bit boundaries, and the “10” pattern from the even-odd (i.e., b2i b2i−1 ) boundaries. The asymptotic bit overheads of both CAC techniques, FPF and FTF, are equal (about 44%) [68]. For the proposed technique, only FTF encoding is investigated, since it has several advantages over FPF encoding. One advantage is that the CODEC of an FTF CAC requires an about 17% lower gate count and an almost 50% lower circuit delay [68]. Nevertheless, a traditional FTF CODEC still exhibits a quadratic growth in complexity with an increasing bit width [68]. To overcome this limitation, the bus can be partitioned into smaller groups that are encoded individually. In this case, a difficulty arises due to undesired transitions between adjacent lines of different groups. Two techniques exist which address this issue: Group complement and bit overlapping [68]. Both cause a significant bit and CODEC overhead, which make them suboptimal. In this chapter, a more effective technique—applicable only for FTF encoding— is presented. To build a 3D power-distribution network requires several power and ground (P/G) TSVs. Power and ground lines are stable, and stable lines can be used for FTF-encoding partitioning, as illustrated in Fig. 9.2. The metal-wire bus containing n data lines is divided into multiple groups, which are encoded individually using an mg -to-ng -bit FTF CAC, where ng is the number of data lines
9.3 Proposed Technique
189 Group i
ng –2
ng –1 max(δng ,ng –1 )=2
Group i + 1 ng
P/G
1
2
1
2
max(δng ,P/G )=1
max(Ceff,ng ) = 3Cmw,c + Cmw,g
a) FPF ng –2
ng –1 max(δng ,ng –1 )=1
ng
P/G
max(δng ,P/G )=1
max(Ceff,ng ) = 2Cmw,c + Cmw,g
b) FTF Fig. 9.2 Power/ground (V/G) lines used for bus partitioning and its influence on the maximum crosstalk for: (a) FPF encoding; (b) FTF encoding
in a group. Between each first line of a group and the last (i.e., ng th) line of the previous group, a stable P/G line is added. This bus partitioning generally causes no overhead if dynamic data lines, as well as stable P/G lines, are transmitted over the same 3D-interconnect cluster, which is the typical case [252]. The crosstalk factor δi,s of a data line and a stable line is Δbi2 [k] (Eq. (9.2) with Δbj [k] equal to 0), and thus limited to a maximum of 1. Consequently, stable lines cannot violate the FTF criterion. In contrast, for an FPF encoding, additional stable lines only lead to a 3C encoding for the metal wires instead of a 2C encoding, as illustrated in Fig. 9.2b. For an exemplary group size of 5, “00001” → “11110” is a valid FPF-CAC sequence. If the group is terminated by a power line (constant logical 1), the effective pattern sequence, including the stable line, is “000011” → “111101”. The second pattern is a forbidden pattern (includes “101” sequence). Thus, a stable line violates the FPF condition, and the ng th line experiences a crosstalk of 3 Cmw,c in the respective cycle. Therefore, the metal-wire crosstalk would no longer be bounded to 2 Cmw,c . Another advantage of FTF encoding is that it can be implemented for free for one-hot-encoded data lines. One-hot encoded bit pairs can only take the values “00”, “01” or “10”. Thus, inverting every second one-hot encoded line by swapping inverting with non-inverting drivers, similar to the approach presented in Chap. 8, already results in an FTF encoding for the lines. After the fixed inversions, the onehot-encoded bit pair b2i+1 b2i can only be “10”, “11” or “00”, while b2i b2i−1 can only be “01”, “00” or “11”. Hence, the “01” pattern is eliminated from the b2i+1 b2i boundaries and the “10” pattern from the b2i b2i−1 boundaries, resulting in an FTF encoding without inducing any additional bit overhead. The second analyzed 2D-CAC technique is shielding, which adds stable P/G lines between the data lines to avoid transitions in the opposite direction for adjacent wires. For 2C shielding, data lines, D, are regularly interleaved with stable shield lines, S, resulting in a DSDSDS. . . D 2D layout. In this case, each data line is shielded by two adjacent stable lines, resulting in a metal-wire crosstalk of
190
9 Low-Power Technique for High-Performance 3D Interconnects
Fig. 9.3 Snake mapping of the bits of a 2D-CAC-encoded data stream onto a TSV array
b15 (MSB)
b0 (LSB)
2Cmw,c Δbi2 [k] ≤ 2Cmw,c for each data line i. For 3C shielding, data-line pairs are shielded by stable lines, resulting in a DDSDDS. . . D 2D layout, which limits the maximum occurring crosstalk class to 3Cmw,g . An assignment of all consecutive bit pairs of any 2D-CAC-encoded data stream onto directly adjacent TSV pairs leads to an ωm/ωe crosstalk-avoidance encoding for the TSVs. One possible systematic assignment, referred to as snake mapping, is illustrated in Fig. 9.3. The snake mapping for a traditional 3C CAC results in a guaranteed ωm/ωe TSV CAC with ωm and ωe at least equal to 1 (i.e., a 1/1 CAC). A traditional 2C CAC results in a guaranteed ωm/ωe TSV CAC with ωm and ωe at least equal to 2. Thus, one can use any 2D CAC as an ωm/ωe CAC by simply applying the snake mapping.
9.3.2.1
Performance-Optimal 2D-CAC-to-3D-CAC Assignment
Although the previously introduced snake mapping for a 2D CAC results in an CAC for the TSVs, other assignments can still result in an even higher TSV performance. This has mainly two reasons. First, ωm should ideally be slightly bigger than ωe , which is not satisfied by the systematic snake mapping. Second, the snake mapping does not fully exploit the properties of stable lines. As outlined previously, the crosstalk factor between any data line and a stable line is limited to a maximum of 1. Thus, stable lines should be mapped to the middle of a TSV array in order to reduce the crosstalk of as many data lines as possible. The systematic snake mapping does not meet this requirement, as it only exploits a reduced maximum crosstalk between directly neighbored bit pairs. Hence, in the following, a formal method to find the performance-optimal assignment of the bits of a given pattern set onto an interconnect structure is derived. By means of this mathematical method, the optimal use of any 2D CAC for a specific TSV array can be determined. Equation (3.48) on Page 65 can be used to express the effective capacitances for an initial mapping, assigning codeword bit bi to the ωm/ωe
9.3 Proposed Technique
191
ith interconnect: C eff,init [k] = diag(S[k] · C).
(9.7)
As outlined before, the capacitance matrix C is simply equal to CR for performance optimization. The maximum values of the crosstalk factors δi,j and δi,l (for all j = l) are assumed to be independent in the following. Please note that this assumption holds for random data words and the 2D CACs discussed in this chapter but not for an FPF CAC, where: maxk (δi,i−1 ) and maxk (δi,i+1 ) are equal to 2, but maxk (δi,i−1 + δi,i+1 ) is also equal to 2 and not 4. For independent crosstalk factors, the maximum effective-capacitance value, for any possible pattern transition and any line (defining the interconnect performance), can be mathematically expressed as Cˆ eff,init = max diag(Swc · CR ).
(9.8)
where “max diag()” returns the maximum diagonal entry of a matrix. Swc is a matrix containing the worst-case switching/crosstalk with Swc,i,j =
maxk (Δbi2 [k]) maxk (δi,j [k]) =
for i = j maxk (Δbi2 [k] − Δbi [k]Δbj [k])
else.
(9.9)
The characteristics of a specific pattern set are captured by Swc . For a random/unencoded bit patterns, Swc,i,j is equal to 2, except for the diagonal entries, which are equal to 1. For an FTF-encoded pattern set, Swc,i,j is equal to 2 except for the diagonal elements and its adjacent entries (i.e., j = i + 1 or j = i − 1) which are equal to 1. An additional stable line at position s (i.e., Δbs = 0) leads to Swc,s,j equal to 0 for all j , and Swc,i,s equal to 1 for all i = s. In contrast to the technique presented in Chap. 8, here, only a mere reordering of the bit-to-TSV assignment is considered. This is formally expressed by swapping rows and the according columns of the switching matrix, mathematically expressed as follows: T
Swc,assign = Iπ Swc Iπ ,
(9.10)
where Iπ is a valid n × n permutation matrix defining the bit-to-TSV assignment. A valid Iπ has exactly one 1 in each row and column while all other entries are 0. To assign the ith bit to TSV j , the ith entry of the j th row of the matrix is set to 1. Thus, for an exemplary 2 × 2 TSV array, to assign bit 1 to TSV 2, bit 2 to TSV 3, bit 3 to TSV 1, and bit 4 to TSV 4, ⎡
0 ⎢1 Iπ = ⎢ ⎣0 0
0 0 1 0
1 0 0 0
⎤ 0 0⎥ ⎥. 0⎦ 1
(9.11)
192
9 Low-Power Technique for High-Performance 3D Interconnects
Thus, the maximum effective capacitance as a function of the bit-to-TSV assignment is expressed as: T Cˆ eff,assign = max diag(Iπ Swc Iπ · CR ).
(9.12)
Finally, the performance-optimal assignment Iπ,perf-opt , which minimizes Cˆ eff , has to full-fill T Iπ,perf-opt = arg min max diag(Iπ Swc Iπ · CR ) , Iπ ∈SIπ ,n
(9.13)
where SIπ ,n is the set of all valid n × n permutation matrices. Instead of solving Eq. (9.13) exactly, Iπ,perf-opt is determined in this work with simulated annealing to reduce the computational complexity.
9.4 Extension to a Low-Power 3D CAC In this section, the proposed 3D-CAC technique is extended to a low-power 3D CAC technique, improving the performance and the power consumption of metal wires and TSVs, simultaneously. From the underlying 2D CACs in the proposed 3D-CAC approach, only FTF encoding effectively reduces the metal-wire power consumption [68]. Therefore, here only an ωm/ωe TSV CAC based on an FTF encoding is analyzed (with and without bus partitioning). For the low-power extension, the formal method to express the interconnect power consumption as a function of the TSV assignment, presented in the previous chapter, is reused. However, in the previous method, the possibility of assigning negated nets and the MOS effect are considered. Since neither is required here, the formula for the assignment dependent power consumption (Eq. (8.13) on Page 160) simplifies to Passign = =
2f Vdd SE,assign , Cassign 2 2f Vdd T Iπ SE Iπ , CR . 2
(9.14)
Matrix SE includes the switching statistics of the bit values (see Eq. (3.46) on Page 65). These switching statistics for unencoded and FTF-encoded data streams are briefly discussed in the following. Thereby, the transmission of uniform-random data (before the CAC encoding) is considered. SE,i,j values for the FTF-encoded and the unencoded data are plotted in Fig. 9.4. For the unencoded data, all SE entries are about 0.5 (see Fig. 9.4b), as the toggle probability of each bit is 50% (i.e., αi = 0.5) while the bit pairs switch completely uncorrelated (i.e., γi,j = 0). The
193
0.5
0.5
0.3
0.3
SE,i,j
SE,i,j
9.4 Extension to a Low-Power 3D CAC
0.1
0.1
5
a) FTF
SE,i,i = αi SE,i,i+1 = αi SE,i,i+2 = αi SE,i,i+3 = αi SE,i,i+4 = αi SE,i,i+5 = αi
10 15 Bit index i
20
5
10 15 Bit index i
– – – – –
γi,i+1 γi,i+2 γi,i+3 γi,i+4 γi,i+5 20
b) Unencoded
Fig. 9.4 SE,i,j values for an FTF-encoded and an unencoded random data stream. (a) FTF. (b) Unencoded
FTF encoding reduces the switching activities to about 40% as it results in a positive temporal correlation between consecutively transmitted bit patterns. Furthermore, FTF encoding induces positive switching correlations between bit pairs (i.e., γi,j > 0), which further decreases some SE entries. SE,i,j for directly adjacent neighbors is even reduced by about 50% compared to its value for unencoded data. The further j is apart from i, the more the bit pairs bi and bj tend to switch uncorrelated, resulting in increased SE,i,j values which asymptotically reach αi ≈ 0.4. These bit-level statistics reveal that neighboring bit pairs have to be assigned to TSVs connected by a large coupling capacitance to effectively reduce the TSV power consumption for the transmission of FTF encoded patterns. Consequently, neighboring bit pairs have to be mapped onto directly adjacent TSVs. This criterion is satisfied by the proposed snake mapping, presented in the previous section as a high-performance assignment. Thus, the snake mapping of FTF-encoded patterns even results in a low-power 3D CAC. However, the snake mapping will likely not lead to the lowest power consumption possible, as it neither considers the heterogeneity in the SE entries (e.g., Fig. 9.4 shows a SE,i,j increase for i equal to 1 and a decrease for i equal to 2), nor the properties of eventual stable lines. Thus, a formal method to obtain the optimal low-power and high-performance assignment is derived in the following. For this purpose, two metrics must be minimized through the assignment: The power consumption and the maximum propagation delay. Again, the delay is estimated through Cˆ eff,assign to obtain solutions that do not depend on the driver technology. To combine the two metrics into a single total-cost function, TC, the individual costs are normalized by the values for the initial assignment. Hence, the assignment has
194
9 Low-Power Technique for High-Performance 3D Interconnects
to minimize TC = κp
Passign Cˆ eff,assign + (1 − κp ) Pinit Cˆ eff,init
(9.15)
T
= κp
T
Iπ SE Iπ , CR max diag(Iπ Swc Iπ CR ) + (1 − κp ) , SE , CR max diag(Swc CR )
where κp —which can take any real value between 0 and 1—defines the weighting of the two optimization objectives. To achieve a low power consumption is more and more prioritized with increasing κp , and vice versa. Finally, the optimal bit-to-TSV assignment, for a given κp , can be determined by the following equation:
Iπ,TC-opt
Passign Cˆ eff,assign = arg min κp + (1 − κp ) Pinit Cˆ eff,init Iπ ∈SIπ ,n
.
(9.16)
Instead of solving Eq. (9.16) exactly, again, simulated annealing is used to reduce the computational complexity.
9.5 Evaluation Throughout this section, the proposed low-power technique for high-performance 3D interconnects is evaluated in-depth. The expected improvement in the TSV performance due to the proposed 3D-CAC technique is analyzed in Sect. 9.5.1 for various TSV arrays, assuming that the power consumption is not considered as an optimization objective. Afterward, the trade-off, if the power consumption and the performance are both considered as optimization objectives, is investigated in Sect. 9.5.2. Finally, the proposed technique is compared to previous 3D-CAC techniques.
9.5.1 TSV-Performance Improvement The effect of the proposed CAC technique on the interconnect performance for various TSV arrangements is quantified in the following for the case that the TSV power consumption is not an optimization objective. For the analyses, TSV capacitance matrices, extracted with the Q3D Extractor for a significant frequency of 10 GHz, are employed in the formulas. The capacitance matrices are extracted for a TSV radius and minimum pitch of 1 μm and 4 μm, respectively. To show the gain for larger TSV dimensions, capacitance matrices are
9.5 Evaluation
195
also extracted for a TSV radius and minimum pitch of 2 μm and 8 μm, respectively. The TSV length is 50 μm in both cases. Capacitance matrices for a wide range of TSV-array dimensions M × N are investigated. More precisely, quadratic arrays from 3 × 3 to 7 × 7 and non-quadratic arrays, with M equal to 3 and N equal to 6, 9, and 12, are analyzed. As the underlying 2D CACs for the proposed technique, 2C and 3C shielding, as well as FTF encoding, are analyzed. Besides the traditional FTF encoding, the CAC is additionally investigated for the scenario where approximately 10% stable TSVs are present in each array, which are exploited by the proposed bus partitioning (BP) technique to reduce the CODEC complexity. For example, two stable lines are assumed in a 4×4 array, which partition the 14 remaining data bits into three groups (b1 to b5 , b1 to b5 , and b1∗ to b4∗ ). The overall-maximum effective capacitance, Cˆ eff , is directly proportional to the maximum delay for ideal drivers, modeled by a simple pull-up/pull-down resistance, as shown in Sect. 3.2. Thus, Cˆ eff is used throughout this book to estimate the driver-independent/theoretical delay or performance improvement of optimization techniques. For the analyzed variants of the proposed technique, as well as for random/unencoded pattern sets, the maximum effective-capacitance values for the TSVs are reported in Fig. 9.5. The figure reveals that unencoded the performance is almost independent of the TSV-array shape. For all TSV arrays, the snake mapping leads to a 2/1+λe1 (ωm/ωe ) CAC for FTF pattern sets and, if no stable lines are present, the expected delay improvement due to the snake mapping and an optimal assignment do not differ noticeably. Compared to an unencoded pattern set, the FTF encoding without BP always leads to an improvement in the maximum effective capacitance by 18–21%. As expected, when stable lines are used for shielding or BP, an optimal TSV assignment generally results in a noticeably higher TSV performance improvement than the systematic snake mapping. The only exception is the 2C-shielding technique, where the snake mapping is equivalent to the optimal assignment. Here, the snake mapping results in a 4/(2+2λe1 ) CAC and thus completely avoids opposite transitions between directly adjacent TSVs, as illustrated in Fig. 9.6a. Thus, the maximum possible effective-capacitance value is drastically improved by about 40%—implying a performance improvement by over 60%. For all other analyzed scenarios, a reassignment of the stable lines, performed by the optimal assignment technique, increases the efficiency of the proposed 3D-CAC technique, as outlined for an exemplary 4 × 4 array. The snake mapping of 3Cshielded patterns leads to a (1+2λd )/(1+λd ) CAC (Cˆ eff reduction by about 15%). Here, a different assignment can lead to a further increased ωm value, resulting in a higher performance. As illustrated in Fig. 9.6b, the stable shields (S) are optimally placed in a way that ωm is increased to 2 + 2λd . Consequently, the 3C shielding ideally results in an improvement in Cˆ eff by 22% instead of the 15% that are obtained for the snake mapping.
196
9 Low-Power Technique for High-Performance 3D Interconnects
Unencoded FTF+BP (optimal) 3C shielding (snake)
(optimal) (snake) 2C shielding (optimal) FTF
FTF+BP
(snake) 3C shielding (optimal) 2C shielding (snake) FTF
50
ˆeff [fF] C
40
30
20
4u 4
3u 6
5u 5
3 u 12
7u 7
3 u 12
7u 7
a) rtsv = 1 μm, dmin = 4 μm, ltsv = 50 μm
ˆeff [fF] C
50 40 30 20
4u 4
3u 6
5u 5
b) rtsv = 2 μm, dmin = 8 μm, ltsv = 50 μm Fig. 9.5 Effect of the proposed technique on the TSV delay/performance (i.e., Cˆ eff ), for different underlying 2D crosstalk-avoidance techniques. Compared are the two assignment techniques, optimal and snake mapping, for two different TSV dimensions and various array shapes. (a) rtsv = 1 μm, dmin = 4 μm, ltsv = 50 μm. (b) rtsv = 2 μm, dmin = 8 μm, ltsv = 50 μm b1
S
b2
S
b1
b2
S
b3
b1
b2
b4
b5
S
b4
S
b3
S
b4
b5
b6
b4
b3
b3
b2
b5
S
b6
S
b7
S
S
S
b5
S
S
b1
S
b8
S
b7
b8
b9
b10
b11
b4
b3
b2
b1
a) 2C shielding
b) 3C shielding
c) FTF+BP
Fig. 9.6 Performance-optimal bit-to-TSV assignments for traditional crosstalk-avoidance techniques and a 4 × 4 array. (a) 2C shielding. (b) 3C shielding. (c) FTF+BP
9.5 Evaluation
197
Also, for the combination of FTF encoding and the proposed BP technique, the performance improvement of the optimal assignment technique is significantly higher than for the snake mapping, which results in a 2/(1+λe1 ) TSV CAC. For the exemplary 4 × 4 array, the optimal assignment—illustrated in Fig. 9.6c—increases the ωm value to 3 + λd by mapping the two stable lines to the array middle. This boosts the theoretical performance improvement from about 23% to above 40%. However, a significant increase in ωm is here only possible for small TSV bundles due to the small fraction of stable TSVs considered. For bigger bundles, the ratio of middle TSVs over edge TSVs increases. Thus, for the FTF encoding with BP, the gain of the optimal assignment technique over the snake mapping decreases toward zero with an increasing TSV count. In summary, the snake mapping is equivalent to the performance-optimal assignment if no stable lines are present. Stable lines, located at the array edges for the snake mapping, are optimally reassigned to TSVs located in the middle of the arrays in order to shield the maximum amount of TSVs, resulting in a further performance improvement over the snake mapping.
9.5.2 Simultaneous TSV Delay and Power-Consumption Reduction In the following, the simultaneous reduction in the TSV performance and power consumption of the proposed low-power 3D-CAC technique is investigated. Besides the trade-off between performance and power-consumption improvement, the gain of the optimal assignment over the systematic snake mapping is analyzed. For this purpose, the optimal bit-to-TSV assignments are determined by means of Eq. (9.16) for different κp values between 0 and 1. Two TSV-array structures are analyzed in this subsection: A 4 × 4 and a 5 × 5 array. In both arrays, the TSVs have a radius of 2 μm, a minimum pitch of 8 μm, and a length of 50 μm. The TSV capacitance matrices are extracted with the Q3D Extractor for a significant frequency of 10 GHz. Investigated are the transmission of random data, FTF-encoded random data, and FTF-encoded random data with two stable lines exploited for bus partitioning (FTF+BP). To take the induced overhead of an FTF encoding as well as the different numbers of data TSVs into account, the power consumptions normalized by the number of transmitted bits per cycle are compared. The resulting theoretical delay and power-consumption reductions are shown in Fig. 9.7. For the FTF encoding without BP, the results show a reduction in the maximum delay by about 19% (i.e., 23% performance increase) for both arrays if the power-consumption is not considered as an optimization objective (i.e., κp = 0). However, no reduction or even a slight increase in the power consumption is obtained in this case. For all κp values between 0.01 and 0.99, still, a delay reduction of about 19% is achieved, but also a power-consumption reduction by about 8–9%. Therefore, the low-power extension enables a significant reduction in the power
198
9 Low-Power Technique for High-Performance 3D Interconnects 5 × 5 optimal
4 × 4 optimal
5 × 5 snake
4 × 4 snake
40 Max. delay reduction [%]
Max. delay reduction %
κp = 0 20
15
κp = 0.01–0.99
10
κp = 0.9
30
κp = 0.5
20 κp = 0.1 10
κp = 1 5 -5
a) FTF
0 5 Power reduction [%]
κp = 1 10
0 -5
0 5 Power reduction [%]
10
b) FTF+BP
Fig. 9.7 Effect of the proposed low-power 3D-CAC technique on the maximum delay and the power consumption of TSVs, compared to unencoded/random data, for (a) FTF encoding, (b) FTF encoding with two stable lines used for BP. Arrows indicate the locations of κp values for the 4 × 4 array. Negative values indicate a degradation in power consumption or performance due to the applied technique
consumption while providing the same performance improvement as the mere highperformance approach. If only the power consumption is considered (i.e., κp = 1), the reduction in the delay drops to about 10%, without a noticeable further decrease in the power consumption. As already discussed in the previous subsection, the snake mapping also leads to the best TSV performance for FTF-encoded data without bus partitioning. The analysis in this subsection shows that the systematic assignment furthermore improves the power consumption by 6.5–7.6% for FTF-encoded data and thus also results in an effective low-power 3D CAC. However, due to some unexploited heterogeneity in the SE entries, the snake mapping does not result in the minimum possible power consumption and is therefore not of optimal nature. If stable lines are exploited to partition the FTF encoding (FTF+BP), there is a much stronger trade-off between performance and power-consumption optimization. When the power consumption is not considered during optimization (i.e., κp = 0), the delay reduction is 25.2 and 32.6% for the 5 × 5 and the 4 × 4 array, respectively. However, here, the power consumption can be even increased. If both quantities are weighted equally (i.e., κp = 0.5), the delay reduction is 24.5 and 28.8%, while the power consumption is decreased by 4.5 and 6.7%, for the 5 × 5 array and the 4 × 4 array, respectively. κp equal to 0.5 offers a good
9.5 Evaluation
199
trade-off for larger TSV counts. While for the smaller 4 × 4 array, the degradation in the delay reduction compared to the performance-optimal case is about four percentage points (pp), the degradation is below one percentage point for the 5×5 array. For a power weight of 10% (i.e., κp = 0.1), the delay reduction is 25.2– 32.1% and the power-consumption reduction is 1.1–1.8%. Hence, a small penalty in the delay reduction by maximum 0.5 pp, already results in a reduction, instead of an increase, in the power consumption. If the power weight is nine times higher than the performance weight (i.e., κp = 0.9), the power consumption can be reduced by 6.6 and 7.7% while the delay reduction is degraded to 22.2–28.6%. The significantly higher trade-off for the FTF encoding with the BP is caused by the stable lines, which can either be mapped onto the TSVs with the overall highest accumulated capacitance, in order to reduce the power consumption effectively, or between the highest amount of dynamic data lines, where they are the most effective shields. Furthermore, since stable lines are not exploited by the systematic snake mapping, it neither results in the lowest possible power consumption nor the highest performance if BP is applied. In conclusion, the theoretical evaluation reveals that the low-power extension presented in this chapter allows for a decrease in the power consumption while still providing nearly the same performance improvement, but only if no stable lines are present in the array. If stable lines are present, there often is a clear tradeoff between power and performance improvement. However, the trade-off vanishes with a decreasing fraction of stable lines. Thus, for big TSV arrays with only a few stable TSVs, a power and performance aware assignment results in (almost) the same performance improvement as an assignment, which only aims to optimize the TSV performance.
9.5.3 Comparison with Existing 3D CACs In the following, the proposed technique is compared to the previous 3D CACs from [54, 138, 265]. Thereby, not only the power consumption and the performance of TSVs are investigated, but also the impact of the techniques on the metal-wire performance and power consumption, as well as the coding overhead. To measure the power and performance metrics, the transmission of 16-bit data words over a metal-wire bus and a TSV array is simulated with the Spectre circuit simulator. For this purpose, the driver and measurement setup from Sect. 3.4 is used. The investigated stream of data words that have to be transmitted is made up of 1 × 105 uniform-random bit patterns. In this experiment, the dimensions of the global TSVs are equal to the minimum predicted ones (i.e., rtsv = 1 μm, dmin = 4 μm, and ltsv = 50 μm). To obtain the capacitance matrices of the TSV arrays, as well as 3π -resistance-inductance-capacitance (RLC) equivalent circuits for the Spectre simulations, parasitic extractions for the parameterisable TSV-array model are performed. The capacitances matrices and the equivalent circuits of the metal-wire structures are generated with the 40-
200
9 Low-Power Technique for High-Performance 3D Interconnects
nm TSMC wire tool, which is based on parasitic extractions with Synopsys Raphael. Here, metal wires in the fourth layer (M4) with a width/spacing of 0.15 μm and a segment length of 100 μm are considered. The chosen value for the wire width and spacing corresponds with the minimum possible one for M4. As the underlying 2D CAC in the proposed technique, FTF encoding, 2C shielding, and 3C shielding are analyzed. Thereby, two P/G lines are considered for the FTF encoding, which are exploited for BP. This represents a typical case as at least one power and one ground TSV are required to set a power network, spanning over both dies. Thus, the FTF encoding is partitioned into three groups employing two 5-bit-to-7-bit FTF encoders and one 6-bit-to-9-bit FTF encoder. Thus, arrays containing 25 TSVs are required for the 23 FTF-encoded data lines and the two P/G lines. A 5 × 5 array is considered here as TSV arrays as square as possible are considered in this subsection. Consequently, a 4 × 4 TSV array is considered for the unencoded 16-bit data. For the 2C and the 3C shielded data, TSV arrays containing 32 and 24 TSVs are required, respectively. Thereby, the two required P/G TSVs are simply embedded into the TSV arrays as shields to minimize the TSV overhead. Thus, a 4 × 8 and a 4 × 6 TSV array are used for the analysis of the shielding techniques. For all previous 3D CACs except the 4LAT, the encoded data words cannot be transmitted together with P/G lines over one array without violating the CACpattern conditions of the encoding. Thus, the existence of a separated bundle for P/G TSVs is assumed for the analysis of these previous 3D CACs. To obtain the minimum overhead for the previous 6C, 4LAT, and 6C-FNS encoding, a 5 × 4, a 3 × 9, and a 3 × 8 TSV array are required, respectively. Reported in this subsection are the power-consumption and delay/performance improvements according to circuit simulations for the moderate-sized drivers from Sect. 9.2.3 To determine the CODEC complexities, all encoder-decoder pairs are synthesized in a commercial 40-nm technology, and the resulting areas in gate equivalents (GE) are reported. Here, CODEC delays are not reported since they can be hidden in a pipeline [138]. Furthermore, the growths in the coding complexities with the input bit width, m, and the asymptotic bit overheads are reported. Thereby, a minimum of 5% of stable lines is assumed in the interconnect structures. The results, reported in Table 9.5, show that previous techniques reduce the maximum TSV delay by at most 9.4% (4LAT encoding). This means a performance increase of only 10.4%. For the presented technique, the TSV-performance improvement can be more than five times larger (2C shielding: 60.8%). Moreover, the 2C shielding improves the metal-wire performance by 90.8% and does not require any CODEC. In comparison, the 4LAT approach does not optimize the metal-wire performance and requires a large CODEC of 1,915 GE. Moreover, a 4LAT encoding dramatically increases the TSV and the metal-wire power consumption by 50.1% and 46.0%, respectively, while the proposed shielding techniques do not affect the power consumption noticeably. However, the proposed 2C-shielding
3 For
the larger-sized drivers from Sect. 9.2, the results show no remarkable differences.
Method
Pattern set
Random 16-bit input data ΔTˆpd,tsv [%] ΔTˆpd,mw [%] Proposed 3D-CAC technique FTF BP (κp = 0) −20.4 −47.6 FTF BP (κp = 0.5) −20.4 −47.6 2C shielding −37.8 −47.6 3C shielding −20.6 −21.3 Previous 3D-CAC techniques 6C [138] −6.9 0.0 4LAT [265] −9.4 0.0 6C-FNS[54] −6.8 0.0 OH [%] Acodec [GE] ΔPtsv [%] ΔPmw [%] 43.8 411 1.2 −21.9 43.8 411 −5.3 −21.9 87.5 0 1.8 0.0 37.5 0 −1.4 0.0 25.0 181 8.1 −7.4 68.8 1915 50.1 46.0 50.0 718 25.7 19.1
General limm→∞ OH(m) [%] 44 44 95 45 44 80 50
CODEC complexity O(m) O(m) 0 0 O(m1.5 ) O(em ) O(m)
Table 9.5 Effect of the proposed and existing 3D-CAC techniques on the maximum delay of TSVs (Tˆpd,tsv ) and metal wires (Tˆpd,mw ), as well as the power consumption of TSVs (Ptsv ) and metal wires (Pmw ). Moreover, the induced bit overhead (OH), the required CODEC area (Acodec ), the increase in the CODEC complexity with increasing input width, and the asymptotic bit overhead are reported
9.5 Evaluation 201
202
9 Low-Power Technique for High-Performance 3D Interconnects
has the drawback that it result in a drastically increased number of TSVs (95%), quantified by the bit overhead, which typically is a citrical metric. Considering all metrics, the best previous technique is the 6C coding. It is the only one that does not drastically increase the interconnect power consumption, results in the lowest bit overhead, and needs the smallest CODEC. However, it only achieves a TSV performance improvement by 7.4% (6.9% delay reduction). The best-proposed ωm/ωe -CAC variant—if all metrics are taken into account— is the FTF encoding with BP based on a power and performance-aware TSV assignment (highlighted in Table 9.5). With the same minimal bit-overhead as the best-previous 6C technique, it achieves a more than three times higher TSVperformance increase, and furthermore an increase in the metal-wire performance by 90.8%. At the same time, it can decrease the power consumption of TSVs and metal wires by 5.3% and 20.4%, respectively. Thus, at the same increase in the line count, the proposed technique results in a drastically higher power and performance gains than all previous methods. The only drawback of the proposed FTF technique over the 6C CAC is a slight increase in the CODEC size. However, the CODEC is of secondary importance, as both techniques generally have relatively low hardware requirements. Furthermore, due to the superior scaling of the proposed technique due to the applied BP technique, this minor drawback will vanish with an increasing bit width of the transmitted patterns. Hence, the proposed technique significantly outperforms existing 3D CACs in all relevant metrics.
9.6 Conclusion The chapter presented a low-power technique for high-performance 3D interconnects. Previous 3D-CAC techniques only allow for a limited improvement in the TSV performance due to the neglected edge effects. Moreover, previous 3D CACs increase the already critical TSV power consumption. By taking the edge effects into account, the proposed technique overcomes both limitations. Furthermore, it allows for a simultaneous improvement in the power consumption and performance of the metal wires in 3D ICs. The fundamental idea of the proposed method is to exploit the characteristics of pattern sets encoded with a traditional 2D CAC using a performance and power-aware net-to-TSV assignment that takes into account the edge effects. Various underlying 2D CACs have been analyzed for the proposed technique, which shows a maximum TSV and metal-wire performance improvement of 60.8% and 90.8%, respectively. Despite this vast performance gain, the technique can improve the TSV and metal-wire power consumption by 5.3% and 21.9%, respectively. In contrast, the best-previous approach improves the TSV performance by at most 12.0%, while it provides no performance increase for metal wires. It also has higher hardware costs and an increase in interconnect power consumption by up to 50%. Hence, the proposed technique outperforms previous methods in all important metrics.
Chapter 10
Low-Power Technique for High-Performance 3D Interconnects in the Presence of Temporal Misalignment
The previous chapter presented a technique that increases the performance and reduces the power consumption of three-dimensional (3D) interconnects. However, due to the high coupling/crosstalk complexity in 3D, the technique still leads to an asymptotic bit overhead of 44%, which is unacceptable in cases where the large through-silicon via (TSV) dimensions or their low manufacturing yield are serious design concerns. Additionally, the technique is only applicable if the signal edges on the interconnects are perfectly temporal aligned. Thus, a technique to increase the TSV performance in the presence of temporal misalignment (skews) between the signal edges, without inducing a bit overhead at all, is presented in this chapter. The proposed technique is again based on an intelligent local net-to-TSV assignment that exploits temporal misalignment in order to improve the TSV performance. In detail, the switching times on the signal nets that have to be electrically connected through a given TSV array are analyzed. Afterward, the proposed technique interleaves the net-to-TSV assignment such that the performance is optimized. The proposed technique only requires an interleaving in the local net-toTSV assignment within the individual arrays, and thus does not result in noticeable implementation costs, as shown in Sect. 8.5.1. Despite the negligible implementation costs, experimental results show that the approach can improve the TSV performance by 65%. Moreover, the proposed assignment technique increases the efficiency of traditional light-weight low-power codes (LPCs), which aim to optimize the switching activities of the individual lines [89]. In detail, it is shown how the proposed technique is extended by lowcomplex low-power codes (LPCs) such that the TSV power consumption is reduced by over 15%. Thus, the technique can improve the TSV performance more significantly than all 3D-crosstalk-avoidance code (CAC) techniques, and this while still allowing for a more than 5× higher reduction in power-consumption at a bit overhead that is
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 L. Bamberg et al., 3D Interconnect Architectures for Heterogeneous Technologies, https://doi.org/10.1007/978-3-030-98229-4_10
203
204
10 Low-Power Technique for High-Performance 3D Interconnects in the Presence. . .
lower by a factor of 5×. Hence, the proposed technique can outperform all other techniques significantly. The remainder of this chapter is structured as follows. Section 10.1 extends the crosstalk model from the previous chapter in a way that it is also applicable in the case of an arbitrary partial temporal misalignment between the signal edges.1 The proposed optimization technique is presented in Sect. 10.2. Afterward, it is outlined in Sect. 10.3 how the assignment method can be used to reduce the 3D-interconnect power consumption effectively. In Sect. 10.4, the technique is evaluated. Finally, a conclusion is drawn.
10.1 Temporal-Misalignment Effect on the Crosstalk For the high-level optimization techniques presented in this part of the book, the interconnect performance is mathematically quantified through the patterndependent maximum effective capacitance, which is proportional to the maximum signal propagation delay (for ideal drivers) as shown in Sect. 3.2. For all CAC techniques, the effective capacitance is expressed by Eq. (9.1) on Page 179. In this equation, the crosstalk factor, δi,j , determines the increase in the delay of the ith interconnect in the respective clock cycle due to the coupling capacitance between the ith and the j th interconnect. Generally, δi,j does not only depend on the switching of the binary value transmitted over the ith interconnect in the current cycle, Δbi [k], but also on the switching on the remote j th interconnect, Δbj [k]. Hence, the term crosstalk. Perfectly temporal aligned signal edges were assumed for the derivation of all existing CACs. In this case, the crosstalk factor can be mathematically expressed as δi,j [k] = Δbi2 [k] − Δbi [k]Δbj [k],
(10.1)
as shown in the previous chapter. According to Eq. (10.1), a crosstalk factor can only take one of the three values 0, 1, or 2. The highest/worst-case crosstalk factor of 2 occurs for transitions in the opposite direction on the two interconnects i and j (i.e., Δbi [k]Δbj [k] = −1). However, the formula for the effective capacitance changes if the signals on the interconnects switch completely temporal misaligned (e.g., the input of the ith line bi switches after the switching of bj is already propagated over the j th line). In this case, the effective capacitance has to be calculated by means of Eq. (3.19) (see Page 57) instead of Eq. (3.18). Comparing Eq. (3.19) with the general formula
1 A partial temporal misalignment between two signals implies here that one signal switches after the other one, but still before the switching of the other line is completed.
10.1 Temporal-Misalignment Effect on the Crosstalk
205
Eq. (9.1) reveals that the crosstalk factor for completely temporal misaligned signal edges is equal to δi,j [k] = Δbi2 [k].
(10.2)
Hence, the delay/performance of interconnect i no longer depends on the switching on interconnect j when the events on both nets occur with a large enough temporal misalignment (skew). The expected values of Eqs. (10.1) and (10.2) are generally equal. With misalignment, δi,j is increased from 0 to 1 for transitions in the same direction (i.e., Δbi [k]Δbj [k] = 1). In contrast, δi,j is reduced from 2 to 1 for transitions in the opposite direction (i.e., Δbi [k]Δbj [k] = −1). Thus, misalignment itself cannot improve the power consumption, determined by the mean effective capacitance values—at least not for random bit patterns with E{Δbi Δbj } equal to 0 for i = j . However, the performance can be improved, as the misalignment reduces the maximum value of a crosstalk factor from 2 to 1, which implies an improvement in the maximum propagation delay. Consequently, a circuit designer could force a sufficient misalignment on some input lines artificially, in order to improve the interconnect performance without increasing the power requirements. For TSVs, a small temporal misalignment is often already enough to effectively improve the performance. The reason is the relatively fast TSV switching when crosstalk effects are reduced due to the low TSV resistances and ground-capacitances. One possibility to force misalignment is to use the rising clock edges for the data transmission over one half of the TSVs in the array, and the falling clock edges for the other half. In this case, 50% of the signal-edge pairs are temporally misaligned. However, intrinsic small temporal skews between the input-signal transitions are typically present either-way, if the inputs of the TSVs are not direct flip-flop outputs. In that case, some signals switch always temporally misaligned as a result of different timing paths in the preceding circuit. Generally, such an intrinsic misalignment tends to be normally distributed with zero-mean. Hence, the assumption that only completely misaligned or perfectly aligned edges occur does not hold in this case. To estimate the interconnect crosstalk in the presence of an arbitrary/partial misalignment using Eq. (9.1), δi,j must be a function of the temporal alignment of the signal edges on the interconnects i and j : δi,j [k] = Δbi2 [k] − Ai,j Δbi [k]Δbj [k].
(10.3)
Here, Ai,j is the alignment factor which depends on the time skew between the edges of the binary signals transmitted over interconnect i and j . The skew is mathematically expressed as Tskew,i,j = Tedge,i − Tedge,j ,
(10.4)
206
10 Low-Power Technique for High-Performance 3D Interconnects in the Presence. . .
where Tedge,i and Tedge,j are the switching times of the binary values on the interconnects i and j (if both signals toggle in the cycle) relative to the rising clock edge. It is already known from the previous analysis that Ai,j is 0 if the signals switch completely temporal misaligned, and 1 if they switch perfectly temporal aligned (i.e., for Tedge,i = Tedge,j ). To obtain a good estimate for Ai,j as a function of Tskew,i,j , Spectre circuit simulations for the setup illustrated in Fig. 3.3 on Page 66 are carried out. A 4 × 4 array with typical global TSV dimensions (i.e., rtsv = 2 μm, dmin = 8 μm, and ltsv = 50 μm) is analyzed. All TSV parasitics used throughout this chapter are obtained by parasitic extractions with the Q3D Extractor for the TSV-array model presented in Sect. 1.3.1. In the simulations, each TSV (modeled by means of an extracted of 3π -resistance-inductance-capacitance (RLC) circuits) is driven by an inverter made up of 22-nm Predictive Technology Model (PTM) transistors. The W/Lmin ratios of the n-MOS and the p-MOS transistor of these drivers are 24 and 48, respectively. This sizing is chosen for two reasons. First, a transistor-sizing ratio of 2× between the p-MOS and the n-MOS results in almost equal rise and fall times for the driver outputs. This ensures that the drivers do not increase misalignment effects artificially. Second, the crosstalk noise that is induced on the TSVs can exceed Vdd/2 (i.e., 0.5 V) for smaller transistor sizes, resulting in undesired signal glitches. Such glitches are particularly critical for misaligned signal edges where they do not necessarily occur at the beginning of the cycles. At the far end, each TSV is terminated in a 22-nm PTM inverter (W/Lmin equal to 4 for the n-MOS and 8 for the p-MOS) charging a 1 fF load. Bit sources (Vdd = 1 V) with a transition time of 10 ps and a bit duration of 1 ns are used to generate the stimuli for the circuit simulations. For each line, the transmitted bit sequence and the delay of the signal edges, Tedge,i , relative to the rising clock edges (at k · 1 ns with k ∈ N), can be varied to model different misalignment scenarios. Again, the inverter circuits to create a realistic shape of the input-voltage waveforms of the TSVs, are sized as the load drivers at the far end. The worst-case propagation delay, Tˆpd , from the TSV input, vin , to the TSV output, vtsv , is measured for each analyzed scenario. The measured propagation delays are interpreted to build models for Ai,j for the worst-case switching (i.e., transitions in the opposite direction on TSVi and TSVj ) as a function of the switching times of the input signals. The alignment factor for the worst-case switching is symbolized as Aˆ i,j in the following.
10.1.1 Linear Model In the author’s initial publication about exploiting misalignment [19], the worstcase propagation delay of a crosstalk victim, TSVv , located in the middle of the TSV array, was analyzed as a function of the misalignment/time-skew of its signal edges against the signal edges for all other TSVs, which served in the analysis as aggressors. In other words, in the analysis, all TSV-input signals switched at
10.1 Temporal-Misalignment Effect on the Crosstalk
Perfect alignment (Aˆv,a = 1)
200 Max. propagation delay [ps]
207
Circuit simulations Linear model 150 Complete misalignment (Aˆv,a = 0) 100 Tm 0
50
100
150
200
250
300
350
Tskew.v,a [ps] Fig. 10.1 Maximum propagation delay of a victim TSV, over the delay of its input-signal edges relative to the signal edges on all remaining aggressor TSVs
the exact same time, except the input signal of the crosstalk victim TSVv which switched by Tskew,v,a later, where the Tskew,v,a value is varied. The resulting maximum propagation delay of the victim TSV over the time skew, Tskew,v,a , is shown by the blue dots in Fig. 10.1. Without misalignment, the worstcase TSV delay is 187.89 ps. However, the delay decreases significantly with an increasing misalignment until a certain point from which the curve flattens out rapidly. This behavior is captured quite well by a simple linear model for Aˆ v,a . From Aˆ v,a equal to 1 for no misalignment (i.e., Tskew,v,a = 0), up to a threshold value, Tm , Aˆ v,a decreases linearly to 0. Afterward, Aˆ v,a is clipped to 0. The fitted Tm value is in the range of the mean propagation delay of the victim for no signal misalignment. The red lines in Fig. 10.1 illustrate the estimated propagation delays according to this linear regression. In the initial publication [19], the same behavior was assumed for a negative temporal misalignment. Negative misalignment implies here that the input of an aggressor TSV switches later than the input of the victim TSV. Thus, the following formula was proposed to estimate the Aˆ i,j values: |Tskew,i,j | Aˆ i,j ≈ max 1 − , 0 . Tm
(10.5)
This equation, combined with Eq. (10.3), builds a simple model to estimate the maximum interconnect crosstalk in the presence of arbitrary signal misalignment. In the remainder of this chapter, this simple model is referred to as linear crosstalk model.
208
10 Low-Power Technique for High-Performance 3D Interconnects in the Presence. . .
10.1.2 Look-Up-Table Model An in-depth analysis of the crosstalk as a function of the misalignment shows that a crosstalk optimization by means of the linear model might even result in a lower interconnect performance than without any optimization. In the following, the reason for this is outlined. Furthermore, a new model to estimate the alignment factors, Aˆ i,j is presented, which resolves this issue. First, the circuit simulations from Sect. 10.1.1 are extended such that the effect of negative temporal misalignment between the edges on the victim line and the aggressor lines is analyzed as well. As illustrated in Fig. 10.2 for a set of representative examples, this analysis shows that, in contradiction to the linear model, temporal misalignment does not only affect the crosstalk/performance positively. When aggressor signals switch their level shortly after a victim, they induce crosstalk noise on the victim line during its transition phase. In the worstcase crosstalk scenarios, where multiple aggressors switch in the opposite direction of the victim, the aggressors induce a huge noise peak on the victim line, which counteracts its active switching. Thus, if this noise is induced before the victim net exceeds the threshold value of the load driver (typically Vdd/2), or if the noise raises the signal temporarily over the threshold again (glitch), the misalignment increases the propagation delay compared to the scenario of no misalignment. These scenarios are illustrated in Fig. 10.2 by the green and the red curves, which represent a negative victim-to-aggressor misalignment of −50 ps and −100 ps, respectively. In both scenarios, the victim output toggles/switches later than for the case of no misalignment (black/solid curves), although the victim input switches at the same time in all cases. The green curves (Tskew,v,a = −50 ps) represent the scenario where the noise is induced before the signal crosses the threshold value; while the red curve (Tskew,v,a = −100 ps) represents the scenario where the aggressor noise leads to a glitch on the signal. Note that this glitch is also noticeable in the output signal, vout , of the aggressor line. For large negative skews, represented by the blue curves (Tskew,v,a = −250 ps), the misalignment again has a strongly positive impact on the performance. In this case, the crosstalk noise is induced after the switching of the logical value on the TSV is already completed. Thus, the crosstalk does not affect the output signal negatively here, resulting in the lowest propagation delay (i.e., highest performance). Another phenomenon, not captured by the linear crosstalk model, is that the effect of the skew between two signal edges also depends on the overall standard deviation of the switching times. In those scenarios where most signal edges are strongly misaligned, the maximum height of the induced noise is significantly lower, compared to the scenario where most aggressor edges occur temporally aligned. Additionally, when the mean misalignment is strong, the transition times of the signals generally tend to be shorter. Hence, with an increasing standard deviation of the switching times, the time range where misalignment has a negative impact is reduced.
10.1 Temporal-Misalignment Effect on the Crosstalk Tskew,v,a = – 250 ps Tskew,v,a = 0 ps
Tskew,v,a = – 100 ps Tskew,v,a = 100 ps
1
Tskew,v,a = – 50 ps Tskew,v,a = 250 ps
1 vin —agressors [V]
vin —victim [V]
209
0.5
0
0.5
0 0.4
0.6
0.8
1
0.4
time [ns]
0.8
1
time [ns]
1.5
1 vtsv —agressors [V]
vtsv —victim [V]
0.6
1
0.5
0.5
0 0
0.4
0.6
0.8
1
0.4
time [ns]
0.8
1
time [ns]
1 vout —agressors [V]
1 vout —victim [V]
0.6
0.5
0
0.5
0 0.4
0.6
0.8
time [ns]
1
0.4
0.6
0.8
1
time [ns]
Fig. 10.2 Effect of positive and negative time skews between the signal edges of a victim TSV relative to the edges on the remaining aggressor TSVs on the signal integrity
210
10 Low-Power Technique for High-Performance 3D Interconnects in the Presence. . .
Thus, to obtain a more precise crosstalk model, the propagation delay of a victim TSVv is analyzed as a function of the temporal misalignment between its input signal and the input signal of one directly adjacent aggressor TSVa . Furthermore, this simulation is executed for various values for the standard deviation in the switching times of the remaining input signals, σskew . The time skew Tskew,v,a is swept from −500 ps to 500 ps with a step width of 2 ps to cover a wide range of positive and negative misalignment scenarios. Additionally, the standard deviation, σskew , is swept from 0 to 500 ps. Out of the maximum signal propagation delays, the Aˆ v,a values are determined as follows: Aˆ v,a (σskew , Tskew,v,a ) =
Tˆpd (σskew , Tskew,v,a ) − Tˆpd (σskew , 500 ps) . Tˆpd (σskew , 0 ps) − Tˆpd (σskew , 500 ps)
(10.6)
where Tˆpd (σskew , 500 ps) and Tˆpd (σskew , 0 ps) are equal to the delays for a complete misalignment and a perfect alignment between the switching times on TSVv and TSVa for a standard deviation in the misalignment of σskew . The results—illustrated in Fig. 10.3—show that misalignment can even increase some Aˆ i,j values to over 1.2, implying an increase in the worst-case crosstalk between the lines by more than 20%, compared to the scenario of no signal misalignment. As expected, the maximum possible Aˆ i,j value decreases with an increasing standard deviation in the switching times. Furthermore, an increased standard deviation leads to a narrower Aˆ i,j curve, which results in generally lower
σ skew = 10 ps σ skew = 200 ps
σskew = 50 ps σskew = 250 ps 1.25
180
σskew
1
170 160 ˆ i,j A
Max. propagation delay [ps]
σ skew = 150 ps σ skew = 300 ps
150
0.5
140 0
130
– 400 – 200
0
200
Tskew,v,a [ps]
400
– 400 – 200
0
200
400
Tskew,v,a [ps]
Fig. 10.3 Worst-case TSV propagation delay and resulting alignment coefficients, Aˆ v,a , over the misalignment, Tskew,v,a , between the victim TSVv and one adjacent aggressor TSVa
10.2 Exploiting Misalignment to Improve the Performance
211
Aˆ i,j values for a given time skew. However, the Aˆ i,j curves remain unchanged from a certain standard deviation σskew,max onward. The observed behavior is hard to capture in a single closed-form expression. Thus, this work proposes to create a 2D lookup table (LUT) out of the circuit simulation results, which can be used at design time to estimate the crosstalk peak and thereby the required timing budget for the TSVs.2 The output of the LUT is the Aˆ i,j value, the first input is the σskew value, and the second input the Tskew,v,a value. In between the discreet analyzed points, a linear interpolation is used to determine the Aˆ i,j values for arbitrary misalignment scenarios. The values are clipped outside the analyzed range, as an Aˆ i,j value does not change with any further change in σskew or Tskew,v,a . After using the LUT to determine the Aˆ i,j values, the maximum crosstalk of the lines is again estimated through Eq. (10.3). This enhanced model for the crosstalk in the presence of arbitrary misalignment scenarios is referred to as the LUT-based crosstalk model in the following.
10.2 Exploiting Misalignment to Improve the Performance In the previous chapter, a technique was presented, which effectively improves the TSV performance by means of a crosstalk-aware net-to-TSV assignment, exploiting the specific bit-level characteristic of 2D-CAC-encoded data. The idea of using the crosstalk-aware assignment is reused for the technique presented in this chapter. However, here, the effect of misalignment between the edges of the input signals is exploited. The following formula for the performance-optimal assignment was derived in Sect. 9.3.2: T Iπ,perf-opt = arg min max diag(Iπ Swc Iπ · CR ) . Iπ ∈SIπ ,n
(10.7)
In this equation, the entries of the worst-case switching matrix are equal to Swc, i,j =
maxk (Δbi2 [k]) for i = j maxk (δi,j [k])
else.
(10.8)
The technique presented in the previous chapter is based on the crosstalk model for perfectly aligned edges. Thus, Δbi2 [k]−Δbi [k]Δbj [k] was recently used for δi,j [k].
2 Since this approach requires extensive circuit simulations for each new driver sizing/technology, the resulting high-level model for the TSV performance is not universally valid. For this reason, it is not included in Part II of this book.
212
10 Low-Power Technique for High-Performance 3D Interconnects in the Presence. . .
In this section, the misalignment-aware crosstalk model (i.e., Eq. (10.3)) has to be used instead for δi,j [k], resulting in Swc,i,j =
maxk (Δbi2 [k])
for i = j
maxk (Δbi2 [k] − Ai,j Δbi [k]Δbj [k])
else.
(10.9)
Without any further extension, the approach from the previous chapter can be used to obtain the performance optimal TSV assignment for arbitrary data streams (e.g., encoded or random) in the presence of temporal misalignment. For perfectly temporal aligned signal edges, a data encoding is needed to improve the TSV performance effectively. In contrast, an intelligent assignment can already significantly improve the TSV performance for any random/unencoded data stream when the signal edges are temporally misaligned, as shown in the following. This implies that no bit or circuit overhead due to an encoding is required for performance improvement. For random data, maxk (Δbi2 [k]) is 1 for a signal line i which eventually toggles its logical value. Again, the properties of stable power or ground (P/G) lines are exploited to obtain an even better TSV performance. For a stable line, maxk (Δbi2 [k]) is clearly 0. The worst-case crosstalk (i.e., maxk (Δbi2 [k] − Ai,j Δbi [k]Δbj [k])) for two random data lines i and j is equal to 1 + Aˆ i,j , occurring for opposite transitions in the bit values on the two lines. However, if the ith line is a dynamic data line while the j th line is stable, maxk (δi,j [k]) is equal to 1, as Δbj [k] is always 0. If i is a stable line, maxk (δi,j [k]) is 0 in any case. Thus, the technique proposed throughout this chapter consists of the following three steps. First, either the linear crosstalk model or the LUT-based crosstalk model is used to determine the Aˆ i,j values, out of which Swc is generated using
Swc,i,j
⎧ ⎪ ⎪ ⎨0 = 1 ⎪ ⎪ ⎩1 + Aˆ i,j
if i is a stable line else if i = j , or if j is a stable line
(10.10)
else.
Subsequently, the heterogeneity in the Swc entries due to the varying Aˆ i,j values is exploited to improve the performance through an optimal net-to-TSV assignment, determined by means of Eq. (10.7). After the optimal assignment is determined, a final sanity check should be performed if the LUT-based crosstalk model was used in the first step. Initially, the standard deviation of the switching times on all nets is considered to obtain the Aˆ i,j values in the LUT-based model. However, only adjacent TSVs show significant crosstalk due to their large coupling capacitances. The Aˆ i,j values, and thereby the crosstalk, increases with a decreased σskew . Therefore, one has to check if, for any TSV, the standard deviation of the switching times for (only) the adjacent TSVs is smaller than the used σskew to determine the Aˆ i,j values. If so, the related Aˆ i,j values have to be adjusted.
10.3 Effect on the TSV Power Consumption
213
However, it is improbable that an adjustment is needed, since the technique proposed in this chapter generally maximizes the standard deviation of the switching times for adjacent TSVs, as it always assigns heavily misaligned nets to strongly coupled lines to reduce their crosstalk effectively. In fact, none of the cases analyzed in the remainder of the chapter needed an adjustment of an Aˆ i,j value during the sanity check.
10.3 Effect on the TSV Power Consumption The effect of the proposed technique on the TSV power consumption is investigated in this section. Thereby, it is shown that the proposed method drastically improves the effectiveness of traditional hardware-efficient low-power codes. In Chap. 3, an equation for the mean power consumption due to each individual interconnect was derived, which helped to outline the magnitude of the edge effects, and the resulting optimization potential in the following chapters. In contrast, the power consumption due to the individual capacitances is analyzed to outline the effect of the proposed misalignment-aware assignment on the TSV power consumption and its implications for the integration of low-power techniques. A formula for the power consumption due to an individual interconnect ground/self-capacitance, Ci,i , can be derived from the formula for the respective cycle-based energy dissipation (Eq. (3.11) on Page 55): PC,i,i = E{Ei,i } · f =
(10.11)
2f Vdd Ci,i · E{Δbi2 }. 2
In the same way, a formula for the mean power consumption due to a coupling capacitance, Ci,j , for perfectly temporal aligned edges on the two interconnects i and j , can be derived from Eq. (3.9) on Page 54: PC,i,j = E{Ei,j } · f =
(10.12)
2f Vdd Ci,j · E{(Δbi − Δbj )2 }. 2
These two formulas were used for the derivation of most existing LPCs [89]. Hence, two main low-power coding approaches exist. The first one aims to minimize the power consumption due to the coupling capacitances, by minimizing the coupling-switching activities defined as def
αcoup,i,j = E{(Δbi − Δbj )2 } = E{Δbi2 + Δbj2 − 2Δbi Δbj }.
(10.13)
214
10 Low-Power Technique for High-Performance 3D Interconnects in the Presence. . .
Note that Δbi2 [k]+Δbj2 [k]−2Δbi [k]Δbj [k] is equal to the sum of the two crosstalk factors for the ith and the j th line (i.e., δi,j [k] + δj,i [k]) for aligned signals edges. Thus, the first approach aims to minimize the crosstalk between the lines. The second approach reduces the self-switching activities, αi , equal to the expected Δbi2 values. This approach traditionally aims at primary reducing the power consumption due to the ground/self-capacitances. A metric for the choice of the right coding approach is the coupling-switching factor, ζ . It defines the ratio of the power consumption due to coupling switching over the power consumption due to self-switching for random/unencoded data (i.e., αcoup,i,j = 1, and αi = 0.5). Without temporal misalignment, the couplingswitching factor is mathematically expressed as ζ =
2
n
n i=1
n
j =i+1 Ci,j
i=1 Ci,i
.
(10.14)
The coupling-switching factor is much greater than 1 for modern 2D and 3D interconnects. Thus, recent LPCs for metal wires focus on primary reducing the coupling switching, instead of the self-switching (e.g., [116, 117, 187]). However, these codes are typically not effective for TSV arrays as the coupling changes due to the different capacitance structures of metal wires and TSV arrays, outlined in Chap. 4. Another drawback of codes reducing the αcoup,i,j values is that they show a significantly higher complexity since the probability of opposite transitions on adjacent lines has to be minimized. Low-power codes focusing on the αi values need to minimize the activities of the individual bits, which typically can be achieved with a much simpler encoder-decoder architecture and thus lower costs. Furthermore, note that encoding techniques which primary improve the self-switching, typically also reduce the coupling switching, and vice versa, since αcoup,i,j = E{Δbi2 + Δbj2 − 2Δbi Δbj } = αi + αj − 2γi,j .
(10.15)
The formula for the power consumption due to a coupling capacitance differs for completely misaligned signal edges. From Eq. (3.10) on Page 55, the following formula can be derived for this case: PC,i,j = =
2f Vdd Ci,j · E{(Δbi2 + Δbj 2 )} 2
(10.16)
2C f Vdd i,i (αi + αj ). 2
For random data, which implies that E{Δbi Δbj } is equal to 0, the expected values for Eqs. (10.12) and (10.16) are equal. Hence, solely a temporal misalignment does not affect the overall power consumption for the transmission of random data. However, Eq. (10.16) reveals that the mean power consumption due to a coupling
10.3 Effect on the TSV Power Consumption
215
Circuit simulation for Δbi [k]Δbj [k]=1 ( / ) Formula for aligned edges and Δbi [k]Δbj [k]=1 New combined formula for Δbi [k]Δbj [k]=1 Circuit simulation for Δbi [k]Δbj [k]= – 1 ( ) Formula for aligned edges and Δbi [k]Δbj [k]= – 1 New combined formula for Δbi [k]Δbj [k]= – 1 Formula for misaligned edges
Power consumption [μW]
40
20
0
0
100
200
300
400
Time skew [ps] Fig. 10.4 Average clock-cycle power consumption of two lines connected by a 20 fF coupling capacitance over the skew between the signal edges
capacitance only depends on the self-switching and no longer on the coupling switching for completely misaligned signal edges. Thus, the coupling capacitances behave as self-capacitances for temporally misaligned signal edges. This implies a theoretical coupling-switching factor of 0. Hence, for a complete temporal misalignment between the signal edges, simple codes that focus on reducing the self-switching have the highest gains, independent of the sizes of the coupling capacitances. The effect of partial misalignment on the power consumption is discussed in the following. For this purpose, the clock-cycle-average power consumption (Tclk = 1 ns) of two lines connected by a 20 fF coupling capacitance is analyzed with Spectre over the time skew between the edges of the input signals of the 22-nm TSV drivers. The results are plotted in Fig. 10.4 for all switching patterns that are affected by temporal misalignment (i.e., all patterns with a switching on both lines). Note that, due to the symmetry, the concept of “negative” time skews does not hold in this power analysis. The figure reveals that a linear fit between the power estimates for full and no misalignment, employing an alignment factor A¯ i,j , enables to model cases of partial misalignment accurately. Hence, the power consumption due to a coupling capacitance for arbitrary misalignment scenarios is estimated as PC,i,j = =
2f Vdd Ci,j · E{Δbi2 + Δbj2 − 2A¯ i,j Δbi Δbj } 2
(10.17)
2f Vdd Ci,j · (1 − A¯ i,j )(αi + αj ) + A¯ i,j αcoup,i,j . 2
This equation reveals that the power consumption due to a coupling capacitance depends on the coupling switching as well as on the self-switching for partially
216
10 Low-Power Technique for High-Performance 3D Interconnects in the Presence. . .
misaligned signal edges. Consequently, the effective coupling-switching factor for the case of arbitrary temporal misalignment is mathematically expressed as ζe = n
2
i=1 Ci,i
n
n
i=1
+2
¯
j =i+1 Ai,j Ci,j n ¯ i=1 j =i+1 (1 − Ai,j )Ci,j
n
.
(10.18)
Hence, ζe is effectively minimized if strongly misaligned signals (i.e., small A¯ i,j ) are transmitted over interconnects connected by a large coupling capacitance, Ci,j , implying an increased effectiveness of low-cost LPCs. This requirement is satisfied by the proposed assignment technique, as a significant temporal misalignment on strongly coupled lines improves the TSV performance effectively. Consider the example of artificial misalignment employing both clock edges. In this case, A¯ i,j is 0 if one of the two nets i or j is an output of a falling-edge triggered flip-flop while the other one belongs to a rising edge triggered flip-flop, and 1 if both nets are outputs of the same kind of flip-flop. Without the artificial misalignment (i.e., for all A¯ i,j equal to 1), ζe is above 60 for typical global TSV dimensions. Thus, to use coding techniques that minimize the coupling switching is reasonable here, despite the resulting increased coding complexity. However, after the proposed assignment technique is applied for the artificial misalignment, ζe is drastically reduced to about 0.2. Hence, after applying the proposed technique, existing light-weight LPCs are even several times more effective than complex ones. Consequently, the proposed technique maximizes the efficiency of many traditional LPCs for TSV-based interconnects.
10.4 Evaluation This section evaluates the proposed technique in depth. First, the expected reduction in the maximum propagation delay due to the proposed technique is investigated in Sect. 10.4.1. Furthermore, the gain of exploiting an artificially generated misalignment compared to an intrinsic misalignment is analyzed. To report representative numbers for the expected/mean gain of the proposed technique, several thousand different misalignment scenarios are considered in this first subsection. This makes an evaluation by means of circuit simulations impossible. Hence, delay reductions according to the high-level formulas (ideal drivers) are presented in Sect. 10.4.1. In the following two subsections, the impact of the technique on the TSV performance for the 22-nm drivers is investigated. For this purpose, extensive Spectre circuit simulations are carried out for a 4 × 4 array. In Sect. 10.4.2, the performance improvement of the proposed technique is investigated for different misalignment scenarios. Thereby, the focus lies on the difference between the application of the simple linear and the more complex LUT-based misalignmentaware crosstalk model for the proposed technique.
10.4 Evaluation
217
Afterward, an in-depth comparison of the presented technique and existing 3DCAC techniques is drawn in Sect. 10.4.3. The common objective of all analyzed techniques is an improved TSV performance, while the techniques proposed in this book furthermore aim for an improved TSV power consumption. Nevertheless, the proposed and the previous techniques are analyzed based on a large number of criteria in Sect. 10.4.3. In addition to the power consumption and performance, the switching noise, the bit overhead, the coder-decoder circuit (CODEC) area, and the maximum throughput of the 4 × 4 array are analyzed.
10.4.1 Expected Delay Reduction In this subsection, the expected reduction in the maximum propagation delay of the newly proposed technique is investigated for a large set of misalignment scenarios. Thereby, the advanced LUT-based crosstalk model is applied to determine the performance-optimal local TSV assignment for various misalignment scenarios in the switching times of the nets. A zero-mean, normally distributed time skew in the switching times of the signals transmitted over a typical 4 × 4 TSV array (i.e., rtsv = 2 μm, dmin = 8 μm, and ltsv = 50 μm) is considered in the first analysis. The standard deviation of the time skews, σskew , is varied to quantify the expected performance gains of the proposed technique for various misalignment scenarios. In order to obtain representative results, all simulations are executed 1000 times for no stable lines, and 1000 times for one power and one ground line in the array (new time skews are drawn randomly for each run). The resulting mean improvements in the maximum TSV delay, compared to a random assignment, as well as the 68% confidence intervals, are plotted in Fig. 10.5 over σskew . For no stable line and no signal misalignment (i.e., for σskew = 0), the performance/delay improvement is obviously zero. Thus, a routing-minimal assignment should be applied in this case. However, for a standard deviation in the switching times of 55 ps, the mean reduction in the TSV delay is already above 15%, with a confidence interval of ±1.6%. The highest delay reduction of 24.7% (i.e., 32.9% performance increase) is obtained for a σskew of about 160 ps. For a huge standard deviation in the switching times, a significant misalignment between most adjacent lines is generally present before the proposed optimization in the TSV assignment is applied. Thus, with an ongoing increase in σskew , the performance/delay improvement of the proposed technique decreases again. However, even for a standard deviation of 500 ps, the expected delay reduction is still above 17%. With stable lines, the curve has a similar shape. However, in this case, the delay reduction is already 13.5% for no temporal misalignment (i.e., σskew = 0), as the proposed technique exploits the stable P/G lines. Here, the technique finds the position in the array where the stable lines serve as the most effective shields for the dynamic data lines. Exploiting stable lines also leads to a potentially higher
218
10 Low-Power Technique for High-Performance 3D Interconnects in the Presence. . .
Max. delay improvement [%]
30
20
10
0
No stable line Two stable P/G lines
0
100
200
300
400
500
σskew [ps] Fig. 10.5 Expected TSV-delay reduction with 68% confidence intervals over the standard deviation of the input-signal switching times, σskew . For each analyzed σskew value, 1000 misalignment scenarios are analyzed
delay reduction by approximately two percentage points (percentage points (pp)). However, for a large standard deviation in the switching times, the delay reduction for an array with stable lines is lower than for one without stable lines (ca. 2.5 pp for σskew equal to 500 ps). With stable lines, fewer adjacent lines potentially toggle in the opposite direction. Consequently, the optimization potential is reduced for higher σskew values, compared to the scenario where no stable lines are present. Compared to the expected delay improvements according to the linear crosstalk model—presented in Ref. [19], but not included in this book—the reported reductions are lower for small σskew values and higher for a large σskew values. The reason is that the advanced LUT-based model takes into account that misalignment is not always beneficial for the crosstalk/delay. A second analysis compares the optimization potential of the proposed technique for artificially generated and intrinsic signal misalignment. For the analyzed scenario of artificial misalignment, the TSVs are driven by flip-flops, where one-half of the flip-flops are rising-edge triggered, and the other half falling-edge triggered. The clock frequency is set to 1 GHz. Thereby, 50% of the signal pairs switch perfectly temporal aligned and 50% temporally misaligned. Applying the proposed assignment technique results in a regular interleaving of nets belonging to rising-edge (R) and falling-edge-triggered (F) flip-flops for the artificial misalignment, illustrated in Fig. 10.6 for a 4 × 4 array. This regular interleaving leads to the highest performance as it completely avoids temporally aligned transitions on directly adjacent TSVs, which are connected by the largest coupling capacitances. A zero-mean Gaussian-distributed skew in the switching times with a standard deviation of around 80–90 ps is analyzed for the intrinsic temporal misalignment.
10.4 Evaluation
219
R
F
R
F
F
R
F
R
R
F
R
F
F
R
F
R
Max. delay improvement [%]
Fig. 10.6 Assignment for artificially temporal misaligned signals: Regular interleaving of the nets of rising-edge-triggered (R) and falling-edge-triggered (F) flip-flops
40
Artificial misalignment Intrinsic misalignment (σskew | 80 ps–90 ps)
20
0
3× 3
4×4
3× 6 5× 5 3× 9 Array dimensions
6× 6
3 × 12
Fig. 10.7 Reduction in the maximum TSV delay, Tˆpd , over the array shape for artificial and intrinsic temporal signal misalignment. Thereby, aggressively scaled TSV dimensions are considered (i.e., rtsv = 1 μm and dmin = 4 μm)
In this analysis, the array dimensions M and N are varied to determine if the array size affects the optimization potential. In the second analysis, the TSV radius and minimum pitch are scaled down by a factor of 2 × compared to the first one to 1 μm and 4 μm, respectively. This aims at quantifying the effectiveness of the proposed technique for more advanced TSV manufacturing processes. The resulting reduction in the maximum TSV delay, compared to no temporal misalignment for artificially generated misalignment and a random assignment for intrinsic misalignment, are reported in Fig. 10.7. The delay reduction for the artificial misalignment is almost stable over the array dimensions (ca. 36–37%). These values are significantly larger compared to the scenarios where an intrinsic temporal misalignment is exploited. For intrinsic signal misalignment, even a nonoptimal assignment typically results in a higher TSV performance, compared to the scenario of no signal misalignment. In these scenarios, the optimization potential is consequently lower. Furthermore, the variance in the switching times is way bigger
220
10 Low-Power Technique for High-Performance 3D Interconnects in the Presence. . .
for artificial misalignment. The amplitude of the misalignment between the outputs of falling-edge and rising-edge triggered flip-flops is big enough to consider the transitions as completely independent (i.e., Aˆ i,j = 0). This leads to a particularly small crosstalk. Moreover, the delay improvement shows significantly higher fluctuations for intrinsic misalignment, as the gain depends on the quality of the random assignment. However, a general dependency between the performance gains and the array dimensions, M × N, cannot be identified. Furthermore, the average reduction in the maximum delay for the more aggressively scaled TSV dimensions is about 21%, which is in accordance with the expected delay reduction for the larger TSV dimensions, reported in Fig. 10.5. Thus, the potential gain of the approach tends to be independent of the TSV-manufacturing maturity. Summarized, the analyses show that even exploiting a small intrinsic signal misalignment has the potential to improve the TSV performance effectively— especially if stable lines are present in the array. For a guaranteed performance improvement, an artificially generated signal misalignment is required, which has the drawback of additional costs, for example, due to the integration of fallingedge-triggered flip-flops. Furthermore, the analyses indicate that the theoretical performance improvement is more or less unaffected by the array size and the geometrical TSV dimensions.
10.4.2 Delay Reduction for Various Misalignment Scenarios The maximum TSV propagation delay is analyzed in this subsection with Spectre for no stable line in the array and five different scenarios of intrinsic temporal misalignment in the input signals. Here, an analysis of a wide range of misalignment scenarios, as done in the theoretical analysis in the previous subsection, is not possible due to the high computational complexity of the circuit simulations. In all scenarios, the mean switching time of the signals is 390 ps after the rising clock edge in the respective cycle. The analyzed standard deviations of the switching times are 0 ps (no misalignment), 15 ps, 25 ps, 90 ps, and 180 ps. All switching times are truncated to a minimum of 0 ps and a maximum of 700 ps in order to present realistic misalignment scenarios. For all analyzed scenarios, the maximum propagation delay is measured for: 1. a random net-to-TSV assignment; 2. an assignment determined by means of the proposed technique employing the linear crosstalk model; 3. an assignment determined by means of the proposed technique employing the LUT-based crosstalk model. These different settings allow for quantifying the effect of the proposed technique for rather small, as well as relatively large, intrinsic misalignment quantities. Artificial misalignment is analyzed in the following Sect. 10.4.3.
10.4 Evaluation
221
Table 10.1 Maximum propagation delay of a TSV for five scenarios of temporal misalignment if the proposed TSV assignment technique is applied and if a random assignment is used
Assignment Random Linear model LUT-based model
Maximum propagation delay [ps] σskew σskew σskew = 0 ps ≈ 15 ps ≈ 25 ps 185.3 186.9 187.9 (+0.8%) (+1.4%) 185.3 186.6 195.0 (+0.7%) (+5.2%) 185.3 169.9 175.5 (−8.3%) (−5.3%)
σskew ≈ 90 ps 207.2 (+11.8%) 155.9 (−15.9%) 150.2 (−19.0%)
σskew ≈ 180 ps 155.0 (−16.4%) 110.4 (−40.4%) 103.4 (−44.2%)
Table 10.1 reports the resulting maximum propagation delays. The relative percentage deviations, compared to the scenario of no signal misalignment, are also included. This comparison is used to prevent an influence of the varying quality of the random assignment on the reported performance improvements. The results show that the maximum possible propagation delay in the presence of signal misalignment is even higher than for no misalignment (reported up to 11.8%). Thus, the results validate again that input-signal misalignment does not necessarily have a positive effect on the TSV performance. Since the linear model does not consider such adverse effects of misalignment, the usage of this model occasionally results in assignments that actually have poor performance properties. Small negative time skews between the signal edges, with an adverse impact on the crosstalk, are treated as positive. Thus, for smaller standard deviations in the switching times, the TSV performance for a random assignment can be even better than for an assignment determined by means of the linear crosstalk model. In contrast, the proposed optimization technique based on the LUT-based crosstalk model considers positive and negative aspects of temporal misalignment. Consequently, the usage of the LUT-based crosstalk model always results in the best TSV performance. Furthermore, in this case, the performance is always significantly improved, compared to the case of no signal misalignment. Positive effects of misalignment dominate more and more with an increasing standard deviation in the switching times. Therefore, asymptotically, the proposed assignment technique employing the linear crosstalk model shows the same improvement as the technique employing the complex LUT-based model. For example, the usage of the linear model for the proposed technique results in a delay increase (i.e., reduced performance), compared to the scenario of no misalignment, by 5.2% for a standard deviation in the switching times of 25 ps. Contrary to this, the usage of the LUT-based crosstalk model results in a delay improvement by 5.3%. However, for a standard deviation of about 180 ps, the usage of both models results in almost the same significant delay improvement of more than 40%. In conclusion, the usage of the more complex crosstalk model employing a LUT is indispensable for smaller misalignment quantities in order to ensure that the
222
10 Low-Power Technique for High-Performance 3D Interconnects in the Presence. . .
assignment results in a good TSV performance. For larger misalignment quantities, the linear model likely also results in a high TSV performance.
10.4.3 Comparison with 3D-CAC Techniques The effect of the proposed optimization technique on various metrics is compared to the values obtained for CAC techniques in this subsection. Investigated is the transmission of 8 kB of random data over a TSV array of fixed size (4 × 4). Thereby, two P/G lines are considered in the array. Furthermore, typical TSV dimensions are investigated (i.e., rtsv = 2 μm, dmin = 8 μm, and ltsv = 50 μm). The transmission of the data is analyzed under nine different scenarios. In scenario 1, the data is transmitted without any optimization. Scenario 2 represents the transmission of the data optimized with the ωm/ωe CAC (based on an forbidden-transition free (FTF) encoding), presented in the previous Chap. 9. As shown in Chap. 9, the ωm/ωe -CAC technique leads to the highest performance improvement of all existing 3D CACs. Additionally, based on an FTF encoding, it is the only technique that can reduce the TSV power consumption noticeably. Nevertheless, previous 3D CACs, which neglect the edge effects, are also analyzed for the sake of completeness. Scenario 3 represents the data transmission for the best previous technique, the 6C CAC. Aligned signal edges are assumed for the first three scenarios since existing CAC techniques are only effective in this particular case. In the scenarios 4 and 5, the netto-TSV assignment approach presented in this chapter is investigated for an intrinsic misalignment between the switching times. Thereby, the simple linear model is used in scenario 4 for the crosstalk estimation, and in scenario 5, the LUT-based model. For the intrinsic temporal misalignment, the switching times of the input signals (with respect to the rising clock edges at k ·1 ns with k ∈ N) are normally distributed with a mean of 500 ps and a standard deviation, σskew , of 100 ps. Furthermore, the switching times are truncated to a minimum of 0 ps and a maximum of 700 ps. Scenario 6 represents the case where rising-edge and falling-edge triggered flip-flops are used to generate an artificial signal misalignment, exploited by the technique proposed in this chapter. For artificial temporal misalignment, there is no need to differentiate whether the linear or the LUT-based crosstalk model is applied. All input signal pairs are either-way perfectly aligned or completely misaligned for the artificial misalignment, resulting in the same Aˆ i,j values (only ones and zeros) for both crosstalk models. Thus, both models result in the same crosstalk estimation and consequently into the same assignment. The scenarios 7–9 represent the scenarios 4–6, respectively, with an additional classical bus invert (CBI) coding [229]. Classical bus invert is a well-known LPC, which effectively reduces the power consumption of metal wires through a reduction in the switching activities by up to 25%, despite inducing a relatively low bit overhead. The CBI is analyzed to validate that, after exploiting misalignment to improve the TSV performance, low-power techniques that aim to reduce the
10.4 Evaluation
223
switching activities reduce the TSV power consumption effectively. A CBI encoding requires only one invert bit as the overhead per codeword (here, equivalent to a bit overhead of 7.7%). In contrast, the 3D CACs result in a bit overhead of 40%, due to the coupling complexity in 3D. Thus, in the scenarios 1, 4, 5, and 6, 14 data bits are effectively transmitted per bit-pattern (no overhead); in the scenarios 7 to 9, 13 bits; and in the scenarios 2 to 3, 10. This results in 9363, 10,083, and 13,108 bit-patterns to transmit the 8 kB of data, respectively. The analyzed power, performance, area, and noise metrics, relative to the values for the transmission of the raw data (scenario 1), are reported in Fig. 10.8. In accordance with the results presented in the previous chapter, the power-consumption reduction is ca. 5% for the ωm/ωe CAC, while the 6C CAC even slightly increases the TSV power consumption. As expected, the proposed overhead-free assignment technique leaves the power consumption more or less unaffected. However, a combination of the proposed method with the CBI coding leads to a reduction in the TSV power consumption by about 17%, independent of the applied crosstalk model. Thus, CBI coding combined with the exploitation of misalignment outperforms the best CAC technique in terms of power-savings by a factor of 3×, despite a more than five times lower induced bit overhead. Another investigated metric is the maximum switching noise that is induced on an interconnect due to coupling effects, reported in Fig. 10.8c. Previous techniques lead to a noise-peak reduction of about 12%. Switching on adjacent lines induces the noise, and the more these aggressors switch at the same time, the higher the noise peak. Consequently, the more adjacent lines switch misaligned, the more the noise peak will be reduced. Thus, the proposed assignment approach also leads to much better noise properties. Compared to the raw data, the noise peak can be reduced by up to 50% for artificial misalignment. For the intrinsic misalignment, the application of the linear crosstalk model leads to lower noise values than the LUT-based model. The reason is that misalignment always has a positive effect on the height of the noise peak, which is captured by the linear, but not the LUTbased, crosstalk model. However, for all analyzed misalignment scenarios, the noise reduction of the proposed technique employing a LUT for the crosstalk estimation is still 2.8 × higher than the noise reduction of the CAC approaches. The reductions in the maximum propagation delay are shown in Fig. 10.8d. In accordance with the experimental results from the previous chapter, the maximum delay reduction of the best/proposed 3D-CAC technique is about 20%. The proposed assignment method leads to significantly higher performance improvements for all investigated misalignment scenarios. As expected, the highest delay reduction of approximately 40% (i.e., 65% performance increase) is obtained for artificial misalignment. Due to the relatively large standard deviation in the switching times for the intrinsic misalignment, the application of the more complex LUT-based crosstalk model only leads to an increase in the performance improvement by about 1 pp, compared to the application of the linear model. Generally, both newly proposed assignment techniques lead to delay reductions by about 34–35% for the intrinsic misalignment—implying a performance increase of 53–55%.
224
10 Low-Power Technique for High-Performance 3D Interconnects in the Presence. . .
Bit overhead [%]
Power [%]
Crosstalk-avoidance coding Exploiting misalignment Exploiting misalignment + CBI coding
0
–5 – 10 – 15
l l l l l l ] 42 ode ode cia ode ode cia [1 m rtifi m m rtifi pt 6C . m . a T T n n A A h Li LU Li LU [C e c— ic— c— ic— si si /ω in ins in ins m r r r r ω t t In Int In Int er
9]
9] 42] del del ial del del ial [1 mo mo ific mo mo ific rt rt n. 6 C n. T T A A Li LU Li LU [C — — — — e c c ic ic si si /ω in ins in ins r r ωm tr tr In Int In Int
– 20 – 40
c)
Max. delay [%]
0
9] 42] del del ial del del ial c c o o o o er [1 m rtifi m rtifi m pt 6C . m T T n. n A A Li LU Li LU [C e c— ic— c— ic— si si /ω ns ns in in ωm tr tri tr tri n n n n I I I I
te
r
0 − 20 − 40
9] 42] del del ial del del ial [1 mo mo ific mo mo ific rt n. rt 6C n. T T A A Li LU Li LU [C — — — — e c c ic ic si si /ω in in s in ins r r ωm tr tr In Int In Int ha
pt
er
d) area [GE/b]
Max. throughput [%]
0
b)
ha
60 40 20 0
– 20
9] 42] del del ial del del ial c c r te C [1 mo mo tifi mo mo tifi p r r 6 T T n. n. A A ha Li LU Li LU [C e c— ic— c— ic— si si /ω in ins in ins r r ωm tr tr In Int In Int
e)
20
p ha
CODEC
Noise peak [%]
a)
40
20 10 0
9] 42] del del ial del del ial [1 mo mo ific mo mo ific rt rt n . 6C n. T T A A h Li LU Li LU [C — — — — e c c c c i i i i s s /ω in ins in ins r r ωm tr tr In Int In Int t ap
er
f)
Fig. 10.8 Effect of the proposed technique exploiting misalignment and best-practice 3D CACs on the: (a) power consumption per transmitted bit; (b) bit overhead; (c) noise peak; (d) maximum propagation delay; (e) maximum throughput; (f) coder-decoder circuit (CODEC) area
10.5 Conclusion
225
An alternative performance metric—in the case that the number of TSVs is fixed as in this analysis—is the maximum possible data throughput reported in Fig. 10.8e. This metric takes the propagation delay as well as the induced bit overhead into account. The bit overheads of existing 3D CACs are much higher than their delay reductions. Thus, when the number of TSVs in the array is fixed (to keep the TSV area occupation and yield unaffected), existing CACs even decrease the maximum data throughput. In contrast, the technique presented in this chapter allows for a significant throughput increase. Since the proposed technique without CBI coding induces no overhead, it results in throughput increases that are the same as the previously reported performance gains (i.e., 65% for artificial misalignment, and 53–55% for intrinsic misalignment). Applying CBI encoding reduces the power consumption for the cost of a decrease in the throughput due to the induced bit overhead. Thus, the method, combined with the CBI encoding, leads to a slightly lower throughput increase. Exploiting artificial misalignment, in combination with a CBI encoding, increases the maximum throughput by 52%. For the intrinsic misalignment and the CBI coding, the throughput can be increased by about 45% for both applied crosstalk models, linear and LUT-based. While the proposed mere assignment approach does not require any active circuit components, coding techniques require coder-decoder circuits (CODECs). Thus, the CODEC areas in gate equivalents (GE), resulting from gate-level syntheses with Synopsys tools, are included in Fig. 10.8f for the sake of completeness. Summarized, the in-depth analysis shows that the proposed technique can significantly outperform CAC techniques in all performance metrics if a sufficient temporal signal misalignment is present. In combination with traditional LPCs, the proposed assignment technique furthermore effectively reduces the 3D-interconnect power consumption.
10.5 Conclusion In this chapter, the effects of temporal misalignment between the switching times of the input signals of TSV-based interconnect structures have been investigated and subsequently exploited in order to improve the TSV performance without noticeable costs. Therefore, the proposed technique exploits the heterogeneity in the TSV capacitances, which arise from the edge effects, outlined in Chap. 4. The proposed technique drastically improves the TSV performance and switching noise. Moreover, the approach mitigates the TSV coupling problem such that low-complex LPCs—aiming for a reduction in the switching activities of the transmitted bits—can effectively improve the TSV power consumption. For a 22-nm technology and modern TSVs, the CAC technique presented in Chap. 9 (which already outperforms all previous techniques drastically) shows to improve the power consumption, the performance, and the maximum switching noise of TSVs by a maximum of 5%, 25%, and 12%, respectively. For a sufficient temporal
226
10 Low-Power Technique for High-Performance 3D Interconnects in the Presence. . .
misalignment, the presented approach can even improve the performance by up to 65%, while it simultaneously decreases the noise and the power consumption by about 45% and 17%, respectively, despite showing a more than five times lower bit overhead than the best CAC technique. This underlines the massive potential of the proposed technique in scenarios where temporal misalignment between the signals is naturally present.
Chapter 11
Low-Power Technique for Yield-Enhanced 3D Interconnects
In the previous chapters, techniques that significantly improve the 3D-interconnect power consumption and performance at relatively low cost were presented. However, issues due to the relatively low through-silicon via (TSV) manufacturing yield have not been adequately addressed up to this point. Previous techniques add redundant TSVs to address the yield issues. However, the redundant TSVs of such techniques are only required in the presence of manufacturing defects, while they are a waste of resources in the most likely case of a defect-free manufacturing. Furthermore, due to their implementation costs, existing redundancy techniques lead to an increase in the already critical TSV power consumption. In this chapter, the first technique ever is presented, which addresses TSV yield issues, while it moreover significantly reduces the power consumption of arbitrary 3D interconnects. For this purpose, low-power encoder-decoder architectures are contributed, which are capable of improving the TSV manufacturing yield. The encoder-decoder structures increase the manufacturing yield by the same amount as existing redundancy techniques, while they additionally provide drastic powerconsumption improvements. This duality of the coding gain is achieved by ensuring that the redundancy in the low-power codewords, compared to the unencoded data words, can be exploited for yield enhancement. Such a hybrid coding approach allows us to drastically reduce the implementation costs compared to a separate integration of a low-power coding (LPC) technique and a redundancy scheme. To reduce the implementation costs even further, the technique is designed in a way that technological heterogeneity between the individual dies of a 3D integrated circuit (IC) is exploited. Technological heterogeneity implies that the logic costs differ for the individual dies of a 3D IC. Hence, the proposed low-power code (LPC)-based redundancy technique is constructed such that it is of minimum possible complexity in costly mixed-signal or radio frequency (RF) dies, in exchange for a small increase in the hardware overhead in dies implemented in an aggressively scaled technology node. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 L. Bamberg et al., 3D Interconnect Architectures for Heterogeneous Technologies, https://doi.org/10.1007/978-3-030-98229-4_11
227
228
11 Low-Power Technique for Yield-Enhanced 3D Interconnects
An evaluation of the proposed technique shows that it results in a reduction in the relative number of defective chips by a factor of 17 ×, while it is additionally capable of reducing the power consumption by over 40%. Additionally, a case study for the commercially available heterogeneous system on a chip (SoC) from Ref. [76] and real RF data streams shows that the proposed technique improves the logic-area requirements and power consumption, compared to the best previous redundancy technique, by 69% and 33%, respectively, while still providing the same TSV yield enhancement. This underlines the substantial superiority of the proposed approach for real systems. The remainder of this chapter is organized as follows. In Sect. 11.1, the related work is reviewed. The logical impacts of the common TSV defects on the transmitted signals are outlined in Sect. 11.2. Limitations of existing yield-enhancement approaches, as well as the fundamental idea of the approach proposed in this chapter, are outlined in Sect. 11.3. A formal problem description, regarding defect fixing (decodability) and minimum possible circuit complexity, is derived in Sect. 11.4. Afterward, in Sect. 11.5, the proposed technique is presented, which is subsequently evaluated in Sect. 11.6. The aforementioned case study is presented in Sect. 11.7. Finally, the chapter is concluded in Sect. 11.8.
11.1 Existing TSV Yield-Enhancement Techniques The related work on TSV yield enhancement is presented in this section. In order to cope with the relatively low TSV manufacturing yield, a dedicated test method in combination with a TSV redundancy scheme is typically used [128]. Initially, the TSVs are grouped into sets of size m, and to each set, one or more redundant TSVs are added. A dedicated methodology tests the TSVs on manufacturing defects. The test results are subsequently interpreted to identify faulty TSVs. If the maximum number of faulty TSVs in a set does not exceed the number of redundant TSVs per set, the redundancy scheme is used to repair the TSV-based links. Otherwise, the die is not repairable and must be discarded. In the following, existing TSV test approaches and redundancy schemes are reviewed. Thereby, the focus lies more on redundancy schemes, since this chapter contributes a redundancy approach, which is generally independent of the testing methodology. A wide range of research has been conducted on TSV testing methods which are classified into two major categories: Pre-bond testing, using dedicated test instruments (e.g., [16, 63, 110, 111, 182, 259]), and post-bond testing, which operates similar to traditional logic testing known from 2D ICs (e.g., [118, 152, 155, 245]). The main advantage of the second approach is that it also allows for detecting bonding failures. Furthermore, it potentially reduces the test time as it enables to combine logic and TSV testing. However, post-bond TSV testing has the drawback that the testing is executed after the manufacturing of the complete stack is completed. Thus, whenever a single die has an irreparable TSV defect,
11.1 Existing TSV Yield-Enhancement Techniques
229
the full stack has to be discarded. In contrast, with pre-bond testing, only the single faulty die needs to be replaced. Thus, post-bond testing can significantly increase manufacturing costs. Consequently, a combination of pre-bond and postbond testing is often applied [166]. Also existing TSV redundancy schemes are classified into two main categories, illustrated in Fig. 11.1: Multiplexer based and coding based. The first approach uses multiplexers to reroute the signals around faulty TSVs. Two representative concepts exist for this approach: Signal-Reroute [107] and Signal-Shift [107, 128]. Signal-Reroute, illustrated in Fig. 11.2a, assigns every signal bit to a TSV and, if a signal TSV is faulty, only the associated net is rerouted over a redundant TSV. In the second concept, illustrated in Fig. 11.2b, beginning at the position of the fault,
TSVs
Die i + 1
Die i
TSVs
d1
TSV1
d2
TSV2
d2
d2
TSV2
dm
TSVm
Fixed encoder
d1 Multiplexer
TSV1
Multiplexer
d1
dm dm
Red. TSV
TSVm
Red. TSV
Red. TSV RS
ctx logic
Die i + 1 Configurable decoder
Die i
d1 d2
dm
Red. TSV crx logic
crx logic
RS
a)
RS
b)
Fig. 11.1 Types of TSV redundancy schemes: (a) Multiplexer based; (b) Coding based
Die i
TSVs
d1
d1
Die i + 1 0 1
d1
Die i
TSVs
d1
d1
Die i + 1 0 1
’0’
d2
d2
0 1
d2
d2
1 0
Defect
0 1
d3
d3
1 0
d4 11 10 01 00
a)
’10’
0 1
d4
d4
Defect
1 0
d3
d2
0 1
d3 ’1’
d3 ’1’
’0’
0 1
’0’
’0’
’1’
d4
d2 ’0’
’0’
d3
d1 ’0’
0 1
d4 ’1’
d4
b)
Fig. 11.2 Variants of multiplexer-based TSV redundancy schemes: (a) Signal-Reroute [107]; (b) Signal-Shift [107, 128]
230
11 Low-Power Technique for Yield-Enhanced 3D Interconnects
all signal nets are shifted toward the redundant TSV. Signal shifting increases the number of rerouted signals, but it decreases the maximum delay of the redundancy scheme [184]. Configuration bit-vectors at the transmitter side, c tx , and the receiver side, c rx , of a TSV group control the multiplexers. These configuration bits are determined through the repair signature (RS) for the set, defining the bypassed TSV. To repair one fault, the compressed repair signature of a set containing n lines is log2 (n) bits wide. Since the configuration is fixed after identifying the faulty TSVs, redundancy schemes generally require one-time-programmable non-volatile memory (NVM) cells to store the RSs. In standard technologies, E-fuses are typically used to implement one-time-programmable NVM cells [15]. A high-voltage circuit is required to program such NVM cells, which results in high area costs. Furthermore, a high-voltage power grid that spans over all dies of a 3D IC drastically increases the layout/design complexity [15]. Therefore, the advantages of having a single/global NVM macro in one die, from which the repair signatures are shifted/loaded serially into distributed repair registers (RRs) during start-up, are extensively discussed in [109]. In summary, besides overcoming the need for NVM cells in each die, using a global macro also reduces the required NVM space by 50% for multiplexer-based redundancy schemes, as the RSs at both sides of a TSV set are equal. If possible, choosing a location where an NVM macro is either-way required, further decreases the design complexity. Also, a controller for the global NVM macro, which serially shifts the RSs into the RRs, while minimizing the number of inter-die connections (i.e., TSVs), is proposed in [109]. The major drawback of using a single global macro is that all RRs must be connected to one long shift-register chain, spanning over all dies of the system. Traditional fixed error-correcting codes (e.g., Hamming [97]) require large amounts of redundant TSVs, making them impractical to address TSV yield issues due to the enormous TSV dimensions and parasitics. Hence, a codingbased approach, illustrated in Fig. 11.1b, has been proposed, which requires only a reconfiguration of the decoder circuit, while it still minimizes the number of required redundant TSVs [184]. Thus, using a single redundant line, the approach can recover the information words in the case of any single static defect by modifying the decoding. Since the redundancy technique proposed in this chapter is also based on coding, in Sects. 11.4 and 11.5, the concept of coding-based redundancy schemes is explained in more detail. The main advantage of the existing coding-based approach is that configuration is only required at the receiver side of a link. This reduces the number of required RSs in the individual dies by 50%. Thereby, the length of the RR chain for the global-NVM-macro approach is halved, which reduces the area overhead and the wiring complexity. For distributed NVM cells, the approach decreases the number of required NVM cells by 50%, which is a drastic improvement.
11.3 Fundamental Idea
231
11.2 Preliminaries—Logical Impact of TSV Faults The common TSV manufacturing defects—outlined in Sect. 1.2.2 on Page 15—can be classified as: 1. 2. 3. 4.
voids; delamination at the interface; material impurities; TSV-to-substrate shorts.
In this subsection, the logical effects of the four different TSV defect types on the transmitted digital signals are outlined. The first three defect types result in an increased TSV resistance, which increases the signal propagation delay (see Eq. (3.43) on Page 63) [259, 263]. Hence, the defects entail logical delay faults. A TSV-to-substrate short results in a resistive electrical connection between the TSV conductor and the substratebias/bulk contact through the conductive substrate. Hence, a TSV-to-substrate short draws the potential of the TSV toward the substrate-bias potential [259]. Thus, for typical grounded p-doped substrates, the defect type entails logical stuck-at-0 (SA0) faults, while for an n-doped substrate, biased at the power-supply potential, the defect type entails stuck-at-1 (SA1) faults. Furthermore, substrate shorts result in increased leakage currents when the TSV-conductor potential differs from the substrate potential (e.g., when, for a p-doped grounded substrate, a logical 1 is on the TSV implying a conductor potential of Vdd ). In summary, a redundancy technique has to cope with stuck-at (SA) and delay errors. Typically, only one SA type (SA1 or SA0) occurs, depending on the substrate doping. Nevertheless, for increased robustness, a scheme that can repair both stuckat fault types is also presented in this chapter.
11.3 Fundamental Idea In this section, first, drawbacks of existing TSV redundancy schemes are outlined. Based on this, the fundamental idea of the proposed technique is outlined. Three major drawbacks of existing TSV reliability schemes can be identified. First, current redundancy techniques do not consider the implications of heterogeneous integration. Digital logic elements are more expensive (in terms of area, power consumption, and delay) when integrated into a less aggressively scaled technology. This creates a demand for heterogeneous redundancy schemes. In dies with larger feature sizes, a redundancy scheme should result in a possibly low hardware overhead. The second drawback is that recent techniques require repair signatures (RSs) in every die, which increases the manufacturing complexity. For example, consider a system consisting of three different dies, illustrated in Fig. 11.3a. In the example, the
Conf. decoder
RS Hard cores
a)
Fix. encoder
Conf. encoder
RS
t
t
Fix. encoder
t
Conf. decoder
Standard FPGA Conf. decoder
Fix. decoder
t
Conf. encoder
t
t
t
Conf. decoder
t
t
Fix. encoder
Fix. decoder
t
FPGA
Standard FPGA
Analog-digital converter
t
Conf. decoder
Digital-analog converter
t
RS
CPU
Fix. encoder
t
Fix. encoder t
RS
t
t
Conf. decoder
Analog-digital converter
(e.g., 15 nm)
Digital-analog converter
Mixed-signal (e.g., 45 nm)
11 Low-Power Technique for Yield-Enhanced 3D Interconnects
(e.g., 15 nm)
232
Fix. encoder Hard cores
b)
Fig. 11.3 Heterogeneous 3D IC with: (a) the previous coding-based TSV redundancy scheme [184]; (b) the proposed TSV redundancy scheme
previous coding-based redundancy approach is already applied to reduce the total number of required RSs. Thus, an RS is only required at each decoder. However, since data flows in both directions, RSs are still required in every die. Thus, the repair register (RR) chain—through which the RSs are distributed at run time—still spans over all dies of the system when a global NVM macro is used. Alternatively, with distributed NVM cells, expensive NVM programming circuits are required in all dies, resulting in high costs. To overcome this bottleneck, a redundancy technique where the reconfigurable circuit is always located in the more aggressively scaled technology, at the interface of two heterogeneous dies, is proposed in this chapter. Furthermore, this reduces the RR hardware costs for the global NVM macro. Thus, for data flowing toward a die fabricated in a bigger technology node, a scheme with a configurable encoding and a fixed decoding with minimal circuit complexity is proposed. If data flows toward a die fabricated in a more aggressively scaled technology, the decoding should be configurable, while the encoding should be of minimal complexity. One exception is made for systems using field-programmable gate array (FPGA) blocks to integrate semi-custom digital blocks after manufacturing. Redundancy schemes are designed to cope with manufacturing defects. Thus, a dynamic reconfiguration of the encoding or decoding at run-time is generally not required. Thus, if one of the two dies contains an FPGA, the placement and routing of the FPGA soft-components, executed after TSV manufacturing and testing, is adaptable on the defect characteristics of the links. This allows eliminating high costs due to NVM and RR cells, storing RSs on-chip for the proposed as well as all existing redundancy schemes as outlined in detail in the later Sect. 11.5. Thus, at the interface
11.3 Fundamental Idea
233
of a die containing an FPGA (referred to in the following as an FPGA die) and another die, it is proposed to place all fixed encoders and decoders in the non-FPGA side/die. For the example illustrated in Fig. 11.3b, the proposed technique requires no RSs in the mixed-signal and the central processing unit (CPU) die anymore. Furthermore, the need for NVM and RR cells vanishes completely. This is an extreme scenario, but it represents existing products, such as the ones presented in Ref. [48, 76]. For systems without FPGA dies and a large number of layers, the number of dies that require RSs is still reduced by 50%, compared to all previous approaches. Here, the proposed technique prevents the necessity of an NVM programming circuit for distributed NVM cells in every second die. For the global NVM macro, the RR chain has to cover 50% fewer dies. The last drawback of existing TSV redundancy schemes is that they are a waste of resources in the most likely event of a defect-free TSV manufacturing. Thus, in this chapter, a redundancy technique is presented, which optimizes the TSV power consumption—especially in scenarios where redundancy is not required for yield enhancement. Therefore, the encoder-decoder pairs are moved from the boundaries of the TSV links to the boundaries of the full 3D links, including the metal wires. This implies another requirement for the fixed encoding and the fixed decoding. Besides being of minimal complexity, the fixed encoder and decoder circuits also have to be usable as a low-power encoder and decoder for TSVs and metal wires. Generally, the requirements for a coding technique to serve as an efficient LPC depends on various scenarios. At first, the coupling phenomenon changes drastically for TSV arrays and metal wire due to the different capacitance structures outlined in Chap. 4. Also temporal misalignment affects the coupling behavior in a complex way, as shown in Chap. 10. This makes it hard to identify what is a suitable lowpower code for long 3D interconnects made up of several metal-wire and TSV segments when the coupling is considered explicitly. However, in the previous chapter, it was shown that reducing the self-switching activities is either-way the more efficient coding approach if nets switch (even only slightly) temporally misaligned. The previous considerations, combined with the fact that considering coupling results in a much larger coder-decoder circuit (CODEC) complexity, indicate that the proposed coding technique should focus on reducing the power consumption of the 3D interconnects by minimizing the switching activities of the transmitted bits. In addition to the switching activities, an optimization in the bit probabilities is desirable in order to exploit the TSV metal-oxide-semiconductor (MOS) effect. Exploiting the MOS effect demands increased logical 1-bit probabilities for typical p-doped substrates and increased 0-bit for n-doped substrates, as shown in Chap. 4. However, the bit probabilities should only be fine-tuned if this can be achieved at a reasonable increase in the coding complexity for mainly two reasons. First, only the TSV but not the metal-wire power consumption can be optimized through the MOS effect, and second, the magnitude of the TSV MOS effect varies strongly for different technologies, as shown in Chap. 4.
234
11 Low-Power Technique for Yield-Enhanced 3D Interconnects
In summary, the fundamental idea is to design two coding architectures, which exploit heterogeneity to reduce implementation costs by having either a minimal fixed encoder or decoder (one each), allow for TSV yield enhancement, and optimize the 3D-interconnect power consumption by reducing the switching activities. Moreover, the possibility to extend the techniques in a way that the bit probabilities can be additionally optimized should be investigated.
11.4 Formal Problem Description The formal problem description considering decodability (i.e., defect fixing) and circuit complexity is presented in the following. Therefore, a link composed of n lines over which m bits have to be transmitted is considered. Thus, the number of redundant lines, r, is n − m.
11.4.1 Decodability To guarantee a fault-free transmission in the case of a defect in a TSV group, a data encoding is integrated, as illustrated in Fig. 11.4. A fault is corrected by only modifying the encoding or the decoding while the counterpart is fixed. In this subsection, the analysis is restricted to linear block codes in the binary Galois field, F2 , as these codes allow for a systematic analysis and result in low hardware complexities. Thus, the encoding of the m-bit information word, d = [d1 , . . . , dm ] ∈ B1×m , to the n-bit codeword, b = [b1 , . . . , bn ] ∈ B1×n , is expressed as a matrix multiplication in the Galois field 2: b = d · ,
(11.1)
where ∈ Bm×n is the encoding matrix. The effect of the physical link on the received word, b ∗ , is expressed by function fL : = fL (d · ). b ∗ = fL (b) Encoder d1
(11.2)
3D interconnect
Decoder
b1
b ,1
bn
b ,n
Γ dm
dˆ1 Φ dˆm
Fig. 11.4 Arrangement of the redundancy encoder-decoder pair enclosing the full 3D-interconnect structure
11.4 Formal Problem Description
235
In the defect-free case, the received word, b ∗ , is always equal to transmitted one, b. However, the received and the transmitted words differ for a defect in the link. A linear decoding, expressed by a multiplication with ∈ Bn×m , has to recover out of b ∗ . Hence, the decoding has to satisfy the information word, d, ! · = fL (d · ) · . d = d ˆ = fL (b)
(11.3)
The set of considered errors is symbolized as SE . Consequently, a fixed encoding, expressed by fix , in combination with a configurable decoding, expressed by conf , must satisfy ∀e ∈ SE ∃ conf,e : fL,e (d · fix ) · conf,e = d.
(11.4)
For a fixed decoding, expressed by fix , the constraint is ∀e ∈ SE ∃ conf,e : fL,e (d · conf,e ) · fix = d.
(11.5)
In the following, it is outlined how the encoder-decoder pairs cope with the two logical fault types, stuck-at (SA) and delay. First, faults are masked in a way that they do not affect the decoded information, dˆi . If the encoding () is fixed, a fault on line i is masked by configuring the decoding in a way that all elements in the ith column of are equal to 0. Thereby, the decoded output becomes independent of the value on the defective line. This kind of masking is not possible if the decoding is fixed. Here, in case of a fault, the encoding is configured such that a constant value is assigned to a faulty line, irrespective of the transmitted information word. This value is logical 0 for an SA0, and logical 1 for an SA1. Thus, a faulty line is set to constant logical 0 and 1 for n-doped and p-doped substrates, respectively.1 If, for example, a constant logical 0 is assigned to the input of a line that might have either an SA0 or a delay fault, also the output at the far end is constant logical 0, independent of the presence of the fault and its type (i.e., SA0 or delay). Thus, the transmitted and the received word Assigning constant/stable values are always equal, despite the defect (i.e., b ∗ = b). to faulty lines furthermore enables us to treat delay faults in the same way as SA faults. Hence, the remainder of this discussion can be limited to SA faults. In the following, SA0 faults are considered as the substrate is typically p doped. However, an analysis of SA1 faults results in the same constraints. For a single SA0 on line i, the effect of the link is expressed by a multiplication with a matrix similar to identity, but with a 0 on the ith diagonal entry, represented by Esai . Thus, for an SA0 fault, Eq. (11.3) is expressed as
1 This forbids later inversions for the nets, as proposed in Chap. 8, to exploit the MOS effect since this would destroy the error masking. Thus, for nets that belong to the output of a configurable encoder, the technique proposed in Chap. 8 can only be applied without inversions (resulting in a mere reordering of the net-to-TSV assignment).
236
11 Low-Power Technique for Yield-Enhanced 3D Interconnects ! · = d · · Esai · . d = fL (b)
(11.6)
Hence, to cope with every single SA fault, a fixed encoding has to satisfy ∀i ∈ {1, 2, . . . , n} ∃ conf,sai : fix · Esai · conf,sai = Im ,
(11.7)
where Im is the m × m identity matrix. The resulting equation systems can be solved if the rank of the result of fix · Esai is at least m (i.e., right inverse exists). The multiplication with Esai simply sets all entries of the ith column to 0. Since all entries of the ith column of the decoding matrix conf are also 0 to mask the error, the equation systems from Eq. (11.7) have unique solutions if all subsets of m columns in fix are linearly independent. Mathematically this is expressed as spark( fix ) > m.
(11.8)
Analogously, one can show that decodability for up to s arbitrary errors, using r redundant lines, requires that in all subsets of m + r − s columns of the encoding matrix, fix , at least m columns are linearly independent. For a fixed decoding, fix , the equation systems are as follows: ∀i ∈ {1, 2, . . . , n} ∃ conf,sai : conf,sai · Esai · fix = Im .
(11.9)
In this equation, the multiplication with Esai sets all entries of the ith row to 0. Thus, the left inverse of Esai ·fix always exists, if all subsets of m rows of fix are linearly independent, mathematically expressed as spark(Tfix ) > m.
(11.10)
Consequently, the general constraint to be able to handle arbitrary combinations of m errors with a fixed decoder and r redundant lines is that, in all subsets of m+r −s rows of fix , at least m rows are linearly independent.
11.4.2 Circuit Complexity In this subsection, the lower bounds for the circuit complexities for the fixed encoding and the fixed decoding are derived. To take TSV costs into account, only redundancy schemes where the number of simultaneously repairable errors is equal to the number of redundant TSVs (i.e., s = r) are considered. Again, fixed encoding and decoding are expressed by Galois-field-2 multiplications with fix and fix , respectively. The number of ones in the ith column of fix and fix defines how many input bits are logical exclusive disjunction (XOR)ed by the encoding and decoding to generate the ith output bit, respectively. Thus, the
11.4 Formal Problem Description
237
maximum number of ones in a single column of fix and fix defines the maximum delay of the encoder and the decoder circuit, respectively. Assuming that only two-input gates are available, representing the worst case, the delay of an XOR operation on i inputs is equal to log2 (i) ·Txor , where Txor is the delay of a single XOR gate. Thus, the maximum delay of the fixed encoder is $ m % ˆ T fix = log2 max fix,i,j · Txor , j
(11.11)
i=1
while the maximum delay of the fixed decoder is $
Tˆfix = log2 max j
n
% fix,i,j
· Txor .
(11.12)
i=1
The area for an XOR operation on i inputs, employing only two-input gates, is equal to (i−1)Axor , where Axor is the XOR-gate area. Thus, the overall area of the fixed encoder circuit is expressed as follows: ⎛
n m
⎞ fix,i,j ⎠ · Axor .
(11.13)
Consequently, the overall area of the fixed decoder circuit is ⎛ ⎞ m n fix,i,j ⎠ · Axor . Afix = ⎝−m +
(11.14)
Afix = ⎝−n +
i=1 j =1
i=1 j =1
First, the lower bound for the circuit complexity of a fixed encoder is derived. Since all m input bits have to affect at least s + 1 bits of the transmitted n = m + s codeword bits to cope with every s-bit error, the minimum number of ones in the encoder matrix, fix , is m · (s + 1). Thus, the optimal area is Aopt, fix = (m · (s + 1) − n) · Axor = s · (m − 1) · Axor .
(11.15)
The optimal delay is obtained if the m · (s + 1) ones are equally distributed on the n = m + s columns of fix . In this scenario, the maximum number of ones in a column is m·(s+1)/m+s . Thus, the optimal delay of a fixed encoder is ' & m · (s + 1) ˆ · Txor . (11.16) Topt, fix = log2 m+s For a fixed decoding, fix , each of the m output bits must be a Boolean expression of at least s + 1 input bits to cope with up to s entropy-less/stable input bits (i.e., masked defect lines). Thus, the optimal delay and area are:
238
11 Low-Power Technique for Yield-Enhanced 3D Interconnects
Tˆopt,fix = log2 (s + 1) · Txor ; Aopt,fix = m · (s + 1) − m · Txor = m · s · Txor .
(11.17) (11.18)
As a sanity check, an extensive search of all matrices in the Galois field 2 that full-fill the decodability constraints is performed for m equal to 4 and 5 and s/r equal to 1 and 2. The check validates the correctness of the derived formulas for the minimum possible circuit complexities.
11.5 TSV Redundancy Schemes The proposed TSV redundancy schemes are outlined in this section. In Sect. 11.5.1, a fixed decoding and the associated configurable encoding are presented. Section 11.5.2 covers the required counterpart: A fixed minimal low-power encoding with a configurable decoding. The proposed schemes are designed to repair a single error per group (i.e., r = s = 1). A minimal/optimal circuit complexity for one redundant line implies a maximum delay of only Txor for both circuits, fixed encoder and fixed decoder. The optimal circuit area of the fixed encoder and the fixed decoder are (m − 1) · Axor and m · Axor , respectively.
11.5.1 Fixed-Decoding Scheme The proposed TSV redundancy scheme with a fixed decoding is illustrated in Fig. 11.5. In the right part of Fig. 11.5a, the fixed decoder is illustrated, which simply is a controllable inverter where the most significant bit (MSB) of the received word controls if the decoded m-bit output is equal to the first m bits of the received word, b ∗ , or equal to its bit-wise negation. The Boolean equation describing the decoding is dˆi = b∗i ⊕ b∗m+1
for i ∈ {1, 2, . . . , m},
(11.19)
where ⊕ is the Boolean XOR operator. For m equal to 4, the decoding matrix is ⎡
fix
1 ⎢0 ⎢ ⎢ = ⎢0 ⎢ ⎣0 1
0 1 0 0 1
0 0 1 0 1
⎤ 0 0⎥ ⎥ ⎥ 0⎥. ⎥ 1⎦ 1
(11.20)
11.5 TSV Redundancy Schemes
239
Encoder (conf. layer; ASIC die) d1 d2 d3 d4 Conf. hardware
ctx
3D interconnect
Decoder (fix. layer)
b1
Line 1
b 1
b2
Line 2
b 2
b3
Line 3
b 3
b4
Line 4
b 4
b5
Red.
b 5
dˆ1 dˆ2 dˆ3 dˆ4
a)
Conf. hardware (ASIC die; p-subs.) d1 d2 d3 d4 ’0’
000 001 010 011 100
b)
ctx
Conf. hardware (ASIC die; n-subs.)
Inv (b5 )
d1 d2 d3 d4 ’0’
000 001 010 011 100
c)
ctx
3
Encoder (conf. layer; FPGA die) d1
Inv (b5 )
3
d2 d3
b1 b2 b3
Conf. hardware (ASIC die)—both SA faults d4 d1 d2 d3 d4 ’0’
000 001 010 011 100 3
d)
ctx
Inv (b5 )
Inv
b4 b5
e)
Fig. 11.5 Proposed redundancy scheme with a fixed decoding for a 4-bit input: (a) General structure; (b) Configurable-hardware structure for p-doped substrates and an implementation into an ASIC die; (c) Configurable hardware structure for n-substrates and an implementation into an ASIC die; (d) Configurable hardware structure to repair both stuck-at fault types and an implementation into an ASIC die; (e) Encoder for an implementation into an FPGA die
11.5.1.1
Defect Fixing
In the following, it is outlined how the minimal invert decoding and a configurable encoding handle a defect on an arbitrary line with the index j . Thus, the received value b∗j is faulty, while all other received values are correct (i.e., b∗i = bi for i ∈ {1, . . . , m+1}\{j }). A correct transmission of d despite the defect requires a transmission of information bit dj over the redundant line (i.e., bm+1 = dj ). Over all other lines, the information bits XORed with the value on the redundant line are
240
11 Low-Power Technique for Yield-Enhanced 3D Interconnects
transmitted: bi = di ⊕ bm+1 = di ⊕ dj for i ∈ {1, 2, . . . , m}.
(11.21)
Therefore, the encoder also contains a controllable inverter. As required, the encoding masks the delay or SA0 error on line j : bj = dj ⊕ bm+1 = dj ⊕ dj = 0.
(11.22)
Also, the correct information is decoded despite the fault:
dˆi = b∗i ⊕ b∗m+1
⎧ ⎪ ⎨(di ⊕ dj ) ⊕ dj = 0 ⊕ dj ⎪ ⎩ ()*+
= di
for i = j
= dj
for i = j .
(11.23)
b∗j has SA0
A constant 0 is assigned to the redundant invert line for an SA0 fault on the (m + 1)th interconnect. Afterward, the first m bits of the transmitted and the since the inverter is received word are equal to the original information word, d, turned off permanently. If the substrate is n-doped—entailing SA1 instead of SA0 faults—the codewords only have to be logically negated. The negation does not affect the decoding since the invert line is also negated. To obtain the negated codewords at the encoder output, the value for the invert line of the encoder simply has to be negated (i.e., ¬dj instead of dj is transmitted over the redundant line for a defect on line j ). Thus, for an SA1 or delay fault, the Boolean equation describing the encoding is bi = di ⊕ bm+1 = di ⊕ ¬dj for i ∈ {1, . . . , m}.
(11.24)
Again, the requirement of masking the fault is satisfied: bj = dj ⊕ bm+1 = dj ⊕ ¬dj = 1.
(11.25)
The decoding is mathematically expressed as
dˆi = b∗i ⊕ b∗m+1
⎧ ⎪ ⎨(di ⊕ ¬dj ) ⊕ ¬dj = 1 ⊕ ¬dj ⎪ ⎩ ()*+
= di
for i = j
= dj
for i = j .
(11.26)
b∗j has SA1
Here, a constant logical 1 is assigned to the invert line if the redundant line is faulty. The encoding for an exemplary SA0 on line 3 and a group size of four is described by
11.5 TSV Redundancy Schemes
241
⎡
b = d · conf,sa3
1 ⎢0 = d · ⎢ ⎣1 0
0 1 1 0
0 0 0 0
0 0 1 1
⎤ 0 0⎥ ⎥. 1⎦ 0
(11.27)
The required encoder hardware, for an implementation into an application-specific integrated circuit (ASIC) die containing hard components, is shown in Fig. 11.5a for m equal to 4. Besides the controllable inverter, made up of m XOR gates, a configurable block is required. This configurable block outputs the invert signal, equal to the signal transmitted over the redundant line. For TSVs traversing a p-doped substrate, the minimum configurable block simply consists of a multiplexer shown in Fig. 11.5b. Signal c tx , equal to the RS stored in an RR, controls the multiplexer. If a data line i is defect (i.e., integer( ctx ) < m), the multiplexer connects-through di . Otherwise, the multiplexer connects-through a constant logical 0. An inversion of the multiplexer output is added if the TSVs traverse an n-doped substrate, as shown in Fig. 11.5c. Such an inversion does not affect the hardware complexity. In contrast, the hardware complexity increases by one XOR gate and one configuration bit if the system should be capable of repairing SA0 as well as SA1 faults, as shown in Fig. 11.5d. Here, in case of a stuck-at fault, the RS does not only have to indicate the location of the defective line but also the defect type (i.e., SA1 or SA0), using an additional bit. For a delay instead of an SA fault on a line, both configurations—the one that treats an SA0 as well as the one to treats an SA1 on the according line—ensure a correct functionality of the link. Thus, here one of the two configurations can be chosen arbitrarily. The encoder hardware for an implementation into an FPGA die, illustrated in Fig. 11.5e, has a lower complexity. Here, just the controllable inverter is physically required. This makes the encoder hardware requirements equal to the decoder hardware requirements (i.e., minimum delay). For FPGA dies, the RSs should be used offline during the placement and routing of the FPGA soft-components, performed after TSV testing. Therefore, after manufacturing and testing, the TSVdefect information is stored not on-chip, but in a file readable for the electronic design automation (EDA) tool of the FPGA. During the place and route stage of the FPGA soft-components, the tool reads the RS of the link and then routes the information bits that have to be transmitted over the TSVs to the encoder inputs such that a correct transmission is ensured. If the ith (i ≤ m) line of the link has an SA0 fault, the tool will route di to the invert input, Inv, to ensure a correct transmission. ¬di is assigned to the invert input for an SA1 fault. The invert input is connected to logical 0 and logical 1 if the redundant line has an SA0 and SA1, respectively. Again, delay faults can be treated either as an SA0 or as an SA1 fault. In case of no defect, an arbitrary value can be assigned to the redundant invert line for all encoder implementations. This freedom is exploited to effectively reduce the 3D-interconnect power consumption through the proposed redundancy technique, which operates in a low-power configuration in that case.
242
11.5.1.2
11 Low-Power Technique for Yield-Enhanced 3D Interconnects
Low-Power Configuration
In the following, the low-power configurations for the fixed-decoding scheme are presented. The same encoding as for a defect on line m, expressed by conf,sam , reduces the 3D-interconnect power consumption if the data words are normally distributed. This is because of the specific characteristics of the MSB region of normally distributed data words, outlined in Sect. 6.1.2 on Page 115. In brief, the more-significant bits of normally distributed data show a strong spatial correlation. Hence, if one of the correlated MSBs toggles, the remaining ones toggle as well in the same direction. The switching activity of the correlated bits in the MSB regions varies between 0 and 1, depending on the word-level correlation of the data. However, if, for each pattern d of a normally distributed data stream, the bits are XORed with the MSB dm ( conf,sam for p-doped substrates), the strong spatial correlation of the higher-value bits result in codeword bits nearly stable on logical 0. An XOR operation with the negated MSB ¬dm ( conf,sam for n-doped substrates), results in codeword bits nearly stable on 1. Thus, independent of the word-level correlation, the switching of the more-significant lines is reduced to almost 0 when the encoding is conf,sam for p-doped as well as n-doped substrates. This coding scheme is based on the same LPC concept as the Gray-coding technique used in Chap. 8, which showed to improve the TSV power consumption effectively. However, the technique in Chap. 8 also employs inversions to optimize the TSV power consumption through the bit probabilities further. These inversions were realized by swapping inverting with non-inverting drivers. This is not possible for the proposed technique, as it would destroy the error masking for SA faults. Nevertheless, an alternative approach is presented in the following, which allows optimizing the TSV power consumption through the bit probabilities. The lowest power consumption is obtained for p-doped substrates if the 1-bit probabilities are maximized, as shown in Chap. 4. For n-doped substrates, the 1-bit probabilities should be minimized. Both can be achieved by assigning the negated MSB for pdoped substrates, and the non-negated MSB for n-doped substrates, to the invert line. Thus, the enhanced low-power configuration—optimizing the bit probabilities besides the switching activities—is the same as for an SA1 on line m for pdoped substrates (typically entail SA0 faults). For n-doped substrates, the enhanced low-power configuration is the same as for an SA0 on line m. The improved configuration comes at slightly increased hardware costs if an encoder that is only capable of fixing one SA fault type is integrated into an ASIC die. Here, the multiplexers of the two configurable circuits, shown in Fig. 11.5b and c, have to be extended by one input ¬dm . The fixed-decoder scheme can also reduce the power consumption for arbitrary distributed data words. In the defect-free case, the invert line can be used to implement any low-power invert-coding. For example, the widely known classical bus invert (CBI) technique, presented in Ref. [229], has already shown to effectively reduce the power consumption of TSVs (besides metal wires as intended) in previous chapters of this book. The multiplexers in Fig. 11.5b–d need an additional input, Invlp , to integrate an invert-coding technique into an ASIC die. Additionally, a block
11.5 TSV Redundancy Schemes
243
that checks the invert condition and sets Invlp if required has to be implemented. This block is cut from the power supply if it is not used (erroneous link). If the block is implemented by means of an FPGA soft-component, it can be simply erased when it is not needed. Please note, CBI is just the most well-known of many possible invert-codes (e.g., [116, 187]). Generally, all invert-codes can be combined with the proposed redundancy scheme. However, CBI encoding typically results in the best trade-off between the 3D-interconnect power-consumption improvement and the CODEC complexity. Hence, other invert-coding methods are not investigated for the proposed technique.
11.5.2 Fixed-Encoding Scheme Two TSV redundancy schemes based on a fixed encoding and a configurable decoding were proposed in Ref. [184]. Although not noticed by the authors, one even has the smallest possible encoder complexity. This fixed encoder, illustrated in Fig. 11.6a, is a simple Gray encoder with the extension that the least significant bit (LSB) of the information word, d1 , is attached as the codeword LSB. Thus, it also has good potential to reduce the 3D-interconnect power consumption. Consequently, the fixed-decoding scheme proposed in this section is based on this existing technique. The fixed encoding of the technique is expressed as
Fix. encoder for p-subs.
Fix. encoder for n-subs. d1 d2 d3 d4
b1 b2 b3 b4
d1 d2 d3 d4
b5 a)
b1 b2 b3 b4 b5
b)
Fig. 11.6 Fixed-encoder circuits for a group size of four: (a) Existing structure presented in Ref. [184] (reused in the proposed technique for n-doped substrates); (b) Proposed structure for p-doped substrates
244
11 Low-Power Technique for Yield-Enhanced 3D Interconnects
Conf. decoder for n-subs. (ASIC die) b 1
0 1
Conf. decoder for p-subs. (ASIC die) b dˆ1 1
0 1
crx,1
0 1
dˆ2
crx,2
b 3
b 3 0 1
dˆ3
crx,3
0 1
crx,4
dˆ3,0 dˆ3,1
dˆ3
crx,3 b 4
b 4
b 4
dˆ2,0 dˆ2,1
dˆ2
crx,2
b 3 0 1
dˆ1,0 dˆ1,1
b 2
b 2 0 1
a)
b dˆ1 1
crx,1
b 2
b 5
Conf. decoder for p-subs. (FPGA die)
0 1
dˆ4 b 5
crx,4
b)
dˆ4 b 5
dˆ4,0 dˆ4,1
c)
Fig. 11.7 Decoder circuits in the fixed encoder scheme for a group size of four: (a) Previous structure [184] (used here for n-doped substrates); (b) Proposed structure for p-doped substrates; (c) Proposed structure for FPGA dies and p-doped substrates
⎧ ⎪ ⎪ ⎨d1 bi = di ⊕ di−1 ⎪ ⎪ ⎩d
for i = 1 for i ∈ {2, . . . , m}
(11.28)
for i = m + 1.
m
For a group size of four, the according encoding matrix is equal to ⎡
fix,prev
1 ⎢0 =⎢ ⎣0 0
1 1 0 0
0 1 1 0
0 0 1 1
⎤ 0 0⎥ ⎥. 0⎦ 1
(11.29)
As the decoding is already extensively discussed in Ref. [184], it is only briefly summarized in the following. The decoder circuit is shown in Fig. 11.7a. The decoding is formally expressed as:
11.5 TSV Redundancy Schemes
i
b∗l dˆi = l=1 m+1 l=i+1 b∗l
245
for crx,i = 0 for crx,i = 1.
(11.30)
The m-bit configuration signal, c rx , controls m output multiplexers. If all configuration bits for the link are 0, the multiplexers connect through the higher inputs and the output is independent of line m+1. An error on any line with index j is bypassed by setting all bits crx,i with i ≥ j to 1. Thus, c rx has to be “0011” to treat an exemplary error on the third line (i.e., b∗,3 is faulty). This results in the following decoding: dˆ1 = b∗1 = d1 ;
(11.31)
dˆ2 = b∗1 ⊕ b∗2 = d1 ⊕ (d1 ⊕ d2 ) = d2 ; dˆ3 = b∗5 ⊕ b∗4 = d4 ⊕ (d4 ⊕ d3 ) = d3 ; dˆ4 = b∗5 = d4 . Hence, the correct information word is decoded independent of the logical value on the faulty line. Furthermore, the specific behavior of the more-significant bits of normally distributed data words, in combination with the XORing of neighbored bits leads to codeword bits that are almost stable on logical 0. Hence, the previous technique is optimal for n-doped substrates, as it reduces the switching activities as well as the 1-bit probabilities. Consequently, the previously proposed fixed encoder scheme is reused for n-doped substrates. However, for p-doped substrates, this previous encoder results in increased TSV capacitance quantities, which makes the approach suboptimal. Consequently, a small extension in the existing scheme is presented in the following, which maximizes the 1-bit probabilities for typical p-doped substrates, without the need to invert nets through a driver swapping.2 To maximize the 1-bit probabilities, the XOR gates are swapped with logical exclusive non-disjunction (XNOR) gates, as shown in Fig. 11.6b. Afterward, the codeword bits are negated for the indexes 2 to m, compared to the previous encoding architecture. This modification does not affect the switching properties; it only maximizes the 1-bit probabilities instead of the 0-bit probabilities. Thus, it results in a lower TSV power consumption, without affecting the power-consumption reduction for metal wires. The modified circuit is still of minimal complexity, as the area and the delay of an XOR and an XNOR gate are generally equal. To decode the correct information XOR and XNOR gates also have to be swapped for the decoder. The word, d, resulting decoder structure for a p-doped substrate is illustrated in Fig. 11.7b.
2 For the fixed-encoder scheme, inverting nets through automatic driver swapping, as proposed in Chap. 8, is possible. The extension is presented to enable the use of the proposed technique in the best possible way, without the need to implement the technique from Chap. 8.
246
11 Low-Power Technique for Yield-Enhanced 3D Interconnects
Again, the multiplexers and the configuration signals are not required if the decoder is implemented in an FPGA die, as illustrated in Fig. 11.7c. Here, during placement and routing of the FPGA soft-components, the EDA tool routes the decoder outputs that are independent of the broken lines to the input of the FPGA soft-components.
11.6 Evaluation In this section, the proposed technique is evaluated in terms of yield enhancement, power-consumption reduction, and hardware complexity. Furthermore, a comparison with existing multiplexer-based and coding-based redundancy schemes is drawn. Thereby, only the values for the previous coding-based scheme based on a Gray encoding are presented, as it significantly outperforms the other one, also presented in [184], in all analyzed metrics.
11.6.1 Yield Enhancement In order to evaluate the yield enhancement, first, the repair limitations of the proposed and existing redundancy schemes are analyzed. Furthermore, it is outlined that the shielding approach presented in Chap. 9 can be used to overcome the limitation of the proposed technique partially. Afterward, the overall TSV-yield enhancement is quantified.
11.6.1.1
Repair Limitations
Not all single line defects can be fixed with the proposed, nor existing, redundancy techniques. Pinhole defects entail a resistive connection between the TSV conductor and the substrate-bias contacts [259]. The bigger the pinhole, the lower the resistance. Thus, leakage current flows through the conductive substrate if the potential on a TSV with a pinhole defect differs from the substrate-bias potential. These leakage currents can dramatically increase the static power consumption. Furthermore, the resistive connection affects the body-bias potential of active elements located near the defective TSV. With increased pinhole size, this will eventually result in malfunctions of active circuit elements. This cannot be fixed by existing redundancy schemes, nor by the proposed fixedencoding scheme, since the schemes do not allow to permanently set the potential of a faulty TSV to the potential of the substrate bias. However, the proposed fixed-decoding scheme overcomes this limitation. The configurable encoding sets a faulty line permanently on the potential of the substrate bias to mask the error.
11.6 Evaluation
247
Consequently, no leakage current flows and the body bias of the active circuits is not impaired. However, the proposed fixed-decoding scheme has a different limitation: A too significant increase in a TSV resistance makes the signal at the far end sensitive to coupling noise. The previously used equivalent circuit for TSV defect modeling does not include coupling effects [259]. Thus, this model cannot identify any limitation for the proposed fixed-decoding scheme. Hence, a more detailed setup is required to outline the limitations of the scheme. Coupling effects are maximized for large, densely bundled, TSVs and low significant-frequencies, as shown in Chap. 4. Thus, a 3π -resistance-inductancecapacitance (RLC) equivalent circuit of a 3×3 array with the minimum TSV spacing is extracted by means of the scaleable TSV-array model presented in Sect. 1.3.1 and the Q3D Extractor.3 The TSV radius and length are set to 2 and 50 μm, respectively, which are typical global TSV dimensions. For this radius, the minimum TSV spacing is 8 μm [1]. A TSV in the middle of an array experiences the highest coupling, as shown in Chap. 4. Thus, a resistive open in the middle TSV is analyzed. The extracted TSV resistance is increased by a parameterizable value Ro to model a defect, analogously to [259, 263]. Spectre circuit simulations are used to analyze worst-case TSV coupling for the fixed decoding scheme. Thereby, inverters of strength 1×, stemming from the aggressively scaled 15-nm standard-cell library NanGate15 [177], are used to drive/load the signal TSVs, as small-sized cells are most sensitive to coupling noise. Contact resistances of 0.1 k are included between the drivers and the TSVs to obtain realistic values for the path resistance. Ramp voltage sources with a rise time of 10 ps, in combination with an additional inverter for realistic signal shaping, are used to generate input stimuli. Due to the p-doped substrate, the faulty TSV (noise victim) is set to constant logical 0 for error masking. All other signal TSVs switch from logical 0 (i.e., 0 V) to logical 1 (i.e., Vdd = 0.8 V), which results in the highest coupling noise on a stable victim line [142]. Over multiple simulation-runs, the value of the defect resistance, Ro , is increased from 0 to 1 M, since this is the maximum reported increase in the TSV resistance due to void or delamination defects [263]. Reported is the output waveform of the load driver at the far end of the faulty TSV. Since the load driver is an inverter, while the faulty TSV is grounded for error masking, the analyzed signal is constant on Vdd (i.e., 0.8 V) in the ideal/noise-free scenario. The results are illustrated in Fig. 11.8. In the first analyzed scenario, all remaining eight TSVs in the array are signal TSVs, which switch from logical 0 to 1. Here, the large coupling capacitances in the TSV array affect the output signal even if the victim TSV has no open defect (i.e., for Ro = 0 ). The maximum induced noise for no manufacturing defect is 0.46 V. Thus, the noise even pushes the signal for a few picoseconds below the threshold voltage (Vdd/2 = 0.4 V), resulting in a logical glitch in the signal. With an increase in resistance, Ro , also the coupling noise increases.
3 All
TSV parasitics in this evaluation section are extracted for a significant frequency of 6 GHz.
248
11 Low-Power Technique for Yield-Enhanced 3D Interconnects
Output voltage [V]
0.8
0.6 Defect free (Ro = 0 ) Defect (Ro = 1 M) Defect (Ro = 1 M with 33 % shields)
Threshold
0.4
0.2
0 0
1
2
3
Time [ns] Fig. 11.8 Worst-case coupling noise on the load-driver output of a stable and faulty TSV line
For an open resistance of 1 M, the noise-peak magnitude is in the range of Vdd , and the maximum duration of a logical glitch is several nanoseconds. This exceeds the typical timing budget for a TSV link. Thus, delay faults can even result in an erroneous transmission for a stable input signal if coupling effects are considered. However, the shielding technique presented in Chap. 9 can be used to overcome this bottleneck. Thus, stable power or ground (P/G) are positioned within the TSV array using the proposed performance-optimal assignment technique. Adding stable shields does not impair the overall TSV yield itself if the shields are on the substrate-bias potential (i.e., all errors are masked anyhow). Thus, for typical p-doped substrates, ground-shields must be used, and, for n-doped substrates, Vdd shields. Hence, the coupling-noise analysis is repeated after placing 33% stable ground TSVs in the array (i.e., 3C shielding) in a performance-optimal way. As shown in Fig. 11.8, the shield lines reduce the maximum coupling noise at the output in case of a defect to about zero. Thus, it completely overcomes the repair limitations for the analyzed technology. However, logical glitches might still occur for future technologies. In this case, a higher shield density could be used to increase the yield further. For the extreme scenario where eight shields enclose every signal TSV, no coupling noise occurs. Here, the fixed-decoding approach can repair faults as long as the open resistance, Ro , is several times lower than the gate-to-substrate resistance, which exceeds the G range. By integrating shields, not only yield issues due to coupling noise are addressed, but also the maximum TSV performance is improved. Consequently, the proposed technique enables low-power, yield-enhanced, and yet high-performance 3D interconnects. However, since the performance improvement of the proposed shielding technique was already evaluated in Chap. 9, the interconnects performance is not further investigated in the present chapter. Summarized, no redundancy technique is
11.6 Evaluation
249
capable of fixing all possible faults. However, shielding techniques can effectively increase the number of repairable errors for the proposed fixed-decoding scheme.
11.6.1.2
Overall TSV Yield
Previous works use the following formula for the manufacturing yield of an m-bit wide link with r redundant lines [109, 184]: r m+r i (1 − pdef )m+r−i pdef , Yprev (m, r) = i
(11.32)
i=0
where pdef is the TSV-defect probability. However, this formula is based on the assumption that the redundancy scheme can cope with every defect, as long as the total number of defects in the set does not exceed r. This is not precise, as outlined previously. The actual yield is slightly lower: Y (m, r, κd ) =
r m+r (1 − pdef )m+r−i (κrepair · pdef )i , i
(11.33)
i=0
where κrepair is the conditional probability that an occurring defect is treatable. Thus, for the technique based on two redundancy schemes, the overall TSV manufacturing yield is: ,
Yt = Y (me , 1, κe ) ,
Y (md , 1, κd )
Nd md
Ne me
-
· Y (mod(Ne , me ), 1, κe ).
(11.34)
-
· Y (mod(Nd , md ), 1, κd ).
In this equation, me and md are the group sizes for the fixed-encoding and the fixed-decoding scheme, respectively. Ne and Nd are the numbers of overall required functional data TSVs for the two schemes. κe and κd indicate the repair capabilities of the schemes. The modulo terms (i.e., terms with “mod(Ne , me )” or “mod(Ne , me )”) are required to represent the grouping for the remaining TSVs, in case of no integer ratio between the number of required data TSVs and the group sizes. For an integer ratio between Ne and me as well as Nd and md , the formula for the overall yield simplifies as follows: Ne
Nd
Yt = Y (me , 1, κe ) me · Y (md , 1, κd ) md .
(11.35)
Previous works (e.g., [109, 184]) already analyzed how the yield is reduced with increasing group size, m, and the overall number of required functional TSVs, N. The following analysis quantifies the effect of varying repair capabilities for the
250
11 Low-Power Technique for Yield-Enhanced 3D Interconnects
κe = 0.91; κd = 0.99 κe = 0.91; κd = 0.91
100%
100%
80%
80%
Overall TSV yield
Overall TSV yield
No redundancy (r = 0) κe = 0.95; κd = 0.95 κe = 0.99; κd = 0.99
60% 40% 20% 0% 10
60% 40% 20% 0%
– 5
10
– 4
–3
– 2
– 1
10 10 10 TSV-defect probability
a) N = 512; m = 8
10–
6
10–
10– 4 10– 3 10– 2 TSV-defect probability 5
b) N = 16, 384; m = 16
Fig. 11.9 Effect of the proposed redundancy technique on the overall TSV manufacturing yield, over the TSV defect probability, for different repair capabilities of the fixed encoder (κe ) and decoder (κd ) scheme. The yield is analyzed for: (a) a system with 512 required functional/data TSVs, grouped in sets of eight; and (b) a system with 16 k required functional/data TSVs, grouped in sets of 16
redundancy schemes. In the analysis, the number of required data TSVs, as well as the group sizes, are equal for both schemes (i.e., Ne = Nd = N/2; me = md = m). Once a system with a moderate amount of required TSVs (N = 512) and a groupsize, m, of eight is considered, and once a system with a high amount of required TSVs (N = 16, 384) and a group size, m, of 16. 0.91, 0.95 and 0.99 are analyzed as the repair capabilities, κe and κd . The overall TSV yields over the TSV defect probability are plotted in Fig. 11.9 for all analyzed scenarios. In accordance with previous works, an increase in the TSV count and the group size generally results in decreased yield quantities. As expected, with increasing repair capabilities of the two schemes, κe and κd , the overall TSV yield increases. An interesting observation is that the mean of κe and κd , represented by κ¯ repair , seems to be the determining factor for the overall yield. Thus, increasing just the repair capability of one of the two redundancy schemes by x typically is as effective as increasing both by x/2. As outlined previously, shielding can enhance the repair capabilities of the proposed fixed-decoder scheme with relatively low effort. Thus, shielding allows to effectively increase the overall yield for the proposed technique employing two different redundancy schemes. Reported TSV defect rates are not smaller than 1 × 10−5 [84]. A TSV defect rate of 1 × 10−5 results in an unacceptably low yield of 84.89% for the complex
11.6 Evaluation
251
system. However, integrating the proposed low-power redundancy scheme enhances the yield to above 99.13%, even for a pessimistic mean repair capability of 95% (i.e., κ¯ repair equal to 0.95, which implies that 5% of the occurring single-line defects are not treatable, despite having a redundant TSV left). This yield enhancement of the proposed high-level technique is even higher than the one for an improvement in the TSV defect rate by a factor of 10 ×, resulting in an overall TSV manufacturing yield of 98.37%. This underlines the effectiveness of the proposed technique in terms of yield enhancement. The use of the proposed redundancy technique for a repair capability of 95% and the improved TSV defect rate of 1 × 10−6 results in an overall TSV manufacturing yield, which is as high as 99.91%. Consequently, the proposed technique paves the way for three-sigma TSV production for complex systems (requires a yield of 99.73%). For the smaller system with 512 required data TSVs, the overall yield without a redundancy scheme is 99.49% for a TSV defect rate of 1 × 10−5 . Integration of a redundancy technique with a mean repair capability of 95% results in a yield that is higher than 99.97%. Thus, the proposed technique enables three-sigma production for the smaller system without the need for an improvement in the TSV defect rate. Furthermore, the yield improvement implies a reduction in the number of systems that have to be discarded after manufacturing due to TSV defects by a factor of 17 ×. Applying the proposed yield-enhancement technique for an improved TSV defect rate of 1 × 10−6 results in an overall yield of 99.9994%, when 99% of the occurring TSV defects are generally treatable. This is even close to six-sigma production, which requires a yield of 99.9997%. Hence, the proposed technique drastically boosts the process of overcoming TSV-related yield issues for complex 3D systems.
11.6.2 Impact on the Power Consumption The effect of the redundancy schemes on the bit-level characteristics, which determine the 3D-interconnect power consumption, is investigated in this subsection for various data types and a common p-doped substrate. Therefore, normally distributed 16-bit data streams, each containing 1 × 105 words, with different wordlevel correlations, ρ, and standard deviations, σ , are generated. For the various data streams, the information words are either negatively correlated (ρ = −0.95), uncorrelated (ρ = 0), or positively correlated (ρ = 0.95). The logarithm to the base two of the standard deviation, log2 (σ ), is varied between 1 and 15 for all three correlation quantities. Subsequently, the effects of the proposed and previous redundancy schemes—all adding one redundant line—on the transmitted codewords are included for the case of a defect-free link. Afterward, the total/accumulated switching activity of the bits (αtotal = i αi ) and the total/mean
252
11 Low-Power Technique for Yield-Enhanced 3D Interconnects
1-bit probability for all bits, (p¯ = 1/n i pi ), is determined.4 Investigated are the total switching activity and 1-bit probability as these metrics quantify by how much the approach is generally capable of reducing the power consumption of TSV and metal-wire structures. In the case study (Sect. 11.7), the effect of the techniques on the true power consumption of an exemplary TSV and metal-wire structure is investigated. Three different encoder variants are considered for the proposed fixed-decoding scheme. In the first one, the invert line is equal to the MSB of the transmitted information word in the low-power configuration (i.e., minimal encoder implementation for an ASIC die). In the second variant, the invert line is equal to the negated MSB of the transmitted data word, in order to optimize the 1-bit beside the switching probabilities (i.e., extended ASIC or FPGA implementation). In the last one, the invert line is used for CBI coding. The results of the analysis are plotted in Fig. 11.10. Both multiplexer-based techniques, Signal-Shift and Signal-Reroute, always increase the interconnect power consumption, as they increase the total switching activity, compared to the case where no redundancy technique is integrated, due to the added redundant line, transmitting a data bit even in the defect-free scenario. In contrast, all other redundancy schemes enable a drastic improvement in the interconnect power consumption. The maximum decrease in the total switching activity for the proposed scheme with the fixed encoding and any analyzed data stream is 84.2%, while the overall 1-bit probability can be increased by up to 81.8%. On average, for all analyzed data streams, the fixed encoding scheme results in an improvement in the total switching activity and 1-bit probability by 37.1% and 53.4%, respectively. The same reduction in the switching activities is obtained for the existing codingbased redundancy scheme, but it has the drawback for typical p-doped substrates that it decreases the 1-bit probability by 77.6% in the peak, and 42.2% in the mean. Thus, as expected, the previous coding-based technique shows the same power savings for metal wires as the proposed/modified version, but lower power savings for TSVs. The fixed-encoding schemes only optimize the power consumption for normally distributed data words. Hence, the more the data-word distribution tends to be uniformly distributed (i.e., high σ ), the lower the improvement in the power consumption. Hence, both fixed encoding schemes result in an increase in the total switching activity by 6.3% for uniformly distributed data due to the added redundant line. This increase is roughly the bit overhead of the encoding technique (i.e., the reciprocal of the group size, 1/m). For all three variants of the proposed fixed-decoding scheme, the switching reduction reaches values as high as 86.3%. Using the MSB or the negated MSB of the information word as the invert line reduces the switching by 36.2% in the 4 The
accumulated switching activity instead of the mean switching probability is investigated to take the added redundant line/bit for the yield-enhancement techniques into account. This ensures a fair comparison.
11.6 Evaluation
253
No redundancy scheme
Signal-Reroute[112]
Signal-Shift [112]
Previous fixed encoder [189]
Proposed fixed encoder
CBI
XOR
negated MSB (fixed decoder)
(fixed decoder)
XOR MSB
(fixed decoder)
ρ =0
ρ = – 0.95 15
Total switching activity
Total switching activity
10
10
5
0
10
ρ = 0 .95
ρ = – 0.95 Total 1-bit probability
log2(σ)
4
2
5
10
0.6 0.4 0.2
5
10
log2(σ)
log2(σ)
ρ =0
ρ = 0 .95
0.6 0.4 0.2
5
10
log2(σ)
15
15
0.8
0
15
0.8
0
5
log2(σ)
6
0
0
15
Total 1-bit probability
Total switching activity
8
Total 1-bit probability
10
5
5
15
0.8 0.6 0.4 0.2 0
5
10
15
log2(σ)
Fig. 11.10 Effect of the proposed and previous redundancy schemes on the power-related bit-level statistics for various synthetic data streams
254
11 Low-Power Technique for Yield-Enhanced 3D Interconnects
mean. The total 1-bit probability is additionally increased by 87.7% in the peak, and 62.5% in the mean, for the negated MSB as the invert bit. In contrast, using the non-negated MSB for inversion reduces the 1-bit probability by at least 5.81% (50.5% in the mean). Hence, the non-minimal ASIC implementation can result in a significantly lower TSV power consumption if the TSVs show a significant MOS effect. The negated or non-negated MSB as the invert signal, again, is only effective for normally distributed data. Thus, the power-consumption reduction vanishes with increasing standard deviation, σ . However, in both variants of the proposed fixed-decoding scheme, one line of the link is always stable. Thus, the scheme never results in an increase in the interconnect power consumption compared to the scenario without any integrated redundancy scheme, even if all redundant lines are required for yield enhancement and the data is completely random. This is another major advantage of the proposed redundancy scheme compared to all previous ones. Integrating CBI coding in the proposed fixed-decoder scheme enables a powerconsumption reduction also for uniformly distributed data (here, about 11%). The toggle reduction with CBI coding for normally distributed data is about the same as for the two more hardware-efficient approaches. Thus, the two non-CBI alternatives are generally more efficient for normally distributed data due to their lower CODEC complexities. In summary, the proposed technique can significantly reduce the power consumption of 3D interconnects compared to all previous techniques. If the TSV MOS effect is not too significant and the data words tend to be normally distributed, even the simplest proposed configurable ASIC encoder structure already results in a drastic improvement of the interconnect power consumption. With increasing TSV MOS effect, adding the small extension to optimize the bit probabilities becomes reasonable. Adding an extra invert-coding is only required for data that tends to be uniformly distributed.
11.6.3 Hardware Complexity The hardware costs associated with loading the repair signatures (RSs) from the NVM cells into the repair registers (RRs) are discussed in this subsection. Furthermore, the costs associated with the actual redundancy schemes are quantified.
11.6.3.1
NVM Cells and Controller
The lower bound for the overall number of bits of the compressed concatenated repair signatures (RSs) is NBrs =
N · log2 (m + 1) . m
(11.36)
11.6 Evaluation
255
In the following, the worst case, a system without FPGA dies, is considered. The number of NVM cells for the multiplexer-based schemes is equal to 2 NBrs when the cells are distributed with the transmitter (Tx) and receiver (Rx) components. This also requires the programming of NVM cells in every die of the system. The number of required NVM cells can be minimized to NBrs with the coding-based scheme presented in Ref. [184]. However, an expensive programming circuit is still required in every die. With the proposed technique, the number of required NVM cells and the number of dies that require NVM programming are minimized. Consequently, considering distributed NVM cells strongly favors the technique presented throughout this chapter. Thus, a global NVM macro, from which the RSs are loaded into the RRs during start-up, is considered in the following for a fair comparison. In this scenario, the number of NVM cells is minimized for all techniques, and at most, one NVMprogramming circuit has to be integrated. No extra programming circuit is required if one die includes one-time-programmable NVM cells either way (e.g., for memory repair). An advantage of coding-based techniques over multiplexer-based techniques for a global NVM macro is that the implementation cost of the NVM controller, which serially loads the RSs from the NVM cells into the RRs at start-up, is reduced. This NVM controller, proposed in [109] along with an approach to connect the RRs to a shift register while minimizing the inter die connections, is briefly reviewed in the following. The repair signatures are stored in ascending order in the NVM macro. Afterward, the controller has to shift out the stored RSs twice: First in ascending and subsequently in descending order. This results in a relatively low-complex controller architecture. For the proposed and the previous coding-based approach, every RS only has to be loaded once, since it is only required at the encoder (i.e., Tx) or the decoder (i.e., Rx) side. This asymptotically reduces the length of the shift register chain and the number of inter-die connections by 50%. The RSs are stored in the NVM macro in the order the according RRs are connected. This results in a further reduced controller complexity as the values only have to be shifted out once. Furthermore, the number of cycles to load the configuration data is reduced by 50%. To quantify the controller complexity for multiplexer-based and coding-based redundancy techniques, generic RTL implementations are created. At start-up, the controllers serially shift the RSs, stored in a global NVM macro with a word width of NBnvm , into the RRs using a serial output, rr_in. Therefore, the signal shift_rr is set, which enables shifting the values of the RR chain. The completion of loading the RSs into the RRs is acknowledged by a flag rr_ready. After completion, the controllers go to an idle state until they receive a request to reload the RSs via a signal load_rb (e.g., after a reset of a block). Except during start-up, an NVM controller is idle. Thus, it is cut from the power supply afterward. Consequently, the power consumption of a controller does not have a noticeable impact. Furthermore, the shift_rr signal, which must be routed to all RRs, is the limiting factor for the speed at which the RSs can be loaded, not the controller. Thus, only the controller area is a major concern. The area of the
11 Low-Power Technique for Yield-Enhanced 3D Interconnects
Fig. 11.11 Area in gate equivalents of the synthesized controller, which loads the RSs from the global NVM macro into the distributed RRs for the two redundancy approaches, multiplexer based and coding based
Coding-based redundancy scheme Multiplexer-based redundancy scheme Controller area [GE]
256
300
200
100
7
8
9
10
11
12
13
log2 (NB RS )
controllers, depending on NBrs , is determined by means of gate-level syntheses in the NanGate15 technology using Synopsys tools. Thereby, an NVM-word width of 32 bit is considered. In Fig. 11.11, the resulting controller areas in gate equivalents are plotted over log2 (N Brs ). The results show that both NVM-controller areas increase linearly with log2 (NBrs ). Using a coding-based redundancy technique, instead of a multiplexerbased technique, reduces the controller complexity by about 25%, independent of the length of the concatenated RS.
11.6.3.2
Redundancy Scheme
The complexities of the encoder-decoder pairs are assessed in the following. Using the area of typical standard cells is a straightforward approach to obtain formulas for the area of the components, including the RRs. Such formulas are reported in Table 11.1. Note that no differentiation is made between gates of equal complexity (e.g., XOR and XNOR, or logical conjunction (AND) and logical non-disjunction (NOR)). Thus, the hardware complexities of the previous and the extended fixedencoding scheme are reported in a single row. To analyze heterogeneity, component complexities for the source (Tx) and the destination (Rx) die are reported individually. A unary encoding or a one-hot encoding of the compressed RS, stored in the RR, is required to generate the configuration signals, c tx or c rx , for some redundancy schemes. Such an encoding requires at least m gates of the size of a standard AND gate, which is considered for area estimation. For the proposed fixed-decoding scheme, four alternative implementations are considered for an ASIC die. In the first one, just the minimal configurable hardware illustrated in Fig. 11.5b or c is implemented. In the second and third one, an extra input is added to the multiplexers, which is either an invert line for CBI coding or the negated MSB for low-power coding of digital signal processor (DSP) signals. Both structures have the same hardware complexity, and in both cases, the
11.6 Evaluation
257
Table 11.1 Area requirements of the transmitter (Tx) and receiver (Rx) units of the redundancy schemes Area Txa m(Axor +Amux ) + log2 (m+1) Aff Fix. decoder m(Axor +Amux ) + Amux + low-power (ASIC) log2 (m+2) Aff Fix. decoder both SAs (ASIC) m(Axor +Amux ) + Axor + log2 (2(m+1)) Aff Fix. decoder (FPGA) mAxor Fix. encoder (ASIC) (m − 1)Axor Scheme Fix. decoder minimal (ASIC)
Fix. encoder (FPGA) Signal-Reroute [107] (ASIC) Signal-Shift [107] (ASIC) Signal-Reroute/Shift [107] (FPGA) a
(m − 1)Axor (m−1)Amux + log2 (m+1) Aff m(Amux +Aand ) − Amux + log2 (m+1) Aff 0
Area Rxa mAxor mAxor mAxor mAxor m(2Axor +Amux +Aand ) − 2Axor + log2 (m+1) Aff 2(m−1)Axor m(Amux +Aand ) + log2 (m+1) Aff m(Amux +Aand ) + log2 (m+1) Aff 0
Amux , Axor , and Aand are the areas of two-input multiplexer, XOR and AND cells in the die, respectively. Aff is the area of a flip-flop
hardware is extended to improve the low-power configuration. Thus, the values for both alternatives are reported in a single row of the table, using the identifier “Fix. decoder low-power”. In the fourth one, an XOR gate and a configuration bit are added, as shown in Fig. 11.5d, in order to cope with both SA fault types. For an implementation into an FPGA die, the minimal circuit (shown in Fig. 11.5e) can operate in all low-power configurations and repair all fault types. The configuration hardware (i.e., multiplexers and RRs) vanishes in FPGA dies for all proposed and previous redundancy schemes. Redundancy schemes, realized in ASIC dies by means of multiplexers, have no hardware complexity at all in FPGA dies. Here, the place-and-route tool for the FPGA soft-components just assigns the signals to the functional lines and skips defective ones. All paths starting at the RRs are not considered for delay analysis, as the RR signals are constant during normal operation (after start-up). Thus, the formulas presented in Table 11.2 are used to estimate the component delays. Power estimation is not as straight forward for several reasons. At first, the configuration signals are only updated once during start-up. Afterward, they are stable. Furthermore, this allows effective clock gating of all RRs with the shift signal of the NVM controller, shift_rr, which is only active during start-up. This asymptotically reduces the dynamic power consumption associated with the RRs and the configuration signals to zero. Second, the power consumption depends on the switching activities, which again depend on the signal characteristics. Third, the power consumption will be determined mainly by the interconnects in between, and not by the encoder and the decoder, as shown in the case study of this work. Thus,
258
11 Low-Power Technique for Yield-Enhanced 3D Interconnects
Table 11.2 Delay of the transmitter (Tx) and receiver (Rx) units Scheme Fix. decoder minimal (ASIC) Fix. decoder low-power (ASIC) Fix. decoder both SAs (ASIC) Fix. decoder (FPGA) Fix. encoder (ASIC) Fix. encoder (FPGA) Signal-Reroute [107] (ASIC) Signal-Shift [107] (ASIC) Signal-Reroute/Shift [107] (FPGA) a
Delay Txa Txor + log2 (m+1) Tmux Txor + log2 (m+2) Tmux 2Txor + log2 (m+1) Tmux Txor Txor Txor log2 (m) Tmux Tmux 0
Delay Rxa Txor Txor Txor Txor (m−1)Txor + Tmux (m−1)Txor Tmux Tmux 0
Tmux , Txor : Delay of two-input multiplexer and XOR cells, respectively
the power consumption of the components is not considered in this subsection. An in-depth power analysis is included in the case study of this chapter. The standard cell libraries NanGate15 [177] (15 nm) and NanGate45 [178] (45 nm) are used to compare the complexity of the redundancy techniques for a bidirectional link with varying group size, m. Thereby, the integration of the extended low-power configuration for the fixed-decoder scheme is considered for the proposed technique, employing two redundancy schemes. In order to obtain pessimistic values, only two-input standard cells with a drive strength of 1× are considered. Reported are the products of the total cell area and the maximum component delay, obtained by means of the formulas reported in Tables 11.1 and 11.2. Four different die interfaces are analyzed. The first one—representing a heterogeneous scenario—is made up of an FPGA die using a 15-nm technology and a mixed-signal die made-up of 45-nm standard cells. In the second scenario, the 15nm technology is used for an ASIC instead of an FPGA die. The third scenario is the interface between a 15-nm ASIC die and a 15-nm FPGA die. As the last scenario, a homogeneous interface between two 15-nm ASIC dies is considered. Figure 11.12 illustrates the results of the analysis. Compared to the best previous technique, Signal-Shift, the coding-based approach from Ref. [184] results in an increase in the complexity by a factor of 3× for a group size of four, and a factor of 34× for a group size of 32, for the first strongly heterogeneous interface. Thus, the previously outlined reduction in the complexity of the NVM controller, as well as the reduction in the interconnect power consumption, come at high expenses. This is due to the delay of the XOR chains of the decoder located in the slower mixedsignal technology. The length of the XOR chains, and thereby the delay, increases linearly with the group size, m. Since the hardware complexity also increases with the group size, the area-delay product increases quadratically. However, the proposed technique partially overcomes this bottleneck. At first, the critical XOR chain is only present in the faster layer, while, in the slower layer, the delay is minimal. Additionally, integrating just fixed circuits of minimal complexity
11.6 Evaluation
259
Proposed Signal-Shift [112]
Previous coding [189] Signal-Reroute [112]
15-nm FPGA die and 45-nm mixed-signal die
15-nm ASIC die and 45-nm mixed-signal die
Area-delay product [μm 2 ns]
Area-delay product [μm 2 ns]
300 200
100
0 16
24
100
0
32
8
16
24
32
Group size
Group size
15-nm FPGA die and 15-nm ASIC die
15-nm ASIC die and 15-nm ASIC die
Area-delay product [μm 2 ns]
Area-delay product [μm 2 ns]
8
200
30
20
10
0 8
16
24
Group size
32
40
20
0 8
16
24
32
Group size
Fig. 11.12 Area-delay products of the redundancy techniques over the group size for a bidirectional 3D link and different die interfaces
in the mixed-signal die reduces the overall area overhead drastically. For a group size of four, the proposed technique even reduces the circuit complexity compared to the best previous technique, Signal-Shift, by a factor of 4×. Albeit slower, the maximum delay of the proposed technique still increases with the group size. Consequently, the proposed technique scales worse than the SignalShift method with a constant delay. Thus, from a group size of 20 or higher, the proposed technique results in a higher circuit complexity. However, commercially
260
11 Low-Power Technique for Yield-Enhanced 3D Interconnects
available systems typically use significantly smaller group sizes in order to achieve the highest yield [128]. A similar trend can be observed for the second interface (15nm digital and 45-nm mixed-signal), with the exception that here the complexity of the proposed redundancy technique exceeds the complexity of the Signal-Shift scheme for group sizes of 16 or higher. If the same technology is used for both dies of the interface, the die location of the XOR chains does not affect the overall circuit delay. Thus, the only advantage of the proposed technique for the third analyzed scenario is that the configuration hardware vanishes. Hence, the area is reduced at the cost of a delay increase, compared to the best previous technique. However, configuration-hardware costs increase only logarithmically. Thus, in the third analyzed scenario, the Signal-Shift approach results in a significantly lower hardware complexity for bigger group sizes than the proposed technique with a linearly increasing delay. In the last scenario, no heterogeneity remains, resulting in a generally higher complexity for the proposed technique. The same accounts for the interface between two FPGA dies. Hence, this scenario is not investigated here. Nevertheless, even for such homogeneous interfaces, the proposed technique outperforms the existing coding-based approach in terms of CODEC complexity. Furthermore, an integration of a low-power gray encoding together with the Signal-Shift method also requires a long XOR chain. Thus, if a low-power coding is either-way required, the proposed technique is also superior for homogeneous interfaces.
11.7 Case Study A case study, considering the commercially available heterogeneous SoC from Ref. [76], in which a TSV redundancy technique has to be integrated, is presented in this section. Even-though the SoC integration is referred to as 3D, the actual integration method in Ref. [76] is 2.5-dimensional (2.5D), as TSVs connect the dies through an interposer. Interposer-based 2.5D integration is an intermediate solution, mainly driven by cost concerns. Since technology has advanced since the publication date of Ref. [76], true heterogeneous 3D integration is considered throughout this book. The reference system consists of two 28-nm FPGA dies, and a 65-nm mixed-signal die containing sixteen 13-bit analog-to-digital converters (ADCs) and sixteen 16-bit digital-to-analog converters (DACs). Again, technology improvement is taken into account: The 15-nm technology NanGate15, and the 45nm technology NanGate45 are considered for the FPGA dies and the mixed-signal die, respectively. The proposed technique is not suitable for the interface between the two FPGA dies, where the standard Signal-Shift approach requires no hardware at all. Thus, only the die interface between the data converters and the FPGA is extended by the proposed technique. Compared is the integration of the proposed redundancy technique, without CBI coding, and the best previous coding-based and multiplexer-based approach. The
11.7 Case Study
261
Table 11.3 Logic area, Alogic , maximum delay, Tˆtotal , and required NVM space for different redundancy techniques and the heterogeneous SoC from [76] Metric Alogic Tˆtotal NVM space
Proposed 1050 μm2 198 ps 0b
Previous coding [184] 3715 μm2 559 ps 80 b
Signal-Shift [107] 3394 μm2 166 ps 144 b
group size is chosen to match the pattern width (i.e., 13 bit for the down-stream from the ADCs to the FPGAs; 16 bit for the up-stream from the FPGAs to the DACs).5 Thus, including the redundant lines, the total TSV count is: 16 · (14 + 17) = 496. In this analysis, the TSVs radius, pitch, and depth are 2 μm, 8 μm, and 80 μm, respectively.6 Crosstalk is minimized by the 2C shielding technique from Chap. 9 to guarantee maximized repair capabilities and an improved TSV performance. For a fair comparison, an integration of 2C shielding is also considered for previous techniques. With 2C shielding, one 4 × 7 and one 5 × 7 TSV array are required per ADC and DAC, respectively. Furthermore, typical global metal wires with a minimum width/spacing of 0.45 μm and a length of 100 μm are added between the components and the TSVs to obtain full 3D-interconnect paths. The Predictive Technology Model (PTM) interconnect tool [205] is used to obtain the metal-wire parasitics. Q3D Extractor and the TSV-array model are again used to obtain TSV parasitics for a significant frequency of 6 GHz. The logic areas of the redundancy techniques are obtained by RTL-to-gate-level syntheses. Furthermore, the syntheses results in combination with Spectre simulations—employing the extracted metalwire and TSV parasitics—are used to obtain the accumulated maximum propagation delay, Tˆtotal , of the encoder—3D-interconnect—decoder arrangements. In Table 11.3, the results are presented. The proposed technique does not require any RRs for the analyzed system. This fact, combined with the minimal circuit complexity in the mixed-signal die, results in a decrease in the area requirements by more than a factor of 3×, compared to the best previous multiplexer-based or coding-based techniques. Compared to the best coding-based technique, the total delay is reduced by a factor of 2.8×. The proposed technique induces a small delay increase of 32 ps (i.e., 19%), compared to the best multiplexer-based technique, Signal-Shift. A delayincrease only occurs for the down-stream path (i.e., ADC to FPGA), which employs the fixed-encoder scheme with the long XOR chains. However, the ADC sample duration in Ref. [76] is 8 ns, while, it is only 625 ps for a DAC. Consequently,
5 This maximum group size results in worst-case values for the proposed technique due to the inferior scaling of the fixed-encoder scheme, compared to previous multiplexer-based techniques, outlined in Sect. 11.6.3. 6 The higher TSV depth, compared to previous analyses, is chosen for proper noise insulation between the mixed-signal and the FPGA dies.
262
11 Low-Power Technique for Yield-Enhanced 3D Interconnects
this delay penalty is not critical. The previous coding-based technique requires a 5-bit wide RR for each of the 16 decoders in the mixed-signal die. The Signal-Shift technique, additionally, requires a 4-bit wide RR for each of the 16 encoders in the mixed-signal die. Thus, for the reference techniques, RSs of a total length of 80 bit and 144 bit have to be loaded into the RRs during start-up. In contrast, not a single RS has to be loaded for the proposed technique. Although the NVM controllers are responsible for less than 5% of the area requirements of the reference techniques, the previous analysis reveals a major advantage of the proposed technique: NVM is no longer required. The publicly available RF data set Deepsig [62] is used to quantify the datadependent power consumption of the logic and the interconnects. The modulation schemes 64-point quadrature amplitude modulation (QAM64), 16-point quadrature amplitude modulation (QAM16), and binary phase-shift keying (BPSK) are analyzed to consider RF signals with low and high modulation possibilities. The mean power consumption for the transmission of 1028 different example signals, each 1000 bit-patterns long, and a signal-noise ratio of 18 dB, is investigated for each modulation scheme. Therefore, the switching and bit probabilities of the raw information bits and the transmitted (encoded) bits are determined by exact bit-level simulations. This information is forwarded to the synthesis tool to obtain values for the power consumption of the logic circuits. Thereby, the clock-gate signal of the RRs, shift_rr, is defined as constant zero. The pattern-dependent interconnect power consumption is calculated by means of the precise high-level formulas presented in Part II, employing the exact bit-level probabilities. Both proposed redundancy schemes are based on low-power coding techniques that primary aim for a reduction in the switching activities instead of the coupling switching. Hence, as shown in Sect. 10.3, their relative power savings are lower the more the signal edges are temporally aligned. Thus, perfectly temporal aligned signal edges are considered in this case study to report pessimistic values for the proposed technique. The results, including the leakage power consumption of the cells, are presented in Table 11.4. Compared to the best multiplexer-based scheme, the proposed technique results in a reduction in the interconnect and logic power consumption
Table 11.4 Power consumption for different RF signals Metric Pdyn,link —QAM64 Pdyn,link —QAM16 Pdyn,link —BPSK Pdyn,logic —QAM64 Pdyn,logic —QAM16 Pdyn,logic —BPSK Pleak,logic mean(Ptotal )
Proposed 8.69 mW 8.83 mW 8.73 mW 0.98 mW 0.98 mW 1.00 mW 35.85 μW 9.77 mW
Previous coding [184] 9.49 mW 9.52 mW 9.48 mW 4.11 mW 4.10 mW 4.16 mW 85.42 μW 13.70 mW
Signal-Shift [107] 13.44 mW 12.87 mW 12.80 mW 1.44 mW 1.45 mW 1.50 mW 70.24 μW 14.57 mW
11.8 Conclusion
263
by over 30% for all analyzed modulation schemes. Leakage currents are even reduced by about 50%. The previous coding-based technique results in about 5% lower reductions in the 3D-interconnect power consumption. However, here, the complex configurable decoders in the mixed-signal die result in an increased power consumption of the logic cells compared to the Signal-Shift technique. Hence, the overall power-consumption reduction for the previous coding-based technique is, in the mean, only 6% and thus more than five times lower than for the proposed technique. In summary, the proposed technique significantly outperforms previous TSV redundancy schemes in nearly all metrics for the existing heterogeneous SoC. Hence, the system is an ideal system for the integration of the proposed technique. The reason is that the proposed technique can heavily exploit the technological heterogeneity between the dies and the normal distribution of the data words transmitted over the 3D links of the RF SoC.
11.8 Conclusion In this chapter, a TSV redundancy technique was proposed, designed to improve the manufacturing yield and the power consumption of 3D interconnects. The technique is based on two optimal coding-based redundancy schemes, used in combination, which minimize the complexity of redundancy techniques in heterogeneous 3Dintegrated systems. Furthermore, the coding techniques are designed in a way that the 3D-interconnect power consumption can be reduced effectively. Moreover, using the proposed technique in combination with the shielding technique from Chap. 9 does not only result in an additional improvement in the 3D-interconnect performance, it also further enhances the TSV yield. Despite its low complexity and the capability to significantly improve the power consumption, the proposed technique shows to improve the overall TSV yield by a factor of 17 ×. Furthermore, an extensive set of analyses, as well as a case study for a commercial SoC, show that the proposed technique significantly outperforms previous approaches in nearly all metrics.
Part V
NoC Optimization for Heterogeneous 3D Integration
Chapter 12
Heterogeneous Buffering for 3D NoCs
In the previous Part III, different optimizations for power consumption, yield, and reliability of interconnects in three-dimensional (3D) integrated circuits (ICs) have been introduced. This part introduces orthogonal optimizations on the system level, focusing on networks on chips (NoCs) for communication. As the optimizations are complementary, both can be combined to multiply the given gains, which we will do in our concluding Chap. 16. Here, we tackle architectural challenges for 3D NoCs in heterogeneous 3D ICs. This chapter will optimize buffers for routers by accounting for heterogeneity. Thereafter, Chap. 13 will optimize routing. Chapter 14 will optimize virtual channels (VCs), and Chap. 15 will further broaden the system-level perspective and optimize the floorplanning and application mapping for the whole system on a chip (SoC). As briefly stated, this chapter optimizes router buffers in heterogeneous 3D interconnects. They are a worthy optimization target because they are one major source of the area and power consumption in 3D NoCs. The buffers are approx. 79% of the router area, and account for 85% of the energy consumption for a standard 3D router (excluding links), with 4 VCs and 8-flit deep buffers synthesized for commercial 65-nm digital technology. Many works reduce router area by tackling router buffer area, for instance, by buffer sharing [145, 212] or by (buffer-less) deflection routing schemes [41]. Buffer area and power consumption are a topic of even higher importance in heterogeneous 3D ICs. The same buffer space will require more area and energy Consumption if implemented in a mixed-signal node or a less advanced digital technology compared to modern digital nodes. Therefore, reducing the buffer space in layers is advantageous in a mixed-signal or less aggressively scaled technology. Area reductions using buffer optimization are not novel, as the numerous existing approaches show. However, our optimization differentiates by heterogeneous architectures (for both buffer depth and router microarchitectures), which reduce area and power effectively only in heterogeneous 3D SoCs. To the best of our knowledge, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 L. Bamberg et al., 3D Interconnect Architectures for Heterogeneous Technologies, https://doi.org/10.1007/978-3-030-98229-4_12
267
268
12 Heterogeneous Buffering for 3D NoCs
similar ideas have not been investigated in literature so far. The proposed buffer redistribution scheme can be applied to any input-buffered router architecture. The remainder of this chapter is structured as follows. In Sect. 12.1 we review router and network architectures. Based on these, we propose heterogeneous buffer distributions in Sect. 12.2 and heterogeneous buffer depth in Sect. 12.3. In Sect. 12.4 we evalute our optimizations. Finally, the chapter is concluded.
12.1 Buffer Distributions and Depths Buffers differ in the area and power requirements between layers in a heterogeneous 3D SoC, as already discussed. Therefore, the number of implemented buffers must be reduced in layers, in which memory is expensive. This reduction has the largest positive effect on the system’s costs. In principle, placing as many buffers as possible in a layer that yields cheaper memory costs in any metric is advantageous. There are two options to realize the principle. First, for router microarchitectures, buffers can be redistributed between routers and layers. This redistribution affects the buffer distributions, with improvements discussed in Sect. 12.4.1. Second, for network architectures, buffer depths can vary between layers. This results in heterogeneous buffer depths, with improvements discussed in Sect. 12.4.2. We also assess a combined approach for additional benefits. We use a minimal example for the 3D technology and NoC to ease evaluation and avoid unintended side effects. We use a 3D SoC with 2 heterogeneous layers (Fig. 12.1). The influence of most NoC architectures, so also buffer distributions and buffer depths, can be evaluated using only two heterogeneous layers, as this encapsulates the important heterogeneous interface. Without loss of generality, we put the more advanced technology node with a more modern/scaled technology node in the bottom and the less advanced technology in the top layers. In the example, the NoC topology is two stacked 4 × 4 grids. The results can be generalized to any combination of more and less expensive technology, such as digital and mixedsignal nodes. This chapter is only interested in the influence of the buffers, hence the following simplifications. Our approach does not consider different topologies or clock frequencies per layer (which will be done in subsequent chapters). We model a synchronous SoC, in which all components are clocked at the lowest speed. This assumption is the most pessimistic scenario. The next Chap. 13 evaluates the impact of varying clock frequencies. For a baseline, we use a homogeneous input-buffered router with dimensionorder routing, 8-flit deep buffers, and 4 VCs. Routers are connected to their neighbors, as shown in Fig. 12.2. A default router pipeline representative of conventional routers is used [58]. There are four stages (Fig. 12.3a). First, routing
12.1 Buffer Distributions and Depths
269
Fig. 12.1 Examplary 3D NoC spanning two heterogeneous layers
mixed-signal layer TSV arrays digital layer
Fig. 12.2 Schematic of a 3D router with links and crossbar
north
east
west
up
crossbar
do w n
Fig. 12.3 Time behavior of input buffers of vertical links (RC–routing calculation, VC–VC calculation, SA–switch allocation, F–fetch, ST–switch traversal, and LT–link traversal). (a) Baseline router pipeline. (b) “Aggressive” pipeline. (c) “Delay-oriented” pipeline
1
Cycle
2
3
4
south
5
6
Head Flit First Body Flit Other Flits
Cycle
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
9
Head Flit First Body Flit Other Flits
Cycle Head Flit First Body Flit Other Flits
is calculated (RC). Second, a virtual channel is allocated (VC). A route and VC arbitration are only calculated for head flits. Third, the switch is arbitrated among flits (SA). During the arbitration, VCs with a lower number have a higher priority. Fourth, the arbitrated flits traverse switch (ST) and link (LT).
270
12 Heterogeneous Buffering for 3D NoCs
12.2 Routers with Optimized Buffer Distribution The microarchitecture of routers is optimized for area and power by distributing buffers among pairs of adjacent routers in different layers. In our exemplary setting, buffers are more expensive in the upper layer. To reduce costs, we redistribute buffers of vertical upward links. We place the input buffers from the upper router to the lower router as an output buffer. We change the location of all VC buffers, but the buffer status registers for the state, route, and output VC remain in the upper router. The other buffers in the routers are untouched. This yields a novel router microarchitecture, as shown in Fig. 12.4b. In the upper layer, input buffers are located on all links, except the incoming vertical links. In the lower layer, input buffers are located in all links, and output buffers are located on vertical upward links. This optimization reduces the area and power consumption of the routers in the upper layer. However, the increased delay of buffer access impedes performance. Hence, we name this architecture “aggressive” (in terms of area savings). To improve its performance, we additionally propose a “delay-oriented” microarchitectural optimization. We split the buffers of the vertical links into two parts. A flit can be stored in an intermediate buffer in the upper router, located between output buffers in the lower layer and the crossbar in the upper router. The remaining buffers are located as output buffers in the lower router. This is shown in Fig. 12.4c.
upper layer (mixed signal)
upper layer (mixed signal)
upper layer (mixed signal)
input buffer
TSVs
lower layer (digital)
a) Baseline router.
lower layer (digital)
b) “Aggressive” router.
input buffers with VCs
additional buffer TSVs output buffer
TSVs
input buffers with VCs
output buffer
input buffers with VCs
lower layer (digital)
c) “Delay-oriented” router.
Fig. 12.4 Architecture of router pairs for the proposed microarchitectural optimization. (a) Baseline router. (b) “Aggressive” router. (c) “Delay-oriented” router
12.3 Routers with Optimized Buffer Depths
271
12.2.1 Router Pipelines The standard pipeline of routers in the digital layer is shown in Fig. 12.3a. The proposed modifications to the router buffer do not influence the router pipeline. Only flits transmitted from the lower to upper layer are affected because buffers from the upper layer are moved downwards as an output buffer. The critical path from the input buffers via the crossbar to the output buffers in the lower layer is reduced in length. However, architects cannot exploit this for improved performance because clocking only these links faster is unrealistic. Routers with intrinsically asynchronous clocking would be complex and costly, and the routers in the slowerclocked part cannot feed the faster-clocked part with data. The router pipeline of the “aggressive” architecture is shown in Fig. 12.3b. Flits traveling the upward connections have a longer critical path due to the traversal of the through-silicon vias (TSVs). Data is fetched from the output buffers in the adjacent layer during the fetch cycle (“F”). Therefore, flits from this link leave the router only every other cycle (under zero load). All other input ports and their pipelines are standard, as shown in Fig. 12.3a. The router pipeline of the “delay-oriented” router is shown in Fig. 12.3c. Only head flits require the additional fetch cycle in this architecture, as the other flits can use pipelining. Follow-up flits are sent parallel to the upper router’s intermediate buffer while previous flits traverse the router. Hence, flits from the vertical links leave the router every clock cycle after an initial delay (under zero load).
12.3 Routers with Optimized Buffer Depths As a different approach to optimization, we reduce buffer depths in the more expensive layer. This approach is depicted in Fig. 12.5 for a conventional router pair. These heterogeneous buffer-depth reductions offer comprehensive optimization Fig. 12.5 Conventional pair of routers with buffer depths optimization [124]
upper layer (mixed signal)
input buffer
input buffers with VCs
TSVs
lower layer (digital)
12 Heterogeneous Buffering for 3D NoCs normalized power-performance-area (PPA)
272
1 0.8 0.6
Power Area Performance
0.4 0.2 2
4
6
8 10 buffer depth
12
14
16
Fig. 12.6 Relation between buffer depths and PPA. Area and power consumption optained from synthesis. Performance simulated with Ratatoskr (buffer depths: 8 flits in digital layer, 2 to 16 flits in mixed-signal layer). Data normalized to 16-flit buffers
potential. While minimizing the area footprint via buffer depth changes yields a linear relation, the performance does not decline proportionally since the relationship between buffer depth and performance is nonlinear. It declines slowly until network saturation is reached and then collapses. The theoretical relation between buffer depths, network costs, power consumption, and performance (measured in package latency) is well-known for two-dimensional (2D) systems (e.g., [186]). For NoCs in heterogeneous 3D SoCs, this relation is exemplified in Fig. 12.6, with heterogeneous buffer depths of 8 flits in the lower layer and 2 to 16 flits in the upper layer using uniform traffic at 3.2 GB/sec traffic injection. We synthesized a gate-level router model for 130-nm and 65-nm commercial digital technologies to generate the results. Their maximum value normalizes area and power results. The plot shows a nonlinear relationship between buffer depth and costs. Large buffer depths do not offer performance advantages. The results imply that a small buffer depth will offer an acceptable compromise with low costs and adequate performance if the buffer depth is still located on the left-hand side of the plateau. Based on these findings, we optimize buffer depths both for traditional router architectures with standard buffer organization, see Fig. 12.5, and the proposed microarchitectural buffer reorganization, see Fig. 12.7.
12.4 Evaluation We chose an incremental approach, which broadens the design space step by step. This enables to understand all implications of architectural optimizations. We use real-world benchmarks with focus on video and audio processing.
12.4 Evaluation Fig. 12.7 “Aggressive” pair of routers with buffer depths optimization
273 upper layer (mixed signal) reduced buffer depth
TSVs output buffer
input buffers with VCs
lower layer (digital)
First, we evaluate optimizations for buffer distributions by benchmarking our router architectures with buffer reorganization and with 8-flit deep buffers in all layers in Sect. 12.4.1. We chose this buffer depth as a baseline since larger buffers only increase area and power overhead without performance advantages. Second, we evaluate optimizations by different buffer depths from 4 to 8 flits to find the tipping point at which buffer size increases add additional costs, without positive performance influence in Sect. 12.4.2. Third, the findings were consolidated by both optimizations. The “aggressive” router microarchitecture is evaluated with buffer depths of 8 flits in the lower layer and 4 to 8 flits in the mixed-signal layer. Since the other architectures show similar results, we do not present them in detail. This approach delivers the optimal buffer depth per layer and is applied to the baseline (homogeneous) and the “delay-oriented” router architecture in Sect. 12.4.3. As a case study, a standard router sets the baseline, synthesized from register– transfer level (RTL) for commercial digital 130-nm and 65-nm technologies, with 8-flit deep VCs, and a flit width of 32 bit. The clock frequency of routers depends mainly on technology. For the 65-nm-technology, the routers clock at approximately 1 GHz and for 130-nm at 820 MHz. We use this to compare power and area savings. We refer to a pair of routers with the same x- and y-coordinates in adjacent layers.
12.4.1 Routers with Optimized Buffer Distribution 12.4.1.1
Power Savings
The total power consumption of the routers is given in Table 12.1 for a pair of routers with 50% toggle activity. The savings are calculated from comparison with the baseline (conventional architecture, homogeneous buffers). The power consumption in the 65-nm layer increases by relocating buffers to the digital layer. Since single
274
12 Heterogeneous Buffering for 3D NoCs
Table 12.1 Power consumption of a router pair with symmetric buffer depths Router Architecture Symmetric Aggressive Delay-oriented
Power consumption 65 nm 130 nm 20.7 mW 58.3 mW 23.6 mW 49.7 mW 22.9 mW 51.9 mW
Savings Router pair 79 mW 73.3 mW 74.8 mW
– 7.2% 5.4%
buffers consume less power in this layer, the power consumption declines by 7.2% for the “aggressive” architecture and by 5.4% for the “delay-oriented” design.
12.4.1.2
Area Savings
The buffers occupy 84% of the router area for the 130-nm technology and 79% for the 65-nm technology. Different silicon nodes yield varying memory cell sizes. We measure an actual ratio of 3.7 for similar flip-flops in the commercial 65-nm and 130-nm technologies. By exploiting this area difference, the “aggressive” router and “delay-oriented” router are smaller than the baseline by 9.6 and 8.3%, respectively (Table 12.2).
12.4.1.3
Performance Implications
We use Ratatoskr for performance evaluation. The different pipeline depths of routers is modeled using the parameter Δ (see Sect. 5.3). It is set to 4 for the downward input ports of routers in the upper layer and 3 for all remaining ports. We benchmark with synthetic and real-world application traffic patterns [214, 239]. Manual mapping of tasks to processing elements (PEs) prioritizes short communication distances to reduce network load in the upper layer. The results of the simulations are shown in Table 12.2. We use a single instance of inputs per application and measure the application delay in clock cycles. This approach makes the performance measurement independent of the target clock frequency. It is expected that the proposed router architectures have reduced performance due to the additional fetch cycles in the router pipelines. Furthermore, the experimental results show that the “aggressive” router has lower performance than the “delay-oriented” one. The “aggressive” and “delay-oriented” router have 14 and 2.1% worse average performance compared with baseline, respectively. We approximate the worst case via hot-spot traffic, in which all PEs send data to a single destination located in the more advanced layer. This yields high performance losses of 33 and 4.9% for both routers. If the hotspots were located in the mixed-signal layer, the performance would be worse. This effect is demonstrated for real-world workloads by the audio decoder and encoder benchmark, with their high utilization of vertical links. We observe the highest performance losses of 40 and 4.9%. The
Tasks Packets/ Traffic/application inj. rate Uniform (median) – 0.8 GB/s Complement – 0.8 GB/s Hotspot – 0.8 GB/s VOPD & Shape Dec. [239] 17 3416 VOPD [214] 16 4044 DVOPD[214] 32 8762 MPEG-4 [214] 12 3467 PIP [214] 8 576 MWD [214] 12 1120 H.263 dec., mp3 dec. [214] 13 19636 mp3 enc., mp3 dec. [214] 12 1652 H.263 enc., mp3 dec. [214] 14 24291 Average performance loss to symmetric baseline Area saving for router input buffers to symmetric baseline
Table 12.2 Benchmark results with symmetric buffer depth [124] Baseline router clock cycl. 1318 170 1290 55188 63766 65202 142010 9984 14114 172554 18500 373778
Aggressive router Clock cycl. 1310 210 1930 55208 69766 77316 142010 11364 16216 181068 30752 430884 14% 9.6% Δ 0.6% −19% −33% −0.7% −8.6% −16% .0% −11% −13% −4.7% −40% −13% 2.1% 8.3%
Delay-oriented router Clock cycl. Δ 1322 −0.3% 178 −4.4% 1656 −4.9% 55192 −0.7% 64368 −0.9% 66458 −1.9% 142010 0.0% 10114 −1.3% 14338 −1.6% 173232 −0.4% 19730 −6.2% 378704 −1.3%
12.4 Evaluation 275
276
12 Heterogeneous Buffering for 3D NoCs
Table 12.3 Performance results for a homogeneous 4-flit buffers [124] Symmetric Aggressive Baseline 4-flit router 4-flit router Clock cycles Clock cycles Loss Clock cycles 1318 2411 45% 2480 170 292 42% 310 1290 2380 46% 2542 55188 124574 56% 124578
Application Uniform Complement Hotspot VOPD & Shape decoder [239] DVOPD [214] 63766 VOPD [214] 65202 MPEG-4 [214] 142010 PIP [214] 9984 MWD [214] 14114 Average performance loss Area savings Power savings
131556 132236 267264 21670 27120
52% 134460 51% 133636 47% 267264 54% 21926 48% 27868 49% 36.1% 37.3%
Loss 47% 45% 49% 56%
Delay-oriented 4-flit router Clock cycles Loss 2448 46% 302 44% 2438 47% 124578 56%
53% 132734 51% 132838 47% 267264 54% 21800 49% 27414 50% 40.1% 41.6%
52% 0.5% 47% 54% 49% 50% 39.8% 40.6 %
unexpected performance increase of 0.6% for uniform random traffic is an artifact from the statistics of our measurement setup.
12.4.2 Routers with Optimized Buffer Depths Reductions in the buffer depth are evaluated by comparison using 8-flit buffers in both layers to a router using 4-flit buffers. The results are shown in Table 12.3. Reducing the buffer depth in the standard router yields a performance loss of 49%. In combination with the proposed router architectures, the performance declines by additional 2.5 and 1.2% for the respective architectures. The power consumption and area savings of all three router architectures are between 36.1 and 41.6%. As expected, the proposed router architectures have lower performance, area, and power than the baseline. The performance loss of homogeneous buffer-depth reductions is high, for which power and area savings are not justified. Therefore, we also evaluate heterogeneous buffer-depth reductions. Heterogeneous buffer depth reductions are compared for the “aggressive” router architecture with 8-flit buffers in all layers, vs. 8-flit buffers in the lower and 4 to 8-flit buffers in the upper layer. The results are shown in Table 12.4. For small buffer depth in the upper layer, performance losses are significant, with up to 40%. Therefore, a considerable asymmetry in buffer depth is not advantageous. However, slight asymmetry is promising. Reducing the buffer depth from 8 to 7 flits in the upper layer yields a constant performance loss of 14% vs. the baseline, but doubles the area savings from 9.6 to 18% and increases the total power savings from 7.2 to
Application Uniform (Median) Complement Hotspot VOPD & Sh. Dec. [239] VOPD [214] DVOPD [214] MPEG-4 [214] PIP [214] MWD [214] H.263 dec., mp3 dec. [214] mp3 enc., mp3 dec. [214] H.263 enc., mp3 dec. [214] Average performance loss Area saving Power savings
554312
373778 40% 44% 29.3%
0%
0% 0%
33%
53% 484378
33808
36% 23.8%
28%
23%
45%
453832
32086
75038 79830 149680 11642 16808 230020
39622
22% 30% 11% 18% 21% 39%
18500
81490 93300 159158 12148 17864 283872
94754 115532 175972 13382 21214 370904
63766 65202 142010 9984 14114 172554
33% 44% 19% 25% 33% 53%
Aggressive router (4-flit to 8-flit buffers in upper layer) Runtime in clock cycles and loss 4 5 6 1804 27% 1438 8.3% 1364 288 41% 218 22% 214 2466 48% 1978 35% 1930 118686 54% 90000 39% 70560
Baseline Runtime clock cycles 1318 170 1290 55188
27% 18.2%
21%
18%
42%
15% 18% 5.1% 14% 16% 25%
3.4% 21% 33% 22%
Table 12.4 Results for heterogeneous buffer depths with heterogeneous “aggressive” router architecture [124]
1320 210 1930 55208
430888
30756
69766 77316 142010 11264 16220 181072
7
18% 12.8%
14%
13%
40%
8.6% 16% 0.0% 11% 13% 4.7%
0.2% 19% 33% 0%
430884
30752
69766 77316 142010 11264 16216 181068
1310 210 1930 55208
8
9.6% 7.2%
14%
13%
40%
8.6% 16% 0.0% 11% 13% 4.7%
−0.6% 19% 33% 0%
12.4 Evaluation 277
278
12 Heterogeneous Buffering for 3D NoCs
Table 12.5 Benchmarks for the symmetric and proposed heterogeneous architectures with 7-flit buffers. The runtime is measured in clock cycles [124]
Application Uniform Complement Hotspot VOPD & Shape Decoder [239] VOPD [214] DVOPD [214] MPEG-4 [214] PIP [214] MWD [214] H.263 enc. mp3 dec. [214] mp3 enc. mp3 dec. [214] H.263 dec. mp3 dec. [214] Average performance loss Buffer area savings Total power savings
Baseline 8-flit buffers Clock cycle 1318 170 1290 55188
Symmetric router 7-flit buffers Clock cycle Loss 1316 0.2% 170 0.0% 1300 0.8% 55188 0.0%
Delay-oriented router 7-flit buffers Clock cycle Loss 1372 4.0% 186 8.6% 1412 8.6% 55192 0.01%
63766 65202 142010 9984 14114 373778
63766 65210 142014 9984 14118 373780
0.0% 0.01% 0.003% 0.0% 0.03% 0.0%
65564 67170 142014 0.0366 14476 388554
2.7% 3.0% 0.003% 3.7% 2.5% 3.8%
18500
18500
0.0%
22182
17%
172554
172558
0.0%
174976
1.4%
0%
0.06%
4.6%
0% 0%
13% 9.3%
28% 15%
12.8%. This result shows that the asymptotic performance limit has to be evaluated in simulations, and a deliberate tradeoff between power/area and performance is possible at this design point.
12.4.3 Combination of Both Optimizations We also evaluate a combination of heterogeneous buffer depth and buffer locations to enable further improvements in all metrics. The “delay-oriented” router and a small asymmetry of 7-flit and 8-flit deep buffers individually showed the most promising results, so they are combined. Results are shown in Table 12.5. Using the conventional router with heterogeneous buffer depth yields a negligible performance loss, with 13 and 9.3% reduced area and power, respectively. Using the “delayoriented” router, the performance declines by 4.6%, with 28 and 15% reduced area and power consumption, respectively. This approach yields a good low-power design.
12.6 Conclusion
279
12.4.4 Influence of Clock Frequency Deviation As the worst-case approximation, we assume globally synchronous clocks. Here, we assess the influence of this assumption. We use the maximum clock frequencies from RTL model synthesis for the routers and model a globally asynchronous locally synchronous (GALS) system, in which the upper layer is clocked at 820 MHz and the lower at 1000 MHz. We do not consider the actual hardware implementation of router interfaces supporting different clock speeds. Uniform random traffic is injected at 3.2 GB/sec. The results change by only 2% on average. We will discuss the effect of different clock speeds in detail in the following Chap. 13.
12.5 Discussion We showed that heterogeneous technologies could pose severe limitations to 3D NoCs, as layers in mixed-signal/less-advanced technologies are restricted in all metrics. Here we demonstrate that power and area consumption in the NoC will be reduced if router buffers are optimized using heterogeneity. Only one of our proposed architecture optimizations is promising, the “delayoriented” router. A relatively small performance loss of 2.1% offers area savings of 8.3%. The area savings are similar for the “aggressive” router, but the performance loss is too high. Considerable heterogeneity for buffer depth results in significant performance losses. Network performance can be maintained, though, using a medium amount of heterogeneity. We achieved cost reductions of up to 13% at a negligible performance loss. Both approaches simultaneously are most promising for low-power designs with relaxed performance constraints. Using the proposed “delay-oriented” router with a light asymmetry in buffer depth of one flit offers area reductions of 28%, power reductions of 15%, and a minor performance loss of 4.6%.
12.6 Conclusion Area and power consumption of NoCs in heterogeneous 3D SoCs can be reduced by applying heterogeneous architecture optimization, here shown at the example of buffer distribution and depths. These types of optimizations require delicate finetuning of parameters since the influence on performance is nonlinear. It is necessary to find buffer depths and distributions which offer the best compromise between penalized performance, reduced area, and declined power consumption. The two proposed architectures in this chapter offer either small area-power reductions at negligible performance loss or higher area-power savings at significant
280
12 Heterogeneous Buffering for 3D NoCs
performance loss. Therefore, a universal design principle cannot be derived. An acceptable tradeoff depends on a specific design’s power-area budget and its performance constraints. The proposed architectures can be applied to any state-of-the-art router architecture, which has input buffers and does not require special buffer features (such as the original implementation of SMART [49]).
Chapter 13
Heterogeneous Routing for 3D NoCs
In the previous chapter, we discussed heterogeneous architectural router optimizations concerning router buffers targeting heterogeneous three-dimensional (3D) networks on chips (NoCs). These optimizations did not account for the varying clock frequency found in heterogeneous 3D integrated circuits (ICs) (i.e., that routers are globally asynchronous). This chapter will extend the previous findings by considering this challenge and derive specific routing algorithms for heterogeneous 3D NoCs. The routing algorithms will also require co-designed router architectures. There are multiple publications on routing algorithms in 3D NoCs, such as [9, 10]. These do not consider the effect of implementation technology because homogeneous 3D systems on chips (SoCs) do not have varying technologies. However, novel routing algorithms are required for heterogeneous 3D SoCs due to the change of the technology node. We exemplify the chapter’s idea with a motivational statement. The performance of routers in mixed-signal layers is smaller than the performance in advanced digital nodes. Architects can exploit this by designing a tailored routing. It is advantageous to send packets along paths with high-performance routers. Models, which allow calculating the transmission time along heterogeneous paths, i.e., path through layers in disparate technologies, did not exist before this work. These, however, enable finding paths for packets with optimized performance and hence are proposed here. We further introduce two routing algorithms and a bespoke router architecture for this. Both increase the network performance to beyond state-of-the-art in heterogeneous 3D NoCs. The remainder of this chapter is structured as follows. We highlight the main challenges for routing from heterogeneity in Sect. 13.1. We model the effect of heterogeneous integration on the NoC performance in Sects. 13.2 and 13.3. Next, we quantify the impact of heterogeneity in Sect. 13.4 and tackle this obstacle by new principles for routing algorithms and modified router architectures. We apply the principles to routing algorithms in Sect. 13.5 and design a fitting router architecture
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 L. Bamberg et al., 3D Interconnect Architectures for Heterogeneous Technologies, https://doi.org/10.1007/978-3-030-98229-4_13
281
282
13 Heterogeneous Routing for 3D NoCs
in Sect. 13.6. We quantify advantages in Sect. 13.8 and discuss the findings in Sect. 13.9. Finally, the chapter is concluded.
13.1 Heterogeneity and Routing If a network on a chip connects components in a heterogeneous 3D IC, the network will span layers in a different technologies. This heterogeneity causes a severe issue. Homogeneous routers, which are common for 3D NoCs [82, 246], yield unbearable area/power costs in mixed-signal layers nodes. Heterogeneous routers with aligned properties solve this issue. The most relevant challenges for their design are: Challenge 1: Challenge 2: Challenge 3: Challenge 4:
Routers in layers using a conservative technology node are disproportionately expensive. The different technology nodes can influence the maximum number of routers per layer. Routers in layers using a conservative technology node are slower clocked than those in a scaled one. Interaction between routers in adjacent layers is not purely synchronous, intrinsically. Routers in different layers are clocked differently. This influences packet provision.
The challenges must be tackled, especially since communication in heterogeneous 3D NoCs has a unique characteristic. Throughput and latency differ between layers due to varying router numbers and router clock speeds. We will contribute that low packet provision in some layers impedes the packet’s performance in the whole network.
13.2 Modeling Heterogeneous Technologies We model the influence of heterogeneous technologies on area and timing. The models cover any commercial technology and feature size under the following assumptions. The through-silicon vias (TSVs) do not need a separate pipeline stage for traversal. We do not model keep-out zones (KOZs) because they are a constant overhead independent of the technology node. Routers must not be located at the same position in their layer, since a redistribution (RD) can connect routers and TSVs (Sect. 15.2.1). The variability of the RD is modeled by converting router locations to router addresses. We assume that components are clocked at different speeds. A synchronous model would waste performance, especially since mixedsignal components have poor clock speeds. Typically the number of routers in each layers is more or less the same, as in mixed-signal technologies less complex processing elements (PEs) are implemented, compared to the ones implemented in scaled digital technologies. The
13.2 Modeling Heterogeneous Technologies
283
Fig. 13.1 Area size model with constant area (orange, lined) and scalable area (green, dotted)
scaling by 2 ×
proposed techniques presented in this chapter have even higher gains for this common (less complex) scenario. Still, we consider the complex scenario in which in all technology nodes equally complex PEs are integrated, resulting in more routers in more scaled technologies. This approach maximizes the usability of the proposed technique, in results in conservative gains. Thus, the reported gains are realistically achieved in most real systems. We model a chip with layers and their index set [] = {1, . . . , }. We assume a n×m-mesh topology per layer. The structure size of the technology nodes, measured in [nm], is given by τ : [] → N. A chip layer with index ι will be called »more advanced node« than a layer with index ξ if τ (ι) is smaller than τ (ξ ) (for clear notation). We define: Definition 20 (Relative Technology-Scaling Factor) Let ξ and ι be the indexes of layers with technologies τ (ξ ) and τ (ι) and with τ (ξ ) > τ (ι). The relative technology-scaling factor Ξ is: Ξ (ξ, ι) :=
τ (ξ ) . τ (ι)
(13.1)
13.2.1 Area The area of the NoC is determined by the size of individual routers times their count. We propose an abstract model covering the influence of technology nodes, synthesis constraints, synthesis tools, and router architectures.
13.2.1.1
Area of Routers and PEs
The technology node in which a router is implemented affects the size of each of its architectural components. Requirements for the area of both combinatorial and sequential logic are influenced. Therefore, the sizes of routing computation, crossbar, and buffer components vary. The area of logic reduces its size (ideally) quadratically with the feature size of the technology node. Some parts of the router do not scale or scale worse, which we approximate as a constant. This yields a total area model of the form αˆ + ας 2 , in which αˆ is the constant part, α is an non-ideality factor, and ς is the feature size. This model is depicted in Fig. 13.1. By this model, we define the area-scaling factor as the difference between baseline technology, in the most advanced technology, and another technology:
284
13 Heterogeneous Routing for 3D NoCs
Definition 21 (Area-Scaling Factor) Let ξ and ι be the indices of two chip layers with technologies τ (ξ ) and τ (ι) and with a relative technology-scaling factor Ξ (ξ, ι). The area-scaling factor, sf : (R) → R, is given by: sf (Ξ ) :=
α + αˆ . α + αˆ Ξ2
(13.2)
The model assumes the chip area is normalized to 1. The non-ideality factor α denotes how well the technology scales. The base technology area offset αˆ is dominated by unscalable components. Both parameters are evaluated for the used technology by synthesis of a small circuit with typical properties, such as a basic router model (we demonstrate this in Sect. 13.8.1). The parameters can be estimated a using off-the-shelf function fitting tool. In an ideal setting, α = 1 and αˆ = 0. For instance, if two layers are implemented in an ideal theoretical technology node with τ (1) = 180 nm and τ (2) = 45 nm, the technology-scaling factor will be sf (Ξ (180, 45)) = 16. For a setting with 90-nm and 180-nm nodes, it will be sf (Ξ (180, 90)) = 4.
13.2.1.2
Router Count
The size of individual routers is influenced by varying nodes and potentially also the number of routers. This effect can be modeled using the area-scaling factor sf for an approximate lower bound for the number of routers implemented in a more advanced node.
13.2.2 Timing We model the transmission time of packets in heterogeneous 3D NoCs. It is determined by the timing of individual routers and their interaction, i.e., the path in the network topology. We account for the clock delay of individual routers and deduct the propagation speed of packets, which traverse multiple routers. This approach gives a zero-load model to analyze the relevant properties to design an efficient routing algorithm.
13.2.2.1
Clock Period
The clock delays in heterogeneous 3D SoCs vary. In more advanced layers, routers are potentially faster than in mixed-signal nodes. Two drivers determine the clock delay of routers. The interconnect delay does not scale and limits router performance in modern nodes (power constraints also limit the maximum achievable clock
13.3 Modeling Communication
285
frequency). The logic delay scales and is larger than the interconnect delay for conservative nodes. We model a clock-scaling factor, which gives the ratio of two clock delays in different technologies. There are two drivers, one of which is scaling and one of which is not scaling. Since the physical influence is difficult to model, we propose an empirical approach by fitting a sigmoid function. This approach shows the high accuracy of the function fitting (we will demonstrate this in Sect. 13.8.1). Definition 22 (Clock-Scaling Factor) Let ξ and ι be the indices of two layers with technologies τ (ξ ) and τ (ι), with τ (ξ ) > τ (ι) and with technology difference Ξ (ξ, ι). Let cb be the base clock delay of the layer with index ξ , and cc be the minimum achievable clock delay, which is limited to physical effects such as power dissipation or interconnect delays. Let β be the maximum speedup, β := cb /cc . The clock-scaling factor cf : (R) → R is: cf (Ξ ) :=
β . 1 + βˆ exp −β˜ Ξ − β¯
(13.3)
Please note that this function is a empirical modeling approach. The function converges to the maximum achievable speedup β determined by the theoretical maximum clock frequency that the architect can assume for their approach (e.g., from limits in power density). The other parameters must be set by fitting the function to synthesis results (see Sect. 13.8.1).
13.3 Modeling Communication We model latency, throughput, and transmission speed under zero load. Horizontal communication within a layer is synchronous. Vertical communication between layers is not purely synchronous, depending on router architecture and technology nodes in the involved layers. Hence, we give two models.
13.3.1 Horizontal Communication We call the speed at which packets are transmitted horizontally under zero-load propagation speed. It varies with technology since the number and frequency of routers differ. We assume that communication is synchronous. The distance divided by the packet’s latency determines the propagation speed. We calculate a packet’s traveled distance. All positions of routers are given by the set P = {(px , py , pz ) ∈ R × R × []}. The x- and y-coordinates of routers
286
13 Heterogeneous Routing for 3D NoCs Latency Δ(π, ξ)
Pipelining δ – χ
Router n+0 Router n+1 δ Router n+2 t+0 t+1 t+2 t+3 t+4 t+5 t+6 t+7 t+8 t+9 t+10 t+11 t+12 t+13 t+14
Fig. 13.2 Horizontal communication of two consecutive packets (orange, green). Routers have delay δ = 3 and pipelining χ = 2. The latency Δ(π, ξ ) is the difference between head flit send and receive time
are measured in [m].1 The symbols px , py , and pz denote the components of each position p ∈ P. The packet payload is modeled using the number of flits transmitted l ∈ L with L ⊂ N+ . Together, the set of packets is given by D = P ×P ×L. Packets are transmitted from a source to a destination position. This yields: Definition 23 (Horizontal Transmission Distance) Let π be a packet with π = (p1 , p2 , l), with source node p1 , destination node p2 , and l flits. The horizontal transmission distance s(π ) is defined as the distance between source and destination positions in x- and y- dimension in minimal routing: . . s(π ) = . p1,x , p1,y − p2,x , p2,y . .
(13.4)
For example, the distance between source and destination position in xand π = (p1 , p2 , l) is calculated by s(π ) := . y-dimension of a packet . . p1,x , p1,y − p2,x , p2,y . in a mesh. The norm ·1 denotes the Manhattan 1 norm.2 The latency of a packet is calculated by accumulating the latency of routers on the path. This is shown in Fig. 13.2. Each router requires δ(ξ ) clock cycles to process the head flit in the layer ξ ∈ []. If a single packet with l flits passes through a single router, the transmission will be finished after δ(ξ ) + l cycles due to pipelining of body flits. Let ρ(ξ ) be defined as the average distance between routers in the layer ξ . A packet traverses s(π )/ρ(ξ ) + 1 routers including the destination router during its transmission. This yields the horizontal packet latency and throughput: Definition 24 (Horizontal Packet Header Latency Under Zero Load) Let π be a packet with π = (p1 , p2 , l) and ξ ∈ [] a layer. The average distance between routers in the layer ξ is ρ(ξ ) and the delay for processing head flits per router is
1 Please note that “measured in [m]” refers to SI-unit meter; “[m]” refers to the set {1, . . . , m}. Thereby we avoid ambiguity. n 2 p = n 1 i=1 |pn | for p ∈ R .
13.3 Modeling Communication
287
δ(ξ ). The clock delay of routers is (ξ ), measured in [s]. The horizontal packet header latency under zero load, measured in [s], in layer ξ is ΔH (π, ξ ) =
s(π ) + 1 δ(ξ )clk(ξ ). ρ(ξ )
(13.5)
Hence, the latency of the l flits of a packet is ΔH (π, ξ ) + l. Definition 25 (Horizontal Router Throughput) Let π be a packet with π = (p1 , p2 , l) and ξ ∈ [] a layer. The delay for processing head flits per router is δ(ξ ). The router is pipelined with χ (ξ ) ∈ [0, δ(ξ )] steps. The clock delay of routers is clk(ξ ), measured in [s]. The horizontal router throughput, measured in [flits/s], is given by the number of flits that a router can pass in a given time frame: Δˆ H (π, ξ ) =
l . clk(ξ )
(13.6)
13.3.2 Vertical Communication While horizontal communication is analog to known models for 2D NoCs, heterogeneity affects vertical communication. The model must comprise non-purely synchronous communication and different router and link architectures. Definition 26 (Vertical Packet Header Latency Under Zero Load) Let π be a packet with π = (p1 , p2 , l) and ξ and λ ∈ [] layers with p1z = ξ and p2z = λ. Without loss of generality, ξ ≤ λ and that slower are above faster layers. The clock delay of routers is clk(i) for layers i ∈ [], measured in [s]. The vertical packet header latency under zero load (downwards), measured in [s], is: ↓
ΔV (π, ξ, λ) =
λ
δ(i)clk(i).
(13.7)
i=ξ
The vertical packet header latency under zero load (upwards), measured in [s], is given by the accumulated router delay plus a clock cycle for synchronization, as illustrated in Fig. 13.3. This occurs only once at the heterogeneous interface. The slower clock frequency dominates, yielding: ↑
ΔV (π, ξ, λ) =
λ
(δ(i) − 1)clk(i) + clk(i + 1).
(13.8)
i=ξ
Definition 27 (Vertical Router Throughput) Let π be a packet with π = (p1 , p2 , l) and ξ and λ ∈ [] layers with p1z = ξ and p2z = λ. Without loss of generality, ξ ≤ λ. Routers are pipelined with χ (i) ∈ [0, δ(ξ )] steps in each
288
13 Heterogeneous Routing for 3D NoCs t+0
t+2
t+4
t+6
t+8
t+10
t+12
Slower layer Faster layer Vertical delay
Vertical delay
Throughput dominated by slowest clock frequency.
Fig. 13.3 Vertical communication is dominated by the slowest clock frequency. In this illustration, routers are clocked at a delay of 1 and 1/2. Routers have head delay δ = 0 and pipelining χ = 0
layer i ∈ []. The clock delay of routers is clk(i), measured in [s]. The horizontal ˆ throughput of routers in any layer i is Δ(π, i). The slowest router determines the vertical router throughput, measured in [flits/s]: Δˆ V (π, ξ, λ) =
min
i∈[ξ,...,λ]
/ 0 ˆ Δ(π, i) .
(13.9)
Lengthy delays for processing a head flit are not relevant due to pipelining. Figure 13.3 illustrates how the slowest clock dominates the transmission throughput.
13.4 Routing Limitations from Heterogeneity The main limiting factor for conventional routing in heterogeneous 3D NoCs is the different clock speeds of routers. This clocking impedes latency and throughput. Novel routing algorithms can limit the length of these slow paths (improving performance), and modified router architectures can improve throughput. There is an essential need for a co-design of router architectures and routing algorithms, as of their interactions shown in this chapter.
13.4.1 Tackling Latency with Routing Algorithms The communication speed differs with technology. Routing algorithms can use this by identification of faster layers. In more advanced layers, routers are clocked faster, which increases the speed. However, there are potentially more routers (from feature-size scaling), which adds delays. To check on the dominating effect, we use Eq. (13.4), and Eq. (13.5), with derivation, which yield the propagation speed of a packet under zero load in horizontal direction.
13.4 Routing Limitations from Heterogeneity
propagation speed ω in [m/s]
predictive technology (modeled)
289
commercial technology (measured)
0.8 0.6 0.4 0.2 0 180-nm 130-nm 90-nm 65-nm 45-nm 28-nm 20-nm 14-nm 10-nm 7-nm digital technology mixed-signal
Fig. 13.4 Experimental propagation speed ω for commercial 180-nm mixed-signal and 130–45nm digital technology using the technologies with an NoC router with a head flit delay of δ = 3 and a 2×2 NoC in the mixed-signal layer and based on the model for 180–7-nm predictive technology
Definition 28 (Propagation Speed) Let ξ ∈ [] be a layer. The propagation speed in layer ξ is ω(ξ ) =
ρ(ξ ) , δ(ξ )clk(ξ )
(13.10)
measured in [m/s]. As the speed is distance per time, it can be rewritten to ω(ξ ) = s(π ) ΔH (π,ξ ) for a given packet. We evaluate the propagation speed in Fig. 13.4. We use a commercial 180-nm mixed-signal node and various 130–45-nm digital nodes. The results are based on synthesis for a standard NoC router with a head flit delay of δ = 3. The NoC in the mixed-signal layer has 2×2 routers, and in the digital layer, there is a larger, scaled NoC. Comparing mixed-signal and digital technologies, we observe a propagation speed improvement between 2.7× and 4.3×. Our experiments hence prove that clock scaling is stronger than area scaling in all technologies for the 180-nm mixedsignal node as a baseline. This relation can be used to design efficient routing algorithms, preferring paths in modern technologies. Please note that one cannot transfer the findings from Fig. 13.4 to a more modern mixed-signal technology without reevaluating the models. In other words, when one takes the 28-nm node in Fig. 13.4 simply as a baseline, it appears that there is no performance advantage because the model predicts a decline in transmission speed. This guise is not the actual behavior for this mixed-signal node because the number of routers in the digital node cannot be scaled up, and switching from 180-nm to 28-nm technology would also result in a different router architecture, possibly with fewer pipeline stages. Hence, one can expect similar performance improvements for a combination of 180-nm mixed-signal to 90-nm digital for a 28-nm with 14-nm setup. For further evaluation, we use the proposed models and fit synthesis results (Sect. 13.8.1). The models enable the prediction of propagation speeds of layers manufactured in nodes with feature sizes below 45 nm, which were not available at
290
13 Heterogeneous Routing for 3D NoCs
the time of this experiment. We set the maximum clock frequency to 5 GHz as this is a common maximum frequency seen in products at the time of writing this book considering power density limitations. Using this parameter, we observe a maximum speed improvement of 5.1× for 28 nm. For smaller nodes, it is reduced to 3.3× due to limited clock speed. Summing up, the clock-frequency scaling remains dominant over area scaling for all nodes, but its advantage declines. Hence, smaller nodes are faster than mixed-signal nodes, in general.
13.4.2 Tackling Throughput with Router Architectures Routing algorithms cannot increase the throughput. It is limited due to the difference in the clock frequency of routers in heterogeneous technologies. Only novel router architectures can solve this issue. To illustrate, we consider packets with length l. According to Eq. (13.6), the throughput of horizontal communication is Δˆ H = 1 clk(ξ ) , determined by the layer’s clock frequency. If communication spans layers in another technology (i.e., layer with another clock frequency), Eq. (13.9) will yield the vertical throughput: ˆ ˆ Δ(π, λ) = min{Δˆ V (π, ξ, λ), Δ(π, λ)} = Δˆ V (π, ξ, λ) ≤
1 . clk(ξ )
(13.11)
This equation shows that the slowest clock frequency limits the throughput on paths through a heterogeneous interface. This limitation is universal for routing in heterogeneous 3D SoCs. Thus communication may not span slower clocked layers if high throughput is required. Due to this principle, increases in transmission speed are restricted by reduced throughput. Even worse, packets from and to slower layers are inevitably limited. Horizontal transmission in slower layers must be reduced to a minimum. We approach this limit by a router architecture that improves the throughput for packets from and to slow layers.
13.5 Heterogeneous Routing Algorithms Conventional dimension ordered routing (DOR) (XYZ routing) cannot always be used in heterogeneous 3D NoCs because the routing algorithm may not be connected. (If connected, its performance will not be on par with our proposed heterogeneous routing.) Our models show that not all routers in digital layers are connected to a router in mixed-signal layers since the number of routers differs. The routing algorithm is not connected, as packets from a node in the mixed-signal layer to a destination in a digital layer cannot be routed if the destination router is not connected upwards. This effect is illustrated in Fig. 13.5. The closest variation
13.5 Heterogeneous Routing Algorithms R1
291 not connected
Slower layer e.g., 180 nm Faster layer e.g., 90 nm DOWN as late as possible
R2
Fig. 13.5 Sectional drawing of a 3D NoC with two layers in disparate technologies: conventional XYZ routing is not connected. The closest variation routes down as late as possible R1
standard path
Slower layer e.g., 180 nm Faster layer e.g., 90 nm
preferred path
R2
Fig. 13.6 “Stay in faster layers!” in a sectional drawing of the 3D NoC with routers plotted in gray and links in black and a scaling factor sf = 4
of conventional XYZ routing for heterogeneous 3D ICs is also shown. Packets are routed to the digital layers via the vertical connection, which is closest to the destination.3
13.5.1 Fundamentals of Heterogeneous Routing Algorithms Two paradigms enable to design routing algorithms that improve latency in 3D NoCs with heterogeneous technologies: – “Stay in faster layers!”: Packets stay in layers with high propagation speed as long as possible. This is exemplified in Fig. 13.6. Usually, data transmitted from routers R 1 to R 2 stay in the upper layer until reaching the router above R 2, which yields a high latency. It is favorable to route all packets via the preferred path in the more modern technology. – “Go through faster layers!”: Packets can be routed via adjacent, faster layers. This detour might improve latency if the difference in clock speeds is sufficient. This is exemplified in Fig. 13.7. Usually, data is transmitted via the upper layer, which is slower than the lower one. It is favorable to route packets with a detour there. Both principles can be applied to propose routing algorithms. The first principle yields Z+ (XY)Z− routing. It is the standard routing function used in the remainder
3 We do not prove livelock and deadlock freedom for this routing method since the proof is merely rearranging the arguments from Sect. 13.5.4. Naturally, heterogeneous XYZ performs worse in every metric than the proposed routing algorithms and is only used as an exemplary baseline.
292
13 Heterogeneous Routing for 3D NoCs R1 Slower layer e.g., 180 nm Faster layer e.g., 90 nm
standard path
R2
preferred path
Fig. 13.7 “Go through faster layers!” in a sectional drawing of the 3D NoC with routers plotted in gray and links in black and a scaling factor sf = 4
of this book. The second principle was used in Ref. [125] to derive ZXYZ routing. However, it did not yield better performance results than Z+ (XY)Z− routing at higher area costs. Furthermore, it is non-minimal and increases the number of power-costly vertical transmissions over TSV links. Thus, it is generally too expensive for low-power systems and, hence, should be avoided (Part II). Therefore, the second principle is not further used in this book; the interested reader is kindly referred to Ref. [125].
13.5.2 Model of the NoC In this subsection, we outline the exemplary NoC model chosen to implement routing algorithms. It is valid for most commercial and research 3D NoCs. A heterogeneous 3D IC with ∈ N layers is used. Its layers are ordered by technology, as all commercial examples, e.g., [48]. As the network topology, each layer has a grid with mξ rows and nξ columns, wherein ξ ∈ [] is the layer index; routers are arranged in rows and columns. Therein, neighboring routers connect horizontally, which yields a mξ -nξ -mesh topology.4 All routers, except those on the bottommost layer, have a (bidirectional) vertical link to the adjacent router in the next lower layer. This linking is possible due to the ordering of layers (see Fig. 13.8). The routers V is also the vertex set of the network digraph T = (V , A).5 The set of arcs A contains the directed links between routers.
13.5.2.1
Network Addresses
The coordinate system of the NoC is shown in Fig. 13.9, which models location of routers. The origin of the coordinates is the NoC top-left corner. We introduce row, column, and layer numbers for routers. Rows and columns are based on the network digraph, not the physical locations. For example, pairs of neighbored
4 We do not model long-range links or express virtual channels. Therefore, routers only have one link in the same direction. 5 In Duato [69] the network digraph is called interconnection network.
13.5 Heterogeneous Routing Algorithms
293
Layer wz = 1 Layer wz = 2 Layer wz = 3 Row/col (wx /wy ) 1
2
3
4
5
Fig. 13.8 Layers are ordered by technology. Routers in same row and column must not be at the same physical location in their layer. Links connect routers via redistribution and TSV arrays (bold) Fig. 13.9 Cardinal directions in model coordinates W
vx
N
up vy
north
N west
east south down vz
[ ]
routers in adjacent layers do not necessarily have the same physical-x and physicaly coordinate but the same column and row number. This is shown in Fig. 13.8. If routers are stacked, such as in Fig. 13.6, this will model a physical placement comparable to Fig. 13.8, in which routers are not necessarily stacked. Connecting these routers is done using redistribution in the top-most metal layer. We notate row, column, and layer of each router as w = (wx , wy , wz ) for w ∈ W = N3 , which is also the network address space. An injective function m : W → P,
(13.12)
converts addresses to locations of routers. Packets with source and destination address are given by D˜ = W × W × L.
13.5.2.2
Cardinal Directions
We use the six directions C := {north, east, south, west, up, down} to sort the arcs, i.e., links in the network, as shown in Fig. 13.9. A local direction is not used in our mathematical models because it not required to prove livelock/deadlock freedom. We define functions which return the set of all links in one of these cardinal directions. These are given for all links (v, w) ∈ A: (v, w) ∈ north(A)
⇔
vx = wx , vy > wy , vz = wz ;
(v, w) ∈ east(A)
⇔
vx < wx , vy = wy , vz = wz ;
294
13 Heterogeneous Routing for 3D NoCs
(v, w) ∈ south(A)
⇔
vx = wx , vy < wy , vz = wz ;
(v, w) ∈ west(A)
⇔
vx > wx , vy = wy , vz = wz ;
(v, w) ∈ up(A)
⇔
vz > wz ;
(v, w) ∈ down(A)
⇔
vz < wz .
For example, north(A) contains all links which point to the north. We introduce functions that will return neighbors of routers in a certain cardinal direction if a link exists.6 Routers at the edges do not have links in that direction which is given by the value 0. We define for all f ∈ C: f : V → V ∪ {0} w if (v, w) ∈ f (A) v #→ 0 otherwise.
13.5.3 Z+ (XY)Z− Routing Algorithm This routing algorithm uses Principle I. Let π˜ = (v, w, l) be a packet. If source and destination are in different layers, i.e., vz = wz , the faster of the two layers will be taken. We apply Eq. 13.10 to calculate the average propagation speed at design time. This yields the following rules for transmission of packet π (in router with address v): – If ω(vz ) < ω(wz ), XYZ routing will be applied. – If ω(vz ) > ω(wz ), ZXY routing will be applied. – If ω(vz ) = ω(wz ), either XYZ routing or ZXY routing will be applied, selected at design time, depending on other network properties such as energy consumption of routers. We call this routing algorithm Z+ (XY)Z− . Since layers are ordered, and hence also the transmission speed, the implementation shown in Listing 2 modifies XYZ routing simply by reordering if-statements. The resulting routing is illustrated in Fig. 13.10. Definition 29 (Routing Function R1 for Z+ (XY)Z− Routing [125]) Let T = (V , A) be the topology digraph with the set of routers V and the set of links A.
6 The function will be only well-defined if router do not have more than one link into the same direction.
13.5 Heterogeneous Routing Algorithms
295
Listing 13.2: Z+ (XY)Z- routing.
Listing 13.1: DOR routing (XYZ). 2 4 6 8 10 12 14
i f vx − route else i f route else i f route else i f route else i f route else i f route else route end i f
dx to vx to vy to vy to vz to vz to
< 0 then EAST − dx > 0 WEST − dy > 0 NORTH − dy < 0 SOUTH − dz > 0 UP − dz < 0 DOWN
1 then
3
then
5
then
7
then
9
then
11 13
to LOCAL 15
i f vz − route else i f route else i f route else i f route else i f route else i f route else route end i f
dz < 0 then to DOWN vx − dx < 0 to EAST vx − dx > 0 to WEST vy − dy > 0 to NORTH vy − dy < 0 to SOUTH vz − dz > 0 to UP
then then then then then
to LOCAL
R1 Slower layer e.g., 180 nm ω(1) < ω(2)
Z+ XY path Faster layer e.g., 90 nm
XYZ- path
R2
Fig. 13.10 Z+ (XY)Z− routing. Transmission through the lower layer is faster in this NoC with two layers in disparate technologies (sectional drawing)
Further, P(A) is the power set of A. The routing function R1 : defined as: ⎧ ⎪ ∅ for v = d; ⎪ ⎪ ⎪ ⎪ ⎪ {north(v)} for vx = dx , vy > dy , vz ⎪ ⎪ ⎪ ⎪ {east(v)} for vx < dx , vz ≥ dz ; ⎨ (v, d) #→ {south(v)} for vx = dx , vy < dy , vz ⎪ ⎪ ⎪ {west(v)} for vx > dx , vz ≥ dz ; ⎪ ⎪ ⎪ ⎪ ⎪ {up(v)} for vx = dx , vy = dy , vz ⎪ ⎪ ⎩ {down(v)} for vz < dz .
V × V → P(A) is
≥ dz ; ≥ dz ; > dz ;
Please note that {0} is impossible by construction since all routers have a downward link in our setting (except those in the bottommost layer), as formally proven in Lemma 3. Minimality refers to the shortest path in the interconnection network. Our routing algorithm will be minimal if links in the interconnection graph are weighted with their speed. The proposed routing algorithm is not minimal in terms of hop distance (which is usually used to assess minimality).
296
13 Heterogeneous Routing for 3D NoCs
13.5.4 Proof of Deadlock and Livelock Freedom We prove that the routing algorithm Z+ (XY)Z− is deadlock-free and livelock-free with Duato’s theorem [69]. It states that routing will be deadlock-free if the routing function is connected and the channel dependency graph is cycle-free. We use terms and definitions from [69]. Among them are routing function, adaptive, connected, direct dependency, and channel dependency graph. If there is a direct dependency from a to b, we also say: »b is direct dependent on a.« Graph related terms like path, closed walk, or cycle are used as defined in [136]. We introduce the terms possible turn and impossible turn according to a routing function R. These terms will determine if the routing functions allows packets to take a turn. Definition 30 A pair of cardinal directions (f, g) ∈ C × C is called a possible turn according to R, if there are two consecutive arcs, (u, v) and (v, w) ∈ A, with: (u, v) ∈ f (A), (v, w) ∈ g(A) and there is a direct dependency from (u, v) to (v, w). Lemma 1 If there is a cycle in the channel dependency graph (CDG), then we can also find a closed walk (v1 , a1 , v2 , . . . , vk , ak , v1 ) (for k ∈ N) in the topology digraph with – ai+1 is direct dependent on ai for all i ∈ {1, . . . , k − 1}, – and a1 is direct dependent on ak . Proof Assume there is a cycle ({a1 , . . . , ak }, {(a1 , a2 ), . . . , (ak−1 , ak ), (ak , a1 )}) in the CDG. According to the definition of direct dependency, the destination node of ai in the topology digraph is also the source node of ai+1 (for all i ∈ {1, . . . , k}, and ak+1 := a1 ). Let us call this node vi+1 (for all i ∈ {1, . . . , k}). Then, (vk+1 , a1 , v2 , . . . , vk , ak , vk+1 ) is a closed walk in the topology digraph.
13.5.5 Z+ (XY)Z− : R1 is Deadlock Free We determine the impossible and possible turns, as shown in Table 13.1. We assume that the numbers of rows, columns and layers mξ , nξ and are sufficient. We assume mξ , nξ , ≥ 2 for all ξ ∈ {1, . . . , }. Table 13.1 Possible turns (f, g) in R1 (Z+ (XY)Z− )
f:
g: n. e. s. w. u. d.
n. 1 1 0 1 0 1
e. 0 1 0 0 0 1
s. 0 1 1 1 0 1
w. 0 0 0 1 0 1
u. 1 1 1 1 1 0
d. 0 0 0 0 0 1
13.5 Heterogeneous Routing Algorithms
297
Lemma 2 When routing R1 returns a direction, then the requested link exists. Proof There are two places without links. Case 1:
Case 2:
Places that are at the outside faces of the 3D NoC, i.e., links at edges of layers, upward links from the topmost layers, and downward links from the bottommost layer. By the definition of R1 , every routing step brings the packet nearer to d. Hence, the nonexistent links on the outer faces are never taken. In some places there are no upward links between layers for heterogeneous technologies. Not every router has a link in direction up. Every router, except those in the bottommost layer, has a downlink by the premise. Downward links in a router are upward links below: When router v has the same x- and y-coordinates as the destination router d and v is below d, v has an up-link. These are also the conditions for traveling up in R1 .
Lemma 3 R1 is connected. Proof Let s and d be any two vertices in V . R1 returns a direction for every vertex except d (it returns ∅). The links in the chosen direction always exist (Lemma 2). If we apply the routing function and proceed through the network in the given directions, we will find a route. As shown in the proof of livelock-freedom, the route is not infinite (Theorem 2). Hence, it terminates. Termination can only happen at d, by definition. Hence, with the routing function R1 , we always find a path from s to d. Theorem 1 R1 is deadlock free. Proof R1 is connected, because of Lemma 3. Assume that the CDG of T and R1 has a cycle. Lemma 1 proves that T has a cycle where each two consecutive arcs are direct dependent. Case 1:
Case 2:
All vertexes of the cycle are in the same layer. We know by Dally and Seitz [57] that XY routing has a cycle free CDG due to impossible turns. Thus, Case 1 does not occur. The vertexes of the cycle are in at least two layers. Since the vertexes are in different layers, there is at least one arc, which goes up. According to Table 13.1, the only possible direction after »up« is »up« and the cycle could never be closed. Hence, Case 2 is also impossible.
By contradiction we show that the CDG is cycle-free, and apply Duato’s Theorem on R1 .
13.5.6 Livelock Freedom Palesi et al. [185] define that “livelock is a condition where a packet keeps circulating within the network without ever reaching its destination”. Hence the following definition.
298
13 Heterogeneous Routing for 3D NoCs
Definition 31 (Livelock-Free) A routing algorithm is livelock-free if every packet has reached its destination after a finite number of hops. Remark A routing algorithm consists of a routing function and a selection. R1 is such a routing function. If an adaptive routing function returns more than one link, the selection will choose one. The property livelock-free belongs to the routing algorithm. We will call a routing function livelock-free if every routing algorithm with this routing function is livelock-free, independent of the selection. Theorem 2 R1 is livelock-free. Proof Assume there were two vertexes, s and d, with the property that the routing R1 makes infinite steps and never reaches d starting from s. Under this assumption, at least one direction must be traveled infinite times. We do a case-by-case analysis, in which we assume this applies to all cardinal directions, showing that it does not work. This statement contradicts the assumption that there could be a livelock. Case 1:
Case 2:
Case 3:
»up« is traveled infinite times. By the definition of R1 (Definition 29), up is only used if vx = dx and vy = dy and vz > dz , with v being the current vertex. Traveling up one layer will remain vx = dx and vy = dy , and results either in vz = dz or vz > dz . The only possible direction after »up« is »up«. Since there are only < ∞ layers, d will be reached after finite steps. Thus, Case 1 cannot occur. »down« is traveled infinite times. Since up can not be traveled infinite times (Case 1), down cannot either. It is limited by the number of layers, , plus the number of times up is traveled. »east« and »west« are traveled infinite times. Similar to Case 2, infinite steps to west imply infinite steps to east and vice versa. From the definition of R1 , we know: – – – –
Case 4:
east and west are the only directions, which affect the x-value of v. A step to east is only done if vx < dx A step to west is only done if vx > dx A step to west or east is only done if vz ≥ dz .
We never step on a router with vx = dx . If we reached a router with vx = dx , north-, south-, up- or down-routing would reach the destination. Steps to east or west are only done in the destination layer or below. In these layers, each row has a router at position dx . Routing from west to east and back without using one of these routers is impossible. »north« and »south« are traveled infinite times. This case is analog to Case 3.
None of the cases occur. Thus, the assumption is wrong. R1 is livelock-free. Remark: The proof requires that for u and v with down(u) = v it holds: up(v) = u, ux = vx , and uy = vy . It also requires the mesh topology.
13.6 Heterogeneous Router Architectures
299
13.6 Heterogeneous Router Architectures In Sect. 13.4.2 a fundamental limitation of routing in heterogeneous 3D SoCs was revealed. Throughput is limited by the slowest clock frequency of a router along a path of a packet. The limitation is only found in heterogeneous 3D integration, as clock frequencies of routers vary. The deviation of clock frequencies in a homogeneous 3D or 2D system is much smaller, so this principle is not relevant for this system.7 Only novel router architectures can tackle this limitation. We assume an integer relation between the clock frequencies cf , with a constant phase shift. We propose a co-design of routing architectures and routing algorithms. Therefore, we use our finding that horizontal transmission in slower layers must be minimized. Our routing algorithm guarantees this. If a packet traverses multiple heterogeneous layers, it will always be horizontally transmitted in the fastest layer. Thus, packets are directly forwarded from local ports or incoming upward ports to the vertical port in the direction of the fastest layer (i.e., down) in the slower router. In the opposite direction, packets from the downward direction are only transmitted to upward or local ports. We contribute router architectures that enable transmission along these paths and heterogeneous interfaces. The router transmits multiple flits in parallel, improving throughput. This architecture is called high vertical-throughput router.
13.6.1 High Vertical-Throughput Router In the faster layers, we use a standard router architecture. Only the vertical link architecture is changed in the slower layers, as explained in the next section. To achieve the desired parallelism, we exploit that PEs connected to the router’s local port can provide multiple flits in parallel because the complete packet is available before transmission. The resulting router architectures are heterogeneously aligned with the technology. We modify a conventional input buffered 3D router, with a link width of N, as shown in Fig. 13.11. To achieve parallelism, the input buffers of vertical and local links can read/write up to cf flits in parallel, each N bit wide. The modified buffers are shown in Fig. 13.12. The crossbar also can read single flits or cf flits, in parallel. Its modifications are shown in Fig. 13.13. Some turns cannot occur in the proposed routing algorithm (e.g., down to north, east, west, or south), which reduces the size of the crossbar. It transmits up to cf flits between local and vertical ports in parallel. For other connections, i.e., horizontal routes via the slower layer, the crossbar can switch single flits between horizontal ports, the upward port, or the local port. The remaining (cf − 1)N lines of the output (to local or up) are zero, and a single flit is transmitted per cycle.
7 This
is even true for globally asynchronous locally synchronous (GALS).
300
13 Heterogeneous Routing for 3D NoCs
Fig. 13.11 High vertical-throughput router architecture
North N
U
cf
N
N
N
N
ow
cf
N
D
cf
Modified Crossbar
N
N East
cf
N
West
p
N
al Lo c
N
N
cf
cf
N
n
N
South
Fig. 13.12 Modified input buffer
In cf . . . 1
N . . .
cf N
N
cf N
. . .
cf . . . 1
N
OutP OutS
Fig. 13.13 Modified crossbar which allows routing cf flits between the local and vertical ports
Inputs Downp Upp Localp cf N cf N cf N
Ups Locals East N N N
crossbar cf N -bit cf N
cf N
crossbar N -bit cf N
extend
cf N Down
Up
West North South N N N
N
N
N
N
N
N
extend
cf N Local
East
West North South
Outputs
Despite these modifications, the complexity of this router architecture will be reduced in the most common scenario with a slower mixed-signal layer and a faster digital layer. The routers in the mixed-signal layer do not have a port in the up direction. Therefore, the high throughput path only connects the down and local ports. The modified crossbar with cf N bits (Fig. 13.13, left-hand side) has two input and output ports. Thus, the local and down ports are directly connected without hardware costs.
13.6 Heterogeneous Router Architectures
301
Compared to the baseline router, this reduces the router’s area costs. Furthermore, only two input buffers are modified for parallelism. The complexity can be further reduced by considering possible paths in the routing algorithm. Single flits will not travel from horizontal to vertical ports, so allowing further reductions of the crossbar and arbiter sizes.
13.6.2 Pseudo-Mesochronous High-Throughput Link The architecture of the vertical links is modified for increased throughput. We propose two designs. The first option targets future technologies with increased TSV yield. The second option can be manufactured at the time of writing this book, but its structure is more complex. Both link architectures require that a circuit in the mixed-signal layers is driven by clocks from a layer in a digital node.
13.6.2.1
Implementation for Modern/Future TSV Technologies
A large TSV array transmits cf N bits in parallel in this implementation. Through-silicon via yield is relatively low at the time of writing this book. Thus, this architecture cannot be manufactured practically and targets future technologies with a higher yield. Before transmission, the data are parallelization is done in the faster layer by a shift register. The link architecture is shown in Fig. 13.14a. From the slower to faster layer, data is transmitted in parallel using the slower clock frequency via the TSV array. The data is stored in the input buffers, modified to fetch cf N bit in parallel at the slower clock frequency. This does not require to route the slower clock to the faster layer and vice versa because the valid line is high only every cf clock cycles with respect to the fast clock.
13.6.2.2
Implementation for Conservative TSV Technology
The previous solution requires a higher TSV yield than available. Hence, we propose a more delicate solution for conservative TSV technology. We propose to clock parts of the logic in the slower layer at the faster clock speed. The clock is transmitted via a clock TSV from the faster layer. This architecture is shown in Fig. 13.14b and c. The faster clock triggers the shift registers in the slower layer. For the upward path (Fig. 13.14b), a shift register is filled with data from the faster layer at its clock speed. Then, the flits are transmitted parallel to the input buffers of the router in the upper layer at the slower clock frequency. Alternatively, the input buffers in the upper layer can be directly clocked at a higher rate by the faster clock, which removes the shift register’s costs.
13 Heterogeneous Routing for 3D NoCs
1 . . . cf
Slow Layer
. . . 1 . . . cf N
N
. . . cf
cf N -bit Shift-in Reg.
N
Router Out Up
Fast Clock
a) Upward, with large TSV array.
Router Out Up
b) Upward, with small TSV array and shift regsiter. Slow Layer
Router Out Down
1
Fast Layer
1
. . . cf
cf N -bit Shift-in Reg.
Fast Layer
N ...
1
Slow Layer
302
. . . cf
cf N -bit Shift-out Reg.
Fast Clock
Router In Up
Fast Layer
N
c) Downward, with small TSV array and shift register. Fig. 13.14 High-throughput connection between two routers in heterogeneous technologies using shift registers and TSVs. (a) Upward, with large TSV array. (b) Upward, with small TSV array and shift register. (c) Downward, with small TSV array and shift register
For the downward path (Fig. 13.14c), in which data are sent from a slower layer to a faster layer, the shift register is loaded with data in parallel with up to cf flits and at the slower clock rate. Then, the data is shifted out to the right at the faster clock speed. Each flit is transmitted via an N -bit TSV array to the input buffers in the faster layer.
13.7 Low-Power Routing in Heterogeneous 3D ICs We proposed a heterogeneous routing algorithms, Z+ (XY)Z− . This was motivated by the idea to improve the network performance in heterogeneous systems. In this
13.7 Low-Power Routing in Heterogeneous 3D ICs
303
section, we will show how this routing technique also effective improves the NoC power consumption in heterogeneous systems, In a heterogeneous 3D NoC, the power consumption must be reduced without increasing the number of energy-expensive transmissions over vertical links. Hence, a power-ideal routing must be heterogeneity-aware by minimizing the hops on the power inefficient mixed-signal layer without detours. Z+ (XY)Z− routing satisfies all this requirements. We will quantify the power gains of Z+ (XY)Z− routing for uniform random traffic and compare them against conventional XYZ routing. Thereby, we consider only two layers in the NoC, one implemented in an advanced technology node and one implemented in a conservative technology node. Also, we consider a fully connected 3D Torus network topology such that we do not have to distinguish between cores located in the middle and the edge of the layer. This results in a simpler and more intuitive mathematical model for the power savings. However, all findings of this section are also valid for other topologies and layer counts (only the resulting mathematical models become more complex). Let m and n be the x and y size of the considered 2-layer NoC Torus, respectively. A PE will never send data to itself. Thus, for a packet injected at location (xsrc , ysrc , zsrc ), 2mn − 1 different destinations are possible, out of which m · n are located in the other layer. Hence, the average number of inter-layer/vertical hops in a packet path is for both routing schemes, XYZ and Z+ (XY)Z− and random traffic equal to: Δz =
1 mn ≈ 2mn − 1 2
for large m and n.
(13.13)
Next, we derive a formula to investigate the average number of horizontal (i.e., x or y) hops of a package with random source and destination. Due to the symmetry of the torus, each router can be regarded as located in the middle of the mesh. To the West and East, m−1/2 columns of n routers are located. To the North and South n−1/2 rows of m routers are located. 4m (2m per layer) routers are exactly i steps in y away from a router, where i is bound between 1 and n−1/2. Analogously, 4n routers are exactly i steps in x away, where i is bound between 1 and n−1/2. For uniform random traffic, the average number of hops in x or y per transmitted flit is: ⎞ ⎛ (m−1)/2 (n−1)/2 1 ⎝4n ΔXY = i + 4m i⎠ 2mn − 1 i=1
i=1
2 m −1 n2 − 1 1 n +m = 2mn − 1 2 2 ≈
m+n 4
for large m and n.
(13.14)
For XYZ routing, the mean number of horizontal hops in x or y in the mixedsignal versus the digital layer is
304
13 Heterogeneous Routing for 3D NoCs
ΔXY mixed-signal = ΔXY digital = 0.5ΔXY
(13.15)
By pushing data as early as possible to the digital layer, the mean hops per layer are asymmetric for Z+ (XY)Z− routing. Only if both the source and the destination are in the mixed-signal layer, it is used for horizontal/XY routing. The probability that the source of packet is located in the mixed signal layer is mn/2|XY | = 1/2, while the conditional probability that the destination layer is also located in the mixed signal layer is (mn−1)/(2|XY |−1) ≈ 1/2. Thus, the average number of horizontal hops in the mixed-signal and the digital layer with Z+ (XY)Z− are 1 mn − 1 1 ΔXY ≈ ΔXY , 2 2mn − 1 4 1 mn − 1 3 )ΔXY ≈ ΔXY . = (1 − 2 2mn − 1 4
ΔXY mixed-signal = ΔXY digital
(13.16) (13.17)
The dynamic NoC energy consumption to route N flits with an XYZ routing becomes 1 m+n 1 1 EXYZ ≈ N (13.18) Emixed-signal + Edigital + E3D , 4 2 2 2 where Emixed-signal and Edigital are the energy consumed to exchange a flit between two routers in the mixed-signal layer and two routers in the digital layer, respectively. E3D is the energy consumed transferring a flit between two routers located in different layers, mainly determined by the energy consumption of the 3D link. For Z+ (XY)Z− routing the power consumption is: EZ+ (XY)Z− ≈ N
m+n 4
1 1 3 Emixed-signal + Edigital + E3D . 4 4 2
(13.19)
Thus, Z+ (XY)Z− routing saves energy, since Emixed-signal is larger than Edigital . Assuming a 16 × 16 × 2 NoC with E3D = 2Emixed-signal = 8Edigital (which are typical values), the savings would be 25%, which outperforms even some of the sophisticated data-encoding techniques presented in Part IV of this book.
13.8 Evaluation 13.8.1 Model Accuracy We evaluate the accuracy of the model fit for the proposed area/timing models to synthesis results of a 3D router with 2 virtual channels (VCs), 4-flit buffer, credit-based flow control, wormhole switching with decentralized arbiters, and XYZ-routing algorithm. The synthesis is conducted with Synopsis Design Compiler
13.8 Evaluation
305
Relative Savings
25 20 15 10 5 130
90
65 45 Technology Node in [nm]
28
Fig. 13.15 Area comparison of commercial 180-nm mixed-signal GP (green) and ULV (orange) node with 45–130-nm digital GP nodes for synthesized router and model fit
Relative Savings
60 50 40 30 20 10 0 130
90
65 45 Technology Node in [nm]
28
Fig. 13.16 Timing comparison of commercial 180-nm mixed-signal GP node (green) and ULV node (orange) with 45–130-nm ULV digital nodes for a synthesized NoC router and model fit with predictive 5 GHz maximum achievable clock speed
for commercial 180-nm mixed-signal technology and commercial 45-nm to 130-nm digital technology. We use both general-purpose (GP) and ultra-low voltage (ULV) mixed-signal technology for all structure sizes. We show synthesis results and model fit for the area model in Fig. 13.15. Mathematica 10 [248] is used for curve fitting. A non-ideality factor α = 3462.7 and an offset of αˆ = 29.8 for 180-nm GP technology with a root-mean-square error (RMSE) of 0.1286 is calculated. Ultra-low voltage technology yields α = 13.2 and an offset of αˆ = 0.124 with a RMSE of 0.1414. The timing model is shown in Fig. 13.16. We use a predicted maximum significant clock frequency of 5.0 GHz. The results for GP nodes are β = 32.85, βˆ = 7.88, β˜ = 0.76, and β¯ = 1.26 with a RMSE of 0.30. For ULV nodes, the model yields the parameters β = 77.45, βˆ = 2.48, β˜ = 0.76 and β¯ = 2.77, with a RMSE of 0.71.
306
13 Heterogeneous Routing for 3D NoCs
Latency Speedup
180 nm mixed-signal to 130-nm digital: simulation , model 90-nm digital: simulation , model
65-nm digital: simulation , model 45-nm digital: simulation , model
6
4
2
0 1
2 3 4 5 Hop Distance in 4 × 4-mesh NoC in Mixed-Signal Technology
6
Fig. 13.17 Latency speedup of Z+ (XY)Z− to conventional XYZ in simulations and model for packets from any node in a 4 × 4 NoC in 180-nm layer in commercial mixed-signal technology to any node in layers in different 130–45-nm commercial digital node
13.8.2 Latency of Routing Algorithm Z+ (XY)Z− We evaluate the performance of the routing algorithm Z+ (XY)Z− by using our zeroload latency model. The latency of packets from nodes in the mixed-signal layers to nodes in the digital layers is reduced by Z+ (XY)Z− . We compare it to conventional XYZ routing. A 4×4 NoC is implemented in commercial 180-nm mixed-signal technology, combined with an NoC with more nodes according to the area model and the synthesis results (Eq. 13.2) in one commercial 130 to 45-nm digital technology. The latency speedup is calculated using ΔH from Eq. (13.5) and it is determined via simulations in Ratatoskr with 4 VCs, 16-flit buffers, and wormhole switching. The results are shown in Fig. 13.17 for all available hop distances in the layer in mixedsignal technology. Model and simulation yield identical results, as expected, since the model is accurate under zero loads. The results show a speedup between 1.5× and 6.5×. It will be larger if a more advanced digital technology accompanies the mixed-signal one. This is consistent with the expectations (Sect. 13.4). Z+ (XY)Z− yields this speedup without implementation costs (Sect. 13.8.4).
13.8.3 Throughput of High Vertical-Throughput Router If architects use the high vertical-throughput router for transmission from a slower to a faster layer, the slower clock frequency will not determine the throughput since the packet is transmitted at once. This is shown in Fig. 13.18, left-hand side. Analog,
13.8 Evaluation
307 Pseudosynchronous, high-throughput packet transmission t+0
t+2
t+4
t+6
t+8
t+10
Slower layer Faster layer Throughput is not dominated by slowest clock frequency.
Fig. 13.18 Throughput of high vertical-throughput router
for transmission in the opposite direction, the slower clock frequency also does not dominate the throughput. This is shown in Fig. 13.18, on the right-hand side.
13.8.4 Area Costs We evaluate the area costs of the proposed high vertical-throughput router using Z+ (XY)Z− routing in a commercial 180-nm ULV mixed-signal technology. We assume a 4 × 4 × 2 NoC (one digital layer accompanies the mixed-signal layer). As baseline, we synthesize a standard router using conventional XYZ routing, including the crossbar optimization from Sect. 7.2.2 on Page 144 (which removes impossible turns). The flit width in all routers is 16 bit, input buffers are 4-flit deep. The routers implement credit-based flow control. Virtual channels are only used in digital layers. Both the baseline, standard, and high vertical-throughput router support a maximum frequency of 150 MHz. It is identical, as the routing algorithm is not part of the critical path.Z+ (XY)Z− routing has 0% overhead compared to conventional XYZ routing. The costs of the high vertical-throughput router depend on the clock frequency difference between routers in mixed-signal and digital technology. For a clock frequency in the digital layer of 300 MHz (cf =2), the silicon area required for the crossbar and the routers reduces by −7.29 %. For routers clocked at 600 MHz (cf =4), the area increases by 4.36%. A higher scaling factor is impossible for 4-flit deep input buffers. The area increase of the router also depends on the implementation of the switch arbitration, i.e., centralized or decentralized. It is typically between 3 and 4%. To summarize, the router area will be reduced, but the throughput will be doubled. Alternatively, the area will slightly increase, but the throughput is quadrupled.
13.8.5 Power Savings To better understand gains in terms of dynamic power/energy consumption, we inject random data at a rate of 0.01 relative to the clock speed of the slowest
308
13 Heterogeneous Routing for 3D NoCs
3× 1 DSAs
3× 2 SIMDs
CIS
Sensing die: 180-nm technology
3 × 3 ADCs
Mixed-signal die: 180-nm technology
3× 2 CPUs
Digital die: 90-nm technology
3 × 4 central processing unit (CPU)s Digital die: 90-nm technology
Fig. 13.19 Three-dimensional vision-system on a chip case study [260]. The CIS directly connects to ADCs. A network on a chip connects the other components in a 3 × 3 × 4 mesh
layer into the network (random data are the worst case for energy consumption). We run the simulations for 40k cycles. For each router, we measure the number of transmitted flits and multiply the value with the energy per transmitted flit, obtained from synthesis results. Since transmissions over TSV array consume more than 2D wires, we consider an extra energy consumption of 100 pJ for each traversal of a vertical link and 10 pJ for each traversal of a vertical link. We report the power savings for the proposed Z+ (XY)Z− routing scheme versus a standard XYZ routing scheme. The results show that the proposed low-power technique yields a power reduction by over 17%. This is a significant power improvement. Important to mention is that the analyzed NoC has a relatively small xy dimension of only 4 × 4. Hence, the potential savings of moving the horizontal routing into the cheaper layer for all 3D paths are smaller than for larger mesh dimensions (see Eqs. (13.18)–(13.19)). Thus, the reported 17% power savings are eveb pessimistic. This result shows the perfect suitability to reduce power consumption using Z+ (XY)Z− routing.
13.8.6 Case Study Similar to the case studies from Part IV, we analyze our approach for a typical heterogeneous 3D SoC for vision applications (vision-system on a chip (VSoC)). This case study is based on [260] with four layers. The case study IC is schematically depicted in Fig. 13.19. The first layer is sensing die, implementing a 180-nm CIS. The second layer implements nine ADCs and three analog accelerators [119] in 180-nm mixed-signal node. The third layer implements 6 processors and 6 single instruction multiple data (SIMD) domain-specific accelerator (DSA) units in 90-nm digital node. The fourth layer hosts 12 processor cores in a 90-nm digital node. The first and second layers are connected via point-to-point links.
13.9 Discussion
309
The second, third, and fourth layer are connected via an NoC with 32-bit-wide links, 8-flit buffers and 4 VCs. Packets have 32 flits. Routers in the digital layer are clocked at 1 GHz and in the mixed-signal layer at 0.5 GHz. The chip executes a face recognition pipeline on 720p images. The ADCs send the raw digital image to the processors in the third layer, which applies the Bayer filter. The SIMD units reduce the resolution by 4× to increase feature extraction speed. The result is transmitted to the analog accelerators in the second layer, which extract features using the Viola-Jones algorithm [242]. The region of interest is transmitted to the fourth layer, in which the processors execute Shi and Tomasi algorithm [222] to find features to track, and Kande-Lucas-Tomasi algorithm [236] tracks them. Work is split equally among the available resources in each step. We compare Z+ (XY)Z− with conventional XYZ routing using the application traffic. We simulate 3M clock cycles in the digital layers and 1.5M in the mixedsignal layer. We measure 145.91 ns average flit latency for conventional XYZ routing and 64.46 ns for the our optimized routing. This result equates to a speedup of 2.26×. We calculated a theoretical speedup of 2.28× under zero load, which demonstrates the quality of our modeling. Average delay for whole packets is reduced from 229.23 ns to 123.07 ns, which is a speedup of 1.86×. The dynamic power consumption is reduced by over 15%.
13.9 Discussion 13.9.1 Model Accuracy Figures 13.15 and 13.16 show the area and timing model for an exemplary NoC router. Both models yield good fits in our experiments. Due to the physical foundation of the area model, it has small RMSE of 0.1414. The timing model is empirical and thus yields a less accurate fit than the area model (increased RMSE of 0.71). Also, the model converges to the target maximum clock frequency, as desired. If more modern technology nodes were available, our model fit would be improved. As demonstrated by the small RMSE, the accuracy of the proposed area/timing model is sufficient to evaluate the influence of heterogeneous interconnects. We further evaluate the expressiveness of the models in terms of routing. As shown in Fig. 13.4, we use data from the model fit to calculate the propagation speed ω for predictive technologies. Comparing predictive technology to the synthesis results for 180-nm commercial mixed-signal and 130-nm to 45-nm commercial digital technologies yields an accuracy of between 1.4 and 7.8%. This small error supports the validity of our models. The accuracy of the model for transmission latency is shown in Fig. 13.17. Model and simulation yield identical results, so the models are precise under zero load. Thus, there is no urgent need to model the behavior under load to develop routing
310
13 Heterogeneous Routing for 3D NoCs
for NoCs targeting heterogeneous 3D SoCs. The case study with a tiny deviation from simulation and model supports that the chosen approach is valid.
13.9.2 Power-Performance-Area Evaluation of Heterogeneous Routing We evaluated the power-performance-area (PPA) of our routing optimization and the co-designed router architectures for heterogeneous 3D integration. The main limitation of heterogeneity (limited throughput from clock differences) is more significant for larger differences between mixed-signal and purely digital technology. Our worst case is a combination of 180-nm commercial mixed-signal technology and 45-nm commercial digital technology. These results are valid for any other combination of technology nodes with similar relative technology scaling factor Ξ . The baseline for comparison is a conventional, homogeneous NoC, with conventional XYZ routing. Routing following Z+ (XY)Z− provides up to 6.5× latency reductions for packets from routers in the mixed-signal nodes to routers in the digital layer (Fig. 13.17). The vertical high-throughput router offers increased throughput of up to 2×, with router area savings compared to a standard router for conventional XYZ routing. Further, a 4× throughput increase is possible with a small router area increase of ≤ 1%. If a larger throughput increase is desired, additional area costs will be necessary but are not desired. The overhead for the link area depends on the TSV technology. Both improvements will reduce area costs and yield in monolithic stacking. For a real-world benchmark with the image processing case study, we achieve 2.26× flit latency reduction in simulation. This speedup demonstrates a performance benefit of the proposed approach for typical applications of heterogeneous 3D ICs. The power consumption of the NoC can be reduced by up to 25%. These 25% are reached for random traffic and large mesh dimensions. For smaller mesh dimensions and less ideal traffic scenarios, our experiments show power savings by at least 15%. To summarize, Z+ (XY)Z− , combined with the proposed router architectures, has negligible area overhead and better performance, and power characteristics than state-of-the-art approaches in theoretical and practical evaluations.
13.10 Conclusion In this chapter, we show that conventional routing algorithms and router architectures pose severe limitations for heterogeneous 3D interconnect architectures. Notably, we have shown that varying throughput and latency of NoCs in layers in disparate technologies drastically degrades network performance.
13.10 Conclusion
311
We applied the models to develop universal principles for routing in heterogeneous 3D ICs. Further, we develop a routing algorithm and a co-designed router architecture implementing the most promising principle. For an exemplary SoC, with layers in commercial 45-nm digital and commercial 180-nm mixed-signal technology, we achieved a latency reduction of up to 6.5× at negligible hardware area overhead compared to conventional approaches. Moreover, our technique reduced the power consumption by up to 25%. Our vertical high-throughput router architecture overcomes throughput limitations and increase it by up to 2× at 6% reduced router hardware costs for the same exemplary set of technologies. Thus, a co-design of routing algorithms and router architectures based on the proposed principles exploit heterogeneity for performance advantages to mitigate the limitations of conventional routing algorithms without drawbacks in implementation costs. The area of vertical links connecting heterogeneous routers is an open issue, which will be tackled in the next chapter.
Chapter 14
Heterogeneous Virtualisation for 3D NoCs
In the previous Chap. 13, we introduced routing algorithms for networks on chips (NoCs) in heterogeneous three-dimensional (3D) integrated circuits (ICs). We observed that routers in digital layers could be clocked faster. Thus, we proposed a heterogeneous routing to improve latency by increasing the number of hops in layers fabricated in advanced technology nodes, for a hop reduction in layers fabricated in less-scaled technology nodes (e.g., mixed-signal layers) with worse power-performance-area (PPA) logic characteristics. However, this routing alone could not improve the network throughput. Slower routers will limit the maximum throughput along a packet path. Moreover, each heterogeneous packet path (i.e., a path between a core in a fast technology and a core in a slow technology) has at least one slow router, irrespective of the routing. We solved this limitation with a modified router architecture that increases the vertical packet throughput by transmitting multiple flits in parallel in slower-clocked routers. Our proposed router architecture from the previous chapter still exhibits two significant disadvantages. First, it either requires much more through-silicon vias (TSVs) than a standard router implementation—which threatens the manufacturing yield—or a mesochronous clocking strategy that is very challenging from physicaldesign aspects. Another drawback comes from the area and power mismatch. The physical implementation of a digital NoC router results in higher energy requirements and area footprints in less-scaled technologies. Consequently, we have an area and energy-per-transmitted-flit mismatch (despite the timing mismatch discussed before), making routers more expensive in, for example, mixed-signal layers. If this area mismatch is not addressed, we often cannot implement regular (same-sized) mesh arrangements in all layers. In a fast digital technology, we can theoretically fit more routers per square-millimeter silicon than in a less-scaled technology, while the available silicon area in each 3D-IC layer is the same. Such heterogeneity in the dimensions of the NoC meshes results in a poorer network connectivity [179]. The energy mismatch between routers yields heat problems. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 L. Bamberg et al., 3D Interconnect Architectures for Heterogeneous Technologies, https://doi.org/10.1007/978-3-030-98229-4_14
313
314
14 Heterogeneous Virtualisation for 3D NoCs
Our proposed router architecture from the previous chapter makes this problem even worse. The applied acceleration of the vertical flit routing in the slower layers increases the power consumption and area of a router in the costly layers even more versus routers implemented in ultimately scaled technology nodes. Thus, the proposed architecture makes the area-energy mismatch of routers in different technologies even worse, rather than addressing the problem. In this chapter, we fully overcome both remaining issues of our existing technique. We will improve the throughput without needing a non-standard router architecture to face no yield or physical-design issues. Moreover, we reduce the router complexity in less-scaled technologies through micro-architectural adjustments. Thereby, we effectively close the technology gap through a heterogeneous router architecture. The idea of the proposed architectural changes is based on an intelligent pruning of virtual channel (VC) resources in layers implemented in lessscaled technology nodes. The remainder of this chapter is structured as follows. In Sect. 14.1 we will briefly revisit the issues addressed by this chapter. Afterwards, our proposed technique is presented in Sect. 14.2 and evaluated in Sect. 14.3. Finally, the chapter is concluded.
14.1 Problem Description The most critical challenges for NoCs in heterogeneous 3D ICs are the varying area, power, and clock frequencies of routers, as stated multiple times in this book. We observed that clock frequency, area, and power of uniform (homogeneous) routers synthesized for disparate technologies vary substantially, as expressed by our models in Chaps. 12 and 13. The proposed scaling factor is between 2× to 16× combining digital and mixed-signal nodes with a difference in feature size of 1:2 and 1:4, as shown in Fig. 13.16 on Page 305. This effect can be generalized to 15-nm digital and 30-nm or 60-nm mixed-signal technologies, respectively. This result implies that the same architecture can be clocked up to 16× faster in any digital technology than mixedsignal technology and is still significantly smaller and consumes less energy per transmitted flit. The massive clock-speed variations have two consequences. First, the latency for a packet transmission varies between layers. Second, the throughput of packet transmission is limited by the slowest-clocked router along the path. The consequence of the varying power consumption is that the dissipated energy consumption per transmitted flit varies between layers. Rather than suffering from this technological heterogeneity, we want to exploit it to improve the NoC’s PPA characteristics. Hence, a possibly large part of the packet path should be located in the fast and power-efficient digital layers (while still having paths of minimal length) to reduce latency and energy consumption. The throughput limitation can be overcome by accelerating flit transmission in all slower routers through hardware acceleration for heterogeneous packets paths (i.e.,
14.1 Problem Description
315
paths where either the source or the destination router is located in a less-advanced technology than the other). The previous chapters solved this by combining an improved network architecture and a technology-aware, minimal routing algorithm, routing as much as possible over layers implemented in an advanced technology node. The proposed network architecture is heterogeneous, i.e., the router architecture varies between layers. The vertical high-throughput router implemented in the slower technology node can route multiple flits in parallel between its local, up, and down ports. This architecture boosts the throughput of all heterogeneous paths for our Z+ (XY)Zrouting algorithm. The routing uses the fastest layer for all horizontal (i.e., XY) hops for any heterogeneous path. Thereby, only the vertical and local flit routing had to be optimized/accelerated in the slower layers to achieve the maximum throughput. One disadvantage of our proposed technique is that is results in yield or physicaldesign issues. To increase the vertical throughput over the vertical TSV links, without the possibility to increase the clock-frequency of the slower router, we either need to implement more TSVs, which has dramatic impacts on the yield, or use the clock from the faster layer at both ends of the TSV array. The later is a headache for the physical design as the clock from the faster layer has to be routed into the slower layer just for the vertical TSV links. Another disadvantage of our technique from the previous chapter is that the vertical throughput acceleration in the slower routers further impairs the area and timing of these router. Also, the energy consumption per effectively transmitted flit increases even further in the slower layers. We offseted this overhead through other optimizations in the router architecture in the previous chapter, such as removing forbidden turns from the arbiter and crossbar logic. However, these optimizations are effective on all layers and could also be done on a baseline router architecture. Thus, our router architecture from the previous chapter typically results in an even more significant timing, area, and energy gap between routers in advanced and lessscaled technologies. Thus, we need a technique to close the throughput gap of routers that at the same time also (partially) closes the energy and area gap. Moreover, the technique must not entail any modifications to the vertical/TSV link architecture compared to a standard router. A naive alternative approach to solve the throughput gap would be to move all router logic to the digital layer. However, this flips the area and power gap in the other direction. It also increases the number of power-costly TSV datatransmissions, as all paths between two mixed-signal components would have to be routed over at least two vertical links. These costly vertical transmissions are not needed in a network architecture that implements routers in all layers. Additionally, moving all routers to one layer would result in physical wiring congestions in the digital layer. Thus, a more elaborate technique is needed, presented in this chapter.
316
14 Heterogeneous Virtualisation for 3D NoCs
14.2 Heterogeneous Microarchitectures Exploiting Traffic Imbalance
Relative Occurrence
Our technique presented here is based on the traffic characteristic resulting from our Z+ (XY)Z- routing scheme from Chap. 13. As packets are routed as much as possible over layers implemented in aggressively scaled technology nodes (without taking detours), the link load on digital layers increases, and the load on mixed-signal layers decreases. This is demonstrated in Fig. 14.1, which shows the VC usage for uniform random traffic in a 4 × 4 × 2 NoC, with two VCs, synchronous 0.5 GHZ clock frequency, 0.03 flits/cycle injection rate, and Z+ (XY)Z- routing, obtained with our Ratatoskr simulator. The results show that, generally, the routers in the mixed-signal layer transmit 50% fewer flits than in the mixed-signal layer, as the routing algorithm pushes data to layers using advanced technologies as soon as possible, while leaving such layers as late as possible. This traffic imbalance between the layers allows closing the energy, timing, and area gap at once by implementing simpler router architectures in layers using less-scaled technology nodes and sophisticated router architectures in advanced technology nodes where the NoC has to sustain a higher traffic load. One approach would be to reduce the depth of NoC buffers in less-scaled layers. A similar technique was demonstrated to achieve significant power and area savings in Chap. 12 of this book. However, the buffer depths merely affect the achievable clock frequency. Thus, reducing the buffer depth fails to close the timing gap, why we would have to sacrifice network performance. To verify this, we synthesize an input-buffered router from register-transfer level (RTL) into the 15-nm Nangate Open Cell Library (OCL) targeting 0.5 , 1 , 2 , and
0.8
Horizontal Links
0.7
Vertical Links
0.6 0.5 0.4 0.3 0.2 0.1 0 Digital Mixed-Signal 0 VCs 0 VCs
Digital Mixed-Signal 1 VCs 1 VCs
Digital Mixed-Signal 2 VCs 2 VCs
Number of active VCs
Fig. 14.1 Virtual channel usage for uniform random traffic in a 4 × 4 × 2 NoC using Z+ (XY)Zrouting
14.3 Evaluation
317
Table 14.1 Maximum achievable clock frequency in a NoC router based on micro-architectural parameters and the used technology node
15 nm 30 nm
2 VCs Buffer depth 16 1 GHz 0.5 GHz
2 VCs Buffer depth 4 1 Ghz 0.5 GHz
1 VC Buffer depth 16 2 GHz 1 GHz
1 VCs Buffer depth 4 2 Ghz 1 GHz
4 GHz, and scale this for 30-nm mixed-signal (cross-validated data). The maximum achievable significant frequencies (in 250-MHZ steps) of the syntheses are listed in Table 14.1. Indeed, the buffer depth cannot improve the maximum achievable clock frequency, but removing VCs (i.e., VC count is 1 resulting in no channel virtualisation) does. The synthesis results show that one can double the clock frequency by removing VCs in the mixed-signal layer. The reason is that multiple VCs require two extra arbitration stages for VC allocation (input VC and output VC), found to be in the critical path. Moreover, multiple VCs increase the wiring complexity within a router. Halving the VCs also halves the buffer spaces (which account for 84% of the area in the synthesized router with 2 VCs and a buffer depth of four). Thus, the most promising solution to close the technology gap for all metrics is to remove the virtualization of links in layers using less-scaled technology nodes. Figure 14.1 demonstrates that this is a valid approach as the reduced load results in a data flow that rarely uses multiple VCs in the mixed-signal layer at the same time. Therefore, we propose such a heterogeneous NoC virtualization scheme, along with the Z+ (XY)Z- routing and a globally asynchronous locally synchronous (GALS) clocking scheme for the network. This approach achieves high performance, while keeping the timing, power-consumption, and area of routers in difference technologies in balance. Note that our approach requires that messagedependent deadlocks are resolved based on end-to-end flow control and not based on a message reordering employing VCs. Thereby, we can reduce the number of VCs in a router to one while maintaining deadlock freedom (i.e., VCs only used for performance enhancement). The requirement of solving deadlocks through flowcontrol is no severe constraint, as an end-to-end flow-control scheme is the superior approach [99].
14.3 Evaluation This section evaluates the proposed technique in terms of all relevant metrics—area, energy consumption, and network performance. We investigate a 2-layer heterogeneous 3D IC with a 8 × 8 × 2 NoC arranged in a 3D-mesh topology. Two layers are sufficient to demonstrate the advantages of the proposed technique, which can be applied for any layer count in the 3D IC. The
318
14 Heterogeneous Virtualisation for 3D NoCs
investigated routers are input-buffered with 4-flit deep buffers. The packet length is 16 flits. Simulations are done with our Ratatoskr tool from Chap. 7.
14.3.1 Area and Energy Consumption We already know from Table 14.1 that removing VCs effectively closes the timing gap. In this section, we show that it allows to (partially) close the area and energy gap. The synthesis results for the standard NoC routers from Sect. 14.2 are used to validate that the removal of VCs significantly reduces the area footprint and energy consumption of a router. In the synthesis, the VC count of the baseline router is fixed to two in both layers. For larger VC counts in the baseline, we could close the area and energy gap even further by removing VCs in the mixed-signal layers. Thereby, we present conservative values in the following. The results are shown in Table 14.2. Removing the virtualization in the mixedsignal layer reduces the maximum area footprint of a router by 45%. Thereby the area gap is reduced from 3.55× to 1.91×, which is a drastic improvement. Also, the energy gap is closed in a similar fashion. Static power consumption (leakage) is roughly saved at the same magnitude as area, why it is not investigated separately. In summary, the results show that the proposed technique drastically closes the technology gap in all metrics, rather than making it worse like the technique proposed in the previous chapter. However, in contrast to the frequency gap, the energy gap and area gap is harder to close. For the considered baseline router with only two VCs and a technology-size difference of 2× between the layers, the proposed technique reduces the area and energy gap by 45 and 53%, respectively, but it is not gone. To close it (almost) entirely, either√ a baseline VC count of four is needed here, or a technology difference that is only 2×. Alternatively, one could increase the VC count in the faster technologies while reducing the VC count in slower technologies. Thus, VCs would be “moved” rather than “removed”. This is plausible as the applied routing algorithm pushes more traffic to the routers in the faster technologies, increasing their virtualisation needs, as shown in the previous section. Table 14.2 Synthesis results for the VC optimization Technology VCs Normalized area Max. frequency [GHz] Normalized energy per transmitted flit
15 nm 1 0.55 2 0.47
2 1 1 1
30 nm 1 1.91 1 1.67
2 3.55 0.5 3.48
14.3 Evaluation
319
2,000 Dig: 1VC, Mixed-sig.: 1VC Dig: 2VCs, Mixed-sig.: 1VC Dig: 2VCs, Mixed-sig.: 2VCs
Latency [ns]
1,500 1,000 500 0 0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
Injection rate [packets/cycle]
a) XYZ routing, synchronous routers 2,000 Dig: 1VC, Mixed-sig.: 1VC Dig: 2VCs, Mixed-sig.: 1VC Dig: 2VCs, Mixed-sig.: 2VCs
Latency [ns]
1,500 1,000 500 0 0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Injection rate [packets/cycle]
b) XYZ routing, asynchronous routers 2,000 Dig: 1VC, Mixed-sig.: 1VC Dig: 2VCs, Mixed-sig.: 1VC Dig: 2VCs, Mixed-sig.: 2VCs
Latency [ns]
1,500 1,000 500 0 0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
Injection rate [packets/cycle]
c) Z+ XYZ- routing, synchronous routers 2,000 Dig: 1VC, Mixed-sig.: 1VC Dig: 2VCs, Mixed-sig.: 1VC Dig: 2VCs, Mixed-sig.: 2VCs
Latency [ns]
1,500 1,000 500 0 0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Injection rate [packets/cycle]
d) Z+ XYZ- routing, asynchronous routers
Fig. 14.2 Latency for low-power optimization under uniform random traffic. (a) XYZ routing, synchronous routers. (b) XYZ routing, asynchronous routers. (c) Z+ XYZ- routing, synchronous routers. (d) Z+ XYZ- routing, asynchronous routers
320
14 Heterogeneous Virtualisation for 3D NoCs
14.3.2 Network Performance To fully understand the performance implications of the proposed optimization technique, we run simulations for the 4 × 4 × 2 NoC for various flit injection rates. The results are shown in Fig. 14.2. We compare XYZ and Z+ (XY)Z- routing for synchronous and GALS systems, as well as homogeneous and heterogeneous VC configurations. In the baseline, the homogeneous network architecture, routers with two VCs outperform the ones without virtualization (i.e., 1 VC), which is expected. More interestingly, one can observe that Z+ (XY)Z- routing saturates earlier than XYZ routing in a synchronous setting (Fig. 14.2 a, c). The reason is that the digital layer must cope with more load due to the Z+ (XY)Z- routing, limited by the clock speed of the slower mixed-signal layer, without architectural compensation. The GALS clocking changes the pictures. Both homogeneous architectures with and without VCs outperform their synchronous, XYZ-routed counterparts (Fig. 14.2d). The proposed heterogeneous architecture that combines 2 VCs in the digital die with 1 VCs in the mixed-signal die for the Z+ (XY)Z- routing and a GALS clocking has the highest saturation point and the best latency metrics of all investigated variants. Thus, the proposed scheme has the best performance for random traffic patterns, while also showing the best power and area characteristics. These results prove that the proposed scheme fully overcomes the limitation of the technique presented in the previous chapter, as it improves all PPA metrics substantially without any sacrifices in terms of yield or physical design complexity.
14.3.2.1
Case Study
So far we have only demonstrated the gains of the proposed technique for synthetic benchmarks (random traffic), not for real non-uniform traffic scenarios. This section investigates the performance gains of the proposed technique for our case study from Sect. 13.8.6 on Page 308. We only consider the network performance here, as it is mathematically proven that Z+ (XY)Z- routing always selects the energy-wise cheapest minimal route (see Sect. 13.7 on Page 302). At the same time, the proposed technique only optimizes the energy consumption per transmitted flit of routers. The area savings reported in Sect. 14.3.1 are even entirely data-stream/application independent. Such a general statement about the performance gains of the proposed VC architecture optimization is not possible. The advantage of the proposed technique is that the routing ensures to traverse as many fast routers as possible, while the VC removal increases the router frequency in the slower layer(s). However, Z+ (XY)Zrouting results in more congestion in the fast layers due to the outlined traffic imbalance, sometimes impairing performance.
14.3 Evaluation
321
Fig. 14.3 Average flit latency for low-power optimization in a 3D VSoC case study, normalized to XYZ-routing with asynchronous routers, 2+2 VCs (Large values are clipped)
The results for the case study are shown in Fig. 14.3. We plot the average flit latency normalized to XYZ routing with asynchronous routers and 2 VCs in all layers, as it shows the best performance for the standard/baseline router architectures. Our proposed technique has the best performance of all investigated schemes. Compared to the reference non-heterogeneous routing-scheme and microarchitecture with 1 VC or 2 VCs, we measure large performance improvements by 69 and 52%, respectively. Furthermore, our proposed NoC has lower area and power, as already argued before. This shows the superiority of heterogeneity, both in routing and VC count. The only architecture that can compete with the technique proposed in this chapter is the setup from Chap. 13 using Z+ (XY)Z- routing and the vertical highthroughput router without VC removal. It still shows 12% less performance, while it requires more area and energy due to the additional VCs in the mixed-signal layer and the acceleration of the vertical communication. These results for a real-world case study highlight the significant gains of a careful, technology-aware co-design of routing algorithms and microarchitecture as done throughout this part of this book. The results validate that globally synchronous clocking yields worse performance in strongly technology-heterogeneous 3D ICs. Moreover, the performance advantage in the digital layer vanishes in this case. Hence, a routing overloading the digital layer is not plausible for performance reasons in this case. Consequently, VC-removal techniques as well as the proposed Z+ (XY)Z- routing scheme should only be used in combination with a GALS clocking scheme if performance is a prime optimization target. However, if power is the only prime optimization target, the techniques can also be interesting for synchronously clocked systems.
322
14 Heterogeneous Virtualisation for 3D NoCs
14.4 Conclusion In this chapter, we proposed an additional micro-architectural optimization technique for NoCs in heterogeneous 3D ICs. In the previous two chapters, we applied the concept of architectural heterogeneity to buffers, routing, and crossbars. In this chapter, we extended it to VCs. We proposed an optimization technique that boosts every PPA metrics of NoC routers implemented in less-scaled/slow technology nodes. The idea is based on the systematic exploitation of the traffic characteristics arising from the heterogeneous routing scheme proposed in the previous chapter of this book. It is to identify rarely used VC resources and to remove them at design time. This removal does not harm performance. It even boosts performance as it allows to overcome the technology gap in terms of clock speed. A real-life VSoC case study showed a performance improvement by 52% compared to the best reference scheme. Additionally, the area footprint and energy consumption per flit of the most costly routers can be reduced by over 40%. Thus, in this chapter, we proposed a technique that simultaneously improves all relevant metrics of the network without exhibiting any yield or physical-design complications. This contribution concludes the book’s final chapter on heterogeneous NoC architectures, as every component of the router (buffers, routing, crossbar, and VCs) has been systematically optimized, considering the implications of technological heterogeneity. In the next chapter, we will again broaden the view and consider heterogeneous 3D integration at the level of application mapping and system on a chip (SoC) floor planning.
Chapter 15
Network Synthesis and SoC Floor Planning
In the previous chapters, we optimized the architecture of networks on chips (NoCs) for heterogeneous three-dimensional (3D) integrated circuits (ICs) by applying heterogeneous optimizations on buffers, routing, virtual channels (VCs), and microarchitecture. In this chapter, we will broaden the view and focus on network synthesis. Thereby, we will optimize the application mapping and the system on a chip (SoC) floor plan. This chapter contributes a linear model for the design space of applicationspecific NoC-interconnected heterogeneous 3D SoCs. Our model is novel because heterogeneous technologies accounted for during application mapping. We also contribute a heuristic to solve the optimization problem effectively. Thereby, we enable area improvements by reducing white space. We will validate our method with the camera-chip (vision-system on a chip (VSoC)) case study used throughout this book. The remainder of this chapter is structured as follows. First, we introduce the fundamental idea terms and differentiate them against existing methods in Sect. 15.1. Next, we model the design space in Sect. 15.2 including technology and NoC architecture. In Sect. 15.3, we constrain the optimization problem that a linear optimization problem is given. In Sect. 15.4, we introduce a heuristic to solve the optimization problem. In Sect. 15.5, we evaluate our method for computational complexity and quality of optimization results. We also discuss our findings. Finally, the chapter is concluded.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 L. Bamberg et al., 3D Interconnect Architectures for Heterogeneous Technologies, https://doi.org/10.1007/978-3-030-98229-4_15
323
324
15 Network Synthesis and SoC Floor Planning
15.1 Fundamental Idea For the optimization problem, we use the following fundamental definitions: Definition 32 (Components) Components denote an abstract representation of processing elements (PEs). Components are parts of a SoC, which consume, generate data, and require area. Definition 33 (Bounding Boxes) Bounding boxes are rectangular outlines on an SoC with an area to place a component, a router, or a through-silicon via (TSV) array. These two definitions are used to model the optimization problem of this chapter, definded as: Definition 34 (System-Level Optimization of NoCs in Heterogeneous 3D SoCs) System-level optimization of NoCs in heterogeneous 3D SoCs consists of NoC synthesis after layer assignment and component positioning. Network on a chip synthesis optimizes the network by assigning routers and TSVs to bounding boxes. Component positioning and NoC planning cannot be separated because the topology and locations of components and routers mutually influence. When the number and location of vertical links are fixed, this also determines the component locations. This dependency makes the optimization a challenging and interesting task for heterogeneous 3D SoCs. Our optimization is multivariate. The objective function comprises area, power, performance, network load, and network congestion models. We contribute an analytical formulation accounting for technology parameters to understand the network synthesis optimization problem after discussing existing approaches.
15.1.1 Existing Approaches Network on a chip synthesis with layer assignment and component positioning is not relevant for 2D or homogeneous 3D SoCs. Only in heterogeneous 3D SoCs, the size of components vary with technology, which influences network topology. Furthermore, the performance and power of components and routers differ with their location/layer. There are two suboptimizations known from existing works, partitioning resources and positioning TSV arrays. Partition resources for application-specific integrated circuits (ASICs) and layer assignment are similar [156], but differ in two key aspects. Application-specific integrated circuit design targets gate-level inputs, in which circuits with many gates are partitioned and the gates have the same properties in each partition targeting optimal communication. In our network synthesis, the number of components is smaller, but their properties differ between layers.
15.2 Modelling and Optimization
325
Positioning TSVs is a well-researched topic. There are three general approaches. First, positioning individual TSVs for ASIC design, second, positioning of TSV arrays as macroblocks, and third, positioning of TSV arrays in 3D NoCs determining a topology for vertical links. During positioning individual TSVs for ASIC design, the objective is a reduction of wire lengths to increase the performance of the 3D chip by the optimized placement of single TSVs. For instance, [51] proposes a method to shorten wires by an average of 20% using an analytical placer. The optimization affects the gates connected by the TSV. For instance, both [237] and [108] propose to combine layout placement and TSV placement by a 3D cell placement algorithm. All these publications target the placement of individual TSVs in homogeneous 3D chips at the gate level. Heterogeneity makes these approaches not applicable to this chapter. Furthermore, placing whole TSV arrays instead of single TSVs improves yield and area (from overlapping keep-out zones (KOZs)). During positioning of TSV arrays modeled as macroblocks, the focus shifts from individual to arrays of TSVs. Through-silicon via arrays are larger than individual ones. Hence they are modeled as a macroblock, a common problem found in ASIC design. Usually, the macroblocks are placed, first with an overlap, and then the placement is legalized [108]. PolarBear [52] proposes a method for mixed-size placement, with less than 5% non-utilized space (white space). The unconstrained reordering of components in this method makes it inapplicable here, as the NoC topology constraints the process. Correct-by-construction methods [213] are computationally too expensive. During positioning of TSV arrays in 3D NoCs, homogeneous 3D mesh typologies are optimized. In [164, 254] the number of TSVs is fixed, optimizing for a given application traffic. As we will show in this chapter, a homogeneous mesh yields large white space for heterogeneous 3D SoCs.
15.2 Modelling and Optimization We define input, objective function, and design space for NoC synthesis. The technology model comprises layer count, layer order, and used technologies. Our model for the application gives the communication between components and the data flow direction. Its input graph is similar to the core graph [228] including power, performance, and area of components in different technologies. Our optimization minimizes the area and power requirements while maximizing the performance of components and NoC.1 We model this using a linear weighted objective function with five addends for the area, power, component performance, network latency, and network throughput.
1 We
target typical quadratic chips, but the model can easily be extended to rectangular SoCs.
326
15 Network Synthesis and SoC Floor Planning
The optimization yields bounding boxes, which provide an area for components, routers, and TSV arrays. Since components and routers have different sizes in each layer (due to heterogeneity), the sizes of bounding boxes differ. Bounding boxes implicitly define the NoC topology. We schematically depict this in Fig. 15.1. After network synthesis, the floorplanning within bounding boxes and of TSVs can be done using well-known methods.
15.2.1 Router Model We model routers with varying area/power depending on their port count. Routers at the edges or ones without vertical links require less area than routers in the middle or with vertical links. A keep-out zone is only needed for downwards connections, as shown schematically in Fig. 15.2. A keep-out zone will not be required if routers connect only upwards. We allow connecting routers and TSV arrays via redistribution (RD). We use elevator first dimension order routing algorithm [179] because it ensures connectivity for a partially-connected 3D topology, but one can use any other routing within our models. Fig. 15.1 Schematic of a solution generated by the proposed model
router link
TSV array
Link Router
Link KOZ
layer 2
Link
TSV array Link Router
Link
TSV array Link
a)
Link
Link Router KOZ
TSV array Link
bounding box for component, router and KOZs
Link
Link
TSV array
layer 1
Link
b)
c)
Fig. 15.2 Routers with TSV arrays in different directions. (a) both upwards and downwards vertical links. (b) only downwards vertical link. (c) only upwards vertical link
15.2 Modelling and Optimization
327
15.2.2 Three-Dimensional Technology Model Components and routers have varying power, performance, and area per technology. Some components are constrained and cannot be implemented in given technology (e.g., analog sensors cannot be implemented in digital technology). We also model TSVs with their area. For vertical links, a RD metal wire can be used to route to TSV arrays. Using a RD metal wire gives the architect the freedom to place routers/components. An exemplary RD of a single-bit connection is shown in Fig. 15.3. When connecting heterogeneous layers, the RD wires in the less advanced node are as short as possible to reduce the driver size since it is relatively expensive. The driver connects to the TSV, and the driver of the RD wire in the digital layer is behind the TSV. In the opposite direction, the driver in the digital layer is behind the router and drives RD wires and TSVs. The placement variability for two vertically connected routers is constrained by the maximum length, D, of the RD wire without violating the clock frequency of the routers. It can be calculated from technology data. For heterogeneous technologies, D will differ in both directions. Since we use bidirectional links, we use the minimum as RD length. The router design using an RD wire is shown in Fig. 15.4. Keep-out zones are only required for downwards connections.
15.2.3 Modelling Assumptions Further assumptions are taken in our model: • Bounding boxes for components and routers are rectangular. This assumption restricts optimization potential but is a usual approximation to reduce the problem complexity [156]. Fig. 15.3 Redistribution wires connecting routers via a heterogeneous interface
Router Mixed-signal layer Digital layer
driver for TSV TSV
RD length
Router Driver for RD
Fig. 15.4 Router using RD wires
Link Link
Link
Router
Link
RD
TSV array
328
15 Network Synthesis and SoC Floor Planning
• The optimization only models system-level decisions and not microarchitectural features. For instance, routers are not separated into multiple layers. • Components (and routers) are positioned in a structure implementing a grid, within layers. • Routers are positioned at joints of the component grid within bounding boxes. • The network on a chip is horizontally fully connected.
15.3 Mixed-Integer Linear Program We model the network synthesis as an mixed-integer linear program (MILP): minimize cT x + d T y, subject to Ax + By ≤ b, with x ∈ Zn ,y ∈ Rk , A ∈ Rm×n , B ∈ Rm×k b ∈ Rm , c ∈ Rn , d ∈ Rk . In the next paragraphs, we introduce the detailed model using these notations: For n ∈ N we introduce the notation [n] := {1, . . . , n}. The symbols px , py and pz denote the entries of a vector p ∈ R3 : p = (px , py , pz ). By χ we mean the indicator function as commonly defined: χA (x) = 1, if x in A; otherwise χA (x) = 0.
15.3.1 Constants and Definitions 15.3.1.1
Component and Communication Model
There are n components. The set of components is [n].2 There are m routers. The number of bounding boxes is set prior to optimization as the number of routers/components. Within the MILP, we refer to the bounding boxes as “tiles” with a modified definition. Tiles provide an area for routers, components, and vertical links but can contain more than one router (Sect. 15.3.2) to improve packaging density. A directed graph models the communication between components with weighted edges. The edge set of the component digraph (directed graph) contains all directed pairs of components which communicate.1 An edge follows the pattern (sender, receiver), 2 which yields the edge set: EA = (i, j ) ∈ [n]×[n] | component i sends to j . This yields the directed and weighted component graph: A = ([n], EA ).
(15.1)
2 [n] is treated as if it were the set of components in the way that each number represents a particular component. It is just a set of indices, which are used to reference components.
15.3 Mixed-Integer Linear Program
329
The weights of the graph’s edges represent communication between components, called bandwidth requirement, given by the function: u : EA → R+ ,
(15.2)
with R+ := {x ∈ R | x > 0}.
15.3.1.2
Technologies and Layers Model
There are k manufacturing technologies and layers. The sets [k] and [] are treated as sets of manufacturing technologies and layers. A technology is selected per layer using τ : [] → [k].
15.3.1.3
Implementation Costs Model
Each component can be implemented in different technologies, yielding the set of available implementations I = [n] × [k]. The area of components per technology is: fc : I −→ R ∪ {∞}.
(15.3)
A component will not be implemented in a given layer if costs are infinite. Router implementation costs are modeled for 2D/3D routers (routers without/with vertical links): fR2D : [k] −→ R
(15.4)
fR3D : [k] −→ R.
(15.5)
Through-silicon via area from KOZ etc. is a constant fKOZ .
15.3.1.4
Energy Model
Components have different power per layer fE : I −→ R ∪ {∞}. Similarly for routers fER : [k] −→ R.
15.3.1.5
Performance Model
Component performance is technology dependent. We use a numerical performance as an estimate fP : I −→ {x ∈ R | x ≥ 0}.
330
15.3.1.6
15 Network Synthesis and SoC Floor Planning
Coordinates
The three-dimensional SoC is spatially embedded in a coordinate system, with upper bounds for the x- and y-dimension xmax and ymax ; P := {x|x ∈ R, 0 ≤ x ≤ xmax } × {y|y ∈ R, 0 ≤ y ≤ ymax } × []. The space P is bounded to model logical disjunction (OR) relations: This allows for two constraints, of which only one must be satisfied (Appendix C).
15.3.2 Variables Positions of components, network topology, positions of routers, and SoC dimensions are optimized. The places of components are given by their upper left corner, with positions si ∈ P for all routers i ∈ [n].3 This location is also the place of the first n routers, which connect one component each. We introduce tiles, which are bounding boxes for routers and components belonging to the same place. A variable denotes tile places, with positions ti ∈ P for all i ∈ [m]. Tiles represent rectangular bounding boxes for the area of a component, routers, and KOZs, represented by length ai and width bi for all i ∈ [m]. The interval gives a tile at position pi ∈ T : Ap = [pi,x , pi,x + ai ) × [pi,y , pi,y + bi ) × pi,z .
(15.6)
We model the NoC as a spatially embedded graph. Their positions give router locations ri for all i ∈ [m]. A variable is defined for the connection of two routers, p and q: e{p,q} ∈ {0, 1}. It is 1 for a connection, and 0 otherwise. Connections are1 bidirectional. This 2yields the connections between all pairs of routers p, q: EN = (p, q) | e{p,q} = 1 . The network topology N is defined as the spatially-embedded directed graph: N = ([m], EN ).
(15.7)
3 The positions of routers cannot be at the upper limits of the coordinates P , since they have a size. This inequality is not modeled here, but taken into account by constraints.
15.3 Mixed-Integer Linear Program
331
15.3.3 Objective Function The size of the SoC is determined by the size of the largest layer, since stacked layer have the same dimensions. The positions of the last tiles plus their size determine the SoC dimension given by its position p and its size ap and bp , respectively. Thus the costs of the chip are defined as c˜area = maxi∈[m] (ti,x + ai ) maxi∈[m] (ti,y + bi ). This must be linearized. Therefore, we use a simpler cost function that minimizes the area of the chip for square-shaped components: carea = max{max (ti,x + ai ), max (ti,y + bi )}. i∈[m]
i∈[m]
(15.8)
The power consumption of routers and components is: cpower =
fE (i, τ (ti,z )) +
fEr (τ (ti,z )).
(15.9)
j ∈[m]
i∈[n]
The system performance is: cperf = −
fP (i, τ (ti,z )).
(15.10)
i∈[n]
To optimize the network performance, we estimate the utilization per link. The »flow« of packets is modeled by a source-sink flow in the network digraph. The flow is, mathematically, ≤ 1, so that the actual source-sink flow of packets is given by multiplication with the bandwidth requirement u in the cost function. We define the function f which gives us this flow for each pair of components: f : EA →
3
1
2 f | f is a si − sj − flow in N, value(f ) = 1 .
(15.11)
(i,j )∈EA
The flow value is value(f ), following the convention used in [136]. If we write (i,j ) f (s, t) (which is technically not defined), we mean f ((s, t)). f(k,l) denotes the corresponding variable for all network links k, l ∈ [m] and components i, j ∈ [n]. Flows model any routing algorithm. Deterministic routing algorithms have binary flows. For adaptive routing, flow values are in the interval [0, 1] and represent the probability that packets use a link. We optimize the flow as it models the network load. Congestion increases the average network delay, reduces throughput, and increases energy consumption. To minimize congestion risk, the network load is reduced by ensuring packets travel
332
15 Network Synthesis and SoC Floor Planning
shorter distances and pass through fewer routers on their path. The costs to reduce network utilization are: ⎛ ⎞ u(e) ⎝ (15.12) cutil = (f (e)) (v)⎠. e∈EA
v∈EN
To further reduce congestion risk, peak loads are avoided. This defines the loads as the function load : EN → R≥0 , which returns the summed utilization on a link: load(v) :=
u(e) (f (e)) (v).
(15.13)
e∈EA
We propose a load-heterogeneity measure that penalizes loads larger than the average link load given by: μl =
1 load(v). |EN |
(15.14)
∀v∈EN
We estimate |EN | ≈ n2 − n as approximate linearization. This corresponds to a fully-connected network and underestimates μl . With the indicator function 1, the peak costs are defined as cpeak =
1(μ,∞) (load(v))(load(v) − μ) v∈EN
=
max{0, load((k, l)) − μ} .
(15.15) (15.16)
(k,l)∈[m]×[m]
The complete objective function comprises the summed costs, with weighting factors ω1 to ω5 : c = ω1 cutil + ω2 cpower + ω3 cperf + ω4 cpeak + ω5 carea .
(15.17)
15.3.4 Constraints We explain and illustrate the constraints of the optimization problem. Their mathematical formulation is a straightforward task. We constrain indices for convenience: Components, tiles, and routers with an index i ∈ [n] are at this exact location. Excess tiles and routers that are not required are at the end of the index range [m].
15.3 Mixed-Integer Linear Program
15.3.4.1
333
Network Model
Tiles are modeled so that the router is located in the upper left corner of the bounding box. The network implements a grid-like topology. Neighboring routers with either the same x- or y-coordinates are connected. Long-range links are prohibited. Three-dimensional routers connect adjacent layers. Which router can be connected depends on its projected distance within layers, which must be smaller than D (|px − qx | + |py − qy | ≤ D). Manhattan distance linearizes the equation and is reasonable to model link manufacturing. Each component is connected to the local port, and routers are not connected to themselves. The resulting constraints can be illustrated as follows: Topology as grid and TSVs connect to adjacent layers Within a layer, wires form grids.
Components are connected to routers
TSVs only connect neighboring layers.
Routers are not self-connected
Each component has a router since it has a NI with a unique address.
Routers connect among each other yet not to themselves.
Forbid connections between non-neighboring routers Only neighboring routers can be connected. Long-range and diagonal links are prohibited.
15.3.4.2
Modeling Tiles (Bounding Boxes)
The tiles provide an area to implement components, routers, and TSV arrays. The assigned area of tiles is given by the side product ap bp . It is constrained to be larger than the tile’s area requirement. The area requirement Aˆ p is the sum of the component size, the size of each router in the tile, and their KOZs. This yields a constraint ap bp ≥ Aˆ p for all p ∈ T . Tiles are without overlaps, routers, and components are assigned to them. In general, more routers than components might be required to fulfill the connectivity of the routing algorithm. Routers either have a separate tile or a shared one with a component. Links connect routers that are assumed to be located at the borders of tiles (simply for the ease of modeling, but routers can be moved in postprocessing). Routers must be at different locations. The actual router placement in a bounding box can be determined in postprocessing. Routers can be freely placed in the tiles, and links can cross them. The resulting constraints can be illustrated as follows:
334
15 Network Synthesis and SoC Floor Planning
Starting positions of tiles
Components start tiles
Each tile has a router.
Routers and tiles
Each component has its own tile.
Sizes of tiles
Routers are either part of a tiles of a component, or they have their own tile. Tiles may not overlap
Tiles provide enough space for their component, routers, TSVs. Links do not cross tiles
Tiles are forbidden to overlap.
Links cannot be placed within a tile.
Routers have different locations Routers are not at the same location.
To model these constraints, we introduce a minimum distance and area linearization. The former allows modeling the inequality of positions, and the latter enables the calculation of area, which intrinsically are a product, i.e., non-linear.
15.3.4.3
Modeling Unequal Positions on SoCs
It is impossible to state a = b for two reals a and b through an MILP, since this would require modeling open sets. Hence, a minimum distance δ around one of the variables is introduced. This allows modeling unequal relations such as, a ≤ b − δ or b ≤ a − δ. The physical representation of δ is the semiconductor scale (i.e., feature size). Modeling the unequal relation is shown in Fig. 15.5. The variables a, b ∈ R are constrained to be unequal using the distance δ. The variable b1 is within the interval [a, a + δ], which is not allowed. The variable b2 is larger than a + δ, which models the unequallity.
15.3.4.4
Linearization of Area Products
In the definition of constraints on sizes of tiles, the product of a tile’s edges ai bi is calculated, which is not linear. We use the linearization following Montreuil [169], as illustrated in Fig. 15.6. The size of the tile i ∈ [m] must be larger than the summed size of its component i, α implemented routers (2D and 3D routers), of which β are
a= / b1
a b1
a= / b2
a−δ
a+δ a b1
b2
Fig. 15.5 a = b1 (left-hand side) and a = b2 is modeled using the distance δ and a ≡ b1 (righthand side)
15.3 Mixed-Integer Linear Program
335
Fig. 15.6 Linearization of a product ai bi , with aspect ratio η and bounds xmax and ymax
3D routers, and γ KOZs with the technology costs of the component’s layer ξ ∈ []. ξ,i For this given area Fα,β,γ , the possible lengths and widths combinations of the tile are given by the half-space: ξ,i
ai bi ≥ Fα,β,γ .
(15.18)
In addition, each side of a tile is constrained by the maximum size of the layer: ai ≤ xmax ,
(15.19)
bi ≤ ymax .
(15.20)
To estimate the product from Eq. 15.18, we introduce a limit on the aspect ratio η. In general, the aspect ratio is between 0 and 1, i.e., 0 < η ≤ 1. Stricter bounds for η reduce the estimation error. The constraints for a given η, also shown in Fig. 15.6, are: ai ≥ ηbi ,
(15.21)
ai ≤ η−1 bi .
(15.22) ξ,i
Connecting the intersection between ai bi = Fα,β,γ and ai = ηbi as well as the intersection between ai bi = Fα,β,γ and ai = η−1 bi yields a line equation, which linearly approximates ai bi . This linearization is given by: ξ,i
ai + bi ≥
ξ,i
Fα,β,γ η−1 +
ξ,i Fα,β,γ η.
(15.23)
336
15 Network Synthesis and SoC Floor Planning
This approximation will have an error of 9–12% if applied to the facility layout ξ,i problem [140] that is similar to our model. The square root of Fα,β,γ cannot be calculated using the MILP model, since it contains variables for the area of component size, routers, and KOZs. The router count in a tile has an upper bound, i.e., α, β, γ ∈ [m − n + 1]. Therefore, m matrices are introduced which contain all precomputed square roots of tile sizes depending on the router and ξ,i TSV count: F˜ ξ,i = (f˜α,β,γ ) ∈ R[m−n+1]×[m−n+1]×[m−n+1] . The corresponding area requirement of a tile i implemented in layer ξ is given by the i, ξ -th matrix ξ,i ξ,i, XI element f˜α,β,γ . We introduce auxiliary binary variables hα,β,γ for each i ∈ [m], ξ ∈ [], and α, β ∈ [m − n + 1], selecting the element of this matrix.4 These variables can be arranged in a matrix of the same dimensions as F˜ ξ,i . Using an m×n×q and analogy of the Frobenius scalar product nmatrices q A = (ai,j,k ) ∈ R m of B = (bi,j,k ) ∈ Rm×n A, B = a b , the constraint in i=1 j =1 k=1 i,j,k i,j,k Eq. 15.18 can be written as m inequalities: ai + bi ≥
H
ξ,i, XI
F˜ ξ,i
η−1
+
√
η .
(15.24)
ξ ∈[] ξ,i, XI
which is a linear equation using hα,β,γ as elements in H ξ,i, XI . It must hold for all i ∈ [m], ξ ∈ [] and α, β, γ ∈ [m − n + 1]:
ξ,i, XI hα,β,γ
=
⎧ ⎪ ⎨1 ⎪ ⎩ 0
if tile i is in layer ξ , has (α − 1) routers, (β − 1)3D routers, and (γ − 1) KOZs, else.
(15.25)
The linearization error can be reduced by using multiple linear equations that ξ,i piecewise approximate a segment of the iso-area line ai bi = Fα,β,γ . This method increases the number of inequalities.
15.3.4.5
Modeling Routing Algorithms
We define a modeling framework for any routing algorithm. The used algorithm must be modeled via separate constraints. We model routing algorithms via the flow f (s, t) [136]. The routing algorithm only traverses existing links between routers. Packets cannot be duplicated or lost, so flow is conserved. This approach allows for verification of the routing algorithm. If the constraints are fulfilled, the routing will be connected. As a neat byproduct, livelock freedom is proven since the flow is acyclic. The resulting constraints are illustrated as follows:
4 The
exponent is part of the variable name.
15.3 Mixed-Integer Linear Program
337
Flow in EN
Flow conservation Only existing links can be traversed.
Packets are not lost or duplicated.
Livelock and deadlock freedom Routing algorithms are livelock and deadlock free.
Flow in EN The flow in EN can traverse existing links only. It is defined on a fully connected network graph. If only the actual implemented links in each solution candidate has been used, the size of the constraints would depend on the solution and not on the input constants, which is illegal for MILPs. Therefore, all flow values of edges, which are not found in the network topology EN , are set to zero by the following m2 |EA | inequalities for all (i, j ) ∈ EA , k, l ∈ [m]: (i,j )
f(k,l) ≤ e{k,l} .
(15.26)
Flow Is Conserved Flow must be conserved, or else packets are duplicated or lost. We define flows in the network EN with value 1, which are also non-circular. The latter is given by the subsequent constraint. According to [136] (Definition 8.1.), the flow for a vertex v will be conserved if the summed flow from all incoming edges e ∈ δ − (v) is equal to the sum of the flows on the outgoing edges e ∈ δ + (v): e∈δ − (v) f (e) = e∈δ + (v) f (e). In our model, the incoming edges e to a vertex with index l (i.e., a router) are given by e ∈ [m] × {l}. Analog, the outgoing links e are given by: e ∈ {l} × [m]. This yields the equations for the flow conservation rule for the fully connected network for all q ∈ [m], (i, j ) ∈ EA and l ∈ [m] \ {i, j } (since the source and the sink do not have flow conservation):
(i,j )
f(k,l) =
k∈[m]
(i,j )
f(l,k) .
(15.27)
k∈[m]
This relates to 2(m − 2)|EA | inequalities. For the source i, the inequality is given by
1+
(i,j )
f(k,i) =
k∈[m]
(i,j )
f(i,k) .
(15.28)
k∈[m]
and for the sink j , it is given by: k∈[m]
(i,j )
f(k,j ) =
k∈[m]
(i,j )
f(j,k) + 1.
(15.29)
338
15 Network Synthesis and SoC Floor Planning
Flow Is Acyclic Flows in the network must be acyclic for live-lock freedom of routing algorithms. A digraph will have a topological order if and only if it is acyclic ([136], Definition 2.8. and Proposition 2.9.). To generate a topological order, variables ΓreA ∈ Z are defined such that they enumerate the vertexes in the network EN for all eA ∈ EA and r ∈ [m]. The enumeration is modeled for all k, l ∈ [m]×[m]: eA eA > 0 → ΓieA < ΓjeA ↔ f(k,l) = 0 or ΓieA < ΓjeA . f(k,l)
(15.30)
(i,j )
For a binary flow (i.e., f(k,l) ∈ B) this yields the following m2 |EA | inequalities for all k, l ∈ [m] × [m] and (i, j ) ∈ EA using the constant cXV I I = m + 1 (Again, the “exponent” 17 is a variable name): (i,j )
Γk
(i,j )
+ 1/2 ≤ Γl
(i,j )
+ (1 − f(k,l) )cXV I I . i,j XV I I
For non-binary flows, the auxiliary binary variable hk,l OR
(15.31)
, which is 1 if the flow is
i,j XV I I (i,j ) (i,j ) non-zero (hk,l OR ≥ f(k,l) ) is introduced. This changes the inequalities to Γk + 1/2 ≤ Γ (i,j ) + (1 − hi,j XV I I )cXV I I . l k,l OR
15.3.5 Case Study: Modeling Elevator-First Dimension-Order Routing As an example, we model elevator-first dimension-order routing to demonstrate the expressiveness of our approach. For dimension ordered routing (DOR) each pair of two routers within a layer must span a rectangle with two other routers. Thus, wires between all neighboring routers must be implemented. The flow is binary. It is 1 on the paths following the routing algorithm and zero on all other edges. The constraints can be illustrated as follows. Please note that they require the implementation of a search algorithm for the following router with a TSV using an MILP, which is inefficient. Connect neighboring routers Forces wires between routers within a layer to enable routing.
Flow for the routing algorithm DOR is used.
Topology For DOR routers must form rectangles.
Flow is binary Packets follow a single path in deterministic routing.
15.4 Heuristic Solution
339
15.4 Heuristic Solution The proposed model in the form of MILP provides an optimal solution. However, it is not efficient and, therefore, hardly provides any solution for large input sets. Defining a modified MILP with higher performance is unrealistic. Instead, we construct solutions with an efficient heuristic. We dissect the problem and split it into a set of steps, each of which can be efficiently optimized. This divide-andconquer-like approach reduces the optimization potential, but a fast heuristic can be developed. We identify the following five steps, for each of which a graphical representation is given in Fig. 15.7. 1 Components are partitioned into layers (Fig. 15.7a). Since dies of identical size are stacked, its largest one determines the SoC area consumption. Thus, the optimization adds up the area requirements of components per dies and targets a homogeneous distribution. This step is like partitioning, as argued in Sect. 15.1.1. At the end of the first step, the power consumption and performance of components are optimal. The area is only approximated but yields similar-sized layers. 2 A floor plan is given per layer for bounding boxes of components (Fig. 15.7b). We use rectangular bounding boxes in line with the tiles of the MILP. This optimization targets two objectives: Minimizing the layer area by reducing the sizes of bounding boxes and tightly packaging them, and minimizing the communication within layers by locating components according to their bandwidth (pairs of components with significant communication requirements are located adjacent in layers). This step determines the size of the layer rather precisely. Only the area of 3D communication is missing. Furthermore, by only optimizing communication within layers, the complexity of this problem is vastly reduced.
1 2 ... n
t 1 ...
2 nξ
a) Component-to-layer assign- b) Component ment planning
d) Place 3D routers
floor
c) TSV array count
e) Legalization
Fig. 15.7 Visual representation of the heursitic’s steps. (a) Component-to-layer assignment. (b) Component floor planning. (c) TSV array count. (d) Place 3D routers. (d) Legalization
340
15 Network Synthesis and SoC Floor Planning
3 The communication between layers is optimized: The number of TSV arrays connecting layers is calculated (Fig. 15.7c). There are two adversarial trends. Many TSV arrays reduce the load on single TSV arrays, but more area is required. Therefore, an optimum with low link load and low area is required. Determining only the TSV array count, but not their precise location allows for an effective solution.5 Up to now, interlayer and intralayer communication have been optimized separately. 4 The global communication is optimized in this step. Through-silicon via arrays and 3D routers are placed by adding them to bounding boxes of components (Fig. 15.7d). This placement results in bounding boxes, as already shown in the MILP (as tiles). This step is done using the TSV connection model and the actual routing algorithm so that the paths of data are known. Constrained by the given TSV count and the order of components in their layer, this is an extensive optimization that accounts for detailed network information. At the end of this step, the size of bounding boxes increases, and they might overlap. 5 The solution is legalized to accommodate the area of the added connections.6 Step 1 splits the design into layers before floor planning each layer in step 2 , which improves optimization speed. Deciding step 3 as late as possible is beneficial for network design, since then component floor plans can be taken into account. This order avoids the over-allocation of network resources. Steps 3 , and 4 are separated to allow comparison against standard approaches.
15.4.1 Heuristic Algorithm Using the five steps, we develop an efficient heuristic. The solutions proposed per step represent only one possible implementation. The structure of the resulting heuristic is illustrated in Fig. 15.8 as a flow chart and explained in the following paragraphs. 1 The heuristic starts with a component-to-layer assignment. An optimized assignment function α : [n] → [] is found, which assigns components to layers w.r.t. an optimized area, power and performance. We use this objective function, with weights ωC , ωE and ωP ∈ R: c(1) = ωC h + ωE
i∈[n]
fE (i, α(i)) − ωP
fP (i, α(i)),
(15.32)
i∈[n]
whereas 5 The
TSV array count is, in short, referred by TSV count. speaking, this step might generate illegal solutions by violating the constraint that the distance between routers must be below D if connected vertically. However, this issue is easily fixed by post-processing or using a slightly reduced D. 6 Strictly
130 nm mixed-signal 90 nm digital (SRAM) 45nm digital (CMOS)
Comp. 2
4 12 4
3 5 7
5 Gb/s
Comp. 1
Comp. 1
Comp. 2
8 Gb/s
7 5 4
2D router
Comp. 3
8 Gb/s
12 12 5
3D router
4 Place 3D Routers SA, accurate routing global communication 5 Legalize Solution LP/SDP area opt
1
2
opt
m
opt
3 TSV Array Count Exhaustive search , arroximated routing area, interlayer-communication opt
...
t
Fig. 15.8 Steps of the heuristic algorithm
Solution (Candidate)
Input
1 2 ... n
2 Component Floor Planning SA, LP/SDP area, intralayer-communication
1 Component-to-Layer Assignment ILP area, power opt
15.4 Heuristic Solution 341
342
15 Network Synthesis and SoC Floor Planning
h = max
/
k∈[]
0 fC (i, α(i)) .
(15.33)
i∈{j |α(j )=k}
This model is an integer linear program (ILP). The first step is shown in Fig. 15.8 on top. It takes technology parameters of components as input and returns an assignment. 2 In a second step, relative positions and bounding boxes are determined for components in their layer w.r.t. minimized area and communication, called component floor planning. To remain with a grid, we define an order that describes the relative positioning of components by assigning them to rows and columns. Each component has its bounding box by assigning a row number and a column number. The advantages of using the order to describe the positioning of components are twofold. First, efficient (polynomial-time) area minimization via a (linearized) linear program (LP) or an exact semi-definite program (SDP) is possible under a given maximum aspect ratio η. Second, it allows modifying the size of bounding boxes (to add routers and TSVs) while maintaining the order. The solution can be legalized, and the bounding boxes are non-overlapping after adding 3D infrastructure. To formalize the area optimization, assume a given order of l k or fewer components in l rows and k columns. Each component has the size ai,j , for certain i ∈ [l] and j ∈ [k]. ai,j = 0 if there is no component at row i, column j for all pairs (i, j ) ∈ [l] × [k]. The height of rows is ri ∈ R for all i ∈ [l]. The width of columns is cj ∈ R for all j ∈ [k]. ri cj ≥ ai,j
for all i ∈ [l], j ∈ k.
(15.34)
The objective function is: Minimize the side length of a square that encloses all bounding boxes. ⎛ max ⎝
i∈[l]
ri ,
⎞ cj ⎠ −→ min.
(15.35)
j ∈[k]
Using LP and Linearization The optimization is conducted subject to ri ≥ ηcj cj ≥ ηri √ ri + cj ≥ ai,j η + ai,j /η
∀i ∈ [l], ∀j ∈ [k]
(15.36)
∀i ∈ [l], ∀j ∈ [k]
(15.37)
∀i ∈ [l], ∀j ∈ [k].
(15.38)
with a given bounding box aspect ratio limit η ∈ (0, 1). The linearization is conducted in Eq. (15.38). The formulation as an LP allows for an efficient solution. Using SDP Variables in SDPs are positive semi-definite matrices. Here, these are l k variables Xk(i−1)+j . We define them so that the desired product ri cj ≥ ai,j as follows.
15.4 Heuristic Solution
343
We set l k variables Xk(i−1)+j such that 4
Xk(i−1)+j
r = √i ai,j
√
ai,j cj
5 & 0, ∀i ∈ [l], ∀j ∈ [k].
(15.39)
These are positive semi-definite matrices; thus each principal minor is greater or equal to 0: det Xk(i−1)+j ≥ 0,
(15.40)
⇒
ri cj − ai,j ≥ 0,
(15.41)
⇒
ri cj ≥ ai,j ,
∀i ∈ [l], ∀j ∈ [k].
(15.42)
We formulate it as an SDP. The objective function minimizes the variable x subject to the following constraints. We assign the corresponding area values to each matrix: √ 2 aij ≤
64
7 5 √ 01 , Xk(i−1)+j ≤ 2 aij , 10
∀i ∈ [l]∀j ∈ [k].
(15.43)
For each i ∈ [l], the upper left entry of the matrices Xk(i−1)+j has the same value for all j ∈ [k] (this models ri ): 64 0≤
7 64 7 5 5 −1 0 10 , Xk(i−1)+j ≤ 0. , Xk(i−1)+1 + 0 0 00
(15.44)
For each j ∈ [k], the lower right entry of the matrices Xk(i−1)+j has the same value for all i ∈ [l] (this models cj ): 64 0≤
7 64 7 5 5 0 0 00 , Xk(i−1)+j ≤ 0. , Xj + 0 −1 01
(15.45)
We model the maximum variable x for the objective function: 0≤x+
7 5 l 64 −1 0 , Xk(i−1)+1 , 0 0
(15.46)
7 5 k 64 0 0 , Xj . 0 −1
(15.47)
i=1
0≤x+
j =1
Areas of components are constrained by an aspect ratio η (to maintain a lower length of a critical path). The relation does not violate this aspect ratio between ri and cj . Rather, a component can find a rectangle inside the bounding box given by ri cj .
344
15 Network Synthesis and SoC Floor Planning
This rectangle has the size of the component. Its edge length is within the aspect ratio η. We formulate for all i ∈ [l] and for all j ∈ [k]: 7 5 10 , Xk(i−1)+1 , 00 7 64 5 00 ≤ , Xj . 01
√ ηai,j ≤ √ ηai,j
64
(15.48) (15.49)
We optimize the assignment of components to rows and columns by simulated annealing. Regarding the relative placement of components, nξ × nξ rows and columns will be sufficient for every possible placement if nξ components layer ξ . As an initial solution, we position components with an area-efficient approach. Optimizing row and column sizes will yield white space if the size difference between adjacent components is significant and the area requirements alternate. Therefore, we cluster components in rectangles by their sizes in descending order. Rows (columns) are filled with components as long as the summed height (width) of the chip is smaller than the width (height). Otherwise, a new column (row) is filled. In every iteration of the simulated annealing, we move a component selected from a uniform random distribution to any other row or column, also selected from a uniform random distribution. If another component is located there, the components will be swapped. After the deletion of rows and columns without components, the optimization mentioned above minimizes the area of the layer. We evaluate the following objective function for this solution candidate, with the weights ωa to minimize area and ωc for communication. The communication between two components i, j is calculated by summing their bandwidth u(i, j ) multiplied by their hop distance Δ(i, j ), i.e., their distance in numbers of rows and columns. We minimize: c(2) = ωa max ri , cj + ωc u(i, j )Δ(i, j ) . (15.50) i∈[n]
j ∈[n]
i∈[n] j ∈[n]
This second step is shown in Fig. 15.8. It takes intralayer communication and component area as input and returns bounding boxes. 3 In a third step, the through-silicon via array count connecting adjacent layers t : [ − 1] → [n] is determined by optimizing communication between layers and minimizing area requirements of KOZs and routers. Based on the router and TSV model as defined in Sect. 15.2.1, all possible implementations are modeled by a “TSV graph”, with all bounding boxes as vertices and all possible (physically implementable) TSV arrays as edges. The subgraphs of nodes located in adjacent layers are bipartite. An example is shown in Fig. 15.9 for a chip with two layers. In this example, components 1 and 2 are located in the upper layer and components 3 to 6 in the lower. It is physically possible to connect component 1 to all other components in the lower layer; component 2 can only connect to 5 and 6. The vertical links, which
15.4 Heuristic Solution
345
1
2
3
5
4
6
1
3
2
4
5
6
b) Example TSV graph
a) Example chip Fig. 15.9 Exemplary chip with corresponding TSV graph, which is bipartite. Selected TSVs are dashed and demonstrate matching property. (a) Example chip. (b) Example TSV graph
can be implemented, form a matching in the TSV graph, as shown by the dashed links in Fig. 15.9. The upper bound for TSV arrays connecting two layers is given by the largest maximum matching in these subgraphs, and it is inadequate since the number of components per layer bounds it. Therefore, an optimal solution can be found efficiently by iterating all numbers of TSVs. The locations of bounding boxes for components are already known. The actual location of TSVs is unknown and only determined in the next step. Therefore, it is approximated by dissecting the covered area of bounding boxes in equal-sized rectangles. The number of rectangles is given by the next larger square number of TSVs (i.e., for 6 TSV arrays, 9 rectangles are defined, with 3 rows and 3 columns). This approach retains the well-known topological grid. Through-silicon vias are positioned in the center of these rectangles, as shown in Fig. 15.10, filling the rectangles randomly. The distance of the center of each component bounding box to the next TSV array is calculated. It is measured in a hop distance multiplied by the communication of the component to components in the other layers (both in the upward and downward direction). The sum of these weighted distances is an estimate of the communication costs. We call the estimated communication costs c. The best solution candidate is selected using the objective function, with weights ωa and ωc : c(3) = ωa
(fKOZ + fR3D )t (ξ ) + ωc c −→ min.
(15.51)
ξ ∈[−1]
The third step is shown in Fig. 15.8 in the middle. It takes interlayer communication and component area as input and returns the number of TSV arrays. 4 In the fourth step, the position of 3D routers is determined, and TSV arrays are associated with bounding boxes. This step increases the area requirement of bounding boxes. If necessary, routers are added to connect the routing algorithm, as already described in the MILP. Simulated annealing is used to find an optimized network EN . The possible connection candidates are already given in the “TSV graph”. Since routers are only
346 Fig. 15.10 Approximated places of TSV arrays for 4 bounding boxes
15 Network Synthesis and SoC Floor Planning
1
2
3
4
allowed to connect to a single TSV array per direction, a legal solution must be a matching in the subgraph of the TSV graph with edges from adjacent layers. The initial solution generates random matching for every pair of adjacent layers, using a simple greedy algorithm, with a matching size smaller or equal to the number of TSV arrays as determined in step three. The neighbor function randomly selects a pair of layers and includes a new random TSV array from the TSV graph. If an added connection violates the matching, all TSVs, which connect the routers at the end of the novel connection, are deleted. If the maximum number of TSVs is violated, another random connection is deleted. Then, a greedy algorithm finds a novel matching, with a size smaller or equal to the number of TSV arrays. The objective function minimizes the communication for the given network EN and the flow of the routing algorithm, with Eq. (15.12). At the end of this step, a complete network graph is defined. The fourth step is shown in Fig. 15.8 in the last node. It takes the TSV connection model as input and returns bounding boxes for TSVs and routers. 5 After adding 3D router area and KOZs, the area of bounding boxes may be too small, or boxes may overlap. In the fifth step, the solution from step four is legalized while retaining the order of components in layers, rows, and columns. The LP or SDP, which is formulated in step two (floor planning of components), is reused. This step is shown in Fig. 15.8 at the end. It returns an optimized solution based on the router and TSV area model. This final step defines a complete network graph and non-overlapping bounding boxes, and the network can be designed. The solution can be further optimized by determining the exact location of routers, components, and TSV arrays within their bounding boxes. Standard approaches from layout synthesis are applicable for a post-processing step.
15.5 Evaluation In this section, we evaluate our approach to the NoC synthesis for applicationspecific SoC design as introduced before. The MILP and the heuristic algorithm are implemented in MATLAB. We use a Core i7-6700 running at 3.4 GHz or an Intel Core i7-7740X processor running at 4.3 GHz with Windows 10 and MATLAB R2018a. For optimizing LPs, we use CPLEX 12.8.0 [112]. For optimizing SDPs, we use Mosek 8.1 [171]. Both tools utilize the CPU’s 8 logical cores and 8 threads.
15.5 Evaluation
347
15.5.1 Performance and Computational Complexity Network on a chip synthesis show similarities to layout synthesis since problems like floor planning and partitions are solved. Layout synthesis is NP-hard [156], and the solution space is even larger for 3D systems. As reported in [154], the possible arrangements in the solution space for 3D IC floorplanning are increased by factor N n−1 /(n − 1)! with N components and n layers compared to 2D. Hence, the optimization problem in this chapter is NP-hard. As conducted in layout synthesis, standard divide and conquer approaches cannot be applied directly due to interdependencies. Positions of components in one layer influence the other layers. Some constraints, such as non-overlapping tiles, are difficult to satisfy.
15.5.2 Mixed-Interger Linear Program The complexity of the MILP can be demonstrated by an exemplary implementation using MATLAB R2018a and IBM CPLEX 12.8.0 [112]. Inequalities and variables are generated automatically from input sets. CPLEX finds a valid solution in a few minutes for less than 3 layers and 5 components on an Intel i7-6700 CPU running Windows 10. For realistic sets with more than 5 components, the performance of CPLEX is too low, which naturally gives the requirement for a heuristic solution.
15.5.3 Heuristic Algorithm We analyze the computational time of the heuristic algorithm per step. 1 The calculation time of component-to-layer assignment is difficult to estimate without experiments since it is solved using MILP. Experiments yield less than a second run time, even for larger input sets (Table 15.4). 2 The component floor planning uses simulated annealing. Its run time is determined by the maximum number of iterations and the time required to generate solution candidates and assess the cost function. Calculating an initial solution and the neighbor function costs O(m) for generating a random vector. For the cost function, the area of each configuration must be minimized. Therefore, an LP (with linearization) or a SDP is used. Both are polynomial-time optimizations. We use an ellipsoid algorithm for an upper bound of the LP computation time, i.e., the worstcase approximation. Following Khachiyan’s approach, solving an LP can be done in O(n4 L), with n variables and L input bits [130]. A layer with k rows and l columns yields O(k +l) variables. L is essentially the bit size of xmax and η, so it is bounded. The pairwise distance between communicating components is calculated for the cost function, i.e., takes O(kl).
348
15 Network Synthesis and SoC Floor Planning
3 To calculate the number of TSVs, we inspect every possible solution and find the global minimum. The largest matching is calculated, taking O(|EN |n) = O(n3 ) ([136], Theorem 10.4). The number of inspected solutions is equal to the largest maximum matching, which is a constant smaller than nl−1 . A cost function is evaluated for each candidate by calculating pairwise distances between components, i.e., O(n2 ). 4 Placing the 3D routers uses simulated annealing. Initial and neighboring solutions are generated with a greedy algorithm, which constructs a matching by iterating all possible edges in the network, i.e., O(m2 ). The cost function calculates the routing algorithm for all edges in EA yielding |EA | ≤ n2 iterations. Calculating the routing function is similar to breadth-first search, i.e., takes O(m + m2 ) ([136], Theorem 2.19). Together this yields O(n2 m2 ). 5 The legalization uses the same LP or SDP as already discussed. Summing up, the proposed heuristic allows for an efficient solution in polynomial time since it is in O(n2 m2 L).
15.5.4 Optimization Results We start with an evaluation using homogeneous 3D SoCs examples, as their optimal results are easily generated manually. Thereby, we can validate our approach.
15.5.4.1
Case Study for Technology Model
The redistribution length, D, is calculated using the formula provided by technology vendors. As an exemplary case study, we connect two layers with a 9-bit vertical link. One layer is implemented in a 45-nm commercial digital, and the other in a 180-nm mixed-signal technology. For a target frequency of 10 MHz, with 20%delay margin-left for the remaining circuit, the length of the RD from 45-nm node to 180-nm node is 0.07 m. In the opposite direction, it is 0.03 m.
15.5.4.2
Comparison to Existing Approaches
The proposed heuristic algorithm cannot be compared to existing approaches since it solves a novel problem formulated in this chapter. Individual steps are compared to similar methods. This comparison is not worthwhile for steps 1 , and 3 , as they are merely a linear optimization as a preparation of the remainder of the heuristic algorithm. The third step determines the number of TSVs. The result is highly dependent on the parameters of the cost function. Therefore, realistic parameters must be found for each set of technologies, which are not available to academia. Therefore, [165] uses a fixed ratio between implemented TSVs and available TSV
mp3 enc mp3 dec
H263 enc mp3 dec
H256 dec mp3 dec
Baseline [228] Baseline with SDP Initial solution SA communication SA balanced Baseline [228] Baseline with SDP Initial solution SA communication SA balanced Baseline [228] Baseline with SDP Initial solution SA communication SA balanced
Area [A] Mean 11301 10178 7902 11699 8244 12535 10178 6993 15762 10474 8568 8091 7281 10779 8516 Std — — — 1598 505 — — — 1723 2148 — — — 1460 796 Δ — −9.94% −30.1% +3.52% −27.1% — -18.8% −44.2% −25.7% −16.4% — −5.57% −15.0% +25.8% −0.61%
Communication [Hops Mb] Mean Std Δ 19858 — — 19858 — 0.0% 33707 — +69.7% 20449 404 +2.98% 21280 624 +7.16% 255324 — — 255324 — 0.0% 525537 — +106% 241479 15333 −5.42% 250187 14763 −2.0% 17546 — — 17546 — 0.0% 39171 — +123.3% 17341 342 −1.17% 17572 487 +0.15%
Bandwidth [Mb/s] Mean Std 4060 — 4060 — 7994 — 4265 201 4452 674 84884 — 84884 — 85244 — 73012 14302 73161 17497 4085 — 4085 — 6560 — 5065 906 4974 902
Δ — 0.0% +96.9% +5.05% +9.66% — 0.0% +0.42% −14.0% −13.8% — 0.0% +60.1% +24.0% +21.8%
Table 15.1 Area and performance comparison with benchmarks [228]. The simulated annealing is executed with 30 reruns, initial temperature 30, cooling 0.98 and 15,000 iterations. The semi-definite program allow for an aspect ratio of up to 0.1
15.5 Evaluation 349
350
15 Network Synthesis and SoC Floor Planning
positions. We compare steps 2 and 4 in the next sections. The SDP or LP in step 5 is reused from the second step and therefore not assessed individually. Step 2 and 5: Placement of Components [228] maps quadratic-shaped cores of varying size in a 2D-mesh NoC. The work proposes a MILP and a heuristic approach. It optimizes network performance, minimizing transmission-energy consumption. We compare the results of our heuristic algorithm, step 2 , with the results of [228] using the three benchmarks provided by [228], namely H.256 decoder mp3 decoder, H.263 encoder mp3 decoder, and mp3 encoder mp3 decoder. Traffic patterns are taken from benchmarks in [214], and the core area from [228]. The results of the area and network performance are given in Table 15.1. The area is measured by the area of the complete chip embracing the components. Accumulating the load of all links measures network performance (delay) and maximum link load (throughput). Five data sets are given per benchmark. First, the baseline by is given by Ref. [228] for mesh-based solutions. We do not compare with non-mesh solutions since they allow additional freedom and provide an unfair baseline. Second, this configuration is optimized using the proposed SDP. The aspect ratio in the SDP is η = 0.1, so that non-quadratic, rectangular shapes of components with that maximum aspect ratio are possible. Thereby, we assess the optimization potential from the additional freedom in core shapes. Third, the initial solution for the subsequent simulated annealing is given, an area/communication-efficient solution. It shows the optimization potential in terms of area. Fourth, the proposed simulated annealing is executed 30 times with 15,000 iterations, an initial temperature of 30, and cooling of 0.98. The results of the 30 reruns are averaged, and the standard deviation is calculated. We set the weight of the area in the cost function to zero to optimize communication. Fifth, we balance the weights in the cost function and prioritize neither area nor communication. A single run of the simulated annealing terminates after approx. 17 minutes on a Windows 10 workstation using an Intel Core i7-7740X processor at 4.3 GHz. The results are discussed in Sect. 15.5.5.1. Step 4: Placement of Vertical Links (TSV-Array Placement) There are multiple works to place vertical links in a 3D NoC for a given TSV count. At the time of writing this book, the most recent work also mapping an application on a 3D-mesh NoC is [165]. The authors propose an ILP and a particle swarm optimization (PSO). Since the mapping is already conducted in step 2 , we only compare to the TSV placement. We take video object plane detection (VOPD) and double video object plane detection (DVOPD) [214] as benchmarks. The other benchmarks from [214] are smaller, and a comparison is not helpful because of convergence to the global minimum.
15.5 Evaluation
351
Table 15.2 VOPD benchmarks [165] network performance comparison (hop distance [HD] times bandwidth [Mb]) in a 4 × 2 × 2 NoC. 20 reruns for PSO and simulated annealing with same computational time budget
Performance [Hd Mb] PSO Proposed TSV count Mean Std Mean Std 1 12229 0 12229 0 2 10591 581 9005 0 3 8894 102 8659 0 4 9013 364 8595 0 5 8725 155 8595 0 6 8723 148 8595 0 7 8595 0 8595 0 8 8595 0 8595 0 Average improvement
Difference 0% 15% 3% 5% 1% 1% 0% 0% 3.125%
Table 15.3 DVOPD benchmarks [165] network performance comparison (bandwidth times hop distance) in a 4 × 4 × 2 NoC. 20 reruns for PSO and simulated annealing with same computational time budget
Performance [Hd Mb] PSO Proposed TSV count Mean Std Mean Std 1 43330 0 43330 0 2 38274 163 37954 395 3 34636 0 33854 0 4 34217 674 32382 0 5 33249 555 31014 0 6 32351 699 30168 0 7 31920 575 29916 0 8 30767 679 29744 0 9 30767 679 29744 0 10 30318 453 29712 0 11 30235 409 29712 0 12 29764 69 29712 0 13 29996 340 29712 0 14 29805 208 29712 0 15 29712 0 29712 0 16 29712 0 29712 0 Average improvement
Difference 0% 1% 2% 5% 7% 7% 6% 3% 3% 2% 2% 0% 1% 0% 0% 0% 2.563%
We chose an arbitrary but fixed mapping for both benchmarks. We use 20 reruns for the PSO and our simulated annealing with the computation time budget. The parameters of the PSO are given by Manna et al. [165] (k1 = 1, k2 = 0.04, k3 = 0.02). The simulated annealing uses an initial temperature 30, cooling 0.97, and 1,000 iterations. Both techniques use the same objective that minimizes bandwidth times communication-hop distance. The results are shown in Table 15.2 for VOPD and in Table 15.3 for DVOPD. The proposed heuristic algorithm allows for up to 15% improved performance vs. [165].
352
15 Network Synthesis and SoC Floor Planning
1
1 Comp. 2
Comp. 1 1
1 Comp. 3
1
1 Comp. 4
1
Comp. 5 1
Fig. 15.11 Component communication digraph for small input
15.5.4.3
Comparison to the Global Minimum
The mixed-integer linear program can optimize tiny input sets. This approach allows comparing the heuristic to the global optimum. We use a homogeneous 3D SoCs with = 2 layers, and 5 components. Components require 10 area units (A), routers per port 3/5 A, and KOZs 2 A. The performance/power √ of all components and routers is identical. The maximum length of the RD is 5 A. We model a fully-adaptive routing algorithm, i.e., we only require connectivity of the network. We use a neutral cost functions with weights of 1. The component communication digraph has bidirectional links with bandwidth 1 between subsequent components, as shown in Fig. 15.11. The result of the MILP is shown for 5 components in Fig. 15.12a. The model comprises 23,277 inequalities and 18,121 variables, requires 7.95 s to set up, and 1.3 GB of memory. The optimization is terminated after 10 minutes. It takes 94 s to find an initial solution with a gap of 25.93%. After 599.34 s a second solution is found and the gap is reduced to 25.00%. The area of the upper layer is 57.43 A and of the lower layer 29.35 A. An exemplary result for the input using 5 components is shown in Fig. 15.12b (optimization with linearized LP) and Fig. 15.12c (optimization with SDP). To compare the results of the MILP, we use the linearized LP (The SDP reduces the error). The simulated annealing for component-floor planning is executed with an initial temperature of 20, 120 iterations, and 0.97 cooling per iteration. The placement of routers use an initial temperature of 100, and 50 iterations, and 0.97 cooling per iteration. All weights of the cost functions are set to 1. As shown in Table 15.4, the heuristic algorithm using an LP requires approximately 24 s to find a solution. When using the SDP, 152.4 s elapse until termination. In both cases, 1.3 GB of memory are used. The area is given in Table 15.5. For the LP, the size of the upper layer is 43.0 A and of the lower layer 25.7 A. For the SDP, the area requirements are smaller with 36.8 A and 23.0 A, respectively.
15.5.4.4
Case Study: Homogeneous 3D SoC
The proposed heuristic is executed for a homogeneous 3D SoC. The homogeneity allows us to understand and validate the solution since the properties of optimal solutions are known. We use a SoC with = 4 identical layers, and a fully connected
15.5 Evaluation
353
Fig. 15.12 Result for 5 components using different optimizers. Bounding boxes are shown in gray and TSV arrays in red. (a) MILP. (b) Heuristic algorithm with LP area optimization. (c) Heuristic algorithm with SDP area optimization
component graph EA with 40 or 80 components. The remaining parameters are identical to the previous example (Sect. 15.5.4.3) One possible result for the heuristic for a SoC with 80 components, using LP for optimization, is shown in Fig. 15.13a, and the corresponding one for the SDP in Fig. 15.13b. For an input set with 40 components, the heuristic algorithm terminates after 503 s using the LP, and after 675 s using the SDP. For an input set with 80
354
15 Network Synthesis and SoC Floor Planning
Table 15.4 Execution time of heuristic algorithm Heuristic step
Homogeneous 3D SoC 5 components 40 components 2 layers 4 layers LP SDP LP SDP
80 components 4 layers LP SDP
1 Comp. to layer assignm.
0.4 s
0.4 s
0.4 s
0.4 s
0.4 s
2 Layer floor planning
20.2 s
146 s
194 s
375 s
609 s
802 s
3 Number of TSV arrays
0.1 s
0.1 s
0.2 s
0.2 s
0.2 s
0.2 s
4 Placement of TSV arrays
3.9 s
3.8 s
308 s
298 s
1246 s
1300 s
5 Legalization
0.2 s
0.2 s
0.3 s
0.5 s
0.3 s
0.5 s
24.7 s
152 s
503 s
675 s
1856 s
2104 s
Execution time
Total
0.4 s
Table 15.5 Area comparison of LP and SDP
Layer 1 Layer 2 Layer 3 Layer 4 Average
Area of homogeneous SoC [A] 5 components 40 components LP SDP Δ LP SDP 43.0 36.8 16.8% 211 178 25.7 23.0 19.6% 222 180 – – – 214 183 – – – 185 154 18.2%
Δ 18.5% 23.3% 16.9% 27.6% 21.6%
80 components LP SDP 364 301 379 313 378 313 316 261
Δ 20.9% 21.1% 20.8% 21.1% 21.1%
components, the heuristic algorithm terminates after 1856 s using the LP, and after 2103 s using the SDP. The execution times are shown in detail in Table 15.4. The average hop distance in the network with 40 components is 3.19 (LP), and 3.29 (SDP). In the example with 80 components, it is 3.76 (LP) and 4.71 (SDP). The area of the 3D SoC with 40 components is 214 A for the LP, and 183 A for the SDP. The area of the 3D SoC with 80 components is 379 A, for the LP, and 313 A for the SDP. The detailed area results are shown in Table 15.5.
15.5.4.5
Case Study: 3D VSoC
We use our heuristic algorithm to optimize a heterogeneous 3D SoC with the typical 3D VSoC case study used throughout this book. Here, the vision-system on a chip consists of only two layers, as the analog sensor layer is not optimized by our method–we only consider the mixed-signal layer on top for analog-digital conversion of image data from a sensor. The digital layer is located below for data processing. The chip implements 18 components, 9 of which are analog-to-digital converters (ADCs), which can only be implemented in the mixed-signal layer, and 9 processors that can be implemented in both layers but are larger and perform less
15.5 Evaluation
355
Fig. 15.13 Result for homogeneous 3D SoC with 80 components and with fully-connected component communication graph using linearized LP and SDP. Bounding boxes are shown in gray and TSV arrays in red. (a) Result from linearized LP. (b) Result from SDP
in the mixed-signal layer than in the digital layer (100 A and 130 A). The ADCs are smaller than the processors in the digital layer with 25 A. Each analog-to-digital converter sends data to one processor. The processors send data to all, forming a fully-connected application graph. We do not consider energy consumption and computational performance in this example and set it to the same value in each layer for each component. This simplification is reasonable, as the LP in the first step must not be evaluated due to its reduced complexity. The conventional design for the VSoC is shown in Fig. 15.14a. The analog– to-digital converters are located in a 3 × 3 grid in the mixed-signal layer. The processors are located in the digital layer in a 3×3 grid. Since traditional approaches do not use RD and use components with identical size, the grids in both layers are similar, and routers are located at the same positions. Please note the white space in the mixed-signal layer and the dense packaging in the digital layer. The resulting design is shown in Fig. 15.14b, using the proposed heuristic algorithm, with 0.1 maximum aspect ratio, 500 iterations, a cooling of 0.97, and an initial temperature of 30 in the simulated annealing. Three processors are located in the mixed-signal layer to save implementation costs using the advantages of heterogeneous integration. Fewer TSV arrays between routers are implemented than in the traditional approach. The run time of the example is 93 s.
356
15 Network Synthesis and SoC Floor Planning
Fig. 15.14 Floor plan of a heterogeneous 3D SoC with 9 ADCs (yellow) in a mixed-signal layer (on top) and 9 processors (gray) with a digital layer (on bottom). Bounding boxes are shown in gray and TSV arrays in red. (a) Conventional result. (b) Result from heuristic
15.5.5 Discussion In this subsection, we discuss the results presented before. We focus on the comparsion against existing approaches to NoC synthesis and our case studies.
15.5.5.1
Comparison to Existing Approaches
15.5.5.2
Step 2 and 5: Placement of Components
The placement of different sized components in each layer is compared to [228] in Table 15.1. Applying the SDP, which offers further optimization potential of the area by non-quadratic component areas, shows a reduced area of between up to ca. 19%. Communication is not changed as of the identical layout. The initial solution for the simulated annealing only optimizes the area without considering the communication. This result shows an area optimization potential between 15 and 44%. Communication is worse, as expected, by up to 2.2×. Our simulated annealing begins with this initial solution and restores the desired communication. The communication is only some 3% worse, and the maximum link load 5% worse on average for the H.256 decoder mp3 decoder. Communication and link load are better than baseline for the H.263 encoder mp3 decoder as a result of a different cost function in [228]. In the case of the mp3 decoder mp3 encoder, communication is slightly improved by 1%, while the maximum link load is worse by 24%. These three benchmarks demonstrate that the
15.5 Evaluation
357
proposed heuristic algorithm optimizes network performance from an inferior initial solution. Both area and communication are optimized in a realistic scenario. The results for this are shown in the last rows of Table 15.1 using a balanced cost function. The optimized H.256 decoder mp3 decoder has an area reduction of 27%, while the communication and link load are worse than the baseline on average between 7 and 9%. This result shows that considerable area reductions are possible. Optimization of the H.263 encoder mp3 decoder reduces area by 16%, the accumulated network load is better by 2%, and the maximum link load is better by 13% on average. The heuristic outperforms state-of-the-art methods. In summary, the proposed heuristic algorithm allows for up to 16% reduced area, with improvements in network performance over state-of-the-art approaches. At maximum, 44.2% area reduction is possible, without improved communication. The results can be improved by allocating more than 17 minutes of CPU time.
15.5.5.3
Step 4: Placement of Vertical Links
The placement of vertical links is compared to [165] in Tables 15.2 and 15.3. For the VOPD benchmark, the proposed method yields up to 15% better communication (measured in bandwidth times hop distance) and is more efficient. With the same time budget, it reliably finds the global optimum. For a TSV count of 2 to 6, the PSO does not find the global minimum, which leads to worse communication. We achieve an average improvement of 3.125% for VOPD. The DVOPD benchmark gives similar results.
15.5.5.4
Validity and Quality of the Results
It is not trivial to assess the validity and quality of the heuristic algorithm since the calculation of an optimal solution (using the MILP) is impossible. Evaluation is possible using two scenarios. First, the MILP can calculate solutions for tiny input sets, which can be compared to the results of the heuristic for the same input set. Second, input scenarios with known optimal results can be used. For this, we use the homogeneous 3D SoC case study with a fully-connected communication graph.
15.5.5.5
Comparison to Optimal Results
For our small input sets, the heuristic shows a better performance than the optimal results from the MILP. To generate a first solution candidate, CPLEX requires a little over 3 minutes, while the heuristic terminates after 25 s (LP) or 152 s (SDP). Considering chip area, the heuristic algorithm outperforms the MILP for similar optimization times. The upper layer is 34% (LP) or 56% (SDP) smaller, and the lower layer is 14% (LP) or 28% (SDP) smaller than the result from the MILP.
358
15 Network Synthesis and SoC Floor Planning
Both the MILP and the heuristic algorithm with the LP use the same linearization and can be directly compared. The better optimization capabilities of the SDP are demonstrated. Regarding power and component performance, the results are equivalent. Considering network performance, the MILP wins. The components are strung together like a necklace, which perfectly represents the application graph. In summary, area results obtained with the heuristic algorithm are superior since non-linearized, non-approximated optimization is possible. The results are inferior in terms of network traffic since communication spanning multiple layers is not considered.
15.5.5.6
Case Study: Homogeneous 3D SoC
Two exemplary results for a homogeneous 3D SoC with 4 layers and 80 components are shown in Fig. 15.13a using the LP and in Fig. 15.13b using the SDP. The component layer assignment assigns 20 components to each layer, so one of the Pareto-optimal solutions is selected (each solution with 20 components per layer is Pareto optimal). As expected, the number of TSV arrays is identical between all layers (16 in the present solutions). The assignment of components and routers to rows and columns also returns the expected result with 5 × 4-NoCs in all layers. One would expect that the positions of TSV√ arrays are similar between all layers. We set the length of RD in all layers to 5 A, so it spans a hop distance between 2 and 3 routers to the TSV array. Since the component communication graph is fully connected and all edge weights are identical, every connection scheme within this distance is Pareto-optimal. The distance for the RD is large enough that TSV arrays can range multiple PEs. From a practical point of view, these can be deleted/merged in post-processing or, with less area from KOZs. In an optimal solution, the area of all layers would be identical. Since solely downward TSV connections have KOZs, the bottom layer is smaller. (The keep-out zones are not accounted for in the component-to-layer assignment.) The area of individual layers is given in Table 15.5. As expected, the lower layer is approximately 18% smaller than the average of the other layers for both LP and SDP of the current results. To summarize, the heuristic algorithm is validated for a homogeneous 3D SoC, in which optimal results are known.
15.5.6 Case Study: 3D VSoC The comparison for the heterogeneous SoC yields the expected results as shown in Table 15.6. We achieve an area reduction of 28% by more efficient packaging and 83% reductions in white space. The heuristic algorithm does not consider interlayer communication. Thus, both communication and maximum link load are both worse by 49 and 60%, respectively. This result not necessarily limits the system
15.5 Evaluation
359
Table 15.6 Comparison of conventional and proposed NoC planning for a heterogenenous 3D SoC, with one mixed-signal and one digital layer and with 18 components, 9 ADCs and 9 processors. Parameters: initial temperature 30, cooling 0.98 and 500 iterations, aspect ratio of bounding boxes η = 0.1 Area Whitespace Communication Maximum link load
Conventional 3712 A 663 A √ 1890 Mb A/s 100 Mb/s
Proposed 2676 A 110 A √ 2810 Mb A/s 160 Mb/s
−28% −84% +49% +60%
performance, as long as the maximum link load is below the throughput capabilities of the corresponding link.
15.5.6.1
Performance of the Heuristic Algorithm
The individual execution times per part of the heuristic algorithm are shown in Table 15.4. The mixed-integer linear program for the layer-to-component assignment is efficient and participates only at a fraction of the computation time. The layer floor planning requires 3.1× more time for a doubled input size using the LP. This effect is not due to a larger input size to the LP for area minimization since the legalization in the last step does not suffer from a larger input size, and the optimization is identical. This finding demonstrates that the worst-case approximation of O(n4 ) overestimates the execution time. The increase of computation time is instead a result of calculating the cost function. This result matches the theoretical expectation since the complexity scales quadratically with the number of components. Using the SDP is more efficient for larger input sets compared to the LP. For small sets, the SDP has a high overhead. Iterating all possible solutions for the TSV count is a reasonable choice since the increase of computation time was not measurable for the given precision. The placement of TSV arrays increases its computation time by 4× for a doublesized input. This finding also matches theory since the cost calculation scales quadratically. Legalization scales well, as discussed.
15.5.6.2
Gap to the Optimial Solution
In general, it is impossible to determine the gap to the optima using the proposed heuristic algorithm since it is impossible to calculate an exact solution efficiently. Our results outline the potential, which is lost by the approximation. During component-to-layer assignment, only the component area is considered since the count of TSV arrays is still unknown. This simplification influences the area of the layer. For the input set with 80 components, the bottom layer is 16% smaller than the average of the other layers. During the determination of the TSV
360
15 Network Synthesis and SoC Floor Planning
count and the component floor planning, interlayer and intralayer communication is considered separately. As demonstrated in the example with 5 components, the MILP has up to 25% lower average hop distance for packets. The quality of the heuristic varies, as it is often stuck in local minima. Further fine-tuning algorithmic parameters are required. One can either use the SDP or LP to optimize the area of layers. Using the LP provides comparability to the MILP, but has worse results from the approximation error.7 One exemplary result for a homogeneous 3D SoC with 80 components is shown in Fig. 15.13a using the linearized LP and in Fig. 15.13b using the SDP. The average hop distance of packets for both solutions is similar with values of 4.71 and 4.78. The area of both chips should be similar, with a linearization error of approximately 12% [140]. We can validate this for our input examples, as shown in the area comparison of results from LP and SD in Table 15.5. On average, the SDP performs 20% better. In terms of computational speed, the LP offers up to 6× higher performance than the SDP, as shown in Table 15.4. In summary, the LP has higher speed, but the SDP produces better results. The linearization error of the LP results in white spaces of 20% more than in solutions from the SDP for our benchmarks. Competing approaches in the field of placement such as [52] report between 5 and 20% white space. The provided solution requires post-processing. Standard methods can be used for the actual placement of gates and TSVs in the bounding boxes. The placement of routers in the bounding boxes can be optimized. For instance, they can be placed at the edged of the bounding boxes, as proposed in methods for pin-placing [156].
15.6 Conclusion This chapter described the workload-specific NoC synthesis for heterogeneous 3D SoCs. The problem is modeling and solving with a heuristic, using a problem decomposition into individual, solvable parts. It provides an optimized solution for large input sets of 80 components in approximately 35 minutes. We compare different optimization methods within the heuristic and remove 20% white spaces from the design. We validate our method with input sets for which the optimal solution is known. We find our solution is better than the state-of-the-art area because we use non-linear models with more freedom in placing and forming components. For example, we achieve up to 15% better communication from better TSV placement. Furthermore, our approach converges faster than comparable methods.
7 The SDP is optimal and does not yield an error. However, it will yield solutions with white space if the configuration of components is inefficient and does not allow tight packaging.
15.6 Conclusion
361
This chapter concludes the optimization of NoCs for heterogeneous 3D integration. We considered every component of the NoC, from the application, via architecture to the placements of TSVs. This comprehensive method covers the full stack of NoC design. In combination with the low-power/high-performance methods for the individual vertical links proposed in Part IV, this book enables the efficient design of communication networks in heterogeneous 3D SoCs.
Part VI
Finale
Chapter 16
Conclusion
The focus of this book was on high-level modeling and optimization of network on a chip (NoC)-based interconnect architectures for technological heterogeneous three-dimensional (3D) integrated circuits (ICs). We identified two major challenges that were not adequately addressed by previous works on modeling and optimization of 3D interconnect architectures. First, the parasitic capacitances of through-silicon vias (TSVs)—used for the vertical links in today’s heterogeneous 3D ICs—heavily impair the power consumption and timing of the interconnect architectures. Thus, the dynamic (i.e., pattern-dependent) power consumption and timing of the links must be encapsulated in high-level models and systematically optimized. A wide range of previous works deals with the high-level modeling and optimization of metal-wire interconnects used for the horizontal links in traditional 2D and 3D ICs, as metal wires exhibit critical parasitics as well. Thus, there is demand for universal modeling and optimization techniques that work at the same time for metal-wire as well as TSV interconnects. Still, they were lacking in the published literature prior to this book. The second identified challenge is due to technological heterogeneity between the various dies/layers of a 3D IC. Technology heterogeneity implies that the implementation cost of a standard digital network router varies for the different layers. Thus, in less-scaled technologies, the router is not only slower but also exhibits a larger silicon footprint and consumes more energy for the same work. For standard 3D NoCs, this is a pain. The overall network performance, footprint, and energy consumption become dominated by the slowest layers. To address the outlined challenges, high-level modeling frameworks for 3D NoCs must encapsulate implications of technological heterogeneity. This technology modeling is lacking in prior 3D-NoC simulators. Also, technological heterogeneity can be transformed from a “pain” into a “gain” through technologicalaware, heterogeneous 3D-NoC architectures, which prior works did not investigate adequately as well.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 L. Bamberg et al., 3D Interconnect Architectures for Heterogeneous Technologies, https://doi.org/10.1007/978-3-030-98229-4_16
365
366
16 Conclusion
After identifying these two fundamental open research problems, the book addressed them rigorously and systematically from both ends: modeling and optimization. The following two paragraphs summarize the book’s contributions in terms of physical 3D-interconnect modeling and optimization as well as heterogeneous 3D-NoC modeling and optimization. Afterwards, the contribution in terms of system design is outlined.
16.1 Modeling and Optimization of 3D Interconnects Previous optimization techniques show rather limited capabilities to improve the TSV performance and are furthermore not capable of reducing the critical TSV power consumption. This book has shown that these severe limitations of existing approaches are mainly due to the absence of precise models to estimate the TSV power consumption and performance on higher abstraction levels. Optimization techniques will only effectively overcome interconnect-related issues in TSV-based 3D ICs if the techniques are derived based on physically precise, and yet universally valid, high-level models. To address this problem, a set of abstract models have been proposed in this book, which enable precise estimation of the power consumption and the performance of 3D interconnects solely on high abstraction levels. The proposed models are physically precise since they were derived while considering the lowest abstraction levels. Nevertheless, the methods are universally valid and provide the required level of abstraction. Evaluations have shown that the proposed models allow for an estimation of the power consumption and performance of modern 3D interconnects with relative errors below 3%. Moreover, it was demonstrated that the proposed models enable the derivation of efficient optimization techniques for 3D interconnects. Four optimization techniques have been proposed in this book—derived based on the contributed high-level formulas—that drastically outperform previous methods. At first, a low-power technique for TSV structures was presented in Chap. 8. The technique improves the TSV power consumption by an intelligent, physicaleffect-aware, local net-to-TSV assignment, exploiting the bit-level statistics of the transmitted patterns. Analyses for a broad set of real and synthetic data streams underlined the efficiency of the proposed low-power technique, which can reduce the TSV power consumption by over 45%, without inducing noticeable overhead costs. Furthermore, the proposed technique enables us to use any existing lowpower approach for metal wires in the most efficient way for TSVs. This allows for improving the power consumption of both structures effectively through a single data-encoding technique. To provide a further improvement in the performance of 3D interconnects, two optimization techniques for low-power and yet high-performance 3D interconnects were proposed in Chaps. 9 and 10 of this book. The first technique exploits the bitlevel switching characteristics of traditional 2D-CAC-encoded data by a net-to-TSV assignment, which results in a simultaneous improvement in the power consumption
16.1 Modeling and Optimization of 3D Interconnects
367
and performance of TSVs as well as metal wires. Experimental results showed that this approach drastically improves the performance of modern TSV arrays and metal-wire buses by 25.6% and 90.8%, respectively, while it simultaneously reduces the TSV and metal-wire power consumption by up to 5.3% and 20.4%, respectively. In comparison, prior approaches improve the TSV performance by a maximum of 12.0% while providing no optimization for the metal wires at all. Moreover, the best previous technique results in significantly higher hardware costs and a dramatic increase in the TSV power consumption by 50%. Since the first-proposed, as well as all previous, optimization techniques for high-performance 3D interconnects are only efficient for the case of a perfect temporal alignment between the signal edges on the lines, the second proposed technique is designed to improve the TSV performance for the case of an arbitrary temporal misalignment between the signal edges. More precisely, the technique exploits temporal misalignment between the signals by a physical-effect-aware TSV assignment, which again results in negligible overhead costs. An in-depth evaluation showed that the proposed technique can improve the TSV performance by over 65%. Paired with a traditional low-power code for metal wires, a further improvement in the TSV power consumption by about 17% can be achieved, despite resulting in a more than five times lower bit overhead than the best previous technique. The last proposed low-power technique for 3D interconnects, presented in Chap. 11, furthermore addresses the low manufacturing yield of TSV structures. This yield-enhancement technique is based on two optimal coding-based redundancy schemes that are used in combination. These coding techniques enable us to minimize the complexity of a redundancy scheme, by exploiting technological heterogeneity between the dies of a 3D IC as much as possible. Furthermore, the coding techniques are designed in a way that they additionally decrease the power consumption of TSVs and metal wires effectively. An evaluation, considering a commercially available heterogeneous SoC, showed that the proposed technique decreases the area overhead and the power consumption compared to the best previous technique by 69.1% and 32.9%, respectively. Despite the low hardware requirements and the large power savings it provides, the proposed yieldenhancement method can improve the overall manufacturing yield by a factor of about 17× for typical TSV defect rates (same as previous techniques). Moreover, it was shown that the proposed yield-enhancement method can be combined with the proposed technique for high-performance 3D interconnects to achieve all objectives at once: Improved power consumption, improved performance, and improved overall yield. In conclusion, all proposed techniques optimize the quality of 3D interconnects drastically beyond state-of-the-art methods, which is enabled by physically precise and yet abstract models for the TSV metrics. Thereby, this book will accelerate the process of overcoming interconnect-related issues in 3D-integrated systems through effective high-level optimization techniques. Using high-level techniques to address the power issues of 3D interconnects generally is of particular importance because just relying on advances in the manufacturing of TSV structures will likely not work due to the demonstrated poor scaling of the parasitics with shrinking geometrical TSV dimensions.
368
16 Conclusion
16.2 Modeling and Optimization of 3D NoCs This book contributed through Chap. 7 a modeling framework for in-depth power-performance-area (PPA) analysis of NoCs for heterogeneous 3D ICs, called Ratatoskr. The framework is open source to reach a wide range of users and contributors from academia and industry. The framework is the only one that allows to properly model the implications of heterogeneous technologies in 3D-integrated systems. Moreover, it also encapsulates the power and timing models for 3D interconnects discussed above. Thereby, it does not only enable to precisely quantify the network performance in cycles, but also the physical power consumption and timing. These are unique features not found in any other existing NoC simulation framework. However, these features are essential for NoCs in heterogeneous 3D ICs. We showed that technological heterogeneity can result in variations in the achievable clock frequency, power consumption, and silicon area of routers by up to 16× between the various layers of a 3D IC. Hence, if heterogeneity is not properly modeled, no PPA metric can be reliably predicted. The physical links dominate the power consumption and timing of NoCs in advanced technologies. Thus, without proper models for the physical links, neither power consumption nor physical timing can be reliably predicted. These modeling capabilities make Ratatoskr unique in its field. In addition to modeling/estimation, the added features enable the derivation of a wide range of architectural PPA optimizations as shown in this book. We proposed and evaluated (employing Ratatoskr) systematically derived heterogeneous 3D NoC architectures. In detail, all major components of an NoC were carefully redesigned for their use in heterogeneous 3D ICs. First, the NoC input buffers were investigated in Chap. 12. Our proposed technique reduces the number of implemented buffers in layers in which memory is expensive. At the same time, the buffer count in layers in which memory is relatively cheap can be increased. This yields heterogeneous buffering architectures with positive effects on the implementation costs. Experimental results showed area savings by 8.3% at a power reduction by 5.4%. The drawback is a by 2.1% worse average performance. Next, the package routing in a heterogeneous 3D NoC was analyzed in Chap. 13. We proposed a heterogeneous routing algorithm. The minimal routing algorithm routes packages as early as possible to layers implemented in aggressively scaled technology nodes, leaving such layers at the other end as late as possible. Thus, all planar (i.e., XY) routing is always done in the most advanced, i.e., cheapest, layers along the path. If a package is going from an more advanced layer to an less advanced layer it is first routed horizontally and then vertically; If a package is going in the opposite direction, vertical routing happens first. This heterogeneity allows reducing the energy consumption in larger NoC meshes by up to 25% with latency improvements by a factor 6×.
16.3 System Design
369
However, the technique alone failed to improve the network throughput as a single slow router hop along the packet path determines the throughput entirely. This throughput limitation was subsequently overcome by accelerating vertical flit transmission in all slower routers through a heterogeneous buffer, crossbar, and link architecture. Since horizontal routing is always done in the fastest layers with the proposed routing algorithm, this acceleration resolves all throughput bottlenecks of any heterogeneous routing paths. Thus, combining this heterogeneous router architecture for vertical-throughput enhancement with our routing technique resulted in additional throughput improvements by up to 2×. On the downside, we found that the vertical-throughput enhancement results in an increase in the area of slower routers. Such an area increase is in some cases not tolerable as even for a baseline NoC the router area is much larger in the less-scaled technologies. Thus, actually, a decrease in the area of slower routers is desired. Another drawback of the proposed vertical-throughput enhancement is that it either entails yield or physical-design complications. To resolve these two remaining issues, the virtualization of the links was lastly investigated in Chap. 14. Rather than changing the hardware of routers and links to increase the vertical throughput at a given clock speed, we proposed to remove/reduce virtual channels (VCs) of links in the slower layers. Virtual channels are a complex feature that at least doubles the buffer requirements and the arbitration depths. Thus, the technique yields drastic bandwidth improvements through a higher achievable clock frequency but also decreases the area requirements drastically. Normally, removing VCs is not possible without sacrificing network performance noticeably. However, with our heterogeneous routing, the slower layers transmit much less traffic that virtualization is no longer required. This final heterogeneous NoC router architecture showed area savings of over 40%, at even better performance than our technique with the special architecture for vertical-throughput enhancement, without yield or physical design limitations. In summary, a careful, thorough, and technology-aware analysis and design of heterogeneous NoC architectures allowed to improve all PPA metrics at once significantly, without any noticeable drawbacks. All this was enabled by our technology-aware models embedded in the contributed Ratatoskr framework.
16.3 System Design In the last Chap. 15 of this book, we additionally proposed a technique for the workload-specific NoC synthesis and floor planning for heterogeneous 3D systems on chips (SoCs). The problem is modeled and solved with a heuristic, using a problem decomposition into individual, solvable parts. It is the missing piece for NoC design in heterogeneous 3D SoCs. Compared to the state of the art, the technique achieves a floor plan that has 20% fewer white spaces but still improved communication properties.
370
16 Conclusion
16.4 Putting It All Together Throughout this book, we frequently presented vision-system on a chip (VSoC) case studies. Some of the most relevant emerging VSoC architectures are the ones used for efficient convolutional neural network (CNN) acceleration. Convolutional neural networks are used for computer vision. In such VSoCs the images are sensed from the physical environment and processed by the CNN in the same chip. Finally, the neural network’s prediction results are outputted via an interface. Convolutional neural networks have massive compute requirements (e.g., the popular ResNet50 requires over 4 GFLOPS/frame [102]). Thus, they need a massive number of cores whose communication can only be handled by a NoC [13, 21, 61, 170]. Moreover, neural networks have massive parameter requirements (e.g., ResNet50 requires over 23 M parameters [102]). These parameters are typically stored on-chip in a distributed fashion to overcome DRAM-related power and performance issues [21, 98]. Thus, large on-chip memories are needed. Since SRAM is too expensive for such large memories, emerging resistive memory technologies with a larger bit-cell density are widely used. Such AI-VSoCs are ideal candidates for NoC-interconnected, heterogeneous 3D integration as discussed in this book. First, CMOS image sensors can only be implemented in relatively conservative technology nodes (about 130 nm). Also, resistive memories are today not available in ultimately scaled technology nodes (typically available in about 20 nm). However, the compute cores should be implemented in ultimately scaled technology nodes (e.g., below 16 nm) due to the massive compute requirements of CNNs, which otherwise cannot be handled efficiently. This is today only possible with TSV-based heterogeneous 3D integration. In the following, we briefly outline what this book brings for the design of such systems. System-on-chip floor planning is rather straightforward here, as the cores are extremely regular while each component can only be placed in one die. Thus, this can be done with less elaborate techniques than the one proposed in Chap. 15 of this book. However, all other contributed modeling and optimization techniques show to be valuable for the design of such systems. First, the modeling techniques allow identifying power, performance, and yield issues precisely at early design stages (e.g., architecture specification). This reduces the risk of time/monetary expensive design changes at later design stages (post RTL design, when these metrics can also be estimated at lower abstraction levels). In terms of optimization, the impact of the proposed NoC-architecture and 3Dinterconnect optimization techniques can be discussed individually. The techniques are orthogonal/complementary, as they optimize different components. We start with the 3D-interconnect optimizations. It is known that the pattern and featuremaps in CNNs tend to be normal distributed with a strong temporal correlation [98, 170, 183]. Thus, the redundancy-coding technique proposed in Chap. 11 will not only improve the manufacturing yield of the critical TSVs by 17×, but it will also improve the power consumption of TSVs and metal wires by 37% in the mean (see Sect. 11.6). In combination with the proposed shielding techniques and the
16.5 Impact on Future Work
371
exploitation of temporal signal misalignment, the timing and noise of the TSV and metal wire interconnects can be additionally reduced by over 40% (see Sects. 9.5 and 10.4). These gains come at negligible area overheads, as the proposed technique results in ultra-low gate requirements in the costly technology, as it exploits the technological heterogeneity as much as possible. Through the proposed heterogeneous NoC router architecture from Chap. 14, designers and architects of such VSoC systems will achieve on top networkperformance enhancement of over 2× and another power improvement by about 15–25% at an about 40% reduce router footprint. Combining the techniques, designers and architects can realistically expect NoC improvements by approximately 40% in area, 45% in power (35% from redundancycoding and another 15% from heterogeneous routing), and 3× in performance (2× from the heterogeneous NoC architecture, and another 1.5× from the improvement of the 3D-interconnect timing). These gains are on top of a remarkably improved reliability from reduced coupling noise and an enhanced overall manufacturing yield by over 10×. This demonstrates the value of the contributions of this book for stateof-the-art SoCs.
16.5 Impact on Future Work This book is expected to have a noticeable impact in academia and industry. The contributed high-level models can pioneer the derivation of novel optimization techniques for future applications. Emerging architectures for 3D systems from neuromorphic computing or secure processing will bring new types of data flow and thus might require a new set of low-power techniques. However, the derivation of the optimization techniques proposed in Part IV of this book has shown a systematic way to approach this challenge based on the contributed high-level models. Moreover, through the publicly available modeling framework Ratatoskr, this book enables a technology-precise simulation of the performance and power requirements of heterogeneous 3D-integrated systems. This feature, for example, allows for an early design space exploration, considering the impact of the TSVs on the system’s power consumption and performance. Furthermore, the contributed optimization techniques themselves are expected to have an impact on future optimization methods, rather than just serving as reference approaches. For example, for each newly proposed low-power coding technique for metal wires, the method from Chap. 8 can directly quantify how effective the technique could be used for TSVs—determining the usability of the technique for 3D integration. In the same way, one can precisely quantify the maximum efficiency of a newly proposed 2D CAC for 3D-integrated systems with the formal method contributed through Chap. 9. Also, the mathematical background provided throughout the derivation of the contributed low-power technique for yield-enhanced 3D interconnects (Chap. 11) is expected to impact future work. At first, a precise mathematical formulation of
372
16 Conclusion
the requirements for a reconfigurable encoding and decoding method to serve as an efficient yield-enhancement technique is contributed. By these formulas, new coding techniques can be systematically derived and verified. Second, the lower bounds for the minimum possible circuit complexities of a yield-enhancement encoder and decoder allow for precise quantification of the hardware efficiency of novel coding approaches for yield-enhanced 3D interconnects. In summary, the contributions of this book will boost the process of overcoming interconnect-related issues in moderns 3D ICs through the provided optimization and modeling techniques. Moreover, the book enables and assists new research projects covering various emerging domains of 3D integration through its mathematical substance.
Appendix A
Pseudo Codes
In the following, we list pseudo codes to generate the capacitance matrix of a metalwire bus (Listing 1), of a through-silicon via (TSV) array following the previously used capacitance model (Listing 2), and of a TSV array following the proposed capacitance model (Listing 3).
Algorithm 1: Generate the capacitance matrix of a metal-wire bus Input: Number of lines n; Self-capacitance value Cmw,g ; Coupling-capacitance value Cmw,c . Output: n × n capacitance matrix C. C := zeros (n, n) // initialize matrix with zeros diag (C) := Cmw,g // set ground capacitances on the matrix diagonal for i = 1 to n − 1 do Ci,(i+1) := Cmw,c // set coupling capacitance C(i+1),i := Cmw,c return C
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 L. Bamberg et al., 3D Interconnect Architectures for Heterogeneous Technologies, https://doi.org/10.1007/978-3-030-98229-4
373
374
A Pseudo Codes
Algorithm 2: Generate the capacitance matrix of a TSV array according to the previously used capacitance model Input: Array dimensions M, N ; Coupling-capacitance values Cn,prev , Cd,prev . Output: (M·N ) × (M·N ) capacitance matrix C. C := zeros (M·N, M·N ) // initialize matrix with zeros /* get matrix D with distance (normalized by dmin ) between √ TSVi and TSVj on entry (i, j ). Thus, Di,j is equal to 1 and 2 for direct and diagonally adjacent TSVs, respectively. Function “distances()” depends on how TSVs are indexed (e.g., column-by-row) */ D := distances (M, N ) for i = 1 to M · N do for j = 1 to M · N do if Di,j = 1 then Ci,j := Cn,prev // directly adjacent TSV pair √ else if Di,j = 2 then Ci,j := Cd,prev // diagonally adjacent TSV pair return C
A Pseudo Codes
375
Algorithm 3: Generate the capacitance matrix of a TSV array according to the proposed capacitance model Input: Array dimensions M, N ; Capacitance values CG,n , CG,d , CG,e0 , CG,e1 , CG,e2 , CG,c0 , CG,c1 , CG,c2 ; Capacitance deviations ΔCn , ΔCd , ΔCe0 , ΔCe1 , ΔCe2 , ΔCc0 , ΔCc1 , ΔCc2 . Output: (M·N ) × (M·N ) capacitance matrices CG , C. CG := C := zeros (M·N, M·N ) // initialize matrix with zeros D := distances (M, N ) // get normalized TSV distances SE := edge_tsv_set(M, N ) // set of all TSVs located at a single edge SC := corner_tsv_set(M, N ) // set of the four corner TSVs for i = 1 to M · N do if i ∈ SE then [CG,i,i , ΔCi,i ] := [CG,e0 , ΔCe0 ] // edge TSV self-capacitance else if i ∈ SC then [CG,i,i , ΔCi,i ] := [CG,c0 , ΔCc0 ]
// corner TSV self-capacitance
for j = 1 to M · N do if Di,j = 1 then [CG,i,i , ΔCi,i ] := [CG,n , ΔCn ] // directly adjacent TSV pair if i ∈ SE and j ∈ SE then [CG,i,i , ΔCi,i ] := [CG,e1 , ΔCe1 ] // located at an edge else if i ∈ SC or j ∈ SC then [CG,i,i , ΔCi,i ] := [CG,c1 , ΔCc1 ] // located at a corner √ else if Di,j = 2 then [CG,i,i , ΔCi,i ] := [CG,d , ΔCd ] // diagonally adjacent TSV pair else if Di,j = 2 then /* indirectly adjacent TSV pair */ if i ∈ SE and j ∈ SE then [CG,i,i , ΔCi,i ] := [CG,e2 , ΔCe2 ] // located at an edge else if i ∈ SC or j ∈ SC then [CG,i,i , ΔCi,i ] := [CG,c2 , ΔCc2 ] return CG , ΔC
// located at a corner
Appendix B
Method to Calculate the Depletion-Region Widths
The applied method to determine the width of the depletion regions for the parameterisable 3D model of a TSV array is outlined in this part of the appendices. An in-depth derivation of the formulas can be found in Ref. [253]. A TSV oxide is surrounded by a depletion region since the TSV metal its isolating oxide liner, and the conductive substrate form a metal-oxide-semiconductor (MOS) junction, A depletion region is an area where the normally conductive substrate has a lack of free charge carriers, resulting in a silicon area with nearly zero conductivity. The width of the depletion region depends on the mean voltage on the TSV compared to the substrate (which is grounded for common p-doped substrates). Hence, the width of the depletion region can differ for the individual TSVs in an array arrangement. The method presented in the following is applied to determine the width of the depletion region of each TSV in the 3D model individually. Since a TSV has a relatively large length-over-diameter ratio, a 2D approximation of the electrostatics can be used. Consequently, Poison’s equation for the electrical field surrounding the TSV in a p-doped substrate can be expressed as 1 d r dr
dψ dr
⎧ ⎨ qNa = subs ⎩ 0
for rtsv + tox < r < rtsv + tox + wdep
(B.1)
for rtsv < r < rtsv + tox ,
where wdep is the width of the depletion region surrounding the TSV which has to be determined; ψ(r) is the electrical potential at radial distance r from the center of the TSV conductor; Na is the acceptor doping concentration in the substrate; q is the elementary charge; and subs is the permittivity of the substrate. The boundary conditions for the Poison equation are: ψ(rtsv + tox + wdep ) = 0; ψ(rtsv ) = V¯ + φMS ;
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 L. Bamberg et al., 3D Interconnect Architectures for Heterogeneous Technologies, https://doi.org/10.1007/978-3-030-98229-4
377
378
B Method to Calculate the Depletion-Region Widths
# dψ ## = 0; dr #r=rtsv +tox +wdep # # dψ ## dψ ## = . subs ox dr #r=rtsv +tox +dr dr #r=rtsv +tox −dr
(B.2)
In the boundary conditions, V¯ is the mean voltage on the TSV, ox is the permittivity of the TSV oxide, and φMS is the work function difference between the TSV metal (i.e., copper) and the doped silicon substrate: φMS = φmetal − φsubs .
(B.3)
Furthermore, rtsv +tox +wdep +dr and rtsv +tox +wdep +dr indicate the conductivesubstrate and the depletion-region side of the boundary between them (infinitesimal thin), respectively. By solving the Poison equation under consideration of the boundary conditions, wdep can be determined. For the weak-inversion/depletion, where a negligible amount of charge exists at the oxide-silicon interface (a strong inversion does not occur due to the relatively low substrate conductivity), the total charge is expressed as π (rtsv + tox + wdep )2 − (rtsv + tox )2 · qNa = V¯ − φMS − ψ(rtsv + tox ) ·
2π ox . rtsv + tox loge rtsv
(B.4)
This equation represents Gauss’s law. The scalar potential at the interface can be expressed by integrating Poison’s equation: rtsv + tox + wdep qNa 2 ψ(rtsv + tox ) = · (rtsv + tox + wdep ) · loge 2subs rtsv + tox 2 − 0.5wdep − wdep (rtsv + tox ) . (B.5) Two unknown variables exist in Eqs. (B.4) and (B.5): wdep and ψ(rtsv + tox ). Therefore, the two equations can be solved self-consistently to determine wdep for any given mean TSV voltage, V¯ . Solving the equations shows that wdep increases with the mean TSV voltage for p-doped substrates. However, a depletion-region width does not increase infinitely with the mean voltage, as it is limited to a maximum value. In the case where the
B Method to Calculate the Depletion-Region Widths
379
Depletion-region width [μm]
0.8
0.6
0.4
0.2
0 –3
–2
–1
0
1
2
3
4
5
Mean TSV voltage [V] Fig. B.1 Depletion-region width over the mean TSV voltage for a grounded p-doped substrate. Thereby, the TSV parameters are: rtsv = 2.5 μm, tox = 0.5 μm, φmetal = 4.6 eV (Copper TSVs), φsubs = 4.89 eV, ox = 3.90 , subs = 11.90 , and Na = 1.35 × 1015 cm−3 (i.e., σsubs = 10 S/m)
maximum depletion-region width is reached, the electrical potential drop across the depletion region is 2kB T · loge q
Na ni
qNa = 2subs
2 − wdep (rtsv + tox ) − 0.5wdep
+ (rtsv + tox + wdep )2 · loge
rtsv + tox + wdep rtsv + tox
,
(B.6)
where ni is the intrinsic carrier concentration of silicon, kB is the Boltzmann constant, and T is the temperature (assumed to be 300 K). Equation (B.6) is derived from Eq. (B.5) for a potential ψ(rtsv +tox ) that is fixed to 2φF = 2kB T /q ·loge (Na/ni ). The width of the depletion region, determined by means of the derived methodology, over the mean TSV voltage, for a TSV radius of 2.5 μm, is exemplarily shown in Fig. B.1. The same concept can be used to derive a method to determine the depletionregion widths for n-doped substrates, as shown in Ref. [253]. While the width of a depletion region increases with the mean voltage on the related TSV conductor for common p-doped substrates, it is simply vice versa for n-doped substrates (i.e., the width decrease with an increased mean voltage).
Appendix C
Modeling Logical OR Relations
Modeling the logical OR relation with the upper bounds is shown in Fig. C.1. On the left-hand side, the optimization curves of the OR-relation in z-dimension is shown. A variable should either be on the top or bottom orange line. Assuming, that the lines are not bounded, no convex optimization space can be defined as the red lines indicate. This is not the case with upper bounds as shown on the right-hand side of the Figure, in which xmax and ymax are used to limit the size of the optimization space. It now has a tetraedic shape and is convex. Therefore, bounds allow to model the logical OR. The logical AND relation can be modeled using multiple constraints, which must be satisfied together. The logical implication a → b can be transformed into an OR relation using the equivalence: a → b ↔ a¯ or b. a¯ is equal to not a. Fig. C.1 Modeling a logical OR via limits in the solution space
z
xmax
x ymax
y
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 L. Bamberg et al., 3D Interconnect Architectures for Heterogeneous Technologies, https://doi.org/10.1007/978-3-030-98229-4
381
References
1. 2015 International Technology Roadmap for Semicondunctors (ITRS)— Interconnects. https://eps.ieee.org/images/files/Roadmap/ITRSIntercon2015.pdf. Accessed 03 Oct 2019 2. A 3D technology toolbox in support of system-technology co-optimization (2019). https:// www.imec-int.com/en/imec-magazine/imec-magazine-july-2019. Accessed 15 Aug 2019 3. F. Abazovic, AMD Fiji HBM limited to 4GB stacked memory (2015). http://www.fudzilla. com/news/graphics/36995-amd-fiji-hbm-limited-to-4-gb-stacked-memory 4. F. Abazovic, Pascal uses 2.5D HBM memory (2015). http://www.fudzilla.com/news/graphics/ 37294-pascal-uses-2-5d-hbm-memory. 5. K. Abe et al., Ultra-high bandwidth memory with 3D-stacked emerging memory cells, in IEEE International Conference on Integrated Circuit Design and Technology and Tutorial, 2008. ICICDT 2008 (2008), pp. 203–206. https://doi.org/10.1109/ICICDT.2008.4567279 6. Y. Aghaghiri, F. Fallah, M. Pedram, Transition reduction in memory buses using sector-based encoding techniques. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 23(8), 1164– 1174 (2004) 7. M.O. Agyeman, A. Ahmadinia, Optimising heterogeneous 3D networks-on-chip, in Parallel Computing in Electrical Engineering (2011). https://doi.org/10.1109/PARELEC.2011.40 8. M.O. Agyeman, A. Ahmadinia, A. Shahrabi, Low power heterogeneous 3D networks-onchip architectures, in 2011 International Conference on High Performance Computing and Simulation (HPCS) (2011). https://doi.org/10.1109/HPCSim.2011.5999871 9. A.B. Ahmed, A.B. Abdallah, LA-XYZ: low latency high throughput look-ahead routing algorithm for 3D network-on-chip (3D-NoC) architecture, in International Symposium on Embedded Multicore SoCs (2012). https://doi.org/10.1109/MCSoC.2012.24 10. A.B. Ahmed, A.B. Abdallah, Low-overhead routing algorithm for 3D network-on-chip, in ICNC (2012), pp. 23–32. https://doi.org/10.1109/ICNC.2012.14 11. T.W. Ainsworth, T.M. Pinkston, Characterizing the cell EIB on-chip network. IEEE Mirco 27(5), 6–14 (2007). https://doi.org/10.1109/MM.2007.4378779 12. I. Akgun, D. Stow, Y. Xie, Network-on-chip design guidelines for monolithic 3-D integration. IEEE Micro 39(6), 46–53 (2019) 13. F. Akopyan et al., Truenorth: design and tool flow of a 65 mW 1 million neuron programmable neurosynaptic chip. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 34(10), 1537– 1557 (2015) 14. K. Ali et al., Different scenarios for estimating coupling capacitances of through silicon via (TSV) arrays, in International Conference on Energy Aware Computing Systems & Applications (ICEAC) (IEEE, Piscataway, 2015), pp. 1–4
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 L. Bamberg et al., 3D Interconnect Architectures for Heterogeneous Technologies, https://doi.org/10.1007/978-3-030-98229-4
383
384
References
15. D. Anand et al., An on-chip self-repair calculation and fusing methodology. IEEE Des. Test Comput. 20(5), 67–75 (2003) 16. D. Arumí, R. Rodríguez-Montañés, J. Figueras, Prebond testing of weak defects in TSVs. IEEE Trans. Very Large Scale Integr. Syst. 24(4), 1503–1514 (2015) 17. M. Bahmani et al., A 3D-NoC router implementation exploiting vertically-partially-connected topologies, in IEEE Computer Society Annual (2012). https://doi.org/10.1109/ISVLSI.2012. 19 18. L. Bamberg, A. García-Ortiz, High-level energy estimation for submicrometric TSV arrays. IEEE Trans. Very Large Scale Integr. Syst. 25(10), 2856–2866 (2017). ISSN: 1 063-8210. https://doi.org/10.1109/TVLSI.2017.2713601 19. L. Bamberg, A. García-Ortiz, Exploiting temporal misalignment to optimize the interconnect performance for 3D integration, in International Symposium on Power and Timing Modeling, Optimization and Simulation (PATMOS) (IEEE, Piscataway, 2018), pp. 214–221 20. L. Bamberg, A. Najafi, A. García-Ortiz, Edge effects on the TSV array capacitances and their performance influence. Elsevier Integr. 61, 1–10 (2018) 21. L. Bamberg et al., Synapse compression for event-based convolutional-neural-network accelerators (2021). Preprint arXiv:2112.07019 22. T. Bandyopadhyay et al., Rigorous electrical modeling of through silicon vias (TSVs) with MOS capacitance effects. IEEE Trans. Compon. Packag. Manuf. Technol. 1(6), 893–903 (2011) 23. K. Banerjee, S. Im, N. Srivastava, Interconnect modeling and analysis in the nanometer era: Cu and beyond, in Advanced Metallization Conference (AMC) (2005), pp. 1–7 24. N. Banerjee, P. Vellanki, K.S. Chatha, A power and performance model for network-on-chip architectures, in Design, Automation and Test in Europe Conference and Exhibition (IEEE, Piscataway, 2004). ISBN: 0-7695-2085-5. https://doi.org/10.1109/DATE.2004.1269067 25. P. Batude et al., Advances in 3D CMOS sequential integration, in International Electron Devices Meeting (IEDM) (IEEE, Piscataway, 2009), pp. 1–4 26. P. Batude et al., 3D sequential integration opportunities and technology optimization, in IEEE International Interconnect Technology Conference (2014), pp. 373–376 27. P. Batude et al., 3D sequential integration: Application-driven technological achievements and guidelines, in International Electron Devices Meeting (IEDM) (IEEE, Piscataway, 2017), pp. 1–3 28. D.U. Becker, W.J. Dally, Allocator Implementations for network-on-chip routers, in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (ACM, New York, 2009), pp. 1–12. https://doi.org/10.1145/1654059.1654112 29. A. Bender, MILP based task mapping for heterogeneous multiprocessor systems, in Proceedings of the Conference on European Design Automation (IEEE, Piscataway, 1996) 30. L. Benini, G. De Micheli, Networks on chips: a new SoC paradigm. Computer 35(1), 70–78 (2002). https://doi.org/10.1109/2.976921 31. L. Benini et al., Asymptotic zero-transition activity encoding for address busses in lowpower microprocessor-based systems, in Great Lakes Symposium on VLSI (GLSVLSI) (IEEE, Piscataway, 1997), pp. 77–82 32. C. Bienia, Benchmarking Modern Multiprocessors. PhD Thesis. Princeton University, 1.01.2011 33. O. Billoint et al., Merging PDKs to build a design environment for 3D circuits: Methodology challenges and limitations, in 2019 International 3D Systems Integration Conference (3DIC) (2019), pp. 1–5 34. N. Binkert et al., The Gem5 simulator. SIGARCH Comput. Archit. News 39(2), 1–7 (2011). https://doi.org/10.1145/2024716.2024718 35. M.D. Bishop et al., Monolithic 3-D integration. IEEE Micro 39(6), 16–27 (2019)
References
385
36. S. Bobba, I.N. Hajj, N.R. Shanbhag, Analytical expressions for average bit statistics of signal lines in DSP architectures, in International Symposium on Circuits and Systems (ISCAS), vol. 6 (IEEE, Piscataway, 1998), pp. 33–36 37. S. Bobba et al., CELONCEL: Effective design technique for 3-D monolithic integration targeting high performance integrated circuits, in 16th Asia and South Pacific Design Automation Conference (ASP-DAC 2011) (2011), pp. 336–343 38. E. Bolotin et al., QNoC: QoS architecture and design process for network on chip. Elsevier J. Syst. Archit. 50(2–3), 105–128 (2004) 39. R.A. Brualdi, Combinatorial Matrix Classes, vol. 13 (Cambridge University Press, Cambridge, 2006) 40. L. Brunet et al., Breakthroughs in 3D sequential technology, in International Electron Devices Meeting (IEDM) (IEEE, Piscataway, 2018), pp. 2–7 41. Y. Cai, K. Mai, O. Mutlu, Comparative evaluation of FPGA and ASIC implementations of bufferless and buffered routing algorithms for on-chip networks, in 16th International Symposium on Quality Electronic Design (IEEE, Piscataway, 2015). ISBN: 978-1-4799-75815. https://doi.org/10.1109/ISQED.2015.7085472 42. V. Catania et al., Noxim: An open, extensible and cycle-accurate network on chip simulator, in International Conference on Application-specific Systems, Architectures and Processors (IEEE, Piscataway, 2015). https://doi.org/10.1109/ASAP.2015.7245728 43. V. Catania et al., Cycle-accurate network on chip simulation with noxim. ACM Trans. Model. Comput. Simul. 27(1), 1–25 (2016). https://doi.org/10.1145/2953878 44. N. Chandoke, A.K. Sharma, A novel approach to estimate power consumption using SystemC transaction level modelling, in Annual IEEE India Conference (INDICON) (IEEE, Piscataway, 2015), pp. 1–6 45. K. Chang et al., Power, performance, and area benefit of monolithic 3D ICs for on-chip deep neural networks targeting speech recognition. J. Emerg. Technol. Comput. Syst. 14(4), 1–19 (2018). ISSN: 1550-4832 46. A. Charif, N.E. Zergainoh, M. Nicolaidis, A new approach to deadlock-free fully adaptive routing for high-performance fault-tolerant NoCs, in 2016 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT) (2016), pp. 121–126. https://doi.org/10.1109/DFT.2016.7684082 47. B. Charlet et al., 3D integration at CEA-LETI, in Handbook of 3D Integration, chap. 19 (Wiley, Hoboken, 2008), pp. 375–392. ISBN: 978-3-5276-2305-1 48. X. Chen, N.K. Jha, A 3-D CPU-FPGA-DRAM hybrid architecture for low-power computation. IEEE Trans. Very Large Scale Integr. Syst. 24(5), 1649–1662 (2016). ISSN: 1063-8210. https://doi.org/10.1109/TVLSI.2015.2483525 49. C.-H.H. Chen et al., SMART: A single-cycle reconfigurable NoC for SoC applications, in Conference on Design, Automation and Test in Europe DATE ’13. EDA Consortium (2013). ISBN: 978-1-4503-2153-2 50. J. Cho et al., Modeling and analysis of through-silicon via (TSV) noise coupling and suppression using a guard ring. IEEE Trans. Compon. Packag. Manufact. Technol. 1(2), 220– 233 (2011) 51. J. Cong, G. Luo, A multilevel analytical placement for 3D ICs, in Asia and South Pacific Design Automation Conference ASP-DAC ’09 (IEEE, Piscataway, 2009), pp. 361–366. ISBN: 978-1-4244-2748-2 52. J. Cong, M. Romesis, J.R. Shinnerl, Robust mixed-size placement under tight white-space constraints, in International Conference on Computer-aided Design ICCAD ’05 (IEEE, Piscataway, 2005). ISBN: 0-7803-9254-X 53. M. Coppola, Spidergon STNoC: The technology that adds value to your System, in 2010 IEEE Hot Chips 22 Symposium (HCS) (2010), pp. 1–39. https://doi.org/10.1109/HOTCHIPS. 2010.7480082 54. X. Cui et al., An enhancement of crosstalk avoidance code based on Fibonacci numeral system for through silicon vias. IEEE Trans. Very Large Scale Integr. Syst. 25(5), 1601–1610 (2017)
386
References
55. W.J. Dally, Virtual-channel flow control. IEEE Trans. Parallel Distrib. Syst. 3(2), 194–205 (1992) 56. W.J. Dally, C.L. Seitz, The torus routing chip. Distrib. Comput. 1(4), 187–196 (1986). https:// doi.org/10.1007/BF01660031 57. W.J. Dally, C.L. Seitz, Deadlock-free message routing in multiprocessor interconnection networks. IEEE Trans. Comput. C-36(5), 547–553 (1987). ISSN: 0018-9340. https://doi.org/ 10.1109/TC.1987.1676939 58. W.J. Dally, B. Towles, Route packets, not wires: on-chip interconnection networks, in Proceedings of the Design Automation Conference (IEEE, Piscataway, 2001) 59. W.J. Dally, B. Towles, Principles and Practices of Interconnection Networks (Elsevier, Amsterdam, 2004) 60. S. Datta et al., Back-end-of-line compatible transistors for monolithic 3-D integration. IEEE Micro 39(6), 8–15 (2019) 61. M. Davies et al., Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro 38(1), 82–99 (2018) 62. DeepSig Inc. https://www.deepsig.io/datasets. Accessed 18 Feb 2019 63. S. Deutsch, K. Chakrabarty, Contactless pre-bond TSV test and diagnosis using ring oscillators and multiple voltage levels. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 33(5), 774–785 (2014) 64. G. Dimitrakopoulos, A. Psarras, I. Seitanidis, Microarchitecture of Network-on-Chip Routers: A Designer’s Perspective (Springer, Berlin, 2015). ISBN: 978-1-4614-4301-8 65. T. Drewes, J.M. Joseph, T. Pionteck, An FPGA-based prototyping framework for networkson-chip, in International Conference on ReConFigurable and FPGAs (IEEE, Piscataway, 2017), pp. 1–7. https://doi.org/10.1109/RECONFIG.2017.8279775 66. C. Duan, A. Tirumala, S.P. Khatri, Analysis and avoidance of cross-talk in on-chip buses, in Symposium on High Performance Interconnects (HOT Interconnects) (IEEE, Piscataway, 2001), pp. 133–138 67. C. Duan, C. Zhu, S.P. Khatri, Forbidden transition free crosstalk avoidance CODEC design, in Design Automation Conference (DAC) (ACM/IEEE, New York/Piscataway, 2008), pp. 986– 991 68. C. Duan, B.J. LaMeres, S.P. Khatri, On and Off-Chip Crosstalk Avoidance in VLSI Design (Springer, Berlin, 2010) 69. J. Duato, A new theory of deadlock-free adaptive routing in wormhole networks. IEEE Trans. Parallel Distrib. Syst. 4(12), 1320–1331 (1993). https://doi.org/10.1109/71.250114 70. J. Duato, S. Yalamanchili, L.M. Ni, Interconnection Networks: An Engineering Approach (Morgan Kaufmann, Burlington, 2003). ISBN: 978-1-55860-852-8 71. M. Ebrahimi et al., MAFA: Adaptive fault-tolerant routing algorithm for networks-on-chip, in 2012 15th Euromicro Conference on Digital System Design (2012), pp. 201–207. https:// doi.org/10.1109/DSD.2012.82 72. M. Ebrahimi et al., DyXYZ: Fully adaptive routing algorithm for 3D NoCs, in Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (2013), pp. 499–503. https://doi.org/10.1109/PDP.2013.80 73. Electromagnetics |Electronic Simulation Software |ANSYS https://www.ansys.com/products/ electronics. Accessed 13 March 2019 74. W.C. Elmore, The transient response of damped linear networks with particular regard to wideband amplifiers. AIP J. Appl. Phys. 19(1), 55–63 (1948) 75. A.E. Engin, S.R. Narasimhan, Modeling of crosstalk in through silicon vias. IEEE Trans. Electromagn. Compat. 55(1), 149–158 (2012) 76. C. Erdmann et al., A heterogeneous 3D-IC consisting of two 28 nm FPGA die and 32 reconfigurable high-performance data converters. IEEE J. Solid-State Circuits 50(1), 258– 269 (2015) 77. F. Fatollahi-Fard et al., OpenSoCFabric (2019). http://www.opensocfabric.org/home.php. Accessed 15 March 2019
References
387
78. F. Fazzino, M. Palesi, D. Patti, Noxim: Network-on-chip simulator (2008). http://sourceforge. net/projects/noxim 79. B.S. Feero, P.P. Pande, Networks-on-chip in a three-dimensional environment: a performance evaluation. IEEE Trans. Comput. 58(1), 32–45 (2009). ISSN: 0018-9340. https://doi.org/10. 1109/TC.2008.142 80. A.M. Felfel et al., Quantifying the benefits of monolithic 3D computing systems enabled by TFT and RRAM, in 2020 Design, Automation Test in Europe Conference Exhibition (DATE) (2020), pp. 43–48 81. J. Flich, D. Bertozzi, Designing Network On-Chip Architectures in the Nanoscale Era (Taylor and Francis, Milton Park, 2010). ISBN: 978-1-4398-3710-8 82. S. Foroutan, A. Sheibanyrad, F. Petrot, Assignment of vertical-links to routers in verticallypartially-connected 3-D-NoCs. IEEE Trans. Comput. Aided Design Integr. Circuits Syst. 33(8), 1208–1218 (2014). https://doi.org/10.1109/TCAD.2014.2323219 83. G. Frank, Pulse code communication US Patent 2,632,058, 1953 84. J.P. Gambino, S.A. Adderly, J.U. Knickerbocker, An overview of through-silicon-via technology and manufacturing challenges. Elsevier Microelectron. Eng. 135, 73–106 (2015) 85. A. García Ortiz, Stochastic Data Models for Power Estimation at High-Levels of Abstraction (Shaker, Germany , 2004) 86. A. García-Ortiz, L.S. Indrusiak, Practical and theoretical considerations on low-power probability-codes for networks-on-chip, in International Symposium on Power and Timing Modeling, Optimization and Simulation (PATMOS) (Springer, Berlin, 2011), pp. 160–169. ISBN: 978-3-642-17752-1 87. A. García-Ortiz et al., Low-power coding for networks-on-chip with virtual channels. ASP J. Low Power Electron. 5(1), 77–84 (2009). https://doi.org/10.1166/jolpe.2009.1006 88. A. García-Ortiz, D. Gregorek, C. Osewold, Analysis of bus-invert coding in the presence of correlations, in Saudi International Electronics, Communications and Photonics Conference (SIECPC) (IEEE, Piscataway, 2011), pp. 1–5 89. A. García-Ortiz, L. Bamberg, A. Najafi, Low-power coding: trends and new challenges. ASP J. Low Power Electron. 13(3), 356–370 (2017) 90. P.E. Garrou, M. Koyanagi, P. Ramm, 3D Process Technology: Robust Circuit and Physical Design for Sub-65 nm Technology Nodes. Handbook of 3D Integration, vol. 3, 1st edn. (Wiley, Hoboken, 2009). ISBN: 978-3-527-32034-9 91. C.J. Glass, L.M. Ni, The turn model for adaptive routing, in Proceedings the 19th Annual International Symposium on Computer Architecture (1992). https://doi.org/10.1109/ISCA. 1992.753324 92. K. Goossens, J. Dielissen, A. Radulescu, AEthereal network on chip: Concepts, architectures, and implementations. IEEE Design Test 22, 414–421 (2005). https://doi.org/10.1109/MDT. 2005.99 93. B. Gopireddy, J. Torrellas, Designing vertical processors in mono-lithic 3D, in 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA) (2019), pp. 643–656 94. P. Gratz et al., On-chip interconnection networks of the TRIPS chip. IEEE Mirco 27(5), 41–50 (2007). https://doi.org/10.1109/MM.2007.4378782 95. F. Guezzi-Messaoud et al., Benefits of three-dimensional circuit stacking for image sensors, in International New Circuits and Systems Conference (NEWCAS) (IEEE, Piscataway, 2013), pp. 1–4 96. P.J. Haas, Stochastic Petri Nets: Modelling, Stability, Simulation (Springer, Berlin, 2002). ISBN: 978-0-387-21552-5 97. R.W. Hamming, Error detecting and error correcting codes. Bell Syst. Techn. J. 29(2), 147– 160 (1950) 98. S. Han, H. Mao, W.J. Dally, Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding (2015). Preprint arXiv:1510.00149 99. A. Hansson, K. Goossens, A. R˘adulescu, Avoiding message-dependent deadlock in networkbased systems on chip . VLSI Design 2007, 10 (2007)
388
References
100. D. Harris, S. Harris, Digital Design and Computer Architecture (Morgan Kaufmann, Burlington, 2010) 101. A. Havashki et al., Analysis of switching activity in DSP signals in the presence of noise, in IEEE EUROCON (IEEE, Piscataway, 2009), pp. 234–239 102. K. He et al., Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778 103. J. Hestness, S.W. Keckler, Netrace: Dependency-Tracking Traces for Efficient Network-onChip Experimentation 104. Y. Hoskote et al., A 5-GHz mesh interconnect for a teraflops processor. IEEE Micro 27(5), 51–61 (2007). https://doi.org/10.1109/MM.2007.4378783 105. Y. Hoskote et al., Teraflops prototype processor with 80 cores, in Hot Chips 19 Symposium (IEEE, Piscataway, 2007). ISBN: 978-1-4673-8869-6. https://doi.org/10.1109/HOTCHIPS. 2007.7482494 106. N.M. Hossain, R.K.R. Kuchukulla, M.H. Chowdhury, Failure analysis of the through silicon via in three-dimensional integrated circuit (3D-IC), in International Symposium on Circuits and Systems (ISCAS) (IEEE, Piscataway, 2018), pp. 1–4 107. A.-C. Hsieh, T.T. Hwang, TSV redundancy: Architecture and design issues in 3-D IC. IEEE Trans. Very Large Scale Integr. Syst. 20(4), 711–722 (2011) 108. M.-K. Hsu, Y.-W. Chang, V. Balabanov, TSV-aware analytical placement for 3D IC designs, in Design Automation Conference (ACM, New York, 2011). https://doi.org/10.1145/2024724. 2024875 109. Y.-J. Huang, J.-F. Li, Built-in self-repair scheme for the TSVs in 3-D ICs. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 31(10), 1600–1613 (2012) 110. L.-R. Huang et al., Oscillation-based prebond TSV test. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 32(9), 1440–1444 (2013) 111. S.-Y. Huang et al., Programmable leakage test and binning for TSVs with self-timed timing control. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 32(8), 1265–1273 (2013) 112. IBM. Cplex 12.8 User’s Manual (2017) 113. Inc. Cadence Design Systems. Genus Synthesis Solution (2016). https://www.cadence. com/content/cadence-www/global/en_US/home/tools/digital-design-and-signoff/synthesis/ genus-synthesis-solution.html 114. International Roadmap for Devices and Systems Technical Report (IEEE, Piscataway, 2018) 115. P. Jacob et al., Mitigating memory wall effects in high-clock-rate and multicore CMOS 3-D processor memory stacks. Proc. IEEE 97(1), 108–122 (2009). ISSN: 0018-9219. https://doi. org/10.1109/JPROC.2008.2007472 116. N. Jafarzadeh et al., Data encoding techniques for reducing energy consumption in networkon-chip. IEEE Trans. Very Large Scale Integr. Syst. 22(3), 675–685 (2014) 117. N. Jafarzadeh et al., Low energy yet reliable data communication scheme for network-onchip. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 34(12), 1892–1904 (2015) 118. I. Jani et al., BISTs for post-bond test and electrical analysis of high density 3D interconnect defects, in European Test Symposium (ETS) (IEEE, Piscataway, 2018), pp. 1–6 119. K. Jia et al., AICNN: Implementing typical CNN algorithms with analog-to-information conversion architecture, in IEEE Computer Society Annual Symposium on VLSI (ISVLSI) (IEEE, Piscataway, 2017). ISBN: 978-1-5090-6762-6. https://doi.org/10.1109/ISVLSI.2017. 23 120. N. Jiang et al., Booksim Interconnection Network Simulator. http://nocs.stanford.edu 121. N. Jiang et al., A detailed and flexible cycle-accurate network-on-chip simulator, in International Symposium on Performance Analysis of Systems and Software. (IEEE, Piscataway, 2013), pp. 86–96. https://doi.org/10.1109/ISPASS.2013.6557149 122. B.K. Joardar et al., Design and optimization of heterogeneous manycore systems enabled by emerging interconnect technologies: Promises and challenges, in 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE) (2019), pp. 138–143 123. J.M. Joseph, C. Blochwitz, T. Pionteck, Adaptive allocation of default router paths in networkon-chips for latency reduction, in International Conference on High Performance Computing
References
389
& Simulation (IEEE, Piscataway, 2016), pp. 140–147. ISBN: 978-1-5090-2088-1. https://doi. org/10.1109/HPCSim.2016.7568328 124. J.M. Joseph et al., Area and power savings via asymmetric organization of buffers in 3DNoCs for heterogeneous 3D-SoCs. Microproc. Microsyst. 48, 36–47 (2017). ISSN: 01419331. https://doi.org/10.1016/j.micpro.2016.09.011 125. J.M. Joseph et al., NoCs in heterogeneous 3D SoCs: Co-design of routing strategies and microarchitectures. IEEE Access 7, 135145–135163 (2019) 126. G. Kahn, The semantics of a simple language for parallel programming, in Proceedings of the IFIP Congress on Information Processing (1974) 127. A.B. Kahng et al., ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration, in Design, Automation Test in Europe Conference Exhibition, 2009. DATE ’09 (2009), pp. 423–428. https://doi.org/10.1109/DATE.2009.5090700 128. U. Kang et al., 8 Gb 3-D DDR3 DRAM using through-silicon-via technology. IEEE J. SolidState Circ. 45(1), 111–119 (2009) 129. N. Kavaldjiev, G.J.M. Smit, P.G. Jansen, A virtual channel router for on-chip networks, in International SOC Conference (SOCC) (IEEE, Piscataway, 2004), pp. 289–293 130. L. Khachiyan, A polynomial algorithm in linear programming. Doklady Academii Nauk SSSR 244, 1093–1096 (1979) 131. Y. Kikuchi et al., A 40 nm 222 mW H.264 full-HD decoding, 25 power domains, 14-core application processor with x512b stacked DRAM. IEEE J. Solid-State Circuits 46(1), 32–41 (2011). ISSN: 0018-9200. https://doi.org/10.1109/JSSC.2010.2079370 132. D.H. Kim, S.K. Lim, Through-silicon-via-aware delay and power prediction model for buffered interconnects in 3D ICs, in International Workshop on System Level Interconnect Prediction (SLIP) (ACM/IEEE, New York/Piscataway, 2010), pp. 25–32 133. D.H. Kim, S.K. Lim, Design quality trade-off studies for 3-D ICs built with sub-micron TSVs and future devices. IEEE J. Emerging Sel. Top. Circuits Syst. 2(2), 240–248 (2012) 134. B. Kim et al., Factors affecting copper filling process within high aspect ratio deep vias for 3D chip stacking, in Electronic Components and Technology Conference (ECTC) (IEEE, Piscataway, 2006), 6–pp. 135. D.H. Kim et al., Design and analysis of 3D-MAPS (3D massively parallel processor with stacked memory). IEEE Trans. Comput. 64(1), 112–125 (2015). ISSN: 0018-9340. https:// doi.org/10.1109/TC.2013.192 136. B. Korte, J. Vygen, Combinatorial optimization: Theory and Algorithms, 5th edn. (Springer, Berlin, 2012). ISBN: 364-242-767-7 137. K. Krishna, H. Kwon, OpenSMART (2017). http://synergy.ece.gatech.edu/tools/opensmart/. Accessed 15 March 2019 138. R. Kumar, S.P. Khatri, Crosstalk avoidance codes for 3D VLSI, in Design, Automation & Test in Europe Conference (DATE) (IEEE, Piscataway, 2013), pp. 1673–1678 139. S. Kundu, Network-on-Chip: The Next Generation of System-on-Chip Integration (CRC Press, Boca Raton, 2017). ISBN: 978-1-1387-4935-1 140. T.A. Lacksonen, Static and dynamic layout problems with varying areas. J. Oper. Res. Soc. 45(1), 59–69 (1994). https://doi.org/10.1057/jors.1994.7 141. F. Laermer, A. Schilp, Method of anisotropically etching silicon US Patent 5,501,893. 1996 142. M. Lampropoulos, B.M. Al-Hashimi, P. Rosinger, Minimization of crosstalk noise, delay and power using a modified bus invert technique, in Design, Automation & Test in Europe Conference (DATE), vol. 2 (IEEE, Piscataway, 2004), pp. 1372–1373 143. P.E. Landman, J.M. Rabaey, Power estimation for high level synthesis, in European Conference on Design Automation with the European Event in ASIC Design (IEEE, Piscataway, 1993), pp. 361–366 144. P.E. Landman, J.M. Rabaey, Architectural power analysis: The dual bit type method. IEEE Trans. Very Large Scale Integr. Syst. 3(2), 173–187 (1995)
390
References
145. M. Langar, R. Bourguiba, J. Mouine, Virtual channel router architecture for network on chip with adaptive inter-port buffers sharing, in 13th International Multi-Conference on Systems, Signals & Devices (IEEE, Piscataway, 2016). ISBN: 978-1-5090-1291-6. https://doi.org/10. 1109/SSD2016.7473771 146. Y. Lee, S.K. Lim, Ultrahigh density logic designs using monolithic 3-D integration. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 32(12), 1892–1905 (2013) 147. S.-Y. Lee, A. Ortega, A novel approach of image compression in digital cameras with a Bayer color filter array, in International Conference on Image Processing (ICIP), vol. 3 (IEEE, Piscataway, 2001), pp. 482–485 148. K. Lee, S.-J. Lee, H.-J. Yoo, Low-power network-on-chip for high-performance SoC design. IEEE Trans. Very Large Scale Integr. Syst. 14(2), 148–160 (2006). ISSN: 1063-8210. https:// doi.org/10.1109/TVLSI.2005.863753 149. Y. Lee, D. Limbrick, S.K. Lim, Power benefit study for ultra-high density transistor-level monolithic 3D ICs, in 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC) (2013), pp. 1–10 150. K.-W. Lee et al., Highly dependable 3-D stacked multicore processor system module fabricated using reconfigured multichip-on-wafer 3-D integration technology, in 2014 IEEE International Electron Devices Meeting (IEDM) (2014), pp. 28.6.1–28.6.4. https://doi.org/10. 1109/IEDM.2014.7047128 151. J.C. Lee et al., High bandwidth memory (HBM) with TSV technique, in 2016 International SoC Design Conference (ISOCC) (IEEE, Piscataway, 2016), pp. 181–182 152. Y.-W. Lee, H. Lim, S. Kang, Grouping-based TSV test architecture for resistive open and bridge defects in 3-D-ICs. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 36(10), 1759–1763 (2016) 153. M. Lee, J.S. Pak, J. Kim, Electrical Design of Through Silicon Via (Springer, Berlin). ISBN: 978-9-401-7903-76 (2014) 154. Z. Li et al., Hierarchical 3-D floorplanning algorithm for wirelength optimization. IEEE Trans. Circuits Syst. I Regul. Pap. 53(12), 2637–2646 (2006). https://doi.org/10.1109/TCSI. 2006.883857 155. L.-C. Li et al., An efficient 3D-IC on-chip test framework to embed TSV testing in memory BIST, in Asia and South Pacific Design Automation Conference (ASP-DAC) (IEEE, Piscataway, 2015), pp. 520–525 156. J. Lienig, Layoutsynthese Elektronischer Schaltungen - Grundlegende Algorithmen für die Entwurfsautomatisierung (Springer, Berlin, 2006). ISBN: 978-3-540-29942-4. https://doi. org/10.1007/3-540-29942-4 157. C. Liu, S.K. Lim, A design tradeoff study with monolithic 3D integration, in 13th International Symposium on Quality Electronic Design (ISQED) (2012), pp. 529–536 158. C. Liu et al., Full-chip TSV-to-TSV coupling analysis and optimization in 3D IC, in Design Automation Conference (DAC) (ACM/IEEE, New York/Piscataway, 2011), pp. 783–788. 159. W. Liu et al., A NoC traffic suite based on real applications, in 2011 IEEE Computer Society Annual (2011), pp. 66–71. https://doi.org/10.1109/ISVLSI.2011.49 160. G.H. Loh, Y. Xie, B. Black, Processor design in 3D die-stacking technologies. IEEE Micro 27(3), 31–48 (2007) 161. Z. Lu et al., NNSE: Nostrum network-on-chip simulation environment, in Proceedings of SSoCC (2005). 162. A. Mallik et al., The impact of sequential-3D integration on semi-conductor scaling roadmap, in 2017 IEEE International Electron Devices Meeting (IEDM) (2017), pp. 32.1.1–31.1.4 163. S.K. Mandal et al., NoCBench: a benchmarking platform for network on chip, in Workshop on Unique Chips and Systems (2009).
References
391
164. K. Manna, S. Chattopadhyay, I. Sengupta, Through silicon via placement and mapping strategy for 3D mesh based network-on-chip, in International Conference on Very Large Scale Integration (VLSISoC) (IEEE, Piscataway, 2014). https://doi.org/10.1109/VLSI-SoC. 2014.7004177 165. K. Manna et al., Integrated through-silicon via placement and application mapping for 3D mesh-based NoC design. ACM Trans. Embed. Comput. Syst. 16(1) (2016). https://doi.org/ 10.1145/2968446 166. E.J. Marinissen, Testing TSV-based three-dimensional stacked ICs, in Design, Automation & Test in Europe Conference (DATE) (IEEE, Piscataway, 2010), pp. 1689–1694 167. R. Merritt, Intel Opens Door on 7nm, Foundry (2014). http://www.eetimes.com/document. asp?doc_id=1323865 168. M. Millberg et al., Guaranteed bandwidth using looped containers in temporally disjoint networks within the nostrum network on chip, in Design, Automation and Test in Europe Conference and Exhibition (IEEE, Piscataway, 2004). ISBN: 0-7695-2085-5. https://doi.org/ 10.1109/DATE.2004.1269001 169. B. Montreuil, A modelling framework for integrating layout design and flow network design, in Material Handling ‘90, (Springer Berlin Heidelberg, Berlin, Heidelberg, 1990), pp. 95–115 170. O. Moreira et al., NeuronFlow: A hybrid neuromorphic–dataflow processor architecture for AI workloads, in 2020 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS) (IEEE, Piscataway, 2020), pp. 01–05 171. Mosek ApS. Mosek (2018) 172. R. Mullins, A. West, S. Moore, Low-latency virtual-channel routers for on-chip networks, in ACM SIGARCH Computer Architecture News (2004) 173. S. Musavvir et al., Inter-tier process-variation-aware monolithic 3-D NoC design space exploration. IEEE Trans. Very Large Scale Integr. Syst. 28(3), 686–699 (2020) 174. E. Musoll, T. Lang, J. Cortadella, Exploiting the locality of memory references to reduce the address bus energy, in International Symposium on Low Power Electronics and Design (ISLPED) (IEEE, Piscataway, 1997), pp. 202–207 175. T. Naito et al., World’s first monolithic 3D-FPGA with TFT SRAM over 90nm 9 layer Cu CMOS, in 2010 Symposium on VLSI Technology (2010), pp. 219–220 176. A. Najafi et al., Energy modeling of coupled interconnects including intrinsic misalignment effects, in International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS) (IEEE, Piscataway, 2016), pp. 262–267 177. NanGate15: 15nm Open Cell Library https://www.silvaco.com/products/nangate/ FreePDK15_Open_Cell_Library/index.html. Accessed 15 March 2019 178. NanGate45: 45nm Open Cell Library https://www.silvaco.com/products/nangate/ FreePDK45_Open_Cell_Library. Accessed 20 Oct 2019 179. B. Niazmand et al., Logic-based implementation of fault-tolerant routing in 3D network-onchips, in International Symposium on Networks-on-Chip (IEEE, Piscataway, 2016). ISBN: 978-1-4673-9030-9. https://doi.org/10.1109/NOCS.2016.7579317 180. F. Niklaus, A.C. Fischer, Heterogeneous 3D integration of MOEMS and ICs, in International Conference on Optical MEMS and Nanophotonics (OMN) (IEEE, Piscataway, 2016), pp. 1–2 181. NoC Blog. Top 5 most popular NoC simulators (2012). https://networkonchip.wordpress. com/2015/11/02/what-is-the-most-popular-full-system-simulator/ 182. B. Noia et al., Scan test of die logic in 3-D ICs using TSV probing. IEEE Trans. Very Large Scale Integr. Syst. 23(2), 317–330 (2014) 183. P. O’Connor, M. Welling, Sigma delta quantized networks (2016). Preprint arXiv:1611.02024 184. C. Osewold, W. Büter, A. García-Ortiz, A coding-based configurable and asymmetrical redundancy scheme for 3-D interconnects, in International Symposium on Reconfigurable and Communication-Centric Systems-on-Chip (ReCoSoC) (IEEE, Piscataway, 2014), pp. 1–8 185. M. Palesi, M. Daneshtalab, Routing Algorithms in Networks-on- Chip (Springer, Berlin, 2014). ISBN: 978-1-4614-8274-1 186. P.P. Pande et al., Performance evaluation and design trade-offs for network-on-chip interconnect architectures. IEEE Trans. Comput. 54(8), 1025–1040 (2005). ISSN: 0018-9340
392
References
187. M. Palesi et al., Data encoding schemes in networks on chip. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 30(5), 774–786 (2011) 188. M. Parasar, A. Sinha, T. Krishna, Brownian bubble router: Enabling deadlock freedom via guaranteed forward progress, in 2018 Twelfth IEEE/ACM International Symposium on Networks- on-Chip (NOCS) (IEEE, Piscataway, 2018), pp. 1–8 189. D. Park et al., ‘MIRA: A multi-layered on-chip interconnect router architecture, in 35th International Symposium on Computer Archi- tecture (IEEE, Piscataway, 2008). https://doi. org/10.1109/ISCA.2008.13 190. S. Pasricha, A framework for TSV serialization-aware synthesis of application specific 3D networks-on-chip, in International Conference on VLSI Design (IEEE, Piscataway, 2012), pp. 268–273 191. S. Pasricha, N. Dutt, On-chip Communication Architectures: System on Chip Interconnect (Morgan Kaufmann, Burlington, 2010) 192. V.F. Pavlidis, E.G. Friedman, 3-D topologies for networks-on-chip. IEEE Trans. Very Large Scale Integr. Syst. 15(10), 1081–1090 (2007). ISSN: 1063-8210. https://doi.org/10.1109/ TVLSI.2007.893649 193. V.F. Pavlidis, E.G. Friedman, Three-Dimensional Integrated Circuit Design. Morgan Kaufmann Series in Systems on Silicon (Elsevier Science, Amsterdam, 2010). ISBN: 978-0-0809218-60 194. V.F. Pavlidis, I. Savidis, E.G. Friedman, Three-Dimensional Integrated Circuit Design (Newnes, London, 2017) 195. E. Pekkarinen et al., A set of traffic models for network-on-chip benchmarking, in International Symposium on System on Chip (2011) 196. Y. Peng et al., On accurate full-chip extraction and optimization of TSV-to-TSV coupling elements in 3D ICs, in International Conference on Computer-Aided Design (ICCAD) (ACM/IEEE, New York/Piscataway, 2013), pp. 281–288 197. Y. Peng et al., Silicon effect-aware full-chip extraction and mitigation of TSV-to-TSV coupling. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 33(12), 1900–1913 (2014) 198. S.S.K. Pentapati et al., A logic-on-memory processor-system design with monolithic 3d technology. IEEE Micro 39(6), 38–45 (2019) 199. S. Piersanti et al., Transient analysis of TSV equivalent circuit considering nonlinear MOS capacitance effects. IEEE Trans. Electromagn. Compat. 57(5), 1216–1225 (2015) 200. S. Piersanti et al., Algorithm for extracting parameters of the coupling capacitance hysteresis cycle for TSV transient modeling and robustness analysis. IEEE Trans. Electromagn. Compat. 59(4), 1329–1338 (2016) 201. R. Pop, S. Kumar, A survey of techniques for mapping and scheduling applications to network on chip systems, in School of Engineering, Jonkoping University, Research Report (2004) 202. L. Popova-Zeugmann, Time and Petri Nets (Springer, Berlin, 2013). ISBN: 978-3-642-411151 203. C.S. Premachandran et al., Impact of 3D via middle TSV process on 20nm wafer level FEOL and BEOL reliability, in Electronic Components and Technology Conference (ECTC) (IEEE, Piscataway, 2016), pp. 1593–1598 204. C.S. Premachandran et al., Comprehensive 3D TSV reliability study on 14nm FINFET technology with thinned wafers, in Electron Devices Technology and Manufacturing Conference (EDTM) (IEEE, Piscataway, 2018), pp. 37–40 205. PTM — Interconnect http://ptm.asu.edu/interconnect.html. Accessed 15 March 2019 206. C. Raghunandan, K.S. Sainarayanan, M.B. Srinivas, Process variation aware bus-coding scheme for delay minimization in VLSI interconnects, in International Symposium on Quality Electronic Design (ISQED) (IEEE, Piscataway, 2008), pp. 43–46 207. A.M. Rahmani et al., High-performance and fault-tolerant 3D NoC-bus hybrid architecture using ARB-NET-based adaptive monitoring platform. IEEE Trans. Comput. 63(3), 734–747 (2014). ISSN: 0018-9340. https://doi.org/10.1109/TC.2012.278
References
393
208. R.S. Ramanujam, B. Lin, A layer-multiplexed 3D on-chip network architecture. IEEE Embed. Syst. Lett. 1(2), 50–55 (2009). ISSN: 1943-0663. https://doi.org/10.1109/LES.2009.2034710 209. S. Ramprasad, N.R. Shanbhag, I.N. Hajj, Analytical estimation of transition activity from word-level signal statistics. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 16(7), 718–733 (1997) 210. S. Ramprasad, N.R. Shanbhag, I.N. Hajj, A coding framework for low-power address and data busses. IEEE Trans. Very Large Scale Integr. Syst. 7(2), 212–221 (1999) 211. S.H.S. Rezaei et al., Dynamic resource sharing for high- performance 3-D networks-on-chip. IEEE Comput. Archit. Lett. 15(1) (2015). ISSN: 1556-6056. https://doi.org/10.1109/LCA. 2015.2448532 212. S.H.S. Rezaei et al., A three-dimensional networks-on-chip architecture with dynamic buffer sharing, in 24th Euromicro International Conference on Parallel, Distributed, and NetworkBased Processing (IEEE, Piscataway, 2016). ISBN: 978-1-4673-8776-7. https://doi.org/10. 1109/PDP.2016.124 213. J.A. Roy et al., Capo: Robust and scalable open-source min-cut floorplacer, in International Symposium on Physical Design (ACM, New York, 2005). https://doi.org/10.1145/1055137. 1055184 214. P.K. Sahu, S. Chattopadhyay, A survey on application mapping strategies for network-on-chip design. J. Syst. Archit. 59, 60–76 (2013). https://doi.org/10.1016/j.sysarc.2012.10.004 215. S. Saini, Low Power Interconnect Design (Springer, Berlin, 2015) 216. K. Salah, Y.I. Ismail, A. El-Rouby, Arbitrary Modeling of TSVs for 3D Integrated Circuits. Analog Circuits and Signal Processing (Springer, Berlin, 2015) ISBN: 978-3-319-07610-2 217. S.K. Samal et al., Monolithic 3D IC vs. TSV-based 3D IC in 14nm FinFET technology, in 2016 IEEE SOI-3D-Subthreshold Microelectronics Technology Unified Conference (S3S) (2016), pp. 1–2 218. J.H. Satyanarayana, K.K. Parhi, Theoretical analysis of word-level switching activity in the presence of glitching and correlation. IEEE Trans. Very Large Scale Integr. Syst. 8(2), 148– 159 (2000) 219. Tezzaron Semiconductor. Our Technology 101 (2015) http://www.tezzaron.com/aboutus/ ourtechnology101/ 220. A. Shacham, K. Bergman, L.P. Carloni, Photonic networks-on-chip for future generations of chip multiprocessors. IEEE Trans. Comput. 57(9), 1246–1260 (2008). ISSN: 0018-9340. https://doi.org/10.1109/TC.2008.78 221. A. Sheibanyrad, F. Ptrot, A. Jantsch, 3D Integration for NoC-Based SoC Architectures (Springer, Berlin, 2010). ISBN: 144-197-617-5 222. J. Shi, C. Tomasi, Good features to track, in Conference on Computer Vision and Pattern Recognition (IEEE, Piscataway, 1994). ISBN: 0-8186-5825-8. https://doi.org/10.1109/CVPR. 1994.323794 223. Y. Shin, S.-I. Chae, K. Choi, Partial bus-invert coding for power optimization of applicationspecific systems. IEEE Trans. Very Large Scale Integr. Syst. 9(2), 377–383 (2001) 224. J. Shi et al., Routability in 3D IC design: Monolithic 3D vs. Sky- bridge 3D CMOS, in 2016 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH) (2016), pp. 145–150 225. M. Shulaker et al., Three-dimensional integration of nanotechnologies for computing and data storage on a single chip. Nature 547, 74–78 (2017) 226. T. Song et al., Full-chip signal integrity analysis and optimization of 3-D ICs. IEEE Trans. Very Large Scale Integr. Syst. 24(5), 1636–1648 (2015) 227. E. Sotiriou-Xanthopoulos et al., A framework for rapid evaluation of heterogeneous 3-D NoC architectures. Microprocess. Microsyst. 38(4), 292–303 (2014). ISSN: 0141-9331. https://doi. org/10.1016/j.micpro.2013.09.003 228. K. Srinivasan, K.S. Chatha, G. Konjevod, Linear-programming- based techniques for synthesis of network-on-chip architectures. IEEE Trans. Very Large Scale Integr. Syst. 14(4), 407–420 (2006). ISSN: 1063-8210. https://doi.org/10.1109/TVLSI.2006.871762
394
References
229. M.R. Stan, W.P. Burleson, Bus-invert coding for low- power I/O. IEEE Trans. Very Large Scale Integr. Syst. 3(1), 49–58 (1995) 230. H. Stark, J.W. Woods, Probability, Random Processes, and Estimation Theory for Engineers (Prentice-Hall, Hoboken, 1986) 231. P.H. Starke, A Memo on Time Constraints in Petri Nets. Informatik-Bericht Nr. 46 (1995) 232. T. Steinmetz, S. Kurz, M. Clemens, Domains of validity of quasistatic and quasistationary field approximations. COMPEL-The Int. J. Comput. Math. Electr. Electron. Eng. 30(4), 1237– 1247 (2011) 233. H. Sun et al., Design of 3D DRAM and its application in 3D integrated multi-core computing systems. IEEE Design Test (2013). https://doi.org/10.1109/MDT.2009.93 234. Synopsis. RTL Synthesis and Test (2017). https://www.synopsys.com/implementation-andsignoff/rtl-synthesis-test.html 235. K. Tatas et al., Designing 2D and 3D Network-on-Chip Architectures (Springer, Berlin, 2014) 236. C. Tomasi, T. Kanade, Detection and tracking of point features. Int. J. Comput. Vis. 9, 137–154 (1991) 237. M.-C. Tsai, T.-C. Wang, T.T. Hwang, Through-silicon via planning in 3-D floorplanning. IEEE Trans. Very Large Scale Integr. Syst. 19(8) (2011). ISSN: 1063-8210. https://doi.org/ 10.1109/TVLSI.2010.2050012 238. W.-P. Tu, Y.-H. Lee, S.-H. Huang, TSV sharing through multiplexing for TSV count minimization in high-level synthesis, in International SOC Conference (SOCC) (IEEE, Piscataway, 2011), pp. 156–159 239. E.B. van der Tol, E.G. Jaspers, Mapping of MPEG-4 decoding on a flexible architecture platform, in Media Processors (2002) 240. P.J.M. Van Laarhoven, E.H.L. Aarts, Simulated annealing, in Simulated Annealing: Theory and Applications (Springer, Berlin, 1987), pp. 7–15 241. B. Victor, K. Keutzer, Bus encoding to prevent crosstalk delay, in International Conference on Computer-Aided Design (IC-CAD) (ACM/IEEE, New York/Piscataway, 2001), pp. 57–63 242. P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple features, in Computer Society Conference on Computer Vision and Pattern Recognition (IEEE, Piscataway, 2001). ISBN: 0-7695-1272-0. https://doi.org/10.1109/CVPR.2001.990517 243. P. Vivet et al., Monolithic 3D: An alternative to advanced CMOS scaling, technology perspectives and associated design methodology challenges, in International Conference on Electronics, Circuits and Systems (ICECS) (IEEE, Piscataway, 2018), pp. 157–160 244. F. von Trapp, The Many Flavors of 3D DRAM (2013). http://www.3dincites.com/2013/10/ the-many-flavors-of-3d-dram/ 245. C. Wang et al., BIST methodology architecture and circuits for pre-bond TSV testing in 3D stacking IC systems. IEEE Trans. Circuits Syst. I: Regul. Pap. 62(1), 139–148 (2014) 246. Y. Wang et al., Economizing TSV resources in 3-D network-on- chip design. IEEE Trans. Very Large Scale Integr. Syst. 23(3), 493–506 (2015). ISSN: 1063-8210. https://doi.org/10. 1109/TVLSI.2014.2311835 247. T. Webber et al., Tiny – optimised 3D mesh NoC for area and latency minimisation. Electron. Lett. 50(3), 165–166 (2014). ISSN: 0 013-5194. https://doi.org/10.1049/el.2013.2557 248. Wolfram Research. Mathematica Edition: Version 10.4 (Wolfram Research, Champaign, Illinois, 2016). 249. S.-C. Wong, G.-Y. Lee, D.-J. Ma, Modeling of interconnect capacitance, delay and crosstalk in VLSI. IEEE Trans. Semicond. Manuf. 13(1), 108–111 (2000) 250. T. Wu et al., High performance and low power monolithic three- dimensional sub-50 nm Poly Si thin film transistor (TFTs) circuits. Nat. Sci. Rep. 7(1368), 2045–2322 (2017) 251. Xilinx. Xilinx Stacked Silicon Interconnect Technology Delivers Break- through FPGA Capacity, Bandwidth, and Power Efficiency v1.2. Xilinx. (2012). www.xilinx.comWP380 252. K. Xu, E.G. Friedman, Scaling trends of power noise in 3-D ICs. Elsevier Integr. 51, 139–148 (2015) 253. C. Xu et al., Compact AC modeling and performance analysis of through-silicon vias in 3-D ICs. IEEE Trans. Electron Devices 57(12), 3405–3417 (2010)
References
395
254. T.C. Xu et al., Optimal placement of vertical connections in 3D network-on-chip. J. Syst. Archit. 59(7), 441–454 (2013). https://doi.org/10.1016/j.sysarc.2013.05.002 255. X. Xu et al., Enhanced 3D Implementation of an Arm®Cortex®-A microprocessor, in 2019 IEEE/ACM Int. Symposium on Low Power Electronics and Design (ISLPED) (2019), pp. 1–6 256. C. Yan, E. Salman, Mono3D: Open source cell library for monolithic 3-D integrated circuits. IEEE Trans. Circuits Syst. I: Regul. Pap. 65(3), 1075–1085 (2018) 257. R. Yarema, Development of 3D for HEP (2013). http://inspirehep.net/record/731918/files/ fermilab-pub-06-343.pdf?version=1 258. Y. Ye et al., Holistic comparison of optical routers for chip multiprocessors, in Anticounterfeiting, Security, and Identification (2012), pp. 1–5. https://doi.org/10.1109/ICASID. 2012.6325348 259. M. Yi et al., A pulse shrinking-based test solution for prebond through silicon via in 3-D ICs. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 38(4), 755–766 (2018) 260. Á. Zarándy, Focal-Plane Sensor-Processor Chips (Springer, Berlin, 2011). ISBN: 978-14419-6475-5 261. Z. Zhang, P. Franzon, TSV-based, modular and collision detectable face-to-back shared bus design, in 3D Systems Integration Conference (3DIC), 2013 IEEE International (2013), pp. 1–5. https://doi.org/10.1109/3DIC.2013.6702399 262. D. Zhang et al., Process development and optimization for 3μm high aspect ratio via-middle through-silicon vias at wafer level. IEEE Trans. Semicond. Manuf. 28(4), 454–460 (2015) 263. Y. Zhao, S. Khursheed, B.M. Al-Hashimi, Online fault tolerance technique for TSV-based 3-D-IC. IEEE Trans. Very Large Scale Integr. Syst. 23(8), 1567–1571 (2015) 264. P. Zhong et al., High-aspect-ratio TSV process with thermomi- gration refilling of Au–Si eutectic alloy. IEEE Trans. Compon. Packag. Manuf. Technol. 11(2), 191–199 (2021). https:// doi.org/10.1109/TCPMT.2020.3047907 265. Q. Zou et al., 3DLAT: TSV-based 3D ICs crosstalk minimization utilizing less adjacent transition code, in Asia and South Pacific Design Automation Conference (ASP-DAC) (IEEE, Piscataway, 2014), pp. 762–767