Efficient Execution of Irregular Dataflow Graphs: Hardware/Software Co-optimization for Probabilistic AI and Sparse Linear Algebra [1st ed. 2023] 3031331354, 9783031331350

This book focuses on the acceleration of emerging irregular sparse workloads, posed by novel artificial intelligent (AI)

174 25 13MB

English Pages 164 [155] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Contents
List of Abbreviations
List of Symbols
List of Figures
List of Tables
1 Irregular Workloads at Risk of Losing the Hardware Lottery
1.1 Domain Specialization and the Hardware Lottery
1.2 Recent Trends and Irregular Workloads
1.3 Introduction to Graphs
1.4 Target Workloads
1.4.1 Probabilistic Circuit (PC)
1.4.2 Sparse Matrix Triangular Solves (SpTRSV)
1.4.3 Comparison of PC and SpTRSV
1.5 Open Research Questions for Efficient Execution of Irregular DFGs
1.5.1 Q1: What Type of Data Representation Is Suitable?
1.5.2 Q2: How Can We Parallelize Irregular DFGs Effectively?
1.5.3 Q3: How Can We Improve the Throughput and Energy Efficiency Through a Custom Processor Architecture?
1.5.4 Q4: How Can We Improve the Hardware Further Through a Dedicated Datapath Design?
1.6 Book Contributions
1.6.1 ProbLP and the Custom Posit Representation
1.6.2 GraphOpt: A Tool for Effective Parallelization of DFGs
1.6.3 DAG Processing Unit: Version 1 (DPU)
1.6.4 DAG Processing Unit-Version 2 (DPU-v2)
2 Suitable Data Representation: A Study of Fixed-Point, Floating-Point, and PositTM Formats for Probabilistic AI
2.1 Error-Bound Analysis
2.1.1 Fixed-Point Error Models
2.1.2 Floating-Point Error Models
2.1.3 Ensuring That All the Intermediate Values of a PC Are Within the Range
2.1.4 Error Propagation from PC Inputs to the Output
2.2 ProbLP
2.2.1 Bounds for Probabilistic Queries
Marginal Probability and MPE
Conditional Probability
2.2.2 Selecting Optimal Representation
2.2.3 Automatic Hardware Generation
2.2.4 Experimental Results
Validation of Bounds
Overall Performance
2.3 Beyond Fixed and Floating Point: Posit Representation
2.4 Conclusions
3 GraphOpt: Constrained-Optimization-Based Parallelization of Irregular Workloads for Multicore Processors
3.1 Graph Partitioning for Parallelization
3.2 GraphOpt
3.2.1 Recursive Two-Way Partitioning (M1)
Optimization Model for the Two-Way Partitioning
Example
3.2.2 Workload Balancing (M2)
3.2.3 Scale to Large Graphs (S1, S2, S3)
Consider Limited Layers (S1)
Independent Connected Components (S2)
Heuristic Coarsening (S3)
3.3 Performance Evaluation
3.3.1 Experimental Setup
3.3.2 Analysis of Super Layers
How Large Are the Super Layers?
Workload Balancing
Throughput Scaling
Impact of the Scalability Techniques
3.3.3 Comparison with State-of-the-Art Libraries
Sparse Matrix Triangular Solves
Probabilistic Circuits
3.4 Discussion and Related Work
3.4.1 Sparse Triangular Solves
3.4.2 Probabilistic Circuits
3.4.3 Graph Partitioning
3.4.4 DAG Scheduling
3.5 Conclusion
4 DAG Processing Unit Version 1 (DPU): Efficient Execution of Irregular Workloads on a Multicore Processor
4.1 Challenges Due to Irregularity
4.1.1 SIMD Unfriendly
4.1.2 Frequent Synchronizations
4.1.3 Inefficient Use of Caches
4.1.4 Data Prefetching
4.2 DPU Architecture
4.2.1 Compute Units (CUs)
4.2.2 Global Scratchpad and Asymmetric Crossbar
4.2.3 Global Sync Unit
4.3 Compute Unit (CU) Architecture
4.3.1 Local Scratchpad
4.3.2 Data Prefetching Using Decoupled Instruction Streams
4.4 Precision-Scalable Custom Posit Unit
4.5 Implementation and Experiments
4.5.1 Physical Implementation
4.5.2 Peak Performance and Voltage Scaling
4.5.3 Workloads
4.5.4 Throughput Scaling with Different Active CUs
4.5.5 Comparison with CPU and GPU
4.5.6 DPU's Performance for a Regular DAG
4.6 Related Work
4.7 Conclusion
5 DAG Processing Unit Version 2 (DPU-v2): Efficient Execution of Irregular Workloads on a Spatial Datapath
5.1 Designing a Processor with a Spatial Datapath for Large Irregular DAGs
5.1.1 Which Spatial Datapath Topology Should Be Used?
5.1.2 How to Read/Write the Inputs/Outputs?
5.1.3 How to Handle Bank Access Conflicts?
5.2 DPU-v2 Architecture Template
5.2.1 Parallel Tree of PEs
5.2.2 Register File Architecture
5.2.3 Datapath-Register Banks Connections
5.2.4 Load, Store, and Copy of Data
5.2.5 Long, Variable-Length Instructions
5.3 Compiler for DAG
5.3.1 Block Decomposition (Step 1)
5.3.2 PE and Register Bank Mapping (Step 2)
5.3.3 Pipeline-Aware Reordering (Step 3)
5.3.4 Spilling from Register File (Step 4)
5.3.5 Reduction in Memory Footprint
5.4 Design Space Exploration
5.4.1 The Most-Efficient Design Configuration
5.5 State-of-the-Art Comparison
5.5.1 Comparison Using PC and SpTRSV
5.5.2 Comparison Using large PCs
5.5.3 Detailed Comparison of DPU-v2 and DPU
5.6 Additional Related Works
5.7 Conclusion
6 Conclusions and Future Work
6.1 Contributions and Conclusions
6.2 Suggestions for Future Works
6.3 Closing Remarks
A The Two-Way Partitioning Model of GraphOpt
Bibliography
Index
Recommend Papers

Efficient Execution of Irregular Dataflow Graphs: Hardware/Software Co-optimization for Probabilistic AI and Sparse Linear Algebra [1st ed. 2023]
 3031331354, 9783031331350

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Nimish Shah Wannes Meert Marian Verhelst

Efficient Execution of Irregular Dataflow Graphs Hardware/Software Co-optimization for Probabilistic AI and Sparse Linear Algebra

Efficient Execution of Irregular Dataflow Graphs

Nimish Shah • Wannes Meert • Marian Verhelst

Efficient Execution of Irregular Dataflow Graphs Hardware/Software Co-optimization for Probabilistic AI and Sparse Linear Algebra

Nimish Shah ESAT-MICAS KU Leuven Leuven, Belgium

Wannes Meert CS-DTAI KU Leuven Leuven, Belgium

Marian Verhelst ESAT-MICAS KU Leuven Leuven, Belgium

ISBN 978-3-031-33135-0 ISBN 978-3-031-33136-7 https://doi.org/10.1007/978-3-031-33136-7

(eBook)

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

To meet the ever-present demand for smarter and more intelligent machines, increasing research efforts are focused on developing novel AI models. However, despite the promising algorithmic properties, these novel models may not execute well on existing hardware architectures. This mismatch prevents researchers from experimenting with bigger versions of the models on larger datasets or deploying these models for energy-constrained applications, keeping them at a competitive disadvantage. This problem is becoming more severe due to the rise of highly domain-specialized hardware and software, which increases the barrier of straying away from the established domains. This impact that existing hardware platforms have on the abandonment of potentially promising models/algorithms is termed as losing the hardware lottery. This book focuses on two workloads that are at risk of losing the hardware lottery: an emerging AI model called probabilistic circuit and a linear algebra operation called sparse matrix triangular solve. The computations in both the workloads can be modeled as irregularly structured dataflow graphs, which do not execute with high throughput and energy efficiency on the existing hardware. To address this limitation, this book contributes cohesive solutions at different levels of the software/hardware stack, namely application-specific optimized data representation, targeted compilation/mapping algorithms, throughput and energyefficient hardware architectures, and optimized silicon implementation. On the application level, the book identifies the most suitable data representation for the application requirements by developing analytical error and energy models of customized data representations. The selected representation is based on a novel format called positTM , customized to achieve the same accuracy as the 32-bit floating point with just 8 bits, improving the throughput and energy efficiency by up to 2.×. On the compilation level, the book proposes optimized mapping algorithms for general-purpose CPU and dedicated hardware architectures. A constrainedoptimization based framework is designed to effectively parallelize the irregular dataflow graphs by minimizing synchronization and communication overheads,

v

vi

Preface

achieving a speedup of 2.×. Furthermore, for custom hardware architectures, dedicated algorithms are developed to map the workloads for high hardware utilization. On the hardware level, two versions of DAG processing units (DPU) are designed to alleviate the execution bottlenecks of irregular dataflow graphs. The first version of DPU hides the long latency of irregular data accesses by decoupling data access and computational instructions, which enables aggressive data prefetching. Furthermore, it is equipped with precision-scalable arithmetic units that perform 1.×32b, 2.×16b, or 4.×8b operations, depending on the application requirements. The second version of DPU is upgraded further with a spatial datapath of processing elements, which is interfaced with a high-bandwidth register file designed to mitigate the impact of irregular accesses. Both versions of DPU are programmable with custom instruction sets and can execute any arbitrary acyclic dataflow graph. They achieve a speedup of 5.× and 20.× over CPU and GPU, while operating below a 0.25 W power budget. On the implementation level, the first version of DPU is physically implemented on chip in a 28 nm CMOS technology to validate the feasibility of the proposed hardware innovations. The measurements of the fabricated prototype confirm that DPU achieves a peak throughput of 74 GOPS at 0.23 W, while reaching the clock frequency of 288 MHz. This way, the book contributes important pieces involving optimizations across the software-hardware stack, which culminate in an end-to-end solution for highthroughput and energy-efficient execution of irregular dataflow graphs, ensuring that these promising workloads do not lose the dreaded hardware lottery. Leuven, Belgium

Nimish Shah Wannes Meert Marian Verhelst

Contents

1

2

Irregular Workloads at Risk of Losing the Hardware Lottery . . . . . . . . . 1.1 Domain Specialization and the Hardware Lottery. . . . . . . . . . . . . . . . . . . . . 1.2 Recent Trends and Irregular Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Introduction to Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Target Workloads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Probabilistic Circuit (PC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Sparse Matrix Triangular Solves (SpTRSV) . . . . . . . . . . . . . . . . . . 1.4.3 Comparison of PC and SpTRSV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Open Research Questions for Efficient Execution of Irregular DFGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Q1: What Type of Data Representation Is Suitable? . . . . . . . . . . 1.5.2 Q2: How Can We Parallelize Irregular DFGs Effectively? . . . 1.5.3 Q3: How Can We Improve the Throughput and Energy Efficiency Through a Custom Processor Architecture? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.4 Q4: How Can We Improve the Hardware Further Through a Dedicated Datapath Design? . . . . . . . . . . . . . . . . . . . . . . . 1.6 Book Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 ProbLP and the Custom Posit Representation . . . . . . . . . . . . . . . . . 1.6.2 GRAPHOPT: A Tool for Effective Parallelization of DFGs . . . 1.6.3 DAG Processing Unit: Version 1 (DPU) . . . . . . . . . . . . . . . . . . . . . . . 1.6.4 DAG Processing Unit-Version 2 (DPU-v2) . . . . . . . . . . . . . . . . . . . Suitable Data Representation: A Study of Fixed-Point, Floating-Point, and PositTM Formats for Probabilistic AI . . . . . . . . . . . . . . 2.1 Error-Bound Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Fixed-Point Error Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Floating-Point Error Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Ensuring That All the Intermediate Values of a PC Are Within the Range. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.4 Error Propagation from PC Inputs to the Output . . . . . . . . . . . . . .

1 1 2 5 8 8 11 14 15 15 16

17 18 19 19 19 20 20 23 24 25 27 30 31 vii

viii

3

4

Contents

2.2 PROBLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Bounds for Probabilistic Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Selecting Optimal Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Automatic Hardware Generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Beyond Fixed and Floating Point: Posit Representation . . . . . . . . . . . . . . 2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31 32 34 35 35 38 41

GRAPHOPT: Constrained-Optimization-Based Parallelization of Irregular Workloads for Multicore Processors . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Graph Partitioning for Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 GRAPHOPT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Recursive Two-Way Partitioning (M1) . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Workload Balancing (M2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Scale to Large Graphs (S1, S2, S3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Analysis of Super Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Comparison with State-of-the-Art Libraries. . . . . . . . . . . . . . . . . . . 3.4 Discussion and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Sparse Triangular Solves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Probabilistic Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Graph Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 DAG Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43 44 46 48 54 55 60 60 62 63 65 65 66 66 66 67

DAG Processing Unit Version 1 (DPU): Efficient Execution of Irregular Workloads on a Multicore Processor . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Challenges Due to Irregularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 SIMD Unfriendly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Frequent Synchronizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Inefficient Use of Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.4 Data Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 DPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Compute Units (CUs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Global Scratchpad and Asymmetric Crossbar . . . . . . . . . . . . . . . . . 4.2.3 Global Sync Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Compute Unit (CU) Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Local Scratchpad. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Data Prefetching Using Decoupled Instruction Streams . . . . . . 4.4 Precision-Scalable Custom Posit Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Implementation and Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Physical Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Peak Performance and Voltage Scaling . . . . . . . . . . . . . . . . . . . . . . . . 4.5.3 Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.4 Throughput Scaling with Different Active CUs . . . . . . . . . . . . . . .

69 70 70 71 71 72 73 73 73 75 75 75 76 78 79 79 81 81 85

Contents

4.5.5 Comparison with CPU and GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.6 DPU’s Performance for a Regular DAG . . . . . . . . . . . . . . . . . . . . . . . 4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

6

ix

85 87 87 88

DAG Processing Unit Version 2 (DPU-v2): Efficient Execution of Irregular Workloads on a Spatial Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Designing a Processor with a Spatial Datapath for Large Irregular DAGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Which Spatial Datapath Topology Should Be Used? . . . . . . . . . 5.1.2 How to Read/Write the Inputs/Outputs? . . . . . . . . . . . . . . . . . . . . . . . 5.1.3 How to Handle Bank Access Conflicts? . . . . . . . . . . . . . . . . . . . . . . . 5.2 DPU-v2 Architecture Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Parallel Tree of PEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Register File Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Datapath-Register Banks Connections. . . . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Load, Store, and Copy of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.5 Long, Variable-Length Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Compiler for DAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Block Decomposition (Step 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 PE and Register Bank Mapping (Step 2) . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Pipeline-Aware Reordering (Step 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Spilling from Register File (Step 4). . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.5 Reduction in Memory Footprint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 The Most-Efficient Design Configuration . . . . . . . . . . . . . . . . . . . . . 5.5 State-of-the-Art Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Comparison Using PC and SpTRSV . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Comparison Using large PCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.3 Detailed Comparison of DPU-v2 and DPU . . . . . . . . . . . . . . . . . . . 5.6 Additional Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91 91 92 94 96 96 97 98 99 99 101 101 104 109 109 109 110 110 114 114 115 117 120 123

Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Contributions and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Suggestions for Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

125 125 128 129

89

A The Two-Way Partitioning Model of GRAPHOPT . . . . . . . . . . . . . . . . . . . . . . . . 131 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

List of Abbreviations

AI CMOS CPU CSC CSR DAG DFG DNN DPU EDP FIFO FPGA GOPS GPU NPU PC

RTL SIMD SoC SpTRSV

SPU SRAM TOPS VLIW

artificial intelligence. 2, 3, 8, 125, 129 complementary metal-oxide semiconductor. 125 central processing unit. 3 compressed sparse column. 7, 10, 14 compressed sparse row. 7, 14, 100, 127 directed acyclic graph. 5–7, 20, 23, 31, 43, 60, 67, 69, 70, 87, 89–92, 95, 110, 115, 126 dataflow graph. vii, 3, 5, 7–11, 13–21, 31, 43, 126–129 deep neural networks. xvii, 1, 2, 15, 90, 92, 129 DAG processing unit. 18, 19, 41, 126 energy delay product. 114, 117 first in, first out. 94 field-programmable gate array. 2, 125 giga operations per second. 126 graphics processing unit. 1–3 neural processing unit. 129 probabilistic circuits: a model that can perform complex probabilistic inference queries tractably.. vii, xvi, 3, 8–11, 14–17, 19, 21, 23–27, 30– 37, 40, 41, 43, 60, 64–67, 78, 79, 81, 110, 115, 125, 126, 129 register-transfer level hardware description. 24, 31 single instruction multiple data. 19, 70, 71, 87, 89 system-on-chip. 1 sparse matrix triangular solves: An operation to solve sparse triangular system of equations.. 3, 8, 11–17, 21, 23, 24, 41, 43, 60, 67, 78, 81, 110, 126, 129 sparse processing unit [33]. 15 static random access memory. 79, 117 Tera operations per second. 3, 69 very long instruction word. 123, 127

xi

List of Symbols

 f˜

. .

The relative error in floating point representation The quantized value of a number represented in a finite-precision arithmetic representation

xiii

List of Figures

Fig. 1.1

Probabilistic circuit, an emerging AI model, is an irregular dataflow graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 1.2 The computation required to find the solution of a system of equations represented by a sparse triangular matrix can be represented as a dataflow graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 1.3 Throughput of SpTRSV and PC on general-purpose CPU and GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 1.4 Different types of graphs are shown in (a), (b), and (c). (d) shows how a graph can be represented as an adjacency matrix, and how a directed acyclic graph can be renamed to get a triangular adjacency matrix. For real-world graphs, the adjacency matrix is typically sparse, which can be stored in a compressed form as shown in (e) . . . . . . . . . . . . . . . Fig. 1.5 A sample Bayesian network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 1.6 The PC compiled from the Bayesian network in Fig. 1.5 for tractable inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 1.7 The pseudocode of evaluating the output of a PC . . . . . . . . . . . . . . . . . . Fig. 1.8 Solving a dense matrix triangular system. (a) Dense triangular system. (b) Elements of x are evaluated one by one, by substituting the previous values. (c) DFG where nodes are add/mul operators. (d) DFG where nodes represent all the operations to evaluate an element of x . . . . . . . . . . . . Fig. 1.9 A sparse matrix in a triangular system leads to an irregular DFG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 1.10 The pseudocode for performing SpTRSV to get x in Lx=b . . . . . . . Fig. 1.11 An example of a spatial datapath: a two-dimensional systolic array of PEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 1.12 The overview of the book contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 2.1 The fields of (a) fixed-point and (b) floating-point formats, with examples and minimum/maximum values . . . . . . . . . . Fig. 2.2 Error propagation with fixed-point error models . . . . . . . . . . . . . . . . . . .

3

4 4

6 9 10 11

12 13 14 18 21 25 31 xv

xvi

Fig. 2.3 Fig. 2.4 Fig. 2.5

Fig. 2.6

Fig. 2.7 Fig. 2.8

Fig. 3.1 Fig. 3.2 Fig. 3.3

Fig. 3.4

Fig. 3.5

Fig. 3.6 Fig. 3.7

Fig. 3.8

List of Figures

Internal working of PROBLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Automatic conversion of a PC to pipelined hardware . . . . . . . . . . . . . . Analytical error bounds and the observed error over a test set for varying fraction bits F , validating the correctness of fixed-point bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analytical error bounds and the observed error over a test set for varying mantissa bits M, validating the correctness of floating-point bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The posit representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Accuracy impact of different representations. Custom posit performs better due to lower error for a wider range of values. (a) Exponent length for different formats. (b) Error in representations @32b. (c) Probabilistic circuit (PC). (d) Sp. Triangular Solve (SpTRSV) . . . . . . . . . . . . . . . . . . . . . . . . . . An example for the need of synchronization . . . . . . . . . . . . . . . . . . . . . . . Different ways to partition a DAG, and impact on parallelization, communication, and workload balance . . . . . . . . . . . . Super layers. GRAPHOPT decomposes a fine-grained DAG into super layers, each having P partitions. The partitions are made as large as possible but also of similar size to ensure workload balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . GRAPHOPT overview. One super layer, starting from the bottom, is generated in every iteration. The main steps M1 and M2 use an optimization model with the Google OR-Tool solver [122] to find good partitions. The scalability steps S1, S2, and S3 are used to handle graphs with millions of nodes/edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The recursive two-way partitioning (M1) uses a Minizinc-based optimization model to partition subgraphs until getting P partitions. Due to the recursive approach, the partitions can be imbalanced. The workload balancing (M2) iteratively redistributes the nodes, using the same Minizinc-based model, to generate a balanced super layer . . . . . . . . Example. An optimal partitioning for a simple graph . . . . . . . . . . . . . . The full flow with S1, S2, and S3 scalability steps that enables the GRAPHOPT to handle large graphs with millions of nodes/edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The data structures generated by depth-first traversal for the heuristic coarsening step (S3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32 35

37

38 39

40 44 45

47

48

49 53

55 58

List of Figures

Fig. 3.9

Fig. 3.10

Fig. 3.11

Fig. 4.1

Fig. 4.2 Fig. 4.3 Fig. 4.4 Fig. 4.5 Fig. 4.6 Fig. 4.7

Fig. 4.8 Fig. 4.9 Fig. 4.10 Fig. 4.11 Fig. 5.1

Detailed analysis of super layers. Row (f) shows that DAGs can be partitioned into a few, large super layers. It also shows the size of the original DAG layers for comparison. Row (g) shows the workload balancing across threads, in the form of operations in different threads in super layers. Row (h) shows the throughput scaling with parallel threads, demonstrating the advantage of using super layers versus direct DAG layer partitioning. Rows (i) and (j) show the improvement in partitioning time due to scalability techniques S1, S2, and S3, while row (k) shows the degradation of performance due to these techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sparse matrix triangular solve performance. GRAPHOPT achieves a mean speedup of 2.0, 3.3, 3.1, 5.6, 10.8, and 23.7 over SuiteSparse CXSparse, SuiteSparse UMFPACK, Intel MKL, P2P, DAG layer partitioning, and KokkosKernels, respectively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Probabilistic circuits performance. GRAPHOPT achieves a mean speedup of 1.8.× and 1052.× over DAG layer partitioning and Juice, respectively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Synchronizations. The GRAPHOPT’s super layer-based approach (b) is used for DPU as it improves the amount of computation done per synchronization barrier by 3.× compared to the standard DAG layer-based approach (a) . . . . . . . . . . The DPU architecture with 64 parallel CUs . . . . . . . . . . . . . . . . . . . . . . . . Asymmetric crossbar to reduce area and power overhead . . . . . . . . . Decoupled streams to overlap memory and processing instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Internal PE design and FIFO flow-control for precise ld/st timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Posit arithmetic unit with precision-scalable subunits . . . . . . . . . . . . . The physical implementation of DPU. (a) Floorplan of the physical implementation of DPU. (b) Standard cells (green) and nets (white) of the crossbar and arbiter. (c) Routing congestion hotspots. (d) Placement density map (violet .→ red for least .→ most density) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chip micrograph and specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peak performance scaling with voltage and precision . . . . . . . . . . . . . . Scaling of throughput with increasing active CUs at 8b precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance comparison. The DPU operating point is 0.9 V and 32b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A few examples of spatial datapaths used for applications like (a) DNNs [147], (b) digital signal processing [159], and (c) sparse matrix-matrix multiplications [128] . . . . . . . . . . . . . . . . .

xvii

61

63

64

72 74 74 76 77 80

82 83 84 85 86

90

xviii

List of Figures

Fig. 5.2

Systolic arrays (a) are underutilized by irregular DAGs, while a tree-shaped datapath (b) is a promising alternative as measured by peak utilization (c) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 5.3 Example of a DAG execution on a tree of PEs, causing irregular register accesses. (a) A sample DAG to be mapped on a tree of PEs with banked register file. (b) Possible decompostion of the DAG along with the input/output variables. (c) Sequence of execution along with the state of the register file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 5.4 (a) A common approach for executing DAGs in parallel, in which the unpredictable irregular accesses happen to scratchpad/memory. (b) In DPU-v2 by pushing the interconnect closer to the datapath, the compiler is equipped to predict the irregular accesses and prevent bank conflicts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 5.5 The DPU-v2 architecture template consists of processing elements (PEs) connected in a tree topology. The trees are connected to parallel register banks with input and output interconnects. The datapath is pipelined according to the number of PE layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 5.6 Automatic write address generation by tracking the occupancy status of registers with valid bits. A priority encoder chooses the empty location with the smallest address to write data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 5.7 Different interconnection topologies and their impact on bank conflicts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 5.8 (a) Shows that vectors are loaded/stored from a single address in the data memory, but register banks are addressed independently. (b) Shows how data can be copied between banks via the input interconnect to avoid bank access conflict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 5.9 Variable-length instructions (in (a)) are densely packed in the memory (in (b)) and are appropriately aligned during decoding for stall-free execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 5.10 Steps involved in the custom compiler to generate optimized instructions for a given DAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 5.11 (a) Cycles to be avoided during block generation. (b) Avoid block dependencies like the dependence of D on B. (c) Any subgraph with two-input nodes, one sink node, and with the longest path length less than the depth of a PE tree can be mapped to the PE tree by replicating certain nodes. (d) Subgraphs that can be combined into a block (possibly with permutations) for mapping to a datapath with one PE tree of depth 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

92

93

95

96

97 98

100

100 102

102

List of Figures

Fig. 5.12 (a) An example of how the mapping of node b affects the compatible PEs and banks of the other nodes. (b) Our algorithm achieves considerably lower bank conflicts than random allocation. (c) and (d) show that the data allocation across register banks remains well-balanced . . . . . . . . . . . . Fig. 5.13 Design space exploration to identify the min-EDP design. The optimal points are highlighted with large markers. (a) Latency. (b) Energy. (c) Energy-delay product . . . . . . . . . . . . . . . . . . . . . Fig. 5.14 Latency vs. Energy plot with a constant EDP curve passing through the min-EDP point, colored according to D, B, and R in (a), (c), and (d). (b) Shows the inset of (a) that highlights the operating points closer to the min-EDP point . . Fig. 5.15 Breakdown of instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 5.16 Throughput for every workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 5.17 How to read the plot in Fig. 5.18: This figure describes the different parts of the bar charts in Fig. 5.18. The two bar charts for each processor show the status of the datapath and the data memory during the execution, exhibiting the overlap of active/inactive phases. Note: The bars in this figure are just for illustration. See Fig. 5.18 for proper bar heights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 5.18 Detailed comparison of the execution phases of DPU-v2 and DPU (see Fig. 5.17 to understand the phases). DPU incurs higher stalls due to bank access conflicts and higher memory transactions than DPU-v2, but 57% of these are overlapped with active PE execution because of the decoupled instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xix

107

111

112 113 115

118

119

List of Tables

Table 2.1 Energy models for arithmetic operators in 65 nm technology at 1 V and 100 MHz frequency. N is the total number of fixed-point bits, and M is the number of mantissa bits in a floating-point format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 2.2 Optimal fixed-point and float-point representations that meet the required error tolerance. The selected representation is in bold. Measured error and energy for the selected representation. I, F, E, and M stand for the number of integer, fraction, exponent, and mantissa bits . . . . . . . . . . Table 3.1 Optimization model. Different parts of the Minizinc two-way partitioning optimization model . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 4.1 Challenges and opportunities in irregular DAG execution and related DPU innovations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 4.2 Instructions of PE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 4.3 Posit unit area and power breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 4.4 Performance comparison with a state-of-the-art posit unit . . . . . . . . Table 4.5 Post-layout area and power breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 4.6 Statistics of the benchmarked DAGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 4.7 Performance comparison with other platforms . . . . . . . . . . . . . . . . . . . . . Table 5.1 Statistics of the benchmarked DAGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 5.2 Area and power breakdown of DPU-v2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 5.3 Performance comparison with other platforms. . . . . . . . . . . . . . . . . . . . . . Table 5.4 Specification of the processors used for comparison . . . . . . . . . . . . . . .

34

36 51 70 77 80 81 83 84 86 108 113 116 117

xxi

Chapter 1

Irregular Workloads at Risk of Losing the Hardware Lottery

1.1 Domain Specialization and the Hardware Lottery The world has an insatiable demand for fast and energy-efficient computation. Today, our mobile devices are more powerful than supercomputers from just a few decades ago, while operating on tiny batteries. They can display 8Kresolution videos and render photo-realistic games for 10 hours straight without a recharge. Furthermore, the devices are also becoming “smarter” with features like voice assistants, face unlock, and interactive augmented reality. As computational platforms keep improving, newer and smarter features that are not feasible today will keep getting added. Domain Specialization As the improvements in silicon fabrication technology become slower and costlier, it is becoming increasingly challenging to keep improving these computational platforms at the same pace as earlier. In the past decade, this has prompted a trend to move toward domain-specialized hardware. Consider the latest Apple A16 Bionic system-on-chip (SoC) [164] designed for mobile phones. Along with two-core high-performance and four-core high-efficiency CPUs for general-purpose computation, it has: • a five-core graphics processing unit (GPU) for rendering user interfaces and 3D graphics, • a 16-core neural engine for processing deep neural networks (DNNs), • an image signal processor (ISP) dedicated to processing data from highresolution image sensors, • a modem unit for 5G wireless communication, and • multiple media engines for high-resolution video encoding and decoding. As evident from this list, there is a trend toward using dedicated hardware for domain-specific workloads. This aggressive specialization is needed because the

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Shah et al., Efficient Execution of Irregular Dataflow Graphs, https://doi.org/10.1007/978-3-031-33136-7_1

1

2

1 Irregular Workloads at Risk of Losing the Hardware Lottery

general-purpose CPUs can no longer achieve the required throughput and energy efficiency with similar silicon area budgets. The Hardware Lottery With the advent of the domain-specialized hardware, a notable phenomenon emerges of uneven progress in algorithmic ideas. Hooker [74] coined the term “hardware lottery” to describe when an algorithmic research direction wins simply because it is suited to the existing software and hardware domains, and not because the direction is superior to alternative approaches. The lost decades of deep neural networks clearly illustrate the impact of losing the hardware lottery. The main ideas used in the current state-of-the-art neural networks were invented in the 1980s–1990s, but they did not demonstrate the anticipated success simply because the hardware was not ready yet. With the CPUs available in that era, it was impractical to train large and deep networks with multiple layers. DNNs had to wait until the 2010s for massively parallel and easily programmable graphics processing units (GPUs), to get widely accepted as a promising research direction. With increasing domain specialization, the hardware lottery will play an even more important role because it becomes increasingly costly to stray off the established “domains.” Certain research directions that fall into the established domains will reap the benefits of domain specialization, while others remain dependent on slow and inefficient general-purpose platforms. As such, nonconventional algorithmic breakthroughs can no longer happen in isolation. They require dedicated research efforts across the software and hardware stack to ensure that the novel ideas do not lose the hardware lottery. This book focuses on promising emerging algorithms for artificial intelligence (AI) and sparse linear algebra, which are at risk of losing the hardware lottery. They are ill-suited for the existing software frameworks and hardware platforms, and require significant innovations across the stack. This book contributes important pieces to bridge these gaps.

1.2 Recent Trends and Irregular Workloads Dense Tensors In recent years, incredible advancements in the domain of efficient dense tensor/matrix computations have been achieved, primarily because of the surge of deep neural networks (DNNs). Software frameworks and languages like TensorFlow [1], PyTorch [119], and Halide [129] enable the modeling of tensor manipulations in a relatively simple way, with the flexibility of targeting diverse hardware platforms. Similarly, the hardware platforms have also rapidly evolved for optimized execution, like GPUs and FPGAs equipped with tensor cores [7, 11, 25], and a large variety of highly specialized DNN processors like [22, 78, 93, 118, 121, 151]. In addition, several automated design space exploration tools have been proposed, which can generate highly targeted and optimized hardware accelerators for specific DNNs [79, 90, 165]. These advancements in specialized accelerators

1.2 Recent Trends and Irregular Workloads

Inputs (eg. sensory data)

3

Probabilistic inference (eg. activity detection)

Probabilistic circuits (PC) Fig. 1.1 Probabilistic circuit, an emerging AI model, is an irregular dataflow graph

provide orders of magnitude higher performance and energy efficiency compared to general-purpose CPUs [118]. Sparse and Irregular Workloads However, emerging workloads that cannot be modeled as dense tensor operations cannot leverage these advancements. A salient example is the emerging AI model called probabilistic circuit (PC) that enables the integration of statistical and symbolic methods, a promising combination pursued to achieve neuro-symbolic reasoning [73, 89]. They are also central to performing inference in the field of probabilistic (logic) programming [51], enabling robust inference under uncertainty [154] (discussed further in Sect. 1.4.1). Computationally, a PC is essentially a sparse and irregular dataflow graph (DFG) [46] without apparent repeating patterns in the edge connectivity (see Fig. 1.1). Such irregular DFGs cannot be modeled as dense tensor operations. Similarly, operations with highly sparse tensors/matrices are also ill-suited to the techniques developed for dense tensors. For example, consider the operation of solving a triangular system of equations represented by a sparse matrix, called the sparse matrix triangular solve (SpTRSV) [44] operation, which is used in robotics [82], wireless communication [4], cryptography [57], etc. The computation of an SpTRSV can be represented as a dataflow graph, as shown in Fig. 1.2. If the matrix sparsity is unstructured without apparent repeating patterns, the edge connectivity of the DFG will also be unstructured and irregular, in which the edges connect seemingly random nodes. Inefficient Execution The irregularity of these workloads poses challenges for parallel execution on CPU and GPU. As shown in Fig. 1.3, the throughput of SpTRSV and PC on the Nvidia RTX 2080Ti GPU is lower than the Intel Xeon Gold 6154 CPU, despite having highly parallel hardware. Furthermore, the CPU could achieve only a tiny fraction of its peak throughput of 3.4 Tera operations per second (TOPS). This underperformance is due to two main reasons: • The seemingly random edge connectivity of an unstructured DFG results in irregular memory accesses with low spatial locality. Hence, the large granularity of cache-line size in CPU and GPU leads to severe underutilization of

4

1 Irregular Workloads at Risk of Losing the Hardware Lottery

L

x

Sparse matrix

b

DFG for triangular solve (SpTRSV) to fi nd x given L and b

Fig. 1.2 The computation required to find the solution of a system of equations represented by a sparse triangular matrix can be represented as a dataflow graph

4.0

Throughput (GOPS)

CPU

GPU

3.0

2.0

1.0

bp _2 00 w es t2 02 si 1 eb er ja gm es h rd 4 b9 68 dw 20 48 tr et ai l m ni st nl tc s m sw eb m sn bc bn et fl ix

0.0

Sparse matrix Triangular Solve (SpTRSV)

Probabilistic Circuits (PC)

Fig. 1.3 Throughput of SpTRSV and PC on general-purpose CPU and GPU

caches and memory bandwidth, because only 4B of data would most likely get consumed out of the 64/128B fetched to a cache line. If the data is mapped to scratchpads instead of caches in the GPU to avoid the granularity mismatch, a different problem of frequent bank access conflicts arises due to the irregular accesses. Furthermore, these irregular accesses severely degrade GPU performance because they prevent memory coalescing, which is crucial for high GPU performance [153].

1.3 Introduction to Graphs

5

• Due to the irregular structure, parallelizing different parts of DFGs across multiple units (like CPU cores, GPU streaming multiprocessors, etc.) can lead to workload imbalance, and high communication and synchronization overheads, which severely undermines the parallelization benefits. This book develops hardware and software solutions to achieve faster and more energy-efficient execution of the irregular DFGs of emerging AI and sparse algebra workloads. Section 1.3 introduces graphs and DFGs, and Sect. 1.4 discusses further the target workloads that can be modeled as irregular DFGs. Section 1.5 describes the major state-of-the-art work for efficient DFG execution, and the open research questions that are still unaddressed. Finally, Sect. 1.6 discusses the book’s major contributions and structure.

1.3 Introduction to Graphs This section introduces graphs, their relevant types and properties, and compressed data structures that can be used to represent them. This is a basic introductory discussion that can be safely skipped by people familiar with these concepts. Graphs A graph is a pair G = (V , E), where V is the set of vertices or nodes, and .E ⊆ V × V is the set of edges, each of which connects two nodes. A graph can be undirected or directed (see Fig. 1.4a and b). In undirected graphs, the edge .(x, y) ∈ E is an unordered tuple, such that .(x, y) and .(y, x) represent the same edge. In contrast, in directed graphs, the edge .(x, y) is an ordered tuple in which node x is called the source or tail of the edge and node y is called the destination or head of the edge. Furthermore, x is called a predecessor of y, and y a successor of x. All the edges that have a node x as the destination are called the incoming edges of x, and the edges that have x as the source are called the outgoing edges of node x. Paths and Cycles A path is defined as an alternating sequence of nodes and edges v1 , .e1 , .v2 , .e2 , .. . ., .ek−1 , .vk , in which all the edges .ei are distinct, .ei = .(vi , vi+1 ), and all the nodes are distinct, except possibly the first and the last one. A cycle is a path in which the first and the last nodes, .v1 and .vk , are the same. Examples of cycle are the paths 1-4-2 in Fig. 1.4a and 1->4->2 in Fig. 1.4b. A directed graph without a cycle is called a directed acyclic graph (DAG). Figure 1.4c is a DAG.

.

Graphs and Matrices A graph G= (V , E) can be represented as an adjacency matrix A of dimension .|V | × |V |, in which elements .aij = 1 if .(vi , vj ) ∈ E and 0 otherwise. Figure 1.4c shows adjacency matrices for a DAG. Note that an adjacency matrix is not unique for a graph; different ordering of nodes produces a different matrix. An adjacency matrix of an undirected graph is symmetric, while it is asymmetric for a DAG. In fact, for a DAG, there exists at least one triangular adjacency matrix,

6

1 Irregular Workloads at Risk of Losing the Hardware Lottery

1

2

3

1

2

3

1

5

(c) directed acyclic

6

1

1

1

3

4

1

1

2

4

6

3 1

1

1

2

1

4

1

3

5

6

1

1

1

4

5

5

2 1

2

3

4

6

5

(b) directed

4

3

2 4

6

5

6

(a) undirected

1

1

4

4 5

3

2

1

3

2

1

5

6

5

6

6

Nodes can be renamed according to topological ordering for a triangular matrix (d) Adjacency matrices of a directed acyclic graph (DAG)

1 1

2

3

a

b

4

c

2 3

5

6

d

e

f

4 5 6

val

a

b

c

d

e

f

val

a

b

c

f

d

e

col_index

2

3

3

5

6

4

row_index

1

1

2

3

2

2

row_offset

0

2

5

6

6

6

col_offset

0

0

1

3

4

5

6

Compressed sparse row (CSR)

6

Compressed sparse column (CSC)

(e) Compressed representations of the triangular matrix in (d) with hypothetical values

Fig. 1.4 Different types of graphs are shown in (a), (b), and (c). (d) shows how a graph can be represented as an adjacency matrix, and how a directed acyclic graph can be renamed to get a triangular adjacency matrix. For real-world graphs, the adjacency matrix is typically sparse, which can be stored in a compressed form as shown in (e)

and conversely, for any given triangular matrix, there exists a DAG such that the matrix is its adjacency matrix [163]. An upper(lower) triangular adjacency matrix of a DAG can be constructed by ordering the nodes through topological sorting [163], in which a node .vi is before(or after) a node .vj if .(vi , vj ) ∈ E. Figure 1.4c shows how DAG nodes can be renamed to obtain a triangular adjacency matrix.

1.3 Introduction to Graphs

7

Sparse Representation If a graph is sparsely connected, i.e., nodes are connected to only a few other nodes, the corresponding adjacency matrix will also be sparse. Such sparse matrices can be compactly stored in a compressed format like the compressed sparse row (CSR) or the compressed sparse column (CSC) representation [133]. These compressed formats aim to eliminate zeroes from the matrix, and store only the non-zero elements and their positions. Suppose an .n × m matrix A consists of .nnz number of non-zero elements. The CSR format represents A with three one-dimensional vectors: val, col_index, and row_offset (see Fig. 1.4d): • The val and col_index vectors are of length .nnz and store the non-zero values and the column indices of those values, respectively. • The row_offset is of length .n + 1 and stores the index in val and col_index where the given row row starts. As such, row_offset[i] stores the total number of non-zero elements prior to row i, and the total number of nonzero elements in the row i is equal to row_offset[.i + 1] - row_offset[i]. In other words, val[row_offset[i]] would be the first non-zero element of row i, if it has any, and col_index[row_offset[i]] would be its columns index. This way, the CSR format stores the non-zero elements of a row and their positions contiguously, but the elements of a column are scattered. In contrast, the CSC format stores the elements of a column contiguously, with vectors val, row_index, and col_offset. This distinction implies that for an adjacency matrix of a DAG, the CSR format stores the list of successor nodes of a given node contiguously, while the CSC format stores the list of predecessor nodes contiguously. For example, in the CSR format in Fig. 1.4d, the successor nodes {2,3} and {3,5,6} of nodes 1 and 2 are stored contiguously in the col_index. Dataflow Graphs A dataflow graph (DFG) is a model of program in which the directed edges represent the flow of data, and the nodes represent operations on the data flowing in through the incoming edges. This way, a DFG represents the data dependencies among operations as a graph. DFGs are devoid of conditional and control flow operations, and only contain data operations. This book focuses on DFGs that are acyclic. Hence, the acronyms DFG and DAG are used interchangeably in the rest of the book. DFGs of real programs are typically sparsely connected, allowing them to be represented compactly as a sparse adjacency matrix stored in the CSR or CSC format. Execution of a DFG Executing the DFG of a program entails evaluating the operations in all the nodes in a valid order. The directed edges represent data dependencies, which impose an execution ordering on the nodes. As such, for all the edges, the source node of an edge should be executed before the destination node. Put differently, a node can be only executed after all the predecessor nodes connected to the incoming edges have finished execution.

8

1 Irregular Workloads at Risk of Losing the Hardware Lottery

At a given moment, all the nodes whose predecessor nodes have finished execution can be scheduled. These are called the active nodes. The execution begins with the source nodes of the DFG being active, as they do not have any incoming edges. Subsequently, other nodes become active as the execution progresses. The active nodes can be executed in parallel, as there are no dependencies among active nodes by definition (otherwise, they would not have been active together).

1.4 Target Workloads Equipped with the understanding of DFG, we can now discuss the target workloads of this book: PC and SpTRSV. We chose these as our targets because: • PC is a highly promising emerging AI model that can address the limitations of the existing techniques, and SpTRSV is a ubiquitous operation required in many domains, including energy-constrained applications. • The existing computing platforms do not execute these workloads efficiently, as evident from the throughput shown earlier in Fig. 1.3.

1.4.1 Probabilistic Circuit (PC) PC represents the probability distribution over a set of random variables (say, A, B,.. . ., X) in a way that makes complex probabilistic queries tractable, like marginal probability, .P r(A = a, B = b); conditional probability, .P r(A = a | B = b); maximum a posteriori, argmax .P r(A = a | B = b); etc. PC is a unifying umbrella term used for different types of tractable probabilistic models, each having different properties, like sum-product networks (SPN), probabilistic sentential decision diagrams (PSDD), arithmetic circuits (AC), etc. (more details in [24]). PCs are increasingly being used for energy-constrained applications like robotic navigation [175], human activity recognition [53, 54, 111], and robust and interpretable image classification [87, 146]. The following are some of the exciting capabilities demonstrated by PCs: • Interpretable Inference The DNNs are considered black boxes because it is not apparent which input features/variables contributed to the output decision or classification result. Such interpretability is required for safety-critical applications like medical diagnosis and autonomous vehicles. For example, a doctor can fully trust an automated cancer diagnosis only if the system can highlight the part of the scan that looks/does not look malignant, instead of a black box positive/negative classification. To this end, PCs are amenable to such interpretability because of their capability of answering complex probabilistic queries [87, 104, 168].

1.4 Target Workloads

9

• Handling Missing Inputs In some scenarios, some of the inputs of the system may not be valid or available. For example, consider a sensing application like human activity detection based on sensory inputs from an accelerometer, gyroscope, microphone, etc. Some of these sensors may become unavailable due to energy-saving requirements. In such a scenario, a model trained for all the sensory inputs should still be capable of performing meaningful inferences with missing inputs. PCs are capable of such robust inference as they can tractably marginalize over the missing inputs, as demonstrated in [53, 83, 112]. • Performing Causal Inference The Book of Why [120] argues that the ability to perform causal inference—identifying causal relations—is the key to understanding the world and demonstrating intelligent behavior. The usefulness of causal inference is evident in some fields, e.g., medical diagnosis requires understanding which microbes/genetic factors cause which diseases, and these diseases cause which symptoms. Interestingly, DNNs do not explicitly try to model or learn the causality of a system. But, researchers are starting to see the benefits of augmenting DNNs with causality in practice [19, 166]. Works in [36, 97] have shown that causal models can be converted to PCs for tractable causal inference. Because of these exciting emerging use cases, our work aims to develop a suitable hardware-software stack for PC to ensure that it does not lose the hardware lottery (Fig. 1.5). Computational Kernel Computationally, a PC is a directed acyclic DFG in which nodes represent either sum or product operations on the node’s inputs.1 Figure 1.5 shows a Bayesian network and Fig. 1.6 shows a PC compiled using UCLA’s ACE compiler [37]. PCs have desirable structural properties that enable the computation of probabilistic queries in linear time in terms of the size of the DAG [24, 38]. But, these properties do not make them suitable for execution on parallel hardware Fig. 1.5 A sample Bayesian network

risky travel?

tuberculosis?

smoking?

lung cancer?

bronchitis?

either tuberculosis or lung cancer

positive xray?

1 The

dysponea?

sum operations are replaced with max operations for some types of probabilistic queries.

10

1 Irregular Workloads at Risk of Losing the Hardware Lottery

Fig. 1.6 The PC compiled from the Bayesian network in Fig. 1.5 for tractable inference

architectures. For example, these properties do not enforce any repetitive patterns in the structure that can be utilized for parallel execution. As discussed earlier, computing a DFG requires iterating over the nodes in a topologically sorted order. The nodes of the DFG can be renamed in a topological sorted order (once at the beginning) to ease the computation. To further simplify the iteration, the renaming can be performed such that the input nodes of the DFG (the nodes without incoming edges) get the 0th to (start-1)th indices, and the internal nodes get the startth to endth indices in a topological order. Suppose the vector x stores the value of every node of the DFG. The 0th to (start-1)th indices of x corresponds to the input nodes, and can be initialized with the input values. The rest of the nodes of the DFG can be evaluated by iterating over nodes from start to end indices and performing sum or product operations (depending on the type of the node) on the value of the predecessor nodes. Getting the list of the predecessor nodes can be done by storing the adjacency matrix of the DFG structure in the CSC format (as discussed in Sect. 1.3). Figure 1.7 shows a for loop that computes the startth to endth elements of the vector x. The final output node of the DFG corresponds to the last element of x, which is the output of the PC. As seen on lines 12 and 13, indirect accesses are performed to the x vector, which leads

1.4 Target Workloads 1

2 3

11

// PC structure is stored as an adjacency matrix in the CSC format: val, row_index, and col_offset vectors  // x[0 : start - 1] is initialized with inputs of the PC // x[start : end] stores the outputs of every node in the PC

4 5 6 7 8

// iterate over all the nodes for (int i= start; i 64(−) .1, >64(−) .1, >64(−) 1, 13 (0.4) .1, >64(−) 1, 11 (0.06) 1, 47 (1.3) 1, 14 (2.2) .1, >64(−)

Predicted consumptions based on the energy models

PC HAR

Type of query Marg. prob. Marg. prob. Cond. prob. Cond. prob. UNIMIB Marg. prob. Cond. prob. UIWADS Marg. prob. Marg. prob. Alarm Marg. prob. Cond. prob.

Fl-pt E, M (Energya in nJ) 9, 14 (6.7) 9, 14 (6.7) 9, 14 (6.7) 9, 14 (6.7) 7, 12 (0.6) 7, 12 (0.6) 6, 10 (0.09) 6, 10 (0.09) 8, 13 (3.2) 8, 13 (3.2)

Max error observed on test set −4 .5.9 × 10 −3 .1.0 × 10. −4 .2.6 × 10 −3 .1.0 × 10 −4 .4.9 × 10 −3 .1.1 × 10 −3 .1.3 × 10 −3 .1.2 × 10 −4 .2.2 × 10 −4 .2.8 × 10

Post-syn. energy (nJ) 5.3 7.2 7.2 7.2 0.34 0.44 0.06 0.08 2.43 3.18

5.37

0.18

0.89

Energy of 32b Fl-pt = 8, .F = 23 10.8 .E

Table 2.2 Optimal fixed-point and float-point representations that meet the required error tolerance. The selected representation is in bold. Measured error and energy for the selected representation. I, F, E, and M stand for the number of integer, fraction, exponent, and mantissa bits

36 2 Suitable Data Representation: A Study of Fixed-Point, Floating-Point, and. . .

2.2 PROBLP

37

The PCs used in this section are compiled using the ACE tool [37], with cd06 and -forceC2d option enabled. For the experiments on HAR, UNIMIB, and UIWADS, we trained a naive Bayes classifier on 60% of the data and used the rest for testing. The testing dataset for Alarm is generated by sampling 1000 instances from the trained network. In all the experiments, the input nodes of the BN were used as evidence nodes e and one of the root nodes in the BN (the class node in the case of the classifiers) as a query node q.

Validation of Bounds This experiment is to validate the developed error bounds, using the PC compiled from the Alarm network. The experimental setting is as follows: Fixed Point The number of integer bits is set to 1 based on the max-min analysis, and fraction bits varied from 8 to 40. Float Point The number of exponent bits is set to 8 based on the max-min analysis, and mantissa bits varied from 8 to 40. Figures 2.5 and 2.6 show the max and mean error with fixed and floating point, respectively, on the test set for the PC compiled from the Alarm network, confirming the validity of the bounds.

Overall Performance

Fig. 2.5 Analytical error bounds and the observed error over a test set for varying fraction bits F , validating the correctness of fixed-point bounds

Absolute error

In this experiment, the complete PROBLP framework is deployed to choose an appropriate data representation and generate hardware for different PCs and for given user requirements. The results of the experiment are summarized in Table 2.2. Experiments are performed for all the combinations of queries and types of error tolerances for the HAR PC, and two combinations for the rest of the PCs. The table shows the optimal fixed-point and float-point representations that meet the target error tolerance. Among these, PROBLP selects the one with less predicted

100 Bound Max. Mean

10-5

10-10 10

15

20

25

30

Fractional bits

35

40

2 Suitable Data Representation: A Study of Fixed-Point, Floating-Point, and. . .

Fig. 2.6 Analytical error bounds and the observed error over a test set for varying mantissa bits M, validating the correctness of floating-point bounds

Relative error

38

100 Bound Max. Mean

10-5

10-10 10

15

20

25

30

35

40

Mantissa bits

energy, highlighted in bold. The resulting maximum error observed on the test sets remains within the required error tolerance. The post-synthesis energy consumption matches well to the energy predicted by the framework. The energy consumption of the hardware with a standard single-precision 32b floating representation (.E = 8, .M = 23, 1 sign bit) is also shown for comparison. Note here that the choice of 0.01 error tolerance is arbitrary and higher energy efficiency can be achieved for relaxed error tolerances. In summary, the results highlight that: • Custom application-specific representation can lead to significant energy savings (up to 67%) compared to the standard 32b floating point. • The maximum error observed on the test set remains within the required error tolerance, validating the error-bound models. • The fixed-point representation often requires more than 64 bits for conditional queries and relative error requirements. As such, a processor designed for all-purpose probabilistic inference would require at least floating-point representation.

2.3 Beyond Fixed and Floating Point: Posit Representation The experiments with PROBLP show that the fixed-point representation is not suitable for all types of probabilistic queries and error tolerances. The main reason is the narrower range of numbers that a fixed-point format can represent, compared to a floating-point format with the same number of bits. For example, a 16b fixed-point format with .(I, F ) = (1, 15) underflows below .2−15 , while a 16b floating-point format with .(E, M) = (6, 10) underflows below a significantly smaller value of −31 . The limited range prevents fixed point from representing small probabilities .2 that commonly occur in probabilistic inference with real-world application. Due to this need to represent small numbers, probabilistic machine learning practitioners normally use double precision 64b floating point or even logarithmic representation [160].

2.3 Beyond Fixed and Floating Point: Posit Representation

39

Fig. 2.7 The posit representation

Trade-off Between Accuracy and Range A floating-point format has to use more exponent bits to increase the range of representation. However, this is at the expense of the fraction(mantissa) bits for the given number of total bits. Fewer fraction bits in turn decrease the accuracy of representation, as we saw in Eq. (2.9) in the previous section. The number of exponent and fraction bits is a design parameter and remains fixed irrespective of the magnitude of the number being represented. This limitation is addressed by a novel representation called posit [64], shown in Fig. 2.7a. Posit uses a regime field in addition to exponent and fraction(mantissa), whose length dynamically varies depending on the magnitude of the number. A posit format is described by a tuple (L, .es), where L is the total length and es is the maximum length of the exponent field. The value of a posit-encoded number is given as follows:

40

2 Suitable Data Representation: A Study of Fixed-Point, Floating-Point, and. . .

Fig. 2.8 Accuracy impact of different representations. Custom posit performs better due to lower error for a wider range of values. (a) Exponent length for different formats. (b) Error in representations @32b. (c) Probabilistic circuit (PC). (d) Sp. Triangular Solve (SpTRSV)

es f˜ = 1.f raction × 2exponent × 2regime×2

.

(2.23)

Instead of under-/overflowing, the regime’s length increases (at runtime instead of design time), by taking away bits from the other fields. Since the length is not fixed, the regime field is encoded as a continuous stream of 0s/1s followed by a 1/0 (see Fig. 2.7b for the regime encoding). Figure 2.7c shows how the length of the regime increases as the number becomes bigger/smaller. The minimum and maximum encodable values are shown in Fig. 2.7d. Custom Posit The standard posit representation proposed by Gustafson [64] aims to have higher accuracy than the floating point around the value of 1.0 while sacrificing the range (Fig. 2.8b). While this design choice might be suitable for some applications, PCs require a large range of representation, as evident from the experiments of PROBLP. To this end, we propose a custom posit representation that is specialized for PCs, in which longer exponent field es is used (Fig. 2.8a), such that the custom posit has a similar precision as the floating point around 1.0, but a more gradual precision degradation for small/large values, greatly increasing the representable range of values. Figure 2.8b demonstrates this gradual, tapered degradation of precision.

2.4 Conclusions

41

Probabilistic Inference Accuracy with Custom Posit The analytical modeling of the error with posit operations is difficult because of the varying lengths of the fields. Instead, the suitability of posit is empirically tested by quantifying the impact on the accuracy of probabilistic inference. Figure 2.8c shows how the accuracy of PCs varies with the number of total bits in floating point, standard posit, and custom posit. The PCs are trained for classification on activity recognition datasets from the UCI repository [48]. Custom posit achieves similar accuracies with 32, 16, and 8b, in contrast to the others. Hence, the custom posit is more suitable for probabilistic inference tasks compared to floating point, due to the wider range of representation. SpTRSV Accuracy with Custom Posit This book focuses on DAGs from SpTRSV applications as well. As it turns out, the properties of posit are also suitable for SpTRSV evaluations. Figure 2.8d shows the mean relative error in an iterative evaluation of SpTRSV operation on matrices from SuiteSparse matrix collection [42].5 In summary, these results show that the custom posit representation is a suitable candidate for a specialized DAG processor that targets these applications. The hardware impact of the varying length of the fields in custom posit is studied in Chap. 4.

2.4 Conclusions This chapter presents an analytical framework called PROBLP that can identify the most energy-efficient data representation for a given PC, the type of probabilistic query to be performed, and the error tolerance of the application. This is achieved by developing error bounds for arithmetic operations (addition and multiplication) that utilize the properties of PCs. Based on the selected data representation, PROBLP generates a fully pipelined spatial hardware description of the PC. The experiments show that the chosen data representation is up to 67% energy-efficient compared to the 32b single-precision floating-point format. PROBLP error models also show that the fixed-point format is unsuitable due to its limited range, as probabilities can become very small. Based on this learning, we designed a customized posit representation with a significantly larger range than the floating point for the same number of bits. Experiments confirm that an 8b custom posit can achieve the same accuracy as the 32b floating-point format for probabilistic inference tasks. Moreover, it also achieves better accuracy for SpTRSV operations, demonstrating a general-purpose utility. The customized posit format is used later in DPU, the dedicated processor designed for PC and SpTRSV.

5 The

details of the matrices are described later in Table 4.6 in Chap. 4.

Chapter 3

GRAPHOPT: Constrained-OptimizationBased Parallelization of Irregular Workloads for Multicore Processors

The previous chapter optimized data representation without sacrificing application requirements for improving energy efficiency. This chapter improves the execution throughput by exploiting the parallelism of the workloads. The DAGs1 of PC and SpTRSV have the opportunity of parallel execution, such that different nodes can be executed simultaneously on parallel CPU or GPU threads. However, the nodes in these DAGs are fine-grained as they represent only a few scalar operations, whose computation cost cannot amortize synchronization and task management overheads when each node is modeled as an individual task. As such, these DAGs cannot be accelerated by simply modeling them as task graphs in TensorFlow [1], Intel TBB [131], etc. Their acceleration needs the creation of coarser partitions by combining original fine-grained nodes, to increase computation to synchronization/communication ratio. But if the partitions become too coarse, they hurt parallelism as there might not be enough partitions available to execute in parallel. Hence, appropriate granularity and parallelism of the partitions are critical for good acceleration. To this end, this chapter proposes GRAPHOPT2 —a partitioner to efficiently parallelize fine-grained DAGs through hardware-aware partitioning. The key contributions of the chapter are as follows: • GRAPHOPT models the graph partitioning for parallel execution as a constrained optimization problem, and leverages a state-of-the-art optimization solver. • Several scalability techniques are proposed to handle real-world graphs with millions of nodes and edges. • The performance of GRAPHOPT is validated for SpTRSV and PC, and compared against standard libraries.

1 Note

that DAG and DFG are used interchangeably in the rest of the book. at https://github.com/nimish15shah/GRAPHOPT.

2 Available

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Shah et al., Efficient Execution of Irregular Dataflow Graphs, https://doi.org/10.1007/978-3-031-33136-7_3

43

44

3 GRAPHOPT: Constrained-Optimization-Based Parallelization of Irregular. . .

Fig. 3.1 An example for the need of synchronization

Scheduled on thread 1

Sync barrier needed Scheduled on thread 2

Threads 1 and 2 are asynchronous e.g., CPU and GPU threads

The chapter is organized as follows. Section 3.1 defines the problem of parallelizing fine-grained DAGs and GRAPHOPT’s approach. Section 3.2 details the working of GRAPHOPT, followed with extensive performance benchmarking in Sect. 3.3. Finally, Sect. 3.4 discusses related works and Sect. 3.5 concludes the chapter. This chapter is based on its related paper N. Shah, W. Meert, and M. Verhelst (2022). “GRAPHOPT: constrained-optimization-based parallelization of irregular graphs.” In: IEEE Transactions on Parallel and Distributed Systems [141].

3.1 Graph Partitioning for Parallelization Modern computing platforms like multicore CPUs and GPUs are equipped with parallel hardware threads, which can execute DAG nodes in parallel. However, the nodes cannot be arbitrarily scheduled because of the data dependencies represented by the DAG edges, i.e., a node can only be executed after all the predecessor nodes have finished their execution. This data dependency demands thread synchronization when a predecessor node is mapped to a different thread than the current one. This is due to the asynchronous execution behavior of CPU and GPU threads; a stalled thread does not stall others. Consider a DAG with three nodes as shown in Fig. 3.1. Suppose nodes A and B are scheduled on different threads. To schedule the node C on thread 1, it must be ensured that the node B has finished the execution on thread 2 and its results are visible to thread 1. But, by definition, asynchronous threads do not provide such a guarantee. Hence, an explicit synchronization barrier is needed before scheduling node C. Such synchronizations can be managed at runtime by a task-scheduling framework like Intel TBB [131], etc. However, the fine granularity of target DAG nodes renders these frameworks ineffective because the nodes do not have enough computation to amortize the task management overhead. This can be addressed by increasing the computational granularity, which can be done by clustering the nodes in coarser partitions that are executed as monolithic tasks to amortize the overheads. However, the quality and correctness of these coarser partitions depend on several aspects as discussed next (with illustrations in Fig. 3.2).

3.1 Graph Partitioning for Parallelization

45

Fig. 3.2 Different ways to partition a DAG, and impact on parallelization, communication, and workload balance

Data Dependencies and Acyclic Partitions The data dependencies of DAG edges imply that a partition can be launched as a monolithic unit only if all the predecessor partitions have finished the execution. Furthermore, the edges between partitions should not create cycles, because cyclic partitions need intermittent synchronization to resolve intermediate data dependencies. The cycles can be prevented by generating convex partitions, i.e., if two nodes are in a partition, the nodes on all the (directed) paths in between the two nodes also have to be in the partition. An example of cyclic partitions is shown in Fig. 3.2d. GRAPHOPT avoids these cycles by enforcing an acyclic constraint. Granularity and Parallelism The sizes of the partitions dictate how often the threads need to synchronize and communicate. To reduce the overall runtime, there are multiple conflicting requirements:

46

3 GRAPHOPT: Constrained-Optimization-Based Parallelization of Irregular. . .

• Partitions should be as large as possible to amortize the overheads and increase the local data reuse, minimizing communication across threads. For example, increasing the granularity in Fig. 3.2c, e helps. • There should be enough parallel partitions to keep all the available threads in the underlying hardware busy. Partitions in Fig. 3.2a, b do not have any parallelism across threads at all. • Partitions that are supposed to execute in parallel should be of similar size to balance the workload across threads. Figure 3.2c shows a counterexample. These requirements suggest that the partition granularity depends on the parallelism of the underlying hardware, like the number of CPU threads, which is fixed and known beforehand. It also depends on the available parallelism in the DAG, which can vary in the different parts of the DAG. Hence, the partition granularity cannot be manually fixed but has to be automatically adjusted depending on the DAG structure. GRAPHOPT achieves a balance by generating partitions that are as large as possible, as long as there are enough parallel partitions. Communication An edge from one partition to the other indicates the communication of data. If the two partitions are executed on the same thread, the data can be locally reused via local caches or scratchpads, whereas the communication across threads incurs a higher overhead of flushing data to the shared caches. This communication overhead can be reduced by keeping the edges within the same thread as much as possible. Figure 3.2e shows an example where partitions can be simply remapped to different threads to avoid inter-thread communication. GRAPHOPT’s Approach To appropriately handle the various trade-offs, instead of creating arbitrary partitions, GRAPHOPT creates a layered graph of coarse partitions as shown in Fig. 3.3. To avoid confusion, a layer of coarse partitions is always referred to as a super layer in the rest of the chapter, and a layer simply means a layer of nodes in the original fine-grained DAG. Every super layer has P independent partitions, where the user can specify an arbitrary P based on the target hardware (e.g., P can be set to 8 for a CPU with 8 single-threaded cores). This ensures that, whenever possible, all the P hardware threads have a corresponding partition to execute. The partitions in a super layer are independent, i.e., there is no edge crossing among them, enabling parallel execution. The parallel threads need to be synchronized after every super layer to allow race-free data communication required by the blue edges. GRAPHOPT models this multi-objective problem of DAG partitioning to create super layers as a constrained-optimization problem, as explained in the next section.

3.2 GRAPHOPT This section explains the generation of super layers with P parallel, balanced, coarse partitions from a fine-grained DAG. The optimal graph partitioning, even

3.2 GRAPHOPT

47

Fig. 3.3 Super layers. GRAPHOPT decomposes a fine-grained DAG into super layers, each having P partitions. The partitions are made as large as possible but also of similar size to ensure workload balancing

without the acyclic constraint and parallelism requirement, is known to be NPcomplete [50, 80]. Recent work [96] has shown that graph partitioning with the acyclic constraint is also NP-complete. Section 3.4 explains why general graph partitioning approaches cannot be used here because of the parallelism requirement (P-independent partitions) and the acyclic constraint. These general approaches only focus on minimizing edges among (possibly cyclic) partitions, and do not aim to generate layered graphs with P parallel workload-balanced partitions needed to keep parallel threads busy. Figure 3.4 shows GRAPHOPT’s approach for generating partitioned super layers. To reduce the complexity, instead of finding all the super layers simultaneously, the tool iteratively constructs one super layer at a time going from the bottom to the top. In an iteration, a super layer is generated with two main steps: (M1) recursive two-way partitioning and (M2) workload balancing. The S1, S2, and S3 are the scalability techniques that enable GRAPHOPT to handle large graphs with millions of nodes/edges. An iteration starts by selecting a part of the DAG that is considered for generating the current super layer. Ideally, all the currently unmapped nodes should be considered, but to limit the complexity, step S1 selects a subgraph G from the unmapped DAG. The M1 step splits this subgraph G into P parallel partitions and associates them with the P underlying hardware threads. These partitions, however, could potentially be of different sizes. The M2 step redistributes nodes among these

48

3 GRAPHOPT: Constrained-Optimization-Based Parallelization of Irregular. . .

Fig. 3.4 GRAPHOPT overview. One super layer, starting from the bottom, is generated in every iteration. The main steps M1 and M2 use an optimization model with the Google OR-Tool solver [122] to find good partitions. The scalability steps S1, S2, and S3 are used to handle graphs with millions of nodes/edges

partitions for workload balancing, generating P-balanced partitions for the current super layer. The unmapped nodes that could not be mapped to any partitions are then considered for the subsequent super layers.

3.2.1 Recursive Two-Way Partitioning (M1) Given a subgraph G, a super layer with parallel partitions should ideally be constructed by a direct P-way partitioning. GRAPHOPT relaxes this by using recursive two-way partitioning, i.e., recursively splitting subgraphs into two parallel smaller subgraphs until getting P subgraphs (partitions) for P threads, as illustrated in Fig. 3.5(top). The recursion starts with the aim of mapping the input subgraph G to P threads. The first two-way partition generates two output partitions, one of which represents the nodes corresponding to .t1 , . . . , tP /2 threads, and the other corresponding to .tP /2+1 , . . . , tP threads. The third output is a set of unmapped nodes that cannot be mapped to either partition without adding an edge between the partitions. These unmapped nodes will be considered for the subsequent super layers. The next recursion splits the first partition (which becomes the current G) into two smaller partitions, one for .t1 , . . . , tP /4 threads and the other for

3.2 GRAPHOPT

49

Fig. 3.5 The recursive two-way partitioning (M1) uses a Minizinc-based optimization model to partition subgraphs until getting P partitions. Due to the recursive approach, the partitions can be imbalanced. The workload balancing (M2) iteratively redistributes the nodes, using the same Minizinc-based model, to generate a balanced super layer

tP /4+1 , . . . , tP /2 threads. This repeats until partitions for individual threads are determined. Thus, the problem of one super layer generation is reduced to multiple iterations of two-way partitioning under given constraints and objectives. Note that if P is not a power of 2, the partitions can end up being imbalanced, which is addressed with the M2 step. Instead of developing a custom heuristic algorithm, GRAPHOPT models the two-way partitioning problem with the Minizinc constrained-optimization language [101]. Such a model can be solved with state-of-the-art constraint programming

.

50

3 GRAPHOPT: Constrained-Optimization-Based Parallelization of Irregular. . .

(CP) or mixed-integer linear programming (MILP) solvers like Google OR-Tools [122], SCIP [55], Gurobi [63], etc. These solvers are designed and tuned with decades of research and often outperform custom heuristics for smaller problem instances, but struggle to scale to larger problems. This is addressed in this work by developing the scalability techniques explained later in Sec. 3.2.3. An optimization model consists of four parts: (1) inputs of the problem, (2) decision variables whose values need to be determined by the solver, (3) constraints on the decision variables, and (4) the objective of the optimization (refer [105] for more details of constrained optimization). The rest of the section describes these four parts.

Optimization Model for the Two-Way Partitioning Given an input graph, the two-way partitioning aims to allocate the nodes to two parallel partitions such that there is no edge crossing from one to the other. Furthermore, the size of the partitions should be as large and equal as possible. If the target for the current recursion are .t1 , t2 , . . . , tx threads, the first output partition corresponds to .t1 , . . . , tx/2 threads and other to .tx/2+1 , . . . , tx threads. The interthread communication (blue edges in Fig. 3.3) would reduce if the sources of the incoming edges of the partitions are from the same group of threads. Hence, one of the aims of the two-way partitioning is also to perform allocation such that the nodes in the first partition mostly have incoming edges from .t1 , . . . , tx/2 threads, and the second partition from .tx/2+1 , . . . , tx threads. Table 3.1 shows the different parts of the optimization model. The current input DAG is denoted by G(V, E), where V is the set of nodes (vertices) and E is the set of directed edges. Note that this is not the complete original DAG, but an input subgraph for the current recursion of the two-way partitioning. The output partitions of G are decided by the decision variable PART(V), an array of integers, one for every node. If PART[v] .= 1 or 2, the node v is allocated to the partition 1 or 2, respectively, and if PART[v] .= 0, it is not allocated. The constraints and objectives are discussed next. Notations .∀ for all, .∈ is member of, .| such that, .∨ logical or, .∧ logical and, .  floor function. Acyclic and Data-Dependency Constraint There should not be any edge from one partition to the other, which implies that, for the destination and source nodes of every edge, PART[dst] should either be equal to PART[src] or should be 0. This also ensures that the destination node is unallocated if the source node is also unallocated, which is needed because the edges represent data dependencies. This constraint is modeled as follows: ∀(src, dst) ∈ E,

.

PART[dst] = PART[src] ∨ PART[dst] = 0

(3.1)

3.2 GRAPHOPT

51

Table 3.1 Optimization model. Different parts of the Minizinc two-way partitioning optimization model Name Inputs .t1 , . . . , .tx V E (V, V) node_w (V) .Vin .Ein (.Vin ,

V)

.PARTin (.Vin )

Description

Type

Target threads for this recursion Set of nodes in the current DAG G Set of directed edges in G expressed as tuples of source and destination nodes Weights indicating the amount of computation within nodes Set of nodes that are already allocated to previous super layers Set of directed edges with source node in .Vin and destination in V Allocated partitions of nodes in .Vin

array of int set of int set of int tuples array of int

Decision vairables: Final output PART (V) Allocated partitions for nodes in V, where 0 indicates no allocation Decision variables: Intermediate PART_1_size Amount of computation allocated to partition 1 PART_2_size Amount of computation allocated to partition 2 .Ein _crossing (.Ein ) Indicates whether the edge in .Ein are crossing partitions or not Constraints Acyclic parts .∀(src, dst) ∈ E, .PART[dst] = PART[src] ∨ PART[dst] = 0 Partition size .∀v ∈ V , PART_1_size = sum(node_w[v] | if PART[v] = 1), . PART_2_size = sum(node_w[v] | if PART[v] = 2) Inter-thread comm. .∀(src, dst) = e ∈ Ein , Ein _crossing[e] = ((PART[dst] = 0) ∧ (PART[dst] = PARTin [src])) Objective .maximize ws × min(PART_1_size, PART_2_size) − wc × sum(Ein _crossing)

set of int set of int tuples array of int in [1,2] array of int in [0,2] int int array of bool

Partition Size Constraint The input parameter .node_w(V ) (array of integers) represents the amount of computation to be performed in each node. For workload balancing, the amount of computation in the two partitions should be the same. To formulate such an optimization objective, the decision variables .PART_1_size and .PART_2_size are used, which are modeled with the following constraint: ∀v ∈ V ,

.

PART_1_size = sum(node_w[v] | if PART[v] = 1) PART_2_size = sum(node_w[v] | if PART[v] = 2)

(3.2)

52

3 GRAPHOPT: Constrained-Optimization-Based Parallelization of Irregular. . .

Inter-Thread Communication Constraint The blue edges in Fig. 3.3 contribute to inter-thread communication. An edge would become a blue edge if its source node is already allocated in a previous super layer to a thread .ta , and the destination node is currently allocated to a partition that does not correspond to .ta . Note that these edges are not in the current edge set E, because the edges in E have the source and destination nodes that are being considered for the current super layer. Another input to the model .Ein represents such edges, which have source nodes (represented by the set .Vin ) in the previously generated super layers, and destination nodes in the current G. Input .PARTin (V ) represents the partitions of the nodes in .Vin based on the threads they are allocated to. If a .vin was allocated to any of the threads .t1 , . . . , tx/2 in the previous super layers, .PARTin [.vin ] = 1, and if allocated to any of the .tx/2+1 , . . . , tx threads .PARTin [.vin ] .= 2. If a .vin was not allocated to any of the current target threads .t1 , . . . , tx , its corresponding edges always result in inter-thread communication irrespective of how the current partitioning is done. Hence, such .vin and the corresponding edges are not considered in the model. An edge in .Ein is a crossing edge when the portions of the destination and source nodes are different. The edge crossings are tracked with .Ein _crossing (array of booleans), modeled with the following constraint: ∀(src, dst) = e ∈ Ein , Ein _crossing[e] =

.

((PART[dst] = 0) ∧ (PART[dst] = PARTin [src]))

(3.3)

The first inequality makes sure that the edges with an unallocated destination are not marked as crossing edges. Objective The objectives of GRAPHOPT are maximizing but also equalizing the partition sizes and minimizing the edge crossings. The optimization solvers generally support only one objective; hence, a weighted sum of the multiple objectives is used. The objective makes sure the smaller of the two partitions is made as big as possible: maximize ws × min(PART_1_size, PART_2_size)

.

− wc × sum(Ein _crossing)

(3.4)

Typically, global synchronization cost is significantly higher than communication cost (i.e., accessing data from other core’s cache); the hyperparameter .ws should be set higher than .wc . In our experiments, .ws is set to 10 .× .wc . The objective along with the constraints in Eqs. 3.1–3.4 form the Minizinc optimization model,3 which is solved with the Google OR-Tools solver.

3 The

optimization model in the Minizinc language is available in Appendix A.

3.2 GRAPHOPT

53

Fig. 3.6 Example. An optimal partitioning for a simple graph

Example Figure 3.6 illustrates the optimal two-way partitioning of a simple G. The node set V is {1, .. . ., 9} and the edge set E is {(1,5), (2,5), .. . ., (8,9)}. The computation in every node, the .node_w, is assumed to be 1. Assume that the target threads are {.t1 , .. . ., .t4 }. The incoming edges .Ein are {(10,1), (10,7), .. . ., (13,4)} and suppose their source nodes .Vin , {10, .. . ., 13}, are mapped to the threads {.t2 , .t2 , .t4 , .t3 } respectively, in the previous super layers. Hence, .PARTin [10] = .PARTin [11] = 1, and .PARTin [12] .= .PARTin [13] .= 2. The variable .PART[9] for the top node will remain 0, otherwise, due to the constraint in Eq. 3.1, all the nodes would end up in the same partition (which would be a valid but suboptimal solution). The .PART for the other nodes are decided such that the .Ein _crossing (blue edges) are minimized. Switching the .PART of 1,2,5,7 with 3,4,6,8 would lead to more blue edges. Hence, the solution shown in the figure is the optimal solution. In one of the next recursions, G will contain {1,2,5,7} nodes with the target threads {.t1 , .t2 }. And the other recursion will be with {3,4,6,8} nodes and target threads {.t3 , .t4 }. Node 9 is an unmapped node, which will return to the set of remaining nodes to be used in the construction of the next super layer.

54

3 GRAPHOPT: Constrained-Optimization-Based Parallelization of Irregular. . .

Algorithm 1: M2 . Gfull , node_w, imbalanced_Dth , Dth , nThreads

3.2.2 Workload Balancing (M2) A penalty of using recursive two-way partitioning instead of a direct P-way partitioning is that the P partitions are not guaranteed to have the same size. The two-way partitioning attempts to equalize the partition sizes, but this does not guarantee equal parallelism. This imbalance can lead to divergences in partition sizes in subsequent recursions. Moreover, the imbalance can also occur when P is not a power of 2. To address this, a workload balancing step (M2) is used as shown in Fig. 3.5(bottom), which redistributes the nodes across imbalanced partitions. In every iteration, the partitions are sorted according to the sizes. The nodes of the largest and the smallest partitions are combined and two-way repartitioned (with the Minizinc model) in an attempt to redistribute the workload. This repeats until the size of the smallest no longer increases further. This process is described in Algorithm 1. If the sizes are still unequal, the larger partitions are truncated in topological order to equalize the sizes (with some margin). The truncated nodes will be added to the pool of unmapped nodes to be considered for the next super layer. The final output of this step is a super layer with balanced P partitions.

3.2 GRAPHOPT

55

Fig. 3.7 The full flow with S1, S2, and S3 scalability steps that enables the GRAPHOPT to handle large graphs with millions of nodes/edges

3.2.3 Scale to Large Graphs (S1, S2, S3) The two-way partitioning model can handle graphs with tens of thousands of nodes and edges in a reasonable time. Hence, the M1 and M2 steps explained earlier are sufficient to generate super layers for such graphs. However, graphs from real applications can have millions of nodes/edges, making the solver runtime prohibitively long. GRAPHOPT uses several techniques to handle the complexity of such large graphs, as shown in the full flow in Fig. 3.7 and described in Algorithm 2.

Consider Limited Layers (S1) Ideally, all the unmapped nodes of the DAG should be considered for allocation in every iteration of super layer generation. However, during the partitioning for the initial bottom super layers, it is unlikely that the nodes from the top of the graph will be allocated. Hence, it is wasteful to consider the entire unmapped graph for the partitioning, and the complexity can be limited by choosing a subset G. With this aim, each node is assigned to a DAG layer before beginning the generation of super layers, using the “as-last-as-possible” (ALAP) heuristic, such that every node

56

3 GRAPHOPT: Constrained-Optimization-Based Parallelization of Irregular. . .

Algorithm 2: GRAPHOPT . .Vfull , .Efull , .node_w, .nThreads .

is in one layer below its lowest successor node (see Algorithm 3). This heuristic only needs a topological sort of graph nodes, which runs in .O(|V | + |E|) (i.e., linear in the size of G) time. In every iteration, GRAPHOPT adaptively considers a limited number of layers of the unmapped graph, chosen such that the input graph G for the M1 step has size .α (set to 4 in our experiments) times the size of the output super layer in the previous iteration (see Algorithm 4). This automatically chooses an appropriate-sized G depending on the parallelism in the DAG. A high .α leads to better super layer quality at the expense of partition time, and vice versa.

Independent Connected Components (S2) During the two-way partitioning in M1 and M2, the current G to partition may not be a fully connected graph, but may have multiple disconnected components. This can simplify the partitioning because disconnected components, by definition, do not have edge crossings. Hence, each component is successively considered as current G and partitioned with the solver independently. Suppose the target for the current recursion is Y threads; the components are partitioned with a target of X threads, chosen as:

3.2 GRAPHOPT

57

Algorithm 3: GenDAGLayers . Gfull

Algorithm 4: S1 . G, node_w

  size of the current component X= Y× . size of all components

.

These independent connected components are discovered using a breadth-first search, running in .O(|V | + |E|) (i.e., linear in the size of G) time.

Heuristic Coarsening (S3) Even with the S1 and S2 techniques, a large graph may have to be two-way partitioned. If the graph is larger than a threshold .thresh_G, a list-based heuristic coarsening is used before the partitioning. The graph nodes are sorted in a list according to the depth-first traversal order as shown in Fig. 3.8, which runs in .O(|V | + |E|) (i.e., linear in the size of G) time. During the traversal, the differences in depth between subsequent nodes are also noted in a list. The third list indicates the outdegree for each node. The node list is then broken into clusters according to the following criteria:

58

3 GRAPHOPT: Constrained-Optimization-Based Parallelization of Irregular. . .

Fig. 3.8 The data structures generated by depth-first traversal for the heuristic coarsening step (S3)

• The size of the cluster should be less than a .size_threshold. In the example, cluster 1 is stopped at node 7 for the .size_threshold of 4. • The difference in depth among consecutive nodes should be less than a .depth_threshold. For example, the difference in depth for the consecutive nodes 7 and 3 is high, so cluster 1 should be stopped at node 7 according to the depth difference as well. • Clusters should be stopped at the nodes with outdegree higher than a .degree_threshold. For example, cluster 2 is stopped at node 6 if the .degree_threshold is 2. A coarse graph is constructed with these clusters, which will be used as the current G for two-way partitioning. The clusters are represented as a single node in the coarse graph with a node weight (amount of computation) equal to the sum of all the enclosing node weights. If the resulting graph is very coarse, then it may adversely affect the quality of the subsequent two-way partitioning. As such, the size threshold is chosen according to the current graph size, resulting in a coarse graph with around 1000 nodes (empirically found to be sufficient for good-quality two-way partitioning). The thresholds are as follows: size of the current graph to coarsen 1000

.

size_threshold

=

depth_threshold

= log2 (size_threshold)

degree_threshold

= 10.

3.2 GRAPHOPT

59

Algorithm 5: S3 . G, node_w

The algorithm of S3 is described in Algorithm 5. Furthermore, Algorithm 6 describes the complete recursive two-way partitioning (M1) equipped with the S2 and S3 scalability techniques. With these three techniques, the GRAPHOPT is able to partition graphs with millions of nodes/edges in seconds.

60

3 GRAPHOPT: Constrained-Optimization-Based Parallelization of Irregular. . .

Algorithm 6: M1_S2_S3 . Va , Gfull , node_w, Dth , Y

3.3 Performance Evaluation 3.3.1 Experimental Setup Experiments are conducted on DAGs of SpTRSV and PC. The SpTRSV is benchmarked on the matrices from the SuiteSparse matrix collection [42] of real-world applications like power network optimization, structural analysis, computational fluid dynamics, nonlinear optimization, economics, robotics, etc. The PCs are from a standard benchmark [88]. The super layers generated from the GRAPHOPT are parallelized across multicore CPU threads using OpenMP. The

3.3 Performance Evaluation

61

Fig. 3.9 Detailed analysis of super layers. Row (f) shows that DAGs can be partitioned into a few, large super layers. It also shows the size of the original DAG layers for comparison. Row (g) shows the workload balancing across threads, in the form of operations in different threads in super layers. Row (h) shows the throughput scaling with parallel threads, demonstrating the advantage of using super layers versus direct DAG layer partitioning. Rows (i) and (j) show the improvement in partitioning time due to scalability techniques S1, S2, and S3, while row (k) shows the degradation of performance due to these techniques

throughput results are averaged over 1000 iterations on up to 18 threads on an Intel® Xeon Gold 6154 CPU, with GCC v4.8.5 compiler, -march=native -Ofast flags, and the thread affinity set as KMP_AFFINITY = granularity = fine,compact,1,0. The experiments with 18 threads are adequate for the target workloads because the workloads have a mean parallelism (quantified as total nodes in a DAG/critical path length) of 8.6, and 95.% of the DAGs achieve peak throughput with fewer than 18 threads. The caches are warmed up by executing the same program before the actual measurement.

62

3 GRAPHOPT: Constrained-Optimization-Based Parallelization of Irregular. . .

3.3.2 Analysis of Super Layers Several experiments are conducted on workloads of varying sizes to evaluate the different properties of super layers, as summarized in Fig. 3.9.

How Large Are the Super Layers? GRAPHOPT combines nodes from multiple DAG layers to create large super layers. The row (f) in Fig. 3.9 shows the number of multiply-accumulate operations in super layers compared to the original DAG layers. GRAPHOPT manages to compress thousands of DAG layers into tens of super layers, reducing the required synchronization barriers.

Workload Balancing Large super layers reduce the number of synchronizations, but good parallel performance demands that the workload is also balanced across threads. The row (g) in Fig. 3.9 shows the number of operations in each super layer for P equals to 2 and 8 threads. As seen, GRAPHOPT balances the workload across parallel partitions in a super layer as much as possible. Also note that the granularity of partitions varies across different super layers, depending on the available parallelism in the corresponding DAG region.

Throughput Scaling The row (h) shows the scaling of throughput with parallel threads. Different matrices reach peak throughput with different preferred threads depending on the parallelism in the DAGs. For comparison, the throughput of direct “DAG layer partitioning” is shown in the orange curve. In this partitioning, the nodes of a single DAG layer are executed in parallel across threads, and the threads are synchronized after every DAG layer. The super layers achieve better performance because they need fewer synchronizations.

Impact of the Scalability Techniques To see the impact of S1–S3 scalability techniques on the partitioning time, the partitioning is performed with and without these techniques, with a timeout of 1 hour. Figure 3.9i, j shows that the tool times out for the larger graphs without these techniques. Figure 3.9k shows that there is a 22% drop in performance due to these techniques for one of the workloads.

3.3 Performance Evaluation

63

Fig. 3.10 Sparse matrix triangular solve performance. GRAPHOPT achieves a mean speedup of 2.0, 3.3, 3.1, 5.6, 10.8, and 23.7 over SuiteSparse CXSparse, SuiteSparse UMFPACK, Intel MKL, P2P, DAG layer partitioning, and KokkosKernels, respectively

3.3.3 Comparison with State-of-the-Art Libraries Sparse Matrix Triangular Solves As shown in Fig. 3.10, the performance of state-of-the-art libraries is evaluated on the L triangular factor DAG of 370 matrices from the SuiteSparse matrix collection [42], with selection criteria that the matrices should be real, square, non-singular, and with .10k non-zeroes. For smaller matrices, SuiteSparse CXSparse reports the highest performance. Overall, GRAPHOPT achieves mean speedups of 2.0, 3.3, 3.1, 5.6, 10.8, and 23.7 over SuiteSparse CXSparse, Intel MKL, SuiteSparse UMFPACK, P2P, DAG layer partitioning, and KokkosKernels, respectively. The main contributor to the speedup is fewer synchronization barriers (99.% reduction compared to the DAG layer partitioning).

Probabilistic Circuits Figure 3.11 shows the throughput for 16 PCs from the standard benchmark [88] achieved with the following implementations. • This work: GRAPHOPT is used to generate super layers with P set to 2, 4,.. . ., 18 parallel threads.

Fig. 3.11 Probabilistic circuits performance. GRAPHOPT achieves a mean speedup of 1.8.× and 1052.× over DAG layer partitioning and Juice, respectively

3.4 Discussion and Related Work

65

• DAG layer partitioning: Same as the sparse triangular solve experiment in Sect. 3.3.3. • Juice: Performance of Juice—a widely used Julia-based library for PCs [35]. Results Juice library, despite of using the layer-based partitioning scheme, achieves significantly lower throughput than the OpenMP-based DAG layer partitioning implementation. GRAPHOPT achieves a mean speedup of 1.8 and 1052 over DAG layer partitioning and Juice, respectively, due to 88.5.% fewer synchronization barriers.

3.4 Discussion and Related Work 3.4.1 Sparse Triangular Solves A common approach for parallelizing triangular solves is the DAG layer partitioning method that partitions nodes on each DAG layer directly into P partitions, and synchronizes the threads after every layer, first introduced in [5]. This is one of the baselines in the experiments section. This partitioning is quicker than our approach when the sparsity pattern/DAG structure changes for every triangular solve. However, for large matrices, thousands of synchronizing barriers might be needed depending on the critical path length of the DAG, incurring large overhead. As shown in our experiments, 99.% barriers can be avoided by combining multiple DAG layers into super layers, improving the throughput significantly compared to the DAG layer partitioning (Fig. 3.10). The point-to-point (P2P) barrier approach alleviates some overhead of the global barrier as shown in Fig. 3.10, but still fails to outperform other libraries. The works in [123] and [71] are the most similar to GRAPHOPT, but could not be compared due to the unavailability of open-source code. These works have two key conceptual differences from our work: 1. Both papers limit the partition sizes. In [123], the partitions should fit in the local scratchpads of GPU, while in [71], the size is controlled with a predefined hyperparameter that has to be tuned for every DAG. This hyperparameter remains the same for the entire DAG even when the local parallelism can vary in different parts of the DAG. This is in contrast with our approach of making the partitions as large as possible as long as there are P parallel partitions. In other words, by explicitly defining the hardware parallelism (via P), GRAPHOPT automatically adjusts the partition sizes based on the available parallelism in the DAG, reducing the overall number of partitions and global barriers. This partition size limit is unnecessary because a global barrier is not needed when partition sizes, e.g., exceed the GPU scratchpad size. Instead, large partitions can be further divided into subpartitions that fit into the scratchpad, and a local barrier (e.g., __syncthreads() in CUDA) can be used.

66

3 GRAPHOPT: Constrained-Optimization-Based Parallelization of Irregular. . .

2. Both [123] and [71] develop partitioning heuristics, while GRAPHOPT uses constrained optimization and leverages a state-of-the-art solver, which often achieve better solutions than custom heuristics.

3.4.2 Probabilistic Circuits The DAG layer partitioning heuristic is used for PC for CPU, GPU, and a custom ASIC parallelization [33, 35, 136]. Our experiments show a speedup of 1052.× over Juice [35]. Libraries based on TensorFlow have also been developed for coarsegrained PCs with explicit regular structures [92, 127]. However, such a TensorFlowbased approach is not useful for fine-grained, irregular PCs due to the overheads of kernel launch, etc. GRAPHOPT does not assume any regularity in PC structure.

3.4.3 Graph Partitioning As explained in Sect. 3.2, GRAPHOPT essentially needs to solve P-way-independent partitioning of a directed graph while ensuring that resulting partitions are acyclic and balanced. In general, graph partitioning is known to be NP-complete [50, 80] and is a widely studied problem. Several partitioning algorithms have been developed [10, 14, 144], but they only focus on undirected graphs intending to reduce the edge crossings within balanced partitions while ignoring the edge direction. As a result, the popular undirected partitioning software like JOSTLE [156] and METIS [80] cannot be used for acyclic partitioning. The acyclic partitioning of DAGs is also shown to be NP-complete like the undirected version of the problem [96]. In recent years, several works have been proposed to tackle the problem [30, 72, 95, 96]. However, these works do not focus on parallelism. The resulting partitions would be acyclic, well-balanced, with minimal edge crossings, but can end up being completely sequential, i.e., only one partition can be executed at a time. This stems from the fact that the usual objective of minimizing edge crossings does not guarantee parallelism. Hence, these methods are not suitable for parallelizing a DAG execution over multiple threads.

3.4.4 DAG Scheduling Several algorithms have been proposed for scheduling DAGs [12, 114, 157], which use either list-based or clustering-based scheduling heuristics, while GRAPHOPT takes a different approach of modeling the core routine of the tool as a constrainedoptimization problem, allowing the use of open-source solvers. The constrainedoptimization-based approach is explored in [59, 152]. However, both these works

3.5 Conclusion

67

use several simplifying assumptions which do not hold for a CPU multithreaded execution, for example, assumptions like (1) all the threads are synchronous, (2) execution and communication latencies are fixed and predefined, and (3) a node execution can be launched precisely in a given cycle. As such, these models cannot generate valid schedules for asynchronous multithreaded CPU execution. Hence, to the best of our knowledge, ours is the first work employing constrained optimization for multithreaded CPU scheduling of irregular DAGs.

3.5 Conclusion This chapter describes GRAPHOPT, a tool developed to efficiently parallelize sparse, irregular graph workloads on parallel compute threads. Graphs are decomposed into super layers with P parallel partitions, using multiple recursions of two-way partitioning of subgraphs. The two-way partitioning problem is modeled as an optimization problem with the Minizinc constraint-modeling language and solved with the open Google OR-Tools solver. The full flow of GRAPHOPT also contains steps for workload balancing and scalability techniques to handle large graphs. The resulting performance of this super layer-based partitioning is benchmarked for SpTRSV and PC, respectively, achieving a speedup of 2.0.× and 1.8.× over the best existing libraries. Thus, GRAPHOPT demonstrates that constrained optimization is effective for the parallelization of large DAGs.

Chapter 4

DAG Processing Unit Version 1 (DPU): Efficient Execution of Irregular Workloads on a Multicore Processor

The previous chapter described GRAPHOPT, a tool that improves the performance of multithreaded CPU execution by reducing the number of synchronization barriers, minimizing inter-core communications, and improving workload balance across cores. Despite these optimizations, the CPU could only achieve a fraction of its peak throughput of 3.4 TOPS as evident from the results of GRAPHOPT (Figs. 3.10 and 3.11 from Chap. 3). Several factors contribute to this underperformance of general-purpose processors as discussed subsequently in Sect. 4.1. To overcome the limitation of general-purpose processors, this chapter proposes DPU, a specialized DAG processing unit designed to efficiently execute highly irregular DAGs and provide an optimized hardware platform for these emerging workloads. The DPU is taped out in TSMC 28nm technology to obtain siliconproven measurement results. Some of the key features are: • Parallel asynchronous compute units equipped with software-managed local scratchpads for data reuse, connected to a global banked scratchpad via a lowoverhead asymmetric crossbar interconnect for high memory bandwidth. • Hardware support for fast synchronizations of compute units, as frequently required for irregular DAGs. • Execution based on decoupled data handling and compute streams to overlap memory and arithmetic instructions. • Precision-scalable arithmetic units based on the customized positTM representation (proposed in Chap. 2) to enable application-dependent precision selection. The chapter is organized as follows. The hardware challenges of irregular DAG execution are discussed in Sect. 4.1. Section 4.2 describes the DPU architecture and Sect. 4.3 explains the internal of a compute unit of the DPU, followed with Sect. 4.4 that discusses the precision-scalable posit unit. Subsequently, Sect. 4.5 presents the experimental results and measurements of the taped-out processor. Finally, Sects. 4.6 and 4.7 discuss the related work and conclude the chapter. This chapter is based on its related papers: © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Shah et al., Efficient Execution of Irregular Dataflow Graphs, https://doi.org/10.1007/978-3-031-33136-7_4

69

70

4 DAG Processing Unit Version 1 (DPU): Efficient Execution of Irregular. . .

• N. Shah, L. I. G. Olascoaga, S. Zhao, W. Meert, and M. Verhelst (2021). PIU: A 248gops/w stream-based processor for irregular probabilistic inference networks using precision-scalable posit arithmetic in 28nm. In IEEE International SolidState Circuits Conference (ISSCC), vol. 64, IEEE, pp. 150–152 [138]. • N. Shah, L. I. G. Olascoaga, S. Zhao, W. Meert, and M. Verhelst (2021). DPU: Dag processing unit for irregular graphs with precision-scalable posit arithmetic in 28 nm. IEEE Journal of Solid-State Circuits [139].

4.1 Challenges Due to Irregularity The irregularity in DAG structure poses several challenges for efficient execution on general-purpose processors like CPUs and GPUs. These are summarized in Table 4.1 along with the related DPU’s solutions to address them.

4.1.1 SIMD Unfriendly The CPU and GPU can reach peak performance if the workload is singleinstruction-multiple-data (SIMD) friendly, because that allows the use of CPU vector instructions and high utilization of GPU’s parallel cores. Interestingly, the DAG nodes to be executed in parallel may perform the same arithmetic operation on the inputs, which would make them suitable for a SIMD execution. However, the inputs of these nodes typically reside in random memory locations owing to the irregularity of the DAG structure, making them unlikely to be co-located in a CPU or GPU cache line. In fact, some of them may not be cached at all and need to be fetched from external memory. Experiments in [2] find that around 50% of the load requests in irregular graph workloads result in cache misses. This leads to high variability in the load latency of these inputs, causing all the SIMD lanes to stall

Table 4.1 Challenges and opportunities in irregular DAG execution and related DPU innovations Challenges/opportunities for irregular DAGs SIMD unfriendly Frequent synchronizations

Inefficient use of caches Data prefetching Diverse applications with varying precision requirement

DPU solutions Asynchronous compute units with independent instructions Hardware-supported synchronization with special instructions, and superlayer-based parallel execution Software-managed scratchpads Decoupled instruction streams for efficient hardware prefetching Precision-scalable custom posit arithmetic unit

4.1 Challenges Due to Irregularity

71

for the slowest input. Thus, despite the availability of parallel operations to execute, random memory loads make irregular DAGs SIMD unfriendly. Consequently, the x86 CPU SIMD vector instructions are not useful for irregular DAGs. GPUs can potentially tolerate irregular memory latency by switching thread warps when a thread stalls (given there are enough thread warps available to schedule). However, GPUs suffer from other bottlenecks as discussed subsequently. The DPU uses asynchronous compute units that execute independent instruction streams instead of a SIMD unit.

4.1.2 Frequent Synchronizations The total amount of computation done per synchronization barrier is a key indicator of parallel performance—the higher the better, to amortize the synchronization cost. This amount depends on the DAG structure and the barrier-placement techniques. In our benchmarks, the DAG layer-based approach (shown in Fig. 4.1a and described in the previous Chap. 3) results in 210 compute operations per barrier, which increases to 633 with the superlayer-based approach of GRAPHOPT (in Fig. 4.1b). Yet, if the barrier takes comparable cycles as the computations, it can severely degrade the parallel performance, possibly making it even worse than a sequential execution. The synchronization barriers in multicore CPUs are implemented with atomic operations on a shared memory location. These atomic operations typically incur long latency from the CPU core to the outermost shared caches or external memory. Furthermore, these operations also create a burst of cache-coherency traffic as every core modifies the same location, increasing the barrier cost further. Due to these reasons, the synchronization barriers in CPUs typically take 3000 cycles (measured with the EPCC microbenchmark [13]). Similarly, in the GPUs, the global synchronization for all the CUDA cores consumes around 2000 cycles [172]. To avoid this bottleneck, DPU is equipped with a hardware-supported barrier instruction that synchronizes all the units in a single cycle, which becomes feasible due to a lower target clock frequency of 300 MHz.

4.1.3 Inefficient Use of Caches DAG execution results in randomly addressed single-word memory accesses (typically of 4B) with lower spatial locality, which implies that the typical 128/256B cache line granularity of CPU and GPU is too high, and most of the fetched words are unlikely to be used. Such random accesses also prevent the memory-request coalescing, which is critical for good GPU performance. Smaller cache lines are preferable for these workloads, but lead to higher area/energy overhead of tag storage and lookup. Furthermore, depending on the access patterns, some words

72

4 DAG Processing Unit Version 1 (DPU): Efficient Execution of Irregular. . .

Fig. 4.1 Synchronizations. The GRAPHOPT’s super layer-based approach (b) is used for DPU as it improves the amount of computation done per synchronization barrier by 3.× compared to the standard DAG layer-based approach (a)

are reused much more frequently than the others. Thus, selectively choosing which words to store locally results in an efficient use of the scarce on-chip local storage. Due to these reasons, DPU uses software-managed local scratchpads with singleword accesses, instead of hardware-managed caches.

4.1.4 Data Prefetching An out-of-order CPU core tries to find the data-prefetching opportunities at runtime, and can issue multiple outstanding load requests [47, 67]. However, [8] reports that an Intel CPU could only keep 2–3 load requests in flight while executing graph workloads, even when the architecture supported up to 10 load requests. The main reason for this underperformance is the limited instruction window in which the

4.2 DPU Architecture

73

core looks for independent load requests, and widening this window is area and power-hungry. To prefetch data efficiently, DPU exploits the fact that the DAG structure is known at compile time. The compiler decouples the DPU’s instructions into memory and processing streams, which are executed independently on different subunits. This enables low-overhead prefetching and memory-compute overlap, without using costly out-of-order hardware.

4.2 DPU Architecture This section describes the salient architectural features of DPU designed to alleviate the challenges discussed in the previous section.

4.2.1 Compute Units (CUs) Figure 4.2 shows the DPU architecture with 64 parallel compute units (CU) that execute 64 subgraphs in each superlayer shown in Fig. 4.1b. Each CU is equipped with its own instruction memory and executes these instructions asynchronously; a stalled CU does not stall the others. The CUs communicate via a global scratchpad connected by an asymmetric crossbar (Sect. 4.2.2). The single-cycle synchronization of CUs is made possible with a global barrier instruction and a global sync unit (Sect. 4.2.3). The detailed architecture of CU is explained later in Sect. 4.3. A specialized compiler is designed that takes an arbitrary DAG and generates the super layers using GRAPHOPT (discussed in Chap. 3), schedules operations on each CU, allocates data to scratchpads and register files, etc.

4.2.2 Global Scratchpad and Asymmetric Crossbar As observed in [2], irregular graphs typically lead to relatively high global traffic when executed on parallel units. In our benchmarks, we also observe a similar trend. The blue global edges in super layers in Fig. 4.1b account for 33% of the total edges in our benchmarks. In DPU, this global inter-CU communication happens via a high-bandwidth global scratchpad. The 256 KB scratchpad is constructed with 64 banks of 4 KB each, providing an overall bandwidth of 2 Kb/cycle. Furthermore, for low hardware overhead, the CUs connect to the global scratchpad via an asymmetric crossbar (Fig. 4.3), such that a CU can load data from any bank but can store to only one specific bank each. This asymmetry of global loads but restricted stores (instead of the other way around) is a deliberate design choice considering the fact that an output of a node is stored only once but usually loaded multiple times from different CUs.

74

4 DAG Processing Unit Version 1 (DPU): Efficient Execution of Irregular. . .

Fig. 4.2 The DPU architecture with 64 parallel CUs

Fig. 4.3 Asymmetric crossbar to reduce area and power overhead

The asymmetric design does reduce the flexibility of mapping intermediate node outputs to the global scratchpad banks. In fact, the bank mapping is fully determined based on the mapping of operations to the CUs (which happens during the super

4.3 Compute Unit (CU) Architecture

75

layer generation), because the output of an operation on a CU can only be stored to that CU’s respective bank. On the other hand, with a full crossbar, the compiler could possibly map an output to any bank to reduce bank conflicts. However, predicting and averting these bank conflicts during compilation is anyway not possible given that the CUs are asynchronous and can possibly stall for unpredictable cycles (e.g., waiting for the next instruction). As a result, in practice, it is very difficult for a compiler to exploit the additional flexibility coming from a full crossbar, and hence the inflexibility of an asymmetric crossbar does not drastically impact the throughput. For each bank, the load requests are selected based on a round-robin arbitration scheme, while the store request has the highest priority. The store request cannot participate in the round-robin arbitration with 64 load requests because that would reduce the store bandwidth by 64.× compared to the loads. Overall, this asymmetric crossbar consumes 45% lower area and energy than the symmetric counterpart that allows store requests to reach every bank.

4.2.3 Global Sync Unit To mitigate the overhead of frequent synchronizations, the CUs are equipped with a special instruction for global barrier, complemented with a central dedicated global sync hardware unit. When a CU reaches a global barrier instruction, it indicates this to the global sync unit and stalls until the other CUs hit their global barrier instruction. The global sync unit uses a tree of AND gates to determine if all the CUs have reached the barrier, and communicates this to all the CUs within that cycle, enabling a single-cycle synchronization. Even though the paths to/from the global sync units create a long combinational loop due to the unit’s centralized role, they are not the critical paths in our design.

4.3 Compute Unit (CU) Architecture 4.3.1 Local Scratchpad Because the subgraphs in the super layers are made as large as possible, they have a significant proportion of intra-CU edges shown in red in Fig. 4.1b (67% of the total number of edges in our benchmarks). To exploit this locality, the CUs are equipped with an address-mapped local scratchpad. The compiler maps the output data of a node to the local scratchpad if all outgoing edges of the node are intra-CU (colored red in Fig. 4.1b); otherwise, it is mapped to the global scratchpad. A scratchpad is used instead of a cache due to the following reasons:

76

4 DAG Processing Unit Version 1 (DPU): Efficient Execution of Irregular. . .

• A software-managed scratchpad can selectively store only the data that has local reuse. • The typical cacheline granularity of 32 or 64 words is too large for graphs due to frequent irregular memory fetches, and results in wasted interconnect traffic [2, 8]. • An address-mapped scratchpad avoids tag storage and lookup, reducing the area and energy footprint.

4.3.2 Data Prefetching Using Decoupled Instruction Streams For a given DAG, the memory load and store instructions to/from the scratchpads can be predicted at compile time. This is leveraged to perform aggressive data prefetching by overlapping memory requests with arithmetic operations without the need for expensive out-of-order hardware. As shown in Fig. 4.4, the instructions for CUs are decoupled into three streams, load, processing, and store streams, which are executed independently on different components of the CU. Load Streaming Unit The load addresses (for both global and local scratchpads) along with the corresponding destination registers are programmed in the load address memory. The load streaming unit prefetches the data by issuing these load requests to the local or global scratchpads. The loaded data may not yet be allowed

Fig. 4.4 Decoupled streams to overlap memory and processing instructions

4.3 Compute Unit (CU) Architecture

77

Fig. 4.5 Internal PE design and FIFO flow-control for precise ld/st timing Table 4.2 Instructions of PE add, mul max, min global or local barrier set_ld_stream_len set_precision

Add or multiply two numbers Max or min of two numbers Barriers for synchronization Sets the load stream length until next barrier Sets the precision of the arithmetic unit

to be written to the register file in the PE, because the PE might still be using the destination register for some other computation. Instead, the data is pushed on a FIFO going to the PE, from where the PE will eventually consume it at the right moment. The load streaming unit keeps streaming the load requests as long as there is space in the FIFO. The prefetching cannot cross barriers; hence, the length of the stream is controlled by a special load stream length register, which is programmed by the PE after every barrier. The FIFO becomes empty at the barriers; hence, frequent barriers reduce prefetching efficiency. PE The processing streams contain the actual arithmetic instructions to be executed on the PE (Fig. 4.5). A custom instruction set is designed (Table 4.2), which dictates the computation of the arithmetic unit in the PE. The PE does not contain pipeline stages, and all the instructions have a latency of one cycle. The instructions have an 18b compute field, with a 3b opcode for ALU and special-function instructions (Table 4.2) and 3 .× 5b operand register file addresses. A 32-entry 32b register file is used with 3 write ports (2 for load ports and 1 for the arithmetic unit) and 2 read ports (for the arithmetic unit). The compiler makes sure that the 3 write ports write to different registers in the same cycle, to avoid conflicts. This is done by precisely controlling the timing of the loads, using flow control bits (see further). The output of the arithmetic unit connects to one of the register write ports and the FIFO to the store streaming unit. The 3 remaining instruction bits control the PE’s IO dataflow. The PE cannot directly communicate with scratchpads; it only gets/puts data from/to the FIFOs of load/store units. The FIFO flow control bits are encoded in the processing

78

4 DAG Processing Unit Version 1 (DPU): Efficient Execution of Irregular. . .

instructions, to let the PE precisely control the timing of migrating data from the load FIFO to the register file, resp. from the ALU output to the store FIFO. These flow control bits indicate whether a FIFO load/store is needed in parallel with the compute operation of the current instruction. The PE stalls if a load flow control bit is set, but there is no data in the load FIFO, or if the store bit is set but the store FIFO is full. With such a flow control, PE avoids data hazards like the write-afterread (WAR) hazard in which the data from load FIFOs overwrite an active register that is not yet fully consumed. Store Streaming Unit The store streaming unit waits for data to show up on the store FIFO from the PE, and stores them to local/global scratchpads according to the addresses in the store stream. It also informs the PE whether all the data is stored to the memory and there are no outstanding store requests left, which helps the PE to decide if the compute unit is ready to sync with other units at the barriers. Local Barrier The load-to-PE and the PE-to-store communication happens via FIFOs, and the corresponding data dependencies are resolved by the FIFOs’ flow control. There is also a dependency of load stream on the store stream. A load following a store to the same scratchpad address (adrX in Fig. 4.4) should not be executed before the store finishes. Since the streaming units operate independently, this ordering is not guaranteed. To address this, an intra-CU local barrier is used. The load stream length is programmed such that the load streaming unit waits at the local barrier for the store unit to catch up before proceeding. Stream Generation Given a subgraph to be executed on a CU, the DPU compiler first generates the corresponding instructions, by (1) scheduling the operations of subgraph nodes in a topological order using a depth-first traversal, (2) performing register allocation using the linear-scan method [124], and (3) inserting the loadstore instructions as needed depending on the register file size. Next, local barriers are inserted such that every load following a store to the same address has at least one barrier in between. Finally, from these instructions, the decoupled streams are generated by assigning the instructions to their respective type of stream while preserving the order. Performance Impact Due to the data prefetching enabled by the decoupled streams, the DPU achieves 1.8.× speedup over an in-order version of DPU that uses coupled instructions (for the benchmarks described in Sect. 4.5).

4.4 Precision-Scalable Custom Posit Unit Chapter 2 proposed the custom posit format that is more suitable for PC and SpTRSV evaluations than the standard posit and floating-point representation. The PEs of DPU are equipped with arithmetic units based on this format.

4.5 Implementation and Experiments

79

Application-Dependent Precision Scalability DAGs from different applications can have widely varying precision requirements. For example, PCs can be used for safety-critical applications like autonomous navigation [175] demanding highly accurate computation, but can also be deployed for simpler applications like human activity classification (running, sitting, etc.) [53, 135], which can tolerate some mispredictions. To meet such diverse requirements, PEs are equipped with precision-scalable arithmetic units that can perform 1 .× 32b, 2 .× 16b, or 4 .× 8b operations in a single cycle, enabling batch execution based on the application requirements. Hardware Design A posit operator at its core needs a floating-point operator, since, for a given value of regime, posit behaves like floating point. Hence, a posit operation is strictly costlier than a floating-point operation (ignoring the exceptions of IEEE float). The arithmetic unit contains posit format decoders and encoders, shared among floating-point adder and multiplier (Fig. 4.6). The decoder finds the length of the regime field (with a priority encoder) and aligns the rest of the fields accordingly (with a barrel shifter) for addition and multiplication. The encoder aligns the output (with a barrel shifter) according to the output regime. All the blocks in the arithmetic units support precision scalability to perform 1 .× 32b, 2 .× 16b, or 4 .× 8b operations. This runtime scalability is novel, and not available in other posit hardware generators [20, 76, 149]. Figure 4.6 shows how 32b building blocks are constructed from two 16b blocks, which in turn are made of 8b blocks. The posit unit consumes 1.8.× the area and power of a floating-point counterpart (Table 4.3), but enables 8b or 16b operations as discussed earlier. Table 4.4 reports comparison with PACoGen [76], a state-of-the-art posit unit generator, which consumes 0.76.× and 0.68.× area and energy at 32b, respectively, due to the overhead of precision scalability in our unit. On the other hand, for applications requiring only 8b precision, the 32b PACoGen unit consumes 3.7.× the energy per operation of the precision-scalable unit.

4.5 Implementation and Experiments 4.5.1 Physical Implementation Figure 4.7a shows the floor plan for the physical implementation of DPU in an active area of 2.0 .× 1.9 mm.2 . The global scratchpad SRAM banks are placed in the center, interfaced with 32 CUs on the top and bottom via the asymmetric crossbar. Figure 4.7b shows the standard cells and the nets of the crossbar and the arbiter, which are mainly concentrated in the center of the chip. The ten metal layers available in the technology eases the crossbar routing, limiting the congestion hotspots to the center of the chip as shown in Fig. 4.7c. The routing congestion at the center is also alleviated by manually limiting the density of the standard cell

80

4 DAG Processing Unit Version 1 (DPU): Efficient Execution of Irregular. . .

Fig. 4.6 Posit arithmetic unit with precision-scalable subunits Table 4.3 Posit unit area and power breakdown

Power

Area .×10 .μm.

3

Decoders Float add and mul Normalize and round Encoder Precision-scalable posit unit

1.2 4.3 1.4 0.6 7.5

2

% 17 57 18 8

.μW

166 484 142 64 856

% 19 56 17 8

4.5 Implementation and Experiments Table 4.4 Performance comparison with a state-of-the-art posit unit

Operating mode Area (.×103 .μm.2 ) Energy (pJ/op) Throughput (ops/cycle)

81 This work 32b 16b 7.5 3.05 1.33 1 2

8b 0.57 4

PACoGen [76] 32b 5.7 2.08 1

placement in the center of the chip as reflected in the placement density map in Fig. 4.7d. Table 4.5 shows the area breakdown and the post-layout and post-synthesis power breakdown of the processor (estimated after activity and cell-delay annotation). The memories occupy half of the area and consume 34% of the power. Post layout, the power of the register file and crossbar increases by 99% and 77%, showing that the post-synthesis estimation can be significantly inaccurate for such modules. Figure 4.8 shows the taped-out prototype of DPU in a CMOS 28nm technology, together with the salient specifications of the processor.

4.5.2 Peak Performance and Voltage Scaling The chip’s electrical performance is measured by scaling the voltage from the nominal 0.9 V to 0.6 V. The chip can operate at the maximum frequency of 288 MHz at 0.9 V and 8b precision, with a peak throughput of 73.8 GOPS (Fig. 4.9a), and at the peak energy efficiency of 538 GOPS/W at 0.6 V (Fig. 4.9b). The top 50 critical paths are in the posit arithmetic unit and the crossbar. Both the blocks can be pipelined to increase clock frequency further, but would have a negative impact on instruction scheduling due to higher posit unit pipelining latency, and would induce a potential throughput trade-off due to increased access latency of the global scratchpad. We do not assess this trade-off further because at 288 MHz, DPU could already outperform CPU and GPU, making it less attractive to increase the clock frequency further and potentially reduce energy efficiency in the process.

4.5.3 Workloads The performance of the chip is benchmarked with the PC and SpTRSV DAGs listed in Table 4.6. The DPU compiler takes as input a DAG in any of the popular graph formats (i.e., all formats supported by the NetworkX package [65]) and generates an execution binary that can be directly programmed to DPU. The experiments are performed with the DAGs that fit in the on-chip data memory (global and local scratchpads). The programming of memories, with an FPGA via a slow chip I/O interface, is not included in the throughput results.

82

4 DAG Processing Unit Version 1 (DPU): Efficient Execution of Irregular. . .

32 compute units

Global scratchpad banks

32 compute units

(a)

(b)

(c)

(d)

Fig. 4.7 The physical implementation of DPU. (a) Floorplan of the physical implementation of DPU. (b) Standard cells (green) and nets (white) of the crossbar and arbiter. (c) Routing congestion hotspots. (d) Placement density map (violet .→ red for least .→ most density)

4.5 Implementation and Experiments

83

Table 4.5 Post-layout area and power breakdown Area mm.2 64 Compute units: 2.67 0.96 PEs: Posit units 0.48 Register file and Decoder 0.48 0.27 Local scratchpads Instr, LD and ST addr. mem 1.18 LD streaming unit 0.20 ST streaming unit 0.04 Global scratchpads 0.54 Crossbar: 0.19 0.13 Datapath 0.06 Arbiter 0.40 Rest DPU 3.8

% 70 26 13 13 7 30 5 1 14 5 3 2 11

Fig. 4.8 Chip micrograph and specifications

Power Post layout mW % 183.1 80 104.7 46 69.4 30 31.5 14 14.2 6 44.9 20 14.1 6 3.9 2 18.5 8 13.8 6 6.8 3 6.3 3 13.5 6 228.9

Post synth. mW % 151.3 83 76.7 42 58.2 32 15.8 9 14.0 8 44.9 25 11.6 6 3.2 2 18.1 10 7.8 4 5.4 3 1.9 1 5.0 3 182.2

Increase in power From synth. to layout % 21 37 19 99 1 0 22 22 2 77 26 232 170 26

84

4 DAG Processing Unit Version 1 (DPU): Efficient Execution of Irregular. . .

Fig. 4.9 Peak performance scaling with voltage and precision Table 4.6 Statistics of the benchmarked DAGs Application Probabilistic Circuits (PC)

Sparse Matrix Triangular Solves (SpTRSV)

Workload mnist nltcs msnbc bnetflix ad bbc c20ng kdd baudio pumsbstar tols4000 bp_200 west2021 qh1484 sieber gemat12 dw2048 orani678 pde2961 blckhole

Nodes (n) 10,414 13,627 47,334 55,007 66,819 77,457 80,962 98,211 121,263 149,662 5978 8406 10,159 11,298 22,768 74,199 79,240 114,275 140,303 150,876

Longest path length (l) 26 27 28 53 93 92 81 54 70 82 52 139 136 237 242 778 929 634 1357 1264

Parallelism (n/l) 400 504 1690 1037 718 841 999 1818 1732 1825 114 60 74 47 94 95 85 180 103 119

4.5 Implementation and Experiments

85

Fig. 4.10 Scaling of throughput with increasing active CUs at 8b precision

4.5.4 Throughput Scaling with Different Active CUs The throughput of DPU is measured by keeping a different number of CUs active to evaluate the parallelization effectiveness for increasing parallel CUs (Fig. 4.10). The PC throughput scales better than the SpTRSV because of higher DAG parallelism (Table 4.6). Apart from DAG parallelism, other factors that lower average throughput compared to the peak are (1) the impact of barriers on data prefetching and (2) global scratchpad access conflicts. Overall, the average throughput is 2.8.× lower than the peak for 64 CUs.

4.5.5 Comparison with CPU and GPU Since there is no previous silicon-proven chip targeting highly irregular DAGs, this experiment is designed to compare DPU’s performance with state-of-the-art CPU and GPU implementations. The details of the platforms are as follows: DPU The results are for 64 active CUs operating at 278 MHz and 32b precision. CPU An Intel(R) Xeon Gold 6154 CPU operating at 3GHz is used for comparison. For PC, a standard Julia-based library called Juice [35] (CPU-JUICE) and an highly optimized OpenMP-based implementation of super layers from GRAPHOPT (CPUOMP) are used for comparison. The SpTRSV performance is evaluated with the standard Intel Math Kernel Library (MKL v2021.1). The programs are compiled with GCC v4.8.5 compiler, -Ofast flag, and OpenMP v3.1. GPU The GPU baseline is evaluated with an RTX 2080Ti GPU operating at 1.3 GHz, and the code compiled with the CUDA v10.2.89 compiler. For PC, an efficient CUDA code described in [137] is used for benchmarking. For SpTRSV, the cusparseScsrsv_solve() function from the standard cuSPARSE library

86

4 DAG Processing Unit Version 1 (DPU): Efficient Execution of Irregular. . .

[99, 100] is used. For a fair comparison, the memory copy time from the host to GPU is not considered. Table 4.7 and Fig. 4.11 summarize the comparison results. All the platforms show higher performance for PC than SpTRSV due to the higher parallelism (Table 4.6). Juice performs considerably slower than the OpenMP counterpart, despite using the same number of CPU cores. Overall, the DPU outperforms the CPU, which in turn beats the GPU. The DPU achieves an average throughput of 6.2 GOPS, a speedup of 5.1.× and 20.6.× compared to the CPU and GPU, at an average efficiency of 27 GOPS/W at 32b precision, showing the effectiveness of the specialized DPU architecture for irregular DAGs.

Table 4.7 Performance comparison with other platforms Workloads Technology Area (mm.2 ) Frequency (GHz) Arithmetic representation Peak throughput (GOPS) Avg. throughput (GOPS) Power (W) Avg. energy eff. (GOPS/W) EDP (pJ .× ns) a

DPUa CPU PC and SpTRSV DAGs 14 nm 28 nm NA 3.8 3 0.28 Float Custom posit 3.4 .× 10.3 17.8 1.2 6.2 55 0.23 0.02 27 38 k 6.0

GPU 12 nm 754 1.35 Float 13.5 .× 10.3 0.4 98 0.004 1M

DPU operating point is 0.9 V and 32b precision

Fig. 4.11 Performance comparison. The DPU operating point is 0.9 V and 32b

4.6 Related Work

87

4.5.6 DPU’s Performance for a Regular DAG DPU’s performance for regular DAGs would be an interesting result, quantifying the effectiveness of asynchronous CUs for a regular workload. Hence, as an additional experiment, performance is benchmarked for a regular DAG of dense matrix-vector multiplication (GEMV). For a .128 × 128 matrix, DPU achieves a throughput of 17.5 GOPS at a utilization of 97.5%, while consuming 414 mW at 0.9 V, 0.28 GHz, and 32b precision, resulting in an efficiency of 42 GOPS/W. This shows that DPU can achieve near-peak throughput for a regular DAG, although with an inefficiency that separate instructions and load/store addresses are used for every CU due to the absence of SIMD support. For reference, the CPU achieves 7.4 GOPS and 0.14 GOPS/W with the Intel MKL GEMV function cblas_sgemv().

4.6 Related Work Neural-network processors exploiting sparsity like [70, 85, 115, 170, 174] have special hardware support to handle irregularity resulting from the sparsity. However, sparsity in our DAG workloads is typically more than 99.99%, significantly higher than NN sparsity (less than 70–80%). As a result, sparse NNs exhibit higher compute-to-memory fetch ratios and some repetitive structures that can be exploited, e.g., by using a systolic array of PEs, while the DPU needs different architectural techniques due to the ultrahigh sparsity. In recent years, architectures like [2, 61, 66, 86, 167] have been proposed for general graph-analytic workloads like PageRank, breadth-first search, single-source shortest paths, etc. The key difference is that these architectures work well when a significant portion of the graph nodes are active, while in compute DAGs only the nodes in a DAG layer are active at a time due to the dependency-induced node ordering. In general graph analytics, the active nodes cannot be predicted at compile time, while they can be predicted for our target DAGs, which is heavily utilized in this work for reducing the number of barriers, aggressive data prefetching, etc. The sparse processing unit (SPU) [33] uses hardware support for stream-joins which is similar to our decoupled streams (Sect. 4.3.2). However, DPU uses a register bank-based PE for local data reuse, while SPU uses a coarse-grained reconfigurable (CGRA)-based spatial dataflow array. Such a spatial datapath can be programmed for small DAGs in the innermost loops of irregular applications, but it is not clear if such an approach is suitable for large DAGs with thousands of nodes (this will be studied in the next chapter). Furthermore, the SPU operates at a significantly higher power budget of 16 W (simulated) as opposed to DPU’s 0.23 W.

88

4 DAG Processing Unit Version 1 (DPU): Efficient Execution of Irregular. . .

4.7 Conclusion This chapter proposed DPU, a processor designed for energy-efficient parallel execution of irregular DAGs. The DPU is equipped with 64 parallel compute units (CUs), each executing a DAG subgraph independently. The CUs communicate via a high-bandwidth global scratchpad connected using a low-overhead asymmetric crossbar. Synchronization of CUs, frequently needed for DAGs, happens in a single cycle using a specialized hardware unit. The instructions of CUs are decoupled into multiple streams for overlapping execution, resulting in 1.8.× speedup. For the arithmetic operations, the CUs are equipped with precision-scalable custom posit units that can perform low-precision batch inference depending on the application requirement. The DPU is fabricated in 28nm technology, and benchmarked on irregular DAGs from probabilistic machine learning and sparse linear algebra. Measurement results show a mean speedup of 5.1.× and 20.6.× over state-of-theart CPU and GPU implementations, with a peak performance of 73.8 GOPS and 538 GOPS/W. Thus, DPU takes a step toward supporting emerging irregular DAG workloads in energy-constrained platforms.

Chapter 5

DAG Processing Unit Version 2 (DPU-v2): Efficient Execution of Irregular Workloads on a Spatial Datapath

The previous chapter presents the first version of DPU, which demonstrates that specialized hardware optimizations can improve throughput and energy efficiency significantly compared to general-purpose CPU and GPU. This chapter investigates whether the DPU can be enhanced further through a dedicated datapath design. Specifically, the suitability of a spatial datapath is explored for the second version of DPU. A spatial datapath contains processing elements (PEs) connected such that an output of a PE can be routed directly to an input of another PE, instead of routing through a register file. Figure 5.1 shows a few examples of spatial datapath topologies used for different applications like DNNs [147], digital signal processing [159], sparse matrix-matrix multiplications [128], etc. The key advantages of such an approach are: • the reduction in register file reads/writes due to spatial data reuse, saving significant energy consumption, and • the reduction of program size as (a) the register file locations are not required for inputs/outputs of most of the PEs, and (b) the datapath can possibly be programmed with SIMD instructions. For irregular DAGs, flexible spatial datapaths with data routers1 along with PEs are studied in DSAGEN [161], SPU [33], Plasticine [125], etc. The data routers enable the flexibility of mapping DAGs with different structures to the same spatial datapath. However, these works focus on small DAGs extracted from for loops of various iterative algorithms that consist of fewer than 100 nodes typically. Such DAGs can be fully spatially mapped to the datapath, which allows the configuration of the datapath (routing, PE operations, etc.) to be fixed at the beginning and to remain unchanged throughout the execution. This is not possible for large DAGs with thousands of nodes, as they cannot be fully mapped. To address the limitation 1 In

some works, the PE itself acts as a router by passing one of the inputs to the output.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Shah et al., Efficient Execution of Irregular Dataflow Graphs, https://doi.org/10.1007/978-3-031-33136-7_5

89

90

5 DAG Processing Unit Version 2 (DPU-v2): Efficient Execution of Irregular. . .

Fig. 5.1 A few examples of spatial datapaths used for applications like (a) DNNs [147], (b) digital signal processing [159], and (c) sparse matrix-matrix multiplications [128]

PE

(a)

(b)

(c)

of prior works for large DAGs, this chapter proposes the second version of the DAG processing unit (DPU-v2). The major contributions of this chapter are as follows: 1. Processor: A parameterized architecture template is designed with a treestructured spatial datapath that is interfaced with a banked register file through high-bandwidth flexible interconnects. The datapath executes a different subgraph from the target DAG every cycle, which is made possible through a custom instruction set with long words. 2. Compiler: A specialized compiler is designed to decompose a large DAG into subgraphs that can be scheduled temporally on the datapath while ensuring that the datapath is maximally utilized. The irregularity of DAGs leads to frequent register bank conflicts, which are reduced with a conflict-aware register allocation while taking into account the interconnect routing constraints.

5.1 Designing a Processor with a Spatial Datapath for Large Irregular DAGs

91

3. Design space exploration: Since there are multiple design parameters in the architecture template (e.g., the number of register banks, size of the spatial datapath, etc.), a design space exploration is performed to arrive at the architecture with the minimum energy-delay product in a 28 nm CMOS technology, and compared with state-of-the-art implementations. The chapter is organized as follows. Section 5.1 discusses the key ideas required to make spatial datapaths suitable for large DAGs. Section 5.2 describes the processor architecture template of DPU-v2 followed by Sect. 5.3 explaining the compiler. Next, Sect. 5.4 presents the design space exploration to find the design with the minimum energy-delay product for the target DAGs, and compares it with state-ofthe-art work. Finally, Sect. 5.6 discusses related works and Sect. 5.7 concludes the chapter. This chapter is based on the related papers: • N. Shah, W. Meert, and M. Verhelst (2022). DPU-v2: Energy-efficient execution of irregular directed acyclic graphs. In 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), IEEE, pp. 1288–1307 [140]. • N. Shah, L. I. G. Olascoaga, W. Meert, and M. Verhelst (2020). Acceleration of probabilistic reasoning through custom processor architecture. In Design, Automation & Test in Europe Conference & Exhibition (DATE), IEEE, pp. 322– 325 [137].

5.1 Designing a Processor with a Spatial Datapath for Large Irregular DAGs This section discusses the main design principles of the DPU-v2 architecture template.

5.1.1 Which Spatial Datapath Topology Should Be Used? Finding suitable datapath topology for target workloads is a challenging task, especially for large-sized DAGs. For small DAGs with up to a hundred nodes, design space exploration frameworks like DSAGEN [161] can identify the optimal spatial topology. However, these frameworks attempt to fully spatially map the DAGs, which is not feasible for DAGs with thousands of nodes. A datapath can be properly utilized for a given large DAG if there are subgraphs in the DAG that can be mapped to the datapath with high utilization of the datapath resources. To identify a suitable datapath, the peak utilization of a given datapath for a given DAG can be used as a figure of merit—low peak utilization implies that there does not exist any subgraph in the DAG that can be properly mapped to

92

5 DAG Processing Unit Version 2 (DPU-v2): Efficient Execution of Irregular. . .

Fig. 5.2 Systolic arrays (a) are underutilized by irregular DAGs, while a tree-shaped datapath (b) is a promising alternative as measured by peak utilization (c)

the datapath. As it turns out, the peak utilization can be estimated with the spatial datapath mapper from [106] that uses optimization solvers to identify the largest subgraph that can be mapped to the datapath. We estimate the peak utilization for two candidate datapath topologies shown in Fig. 5.2: • A two-dimensional systolic array of PEs [84], a commonly used topology for DNNs and scientific computing workloads. • A binary tree of PEs, which is found to be suitable for sparse workloads [85, 128]. For both topologies, the PEs can either perform arithmetic operations on the two inputs or act as a router and pass one of the inputs to the output. Experiments show that for the target DAGs, the utilization of the 2D systolic array drops quickly as the size of the array grows (see Fig. 5.2c). A 4.×4 array with 8 inputs achieves a peak utilization of less than 50%. On the other hand, the DAGs contain subgraphs that can almost fully utilize large trees. As such, a tree of PEs is selected for DPU-v2. Note that the spatial mapper used here can quantify the peak utilization, but the constrained-optimization-based approach used in the mapper is too slow to fully map large DAGs with tens of thousands of nodes. For a scalable solution, we later present a heuristic algorithm to decompose irregular DAGs into tree-like subgraphs (Sect. 5.3.1).

5.1.2 How to Read/Write the Inputs/Outputs? A parallel datapath can operate without stalls only if the inputs can be provided and the outputs get consumed in every cycle. This could be enabled by reading/writing the parallel inputs/outputs as vectors from a vector register file. However, the irregularity of the DAG structure prevents such a regular, vector-based approach, and instead requires gathering/scattering data from seemingly random locations.

5.1 Designing a Processor with a Spatial Datapath for Large Irregular DAGs

93

Fig. 5.3 Example of a DAG execution on a tree of PEs, causing irregular register accesses. (a) A sample DAG to be mapped on a tree of PEs with banked register file. (b) Possible decompostion of the DAG along with the input/output variables. (c) Sequence of execution along with the state of the register file

Figure 5.3 illustrates this with a concrete example. Figure 5.3a shows a sample DAG and a target datapath consisting of a PE tree that is interfaced with a banked register file for parallel inputs/outputs. Figure 5.3b shows an example decomposition of the DAG in tree-shaped structures that can be sequentially mapped to the datapath. The outputs of some of the nodes (like node a) are required to be stored in the register file for later consumption, while others (like node d) are fully consumed within the datapath. Figure 5.3c shows the temporal execution with the evolving state of the register file. The DAG irregularity manifests in the form of irregular accesses to register banks as explained next. Irregular Bank Accesses The inputs and outputs of the PE tree access different banks over the cycles. For example, PE2 reads banks 2 and 3 at T.0 and banks 1 and 3 at T.1 . The access pattern cannot be regularized by simply remapping the variables to different banks or remapping nodes to different PEs. Consider the input variables of node b, which are also consumed by c and d. At T.0 , they are consumed by one PE, while at T.1 , by two PEs, one each. It is not possible to map the variables among banks such that the PEs read the same banks at both T.0 and T.1 . One alternative could

94

5 DAG Processing Unit Version 2 (DPU-v2): Efficient Execution of Irregular. . .

be to execute the nodes c and d in different cycles, but that would lead to idle PEs. Hence, for high PE utilization, some form of flexibility is required in interconnecting the inputs/outputs of PE trees to the register banks. Irregular Addresses PEs read different addresses from different banks in a cycle. This is caused by the fact that the input variables consumed simultaneously can possibly be generated in different cycles (e.g., the inputs of T.3 ). Similarly, variables that are generated together can be consumed in different cycles (e.g., the outputs of T.1 ). Thus, unlike SIMD execution, all the register banks do not use the same read/write addresses, requiring flexible addressing hardware. Replication Is Not a Solution It may appear that replicating a variable in multiple banks and multiple addresses can solve both of the above problems for register reads. For such replications, a variable would be written as well as read multiple times. However, there are three problems: • The register file can be severely underutilized due to replicated copies of the same data. • The energy consumption increases due to more writes compared to Fig. 5.3’s approach of single writes. • The replication alleviates the irregular reads but exacerbates the irregularity of writes. Hence, it simply shifts the problem from reads to writes. Hence, DPU-v2 uses flexible interconnects to interface the datapath with the register file and independent read/write addresses for every bank in the register file to ease the gathering/scattering of data from irregular locations without relying on data replication.

5.1.3 How to Handle Bank Access Conflicts? Despite having independent banks in the register file as discussed above, the irregular reads/writes can still cause conflicts while accessing the register banks. A bank with a single read and a single write port can read and write only one register in every cycle. If the execution demands reading/writing more than one register from the same bank in a cycle, a bank conflict occurs, which stalls the execution. In the first version of the DPU, these conflicts happened while accessing the global scratchpad banks. Precisely, 43% of the load requests resulted in bank conflicts. DPU and other works targeting irregular accesses like [33, 132] attempt to hide these conflicts by (a) overlapping them with computational operations through aggressive prefetching, and/or (b) reordering the accesses that cause conflicts at runtime. But, these techniques require costly hardware structures like prefetching FIFOs, reordering buffers, etc. In contrast, DPU-v2 aims to eliminate the bank conflicts altogether at compile time. This becomes possible because the compiler can predict the parallel inputs/outputs required by the spatial datapath in every cycle. In DPU, the compute

5.1 Designing a Processor with a Spatial Datapath for Large Irregular DAGs DPU's approach

95

DPU-v2's approach

Long instruction Instruction Decoder

Decoder

Decoder

Decoder

PE

PE

PE

PE

RF

Decoder

Instruction Instruction Instruction

RF

RF

RF

Sync unit

Spatial datapath of PEs

Inter. network RF RF RF RF

Interconnection network

Shared scratchpad banks (a) Irregular acceses to scratchpad

Shared scratchpad banks (b) Irregular accesses to register file

Fig. 5.4 (a) A common approach for executing DAGs in parallel, in which the unpredictable irregular accesses happen to scratchpad/memory. (b) In DPU-v2 by pushing the interconnect closer to the datapath, the compiler is equipped to predict the irregular accesses and prevent bank conflicts

units operate asynchronously as a stall in one unit does not stall the other. This prevents the compiler from predicting the exact timing of the irregular accesses to the global scratchpad banks (see Fig. 5.4). On the other hand, in DPU-v2, the irregular accesses from the spatial datapath are limited to the register banks (as discussed in the previous subsection Sect. 5.1.2), while the memory/scratchpad is accessed in a regular pattern. In every cycle, the compiler knows which registers are accessed and can thereby also predict register bank conflicts. This way, the compiler can perform appropriate bank mapping to eliminate these conflicts, which can improve the throughput while avoiding costly hardware structures. In summary, the architecture template can use the following features to handle DAG irregularity: • A spatial datapath with parallel PEs connected in a tree toplogy. • Interface the datapath to a register file with parallel inputs/outputs and independent addressing scheme via a flexible interconnect. • Eliminate bank access conflicts at compile time and simply the hardware.

96

5 DAG Processing Unit Version 2 (DPU-v2): Efficient Execution of Irregular. . .

Fig. 5.5 The DPU-v2 architecture template consists of processing elements (PEs) connected in a tree topology. The trees are connected to parallel register banks with input and output interconnects. The datapath is pipelined according to the number of PE layers

5.2 DPU-v2 Architecture Template This section describes the proposed DPU-v2 architecture template (Fig. 5.5) designed for irregular DAGs, which utilizes the design principles discussed in the previous section.

5.2.1 Parallel Tree of PEs As shown in Fig. 5.5, the datapath consists of many parallel processing elements (PEs) to execute the DAG nodes. • PE: Each PE can be configured to perform a basic arithmetic operation (.+ and .×). Additionally, a PE can bypass one of its inputs to its output. • Trees of PEs: PEs are interconnected in T parallel structures with a tree topology containing D layers, where outputs of a PE layer are connected to the inputs of the next PE layer. This enables immediate reuse of intermediate results, avoiding register file accesses and the associated energy consumption.

5.2 DPU-v2 Architecture Template

97

• Pipeline stages: The PE outputs are registered, resulting in D pipeline stages in the trees.

5.2.2 Register File Architecture The PE trees read and write data from a shared register file containing B parallel banks with R registers per bank (Fig. 5.5). The number of banks B is chosen such that there is one register bank for each input of the trees (i.e., .B = T × 2D ). As such, the register file has enough bandwidth to feed the data into the trees in every cycle. However, due to DAG irregularity, the parallel inputs needed in a clock cycle typically do not reside at the same address in the different banks, as discussed in Sect. 5.1.2. To handle this irregularity, the banks are made independent in terms of read/write addressing. Each bank can read/write from/to any location, independent of the other banks. This flexibility comes at the overhead of additional instruction bits to encode these addresses. While this overhead is present for reads, it is alleviated for writes as discussed below. Automatic write Address Generation To reduce the instruction size, a novel register-writing policy is used, alleviating the need to encode write addresses in instructions. The policy is to always write data to the empty location with the lowest address within a register bank. Thus, instructions do not control where a write happens, but the register bank itself chooses the appropriate location for the incoming data. Why Does the Policy Work? To perform correct execution with such an automatic write policy, the compiler should be able to predict at compile time the write addresses that will be chosen at runtime. Since our target DAGs are known at compile time and the PEs execute synchronously, the instruction execution sequence is deterministic. This enables the compiler to predict write addresses that will be chosen in every cycle, given that the execution begins with a known state of the register banks (e.g., all the banks are empty). Hardware The policy requires that the occupancy state of every register is tracked. This is done via a valid bit for every register, which is set to 1 when the Fig. 5.6 Automatic write address generation by tracking the occupancy status of registers with valid bits. A priority encoder chooses the empty location with the smallest address to write data

Data Valid

Write addr Register bank

Priority encoder

98

5 DAG Processing Unit Version 2 (DPU-v2): Efficient Execution of Irregular. . .

corresponding register contains valid data. Using these valid bits, the write address is computed with a priority encoder as shown in Fig. 5.6. A write to a register sets the respective valid bit. However, a read should not automatically reset the bit, as a variable residing in a register can be reused multiple times. The instructions should therefore indicate the last read to the register to reset the valid bit and let the register be subsequently available to write new data. This resetting is done by a valid_rst bit in the instruction, one for each bank. Thus, a multi-bit write address is replaced with a single valid_rst bit, reducing the width of the instructions. This automatic write policy results in a program size reduction of 30% on average.

5.2.3 Datapath-Register Banks Connections As discussed in Sect. 5.1.2, the connection of datapath to the register banks requires some flexibility due to the irregularity of DAG edges. However, the design space for interconnecting B read ports to B inputs and #PE outputs to B write ports is huge. On one extreme, full crossbars can be used for both inputs and outputs (Fig. 5.7a). On the other extreme, one-to-one connections can be done (Fig. 5.7d). Compilation Considerations The interconnections not only have a hardware impact but also influence the compilation complexity. Crossbars significantly simplify the compilation as it decouples PE mapping and bank mapping. For example, for the design in Fig. 5.7a, when a DAG node is mapped to a PE, the bank mapping options for its input data and output results do not get restricted, as all the banks are accessible for reading and writing. In contrast, for Fig. 5.7d, when a DAG node is mapped to a PE, its output result can be stored to only one bank (or two in the case of the top PE), and the input data should be mapped to the banks the PE can access. Thus, the PE mapping and bank mapping get coupled. This coupling complicates the compilation because the mapping of one node to a PE can potentially impact the mapping of all the rest of the nodes.

Fig. 5.7 Different interconnection topologies and their impact on bank conflicts

5.2 DPU-v2 Architecture Template

99

Thus, using at least one crossbar, either at the input or the output, ensures that the mapping of one node only impacts a limited set of nodes. Furthermore, hardware synthesis results revealed that a crossbar only consumes 4% area and 9% power of the whole design (see Table 5.2 (input interconnect)). As a result, designs with at least one crossbar are considered for exploration (Fig. 5.7a–c) simplifying compilation without significant hardware overhead. The mapping algorithm is explained in detail in Sect. 5.3.2, which is used to minimize the bank conflicts (Fig. 5.7e) for the different design options. The bank conflicts happen due to the following structural hazards: (1) two PEs simultaneously access the same bank (but banks only have one read and one write port), or (2) the input data (or the output result) is mapped to a bank that the PE cannot access, inducing a stall for copying data across banks. As expected, design (a) achieves minimal conflicts. Yet, design (b) is selected, which has a crossbar for inputs, and each bank is connected to the output of one PE per layer. It induces only 1.4.× higher conflicts due to the limited output connectivity, increasing the overall latency by 1% but reducing the overall power by 9%. As such, it achieves a better latency-power trade-off. The design (c) results in significantly higher conflicts without considerable area/power advantage over (b). Design (d) is not evaluated further as it will incur even more conflicts than (c).

5.2.4 Load, Store, and Copy of Data The register banks connect to an on-chip SRAM-based data memory with a readwrite width of B words. The load/store from the data memory happens in the form of vectors (with a word-level enable mask) as shown in Fig. 5.8a. But, this data can be written to/read from different addresses in the register banks. Note that the register write addresses for a load operation are automatically generated as described in Sect. 5.2.2, whereas the register read addresses for a store operation are encoded in the instructions. Furthermore, to handle bank conflicts, a copy instruction enables an arbitrary shuffle of data across banks using the crossbar as shown in Fig. 5.8b, implemented as depicted on the left of Fig. 5.5.

5.2.5 Long, Variable-Length Instructions The architecture is designed to be programmable for arbitrary DAGs, with a custom VLIW instruction set (Fig. 5.9a) that can configure the trees and crossbar, copy data from one register bank to another, or load/store from data memory. As different instructions encode different types of information, they have different lengths, depending on the hardware parameters D, B, and R. Figure 5.9a shows instruction lengths for a sample design configuration. For proper utilization of the instruction memory, the instructions are packed densely, without any bubbles (Fig. 5.9b). The

100

5 DAG Processing Unit Version 2 (DPU-v2): Efficient Execution of Irregular. . .

Data memory Store

Copy

Load (a)

(b)

Fig. 5.8 (a) Shows that vectors are loaded/stored from a single address in the data memory, but register banks are addressed independently. (b) Shows how data can be copied between banks via the input interconnect to avoid bank access conflict

Fig. 5.9 Variable-length instructions (in (a)) are densely packed in the memory (in (b)) and are appropriately aligned during decoding for stall-free execution

instruction memory can supply IL bits in every cycle, which is the same as the size of the longest instruction. To handle varying lengths, a shifter before the instruction decoder performs appropriate alignment, which makes sure there are no stalls in execution, and the next instruction is always fully available for decoding. It may appear that the long instructions would increase the memory footprint significantly, but in fact, the overall memory footprint decreases by 48% compared to the CSR representation (discussed further in Sect. 5.3.5). In summary, DPU-v2 contains a datapath with parallel PE trees connected to a banked register file with flexible interconnects. It is programmed with custom variable-length VLIW instructions, which are packed densely in the instruction

5.3 Compiler for DAG

101

memory. The independent parameters of the architecture template (D, B, and R) are not fixed at this point and will be selected based on a design space exploration described in Sect. 5.4.

5.3 Compiler for DAG Due to the nonconventional datapath with flexible interconnects and the VLIW instruction set, it is difficult to modify a conventional compiler of traditional processors to generate instructions for the proposed architecture. Hence, a targeted compiler is developed to fully utilize the PE trees despite the irregularity of unstructured DAGs. The compiler is designed to work for any values of the design parameters (i.e., D, B, and R), instead of targeting a fixed configuration of the architecture template. The compiler takes as input a DAG in any of the popular graph formats (i.e., all formats supported by the NetworkX package [65]) and generates an execution binary that can be directly programmed to DPU-v2. Since the structure of the DAG remains static across multiple executions (during inference with PC and for SpTRSV with multiple right-hand side vectors), the compilation is performed offline, and the DAG is unfolded at compile time to generate DAGspecific instructions. The major compilation steps are shown in Fig. 5.10 and discussed next.

5.3.1 Block Decomposition (Step 1) Compilation begins by decomposing the input DAG, which is first converted to a binary DAG (containing two-input nodes only) by replacing a multi-input node with a tree of two-input nodes. This is to ensure that the nodes can be mapped directly to the two-input PEs. This binary DAG is decomposed into sets of nodes called blocks. A block is supposed to be a monolithic unit that can be executed with a single exec instruction (from Fig. 5.9a). The constraints and objectives for this decomposition step are as follows: • Constraint A: The resulting graph of blocks should be acyclic for functional correctness. Figure 5.11a shows an example of a cyclic decomposition, in which it cannot be determined which block should be scheduled first. • Constraint B: The nodes in a block should be spatially schedulable on the PE trees, considering the number of PEs available and their connectivity. • Objective C: The PE trees should be maximally utilized. • Objective D: The number of dependencies among blocks should be minimized to reduce read-after-write hazards during the pipelined execution. Figure 5.11b shows an example where the dependency of block D on B could be avoided by an improved combination of nodes in C, as the nodes in D are not actually dependent on B in the DAG.

102

5 DAG Processing Unit Version 2 (DPU-v2): Efficient Execution of Irregular. . .

Fig. 5.10 Steps involved in the custom compiler to generate optimized instructions for a given DAG

Fig. 5.11 (a) Cycles to be avoided during block generation. (b) Avoid block dependencies like the dependence of D on B. (c) Any subgraph with two-input nodes, one sink node, and with the longest path length less than the depth of a PE tree can be mapped to the PE tree by replicating certain nodes. (d) Subgraphs that can be combined into a block (possibly with permutations) for mapping to a datapath with one PE tree of depth 3

5.3 Compiler for DAG

103

Algorithm 1: Step 1 .G(V , E), D, T 

To achieve these objectives under the mentioned constraints, a greedy iterative algorithm (see Algorithm 1) is designed that constructs one block in every iteration as follows: 1. Search all the schedulable connected subgraphs. A subgraph is schedulable if all of its predecessor nodes (if any) are already assigned to the blocks in previous iterations. Ensuring this schedulability property in every iteration results in blocks that satisfy constraint A. Furthermore, a subgraph is schedulable only if it can be completely mapped to a PE tree, which is needed for constraint B. Subgraphs to be considered for this schedulability check are constructed as follows: A node that is less than D distance away from the already mapped nodes (called .curr_source_nodes) is considered as a sink node, and along with its unmapped ancestors, form a subgraph. The check of whether a subgraph satisfies constraint B or not is simplified due to the tree topology of PEs. Any connected

104

5 DAG Processing Unit Version 2 (DPU-v2): Efficient Execution of Irregular. . .

subgraph with only two-input nodes, exactly one sink node, and with the longest path length less than D can be mapped to a PE tree of depth D. Figure 5.11c shows an example of a non-tree subgraph that satisfies these properties, and can be mapped through replication. In this way, the find_schedulable_subg() function (line 6) finds all the schedulable subgraphs that have a given predecessor node as an input. The overall search for subgraphs is made efficient by keeping track of the schedulable subgraphs across iterations (in the set .Dsch ), to avoid re-searching the same subgraphs again. In every iteration, only the subgraphs that originate from the nodes mapped in the previous iteration are added to the set .Dsch (lines 5–7), by setting .curr_source_nodes appropriately (line 23). 2. Among the schedulable subgraphs, select a set to form a block. The previous step finds connected subgraphs that can be mapped to the datapath. As shown in Fig. 5.11d, multiple such subgraphs (with possibly different longest path lengths) can be combined into a block. To find the appropriate set of subgraphs to combine in a block, the fitness of a block is evaluated based on objectives C and D: (1) According to obj. C, the blocks with more nodes are more fit. (2) Quantifying obj D is difficult. In practice, the dependencies across blocks decrease if the subgraphs combined in a block lie closer to each other in the DAG. Hence, a block is penalized according to the distance between the nodes in a block. The distance is approximated by the difference in occurrences of their nodes during a depth-first traversal of the DAG (performed once at the beginning). These two metrics of fitness are used to add an appropriate subgraph to a block (lines 15–16) and to compare different valid blocks (lines 17–19). The process repeats until all the nodes are mapped to blocks. Note that nodes are not mapped to specific PEs yet, which is subsequently done in step 2. Asymptotic Complexity The number of iterations scales linearly with the number of DAG nodes for a given datapath. In each iteration, selecting a set of subgraphs to form a block scales linearly with the number of schedulable subgraphs (i.e., the size of .Dsch ), which in turn scales linearly with the number of DAG nodes in the worst case. As such, the complexity is O(.N 2 ), where N is the number of nodes.

5.3.2 PE and Register Bank Mapping (Step 2) To generate an instruction for each block, every node of the block needs to be assigned to a hardware PE in the PE trees, and the source/destination of the inputs/outputs of a block should be mapped to register banks. The constraints and objectives of the mappings are as follows: • Constraint E: Only one node can be mapped to one PE (note that the replication in Fig. 5.11c is achieved by splitting nodes into multiple temporary nodes). The mapping should be topologically consistent, which requires that if a node n is

5.3 Compiler for DAG

• • •

• •

105

mapped to a PE p, then the predecessors and successors of n are also mapped to, respectively, the predecessors and successors of p. Constraint F: Two different inputs of a block should not be mapped to the same register bank, to prevent bank conflicts while reading. Constraint G: Two different outputs of a block should not be mapped to the same register bank, to prevent bank conflicts while writing. Constraint H: For the nodes whose results are stored as outputs of a block, the PE mapping and the bank mapping of that node should be compatible, i.e., the bank should be writable from that PE. This constraint arises due to the limited connectivity of PEs and banks in the output interconnect (Fig. 5.5). Note that such a constraint is not required for block inputs because the input interconnect is a crossbar. Objective I: Minimize bank conflicts, as every bank conflict will result in a stalling cycle for data copy. Objective J: Balance the distribution across register banks.

Due to constraint H, the PE mapping is performed in tandem with the register bank mapping, since assigning a node to PE restricts the number of compatible banks and vice versa. Specifically, a greedy iterative algorithm is used, which performs the mappings for one node in every iteration as described in algorithm 2. The algorithm keeps track of a set of compatible PEs (.Sp ) for every node and a set of compatible banks (.Sb ) for nodes that serve as inputs/outputs of blocks (and hence should be stored to the register file). These nodes involved in inputs/outputs (e.g., nodes a, b, and d in Fig. 5.12a) are called .io_nodes in the algorithm. A PE is compatible if the topological consistency defined in constraint E can be satisfied for all the nodes in the block after the node is mapped to the PE. A bank is compatible, if mapping the node to that bank does not cause a bank conflict with the already mapped nodes. Node Order for Mapping The mapping is done for .io_nodes first. In every iteration, the node with the least number of compatible banks (to achieve objective I ) in .Sb is chosen for mapping, from anywhere in the DAG ignoring the block boundaries. The runtime of the selection is made independent of the size of the DAG using the .Mnodes data structure (constructed in lines 9–12 in Algorithm 2), and searching in it in every iteration (lines 15–18). Mapping The chosen node is mapped to a compatible bank (chosen randomly to achieve objective J) if available, otherwise to a bank that leads to the least conflicts (lines 21–24). Subsequently, the node is mapped to a PE (lines 25–29) that can write to the chosen bank, if such a PE is compatible. If not, a compatible PE is chosen at random from .Sp . Note that at least one compatible PE will always be left because, as explained in step 1(Sect. 5.3.1), the blocks satisfy constraint B, ensuring the schedulability of nodes. Figure 5.12a shows the mapping of a node b to an appropriate PE and bank (in color). The following two updates happen to the .Sp and .Sb of the nodes affected by this mapping:

106

5 DAG Processing Unit Version 2 (DPU-v2): Efficient Execution of Irregular. . .

Algorithm 2: Step 2 .G(V , E), D, B, output_connectivity, B

5.3 Compiler for DAG

107

Fig. 5.12 (a) An example of how the mapping of node b affects the compatible PEs and banks of the other nodes. (b) Our algorithm achieves considerably lower bank conflicts than random allocation. (c) and (d) show that the data allocation across register banks remains well-balanced

Intra-Block Compatibility Update The PE and bank options for the unmapped nodes a, c, and d decrease due to the mapping of b in Fig. 5.12a, according to the constraints E and G. .Sp and .Sb for these nodes are updated accordingly (lines 30– 33).

108

5 DAG Processing Unit Version 2 (DPU-v2): Efficient Execution of Irregular. . .

Table 5.1 Statistics of the benchmarked DAGs Workloads tretail mnist nltcs msnbc msweb bnetflix (b) SpTRSV bp_200 west2021 sieber jagmesh4 rdb968 dw2048 (c) Large PC pigs andes munin mildew

Type (a) PC

Nodes (n) 9k 10k 14k 48k 51k 55k 8k 10k 23k 44k 51k 79k 0.6M 0.7M 3.1M 3.3M

Longest path (l) 49 26 27 28 73 53 139 136 242 215 278 929 90 84 337 176

Parallelism (n/l) 180 400 504 1.7k 698 1k 60 74 94 203 181 85 6.9k 8.6k 9.1k 18.8k

Compile time (min.) 1 2 3 28 31 35 1 1 6 12 15 16 121 145 850 910

Inter-Block Compatibility Update Nodes from other blocks are also affected by the bank assignment to b. The nodes e and f in Fig. 5.12a are consumed together with b and cannot be allocated to the same bank (to satisfy constraint F). Hence, b’s bank is removed from their .Sb (line 34). After these compatibility updates, a new node is selected in the next iteration, and the process repeats until all the .io_nodes are mapped. Next, the non-.io_nodes (node c in Fig. 5.12a) are mapped to PEs in the same way as discussed earlier, but without the steps for bank mapping. Asymptotic Complexity The work done in the intra-block update depends only on the size of the datapath and does not depend on the properties of the DAG, while the work done in the inter-block update scales with the outdegree of the node being assigned. Since the number of iterations is same as the number of nodes, the algorithm complexity is O(.N × (G)), where N is the number of DAG nodes and .(G) is the maximum outdegree of the DAG. Figure 5.12b shows that our algorithm achieves a 292.× reduction in bank conflicts compared to random bank allocation, confirming that the objective I is met. Figure 5.12c shows the number of active registers across banks during the execution of a workload (from Table 5.1), demonstrating a well-balanced distribution, in line with objective J . The spilling of registers for limited bank size is discussed later in this section. The bank conflicts are resolved with the copy instruction from Fig. 5.9 that copies the data to an appropriate bank, eliminating the need for handling the bank conflicts in hardware. Finally, an instruction list is constructed with the exec and copy instructions, along with the load instructions to fetch inputs from the data memory for the first time.

5.3 Compiler for DAG

109

Impact of the Crossbar Note that the input crossbar significantly simplifies the inter-block update. It decouples the PE allocations of a block from the bank allocations of its inputs, and limits constraint H to outputs only. This ensures that the inter-block updates remain limited to the inputs of the successor blocks only, and do not propagate to the whole DAG. Without a crossbar, the inter-block update can possibly affect all the nodes in the DAG, quickly reducing the number of compatible banks for the rest of the nodes in every iteration.

5.3.3 Pipeline-Aware Reordering (Step 3) As the datapath has .D + 1 pipestages, the instruction list (generated in step 2) is reordered to ensure that the dependent instructions are at least .D + 1 instructions apart to prevent read-after-write (RAW) hazards. The reordering involves a search for independent instructions to insert in between dependent instructions. This search is limited to a fixed-size window of succeeding instructions (300 in our experiments), to make the runtime scale linearly with the size of the instruction list. Subsequently, no-operation (nop) instructions are inserted for the unresolved hazards.

5.3.4 Spilling from Register File (Step 4) Data must be spilled from the register file if intermediate results do not fit. Given the schedule of execution, a live-range analysis is performed to determine when the spilling is required, and store instructions are appropriately inserted. The spilled data are later loaded back by inserting load instructions before they are supposed to be consumed, in a way that avoid new RAW pipeline hazards. The live-range analysis and insertion of intermediate loads/stores scale linearly with the size of the DAG. Figure 5.12c and d shows register occupancy profile for a workload without and with register spilling, respectively, for .R = 64. The final output of the compilation process is a list of instructions to be executed on DPU-v2 for the given DAG, which can be directly translated into a binary program for execution. The distribution of different types of instructions is shown in Fig. 5.15 in the next section.

5.3.5 Reduction in Memory Footprint The list of instructions generated by the compiler statically encodes the entire DAG structure, leading to a larger instruction footprint compared to a conventional approach of a for loop iterating over the DAG stored in a compressed sparse row

110

5 DAG Processing Unit Version 2 (DPU-v2): Efficient Execution of Irregular. . .

(CSR)-like format and performing indirect memory accesses. However, note that the overall memory footprint (instructions .+ data) of our approach is in fact lower than the CSR data structure because the statically generated instructions can encode the edge connectivity information with fewer bits, due to the following reasons: • For the DAG edges that are mapped to PE-PE connections, the addresses for the source and destination of the edges need not be stored, in contrast to the CSR data structure which stores the pointers for every edge. • Due to the static mapping of variables to the register file, the address for the source/destination of the edges that are mapped to PE-register or register-PE connections can be specified with a register address (.= 11b in our final design configuration) instead of encoding it in a global address space of 32/64b like in the CSR data structure. For the target workloads, the total memory footprint of instructions and data is 48% smaller than the CSR data structure, which is not required in our approach.

5.4 Design Space Exploration Experiments are designed to find the most energy-efficient hardware configuration (i.e., D, B, and R) of DPU-v2 for a suite of target DAGs from PC and SpTRSV workloads (shown in Table 5.1), and compare it with state-of-the-art implementations. The PCs are from the standard benchmark of density-estimation applications from [88] and from [143], while the sparse matrices from the SuiteSparse collection of real applications [42] are used for SpTRSV.

5.4.1 The Most-Efficient Design Configuration To reliably estimate the energy impact of design choices, a parameterized Verilog model is developed, which is used to synthesize gate-level netlists in a CMOS 28 nm technology. The design space exploration is done by varying the parameters in the following set of values: D in [1, 2, 3], B in [8, 16, 32, 64], and R in [16, 32, 64, 128], leading to 48 combinations. For each of the design points, the energy is computed by mapping actual workloads ((a) and (b) in Table 5.1) with the compiler, and annotating the resulting switching activities of the workloads from gate-level netlist simulations with a target frequency of 300 MHz. The data and instruction memories are of 512KB each. The mean latency, energy, and energy-delay product (EDP) per operation, averaged over the workloads, are plotted in Fig. 5.13a, b, and c, respectively. As expected, the minimum latency design point (.D = 3, .B = 64, .R = 128) has the highest values of D, B, and R, i.e., the one with the largest area. Note that increasing R beyond 32 has diminishing improvement as the active set of data is

5.4 Design Space Exploration

111

Fig. 5.13 Design space exploration to identify the min-EDP design. The optimal points are highlighted with large markers. (a) Latency. (b) Energy. (c) Energy-delay product

able to fit within the register file. Unlike latency, the min. energy point (.D = 3, B = 16, .R = 64) uses fewer B (and hence also fewer PE trees) and fewer R. Notice that increasing D improves latency without consuming additional power, resulting in an energy improvement as well. This demonstrates the effectiveness of the treebased spatial datapath. On the other hand, increasing B decreases latency but with proportionally more power consumption because the utilization of the increasingly

.

112

5 DAG Processing Unit Version 2 (DPU-v2): Efficient Execution of Irregular. . .

Fig. 5.14 Latency vs. Energy plot with a constant EDP curve passing through the min-EDP point, colored according to D, B, and R in (a), (c), and (d). (b) Shows the inset of (a) that highlights the operating points closer to the min-EDP point

parallel datapath gradually decreases, shifting the minimum energy point toward a lower value of B. Minimum EDP The minimum energy-delay product (EDP; Fig. 5.13c) is achieved at (.D = 3, .B = 64, .R = 32). Figure 5.14 shows latency vs. energy charts for a different perspective of the design space, along with a constant-EDP curve passing through the min-EDP point. The slope of the curve indicates that the latency has more variation than the energy. Breakdown of Area, Power, and Latency Table 5.2 shows the distribution of area and power across different modules for the min-EDP design. Memories consume most of the area and 32% of the power. The datapath and the register file consume

5.4 Design Space Exploration

113

Table 5.2 Area and power breakdown of DPU-v2 Datapath PEs Pipelining registers Input interconnect Output interconnect Register file Banks Wr addr generator Control Instr fetch Decode Pipelining registers Instruction memory (512KB) Data memory (512KB)

Area mm.2

%

Power mW

%

0.13 0.04 0.14 0.01

4 1 4 0

11.9 8.0 10.0 0.5

11 7 9 1

0.35 0.03

11 1

24.0 7.8

22 7

0.06 0.04 0.01 1.20 1.20 3.20

2 1 0 38 38

7.0 2.6 2.7 27.7 6.7 108.9

6 2 2 26 6

Fig. 5.15 Breakdown of instructions

around equal power: 28% and 29%, respectively. Figure 5.15 shows the breakdown of the different categories of instructions across the different workloads executed on this architecture instantiation. Compilation Time The compilation times of the DAGs are reported in Table 5.1 for the min-EDP design. For the large PCs, the decomposition into blocks (step 1)

114

5 DAG Processing Unit Version 2 (DPU-v2): Efficient Execution of Irregular. . .

becomes too slow, and the DAG is first coarsely decomposed into partitions with 20k nodes each, using the S3 technique of GRAPHOPT (which scales linearly with DAG size; see Sect. 3.2.3), and then each partition is decomposed independently into blocks.

5.5 State-of-the-Art Comparison The proposed processor is benchmarked against DPU, a sparse processor SPU [33], and optimized CPU and GPU baselines: • CPU: GRAPHOPT’s multithreaded results benchmarked on an Intel(R) Xeon Gold 6154 CPU are used. • GPU: The SpTRSV implementation [99] from the cuSPARSE library is benchmarked on an RTX 2080Ti GPU. In the absence of a GPU library for PC, we implemented a layer-wise parallelization in CUDA [109] that is inspired by the cuSPARSE SpTRSV implementation.

5.5.1 Comparison Using PC and SpTRSV For the PC and SpTRSV workloads in Tables 5.1a and b, the performance of the obtained min-EDP configuration of our processor is compared with DPU, CPU, and GPU. DPU For a fair comparison, the DPU with a similar area, number of registers, arithmetic representation, and data memory bandwidth (76.8 GB/s) as the min-EDP DPU-v2 is synthesized. This equivalent DPU has 32 compute units as opposed to the 64 units described in Chap. 4 (see Table 5.4 for more details). Results Figure 5.16a shows the throughput of individual workloads, and Table 5.3 summarizes the results. DPU-v2 outperforms the DPU with 32 units for all the workloads except bnetflix and sieber, with an average speedup of 1.35.×, 4.2.×, and 10.5.× over DPU, CPU, and GPU, respectively, while achieving 15% better EDP than DPU. Analysis Although operating at higher frequencies, the CPU and GPU underperform specialized architectures due to severe inefficiency in the cache hierarchies due to the irregular fine-grained accesses. Furthermore, due to the relatively small size of the workloads, there is not enough parallel work to amortize the synchronization and communication overheads of multiple cores. Despite of the 15% EDP improvement, DPU-v2 achieves 35% higher throughput while using 75% more PEs than DPU, leading to lower energy efficiency. To identify the cause of this underutilization of PEs, further benchmarking is done in Sect. 5.5.3.

5.5 State-of-the-Art Comparison

115

Fig. 5.16 Throughput for every workload

5.5.2 Comparison Using large PCs To demonstrate the scalability of this work, a large configuration of the proposed processor is benchmarked with large PCs with up to 3.3M nodes (Table 5.1c) and compared with the sparse processing unit (SPU) [33], the CPU baseline from the SPU work (CPU.SP U ), CPU, and GPU: • SPU: SPU is a coarse-grained reconfigurable array (CGRA)-like architecture with a flexible spatial datapath. Instead of treating a PC as a large DAG, the iterative for loop-based computation of a PC described in the introduction chapter Sect. 1.4.1 is mapped to the spatial datapath. This entails the mapping of the small computational DAG inside the for loop to the spatial datapath, instead of unrolling the loop and mapping the large DAG as done for DPU-v2. The throughput of SPU is estimated based on the speedups reported over its CPU baseline. • CPU.SPU : For estimating the throughput of SPU, the CPU baseline from SPU is benchmarked.2 • DPU-v2 (L): For the large workloads, our processor is synthesized with a larger on-chip data memory (2MB) and 256 registers per bank. Unlike previous experiments, the instructions do not fit in the on-chip memory anymore, and are streamed from external interface with a bandwidth of 64 GB/s. Since SPU assumes a 256 GB/s memory interface and performs batch execution, our system

2 The

CPU.SP U baseline code is provided by the SPU authors.

a

DPU-v2 DPU equivalent (a) and (b) from Table 5.1 28 nm 28 nm 3.6 3.2 0.3 0.3 32b float – – 3.1 4.2 2.6.× 3.5.× 0.11 0.07 44.3 38.2 7.1 6.0

Estimated based on the speedups over CPU.SP U

Workloads Technology Area (mm.2 ) Freq (GHz) Arithmetic repr. Mem BW (GB/s) Through. (GOPS) Speedup Power (W) Effici. (GOPS/W) EDP (pJ.×ns) 14 nm NA 3 120 1.2 1.× 55 0.02 38k

– 5.3 4.4.× 0.11 48.1 4.0

CPU[141]

28 nm 3.7 0.3

DPU 64CU

Table 5.3 Performance comparison with other platforms

616 0.4 0.3.× 98 0.004 1M

12 nm 754 1.35

GPU [99]

DPU-v2 (L) SPU[33] (c) from Table 5.1 28 nm 28 nm 40.4 36.6 0.3 NA 32b float 256 256 34.6 22.2a 20.7.× 13.3.× 1.1 16 31.5 1.4 1.0 57.4 120 1.7 1.× 61 0.03 36k

14 nm NA 3

CPU.SP U [33]

120 1.8 1.1.× 65 0.03 27k

14 nm NA 3

CPU[141]

616 4.6 2.8.× 155 0.03 9k

12 nm 754 1.35

GPU

116 5 DAG Processing Unit Version 2 (DPU-v2): Efficient Execution of Irregular. . .

5.5 State-of-the-Art Comparison

117

is benchmarked with 4 cores (with 64GB/s requirement per core) for a fair comparison. The parallel cores can either perform batch execution (used for benchmarking) or execute different DAGs. Like the SPU experiments, the power consumption of an external memory and memory controller are not included in the measurements. Results The per-workload throughput is shown in Fig. 5.16b, and the overall results are summarized in Table 5.3. With a similar area and external bandwidth consumption, the proposed processor outperforms prior work, with an overall speedup of 1.6.×, 20.7.×, 19.2.×, and 7.5 .× over SPU, CPU.SP U , CPU, and GPU, respectively. Furthermore, the speedup is achieved at a significantly lower power (1.1W) compared to SPU (16W), greatly improving EDP. Analysis The large DAGs are able to better utilize GPU, leading to higher throughput than CPU, but also consume higher power. DPU-v2 (L) consumes significantly lower power than SPU due to the following main architectural differences: • SPU does not have a register file, prompting significantly higher SRAM accesses compared to this work. • SPU aggressively reorders memory requests in hardware to properly utilize the on-chip memory bandwidth in the presence of irregular requests, at the expense of higher hardware complexity. On the other hand, this work considerably eliminates the bank conflicts altogether by limiting the irregular accesses to register banks and through advanced bank mapping during compilation. This way, the hardware-software co-optimization presented for DPU-v2 results in higher performance and better energy efficiency than SPU, CPU, and GPU. Detailed benchmarking is done in the next section to compare DPU-v2 and DPU further.

5.5.3 Detailed Comparison of DPU-v2 and DPU This section analyzes the execution profiles of DPU-v2 and DPU to further analyze the trade-offs between the two approaches. The specifications of the processors are the same as used in the previous section, shown in detail in Table 5.4. DPU-

Table 5.4 Specification of the processors used for comparison Configuration Number of PEs Number of registers Number of register file ports Number of SRAM ports in data memory Total instruction memory bandwidth

DPU-v2 D = 3, B = 64, R = 32 56 1024 64 64 38 GB/s

DPU Number of CUs = 32 32 1024 64 64 77 GB/s

118

5 DAG Processing Unit Version 2 (DPU-v2): Efficient Execution of Irregular. . .

Fig. 5.17 How to read the plot in Fig. 5.18: This figure describes the different parts of the bar charts in Fig. 5.18. The two bar charts for each processor show the status of the datapath and the data memory during the execution, exhibiting the overlap of active/inactive phases. Note: The bars in this figure are just for illustration. See Fig. 5.18 for proper bar heights

v2 contains more PEs due to the tree-based datapath, for the same number of register read ports. Furthermore, DPU-v2 requires lower instruction bandwidth due to automatic write address generation and the tree-based datapath (as explained in Sects. 5.2.2 and 5.3.5). The datapath and the memory of the processors are profiled and plotted side by side to illustrate (non-)overlap of (in)active execution phases (see Figs. 5.17 and 5.18). The key highlights from these profiles are as follows: • Underutilized PE trees in DPU-v2 (see A in Fig. 5.18): DPU-v2 spends 28% more cycles for executing operations on PEs compared to DPU (the blue part of the bars) despite using more PEs than DPU (56 vs. 32). This indicates inefficiency in step 1 of the DPU-v2 compiler that decomposes a DAG into blocks, which leads to the underutilization of the PE trees. DPU-v2 ends up performing even worse for one of the workloads, bnetflix, due to this reason (see B ). This inefficiency in step 1 boils down to its inability to simultaneously achieve objectives C (fewer blocks) and D (fewer dependencies between blocks). If step 1 only focuses on objective C, it can find blocks with high PE utilization (hence a smaller blue part in the bar chart), but suffers from higher no-operation bubbles due to pipeline RAW hazard, leading to lower performance overall. As such, step 1 balances the two objectives by quantifying the fitness for each of

5.5 State-of-the-Art Comparison

119

C

Inactive PEs / No loads-stores from Data memory PE trees (DPUv2) or PEs (DPUv1) are active

8000

Loads/stores from Data memory Inactive cycles due to bank-access conflicts in register file (DPUv2) or global memory (DPUv1)

Cycles

6000

B 4000

A

2000

Datapath Memory

DPU

Datapath Memory

0

v2

v1

tretail

mnist

nltcs

msweb

msnbc

bnetflix

bp_200 west2021 sieber

jagmesh4 rdb968

dw2048

Workloads

Fig. 5.18 Detailed comparison of the execution phases of DPU-v2 and DPU (see Fig. 5.17 to understand the phases). DPU incurs higher stalls due to bank access conflicts and higher memory transactions than DPU-v2, but 57% of these are overlapped with active PE execution because of the decoupled instructions

the objectives and maximizing their weighted sum. However, as explained in Sect. 5.3.1, quantifying the fitness of a block for objective D is nontrivial and is currently approximated. Future work to improve step 1 should focus on better quantification of objective D fitness, and on constructing blocks that achieve objective D without sacrificing objective C. • Parallel PE and memory operations in DPU: On average, 57% of the memory operations and stalls in DPU (the green and red parts) get overlapped with the active execution of PEs because of the decoupling of instructions, allowing parallel execution. DPU-v2 does not have this overlap due to in-order execution. • Reduced bank access conflicts in DPU-v2: DPU-v2 incurs 85% lower stalls due to bank access conflicts (red parts), showing that step 2 of the DPU-v2 compiler is able to minimize the impact of irregular bank accesses. However, as discussed in the previous point, DPU hides a significant number of these stalls by overlapping them with active PE execution, reducing the penalty of these conflicts. • Higher memory transactions in DPU: DPU performs 66% higher memory loads/stores compared to DPU-v2 due to the sharing of data via the global scratchpad instead of a shared register file like DPU-v2. This problem is especially severe for workloads with low parallelism (see C ) because of frequent barriers and sharing of data to keep all the PEs busy. However note that, as discussed earlier, DPU partially mitigates the impact of these additional memory

120

5 DAG Processing Unit Version 2 (DPU-v2): Efficient Execution of Irregular. . .

transactions by performing them in parallel with the execution of PEs with the help of instruction decoupling. In conclusion, • although the compiler-centric DPU-v2 approach should ideally achieve much higher speedup, in practice, the optimization steps become highly complicated and lead to suboptimal results, as evident from the underutilization of PE trees. Yet, the compiler is able to significantly reduce bank access conflicts. • the decoupling of instructions in DPU helps in overlapping a large proportion of memory operations with datapath execution, nullifying the gains made by DPUv2. To address this shortcoming of DPU-v2, the in-order execution should be extended to use DPU’s instruction decoupling, getting the best of both designs. • for workloads with low parallelism, the shared register file approach of DPU-v2 is much better than the sharing of data via global scratchpads in DPU. Overall, the compiler-driven approach of DPU-v2 achieves better results (latency and EDP) than DPU, and can be extended with the decoupling of instructions for even better results.

5.6 Additional Related Works Specialized Dataflow MAERI [85], an architecture with flexible spatial dataflow for neural networks, is the most similar to this work, as it also uses a tree-based datapath of PEs. But, the controller and the interconnection networks are designed specifically for neural networks, and do not natively support arbitrary DAGs. Due to the regularity in NNs, the mapping of operations and routing of operands get significantly simplified compared to arbitrary DAGs. The regularity also limits bank conflicts, which get exacerbated due to irregularity. Clark et al. proposed approaches related to this work, in which frequent recurring subgraphs are identified in a DAG [27, 28] to determine the most promising datapath structure. However, the number of ports of the datapath (four inputs, two outputs) is!! severely limited due to the use of a monolithic register file, limiting hardware parallelism. This work addresses that by using a banked register file and appropriate interconnections, allowing a highly parallel datapath with up to 64 read/write ports. Chainsaw [142] decomposes DAGs into chains instead of trees, which is also a promising alternative as multi-input nodes can also be unfolded as chains. However, the interchain communication, which is frequent for irregular DAGs, happens via a central register file accessed with a bus. This would become a major bottleneck as Fig. 5.7 shows that a multi-ported register file and high-bandwidth interconnection used in this work are critical for good performance. SEED[107] combines a dataflow architecture with a general-purpose processor. It focuses on small DAGs within nested loops, which typically contain a high proportion of conditional nodes, while our work focuses on large DAGs,requiring

5.6 Additional Related Works

121

a different approach based on high-bandwidth interconnects and a banked register file. Operand Networks The input/output interconnects of our work can be classified as scalar operand networks as discussed in [148]; however, we closely couple operand networks with a spatial tree-based datapath, which significantly complicates the spatial and temporal mapping. The conflict-avoiding bank mapping given the networking constraints is also novel. FPGA The work in [110] targets PCs through fully spatial mapping on FPGA. However, since large PCs cannot be completely spatially mapped, their approach is limited to only small DAGs with fewer than 500 nodes. CGRA Coarse-grained reconfigurable array (CGRA) architectures like HRL [56], Plasticine [125], DySER [60], and CGRA Express [116] can execute DAGs with a spatial datapath. However, existing CGRAs suffer from several problems: • Most CGRAs either have a local register file within a PE or do not have a register file at all. For example, DySER does not have a register file in the PEs, and the inputs/outputs are sequentially accessed from a global register file, which would severely limit the throughput. CGRA Express [116] uses a local register file per PE and also has register bypass networks to dynamically fuse operations with the same goal as the PE trees in this work. An advantage of using PE trees is that X PEs can operate with X + 1 register read ports, while in CGRA Express, X PEs would require 2X ports. Furthermore, experiments in Sect. 5.2.3 shows that a one-to-one connection of PE and banks would lead to frequent conflicts, revealing that local register files (similar to a one-to-one connection) would be insufficient and more complex connectivity between the PEs and register banks is essential for good performance. • The spatial connectivity of CGRAs is not reconfigured every cycle typically, but is fixed for the entire execution. Given that a large DAG cannot be entirely mapped spatially, it has to be decomposed into one recurring pattern, which is difficult to find due to irregularity. • CGRAs normally use a 2D mesh. But, as described in Sect. 5.1.1, unstructured DAGs cannot fully utilize a 2D array. Graph Analytics In recent years, several accelerators like Fifer [103], PolyGraph [34], and Graphicionado [66], [61], with hardware support for irregularity, have been proposed for graph kernels like traversal, node ranking, clustering, etc. In graph analytics, the ordering of nodes/edges cannot be determined at compile time, limiting the scope of compile-time optimizations. On the other hand, the node ordering in our target computation DAGs is fixed based on the dependencies, which are known at compile time. This fact is utilized extensively in this work to perform mapping of nodes to PEs, conflict-aware bank allocation, pipeline-aware reordering, etc. This way, the hardware is simplified by assigning several responsibilities to the compiler, but excludes general graph analytic operations from the scope of this work.

122

5 DAG Processing Unit Version 2 (DPU-v2): Efficient Execution of Irregular. . .

Spare Linear Algebra A growing number of hardware solutions are being designed for sparse linear algebra, like Sparse-TPU [70], SpArch [173], SparseP [58], SIGMA [128], etc. Some specifically target sparsity in deep learning algebra, e.g., SNAP [174] and Sticker [170], [39]. These works mainly focus on sparse matrix-matrix or matrix-vector multiplications (SpMM and SpMV). But SpTRSV has a different kind of parallelism compared to SpMM and SpMV due to its inductive nature, which is discussed in detail in REVEL [162]. Put simply, in SpMV and SpMM, outputs only depend on inputs, while in SpTRSV, outputs can depend on the previous outputs, creating possibly long chains of producer-consumer dependencies. REVEL [162] provides a solution for dense matrices. Sparsity exacerbates the parallelization difficulties by introducing irregularity, which is addressed in this work. There still remains scope for a more general solution that does not utilize a DAG-specific (in other words, sparsity-pattern specific) compilation developed in this work, allowing the sparsity pattern to change with every execution. Compilers The greedy block generation in step 1 (Sect. 5.3.1) is highly related to the work in [26] and its advanced algorithm that can achieve optimal mapping for a general spatial datapath. However, its experiments reported a modest speedup of 10% over greedy heuristics, suggesting a limited opportunity for further improvement over this work. CGRA mappers like [15, 21, 68, 108, 155], and [106] can map a DAG to generic spatial datapaths, including trees. However, they can handle relatively small DAGs with up to 500 nodes, which are supposed to be fully mapped to the datapath. In contrast, this work can handle more than 100k nodes by limiting the focus to trees. Moreover, those works usually considered a local/no register file, making them unsuitable for a global, banked file. Like this work, [32] also developed a DAG mapping algorithm for an architecture with a tree of accelerator blocks (instead of PEs) [31], but the similarity ends there. Since the target was of a larger hardware granularity, i.e., accelerator blocks instead of PEs, they did not have to consider the constraints of a banked register file connected with a custom interconnection topology, like this work. VLIW compilers that consider partitioned register files like [49, 77], and [113], share several similarities with the compiler of this work. Specifically, the bottomup greedy (BUG) algorithm [49] can compile for independent register banks and arbitrary interconnection between the banks and PEs. Yet, it enforces a limitation that the output of a PE cannot be connected to another PE but can only connect to a register bank. This limits the design space to only an array of PEs (i.e., our template with D = 1) and excludes the PE trees used in this work. As noted in [49], the algorithm cannot be easily modified to ease this constraint. Our compiler, on the other hand, targets PE trees but limits the interconnection design space to have at least one crossbar.

5.7 Conclusion

123

5.7 Conclusion This chapter presents DPU-v2, a specialized processor template designed for energy-efficient parallel execution of irregular DAGs. The architecture is equipped with a tree-based spatial datapath of processing elements enabling immediate data reuse. Instead of a multi-ported global register file, parallel register banks with independent addressing provide the required bandwidth. The instruction overhead of independent register addresses is eased with an automatic writing scheme. The optimized interconnect networks between the datapath and the register banks enable flexible routing of data required for irregular data accesses. The architecture is made programmable with a variable-length VLIW-like instruction set. A targeted compiler is developed to map DAGs to DPU-v2, which maximizes the datapath utilization, while minimizing stalls due to bank conflicts and pipeline hazards. This way, a cohesive hardware-software co-optimization approach is developed to address the complexities arising from irregularity and simplify the hardware for energy efficiency. Finally, a design space exploration is performed to identify the configuration with minimal EDP. The optimal design achieved speedup over an equivalent DPU, SPU, CPU, and GPU while consuming lower EDP. DPU-v2 can be augmented with the optimized features of DPU to further improve throughput. Thus, this chapter demonstrates the effectiveness of hardware-software co-optimization for irregular DAG execution, enabling emerging AI and sparse linear algebra applications.

Chapter 6

Conclusions and Future Work

The overarching goal of the book is to develop the critical pieces of the softwarehardware stack for the emerging AI and sparse linear algebra workloads. The book proposed a suitable energy-efficient data representation for the target workloads, developed tools to perform compile-time optimizations for effective parallelization, and designed optimized hardware architectures. The feasibility of the hardware innovations is validated through physical implementation in a 28nm CMOS technology. The experimental results demonstrate that the proposed techniques achieve significantly higher throughput and energy efficiency than the state-of-the-art implementations.

6.1 Contributions and Conclusions The main book contributions to address the open research questions can be summarized as follows: Research Question 1: What Type of Data Representation Is Suitable? Chapter 2 presents an analytical framework called PROBLP that can identify the most energy-efficient data representation for a given PC, the type of probabilistic query to be performed, and the error tolerance of the application. We achieved this by developing error bounds for arithmetic operations (addition and multiplication), which utilize the properties of PCs. Based on the selected data representation, PROBLP generates a fully pipelined spatial hardware description of the PC that can be mapped to an FPGA. The experiments show that the chosen data representation is up to 67% energy-efficient compared to the 32b single-precision floating-point format. PROBLP error models also show that the fixed-point format is unsuitable due to its limited range, as probabilities can become very small. Based on this learning, we designed a customized positTM representation with a significantly larger range than floating point for the same number of bits. Experiments confirm that an 8b © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Shah et al., Efficient Execution of Irregular Dataflow Graphs, https://doi.org/10.1007/978-3-031-33136-7_6

125

126

6 Conclusions and Future Work

custom posit can achieve the same accuracy as the 32b floating-point format for probabilistic inference tasks. Moreover, it also achieves better accuracy for SpTRSV operations, demonstrating a general-purpose utility. The customized posit format is used in DPU, the dedicated processor designed for PC and SpTRSV. Research Question 2: How Can We Parallelize Irregular DFGs Effectively? Identifying the key requirements for effective parallelizing, Chap. 3 proposes a DFG parallelization tool called GRAPHOPT. It decomposes a DFG to generate superlayers of partitions, each of which can execute in parallel on a CPU/GPU/DPU core or thread. At the core of GRAPHOPT, there is a two-way graph partitioner, modeled as a constrained-optimization problem and solved with the Google ORTools solver. This partitioner is used iteratively to generate one superlayer at a time while achieving workload balance and minimal communication across parallel cores. The partitions in the superlayers are made as large as possible, as long as there are enough parallel partitions, to reduce the number of total superlayers and, thereby, minimize the number of costly synchronization barriers. Several scalability techniques are developed for large DFGs with more than 100K nodes. The performance of the generated superlayers is evaluated with a multithreaded CPU implementation. The experiments show that GRAPHOPT achieves a speedup of 2.0.× and 1.8.× over the state-of-the-art CPU libraries for SpTRSV and PC, respectively. GRAPHOPT is also used for generating parallel partitions for DPU, demonstrating its utility across different platforms. Research Question 3: How Can We Improve the Throughput and Energy Efficiency Through a Custom Processor Architecture? Chapter 4 discusses the limitations of CPU and GPU in executing irregular DFGs, and proposes the DAG processing unit (DPU) to address them. DPU is equipped with 64 parallel compute units (CU), each of which executes a DFG partition from the superlayers generated by GRAPHOPT. Instead of a cache-based memory hierarchy, DPU contains distributed local scratchpads and a global banked scratchpad for communication between CUs. Synchronization of CUs, frequently needed for DFGs, happens in a single cycle using a specialized hardware unit, significantly reducing the overhead. Furthermore, the irregularity of DFG leads to frequent bank access conflicts in the global scratchpads, leading to high latency. This latency is hidden through the technique of decoupling memory accesses from compute operations, which allows overlapping the long scratchpad access latency with computational operations, resulting in a 1.8.× speedup compared to an in-order coupled execution. For arithmetic operations, the CUs are equipped with precision-scalable custom posit units that can perform low-precision batch inference depending on the application requirement. The DPU is fabricated in 28nm technology and benchmarked on irregular DFGs of PC and SpTRSV. The measurement results show a mean speedup of 5.1.× and 20.6.× over state-of-the-art CPU and GPU implementations, with a peak performance of 73.8 GOPS and 538 GOPS/W. Thus, DPU demonstrates that a custom architecture improves energy efficiency significantly over CPU and GPU, which can enable energy-constrained edge applications.

6.1 Contributions and Conclusions

127

Research Question 4: How Can We Improve the Hardware Further Through a Dedicated Datapath Design? Chapter 5 proposes the second version of DPU called DPU-v2, for which a dedicated datapath is designed for irregular DFGs. The datapath contains a spatial tree-based topology of parallel processing elements (PE), instead of a scalar PE used in the previous version of DPU. Such a spatial topology enables the reuse of the intermediate data within the datapath, and reduces register file reads and writes. To provide the parallel inputs/outputs to the datapath, parallel register banks with independent addresses are used, instead of a multiported monolithic register file. This approach allows a high number of register read/write ports (up to 64 in our experiments), which would not have been possible with a multi-ported monolithic file. The instruction overhead of independent register bank addresses is eased with an automatic writing scheme that eliminates the need to encode write addresses in the instructions. Flexible interconnect networks are used to connect the datapath with the register banks, allowing flexible routing of data required for irregular data accesses. The architecture is made programmable with a variable-length VLIW-like custom instruction set, which achieves a 48% lower memory footprint compared to the CSR representation of the DFG. A targeted compiler is developed to map DAGs to DPU-v2 that maximizes the datapath utilization while minimizing stalls due to bank conflicts and pipeline hazards. This way, a cohesive hardware-software co-optimization approach addresses the complexities arising from irregularity and simplifies the hardware for energy efficiency. Finally, a design space exploration is performed to identify the configuration with minimal EDP. The optimal design achieved higher throughput than an equivalent DPU, SPU, CPU, and GPU, while consuming lower EDP. Thus, DPU-v2 demonstrates the effectiveness of a dedicated datapath and the accompanying compile-time optimizations in further improving energy efficiency. The book also allows us to draw more general conclusions: • A novel workload demands co-optimization at different levels of the softwarehardware stack for a cohesive end-to-end solution. • By utilizing the properties of the target workload (e.g., unsigned operations, acyclic structure, etc.), it can become possible to analytically model the error injected due to low-precision data representations. This eliminates the need for relying on an empirical analysis. • The posit representation can be parameterized differently depending on whether the workload requires a wide dynamic range or high precision for a narrow range. • Constrained-optimization-based models allow us to formally express the constraints and objectives of a compile-time optimization problem. Using such a formal model, readily available solvers can be utilized to find optimized solutions. However, this approach may not scale to large problem sizes (e.g., the graph size in this book). This scalability issue can be addressed by a hybrid approach, in which heuristics break a large problem into smaller problems, each of which can be solved optimally with a solver. • A crossbar could be a feasible choice for interconnection in medium-sized designs in modern technology nodes that has ten or even more metal routing

128

6 Conclusions and Future Work

layers. The physical implementation of DPU shows that the area and power overhead of a 64 .× 64 32b crossbar is not significantly higher than the other hardware blocks. • The decoupled streams of memory and computation instructions enable aggressive prefetching of data, which can be utilized for workloads beyond irregular DFGs. • A spatial datapath of a suitable topology can be used for irregular workloads, but high datapath utilization needs appropriate placement of operations and routing of data.

6.2 Suggestions for Future Works Some suggestions for future works that can build on the contributions of this book: • In Chap. 2, PROBLP develops analytical error bounds for the fixed-point and floating-point formats. These analytical models can be extended for the posit representation, eliminating the need to rely on empirical analysis. • In Chap. 3, GRAPHOPT uses a two-way partitioning solver recursively to generate P -way partitions (.P > 2). Interestingly, the two-way partitioning model in Sect. 3.2.1 can be extended to P-way partitions by simply allowing the PART variable to take values in integers [0,P] instead of [0,2]. However, allowing [0,P] values adds symmetries in the search space since two partitions can be switched without breaking any constraint. These symmetries prohibitively increase the runtime of the solver since it keeps finding (or refuting) candidates that are just permutations of an earlier solution (or nonsolution). The usual technique of symmetry breaking using lexicographical ordering constraint on the PART variable is not sufficient because it does not break all the symmetries. A recent paper [29] proposes stricter symmetry-breaking constraints for several graph optimization problems, which we believe can also be applied to this problem. This could be an interesting extension of GRAPHOPT that can improve the quality and speed of the partitioning solver. • The superlayers generated by GRAPHOPT are used for multithreaded CPU implementation and DPU execution in this book. In the future, the superlayers can also be used for improving GPU execution. • Although DPU and DPU-v2 are presented independently in this book, their optimizations are orthogonal and complementary to each other. As such, both the architectures can be combined into a unified processor, such as a DPU equipped with the tree-based datapath and the banked register file of DPU-v2. • Both DPU and DPU-v2 use the crossbar interconnection network, which consumes a relatively smaller area and is feasible to place and route for the current scale of the prototypes. As such, it does not require further optimization for the purpose of this book. However, crossbars would not be feasible for larger designs with, say, 1024 scratchpad/register banks. For such a massively parallel

6.3 Closing Remarks

129

processor, a scalable hierarchical interconnection network could be a better alternative, like a 2D mesh topology connecting clusters that internally use crossbars. • The main optimizations in the book utilize the fact that the DFG structure is fixed and known at compile time. Although this assumption allows extensive hardware-software codesign, it also limits the scope of this book. For a more general approach applicable to frequently varying DFG structures, some of the compile-time optimizations can be simplified by delegating the responsibility to hardware for runtime optimizations. This would reduce the compilation overhead, which can possibly be amortized with fewer DFG evaluations. • A full-fledged application typically contains sparse workloads in combination with dense counterparts. For example, a neuro-symbolic system can contain a DNN for feature extraction along with a PC for probabilistic reasoning. Similarly, an autonomous navigation system may contain a DNN for processing sensory inputs, along with a localization block requiring SpTRSV operations. Hence, in the future, an end-to-end system could be developed, in which CPU, NPU, DPU, etc. operate harmoniously. A promising candidate for this purpose would be a chiplet-based system, where each chiplet is specialized for a domain of target workloads.

6.3 Closing Remarks The rise of aggressive domain specialization of hardware and systems increases the barrier to developing novel algorithms that do not fall in the established domains. As such, new algorithms cannot reach their full potential in isolation, but require efforts across the stack. This book focuses on such novel and promising workloads from AI and sparse linear algebra that can be modeled as irregular dataflow graphs, which execute poorly on existing hardware. To address this problem, the book contributes important pieces involving optimizations at applications, compilation, hardware, and implementation levels, which enables high throughput and energy-efficient execution of these workloads. Our work demonstrates that the hardware lottery can be won through dedicated and cohesive hardware/software optimizations.

Appendix A

The Two-Way Partitioning Model of GRAPHOPT

The code Listing A.1 is the Minizinc-based optimization model for two-way partitioning of graphs, as described in Sect. 3.2.1. A python wrapper to this Minizinc model generates the input parameters like .Vin , Ein , P ART _in from the graph structure and mapping of previous superlayers. The code Listing A.2 describes the input parameters to the model for the example shown in Fig. 3.6. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

%%%%% input parameters %%%% int: n_V; % number of nodes in graph G set of int: V = 1..n_V; int: max_node_w; % maximum node weight in current G array[V] of 1..max_node_w: node_w; int: n_E; array[1..n_E, 1..2] of int: E; % edges (an array (instead of a set) of tuples, for ease of modelling) int: n_Vin; set of int: Vin = 1..n_Vin; % source nodes on incoming edges int: n_Ein; array[1..n_Ein, 1..2] of int: Ein; array[Vin] of 1..2: PARTin;

%%%%% decision variables %%%% array[V] of var 0..2: PART; var 0..(max_node_w * n_V): PART_1_size; var 0..(max_node_w * n_V): PART_2_size; array[1..n_Ein] of var bool: Ein_crossings;

%%%%% constraints %%%% % acyclic and data-dependency constraint constraint forall (e in 1..n_E) ( let { int : src = E [e, 1]; % local variables for readability int : dst = E [e, 2]; } in PART [dst] = PART [src] \/ PART [dst] = 0 );

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Shah et al., Efficient Execution of Irregular Dataflow Graphs, https://doi.org/10.1007/978-3-031-33136-7

131

132 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

A The Two-Way Partitioning Model of GRAPHOPT

% partition sizes constraint constraint PART_1_size = sum ( [ node_w[v] | v in V where PART [v] == 1 ] ); constraint PART_2_size = sum ( [ node_w[v] | v in V where PART [v] == 2 ] ); % inter-thread communication constraint constraint forall (e in 1..n_Ein) ( let { int : src = Ein [e, 1]; % local variables for readability int : dst = Ein [e, 2]; } in Ein_crossings [e] = ( PART [dst] != 0 /\ PART [dst] != PARTin [src] ) ); %%%%% objective %%%% solve maximize 10 * min (PART_1_size, PART_2_size) - sum ( Ein_crossings );

Listing A.1 Minizinc code for the two-way partitioning optimization model described in Sect. n_V =3.2.1 9; max_node_w = 1; node_w = [1, 1, 1, 1, 1, 1, 1, 1, 1]; n_E = 8; E = array2d (1..n_E, 1..2, [ 1, 5, 2, 5, 5, 7, 3, 6, 4, 6, 6, 8, 7, 9, 8, 9 ]); n_Vin = 4; n_Ein = 9;

% For ease of modelling Vin are numbered 1,...,4 instead of 10,...,13 as shown in fig. 6 Ein = array2d (1..n_Ein, 1..2, [ 1, 1, 1, 4, 1, 7, 2, 1, 2, 2, 2, 8, 3, 2, 3, 8, 4, 4 ]); PARTin = [1, 1, 2, 2];

Listing A.2 Parameters for the example in Fig. 3.6, to be passed as inputs to the model in Listing A.1

Bibliography

1. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., Zheng, X.: Tensorflow: A system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283 (2016) 2. Addisie, A., Kassa, H., Matthews, O., Bertacco, V.: Heterogeneous memory subsystem for natural graph analytics. In: 2018 IEEE International Symposium on Workload Characterization (IISWC), pp. 134–145. IEEE, Piscataway (2018) 3. Agrawal, A., Lee, S.K., Silberman, J., Ziegler, M., Kang, M., Venkataramani, S., Cao, N., Fleischer, B., Guillorn, M., Cohen, M., et al.: 9.1 a 7 nm 4-core AI chip with 25.6 TFLOPS hybrid FP8 training, 102.4 TOPS INT4 inference and workload-aware throttling. In: 2021 IEEE International Solid-State Circuits Conference (ISSCC), vol. 64, pp. 144–146. IEEE, Piscataway (2021) 4. Al-Abbasi, A.O., Hamila, R., Bajwa, W.U., Al-Dhahir, N.: A general framework for the design and analysis of sparse FIR linear equalizers. In: 2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 834–838. IEEE, Piscataway (2015) 5. Anderson, E., Saad, Y.: Solving sparse triangular linear systems on parallel computers. Int. J. High Speed Comput. 1(1), 73–95 (1989) 6. Anguita, D., Ghio, A., Oneto, L., Parra, X., Reyes-Ortiz, J.L.: A public domain dataset for human activity recognition using smartphones. In: ESANN (2013) 7. Arora, A., Mehta, S., Betz, V., John, L.K.: Tensor slices to the rescue: Supercharging ml acceleration on fpgas. In: The 2021 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 23–33 (2021) 8. Beamer, S., Asanovic, K., Patterson, D.: Locality exists in graph processing: Workload characterization on an ivy bridge server. In: 2015 IEEE International Symposium on Workload Characterization, pp. 56–65. IEEE, Piscataway (2015) 9. Beinlich, I.A., Suermondt, H.J., Chavez, R.M., Cooper, G.F.: The alarm monitoring system: A case study with two probabilistic inference techniques for belief networks. In: AIME 89, pp. 247–256. Springer, Berlin 1989 10. Bichot, C.-E., Siarry, P.: Graph Partitioning. Wiley, Hoboken (2013) 11. Boutros, A., Nurvitadhi, E., Ma, R., Gribok, S., Zhao, Z., Hoe, J.C., Betz, V., Langhammer, M.: Beyond peak performance: Comparing the real performance of AI-optimized FPGAs and GPUs. In: 2020 International Conference on Field-Programmable Technology (ICFPT), pp. 10–19. IEEE, Piscataway (2020)

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Shah et al., Efficient Execution of Irregular Dataflow Graphs, https://doi.org/10.1007/978-3-031-33136-7

133

134

Bibliography

12. Bramas, B., Ketterlin, A.: Improving parallel executions by increasing task granularity in task-based runtime systems using acyclic DAG clustering. PeerJ Comput. Sci. 6, e247 (2020) 13. Bull, J.M., Reid, F., McDonnell, N.: A microbenchmark suite for openmp tasks. In: International Workshop on OpenMP, pp. 271–274. Springer, Berlin (2012) 14. Buluç, A., Meyerhenke, H., Safro, I., Sanders, P., Schulz, C.: Recent advances in graph partitioning. In: Algorithm Engineering , pp. 117–158. Springer, Cham (2016) 15. Canesche, M., Menezes, M., Carvalho, W., Torres, F.S., Jamieson, P., Nacif, J.A., Ferreira, R.: Traversal: a fast and adaptive graph-based placement and routing for CGRAs. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 40(8), 1600–1612 (2020) 16. Casale, P., Pujol, O., Radeva, P.: Personalization and user verification in wearable systems using biometric walking patterns. Pers. Ubiquitous Comput. 16(5), 563–580 (2012) 17. Chan, H., Darwiche, A.: When do numbers really matter? J. Artif. Intell. Res. 17, 265–287 (2002) 18. Chan, H., Darwiche, A.: Sensitivity analysis in bayesian networks: From single to multiple parameters. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pp. 67–75. AUAI Press, Washington (2004) 19. Chattopadhyay, A., Manupriya, P., Sarkar, A., Balasubramanian, V.N.: Neural network attributions: A causal perspective. In: International Conference on Machine Learning, pp. 981–990. PMLR, Cambridge (2019) 20. Chaurasiya, R., Gustafson, J., Shrestha, R., Neudorfer, J., Nambiar, S., Niyogi, K., Merchant, F., Leupers, R.: Parameterized posit arithmetic hardware generator. In: 2018 IEEE 36th International Conference on Computer Design (ICCD), pp. 334–341. IEEE, Piscataway (2018) 21. Chen, L., Mitra, T.: Graph minor approach for application mapping on cgras. ACM Trans. Reconfig. Technol. Syst. 7(3), 1–25 (2014) 22. Chen, Y.-H., Krishna, T., Emer, J.S., Sze, V.: Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circuits 52(1), 127– 138 (2016) 23. Choi, A., Darwiche, A.: On the relative expressiveness of bayesian and neural networks. In: International Conference on Probabilistic Graphical Models, pp. 157–168. PMLR, Cambridge (2018) 24. Choi, Y., Vergari, A., Van den Broeck, G.: Probabilistic circuits: A unifying framework for tractable probabilistic models. Technical Report (2020) 25. Choquette, J., Gandhi, W., Giroux, O., Stam, N., Krashinsky, R.: Nvidia a100 tensor core GPU: performance and innovation. IEEE Micro 41(2), 29–35 (2021) 26. Clark, N., Hormati, A., Mahlke, S., Yehia, S.: Scalable subgraph mapping for acyclic computation accelerators. In: Proceedings of the 2006 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, pp. 147–157 (2006) 27. Clark, N., Kudlur, M., Park, H., Mahlke, S., Flautner, K.: Application-specific processing on a general-purpose core via transparent instruction set customization. In: 37th International Symposium on Microarchitecture (MICRO-37’04), pp. 30–40. IEEE, Piscataway (2004) 28. Clark, N.T., Zhong, H., Mahlke, S.A.: Automated custom instruction generation for domainspecific processor acceleration. IEEE Trans. Comput. 54(10), 1258–1270 (2005) 29. Codish, M., Miller, A., Prosser, P., Stuckey, P.J.: Constraints for symmetry breaking in graph representation. Constraints 24(1), 1–24 (2019) 30. Cong, J., Li, Z., Bagrodia, R.L.: Acyclic multi-way partitioning of boolean networks. In: Proceedings of the 31st Conference on Design Automation, pp. 670–675. ACM Press, New York (1994) 31. Cong, J., Ghodrat, M.A., Gill, M., Grigorian, B., Reinman, G.: Charm: A composable heterogeneous accelerator-rich microprocessor. In: Proceedings of the 2012 ACM/IEEE International Symposium on Low Power Electronics and Design, pp. 379–384 (2012) 32. Cong, J., Huang, H., Ghodrat, M.A.: A scalable communication-aware compilation flow for programmable accelerators. In: 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 503–510. IEEE, Piscataway (2016)

Bibliography

135

33. Dadu, V., Weng, J., Liu, S., Nowatzki, T.: Towards general purpose acceleration by exploiting common data-dependence forms. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 924–939 (2019) 34. Dadu, V., Liu, S., Nowatzki, T.: PolyGraph: exposing the value of flexibility for graph processing accelerators. In: 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pp. 595–608. IEEE, Piscataway (2021) 35. Dang, M., Khosravi, P., Liang, Y., Vergari, A., Van den Broeck, G.: Juice: A Julia package for logic and probabilistic circuits. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence (Demo Track) (2021). 36. Darwiche, A.: Causal inference using tractable circuits (2022). Preprint arXiv:2202.02891 37. Darwiche, A., Chavira, M.: Ace, an arithmetic circuit compiler (2007). http://reasoning.cs. ucla.edu/ace\/ 38. Darwiche, A., Marquis, P.: A knowledge compilation map. J. Artif. Intell. Res. 17, 229–264 (2002) 39. Dave, S., Baghdadi, R., Nowatzki, T., Avancha, S., Shrivastava, A., Li, B.: Hardware acceleration of sparse and irregular tensor computations of ML models: a survey and insights. Proc. IEEE 109(10), 1706–1752 (2021) 40. Davis, T.A.: Algorithm 832: UMFPACK v4. 3—an unsymmetric-pattern multifrontal method. ACM Trans. Math. Softw. 30(2), 196–199 (2004) 41. Davis, T.A.: Direct Methods for Sparse Linear Systems. SIAM, Philadelphia (2006) 42. Davis, T.A., Hu, Y.: The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1:1–1:25 (2011) 43. Davis, T.A., Palamadai Natarajan, E.: Algorithm 907: KLU, a direct sparse solver for circuit simulation problems. ACM Trans. Math. Softw. 37(3), 1–17 (2010) 44. Davis, T.A., Rajamanickam, S., Sid-Lakhdar, W.M.: A survey of direct methods for sparse linear systems. Acta Numerica 25, 383–566 (2016) 45. Delaplace, C.: Linear alebra algorithms for cryptography. PhD Thesis, Université Rennes 1, (2018) 46. Dennis, J.: Data Flow Graphs, pp. 512–518. Springer, Boston (2011) 47. Doweck, J., Kao, W.-F., Lu, A.K.-Y., Mandelblat, J., Rahatekar, A., Rappoport, L., Rotem, E., Yasin, A., Yoaz, A.: Inside 6th-generation intel core: new microarchitecture code-named skylake. IEEE Micro 37(2), 52–62 (2017) 48. Dua, D., Graff, C.: UCI Machine Learning Repository (2017) 49. Ellis, J.R.: Bulldog: A Compiler for VLSI Architectures. Mit Press, Cambridge (1986) 50. Feldmann, A.E.: Fast balanced partitioning is hard even on grids and trees. Theor. Comput. Sci. 485, 61–68 (2013) 51. Fierens, D., Van den Broeck, G., Thon, I., Gutmann, B., De Raedt, L.: Inference and learning in probabilistic logic programs using weighted CNFs. Theory Practice Logic Program. 15, 358–401 (2015) 52. Frigerio, M., Buchli, J., Caldwell, D.G., Semini, C.: Robcogen: a code generator for efficient kinematics and dynamics of articulated robots, based on domain specific languages. J. Softw. Eng. Robot. 7(1), 36–54 (2016) 53. Galindez Olascoaga, L.I., Meert, W., Shah, N., Verhelst, M., Van den Broeck, G.: Towards hardware-aware tractable learning of probabilistic models. In: Advances in Neural Information Processing Systems, vol. 32 (2019) 54. Galindez Olascoaga, L.I., Meert, W., Shah, N., Verhelst, M.: Dynamic complexity tuning for hardware-aware probabilistic circuits. In: IoT Streams for Data-Driven Predictive Maintenance and IoT, Edge, and Mobile for Embedded Machine Learning, pp. 283–295. Springer, Berlin (2020) 55. Gamrath, G., Anderson, D., Bestuzheva, K., Chen, W.-K., Eifler, L., Gasse, M., Gemander, P., Gleixner, A., Gottwald, L., Halbig, K., Hendel, G., Hojny, C., Koch, T., Le Bodic, P., Maher, S.J., Matter, F., Miltenberger, M., Mühmer, E., Müller, B., Pfetsch, M.E., Schlösser, F., Serrano, F., Shinano, Y., Tawfik, C., Vigerske, S., Wegscheider, F., Weninger, D., Witzig, J.: The SCIP Optimization Suite 7.0. Technical Report, Optimization Online, March (2020)

136

Bibliography

56. Gao, M., Kozyrakis, C.: HRL: Efficient and flexible reconfigurable logic for near-data processing. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 126–137 (2016) 57. Geiselmann, W., Shamir, A., Steinwandt, R., Tromer, E.: Scalable hardware for sparse systems of linear equations, with applications to integer factorization. In: International Workshop on Cryptographic Hardware and Embedded Systems, pp. 131–146. Springer, Berlin (2005) 58. Giannoula, C., Fernandez, I., Gómez-Luna, J., Koziris, N., Goumas, G., Mutlu, O.: SparseP: Towards efficient sparse matrix vector multiplication on real processing-in-memory systems (2022). Preprint arXiv:2201.05072 59. Goulas, G., Gogos, C., Valouxis, C., Alefragis, P., Voros, N.S.: Coarse grain parallelization using integer programming. In: 2013 11th IEEE International Conference on Industrial Informatics (INDIN), pp. 816–820 (2013) 60. Govindaraju, V., Ho, C.-H., Nowatzki, T., Chhugani, J., Satish, N., Sankaralingam, K., Kim, C.: DySER: unifying functionality and parallelism specialization for energy-efficient computing. IEEE Micro 32(5), 38–51 (2012) 61. Gui, C.-Y., Zheng, L., He, B., Liu, C., Chen, X.-Y., Liao, X.-F., Jin, H.: A survey on graph processing accelerators: challenges and opportunities. J. Comput. Sci. Technol. 34(2), 339– 371 (2019) 62. Gupta, S., Agrawal, A., Gopalakrishnan, K., Narayanan, P.: Deep learning with limited numerical precision. In: International Conference on Machine Learning, pp. 1737–1746. PMLR, Cambridge (2015) 63. Gurobi Optimization LLC. Gurobi optimizer reference manual (2021) 64. Gustafson, J.L., Yonemoto, I.T.: Beating floating point at its own game: posit arithmetic. Supercomput. Front. Innovat. 4(2), 71–86 (2017) 65. Hagberg, A., Swart, P., and Schult, D.: Exploring network structure, dynamics, and function using networkX. Technical Report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States) (2008) 66. Ham, T.J., Wu, L., Sundaram, N., Satish, N., Martonosi, M.: Graphicionado: A highperformance and energy-efficient accelerator for graph analytics. In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–13. IEEE, Piscataway (2016) 67. Hammarlund, P., Martinez, A.J., Bajwa, A.A., Hill, D.L., Hallnor, E., Jiang, H., Dixon, M., Derr, M., Hunsaker, M., Kumar, R., et al.: Haswell: the fourth-generation intel core processor. IEEE Micro 34(2), 6–20 (2014) 68. Hamzeh, M., Shrivastava, A., Vrudhula, S.: REGIMap: Register-aware application mapping on coarse-grained reconfigurable architectures (CGRAs). In: Proceedings of the 50th Annual Design Automation Conference, pp. 1–10 (2013) 69. Haotian, L., Qiyue, Y.: Doubly-iterative sparsified mmse turbo equalization for OTFs modulation (2022). Preprint arXiv:2207.00866 70. He, X., Pal, S., Amarnath, A., Feng, S., Park, D.-H., Rovinski, A., Ye, H., Chen, Y., Dreslinski, R., Mudge, T.: Sparse-TPU: Adapting systolic arrays for sparse matrices. In: Proceedings of the 34th ACM International Conference on Supercomputing, pp. 1–12 (2020) 71. Helal, A.E., Aji, A.M., Chu, M.L., Beckmann, B.M., Feng, W.: Adaptive task aggregation for high-performance sparse solvers on GPUs. In: 28th International Conference on Parallel Architectures and Compilation Techniques PACT, pp. 324–336 (2019) 72. Herrmann, J., Özkaya, M.Y., Uçar, B., Kaya, K., Çatalyürek, Ü.V.: Multilevel algorithms for acyclic partitioning of directed acyclic graphs. SIAM J. Sci. Comput. 41(4), A2117–A2145 (2019) 73. Hitzler, P.: Neuro-Symbolic Artificial Intelligence: The State of the Art. IOS Press, Amsterdam (2022) 74. Hooker, S.: The hardware lottery. Commun. ACM 64(12), 58–65 (2021) 75. Huang, J., Chavira, M., Darwiche, A., et al.: Solving MAP exactly by searching on compiled arithmetic circuits. In: AAAI (2006)

Bibliography

137

76. Jaiswal, M.K., So, H.K.-H.: Pacogen: a hardware posit arithmetic core generator. IEEE Access 7, 74586–74601 (2019) 77. Jang, S., Carr, S., Sweany, P., Kuras, D.: A code generation framework for VLIW architectures with partitioned register banks. In: Proceedings of the 3rd International Conference on Massively Parallel Computing Systems, vol. 4. Citeseer (1998) 78. Jouppi, N.P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12 (2017) 79. Kao, S.-C., Parashar, A., Tsai, P.-A., Krishna, T.: Demystifying map space exploration for NPUs (2022). Preprint arXiv:2210.03731 80. Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20(1), 359–392 (1998) 81. Khan, O.U., Wentzloff, D.D.: Hardware accelerator for probabilistic inference in 65-nm CMOS. IEEE Trans. Very Large Scale Integr. Syst. 24(3), 837–845 (2016) 82. Khosoussi, K., Huang, S., Dissanayake, G.: A sparse separable slam back-end. IEEE Trans. Robot. 32(6), 1536–1549 (2016) 83. Khosravi, P., Vergari, A., Choi, Y., Liang, Y., Broeck, G.V.d.: Handling missing data in decision trees: A probabilistic approach (2020). Preprint arXiv:2006.16341 84. Kung, H., Leiserson, C.E.: Systolic arrays (for VLSI). In: Sparse Matrix Proceedings 1978, vol. 1, pp. 256–282. Society for Industrial and Applied Mathematics, Philadelphia (1979) 85. Kwon, H., Samajdar, A., Krishna, T.: MAERI: Enabling flexible dataflow mapping over DNN accelerators via reconfigurable interconnects. In: Proceedings of the TwentyThird International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 461–475 (2018) 86. Li, G., Dai, G., Li, S., Wang, Y., Xie, Y.: GraphIA: An in-situ accelerator for large-scale graph processing. In: Proceedings of the International Symposium on Memory Systems, pp. 79–84 (2018) 87. Liang, Y., Van den Broeck, G.: Learning logistic circuits. In: Proceedings of the 33rd Conference on Artificial Intelligence (AAAI) (2019) 88. Liang, Y., Bekker, J., den Broeck, G.V.: Learning the structure of probabilistic sentential decision diagrams. In: Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence UAI (2017) 89. Manhaeve, R., Dumanˇci´c, S., Kimmig, A., Demeester, T., De Raedt, L.: Deepproblog: Neural probabilistic logic programming. In: 32nd Conference on Neural Information Processing Systems (2018) 90. Mei, L., Houshmand, P., Jain, V., Giraldo, S., Verhelst, M.: ZigZag: Enlarging joint architecture-mapping design space exploration for dnn accelerators. IEEE Trans. Comput. 70(8), 1160–1174 (2021) 91. Micucci, D., Mobilio, M., Napoletano, P.: Unimib SHAR: a dataset for human activity recognition using acceleration data from smartphones. Appl. Sci. 7(10), 1101 (2017) 92. Molina, A., Vergari, A., Stelzner, K., Peharz, R., Subramani, P., Di Mauro, N., Poupart, P., Kersting, K.: SPFlow: An easy and extensible library for deep probabilistic learning using sum-product networks (2019). Preprint arXiv:1901.03704 93. Moons, B., Uytterhoeven, R., Dehaene, W., Verhelst, M.: 14.5 ENVISION: A 0.26-to10 TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable convolutional neural network processor in 28nm FDSOI. In: 2017 IEEE International Solid-State Circuits Conference (ISSCC), pp. 246–247. IEEE, Piscataway (2017) 94. Moons, B., Bankman, D., Yang, L., Murmann, B., Verhelst, M.: BinarEye: An always-on energy-accuracy-scalable binary cnn processor with all memory on chip in 28nm cmos. In 2018 IEEE Custom Integrated Circuits Conference (CICC), pp. 1–4. IEEE, Piscataway (2018) 95. Moreira, O., Popp, M., Schulz, C.: Graph partitioning with acyclicity constraints. In: 16th International Symposium on Experimental Algorithms SEA, vol. 75, pp. 30:1–30:15 (2017)

138

Bibliography

96. Moreira, O., Popp, M., Schulz, C.: Evolutionary multi-level acyclic graph partitioning. J. Heuristics 26(5), 771–799 (2020) 97. Mossé, M., Ibeling, D., Icard, T.: Is causal reasoning harder than probabilistic reasoning? Rev. Symbol. Logic, 1–26 (2022). https://doi.org/10.1017/S1755020322000211 98. Muller, J.-M., Brisebarre, N., De Dinechin, F., Jeannerod, C.-P., Lefevre, V., Melquiond, G., Revol, N., Stehlé, D., Torres, S., et al.: Handbook of Floating-Point Arithmetic. Springer, Berlin (2018) 99. Naumov, M.: Parallel solution of sparse triangular linear systems in the preconditioned iterative methods on the GPU. NVIDIA Corporation, Westford, MA, USA, Technical Report NVR-2011 1 (2011) 100. Naumov, M., Chien, L., Vandermersch, P., Kapasi, U.: Cusparse library. In: GPU Technology Conference (2010) 101. Nethercote, N., Stuckey, P.J., Becket, R., Brand, S., Duck, G.J., Tack, G.: MiniZinc: Towards a standard CP modelling language. In: Principles and Practice of Constraint Programming CP, vol. 4741, pp. 529–543 (2007) 102. Neuman, S.M., Plancher, B., Bourgeat, T., Tambe, T., Devadas, S., Reddi, V.J.: Robomorphic computing: A design methodology for domain-specific accelerators parameterized by robot morphology. In: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 674–686 (2021) 103. Nguyen, Q.M., Sanchez, D.: Fifer: Practical acceleration of irregular applications on reconfigurable architectures. In: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 1064–1077 (2021) 104. Nourani, M., Roy, C., Rahman, T., Ragan, E.D., Ruozzi, N., Gogate, V.: Don’t explain without verifying veracity: An evaluation of explainable ai with video activity recognition (2020). Preprint arXiv:2005.02335 105. Nowatzki, T., Ferris, M.C., Sankaralingam, K., Estan, C., Vaish, N., Wood, D.A.: Optimization and mathematical modeling in computer architecture. Synth. Lect. Comput. Archit. 8, 1–144 (2013) 106. Nowatzki, T., Sartin-Tarm, M., De Carli, L., Sankaralingam, K., Estan, C., Robatmili, B.: A scheduling framework for spatial architectures across multiple constraint-solving theories. ACM Trans. Program. Languages Syst. 37(1), 1–30 (2014) 107. Nowatzki, T., Gangadhar, V., Sankaralingam, K.: Exploring the potential of heterogeneous von neumann/dataflow execution models. In: Proceedings of the 42nd Annual International Symposium on Computer Architecture, pp. 298–310 (2015) 108. Nowatzki, T., Ardalani, N., Sankaralingam, K., Weng, J.: Hybrid optimization/heuristic instruction scheduling for programmable accelerator codesign. In: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, pp. 1–15 (2018) 109. NVIDIA Corporation: NVIDIA CUDA C programming guide (2019). Version 10.1. 110. Ober, M., Hofmann, J., Sommer, L., Weber, L., Koch, A.: High-throughput multi-threaded sum-product network inference in the reconfigurable cloud. In: 2019 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC), pp. 26–33 (2019) 111. Olascoaga, L.I.G., Meert, W., Shah, N., Van den Broeck, G., Verhelst, M.: On hardwareaware probabilistic frameworks for resource constrained embedded applications. In: 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), pp. 66–70. IEEE, Piscataway (2019) 112. Olascoaga, L.I.G., Meert, W., Verhelst, M.: Hardware-Aware Probabilistic Machine Learning Models: Learning. Inference and Use Cases, Springer Nature, Berlin (2021) 113. Ozer, E., Banerjia, S., Conte, T.M.: Unified assign and schedule: A new approach to scheduling for clustered register file microarchitectures. In Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture, pp. 308–315. IEEE, Piscataway (1998)

Bibliography

139

114. Özkaya, M.Y., Benoit, A., Uçar, B., Herrmann, J., Çatalyürek, Ü.V.: A scalable clusteringbased task scheduler for homogeneous processors using dag partitioning. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 155–165. IEEE, Piscataway (2019) 115. Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkatesan, R., Khailany, B., Emer, J., Keckler, S.W., Dally, W.J.: SCNN: An accelerator for compressed-sparse convolutional neural networks. ACM SIGARCH Comput. Architect. News 45(2), 27–40 (2017) 116. Park, Y., Park, H., Mahlke, S.: CGRA express: accelerating execution using dynamic operation fusion. In: Proceedings of the 2009 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pp. 271–280 (2009) 117. Park, J., Smelyanskiy, M., Sundaram, N., Dubey, P.: Sparsifying synchronization for highperformance shared-memory sparse triangular solver. In: International Supercomputing Conference, pp. 124–140. Springer, Berlin (2014) 118. Park, J.-S., Park, C., Kwon, S., Kim, H.-S., Jeon, T., Kang, Y., Lee, H., Lee, D., Kim, J., Lee, Y., et al.: A multi-mode 8k-MAC HW-utilization-aware neural processing unit with a unified multi-precision datapath in 4nm flagship mobile SoC. In: 2022 IEEE International Solid-State Circuits Conference (ISSCC), vol. 65, pp. 246–248. IEEE, Piscataway (2022) 119. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: PyTorch: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32 (2019) 120. Pearl, J., Mackenzie, D.: The Book of Why: The New Science of Cause and Effect. Basic Books, New York (2018) 121. Peemen, M., Setio, A.A., Mesman, B., Corporaal, H.: Memory-centric accelerator design for convolutional neural networks. In: 2013 IEEE 31st International Conference on Computer Design (ICCD), pp. 13–19. IEEE, Piscataway (2013) 122. Perron, L., Furnon, V.: Or-tools 123. Picciau, A., Inggs, G.E., Wickerson, J., Kerrigan, E.C., Constantinides, G.A.: Balancing locality and concurrency: Solving sparse triangular systems on GPUs. In: 23rd IEEE International Conference on High Performance Computing HiPC, pp. 183–192 (2016) 124. Poletto, M., Sarkar, V.: Linear scan register allocation. ACM Trans. Program. Languages Syst. 21(5), 895–913 (1999) 125. Prabhakar, R., Zhang, Y., Koeplinger, D., Feldman, M., Zhao, T., Hadjis, S., Pedram, A., Kozyrakis, C., Olukotun, K.: Plasticine: A reconfigurable architecture for parallel patterns. In: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 389–402 (2017) 126. Pradhan, R., Yang, S., Dellaert, F., Choset, H., Travers, M.: Optimal control for structurally sparse systems using graphical inference (2021). Preprint arXiv:2104.02945 127. Pronobis, A., Rao, R.P.N.: Learning deep generative spatial models for mobile robots. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems IROS, pp. 755– 762 (2017) 128. Qin, E., Samajdar, A., Kwon, H., Nadella, V., Srinivasan, S., Das, D., Kaul, B., Krishna, T.: SIGMA: A sparse and irregular gemm accelerator with flexible interconnects for dnn training. In: 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 58–70. IEEE, Piscataway (2020) 129. Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., Amarasinghe, S.P.: Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In: ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI, pp. 519–530 (2013) 130. Rajamanickam, S., Acer, S., Berger-Vergiat, L., Dang, V., Ellingwood, N., Harvey, E., Kelley, B., Trott, C.R., Wilke, J., Yamazaki, I.: Kokkos kernels: Performance portable sparse/dense linear algebra and graph kernels (2021). Preprint arXiv:2103.11991 131. Robison, A.D.: Intel®threading building blocks (TBB). In: Encyclopedia of Parallel Computing, pp. 955–964. Springer, Boston (2011)

140

Bibliography

132. Rucker, A., Vilim, M., Zhao, T., Zhang, Y., Prabhakar, R., Olukotun, K.: Capstan: A vector RDA for sparsity. In: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 1022–1035 (2021) 133. Saad, Y.: Iterative Methods for Sparse Linear Systems. SIAM, Philadelphia (2003) 134. Schwarz, E.M., Schmookler, M., Trong, S.D.: Hardware implementations of denormalized numbers. In: Proceedings 2003 16th IEEE Symposium on Computer Arithmetic, pp. 70–78. IEEE, Piscataway (2003) 135. Shah, N., Olascoaga, L.I.G., Meert, W., Verhelst, M.: PROBLP: A framework for iowprecision probabilistic inference. In: 2019 56th ACM/IEEE Design Automation Conference (DAC), pp. 1–6 (2019) 136. Shah, N., Olascoaga, L.I.G., Meert, W., Verhelst, M.: Acceleration of probabilistic reasoning through custom processor architecture. In: 2020 Design, Automation & Test in Europe Conference & Exhibition DATE, pp. 322–325 (2020) 137. Shah, N., Olascoaga, L.I.G., Meert, W., Verhelst, M.: Acceleration of probabilistic reasoning through custom processor architecture. In: 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 322–325. IEEE, Piscataway (2020) 138. Shah, N., Olascoaga, L.I.G., Zhao, S., Meert, W., Verhelst, M.: 9.4 PIU: A 248GOPS/W stream-based processor for irregular probabilistic inference networks using precision-scalable posit arithmetic in 28nm. In: 2021 IEEE International Solid-State Circuits Conference (ISSCC), vol. 64, pp. 150–152. IEEE, Piscataway (2021) 139. Shah, N., Olascoaga, L.I.G., Zhao, S., Meert, W., Verhelst, M.: DPU: DAG processing unit for irregular graphs with precision-scalable posit arithmetic in 28 nm. IEEE J. Solid-State Circuits 57, 2586–2596 (2021) 140. Shah, N., Meert, W., Verhelst, M.: DPU-v2: Energy-efficient execution of irregular directed acyclic graphs. In: 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1288–1307. IEEE, Piscataway (2022) 141. Shah, N., Meert, W., Verhelst, M.: GraphOpt: constrained-optimization-based parallelization of irregular graphs. IEEE Trans. Parall. Distrib Syst. 33, 3321–3332 (2022) 142. Sharifian, A., Kumar, S., Guha, A., Shriraman, A.: Chainsaw: Von-neumann accelerators to leverage fused instruction chains. In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–14. IEEE, Piscataway (2016) 143. Shen, Y., Choi, A., Darwiche, A.: Tractable operations for arithmetic circuits of probabilistic models. In: Advances in Neural Information Processing Systems, vol. 29 (2016) 144. Slota, G.M., Root, C., Devine, K., Madduri, K., Rajamanickam, S.: Scalable, multi-constraint, complex-objective graph partitioning. IEEE Trans. Parallel. Distrib. Syst. 31(12), 2789–2801 (2020) 145. Sommer, L., Weber, L., Kumm, M., Koch, A.: Comparison of arithmetic number formats for inference in sum-product networks on FPGAs. In: 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 75–83. IEEE, Piscataway (2020) 146. Stelzner, K., Peharz, R., Kersting, K.: Faster attend-infer-repeat with tractable probabilistic models. In: Proceedings of the 36th International Conference on Machine Learning, ICML, vol. 97, pp. 5966–5975 (2019) 147. Sze, V., Chen, Y.-H., Yang, T.-J., Emer, J.S.: Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE 105(12), 2295–2329 (2017) 148. Taylor, M.B., Lee, W., Amarasinghe, S., Agarwal, A.: Scalar operand networks: On-chip interconnect for ILP in partitioned architectures. In: The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings., pp. 341–353. IEEE, Piscataway (2003) 149. Tiwari, S., Gala, N., Rebeiro, C., Kamakoti, V.: PERI: A configurable posit enabled risc-v core. ACM Trans. Architect. Code Optim. 18(3), 1–26 (2021) 150. Tschiatschek, S., Pernkopf, F.: On bayesian network classifiers with reduced precision parameters. IEEE Trans. Pattern Analy. Mach. Intell. 37(4), 774–785 (2015)

Bibliography

141

151. Ueyoshi, K., Papistas, I.A., Houshmand, P., Sarda, G.M., Jain, V., Shi, M., Zheng, Q., Giraldo, S., Vrancx, P., Doevenspeck, J., et al.: Diana: An end-to-end energy-efficient digital and analog hybrid neural network soc. In: 2022 IEEE International Solid-State Circuits Conference (ISSCC), vol. 65, pp. 1–3. IEEE, Piscataway (2022) 152. Valouxis, C., Gogos, C., Alefragis, P., Goulas, G., Voros, N., Housos, E.: Dag scheduling using integer programming in heterogeneous parallel execution environments. In: Proceedings of the Multidisciplinary International Conference on Scheduling: Theory and Applications MISTA, pp. 392–401 (2013) 153. van den Braak, G.: Improving GPU performance: reducing memory conflicts and latency. PhD Thesis, Technische Universiteit Eindhoven (2015) 154. Verreet, V., Derkinderen, V., Dos Martires, P.Z., De Raedt, L.: Inference and learning with model uncertainty in probabilistic logic programs. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 10060–10069 (2022) 155. Walker, M.J., Anderson, J.H.: Generic connectivity-based CGRA mapping via integer linear programming. In: 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 65–73. IEEE, Piscataway (2019) 156. Walshaw, C., Cross, M.: JOSTLE: parallel multilevel graph-partitioning software–an overview. Mesh Partition. Techniq. Domain Decomposit. Techniq. 10, 27–58 (2007) 157. Wang, H., Sinnen, O.: List-scheduling versus cluster-scheduling. IEEE Trans. Parall. Distrib. Syst. 29(8), 1736–1749 (2018) 158. Wang, E., Zhang, Q., Shen, B., Zhang, G., Lu, X., Wu, Q., Wang, Y.: Intel math kernel library. In: High-Performance Computing on the Intel® Xeon Phi.T M , pp. 167–188. Springer, Berlin (2014) 159. Wanhammar, L.: DSP Integrated Circuits. Elsevier, Amsterdam (1999) 160. Weber, L., Sommer, L., Oppermann, J., Molina, A., Kersting, K., Koch, A.: Resource-efficient logarithmic number scale arithmetic for SPN inference on FPGAs. In: 2019 International Conference on Field-Programmable Technology (ICFPT), pp. 251–254. IEEE, Piscataway (2019) 161. Weng, J., Liu, S., Dadu, V., Wang, Z., Shah, P., Nowatzki, T.: DSAGEN: Synthesizing programmable spatial accelerators. In: 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pp. 268–281. IEEE, Piscataway (2020) 162. Weng, J., Liu, S., Wang, Z., Dadu, V., Nowatzki, T.: A hybrid systolic-dataflow architecture for inductive matrix algorithms. In: 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 703–716. IEEE, Piscataway (2020) 163. West, D.B.: Introduction to Graph Theory, vol. 2. Prentice Hall, Upper Saddle River (2001) 164. Wikipedia Contributors: Apple A16 — Wikipedia, the free encyclopedia (2022). Online Accessed 13 Nov 2022 165. Wu, Y.N., Tsai, P.-A., Parashar, A., Sze, V., Emer, J.S.: Sparseloop: An analytical, energyfocused design space exploration methodology for sparse tensor accelerators. In: 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 232–234. IEEE, Piscataway (2021) 166. Xia, K., Lee, K.-Z., Bengio, Y., Bareinboim, E.: The causal-neural connection: Expressiveness, learnability, and inference. In: Advances in Neural Information Processing Systems, vol. 34, pp. 10823–10836 (2021) 167. Yao, P., Zheng, L., Liao, X., Jin, H., He, B.: An efficient graph accelerator with parallel data conflict management. In: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, pp. 1–12 (2018) 168. Yao, S., Yang, J.-B., Xu, D.-L., Dark, P.: Probabilistic modeling approach for interpretable inference and prediction with data for sepsis diagnosis. Expert Syst. Appl. 183, 115333 (2021) 169. Yates, R.: Fixed-point arithmetic: an introduction. Digital Signal Labs 81(83), 198 (2009) 170. Yuan, Z., Liu, Y., Yue, J., Yang, Y., Wang, J., Feng, X., Zhao, J., Li, X., Yang, H.: STICKER: An energy-efficient multi-sparsity compatible accelerator for convolutional neural networks in 65-nm CMOS. IEEE J. Solid-State Circuits 55(2), 465–477 (2020)

142

Bibliography

171. Zermani, S., Dezan, C., Chenini, H., Diguet, J.-P., Euler, R.: FPGA implementation of bayesian network inference for an embedded diagnosis. In: 2015 IEEE Conference on Prognostics and Health Management (PHM), pp. 1–10. IEEE, Piscataway (2015) 172. Zhang, L., Wahib, M., Zhang, H., Matsuoka, S.: A study of single and multi-device synchronization methods in Nvidia GPUs. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 483–493. IEEE, Piscataway (2020) 173. Zhang, Z., Wang, H., Han, S., Dally, W.J.: Sparch: Efficient architecture for sparse matrix multiplication. In: 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 261–274. IEEE, Piscataway (2020) 174. Zhang, J.-F., Lee, C.-E., Liu, C., Shao, Y.S., Keckler, S.W., Zhang, Z.: Snap: An efficient sparse neural acceleration processor for unstructured sparse deep neural network inference. IEEE J. Solid-State Circuits 56(2), 636–647 (2021) 175. Zheng, K., Pronobis, A.: From pixels to buildings: End-to-end probabilistic deep networks for large-scale semantic mapping. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3511–3518 (2019)

Index

A Accuracy of probabilistic inference, 41, 126 Analytical error analysis, v, 19, 23, 36, 37, 128 Approximate computing, 15 C Custom floating point, v, 19, 23, 40 Custom interconnection networks, 122 D Dataflow graphs, v, vi, 3, 4, 7, 129 Decoupled instruction streams, 70, 76–78, 119, 128 Design space exploration, 2, 21, 91, 101, 110–114, 123, 127 Domain-specialized hardware, v, 1, 2 G Graph partitioning with constrained optimization, 20, 43

M Multithreaded execution, 17, 67, 69 P Parallel processor for irregular graphs, 67 Posit.TM representation, 19–20, 23–41, 69, 125, 127 Precision-scalable Posit.TM arithmetic unit, 69, 70, 78–79 Probabilistic circuits (PC), v, 3, 4, 8–11, 14–17, 19, 21, 23–37, 40, 41, 43, 60, 64, 66, 67, 78, 81, 84–86, 101, 110, 113, 115, 121, 125, 126, 129 R Reducing communication and synchronization, 69

I Irregular workloads, 1–22, 43–67, 69–123, 128

S Silicon implementation, v, 15 Sparse matrix triangular solve (SpTRSV), v, 3, 4, 8, 11–17, 21, 23, 24, 40, 41, 43, 60, 63–64, 67, 78, 81, 84–86, 101, 108, 110, 114, 122, 126, 129 Spatial datapath for irregular graphs, 89–123, 128

L Low-precision data representation, 127

W Workload balance, 20, 45, 47, 69, 126

H Hardware lottery, v, vi, 1–22, 129 Hardware-software codesign, 21, 129 Hardware synchronization, 16, 69–71

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Shah et al., Efficient Execution of Irregular Dataflow Graphs, https://doi.org/10.1007/978-3-031-33136-7

143