Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing: Software Optimizations and Hardware/Software Codesign 3031399315, 9783031399312

This book presents recent advances towards the goal of enabling efficient implementation of machine learning models on r

159 14 23MB

English Pages 491 [481] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Acknowledgments
Contents
Part I Efficient Software Design for Embedded Machine Learning
Machine Learning Model Compression for Efficient Indoor Localization on Embedded Platforms
1 Introduction
2 Background and Related Work
3 CHISEL Framework
3.1 Data Preprocessing and Augmentation
3.2 Network Architecture
3.3 Model Compression
4 Experiments
4.1 Evaluation on UJIIndoorLoc Dataset
4.2 Evaluation on Compression-Aware Training
5 Conclusion
References
A Design Methodology for Energy-Efficient Embedded Spiking Neural Networks
1 Introduction
1.1 Overview
1.2 Design Constraints for Embedded SNNs
2 Preliminaries
2.1 Spiking Neural Networks (SNNs)
2.2 Spike-Timing-Dependent Plasticity (STDP)
3 A Design Methodology for Embedded SNNs
3.1 Overview
3.2 Reduction of SNN Operations
3.3 Learning Enhancements
3.4 Weight Quantization
3.5 Evaluation of Memory and Energy Requirements
3.6 Employment of Approximate DRAM
4 Experimental Evaluations
4.1 Classification Accuracy
4.2 Reduction of Memory Requirement
4.3 Improvement of Energy Efficiency
4.4 Impact of Approximate DRAM
5 Conclusion
References
Compilation and Optimizations for Efficient Machine Learning on Embedded Systems
1 Introduction
2 Background and Related Works
2.1 Efficient DNN Designs
2.2 Efficient Accelerator Designs and DNN Mapping Methods
2.3 Efficient Co-Design Optimization
3 Efficient Machine Learning Model Designs
3.1 The ELB-NN
3.1.1 Hybrid Quantization Scheme
3.1.2 Hardware Accelerator for ELB-NN
3.2 The VecQ
3.2.1 Quantization with Vector Loss
3.2.2 Framework Integration
4 Efficient Accelerator Design and Workload Mapping
4.1 DNNBuilder
4.1.1 An End-to-end Automation Flow
4.1.2 Architecture Novelties
4.1.3 State-of-the-art Performance
4.2 PyLog: A Python-Based FPGA Programming Flow
4.2.1 PyLog Flow Overview
4.2.2 PyLog Features
4.2.3 PyLog Evaluation Results
5 Efficient Optimizations
5.1 Overview of Hardware-aware Neural Architecture Search (NAS)
5.2 HW-Aware NAS Formulation
5.3 FPGA/DNN Co-Design
5.3.1 The Key to Co-Design: Bundle
5.3.2 Progressively Reducing Search Space
5.3.3 Evaluation Results
5.4 EDD: Efficient Differential DNN Architecture Search
5.4.1 Fused Co-Design Space
5.4.2 Differentiable Performance and Resource Formulation
5.4.3 State-of-the-art Results
6 Conclusion
References
A Pedestrian Detection Case Study for a Traffic Light Controller
1 Introduction
2 Related Work
2.1 Neural Networks for Pedestrian Detection
2.2 Pedestrian Detection on Embedded Systems
2.3 Quantization
3 Pedestrian Detection Use Case
4 Results
4.1 Experimentation Setup
4.2 No Constraints
4.3 Cost Constraints
4.4 Cost, Latency, and Precision Constraints
4.5 Effect of Resolution and Quantization
5 Conclusion
References
How to Train Accurate BNNs for Embedded Systems?
1 Introduction
2 Related Work
3 Background on BNNs
3.1 Inference
3.2 Training
4 Classification of Accuracy Repair Techniques
5 Overview of Accuracy Repair Techniques as Applied in the Literature
5.1 Training Techniques
5.1.1 Binarizer (STE)
5.1.2 Normalization
5.1.3 Teacher–Student
5.1.4 Regularization
5.1.5 Two-Stage Training
5.1.6 Optimizer
5.2 Network Topology Changing
5.2.1 Scaling Factor
5.2.2 Ensemble
5.2.3 Activation Function
5.2.4 Double Residual
5.2.5 Squeeze-and-Excitation
6 Empirical Review of Accuracy Repair Methods
6.1 Establishing the Design Space
6.2 Finding a Good Baseline BNN
6.3 Design Space Exploration
6.3.1 Binarizer (STE)
6.3.2 Normalization
6.3.3 Scaling Factor
6.3.4 Two-Stage Training, Activation Function, and Double Residual
7 Discussion and Future Research
7.1 Accuracy Gap
7.2 Benefit and Cost of BNNs
8 Conclusion
References
Embedded Neuromorphic Using Intel's Loihi Processor
1 Introduction
2 Brain-Inspired Spiking Neural Networks
2.1 Spiking Neuron Models
2.2 Spike Coding Methods
2.3 SNN Learning Methods
3 Conventional Architectures vs. Neuromorphic Architectures
4 Event-Based Cameras
5 Applications and Datasets for Event-Based SNNs
6 The Loihi Architecture
6.1 Neuron Model
6.2 Chip Architecture
6.3 Second Generation: Loihi 2
6.4 Tools to Support Loihi Developers
6.5 SOTA Results of Event-Based SNNs on Loihi
7 Case Study for Autonomous Vehicles: Car Detection with CarSNN
7.1 Problem Analysis and General Design Decisions
7.2 CarSNN Methodology
7.2.1 CarSNN Model Design
7.2.2 Parameters for Training
7.2.3 Parameters for Feeding the Input Data
7.3 Evaluation of CarSNN Implemented on Loihi
7.3.1 Experimental Setup
7.3.2 Accuracy Results for Offline Trained CarSNN
7.3.3 CarSNN Implemented on Loihi
7.3.4 Comparison with the State of the Art
8 Conclusion
References
Part II Hardware-Software Co-Design and Co-Optimizations for Embedded Machine Learning
Machine Learning for Heterogeneous Manycore Design
1 Introduction
2 ML-Enabled 3D CPU/GPU-Based Heterogeneous Manycore Design
2.1 Related Prior Work
2.1.1 3D Heterogeneous Manycore Systems
2.1.2 Multi-Objective Optimization Algorithms
3 3D Heterogeneous Manycore Design Formulation
4 MOO-STAGE: ML-Enabled Manycore Design Framework
4.1 MOO-STAGE: Local Search
4.2 MOO-STAGE: Meta Search
5 Experimental Results
5.1 Experimental Setup
5.2 Comparing the Different Algorithms
5.3 Comparison with Mesh NoC-Based Heterogeneous Manycore System
6 MOO-STAGE FOR M3D-Based Manycore Systems
6.1 MOO-STAGE for M3D Design
7 Conclusion
References
Hardware–Software Co-design for Ultra-Resource-Constrained Embedded Machine Learning Inference: A Printed Electronics Use Case
1 Introduction
2 Background on Printed Electronics
3 Preliminaries
4 Bespoke ML Classification Circuits
4.1 Resource-Aware ML Algorithm Selection
4.2 Bespoke Classifier Implementation
5 Co-Design for Approximate ML Classification Circuits
5.1 Approximate MLPs and SVMs
5.2 Approximate Decision Trees
6 Co-design for Stochastic Neural Network Circuits
6.1 Mixed-Signal Stochastic Neuron
6.2 Analog Stochastic SNG
6.3 Analog Stochastic Activation Function
6.4 Hardware-Driven Training
6.5 Mixed-Signal Stochastic Inference
7 Conclusion
References
Cross-Layer Optimizations for Efficient Deep Learning Inference at the Edge
1 Introduction
2 Preliminaries
3 DNN Optimization Techniques
3.1 Pruning
3.1.1 Fine-Grained Pruning
3.1.2 Course-Grained Pruning
3.2 Quantization
3.3 Knowledge Distillation
3.4 Neural Architecture Search
3.5 Hardware Approximations
4 Cross-Layer Optimization
4.1 Methodology
4.2 Structured Pruning
4.3 Quantization
4.4 Hardware-Level Approximations: Impact of Self-Healing and Non-Self-Healing Approximate Designs on DNN Accuracy
5 End-to-End System-Level Approximations
6 Conclusion
References
Co-designing Photonic Accelerators for Machine Learningon the Edge
1 Introduction
2 Background and Related Work
3 Noncoherent Photonic Computation Overview
4 CrossLight Architecture
4.1 MR Device Engineering and Fabrication
4.2 Tuning Circuit Design
4.3 Architecture Design
4.3.1 Decomposing Vector Operations in CONV/FC Layers
4.3.2 Vector Dot Product (VDP) Unit Design
4.3.3 Optical Wavelength Reuse in VDP Units
5 Evaluation and Simulation Results
5.1 Simulation Setup
5.2 Results: CrossLight Resolution Analysis
5.3 Results: CrossLight Sensitivity Analysis
5.4 Results: Comparison with State-of-the-Art Accelerators
6 Conclusion
References
Hardware–Software Co-design of Deep Neural Architectures: From FPGAs and ASICs to Computing-in-Memories
1 Introduction
2 Hardware–Software Co-design with Neural Architecture Search
3 Hardware-Aware Neural Architecture Search for FPGA
3.1 Implementation of DNNs on FPGAs
3.2 Co-design Framework for FPGAs
3.2.1 Problem Statement and Solution
3.3 Experiments
3.3.1 Search Space Setup
3.4 Comparison Results with the Existing NAS Frameworks
3.5 Comparison Results with the Existing Architectures
3.6 Importance of Co-exploration
3.7 Concluding Remarks for NAS-F
4 Co-design of Neural Networks and ASICs
4.1 Problem Analysis for DNN-ASIC Co-design
4.1.1 Major Components
4.1.2 Problem Definition
4.2 Co-design Framework for ASIC
4.3 Experimental Evaluation
4.3.1 Evaluation Environment
4.4 Design Space Exploration
4.4.1 Results on Multiple Tasks for Multiple Datasets
4.5 Concluding Remarks for NASAIC
5 Co-design of Neural Networks and Computing-in-Memory Accelerators
5.1 Compute-in-Memory Neural Accelerators
5.1.1 Device and Its Variations
5.1.2 Crossbar Architecture
5.1.3 NeuroSIM
5.2 Problem Definition
5.3 Co-design Framework for CiM
5.4 Experiments and Results
5.4.1 Experiment Setup
5.4.2 Comparison Results to State-of-the-Art NAS
5.4.3 Results of Multi-Objective Optimization
5.5 Concluding Remarks for NACIM
6 Conclusions
References
Hardware and Software Optimizations for Capsule Networks
1 Introduction
2 Traditional DNNs vs. CapsNets
3 CapsNet Models and Applications
4 Efficient CapsNets Training Methodologies
5 Hardware Architectures for Accelerating CapsNets' Inference
5.1 CapsNets Execution on GPUs and Their Bottlenecks
5.2 The CapsAcc Accelerator
5.3 FEECA Methodology
5.4 Memory Design and Management for CapsNets
6 Lightweight Optimizations for CapsNets Inference
6.1 Quantization of CapsNets with the Q-CapsNets Framework
6.2 Approximations for Energy-Efficient CapsNets with the ReD-CaNe Methodology
7 HW-Aware Neural Architecture Search for DNNs and CapsNets
8 Conclusion
References
Design of Sparsity Optimized Photonic Deep Learning Accelerators
1 Introduction
2 Background and Related Work
3 Software and Dataflow Optimizations
3.1 Model Sparsification
3.2 Weight Clustering
3.3 Dataflow Optimizations
4 SONIC Hardware Accelerator Overview
4.1 Microring Resonators (MRs) and Robust Tuning
4.2 Vector-Dot-Product Unit (VDU) Design
4.3 SONIC Architecture
5 Experiments and Results
5.1 Model Sparsification and Clustering Results
5.2 Comparison with State-of-the-Art Accelerators
6 Conclusion
References
Algorithm-System Co-design for Efficient and Hardware-Aware Embedded Machine Learning
1 Overview
2 Efficient Inference Systems
2.1 Inference Frameworks for TinyML
2.1.1 Interpreter-Based vs@汥瑀瑯步渠. Code-Generation
2.1.2 Low-Precision Support
2.2 Scheduling Optimizations
2.2.1 Reducing Peak Memory with Optimized Scheduling
2.2.2 Kernel Optimization for Faster Inference
3 Efficient Deep Learning Model Design
3.1 Model Compression
3.1.1 Pruning
3.1.2 Quantization
3.1.3 Knowledge Distillation
3.1.4 AutoML-Based Model Compression
3.2 Neural Architecture Search
4 System-Algorithm Co-design
4.1 Co-design to Achieve the Best Performance
4.2 Co-design Broadens the Design Space
5 Summary
References
Efficient Hardware and Software Design for On-device Learning
1 Introduction
2 Related Works
2.1 Software
2.2 Hardware
3 Software: On-device Continuous Self-supervised Contrastive Learning with Selective Data Contrast
3.1 Framework Overview
3.2 Data Replacement by Contrast Scoring
3.3 Understanding the Effectiveness of Contrast Score
4 EF-Train: Enable Efficient On-device CNN Training on FPGA Through Data Reshaping for Online Adaptation or Personalization
4.1 On-device CNN Training Accelerator
4.1.1 The Architecture of the Training Accelerator
4.1.2 The Forward and Backward Propagation of a Convolutional Layer
4.1.3 The Weight Update of a Convolutional Layer
4.2 Data Reshaping Approach
4.2.1 Analysis on Discontinuous Memory Access
4.2.2 Optimizing Discontinuous Memory Access
4.2.3 Weight Reuse in Mini-Batch Training
5 Experiment
5.1 Software Experimental Results
5.1.1 Experimental Setup
5.1.2 Improved Accuracy
5.1.3 Learning Speed
5.2 Hardware Experimental Results
5.2.1 Correctness of the Accelerator
5.2.2 Effectiveness of the Data Reshaping Approach
5.2.3 Comparison with the State-of-the-Art Works
6 Conclusion
References
Pipelined CNN Inference on Heterogeneous Multi-processor System-on-Chip
1 Introduction
2 Background
2.1 Heterogeneous Multi-processor System-on-Chips
2.2 Convolution Neural Networks
2.3 ARM Compute Library (ARM-CL)
3 Related Work
4 Experimental Setup
5 Non-pipelined Parallel Inference
6 Pipelined CNN Inference on Asymmetric Multi-core
7 Pipelined CNN Inference on Asymmetric Multi-core and GPU
8 Pipelined CNN Inference and NPU
9 Conclusion and Future Outlook
References
Efficient Neural Networks and Their Acceleration Techniques for Embedded Machine Learning
1 HW–SW–ML Co-design Flow and Enabling Techniques
2 Model Compression Techniques
2.1 Quantization
2.2 Pruning
2.3 Weight Sharing
2.4 Low-Rank Approximation
2.5 Knowledge Distillation
3 Lightweight Neural Architectures
3.1 Depthwise Separable Convolution
3.2 Neural ODE
3.3 Neural Architecture Search
3.4 On-Device Learning
4 Accelerator Design Techniques
4.1 SoC FPGA Platform
4.2 Design Optimization Techniques
4.2.1 Data Type
4.2.2 Loop Unrolling
4.2.3 Array Partitioning
4.2.4 Pipelining
4.2.5 Dataflow Optimization
4.2.6 Extensibility
4.3 Evaluation Results
4.3.1 Design Techniques
4.3.2 Quantization
4.3.3 Data Transfer Overhead
5 Summary
References
Hardware–Software Codesign of an Adder-Tree Type CNNAccelerator
1 Introduction
2 HW/SW Codesign Flow for NPU Design
3 MIDAP Baseline Architecture
3.1 Data Layout and Alignment Submodule
3.2 Extended CIM
4 Virtual Prototyping and Design Space Exploration
4.1 L1 Control Code
4.2 High-Level Compiler
4.3 Early Simulation Results
4.4 Design Space Exploration
5 Supporting Non-Convolutional Layers
5.1 PE Extension to Support Element-Wise Operation
5.2 Reduction Module
5.3 Simulation Results
6 Remaining Design Steps to NPU Implementation
7 Summary and Future Work
References
Index
Recommend Papers

Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing: Software Optimizations and Hardware/Software Codesign
 3031399315, 9783031399312

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Sudeep Pasricha Muhammad Shafique   Editors

Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing Software Optimizations and Hardware/ Software Codesign

Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing

Sudeep Pasricha • Muhammad Shafique Editors

Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing Software Optimizations and Hardware/Software Codesign

Editors Sudeep Pasricha Colorado State University Fort Collins, CO, USA

Muhammad Shafique New York University Abu Dhabi Abu Dhabi, Abu Dhabi United Arab Emirates

ISBN 978-3-031-39931-2 ISBN 978-3-031-39932-9 https://doi.org/10.1007/978-3-031-39932-9

(eBook)

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.

Preface

Machine Learning (ML) has emerged as a prominent approach for achieving stateof-the-art accuracy for many data analytic applications, ranging from computer vision (e.g., classification, segmentation, and object detection in images and video), speech recognition, language translation, healthcare diagnostics, robotics, and autonomous vehicles to business and financial analysis. The driving force of the ML success is the advent of Neural Network (NN) algorithms, such as Deep Neural Networks (DNNs)/Deep Learning (DL) and Spiking Neural Networks (SNNs), with support from today’s evolving computing landscape to better exploit data and thread-level parallelism with ML accelerators. Current trends show an immense interest in attaining the powerful abilities of NN algorithms for solving ML tasks using embedded systems with limited compute and memory resources, i.e., so-called Embedded ML. One of the main reasons is that embedded ML systems may enable a wide range of applications, especially the ones with tight memory and power/energy constraints, such as mobile systems, Internet of Things (IoT), edge computing, and cyber-physical applications. Furthermore, embedded ML systems can also improve the quality of service (e.g., personalized systems) and privacy as compared to centralized ML systems (e.g., based on cloud computing). However, state-of-the-art NN-based ML algorithms are costly in terms of memory sizes and power/energy consumption, thereby making it difficult to enable embedded ML systems. This book consists of three volumes, and explores and identifies the most challenging issues that hinder the implementation of embedded ML systems. These issues arise from the fact that, to achieve better accuracy, the development of NN algorithms have led to state-of-the-art models with higher complexity with respect to model sizes and operations, the implications of which are discussed below: • Massive Model Sizes: Larger NN models usually obtain higher accuracy than the smaller ones because they have a larger number of NN parameters that can learn the features from the training dataset better. However, a huge number of parameters may not be fully stored on-chip, hence requiring large-sized off-chip memory to store them and intensive off-chip memory accesses during run time. v

vi

Preface

Furthermore, these intensive off-chip accesses are significantly more expensive in terms of latency and energy than on-chip operations, hence exacerbating the overall system energy. • Complex and Intensive Operations: The complexity of operations in NN algorithms depends on the computational model and the network architecture. For instance, DNNs and SNNs have different complexity of operations since DNNs typically employ Multiply-and-Accumulate (MAC) while SNNs employ more bio-plausible operations like Leaky-Integrate-and-Fire (LIF). Besides, more complex neural architectures (e.g., residual networks) may require additional operations to accommodate the architectural variations. These complex architectures with a huge number of parameters also lead to intensive neural operations (e.g., a large number of MAC operations in DNNs), thereby requiring high computing power/energy during model execution. In summary, achieving acceptable accuracy for the given ML applications while meeting the latency, memory, and power/energy constraints of the embedded ML systems is not a trivial task. To address these challenges, this book discusses potential solutions from multiple design aspects, presents multiple applications that can benefit from embedded ML systems, and discusses the security, privacy, and robustness aspects of embedded ML systems. To provide a comprehensive coverage of all these different topics, which are crucial for designing and deploying embedded ML for real-world applications, this book is partitioned into three volumes. The first volume covers the Hardware Architectures, the second volume covers Software Optimizations and Hardware/Software Codesign, and the third volume presents different Use Cases and Emerging Challenges. The brief outline of the second volume of this Embedded ML book targeting Software Optimizations and Hardware/Software Codesign along with the part structure is as follows. Part I – Efficient Software Design: To efficiently run the NN algorithms on hardware platforms, the software should be designed and optimized judiciously. Toward this goal, the first part of the Volume 2 of this book provides efficient software designs for embedded ML systems. • Chapter 1 explains how to obtain optimized ML models for indoor localization applications on mobile devices through convolutional encoders, CNNs, and model compression. • Chapter 2 discusses a methodology for optimizing SNN processing for both the training and inference phases on memory- and energy-constrained embedded platforms. • Chapter 3 employs a series of design methodologies including compilation and optimization techniques when designing DNN models for efficient execution on hardware accelerators. • Chapter 4 explores different NN architectures and hardware platforms to find the most suitable embedded ML solution for a pedestrian detection system.

Preface

vii

• Chapter 5 describes techniques for efficiently training Binary Neural Networks (BNNs) targeting resource-constrained embedded applications. • Chapter 6 studies the challenges and benefits of employing neuromorphic computing using Intel’s Loihi processor for embedded applications. Part II – Hardware-Software Co-design and Co-optimization: One-sided design and optimization (i.e., either on the hardware or software layer) may not be sufficient to improve the efficiency of embedded ML since benefits from one layer might not be exploited by the other layer. To address this, the second part of the Volume 2 of this book discusses hardware and software co-design and cooptimization techniques that either employ ML algorithms, target embedded ML systems, or do both. • Chapter 7 explores ML-based design space exploration methods to learn the behavior of the search space for finding suitable solutions when designing heterogeneous manycore systems. • Chapter 8 describes hardware-software co-design techniques for embedded ML inference targeting a printed electronics application. • Chapter 9 presents an overview of different DNN optimization and approximation techniques, then discusses cross-layer methodologies for realizing DNN inference at the edge. • Chapter 10 develops a cross-layer optimized silicon photonic-based NN accelerator while considering device-level fabrication optimizations, circuit-level tuning enhancements, as well as architecture-level design and mapping improvements. • Chapter 11 explores hardware/software co-design techniques for deep neural architectures across different platforms, like FPGAs, Application-Specific Integrated Circuits (ASICs), and IMC. • Chapter 12 presents hardware and software optimizations for capsule networks that include hardware acceleration, training methods, and neural architecture search (NAS). • Chapter 13 proposes a silicon photonic-based accelerator for sparse DL inference by exploiting the low latency nature of photonic devices coupled with software-level optimizations. • Chapter 14 describes the efficient algorithm and system co-design for reducing memory and computation costs of DNNs including model designs, model compression, and NAS. • Chapter 15 discusses hardware and software techniques for enabling ondevice DNN learning by considering the most representative training data and employing a specialized accelerator. • Chapter 16 develops techniques to perform pipelined CNN inference on a heterogeneous multi-processor system-on-chip platform. • Chapter 17 outlines the optimization techniques of DNN models (such as quantization, loop unrolling, and pipelining) and evaluates their implementations on a system-on-chip platform.

viii

Preface

• Chapter 18 presents hardware and software techniques for implementing a CNN accelerator with adder-tree-type neural processing units. We hope this book provides a comprehensive review and useful information on the recent advances in embedded machine learning for cyber-physical, IoT, and edge computing applications. Fort Collins, CO, USA Abu Dhabi, UAE September 1, 2023

Sudeep Pasricha Muhammad Shafique

Acknowledgments

This book would not be possible without the contributions of many researchers and experts in the field of embedded systems, machine learning, IoT, edge platforms, and cyber-physical systems. We would like to gratefully acknowledge the contributions of Rachmad Putra (Technische Universität Wien), Muhammad Abdullah Hanif (New York University, Abu Dhabi), Febin Sunny (Colorado State University), Asif Mirza (Colorado State University), Mahdi Nikdast (Colorado State University), Ishan Thakkar (University of Kentucky), Maarten Molendijk (Eindhoven University of Technology), Floran de Putter (Eindhoven University of Technology), Henk Corporaal (Eindhoven University of Technology), Salim Ullah (Technische Universität Dresden), Siva Satyendra Sahoo (Technische Universität Dresden), Akash Kumar (Technische Universität Dresden), Arnab Raha (Intel), Raymond Sung (Intel), Soumendu Ghosh (Purdue University), Praveen Kumar Gupta (Intel), Deepak Mathaikutty (Intel), Umer I. Cheema (Intel), Kevin Hyland (Intel), Cormac Brick (Intel), Vijay Raghunathan (Purdue University), Gokul Krishnan (Arizona State University), Sumit K. Mandal (Arizona State University), Chaitali Chakrabarti (Arizona State University), Jae-sun Seo (Arizona State University), Yu Cao (Arizona State University), Umit Y. Ogras (University of Wisconsin, Madison), Ahmet Inci (University of Texas, Austin), Mehmet Meric Isgenc (University of Texas, Austin), and Diana Marculescu (University of Texas, Austin), Rehan Ahmed (National University of Sciences and Technology, Islamabad), Muhammad Zuhaib Akbar (National University of Sciences and Technology, Islamabad), Lois Orosa (ETH Zürich, Skanda Koppula (ETH Zürich), Konstantinos Kanellopoulos (ETH Zürich), A. Giray Ya˘glikçi (ETH Zürich), Onur Mutlu (ETH Zürich), Saideep Tiku (Colorado State University), Liping Wang (Colorado State University), Xiaofan Zhang (University of Illinois Urbana-Champaign), Yao Chen (University of Illinois Urbana-Champaign), Cong Hao (University of Illinois Urbana-Champaign), Sitao Huang (University of Illinois Urbana-Champaign), Yuhong Li (University of Illinois Urbana-Champaign), Deming Chen (University of Illinois Urbana-Champaign), Alexander Wendt (Technische Universität Wien), Horst Possegger (Technische Universität Graz), Matthias Bittner (Technische Universität Wien), Daniel Schnoell (Technische Universität Wien), Matthias Wess (Technische Universität Wien), ix

x

Acknowledgments

Dušan Mali´c (Technische Universität Graz), Horst Bischof (Technische Universität Graz), Axel Jantsch (Technische Universität Wien), Floran de Putter (Eindhoven University of Technology), Alberto Marchisio (Technische Universitat Wien), Fan Chen (Indiana University Bloomington), Lakshmi Varshika Mirtinti (Drexel University), Anup Das (Drexel University), Supreeth Mysore Shivanandamurthy (University of Kentucky), Sayed Ahmad Salehi (University of Kentucky), Biresh Kumar Joardar (University of Houston), Janardhan Rao Doppa (Washington State University), Partha Pratim Pande (Washington State University), Georgios Zervakis (Karlsruhe Institute of Technology), Mehdi B. Tahoori (Karlsruhe Institute of Technology), Jörg Henkel (Karlsruhe Institute of Technology), Zheyu Yan (University of Notre Dame), Qing Lu (University of Notre Dame), Weiwen Jiang (George Mason University), Lei Yang (University of New Mexico), X. Sharon Hu (University of Notre Dame), Jingtong Hu (University of Pittsburgh), Yiyu Shi (University of Notre Dame), Beatrice Bussolino (Politecnico di Torino), Alessio Colucci (Technische Universität Wien), Vojtech Mrazek (Brno University of Technology), Maurizio Martina (Politecnico di Torino), Guido Masera (Politecnico di Torino), Ji Lin (Massachusetts Institute of Technology), Wei-Ming Chen (Massachusetts Institute of Technology), Song Han (Massachusetts Institute of Technology), Yawen Wu (University of Pittsburgh), Yue Tang (University of Pittsburgh), Dewen Zeng (University of Notre Dame), Xinyi Zhang (University of Pittsburgh), Peipei Zhou (University of Pittsburgh), Ehsan Aghapour (University of Amsterdam), Yujie Zhang (National University of Singapore), Anuj Pathania (University of Amsterdam), Tulika Mitra (National University of Singapore), Hiroki Matsutani (Keio University), Keisuke Sugiura (Keio University), Soonhoi Ha (Seoul National University), Donghyun Kang (Seoul National University), Ayush Mittal (Colorado State University), Bharath Srinivas Prabakaran (Technische Universität Wien), Ganapati Bhat (Washington State University), Dina Hussein (Washington State University), Nuzhat Yamin (Washington State University), Rafael Makrigiorgis (University of Cyprus), Shahid Siddiqui (University of Cyprus), Christos Kyrkou (University of Cyprus), Panayiotis Kolios (University of Cyprus), Theocharis Theocharides (University of Cyprus), Anil Kanduri (University of Turku), Sina Shahhosseini (University of California, Irvine), Emad Kasaeyan Naeini (University of California, Irvine), Hamidreza Alikhani (University of California, Irvine), Pasi Liljeberg (University of Turku), Nikil Dutt (University of California, Irvine), Amir M. Rahmani (University of California, Irvine), Sizhe An (University of Wisconsin-Madison), Yigit Tuncel (University of Wisconsin-Madison), Toygun Basaklar (University of WisconsinMadison), Aditya Khune (Colorado State University), Rozhin Yasaei (University of California, Irvine), Mohammad Abdullah Al Faruque (University of California, Irvine), Kruttidipta Samal (University of Nebraska, Lincoln), Marilyn Wolf (University of Nebraska, Lincoln), Joydeep Dey (Colorado State University), Vipin Kumar Kukkala (Colorado State University), Sooryaa Vignesh Thiruloga (Colorado State University), Marios Pafitis (University of Cyprus), Antonis Savva (University of Cyprus), Yue Wang (New York University), Esha Sarkar (New York University), Saif Eddin Jabari (New York University Abu Dhabi), Michail Maniatakos (New York University Abu Dhabi), Mahum Naseer (Technische Universität Wien), Iram

Acknowledgments

xi

Tariq Bhatti (National University of Sciences and Technology, Islamabad), Osman Hasan (National University of Sciences and Technology, Islamabad), Hao Fu (New York University), Alireza Sarmadi (New York University), Prashanth Krishnamurthy (New York University), Siddharth Garg (New York University), Farshad Khorrami (New York University), Priyadarshini Panda (Yale University), Abhiroop Bhattacharjee (Yale University), Abhishek Moitra (Yale University), Ihsen Alouani (Queen’s University Belfast), Stefanos Koffas (Delft University of Technology), Behrad Tajalli (Radboud University), Jing Xu (Delft University of Technology), Mauro Conti (University of Padua), and Stjepan Picek (Radboud University). This work was partially supported by the National Science Foundation (NSF) grants CCF-1302693, CCF-1813370, and CNS-2132385; by the NYUAD Center for Interacting Urban Networks (CITIES), funded by Tamkeen under the NYUAD Research Institute Award CG001, Center for Cyber Security (CCS), funded by Tamkeen under the NYUAD Research Institute Award G1104, and Center for Artificial Intelligence and Robotics (CAIR), funded by Tamkeen under the NYUAD Research Institute Award CG010; and by the project “eDLAuto: An Automated Framework for Energy-Efficient Embedded Deep Learning in Autonomous Systems,” funded by the NYUAD Research Enhancement Fund (REF). The opinions, findings, conclusions, or recommendations presented in this book are those of the authors and do not necessarily reflect the views of the National Science Foundation and other funding agencies.

Contents

Part I Efficient Software Design for Embedded Machine Learning Machine Learning Model Compression for Efficient Indoor Localization on Embedded Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Saideep Tiku, Liping Wang, and Sudeep Pasricha

3

A Design Methodology for Energy-Efficient Embedded Spiking Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rachmad Vidya Wicaksana Putra and Muhammad Shafique

15

Compilation and Optimizations for Efficient Machine Learning on Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaofan Zhang, Yao Chen, Cong Hao, Sitao Huang, Yuhong Li, and Deming Chen

37

A Pedestrian Detection Case Study for a Traffic Light Controller . . . . . . . . . Alexander Wendt, Horst Possegger, Matthias Bittner, Daniel Schnöll, Matthias Wess, Dušan Mali´c, Horst Bischof, and Axel Jantsch

75

How to Train Accurate BNNs for Embedded Systems? . . . . . . . . . . . . . . . . . . . . . F. A. M. de Putter and Henk Corporaal

97

Embedded Neuromorphic Using Intel’s Loihi Processor . . . . . . . . . . . . . . . . . . . . 137 Alberto Marchisio and Muhammad Shafique Part II Hardware-Software Co-Design and Co-Optimizations for Embedded Machine Learning Machine Learning for Heterogeneous Manycore Design . . . . . . . . . . . . . . . . . . . . 175 Biresh Kumar Joardar, Janardhan Rao Doppa, and Partha Pratim Pande Hardware–Software Co-design for Ultra-Resource-Constrained Embedded Machine Learning Inference: A Printed Electronics Use Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Georgios Zervakis, Mehdi B. Tahoori, and Jörg Henkel xiii

xiv

Contents

Cross-Layer Optimizations for Efficient Deep Learning Inference at the Edge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Muhammad Abdullah Hanif and Muhammad Shafique Co-designing Photonic Accelerators for Machine Learning on the Edge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Febin P. Sunny, Asif Mirza, Mahdi Nikdast, and Sudeep Pasricha Hardware–Software Co-design of Deep Neural Architectures: From FPGAs and ASICs to Computing-in-Memories . . . . . . . . . . . . . . . . . . . . . . . 271 Zheyu Yan, Qing Lu, Weiwen Jiang, Lei Yang, X. Sharon Hu, Jingtong Hu, and Yiyu Shi Hardware and Software Optimizations for Capsule Networks . . . . . . . . . . . . . 303 Alberto Marchisio, Beatrice Bussolino, Alessio Colucci, Vojtech Mrazek, Muhammad Abdullah Hanif, Maurizio Martina, Guido Masera, and Muhammad Shafique Design of Sparsity Optimized Photonic Deep Learning Accelerators . . . . . . 329 Febin Sunny, Mahdi Nikdast, and Sudeep Pasricha Algorithm-System Co-design for Efficient and Hardware-Aware Embedded Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 Ji Lin, Wei-Ming Chen, and Song Han Efficient Hardware and Software Design for On-device Learning . . . . . . . . . 371 Yawen Wu, Yue Tang, Dewen Zeng, Xinyi Zhang, Peipei Zhou, Yiyu Shi, and Jingtong Hu Pipelined CNN Inference on Heterogeneous Multi-processor System-on-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 Ehsan Aghapour, Yujie Zhang, Anuj Pathania, and Tulika Mitra Efficient Neural Networks and Their Acceleration Techniques for Embedded Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429 Hiroki Matsutani and Keisuke Sugiura Hardware–Software Codesign of an Adder-Tree Type CNN Accelerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 Soonhoi Ha and Donghyun Kang Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475

Part I

Efficient Software Design for Embedded Machine Learning

Machine Learning Model Compression for Efficient Indoor Localization on Embedded Platforms Saideep Tiku, Liping Wang, and Sudeep Pasricha

1 Introduction Contemporary geo-location services have eliminated the need for cumbersome paper-based maps that were the dominant navigation strategy of the past. Outdoor mapping, localization, and navigation technologies have reinvented the way we interact with the world around us. However, due to the limited permeability of GPS signals within indoor environments, such services do not function in buildings such as malls, hospitals, and schools. In an effort to extend localization services to buildings and subterranean locales, indoor localization solutions are experiencing a recent upsurge in interest. While substantial progress has been made in this area (see Sect. 2), WiFi-based fingerprinting for the purpose of indoor localization stands out as the most promising solution. This is mainly due to the ubiquitous nature of WiFi access points (APs) and their signals in buildings and the superior localization accuracies demonstrated with it. Fingerprinting consists of two phases: the first phase, known as the offline phase, consists of collecting WiFi signal characteristics such as RSSI (received signal strength indicator) at various indoor locations or reference points (RPs) in a building. The vector of wireless signal RSSI values from all APs observable at an indoor location represents a fingerprint of that location. Such fingerprints collected across RPs in the offline phase form a fingerprint dataset, where each row in the dataset consists of an RSSI fingerprint along with its associated RP location. Using this dataset from the offline phase, a machine learning (ML) model can then be trained and deployed on embedded devices (e.g., smartphones) equipped with WiFi

S. Tiku () · L. Wang · S. Pasricha Department of Electrical and Computer Engineering, Colorado State University, Fort Collins, CO, USA e-mail: [email protected]; [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 S. Pasricha, M. Shafique (eds.), Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing, https://doi.org/10.1007/978-3-031-39932-9_1

3

4

S. Tiku et al.

transceivers. In the second phase, called the online phase, WiFi RSSI captured by a user is sent to the ML model and used to predict the user’s location in the building. Recent efforts report improved indoor localization accuracy through the use of convolutional neural network (CNN) models [1]. This success of which is mainly attributed to the superior ability of CNNs at discerning underlying patterns within fingerprints. CNN models can be deployed on smartphones and allow users to localize themselves within buildings, in real time. Executing these models on smartphones instead of the cloud further enables security and sustainability as it eliminates the need of user data being shared through unsecured networks [2]. Unfortunately, research in the domain of indoor localization neglects the high memory and computational requirements of CNNs, making deployment on resource-constrained embedded systems such as smartphones a challenge. While post-training model compression can ease model deployability in the online phase, it leads to an unpredictable degradation in localization performance. Thus, there is a critical demand for holistic deep learning solutions that can provide robust indoor localization performance when deployed on embedded devices. In this chapter, we propose a novel multidimensional approach toward indoor localization that combines convolutional autoencoders, CNNs, and model compression to deliver a sustainable and lightweight framework called CHISEL. The main contributions of this work can be summarized as follows: • We propose a novel RSSI pattern recognition centric and pruning and quantization aware deep learning-based indoor localization solution that combines a convolutional autoencoder (CAE) and a CNN classifier to deliver a lightweight and robust solution for embedded deployment. • We describe a methodology for fingerprint augmentation that in combination with our proposed model improves generalization and lowers overfitting in the offline phase. • We benchmark the performance of CHISEL against state-of-the-art ML and deep learning-based indoor localization frameworks with an open indoor localization database to quantify its superior performance and lower overheads.

2 Background and Related Work An intuitively straightforward and fairly well-studied approach to indoor localization is through trilateration/triangulation using angle of arrival (AoA) or time of flight (ToF) methods [2]. However, AoA introduces hardware complexity and is very sensitive to computational error, especially when the distance between the transmitter and receiver becomes large. ToF needs tight synchronization requirements and even with enough resolution from signal bandwidth and sampling rate, the significant localization errors are impossible to be eliminated when no lineof-sight paths are available. Both of the aforementioned methods also require precise knowledge of AP locations, making them impractical for many indoor

Machine Learning Model Compression for Efficient Indoor Localization. . .

5

environments. Moreover, indoor locales pose the additional challenge of being composed of complex angular interior structures as well as diverse construction materials, e.g., concrete, metal, and wood, which contribute to hard-to-predict reductions in localization accuracy due to wireless multipath and shadowing effects [2]. The core advantage of fingerprinting-based approaches can be attributed to their ubiquity and flexibility. Fingerprinting does not require rigid synchronization and knowledge of AP locations and are also relatively immune to multipath and shadowing effects. Many RSSI fingerprinting-based ML solutions have been proposed for indoor localization, e.g., approaches using support vector regression (SVR) [3], k-nearest neighbors (KNN) [4–7], and random forest (RF) [8]. Deep learning-based fingerprinting methods present themselves as a promising avenue as they shown to outperform classical ML approaches. Many deep learning techniques have been adapted to the domain of indoor localization [2, 9–11]. The work by Jang et al. [12] proposed a CNN classifier while Nowiki et al. [13] built a model consisting of a stacked autoencoder (SAE) for feature space reduction followed by a DNN classifier (SAEDNN). The experiments were conducted on the UJIIndoorLoc dataset [14] to predict the building and floor that a user is located on. However, these works do not consider positioning the user within a given floor, which is a much harder problem. At the same time, another SAEDNN-based model was proposed in [15] that reduced the number of nodes in the final layer. Later, a 1-D CNN approach [15] was shown to outperform these works. This is achieved through the additional overhead of deploying multiple CNN models hierarchically that localize the user within a building, floor, and finally floor prediction, which has high memory and computational costs. While previous works propose promising deep learning-based approaches for relatively well-known challenges associated with indoor localization such as device heterogeneity [5–7] and temporal variation [16], they consistently overlook deployability issues arising from memory and computational overheads in embedded devices that can directly impact the localization latency [17] and energy-efficiency of the framework [18, 19]. Post-training model compression techniques can help mitigate these deployment issues, but lead to an unacceptable loss in localization accuracy, as discussed in Sect. 4. To overcome these challenges, we first present an evaluation of the impact of compression-agnostic indoor localization model design. We further present our deployment-centric approach that utilizes a combination of convolutional autoencoder (CAE) and a CNN. We further show that our approach maintains its prediction robustness across the deployment process.

3 CHISEL Framework In this section, we discuss the various components of the CHISEL framework as covered in our work [20]. The following subsections discuss our preprocessing and augmentation approach followed by a detailed view of our deep-learning net-

6

S. Tiku et al.

work architecture and, finally, the model compression and pruning methodologies evaluated.

3.1 Data Preprocessing and Augmentation For the purpose of the experimental evaluations presented in this chapter, we employ the UJIIndoorLoc indoor localization benchmark [14] that covers a total of three buildings and five floors. Our approach considers a total of 905 unique RPs, such that each RP represents a unique combination of [building ID/floor ID/space ID/relative position ID]. Here the space ID is used to differentiate between the location inside and outside a room. Relative position ID locates the user on a given floor. The RSSI values for WiFi APs vary in the range of −100 to 0 dBm, where −100 dBm indicates no signal and 0 indicates full signal strength. We standardize these fingerprinting RSSI values to a range of 0–1. As there is no test data in the UJIIndoorLoc dataset, we utilize the validation component (1111 samples) of the dataset as the test set. The training portion of the dataset is split into training (15,950 samples) and validation (3987 samples) subsets, based on an 80:20 split. To compensate for the limited samples per RP and to further improve generalization, we augment the fingerprint dataset. For each RP we first calculate the mean value of all nonzero RSSI APs within one RP and the absolute difference between the mean value of each AP value. Then we generate the AP RSSI values from the uniform distribution between the difference range obtained from the first step. The final dataset is the shuffled combination of the original and augmented fingerprints. Considering our use of convolutional deep learning networks that are designed to work with images, we translate RSSI fingerprints into greyscale images. To achieve this, each fingerprint is zero-padded and translated into a single-channel square shaped image, similar to the work in [1]. For the UJIIndoorLoc dataset, this produced 24 × 24 × 1 dimensional images. This new fingerprint image-based dataset is then used to train the deep learning model described in the next subsection.

3.2 Network Architecture Table 1 depicts our proposed deep learning model which contains the CAE and CNN components that are trained in two stages. The first stage of CHISEL’s localization deep-learning model comprises of the overcomplete CAE. This encoder-decoder pair is trained using the mean squared error (MSE) loss between the input and the output fingerprint. This process enables the CAE: encoder to efficiently extract hidden features within the input fingerprints. In the second stage, the decoder is dropped and replaced by the CNN classifier as given in Table 1. The goal of this classifier is to predict the user’s location, given the encoded input from the CAE. The model is then retrained with the weights associated with the encoder frozen

Machine Learning Model Compression for Efficient Indoor Localization. . .

7

Table 1 CHISEL’s CAECNN network model layers Layer type CAE: Encoder Input Convolutional Max pooling Convolutional CAE: Decoder Up sampling Convolutional CNN classifier Convolutional Convolutional Max pooling Convolutional Convolutional Max pooling Flatten Fully connected Batch norm Softmax

Layer size

Filter count

Filter size

Stride value

Output size

– 24 × 24 – 12 × 12

– 16 1 8

– 3×3 2×2 3×3

– 1×1 2×2 1×1

24 × 24 × 1 24 × 24 × 16 12 × 12 × 16 12 × 12 × 8

– 24 × 24

1 1

2×2 3×3

2×2 2×2

24 × 24 × 8 24 × 24 × 1

12 × 12 12 × 12 – 6×6 6×6 – 1 × 288 1 × 128 1 × 128 1 × 905

8 16 1 32 32 1 – – – –

3×3 3×3 2×2 3×3 3×3 2×2 – – – –

1×1 1×1 2×2 1×1 1×1 2×2 – – – –

12 × 12 × 8 12 × 12 × 16 6 × 6 × 16 6 × 6 × 32 6 × 6 × 32 3 × 3 × 32 1 × 1 × 288 1 × 1 × 128 1 × 1 × 128 1 × 1 × 905

in place and loss function set to sparse categorical cross-entropy. ReLU (rectified linear units) is the only activation function we used for all convolutional and fully connected layers. The full model has 171,209 parameters in total.

3.3 Model Compression In this chapter, we explore the combinations of two orthogonal approaches for model compression toward the goal of improving localization inference time: Quantization The parameters of a neural network, in general, are represented by 32bit wide floating-point values. However, a single neural network model can consist of hundreds of thousands to millions of parameters, leading to very large memory footprints and high inference latency. Quantization achieves model compression and computational relaxation by limiting the bit-width used to represent weights and activations. However, this can lead to unpredictable accuracy degradation due to reduction in floating-point (FP) precision [21]. To overcome this issue, researchers have proposed quantization aware training (QAT) which involves quantizing weights and/or activations before the training process actually begins. Numerous variations of this approach have been proposed over the years, such as binarization, low-bit FP representation, and integer estimation

8

S. Tiku et al.

[21]. The integer estimation approach improves latency by replacing FP operations by integers at inference, and therefore, by itself is not effective at memory footprint reduction. Alternatively, binarization heavily limits achievable model accuracy due to the low coverage of possible values each parameter can assume. Therefore, a low-bit INT representation is the most flexible middle ground of choice. Through this work, we evaluated both post-training and training-aware quantization across a range of bit widths. The results of this analysis are presented in Sect. 4. We explored quantization levels ranging from 32 bits down to 2 bits. We applied a uniform quantizer to all convolutional layers keeping an equal number of positive and negative representations of parameters. Scaling the input tensors can further improve quantized model accuracy [21]. We calculated scaling factors channel by channel using averaging absolute values of weights in each channel. In addition, to overcome the issue of vanishing gradients, we apply the “straightthrough estimator” (STE) [22] to the standard ReLU activation function. Pruning This approach involves selectively removing weights from a previously trained neural network. Out of the many pruning methodologies such as filter pruning, layer pruning, and connection pruning, we employ connection pruning and filter pruning due to their diverse applicability and promising results across different kinds of models and layers [21]. Toward this goal, we implemented sparse connection pruning and filter pruning that are focused on zeroing-out either a weight value or entire filters based on their magnitude [23, 24]. To achieve a sparsity of S%, the weights of the model are ranked based on magnitude, and the smallest S% are set to zero. In the case of filter pruning, we utilize L2-norm on the filter weights in order to rank them. We performed connection + filter pruning with varying sparsity values of 0% (no pruning), 25%, 50%, and 75% for the CHISEL model to identify the best variation (Sect. 4).

4 Experiments In this section, we compare our proposed CHISEL framework with its data augmentation and novel CAECNN network architecture with state-of-the-art deep learning-based techniques SAEDNN [15] and 1D CNN [25], as well as classical ML methods, KNN [4] and RF [8], all of which were described in Sect. 2. To highlight the impact of fingerprint augmentation, we evaluate and present the experimental results for two variants of our proposed framework. The first variant, CHISELDA, employs our fingerprint augmentation methodology as discussed in Sect. 3.1, whereas the second variant, CHISEL, does not. We utilize the UJIIndoorLoc dataset for all experiments.

Machine Learning Model Compression for Efficient Indoor Localization. . .

9

Table 2 Localization performance comparison Building (%) Floor (%) Position (m)

KNN [4] 98.42 90.05 9.82

RF [8] 100 91.2 7.85

SAE DNN [13] 99.82 91.27 9.29

1D CNN [25] 100 94.68 11.78

CHISEL 99.64 91.72 8.80

CHISEL-DA 99.96 93.87 6.95

4.1 Evaluation on UJIIndoorLoc Dataset The comparison of building and floor accuracy is shown in Table 2. CHISEL has nearly 100% accuracy on building prediction and outperforms almost all other approaches on floor accuracy except for 1D-CNN which has three dedicated models for building, floor, and location, respectively. As shown in Table 2, the best average localization error of the proposed models are ≈8.80 m and ≈6.95 m, respectively, for CHISEL and CHISEL-DA. Based on our experiments, we believe that 1-D CNN is unable to outperform optimized version of classical ML-based techniques due to the lack of data augmentation, as proposed in our work.

4.2 Evaluation on Compression-Aware Training Given its superior performance of CHISEL-DA, we select it as the baseline CHISEL model for further optimization. The uncompressed size of this CHISEL model is 801 KB and delivers an average localization accuracy of 6.95 m at a latency of 5.82 ms on the Galaxy S7 device. To make the model more amenable to resourcelimited embedded devices, we evaluate the impact of quantization and pruning on the accuracy, latency, and memory use of CHISEL. Figure 1 presents the impact of the various compression and pruning configurations on the average localization error. The memory footprints of the CHISEL model configurations are captured in Table 3. Configurations suffixed with WO and WA, respectively, represent weight-only quantization and weight + activation quantization. From Fig. 1a, we observe that post-training quantization yields models with higher localization error in all cases as compared to quantization-aware training (QAT) in Fig. 1b. This motivates the use of QAT when deploying CHISEL. As expected, we observe a general overarching trend that using fewer bits to represent weight values leads to worsening accuracy. This is due to the lack of unique values available to represent weights and activations. At the extreme side, we observe that when CHISEL is about 1/17th its original size, in the INT2-WA-25% configuration (46 KB) (Fig. 1b), it makes localization error ≈ 1.82× larger than CHISEL-DA but is still competitive with 1DCNN (Table 2).

10

S. Tiku et al.

Fig. 1 Average error comparisons across various configurations of pruned and quantized versions of the CHISEL model. (a) Post-training quantization. (b) Quantization-aware training. Numbers on top of each bar represent the model’s average error (meters). WO and WA, respectively, represent weight-only and weight + activation quantization Table 3 Memory footprints of different compression combinations

Percentage sparsity Config 0% INT2 WO 128 KB INT4 WO 173 KB INT8 WO 262 KB INT16 WO 442 KB INT2 WA 57 KB INT4 WA 107 KB INT8 WA 206 KB INT16 WA 407 KB FP32 NQ 801 KB

25% 116 KB 148 KB 215 KB 350 KB 46 KB 92 KB 160 KB 315 KB 620 KB

50% 102 KB 124 KB 169 KB 259 KB 33 KB 68 KB 115 KB 222 KB 440 KB

75% 90 KB 101 KB 122 KB 165 KB 21 KB 45 KB 68 KB 130 KB 259 K

In both Fig. 1a, b, we note that pruning from 0% (no pruning) to 25% has almost no impact on localization accuracy while reducing the model footprint by up to 25%. This is strongly suggestive of pruning’s positive role toward deeplearning model deployment for indoor localization through CHISEL. The impact of

Machine Learning Model Compression for Efficient Indoor Localization. . .

11

Fig. 2 On-device inference time from all compression configurations of CHISEL. CHISEL FLOAT32 represents the baseline model

pruning becomes more pronounced when aggressively quantizing the model. This is especially true for the WA quantization as shown in Fig. 1b. It is important to pay more attention to activations when quantizing CNN models with low bits, or aggressive quantization may result in huge accuracy reduction after compression. Based on the results observed in Fig. 1, a good candidate for the compressed model is INT4-WO-25% with QAT, resulting in a 148 KB memory footprint (Fig. 1b). We shrunk the model by ≈ 5.41× and while mildly improving the accuracy as compared to state-of-the-art deep learning designs in prior works [15, 25]. To capture the impact of compression on localization latency, we deployed all of the compressed configurations on a Samsung Galaxy S7 smartphone using Android Studio’s on-device inference framework in v4.0.1. Our application is designed to directly receive RSSI from a file that contains all 1111 samples from the test set. RSSI values are processed into matrices in-app and fed to the CHISEL model. The captured latencies include the time required to pre-process the RSSI fingerprint into images and are averaged over 100 repetitions. Inference time results of all compression configurations are presented in Fig. 2. We observe that both quantization and pruning can offer notable acceleration over FLOAT32 models. The INT4-WA-50% sparsity model cuts localization latency to half (~2.5 ms), while taking a penalty of 2.63 m (38%). Aggressive quantization and pruning beyond this point yields limited benefits, e.g., INT2-WA + 75% sparsity only reduces the latency to ~2.25 ms while degrading the localization accuracy by 3× (25.32). INT4-WO-25% continues to present itself as a good candidate with ~31% reduction in latency. In summary, the intelligent data augmentation and novel CAECNN deep learning network model which is amenable to model compression allows our CHISEL framework to provide new options for high accuracy indoor localization while minimizing deployment costs on embedded devices.

12

S. Tiku et al.

5 Conclusion In this chapter, we presented a novel indoor localization framework called CHISEL. Our approach outperforms state-of-the-art ML and deep learning-based localization frameworks, achieving higher localization accuracy. The CHISEL framework attains this through the intelligent confluence of data-centric fingerprint augmentation, feature extraction, and robust model compression. The compression friendly CAECNN models in CHISEL maintains accuracy in the presence of aggressive model compression. Our compressed model versions are easy to deploy on smartphones and resource-constrained embedded devices that may have KB-sized resource budgets. Based on all of the results presented in Sect. 4 and considering model size and latency, our best configuration of CHISEL is the INT4-WO-25% with QAT, which reduces the model size to 148 KB (an 81.52% reduction) and improves the latency by 1.80 ms (30.93% reduction) at the cost of sacrificing 0.34 m (4.89%) localization accuracy. Acknowledgments This work was supported by the National Science Foundation (NSF), through grant CNS-2132385.

References 1. Mittal, A., Tiku, S., Pasricha, S.: Adapting convolutional neural networks for indoor localization with smart mobile devices. In: Great Lakes Symposium on VLSI (GLSVLSI). ACM (2018) 2. Zafari, F., Gkelias, A., Leung, K.K.: A survey of indoor localization systems and technologies. IEEE Commun. Surv. Tutorials. 21(3), 2568–2599 (2019) 3. Shi, K., Ma, Z., Zhang, R., Hu, W., Chen, H.: Support Vector Regression Based Indoor Location in IEEE 802.11 Environments. Hindawi Mobile Information Systems (2015) 4. Xie, Y., Wang, Y., Nallanathan, A., Wang, L.: An improved K-nearest-neighbor indoor localization method based on spearman distance. IEEE Signal Process Lett. 23(3), 351–355 (2016) 5. Tiku, S., Sudeep, P., Notaros, B., Han, Q.: SHERPA: a lightweight smartphone heterogeneity resilient portable indoor localization framework. In: International Conference on Embedded Software and Systems (ICESS). IEEE (2019) 6. Tiku, S., Pasricha, S.: PortLoc: a portable data-driven indoor localization framework for smartphones. IEEE Des. Test. 36(5), 18–26 (2019) 7. Tiku, S., Pasricha, S., Notaros, B., Han, Q.: A Hidden Markov Model based smartphone heterogeneity resilient portable indoor localization framework. J. Syst. Archit. 108, 101806 (2020) 8. Jedari, E., Wu, Z., Rashidzadeh, R., Saif, M.: Wi-Fi based indoor location positioning employing random forest classifier. In: International Conference on Indoor Positioning and Indoor Navigation (IPIN). IEEE (2015) 9. Langlois, C., Tiku, S., Pasricha, S.: Indoor localization with smartphones: harnessing the sensor suite in your pocket. IEEE Consum. Electron. Mag. 6, 70–80 (2017) 10. Pasricha, S., Ugave, V., Han, Q., Anderson, C.: LearnLoc: a framework for smart indoor localization with embedded mobile devices. In: International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). IEEE (2015)

Machine Learning Model Compression for Efficient Indoor Localization. . .

13

11. Tiku, S., Pasricha, S.: Overcoming security vulnerabilities in deep learning–based indoor localization frameworks on mobile devices. ACM Trans. Embed. Comput. Syst. 18, 1–24 (2020) 12. Jang, J., Hong, S.: Indoor localization with WiFi fingerprinting using convolutional neural network. In: International Conference on Ubiquitous and Future Networks (ICUFN). IEEE (2018) 13. Nowicki, M., Wietrzykowski, J.: Low-effort place recognition with WiFi fingerprints using deep learning. In: Springer International Conference Automation. Springer (2017) 14. Torres-Sospedra, J., Montoliu, R., Martinez-Uso, A., Avariento, J.P., Arnau, T.J., BeneditoBordonau, M., Huerta, J.: UJIIndoorLoc: A new multi-building and multi-floor database for WLAN fingerprint-based indoor localization problems. In: International Conference on Indoor Positioning and Indoor Navigation (IPIN). IEEE (2014) 15. Kim, K.S., Lee, S., Huang, K.: A scalable deep neural network architecture for multi-building and multi-floor indoor localization based on Wi-Fi fingerprinting, Springer. Big. Data. Anal. 3(1), 1–17 (2018) 16. Tiku, S., Pasricha, S.: Siamese neural encoders for long-term indoor localization with mobile devices. In: IEEE/ACM Design, Automation and Test in Europe (DATE) Conference and Exhibition. IEEE (2022) 17. Tiku, S., Kale, P., Pasricha, S.: QuickLoc: adaptive deep-learning for fast indoor localization with mobile devices. ACM Trans. Cyber-Phys. Syst. 5(4), 1–30 (2021) 18. Pasricha, S., Doppa, J., Chakrabarty, K., Tiku, S., Dauwe, D., Jin, S., Pande, P.: Data analytics enables energy-efficiency and robustness: from mobile to manycores, datacenters, and networks. In: ACM/IEEE International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). IEEE (2017) 19. Tiku, S., Pasricha, S.: Energy-efficient and robust middleware prototyping for smart mobile computing. In: IEEE International Symposium on Rapid System Prototyping (RSP). IEEE (2017) 20. Wang, L., Tiku, S., Pasricha, S.: CHISEL: compression-aware high-accuracy embedded indoor localization with deep learning. IEEE Embed. Syst. Lett. 14, 23–26 (2021) 21. Mishra, R., Gupta, H. P., Dutta, T.: A survey on deep neural network compression: challenges, overview, and solutions. arXiv preprint arXiv:2010.03954 (2020) 22. Bengio, Y., Leonard, N., Courville, A.: Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013) 23. Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710 (2016) 24. Molchanov, P., Tyree, S., Karras, T., Aila, T., Kautz, J.: Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440 (2016) 25. Song, X., Fan, X., Xiang, C., Ye, Q., Liu, L., Wang, Z., He, X., Yang, N., Fang, G.: A novel convolutional neural network based indoor localization framework with WiFi fingerprinting. IEEE Access. 7, 110698–110709 (2019)

A Design Methodology for Energy-Efficient Embedded Spiking Neural Networks Rachmad Vidya Wicaksana Putra and Muhammad Shafique

1 Introduction 1.1 Overview Artificial Intelligence (AI) is currently considered the state-of-the-art solution to analyze a huge amount of data in the era of Big Data and Internet-of-Things (IoT), where digital data are continuously generated. Analyzing and inferring meaningful information from these data are the key ingredients for improving the quality of life [33, 66]. Therefore, the AI developments, especially Machine Learning (ML), have been expanded widely across diverse applications in the last ten years, like computer vision [28, 36, 73], healthcare [3, 5, 37], business and finance [21, 71, 72], and autonomous systems (e.g., drones and self-driving cars) [20, 43]. Currently, the prominent ML algorithms employ brain-inspired computations, namely 1 Artificial Neural Networks (ANNs), which include Deep Neural Networks (DNNs)/Deep Learning (DL) [6, 33, 66, 68], and 2 Spiking Neural Networks (SNNs) [35, 54, 57, 58, 64, 66], as shown in Fig. 1a. Among these algorithms, SNNs have shown abilities for achieving high accuracy with ultra-low-power/energy consumption because of their sparse spike-based computations [1, 13]. Moreover, SNNs can also efficiently learn unlabeled data using bio-plausible learning rules like Spike-Timing-Dependent Plasticity (STDP), hence offering unsupervised learning capabilities to the SNN-based systems, as shown in Fig. 1b. R. V. W. Putra () Embedded Computing Systems, Institute of Computer Engineering, Technische Universität Wien, Vienna, Austria e-mail: [email protected] M. Shafique A1-173, Division of Engineering, New York University Abu Dhabi, Saadiyat Island, UAE e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 S. Pasricha, M. Shafique (eds.), Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing, https://doi.org/10.1007/978-3-031-39932-9_2

15

16

R. V. W. Putra and M. Shafique

Artificial Intelligence (AI) Programs that try to mimic the cognitive function of the human brain.

Machine Learning (ML) Algorithms that learn without being explicitly programmed.

Brain-inspired Computation Algorithms that take inspiration from neuron operations.

Artificial Neural Networks (ANNs) Deep Neural Networks (DNNs) / Deep Learning (DL)

Spiking Neural Networks (SNNs)

(a) Input Layer







Input

Large batch of samples

Training

Excitatory Inhibitory Layer Layer

Training using bio-plausible learning rule, such as STDP

learning Inhibition signal

Inference

Small batch of samples







Input

Excitatory signals (output spike trains) are used for the prediction

Excitatory signal

(b) Fig. 1 (a) Relation of AI, ML, DNNs, and SNNs (adapted from [68]). (b) Training and inference phases in an SNN, will be discussed in Sect. 2.1

Unsupervised learning (i.e., learning with unlabeled data) is beneficial for many real-world applications (e.g., autonomous agents) since it avoids costly data labeling [54, 60], and it also enables efficient online learning to make SNN-based systems adaptive to diverse operational environments [2, 44, 56]. All these advantages make

A Design Methodology for Energy-Efficient Embedded Spiking Neural Networks

17

SNNs suitable for solving ML tasks on resource- and energy-constrained embedded platforms [50, 52, 53]. In the training phase, an SNN performs a forward-pass while conducting a bio-plausible learning rule (e.g., STDP [14, 55, 56]) to update its weights. In the inference phase, an SNN performs a forward-pass using the trained weights, hence requiring a much cheaper cost of memory accesses and computation energy than the training. The state-of-the-art SNNs usually achieve higher accuracy at the cost of a larger model size, which necessitates more intensive memory accesses and computations, hence incurring higher power/energy consumption [54]. For instance, a network with 200 neurons achieves .∼75% accuracy on the MNIST dataset and requires .∼1 MB memory, while a network with 9800 neurons achieves .∼92% accuracy and requires .∼200 MB memory [52]. Recently, privacy and security concerns have been pushing many ML-based applications towards embedded platforms, such as IoT-Edge devices [6, 7, 66]. Hence, employing SNNs for embedded applications is highly desired, as it may provide better services (e.g., real-time response) with better privacy and security. However, embedded applications are typically memoryand energy-constrained, thereby making it challenging to efficiently run the SNN processing. Therefore, it is necessary to provide a design methodology that improves the energy efficiency of SNNs and maintains their accuracy, thereby enabling their implementations on embedded platforms.

1.2 Design Constraints for Embedded SNNs Hardware platforms (e.g., accelerators) for embedded applications usually have tight resources (e.g., memory) and power budgets. Besides, some applications may pose other constraints like latency and throughput, especially for applications that need a real-time decision (e.g., self-driving cars) [66]. All these constraints make it more challenging to efficiently run SNNs on embedded platforms. For instance, the embedded platforms typically have small on-chip memory (.≤ 500 KB) [49, 51, 68] and low operational power (.≤ 5 W) [41, 52, 56]. Consequently, memory and power constraints limit the SNN operations that can be run simultaneously, thereby leading to long latency and low throughput. To improve the performance and efficiency of SNN processing, previous works have proposed specialized accelerators, such as [1, 11–13, 15, 32, 42, 45, 61, 65]. Recent studies have identified that the energy consumption of SNN accelerators is dominated by memory accesses [31]; see Fig. 2. The reasons are the following. • High memory access energy: The (off-chip and on-chip) memory accesses are much more expensive as compared to the computation engine since this engine typically employs neuron models like the Integrate-and-Fire (LIF), which performs simple operations (e.g., an addition for updating membrane potential) [31]. • A large number of memory accesses: The number of memory accesses required for SNN processing is proportional to the number of SNN parameters (i.e.,

18

R. V. W. Putra and M. Shafique

Memory Access Accesses Memory

TrueNorth [13] TrueNorth

Communicaon Communicaon

PEASE [14] PEASE

Computaon Computaon

SNNAP [15] SNNAP

0 0%

20 20%

40 40%

60 60%

80 80%

100 100%

Breakdown of Energy Consumpon [%] Fig. 2 Breakdown of the energy consumption incurred by different SNN hardware accelerators, i.e., TrueNorth [1], PEASE [61], and SNNAP [65] (adapted from studies in [31])

weights and neuron parameters) [54, 55]. Hence, larger SNNs (which typically offer higher accuracy) require more intensive memory accesses. In summary, optimizing the memory requirements (e.g., number of accesses, operational power/energy) is the key to improving the energy efficiency of SNN processing, thereby enabling efficient embedded SNNs. In the remaining sections, we discuss the following points. 1. In Sect. 2, we present an overview of SNNs and their learning rules. 2. In Sect. 3, we discuss our design methodology for optimizing the memory and energy requirements of SNNs targeting embedded platforms. 3. In Sect. 4, we present the experimental evaluations for our design methodology. 4. In Sect. 5, we conclude the chapter with a summary.

2 Preliminaries 2.1 Spiking Neural Networks (SNNs) SNNs are the computation models for neural networks that employ bio-plausible neurons, synapses, and learning rules (e.g., STDP) [35, 48, 69]. They employ sequences of action potentials/spikes (so-called spike trains) for representing data. An SNN is constructed by several components, i.e., network topology/architecture, neuron model, synapse model, spike coding, and learning rule [39]. • Network architecture: It defines how neurons are connected to each other. Here, we consider the network architecture in Fig. 1b, as it has been used widely in previous works and offers robustness for performing different learning rules [14]. This architecture has input, excitatory, and inhibitory layers. The input image represents the input layer, whose pixels are connected to all excitatory neurons. Each excitatory neuron should recognize a specific class, and its output is connected to a specific inhibitory neuron. Each inhibitory neuron produces spikes for inhibiting all excitatory neurons, except the one that provides input connection.

A Design Methodology for Energy-Efficient Embedded Spiking Neural Networks

Presynaptic spikes

time

Vth

Membrane potential Postsynaptic spikes

19

Vreset refractory period

time

time

Fig. 3 Illustration of the neuron dynamics of the LIF neuron model (adapted from [54])

• Neuron and synapse model: The neuron model represents the neuron dynamics. Here, we use the Leaky Integrate-and-Fire (LIF) as it offers bio-plausible behavior with low complexity [27]. Its membrane potential (.Vmem ) increases if there is an input spike, and otherwise, .Vmem decreases. A spike is produced when .Vmem reaches the threshold (.Vth ), and then .Vmem is back to the reset potential (.Vreset ), as shown in Fig. 3. For the synapse model, we use conductance that increases by the respective weight (w) when an input spike comes, and otherwise, w decreases. • Spike coding: It translates the input data into spike trains. There are several spike coding techniques in the literature, such as rate, temporal, rank-order, phase, burst, and time-to-first spike [18, 29, 46, 47, 70]. Here, we consider the rate coding since it can achieve high accuracy under unsupervised learning scenarios [14]. • Learning rule: It defines how an SNN learns salient features from input samples and updates the weights. Several bio-plausible learning rules have been developed, such as STDP, Spike-Driven Synaptic Plasticity (SDSP), Local Correlation Plasticity (LCP), Modified Ion Channel-Based Plasticity, and IonoNeuromorphic Intracellular Calcium-Mediated Plasticity [59]. Here, we consider STDP since it has achieved high accuracy for diverse SNN models with relatively low computational complexity. A detailed discussion of STDP is provided in Sect. 2.2.

2.2 Spike-Timing-Dependent Plasticity (STDP) STDP is a bio-plausible learning rule that leverages the timing correlation between presynaptic and postsynaptic spikes. Although there are several variants of STDP, we consider the pairwise weight-dependent STDP as the baseline since it has relatively lower computational complexity than other learning rules, and it has been widely used in previous works [14, 23, 25, 59, 62, 63]. At every presynaptic and

20

R. V. W. Putra and M. Shafique

Presynaptic neuron (neuron-p)



Postsynaptic neuron (neuron-q)

wpq

p

time

Presynaptic spike

time spike trace

Postsynaptic spike

q

1

Presynaptic trace 0 Presynaptic Synapse spike

Postsynaptic spike

Postsynaptic trace

xpre

1 0

time

xpost time

Fig. 4 Illustration of a single synaptic connection and the dynamics of spike traces for presynaptic spike .xpre and postsynaptic spike .xpost (adapted from [54])

postsynaptic spike, this STDP calculates the weight change (i.e., either potentiation or depression) based on Eq. (1) and then updates the corresponding weight.

.

w =

 −ηpre · xpost · w μ ηpost · xpre · (wm

− w)μ

on presynaptic spike on postsynaptic spike

(1)

w denotes the weight change (i.e., either potentiation or depression). .ηpre and .ηpost denote the learning rate for the presynaptic and postsynaptic spike, respectively. .xpre and .xpost denote the traces for presynaptic and postsynaptic spike, respectively. When a spike occurs, the corresponding trace is set to 1, otherwise the trace decreases, as shown in Fig. 4. These traces are used for improving the simulation speed [38]. Meanwhile, w denotes the current weight, .wm denotes the maximum possible value of the weight, and .μ denotes the weight dependence factor.

.

3 A Design Methodology for Embedded SNNs 3.1 Overview In the literature, several techniques can be employed for improving the energy efficiency of SNN processing for the training and inference phases. These techniques can be classified into hardware- and software-level techniques, as shown in Fig. 5. The software-level techniques encompass model compression (e.g., pruning [54, 56, 60] and quantization [54, 55, 60]), efficient data representation (e.g., low-complexity spike coding [29, 46, 47, 70] and spike bundling [31]), neuron model selection [27], efficient DRAM data mapping [52], and approximate neuron operations [65]. Meanwhile, the hardware-level techniques encompass efficient neuron hardware [8, 16, 34], efficient learning hardware [4, 15, 17], and employment of approximate memories (e.g., reduced-voltage DRAM and SRAM) [50, 52].

A Design Methodology for Energy-Efficient Embedded Spiking Neural Networks

21

Network/Model Compression Pruning/ Elimination 11011001

Excitatory Layer

Efficient Data Representation

Inhibitory Layer

Rate, Phase, Burst, Rank-Order, Time-toFirst Spike, etc.

1101

DRAM Data Mapping

Approximate Computations

Neuron Model Selection & Its Efficient Hardware

Compute Engine

Learning Unit

… … …

Leaky Integrate-and-Fire (LIF), Integrate-and-Fire (IF), etc.







… …

Weight Buffer

… synapse neuron

Off-chip Memory (DRAM)

Control Neuron Buffer

On-chip

Approximate Memories Reduced-Voltage Off-chip Memory (DRAM) Reduced-Voltage On-chip Memory (Buffer)

1110 1001

SNN Model

Approx. Neuron Operations

Hardware

Spike Bundling





Quantization

Input Layer



Software

Techniques for Improving the Energy Efficiency of SNN Processing

… …

Efficient Hardware for On-chip Learning Rule (e.g., STDP)

SNN Accelerator

Fig. 5 Techniques for improving the energy efficiency of SNN processing for enabling their implementations on embedded platforms Memory Budget

2 Evaluation of memory

Energy Budget

1

& energy requirements

Optimized SNN model

Target Accuracy

Learning rule

DRAM Configuration

Optimization steps SNN architecture with direct inhibition

SNN architecture

ReducedVoltage DRAM

a

Reduction of neuron & learning operations

Simplified learning rule

Evaluation & Selection

DRAM error modeling

Test DRAM error profile

Accuracy Selection

Spike coding Neuron model Hyperparameters

3

Training For each input sample Enhancements b for the learning rule/mechanism

Approximate DRAM Dataset

c Quantization

Optimized SNN model + tage DRAM Reduced-Voltage

+

Fig. 6 Our design methodology for optimizing memory and energy requirements for embedded SNNs, showing the key proposed techniques (adapted from studies in [52, 54])

In this chapter, we discuss our design methodology that minimizes memory and energy consumption of SNN processing for training and inference phases by 1 optimizing SNN operations through a reduction of neuron and learning operations, b learning rule enhancements, and c quantization; 2 evaluating the memory and energy costs; and 3 employing approximate DRAM; as shown in Fig. 6. Our design methodology is applicable for different sizes of SNN models and different sizes of hardware platforms (e.g., memory size), thereby supporting diverse embedded applications. In the following, we provide a detailed discussion of each proposed technique.

22

R. V. W. Putra and M. Shafique

Synapse Excitatory (learned by STDP) Layer Input

Inhibitory Layer

Synapse Excitatory Direct lateral (learned by STDP) Layer inhibion Input

Modification

(a) SNN architecture with STDP

(b) Proposed SNN architecture

Fig. 7 (a) SNN architecture with STDP learning rule. (b) Proposed SNN architecture. These figures are adapted from [54]

3.2 Reduction of SNN Operations The state-of-the-art SNN architecture with the STDP learning rule usually employs a pair of excitatory and inhibitory neurons, each having different functions and parameters, as shown in Fig. 7a. We observe that a large number of inhibitory neurons only process a small number of incoming spikes to produce the inhibition spikes at the cost of a high amount of memory and energy. Therefore, we optimize the use of inhibitory neurons to reduce memory and energy consumption, as shown by a in Fig. 6. To do this, we replace the inhibitory neurons with direct lateral inhibitions to reduce the neuron operations, as shown in Fig. 7b. In this manner, the function of spikes from the inhibitory neurons, which enable competition among excitatory neurons, is replaced by the spikes from the excitatory neurons.

3.3 Learning Enhancements In the learning process, at least one excitatory neuron is expected to recognize the class of an input image by generating the highest number of spikes. Therefore, information on the postsynaptic spikes should be leveraged to enhance the learning quality. Towards this, we propose a technique to enhance the quality of STDP-based learning by employing timestep-based weight updates and adaptive potentiation factor (k), as shown by b in Fig. 6. Following are the details for each technique. • Timestep-based weight update: It aims at reducing spurious weight updates which are triggered by postsynaptic spikes. Hence, it updates the weights once within a timestep to ensure that the updates are necessary and performed with proper weight values. Moreover, this technique also reduces the computational requirements for weight updates, thereby curtailing the memory accesses and energy consumption.

A Design Methodology for Energy-Efficient Embedded Spiking Neural Networks

23

• Adaptive potentiation factor (k): It aims at providing proper potentiation for each weight update by leveraging the information of postsynaptic spikes. Therefore, the number of postsynaptic spikes is accumulated from the start of an input image presentation until the observation time. This number of postsynaptic spikes is used to calculate the adaptive potentiation factor k using Eq. (2). Here, maxS denotes the maximum number of accumulated spikes, and .Sth denotes the threshold of spikes which normalizes maxS. Afterwards, k is used for calculating the weight change using Eq. (3), and we select .μ = 1 to simplify the computation and hardware implementation.  k=

.

maxS Sth



w = k · ηpost · xpre · (wm − w)μ on update time

.

(2)

(3)

• Effective inhibition strength: Balancing the excitatory and inhibitory strength is important for ensuring proper inhibition strength [14]. Too strong inhibition causes ineffective domination since the winning neuron strongly prevents others from firing. Meanwhile, too weak inhibition causes no actual competition among neurons since it makes all neurons easily generate spikes for any input features. Towards this, we perform an experimental analysis to investigate the impact of different inhibition strengths on accuracy. Afterwards, the inhibition strength that leads to high accuracy is used for the corresponding optimized SNN model.

3.4 Weight Quantization SNN processing is usually performed using floating-point data to achieve high accuracy at the cost of a large memory footprint and high energy consumption. To obtain a memory- and energy-efficient SNN processing, it is a common practice to use fixed-point data through quantization. However, a quantized value has a reduced precision of data representation, hence potentially degrading the accuracy. Towards this, we perform exploration to investigate the impact of different levels of weight quantization on the accuracy, as shown by c in Fig. 6. Then, the acceptable quantization levels can be used for further design consideration, e.g., selecting the quantized SNN model that offers an acceptable memory footprint and energy consumption. For quantization, we employ the rounding to the nearest value technique, which rounds up the value that are precisely half of the two representable numbers [26].

24

R. V. W. Putra and M. Shafique

3.5 Evaluation of Memory and Energy Requirements To provide applicability in many embedded applications, our optimizations have to meet the given memory and energy budgets. Towards this, we propose a technique to find the sizes of SNN models that have acceptable accuracy and fulfill the memory and energy budgets for both training and inference phases, as shown by 2 in Fig. 6. Its key idea is to gradually increase the network size (i.e., the number of excitatory neurons), then evaluate the design metrics, including accuracy, memory, and energy consumption. If several models fulfill the design requirements, then we select the one with higher accuracy. Meanwhile, if several models fulfill the design requirements and have the same accuracy, then we select the smaller model to keep memory footprint and energy consumption low.

3.6 Employment of Approximate DRAM Discussion in Sect. 1.2 highlights that memory accesses dominate the energy consumption of SNN processing. Previous studies in [68] further emphasize that single off-chip DRAM access is significantly more expensive than single on-chip buffer access. Therefore, we further reduce the memory access energy for SNN inference by employing a reduced-voltage-based approximate DRAM, as shown by 3 in Fig. 6. To highlight the potentials and challenges of approximate DRAM, we conduct experiments to analyze the DRAM access energy under different conditions (i.e., a row buffer hit, miss, or conflict).1 Moreover, we also study the DRAM bit error rate (BER) due to reduced-voltage DRAM operations. For the experiments, we employ LPDDR3-1600 4 Gb DRAM, the DRAM circuit model from [10], and the SPICE simulator to analyze the dynamics of DRAM bitline voltage (.Vbitline ). Then, we use the DRAMPower simulator [9] to obtain the DRAM access energy. The original DRAM operates at 1.35 V of the supply voltage (.Vsupply ), while the approximate one operates below 1.35 V. The experimental results are provided in Fig. 8, from which we make the following key observations. • The reduced .Vsupply decreases the DRAM energy-per-access (i.e., by up to 42%) across different access conditions (i.e., a row buffer hit, miss, and conflict). • The reduced .Vsupply cause the DRAM weak cells2 to become vulnerable to errors in the form of bit flips. Such bit flips can alter the values of trained SNN weights and eventually degrade the accuracy.

1 In a row buffer hit, the requested data are already in the DRAM row buffer, hence they can be accessed directly. Meanwhile, in a row buffer miss or conflict, the requested data are not the DRAM row buffer, hence the requested row has to be opened before its contents are loaded into the row buffer and then accessed. Further details of DRAM fundamentals can be found in [19]. 2 DRAM weak cells are the cells that fail when the DRAM parameters reduced [52].

(a)

8 6

1.350V 1.025V

Saving

4 2

0 Hit Miss Conflict DRAM Access Condition

Bit Error Rate

DRAM Access Energy [nJ]

A Design Methodology for Energy-Efficient Embedded Spiking Neural Networks

25

8 10-2

6 10-4 4 10-6

Bit error rate increases as the supply voltage decreases

2 10-8

(b)

0 0

1.03 1.03

1.13 1.23 1.13 1.23 Supply Voltage [V]

1.33 1.33

Fig. 8 (a) DRAM access energy for different conditions, i.e., a row buffer hit, miss, and conflict (adapted from [52]). (b) Bit error rate (BER) for different DRAM supply voltage (adapted from [10])

Towards this, we employ the approximate DRAM by gradually reducing the Vsupply while observing whether the obtained accuracy meets the design specifications. If several .Vsupply values result in acceptable accuracy scores, then we select the lowest .Vsupply to achieve high DRAM access energy saving.

.

4 Experimental Evaluations To evaluate our design methodology, we build the following experimental setup. SNN Evaluation We use a Python-based SNN simulator [24] for evaluating the accuracy of a given SNN, which runs on a multi-GPU machine (i.e., Nvidia GeForce RTX 2080 Ti [40]). For data encoding, we employ rate coding to convert input images to Poisson-distributed spike trains. For workloads, we consider the MNIST dataset since they have been widely used for evaluating the accuracy of SNNs with the STDP learning rule [69]. For fair comparisons, we recreate the baseline SNN design [14] and the Self-Learning STDP (SL-STDP) [67], then run them using the same Python-based SNN simulator. In the simulation, we extract the size of SNN model to evaluate the memory footprint. Furthermore, we also estimate the energy consumption using the approach of [22], i.e., by leveraging the simulation time and processing power from the nvidia-smi utility. DRAM Evaluation We generate bit errors in DRAM based on the DRAM error model with a uniform random distribution across a bank [30]. Furthermore, we employ the LPDDR3-1600 4Gb DRAM configuration, the DRAM circuit model from [10] and the SPICE simulator to analyze the dynamics of DRAM voltage. Afterwards, we leverage the DRAM voltage dynamics for estimating the DRAM access energy using the DRAMPower simulator [9].

4.1 Classification Accuracy Figure 9 presents the experimental results for classification accuracy. Results in Fig. 9a show that our methodology can achieve higher accuracy than the baseline

Accuracy [%]

26

R. V. W. Putra and M. Shafique

100 80 60 40 20 0

Baseline [10} Our methodology maintains high accuracy across different network sizes

400

900 1600 Network Size

SL-STDP [15] FSpiNN Our method

2500

Accuracy [%]

(a) Accuracy of SNNs with floating point precision 100 80 60 40 20 0 4-bit

Baseline [10] 1 Good trade-off between accuracy and memory

SL-STDP [15] Our method FSpiNN

6-bit

8-bit 10-bit 12-bit 14-bit 16-bit Quantization Level (b) Accuracy of an SNN with quantized precision

Fig. 9 (a) Accuracy of SNNs with floating-point precision (FP32) across different network sizes, i.e., 400, 900, 1600, and 2500 excitatory neurons. (b) Accuracy of an SNN with 400 excitatory neurons across different quantized precision levels, i.e., 4–16 bits of weights. These results are adapted from [54]

and the SL-STDP across different network sizes, i.e., by up to 7.2% accuracy improvement. The reason is that our methodology incorporates the improved STDPbased learning that avoids spurious weight updates as well as employs adaptive potentiation and effective inhibition strength for increasing the confidence level of learning over time in the training phase, thereby improving accuracy in the inference phase. Meanwhile, other techniques do not consider improving the confidence level of learning over time, resulting in sub-optimal accuracy. After applying weight quantization, different techniques have different accuracy profiles, as shown in Fig. 9b. Here, our methodology achieves higher accuracy than the baseline and the SL-STDP when the minimum bit-width of quantization is 8 bits, as indicated by label- 1 in Fig. 9b. The reason is that an 8-bit precision (or more) in our methodology provides a sufficient range of weight values to represent unique features in the input images, thereby making each neuron recognizes a specific class. In the 8-bit precision, our methodology achieves 91.6% accuracy, while the baseline and the SL-STDP achieve 87.6% and 82%, respectively. It shows that the accuracy obtained by our methodology with 8-bit precision is still higher than the baseline and the SL-STDP with FP32 precision. These results suggest that our methodology (8-bit) achieves no accuracy loss with a reduced bit-width, thereby offering a good trade-off between accuracy and memory footprint.

A Design Methodology for Energy-Efficient Embedded Spiking Neural Networks

27

M e m o r y [M B ]

80 60 40 20

Baseline [10] SL-STDP [15] FSpiNN (FP32) Our method (FP32) FSpiNN (16-bit) Our method (16-bit) FSpiNN (8-bit) Our method (8-bit)

Our method can significantly reduce the memory footprint

savings

savings

0 400

900

1600

2500

Network Size Fig. 10 Memory footprints for different network sizes, i.e., 400, 900, 1600, and 2500 excitatory neurons, and different quantization levels, i.e., without quantization/FP32, 16-bit, and 8-bit precision (adapted from [54])

4.2 Reduction of Memory Requirement The experimental results for memory requirements (i.e., footprints) across different network sizes are provided in Fig. 10. These results show that an SNN with 2500 excitatory neurons that employ the baseline or the SL-STDP techniques, requires more than 50 MB of memory, thereby making it challenging to run the SNN model on embedded platforms with limited on-chip memory. On the other hand, our methodology without quantization (FP32) obtains 1.8x memory saving as compared to the baseline or the SL-STDP. The reason is that our methodology eliminates the inhibitory neurons, hence the corresponding neuron parameters do not need to be stored in memory. Moreover, the memory footprint is reduced even more when the quantization is applied. For instance, our methodology achieves about 3.5x and 7x memory savings as compared to the baseline for 16-bit and 8-bit precision, respectively. These results indicate that the optimized SNN with 2500 excitatory neurons and 8-bit weight precision, requires ca. 8.5 MB of memory, thereby making it more feasible to deploy and run the SNN model on embedded platforms. At this point, if we consider both the obtained accuracy and memory footprint in the design process, we can select the optimized SNN model that offers a good trade-off between acceptable accuracy and memory footprint.

4.3 Improvement of Energy Efficiency Figure 11 shows the experimental results for energy efficiency across different network sizes for both training and inference phases. Training The SL-STDP improves the energy efficiency over the baseline by 1.1x– 1.2x across different network sizes. The reason is that the SL-STDP employs weight updates based on postsynaptic spikes, thereby having less number of updates than the baseline, whose weight updates are based on both presynaptic and postsynaptic

28

R. V. W. Putra and M. Shafique

Energy Efficiency (Normalized to Baseline)

Baseline [ ] 5x 4x 3x 2x

(a) Training

SL-STDP [ ]

FSpiNN (FP32) Our Method (FP32)

Our methodology improves energy efficiency for training

FSpiNN (16-bit) Our Method (16-bit)

4x 3x

(b) Inference

FSpiNN (8-bit)(8-bit) Our Method

Our methodology improves energy efficiency for inference

2x 1x

1x

0x

0x 400

900

1600

2500

Network Size

400

900 1600 Network Size

2500

Fig. 11 (a) Energy efficiency across different network sizes for the training phase. (b) Energy efficiency across different network sizes for the inference phase. These results are adapted from [54]

spikes. Meanwhile, our methodology (FP32) improves the energy efficiency more than the SL-STDP, i.e., by 1.3x–1.6x as compared to the baseline. The reason is that, in addition to the removal of presynaptic spike-based updates, our methodology also removes the inhibitory neurons and reduces the learning complexity. Moreover, if weight quantization is applied, our methodology improves the energy efficiency even further due to a lower memory footprint, i.e., by 1.5x–2.1x and 1.7x–2.3x compared to the baseline for 16-bit and 8-bit precision, respectively. Inference The SL-STDP and the baseline have comparable energy efficiency across different network sizes since they have similar operations during the inference phase. Meanwhile, our methodology (FP32) improves the energy efficiency more than the baseline by 1.2x–1.4x due to the removal of inhibitory neurons. Moreover, if weight quantization is applied, our methodology further improves the energy efficiency due to a lower memory footprint, i.e., by 1.5x–2x and 2.1x–2.2x compared to the baseline or the SL-STDP for 16-bit and 8-bit precision, respectively. In summary, these results highlight that the SL-STDP can achieve higher energy efficiency than the baseline in the training phase, while our methodology can achieve the highest energy efficiency in both the training and inference phases. At this point, if we consider the accuracy, memory footprint, and energy efficiency that the quantized designs can achieve, then we can judiciously select the optimized SNN model from our methodology to offer a good design trade-off in terms of accuracy, memory, and energy efficiency for the given embedded applications.

4.4 Impact of Approximate DRAM Here, we discuss the impact of approximate DRAM for SNN inference in terms of classification accuracy and DRAM access energy savings. Classification Accuracy Figure 12 presents the experimental results for the accuracy of SNNs across different network sizes and different bit error rates (BER)

Accuracy [%]

A Design Methodology for Energy-Efficient Embedded Spiking Neural Networks 86 86.0 89.5 (a) (b) 85.0 85 89.0 84.0 84 88.5 decreasing decreasing 83.0 88.0 83 accuracy accuracy 82.0 87.5 82 1.E-09 1.E-09 1.E-07 1.E-05 10-9 1.E-07 10-7 1.E-05 10-5 1.E-03 10-3 10-9 10-7 10-5 1.E-03 10-3 90.5 90.5 90.0 89.5 89.5 89.0 88.5 88.5

(c) decreasing accuracy

1.E-09 10-9

1.E-07 10-7

1.E-05 10-5

88.5 88.0 87.5 87.0 86.5

1.E-03 10-3

(d) decreasing accuracy

10-9 1.E-09

10-71.E-0610-5

29

Baseline SNN with Accurate DRAM Series1 Baseline SNN with

Approximate DRAM

A higher bit error rate (BER) in approximate DRAM leads to lower accuracy

10-3 1.E-03

Bit Error Rate (BER) Fig. 12 The accuracy of the baseline SNN with accurate and approximate DRAM across different BER and different network sizes: (a) 400, (b) 900, (c) 1600, and (d) 2500 excitatory neurons (adapted from [52])

in DRAM. These results show that the baseline SNN with approximate DRAM faces accuracy degradation as compared to the baseline SNN with accurate DRAM. In general, the accuracy decreases as the BER increases. The reason is that, in approximate DRAM, the weight bits are corrupted (i.e., flipped) and their values are changed. These changes may lead to accuracy degradation, as they can deteriorate the corresponding neuron behavior from recognizing the correct class. Moreover, we observe that different bit error locations have a different impact on accuracy. If the bit errors alter the most significant bits (MSBs) of weights, then they may significantly change the respective weight values, hence leading to noticeable accuracy degradation. Meanwhile, if the bit errors alter the less significant bits (LSBs) of weights, then they may not significantly change the respective weight values, hence the accuracy is not much affected. DRAM Access Energy Fig. 13 provides the experimental results of the DRAM access energy for the baseline SNN with accurate and approximate DRAM across different .Vsupply values and different network sizes. Here, we observe that the reduced-.Vsupply to 1.325 V, 1.25 V, 1.175 V, 1.1 V, and 1.025 V leads to DRAM access energy savings by ca. 3%, 13%, 22%, 31%, 39% on average, respectively. These results indicate that our design methodology with approximate DRAM substantially decreases the DRAM access energy for SNN inference compared to accurate DRAM. The reason is that the reduced-.Vsupply forces the DRAM to run at the reduced operational parameters (e.g., voltage, power). The above discussion highlights that employing the approximate DRAM can achieve substantial savings for DRAM access energy at the cost of accuracy degradation. Therefore, the interactions between accuracy and DRAM energy savings should be exploited by carefully employing approximate DRAM so that the obtained accuracy is still acceptable, i.e., meeting the given design specifications. Therefore, it is necessary to select the effective .Vsupply reduction levels so that

R. V. W. Putra and M. Shafique

2.0 1.5 1.0 0.5 0.00

Employing reduced-voltage DRAM in our methodology leads to the reduction of DRAM access energy, as compared to accurate DRAM (i.e., Acc. DRAM)

saving

saving

Acc.1.350V DRAM 1.325V 1.250V 1.175V 1.100V 1.025V Acc.1.350V DRAM 1.325V 1.250V 1.175V 1.100V 1.025V Acc.1.350V DRAM 1.325V 1.250V 1.175V 1.100V 1.025V Acc.1.350V DRAM 1.325V 1.250V 1.175V 1.100V 1.025V A 1DRAM 350V

DRAM Access Energy [mJ]

30

Approximate DRAM 400

Approximate DRAM 900

Approximate DRAM 1600

Approximate DRAM 2500

Fig. 13 Results of DRAM access energy for an SNN inference (i.e., with a single input sample) using accurate and approximate DRAM across different DRAM supply voltages .Vsupply and different network sizes (adapted from [52])

both the significant DRAM access energy savings and acceptable accuracy can be achieved for enabling embedded SNNs.

5 Conclusion In this chapter, we discuss our design methodology to improve the energy efficiency of SNN processing for embedded applications. It employs several network optimization steps, including the reduction of neuron and learning operations, enhancements of learning quality, and weight quantization. Afterwards, our methodology evaluates if the optimized SNN model meets the memory and energy costs. If so, our design methodology then employs a reduced-voltage approximate DRAM to substantially reduce the DRAM energy-per-access while maintaining the accuracy within an acceptable range. In this manner, our design methodology provides the optimized SNN model that meets the accuracy, memory, and energy constraints, thereby having high applicability for diverse embedded applications. Acknowledgments This work was partly supported by Intel Corporation through Gift funding for the project “Cost-Effective Dependability for Deep Neural Networks and Spiking Neural Networks,” and by Indonesia Endowment Fund for Education (IEFE/LPDP) Graduate Scholarship Program, Ministry of Finance, Republic of Indonesia under Grant PRJ1477/LPDP.3/2017. This work was also jointly supported by the NYUAD Center for Interacting Urban Networks (CITIES), funded by Tamkeen under the NYUAD Research Institute Award CG001, and Center for CyberSecurity (CCS), funded by Tamkeen under the NYUAD Research Institute Award G1104.

A Design Methodology for Energy-Efficient Embedded Spiking Neural Networks

31

References 1. Akopyan, F., Sawada, J., Cassidy, A., Alvarez-Icaza, R., Arthur, J., Merolla, P., Imam, N., Nakamura, Y., Datta, P., Nam, G., Taba, B., Beakes, M., Brezzo, B., Kuang, J.B., Manohar, R., Risk, W.P., Jackson, B., Modha, D.S.: TrueNorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 34 (2015). https://doi.org/10.1109/TCAD.2015.2474396 2. Allred, J.M., Roy, K.: Controlled forgetting: Targeted stimulation and dopaminergic plasticity modulation for unsupervised lifelong learning in spiking neural networks. Front. Neurosci. 14, 7 (2020). https://doi.org/10.3389/fnins.2020.00007 3. Arslan, A.K., Yasar, S., Colak, C.: An intelligent system for the classification of lung cancer based on deep learning strategy. In: 2019 International Artificial Intelligence and Data Processing Symposium (IDAP), pp. 1–4 (2019). https://doi.org/10.1109/IDAP.2019.8875896 4. Baek, E., Lee, H., Kim, Y., Kim, J.: FlexLearn: Fast and highly efficient brain simulations using flexible on-chip learning. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’52, p. 304–318. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3352460.3358268 5. Barata, C., Marques, J.S.: Deep learning for skin cancer diagnosis with hierarchical architectures. In: 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pp. 841–845 (2019). https://doi.org/10.1109/ISBI.2019.8759561 6. Capra, M., Bussolino, B., Marchisio, A., Shafique, M., Masera, G., Martina, M.: An updated survey of efficient hardware architectures for accelerating deep convolutional neural networks. Future Internet 12(7), 113 (2020) 7. Capra, M., Peloso, R., Masera, G., Ruo Roch, M., Martina, M.: Edge computing: A survey on the hardware requirements in the internet of things world. Future Internet 11(4) (2019). https:// doi.org/10.3390/fi11040100. https://www.mdpi.com/1999-5903/11/4/100 8. Cassidy, A.S., Merolla, P., Arthur, J.V., Esser, S.K., Jackson, B., Alvarez-Icaza, R., Datta, P., Sawada, J., Wong, T.M., Feldman, V., Amir, A., Rubin, D.B.D., Akopyan, F., McQuinn, E., Risk, W.P., Modha, D.S.: Cognitive computing building block: A versatile and efficient digital neuron model for neurosynaptic cores. In: The 2013 International Joint Conference on Neural Networks (IJCNN), pp. 1–10 (2013). https://doi.org/10.1109/IJCNN.2013.6707077 9. Chandrasekar, K., Weis, C., Li, Y., Goossens, S., Jung, M., Naji, O., Akesson, B., Wehn, N., Goossens, K.: DRAMPower. http://www.drampower.info 10. Chang, K.K., Ya˘glıkçı, A.G., Ghose, S., Agrawal, A., Chatterjee, N., Kashyap, A., Lee, D., O’Connor, M., Hassan, H., Mutlu, O.: Understanding reduced-voltage operation in modern DRAM devices: experimental characterization, analysis, and mechanisms. Proc. ACM Measurements and Analysis of Computing Systems 1(1) (2017). https://doi.org/10.1145/3084447 11. Chen, G.K., Kumar, R., Sumbul, H.E., Knag, P.C., Krishnamurthy, R.K.: A 4096-neuron 1msynapse 3.8-pJ/SOP spiking neural network with on-chip STDP learning and sparse weights in 10-NM FinFET CMOS. IEEE J. Solid State Circuits 54(4), 992–1002 (2018) 12. Chen, Q., He, G., Wang, X., Xu, J., Shen, S., Chen, H., Fu, Y., Li, L.: A 67.5 μj/prediction accelerator for spiking neural networks in image segmentation. IEEE Trans. Circuits Syst. II Express Briefs 69(2), 574–578 (2021) 13. Davies, M., Srinivasa, N., Lin, T., Chinya, G., Cao, Y., Choday, S.H., Dimou, G., Joshi, P., Imam, N., Jain, S., Liao, Y., Lin, C., Lines, A., Liu, R., Mathaikutty, D., McCoy, S., Paul, A., Tse, J., Venkataramanan, G., Weng, Y., Wild, A., Yang, Y., Wang, H.: Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro 38(1), 82–99 (2018). https://doi.org/ 10.1109/MM.2018.112130359 14. Diehl, P., Cook, M.: Unsupervised learning of digit recognition using spike-timing-dependent plasticity. Front. Comput. Neurosci. 9, 99 (2015). https://doi.org/10.3389/fncom.2015.00099 15. Frenkel, C., Lefebvre, M., Legat, J., Bol, D.: A 0.086-mm2 12.7-pJ/SOP 64k-synapse 256neuron online-learning digital spiking neuromorphic processor in 28-NM CMOS. IEEE Trans. Biomed. Circuits Syst. 13(1), 145–158 (2019). https://doi.org/10.1109/TBCAS.2018.2880425

32

R. V. W. Putra and M. Shafique

16. Frenkel, C., Legat, J.D., Bol, D.: A compact phenomenological digital neuron implementing the 20 Izhikevich behaviors. In: 2017 IEEE Biomedical Circuits and Systems Conference (BioCAS), pp. 1–4 (2017). https://doi.org/10.1109/BIOCAS.2017.8325231 17. Frenkel, C., Legat, J.D., Bol, D.: Morphic: A 65-nm 738k-synapse/mm2 quad-core binaryweight digital neuromorphic processor with stochastic spike-driven online learning. IEEE Trans. Biomed. Circuits Syst. 13(5), 999–1010 (2019) 18. Gautrais, J., Thorpe, S.: Rate coding versus temporal order coding: a theoretical approach. Biosystems 48(1), 57–65 (1998). https://doi.org/10.1016/S0303-2647(98)00050-1 19. Ghose, S., et al.: Demystifying complex workload-DRAM interactions: an experimental study. In: Proceedings of the SIGMETRICS, pp. 93–93 (2019). https://doi.org/10.1145/3309697. 3331482 20. Grigorescu, S., Trasnea, B., Cocias, T., Macesanu, G.: A survey of deep learning techniques for autonomous driving. J. Field Rob. 37(3), 362–386 (2020). https://doi.org/10.1002/rob.21918 21. Ha, V.S., Lu, D.N., Choi, G.S., Nguyen, H.N., Yoon, B.: Improving credit risk prediction in online peer-to-peer (p2p) lending using feature selection with deep learning. In: Proceedings of the 2019 21st International Conference on Advanced Communication Technology (ICACT), pp. 511–515 (2019). https://doi.org/10.23919/ICACT.2019.8701943 22. Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv preprint arXiv:1510.00149 (2015) 23. Hazan, H., Saunders, D., Sanghavi, D.T., Siegelmann, H., Kozma, R.: Unsupervised learning with self-organizing spiking neural networks. In: International Joint Conference on Neural Networks, pp. 1–6 (2018). https://doi.org/10.1109/IJCNN.2018.8489673 24. Hazan, H., Saunders, D.J., Khan, H., Patel, D., Sanghavi, D.T., Siegelmann, H.T., Kozma, R.: BindsNET: A machine learning-oriented spiking neural networks library in python. Front. Neuroinform. 12, 89 (2018). https://doi.org/10.3389/fninf.2018.00089 25. Hazan, H., Saunders, D.J., Sanghavi, D.T., Siegelmann, H., Kozma, R.: Lattice map spiking neural networks (LM-SNNs) for clustering and classifying image data. Ann. Math. Artif. Intell. 88(11), 1237–1260 (2019). https://doi.org/10.1007/s10472-019-09665-3 26. Hopkins, M., Mikaitis, M., Lester, D.R., Furber, S.: Stochastic rounding and reduced-precision fixed-point arithmetic for solving neural ordinary differential equations. Phil. Trans. R. Soc. A 378(2166), 20190052 (2020) 27. Izhikevich, E.M.: Which model to use for cortical spiking neurons? IEEE Trans. on Neural Networks (TNN) 15(5) (2004). https://doi.org/10.1109/TNN.2004.832719 28. Kaskavalci, H.C., Gören, S.: A deep learning based distributed smart surveillance architecture using edge and cloud computing. In: Proceedings of the 2019 International Conference on Deep Learning and Machine Learning in Emerging Applications (Deep-ML), pp. 1–6 (2019). https://doi.org/10.1109/Deep-ML.2019.00009 29. Kayser, C., Montemurro, M.A., Logothetis, N.K., Panzeri, S.: Spike-phase coding boosts and stabilizes information carried by spatial and temporal spike patterns. Neuron 61(4), 597–608 (2009). https://doi.org/10.1016/j.neuron.2009.01.008 30. Koppula, S., Orosa, L., Ya˘glıkçı, A.G., Azizi, R., Shahroodi, T., Kanellopoulos, K., Mutlu, O.: Eden: Enabling energy-efficient, high-performance deep neural network inference using approximate DRAM. In: 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 166–181 (2019). https://doi.org/10.1145/3352460.3358280 31. Krithivasan, S., Sen, S., Venkataramani, S., Raghunathan, A.: Dynamic spike bundling for energy-efficient spiking neural networks. In: 2019 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), pp. 1–6 (2019). https://doi.org/10.1109/ ISLPED.2019.8824897 32. Kuang, Y., Cui, X., Zhong, Y., Liu, K., Zou, C., Dai, Z., Wang, Y., Yu, D., Huang, R.: A 64K-neuron 64M-1b-synapse 2.64 pJ/SOP neuromorphic chip with all memory on chip for spike-based models in 65nm CMOS. IEEE Trans. Circuits Syst. II Express Briefs 68(7), 2655– 2659 (2021) 33. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)

A Design Methodology for Energy-Efficient Embedded Spiking Neural Networks

33

34. Lee, D., Lee, G., Kwon, D., Lee, S., Kim, Y., Kim, J.: Flexon: A flexible digital neuron for efficient spiking neural network simulations. In: 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 275–288. IEEE, New York (2018) 35. Maass, W.: Networks of spiking neurons: The third generation of neural network models. Neural Netw. 10(9), 1659–1671 (1997). https://doi.org/10.1016/S0893-6080(97)00011-7 36. Minaee, S., Boykov, Y.Y., Porikli, F., Plaza, A.J., Kehtarnavaz, N., Terzopoulos, D.: Image segmentation using deep learning: A survey. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), pp. 1–1 (2021). https://doi.org/10.1109/TPAMI.2021.3059968 37. Mohsen, H., El-Dahshan, E.S.A., El-Horbaty, E.S.M., Salem, A.B.M.: Classification using deep learning neural networks for brain tumors. Future Computing and Informatics Journal 3(1), 68–71 (2018). https://doi.org/10.1016/j.fcij.2017.12.001. https://www.sciencedirect.com/ science/article/pii/S2314728817300636 38. Morrison, A., Aertsen, A., Diesmann, M.: Spike-timing-dependent plasticity in balanced random networks. Neural Comput. 19(6), 1437–1467 (2007). https://doi.org/10.1162/neco. 2007.19.6.1437 39. Mozafari, M., Ganjtabesh, M., Nowzari-Dalini, A., Masquelier, T.: SpykeTorch: Efficient simulation of convolutional spiking neural networks with at most one spike per neuron. Front. Neurosci. 13, 625 (2019). https://doi.org/10.3389/fnins.2019.00625 40. NVIDIA: NVIDIA GeForce RTX 2080 Ti. https://www.nvidia.com/de-at/geforce/graphicscards/rtx-2080-ti 41. NVIDIA: NVIDIA Jetson Nano. https://developer.nvidia.com/embedded/jetson-nanodeveloper-kit 42. Painkras, E., Plana, L.A., Garside, J., Temple, S., Galluppi, F., Patterson, C., Lester, D.R., Brown, A.D., Furber, S.B.: Spinnaker: A 1-w 18-core system-on-chip for massively-parallel neural network simulation. IEEE J. Solid State Circuits 48(8), 1943–1953 (2013) 43. Palossi, D., Loquercio, A., Conti, F., Flamand, E., Scaramuzza, D., Benini, L.: Ultra low power deep-learning-powered autonomous nano drones. CoRR abs/1805.01831 (2018). http://arxiv. org/abs/1805.01831 44. Panda, P., et al.: Asp: Learning to forget with adaptive synaptic plasticity in spiking neural networks. IEEE J. Emerging Sel. Top. Circuits Syst. (JETCAS) 8(1), 51–64 (2018). https://doi. org/10.1109/JETCAS.2017.2769684 45. Park, J., Lee, J., Jeon, D.: A 65-nm neuromorphic image classification processor with energyefficient training through direct spike-only feedback. IEEE J. Solid State Circuits 55(1), 108– 119 (2019) 46. Park, S., Kim, S., Choe, H., Yoon, S.: Fast and efficient information transmission with burst spikes in deep spiking neural networks. In: 56th Annual Design Automation Conference (DAC), p. 53 (2019) 47. Park, S., Kim, S., Na, B., Yoon, S.: T2FSNN: Deep spiking neural networks with time-tofirst-spike coding. In: Proceedings of 57th ACM/IEEE Design Automation Conference (DAC), pp. 1–6 (2020). https://doi.org/10.1109/DAC18072.2020.9218689 48. Pfeiffer, M., Pfeil, T.: Deep learning with spiking neurons: Opportunities and challenges. Front. Neurosci. 12, 774 (2018). https://doi.org/10.3389/fnins.2018.00774 49. Putra, R.V.W., Hanif, M.A., Shafique, M.: DRMap: A generic DRAM data mapping policy for energy-efficient processing of convolutional neural networks. In: 2020 57th ACM/IEEE Design Automation Conference, pp. 1–6 (2020). https://doi.org/10.1109/DAC18072.2020.9218672 50. Putra, R.V.W., Hanif, M.A., Shafique, M.: Respawn: Energy-efficient fault-tolerance for spiking neural networks considering unreliable memories. In: 2021 IEEE/ACM International Conference On Computer Aided Design, pp. 1–9 (2021). https://doi.org/10.1109/ ICCAD51958.2021.9643524 51. Putra, R.V.W., Hanif, M.A., Shafique, M.: ROMANet: Fine-grained reuse-driven off-chip memory access management and data organization for deep neural network accelerators. IEEE Trans. Very Large Scale Integr. VLSI Syst. 29(4), 702–715 (2021). https://doi.org/10.1109/ TVLSI.2021.3060509

34

R. V. W. Putra and M. Shafique

52. Putra, R.V.W., Hanif, M.A., Shafique, M.: SparkXD: A framework for resilient and energy-efficient spiking neural network inference using approximate DRAM. In: 2021 58th ACM/IEEE Design Automation Conference, pp. 379–384 (2021). https://doi.org/10.1109/ DAC18074.2021.9586332 53. Putra, R.V.W., Hanif, M.A., Shafique, M.: SoftSNN: Low-cost fault tolerance for spiking neural network accelerators under soft errors. arXiv preprint arXiv:2203.05523 (2022) 54. Putra, R.V.W., Shafique, M.: FSpiNN: An optimization framework for memory-and energyefficient spiking neural networks. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 39(11), 3601–3613 (2020). https://doi.org/10.1109/TCAD.2020.3013049 55. Putra, R.V.W., Shafique, M.: Q-SpiNN: A framework for quantizing spiking neural networks. In: 2021 International Joint Conference on Neural Networks, pp. 1–8 (2021). https://doi.org/ 10.1109/IJCNN52387.2021.9534087 56. Putra, R.V.W., Shafique, M.: SpikeDyn: A framework for energy-efficient spiking neural networks with continual and unsupervised learning capabilities in dynamic environments. In: 2021 58th ACM/IEEE Design Automation Conference, pp. 1057–1062 (2021). https://doi.org/ 10.1109/DAC18074.2021.9586281 57. Putra, R.V.W., Shafique, M.: lpSpikeCon: Enabling low-precision spiking neural network processing for efficient unsupervised continual learning on autonomous agents. arXiv preprint arXiv:2205.12295 (2022) 58. Putra, R.V.W., Shafique, M.: tinySNN: Towards memory-and energy-efficient spiking neural networks. arXiv preprint arXiv:2206.08656 (2022) 59. Rahimi Azghadi, M., Iannella, N., Al-Sarawi, S.F., Indiveri, G., Abbott, D.: Spike-based synaptic plasticity in silicon: Design, implementation, application, and challenges. Proc. IEEE 102(5), 717–737 (2014). https://doi.org/10.1109/JPROC.2014.2314454 60. Rathi, N., Panda, P., Roy, K.: STDP-based pruning of connections and weight quantization in spiking neural networks for energy-efficient recognition. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 38(4), 668–677 (2019). https://doi.org/10.1109/TCAD.2018.2819366 61. Roy, A., Venkataramani, S., Gala, N., Sen, S., Veezhinathan, K., Raghunathan, A.: A programmable event-driven architecture for evaluating spiking neural networks. In: 2017 IEEE/ACM International Symposium on Low Power Electronics and Design, pp. 1–6 (2017). https://doi.org/10.1109/ISLPED.2017.8009176 62. Saunders, D.J., Patel, D., Hazan, H., Siegelmann, H.T., Kozma, R.: Locally connected spiking neural networks for unsupervised feature learning. Neural Netw. 119, 332–340 (2019). https:// doi.org/10.1016/j.neunet.2019.08.016 63. Saunders, D.J., Siegelmann, H.T., Kozma, R., Ruszinkó, M.: STDP learning of image patches with convolutional spiking neural networks. In: International Joint Conference on Neural Networks, pp. 1–7 (2018). https://doi.org/10.1109/IJCNN.2018.8489684 64. Schuman, C.D., Potok, T.E., Patton, R.M., Birdwell, J.D., Dean, M.E., Rose, G.S., Plank, J.S.: A survey of neuromorphic computing and neural networks in hardware. arXiv preprint arXiv:1705.06963 (2017) 65. Sen, S., Venkataramani, S., Raghunathan, A.: Approximate computing for spiking neural networks. In: Design, Automation Test in Europe Conf. Exhibition, pp. 193–198 (2017). https://doi.org/10.23919/DATE.2017.7926981 66. Shafique, M., Marchisio, A., Putra, R.V.W., Hanif, M.A.: Towards energy-efficient and secure edge ai: A cross-layer framework ICCAD special session paper. In: 2021 IEEE/ACM International Conference On Computer Aided Design, pp. 1–9 (2021). https://doi.org/10.1109/ ICCAD51958.2021.9643539 67. Srinivasan, G., Roy, S., Raghunathan, V., Roy, K.: Spike timing dependent plasticity based enhanced self-learning for efficient pattern recognition in spiking neural networks. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 1847–1854 (2017). https:// doi.org/10.1109/IJCNN.2017.7966075 68. Sze, V., Chen, Y., Yang, T., Emer, J.S.: Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE 105(12), 2295–2329 (2017). https://doi.org/10.1109/JPROC. 2017.2761740

A Design Methodology for Energy-Efficient Embedded Spiking Neural Networks

35

69. Tavanaei, A., Ghodrati, M., Kheradpisheh, S.R., Masquelier, T., Maida, A.: Deep learning in spiking neural networks. Neural Netw. 111, 47–63 (2019). https://doi.org/10.1016/j.neunet. 2018.12.002 70. Thorpe, S., Gautrais, J.: Rank order coding. In: Computational Neuroscience, pp. 113–118. Springer, New York (1998) 71. Ying, J.J.C., Huang, P.Y., Chang, C.K., Yang, D.L.: A preliminary study on deep learning for predicting social insurance payment behavior. In: 2017 IEEE International Conference on Big Data, pp. 1866–1875 (2017). https://doi.org/10.1109/BigData.2017.8258131 72. Zanc, R., Cioara, T., Anghel, I.: Forecasting financial markets using deep learning. In: 2019 IEEE 15th International Conference on Intelligent Computer Communication and Processing, pp. 459–466 (2019). https://doi.org/10.1109/ICCP48234.2019.8959715 73. Zhang, D., Liu, S.E.: Top-down saliency object localization based on deep-learned features. In: 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics, pp. 1–9 (2018). https://doi.org/10.1109/CISP-BMEI.2018.8633218

Compilation and Optimizations for Efficient Machine Learning on Embedded Systems Xiaofan Zhang, Yao Chen, Cong Hao, Sitao Huang, Yuhong Li, and Deming Chen

1 Introduction The recent development of Deep Neural Networks (DNNs) has made machine learning based smart solutions more relevant and accessible to the general public. We have seen that some DNN technologies have been integrated into our daily applications to provide high-quality inference services, such as image recognition, natural language processing, self-driving cars, and augmented and virtual reality [1–4], which have made our lives more convenient and our work more efficient. A significant number of these machine learning applications leverage edge devices and need to be deployed onto resource-constrained embedded systems, such as cell phones, cameras, and unmanned aerial vehicles (UAVs). They require not only higher inference accuracy to achieve intelligent responses but also aggressive inference speed, throughput, and energy efficiency to meet real-life demands. As DNNs become more complicated, developing and serving the DNN-enabled applications requires more compute and memory resources, longer latency, and

X. Zhang () · Y. Li · D. Chen Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Champaign, IL, USA e-mail: [email protected]; [email protected]; [email protected] Y. Chen Advanced Digital Sciences Center, Singapore, Singapore e-mail: [email protected] C. Hao Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA e-mail: [email protected] S. Huang Electrical Engineering and Computer Science, University of California Irvine, Irvine, CA, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 S. Pasricha, M. Shafique (eds.), Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing, https://doi.org/10.1007/978-3-031-39932-9_3

37

38

X. Zhang et al.

greater energy consumption. For example, the computation demands for DNN training have risen by over 300,000 times between AlexNet [5], the champion model of the 2012 ImageNet competition, and the AlphaGo Zero [6], the AI player proposed in 2017 for the board game Go with superhuman skills [7]. By checking the image recognition models, there is a 16 times increase in model complexity from AlexNet with 85% top-5 accuracy to ResNet-152 [2] with 95% top-5 accuracy. Such exponentially increasing compute and memory demands have created challenges and difficulties for DNN deployment on hardware, especially when targeting edge embedded devices with strictly limited compute and memory resources and tight power budgets [8, 9]. Although cloud computing can alleviate the burden of edge computing by taking over computationally intensive tasks, it is not always feasible when dealing with various real-life scenarios. Primary reasons for sticking to edge embedded devices come from the unique requirements of the edge applications, which typically require real-time decision-making and reduced reliance on network communication and accessibility. They typically cannot tolerate the extra latency caused by network data transfer due to the real-time response requirements. In addition, private information, such as personal and sensitive data, should not be uploaded to the cloud without permission. It means that the edge devices are required to deliver not only high inference accuracy from DNNs but also aggressive inference speed, throughput, and energy efficiency to meet various real-life demands. In summary, the challenges of deploying machine learning workloads on edge embedded devices mainly come from three aspects: (1) DNN models are getting complicated and may fail to run efficiently, especially when targeting the low-power edge devices with scarce compute and memory resources; (2) Mapping DNN onto existing hardware or building domain-specific hardware is tedious and time-consuming; (3) Additional challenges come from inefficient optimization strategies that focus only on hardware or software optimizations alone but lack software/hardware co-design or cross-system stack design methods that can potentially deliver better overall solutions. Despite the aforementioned challenges, there has been continuous progress in recent studies to explore various optimization strategies for edge machine learning solutions. In this chapter, we present comprehensive design methodologies to face and overcome the challenges and enable efficient DNN applications on embedded systems. These methods include efficient DNN model designs in Sect. 3, accelerator design and workload mapping technologies in Sect. 4, and cross-stack optimization strategies in Sect. 5.

2 Background and Related Works Existing solutions to enable efficient DNN on embedded systems attempt to address challenges from the DNN model to the entire hardware-software system. These different methods cover different development cycles and have different

Compilation and Optimizations for Efficient Machine Learning on Embedded Systems

39

Table 1 Design methodologies and their attributes Methods Efficient DNN model design

Efficient accelerator design and DNN mapping Efficient DNN/accelerator co-design

Attributes Design methods to create DNN models with fewer parameters, fewer memory demands, and lower computational complexity Solutions to build domain-specific hardware/software accelerators with optimized task scheduling Optimization strategies that integrate both the hardware design process and DNN algorithm design process

characteristics, as shown in Table 1. In this section, we present existing work on the different design methods in terms of their different properties.

2.1 Efficient DNN Designs A DNN includes multiple intermediate layers between the input and output layers, and each intermediate layer consists of artificial neurons for transforming the input information (e.g., input feature maps) following the predefined network connection. In general, a DNN contains millions of parameters and requires billions of operations during inference. To successfully deploy DNNs onto hardware with desired performance, developers focus on network compression to reduce network complexities and lower the compute and memory demands. Recent research has demonstrated the possibility of using quantized data to represent original floatingpoint parameters, such as using 8-bit quantization or even binary and ternary data representation [10–15]. These solutions are intended to replace the hardwareintensive floating-point multiplications by logical operations so that DNNs can be more efficient on hardware platforms. Another method to compress DNN is network pruning, which aims to reduce the redundancy of DNN structures [16–18]. According to the published pruning strategies, the less essential connections between DNN layers are discarded, and network retraining is then performed to regain accuracy. Significant reductions can be achieved on the classic DNNs, such as AlexNet [5] and VGG-16 [19]. Since the major benefit of network compression comes from the fully connected (FC) layers, to continuously have effective pruning results for latter DNNs (e.g., GoogLeNet [20] and ResNet [2]) with fewer FC layers, more sophisticated algorithms are required to achieve effective network pruning, such as using evolutionary algorithms [21], alternating direction method of multipliers [22], and iterative pruning [23]. As most of the computations happen inside the convolutional (Conv) layers, previous works also attempt to reduce the computational complexity by using depthwise separable Conv layers [24]. The depth-wise separable structure can effectively

40

X. Zhang et al.

reduce the number of operations and provide more compact DNN designs for resource-constrained hardware. To further improve the DNN deployment on hardware, layer fusion is proposed in [25] to minimize data movements between on-chip and off-chip memory.

2.2 Efficient Accelerator Designs and DNN Mapping Methods Building domain-specific hardware accelerators is another popular approach for efficient DNN deployment. These accelerators attempt to take advantage of customized or specialized hardware and software designs, such as adopting acceleration libraries on CPUs [26], exploring kernel optimization on GPUs [27], and building customized accelerators on FPGAs [28–30] and ASICs [10, 31, 32] to improve the speed and efficiency of DNN inference and training processes. Among these accelerator designs, FPGA- and ASIC-based designs can be fully customized to implement the neural network functionality with improved latency, throughput, and energy consumption compared to CPU- and GPU-based designs. Still, developing customized accelerators presents significant challenges, such as the tedious hardware design process, the intricate hardware verification problems, and the time-consuming design space exploration during DNN deployment. To alleviate these challenges, recent investigations have started focusing on techniques including high-level synthesis [33–35] and end-to-end design frameworks for fast DNN accelerator design and efficient workload deployment [30, 36–38]. They support high abstraction inputs, such as Python-based DNN descriptions used by popular machine learning frameworks (e.g., Caffe [39], TensorFlow [40], PyTorch [41]), so DNNs can be directly imported without manual code conversions and be parsed and then mapped onto hardware. These frameworks, such as DNNBuilder [30] and HybridDNN [37], also integrate design space exploration (DSE) engines to perform effective and systematical explorations and deliver highly optimized accelerators to meet the user-specific requirements.

2.3 Efficient Co-Design Optimization Recent research also focuses on cross-stack co-design optimizations to enable successful DNN deployment on embedded systems [42]. Instead of independently optimizing hardware and software components, researchers proposed algorithm/accelerator co-design and co-search to solve the edge AI challenges: DNNs are designed to satisfy accuracy demands and must be aware of the hardware constraints with rational network configurations. At the same time, the accelerators need to provide extensive support for different DNN components without introducing too many restrictions on network design and guarantee performance to meet the specifications. The authors in [43] proposed the concept of DNN/accelerator co-

Compilation and Optimizations for Efficient Machine Learning on Embedded Systems

41

design for the first time, which aims to consider software and hardware metrics simultaneously to “automatically generate both DNN models and their corresponding implementations as pairs.” This concept is then demonstrated by winning the competitive System Design Contest for low-power object detection in the 56th IEEE/ACM Design Automation Conference (DAC-SDC) [44]. Many follow-up works continued investigating the co-design opportunities between different AI algorithms and hardware devices [45–52]. These co-design approaches have been studied with remarkable achievements by combining multiple optimization techniques across both hardware and software. For example, while neural architecture search (NAS) has been largely successful in designing highquality DNN models [53, 54], hardware-aware NAS is drawing increasing attention, which aims at delivering high-accuracy models with hardware efficiency as well (e.g., FBNet [55] and MNasNet [56]). Other machine learning algorithm/hardware co-design works include FNAS [57], NAIS [48], EDD [58], and NASAIC [59]. Driven by the success of such a co-design strategy, other types of co-design methods are also proposed recently, including software/compiler co-design [60–62], compiler/hardware co-design [63–65], etc.

3 Efficient Machine Learning Model Designs Machine learning applications require not only high inference accuracy but also aggressive inference speed, throughput, and energy efficiency to meet real-life demands. They rely on hardware-efficient DNN designs, especially when targeting edge scenarios with limited hardware resources. In this section, we introduce ELBNN [12] and VecQ [14] to deliver hardware-efficient DNNs for embedded systems.

3.1 The ELB-NN ELB-NN (Extremely Low Bitwidth Neural Network) is proposed to enhance energy efficiency when running image classification on an embedded FPGA. It is one of the first hybrid low-bitwidth designs that supports arbitrary DNN quantization. This subsection presents the hybrid quantization feature of the ELB-NN and its corresponding hardware accelerator design on embedded systems.

3.1.1

Hybrid Quantization Scheme

Hybrid quantization means that different quantization schemes are involved for the network’s parameters and activations. The quantization scheme can go all the way down to binary. To better adapt the hybrid quantization, we first investigate their impacts on the network inference accuracy. We follow Eq. (1) to calculate the binary

42

X. Zhang et al.

weights. Here w˜ represents the full-precision weights after back propagation, while E(|w|) ˜ represents the mean of all the full-precision weights as a scaling factor. For the ternary training, the wt (representing ternary parameters) can be calculated following Eq. (2). Here we set the threshold wthres = 0.7E(|w|) ˜ and calculate the scaling factor E as suggested in [66]. We also apply relatively high precision using 8- or 4-bit fixed-point representation. We then use AlexNet [5] to perform quantitative analysis when applying hybrid quantization. wb = sign(|w|) ˜ × E(|w|) ˜

(1)

.

 wt =

.

sign(w) ˜ × E |wt | > wthres 0 |wt | ≤ wthres

(2)

In this analysis with AlexNet, we focus on the impact of (1) the quantized parameters of the convolutional (Conv) and fully connected (FC) layer, and (2) the quantized activations. We use mid-Conv to denote all the Conv layers except the first Conv layer and mid-FC to denote all the FC layers except the last FC layer. The naming rule of the proposed hybrid precision network can be referred to Fig. 1. In Table 2, the 8-bit design (AlexNet-8-8888) only reduces the accuracy by 1.3% compared to the original float32 version. The accuracy is still promising after using ternary (AlexNet-8-8228) and binary (AlexNet-8-8218) parameters for mid-Conv and mid-FC layer. It means that the network is relatively robust to the precision of parameters. On the contrary, the precision of activations significantly impacts classification accuracy. Compared to the AlexNet-8-8218, we observe 3.3% and 6.5% accuracy drop when activations move to 4 bits (AlexNet-4-8218) and 2 bits

Alexnet-4-8218 network type Acvaon bitwidth

Last-FC weights bitwidth mid-FC weights bitwidth mid-CONV weights bitwidth first-CONV weights bitwidth

Fig. 1 Network representation when using hybrid quantization [12] Table 2 Inference accuracy with hybrid quantization using ImageNet dataset [12]

Network precision AlexNet with float32 AlexNet-8-8888 AlexNet-8-8228 AlexNet-8-8218 AlexNet-8-8118 AlexNet-4-8218 AlexNet-2-8218 AlexNet-4-8218 (w/o g.) AlexNet-4-8218 (ext.)

Accuracy (Top-1) 55.9%[67] 54.6% 53.3% 52.6% 51.1% 49.3% 46.1% 53.2% 54.5%

Compilation and Optimizations for Efficient Machine Learning on Embedded Systems

43

(AlexNet-2-8218). To further investigate, we disable the group function, which was originally proposed to handle the limited GPU memory issue. As a result, we capture an 80% computation increase and a 3.9% accuracy improvement in AlexNet-4-8218 (w/o g.). We further increase the channel number for the five Conv layers in AlexNet-4-8218 (ext.) and achieve 1.3% accuracy gain by affording extra 61% computation compared to AlexNet-4-8218 (w/o g.). By increasing the model complexity, we can bring back the accuracy. These observations are insightful for hybrid DNN quantization as parameters can be quantized more aggressively than activations.

3.1.2

Hardware Accelerator for ELB-NN

To handle ELB-NN, we propose a parameterized Computation Engine (CE) in [12] with flexible support of low bitwidth quantization and configurable parallelism during execution. As shown in Fig. 2, it contains a four-input-parallel CE as an example, where four inputs can be processed simultaneously (including binary/ternary operations and accumulation) and then followed by batch normalization (BN) and activation function. The precision of the accumulator is adjustable, which is intended to allow more flexible quantization designs and maintain the output accuracy. For a larger number of inputs, an adder tree will be used before the accumulator for timing enclosure.

Fig. 2 Computation engine (CE) with binary and ternary logic operations [12]

44

X. Zhang et al.

To demonstrate the hardware efficiency of ELB-NN, we adopt the accelerator with the proposed CE and accelerate different quantized versions of the AlexNet and VGG16 using an embedded platform called ZC706 (with an ARM CPU and a Xilinx Kintex-7 FPGA). Results are shown in Table 3. ELB-NN can achieve throughput performance up to 10.3 TOPS, which outperforms previous designs in [68–70].

3.2 The VecQ Vectorized Quantization (VecQ) is a training quantization framework that is based on a novel quantization loss measurement metric called vector loss. It is proposed to provide flexible bitwidth support with minimal quantization loss to achieve higher model accuracy. In this subsection, we present the detailed definition of vector loss and the VecQ quantization framework.

3.2.1

Quantization with Vector Loss

We use the square of the Euclidean distance of the data before and after quantization to represent quantization loss. It is also called Square 2-norm or L2 distance. Minimizing the L2 loss during the model training is proved to be adequate in providing higher model accuracy [67, 71–74]. However, due to the non-convex characteristic of optimization for L2 loss under a constrained bitwidth requirement, the quantization easily falls into sub-optimal solution space. In addition, adopting the L2 distance collects the loss of each quantized data individually and neglects the distribution and correlations among these data points in a kernel or a layer. Focusing on the limitations above, in VecQ, we first flatten and reshape the weight set Wf (l) for a layer of the DNN and reshape them as a vector wf (l) with the dimension of the size of the elements. For example, it will be a N × M × K 2 dimensional vector for a CNN layer with N input channel, M output channel and K size of filter kernel. The quantized weight vector is denoted as wq (l). There are two attributes of a vector, as shown in Fig. 3 orientation and modulus. We define a quantization angle representing the intersection angle between the original weight vector and the quantized vector. So, the vector distance between the weight vector before and after quantization is determined by the quantization angle and the vector’s modulus. We then define a vector loss, denoted as Jv , and compose it with the orientation loss Jo and the modulus loss Jm . Jv = J (wf , w q ) = Jo (wf , w q ) + Jm (wf , w q )

.

where Jo and Jm are computed as:

(3)

Network AlexNet-8-8888 AlexNet-8-8218 AlexNet-4-8218 AlexNet-4-8218 (w/o g.) AlexNet-4-8218 (ext.) VGG16-4-8218 VGG16-2-8118

Utilization LUT 86262(39%) 103505(47%) 105673(48%) 127393(58%) 124317(57%) 112992(52%) 137973(63%) FF 51387(12%) 90125(21%) 94149(22%) 105328(24%) 101558(23%) 99396(23%) 113262(26%)

BRAM 303(56%) 498(91%) 463(85%) 435(80%) 481(88%) 509(93%) 499(92%)

DSP 808(90%) 550(61%) 880(98%) 839(93%) 783(87%) 298(33%) 651(72%)

Table 3 ELB-NN performance evaluated on an embedded platform (Xilinx ZC706) [12] Batch size 2 5 8 7 7 2 3

Bandwidth (GBytes/s) 10.8 3.35 3.35 4.30 3.4 5.85 6.67

Complexity (GOP) 1.45 1.45 1.45 2.61 4.22 31.0 31.0

Images/s 340 856.1 1369.6 1198.5 599.2 110.7 332.2

Perf. (TOPS) 0.493 1.24 1.99 2.59 2.53 3.43 10.3

Compilation and Optimizations for Efficient Machine Learning on Embedded Systems 45

46

X. Zhang et al.

Fig. 3 The attributes of a vector and the quantization angle of wf and wq

Jo = 1 − cos θ,

  αw q wf cos θ = |αwq | |wf |

= 1 − ev ewf .

d  =1− (evi ewf i )

(4)

i=1

Jm = ||wf − αw q ||22 here, the ev and ewf represent the unit vector for v and wf . wf is a weight vector of a layer of a DNN containing d weights. With these approaches, the orientation loss Jo indicates the optimized quantization angle and the modulus loss Jm indicates the optimized scale at this angle. Therefore, our quantization takes two stages to minimize the two losses independently, which are defined as steering stage and driving stage as shown in Fig. 4. In the steering stage, we adjust the orientation of the weight vector to minimize the orientation loss. Then, we fix the orientation and only scale the modulus of the vector at the driving stage to minimize the modulus loss.

3.2.2

Framework Integration

The VecQ quantization is integrated into the DNN training flow for both the weight data and the activation data. As shown in Fig. 5. For weight data, taking a layer l as an example, during the forward propagation, the weights wf (l) represented in floating-point is quantized into wq (l), then use the quantized weights to compute the output of this layer. To simplify the computing process, the weight is treated

Compilation and Optimizations for Efficient Machine Learning on Embedded Systems

47

Fig. 4 The overall flow of quantization process, including both steering and driving stage [14] Uniformly quantizing weights to discrete values k Bits Clip

Clip

Quantized weights

Weights distribution

Input Convlotion Convlotion Convlotion

Output

Full Full Full connection connection connection Softmax

1

VecQ

Conv

VecQ

Activat ion()

Layer

forward

process

Fig. 5 Integrated quantization process in DNN training [14]

as normally distributed and an interval λ is used regarding the given bitwidth constraint. During the backward propagation, the gradient is calculated with wq (l) instead of wf (l) and propagated. In the final update process, the gradient g(l) of wq (l) is updated to wf (l) [67]. For the activation output of a layer, during the training, we compute a distribution parameter of the activation outputs p(t) and update it with Exponential Moving Average. During the inference, the distribution parameter is employed as a linear

48

X. Zhang et al.

Fig. 6 Comparison with state-of-the-art solutions

factor to the activation function [75]. The A(l) is the activation output of layer l, and Activation(·) is the non-linear activation function following the convolution or fully connected layers, such as Sigmoid, Tanh, ReLU. We evaluate VecQ on image classification task with the popular models and compare the results to the state-of-the-art quantization solutions with the same DNN model and bitwidth configurations. The state-of-the-art quantization solutions include BWN [11], TWN [66], TTQ [76], TSQ [77], INQ [78], and ENN [79]. Note here, not all of these quantization solutions provide bitwidth support from 1 to 8 bits. As shown in Fig. 6, our VecQ quantization outperforms most of the solutions with the same bitwidth configurations, and VecQ provides a wider range of bitwidth coverage as well. It only loses the advantage when comparing to the solutions specifically designed for binary weights.

4 Efficient Accelerator Design and Workload Mapping As discussed before, there exists an ever-widening barrier between fast DNN model design in software and slow hardware accelerator implementation. To bridge the hardware-software gap, in this section, we introduce DNNBuilder [30] and PyLog [38] to provide efficient solutions for automatically generating high-performance hardware accelerators for DNN workload deployments.

4.1 DNNBuilder DNNBuilder is an end-to-end automation framework that can transform DNN designs from popular deep learning frameworks to highly optimized hardware

Compilation and Optimizations for Efficient Machine Learning on Embedded Systems

49

Fig. 7 The end-to-end design flow introduced by DNNBuilder [30]

deployment on customized accelerators implemented on FPGAs. Users are no longer required to design and optimize accelerators manually but can enjoy the autogenerated hardware accelerators for desired AI workloads. DNNBuilder introduces two major architecture innovations: the fine-grained layer-based pipeline architecture and the column-based cache scheme, which achieve 7.7× and 43× reduction of latency and on-chip memory usage, respectively. This subsection presents the novel designs introduced by DNNBuilder and showcases its promising edge AI performance.

4.1.1

An End-to-end Automation Flow

DNNBuilder produces customized DNN accelerators in three steps as Design, Generation, and Execution (Fig. 7). During the Design step, a DNN is designed and trained using deep learning frameworks, which in general employ CPUs and GPUs. After training, network definition files and trained parameters are passed to the next step. To ensure design freedom specified by users, the proposed flow supports hybrid quantization schemes, where different quantization schemes can be applied to the parameters and activations of different network layers, to explore tradeoffs among inference accuracy, resource utilization, performance, etc. One important feature of this step is the feedback function that provides hardware metrics estimation. If the current DNN runs slower or consumes more resources than expected, users could update their network designs, such as adjusting quantization schemes or modifying network layers to meet performance and resource requirements. This function also makes the hardware-software co-design possible. In the Generation step, network parsing is launched to decompose the input models. Different network layers, e.g., Conv, Pooling, and FC layers, are decomposed and then mapped to our pre-built RTL IPs, which are the basic building blocks of the generated accelerator. The computational intensive nested loops are captured by parameterized compute engines. Then, automated optimization works for exploring the hardware design space and provides configuration guidelines so

50

X. Zhang et al.

that the generated accelerator can achieve maximum performance. Following these guidelines, network construction is responsible for building DNN implementations with the pre-built RTL IPs, dataflow controller, and memory instances, which are highly configurable to ensure the adaptability and scalability for various DNNs. After that, code generation generates accelerator related files for FPGA-based instances. In the Execution step, the DNN accelerator is instantiated in FPGA with unified interfaces, including a FIFO-like data input/output interface and a weight access interface connecting the off-chip memory controller. In this final step, the DNN accelerator is ready for eventual deployment.

4.1.2

Architecture Novelties

We propose a fine-grained layer-based pipeline to deliver high-throughput performance and promising real-time response. Each major neural network layer, such as Conv or FC layer, in the targeted DNN model, is handled by one pipeline stage, as major layers dominate computation and memory consumption. The rest of the layers, such as batch normalization (BN), scale, and activation layers, are aggregated to their neighboring major layers so that we reduce the number of pipeline stages for lower latency. In addition, DNNBuilder enables pipeline stage overlapping to overcome the long initial latency, which is frequently encountered by conventional pipelines. We demonstrate the proposed fine-grained layer-based pipeline by accelerating an object detection DNN model called YOLO [80] and show the results in Fig. 8. DNNBuilder can effectively hide the data transmission delay and generate outputs even when the first input frame is still loading. It helps achieve a 7.7× smaller startup latency (9.92 ms) compared to the conventional pipeline design (411.99 ms). The other novel design is the column-based cache scheme, which reduces on-chip memory utilization during DNN inference and supports high-definition image input for resource-constrained embedded systems. By following the pipeline

Fig. 8 Latency comparison between the proposed fine-grained (left) and conventional (right) pipeline when handling the same object detection DNN model with a ZC706 embedded FPGA [30]

Compilation and Optimizations for Efficient Machine Learning on Embedded Systems

51

Fig. 9 The proposed column-based cache scheme [30]

architecture, intermediate results between pipeline stages are stored on-chip to guarantee seamless pipeline operations. However, feature maps can be enormous when inputs become large in real life and become impossible to be held on-chip entirely. The column-based cache scheme is designed to address this problem as it only keeps a subset of the input feature map on chip. Figure 9 shows an example when DNNBuilder processes a convolution layer (with kernel size=3 and stride=1). Since slices 1∼3 contribute to the first sliding window operation (from top to bottom), we name the first three slices as column 1. Similarly, column 2 represents the amount of data for the second sliding window operation, so that slices 2∼4 constitute column 2. DNNBuilder caches at least two columns before starting computing, which allows the kernel to perform the second vertical sliding window operation immediately after finishing the first one. Delay caused by data shortage will not happen by caching one more column. Meanwhile, slice 5 will start buffering to form the next column (with slices 3∼5) after releasing the room taken by slice 1. By serving the same objection detection AI model (YOLO with highdefinition inputs), the proposed column-based cache can significantly reduce 43× on-chip memory usage compared to the accelerator without this technology [30].

52

X. Zhang et al.

Table 4 Comparison with existing embedded FPGA-based DNN accelerators [30] Reference FPGA chip Frequency Network Precision DSPs (used/total) DSP Efficiency Performance (GOPS) Power Efficiency (GOPS/W)

[29] Zynq XC7Z045 150 MHz VGG Fix16 780/900 44.0% 137 14.2

[81] Zynq XC7Z045 100 MHz VGG Fix16 824/900 69.6% 230 24.4

DNNBuilder Zynq XC7Z045 200 MHz VGG Fix16 (Fix8) 680/900 96.2% 262 (524) 36.4 (72.8)

Table 5 AlexNet inference comparison on embedded GPU and FPGA platforms [30] Platform DNNBuilder (ZC706) GPU-TX2[82]

4.1.3

Precision Fix16, Fix8 Float16

Batch 1, 2 2

Throughput (img./S) 170, 340 250

Power (W) 7.2 10.7

Efficiency (img./S/W) 23.6, 47.2 23.3

State-of-the-art Performance

We demonstrate our design by accelerating popular AI workloads on an embedded platform (ZC706). As shown in Table 4, our DNNBuilder-generated design reaches the best performance (524 and 262 GOPS in Fix8 and Fix16 quantization schemes) and power efficiency (72.8 GOPS/Watt in Fix8 and 36.4 GOPS/Watt in Fix16). We also extend our comparison to the embedded GPU (TX2) in Table 5. The DNNBuilder-generated design can deliver higher efficiency than the TX2-based solution even without using batch processing (batch size = 1), and it can achieve up to 47.2 image/Second/Watt.

4.2 PyLog: A Python-Based FPGA Programming Flow The fast-growing complexity of new applications and new use scenarios poses serious challenges for computing systems. Embedded hardware accelerator systems have demonstrated great flexibility, performance, and efficiency in many different applications and scenarios. However, as system complexity and application complexity grow rapidly, programming and optimizing embedded accelerator systems require great manual efforts and consume a lot of time. Compiling and optimizing a general application specified in high-level programs like Python are becoming common tasks in creating embedded accelerator designs. High-level synthesis (HLS) transforms design inputs written in high-level languages (e.g., C++, OpenCL, Python) to hardware descriptions in RTL (Register-Transfer Level) languages such as Verilog. HLS offers up to 10× code reduction and 1000× simulation time reduction over manual RTL design solutions. HLS has been intensively studied in the past three decades [33, 34, 83–97], and there are popular commercial HLS tools used by many designers [98, 99].

Compilation and Optimizations for Efficient Machine Learning on Embedded Systems

4.2.1

53

PyLog Flow Overview

PyLog [38] is a Python-based high-level programming flow for FPGAs. It allows users to create FPGA accelerators with Python code using PyLog high-level operators and Python syntax. PyLog presents a unified programming model for host and accelerator logic with consistent Python-based syntax and semantics. This seamless host-accelerator programming model enables agile system design, convenient functional simulation, and flexible design space exploration. Figure 10 shows the overall PyLog at high level. PyLog flow allows users to create efficient FPGA accelerators and program host system with Python. The input to the PyLog flow is Python code, where the FPGA kernel function is decorated with the decorator. The PyLog flow contains an accelerator synthesis flow and a runtime flow. In the accelerator synthesis flow, the @pylog decorator calls the PyLog compiler to compile the kernel function into optimized high-level synthesis (HLS) C code, which is then compiled into efficient FPGA IPs with HLS flow, and integrated into a complete accelerator system by PyLog system generator. Beside the PyLog kernel function, the rest of the PyLog program is interpreted by the standard Python interpreter running on the host CPU side, which supports all Python libraries and

Fig. 10 The PyLog Flow and Example System Architecture [38]

54

X. Zhang et al.

Table 6 PyLog Supported Language Features [38] Category PyLog high-level operators NumPy operators Python features

Operators map, dot, user-defined ops argmax, argmin, max, min, matmul, convolve, sort list, functions, calls, lambda, for, while, if...else..., slice, subscript, attribute, bin_op, unary_op, return

language features. This part of PyLog program naturally becomes the host program of the whole accelerator system. After the accelerator system is generated by the synthesis flow, the system can be deployed at the target FPGA platform using the generated FPGA bitstream configuration file, and runs with support from the PyLog runtime flow. During runtime, PyLog runtime can prepare input data, launch accelerator, and collect results according to the host code. Host CPU and the FPGA accelerator interactions are handled automatically by the PyLog runtime and the underlying Xilinx PYNQ library [100].

4.2.2

PyLog Features

PyLog has several unique features that help users to create FPGA accelerators more efficiently. (i) High-Level Operators. In addition to commonly used Python standard language features, PyLog also supports several built-in high-level operators and NumPy operators that allow users to express computation patterns at high level and enable better compiler optimizations. Table 6 summarizes the language features supported in PyLog, including PyLog high-level operators, NumPy operators, and standard Python features. Listing 1 demonstrates a few example usages of PyLog map and dot operators. 1 2 3 4 5 6 7 8 9 10 11

# Vector add out = map(lambda x, y: x + y, vec_a, vec_b) # 1D convolution out = map(lambda x:w0*x[-1]+w1*x[0]+w2*x[1], vec) # Inner product out_vec[i] = dot(matrix[i,:], in_vec) # Square matrix multiplication out = map(lambda x,y: dot(x[0,:],y[:,0]), ma, mb)

Listing 1 PyLog map and dot examples [38]

Compilation and Optimizations for Efficient Machine Learning on Embedded Systems

55

PyLog high-level operations y = map(lambda a, b: dot(a[0,:], b[:,0]), mat_a, mat_b) ...

HLS C Implementations for (i...) { for (j...) { for (k...) { ... } } }

... for (i...) { for (ii...) { ... } }

for (k...) {

... #pragma HLS unroll for (i...) { for (j...) { } } }

...

Hardware Implementaons

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

Fig. 11 Different implementations generated from the same PyLog code [38]

These operators not only simplify programming for users, but they also pass more information on computation to the compiler (e.g., computation patterns, data flow information, data/task parallelism, etc.), compared to programming in C/C++, and thus allows compilers to perform more optimizations and choose the optimal code generation. Figure 11 shows an example of generating multiple hardware implementations from a PyLog map operation. The compiler generates HLS C implementations in different styles, which corresponds to different hardware structures, e.g., shift registers, systolic arrays, etc. Depending on the context and constraints, the optimal implementation will be chosen. (ii) Type Inference and Type Checking. Python is a dynamically typed languages and there is no explicit type declaration in the Python code. PyLog has a built-in type inference and type checking engine that can infer the type and shape information of code objects in the PyLog kernel functions. This type inference engine is critical in PyLog since same operators may have completely different meanings when applied to operands with different types or shapes. With this type inference engine, PyLog users do not need to provide explicit type annotations or hints in PyLog program. (iii) Compiler Optimizations. PyLog provides a set of compiler optimizations that improve the design quality of generated accelerators. PyLog uses its own PyLog intermediate representation (PLIR) as the internal representation of the input code. PyLog

56

X. Zhang et al.

Fig. 12 Length of HLS C code and PyLog code [38]

code analysis and transformation passes work on PLIR to perform a sequence of optimizations including high-level operator lowering, loop transformation, HLS pragma insertion, etc. The internal PLIR is capable of expressing different design options and can therefore form a design space that covers not only low-level design parameter tuning but also high-level design pattern selection, which has not been explored in previous tools.

4.2.3

PyLog Evaluation Results

We evaluate the performance of PyLog in terms of expressiveness and accelerator performance, using real-world applications. (i) Expressiveness. We evaluated the expressiveness of PyLog by comparing the number of lines of code to implement a benchmark using PyLog and HLS C. Figure 12 shows the evaluation results. For the benchmarks evaluated, on average PyLog only needs around 30% of the code length of HLS C. This indicates that PyLog provides good expressiveness compared with HLS C and allows users to describe their computation with fewer lines of code. (ii) Accelerator Performance. We evaluated the performance of the accelerators generated by PyLog using real-world benchmarks. Evaluation was done on Amazon EC2 F1 f1.2xlarge instance. The evaluated benchmarks are from different domains and have various computation patterns. They are representative FPGA workloads, including linear algebra, data analytics, stencil, sparse operations, etc. Amazon EC2 F1 f1.2xlarge instance is a cloud computing platform that has an 8-core Intel Xeon E5-2686 v4 CPU and a Xilinx Virtex UltraScale+ XCVU9P FPGA. Table 7 shows the evaluation results. The table lists the FPGA resource utilization as well as the accelerator execution time. We compared the PyLog accelerator execution time against the optimized CPU time as well as the execution time of accelerators generated from [101]. On average, PyLog accelerators achieve around 3.17× and 1.24× speedup over CPU baseline and manually optimized accelerators [38].

74,889 17,604 111,144 57,854 75,846 63,759 12,787 7647

109,276 10,829 93,925 47,304 56,580 12,868 8294 4096

KNN K-means Jacobi [102] Seidel [102] Gaussian [102] GEMM SpMV Histogram [103] 96 30 48 655 25 13

425 3

BRAM 0 7 704 304 688 1024 21 0

DSP 256.40 273.97 269.03 269.03 147.15 250.00 273.97 273.97

f (MHz) 37.222 37.429 37.327 37.341 37.783 39.657 37.225 37.327

P (W) 0.48 38.16 11.31 21.37 23.63 60.34 0.29 5.85

TCPU

TPyLog

0.45 0.26 4.24 4.45 8.25 5.19 8.22 5.16 7.34 5.19 8.13 13.05 – 0.24 – 2.07 Geometric mean

THCL

1.85 8.58 2.18 4.14 4.55 4.62 1.21 2.83 3.17

TCPU TPyLog

1.73 0.95 1.59 1.59 1.41 0.62 – – 1.24

THCL TPyLog

TCPU : Execution time on CPU; THCL : Execution time on HeteroCL [101] generated accelerator; TPyLog : Execution time on PyLog generated accelerator; All time values are in milliseconds (ms); ‘-’ means the implementation is not publicly available.

FF

LUT

Benchmark

Table 7 Accelerator Performance Evaluation on AWS F1 Instance

Compilation and Optimizations for Efficient Machine Learning on Embedded Systems 57

58

X. Zhang et al.

5 Efficient Optimizations With a great number of efficient optimization techniques, in this section, we introduce three key optimization techniques: hardware-aware NAS, FPGA/DNN co-design [45], and a unified differentiable co-design approach across different platforms [58].

5.1 Overview of Hardware-aware Neural Architecture Search (NAS) Neural Architecture Search (NAS) refers to the automated process of neural architectural design [114]. It has been largely successful in producing many stateof-the-art networks. Typically, a NAS process requires three distinct components as shown in Table 8: 1. Search space. A search space includes all possible network architectures that may follow a predefined template. For example, the networks can be sequential layer-wise architecture [115, 116], cell-based architecture [54], and hierarchical architecture [56]. Also, hardware parameters could be considered into the search space for HW-aware NAS. 2. Search algorithm. Given the prohibitively large search space, the search algorithm can greatly influence the efficiency of the search and the effectiveness of the final network architecture. Popular search algorithms include evolutionary algorithm, reinforcement learning, gradient-based, etc. 3. Network evaluation. Network evaluation is the key for efficient NAS, since fast evaluation is required to estimate the quality of individual networks to guide the search algorithm to choose top-performing architectures from the search space. Network evaluation can be prohibitively expensive due to network training, so that various approximation approaches have been proposed to expedite the

Table 8 A brief overview of Neural Architecture Search components and example algorithms Search space

NASNet [54], DARTS [104], FBNet [55], ProxylessNAS [105], FlexiBERT [106]. . . Quantization, sparsification, tiling parameters, Hardware search space number of PEs, other HW specific parameters. . . Search algorithm Reinforcement learning, evolutionary algorithm, random search, Bayesian optimization, di. . . Network evaluation DARTS, Random sampling [107], SNAS [108], Weight-sharing based EDD [58], ProxylessNAS, OFA [109]. . . Early stopping, NAO [110], NASWOT [111], Few/one-shot training Synflow [112], GenNAS [113]. . . Architecture search space

Compilation and Optimizations for Efficient Machine Learning on Embedded Systems

59

evaluation such as few-shot and one-shot training [117, 118], weight-sharing based [110, 113, 115], and using proxy tasks [112].

5.2 HW-Aware NAS Formulation In recent years, driven by the need of deploying power-hungry DNNs into resourceconstrained devices, hardware-aware NAS (HW-NAS) has emerged as one of the most promising techniques [119]. There is a great amount of hardware-aware work, each of which often adopts a specific hardware device (CPU, GPU, embedded/mobile device) and requires a different hardware-cost metric (e.g., prioritizes latency or energy). For example, FBNet [55] develops a differentiable neural architecture search (DNAS) framework and discovers state-of-the-art DNNs balancing both accuracy and hardware efficiency, by incorporating a loss consisting of both the cross-entropy loss that leads to better accuracy and the latency loss that penalizes the network’s latency on a target device. To provide more integrated co-optimization solutions, EDD [58] fuses the design space of DNN architecture and hardware accelerator and formulates the DNN and hardware design as a co-search problem. EDD aims to discover the most suitable combination of DNN and hardware within the co-search space and maximize software and hardware metrics given the targeted edge AI application. Once for All (OFA) [109] is the first work that proposes an elastic training scheme for supernet. By training the supernet, high-accuracy architectures is directly searched by selecting from the OFA network without additional training. One of the classic search methods for HW-NAS is to first define a template based search space, and then incorporate hardware performance into the loss function: L = LT + LH W

.

or

L = LT · LH W

(5)

where LT is the task-specific loss of NAS, such as cross-entropy loss for classification tasks or Mean squared error (MSE) loss for regression tasks. LH W is the hardware performance loss, such as measured or estimated execution latency of the network architectures on the target device.

5.3 FPGA/DNN Co-Design Hao and Chen first proposed the concept of accelerator and DNN co-design in an invited paper titled “Deep Neural Network Model and FPGA Accelerator Codesign: Opportunities and Challenges” [43], where they advocated “automatically generate both DNN models and their corresponding implementations as pairs.” Later, based on the proposed co-design method, we implemented the first simultaneous FPGA/DNN co-design framework [45]. It has two major components, as

60

X. Zhang et al. Target ML task; FPGA device (resources); performance targets (QoS)

Auto DNN: Co Search Engine Step 1. Basic building block modeling

Step 2. Building block selection

Auto HLS:

Step 3. DNN search and update

FPGA Accelerator Generator

Software: DNN Model

Hardware: FPGA Accelerator

Fig. 13 FPGA/DNN co-design framework [45]

shown in Fig. 13: (1) a hardware-oriented bottom-up DNN model design, named Auto-DNN, which is an efficient search engine to explore DNN candidates under hardware resource and performance constraints; (2) a DNN-driven top-down FPGA accelerator design, named Auto-HLS, which is a fast board-level design generator to automatically map DNNs onto FPGAs.

5.3.1

The Key to Co-Design: Bundle

The key to achieve co-design, i.e., to execute Auto-DNN and Auto-HLS simultaneously, is to propose basic building blocks that can be used to construct both DNNs and their accelerators at the same time. We call such building blocks Bundles, the common building block of both DNN architectures as well as their hardware implementation, as shown in Fig. 14. The benefits of Bundles are two-fold. First, a DNN can be constructed by replicating a bundle for a certain number of layers with pooling layers inserted, which is a common and effective way to construct high-quality DNNs, such as the residual block in ResNet [2] and the Inception block in GoogLeNet [20]; meanwhile, many NAS approaches follow such cell-based strategy [54, 56, 104]. Second, an accelerator can be constructed by building a hardware module for the certain bundle and reusing it for different DNN layers, given that the DNN is built by replicating the bundle; this can significantly reduce the resource usage of the accelerator by resource sharing and shorten the hardware development cycle. As an example, a Bundle can be a set of DNN layers including: one 3 × 3 convolution, one batch normalization, one activation, one 1 × 1 convolution, and one activation. Accordingly, the hardware accelerator will need one instance for the 3 × 3 convolution, one instance for the 1 × 1 convolution, and so on.

Compilation and Optimizations for Efficient Machine Learning on Embedded Systems

61

Input

Software

Conv 3x3

1st Basic Block

DNN

DW Conv 3x3

2nd Basic Block

Conv 1x1

3rd Basic Block



Conv 3x3 DW Conv 3x3 Conv 1x1 Activation

Activation

Output

Implements

FPGA Hardware

Convolution 1x1

Relu

Depth wise Convolution 3x3

Convolution 3x3

Pooling Fig. 14 The key to co-design: Bundle—the common basic building block to both DNN design and accelerator design [45]

5.3.2

Progressively Reducing Search Space

It is non-trivial to select an optimal Bundle given the large design space and the prohibitively long DNN training time. Therefore, it is essential to narrow down the search space as early as possible. Our approach is in a three-step progressive way, by filtering out unfavorable bundles at early stage and conducting detailed search at later stage using promising ones. The three steps are as follows. Step 1 Analytical models for performance and resource estimation for bundles and DNNs. Denoting a Bundle as bundi , the resource of bundi is computed as: r Resbund = i



.

Resjr + ir

(6)

pj

where Resjr is the resource usage of instance pj of resource type r (including DSP, LUTs, FF, and BRAM). ir represents other resource overhead such as LUTs consumed by control logic and multiplexers. The latency of a Bundle is estimated as: Latbundi = αi ·



.

pj

Compj +

βi · (Datai ) bw

(7)

62

X. Zhang et al.

where Compj is the computation latency of instance pj , and (Datai ) is the data amount processed by bundi . bw represents the off-chip memory bandwidth. Denote the latency of one execution of pj as latj , and the total number of reuses of pj as reusej , the computation latency Compj is estimated as: 

Compj =

.

reusej · latj

(8)

1≤j ≤n

where reusej can be computed by the input/output dimensions of the data processed by the IP and the data dimensions of pj ’s interface. The parameter αi in Eq. (7) describes how much computation is overlapped because of IP pipelining, and βi describes how much data transfer is overlapped during computations. αi , βi and i will be determined for each bundi using Auto-HLS sampling. The overall DNN latency based on Latbundi in Eq. (7) is estimated as: LatDN N =

N 

.

Latbund + φ · LatDM

(9)

i=1

where N is the number of Bundle repetitions of the DNN, and φ · LatDM represents the inter-bundle data movement latency. For overall DNN resource utilization, we have: ResDN N = Resbundi + γ · Resctl

.

(10)

where Resbundi is the resource of bundi , and Resctl is additional control logic overhead, e.g., finite state machine and multiplexers. φ, γ , LatDM and Resctl will be decided and calibrated through actual hardware synthesis and implementation. Step 2 Bundle evaluation and selection. In this step, we evaluate the latency, resource, and accuracy metrics for each Bundle, as defined in Step 1. Since we cannot evaluate the accuracy for a single Bundle, we replicate a Bundle for N times to build a DNN and train it for a small number of epochs (20 in the experiment). We plot Pareto curves for the Bundles to examine the tradeoff between DNN accuracy and resource utilization, and the Bundles on the Pareto curve will be selected for detailed search in the next step. Step 3 DNN construction using Bundles and training. After selecting top-N promising Bundle candidates, we search DNN models under resource and latency constraints. For each Bundle, K initial DNNs are generated and are progressively updated by adjusting the number of channels, pooling layer positions, etc., until the latency target is met. Then, we perturb the initial DNNs by changing three variables: the number of Bundle replications, down-sampling configurations between bundles, and channel-expansion configuration. We adopted Stochastic Coordinate Descent (SCD) algorithm for perturbation, while other heuristic or evolutionary algorithms

Compilation and Optimizations for Efficient Machine Learning on Embedded Systems

63

Table 9 Performance Comparisons (FPGA and GPU competition data are obtained from [120]) IoU Latency (ms) 68.6% 80.0 (100 MHz) 57.4 (150 MHz) 61.2% 62.6 (100 MHz) DNN2 44.1 (150 MHz) DNN3 59.3% 47.8 (100 MHz) 33.7 (150 MHz) 62.4% 84.6 (150 MHz) 1st in FPGA SSD 2nd in FPGA – 49.2% 38.5 (150 MHz) 3rd in FPGA – 57.3% 136.1 (150 MHz) 69.8% 40.7 (854 MHz) Yolo 1st in GPU 2nd in GPU Tiny-Yolo 69.1% 39.5 (854 MHz) 3rd in GPU Tiny-Yolo 68.5% 42.3 (854 MHz)

Ours

Model DNN1

FPS 12.5 17.4 16.0 22.7 20.9 29.7 11.96 25.97 7.35 24.55 25.3 23.64

Power 2.2 W 2.5 W 2.2 W 2.4 W 2.2 W 2.4 W 4.2 W 2.5 W 2.6 W 12.6 W 13.3 W 10.3 W

Energy 8.80 KJ 7.18 KJ 7.50 KJ 5.51 KJ 5.74 KJ 4.04 KJ 17.56 KJ 4.81 KJ 17.69 KJ 25.66 KJ 26.28 KJ 21.79 KJ

Efficiency 0.18 J/pic 0.14 J/pic 0.15 J/pic 0.11 J/pic 0.11 J/pic 0.08 J/pic 0.35 J/pic 0.10 J/pic 0.35 J/pic 0.51 J/pic 0.53 J/pic 0.44 J/pic

can be applied as well. The goal of the search algorithm is to find the DNN architecture which meets the performance constraints with highest accuracy.

5.3.3

Evaluation Results

To evaluate the effectiveness of the co-design framework, we apply it on a lowpower object detection competition [120], and compare to the top-3 winners for both FPGA and GPU categories. The results are shown in Table 9. We make comparisons in: (1) the Intersection over Union (IoU); (2) the latency for processing one frame and the overall frame per second (FPS); (3) the board power; (4) the energy consumption for all testing data; and (5) the energy efficiency per frame (J/pic). The results are collected from the board-level implementations on Pynq-Z1. The latency refers to the execution time for a single frame in millisecond, while FPS is measured using total runtime for the 50K images including image loading, preprocessing, and DNN inference. Compared to the 1st-place winner of the FPGA category, we achieve 6.2% higher IoU, 40% lower power, and 2.5× better energy efficiency, which we attribute to the effectiveness of an automated co-search instead of manual designs. Compared to GPU-based designs, our DNN1 model is more accurate than the 3rd-place design and only 1.2% lower IoU than the 1st-place GPU design. Regarding the energy efficiency, ours is 3.6× better than the 1st-place GPU design with 40% longer latency despite a nearly 6× slower clock frequency.

64

X. Zhang et al.

5.4 EDD: Efficient Differential DNN Architecture Search On top of the FPGA/DNN co-design introduced in Sect. 5.3, we further extend codesign to a more generalized and unified approach, i.e., fully simultaneous neural architecture and implementation co-search, targeting arbitrary hardware platforms. Neural architecture and implementation co-search (NAIS) [48] is the first work that stylized design methodology targeting both FPGAs and GPUs, while EDD [58] is a fully simultaneous, efficient differentiable DNN architecture and implementation co-search methodology. The overall architecture of EDD is presented in Fig. 15.

5.4.1

Fused Co-Design Space

The key technology is to fuse the design space of DNN architecture search and hardware implementation search. We collectively denote the variables used in DNN search and implementation search as A and I , respectively, and the fused space of co-search is {A, I }. To carry out both DNN architecture and hardware accelerator co-search in the fused DNN/accelerator space as described in Eq. (5), we minimize the following loss function:

DNN

Candidate operations of

One candidate operation

: sampling , parameters of operation

… ,

,

,





, ,

parameters of



: Sampling quantization

input , ,

bit

, ,

bit



, ,

bit

output Fig. 15 The overall architecture of EDD [58]

Channel expands by

1

1

Channel shrinks by

1

1

Neural Architecture Search (NAS)

Other implementation variables in • FPGA: parallel factors, loop tiling factors, etc. • GPU: batch size, etc.

Implemen tation Search

Compilation and Optimizations for Efficient Machine Learning on Embedded Systems

min : L = Accloss (A, I ) · P erfloss (I ) + β · C RES(I )−RESub

.

65

(11)

In the above equation, Accloss is the DNN accuracy loss; P erfloss is the hardware performance loss such as end-to-end inference latency, throughput, energy, DNN model complexity, etc.; multiple performance metrics can be optimized simultaneously by defining a single weighted loss. RES is the resource utilization and RESub is resource upper bound. Apparently, Accloss is a function of A and I ; P erfloss and RES are functions of I . Resource upper bound RESub is expressed in an exponent term to introduce large penalty when being violated. Worth noting, in the existing hardware-aware NAS approaches, only A is searched while I is fixed during NAS. In our proposed co-search formulation, I is variable, and A and I are fused as one design space {A, I }. NAS Design Space In the search space, each DNN is composed of N basic building blocks in a single-path fashion without branches [121]. Inside each block, there are M candidate operations. We adopt the most commonly used DNN blocks in NAS approaches, called MBConv [56], which is composed of sequential layers of conv-1 × 1, dwconv-k × k and conv-1 × 1, where k is the kernel size. Between conv-1 × 1 and dwconv-k × k, the number of channels expands/shrinks by a m ratio of chm i for operation opi . The output of each block is calculated based on the outputs of its M candidate operations. Specifically, we adopt the Gumbeli will be sampled from a Softmax function in [55], where each operation opm sampling parameter θi,m following Gumbel-Softmax distribution, which converts the discrete non-differentiable sampling to continuous differentiable sampling. The sampled operations form a complete DNN, which can be evaluated for accuracy and implementation performance. Implementation Search Space We let each candidate operation opim has its own implementation variables, forming an implementation search space Iim . The primary implementation variable is quantization q, i.e., data precision, since it has a large impact on DNN accuracy, implementation performance and hardware resource. Rather than using a train-and-quantize approach, the quantization shall be searched together with DNN structure to provide implementation performance feedback. Besides quantization, other implementation variables can also be integrated into the framework, such as accelerator parallelism, loop tiling factors, batch size, etc.

5.4.2

Differentiable Performance and Resource Formulation

The key challenge is how to formulate the loss function to be differentiable with respect to the search space A and I . Since NAS search space A is discrete, differentiable formulation requires continuous relaxation. DARTS [104] is the first work that uses softmax for relaxation, while FBNet uses Gumbel-softmax [122] by sampling from the discrete space. Such relaxation has been demonstrated to be GPU hours efficient with appealing model accuracy [55, 104, 113]. Motivated by FBNet,

66

X. Zhang et al.

a similar technique using Gumbel-softmax can be applied to differentiable implementation I to convert the discrete implementation search space into continuous. Therefore, by descending the loss function on validation set, {A, I } can be learned simultaneously.

5.4.3

State-of-the-art Results

We demonstrate the results on a subset of ImageNet dataset randomly sampled from 100 classes and target three hardware architectures, each with a searched DNN model, called EDD-Net: (1) low-latency oriented GPU (EDD-Net-1); (2) folded FPGA architecture (EDD-Net-2), where a single processing element (PE) will be reused by all layers; (3) pipelined FPGA architecture (EDD-Net-3), where each layer has its own PE, and all PEs work simultaneously. Each model is produced through EDD within a 12-hour search on a P100 GPU. For GPU-targeted EDD-Net-1, the results are as shown in Table 10, where the GPU latency is tested on Titan RTX. It shows that EDD-Net-1 reaches similar or better accuracy comparing with the state-of-the-art DNN models and other NAS approaches while achieving the shortest inference latency. Table 11 shows the accuracy and latency tradeoff of different precisions of EDD-Net-1 on Nvidia 1080 Ti GPU. For FPGA-targeted EDD-Net-2, the latency values are collected by running DNN models with CHaiDNN accelerators on ZCU102 FPGA as shown in Table 10. It shows that EDD-Net-2 delivers the shortest latency on FPGA among all the DNNs. FPGA-targeted EDD-Net-3 is searched targeting a pipelined FPGA accelerator. As shown in Table 12, EDD-Net-3 achieves higher throughput with a much higher accuracy comparing with the state-of-the-art.

Table 10 Comparisons with existing NAS solutions [58] GPU Latency Titan RTX

FPGA Latency ZCU102 [123]

10.47 9.7 11.7 10.9

27.75 ms 17.87 ms 21.91 ms 9.71 ms

13.25 ms 10.85 ms NA 10.15 ms

7.5 7.6 7.6 7.8 7.5 7.7 7.9

17.94 ms 22.54 ms 21.34 ms 21.23 ms 15.72 ms 11.17 ms 13.00 ms

8.78 ms 12.21 ms 10.81 ms 10.78 ms 10.79 ms 11.15 ms 7.96 ms

Test Error (%) Top-5 Top-1 Baseline Models GoogleNet 30.22 28.1 MobileNet-V2 [124] ShuffleNet-V2 [125] 30.6 ResNet18 30.2 Hardware-aware NAS Models MNasNet-A1 [56] 24.8 FBNet-C [55] 24.9 24.7 Proxyless-cpu [105] Proxyless-Mobile [105] 25.4 Proxyless-GPU [105] 24.9 25.3 EDD-Net-1 EDD-Net-2 25.4

Compilation and Optimizations for Efficient Machine Learning on Embedded Systems Table 11 EDD-Net-1 accuracy and latency on 1080 Ti [58]

Test Error Latency

32-bit Floating 25.5% 2.83 ms

16-bit Floating 25.3% 2.29 ms

67

8-bit Integer 26.4% 1.74 ms

Table 12 Comparison of EDD-Net-3 with DNNBuilder [30] VGG16 EDD-Net-3

Top-1 Error (%) 29.5 25.6

Top-5 Error (%) 10.0 7.7

Throughput (ZC706) 27.7 fps 40.2 fps

6 Conclusion Emerging DNN-based AI applications are challenging for embedded systems as these applications come with high computation and memory demands as well as diverse application-specific requirements, such as real-time responses, highthroughput performance, and reliable inference accuracy. This chapter introduced a series of effective design methods to overcome these challenges to enable embedded AI solutions. These methods can be categorized into efficient machine learning algorithms, accelerator and compiler designs, and various co-design and optimization strategies. We first proposed ELB-NN and VecQ to strengthen the AI model’s hardware efficiency by enabling extremely low bitwidth quantization during model training and inference. Then, we proposed DNNBuilder and PyLog for customized hardware accelerator design and DNN workload mapping to such accelerators. At last, we introduced efficient co-design strategies, including FPGA/DNN co-design and EDD, when deploying AI workloads on embedded systems. We believe embedded AI solutions will involve more effective and comprehensive design methods in the future, covering AI algorithms, customized accelerators, and co-design and co-optimization strategies between algorithm and hardware. For example, our efficient AI algorithm designs, such as ELB-NN and VecQ, can adopt more advanced quantization schemes to minimize network compression loss. Future works will consider more diverse network architecture and layer-wise data distribution features. To facilitate a smoother accelerator design process, we will extend DNNBuilder and PyLog to create frameworks and tools for hardware design, synthesis, and workload compilation. Major directions include (1) heterogeneous computing support, which intends to enable system-level design and optimization for heterogeneous AI systems, and (2) dynamic computational graph scheduling, which enables the generation of runtime adaptive accelerators for future AI applications. Our future works will also cover more advanced software/hardware codesign for emerging AI models running on heterogeneous systems, which contains a much larger design space and is thus more challenging. For example, multi-modal multi-task (MMMT) models [126] and customized hardware designs [127] working for autonomous driving have demonstrated the importance of heterogeneity in AI model and hardware designs. The co-design and co-optimization methods must be developed for such heterogeneous scenarios.

68

X. Zhang et al.

Acknowledgments The works presented in this book chapter are mainly supported by the IBMIllinois Center for Cognitive Computing System Research (C3SR) – a research collaboration as part of IBM AI Horizons Network, Semiconductor Research Corporation (SRC), the Campus for Research Excellence and Technological Enterprise (CREATE) program in Singapore, AMD-Xilinx Center of Excellence, and a Google PhD Fellowship to Xiaofan Zhang. The authors also want to thank Chao Zhu, Cheng Gong, Chengyue Wang, Hyunmin Jeong, Jinjun Xiong, Junsong Wang, Kun Wu, Kyle Rupnow, Tao Li, Qiuwen Lou, Wen-mei Hwu, Xinheng Liu, Ye Lu, and Yonghua Lin for their valuable contributions.

References 1. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015) 2. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016) 3. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017) 4. Lombardi, S., Saragih, J., Simon, T., Sheikh, Y.: Deep appearance models for face rendering. ACM Trans. Graph. 37(4), 1–13 (2018) 5. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25 (2012) 6. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017) 7. OpenAI: AI and compute (2018) 8. Zhao, S., Ahmed, S., Liang, Y., et al.: A real-time 3d sound localization system with miniature microphone array for virtual reality. In: Proceedings of the IEEE Conference on Industrial Electronics and Applications (ICIEA) (2012) 9. Chen, D., Cong, J., Gurumani, S., et al.: Platform choices and design demands for IoT platforms: cost, power, and performance tradeoffs. IET Cyber-Physical Systems: Theory and Applications 1(1), 70–77 (2016) 10. Jouppi, N.P. et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the International Symposium on Computer Architecture (ISCA) (2017) 11. Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: XNOR-net: ImageNet classification using binary convolutional neural networks. In: European conference on computer vision, pp. 525–542. Springer, Berlin (2016) 12. Wang, J., Lou, Q., Zhang, X., Zhu, C., Lin, Y., Chen, D.: Design flow of accelerating hybrid extremely low bit-width neural network in embedded FPGA. In: International Conference on Field Programmable Logic and Applications, pp. 163–1636. IEEE, New York (2018) 13. Gope, D., et al.: Ternary hybrid neural-tree networks for highly constrained IoT applications (2019) 14. Gong, C., Chen, Y., Lu, Y., Li, T., Hao, C., Chen, D.: VecQ: Minimal loss DNN model compression with vectorized weight quantization. IEEE Trans. Comput. 70(5), 696–710 (2020) 15. Chen, Y., et al.: T-DLA: An open-source deep learning accelerator for ternarized DNN models on embedded FPGA. In: ISVLSI (2019) 16. Han, S., et al.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems (2015) 17. Han, S., et al.: Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. In: International Conference of Learning Representation (2016) 18. Luo, J.-H., et al.: ThiNet: A filter level pruning method for deep neural network compression. In: International Conference of Computer Vision (2017)

Compilation and Optimizations for Efficient Machine Learning on Embedded Systems

69

19. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 20. Szegedy, C., Liu, W., Jia, Y., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015) 21. Dai, X., et al.: Nest: A neural network synthesis tool based on a grow-and-prune paradigm. IEEE Trans. Comput. 68(10), 1487–1497 (2019) 22. Ren, A., et al.: ADMM-NN: An algorithm-hardware co-design framework of DNNs using alternating direction methods of multipliers. In: Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2019) 23. Ding, X., et al.: Auto-balanced filter pruning for efficient convolutional neural networks. In: AAAI (2018) 24. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017) 25. Alwani, M., et al.: Fused-layer CNN accelerators. In: Proceedings of the International Symposium on Microarchitecture (2016) 26. Brown, B.: Intel® math kernel library for deep learning networks (2018) 27. Franklin, D.: NVIDIA Jetson AGX Xavier delivers 32 tera ops for new era of AI in robotics. In: NVIDIA Accelerated Computing| Parallel For all (2018) 28. Zhang, C., et al.: Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: International Symposium on FPGAs (2015) 29. Qiu, J., et al.: Going deeper with embedded FPGA platform for convolutional neural network. In: International Symposium on FPGAs (2016) 30. Zhang, X., Wang, J., Zhu, C., Lin, Y., Xiong, J., Hwu, W.m., Chen, D.: DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs. In: International Conference on Computer Aided Design, pp. 1–8. IEEE, New York (2018) 31. Chen, Y.-H., et al.: Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. In: Proceedings of the International Solid-State Circuits Conference (ISSCC) (2016) 32. Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A., Dally, W.J.: EIE: Efficient inference engine on compressed deep neural network. In: Proceedings of the International Symposium on Computer Architecture (ISCA) (2016) 33. Papakonstantinou, A., Gururaj, K., Stratton, J.A., Chen, D., Cong, J., Hwu, W.M.: FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs. In: Proceedings of the Symposium on Application Specific Processors (2009) 34. Rupnow, K., Liang, Y., Li, Y., Chen, D.: A study of high-level synthesis: Promises and challenges. In Proceedings of the International Conference on ASIC (2011) 35. Liu, X., Chen, Y., Nguyen, T., et al.: High level synthesis of complex applications: An H. 264 video decoder. In: International Symposium on FPGAs (2016) 36. Zhang, C., et al.: Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 38(11), 2072–2085 (2018) 37. Ye, H., Zhang, X., Huang, Z., Chen, G., Chen, D.: HybridDNN: A framework for highperformance hybrid DNN accelerator design and implementation. In: 2020 57th ACM/IEEE Design Automation Conference (DAC), pp. 1–6. IEEE, New York (2020) 38. Huang, S., Wu, K., Jeong, H., Wang, C., Chen, D., Hwu, W.M.: PyLog: An algorithm-centric Python-based FPGA programming and synthesis flow. IEEE Trans. Comput. 70(12), 2015– 2028 (2021) 39. Jia, Y., et al.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the ACM international conference on Multimedia (2014) 40. Abadi, M., et al.: TensorFlow: A system for large-scale machine learning. In: Proceedings of the {USENIX} symposium on operating systems design and implementation ({OSDI}) (2016)

70

X. Zhang et al.

41. Paszke, A., et al.: PyTorch: An imperative style, high-performance deep learning library. In: Proceedings of the Advances in neural information processing systems (2019) 42. Hao, C., Dotzel, J., Xiong, J., Benini, L., Zhang, Z., Chen, D.: Enabling design methodologies and future trends for edge AI: specialization and codesign. IEEE Design and Test 38(4), 7–26 (2021) 43. Hao, C., Chen, D.: Deep neural network model and FPGA accelerator co-design: Opportunities and challenges. In: 2018 14th IEEE International Conference on Solid-State and Integrated Circuit Technology (ICSICT), pp. 1–4. IEEE, New York (2018) 44. Zhang, X., Lu, H., Hao, C., Li, J., Cheng, B., Li, Y., Rupnow, K., Xiong, J., Huang, T., Shi, H., et al.: Skynet: a hardware-efficient method for object detection and tracking on embedded systems. In: Proceedings of Machine Learning and Systems, pp. 216–229 (2020) 45. Hao, C., Zhang, X., Li, Y., Huang, S., Xiong, J., Rupnow, K., Hwu, W.m., Chen, D.: FPGA/DNN co-design: An efficient design methodology for 1ot intelligence on the edge. In: 2019 56th ACM/IEEE Design Automation Conference (DAC), pp. 1–6. IEEE, New York (2019) 46. Yang, Y., Huang, Q., Wu, B., et al.: Synetgy: Algorithm-hardware co-design for convnet accelerators on embedded FPGAs. In: International Symposium on FPGAs (2019) 47. Guo, K., Zeng, S., Yu, J., Wang, Y., Yang, H.: A survey of FPGA-based neural network inference accelerators. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 12(1), 1–26 (2019) 48. Hao, C., Chen, Y., Liu, X., Sarwari, A., Sew, D., Dhar, A., Wu, B., Fu, D., Xiong, J., Hwu, W.m., et al.: NAIS: Neural architecture and implementation search and its applications in autonomous driving. arXiv preprint arXiv:1911.07446 (2019) 49. Jiang, W., Yang, L., Sha, E.H.M., Zhuge, Q., Gu, S., Dasgupta, S., Shi, Y., Hu, J.: Hardware/software co-exploration of neural architectures. In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2020) 50. Wang, J., Zhang, X., Li, Y., et al.: Exploring HW/SW co-optimizations for accelerating large-scale texture identification on distributed GPUs. In: Proceedings of the International Conference on Parallel Processing (ICPP), pp. 1–10 (2021) 51. Zhang, X., Ma, Y., Xiong, J., et al.: Exploring HW/SW co-design for video analysis on CPU-FPGA heterogeneous systems. In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2021) 52. Fu, Y., Zhang, Y., Li, C., et al.: A3C-S: Automated agent accelerator co-search towards efficient deep reinforcement learning. In: Design Automation Conference (2021) 53. Elsken, T., Metzen, J.H., Hutter, F.: Neural architecture search: A survey. J. Mach. Learn. Res. 20(1), 1997–2017 (2019) 54. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In : Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710 (2018) 55. Wu, B., Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Y., Tian, Y., Vajda, P., Jia, Y., Keutzer, K.: FBNet: Hardware-aware efficient convnet design via differentiable neural architecture search. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10734–10742 (2019) 56. Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., Le, Q.V.: MnasNet: Platform-aware neural architecture search for mobile. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828 (2019) 57. Jiang, W., Zhang, X., Sha, E.H.M., Yang, L., Zhuge, Q., Shi, Y., Hu, J.: Accuracy vs. efficiency: Achieving both through FPGA-implementation aware neural architecture search. In: Proceedings of the 56th Annual Design Automation Conference 2019, pp. 1–6 (2019) 58. Li, Y., Hao, C., Zhang, X., Liu, X., Chen, Y., Xiong, J., Hwu, W.m., Chen, D.: EDD: Efficient differentiable DNN architecture and implementation co-search for embedded AI solutions. In: 2020 57th ACM/IEEE Design Automation Conference (DAC) (2020) 59. Yang, L., Yan, Z., Li, M., Kwon, H., Lai, L., Krishna, T., Chandra, V., Jiang, W., Shi, Y.: Co-exploration of neural architectures and heterogeneous ASIC accelerator designs targeting multiple tasks. In: 2020 57th ACM/IEEE Design Automation Conference (DAC) (2020)

Compilation and Optimizations for Efficient Machine Learning on Embedded Systems

71

60. Ma, X., Guo, F.M., Niu, W., Lin, X., Tang, J., Ma, K., Ren, B., Wang, Y.: PCONV: The missing but desirable sparsity in DNN weight pruning for real-time execution on mobile devices. In: AAAI, pp. 5117–5124 (2020) 61. Niu, W., Ma, X., Lin, S., Wang, S., Qian, X., Lin, X., Wang, Y., Ren, B.: PatDNN: Achieving real-time DNN execution on mobile devices with pattern-based weight pruning. In: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 907–922 (2020) 62. Lin, J., Chen, W.M., Cohn, J., Gan, C., Han, S.: MCUNet: Tiny deep learning on IoT devices. In: Annual Conference on Neural Information Processing Systems (NeurIPS) (2020) 63. PULP—An Open Parallel Ultra-Low-Power Processing-Platform. http://iis-projects.ee.ethz. ch/index.php/PULP 64. Garofalo, A., Tagliavini, G., Conti, F., Rossi, D., Benini, L.: XpulpNN: accelerating quantized neural networks on RISC-V processors through isa extensions. In: 2020 Design, Automation and Test in Europe Conference and Exhibition (DATE), pp. 186–191. IEEE, New York (2020) 65. Garofalo, A., Rusci, M., Conti, F., Rossi, D., Benini, L.: Pulp-NN: accelerating quantized neural networks on parallel ultra-low-power RISC-V processors. Phil. Trans. R. Soc. A 378(2164), 20190155 (2020) 66. Li, F., Zhang, B., Liu, B.: Ternary weight networks. arXiv preprint arXiv:1605.04711 (2016) 67. Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: DoReFA-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016) 68. Zhao, R., Song, W., Zhang, W., Xing, T., Lin, J.H., Srivastava, M., Gupta, R., Zhang, Z.: Accelerating binarized convolutional neural networks with software-programmable FPGAs. In: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 15–24 (2017) 69. Umuroglu, Y., et al.: Finn: A framework for fast, scalable binarized neural network inference. In: International Symposium on FPGAs. ACM, New York (2017) 70. Eriko, N., et al.: Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC. In: International Conference on Field-Programmable Technology (FPT) (2016) 71. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks. In: Advances in Neural Information Processing Systems, pp. 4107–4115 (2016) 72. Courbariaux, M., Bengio, Y., David, J.P.: BinaryConnect: Training deep neural networks with binary weights during propagations. In: Advances in Neural Information Processing Systems, pp. 3123–3131 (2015) 73. Lin, Z., Courbariaux, M., Memisevic, R., Bengio, Y.: Neural networks with few multiplications (2016) 74. Jin, C., Sun, H., Kimura, S.: Sparse ternary connect: Convolutional neural networks using ternarized weights with enhanced sparsity. In: ASP-DAC, pp. 190–195 (2018) 75. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference Machine Learning, pp. 448–456 (2015) 76. Zhu, C., Han, S., Mao, H., Dally, W.J.: Trained ternary quantization. In: International Conference Learning Representation (2017) 77. Wang, P., Hu, Q., Zhang, Y., Zhang, C., Liu, Y., Cheng, J.: Two-step quantization for low-bit neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4376–4384 (2018) 78. Zhou, A., Yao, A., Guo, Y., Xu, L., Chen, Y.: Incremental network quantization: Towards lossless CNNs with low-precision weights (2017) 79. Leng, C., Dou, Z., Li, H., Zhu, S., Jin, R.: Extremely low bit neural network: Squeeze the last bit out with ADMM. In: AAAI (2018) 80. Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017) 81. Xiao, Q., et al.: Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. In: Proceedings of the 54th Annual Design Automation Conference, pp. 1–6 (2017)

72

X. Zhang et al.

82. Franklin, D.: NVIDIA Jetson TX2 delivers twice the intelligence to the edge. In: NVIDIA Accelerated Computing| Parallel For all (2017) 83. Papakonstantinou, A., Liang, Y., Stratton, J.A., Gururaj, K., Chen, D., Hwu, W.M.W., Cong, J.: Multilevel granularity parallelism synthesis on FPGAs. In: 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines, pp. 178– 185. IEEE, New York (2011) 84. Gurumani, S.T., Tolar, J., Chen, Y., Liang, Y., Rupnow, K., Chen, D.: Integrated CUDAto-FPGA synthesis with Network-on-Chip. In: 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines, pp. 21–24 (2014) 85. Chen, Y., Gurumani, S.T., Liang, Y., Li, G., Guo, D., Rupnow, K., Chen, D.: FCUDA-NoC: A scalable and efficient network-on-chip implementation for the CUDA-to-FPGA flow. IEEE Trans. Very Large Scale Integr. VLSI Syst. 24(6), 2220–2233 (2016) 86. Nguyen, T., Gurumani, S., Rupnow, K., Chen, D.: FCUDA-SoC: Platform integration for field-programmable soc with the CUDA-to-FPGA compiler. In: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’16), pp. 5–14 (2016) 87. Chen, Y., Nguyen, T., Chen, Y., Gurumani, S.T., Liang, Y., Rupnow, K., Cong, J., Hwu, W.M., Chen, D.: FCUDA-HB: Hierarchical and scalable bus architecture generation on FPGAs with the FCUDA flow. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 35(12), 2032–2045 (2016) 88. Cong, J., Huang, H., Jiang, W.: A generalized control-flow-aware pattern recognition algorithm for behavioral synthesis. In: 2010 Design, Automation and Test in Europe Conference and Exhibition (DATE 2010), pp. 1255–1260 (2010) 89. Cong, J., Liu, B., Neuendorffer, S., Noguera, J., Vissers, K., Zhang, Z.: High-level synthesis for FPGAs: From prototyping to deployment. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 30(4), 473–491 (2011) 90. Zhang, Z., Fan, Y., Jiang, W., Han, G., Yang, C., Cong, J.: AutoPilot: A Platform-Based ESL Synthesis System, pp. 99–112. Springer Netherlands, Dordrecht (2008) 91. Cong, J., Fan, Y., Han, G., Jiang, W., Zhang, Z.: Platform-based behavior-level and systemlevel synthesis. In: 2006 IEEE International SOC Conference, pp. 199–202 (2006) 92. Canis, A., Choi, J., Aldham, M., Zhang, V., Kammoona, A., Anderson, J.H., Brown, S., Czajkowski, T.: LegUp: High-level synthesis for FPGA-based processor/accelerator systems. In: Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pp. 33–36 (2011) 93. Canis, A., Choi, J., Aldham, M., Zhang, V., Kammoona, A., Czajkowski, T., Brown, S.D., Anderson, J.H.: LegUp: An open-source high-level synthesis tool for FPGA-based processor/accelerator systems. ACM Trans. Embed. Comput. Syst. 13(2), 1–27 (2013) 94. Ye, H., Hao, C., Cheng, J., Jeong, H., Huang, J., Neuendorffer, S., Chen, D.: ScaleHLS: A new scalable high-level synthesis framework on multi-level intermediate representation. In: 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 741–755 (2022) 95. Chen, D., Cong, J., Fan, Y., Wan, L.: LOPASS: A low-power architectural synthesis system for FPGAs with interconnect estimation and optimization. IEEE Trans. Very Large Scale Integr. VLSI Syst. 18(4), 564–577 (2009) 96. Chen, D., Cong, J., Xu, J.: Optimal module and voltage assignment for low-power. In: Proceedings of the Asia and South Pacific Design Automation Conference, vol. 2, pp. 850– 855 (2005) 97. Chen, D., Cong, J., Xu, J.: Optimal simultaneous module and multivoltage assignment for low power. ACM Trans. Des. Autom. Electron. Syst. 11(2), 362–386 (2006) 98. Vitis HLS. https://www.xilinx.com/support/documentation-navigation/design-hubs/dh0090vitis-hls-hub.html 99. Siemens High-Level Synthesis and Verification. https://eda.sw.siemens.com/en-US/ic/icdesign/high-level-synthesis-and-verification-platform/ 100. PYNQ. http://www.pynq.io/

Compilation and Optimizations for Efficient Machine Learning on Embedded Systems

73

101. Lai, Y.H., Chi, Y., Hu, Y., Wang, J., Yu, C.H., Zhou, Y., Cong, J., Zhang, Z.: HeteroCL: A multi-paradigm programming infrastructure for software-defined reconfigurable computing. In: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA’19, pp. 242–251, New York, NY, USA, 2019. Association for Computing Machinery, New York (2019) 102. Grauer-Gray, S., Xu, L., Searles, R., Ayalasomayajula, S., Cavazos, J.: Auto-tuning a highlevel language targeted to GPU codes. In: 2012 Innovative Parallel Computing (InPar), pp. 1–10 (2012) 103. Kastner, R., Matai, J., Neuendorffer, S.: Parallel programming for FPGAs (2018) 104. Liu, H., Simonyan, K., Yang, Y.: Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055 (2018) 105. Cai, H., Zhu, L., Han, S.: ProxylessNAS: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332 (2018) 106. Tuli, S., Dedhia, B., Tuli, S., Jha, N.K.: FlexiBERT: Are current transformer architectures too homogeneous and rigid? arXiv preprint arXiv:2205.11656 (2022) 107. Li, L., Talwalkar, A.: Random search and reproducibility for neural architecture search. In: Uncertainty in artificial intelligence, pp. 367–377. PMLR, New York (2020) 108. Xie, S., Zheng, H., Liu, C., Lin, L.: SNAS: stochastic neural architecture search. arXiv preprint arXiv:1812.09926 (2018) 109. Cai, H., Gan, C., Wang, T., Zhang, Z., Han, S.: Once-for-all: Train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791 (2019) 110. Luo, R., Tian, F., Qin, T., Chen, E., Liu, T.Y.: Neural architecture optimization. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 7827–7838 (2018) 111. Mellor, J., Turner, J., Storkey, A., Crowley, E.J.: Neural architecture search without training. In: International Conference on Machine Learning, pp. 7588–7598. PMLR, New York (2021) 112. Abdelfattah, M.S., Mehrotra, A., Dudziak, Ł., Lane, N.D.: Zero-cost proxies for lightweight NAS. arXiv preprint arXiv:2101.08134 (2021) 113. Li, Y., Hao, C., Li, P., Xiong, J., Chen, D.: Generic neural architecture search via regression. Adv. Neural Inf. Proces. Syst. 34, 20476–20490 (2021) 114. Kyriakides, G., Margaritis, K.: An introduction to neural architecture search for convolutional networks. arXiv preprint arXiv:2005.11074 (2020) 115. Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2016) 116. Baker, B., Gupta, O., Naik, N., Raskar, R.: Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167 (2016) 117. Zhao, Y., Wang, L., Tian, Y., Fonseca, R., Guo, T.: Few-shot neural architecture search. In: International Conference on Machine Learning, pp. 12707–12718. PMLR (2021) 118. Li, H., Eigen, D., Dodge, S., Zeiler, M., Wang, X. Finding task-relevant features for few-shot learning by category traversal. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1–10 (2019) 119. Benmeziane, H., Maghraoui, K.E., Ouarnoughi, H., Niar, S., Wistuba, M., Wang, N.: A comprehensive survey on hardware-aware neural architecture search. arXiv preprint arXiv:2101.09336 (2021) 120. Xu, X., et al.: DAC-SDC low power object detection challenge for UAV applications. arXiv:1809.00110 (2018) 121. Stamoulis, D., Ding, R., Wang, D., Lymberopoulos, D., Priyantha, B., Liu, J., Marculescu, D.: Single-path NAS: Device-aware efficient convnet design. arXiv preprint arXiv:1905.04159 (2019) 122. Jang, E., Gu, S., Poole, B.: Categorical reparameterization with Gumbel-Softmax. arXiv preprint arXiv:1611.01144 (2016) 123. CHaiDNN. https://github.com/Xilinx/CHaiDNN 124. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520 (2018)

74

X. Zhang et al.

125. Ma, N., Zhang, X., Zheng, H.T., Sun, J.: ShuffleNet v2: Practical guidelines for efficient CNN architecture design. In: Proceedings of the European conference on computer vision (ECCV), pp. 116–131 (2018) 126. Hao, C., Chen, D.: Software/hardware co-design for multi-modal multi-task learning in autonomous systems. In: 2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS), pp. 1–5. IEEE, New York (2021) 127. Talpes, E., Sarma, D.D., Venkataramanan, G., Bannon, P., McGee, B., Floering, B., Jalote, A., Hsiong, C., Arora, S., Gorti, A., et al.: Compute solution for tesla’s full self-driving computer. IEEE Micro 40(2), 25–35 (2020)

A Pedestrian Detection Case Study for a Traffic Light Controller Alexander Wendt, Horst Possegger, Matthias Bittner, Daniel Schnöll, Matthias Wess, Dušan Mali´c, Horst Bischof, and Axel Jantsch

1 Introduction Deep Neural Networks (DNNs) are exquisitely suitable for a variety of data analysis tasks such as object detection, object identification, or scene segmentation in images. Hence, combining cameras with DNN-based analysis has high potential in a number of application domains. It is desirable to deploy the DNN processing close to the camera when latency is tightly constrained, there is limited bandwidth available, and privacy and security are prime concerns, or due to cost reasons. However, DNNs are compute and memory hungry and tight power, and cost and size constraints mean that we usually have to find the leanest and least expensive hardware platform that can still meet the quality and precision requirements. DNN-based machine learning in embedded applications continues to be a great challenge due to the huge and poorly understood design space. Many design decisions at the network architecture, the mapping, and the platform level can be considered, but their effects on the implementation metrics are not always obvious and rarely independent of each other. For instance, structured pruning, where entire layers or filters are removed, seems to be more effective for most current hardware platforms than unstructured pruning, where individual connections are selectively removed, even if they result in the same reduction of parameters and

A. Wendt · M. Bittner · D. Schnöll · M. Wess · A. Jantsch () Christian Doppler Laboratory for Embedded Machine Learning, Institute of Computer Technology, TU Wien, Austria e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected] H. Possegger · D. Mali´c · H. Bischof Christian Doppler Laboratory for Embedded Machine Learning and the Institute of Computer Graphics and Vision, Graz University of Technology, Graz, Austria e-mail: [email protected]; [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 S. Pasricha, M. Shafique (eds.), Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing, https://doi.org/10.1007/978-3-031-39932-9_4

75

76

A. Wendt et al.

computations [30]. Unstructured pruning leads to a poor utilization of caches and processing arrays and thus to a smaller decrease in latency and power consumption than what the reduction in computation would suggest. There is also a strong trend to improve hardware architectures and adapt them to the characteristics of popular DNNs [8, 51], which means that optimizations that have been effective a few years ago may be less useful for more recent hardware platforms. As a result, we can observe co-design and co-evolution of DNN optimization methods and implementation platforms [8, 24, 54]. In order to better understand the specific implications of design choices and the interdependence of decisions, we have conducted a series of experiments using a pedestrian detection application for a traffic light controller.

2 Related Work Detecting pedestrians in videos or images with the help of object detection algorithms is a problem with a long history [10]. Improving their performance is still a hot research topic. It is probably most accelerated by the shift from traditional handcrafted detection pipelines, e.g., region proposal, feature extraction, and region classification, toward DNN- and Convolutional Neural Network (CNN)based methods [49]. Autonomous driving can be seen as one of the key enablers for the rapid improvement of accuracy and reliability of pedestrian detection systems, since it is indispensable for insuring the safety of people. As indispensable as safety is for people, so indispensable are the embedded devices that are used to transform CNN object detection systems into operable systems [44]. Since DNNs are known for their high computational complexity, high energy consumption, and performance issues for challenging datasets, there is a vast amount of research which tries to improve these aspects in relation to embedded devices. This section is therefore providing an overview of current research which is dealing with pedestrian detection, followed by publications that are tackling the challenging task of generating a comparable viewpoint of different hardware platforms and network combinations. The last section will cover work related to quantization of DNNs on hardware platforms which is crucial in terms of fitting the tight memory constrains of embedded devices.

2.1 Neural Networks for Pedestrian Detection Pedestrian detection as a subfield of object detection can be seen as a simultaneous process of spatial localization and classification of persons in images. The output of such a detection network is a set of bounding boxes and a corresponding class [35]. Modern CNN-based detectors in general consist of two main components. The first

Pedestrian Detection in a Traffic Light Controller

77

one is responsible for extracting features from the images, and the second one for producing the predictions of bounding boxes and classes. The feature extraction part is often considered as the backbone of the network, and its main purpose is to generate an enriched information of the input image in form of feature maps. Depending on the structure of the backbone, the feature maps are generated in a layer-wise fashion. Low-level feature maps, which normally have higher resolution, do provide more accurate information related to the task of localization. High-level feature maps are more enriched with semantic information, which is important for the classification task. Improvement in the image classification domain with networks, e.g., VGGNet [45] and ResNet [20], has eliminated problems like vanishing and exploding gradients. The Feature Pyramid Network (FPN) [26] is introduced to realize both localization accuracy and semantic richness with the help of multi-scale feature representations. Such FPN network blocks are commonly used to form the backbone’s neck. The network family of MobileNet [43] has achieved reducing the number of parameters and the computational complexity with exchanging the standard convolution block with depth-wise separable convolutions, which in turn increased inference speeds. The detection framework builds upon the backbone of the neural network. It uses the extracted feature information from the backbone for generating bounding boxes and classifications. At the top level, detection frameworks can be split into singlestage and dual-stage detectors, and from a historic point of view, most architectures were developed in a two-stage fashion at first. The most representative two-stage detectors include the R-CNN family [18, 41], which are performing a Region Of Interest (ROI) search at the fist stage and localization and classification at the second stage. Despite their good accuracy scores, they do suffer from high complexity which in fact makes them less suitable for embedded applications, where fast inference run time is needed. In contrast, single-stage detectors, e.g., You Only Look Once (YOLO) [5, 38, 39], and Single Shot Multi-Box Detector (SSD) [28], are achieving more FPS, with predicting the class and location simultaneously, but they are suffering from a loss in accuracy compared to dual stage detectors. The huge design space consisting of different backbones and detection frameworks allows for the construction of many different networks. Some of them achieve a slight increase in accuracy, which in fact is often highly dependable on the used dataset. Others are slightly faster because of optimizing, e.g., the network structure, input resolution, convolution kernel sizes, to ideally utilize the hardware resources and computing units. Overall, the trend is toward using and advancing single-stage detectors, but there is a lack of qualitative performance comparisons for standard pedestrian detection networks on embedded devices [44].

2.2 Pedestrian Detection on Embedded Systems The co-design process of improving neural networks for embedded system and modifying or building hardware accelerators which are suitable for pedestrian

78

A. Wendt et al.

detection systems has experienced a significant development in the last few years. Initial attempts of mapping and modifying a traditional multi-stage pedestrian detection pipeline (Region proposals .− → Feature extraction .− → Region classification) to an embedded device have been shown by Tomè et al. [49]. This work proposes a lightweight combination of Locally Decorrelated Channel Features (LDCF) for region proposals, Aggregated Channel Features (ACF) for feature extraction, and AlexNet for region classification. Implementing their approach, with solving the detection task of the Caltech pedestrian dataset [11], they achieve 405 ms per frame, with an input resolution of .640 × 480, on an Nvidia Jetson TK1 development board. Two-stage detectors do have a higher computational complexity and face bigger barriers when implementing them on embedded hardware. That might be one of the main reasons why the actual research trend is toward single-stage detectors. Ding et al. [9] propose a modified architecture that advances the trade-off between local and global feature extraction. They first integrate the FPN block into the SSD detection framework and introduce a Squeeze-and-Excitation network for generating pedestrian focused feature maps. They evaluate their performance on KITTI [16], Caltech Pedestrian [11], and ETH [14]. Latency is analyzed on an Nvidia TITAN X and can be observed in Table 1. The lack of tiny pedestrian samples in common datasets led to the work by Wu et al. [53]. They proposed their own dataset and analyzed the influence of different input resolutions on precision and detection speed for the networks YOLOv3 and YOLOv3-tiny on an Nvidia Jetson TX2. They figured out that increasing precision is approximately proportional to increasing input resolution, while detection speed is behaving in the exact opposite. At an input resolution of .416 × 416, they achieve a detection speed of 286 ms for YOLOv3 and 63 ms for YOLOv3-tiny. Murthy et al. [34] propose a feature fusion model that adds contextual information into the optimized MobileNet + SSD architecture, and the proposed model achieved 80.4 % average precision (PASCAL VOC-2007 dataset) with a detection speed of 29 ms at an input resolution of .300 × 300 on an Nvidia Jetson Nano board. The comparison with YOLOv3-tiny results in a detection speed of 40 ms. Tsai et al. [50] present MobileNet-JDE, which is a lightweight multi-object tracking model for embedded systems. The inference for this model lasts 79 ms per image with and input resolution of .512 × 512, evaluated at an Nvidia Jetson AGX Xavier. The work related to hardware comparison is rather limited. The trend is toward devising improvements in networks and evaluating their efficiency on one dedicated hardware platform. Mittal [32] conducted a survey on optimized implementation of deep learning models on the Nvidia Jetson platforms, which gives a good overview over the deep learning architectures that have been successfully implemented on Nvidia Jetson Platforms. Some work is also dealing with comparing Intel CPU and Nvidia GPU platforms, e.g., the work by Biddulph et al. [4], where they are integrating pedestrian detection into a humanoid robot software system and evaluate the performance of an SSDMobileNet, pretrained on MS COCO. They compare the two hardware platforms Intel NUC7i7BNH and Nvidia Jetson TX2. The results are exciting since

Author [4] [4] [9] [9] [34] [53] [53] [34] [49] [50]

Platform Intel NUC Nvidia Jetson TX2 Nvidia TITAN X Nvidia TITAN X Nvidia Jetson Nano Nvidia Jetson TX2 Nvidia Jetson TX2 Nvidia Jetson Nano Nvidia Jetson TK1 Nvidia Jetson AGX Xavier Network SSD+MobileNet SSD+MobileNet modified SSD+FPN modified SSD+FPN SSD+Optimized MobileNet Yolov3 Yolov3-tiny Yolov3-tiny LDCF + ACF + AlexNet MobileNet-JDE

Dataset MS COCO [27] MS COCO KITTI [16] Caltech Pedestrian Caltech Pedestrian Caltech Pedestrian Caltech Pedestrian Caltech Pedestrian Caltech Pedestrian [11] Merge of 6 datasets

Resolution × 1024 .1024 × 1280 .500 × 1986 .480 × 640 .300 × 300 .416 × 416 .416 × 416 .300 × 300 .480 × 640 .512 × 512 .1280

Table 1 Results of selected related work, grouped by underlying detection network architecture, i.e., SSD, YOLO, and others Latency 170 ms 570 ms 200 ms 59 ms 29 ms 286 ms 63 ms 40 ms 405 ms 79 ms

Pedestrian Detection in a Traffic Light Controller 79

80

A. Wendt et al.

the NUC7 performs 3.5 times faster than the TX2, with an inference duration of 170 ms on the Intel platform compared to 570 ms on the TX2. However, the NUC consumes 40.52 W on average, while the Jetson requires only 9.48 W.

2.3 Quantization Traditionally, neural networks are defined, trained, and executed in full precision. By quantizing them, the memory footprint can be reduced by a factor of 2/4/8 if data types such as half-precision, integer with 8 bits, or integer with 4 bits are used. Some modern devices offer native support for different data types, increasing the computational speed compared to full precision. Nvidia’s Quadro RTX 6000 for example reaches double the amount of FLOPS, when going from floating point with 32 bits (FP32) to 16 bits (FP16). The increase of OPs by the factor of two is also valid for going from tensor FP16 to tensor integer 8 (INT8) and INT8 to INT4.1 The acceleration of the arithmetic operations with the help of low-resolution data types has the disadvantage of introducing a loss in numerical accuracy and therefore a decrease of the networks performance. The impact on the networks performance highly depends on the architecture, the desired data type, and if quantization is factored in during or after the training. Research shows that quantizing during the training process usually yields better results at the cost of training time, but this gap is steadily shrinking [25]. Gholami et al. [17] discuss multiple quantization approaches, e.g., uniform quantization, non-uniform quantization, symmetric quantization, and asymmetric quantization. Using tools to automate the quantization process makes it possible to actively compare multiple networks regarding, model size, inference speed, and network precision. Two of these tools are Intel OpenVINO2 and Nvidia TensorRT;3 as the names suggest, they were designed for the hardware sold by Intel or Nvidia. Nvidia TensorRT officially only supports Nvidia devices, as it is built on CUDA,4 and Intel OpenVINOs inference engine officially only supports Intel products.5 Both tools provide general optimization approaches, such as layer fusion, as well as hardwarespecific optimization, such as optimal instruction selection.

1 https://images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turingarchitecture/NVIDIA-Turing-Architecture-Whitepaper.pdf (accessed: 2022-03-22). 2 https://docs.openvino.ai/latest/index.html (accessed: 2022-03-22). 3 https://developer.nvidia.com/tensorrt (accessed: 2022-03-22). 4 https://developer.nvidia.com/cuda-zone (accessed: 2022-03-22). 5 https://docs.openvino.ai/2021.4/openvino_docs_IE_DG_Device_Plugins.html?highlight= devices (accessed: 2022-03-22).

Pedestrian Detection in a Traffic Light Controller

81

3 Pedestrian Detection Use Case Understanding human motion and behavior is the key enabling factor of many realworld computer vision systems. Impressive advances in relevant application fields, such as visual object tracking and detection, supported by deep learning techniques, allowed wide deployment of such systems. For our case study, we choose a recently developed real-world system, namely an automated pedestrian traffic light [13]. The goal of this system is to increase both traffic safety and efficiency via intent prediction: to this end, a camera system mounted on top of a traffic light pole observes pedestrian movements and infers whether the pedestrian wants to actually cross the road. Instead of simply detecting the presence of pedestrians near a crosswalk, the system is thus able to predict respective pedestrian trajectories and deduce their intention, answering the question: Is a pedestrian about to cross the road? Such a vision-based system offers several advantages over existing conventional solutions. On the one hand, the traffic light scheduler can be triggered while pedestrians are approaching the crosswalk, leading to reduced waiting times for pedestrians and avoiding halting motorized traffic when not necessary. On the other hand, the camera system allows to determine the crowd density at any time. Given this prior, the traffic light schedule can be further optimized, e.g., to allow larger crowds more time to safely clear the crosswalk. Accurately detecting humans from images and estimating their trajectories over time is crucial to the robust operation of this automated pedestrian traffic light. Predicting pedestrian trajectories heavily relies on multi-object tracking (MOT) approaches. More specifically, the system’s real-time requirement constrains us to online MOT approaches, which rely on probabilistic inference of the object states solely based on observations up until the current time stamp, i.e., the tracking approach cannot wait for the observations from future frames to robustify the estimated trajectories. Pedestrian detection reached notably high performance levels over the past decade, mainly due to the advances in object detection domain, e.g., [2, 19, 29, 37, 38, 40, 48]. Nowadays, the detection performance is sufficiently high such that most tracking approaches rely on the tracking-by-detection paradigm, which poses tracking as an association problem, e.g., [3, 36, 42, 46, 52]. Thus, the tracking approach can focus on robustly identifying corresponding detections between consecutive frames and link them into trajectories. Similarly, the core of our investigated traffic light system is an online multi-object tracking-by-detection approach, which leverages Kalman filters [21] in combination with the Hungarian algorithm [23, 33] to fuse the detections into individual trajectories over time. As for any tracking-by-detection approach, the performance of the automated pedestrian traffic light system depends significantly on the quality of the underlying detection model. Therefore, we find this system ideal for our case study, as better detection results ensure improved safety and traffic flow. A major challenge regarding reliable detection of pedestrians in this application setting arises from the unusual viewpoint. Motivated by the race to autonomous

82

A. Wendt et al.

Fig. 1 Appearance variation across different pedestrian detection datasets. (a) Traffic Light [13]. (b) Caltech Pedestrian [11]. (c) MOT’20 [7]

driving and improved automated driver assistance systems, multiple large-scale pedestrian detection datasets have been published, e.g., Caltech Pedestrians [11], KITTI [16], Waymo Open Dataset [47], nuScenes [6], etc. However, due to their low elevation viewpoints, these publicly available detection datasets almost exclusively capture front views and side views of pedestrians, as shown in Fig. 1b, whereas elevated surveillance viewpoints as in our application scenario exhibit significantly different visual appearance. Such classical surveillance viewpoints are widely used in visual tracking benchmarks, e.g., PETS’09 [15], MOT’17-04 [31], or MOT’20 [7]. While these viewpoints are more similar to our traffic light pole setting, these tracking benchmarks focus on long-range surveillance to observe longer object trajectories as depicted in Fig. 1c. At the crosswalk, however, we are interested in short-term trajectories instead and thus need to use a top-view camera covering the area close to the crosswalk. An example of the viewpoint is illustrated in Fig. 1a. Such a camera viewpoint causes notable appearance variation of the pedestrians, which need to be addressed by a robust detection model. First, there is a huge variety of body poses seen from this viewpoint: for pedestrians close to the traffic light, only the head and shoulder region is visible (Fig. 2a). Closer to the border of the field of view, the visual appearance becomes more similar to standard side-view pedestrians (Fig. 2b), although the human body axis is often not aligned with the vertical image axis. Second, illumination and seasonal conditions cause additional detection challenges, e.g., cast shadows (Fig. 2c, d), which often increase the false positive rate of the detector, hats or umbrellas (Fig. 2e, f), which occlude major body parts, etc. For our use case, we have to reliably detect pedestrians within the camera field of view despite all these challenges. Suitable deep learning-based model architectures for this task, e.g., [1, 12, 22], have been studied in more detail by Ertler et al. [12, 13]. Our main focus in this work is to explore the design space for implementing suitable deep neural networks on hardware platforms that meet the functional and performance requirements of this real-world use case. In particular, this challenging outdoor application imposes the following requirements on the selected hardware: • The system will be in operation 24/7 throughout the year. • Inference latency must be .≤100 ms to ensure a timely response which enables the efficient and anticipatory traffic light scheduling.

Pedestrian Detection in a Traffic Light Controller

83

Fig. 2 Samples of the pedestrian detection dataset depicting diverse challenges, which a detector has to overcome in our application use case

• To robustify the tracker’s association step and improve the intent prediction, detections for the image streams are required at a frame rate .≥10 fps (i.e., throughput with batch size of 1). • The spatial dimensions must not exceed a volume of 3000 cm3 (approximately the size of a shoe box) to allow deployment in the corresponding switch boxes.

84

A. Wendt et al.

• Since there is no option to regulate the temperature within the electrical switch boxes, the computing hardware must withstand both summer and winter temperatures. Due to the typical mounting position and the heat buildup caused by the continuous operation, temperatures within the enclosure will never drop below 0 ◦C and reach an anticipated maximum of 45 ◦C during summer. • There are no tight power constraints because a wide range of power supply options can be supplied both at the switch box and at the traffic light pole. However, the spatial dimensions and internal working temperatures impose an additional, but implicit power constraint. • The hardware cost should be .≤ 600 e per device. In order to observe both ends of a crosswalk, two devices would be installed. Besides meeting these hardware constraints, the goal of this study is to find the best detection model. For a qualitative comparison of all evaluated models, we use the mean average precision (mAP) score at 0.5 IoU (Intersection-over-union), computed over a custom collected dataset which captures all of the aforementioned challenges. Furthermore, to ensure the safety of pedestrians, we need to consider detection failures (i.e., false positives and false negatives) in combination with the observed pedestrian density to find a suitable model. For example, false negatives must be penalized drastically if there are only very few pedestrians near the crosswalk, i.e., to avoid letting pedestrians wait for too long, in which case they are likely to jaywalk and thus put themselves at risk. Similarly, false alarm rates (i.e., frequent false positives) should also be as low as possible to avoid unnecessarily disrupting the traffic flow by stopping the motorists even though no pedestrian is crossing the road. In more crowded scenarios, on the other hand, both false negatives and false positives are less critical, as these detection errors would only result in a minor miscalculation of the crowd size, which can easily be addressed by corresponding settings in the traffic light controller. We ensure that errors in low-density scenarios are more intensely penalized in our evaluation by carefully collecting examples for our dataset such that these critical scenarios are included more frequently.

4 Results We have studied 18 network variants from the Yolo, ResNet, MobileNet, and EfficientDet families, and six hardware platforms with several different hardware optimizers and image resolutions. All together we have run 383 experiments that are the basis of the results shown in this section. Note, however, that not every possible combination is covered, mostly for practical reasons. Sometimes, it was simply not feasible to get a specific network-platform-optimizer combination to work. Still, the design space as defined by the given networks and HW platforms is well covered, and we consider our conclusions within this given design space

Pedestrian Detection in a Traffic Light Controller

85

Fig. 3 The experimentation flow

as robust, even though there is no guarantee that the optimal solution for a given objective function is among the 383 solutions. First, in Sect. 4.1, we provide the details of the experiments, the dataset, the networks, and the experiments. Then, in the subsequent sections, we study first the overall best solutions and then solutions under specific constraints.

4.1 Experimentation Setup Figure 3 shows the flow used in our experiments. Dataset The dataset contains 19087 images for training and 13184 images for validation. Only one class has to be identified (“person”). The number of positive images in the training set is 9009 and in the validation set 3483. Image resolution is .1280 × 720 pixels. All training has been conducted on a powerful Tesla V100 server. The inference has been run on the target device of each experiment. However, the inference of the full validation set on the individual target devices turned out to be infeasible, as it takes four hours on the Xavier platform for one network. Therefore, we took only 10 % of the images of the validation set, randomly selected, for validation runs on the target devices. Platforms Table 2 lists the platforms used in the experiments. Their raw performance ranges from 0.5 to 32 Tops/s with fairly different architectures. Note that some of the figures denote INT8 operations, while others denote FP16 or FP32. For standard post-training quantization and hardware-specific optimizations, we use OpenVino for the Intel platforms, where floating-point 32 bits (OVFP32) and floating-point 16 bits (OVFP16) are used. For Nvidia platforms, we make use of

86

A. Wendt et al.

Table 2 Platforms with maximum performance and available memory as used in the experiments. The figures only show the approximate performance and cost of the used platforms. Depending on the platform, the performance figures denote floating-point or integer operations. The costs denote circa figures since the actual prizes vary Name Nvidia Xavier AGX6 Nvidia Jetson TX27 Nvidia Jetson Nano8 Intel NCS29 Intel NUC CPU (i7-8650U)10 Intel NUC GPU (Intel UHD 620)10

Performance [T op/s] 32 1.3 0.5 1 22.4 0.8

Memory [GB] 16 4 4 0.5 32 32

Power [W] 10 to 30 7.5 to 15 5 to 10 5 15 15

Cost [e] 800 260 120 80 600 600

TensorRT quantization options, floating-point 32 bits (TRTFP32), floating-point 16 bits (TRTFP16), and integer 8 bits (TRTINT8). Networks We have used 13 networks in our experiments which are listed in Table 3 in an increasing order of the model size. The number of parameters is in the range of 2.8 to 87 million. Scope of the Experiments Given the set of networks, platforms, optimization options, and image resolutions, there are approximately 3500 possible combinations. However, some are not possible in principle (e.g., there is no INT8 quantization option for the Intel platforms available), and many have turned out to be infeasible due to practical reasons. The DNN networks, frameworks, and platforms we have used have many incompatibility Issues, and although we could solve some of them with available converter tools and our own scripts,11 many combinations were still unobtainable in our case study in the given time frame. Hence, these experiments and their results should be seen as an attempt to cover the design space, as defined by the given networks, platforms, and frameworks as well as possible with a standard flow that we could make work with reasonable effort. Altogether, we report here the results of 383 experiments, which is only about 10% of the complete space. We have spent significant effort to cover all important and interesting areas in this design space, but there is no guarantee that we have not missed any interesting experiments with

6 https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-agx-xavier/ (accessed: 2022-02-27). 7 https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-tx2/ (accessed: 2022-02-27). 8 https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-nano (accessed: 2022-02-27). 9 https://www.intel.com/content/www/us/en/developer/tools/neural-compute-stick/overview.html (accessed: 2022-03-04). 10 https://www.intel.com/content/www/us/en/products/details/nuc.html (accessed: 2022-03-04). 11 https://github.com/embedded-machine-learning/scripts-and-guides.

Pedestrian Detection in a Traffic Light Controller

87

Table 3 DNNs used with model size in increasing order. The number of operations (last column) is given for images with .640 × 640 resolution Name SSD MobileNet v2 FPNLite EfficientDet-d0 SSD MobileNet v2 Yolo v5s Yolo v3tiny Yolo v5m Yolo v5l SSD ResNet50 v1 FPN Yolo v3 Yolo v3 spp SSD ResNet101 v1 FPN SSD ResNet152 v2 FPN Yolo v5x Table 4 The tools and optimization options used in the experiments

Framework used TensorFlow TensorFlow TensorFlow PyTorch PyTorch PyTorch PyTorch TensorFlow PyTorch PyTorch TensorFlow TensorFlow PyTorch Symbol TRTINT8 TRTFP16 TRTFP32 OV16 OV32

No of parameters (.106 ) 2.8 3.9 4.5 7.0 8.6 21.0 46.6 50.7 61.4 62.5 69.7 85.3 87.1 Framework Nvidia Tensor RT Nvidia Tensor RT Nvidia Tensor RT Intel’s OpenVINO Intel’s OpenVINO

GOPS@640 6.8 7.3 6.1 16.5 12.9 49.0 109.1 157.7 154.8 155.7 218.4 279.2 205.5

Quantization Integer 8 bit Floating-point 16 bit Floating-point 32 bit Floating-point 16 bit Floating-point 32 bit

superior solutions. On the contrary, we are convinced that, given enough time and effort, more interesting results, trends, and correlations can be identified. Hence, the following results should be seen as a preliminary study which hopefully provides some interesting insights.

4.2 No Constraints We start our analysis assuming that we minimize latency and maximize precision without further constraints. It turns out that the variation across the 383 solutions with respect to these two figures of merit is quite significant. The plot in Fig. 4 depicts all Pareto optimal solutions. Table 4 lists the symbols we use in the following plots for identifying which framework and quantization option has been used. We note that in this Pareto optimal set, Yolo v5 is heavily represented. In addition, MobileNet v2 is present at the low-latency end of the scale. With 5.8 ms, Yolo v5s running on Xavier features the lowest latency among all solutions. The group at the lower left part of the plot shows rather poor precision and is unlikely to be

88

A. Wendt et al.

Fig. 4 The Pareto optimal solutions when all experiments and no constraints are considered. 1 With image resolution .640 × 360. .2 With image resolution .1280 × 720

.

considered in a real application unless the latency constraint is very tight and the precision requirements modest. At the upper right corner of the plot, we have Yolo v5l networks running on Jetson TX2 and Jetson Nano featuring the highest precision of 0.95 mAP and a latency of 444 ms and 1236 ms, respectively. At the upper left part of the plot, we see a few solutions that probably constitute a reasonable compromise. In that group, the latency ranges between 7.3 ms and 87 ms, while the precision is between 0.94 and 0.953 mAP. They all use Yolo v5.

4.3 Cost Constraints The Nvidia platform Xavier is not only powerful but also expensive and power hungry. Assuming it is excluded due to cost reasons, the Pareto optimal set of the remaining solutions is shown in Fig. 5.

Pedestrian Detection in a Traffic Light Controller

89

Fig. 5 The Pareto optimal solutions excluding the Xavier platform. .1 With image resolution .640× 360. .2 With image resolution .1280 × 720

Again, we consider the three regions in the plot in turn: lower left, upper right, and upper left. In the lower left part, we have three MobileNet v2 solutions running on the IntelNUC CPU platform. They exhibit latency between 6.4 and 8.8 ms and a rather poor precision between 0.84 and 0.89 mAP. In the right upper part, there is only one Yolo v5 solution running on the Jetson Nano platform. With a latency of 1236 ms, it needs three times the inference time than the second worst solution in this Pareto plot. The group in the upper left part constitutes a sweet area and consists of only Yolo v5 networks running on the Intel NUC GPU, TX2, and Jetson Nano platforms. They all show rather high precision above 0.94 mAP, while their latency varies from 26.7 ms to 444 ms.

4.4 Cost, Latency, and Precision Constraints We tighten our requirements and allow only for a maximum latency of 100 ms and a minimum precision of 0.9 mAP, in addition to the cost constraint of the previous section. Since there are only 20 solutions left, we display all of them in Fig. 6.

90

A. Wendt et al.

Fig. 6 The solutions under a maximum latency constraint of 100 ms, a minimum precision constraint of 0.9 mAP, and excluding the Xavier platform

Only one variant of Yolo v5 qualifies running on Nvidia’s Jetson Nano architecture. All solutions on the Pareto front are Yolo v5 networks mainly running on TX2 and NUC GPU platforms, but the Jetson Nano solution is also Pareto optimal. None of the MobileNet v2 solutions make it to the Pareto front due to too low precision.

4.5 Effect of Resolution and Quantization We have studied the effect of image resolution and hardware-specific quantization on latency and precision. Figure 7a shows the overall results regarding resolution. There is a weak effect that higher resolution increases latency and an even weaker effect on increased precision. Each hardware platform comes with tools and limited quantization options, ranging from INT8 to FP16 and FP32. We have studied the effects of using those options. There is a clear general correlation to the latency, but the precision is hardly effected as shown in Fig. 7b. Hence, as a general rule, the smallest data type should be used. Note that Fig. 7 shows a summary over all our experiments as we conducted them. It does not mean that higher resolution images cannot be used to obtain better precision if an appropriate methodology is applied, and we do not want to generalize to other settings. But we observe that in our experiments the correlation between image resolution and both precision and latency is weak. Also, Fig. 7 displays average figures over different networks and platforms. To better understand the effects of image resolution and HW-specific optimizations, we have separated

Pedestrian Detection in a Traffic Light Controller

91

Fig. 7 Effect of resolution and HW-dependent optimizations. (a) For each image resolution studied, the latency and mAP give the average values over all solutions. (b) For different quantization options of the tools, the latency and mAP are given as the average values over all solutions

the effect on latency for different platforms and for specific networks in Figs. 8, 9, and 10. Investigating the huge design space of our experiments by analyzing multiple networks, on Intel and Nvidia platforms, with different resolutions and hardwarespecific quantization options makes it difficult to draw overall conclusions about their influence on precision and latency. This is due to the fact that each hardwarenetwork-resolution-quantization option will perform differently. In addition, there are also combinations that are not realizable on the hardware. For further analysis, we consider two pareto optimal networks, SSD MobileNet v2 FPNlite and Yolo v5s. We primarily show the influence of resolution and quantization onto latency, since precision is hardly influenced, as Fig. 7 suggests. Figure 8 shows the impact of resolution and HW optimization for the IntelNUC and the networks Yolo v5s and SSD MobileNet v2 FPNLite. Clearly, here a higher resolution leads to longer latency. Also, quantization to 16bit FP significantly decreases latency for the GPU. It does not have any effect on the CPU because the CPU does not support 16-bit FP operations, which means it uses the same operations as for 32 bit FP. We show here only two networks, but the effect on other Yolo network variants is very similar. The same relation for the Nvidia platforms is shown in Figs. 9 and 10. However, for MobilNet v2, no results for TRTFP16 and TRTINT8 could be obtained in our experiments, because we were not able to conduct the necessary conversions. Thus, for MobileNet v2, only the TRTFP32 results are shown in Fig. 9, while for Yolo v5s

92

A. Wendt et al.

Fig. 8 The effect on the latency of different resolutions and quantization options evaluated for the two pareto optimal networks on Intel platforms Fig. 9 The effect on the latency of different resolutions evaluated for SSD MobileNet v2 FPNLite on Nvidia platforms. Results for FP16 and INT8 quantizations were not obtained in the experiments

the results for all three quantizations are shown, except INT8 on the TX2 and the Nano, which again could not be obtained in the experiments. The effects of resolution and quantization on latency are again strong and unambiguous. We note that the speedup when moving from FP32 to FP16 is significant for all platforms and networks and in the range between 1.5 and 2.3. Also, the speedup due to lower image resolution is more varied but potentially higher, ranging between 2x and 10x in our experiments.

Pedestrian Detection in a Traffic Light Controller

93

Fig. 10 The effect on the latency of different resolutions and quantization options evaluated for Yolo v5s on Nvidia platforms. TRTINT8 quantizations for Jetson Nano and TX2 were not obtained in the experiments

5 Conclusion Regarding networks, Yolo v5 is certainly the main winner in this case study, achieving an excellent compromise with low latency and high precision. Regarding platforms, four platforms are reasonable choices: IntelNUC GPU, IntelNUC CPU, Jetson Nano, and Jetson TX2. Under the given constraints of 100 ms maximum latency and 0.9 minimum mAP, Fig. 6 shows the main candidates, all of which are Yolo v5 or MobileNet V2 networks; note that the Yolo v5 networks exhibit consistently higher precision than the MobileNet V2 solutions. For very low latency under 50 ms, TX2 and IntelNUC GPU are preferable platforms with an image resolution of .640 × 360 pixels. If the IntelNUC is considered, it should be used with the GPU and with FP16 quantization. If high precision is prioritized, TX2 is the winner in this group, delivering 0.94 mAP at 88 ms. As elaborated above, we consider this study as preliminary in the sense that we have covered only part of the overall design space as defined by the set of networks, platforms, and optimization options. We believe that we have uncovered several useful correlations, or lack of correlations, and formulated some relevant conclusions, but we expect further lessons can be learned as we study the design space in more detail.

References 1. Angelova, A., Krizhevsky, A., Vanhoucke, V.: Pedestrian Detection with a Large-Field-OfView Deep Network. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2015)

94

A. Wendt et al.

2. Angelova, A., Krizhevsky, A., Vanhoucke, V., Ogale, A., Ferguson, D.: Real-Time Pedestrian Detection With Deep Network Cascades. In: Proceedings of the British Machine Vision Conference (BMVC) (2015) 3. Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple Online and Realtime Tracking. In: Proceedings of the IEEE International Conference on Image Processing (ICIP) (2016) 4. Biddulph, A., Houliston, T., Mendes, A., Chalup, S.K.: Comparing computing platforms for deep learning on a humanoid robot. In: ICONIP (2018) 5. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: Optimal speed and accuracy of object detection. ArXiv, abs/2004.10934 (2020) 6. Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuScenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020) 7. Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., Leal-Taixé, L.: MOT20: A benchmark for Multi Object Tracking in Crowded Scenes. arXiv Corr, abs/1906.04567 (2020) 8. Deng, L., Li, G., Han, S., Shi, L., Xie, Y.: Model compression and hardware acceleration for neural networks: A comprehensive survey. Proc. IEEE 108(4), 485–532 (2020) 9. Ding, L., Wang, Y., Laganière, R., Luo, X., Huang, D., Zhang, H.: Learning efficient single stage pedestrian detection by squeeze-and-excitation network. Neural Comput. Applic. 33(23), 16697–16712 (2021) 10. Dollar, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: A benchmark. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 304–311 (2009) 11. Dollár, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian Detection: An Evaluation of the State of the Art. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 34(4), 743–761 (2012) 12. Ertler, C., Possegger, H., Opitz, M., Bischof, H.: Pedestrian Detection in RGB-D Images from an Elevated Viewpoint. In: Proceedings of the Computer Vision Winter Workshop (CVWW) (2017) 13. Ertler, C., Possegger, H., Opitz, M., Bischof, H.: An Intent-Based Automated Traffic Light for Pedestrians. In: Proceedings of the IEEE International Conference on Advanced Video and Signal based Surveillance (AVSS) (2018) 14. Ess, A., Leibe, B., Van Gool, L.: Depth and appearance for mobile scene analysis. In: 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8 (2007) 15. Ferryman, J.M., Shahrokni, A.: PETS 2009: Dataset and Challenge. In: Proceedings of the IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (Winter-PETS) (2009) 16. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2012) 17. Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M.W., Keutzer, K.: A survey of quantization methods for efficient neural network inference. CoRR, abs/2103.13630 (2021) 18. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014) 19. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 38(1), 142–158 (2016) 20. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015) 21. Kálmán, R.E.: A New Approach to Linear Filtering and Prediction Problems. J. Basic Eng. 82(1), 35–45 (1960) 22. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet Classification with Deep Convolutional Neural Networks. In: Proceedings of the Conference on Neural Information Processing Systems (NeurIPS) (2012) 23. Kuhn, H.W.: The Hungarian Method for the Assignment Problem. Naval Research Logistics Quarterly 2, 83–97 (1955)

Pedestrian Detection in a Traffic Light Controller

95

24. Li, Y., Hao, C., Zhang, X., Liu, X., Chen, Y., Xiong, J., Hwu, W.m., Chen, D.: EDD: Efficient differentiable DNN architecture and implementation co-search for embedded AI solutions. In: 2020 57th ACM/IEEE Design Automation Conference (DAC), pp. 1–6 (2020) 25. Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., Gu, S.: BRECQ: pushing the limit of post-training quantization by block reconstruction. CoRR, abs/2102.05426 (2021) 26. Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 936–944 (2017) 27. Lin, T.Y., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV (2014) 28. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: SSD: Single shot multibox detector. In: ECCV (2016) 29. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: SSD: Single Shot MultiBox Detector. In: Proceedings of the European Conference on Computer Vision (ECCV) (2016) 30. Ma, X., Lin, S., Ye, S., He, Z., Zhang, L., Yuan, G., Tan, S., Fan, D., Qian, X., Lin, X., Ma, K., Wang, Y.: Non-structured DNN weight pruning–is it beneficial in any platform? In: IEEE Transactions on Neural Networks and Learning Systems, pp. 1–15 (2021) 31. Milan, A., Leal-Taixé, L., Reid, I., Roth, S., Schindler, K.: MOT16: A Benchmark for MultiObject Tracking. arXiv Corr, abs/1603.00831 (2016) 32. Mittal, S.: A survey on optimized implementation of deep learning models on the NVIDIA jetson platform. J. Syst. Archit. 97, 428–442 (2019) 33. Munkres, J.: Algorithms for the Assignment and Transportation Problems. J. Soc. Ind. Appl. Math. 5(1), 32–38 (1957) 34. Murthy, C.B., Hashmi, M.F., Keskar, A.G.: Optimized MobileNet + SSD: a real-time pedestrian detection on a low-end edge device. International Journal of Multimedia Information Retrieval 10(3), 171–184 (2021) 35. Papageorgiou, C.P., Oren, M., Poggio, T.: A general framework for object detection. In: Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271), pp. 555–562 (1998) 36. Possegger, H., Mauthner, T., Roth, P.M., Bischof, H.: Occlusion Geodesics for Online MultiObject Tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014) 37. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You Only Look Once: Unified, Real-Time Object Detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 38. Redmon, J., Farhadi, A.: YOLO9000: Better, Faster, Stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 39. Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. ArXiv, abs/1804.02767 (2018) 40. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In: Proceedings of the Conference on Neural Information Processing Systems (NeurIPS) (2015) 41. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017) 42. Sadeghian, A., Alahi, A., Savarese, S.: Tracking The Untrackable: Learning To Track Multiple Cues with Long-Term Dependencies. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017) 43. Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: Inverted residuals and linear bottlenecks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018) 44. Sha, M., Boukerche, A.: Performance evaluation of CNN-based pedestrian detectors for autonomous vehicles. Ad Hoc Netw. 128, 102784 (2022)

96

A. Wendt et al.

45. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2015) 46. Solera, F., Calderara, S., Cucchiara, R.: Learning to Divide and Conquer for Online Multi-Target Tracking. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015) 47. Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., et al. Scalability in perception for autonomous driving: Waymo open dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020) 48. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J.: Rethinking the Inception Architecture for Computer Vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 49. Tomè, D., Monti, F., Baroffio, L., Bondi, L., Tagliasacchi, M., Tubaro, S.: Deep convolutional neural networks for pedestrian detection. Signal Process. Image Commun. 47, 482–489 (2016) 50. Tsai, C.Y., Su, Y.K.: MobileNet-JDE: a lightweight multi-object tracking model for embedded systems. In: Multimedia Tools and Applications (2022) 51. Wang, E., Davis, J.J., Zhao, R., Ng, H.-C., Niu, X., Luk, W., Cheung, P.Y.K., Constantinides, G.A.: Deep neural network approximation for custom hardware: Where we’ve been, where we’re going. CoRR, abs/1901.06955 (2019) 52. Wojke, N., Bewley, A., Paulus, D.: Simple Online and Realtime Tracking with a Deep Association Metric (2017) 53. Wu, J., Men, Y., Chen, D.: Lightweight network and parallel computing for fast pedestrian detection. Int. J. Circuit Theory Appl. 49(4), 1040–1049 (2021) 54. Zhang, X., Lu, H., Hao, C., Li, J., Cheng, B., Li, Y., Rupnow, K., Xiong, J., Huang, T., Shi, H., Hwu, W.M., Chen, D.: SkyNet: a hardware-efficient method for object detection and tracking on embedded systems. In: Dhillon, I., Papailiopoulos, D., Sze, V. (eds.) Proceedings of Machine Learning and Systems, vol. 2, pp. 216–229 (2020)

How to Train Accurate BNNs for Embedded Systems? F. A. M. de Putter and Henk Corporaal

1 Introduction With the advent of deep learning, deep learning architectures have achieved state-ofthe-art results in a variety of fields, e.g., computer vision, speech recognition, and natural language processing, in which they surpass (or are comparable to) human expert performance. Within the popular domain of computer vision, Convolutional Neural Networks (CNNs) have been proven successful and are thus widely used. Nevertheless, CNNs require massive amounts of memory and computational load for both training and inference. While training can be done offline on powerful GPU-based machines, inference is expected to be done at the edge on resourceconstrained embedded systems because of data privacy and strict latency and reliability constraints. State-of-the-art CNNs have billions of parameters and do billions of operations, making them both computationally and memory intensive and therefore hard to deploy on resource-constrained embedded systems. Fortunately, there is a lot of room for optimization. Key optimizations are compression and code transformations. Compression reduces the model size of CNNs with a minor loss in accuracy compared to the original model. High compression can be achieved by quantization and/or pruning. Quantization reduces the bit width for both parameters and features of a CNN, whereas pruning removes redundant connections and/or neurons in a CNN. Code transformations enable more energy-efficient mappings of CNNs on embedded systems. A proper CNN mapping makes sure many weights and features are cached on-chip and reused many times before new data is loaded from off-

F. A. M. de Putter () · H. Corporaal Eindhoven Artificial Intelligence Systems Institute and PARsE lab, Eindhoven University of Technology, Eindhoven, The Netherlands e-mail: [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 S. Pasricha, M. Shafique (eds.), Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing, https://doi.org/10.1007/978-3-031-39932-9_5

97

98

F. A. M. de Putter and H. Corporaal

chip power-hungry DRAM. Therefore, a proper mapping is key to having CNNs on embedded systems. The focus of this chapter is on extreme quantized CNNs, specifically: Binary Neural Networks (BNNs). This is a relatively new area since the first BNN [9] was published in 2016. BNNs achieve a very high rate of compression by quantizing both features and weights to a single bit. This saves memory and simplifies computation tremendously, which enables BNNs to potentially be the holy grail for deploying CNNs on resource-constrained embedded systems. However, this extreme form of quantization inescapably causes severe accuracy loss. Moreover, BNNs’ intrinsic discontinuity brings difficulty to their training. Fortunately, many training techniques and network topology changes have been proposed that aim to reduce this accuracy loss. This chapter presents a survey of these training techniques and network topology changes that aim to repair the accuracy loss. Problem, however, is that papers on BNNs are not aligned on what improvements should be used to get highaccuracy BNNs. Typically, each paper on BNNs proposes its own improvements that are not necessarily in line, or even contradict, with previous work. Therefore, we isolate improvements and empirically evaluate those on benchmarks for image classification. Based on the outcomes of this study, we provide directions for future research. In short, the contributions of this work are: • Classification and overview of BNN accuracy repair methods under various dimensions. • Evaluation of the benefits of isolated improvements on two benchmarks: CIFAR10 on ResNet-20 and CIFAR100 on ResNet-18. • Indication of future directions for researching accuracy repair techniques that allow high-accuracy BNNs. The remainder of this chapter is structured as follows: the next section reviews related work, while Sect. 3 presents background information on BNNs. Sects. 4 and 5, respectively, present the classification and overview of accuracy repair methods. In Sect. 6, the empirical review is conducted, and the results are shown. Section 7 raises a discussion on the results and indicates future research directions. Finally, in Sect. 8, conclusions are drawn.

2 Related Work This chapter consists of a survey and an empirical study. With respect to the survey part, there are three previously published surveys on BNNs [33, 40, 51] in, respectively, 2019, 2020, and 2021. Since the BNN field is relatively new, it is quickly evolving, and therefore many new papers are published each year. Therefore, we feel that it is reasonable to present another overview that is updated with new accuracy repair methods. Contrary to the most recent survey [51], we

How to Train Accurate BNNs for Embedded Systems?

99

employ a hierarchical classification with different metrics that allows us to present a complete overview in a single table in which each work may contain several accuracy repair techniques. In addition, none of the previously published surveys have done an empirical study on the benefits of individual repair methods. With respect to the empirical study, there are four works [1, 4, 29, 42] that research a limited set of our design space. Tang et al. [42] study the effect of certain learning rates and proposes to use a regularization loss term for absolute latent weight distance to one instead of weight decay. Furthermore, they suggest using a PReLU activation function rather than a ReLU. Alizadeh et al. [1] study the difference between the ADAM and the SGD optimizer, the effect of gradient clipping, and the momentum term in batch normalization. Bethge et al. [4] conduct experiments on a scaling factor that is derived from features and weights. In addition, they study the clipping interval on the binarizer and use the double residual repair method in all experiments. Liu et al. [29] aim to explain why the ADAM optimizer performs better than SGD in their experiments. Additionally, they investigate why the two-stage training scheme is beneficial in their experiments. Contrary to those studies, we first establish a good baseline without any repairs applied to it. Furthermore, each experiment in our design space is run five times with different random seeds, such that claims about the deviation in accuracy of a repair method can be made. Lastly, as our design space is a lot larger, our study aims to provide the full picture of the benefits of individual repair methods.

3 Background on BNNs State-of-the-art CNNs need a lot of memory and compute power, which often makes them unsuitable for resource-constrained embedded systems. Quantization is one optimization that may enable CNNs on embedded systems. As the holy grail of quantization, BNNs achieve a 32x compression ratio, compared to 32-bit floatingpoint CNNs, by quantizing both features and weights to either .+1 or .−1 such that they can be stored in a single bit. Section 3.1 explains the inference process, while Sect. 3.2 presents the training process of BNNs.

3.1 Inference To quantize features or weights from real to binary values, generally the signfunction is used:  +1 if x ≥ 0 .xB = sign(xR ) = (1) −1 otherwise

100

XR

WR

F. A. M. de Putter and H. Corporaal

+1

-1

+1

+1

-1

-1

-1

Multiply

-1

+1

+1

+1

0

1

1

1

Accumulate

+2

-1

4x float32 XB

WB

1

0

1

1

0

0

0

XNOR

PopCount

+2

#pos - #neg = 2*PopCount - bitwidth

0

4x 1-bit

Fig. 1 An example of the multiply–accumulate operation using binary values: a combination of bitwise XNOR and popcount

where .xB is the binarized value and .xR the real value. Not only does the binarization save on storage space, but it also simplifies the compute logic. Wherein a CNN’s convolution layers (and others like fully connected layers) are done using many real-valued multiplications and accumulations, a binary convolution can be implemented using XNORs and popcount operation (i.e., counting the number of 1’s in a bitstring). An example of this is given in Fig. 1. Note that the .−1 is encoded as a 0. Replacing a real-valued MAC by XNOR-PopCount is very good for energy efficiency. For example, in Andri et al. [2], a BNN accelerator is presented that achieves down to merely .4.48 fJ/Op in GF 22 nm at .0.4 V, where Op is a binary operation (xnor or popcount). The most common network architecture used in BNN papers is illustrated in Fig. 2. It is based on the ResNet-18 [15] architecture which consists of a single building block that repeats. Note that each repetition has a different amount of input channels. This architecture is also used in our review in Sect. 6. A common layer sequence of BNNs is shown in Fig. 3. To optimize this part for inference, BNNs typically fuse the batch normalization layer with a sign-function, instead of the preceding convolution layer as done in full-precision networks. The absorption of the batch normalization layer in its preceding convolution is possible because batch normalization is an affine transformation. However, in BNNs, the batch normalization layer cannot be absorbed in the preceding convolution as it would require its weights to have much higher precision. Thus, in BNNs, the batch normalization layer is fused into the proceeding sign-function. The fused operator is illustrated by  sign-batchnorm(Y ) =

.

+1,

if Yˆ ≥ 0

−1,

elsewhere

(2)

Block C=512

Block C=512

Fully-connected

101

Block C=256

Block C=256

Block C=128

Block C=128

Block C=64

Block C=64

3x3 Convolution

How to Train Accurate BNNs for Embedded Systems?

+

Hard Tanh

Batch normalization

3x3 Convolution

Sign

Batch normalization

3x3 Convolution

Sign

Block

Floating-point

Sign

Integer

Batch normalization

Binary

Convolution

Fig. 2 The binary ResNet-18 architecture is composed of multiple blocks. Each block consists of two convolutions and has a varying amount of input channels, denoted by C. Dashed residual lines indicate down-sampling of the height and width of the feature map. Moreover, the first convolution layer inside a block with the dashed residual line employs a stride of two

Binary

Fig. 3 Common BNN layer sequence: convolution followed by batch normalization and signfunction. The texts above the arrows describe the data format at that specific location in the network

where Y is the output of a binary convolution .(XB ∗ WB ) and .Yˆ the output of the batch normalization layer. A batch normalization layer consists of the moving average standard deviation .σ , a learnable scale .γ , mean .μ, a small constant ., and a learnable shift factor .β. .Yˆ can be represented as a function of Y as follows: Y −μ Yˆ ≡ γ √ + β ≥ 0. σ2 +   γ (Y − μ) ≥ −β σ 2 + . √ 2 ≥ μ − β γσ + , if γ > 0 √ 2 ≤ μ − β γσ + , if γ < 0 .

⎧ ⎨Y ⎩Y

(3) (4) (5)

102

F. A. M. de Putter and H. Corporaal a)

b)

2

2 Sign function STE function

1

1

0

0

−1

−1

−2 −2

−1

0 x

1

2

−2 −2

−1

0 x

1

2

Fig. 4 The sign(x) function and the use of an STE to enable gradients. (a) Sign(x). (b) STE: Derivative of Sign(x)

Eq. (5) can be further simplified by fusing parts into the preceding convolution, which eliminates the need for both a greater-than and less-than comparison. Since both .γ and the weights (.WB ) are feature map-based, the weights’ feature maps for which .γ is negative can be multiplied by .−1 such that the following equation remains: √ β σ2 +  .sign-batchnorm(XB ∗ ±WB ) ≥ ±μ − (6) |γ | where .XB is the input to the binary convolution.

3.2 Training Aside from the efficient inference, the binary weights of a BNN still need to be learned as well. Similar to a real-valued CNN, it is possible to use the gradient descent algorithm. However, since the sign-function (Eq. (1)) has a derivative that is almost everywhere 0, it is not possible to train BNNs directly using the famous gradient descent algorithm. Fortunately, this issue has been resolved by the introduction of a so-called straight-through estimator (STE) [43]. The STE is illustrated in Fig. 4 and can be expressed as a clipped identity function: .

∂XB = 1|XR ≤1| ∂XR

(7)

How to Train Accurate BNNs for Embedded Systems?

103

where .XR is the real-valued input to the sign-function, .XB is the binarized output, and .1|XR ≤1| evaluates to 1 if .|XR | ≤ 1 and 0 otherwise (Fig. 4b). As such, the complete gradient chain of the loss with respect to the real-valued non-quantized weights is as follows: .

∂L ∂L ∂XB ∂L = = ∗ 1|XR ≤1| ∂XR ∂XB ∂XR ∂XB

(8)

where L is the loss function, .XR is the real-valued input to the sign-function, .Xb is the binarized output, and .1|XR ≤1| is the clipped identity function. With the help of an STE, BNNs can be trained using the same gradient descent algorithms as in ordinary real-valued CNNs. However, note that even with the use of STEs, BNNs cannot be trained up to satisfactory performance, and a substantial accuracy gap with respect to their full-precision counterparts remains.

4 Classification of Accuracy Repair Techniques Although the premise of BNNs with high compression and efficient compute logic is good, the accuracy loss gotten by transforming a real-valued CNN into a BNN is not acceptable. On the one side, the accuracy loss is due to the large quantization error between full-precision and binary values. On the other side, this is due to the discrete optimization landscape for which gradient descent-based optimization has difficulties, even with the use of STEs. Thus, BNNs heavily suffer from accuracy loss. Fortunately, in recent years, many training techniques and network topology changes have been proposed that aim to reduce the accuracy loss. Whereas training techniques aim to improve the optimization process of BNNs, network topology changes aim to reduce the quantization error between full-precision CNNs and their binary counterpart. As such, in this section, we employ a hierarchical classification of accuracy repair methods, as illustrated in Table 1. This classification starts with two main branches: training techniques and network topology changes. Within these branches are several categories that group accuracy repair techniques together. Note that the repair categories can be applied orthogonally to each other, but repairs within a single category cannot (except for some of the Teacher–Student and regularization approaches). Within the training techniques branch, there are the following six categories: 1. Binarizer (STE): Since the derivative of the binarization process (sign-function) is 0 almost everywhere, works have come up with various solutions to create a fake derivative and thereby enable the use of the renowned backpropagation algorithm. 2. Normalization: Before binarizing the features or weights, some works apply a form of normalization. Normalization includes, but is not limited to, stan-

104

F. A. M. de Putter and H. Corporaal

Table 1 Classification table of accuracy repair methods and legend for Table 2 Abbreviation LC_|X| LC_A PN GPN T EDE SS EWGS LB 2. Normalization DB STD MSTD MSTDB BN 3. Teacher–Student TO TB 4. Regularization RE RD 5. Two-stage training TST_Y/N SGD 6. Optimizer ADAM Custom Topology 7. Scaling factor AM changing LF LFI 8. Ensemble EL EB EN 9. Activation function I&H ReLU PreLU RPReLU DPReLU 10. Double residual 2R_Y/N 11. Squeeze-and-excitation SE_Y/N

Property Training 1. Binarizer (STE) technique

Description Linear clipped at |X| Linear with adaptive clipping Polynomial Gradual Polynomial Gradual tanh-based sign T with magnitude scaling SwishSign Element-wise Gradient Scaling Learnable bias Dynamic bias Division by standard deviation Zero-mean and division by standard deviation MSTD divided by b Batch normalization Train toward Teacher’s output TO and per-block loss with Teacher Weight entropy regularization Weight distance to 1 regularization Yes / No SGD optimizer ADAM optimizer Custom BNN optimizer Absolute mean of the Weights Learnable factor Learnable factor with initialization Per-layer Per-block Per-network Identity and htanh ReLU PreLU RPReLU DPReLU Residual per convolution Yes/No Squeeze-and-Excitation Yes/No

dardizing and/or centralizing. Moreover, since normalization is proceeded by a binarizer, any multiplication done in the normalization phase will only affect the backward pass, i.e., gradients will be scaled. 3. Teacher–Student: The Teacher–Student methodology aims to help train CNNs by having a complex teacher network trying to transfer knowledge to a simple

How to Train Accurate BNNs for Embedded Systems?

105

student network (the BNN). The knowledge transfer process can be done at distinct stages in the network. 4. Regularization: Regularization is a technique that introduces another loss term that should indirectly help train the network. The final network loss L can then be described as L = LCE + λLR

.

(9)

where .LCE is the cross-entropy loss, .LR the regularization loss, and .λ the balancing coefficient. 5. Two-stage training: The default in quantization-aware training is to train with both quantized features and weights simultaneously. In the two-stage training procedure, the first stage employs only binary features, whereas in the second stage both are binary. 6. Optimizer: Within BNN works, some use the SGD optimizer, whereas others use the ADAM optimizer. Contrary to real-valued networks in which SGD supersedes ADAM in terms of accuracy, it is not clear if this is the case for BNNs. Within the network topology changing branch, there are the following five categories: 7. Scaling factor: Usually, a binary convolution outputs the values with a larger range than its real-valued counterpart. Hence, many works employ a scaling factor .α to minimize the quantization error. .

8.

9.

10.

11.

min | Y − αYB |2 α

(10)

where .YB is the output of a binary-valued convolution and Y is its real-valued counterpart. Ensemble: Ensemble methods use multiple instantiations of (a certain part of) the network to obtain better performance than could be obtained from any individual part. Activation function: Next to the binarizer that can be regarded as the nonlinearity in the network, works have suggested adding another activation function, such as ReLU or PReLU, to improve the accuracy. Double residual: To keep the flow of information rich in the network, BNNs use a residual connection after every convolution, instead of after multiple convolutions. Squeeze-and-Excitation: The concept of Squeeze-and-Excitation is to assign each output feature map channel a different weight based on its importance. These channel-wise weights are based on the input feature maps.

These eleven categories are used to create an overview of the repair methods used by BNN papers, which is presented in Sect. 5.

106

F. A. M. de Putter and H. Corporaal

5 Overview of Accuracy Repair Techniques as Applied in the Literature Accuracy repair techniques can be grouped into two main branches: training techniques and network topology changes. These two branches can be further split up and classified according to Table 1. Table 2 presents an overview of most BNN works and their respective repair techniques used. Additionally, their claimed top1 accuracies on the ResNet-18 architecture and ImageNet dataset are shown. As a reference, the first row in Table 2 denotes the full-precision ResNet-18 architecture with its accuracy on ImageNet.

5.1 Training Techniques The training technique branch can be split up into the following categories: binarizer, normalization, teacher–student, regularization, two-stage training, and optimizer. Each repair method within these categories is outlined in the sections below. Note that these repair methods only influence the training phase, not the inference phase.

5.1.1

Binarizer (STE)

Similar to real-valued CNNs, training of BNNs is done using the backpropagation algorithm. This algorithm requires every function to be differentiable. Since the derivative of the binarizer (sign-function) is 0 almost everywhere, Tieleman and Hinton [43] introduced the default STE (Fig. 4). It passes gradients as is for values that fall in the interval .[−1, +1] but cancels gradients for values that are outside the interval. An STE is often used to provide an estimate for the gradient. Obviously, this introduces a gradient mismatch between the actual gradient of the sign-function and the STE, which makes the training of BNNs more difficult. Widely explored are alternative estimators for the gradient, which should aid the optimization of a BNN. In recent years, there have been works that change the clipping interval of the STE. For example, BinaryDenseNet [4] and MeliusNet [3] use an interval of .[−1.3, +1.3], whereas PokeBNN [53] uses an interval of .[−3, +3]. These papers chose this interval based on empirical studies. Next to a fixed interval, Sakr et al. [37] propose to make the interval a learnable parameter. They constrain the parameter to be small using an additional loss term. Likewise ReCU [49] changes the interval during training from .[−0.85, +0.85] to .[−0.99, +0.99]. According to their mathematics, the smaller interval is best for reducing the quantization error, whereas the higher interval ensures maximum information entropy. Bi-Real Net (2018) [31] introduced the Polynomial STE, shown in Fig. 5. The triangular shape of the derivative should resemble the actual derivative of a sign-

SS

2017 n/a

2018 53

Compact BNN [42]

Regularized BNN [11]

2018 n/a

PN

PN

T

2019 67

2019 53.7

Human pose estimation [7]

LC_1.3

LC_A

GroupNet (5x) [56]

BinaryDenseNet 2019 n/a [4]

Continuous Binarization [37]

Bi-Real Net [31] 2018 56.4

LC_1

2017 65

ABCNet (5x5) [26]

LC_1

LC_1

LC_1

2016 42.2

2016 51.2

BinaryNet [9]

2015 69.3

Full-precision ResNet [15]

XNORNet [35]

Year ResNet-18 1. Feature & Binarizer ImageNet Accuracy (%)

Work

BN

DB

BN

T

TO

Y

N

N

N

N

LC_.∞

.M 1

RE

N

N

N

N

ADAM

SGD

ADAM

SGD

ADAM

n/a

ADAM

ADAM

ADAM

SGD

4. Regular- 5. 6. ization Two-stage Optimizer training

N

DB

2. Weight 3. Normaliza- Teacher– tion Student

LC_1.3

n/a

LC_1

SS

LC_1

LC_1

LC_1

LC_1

2. Feature 1. Weight Normaliza- Binarizer tion

Table 2 BNN training techniques classification and evaluation. Refer to Table 1 for a legend

AM

AM

LF

AM

AM

LF

AM

7. Scaling factor

EB

EL

8. Ensemble

PReLU

ReLU

ReLU

n/a

None

n/a

PReLU

I&H

I&H

I&H

N

N

.M 1

(continued)

N

N

.M 1

Y

N

N

N

N

N

N

N

Y

N

N

N

N

N

9. 10. Double 11. Activation residual Squeezefunction andExcitation

XNORNet++ [6] Latent Weights [18] BENN (6x) [55] Circulant BNN [28] CI-BCNN [46] IR-Net [34] BBG [39] Real-to-Binary [32] ReActNet [30] RBNN [25] ProxyBNN [17] Noisy Supervision [14] Information in BNN [19] BNN SISR [47] SI-BNN [44] Adam & TST [29] ReCU [49] Bop and beyond [41] Sub-bit BNN [45] MeliusNet [3]

Table 2 (continued)

2019 2019 2019 2019 2019 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2021 2021 2021 2021 2021

57.1 n/a 61 61.4 59.9 58.1 59.4 65.4 65.9 59.9 63.7 59.4 61.36 n/a 59.7 n/a 66.4 n/a 55.7 n/a LB STD LB BN

PN LC_1.3

BN

BN LB MSTD BN

BN

BN

LC_1 PN LC_1 EDE LC_1 LC_1 PN GPN LC_1 LC_1 1 .M LC_1 1 .M PN PN

n/a

LC_1 LC_1.3

LC_1 PN LC_1 EDE LC_1 LC_1 LC_1 GPN 1 .M LC_1 T LC_1 LC_1 LC_1 LC_A

n/a

MSTDB

MSTD

1

MSTD .M

TO

TB TO

.M

1

Y N

N N N N N Y Y N N N N N N Y N

N

ADAM Custom ADAM SGD SGD SGD n/a ADAM ADAM SGD ADAM SGD n/a ADAM ADAM ADAM SGD Custom ADAM RADAM AM

AM AM AM LF

AM AM LF AM LFI AM AM

AM

LF EN

PReLU ReLU

n/a PReLU I&H I&H n/a PReLU PReLU I&H PReLU PReLU n/a n/a None PReLU PReLU

n/a

.M

Y

N Y N Y Y Y Y Y Y Y N Y Y Y Y

N

1

N N

N N N N N N N N N N N N N N N

N

108 F. A. M. de Putter and H. Corporaal

1 .M

2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 2021

denotes miscellaneous methods

Increasing Entropy [57] SA-BNN [27] UaBNN [54] Lottery ticket BNN [12] Expert BNN [5] BNN-BN [8] BCNN [36] BNN-BN SR [20] Complex BNN [23] EWGS [22] Equal Bits [24] DyBNN [52] SD-BNN [50] BNN latent weights [48] PokeBNN [53] BoolNet [13]

58.5 61.7 61.9 n/a n/a 61.1 n/a n/a 57.65 55.3 60.4 67.4 66.5 63.8 n/a n/a

EDE PN PN PN LC_1 PN PN PN LC_1 EWGS 1 .M n/a EDE LC_1 LC_3 LC_1

1

BN

DB DB DB BN

BN LB LB

BN BN

.M

EDE LC_1 1 .M LC_1 LC_1 LC_1 LC_1 LC_1 LC_1 EWGS 1 .M n/a EDE LC_1 LC_1 T

1

DB

DB

MSTDB

.M

TO

TO

TO

TB

TB TO

.M

.M

1

1

N N N N Y Y Y N N N N Y Y N N N

SGD ADAM ADAM SGD ADAM ADAM ADAM ADAM ADAM ADAM SGD ADAM SGD ADAM ADAM RADAM AM

LFI

AM LF AM AM LF AM EB

RPReLU ReLU PReLU PReLU PReLU RPReLU I&H ReLU n/a PReLU PReLU PReLU DPReLU None

I&H

Y Y Y N Y Y Y N Y N n/a Y Y Y Y 1 .M

N N N N N N N N N N N Y Y N Y N

How to Train Accurate BNNs for Embedded Systems? 109

110

F. A. M. de Putter and H. Corporaal Derivative of Polynomial STE

y

Polynomial STE 2

2

1

1

0

0

−1

−1

−2 −2

−1

0 x

1

2

−2 −2

−1

0 x

1

2

Fig. 5 The polynomial STE function and its derivative Derivative of SwishSign

y

SwishSign 2

2

1

1

0

0

−1

−1

−2 −2

−1

0 x

1

2

−2 −2

−1

0 x

1

2

Fig. 6 The SwishSign function and its derivative

function more closely, and therefore the gradient approximation error would be less. As can be seen in Table 2, many works have adapted this STE for feature binarization. Regularized BNN (2018) [11] introduced the SwishSign STE, illustrated in Fig. 6. It is based on the sigmoid-function. Similar to the polynomial STE, this function should resemble the actual derivative of a sign-function more closely. However, this derivative includes certain intervals for which the gradient is negative. The equation corresponding to the SwishSign is SSβ (x) = 2σ (βx)(1 + βx(1 − σ (βx))) − 1

.

where .σ represents the sigmoid-function and .β equals 5 as in Fig. 6.

(11)

How to Train Accurate BNNs for Embedded Systems?

111 Derivative of tanh(λx)

sign(x)≈tanh(λx) 2

2 λ=1 λ=5 λ=20

y

1

1

0

0

−1

−1

−2 −2

λ=1 λ=5 λ=20

−1

0 x

1

2

−2 −2

−1

0 x

1

2

Fig. 7 The gradual tanh(.λx) function, in which .λ is changed during training

Human pose estimation BNN (2019) [7] introduced an STE that is gradually changing during the training process. It is based on the tanh-function, shown in Fig. 7, and approximates the actual gradient of a sign-function more and more as .λ is increased. IR-Net (2020) [34] extended the approach of [7] by scaling the magnitude accordingly and calls its STE EDE. EDE can be described by  1 , 1 ∗ tanh(λ ∗ x). .EDE(x) = max λ

(12)

λ = 10−3+(1+3)∗T

(13)



where T is the percentage of training. This extension is shown in Fig. 8. RBNN (2020) [25] is a variant upon the polynomial STE: GPN. The triangular shape of the polynomial is gradually sharpened and compressed based on the stage of training. It is illustrated in Fig. 9, and its equation is  GPN(x) =

.

2 2

k · (−sign(x) λ 2x + k · sign(x),



2λx),

if | x |
Xb |gXr|Vth

V Vth

Output spikes

Time window

T

Fig. 1 Functionality of a spiking neural network, in which the events are encoded into spikes and the neurons’ output spikes are generated when the membrane potential exceeds the threshold voltage

and we showcase the design of a “car vs. background” classifier, called CarSNN, implemented on Loihi.

2 Brain-Inspired Spiking Neural Networks Considered the third generation of neural networks [47], the SNNs follow the wave of success of the deep neural networks (DNNs) to perform complex machine learning tasks [10]. While conventional DNNs process continuous values, SNNs process discrete spike trains, mimicking the information processing behavior of the neurons in the human brain. The key advantage of SNNs, besides the biological plausibility, is that they offer great potential for developing energy-efficient machine learning when co-designed with neuromorphic hardware due to the sparse nature of SNNs. Figure 1 shows the basic functionality of SNNs. The input spikes encode the information using spike trains. The neurons of the network perform the integration of the spikes, which contribute to increasing the neurons’ membrane potential. In this way, the output spikes are generated when the membrane potential exceeds a threshold.

2.1 Spiking Neuron Models A neuron is considered a simple computational node in the SNN. When a spike coming from a presynaptic neuron arrives at the input of the postsynaptic neuron,

Embedded Neuromorphic Using Intel’s Loihi Processor

139

the spiking current injected into the body of the neuron, associated with the synaptic weight, is integrated into the membrane, thus contributing to raising its membrane potential. Several neuron models have been proposed in the literature, and this section focuses on the most common models employed for SNNs. The McCulloch–Pitts neuron [49] is the predecessor model for the neural networks. It simply computes the sum of the incoming spikes and emits the Boolean output 1 (fires) if the sum is higher than its threshold or 0 otherwise. Its main drawback, which is the inability to learn, led to the development of the perceptron model [70], which introduces the concept of learnable weights that are multiplied by the Boolean outputs of the McCulloch–Pitts neuron. A network of perceptrons represents the first generation of neural networks, while the second generation uses the same perceptron model but with a more complex activation function. For the third generation, which corresponds to the SNNs, different neuron models can be employed, leveraging the trade-off between biological plausibility and implementation cost: • The Hodgkin–Huxley model [31] represents the most biologically plausible but also the most complex model. It involves several differential equations, thus making the development of large SNNs using this model impractical and inefficient. • The Izhikevich model [33] is able to reproduce different spiking patterns as well as spike shapes of biological cortical neurons while being more computationally efficient than the Hodgkin–Huxley neuron. Its functionality is described through Eqs. (1) to (3), where v is the membrane potential, I is the input current, u is the membrane recovery variable, and a, b, c, d are constants that set the spike shape. .

dv = 0.04v 2 (t) + 5v(t) + 140 − u(t) + I (t) dt du = a(bv(t) − u(t)) dt  v←c .if v ≥ Vth , then u←u+d

(1) (2)

.

.

(3)

• The integrate-and-fire (IF) model [28] is the simplest model from the computational point of view, thus making it widely used. The model is based on a resistance–capacitance (RC) circuit, similar to a low-pass filter. The evolution over time of the membrane potential of the postsynaptic neuron is described in Eq. (4), from which the time constant .τm = RC can be derived as in Eq. (5). I (t) =

.

dv v(t) +C dt R

(4)

140

A. Marchisio and M. Shafique

τm

.

dv = −v(t) + RI (t). dt

(5)

When the membrane potential reaches a certain threshold .Vth at the firing time .tf , the postsynaptic neuron produces a spike .δ(t − tf ), after which the membrane potential is reset to a value .vrest (often set to 0 as a common assumption [60]), obviously lower than .Vth . • The leaky-integrate-and-fire (LIF) [94] is a modified version of the IF model that introduces the concept of refractory period, corresponding to the period of time after a spike in which the membrane potential is unable to increase even if a train of spikes is received at the input. The evolution over time of the membrane potential of a LIF neuron is described by Eq. (6). τm

.

   dv = −v(t) + i0 (t) + wj ij (t) . dt

(6)

When a spike is received, a synaptic current .ij (t) is generated, modulated by its correspondent synaptic weight .wj , and added to the bias current .i0 (t). The lower computational complexity of the LIF neuron, compared to the Hodgkin– Huxley model, comes at the price of a lower biological plausibility. The LIF model assumes that the shape of the action potentials is uniform for all spikes, thus limiting the ability to reproduce biological spike patterns and shapes. However, such low complexity allows for the creation of large SNNs and their implementation on neuromorphic hardware.

2.2 Spike Coding Methods The information of SNNs is represented and propagated through spike trains. Different approaches used for encoding the information into spikes [61] are shown in Fig. 2: • Rate Coding: The intensity of the activation corresponds to the probability to fire a spike, which translates into the mean firing rate in an observation period. This is the most commonly used method for its simplicity, but it might be more power-consuming than other coding techniques due to the high spike rate. • Inter-spike-Interval (ISI) Coding: The intensity of the activation is temporally coded as the precise delay between consecutive spikes. • Time-to-First-Spike (TTFS) Coding: The activation intensity a is coded as the time difference .t between the stimulus and the first spike of a neuron. Such delay can be either the inverse of the amplitude (.t = 1/a) or a linear relation, such as .t = 1 − a. The main assumption of this encoding method is that a neuron generates only one spike for any given stimulus. Hence, eventual subsequent spikes that follow from that neuron are simply ignored [57]. The main advantage is that a fast processing is guaranteed, since the information is already transmitted when the first spike is received.

Embedded Neuromorphic Using Intel’s Loihi Processor

141

Rate

ISI

TTFS

Fig. 2 Comparison between rate, inter-spike interval (ISI), and time-to-first-spike (TTFS) encoding techniques

2.3 SNN Learning Methods Different approaches for training SNNs can be followed based on the topology of learning. For unsupervised learning, the possible methods include Hebbian learning [74], spike-time-dependent plasticity (STDP) [81], and the spike-driven synaptic plasticity (SDSP) [25]. The basic idea of the STDP, which is the most common method, is that the strength (or weight) of a synapse depends on the degree of timing correlation between the presynaptic and postsynaptic neuronal spikes. The supervised DNN learning methods, based on the gradient backpropagation, cannot be directly applied to SNNs due to the non-differentiability of the spiking loss function [71]. The possible solutions to this problem are: (1) using DNN-toSNN conversion or (2) approximating the spiking derivative through a surrogate gradient: 1. The DNN-to-SNN conversion approach is based on training the DNN with common gradient-based backpropagation and then converting the trained network into the SNN domain [73]. The accuracy loss during the conversion can be balanced with weight normalization and by carrying out a single forward pass for SNN inference in multiple discrete timesteps. Another hybrid approach [64] consists of training a DNN, converting the DNN into SNN, and then incrementally training the SNN with an approximated backpropagation. While the basic conversion approach can be applied only to static datasets, a pre-processing method based on event accumulation over time can be effectively applied to event-based data to provide the correct inputs to the DNN in preparation for a DNN-to-SNN conversion [48].

142

A. Marchisio and M. Shafique

2. The surrogate gradient learning [54] circumvents the non-differentiability of a LIF neuron by defining an approximate gradient during backward propagation. This allows to employ common SNN-based backpropagation methods, such as the spatio-temporal backpropagation (STBP) [95] or SLAYER [78]. Further modifications and approximation of the learning rule can be applied to achieve online learning on neuromorphic chips [68, 82].

3 Conventional Architectures vs. Neuromorphic Architectures The recent AI breakthroughs have boosted the intelligence embedded in computing devices. However, as modern AI technologies and algorithms are maturing, the limitations of their conventional computing infrastructures are emerging. While DNNs can scale to solve complex problems, these gains are ensured by high computational power and memory-intense cost. Neuromorphic computing represents a fundamental redesign principle of the computer architecture, inspired by the mechanisms of the biological brain. The roadmap of neuromorphic computing departs from the abstraction layers and algorithms of conventional computing with the scope of unlocking orders of magnitude gains in efficiency and performance compared to conventional architectures. As shown in Fig. 3, the conventional architectures, varying from common desktop processors to the most advanced AI accelerators [34, 69], have much higher power consumption than neuromorphic architectures, such as Intel Loihi [15], IBM TrueNorth [50], SpiNNaker [24], and BrainScaleS [76]. Indeed, the power consumption of such neuromorphic accelerators is comparable to or lower than the human biological brain. The following paragraphs briefly discuss the most popular neuromorphic chips, while the Intel Loihi is comprehensively discussed in Sect. 6. TrueNorth [50] is a digital chip designed and implemented by IBM in a 28nm CMOS technology. The chip organization consists of a tiled array of 4096 neurosynaptic cores. Each core has 12.75 KB of local SRAM memory, which stores the synapse states, the neuron states, and the parameters of up to 256 LIF neurons.

NEUROMORPHIC HARDWARE Intel Loihi

IBM TrueNorth

CONVENTIONAL HARDWARE SpiNNaker

BrainScaleS

Human Brain

Desktop Processor

Google TPU

Cerebras WSE-II

Supercomputer Chip

~15 kW

Few MW

Power Tens of mW

1W

20 W

50-100 W

~200 W

Fig. 3 Comparison between the power consumption of the human brain, conventional architectures, and neuromorphic architectures

Embedded Neuromorphic Using Intel’s Loihi Processor

143

The spike-based communication and asynchronous routing infrastructure enable the integration of multiple TrueNorth chips into larger systems. SpiNNaker [24] is a digital system designed to simulate large SNNs in real time. Its basic blocks are ARM9 cores that can access a small amount of local memory, while some additional memory is shared across one multi-core chip. 18 processor cores are grouped together to form a chip, and 48 chips are assembled together to form a board. Larger systems can be built by connecting multiple 48-chip boards, such as the 1-million processor system built by the University of Manchester. The second version, named SpiNNaker 2 [44], integrates more cores per chip and linear algebra accelerators to execute more efficiently sparse deep learning algorithms. Its software stack facilitates the SNN deployment using python-based simulators, such as PyNN [17]. Its interface supports standard neuron models such as the LIF and Izhikevich neurons, and common learning algorithms such as the STDP. BrainScaleS [76] is a hybrid system that combines analog neuron circuits with digital communication networks. It supports the adaptive exponential IF neuron model, which can be configured through a parameter to adapt to different spiking behaviors. A single chip supports up to 512 neurons and 14,000 synapses per neuron. Larger networks can be built by connecting multiple chips directly on the silicon wafer. NeuroGrid [4] is a platform that employs analog/digital mixed-signal circuits to implement large SNN models in real time. A NeuroGrid board is composed of 16 CMOS NeuroCore chips, and each chip contains an array of 256 .× 256 two compartmental neurons. The full NeuroGrid board can scale up to 1-million neurons and billions of synapses thanks to its asynchronous multicast tree routing digital infrastructure. DYNAP-SE [52] is a chip fabricated by INI Zurich in a 180-nm CMOS technology node. The chip has 4 cores, with 256 neurons each, and supports 64k synapses. The asynchronous digital connectivity between neurons can be reprogrammed at runtime, enabling flexible SNN model implementations, including recurrent networks. ODIN [23] is a 28-nm CMOS digital neuromorphic chip designed by the Catholic University of Louvain. A core is composed of 256 neurons, which can be configured with the LIF model or the Izhikevich model. The parameters of the neurons are stored in a 4 KB SRAM array, and the 64k synapses are implemented as a 32 KB SRAM array. µBrain [84] is a digital event-based neuromorphic chip implemented in a 40-nm CMOS technology. The architecture is fully asynchronous, without a clock. Due to its ultra-low power consumption (in the range of a few tens of µW), it is more suitable for IoT applications.

144

A. Marchisio and M. Shafique

4 Event-Based Cameras Event-based sensors, also commonly called dynamic vision sensors (DVS), take inspiration from the human eye’s retina functionality. While in traditional framebased sensors the image recording of a scene is obtained by stacking a series of frames at a specific temporal rate, the information recorded by event-based sensors is directly related to the light variations in the scene. More specifically, if and only if a pixel changes its brightness, the camera triggers an event with this information: • .x, y: The coordinates of the pixel • t: The timestamp of when the event occurred • p: The polarity of the brightness variation, which is ON or 1 for higher brightness, and OF F or 0 for lower brightness Thus, the brightness changes in the scene are recorded asynchronously and independently for every pixel as a variable data rate sequence. As shown in Fig. 4, for each pixel, the brightness (measured as log intensity) is memorized when an event is recorded and continuously monitored for a (positive or negative) change of sufficient magnitude, compared to the previously memorized value. The events are transmitted with the asynchronous address event representation (AER) protocol. Thanks to their structure, the spikes generated by an event-based sensor can directly feed the SNNs’ inputs without any manipulation. Recently, due

DAVIS 240C Functionality

Event Generation

Δ(logl)

OFF Th. ON Th.

DAVIS chip layout

DAVIS 240C Event-Based Camera

Fig. 4 Functionality of the DAVIS240C camera [8], showing a simplified circuit diagram of the DAVIS pixel, and the DVS operation of converting light into events. Figure adapted from [26]

Embedded Neuromorphic Using Intel’s Loihi Processor

145

to their increased popularity and demand, different high-technology companies, including iniVation [8, 40], Prophesee [21, 62], CelePixel [12, 30], and Samsung [80, 85], have specialized in the commercialization of event-based cameras. In summary, compared to the frame-based sensors, the event-based cameras offer the following improvements: • High resolution in time: Multiple events can be recorded with a time resolution of a few microseconds. Therefore, common frame-based issues such as oversampling, undersampling, and motion blur are avoided, making event-based sensors suitable for high-speed or low-latency operations. • Adaptive data rate .→ less power and memory usage: The data is recorded only when a bright variation is detected in the scene. Hence, no information is recorded in the absence of light changes, leading to almost zero power consumption and efficient storage of the information. • High dynamic range (up to 140 dB): The large range (compared to .≈ 60 dB of the frame-based sensors) allows to use event-based sensors also in extreme conditions, e.g., with very low light.

5 Applications and Datasets for Event-Based SNNs Event-based SNNs are well suited for high-dynamics applications deployed in extreme low-power systems. The existing applications cover wide ranges such as industrial automation, IoT, smart mobility, robotics, and healthcare [26, 88]. However, a key feature that boosts the research and development of optimized algorithms and computation mechanisms is the availability of open-source datasets that can be easily accessed by third parties. An overview of the most common opensource event-based vision datasets is shown in Table 1. While the earliest datasets were generated by converting the classical static computer vision datasets into sequences of events [39, 43, 56], in recent years the neuromorphic community has released a larger variety of datasets that are directly generated by recordings from DVS cameras [1, 5, 13, 18, 22, 27, 32, 53, 79, 89, 96]. Such a larger accessibility has led to the developments of end-to-end systems using event-based data implemented on neuromorphic hardware [67, 90, 92].

6 The Loihi Architecture The Intel Loihi chip [15], based on a neuromorphic mesh of 128 neurocores, executes the neuron computations in a highly parallel and power-efficient asynchronous manner. The neurocore management is guaranteed by 3 embedded x86 processors, and an asynchronous network-on-chip (NoC) allows communication between neurons.

146

A. Marchisio and M. Shafique

Table 1 An overview of the open-source event-based vision datasets Dataset Reference Application domain N-MNIST [56] Digit classification & Nand object Caltech101 classification (10 and 101 categories) Object classification CIFAR10- [39] (10 categories) DVS

DvsGesture [1]

ESImageNet

[43]

N-Cars

[79]

DET

[13]

AAD

[89]

Eventcamera datasets MVSEC

[53]

DDD17 & DDD20

[96]

[5, 32]

Generation method Converted from static datasets (MNIST [38] and Caltech101 [20])

Description Recording of static images with an ATIS sensor [62] in motion

Converted from the static CIFAR10 dataset [36]

Closed-loop smooth movement repetition of frame-based images recorded with a DVS128 camera [40] Gesture recognition Recorded with a Hand and arm gestures (11 classes) DVS128 camera [40] collected from 29 subjects under 3 different lighting conditions in a stationary background Image classification Converted from the Event streams are (1000 classes) static ILSVRC2012 software-generated by dataset [75] capturing the sequential features of images and placing them on the time axis with timestamps Cars vs. Recorded with an Events captured with the background ATIS camera [62] sensor mounted behind classification the windshield of a car Lane extraction Recorded with a High spatial resolution Celex-V sensor [12] traffic scenes captured in Wuhan City, China Cars and pedestrian Recorded with an Sensor mounted behind detection ATIS camera [62] the windshield of a car on France roads Pose estimation, Recorded with a A set of different scenes visual odometry, DAVIS240C sensor [8] from synthetic and real and SLAM environments Optical flow Recorded with a Combined with other estimation DAVIS346B sensor sensor data (IMU, lidar, GPS), long indoor and outdoor sequences in a variety of illuminations and speeds Steering angle Recorded with a Various weather, driving, prediction DAVIS346B sensor road, and lighting conditions in Switzerland, Germany, and USA (continued)

Embedded Neuromorphic Using Intel’s Loihi Processor

147

Table 1 (continued) Dataset UZH-FPV Drone Racing

Reference Application domain Generation method Visual odometry Recorded with a [18] miniDAVIS346 camera

Brisbane [22] event VPR

Visual place recognition

Recorded with a DAVIS346B sensor

DSEC

Stereo vision

Recorded with two Prophesee PPS3MVCD sensors

[27]

Description High-speed event sequences captured on a first-person-view racing quadrotor flown by an expert human pilot Events captured with the sensor mounted behind the windshield of a car in the suburbs of Brisbane, Australia Combined with other sensor data (two standard RGB cameras, lidar, GPS), captured with a variety of illumination conditions in Switzerland

6.1 Neuron Model The Loihi architecture implements the well-known CUrrent BAsed (CUBA) leakyintegrate-and-fire (LIF) neuron. Each neuron is modeled as a reservoir of charge, and when this overcomes the voltage threshold, a current spike on the output axons is generated. The two internal state variables of the model are the synaptic response current .ui (t) and the membrane potential .Vi (t). A postsynaptic neuron i receives in input a train of spikes that are sent by a presynaptic neuron j . The spikes can be represented as a train of Dirac delta functions at time .tk , as in Eq. (7). σj (t) =



.

δ(t − tk ).

(7)

k

The train of spikes is then processed by a synaptic filter input response .αu (t), which is defined as in Eq. (8), where .H (t) is the step function, and .τu a time constant. t

e − τu .αu (t) = H (t). τu

(8)

Each filtered spike train is multiplied by the synaptic weight .wij associated with the synapse that connects neurons i and j and added together with an additional bias current .bi to compute the synaptic response current in Eq. (9). ui (t) =



.

j

wij (αu ∗ σj )(t) + bi .

(9)

148

A. Marchisio and M. Shafique

The synaptic current is then integrated by the membrane potential as in Eq. (10). Every time the membrane potential overcomes the voltage threshold .θi , the neuron i emits an output spike, and its membrane potential is reset to a .vrest value. v˙i (t) = −

.

1 vi (t) + ui (t) − θi σi (t). τv

(10)

Note that the time constant .τv is responsible for the leaky behavior of the model [15].

6.2 Chip Architecture As shown in Fig. 5, a Loihi chip is composed of 128 neuromorphic cores (neurocores), and each neurocore can implement up to 1024 primitive spiking neural units (compartments), which emulate trees of spiking neurons. The spikes generated by each neuron are delivered to all the compartments belonging to its synaptic fanout through the asynchronous NoC in the form of packetized messages, following a mesh operation executed over a series of algorithmic timesteps. This process uses a

LMT

PARALLEL I/O

LMT

Neuromorphic Core

X86 Processor

PARALLEL I/O

PARALLEL I/O

NoC

LMT

FPIO

Off-Chip Interface

PARALLEL I/O

Fig. 5 Architectural view of the Loihi chip [15]

Embedded Neuromorphic Using Intel’s Loihi Processor

149

barrier synchronization mechanism to ensure that all neurons are ready to proceed coherently to the next timestep. To implement wider and deeper SNNs that do not fit on a single 128-neurocore chip, multiple chips can be combined together without any latency increase due to the message exchange. The off-chip communication interface extends the mesh up to 4096 on-chip cores and up to 16,384 hierarchically connected cores. The 3 embedded x86 processors, called LakeMounts, guarantee the correct functioning of the entire system. They can also be used to probe the performance of the chips and transfer information to the user. The mesh operation is composed of the following sub-operations: 1. Each neurocore independently iterates over its compartments, and if a neuron of a compartment is in a spike firing state, the spike message is generated. 2. All the messages are distributed to all the neurocores that contain their synaptic fan-outs through the NoC. 3. When a neurocore ends the internal distribution of spikes to its neurons, it sends a barrier synchronization message received by the neighbor neurocores. This message has the effect of flushing all the spikes that are still traveling. After that, another barrier message notification is generated by the same neurocore. 4. When all the neurocores receive the second signal, the timestep is incremented. As shown in Fig. 6, the microarchitecture of a single neurocore is composed of 4 units: 1. The Synapse unit is responsible for routing the input spikes to the appropriate compartments and retrieving the corresponding synaptic weights from memory. 2. The Dendrite unit manages the synaptic current and membrane voltage for each compartment. It is also responsible for verifying whether a neuron is firing and delivering this information to the Axon. 3. The Axon unit generates the output spikes, in which each message is associated with the specific address of postsynaptic neurons.

Fig. 6 Microarchitectural view of a single Loihi neurocore [15]

150

A. Marchisio and M. Shafique

4. The Learning unit, using the spike traces at the output of the neurocore and other local information, updates the synaptic weights according to the learning rule.

6.3 Second Generation: Loihi 2 Based on the advancements conducted by the neuromorphic community, which highlighted the need for a larger and more flexible research platform, Intel’s designers recently developed the second generation of neuromorphic processors, called Loihi 2 [55]. Compared to the previous version, the key enhancements are highlighted as follows: • Neuron model programmability: Compared to the Loihi, which only supports the CUBA LIF neuron, Loihi 2 implements the neuron models in a programmable pipeline, which supports common arithmetic, comparison, and program control flow instructions. Loihi 2’s programmability expands its range of neuron models without incurring performance or efficiency overhead compared to Loihi, thereby enabling a richer space of opportunities and applications. • Generalized event-based messaging: While Loihi initially supported only binary-valued spike messages, Loihi 2 allows to carry a 32-bit spike payload with little extra performance and energy cost. Such generalized spike message protocol preserves the desirable sparse and time-coded communication properties of SNNs, while also providing greater numerical precision. • Enhanced learning capability: Loihi only supported two-factor learning rules on its synapses, while in Loihi 2 the programmable neuron models can manipulate the synaptic inputs received from their dendritic compartments to map localized third factors to specific synapses. Hence, Loihi 2’s support for modulatory factors expands the possibility of using many of the latest neuromorphic learning algorithms. • Higher density and capacity: Compared to Loihi, the Loihi 2 neurocores fabricated with a newer technology are approximately half the size, thus achieving a .2× higher synaptic density. Moreover, Loihi 2 cores support flexible memory bank partitioning to increase the effective core capacity. In addition, Loihi 2 increases the number of embedded processors per chip to 6 from 3 in Loihi. • Faster computation and connectivity: The asynchronous circuits in Loihi 2 have been redesigned and optimized, thus providing between .2× and .10× processing gain speed compared to Loihi. Moreover, the inter-chip interface is improved. Loihi 2 supports the local broadcast of spikes at a destination chip, resulting in .4× faster asynchronous chip-to-chip communication and .10× lower inter-chip bandwidth utilization. Moreover, the 3D multi-chip scaling is supported to further reduce the routing distances between chips and the congestion of inter-chip links.

Embedded Neuromorphic Using Intel’s Loihi Processor

151

6.4 Tools to Support Loihi Developers A solid tool flow is essential for enabling large-scale usage of the hardware architecture and conducting cutting-edge research. The software stacks abstract away the users from the low-level hardware details, allowing them to focus on highlevel modeling of algorithms, network architectures, learning rules, etc. Shortly after the Loihi was made available for researchers, the NxSDK [41] API was released. Through this API, a programmer can define the architecture of the SNN model, as well as its parameters, such as decay time constants, spike impulse values, synaptic weights, refractory delays, spiking thresholds, and custom learning rules. Moreover, external stimulus spikes can be injected into designated connections at specified timesteps. The network state and power/energy consumption can be monitored at runtime. The compiler [42], after checking that the specifications are compliant with the hardware support, greedily assigns network entities (compartments, synapses, learning rules) to the available resources to minimize the occupied neurocores. To ease the development of complex and deep SNNs, NxTF [72] provides a programming interface derived from Keras and a compiler optimized for mapping deep convolutional SNNs to the multi-core Intel Loihi architecture. It supports SNNs trained directly on spikes as well as models converted from traditional DNNs through SNN-ToolBox [73], processing both sparse event-based and dense framebased datasets. Recently, other SNN simulators such as PyNN [17], Nengo [2, 63], and Brian [51, 83] have been extended with the support to map the model onto the Loihi neuromorphic hardware. To extend the support to Loihi 2 and other neuromorphic architectures, the Lava [14] framework has been released. It is extensible and hierarchical to allow programmers to build up levels of abstraction in a modular way. It can also be integrated with third-party frameworks, including TensorFlow, PyTorch, Nengo, and others. Its Magma component adds a low-level interface for mapping and executing SNNs onto the target neuromorphic hardware platform.

6.5 SOTA Results of Event-Based SNNs on Loihi While DNN principles led to breakthroughs in machine learning domains, neuromorphic computing aims to take this a step further to computation principles more directly inspired by the functionality of biological neural circuits. The basic paradigm of DNNs consists of applying error backpropagation to modify the parameters and solve a wide range of problems [16]. The most common approaches for deploying SNNs inspired by the deep learning principles are shown in Fig. 7. The DNN-to-SNN conversion approach is based on offline training of the DNN model and then conversion of the network to the SNN domain for inference on

152

A. Marchisio and M. Shafique

Neural Network Model

SNN Backpropagaon TensorFlow/PyTorch + SLAYER/STBP/… Pre-Training (Oponal)

TensorFlow/PyTorch

NxSDK, Lava,… Mapping onto Neuromorphic HW

Event-Based or Stac Dataset

Offline Training (CPU/GPU) DNN Training Backpropagaon TensorFlow/PyTorch + SNN-ToolBox/NengoDL/…

Online Execuon (Loihi) Converted SNN Implementaon Spiking rates corresponding to DNN acvaons

Direct SNN Inference Informaon encoded in spaotemporal spike paerns Online Learning Spike sequences evolve to opmize loss funcon

100 90 80 70 60 50 40 30 20 10 0

Acc. T

Lowest Latency

200 180 160 140 120 100 80 60 40 20 0

T (ms)

Acc.(%)

Fig. 7 Methodologies for deploying spiking deep learning applications on Loihi [15]

Fig. 8 Comparison in terms of accuracy (Acc.) and latency (T) between different implementations of the DvsGesture recognition problem (SLAYER on Loihi [78], CNN on TrueNorth [1], Retina + Gabor filter on SpiNNaker [45], spatio-temporal filter + CNN on GPU [29], PointNet++ on GPU [93]). Figure adapted from [90]

neuromorphic hardware. This methodology has been showcased for different applications, including keyword spotting [6], image retrieval [46], segmentation [59], gesture recognition [48], and heartbeat classification [9]. Another common method is to directly perform SNN backpropagation-based training [3, 54, 78, 95]. These approaches better optimize latency and energy efficiency, since the encoding and spike propagation in temporal domain lead to lower spikes per computation. Different applications have been demonstrated with this approach, including tactile digit recognition [77], gesture recognition [11], robot grasping [87], robot navigation [86], car recognition [90], and lane detection [91]. Moreover, the Loihi learning engine can be exploited to conduct online learning. Since the naive backpropagation is ill-suited for continuous learning, different approximation algorithms have been employed [19, 82]. Several works showcased the benefits of implementing the target application onto the Loihi neuromorphic chip compared to other hardware platforms. To demonstrate this, Fig. 8 compares SNN classifiers performing gesture recognition

Embedded Neuromorphic Using Intel’s Loihi Processor

153

on the DvsGesture dataset [1], implemented on different hardware platforms. The highest trade-off between accuracy and latency is achieved by the SLAYER [78] implementation on the Intel Loihi. It also exhibits the lowest power consumption (0.54 mJ), compared to over 19 mJ consumed by the IBM TrueNorth implementation.

7 Case Study for Autonomous Vehicles: Car Detection with CarSNN As a case study, this section presents CarSNN [90], an efficient SNN implementation on Loihi for event-based car detection, which is a well-known problem in the context of autonomous driving (AD) applications. This is a crucial task, which, along with other AD tasks (e.g., steering angle prediction, car and pedestrian localization, control of brake pedal, and accelerator), requires a fast decision-making system with low latency to minimize the chance to have catastrophic accidents due to late decisions. Another property is related to the robustness of the system, which must operate in different conditions, including various levels of illumination and weather conditions. Moreover, the system should be designed and optimized for low power consumption, which is an important design criterion in automotive, especially for battery-driven electric mobility.

7.1 Problem Analysis and General Design Decisions Elaborating more in detail the target problem, the focus is on the “cars vs. background” classification. To overcome the above-discussed limitations, three main research objectives are identified: 1. The system should use the major robust vision engine, i.e., an event-based camera. 2. The network should be a low-complexity event-based SNN for energyconstrained and low-latency systems. 3. The developed SNN should fulfill the system constraints to be implemented onto a neuromorphic hardware chip. Following these research objectives, the design, optimization, and implementation of the SNNs are conducted on the Intel Loihi Neuromorphic Research Chip [15], and evaluated on the N-CARS dataset [79], based on Asynchronous Time-based Image Sensor (ATIS) [62]. For the target classification problem, a supervised learning method can be employed to train the network based on the desired task. Every SNN sample is represented by a stream of events, which represents the same object to classify. In the same sample, all the spikes are correlated in time and space with each other [79].

154

A. Marchisio and M. Shafique

120

120

100

100

80

80

60

60

40

40

20

20 20

40

60

80

100

100 x 100 50 x 50

20

40

60

80

100

Fig. 9 Event occurrences on (a) test samples and (b) train samples of the N-CARS dataset [90]

To achieve good performance, this temporal correlation needs to be exploited in a learning method capable of utilizing this property. As claimed in [95], the STBP is an efficient offline SNN learning method, since it achieves very high classification accuracy in tasks involving event-based camera streams. It also uses both temporal and spatial domains to calculate the gradients and train the SNN. Since a very reactive prediction is needed to perform in real time, only a subset of input information is used, according to the attention window principle. The event occurrences, in both the training and testing sets of the N-CARS dataset, are analyzed to find the area that focuses the attention on input data. Figure 9 shows the event occurrences evaluated in different attention windows. The majority of the events is contained in the area of size .50 × 50 in the bottom-left corner, in both the train and test sets.

7.2 CarSNN Methodology The methodology to design the SNN model for the “cars vs. background” classification, called CarSNN, consists of a three-step process, as shown in Fig. 10. Following the aforementioned considerations about different attention windows, the SNN model architectures are discussed in Sect. 7.2.1. The methodological steps for finding the parameters for SNN training and feeding input data are discussed in Sects. 7.2.2 and 7.2.3, respectively.

7.2.1

CarSNN Model Design

To achieve good classification results, the input events are transmitted in two distinct polarity channels, one for positive and one for negative events. Since we consider this problem as a multi-classification task, instead of a straightforward binary

Embedded Neuromorphic Using Intel’s Loihi Processor

1 NETWORKS DEFINITION Receive 2 disnct polarity channels as inputs Have 2 output channels Inspired by the SNN for the DvsGesture Use three different aenon windows

155

2 TRAINING PARAMETERS Use the STBP learning rule NEURON MODEL

LEARNING RULE • Adam opmizer • MSE loss funcon • LR= or

• • •

ms

• • •

ms

3 INPUT PARAMETERS Use accumulaon in me ms ms to • • ms to ms • Batch size to 80

• ms • ms • Batch size

Fig. 10 Three-step methodology for CarSNN design [90] with the training and feeding input parameters

classification, the output layer of the CarSNN has two neurons that correspond to the cars and background classes. Since the architecture proposed in [82] achieved high classification accuracy and low latency on the DvsGesture dataset [1], a similar SNN model is designed to correctly function for the N-CARS dataset. Compared to the model of [82], the CarSNN has different values of kernel size, padding, and output channels for the first convolutional layer, and different sizes for the last two dense layers. Based on the attention window analysis, three different SNNs for the three different sizes of input images are developed: 1. Size .128×128 (Table 2): The model is very similar to the SNN proposed in [82]. Since this size is larger than the N-CARS dataset image size, which is .120×100, the exceeded pixels do not produce spikes (no event). Besides, this dimension is equal to the resolution of common DVS cameras [40, 62]. Therefore, this network can be directly implemented with event-based sensors. 2. Size .50×50 (Table 3): This uses the first attention window (red square in Fig. 9). 3. Size .100 × 100 (Table 4): This uses the second attention window (green square in Fig. 9).

156

A. Marchisio and M. Shafique

Table 2 CarSNN model for full-size images (input size .128 × 128) Layer type Av. pooling Convolution Av. pooling Convolution Av. pooling Dense Dense

In ch. 2 2 32 32 32 2048 1024

Out ch. 2 32 32 32 32 1024 2

Kernel size 4 3 2 3 2 .− .−

Padding

Stride

.−

.−

1

1

.−

.−

1

1

.−

.−

.−

.−

.−

.−

Table 3 CarSNN model for first attention window (input size .50 × 50) Layer type Av. pooling Convolution Av. pooling Convolution Av. pooling Dense Dense

In ch. 2 2 32 32 32 512 144

Out ch. 2 32 32 32 32 144 2

Kernel size 4 3 2 3 2 .− .−

Padding

Stride

.−

.−

1

1

.−

.−

1

1

.−

.−

.−

.−

.−

.−

Table 4 CarSNN model for second attention window (input size .100 × 100) Layer type Av. pooling Convolution Av. pooling Convolution Av. pooling Dense Dense

7.2.2

In ch. 2 2 32 32 32 1568 512

Out ch. 2 32 32 32 32 512 2

Kernel size 4 3 2 3 2 .− .−

Padding

Stride

.−

.−

1

1

.−

.−

1

1

.−

.−

.−

.−

.−

.−

Parameters for Training

Using an SNN supervised learning rule based on backpropagation, such as the STBP [95], it is possible to set several hyper-parameters: • Loss function: The mean squared error (MSE) loss criterion is employed, since it achieves the highest performance in [95]. • Optimizer: Adam [35] is used, because it suits well for the STBP. • Learning rate (LR): After some preliminary experiments, a good value ranges between .1e−5 and .1e−4 , where with the latter value the training is faster and the SNN achieves good accuracy results in fewer epochs. Other specific parameters related to the learning rule implemented on the SNNs with LIF neurons can be adjusted. The membrane potential update (.ut+1,n ) using i the membrane potential decay factor .τ is formalized in Eq. (11).

Embedded Neuromorphic Using Intel’s Loihi Processor

t,n ut+1,n = ut,n i i τ (1 − oi ) +

157

l(n−1) 

.

n t+1,n−1 wij oj + bin .

(11)

j =1

Another key parameter of a LIF neuron is its threshold (.Vth ). When the membrane potential overcomes this value, the neuron generates an output spike and resets the potential to a specific value. In each experiment, the neurons have the same .Vth , and the reset value is equal to 0. The third parameter (. a21 ) is related to the approximation of the derivative of spiking nonlinearity. The rectangular pulse function defined in Eq. (12) is used. h1 (u) =

.

 1 a1  sign |u − Vth | < . 2 a1

(12)

The following section discusses some experiments conducted to set these parameters, with a particular focus on .Vth . We made these decisions: • .Vth : Its value changes from 0.3 to 0.8, and for each value, the accuracy curve is evaluated. • . a21 : It assumes the same value of the threshold, as this assumption is also made in [95]. • .τ : Its value must be small to have good approximation of the neuron model and in particular of .f (oit,n ). Hence, it is set to .0.2 ms. To speed up the process and achieve good performance, an accumulation mechanism is introduced. The spikes are accumulated at a constant time rate called sample time (.Tsample ). In these experiments, its value is set to 10 ms. Every .Tsample time, a new input image that feeds the SNN is constructed. The events composing the image follow the rule that at most one spike per channel can be present in each pixel for a time window of 15 timesteps. Hence, this accumulation mode can compress the input information. The evaluated accuracy refers to every single sample (i.e., accumulated image) trained for 300 epochs. Table 5 and Fig. 11 report the results of these experiments, related to the SNN with the full-size image (Table 2). Table 5 Experiments to find an efficient value of .Vth Input size .128 × 128 .128 × 128 .128 × 128 .128 × 128 .128 × 128

.Vth

a . 1

0.3 0.4 0.5 0.6 0.8

0.3 0.4 0.5 0.6 0.8

2



.Tsample

ms 0.2 0.2 0.2 0.2 0.2

ms 10 10 10 10 10

Batch size 20 20 20 20 20

The bold values highlight which value of .Vth lead to highest accuracy.

LR −5 .1e −5 .1e −5 .1e −5 .1e −5 .1e

Accuracy % 83.0 84.0 82.4 81.9 82.6

158

A. Marchisio and M. Shafique

From Table 5, it can be noticed that the best accuracy is achieved when .Vth = 0.4. Moreover, from Fig. 11, we can notice that, while a relatively high accuracy is reached after a few epochs with .Vth equal to 0.3, the training curve with .Vth equal to 0.4 has less instability than for the other experiments. These two reasons motivate the choice to use 0.4 for the .Vth parameter.

7.2.3

Parameters for Feeding the Input Data

As previously discussed, the input spikes are given to the SNN with an accumulation strategy to speed up the training. The experiments conducted in Table 5 already show a relatively high accuracy. Apart from keeping this property, the accumulation advantages are to decrease the power consumption and increase the reactivity of the system with input due to the input data compression. Moreover, the upper bound to the latency of the system is set to 10 ms. Hence, for the training, only 10 ms from the dataset sample stream is fetched with a random initial point, which is defined as the maximum acceptable sample length (.Tl ). Given this constraint, two different approaches can be adopted: 1. Accumulate the spikes every .Tl time (.Tsample = Tl ) and make the prediction on a single image for the entire input stream, as performed in the previous experiments. 2. Accumulate the spikes in a way to have more than one input image for every input stream (.Tsample < Tl ), and afterward evaluate which is the class with majority prediction. Therefore, some analyses to find efficient values for the sample time and the variation of the LR with two different batch sizes (BS on Table 6) are conducted. In

Accuracy (%)

90

High accuracy aer few epochs

Stable accuracy

80 70 Vth=0.3 Vth=0.4 Vth=0.5 Vth=0.6 Vth=0.8

60 50 0

50

100

150

200

250

300

Epoch Fig. 11 Accuracy curves for the experiment made to evaluate an efficient value of .Vth , based on the results in [90]

Embedded Neuromorphic Using Intel’s Loihi Processor

159

these experiments, the second approach for the image accumulation is used, and the parameters are set as follows: .Vth = 0.4, . a21 = 0.4, .τ = 0.2 ms. The training lasts for 200 epochs, with a learning rate equal to .1e−4 and a minimum batch size of 40. Three different metrics are used to evaluate the accuracy: • One-shot accuracy on test data (.acc.s ): It is the accuracy computed for all the samples taken at .Ts of the test dataset. • Accuracy on test data (.acc.test ): It represents the accuracy for all the sample streams of the test dataset, computed from the majority prediction of the part of the stream with sample length equal to .Tl . • Accuracy on train data (.acc.train ): It is the accuracy for all the train streams, based on the majority prediction computation with sample length equal to .Tl . The results in Table 6 provide the necessary feedback for setting the value of .Ts . If it is small (e.g., .0.5 ms), there are more points for the same stream sample, but the SNN training becomes difficult since the accumulation has no effect and there is a lower temporal correlation. On the other hand, for high .Ts (e.g., 2 ms), the accuracy is low. The best trade-off is obtained when .Ts is equal to 1 ms. Moreover, the batch size affects the training process. To achieve high accuracy, the value of BS should be limited to 40. In the first experiments of Table 6, only the variation of .Tl and BS is analyzed. With constant BS and the same value of .acc.s , the .acc.test increases or remains stable with the increase of .Tl . This behavior is expected since there are more subTable 6 Experiments to find efficient values of .Ts , .Tl , and batch size .Ts

Input size .128 × 128 .128 × 128 .128 × 128 .128 × 128 .128 × 128 .128 × 128 .128 × 128 .128 × 128 .128 × 128 .128 × 128 .100 × 100 .100 × 100 .100 × 100 .50 × 50 .50 × 50 .50 × 50

ms 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0 0.5 1.0 2.0 0.5 1.0 2.0

.Tl

ms 2.0 4.0 6.0 8.0 2.0 4.0 6.0 8.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0

BS 80 80 80 80 40 40 40 40 40 40 40 40 40 40 40 40

LR −4 .1e −4 .1e −4 .1e −4 .1e −4 .1e −4 .1e −4 .1e −4 .1e −4 .1e −4 .1e −4 .1e −4 .1e −4 .1e −4 .1e −4 .1e −4 .1e

.acc.s

.acc.test

.acc.train

% 80 80 51 80 80 80 72 81 80 51 75 81 51 67 71 74

% 79 80 51 79 77 83 70 86 86 51 80 85 51 71 75 77

% 83 86 51 89 86 88 90 91 94 51 84 92 51 79 81 83

The bold values indicate the highest accuracy within a given same input size.

160

A. Marchisio and M. Shafique

Sample (Tsample)

Datasets

Learning

SNN parameters Training SNN Model Methods parameters

SNN Training on on Nvidia RTX 2080-Ti GPUs

Loihi Sample (Tsample)

Trained SNN Accuracy Trained SNN Model & Weights

Predicted Class Background Car

DVS Camera Fig. 12 Setup and tool flow for conducting the experiments in [90]

predictions to compute the final result when .Tl is large. The changes in the .acc.s are due to the non-deterministic training process.

7.3 Evaluation of CarSNN Implemented on Loihi For the experiments, the second approach shown in Fig. 7 has been adopted. The CarSNN is trained offline and then mapped onto the neuromorphic chip. Figure 12 shows an overview of the tool flow for conducting the experiments.

7.3.1

Experimental Setup

Coherently with the analyses made in the previous sections, the experiments are conducted on the N-CARS dataset [79]. It consists of a recording of 80 min with an ATIS camera [62], mounted behind the windshield of a car. The outcoming events are converted into gray-scale images and processed with a state-of-the-art object detector [65, 66] to automatically extract bounding boxes of size .120 × 100 around the two classes. The dataset is composed of 7940 car and 7482 background training samples and 4396 car and 4211 background testing samples lasting 100 ms. The dataset samples are made as .1 channel stream with two possible event values (.−1 and 1). The SNNs are described using the PyTorch library [58]. In these codes, the SNNs’ functional behavior is modeled with the implementation of Eq. (11) that contains the mechanism to update the membrane potential. The experiments are executed on a workstation equipped with Nvidia RTX 2080-Ti GPUs. The hyper-

Embedded Neuromorphic Using Intel’s Loihi Processor Table 7 Parameters of the experiments

Table 8 Results of the offline training experiments

Epochs 200

.Ts

.Tl

ms 1.0

ms 10.0

161 .τ

BS 40

LR −3 to 1e−6 .1e

Input size .128 × 128 .100 × 100 .50 × 50

.Vth

a . 1

0.4

0.4

2

ms 0.2

.acc.s

.acc.test

.acc.train

% 80.1 80.5 72.6

% 85.7 86.3 78.7

% 93.6 95.0 85.3

parameters setup is based on the analyses made in Sects. 7.2.1 and 7.2.2 and is summarized in Table 7. The dataset streams are shuffled in a random manner, and the sample time .Tl is selected starting from a random initial point. The value of BS is set to 40 since it gives the best accuracy in the previous experiments (according to Table 6) and guarantees a reasonable training time duration. To obtain a fair comparison, the same values of .Ts = 1 ms and .Tl = 10 ms are set for the three experiments. Table 6 reports the results leveraging the trade-offs between these two values. The same parameters for the SNN model used in Sect. 7.2.3 are employed. The LR decreases by 0.5 every 20 epochs, starting for the value .1e−3 . Compared to having a fixed LR, the accuracy slightly increases using this approach. To ease the SNN mapping onto the Loihi chip, only the weights are updated during training, while the bias is forced to 0. The train lasts for 200 epochs, and every sample taken at .Ts time is evaluated for 20 timesteps. Considering these hardware and software settings, the training for one single epoch on all the dataset samples lasts about 300 seconds. The mean inference latency for all samples, given at the time .Ts , is about 0.8 ms.

7.3.2

Accuracy Results for Offline Trained CarSNN

Table 8 shows the results in terms of the same accuracy policies as defined in Sect. 7.2.3. The accuracy values measured for the attention window of size .100 × 100 are comparable to the results for the larger image size (.128 × 128) and indeed exhibit slightly higher .acc.test and .acc.train . This observation highlights that the cropped part of the sample is not much significant for the correct classification and might cause an SNN misbehavior. On the other hand, the input values that form a small part of the original image (.50 × 50) lead to a significant accuracy decrease. Moreover, the results in Table 8 show an overfitting behavior due to the gap between .acc.test and .acc.train , which limits the upper bound of the accuracy for the developed CarSNN models.

162

A. Marchisio and M. Shafique

7.3.3

CarSNN Implemented on Loihi

The CarSNN implementation onto the Intel Loihi Neuromorphic Chip exploits some similarities with the offline model used for the previous experiments. Equation (13) reports how the Compartment Voltage (CompV ), which measures the membrane voltage of a neuron, evolves in the neuromorphic hardware [15]. CompVt+1 = CompVt

.

212 − δv + CompIt+1 + bias. 212

(13)

The Compartment Current (CompI ) formula in Eq. (14) expresses the accumulation of the weighted incoming spikes from j th presynaptic neuron. CompIt+1 = CompIt

.

 212 − δi + 26+wgtExp wj sjt+1 . 12 2

(14)

j

From Eqs. (13) and (14), the following parameters are defined: • • • •

δi : Compartment Current Decay δv : Compartment Voltage Decay bias: Bias component on CompV wgtExp: Value used to implement very different weights between different SNN layers

. .

Comparing the formulation of the offline model (i.e., Eq. (11)) and Eq. (13) of the online model, their similarities are expressed in equations from (15) to (18). CompVt = ut

(15)

.

CompIt =



.

wj ojt+1

if δi = 212

(16)

j

.

212 − δv =τ 212

(17)

bias = b.

(18)

.

Only the CarSNN described in Table 2 is implemented online since it achieves good offline accuracy results (as indicated in Table 8), and it represents the most complex developed network in terms of latency, power consumption, and the number of neurons. The Loihi chip uses only 8 bits for the storage of weights. The maximum weight range is .(−7, 6). Since these values are very different across layers and the wgtExp is limited, the following operations are conducted:

Embedded Neuromorphic Using Intel’s Loihi Processor

163

Table 9 Translation of parameters from offline to online implementation on Loihi Offline implementation Parameter Value 0.4 .Vth W eight .×1 .τ 0.2 0 b .− .−

Precision 64-bit floating point 46-bit floating point 64-bit floating point 64-bit floating point 64-bit floating point

Loihi online implementation Parameter Value Precision .Vth mant 10 12-bit fixed point W eight .×25 8-bit fixed point .δv 3276 12-bit fixed point Bias 0 8-bit fixed point .δi 0 12-bit fixed point

Table 10 Results of the CarSNN implemented onto the Loihi chip .acc.s

.acc.test

% 72.16

% 82.99

Neurons number 54,274

Synapses number 5,122,048

Neurocores number 151

Mean latency

Max latency

.μs

.μs

899.6

.≈

700

1. Weights and .Vth are multiplied by 25 (this value does not consider the default multiplication for .26 of weights and .Vth made on the Loihi). 2. All the 8 bits are used to store the values. According to Eqs. (15)–(18), the other neuromorphic hardware parameters can be adjusted. All the setup parameters are summarized in Table 9. The CarSNN is defined using the Intel Nx SDK API version 0.9.5 and runs on the Nahuku32 partition. In particular, the NxTF layers, such as NxConv2D, NxAveragePooling2D, and NxDense utilities, are employed. Such implementation is practical to automatically improve the performance of the SNN in a simple manner. The CarSNN is tested on the N-CARS dataset. Every sample at .Ts is replicated for 10 timesteps, and a blank time of 7 timesteps is inserted between samples, thus obtaining 17 timesteps per inference. It is necessary to follow the real-time constraint of a maximum inference latency of 1 ms. In the results reported in Table 10, the mean latency, indicating the time used to evaluate every sample at .Ts , is computed by multiplying the mean total execution time (in timesteps) with the number of timesteps per inference. On the other hand, the maximum latency indicates the maximum “spiking time” for every timestep, i.e., the time required by the Loihi chip to make the classification decision. This value indicates whether the latency constraint is met. It excludes the time overhead used to exchange results between the chip and the host system, which can be reduced by directly using output ports. From Table 10, the following observations can be made: • The gap between offline and online results is due to the simplicity of the model used offline, which represents a high-level approximation of the actual behavior of the neuromorphic chip. • The .acc.test for the implementation onto the Loihi chip is .2.6% lower than the offline application. • The maximum latency does not exceed .Ts (1 ms).

164

A. Marchisio and M. Shafique

Table 11 Power and energy consumption of the CarSNN implemented onto the Loihi chip LakeMounts power mW 40.8

Neurocores power mW 314.5

System power mW 1375.4

Energy per inference .μJ

319.7

Table 11 describes the power and energy consumption of the application implemented on the neuromorphic chip. In particular: • The LakeMounts Power is the consumption of the embedded processors used to manage neurons and exchange messages with the host system. • The neurocores power represents the consumption of the neurons. • The system power is the consumption of the entire system, where a significant contribution is due to the static power consumed in the used partition. • The energy per inference is the mean energy consumed to classify one sample. Hence, Table 11 reports the power and energy consumption of the SNN implemented onto the Loihi chip, which are several orders of magnitude lower than the power and energy consumed by GPUs on the same task.

7.3.4

Comparison with the State of the Art

CarSNN is the first spiking convolutional neural network (CNN) designed to perform event-based “cars vs. background” classification on neuromorphic hardware using statistic analysis of event occurrences to indicate different attention windows. A simple yet efficient technique for event accumulation in time maintains the spike temporal correlation. In the related works, the time correlation is maintained with good performance using different methods: • Histograms of averaged time surfaces (HATS) [79] uses local memory to calculate the average of time surfaces, which represents the recent temporal activity within a local spatial neighborhood. • Hierarchy of time surfaces (HOTS) [37] uses the computation of time surfaces in a hierarchical way between the layers. • Gabor filter [7] considers the spatial correlation between different events and assigns them to the channels based on this information. In HATS [79], the methods are evaluated on a simple linear support vector machine (SVM) classifier for the N-CARS dataset. Table 12 compares the results of this simple classifier method with CarSNN. The Gabor filter method adopts a two-layer SNN before the SVM. As discussed in Sect. 7.2.2, since the upper bound of .Tl is 10 ms, the comparison is made taking into account this real-time constraint limitation. As highlighted in Table 12, CarSNN achieves better accuracy with a limited .Tl than the linear SVMs implemented after the use of different and more complicated accumulation approaches.

Embedded Neuromorphic Using Intel’s Loihi Processor Table 12 Comparison of results for .Tl = 10 ms

Classifier (Accumulation approach) Linear SVM (HOTS) Linear SVM (Gabor-SNN) Linear SVM (HATS) CarSNN (.128 × 128 attention window) CarSNN (.100 × 100 attention window) CarSNN (.50 × 50 attention window)

165 .acc.test .≈

0.54 0.66 .≈ 0.81 .0.86 .0.86 .0.79 .≈

The bold values indicate two of our proposed networks that outperform the state-of-the-art in terms of accuracy.

8 Conclusion Neuromorphic architectures have emerged as efficient computing platforms for implementing SNNs. They are especially appealing for their low power/energy consumption and enable the development of high-performance machine learning applications on resource-constrained devices. They match well with event-based sensors for performing the computations only in the presence of events, thus saving power during idle times. This chapter discusses the opportunities and state-ofthe-art advancements of embedded neuromorphic computing, which combines the advancements in theoretical and computational neuroscience with the notions and results from multiple subfields, including electrical engineering, machine learning, signal processing, and robotics. A particular focus is given to the implementation of CarSNN, a spiking network for event-based car detection in the context of autonomous driving, implemented onto the Loihi neuromorphic chip. The results show several orders of magnitude power consumption reduction while also increasing the classification accuracy compared to the related works. Such promising results pose solid bases for developing and implementing innovative features both from the algorithmic perspective (e.g., implementing online lifelong learning on the chip) and from the hardware design perspective (e.g., the integration of processors and memory units into in-memory computing devices). Acknowledgments This work has been supported in part by the Doctoral College Resilient Embedded Systems, which is run jointly by the TU Wien’s Faculty of Informatics and the UAS Technikum Wien. This work was also supported in part by the NYUAD Center for Interacting Urban Networks (CITIES), funded by Tamkeen under the NYUAD Research Institute Award CG001, Center for Cybersecurity (CCS), funded by Tamkeen under the NYUAD Research Institute Award G1104, and Center for Artificial Intelligence and Robotics (CAIR), funded by Tamkeen under the NYUAD Research Institute Award CG010.

166

A. Marchisio and M. Shafique

References 1. Amir, A., Taba, B., Berg, D.J., Melano, T., McKinstry, J.L., di Nolfo, C., Nayak, T.K., Andreopoulos, A., Garreau, G., Mendoza, M., Kusnitz, J., DeBole, M., Esser, S.K., Delbrück, T., Flickner, M., Modha, D.S.: A low power, fully event-based gesture recognition system. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21–26, 2017, pp. 7388–7397. IEEE Computer Society (2017). https://doi.org/ 10.1109/CVPR.2017.781 2. Bekolay, T., Bergstra, J., Hunsberger, E., DeWolf, T., Stewart, T.C., Rasmussen, D., Choo, F., Voelker, A., Eliasmith, C.: Nengo: a python tool for building large-scale functional brain models. Frontiers Neuroinformatics 7, 48 (2013). https://doi.org/10.3389/fninf.2013.00048 3. Bellec, G., Salaj, D., Subramoney, A., Legenstein, R., Maass, W.: Long short-term memory and learning-to-learn in networks of spiking neurons. In: Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3–8, 2018, Montréal, Canada, pp. 795–805 (2018). https://proceedings. neurips.cc/paper/2018/hash/c203d8a151612acf12457e4d67635a95-Abstract.html 4. Benjamin, B.V., Gao, P., McQuinn, E., Choudhary, S., Chandrasekaran, A., Bussat, J., AlvarezIcaza, R., Arthur, J.V., Merolla, P., Boahen, K.: Neurogrid: A mixed-analog-digital multichip system for large-scale neural simulations. Proc. IEEE 102(5), 699–716 (2014). https://doi.org/ 10.1109/JPROC.2014.2313565 5. Binas, J., Neil, D., Liu, S., Delbrück, T.: DDD17: end-to-end DAVIS driving dataset. CoRR abs/1711.01458 (2017). http://arxiv.org/abs/1711.01458 6. Blouw, P., Choo, X., Hunsberger, E., Eliasmith, C.: Benchmarking keyword spotting efficiency on neuromorphic hardware. CoRR abs/1812.01739 (2018). http://arxiv.org/abs/1812.01739 7. Bovik, A.C., Clark, M., Geisler, W.S.: Multichannel texture analysis using localized spatial filters. IEEE Trans. Pattern Anal. Mach. Intell. 12(1), 55–73 (1990). https://doi.org/10.1109/ 34.41384 8. Brandli, C., Berner, R., Yang, M., Liu, S., Delbrück, T.: A 240 × 180 130 db 3 μs latency global shutter spatiotemporal vision sensor. IEEE J. Solid State Circuits 49(10), 2333–2341 (2014). https://doi.org/10.1109/JSSC.2014.2342715 9. Buettner, K., George, A.D.: Heartbeat classification with spiking neural networks on the Loihi neuromorphic processor. In: IEEE Computer Society Annual Symposium on VLSI, ISVLSI 2021, Tampa, FL, USA, July 7-*9, 2021, pp. 138–143. IEEE (2021). https://doi.org/10.1109/ ISVLSI51109.2021.00035 10. Capra, M., Bussolino, B., Marchisio, A., Masera, G., Martina, M., Shafique, M.: Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead. IEEE Access 8, 225134–225180 (2020). https://doi.org/10. 1109/ACCESS.2020.3039858 11. Ceolini, E., Frenkel, C., Shrestha, S.B., Taverni, G., Khacef, L., Payvand, M., Donati, E.: Hand-gesture recognition based on EMG and event-based camera sensor fusion: A benchmark in neuromorphic computing. Front. Neurosci. 14 (2020). https://doi.org/10.3389/fnins.2020. 00637. https://www.frontiersin.org/article/10.3389/fnins.2020.00637 12. Chen, S., Guo, M.: Live demonstration: CeleX-V: A 1M pixel multi-mode event-based sensor. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2019, Long Beach, CA, USA, June 16–20, 2019, pp. 1682–1683. Computer Vision Foundation / IEEE (2019). https://doi.org/10.1109/CVPRW.2019.00214. http://openaccess. thecvf.com/content_CVPRW_2019/html/EventVision/Chen_Live_Demonstration_CeleX-V_ A_1M_Pixel_Multi-Mode_Event-Based_Sensor_CVPRW_2019_paper.html 13. Cheng, W., Luo, H., Yang, W., Yu, L., Chen, S., Li, W.: DET: A high-resolution DVS dataset for lane extraction. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2019, Long Beach, CA, USA, June 16–20, 2019, pp. 1666– 1675. Computer Vision Foundation/IEEE (2019). https://doi.org/10.1109/CVPRW.2019.

Embedded Neuromorphic Using Intel’s Loihi Processor

167

00210. http://openaccess.thecvf.com/content_CVPRW_2019/html/EventVision/Cheng_DET_ A_High-Resolution_DVS_Dataset_for_Lane_Extraction_CVPRW_2019_paper.html 14. Corporation, I.: Lava: A software framework for neuromorphic computing. https://lava-nc.org/ 15. Davies, M., Srinivasa, N., Lin, T., Chinya, G.N., Cao, Y., Choday, S.H., Dimou, G.D., Joshi, P., Imam, N., Jain, S., Liao, Y., Lin, C., Lines, A., Liu, R., Mathaikutty, D., McCoy, S., Paul, A., Tse, J., Venkataramanan, G., Weng, Y., Wild, A., Yang, Y., Wang, H.: Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro 38(1), 82–99 (2018). https://doi.org/ 10.1109/MM.2018.112130359 16. Davies, M., Wild, A., Orchard, G., Sandamirskaya, Y., Guerra, G.A.F., Joshi, P., Plank, P., Risbud, S.R.: Advancing neuromorphic computing with Loihi: A survey of results and outlook. Proc. IEEE 109(5), 911–934 (2021). https://doi.org/10.1109/JPROC.2021.3067593 17. Davison, A.P., Brüderle, D., Eppler, J.M., Kremkow, J., Müller, E.B., Pecevski, D., Perrinet, L.U., Yger, P.: PyNN: a common interface for neuronal network simulators. Frontiers Neuroinformatics 2, 11 (2008). https://doi.org/10.3389/neuro.11.011.2008 18. Delmerico, J.A., Cieslewski, T., Rebecq, H., Faessler, M., Scaramuzza, D.: Are we ready for autonomous drone racing? the UZH-FPV drone racing dataset. In: International Conference on Robotics and Automation, ICRA 2019, Montreal, QC, Canada, May 20–24, 2019, pp. 6713– 6719. IEEE (2019). https://doi.org/10.1109/ICRA.2019.8793887 19. DeWolf, T., Jaworski, P., Eliasmith, C.: Nengo and low-power AI hardware for robust, embedded neurorobotics. Frontiers Neurorobotics 14, 568359 (2020). https://doi.org/10.3389/ fnbot.2020.568359 20. Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. Comput. Vis. Image Underst. 106(1), 59–70 (2007). https://doi.org/10.1016/j.cviu.2005.09.012 21. Finateu, T., Niwa, A., Matolin, D., Tsuchimoto, K., Mascheroni, A., Reynaud, E., Mostafalu, P., Brady, F.T., Chotard, L., LeGoff, F., Takahashi, H., Wakabayashi, H., Oike, Y., Posch, C.: 5.10 A 1280×720 back-illuminated stacked temporal contrast event-based vision sensor with 4.86μm pixels, 1.066geps readout, programmable event-rate controller and compressive dataformatting pipeline. In: 2020 IEEE International Solid- State Circuits Conference, ISSCC 2020, San Francisco, CA, USA, February 16–20, 2020, pp. 112–114. IEEE (2020). https://doi. org/10.1109/ISSCC19947.2020.9063149 22. Fischer, T., Milford, M.: Event-based visual place recognition with ensembles of temporal windows. IEEE Robotics Autom. Lett. 5(4), 6924–6931 (2020). https://doi.org/10.1109/LRA. 2020.3025505 23. Frenkel, C., Lefebvre, M., Legat, J., Bol, D.: A 0.086-mm2 12.7-pJ/SOP 64k-synapse 256neuron online-learning digital spiking neuromorphic processor in 28-nm CMOS. IEEE Trans. Biomed. Circuits Syst. 13(1), 145–158 (2019). https://doi.org/10.1109/TBCAS.2018.2880425 24. Furber, S.B., Galluppi, F., Temple, S., Plana, L.A.: The SpiNNaker project. Proc. IEEE 102(5), 652–665 (2014). https://doi.org/10.1109/JPROC.2014.2304638 25. Fusi, S., Annunziato, M., Badoni, D., Salamon, A., Amit, D.J.: Spike-driven synaptic plasticity: Theory, simulation, VLSI implementation. Neural Comput. 12(10), 2227–2258 (2000). https:// doi.org/10.1162/089976600300014917 26. Gallego, G., Delbrück, T., Orchard, G., Bartolozzi, C., Taba, B., Censi, A., Leutenegger, S., Davison, A.J., Conradt, J., Daniilidis, K., Scaramuzza, D.: Event-based vision: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 44(1), 154–180 (2022). https://doi.org/10.1109/TPAMI. 2020.3008413 27. Gehrig, M., Aarents, W., Gehrig, D., Scaramuzza, D.: DSEC: A stereo event camera dataset for driving scenarios. IEEE Robotics Autom. Lett. 6(3), 4947–4954 (2021). https://doi.org/10. 1109/LRA.2021.3068942 28. Gerstner, W., Kistler, W.M.: Spiking Neuron Models: Single Neurons, Populations, Plasticity. Cambridge University Press (2002). https://doi.org/10.1017/CBO9780511815706 29. Ghosh, R., Gupta, A., Silva, A.N., Soares, A., Thakor, N.V.: Spatiotemporal filtering for eventbased action recognition. CoRR abs/1903.07067 (2019). http://arxiv.org/abs/1903.07067

168

A. Marchisio and M. Shafique

30. Guo, M., Huang, J., Chen, S.: Live demonstration: A 768 × 640 pixels 200Meps dynamic vision sensor. In: IEEE International Symposium on Circuits and Systems, ISCAS 2017, Baltimore, MD, USA, May 28–31, 2017, p. 1. IEEE (2017). https://doi.org/10.1109/ISCAS.2017.8050397 31. Hodgkin, A.L., Huxley, A.F.: A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. 117(4), 500–544 (1952). https://doi. org/10.1113/jphysiol.1952.sp004764. https://physoc.onlinelibrary.wiley.com/doi/abs/10.1113/ jphysiol.1952.sp004764 32. Hu, Y., Binas, J., Neil, D., Liu, S., Delbrück, T.: DDD20 end-to-end event camera driving dataset: Fusing frames and events with deep learning for improved steering prediction. In: 23rd IEEE International Conference on Intelligent Transportation Systems, ITSC 2020, Rhodes, Greece, September 20–23, 2020, pp. 1–6. IEEE (2020). https://doi.org/10.1109/ITSC45102. 2020.9294515 33. Izhikevich, E.M.: Simple model of spiking neurons. IEEE Trans. Neural Networks 14(6), 1569–1572 (2003). https://doi.org/10.1109/TNN.2003.820440 34. Jouppi, N.P., Young, C., Patil, N., Patterson, D.A., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., Boyle, R., Cantin, P., Chao, C., Clark, C., Coriell, J., Daley, M., Dau, M., Dean, J., Gelb, B., Ghaemmaghami, T.V., Gottipati, R., Gulland, W., Hagmann, R., Ho, C.R., Hogberg, D., Hu, J., Hundt, R., Hurt, D., Ibarz, J., Jaffey, A., Jaworski, A., Kaplan, A., Khaitan, H., Killebrew, D., Koch, A., Kumar, N., Lacy, S., Laudon, J., Law, J., Le, D., Leary, C., Liu, Z., Lucke, K., Lundin, A., MacKean, G., Maggiore, A., Mahony, M., Miller, K., Nagarajan, R., Narayanaswami, R., Ni, R., Nix, K., Norrie, T., Omernick, M., Penukonda, N., Phelps, A., Ross, J., Ross, M., Salek, A., Samadiani, E., Severn, C., Sizikov, G., Snelham, M., Souter, J., Steinberg, D., Swing, A., Tan, M., Thorson, G., Tian, B., Toma, H., Tuttle, E., Vasudevan, V., Walter, R., Wang, W., Wilcox, E., Yoon, D.H.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June 24–28, 2017, pp. 1–12. ACM (2017). https://doi.org/10.1145/3079856.3080246 35. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings (2015). http://arxiv.org/abs/1412.6980 36. Krizhevsky, A.: Learning Multiple Layers of Features from Tiny Images. University of Toronto (2012) 37. Lagorce, X., Orchard, G., Galluppi, F., Shi, B.E., Benosman, R.: HOTS: A hierarchy of eventbased time-surfaces for pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(7), 1346–1359 (2017). https://doi.org/10.1109/TPAMI.2016.2574707 38. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998). https://doi.org/10.1109/5.726791 39. Li, H., Liu, H., Ji, X., Li, G., Shi, L.: Cifar10-dvs: An event-stream dataset for object classification. Front. Neurosci. 11 (2017). https://doi.org/10.3389/fnins.2017.00309. https:// www.frontiersin.org/article/10.3389/fnins.2017.00309 40. Lichtsteiner, P., Posch, C., Delbrück, T.: A 128×128 120 db 15 μs latency asynchronous temporal contrast vision sensor. IEEE J. Solid State Circuits 43(2), 566–576 (2008). https:// doi.org/10.1109/JSSC.2007.914337 41. Lin, C., Wild, A., Chinya, G.N., Cao, Y., Davies, M., Lavery, D.M., Wang, H.: Programming spiking neural networks on Intel’s Loihi. Computer 51(3), 52–61 (2018). https://doi.org/10. 1109/MC.2018.157113521 42. Lin, C., Wild, A., Chinya, G.N., Lin, T., Davies, M., Wang, H.: Mapping spiking neural networks onto a manycore neuromorphic architecture. In: Foster, J.S., Grossman, D. (eds.) Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2018, Philadelphia, PA, USA, June 18–22, 2018, pp. 78–89. ACM (2018). https://doi.org/10.1145/3192366.3192371 43. Lin, Y., Ding, W., Qiang, S., Deng, L., Li, G.: ES-ImageNet: A million event-stream classification dataset for spiking neural networks. CoRR abs/2110.12211 (2021). https://arxiv. org/abs/2110.12211

Embedded Neuromorphic Using Intel’s Loihi Processor

169

44. Liu, C., Bellec, G., Vogginger, B., Kappel, D., Partzsch, J., Neumärker, F., Höppner, S., Maass, W., Furber, S.B., Legenstein, R., Mayr, C.G.: Memory-efficient deep learning on a SpiNNaker 2 prototype. Front. Neurosci. 12 (2018). https://doi.org/10.3389/fnins.2018.00840. https:// www.frontiersin.org/article/10.3389/fnins.2018.00840 45. Liu, Q., Furber, S.: Real-time recognition of dynamic hand postures on a neuromorphic system. Int. J. Electr. Comput. Eng. 9(5), 507–514 (2015). https://publications.waset.org/vol/101 46. Liu, T., Mahjoubfar, A., Prusinski, D., Stevens, L.: Neuromorphic computing for content-based image retrieval. CoRR abs/2008.01380 (2020). https://arxiv.org/abs/2008.01380 47. Maass, W.: Networks of spiking neurons: The third generation of neural network models. Neural Networks 10(9), 1659–1671 (1997). https://doi.org/10.1016/S0893-6080(97)000117 48. Massa, R., Marchisio, A., Martina, M., Shafique, M.: An efficient spiking neural network for recognizing gestures with a DVS camera on the Loihi neuromorphic processor. In: 2020 International Joint Conference on Neural Networks, IJCNN 2020, Glasgow, United Kingdom, July 19–24, 2020, pp. 1–9. IEEE (2020). https://doi.org/10.1109/IJCNN48605.2020.9207109 49. McCulloch, W.S., Pitts, W.H.: A logical calculus of the ideas immanent in nervous activity. In: Boden, M.A. (ed.) The Philosophy of Artificial Intelligence, Oxford Readings in Philosophy, pp. 22–39. Oxford University Press (1990) 50. Merolla, P.A., Arthur, J.V., Alvarez-Icaza, R., Cassidy, A.S., Sawada, J., Akopyan, F., Jackson, B.L., Imam, N., Guo, C., Nakamura, Y., Brezzo, B., Vo, I., Esser, S.K., Appuswamy, R., Taba, B., Amir, A., Flickner, M.D., Risk, W.P., Manohar, R., Modha, D.S.: A million spiking-neuron integrated circuit with a scalable communication network and interface. Science 345(6197), 668–673 (2014). https://doi.org/10.1126/science.1254642. https://www.science.org/doi/abs/ 10.1126/science.1254642 51. Michaelis, C., Lehr, A.B., Oed, W., Tetzlaff, C.: Brian2Loihi: An emulator for the neuromorphic chip Loihi using the spiking neural network simulator Brian. CoRR abs/2109.12308 (2021). https://arxiv.org/abs/2109.12308 52. Moradi, S., Ning, Q., Stefanini, F., Indiveri, G.: A scalable multicore architecture with heterogeneous memory structures for dynamic neuromorphic asynchronous processors (DYNAPs). IEEE Trans. Biomed. Circuits Syst. 12(1), 106–122 (2018). https://doi.org/10.1109/TBCAS. 2017.2759700 53. Mueggler, E., Rebecq, H., Gallego, G., Delbrück, T., Scaramuzza, D.: The event-camera dataset and simulator: Event-based data for pose estimation, visual odometry, and SLAM. Int. J. Robotics Res. 36(2), 142–149 (2017). https://doi.org/10.1177/0278364917691115 54. Neftci, E.O., Mostafa, H., Zenke, F.: Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks. IEEE Signal Process. Mag. 36(6), 51–63 (2019). https://doi.org/10.1109/MSP.2019.2931595 55. Orchard, G., Frady, E.P., Rubin, D.B.D., Sanborn, S., Shrestha, S.B., Sommer, F.T., Davies, M.: Efficient neuromorphic signal processing with Loihi 2. In: IEEE Workshop on Signal Processing Systems, SiPS 2021, Coimbra, Portugal, October 19–21, 2021, pp. 254–259. IEEE (2021). https://doi.org/10.1109/SiPS52927.2021.00053 56. Orchard, G., Jayawant, A., Cohen, G., Thakor, N.V.: Converting static image datasets to spiking neuromorphic datasets using saccades. CoRR abs/1507.07629 (2015). http://arxiv.org/abs/ 1507.07629 57. Pan, Z., Wu, J., Zhang, M., Li, H., Chua, Y.: Neural population coding for effective temporal classification. In: International Joint Conference on Neural Networks, IJCNN 2019 Budapest, Hungary, July 14–19, 2019, pp. 1–8. IEEE (2019). https://doi.org/10.1109/IJCNN.2019. 8851858 58. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E.Z., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: PyTorch: An imperative style, high-performance deep learning library. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS

170

A. Marchisio and M. Shafique

2019, December 8–14, 2019, Vancouver, BC, Canada, pp. 8024–8035 (2019). https:// proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html 59. Patel, K., Hunsberger, E., Batir, S., Eliasmith, C.: A spiking neural network for image segmentation. CoRR abs/2106.08921 (2021). https://arxiv.org/abs/2106.08921 60. Paugam-Moisy, H., Bohté, S.M.: Computing with spiking neuron networks. In: Rozenberg, G., Bäck, T., Kok, J.N. (eds.) Handbook of Natural Computing, pp. 335–376. Springer (2012). https://doi.org/10.1007/978-3-540-92910-9_10 61. Ponulak, F., Kasinski, A.J.: Introduction to spiking neural networks: Information processing, learning and applications. Acta Neurobiol. Exp. 71(4), 409–33 (2011) 62. Posch, C., Matolin, D., Wohlgenannt, R.: A QVGA 143 db dynamic range frame-free PWM image sensor with lossless pixel-level video compression and time-domain CDS. IEEE J. Solid State Circuits 46(1), 259–275 (2011). https://doi.org/10.1109/JSSC.2010.2085952 63. Rasmussen, D.: NengoDL: Combining deep learning and neuromorphic modelling methods. CoRR abs/1805.11144 (2018). http://arxiv.org/abs/1805.11144 64. Rathi, N., Srinivasan, G., Panda, P., Roy, K.: Enabling deep spiking neural networks with hybrid conversion and spike timing dependent backpropagation. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net (2020). https://openreview.net/forum?id=B1xSperKvH 65. Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21–26, 2017, pp. 6517–6525. IEEE Computer Society (2017). https://doi.org/10.1109/CVPR.2017.690 66. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017). https://doi.org/10.1109/TPAMI.2016.2577031 67. Renner, A., Evanusa, M., Orchard, G., Sandamirskaya, Y.: Event-based attention and tracking on neuromorphic hardware. In: 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems, AICAS 2020, Genova, Italy, August 31 - September 2, 2020, p. 132. IEEE (2020). https://doi.org/10.1109/AICAS48895.2020.9073789 68. Renner, A., Sheldon, F., Zlotnik, A., Tao, L., Sornborger, A.T.: The backpropagation algorithm implemented on spiking neuromorphic hardware. CoRR abs/2106.07030 (2021). https://arxiv. org/abs/2106.07030 69. Rocki, K., Essendelft, D.V., Sharapov, I., Schreiber, R., Morrison, M., Kibardin, V., Portnoy, A., Dietiker, J., Syamlal, M., James, M.: Fast stencil-code computation on a wafer-scale processor. In: Cuicchi, C., Qualters, I., Kramer, W.T. (eds.) Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9–19, 2020, p. 58. IEEE/ACM (2020). https://doi.org/10.1109/SC41405.2020.00062 70. Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review 65(6), 386–408 (1958) 71. Rückauer, B., Känzig, N., Liu, S., Delbrück, T., Sandamirskaya, Y.: Closing the accuracy gap in an event-based visual recognition task. CoRR abs/1906.08859 (2019). http://arxiv.org/abs/ 1906.08859 72. Rueckauer, B., Bybee, C., Goettsche, R., Singh, Y., Mishra, J., Wild, A.: NxTF: An API and compiler for deep spiking neural networks on Intel Loihi. CoRR abs/2101.04261 (2021). https://arxiv.org/abs/2101.04261 73. Rueckauer, B., Lungu, I.A., Hu, Y., Pfeiffer, M., Liu, S.C.: Conversion of continuous-valued deep networks to efficient event-driven networks for image classification. Front. Neurosci. 11 (2017). https://doi.org/10.3389/fnins.2017.00682. https://www.frontiersin.org/article/10.3389/ fnins.2017.00682 74. Ruf, B., Schmitt, M.: Hebbian learning in networks of spiking neurons using temporal coding. In: Mira, J., Moreno-Díaz, R., Cabestany, J. (eds.) Biological and Artificial Computation: From Neuroscience to Technology, International Work-Conference on Artificial and Natural Neural Networks, IWANN ’97, Lanzarote, Canary Islands, Spain, June 4–6, 1997, Proceedings, Lecture Notes in Computer Science, vol. 1240, pp. 380–389. Springer (1997). https://doi.org/ 10.1007/BFb0032496

Embedded Neuromorphic Using Intel’s Loihi Processor

171

75. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.S., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-0150816-y 76. Schmitt, S., Klähn, J., Bellec, G., Grübl, A., Güttler, M., Hartel, A., Hartmann, S., de Oliveira, D.H., Husmann, K., Jeltsch, S., Karasenko, V., Kleider, M., Koke, C., Kononov, A., Mauch, C., Müller, E., Müller, P., Partzsch, J., Petrovici, M.A., Schiefer, S., Scholze, S., Thanasoulis, V.N., Vogginger, B., Legenstein, R., Maass, W., Mayr, C., Schüffny, R., Schemmel, J., Meier, K.: Neuromorphic hardware in the loop: Training a deep spiking network on the BrainScaleS wafer-scale system. In: 2017 International Joint Conference on Neural Networks, IJCNN 2017, Anchorage, AK, USA, May 14–19, 2017, pp. 2227–2234. IEEE (2017). https://doi.org/10. 1109/IJCNN.2017.7966125 77. See, H., Lim, B., Li, S., Yao, H., Cheng, W., Soh, H., Tee, B.C.K.: ST-MNIST - the spiking tactile MNIST neuromorphic dataset. CoRR abs/2005.04319 (2020). https://arxiv.org/abs/ 2005.04319 78. Shrestha, S.B., Orchard, G.: SLAYER: spike layer error reassignment in time. In: Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3–8, 2018, Montréal, Canada, pp. 1419–1428 (2018). https://proceedings.neurips.cc/paper/2018/hash/ 82f2b308c3b01637c607ce05f52a2fed-Abstract.html 79. Sironi, A., Brambilla, M., Bourdis, N., Lagorce, X., Benosman, R.: HATS: histograms of averaged time surfaces for robust event-based object classification. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, pp. 1731–1740. Computer Vision Foundation / IEEE Computer Society (2018). https://doi.org/10.1109/CVPR.2018.00186. http://openaccess.thecvf.com/content_cvpr_2018/ html/Sironi_HATS_Histograms_of_CVPR_2018_paper.html 80. Son, B., Suh, Y., Kim, S., Jung, H., Kim, J., Shin, C., Park, K., Lee, K., Park, J.M., Woo, J., Roh, Y., Lee, H., Wang, Y.M., Ovsiannikov, I.A., Ryu, H.: 4.1 A 640×480 dynamic vision sensor with a 9μm pixel and 300Meps address-event representation. In: 2017 IEEE International Solid-State Circuits Conference, ISSCC 2017, San Francisco, CA, USA, February 5–9, 2017, pp. 66–67. IEEE (2017). https://doi.org/10.1109/ISSCC.2017.7870263 81. Srinivasan, G., Panda, P., Roy, K.: STDP-based unsupervised feature learning using convolution-over-time in spiking neural networks for energy-efficient neuromorphic computing. ACM J. Emerg. Technol. Comput. Syst. 14(4), 44:1–44:12 (2018). https://doi.org/10. 1145/3266229 82. Stewart, K., Orchard, G., Shrestha, S.B., Neftci, E.: Online few-shot gesture learning on a neuromorphic processor. IEEE J. Emerg. Sel. Topics Circuits Syst. 10(4), 512–521 (2020). https://doi.org/10.1109/JETCAS.2020.3032058 83. Stimberg, M., Brette, R., Goodman, D.F.: Brian 2, an intuitive and efficient neural simulator. eLife 8, e47314 (2019). https://doi.org/10.7554/eLife.47314 84. Stuijt, J., Sifalakis, M., Yousefzadeh, A., Corradi, F.: µBrain: An event-driven and fully synthesizable architecture for spiking neural networks. Front. Neurosci. 15 (2021). https:// doi.org/10.3389/fnins.2021.664208. https://www.frontiersin.org/article/10.3389/fnins.2021. 664208 85. Suh, Y., Choi, S., Ito, M., Kim, J., Lee, Y., Seo, J., Jung, H., Yeo, D., Namgung, S., Bong, J., Yoo, S., Shin, S., Kwon, D., Kang, P., Kim, S., Na, H., Hwang, K., Shin, C., Kim, J., Park, P.K.J., Kim, J., Ryu, H., Park, Y.: A 1280×960 dynamic vision sensor with a 4.95-μm pixel pitch and motion artifact minimization. In: IEEE International Symposium on Circuits and Systems, ISCAS 2020, Seville, Spain, October 10–21, 2020, pp. 1–5. IEEE (2020). https://doi. org/10.1109/ISCAS45731.2020.9180436 86. Tang, G., Kumar, N., Michmizos, K.P.: Reinforcement co-learning of deep and spiking neural networks for energy-efficient mapless navigation with neuromorphic hardware. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020, Las Vegas, NV, USA,

172

A. Marchisio and M. Shafique

October 24, 2020 - January 24, 2021, pp. 6090–6097. IEEE (2020). https://doi.org/10.1109/ IROS45743.2020.9340948 87. Taunyazov, T., Sng, W., Lim, B., See, H., Kuan, J., Ansari, A.F., Tee, B.C.K., Soh, H.: Eventdriven visual-tactile sensing and learning for robots. In: Toussaint, M., Bicchi, A., Hermans, T. (eds.) Robotics: Science and Systems XVI, Virtual Event / Corvallis, Oregon, USA, July 12–16, 2020 (2020). https://doi.org/10.15607/RSS.2020.XVI.020 88. Tayarani-Najaran, M.H., Schmuker, M.: Event-based sensing and signal processing in the visual, auditory, and olfactory domain: A review. Front. Neural Circ. 15 (2021). https://doi.org/ 10.3389/fncir.2021.610446. https://www.frontiersin.org/article/10.3389/fncir.2021.610446 89. de Tournemire, P., Nitti, D., Perot, E., Migliore, D., Sironi, A.: A large scale event-based detection dataset for automotive. CoRR abs/2001.08499 (2020). https://arxiv.org/abs/2001. 08499 90. Viale, A., Marchisio, A., Martina, M., Masera, G., Shafique, M.: CarSNN: An efficient spiking neural network for event-based autonomous cars on the Loihi neuromorphic research processor. In: International Joint Conference on Neural Networks, IJCNN 2021, Shenzhen, China, July 18–22, 2021, pp. 1–10. IEEE (2021). https://doi.org/10.1109/IJCNN52387.2021.9533738 91. Viale, A., Marchisio, A., Martina, M., Masera, G., Shafique, M.: LaneSNNs: Spiking neural networks for lane detection on the Loihi neuromorphic processor. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2022. IEEE (2022) 92. Vitale, A., Renner, A., Nauer, C., Scaramuzza, D., Sandamirskaya, Y.: Event-driven vision and control for UAVs on a neuromorphic chip. In: IEEE International Conference on Robotics and Automation, ICRA 2021, Xi’an, China, May 30 - June 5, 2021, pp. 103–109. IEEE (2021). https://doi.org/10.1109/ICRA48506.2021.9560881 93. Wang, Q., Zhang, Y., Yuan, J., Lu, Y.: Space-time event clouds for gesture recognition: From RGB cameras to event cameras. In: IEEE Winter Conference on Applications of Computer Vision, WACV 2019, Waikoloa Village, HI, USA, January 7–11, 2019, pp. 1826–1835. IEEE (2019). https://doi.org/10.1109/WACV.2019.00199 94. Wang, Z., Guo, L., Adjouadi, M.: A biological plausible generalized leaky integrate-and-fire neuron model. In: 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBC 2014, Chicago, IL, USA, August 26–30, 2014, pp. 6810–6813. IEEE (2014). https://doi.org/10.1109/EMBC.2014.6945192 95. Wu, Y., Deng, L., Li, G., Zhu, J., Shi, L.: Spatio-temporal backpropagation for training highperformance spiking neural networks. Front. Neurosci. 12 (2018). https://doi.org/10.3389/ fnins.2018.00331. https://www.frontiersin.org/article/10.3389/fnins.2018.00331 96. Zhu, A.Z., Thakur, D., Özaslan, T., Pfrommer, B., Kumar, V., Daniilidis, K.: The multivehicle stereo event camera dataset: An event camera dataset for 3D perception. IEEE Robotics Autom. Lett. 3(3), 2032–2039 (2018). https://doi.org/10.1109/LRA.2018.2800793

Part II

Hardware-Software Co-Design and Co-Optimizations for Embedded Machine Learning

Machine Learning for Heterogeneous Manycore Design Biresh Kumar Joardar, Janardhan Rao Doppa, and Partha Pratim Pande

1 Introduction Advanced computing systems have long been an enabler of emerging applications and technology, either through sheer computational power or form factor miniaturization. Large-scale datacenters have enabled complex machine learning algorithms to analyze and decipher massive amounts of raw data. Simultaneously, mainstream CPUs and GPUs have brought many of the lower complexity algorithms to the end user devices. These innovative models and learning techniques in turn allow us to analyze and interpret large quantities of data, making it possible to exceed human decision-making in multiple domains, including in hardware design [1–4]. Overall, there is significant opportunities for tight collaboration between experts of machine learning and computing systems design. This will help to create a datadriven computing system design framework that integrates both machine learning and expert domain knowledge in hardware. As machine learning techniques become more complex, we will need lowcost, high-performance, and energy-efficient commodity systems at our disposal. Developing these application-specific hardware must become easy, inexpensive, and as seamless as developing application software to keep up with the rapid evolution of machine learning algorithms. Therefore, it is of high priority to create an innovative design framework that reduces the engineering costs and design time of machine learning specific computing systems. This framework will enable the democratization of access to application-specific hardware and make these

B. K. Joardar Department of ECE, University of Houston, Houston, TX, USA e-mail: [email protected] J. R. Doppa · P. P. Pande () School of EECS, Washington State University, Pullman, WA, USA e-mail: [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 S. Pasricha, M. Shafique (eds.), Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing, https://doi.org/10.1007/978-3-031-39932-9_7

175

176

B. K. Joardar et al.

systems widely available to machine learning researchers and data scientists. To aid low-cost, energy-efficient, and small form factor implementations for machine learning, computer architects have strived to develop innovative solutions. For instance, heterogeneous manycore systems that integrate multiple CPU and GPU processors on the same die have become popular for deploying ML workloads [2, 5]. By integrating multiple types of processors on to the same die, we can avoid expensive off-chip communication. This will lead to lower execution times [2]. The increased level of on-chip parallelism will improve our ability to run machine learning algorithms and Big Data analytic commensurate with the number of cores on the chip. However, these highly integrated manycore architectures introduce additional problems. As the number of cores increases, system complexity and the number of interdependent design decisions grow. This escalates the need for a holistically optimized design process that makes design decisions across all layers of the system (subsystems), e.g., memory, compute, and interconnect. Additionally, rising levels of variability within the manufacturing process and system workload make it increasingly difficult for manycore systems to quickly adapt to changing system requirements. Both the increasing system design complexity and operational variability have made it increasingly difficult to explore the expanding combinatorial design space and optimize manycore systems. To address these challenges, we believe machine learning can provide a natural solution to create new energyefficient heterogeneous computing systems with significantly less engineering effort. Overall, through tight collaborative efforts between computing system experts and machine learning practitioners, we can create a framework that espouses system design knowledge and data-driven decision-making. This interdisciplinary effort will greatly benefit both domains. Looking ahead, the amount of data at our fingertips will continue to explode and necessitate this mutually beneficial collaborative relationship: machine learning and computing systems will need to inspire and motivate each other to continue innovation in their respective domains. This close collaboration can stimulate and empower the next wave of machine learning algorithms and manycore design methodologies. In this chapter, we are going to present two examples of how machine learning can benefit hardware design. First, we will present the design of a 3D heterogeneous manycore system enabled by ML. Next, we show how ML benefit monolithic 3D (M3D)-based heterogeneous manycore design.

2 ML-Enabled 3D CPU/GPU-Based Heterogeneous Manycore Design Recently, platforms using both CPUs and GPUs have significantly improved the execution time for ML applications [2, 5]. However, existing discrete GPU systems use off-chip interconnects (e.g., PCIe) to communicate with the CPUs. These

Machine Learning for Heterogeneous Manycore Design

177

interconnects give rise to high data transfer latency and become performance bottlenecks for applications that involve high volumes of data transfers between the CPUs and GPUs. A heterogeneous manycore system that integrates many CPUs and GPUs on a single chip can solve this problem and avoid such expensive off-chip data transfers [6, 7]. In addition, these single-chip systems require a scalable interconnection backbone (networks-on-chip (NoCs)) to facilitate more efficient communication. To further reduce data transfer costs, three-dimensional (3D) integrated circuits (ICs) have been investigated as a possible solution and have made significant strides towards improving communication efficiency [8, 9]. By connecting planar dies stacked on top of each other with through-silicon vias (TSVs), the communication latency, throughput, and energy consumption can be further improved [9]. 3D ICs together with NoCs enable the design of highly integrated heterogeneous (e.g., CPUs, GPUs, accelerators) manycore platforms for big-data applications [2]. However, the design of 3D NoC-based manycore systems pose unique challenges. Due to the heterogeneity of the cores integrated on a single chip, the communication requirements for each core can vary significantly. For example, in a CPU-GPUbased heterogeneous system, CPUs require low memory latency while GPUs need high-throughput data transfers [10]. In addition to the individual core requirements, 3D ICs allow dense circuit integration but have much higher power density than their 2D counterparts. Therefore, the design process must consider reducing temperature hotspots as an additional objective. Overall, the design of a 3D heterogeneous manycore architecture needs to consider each of these objectives and satisfy all of them simultaneously. Hence, 3D heterogeneous manycore design can be formulated as a multi-objective optimization (MOO) problem. As the number of cores increases, it becomes extremely challenging and time consuming to find good designs that exhibit good power performance trade-offs. As we show later, existing optimization techniques (such as AMOSA and NSGA-II [11, 12]) are not scalable and often lead to sub-optimal designs. To address this problem, we present a new MOO algorithm, MOO-STAGE, which extends the machine learning framework STAGE [13]. As opposed to traditional MOO algorithms that only consider the current solution set when making search decisions, MOO-STAGE learns from the knowledge of previous search trajectories to guide the search towards more promising parts of the design space. This significantly reduces the optimization time without sacrificing the solution quality. Using MOO-STAGE, we can take advantage of the characteristics of different applications and incorporate appropriate design objectives to enable quick design space exploration of 3D heterogeneous systems.

2.1 Related Prior Work 2.1.1

3D Heterogeneous Manycore Systems

Recently, several commercial products have been developed that incorporate multiple types of computing elements, including CPUs and GPUs [14, 15]. Due to its

178

B. K. Joardar et al.

heterogeneity, CPU-GPU-based systems exhibit some interesting characteristics. GPUs typically only communicate with a few shared last level caches (LLCs) which results in many-to-few traffic patterns (i.e., many GPUs communicate with a few LLCs) with negligible inter-GPU communication [2, 10]. This can cause the LLCs to become bandwidth bottlenecks under heavy network loads and lead to significant performance degradation. In addition, GPUs tend to monopolize the memory and cause high CPU memory access latency [16]. Designers have also taken advantage of 3D IC’s higher packing density and lower interconnect latency to improve the performance of manycore systems [8, 9]. The advantages of 3D integration for CPU- and GPU-based manycore systems have been demonstrated in [2]. Hence, we conjecture that 3D heterogeneous manycore system will be the choice of computing platform for accelerating ML in future. However, designing such a manycore platform is challenging. Due to the differences in the thread-level parallelism of CPUs and GPUs, the heterogeneous system should satisfy both CPU and GPU constraints [17]. On top of this, 3D ICs suffer from thermal issues due to higher power density [18]. Multiple techniques including the use of clever floorplanning and task remapping. have been proposed to address this problem [19, 20]. In contrast to these prior works, we present a MOO algorithm to intelligently place the processing cores and links within a 3D heterogeneous system that jointly considers all relevant design metrics, e.g., latency, throughput, energy, and temperature.

2.1.2

Multi-Objective Optimization Algorithms

Basic MOO algorithms such as genetic algorithms (GA), e.g., NSGA-II, or simulated annealing-based algorithms, e.g., AMOSA, have been used in different optimization problems [11, 12]. AMOSA has been demonstrated to be superior to GAs or simulated annealing and has been applied for the problem of heterogeneous NoC design in [10]. However, since AMOSA is based on simulated annealing, it needs to be annealed slowly to ensure a good solution, which does not scale well with the size of the search space. In [21], the authors have used a heuristicbased MOO for multicore designs. However, they focus mainly on optimizing individual cores in smaller systems with up to 16 processors. Latency and area have been optimized using GAs to design NoC architectures in [22]. The authors in [23] have used machine learning techniques like linear regression and neural networks for MOO on different platforms. A learning-based fuzzy algorithm has been proposed to reduce the search time in [24]. However, this methodology requires a threshold to be decided for each application separately. A relatively recent work [25] proposed a branch-and-bound-based algorithm, titled priority and compensation factor-oriented branch and bound (PCBB) for task mapping in a NoCbased platform [25]. However, this work only considers task mapping on a relatively smaller system size, where calculating the bound for each node is significantly easier. These works have mainly considered homogeneous platforms with smaller

Machine Learning for Heterogeneous Manycore Design

179

system sizes and fewer number of objectives and are often not suited for large design space exploration problems. 3D heterogeneous manycore design is far more complex since the design must consider the requirements for each component. With additional constraints such as temperature and energy, the required optimization time can become tremendously high. Therefore, as systems become more complex, algorithms that are scalable with the size of the search space and can reduce optimization time without sacrificing solution quality will be needed. ML presents a promising direction to explore in this regard.

3 3D Heterogeneous Manycore Design Formulation In this section, we discuss the design objectives of an example CPU/GPU-based manycore platform. Figure 1 illustrates an example 3D heterogeneous architecture with two layers. For these systems, it is important that we (1) optimize considering both CPU and GPU characteristics; (2) efficiently balance the load of the 3D NoC under many-to-few traffic patterns, as mentioned earlier, many GPUs communicate with few LLCs and vice versa, leading to the many-to-few traffic; (3) minimize the network energy; and (4) minimize the peak temperature of the system. Note that we choose these design objectives as an example only. There may be additional design objectives based on specific design cases which can be similarly included in the design process. Overall, the design methodology should optimize the system for individual core requirements along with other design constraints for a highperformance manycore architecture. Here, the design methodology focuses on the placement of the CPUs, GPUs, LLCs, and planar links. We elaborate on how the methodology satisfies each objective next. CPU Design Objective CPU cores use instruction-level parallelism to achieve high performance on a limited number of threads. If any of these threads stall, CPUs incur a large penalty. Therefore, memory access latency is a primary concern for CPUs.

Fig. 1 Illustration of a TSV-based 3D heterogeneous manycore system. The system is divided into CPU, GPU, and LLC tiles. Tiles are interconnected via a planar link (intra-layer) or a TSV (inter-layer)

180

B. K. Joardar et al.

For C CPUs and M LLCs, we can model the average CPU-LLC latency using the following equation [26]: Lat =

C M  1  r ∗ hij + dij ∗ fij C∗M

(1)

i=1 j =1

Here, r is the number of router stages, hij is the number of hops from CPU i to LLC j, dij indicates the total link delay, and f represents the amount of interaction between core i and core j. The path from core i to core j is determined by the routing algorithm. It should be noted here that the above equation is applicable with any routing algorithm. GPU Design Objective Unlike the CPUs, GPUs rely on high levels of data parallelism. Massive amounts of parallelism coupled with quick context switching allow the GPU to hide most of its memory access latency. However, to do so, GPUs need lots of data and rely on high-throughput memory accesses. We can maximize the throughput of GPU-related traffic by load-balancing the network to prevent congestion, which will allow more messages to utilize the network at a time. This does not change the total number of packets to be communicated. Instead, it reduces the number of heavily congested links by redistributing traffic flows. This reduces the amount of contention for heavily utilized links. As a result, links are more readily available, there is less network congestion, and hence, network throughput is improved. To balance the expected link utilization (load-balance the network), we consider minimizing both the mean ( .U ) and standard deviation (σ ) of expected link utilization as suitable objectives. The expected utilization of link k (Uk ) can be obtained by the following equation: Uk =

R R  

fij ∗ pij k

(2)

i=1 j =1

Here R is the total number of tiles and pijk indicates whether a planar/vertical link k is used to communicate between core i and core j, respectively, i.e.:  pij k =

1, if cores i and j communicate using link k 0, otherwise

(3)

pijk can be determined by using the network connectivity and routing protocols. Then, the mean ( .U ) and standard deviation (σ ) of link utilization can be determined from the following equations: 1 Uk L L

U=

k=1

(4)

Machine Learning for Heterogeneous Manycore Design

181

  L 1   2  Uk − U σ = L

(5)

k=1

Thermal Requirements One of the key challenges in 3D integration is the highpower density and resulting temperature hotspots. High temperature not only affects performance but also the lifetime of the device. Processing cores that are further away from the sink tend to have higher temperatures than those close to the sink. Therefore, cores must be properly placed, e.g., high power-consuming cores should be placed close to the sink to reduce temperature. To estimate the temperature of a core, we use the fast approximation model presented in [27]. It considers both horizontal and vertical heat flow to accurately estimate the temperature. A manycore system can be divided into N single-tile stacks, each with K layers, where N is the number of tiles on a single layer and K is the total number of layers. The temperature of a core within a single-tile stack n located at layer k from the sink (Tn, k ) due to the vertical heat flow is given by: Tn,k =

k i=1

i Pn,i

j =1

Rj

+ Rb

k i=1

Pn,i

(6)

This represents the vertical heat flow in a manycore system [28]. Here, Pn, i is the power consumption of the core, i layers away from the sink in single-tile stack n, Rj is the vertical thermal resistance, and Rb is the thermal resistance of the base layer on which the dies are placed. The values of Rj and Rb can be obtained using 3DICE [29]. The horizontal heat flow is represented through the maximum temperature difference in the same layer k ((k)): T (k) = maxTn,k − minTn,k n

n

(7)

The overall thermal prediction model includes both vertical and horizontal heat flow equations. Following [27], we use T as our comparative temperature metric for any given 3D architecture:



maxT (k) T = maxTn,k n,k

k

(8)

Energy Requirements A few long-range links added to the NoC can improve performance [26]. However, these long-range links are costlier in terms of energy. Routers with a higher number of ports can improve path diversity and throughput; however, larger routers are difficult to design and are power hungry. Therefore, router size and link length must be optimized during design time to deliver high performance without consuming high amounts of energy. For a system with N tiles, R routers, L planar links, and V vertical links, the approximate network energy consumed can be obtained using the following equation:

182

B. K. Joardar et al.

Erouter =

N N   i=1 j =1

Elink =

N N   i=1 j =1

fij ·

L  k=1

fij ·

R 

rij k · (Er · Pk )

(9)

k=1

pij k · dk · Eplanar +

V 

qij k · Evertical

(10)

k=1

E = Erouter + Elink

(11)

Here Er denotes the average router logic energy per port, and Pk denotes the number of ports available at router k. The total link energy can be divided into two parts due to the different physical characteristics of planar and vertical links. f represents the frequency of communication between core i and core j and can be extracted from Gem5-GPU [30] simulations while dk represents the physical link length of link k. Here, q and r is defined similarly as pijk (Eq. 3) to indicate if a vertical link or router k is utilized to communicate between core i and core j, respectively. Eplanar and Evertical denote the energy consumed per flit by planar metal wires and vertical links, respectively. All the required power numbers can be obtained using Synopsys Prime Power. The total network energy E is the sum of router logic and link energy. Overall Formulation In the end, the aim is to find a 3D heterogeneous manycore design that improves throughput (by minimizing the mean link utilization ( .U ), standard deviation of individual link utilizations (σ )), minimizes average latency between CPU and LLCs (Lat), and reduces both temperature (T) and energy (E). It is important to note that the analytical models for these objectives only need to be accurate in determining which designs are better relative to one another, e.g., lower values of T result in better temperatures. This allows us to quickly compare designs without performing detailed simulations during the optimization search. Since optimizing one objective may negatively affect another, it is important that these objectives are optimized simultaneously. For example, a thermal-only aware placement would move high-power cores closer to the sink [8] and possibly further away from cores they highly communicate with, negatively affecting performance and energy. Figure 2 illustrates the overall formulation. As shown in Fig. 2, the input to the MOO solver are the different processing units (e.g., CPU, GPU, and LLC) and the various design objectives (e.g., latency, throughput, and energy). The MOO solver (ML-based in this case) aims to search for good solutions based on the combined objective: 

 D ∗ = MOO OBJ = U¯ (d), σ (d), Lat(d), T (d), E(d)

(12)

where D∗ is the set of Pareto optimal designs among all possible 3D heterogeneous manycore system configurations D, i.e., D∗ ∈ D, MOO is a multi-objective solver, and OBJ is the set of all objectives to evaluate a candidate design d ∈ D. A candidate

Machine Learning for Heterogeneous Manycore Design

183

Fig. 2 Overall design formulation for a heterogeneous manycore design using an ML-based method

design d consists of an adjacency matrix for the links (designates which pair of tiles are connected via a link) and a tile placement vector (designates which core is placed at which tile). We also need to ensure that for all d ∈ D, all source-destination pairs have at least one path between them. Since mesh is the most commonly used NoC architecture, any design d has an equal number of links as that of a 3D mesh NoC for fair comparison. Finally, we pick a design from among the pareto-optimal set of manycore systems depending on our requirements in terms of performance, energy, etc.

4 MOO-STAGE: ML-Enabled Manycore Design Framework Next, we discuss MOO-STAGE, an ML-based multi-objective optimization algorithm, and how it can be applied to 3D manycore heterogeneous design. The key idea behind MOO-STAGE is to intelligently explore the search space such that the MOO problem is efficiently solved. More precisely, MOO-STAGE utilizes a supervised learning algorithm that leverages past search experience (local search) to learn an evaluation function that can then estimate the outcome of performing a local search from any given state in the design space (meta search). In practice, the MOO-STAGE algorithm iteratively executes local and meta searches in a sequence as shown in Fig. 3. Figure 3 shows a high-level overview of how MOO-STAGE works. The first stage (local search) performs a search from a given starting state, guided by a cost function considering all objectives. Then, the search trajectories collected from the local search is used for the next stage (meta search) to learn an evaluation function. This evaluation function attempts to learn the potential (quantified using the cost function) of performing a local search starting from a particular state. This allows the algorithm to prune away bad starting states to reduce the number of local search calls needed to find (near-) optimal designs in the given design space. Unlike MOO-STAGE, other MOO algorithms based on random restarts do not leverage any such knowledge and spend significant time searching from states that would otherwise be rejected by MOO-STAGE. Therefore, MOOSTAGE explicitly guides the search towards promising areas of the search space much faster than conventional MOO algorithms. Below we describe the details of the MOO-STAGE algorithm.

184

B. K. Joardar et al.

Fig. 3 Overview of the MOO-STAGE algorithm

4.1 MOO-STAGE: Local Search Given an objective, the goal of a local search algorithm (e.g., greedy search or SA) is to traverse through a sequence of neighboring states to find a solution that minimizes the objective. To accommodate multiple objectives, we can employ the Pareto hypervolume (PHV) [31] metric to evaluate the quality of a set of solutions (higher is better). The PHV is the hypervolume of the dominated portion of the objective space as a measure for the quality of Pareto set approximations. A design P is dominated by design Q (P } ← ∪{ ←{ ∈ |(∄ ∈

4: 5:

},

(

):

)[ ≺ ]} ←

)

}

4.2 MOO-STAGE: Meta Search The second and key component of MOO-STAGE is the learning phase, also known as the meta-search. For standard local search procedures, one of the key limitations is that the quality of the local search critically depends on the starting point of the search process (dstart ). Although algorithms like SA try to mitigate this effect by incorporating some random exploration, they are still limited by the local nature of the search. If the search repeatedly begins near poor local minima, it is possible that the search will never find a high-quality solution. MOO-STAGE attempts to solve this problem by learning a function approximator (evaluation function)

186

B. K. Joardar et al.

using previous local search data that can predict the outcome of a local search procedure from a particular starting point. Using this evaluation function MOOSTAGE intelligently selects starting states with a high potential to lead to better quality solutions and subsequently significantly reduces the computation time. We discuss the details of this procedure in the following paragraphs. After completing the local search, MOO-STAGE adds the local optima set to the global optima set (Sglobal ) ensuring that all states in the global optima set are non-dominated (Algorithm 2, lines 3–4). If the local optima set did not add any new entries to the global optima set, MOO-STAGE completes and returns the global optima set (Algorithm 2, lines 5–6). Otherwise, it adds the local search trajectory (Straj ) and PHV of this trajectory (PHVobj (Straj )) as a training example to the aggregated training set (Strain ) and learn the evaluation function Eval using Strain (Algorithm 2, lines 7–8). We can employ regression forest as the base learner for creating Eval. Regression forest is only used as an example here and other regression learners that are quick to evaluate and sufficiently expressive to fit the training data can be used to similar effect. Given the function Eval, we can then use a standard greedy search to optimize Eval beginning at the last state of the local search (dlast ) to find the starting state for the next local search iteration(drestart ). If dlast = drestart , MOO-STAGE chooses a random design from the design space instead ((D)) (Algorithm 2, lines 9–13). Using these two computational search processes (local search and meta-search), MOO-STAGE progressively learns the structure of the solution space and improves Eval. Essentially, the algorithm attempts to learn a regressor that can predict the PHV of the local optima from any starting design and explicitly guides the search towards predicted high-quality starting designs. The MOO-STAGE implementation is available at [28].

5 Experimental Results 5.1 Experimental Setup To obtain network- and processor-level information, we use the Gem5-GPU fullsystem simulator [30]. The CPU cores are based on the ×86 architecture while the GPUs are based on the NVIDIA Maxwell architecture. Here, each GPU core is analogous to a streaming multiprocessor (SM) in Nvidia terminology. The architecture of an individual GPU core is similar to a GPU Compute Unit (CU) described in [30]. The CPUs operate at 2.5 GHz while the GPUs operate at 0.7 GHz. The core power profiles are extracted using GPUWattch and McPat [33, 34]. The core temperatures are obtained using 3D-ICE [29]. Due to the high-power densities in 3D ICs, microfluid-based cooling techniques must be used to reduce core temperatures. Reciprocal design symmetry (RDS) floor planning [19] is also adopted to reduce the temperature. The 3D mesh NoCs use XYZ-dimension order

Machine Learning for Heterogeneous Manycore Design

187

Algorithm 2. MOO-STAGE (Set of optimization objectives), Input: (Maximum iterations), (Design space) Output: (Non-dominated set of designs)

5: 6: 7:

( ) Initialize: ← ∅, ← ← ∅, : For = 0 to ) ( , , , ← Maintain non-dominated global set: ← ∪ [ ≺ ] ∄ ∈ ← ∈ ∩ = ∅: [If algorithm converged] If Return : Add training example for each design ∈

8: 9: 10: 11: 12: 13: 14:

← ∪ , Train evaluation function: ← Greedy Search: = : If ← ( ) Else ← Return

1: 2: 3: 4:



(

( ,

) )

routing while the optimized architectures use ALASH routing [35]. Note that the proposed architectures do not have a regularity constraint and hence XYZ dimension order routing cannot be employed as in the case of 3D Mesh NoCs. Each CPU and GPU have a private L1 data and instruction cache of 32 KB each. Each LLC consists of 256 KB memory. To evaluate MOO-STAGE, we consider two reference algorithms, AMOSA and PCBB [12, 25]. AMOSA is a widely employed algorithm for multi-objective optimization due to its ability to achieve near optimal solutions. On the other hand, PCBB is a branch-and-bound-based technique used for task mapping in an NoCbased system considering multiple objectives. We evaluate the algorithms based on runtime and quality of solutions. Given the set of Pareto-optimal solutions D* specified by Eq. (12) for each MOO solver considered here (i.e., AMOSA, PCBB, and MOO-STAGE), we run detailed simulations on this subset of solutions to get absolute values for energy, performance, and temperature. All experiments have been run on an Intel® Xeon® CPU E5-2620 @ 2GHz machine with 16 GB RAM running CentOS 6. The code for MOO-STAGE, AMOSA, and PCBB are available on GitHub [28]. We use ten applications from the Rodinia benchmark suite and two machine learning workloads to evaluate MOO-STAGE (shown in Table 1) [36]. To evaluate the performance of MOO-STAGE, we consider different optimization scenarios. System designers often have many different and perhaps conflicting

188

B. K. Joardar et al.

Table 1 List of applications and their respective domains Applications Back propagation (BP) Breadth-first search (BFS) CNN for CIFAR −10 (CDN) Gaussian elimination (GAU) HotSpot (HS) CNN for MNIST (LEN) LU decomposition (LUD) Needleman-Wunsch (NW) k-nearest neighbors (KNN) PathFinder (PF)

Domain/usage Pattern recognition Graph algorithm Image classification (RGB) Linear algebra Physics simulation Image classification (Grayscale) Linear algebra Bioinformatics Data mining Grid traversal

objectives. Therefore, it is important to look at several cases with different number of objectives for the 3D heterogeneous architecture. As an example, we consider three different cases:

 • Case 1 – . U , σ We consider mean (Eq. 4) and standard deviation of link utilization (Eq. 5).  • Case 2 – . U , σ, Lat We add CPU-LLC latency (Eq. 1) to Case 1.

 • Case 3 – . U , σ, Lat, E We add energy (Eq. 11) to Case 2. However, new objectives can be judiciously added to fit the designer goals and constraints. Finally, our goal here is to optimize the placement of CPUs, LLCs, GPUs, and planar links such that they improve the design objectives

5.2 Comparing the Different Algorithms In this subsection, we investigate PCBB, AMOSA, and MOO-STAGE’s performance for the problem of 3D heterogeneous manycore design. Here, we consider a 64-tile system with 8 CPUs, 16 LLCs, and 40 GPUs. The number of planar links and number of TSVs are kept the same as a similar size 3D mesh NoC for fair comparison. For brevity, in Fig. 4 we present the BFS results averaged over multiple runs as a representative example. Similar observations are made for all other applications as well. Figure 4 shows the evolution of the best solution’s EDP over time for AMOSA and MOO-STAGE for all three optimization cases. Since PCBB is not an anytime algorithm, we can only show the total run-time needed to complete the branch-andbound enumeration. This is discussed later in Table 2. It is evident from Fig. 4 that MOO-STAGE achieves lower EDP values significantly faster than AMOSA. To further demonstrate this, we define two metrics: TMOO-STAGE , which is the time required for MOO-STAGE to converge and TAMOSA which is the time needed for AMOSA to generate similar quality of solutions.

Machine Learning for Heterogeneous Manycore Design

189

Fig. 4 Normalized quality

using AMOSA and MOO-STAGE

of NoC  solutions (EDP) obtained for (a) two objectives ( . U , σ ), (b) three objectives ( . U , σ, Lat ), and (c) four objectives (

 . U , σ, Lat, E ) for the BFS benchmark Table 2 MOO-stage speed-up over PCBB and AMOSA Application BP BFS CDN GAU HS LEN LUD NW KNN PF Average

Two-obj PCBB 130 135 146 134 144 145 140 150 148 142 141.4

AMOSA 1.5 2.0 1.5 1.3 1.5 2.0 1.3 1.5 1.2 1.2 1.5

Three-obj AMOSA 6.4 5.0 5.8 6.0 8.0 5.8 5.0 5.0 6.4 5.0 5.8

Four-obj AMOSA 12.5 9.4 13.7 7.2 10.0 14.2 10.0 11.4 7.5 11.4 10.7

However, AMOSA never finds the best solution that MOO-STAGE obtains even after significantly longer durations for the three- and four-objective optimization. It is clear from Fig. 4 that the amount of speed-up MOO-STAGE achieves increases as the number of objectives increase. With four objectives, MOO-STAGE converges approximately after TMOO-STAGE = 9 h while AMOSA takes approximately TAMOSA = 85 h to come within 3% of MOO-STAGE’s solution quality. Therefore, MOO-STAGE achieves an approximate 9.4 times optimization time speed-up compared to AMOSA. This example demonstrates how ML can benefit hardware design. Figure 4 shows an example of how MOO-STAGE can reduce the time it takes to design and optimize new architectures in future. The significant improvement in optimization time can be attributed to the fact that MOO-STAGE performs active learning [37]. MOO-STAGE aggregates learning examples over multiple iterations to reduce the number of training data needed to learn a target concept. This guarantees that only a small number of trajectories are needed to achieve good generalization behavior with the learned function [37] and accurately evaluate the entire input design space. As a result, after a few iterations, MOO-STAGE achieves a near-accurate evaluation function to speed up

190

B. K. Joardar et al.

the optimization process. Table 2 shows the speed-up with MOO-STAGE compared to AMOSA and PCBB for all applications under optimization cases 1, 2, and 3. MOO-STAGE achieves significant gains in convergence time for all applications and number of objectives. Note that due to the large execution time for PCBB, we only show the two-objective optimization case (case 1) for PCBB. However, increasing the number of objectives will reduce the number of branches that are pruned and exponentially increase the run-time. This would result in even worse three- and four-objective run-times for PCBB. As seen from Table 2, even for the simpler two-objective optimization, PCBB takes 141× longer on average to find the similar quality of solution as MOOSTAGE. This is mainly due to the sheer size of the design space of 3D NoCs. For more intuition, in a 4 × 4 × 4 (64-tile) system with 144 links (96 planar + 48 vertical), the total number of possible tile placements is 64 factorial. Then, each of these tile placements has C(C(16,2)*4,96) different ways to place the planar links. Although PCBB manages to prune significantly over 99.99% of this solution space, the tiny fraction that is left consists of several millions of possible solutions. This is significantly more than MOO-STAGE or AMOSA leading to worse execution times. On the other hand, MOO-STAGE reduces the optimization time over AMOSA by 1.5×, 5.8×, and 10.7× on average for the two-, three-, and four-objective cases, respectively. Table 2 also shows that MOO-STAGE can obtain high-quality solutions in a shorter amount of time irrespective of the application. For further analysis, Fig. 5 shows the tile placement and number of planar links in each layer for the manycore configurations obtained using MOO-STAGE and AMOSA at time TSTAGE . The BFS benchmark considering four-objectives (case 3 described in Sect. 5.1) is shown as an example. Here, we do not consider PCBB as it takes orders of magnitude more time to generate a good NoC design compared to both MOO-STAGE and AMOSA. From Fig. 5, we note that in the NoC obtained using MOO-STAGE, LLCs tend to remain in the middle layers. This allows the LLCs to access the vertical links in both directions and reduce the average hop count to the other tiles. Also, we observe that more links are concentrated in layers with LLCs. The presence of more links enables greater path diversity and reduces the amount of traffic congestion under many-to-few traffic pattern leading to better performance. Of note, it is interesting to see that AMOSA and MOO-STAGE both achieve nearly similar tile placement configurations but very different link placements. This is due to the link placement search space being much larger than the tile placement search space. Therefore, AMOSA fails to explore it adequately within TSTAGE time and ends up with a solution similar to the initial 3D mesh (starting state for all searches is 3D mesh which has uniform distribution of links among layers as well). These analyses justify the approximately 9.5% difference in EDP (Fig. 4c) between the best solutions obtained using the two algorithms at time TSTAGE . Therefore, by learning the search space, MOO-STAGE is able to achieve better quality solutions in a shorter span of time.

Machine Learning for Heterogeneous Manycore Design

191

Fig. 5 Comparison of physical placement of tiles and planar links in solutions obtained using MOO-STAGE and AMOSA

5.3 Comparison with Mesh NoC-Based Heterogeneous Manycore System For this comparison, we introduce two new optimization cases (extending from the cases discussed earlier): • Case 4 – {T} Thermal only optimization. We consider peak core temperature (Eq. 8) only. • Case 5 – { .U , σ , Lat, T, E} Joint performance-thermal optimization. We add temperature (Eq. 8) to Case 3. In Fig. 6, we compare the results of the manycore systems optimized for case 3 (network efficiency/performance), case 4 (thermal-only), and case 5 (joint network performance-thermal) normalized to the case 3 designs. Figure 6a, b shows the fullsystem execution time and EDP, respectively. Figure 6c shows the temperature of the 3D NoC configuration in all three NoC cases. Here, full-system EDP (FS-EDP) is defined as the product of full-system execution time and energy. The full-system execution time is obtained via detailed Gem5-GPU simulations. It is clear from Fig. 6 that incorporating only thermal in the optimization process leads to the best temperature profile but a significant degradation of more than 7% in full-system execution time on average. Similarly, the NoC optimized for network efficiency (case 3) achieves the best EDP but at a 20 ◦ C average degradation of temperature compared to the only-thermal optimized NoC (case 4). On the other hand, the jointly optimized NoC exhibits temperature improvements of 18 ◦ C on average while sacrificing only 2.3% in overall execution time. Therefore, it is important that we jointly optimize both performance and thermal to reduce on-chip temperature

192

B. K. Joardar et al.

Fig. 6 Performance-thermal trade-offs for 64-tile NoCs: (a) full-system execution time, (b) full-system EDP, (c) temperature comparison for three optimization cases: network efficiency/performance-only (Perf), joint performance-thermal (Joint), and thermal-only (Therm)

while delivering high performance. In all cases, the optimized manycore designs perform better than their mesh-based counterparts.

6 MOO-STAGE FOR M3D-Based Manycore Systems In this section, we discuss the design of the M3D-enabled CPU/GPU-based heterogeneous manycore system. We have seen from earlier discussion that 3D TSV-based heterogeneous manycore system can improve performance. However, the relatively large dimensions of TSVs (~um) present some fundamental limitations in high-performance, low-power manycore architecture design: (a) fine-grained partitioning of logic blocks across multiple tiers is not possible [38], forcing only planar implementations of cores and their associated logic elements; (b) thick bonding material introduces heat dissipation challenges [39]; and (c) TSVs add nonnegligible area and power overheads. Overall, TSV-based 3D designs cannot achieve the full-potential of vertical integration. This can lead to poor performance-thermal trade-offs in 3D CPU/GPU-based heterogeneous manycore design. Meanwhile, monolithic 3D (M3D) has emerged as a promising technology for fine-grained vertical integration. By utilizing extremely small monolithic inter-tier vias (MIVs), M3D architectures have the potential to address the limitations of their TSV-based counterparts. M3D takes advantage of the nano-scale dimensions of MIVs (diameter of ~50 nm) to design hardware logic over multiple tiers [38]. This leads to significant performance gains and better energy efficiency. For instance, an M3D-enabled adder spanning two tiers outperforms conventional designs by 33% [40]. Figure 7 shows an M3D-based heterogeneous manycore architecture and how it differs from its more traditional TSV-based counterpart. As shown in Fig. 7, M3D allows for multi-tier logic blocks, i.e., we can design 3D CPU, GPU, and LLC tiles. Figure 7 shows a two-tier partitioning for the logic and memory blocks where each logic block is spread across two physical layers; we refer this architecture as HeM3D (heterogeneous M3D). To highlight the salient features of HeM3D, Fig. 7b shows a regular TSV-based 3D heterogeneous manycore design (the equivalent of HeM3D

Machine Learning for Heterogeneous Manycore Design

193

Fig. 7 Illustration of heterogeneous manycore architectures using (a) M3D (HeM3D) and (b) TSV

using TSV). The TSV-enabled design (Fig. 7b) utilizes planar components that are stacked on top of each other to create the 3D architecture; the TSV-based manycore architecture in Fig. 7b is equivalent to the design in Fig. 1. Hence, in a TSV-based design the main performance benefits arise from better network connectivity using the vertical links, not from improvements to the individual computing elements. M3D integration is enabled by fabricating two or more silicon layers sequentially on the same substrate and interconnecting the layers using small MIVs, allowing ultrahigh-density fine-grained vertical integration [41]. This is fundamentally different from 3D integration using TSVs to interconnect separately fabricated dies. Depending on the granularity with which devices are partitioned across multiple tiers, M3D-based architectures can be grouped into three main categories: (a) transistor-level or N/P partitioning, the nFETs and pFETs of a gate are placed on two separate tiers and connected via intra-gate MIVs [42]; (b) gate-level partitioning, planar gates are placed in different tiers and connected using inter-gate MIVs [43]; and (c) block-level partitioning, intellectual property (IP), functional, and memory blocks are placed in different tiers and connected using MIVs [44]. Among these different partitioning techniques, gate-level partitioning results in the highest amount of footprint reduction and subsequent performance improvement [43]. By placing different logic gates across multiple tiers, i.e., in 3D, the overall wirelength is reduced significantly. This leads to higher clock frequencies due to lower latency along the critical paths and a simplified and more energy-efficient clock tree and power delivery network [43]. Therefore, Hem3D adopts gate-level partitioning to develop 3D logic blocks. Figure 8a shows the planar cores used in TSV-based heterogeneous manycore systems. Figure 8b illustrates the design of such cores using M3D-enabled gate-level partitioning in two tiers. These cores are paired with routers and the combination of a core, and the associated router is referred to as a tile in Fig. 7. The gates spread across different tiers are connected using monolithic inter-tier vias (MIVs), which tend to be much smaller than TSVs. Due to the gate-level partitioning, the dimensions of the multi-tier tiles are considerably smaller than those of the planar

194

B. K. Joardar et al.

Fig. 8 Illustration of (a) planar logic block and (b) multi-tier logic block enabled by M3D-based gate level partitioning. The width and length of the multi-tier core is substantially smaller than its planar counterpart. This is for illustration purpose only

tiles, so the critical paths of the multi-tier tiles are similarly shorter as well. As a result, M3D-enabled logic blocks are typically faster than traditional designs. Altogether, HeM3D (Fig. 7a) incorporates the M3D enabled CPUs, GPUs, LLCs, and routers. The CPU, GPU, LLC tiles, and routers are extended vertically across two tiers in HeM3D. Each of these two-tier components are then distributed among four tiers to get the overall 3D structure as shown in Fig. 7a. Each tile is connected to an NoC router for communication. The routers are connected to each other via an optimized NoC topology. Overall, by vertically partitioning across multiple tiers in M3D, HeM3D lowers critical path delay leading to better performance. In addition, the vertical partitioning of the tiles result in lower power consumption due to the reduced wirelength and fewer repeaters. The improved power efficiency results in an inherent reduction of on-chip temperature, which is otherwise a serious problem in TSV-based 3D designs. Hence, HeM3D can achieve better performance and thermal characteristics than its TSV equivalent as demonstrated later. However, designing an M3D-based heterogeneous manycore system is challenging as the size of the design space is far larger than its TSV-based counterpart. In next section, we discuss how MOO-STAGE can solve this problem.

6.1 MOO-STAGE for M3D Design Similar to the TSV-based manycore design, we can formulate the M3D-based manycore design also as a MOO problem, i.e., given C CPUs, G GPUs, L LLCs, and P links, how to place these tiles and links to obtain a suitable manycore system.

Machine Learning for Heterogeneous Manycore Design

195

However, M3D-based heterogeneous manycore design has a few key differences compared to the TSV-based design that must be considered during the design process. First, as shown in Fig. 8, M3D allows us to design logic blocks in 3D. Hence, we must redesign the CPU, GPU, and LLC tiles to take advantage of M3D. For instance, an M3D-based CPU can be 14% faster than conventional planar CPU. Similarly, LLC and GPU performance can be improved by 23.3% and 10%, respectively, using M3D [45, 46]. By utilizing this property of M3D, we can also design 3D routers, which extend across multiple physical layers as shown in Fig. 7; recall that individual routers are planar in the case of TSV-based manycore system. This opens up many new NoC design possibilities where 3D routers act as communication shortcut between the different layers. However, this increases the size of the NoC design space. Hence, unlike TSV-based manycore design, design space exploration in M3D is more challenging. Next, TSV and M3D systems have very different physical structures which affect temperature (illustrated in Fig. 9). TSV-based architectures include a layer of bonding material between adjacent silicon tiers that has very poor thermal conductivity (Fig. 9a) [39]. This prevents heat from easily flowing towards the heat sink. In addition, TSVs are much larger than MIVs, resulting in thicker silicon layers and a longer path for heat flowing towards the sink. Due to these reasons, a major portion of the heat spreads laterally rather than flowing vertically towards the sink (as shown in Fig. 9a). As a result of this gradual heat accumulation among the layers, the overall temperature of the chip increases, which negatively affects the performance. Hence, TSV-based 3D integration is not very effective in designing high-performance heterogeneous architectures as we show later. On the other hand, unlike their TSV counterparts, M3D integration (shown in Fig. 9b) inherently exhibits better thermal properties due to thinner tiers and the absence of any bonding material [39]. The inter-layer dielectric (ILD) in M3D is significantly thinner and has better thermal characteristics than the equivalent “bonding layer” of TSV. This results in different effective thermal resistances for M3D- and TSV-based systems (Rj and Rb in Eq. 6), which we obtain from [39] and 3D-ICE simulations [29]. In addition, the power consumption (Pn, i ) in Eq. (6) also varies between TSV and M3D integration. M3D architectures are more power efficient than their TSV counterparts due to the smaller dimensions [39]. Hence, by considering these M3D specific differences, we can accurately model the thermal profile in HeM3D. The lower temperature in M3D enables us to apply more aggressive performance optimizations to the overall architecture without having to worry about on-chip temperatures as we show later in this work. Overall, we can formulate the M3D-based manycore design as a MOO problem (similar to the TSV-based manycore design but with the M3D-specific changes). However, as mentioned earlier, the design space of M3D is significantly larger than its TSV-based counterpart, which makes the MOO problem more challenging. Next, we show the efficacy of MOO-STAGE for M3D-based heterogeneous manycore design. To evaluate the performance of MOO STAGE, we consider the well-known MOO algorithm AMOSA as the baseline due to its ability to achieve near optimal solutions [12]. We evaluate both algorithms based on their runtime and quality of

196

B. K. Joardar et al.

Fig. 9 Illustration of physical structure and heat flow in (a) TSV two-tier cross-section, (b) M3D two-tier cross section

solutions. Given the set of Pareto-optimal solutions D* , we run detailed simulations on each solution in D* to get accurate performance and temperature measurements. We use the same MOO formulation (Eq. 12) to design the TSV-based baseline architecture. In other words, we optimize the placement of CPUs, GPUs, LLCs, and planar links to improve latency, throughput, and temperature using TSV-specific parameters. All experiments have been run on an Intel® Xeon® CPU E5-2630 @ 2.2GHz machine with 252 GB RAM running CentOS 6. Figure 10 shows the speed-up in convergence time achieved by MOO-STAGE compared to AMOSA for designing HeM3D and its TSV equivalent for all considered benchmarks. Here, we define convergence as the point of time beyond which the subsequent solutions vary in performance by less than 2%. This analysis is done with the joint performance-thermal optimization (case 5 in TSV-based MOO). For the TSV-based design, MOO-STAGE outperforms AMOSA by 5.48× on average. This further increases to 7.38× on average for HeM3D design optimization. This happens as the design space of M3D-enabled architectures is significantly larger than their TSV-based counterpart. Conventional search algorithms like AMOSA need to be annealed slowly and do not scale with the size of the search space. In addition, critical parameters like annealing temperature in AMOSA need to be tuned carefully for best results. Even then, AMOSA requires significant amount of time to yield a solution whose performance-thermal trade-off is comparable to that obtained using MOO-STAGE. On the other hand, by filtering out the promising regions of the HeM3D design space, MOO-STAGE virtually reduces the size of search space and hence the effort required to explore it to uncover (near-) optimal designs. As a result, MOO-STAGE achieves higher speed-up for HeM3D design. This shows that we can use ML-based solvers for design M3D-based manycore systems as well.

Machine Learning for Heterogeneous Manycore Design

197

Fig. 10 Speed-up achieved by MOO-STAGE compared to AMOSA for designing HeM3D and its TSV-based counterpart

7 Conclusion In recent years, there has been much focus on the acceleration of machine learning and deep learning algorithms through manycore optimization. On the other hand, we have only scratched the surface of computing systems design using machine learning techniques. In this chapter, we have provided two examples showing how ML can be used for heterogeneous manycore design. 3D CPU-GPU-based heterogeneous architectures provide an opportunity to design high-performance, energy-efficient computing platforms to meet the growing computational need in deep learning and big-data applications. However, 3D heterogeneous architectures present several new design challenges: (a) multiple potentially conflicting design requirements; (b) 3D integration-induced thermal hotspots; and (c) significantly larger design spaces. As we have shown here, existing algorithms are often not scalable for large and complex design problems. ML-inspired methods such as MOOSTAGE can solve this problem. MOO-STAGE finds better manycore designs, faster than existing optimization algorithms. Overall, by integrating expert knowledge in hardware with data-driven machine learning, we can create a machine learningdriven hardware design framework. Such a framework can bring engineering costs down, commoditizing and democratizing the optimized computing platforms for the next generation of machine learning algorithms. Using this framework, we can stimulate the relationship between computing systems and machine learning, spurring improvement in both computing infrastructures and algorithms.

References 1. Kim, R.G., et al.: Imitation learning for dynamic VFI control in large-scale manycore systems. IEEE Trans. Very Large Scale Integr. VLSI Syst. 25(9), 2458–2471 (2017) 2. Joardar, B.K., Kim, R.G., Doppa, J.R., Pande, P.P., Marculescu, D., Marculescu, R.: Learningbased application-agnostic 3D NoC design for heterogeneous manycore systems. IEEE Trans. Comput. 68(6), 852–866 (2019)

198

B. K. Joardar et al.

3. Mirhoseini, A., et al.: Chip placement with deep reinforcement learning. In: arXiv:2004.10746. (2020) 4. Deshwal, A., Jayakodi, N.K., Joardar, B.K., Doppa, J.R., Pande, P.P.: MOOS: a multi-objective design space exploration and optimization framework for NoC enabled manycore systems. ACM Trans. Embed. Comput. Syst. 18(5s, Article 77), 1–23 (2019) 5. Ding, Y., Botzer, N., Weninger, T.: HetSeq: distributed GPU training on heterogeneous infrastructure. Proc. AAAI Conf. Artif. Intell. 35(17), 15432–15438 (2021) 6. Hestness, J., Keckler, S.W., Wood, D.A.: GPU computing pipeline inefficiencies and optimization opportunities in heterogeneous CPU-GPU processors. In: Proceedings of the IEEE IISWC, pp. 87–97. IEEE, Atlanta (2015) 7. Power, J., et al.: Heterogeneous system coherence for integrated CPU-GPU systems. In: Proceedings of the IEEE/ACM MICRO, pp. 457–467. IEEE, Davis (2013) 8. Davis, W.R., et al.: Demystifying 3D ICs: the pros and cons of going vertical. IEEE Des. Test Comput. 22(6), 498–510 (2005) 9. Feero, B.S., Pande, P.P.: Networks-on-chip in a three-dimensional environment: a performance evaluation. IEEE Trans. Comput. 53(1), 32–45 (2008) 10. Choi, W., et al.: On-chip communication network for efficient training of deep convolutional networks on heterogeneous manycore systems. IEEE Trans. Comput. 67(5), 672–686 (2018) 11. Deb, K., Pratap, A., Agarwal, S.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002) 12. Bandyopadhyay, S., Saha, S., Maulik, U., Deb, K.: A simulated annealing-based multiobjective optimization algorithm: AMOSA. IEEE Trans. Evol. Comput. 12(3), 269–283 (2008) 13. Boyan, J.A., Moore, A.W.: Learning evaluation functions to improve optimization by local search. J. Mach. Learn. Res. 1, 77–112 (2001) 14. Apple: https://www.apple.com/newsroom/2021/10/introducing-m1-pro-and-m1-max-themost-powerful-chips-apple-has-ever-built/ (2021). Accessed Apr 2022 15. Qualcomm: https://developer.qualcomm.com/blog/heterogeneous-computing-yourdemanding-apps (2020). Accessed Apr 2022 16. Kayiran, O., et al.: Managing GPU concurrency in heterogeneous architectures. In: Proceedings of the IEEE/ACM MICRO. IEEE, Cambridge (2014) 17. Lee, J., Li, S., Kim, H., Yalamanchilli, S.: Design space exploration of on chip ring interconnection for a CPU-GPU heterogeneous architecture. ACM J. Parallel Distrib. Comput. 73(12), 1525–1538 (2013) 18. Joardar, B.K., et al.: 3D NoC-enabled heterogeneous manycore architectures for accelerating CNN training: performance and thermal trade-offs. In: Proceedings of the IEEE/ACM NOCS. IEEE, Seoul (2017) 19. Alam, S.M., Jones, R.E., Pozder, S., Jain, A.: Die/wafer stacking with reciprocal design symmetry (RDS) for mask reuse in three-dimensional (3D) integration technology. In: Proceedings of the ISQED, pp. 569–575. IEEE, San Jose (2009) 20. Zhou, X., Xu, Y., Du, Y., Zhang, Y., Yang, J.: Thermal management for 3D processors via task scheduling. In: Proceedings of the International Conference on Parallel Processing, pp. 115–122. IEEE, Portland (2008) 21. Mariani, G., Palermo, G., Zaccaria, V., Silvano, C.: OSCAR: an optimization methodology exploiting spatial correlation in multicore design spaces. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 31(5), 740–753 (2012) 22. Morgan, A.A., Elmiligi, H., El-Kharashi, M.W., Gebali, F.: Multi-objective optimization for networks-on-chip architectures using genetic algorithms. In: Proceedings of the IEEE ISCAS, pp. 3725–3728. IEEE, Paris (2010) 23. Ozisikyilmaz, B., Memik, G., Choudhary, A.: Efficient system design space exploration using machine learning techniques. In: Proceedings of the ACM/IEEE DAC, pp. 966–969. IEEE, Anaheim (2008)

Machine Learning for Heterogeneous Manycore Design

199

24. Ascia, G., Catania, V., Di Nuovo, A.G., Palesi, M., Patti, D.: Efficient design space exploration for application specific systems-on-a-chip. J. Syst. Archit. 53(10), 733–750 (2007) 25. Wu, C., et al.: A multi-objective model oriented mapping approach for NoC-based computing systems. IEEE Trans. Parallel Distrib. Syst. 28(3), 662–676 (2017) 26. Das, S., Doppa, J.R., Pande, P.P., Chakrabarty, K.: Design-space exploration and optimization of an energy-efficient and reliable 3-D small-world network-on-chip. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 36(5), 719–732 (2017) 27. Cong, J., Wei, J., Zhang, Y.: A thermal-driven floorplanning algorithm for 3D ICs. In: Proceedings of the ICCAD, pp. 306–313. IEEE, San Jose (2004) 28. GitHub: https://github.com/CSU-rgkim/TC_2018_code 29. Sridhar, A., Vincenzi, A., Ruggiero, M., Brunschwiler, T., Atienza, D.: 3D-ICE: fast compact transient thermal modeling for 3D ICs with inter-tier liquid cooling. In: Proceedings of the ICCAD, pp. 463–470. IEEE, San Jose (2010) 30. Power, J., Hestness, J., Orr, M., Hill, M., Wood, D.: gem5-gpu: a heterogeneous CPU-GPU simulator. IEEE Comput. Archit. Lett. 14(1), 34–36 (2015) 31. Zitzler, E., Brockhoff, D., Thiele, L.: The hypervolume indicator revisited: on the design of pareto-compliant indicators via weighted integration. In: Proceedings of the EMO, pp. 862– 876. IEEE, Matsushima (2007) 32. Auger, A., Bader, J., Brockhoff, D., Zitzler, E.: Theory of the hypervolume indicator: optimal mu-distributions and the choice of the reference point. In: Proceedings of the ACM FOGA, pp. 87–102. IEEE, Orlando (2009) 33. Leng, J., et al.: GPUWattch: enabling energy optimizations in GPGPUs. In: Proceedings of the ISCA, pp. 487–498. IEEE, Tel-Aviv (2013) 34. Li, S., et al.: McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In: Proceedings of the IEEE/ACM MICRO, pp. 469–480. IEEE, New York (2009) 35. Lysne, O., Skeie, T., Reinemo, S.A., Theiss, I.: Layered routing in irregular networks. IEEE Trans. Parallel Distrib. Syst. 17(1), 51–65 (2006) 36. Che, S., et al.: Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of the IISWC, pp. 44–54. IEEE, Austin (2009) 37. Ross, S., Gordon, G.J., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the AISTATS, pp. 627–635. IEEE, Ft. Lauderdale (2011) 38. Samal, S.K., Nayak, D., Ichihashi, M., Banna, S., Lim, S.K.: Monolithic 3D IC vs. TSV-based 3D IC in 14 nm FinFET technology. In: 2016 IEEE SOI-3D-Subthreshold Microelectronics Technology Unified Conference (S3S), pp. 1–2. IEEE (2016) 39. Samal, S.K., Panth, S., Samadi, K., Saedi, M., Du, Y., Lim, S.K.: Fast and accurate thermal modeling and optimization for monolithic 3D ICs. In: 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6. IEEE (2014) 40. Panth, S., Samadi, K., Du, Y., Lim, S.K.: High-density integration of functional modules using monolithic 3D-IC technology. In: 2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 681–686. IEEE (2013) 41. Lee, Y., Lim, S.K.: Ultrahigh density logic designs using monolithic 3-D integration. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 32, 1892–1905 (2013) 42. Shi, J., et al.: A 14 nm FinFET transistor-level 3D partitioning design to enable highperformance and low-cost monolithic 3D IC. In: 2016 IEEE International Electron Devices Meeting (IEDM), pp. 2.5.1–2.5.4. IEEE (2016) 43. Liu, C., Lim, A.N.D.S.K.: A design tradeoff study with monolithic 3D integration. In: Thirteenth International Symposium on Quality Electronic Design (ISQED), pp. 529–536. IEEE (2012) 44. Panth, S., Samadi, K., Du, Y., Lim, S.K.: Power-performance study of block-level monolithic 3D-ICs considering inter-tier performance variations. In: 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6. IEEE (2014)

200

B. K. Joardar et al.

45. Gong, Y., Kong, J., Chung, S.W.: Quantifying the impact of monolithic 3D (M3D) integration on L1 caches. IEEE Trans. Emerg. Top. Comput. 1, 854–865 (2019) 46. Arka, A.I., et al.: HeM3D: heterogeneous manycore architecture based on monolithic 3D vertical integration. ACM Trans. Des. Autom. Electron. Syst. 26(2, Article 16), 1–21 (2021)

Hardware–Software Co-design for Ultra-Resource-Constrained Embedded Machine Learning Inference: A Printed Electronics Use Case Georgios Zervakis, Mehdi B. Tahoori, and Jörg Henkel

1 Introduction Embedded machine learning (ML) constitutes a phenomenally fast-growing field that comprises ML algorithms, hardware, and software, capable of performing on-device sensor data analyses at extremely low power, enabling thus several always-on and battery-powered applications and services [32]. Running ML-based applications on embedded edge devices is attracting a vast interest for manifold reasons such as accessibility, privacy, latency, and security [29]. The mandatory requirement of embedded ML for energy efficiency but also low latency as well as to retain accuracy in acceptable levels leads to custom-designed circuits and cross/co-optimization of the software and hardware stack. Nevertheless, as this domain is still in its infancy and fast-changing, avoiding a massive non-recurring engineering (NRE) cost upfront is highly desirable, especially for low-cost embedded ML systems [29]. Printed electronics constitutes one of the most extreme examples of embedded ML inference ultra-resource-constrained devices. Printed electronics is increasingly recognized as a key enabler for the Internet of Things as part of the “Fourth Industrial Revolution,” whose core technology advances are functionality and low cost [9]. Printed electronics offers unprecedented mechanical flexibility functionality and low NRE and fabrication cost, often considered as the creation of smart lightweight electronics based on cheap abundant materials and manufactured by simple ubiquitous printing processes, enabling new ways of integration [9]. Printed electronics forms a rapidly growing domain, and it is projected that their market will reach the US$73B in 2027.

G. Zervakis () · M. B. Tahoori · J. Henkel Karlsruhe Institute of Technology, Karlsruhe, Germany e-mail: [email protected]; [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 S. Pasricha, M. Shafique (eds.), Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing, https://doi.org/10.1007/978-3-031-39932-9_8

201

202

G. Zervakis et al.

Printed electronics appears as a viable solution to bring computing and intelligence in domains such as disposables (e.g., packaged foods, beverages), smart packaging, low-end healthcare products (e.g., as smart bandages), in situ monitoring, as well as the 10-trillion market of fast moving consumer goods (FMCG) [27], etc. While the impact of computing is almost ubiquitous, such application domains have not seen considerable penetration of computing that could help, for example, with identification and tracking (Is this the pill of today?), quality monitoring (Is this milk still good?), brand authentication (Is this a Granny Smith apple?), or interactivity (Is my beverage at my desired temperature?). Such domains pose requirements for ultra-low cost (even sub-cent) and conformality that lithographybased silicon technologies cannot satisfy. For example, silicon systems cannot meet stretchability, porosity, and flexibility requirements, while the high manufacturing, packaging, and assembly costs of silicon prevent sub-cent cost [8]. On the other hand, printed circuits feature ultra-low-cost additive manufacturing that enables conformal hardware on-demand. Though, the large feature sizes in printed electronics incur very high restrictions leading to elevated hardware overheads that can be prohibitive for the realization of complex circuits. As a result, integrating ML classification in printed circuits is very challenging and requires careful design and high optimization. Although work on printed ML is limited, significant advancements have been reported. The authors in [37] develop a printed artificial neuron that supports both multiply–accumulate (MAC) operation and non-linear activation functions. Douthwaite et al. [15] propose a flexible MAC engine. Targeting more complex printed ML inference, the authors in [2, 5, 27, 38, 39] employ a systematic software– hardware co-design, enriched with non-conventional computing in [2, 5, 38] to further boost the hardware gains. Ozer et al. [39] fabricated a natively flexible processing engine with hardwired parameters for an odor recognition application based on their developed resource-efficient ML algorithm (the “univariate Bayes feature voting classifier”). The authors in [27] and [2] discuss the implementation of several printed ML classifiers, Balaskas et al. [5] focus on ultra-tiny implementations of digital decision trees, while Weller et al. [38] examine the feasibility of shallow printed neural networks. In this chapter, we further discuss and analyze co-design approaches that enable embedded ML inference on such ultra-resource-constrained devices as our printed electronics use case. In brief, as shown in Fig. 1, such approaches are built upon a fully customized circuit design, further optimization through means of non-conventional computing (e.g., stochastic or mixed signal), and hardware-aware software optimization that among others comprises model and architecture selection and/or training.

2 Background on Printed Electronics By printed electronics, we refer to a fabrication technology that is based on printing processes, e.g., jet printing and screen or gravure printing [14]. These techniques refer to an additive manufacturing process in which functional materials are directly

Hardware–Software Co-design for Ultra-Resource-Constrained Embedded. . .

Dataset/ Model

Fully Custom Circuit Design

203

Optimize HW

Bespoke

Approximate Stochastic Mixed-signal

Printed ML Classifier

Co-design

Area & Power Constraints

Optimize Model (SW) HW-aware model selection HW-aware train Model approximation

Fig. 1 Abstract flow of software–hardware co-design, enriched with non-conventional computing, to enable the realization of ultra-resource-constrained ML circuits

deposited on the substrate. The latter simplifies the production chain compared to subtractive processes in silicon-based systems. The simple manufacturing process combined with the low equipment costs enables fabrication of ultra-low-cost electronic circuits, extremely lower than the cost associated with the fabrication of silicon-based systems that require clean rooms and very expensive foundries. Printed electronics and silicon-based systems are complementary technologies and do not compete each other. The former will never challenge the latter in terms of area, integration density, or performance, while, on the other hand, they can realize ultra-low-cost and flexible computing systems, being, thus, suitable for application domains untouchable by silicon systems. Electronics on flexible substrates are made possible by using contactless printing methods, e.g., inkjet printers, combined with highly optimized functional inks, e.g., conductive, semi-conductive, and non-conductive materials. Using these inks, oxide-based [33] or organic [12] transistors can be fabricated. Although organic materials are easily processed, they exhibit lower environmental stability. On the other hand, oxide-based inks feature excellent environmental stability and conductivity, but it is more difficult to print them, and they suffer from impurities as a result of surfactants [14]. Inkjet-printed electrolyte-gated transistor (EGT) technology is an oxide-based inorganic printed technology. EGT deploys, in the transistor, fluid electrolytes as a dielectric substitute and allows bellow 1V operating voltages. Hence, they are excellent candidates for self-powered embedded IoT mobile computing systems. Despite the attractive features, printed electronics exhibits several prevalent limitations. The large feature sizes lead to high device latencies and low integration density, i.e., orders of magnitude lower than that in silicon VLSI. As a result, designs of low complexity, which feature a limited number of transistors, are mainly favored in order to reduce area overheads and to enable manufacturing with reasonable yield. Still, printed computing systems have been successfully fabricated, such as

204

G. Zervakis et al.

Boolean logic [13], digital and analog storage elements [20, 36], amplifiers [22], a ML processing engine [39], and a 32-bit Arm microprocessor [7].

3 Preliminaries ML Classifiers: Throughout this chapter, as case studies, several ML classifiers are considered, such as decision trees (DTs), multi-layer perceptron classifiers (MLP-Cs), multi-layer perceptron regressors (MLP-Rs), SVM classifiers (SVMC), as well as SVM-Rs trained on several datasets of the UCI ML repository [16]. To train a classifier on a specific dataset, scikit-learn, the randomized parameter optimization (RandomizedSearchCV), and 5-fold cross validation are used. Input normalization is applied, and for training and testing, we randomly split the dataset with a 70%/30% ratio. For all the MLPs, the topology is set to one hidden layer with up to five neurons, and ReLU is the activation function. All the SVMs employ a linear kernel, and for the SVM-Cs, 1-vs-1 classification is used. Hardware Evaluation: To generate the register-transfer level (RTL) description of the ML classifiers, fixed-point arithmetic is used. The precision of the inputs and coefficients varies per use case and ranges from 4 to 8 bits. The RTL of each classifier is synthesized with Synopsys Design Compiler and mapped to the opensource EGT library1 [8], which, as mentioned above, is an inkjet-printed technology. Note that EGT is low voltage, allowing for battery-powered operation. QuestaSim is used for circuit simulation in order to obtain the circuit’s output as well as its switching activity. Finally, power evaluation is performed with Synopsys Prime Time and the switching activity obtained from the circuit simulation.

4 Bespoke ML Classification Circuits Despite a variety of the existing printed circuits as well as increased research activities on several applications, Mubarik et al. [27] examined, for the first time, integrating ML algorithms and bringing very sophisticated services in such ultraresource-constrained environments. As a first step toward an efficient software– hardware co-design, the authors in [27] started by investigating which ML classification algorithms could be feasibly implemented. This depends on the application characteristics that determine the accuracy of a ML algorithm but also on the associated hardware cost (e.g., power) and the respective resource limitations of printed electronics.

1 https://github.com/PrintedComputing.

Hardware–Software Co-design for Ultra-Resource-Constrained Embedded. . .

205

Table 1 Accuracy vs. computation requirements for varying ML classification models. Source: [27] RF-4.a A.d C.e Arrhythmia 0.61 166 Cardio 0.92 225 0.95 471 GasID HAR 0.99 334 Pendigits 0.96 570 RedWine 0.56 361 WhiteWine 0.56 569 a b c d e f

MLP-1.b A M.f 0.62 1380 0.9 110 0.98 665 0.99 85 0.94 130 0.57 85 0.56 90

LR SVM-C A M C A M 0.63 2893 11 0.64 14k 0.91 57 3 0.9 57 0.91 762 6 0.99 1.9k 0.94 60 5 0.99 120 0.92 160 10 0.98 720 0.56 66 6 0.55 165 0.54 66 6 0.51 165

C 55 3 15 10 45 15 15

SVM-R A M C 0.25 263 12 0.84 19 4 0.99 127 7 0.9 12 6 0.19 16 11 0.57 11 7 0.51 11 7

DT-2.c A C 0.58 3 0.79 3 0.49 3 0.82 2 0.32 3 0.47 3 0.48 3

RF classifiers with four estimators MLP with one hidden layer and up to five hidden nodes DT with depth two Accuracy The number of comparisons The number of MAC operations

4.1 Resource-Aware ML Algorithm Selection In [27], five common algorithms for ML classification are considered: support vector machines (SVMs), random forests (RFs), decision trees (DTs), multi-layer perceptrons (MLPs), and logistic regression (LR). These algorithms were evaluated on several datasets [16] that belong to simple ML applications that consume at least one input generated from a sensor and exhibit low duty cycle, sample rate, and precision requirements. The accuracy is obtained as described in Sect. 3, and the hardware costs are estimated by counting the number of required MAC operations (or comparisons in DTs). Table 1 presents some representative examples. Among the examined ML algorithms, the hardware cost of LR, MLPs, RFs, and SVM-C (SVM classification) is considered prohibitive for most printed applications since the estimated area and power overheads range from 21 to 2250cm.2 and .0.078 to .8.2W, respectively. Hence, the authors in [27] deduced that DTs and SVM-Rs (SVM regression) provide, in most cases, a good balance of attained accuracy and estimated hardware costs.

4.2 Bespoke Classifier Implementation Conventional Baseline: The conventional implementation of a fully parallel DT requires a comparator and two registers for each node in the tree (defined by its depth and the number of supported parallel comparisons per level). One register stores the coefficients (thresholds) for the comparisons, while the other holds an input feature. Then, a multiplexer is used to select the classification based on the results of all the

206

G. Zervakis et al.

comparisons made. Similarly, the conventional implementation for a fully parallel SVM-R requires a hardware MAC unit for each MAC operation. Registers are used to hold the input features and coefficients. The number of registers is defined by the maximum number of supported input features. The multipliers are equal to the number of input features and multiply the input features with the respective trained coefficients. All the products are then added and comparators and a class encoder are used to map the sum to the nearest class. Bespoke Implementations: Per-unit-area fabrication costs and non-recursive engineering (NRE) costs are extremely low (even sub-cent)—especially for additive and mask-less technologies, e.g., inkjet printing, that may even allow on-demand in situ printing—enabling thus highly custom bespoke classifiers, even at low to moderate fabrication volumes. The latter constitute circuit implementations that are fully customized to a specific model generated for a dedicated application using a given training dataset [27]. That level of optimization/customization is mostly infeasible in lithography-based silicon technologies, mainly due to the high associated costs (e.g., fabrication and NRE). This degree of customization offered by bespoke implementations [10, 28] enables very area-efficient designs targeting ultra-low-area and power-constrained embedded devices. In bespoke DTs, the coefficients (threshold values) are hardwired in the RTL description itself, replacing thus the respective threshold registers of the conventional maximally parallel implementations. In addition, the registers of the input features are also replaced by direct connections to the corresponding inputs ports. As a result, the EDA tool optimizes the netlist by propagating constant values and simplifying/removing unnecessary gates in the design. For example, the comparators now have only one input variable, greatly simplifying the overall circuit implementation. Similarly, in bespoke SVM-Rs, the numbers of input features, coefficients, MACs, comparators but also the precision of the computations and the width of the registers are tailored to the specific application and SVM model. Again, the coefficient registers are replaced by the trained coefficients that are hardwired in the RTL description. As a result, the conventional input multipliers are replaced by multiplier-by-constant units that further optimize the logic of the hardware multipliers. Figure 2 presents the hardware gains achieved when using bespoke parallel DTs in place of conventional ones. As shown in Fig. 2, the bespoke DTs outperform their conventional counterparts in all the examined metrics achieving, on average, 3.9x, 48.9x, and 75.6x lower delay, area, and power, respectively. Similarly, Fig. 3 presents the same comparative analysis when SVM-R is used for classification. Again, the bespoke SVM-Rs feature significant lower hardware overheads than the conventional ones, delivering, on average, 1.4x, 12.8x, and 12.7x lower delay, area, and power, respectively. Eventually, the tangible outcome of the co-design methodology of [27] is as follows: the DTs with depths 1 and 2 can be powered by only a Blue Spark 3mW printed battery, while the DTs with depth 4 and the Cardio, WhiteWine, RedWine, and HAR SVM-Rs can be powered by a Molex 30mW printed battery. On the other hand, for the remaining circuits, there does

Hardware–Software Co-design for Ultra-Resource-Constrained Embedded. . .

207

Fig. 2 Hardware gains of bespoke maximally parallel decision trees compared against their conventional counterparts. DT-d is a decision tree of depth d. Source of figure data is [27]

not exist an adequate power supply today. It should be noted that [27] examined also other optimizations such as memoization and analog computing but they are not analyzed here.

5 Co-Design for Approximate ML Classification Circuits As discussed in Sect. 4, bespoke circuit implementations [27] feature increased area and power savings compared to conventional implementations, paving the way toward the realization of printed ML circuits. Leveraging the low fabrication cost and in situ fabrication of printed electronics, Mubarik et al. [27] exploited the high hardware efficiency of bespoke implementations to generate simple printed ML classifiers, e.g., decision trees (DTs). As demonstrated in [27], bespoke circuits and printed electronics are inherently interconnected, since printed electronics enable

208

G. Zervakis et al.

Fig. 3 Hardware gains of bespoke maximally parallel SVMs compared against their conventional counterparts. Source of figure data is [27]

the fabrication of bespoke circuits, while bespoke implementations enable the realization of complex printed circuits. However, as shown in Table 1, due to the still considerable area and power requirements, implementing more complex classifiers requires additional optimization. To further reduce the hardware overheads of the ML classifiers, the authors in [2] and [5] employed, for the first time, approximate computing principles on printed electronics. Balaskas et al. [5] focused on improving even further the printed DT classifiers introduced in [27], while Armeniakos et al. [2] explored the potential of integrating more complex classifiers such as MLPs and SVMs. Approximate computing leverages the intrinsic error resilience of a large number of application domains in order to improve metrics such as performance, power, area, etc., by intelligently reducing the accuracy of some of the underlying computations [45]. Such applications can still deliver results of acceptable quality despite some computations being performed in an approximate manner. Exploiting the inherent error tolerance of ML algorithms, ML circuits constitute perfect candidates for approximate computing application [45]. Significant research activity is recorded on approximate arithmetic circuits such as adders [3, 31, 42, 44] and multipliers [6, 40, 42, 44] that constitute the fundamental building block of ML circuits, automated design frameworks for approximate circuits [4, 25, 41, 43], as well as approximate neural network inference accelerators [11, 18, 26, 34, 35, 44, 46]. However, these works focus on conventional (non-bespoke) implementations that are unsuitable for ultra-resource-constrained printed circuits [27]. On the

Hardware–Software Co-design for Ultra-Resource-Constrained Embedded. . .

209

other hand, [2, 5] employ a systematic software–hardware approximate co-design approach, tailored to bespoke circuits, in order to optimize the generated ML circuits for the extreme and very demanding use case of printed electronics.

5.1 Approximate MLPs and SVMs Armeniakos et al. [2] observed that, in bespoke circuits, the respective hardware overheads vastly depend on the coefficients since they are hardwired in the circuit implementation itself. For example, assume a bespoke multiplier that performs the computation .x × w, where x is an unknown input and w is a predefined constant coefficient. Figure 4 presents the variation of the area of two bespoke multipliers with respect to the value of the coefficient. As shown in Fig. 4, both multipliers feature significant area variation, and most importantly, neighboring coefficient values (w) may exhibit significantly different area. It is noteworthy that in several cases the area may be even nullified, e.g., when the coefficient is a power of two. Inspired by the observation in Fig. 4, the authors in [2] introduced, at the software level, a novel hardware-driven coefficient approximation in which the coefficients w of a given MLP or SVM model are replaced by approximate values .w˜ (in the close proximity of w) that induce lower hardware overheads. For example, for the multipliers of Fig. 4, Fig. 5 reports the potential area savings when w is replaced by .w ˜ so that .|w − w| ˜ ≤ d, where d is a small constant (e.g., 1–5 in Fig. 5). As shown in Fig. 5, the coefficient approximation achieves more than .19% (when .d = 1) median reduction of the area of the bespoke multipliers. The area reduction increases to .53% when .d = 4. In several cases, the area reduction is 0% or goes up to 100%. The latter is explained by the fact that w is replaced by zero-area .w, ˜ and thus the area

(a)

(b)

Fig. 4 Area variation of a bespoke multiplier with respect to the coefficient value w (x-axis). Two multiplier sizes are examined: 8-bit coefficients with (a) 4-bit inputs and (b) 8-bit inputs. It is noteworthy that the area of the corresponding conventional .4×8 and .8×8 multipliers is 83.61mm.2 and 207.43mm.2 , respectively. Source of the figure data is [2]. (a) x: 4 bit, w: 8 bit. (b) x: 8 bit, w: 8 bit

210

G. Zervakis et al.

(a)

(b)

Fig. 5 Area reduction achieved by the coefficient approximation when .|w − w| ˜ ≤ d. Source of the figure data is [2]. (a) x: 4 bit, w: 8 bit. (b) x: 8 bit, w: 8 bit

reduction is 100%, since multiplying by .w˜ induces zero area overhead. On the other hand, when multiplying by w features the lowest area in the segment .[w −d, w +d], then no area reduction can be achieved (i.e., 0%). The main computationally intensive operation performed by both SVMs and MLPs is a weighted sum: S=



.

xj · wj ,

(1)

1≤j ≤K

where .wi are the model’s trained coefficients (or weights), .xi are the inputs, and K is the number of coefficients. Remember that, in bespoke ML circuits, these coefficients are hardwired within the circuit [27] and no weight read/transfer occurs. When applying the coefficient approximation, the approximate sum becomes S =



.

xj · w˜ j ,

(2)

1≤j ≤K

and thus, the error of the weighted sum equals S =



xj · w ,

1≤j ≤K .

⇒ E[S ] =

wj = (wj − w˜ j ) 

E[xj ] · wj .

(3)

1≤j ≤K

Hence, by balancing the positive with the negative .wj , the average error of the weighted sum can be minimized. Note that .E[xj ] can be easily obtained by running inference on the training dataset. Given that K is usually small, a brute force method

Hardware–Software Co-design for Ultra-Resource-Constrained Embedded. . .

211

Table 2 Evaluation of baseline exact bespoke printed ML circuits Dataset Cardio RedWine Dataset Cardio RedWine Dataset Cardio RedWine a b c

MLP-C Acca 0.88 0.56 MLP-R Acca 0.83 0.56 SVM-C Acca 0.90 0.57

Tb (21,3,3) (11,2,6)

#Cc 72 34

Area (cm.2 ) 33.4 17.6

Power (mW) 97.3 53.3

Tb (21,3,1) (11,2,1)

#Cc 66 24

Area (cm.2 ) 21.6 7.1

Power (mW) 65.9 24.0

Tb 3 15

#Cc 63 66

Area (cm.2 ) 15.1 23.5

Power (mW) 46.8 72.9

Accuracy when using 8-bit coefficients and 4-bit inputs Topology (for SVMs: the number of classifiers) Number of coefficients

was used in [2] to find the approximate coefficients .wj , .∀j , that minimize the average error .E[S ]. In addition, [2] combined the software-based coefficient approximation with the hardware-based gate-level pruning [42, 44]. In gate-level pruning, gates are removed from the post-synthesis netlist of an input circuit and are replaced with a constant “0” or “1.” In [2], the gates are pruned based on their theoretical maximum error (i.e., most significant output bit that a gate connects to through any path) and their switching frequency (i.e., gates that their output is constant for the majority of the time are less probable to generate errors after pruning). Table 2 reports the accuracy and hardware analysis for several exact fully parallel baseline ML classifiers. Figure 6 presents the accuracy vs. normalized area tradeoff when applying the above-described approximations. The blue cross depicts the baseline exact bespoke ML circuit, while the red star and gray “x” represent the coefficient approximation and gate-level pruning techniques when applied in isolation, respectively. The black triangles represent the circuits that apply combined software–hardware co-approximation. As shown in Fig. 6, all the approximate techniques can deliver high area reduction for a small accuracy loss. The hardwaredriven coefficient approximation achieves considerable area reduction for minimal accuracy loss, while the designs that apply combined software–hardware coapproximation constitute the Pareto-front for each ML circuit. Table 3 summarizes the hardware characteristics of the approximate circuits for 1% accuracy loss constraint. As shown in Table 3, the approximate circuits that apply both coefficient approximation and gate-level pruning achieve more than 43% and 38% area and power reduction, respectively. Most importantly, coefficient approximation combined with gate-level pruning delivers in many cases sub-30mW operation. In other words, these designs can be operated by a Molex 30mW printed battery. Hence, the combined software–hardware co-approximation enables, in many cases,

212

G. Zervakis et al.

(a)

(b)

(c)

(e)

(f)

(d)

Fig. 6 The accuracy–normalized area tradeoff when applying both coefficient approximation and gate-level pruning. Source of the figure data is [2]. (a) RedWine MLP-C. (b) RedWine MLP-R. (c) RedWine SVM-C. (d) Cardio MLP-C. (e) Cardio MLP-R. (f) Cardio SVM-C

Table 3 Area and power evaluation of approximate MLPs/SVMs for up to 1% accuracy loss [2] ML Circuit Cardio MLP-C Cardio MLP-R Cardio SVM-C RedWine MLP-C RedWine MLP-R RedWine SVM-C a b c

Coef. approx. and gate prune A.a P.b AG.c PG.c 17 54 48% 44% 12 37 45% 44% 8.7 29 43% 38% 8.0 27 55% 50% 3.3 12 53% 49% 7.6 26 68% 65%

Only coef. approx. A.a P.b AG.c PG.c 20 62 40% 36% 16 49 27% 26% 10 33 33% 29% 9.3 30 47% 43% 6.0 21 15% 14% 16 50 32% 31%

Only gate prune A.a P.b AG.c 33 97 0% 18 56 16% 14 43 8.7% 18 53 0% 4.6 17 35% 15 49 35%

PG.c 0% 15% 8.3% 0% 30% 33%

Area (cm.2 ) Power (mW) Area and power gain compared to the bespoke baseline

battery-powered printed ML circuits. On the other hand, this is not always the case for the single-approximation methods that mainly feature high power consumption and cannot be powered by any existing power supply available today for printed electronics.

Hardware–Software Co-design for Ultra-Resource-Constrained Embedded. . .

213

5.2 Approximate Decision Trees Balaskas et al. [5] followed a similar approach to [2] targeting decision trees. Similarly to the bespoke multiplier, [5] showed that the area of a bespoke comparator is determined by its coefficient value. For example, for 6-bit and 8-bit comparators, Fig. 7 presents the area of the comparator with respect to the coefficient value. As shown in Fig. 7, a high variation is observed. Some coefficients are significantly more hardware-friendly resulting in highly area-efficient comparator implementations. In addition, as expected, the precision of the comparison also affects the area overhead. Still, as shown in Fig. 7, the tradeoff between area, input precision, and coefficient is not straightforward, e.g., not all the 6-bit comparators are smaller than the 8-bit ones. The authors in [5] exploited these findings to approximate decision trees and make them suitable for ultra-resource-constrained and battery-powered printed devices. Two approximate techniques are considered: (i) precision scaling for the inputs and (ii) coefficient approximation for the coefficients (i.e., replace a coefficient with a more area-efficient approximate value). However, the fact that both precision and coefficient approximation impact the classification accuracy and that their combined impact on the area is not trivial, exacerbates the design problem. A non-dominated sorting genetic algorithm (NSGA-II) is used by [5] to explore the associated design space and identify for each comparator in the decision tree the input precision and then an area-efficient approximate value to replace the coefficient obtained from training. Overall, the approximation flow of [5] is depicted in Fig 8. To guide the search toward more area-efficient designs, the sum of the comparators’ area is used as a proxy of the area of the entire decision tree. To avoid hardware evaluations during the genetic optimization, the area values of the bespoke comparators, for varying precisions (2 bit–8 bit), are stored in a look-up table (LUT). The LUTs are created during an one-time offline process. Then, at runtime, given the configuration of each comparator (i.e., input precision and coefficient value), the respective area value is fetched from the LUT. The approximation candidates of

(a)

(b)

Fig. 7 Area variation of a bespoke comparator with respect to the coefficient value w (x-axis). Two input sizes are examined: (a) 6 bit and (b) 8 bit. Source of the figure data is [5]

214

G. Zervakis et al. Decision Tree

Dataset

>C1

Training >CN

Y2

>C1 >C2

>CN

Coefficient Precision Conversion

Y1

Y1

Y0

Approximate Decision Tree

Test

>C2

Accuracy

FP INT

Area Area LUT

Genetic Optimization

Pareto Analysis

Fig. 8 Co-design of approximate decision tree circuits for ultra-resource-constrained applications [5] Table 4 Evaluation of baseline exact bespoke decision tree circuits. Source [5] Dataset Arrhythmia Balance HAR Mammographic Seeds Vertebral

Accuracy 0.564 0.745 0.835 0.759 0.889 0.850

#Comparators 54 102 178 150 10 27

Delay (ms) 27.0 28.0 33.7 34 20.3 20.9

Area (mm.2 ) 163 68 551 98.75 30 58

Power (mW) 7.55 3.11 26.10 4.47 1.43 2.68

the two approximation techniques are encapsulated in a chromosome representative of each approximate decision tree. The chromosome contains 2M genes, with M being the number of comparators in the decision tree. For each comparator, two genes are stored, i.e., the input precision and a margin d that is used to select the approximate coefficient [5]. The parent population is randomly generated, and all the chromosomes are subjected to the NSGA-II standard iterative process, i.e., tournament selection, simulated binary crossover, and polynomial mutation. From the combined parent and children chromosomes pool, the non-dominated solutions are obtained through a non-dominated sorting and truncation with respect to the crowding distance. After a few generations, the genetic optimization returns an estimated accuracy–area Pareto-front of approximate decision trees. Table 4 reports the accuracy and hardware analysis of the exact fully parallel baseline decision trees. Figure 9 presents the accuracy vs. normalized area tradeoff of the approximate bespoke decision trees obtained by the co-design framework of [5]. As shown in Fig. 9, the area estimation of [5] closely follows the area of the decision trees obtained after synthesis with the EDA tool. Hence, during the genetic optimization, hardware evaluations are avoided, and still the respective

Hardware–Software Co-design for Ultra-Resource-Constrained Embedded. . . Estimated pareto

0.2 0.5 0.8 1.0 Area

(d)

0.2 0.5 0.8 1.0 Area

(c)

(b) 0.98 0.96 0.94 0.92 0.90

Accuracy

Accuracy

Accuracy

0.0 0.5 1.0 1.5 Area

0.84 0.78 0.72 0.66 0.60 0.54

0.4 0.6 0.8 1.0 0.2 0.5 0.8 1.0 Area Area

(a) 0.86 0.84 0.82 0.80 0.78 0.76

Exact Bespoke Accuracy

0.85 0.720.83 0.640.80 0.560.78 0.48 0.75 0.40 Accuracy

0.72 0.64 0.56 0.48 0.40

Accuracy

Accuracy

Approximate

0.0 0.3 0.6 0.9 Area

(e)

215

0.90 0.80 0.70 0.60 0.50 0.2 0.5 0.8 1.0 Area

(f)

Fig. 9 The accuracy–normalized area tradeoff when applying both hardware-driven coefficient approximation and precision scaling. Source of the figure data is [5]. (a) Arrhythmia. (b) Balance. (c) HAR. (d) Mammographic. (e) Seeds. (f) Vertebral

design space is precisely explored. Moreover, it is observed that the approximate decision trees achieve very high area gains for negligible (if any) accuracy loss. Some approximate designs achieve even higher accuracy than their baseline exact counterpart. This might be explained by the fine granularity of precision scaling, which acts as a regularization measure and results in a more efficient, less redundant classifier. Table 5 summarizes the area Pareto-optimal designs of Fig. 9 that feature up to 1% accuracy loss. As shown in Table 5, the approximate decision trees deliver .4.39x and .4.53x area and power reduction, respectively, on average. Importantly, all the designs in Table 5 feature less than 3mW power consumption and thus can be powered by only a Blue Spark printed battery. It is noteworthy that the approximate Seeds requires less than .0.1mW and thus can be self-powered, i.e., by an energy harvester. In addition, the area of almost all the examined models is less than 100mm.2 being thus perfectly suitable for printed circuits. Concluding, as demonstrated in Fig. 9 and Table 5, co-designing approximate decision trees [5] enables sub-3mW ML classification with considerable accuracy suitable for most battery-powered printed applications.

216

G. Zervakis et al.

Table 5 Area and power evaluation of approximate decision trees for up to 1% accuracy loss [5] Model Arrhythmia Balance HAR Mammographic Seeds Vertebral

Accuracy 0.67 0.81 0.83 0.81 0.94 0.86

Area (mm.2 ) 22.30 27.28 294.54 8.06 2.32 7.84

Norm. Area 0.137 0.401 0.534 0.082 0.077 0.136

Power (mW) 1.04 1.16 13.70 0.38 0.09 0.38

Norm.Power 0.138 0.372 0.525 0.084 0.064 0.142

6 Co-design for Stochastic Neural Network Circuits While the authors in [2, 5] used approximate computing to achieve ultra-low-power inference and reduced area overheads, Weller et al. [38] leveraged the cost efficiency of stochastic computing paradigm to build, for the first time, mixed-signal stochastic neural network accelerators. Stochastic computing (SC) is a low-cost (area and power) alternative to conventional computing [1]. In SC, signals are represented by a random bit stream. These bit sequences loosely resemble the neural spike trains of the brain [19]. The sequential bit stream is obtained by counting the “1’s” and dividing their number by the length of the bit stream, e.g., the sequence “0, 1, 0, 1, 0, 1, 0, 0” represents the .A = 3/8. To obtain negative numbers, bi-polar encoding is used [30] with the transformation .A = 2 · A − 1, where .A is the bi-polar representation and .A ∈ [−1, 1]. Since in SC computations are performed sequentially and bit-wise, complex operations can be expressed with simple logic gates. Most notably, in SC, the multiplication can be implemented by a single XNOR gate (i.e., only 9 transistors), while the addition is implemented only by a multiplexer (i.e., only 7 transistors). SC has been around since the 1960s [1] and, despite the potential for ultralow-power operation and extremely low area overheads, has not been subjected to a widespread adoption. The latter is based on the fact that SC imposes high latency and low throughput due to the long bit streams required and the considerable accuracy loss. Moreover, area is not a primary concern in conventional systems [38]. Nevertheless, in ultra-resource-constrained environments, e.g., printed circuits, SC appears as a promising solution to enable ML inference of complex models. Additionally, SC-based neural network circuits [23] have gained attention exploiting the inherent error tolerance of neural networks. Especially in the case of printed electronics—in which performance is not a primary concern, as sensor readouts occur only in the order of seconds [21] or even minutes [24], but area is of major concern, since feature sizes are in the range of micrometers—SC seems as a perfect fit and one may imagine that printed neural networks may become feasible.

Hardware–Software Co-design for Ultra-Resource-Constrained Embedded. . .

217

wSNG x1 w1 x2 w2

w1*x1

iSNG XNOR wSNG

MUX

1/2(w1*x1+ w2*x2)

ReLU

Y

iSNG XNOR wSNG

w2*x2

AF(1/2(w1*x1+w2*x2))

Fig. 10 Stochastic computing neuron, used in [38], for stochastic neural networks

6.1 Mixed-Signal Stochastic Neuron Figure 10 presents a SC neuron. The core operation of a neuron is multiply– accumulate. Hence, substantial area and power gains can be obtained when switching to the SC domain. However, while the SC-based counterparts of multipliers (i.e., XNOR) and adders (i.e. MUX) feature ultra-low cost, the required stochastic number generators (SNGs) as well as the activation functions (AF) still form a resource bottleneck. To reduce the hardware overheads of SNGs and AFs, the authors in [38] adopted a mixed-signal computing and replaced the expensive digital SC-based SNGs and AFs by efficient analog implementations. In the mixed-signal SC neuron of [38] (Fig. 10), analog SNGs (wSNG and iSNG) are used to convert binary numbers of the input features and weights into stochastic bit sequences. The generated bit streams of the input features and weights are multiplied through XNOR logic gates. Then, the obtained products are accumulated through a MUX. A SNG is used to generate the select signals to drive the MUXs for the addition operation. Note though that, in SC-based addition, the sum is scaled by a factor depending on the number of addends. In Fig. 10, a 2-input adder is used, and thus, the sum is scaled by .1/2. Finally, the sum of the accumulation is fed to an analog ReLU unit that implements the non-linear AF.

6.2 Analog Stochastic SNG The analog SNG proposed by Weller et al. [38] is illustrated in Fig. 11. The weights SNG (wSNG) comprises a ring oscillator (RINGO) that is used to generate the oscillating signal. That signal is applied, then, to the enable transistor of the tuned true random number generator (TTRNG). A bi-stable back-to-back inverter is used to implement the TTRNG. The meta-stability of the TTRNG is caused by random noise (e.g., thermal noise [17]). Whenever TTRNG is enabled, its output voltage

218

RINGO

G. Zervakis et al.

Vdd

Vdd

Vdd

Vdd

X

Tuned Resistors

Out

Enable TTRNG VSS

Fig. 11 Schematic of the analog SNG used in [38]. The weights SNG (wSNG) consists of a RINDO and a TTRNG, while for the inputs SNG (iSNG) an additional transistor is required (blue)

“OUT” is either logic “0” or “1” depending on the ratio of its pull-up resistors but also on the induced random noise. Note that the ratio of pull-up resistors of TTRNG corresponds to the neuron’s weight obtained after training. Hence, when using the oscillating enable signal generated by the RINGO to drive the TTRNG, a random bit sequence is produced at the “OUT” of TTRNG. In [38], in order to obtain the desired pull-up resistor ratio, which initially is biased by process variations, TTRNG is tuned, in a post-fabrication step, by printing additional layers to the pull-up resistor [17]. The layers behave as multiple resistors connected together in parallel, and thus every additional layer reduces the overall resistance [17]. As a result, the probability of producing “0” and “1” is adjusted to implement any stochastic bit stream representing the trained weights. Note that this is a unique capability of printed electronics, and it is not feasible in silicon technology due to its subtractive processes. Finally, by adding one more transistor to the pull-up network (colored in blue in Fig. 11), a SNG (iSNG) for converting the inputs features (X) into stochastic bit streams is obtained. Although the wSNG is one-time programmable, the input voltage of the additional transistor controls iSNG.

6.3 Analog Stochastic Activation Function The analog stochastic AF of [38] is depicted in Fig. 12. AF receives a stochastic number (random bit stream) as input. A capacitor is used for analog integration of the voltage pulses of the input stochastic bit sequences. The enable signal (EN) controls the operation state of the AF. If the enable signal (EN) is on (logic “1”), then the input (IN) is applied to a charging state. The capacitor is connected to .Vdd and charged when the incoming bits are “1,” while it is connected to .VSS and discharged when the incoming bits are “0.” When the entire input bit sequence is processed, the

Hardware–Software Co-design for Ultra-Resource-Constrained Embedded. . .

Output Stage

Charger

IN

Vdd

Vdd

Input Stage

219

EN

Vdd

EN

OUT Vss

DIS

Discharger

Capacitor

Fig. 12 Schematic of the analog activation function used in [38] that resembles a bi-polar ReLU

charge of the capacitor reflects the number of “1” in the bit stream. Next, the enable signal is set to “0,” and the output stage is activated. In addition, the capacitor is disconnected from the charging stage, keeping, thus, its voltage. If the capacitor’s voltage level is above a specific threshold, then the input signal propagates to the output (OUT). If not, then the output is pulled down to logic “0” (permanently, for the entire sequence). This behavior efficiently reassembles the functionality of the rectified linear unit (ReLU) AF. However, due to the bi-polar SC-based encoding, a sequence of only logic “0’s” represents the -1 value. Hence, the AF of Fig. 11 [38] returns a .−1 when its input is negative.

6.4 Hardware-Driven Training As explained above, in the SC-based neuron of Fig. 10, the computed sum is scaled based on the number of synapses, while the employed ReLU (Fig. 12) cannot generate a zero value for the negative inputs. Moreover, since for the stochastic numbers a bi-polar encoding is used, the weights and input features are in the range of .[−1, 1]. To adopt the neural network to these particular irregular characteristics of the mixed-signal stochastic inference, Weller et al. [38] implemented a hardwaredriven neural network training. First, in order to encompass the stochastic addition in training, instead of the traditional weighted sum, the network is trained with y=

.

1 2log2 M

 i≤j ≤M

xj ,

(4)

220

G. Zervakis et al.

where .xj are the inputs of the adder and y is the sum. Next, the “bi-polar” ReLU is used in training:  AF(y) =

.

y,

if y > 0

−1,

otherwise.

(5)

Finally, weights are ensured to belong in the .[−1, 1] segment by applying clipping after each weight update:

clip(w) =

.

⎧ ⎪ ⎪ ⎨1,

−1, ⎪ ⎪ ⎩w,

if w > 1 if w < −1

(6)

otherwise.

Note that the weight update can be implemented using any existing optimizer. For example, the traditional gradient descent would still use the negative loss gradient with respect to the weights.

6.5 Mixed-Signal Stochastic Inference Figure 13 presents the hardware and accuracy evaluation of several mixed-signal stochastic neural network accelerators that use the architecture of [38] (neuron of Fig. 10) and follow the hardware-driven training approach. The following datasets from UCI ML repository [16] are considered: Mammographic (MA), Vertebral Col. 2C (V2), Breast Cancer (BC), Seeds (SE), Balance Scale (BS), and Vertebral Col. 3C (V3). For each dataset, a fully connected neural network is trained using the Adam optimizer for 200 epochs with full-batch updates. The mean squared error is used as a loss function. The topology of each neural network is fixed: .#inputs × 3×#outputs. Training and testing are performed with a random 67% and 33% split. The length of the random stochastic bit stream used is 1024. As shown in Fig. 13, all the neural network accelerators feature area less than .2.5cm.2 and power less than 11mW, being thus suitable for printed applications. Moreover, in many cases, the accuracy loss is well constrained, i.e., the achieved accuracy is within .10% of to the accuracy of the conventional digital floating-point model. However, in some cases, the accuracy loss goes even beyond 20%.

Hardware–Software Co-design for Ultra-Resource-Constrained Embedded. . .

221

Fig. 13 Area (a), power (b), and accuracy (c) evaluation of several mixed-signal stochastic neural network accelerators [38]

7 Conclusion In this chapter, we used printed electronics as an extreme testbed of embedded ML application. We demonstrated that a systematic software–hardware co-design, enhanced with non-conventional computing approaches—that may further boost the efficiency of our computing systems—enables bringing intelligence even in such ultra-resource-constrained environments. Printed electronics form a rapidly growing market with unique and unprecedented features, compared to silicon VLSI, but also extraordinary limitations. The state of the art has shown a clear path toward enabling sub-milliwatt ML inference. Nevertheless, there is still room for significant improvement and optimization albeit the recent remarkable advancements reported. Acknowledgments This work is partially supported by the German Research Foundation (DFG) through the project “ACCROSS: Approximate Computing aCROss the System Stack” HE 2343/161 and by grant from the Excellence Initiative of Karlsruhe Institute of Technology under Future Field program “SoftNeuro.”

References 1. Alaghi, A., Hayes, J.P.: Survey of stochastic computing. ACM Trans. Embed. Comput. Syst. 12(2s), (2013). http://dx.doi.org/10.1145/2465787.2465794 2. Armeniakos, G., Zervakis, G., Soudris, D., Tahoori, M.B., Henkel, J.: Cross-layer approximation for printed machine learning circuits. In: Design, Automation Test in Europe Conference & Exhibition (DATE) (2022)

222

G. Zervakis et al.

3. Ayub, M.K., Hasan, O., Shafique, M.: Statistical error analysis for low power approximate adders. In: Design Automation Conference (DAC), pp 1–6 (2017) 4. Balaskas, K., Zervakis, G., Amrouch, H., Henkel, J., Siozios, K.: Automated design approximation to overcome circuit aging. IEEE Trans. Circ. Syst. I Reg. Pap. 68(11), 4710–4721 (2021) 5. Balaskas, K., Zervakis, G., Siozios, K., Tahoori, M.B., Henkel, J.: Approximate decision trees for machine learning classification on tiny printed circuits. In: International Symposium on Quality Electronic Design (ISQED) (2022) 6. Bhardwaj, K., Mane, P.S., Henkel, J.: Power- and area-efficient approximate Wallace tree multiplier for error-resilient systems. In: Fifteenth International Symposium on Quality Electronic Design, pp 263–269 (2014) 7. Biggs, J., Myers, J., Kufel, J., Özer, E., Craske, S., Sou, A., Ramsdale, C., Williamson, K., Price, R., White, S.: A natively flexible 32-bit arm microprocessor. Nature 595, 532–536 (2021) 8. Bleier, N., Mubarik, M., Rasheed, F., Aghassi-Hagmann, J., Tahoori, M.B., Kumar, R.: Printed microprocessors. In: Annu. Int. Symp. Computer Architecture (ISCA), pp. 213–226 (2020) 9. Chang, J.S., Facchetti, A.F., Reuss, R.: A circuits and systems perspective of organic/printed electronics: Review, challenges, and contemporary and emerging design approaches. IEEE J. Emerg. Sel. Top. Circ. Syst. 7(1), 7–26 (2017). https://doi.org/10.1109/JETCAS.2017.2673863 10. Cherupalli, H., Duwe, H., Ye, W., Kumar, R., Sartori, J.: Bespoke processors for applications with ultra-low area and power constraints. In: Annu. Int. Symp. Computer Architecture (ISCA), pp 41–54 (2017) 11. Choi, J., Venkataramani, S.: Approximate computing techniques for deep neural networks. In: Reda, S., Shafique, M. (eds.) Approximate Circuits: Methodologies and CAD, Springer International Publishing, Cham, pp 307–329 (2019) 12. Chung, S., Kim, S.O., Kwon, S.K., Lee, C., Hong, Y.: All-inkjet-printed organic thin-film transistor inverter on flexible plastic substrate. IEEE Electron Dev. Lett. 32(8), 1134–1136 (2011) 13. Conti, S., Pimpolari, L., Calabrese, G., Worsley, R., Majee, S., Polyushkin, D.K., Paur, M., Pace, S., Keum, D.H., Fabbri, F., et al.: Low-voltage 2D materials-based printed field-effect transistors for integrated digital and analog electronics on paper. Nature Communications 11(1):1–9 (2020) 14. Cui, Z.: Printed Electronics: Materials, Technologies and Applications. Wiley (2016) 15. Douthwaite, M., Garcıa-Redondo, F., Georgiou, P., Das, S.: A time-domain current-mode MAC engine for analogue neural networks in flexible electronics. In: IEEE Biomedical Circuits and Systems Conference (BioCAS), pp. 1–4 (2019) 16. Dua, D., Graff, C.: UCI machine learning repository (2017) 17. Erozan, A.T., Wang, G.Y., Bishnoi, R., Aghassi-Hagmann, J., Tahoori, M.B.: A compact lowvoltage true random number generator based on inkjet printing technology. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 28(6), 1485–1495 (2020) 18. Hanif, M.A., Khalid, F., Shafique, M.: CANN: Curable approximations for high-performance deep neural network accelerators. In: Design Automation Conference (DAC) (2019) 19. Hayes, J.P.: Introduction to stochastic computing and its challenges. In: Design Automation Conference (DAC) (2015) 20. Huber, B., Popp, P., Kaiser, M., Ruediger, A., Schindler, C.: Fully inkjet printed flexible resistive memory. Appl. Phys. Lett. 110(14), 143503 (2017) 21. Kim, J., Jeerapan, I., Imani, S., Cho, T.N., Bandodkar, A., Cinti, S., Mercier, P.P., Wang, J.: Noninvasive alcohol monitoring using a wearable tattoo-based iontophoretic-biosensing system. ACS Sensors 1(8), 1011–1019 (2016) 22. Kondo, M., Uemura, T., Akiyama, M., Namba, N., Sugiyama, M., Noda, Y., Araki, T., Yoshimoto, S., Sekitani, T.: Design of ultraflexible organic differential amplifier circuits for wearable sensor technologies. In: 2018 IEEE International Conference on Microelectronic Test Structures (ICMTS), pp. 79–84. IEEE (2018)

Hardware–Software Co-design for Ultra-Resource-Constrained Embedded. . .

223

23. Liu, Y., Liu, S., Wang, Y., Lombardi, F., Han, J.: A survey of stochastic computing neural networks for machine learning applications. IEEE Trans. Neural Networks Learn. Syst. 32(7), 2809–2824 (2021) 24. Mostafalu, P., Lenk, W., Dokmeci, M.R., Ziaie, B., Khademhosseini, A., Sonkusale, S.R.: Wireless flexible smart bandage for continuous monitoring of wound oxygenation. IEEE Trans. Biomed. Circ. Syst. 9(5), 670–677 (2015) 25. Mrazek, V., Hanif, M.A., Vasicek, Z., Sekanina, L., Shafique, M.: autoAX: An automatic design space exploration and circuit building methodology utilizing libraries of approximate components. In: Design Automation Conference (2019a) 26. Mrazek, V., Vasicek, Z., Sekanina, L., Hanif, M.A., Shafique, M.: ALWANN: Automatic layerwise approximation of deep neural network accelerators without retraining. In: Int Conference on Computer-Aided Design (ICCAD) (2019b) 27. Mubarik, M.H., Weller, D.D., Bleier, N., Tomei, M., Aghassi-Hagmann, J., Tahoori, M.B., Kumar, R.: Printed machine learning classifiers. In: Annu. Int. Symp. Microarchitecture (MICRO), pp. 73–87 (2020) 28. Ozer, E., Kufel, J., Biggs, J., Brown, G., Myers, J., Rana, A., Sou, A., Ramsdale, C.: Bespoke machine learning processor development framework on flexible substrates. In: 2019 IEEE International Conference on Flexible and Printable Sensors and Systems (FLEPS), pp. 1–3 (2019) 29. Prakash, S., Callahan, T., Bushagour, J., Banbury, C.R., Green, A.V., Warden, P., Ansell, T., Reddi, V.J.: CFU playground: Full-stack open-source framework for tiny machine learning (TinyML) acceleration on FPGAs (2022). CoRR abs/2201.01863. https://arxiv.org/abs/2201. 01863 30. Ren, A., Li, Z., Ding, C., Qiu, Q., Wang, Y., Li, J., Qian, X., Yuan, B.: SC-DCNN: Highlyscalable deep convolutional neural network using stochastic computing. In: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 405–418 (2017) 31. Shafique, M., Ahmad, W., Hafiz, R., Henkel, J.: A low latency generic accuracy configurable adder. In: Design Automation Conference (DAC), pp. 1–6 (2015) 32. Shafique, M., Theocharides, T., Reddy, V.J., Murmann, B.: TinyML: Current progress, research challenges, and future roadmap. In: Design Automation Conference (DAC), pp. 1303–1306 (2021). https://doi.org/10.1109/DAC18074.2021.9586232 33. Shao, F., Wan, Q.: Recent progress on jet printing of oxide-based thin film transistors. J. Phys. D Appl. Phys. 52(14), 143002 (2019) 34. Spantidi, O., Zervakis, G., Anagnostopoulos, I., Amrouch, H., Henkel, J.: Positive/negative approximate multipliers for DNN accelerators. In: International Conference on Computer Aided Design (ICCAD), pp. 1–9 (2021) 35. Tasoulas, Z.G., Zervakis, G., Anagnostopoulos, I., Amrouch, H., Henkel, J.: Weight-oriented approximation for energy-efficient neural network inference accelerators. IEEE Trans. Circ. Syst. I Reg. Papers 67, 4670–4683 (2020) 36. Weller, D., Marques, G.C., Aghassi-Hagmann, J., Tahoori, M.B.: An inkjet-printed low-voltage latch based on inorganic electrolyte-gated transistors. IEEE Electron Dev. Lett. 39(6), 831–834 (2018) 37. Weller, D.D., Hefenbrock, M., Tahoori, M.B., Aghassi-Hagmann, J., Beigl, M.: Programmable neuromorphic circuit based on printed electrolyte-gated transistors. In: 2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 446–451 (2020). https://doi. org/10.1109/ASP-DAC47756.2020.9045211 38. Weller, D.D., Bleier, N., Hefenbrock, M., Aghassi-Hagmann, J., Beigl, M., Kumar, R., Tahoori, M.B.: Printed stochastic computing neural networks. In: Design, Automation Test in Europe Conference Exhibition (DATE), pp. 914–919 (2021) 39. Özer, E., Kufel, J., Myers, J., Biggs, J., Brown, G., Rana, A., Sou, A., Ramsdale, C., White, S.: A hardwired machine learning processing engine fabricated with submicron metal-oxide thin-film transistors on a flexible substrate. Nature Electronics 3, 1–7 (2020)

224

G. Zervakis et al.

40. Zervakis, G., Tsoumanis, K., Xydis, S., Soudris, D., Pekmestzi, K.: Design-efficient approximate multiplication circuits through partial product perforation. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 24(10), 3105–3117 (2016) 41. Zervakis, G., Ntouskas, F., Xydis, S., Soudris, D., Pekmestzi, K.: VOSsim: A framework for enabling fast voltage overscaling simulation for approximate computing circuits. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 26(6), 1204–1208 (2018) 42. Zervakis, G., Koliogeorgi, K., Anagnostos, D., Zompakis, N., Siozios, K.: VADER: Voltagedriven netlist pruning for cross-layer approximate arithmetic circuits. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 27(6), 1460–1464 (2019) 43. Zervakis, G., Xydis, S., Soudris, D., Pekmestzi, K.: Multi-level approximate accelerator synthesis under voltage island constraints. IEEE Trans. Circ. Syst. II Exp. Briefs 66(4), 607– 611 (2019) 44. Zervakis, G., Amrouch, H., Henkel, J.: Design automation of approximate circuits with runtime reconfigurable accuracy. IEEE Access 8, 53522–53538 (2020) 45. Zervakis, G., Saadat, H., Amrouch, H., Gerstlauer, A., Parameswaran, S., Henkel, J.: Approximate computing for ML: State-of-the-art, challenges and visions. In: Asia and South Pacific Design Automation Conference, pp. 189–196 (2021a) 46. Zervakis, G., Spantidi, O., Anagnostopoulos, I., Amrouch, H., Henkel, J.: Control variate approximation for DNN accelerators. In: Design Automation Conference (DAC), pp. 481–486 (2021b)

Cross-Layer Optimizations for Efficient Deep Learning Inference at the Edge Muhammad Abdullah Hanif and Muhammad Shafique

1 Introduction Deep neural networks (DNNs) have emerged as a promising set of techniques for solving complex AI problems. However, the current progress in deep learning (DL) domain has highlighted that (in most of the cases) there exists a direct relation between the accuracy and the complexity of DNNs. The state-of-the-art results, specifically for complex AI problems, are usually achieved through deeper and highly compute-intensive models. This can be observed from Fig. 1, which shows the accuracy and computational requirements of few of the most promising DNNs designed for image classification application. This resource-hungry nature of state-of-the-art DNNs limits their reliable deployment in resource-constrained embedded scenarios. Therefore, to enable the use of high-accuracy DNNs in resource-constrained embedded systems, these networks have to be optimized to fit to the given embedded device and offer a resource-efficient inference, which is specifically important for battery-operated mobile devices deployed at the edge of the computing stack. Several techniques have been proposed to reduce the memory and computational requirements of DNNs. At the software level, these techniques mainly include pruning, quantization, and hardware-aware neural architecture search (NAS). At the hardware level, these techniques include dataflow optimizations, specialized hardware accelerator design, and hardware approximations. Although most of the efforts have been focused on demonstrating the effectiveness of each technique individually, there have been some efforts toward developing hardware–software codesign approaches and system-level optimization. Figure 2 presents a brief overview

M. A. Hanif () · M. Shafique New York University Abu Dhabi, Abu Dhabi, UAE e-mail: [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 S. Pasricha, M. Shafique (eds.), Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing, https://doi.org/10.1007/978-3-031-39932-9_9

225

226

M. A. Hanif and M. Shafique

10000

50

Number of Parameters (in millions)

Accuracy [%age]

100

100

0

1 AlexNet

VGG-16

ResNet-50 EfficientNet CoAtNet-7

Top 1 Accuracy [%age]

Number of Parameters

Fig. 1 Characteristics of different DNNs trained for ImageNet classification. Data sources: [4, 25, 32–34] Input

Input Pruned Connecons

-2.4

5.3

1.2

-0.2

0.6

0.1

8 7 6 5 4 3 2 1 -1 1 2 3 4 5 6 7 8 9 -2 -3 -4

1

3

5

-2

5

1

0

1

0

CPU

Control Unit

Off-Chip Memory

On-chip Memory

Processing Array PE PE ... PE PE

DNN Inference Accelerator

Quanzaon

PE

PE

PE

...

PE

...

PE

...

4.6

...

3.2

...

Pruned Neurons

0.9

Accumulators

Hardware Accelerators Output

Search Space

Pruning

Search Strategy

Evaluaon Strategy

Opmal DNN Architecture

Neural Architecture Search (NAS) Teacher Model

Output

Cross-layer Opmizaons (Secon 4)

Soware-level Opmizaons (Secon 3)

Hardware-level Opmizaons (Secon 3)

End-to-End System-level Opmizaons (Secon 5)

Knowledge Transfer

Approximate 2x2 Mulplier: Approximates 3x3 Æ 7 A0 O0 A1

X

...

... ...

...

...

Knowledge Disllaon

Input

Stage 1

Low

? High

Stage 2

Low

?



High

Adapve Approximaons

O3

Inputs

X +

+ Student Model

O2

B1

Accurate Datapath

Loss Computaon

Dataset

O1

B0

Library of Approximate Components

-

Stage N

Output

Output

Approximaon Methodology

Hardware Approximaons

Fig. 2 Chapter overview

of the available techniques that can be combined to develop cross-layer approaches to drastically reduce the energy consumption, as well as memory footprint, of DNNs to enable their efficient deployment on resource-constrained devices. Toward this, this chapter presents an overview of the state-of-the-art techniques for optimizing DNNs. The chapter highlights the main concepts and the foundational works along with some of the most recent works in the domain. After presenting an overview of individual techniques, the chapter presents a crosslayer approach for significantly reducing the energy consumption of DNN systems. Toward the end, the chapter also covers some of the end-to-end system-level approximation techniques. Figure 2 presents an overview of the chapter and also highlights the connections between different sections.

Cross-Layer Optimizations for Efficient Deep Learning Inference at the Edge

227

2 Preliminaries A neural network (NN) is mainly composed of neurons arranged in the form of layers, where a neuron is the basic building block of a NN. The type of neuron that is commonly used in DNNs performs a simple dot-product operation between weights and activations and then passes the result through an activation function. The functionality of the neuron can mathematically be represented as O=f

 n

.

 wi ∗ ai + b ,

(1)

i=1

where O represents the output, .ai represents the ith activation (input), .wi represents the ith weight, b represents the bias, .f (.) represents the activation function, which is usually a non-linear function (e.g., ReLU and tanh) to add non-linearity in the network, and n represents the total number of inputs/weights, as shown in Fig. 3a. Figure 3b shows the structure of a multi-layer perceptron (MLP), where each neuron in one layer is connected to all the neurons in the next layer and the previous layer. It is the most widely known structure of DNNs, and such networks are also referred to as fully connected neural networks (FCNNs). Another type of DNN that is widely used and studied is a convolutional neural network (CNN). Figure 3c shows the basic structure of a convolutional layer, and Fig. 3d shows how convolutional layers and fully connected layers are joined together to form DNNs for image processing applications. Note that various types of convolutional layers and DNN structures have been proposed in the literature to achieve ultra-high accuracy. Details related to different variants of CNNs can be found in [23]. Apart from FCNNs and CNNs, recurrent neural networks (RNNs) and graph neural networks (GNNs) have also been proposed. RNNs are mainly used to process time-series data and, therefore, are widely being used in healthcare applications and language processing, while GNNs are used to process graph data such as social networks, digital circuits, and transportation networks. More details related to these types of DNNs can be found in [39] and [41]. Note that as most of the optimization and approximation techniques have been evaluated for CNNs designed for image classification application, this chapter mainly focuses on examples and evaluations related to the same.

3 DNN Optimization Techniques This section presents an overview of different optimization techniques that can be employed to reduce the size and complexity of DNNs. It also covers different approximation techniques that can enable highly efficient DNN inference without affecting the application-level accuracy of DNNs.

228

M. A. Hanif and M. Shafique

Acvaon funcons (f)

f

Output Layer

Hidden Layers

Input Layer

W11

W11

W11

W11

O1

A2

O2

...

...

...

A1

...

...

...

AI

OK

WIJ

(b)

(a)

Input Feature Maps

Convoluonal Layer Filter 1

Jth

Input Image Output Feature Maps

neuron in hidden layer 1

Feature Extractor CONV 1

CONV 2 Filter 2

CONV 3 Convoluonal Layers



M

Filter N

N

M

(c)

Nth feature map

(d)

Pooling Layer

CONV 4 CONV 5

Classifier FC 6 FC 7

FC 8 Output

Fig. 3 (a) Basic structure of a neuron. (b) An example of a fully connected neural network. (c) Detailed illustration of a convolutional layer. (d) Bird’s eye view of the AlextNet architecture

3.1 Pruning Pruning refers to removing non-essential/ineffectual parameters from a DNN to reduce its size and complexity. Various types of pruning techniques have been proposed in the literature, depending on the granularity and nature of the technique. For example, pruning can be applied at element level, where each individual weight/parameter is checked against a user-defined criterion, and it is pruned if it is identified as non-essential (or relatively less important) based on the defined criterion. Apart from removing individual parameters, groups of parameters, such as filters, filter kernels, and layers, can also be removed to reduce the complexity of DNNs. Usually, the element-level pruning is referred to as fine-grained pruning, and all other types, where groups of parameters are removed, fall under the category of course-grained pruning. The following subsections highlight prominent techniques in each category.

3.1.1

Fine-Grained Pruning

The number of connections in a DNN, specifically in a fully connected neural network, defines the total number of parameters in a neural network. As stated earlier in the chapter, high accuracy comes at the cost of high network complexity; therefore, removing non-essential parameters can play a vital role in reducing the

Cross-Layer Optimizations for Efficient Deep Learning Inference at the Edge

Input

229

Input

Input Pruned Connecons Pruned Neurons

(a)

Output

(b)

Output

(c)

Output

Fig. 4 (a) An example FCNN. (b) An example of connection pruning. (c) An example of neuron pruning

DNN inference cost. The most convenient and most rewarding technique in terms of reducing the total number of parameters in a DNN is element-level/connection pruning, where each individual parameter in a DNN is checked against a criterion and removed if it falls in ineffectual category. Figure 4b illustrates an example of element-level/connection pruning, where connections having weights below a predefined threshold are removed. Earlier works in this direction include optimal brain damage [26] and the optimal brain surgeon [17] methods. More recently, Han et al. [14] proposed a three-step method to reduce the size and computational complexity of DNNs. First, they train the network to learn which connections are important. Then, they remove the unimportant connections. Finally, they fine-tune the weights of the remaining connections. They also showed that learning the right connections is an iterative process; therefore, pruning followed by fine-tuning should be repeated multiple times to achieve higher compression ratios. Deep compression [13] employed a similar method for pruning and combined it with weight sharing and Huffman coding to reduce the memory footprint of the AlexNet by 35x. As DNNs are composed of a number of layers and each layer can have a different number of parameters, different policies can be defined to remove weights from the given DNN. PruNet [27] compared multiple such policies, mainly class distribution, class uniform, and class blind, and showed that specific policies can offer better results than others for a given DNN.

3.1.2

Course-Grained Pruning

Although fine-grained pruning has the capability to significantly reduce the memory footprint of DNNs, it does not necessarily translate to energy or latency improve-

230

M. A. Hanif and M. Shafique

ments in all cases. This is mainly because of multiple reasons: (1) the network parameters are stored in a compressed format and, therefore, have to be uncompressed before corresponding operations; (2) the underlying hardware architecture is not specifically designed to take advantage of the fine-grained sparsity in the DNN. Therefore, specialized hardware accelerators such as EIE [12] and SCNN [30] are designed to process compressed-sparse DNNs that are generated through finegrained pruning. Scalpel [40] showed that it is important to align the pruning method to the underlying hardware to achieve efficiency gains, and a mismatch between the pruned network structure and the organization and architecture of the processing units in the hardware can lead to high overheads. Based on this observation, Scalpel proposed two different pruning techniques, namely SIMDaware weight pruning and node pruning. SIMD-aware weight pruning maintains weights in aligned fixed-size groups to fully utilize the SIMD units, and it is mainly useful for low-parallelism hardware (e.g., micro-controllers). Node pruning removes redundant nodes (e.g., see Fig. 4c), thereby reducing computation and memory footprint without sacrificing the dense matrix format, and it is mainly useful for high-parallelism hardware (e.g., GPUs). Similar to node pruning, several other techniques have also been proposed that fall under the category of structured pruning. Anwar et al. [1] argued that conventional pruning techniques such as [14] result in irregular network connections that not only demand extra representation effort but also result in underutilization of parallel-processing engines. Therefore, to address this limitation, they proposed the concept of structured sparsity at various scales for CNNs, i.e., channel-wise, kernel-wise, and intra-kernel strided sparsity. Figure 5b illustrates an example of filter pruning, and Fig. 5b highlights the main difference between different types of structured pruning for convolutional layers. The significance of structured pruning can be realized from the fact that most of the recent works related to automatic methodologies for producing compressed models for resource-constrained embedded devices mainly employ structured pruning; for example, see AMC [18] and APQ [38] frameworks. Apart from the above-mentioned techniques, pattern-based pruning [29] and layer pruning [6] techniques have also been proposed. Pattern-based pruning falls at the boundary of structured and fine-grained pruning, as it allows to prune each kernel of the filter differently based on a pre-defined set of templates. The key advantage of pattern-based pruning is that it enables comparatively more compression and efficiency compared to filter and channel pruning without any significant degradation in the application-level accuracy of the DNNs. Layer pruning techniques add another dimension to the compression techniques, as they enable the designers to reduce the depth (the number of layers) of DNNs as well. Reference [6] argue that layer pruning can offer higher latency benefits compared to other types of pruning.

Cross-Layer Optimizations for Efficient Deep Learning Inference at the Edge Layer 1 Input Feature Filters Maps Filter N M

*

231

Layer 1 Input Feature Filters Maps Filter N M

Layer 1 Output

*

=

Layer 1 Output

= N

N

Filter 1

Filter P

*

Pruned Filter Filter 1

M Layer 2 Filters

Filter P

p

(a) (c)

N

Filter-wise Pruning

M

Layer 2 Filters

Layer 2 Output

=

Filter 1

Pruned Channel

Pruned Corresponding Channels

(b)

*

Filter 1

Channel-wise Pruning

Layer 2 Output

= p N

Shape-wise Pruning

Fig. 5 (a) Processing flow of two consecutive convolutional layers. (b) An example of filter (or channel) pruning. (c) Different types of structured pruning

3.2 Quantization Quantization refers to the process of mapping values from a large set to a smaller set of quantization levels. The number of quantization levels defines the number of bits required to represent the data. Therefore, using a reduced precision format (having less number of quantization levels) for DNN data structures can have a profound impact on the storage requirements and inference cost of DNNs. The main goal of DNN quantization is to reduce the storage requirements as well as inference cost of DNNs without affecting their application-level accuracy in order to enable reliable and resource-efficient inference at the edge. Toward this, multiple DNN quantization techniques have been proposed. The simplest and the most commonly used technique is 8-bit range linear quantization, where floating-point weights and activations are converted into 8-bit fixed-point format to reduce the storage requirements as well as the complexity of the hardware modules. The range linear quantization has two types: (1) asymmetric quantization and (2) symmetric quantization. In asymmetric quantization, the minimum and maximum values observed in the float range are mapped to the minimum and maximum possible values (respectively) in the integer/fixed-point range (see Fig. 6c); however, in symmetric quantization, the float range is computed by using the maximum absolute value observed to ensure zero alignment (see Fig. 6b).

232

M. A. Hanif and M. Shafique

8 7 6 5

1

3

5

-2

5

1

0

1

0

-b a

0

b

4

Uniform Quanzaon

3 2 1

-1

1

2

3

4

5

6

7

8

9

-127

-2

127

0

-3

0.9 3.2 4.6

(b)

-4

-2.4 5.3 1.2 8

-0.2 0.6 0.1

7

-b a

6 5

Non-Uniform Quanzaon

4

-2

4

1

0

1

0

0

b

4

4 3 2 1

-1

(a)

1

1

2

3

4

5

6

7

8

-128

9

0

127

-2 -3

(c)

-4

Fig. 6 (a) Difference between uniform and non-uniform quantization. (b) 8-bit symmetric quantization. (c) 8-bit asymmetric quantization

Apart from linear quantization techniques that have uniform interval size between quantization levels, non-linear (or non-uniform) quantization techniques have also been explored in the literature [13, 21]. Such techniques are mainly inspired by the long-tailed nature of DNN data structures [21]. Figure 6a presents an illustration of the difference between uniform and non-uniform quantization schemes. Works such as XNOR-Net [31], Binarized Neural Network (BNN) [20], and DoReFa-Net [43] have even explored the potential of aggressive quantization, down to binary weights and activations. However, aggressive quantization leads to significant accuracy degradation. Therefore, mixed-precision quantization techniques have also been explored in which different quantization can be used for different layers, filters, and/or channels in the network [7, 37].

3.3 Knowledge Distillation In machine learning, knowledge distillation refers to the process of transferring knowledge from a larger model to a smaller, less complex one [2, 19]. As large DNNs have higher knowledge capacity and generalize well compared to smaller ones, it is relatively easy to train large models for high accuracy. Moreover, ensemble of models [5], where multiple models trained on the same data are combined together, is also commonly used to achieve high accuracy. However, large DNNs, specifically ensemble of models, lead to high inference cost and thereby cannot be deployed in resource-constrained scenarios. Knowledge distillation is a technique that enables the designers to first train large models and then transfer

Cross-Layer Optimizations for Efficient Deep Learning Inference at the Edge

Teacher Model

Knowledge Transfer

233

Student Model

...

...

Loss Computaon

...

...

...

Dataset Fig. 7 An overview of the knowledge distillation process. Knowledge from a larger model is transferred to a smaller, less complex model that can be deployed in resource-constrained scenarios to offer resource-efficient inference

their knowledge to smaller, less complex models that can offer resource-efficient inference while meeting the user-defined accuracy constraints. Figure 7 shows a high-level view of the knowledge distillation process. In general, knowledge is transferred from the teacher model (larger model) to the student model (smaller model) by training the student model on a transfer set and using soft targets that are generated by using the teacher model. For a classification model, the soft targets are generated using the following equation: qi = exp(zi /T )/



.

exp(zj /T ).

(2)

j

Here, .zi refers to the ith output of the teacher model before it is passed through the softmax function. .qi corresponds to the probability generated after the softmax function. T is temperature that is usually set to 1; however, for generating soft targets, higher values (i.e., greater than 1) are used. The core idea of this process is to train the student model to mimic the teacher model to offer competitive or even superior accuracy. There are multiple types of knowledge distillation such as response-based distillation and feature-based distillation. Details related to all the types can be found in [11].

3.4 Neural Architecture Search Conventionally, neural networks have been designed manually by human experts, which is a highly time-consuming process. Due to this and the growing need for high-accuracy models for edge-based applications, there is an increasing interest toward automated neural architecture search (NAS) methods. Figure 8 shows the general flow adopted for NAS algorithms. First, a search space is defined by defining a base architecture and the possible types of modules and connections that can be deployed in the architecture. Then, the search strategy defines how to explore the search space. The search strategy faces the exploration/exploitation dilemma, where, on the one hand, the search strategy should lead to an effective solution

234

Search Space

M. A. Hanif and M. Shafique

Search Strategy

Candidate Models Esmated Results

Evaluaon Strategy

Opmal DNN Architecture

Fig. 8 Overview of neural architecture search (NAS) process

quickly, while on the other hand, it should not converge to a region of sub-optimal architectures. The evaluation strategy refers to the process of evaluating the accuracy and other desirable metrics of candidate solutions. These performance numbers are then fed to the search strategy to direct the candidate search toward more effective architectures.

3.5 Hardware Approximations DNNs fall under the category of embarrassingly parallel workloads and, therefore, can significantly benefit from hardware-level parallelism. Several hardware accelerators have been proposed to enable energy-efficient DNN inference [3, 22]. These accelerators are mainly composed of massively parallel neural arrays to perform a large number of multiply–accumulate (MAC) operations in parallel. Apart from neural arrays, these accelerators employ dedicated memory hierarchy to minimize costly DRAM accesses. These optimizations can offer efficiency gains only to a certain extent; however, in some cases, these optimizations are insufficient to meet the constraints. Approximate computing is one such computing paradigm that can push the boundary of efficiency gains further by trading accuracy for efficiency. This accuracy–efficiency trade-off is usually achieved either by functional approximation of arithmetic modules (through logic simplification, as shown in Fig. 9) or by voltage underscaling. Works such as AxNN [36] argue that DNNs are mostly used in applications where less than perfect results are acceptable; therefore, approximations can be employed to achieve high efficiency gains at the cost of minor accuracy loss. Toward this, AxNN [36] proposed to selectively approximate less important neurons and, moreover, employ approximation-aware training to contain the accuracy loss. Along the similar lines, ALWANN [28] proposed a methodology for approximating the multiplier units in DNN hardware without involving approximation-aware training. Apart from functional approximation of arithmetic modules, techniques such as ThunderVolt [42] and MATIC [24] that are based on voltage underscaling can also be employed to increase the efficiency of DNN inference.

Cross-Layer Optimizations for Efficient Deep Learning Inference at the Edge Approximate 2x2 Mulplier: Approximates 3x3 Æ 7

Accurate 2x2 Mulplier A0

A0

O0

A1 B0

Inputs

X

Library of Approximate Components

X +

+ O1

-

B0

O2 B1

Accurate Datapath

O0

A1

O1

235

Output

O2 B1

O3

Methodology for Approximang Datapaths

O3

00

01

10

11

00

01

10

11

00

0000

0000

0000

0000

00

0000

0000

0000

0000

Inputs

01

0000

0001

0010

0011

01

0000

0001

0010

0011

X

10

0000

0010

0100

0110

10

0000

0010

0100

0110

11

0000

0011

0110

1001

11

0000

0011

0110

0111

A

B

A

(a)

B

Approximaon: 3x3 Æ 7

Approximate Datapath

+

+

(b)

Approx. Adder

X

Output

Approximate Mulplier Type a Approximate Mulplier Type b

Fig. 9 (a) An example of functional approximation, where a 2.×2 multiplier is simplified to reduce the area and energy cost. (b) A generic flow for approximating datapaths

4 Cross-Layer Optimization This section presents a cross-layer approach for combining multiple types of optimization/approximation techniques from different layers of the computing stack to significantly reduce the resource requirements of DNNs during inference.

4.1 Methodology As highlighted in Sect. 3, various software-level and hardware-level approximation/optimization techniques have been proposed to enable DNN deployment in resource-constrained edge devices. At the software level, pruning and quantization are the most widely used techniques, and at the hardware level, the use of specialized hardware accelerators together with functional approximation of arithmetic modules is considered to be highly effective. These software and hardware techniques can be combined in a synergistic manner to achieve better efficiency gains. Figure 10 presents a cross-layer methodology that combines pruning and quantization techniques with hardware-level approximations [15]. The methodology consists of the following steps: • Pruning: DNNs are usually highly over-parameterized. Therefore, removing ineffectual weights from the network has proven to be a fairly effective technique toward reducing the complexity of DNNs. Based on this, the cross-layer methodology shown in Fig. 10 employs pruning as the first step (i.e., Step 1). An iterative pruning technique is usually used in which each pruning step is followed by partial retraining to regain the accuracy lost due to pruning. The weights to be pruned in each iteration are selected based on their importance, which is usually

236

M. A. Hanif and M. Shafique

Software-level Optimizations Pre-trained DNN

Pruning

1

Hardware-level Approximations 3 Early Design Space Exploration to Remove Sub-Optimal Configurations

Approximate MAC Designs

Fine-tuning (Optional)

Training and Validation Datasets

Quantization

Fine-tuning (Optional)

User-defined Accuracy Constraint Compressed DNN

Pareto-Optimal Configurations

2

High-level Simulations for Accuracy Evaluation to Identify Optimal Configurations

Optimal Approximate MAC Configuration

Fine-tuning (Optional)

Functional Models of Approximate Units Compressed DNN & Optimal Hardware Configuration

Fig. 10 A cross-layer optimization flow for DNNs [15]

estimated using their L1-norm or L2-norm. The iterations are performed until the accuracy of the network drops below the user-defined accuracy constraint. Then, the DNN from the second to the last iteration is forwarded to the next step for further optimization. • Quantization: The precision of DNN weights and biases has a significant impact on the overall memory requirements of DNNs. Moreover, the precision of DNN data structures also impacts the complexity of the computational modules required at the hardware level. Therefore, to represent the weights and activations using a low-precision format in order to reduce the overall DNN inference cost, the crosslayer methodology shown in Fig. 10 employs DNN quantization as the second step (i.e., Step 2). Similar to iterative pruning, DNN quantization can also be coupled with retraining to achieve high compression without much accuracy loss. Moreover, pruning and quantization can be combined in a single framework [35]. However, such methods require sophisticated optimization algorithms to effectively explore the combined design space. • Hardware Approximations: Specialized accelerators are employed at the hardware level to reduce the energy consumption of DNN inference in real-world systems. These accelerators can be equipped with approximate modules (e.g., approximate memory units, approximate adders, and approximate multipliers) to further boost the efficiency by trading a bit of application-level accuracy for a significant amount of efficiency gain. Toward this, Step 3 of the cross-layer methodology explores the potential of hardware-level approximations. This step performs a design space exploration to find appropriate approximate modules that offer high efficiency while meeting the user-defined quality constraints.

Cross-Layer Optimizations for Efficient Deep Learning Inference at the Edge

237

Compute saliency of all the filters/neurons using L1-norm Pre-trained DNN

Repeat for Each Layer Create a copy of the DNN Remove x percent of least salient filters/neurons from the ith layer

Validation Dataset

User-defined Cost Function (C)

Compute the accuracy and compression ratio of the DNN and register in θ

Empty θ and replace the pre-trained DNN

Based on the user-defined cost function, compute the cost of each DNN in θ and select the one that has the least cost Fine-tune the DNN for y number of epochs

Training Dataset

Compute the accuracy of the DNN using a subset of the validation set User-defined Accuracy Constraint (AC)

If validation accuracy > AC

Yes

No Output the DNN from the previous iteration as the output

Fig. 11 A structured pruning approach [15]

4.2 Structured Pruning This section presents an iterative structured pruning technique to highlight the effectiveness of the pruning step in the cross-layer methodology (i.e., Step 1 in Fig. 10). The key concept behind this methodology is to iteratively select one layer at a time, prune some ineffectual filters/neurons from it, and then after retraining recompute the saliency of all the remaining filters/neurons before repeating the steps again. Figure 11 presents the overall flow of the pruning methodology, and the main steps of the flow are explained as follows: 1. Given a pre-trained DNN, the first step is to compute the saliency of each filter/neuron in the DNN using a suitable saliency estimation method, e.g., using L1-norm of the filters/neurons. 2. The next step is to select the most appropriate layer for pruning. For that, for each layer of the input DNN, the methodology creates a copy of the DNN and removes

80 75 70 65 60 55 50

M. A. Hanif and M. Shafique

1 2

3

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95100

Model Size Reduction [%age]

(a)

Test Accuracy [%age]

Test Accuracy [%age]

238

80 60 The critical point after which the accuracy starts decreasing rapidly regardless of the pruning ratio

40

20

1 2 3

0 4

5

6

7

8

9

10

Bit Width

(b)

Fig. 12 Results of structured pruning when applied to the LeNet5 network trained on the Cifar10 dataset [15]. (a) Impact of structured pruning on accuracy. (b) Impact of quantization on the accuracy of the model variants having different compression ratios. The model variants are marked in (a)

x% of the least significant filters/neurons from the layer while keeping all the rest of the layers intact. The methodology then computes the accuracy and compression ratio of each model using a subset of the validation dataset and registers the accuracy and compression numbers in .θ . A user-defined cost function C is then used to compute the cost of each pruned DNN (i.e., each DNN copy). The models are then sorted based on their costs, and the one that has the least cost is selected, while all the rest are discarded. The selected model is then fine-tuned for y epochs. Then, the accuracy of the model is estimated using a subset of the validation dataset and compared against the user-defined accuracy constraint (.Ac ). If the accuracy is greater than .Ac , the pre-trained model is replaced with the pruned model and the complete process is repeated again. Once the accuracy is below .Ac , the output model from the previous iteration is forwarded as the final output of the methodology. .

3.

4. 5. 6. 7.

The above flow can be employed for pruning any dense DNN. Figures 12a and 13a show the results when the above pruning method is employed for the LeNet5 and the VGG11 networks, both trained on the Cifar10 dataset. For these experiments, the cost function was defined to be .C = 100 − (Accuracy + 4 ∗  Pi / j ∈{all layers} Pj ), where Accuracy is the estimated accuracy of the DNN after pruning the ith layer and .Pi is the number of parameters in the ith layer. Moreover, x was defined to be 20, and y was defined to be 2. The results presented in Figs. 12a and 13a clearly show that the structured pruning methodology can significantly reduce the number of parameters as well as the computational requirements of DNNs without affecting the accuracy much. The figures also show that the sensitivity of DNNs to pruning varies from network to network, and at high compression ratios, DNNs start becoming more sensitive to further pruning. Note that intermediate fine-tuning, i.e., .y > 0, is one of the key factors for achieving a high compression ratio.

100 95 90 85 80 75 70

239

100

2

1

3

4

5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95100

Test Accuracy [%age]

Test Accuracy [%age]

Cross-Layer Optimizations for Efficient Deep Learning Inference at the Edge

1

80

60

The point after which the accuracy starts decreasing rapidly regardless of the pruning ratio of the DNN

40 20

2

3 4

0 4

5

6

Model Size Reduction [%age]

7

8

9

10

5

Bit Width

(b)

(a)

Fig. 13 Results of structured pruning when applied to the VGG11 network trained on the Cifar10 dataset [15]. (a) Impact of structured pruning on accuracy. (b) Impact of quantization on the accuracy of the model variants having different compression ratios. The model variants are marked in (a)

4.3 Quantization This section presents an analysis of a post-training quantization approach to highlight the significance of DNN quantization (i.e., Step 2 in Fig. 10) for reducing the DNN inference cost. For this section, it is assumed that all the DNN data structures, i.e., weights and activations of all the layers, are quantized to the same bit width. To quantize the weights of a layer, the following equations are employed: ˆ = round(W × W ) Wi i scale

(3)

.

f loor(log2 (

Wscale =2

.

2n−1 − 1 )) max(abs(W )) .

Here .W represents the set of all the weights, .Wi represents the ith element in , .W ˆ represents the set of quantized weights, .W is the scale factor, and .W scale

n is the bit width. To quantize the activations at a point in the network, first, the activations are profiled using a set of input samples, and then, the scale factor is defined using the following equation:  f loor

A scale = 2

.

 log2

2n−1 − 1 max(abs(A ))

 .

Here .A is the set of all the logged activations from the input of the lth layer and A scale is the scale factor. During the run time, the activations are scaled using the following equation:

.

ˆ = round(A × A ), A i scale i

.

(4)

240

M. A. Hanif and M. Shafique

ˆ represents the quantized activations. Note that .W and .A are where .A scale scale intentionally defined to be in the power-of-2 to simplify the intermediate operations. Figure 12b shows the impact of quantization on the application-level accuracy of three different variants of the LeNet5 model having different pruning ratios. The variants are marked in Fig. 12a using the corresponding labels. As can be observed from Fig. 12b, all the variants are almost equally sensitive to the quantization errors, with more pruned variants being slightly more sensitive. Moreover, the accuracy of each network stays close to the baseline until a point, i.e., until 6-bit quantization, and after that the accuracy starts dropping sharply. The same trend is observed for the VGG11 network trained on the Cifar10 dataset (see Fig. 13). From this analysis, it can be concluded that higher pruning levels are usually more beneficial than posttraining quantization for achieving better quality–efficiency trade-off.

4.4 Hardware-Level Approximations: Impact of Self-Healing and Non-Self-Healing Approximate Designs on DNN Accuracy This section presents the impact of using approximate arithmetic modules in DNN accelerators (i.e., Step 3 in Fig. 10) on the application-level accuracy of DNNs and the overall DNN inference cost. For the analysis in this section, modules designed using conventional as well as self-healing methods are considered. Figure 14 presents the key distinction between the conventional and self-healing designs. Figure 14a shows the conventional method for approximating hardware accelerators, where each module in the system is replaced with an appropriate approximate module from a library of approximate circuits. In this method, although the selection can be guided through a sophisticated methodology, the selection process does not specifically focus on error masking or compensation to achieve better accuracy– efficiency trade-offs. Contrary to the conventional methods, Figs. 14b and c show two different methods for introducing self-healing approximations in a system. The concept is applicable to systems that can be divided into approximation and healing stages. This concept is mainly proposed for dot-product operations that involve multiplications and then accumulation, as in such cases approximations can be applied in the multiplication part and the accumulation part can be viewed as the healing stage [9, 10]. The approximations are applied such that some of the modules generate positive errors, while others generate negative errors, and these errors are then canceled by each other in the accumulation stage to offer better accuracy– efficiency trade-off. The dot-product operation is the most common operation in DNN inference. It involves multiplications followed by the accumulation of the products. As multiplication is one of the most costly operations, approximations are deployed in the multiplier units for this analysis. Moreover, conventional as well as self-healing approximate multipliers are considered to study the effectiveness of functional

Cross-Layer Optimizations for Efficient Deep Learning Inference at the Edge

Approximation Stage Approximate System Input

Approx. Module 1

Approx. Module 2

Approx. Module 1a

Output

Healing Stage +

Module 2

(a)

Output

(e.g., performs + , where ≈− )

Inputs Approx. Module 1b

241

+

(b)

Approximation Stage Input

Approx. Module 1

Healing Stage ,

,…

Module 2 (e.g., performs + …, where ≈− )

Output

(c)

Fig. 14 A comparison of conventional and self-healing approaches [15]. (a) Conventional approximation. (b) Self-healing using complementary approximate modules in the approximation stage. (c) Self-healing using complementary approximate components inside the modules in the approximation stage

approximations in arithmetic modules. Figure 15a shows the baseline 8.×8 multiplier design employed in this work. The multiplier is constructed using 2.×2 multipliers. The design of the accurate 2.×2 multiplier is shown in Fig. 16a. For approximations, the 2.×2 multiplier designs shown in Fig. 16b–d are employed, where the designs in Figs. 16b and d approximate .3 × 3 to 7 and 5, respectively (i.e., negative error), and the design in Fig. 16c approximates .3 × 3 to 11 (i.e., positive error). The 8.×8 multiplier configurations used in this analysis are illustrated in Fig. 15b–j, and their performance and error characteristics are listed in Table 1. Note, for this analysis, it is assumed that the same multiplier design is used for all the multipliers in the DNN hardware accelerator, i.e., the design is assumed to be homogeneous. The approximate multiplier configurations that are composed of accurate 2.×2 multipliers and the 2.×2 multipliers that generate only negative errors represent the conventional multipliers, i.e., configurations in Fig. 15b–f. The configurations that generate both positive and negative errors represent the selfhealing designs, i.e., configurations in Fig. 15g–j. To evaluate the impact of hardware-level approximations of arithmetic modules on the accuracy of DNNs, functional models of these approximate multipliers were integrated in a PyTorch-based simulation framework. Figure 17 shows the results obtained when different approximate multiplier configurations (shown in Fig. 15) are used for the LeNet5 network trained on the Cifar10 dataset. Note, for this analysis, different variants of the same LeNet5 having different pruning ratios are

242

M. A. Hanif and M. Shafique

LEGEND:

a7 x b7

ai : ith-bit of operand A : jth-bit of operand B bj ppij : Paral Product of ai and bj PO-1 : MSB of the product O : Number of output bits – 1 M : 2x2 mulplier of type t

a6 b6

a5 b5

a4 b4

a3 b3

a2 b2

a1 b1

a0 b0

1 pp07 pp06 pp05 pp04 pp03 pp02 pp01 pp00 pp17 pp16 pp15 pp14 pp13 pp12 pp11 pp10 pp27 pp26 pp25 pp24 pp23 pp22 pp21 pp20 pp37 pp36 pp35 pp34 pp33 pp32 pp31 pp30 pp47 pp46 pp45 pp44 pp43 pp42 pp41 pp40 Extension of Ones for pp57 pp56 pp55 pp54 pp53 pp52 pp51 pp50 Larger Output Widths pp07 pp66 pp65 pp64 pp63 pp62 pp61 pp60 1 ... 1 pp77 pp76 pp75 pp74 pp73 pp72 pp71 pp70 PO-1 ... P15 P14

P13 P12 P11 P10 P9

P8 P7

P6

P5

P4

P3

P2

P1

P0

(a) M M M M M M M M M M M M M M M M

(b) M M M M M M M M M M M M M M M M

(e) M M M M M M M M M M M M M M M M

(h)

M M M M M M M M M M M M M M M M

(c) M M M M M M M M M M M M M M M M

(f) M M M M M M M M M M M M M M M M

(i)

M M M M M M M M M M M M M M M M

(d) M M M M M M M M M M M M M M M M

(g) M M M M M M M M M M M M M M M M

(j)

Fig. 15 Types of 8.×8 approximate multipliers considered for simulations [15]. (a) An 8-bit multiplier design based on Baugh-Wooley algorithm realized using 2.×2 multipliers. (b) Config. 1. (c) Config. 2. (d) Config. 3. (e) Config. 4. (f) Config. 5. (g) Config. 6. (h) Config. 7. (i) Config. 8. (j) Config. 9

considered. The network variants are marked in Fig. 12a with the corresponding labels. As can be observed from Fig. 17, the variants having higher pruning ratios are slightly more sensitive to hardware-level approximations. Similar results are observed for the VGG11 network (see Fig. 18).

Cross-Layer Optimizations for Efficient Deep Learning Inference at the Edge

A

B

00

01

10

11

00

0000

0000

0000

0000

01

0000

0001

0010

0011

10

0000

0010

0100

0110

11

0000

0011

0110

1001

(e)

(a) A

B

00

01

10

11

00

0000

0000

0000

0000

01

0000

0001

0010

0011

10

0000

0010

0100

0110

11

0000

0011

0110

0111

(b)

(f)

00

01

10

11

00

0000

0000

0000

0000

01

0000

0001

0010

0011

10

0000

0010

0100

0110

11

0000

0011

0110

1011

A

B

(c)

(g)

A

(d)

243

B

00

01

10

11

00

0000

0000

0000

0000

01

0000

0001

0010

0011

10

0000

0010

0100

0110

11

0000

0011

0110

0101

(h)

Fig. 16 2.×2 multiplier designs used for building 8.×8 approximate multipliers [15]. (a) Accurate 2.×2 multiplier: .M. (b) Approximate 2.×2 multiplier having .3×3→7 : M. (c) Approximate 2.×2 multiplier having .3×3→11 : M. (d) Approximate 2.×2 multiplier having .3×3→5 : M. (e) Truth table of .M. (f) Truth table of .M. (g) Truth table of .M. (h) Truth table of .M

244

M. A. Hanif and M. Shafique

Table 1 Error and performance characteristics of the multiplier configurations presented in Fig. 15 [15]. The hardware results are generated for 65 nm technology using Cadence Genus Synthesis tool with TSMC 65 nm library Multiplier configurations Ax. 5 Ax. 6 Ax. 7 Ax. 8 Ax. 9 Accurate Ax. 1 Ax. 2 Ax. 3 Ax. 4 0 0.25 9.75 266.25 3102.30 24,806.00 7.50 78.00 2128.00 2547.00 MSE 0

0.13

0

1.13

7.13

23.13

55.13

0.94

3.38

19.94

21.90

.−0.13 .−1.13 .−7.13

.−23.13

.−55.13

0.00

0.00

.−0.25

.−0.13

753

716

609

571

726

727

672

670

46.04

44.98 44.92 40.81

40.98

38.96

45.49 45.05 43.48

42.94

1.92

1.86

1.73

1.73

1.95

1.77

88.40

83.66 77.71 70.60

70.90

67.40

88.71 84.24 75.22

MED Mean error

696

616

Area [cell area] Power [.μW] 1.73

1.73

1.87

1.73

Delay [ns] 76.00

Accuracy [%age]

PDP [fJ]

100 80

1

60 2

40 20

3

0

Accurate

Ax.1

Ax.2 Ax.3 Ax.4 Ax.5 Non-Self-healing configurations Multiplier Configuration

Ax.6

Ax.7 Ax.8 Self-healing configurations

Ax.9

Accuracy [%age]

Fig. 17 Impact of using approximate multipliers on the accuracy of different pruned variants of the LeNet5 network [15]. The considered variants are marked in Fig. 12a

Aggressive approximation leads to significant accuracy Loss

100 80

1

60

2

40

3

20

4

0

5

Accurate

Ax.1

Ax.2 Ax.3 Ax.4 Ax.5 Non-Self-healing configurations Multiplier Configuration

Ax.6

Ax.7 Ax.8 Self-healing configurations

Ax.9

Fig. 18 Impact of using approximate multipliers on the accuracy of different pruned variants of the VGG11 network [15]. The considered variants are marked in Fig. 13a

Cross-Layer Optimizations for Efficient Deep Learning Inference at the Edge

245

5 End-to-End System-Level Approximations The emerging trend toward edge-based applications (e.g., autonomous driving, surveillance, smart healthcare, etc.) has resulted in a huge demand for ultraefficient DNN-based systems that can offer high performance under tight resource constraints. Various optimization techniques have been proposed in the literature to address these contradicting goals (see Sect. 3). However, most of these techniques mainly target only the computational part of the system and hence achieve limited efficiency gains. Computing systems, specifically the ones that are installed in edge devices, are composed of multiple different subsystems, e.g., sensing, memory, processing, and communication subsystems. Therefore, joint optimization across different subsystems can offer better returns compared to single subsystem optimizations. Toward this, various studies have shown the potential of systematic joint optimization of subsystems to achieve ultra-high efficiency gains. In [8], Ghosh et al. proposed a methodology for optimizing a DNN-based inference system. They considered a smart-camera system that executes a CNN-based image classification application to demonstrate how different subsystems can be approximated in a systematic manner to achieve significant energy benefits. The considered smartcamera system was composed of sensor, memory, compute, and communication subsystems. For approximating the sensor part, they employed image sub-sampling to reduce the size of the input to the DNN. For approximating the memory subsystem, they considered increasing the refresh interval of DRAM cells, as DRAM refresh operation is one of the major contributors to the overall energy consumption of a system. First, they divided the data into critical and non-critical parts, where they defined the entire application program and the intermediate outputs of the DNN as critical data and the input image along with the DNN weights as non-critical data. Then, they stored critical data in cells that have nominal refresh rate, while non-critical data in cells that have lower than nominal refresh rate. Note that reducing the refresh rate results in retention errors, which lead to data corruption, and the more the reduction in the refresh rate, the higher the probability of retention errors. Therefore, refresh rate has to be controlled based on the error resilience of the application. For approximating the computing part, they mainly considered structured pruning, as it results in significant reduction in the network size and computations while maintaining the dense matrix format. For cloudbased simulations, they employed lossy JPEG compression scheme to reduce the communication cost incurred to transmit the image to a cloud server. They showed that for edge-based execution their methodology can lead to about 1.6x energy reduction, while for cloud-based execution their methodology can offer between 1.4x and 3.4x energy reduction, depending on the type of CNN used. A similar study for iris scanning-based biometric security system is presented in [16] by Hashemi et al. The study clearly shows that end-to-end and cross-layer approximations can offer higher benefits, and such techniques can be employed even for security systems.

246

M. A. Hanif and M. Shafique

6 Conclusion Deep neural networks have emerged as a prominent set of techniques for solving complex AI problems. However, the state-of-the-art DNNs have high computational complexity and memory footprint. Moreover, they require a significant amount of energy to process each input sample. This chapter covered different DNN optimization techniques that can be employed to reduce the inference cost of DNNs and enable their deployment in resource-constrained edge devices. The chapter also covered a cross-layer methodology that systematically combines softwarelevel and hardware-level optimization techniques to significantly reduce the energy requirements of DNN-based systems. Toward the end, the chapter discussed some works related to end-to-end system-level optimizations to enable ultra-efficient DNN inference.

References 1. Anwar, S., Hwang, K., Sung, W.: Structured pruning of deep convolutional neural networks. ACM J. Emerg. Technol. Comput. Syst. (JETC) 13(3), 1–18 (2017) 2. Buciluˇa, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 535–541 (2006) 3. Chen, Y.H., Krishna, T., Emer, J.S., Sze, V.: Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid State Circ. 52(1), 127–138 (2016) 4. Dai, Z., Liu, H., Le, Q.V., Tan, M.: CoAtNet: Marrying convolution and attention for all data sizes. In: Advances in Neural Information Processing Systems, vol. 34, pp. 3965–3977 (2021) 5. Dietterich, T.G.: Ensemble methods in machine learning. In: International Workshop on Multiple Classifier Systems, pp. 1–15. Springer (2000) 6. Elkerdawy, S., Elhoushi, M., Singh, A., Zhang, H., Ray, N.: To filter prune, or to layer prune, that is the question. In: Proceedings of the Asian Conference on Computer Vision (2020) 7. Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M.W., Keutzer, K.: A survey of quantization methods for efficient neural network inference. Preprint (2021). arXiv:2103.13630 8. Ghosh, S.K., Raha, A., Raghunathan, V.: Approximate inference systems (axis) end-to-end approximations for energy-efficient inference at the edge. In: Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design, pp. 7–12 (2020) 9. Gillani, G., Hanif, M.A., Verstoep, B., Gerez, S.H., Shafique, M., Kokkeler, A.B.: MACISH: Designing approximate MAC accelerators with internal-self-healing. IEEE Access 7, 77142– 77160 (2019) 10. Gillani, G.A., Hanif, M.A., Krone, M., Gerez, S.H., Shafique, M., Kokkeler, A.B.: SquASH: Approximate square-accumulate with self-healing. IEEE Access 6, 49112–49128 (2018) 11. Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: A survey. Int. J. Comput. Vis. 129(6), 1789–1819 (2021) 12. Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A., Dally, W.J.: EIE: Efficient inference engine on compressed deep neural network. ACM SIGARCH Comput. Archit. News 44(3), 243–254 (2016) 13. Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. Preprint (2015). arXiv:1510.00149

Cross-Layer Optimizations for Efficient Deep Learning Inference at the Edge

247

14. Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems, vol. 28 (2015). https://doi. org/10.48550/arXiv.1506.02626 15. Hanif, M.A., Shafique, M.: A cross-layer approach towards developing efficient embedded deep learning systems. Microprocess. Microsyst., 103609 (2021) 16. Hashemi, S., Tann, H., Buttafuoco, F., Reda, S.: Approximate computing for biometric security systems: A case study on iris scanning. In: 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 319–324. IEEE (2018) 17. Hassibi, B., Stork, D.: Second order derivatives for network pruning: Optimal brain surgeon. In: Advances in Neural Information Processing Systems, vol. 5 (1992) 18. He, Y., Lin, J., Liu, Z., Wang, H., Li, L.J., Han, S.: AMC: AutoML for model compression and acceleration on mobile devices. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–800 (2018) 19. Hinton, G., Vinyals, O., Dean, J., et al.: Distilling the knowledge in a neural network, vol. 2(7). Preprint (2015). arXiv:1503.02531 20. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks. In: Advances in Neural Information Processing Systems, vol. 29 (2016) 21. Jain, S., Venkataramani, S., Srinivasan, V., Choi, J., Gopalakrishnan, K., Chang, L.: BiScaledDNN: Quantizing long-tailed datastructures with two scale factors for deep neural networks. In: 2019 56th ACM/IEEE Design Automation Conference (DAC), pp. 1–6. IEEE (2019) 22. Jouppi, N.P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., et al.: In-datacenter performance analysis of a tensor processing unit. In: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 1–12. IEEE (2017) 23. Khan, A., Sohail, A., Zahoora, U., Qureshi, A.S.: A survey of the recent architectures of deep convolutional neural networks. Artif. Intell. Rev. 53(8), 5455–5516 (2020) 24. Kim, S., Howe, P., Moreau, T., Alaghi, A., Ceze, L., Sathe, V.: MATIC: Learning around errors for efficient low-voltage neural network accelerators. In: 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1–6. IEEE (2018) 25. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25 (2012) 26. LeCun, Y., Denker, J., Solla, S.: Optimal brain damage. In: Advances in Neural Information Processing Systems, vol. 2 (1989) 27. Marchisio, A., Hanif, M.A., Martina, M., Shafique, M.: Prunet: Class-blind pruning method for deep neural networks. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2018) 28. Mrazek, V., Vasícek, Z., Sekanina, L., Hanif, M.A., Shafique, M.: ALWANN: automatic layer-wise approximation of deep neural network accelerators without retraining. In: 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 1–8. IEEE (2019) 29. Niu, W., Ma, X., Lin, S., Wang, S., Qian, X., Lin, X., Wang, Y., Ren, B.: PatDNN: Achieving real-time DNN execution on mobile devices with pattern-based weight pruning. In: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 907–922 (2020) 30. Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkatesan, R., Khailany, B., Emer, J., Keckler, S.W., Dally, W.J.: SCNN: An accelerator for compressed-sparse convolutional neural networks. ACM SIGARCH Comput. Archit. News 45(2), 27–40 (2017) 31. Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: XNOR-net: ImageNet classification using binary convolutional neural networks. In: European Conference on Computer Vision, pp. 525– 542. Springer (2016) 32. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Preprint (2014). arXiv:1409.1556

248

M. A. Hanif and M. Shafique

33. Srinivas, A., Lin, T.Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A.: Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16519–16529 (2021) 34. Tan, M., Le, Q.: EfficientNet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114. PMLR (2019) 35. Tung, F., Mori, G.: CLIP-Q: Deep network compression learning by in-parallel pruningquantization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7873–7882 (2018) 36. Venkataramani, S., Ranjan, A., Roy, K., Raghunathan, A.: AxNN: Energy-efficient neuromorphic systems using approximate computing. In: 2014 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), pp. 27–32. IEEE (2014) 37. Wang, K., Liu, Z., Lin, Y., Lin, J., Han, S.: HAQ: Hardware-aware automated quantization with mixed precision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8612–8620 (2019) 38. Wang, T., Wang, K., Cai, H., Lin, J., Liu, Z., Wang, H., Lin, Y., Han, S.: APQ: Joint search for network architecture, pruning and quantization policy. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2078–2087 (2020) 39. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., Philip, S.Y.: A comprehensive survey on graph neural networks. IEEE Trans. Neural Networks Learn. Syst. 32(1), 4–24 (2020) 40. Yu, J., Lukefahr, A., Palframan, D., Dasika, G., Das, R., Mahlke, S.: Scalpel: Customizing DNN pruning to the underlying hardware parallelism. ACM SIGARCH Comput. Archit. News 45(2), 548–560 (2017) 41. Yu, Y., Si, X., Hu, C., Zhang, J.: A review of recurrent neural networks: LSTM cells and network architectures. Neural Computation 31(7), 1235–1270 (2019) 42. Zhang, J., Rangineni, K., Ghodsi, Z., Garg, S.: ThunderVolt: enabling aggressive voltage underscaling and timing error resilience for energy efficient deep learning accelerators. In: Proceedings of the 55th Annual Design Automation Conference, pp. 1–6 (2018) 43. Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. Preprint (2016). arXiv:1606.06160

Co-designing Photonic Accelerators for Machine Learning on the Edge Febin P. Sunny, Asif Mirza, Mahdi Nikdast, and Sudeep Pasricha

1 Introduction Many emerging applications such as self-driving cars, autonomous robotics, fake news detection, pandemic growth, trend prediction, and real-time language translation are increasingly being powered by sophisticated machine learning models. With researchers creating deeper and more complex deep neural network (DNN) architectures, including multilayer perceptron (MLP) and convolution neural network (CNN) architectures, the underlying hardware platform must consistently deliver better performance while satisfying strict power dissipation limits. In order to achieve better performance per watt, hardware architects design custom accelerators for deep learning, e.g., Google’s TPU [1] and Intel’s Movidius [2], with much higher performance per watt than that of conventional CPUs and GPUs. Unfortunately, electronic accelerator architectures face fundamental limits in the post Moore’s law era where processing capabilities are no longer improving as they did over the past several decades [3]. In particular, moving data electronically on metallic wires in these accelerators creates a major bandwidth and energy bottleneck [4]. In order to enable ultrahigh bandwidth, low-latency, and energyefficient communication, silicon photonics has emerged as a suitable technology [5]. Several efforts have explored the design of chip-scale networks with silicon photonics [6–28]. CMOS-compatible photonic interconnects have already replaced metallic ones for light-speed data transmission at almost every level of computing and are now actively being considered for chip-scale integration [29].

F. P. Sunny () · A. Mirza · M. Nikdast · S. Pasricha Department of Electrical and Computer Engineering, 1373 Campus Delivery, Colorado State University, Fort Collins, CO, USA e-mail: [email protected]; [email protected]; [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 S. Pasricha, M. Shafique (eds.), Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing, https://doi.org/10.1007/978-3-031-39932-9_10

249

250

F. P. Sunny et al.

Remarkably, it is also possible to use optical components to perform computation, e.g., matrix-vector multiplication [30]. Thus, it is now possible to conceive of a new class of DNN accelerators that employ photonic interconnects and photonic integrated circuits (PICs) built with on-chip waveguides, electro-optic modulators, photodetectors, and lasers for low-latency and energy-efficient optical domain data transport and computation. Such photonics-based accelerators can effectively address the fan-in and fan-out problems with linear algebra processors; also, their operational bandwidth can approach the photodetection rate (typically in the hundreds of GHz). This speed of operation can, ideally, be orders of magnitude higher than electronic systems today that operate at a clock rate of a few GHz [31]. Despite the above benefits, several challenges must be addressed before viable photonic DNN accelerators can be realized. Fabrication process and thermal variations can adversely impact the robustness of photonic accelerator designs by introducing undesirable crosstalk noise, optical phase shifts, resonance drifts, tuning overheads, and photo-detection current mismatches. For example, experimental studies have shown that micro-ring resonator (MR) devices used in chip-scale photonic interconnects can experience significant resonant drifts (e.g., ~9 nm reported in [32]) within a wafer due to process variations. This matters because even a 0.25 nm drift can cause the bit error rate (BER) of photonic data traversal to degrade from 10−12 to 10−6 . Moreover, thermal crosstalk in silicon photonic devices such as MRs can limit the achievable precision (i.e., resolution) of weight and bias parameters to a few bits, which can significantly reduce DNN model accuracy. Common tuning circuits that rely on thermo-optic phase-change effects to control photonic devices, e.g., when imprinting activations or weights on optical signals, also place a limit on the achievable throughput and parallelism in photonic accelerators. Lastly, at the architecture level, there is a need for a scalable, adaptive, and low-cost computation and communication fabric that can handle the demands of diverse MLP and CNN models. In this chapter, we highlight CrossLight, a novel silicon photonic neural network accelerator that addresses the challenges highlighted above through a cross-layer design approach. By cross-layer, we refer to the co-design paradigm that involves considering multiple layers in the hardware-software design stack together, for a more holistic optimization of the photonic accelerator. This work was first presented in [33], and this chapter describes the approach from that work. CrossLight involves device-level engineering for resilience to fabrication-process variations and thermal crosstalk, circuit-level tuning enhancements for inference latency reduction, and an optimized architecture-level design that also integrates the device- and circuit-level improvements to enable higher resolution, better energy-efficiency, and improved throughput compared to prior efforts on photonic accelerator design. The novel contributions in CrossLight include the following: • Improved silicon photonic device designs that we fabricated to make our architecture more resilient to fabrication-process variations. • An enhanced tuning circuit to simultaneously support large thermal-induced resonance shifts and high-speed, low-loss device tuning.

Co-designing Photonic Accelerators for Machine Learning on the Edge

251

• Consideration of thermal crosstalk mitigation methods to improve the weight resolution achievable by our architecture. • Improved wavelength reuse and use of matrix decomposition at the architecture level to increase throughput and energy-efficiency. • A comprehensive comparison with state-of-the-art accelerators that shows the efficacy of our cross-layer optimized solution.

2 Background and Related Work Silicon-photonics-based DNN accelerator architectures represent an emerging paradigm that can immensely benefit the landscape of deep learning hardware design [34–39]. A photonic neuron in these architectures is analogous to an artificial neuron and consists of three components: a weighting, a summing, and a nonlinear unit. Noncoherent photonic accelerators, such as [36–38], typically employ the broadcast and weight (B&W) protocol [35] to manipulate optical signal power for setting and updating weights and activations. The B&W protocol is an analog networking protocol that uses wavelength-division multiplexing (WDM), photonic multiplexors, and photodetectors to combine outputs from photonic neurons in a layer. The optical signal power is controlled, for imprinting parameter values, by controlling the optical loss in the devices employed, through tuning mechanisms. Coherent photonic accelerators, such as [31] and [39], manipulate the electrical field amplitude rather than signal power and typically use only a single wavelength. Weighting occurs with electrical field amplitude attenuation proportional to the weight value and phase modulation that is proportional to the sign of the weight. The weighted signals are then coherently accumulated with cascaded Y-junction combiners. For both types of accelerators, nonlinearity can be implemented with devices such as electro-absorption modulators [31]. Due to the scalability, phase encoding noise, and phase error accumulation limitations of coherent accelerators [40], there is growing interest in designing efficient noncoherent photonic accelerators. In particular, the authors of DEAPCNN [36] have described a noncoherent neural network accelerator that implements the entirety of the CNN layers using connected convolution units. In these units, the tuned MRs assume the kernel values by using phase tuning to manipulate the energy in their resonant wavelengths. Holylight [37] is another noncoherent architecture that uses microdisks (instead of MRs) for its lower area and power consumption. It utilizes a “whispering gallery mode” resonance for microdisk operation, which unfortunately is inherently lossy due to a phenomenon called tunneling ray attenuation [41]. More generally, these noncoherent architectures suffer from susceptibility to process variations and thermal crosstalk, which are not addressed in these architectures. Microsecond-granularity thermo-optic tuning latencies further reduce the speed and efficiency of optical computing [42]. The work in [33] addressed these shortcomings as part of the proposed cross-layer optimized noncoherent photonic accelerator.

252

F. P. Sunny et al.

3 Noncoherent Photonic Computation Overview Noncoherent photonic accelerators typically utilize the Broadcast and Weight (B&W) photonic neuron configuration with multiple wavelengths. Figure 1 shows an example of this B&W configuration with n neurons in a layer where the colored-dotted box represents a single neuron. In this example, each input to a neuron (activation parameter value) is imprinted onto a unique wavelength (λi ) emitted by a laser diode (LD) using a Mach-Zehnder modulator (MZM) for tuning the wavelength. The wavelengths are multiplexed (MUXed) into a single waveguide using arrayed waveguide grating (AWG) and split into n branches that are each weighted with a micro-ring resonator (MR) bank that alters optical signal power proportional to weight values. A balanced photodetector (BPD) performs summation across positive and negative weight arms at each branch. Optoelectronic devices such as electro-absorption modulators (not shown for brevity) introduce nonlinearity after the multiplication and summation operations. MRs are the fundamental computing components of this configuration. Weights (and biases) are altered by tuning MRs so that the losses experienced by wavelengths—on which activations have been imprinted—can be modified to realize matrix-vector multiplication. MR weight banks have groups of these tunable MRs, each of which can be tuned to drain energy from a specific resonant wavelength so that the intensity of the wavelength reflects a specific value (after it has passed near the MR). As an example of performing computation in the optical domain, consider the case where an activation value of 0.8 must be weighted by a value of 0.5 as part of a matrix-vector multiplication in a DNN model inference phase. Let us assume that the red wavelength (λ1 ) is imprinted with the activation value of 0.8 by using the MZM in Fig. 1 (alternatively, MRs can be used for the same goal, where an MR will be tuned in such a way that 20% of the input optical

Fig. 1 Noncoherent Broadcast-and-weight (B&W) based photonic neuron

Co-designing Photonic Accelerators for Machine Learning on the Edge

253

Fig. 2 An all-pass MR with output spectral characteristics at the through port with extinction ratio (ER) and free spectral range (FSR) specified in the figure

signal intensity is dropped as the wave traverses the MR). When λ1 passes through an MR bank, e.g., the one in the dotted-blue box in Fig. 1, the MR in resonance with λ1 can be tuned to drop 50% of the input signal intensity. Thus, as λ1 passes this MR, we will obtain 50% of the input intensity at the through port, which is 0.4 (= 0.8 × 0.5). The BPD shown in Fig. 1 then converts the optical signal intensity from that wavelength (and other wavelengths) into an electrical signal that represents an accumulated single value. An MR is essentially an on-chip resonator, which is said to be in resonance when an optical wavelength on the input port matches with the resonant wavelength of the MR. The MR generates a Lorentzian-shaped signal at the through port. Figure 2 shows an example of an all-pass MR and its output optical spectrum. The extinction ratio (ER) and free-spectral range (FSR) are two primary characteristics of an MR. These depend on several physical properties in the MR, including its width, thickness, radius, and the gap between the input and ring waveguide [43]. Changing any of these properties changes the effective index (neff ) of the MR, which in turn causes a change in the output optical spectrum. For reliable operation of MRs, it is crucial to maintain the central wavelength at the output optical spectrum. However, MRs are sensitive to fabrication-process variations (FPVs) and variations in surrounding temperature. These cause the central wavelength of the MR to deviate from its original position, causing a drift in the MR resonant wavelength (λMR ) [44]. Such a drift (due to FPV or thermal variations) can be compensated using thermo-optic (TO) or electro-optic (EO) tuning mechanisms. Both of these have their own advantages and disadvantages. EO tuning is faster (~ns range) and consumes lower power (~4 μW/nm) but with a smaller tuning range [45]. In contrast, TO tuning has a larger tunability range, but it consumes higher power (~27 mW/FSR) and has higher (~μs range) latency [42]. In order to support complex MLP and CNN model executions, with meaningful throughput, a large number of neurons need to be represented simultaneously. At the architecture level, this translates to requiring a large number of MRs. As the number of MRs increase, so does the length of the waveguide which hosts the banks. Unfortunately, this leads to an increase in the total optical signal propagation,

254

F. P. Sunny et al.

modulation, and through losses experienced, which in turn increases the laser power required to drive the optical signals through the weight banks. This increase in laser power is necessary so that the signals can be detected error-free at the photodetector. An excessive number of parallel arms with MR weight banks (the dotted box in Fig. 1 represents one arm working in parallel with other arms) also increases optical splitter losses. Moreover, effective crosstalk mitigation strategies have to be employed to handle crosstalk noise among MRs (as is the case with previously proposed photonic accelerators), the increased crosstalk noise in the optical signals will drive down the weight resolution of the architecture. In summary, to design efficient photonic accelerators, there is a need for (i) improved MR device design to better tolerate variations and crosstalk; (ii) efficient MR tuning circuits to quickly and reliably imprint activation and parameter values; and (iii) a scalable architecture design that minimizes optical signal losses. The CrossLight photonic accelerator design addresses all of these concerns and is discussed next.

4 CrossLight Architecture Figure 3 shows a high-level overview of the CrossLight noncoherent silicon photonic neural network accelerator. The photonic substrate performs vector dot product (VDP) operations using silicon photonic MR devices and summation using optoelectronic photodetector (PD) devices over multiple wavelengths. An electronic control unit is required for the control of photonic devices, and for communication with a global memory to obtain the parameter values, mapping of the vectors, and for partial sum buffering. The architecture uses digital to analog converter (DAC) arrays to convert buffered signals into analog tuning signals for MR tuning mechanisms. Analog to digital converter (ADC) arrays are used to map the output analog signals generated by PDs to digital values that are sent back for postprocessing and buffering. The discussion of this accelerator is broken down into three parts (presented in Sects. 4.1, 4.2, and 4.3), corresponding to the contributions at the device, tuning circuit, and architecture levels.

4.1 MR Device Engineering and Fabrication Process variations are inevitable in CMOS-compatible silicon photonic fabrications, causing undesirable changes in resonant wavelength of MR devices (λMR ). We fabricated a 1.5 × 0.6 mm2 chip with high-resolution Electron Beam (EBeam) lithography and performed a comprehensive design-space exploration of MRs to compensate for FPVs while improving MR device insertion loss and Q-factor. In this exploration, we varied the input and ring waveguide widths to find an MR device design that was tolerant to FPVs. We found that in an MR design of any radii and

Co-designing Photonic Accelerators for Machine Learning on the Edge

255

Fig. 3 An overview of CrossLight, showing dedicated vector dot product (VDP) units for CONV and FC layer acceleration, and the internal architecture

gap, when the input waveguide is 400 nm wide and the ring waveguide is 800 nm wide at room temperature (300 K), the undesired λMR due to FPVs can be reduced from 7.1 to 2.1 nm (70% reduction). This is a significant result, as these optimized MRs require less compensation for FPV-induced resonant wavelength shifts, which can reduce the tuning power consumption of architectures using such MRs. Unfortunately, even with such optimized MR designs, the impact of FPVs is not completely eliminated, and there is still a need to compensate for FPVs. Thermal variations are another major factor to cause changes in MR neff which also leads to undesirable λMR . Thermo-optic (TO) tuners are used to compensate for such deviations in λMR . These TO tuners use microheaters to change the temperature of an MR device, which then alters the neff of the MR, changing the device resonant wavelength, and correcting the λMR . Unfortunately, high temperatures from such heaters can cause thermal energy dissipation, creating thermal crosstalk across MR devices placed close to each other. One way to avoid thermal crosstalk due to proximity of devices is to place them further apart from each other, typically 120 μm to 200 μm (depending on the number of MR devices in proximity within an MR

256

F. P. Sunny et al.

bank). But such a large spacing hurts area efficiency and also increases waveguide length, which increases propagation losses and its associated laser power overhead. The CrossLight architecture addresses this challenge at the circuit level, as discussed next.

4.2 Tuning Circuit Design To reduce thermal crosstalk, we must reduce the reliance on TO tuning. The TO tuning approach is used in all prior photonic neural network accelerators, but it entails high overheads. We propose to use a hybrid tuning circuit where both thermooptic (TO) and electro-optic (EO) tuning are used to compensate for λMR . Such a tuning approach has previously been proposed in [47] for silicon photonic MachZehnder Interferometers with low insertion loss. Such an approach can be easily transferred to an optimized MR for hybrid tuning in our architecture. The hybrid tuning approach supports faster operation of MRs with fast EO tuning to compensate for small λMR shifts and, when necessary, using TO tuning when large λMR shifts need to be compensated. More specifically, imprinting parameter values on to wavelengths using MRs requires the architecture to induce only small λMR , under 1 nm, which can be done in ns range using EO tuning. For correcting FPV, which can be significantly higher than 1 nm, the hybrid tuning approach relies on TO tuning. To further reduce the power overhead of TO tuning in this hybrid approach, we adapt a method called thermal Eigenmode decomposition (TED), which was first proposed in [48]. Using TED, all the MRs in an MR bank can be tuned collectively, to compensate for large λMR shifts. Collective thermal tuning using TED takes into account the thermal crosstalk impact from nearby MR TO-tuning circuitry and compensates for it in the tuning signals sent to the MR bank. By doing so, we can cancel the effect of thermal crosstalk (i.e., an undesired phase change) in MRs with much lower power consumption. The TO tuning power can be calculated by the amount of phase shift necessary to apply to the MRs in order for them to be at their desired resonant wavelength. The extent of phase crosstalk ratio (due to thermal crosstalk) as a function of the distance between an MR pair is shown in Fig. 4, for our fabricated MR devices. The results are based on detailed analysis with a commercial 3D heat transport simulation EDA tool for silicon photonic devices (Ansys Lumerical HEAT [46]). It can be seen from the orange line that as the distance between an MR pair increases, the amount of phase crosstalk reduces exponentially. Such a trend has also been observed in [49]. To find a balance between tuning power savings while having reduced crosstalk, we perform a sensitivity analysis based on the distance between two adjacent MRs in our architecture. By utilizing TED, we can place the optimized MRs (described in the previous section) in such a manner that maximum tuning power is saved when they are close to each other while compensating for thermal crosstalk. The results from our analysis (the solid-green line in Fig. 4) indicate that placing each MR pair

Co-designing Photonic Accelerators for Machine Learning on the Edge

257

Fig. 4 Phase crosstalk ratio and tuning power consumption in a block of 10 fabricated MRs with variable distance between adjacent pair of MRs

at a distance of 5 μm is optimal, as increasing or decreasing such a distance causes an increase in power consumption of individual TO heaters in the MRs. Figure 4 also shows the tuning power required without using the TED approach (blue arrow line), which can be seen to be notably higher. The workflow of the circuit-level hybrid tuning approach can be summarized as follows. When the accelerator is first booted at runtime, a one-time compensation for design-time FPVs is applied using TO tuning. The extent of compensation for crosstalk is calculated offline during the test phase, where the required phase shift in each of the MRs is calculated, and once the system is online, the respective phase shift values are applied to cancel the impact of thermal crosstalk. Subsequently, we apply EO tuning due to its extremely low latency to represent vector elements in each vector operation with MRs (discussed in more detail in the next section). If large shifts in temperature are observed at runtime, we can perform a one-time calibration with TO tuning to compensate for it. In our analysis, runtime TO tuning would be required rarely beyond its first use after the initial bootup of the photonic accelerator platform.

4.3 Architecture Design The optimized MR devices, layouts, and tuning circuits are utilized within optical vector dot product (VDP) units, which are shown in Fig. 3. We use banks (groups) of MRs to imprint both activations and weights onto the optical signal. Multiple VDP units are combined together to form two architectural sub-components: one to support convolution (CONV) layer acceleration and the other to support fully

258

F. P. Sunny et al.

connected (FC) layer acceleration. We focus on these two types of layers, as they are the most widely used and consume the most significant amount of latency and power in computational platforms that execute DNNs. In contrast, other layer types (e.g., pooling, batch normalization) can be implemented efficiently in the electronic domain. Note also that we focus on inference acceleration, as done in all photonic DNN accelerators and most electronic DNN accelerators.

4.3.1

Decomposing Vector Operations in CONV/FC Layers

To map CONV and FC layers from DNN models to our accelerator, we first need to decompose large vector sizes into smaller ones. In CONV layers, a filter performs convolution on a patch (e.g., 2 × 2 elements) of the activation matrix in a channel to generate an element of the output matrix. The operation can be represented as follows: K ⊗A=Y

(1)

For a 2 × 2 filter kernel and weight matrices, (1) can be expressed as: 

   k1 k2  a1 a2 = k1 a1 + k2 a2 + k3 a3 + k4 a4 k3 k4 a3 a4

(2)

Rewriting (2) as a vector dot product, we have: a  1

a2 a3 a4

[k1 k2 k3 k4 ] .

= k1 a1 + k2 a2 + k3 a3 + k4 a4

(3)

Once we are able to represent the operation as a vector dot product, it is easy to see how it can be decomposed into partial sums. For example:  a1 = k1 a1 + k2 a2 = PS1 [k1 k2 ] . a2 

 a3 = k3 a3 + k4 a4 = PS2 [k3 k4 ] . a4 

PS1 + PS2 = Y

(4)

In FC layers, typically much larger dimension vector multiplication operations are performed between input activations and weight matrices:

Co-designing Photonic Accelerators for Machine Learning on the Edge

⎡ ⎢ ⎢ AW = ⎢ ⎣

a1 a2 .. .

259

⎤ ⎥ ⎥ ⎥ [w1 w2 . . . wn ] ⎦

(5)

an ⎡ ⎢ ⎢ AW = ⎢ ⎢ ⎣

a1 • w1 +a1 • w2 + · · · a1 • wn a2 • w1 +a2 • w2 + · · · a2 • wn .. . an • w1 +an • w2 + · · · an • wn

⎤ ⎥ ⎥ ⎥ ⎥ ⎦

(6)

In (5), a1 to an represent a column vector of activations (A) and w1 to wn represent a row vector of weights (W). The resulting vector is a summation of dot products of vector elements (6). Much like with CONV layers, these can be decomposed into lower dimensional dot products, as described in (1), (2), (3), (4) and (5).

4.3.2

Vector Dot Product (VDP) Unit Design

We separated the implementation of CONV and FC layers in CrossLight due to the vastly different orders of vector dot product computations required to implement each layer. For instance, typical CONV layer kernel sizes vary from 2 × 2 to 5 × 5, whereas in FC layers, it is not uncommon to have 100 or more neurons (requiring 100 × 100 or higher order matrix-vector multiplication). State-of-the-art photonic DNN accelerators at the time, e.g., [36], only consider the scales involved at the CONV layer, and either only support CONV layer acceleration in the optical domain, or use the same CONV layer implementation to accelerate FC layers. This will lead to increased latencies and reduced throughput as the larger vectors involved with FC layer calculation must be divided up into much smaller chunks, in the order of the filter kernel size of the CONV layer. For faster execution of FC layers, while providing energy-efficiency for CONV layer operation, we separately support the unique scale and requirements of vector dot products involved in CONV layers and FC layers. For CONV layer acceleration, we consider n VDP units, with each unit supporting an N × N dot product. For FC layer acceleration, we consider m units, with each unit supporting a K × K dot product. Here n > m and K > N, as per the requirements of each of the distinct layers. In each of the VDP units, the original vector dimensions are decomposed into N or K dimensional vectors, as discussed above. We performed an exploration to determine the optimal values for N, K, n, and m. The results of this exploration study are presented in Sect. 5.

260

4.3.3

F. P. Sunny et al.

Optical Wavelength Reuse in VDP Units

Prior work on photonic DNN accelerator design typically considers a separate wavelength to represent each individual element of a vector. With large vector sizes, this approach leads to an increase in the total number of wavelengths and hence lasers needed in the laser bank which in turn increases power consumption. Beyond employing the decomposition approach discussed above, we also consider wavelength reuse per VDP unit to minimize laser power. In this approach, within VDP units, the N or K dimensional vectors are further decomposed into smaller sized vectors for which dot products can be performed using MRs in parallel, in each arm of the VDP unit. The same wavelengths can then be reused across arms within a VDP to reduce the number of unique wavelengths required. PDs perform summation of the elementwise products to generate partial sums from decomposed vector dot products. The partial sums from the decomposed operations are then converted back to the photonic domain by VCSELs (Fig. 3), multiplexed into a single waveguide, and accumulated using another PD, before being sent for buffering. Thus, our approach leads to an increase in the number of PDs compared to other accelerators but significantly reduces both the number of MRs per waveguide and the overall laser power consumption. In each arm within a VDP unit, we used a maximum of 15 MRs per bank for a total of 30 MRs per arm, to support up to a 15 × 15 vector dot product. The choice of MRs per arm considers not only the thermal crosstalk and layout spacing issues (discussed earlier) and the benefits of wavelength reuse (discussed in previous paragraph) but also the fact that optical splitter losses become non-negligible as the number of MRs per arm increases, which in turn increases laser power requirements. Thus, the selection of MRs per arm within a VDP unit was carefully adjusted to balance parallelism within/across arms and laser power overheads.

5 Evaluation and Simulation Results 5.1 Simulation Setup To evaluate the effectiveness of our CrossLight accelerator, we conducted several simulation studies. These studies were complemented by our MR-device fabrication and optimization efforts on real chips, as discussed in Sect. 4. We considered the four DNN models shown in Table 1 for execution on the accelerator. Model 1 is Lenet5 [50], and models 2 and 3 are custom CNNs with both FC and CONV layers. Model 4 is a Siamese CNN utilizing one-shot learning. The datasets used to train these models are also shown in the table. We designed a custom CrossLight accelerator simulator in Python to estimate its performance and power/energy. We

Co-designing Photonic Accelerators for Machine Learning on the Edge

261

Table 1 Models and datasets considered for evaluation Model no. 1 2 3 4

CONV layers 2 4 7 8

Table 2 Parameters considered for analyses of photonic accelerators

FC layers 2 2 2 4

Parameters 60,074 890,410 3,204,080 38,951,745

Devices EO tuning [45] TO tuning [42] VCSEL [55] TIA [57] Photodetector [58]

Datasets Sign MNIST CIFAR10 STL10 Omniglot Latency 20 ns 4 μs 10 ns 0.15 ns 5.8 ps

Power 4 μW/nm 27.5 mW/FSR 0.66 mW 7.2 mW 2.8 mW

used TensorFlow 2.3 along with Qkeras [51], for analyzing DNN model accuracy across different parameter resolutions. We compared CrossLight with the DEAP-CNN [36] and Holylight [37] photonic DNN accelerators from prior work. Table 2 shows the optoelectronic parameters considered for this simulation-based analysis. We considered photonic signal losses due to various factors: signal propagation (1 dB/cm [29]), splitter loss (0.13 dB [52]), combiner loss (0.9 dB [53]), MR through loss (0.02 dB [54]), MR modulation loss (0.72 dB [55]), microdisk loss (1.22 dB [56]), EO tuning loss (6 dB/cm [45]), and TO tuning loss (1 dB/cm [42]). We also considered the 1-to-56-Gb/s ADC/DAC-based transceivers from recent work [61]. To calculate laser power consumption, we use the following laser power model: Plaser − Sdetector ≥ Pphoto loss + 10 × log10 Nλ

(7)

where Plaser is laser power in dBm, Sdetector is the PD sensitivity in dBm, and Pphoto _ loss is the total photonic loss encountered by the optical signal, due to all of the factors discussed above.

5.2 Results: CrossLight Resolution Analysis We first present an analysis of the resolution that can be achieved with CrossLight. We consider how the optical signals from MRs impact each other due to their spectral proximity, also known as inter-channel crosstalk. For this, we use the equations from [59]: δ2 ϕ (i, j ) = 2 λ i − λj + δ 2

(8)

262

F. P. Sunny et al.

In (8), ϕ(i, j) describes the noise content from the jth MR present in the signal from the ith MR. As the noise content increases, the resolution achievable with CrossLight will decrease. Also, (λi − λj ) is the difference between the resonant wavelengths of ith MR and jth MR, while δ (= λi /2Q) denotes the 3 dB bandwidth of the MRs, with Q being the quality factor (Q-factor) of the MR being considered. The noise power component can thus be calculated as: Pnoise =

n−1 i

ϕ (i, j ) Pin [i]

(9)

For unit input power intensity, resolution can then be computed as: Resolution =

1 max |Pnoise |

(10)

From this analysis, we found that with the FSR value of 18 nm and the Q value of ~8000 in our optimized MR designs, and the wavelength reuse strategy in CrossLight, which allows us to have large (λi – λj ) values (>1 nm), our MR banks will be able to achieve a resolution of 16 bits for up to 15 MRs per bank (Sect. 4.3.2). This resolution is much higher than the resolution achievable by many photonic accelerators. For instance, DEAP-CNN can only achieve a resolution of 4 bits, whereas Holylight can only achieve a 2-bit resolution per microdisk (this work, however, combines 8 microdisks to achieve an overall 16-bit resolution). Higher resolution ensures better accuracy in inference, which can be critical in some applications. Figure 5 shows the impact of varying the resolution across the weights and activations from 1 bit to 16 bits (we used quantization-aware training to maximize accuracy), for the four DNN models considered (Table 1). It can be observed that model inference accuracy is sensitive to the resolution of weight and activation parameters. Models such as the one for STL10 are particularly sensitive to the resolution. Thus, the high resolution afforded by CrossLight can allow achieving higher accuracies than other photonic DNN accelerators, such as DEAP-CNN.

5.3 Results: CrossLight Sensitivity Analysis We performed a sensitivity analysis by varying the number of VDP units in the CONV layer accelerator (n) and FC layer accelerator (m), along with the complexity of the VDP units (N and K, respectively). Figure 6 shows the frames per second (FPS; a measure of inference performance) vs. energy per bit (EPB) vs. area of various configurations of CrossLight. We selected the best configuration as the one that had the highest value of FPS/EPB. In terms of (N, K, n, m), the values of the four parameters for this configuration are

Co-designing Photonic Accelerators for Machine Learning on the Edge

263

Fig. 5 Inference accuracy of the four DNN models considered, across quantization (resolution) range from 1 bit to 16 bits (for both weights and activations)

Fig. 6 Scatterplot of average FPS vs. average EPB vs. area of various CrossLight configurations. The configuration with highest FPS/EPB (and FPS) is highlighted

(20, 150, 100, 60). This configuration also ended up being the one with the highest FPS value, but had a higher area overhead than other configurations. Nonetheless, this area is comparable to that of other photonic accelerators. We used this configuration for comparisons with prior work, as discussed next.

264

F. P. Sunny et al.

Fig. 7 Power consumption comparison among variants of CrossLight vs. photonic accelerators (DEAP-CNN, Holylight), and electronic accelerator platforms (P100, Xeon Platinum 9282, Threadripper 3970x, DaDianNao, EdgeTPU, Null Hop)

5.4 Results: Comparison with State-of-the-Art Accelerators We compared our CrossLight accelerator against two well-known photonic accelerators, DEAP-CNN and Holylight, within a reasonable area constraint for all accelerators (~16–25 mm2 ). We present results for four variants of the CrossLight architecture: (1) Cross_base utilizes conventional MR designs (without FPV resilience) and traditional TO tuning; (2) Cross_opt utilizes the optimized MR designs from Sect. 4.1, and traditional TO tuning; (3) Cross_base_TED utilizes the conventional MR designs with the hybrid TED-based tuning approach from Sect. 4.2; and (4) Cross_opt_TED utilizes the optimized MR designs and the hybrid TED-based tuning approach. Figure 7 shows the power consumption comparison across the four CrossLight variants and the two photonic accelerators from prior work. We also include comparison numbers for electronic platforms: three deep learning accelerators (DaDianNao, Null Hop, and EdgeTPU), a GPU (Nvidia Tesla. P100), and CPUs (Intel Xeon Platinum 9282 denoted as IXP9282, and AMD Threadripper 3970x denoted as AMD-TR) [60]. The effectiveness of the optimization approaches adopted is immediately evident from the power values. The variants which considered conventional MR design (Cross_base and Cross_base_TED) have larger power consumption for compensating for FPV. This value becomes nontrivial as the number of MRs increase and thus having reduced tuning power requirement per MR (in Cross_opt and Cross_opt_TED) becomes a significant advantage. Using the TED based hybrid tuning approach provides further significant power benefits for Cross_opt_TED over Cross_opt, which uses conventional TO tuning. Cross_opt_TED can be seen to have lower power consumption than both photonic accelerators, as well as the CPU and GPU platforms, although this power is higher than that of the edge/mobile electronic accelerators. Figure 8 shows a comparison of energy-per-bit (EPB) across all of the photonic accelerators, for the four DNN models. On average, our best CrossLight configuration (Cross_opt_TED) has 1544× and 9.5× lower EPB compared to DEAP-CNN and Holylight, respectively. CrossLight is able to achieve significantly lower EPB as the design for the architecture took into consideration various losses

Co-designing Photonic Accelerators for Machine Learning on the Edge

265

Fig. 8 Comparison of EPB values of the photonic DNN accelerators Table 3 Average EPB and kiloFPS/watt values across accelerators

Accelerator P100 IXP 9282 AMD-TR DaDianNao Edge TPU Null Hop DEAP_CNN Holylight Cross_base Cross_base_TED Cross_opt Cross_opt_TED

Avg. EPB (pJ/bit) 971.31 5099.68 5831.18 58.33 697.37 2727.43 44453.88 274.13 142.35 92.64 75.58 28.78

Avg. kiloFPS/watt 24.9 2.39 2.09 0.65 17.53 4.48 0.07 3.3 10.78 16.54 20.25 52.59

and crosstalk that a photonic DNN accelerator would experience and put in place various optimizations at the device, circuit, and architecture layers to counteract their impact. The utilization of TED-based thermal crosstalk management allows us to have MRs placed much closer together, which in turn reduces propagation losses. In addition, CrossLight considers a hybrid TO+EO tuning approach, which enables the reduction of power and EPB as well. The use of EO tuning in this hybrid tuning approach also provides the advantage of lower latencies for VDP operations, which is apparent in the EPB values. Table 3 summarizes the average values of EPB (in pJ/bit) and performanceper-watt (in kiloFPS/Watt) of the photonic accelerators as well as the electronic accelerators considered in this work. It can be observed that the best CrossLight configuration (Cross_opt_TED) achieves significantly lower EPB and higher performance-per-watt values than all of the accelerators considered. Specifically, against Holylight, which is the best out of the two photonic DNN accelerators considered, CrossLight achieves 9.5× lower energy-per-bit and 15.9× higher performance per watt. These results demonstrate the effectiveness of cross-layer design of deep learning accelerators with the emerging silicon photonics technology. With the growing maturity of silicon photonic device fabrication in CMOS-compatible processes, it is expected that the energy costs of device tuning, losses, and laser

266

F. P. Sunny et al.

power overheads will go further down, making an even stronger case for considering optical-domain accelerators for deep learning inference.

6 Conclusion In this chapter, we presented a cross-layer optimized photonic neural network accelerator called CrossLight. Utilizing silicon photonic device-level fabricationdriven optimizations along with circuit-level and architecture-level optimizations, we demonstrated 9.5× lower energy per bit and 15.9× higher performance per watt compared to state-of-the-art photonic DNN accelerators. CrossLight also shows improvements in these metrics over several CPU, GPU, and custom electronic accelerator platforms considered in our analysis. CrossLight shows the promise of cross-layer optimization strategies in countering various challenges such as crosstalk, fabrication-process variations, high laser power, and excessive tuning power. The results presented in this chapter demonstrate the promise of photonic DNN accelerators in addressing the need for energy-efficient and high performanceper-watt DNN acceleration.

References 1. Jouppi, N.P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., Boyle, R., Cantin, P., Chao, C., Clark, C., Coriell, J., Daley, M., Dau, M., Dean, J., Gelb, B., Ghaemmaghami, T.V., Gottipati, R., Gulland, W., Hagmann, R., Ho, C.R., Hogberg, D., Hu, J., Hundt, R., Hurt, D., Ibarz, J., Jaffey, A., Jaworski, A., Kaplan, A., Khaitan, H., Koch, A., Kumar, N., Lacy, S., Laudon, J., Law, J., Le, D., Leary, C., Liu, Z., Lucke, K., Lundin, A., MacKean, G., Maggiore, A., Mahony, M., Miller, K., Nagarajan, R., Narayanaswami, R., Ni, R., Nix, K., Norrie, T., Omernick, M., Penukonda, N., Phelps, A., Ross, J., Ross, M., Salek, A., Samadiani, E., Severn, C., Sizikov, G., Snelham, M., Souter, J., Steinberg, D., Swing, A., Tan, M., Thorson, G., Tian, B., Toma, H., Tuttle, E., Vasudevan, V., Walter, R., Wang, W., Wilcox, E., Yoon, D.H.: In-datacenter performance analysis of a tensor processing unit. In: ISCA 2017 2. Intel Movidius VPU.: 2020, [Online]: https://www.intel.com/content/www/us/en/products/ processors/movidius-vpu/movidius-myriad-x.html 3. Waldrop, M.M.: The chips are down for Moore’s law. Nat. News. 530(7589) (2016) 4. Pasricha, S., Dutt, N.: On-Chip Communication Architectures. Morgan Kauffman, ISBN 9780-12-373892-9 (Apr 2008) 5. Ziabari, A.K.K., Abella’n, J.L., Ubal, R., Chen, C., Joshi, A., Kaeli, D.: Leveraging siliconphotonic noc for designing scalable GPUs. In: ACM ICS (2015) 6. Bahirat, S., Pasricha, S.: METEOR: hybrid photonic ring-mesh network-on-chip for multicore architectures. ACM Trans. Embedd. Comput. Syst. 13(3), 1–33 (2014) 7. Bahirat, S., Pasricha, S.: HELIX: design and synthesis of hybrid nanophotonic applicationspecific network-on-chip architectures. IEEE international symposium on quality electronic design (ISQED), 2014. 8. Bahirat, S., Pasricha, S.: 3D HELIX: design and synthesis of hybrid nanophotonic applicationspecific 3D network-on-chip architectures. Workshop on exploiting silicon photonics for

Co-designing Photonic Accelerators for Machine Learning on the Edge

267

energy efficient heterogeneous parallel architectures (SiPhotonics), 2014. 9. Bahirat, S., Pasricha, S.: A particle Swarm optimization approach for synthesizing applicationspecific hybrid photonic networks-on-chip. IEEE international symposium on quality electronic design (ISQED), 2012. 10. Bahirat, S., Pasricha, S.: UC-PHOTON: a novel hybrid photonic network-on-chip for multiple use-case applications. IEEE international symposium on quality electronic design (ISQED), 2010. 11. Bahirat, S., Pasricha, S.: Exploring hybrid photonic networks-on-chip for emerging chip multiprocessors. IEEE/ACM international conference on hardware/software codesign and system synthesis (CODES+ISSS), 2009. 12. Chittamuru, S.V.R., Thakkar, I., Pasricha, S., Vatsavai, S.S., Bhat, V.: Exploiting process variations to secure photonic NoC architectures from snooping attacks. IEEE Trans. Comput. Aid. Des. Integrat. Circuits Syst. 40, 850–863 (2021) 13. Chittamuru, S.V.R., Thakkar, I., Pasricha, S.: LIBRA: thermal and process variation aware reliability management in photonic networks-on-chip. IEEE Tran. Multi-Scale Comput. Syst. 4(4), 758–772 (2018) 14. Chittamuru, S.V.R., Dharnidhar, D., Pasricha, S., Mahapatra, R.: BiGNoC: accelerating Big Data computing with application-specific photonic network-on-chip architectures. IEEE Trans. Parallel. Distrib. Syst. 29(11), 2402–2415 (2018) 15. Chittamuru, S.V.R., Thakkar, I., Pasricha, S.: HYDRA: heterodyne crosstalk mitigation with double microring resonators and data encoding for photonic NoC. IEEE Trans. Very Large Scale Integr. Syst. 26(1), 168–181 (2018) 16. Chittamuru, S.V.R., Desai, S., Pasricha, S.: SWIFTNoC: a reconfigurable silicon-photonic network with multicast enabled channel sharing for multicore architectures. ACM J. Emerg. Technol. Comput. Syst. 13(4), 1–27 (2017) 17. Chittamuru, S.V.R., Pasricha, S.: Crosstalk mitigation for high-radix and low-diameter photonic NoC architectures. IEEE Des. Test. 32(3) (2015) 18. Thakkar, I., Chittamuru, S.V.R., Pasricha, S.: Mitigating the energy impacts of VBTI aging in photonic networks-on-chip architectures with multilevel signaling. IEEE workshop on energyefficient networks of computers (E2NC), 2018. 19. Pasricha, S., Chittamuru, S.V.R., Thakkar, I., Bhat, V.: Securing photonic NoC architectures from hardware Trojans. In: IEEE/ACM international symposium on networks-on-chip (NOCS), 2018 20. Chittamuru, S.V.R., Thakkar, I., Pasricha, S.: SOTERIA: exploiting process variations to enhance hardware security with photonic NoC architectures. IEEE/ACM design automation conference (DAC), 2018. 21. Thakkar, I., Chittamuru, S.V.R., Pasricha, S.: Improving the reliability and energy-efficiency of high-bandwidth photonic NoC architectures with multilevel signaling. IEEE/ACM international symposium on networks-on-chip (NOCS), 2017. 22. Chittamuru, S.V.R., Thakkar, I., Pasricha, S.: Analyzing voltage bias and temperature induced aging effects in photonic interconnects for manycore computing. ACM system level interconnect prediction workshop (SLIP), 2017. 23. Dang, D., Chittamuru, S.V.R., Mahapatra, R.N., Pasricha, S.: Islands of heaters: a novel thermal management framework for photonic NoCs. IEEE/ACM Asia & South Pacific design automation conference (ASPDAC), 2017. 24. Thakkar, I., Chittamuru, S.V.R., Pasricha, S.: A comparative analysis of front-end and backend compatible silicon photonic on-chip interconnects. ACM/IEEE system level interconnect prediction workshop (SLIP), 2016. 25. Thakkar, I., Chittamuru, S.V.R., Pasricha, S.: Run-time laser power management in photonic NoCs with on-chip semiconductor optical amplifiers. IEEE/ACM international symposium on networks-on-chip (NOCS), 2016. 26. Chittamuru, S.V.R., Thakkar, I., Pasricha, S.: PICO: mitigating Heterodyne crosstalk due to process variations and intermodulation effects in photonic NoCs. IEEE/ACM design automation conference (DAC), 2016.

268

F. P. Sunny et al.

27. Chittamuru, S.V.R., Thakkar, I., Pasricha, S.: Process variation aware crosstalk mitigation for DWDM based photonic NoC architectures. IEEE international symposium on quality electronic design (ISQED), 2016. 28. Chittamuru, S.V.R., Pasricha, S.: SPECTRA: a framework for thermal reliability management in silicon-photonic networks-on-chip. IEEE international conference on VLSI design (VLSI), 2016. 29. Pasricha, S., Nikdast, M.: A survey of silicon photonics for energy efficient Manycore computing. IEEE Des. Test. 37(4) (2020) 30. Miller, D.A.: Silicon photonics: meshing optics with applications. Nat. Photonics. 11(7), 403– 404 (2017) 31. Shen, Y., Harris, N.C., Skirlo, S., Prabhu, M., Jones, T.B., Hochberg, M., Sun, X., Zhao, S., Larochelle, H., Englund, D., Soljacic, M.: Deep learning with coherent nanophotonic circuits. Nat. Photonics. 11(7), 441–446 (2017) 32. Zortman, W.A., Trotter, D.C., Watts, M.R.: Silicon photonics manufacturing. Opt. Express. 18(23) (2010) 33. Sunny, F., Mirza, A., Nikdast, M., Pasricha, S.: CrossLight: a cross-layer optimized silicon photonic neural network accelerator. ACM/IEEE DAC. (2021) 34. Sunny, F., Taheri, E., Nikdast, M., Pasricha, S.: A survey on silicon photonics for deep learning. ACM J. Emerg. Technol. Comput. Syst. 17(4), 1–57 (2021) 35. Tait, A.N., De Lima, T.F., Zhou, E., Wu, A.X., Nahmias, M.A., Shastri, B.J., Prucnal, P.R.: Neuromorphic photonic networks using silicon photonic weight banks. Sci. Rep. 7(1) (2017) 36. Bangari, V., Marquez, B.A., Miller, H., Tait, A.N., Nahmias, M.A., De Lima, T.F., Peng, H.T., Prucnal, P.R., Shastri, B.J.: Digital electronics and analog photonics for convolutional neural networks (DEAP-CNNs). IEEE JQE. 26(1) (2020) 37. Liu, W., Liu, W., Ye, Y., Lou, Q., Xie, Y., Jiang, L.: HolyLight: a Nanophotonic accelerator for deep learning in data centers. In: IEEE/ACM DATE (2019) 38. Shiflett, K., Wright, D., Karanth, A., Louri, A.: PIXEL: photonic neural network accelerator. In: HPCA 2020 39. Zhao, Z., Liu, D., Li, M., Ying, Z., Zhang, L., Xu, B., Yu, B., Chen, R.T., Pan, D.Z.: Hardwaresoftware co-design of slimmed optical neural networks. In: IEEE/ACM ASPDAC (2019) 40. Mourgias-Alexandris, G., Totovic, A., Tsakyridis, A., Passalis, N., Vyrsokinos, K., Tefas, A., Pleros, N.: Neuromorphic photonics with coherent linear neurons using dual-IQ modulation cells. JLT. 38(4), 811–819 (2020) 41. Pask, C.: Generalized parameters for tunneling ray attenuation in optical fibers. J. Opt. Soc. Am. 68(1), 110–116 (1978) 42. Pintus, P., Hofbaurer, M., Manganelli, C.L., Fournier, M., Gundavarapu, S., Lemonnier, O., Gambini, F.: PWM-driven thermally tunable silicon microring resonators: design, fabrication, and characterization. L&P Rev. 13(9) (2019) 43. Bogaerts, W., Heyn, P.D., Vaerenburgh, T.V., De Vos, K., Selvaraj, S.K., Claes, T., Dumon, P., Bienstman, P., Thourhout, D.V., Baets, R.: Silicon microring resonators. L&P Rev. 6(1) (2012) 44. Nikdast, M., Nicolescu, G., Trajkovic, J., Liboiron-Ladouceur, O.: Chip-scale silicon photonic interconnects: a formal study on fabrication non-uniformity. JLT. 34(16), 3682–3695 (2016) 45. Stefan, A., Stoferie, T., Marchiori, C., Caimi, D., Czornomaz, L., Stuckelberger, M., Sousa, M., Offrein, B.J., Fompeyrine, J.: A hybrid barium titanate–silicon photonics platform for ultraefficient electro-optic tuning. JLT. 34(8), 1688–1693 (2016) 46. Ansys Lumerical Inc.: Ansys Lumerical HEAT. [Online]. Available: https://www.ansys.com/ products/photonics/heat 47. Lu, L., Li, X., Gao, W., Li, X., Zhou, L., Chen, J.: Silicon non-blocking 4× 4 optical switch chip integrated with both thermal and electro-optic tuners. IEEE Photonics. 11(6) (2019) 48. Milanizadeh, M., Aguiar, D., Melloni, A., Morichetti, F.: Canceling thermal cross-talk effects in photonic integrated circuits. JLT. 37(4), 1325–1332 (2019) 49. De, S., Das, R., Varshney, R.K., Schneider, T.: Design and simulation of thermo-optic phase shifters with low thermal crosstalk for dense photonic integration. In: IEEE Access, vol. 8, (2020)

Co-designing Photonic Accelerators for Machine Learning on the Edge

269

50. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. In: Proceedings of the IEEE (1998) 51. QKeras.: https://github.com/google/qkeras 52. Frandsen, L.H., Ingo Borel, P., Zhuang, Y.X., Harpøth, A., Thorhauge, M., Kristensen, M., Bogaerts, W., Dumon, P., Baets, R., Wiaux, V., Woulters, J.: Ultralow-loss 3-dB photonic crystal waveguide splitter. Opt. Lett. 29(14) (2004) 53. Tu, Y.C., Fu, P.H., Huang, D.W.: High-efficiency ultra-broadband multi-tip edge couplers for integration of distributed feedback laser with silicon-on-insulator waveguide. IEEE Photonic J. 11(4) (2019) 54. Bahirat, S., Pasricha, S.: OPAL: a multi-layer hybrid photonic NoC for 3D ICs. In: IEEE/ACM ASPDAC (2011) 55. Jayatileka, H., Caverley, M., Jaeger, N.A.F., Shekhar, S., Chrotowski, L.: Crosstalk limitations of Microring-Resonator based WDM Demultiplexers on SOI. In: OIC 2015 56. Timurdogan, E., Sorace-Agaskar, C.M., Hosseini, E.S., Leake, G., Coolbaugh, D.D., Watts, M.R.: Vertical junction silicon microdisk modulator with integrated thermal tuner. In: CLEO:Science and Innovations, OSA (2013) 57. Güngördü, A.D., Dündar, G., Yelten, M.B.: A high performance TIA Design in 40 nm CMOS. In: IEEE ISCAS (2020) 58. Wang, B., Huang, Z., Sorin, W.V., Zeng, X., Liang, D., Fiorentino, M., Beausoleil, R.G.: A low-voltage Si-Ge avalanche photodiode for high-speed and energy efficient silicon photonic links. JLT. 38(12), 3156–3163 (2020) 59. Duong, L.H.K., Nikdast, M., Le Beux, S., Xu, J., Wu, X., Wang, Z., Yang, P.: A case study of signal-to-noise ratio in ring based optical networks-on-chip. IEEE Des. Test. 31(5) (2014) 60. Capra, M., Bussolino, B., Marchisio, A., Shafique, M., Masera, G., Martina, M.: An updated survey of efficient hardware architectures for accelerating deep convolutional neural networks. In: Future Internet 2020 61. Pisati, M., De Bernardinis, F., Pascale, P., Nani, C., Sosio, M., Pozzati, E., Ghittori, N., Magni, F., Garampazzi, M., Bollati, G., Milani, A., Minuti, A., Giunco, F., Uggetti, P., Fabiano, I., Codega, N., Bosi, A., Carta, N., Pellicone, D., Spelgatti, G., Cutrupi, M., Rossini, A., Massolini, R., Cesura, G., Bietti, I.: A sub-250 mW 1-to-56Gb/s continuous-range PAM-4 42.5 dB IL ADC/DAC-based transceiver in 7 nm FinFET. In: IEEE ISSCC 2019

Hardware–Software Co-design of Deep Neural Architectures: From FPGAs and ASICs to Computing-in-Memories Zheyu Yan, Qing Lu, Weiwen Jiang, Lei Yang, X. Sharon Hu, Jingtong Hu, and Yiyu Shi

1 Introduction Deep neural networks (DNNs) have become one of the most promising candidates for various perception tasks including object detection, medical image segmentation, and speech recognition. Thus, it is natural to implement DNNs on embedded systems that require such perception tasks. However, typical DNNs have millions of weights and require billions of operations, which cannot be handled by typical embedded processors because the processors have a lower power budget and can perform a limited amount of operations per unit time. If naively deploying DNN models on these embedded processors, the deployed perception task cannot be finished in real time, thus harming the effectiveness of the whole system. There are mainly two directions of research targeting this issue. First, from the software perspective, DNN designers offer simpler DNNs so that they can be used on embedded processors. These DNNs can achieve comparable performance compared with their larger counterparts but have much fewer parameters and require much fewer operations. For example, MobileNet [13] has only 3% number of weights and requires only 4% of the operations, compared with the then stateof-the-art VGG16 [34], while achieving only 0.9% lower accuracy on ImageNet

Z. Yan · Q. Lu · X. S. Hu · Y. Shi () University of Notre Dame, Notre Dame, IN, USA e-mail: [email protected] W. Jiang George Mason University, Fairfax, VA, USA L. Yang The University of New Mexico, Albuquerque, NM, USA J. Hu University of Pittsburgh, Pittsburgh, PA, USA © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 S. Pasricha, M. Shafique (eds.), Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing, https://doi.org/10.1007/978-3-031-39932-9_11

271

272

Z. Yan et al.

classification. On the other hand, from the hardware perspective, embedded system designers provide more powerful and energy-efficient processors specially designed for DNNs. This means designers can run more complex DNNs on embedded processors. For example, Eyeriss [5] process can run AlexNet in 120 ms with peak power under 280 mW. However, these designs are not optimal. Either direction of research only considers one part of the design, and an optimal pair of DNN and hardware design is not found. Finding the optimal design pair is not an easy task because there is an exponentially increasing design space to consider. Fortunately, neural architecture search (NAS) is proposed to tackle this issue. NAS can search through a large search space and keep updating its search strategy based on the design pair it finds. With proper hyperparameter settings and a sufficient amount of search time, NAS can automatically find the optimal DNN model-embedded processor design pair in the designated design space. In this chapter, we introduce three lines of work that use NAS to perform hardware–software co-design of embedded DNNs. These three different works cover three different target hardware platforms: (1) the well-established field programmable gate array (FPGA) platforms [19], (2) now heated discussed applicationspecific integrated circuit (ASIC) platforms [43], and (3) future-proofing computein-memory (CiM) platforms [18]. We show that hardware–software co-design can provide better Pareto-optimal DNN-embedded processor pairs compared with previous state-of-the-art designs. It also offers a more flexible tradeoff scheme for designers to focus more on one certain type of design target.

2 Hardware–Software Co-design with Neural Architecture Search Hardware–software co-design of DNN models and embedded processors incurs a large search design where manual is infeasible. Design automation techniques need to be used to automatically search through such design spaces and find the optimal solution. One of the most widely used methods is neural architecture search (NAS). NAS is targeting a problem that, given finite, discrete design space, finds a design in it that can offer the highest efficiency, i.e., the optimal design. A typical NAS, such as that in [49], is composed of a controller and an evaluator. A NAS framework is guided by a controller that iteratively predicts a design and updates itself according to the efficiency of this design so that it can predict better designs. The evaluator is used to evaluate the efficiency of a certain design. The controller update can be implemented by different techniques, such as the reinforcement learning approach or evolutionary algorithm. Here we introduce the widely used reinforcement learning method. Typically, a recurrent neural network (RNN) is implemented in the controller for the prediction of the hyperparameters of a child network. There are generally three kinds of hyperparameters: architecture parameters (e.g., the number of channels for each layer), the quantization parameters

HW-SW Co-design of DNN Architectures: From FPGAs and ASICs to CiM

273

Fig. 1 Overview of hardware–software co-exploration NAS. The red rectangles convey the metrics that can be optimized in the exploration [19]

(e.g., the bit width of integer and fraction part), and circuit/device parameters (e.g., which device to be used). All possible combinations of these parameters form the state space in reinforcement learning. In each iteration, the RNN predicts a set of hyperparameters, which is the action of reinforcement learning. At the end of an iteration, we update the RNN network for better prediction in terms of the reward. The update procedure is the interaction of the controller with the environment, which is modeled as a Markov Decision Process (MDP) for optimization. Specifically, the Monte Carlo policy gradient algorithm [38] is employed: 1   T −t .∇J (θ ) = γ ∇θ log πθ (at |a(t−1):1 )(Rk − b), m m

T

(1)

k=1 t=1

where m is the batch size and T is the total number of steps in each episode. The rewards are discounted at every step by an exponential factor .γ , and the baseline b is the average exponential moving of the reward. Specifically, as shown in Fig. 1, a NAS framework for hardware–hardware codesign consists of four steps: (1) Prediction: The controller predicts a design pair of DNN model and hardware design; (2) HW eval: The evaluator evaluates the hardware efficiency (e.g., energy consumption, latency, peak power) of this design pair; (3) DNN eval: The evaluator trains the predicted DNN model to collect its performance for the target task; (4) Update: The controller summarizes the hardware efficiency and model performance into a “reward” using a reward function and updates itself according to the reward. These four steps are done iteratively until the controller converges, i.e., it keeps predicting the same architecture. It is also worth noticing that the evaluation result of steps (2) and (3) is approximate due to efficiency considerations. More specifically, evaluating one certain design choice should not take too large amount of time because typically thousands of design choices are explored. Typically, estimation and modeling-based methods rather than cycle-accurate simulations are used in step (2). If the hardware

274

Z. Yan et al.

efficiency does not match a user-defined constraint, step (3) can be skipped because no matter how good the DNN performance is, the design is invalid. In step (3), the DNN model might be trained through fewer iterations than needed. Thus, after the search process, several (e.g., 5) design choices are selected as possible candidates. These design choices are trained to converge and evaluate using a cycle-accurate tool so that the actual best design is selected.

3 Hardware-Aware Neural Architecture Search for FPGA FPGA is one of the most popular platforms to implement deep neural networks (DNNs) because of its programmability, high performance, and energy efficiency, especially in low-batch inferences of embedded applications. Different from designing DNNs for GPUs, where the underlying hardware design is fixed, designing DNNs for FPGAs also opens up the opportunity for hardware architecture designs. However, we still do not have infinite design freedom because the available hardware resource for one certain FPGA platform is fixed. The design space can be further opened up when the design permits using multiple FPGAs for the target task. In this section, we introduce a hardware–software co-design work that automatically finds optimal DNN-FPGA design pair that can provide both high-inference-accuracy DNNs and hardware-efficient designs [19]. For easier understanding, we name this work NAS-F for the rest of this section. In hardware-related designs, multiple hardware-related metrics including peak power, energy consumption, and inference latency are all important factors needed to be optimized. Solving an optimization problem with so many objectives is not an easy task. By analyzing the key property of FPGA-based designs, the authors of NAS-F show that, in multi-FPGA systems, pipeline efficiency is a representative metric for these hardware-related metrics because it determines the hardware utilization as well as energy efficiency. Thus, the co-exploration problem is reduced to a bi-objective optimization problem that is to find a design that can offer both high accuracy and high pipeline efficiency desired. To acquire more realistic results, NAS-F also requires a minimum throughput of .≥30 FPS. Experimental results show that NAS-F can significantly push forward the Pareto frontier. On ImageNet, NAS-F can identify architecture and hardware pairs to achieve the same accuracy, 35.42% higher throughput, and 54.05% higher energy efficiency with the reduced search time, compared with the hardware-aware NAS.

3.1 Implementation of DNNs on FPGAs FPGA has demonstrated its excellent ability to achieve high performance and energy efficiency for low-batch real-time inferences. Hence, a large amount of work is made in implementing neural networks on FPGAs, in which tools are developed to

HW-SW Co-design of DNN Architectures: From FPGAs and ASICs to CiM

275

Fig. 2 An overview of implementing a child network onto multiple FPGAs to be organized in a pipelined fashion

automatically design accelerators on FPGAs for given network architecture. After that, implementations on multiple FPGAs have become the mainstream [11, 12, 45], since limited resource on a single FPGA becomes the performance bottleneck. To fully utilize the computation power provided by multiple FPGAs, a typical technique is to implement the neural network on multiple FPGAs in a pipelined fashion. Figure 2 demonstrates one such example, in which a 5-layer network is partitioned into 3 pipeline stages, and each pipeline stage is mapped to a certain FPGA in an available pool. Finally, those FPGAs are connected as a linear array to function in the pipelined fashion. When given a neural network C identified during the NAS-F framework (also named child network), here we show how C is mapped to multiple FPGAs by the following steps: ➁ Partition Child Network to Pipeline Stages Let .P (C) be a set of partitions for the child network .C.P (C) = .{P1 , P2 , · · · , PM }, where.Pi is a nonempty subset of set L. We have the following two properties: (1) . Pi ∈P (C) = L; and (2) .∀Pi , Pj ∈ P (C); if .i = j , then .Pi ∩Pj = ∅. After the partitioning, each set in .P (C) corresponds to a pipeline stage. For example, in Fig. 2 ➁, we partition the given child network into 3 pipeline stages, .P1 = {l1 } , P2 = {l2 , l3 }, and .P3 = {l4 , l5 }. ➂ Assign Pipeline Stages to FPGAs Then, we can assign each pipeline stage to a specific FPGA in an available FPGA pool, as shown in Fig. 2 ➂. An FPGA pool with n FPGAs can be represented by a set .F = {f0 , f1 , . . . , fn }. Each FPGA, .fi , has a set of attributes, including memory .memi , DSP slices .dspi , etc. These attributes will be utilized to model the timing performance for a child network. ➃ Pipelined FPGAs The pipelined executions of multiple FPGAs are illustrated in Fig. 2 ➃. The system will continuously obtain inputs from the dataset at a fixed rate (frame per second) and generate output data from the last pipeline stage. The

276

Z. Yan et al.

input rate of the system reflects the throughput specification T S, which implies that the latency of each pipeline stage should be no more than .1/T S.

3.2 Co-design Framework for FPGAs 3.2.1

Problem Statement and Solution

The problem in NAS-F can be defined as, given a dataset, a pool of FPGAs F , and a throughput specification T SS, co-explore architecture search space and hardware design space to find a child network C: • para: Parameters of all layers in the child network • P : The partition of layer set I , in the child network • .α: The assignment of pipeline stages to set F such that the accuracy of child network C is maximized, the pipeline FPGA system can meet the required throughput T S, and the average utilization of all FPGAs is maximized. This problem can be solved using the co-design framework introduced in Sect. 2. Recalling the framework consists of 4 steps: Prediction, HW eval, DNN eval, and Update. The Prediction and Update and DNN eval stages are the same as all other general methods. For the HW eval, the authors of this chapter develop an estimation tool to evaluate model latency and pipeline efficiency. It is worth noticing that the authors of this chapter also implement early stopping that is discussed in Sect. 2 where if the hardware efficiency metrics do not meet a user-defined constraint, the DNN eval is skipped and a total reward of .−1 is reported. In this framework, the HW eval step is called Fast Exploration, and DNN eval step is called Slow Exploration because DNN evaluation step generally includes DNN training, which is very time-consuming. Specifically speaking, in the HW eval step, the latency of a pipeline stage under an assignment function can be easily captured with a performance model [44]. For FPGA .fi , its latency is denoted as .Lati . Pipeline efficiency is defined as a function of the hardware utilization in each pipeline stage (corresponding to an FPGA). The utilization of FPGA .fi is equal to .Lati × T S. Higher utilization of an FPGA indicates less idle time in processing and higher energy efficiency. Therefore, high average utilization of all FPGAs is always desired.

3.3 Experiments The NAS-F framework is effective in various tasks. The authors discussed the performance of NAS-F in image classification tasks for CIFAR-10 and ImageNet datasets (Fig. 3).

HW-SW Co-design of DNN Architectures: From FPGAs and ASICs to CiM

277

Fig. 3 An overview of NAS-F. (1) Prediction: An RNN-based controller predicts the hyperparameters in a child network; (2) HW eval: After evaluating the hardware performance of a child network, the fast exploration level prunes child networks with inferior hardware utilization; (3) DNN eval: The slow exploration level updates controller using hardware utilization and accuracy obtained by training child networks

3.3.1

Search Space Setup

The authors of NAS-F consider an underlying hardware platform with up to three Xilinx FPGAs (XC7Z015), each of which contains 74K logic cells, 4.9 Mb on-chip memory, and 150 DSP Slices. This type of FPGA device provides high-speed serial communication (up to 16.8 Gbps of bandwidth), which is suitable for multi-FPGA pipelines where communications between FPGAs can be a bottleneck. In terms of DNN architecture designs, for CIFAR-10 classification, a convolutional architecture backbone is used. For each convolution layer, three design parameters are considered: filter size (the number of output channels), kernel size, and stride. The filter size can be chosen in [24,36,48,64], the kernel size in [1,3,5,7], and strides in [1,2]. After each layer, the rectified linear units (ReLU) and batch normalization (BN) layers are appended. For ImageNet, the architecture repeats mobile inverted bottleneck convolution layers instead of ordinary convolutional ones, same as that in [4]. The design choices include kernel sizes [3,5,7], strides

278

Z. Yan et al.

[1,2], and expansion ratios [3,6]. For both tasks, the activations and weights for DNN models are quantized to 16 bits.

3.4 Comparison Results with the Existing NAS Frameworks NAS-F framework is superior to the then existing NAS frameworks under the same setting as the NAS-F, i.e., exploring 10,000 episodes and getting the accuracy of 16 child networks in each episode. The baseline compared is discussed in [49]. Unlike NAS-F, the hardware-aware NAS does not explore through different accelerator designs, which is shown in Fig. 2a. For a fairer comparison, the final architectures obtained by the hardwareaware NAS are then optimized for hardware implementation to achieve a better design in terms of hardware efficiency. This approach is denoted as “Sequential Optimization” in the results. Figure 4 reports the design space exploration assuming the hardware design space contains up to: (a) two FPGAs or (b) three FPGAs. The x-axis and y-axis represent the accuracy and pipeline efficiency, respectively. For clear demonstration, we only include the architectures whose pipeline efficiency is no less than 85% for two FPGAs in Fig. 4a and no less than 75% for three FPGAs in Fig. 4b. The red lines represent the Pareto frontiers explored by NAS-F. The green lines, on the other hand, represent the frontier obtained by hardware-aware NAS. These figures clearly show that by exploring hardware design space, NAS-F can push forward the Pareto frontiers in the accuracy and efficiency tradeoffs significantly.

Fig. 4 Pareto frontiers between accuracy and pipeline efficiency for hardware-aware NAS and NAS-F (co-exploration), both of which are designed under the timing specification of 35FPS: (a) designs with 2 FPGAs; (b) design with 3 FPGAs

HW-SW Co-design of DNN Architectures: From FPGAs and ASICs to CiM

279

Table 1 Comparison among NAS-F, hardware-aware NAS, and sequential optimization on CIFAR-10 and ImageNet datasets Dataset Models CIFAR-10 HW-aware Sequential NAS-F HW NAS-F SW CIFAR-10 HW-aware Sequential NAS-F HW NAS-F SW

Depth 13 13 10 14 15 15 17 15

Parameters (M) 0.53 0.53 0.29 0.61 0.44 0.44 0.54 0.48

Accuracy (Top1, %) 84.53 84.53 80.18 85.19 68.40 68.40 68.00 70.24

Accuracy (Top5, %) – – – – 89.84 89.84 89.60 90.53

Pipeline Eff (%) 73.20 92.20 99.69 92.15 81.07 86.75 96.15 93.89

FPS 16.2 29.7 35.5 35.5 06.8 10.4 12.1 10.5

Energy Eff. (GOPS/W) 0.84 1.36 2.55 1.91 0.34 0.46 1.01 0.74

Table 2 NAS-F uses much fewer GPU hours than that hardware-aware NAS, benefiting from the early-stage pruning Dataset CIFAR-10 ImageNet

Approach Hardware-aware NAS NAS-F Hardware-aware NAS NAS-F

Arch for Training 108,000 308 7263 53

GPU Hours 16,586 .102 + 1.9 = 103.9 36,315 .256 + 1.8 = 266.8

Impr. 1 .159× 1 .136×

Table 1 reports the comparison results on accuracy, pipeline efficiency, throughput, and energy efficiency on CIFAR-10 and ImageNet. All the architectures identified have fewer than 1M parameters mainly due to the hardware capacity. This inevitably leads to accuracy loss. However, as we can see, the architecture explored by OptSW can still achieve 85.19% test accuracy on CIFAR-10 and 70.24% top-1 accuracy on ImageNet. These results demonstrate the effectiveness of the co-exploration approach in resource-limited scenarios. In addition, OptSW outperforms hardware-aware NAS by achieving 54.37 and 35.24% higher throughput and 56.02 and 54.05% higher energy efficiency on CIFAR-10 and ImageNet, respectively. Compared with sequential optimization, OptSW achieves 16.34 and 28.79% improvements on CIFAR-10 in throughput and energy efficiency, respectively; and on ImageNet, it can also slightly improve throughput and achieve 37.84% improvements in energy efficiency. Finally, Table 2 reports the comparison results on normalized search time between the hardware-aware NAS and NAS-F. Results in this table show that NASF can significantly accelerate the search process, achieving 159X and 136X fewer GPU hours on CIFAR-10 and ImageNet, respectively. The speedup is achieved from the efficient early-stage pruning at the fast exploration level. In a search space where most of the designs cannot meet the timing constraint, this early stopping technique significantly reduces the search time wasted in training unwanted DNN models.

280

Z. Yan et al.

3.5 Comparison Results with the Existing Architectures The effectiveness of NAS-F can also be shown by comparing it with the existing architectures: ProxylessNet [4] and MobileNetV2 [31]. For a fairer comparison, the authors of NAS-F assume the same underlying hardware and same time constraints and use the same method as NAS-F to find optimal FPGA design parameters. As shown in Table 3, compared with the manually designed MobileNetV2, NASF-OptSW can achieve 2.33X and 1.57X improvement in throughput and energy efficiency, respectively, with only 0.47% of top-5 accuracy drop. Similar results can be observed from the comparison with ProxylessNet. Results show that NAS-F can make a better tradeoff between hardware efficiency and architecture accuracy.

3.6 Importance of Co-exploration Finally, we can use Fig. 5 presented in the NAS-F paper to show the importance of co-exploration on NAS and hardware design spaces, instead of: (1) using a heuristic on restricting the size of models for only NAS exploration or (2) applying hardwareaware NAS exploration.

Table 3 Comparison with the existing architectures on ImageNet with the timing specification of 10 fps Models MobileNetV2 [41] Net [8] NAS-F HW NAS-F SW

Depth 18 21 17 15

Accuracy (Top-1) 71.80% 74.60% 68.14% 70.24%

Accuracy (Top-5) 91.00% 92.50% 89.60% 90.53%

FPS 4.5 3.1 12.1 10.5

Energy Eff. 0.47 0.41 1.01 0.74

Fig. 5 Percentages of valid architectures for different timing specifications: (a) fixed stride of 1; (b) predictable strides

HW-SW Co-design of DNN Architectures: From FPGAs and ASICs to CiM

281

In Fig. 5a, the x-axis and y-axis represent the model size and the pipeline efficiency. Each point in this figure is a design, which is optimized using the algorithm in [16]. From this figure, we can see that, for the designs whose model size ranges from 120K to 150K, the optimized hardware efficiency ranges from 1.29 to 98.35%. Moreover, for a much narrower range from 149K to 150K, the efficiency still ranges from 7.02 to 98.35%. All the above results reflect that we cannot guarantee the hardware efficiency by restricting the model size only. In Fig. 5b, NAS-F unveils the fundamental difference between co-exploration and hardware-aware architecture search. In this figure, the black crosses and red circles represent the valid design points in HW-aware NAS and co-exploration search spaces, respectively. The HW-aware NAS has a much narrower search space than the proposed co-exploration approach. Basically, HW-aware NAS will prune the architectures with high accuracy but fail to meet hardware specifications on fixed hardware design. However, by opening the hardware design space, it is possible to find a tailor-made hardware design for the pruned architectures to make them meet the hardware specifications. Therefore, compared with the HW-aware NAS, the co-exploration approach enlarges the search space. As a result, it can make better tradeoffs between accuracy and hardware efficiency.

3.7 Concluding Remarks for NAS-F NAS-F opens up the hardware design freedom in neural network design automation. This is driven by the drift in trend from software and hardware people doing their own parts to cross-layer design efforts. NAS-F shows that through jointly exploring DNN architecture designs and hardware design space, the design Pareto frontier on accuracy and hardware efficiency tradeoffs can be significantly pushed forward. The framework proposed in this chapter is the base for neural architecture and hardware co-exploration. Based on the proposed co-exploration framework, various follow-up works are proposed. First, co-exploring DNN architectures, quantization, and hardware designs [25]. Second, co-exploring DNN architectures and hardware designs for multiple tasks [18]. Third, co-exploring DNN architectures, hardware architecture designs, circuit designs, and underlying device choices for computein-memory platforms [18]. We will introduce the latter two follow-up works in the remainder of this chapter.

4 Co-design of Neural Networks and ASICs As discussed in the previous sections, hardware co-exploration using NAS targeting general computation platforms and programmable platforms [2, 4, 17–19, 25, 39, 42, 47] further enables the hardware design space to jointly identify the best architecture and hardware designs in maximizing network accuracy and hardware efficiency.

282

Z. Yan et al.

On the other hand, among all AI accelerating platforms, application-specific integrated circuits (ASICs), composed of processing elements (PEs) connected in different topologies, can provide incomparable energy efficiency, latency, and form factor [5, 27, 40]. Most existing ASIC accelerators [5, 9, 26], however, target common neural architectures and do not reap the power of NAS. Jointly exploring the DNN architecture search space and hardware design space can identify better solutions than doing them separately. However, such a task is quite challenging, primarily due to the large design space of ASICs where the same set of PEs can constitute numerous topologies (and thus dataflows). Enumeration is simply out of the question. In addition, when ASIC accelerators are deployed on the edge, they usually need to handle multiple tasks involving multiple DNNs. For instance, tasks such as object detection, image segmentation, and classification can be triggered simultaneously on augmented reality (AR) glasses [1], each of which relies on one kind of DNN. Since the DNNs for different tasks can have distinct architectures, one dataflow cannot fit all of them; meanwhile, multiple tasks need to be executed concurrently, which requires task-level parallelism. As such, it is best to integrate multiple heterogeneous subaccelerators (corresponding to different dataflows) into one accelerator to improve performance and energy efficiency, which has been verified in [23]. Yet this further complicates the design space. In this section, we show a work NASAIC [43] that addresses these challenges by establishing a link between NAS and ASIC accelerator design. Instead of a full-blown exploration of the design space, the authors observe that there already exist a few great ASIC accelerator designs such as ShiDianNao [9], NVDLA [26], and Eyeriss [5]. Each of these designs has its unique dataflow, and the accelerator is determined once the hardware resource associated with the dataflow is given. As such, the authors can create a set of ASIC templates, where each template corresponds to one specific dataflow, so that the design space can be significantly narrowed down to the selection of templates to form a heterogeneous accelerator and the allocation of hardware resources (e.g., the number of PEs and NoC bandwidth) to the selected templates. Based on the template concept, neural architecture and ASIC design coexploration framework, named NASAIC, is proposed. Experimental results on the workload with the mixed classification and segmentation tasks show that, compared with solutions generated by the successive NAS and ASIC design optimization that cannot satisfy the design specs, those from NASAIC can guarantee to meet the design specs with 17.77%, 2.49.× and 2.32.× reductions in latency, energy, and area and with only 0.76% average accuracy loss on these two tasks. Furthermore, compared with hardware-aware NAS for a fixed ASIC design, NASAIC can achieve 3.65% higher accuracy. To the best of the authors’ knowledge, this is the first work on neural architecture and ASIC design co-exploration.

HW-SW Co-design of DNN Architectures: From FPGAs and ASICs to CiM

283

4.1 Problem Analysis for DNN-ASIC Co-design In this section, we first introduce multi-task workloads and heterogeneous accelerators NASAIC is looking into and then discuss the problem formulation of neural architecture and ASIC design co-exploration introduced by the authors of NASAIC.

4.1.1

Major Components

Figure 6 demonstrates an overview of the co-exploration, which involves three exploration layers: ➊ “Application,” ➋ “Accelerator,” and ➌ “Synthesis.” The application layer determines the neural architectures to be applied, while the accelerator layer creates an ASIC template set based on the dataflow style of the existing accelerator designs. Acting as the bridge, the synthesis layer allocates a template together with the resources to each sub-accelerator, then maps, and schedules the network layers to sub-accelerators. In the following text, we will introduce each exploration layer in detail.

➊ Application The application workload considered in NASAIC consists of multiple AI tasks where each task uses an individual DNN model. A workload with m tasks is defined as .W = T1 , T2 , · · · , Tm . Figure 6 shows an example with two T1: Image Classification R

Application

1

U-Net1

R C CH

T3

T2: Image Segmentation

C

ResNet1

CH

… ResNet2



R C

U-Net2

CH

ResNet3 …

3

Resource Allocator

Synthesis Mapping & Scheduling Resultant Accelerator

… aic1: DF2 # of PEs NoC BWs NN layers

aic2: DF1 # of PEs NoC BWs NN layers

aic3

… …

Accelerator

2

…… DF1: Shidiannao Style

DF 2: NVDLA Style

Fig. 6 Overview: co-exploration with three layers of optimizations

DF3

284

Architecture Search Space: T1: Classification (CIFAR-10) Hyperparameters for the ith block: FNi: SKi: T2: Segmentation (Nuclei) Hyperparameters for the ith layer: i-1 i-1 Height: [1-5]; FNi:

Z. Yan et al. To/From DRAM

Design Space: Dataflow: shidiannao, nvdla, rs; Maximum PE num: 4096; ` Maximum bandwidth: i-1 64 GB/s

Global Buffer Global Interconnect NIC alloc(aic1) sch(aic1) aic1: DF2

NIC alloc(aic2) NIC sch(aic2) aic2: DF1



aic3

Fig. 7 Left: search spaces for both NAS and ASIC accelerator designs. Right: the resultant heterogeneous ASIC accelerator

tasks (i.e., .T1 for classification and .T2 for segmentation). Task .Ti ∈ W corresponds to a DNN architecture .Di , which forms a set D with m DNN architectures. A DNN architecture is denoted as .Di = Bi , Li , Hi , acci . .Di is composed of a backbone architecture .Bi , a set of layers .Li , a set of hyperparameters .Hi , and an accuracy .acci . For example, in Fig. 6, backbone architecture .B1 for classification task .T1 is ResNet9 [24], and its hyperparameters include the number of filters (F N ) and the number of skip layers (SK) for each residual block, as shown in Fig. 7 (left), while for .T2 , backbone architecture .B2 is U-Net [30] whose hyperparameters include the height (H eight) and filter numbers (F N ) for each layer.

➋ ASIC Accelerator A heterogeneous ASIC accelerator formed by multiple subaccelerators connected in an NoC topology through NIC is shown in Fig. 7 (right). Define .AI C = aic1 , aic2 , . . . aick to be a set of k sub-accelerators. A subaccelerator .aici = dfi , pei , bwi has three properties: dataflow style .dfi , the number of PEs .pei , and the NoC bandwidth .bwi . With a set of predefined dataflow templates to choose from, as shown in Fig. 6, the ASIC design space is significantly narrowed down from choosing specific unrolling, mapping, and data reuse patterns to allocating resources (one template with associated PEs and bandwidth) to each sub-accelerator. Kindly note that according to the template and the mapped network layers, the memory size can be determined to support the full use of hardware, as in [22]. Therefore, memory size will not be explored in the search space.

➌ Synthesis Based on the definition of applications and accelerators, next, synthesis optimization consists of three parts, resource allocation, mapping, and scheduling. Resource Allocation In this task, a limit for the total amount of resources available (a maximum number of PEs and maximum bandwidth) is set. Given a set of subaccelerators, these resources are allocated to all accelerators. Mapping and Scheduling On the software side, each layer of different networks is mapped to one certain sub-accelerator, and their execution orders are then determined. The synthesis results can be evaluated via four metrics, including accuracy, latency, energy, and area. NASAIC aims to maximize the accuracy of DNNs under the given design specs on latency (LS), energy (ES), and area (AS).

HW-SW Co-design of DNN Architectures: From FPGAs and ASICs to CiM

4.1.2

285

Problem Definition

The NASAIC is targeting a problem that: given a multi-task workload W , the backbone neural architecture for each DNN in set D, a set of sub-accelerators AI C, a set of dataflow templates DF , the maximum number of PEs and bandwidth, and design specs (LS, ES, AS), determine: • .nas(Di ): Architecture hyperparameters of each DNN .Di ∈ D • .alloc(aick ): The dataflow and resource allocation for each sub-accelerator .aick ∈ AI C • .map(li,j ) and .sch(aick ): the mapping of network layers to sub-accelerators and their schedule orders such that the maximum accuracy of DNNs can be achieved, while all design specs and resource constraints  are met, i.e., .max =weighted(D), .s.t., .rl ≤ LS, .re ≤ ES, .ra ≤ AS . i=1···|AI C| {pei } ≤ NP , . i=1···|AI C| {bwi } ≤ BW , where .rl, re, ra represent latency, energy, and area of the resultant accelerator, and a weighted function defined in the next section is to get the accuracy of all networks, which can be functions such as avg (maximize the average accuracy) or min (maximize the minimum accuracy).

4.2 Co-design Framework for ASIC This problem can be solved using the co-design framework introduced in Sect. 2. Recalling the framework consists of 4 steps: Prediction, HW eval, DNN eval, and Update. The Prediction and Update stages are the same as all other general methods. For the DNN eval step, multiple, rather than only one, DNN models need to be trained and evaluated. Their performances are also organized into one single reward term. For the HW eval, the authors of NASAIC use an open-source estimation tool to handle it. More specifically, the HW eval step is targeting a problem that, given a set of identified DNN architectures D and a set of determined sub-accelerators AI C are given by the controller, collect the hardware metrics including latency rl, energy re, and area ra. NASAIC incorporates the state-of-the-art cost model, MAESTRO [22], and a mapping and scheduling algorithm to obtain the above metrics. Area ra can be directly obtained from MAESTRO with the given sub-accelerator AI C. The latency rl and energy re are determined by the mapping and scheduling. To develop an algorithm for mapping and scheduling, NASAIC needs to obtain  the latency and energy of each layer on different sub-accelerators. Let .L = Dk ∈D {Lk } be the layer set. For a pair of network layer .∀li ∈ L and sub-accelerator .aicj ∈ AI C, by inputting them to MAESTRO, latency .li,j and energy .ei,j can be collected. The scheduling and mapping problem can be proved to be equivalent to the traditional heterogeneous assignment problem [15, 33]: given the latency .li,j and energy cost .ei,j for each layer i on sub-accelerator j , the dependency among layers,

286

Z. Yan et al.

and a timing constraint LS, determine the mapping and scheduling of each layer on one sub-accelerator, such that the energy cost re is minimized, while the latency .rs ≤ LS. This scheduling method can be solved using an existing algorithm for it. Because of the limited space, we do not discuss this algorithm in detail here.

4.3 Experimental Evaluation Here we show the effectiveness of NASAIC framework using the experimental results reported by its authors. Different application workloads and hardware configurations are used. Results reported in this section demonstrate that NASAIC can efficiently identify accurate neural architectures together with AISC accelerator designs that are guaranteed to meet the given design specs while achieving high accuracy for multiple AI tasks.

4.3.1

Evaluation Environment

Application Workloads Typical workloads on AR glasses in applications such as driver assistance or augmented medicine are used to demonstrate the efficacy of NASAIC. In these workloads, the core tasks involve image classification and segmentation, where representative datasets such as CIFAR-10, STL-10 [8], and Nuclei [21] are commonly employed, along with light-weight neural architectures. We synthesize the following two workloads: • W1: Tasks on one classification dataset (CIFAR-10) and one segmentation dataset (Nuclei) • W2: Tasks on two classification datasets (CIFAR-10, STL-10) • W3: Tasks on the same classification dataset (CIFAR-10) The backbone architectures and their search space for the above tasks are defined as follows. For the classification tasks, ResNet9 [24] is selected. Parameter options for each block are depicted in Fig. 7a. For the segmentation tasks, U-Net [30] is used as the architecture backbone. The search space for this backbone architecture includes the number of heights and filter channel numbers in each layer, as shown in Fig. 7. Hardware Configuration Accelerator design includes the allocation of hardware resources to sub-accelerators, and the selection of dataflow for each sub-accelerator. For resource allocation, the maximum number of PEs is set to be 4096, and the maximum NoC bandwidth to be 64 GB/s, in accordance to [23]. Specifically, each sub-accelerator uses one of the following dataflows: ShiDianNao (abbr. shi) [9], NVDLA (abbr. dla) [26], and row-stationary [5] style. In the case where one sub-accelerator has no resource allocation, the design degenerates to a single large accelerator, while in the case where sub-accelerators have exactly the same allocation, the design degenerates to homogeneous accelerators.

HW-SW Co-design of DNN Architectures: From FPGAs and ASICs to CiM

287

4.4 Design Space Exploration Figure 8, 9 and 10 demonstrates the exploration results of NASAIC on three application workloads. In this figure, the x-axis, y-axis, and z-axis represent latency, energy, and area, respectively. The black diamond indicates the design specs (upper bound); each green diamond is a solution (neural architecture-ASIC design pair) explored by NASAIC; each blue cross is a solution based on the smallest neural network in the search space combined with different ASIC designs (lower bound); and the red star refers to the best solution in terms of the average accuracy explored by NASAIC. The numbers in the rectangles with blue, green, and red colors represent the accuracy of the smallest network, the inferior solutions, and our best solutions, respectively. Several observations can be made from Figs. 8, 9 and 10. First, NASAIC can guarantee that all the explored solutions meet the design specs. Second, the identified solutions have high accuracy. The accuracies on CIFAR-10 of the four solutions are 92.85%, 92.62%, 93.23%, and 91.11%, while the accuracy lower bound from the smallest network is 78.93%. Similar trends can be observed for STL-10 and Nuclei. Third, the best solutions of W 1 and W 3 identified by NASAIC are quite close to the boundary defined by one of the three design specs, which indicates that in these cases the accuracy is bounded by resources.

Fig. 8 Exploration results obtained by NASAIC with CIFAR-10 and STL-10 datasets (W 1)

288

Z. Yan et al.

Fig. 9 Exploration results obtained by NASAIC with CIFAR-10 and Nuclei datasets (W 2)

Fig. 10 Exploration results obtained by NASAIC with CIFAR-10 dataset (W 3)

4.4.1

Results on Multiple Tasks for Multiple Datasets

Table 4 reports the comparison results on multi-dataset workloads. Two additional approaches are used as the baseline. First, “NAS.→ASIC” indicates successive NAS [49] and brute-force hardware exploration. Second, in “ASIC.→HW-NAS,”

HW-SW Co-design of DNN Architectures: From FPGAs and ASICs to CiM

289

Table 4 Comparison between successive NAS and ASIC design (NAS→ASIC), ASIC design followed by hardware-aware NAS (ASIC→HW-NAS), and NASAIC Work. W1

Approach NAS→ASIC ASIC→ HW-NAS NASAIC

W2

NAS→ASIC ASIC→ HW-NAS NASAIC

Hardware dla, 2112, 48

shi, 1984, 16

dla, 1088, 24

shi, 2368, 40

dla, 576, 56

shi, 1792, 8

dla, 2368, 56

shi, 1728, 8

dla, 2112, 24

shi, 1536, 40

dla, 2112, 40

shi, 1184, 24

Dataset CIFAR-10 Nuclei CIFAR-10 Nuclei CIFAR-10 Nuclei CIFAR-10 STL-10 CIFAR-10 STL-10 CIFAR-10 STL-10

Accuracy 94.17% 83.94% 91.98% 83.72% 92.85% 83.74% 94.17% 76.50% 92.53% 72.07% 92.62% 75.72%

L /cycles 9.45e5 × 5.8e5  7.77e5  9.31e5  9.69e5  6.48e5 

E /nJ 3.56e9 × 1.94e9  1.43e9  3.55e9 × 2.90e9  2.50e9 

A /μm2 4.71e9 × 3.82e9  2.03e9  4.83e9 × 3.86e9  3.34e9 

×: violate design specs; : meet design specs

a Monte Carlo search with 10,000 runs is first conducted to obtain the ASIC design closest to the design specs. Then, for that specific ASIC design, the hardwareaware NAS [36] is extended to identify the best neural architecture under the design specifications. Results in Table 4 demonstrate that for the neural architectures identified by NAS, none of the accelerator designs explored by the brute-force approach can provide a legal solution that satisfies all design specs. On the contrary, for both workloads, NASAIC can guarantee the solutions to meet all specs with average accuracy losses of 0.76% and 1.17%, respectively. For workload W 1, NASAIC achieves 17.77%, 2.49.×, and 2.32.× reductions on latency, energy, and area, respectively, against NAS.→ASIC. For workload W 2, the numbers are 30.39%, 29.58%, and 30.85%. When comparing NASAIC with ASIC.→HW-NAS, even though the solution of the latter is closer to the design specs, for W1, NASAIC achieves 0.87% higher accuracy for CIFAR-10 and similar accuracy for Nuclei; for W2, 3.65% higher accuracy is achieved for STL-10 and similar accuracy for CIFAR-10. All the above results have revealed the necessity and underscored the importance of co-exploring neural architectures and ASIC designs.

4.5 Concluding Remarks for NASAIC NASAIC can co-explore neural architectures and ASIC accelerator designs targeting multiple AI tasks on edge devices. It fills the missing link between NAS and ASIC by creating an accelerator template set in terms of the dataflow style. In addition, a novel multi-task-oriented RNN controller has been developed to simultaneously determine multiple neural architectures under a unified design

290

Z. Yan et al.

specification. The efficacy of NASAIC is verified through a set of comprehensive experiments.

5 Co-design of Neural Networks and Computing-in-Memory Accelerators Although hardware–software co-design can find the optimal DNN-hardware design pair for a designated hardware platform, all the works introduced in the previous sections are based on the conventional von Neumann architecture, where data and computation are separated. For each operation to be done, data needed for this operation needs to be fetched elsewhere and the result also needs to be stored, which leads to a significant memory access overhead. For DNN applications, the cost of data movement has become a bottleneck in terms of both latency and energy efficiency. This is an inherent property of the von Neumann architecture, and it cannot be eliminated if the architecture is unchanged. This issue is also called the “memory wall.” Computing-in-memory (CiM) has been proved to be able to effectively transcend the memory wall [14] and has been considered to be a promising candidate for neural network computations due to the incomparable architectural benefits. (i) CiM architecture can benefit from the fixed memory access pattern within neural network computation [37] to execute operations in place. (ii) Emerging devices (e.g., ReRAM, STT-RAM) can be efficiently leveraged in the in-memory computing architecture [32] to provide high performance and energy efficiency. In [3, 20], MOSFET-based in-memory processing has been employed for neural network computation, and the improvement in terms of energy and delay is observed compared with the conventional von Neumann architectures. Research works [7, 32] leverage emerging-device-based in-memory computing schemes to construct crossbar architectures that can perform the matrix multiplication in the analog domain, which further optimizes the computation metrics such as area, energy, and delay. Most of the existing works on CiM neural accelerator design simply map classic neural networks (e.g., LeNet, AlexNet) to the CiM platform to evaluate their design and compare against other counterparts. However, without the optimization of neural architectures, these reported metrics (i.e., accuracy, latency, energy, etc.) may be far from optimal. In this chapter, the authors bring the CiM neural accelerator design to interplay with the neural architecture search, aiming to automatically identify the best device, circuit, and neural architecture coupled with maximized network accuracy and hardware efficiency. The novel device–circuit–architecture co-exploration brings opportunities to boost performance; however, it also incurs many new challenges. First of all, unlike the conventional von Neumann architecture-based neural architecture coexploration [19], the design space of CiM-based neural accelerator spans across

HW-SW Co-design of DNN Architectures: From FPGAs and ASICs to CiM

291

multiple layers from device type and circuit topology to neural architecture. Second, limited by the computing capacity of each device cell, quantization is essential to improve the hardware efficiency [40, 41, 46]; as such, quantization has to be automatically determined during the search process. Third, in addition to the optimization goals of hardware efficiency used in the existing co-exploration framework for mobile platform and FPGAs, CiM has extra objectives, such as minimizing area, maximizing lifetime, etc. Last but not least, emerging devices commonly have non-ideal behaviors (known as device variation); that is, if directly mapping the trained DNN models to the architecture without considering the device variation, a dramatic accuracy loss will be observed, rendering the architecture useless. The authors of [18] propose a device–circuit–architecture co-exploration framework, named NACIM, to automatically identify the best CiM neural accelerators, including the device type, circuit topology, and neural architecture hyperparameters. A typical hardware–software co-explore NAS framework is used in this chapter. By configuring the parameters of the framework, i.e., assigning different importance weights to different figures of merits, designers can customize the optimization goals in terms of their demands. The authors also take device variation into consideration and offer a noise-aware training scheme to offer more robust models against device variations. Experimental results show that the proposed NACIM framework can find a robust neural network with only 0.45% accuracy loss in the presence of device variation, compared with a 76.44% loss from the state-of-the-art NAS without considering device variation. In addition, NACIM can significantly push forward the Pareto frontier in terms of the tradeoff between accuracy and hardware efficiency, achieving up to 16.3 TOPs/W energy efficiency for a 3.17.× improvement.

5.1 Compute-in-Memory Neural Accelerators The authors of NACIM choose the crossbar to be their computation engine. As they are not familiar to most users, in this section, we discuss the basics of CiM accelerators from three perspectives: (1) the devices used in this CiM platform and their non-ideal behavior; (2) the general architecture of CiM neural accelerators; (3) NeuroSIM, a simulation framework for CiM neural accelerators the authors of NACIM used.

5.1.1

Device and Its Variations

Non-volatile devices have been widely adopted in crossbar computations. When considering using the crossbar to perform inference, different device implementations lead to distinct energy, latency, etc. Here, two factors are considered: (1) how many levels of precision the non-volatile device can be configured; (2) the non-ideal

292

Z. Yan et al.

behavior of the devices. Both binary devices and multi-level devices are used in the existing crossbar-based computation platforms. For the multi-level device, there are existing works with 4-bit (i.e., 16 levels) devices, with good distinction among different levels [48]. Besides the multi-level devices, binary devices (STT-MRAM, etc.) are also considered. Different kinds of devices may affect the on and off current for the crossbar computation and ultimately impact delay, energy, etc.. Different numbers of levels in these devices also require different peripheral circuitries in the crossbar architecture, which is another design space the authors will consider in this chapter. These emerging devices also suffer from various errors [10]. When the circuitry is used for inference, device-to-device variations could be the dominant error source. The variation could be caused in the fabrication process and in the device programming phase. The other dominant sources of error come from noises. Among the noise sources, random telegraph noise (RTN) [10] in particular is a main source of noise caused by electrons temporarily being trapped within the device that in turn changes the effective conductance of device. Other noise sources include thermal noise and shot noise. However, they typically are much smaller compared with RTN [10]. In NACIM, the device variation is modeled as a whole, represented as a Gaussian distribution. The magnitude of the variation can be referred to [48], where the variations are from actual measurements.

5.1.2

Crossbar Architecture

Different crossbar-based architectures are proposed [7, 32]. The authors assume an ISAAC-like architecture [32] in simulation. The architecture is highly parallel with multiple tiles. Within each tile, there are multiple crossbar arrays. The computation here is performed in the analog domain. However, ADC and DAC are used to convert the signal from and to the analog domain computation. The authors assume that all the weights can be mapped to the crossbar arrays. Therefore, no programming of the weights is needed in the computation.

5.1.3

NeuroSIM

DNN+NeuroSIM [29] is an integrated framework built for emulating the deep neural networks (DNN) inference performance or on-chip training performance on the hardware accelerator based on near-memory computing or in-memory computing architectures. Various device technologies are supported, including SRAM, emerging non-volatile memory (eNVM) based on resistance switching (e.g., RRAM, PCM, STT-MRAM), and ferroelectric FET (FeFET). SRAM is by nature 1 bit per cell; eNVMs and FeFET in this simulator can support either 1 bit or multi-bit per cell. NeuroSIM [6] is a circuit-level macromodel for benchmarking neuroinspired architectures (including memory array, peripheral logic, and interconnect routing) in terms of circuit-level performance metrics, such as chip area, latency,

HW-SW Co-design of DNN Architectures: From FPGAs and ASICs to CiM

293

dynamic energy, and leakage power. With PyTorch and TensorFlow wrapper, DNN+ NeuroSIM framework can support hierarchical organization from the device level (transistors from 130 nm down to 7 nm, eNVM and FeFET device properties) to the circuit level (periphery circuit modules such as analog-to-digital converters, ADCs), to chip level (tiles of processing elements built up by multiple sub-arrays, and global interconnect and buffer) and then to the algorithm level (different convolutional neural network topologies), enabling instruction-accurate evaluation on the inference accuracy as well as the circuit-level performance metrics at the runtime of inference.

5.2 Problem Definition Figure 11 illustrates the cross-layer optimization from application to hardware. The ultimate goal is to implement the inference of a neural network on computing-inmemory (CiM) systems. Optimization decisions need to be made in five design layers, including: (a) neural architecture search, (b) quantization determination, (c) dataflow, (d) circuit design, and (e) device selection (Fig. 12). Search Space The design spaces of all the layers form an integrated search space. Among the five design layers, the dataflow design layer has the minimum options. Although there are different types of dataflows in terms of the data reuse pattern, the weight-stationary dataflow is commonly used for the CiM platform. All the other design layers provide various design options. For the neural architecture layer, the size of the neural architecture can be adjusted to fit the hardware, which can be implemented by searching for the hyperparameters of the backbone neural architecture. For the quantization layer, different bit widths for both integer and fraction parts can be employed for network layers. For the circuit layer, tile size, buffer size, and bandwidth should be determined. Finally, for the device layer, the authors have choices in different types of devices. Problem Statement Based on the definition of each layer, the authors formally define the problem solved in this work as follows: Given a dataset (e.g., CIFAR-10), a machine learning task (e.g., image classification), and a set of available devices DT , the authors are going to determine: • • • •

A: The neural architecture for the machine learning task Q: The quantization of each layer in the architecture A D: The device in set DT used for the chip design C: The circuit design based on the selected device D

Objective Such that the inference accuracy of the machine learning task on the resultant circuit can be maximized, while the hardware efficiency (e.g., latency, energy efficiency, area, etc.) can be optimized. Kindly note that since the above optimization problem has multiple objectives, the authors further propose a frame-

294

Z. Yan et al.

Standard Conv.

(a) Neural Architecture

weights

3 2 1 0 1 2 3

IFM

… st

fw Weight

Activation



…… …

… …

Energy

PE

synaptic array

row driver

tile



FC

Tile Buffer

Global Buffer

Accuracy

Weight

Activation

6 4 2 0 2 4 6

Latency

OFM

(d) Circuit

Obj.

… fh



(c) Data Flow

ch

PE Buffer

(b) Quantization

Fully Connection

Group Conv.

Area

Accumulation & Output Buffer

NoC

readout circuit

bit line

(e) Device

TiN HfOx Switching layer TiN

ReRAM

Free layer Barrier layer Fixed layer

Gate

n+

p FeFET

source line

Variation

word line

FE l a yer Meta l

n+

n+

p

n+

STT-MRAM

Fig. 11 Cross-layer optimization to identify the best neural architecture on computing-in-memory platform: (a) neural architecture; (b) 2 possible quantization for 4 layers; (c) dataflow of generating output feature maps by using the input feature maps and weights; (d) layout of circuit; (e) different computing-in-memory devices

Fig. 12 Overview of the proposed NACIM framework: ➀ a reward-based controller; ➁ an optimizer selector for architecture A, quantization Q, device D, and circuit C; ➂ an accuracy evaluator for identified neural architecture; ➃ a hardware performance evaluator with the circuit optimization

HW-SW Co-design of DNN Architectures: From FPGAs and ASICs to CiM

295

work in the next section, which can support designers to specify the metrics to be optimized (e.g., simultaneously maximizing accuracy, latency, and area— simultaneously maximizing accuracy and minimizing latency and area).

5.3 Co-design Framework for CiM This problem can be solved using the co-design framework introduced in Sect. 2. Recalling the framework consists of 4 steps: Prediction, HW eval, DNN eval, and Update. The Prediction and Update stages are the same as all other general methods. For the HW eval, the authors of NACIM use the open-source estimation tool NeuroSIM introduced before to handle it. For the DNN eval step, the impact of device variations needs to be considered. The performance of a given DNN model needs to be accurately evaluated, and models that can work well under the impact of device variations need to be trained. The performance of a DNN model under the impact of device variations is obtained through Monte Carlo method. To train a robust DNN model that can work well under the impact of device variations, the authors of NACIM propose a novel training method that involves the device variation in the training procedure. The method is composed of two steps: First, the authors use Monte Carlo method to obtain samples for each weight based on a Gaussian distribution, whose mean is 0 and variance is equivalent to the device variance; second, these samples will be added to the corresponding weights in the forward path in the training stage. Since only one sample for each weight is required in each forward path, very little overhead is added to the training process.

5.4 Experiments and Results The NACIM framework is effective in various tasks. The authors discussed the performance of NACIM in various settings including image classification and segmentation tasks.

5.4.1

Experiment Setup

The authors of NACIM explore two machine learning tasks, image classification and object segmentation to evaluate their efficiency. For the image classification task, similar to most existing works on CiM-based neural accelerators [28, 35], the authors use the CIFAR-10 dataset, while for the object segmentation, the authors apply the Nuclei dataset [21]. Table 5 shows the neural architecture search spaces for these datasets.

296

Z. Yan et al.

Table 5 Experimental settings for three types of backbone on two datasets, CIFAR-10 and Nuclei Spaces Res. lim. VGG-like space Enc-dec-like .•

# layer 8 11 4,6,8,10

# filter 24,36,48,64 128,256,512,1024 16,32,64,128

Filter H/W 1,3,5,7 1,3,5,7 3

FC neuros 64,128,256,512 256,512,1024,2048 –

Filter H/W: Height and width of filter; FC: Fully connection layer

Table 6 Comparison results between the proposed approaches and the state-of-the-art QuantNAS without the consideration of the device during the search process Approach QuantNAS ptbNAS NACIM.hw NACIM.sw

Accuracy 84.92% 74.28% 73.58% 73.88%

Acc w/ variation 8.48% 72.18% 70.12% 73.45%

Area (.μm2 ) 6 .3.24 ∗ 10 6 .2.57 ∗ 10 6 .1.78 ∗ 10 6 .1.97 ∗ 10

EDP (pJ*ns) 12 .8.08 ∗ 10 12 .7.9 ∗ 10 12 .2.21 ∗ 10 12 .3.76 ∗ 10

Speed (TOPs) 0.285 0.117 0.204 0.234

E.-E. (TOPs/W) 5.14 4.99 12.3 16.3

For the resource-limited scenario (RLS), the authors also explore the quantization space. The quantization bit width of the activation and weight of each layer is searched separately. For each type of data, the authors determine the number of integer bits ranging from 0 to 3 and the number of fraction bits ranging from 0 to 6. For the device and circuit, in this section, 4-bit ReRAM devices are used in the crossbar computation. The noise model of the device is from [48]. NACIM also searches through layer-wise quantization parameters for each candidate architecture. In this chapter, we show the performance comparisons of NACIM to baselines under the resource-limited scenario (RLS). For more experimental results, readers can refer to the original work [18].

5.4.2

Comparison Results to State-of-the-Art NAS

First, we can compare the exploration results of different searching methods in Table 6. “QuantNAS” indicates the state-of-the-art quantization architecture coexploration method proposed in [25], where the standard training procedure is conducted. “ptbNAS” indicates the noise-aware training and searching method proposed in this chapter, where the switch combination is set as .SA = 1, SQ = 1, SD = 1, SC = 0. Kindly note that the QuantNAS is the basis of ptbNAS, but ptbNAS integrates the noise awareness during the search process. “NACIM” indicates the noise-aware training and searching method along with the hardware resource-aware quantization search, which combines ptbNAS and rN AS. Please note that “NACIM” can obtain a series of solutions on Pareto frontier. The authors use notation “NACIM.hw ” and “NACIM.sw ” to represent the solution with maximum hardware efficiency and that with maximum accuracy, respectively. For the notations

HW-SW Co-design of DNN Architectures: From FPGAs and ASICs to CiM

297

in Table 6, accuracy of all architectures without noise is shown as “Accuracy,” accuracy after considering the device variation in column “Acc w/ variation,” and area, energy-delay product (EDP), speed (TOPs), and energy efficiency (TOPs/W) are shown in the latter columns. Results in Table 6 show that QuantNAS can find architecture with the highest accuracy. However, when it is employed for computing-in-memory circuit with variation, it has a drastic accuracy loss from 84.92 to 8.48%, rendering the architecture to be useless. On the contrary, with consideration of device variation in training process, the network accuracies of ptbNAS, NACIM.hw , NACIM.sw on computing-in-memory circuit are 72.18%, 70.12%, 73.45%, respectively. What is more, the accuracy loss for NACIM.sw is only 0.43%. We can also observe from the table that by employing the cross-layer optimization, NACIM.hw can obtain the best hardware efficiency. Compared with QuantNAS, NACIM.hw achieves 1.82.× reduction on area and 3.66.× improvement in energy-delay product. Compared with ptbNAS, these figures are 14.01% and 1.89.×, respectively. Compared with NACIM.sw , these figures are 9.64% and 1.70.×, respectively. These results demonstrate the capability of NACIM to synthesize the cost-effective computing-in-memory chips. Another observation is that the architectures identified by both QuantNAS and NACIM.sw achieve slightly higher speed than that by NACIM.hw . This is because NACIM.hw finds many simple structures with fewer operations, but the latency is not improved accordingly since other designs can have more processing elements. In the comparison of energy efficiency, NACIM.hw achieves .2.39× higher energy efficiency than QuantNAS. NACIM.sw achieves .3.17× higher energy efficiency, reaching up to 16.3 TOPs/W. The above observations clearly show the importance of conducting cross-layer optimization to obtain useful neural architectures for hardware-efficient computing-in-memory architecture.

5.4.3

Results of Multi-Objective Optimization

Figure 13 shows the design space exploration tradeoffs between the accuracy and the normalized hardware efficiency. The normalized hardware efficiency is calculated based on weighted hardware metrics, including latency, area, and energy, which are represented by the x-axis. Each hardware component has the same weight, and the total normalized hardware efficiency consists of half of the reward and inference accuracy takes another half. An interesting observation from the results is that compared with the bi-objective optimization, NACIM found more architectures with lower accuracy. This is because the weights for accuracy in calculating the reward are decreased. However, the authors can still find the solution with the highest accuracy and achieve 1.65.× improvement in hardware efficiency.

298

0.45 NACIM

pNAS

0.40

Error

Fig. 13 Multi-objective optimization: inference error vs. normalized hardware efficiency. The hardware efficiency is the weighted sum of hardware area, energy, and latency

Z. Yan et al.

0.35 0.30 0.25 0.15

ideal sol. 0.20

0.25 0.30 0.35 0.40 0.45 0.50 Normalized Hardware Efficiency

5.5 Concluding Remarks for NACIM NACIM formally defined cross-layer optimization problem for automatically identifying neural architectures on computing-in-memory (CiM) platform. NACIM devised a novel co-exploration framework that gives flexibility for designers to set different optimization goals. A device variation-aware training is also proposed in NACIM.

6 Conclusions In this chapter, we first introduce the general idea of neural architecture search-based software–hardware co-design. After that, we showcase three lines of work that codesign DNN with three different types of the underlying hardware. Experimental results show that software–hardware co-design can provide better Pareto-optimal DNN-embedded processor pairs compared with previous state-of-the-art designs. It also offers a more flexible tradeoff scheme for designers to focus more on one certain type of design target.

References 1. Abrash, M.: https://www.oculus.com/blog/inventing-the-future/ (2019). Accessed 26 Nov 2019 2. Bian, S., Jiang, W., Lu, Q., Shi, Y., Sato, T.: NASS: Optimizing secure inference via neural architecture search. In: European Conference on Artificial Intelligence (2020) 3. Biswas, A., Chandrakasan, A.P.: Conv-RAM: An energy-efficient SRAM with embedded convolution computation for low-power CNN-based machine learning applications. In: 2018 IEEE International Solid-State Circuits Conference-(ISSCC), pp. 488–490. IEEE (2018) 4. Cai, H., Zhu, L., Han, S.: ProxylessNAS: Direct neural architecture search on target task and hardware. In: International Conference on Learning Representations (2018)

HW-SW Co-design of DNN Architectures: From FPGAs and ASICs to CiM

299

5. Chen, Y.H., Emer, J., Sze, V.: Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks. ACM SIGARCH Comp. Architect. News 44(3), 367–379 (2016) 6. Chen, P., Peng, X., Yu, S.: NeuroSim: A circuit-level macro model for benchmarking neuroinspired architectures in online learning. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 37(12), 3067–3080 (2018). https://doi.org/10.1109/TCAD.2018.2789723 7. Chi, P., Li, S., Xu, C., Zhang, T., Zhao, J., Liu, Y., Wang, Y., Xie, Y.: PRIME: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. In ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 27–39. IEEE Press (2016) 8. Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 215–223 (2011) 9. Du, Z., Fasthuber, R., Chen, T., Ienne, P., Li, L., Luo, T., Feng, X., Chen, Y., Temam, O.: ShiDianNao: Shifting vision processing closer to the sensor. In: Proceedings of the 42nd Annual International Symposium on Computer Architecture, pp. 92–104 (2015) 10. Feinberg Ben, S.W., Ipek, E.: Making memristive neural network accelerators reliable. In: IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 52– 65 (2018) 11. Geng, T., Wang, T., Sanaullah, A., Yang, C., Patel, R., Herbordt, M.: A framework for acceleration of CNN training on deeply-pipelined FPGA clusters with work and weight load balancing. In: International Conference on Field Programmable Logic and Applications (FPL), pp. 394–3944. IEEE (2018) 12. Geng, T., Wang, T., Sanaullah, A., Yang, C., Xu, R., Patel, R., Herbordt, M.: FPDeep: Acceleration and load balancing of CNN training on FPGA clusters. In: International Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 81–84. IEEE (2018) 13. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: MobileNets: Efficient convolutional neural networks for mobile vision applications. Preprint arXiv:1704.04861 (2017) 14. Ielmini, D., Wong, H.S.P.: In-memory computing with resistive switching devices. Nat. Electron. 1, 333 (2018) 15. Ito, K., Lucke, L.E., Parhi, K.K.: ILP-based cost-optimal DSP synthesis with module selection and data format conversion. IEEE Trans. Very Large Scale Integr. Syst. 6(4), 582–594 (1998) 16. Jiang, W., Sha, E.H.M., Zhuge, Q., Yang, L., Chen, X., Hu, J.: Heterogeneous FPGA-based cost-optimal design for timing-constrained CNNs. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 37(11), 2542–2554 (2018) 17. Jiang, W., Sha, E.H.M., Zhang, X., Yang, L., Zhuge, Q., Shi, Y., Hu, J.: Achieving super-linear speedup across multi-FPGA for real-time DNN inference. ACM Trans. Embedd. Comput. Syst. 18(5s), 1–23 (2019) 18. Jiang, W., Lou, Q., Yan, Z., Yang, L., Hu, J., Hu, X.S., Shi, Y.: Device-circuit-architecture co-exploration for computing-in-memory neural accelerators. IEEE Trans. Comput. 70(4), 595–605 (2020) 19. Jiang, W., Yang, L., Sha, E.H.M., Zhuge, Q., Gu, S., Dasgupta, S., Shi, Y., Hu, J.: Hardware/software co-exploration of neural architectures. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 39(12), 4805–4815 (2020) 20. Kang, M., Lim, S., Gonugondla, S., Shanbhag, N.: An in-memory VLSI architecture for convolutional neural networks. IEEE J. Emerg. Sel. Top. Circuits Syst. 8, 494–505 (2018) 21. Kumar, N., Verma, R., Sharma, S., Bhargava, S., Vahadane, A., Sethi, A.: A dataset and a technique for generalized nuclear segmentation for computational pathology. IEEE Trans. Med. Imag. 36(7), 1550–1560 (2017) 22. Kwon, H., Chatarasi, P., Pellauer, M., Parashar, A., Sarkar, V., Krishna, T.: Understanding reuse, performance, and hardware cost of DNN dataflow: A data-centric approach. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 754–768 (2019)

300

Z. Yan et al.

23. Kwon, H., Lai, L., Pellauer, M., Krishna, T., Chen, Y.H., Chandra, V.: Heterogeneous dataflow accelerators for multi-DNN workloads. In: 2021 IEEE International Symposium on HighPerformance Computer Architecture (HPCA), pp. 71–83. IEEE (2021) 24. Li., C.: https://lambdalabs.com/blog/resnet9-train-to-94-cifar10-accuracy-in-100-seconds (2019). Accessed 24 Nov 2019 25. Lu, Q., Jiang, W., Xu, X., Shi, Y., Hu, J.: On neural architecture search for resource-constrained hardware platforms. In: International Conference on Computer-Aided Design (2019) 26. NVIDIA: NVDLA deep learning accelerator (2017). http://nvdla.org 27. Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkatesan, R., Khailany, B., Emer, J., Keckler, S.W., Dally, W.J.: SCNN: An accelerator for compressed-sparse convolutional neural networks. ACM SIGARCH Comput. Architect. News 45(2), 27–40 (2017) 28. Patil, A.D., Hua, H., Gonugondla, S., Kang, M., Shanbhag, N.R.: An MRAM-based deep inmemory architecture for deep neural networks. In: 2019 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5. IEEE (2019) 29. Peng, X., Huang, S., Luo, Y., Sun, X., Yu, S.: DNN+ NeuroSim: An end-to-end benchmarking framework for compute-in-memory accelerators with versatile device technologies. In: 2019 IEEE International Electron Devices Meeting (IEDM), pp. 32–5. IEEE (2019) 30. Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 234–241. Springer (2015) 31. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: Inverted residuals and linear bottlenecks (2018). Preprint arXiv:1801.04381 32. Shafiee, A., Nag, A., Muralimanohar, N., Balasubramonian, R., Strachan, J.P., Hu, M., Williams, R.S., Srikumar, V.: ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Comput. Architect. News 44, 14–26 (2016) 33. Shao, Z., Zhuge, Q., Xue, C., Sha, E.M.: Efficient assignment and scheduling for heterogeneous DSP systems. IEEE Trans. Parallel Distrib. Syst. 16(6), 516–525 (2005) 34. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). Preprint arXiv:1409.1556 35. Sun, X., Liu, R., Peng, X., Yu, S.: Computing-in-memory with SRAM and RRAM for binary neural networks. In: 2018 14th IEEE International Conference on Solid-State and Integrated Circuit Technology (ICSICT), pp. 1–4. IEEE (2018) 36. Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., Le, Q.V.: MnasNet: Platform-aware neural architecture search for mobile. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2820–2828 (2019) 37. Sze, V., Chen, Y.-H., Yang, T.J., Emer, J.S.: Efficient processing of deep neural networks: A tutorial and survey. In: Proceedings of the IEEE, pp. 2295–2329. IEEE (2017) 38. Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3), 229–256 (1992) 39. Wu, B., Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Y., Tian, Y., Vajda, P., Jia, Y., Keutzer, K.: FBNet: Hardware-aware efficient ConvNet design via differentiable neural architecture search. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10734–10742 (2019) 40. Xu, X., Ding, Y., Hu, S.X., Niemier, M., Cong, J., Hu, Y., Shi, Y.: Scaling for edge inference of deep neural networks. Nat. Electron. 1(4), 216–222 (2018) 41. Xu, X., Lu, Q., Yang, L., Hu, S., Chen, D., Hu, Y., Shi, Y.: Quantization of fully convolutional networks for accurate biomedical image segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8300–8308 (2018) 42. Yang, L., Jiang, W., Liu, W., Edwin, H., Shi, Y., Hu, J.: Co-exploring neural architecture and network-on-chip design for real-time artificial intelligence. In: 2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 85–90. IEEE (2020) 43. Yang, L., Yan, Z., Li, M., Kwon, H., Lai, L., Krishna, T., Chandra, V., Jiang, W., Shi, Y.: Co-exploration of neural architectures and heterogeneous ASIC accelerator designs targeting multiple tasks. In: 2020 57th ACM/IEEE Design Automation Conference (DAC), pp. 1–6. IEEE (2020)

HW-SW Co-design of DNN Architectures: From FPGAs and ASICs to CiM

301

44. Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B., Cong, J.: Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: International Symposium on FieldProgrammable Gate Arrays (FPGA), pp. 161–170. ACM (2015) 45. Zhang, C., Wu, D., Sun, J., Sun, G., Luo, G., Cong, J.: Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. In: Proceedings of the 2016 International Symposium on Low Power Electronics and Design, pp. 326–331 (2016) 46. Zhang, J., Raj, P., Zarar, S., Ambardekar, A., Garg, S.: Compact: On-chip compression of activations for low power systolic array based CNN acceleration. ACM Trans. Embed. Comput. Syst. 18(5s), 1–24 (2019) 47. Zhang, X., Jiang, W., Shi, Y., Hu, J.: When neural architecture search meets hardware implementation: from hardware awareness to co-design. In: 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 25–30. IEEE (2019) 48. Zhao, M., Wu, H., Gao, B., Zhang, Q., Wu, W., Wang, S., Xi, Y.: Investigation of statistical retention of filamentary analog RRAM for neuromorphic computing. In: IEEE International Electron Devices Meeting (IEDM), pp. 39–4 (2017) 49. Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. In: International Conference on Learning Representations (ICLR) (2017)

Hardware and Software Optimizations for Capsule Networks Alberto Marchisio, Beatrice Bussolino, Alessio Colucci, Vojtech Mrazek, Muhammad Abdullah Hanif, Maurizio Martina, Guido Masera, and Muhammad Shafique

1 Introduction In recent years, Capsule Networks (CapsNets) have become popular among advanced Machine Learning (ML) models [3], due to their high learning capabilities and improved generalization ability, compared to the traditional Deep Neural Networks (DNNs). The ability to learn hierarchical information of different features (position, orientation, and scaling) in a single capsule allows to achieve high accuracy in machine learning vision applications, e.g., MNIST [15] and FashionMNIST [40] classification, as well as effective applicability to other ML application domains, such as speech recognition [39], natural language processing [41], and healthcare [30]. Indeed, CapsNets are able to encapsulate the hierarchical and spatial information of the input features in a closer way to our current understanding of the human brain’s functionality. It is shown by recent analyses about the CapsNets’ robustness against affine transformations and adversarial attacks [7, 20, 29], showing that CapsNets are more resilient against such vulnerability threats than traditional DNNs which have similar classification accuracy.

A. Marchisio () · A. Colucci Technische Universität Wien (TU Wien), Vienna, Austria e-mail: [email protected]; [email protected] B. Bussolino · M. Martina · G. Masera Politecnico di Torino, Turin, Italy e-mail: [email protected]; [email protected]; [email protected] V. Mrazek Brno University of Technology, Brno, Czechia e-mail: [email protected] M. A. Hanif · M. Shafique eBrain Lab, Division of Engineering, New York University Abu Dhabi, Abu Dhabi, UAE e-mail: [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 S. Pasricha, M. Shafique (eds.), Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing, https://doi.org/10.1007/978-3-031-39932-9_12

303

A. Marchisio et al. 10000

40

1000

30

100

20

10

10

MACs / Memory

Memory [MB] (log scale)

304

1

Memory Footprint MACs / Memory

0 LeNet

AlexNet

CapsNet

Fig. 1 Comparison of Memory footprint and (multiply-and-accumulate operations vs. memory) ratio (MACs/Memory) between the LeNet [15], AlexNet [14], and CapsNet [36] (based on the data presented in [27]) SW-Level CapsNet Models

Datasets

CapsNets Training Methodologies

Efficient Trained CapsNet Models

HW-Aware Neural Architecture Search

HW-Level CapsNet HW Accelerator (CapsAcc)

DSE of PE Array

DSE of Memory Organizations

HW/SW Co-Design & Optimizations CapsNets Quantization Framework

Approximate CapsNets Design

Efficient CapsNets Inference

Fig. 2 Overview of the chapter’s content

However, the presence of capsules in the layers introduces an additional dimension compared to the matrices of the convolutional and fully connected layers of the traditional DNNs, which significantly increase the computations and communication workload of the underlying hardware. Therefore, the main challenge in deploying CapsNets is their extremely high complexity. They require intense computations due to the multiplications in the matrices of capsules and the iterative dynamic routing-by-agreement algorithm for learning the cross-coupling between capsules. Figure 1 compares the CapsNet [36] with the LeNet [15] and the AlexNet [14], in terms of their memory footprints and the total number of multiply-and-accumulate (MAC) operations needed to perform an inference pass. The MACs/memory ratio is a good metric to show the computational complexity of the models, thus demonstrating the higher compute-intensive nature of CapsNets, compared to traditional DNNs. In this chapter, after discussing the differences between traditional DNNs and CapsNets and their advanced model architectures, we present the state-of-theart methodologies and optimization techniques for their efficient execution. An overview of the chapter’s content is shown in Fig. 2.

Hardware and Software Optimizations for Capsule Networks

305

2 Traditional DNNs vs. CapsNets As discussed in [11], among the major drawbacks of traditional DNNs, which are based on convolutional operations, (i) they have too few structural levels, and thus they cannot handle different viewpoints of the same object, and (ii) pooling layers are too naive forms of information encoding, since they make DNNs translationinvariant, rather than equivariant. To overcome these problems, the architecture of CapsNets is proposed. The key differences w.r.t. traditional DNNs are summarized in Table 1. Inspired by the concept of inverse graphics, in [11], the neurons are grouped together into vectors to form the so-called capsules. A capsule encodes both the instantiation parameters (i.e., pose, like width, skew, rotation, and other spatial information), and its length (i.e., its Euclidean Norm) is associated with the instantiation probability of the entity. In this way, the CapsNets from the image pixels encode the pose of low-level features, and from the pose of the “parts,” it is possible to understand the pose of the “whole,” i.e., the high-level entities, to make a better prediction. As activation function, the CapsNets use the Squash, which is a multidimensional nonlinear function that efficiently fits the prediction vector that forms the capsule. Moreover, to overcome the problem that DNNs are not invariant to translation, the concept of routing is introduced. The (Max) pooling operation consists of collecting a group of adjacent neurons and selecting the one with the highest activity, thus discarding the spatial information provided by this group of neurons. For this reason, the pooling layers are responsible for the so-called Picasso problem, in which DNNs classify an image having a nose below the mouth and an eye below the nose as a face, since they lose spatial relationships between features. To replace the pooling layers, an iterative routing procedure to determine the values of the coupling coefficients between a low-level capsule and high-level capsules is proposed in [36]. It is an iterative process in which the agreements between the capsules of two consecutive layers are measured and updated for a certain number of iterations at runtime during the inference.

Table 1 Key differences between traditional DNNs and CapsNets Basic block Activation function Inter-layer connections Detection property

Traditional DNNs Neuron (scalar value) Rectified linear unit (ReLU) Pooling Feature detection

CapsNets Capsule (vector) Squash Dynamic routing Entity detection

306

A. Marchisio et al.

3 CapsNet Models and Applications Hinton et al. [11] first showed the applicability of CapsNets, which adopt the capsules as basic blocks and can learn the features of an image in addition to its deformations and viewing conditions. A more detailed explanation of how poses and probabilities are represented and computed to form a CapsNet is described in [36]. A capsule is a vector of neurons, each representing an instantiation parameter of the entity, and the instantiation probability is measured by the length of the vector. To represent such probability in the range .{0, 1}, the Squash function is employed. The iterative procedure for computing the coupling coefficients .cij constitutes the Dynamic Routing-by-Agreement in Algorithm 1. The coupling coefficient determines in which amount the lower level capsule i sends its activation to all the higher level capsules. In other words, .cij represents the prior probability that an entity detected by a lower level capsule i belongs to the higher level entity of capsule j . To satisfy the property that the sum of these coefficients must be unitary, the Softmax function is applied (see line 8 of Algorithm 1). The activation .vj of the capsule j is obtained by applying the Squash function to the pre-activation .sj (line 14). The last step consists of updating the logits .bij to be used in the following

Algorithm 1: Dynamic routing-by-agreement in CapsNets Input: Prediction Votes uˆ i|j ; Number of Iterations r; Layer l Output: Activation Vectors vj 1 for Capsule i in Layer l do 2 for Capsule j in Layer (l + 1) do 3 Logits Initialization: bij ← 0; 4 end 5 end 6 for r Iterations do 7 for Capsule i in Layer l do 8 9 10 11 12 13 14

Softmax: cij ← softmax (bij ) = end for Capsule j in Layer (l + 1) do  Sum: sj ← i cij · uˆ i|j ; end for Capsule j in Layer (l + 1) do Squash: vj ← squash (sj ) =

e

bij ebik

;

k

||sj ||2 sj ; 1+||sj ||2 ||sj ||

15 end 16 for Capsule i in Layer l do 17 for Capsule j in Layer (l + 1) do 18 Update: bij ← bij + uˆ i|j · vj ; 19 end 20 end 21 end

Hardware and Software Optimizations for Capsule Networks

Conv1 Layer

INPUT 28x28 9x9

PrimaryCaps Layer

ReLU

20x20x256

9x9

307

ClassCaps Layer

Squash

6x6x32x8

Dynamic Routing

OUTPUT oooooooooooooooo

16x10

Fig. 3 Architectural model of the vanilla CapsNet [36] Dynamic Roung

OUTPUT

+

CLASSCAPS

CONVCAPS 2D #15

+

CONVCAPS 2D #14

CONVCAPS 2D #11

+

CONVCAPS 2D #13

CONVCAPS 3D

CONVCAPS 2D #12 CONVCAPS 2D #9

CONVCAPS 2D #7

CONVCAPS 2D #6

+

CONVCAPS 2D #5

CONVCAPS 2D #3

CONVCAPS 2D #2

CONVCAPS 2D #1

CONV

INPUT

Squash

CONVCAPS 2D #8

CONVCAPS 2D #10

ReLU CONVCAPS 2D #4

Fig. 4 Architectural model of the DeepCaps [35]

iteration by computing the agreement through the scalar product between the input prediction votes .uˆ i|j and the activation .vj (line 18). The first CapsNet model [36] using the vector capsules and the dynamic routing is shown in Fig. 3. A convolutional layer with kernel .9 × 9, stride 1, and 256 output channels is followed by the PrimaryCaps layer, in which the neurons are grouped into 8D vectors, organized in 32 output channels, and form a convolutional capsule layer of kernel size .9 × 9 and stride 2, using the Squash activation function. In the last ClassCaps layer, each of the 10 capsules is dedicated to recognizing the output classes. The Dynamic Routing analyzes the features encoded by the 1152 8D capsules of the PrimaryCaps layer to generate the 10 16D activations of the ClassCaps layer. For training purposes, a decoder network (i.e., a cascade of three fully connected layers) is built for obtaining the image reconstruction and then employing the reconstruction loss along with the margin loss (i.e., computed from the instantiation probabilities of the output activations) to form the loss function. Despite being applied mainly to relatively simple tasks, like MNIST [15] and Fashion-MNIST [40] classification, this architecture has been extensively analyzed and studied by the community. Hence, in the following, we consider it as vanilla CapsNet or simply CapsNet. A major limitation of this CapsNet is that it is extremely compute-intense and requires many parameters to reach similar performances as traditional DNNs for complex tasks. To overcome these issues, the DeepCaps architecture [35] has been proposed. As shown in Fig. 4, besides increasing the depth, the DeepCaps exploits 3D convolutional capsule layers and 3D dynamic routing, thus significantly reducing the number of parameters. Moreover, the decoder employs deconvolutional layers that capture more spatial relationships than the fully connected layers. Concurrently, other modifications of the vanilla CapsNets have been proposed. Instead of using vector capsules, Hinton et al. [12] proposed the representation of capsules’ inputs and outputs as matrices and replaced the dynamic routing-byagreement with the expectation–maximization (EM) algorithm. The EM routing is a clustering process based on the Gaussian mixture model. Compared to the

308

A. Marchisio et al.

dynamic routing, it improves the sensitivity of small coupling differences for values close to 1 but implies higher computational time and complexity. Inspired by recent research advancements on transformers, Choi et al. [4] proposed the attention routing and capsule activation, while Hahn et al. [8] proposed self-routing CapsNets, in which the values of the coupling coefficients are learned during training. Other promising CapsNet architectures introduced different variants of the routing algorithm, including the inverted dot-product (IDP) attention routing [38], the self-attention routing [28], and the straight-through (ST) attentive routing [2]. The key features of the different routing methods are summarized in Table 2.

4 Efficient CapsNets Training Methodologies State-of-the-art learning policies for traditional DNNs are designed to tune the learning rate and batch size values during different training epochs to achieve high accuracy and fast training time. Compared to the baseline policy in which the learning rate is exponentially decreasing and the batch size is kept constant during training, the most popular learning policies include: • One-Cycle Policy [37]: This technique applies a single cycle of learning rate variation. The training is conducted in three phases. In phase-1 (for the first .45% of training epochs), the learning rate is increased from a minimum to a maximum value. In phase-2 (for other .45% of epochs), the learning rate is decreased in a symmetric way. In phase-3 (for the last .10% of epochs, the learning rate is further decreased. • Warm Restarts [17]: The learning rate is initialized to a maximum value and cyclically decreased with cosine annealing to a minimum value and then reset to its maximum with a step function. The cyclic repetition of this process emulates a warm restart that allows the model to traverse several saddle points and local minima to obtain fast convergence. • Adaptive Batch Size (AdaBatch) [6]: Since small batch sizes typically imply the convergence in few training epochs, while large batch sizes guarantee high computational efficiency due to high parallel processing in GPU clusters, a good trade-off consists of adaptively increasing the batch size during training. Starting with a small batch size allows fast convergence in early epochs, and progressively increasing the batch size at selected epochs improves the performance due to the larger workload available per processor in later epochs. The FasTrCaps framework [21] combines the above-discussed techniques and other optimization strategies for efficiently training the CapsNets. As shown in Fig. 5, the methodology is composed of three steps. First, the learning rate policies are tailored and applied to CapsNets. Then, the adaptive batch size is selected. Among the explored learning rate policies, the Warm Restarts guarantees the most promising results in terms of accuracy, while the AdaBatch provides a good trade-

[12]

[4]

EM routing

Attention routing Self-routing

[38]

[28]

[2]

IDP attention routing

Self-attention routing

ST attentive routing

[8]

Reference [36]

Routing method Dynamic routing

Short description Coupling agreement computed based on cosine similarity, normalized by squash Clustering-based algorithm where the agreements follow a Gaussian distribution Coupling coefficients computed with an attention module Each capsule is routed independently by its subordinate routing network (non-iterative procedure) Coefficients computed through an IDP mechanism between high-level capsule states and low-level capsule votes Capsules between subsequent layers are routed with a self-attention mechanism Connection between high-level and low-level capsules based on binary decision with a straight-through estimator Differentiability of computations and high accuracy on ImageNet

Competitive performance with few parameters

Fast computations using a concurrent iterative procedure

High number of parameters and high training time

Low accuracy for complex tasks

Memory-intense and complex backbone needed

Scalability issues and high computational cost

Use high-complex capsule activations

Only forward computations Competitive robustness performance and viewpoint generalization

High computational expensive

Potential limitations Low sensitivity for values close to 1

Overcomes the dynamic routing limitation

Benefits Dynamic update with multiple iterations

Table 2 An overview of the most common versions of the routing algorithm for CapsNets

Hardware and Software Optimizations for Capsule Networks 309

A. Marchisio et al.

FasTrCaps Framework Step 2: Step 1: Batch Size Selection: Learning Rate Policies Adaptive Batch Size, Variation for Different Epochs

WarmAdaBatch: WR + AdaBatch

Warm Restarts (WR)

Step 3: CapsNet Complexity Reduction



Reduced-Sized Decoder

Output: Optimized CapsNet

CapsNet

310

Weight Sharing

Fig. 5 Overview of the FasTrCaps framework’s functionality [21]

off to obtain fast convergence. Combining these two techniques, a novel learning policy called WarmAdaBatch is designed. The WarmAdaBatch method, shown in Algorithm 2, is a hybrid learning policy that combines the variations of the learning rate and the batch size. For the first three epochs, the batch size is set to 1, and then it is increased to 16 for the remaining training epochs. The first cycle of the warm restart policy is done during the first three epochs and the second one during the remaining training epochs. For completeness, the procedures describing the AdaBatch and Warm Restarts methods are shown in lines 15–25 and lines 26–34 of Algorithm 2, respectively. In the third step, the computational complexity is reduced by performing two optimizations. (i) The size of the CapsNet decoder is reduced by around .5%, maintaining only the .1 × 16 inputs linked to the capsule that outputs the highest probability. (ii) The weights between the PrimaryCaps and ClassCaps layers are shared by having a single weight tensor associated with all the 8D vectors inside each .6 × 6 capsule, achieving more than .15% reduction in the total number of parameters. The evaluation is conducted on the CapsNet [36] for the MNIST [15] and Fashion-MNIST [40] datasets. Table 3 shows the key results employing the WarmAdaBatch and Weight Sharing. The accuracy values are computed by averaging 5 training runs, each of them lasting for 30 epochs. For the MNIST dataset, the accuracy drop caused by the Weight Sharing is compensated by the WarmAdaBatch. The combination of both techniques results in a slightly higher accuracy (.99.38% vs. .99.37% of the baseline), with fewer training epochs. For the Fashion-MNIST dataset, despite requiring slightly more training epochs than the baseline, the combination of WarmAdaBatch and Weight Sharing shows comparable accuracy w.r.t. the baseline.

Hardware and Software Optimizations for Capsule Networks

311

Algorithm 2: WarmAdaBatch training method for CapsNets 1 Procedure WarmAdaBatch(lrmin , lrmax , MaxEpoch, MaxStep) 2 Tcurr ← 0; 3 for Epoch ∈ {1, ..., MaxEpoch} do // Batch size update 4 AdaBatch(4,Epoch); 5 if Epoch ≤ 3 then 6 Ti ← 3 ∗ 60, 000; // Steps in 3 epochs with batch size 1 7 else 8 Ti ← 27 ∗ 3, 750; // Steps in 27 epochs with batch size 16 9 end 10 for Step ∈ {1, ..., MaxStep} do // Learning Rate update 11 Tcurr ← W R(lrmin , lrmax , Tcurr , Ti ); 12 end 13 end 14 end 15 Procedure AdaBatch(P , CurrentEpoch) 16 if CurrentEpoch ≤ 3 then 17 BatchSize ← 1; 18 else if 4 ≤ CurrentEpoch ≤ 8 then 19 BatchSize ← 2P ; 20 else if 9 ≤ CurrentEpoch ≤ 13 then 21 BatchSize ← 2P +1 ; 22 else 23 BatchSize ← 2P +2 ; 24 end 25 end 26 Procedure WarmRestarts(lrmin ,lrmax , Tcurr , Ti  ) 27

lr ← lrmin +

1 2

; // Learning rate update (lrmax − lrmin ) 1 + cos π Tcurr Ti

28 if Tcurr = Ti then // Warm Restart after Ti training steps 29 Tcurr ← 0; 30 else // Current step update 31 Tcurr ← Tcurr + 1; 32 end 33 return Tcurr ; 34 end

5 Hardware Architectures for Accelerating CapsNets’ Inference For deploying CapsNets-based systems at the edge, it is crucial to minimize the power/energy consumption and maximize the performance. The unique operations involving capsules, Squash, and the dynamic routing make the existing architectures for accelerating traditional DNNs unsuitable or inefficient. Therefore, specialized architectures and dataflows need to be designed and tailored for CapsNets. A performance analysis of the CapsNets execution is beneficial for identifying the bottlenecks. The rest of the section discusses CapsAcc [19], which is the first

312

A. Marchisio et al.

Table 3 Accuracy results obtained with CapsNet for the MNIST and Fashion-MNIST datasets, applying different solutions using the FasTrCaps framework [21] Epochs to reach max accuracy

Accuracy Fashion-MNIST 90.99% 91.47% 90.47% 90.67%

MNIST 99.37% 99.45% 99.26% 99.38%

Fashion-MNIST 17 27 17 20

MNIST 29 8 26 11

Learning rate and batch size Baseline WarmAdaBatch Baseline WarmAdaBatch

1.0E+05

Longest time: Squash operation

[μs]

[μs]

Longest time: ClassCaps Layer

1.0E+04

1.0E+05 1.0E+04

1.0E+03

1.0E+03

1.0E+02

1.0E+02

1.0E+01

1.0E+01

1.0E+00

1.0E+00

(a)

Weight sharing No No Yes Yes

(b)

Fig. 6 Execution time breakdown of the CapsNet [21] on the Nvidia GeForce RTX 2080 Ti GPU. (a) Layer-wise breakdown. (b) Operations in the dynamic routing

accelerator architecture for CapsNets, the FEECA methodology [26] for exploring the design space of the computational units of CapsNets accelerators, and the DESCNet methodology [25] for conducting design space exploration of CapsNets memory units.

5.1 CapsNets Execution on GPUs and Their Bottlenecks To understand how the CapsNets’ inference is performed, we perform a comprehensive analysis to measure the performance of the PyTorch-based CapsNet [36] implementation for the MNIST dataset on the NVIDIA GeForce RTX 2080 Ti GPU. Figure 6a shows the execution time breakdown for each layer. The ClassCaps layer is the bottleneck since it is around 10–20.× slower than the previous layers, despite counting fewer parameters than the PrimaryCaps layer. To obtain more detailed results, the execution time for each operation of the dynamic routing in the ClassCaps layer is analyzed and reported in Fig. 6b. From the results, it is clear that the most compute-intensive operation is the Squash inside the ClassCaps layer. Hence, these analyses motivate the design of the hardware architecture and dataflow that efficiently computes the Squash and dynamic routing.

Hardware and Software Optimizations for Capsule Networks

PE /

8

Accumulator

Data

Data Register

x 8

PE Weight

(a)

25

8 /

/ + 16 Partial Sum Register 25 /

/

/

25

Weight2 Register

/

PE

25

Control Unit

/

16x16

… …

/

/ 8

8

PE PE Array

Partial Sum

Weight1 Register

/

PE

/ 8

Activation

… … /

Data Buffer

8

8

/

8

Data Memory

Weight

Weight Memory

Data

Weight Buffer

Routing Buffer

313

Partial Sum

(b)

Fig. 7 Hardware architecture of the CapsAcc accelerator [19]. (a) Complete accelerator architecture. (b) Architecture of a processing element (PE)

5.2 The CapsAcc Accelerator The top-level architecture of the CapsAcc accelerator is shown in Fig. 7a. The core of the computations is conducted in the Processing Element (PE) Array for efficiently computing the operations between matrices. The inputs are propagated toward the output of the PE array both horizontally (data) and vertically (weight and partial sum). Each PE, shown in Fig. 7b, consists of a multiply-and-accumulate (MAC) unit and four registers to synchronize the weight and data transfers, and partial sum results at each clock cycle. The Weight.2 Register allows to store and use the same weight on different data for convolutional layer operations, while for fully connected operations, only one-cycle latency overhead is introduced without affecting the throughput. The 8-bit data is multiplied with the 8-bit weight to form a 16-bit product, which is accumulated with the previous partial sum using 25 bits to avoid overflow. The resulting partial sums coming from the PE array are stored in an accumulator, followed by the activation unit, which can selectively perform the ReLU, Squash, normalization, or Softmax. More details on the implementations of these units are discussed in [19]. At each stage of the inference process, the control unit generates the control signals for all the components of the accelerator architecture. The memory is organized such that all the weights for each operation are stored in the on-chip weight memory, while the input data, which correspond to the pixel intensities of the image, are stored in the on-chip data memory. The data buffer and weight buffer work as cushions for the interface with the PE array at a high access rate. Moreover, the accumulator unit contains a buffer for storing the output partial sums, and the routing buffer stores the coefficients of the dynamic routing. The multiplexers at the input of the PE array are used to handle the different dataflows for each operation. To maximize the data reuse, the routing buffer stores the values of the coupling coefficients across different iterations of the dynamic routing.

314 Table 4 Parameters of the synthesized CapsAcc

Table 5 Area and power, for the different components of the CapsAcc architecture

A. Marchisio et al. Tech. node [nm] Voltage [V] Area [mm.2 ] Power [mW] Clk Freq. [MHz] Bit width On-Chip Mem. [MB] Area [mm.2 ] Power [mW] Area [µm2 ] Component 42,867 PE array Accumulator 32,641 Activation 29,027 Data buffer 136,222 Routing buffer 32,598 Weight buffer 11,961 Other 4330

45 1 2.60 427.44 250 8 8 2.60 427.44

Power [mW] 112.31 47.57 2.21 199.31 47.56 17.46 1.10

The complete CapsAcc architecture has been implemented in RTL (VHDL) and synthesized in a 45-nm CMOS technology node using the ASIC design flow with the Synopsys Design Compiler (see the parameters in Table 4). The gate-level simulations conducted using Mentor ModelSim are conducted to obtain precise area, power, and performance of our design. Table 5 reports the detailed area and power breakdown, showing that the contributions are dominated by the buffers.

5.3 FEECA Methodology The CapsAcc architecture represents a specific design solution for accelerating the CapsNets’ inference. For systematically exploring the design space of CapsNets accelerators, the FEECA methodology [26] can be employed. As shown in Fig. 8, its goal is to find Pareto-optimal sets of architectural parameters of the CapsNet accelerator to achieve good trade-offs between the design objectives, which are area, energy, and performance in terms of inference delay. Given a set of configurable parameters, the area, power, and energy of the PE arrays and the memories are computed through Synopsys Design Compiler and CACTI [16], respectively. The evaluation of each candidate solution is based on the model extraction of the CapsAcc accelerator [19] for the CapsNet [36]. For searching in the space of the solutions, the straightforward approach is to use brute force, but for reducing the exploration time, a heuristic multi-objective optimization approach based on the principles of the NSGA-II genetic algorithm [5] is used. It is an iterative

Hardware and Software Optimizations for Capsule Networks

CapsAcc Accelerator

Configurable Parameters: #ROWS of the PE array [1, 50] #COLS of the PE array [1, 50] Pipeline stages nstg {1, 2} Clock period T {2, 3, 4} Input pairs (weight+data) npe [1, 200] Memory Bandwidth membw {28, 29, 210} PE Array Generator

Design Compiler

Memory Generator

CACTI

315

Input CapsNet

Design Objectives: Energy-Delay, Area-Delay, Area-Energy-Delay

Models Extraction

Search Algorithm: Brute-force or NSGA-II BSP Pareto

Area Power Energy

Evaluation: Area, Energy, Delay

Output: Set of Pareto-Optimal Configurations

Fig. 8 Overview of the FEECA methodology [26] for the design space exploration of CapsNets accelerators 10 3

10 8

10 2

Delay [ s] Delay [s]

Lowest-Delay solution outside the Pareto-front

Energy [ m J] Energy [mJ]

Area Area [[μm 22]]

4 × 10 7

2 × 10 7

1

10

2

10

3

Lowest-Delay

6 × 10 7

3 × 10 7

10

10 1

Lowest-Delay 10 7 10

10 2 10 1 Energy-delay Energy-Delay product [ m J s] Product [mJ·s]

5

10 0

10 6 10 4 Area-delay Area-Delay product m 22·s] s] Product [[μm

10

4

10 8

10 9 10 10 Energy-area Energy-Area product [ m J·μm m2]2 ] Product [mJ

Fig. 9 Pareto-optimal set of configurations of CapsNet accelerator generated with the FEECA methodology [26]

algorithm that combines crossover and mutation to explore the solutions, which are progressively selected based on the Pareto frontiers. Figure 9 shows the set of Pareto-optimal solutions that form the output of the FEECA methodology. For visualization purposes, the results are visualized in 2D plots, where each couple of two objectives is combined into products, which are energy .× delay (EDP), area .× delay (ADP), and energy .× area (EAP), respectively. By reducing the dimension of the space, only a smaller number of solutions remain in the Pareto frontiers, which are represented by the gray lines. The highlighted lowest delay solution (i.e., the fastest architecture) lays on the Pareto-frontier only in the last two plots, i.e., the ADP vs. energy and EAP vs. delay trade-offs, while it is not Pareto-optimal for the other case of EDP vs. area.

316

A. Marchisio et al.

CapsAcc Architecture PE

PE

PE

PE

PE

PE

PE

PE

PE

Activation Unit (ReLU, Softmax, Squash)

Off-Chip Memory (DRAM)

Memory Controller

Bank1

On-Chip

CPU Bank2

Bankn

VDD

Sector1

Sleep Transistor

Sector2

Sleep Transistor

Sectors

Sleep Transistor

Application-Driven Power Management of the On-Chip Memory

DMA

Control Unit

DESCNet: On-Chip Scratchpad Memory Fig. 10 Architectural view of the complete accelerator for CapsNets’ inference, with a focus on the DESCNet scratchpad memory [25]

5.4 Memory Design and Management for CapsNets Motivated by the results previously discussed in Table 5 that shows that the on-chip area and power consumption of the complete CapsNet accelerator are dominated by the memory buffers, a specialized scratchpad memory organization (DESCNet [25]) is designed. The architectural view in Fig. 10 shows that the DESCNet memory is connected to the off-chip memory and the CapsNet accelerator through dedicated bus lines. The scratchpad memory is partitioned into banks, where each bank consists of equally sized sectors. All sectors with the same index across different banks are connected through a power-gating circuitry implemented with sleep transistors to support an efficient sector-level power management control at the cost of a certain area overhead. The application-driven memory power management unit determines the appropriate control signals for the sleep transistors.

Hardware and Software Optimizations for Capsule Networks

CapsNet Models

1

CapsNet DeepCaps … 2

Extract Operation-Wise Memory Usage ANALYZER

Design Options, Sizes, Number of Banks and Sectors CapsNet Hardware Accelerator CapsAcc …

317

… Squash PrimaryCaps ConvCaps2D Data Reads Data Writes Accumulator Size … CACTI-P

Memory Modeling

3

DESIGN SPACE EXPLORATION

Optimizations of Memory Configurations and Sizes

SYNOPSYS-DC (45 nm Technology)

PE Array Synthesis

Memory Organization, Energy Consumption, Area Fig. 11 DESCNet [25] methodology and tool flow for conducting the design space exploration

This memory model can be generalized for different memory organizations supporting different sizes and levels of parallelism, including multiport memories. Toward this, the following design options are analyzed: 1. Shared Multiport Memory (SMP): A shared on-chip memory with three ports for parallelized access to the weights, input data, and the accumulator’s storage. 2. Separated Memory (SEP): Weights, input data, and partial sums are stored in three separate on-chip memories. 3. Hybrid Memory (HY): A combination of the other two design options, i.e., an SMP coupled with a SEP memory. Given the different memory organizations, sizes, the number of banks, and sectors per bank, a design space exploration is conducted. The flow of the DESCNet design space exploration methodology is shown in Fig. 11. For each design option, the values of memory organization (i.e., the size and the number of banks and sectors), the energy consumption, and area are generated. Given as input the CapsNet model and hardware accelerator, the memory usage and memory accesses for each operation of the CapsNet inference are extracted. Then, the analyzer collects the statistics for each configuration, and the design space is explored. The memory area and energy consumption estimations, with and without the powergating option, are conducted through the CACTI-P tool [16]. The different memory architectural options have been evaluated for area and energy consumption. Figure 12a shows the .15,233 different DESCNet architectural configurations for the CapsNet [36] on the MNIST dataset [15], while Fig. 12b shows the .215,693 configurations for the DeepCaps [35] on the CIFAR10 dataset [13].

318

A. Marchisio et al. (a) CapsNet - MNIST

Lowest-energy: SEP-PG

(b) DeepCaps – CIFAR10

Lowest-energy: HY-PG

Fig. 12 Design space exploration of the DESCNet memory configurations [25]. (a) Results for the CapsNet on the MNIST dataset. (b) Results for the DeepCaps on the CIFAR10 dataset

For each design option (SMP, SEP, and HY) and its corresponding version with power-gating (with the suffix -PG), the Pareto-optimal solutions with the lowest energy are highlighted. Note that while SEP, SEP-PG, and HY-PG belong to the Pareto-frontier, HY, SMP, and SMP-PG are dominated by other memory configurations. Using these optimizations, it is possible to achieve up to .80% energy saving and up to .47% area saving compared to the memory organization of CapsAcc [19].

6 Lightweight Optimizations for CapsNets Inference To further ease the deployment of CapsNets at the edge, other lightweight optimizations can be conducted. A reduction of the wordlength of the weights and activations of a CapsNet for computing the inference not only lightens the memory storage requirements but might also have a significant impact on the energy consumption of the computational units. Moreover, the current trends in approximate computing can be leveraged to approximate the most compute-intensive operations, such as the multiplications, achieve energy-efficient CapsNet hardware architectures, and enable design-/run-time energy-quality trade-offs.

6.1 Quantization of CapsNets with the Q-CapsNets Framework Despite the considerable energy savings, having a too short wordlength implies lowering the accuracy of the CapsNets, which is typically an undesired outcome from the end-user perspective. To find efficient trade-offs between the memory footprint, the energy consumption, and the classification accuracy, the Q-CapsNets framework [22] is applied. As shown in Fig. 13, it explores different layer-wise and operation-wise arithmetic precisions for obtaining the quantized version of a given CapsNet. It tackles in particular the dynamic routing, which is a peculiar feature of the CapsNets involving complex and computationally expensive operations

Hardware and Software Optimizations for Capsule Networks

319

Q-CapsNets Framework CapsNet Models: (FP32 Training) CapsNet DeepCaps …

Dataset: MNIST CIFAR10 …

Outputs: Quantized CapsNet Models: model_satisfied model_memory model_accuracy

Settings: Accuracy tolerance Memory budget Rounding schemes

Fig. 13 Flow of the Q-CapsNets framework [22]

performed iteratively, with a significant impact on the energy consumption. Given the CapsNet model architecture, together with the training and test datasets, and a set of user constraints such as the accuracy tolerance, the memory budget, and the rounding schemes, the Q-CapsNets framework progressively reduces the numerical precision of the data (e.g., weights and activations) in the CapsNet inference, aiming at satisfying both requirements on accuracy and memory usage. A step-by-step description of the framework is the following: (1) Layer-Uniform Quantization (weights + activations): All the weights and activations are converted to fixed-point arithmetic, with 1-bit integer part, and .Qw -bit and .Qa -bit fractional part, respectively. Afterward, their precision is further reduced in a uniform way (e.g., .Qw = Qa ). (2) Memory Requirements Fulfillment: In this stage, only the CapsNet weights are quantized. Since the perturbations to weights in the final layers can be more costly than perturbations in the earlier layers, for each layer l its respective .Qw is set such that .(Qw )l+1 = (Qw )l − 1. Having defined these conditions, the model_memory is obtained, having the correct .Qw computed as the maximum integer value such that the sum of the weight memory occupied by each layer is lower than the memory budget. Afterward, if the accuracy of the model_memory is higher than the target accuracy, it continues to (3A) for further quantization steps. Otherwise, it jumps to (3B). (3A) Layer-wise Quantization of Activations: The activations are quantized in a layer-wise fashion. Progressively, each layer of the CapsNet (except the first one) is selected, and .Qa of the current layer is lowered until the minimum value for which the accuracy remains higher than the target accuracy. This step repeats iteratively until the .Qa for the last layer is set. Afterward, it continues to (4A). (3B) Layer-Uniform ad Layer-wise Quantization of Weights: Starting from the outcome of step (1), only the weights are quantized, first in a uniform and then in a layer-wise manner (as in step 3A) until reaching the target accuracy. The

320

A. Marchisio et al.

Table 6 Q-CapsNet’s accuracy results, weight (W) memory, and activation (A) memory reduction for the CapsNets and for the DeepCaps on the MNIST, Fashion-MNIST, and CIFAR10 datasets [22] Model CapsNet CapsNet CapsNet CapsNet DeepCaps DeepCaps DeepCaps DeepCaps DeepCaps DeepCaps

Dataset MNIST MNIST FMNIST FMNIST MNIST MNIST FMNIST FMNIST CIFAR10 CIFAR10

Accuracy

W mem reduction

A mem reduction

.99.58%

.4.87×

.2.67×

.99.49%

.2.02×

.2.74×

.92.76%

.4.11×

.2.49×

.78.26%

.6.69×

.2.46×

.99.55%

.7.51×

.4.00×

.99.60%

.4.59×

.6.45×

.94.93%

.6.40×

.3.20×

.94.92%

.4.59×

.4.57×

.91.11%

.6.15×

.2.50×

.91.18%

.3.71×

.3.34×

resulting CapsNet model (model_accuracy) is returned as another output of the framework. (4A) Dynamic Routing Quantization: Due to the computationally expensive operations, such as Squash and Softmax, the wordlength of the dynamic routing tensors may be different as compared to other layers of the CapsNet. Therefore, a specialized quantization process is performed in this step, in which the operators of the dynamic routing can be quantized more than the other activations (i.e., with a wordlength lower than .Qa , which we call .QDR ). The quantized CapsNet model generated at the end of this step is denoted as model_satisfied. Some key results of the Q-CapsNets framework implemented in PyTorch [33] and tested on the CapsNet [36] and DeepCaps [35] models for MNIST [15], Fashion-MNIST [40], and CIFAR10 [13] datasets are shown in Table 6. An efficient model_satisfied for the CapsNet on the MNIST dataset achieves .4.87× weight memory reduction with only .0.09% accuracy loss compared to the baseline. Even larger memory reductions (up to .7.51× weight and .6.45× activation memory reduction) can be obtained for the DeepCaps, with less than .0.15% accuracy loss. Note that the wordlength for the dynamic routing operations can be reduced up to 3 or 4 bits with very little accuracy loss compared to the full-precision model. Such an outcome is attributed to the fact that the operations of the involved coefficients (along with Squash and Softmax) are updated dynamically, thereby tolerating a more aggressive quantization compared to previous layers.

Hardware and Software Optimizations for Capsule Networks

Input: CapsNet Operations

Input: Approx. Component Library Output: Design of Approximate CapsNet for Efficient Inference

321

ReD-CaNe Methodology STEP 1: Group Extraction: distinction based on the type of operation STEP 6: Select Approximate Components

Groups of Operations

STEP 5: Mark Resilient Layers for Each Non-Resilient Group: resilient layer designation

STEP 2: Group-Wise Resilience Analysis STEP 3: Mark Resilient Groups: when the accuracy is high STEP 4: Layer-Wise Resilience Analysis for Non-Resilient Groups

Fig. 14 Flow of the ReD-CaNe methodology [24]

6.2 Approximations for Energy-Efficient CapsNets with the ReD-CaNe Methodology Approximation errors in hardware have been extensively employed to trade off accuracy for efficiency. Several recent works [9, 10, 18, 32] have studied the resiliency of traditional DNNs to approximations, showing the possibility to achieve high energy savings with minimal accuracy loss. For CapsNets, the resiliency analysis is conducted through the ReD-CaNe methodology [24]. Since the estimated energy consumption of the multipliers counts for .96% of the total energy share of the computational path of the DeepCaps inference, the approximate multipliers are targeted. The ReD-CaNe methodology, shown in Fig. 14, provides useful strategies for deploying energy-efficient inference, showing the potential to design and apply approximations to specific CapsNets layers and operations (i.e., the more resilient ones) without sacrificing the accuracy much. Since the profiling of the error distributions produced by approximate multipliers of the EvoApproxSb library [31] shows that the majority of the components have a Gaussian-like distribution of the arithmetic error, the approximations can be modeled as a noise injection of a certain magnitude and average. A step-by-step description of the ReD-CaNe methodology is the following: 1. Group Extraction: The operations of the CapsNet inference are divided into groups based on the type of operation involved (e.g., MAC, activation function, Softmax, or logits update). This step generates the Groups. 2. Group-wise Resilience Analysis: The test accuracy drop is monitored by injecting noise into different groups. 3. Mark Resilient Groups: Based on the results of the analysis performed in Step 2, the more resilient groups are marked. After this step, there are two categories of Groups, the Resilient and the Non-resilient ones.

322

A. Marchisio et al.

4. Layer-wise Resilience Analysis for Non-resilient Groups: For each nonresilient group, the test accuracy drop is monitored by injecting noise at each layer. 5. Mark Resilient Layers for Each Non-resilient Group: Based on the results of the analysis performed in Step 4, the more resilient layers are marked. 6. Select Approximate Components: For each operation, the approximate components from a given library are selected based on the resilience measured as the noise magnitude. As a case study, the detailed results applied to the DeepCaps for the CIFAR10 dataset are shown in Fig. 15. In the experiments for Step 2, the same noise is injected into every operation within a group, while maintaining the other groups accurate. As shown in Fig. 15a, the Softmax and the logits update groups are more resilient than MAC outputs and activations because the DeepCaps accuracy starts to decrease with a correspondent lower noise magnitude. Note that, for low noise magnitude, the accuracy slightly increases due to regularization, with a similar effect as the dropout. Figure 15b shows the resiliency analysis of each layer of the non-resilient groups (i.e., MAC outputs and activations). While the first convolutional layer is the least resilient, the ClassCaps layer and ConvCaps3D layer are the most resilient ones. Since the latter is the only convolutional layer that employs the dynamic routing algorithm, the higher resilience is attributed to the dynamic routing iterations, in which the coefficients are updated dynamically at runtime, and thus they can adapt to the approximation errors. Moreover, among the activations, the most resilient layers are the ConvCaps3D, ConvCaps2D#4, and ConvCaps2D#8, which are the layers in the skip connection path of the DeepCaps.

7 HW-Aware Neural Architecture Search for DNNs and CapsNets Manually designing CapsNets is a tedious job and incurs challenging efforts. Neural Architecture Search (NAS) methods automatically select the best model for a given application task. Moreover, hardware-aware NAS methodologies are employed to find efficient trade-offs between the accuracy of the models and the efficiency of their execution on specialized hardware accelerators in IoT-Edge/CPS devices. Toward this, the NASCaps framework [23] jointly optimizes the network accuracy and its corresponding hardware efficiency, expressed as energy, memory, and latency of a given hardware accelerator. It supports both the traditional DNN layers and the CapsNets layers and operations, including the dynamic routing. The overall functionality and workflow of the NASCaps framework are shown in Fig. 16. As input, the framework receives the configuration of the underlying hardware accelerator and a given dataset used for training, as well as a collection of the possible types of layers that can be used to form different candidate DNNs and CapsNets. The evolutionary search is based on the principles of the multi-objective

Hardware and Software Optimizations for Capsule Networks

#1: MAC outputs #2: activations

#3: softmax #4: logits update

(a) Step 2: Group-Wise Resilience

323

Step 3: Mark Resilient Groups ‰ Softmax ‰ Logits update Group #2: activations

Group #1: MAC outputs Least Resilient: first layer Step 5: Mark Resilient Layers ‰ ConvCaps3D ‰ ClassCaps (in the Dynamic Routing

Step 5: Mark Resilient Layers ‰ ConvCaps3D ‰ ConvCaps2D#4 ‰ ConvCaps2D#8 (in the skip layers)

)

(b) Step 4: Layer-Wise Resilience

Fig. 15 ReD-CaNe methodology applied to the DeepCaps for the CIFAR10 dataset [24]

Extraction of energy, memory, latency

HW Accelerator

NASCaps Framework HW model

NSGA-II Random DNNs (initial)

Evaluate

DNN Training with Limited Epochs

Dataset

Termination conditions

DNNs and CapsNets Layer Library

Select P best individuals

Full training

Generate Q offsprings (crossover, mutation)

Fully-trained inference

U

Fig. 16 Overview of the NASCaps framework’s functionality [23]

Output: Set of Pareto-optimal High-Accurate & HW-Efficient Convolutional CapsNets

324

A. Marchisio et al.

iterative genetic NSGA-II algorithm [5]. Analytical models of the execution of different types of layers and operations in the hardware architecture are developed to estimate the hardware metrics during the design space exploration quickly. To further reduce the exploration time, the accuracy of each candidate DNN/CapsNet is evaluated after a limited number of training epochs, in which the number of epochs is selected based on the Pearson correlation coefficient [34] w.r.t. the fully trained networks. Afterward, the Pareto frontiers relative to accuracy, energy consumption, latency, and memory footprint are generated to proceed to the next iteration. At the end of this selection procedure, the Pareto-optimal DNN solutions are fully trained to make an exact accuracy evaluation. The NASCaps framework [23] has been implemented with the TensorFlow library [1] running on GPU-HPC computing nodes equipped with four NVIDIA Tesla V100-SXM2 GPUs. A set of case study experiments for the CIFAR10 dataset [13] running on the CapsAcc architecture [19] is shown in Fig. 17. The maximum number of generations for the genetic algorithm is set to 20, but a maximum time-out of 24 hours has been imposed, thus stopping the algorithm at the 14th iteration. The candidate solutions in the earlier generations in Fig. 17a are quite inefficient, while the networks found in the latest generations outperform the manually designed CapsNet and DeepCaps. Note that at this stage of partially trained networks (i.e., after 10 epochs), some solutions exhibit .>20% accuracy

60

40

40

20

20

0 10

2

Generat ion 0 Generat ion 3

-52.12% Energy

60

2

10 2 10 0 Energy [mJ] Energy [ m J]

10 3 Lat ency [ s]

80 60 40 10 4 10 2 Mem ory foot print [ kiB] DeepCaps

10

1

DeepCaps CapsNet

-64.34% Latency

100

Memory

CIFAR10-NAS (14)

5

Generat ion 12 Generat ion 14

100 -30.19%

80

40 10

Generat ion 6 Generat ion 9

Accuracy [ %]

Accuracy [ %]

100

20

0 0 10 4 10 6 10 10 2 Mem ory foot print [ kiB]

10 0 10 2 Energy [ m J] Energy [mJ]

(a)

>20% higher 60 accuracy w.r.t. 40 DeepCaps

Accuracy [ %]

Accuracy [ %]

60

(b)

80

80

80

80 60 40 10

5

10 3 Lat ency [ s]

CapsNet

Fig. 17 NASCaps framework applied to the CIFAR10 dataset, showing the trade-offs between accuracy, energy, memory footprint, and latency [23]. (a) Partially trained results. (b) Fully trained results

Hardware and Software Optimizations for Capsule Networks

325

improvements compared to the DeepCaps. The Pareto-optimal solutions have been fully trained for 300 epochs, and the results are shown in Fig. 17b. The highlighted solution achieves an accuracy of .85.99% (about .1% accuracy drop), while reducing the energy consumption by .52.12%, the memory footprint by .30.19%, and the latency by .64.34%, compared to the DeepCaps inference run on the CapsAcc accelerator.

8 Conclusion Capsule Networks offer high learning capabilities, which results in high accuracy in several applications and high robustness against the vulnerability threats which involve spatial transformations. However, compared to traditional DNNs, the capsule layers introduce an additional dimension, and the iterative dynamic routing makes CapsNets high compute-intensive. In this chapter, several optimization techniques and frameworks tailored for CapsNets have been proposed. The FasTrCaps framework employs state-of-the-art learning policies and reduces the complexity of CapsNets for efficient training. CapsNets hardware architectures based on the CapsAcc inference accelerator are explored with the FEECA methodology, while the DESCNet methodology optimizes the memory organizations based on the CapsNets’ workload. The Q-CapsNets framework produces lightweight quantized CapsNets, and the ReD-CaNe methodology further reduces the energy consumption by employing approximate multipliers. Moreover, the NASCaps framework enables hardware-aware capsule-based neural architecture search for jointly optimizing accuracy, memory, energy, and latency, thus enabling CapsNets deployment in resource-constrained edge devices. Acknowledgments This work has been supported in part by the Doctoral College Resilient Embedded Systems, which is run jointly by the TU Wien’s Faculty of Informatics and the UAS Technikum Wien.

References 1. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D.G., Steiner, B., Tucker, P.A., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: A system for large-scale machine learning. In: Keeton, K., Roscoe, T. (eds.) 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2-4, 2016, pp. 265–283. USENIX Association (2016). https://www.usenix. org/conference/osdi16/technical-sessions/presentation/abadi 2. Ahmed, K., Torresani, L.: Star-caps: Capsule networks with straight-through attentive routing. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14,

326

A. Marchisio et al.

2019, Vancouver, BC, pp. 9098–9107 (2019). https://proceedings.neurips.cc/paper/2019/hash/ cf040fc71060367913e81ac1eb050aea-Abstract.html 3. Capra, M., Bussolino, B., Marchisio, A., Shafique, M., Masera, G., Martina, M.: An updated survey of efficient hardware architectures for accelerating deep convolutional neural networks. Future Int. 12(7), 113 (2020). https://doi.org/10.3390/fi12070113 4. Choi, J., Seo, H., Im, S., Kang, M.: Attention routing between capsules. In: 2019 IEEE/CVF International Conference on Computer Vision Workshops, ICCV Workshops 2019, Seoul, Korea (South), October 27–28, 2019, pp. 1981–1989. IEEE (2019). https://doi.org/10.1109/ ICCVW.2019.00247 5. Deb, K., Agrawal, S., Pratap, A., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002). https://doi.org/10. 1109/4235.996017 6. Devarakonda, A., Naumov, M., Garland, M.: ADABATCH: Adaptive batch sizes for training deep neural networks. CoRR abs/1712.02029 (2017). http://arxiv.org/abs/1712.02029 7. Gu, J., Tresp, V.: Improving the robustness of capsule networks to image affine transformations. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, June 13–19, 2020, pp. 7283–7291. Computer Vision Foundation/IEEE (2020). https://doi.org/10.1109/CVPR42600.2020.00731. https://openaccess.thecvf.com/content_ CVPR_2020/html/Gu_Improving_the_Robustness_of_Capsule_Networks_to_Image_Affine_ Transformations_CVPR_2020_paper.html 8. Hahn, T., Pyeon, M., Kim, G.: Self-routing capsule networks. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, pp. 7656–7665 (2019). https:// proceedings.neurips.cc/paper/2019/hash/e46bc064f8e92ac2c404b9871b2a4ef2-Abstract.html 9. Hanif, M.A., Hafiz, R., Shafique, M.: Error resilience analysis for systematically employing approximate computing in convolutional neural networks. In: Madsen, J., Coskun, A.K. (eds.) 2018 Design, Automation & Test in Europe Conference & Exhibition, DATE 2018, Dresden, March 19–23, 2018, pp. 913–916. IEEE (2018). https://doi.org/10.23919/DATE. 2018.8342139. 10. Hanif, M.A., Marchisio, A., Arif, T., Hafiz, R., Rehman, S., Shafique, M.: X-DNNs: systematic cross-layer approximations for energy-efficient deep neural networks. J. Low Power Electron. 14(4), 520–534 (2018). https://doi.org/10.1166/jolpe.2018.1575 11. Hinton, G.E., Krizhevsky, A., Wang, S.D.: Transforming auto-encoders. In: Honkela, T., Duch, W., Girolami, M.A., Kaski, S. (eds.) Artificial Neural Networks and Machine Learning - ICANN 2011 - 21st International Conference on Artificial Neural Networks, Espoo, June 14–17, 2011, Proceedings, Part I, Lecture Notes in Computer Science, vol. 6791, pp. 44–51. Springer (2011). https://doi.org/10.1007/978-3-642-21735-7_6 12. Hinton, G.E., Sabour, S., Frosst, N.: Matrix capsules with EM routing. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, April 30–May 3, 2018, Conference Track Proceedings. OpenReview.net (2018). https://openreview.net/forum?id= HJWLfGWRb 13. Krizhevsky, A.: Learning Multiple Layers of Features from Tiny Images. University of Toronto, Toronto (2012) 14. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Bartlett, P.L., Pereira, F.C.N., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3–6, 2012, Lake Tahoe, Nevada, pp. 1106–1114 (2012). https://proceedings.neurips.cc/paper/ 2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html 15. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998). https://doi.org/10.1109/5.726791 16. Li, S., Chen, K., Ahn, J.H., Brockman, J.B., Jouppi, N.P.: CACTI-P: architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques. In: Phillips,

Hardware and Software Optimizations for Capsule Networks

327

J.R., Hu, A.J., Graeb, H. (eds.) 2011 IEEE/ACM International Conference on ComputerAided Design, ICCAD 2011, San Jose, California, November 7–10, 2011, pp. 694–701. IEEE Computer Society (2011). https://doi.org/10.1109/ICCAD.2011.6105405 17. Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net (2017). https://openreview.net/ forum?id=Skq89Scxx 18. Marchisio, A., Hanif, M.A., Khalid, F., Plastiras, G., Kyrkou, C., Theocharides, T., Shafique, M.: Deep learning for edge computing: Current trends, cross-layer optimizations, and open research challenges. In: 2019 IEEE Computer Society Annual Symposium on VLSI, ISVLSI 2019, Miami, FL, July 15–17, 2019, pp. 553–559. IEEE (2019). https://doi.org/10.1109/ ISVLSI.2019.00105 19. Marchisio, A., Hanif, M.A., Shafique, M.: CapsAcc: An efficient hardware accelerator for CapsuleNets with data reuse. In: Teich, J., Fummi, F. (eds.) Design, Automation & Test in Europe Conference & Exhibition, DATE 2019, Florence, March 25–29, 2019, pp. 964–967. IEEE (2019). https://doi.org/10.23919/DATE.2019.8714922 20. Marchisio, A., Nanfa, G., Khalid, F., Hanif, M.A., Martina, M., Shafique, M.: CapsAttacks: Robust and imperceptible adversarial attacks on capsule networks. CoRR abs/1901.09878 (2019). http://arxiv.org/abs/1901.09878 21. Marchisio, A., Bussolino, B., Colucci, A., Hanif, M.A., Martina, M., Masera, G., Shafique, M.: FasTrCaps: An integrated framework for fast yet accurate training of capsule networks. In: 2020 International Joint Conference on Neural Networks, IJCNN 2020, Glasgow, July 19–24, 2020, pp. 1–8. IEEE (2020). https://doi.org/10.1109/IJCNN48605.2020.9207533 22. Marchisio, A., Bussolino, B., Colucci, A., Martina, M., Masera, G., Shafique, M.: Q-CapsNets: A specialized framework for quantizing capsule networks. In: 57th ACM/IEEE Design Automation Conference, DAC 2020, San Francisco, CA, July 20–24, 2020, pp. 1–6. IEEE (2020). https://doi.org/10.1109/DAC18072.2020.9218746 23. Marchisio, A., Massa, A., Mrazek, V., Bussolino, B., Martina, M., Shafique, M.: NASCaps: A framework for neural architecture search to optimize the accuracy and hardware efficiency of convolutional capsule networks. In: IEEE/ACM International Conference On Computer Aided Design, ICCAD 2020, San Diego, CA, November 2–5, 2020, pp. 114:1–114:9. IEEE (2020). https://doi.org/10.1145/3400302.3415731 24. Marchisio, A., Mrazek, V., Hanif, M.A., Shafique, M.: Red-cane: A systematic methodology for resilience analysis and design of capsule networks under approximations. In: 2020 Design, Automation & Test in Europe Conference & Exhibition, DATE 2020, Grenoble, March 9–13, 2020, pp. 1205–1210. IEEE (2020). https://doi.org/10.23919/DATE48585.2020.9116393 25. Marchisio, A., Mrazek, V., Hanif, M.A., Shafique, M.: DESCNet: developing efficient scratchpad memories for capsule network hardware. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 40(9), 1768–1781 (2021). https://doi.org/10.1109/TCAD.2020.3030610 26. Marchisio, A., Mrazek, V., Hanif, M.A., Shafique, M.: FEECA: design space exploration for low-latency and energy-efficient capsule network accelerators. IEEE Trans. Very Large Scale Integr. Syst. 29(4), 716–729 (2021). https://doi.org/10.1109/TVLSI.2021.3059518 27. Marchisio, A., Bussolino, B., Salvati, E., Martina, M., Masera, G., Shafique, M.: Enabling capsule networks at the edge through approximate softmax and squash operations. In: 2022 IEEE/ACM International Symposium on Low Power Electronics and Design, ISLPED 2022, Boston, MA, August 1–3, 2022, pp. 1–6. IEEE (2022) 28. Mazzia, V., Salvetti, F., Chiaberge, M.: Efficient-CapsNet: Capsule network with self-attention routing. CoRR abs/2101.12491 (2021). https://arxiv.org/abs/2101.12491 29. Michels, F., Uelwer, T., Upschulte, E., Harmeling, S.: On the vulnerability of capsule networks to adversarial attacks. CoRR abs/1906.03612 (2019). http://arxiv.org/abs/1906.03612 30. Monday, H.N., Li, J., Nneji, G.U., Nahar, S., Hossin, M.A., Jackson, J.: Covid-19 pneumonia classification based on neurowavelet capsule network. Healthcare 10(3) (2022). https://doi. org/10.3390/healthcare10030422. https://www.mdpi.com/2227-9032/10/3/422

328

A. Marchisio et al.

31. Mrazek, V., Hrbacek, R., Vasícek, Z., Sekanina, L.: EvoApprox8B: Library of approximate adders and multipliers for circuit design and benchmarking of approximation methods. In: Atienza, D., Natale, G.D. (eds.) Design, Automation & Test in Europe Conference & Exhibition, DATE 2017, Lausanne, March 27–31, 2017, pp. 258–261. IEEE (2017). https:// doi.org/10.23919/DATE.2017.7926993 32. Mrazek, V., Vasícek, Z., Sekanina, L., Hanif, M.A., Shafique, M.: ALWANN: automatic layer-wise approximation of deep neural network accelerators without retraining. In: Pan, D.Z. (ed.) Proceedings of the International Conference on Computer-Aided Design, ICCAD 2019, Westminster, CO, November 4–7, 2019, pp. 1–8. ACM (2019). https://doi.org/10.1109/ ICCAD45719.2019.8942068 33. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E.Z., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: PyTorch: An imperative style, high-performance deep learning library. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, pp. 8024–8035 (2019). https://proceedings. neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html 34. Pearson, K., for National Eugenics, G.L.: “Note on Regression and Inheritance in the Case of Two Parents”. Proceedings of the Royal Society. Royal Society (1895). https://books.google. it/books?id=xst6GwAACAAJ 35. Rajasegaran, J., Jayasundara, V., Jayasekara, S., Jayasekara, H., Seneviratne, S., Rodrigo, R.: DeepCaps: Going deeper with capsule networks. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, June 16–20, 2019, pp. 10725–10733. Computer Vision Foundation / IEEE (2019). https://doi.org/10.1109/ CVPR.2019.01098. http://openaccess.thecvf.com/content_CVPR_2019/html/Rajasegaran_ DeepCaps_Going_Deeper_With_Capsule_Networks_CVPR_2019_paper.html 36. Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4–9, 2017, Long Beach, CA, pp. 3856–3866 (2017). https://proceedings.neurips.cc/paper/2017/hash/ 2cad8fa47bbef282badbb8de5374b894-Abstract.html 37. Smith, L.N., Topin, N.: Super-convergence: very fast training of residual networks using large learning rates. CoRR abs/1708.07120 (2017). http://arxiv.org/abs/1708.07120 38. Tsai, Y.H., Srivastava, N., Goh, H., Salakhutdinov, R.: Capsules with inverted dot-product attention routing. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net (2020). https://openreview.net/ forum?id=HJe6uANtwH 39. Wu, X., Cao, Y., Lu, H., Liu, S., Wang, D., Wu, Z., Liu, X., Meng, H.: Speech emotion recognition using sequential capsule networks. IEEE ACM Trans. Audio Speech Lang. Process. 29, 3280–3291 (2021). https://doi.org/10.1109/TASLP.2021.3120586 40. Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. CoRR abs/1708.07747 (2017). http://arxiv.org/abs/1708.07747 41. Zhao, W., Peng, H., Eger, S., Cambria, E., Yang, M.: Towards scalable and reliable capsule networks for challenging NLP applications. In: Korhonen, A., Traum, D.R., Màrquez, L. (eds.) Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, July 28–August 2, 2019, Volume 1: Long Papers, pp. 1549–1559. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/p19-1150

Design of Sparsity Optimized Photonic Deep Learning Accelerators Febin Sunny, Mahdi Nikdast, and Sudeep Pasricha

1 Introduction Over the past decade, convolutional neural networks (CNNs) have exhibited success in many application domains, such as image/video classification, object detection, and even sequence learning. In order to tackle more complex problems in these domains, CNNs have become increasingly compute and memory intensive. This is reflected in the increase in operations needed from ~4.5 million for LeNet-5 [1], proposed in 1998, to ~30 billion for VGG16 [2], proposed in 2014. To keep pace with the continuous increase in CNN resource requirements, accelerator platforms that can cater to these requirements, such as Google’s tensor processing units (TPUs), and custom application-specific integrated circuits (ASICs) have been proposed. Even general-purpose platforms such as CPUs and graphic processing units (GPUs) now have advanced vector processing support and tensor processing support. However, these platforms still have low performance and energy efficiency for most CNN applications. Sparse neural networks (SpNNs) [3] enable a reduced number of neurons and synapses while maintaining the original model accuracy. Therefore, they represent a promising optimization to reduce the overall resource requirements for CNNs in resource-constrained environments. Sparsifying a CNN into an SpNN provides the immediate benefit of easier compression, thereby reducing the memory footprint requirement. Unfortunately, simply deploying an SpNN on an accelerator does not necessarily ensure model performance and energy-efficiency improvements. This is because the strategies for dense neural network acceleration, for which most accelerators today are optimized, are not able to take advantage of the sparsity available in neural

F. Sunny () · M. Nikdast · S. Pasricha Department of Electrical and Computer Engineering, 1373 Campus Delivery, Colorado State University, Fort Collins, CO, USA e-mail: [email protected]; [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 S. Pasricha, M. Shafique (eds.), Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing, https://doi.org/10.1007/978-3-031-39932-9_13

329

330

F. Sunny et al.

networks. Dense neural network accelerators orchestrate dataflow and operations for parameters that have been sparsified (i.e., zeroed out). Without carefully devised strategies for taking advantage of sparsity, conventional DNN accelerators will have to perform sparse parameter operations, avoiding which can enable lower latency and energy consumption. Therefore, sparsity-aware strategies to reduce the number of operations while accelerating SpNNs are essential. There have been a few recent efforts to design accelerators that provide support for SpNNs [4–6]. But these electronic accelerators face fundamental limitations in the post-Moore era, where processing capabilities are no longer improving as they once did, and metallic wires create new dataflow bottlenecks [7]. Neural network accelerator architectures that leverage silicon photonics for computing and data transfer can enable low latency and energy-efficient computation solutions [8–12]. However, they are not impervious to the high latency and energy wastage problem when accelerating SpNNs. In [13], we presented a novel neural network accelerator designed with silicon photonics that is optimized for exploiting sparsity, to enable energy-efficient and low-latency SpNN acceleration. To the best of our knowledge, this work presented the first noncoherent photonic SpNN accelerator. The novel contributions in this paper can be summarized as follows: • The design of a novel photonic-domain SpNN hardware accelerator architecture that utilizes a modular, vector-granularity-aware structure to enable high throughput and energy-efficient execution across different CNN models. • Sparsity-aware data compression and dataflow techniques for fully connected and convolution layers, which are tuned for the high throughput operation of this photonic accelerator. • A comprehensive comparison with state-of-the-art sparse electronic and dense photonic CNN accelerator platforms, to demonstrate the potential of the SONIC accelerator platform.

2 Background and Related Work There is a need for specialized hardware architectures for accelerating SpNNs. In recent years, a few such architectures have been proposed by the electronic machine learning (ML) acceleration research community, e.g., [4–6]. The framework presented in [4] leveraged a custom instruction-set architecture (ISA) for SpNNs. Specialized buffer controller architectures were used, which involved indexing to keep track of sparse elements, thus preventing them from being fed to the processing elements. In [5], a software-hardware co-optimized reconfigurable sparse CNN accelerator design was proposed for FPGAs. The architecture exploited both interand intra-output feature map parallelism. Kernel merging along with structured

Design of Sparsity Optimized Photonic Deep Learning Accelerators

331

sparsity were considered to further improve the overall efficiency. The work in [6] described an FPGA-based implementation of a sparse CNN accelerator. The accelerator made use of an output feature-map compression algorithm, which allowed the accelerator to operate directly on compressed data. To address the demand for lower latency and better energy efficiency, there has been growing interest in using silicon photonics for ML acceleration [14] and manycore computing in general [15]. There is a broad body of work which explores the feasibility of silicon photonics for on-chip communication [16–38]; using this technology for computation, however, is an emerging trend. Silicon photonic neural network accelerators can be broadly classified into two types: coherent and noncoherent. Coherent architectures use a single wavelength to operate and imprint weight and activation parameters onto the electrical field amplitude of optical signals, e.g., [9]. In contrast, noncoherent architectures use multiple wavelengths, where each wavelength can be used to perform an individual neuron operation in parallel with other wavelengths, e.g., [8–12]. In these architectures, parameters are imprinted directly onto signal amplitude. The work in [9] was the first to consider sparsity in the design of coherent photonic neural network accelerators. Structured sparsity techniques were used along with fast Fourier transform (FFT)-based optical convolution, with the goal of reducing the area consumption of coherent architectures. In order to implement matrix multiplication units in coherent networks, the operation has to be encoded into phase changes in the devices (Mach-Zehnder Interferometers (MZIs)). Depending on the matrix sizes being considered, the phase matrices and hence the MZI count can become quite large. In order to reduce phase matrix size and MZI count, a singular value decomposition (SVD)-based approach for phase matrix representation is adopted. However, this approach makes these architectures susceptible to accuracy loss, as the experiments in [9] showed. Due to phase encoding noise, phase error accumulation, and scalability limitations of coherent accelerators [52], there has been a growing interest in noncoherent photonic accelerators. Noncoherent dense neural network accelerators were proposed in [8] and [10], where the basic device for multiply and accumulate units relies on microring resonators [8] and microdisks [10]. The work in [8] also utilized cross-layer device-level and circuit-level optimizations to enable lower power consumption in the optical domain. The SONIC architecture discussed in this chapter represents the first noncoherent photonic SpNN accelerator. SONIC architecture adapts multiple software optimizations for sparsity, clustering, and dataflow, which are integrated closely with the hardware architecture design for improved energy efficiency and latency, without compromising on inference accuracy of the deployed models. The rest of this chapter is organized as follows. Section 3 provides an overview of the proposed software and dataflow optimizations for convolution and fully connected layers in CNNs. Section 4 describes the hardware design of the noncoherent photonic accelerator that is tuned for these model optimizations. Section 5 presents the experiments conducted and results. Finally, we draw conclusions in Sect. 6.

332

F. Sunny et al.

3 Software and Dataflow Optimizations 3.1 Model Sparsification To generate SpNNs, we adapt a layer-wise, sparsity-aware training approach from [39]. Notably, layer-wise sparsification enables more control over the sparsification process and avoids overly sparsifying sensitive layers which contribute to the overall model accuracy. Hence, we opted for layer-wise sparsity-aware training over considering the whole model for pruning simultaneously. In our approach, for every layer selected to be sparsified, a binary mask variable is added, which is of the same size and shape as the layer’s weight tensor. The algorithm also determines which of the weights participate in the forward execution of the graph. The weights in the chosen layer are then sorted by their absolute values, and the smallest magnitude weights are masked to zero until the user-specified sparsity levels are reached. Also, note that we opt for sparsity-aware training instead of post-training sparsification, as the latter approach can indiscriminately remove neurons, thus adversely affecting the inference accuracy. We also utilize an L2 regularization term during training, to encourage smaller weight values and avoid overfitting, which further helps improve the overall accuracy of the model, post-deployment.

3.2 Weight Clustering Photonic accelerators must utilize electrical-optical interfaces to tune the optical devices they use in the multiply-and-accumulate (MAC) units. The interfaces make use of digital-to-analog converters (DACs), which have high power overheads and can drive up the overall power consumption of the accelerators, such as the case in [8]. Moreover, higher resolution (i.e., the number of bits used to represent each weight and activation parameter) requirements for a DAC translate into higher power and latency overhead in the DAC. Thus, to reduce the DAC overhead, we perform post-training quantization of the deployed models, in the form of weight clustering. We opt for density-based centroid initialization of the weights, for the clustering operation, as described in [40]. For this clustering approach, a cumulative distribution function is built from the model’s weights. The distribution is evenly divided into regions, based on the user specified number of clusters. The centroid weight values of the evenly distributed regions are then deduced, and these values are used to initialize clustering. This process effectively reduces the variations in weight values and confines the values to the centroids. Therefore, if there are C centroids, and thus C clusters, the model will end up with C unique weights. This implies that the weights can be represented with a resolution of log2 C, thus reducing the required DAC resolution and enabling power and latency savings. Section 5.1 describes our weight clustering (and sparsification) explorations and parameters in more detail.

Design of Sparsity Optimized Photonic Deep Learning Accelerators

333

3.3 Dataflow Optimizations Beyond sparsification and weight clustering optimizations, we also perform enhancements to improve dataflow efficiency in our hardware platform. Fully connected (FC) layers are computationally intensive layers in CNNs where all neurons in the layer are connected to all the other neurons in the following FC layer. This results in large matrix-vector product operations to generate the output vector from a single FC layer, the output vector from one FC layer acts as the input activation vector for the subsequent FC layer (Fig. 1a). As the figure shows, there can be many parameters in the weight matrix and the activation vector which are zeroes. When these matrices and vectors are passed as is to the computational elements, the presence of zero elements cause increased energy consumption and latency. However, steps can be taken to prevent zero parameters from being passed on to the processing elements. To achieve this goal, a compression approach, as depicted in Fig. 1a, b, is utilized. In this approach, we identify the zero parameters in the activation vector and remove the corresponding columns in the weight matrix, which will be operated upon by these parameters during the dot-product operation. This approach generates dense activation vectors, but the weight vectors can still be sparse, as depicted in the weight matrix in Fig. 1b. This process also does not impact the output vector calculation accuracy or output vector dimension. For convolution (CONV) layers, the main difference with FC layers is the convolution operation performed in CONV layers. We unroll the CONV layer kernels and their associated patch of the input feature (IF) map matrix, to form vector-dot-product operations from the convolution operations (see Fig. 2a). The compression approach for FC layers can be repeated for these unrolled matrix-vector multiplication operations (see Fig. 2b). The compression approach in CONV layers helps generate dense kernel vectors to be passed to the vector-dot-product units (VDUs). Note that the IF vectors (activations) being passed for processing may still have sparsity present, as shown in Fig. 2c. The residual sparsity in the FC layer weight matrices and the CONV layer IF maps is handled at the vector-dot-product unit (VDU) level, as discussed in more detail in Sect. 4.2.

4 SONIC Hardware Accelerator Overview Figure 3 shows a high-level overview of the proposed noncoherent SONIC architecture for SpNN inference acceleration. SONIC comprises of an optical processing core, which uses vector-dot-product units (VDUs)—described in Sect. 4.2—to perform multiply and accumulate operations for FC and CONV layers in the photonic domain during inference. To interface with the main memory, map the dense and sparse vectors to the photonic VDUs, and perform post-processing operations, such as applying nonlinearities and accumulating partial sums generated by the photonic core; several peripheral electronic modules are also required. DAC

334

F. Sunny et al.

Fig. 1 FC layer operation, where the product of the weight matrix and activation vector is calculated. (a) Zero element identification in the activation vector and corresponding columns in weight matrix (marked in dotted outlines); (b) compressed matrix and vector, but weight matrix still exhibits parameter sparsity

arrays within. VDUs convert buffered signals into analog tuning signals for MRs, and vertical-cavity surface-emitting lasers (VCSELs) are used to generate different wavelengths. Analog-to-digital converter (ADC) arrays are used to map the output analog signals generated by photonic summation to digital values that are sent back for post-processing and buffering. The devices, VDU design, and architecture are discussed further in the following subsections.

Design of Sparsity Optimized Photonic Deep Learning Accelerators

335

Fig. 2 (a) Convolution operation between kernel (weight) matrix and input feature map (activations). A patch of the input feature map is convolved with the kernel matrix at a time to generate an output feature-map element (a patch and the corresponding output element are shown in red boxes). (b) Convolution operation unfurled into a vector-matrix-dot-product operation; avenues for compression are indicated by dotted-red outlines. (c) The result of the compression approach, with input feature map still exhibiting parameter sparsity

4.1 Microring Resonators (MRs) and Robust Tuning MRs are the primary computational devices used within our VDUs to implement matrix-vector multiplication operations. MRs are wavelength-selective silicon photonic devices, which are usually designed to be responsive to a specific “resonant” wavelength (λMR ). Such MRs are used to modulate and filter their resonant wavelengths in a carefully controlled manner, via a tuning circuit, to realize multiplications in the optical domain. An MR tuning mechanism can be used to induce a resonance shift (λMR ), and to change the output wavelength amplitude (Fig. 4a) to realize a scalar multiplication

336

F. Sunny et al.

Fig. 3 An overview of the SONIC architecture, with N CONV layer-specific VDUs and K FC layer-specific VDUs

Fig. 4 (a) An all-pass microring resonator (MR) filter (R is the radius of the MR and determines the resonant wavelength). (b) An MR bank where multiple MR filters, each sensitive to a particular wavelength, are arranged to perform vector-matrix multiplication

Design of Sparsity Optimized Photonic Deep Learning Accelerators

337

operation. The tuning mechanism in MRs operates by heating (thermo-optic (TO) tuning [41]) or carrier injection (electro-optic (EO) tuning [42]), thereby inducing a change in the effective index (neff), which in turn impacts λMR . The induced λMR increases the loss a wavelength experiences as it passes the MR, modifying the amplitude and imprinting the desired parameter (W1–W3 for MR1–MR3 in Fig. 4b). To improve throughput, wavelength division multiplexing (WDM) signals are used with a group of MRs (i.e., MR bank, Fig. 4b), where each MR is sensitive to a specific λMR . A large passband in MRs can be achieved by cascading several of them, as in [43], which can be used to simultaneously tune multiple wavelengths, which is used for batch normalization layer implementation. In SONIC, we make use of a TO+EO hybrid tuning circuit to induce λMR . Such a tuning approach has previously been proposed in [44] for silicon photonic devices with low insertion loss. This approach can be easily transferred to MR banks for hybrid tuning in our architecture. The hybrid tuning approach supports faster operation of MRs with fast EO tuning to induce small λMR and using TO tuning for large λMR . The faster EO tuning and its small range of operation is well suited for parameter imprinting onto the wavelengths, whereas TO tuning with its large tuning range is utilized primarily for fabrication process variation (FPV) correction. To further reduce the power overhead of TO tuning in our hybrid approach, we adapt a method called thermal eigenmode decomposition (TED), which was first proposed in [45]. Using TED, we can collectively tune the MR bank with lower power consumption.

4.2 Vector-Dot-Product Unit (VDU) Design As we decompose the operations in FC and CONV layers to vector-dot-product operations, our processing units are effectively vector-dot-product units (VDUs). Figure 5 depicts the VDU design in SONIC. As Figs. 1b and 2c showed, the granularity of the vectors involved in FC and CONV operations can be different. In real models, the CONV kernel sizes are relatively small when compared to FC layers. Also, in our dataflow for CONV layers, the dense vectors are generated by kernel matrices (weights), and for FC layers, the dense vectors are generated by activation vectors. However, because of the clustering approach, we utilize (see Sect. 3.2), for CONV layer dense vectors only need low resolution digital-to-analog converters (DACs). Similarly, for FC layers, the sparse vectors may utilize the lowresolution DACs, due to the same reason. Therefore, considering these differences, we separate the VDU implementations for CONV and FC layers. However, both the VDU implementations follow the layout illustrated in Fig. 5. As shown in Fig. 5, VDUs use separate local buffers to store the sparse and dense vector parameter values. The parameters are fed into DAC arrays for driving the optical devices (MRs or VCSELs). Each VDU has a local VCSEL array, which is driven using a DAC array. A DAC drives its corresponding VCSEL to generate optical signals with amplitude tuned to reflect its corresponding vector

338

F. Sunny et al.

Fig. 5 Vector-dot-product unit in the SONIC architecture

parameter. The VCSEL array is driven by data from sparse data buffers. By driving VCSELs directly using sparse vector data, our VDU designs prevent unnecessary computations. Encountering a zero element in the sparse vector prevents the VCSEL from being driven (recall that after the compression approach described in Sect. 3.3, there may be residual sparsity in the IF map or weight matrix). This involves power gating the VCSEL, and hence subsequent operations for the dot product will not occur. The power gating thus helps avoid the wasteful operations with the zero parameters, which were not eliminated by our data compression approach (see Sect. 3), in the vectors fed to the VDU. The signals from the VCSEL array are fed into an optical multiplexer (MUX in Fig. 5) to generate a WDM signal, which is transmitted to the MR bank via a waveguide. The MR bank is comprised of several tunable MRs, each of which can be tuned to alter the optical signal amplitude of a specific input wavelength, so that the intensity of the wavelength reflects a specific value, as discussed earlier. We also make use of a broadband MR to tune all wavelengths simultaneously to reflect batch normalization (BN) parameters for a layer (Fig. 5). Once the multiplication between parameters and BN parameters have been performed, a photodetector is used to convert the optical signal back to an electrical signal, to obtain a single, accumulated value from the VDU.

4.3 SONIC Architecture The VDU design discussed above is integrated in the SONIC architecture shown in Fig. 3. As mentioned earlier, we separate the VDU designs for the FC and CONV layer operations. The separate VDU designs account for the vector granularity differences between FC and CONV layer operations, and the differences in DAC

Design of Sparsity Optimized Photonic Deep Learning Accelerators

339

requirement for driving the VCSEL and MR arrays. The architecture relies on an electronic-control unit for interfacing with the main memory, retrieving the parameters, mapping the compressed parameters, and post-processing the partial sums generated by the VDUs. The optical processing core (see Fig. 3) focuses on CONV, FC, and batch normalization acceleration during the inference phase. Other operations, such as activation and pooling are implemented electronically, as done in prior works on optical computation. The photonic accelerator core of the SONIC architecture design (Fig. 3) arranges VDUs in an array. For CONV layers, we consider N VDU units, with each unit supporting an n × n dot product. For FC layer acceleration, we consider K VDU units, with each unit supporting a m × m dot product. Here, m > n and N > K, as per the requirements of each of the distinct layers. In each VDP unit, the original vector dimensions are decomposed into n or m dimensional vectors. Here, n and m are dependent on the dense vector granularity we obtain through the compression approach for the CONV and FC layers (Sect. 3.3).

5 Experiments and Results For our experiments, we consider four custom CNN models with both CONV and FC layers, for the well-known CIFAR10, STL10, SVHN, and MNIST datasets. Details on the baseline models are shown in Table 1. For evaluating the performance of the SONIC architecture, we compare it against two state-of-the-art SpNN accelerators: RSNN [5] and NullHop [6], along with dense photonic accelerators CrossLight [8] and HolyLight [10], and a photonic binary neural network accelerator LightBulb [51]. Furthermore, we attempted to implement the coherent SpNN photonic accelerator from [9]; however, the work does not provide details or results for latency, power, and energy, which prevented us from comparing against it. We also show comparative results against the NVIDIA Tesla P100 GPU and Intel Xeon Platinum 9282 CPU. We compared all these architectures in terms of throughput (i.e., frame per second (FPS)), energy per bit (EPB), and power consumption efficiency (FPS/W). We devised a custom Python simulator, integrated with Tensorflow v2.5, to evaluate SONIC and other accelerators. The parameters summarized in Table 2 were used to configure the accelerator simulator to obtain performance and power/energy results.

Table 1 CNN models considered for experiments Datasets MNIST CIFAR10 STL10 SVHN

Conv layers 2 6 6 4

FC layers 2 1 1 3

No. of parameters 1,498,730 552,874 77,787,738 552,362

Baseline accuracy 93.2% 86.05% 74.6% 94.6%

340

F. Sunny et al.

Table 2 Parameters considered for analysis of accelerators

Devices EO tuning [41] TO tuning [42] VCSEL [46] Photodetector [47] DAC (16 bit) [48] DAC (6 bit) [49] ADC (16 bit) [50]

Latency 20 ns 4 μs 0.07 ns 5.8 ps 0.33 ns 0.25 ns 14 ns

Power 4 μW/nm 27.5 mW/FSR 1.3 mW 2.8 mW 40 mW 3 mW 62 mW

Table 3 Summary of the sparsification and clustering results Datasets MNIST CIFAR10 STL10 SVHN

Layers pruned 4 7 5 5

No. of weight clusters 64 16 64 64

No. of parameters 749,365 276,437 46,672,643 331,417

Final accuracy 92.89% 86.86% 75.2% 95%

5.1 Model Sparsification and Clustering Results In the first experiment, we focus on software model optimization in SONIC. To obtain the best accuracy possible, we performed layer-wise sparsification in the models considered, as described in Sect. 3.1. We also use this experiment to partially explore the design space of SONIC hardware implementations. As depicted in Fig. 5, we use DACs for driving MRs and VCSELs in our accelerator. To decide on the required DAC resolution (and corresponding power and latency costs), we perform post-training weight clustering, as described in Sect. 3.2. Our goal was to generate models with as much per-layer sparsity as possible, and minimal DAC resolution, while exhibiting comparable accuracy to the baseline model. A summary of the optimized models and the final accuracy achieved after sparsification and weight clustering is shown in Table 3. Consistent with the trend in prior works, the final accuracy of the optimized models in Table 3 is comparable or slightly better than the baseline accuracy. To arrive at these numbers for each model, we performed a detailed exploration. Figure 6 shows the design space considered during sparsity and clustering exploration for the CIFAR10 model (figures for the other three models are omitted for brevity). Figure 7 further shows the layer-wise breakdown of sparsification for all four models, where the plots show the layer-specific sparsity level for weight parameters (in the best solution for each model from Table 3) and the resulting sparsity in activations as they traverse the sparse layers. Our exploration was able to identify the need for a maximum of 16 clusters for the best CIFAR10 solution and a maximum of 64 clusters across the four models (Table 3). We kept activation granularity at 16 bits, which provided us with sufficient accuracy (Table 3). Based on these results the

Design of Sparsity Optimized Photonic Deep Learning Accelerators

341

Fig. 6 Visualization of sparsity and clustering exploration on the CIFAR10 model. Number of layers is the total layers sparsified, sparsity is the average pruning aggressiveness, and number of clusters refers to the total weight clusters. The best (highest accuracy) configuration is indicated by the star

SONIC accelerator used 6-bit DACs for weight parameter representation and 16-bit DACs for activation parameter representation.

5.2 Comparison with State-of-the-Art Accelerators We explored various (n, m, N, K) configurations for the SONIC architecture, where n and m are maximum dot-product granularity supported by CONV and FC VDPs, respectively, and N and M are the total number of CONV and FC VDPs employed by the accelerator. From this exploration, we found that the best configuration in terms of FPS/W, EPB, and power consumption is (5, 50, 50, 10). We found that the value of n is heavily dependent on CONV layer kernel values, which was fixed after our model sparsification experiments. Increasing n beyond five did not provide any benefits, as the dense kernel vectors do not exceed five-parameter granularity for the considered models. Figure 8 shows power consumption, and Fig. 9 shows the power efficiency (in terms of frames-per-second/watt or FPS/W) across the accelerators considered. In these figures, Nvidia Tesla P100 (NP100) is the GPU, and Intel Xeon Platinum 9282 (IXP) is the CPU. We can observe that due to its sparsity-aware, clustering-aware, and dataflow-optimized hardware architecture design, SONIC exhibits substantially

342

F. Sunny et al.

Fig. 7 Sparsity across various layers in the four models considered

higher power efficiency, even though it has higher power consumption than the electronic SpNN accelerators. SONIC, on average, exhibits 5.81× and 4.02× better FPS/W than the NullHop and RSNN electronic SpNN accelerators. SONIC also exhibits 3.08×, 2.94×, and 13.8× better power efficiency on average than the LightBulb, CrossLight, and HolyLight photonic accelerators, respectively. This better power efficiency is owing to the fact that SONIC is designed to take advantage of sparsity and clustering in the models, while none of the photonic accelerators are thus optimized. When comparing the energy per bit (EPB) across the accelerators, as shown in Fig. 10, we can again observe that the co-design of the software and dataflow optimizations along with the hardware architecture in SONIC allow it to outperform the photonic and electronic SpNN accelerators. SONIC exhibits, on average, 19.4×, 18.4×, and 27.6× lower EPB than LightBulb, CrossLight, and HolyLight, respectively. SONIC also exhibits 8.4× and 5.78× lower EPB than NullHop and

Design of Sparsity Optimized Photonic Deep Learning Accelerators

343

Fig. 8 Power comparison across the accelerator platforms

Fig. 9 FPS/W comparison across the accelerator platforms

RSNN. These results highlight the promise of hardware-software co-designed photonic accelerators, such as SONIC, for SpNN deployment on resource-constrained platforms.

6 Conclusion In this chapter, we presented a novel noncoherent photonic sparse neural network accelerator, called SONIC, that integrated several hardware and software optimiza-

344

F. Sunny et al.

Fig. 10 EPB comparison across the accelerator platforms

tions. SONIC exhibits up to 5.8× better power efficiency, and 8.4× lower EPB than state-of-the-art sparse electronic neural network accelerators and up to 13.8× better power efficiency and 27.6× lower EPB than state-of-the-art dense photonic neural network accelerators. These results demonstrate the promising low-energy and lowlatency inference acceleration capabilities of our SONIC architecture.

References 1. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE. 86(11), 2278–2324 (1998) 2. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. (2014) 3. Park, J., Li, S., Wen, W., Tang, P.T.P., Li, H., Chen, Y., Dubey, P.: Faster CNNs with direct sparse convolutions and guided pruning. In: Proc. ICLR (2017) 4. Zhang, S., Du, Z., Zhang, L., Lan, H., Liu, S., Li, L., Guo, Q., Chen, T., Chen, Y.: CambriconX: an accelerator for sparse neural networks. In: MICRO (2016) 5. You, W., Wu, C.: RSNN: a software/hardware co-optimized framework for sparse convolutional neural networks on FPGAs. IEEE Access. 9 (2021) 6. Aimar, A., Mostafa, H., Calabrese, E., Rios-Navarro, A., Tapiador-Morales, R., Lungu, I.A., Milde, M.B., Corradi, F., Linares-Barrance, A., Liu, S.C., Delbruck, T.: NullHop: a flexible convolutional neural network accelerator based on sparse representations of feature maps. IEEE Trans. Neural Netw. Learn. Syst. 30(3), 644–656 (2019) 7. Waldrop, M.M.: The chips are down for Moore’s law. Nat. News. 530(7589) (2016) 8. Sunny, F., Mirza, A., Nikdast, M., Pasricha, S.: CrossLight: a cross-layer optimized silicon photonic neural network accelerator. In: DAC (2021) 9. Gu, J., Zhao, Z., Feng, C., Liu, M., Chen, R.T., Pan, D.Z.: Towards area-efficient optical neural networks: an FFT-based architecture. In: ASP-DAC (2020)

Design of Sparsity Optimized Photonic Deep Learning Accelerators

345

10. Liu, W., Liu, W., Ye, Y., Lou, Q., Xie, Y., Jiang, L.: HolyLight: a nanophotonic accelerator for deep learning in data centers. In: DATE, 2019 11. Dang, D., Chittamuru, S.V.R., Pasricha, S., Mahapatra, R., Sahoo, D.: BPLight-CNN: a photonics-based backpropagation accelerator for deep learning. ACM JETC. 17(4), 1–26 (2021) 12. Sunny, F., Mirza, A., Nikdast, M., Pasricha, S.: ROBIN: a robust optical binary neural network accelerator. In: ACM TECS (2021) 13. Sunny, F., Nikdast, M., Pasricha, S.: SONIC: a sparse neural network inference accelerator with silicon photonics for energy-efficient deep learning. In: Asia and South Pacific Design Automation Conference (ASP-DAC) (2022) 14. Sunny, F., Taheri, E., Nikdast, M., Pasricha, S.: A survey on silicon photonics for deep learning. ACM JETC. 17(4), 1–57 (2021) 15. Pasricha, S., Nikdast, M.: A survey of silicon photonics for energy efficient manycore computing. IEEE D&T. 37(4), 60–81 (2020) 16. Bahirat, S., Pasricha, S.: METEOR: hybrid photonic ring-mesh network-on-chip for multicore architectures. ACM Trans. Embedd. Comput. Syst. 13(3), 1–33 (2014) 17. Bahirat, S., Pasricha, S.: HELIX: design and synthesis of hybrid nanophotonic applica-tionspecific network-on-Chip architectures. IEEE international symposium on quali-ty electronic design (ISQED), 2014. 18. Bahirat, S., Pasricha, S.: 3D HELIX: design and synthesis of hybrid nanophotonic application-specific 3D network-on-chip architectures. Workshop on exploiting silicon photonics for energy efficient heterogeneous parallel architectures (SiPhotonics), 2014. 19. Bahirat, S., Pasricha, S.: A particle swarm optimization approach for synthesizing application-specific hybrid photonic networks-on-chip. IEEE international symposium on quality electronic design (ISQED), 2012. 20. Bahirat, S., Pasricha, S.: UC-PHOTON: a novel hybrid photonic network-on-chip for multiple use-case applications. IEEE international symposium on quality electronic design (ISQED), 2010. 21. Bahirat, S., Pasricha, S.: Exploring hybrid photonic networks-on-chip for emerging chip multiprocessors. IEEE/ACM international conference on hardware/software codesign and system synthesis (CODES+ISSS), 2009. 22. Chittamuru, S.V.R., Thakkar, I., Pasricha, S., Vatsavai, S.S., Bhat, V.: Exploiting process variations to secure photonic NoC architectures from snooping attacks. IEEE Trans. Comput. Aided Des. Integrat. Circuits Syst. 40(5), 850–863 (2021) 23. Chittamuru, S.V.R., Thakkar, I., Pasricha, S.: LIBRA: thermal and process varia-tion aware reliability management in photonic networks-on-chip. IEEE Trans. Multi-Scale Comput. Syst. 4(4), 758–772 (2018) 24. Chittamuru, S.V.R., Dharnidhar, D., Pasricha, S., Mahapatra, R.: BiGNoC: Acceler-ating big data computing with application-specific photonic network-on-chip archi-tectures. IEEE Trans. Parallel Distrib. Syst. 29(11), 2402–2415 (2018) 25. Chittamuru, S.V.R., Thakkar, I., Pasricha, S.: HYDRA: heterodyne crosstalk miti-gation with double microring resonators and data encoding for photonic NoC. IEEE Trans. Very Large Scale Integrat. Syst. 26(1), 168–181 (2018) 26. Chittamuru, S.V.R., Desai, S., Pasricha, S.: SWIFTNoC: a reconfigurable silicon-photonic network with multicast enabled channel sharing for multicore architec-tures. ACM J. Emerg. Technol. Comput. Syst. 13(4), 1–27 (2017) 27. Chittamuru, S.V.R., Pasricha, S.: Crosstalk mitigation for high-radix and low-diameter photonic NoC architectures. IEEE Des. Test. 32(3), 29–39 (2015) 28. Thakkar, I., Chittamuru, S.V.R., Pasricha, S.: Mitigating the energy impacts of VBTI aging in photonic networks-on-chip architectures with multilevel signaling. IEEE workshop on energyefficient networks of computers (E2NC), 2018. 29. Pasricha, S., Chittamuru, S.V.R., Thakkar, I., Bhat, V.: Securing photonic NoC Ar-chitectures from hardware Trojans. IEEE/ACM international symposium on net-works-on-chip (NOCS), 2018.

346

F. Sunny et al.

30. Chittamuru, S.V.R., Thakkar, I., Pasricha, S.: SOTERIA: exploiting process varia-tions to enhance hardware security with photonic NoC architectures. IEEE/ACM de-sign automation conference (DAC), 2018. 31. Thakkar, I., Chittamuru, S.V.R., Pasricha, S.: Improving the reliability and energy-efficiency of high-bandwidth photonic NoC architectures with multilevel signaling. IEEE/ACM international symposium on networks-on-chip (NOCS), 2017. 32. Chittamuru, S.V.R., Thakkar, I., Pasricha, S.: Analyzing voltage bias and temper-ature induced aging effects in photonic interconnects for manycore computing. ACM system level interconnect prediction workshop (SLIP), 2017. 33. Dang, D., Chittamuru, S.V.R., Mahapatra, R.N., Pasricha, S.: Islands of heaters: a novel thermal management framework for photonic NoCs. IEEE/ACM Asia & South Pacific design automation conference (ASPDAC), 2017. 34. Thakkar, I., Chittamuru, S.V.R., Pasricha, S.: A comparative analysis of front-end and backend compatible silicon photonic on-chip interconnects. ACM/IEEE system level interconnect prediction workshop (SLIP), 2016. 35. Thakkar, I., Chittamuru, S.V.R., Pasricha, S.: Run-time laser power management in photonic NoCs with on-chip semiconductor optical amplifiers. IEEE/ACM interna-tional symposium on networks-on-chip (NOCS), 2016. 36. Chittamuru, S.V.R., Thakkar, I., Pasricha, S.: PICO: mitigating heterodyne cross-talk due to process variations and intermodulation effects in photonic NoCs. IEEE/ACM design automation conference (DAC), 2016. 37. Chittamuru, S.V.R., Thakkar, I., Pasricha, S.: Process variation aware cross-talk mitigation for DWDM based photonic NoC architectures. IEEE international sym-posium on quality electronic design (ISQED), 2016. 38. Chittamuru, S.V.R., Pasricha, S.: SPECTRA: a framework for thermal reliability management in silicon-photonic networks-on-chip. IEEE international conference on VLSI design (VLSI), 2016. 39. Zhu, M.H., Gupta, S.: To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv:1710.01878v2, 2017. 40. Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv:1510.00149v5 [cs.CV], 2015. 41. Stefan, A., Stoferie, T., Marchiori, C., Caimi, D., Czornomaz, L., Stuckelberger, M., Sousa, M., Offrein, B.J., Fompeyrine, J.: A hybrid barium titanate–silicon photonics platform for ultraefficient electro-optic tuning. JLT. 34(8), 1688–1693 (2016) 42. Pintus, P., Hofbaurer, M., Manganelli, C.L., Fournier, M., Gundavarapu, S., Lemonnier, O., Gambini, F.: PWM-driven thermally tunable silicon microring resonators: design, fabrication, and characterization. In: L&P (2019) 43. Xia, J., Bianco, A., Bonetto, E., Gaudino, R.: On the design of microring resonator devices for switching applications in flexible-grid networks. In: ICC, 2014 44. Lu, L., Li, X., Gao, W., Li, X., Zhou, L., Chen, J.: Silicon non-blocking 4×4 optical switch chip integrated with both thermal and electro-optic tuners. IEEE Photonics. 11(6) (2019) 45. Milanizadeh, M., Aguiar, D., Melloni, A., Morichetti, F.: Canceling thermal cross-talk effects in photonic integrated circuits. JLT. 37(4), 1325–1332 (2019) 46. Inti, R., Mansuri, M., Kennedy, J., Qiu, J., Hsu, C.M., Sharma, J., Li, H., Casper, B., Jaussi, J.: A scalable 32-to-56Gb/s 0.56-to-1.28pJ/b voltage-mode VCSEL-based optical transmitter in 28nm CMOS. In: CICC (2021) 47. Wang, B., Huang, Z., Sorin, W.V., Zeng, X., Liang, D., Fiorentino, M., Beausoleil, R.G.: A low-voltage Si-Ge avalanche photodiode for high-speed and energy efficient silicon photonic links. JLT. 38(12), 3156–3163 (2020) 48. Wu, B., Zhu, S., Xu, B., Chiu, Y.: A 24.7 mW 65 nm CMOS SAR assisted CT  modulator with second-order noise coupling achieving 45 MHz bandwidth and 75.3 dB SNDR. IEEE J. Solid State Circuits. 51(12), 2893–2905 (2016) 49. Yang, C.M., Kuo, T.H.: A 3 mW 6-bit 4 GS/s subranging ADC with subrange-dependent embedded references. IEEE TCAS, 2021.

Design of Sparsity Optimized Photonic Deep Learning Accelerators

347

50. Shen, J., Shikata, A., Fernando, L.D., Guthrie, N., Chen, B., Maddox, M., Mascarenhas, N., Kapusta, R., Coln, M.C.W.: A 16-bit 16-MS/s SAR ADC with on-chip calibration in 55-nm CMOS. IEEE J. Solid State Circuits. 54(4), 1149–1160 (2018) 51. Zokaee, F., Lou, Q., Youngblood, N., Liu, W., Xie, Y., Jiang, L.: LightBulb: a photonicnonvolatile-memory-based accelerator for binarized convolutional neural networks. In: DATE (2020) 52. Banerjee, S., Nikdast, M., Chakrabarty, K.: Modeling silicon-photonic neural networks under uncertainties. In: DATE 2021

Algorithm-System Co-design for Efficient and Hardware-Aware Embedded Machine Learning Ji Lin, Wei-Ming Chen, and Song Han

1 Overview Deep learning has made tremendous progress in various areas. However, deploying deep learning models to embedded hardware for edge applications is difficult due to the limited resource (e.g., power, memory, computation, etc.) (Fig. 1). The overall performance of the system is determined by both hardware design and software design. Therefore, apart from the specialized hardware design, we also need a highly efficient software system to unleash the potential of hardware [36]. In this chapter, we will introduce the recent progress of efficient software design for embedded machine learning applications. The chapter will be organized as follows: 1. Firstly, we will introduce the efficient inference system stack for embedded deep learning, including inference libraries and scheduling optimizations (Sect. 2). 2. Secondly, we will introduce techniques to make deep learning models more efficient, including techniques like model compression (pruning and quantization) and neural architecture search (Sect. 3). 3. Finally, we will introduce how co-designing the inference system and deep learning network will derive the best end-to-end performance (Sect. 4).

J. Lin · W.-M. Chen · S. Han () Massachusetts Institute of Technology, Cambridge, MA, USA e-mail: [email protected]; [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 S. Pasricha, M. Shafique (eds.), Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing, https://doi.org/10.1007/978-3-031-39932-9_14

349

350

J. Lin et al.

Peak Memory (kB) ResNet-50

23x

MobileNetV2

22x 5x

MobileNetV2 (int8) 0

320kB 2000 constraint

4000

6000

8000

Fig. 1 Embedded deep learning is difficult. The peak memory of most of the deep learning networks (e.g., ResNets [23] and MobileNets [27, 51]) surpasses the hardware resource of embedded devices like microcontrollers by more than 5×, even after model quantization. Figure adapted from [36]

2 Efficient Inference Systems In this section, we will introduce efficient inference systems for embedded machine learning applications, especially for TinyML based on microcontrollers.

2.1 Inference Frameworks for TinyML Deploying deep learning models on memory-constrained microcontrollers requires an efficient inference framework. Popular deep learning frameworks for cloud servers (e.g., PyTorch [44] and TensorFlow [1]) are usually not suitable for embedded devices, since their memory consumption is too large. They also require a sophisticated operational system, which is usually not available on edge devices like microcontrollers. To bridge the gap, there are several deployment frameworks specially designed for TinyML, including TensorFlow Lite Micro (TFLM) [1], CMSIS-NN [30], TinyEngine [36], MicroTVM [11], CMix-NN [9], Embedded Learning Library,1 uTensor,2 etc. The frameworks usually have low-level implementations for basic deep learning operators (e.g., convolutions, linear layers).

2.1.1

Interpreter-Based vs. Code-Generation

The inference libraries for TinyML can be categorized into two approaches: interpreter-based and code-generation (Fig. 2).

1 https://microsoft.github.io/ELL/. 2 https://github.com/uTensor/uTensor.

Algorithm-System Co-design for Efficient and Hardware-Aware Embedded. . .

351

Runtime Interpreter

NN Model

Meta info. & Memory allocation

Inference

All Supported ops

(a) Compile time (o ine)

Runtime

Memory schedule

Inference

NN Model Specialized ops

(b) Fig. 2 Interpreter-based vs. code-generation frameworks. Interpreter-based method reduces the cost by separating compile and runtime, reducing the workload of runtime stage. (a) Interpreterbased frameworks. (b) Code-generation frameworks

Interpreter-based methods include TFLM, CMSIS-NN, etc. Such methods store the neural network model and the code for all supported operators on the embedded device. When we want to inference the model, the interpreter will interpret the model to generate meta-information (e.g., get the graph of the model) and the memory allocation for each weight tensor and activation. Then, the code of used operators is selected for inference. All the steps happen during the runtime, leading to computation and memory overhead: 1. The interpretation stage requires analysis of the model, leading to computation overhead. 2. The extracted meta-information and memory allocation scheme can result in unnecessary memory overhead. 3. Storing the code for all the supported operators leads to storage overhead. To avoid the above overhead, another kind of inference framework is based on code-generation, including TinyEngine [36], MicroTVM [11], Embedded Learning Library, and uTensor. Given a neural network model, during the compilation time (run in offline), the compiler will analyze the given model and generate the optimal memory scheduling for inference. The scheduling along with the code of the used operators will be compiled into binary code runnable on the embedded device. In this way, during runtime, the device can directly execute the generated binary, without incurring the above overhead.

352

J. Lin et al. 1.7x smaller Code generation

0

48

Baseline: Interpreter-based 76

160

Peak Mem (KB)

Baseline: Interpreter-based 0

1.2x faster

Code generation: Eliminate runtime interpretation overhead Million MAC/s

52

64

Fig. 3 Interpreter-based method results in 1.7.× smaller peak memory and larger computation efficiency. The numbers are collected during the design of TinyEngine [36]

To compare the performance of different frameworks, we provide the memory usage (peak memory) and computation efficiency (MAC/second) in Fig. 3. The numbers are collected during the design of TinyEngine [36], averaged from different serving networks. We can find that compared to interpreter-based method, code-generation can reduce the peak memory usage by 1.7.× and increases the computation efficiency by .1.2× by splitting compile time and runtime.

2.1.2

Low-Precision Support

Quantization [14, 20, 31, 36, 47, 50, 55, 63, 64] has been an industrial standard for deploying deep learning models on edge devices. It helps to reduce the model size and inference cost if low bit-precision operations are supported. This is especially helpful for embedded machine learning. Current inference frameworks widely support different bit-precisions. By default, TFLM [1] and MicroTVM [11] support fp32 inference. int8 quantization is widely used since it can preserve the accuracy with post-training quantization [29]. It is supported in various inference libraries including TFLM [1], TinyEngine [36], CMSIS-NN [30], MicroTVM [11], etc. Some libraries support even lower bitprecisions. For example, TinyEngine can support int4 quantization, which helps to squeeze in a larger model given the same hardware resource; CMix-NN [9] supports mixed-precision to trade off accuracy and cost.

2.2 Scheduling Optimizations The peak memory required during inference is tightly related to the scheduling. Different scheduling can lead to different memory usage, inference speed, and energy usage. Here, we introduce some techniques to improve the memory efficiency and arithmetic efficiency.

Algorithm-System Co-design for Efficient and Hardware-Aware Embedded. . .

2.2.1

353

Reducing Peak Memory with Optimized Scheduling

Here, we introduce some inter-op and intra-op techniques to reduce the peak memory for model inference.

2.2.1.1

Inter-op: Reordering op to Reduce Peak Memory

Some neural networks consist of multiple parallel branches (e.g., ResNet [23] and MobileNetV2 [51] have 2 parallel branches due to residual connections; InceptionV3 [53] has 4 parallel branches). The popularity of Neural Architecture Search (NAS) leads to a family of more complex and irregularly wired neural networks like NASNets [66], AmobaNets [48], RandWire [58], etc. The parallel structures can be executed by different orders, and different orders will have different peak memory [2]. Ahn et al. [2] proposed to use dynamic programming to search for the best op ordering to reduce memory usage. They proposed Adaptive Soft Budgeting to speed up the scheduling and Identity Graph Rewriter to change the topology without altering mathematical integrity, which further reduces the memory cost. The proposed method reduces the memory footprint by 1.86.× and off-chip traffic by .1.76×.

2.2.1.2

Intra-op: In-place Depth-Wise Convolution

Efficient neural network designs are widely based on depth-wise block with inverted bottleneck [8, 28, 36, 37, 51]. It consists of a point-wise convolution (convolution with kernel size .1 × 1), a spatial depth-wise convolution (with #groups equal to #channels), and another point-wise convolution. The first point-wise convolution will increase the number of channels by a factor t (i.e., the expansion ratio, typically between 3 and 6), the depth-wise convolution retains the number of channels, while the last point-wise convolution reduces the numbers of channels, leading to diamond shape. It is contrary to a normal bottleneck structure [23], so-called inverted. Since during the inference of operators, the memory usage includes both the input and output activation. So the memory usage for the depth-wise layer is much higher than the other two point-wise convolution. Consider an inverted bottleneck block with input channel c, middle channel tc, and output channel c. The spatial dimensions are h and w. The memory usage related to activation is .c + tc for the point-wise convolutions and 2tc for the depth-wise convolution. For MobileNetV2 [51], .t = 6, making the depth-wise convolution .1.7× more memory expensive. MCUNet proposes in-place depth-wise convolution to reduce peak memory of the inverted bottleneck block [36] (Fig. 4). Luckily, depth-wise convolutions do not perform convolutional filtering across channels. Therefore, once the output computation of a certain channel is completed, the input activation of the channel can be discarded. We overwrite it and use to store the output activation of another

354

J. Lin et al.

Fig. 4 In-place depth-wise convolution reduces the activation memory consumption from .O(2N ) to .O(N + 1), N being the number of channels. Figure adapted from [36]

Memory Usage (kB)

per-layer inference per-patch inference

1400

High Low mem. mem.

1120 840 560

per-layer inference

peak mem: 1372kB peak mem: 172kB

Per-layer memory Per-patch memory

8 smaller

original peak mem.

per-patch peak mem.

280

256kB constraint

0 0

1

2

3

4

5

6

7 8 9 10 Block Index

11

12

13

14

16

16

17

Fig. 5 MobileNetV2 has a very imbalanced memory usage distribution. The peak memory is determined by the first 5 blocks with high peak memory, while the later blocks all share a small memory usage. By using per-patch inference (.4 × 4 patches), we are able to significantly reduce the memory usage of the first 5 blocks and reduce the overall peak memory by 8.×, fitting MCUs with a 256-kB memory budget. Notice that the model architecture and accuracy are not changed for the two settings. The memory usage is measured in int8. The figure is adapted from [37]

channel, allowing activation of depth-wise convolutions to be updated in-place. It reduces the activation memory consumption from .O(2N) to .O(N + 1), N being the number of channels. When applied to embedded devices, the technique can reduce the measured memory usage by 1.6.×. 2.2.1.3

Intra-op: Patch-Based Inference

Efficient convolutional neural networks (CNNs) usually have a very imbalanced activation memory distribution. Take MobileNetV2 [51] as an example. As shown in Fig. 5, only the first 5 blocks have a high peak memory (.>450 kB), becoming the memory bottleneck of the entire network. The remaining 13 blocks have a low memory usage, which can easily fit a 256-kB microcontroller. To break the memory bottleneck, we can use patch-based inference schedule for the initial memory-intensive layers [37]. Existing deep learning inference frameworks (e.g., TensorFlow Lite Micro [1], TinyEngine [36], microTVM [11], etc.) use a layer-by-layer execution. For each convolutional layer, the inference

Algorithm-System Co-design for Efficient and Hardware-Aware Embedded. . .

355

C W conv1 s=1

H

conv2 s=2

In memory

(a) Per-layer computation (executing first conv)

Not in memory Currently executing To be executed Overlapped

11

conv1 s=1

conv2 s=2

1

1 2

2

2

(b) Per-patch computation (executing first patch) Fig. 6 Per-patch inference can significantly reduce the peak memory required to execute a sequence of convolutional layers. We study two convolutional layers (strides 1 and 2). Under perlayer computation (a), the first convolution has a large input/output activation size, dominating the peak memory requirement. With per-patch computation (b), we allocate the buffer to host the final output activation and compute the results patch-by-patch. We only need to store the activation from one patch but not the entire feature map, reducing the peak memory (the first input is the image, which can be partially decoded from a compressed format like JPEG). Figure adapted from [37]

library first allocates the input and output activation buffer in SRAM and releases the input buffer after the whole layer computation is finished (if the input is not used by other layers).3 Such an implementation makes inference optimization easy (e.g., im2col, tiling, etc.), but the SRAM has to hold the entire input and output activation for each layer, which is prohibitively large for the initial stage of the CNN. The patch-based inference runs the initial memory-intensive stage in a patch-by-patch manner. For each time, we only run the model on a small spatial region (.>10.× smaller than the whole area), which effectively cuts down the peak memory usage. After this stage is finished, the rest of the network with a small peak memory is executed in a normal layer-by-layer manner (see the upper notations in Fig. 5). The technique can reduce the peak memory of efficient networks (MobileNetV2 [51], FBNet [57], and MCUNet [36]) by 4–6.× when running on microcontrollers (Fig. 7). Here is an example of two convolutional layers (with strides 1 and 2) in Fig. 6. For conventional per-layer computation, the first convolutional layer has large input and output activation size, leading to a high peak memory. With spatial partial computation, we allocate the buffer for the final output and compute its values patchby-patch. In this manner, we only need to store the activation from one patch instead

3 The

weights are partially fetched from Flash and thus are not counted.

356

J. Lin et al.

Per-layer (w/ TinyEngine) 320

315

Per-patch (2 2)

Per-patch (3 3)

310

256 234

192 128

64

MbV2

4.2x smaller

132

113

64 0

4.1x smaller

4.9x smaller

w0.5, r144

76

FBNet-A

w0.45, r144

85 56

MCUNet w1.0, r144

On-device Measurement of Peak SRAM (kB) Fig. 7 On-device measurement: patch-based inference reduces the measured peak SRAM usage by 4–6.× when running on microcontrollers. Figure adapted from [37]

of the whole feature map. Note that the first activation is the input image, which can be partially decoded from a compressed format like JPEG and does not require full storage (Fig. 7). The significant memory saving comes at the cost of computation overhead. To maintain the same output results as per-layer inference, the non-overlapping output patches correspond to overlapping patches in the input image (the shadow area in Fig. 6b). This is because convolutional filters with kernel size >1 contribute to increasing receptive fields. The bordering pixel on the output patches is dependent on the inputs from neighboring patches. We will discuss about how to solve the issue in the later sections (Sect. 4.2).

2.2.2

Kernel Optimization for Faster Inference

Optimizing the kernel implementation can further reduce inference latency for a given model. Since operators of deep learning models typically require several levels of loops to perform multi-dimensional multiply and accumulate (MAC) operations, how to apply different loop optimization techniques is very important. To maximize the effectiveness of these techniques, we can generate specialized kernel code that for different layers considering the memory constraints. 1. Loop tiling, also known as loop blocking, transforms the computation and changes the data access pattern to ensure that activation and weights used in the loop stay in the cache until they are reused. During code-generation, we tile the loops based on kernel size and available memory, which is different for each layer. 2. Loop unrolling reduces the overhead incurred by loop controlling due to branch instructions. For different kernel sizes and strides, we can generate code to fully

Algorithm-System Co-design for Efficient and Hardware-Aware Embedded. . .

357

1.6x faster Baseline: ARM CMSIS-NN

52

0

Loop unrolling

Code generation Specialized Im2col

Million MAC/s

64

70

Tiling

75 79 82 Op fusion

Fig. 8 TinyEngine [36] optimizes the kernels with specialized im2col, op fusion, loop unrolling, tilling, etc. to improve the inference speed, leading to 1.6.× faster inference compared to the baseline

unrolling the most inner loop to minimize the overhead (e.g., 9 repeated code segments for 3 .× 3 kernel and 25 for 5 .× 5). We can fuse multiple operators into one kernel to further reduce latency. For example, ReLu and BN layers can be performed within a convolution layer, so activation can be updated in-place and reduce the memory footprint. These specialized optimization on the computation kernel further increased the inference efficiency by 22%, as shown in Fig. 8.

3 Efficient Deep Learning Model Design 3.1 Model Compression Model compression is an important technique to reduce the cost of deep learning models. Common model compression techniques include pruning, quantization, knowledge distillation, etc.

3.1.1

Pruning

Model pruning aims to remove the redundant weights in a neural network to reduce the model size and computation (Fig. 9). Pruning can be performed at different granularities [40]. We analyze the pruning of a convolutional layer, which is the most computationally expensive component in a CNN. The weights of a convolutional layer can be expressed as a 4-dimensional tensor of shape .n × c × kh × kw , where .kh , kw is the kernel size, n is the #output channels (i.e., #filters), and c is the #input channels (i.e., #channels). • Fine-grained pruning aims to remove individual elements from the weights [18, 22, 32, 52]. It can achieve very large pruning rates without hurting the model accuracy. For example, AMC [25] can reduce the model size of ResNet-50 [23] by 5.× without losing accuracy on the ImageNet [15]. However, it usually requires special hardware accelerators to achieve real speed-up [10, 13, 19, 21, 60, 62].

358

J. Lin et al.

Fig. 9 Pruning removes redundant weights in a deep neural network (figure adapted from [18])

• Coarse-grained pruning aims to remove an entire “channel” or “filter” from the weight tensor [24, 25, 33, 38, 41, 56]. Channel-/filter-level pruning can lead to actual speed on existing hardware accelerators like CPUs and GPUs [25, 59]. How to determine the importance of weight elements and find out the less important ones to prune is essential to the pruning performance. The most straightforward heuristic is based on the weight magnitude, i.e., Importance = |w|

.

for fine-grained pruning and Importance = ||W||

.

for coarse-grained pruning. Despite the simplicity, it achieves competitive performance compared to other more complicated designs. Other importance criteria include gradient/derivatives [22, 32], approximated Taylor expansion [42], sensitivity analysis [16], etc.

3.1.2

Quantization

Quantization reduces the bit-precision of weights and activation to reduce the model size and accelerate the inference if there is support for low-bit instructions. Existing quantization schemes include k-means clustering and linear/fixed-point quantization. k-means clustering-based quantization will run k-means clustering algorithm on the weight tensor to find the centroid of each cluster [20] (Fig. 10). After clustering, we just need to store the centroid of each cluster and the cluster index (2-bit), which

Algorithm-System Co-design for Efficient and Hardware-Aware Embedded. . .

359

Fig. 10 K-means clustering-based quantization will run k-means clustering algorithm on the weight tensor to find the centroid of each cluster (figure adapted from [20])

can reduce the weight size by roughly 16.× (from fp32 to uint2; the centroids are negligible compared to indices). Such method can usually greatly reduce the model size. However, it is hard to accelerate on CPUs/GPUs. Fixed-point/linear quantization aims to quantize the weights and activation into low-bit integers. For example, int8 quantization has been an industrial standard for deploying deep learning models on edge [29]. It rounds the float-point value into the nearest quantized values after range truncation. Suppose the clipping range is .[a, b] and the number of quantization bins is n (256 for int8), we can quantize a floating point value x into a quantized value q using the following formula (adapted from [29]): clamp(r, a, b) = min(max(x, a), b).

.

b−a . n−1   clamp(r, a, b) − a s(a, b, n) + a q = round s(a, b, n)

s(a, b, n) =

(1) (2) (3)

The fixed-point quantization scheme can be supported in current frameworks. For example, inference libraries like TFLM [1] and TinyEngine [36] can handle int8/int4 quantization; CMix-NN [9] can support mixed bit-precision. Note that for int8 quantization, we can usually perform post-training quantization (PTQ) without minimal loss of accuracy [29, 43]. For lower bit-precisions, we can perform quantization-aware training (QAT) to further reduce the accuracy drop in quantization. We can use the Straight-Through Estimator (STE) [5] to estimate the gradient, since the rounding operator is not differentiable. With STE, the backpropagation gradient is computed using .

∂L ∂L = ∂x ∂q

(4)

360

3.1.3

J. Lin et al.

Knowledge Distillation

Knowledge distillation (KD) [6, 26] can transfer the “dark knowledge” inside a large teacher model to a small student model, leading to better student accuracy compared to standalone training. The student model is usually smaller (fewer layers, fewer channels, or use a lighter design) compared to the teacher model, which makes the inference faster and deployment easier. In a classification task, KD encourages the student model to match the output (i.e., the logits) of the teacher model. The most common choice is to use KLdivergence loss to encourage logit matching. Apart from the final output, KD can also be applied to intermediate activation, which also includes useful information. Example includes FitNet [49] and Attention Transfer (AT) [61]. However, such methods require the two networks to have the same spatial dimension, which limits the design of the student model.

3.1.4

AutoML-Based Model Compression

Traditionally, model compression relies on manual tuning of the hyper-parameters like pruning rates of each layer, bit-precision of each layer, etc. These hyperparameters are important for the model compression performance. However, this process is sub-optimal and time-consuming. To get a good result, you need human expertise and many trials and errors. To solve this problem, researchers proposed to use AutoML-based method to automate the process, offering a push-the-button solution to get state-of-the-art performance. He et al. proposed AutoML for Model Compression (AMC) [25], which treats model pruning as a sequential decision process (finding the pruning rate layerby-layer) and solves it with reinforcement learning (Fig. 11). Given a pretrained Model Compression by Human: Labor Consuming, Sub-optimal

Critic Original NN

Compressed NN

AMC Engine

Actor

Embedding

Original NN

Reward= -Error*log(FLOP)

Compressed NN

Model Compression by AI: Automated, Higher Compression Rate, Faster

Action: Compress with Sparsity ratio at (e.g. 50%)

Embedding st

Layer t+1 ?% Layer t 50% Layer t-1 0%

Agent: DDPG Environment: Channel Pruning

Fig. 11 Overview of AutoML for Model Compression (AMC). Left: AMC utilizes reinforcement learning to automatically find the best pruning schedule for a given network. Figure adapted from [25]

Algorithm-System Co-design for Efficient and Hardware-Aware Embedded. . .

361

Fig. 12 The pruning policy (sparsity ratio) given by AMC agent for ResNet-50. AMC finds very salient sparsity pattern across layers: peaks are .1 × 1 convolution and crests are .3 × 3 convolution. The reinforcement learning agent automatically learns that .3 × 3 convolution has more redundancy than .1 × 1 convolution and can be pruned more. Figure adapted from [25]

model, the reinforcement learning agent (DDPG [35]) receives an embedding from the current layer containing information like channel number, kernel size, etc. and outputs the pruning rate for this layer. After all the layers are pruned, the accuracy is evaluated as part of the reward function. AMC achieves state-of-the-art performance: it improves the compression ratio of ResNet-50 [23] on ImageNet [15] from 3.4.× to 5.× without hurting accuracy; it can accelerate MobileNet [27] by 1.81.× on an Android phone with only 0.1% loss of ImageNet Top-1 accuracy. AMC works without any human prior. Interestingly, it discovers some patterns similar to human heuristics. We plot the per-layer pruning rates in Fig. 12. The reinforcement learning agent automatically learns that .3 × 3 convolution has more redundancy than .1 × 1 convolution and can be pruned more. Not only can AMC optimize FLOPs and model size, but it can also optimize the inference latency, directly benefiting mobile developers. Take MobileNet [27], a highly compact network as an example (Fig. 13). Previous attempts using handcrafted policy to prune MobileNet led to significant accuracy degradation, while AMC-pruned MobileNet significantly improves both accuracy-computation tradeoff and accuracy-latency trade-off on ImageNet (measured on Google Pixel 1). The framework can be also extended to mixed-precision quantization [55] to help determine the per-layer bit-precision, such that the quantized model achieves the best accuracy under the same model size/bit-ops.

3.2 Neural Architecture Search Neural architecture search (NAS) has proved to be an effective way to automatically design neural networks and replace the labor-intensive manual design process.

362

J. Lin et al.

Fig. 13 AMC achieves better results compared to human experts when optimizing for both computation and latency reduction. AMC strictly dominates human expert in the pareto optimal curve (inference time measured on Google Pixel 1). (a) Accuracy vs. MACs. (b) Accuracy vs. inference time

The original NAS approach [65] treats neural network architecture as a sequence generation process and uses an RNN controller to generate architectures, which is trained with reinforcement learning through trial and errors. For each sampled architecture, we need to train the network to get the accuracy for learning, which is repeated tens of thousands of times (e.g., 12,800 in [66]), leading to large search cost (.104 GPU hours). To solve the problem, one-shot NAS methods [4, 7, 8, 17, 39, 45] aim to train a single super network with weight sharing, and we can sample sub-networks with various architecture inside the super network. When evaluating the accuracy of different sub-networks, we directly extract the corresponding weights from the super network, so we only need to train the super network once, which greatly cuts down the training cost. After the super network training is done, we can search for the sub-network that satisfies the given constraints (e.g., MACs, model size, etc.). Common search algorithms include evolutionary search, reinforcement learning, Bayesian optimization, gradient-based method, accuracy predictor-based method, etc. NAS has produced state-of-the-art efficient CNN models, outperforming humandesigned counterpart. We compared the ImageNet accuracy vs. MACs trade-off of human-designed CNNs and NAS-designed CNNs in Fig. 14. The auto-designed CNNs can achieve better accuracy at lower computation. There are several neural architecture search algorithms specialized for TinyML on microcontrollers. For example, TinyNAS [36] adopts a two-stage neural architecture search approach that first optimizes the search space to fit the resource constraints of certain hardware and then specializes the network architecture in the optimized search space. It can automatically handle diverse constraints (e.g., device, latency, energy, memory) under low search costs. MicroNets [3] observes that for TinyML on MCU, the latency of the model is linearly related to the

Algorithm-System Co-design for Efficient and Hardware-Aware Embedded. . .

363

Fig. 14 Comparing ImageNet accuracy vs. MACs trade-off of human-designed CNNs and NASdesigned CNNs. NAS outperforms human in designing efficient CNNs. Figure adapted from [8]

MACs under a uniform prior and proposed to use MACS as a proxy for latency in differentiable neural architecture search. .μNAS [34] specializes deep models for “mid-tier” MCUs with memory requirements ranging from 0.5 to 64 KB.

4 System-Algorithm Co-design System optimization (Sect. 2) and deep learning model optimization (Sect. 3) can both improve the end-to-end efficiency of embedded machine learning. However, to achieve the best accuracy vs. efficiency trade-off, we need to co-design the system and the algorithm [36].

4.1 Co-design to Achieve the Best Performance MCUNet (V1 and V2) [36, 37] co-designs the system (TinyEngine for inference scheduling) and algorithm (TinyNAS for neural network architecture) to fit the tight resource constraints on microcontrollers (Fig. 15). Compared to traditional methods that either (a) optimize the neural network using neural architecture search based on a given deep learning library (e.g., TensorFlow, PyTorch) [7, 54, 57] or (b) tune the library to maximize the inference speed for a given network [11, 12], MCUNet can better utilize the resources by system-algorithm co-design to get a better performance.

364

J. Lin et al.

Library

NAS

TinyNAS

(a)

Library

NN Model (b)

Efficient Neural Architecture MCUNet

TinyEngine

Efficient Compiler / Runtime (c)

Fig. 15 MCUNet jointly designs the neural architecture and the inference scheduling to fit the tight memory resource on microcontrollers. TinyEngine makes full use of the limited resources on MCU, allowing a larger design space for architecture search. With a larger degree of design freedom, TinyNAS is more likely to find a high accuracy model compared to using existing frameworks. (a) Search NN model on an existing library. (b) Tune deep learning library given a NN model. (c) MCUNet: system-algorithm co-design

The co-design space is a combinational space of the two dimensions. The search space includes: • Model side: MCUNet uses an MnasNet-alike search space [7, 54] for NAS. The space includes different kernel sizes for each inverted residual block .k[ ] (3/5/7), different expansion ratios .e[ ] (3/4/6), and a different number of blocks for each stage .d[ ] (2/3/4). More recent search space designs like MobileNetV3 [28] have better accuracy-computation trade-off but are hard to quantize due to Swish activation function [46], making deployment on MCU difficult. • System side: the system-side optimization includes all the scheduling optimization knobs including in-place depth-wise (Sect. 2.2.1), loop unrolling, tolling, etc. For models with larger input resolution (e.g., for detection), we further extend the space to support patch-based inference, which includes a number of layers to run patch-based inference n and a number of spatial patches p. The system and algorithm are tightly correlated. For example, given the same constraints, we can choose to use a smaller model that fits per-layer execution (.p = 1, no computation overhead) or a larger model and per-patch inference (.p > 1, with a small computation overhead). Therefore, we need to put both sides in the same loop. The combinational space is very large. Therefore, we need to use search algorithms with high sample efficiency like evolutionary search [17, 36]. System-algorithm co-design leads to better performance compared to systemonly or model-only optimization. Here, we provide an example. We are building ImageNet models for STM32F746 MCU (320 kB SRAM and 1 MB Flash). The baseline is a scaled-down MobileNetV2 [51] model with TFLM [1]. We compare the performance of system-only, model-only, and co-design in Fig. 16. Even though system-only and model-only optimization improve the performance, co-design leads to the best accuracy.

Algorithm-System Co-design for Efficient and Hardware-Aware Embedded. . .

Baseline (MbV2*+CMSIS)

39

System-only (MbV2**+TinyEngine)

49

Model-only (TinyNAS+CMSIS)

56

Co-design (TinyNAS+TinyEngine) ImageNet Top1 : 35%

365

62

45%

55%

65%

Fig. 16 Customizing ImageNet models for STM32F746 MCU (320 kB SRAM and 1 MB Flash). System-algorithm co-design leads to the best performance (figure adapted from [36])

conv 3x3 s=1

conv 3x3 s=2

Fig. 17 Patch-based inference leads to computation overhead due to spatial overlapping (shadowed area)

4.2 Co-design Broadens the Design Space As mentioned above (Sect. 2.2.1), patch-based inference [37] can significantly reduce the peak memory required for inference. However, it leads to computation overhead due to spatial overlapping. As shown in Fig. 17, for a patch-based inference stage of two convolutional layers, even though the output feature map has no overlapping between patches, its corresponding feature maps in previous layers will have overlapping due to the growing receptive field. Such computational overhead will lead to slower inference and larger energy usage, which is undesired for embedded cases. If we only optimize the schedule, we will less likely adopt the method due to the overhead. Luckily, with system-algorithm co-design, we can actually optimize the model to reduce the computational overhead, such that we can enjoy the benefit of smaller peak memory without the extra computation. We propose receptive field (RF) redistribution to reduce computation overhead. The basic idea is to (1) reduce the receptive field of the patch-based initial stage and (2) increase the receptive field of the later stage. Reducing RF for the initial stage helps to reduce the size of each input patch and repeated computation. However, some tasks may have degraded performance if the overall RF is smaller (e.g., detecting large objects). Therefore, we further increase the RF of the later stage to compensate for the performance loss. We take MobileNetV2 as a study case and modify its architecture. The comparison is shown in Fig. 18. We used smaller kernels and fewer blocks in the per-patch inference stage and increased the number of blocks in the later per-layer inference stage.

366

MB6 3x3

MB6 3x3

MB6 3x3

Pooling FC

MB6 3x3

MB6 3x3

MB6 3x3

MB6 3x3

MB6 3x3

MB6 3x3

Pooling FC

MB6 3x3

MB6 3x3

MB6 3x3

MB6 3x3

MB6 3x3

MB6 3x3

MB6 3x3

MB6 3x3

MB6 3x3

MB6 3x3

MB6 3x3

MB1 1x1

MB6 3x3

conv3x3

MB6 3x3

Increase Receptive Field

Reduce Receptive Field

MbV2-RD

MB6 3x3

MB6 3x3

MB6 3x3

MB6 3x3

MB6 3x3

MB6 3x3

MB6 3x3

MB6 3x3

MB6 3x3

small peak memory/per-layer MB6 3x3

MB1 3x3

MB6 3x3

large peak memory/per-patch conv3x3

MbV2

J. Lin et al.

Fig. 18 The redistributed MobileNetV2 (MbV2-RD) has reduced receptive field for the per-patch inference stage and increased receptive field for the per-layer stage. The two networks have the same level of performance, but MbV2-RD has a smaller overhead under patch-based inference. The mobile inverted block is denoted as MB{expansion ratio} {kernel size}. The dashed border means stride .= 2 (figure adapted from [37]) Fig. 19 MCUNetV2: joint neural architecture and inference scheduling search

MCUNetV2

Neural architecture #layers #channels kernel size

Inference scheduling #patches #layers for patch-based other knobs from TinyEngine*

The process needs manual tuning and varies case by case, which is tricky since we need to balance the cost and the accuracy. Therefore, we propose to use joint neural architecture search and inference scheduling search (similar to MCUNet, but with patch-based primitives, see Fig. 19). In summary, system-algorithm co-design allows us to find the unseen optimization opportunities, which are not available with model-only or system-only optimization.

5 Summary In this chapter, we introduce efficient algorithm and system co-design for embedded machine learning, including efficient inference systems and efficient deep learning models, as well as the joint optimization between them. We believe that systemalgorithm co-design will allow us to fully utilize the potential of the embedded device and enable more powerful machine learning applications.

Algorithm-System Co-design for Efficient and Hardware-Aware Embedded. . .

367

References 1. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: TensorFlow: A system for large-scale machine learning. In: OSDI (2016) 2. Ahn, B.H., Lee, J., Lin, J.M., Cheng, H.-P., Hou, J., Esmaeilzadeh, H.: Ordering chaos: Memory-aware scheduling of irregularly wired neural networks for edge devices (2020). Preprint arXiv:2003.02369 3. Banbury, C., Zhou, C., Fedorov, I., Matas, R., Thakker, U., Gope, D., Janapa Reddi, V., Mattina, M., Whatmough, P.: MicroNets: neural network architectures for deploying TinyML applications on commodity microcontrollers. Proc. Mach. Learn. Syst. 3, 517–532 (2021) 4. Bender, G., Kindermans, P.J., Zoph, B., Vasudevan, V., Le, Q. Understanding and simplifying one-shot architecture search. In: International Conference on Machine Learning (2018) 5. Bengio, Y., Léonard, N., Courville, A.: Estimating or propagating gradients through stochastic neurons for conditional computation (2013). Preprint arXiv:1308.3432 6. Buciluˇa, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 535–541 (2006) 7. Cai, H., Zhu, L., Han, S.: ProxylessNAS: Direct neural architecture search on target task and hardware. In: ICLR (2019) 8. Cai, H., Gan, C., Wang, T., Zhang, Z., Han, S.: Once for all: train one network and specialize it for efficient deployment. In: ICLR (2020) 9. Capotondi, A., Rusci, M., Fariselli, M., Benini, L.: CMix-NN: mixed low-precision CNN library for memory-constrained edge devices. IEEE Trans. Circuits Syst. II Express Briefs 67(5), 871–875 (2020) 10. Chen, Y.-H., Krishna, T., Emer, J.S., Sze, V.: Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circuits 52, 127–138 (2017) 11. Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, H., Cowan, M., Wang, L., Hu, Y., Ceze, L., et al.: {TVM}: An automated end-to-end optimizing compiler for deep learning. In: USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) (2018) 12. Chen, T., Zheng, L., Yan, E., Jiang, Z., Moreau, T., Ceze, L., Guestrin, C., Krishnamurthy, A.: Learning to optimize tensor programs. In: Advances in Neural Information Processing Systems (2018) 13. Chen, Y.-H., Yang, T.J., Emer, J., Sze, V.: Eyeriss v2: a flexible accelerator for emerging deep neural networks on mobile devices. IEEE J. Emerg. Sel. Topics Circuits Syst. 9(2), 292–308 (2019) 14. Choi, J., Wang, Z., Venkataramani, S., Chuang, P.I.J., Srinivasan, V., Gopalakrishnan, K.: Pact: Parameterized clipping activation for quantized neural networks (2018). Preprint arXiv:1805.06085 15. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition (2009) 16. Engelbrecht, A.P.: A new pruning heuristic based on variance analysis of sensitivity information. IEEE Trans. Neural Netw. 12(6), 1386–1399 (2001) 17. Guo, Z., Zhang, X., Mu, H., Heng, W., Liu, Z., Wei, Y., Sun, J.: Single Path One-Shot Neural Architecture Search with Uniform Sampling (2019). arXiv 18. Han, S., Pool, J., Tran, J., Dally, W.J.: Learning both weights and connections for efficient neural networks. In: Advances in Neural Information Processing Systems (2015) 19. Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A., Dally, W.J.: EIE: Efficient inference engine on compressed deep neural network. ACM SIGARCH Comput. Archit. News 44, 243–254 (2016)

368

J. Lin et al.

20. Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In: International Conference on Learning Representations (2016) 21. Han, S., Kang, J., Mao, H., Hu, Y., Li, X., Li, Y., Xie, D., Luo, H., Yao, S., Wang, Y., et al.: Ese: Efficient speech recognition engine with sparse LSTM on FPGA. In: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 75–84. ACM (2017) 22. Hassibi, B., Stork, D.G.: Second Order Derivatives for Network Pruning: Optimal Brain Surgeon. Morgan Kaufmann, Burlington (1993) 23. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016) 24. He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In: Proceedings of the IEEE International Conference on Computer Vision (2017) 25. He, Y., Lin, J., Liu, Z., Wang, H., Li, L.-J., Han, S.: AMC: AutoML for model compression and acceleration on mobile devices. In: Proceedings of the European conference on computer vision (ECCV) (2018) 26. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network (2015). Preprint arXiv:1503.02531 27. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications (2017). arXiv 28. Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q.V., Adam, H.: Searching for MobileNetV3. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2019) 29. Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., Kalenichenko, D.: Quantization and training of neural networks for efficient integer-arithmetic-only inference. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2704–2713 (2018) 30. Lai, L., Suda, N., Chandra, V.: CMSIS-NN: Efficient neural network kernels for arm Cortex-M CPUs (2018). Preprint arXiv:1801.06601 31. Langroudi, H.F., Karia, V., Pandit, T., Kudithipudi, D.: Tent: Efficient quantization of neural networks on the tiny edge with tapered fixed point (2021). Preprint arXiv:2104.02233 32. LeCun, Y., Denker, J.S., Solla, S.A., Howard, R.E., Jackel, L.D.: Optimal brain damage. In: Advances in Neural Information Processing Systems, vol. 2, pp. 598–605 (1989) 33. Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient convnets (2016). Preprint arXiv:1608.08710 34. Liberis, E., Dudziak, Ł., Lane, N.D.: μNAS: Constrained neural architecture search for microcontrollers (2020). Preprint arXiv:2010.14246 35. Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning (2015). Preprint arXiv:1509.02971 36. Lin, J., Chen, W.-M., Lin, Y., Cohn, J., Gan, C., Han, S.: MCUNet: Tiny deep learning on IoT devices. In: Advances in Neural Information Processing Systems (2020) 37. Lin, J., Chen, W.-M., Cai, H., Gan, C., Han, S.: Mcunetv2: Memory-efficient patch-based inference for tiny deep learning (2021). Preprint arXiv:2110.15352 38. Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolutional networks through network slimming. In: Proceedings of the IEEE International Conference on Computer Vision (2017) 39. Liu, H., Simonyan, K., Yang, Y.: DARTS: Differentiable architecture search. In: ICLR (2019) 40. Mao, H., Han, S., Pool, J., Li, W., Liu, X., Wang, Y., Dally, W.J.: Exploring the granularity of sparsity in convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 13–20 (2017) 41. Molchanov, P., Tyree, S., Karras, T., Aila, T., Kautz, J.: Pruning convolutional neural networks for resource efficient transfer learning. CoRR, abs/1611.06440 (2016)

Algorithm-System Co-design for Efficient and Hardware-Aware Embedded. . .

369

42. Molchanov, P., Tyree, S., Karras, T., Aila, T., Kautz, J.: Pruning convolutional neural networks for resource efficient transfer learning. In: International Conference on Learning Representations (2017) 43. Nagel, M., Baalen, M.v., Blankevoort, T., Welling, M.: Data-free quantization through weight equalization and bias correction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1325–1334 (2019) 44. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: PyTorch: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32 (2019) 45. Pham, H., Guan, M.Y., Zoph, B., Le, Q.V., Dean, J.: Efficient neural architecture search via parameter sharing. In: International Conference on Machine Learning (2018) 46. Ramachandran, P., Zoph, B., Le, Q.V.: Searching for activation functions (2017). Preprint arXiv:1710.05941 47. Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: XNOR-Net: ImageNet classification using binary convolutional neural networks. In: European Conference on Computer Vision (2016) 48. Real, E., Aggarwal, A., Huang, Y., Le, Q.V.: Regularized evolution for image classifier architecture search. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 4780–4789 (2019) 49. Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: FitNets: Hints for thin deep nets (2014). Preprint arXiv:1412.6550 50. Rusci, M., Capotondi, A., Benini, L.: Memory-driven mixed low precision quantization for enabling deep network inference on microcontrollers. In: Proceedings of Machine Learning and Systems (2020). 51. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.-C.: MobileNetV2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018) 52. Srinivas, S., Babu, R.V.: Data-free parameter pruning for deep neural networks (2015). Preprint arXiv:1507.06149 53. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.:. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016) 54. Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., Le, Q.V.: MnasNet: Platform-aware neural architecture search for mobile. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019). 55. Wang, K., Liu, Z., Lin, Y., Lin, J., Han, S.: HAQ: Hardware-Aware Automated Quantization with Mixed Precision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019) 56. Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H.: Learning structured sparsity in deep neural networks. In: Advances in Neural Information Processing Systems, pp. 2074–2082 (2016) 57. Wu, B., Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Y., Tian, Y., Vajda, P., Jia, Y., Keutzer, K.: FBNet: Hardware-aware efficient ConvNet design via differentiable neural architecture search. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019) 58. Xie, S., Kirillov, A., Girshick, R., He, K.: Exploring randomly wired neural networks for image recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1284–1293 (2019) 59. Yang, T.-J., Howard, A., Chen, B., Zhang, X., Go, A., Sze, V., Adam, H.: NetAdapt: Platformaware neural network adaptation for mobile applications (2018). Preprint arXiv:1804.03230 60. Yu, J., Lukefahr, A., Palframan, D., Dasika, G., Das, R., Mahlke, S.: Scalpel: Customizing DNN pruning to the underlying hardware parallelism. ACM SIGARCH Comput. Archit. News 45(2), 548–560 (2017) 61. Zagoruyko, S., Komodakis, N.: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer (2016). Preprint arXiv:1612.03928

370

J. Lin et al.

62. Zhang, S., Du, Z., Zhang, L., Lan, H., Liu, S., Li, L., Guo, Q., Chen, T., Chen, Y.: Cambriconx: An accelerator for sparse neural networks. In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–12 IEEE (2016) 63. Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients (2016). Preprint arXiv:1606.06160 64. Zhu, C., Han, S., Mao, H., Dally, W.J.: Trained ternary quantization (2016). Preprint arXiv:1612.01064 65. Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. In: ICLR (2017) 66. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)

Efficient Hardware and Software Design for On-device Learning Yawen Wu, Yue Tang, Dewen Zeng, Xinyi Zhang, Peipei Zhou, Yiyu Shi, and Jingtong Hu

1 Introduction Recently, the applications of deep neural networks (DNNs) have been shifting from high-performance servers to resource-constrained edge devices. For example, DNNs are deployed to cars for self-driving, unmanned aerial vehicles (UAVs) for navigation [3], and various tasks such as search and rescue [4]. In these applications, the model is first trained on powerful cloud servers, and then the trained model is deployed to edge devices. However, since the captured data in the wild on devices is dynamic and can be very different from the data for pre-training, it is necessary for these deployed models to be updated based on the data in the wild. Following a conventional training scheme, the collected data is first transmitted to the cloud server, and then the model is trained on the server by using the latest data. After that, the updated model is sent back to the edge devices. The whole process is timeconsuming with high communication costs. Therefore, it is desirable to perform learning on the device with the data in the wild. In this way, the model can adapt to new environments in situ, and the accuracy of the model can be improved.

This work consists of papers published in the proceedings of the Design Automation Conference (DAC) [1] and ACM Transactions on Design Automation of Electronic Systems (TODAES) [2]. Y. Wu · Y. Tang · X. Zhang · P. Zhou · J. Hu () University of Pittsburgh, Pittsburgh, PA, USA e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected] D. Zeng · Y. Shi University of Notre Dame, Notre Dame, IN, USA e-mail: [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 S. Pasricha, M. Shafique (eds.), Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing, https://doi.org/10.1007/978-3-031-39932-9_15

371

372

Y. Wu et al.

Besides, personalization can be achieved by training convolutional neural network (CNN) models on local devices. For instance, models for medical applications such as home monitoring [5] and long-term ECG monitoring [6] highly rely on the user’s physical conditions. Different users usually have distinct data distributions, so models need to be fine-tuned to customize for specific users. Compared with using cloud services to log a user’s condition over time for improving the model’s performance [7], directly updating the model on local devices would be more effective. Furthermore, better privacy can be provided when learning at the edge compared with learning on the cloud servers [8, 9] because the user’s data do not need to leave the local device. To enable efficient on-device learning, both software and hardware problems need to be solved properly. From the software perspective, it is necessary to perform on-device learning without uploading all the streaming data to servers. This is because transmitting all the streaming data has prohibitively high communication costs and causes high latency. Besides, different from conventional training on servers with fully labeled data, most of the streaming data are unlabeled, and it is necessary to perform ondevice learning by using as few labels as possible. The recently developed self-supervised learning technique, contrastive learning, is effective in unsupervised visual representation learning. It improves the quality of the features generated from the feature extractor (e.g., a set of convolutional layers). Based on the improved feature extractor, the whole model consisting of the feature extractor and classifier can be fine-tuned with a few labeled samples for better classification accuracy. While contrastive learning is promising in on-device learning, it still relies on the conventional training setup, in which a large dataset is completely collected before the training starts. During conventional training, each mini-batch of data is randomly sampled from the whole data [10, 11]. However, on-device learning faces a different setup. On edge platforms, the data collected by sensors such as cameras continuously stream in. To follow the conventional training protocol, the constantly generated massive unlabeled data needs to be continuously stored in the storage devices such as Flash memory, which incurs prohibitive storage and energy overhead of the data writing and reading. Without accumulating a large dataset, to learn from new data and not forget the existing knowledge, a small data buffer can be used to store the most recent samples to form each mini-batch for training. The existing contrastive learning methods assume each mini-batch can be sampled uniformly at random from the pre-collected dataset, and each mini-batch is independent and identically distributed (iid). Under this assumption, each class has representative data in each mini-batch. However, when performing on-device learning, it is challenging to maintain the most representative samples from the historical and current data in the buffer such that learning from this buffer will reach an accurate model. First, the streaming data are temporally correlated, and sequentially forming mini-batches will result in a correlation within each mini-batch. The streaming data is continuously captured, and a long sequence of data can be in the same category. For example, when a robot with cameras is moving, the same object can appear in adjacent frames in the continuous stream, while another object will appear in the following sequence

Efficient Hardware and Software Design for On-device Learning

373

of frames. Second, since the streaming data is unlabeled, there is no easy way to select the representative data for each data category from the non-iid stream and maintain the exemplars in the buffer. If labels were available for each data frame, by using the existing method [12], the most representative data could be selected for each category from the non-iid stream. However, without labels for data selection, directly learning from the non-iid data stream will result in slow learning speed, low-quality visual representations, and low task accuracy. To improve the learning speed and model accuracy, it is important to select the most representative data from the input stream and maintain them in the buffer for learning. To this end, we propose a score-based data selection method. The contrast score computes the similarity between the features of a sample and its horizontal-flipped view. The contrast score measures how well the model encodes each data sample. By using the contrast score, the most representative data are selected and maintained in the buffer. Data that the model is incapable of generating good representations for are valued for training since the model can be improved if it is trained with these data. These data will be selected and maintained in the buffer for learning in the following iterations. On the other hand, data with a high quality of encoded representation by the model have been effectively learned, which will be dropped to save places for the valuable data. After training the feature extractor by contrastive learning with unlabeled data, the classifier will be updated. Since training the classifier with only unlabeled data cannot reach a meaningful accuracy, we send 1 or 10% of the data to the server to acquire the labels and use these labeled data to train the classifier for the target task such as classification. In the hardware-level aspect, an efficient accelerator is needed to achieve deep learning on low-power and resource-constrained edge devices. However, efficient implementation of CNN training on these accelerators is more challenging than inference due to the following three reasons. First, the training process needs more operations than the inference process. The inference process only needs to handle the forward propagation (FP), while the training process contains additional backward propagation (BP) and weight update (WU). Therefore, training consumes 3.× computation operations and more types of operations compared with inference [9, 13]. Second, the data dependency between different steps in training makes the on-board memory management and data reusing in dynamic random access memory (DRAM) challenging [14]. For example, tremendous activation values in FP are used in BP and WU. The loss computed by BP needs to be further used by WU. Third, different memory access patterns are used in FP, BP, and WU. The memory access patterns need to be optimized by considering all the three steps. Thus, utilizing the memory optimization only considering FP will result in low overall memory access efficiency. To address these challenges in the implementation of training on edge devices, we propose a new efficient training accelerator named EF-Train. It features a unified channel-level parallelism-based convolution kernel to deal with the computation complexity. The channel-level parallelism means that the kernel allocates these computation resources to process multiple channels of feature maps in parallel. The unified kernel means the same on-chip computation resources on the device

374

Y. Wu et al.

are used in FP, BP, and WU for processing convolution operations. To solve the communication bottleneck in realistic end-to-end training processes, a compiletime optimization approach, data reshaping, is proposed. In mini-batch training, intra-tile and inter-tile memory access continuity and weight reuse are achieved by this method. The proposed techniques are implemented on resource-limited FPGAs without sacrificing precision for end-to-end training. Both small and large batch sizes are supported. Our main contributions are as follows: • Self-supervised on-device learning framework. We propose a framework to select the most representative data for self-supervised contrastive learning from the unlabeled data stream. The selected data enables the model to adapt to new environments while not forgetting old ones. Only a small data buffer is used, eliminating the need to continuously store all the data on the device (Sect. 3.1). • Contrast scoring for data selection. We propose contrast scoring to measure the importance of data samples and maintain the most important and representative data in the buffer for effective learning. The streaming data is selected on-thefly without using any labeling information. The selected data will generate large gradients to effectively improve the learning (Sects. 3.2 and 3.3). • Efficient CNN by EF-Train. An efficient CNN training accelerator with a unified convolution kernel to process FP, BP, and WU with full precision is proposed. Channel-level parallelism is leveraged for high computation utilization, and both small and large batch sizes are supported. Different types of layers, including convolutional (Conv) layers, fully connected (FC) layers, batch normalization (BN) layers, rectified linear unit (ReLU) layers, and pooling layers, can be trained end to end (Sect. 4.1). • Efficient memory access data reshaping. To solve the off-chip communication bottleneck, a data reshaping approach is developed. To remove discontinuous memory accesses within a tile, the features and weights are stored in off-chip memory with intra-tile continuous memory allocation. Inter-tile discontinuous memory accesses are reduced by scheduling loop orders between tiles. Weight reuse among multiple images in a mini-batch is further exploited to improve communication efficiency for large batch sizes (Sect. 4.2). • End-to-end validation on the low-power, resource-constrained edge FPGA. The EF-Train prototype is implemented on the low-power, resource-limited edge FPGA to validate our design for efficiently solving the computation complexity and communication bottleneck of on-device learning. We conduct end-to-end CNN training without sacrificing precision. The experimental results will be reported in Sect. 5.2. Experimental results show the effectiveness of the proposed methods. For the software, the proposed techniques significantly improve the accuracy and the learning speed. For example, we achieve 28.36% higher accuracy by using 1% labeled data than directly performing supervised learning with these data. Besides, we achieve 13.9% higher accuracy than the state-of-the-art (SOTA) techniques for

Efficient Hardware and Software Design for On-device Learning

375

data selection [15]. We also achieve 2.67.× faster learning when reaching the same accuracy as other methods. On the hardware validation, end-to-end CNN training is performed on ZCU102 for various CNNs on both the CIFAR-10 and ImageNet datasets. A throughput of 46.99 GFLOPS and an energy efficiency of 6.09 GFLOPS/W are achieved by our design.

2 Related Works 2.1 Software Contrastive Learning The recently developed self-supervised learning method, contrastive learning, learns visual representations from unlabeled data. In this chapter, we employ a typical contrastive learning method [10]. For one image x, its representation vector h is generated by .h = f (x), where .f (·) is the backbone of a deep learning model (i.e., a set of convolutional layers). A project head .g(·) is used to project the representations to the latent space as a vector .z = g(h) = g(f (x)), and the contrastive loss is applied on z. The contrastive loss takes positive pairs and negative pairs as inputs. A positive pair .(zi , zi + ) is formed by two views .(xi , xi + ) of an input image x. These two views are fed into the encoder to generate the representations .(hi , hi + ) = (f (xi ), f (xi + )). By projecting and normalizing the representations as a positive pair .(zi , zi + ), the contrastive loss .i,i + is calculated as follows: i,i + = − log

.

exp(zi · zi + /τ )  , exp(zi · zi + /τ ) + i − exp(zi · zi − /τ )

(1)

where the projected representation vectors .zi − of other data in the same batch are negative pairs. .τ is a hyper-parameter temperature. By optimizing the contrastive loss, high-quality representations can be learned by the encoder. The existing works on contrastive learning rely on conventional training setup and do not work well in streaming settings. He et al. [11] and Chen et al. [10] perform contrastive representation learning for downstream tasks such as classification and segmentation. Knights et al. [16] and Orhan et al. [17] leverage the temporal correlation existing in video data for improving representation learning. The drawback of these works is that they assume the entire dataset is pre-collected before training starts. Samples can be randomly sampled from the entire dataset to form each mini-batch, and each mini-batch has independent and identically distributed (iid) data distribution. However, in streaming settings, the input data on edge devices usually do not follow iid data distribution, and the data is sequentially collected as it is. Besides, random sampling from the data to form iid data distribution requires continuously storing all the input data and is prohibitive

376

Y. Wu et al.

on edge devices. Therefore, to enable efficient and accurate on-device contrastive learning, a technique to form mini-batches filled with the most representative data on-the-fly is needed. Data Selection for Streaming and Continual Learning The existing works on data selection for streaming and continual learning are designed for supervised learning and leverage data labels for the selection [18]. A data buffer is used to store previously seen data for rehearsal [12, 18, 19] such that the problem of catastrophic forgetting of previously seen data can be mitigated. Since these works reply on data labels and labeling all the streaming data is prohibitive on edge devices, these works cannot be applied to contrastive learning from unlabeled streaming data.

2.2 Hardware DRAM Access Issues for Current Inference Accelerators Implementing CNN inference on resource-limited local devices has been widely investigated in recent years [20, 21]. Many inference accelerators [20, 22, 23] mainly focused on selecting optimal design parameters to improve the acceleration performance for individual Conv layers. These works adopt optimizing techniques such as loop tiling and loop unrolling. Although higher performance efficiency on a given Conv layer is achieved by the proposed algorithms, these designs only presented isolated accelerators without completing end-to-end validation. End-to-end validation means that all layers of a neural network are tested continuously, where the layers’ intermediate results are usually transferred between off-chip DRAM and on-chip buffer due to the limited on-chip storage size. Therefore, the impact of off-chip memory accesses should be considered in realistic scenarios. For some edge-level devices such as FPGAs, direct memory access (DMA) is a commonly used effective data transmission approach for continuous address data reading. When the limited onchip memory resources of such edge devices cannot hold all the features and weights of a Conv layer, the accelerator needs to fetch and process data in tiles based on the computation pattern. However, the continuity of data addresses in DRAM will be broken by these tiling schemes, so the DMA transmission efficiency will be reduced. Previous work [24] has proved that discontinuous memory access can degrade the DMA transferring speed from about 8GB/s to around 1GB/s. However, the proposed isolated accelerators [20, 22, 23] are based on the assumption that data are well preallocated between adjacent layers so tiles can be loaded from and stored back to the off-chip memory continuously. In fact, compared to the acceleration time, the allocation overhead is extremely large in realistic end-to-end systems. The detailed analysis will be further discussed in Sect. 4.2. Solutions for the DRAM Access Issues in the Inference Phase To achieve CNN acceleration in real end-to-end applications, some recent works have focused on DRAM memory access to solve the communication bottleneck. For example, ROMANet [25] proposed a design space exploration (DSE) by searching for the

Efficient Hardware and Software Design for On-device Learning

377

appropriate data partitioning and scheduling for each layer of a network to reduce the number of memory access for tensor processing units (TPUs). DRMap [26] proposed a generic DRAM mapping policy targeting on TPUs. A DSE has also been presented to reduce the DRAM access latency and energy. In [27], a multi-bank onchip memory management (MOMM) problem is defined to minimize the DRAM access overhead in the processing of CNNs on a neural processing unit (NPU) with a multi-bank on-chip memory. Caffeine [28] combined both off-chip and on-chip data reorganizations for the convolutional matrix-multiplication representation to maximize the underlying memory bandwidth utilization for FPGA-based designs. However, all these works are based on the computation and memory access pattern in the inference phase. Different from the inference phase that only has FP, the training phase involves FP, BP, and WU. The data access patterns for output features, input features, and weights in FP, BP, and WU are different. Therefore, we cannot directly apply the above-mentioned approaches in CNN training, and a new optimized design considering FP, BP, and WU together is required. Current Training Accelerators The training process is much more complicated than the inference process. Therefore, directly adopting the frameworks of inference accelerators for training is sub-optimal. For current studies on CNN training acceleration, [29] proposed a layer-wise adaptive rate scaling (LARS) algorithm on Google’s cloud TPU platform that reduced the 90-epoch ResNet-50 training time from hours to 20 minutes. Venkataramanaiah et al. [30] developed an automatic compiler for training accelerator on Stratix 10 using the precision of 16-bit fixed point. DarkFPGA [31] adopted batchlevel parallelism, which means the images inside a mini-batch are processed in parallel. It used 8-bit integers for training a VGG-like network on the Maxeler MAX5 platform, which achieved high throughput when the batch size is large. However, these existing works mainly focused on cloud-level devices with abundant memory and computation resources. For on-device learning, [32] proposed an energy-efficient CNN processor facilitating both inference and training for low-power embedded systems. This work was the first application-specific integrated circuit (ASIC) design for the CNN training accelerator. Venkataramanaiah et al. [33] proposed an accelerator with a low-bit fixed-point data type for CNN training. The accelerator implements a stochastic gradient descent-based training algorithm implemented in both 65 nm CMOS ASIC and Intel Stratix-10 FPGA hardware with 16-bit fixed-point precision. These works applied quantized data type to relieve the communication bottleneck, but there is no evidence that such quantization techniques can remain high accuracy on a large dataset (e.g., ImageNet) with dense neural networks. The high computation and memory overhead should be faced directly because full precision is still preferred in most CNN training scenarios. Therefore, an optimized design is necessary to implement on-device training on resource-limited edge devices without sacrificing precision. To solve both the computation complexity and the communication

378

Y. Wu et al.

bottleneck of end-to-end training, we propose an efficient CNN training accelerator that will be illustrated in Sect. 4.

3 Software: On-device Continuous Self-supervised Contrastive Learning with Selective Data Contrast To efficiently and effectively learn visual representations from the unlabeled streaming data while avoiding accumulating a large dataset due to the limited storage on edge devices, we propose a framework to select data on-the-fly for ondevice learning. In this framework, Contrast Score is proposed to form the data replacement policy such that the most representative data is maintained in the buffer. By learning from these important data, the model accuracy will be effectively improved. Valuable data that the model cannot effectively encode will be selected and maintained in the buffer for further learning. Data that the model can generate high-quality representations will be dropped to save place for the important ones. The theoretical analysis supports the simple yet effective contrast score method. Data with higher scores are associated with larger gradients and will expedite the learning. The overview of the framework will be introduced in Sect. 3.1. Then the proposed contrast scoring method will be described in Sect. 3.2. Finally, the theoretical analysis of the effectiveness of contrast scoring will be described in Sect. 3.3.

3.1 Framework Overview The proposed framework consists of two stages. As shown in Fig. 1, the first stage is training an encoder by self-supervised contrastive learning. The trained encoder is capable of encoding high-dimensional inputs such as images into low-dimensional vectors. The second stage uses the learned encoder as the initialization and trains the classifier with limited labeled data for the downstream tasks. In the first stage, the input streaming data is consumed on-the-fly to train the model for better visual representations. Instead of accumulating all the input data, we only use a small data buffer to maintain the most representative data, and the data in the buffer directly serves as a training mini-batch. As new inputs I are streaming in, the new data in I and the existing data in buffer B are scored, and the most representative and important ones are selected. For convenience, we set the size of I to the same as the size of B. After scoring, the data with the highest scores from .B ∪ I will be put back into the buffer. Therefore, the data replacement process always maintains the most representative data among the new and old ones. After each iteration of data replacement, the data maintained in the buffer will be used as one mini-batch for training the model once. In the following sections, we will describe the detailed data replacement method.

Efficient Hardware and Software Design for On-device Learning

379

Contrast Scoring

Few Labeled Data

Encoder Unlabeled Input Stream

New Data I

Data with High Scores

Classifier Data Maintained in Buffer B Updated Data Buffer B Data Replacement by Contrast Scoring Capturing New data

Model Update Phase 2: Train Classifier

Phase 1: Self-Supervised Data Representation Learning

Fig. 1 Framework overview of on-device contrastive learning. The contrast-scoring-based data selection method selects the important data from the unlabeled input stream to train the encoder by self-supervised contrastive learning. After that, a few labeled data (e.g., 1%) will be used to train the classifier by supervised fine-tuning

3.2 Data Replacement by Contrast Scoring Contrast Scoring For each input image .xi , the contrast scoring function .S(xi ) evaluates the capability of the encoder .f (·) to generate representation vector .hi = f (xi ) for .xi . If the encoder cannot generate high-quality representations for .xi , .xi will be important for training the encoder since learning from .xi can improve the encoder for better representations. To measure the quality of the representation of .xi , either from the input stream or from the buffer, another view .xi + is formed by horizontal flipping. Then, the representations of these two views are fed into the encoder to generate corresponding representation vectors .hi and .hi + . If the representation quality of .xi is good enough, .hi and .hi + will be very similar or even identical. Based on .hi and .hi + , the contrast score for .xi can be computed by scoring function .S(·) (Fig. 2). We define the contrast scoring function S(·) as S(xi ) = 1 − similarity(zi , zi + ) .

= 1 − ziT zi + ,

.

xi ∈ {B ∪ I }

zi = g(hi )/g(hi )2 , zi + = g(hi + )/g(hi + )2 ,

(2) (3)

where we feed an image xi and its horizontally flipped view xi + to the encoder f (·) and generate the representation vectors hi = f (xi ) and hi + = f (xi + ), respectively. Then we normalize the projected representation zi and zi + from g(·) by applying 2 -normalization such that zi 2 = zi + 2 = 1. As a result, the value of the dot product will be in the range [−1,1], and S(xi ) will be in the range [0,2].

380

Y. Wu et al.

New Inputs I

Compare 0.8

Data Buffer B

Compare 0.3

Repres. Vectors

Flip

0.1 Encoder Flipped Images

Prepare Data for Scoring

Encode Representation Vectors

Compare Compare 0.7 Compute Contrast Scores in Projected Representation Space

Fig. 2 Contrast scoring for data replacement. Two views of an input image, including the original view and the horizontally flipped view, are fed into the encoder to generate representations. The representations are future normalized in the unit sphere to compute contrast scores

The contrast scoring function S(xi ) evaluates the difference of cosine similarity between the representation vectors of two views (i.e., original and horizontalflipped views) of an image xi , and a higher score represents a larger difference. The representations of an image need to be independent of its views [34], and the representations of two views should be as similar as possible. Since a higher score means a larger difference, images with higher contrast scores are more important for training the encoder. By updating the encoder with xi using the contrastive loss [10], which aims to maximize the similarity of two strongly augmented views of xi , the score of xi in Eq. (2) will decrease, and xi will have a lower probability of being selected into the next mini-batch in Eq. (4). As a result, the valuable data will have a higher probability to be selected and used in the next mini-batch, while other data will be likely to be dropped without being used for training. The theoretical analysis of the effectiveness of contrast scoring will be described in Sect. 3.3. Design Principle of Contrast Score Contrast scoring measures the capability of the encoder to generate the representation hi = f (xi ) for an input image xi . Therefore, it should only rely on the input image and the encoder and not rely on other factors. To this end, when generating two views of an image (xi , xi + ) as the input to the scoring function S(·), it is important to get rid of transformations that involve any randomness such as random cropping and only use deterministic data augmentation (i.e., horizontal flipping) to generate two views. In this way, the deterministic data augmentation generates consistent inputs to S(·) and consistent score S(·) given an input image xi . Data Selection Based on Contrast Score After some data arrive at iteration t, we select the most informative and representative data from both It and Bt to form the next mini-batch Bt+1 , aiming to benefit the model the most. To this end, we use the contrast scoring function S(·) to measure the old data in the buffer Bt and newly arrived data It . After scoring, the data with the highest contrast scores in Bt ∪ It are selected for the next mini-batch Bt+1 .

Efficient Hardware and Software Design for On-device Learning

   Bt+1 = xi |xi ∈ Bt ∪ It , i ∈ topN {S(xi )}2N i=1 .

.

381

(4)

In the above equation, topN() finds the xi with the top N scores and returns their indices. As a result, by leveraging the contrast scoring, we maintain the most representative data in the buffer.

3.3 Understanding the Effectiveness of Contrast Score Analysis of the contrast scoring shows that the selected data have large gradient magnitudes and expedite the learning process. Specifically, in Eq. (1), the gradient of the contrastive loss .i,i + of an input .xi with respect to its representations .zi is as follows: ⎞ ⎛  ∂i,i + 1 ⎝ pzi − · zi − ⎠ , (5) =− . 1 − pzi + · zi − τ ∂zi z i−

  exp zi T z/τ  , .pz =  T zj ∈{z + ,z − } exp zi zj /τ i

pz ∈ {pzi + , pzi − }.

(6)

i

By applying the softmax function to the similarity .zi T zj between .zi and each . zj ∈ {zi + , zi − } in the same mini-batch, the probability distribution .pz is generated. In the above equation, .pzi + is the matching probability of a positive pair .zi and .zi + . .pz − is the matching probability of a negative pair of .zi and .zi − , where .zi − is the i representation of other images in the same mini-batch as .zi . If an input image .xi has a small value of the contrast score .S(xi ), it will generate a near-zero gradient and only makes a very small contribution to the training. On the other hand, if an input image .xi has a high contrast score .S(xi ), it will generate a large gradient in Eq. (5) and effectively contribute to the training.

4 EF-Train: Enable Efficient On-device CNN Training on FPGA Through Data Reshaping for Online Adaptation or Personalization In the hardware-level design, we propose EF-Train, an efficient DNN training accelerator enabling resource-limited, low-power, edge devices to continuously learn from new data for domain adaption and personalization. We first illustrate a CNN training accelerator prototype (Sect. 4.1) with a unified channel-level parallelism-based convolution kernel that can solve the computation complexity

382

Y. Wu et al.

ķ

On-Chip Memory and Data Flow

BN Kernel

Pooling Kernel

IFM buffer

Conv Kernel

BN Parameters

Pooling Indexes

PE

PE

PE

PE

PE

PE

PE

PE

PE

Weight buffer

1. for(i=0; i