Analog Circuits for Machine Learning, Current/Voltage/Temperature Sensors, and High-speed Communication: Advances in Analog Circuit Design 2021 3030917401, 9783030917401

This book is based on the 18 tutorials presented during the 29th workshop on Advances in Analog Circuit Design. Expert d

214 30 22MB

English Pages 350 [351] Year 2022

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
The Topics Covered Before in this Series
Contents
Part I Analog Circuits for Machine Learning
1 Mixed-Signal Compute and Memory Fabrics for Deep Neural Networks
1 Introduction
2 Efficiency Limits of Digital DNN Accelerators
3 Analog and Mixed-Signal Computing
4 In-Memory Computing
5 Discussion and Conclusions
References
2 Analog Computation with RRAM and Supporting Circuits
1 Introduction
2 Analog Crossbar Computation
3 Challenges of Crossbar Operation
3.1 Device Nonlinearity
3.2 Mixed-Signal Peripheral Circuitry
4 Non-volatile Crossbar Synapses
4.1 Flash
4.2 Filamentary Resistive-RAM
5 Digital RRAM Crossbar
5.1 Analog Operation with Digital RRAM Cells
6 Analog RRAM Crossbar
6.1 Analog Operation with Analog RRAM Cells
6.2 Fully Integrated CMOS-RRAM Analog Crossbar
6.2.1 RRAM Programming
6.2.2 RRAM Nonlinearity
6.2.3 CMOS Prototype
6.2.4 Measurement Setup
6.2.5 Single-Layer Perceptron Example
6.2.6 System Performance
7 Conclusions
References
3 Analog In-Memory Computing with SOT-MRAM: Architecture and Circuit Challenges
1 Introduction
2 Resistive Element Array
3 SOT MRAM Memory Element
4 SOT MRAM-Based Cell for AiMC
5 MVM Result in SOT Array
6 Impact of LSB Size on ADC Design
6.1 LSB Shrinking on CS SAR DAC
6.2 LSB Shrinking on CS SAR Comparator
7 Conclusions
References
4 Prospects for Analog Circuits in Deep Networks
1 Introduction
2 Review of Circuits for Analog Computing
3 Analog Circuits for Matrix-Vector Multiplication
4 Non-volatile Resistive Crossbars
5 Future of Analog Deep Neural Network Architectures
5.1 Trends in Machine Learning ASICs
6 Conclusion
References
5 SPIRIT: A First Mixed-Signal SNN Using Co-integrated CMOS Neurons and Resistive Synapses
1 Introduction
2 NVM Technology
3 Neural Network Architecture
3.1 Building a SNN
3.2 Learning Strategy
4 Circuit Architecture
4.1 Synapse Implementation
4.2 Neuron Design
4.3 Top Architecture
5 Measurement Results
5.1 Circuit Validation
5.2 Extra Measurements on OxRAMs
6 Discussion
7 Conclusion
References
6 Accelerated Analog Neuromorphic Computing
1 Introduction
2 Overview of the bss Neuromorphic Architecture
3 The hicannx Chip
3.1 Event-Routing Within hicannx
3.2 Analog Inference: Rate-Based Extension of hicannx
4 Analog Verification of Complex Neuron Circuits
4.1 Interfacing Analog Simulations from Python
4.2 Monte Carlo Calibration of adex Neuron Circuits
5 Conclusion
Author Contribution
References
Part II Current, Voltage, and Temperature Sensors
7 Advancements in Current Sensing for Future Automotive Applications
1 Introduction
2 Current Sensing
2.1 Classical Current Sensing
2.2 Improvement of Classical Current Sensing
2.3 From Linear to Switched Concepts
2.4 Current Sensing Goes Digital
2.5 Impact to Future Designs
3 Conclusions
References
8 Next Generation Current Sense Interfaces for the IoT Era
1 Introduction
2 Sensing Interfaces for IoT
2.1 Current Sensing
2.2 Capacitive Sensing
2.3 Inductive Sensing
2.4 Resistive Sensing
3 Multi-sense Interfaces
4 Choosing a Current Sensing ADC
4.1 Two-Step ADCs
4.2 Current Mode Incremental ADC (CI-ADC)
5 Incremental Δ Design Considerations
5.1 Choosing Both Coarse and Fine ADC Order
5.2 Understanding Noise
5.3 Incremental Δ Linearity with Passive Integrators
5.4 Capacitor Sizing
5.5 Decimation Filter
6 Multi-Sense with a CI-ADC
6.1 Current Sensing with a CI-ADC
6.2 Capacitive Sensing with a CI-ADC
6.3 Inductive Sensing with a CI-ADC
6.4 Resistive Sensing with a CI-ADC
7 Measurement Results
7.1 Optical Proximity Results
7.2 Capacitance Sensing Results
7.3 Inductive Sensing Results
7.4 Resistance Sensing
8 Conclusions
References
9 Precision Voltage Sensing in Deep Sub-micron and Its Challenges
1 ADC Overview
1.1 Sampling
1.2 Quantisation
1.3 Other Noise Sources
1.3.1 Aperture Error
1.3.2 Thermal Noise
1.4 ADC Signal to Noise
1.5 Figure of Merits
1.5.1 Walden FoM
1.5.2 Schreier FoM
1.6 Architecture Comparison
1.7 Architecture Selection
2 SAR ADC Architecture
3 Noise-Shaped SAR ADC
4 Error Feedback Design Example
5 Dynamic Amplifier
6 Conclusion
References
10 Breaking Unusual Barriers in Sensor Interfaces: From Minimum Energy to Ultimate Low Cost
1 Introduction
2 Ultra-Low Power All-Dynamic Multimodal Sensor Interface
2.1 Proposed All-Dynamic Versatile Sensing Platform
2.2 Low Power All-Dynamic Temperature Sensing
2.3 All-Dynamic Capacitance Sensor Interface
2.4 All-Dynamic 4-Terminal Resistance Sensor Interface
2.5 SAR ADC
2.6 Measurement Results
3 Ultimate Low-Cost Electronics
4 A Printed Smart Temperature Sensor for Cold Chain Monitoring Applications
4.1 System Architecture
4.2 Circuit Implementation
4.3 Measurement Results
5 Conclusions
References
11 Thermal Sensor and Reference Circuits Based on a Time-Controlled Bias of pn-Junctions in FinFET Technology
1 Introduction
2 Basic Principles
2.1 Bulk Diode Properties
2.2 Capacitive Bias of PN-Junctions
2.3 Forward-Bias Through Negative Charge-Pump
3 Application to a Switch-Cap Reverse Bandgap Reference
4 An Untrimmed Thermal Sensor Using Bulk Diodes for Sensing
4.1 Pulse-Controlled Sensor Principle
4.2 Circuit Realization with C-DAC
4.3 Simulation and Measurement Results
5 Conclusions
References
12 Resistor-Based Temperature Sensors
1 Introduction
2 Theoretical Energy Efficiency of Different Sensors
2.1 Temperature Sensors and Resolution FoM
2.2 BJT Sensor and Theoretical FoM
2.3 Resistor Sensor and Theoretical FoM
2.4 Effect of Readout Circuits
3 Resistor Choice and Sensor Topologies
3.1 Sensing Resistor Choice
3.2 Reference Choice
3.3 Dual-R Sensor Examples
3.4 RC Sensor Examples
4 An Energy-Efficient WhB Sensor Design
4.1 Front-End Design
4.2 Readout Circuit Design
4.3 Measurement Results
5 Summary
References
Part III High-speed Communication
13 Recent Advances in Fractional-N Frequency Synthesis
1 Introduction
2 Noise and Fractional-N Spurs
3 Divider Controller Spurs
4 Loop Nonlinearities
4.1 Loop Filter and Controlled Oscillator
4.2 Frequency Divider
4.3 Time Difference Measurement
5 Interaction Between the Divider Controller and Loop Nonlinearities
6 Spur Mitigation Strategies
7 All-Digital Phase Locked Loops
8 Conclusion
References
14 ADC/DSP-Based Receivers for High-Speed Serial Links
1 Introduction
2 ADC Resolution Requirements and Topologies
3 Digital Equalization
4 A 52 Gb/s ADC-Based PAM-4 Receiver with Comparator-Assisted 2 Bit/Stage SAR ADC and Partially Unrolled DFE
4.1 Receiver Architecture
4.2 ADC Design
4.3 DSP Design
4.4 Measurement Results
5 Conclusion
References
15 ADC-Based SerDes Receiver for 112 Gb/s PAM4 Wireline Communication
1 Introduction
2 ADC-Based Receiver Architecture
2.1 Peak to Main Cursor Ratio (PMR)
2.2 Distributed Equalization
3 112 Gb/s 16 nm Silicon Implementation
3.1 Receiver
3.2 Clocking
3.3 Measurement Results
4 Conclusions
References
16 BiCMOS-Integrated Circuits for Millimeter-Wave Wireless Backhaul Transmitters
1 Introduction
2 Reconfigurable Multi-core Voltage Controlled Oscillator
2.1 Multi-Core VCO Overview
2.2 Effect of Components Mismatch
2.3 VCO Measurement Results
3 Frequency Tripler with High Harmonics Suppression
3.1 Tripler Operating Principle
3.2 Tripler Design and Measurements
4 Wideband I/Q LO Generation with Self-tuned Polyphase Filter
4.1 I/Q LO Generation Architecture and Circuits Design
4.2 I/Q LO Measurement Results
5 E-Band Common-Base PAs Leveraging Current-Clamping
5.1 Principle of Current-Clamping
5.2 PAs Design and Measurements
6 Conclusions
References
17 Optical Communication in CMOS—Bringing New Opportunities to an Established Platform
1 Introduction
2 Schottky Photodiodes in Bulk CMOS
2.1 Electrical Characterization
2.2 Optical Characterization
3 Integrated Receivers Without Equalization
4 Integrated Receivers with Equalization
5 CMOS 1310/1550nm Receiver Chip Implementations
5.1 Receivers Without Equalization
5.2 Receiver with Embedded IIR DFE
6 Conclusions
References
18 Coherent Silicon Photonic Links
1 Introduction
2 Coherent Transceiver Operation
2.1 Transmitter
2.2 Receiver
3 High-Swing Linear Driver
4 Measurement Results
5 Conclusions
References
Index
Recommend Papers

Analog Circuits for Machine Learning, Current/Voltage/Temperature Sensors, and High-speed Communication: Advances in Analog Circuit Design 2021
 3030917401, 9783030917401

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Pieter Harpe Kofi A.A. Makinwa Andrea Baschirotto   Editors

Analog Circuits for Machine Learning, Current/Voltage/ Temperature Sensors, and High-speed Communication Advances in Analog Circuit Design 2021

Analog Circuits for Machine Learning, Current/Voltage/ Temperature Sensors, and High-speed Communication

Pieter Harpe • Kofi A. A. Makinwa Andrea Baschirotto Editors

Analog Circuits for Machine Learning, Current/Voltage/ Temperature Sensors, and High-speed Communication Advances in Analog Circuit Design 2021

Editors Pieter Harpe Eindhoven University of Technology Eindhoven, The Netherlands

Kofi A. A. Makinwa Delft University of Technology Delft, The Netherlands

Andrea Baschirotto University of Milan Milan, Italy

ISBN 978-3-030-91740-1 ISBN 978-3-030-91741-8 https://doi.org/10.1007/978-3-030-91741-8

(eBook)

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

This book is part of the Analog Circuit Design series and contains contributions of all 18 speakers of the 29th workshop on Advances in Analog Circuit Design (AACD). The event was organized by Dr. Ivan O’Connell, Nicola Cooney, Catherine Walsh, and Paul Hyland from MCCI—Microelectronic Circuits Centre Ireland, Tyndall National Institute, Cork, Ireland. MCCI also sponsored the workshop. Due to the COVID-19 pandemic, the workshop was held online, from March 22nd to March 30th, 2021. About AACD The aim of the AACD workshop is to bring together a group of expert designers to discuss new developments and future options. Each workshop is followed by the publication of a book by Springer in their successful series of Analog Circuit Design. This book is the 29th in this series. The book series can be seen as a reference for all people involved in analog and mixed-signal design. The full list of the previous books and topics in the series is included in this book. About MCCI Funded by Enterprise Ireland and the IDA, MCCI’s mission is to deliver high impact research for the semiconductor industry and to generate innovative technology. MCCI is a national technology center that works collaboratively in microelectronics circuit design to improve the performance of mixed-signal circuits required by their industry partners. MCCI’s research focus is on mixed-signal, analog, and RF circuits. The center has established itself as a single point of contact in Ireland for access to high-caliber academic research in the field of microelectronics. MCCI is committed to the development of an engineering talent pipeline for the global semiconductor industry. For more information, visit www.mcci.ie This book comprises three parts, each with six papers from experts in the field, covering advanced analog and mixed-signal circuit design topics that are considered highly important by the circuit design community: • Analog Circuits for Machine Learning • Current/Voltage/Temperature Sensors • High-speed Communication v

vi

Preface

We are confident that this book, like its predecessors, proves to be a valuable contribution to our analog and mixed-signal circuit design community. Eindhoven, The Netherlands Delft, The Netherlands Milano, Italy

Pieter Harpe Kofi A. A. Makinwa Andrea Baschirotto

The Topics Covered Before in this Series

2019

Milan (Italy)

2018

Edinburgh (Scotland)

2017

Eindhoven (The Netherlands)

2016

Villach (Austria)

2015

Neuchâtel (Switzerland)

2014

Lisbon (Portugal)

2013

Grenoble (France)

2012

Valkenburg (The Netherlands)

2011

Leuven (Belgium)

Next-Generation ADCs High-Performance Power Management Technology Considerations for Advanced Integrated Circuits Analog Techniques for Power Constrained Applications Sensors for Mobile Devices Energy Efficient Amplifiers and Drivers Hybrid ADCs Smart Sensors for the IoT Sub-1V and Advanced Node Analog Circuit Design Continuous-time  Modulators for Transceivers Automotive Electronics Power Management Efficient Sensor Interfaces Advanced Amplifiers Low-Power RF Systems High-Performance AD and DA Converters IC Design in Scaled Technologies Time-Domain Signal Processing Frequency References Power Management for SoC Smart Wireless Interfaces Nyquist A/D Converters Capacitive Sensor Interfaces Beyond Analog Circuit Design Low-Voltage Low-Power Data Converters Short-Range Wireless Front-Ends Power Management and DC-DC (continued) vii

viii

The Topics Covered Before in this Series

2010

Graz (Austria)

2009

Lund (Sweden)

2008

Pavia (Italy)

2007

Oostende (Belgium)

2006

Maastricht (The Netherlands)

2005

Limerick (Ireland)

2004

Montreux (Swiss)

2003

Graz (Austria)

2002

Spa (Belgium)

2001

Noordwijk (The Netherlands)

2000

Munich (Germany)

1999

Nice (France)

1998

Copenhagen (Denmark)

1997

Como (Italy)

Robust Design Sigma Delta Converters RFID Smart Data Converters Filters on Chip Multimode Transmitters High-Speed Clock and Data Recovery High-Performance Amplifiers Power Management Sensors, Actuators, and Power Drivers for the Automotive and Industrial Environment Integrated PAs from Wireline to RF Very High Frequency Front-Ends High-Speed AD Converters Automotive Electronics: EMC issues Ultra-Low-Power Wireless RF Circuits: Wide Band, Front-Ends, DACs Design Methodology and Verification of RF and Mixed-Signal Systems Low Power and Low Voltage Sensor and Actuator Interface Electronics Integrated High-Voltage Electronics and Power Management Low-Power and High-Resolution ADCs Fractional-N Synthesizers Design for Robustness Line and Bus Drivers Structured Mixed-Mode Design Multi-bit Sigma-Delta Converters Short-Range RF Circuits Scalable Analog Circuits High-Speed D/A Converters RF Power Amplifiers High-Speed A/D Converters Mixed-Signal Design PLLs and Synthesizers XDSL and Other Communication Systems RF-MOST Models and Behavioral Modeling Integrated Filters and Oscillators 1-Volt Electronics Mixed-Mode Systems LNAs and RF Power Amps for Telecom RF A/D Converters Sensor and Actuator Interfaces Low-Noise Oscillators, PLLs, and Synthesizers (continued)

The Topics Covered Before in this Series 1996

Lausanne (Swiss)

1995

Villach (Austria)

1994

Eindhoven (The Netherlands)

1993

Leuven (Belgium)

1992

Scheveningen (The Netherlands)

ix RF CMOS Circuit Design Bandpass Sigma Delta and Other Data Converters Translinear Circuits Low-Noise/Power/Voltage Mixed-Mode with CAD Tools Voltage, Current, and Time References Low Power Low Voltage Integrated Filters Smart Power Mixed-Mode A/D Design Sensor Interfaces Communication Circuits OpAmps ADC Analog CAD

Contents

Part I Analog Circuits for Machine Learning 1

Mixed-Signal Compute and Memory Fabrics for Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Boris Murmann

2

Analog Computation with RRAM and Supporting Circuits . . . . . . . . . . Justin M. Correll, Seung Hwan Lee, Fuxi Cai, Vishishtha Bothra, Yong Lim, Zhengya Zhang, Wei D. Lu, and Michael P. Flynn

3

Analog In-Memory Computing with SOT-MRAM: Architecture and Circuit Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Caselli, J. Doevenspeck, S. Cosemans, I. A. Papistas, A. Mallik, P. Debacker, and D. Verkest

4

Prospects for Analog Circuits in Deep Networks . . . . . . . . . . . . . . . . . . . . . . . Shih-Chii Liu, John Paul Strachan, and Arindam Basu

5

SPIRIT: A First Mixed-Signal SNN Using Co-integrated CMOS Neurons and Resistive Synapses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. Valentian, F. Rummens, E. Vianello, T. Mesquida, C. Lecat-Mathieu de Boissac, O. Bichler, and C. Reita

6

Accelerated Analog Neuromorphic Computing . . . . . . . . . . . . . . . . . . . . . . . . . Johannes Schemmel, Sebastian Billaudelle, Philipp Dauer, and Johannes Weis

3 17

33

49

63

83

Part II Current, Voltage, and Temperature Sensors 7

Advancements in Current Sensing for Future Automotive Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Andreas Kucher, Andrea Baschirotto, and Paolo Del Croce

8

Next Generation Current Sense Interfaces for the IoT Era . . . . . . . . . . . 115 Paul Walsh, Oleksandr Kaprin, Andriy Maharyta, and Mark Healy xi

xii

Contents

9

Precision Voltage Sensing in Deep Sub-micron and Its Challenges . . . 137 Ivan O’Connell, Subhash Chevella, Gerardo Molina Salgado, and Daniel O’Hare

10

Breaking Unusual Barriers in Sensor Interfaces: From Minimum Energy to Ultimate Low Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 H. Xin, M. Fattori, P. Harpe, and E. Cantatore

11

Thermal Sensor and Reference Circuits Based on a Time-Controlled Bias of pn-Junctions in FinFET Technology. . . 191 Matthias Eberlein and Harald Pretl

12

Resistor-Based Temperature Sensors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Sining Pan and Kofi A. A. Makinwa

Part III High-speed Communication 13

Recent Advances in Fractional-N Frequency Synthesis . . . . . . . . . . . . . . . 233 Michael Peter Kennedy

14

ADC/DSP-Based Receivers for High-Speed Serial Links . . . . . . . . . . . . . . 247 Samuel Palermo, Sebastian Hoyos, Shiva Kiran, Shengchang Cai, and Yuanming Zhu

15

ADC-Based SerDes Receiver for 112 Gb/s PAM4 Wireline Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 Kevin Geary, James Hudner, Declan Carey, Ronan Casey, Kay Hearne, Marc Erett, Chi Fung Poon, Hongtao Zhang, Sai Lalith Chaitanya Ambatipudi, David Mahashin, Parag Upadhyaya, and Yohan Frans

16

BiCMOS-Integrated Circuits for Millimeter-Wave Wireless Backhaul Transmitters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Andrea Mazzanti, Lorenzo Iotti, Mahmoud Mahdipour Pirbazari, Farshad Piri, Elham Rahimi, and Francesco Svelto

17

Optical Communication in CMOS—Bringing New Opportunities to an Established Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 Wouter Diels and Filip Tavernier

18

Coherent Silicon Photonic Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 Abdelrahman H. Ahmed, Alexander Rylyakov, and Sudip Shekhar

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341

Part I

Analog Circuits for Machine Learning

The first part of this book discusses the use of mixed-signal architectures for the realization of integrated neural networks intended for machine learning. The first few chapters discuss mixed-signal architectures that aim to improve the energy efficiency of deep neural networks (DNNs), followed by chapters that discuss the use of mixed-signal architectures in the realization of spiking neural networks (SNNs). In Chap. 1, Boris Murmann reviews the energy efficiency of fully digital implementations of DNNs, which is mainly limited by the energy required for (external) memory access and data movement. He then discusses the potential benefits of employing mixed analog/digital compute and memory fabrics. This should lead to DNNs with single-digit fJ/MAC energy efficiencies, at least for up to 4-bit matrix-vector arithmetic at the array level. In-memory computing based on Resistive Random-Access Memory (RRAM) crossbars enables efficient and parallel vector-matrix multiplication. A weight matrix can be mapped onto the crossbar, and multiplication can be efficiently performed in the analog domain. In Chap. 2, Michael Flynn et al. discuss various RRAM crossbar implementations, the associated mixed-signal circuits, and their challenges. As a proof-of-concept, a fully integrated CMOS-RRAM coprocessor is presented. In Chap. 3, Michelle Caselli et al. discuss the application of Spin-Orbit Torque (SOT) MRAM in neural networks. SOT-MRAM has many attractive properties: high endurance, fast write operations, and low-energy at logic-compatible supply voltages. However, its low on/off ratio poses a serious design challenge. To address this, the authors propose a compute array architecture based on pulse-width encoded inputs and a complementary pre-charge-discharge scheme for the summation lines, together with custom-designed summation line ADCs. In Chap. 4, Shih-Chii Liu et al. discuss the use of analog circuits in the realization of energy efficient DNNs. A brief history of such circuits is presented, which shows that that many key operations used in DNNs can be efficiently implemented by analog subthreshold or charge domain circuits. The chapter concludes with a discussion of the prospects for using analog circuits and emerging memory

2

I

Analog Circuits for Machine Learning

technologies to realize low-power deep network accelerators suitable for edge or tiny machine learning applications. In Chap. 5, Francois Rummens et al. present SPIRIT, a chip that implements a mixed-signal spiking neural network (SNN). Its resistive synapses employ RRAM devices realized in the back-end of a 130 nm CMOS process. It achieves 84% classification accuracy of a set of handwritten digits, which is maintained after more than 750M spikes, attesting to the RRAM’s read endurance. At synapse and neuron levels, SPIRIT consumes 3.6 pJ per synaptic event, which is significantly lower than similar chips using formal coding. In Chap. 6, Johannes Schemmel et al. present the concepts behind the secondgeneration BrainScales (BSS-2) neuromorphic computing architecture, which aims to emulate biological neural networks. It consists of hundreds of analog neuromorphic accelerators and digital compute cores realized on a full wafer. The former emulate the spike-based dynamics of the neural network, while the latter simulate slower biological processes. In order to cope with analog process variation, a custom software toolbox has been developed, which facilitates complex calibrated MonteCarlo simulations.

Chapter 1

Mixed-Signal Compute and Memory Fabrics for Deep Neural Networks Boris Murmann

Abstract Deep neural networks have emerged as important new drivers for custom VLSI computing hardware in resource-constrained environments. In this context, this chapter reviews asymptotic limits of purely digital implementations and enumerates the potential benefits of employing mixed analog/digital compute fabrics. Given that most deep neural networks are limited by memory access and data movement, the discussion emphasizes these aspects as lead-ins to memory-like and in-memory processing approaches. We show that single-digit fJ/MAC energy efficiencies are feasible for up to 4-bit mixed-signal matrix-vector arithmetic at the array level. However, more research is needed to strongly benefit from these efficient computations at the system level.

1 Introduction Data analytics is becoming an increasingly important aspect within our information technology infrastructure. The massive amount of unstructured data generated by sensors and other sources must be analyzed and interpreted to deduce meaningful actions and decisions. Examples include automatic speech-to-text translation, keyword spotting, and person detection. Most solutions in this space leverage machine learning techniques and thrive on the vast amounts of available data for algorithm training. While there is a nearly endless and growing variety in the algorithmic space, we focus here on deep neural networks (DNNs). More specifically, we discuss opportunities in mixed-signal (analog and digital) design for deep convolutional neural networks (CNNs) in resource-constrained applications. CNNs have proven to achieve the highest performance for a variety of problems and are particularly attractive for computer vision. For an introduction to the inner workings of CNNs, the reader may refer to [1]. B. Murmann () Stanford University, Stanford, CA, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 P. Harpe et al. (eds.), Analog Circuits for Machine Learning, Current/Voltage/Temperature Sensors, and High-speed Communication, https://doi.org/10.1007/978-3-030-91741-8_1

3

4

B. Murmann

X Filter weights Fx×Fy×C dot

one pixel

Y K tensors

C

K

Fig. 1.1 Elementary CNN operation. The dot product of a volume of input pixels and a weight tensor is formed to produce one output pixel. The operation slides across the input volume and K filters are used to fill the output volume

Figure 1.1 illustrates the core operation used in CNNs. The dot product of a volume of input pixels and a weight tensor is formed to produce one output pixel. For the first layer in computer vision, the C-dimension is 3 for RGB images. The output pixels (feature maps) are nonlinearly processed (typically rectification, not shown) and sent to subsequent layers for further analysis. State-of-the-art networks can contain tens of layers, use tens of millions of filter weights, and require tens of billions of multiply-add operations, making it challenging to run these algorithms within small chips and at low power dissipation. Training the weights is even more compute-intensive and lies beyond the scope of this chapter. In today’s usage paradigm, the weights are typically trained offline (on a server) and uploaded to the CNN processor to perform classification tasks (“inference”) with fixed weights. While it is possible to run a CNN on generic CPUs, both the power dissipation and execution time are typically not acceptable. The next best options are graphics processors (GPUs), field programmable gate arrays (FPGAs), and their recent incarnation as adaptive compute acceleration platforms (ACAPs). These platforms perform better as they can exploit the “embarrassingly parallel” nature of CNNs and benefit from data reuse. This can be understood from Fig. 1.1. As the filters slide across the image, they don’t need to be reloaded from memory and partial sums from a previous filter position may be reused for the computation of the next output pixel. For each filter position, a large number of multiply-adds can be performed in parallel using an array of compute cores. Despite the success of GPUs, FPGAs, and ACAPs in deep learning applications, they typically do not achieve the ultimate energy efficiency demanded by mobile applications or small sensor platforms. For this reason, a large number of R&D efforts have been launched in search of further improvements. Most of these developments target custom digital processors. The reader is referred to [2] for an overview. This chapter deals with the question of whether some form of analog or mixedsignal processing may be advantageous in full-custom CNN accelerator chips. To

1 Mixed-Signal Compute and Memory Fabrics for Deep Neural Networks

5

investigate, Sect. 2 begins by identifying a baseline for efficiency limits in typical digital accelerators. Next, Sect. 3 looks at mixed-signal approaches using “memorylike” arrays employing switched-capacitor circuits. Chapter 4 then discusses inmemory computing, specifically highlighting emerging Resistive Random-Access Memory (RRAM). Finally, Chap. 5 reviews some of the main take-homes and suggestions for future research.

2 Efficiency Limits of Digital DNN Accelerators A typical realization of a digital DNN accelerator is shown in Fig. 1.2. Due to the large memory requirements, the weights are often stored in off-chip DRAM and an elaborate SRAM memory hierarchy handles intermediate buffering for data reuse. The core of the accelerator is typically a processing element (PE) array that handles the multiply-add operations. As shown in Fig. 1.2, the cost of memory access varies widely across the architecture, which makes it important to devise the best possible access schedule for each memory level. Also, note that multiply and add operations consume significantly less energy than accessing the large memory blocks in the system, pointing to data movement as the most significant issue in DNN accelerator design. In other words, simply designing a better multiplier won’t lead to significant improvements. To get a feel for what we would ideally like to achieve, consider a numerical example. Suppose that we have a network requiring ten billion operations (multiply and add) and that we want these operations to complete in 10 ms. This gives a computational load of 1 TOP/s. Assuming that we want to dissipate no more than 1 mW in this processor, we would be looking for a compute efficiency of 1000 TOP/s/W. Is this possible?

Fig. 1.2 Digital DNN processor architecture with typical energy numbers for data movement and arithmetic (~28 nm CMOS). RF stands for register file

6

B. Murmann

Fig. 1.3 Typical processing element [3]

To obtain a simple (and optimistic) bound, let us assume that the processor’s memory hierarchy is highly efficient, so that the accelerator’s PE array becomes the main bottleneck. Each PE contains a set of small register files and multiply accumulate (MAC) arithmetic (see Fig. 1.3). Assuming the numbers from Fig. 1.2 and 8-bit arithmetic and four memory accesses per MAC operation (read input, weight and partial sum, and write output), we find: Energy = (ERF + EMAC ) /2 = (4 × 80 fJ + 230 fJ) /2 ≈ 275 fJ Operation

(1.1)

Note that this corresponds to only 3.6 TOP/s/W. Also, if we somehow managed to reduce the multiply-add energy to zero, we would still not be able to surpass 6.25 TOP/s/W due to register file access alone. To get anywhere near 1000 TOP/s/W (1 fJ per operation, or 2 fJ/MAC), we must eliminate or significantly reduce register file size and access. The interested reader may refer to [6] for an overview on the long list of tricks employed by digital designers. Among them, a significant idea is to exploit “sparsity.” Due to the fact that the output feature maps typically pass through a rectifying linear unit (RELU), negative values are mapped to zero, making it possible to skip a large fraction of MAC operations and RF accesses downstream. Also, aggressive quantization is being used to work with lower bit widths. It has been shown that representing weights and output activations using four bits tends to suffice even in highly demanding classification tasks [7]. All said and done, in combination with other strategies, this leads us to efficiency numbers in the tens of TOP/s/W for state-of-the-art digital DNN accelerator designs. Figure 1.4 shows the

1 Mixed-Signal Compute and Memory Fabrics for Deep Neural Networks

7

Fig. 1.4 DNN accelerator survey [8]. The points closest to the upper left corner are the mixedsignal designs of [4] (orange triangle) and [5] (orange circle)

survey data that corroborates this finding, showing designs in the 10–100 TOP/s/W regime for custom digital DNN processors. In the context of benchmarking, it is worth mentioning that the compute efficiency in TOP/s/W in not a perfect metric and should be taken with a grain of salt. Authors sometimes include only parts of their system power numbers (e.g., DRAM access may be excluded), and the workload size and sparsity may differ vastly between the various data points. Also, a design with large TOP/s/W may actually be quite inefficient in absolute terms if it requires an excess number of operations to complete the classification task. Fair benchmarking of machine learning accelerators is still an underdeveloped topic, but the reader may refer to [9] for basic guidelines and considerations. Regardless of the uncertainty surrounding the TOP/s/W efficiency metric, Fig. 1.4 shows that there are mixed-signal accelerator prototypes that look very competitive and seem to break into a territory that is not reachable by purely digital designs. Thus, we investigate this aspect next and examine the developments that have led to these promising data points.

3 Analog and Mixed-Signal Computing There is a rich history of research that promotes the purely analog implementation of neural networks and other machine learning algorithms. Historically, this path has followed neuromorphic principles [10, 11], which build on our (very limited) understanding of the human brain and its “integrate and fire” neurons that are

8

B. Murmann

amenable to analog circuit implementation. While the resulting neurons represent an intriguing and biologically plausible emulation of the units found in the human brain, the networks constructed with them tend to lack scalability. It is difficult to array and cascade a large number of analog building blocks and deal with the accumulation of noise and component mismatch. Additionally, and perhaps more significantly, it is challenging to build analog memory cells [12]. For this reason, present explorations in neuromorphic design are dominated by digital emulations, such as IBM’s TrueNorth processor [13]. Since purely analog implementations are difficult to scale, could one instead build a processor that uses digital storage and leverages analog/mixed-signal compute for potential efficiency gains? As shown in [14], mixed-signal computing can indeed be lower energy than digital for low resolutions, typically below 8 bits. The most straightforward way to exploit this would be to embed mixedsignal compute macros into the PE blocks of a mainly digital processor. This was considered in [15, 16], which point to the conclusion that the idea will in practice lead to diminishing returns. As discussed in the previous section, in a mostly digital processor with a PE array, large amounts of energy are spent on bringing the data to the compute elements. To fully harvest the benefits of mixed-signal processing, one should create an architecture that reduces this overhead substantially. This was the goal for our work in [4], which uses a “weight-stationary” compute array (see Fig. 1.5). Each column of the array contains filter weights that are unrolled into a vector. The input activations are also unrolled into a vector and broadcast across the array from the left. The buffer memory access for these input data is amortized across many compute columns. Also, the partial products at each matrix location are not written back to memory but are accumulated along the column. This accumulation could be done with a digital adder tree, but the main feature of a mixed-signal implementation is to collect charges or currents for potential area and energy savings. A/D conversion occurs at the bottom of the columns and is amortized across the number of rows. To simplify further, our first prototype in [4] opted for a binarized neural network [17], leading to a 1-bit A/D interface and XNOR gates as the multipliers in each matrix location. The resulting unit cell is shown in Fig. 1.6. Its compactness allowed the on-chip integration of a 64 × 1024 array that computes 64 outputs in one shot at 772 1b-TOP/s/W (2.6 fJ/MAC). Since this architecture can also be realized without analog accumulation, we designed an equivalent digital version for comparison purposes. Figure 1.7 shows the total array energy for this custom digital design along with that of the described mixed-signal approach. The latter shows an improvement of about 4.2x. However, at the chip level, the energy savings reduce to approximately 1.8x. The issue is that the compute array is not large enough to hold the weights of a single layer, let alone the entire neural network. The array is thus time multiplexed, and the weights are re-loaded from the SRAM buffer many times during an inference cycle. The associated memory access energy overhead is large and identical for both designs. The merit of going mixed-signal is limited because the penalty of using digital adders is relatively insignificant compared to the data movement overhead for a

1 Mixed-Signal Compute and Memory Fabrics for Deep Neural Networks

Filter 1

Filter M

Fx×Fy×C

Fx×Fy×C

...

i1

×

+

W2,1

×

+

W2,M

×

+

+

WN,M

×

...

×

+

...

... WN,1

×

Analog accumulation

iN Unit element

W1,M

...

Fx×Fy×C

W1,1

...

Broadcast input activations (from buffer memory)

9

+

o1 biasM

A/D

Your favorite cell design goes here

+

z-1

+

+ A/D

bias1

oM

Digital accumulation (optional)

z-1

+

Fig. 1.5 Weight-stationary compute array with analog accumulation

Fig. 1.6 Unit element from [4]

1-bit fabric. This observation was reconfirmed by another digital BinaryNet design in 10 nm CMOS [18]. From here, there are at least two directions for improving the relative merit of mixed-signal arrays. The first option is to generally aim for much denser and larger arrays. This motivates the in-memory computing approaches discussed in Sect. 4. Another direction is to continue working with larger unit elements that can offer

10

B. Murmann

Fig. 1.7 (a) Conventional processing element (PE) array vs. (b) memory-like mixed-signal array (single-bit implementation from [4]

multi-bit compute capability that cannot be easily matched by a digital equivalent. We studied multi-bit switched capacitor arrays in [19] and found that an energy level of ~5 fJ/MAC seems feasible for 4-bit weights and activations in 28 nm CMOS. While this result still needs silicon validation, we recently saw an equivalent digital array design in 22 nm CMOS that achieves 22 fJ/MAC for 4-bit weights and activations [20]. Thus, the jury is still out on whether it makes sense to compete with digital solutions using “memory-like” arrays using moderately dense unit elements.

4 In-Memory Computing In-memory computing is a relatively old idea that aims to integrate memory cells and compute into a single dense fabric [21]. Conceptually, the unit element in Fig. 1.6 follows the same idea, but its size is relatively large, so that a denser piece of memory is still required in its periphery to store the weights and activations of a complete neural network. To overcome this issue, denser cells can be designed as illustrated in Fig. 1.8. The most obvious way to increase density is to realize the memory bit with a standard 6-T SRAM cell (see Fig. 1.8b) as explored in [5]. In addition, the logic can be simplified and single-ended signaling can be used to reduce the area further. While the differential “memory-like” cell measures 24,000 F2 (where F is the half pitch the process technology), the SRAM-based cell has an area of only 290 F2 . Further cell size reductions are possible by migrating to emerging memory technologies (see Fig. 1.8c), as discussed later. At present, SRAM-based in-memory computing is receiving significant attention in the research community [22] and many circuit and network architecture options are being explored. In [23], a complete processor with SRAM-based in-memory

1 Mixed-Signal Compute and Memory Fabrics for Deep Neural Networks

11

Fig. 1.8 (a) Memory-like cell (from [4]), (b) in-memory computing cell based on SRAM (from [5]), (c) cell based on resistive memory technology

Fig. 1.9 Compute-in-memory core used in [23]

computing is presented (see Fig. 1.9). Owing to the density of SRAM-based cells, the compute array is significantly denser than in [4]. However, due to the simple unit element structure, multi-bit operations must be serialized, generating overhead in the number of required A/D conversions. This design achieves 121 4b-TOP/s/W (16.5 fJ/MAC), defining the current state of the art for complete demonstrators employing mixed-signal computing. In search of even higher compute density and the potential to store all weights on chip, a wide variety of emerging memory technologies are currently under investigation. For instance, RRAM technology promises to deliver densities that are comparable to DRAM, while being non-volatile and potentially offering multilevel storage. This could open up a future where relatively large machine learning models (>10 MB) can be stored on a single chip to eliminate external DRAM access. In addition, these memory types are compatible with in-memory-computing by exploiting current summation on the bitlines [24]. While there are many possible ways to incorporate emerging nonvolatile memory into a machine learning processor, one attractive option is a streaming (pipelined) topology as shown in

12

B. Murmann

Fig. 1.10 Streaming architecture for neural network processing with RRAM

Fig. 1.11 Complete processor with massively parallel in-memory computing [26]

Fig. 1.10. Here, large in-memory compute tiles are pipelined between small SRAM line buffers that hold only the current input working set [25]. To form a complete processor, many such tiles can potentially be integrated on a single chip, as described in [26] (see Fig. 1.11). In this architecture, all weights are stationary and the data slides through the fabric with massively parallel computation. At present, the art of designing machine learning processors using emerging memory is still in its infancy. Key issues include access to process technology as well as challenges with the relatively poor retention and endurance of emerging memory technology (see e.g., [27]). Consequently, most existing chip demonstrators are only sub-systems and/or use relatively small arrays (see e.g., [28, 29]). However, one important aspect that has already become clear from these investigations is that the D/A and A/D interfaces required at the array boundaries can be a significant showstopper. For example, a state-of-the-art ADC consumes about 1 pJ per conversion at approximately 4–8 bits of resolution [30]. If amortized across 100 memory rows, the ADC energy overhead is 10 fJ per MAC operation, which will essentially “destroy” the array’s efficiency. This issue is particularly visible in designs that activate even fewer rows per computation (e.g., nine rows in [29]). The solution is to work on innovations that will allow for larger arrays and massively parallel row activation.

1 Mixed-Signal Compute and Memory Fabrics for Deep Neural Networks

13

VOUT VDD

1 1

=1 time T

2 2

=0

VOUT =0

Fig. 1.12 Dynamic bitline readout approach. The actual implementation is differential (using two cells per element) to enable pos/neg outputs

Our work in [31] takes a step in this direction and explores what might be possible (using SPICE simulations in 40 nm CMOS). Instead of using a current-mode TIA readout, we opt for a fully dynamic scheme as shown in Fig. 1.12. The bitlines are precharged and discharge according to the dot product of the inputs and the weight conductances. At the sample time T, the bitline voltage is digitized using a 2-bit ramp ADC. The DAC block within these converters is integrated and shared within the cell array (see Fig. 1.13), so as to provide a replica with similar nonlinearity as the cells used for the dot product. The output code is found by detecting the zero crossings on each differential bitline pair. For this implementation with 2-bit inputs, 2-bit outputs, and 5-level weights in each RRAM cell, our SPICE simulations predict efficiencies ranging from 300 to 1200 TOP/s/W for array heights between ~3000 and 13,000 cells. These results include benefits from sparsity-aware processing. Specifically, if the ramp ADCs trip in early conversion cycles, further bitline precharges are skipped, leading to an energy benefit of about 2.5x. For additional details and assumptions that lead to these numbers, the reader is referred to [31].

14

B. Murmann

Fig. 1.13 Column design for a 3 × 3 × 128 filter patch. The ramp cells are used to drive the differential bitline voltage to zero

5 Discussion and Conclusions Using mixed-signal compute arrays is a promising paradigm for machine learning processors as it can help reduce data movement and leverage energy efficient-analog accumulation. SRAM-based implementations are close to being commercialized today and are showing promising improvements over purely digital processors. However, to achieve the “ultimate” compute and memory density (~10–100 MB on a small chip, without external DRAM), alternative memory technologies are needed. As RRAM and similar technologies are now becoming readily available, this provides a fertile ground for research in mixed-signal accelerator design. While circuit and array-level efficiency metrics are important, the designer should not lose sight of system-level benchmarks [9]. More work is needed to assess and mitigate flexibility limitations imposed by large mixed-signal compute arrays (e.g., due to under-utilization and sparsity). On the other hand, it may be possible to discover neural network architectures that map particularly well onto large mixedsignal arrays via hardware-aware neural architecture search (NAS). This direction is being actively pursued in the digital space (see e.g., [32]), but remains largely unexplored for mixed-signal designs.

1 Mixed-Signal Compute and Memory Fabrics for Deep Neural Networks

15

References 1. I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, Cambridge, 2016) 2. V. Sze, Y.-H. Chen, T.-J. Yang, J.S. Emer, Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE 105(12), 2295–2329 (2017). https://doi.org/10.1109/ JPROC.2017.2761740 3. Y.-H. Chen, T. Krishna, J.S. Emer, V. Sze, Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid State Circuits 52(1), 127–138 (2017). https://doi.org/10.1109/JSSC.2016.2616357 4. D. Bankman, L. Yang, B. Moons, M. Verhelst, B. Murmann, An always-on 3.8 J/86% CIFAR10 mixed-signal binary CNN processor with all memory on chip in 28-nm CMOS. IEEE J. Solid State Circuits 54(1), 158–172 (2019). https://doi.org/10.1109/JSSC.2018.2869150 5. H. Valavi, P.J. Ramadge, E. Nestler, N. Verma, A 64-tile 2.4-Mb in-memory-computing CNN accelerator employing charge-domain compute. IEEE J. Solid State Circuits 54(6), 1789–1799 (2019). https://doi.org/10.1109/JSSC.2019.2899730 6. M. Verhelst, B. Moons, Embedded deep neural network processing: Algorithmic and processor techniques bring deep learning to IoT and edge devices. IEEE Solid-State Circuits Mag. 9(4), 55–65 (2017). https://doi.org/10.1109/MSSC.2017.2745818 7. S.K. Esser, J.L. McKinstry, D. Bablani, R. Appuswamy, D.S. Modha, Learned step size quantization (2019). [Online]. https://arxiv.org/abs/1902.08153. Accessed 1 Mar 2020 8. K. Guo, W. Li, K. Zhong, Z. Zhu, S. Zeng, S. Han, Y. Xie, P. Debacker, M. Verhelst, Y. Wang. Neural Network Accelerator Comparison [Online]. Available: https:// nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/ 9. G.W. Burr, S.H. Lim, B. Murmann, R. Venkatesan, M. Verhelst, Fair and comprehensive benchmarking of machine learning processing chips. IEEE Des. Test (2021). https://doi.org/ 10.1109/MDAT.2021.3063366 10. C. Mead, Neuromorphic electronic systems. Proc. IEEE 78(10), 1629–1636 (1990). https:// doi.org/10.1109/5.58356 11. R.A. Nawrocki, R.M. Voyles, S.E. Shaheen, A mini review of neuromorphic architectures and implementations. IEEE Trans. Electron Devices 63(10), 3819–3829 (2016). https://doi.org/ 10.1109/TED.2016.2598413 12. E.A. Vittoz, Future of analog in the VLSI environment. Proc. ISCAS, 1372–1375 (1990). https://doi.org/10.1109/ISCAS.1990.112386 13. P.A. Merolla et al., Artificial brains. A million spiking-neuron integrated circuit with a scalable communication network and interface. Science 345(6197), 668–673 (2014). https://doi.org/ 10.1126/science.1254642 14. B. Murmann, D. Bankman, E. Chai, D. Miyashita, L. Yang, Mixed-signal circuits for embedded machine-learning applications, 1341–1345 (2016). https://doi.org/10.1109/ ACSSC.2015.7421361 15. D. Bankman and B. Murmann, An 8-Bit, 16 Input, 3.2 pJ/op Switched-Capacitor Dot Product Circuit in 28-nm FDSOI CMOS, Proc. IEEE Asian Solid-State Circuits Conf., Toyama, Japan, Nov. 2016, pp. 21–24. http://dx.doi.org/10.1109/ASSCC.2016.7844125 16. A.S. Rekhi et al., Analog/mixed-signal hardware error modeling for deep learning inference, in 2019 56th ACM/IEEE design automation conference (DAC), (2019), pp. 1–6. https://doi.org/ 10.1145/3316781.3317770 17. M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, Y. Bengio, Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or −1 (2016). [Online]. https://arxiv.org/abs/1602.02830. Accessed 27 Feb 2020 18. P. Knag, A 617 TOPS/W all digital binary neural network accelerator in 10 nm FinFET CMOS, in Symp. VLSI Circuits Dig. (2020), pp. 1–2 19. B. Murmann, Mixed-signal computing for deep neural network inference. IEEE Trans. Very Large Scale Integr. Syst. 29(1), 3–13 (2021). https://doi.org/10.1109/TVLSI.2020.3020286

16

B. Murmann

20. Y. Der Chih et al., An 89TOPS/W and 16.3 TOPS/mm2 all-digital SRAM-based full-precision compute-in memory macro in 22 nm for machine-learning edge applications. Digest of Technical Papers—IEEE International Solid-State Circuits Conference 64, 252–254 (2021). https://doi.org/10.1109/ISSCC42613.2021.9365766 21. W.H. Kautz, Cellular logic-in-memory arrays, IEEE Trans. Comput. C–18(8), 719–727 (1969). https://doi.org/10.1109/T-C.1969.222754 22. N. Verma et al., In-memory computing: Advances and prospects. IEEE Solid-State Circuits Mag. 11(3), 43–55 (2019). https://doi.org/10.1109/MSSC.2019.2922889 23. H. Jia et al., A programmable neural-network inference accelerator based on scalable inmemory computing. Digest of Technical Papers—IEEE International Solid-State Circuits Conference 64, 236–238 (2021). https://doi.org/10.1109/ISSCC42613.2021.9365788 24. H. Tsai, S. Ambrogio, P. Narayanan, R.M. Shelby, G.W. Burr, Recent progress in analog memory-based accelerators for deep learning. J. Phys. D. Appl. Phys. 51(28), 283001 (2018). https://doi.org/10.1088/1361-6463/aac8a5 25. A. Shafiee et al., ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars, in 2016 ACM/IEEE 43rd annual international symposium on computer architecture (ISCA) (2016), pp. 14–26, doi: https://doi.org/10.1109/ISCA.2016.12 26. M. Dazzi, A. Sebastian, P.A. Francese, T. Parnell, L. Benini, E. Eleftheriou, 5 parallel prism: A topology for pipelined implementations of convolutional neural networks using computational memory (2019). [Online]. http://arxiv.org/abs/1906.03474. Accessed 1 Mar 2020 27. Y.-H. Lin et al., Performance impacts of analog ReRAM non-ideality on neuromorphic computing. IEEE Trans. Electron Devices 66(3), 1289–1295 (2019). https://doi.org/10.1109/ TED.2019.2894273 28. S. Yin, X. Sun, S. Yu, J. Seo, High-throughput in-memory computing for binary deep neural networks with monolithically integrated RRAM and 90 nm CMOS (2019). [Online]. http:// arxiv.org/abs/1909.07514. Accessed 1 Mar 2020 29. C.-X. Xue et al., 24.1 A 1Mb multibit ReRAM computing-in-memory macro with 14.6 ns parallel MAC computing time for CNN based AI edge processors, in 2019 IEEE international solid- state circuits conference—(ISSCC) (2019), pp. 388–390. https://doi.org/10.1109/ ISSCC.2019.8662395 30. B. Murmann, ADC performance survey 1997–2020. http://web.stanford.edu/~murmann/ adcsurvey.html 31. D. Bankman, J. Messner, A. Gural and B. Murmann, “RRAM-Based In-Memory Computing for Embedded Deep Neural Networks,” 2019 53rd Asilomar Conference on Signals, Systems, and Computers (2019), pp. 1511–1515. http://doi.org/10.1109/ IEEECONF44664.2019.9048704 32. L. Yang et al., Co-exploration of neural architectures and heterogeneous ASIC accelerator designs targeting multiple tasks (2020). [Online]. https://arxiv.org/abs/2002.04116. Accessed 11 Jul 2020

Chapter 2

Analog Computation with RRAM and Supporting Circuits Justin M. Correll, Seung Hwan Lee, Fuxi Cai, Vishishtha Bothra, Yong Lim, Zhengya Zhang, Wei D. Lu, and Michael P. Flynn

Abstract In-memory computing on RRAM crossbars enables efficient and parallel vector-matrix multiplication. The neural network weight matrix is mapped onto the crossbar, and multiplication is performed in the analog domain. This article discusses different RRAM crossbar implementations, the associated mixed-signal circuits, and their challenges. As a proof-of-concept, a fully integrated CMOSRRAM coprocessor is presented.

1 Introduction The explosion in AI and machine learning is driving the need for efficient matrix operations. In particular, convolutional neural networks (CNNs) depend on largescale vector-matrix multiplication (VMM). GPUs are an excellent choice for this task because they are much more efficient than CPUs. Nevertheless, GPU systems are still energy-intensive and, therefore, prohibitive for energy-constrained applications. As AI becomes a more prominent component of computing, it is vital to improve its energy efficiency. Energy efficiency is particularly important for J. M. Correll · Z. Zhang · W. D. Lu · M. P. Flynn () Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI, USA e-mail: [email protected] S. H. Lee Intel, Albuquerque, NM, USA F. Cai Applied Materials, Santa Clara, CA, USA V. Bothra Apple, Cupertino, CA, USA Y. Lim Samsung Electronics, Yongin, South Korea

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 P. Harpe et al. (eds.), Analog Circuits for Machine Learning, Current/Voltage/Temperature Sensors, and High-speed Communication, https://doi.org/10.1007/978-3-030-91741-8_2

17

18

J. M. Correll et al.

energy-constrained edge devices. The slowdown in scaling and the end of Moore’s Law make it urgent to find new approaches. One of the problems of conventional approaches is that they are based on the von Neumann architecture and therefore suffer from the von Neumann memory bottleneck. Even if the compute is very efficient, moving multiplication coefficients from memory to the compute units itself consumes significant energy. This coefficient movement also takes time and therefore introduces latency. Compute-in-memory systems such as those based on resistive crossbars are attractive because they avoid the von Neumann bottleneck. Machine learning can be divided into training and inference. Accurate learning is critical, and therefore training requires high numerical accuracy making it expensive in terms of hardware resources. However, in most cases, training can be performed in advance so that the requirements for inference dominate in practical applications. Performing inference at edge devices close to the sensor interface is essential to reduce communication needs and to improve security. In speech recognition and image recognition, coefficients are learned in advance and deployed to the edge devices. These edge devices are often highly energy-constrained, so it is vital to reduce the energy needed for inference. Fortuitously, we can exploit the lower accuracy needed for inference to reduce power consumption significantly. This low accuracy requirement favors analog computation and has driven a resurgence in analog computing.

2 Analog Crossbar Computation Analog crossbar arrays are biologically inspired structures that model the massive connectivity of the human brain and show promise for energy-efficient VMM. Each of the crossbar array columns mimics a single neuron with many parallel synaptic connections and forms the basis for neural-network matrix multiplication. In the conventional von Neumann architecture, matrix multiplication is a serial process and requires significant time and energy to move the matrix coefficients to and from external memory. Analog crossbar computation with RRAM and other non-volatile emerging devices promises to break the computing bottleneck in machine-learning applications. Crossbar operation exploits simple physics for very efficient computation. Further, the in-memory nature of this computation removes the memory access bottleneck. As shown in Fig. 2.1, a crossbar is an array of row and column conductors with a resistor connecting the row and column at every row-column intersection. The conductances of these resistors form the multiplicand matrix. The crossbar exploits Ohm’s Law and Kirchhoff’s Current Law for multiply-andaccumulate matrix multiplication operations. We consider the case where the input vector is a set of voltages applied to the rows while the output is current from the columns. Virtual grounds connected to each column collect the column currents.

2 Analog Computation with RRAM and Supporting Circuits Fig. 2.1 Analog crossbar matrix

19

x1 W(1,1)

W(1,2)

W(1,j)

W(2,1)

W(2,2)

W(2,j)

W(i,1)

W(i,2)

W(i,j)

x2

xi

I1 DAC

I2

Ij

ADC W(1,1)

W(1,2)

W(1,j)

VREF

DAC

W(1,1)

W(1,2)

W(1,j)

W(2,1)

W(2,2)

W(2,j)

W(i,1)

W(i,2)

W(i,j)

ADC W(2,1)

W(2,2)

W(2,j)

W(i,1)

W(i,2)

W(i,j)

VREF

DAC

(a)

DAC

ADC

DAC

VREF

ADC

VREF

ADC

VREF

VREF

DAC

ADC

(b)

Fig. 2.2 Analog vector-matrix multiplication (a), and matrix-vector multiplication (b) with mixedsignal circuits

The output current of each column is the vector-vector dot-product of the input voltages and the crossbar conductances for the column, as shown in Fig. 2.2a. If a voltage vector xi is applied to the rows, then the resulting current flowing through the conductance wi, j at intersection i, j is: Iij = xi wij

(2.1)

The total column current, Ij , represents the vector-vector dot product of the input vector with the crossbar column, Wj : Ij =



i xi Wij

= x T Wj

(2.2)

20

J. M. Correll et al.

By collecting currents from the columns in parallel, the VMM output is obtained in a single step. Additionally, a crossbar can perform the transpose operation, or backpropagation, by simply applying an input vector to the columns and measuring the current flowing from the rows to the virtual ground (Fig. 2.2b). In this case, the total row current Ii represents the dot product of voltage input vector xj with the crossbar row Wi : Ij =



j Wij xj

= Wi x T

(2.3)

3 Challenges of Crossbar Operation 3.1 Device Nonlinearity Before discussing some of the different devices that can be used for crossbar operation, we first review some of the challenges of crossbar operation. The first challenge is that practical crossbar devices, themselves, tend to be highly nonlinear. In other words, doubling the voltage across the device does not lead to doubling the current. Two-point operation is an effective way to circumvent this problem. In this approach, the analog input is encoded as a pulse-width-modulated (PWM) signal, and either a fixed read voltage or 0 V is applied across the device. The column output of the crossbar is integrated over the maximum period of the PWM signal. An advantage of this approach is that it enables excellent multiplication linearity even with nonlinear crossbar devices. However, there are significant challenges. The PWM nature means that operation takes much longer, in the simplest case 2N clock cycles, where N is the bit-width of the input vector. Second, integration of the column output current is required to determine the analog multiplicand.

3.2 Mixed-Signal Peripheral Circuitry Another challenge is that effective analog crossbar operation requires extensive analog support circuitry. As discussed, PWM operation enables linear multiplication with nonlinear memory devices. The generation of the PWM waveforms is relatively straightforward, as it can be based on synchronous digital circuitry. A drawback of the PWM operation is that the output current must be integrated, and this may require a relatively large capacitance. The requirement for a virtual ground is also a challenge since low-impedance wide-bandwidth virtual ground circuits (e.g., a Transimpedance Amplifier or TIA) can consume significant power. Digitization is also a challenge, as maximum crossbar throughput requires a dedicated ADC per column. For many applications, a relatively low-resolution ADC is sufficient (e.g.,

2 Analog Computation with RRAM and Supporting Circuits

21

5–8 bits); nevertheless, even low-resolution ADCs can be challenging to place in the narrow memory pitch. Finally, many memory devices require a high voltage for programming—this necessitates the use of high voltage FETs and level-shifters, both of which consume a large area.

4 Non-volatile Crossbar Synapses 4.1 Flash Flash memory is one of the best-established non-volatile memory technologies and can be effective for building a crossbar (Fig. 2.3). The threshold voltage of the flash device is programmed to set a crossbar weight. In one approach [1], the flash device functions as a programmable current source. A PWM signal drives the gate, and the duration of this PWM signal represents the input. The output current flows at this weight value during the on-time of the PWM signal. The integrated output current represents the product of the weight and the PWM duration. This current-output operation assumes that the MOSFET is operating in the saturation region. The output currents from the entire column can be collected by a virtual ground and then integrated and digitized. The authors in [1] avoid the complexity and energy cost of the virtual ground by integrating the current on the capacitance of the column line. This column capacitance is initially pre-charged and then discharged by the currents of the flash transistors. A low-resolution voltage-mode ADC digitizes the final column voltage. The nonlinear capacitances and the finite Fig. 2.3 Flash crossbar array

DAC

DAC

ADC

ADC

ADC

DAC

22

J. M. Correll et al.

Fig. 2.4 Comparison of key performance metrics for different resistive crossbar technologies in VMM operations

output resistance of the memory devices limit the accuracy; nevertheless, this approach is efficient for low accuracy operation.

4.2 Filamentary Resistive-RAM Although flash is a well-established memory technology, flash devices are relatively large. Furthermore, flash transistors require a large programming voltage, limiting the adoption of flash in the most advanced technologies nodes. For practical applications, the crossbar resistance should be both very compact and programmable. These size and programmability requirements have prompted research into emerging memory technologies such as Phase Change Memory (PCM), spin-transfer torque RAM (STT-RAM), and Resistive RAM (RRAM). Figure 2.4 compares different technologies. Filamentary RRAM is a metal-insulator-metal structure that has been widely studied for application in crossbar arrays. The memristive synapse is formed by precisely controlling the defects in metal-oxide materials such as HfOx , WOx , and TiOx . Such analog RRAM devices are CMOS compatible and easily fabricated in scalable, highly dense crossbar arrays that support ultra-low-power compute operations. Crossbar arrays are either passive or active, depending on the bitcell composition, and each type of bitcell presents a unique set of benefits and tradeoffs.

2 Analog Computation with RRAM and Supporting Circuits

23

Top electrode (TE)

RRAM Bottom electrode (BE) Fig. 2.5 Passive RRAM crossbar array

Passive 1R crossbar arrays are formed with an RRAM synapse at each array crosspoint (Fig. 2.5). In this approach, the physical size of the array is determined by the effective RRAM area of 4F2 , where F is the minimum feature size. The main advantage of passive arrays is their superior compute density. Furthermore, 3D integration is possible and shows promise as a path towards implementing largescale neural networks with several million parameters. Another advantage is that the passive array structure naturally supports VMM and backpropagation, as discussed in Sect. 2. A key disadvantage of passive arrays is the sneak-path currents that arise from the inherent parallel current paths present in the array. Multiple current paths cause erroneous currents that affect both weight storage and matrix multiplication accuracy during both the programming and the read operations, respectively. Both conditions ultimately lead to degradation in network performance or classification accuracy. Common approaches to mitigate the sneak-path problem include halfvoltage write-protect programming schemes [2, 3] and intentionally fabricating RRAM devices with enhanced nonlinearity [4]. A circuit-level solution to the sneak-path problem adds a selector device in series with the RRAM bitcell. This 1T1R bitcell solves the sneak-path problem by gating the current through the RRAM with an enable signal. This solution, of course, increases the complexity of the periphery circuitry shown in Fig. 2.6 and also increases the area of the bitcell, which in turn drastically reduces the compute density. Generally, the selector device is implemented with a high-voltage IO transistor to support the high voltages and currents associated with programming. For metal-oxide RRAM, the choice of 1R or 1T1R has a critical impact on device-to-device I-V variation. Filament formation in RRAM is an abrupt process and difficult to control. In passive arrays, “forming-free” RRAM devices are used, and the device variation is controlled in the fabrication process. The authors in [5] discuss the challenges and paths forward to implementing passive arrays with state-of-the-art device variation and tuning accuracy. In active 1T1R arrays, the virgin RRAM devices experience a forming step to initiate the oxide filament.

24

J. M. Correll et al.

Fig. 2.6 Active 1T1R crossbar array

DACs

Gate Decoder

Column Decoder ADCs A risk in the forming step is that the abrupt formation of the filament can lead to a high conductance state that is unresponsive to device potentiation or “stuck.” These high current cells significantly impact matrix multiplication accuracy. The selector transistor ensures current compliance, preventing excessive currents that can damage the RRAM. The I-V characteristics of the selector transistor limit the current through the RRAM device. Moreover, the transistor gate is used to precisely control the current flow during programming to overcome device variation during synaptic weight storage.

5 Digital RRAM Crossbar RRAM is an up-and-coming digital non-volatile memory technology. The very compact bitcell size makes RRAM ideal for storing the digital weights for digital VMM. An RRAM array holds the weights in the simplest case, while conventional CMOS logic performs the vector multiplication itself. Typically, the RRAM memory is configured as a crossbar structure. A single row is activated to access the memory, and then sense amplifiers connected to the bit lines determine the digital output. The small area is a significant advantage and allows the weight memory to be close to the multipliers. Digital RRAM is a relatively mature technology and is now commercially available. Table 2.1 summarizes a few representative examples from different providers [6].

2 Analog Computation with RRAM and Supporting Circuits

25

Table 2.1 Summary of commercially available digital RRAM technologies [6]

5.1 Analog Operation with Digital RRAM Cells A hybrid analog-digital crossbar structure applies single-bit RRAM elements in an analog fashion. In this mode, the RRAM is programmed to be either high resistance or low resistance. The two-state RRAM operation is relatively easy to implement and considered to be more reliable [14]. In the simplest case, there are single-bit weights and a vector of single-bit voltage inputs. Simple digital circuits can generate the single-bit row voltages. A low-impedance virtual ground terminates the bit lines. An ADC digitizes the current flowing into the virtual ground to provide the digital output of the vector multiplication. With the single bit weights and the single-bit inputs, the ADC can be low resolution. The use of multiple digital RRAM cells can facilitate multiple-valued weights. One approach uses a separate bit-line for each bit of the weight resolution. For example, in Fig. 2.7, two parallel bit lines are combined for two-bit synapse weights. The current outputs of these different bit lines are weighted and summed. This weighting and summing can be in the digital domain [14] or in the analog domain [15]. Higher input resolutions are enabled by processing the digital input words, serially, one bit at a time and weighting the bit line outputs. Because of its complexity, this approach is limited to small weight and input resolutions, typically two bits.

26 Fig. 2.7 Digital RRAM pseudo two-bit synapse

J. M. Correll et al.

LSB Weight

LSB+1 Weight

1b

1b

1b

1b

VREF

VREF

Low-Resolution ADC

ADC

Sum

6 Analog RRAM Crossbar An analog crossbar structure utilizes multi-bit RRAM at every crosspoint. The RRAM is programmed to multi-level conductance states. Each RRAM bitcell is analogous to one weight value in a weight matrix. This allows for a more direct mapping of the VMM or MVM weight matrix onto the analog crossbar (Fig. 2.2). The compute occurs in place and in parallel, thus improving the compute density over the digital RRAM approach. Full analog operation requires specialized mixed-signal hardware for programming and matrix multiplication. The degree of parallelization is dependent upon the area of mixed-signal DACs, ADCs, and column multiplexors. Furthermore, many challenges exist in the commercial fabrication and integration of reliable multi-bit RRAM.

6.1 Analog Operation with Analog RRAM Cells Analog RRAM is programmed to multi-level conductance states with dedicated high-voltage DACs. Fixed-amplitude positive and negative pulse trains applied to the RRAM bitcell increase and decrease the conductance, as shown in Fig. 2.8.

2 Analog Computation with RRAM and Supporting Circuits

27

Fig. 2.8 Long-term potentiation and long-term depression of analog RRAM

After each programming step, the conductance value is determined by reading the selected bitcell. To perform multiplication, DACs apply voltage inputs to the analog crossbar rows. The resulting column currents are collected at low-impedance virtual grounds and digitized with moderate-to-high resolution ADCs. Multiplication and accumulation are performed along each column of the weight matrix in parallel, without the need for extra crossbar columns or processing steps.

6.2 Fully Integrated CMOS-RRAM Analog Crossbar The work in [16, 17] presents a fully integrated analog RRAM crossbar coprocessor that addresses many of the challenges associated with crossbar operation described in Sect. 2. A passive 54 × 108 WOx-based RRAM crossbar performs matrix multiplication in the analog domain. The system includes the passive crossbar, 486 DACs, 162 ADCs, a mixed-signal timing interface, and a RISC processor on a single die. Each row and column of the crossbar has dedicated hardware for programming and matrix multiplication for maximum throughput. The fully reprogrammable system supports forward and backward propagation. The on-chip RISC processor controls write and read operations making the system standalone.

6.2.1

RRAM Programming

Each row and column features high-voltage and low-voltage write hardware to program the RRAM bitcells. This hardware enables bi-directional programming (SET and RESET) without the need for a negative voltage supply. A high-voltage write DAC is connected to the selected crossbar row to increase bitcell conductance, while a low-voltage write DAC connects to the selected crossbar column. All unselected rows and columns are clamped to a protective half-voltage. Both DACs are pulsed to potentiate the RRAM bitcell with high-voltage “positive” pulses. The

28

J. M. Correll et al.

process is reversed to decrease the bitcell conductance. After each programming step, the bitcell conductance is read using a read-verify step. The system flexibly combines RRAM programming, vector-matrix products (forward propagation), and matrix-vector products (backward propagation). The coprocessor supports online learning rules such as gradient-descent [18, 19] and winner-take-all [19].

6.2.2

RRAM Nonlinearity

The crossbar operates in the charge domain instead of the voltage and current domains to address RRAM nonlinearity. The input vector is a pulse-modulated signal, and the column currents are integrated at high-quality virtual grounds terminating each column. Pulse inputs are more accurate than more straightforward PWM inputs for charge-domain operation because the errors due to finite rise and fall times are linear with pulse operation. Two-level linear time-domain RTZ DACs apply pulses to the rows (or columns). A two-stage regulator terminates the columns (or rows) with a robust virtual ground. High-resolution charge-domain ADCs integrate and digitize the resulting currents. The active virtual ground circuit improves robustness by providing programmable current division to accommodate the resistance variation of different RRAM recipes. The 13-bit hybrid ADC consists of a 5-bit first-order incremental for coarse conversion followed by a 9-bit SAR ADC for fine digitization (Fig. 2.9). Although high-resolution is unnecessary for matrix multiplication, high-precision is desirable for weight storage or online learning.

VDAC

iin VIN VREG

VCM

+ Current Divider (1/64 ~ 8/64)

O1

VCLK1 Ch. D Ch. 2 Ch. 1

S1 S1 S1

O2

VCM

VCM

RST

O1 O2

- + Ring Am p.

S1' S1' S1'

I1

9b SAR 9b ADC

w/ Auto-Zero

CI1

RST

INT

RST

I1

+O2

INT

RST

VCM

RST

VCM O2

VCM

Fig. 2.9 Schematic of 13-bit charge domain ADC [16, 17]

13b

5b

O1

O1

Digital Correction

VTH

+ -

Decimation Filter

VCLK1

VADC_EN VDAC_START

Timing Generator

S1,2,D I1,2,D O1,2 RST INT

DOUT

2 Analog Computation with RRAM and Supporting Circuits

29

Fig. 2.10 Die photograph of the 180 nm prototype [17]

6.2.3

CMOS Prototype

The prototype is fabricated in 180 nm CMOS and occupies 62 mm2 (Fig. 2.10). This area includes 486 DACs, 162 ADCs, the RISC processor, 64 KB of SRAM, and the RRAM crossbar fabricated on the surface of the die. A single channel occupies 0.13 mm2 and includes three DACs and one ADC. The 54 × 108 RRAM crossbar is fabricated in a post-processing step.

6.2.4

Measurement Setup

The prototype is wire-bonded to a 391-pin package for testing and measurement. A custom printed circuit board provides analog and digital supplies and off-chip biasing for the mixed-signal hardware. An Opal Kelly XEM7001 FPGA board loads compiled C code into the on-chip dual-port SRAM. The developed C code consists of high-level algorithmic instructions as well as the supporting low-level array configuration instructions.

6.2.5

Single-Layer Perceptron Example

A single-layer perceptron (SLP) network is implemented to verify the operation of the integrated chip. Network training and classification are performed using a subset of five Greek letter patterns represented as 5 × 5 binary images. The SLP has 26 inputs which correspond to the 25 pixels in the image plus a bias term. There are five output classes. The input and output neurons are fully connected with 130 synaptic

30

(a)

J. M. Correll et al. (b)

V1 V2 V3

Q1

Softmax

a1

V4

Q2

a2

Q3

a3

Q4

a4

Q5

a5



V5

V21 V22 V23 V24 V25

Vb

(c)

Fig. 2.11 Single layer perceptron example

weights (Fig. 2.11a). The neuron with the highest output is identified by the softmax activation function and is used to classify the corresponding class. Figure 2.11b shows the SLP implementation. The original binary input patterns are converted to voltage pulses, which are applied to the RRAM sub-array with the 600 mV read DACs. Where a white pixel is present, a voltage pulse is applied to the corresponding row. Black pixels correspond to the absence of an input pulse or the common-mode voltage. The bias term is treated as a white pixel that is applied as an extra input. All of the input pulses are of the same duration and amplitude. Each synaptic weight wij is implemented with two RRAM devices representing a positive − and negative weight, G+ ij and Gij , using positive RRAM conductance values. Online learning is performed and the synaptic weights are updated during training using the batch gradient descent rule: Δwij = η

N n=1



(n)

tj

(n)

− yj



x (n)

(2.4)

where x(n) is the nth training sample of the input dataset, y(n) is the network output, t(n) is the corresponding label, and η is the learning rate. The update value wij for the ith element in the jth class is then implemented in the RRAMs by applying programming pulses with the write DACs and a pulse width proportional to the desired weight change with 6-bit precision.

2 Analog Computation with RRAM and Supporting Circuits

31

The SLP is mapped onto a 26 × 10 sub-array of the crossbar. The SLP is trained and tested with noisy 5 × 5 Greek letter patterns for the following five distinct classes—“Ω,” “M,” “,” “,” “,” To create the data set, one of the 25 pixels of the original images is flipped to generate 25 noisy images for each Greek letter. Together with the original image, a set of 26 images for each Greek letter is formed. The training set consists of 16 randomly selected images from each class. The remaining 10 images from each class form the test set. The SLP achieves 100% classification accuracy for both the training and testing sets. Figure 2.11c shows the evolution of the output neuron signals during training, averaged over all training patterns for a specific class. The winning neuron is clearly separated from the other neurons and improves or is strengthened, during training, verifying the online learning rule.

6.2.6

System Performance

At the 148 MHz maximum operating frequency, the mixed-signal core, including read DACs and ADCs, consumes 64.4 mW while performing matrix multiplications. The maximum theoretical throughput is 9.87 M vector-matrix multiplications per second. The mixed-signal energy efficiency is 144 nJ per matrix multiplication or 25 pJ per operation. The measured ADC shorted-input noise is 0.23 codes. The RISC processor and 54 × 108 passive crossbar consume 235 mW and 7 mW, respectively.

7 Conclusions Matrix multiplication with analog RRAM crossbars has the potential to overcome the von Neumann computation bottleneck. Passive 1R arrays are amenable to 3D integration and are compact enough to facilitate networks with millions of synapses. A drawback of passive 1R arrays is that they suffer from device I-V variation and sneak-path currents. Active 1T1R arrays solve these issues but are larger due to the addition of selector transistors. Multi-level conductance operation of RRAM devices increases the compute density when compared to binary RRAM devices. In all of these cases, mixed-signal circuits are vital to enable pseudo-VMM or full analog VMM. The array structure and choice of RRAM device determine the system architecture.

References 1. S. Cosemans et al, Towards 10000TOPS/W DNN inference with analog in-memory computing – a circuit blueprint, device options and requirements, in IEEE Int. Electron Devices Meeting (IEDM) (San Francisco, CA, USA, 2019), pp. 22.2.1–22.2.4 2. S. Kim, J. Zhou, W.D. Lu, Crossbar RRAM arrays: Selector device requirements during write operation. IEEE Trans. Electron Devices 61(8), 2820–2826 (2019)

32

J. M. Correll et al.

3. J. Zhou, K. Kim, W.D. Lu, Crossbar RRAM arrays: Selector device requirements during read operation. IEEE Trans. Electron Devices 61(5), 1369–1376 (2014) 4. M.A. Zidan et al., Memristor-based memory: The sneak paths problem and solutions. Microelectron. J. 44(2), 176–183 (2013) 5. H. Kim, H. Nili, M.R. Mahmoodi, D.B. Strukov, 4K-memristor analog-grade passive crossbar circuit, arXiv:1906.12045 (2019) 6. S.H. Lee, X. Zhu, W.D. Lu, Nanoscale resistive switching devices for memory and computing applications, in Nano Res. (2020) 7. H.D. Lee et al, Integration of 4F2 selector-less crossbar array 2Mb ReRAM based on transition metal oxides for high density memory applications, in Symposium on VLSI Technology (VLSIT) (Honolulu, HI, USA, 2012), pp. 151–152 8. M. Hsieh et al, Ultra high density 3D via RRAM in pure 28nm CMOS process, in IEEE Int. Electron Devices Meeting (IEDM) (Washington, DC, USA, 2013), pp. 10.3.1–10.3.4 9. T. Liu et al, A 130.7 mm2 2-layer 32 Gb ReRAM memory device in 24 nm technology, in IEEE Int. Solid-State Circuits Conf. (ISSCC) (San Francisco, CA, USA, 2013), pp. 210–211 10. J. Zahurak et al, Process integration of a 27 nm, 16 Gb Cu ReRAM, in IEEE Int. Electron Devices Meeting (IEDM) (San Francisco, CA, USA, 2014), pp. 6.2.1–6.2.4 11. R. Fackenthal et al, A 16 Gb ReRAM with 200 MB/s write and 1 GB/s read in 27 nm technology, in IEEE Int. Solid-State Circuits Conf. (ISSCC), (San Francisco, CA, USA, 2014), pp. 338–339 12. S.H. Jo, T. Kumar, S. Narayanan, W.D. Lu, H. Nazarian, 3D-stackable crossbar resistive memory based on field assisted superlinear threshold (FAST) selector, in IEEE Int. Electron Devices Meeting (San Francisco, CA, USA, 2014), pp. 6.7.1–6.7.4 13. S.H. Jo, T. Kumar, S. Narayanan, H. Nazarian, Cross-point resistive RAM based on fieldassisted superlinear threshold selector. IEEE Trans. Electron Devices 62(11), 3477–3481 (2015) 14. L. Ni et al., Distributed in-memory computing on binary RRAM crossbar. ACM J. Emerg. Technol. Comput. Syst. 13(3), 1–18 (2017) 15. C. Xue et al, A 1 Mb multibit ReRAM computing-in-memory macro with 14.6 ns parallel MAC computing time for CNN based AI edge processors, in IEEE Int. Solid-State Circuits Conf. (ISSCC) (San Francisco, CA, USA, 2019), pp. 388–390 16. F. Cai et al., A fully integrated reprogrammable memristor-CMOS system for efficient multiply-accumulate operations. Nat. Electron. 2(7), 290–299 (2019) 17. J.M. Correll et al., A fully integrated reprogrammable CMOS-RRAM compute-in-memory coprocessor for neuromorphic applications. IEEE J. Exploratory Solid-State Comput. Dev. Circuits 6(1), 36–44 (2020) 18. M.V. Nair, P. Dudek, Gradient-descent-based learning in memristive crossbar arrays, in Int. Joint Conference on Neural Networks (Killarney, Ireland, 2015), p. 1 19. P.M. Sheridan, C. Du, W.D. Lu, Feature extraction using memristor networks. IEEE Trans. Neural Networks and Learning Systems 27(11), 2327–2336 (2015)

Chapter 3

Analog In-Memory Computing with SOT-MRAM: Architecture and Circuit Challenges M. Caselli, J. Doevenspeck, S. Cosemans, I. A. Papistas, A. Mallik, P. Debacker, and D. Verkest

Abstract Analog In-Memory Computing (AiMC) has recently emerged as a promising approach to enable the implementation of highly computation-intensive Deep Neural Networks (DDNs) for AI/ML applications in edge devices. The use of new memory technologies will enable drastic improvements in energy efficiency and compute density, but different memory technologies come with specific circuit challenges. In this work, we discuss the case of Spin-Orbit Torque (SOT) MRAM, which has many attractive properties: high endurance, fast write operations, and low-energy at logic-compatible supply voltages. Crucially, it is possible to realize SOT-MRAM devices with high resistances (>1 M ) and low resistance variation, which enables the realization of large AiMC arrays. The key circuit challenge for SOT-MRAM is its low on/off ratio. To address this challenge, we present a compute array architecture based on pulse-width encoded inputs and a complementary precharge-discharge scheme for the summation lines, and we discuss the effect of the low on/off ratio on the topology and design of the summation line ADCs.

1 Introduction Deep neural networks (DNNs) provide state-of-the-art performance in a wide variety of AI/ML applications, from image classification to speech recognition [1]. A convolutional neural network (CNN) consists of several convolutional layers (CONV) used for feature extraction, followed by fully connected layers (FC) for M. Caselli () KU Leuven, Leuven, Belgium Imec, Leuven, Belgium J. Doevenspeck · S. Cosemans · I. A. Papistas · A. Mallik · P. Debacker · D. Verkest Imec, Leuven, Belgium e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 P. Harpe et al. (eds.), Analog Circuits for Machine Learning, Current/Voltage/Temperature Sensors, and High-speed Communication, https://doi.org/10.1007/978-3-030-91741-8_3

33

34

M. Caselli et al.

Fig. 3.1 Convolutional neural network for image classification

classification, as shown in Fig. 3.1. The fundamental mathematical operation of these two layers is the matrix-vector multiplication (MVM): y=

N i=0

act i · wi ,

(3.1)

where w are the weights obtained from training and act are the input features (or activations) computed by a previous layer. In the inference phase, when the network is used to perform its task, a massive amount of matrix-vector multiplications must be computed. Due to its energy efficient processing, Analog In-Memory Computing (AiMC) has recently emerged as a valuable option for these tasks. With this approach, the MVM is computed in analog fashion directly inside the memory array, by means of computing memory cells, used to store the weights. Each layer of the network is mapped in hardware onto a crossbar array, capable of storing the weights obtained from training in memory cells. The activations are propagated on rows called activation lines (Fig. 3.2). Therefore, each array column, called a summation line, computes a vector dot-product y. One of the possible implementations of a crossbar array is the resistive element array shown in Fig. 3.3, in which each weight is stored as a programmable conductance G. The following sections discuss the use of spin-orbit torque (SOT) magnetoresistive RAM (MRAM) as a resistive memory element for AiMC. In the proposed system, digital activations are converted by a digital-to-analog converter (DAC) into pulse-width modulated voltage pulses [2]. For the analog computation, each column, comprising two complementary summation lines SP -SN , is precharged to the voltage supply, and then discharged by the MVM operation. The analog voltage difference between the summation lines YA represents the result of the operation. The digital output YD is obtained from YA by an analog-to-digital converter (ADC). The flow of computation is described by the conceptual diagram in Fig. 3.4.

3 Analog In-Memory Computing with SOT-MRAM: Architecture and Circuit. . .

35

Fig. 3.2 Crossbar array mapping

Fig. 3.3 Array column with complementary summation lines in precharge/discharge scheme

Fig. 3.4 Conceptual diagram of the resistive array operation. From CNN MVM operation to digital ADC output

2 Resistive Element Array A schematic view of a single resistive memory element is shown in Fig. 3.5. In this conceptual scheme, the weight is stored as conductance of a programmable resistor, and the input activation is the voltage Vact, i applied to the input activation line. Being V0 the applied voltage to the summation line yi , the voltage over the resistor Vact, i − V0 , is proportional to the activation, and the current cell contribution Ii, j is related to the dot-product acti · wi, j . The exact MVM result would require a constant V0 along the summation line. Nonetheless, a partial deviation from the ideal AiMC result (Fig. 3.5) can be tolerated by the network, if the nonlinearity of the array is included in the computational model used for the network training [3].

36

M. Caselli et al.

Fig. 3.5 Resistive element computation and nonlinear transfer function

Fig. 3.6 Resistive array with non-idealities

In the array structure of Fig. 3.6, non-idealities impact the computation accuracy and the periphery circuits design. Indeed, in a full crossbar structure, the MVM result is obtained by summing with Kirchoff’s law the currents on every memristor due to the analog operation. For a good linearity of the transfer function, the voltage drop (IR-drop) on the summation lines, generated by the currents running on the wire resistance RWIRE , must be small. During the computing phase, a DC current runs also through each activation line. Therefore, the IR-drop must be small also in this section to avoid hard constraints on the DAC implementation [4]. A spice simulation of a resistive array column, with all memory cells on-state, and all cell weights in low resistance state RLRS , is shown in Fig. 3.7 for different value of RLRS . From the obtained results, to limit the impact of the IR-drop, RLRS should be much larger than RWIRE . The simulation data pattern is a worst case,

3 Analog In-Memory Computing with SOT-MRAM: Architecture and Circuit. . .

37

Fig. 3.7 IR-drop effect on 1024 cells summation line. RWIRE = 1 /cell, all weights = LRS, all cells active

nonetheless the array shows a significant voltage drop along the summation line even with RLRS = 1 M = 106 × RWIRE per cell. The IR-drop goes with the square of the number of rows N, and it is weights pattern and activation pattern dependent. Therefore, if included in the array model, this non-ideality can be compensated during training to some extent. Since the analog computation physically requires the discharging of a capacitor, the value of RLRS is relevant also during the transient phase. With pulse-width activations, several levels must be encoded in the time scale set by time constant τ(RLRS CP ), being CP the parasitic capacitance per memory cell. Therefore, the AiMC operation cannot be too fast, if multibit activations are used. An example is shown in the graph in Fig. 3.8, where the time required to discharge a summation line at the 50% of the precharged value has been computed for different values of RLRS . With RLRS = 100 k , CP = 0.1fF/cell, and a realistic activation pattern, only 30 ps are required for the summation line discharge. From the graph, RLRS values in [1–10] M range seem suitable for neural network applications. Indeed, such RLRS values allow multibit pulse-width encoded activations and a reasonable computation time. Summarizing, the deviation from the ideal value of the current sunk by the resistive memory cell, due to IR-drops, affects the classification accuracy and it must be limited. This requires, in hardware, to increase the ratio RLRS /RWIRE , allowing in software an effective compensation of the remaining non-ideality. Moreover, the

38

M. Caselli et al.

Fig. 3.8 Time to discharge summation line vs LRS value of the computing cell

computation with large RLRS memory elements is beneficial also for the dynamic operations with pulse-width encoded activations.

3 SOT MRAM Memory Element In the previous section it has been explained why large resistance values are beneficial in resistive memory array for analog computation. However, several other parameters must be taken into account when evaluating an AiMC memory element. Low variation (σ/μ) of the resistance values in the different states is required to achieve good classification accuracy. Good endurance and data retention are expected from a memory cell for neural network applications. The small computing cell area is necessary to enable deep learning on edge devices, and the capability to compute at core logic-compatible voltages drastically simplifies the accelerator architecture [6]. Table 3.1 reports the main performance of several AiMC memory elements. Among them, the Spin-Orbit Torque (SOT) MRAM emerges as a promising alternative for analog computing due to a good overall performance on the main metrics, and some additional features, like the low voltage read and write operations. The physical structure of the Magneto-resistive RAM is based on a Magnetic Tunnel Junction (MTJ). Here, two magnetic layers, the reference layer (RL) with

3 Analog In-Memory Computing with SOT-MRAM: Architecture and Circuit. . .

39

Table 3.1 Performance comparison of memory cells for AiMC Performance Resistance levels Variation Endurance Retention Non-volatile Area a b

Target specs High Low High High Yes Small

ReRAM [5] Low Very highb Medium Low Yes Small

SRAMa High Low High – No Very large

STT MRAM Low Medium High High Yes Small

SOT MRAM High Medium High High Yes Large

With long L readout device When operating at sufficiently low current level

Fig. 3.9 Binary magneto-resistive RAM

static magnetic orientation, and the free layer (FL), where the orientation can be switched, are separated by a dielectric layer. This intermediate layer, made with Magnesium Oxide (MgO), acts as tunneling barrier, and the value of the resistance can be tuned by modifying its thickness. In MRAM memory, the information is stored in the relative orientation of the two magnetic layers. When they have parallel orientation, there is a higher tunneling probability. This corresponds to the low resistance state value RP ≡ RLRS . When they have anti-parallel orientation there is low tunneling probability, and the memory is in high resistance state RAP ≡ RHRS . With a single pillar structure, the MRAM is a binary memory (Fig. 3.9). The physical structure of the SOT MRAM derives from the spin-transfer torque (STT) MRAM. The STT MRAM is a two-terminal device (top electrode TE and bottom electrode BE), with a common read and write path. For memory writing, a

40

M. Caselli et al.

Fig. 3.10 SOT device with write and read paths separated. On the right: SOT TEM cross-section

relatively large current is required through the MTJ, therefore high RLRS requires very high voltages across the MTJ. To avoid damages to the junction and ensure fast read-out for cache memory applications, in STT memories the RLRS is usually limited to tens of k s, making this memory element not suitable for AiMC [7]. The SOT MRAM is instead a three-terminal device, with a top electrode TE and two bottom electrodes BE1 -BE2 , connected by a heavy-metal layer, the SOT track (Fig. 3.10). Two bottom electrodes allow the separation of the write path from the read path, which corresponds to the MVM operation in AiMC. In writing, the current flows in SOT track, from BE1 to BE2 , and it is converted in spin current running in the MTJ thanks to the SOT interaction. The structure allows the sizing of large RLRS values, suitable for AiMC, since a low voltage can be applied in writing. During the read/MVM phase, the current flows in the MTJ from BE1 to TE, due to the applied voltage. SOT MRAM devices in literature achieve RLRS in the order of M s [8]. SOT devices with RLRS in [1–60] M range can be realized, operating on the critical dimension (CD), as shown in Fig. 3.11. The resistive memory must also fulfill the requirement of low resistance variation, and an acceptable value of σP /μP < 10% can HRS be obtained for CD = 60 nm. A large tunnel-magneto resistance TMR = RLRSR−R , LRS expressing the read window between the two states, is fundamental for AiMC applications. The baseline is TMR = 150%, corresponding to an on/off ratio of 2.5.

4 SOT MRAM-Based Cell for AiMC The binary SOT memories are used in the MVM array to store the weights derived from network training. In terms of bits of resolution, neural networks with low precision activations and weights can provide good accuracy, close to the floating

3 Analog In-Memory Computing with SOT-MRAM: Architecture and Circuit. . .

41

Fig. 3.11 Resistance value and normalized variation in SOT device for different CD

Fig. 3.12 Ternary weight with complementary scheme, and circuit implementation with SOT devices

point. The reduction in the number of MAC operations increases the processing efficiency. Ternary weights can be exploited for many DNN applications, as demonstrated in [9] for various networks and datasets, and they can be used in an array structure with SOT devices. Indeed, a ternary weight, with weight values −1, 0, +1, can be constructed in analog with two binary memory cells, in a differential structure, as shown in Fig. 3.12. The SOT cell for analog computing requires two selectors per SOT device: one for the writing and one for the analog computation. During the writing phase, the summation lines are driven high, the writing switches of the row are enabled, and wbl are driven at the voltage required for the writing. In the MVM phase, the MOS

42

M. Caselli et al.

Fig. 3.13 MVM operation with precharge-discharge scheme and complementary summation lines

switches SWP -SWN are driven with the same pulse-width encoded activation and the summation lines discharge with the stored weight. A conceptual transient graph of an MVM operation on a single column, with precharge-discharge scheme, and complementary summation lines is shown in Fig. 3.13. Here, the summation lines are precharged at VDD, then they are left floating, and finally the SOT compute cells discharge lines, due to the pulse-width tP applied at the bottom switches. Each cell contributes with amounts of charge Q related to the activation pulse-width times the stored weight. The MVM routine is concluded when the analog differential voltage YA = VSP − VSN is converted to digital by the ADC.

5 MVM Result in SOT Array The analog output voltage YA generated by each column in the MVM array is converted into digital with an A/D converter. Since the ADC operative range is tailored on the signal range, a large VRMAX allows relaxed specifications for the ADC. The maximum output effective voltage range VRMAX can be related to the array parameters: ⎛ VRMAX = 2· Vpch · ⎝e



−tP MAX RHRS CP





−e

−tP MAX RLRS CP



⎠,

(3.2)

where Vpch is the summation lines precharge voltage and CP is the computing cell parasitic capacitance. In Eq. 3.2, all the cells on the negative summation line SN are assumed in LRS, and all the cells on SP in HRS. Moreover, the analysis considers RLRS  RWIRE , making negligible the IR-drop and hence assuming all cells contributing equally along the summation lines. All the activations are at the maximum value, which corresponds to the DAC largest pulse-width tP_MAX .

3 Analog In-Memory Computing with SOT-MRAM: Architecture and Circuit. . .

43

Fig. 3.14 Array output voltage range vs RLRS and α, normalized at Vpch . Maximum theoretical VRMAX _N = 2, with α infinite

A key parameter for the evaluation of the SOT cell performance is introduced: α=

RHRS = TMR + 1. RLRS

(3.3)

Factor α as TMR should be maximized to enlarge the read window between the two resistance states. MATLAB simulation results of VRMAX with respect to RLRS and α, and normalized to Vpch, are shown in the graph of Fig. 3.14. Here, different α values correspond to different optimum values of RLRS to maximize VRMAX . Moreover, large α = RRHRS and TMR are beneficial for a large VRMAX . Currently, the maximum LRS value of α in literature is below 3 [10]. Therefore, an MVM array with SOT computing cells is capable to generate an output signal range a bit larger than 30% of that provided by an ideal device with infinite α factor. Figure 3.15 reports MATLAB simulation results of VRMAX with respect to the DAC largest pulse-width tP_MAX . Even if a maximum of VRMAX can be identified, the sizing of the best tP_MAX is directly related to the time constant τ(RLRS CP ), and it has to take into account also the effects on other sections of the accelerator. Indeed, a trade-off for tP_MAX stands: too large values make the computation speed decreasing, and too small values make hard the hardware implementation of multilevel input activations. In SOT MVM arrays VRMAX is limited to hundreds of mV, and it corresponds to the full-scale voltage VFS of the ADC, if no amplification is provided.

44

M. Caselli et al.

Fig. 3.15 Array output voltage range vs activation pulse-width tP , normalized at Vpch . α = 2.5 and RLRS = 2.5 MOhm are here considered

In low quantized DNN layers the output data resolution N is usually in [4–8] bits range. Therefore, despite the optimum sizing of VRMAX a small ADC minimum voltage step can be expected:

VLSB =

VFS (α, RLRS , tP , CP ) . 2B

(3.4)

The shrinking of VFS and VLSB , compared with an MVM array with ideal computing devices, has a remarkable impact on the ADC specifications, and it is analyzed in the next section.

6 Impact of LSB Size on ADC Design For AiMC applications SAR and Flash ADC architectures are the most commonly proposed solution in literature. The former is energy efficient with a small area, and a conversion speed suitable for the application. Flash ADCs get competitive with very small resolutions, where the parallel architecture can be maximum exploited, keeping the area small, with limited energy consumption [11]. The SAR converter is composed by three main sub-blocks: the comparator, which operates as decision making circuit, the digital-to-analog converter (DAC), used in feedback to modify the input voltage at every conversion step, and a digital section for the SAR algorithm control and the output code storage.

3 Analog In-Memory Computing with SOT-MRAM: Architecture and Circuit. . .

45

Fig. 3.16 Schematic view of CS SAR ADC

In the following, the impact of shrinking VFS and VLSB , with N constant, on a charge sharing (CS) SAR ADC topology, with capacitive DAC (Fig. 3.16) is evaluated.

6.1 LSB Shrinking on CS SAR DAC In CS SAR topology the capacitive DAC is made with a binary scaled capacitor bank, of 2N unit capacitors CU , connected to the input terminals of the ADC. The bottom plate of each capacitor of the bank is driven by the control logic, based on the last comparator decision, when the corresponding bit must be evaluated. Since the area and the energy consumption of the DAC grow linearly with the capacitance, CU must be sized as small as possible. The minimum value of the unit capacitance is obtained from the DAC specifications of thermal noise or mismatch standard deviation σDAC . For good SNDR and ENOB, the thermal noise must be much smaller than the 2 quantization noise vth−DAC  vq2 , and the sizing equation is:

Cu
> LRS resistance can be assumed, this arrangement could permit to perform a signed analog sum of individual reading currents to obtain a synaptic current. Unfortunately, we had no guarantee about LRS value distribution and we could not assume that two devices in LRS connected in parallel would form a smaller resistor than a unique device in LRS. Taking into account the synaptic weight range we needed, we could not take such a risk and decided to digitalize the state of each device to obtain a 8-bit word representing the synaptic weight. Therefore, all eight OxRAMs composing a synapse must be read at the same time, so they are placed on the same line of the matrix and each column is equipped with its own reading circuit as shown in Fig. 5.13. The readout circuit working principle is based on current comparison. Topology shown in Fig. 5.14 is motivated by a small area requirement and gives the opportunity to tune the reference current distinguishing between the two resistive states. The N1 transistor is added to preload the BL node before activating the

72

A. Valentian et al.

VSL READ

Read_Mode

Preload

N1

N0

SL BL

1T1R

Read weight bit WL VDD1 Read

p0

N2

n0

n1

p1

IREF

gnd Fig. 5.14 Reading circuit/1b ADC

comparison via N0. Preload and Read commands are shared between all reading circuits.

4.2 Neuron Design The IF analog neurons were designed to meet the requirements for mathematical equivalence with the tanh activation function and to be adapted to the synaptic weight digitalization presented above. That is, 8b weights coding for −4 to +4 are integrated onto their membrane; they generate positive and negative spikes as output; they have a refractory period mechanism (note that they keep on integrating stimulations during the refractory period); and they implement a relative reset after emitting a spike. The proposed neuron core (Fig. 5.15) is composed of a MOM 200 fF capacitor loaded and unloaded by a biphasic current injector (switched current mirrors) and two voltage comparators for spike generation. Each neuron is managed by a dedicated finite-state machine (FSM), which also manages the refractory period and the relative reset (Fig. 5.16). The reset method which links the behaviors of ANN and SNN consists in subtracting the value of the crossed threshold from the membrane potential (VMEM ).

5 SPIRIT: A First Mixed-Signal SNN Using Co-integrated CMOS Neurons. . .

73

VDD1 _

Threshold+

4b Excitatory bits

Spike+ + VMEM 4b

_

CMEM

Inhibitory bits

Spike_ Threshold_

+

Gnd Fig. 5.15 Neuron core Fig. 5.16 One branch of inhibitory current injector

VDD1 ILOAD VBIAS

VMEM

Inhibitory bits gnd

As a consequence, the value of VMEM after the emission of a spike depends on its value right before the last stimulation. This reset method heavily impacts the VMEM range management: to perform an effective relative reset, VMEM range must spread from −7 to +7. The realization of this relative reset is greatly facilitated by the digitalization of the weights as the FSM simply activates the same current injectors to perform the reset, stimulating +4 after a negative spike and −4 for a positive one (Fig. 5.17). This FSM contains a counter which retains output spikes during the refractory period. It allows the neuron to integrate stimulations that occur during the refractory period without spreading the produced spikes further down the network. This spike counter also reduces the total amount of output spikes as it increments on positive spikes and decrements on negative spikes. Once the refractory period is over, the counter contains either positive or negative spike(s) and the FSM resumes emitting them from the counter.

74

A. Valentian et al.

VDD1 = 1.2 V

V Vsat

+7 +6 +5 Vth+

Relative reset VMEM

Vstep

+4 +3

Synaptic stimulations

+4

+2

-4

+1

-1

V0 = 0.6 V -1

-4

-2 Vth-

+3

+1

-3

-3

+2

+3 -3

-4 -5 -6 -7

GND = 0 V

Vsat t

Fig. 5.17 VMEM range arrangement and worst case of positive spiking

4.3 Top Architecture The RRAM matrix (Figs. 5.18 and 5.19) design strategy is strongly impacted by the fully connected neural network topology (Fig. 5.7). As our network sees as many input channels as there are pixels, the matrix needs 144 lines or Word Lines. And, since each pixel is connected through a dedicated synapse to every neuron, we need 10 columns of synapses. Thus, with eight OxRAMs per synapses, it amounts for 80 columns of devices or Bit Lines (Fig. 5.19). This arrangement permits to read in parallel the 80 bits encoding the 10 synaptic weights, optimizing the reading power consumption and latency. Those 80 bits are then sent to the 10 neuron circuits which can process simultaneously the 10 synaptic stimulations triggered by each input spikes. This RRAM matrix and the 10 analog neurons are driven by a top FSM which is also connected to two FIFOs. The first one is filled with inputs spikes representing the image to process, each spike being composed of the emitting pixel address. The second one receives output spikes generated by the neurons, storing their sign and the address of the emitting neuron. Therefore, this top FSM performing the inference consists in: unstacking the first FIFO; reading the WL corresponding to the emitting pixel of the spike; getting the 10 weights; sending these weights to the 10 neurons; letting the neural FSMs integrate their corresponding synaptic weight and checking if they have to generate a spike; collecting the output spikes; and finally stacking them into the second FIFO. This top FSM also includes a SPI slave module for external communication.

5 SPIRIT: A First Mixed-Signal SNN Using Co-integrated CMOS Neurons. . .

WL0

Dr. WL

WL1

Dr. WL

WL2

Dr. WL

WL143

SL

Dr. SL driver SL

BL0

BL1

BL80

Dr. BL

Dr. BL

OxRAMs matrix 144 × 80

Dr. BL

144 drivers WL receiving spikes from 12 × 12 pixels

Dr. WL

75

80 drivers BL providing 10 weights of 8 bits Fig. 5.18 RRAM matrix sizes

Fig. 5.19 8-bits synapses weights as encoded in the RRAM matrix (green: LRS state; red: HRS state)

During inferences (Fig. 5.20), only spikes flow through the SPI link: input spikes from the master to SPIRIT and output ones from the chip to the master. So the master is able to count the signed spikes emitted by each neuron. This count forms a score for each digit class. Then the master can compare the two highest scores, and when the gap between those top scores is higher than a predetermined termination parameter (noted ΔS), the master assumes that the classification is over and stops sending input spikes.

76

A. Valentian et al.

SpikesIN generation from picture

Master Scores + ΔS comparison

SpikesIN

FIFOIN

SPI

SPIRIT

SpikesOUT

FIFOOUT

@pixel

OxRAM Matrix 80b weights

@neuro + sign

Neurons

Fig. 5.20 Inference top flow

5 Measurement Results 5.1 Circuit Validation On-chip inference tests have been performed using all 10 K MNIST test images. SPIRIT (Fig. 5.21) reached an 84% classification rate. This figure may seem a bit low, but the very simple neural topology we used limits it. Indeed, ideal simulations achieve 88% accuracy. We assume the chip functional, as analog variations and noise without compensation technic lead necessarily to decrease the accuracy of the process. At neural core level, the energy consumption per synaptic event is 3.6 pJ. At the chip level, this figure rises to 180 pJ. This huge worsening is mainly due to the SPI communication protocol which is slow and not optimized. Measurements show that an image classification needs 136 inputs spikes on average (for ΔS = 10): this is less than one spike accumulated per input, leading to a 5x energy gain compared to an equivalent ANN in 130 nm node. Figure 5.22 displays the classification accuracy and the number of input spikes needed to treat the whole MNIST image test batch for different ΔS. It illustrates an interesting feature of SNN: ΔS is a natural way to favor either classification accuracy or consumption per processed image. As reading the RRAM implies letting a current flow across the resistive devices and as inference consists in performing a lot of RRAM reading, we run the risk of unwittingly rewriting the matrix. To draw Fig. 5.23, we let the chip processing the whole MNIST test batch more than 200 times, which corresponds to process 750 million spikes. After each batch, we read the whole RRAM matric and compared it content to the initial weight we had written. This test results demonstrates the absence of read disturb during inference.

5.2 Extra Measurements on OxRAMs As our OxRAM reading circuit uses a tunable reference (see Fig. 5.14, IREF ), we can monitor quite accurately the resistance value of each resistive device through

5 SPIRIT: A First Mixed-Signal SNN Using Co-integrated CMOS Neurons. . .

77

Fig. 5.21 Chip micrograph and physical figures

Fig. 5.22 Accuracy/activity tradeoff

multiple readings with multiple reference current values. Thus, thanks to successive writing operations, we managed to put together OxRAMs in different resistance batches, we then checked that 5 M reads do not corrupt those batches. Such an exploratory work shows that this RRAM technology could more efficiently encode weights ranging from −4 to +4 by combining only two devices per synapse instead of eight [9]. To perform inference using such device behavior, several strategies can be considered. The first possibility is to keep on digitalizing the weight and to use one device in multiple level cell (MLC) mode to encode the absolute value of the weight and another one in binary mode for the sign.

78

A. Valentian et al.

Fig. 5.23 Read disturb assessment on inference runs

Fig. 5.24 Multiple valued OxRAM

The second idea is to use two MLC OxRAMs each one encoding either the positive or the negative weight component and to sum it directly on the membrane capacitor. This solution would lead to a circuit quite similar to the ideal view displayed in Fig. 5.1. It would no longer need a weight digitalization and certainly further reduce the power consumption. A hybrid solution would consist in an analog MLC for the absolute value and a binary OxRAM for the sign. This solution would need an analog inversion of the synaptic current before integrating it on CMEM for negative weights. Note that the measurements displayed in Fig. 5.24 are well adapted for the last two solutions as the batches “2,” “3,” and “4” are roughly multiples of the batch “1.” The range “0” corresponding to the absence of synaptic connection is mapped to the lowest current range.

5 SPIRIT: A First Mixed-Signal SNN Using Co-integrated CMOS Neurons. . .

79

6 Discussion This chip and its energy performance clearly illustrate the importance of data movement in NN integration. Especially in SNN, the data flow can be so large that the data transportation can have a stronger impact on the overall consumption than the data processing. Although this analog chip design did not include any mismatch compensation, the difference between simulated and integrated classification rates remains decent. This small decrease is very probably due to the very shallow depth of the integrated NN and it is a safe bet that, for a deeper NN, such a naive approach would have given much worse results on silicon. An important part of the design has been focused on the mathematical equivalence between ANN model used for learning and the embedded SNN circuit. It led to a quite complex mixed signal neuron circuit whose consumption is expanded. A more interesting way of thinking about this question of equivalence may be to adapt the simulated ANN model to a more efficient neural circuit rather than the other way around. The same RRAM technology has been co-implemented in 28 nm FDSOI. The OxRAM matrix and the analog neurons have been re-designed on this technology and are currently still under fabrication. RRAM matrix area is divided by a factor 36x, as illustrated in Fig. 5.25, while the neurons surface is divided by a factor 17x. At neural core level, the simulated energy per synaptic event is decreased by 40%. Figure 5.26 shows that re-implementing the SPIRIT chip in 28 nm FDSOI and using MLC technic enable to reach state-of-the-art performances in terms of energy and surface. The energy per synaptic event on 28 nm FDSOI is extrapolated from our 180 pJ measurement which includes the excessive consumption due to the SPI communication circuits.

Fig. 5.25 Layout scaling from 130 nm to 28 nm node

80

A. Valentian et al.

Fig. 5.26 Comparison to the state of the art SPI interface

Ethernet interface

Image

Input spikes

Output spikes

Result Surface tablet

Microzed – ZYNQ X7CZ020

SPIRIT

Fig. 5.27 Demonstration of live handwritten digits classification

7 Conclusion We have designed, fabricated, and tested a functional SNN on a fully-integrated technology co-integrating analog neurons and RRAM synapses. We demonstrated 5x better energy consumption with respect to an equivalent ANN chip. Neural core energy per synaptic event values 3.6 pJ and could be further reduced by fully exploiting the RRAM technology capacities. Synaptic density can also be improved by using RRAM as multiple level cells. A 28 nm version of this circuit is currently under fabrication and will demonstrate more convincing performances in terms of consumption and density. Note also that a live demonstration of classification of digits drawn on a touch screen interface has been developed and is functional, as illustrated in Fig. 5.27.

5 SPIRIT: A First Mixed-Signal SNN Using Co-integrated CMOS Neurons. . .

References 1. S. Ambrogio et al., Nature 558(778), 60 (2018) 2. R. Mochida et al, VLSI Technology, pp. 175–176 (2018) 3. P. Merolla et al, Science, pp. 668–673 (2014) 4. D.R.B. Ly et al., J. Phys. D: Appl. Phys. 51, 444002 (2018) 5. https://github.com/CEA-LIST/N2D2 6. J. Perez-Carrasco et al, TPAMI, pp. 2706–2719 (2013) 7. A. Grossi et al, VLSI J. pp. 2599–2607 (2018) 8. D. Garbin et al, TED, 62(8) (2015) 9. B.Q. Le et al, TED, 66(1) (2019)

81

Chapter 6

Accelerated Analog Neuromorphic Computing Johannes Schemmel, Sebastian Billaudelle, Philipp Dauer, and Johannes Weis

Abstract This chapter presents the concepts behind the BSS accelerated analog neuromorphic computing architecture. It describes the second-generation BrainScales-2 (BSS-2) version and its most recent in silico realization, the HICANN-X Application Specific Integrated Circuit (ASIC), as it has been developed as part of the neuromorphic computing activities within the European Human Brain Project (HBP). While the first generation is implemented in a 180 nm process, the second generation uses 65 nm technology. This allows the integration of a digital plasticity processing unit, a highly parallel microprocessor specially built for the computational needs of learning in an accelerated analog neuromorphic systems. The presented architecture is based upon a continuous-time, analog, physical model implementation of neurons and synapses, resembling an analog neuromorphic accelerator attached to build-in digital compute cores. While the analog part emulates the spike-based dynamics of the neural network in continuous time, the latter simulates biological processes happening on a slower timescale, like structural and parameter changes. Compared to biological timescales, the emulation is highly accelerated, i.e., all time constants are several orders of magnitude smaller than in biology. Programmable ion channel emulation and inter-compartmental conductances allow the modeling of nonlinear dendrites, back-propagating action potentials, and NMDA and Calcium plateau potentials. To extend the usability of the analog accelerator, it also supports vector-matrix multiplication. Thereby, BSS2 supports inference of deep convolutional networks as well as local learning with complex ensembles of spiking neurons within the same substrate. A prerequisite to successful training is the calibratability of the underlying analog circuits across the full range of process variations. For this purpose, a custom software toolbox has been developed, which facilitates complex calibrated Monte Carlo simulations.

J. Schemmel () · S. Billaudelle · P. Dauer · J. Weis Heidelberg University, Heidelberg, Germany e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 P. Harpe et al. (eds.), Analog Circuits for Machine Learning, Current/Voltage/Temperature Sensors, and High-speed Communication, https://doi.org/10.1007/978-3-030-91741-8_6

83

84

J. Schemmel et al.

1 Introduction The basic concept of the BrainScaleS systems is the emulation of biologically inspired neural networks with physical models [1]. It differs from comparable neuromorphic approaches based on continuous-time analog circuits [2–4] in many aspects, like the high acceleration factor [5, 6], usage of wafer-scale integration [7], calibratability toward biologically sound neuron parameters [8, 9], a software interface based on the simulator-agnostic description language PyNN [10, 11], support for nonlinear dendrites and structured neurons [12], and on-chip support for complex plasticity rules based on a combination of analog measurements internal analog-to-digital conversion and build-in microprocessors. The first generation, BrainScaleS 1, has been completed [13] and is used mostly for research of connectivity aspects of large accelerated analog neural networks and the further development of wafer-scale integration technology. The main shortcoming of the BrainScaleS 1 system is the rather inflexible implementation of long-term plasticity based solely on Spike-timing dependent plasticity (STDP), which has been taken over from its predecessor [6]. Already, at the very beginning of the BrainScaleS project, this was considered a conceptual weakness, and an upgrade path was devised to implement the more flexible hybrid plasticity [14] scheme in future revisions. Due to the process technology used within BrainScaleS 1, 180 nm, it was not feasible to integrate the necessary standard cell logic without sacrificing too much area to digital circuits in relation to the analog neurons and synapses. Therefore, the decision was made to develop a second BrainScaleS generation, BrainScaleS 2, which is based from the beginning on a smaller process technology, namely 65 nm. Figure 6.1 shows the main elements of the BSS architecture. At the

Fig. 6.1 Basic elements of the BrainScaleS architecture: wafer, BSS-1 ASIC, BSS-2 neuron, and exemplary membrane voltage trace

6 Accelerated Analog Neuromorphic Computing

85

very left, a BSS-1 wafer containing approx. 500 interconnected ASICs is shown. To its right, a BSS chip illustrates the characteristic layout of BSS neuromorphic chips: a central neuron area surrounded by two large synapse blocks. The sketched overlay shows the rectangular orientation of input (pre-synaptic) and output (postsynaptic) signals: the input is routed horizontally through the synapse array, while the output of the synapses connects them vertically to the neurons in the center. Next to it, the graphical representation of an emulated structured neuron is shown above a measured voltage trace from the membrane capacitor of a neuron. One major improvement is the inclusion of a digital plasticity processor in the BSS-2 ASIC [14]. This specialized highly parallel Single Instruction Multiple Data (SIMD) microprocessor adds an additional layer of modeling capabilities, covering all aspects of structural and parameter changes during network operation. By including the necessary logic directly within the analog network core, a communication bottleneck to the host system is avoided. This allows to scale up all novel plasticity features for wafer-scale integration within the BrainScaleS 2 system. In the finale multi-wafer version of the BrainScaleS 2 system, which is planned to be capable of extending experiments across several hundreds of wafers, the distributed local compute capability will be even more essential. It will not only perform all kinds of plasticity calculations but also the initialization and calibration of the numerous analog mixed-signal circuits within the ASIC. The role of the analog neural network block changes by the transition from BSS-1 to BSS-2. The analog part becomes an attachment to the Central Processing Unit (CPU) cores, similar to a complex accelerator. Figure 6.2 illustrates this architecture.

NOC

cache

cache

NOC

NOC

cache

processor

processor

processor

vector unit

vector unit

vector unit

high-bw link

high-bw link

high-bw link

analog l core

analog l core

analog l core

NOC NOC

cache

NOC

cache

processor

processor

vector unit

vector unit

high-bw link

high-bw link

analog l core

analog l core

cache

processor

Network-on-chip: • prioritize event data • unused bw for CPU • common address space for neurons and CPUs

vector unit NOC

cache

NOC

NOC

vector unit

high-bw link

cache

processor

processor

memory controller

vector unit

high-bw link

high-bw link

analog l core

analog l core high-bw link

special function tile: • memory controller • SERDES IO • purely digital function unit

analog core

high-bandwidth link: vector unit  NM core • weights • correlation data • routing topology • event (spikes) IO • configuration

Fig. 6.2 A neuromorphic SOC consisting of a multitude of digital CPU cores with special vector units attached to analog neuromorphic accelerators

86

J. Schemmel et al.

The remainder of this publication is organized as follows: Section 2 gives an overview of the BSS-2 architecture. Section 3 presents the current prototype, the single-chip variant of BSS-2, called HICANN-X. Section 4 shows some examples of the complex calibrated Monte Carlo simulations used to verify that the analog neuron circuits are always capable of correctly emulating their biological counterparts, i.e., their calibratability under all process and device variations. The chapter closes with a conclusion in Sect. 5.

2 Overview of the BSS Neuromorphic Architecture As shown in Fig. 6.2, the BSS architecture is based on the close interaction of digital and analog circuit blocks. Because of their primary intended function, the digital processor cores are called Plasticity Processing Units (PPUs). As the main neuromorphic component, the analog core contains synapse and neuron circuits [15, 16], analog parameter memories, PPU interfaces, and all event-related interface components. The PPU is an embedded microprocessor core with a highly parallel SIMD unit optimized for the calculation of plasticity rules in conjunction with the analog core [17]. In the current incarnation of the BSS architecture, BSS-2, two PPUs share an analog core. This allows the most efficient arrangement of the neuron circuits in the center of the analog core. Figure 6.3 depicts the individual function blocks located within the Analog Network Core (ANNCORE): top plasticity processing unit

events

128x256 synapse array

128 STDF synapse drivers

128x256 synapse array

event roung and random generators

130x24 analog parameter memories

128 neuron compartments 130x24 analog parameter memories

digital neuron control 130x24 analog parameter memories

external

256 channel single-slope ADC

128 neuron compartments

128 neuron compartments

128x256 synapse array 256 channel single-slope ADC

digital neuron control 130x24 analog parameter memories

128 neuron compartments

128 STDF synapse drivers

128x256 synapse array 256 channel single-slope ADC

bottom plasticity processing unit Fig. 6.3 Block diagram of the Analog Network Core (ANNCORE)

analog network core

256 channel single-slope ADC

6 Accelerated Analog Neuromorphic Computing

87

Synapse Arrays The total number of synapses is split up into four equally sized blocks to keep the vertical and horizontal lines traversing the sub-arrays as short as possible, thereby reducing their parasitic capacitances (see [16, 17]). Each synapse array resembles a block of static memory, with 16 memory cells located in each synapse, organized in two words of eight bits each. A synapse array also contains the sense amplifiers, precharge and write control circuits, and word-line decoders and buffers. Thereby, it can be connected directly to the digital, standard cell-based parts of the chip. Two PPUs connect to the static memory interfaces of the two adjacent synapse arrays, using a fully parallel connection to the 8 × 256 data lines. Neuron Compartment Circuits Four rows of neuron compartment circuits are located at the edges of the synapse blocks. Each pair of dendritic input lines of a neuron compartment is connected to a column of 256 synapses. The neuron compartment implements the AdEx neuron model. They can be connected to form larger neurons, emulating either point or structured neurons. See [12] for more details about the multi-compartment capabilities. Analog Parameter Memories Adjacent to each row of neuron compartments is a row of analog parameter storages. These capacitive memories [18] store 24 analog values per neuron and an additional 48 global parameters. They are auto-refreshed from values stored digitally inside the memory block. Digital Neuron Control Two neuron rows share a digital neuron control block which synchronizes neural events to the digital system clock of 125 MHz and serializes them onto digital output buses. Synapse Drivers with Short Term Plasticity The pre-synaptic events are fed into the array via the synapse drivers. Besides timing control and buffering, they contain short-term plasticity circuits emulating a simplified Tsodyks–Markram model [6, 19]. The synapse drivers can handle single- or multi-valued input signals, depending on the current operation mode of the synapse row, which may be either rate or spike based. Random Event Generators The random generators produce random background events fed directly into the synapse array via the synapse drivers, strongly reducing the external bandwidth usage when stochastic models [20, 21] are used. Correlation Analog to Digital Converters (ADCs) The top and bottom edges of the ANNCORE are lined by the SIMD units of the top and bottom PPUs. A column-parallel ADC converts the analog data from the synapse arrays as well as selected analog signals from the neurons into the digital representations needed by the PPUs.

88

J. Schemmel et al.

1024 ramp generaon

ramp clk

config & weights

256 counters & comparators analog correlaon data

t address

128 STDF synapse drivers

pre adr en pre dac en 6bit address

256

128

256

128 columns 256 rows

synapse array

spike roung

256 post-syn currents

128 neuron compartments with NMDA extensions

128 input events

neuron signals

fire adr nb 4

nb 0

nb 2

control

event roung and random generators

fire adr nb 5 fire adr nb 6 fire adr nb 7

nb 3

threshold reset membrane

48 global V,I

nb 1

128

128

neuron builder & priority encoder

inpu even

neuron signals

post

dr en ac en

neuron reset signals

row control signals

membrane voltages

ramp generaon

mp clk

digital

2048 correlaon data

ramp voltage

24 V,I

128

top right quadrant of analog network core

top plasticity processing unit

to fast ADC

130x24 analog parameter memories

neuron builder & priority encoder post

digital neuron control neuron neuron builder builder & & priority priority encoder encoder

boom right quadrant

Fig. 6.4 Cutout from the detailed block diagram of the ANNCORE, showing the upper right quadrant and its related I/O circuits

Figure 6.4 shows a zoom-in into the upper right quadrant of the ANNCORE. For compatibility with BSS-1, the synapse drivers and digital neuron control circuits are arranged in a similar substructure as they have been previously: one synapse driver controls two rows of synapses in both adjacent blocks and the digital neuron control is split into eight blocks controlling 64 neuron compartments each. Four blocks are located in the left and four in the right half of ANNCORE. Each block contains the so-called neuron builder logic, which allows to interconnect analog membrane and digital spike output signals from neuron compartments being either vertically or horizontally adjacent to each other. To serialize the up-to 64 spike outputs, each digital neuron control block contains priority encoder circuits that arbitrate the access to the output bus. It also contains an 8 × 64 neuron source address memory [22]. The pre-synaptic input for the synapse drivers of one chip half comes from a set of local event input buses driven by the central event router. The event router within the ANNCORE mixes global, local, and random event sources. In Fig. 6.4, the synapses are arranged in a two-dimensional array between the PPU and the neuron compartment circuits. Pre-synaptic input enters the synapse array at the left edge. For each row, a set of signal buffers transmit the pre-synaptic pulses to all synapses

6 Accelerated Analog Neuromorphic Computing

pre-synaptic input

89

excitatory synapc input

ion channel circuits

… A Isyn

Vinput 4 ns t

12x8 μm2

Rsyn

B

4 μs

t

6 bit SRAM weight

6 bit SRAM address

membrane capacitance Cmem

Vmembrane

Csyn

40 μs t

causal and an-causal correlation readouts analog control:

analog gmax control input comparator

pre-synapc neuron address (6 bit)

6 bit DAC

pre-synaptic enable signal

correlaon sensor

STDP me-constant STDP storage gain

4 bit SRAM digital calibraon

post-synaptic event from neuron A B neuron dendritic inputs

Fig. 6.5 Top: operating principle and basic timing relationships of an accelerated BrainScaleS spiking neuron. Bottom: block diagram of a synapse

in the row. The post-synaptic side of the synapses, i.e., the equivalent of the dendritic membrane of the target neuron, is formed by wires running vertically through each column of synapses. At each intersection between pre- and post-synaptic wires, a synapse is located. To avoid that all neuron compartments share the same set of pre-synaptic inputs, each pre-synaptic input line transmits—in a time-multiplexed fashion—the pre-synaptic signals of up to 64 different pre-synaptic neurons. Each synapse stores a pre-synaptic address that determines the pre-synaptic neuron it responds to. Figure 6.5 illustrates the basic operation of the BrainScaleS accelerated analog neuron and its associated synapses. Due to space limitations, the dendritic column is rotated by 90° in the figure. The bottom half of the figure shows a block diagram of the synapse circuit. The main functional blocks are the address comparator, the Digital to Analog Converter (DAC), and the correlation sensor. Each of these circuits has its associated memory block. The address comparator receives a 6 bit address and a pre-synaptic enable signal from the periphery of the synapse array as well as a locally stored 6 B neuron number. If the address matches the programmed neuron number, the comparator circuit generates a pre-synaptic enable signal local to the synapse (pre), which is subsequently used in the DAC and correlation sensor circuits. Each time the DAC circuit receives a pre signal, it generates a current pulse. The height of this pulse is proportional to the stored weight, while the pulse

90

J. Schemmel et al.

width is typically 4 ns. This matches the maximum pre-synaptic input rate of the whole synapse row which is limited to 125 MHz. The remaining 4 ns is necessary to change the pre-synaptic address. The current pulse can be shortened below the 4 ns maximum pulse length to emulate short-term synaptic plasticity [6, 23]. Each neuron compartment has two inputs, labeled A and B in Fig. 6.5. Usually, the neuron compartment uses A as excitatory and B as inhibitory input. Each row of synapses is statically switched to either input A or B, meaning that all pre-synaptic neurons connected to this row act either as excitatory or as inhibitory inputs to their target neurons. Due to the address width of 6 bit, the maximum number of different pre-synaptic neurons is 64 [24]. The output currents of all synapses discharge the synaptic input capacitance Csyn , which is realized predominantly by the shielding capacitance of the long synaptic input wires. An adjustable MOS resistor, Rsyn , restores the charge. Due to the short time constant of the synaptic input pulse compared to the time constant of the synaptic input line τinput = Csyn Rsyn , which is three orders of magnitude longer, the voltage trace Vinput (t) is a single exponential. The ion channel circuits in BrainScaleS should implement the full AdEx neuron model, as it is the case in the BSS-1 system [16, 25, 26]. In BSS-2, some terms are still under development at the time of this writing. The minimum configuration available in all prototype versions of BSS-2 is a set of two current-based inputs, one for inhibitory synaptic input, connected to input A in Fig. 6.5, and one for excitatory (input B), in combination with a leak circuit and spike and reset generation [15]. Therefore, the membrane voltage is given by the standard Integrate-and-Fire (I&F) neuron model [27]. Typically, the membrane time constant set by the leakage term is another order of magnitude above the time constant of the synaptic input. These temporal relationships are visualized in the small timing diagram inserts in Fig. 6.5. The remaining functional block of the synapse shown in Fig. 6.5 is the correlation sensor. Its task is the measurement of the time difference between pre- and postsynaptic spikes. To determine the time of the pre-synaptic spike, it is connected to the pre signal. The post-synaptic spike time is determined by a dedicated signaling line running from each neuron compartment vertically through the synapse array to connect to all synapses projecting to input A or B of the compartment [17].

3 The HICANN-X Chip Although the target of the BSS architecture is wafer-scale integration, which offers a cost-effective possibility to build brain-size spiking neural network models, smaller solutions based upon single ASICs are needed to develop and debug the final design. They also shorten the time to first experiments, of which a significant proportion does need only hundreds up to a few thousand neurons and therefore does not necessarily rely on wafer-scale integration. Depending on the complexity of the neuron model they utilize, a few tens of interconnected BSS ASICs might be sufficient. To support these goals, an intermediate version of the second-generation BSS technology has been developed: suited for single- or multi-chip operation,

6 Accelerated Analog Neuromorphic Computing

analog outputs

91

top plasticity processing unit

output amplifier

8 HICL SERDES blocks

HICL SERDES SERDES SERDES SERDES lb 1 SERDES lb 1 SERDES lb 1 SERDES lb 1 SERDES lb lb 11 lb 1

digital core logic

event router and random background generators

analog network core

HICL links 0 to 7 extclk

main PLL

JTAG and reset

fast ADC

bottom plasticity processing unit

Fig. 6.6 Block diagram of the HICANN-X ASIC

but simultaneously prepared for later wafer-scale integration. This section will introduce said single-chip version of BSS-2, called HICANN-X, in more detail. Figure 6.6 shows a block diagram of HICANN-X. In total, the HICANN-X chip uses 16 differential Low Voltage Differential Signalling (LVDS) lines for the host communication. A single chip has the same bandwidth as the full BSS-1 reticle built from eight individual chips. Using this link arrangement, the HICANN-X chip can be directly connected to one communication module of the BrainScaleS system, providing an easy upgrade path [28]. The layout and photograph of the chip are shown in Fig. 6.7.

3.1 Event-Routing Within HICANN-X HICANN-X uses the same two-level communication infrastructure as the first BSS generation [29]: a real-time address-event layer without handshake, called Event Link Layer 1 (Layer1), and a second layer using time-stamped event packets.1 Figure 6.8 shows the implementation of the central Layer1 digital event routing network. There are two main sources and sinks for event data: the analog network core, which has eight input event and eight output event buses, and the Event Link 1 The

Layer1 data format codes a neural event as a parallel bit field containing the neuron address and a valid bit. It is real-time data with a temporal resolution of the system clock, which is 250 MHz in HICANN-X.

92

J. Schemmel et al.

• 65nm LP-CMOS, power consumption O(10 pJ/synaptic event) • 128k synapses • 512 neural compartments (Sodium, Calcium and NMDA spikes) • two SIMD plasticity processing units (PPU) • PPU internal memory can be extended externally

• fast ADC for membrane voltage monitoring • 256k correlation sensors with analog storage (> 10 Tcorr/s max) • 1024 ADC channels for plascity input variables • 32 Gb/s neural event IO • 32 Gb/s local entropy for stochastic neuron operation

Fig. 6.7 Top: layout drawing and chip photograph of the HICANN-X ASIC. Bottom: key features of HICANN-X

Fig. 6.8 Conceptual view of the internal digital event routing matrix of the HICANN-X chip. All 20 sources are shown vertically on the left, while the 12 output channels are listed at the top. At each position marked with a cross, a programmable routing element is located

6 Accelerated Analog Neuromorphic Computing

93

Layer 2 (Layer2)→Layer1 converter, which provides four links in each direction. With the exception of the analog core input buses, each link can handle one event per clock cycle of 4 ns. The ANNCORE input buses are limited to one event every two cycles. All eight links are used for Layer2 based event transport. An event is encoded as a combination of neuron address and time stamp. The conversion between timestamped Layer2 data and real-time Layer1 data is preformed inside the Layer2 → Layer1 converter located in the digital core logic. It uses a globally synchronized system time counter for this purpose. The routing of all Layer1 events is done within the router matrix. Inside this module are several columns of buffered n-to-1 event merger stages allowing to combine the data of a set of inputs into one Layer1 output channel. All eight physical links of the chip can be simultaneously used for neuron event data (Layer2), slow control, and PPU global memory accesses. The number of active links might be statically programmed to be any number between one and the maximum of eight. This is useful if several chips should be connected to a single host with a limited number of available links.

3.2 Analog Inference: Rate-Based Extension of HICANN-X One of the first neuromorphic systems build in Heidelberg was the Heidelberg AnaloG Evolvable Neural network (HAGEN), a fast analog perceptron-based network chip optimized for hardware-in-the-loop training [30]. Caused by parallel activities within the Heidelberg Electronic Vision(s) research group [31], it was mainly trained by evolutionary algorithms [32], explaining the acronym. Nevertheless, it was perfectly usable for other hardware-in-the-loop based algorithms, similar to the deep learning results that have been more recently achieved by other neural network chips used in a perceptron-like fashion [33]. Although the High Input Count Analog Neural Network (HICANN) architecture has been successfully used to implement deep multilayer networks using rate-based spiking models [34] and back-propagation-based training, it loses some of its power efficiency by emulating a perceptron model. Encoding the activation in the time between spikes can enhance the efficiency significantly [35]. In all spiking solutions, the network operates in continuous time, and therefore the size of the network is limited to the number of neurons and synapses available on the chip. The HAGEN extension, which is part of the HICANN-X chip, allows a seamless mixture of spiking and nonspiking operation within a single chip. Since this rate-based operation is based on discrete-time analog vector-matrix multiplication, a time-multiplexing scheme can be employed, similar to digital accelerators for deep convolutional networks [36]. In this case, the size of the network is limited only by the size of the available external memory. Figure 6.9 visualizes the differences between standard spiking mode and HAGEN mode, which eliminates all temporal dynamics from the neuron. By disabling the leakage term of the neuron, the membrane just sums up the synaptic input.

94

J. Schemmel et al.

pre-synaptic input

excitatory synaptic input



A Isyn t 0 - 4 ns t

Vinput

ion channel circuits membrane capacitance Cmem

Rsyn

B 100 ns t

Csyn

Vmembrane Vout Vreset t Δtinput treset

Fig. 6.9 Operating principle of the HAGEN extensions in the HICANN-X chip

The excitatory input is added with a positive and the inhibitory input with a negative sign. All input is applied during the time interval tinput , after which the membrane voltage is digitized by the Correlation-readout ADC (CADC) and the neuron is set to the reset voltage Vreset by a reset signal from the PPU. tinput can be as short as 100 ns. It depends on the bandwidth of the synaptic input and the number of synaptic rows used, i.e., the total time required to transfer all input events to the synapses. Since the minimum time is at least a few synaptic time constants and nothing is gained by setting the integration time shorter than the conversion time of the CADC, a typical value for tinput is about 500 ns. In this case, the network can evaluate 2 · 106 × 256 × 512 = 2.62 · 1011 multiply–accumulate operations per second. By shortening the conversion time of the CADC further, speed improvements are possible. In the current chip revision, the neuron circuit is not yet capable of the full speed of the synapse array, and therefore the full multiply–accumulate cycle is several µs. Since the reset voltage of the neuron membrane can be aligned with the lower bound of the CADC conversion range, the neuron acts like a ReLU unit in this setting [37]. A standard synapse within BrainScaleS reacts to a pre-synaptic event in a digital fashion: the arrival of a pre-synaptic event generates a fixed current pulse. By enabling short-term facilitation or depression [23], the synaptic strength depends on the pre-synaptic firing history. This is achieved by modulating the pulse length generated by the synapse. Instead of using the firing history, in HAGEN mode, the pulse length is transmitted together with the pre-synaptic spike and converted into variable length pulses by the existing Short Time Plasticity (STP) pulse length modulation circuits. The digital pulse length information is transmitted by reusing the 5 lower address bits of the Layer1 event data, since in the HAGEN mode the network structure is much more regular and not all pre-synaptic address bits are needed. Figure 6.10 shows results using the activity-based perceptron mode from HICANN-X for analog vector-matrix multiplication. In the left part of the figure, 127 neurons are measured simultaneously. Their synaptic weights increase linearly from −63 to 63, i.e., all synapses connected to a single neuron are set to the same weight, while the weights increase from neuron to neuron. All synapses receive the same input: 0, 3, or 7 for the black, red, and blue traces, respectively. The outputs

150

95

vector entry 0 vector entry 3 vector entry 7

100

103

0 1 2

50

True label

Amplitude on neurons’ membranes [LSB]

6 Accelerated Analog Neuromorphic Computing

0

102

3 4 5 6

101

7

−50

8 −100

9 −60

−40 −20 0 20 40 Synapse weight [LSB] (on individual neurons)

60

0

1

2

3 4 5 6 7 Obtained label

8

9

100

Fig. 6.10 Left: results for analog matrix-vector multiplication. Right: confusion matrix for MNIST.

of all 127 neurons are digitized simultaneously by the CADC, and the digital values are plotted over the weight values of the neurons. Although the neuron circuits are calibrated, some fixed-pattern noise remains visible. The resulting amplitudes approach saturation for large absolute values greater than 100 LSB. This is also caused by the fact that the neuron circuits have not yet been optimized for rate-based operation, which limits the usable dynamic range of the output of the matrix-vector multiplication. The chip has been subsequently used to perform inference on the MNIST dataset of handwritten digits [38]. A three-layer convolutional neural network has been trained in TensorFlow [39] to reach a classification rate of 98.5% using 32 bit floating point weights. The weights and input activations of this network have been quantized to 6 bit weight and 5 bit input resolution, to fit the trained network to the dynamic range of the analog circuits. After training the model with hardware in the loop, similar to the approach followed in [34], a classification accuracy of 98.0 ± 0.1% on the test dataset has been achieved using the HICANN-X chip. The corresponding confusion matrix is shown in the right panel of Fig. 6.10. Within the analog network core, 5 µs is spent per full-chip MAC operation during the MNIST task. The results are discussed in more detail in [40]; further benchmarks are presented in [41] and [42].

4 Analog Verification of Complex Neuron Circuits The BrainScaleS systems feature complex mixed-signal circuits to emulate the rich properties of their biological counterparts. Our neuron circuits, implementing the AdEx equations [25], possess a multitude of individual subcomponents, such as a leak or adaptation term. Each of these units is parameterized through a number of digital controls as well as analog voltage and current biases. Designed to support a variety of different tasks, ranging from biologically realistic firing patterns to analog

96

J. Schemmel et al.

matrix multiplication, these circuits have to be operated at widely different operating points. The correct behavior has to be ensured prior to fabrication. Individual components can often be unit-tested in isolation, making use of conventional simulation strategies. The accessibility of a complete design is, however, limited due to error propagation and inter-dependencies of parameters. A suite of benchmark tasks, evaluated on comprehensive testbenches, is required for pre-tapeout verification. To ensure the required degree of precision over larger arrays of analog circuits, mismatch effects introduced through imperfections in the production process have to be covered through Monte Carlo (MC) simulations. Different incarnations of a circuit can be obtained by individually fixing the MC seed. These virtual instances can then be characterized, very similar to the fabricated siblings. Similarly, the worst-case behavior can be characterized for the process corners. In the following paragraphs, we present our simulation strategy and a custom library to aid software-driven simulations within the rich ecosystem of the Python programming language. We will guide through our benchmarking flow for our current generation of AdEx neurons. Similar approaches have successfully been taken for the verification of plasticity circuits and vector-matrix multiplication circuits.

4.1 Interfacing Analog Simulations from Python

user code

Our custom Python module teststand provides a tight integration between analog circuit simulations and the ecosystem of the programming language [43]. It mainly consists of a software layer to interface with the Cadence Spectre simulator and other tools from the Cadence Design Suite (Fig. 6.11). Teststand extracts the testbench’s netlist directly from the target cell view as available in the design library. The data is accessed by querying the database via an OCEAN script executed as a child process. Teststand then reads the netlist

simulation setup parameters,

data analysis traces

teststand

stimuli

1

2

netlist extraction

simulation

CDS®

3 parsing

Cadence® Spectre®

Fig. 6.11 Structure of a teststand-based simulation highlighting the interaction with the Cadence Design Suite. Image taken from [43]

6 Accelerated Analog Neuromorphic Computing

97

and modifies it according to the user’s specification. In addition to the schematic description, Spectre netlists also contain simulator instructions. Teststand generates these statements according to the user’s Python code. Specifically, the user can define analyses to be performed by the simulator, such as DC, AC, and transient simulations. MC analyses are supported as well and play an important role in the verification strategies presented below. Teststand can be easily extended to support all features provided by the backend. All circuit parameters, stimuli, and nodes to be recorded are specified using an object-oriented interface that resembles Spectre simulation instructions. cell = (’mylib’, ’mycell’, ’schematic’) nets = [’I0.mynet’] teststand = Teststand(cds_lib, cell) tran = TransientAnalysis(’tran’, 1e-3) simulation = Simulation( [tran], params, save=nets) result = teststand.simulate(simulation) The simulate()-call executes Spectre as a child process. Basic parallelization features are natively provided via the multiprocessing library. Scheduling can be trivially extended to support custom compute environments. The simulation log is parsed, and potential error messages are presented to the user as Python exceptions. Results are read and provided to the user as structured NumPy arrays. This allows to resort to the vast amount of data processing libraries available in the Python ecosystem to process and evaluate recorded data. Most notably, this includes NumPy [44], SciPy [45], and Matplotlib [46]. As a side effect, the latter allows to directly generate rich publication-ready figures from analog circuit simulations.

4.2 Monte Carlo Calibration of AdEx Neuron Circuits As shown in Fig. 6.12a, we used the teststand library, inter alia, for the verification of our AdEx design. The model equations feature a high-dimensional parameter space, allowing for a wide range of behaviors. Our circuit, on the other hand, is parameterized through 24 individual analog bias sources and a set of digital controls. Starting from first-order models of the utilized subcomponents, we characterized the circuit’s dynamics through a set of measurements on the full neuron circuit. With the results stored in a database, we established a transformation between the circuits’ and the models’ parameter spaces. The influence of mismatch effects manifests itself in deviations in these calibration curves for individual neuron instances. We applied the above framework for a large number of neuron incarnations, obtained by fixing the respective MC seeds. The circuit was benchmarked against multiple firing patterns, such as transient spiking, regular bursting, and initial bursting [47]. For each of these targets, a set of

98

J. Schemmel et al. B

neuron

parameter storage

regular bursting membrane

synapses

testbench

A

0

time [µs]

500

testbench abstraction layer

membrane

transient spiking

measurements, characterization

lookup

0

database

time [µs]

500

initial bursting membrane

user code

0

benchmarks

time [µs]

500

Fig. 6.12 MC calibration workflow of an AdEx neuron circuit using teststand. (a) Testbench and overview on the software stack for characterization and calibration. (b) Membrane traces (red) of a neuron circuit simulation configured for regular bursting, transient spiking, and initial bursting. The results of a numerical integration of the AdEx equations are shown as a reference (gray)

biases, corresponding to the respective parameter set from literature, was determined through a reverse lookup based on the above transformations. Exemplary results for a single neuron simulation are shown in Fig. 6.12b. The presented approach enforces the development of calibration algorithms before tapeout. Especially, for circuits with large parameter spaces, there might occur multidimensional dependencies which can be hard to resolve. The strategy might also reveal an insufficient parametrization not necessarily apparent from individual unit tests. In order to uncover potential regressions due to modifications to a circuit, simulations based on teststand can easily be automated and allow continuous integration testing for full-custom designs.

5 Conclusion The development and implementation of the presented second-generation BrainScaleS architecture will hopefully continue during the next years. The outcome we hope for is a multi-wafer system, constructed from hundreds of 30 cm silicon wafers, each one directly embedded in a printed circuit board (PCB) and all of them interconnected to form a novel large-scale analog neuromorphic platform, capable of answering questions about learning and development in large scale, biologically realistic neural networks. Utilizing standard Complementary Metal-Oxid-Semiconductor (CMOS) technology to build large-scale analog accelerated neuromorphic hardware systems places our approach in the middle between the two major research directions for

6 Accelerated Analog Neuromorphic Computing

99

AI circuits: digital accelerators and novel persistent memory devices. It presents a complementary option to theses technologies. Compared to systems based on novel device technology, it has advantages, like the high operational speed, low energy requirements for learning, the possibility to use any standard CMOS process without regard to back-of-the-line compatibility, and the capability to replicate relevant biological structures more easily. In comparison to digital implementations, like Loihi or SpiNNaker [48, 49], the fully analog implementation of complex neural structures combined with true in-memory computing allows for time-continuous emulation of neural dynamics and much higher emulation speed at similar energy efficiencies. Most importantly, analog CMOS implementations might be the essential step to uncover the learning rules needed to cope with substrate variations. In our systems, the local learning rules do not only train the system to perform a certain task but simultaneously adjust the operating point of the circuits and compensate fixedpattern noise [23]. This will be an essential property for future novel computing systems based on advanced device technologies as well, since they all are expected to have substantially increased device-to-device variations. We hope that our BSS platform will be useful for the development of robust local learning algorithms in the upcoming future. In the short term, the BSS system allows the combination of energy- and costefficient analog inference with local learning rules for a multitude of practical applications, scaling from small systems for edge computing up to high-performance neuromorphic cloud computing. Acknowledgments The authors wish to express their gratitude to Andreas Grübl, Yannik Stradmann, Vitali Karasenko, Korbinian Schreiber, Christian Pehle, Ralf Achenbach, Markus Dorn, and Aron Leibfried for their invaluable help and active contributions in the development of the BrainScaleS 2 ASICs and systems. They are not forgetting the important role their former colleagues Andreas Hartel, Syed Aamir, Gerd Kiene, Matthias Hock, Simon Friedmann, Paul Müller, Laura Kriener, and Timo Wunderlich had in these endeavors. They also want to thank their collaborators Sebastian Höppner from TU Dresden and Tugba Demirci from EPFL Lausanne for their contributions to the BrainScaleS 2 prototype ASIC. Very special thanks go to Eric Müller, Arne Emmel, Philipp Spilger, and the whole software development team, as well as Mihai Petrovici, Sebastian Schmitt, and the late Karlheinz Meier for their invaluable advice. This work has received funding from the European Union Seventh Framework Programme ([FP7/2007-2013]) under grant agreement no 604102 (HBP rampup), 269921 (BrainScaleS), 243914 (Brain-i-Nets), the Horizon 2020 Framework Programme ([H2020/2014-2020]) under grant agreement 720270 and 785907 (HBP SGA1 and SGA2) as well as from the Manfred Stärk Foundation.

Author Contribution J.S. created the concept, has been the lead architect of the BSS systems, and wrote the manuscript except for Sect. 4, which was written by S.B. S.B. also created the

100

J. Schemmel et al.

teststand software and conceived the simulations jointly with P.D, who performed the simulations and prepared the results. J.W. performed the measurements for the HAGEN mode and created Fig. 6.10. All authors edited the manuscript together.

References 1. J. Schemmel, D. Brüderle, A. Grübl, M. Hock, K. Meier, S. Millner, A wafer-scale neuromorphic hardware system for large-scale neural modeling, in Proceedings of the 2010 IEEE International Symposium on Circuits and Systems (ISCAS) (2010), pp. 1947–1950 2. G. Indiveri, B. Linares-Barranco, T.J. Hamilton, A. van Schaik, R. Etienne-Cummings, T. Delbruck, S.-C. Liu, P. Dudek, P. Häfliger, S. Renaud, J. Schemmel, G. Cauwenberghs, J. Arthur, K. Hynna, F. Folowosele, S. Saighi, T. Serrano-Gotarredona, J. Wijekoon, Y. Wang, K. Boahen, Neuromorphic silicon neuron circuits. Front. Neurosci. 5(0), 2011. http://www. frontiersin.org/Journal/Abstract.aspx?s=755&name=neuromorphicengineering&ART_DOI= 10.3389/fnins.2011.00073 3. B.V. Benjamin, P. Gao, E. McQuinn, S. Choudhary, A.R. Chandrasekaran, J.-M. Bussat, R. Alvarez-Icaza, J.V. Arthur, P.A. Merolla, K. Boahen, Neurogrid: a mixed-analog-digital multichip system for large-scale neural simulations. Proc. IEEE 102(5), 699–716 (2014) 4. R. Douglas, M. Mahowald, C. Mead, Neuromorphic analogue VLSI. Annu. Rev. Neurosci. 18, 255–281 (1995) 5. J. Schemmel, A. Grübl, K. Meier, E. Muller, Implementing synaptic plasticity in a VLSI spiking neural network model, in Proceedings of the 2006 International Joint Conference on Neural Networks (IJCNN) (IEEE Press, Piscataway, 2006) 6. J. Schemmel, D. Brüderle, K. Meier, B. Ostendorf, Modeling synaptic plasticity within networks of highly accelerated I&F neurons, in Proceedings of the 2007 IEEE International Symposium on Circuits and Systems (ISCAS) (IEEE Press, Piscataway, 2007), pp. 3367–3370 7. K. Zoschke, M. Güttler, L. Böttcher, A. Grübl, D. Husmann, J. Schemmel, K. Meier, O. Ehrmann, Full wafer redistribution and wafer embedding as key technologies for a multiscale neuromorphic hardware cluster, in 2017 IEEE 19th Electronics Packaging Technology Conference (EPTC) (IEEE, Piscataway, 2017), pp. 1–8 8. S. Millner, A. Grübl, K. Meier, J. Schemmel, M.-O. Schwartz, A VLSI implementation of the adaptive exponential integrate-and-fire neuron model, in Advances in Neural Information Processing Systems, vol. 23, ed. by J. Lafferty, C.K.I. Williams, J. Shawe-Taylor, R. Zemel, A. Culotta (ACM, New York, 2010), pp. 1642–1650 9. T. Pfeil, A. Grübl, S. Jeltsch, E. Müller, P. Müller, M.A. Petrovici, M. Schmuker, D. Brüderle, J. Schemmel, K. Meier, Six networks on a universal neuromorphic computing substrate. Front. Neurosci. 7, 11 (2013). http://www.frontiersin.org/neuromorphic_engineering/10.3389/fnins. 2013.00011/abstract 10. A.P. Davison, D. Brüderle, J. Eppler, J. Kremkow, E. Muller, D. Pecevski, L. Perrinet, P. Yger, PyNN: a common interface for neuronal network simulators. Front. Neuroinform. 2, 11 (2008) 11. D. Brüderle, M.A. Petrovici, B. Vogginger, M. Ehrlich, T. Pfeil, S. Millner, A. Grübl, K. Wendt, E. Müller, M.-O. Schwartz, D. de Oliveira, S. Jeltsch, J. Fieres, M. Schilling, P. Müller, O. Breitwieser, V. Petkov, L. Muller, A. Davison, P. Krishnamurthy, J. Kremkow, M. Lundqvist, E. Muller, J. Partzsch, S. Scholze, L. Zühl, C. Mayr, A. Destexhe, M. Diesmann, T. Potjans, A. Lansner, R. Schüffny, J. Schemmel, K. Meier, A comprehensive workflow for generalpurpose neural modeling with highly configurable neuromorphic hardware systems. Biol. Cybern. 104, 263–296 (2011). https://doi.org/10.1007/s00422-011-0435-9 12. J. Schemmel, L. Kriener, P. Müller, K. Meier, An accelerated analog neuromorphic hardware system emulating NMDA-and calcium-based non-linear dendrites. Preprint, arXiv:1703.07286 (2017)

6 Accelerated Analog Neuromorphic Computing

101

13. C.S. Thakur, J.L. Molin, G. Cauwenberghs, G. Indiveri, K. Kumar, N. Qiao, J. Schemmel, R. Wang, E. Chicca, J. Olson Hasler, et al., Large-scale neuromorphic spiking array processors: A quest to mimic the brain. Front. Neurosc. 12, 891 (2018) 14. S. Friedmann, J. Schemmel, A. Grübl, A. Hartel, M. Hock, K. Meier, Demonstrating hybrid learning in a flexible neuromorphic hardware system. IEEE Trans. Biomed. Circuits Syst. 11(1), 128–142 (2017) 15. S.A. Aamir, P. Müller, A. Hartel, J. Schemmel, K. Meier, A highly tunable 65-nm CMOS LIF neuron for a large-scale neuromorphic system, in Proceedings of IEEE European Solid-State Circuits Conference (ESSCIRC) (2016) 16. S.A. Aamir, Y. Stradmann, P. Müller, C. Pehle, A. Hartel, A. Grübl, J. Schemmel, K. Meier, An accelerated LIF neuronal network array for a large-scale mixed-signal neuromorphic architecture. IEEE Trans. Circuits Syst. I Reg. Pap. 65(12), 4299–4312 (2018) 17. S. Friedmann, J. Schemmel, A. Grübl, A. Hartel, M. Hock, K. Meier, Demonstrating hybrid learning in a flexible neuromorphic hardware system. IEEE Trans. Biomed. Circuits Syst. 11(1), 128–142 (2017) 18. M. Hock, A. Hartel, J. Schemmel, K. Meier, An analog dynamic memory array for neuromorphic hardware, in 2013 European Conference on Circuit Theory and Design (ECCTD), Sept 2013, pp. 1–4 19. M. Tsodyks, H. Markram, The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci. USA 94, 719–723 (1997) 20. T. Pfeil, J. Jordan, T. Tetzlaff, A. Grübl, J. Schemmel, M. Diesmann, K. Meier, The effect of heterogeneity on decorrelation mechanisms in spiking neural networks: a neuromorphichardware study. Preprint, arXiv:1411.7916 (2014) 21. J. Jordan, M.A. Petrovici, O. Breitwieser, J. Schemmel, K. Meier, M. Diesmann, T. Tetzlaff, Deterministic networks for probabilistic computing. Sci. Rep. 9(1), 1–17 (2019) 22. G. Kiene, Mixed-signal neuron and readout circuits for a neuromorphic system. Master thesis, Universität Heidelberg, 2017 23. S. Billaudelle, Design and implementation of a short term plasticity circuit for a 65 nm neuromorphic hardware system. Masterarbeit, Universität Heidelberg, 2017 24. S. Billaudelle, B. Cramer, M.A. Petrovici, K. Schreiber, D. Kappel, J. Schemmel, K. Meier, Structural plasticity on an accelerated analog neuromorphic hardware system. Preprint, arXiv:1912.12047 (2019) 25. R. Brette, W. Gerstner, Adaptive exponential integrate-and-fire model as an effective description of neuronal activity. J. Neurophysiol. 94, 3637–3642 (2005) 26. S. Millner, Development of a multi-compartment neuron model emulation. Ph.D. dissertation, University of Heidelberg, 2012 27. R. Jolivet, T.J. Lewis, W. Gerstner, Generalized integrate-and-fire models of neuronal activity approximate spike trains of a detailed model to a high degree of accuracy. J. Neurophysiol. 92(2), 959–976 (2004) 28. V. Thanasoulis, J. Partzsch, S. Hartmann, C. Mayr, R. Schüffny, Dedicated FPGA communication architecture and design for a large-scale neuromorphic system, in 2012 19th IEEE International Conference on Electronics, Circuits, and Systems (ICECS 2012) (IEEE, Piscataway, 2012), pp. 877–880 29. J. Schemmel, J. Fieres, K. Meier, Wafer-scale integration of analog neural networks, in Proceedings of the 2008 International Joint Conference on Neural Networks (IJCNN) (2008) 30. J. Schemmel, S. Hohmann, K. Meier, F. Schürmann, A mixed-mode analog neural network using current-steering synapses. Analog Integr. Circ. Sig. Process. 38(2–3), 233–244 (2004) 31. J. Langeheine, M. Trefzer, D. Brüderle, K. Meier, J. Schemmel, On the evolution of analog electronic circuits using building blocks on a CMOS FPTA, in Proceedings of the Genetic and Evolutionary Computation Conference(GECCO2004) (2004) 32. S. Hohmann, J. Fieres, K. Meier, J. Schemmel, T. Schmitz, F. Schürmann, Training fast mixedsignal neural networks for data classification, in Proceedings of the 2004 International Joint Conference on Neural Networks (IJCNN’04) (IEEE Press, Piscataway, 2004), pp. 2647–2652

102

J. Schemmel et al.

33. E. Nurse, B.S. Mashford, A.J. Yepes, I. Kiral-Kornek, S. Harrer, D.R. Freestone, Decoding EEG and LFP signals using deep learning: heading truenorth, in Proceedings of the ACM International Conference on Computing Frontiers (2016), pp. 259–266 34. S. Schmitt, J. Klähn, G. Bellec, A. Grübl, M. Güttler, A. Hartel, S. Hartmann, D. Husmann, K. Husmann, S. Jeltsch, V. Karasenko, M. Kleider, C. Koke, A. Kononov, C. Mauch, E. Müller, P. Müller, J. Partzsch, M.A. Petrovici, B. Vogginger, S. Schiefer, S. Scholze, V. Thanasoulis, J. Schemmel, R. Legenstein, W. Maass, C. Mayr, K. Meier, Classification with deep neural networks on an accelerated analog neuromorphic system. arXiv (2016) 35. J. Göltz, A. Baumbach, S. Billaudelle, O. Breitwieser, D. Dold, L. Kriener, A.F. Kungl, W. Senn, J. Schemmel, K. Meier, et al., Fast and deep neuromorphic learning with time-tofirst-spike coding. Preprint, arXiv:1912.11443 (2019) 36. A. Shawahna, S.M. Sait, A. El-Maleh, FPGA-based accelerators of deep learning networks for learning and classification: a review. IEEE Access 7, 7823–7859 (2018) 37. P. Sharma, A. Singh, Era of deep neural networks: a review, in 2017 8th International Conference on Computing, Communication and Networking Technologies (ICCCNT) (IEEE, Piscataway, 2017), pp. 1–5 38. Y. LeCun, C. Cortes, The MNIST database of handwritten digits (1998) 39. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: large-scale machine learning on heterogeneous distributed systems (2015). http:// download.tensorflow.org/paper/whitepaper2015.pdf 40. J. Weis, P. Spilger, S. Billaudelle, Y. Stradmann, A. Emmel, E. Müller, O. Breitwieser, A. Grübl, J. Ilmberger, V. Karasenko, M. Kleider, C. Mauch, K. Schreiber, J. Schemmel, Inference with artificial neural networks on analog neuromorphic hardware, in IoT Streams for Data-Driven Predictive Maintenance and IoT, Edge, and Mobile for Embedded Machine Learning (Springer International Publishing, Cham, 2020), pp. 201–212 41. P. Spilger, E. Müller, A. Emmel, A. Leibfried, C. Mauch, C. Pehle, J. Weis, O. Breitwieser, S. Billaudelle, S. Schmitt, T.C. Wunderlich, Y. Stradmann, J. Schemmel, hxtorch: PyTorch for BrainScaleS-2 — perceptrons on analog neuromorphic hardware, in IoT Streams for DataDriven Predictive Maintenance and IoT, Edge, and Mobile for Embedded Machine Learning (Springer International Publishing, Cham, 2020), pp. 189–200 42. Y. Stradmann, S. Billaudelle, O. Breitwieser, F.L. Ebert, A. Emmel, D. Husmann, J. Ilmberger, E. Müller, P. Spilger, J. Weis, J. Schemmel, Demonstrating analog inference on the brainscales2 mobile system (2021) 43. A. Grübl, S. Billaudelle, B. Cramer, V. Karasenko, J. Schemmel, Verification and design methods for the brainscales neuromorphic hardware system. Preprint (2020). http://arxiv.org/ abs/2003.11455 44. T.E. Oliphant, A Guide to NumPy, vol. 1 (Trelgol Publishing, New York, 2006) 45. E. Jones, T. Oliphant, P. Peterson, SciPy: open source scientific tools for Python (2001). http:// www.scipy.org/ 46. J.D. Hunter, Matplotlib: a 2d graphics environment. Comput. Sci. Eng. 9(3), 90–95 (2007) 47. R. Naud, N. Marcille, C. Clopath, W. Gerstner, Firing patterns in the adaptive exponential integrate-and-fire model. Biol. Cybern. 99(4), 335–347 (2008). https://doi.org/10.1007/ s00422-008-0264-7 48. M. Davies, N. Srinivasa, T.-H. Lin, G. Chinya, Y. Cao, S.H. Choday, G. Dimou, P. Joshi, N. Imam, S. Jain, et al., Loihi: a neuromorphic manycore processor with on-chip learning. IEEE Micro 38(1), 82–99 (2018) 49. S.B. Furber, F. Galluppi, S. Temple, L.A. Plana, The spinnaker project. Proc. IEEE 102(5), 652–665 (2014)

Part II

Current, Voltage, and Temperature Sensors

The second part of this book discusses the development of smart sensors for current, voltage, and temperature measurements, describing the implementation issues and trade-offs between the different requests of specific applications, of specific technologies, etc. In Chap. 7, Andreas Kucher deals with the advancements in current sensing for future automotive applications, where power applications diagnosis and security are becoming more and more important, with attention to the cost optimization on system level, to the new autonomous-driving requirements, and to the efficiency. In fact, smart power management key point is current sensing accuracy in a hostile automotive environment. This means that the optimization is carried out at system level, increasing the development complexity. Finally, future innovation is discussed in terms of digitalization to be operated in the same power transistor device. In Chap. 8, Paul Walsh discusses the current sense interfaces for IoT applications. An overview of current sensing interfaces for multisense applications is given. Moreover, the design case of a versatile ratio-metric current sensing front end that supports capacitance, resistance, photoelectric, and pyroelectric sensing is proposed. The device for the capacitive sensing achieves 17.5-ENOB for a wide capacitance range up to 200 pF. The system is based on an incremental zoom ADC in the front end that is configurable to support multisense operation. In Chap. 9, Ivan O’Connell addresses the challenges in the realization of highprecision ADCs in 28 nm and below. Since Sigma Delta architectures appear not to be compatible with reduced voltage rails and transistor gain, Successive Approximation Register (SAR) ADC architectures demonstrate to be suitable for consistently and reliably achieving 12+ bit performance in scaled technologies. Among them, the Noise-Shaped SAR ADCs leverage the advantages of both SigmaDelta and SAR architectures, but they require further technical developments as discussed in the chapter. In Chap. 10, Eugenio Cantatore studies the unusual limits in sensor interfaces, for instance in terms of energy and cost, as required by IoT applications, which often rely on miniature batteries for power supply but demand limited accuracy. In these cases, rethinking sensor interface architectures is necessary to achieve minimum

104

II Current, Voltage, and Temperature Sensors

average power and energy-per-measurement, while offering the possibility of reading different types of sensors at the same time. This action results also in enabling innovative IoT applications by unconventional form factors and new fabrication technologies. This is the case of intelligent sensors to be printed on the packaging of perishable goods, to monitor their keeping quality, at a cost comparable to graphic printing of labels. In Chap. 11, Matthias Eberlein describes the implementation of Thermal Sensor and Reference Circuits in the innovative 16 nm FinFET Technology. The author introduces a new concept to generate PTAT and CTAT voltages precisely through switched-capacitor operation, exploiting the active bulk diode to be forward biased by a charge-pump. During capacitor discharge, the respective pn-junction voltages are sampled at different time points and further combined by charge-sharing techniques. Using this concept, a bandgap reference is developed which reaches an untrimmed accuracy of ±0.73% (3σ), consuming only 21 nA. Moreover, a thermal sensor with an 8-bit SAR readout, which achieves a precision of ≤2 ◦ C without calibration on 2500 μm2 silicon area, is proposed. In Chap. 12, Sining Pan deals with Resistor-based Temperature Sensors. The discussion is opened with an overview of resistor-based sensors, with a focus on their energy efficiency. The theoretical energy efficiency limit of resistor-based sensors is determined and compared to that of traditional BJT-based sensors. Moreover, a review of the different types of resistor-based sensors is proposed. Finally, the design-case of a high-resolution Wheatstone bridge sensor is given. Exploiting a readout with a continuous-time Delta-Sigma modulator, the overall sensor achieves state-of-the-art energy efficiency, with a resolution FoM of 10 fJ K2 , approaching the theoretical energy efficiency limit.

Chapter 7

Advancements in Current Sensing for Future Automotive Applications Andreas Kucher, Andrea Baschirotto, and Paolo Del Croce

Abstract In automotive power applications diagnosis and security are getting more and more important. Driver for this development is cost optimization on system level, new requirements out of autonomous driving, and efficiency. The base for smart power management is accurate current sensing in rough automotive environment. In many cases optimization on system level is needed; thus on device level this is not enough anymore. This paper describes the methods and innovative improvements of today’s solutions to support upcoming requirements in accuracy and safety.

1 Introduction Measuring a current in a vehicle seems to be not too complicated. Real application environment is rough and reliability must be high. First, we look for motivation of measuring currents. In the past [1] the main motivation of measuring current was detection if load is present (open load detection) or if the current range is in proper range. The second motivation was protection of system to avoid fire hazards, on device level by overcurrent, overvoltage, and over temperature protection. To protect wires and loads the load current has to be measured. Today the situation is quite different. Car manufacturers are striving for autonomous driving vehicles and this is changing requirements in extreme way. Protecting and diagnosing are not enough anymore. Autonomous systems must know at any state all currents in a car to be able to detect any deviation from expected values. Waiting for a fail like in the past is no option anymore. Passengers A. Kucher () · P. Del Croce Infineon Technologies, Villach, Austria e-mail: [email protected]; [email protected] A. Baschirotto University of Milan, Milan, Italy e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 P. Harpe et al. (eds.), Analog Circuits for Machine Learning, Current/Voltage/Temperature Sensors, and High-speed Communication, https://doi.org/10.1007/978-3-030-91741-8_7

105

106

A. Kucher et al.

Supply

LOAD

ECU VS [-16V ….42V] Vbb

High Side Switch

VREG LLine

Switch

VCC

1..10µH

OUT

Logic Ri

uC

LLine 1..10µH

Protection Diagnosis

Feedback

10mOhm

LOAD +

Battery

GND

GND

GND GND

GND

Fig. 7.1 Typical application drawing

expect to be given a safe lift when driving in autonomous vehicle. This means the system must be able to fail in a safe mode. Finally, all changes make a car more clean, safe, and smart. The basic system looks similar than many years ago [1], but the requirements and implementation were changed significantly over the last years. Figure 7.1 shows on the left-hand side the battery of a vehicle. The ECU (Electronic Control Unit) contains typically a microcontroller, an intelligent power switch, and bus interface to communicate with the vehicle.

2 Current Sensing 2.1 Classical Current Sensing A classical concept for current sensing is to replicate the load current divided by a fixed geometrical ratio to a proper sense current. The concept is based on a regulated voltage at the source of the sensing transistor M2, such that it is set to the same voltage as the source of the power transistor M1. Thus M1 and M2 have the same gate source voltage VGS . With a fixed geometric ratio from M1 to M2 the current has the same ratio. The concept in Fig. 7.2 has two dominating errors: the operational amplifier and the geometric matching of sense transistor M2 to power transistor M1. Once switched on the gate source voltage of M1 and M2 is high and both transistors are operated in linear mode. The load current in linear mode is:

7 Advancements in Current Sensing for Future Automotive Applications

107

Supply Vs ratio D

ON

Driver DRIVER

1 :x

Sense Tr.

M2

D

Power Transistor

G

VGS

S

S

M1

VOS

ON ISE NS E

Opamp

S

ILoad

G

M3 D

Out

Isense

to ADC Input of µC

RSENSE RLoad

Current Sensing

Power Stage

Fig. 7.2 Classical current sensing concept

Iload = βload

 



VGS,load − Vth · VDS,load −

2 VDS,load



2

(7.1)

where β is a parameter depending on technology and geometric dimension of the transistor. In ideal case the sensing current is:  Isense = βsense

2 VDS,sense   VGS,sense − Vth · VDS,sense − 2



From Eqs. (7.1) and (7.2) the ratio of load to sense current (KILIS ) is:

(7.2)

108

A. Kucher et al.

KILIS

Vov · VDS,load − Iload βload = = · Isense βsense Vov · VDS,sense −

2 VDS,load 2 2 VDS,sense 2

(7.3)

The VDS,sense of the sense transistor M2 is regulated by an operational amplifier. This operational amplifier has an offset voltage (VOS ) which leads to VDS voltage of M2 which is different to VDS of M1.

KILIS =

VDS,load = VDS,sense ± Vos

(7.4)

Iload βload 1 = · Isense βsense 1 ± Vos /VDS,load

(7.5)

As seen in formula (7.5) the offset voltage VOS and the drain source voltage of power transistor M1 (VDS,load ) are two important parameters for this consideration. By limiting the VDS,load to a given minimum voltage this error could be reduced as described in the next chapter.

2.2 Improvement of Classical Current Sensing Figure 7.3 shows the improvement of the concept shown in Fig. 7.2. An operational amplifier is added to limit the minimum VDS,load for example at 40 mV. The limitation is done by reduction of the gate source voltage VGS . An improvement to the classical concept in Fig. 7.2 is to limit the VDS at low currents. KILIS =

Iload βload 1 = · Isense βsense 1 ± Vos /VDS,load

(7.6)

Now the geometrical error of the ratio power to sense transistor has to be considered. At high load currents, the standard deviation σ [2] depends on the mismatch of target current ratio versus geometrical ratio. The standard deviation is calculated as follows:  KILIS σ 2 (ΔVth ) σ 2 (Voffset ) σ 2 (Δμn ) = σ2 + + (7.7) 2 2 Kgeo μn VDS (VGS − Vth ) At low load current the gate source voltage VGS has to be reduced as long as the minimum drain source voltage (VDS,load ) is achieved. At low currents the VGS voltage is reduced down to the range of threshold voltage Vth .

7 Advancements in Current Sensing for Future Automotive Applications

109

Supply Vs ratio D

ON

Driver DRIVER

1 :x

Sense Tr.

M2

S

D G

VGS

S

Power Transistor

M1

Vs-40mV

VOS

ISENSE

Opamp

S

ILoad

G

M3 D

Out

Isense

to ADC Input of µC

RSENSE RLoad

Power Stage

Current Sensing

Fig. 7.3 Current sensing concept with limitation of minimum VDS

At this low gate source voltage VGS the drain current has exponential behavior. This impacts the most at threshold voltage mismatch of M1 to M2. Equation 7.8 shows the drain current at low current.  IDS = I0 · exp

VGS − Vth VT · n

(7.8)

I0 is the drain current where VGS = Vth , and VT is the thermal voltage. The standard deviation is as described in formula (7.9).  σ2

KILIS Kgeo



 =

1 nVT

2

⎡ · σ 2 (ΔVth ) + ⎣

⎤2 1

  ⎦ · σ 2 (Voffset ) VDS VT · exp VT − 1 (7.9)

Any mismatch from power to sense transistor has now significant impact to current sensing. This limits the accuracy of the concept from Fig. 7.3 at lower current levels.

110

A. Kucher et al.

VS

12 V

VDS Limitation Vref=10mV

M_SENSE

1

:

M_LOAD

K

IL

IS

en

D Vos

OUT 1 2 1

OTA

P2

to µC ADC

Cf

Offset adjustment

V_IS+A*Vos V_IS

Rf

LOAD

IS V_IS-A*Vos Rsense

AZ Comp

VS

DAC

UP/DOWN COUNTER

CLK

Ripple Reduction Fig. 7.4 Current sensing concept with limitation of minimum VDS and improved operational amplifier

2.3 From Linear to Switched Concepts To have further system improvement the offset voltage of the operational amplifier has to be significantly smaller. Both architectural innovations and technology changes enabled new concepts. Figure 7.4 shows an example of this by using a chopper amplifier [2]. This concept combines the improvement of very low offset voltage and VDS limitation.

7 Advancements in Current Sensing for Future Automotive Applications

111

2.4 Current Sensing Goes Digital As digitalization is going deeper and deeper into electronic control units, the next step in evolution is to change from analog to digital feedback. The big advantage of digital is the easy feedback to microcontroller and flexible implementation of input filtering by use of tracking mechanism. However, such a step towards digitalization requires an ADC interface to be implemented in a power-optimized BCD technology. Such a technology offers capacitors with a relatively poor matching and linearity, and it has a limited number of metal layers. Therefore, it is not ideally suited for high-accuracy ADCs. This shows that such an automotive environment presents new challenges for mixedsignals-designers which are required to explore new high-performance solutions to be implemented in not-analog-optimized process. Detailed analysis of the systems shows that a 10 b ADC is needed to cover the full current range 100 mA up to 100 A in order also to implement single bit function like open-load detection. Moreover, monotonicity is mandatory to accomplish with the ADC algorithms and current measurement processing. In this scenario a current-mode ADC is chosen since it can be implemented using only MOS devices whose proper size guarantees accuracy and monotonicity requirement, thanks to the special attention in the current mirrors building blocks. The overall system [3] is shown in Fig. 7.5 that is similar to the analog version of Fig. 7.4. The main difference is in the replacement of the digital path to close the current feedback loop. In this direction, the low-offset amplifier, fixing the VDS of the two kilis devices, is here replaced with a comparator that is embedded in a tracking loop. As soon as a VDS difference is present, the comparator drives an UP-DOWN command to the tracking register that controls a current-steering DAC whose output current balances the current from the kilis module to null the voltage at the comparator input. Sufficiently high sampling frequency and small hysteresis allow proper operation of the loop. The UP-DOWN register gives the digital representation of the analog current and can be used for digital signal processing. Proper threshold in such register allows easy implementation of some basic functions, like the open-load detection. The key decision is then to implement such a digital signal processing in an external microcontroller, with large versatility but higher costs, or internally with limited versatility also due to the low efficiency of the digital design in such a power process. It has been demonstrated that some digital functions with limited complexity can be implemented directly in the BCD device with careful design taking into account the application specification and the limited digital design capability of such a technology process. One example is the evaluation of the temperature of the wire in which the current is flowing. The wire current-temperature-time relationship is expressed by:

112

A. Kucher et al.

Fig. 7.5 Digital current sensing concept [3]

ΔT =

2 ·α   Iload t · 1 − e− τ A

where T [K] is the temperature variation, t [s] is the time, α [K·m2 /A2 ] is the parameter depending on the thermal and geometric characteristics of the wire, A [m2 ] is the wire section, and τ [s] is the thermal time constant. This operation can be efficiently implemented in the digital section fully realized in such power-optimized technology, permitting the realization of a single chip device performing current measurement, open-load detection, and wire temperature measurement. Proper and simple checks on the digital value of the temperature value allow to implement an active fuse.

2.5 Impact to Future Designs The digital concept described in Sect. 2.4 provides digital current information inside the device. With this information there is now the possibility to do data processing directly at the intelligent power switch instead the inside microcontroller. This

7 Advancements in Current Sensing for Future Automotive Applications

113

enables new features and functions to support new requirements like autonomous driving. In first step designers need to change their methods from analog to mixed signal design. The challenge here is the use of technologies designed for power devices. A further challenge is to meet all requirements for autonomous driving like functional safety, reliability, enhanced diagnosis, configuration, and finally connectivity. This digital current measurement is the base for further system integration in many respects. Design, concept, and system engineer have now the possibility to make an advantage in cost or function by having these new features.

3 Conclusions Initially we started with analog concepts. We improved those concepts continuously from one generation to the next. Additional concepts were adapted from one to the next wafer technology. Technology changes are needed for lower production cost and improved functionality. In the recent years there are new functional safety and autonomous driving requirements. Digital current measurement is the enabler for those new requirements. From circuit point of view digital concepts are challenging but provided a lot of new opportunities in function and safety on system level.

References 1. A. Kucher, Protection and Diagnosis of Smart Power High-Side Switches in Automotive Applications (AACD, 2007) 2. A. Tranca, A. Mayer, Robust Design of Smart Power ICs for Automotive Applications, ESSCIRC 3. A. D’Amico, Chr. Djelassi, D. Haerle, P. Del Croce, A. Baschirotto, A mixed-signal multifunctional system for current measurement and stress detection, PRIME conference, 2017

Chapter 8

Next Generation Current Sense Interfaces for the IoT Era Paul Walsh, Oleksandr Kaprin, Andriy Maharyta, and Mark Healy

Abstract Key differentiators in today’s MCU market are capabilities like low power, security, and sensor interface support. An on-chip voltage ADC is no longer enough to support sensor peripherals in the IoT era. Flexible sensor interfaces provide the opportunity for higher system level integration with a small form factor at lower cost. This paper provides an overview of current sensing interfaces for multi-sense applications. A versatile ratio-metric current sensing front end that supports capacitance, resistance, photoelectric, and pyroelectric sensing is presented. In the capacitive sensing configuration, it can achieve 17.5-ENOB for a wide capacitance range up to 200 pF. The converter consists of an incremental zoom ADC that is front-end configurable to support multi-sense operation.

1 Introduction The Internet of things (IoT) offers an environment where devices provide “smart” interfaces that can sense the world around them and coordinate a response without human interaction [1]. An example of an IoT connected device is shown in Fig. 8.1. The “intelligent” edge device is a PSoC™ (Programmable System On-a Chip) microcontroller which connects to a variety of sensors and is capable of relaying information securely to a cloud device for further data analytics. To serve the uncoordinated and exponential growth of the IoT a flexible sensor interface becomes critical for highly integrated low-cost solutions. This flexibility needs to be in the form of multi-sensing capability (e.g. touch, temperature, P. Walsh () · M. Healy Cypress Semiconductor, Cork, Ireland e-mail: [email protected] O. Kaprin · A. Maharyta Cypress Semiconductor, Lviv, Ukraine

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 P. Harpe et al. (eds.), Analog Circuits for Machine Learning, Current/Voltage/Temperature Sensors, and High-speed Communication, https://doi.org/10.1007/978-3-030-91741-8_8

115

116

P. Walsh et al. LIGHT INTENSITY (PHOTODIODE)

LIGHT INTENSITY (PHOTORESISTOR)

APPS

TOUCH/ PROXIMITY DATA ANALYTICS TEMPERATURE

FORCE

PSoC ®

WiFi

SECURITY

IoT Device Cloud

SECURITY MACHINE LEARNING

PRESSURE

PYROELECTRIC (MOTION)

ELECTET MICROPHONE (SOUND)

Fig. 8.1 Example microcontroller sensor interface in the IoT

pressure, and proximity) and dynamic configurability to optimize sampling speed, resolution, and power. In the current chapter we examine the requirements of current sensing front ends in the IoT and present a flexible interface that supports current, capacitance, inductive, and resistive sensing.

2 Sensing Interfaces for IoT 2.1 Current Sensing Examples of current sensors can be found in Fig. 8.2. A photodiode can measure both light intensity (Fig. 8.2a) and optical proximity (Fig. 8.2b). Optical proximity detection uses an infrared (IR) LED to drive a signal onto the object being detected. The magnitude of the reflected signal determines the proximity of the object. Ambient light can be 100 times greater than the reflected signal, so the front end needs a large dynamic range. Figure 8.2c shows an example schematic of a photodiode sensing system. The output current is fed into a transimpedance amplifier (TIA) converted to a voltage and processed by a voltage mode ADC. A pyroelectric infrared (PIR) sensor detects infrared radiation. Certain wavelengths are sensitive to the human body (6–14 μm) and can be used to sense motion (Fig. 8.2d). A PIR sensor schematic with its signal conditioning circuitry is found in Fig. 8.2e. A 100 k resistor converts the sensor output current to a voltage which is filtered, amplified, and then converted by an ADC. The signal transfers from charge to current to voltage before conversion by the ADC.

8 Next Generation Current Sense Interfaces for the IoT Era

117

Reflecve Surface

Rf

100x

Ph1

-



ADC

Dout

+

I→V→D Photodiode

Photodiode

(a) Light intensity monitor

Photodiode

IR LED

(b) Proximity Sensor

(c) Photodetector operaon with an ADC

VDD

IR radiaon IR radiaon

VS

Warm Body

Gain

ADC

Dout

BPF

Rs Q →I→V→D PIR Detector

(d) PIR sensor performing moon detecon of a warm body

PIR DETECTOR

(e) PIR sensing with an ADC

Fig. 8.2 Current sensing. (a) Light intensity; (b) optical proximity; (c) photodiode sensing circuitry; (d) motion sensing; (e) PIR sensing circuit

2.2 Capacitive Sensing Capacitive sensing is commonly used in human-machine interfaces (HMI). It is of lower power than alternative human-interfaces like voice [2] and robust for HMI applications such as buttons, sliders, and touch screens. Figure 8.3a shows a selfcapacitance button where one terminal of the capacitor is sensed. Figure 8.3b shows a mutual capacitive touchscreen where both terminals of the capacitance are sensed by driving a transmit signal onto one and receiving it on the other. Other capacitive sense applications include non-contact liquid level sensing (Fig. 8.3c) as well as interfaces to MEMS based sensors that use capacitive sensing to sense quantities such as pressure [3], humidity [4], and sound [5]. Capacitive proximity sensing can also be used in IoT radios for specific absorption rate (SAR) detection. The SAR standard defines how much RF radiation can be absorbed by human tissue. The FCC specifies 1.6 W/kg [6]. Phone vendors set requirements to detect human proximity to the antenna (e.g. 15 mm from an antenna in 1 mm displacements). A phone’s antenna can be small (