149 92 58MB
English Pages 462 [446] Year 2022
Computer Architecture and Design Methodologies
Mohamed M. Sabry Aly Anupam Chattopadhyay Editors
Emerging Computing: From Devices to Systems Looking Beyond Moore and Von Neumann
Computer Architecture and Design Methodologies Series Editors Anupam Chattopadhyay, Nanyang Technological University, Singapore, Singapore Soumitra Kumar Nandy, Indian Institute of Science, Bangalore, India Jürgen Teich, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany Debdeep Mukhopadhyay, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India
Twilight zone of Moore’s law is affecting computer architecture design like never before. The strongest impact on computer architecture is perhaps the move from unicore to multicore architectures, represented by commodity architectures like general purpose graphics processing units (gpgpus). Besides that, deep impact of application-specific constraints from emerging embedded applications is presenting designers with new, energy-efficient architectures like heterogeneous multi-core, accelerator-rich System-on-Chip (SoC). These effects together with the security, reliability, thermal and manufacturability challenges of nanoscale technologies are forcing computing platforms to move towards innovative solutions. Finally, the emergence of technologies beyond conventional charge-based computing has led to a series of radical new architectures and design methodologies. The aim of this book series is to capture these diverse, emerging architectural innovations as well as the corresponding design methodologies. The scope covers the following. • Heterogeneous multi-core SoC and their design methodology • Domain-specific architectures and their design methodology • Novel technology constraints, such as security, fault-tolerance and their impact on architecture design • Novel technologies, such as resistive memory, and their impact on architecture design • Extremely parallel architectures
More information about this series at https://link.springer.com/bookseries/15213
Mohamed M. Sabry Aly · Anupam Chattopadhyay Editors
Emerging Computing: From Devices to Systems Looking Beyond Moore and Von Neumann
Editors Mohamed M. Sabry Aly School of Computer Science and Engineering Nanyang Technological University Singapore, Singapore
Anupam Chattopadhyay School of Computer Science and Engineering Nanyang Technological University Singapore, Singapore
ISSN 2367-3478 ISSN 2367-3486 (electronic) Computer Architecture and Design Methodologies ISBN 978-981-16-7486-0 ISBN 978-981-16-7487-7 (eBook) https://doi.org/10.1007/978-981-16-7487-7 © Springer Nature Singapore Pte Ltd. 2023 Chapters “Innovative Memory Architectures Using Functionality Enhanced Devices” and “Intelligent Edge Biomedical Sensors in the Internet of Things (IoT) Era” are licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/). For further details see license information in the chapters. This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Contents
Background Trends in Computing and Memory Technologies . . . . . . . . . . . . . . . . . . . . . Mohamed M. Sabry Aly and Anupam Chattopadhyay
3
Devices and Models Beyond-Silicon Computing: Nano-Technologies, Nano-Design, and Nano-Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gage Hills Innovative Memory Architectures Using Functionality Enhanced Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Levisse Alexandre Sébastien Julien, Xifan Tang, and Pierre-Emmanuel Gaillardon Interconnect and Integration Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yenai Ma, Biresh Kumar Joardar, Partha Pratim Pande, and Ajay Joshi
15
47
85
Nanomagnetic Logic: From Devices to Systems . . . . . . . . . . . . . . . . . . . . . . . 107 Fabrizio Riente, Markus Becherer, and Gyorgy Csaba Quantum Computing—An Emerging Computing Paradigm . . . . . . . . . . . 145 Manas Mukherjee Circuits and Architectures A Modern Primer on Processing in Memory . . . . . . . . . . . . . . . . . . . . . . . . . 171 Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, and Rachata Ausavarungnirun Neuromorphic Data Converters Using Memristors . . . . . . . . . . . . . . . . . . . . 245 Loai Danial, Parul Damahe, Purvi Agrawal, Ruchi Dhamnani, and Shahar Kvatinsky
v
vi
Contents
Hardware Security in Emerging Photonic Network-on-Chip Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 Ishan G. Thakkar, Sai Vineel Reddy Chittamuru, Varun Bhat, Sairam Sri Vatsavai, and Sudeep Pasricha Design Automation Flows Synthesis and Technology Mapping for In-Memory Computing . . . . . . . . 317 Debjyoti Bhattacharjee and Anupam Chattopadhyay Empowering the Design of Reversible and Quantum Logic with Decision Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 Robert Wille, Philipp Niemann, Alwin Zulehner, and Rolf Drechsler Error-Tolerant Mapping for Quantum Computing . . . . . . . . . . . . . . . . . . . 371 Abdullah Ash Saki, Mahabubul Alam, Junde Li, and Swaroop Ghosh System-Level Trends Intelligent Edge Biomedical Sensors in the Internet of Things (IoT) Era . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407 Elisabetta De Giovanni, Farnaz Forooghifar, Gregoire Surrel, Tomas Teijeiro, Miguel Peon, Amir Aminifar, and David Atienza Alonso Reconfigurable Architectures: The Shift from General Systems to Domain Specific Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 Eleonora D’Arnese, Davide Conficconi, Marco D. Santambrogio, and Donatella Sciuto
Background
Trends in Computing and Memory Technologies Mohamed M. Sabry Aly and Anupam Chattopadhyay
Abstract The current decade is poised to see a clear transition of technologies from the de-facto standards. After supporting tremendous growth in speed, density and energy efficiency, newer CMOS technology nodes provide diminishing returns, thereby paving way for newer, non-CMOS technologies. Already multiple such technologies are available commercially to satisfy the requirement of specific market segments. Additionally, researchers have demonstrated multiple system prototypes built out of these technologies, which do co-exist with CMOS technologies. Apart from clearly pushing the limits of performance and energy efficiency, the new technologies present opportunities to extend the architectural limits, e.g., in-memory computing; and computing limits, e.g., quantum computing. The eventual adoption of these technologies are dependent on various challenges in device, circuit, architecture, system levels as well as robust design automation flows. In this chapter, a perspective of these emerging trends is painted in manufacturing technologies, memory technologies and computing technologies. The chapter is concluded with a study on the limits of these technologies.
1 Trends in Computing and Applications Our continuous reliance on electronics has been the main derive in the digital revolution we live in nowadays. Thanks to advances in device technologies, computing systems are infused in our daily lives, from edge near-sensor platforms such as wearables and Internet-of-Things, to the resource-abundant cloud infrastructures. Such abundance of computing resources has created a heavily interconnected digital world in tandem with our physical reality, where embedded mobile devices collectively M. M. S. Aly · A. Chattopadhyay (B) NTU, Singapore, Singapore e-mail: [email protected] M. M. S. Aly e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2023 M. M. S. Aly and A. Chattopadhyay (eds.), Emerging Computing: From Devices to Systems, Computer Architecture and Design Methodologies, https://doi.org/10.1007/978-981-16-7487-7_1
3
4
M. M. S. Aly and A. Chattopadhyay
created a data deluge with massive amounts of structured and unstructured data (Jagadish et al. 2014; Reed and Dongarra 2015). This data has been the main catalyst to enable new classes of applications with complex functionality and services (e.g., computer vision in autonomous driving, real-time speech-to-text services, natural language processing (Hirschberg and Manning 2015), and knowledge-based systems such as IBM Watson’s deep QA (Ferrucci et al. 2010)). As we continue to demand higher application complexity with more stringent requirements (faster response time, lower energy, improved security measures) (Ashibani and Mahmoud 2017; Shalf 2020), the underlying hardware infrastructure faces major challenges and bottlenecks. Electronic systems’ advancements have been mainly driven by Moore’s law (Moore et al. 1965, 1975) (Fig. 1). While technology roadmap continues to follow this trend, it will very soon hit fundamental limits (Theis and Wong 2017). Moreover, Dennard’s scaling of devices (power density of transistors is retained with device shrinking) has hit a wall. This implies that computing systems consume more power per unit area with device shrinking, which leads to an increase operating temperature (Frank et al. 2001; Shafique et al. 2014). While computing systems’ designs can aid in alleviating technological shortcomings, e.g., by exploiting parallelism and application specificity, these designs face the challenges of CPU and memory performance mismatch as well as bandwidth bottleneck. Moreover, recent applications’ characteristics expose significant inefficiencies in memory subsystem designs. Proliferating abundant-data applications (e.g., graph analytics (Malewicz et al. 2010) and deep learning (LeCun et al. 2015)) process large
Fig. 1 Number of transistors per chip observed in digital processing units (and systems on chip) progressively since their introduction in the 1970s. The trend follow’s Moore’s law trend where transistor count per unit area doubles roughly every 18 months (Source our world in data (online) https://ourworldindata.org/technological-progress)
Trends in Computing and Memory Technologies
5
amounts of data that far exceeds the capacity of current on-chip memory capacity (SRAM-based caches or scratchpad memories). As a result, memory accesses to offchip DRAM, and even persistent storage, dominate the execution time and energy consumption associated with these applications’ workloads. We elaborate further on such limitations that stem from technology devices, computing and memory systems, in the following sections.
2 Technology-Specific Limits Technology-specific limits originate from the physical phenomena that is harnessed for performing the computations. These are basically bounded by space-time and/or energy-time limits. While space-time limits are dominated by memory bottlenecks, energy-time limits show a large gap between current digital technologies and biological processes. From another perspective, the computing limit of an electronic switch is fundamentally constrained by the quantum noise, light speed, and thermodynamic barriers (Bremermann 1967; Zhirnov et al. 2014). By taking hint from the thermodynamic barrier, Landauer proved that loss of a single bit of information leads to an Energy loss of ≥ kT ln 2 (Landauer 1961), which was later demonstrated experimentally (Bérut et al. 2012). Considering the fact that avoiding such information loss can lead to energy-efficient computing, researchers looked into the studies of reversible computing (Morita 2009). However, current digital systems are far from achieving that limit, and dominated by communication energy, rather than computing energy—thus fuelling the research on near-memory and in-memory computing. This brings us back to the limits of space-time for performing a computation. Taking hint from energy-time limits, it is observed that biological computing processes are strikingly more energy-efficient compared to the digital systems. Even though a lot of research effort has been put to reduce this gap by, e.g. power/thermal management (Sabry et al. 2011), approximate computing (Mittal 2016; Wang et al. 2016; Constantin et al. 2016), near-threshold computing (Dreslinski et al. 2010), yet, it appears that the energy management fails partly owing to the lack of progress in the battery technologies and cooling technologies. This directly results in a slowdown of technological progress. Mimicking biological systems can be done by native embedding of the working steps of an algorithm into the devices. Prominent examples of these are DNA computing (Adleman 1994) and recently, neuromorphic computing (Mead 1990; Human brain project 2020). A deeper understanding of the biological processes (Attwell et al. 2012), and corresponding adaptation of those in digital systems seem to be the most promising approach towards the design of next generation computing systems.
6
M. M. S. Aly and A. Chattopadhyay
3 Computing Limitations Today’s digital computing architectures comprise a number of interconnected processing elements, spanning one or multiple chips, that have access to on-chip memory (either private or shared) with limited capacity. These processing elements—such as CPUs, GPU cores, DSPs, special-function blocks, multipliers and adders—can be identical in homogeneous architectures (e.g., multi and many core CPUs, GPUs). Alternatively, a computing chip can contain different processing element types in a heterogeneous system-on-chip architecture. Figure 2 illustrates the floorplan of various homogeneous and heterogeneous computing systems. Historically, single-core computing systems had their performance alongside energy (i.e., CPU could run faster with lower energy per operation) improved thanks to device shrinking, till the early 2000s. Afterwards, a trade-off between performance and energy occurred in CPU design. To alleviate this issue, multi-core designs in a single chip were introduced. While multi-core designs improved energy efficiency, the stagnation of Dennard’s scaling as transistor dimensions shrink imposed significant constraints—this led to the rise of Dark Silicon where compute units cannot be all running at peak performance due to power and thermal constraints (Esmaeilzadeh et al. 2011). This stagnation is illustrated in Fig. 3—single-core CPUs power increased significantly as performance improved, whereas multicore designs could boost performance without a massive increase in power consumption. As devices continue to shrink, multicore designs could not improve performance without an accompanied increase in power. Computing architectures then focused on the integration of specialised hardware accelerators in a heterogeneous system-on-chip. Application-Specific architectures
Fig. 2 Architectural evolution of processors overtime to overcome the stagnation of Dennard’s scaling. To improve performance without an increase in power density, designs transitioned from single to multi-core homogeneous platforms, then to heterogeneous architectures involving CPU, GPU and/or domain-specific accelerators. Prospective architecture will adopt additional device heterogeneity and interwoven compute-to-memory connectivity
Trends in Computing and Memory Technologies
7
Fig. 3 Trends in the number of transistors, performance and power in processing units. Architecture designs adopted multi-core around 2005 due to slowing down of dennard’s scaling (in tandem with stagnating single-thread performance and clock frequency). As a result, power consumption has drastically increased (Source Conte 2015)
provide higher energy efficiency than general purpose processors (Horowitz 2014). For instance, the proliferation of deep learning applications and the continuous demand for higher performance and lower energy consumption triggered a massive boom in developing hardware accelerators for deep learning (LeCun 2019; Sze et al. 2017). Despite the significant architecture advances, limitations still arise. The main challenge now stems from recent application trends—they are data-centric with lots of data movements from memory to compute resources, which in turn exposes inefficiencies in current memory system architecture and devices.
4 Memory Limitations Current computing systems, apart from processing units, use different memory technologies in a hierarchy. Figure 4 illustrates a typical memory hierarchy in current computing systems—SRAM is used for high-speed low-capacity (up to tens of MBs) cache memory, whereas off-chip DRAM provide GBs working memory with 100 ns access latency. To increase capacity further, nonvolatile data storage devices are deployed (such as FLASH devices), where capacity can reach TB scale, with >100 µs speed.
8
M. M. S. Aly and A. Chattopadhyay
Fig. 4 Memory hierarchy in state-of-the-art processing systems. Registers designed with standardcell CMOS libraries—they are the fastest and occupy the highest energy and area, only limited capacity of them exist in processors. First-level and multiple lower level caches are designed with SRAM. On-chip embedded DRAM can provide higher density, which they can be used as a lastlevel cache (e.g., in POWER processors). Off-chip DRAM provide Giga or Terabyte scale working memory. To increase storage, persistent storage device provide cost effectiveness (number of bits per unit cost), at the cost of significantly lower latency (Source Wong and Salahuddin, Nature’15 (Wong and Salahuddin 2015))
The use of various memory technologies stems from balancing the trade-offs of each technology. SRAM is fast but is accompanied with large footprint and high static energy, where as DRAM is denser, but is not CMOS compatible (cannot be integrated with CMOS and achieve high density) and suffers from significant overheads (e.g., refresh). Non-volatile memory, such as FLASH, consume low static energy (and can be turned off), offers high storage density, and can be even integrated on chip (particularly in embedded domains). However FLASH has very slow access latency, very high access energy and extremely limited endurance (number of write cycles before permanent failure) (Freitas and Wilcke 2008; Wong and Salahuddin 2015). Current memory hierarchy is then characterized by a degradation of access bandwidth as data access goes to lower levels in the hierarchy. Conventional applications leverage spatial and temporal locality, where data in close spatial proximity are accessed frequently before a new set of data, to overcome the limitation of memory subsystem. Architecture solutions are also introduced, such as speculative prefetching (Jegou and Temam 1993; Collins et al. 2001), to overcome some of the memory-access limitations. However, characteristics of emerging abundant-data applications—namely, random access to large amounts of data (e.g., graph analytics (Satish et al. 2014)), little data locality (e.g., recurrent neural networks (Mikolov et al. 2011)), or high locality to data segment with footprint larger than on-chip memory size—impose major challenges to current memory hierarchy and overall system performance. Compute units access off-chip DRAM more frequently. Combined with the limited number of off-chip connections to DRAM, majority of energy consumption for such system is consumed by memory accesses or by compute units waiting for memory access to be serviced (Aly et al. 2018). These limitations, combined with other device limitations, have triggered significant research efforts to introduce new devices, architectures and data mapping
Trends in Computing and Memory Technologies
9
mechanisms, even algorithm-architecture co-design for the next generation of computing systems. A notable direction in this scenario is to leverage local computing, which is enabled by technologies favouring in-memory, or near-memory computing (Gaillardon et al. 2016; Bhattacharjee et al. 2017; Haj-Ali et al. 2018). The impact of local computing on the highest achievable performance of an algorithm can be analytically studied through the Bitlet model (Korgaonkar et al. 2019).
5 Book Scope and Organization The main goal of this book is to provide a holistic and broad view on current trends towards new computing systems. The book covers topics from emerging1 devices to full system-level challenges and approaches to improve design performance, and efficiency of unconventional computing architectures, or common computing fabrics that leverage device- and integration-specific benefits. The book consists of four sections, that traverse the stack of computing systems design—devices, circuits and architectures, design flows and system-level trends. Section 2 discusses devices and integration approaches beyond traditional trends. We cover upcoming logic devices, memory devices and new interconnect technologies. An overview study of the quantum technologies is also presented. Section 3 focuses on architectures that deviate from conventional designs, powered by new technologies. Here, we discuss near-memory and in-memory computing technologies, brain-inspired neuromorphic computing and novel on-chip communication architectures. To support the design of the emerging architectures in Sect. 3, Section 4 focuses on automation tools and flows towards mapping and testing of such architectures. Three chapters in this section presents technology mapping for in-memory computing platforms, verification and testing for emerging technologies and error-tolerant mapping for quantum technologies. Finally, Sect. 5 introduces mechanisms that overcome technological limitations at the system level. There two chapters are presented that details novel self-aware computing architectures, and systems based on reconfigurable architectures, respectively.
References L.M. Adleman, Molecular computation of solutions to combinatorial problems. Science 266(5187), 1021–1024 (1994) M.M.S. Aly, T.F. Wu, Andrew Bartolo, Y.H. Malviya, W. Hwang, G. Hills, I. Markov, M. Wootters, M.M. Shulaker, H.-S.P. Wong et al., The N3XT approach to energy-efficient abundant-data computing. Proc. IEEE 107(1), 19–48 (2018) 1
The devices discussed in this book are considered emerging by the time of publication.
10
M. M. S. Aly and A. Chattopadhyay
Y. Ashibani, Q.H. Mahmoud, Cyber physical systems security: analysis, challenges and solutions. Comput. Secur. 68, 81–97 (2017) D. Attwell, J.J. Harris, R. Jolivet, Synaptic energy use and supply. Neuron 75 (2012) A. Bérut, A. Arakelyan, A. Petrosyan, S. Ciliberto, R. Dillenschneider, E. Lutz, Experimental verification of Landauer’s principle linking information and thermodynamics. Nature 483 (2012) D. Bhattacharjee, R. Devadoss, A. Chattopadhyay, Revamp: ReRAM based VLIW architecture for in-memory computing, in Design, Automation Test in Europe Conference Exhibition (DATE) (2017), pp. 782–787 H.J. Bremermann, Quantum noise and information, in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability (1967), p. 4 J.D. Collins, H. Wang, D.M. Tullsen, C. Hughes, Y.-F. Lee, D. Lavery, J.P. Shen, Speculative precomputation: long-range prefetching of delinquent loads, in Proceedings 28th Annual International Symposium on Computer Architecture (IEEE, 2001), pp. 14–25 J. Constantin, Z. Wang, G. Karakonstantis, A. Chattopadhyay, A. Burg, Statistical fault injection for impact-evaluation of timing errors on application performance, in Proceedings of the 53rd Annual Design Automation Conference (2016) T. Conte, IEEE rebooting computing initiative & international roadmap of devices and systems, in Proceeding of the IEEE Rebooting Computer Architecture 2030 Workshop (2015). [Online]. Available: https://arch2030.cs.washington.edu/slides/arch2030_tom_conte.pdf (Original data collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten) R.G. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, T. Mudge, Near-threshold computing: reclaiming Moore’s law through energy efficient integrated circuits. Proc. IEEE 98(2), 253–266 (2010) H. Esmaeilzadeh, E. Blem, R.S. Amant, K. Sankaralingam, D. Burger, Dark silicon and the end of multicore scaling, in 2011 38th Annual International Symposium on Computer Architecture (ISCA) (IEEE, 2011), pp. 365–376 D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A.A. Kalyanpur, A. Lally, J. William Murdock, E. Nyberg, J. Prager et al., Building Watson: an overview of the DeepQA project. AI Mag. 31(3), 59–79 (2010) D.J. Frank, R.H. Dennard, E. Nowak, P.M. Solomon, Y. Taur, H.-S.P. Wong, Device scaling limits of Si MOSFETs and their application dependencies. Proc. IEEE 89(3), 259–288 (2001) R.F. Freitas, W.W. Wilcke, Storage-class memory: the next storage system technology. IBM J. Res. Dev. 52(4.5), 439–447 (2008) P. Gaillardon, L. Amar, A. Siemon, E. Linn, R. Waser, A. Chattopadhyay, G. De Micheli, The programmable logic-in-memory (PLIM) computer, in Design, Automation Test in Europe Conference Exhibition (DATE) (2016), pp. 427–432 A. Haj-Ali, R. Ben-Hur, N. Wald, R. Ronen, S. Kvatinsky, Not in name alone: a memristive memory processing unit for real in-memory processing. IEEE Micro 38(5), 13–21 (2018) J. Hirschberg, C.D. Manning, Advances in natural language processing. Science 349(6245), 261– 266 (2015) M. Horowitz, 1.1 computing’s energy problem (and what we can do about it), in 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC) (IEEE, 2014), pp. 10–14 Human Brain Project (2020), https://www.humanbrainproject.eu/en/. Accessed 07 July 2020 H.V. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J.M. Patel, R. Ramakrishnan, C. Shahabi, Big data and its technical challenges. Commun. ACM 57(7), 86–94 (2014) Y. Jegou, O. Temam, Speculative prefetching, in Proceedings of the 7th International Conference on Supercomputing (1993), pp. 57–66 K. Korgaonkar, R. Ronen, A. Chattopadhyay, S. Kvatinsky, The bitlet model: defining a litmus test for the bitwise processing-in-memory paradigm (2019) R. Landauer, Irreversibility and heat generation in the computing process. IBM J. Res. Dev. 5(3), 183–191 (1961)
Trends in Computing and Memory Technologies
11
Y. LeCun, 1.1 deep learning hardware: past, present, and future, in 2019 IEEE International SolidState Circuits Conference-(ISSCC) (IEEE, 2019), pp. 12–19 Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521(7553), 436–444 (2015) G. Malewicz, M.H. Austern, A.J.C. Bik, J.C. Dehnert, I. Horn, N. Leiser, G. Czajkowski, Pregel: a system for large-scale graph processing, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (2010), pp. 135–146 C. Mead, Neuromorphic electronic systems. Proc. IEEE 78(10) (1990) ˇ T. Mikolov, S. Kombrink, L. Burget, J. Cernock` y, S. Khudanpur, Extensions of recurrent neural network language model, in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2011), pp. 5528–5531 S. Mittal, A survey of techniques for approximate computing. ACM Comput. Surv. 48(4) (2016) G.E. Moore et al., Cramming more components onto integrated circuits (1965) G.E. Moore et al., Progress in digital integrated electronics. Electron Devices Meeting 21, 11–13 (1975) K. Morita, Reversible Computing (Springer, New York, NY, 2009), pp. 7695–7712 D.A. Reed, J. Dongarra, Exascale computing and big data. Commun. ACM 58(7), 56–68 (2015) M.M. Sabry, A.K. Coskun, D. Atienza, T. Rosing, T. Brunschwiler, Energy-efficient multiobjective thermal control for liquid-cooled 3D stacked architectures. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 30(12), 1883–1896 (2011) N. Satish, N. Sundaram, M.M.A. Patwary, J. Seo, J. Park, M. Amber Hassaan, S. Sengupta, Z. Yin, P. Dubey, Navigating the maze of graph analytics frameworks using massive graph datasets, in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (2014), pp. 979–990 M. Shafique, S. Garg, J. Henkel, D. Marculescu, The EDA challenges in the dark silicon era: temperature, reliability, and variability perspectives, in Proceedings of the 51st Annual Design Automation Conference (2014), pp. 1–6 J. Shalf, The future of computing beyond Moore’s law. Philos. Trans. R. Soc. A 378(2166), 20190061 (2020) V. Sze, Y.-H. Chen, T.-J. Yang, J.S. Emer, Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE 105(12), 2295–2329 (2017) T.N. Theis, H.-S.P. Wong, The end of Moore’s law: a new beginning for information technology. Comput. Sci. Eng. 19(2), 41–50 (2017) Z. Wang, G. Karakonstantis, A. Chattopadhyay, A low overhead error confinement method based on application statistical characteristics, in Design, Automation Test in Europe Conference Exhibition (DATE) (2016), pp. 1168–1171 H.-S.P. Wong, S. Salahuddin, Memory leads the way to better computing. Nat. Nanotechnol. 10(3), 191–194 (2015) V. Zhirnov, R. Cavin, L. Gammaitoni, Minimum Energy of Computing, Fundamental Considerations (2014)
Devices and Models
Beyond-Silicon Computing: Nano-Technologies, Nano-Design, and Nano-Systems Gage Hills
1 Introduction For decades, humankind has enjoyed the energy efficiency benefits of scaling transistors smaller and smaller, but these benefits are waning. In a worldwide effort to continue improving computing performance, many researchers are exploring a wide range of technology alternatives, ranging from new physics (spin-, magnetic, tunneling-, and photonic-based devices) to new nanomaterials (carbon nanotubes, two-dimensional materials, superconductors) to new devices (non-volatile embedded memories, ferroelectric-based logic and memories, q-bits) to new systems, architectures, and integration techniques (advanced die- and wafer-stacking, monolithic three-dimensional (3D) integration, on-chip photonic interconnects). However, developing new technologies from the ground up is no simple task, and requires an end-to-end approach addressing many challenges along the way. First of all, a detailed analysis of the overall potential benefits of a new technology is essential; it can take years to bring a new technology to the level of maturity required for high-volume production, and so a team of researchers must ensure upfront that they are developing the right technologies for the right applications. For example, many emerging nanotechnologies are subject to nano-scale imperfections and variations in material properties—how does one overcome these challenges at a very-large scale? Will new design techniques be required? Will circuit and system designers even use the same approaches to designing next generation systems, or would an entirely different approach offer much better results? What level of investment will be required to develop these new technologies, designs, and systems, and at the end of the day, will the outcome be worth the effort? These are just examples of the some of the major questions that are essential to consider as early as possible. G. Hills (B) Harvard University, Cambridge, MA, USA e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2023 M. M. S. Aly and A. Chattopadhyay (eds.), Emerging Computing: From Devices to Systems, Computer Architecture and Design Methodologies, https://doi.org/10.1007/978-981-16-7487-7_2
15
16
G. Hills
To provide a concrete example of how many of these questions are being addressed in practice, in this chapter, we take a deep dive into one specific research area: using carbon nanotubes (CNTs) as the channel material of field-effect transistors, and the resulting next-generation systems that are enabled. We start by providing an overview of the potential benefits of this emerging nanotechnology, as well as the major challenges that have blocked significant progress in the field. We then describe example design techniques that have been used to overcome these challenges (i.e., nano-design techniques), and offer experimental demonstrations of larger-scale circuits that have been enabled using these techniques in practice, and that are now being developed inside commercial manufacturing facilities. Next, we illustrate how carbon nanotube field-effect transistors (CNFETs, and other nano-technologies with similar physical properties) enable entirely new types of computing systems that are impossible to build using today’s silicon-based technologies, namely, monolithic three-dimensional (3D) integrated systems with multiple layers of transistors and multiple layers of memories fabricated directly on top of each other with ultra-dense vertical connectivity. We close by summarizing the potential that these 3D “nanosystems” have to extend progress in energy-efficient computing, and finally offer examples of how they can be combined with advances higher up the computing stack (e.g., with new computer architectures, compilers, or domain-specific programming languages) for even larger benefits. We hope that this technological journey spanning nano-technologies, nano-design, and nano-systems will motivate the reader to pursue further investigation into these cutting-edge research areas.
2 Emerging Nanotechnologies: Opportunities and Challenges Quantifying the potential benefits of a new technology is essential to guide development of the right technologies for the right applications. Importantly, since today’s systems often consist of multiple heterogenous components (including processor cores, on- and off-chip memories, power distribution, etc.), small-scale benchmarking does not capture important interactions between these various components that can limit overall system performance. For example, analyzing the delay and power consumption of a stand-alone 32-bit adder does not account for many effects present in realistic very-large-scale integrated (VLSI) systems (e.g., interconnect routing parasitics, application-level timing constraints, process variations), and does not perform technology-specific optimization of key circuit-level design parameters (e.g., total circuit area, target clock frequency). This can lead to incorrect conclusions, and thus, wasted efforts in pursuit of developing the wrong technologies. Instead, an end-to-end evaluation framework is essential, including: (a) energy-efficient circuit/system-level techniques to overcome inherent imperfections and variations, (b) full physical design of VLSI systems, and (c) variation-aware power/timing design and
Beyond-Silicon Computing: Nano-Technologies, Nano-Design …
17
optimization, calibrated to experimental data, running real applications, and meeting circuit-level yield, test, and noise immunity constraints. In this section, we provide an overview of such an analysis framework that compares the energy efficiency benefits of multiple promising technology candidates (shown in Fig. 1) for future technology nodes. This analysis uses complete physical designs of VLSI processor cores in order to account for the realistic circuit effects. Energy efficiency is quantified by the Energy-Delay Product metric (EDP), i.e., the product of total circuit energy consumption and the maximum propagation delay from any input to any output (the critical path delay). Importantly, this comparison leverages industry-practice VLSI designs and design flows, as well as technology parameters that calibrated to experimental data to, not only to quantify the EDP benefits of each technology, but also provide insight into the sources of their benefits. Using the VLSI design flow shown in Fig. 2, this approach demonstrates that CNFETs offer major energy efficiency benefits for sub-10 nm node digital VLSI circuits. For additional details, we refer the reader to Hills et al. (2018), which also describes how these benefits can be maintained even in the presence of major variations in CNT processing (which we also describe in Sect. 3). While this methodology has so far been used to evaluate the benefits of CNFETs, it can also be extended to evaluate new combinations of technologies moving forward.
Fig. 1 Advanced technology options for field-effect transistors (FETs) for future technology nodes. For each FET, the drain contact is transparent in order to illustrate a cross-section of the transistor channel, with microscopy images provided underneath each 3D-rendered FET. a FinFET with multiple fins. b Nanowire FET with multiple nanowires both horizontally and vertically (Mertens et al. 2016). c Nanosheet FET with multiple nanosheets integrated vertically on top of each other (Loubet et al. 2017). d Two-dimensional (2D) material FET, in which the FET channel can be made of 2D materials such as MoS2 , black phosphorus, or WSe2 . e Carbon nanotube FET (CNFET), with multiple carbon nanotubes (CNTs) comprising the transistor FET channel, shown in the top-view scanning electron microscopy (SEM) image
18
G. Hills
Fig. 2 End-to-end approach to quantify the benefits of emerging technologies for VLSI scale circuits. a VLSI design and analysis flow. Key components include: experimentally calibrated compact models (as shown for CNFET drain current vs. drain-to-source voltage: I D vs. V DS , enabling accurate circuit simulations) (Lee et al. 2015), library cell layouts (enabling extraction of parasitic resistance and capacitance elements), and circuit-level EDP optimization. b Example illustration of a single CNFET, which is used in conjunction with this design flow to analyze the energy and delay of the OpenSparc T2 processor core (OpenSPARC 2011) designed using CNFETs. c The EDP-optimal design is selected from the Pareto-optimal trade-off curve to quantify the energy efficiency of each technology
2.1 VLSI Circuit Benefits of One-Dimensional and Two-Dimensional Nanomaterials While quantifying EDP benefits at the circuit- and system-level is certainly required to drive technology development, e.g., for motivating the use of CNFETs, it is equally important to understand where these benefits are coming from. Using the detailed analysis flow in Fig. 2 provides valuable insight into the sources of these benefits for CNFETs. In particular, a useful metric for evaluating the potential circuit-level benefits of a FET technology is the electrostatic scale length (Frank et al. 1998), which quantifies how susceptible that FET is to short-channel effects (Kuhn 2012). The scale length should be small to enable shorter gate lengths, thus reducing the energy required to charge the FET gate capacitance without degrading the FET ability to quickly turn on and off by modulating the gate voltage (quantified by the subthreshold slope). Well-known approaches for improving the scale length include: (1) improve FET geometry (e.g., changing from FinFET to gate-all-around Nanowire FET or Nanosheet FET), and (2) reduce the semiconductor body thickness (e.g., reducing the thickness of fins in a FinFET or reducing the diameter of individual nanowires in a nanowire FET). While evolving from today’s silicon–germanium (SiGe)-based FinFETs to gate-all-around nanowire FETs or nanosheet FETs reduces
Beyond-Silicon Computing: Nano-Technologies, Nano-Design …
19
FET scale length, continuing to reduce scale length requires reducing the semiconductor body thickness. Unfortunately, reducing the body thickness can have unwanted side effects. In particular, bulk materials (e.g., all Si-, Ge-, and III-V-based semiconductors) suffer from severely degraded carrier transport as the body thickness scales to sub-10 nm dimensions (Gomez et al. 2007; Hashemi et al. 2014a, b; Sasaki et al. 2015; Suk et al. 2007; Uchida et al. 2003). This degradation in carrier transport arises from increased photon scattering and surface roughness, which significantly degrades FET on-current density and thus overall circuit speed. The unfortunate result is that it has become extremely challenging for technologists to create FETs that exhibit both excellent scale length and also excellent carrier transport simultaneously. To alleviate this challenge, many technologists have turned to alternative “low-dimensional” materials, i.e., 2D materials and one-dimensional (1D) carbon nanotubes, which inherently maintain superior carrier transport even with very thin body thickness. For example, experimental measurements quantifying carrier transport in CNTs includes hole mobility exceeding 2,500 cm2 /V s (Zhou et al. 2005) and hole velocity of 4.1 × 107 cm/s, even for CNT diameter below 2 nm. For reference, measurements of experimental Si FinFET with body thickness less than 3 nm exhibit mobility under 300 cm2 /V s. Because of these material- and device-level benefits of CNTs (and other low-dimensional materials), there are significant energy efficiency benefits at the circuit level, and we refer the reader to Hills et al. (2018) for quantified results. Key implications for gaining high-level intuition include: • Superior CNT carrier transport enables CNFET circuits to operate with reduce supply voltage with simultaneously higher effective drive current (I EFF ) compared to SiGe FinFET (e.g., 20% lower V DD with 25% higher I EFF for the same off-state leakage current density). • Ultra-thin body thickness (CNT diameter) results in very short scale length, enabling experimental CNFETs maintain steep sub-threshold slope (SS) with extremely-scaled gate length (e.g., SS = 70 mV/decade with 5 nm gate length, which has been shown for both PMOS and NMOS CNFETs experimentally (Qiu et al. 2017)). • Optimized CNFET circuits exhibit lower total circuit capacitance, which reduces overall energy consumption and also contributes to higher circuit speeds (e.g., 2 × lower capacitance for projected CNFET vs. Si/SiGe FinFET (Hills et al. 2018)). This reduction in capacitance comes from multiple sources. First, shorter CNFET gate length not only reduces intrinsic gate-to-channel capacitance (due to smaller gate area), but also reduces parasitic gate-to-source/drain capacitance due to increased physical separation between the gate and the source/drain contacts. Second, high CNFET drive current enables electronic design automation (EDA) tools for logic synthesis/place-and-route to automatically select standard library cells with smaller drive strengths and still meet circuit-level timing constraints. And third, these constraints can be met even with using planar CNFETs, which have lower gate capacitance compared to three-dimensional FinFETs, nanowire FETs, and nanosheet FETs, whose channels extend vertically above the substrate to increase drive current at the cost of higher parasitic capacitance (Hills et al. 2018).
20
G. Hills
2.2 Inherent Challenges in Emerging Nanotechnologies Despite the projected benefits of emerging nanotechnologies, such as CNFETs and FETs based on other low-dimensional nanomaterials, there are significant practical challenges that must be overcome before these benefits can be realized. Specifically, emerging nanotechnologies are inherently subject to nano-scale imperfections and process variations, and without dedicated techniques to specifically address these challenges at the fabrication-, design-, and system-levels, affected nanotechnologies may never see the light of day. As an illuminating example, key challenges that have plagued CNTs for decades include: • CNT aggregates—during CNT deposition, i.e., when CNTs are deposited on the wafer substrate used for circuit fabrication, CNTs can “bundle” together forming “CNT aggregates” (an example image is shown in Fig. 4a). The presence of CNT aggregates in CNFET channel regions can lead to incorrect CNFET functionality, reducing overall CNFET circuit yield (Hills et al. 2019). • CNT CMOS—today’s energy-efficient digital circuits rely on having a complementary metal–oxide–semiconductor (CMOS) process that includes both PMOS and NMOS FETs. However, many emerging FET technologies, including CNFETs, have lacked a robust CMOS process. In particular, both PMOS and NMOS CNFETs should: (a) be air-stable, (b) have tunable electrical characteristics (e.g., threshold voltage), and (c) have limited variability (Hills et al. 2019; Lau et al. 2018). • Metallic CNTs—due to a lack of precise control CNT properties (e.g., diameter and chirality), CNTs can be either semiconducting (s-CNT) or metallic (m-CNTs); m-CNTs exhibit little or no bandgap, and so their conductance cannot be effectively modulated by the CNFET gate, leading to increased off-state leakage current and potentially incorrect logic functionality in CNFET circuits (Hills et al. 2015, 2019; Zhang et al. 2009). • CNT variations—in addition to metallic CNTs, CNTs exhibit additional variations in CNT density, CNT diameter, alignment, and doping (Fig. 3a). CNT variations can lead to near-zero functional yield, increase susceptibility to noise, and degrade EDP benefits of CNFET digital circuits. But a key question is: which of these variations actually matter from a system point-of-view, which is ultimately what we care about? Without a systematic methodology to evaluate the system-level impact of CNT variations, one might blindly pursue difficult CNT processing paths with diminishing returns, while overlooking other CNT process advances that enable far larger yield and performance benefits overall. This challenge is further exacerbated by the fact that CNT processing advances can also be combined with CNFET circuit design techniques to reduce the impact of CNT variations (e.g., selective transistor upsizing), which lead to massive design spaces that can be intractable to explore (Fig. 3b); for example, existing approaches rely on trialand-error-based ad hoc techniques that can be prohibitively time consuming (e.g., requiring computation runtimes exceeding 1.5 months (Hills et al. 2015)).
Beyond-Silicon Computing: Nano-Technologies, Nano-Design …
21
Fig. 3 Nanotechnology challenges. a CNT variations, including variations in CNT type, density, diameter, alignment, and doping (Hills et al. 2015). b Subset of the massive design space to explore in order to co-optimize CNT process improvements (e.g., in the percentage of metallic CNTs, or variations in CNT spacing) and CNFET circuit design parameters (e.g., target clock speed). Note that, these three dimensions only represent a small subset of the entire design space; e.g., for each one of these points, we can also use circuit-level techniques (such as selective transistor upsizing) to reduce the impact of CNT variations at the cost of increased energy consumption
3 Overcoming Challenges: Coordinated Nano-Fabrication + Nano-Design Isolated improvements in processing or design are insufficient for overcoming the challenges in Sect. 2. Instead, in this section, we start by describing the essential interplay between advances in nano-fabrication and nano-design that are essential for overcoming CNT aggregates, CNT CMOS, metallic CNTs, and CNT variations in an energy-efficient and computationally-efficient manner (Sects. 3.1, 3.2 and 3.3). Section 3.4 presents experimental demonstrations of larger-scale CNFET circuits that have now been realized, showing that these techniques work in practice. In Sect. 3.5, we highlight that many of these techniques are now being transferred to multiple high-volume commercial manufacturing facilities.
3.1 VLSI CNFET Nano-Fabrication Figure 4 illustrates two nanofabrication techniques that are used to address two of the key challenges described in Sect. 2, i.e., removing CNT aggregates and realizing a robust CMOS process for CNFETs. Specifically, RINSE (Removal of Incubated Nanotubes through Selective Exfoliation) reduces the number of CNT aggregates per unit area by 250× (Hills et al. 2019), and MIXED (Metal Interface engineering crossed with Electrostatic Doping) realizes a VLSI-compatible CMOS process for
22
G. Hills
Fig. 4 Summary of RINSE and MIXED to overcome CNT aggregates and to enable CNT CMOS. a Scanning electron microscopy (SEM) images of a CNT aggregate on the wafer. b 3-step RINSE process to remove CNT aggregates, resulting in >250× reduction in the number of CNT aggregates per unit area. c Schematic illustration of a PMOS CNFET and an NMOS CNFET using the MIXED process flow. Here, PMOS CNFETs have Platinum source/drain contacts and SiOX doping oxide, and NMOS CNFETs have Titanium source/drain contacts and HfOX doping oxide. d Experimentally measured drain current (I D ) vs. drain-to-source voltage (V DS ) characteristics from fabricated CNFETs indicating similar drive current for PMOS and NMOS CNFETs. For NMOS (shown in red), the upper-most curve is measured with gate-to-source voltage (V GS ) of 1.8 V with V GS decreasing in steps of 0.1 V for each subsequent curve. For PMOS (shown in blue), the upper-most curve is measured with V GS = −1.8 and increasing in steps of 0.1 V for each subsequent curve. e I D vs. V GS (with V DS = 1.8 V) for an NMOS CNFET, showing the ability change the threshold voltage (i.e., to horizontally shift the I D vs. V GS curve) by controlling the stoichiometry of the doping oxide. The ratios shown in the legend (“4:1”, “2:1”, and “1:1”) indicate the relative number of Hafnium (Hf) pulses to Oxygen (O) pulses during HfOX deposition to control the stoichiometry of the oxide (Lau et al. 2018)
CNFETs that is air-stable, electrically tunable, and robust (Hills et al. 2019). Details for RINSE and MIXED are provided below: • RINSE—To enable CNFET circuit fabrication, CNTs must be uniformly deposited across the entire wafer. This can be achieved via solution processing, in which (150 mm) wafers are submerged in solutions that contain dispersed CNTs. Unfortunately, this CNT deposition technique can result in CNT aggregates deposited randomly across the wafer, which are considered manufacturing defects that act as particle contamination and thus reduce die yield. Existing techniques that attempt to remove CNT aggregates in the solution, i.e., before deposition, such as highpower sonication, centrifugation or excessive filtering prior to deposition, are insufficient to meet strict yield requirements for large-scale systems or to remove CNT aggregates without damaging CNTs, thus degrading CNFET performance (e.g., CNFET on-current density). Instead, by applying RINSE, we are able to
Beyond-Silicon Computing: Nano-Technologies, Nano-Design …
23
selectively remove CNT aggregates after deposition, without damaging the nonaggregated CNTs. Figure 4b illustrates the 3-step process for RINSE, including: Step 1: deposit CNTs on the wafer (by submerging wafers pretreated with a CNT adhesion promoter in a pre-dispersed CNT solution. Step 2: spin-coat a standard photoresist (polymethylglutarimide) onto the wafer and curing it at ~200 °C. Step 3: place the wafer in a solvent (N-methylpyrrolidone) for sonication. Hills et al. (2019) experimentally demonstrates that RINSE reduces CNT aggregate density (i.e., the number of CNT aggregates per unit area) by >250×, without damaging CNTs or affecting CNFET performance. • MIXED—Energy-efficient CMOS logic circuits using CNFETs relies on the ability to fabricate both PMOS and NMOS CNFETs that are air-stable, robust, and have tunable electrical characteristics (e.g., controlling the threshold voltage to trade-off higher CNFET circuit speed vs. lower CNFET leakage power). Existing techniques for CNT CMOS are insufficient, since they either: have large CNFETto-CNFET variability, use materials that are not air-stable, silicon CMOS compatible, or are not robust. In order to address all of these challenges simultaneously, MIXED uses a combined doping approach that engineers both the oxide deposited over the CNTs to encapsulate the CNFET, as well as optimizing the metal source/drain contacts to CNTs by using a lower workfunction metal (e.g., Titanium) for NMOS CNFETs and higher workfunction metal (e.g., Platinum) for PMOS CNFETs. MIXED leverages only air-stable and silicon-CMOS compatible materials, and also allows for precise threshold voltage tuning by controlling the stoichiometry of robust atomic layer deposition (ALD) oxides deposited over the CNTs. MIXED also leverages workfunction engineering of the metal-CNT contacts in order to increase drive current for both PMOS CNFETs and NMOS CNFETs (Hills et al. 2019).
3.2 VLSI CNFET Nano-Design While RINSE and MIXED address CNT aggregates and CNT CMOS, metallic CNTs (m-CNTs) remain an outstanding challenge that have not been overcome by isolated advances in nano-fabrication. M-CNTs increase leakage power in VLSI CNFET circuits (degrading EDP benefits) and also degrade noise resilience of connected logic stages for digital VLSI, which can lead to incorrect logic functionality (Hills et al. 2019). To quantify the circuit-level impact of m-CNTs, we consider the noise resilience of a pair of connected logic stages (comprising a driving logic stage and a loading logic stage, e.g., two cascaded inverters to form a CMOS buffer); a useful metric for quantifying noise resilience is the static noise margin (SNM). SNM is defined using the Voltage Transfer Characteristics (VTCs: which defines the output voltage, V OUT , as a function of the input voltage: V IN , for each input of a logic stage) of the driving and loading logic stages. Using the static noise margin to quantify the noise resilience of digital VLSI circuits, one can then define the probability that all pairs of connected logic stages have SNM exceeding a target minimum required SNM,
24
G. Hills
i.e., SNM R (SNM R is chosen by the designer to meet circuit-level noise resilience requirements, and is typically a fraction of the circuit supply voltage: V DD , e.g., SNM R = V DD /5). Then the Probability that all static Noise Margin requirements are Satisfied is pNMS , where pNMS is the probability that SNM(Gi , Gj ) ≥ SNM R for all logic stages in the circuit, and SNM(Gi , Gj ) is the SNM for driving logic stage Gi and loading logic stage Gj (with i = j). M-CNTs can lead to near-zero pNMS , which is not acceptable for VLSI circuits. To quantify the relationship between pNMS and the fraction of m-CNTs on the wafer substrate, we also define pS as the probability that a given CNT is a semiconducting CNT (s-CNT) instead of a metallic CNT (m-CNT). Figure 5h illustrates the relationship between pNMS and pS (shown for SNM R = V DD /5 for a circuit consisting of approximately one million logic gates); to achieve pNMS = 99%, pS must satisfy pS ≥ 99.999, 9 9%, which corresponding to 1 m-CNT in 10–100 million CNTs. Despite many efforts to remove m-CNTs (Arnold et al. 2006; Patil et al. 2009; Shulaker et al. 2013a, 2015), the highest-purity results achieve pS ~ 99.99% (1 m-CNT in 10,000 CNTs), i.e., 3–4 orders of magnitude off the target in terms of purity. This is where the benefit of nano-design techniques comes into play. Specifically, Hills et al. (2019) describes a circuit design technique called DREAM (“Designing REsilience Against Metallic CNTs”), which overcomes the presence of m-CNTs entirely through circuit design, and enables VLSI CNFET circuits to meet pNMS requirements with pS = 99.99% CNT purity (e.g., pNMS ≥ 99% for CNFET circuits with one million logic gates: Fig. 5h). Importantly, pS = 99.99% which has already been achieved today, and can achieved through multiple techniques, e.g., solutionbased CNT processing using the RINSE process. The key insight for DREAM is that m-CNTs affect the VTCs of different logic stages differently depending on how each logic stage is implemented (including both its schematic and physical layout). Thus, m-CNTs affect the SNM of different pairs of logic stages depending on which driving logic stage and which loading logic stages is. In particular, the SNM between a pair of connected logic stages, SNM (Gi , Gj ), is more susceptible for specific combinations of logic stages (Gi , Gj ). DREAM first quantifies SNM for all possible combinations of logic stages in a standard cell library, and then applies a logic transformation during logic synthesis (e.g., using Synopsys Design Compiler® or Cadence Genus® ) to preferentially avoid the specific combinations of logic stages whose SNM is most susceptible to m-CNTs, while preferring combinations of logic stages whose SNM is more robust to m-CNTs. Importantly, the same overall circuit logic functionality is maintained, since there can be multiple configurations of logic gates that can achieve the same overall logic function in a digital circuit. Figure 5g quantifies the SNM in the presence of m-CNTs of different pairs of connected logic stages in an example standard cell library (derived from Clark et al. 2016), and an algorithm for implementing DREAM using standard electronic design automation (EDA) tools for logic synthesis is provided in Hills et al. (2019). As an illustrative example, Fig. 5a–d illustrates the SNM for the four combinations of connected logic stage pairs using a 2-input not-and gate (“nand2”) and a 2-input not-or gate (“nor2”) in the presence of m-CNTs. Note that, a single m-CNT can affect multiple CNFETs simultaneously, since the length of an m-CNT can be much longer
Beyond-Silicon Computing: Nano-Technologies, Nano-Design …
25
Fig. 5 DREAM overview. a–d Static Noise Margin (SNM) illustrated for four different pairs of connected logic stages using nand2 and nor2 logic stages. Simulation results are derived using a compact model for CNFETs (Lee et al. 2015) with parameters defined in Hills et al. (2019), in conjunction with library cells derived from Clark et al. (2016). Note that, SNM is the minimum of the “high” SNM (SNM H ) and the “low” SNM (SNM L ), i.e., SNM = min(SNM H , SNM L ) (Hills et al. 2015). e SNM illustration from analyzing experimentally measured CNFETs. Current–voltage (I-V ) characteristics are measured from 1,000 NMOS CNFETs and 1,000 PMOS CNFETs, and then are used to solve the VTCs of nand2 and nor2 logic stages. Despite using the exact same CNFETs, (nor2, nor2) has better (higher) SNM than (nand2, nor2). See Hills et al. (2019) for details. f Cumulative distribution of SNM for one million combinations of nand2 and nor2 logic stages, solved using the same method as in (e). g Minimum SNM for combinations logic stages for a projected 7 nm node CNFET technology (Hills et al. 2019). h pNMS vs. pS shown with and without DREAM (for SNM R = V DD /5, for CNFET circuits with one million logic gates), illustrating that DREAM can relax pS requirements by 10,000× (Hills et al. 2019)
than the length of the CNFET channel, and so a single m-CNT can comprise part of the channel for multiple different CNFETs depending on their relative physical locations. Importantly, (nand2, nand2), (nor2, nor2) have better (higher) SNM compared to (nand2, nor2), (nor2, nand2) despite using the exact same VTCs. Thus, in this case, DREAM can be used to prefer (nand2, nand2), (nor2, nor2) while avoiding (nand2, nor2), (nor2, nand2), and still permitting use of both nand2 and nor2. DREAM is one technique that emphasizes the essential interplay between emerging nanotechnologies and emerging nanodesign. For example, achieving 99.999, 999% CNT purity is currently impossible using material synthesis alone,
26
G. Hills
but VLSI systems can still be demonstrated today by using DREAM to overcome inherent technology challenges.
3.3 Rapid Co-optimization of Processing and Design to Overcome Nanotechnology Variations While the above nano-fabrication and nano-design techniques can be combined to overcome CNT aggregates, CNT CMOS, and metallic CNTs, another challenge remains: CNT variations. CNT variations can lead to near-zero functional yield, increase susceptibility to noise (quantified by pNMS in the previous section), and degrade energy efficiency benefits of CNFET digital circuits (quantified by EDP). To overcome CNT variations, joint exploration and optimization of CNT processing parameters (to be improved during CNFET fabrication) and CNFET digital circuit design are required. However, existing approaches for such exploration and optimization rely on trial-and-error-based ad hoc techniques resulting in very long computation runtimes. Thus, how can a designer efficiently explore the large design space of CNT process improvements and CNFET circuit design, to overcome CNT variations in an energy efficient manner? In this section, we present a new approach that achieves fast runtimes (e.g., 30 min for a processor core design vs. a month using existing approaches). This approach can be used to derive multiple design points (each representing a combination of parameters for CNT processing and CNFET circuit design) to overcome CNT variations. These design points preserve 90% of the projected EDP benefits of CNFET digital circuits (despite CNT variations), while simultaneously meeting circuit-level yield and noise margin constraints. The derived design points directly influence experimental research on CNFETs, and are thus essential to guide the allocation of valuable research time in developing new technologies. An existing approach to overcome CNT variations is based on brute-force trialand-error (Zhang et al. 2011): a designer iterates over many design points (example design points are illustrated in Fig. 3b, e.g., each one represents a combination of values for CNT processing parameters and CNFET circuit design parameters, e.g., the percentage of metallic CNTs, standard deviation of the spacing between CNTs, the target clock frequency, or values to parameterize how many CNFETs are selectively upsized), analyzing each one until a design point that satisfies a target clock frequency and target pNMS with small energy cost is found (e.g., energy cost: E < 5%). Furthermore, this approach utilizes highly accurate yet computationally expensive models to calculate delay penalties and PNMV. It suffers from two significant bottlenecks. (1)
The computational runtime required to calculate CNFET circuit delay and pNMS in the presence of CNT variations limits the number of design points that can be explored.
Beyond-Silicon Computing: Nano-Technologies, Nano-Design …
(2)
27
The number of required simulations can be exponential in the number of CNT processing and CNFET design parameters.
The approach described in Hills et al. (2015) overcomes these bottlenecks as follows: (1)
(2)
Degradations in CNFET circuit delay and pNMS (induced by CNT variations) are computed 100× faster than the previous approach by using linearized circuit models. This speed-up enables exploration of many more design points while maintaining sufficient accuracy to make correct design decisions (details in Hills et al. (2015)). An efficient gradient descent search algorithm, which based on delay and pNMS sensitivity information with respect to the processing parameters, is used to systematically guide the exploration of design points (details in Hills et al. (2015)).
Figure 6 illustrates that the combination of these techniques can exponentially reduce the required simulation time. Specifically, by leveraging linearized models for variations in circuit delay, energy, and noise, a designer can easily combine these models with high-level optimization techniques, such as a gradient descent search algorithm. Then by using gradient descent search, each time the designer takes a step with the gradient, they are able to incrementally compute the impact of variations, leveraging computation results from the previous design point, instead of starting from scratch. Thus, the combination of all these techniques together provides an exponential speed-up compared to brute force, e.g., reducing the required computational runtime from 1.5 months to 30 min for the “fgu” module of OpenSparc T2 (Fig. 6b). Importantly, all of the techniques described in Sect. 3 can be integrated into standard VLSI processing and design flows, using industry-practice electronic design
Fig. 6 Rapid co-optimization of CNT process improvements and CNFET circuit design. a Combined approach, leveraging linearized circuit models, gradient descent search, and rapid statistical analysis. b Resulting speed-up in computational runtime, shown for modules from the processor core of OpenSparc T2 (Hills et al. 2015)
28
G. Hills
automation (EDA) tools, which is a critical component of accelerating the adoption of a new technology into the mainstream. For example, Fig. 7 illustrates a reference design flow integrating RINSE, MIXED, DREAM, and the rapid co-optimization of CNT processing and CNFET circuit design described here to overcome CNT variations. The experimental CNFET circuit demonstrations described in the next section have leveraged this flow.
Fig. 7 VLSI CNFET Nano-Fabrication + Nano-Design flow, including RINSE, MIXED, DREAM, and the rapid co-optimization of CNT processing and CNFET circuit design to overcome CNT variations (each of these steps is highlighted in blue). Details of the “DREAM-enforcing standard cell library” can be found in Hills et al. (2019)
Beyond-Silicon Computing: Nano-Technologies, Nano-Design …
29
3.4 Experimental CNFET Circuit Demonstrations This sub-section summarizes how the combined nano-fabrication and nano-design techniques have enabled experimental realizations of larger-scale CNFET circuits. These demonstrations include CNT CMOS analog and mixed-signal circuits (Amer et al. 2019; Ho et al. 2019), static random-access memory (SRAM) arrays using CNFETs (Kanhaiya et al. 2019a, b), and a RISC-V microprocessor built using CNFETs (Hills et al. 2019). Analog, digital, and memory circuits have become essential parts of VLSI computing systems today, and so the ability to yield these types of circuits is an important aspect of technology development for the wide range of new technologies being considered for next-generation computing systems. We refer the reader to the respective references for more details of these experimental CNFET circuit demonstrations. • CNT CMOS analog and mixed-signal circuits—While CNFET digital logic can maintain correct logic functionality in the presence of m-CNTs (e.g., leveraging DREAM, although increased leakage current can degrade overall EDP and SNM metrics), m-CNTs can result in catastrophic failure mechanisms for analog CNFET circuits. For example, m-CNTs can severely attenuate amplifier gain, resulting in incorrect operation of mixed-signal circuit building blocks, including digital-to-analog converters (DACs) and analog-to-digital converters (ADCs). To overcome the challenge of m-CNTs for analog and mixed signal circuits, Amer et al. (2019) provides an overview of a combined processing and design technique called SHARC (Self-Healing Analog with RRAM and CNFETs). SHARC leverages programmable Resistive Random-Access Memory (RRAM) elements, which are configured in series with CNFETs, to automatically “self-heal” analog circuits to operate correctly despite the presence of m-CNTs. SHARC enabled the first analog and mixed-signal CNT CMOS circuits that are robust to m-CNTs, including 4-bit DACs and successive approximation register (SAR) ADCs (Amer et al. 2019; Ho et al. 2019). Additional CNFET analog circuit demonstrations are described in Ho et al. (2019) and are shown in Fig. 8. • CNT CMOS SRAM arrays—Fig. 9 summarizes experimental demonstrations and measurements of 1 kilobit (32 × 32) 6-transistor (6 T) CNFET SRAM arrays, each comprising 6,144 CNFETs (both PMOS and NMOS), with all 1,024 cells functioning correctly while being connected within the same circuit (with shared wordlines and shared bitlines) (Kanhaiya et al. 2019a, b). Additional demonstrations in Kanhaiya et al. (2019a, b) include the first 10-transistor (10 T) SRAM cells, which exhibit relatively higher read- and write-margins (Calhoun and Chandrakasan 2007), and which operate at highly scaled voltages down to V DD = 300 mV (Kanhaiya et al. 2019b). Because CNFETs can be fabricated at lo processing temperatures, CNFET SRAM cells can be fabricated directly on top of interconnect routing (additional details in Sect. 4). This enables new circuit-/system-level opportunities for CNFET SRAM, including: (1) fabricating SRAM directly on top of processor cores (Shulaker et al. 2014, 2017), and (2) utilizing back-end-of-line (BEOL) metal routing both above and below CNFETs (e.g., buried power rails
30
G. Hills
Fig. 8 CNT CMOS Analog Circuits (Ho et al. 2019), including 2-stage operational amplifier (opamp) in (a)–(d), and implementation of CNFET op-amp in a current-sensing analog sub-system. a 2-stage op-amp schematic. Annotated CNFET widths are multiples of a CNFET with width W = 5 µm and length L = 3 µm. b Scanning electron microscopy (SEM) image of one fabricated 2-stage op-amp, false-colored to show the PMOS and NMOS CNFETs in the circuit (large squares are probe pads). c Three overlaid measured waveforms from the 2-stage op-amp, showing output voltage (V OUT ) as a function of differential input voltage (V IN = V IN+ − V IN− ). d Corresponding gain for the same measurements in (c), with gain = V OUT /V IN , where V OUT = V OUT (V IN+ ) − V OUT (V IN− ) (additional figures of merit are provided in Ho et al. (2019)). e–f Schematic and SEM of the current-sensing analog sub-system with external current source. g Measured linear response of the sub-system, converting input current to output voltage (with supply voltage V DD = 0.48 V). h–i 100 repeated measurement cycles, illustrating minimum drift over time, for V DD = 0.48 V (in (h)) and V DD = 2.0 V (in (i)), demonstrating functionality and linearity over a range of supply voltages
(Chava et al. 2018)) to potentially improve SRAM cell density (Kanhaiya et al. 2019b). • RV16X-NANO—Fig. 10 illustrates a recent demonstration of a microprocessor built entirely from CNFETs, which is based on the RISC-V instruction set (https://riscv.org/specifications), runs standard 32-bit instructions on 16-bit data and addresses, comprises >14,700 CMOS CNFETs (both PMOS and NMOS), and can execute compiled programs while interfacing with memory (Hills et al. 2019). Importantly, it leverages substantial existing infrastructure for both VLSI processing and design, which can more easily facilitate its adoption into highvolume commercial foundries (Sect. 3.5). As alluded to in Fig. 9, since CNFETs can be fabricated on top of back-end-of-line (BEOL) metal interconnects, RV16XNANO also implements a new physical design architecture with BEOL metal routing both above and below the active CNFET layers. Such routing architectures can help to reduce overall routing congestion, e.g., as standard library cells continue to scale to extreme dimensions; for RV16X-NANO, metal layers above CNFETs are primarily used for power distribution, while metal layers underneath CNFETs are primarily used for signal routing, all of which has been designed
Beyond-Silicon Computing: Nano-Technologies, Nano-Design …
31
Fig. 9 CNT SRAM. a SEM image of 1 kilobit CNFET 6 T SRAM memory array. b SEM of individual SRAM cell, which is false-colored to highlight the power rails (V DD and GND), pull-up CNFETs (P1 and P2), pull-down CNFETs (D1 and D2), access CNFETs (A1 and A2), wordline (WL) and bitlines (BL and BLN). Relative CNFET sizing is 2.25:1.5:1 for D1/D2:A1/A2:P1/P2. c Corresponding schematic for each 6 T SRAM cell. d–j 6 T SRAM cell characterization, including: d read margin, e hold margin, f write margin, all measured from a typical CNFET CMOS 6 T SRAM cell. g 1,000 overlaid measurements for a single CNFET SRAM cell. Statistical distributions from 40 CNFET SRAM cells are shown for: (h) write margin, i read margin, and j hold margin, with summary statistics μWRITE , μREAD , μHOLD to denote the average values and σ WRITE , σ READ , and σ HOLD to denote the standard deviations. Additional details are provided in Kanhaiya et al. (2019b)
using standard electronic design automation (EDA) tools for physical placementand-routing (Cadence Innovus® ). We refer the reader to Hills et al. (2019) for extensive details of the architecture, programs, standard cell libraries, and process design kits (PDKs) used to design and fabricate RV16X-NANO. Additional experimental demonstrations, which established many of the foundations for the larger-scale demonstrations discussed here, include CEDRIC: a Turingcomplete microprocessor built using CNFETs (Shulaker et al. 2013a), and Sacha: the Stanford Carbon Nanotube-based Hand-Shaking Robot, which operated based on a phase-lock loop (PLL) circuit fabricated using CNFETs (Shulaker et al. 2013b).
32
G. Hills
Fig. 10 RV16X-NANO. 16-bit microprocessor designed entirely using CNFETs (RV16X-NANO) (Hills et al. 2019). a Die photograph, including standard library cells comprising the processor core in the center (~7 mm by 7 mm), power rails extending horizontally, and probe pads around the perimeter (core inputs are primarily located toward the top, outputs are primarily located toward the bottom, and power pads are located on the left and right edges). b Zoomed-in photograph illustrating five rows of CNFET library cells with alternating supply voltage (VDD) and ground (GND) power rails (PMOS CNFETs are adjacent to VDD and NMOS CNFETs are adjacent to GND). c Schematic of a CNFET, with source/drain contacts shown in green, and gate contact in red underneath the CNFET channel (for back-gate CNFET geometries). d Top-view scanning electric microscopy (SEM) image of a CNFET channel with false-colored CNTs. e CNT rendering illustrating the location of CNTs in the CNFET channel. f Measured waveforms from the canonical “Hello, world” program, with input 32-bit instructions shown in blue and character output shown in red; the message translated from the ascii-valued 8-bit char[7:0] (which is valid when char[8] is high) is highlighted at the bottom
3.5 CNFET Technology Transfer to High Volume Commercial Manufacturing Facilities Despite progress described in previous sections for developing CNFET technologies, existing demonstrations of CNFETs have been limited to academic institutions and research laboratories. While technology transfer into commercial manufacturing facilities is a necessary step for high-volume proliferation of CNFET technologies, significant obstacles must be overcome beforehand. Among others, one of these major challenges is that all of the materials and processes used to fabricate CNFETs
Beyond-Silicon Computing: Nano-Technologies, Nano-Design …
33
must meet the strict compatibility requirements of silicon-based commercial fabrication facilities. In this section, we provide an overview of recent efforts to address a specific aspect of these material- and process-based challenges: how to develop a suitable method for depositing CNTs uniformly over industry-standard large-area substrates. To facilitate high-volume, low-cost manufacturing of CNFETs, such a deposition method for CNTs needs to be manufacturable, compatible with today’s silicon-based technologies, and provide a path to achieving systems with energy efficiency benefits over silicon. Bishop et al. (2020) provides a method that meets these requirements, using a solution-based CNT deposition technique, called incubation, a substrate is submerged within a CNT solution, allowing CNTs to adhere to its surface (Hills et al. 2019; Zhong et al. 2017). The CNT incubation technique described in Bishop et al. (2020) offers the following key advantages that are particularly useful for the initial adoption of CNFETs in commercial manufacturing facilities: (1)
(2)
(3)
Low barrier for integration—uniform CNT deposition across 200 mm substrates has been experimentally demonstrated using equipment that is already being used for silicon CMOS fabrication within these facilities, which accelerates adoption of CNTs by leveraging existing infrastructure. Large quantity production—solution-based CNTs can be synthesized in large quantities for high-volume production while meeting CNT material-level requirements for realizing digital VLSI circuits, e.g., with high semiconducting CNT purity exceeding 99.99%, which meets the requirements described in Sect. 3.2) (Cao et al. 2013; Ding et al. 2015; Green and Hersam 2007; Hills et al. 2019). The creation of these highly purified semiconducting CNT solutions that meet the stringent chemical and particulate contamination requirements is also a key enabler for including CNFETs in commercial facilities (Baltzinger and Delahaye 1999). Improved throughput and a path for energy efficiency—while various incubation techniques have been demonstrated to be practical and effective (Hills et al. 2019; Kanhaiya et al. 2019b; Srimani et al. 2019), Bishop et al. (2020) offers characterization of the fundamental aspects of CNT incubation, with respect to manufacturability, compatibility and the resulting CNFET performance that can be achieved. This insight has resulted in both increased throughput (accelerating the time required to perform CNT incubation from 48 h to 150 s), and also VLSI circuit-level power/performance-based analysis demonstrating that incubation enables a path for CNFET circuits to compete with and eventually surpass the energy efficiency of silicon-based circuits at comparable technology nodes.
The advances in CNT incubation described in Bishop et al. (2020), together with the co-optimized nano-fabrication and nano-design techniques described previously in this section, have enabled CNFET fabrication within two distinct industry manufacturing facilities: a commercial silicon manufacturing facility (Analog Devices, Inc.) and a high-volume manufacturing semiconductor foundry (SkyWater Technology Foundry). At each of these facilities, CNFETs are fabricated using the same
34
G. Hills
equipment currently being used to fabricate silicon product wafers, explicitly demonstrating that CNFET fabrication can leverage existing infrastructure and is siliconCMOS-compatible. Figures 11 and 12 illustrate some of the first experimental data from these commercial facilities fabricating CNFETs and CNFET-based circuits, including uniform and reproducible CNFET fabrication across industry-standard 200 mm wafers, with 14,400/14,400 CNFETs distributed across multiple wafers and across 200 mm substrates (Fig. 11) (Bishop et al. 2020), and electrical measurements of the first CNFET-based standard library cells fabricated at SkyWater at a 130 nm technology node (Fig. 12) (Srimani et al. 2020).
Fig. 11 Wafer-scale integration of CNFETs across 200 mm wafers within a commercial silicon foundry. a Processing station within the foundry for performing CNT incubation (details in Bishop et al. (2020)). b–d Images from the foundry, including: b 200 mm wafer with CNFETs, c individual die, and d top-view SEM of a single CNFET (note that, the CNTs are not visible in this image due to the CNFET gate embedded underneath the channel region. e Cross-sectional SEM of two CNFETs connect in series (sharing a source/drain contact), with false-colored source/drain metal contacts, high-k gate dielectric, and embedded metal gates (leveraging a back-gate CNFET geometry). The sum of the channel length (~285 nm) and the contact length (~265 nm) sets the contacted gate pitch (CGP) of ~550 nm, suitable for a ~130 nm technology node. f–g SEM images from multiple points across 200 mm wafers, which illustrate CNT deposition (after CNT incubation), and which are used for characterizing CNT density, uniformity, and reproducibility (see Bishop et al. 2020)
Beyond-Silicon Computing: Nano-Technologies, Nano-Design …
35
Fig. 12 Experimental measurements of CNFET standard library cells fabricated at SkyWater Technology Foundry (schematics for all library cells follow design guidelines for static CMOS logic families, i.e., with PMOS CNFETs comprising the pull-up network and NMOS CNFETs comprising the pull-down network, see (Hills et al. 2019) for schematics). a 200 mm wafer with CNFET circuits, including multiple standard library cells shown in (b)–(g). Each unique entry in (b)–(g) corresponds to a unique standard library cell, with the physical layout shown on the left (image from Cadence Virtuoso® ), an SEM image shown in the center, and experimentally measured waveforms from multiple instantiations of that cell shown on the right (with supply voltage V DD = 1.8 V). Waveforms include overlaid measurements from at least 100 instantiations of each logic gate. b 2-input not-or “NOR2” logic gate, with logical function “OUT = !(A + B)”. The relationships V OUT vs. V A and V OUT vs. V B are the voltage transfer curves for multiple instantiations of NOR2 logic gates. Gain is the maximum is the maximum value of V OUT /V IN for V IN in the range of [0, V DD ] (where V IN corresponds to either of the A or B inputs being swept, i.e., either V A or V B ). c 2-stage buffer “BUF” logic gate. Swing is difference between the maximum and minimum value of V OUT (as a fraction of V DD ) as V A is swept over the range [0, V DD ]; thus, for static CMOS logic, swing should approach 1.0 (i.e., V OUT is “rail-to-rail”). d Half-adder logic gates, with “SUM = XOR(A, B)” and “CO = A*B”. Measured waveforms show input voltages (V A and V B ) and output voltages (SUM and CO) as functions of time. e Full-adder logic gate, with “SUM = XOR(A, B, C)” and “CO = A*B + B*C + A*C”. Sequential logic elements include D-Latches (shown in (f), with data input “D” and enable input “EN”) and D-Flip-Flops (shown in (g), with data input “D” and clock input “CLK”), both of which have output “Q” to indicate the state. Additional library cell functions realized (not shown here) include: D-flip-flops with asynchronous reset, D-flip-flops with scan, clock-gating cells, multiplexors, exclusive-or, exclusive-nor, fill cells (to connect power rails during place-and-route), and “decap” cells (to increase capacitance between power supply rails)
36
G. Hills
4 Next-Generation Nano-Systems While the sections so far have highlighted CNFETs as an example of an end-to-end approach for developing one specific nanotechnology, finding the “best” transistor or memory technologies alone is insufficient to satisfy future application demands. Instead, heterogeneous integration of multiple technologies simultaneously, which can be combined to create entirely new computing systems, can result in far-larger benefits overall. This is because systems today (including general purpose processors and domain-specific accelerators) are often limited by system-level inefficiencies; for example, the “memory wall,” refers to the vast majority of execution time and energy that wasted passing data back and forth between processing elements and off-chip memory (e.g., off-chip DRAM). To overcome these outstanding challenges, the device-level benefits of new nanotechnologies must be combined with the novel systems architectures that they naturally enable. For example, many nanotechnologies can be fabricated at low processing temperatures (1,000 °C for silicon CMOS), which is a key property that enables the development of monolithic 3D nanosystems. For example, it is projected that monolithic 3D systems, with multiple layers of computation and multiple layers of memory densely integrated directly on top of each other, can improve energy efficiency by over two orders of magnitude compared to systems today (quantified by Energy-Delay Product (EDP)) (Aly et al. 2018). Alternative approaches to 3D integration include “2.5-dimensional” integration (integrating multiple chips on interposers) or 3D chip stacking, but the relatively large pitch of vertical interconnects (such as Through-Silicon Vias: TSVs) limits the density of vertical interconnects. Monolithic 3D systems, on the other hand, leverage standard metal routing vias from the back-end-of-line (BEOL), which can be much denser (e.g., over 2 orders of magnitude denser than TSVs (Aly et al. 2015)), which can translate to massive increase in system-level performance metrics such as processor-to-memory bandwidth (Aly et al. 2015, 2018). In this section, we provide a summary of the progress and prospect of developing such monolithic 3D “nanosystems”. Figure 13 illustrates that nanosystems are naturally enabled by low-temperature fabrication of emerging nanotechnologies for both logic and memory. Using these technologies, nanosystems offer radically new opportunities to improve energy efficiency, e.g., with separate circuit tiers optimized for processor cores, caches, power delivery, heat removal, etc.), and an example 3D nanosystem is shown in Fig. 14. Section 4.1 presents experimental demonstrations of 3D nanosystem prototypes that have been developed in academic institutions; Sect. 4.2 presents progress toward the development of 3D nanosystems at commercial manufacturing facilities, including advances in both processing and design infrastructures.
Beyond-Silicon Computing: Nano-Technologies, Nano-Design …
37
Fig. 13 Monolithic 3D integration is naturally enabled by emerging nanotechnologies that can be fabricated at low processing temperatures. For example, for logic, one can use CNFETs (using all the techniques described in the previous sections) or various 2D materials (black phosphorus, MoS2 , WSe2 ), and for memory, there is a wide range of technologies to choose from, including RRAM, spin-transfer torque magnetic RAM (STT-MRAM), conductive bridge RAM (CBRAM), and more, and a designer can choose the technology with characteristics best-suited for a particular application
Fig. 14 Example 3D nanosystem, enabled by the low-temperature fabrication of emerging nanotechnologies. Such nanosystems combine advances from across the computing stack, including nanomaterials such as CNTs for high-performance and energy-efficient transistors, high-density on-chip non-volatile memories, fine-grained 3D integration of logic and memory with ultra-dense connectivity, new 3D architectures for computation immersed in memory, and integration of new materials technologies for efficient heat removal solutions. Resulting nanosystems offer radically new opportunities for computing architectures, e.g., with separate circuit tiers optimized for processor cores, caches, power delivery, heat removal, etc., as shown here
4.1 Experimental 3D Nano-System Demonstrations Just as an end-to-end approach for evaluating the potential benefits of CNFETs for 2-D circuits, quantifying the benefits of 3D nanosystems is a necessary step before investing in resources for their fabrication and experimental development. For extensive analysis on various system configurations and potential paths for continuing
38
G. Hills
to improve 3D nanosystems, we refer to the reader to Aly et al. (2018), which serves to motivate the experimental nanosystem demonstrations described in this section. To demonstrate that experimental nanosystems are now becoming a reality, we summarize two representative system-level demonstrations, which not only show that monolithic 3D integration of multiple nanotechnologies is achievable in practice, but also demonstrate some of the application domains enabled by monolithic 3D integrated systems. • 3D nanosystem integrating layers of computation, memory, and sensing—Fig. 15a illustrates a prototype of 3D nanosystem that comprises over two million CNFETs, one megabit of RRAM, all of which are fabricated sequentially over a bottom tier of silicon FETs (Shulaker et al. 2017). In particular, the CNFETs occupy two unique vertical circuit tiers: the top tier, in which the CNFETs are exposed to the environment and function as gas sensors and write their captured data directly into the tier of RRAM memory underneath (the “1-transistor 1-resistor” or “1T1R” memory cells use the bottom layer of silicon FETs for the access transistor). Another tier of CNFET circuits is then used to implement a classification accelerator that extracts features from the data stored in the RRAM memory. Since each CNFET sensor writes directly into its own dedicated memory cell, without
Fig. 15 Experimental demonstrations of 3D nanosystems. a 4-tier nanosystem comprising two tiers of CNFETs, one tier of RRAM, and one tier of silicon FETs (example applications include highthroughput characterization of ambient gases) (Shulaker et al. 2017). b Monolithic 3D imaging system, with CNFET-based edge detection circuitry fabricated directly on top of silicon-based imaging pixels (Srimani et al. 2019)
Beyond-Silicon Computing: Nano-Technologies, Nano-Design …
39
the need to be serialized through a memory port interface, this 3D nanosystem can capture massive amounts of data every second and process it on-chip, so that the overall chip output is highly-processed information instead of raw CNFET sensor data. As a demonstration, Shulaker et al. (2017) shows how this system is used to classify ambient gates. Furthermore, the fact that the layers are fabricated on top of silicon circuits experimentally demonstrates that 3D nanosystems are silicon-CMOS compatible, i.e., emerging nanotechnologies can be fabricated on top of existing silicon-based technologies. • Monolithic 3D imaging system—Fig. 15b illustrates an experimentally fabricated and tested 3D nanosystem comprising three vertical circuit tiers: silicon-based imaging pixels on the bottom tier, followed by CNFET circuits on the tier above (tier 2) to perform pre-processing on the image data, and then CNFET circuits on the third tier for executing algorithms. Srimani et al. (2019) offers an demonstration of how this system is used to perform in-situ edge detection. Levering the ultra-dense vertical connectivity enabled by monolithic 3D integration, every pixel in parallel sends data vertically through the chip to the upper layers for subsequent processing, instead of having to read out data from each pixel serially, store the raw pixel values in memory, and then compute on the data in memory (e.g., in a conventional 2D system). Thus, the output of this 3D camera system is able to output highly-processed information instead of the raw pixel data. This systemlevel approach can enable high-throughput and low-latency image classification systems that would otherwise be impossible to build using today’s silicon-based technologies. Additional 3D nanosystem demonstrations, not described here but that we refer the interested reader to, include Wu et al. (2018, 2019).
4.2 Three-Dimensional Nano-Systems in Commercial Foundries With the ongoing adoption of emerging nanotechnologies in commercial foundries, e.g., as described in Sect. 3.5, the subsequent development of 3D nanosystems is a natural progression to fully capitalize on the benefits that new technologies have to offer. For 3D nanosystems leveraging CNFETs, each individual CNFET circuit tier follows similar processing as described for 2D CNFET systems, and so all of the techniques for overcoming inherent CNT imperfections and variations (described above) can be used lock, stock, and barrel for the development of 3D systems. This approach has been taken by SkyWater Technology Foundry, who has demonstrated in Srimani et al. (2020) that they are developing processes for CNFETs and RRAM that can be integrated directly into the back-end-of-line (BEOL), enabling the recent demonstration of a monolithic 3D systems being developed at SkyWater with multiple layers of CNFETs and multiple layers of RRAM at a 130 nm technology node (https://spectrum.ieee.org/nanoclast/semiconductors/
40
G. Hills
devices/first-3d-nanotube-and-rram-ics-come-out-of-foundry). In this section, we provide an overview of this technology that is currently being developed, including infrastructure for both VLSI processing and design. Figure 16 illustrates the initial foundry process, which is implemented across industry-standard 200 mm substrates. The full monolithic 3D stack, which integrates four tiers of active devices distributed throughout the BEOL metal layers, offers 15 metal layers on 13 different physical layers, using 42 mask layers. These active device tiers include two tiers of CMOS CNFETs and two tiers of RRAM, all of which are fabricated using low-temperature and BEOL-compatible process flows. All vertical layers are fabricated sequentially over the same starting substrate, using the same BEOL inter-layer vias that are used to connect standard metal layers (such as the vias connecting “metal 1” and “metal 2”). Due to monolithic 3D integration, the vertical connectivity between tiers can exceeds 11 million vertical interconnects per mm2 (with via pitch of ~300 nm at the ~130 nm technology node). All fabrication is wafer-scale without any per-unit customization, leveraging existing silicon CMOS high-volume manufacturing processing and infrastructure. As an example of circuits spanning multiple tiers, electrical current–voltage characteristics for 1T-1R memory cells are shown in Fig. 16g, for all four combinations of: either NMOS or PMOS CNFET on tier 3 (for the “1T” element), and RRAM on either tier 1 or tier 2 (for the “1R” element). Electrical characteristics for CNFETs are similar to those shown in Fig. 12, since the same process is used for each CNFET tier. In addition to the monolithic 3D process infrastructure, this process is accompanied by a complete design infrastructure, so that a designer would have everything they need to tape-out a monolithic 3D system using this process. An essential component of this 3D design infrastructure has been the development of a monolithic 3D process design kit (PDK), which provides 3D support for: Design Rule Check (DRC), LVS, Parasitic Extraction (PEX), circuit simulation, electromigration/voltage drop analysis (EM/IR), logic synthesis, place-and-route, metal fill, and optical proximity correction (OPC) for final photomask generation. Alternatively, many of today’s existing efforts to design 3D systems rely on (manually) stitching together separate circuit tiers each designed using conventional PDKs; however, this approach can neglect critical effects such as inter-tier parasitics (affecting timing closure), and also can prevent teams from verifying that their designs are correct (lacking tools such as Layout Vs. Schematic (LVS) for full 3D systems). Thus, while these alternative approaches may suffice for academic exercises, they can be insufficient for analyzing, verifying, and taping out 3D systems. Figure 17 summarizes the industry-practice VLSI design flow described in Srimani et al. (2020), which corresponds to the process in Fig. 16. In addition to the 3D design tools described above, this design flow also incorporates compact models for CNFETs and RRAM on all circuit tiers that are compatible with standard circuit simulations (e.g., Synopsys HSPICE® and Cadence Spectre® ), as well as standard cell libraries with 906 total standard cells, including high-density, highspeed, and low-leakage standard cell variants. Importantly, the design flow leverages existing commercial tools and performs all steps required to transform high-level hardware descriptions into standard layout formats for generating final reticles for fabrication.
Beyond-Silicon Computing: Nano-Technologies, Nano-Design …
41
Fig. 16 Multi-Tier CNFET and RRAM process in a commercial foundry. a Schematic illustration of the process cross-section established within the foundry. This initial process includes 4 device tiers: RRAM memory for tier 1 and tier 2, and CNFET CMOS for tier 3 and tier 4, all of which are fabricated in the BEOL. There are 15 metal layers total layers on implemented on 13 physical layers (since source/drain deposition for NMOS CNFETs and PMOS CNFETs use separate metal depositions but occupy the same physical layer). b Cross section SEM images of NMOS CNFETs (top) and PMOS CNFETs (bottom), highlighting the MIXED CMOS process (Sect. 3.1). c Topview SEM of a CNFET with multiple fingers, with false coloring to indicate the CNTs in the channel. d Cross-section SEM image showing CNFETs fabricated directly over RRAM memory cells, with routing above and below. Here, bottom metal layers show dummy metal fill (automatically performed using standard electronic design automation tools). e RRAM bypass vias through an RRAM device layer, illustrating the option of using RRAM tier 1 for additional routing resources. f Zoomed-in view of tight-pitched RRAM with corresponding schematic. Colors for (b–f) correspond to coloring in (a). g Typical I-V characteristics of 1T-1R memory cells for different combinations of NMOS/PMOS CNFETs and RRAM tiers, showing the form (“F”), set (“S”), and reset (“R”) events of the RRAM cell through the CNFET select transistor. (c) Measured distributions of Set Voltage (V SET ) and Reset Voltage (V RESET ) for 512-bit RRAM arrays fabricated across different BEOL layers in the monolithic 3D integrated circuit
42
G. Hills
Fig. 17 Design flow for creating monolithic 3D nanosystems, using the process in Fig. 16. This flow leverages a monolithic 3D PDK, standard cell libraries, and standard EDA tools, so that designers can transform a high-level description of a system (e.g., a register transfer level (RTL) description in Verilog) into a 3D layout (e.g., in standard graphic database system (GDS) format) for taping out monolithic 3D nanosystems (Srimani et al. 2020)
5 Outlook We hope that this chapter has given the reader a taste of the end-to-end approach required for developing new technologies to address growing system-level bottlenecks in today’s computing systems. While we have focused on CNFETs and the monolithic 3D nanosystems that they enable, many of the principles described here can and should be applied to any emerging technology, including the wide range of new materials, devices, systems, architectures, and integration that are currently being investigated today (including those described in the introduction), and which are at varying levels of maturity. Of course, just as 3D nanosystems are not constrained by the properties of today’s silicon technologies, futuristic systems may evolve to become increasingly dissimilar to systems today, both from a physical perspective, and from an architectural perspective. Solutions may require diverse design abstractions, design methodologies adapted for different fields, or statistical methods to model complex system interactions with dynamic environments. No matter how system development continues to progress, we are confident that it will require tight-knit coordination among interdisciplinary researchers in both academia and industry, and we hope that this chapter may spark the reader to start thinking about new revolutions in the development of next-generation electronic systems.
References M.M.S. Aly et al., Energy-efficient abundant-data computing: the N3XT 1,000 x. IEEE Comput. 48(12), 24–33 (2015) M.M.S. Aly, T.F. Wu, A. Bartolo, Y.H. Malviya, W. Hwang, G. Hills, I. Markov et al., The N3XT approach to energy-efficient abundant-data computing. Proc. IEEE 107(1), 19–48 (2018) A.G. Amer, R. Ho, G. Hills, A.P. Chandrakasan, M.M. Shulaker, 29.8 SHARC: self-healing analog with RRAM and CNFETs, in 2019 IEEE International Solid-State Circuits Conference-(ISSCC) (IEEE, 2019), pp. 470–472
Beyond-Silicon Computing: Nano-Technologies, Nano-Design …
43
M.S. Arnold, A.A. Green, J.F. Hulvat, S.I. Stupp, M.C. Hersam, Sorting carbon nanotubes by electronic structure using density differentiation. Nat. Nanotechnol. 1(1), 60–65 (2006) J.-L. Baltzinger, B. Delahaye, Semiconductor Technologies, ed. by J. Grym (IntechOpen, 1999), Chap. 4, pp. 57–78 M.D. Bishop, G. Hills, T. Srimani, C. Lau, D. Murphy, S. Fuller, J. Humes, A. Ratkovich, M. Nelson, M.M. Shulaker, Fabrication of carbon nanotube field-effect transistors in commercial silicon manufacturing facilities. Nat. Electron. 1–10 (2020) B.H. Calhoun, A.P. Chandrakasan, A 256-kb 65-nm sub-threshold SRAM design for ultra-lowvoltage operation. IEEE J. Solid-State Circuits 42(3), 680–688 (2007) Q. Cao et al., Arrays of single-walled carbon nanotubes with full surface coverage for highperformance electronics. Nat. Nanotechnol. 8, 180–186 (2013) B. Chava, J. Ryckaert, L. Mattii, S.M.Y. Sherazi, P. Debacker, A. Spessot, D. Verkest, DTCO exploration for efficient standard cell power rails, in Design-Process-Technology Co-optimization for Manufacturability XII. International Society for Optics and Photonics, vol. 10588 (2018), p. 105880B L.T. Clark, V. Vashishtha, L. Shifren, A. Gujja, S. Sinha, B. Cline, C. Ramamurthy, G. Yeric, ASAP7: a 7-nm finFET predictive process design kit. Microelectron. J. 53, 105–115 (2016) J. Ding et al., A hybrid enrichment process combining conjugated polymer extraction and silica gel adsorption for high purity semiconducting single-walled carbon nanotubes. Nanoscale 7, 15741–15747 (2015) D.J. Frank, Y. Taur, H.-S. Philip Wong, Generalized scale length for two-dimensional effects in MOSFETs. IEEE Electron Device Lett. 19(10), 385–387 (1998) L. Gomez, I. Aberg, J.L. Hoyt, Electron transport in strained-silicon directly on insulator ultrathinbody n-MOSFETs with body thickness ranging from 2 to 25 nm. IEEE Electron Device Lett. 28(4), 285–287 (2007) A.A. Green, M.C. Hersam, Ultracentrifugation of single-walled nanotubes. Mater. Today 10, 59–60 (2007) P. Hashemi et al., Strained Si1-x Gex -on-insulator PMOS FinFETs with excellent sub-threshold leakage, extremely-high short-channel performance and source injection velocity for 10 nm node and beyond, in Proceedings of the Symposium on VLSI Technology (VLSI-Technology), Digest Technical Papers (2014a), pp. 1–2 P. Hashemi et al., First demonstration of high-Ge-content strained- Si1-x Gex (x = 0.5) on insulator PMOS FinFETs with high hole mobility and aggressively scaled fin dimensions and gate lengths for high- performance applications, in Proceedings of the IEEE International Electron Devices Meeting (2014b), pp. 16.1.1–16.1.4 G. Hills, J. Zhang, M.M. Shulaker, H. Wei, C.-S. Lee, A. Balasingam, H.-S. Philip Wong, S. Mitra, Rapid co-optimization of processing and circuit design to overcome carbon nanotube variations. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 34(7), 1082–1095 (2015) G. Hills, M.G. Bardon, G. Doornbos, D. Yakimets, P. Schuddinck, R. Baert, D. Jang et al. Understanding energy efficiency benefits of carbon nanotube field-effect transistors for digital VLSI. IEEE Trans. Nanotechnol. 17(6), 1259–1269 (2018) G. Hills, C. Lau, A. Wright, S. Fuller, M.D. Bishop, T. Srimani, P. Kanhaiya et al., Modern microprocessor built from complementary carbon nanotube transistors. Nature 572(7771), 595–602 (2019) R. Ho, C. Lau, G. Hills, M.M. Shulaker, Carbon nanotube CMOS analog circuitry. IEEE Trans. Nanotechnol. 18, 845–848 (2019) P.S. Kanhaiya, C. Lau, G. Hills, M. Bishop, M.M. Shulaker, 1 Kbit 6T SRAM arrays in carbon nanotube FET CMOS, in 2019 Symposium on VLSI Technology (IEEE, 2019a), pp. T54–T55 P.S. Kanhaiya, C. Lau, G. Hills, M.D. Bishop, M.M. Shulaker, Carbon nanotube-based CMOS SRAM: 1 kbit 6T SRAM arrays and 10T SRAM cells. IEEE Trans. Electron Devices 66(12), 5375–5380 (2019b) K.J. Kuhn, Considerations for ultimate CMOS scaling. IEEE Trans. Electron Devices 59(7), 1813– 1828 (2012)
44
G. Hills
C. Lau, T. Srimani, M.D. Bishop, G. Hills, M.M. Shulaker, Tunable n-type doping of carbon nanotubes through engineered atomic layer deposition HfOX films. ACS Nano 12(11), 10924– 10931 (2018) C.-S. Lee, E. Pop, A.D. Franklin, W. Haensch, H.-S. Philip Wong, A compact virtual-source model for carbon nanotube FETs in the sub-10-nm regime—Part I: Intrinsic elements. IEEE Trans. Electron Devices 62(9), 3061–3069 (2015) N. Loubet et al., Stacked nanosheet gate-all-around transistor to enable scaling beyond FinFET, in Proceedings of the Symposium on VLSI Technology (2017), pp. T230–T231 H. Mertens et al., Gate-all-around MOSFETs based on vertically stacked horizontal Si nanowires in a replacement metal gate process on bulk Si substrates, in Proceedings of the IEEE Symposium on VLSI Technology (2016), pp. 1–2 OpenSPARC (Dec. 2011), http://www.opensparc.net/opensparc-t2 N. Patil, A. Lin, J. Zhang, H. Wei, K. Anderson, H.-S. Philip Wong, S. Mitra, VMR: VLSIcompatible metallic carbon nanotube removal for imperfection-immune cascaded multi-stage digital logic circuits using carbon nanotube FETs, in 2009 IEEE International Electron Devices Meeting (IEDM) (IEEE, 2009), pp. 1–4 C. Qiu, Z. Zhang, M. Xiao, Y. Yang, D. Zhong, L.-M. Peng, Scaling carbon nanotube complementary transistors to 5-nm gate lengths. Science 355(6322), 271–276 (2017) Y. Sasaki et al., Novel junction design for NMOS Si Bulk-FinFETs with extension doping by PEALD phosphorus doped silicate glass, in Proceedings of the IEEE International Electron Devices Meeting (2015), pp. 21–28 M.M. Shulaker, G. Hills, N. Patil, H. Wei, H.-Y. Chen, H.-S. Philip Wong, S. Mitra. Carbon nanotube computer. Nature 501(7468), 526–530 (2013a) M.M. Shulaker, J.V. Rethy, G. Hills, H. Wei, H.-Y. Chen, G. Gielen, H.-S. Philip Wong, S. Mitra, Sensor-to-digital interface built entirely with carbon nanotube FETs. IEEE J. Solid-State Circuits 49(1), 190–201 (2013b) M.M. Shulaker, K. Saraswat, H.-S. Philip Wong, S. Mitra. Monolithic three-dimensional integration of carbon nanotube FETs with silicon CMOS, in 2014 Symposium on VLSI Technology (VLSITechnology): Digest of Technical Papers (IEEE, 2014), pp. 1–2 M.M. Shulaker, G. Hills, T.F. Wu, Z. Bao, H.-S. Philip Wong, S. Mitra. Efficient metallic carbon nanotube removal for highly-scaled technologies, in 2015 IEEE International Electron Devices Meeting (IEDM) (IEEE, 2015), pp. 32–34 M.M. Shulaker, G. Hills, R.S. Park, R.T. Howe, K. Saraswat, H.-S. Philip Wong, S. Mitra. Threedimensional integration of nanotechnologies for computing and data storage on a single chip. Nature 547(7661), 74–78 (2017) T. Srimani, G. Hills, C. Lau, M. Shulaker, Monolithic three-dimensional imaging system: carbon nanotube computing circuitry integrated directly over silicon imager, in 2019 Symposium on VLSI Technology (IEEE, 2019), pp. T24–T25 T. Srimani et al., Heterogeneous integration of BEOL logic and memory in a commercial foundry: multi-tier complementary carbon nanotube logic and resistive RAM at a 130 nm node, in VLSI (2020) S.D. Suk et al., Investigation of nanowire size dependency on TSNWFET, in Proceedings of the IEEE International Electron Devices Meeting (2007), pp. 891–894 K. Uchida, J. Koga, S.-I. Takagi, Experimental study on carrier transport mechanisms in doubleand single-gate ultrathin-body MOSFETs-Coulomb scattering, volume inversion, and/spl T SOI induced scattering, in Proceedings of the IEEE International Electron Devices Meeting Technical Digest (2003), pp. 33–35 T.F. Wu, H. Li, P.-C. Huang, A. Rahimi, G. Hills, B. Hodson, W. Hwang et al., Hyperdimensional computing exploiting carbon nanotube FETs, resistive RAM, and their monolithic 3D integration. IEEE J. Solid-State Circuits 53(11), 3183–3196 (2018) T.F. Wu, B.Q. Le, R. Radway, A. Bartolo, W. Hwang, S. Jeong, H. Li et al., 14.3 A 43 pJ/cycle nonvolatile microcontroller with 4.7 µs shutdown/wake-up integrating 2.3-bit/cell resistive RAM
Beyond-Silicon Computing: Nano-Technologies, Nano-Design …
45
and resilience techniques, in 2019 IEEE International Solid-State Circuits Conference-(ISSCC) (IEEE, 2019), pp. 226–228 J. Zhang, N.P. Patil, S. Mitra, Probabilistic analysis and design of metallic-carbon-nanotube-tolerant digital logic circuits. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 28(9), 1307–1320 (2009) J. Zhang, N. Patil, H.-S. Philip Wong, S. Mitra, Overcoming car- bon nanotube variations through co-optimized technology and circuit design, in Proceedings of the InternationalElectron Devices Meeting (IEDM), Washington, DC, USA, 2011, pp. 4.6.1–4.6.4 D. Zhong, M. Xiao, Z. Zhang, L.M. Peng, Solution-processed carbon nanotubes based transistors with current density of 1.7 mA/µm and peak transconductance of 0.8 mS/µm, in Proceedings of the 2017 IEEE International Electron Devices Meeting (IEDM) (IEEE, 2017), pp. 5–6 X. Zhou, J.-Y. Park, S. Huang, J. Liu, P.L. McEuen, Band structure, phonon scattering, and the performance limit of single-walled carbon nanotube transistors. Phys. Rev. Lett. 95(14) (2005), Art. no. 146805
Innovative Memory Architectures Using Functionality Enhanced Devices Levisse Alexandre Sébastien Julien, Xifan Tang, and Pierre-Emmanuel Gaillardon
1 Introduction Since the introduction of the transistor, the semiconductor industry has always been able to propose an increasingly higher level of circuit performance while keeping cost constant by scaling the transistor’s area. This scaling process (named Moore’s law) has been followed since the 80s. However, it has been facing new constraints and challenges since 2012. Standard sub-30nm bulk CMOS technologies cannot provide sufficient performance while remaining industrially profitable. Thereby, various solutions, such as FinFETs (Auth et al. 2012) or Fully Depleted Silicon On Insulator (FDSOI) (Faynot et al. 2010) transistors have therefore been proposed. All these solutions enabled Moore’s law scaling to continue. However, when approaching sub10nm technology nodes, the story starts again. Again, process costs and electrical issues reduce the profitability of such solutions, and new technologies such as GateAll-Around (GAA) (Sacchetto et al. 2009) transistors are seen as future FinFET replacement candidates. On the other hand, alternative solutions to overly expensive transistor scaling are currently being explored and developed by several academic research centers. Among these approaches, one solution consists of enhancing the transistor’s functionalities instead of scaling its dimensions. In other words, the transistor performances are boosted by the addition of functionalities inside the device itself. From this perspective, the transistor’s performance is enhanced, providing the same computing performances than a more advanced scaled CMOS node while providing a reduced process fabrication cost overhead. Compared to the standard Moore’s scaling law, this approach could enable a strong cost reduction thanks to L. A. S. Julien (B) EPFL, Embedded Systems Laboratory, Lausanne, Switzerland e-mail: [email protected] X. Tang · P.-E. Gaillardon Electrical and Computer Engineering, University of Utah, Salt Lake City, USA e-mail: [email protected] P.-E. Gaillardon e-mail: [email protected] © The Author(s) 2023 M. M. S. Aly and A. Chattopadhyay (eds.), Emerging Computing: From Devices to Systems, Computer Architecture and Design Methodologies, https://doi.org/10.1007/978-981-16-7487-7_3
47
48
L. A. S. Julien et al.
relaxed constraints on the device scaling. From this perspective, devices with new or boosted functionalities (i.e., Functionality Enhanced Devices—FED) are explored in the literature and, among these, this chapter focuses on the Silicon NanoWire Field Effect Transistor (SiNWFET) (De Marchi et al. 2012). The three major enhancements of this technology are (i) dynamic majority carrier type control, (ii) dynamic subthreshold slope control and (iii) dynamic threshold voltage tuning. Thanks to these effects, it is demonstrated that SiNWFET enables denser logic gates than equivalent performance CMOS gates while using less transistors. While several works exploring logic gate design (Gaillardon et al. 2016; Amarú et al. 2013) and circuit synthesis (Arnani et al. 2013) are reported, memory blocks using SiNWFET are poorly explored (Shamsi et al. 2015). Thereby, in this work, memory blocks using SiNWFET are explored. The following methodology was considered (i) identification of the standard CMOS-based solution, (ii) proposition of standard replacement architecture solutions using SiNWFET, (iii) exploration of breakthrough architecture solutions and, (iv) benchmark with existing solutions. The memory market is usually segmented into two categories. The first one is the volatile memory category that includes several technologies for various uses. Among them, the most known ones are the embedded SRAM caches and the standalone DRAM memories (Chang et al. 2017; Lee et al. 2014). However, even closer to the logic circuit than SRAM, some memorization points are present under the name of Flip-Flops. In parallel to the logic circuit development, flip-flops using SiNWFETs must be designed. The first subsection of this chapter explores the design of flip-flops using SiNWFET technologies. As the microprocessors are expected to operate a high frequencies, this section focuses on True Single Phase Clock (TSPC) (Tang et al. 2014) that are adapted thanks to the lack of cross-coupled latch inside the Flip-Flop structure. The other memory market category is the Non-Volatile Memory (NVM) category. Most well-known for its high density Solid-Sate Drive (SSDs) or Hard Disks Drive (HDDs) standalone chips (Park et al. 2015; Mamun et al. 2017). NVM co-integration with advanced CMOS nodes is required to enable microcontrollers production, leading to huge design and technology complexity (i.e., costs) (Shum et al. 2017). Due to skyrocketing costs, floating gate-based NVM integration takes more and more time to be performed at each CMOS technology node (4 years for the 90 nm, more than 10 for the 28nm technology node) (Strenz 2012). Knowing this, it seems unrealistic to develop a floating gate-based NVM for SiNWFET technology. The second part of this chapter thus focuses on emerging NVM technologies (such as Filamentarybased Resistive Switching Technologies—RRAM) that provide low process cost overhead and fabrication friendly materials. Additionally, these technologies are widely explored and seen as the floating gate-based embedded NVM replacement by both industrial and academic researchers (Kawahara et al. 2013; Portal et al. 2017). The major contribution of this section is the use of the dynamic polarity control enabled by SiNWFET technology to enhance the density and functionality of such NVM blocks. The outline of this chapter is organized as follows. First, the necessary technology background on SiNWFET technology is introduced. Then, an extensive study of TSPC Flip-Flops using SiNWFET is carried out. Basic TSPC flip-flops as well as enhanced TSPC with embedded logic functions are designed and com-
Innovative Memory Architectures Using Functionality Enhanced Devices
49
pared to CMOS-based TSPC providing the same functionalities. Finally, this chapter addresses the design of NVM memories blocks using 1 Transistor-1 RRAM (1T1R) bitcells. After the basis of filamentary-based RRAM technology is presented, the dynamic polarity control of SiNWFET technology is used to provide more efficient and denser 1T1R solutions than equivalent CMOS-based solutions.
2 Polarity Controllable SiNWFET Technology Overview This section introduces the necessary technology background of this chapter on Polarity Controllable (PC) Silicon Nanowire FET (SiNWFET) technology. In parallel to the evolution of regular CMOS technologies, devices providing dynamic polarity control were explored in order to enable extremely dense and low power logic gates (Gaillardon et al. 2016) (e.g. such as 4-transistors XOR function (Amarú et al. 2013)). Thereby, polarity control transistors are actively studied and various highlyscaled FET devices based on silicon nanowires (De Marchi et al. 2012; Heinzig et al. 2012), carbon nanotubes (Lin et al. 2005), graphene (Nakaharai et al. 2014), FinFETs (Zhang et al. 2014a) and WSe2 based bipolar transistors (Resta et al. 2016) have been demonstrated. Among these technologies, the Silicon NanoWire Field Effect Transistor (SiNWFET Marchi et al. 2014; De Marchi et al. 2012) using a gate-allaround process appears to be the most natural evolution from FinFET transistors as its process is based on a FinFET fabrication process (Sacchetto et al. 2009). Figure 1a
Fig. 1 a 3D view of a polarity controllable SiNWFET structure. b Simulated I-V curves of n-type (black) and p-type (red) operation in a 25nm equivalent node (Mohammadi et al. 2015) with detailed subthreshold current. c Symbol view of the PC SiNWFET with independent polarity gates and the equivalent circuits. d Symbol view of the PC SiNWFET with the polarity gate (PG) node connected to both PGs and PGd
50
L. A. S. Julien et al.
presents the physical structure of the considered SiNWFET transistors. Polarity controllable devices provide huge flexibility due to their performance control. Indeed, by controlling the voltages on the two polarity gates (PGs and PGd), on the control gate, on the drain and the source, several effects can be obtained: polarity control, subthreshold slope control and threshold voltage modulation. This chapter focuses on the polarity control effect presented Fig. 1b. By changing the polarity gates voltage bias, it is possible to switch from an n-type MOS behavior to a p-type MOS behavior. As shown Fig. 1c PC SiNWFET transistor can be approximated as a 2 independent gates transistor by biasing one of the polarity gates at gnd (to create two serial p-type transistors) or at vdd (to create two serial n-type transistors). This configuration will be considered in the TSPC section (i.e., Sect. 3). On the other hand, as it is shown Fig. 1d, by connecting together the PGD and the PGS together, a single polarity gate (PG) is created. This PG controls the transistor type while keeping it as a single transistor. This properties are used in the NVM section (i.e., Sect. 4).
3 Silicon NanoWire FET Based True Single Phase Clock Flip-Flops This section explores the opportunities opened by Functionality-Enhanced Devices (FED) in local memorization blocks (i.e., flip-flops) embedded in the logic. The contributions of this section are summarized as follow. Innovative Polarity Controllable (PC) SiNWFET-based True Single Phase Clock Flip-Flops embedding asynchronous set, reset and logic operations are proposed. Then, these topologies are compared with equivalent standard CMOS solutions. Energy and area gains are exhibited.
3.1 True Single-Phase Clock (TSPC) Flip-Flop Because they are a key component of integrated sequential circuits, flip-flops strongly impact the area, speed and power consumption of modern digital circuits (Teh et al. 2011). Two main kind of Flip-flops are reported, the dynamic and the static flipflops (Rabaey et al. 2002). Static flip-flops, that are usually built around two inverterbased latches, are highly sensitive to the inverter design and may suffer of area and delay increase. On the other hand, thanks to a simpler architecture, flip-flops based on dynamic logic have advantages in high operating speed and area density compared to static ones (Teh et al. 2011) Additionally, while most of the static flip-flops require both clock signal and its inversion, dynamic flip-flops do not need it (Yuan and Svensson 1997), simplifying the clock tree synthesis. Thereby, using single phase clock and dynamic logic do not only lead to compact design but also faster response. Finally, a unique feature of TSPC flip-flop is the reduction of the overall setup time and delay obtained by embedding logic gates inside the structure (Rabaey
Innovative Memory Architectures Using Functionality Enhanced Devices
51
VDD Set Path Reset Path
S D
CLK
S
R
CLK X
Q
Y R
D
S
Q
CLK S
CLK GND Fig. 2 Schematic of a CMOS-based TSPC flip-flop embedding an asynchronous set and reset though S and R signals (Golinescu 1999; Yuan and Svensson 1997)
et al. 2002). For these reasons, TSPC flip-flop widely used in many high-speed applications (Nomura 2008), will be explored in this chapter. CMOS TSPC flip-flop can be built with only 9 transistors, which is very compact as compared to static version with 22 transistors (Rabaey et al. 2002). A TSPC flip-flop with asynchronous reset and set requires 6 additional transistors for pulling-up to vdd or pulling-down to gnd at each stage. As depicted in Fig. 2, CMOS TSPC flip-flop is composed of three stages. (i) A precharge stage with a clock low enable. This stage is transparent when CLK = 0, i.e. the node X is updated with the value of D only when CLK = 0. (ii) A latch stage. The second stage is a latch stage, storing the value of node X at the rising edge of CLK. When CLK = 0, node Y is precharged to vdd. At the rising edge of CLK, node Y can be updated to gnd if node X = 1. Note that even if node X can be pulled-down when CLK = 1, the latched value at node Y cannot be modified. (iii) A precharge stage with clock high enable. The last stage is transparent when CLK = 1. It propagates node Y only when CLK = 1. However, as it is based on precharging and discharging, TSPC flip-flop relies on internal capacitance for storage, which requires periodical refreshing (on the order of milliseconds) (Rabaey et al. 2002). In addition, logic gates can be directly inserted into the first stage of CMOS TSPC flip-flop. Figure 3 depicts a TSPC flip-flop with a prior AND function. The setup time of a single TSPC flip-flop increases but considering a AND gate cascaded by a standard TSPC flip-flop, the overall setup time decreases (Rabaey et al. 2002).
52
L. A. S. Julien et al.
3.2 Standard PC SiNWFET Flip-Flop This section proposes to explore the opportunities opened by PC SiNWFET in TSPC flip-flop design. First, circuit design opportunities are explored. Then, circuit functionality is verified through electrical simulations. Finally, the PC SiNWFET flipflops area and delay are compared to standard CMOS TSPC flip-flops and the results are discussed.
3.2.1
Structure Modifications
As it was introduced before, PC SiNWFET leads to area and timing efficiency thanks to the fact that it is an equivalent circuit of two serial transistors in a unique device (as it was shown Fig. 1). In standard CMOS-based TSPC flip-flops, as shown in Fig. 3, serial transistors configuration is widely used. In addition, PC SiNWFET can be dynamically reconfigured from n- to p-type (Zhang et al. 2013), depending on the signal applied to its gates. Therefore, it is possible to dynamically convert a pull-up transistor into a pull-down one and vice-versa. A combination of these two properties is used to compact the original TSPC flip-flop design. Figure 4 depicts a PC SiNWFET TSPC flip-flop with asynchronous set and reset operations (High Enabled). The novel circuit consists of only 8 PC SiNWFETs and its schematic fits the layout style in Bobba et al. (2012), consuming only 3 tiles. By applying design rules of Fig. 1, all control gates and polarity gates are fully used. Compared with CMOS design (Fig. 3), the novel circuit removes the pull-up and
VDD S CLK
CLK B
A
S
R X
R
A B
Q
Y
Q
CLK S
S CLK GND
Fig. 3 Schematic of a CMOS-based TSPC flip-flop embedding an asynchronous set and reset though S and R signals (Golinescu 1999; Yuan and Svensson 1997). AND gate is highlighted in red
Innovative Memory Architectures Using Functionality Enhanced Devices
S CLK D S
VDD
VDD D VDD
Y
S GND GND
S R
CLK R GND X N1 S
53
S
S GND
N2 S Q
VDD R S
Q
VDD CLK
GND
Fig. 4 Schematic of the proposed PC SiNWFET-based TSPC flip-flop embedding asynchronous set and reset features
pull-down transistors at first stage and third stage. Transistor N1 (highlighted by an orange rectangle in Fig. 4) plays a dual role, as it replaces part of pull-up and pulldown branches of the first stage of Fig. 3. Transistor N2 (in red) plays an identical role for the third stage. In addition to serial transistors merge into single devices, improvements are done at the two pre-charge stages. TSPC flip-flop first stage improvements Asynchronous set requires node X to be pulled down to gnd at the first stage even when CLK = 0 and D = 0. In CMOS design, an additional pull-down transistor, controlled by signal S, is required in order to avoid a path from vdd to gnd when CLK = 1 and D = 0. Then, to prevent another path from vdd to gnd when CLK = 0 and D = 0, a pull-up transistor, also controlled by signal S, is added. The transistor N1 in Fig. 4 realizes this dual functionality. When set is enabled, transistor N1 switches from n-type to p-type, while its source is set to vdd. Thus, transistor N1 is only in on state when CLK = 0 and X = 0 (when D = 0, X = 0), which ensures that node Y can be pulled up to vdd whatever CLK and D are. In Fig. 3, asynchronous set path is highlighted in dashed arrow while asynchronous reset is in solid line arrow. TSPC flip-flop third stage improvements Asynchronous set requires immediate logic-high output when enabled. Therefore, the third stage should be pulled down once set is enabled even when CLK = 0. In CMOS design, a pull-down transistor is required. However, in PC SiNWFET design, similarly, transistor N2 switches from p-type to n-type, while its source is set to gnd when set is enabled. As node Y is pulled up to vdd when set is enabled, transistor N2 is in on state, thereby pulling down node Q- to gnd.
54
L. A. S. Julien et al.
Set = 1
Set = 1
S
CLK = 0
CLK = 1
CLK
D=0
D =0
D
Q holds
Q
Q: 0=>1
Fig. 5 Electrical simulation waveform of an asynchronous set in a PC SiNWFET-based TSPC flip-flop
3.2.2
Transient Validation
To validate the correct behavior of the cell under asynchronous set and reset, electrical simulations are performed, and transient waveforms are shown in Figs. 5 and 6. A simple 22nm PC SiNWFET table-based compact model, derived from Zhang et al. (2013), is used with HSPICE simulator. Figure 5 verifies that the output Q can be pulled up once set is enabled even during the most challenging case (CLK = 0, D = 0 and Q = 0). For asynchronous reset, the most challenging case happens when CLK = 0, D = 1 and Q = 1. From Fig. 6, the output Q is observed to be correctly pulled down, when set operation is triggered. Once set/reset signals are de-asserted, the output Q switches again accordingly to the next clock rising edge.
3.2.3
Circuit-Level Performances Estimations
To evaluate the performance of the proposed flip-flop design, four major metrics are considered: area, setup-time, hold-time and clock-to-Q delay. In this section, experimental results, obtained by electrical simulations, are compared between TSPC flip-flop designs, implemented both in traditional CMOS (Fig. 2) and in PC SiNWFET (Fig. 4), using a of 22nm technology node. Note that static flip-flops are out of the scope of this chapter and therefore are not discussed. For CMOS technology, PTM 22nm LSTP FinFET model (PTM 2021) is considered. To accurately measure the minimum setup time/hold time and clock-to-Q delay, a binary search approach
Innovative Memory Architectures Using Functionality Enhanced Devices
Reset = 1
55
Reset = 0
R
CLK = 0
CLK = 1
CLK
D=0
D=1
D
Q holds
Q
Q: 1=>0
Fig. 6 Electrical simulation waveform of an asynchronous reset in a PC SiNWFET-based TSPC flip-flop
is used by setting a delay tolerance corresponding to 10% of the reference delay and a resolution of 0.01 ps (Cadence 2009). Table 1 shows the comparison the two TSPC flip-flop implementations. Thanks to the compactness properties of PC SiNWFETs, an area saving of up to 20% is achievable. Regarding timing performances, PC SiNWFET TSPC flip-flop reduces its internal delay by 30% on average. The remarkable performance gains come from
Table 1 Comparison between FinFET-based and PC SiNWFET-based TSPC flip-flops. SiNWFET area = 1.5 MOS area Benchmark/TSPC LSTP FinFET PC SiNWFET Comparison (%) Area (# of transistors) Setup time Rise (ps) Fall (ps) Average (ps) Hold time Rise (ps) Fall (ps) Average (ps) Clock-to-Q delay Rise (ps) Fall (ps) Average (ps) Leakage power (nW)
15 −110.13 −79.55 −94.84 86.03 117.46 101.75 22.22 34.88 28.55 0.26
12 −109.22 −81.75 −95.5 86.9 114.5 100.7 13.42 25.35 19.39 0.24
−20 0.83 −2.77 −0.68 1.01 −2.52 −1.03 −39.6 −27.32 −32.1 −7
56
L. A. S. Julien et al.
the area reduction given by PC SiNWFET as well as the intrinsic parasitic capacitance reduction given by a single device instead of two serial CMOS transistors.
3.3 Enhanced PC SiNWFET FF with Logic Operations In addition to the realization of more compact pull-up and pull-down networks, PC SiNWFETs are also used to implement the AND logic function natively (Zhang et al. 2013). This feature can be efficiently embedded into PC SiNWFET flip-flop. For instance, Fig. 7 shows a PC SiNWFET flip-flop with AND gate embedded in the first stage. In the PC SiNWFET implementation, the first stage is a clocked AND gate, derived from Zhang et al. (2013). The following stages remain unchanged as compared to Fig. 4. Note that the clock signal in the first stage is wired to the lowleakage controllability gate, leading to a larger power efficiency as a trade-off in internal delay. The clock signal replaces the fixed gnd biases at the first stage of CMOS design. When CLK = 0, the first stage works as a XOR gate, i.e., X is equal to A AND B. When CLK = 1, node X is pulled up to vdd, as opposed to the regular CMOS design where node X is left untouched. Therefore, an additional inverter is inserted to avoid the conditions where a path from vdd to gnd can be created (when CLK = 1). Electrical simulations, depicted in Fig. 8, validates the AND/TSPC flip-flop functionality. Indeed, it shows that the output Q equals to A AND B at each clock rising edge. Finally, the performance between CMOS design (Fig. 7) and PC SiNWFET design (Fig. 8) are compared in Table 2. It shows that the area shrinks by 21%, while the delay reduces by 6% on average and leakage power drops by 45%. Leakage gain is accounted for power-efficiency of the PC SiNWFET AND gate (Zhang et al. 2013). Nevertheless, note that the timing performance gain is lowered because of the use of the low-leakage controllability gate of PC SiNWFETs.
S
VDD
CLK
CLK
A S
B S X
VDD A B GND
R CLK GND
Y
S R
S
S GND
S Q
VDD S R
S CLK S
Q
VDD CLK
GND
Fig. 7 Schematic of the proposed PC SiNWFET-based TSPC flip-flop embedding AND gate and asynchronous set and reset features
Innovative Memory Architectures Using Functionality Enhanced Devices
57
CLK Rises
CLK Rises
CLK Rises
A=0
A=1
A=0
A=1
B=0
B=0
B=1
B=1
Q=0
Q=0
Q: 0=>1
Q: 1=>0
CLK Rises
CLK
A
B
Q
Fig. 8 Electrical simulation waveform of a AND operation in a PC SiNWFET-based AND-gate embedded TSPC flip-flop Table 2 Comparison between FinFET-based and PC SiNWFET-based TSPC flip-flops with embedded AND operation. SiNWFET area = 1.5 MOS area Benchmark/TSPC LSTP FinFET PC SiNWFET Comparison (%) Area (# of transistors) Setup time Rise (ps) Fall (ps) Average (ps) Hold time Rise (ps) Fall (ps) Average (ps) Clock-to-Q delay Rise (ps) Fall (ps) Average (ps) Leakage power (nW)
17 −9.21 28.14 9.47 19.65 16.6 18.13 131.11 121.01 126.06 0.61
13.4 −12.51 20.99 4.24 11.96 17.15 14.56 89.95 145.97 117.96 0.33
−20.59 −35.83 −25.41 −55.2 −39.13 −3.31 −19.7 31.39 20.63 −6.43 −45.53
3.4 Discussion and Conclusions In this section, TSPC flip-flop circuit designs, leveraging the compactness offered by PC SiNWFETs, are explored and various solutions are proposed. Experimental results, obtained by electrical characterization, show that PC SiNWFET implementation improves area, delay and leakage power by nearly 20%, 30% and 7% respectively compared to CMOS design. In addition, this section showed that an AND gate can be embedded into the structure. This leads to an area reduction of 21%, a delay
58
L. A. S. Julien et al.
reduction of 6% and a leakage reduction of 45% on average. It shows that beyond standard logic functions improvements, PC SiNWFET also enhance the flip-flop functionality which enables area and energy reduction while adding functionalities. In the following section of this chapter, the focus is given in the opportunities opened by PC SiNWFET in the case of NVM co-integration.
4 Emerging Resistive Memories Architectures Using PC SiNWFETS In this section, we proposes to investigate dense memory array implication of using PC SiNWFET as selectors in the bitcell. To that end, this chapter first proposes an extensive overview of the filamentary RRAM technologies and programming conditions. Then, it explores the opportunities opened by the fine-grain dynamic polarity control offered by PC SiNWFET technology. Finally, two PC SiNWFET based bitcells are proposed and explored from an array point of view. The proposed PC SiNWFET RRAM bitcells enable low voltage operation (no gate-overdrive required) while enabling from 1.5x up to 2.45x compared to CMOS-based 2T1R bitcells.
4.1 RRAM Technology On the other side of the technologies developments, non-volatile resistance switching devices enabled by simple technology stacks and materials are seen by both industrials and academics as an attractive solution for the future of Non-Volatile Memories (NVM) (Burr et al. 2008). These devices, by enabling non-volatile resistance switching on a 2-terminals device are widely explored thanks to their huge integration density enabled by crosspoint architecture (Patel and Friedman 2014). However, these high density solution suffers of middle voltage constraints during programming operations (i.e., 3 to 5 V of programming voltage) and huge periphery area overhead (Levisse et al. 2017). For these reasons, 1-Transistor 1-Resistance (1T1R) architectures are still explored by the industrials (Wei et al. 2011) for NOR memories replacement (Giraud et al. 2017). The main resistive memory technologies (RRAM) are the Phase Change Memories (PCM) (Wong et al. 2010; Burr et al. 2016), the Magneto-resistive Memories (MRAM) (Endoh et al. 2016; Apalkov et al. 2013) and the Filamentary Resistive Memories (Oxide-based OxRAM and Conductive-bridge CBRAM) (Vianello et al. 2014; Wong et al. 2012). While the non-volatile resistive switching properties exploration started in 1960s with the study of reversible breakdown in Metal/Oxide/Metal (Au/SiO/Au and Au/SiO/Al) stacks (Nielsen and Bashara 1964; Simmons and Verderber 1967), it was later on totally forgotten with the continuous scaling of floating gate NVMs. However, with the worries about the “scaling wall” and the end of Moore’s Law (End of Moore’s Law 2016, 2013; Haron and
Innovative Memory Architectures Using Functionality Enhanced Devices
59
Hamdioui 2008), it was resurrected (Beck et al. 2000; Zhuang et al. 2002) and pushed as an opportunity of future replacement candidate for DRAM (Prenat et al. 2014) and Flash memories (Kawahara et al. 2013; Baek et al. 2005; ExtremeTech 2013; Computerworld 2013). Today, filamentary RRAMs are seen as the most promising candidate among all the other technologies (PCM, MRAM) thanks to its CMOScompatible and extremely simple fabrication process (Vianello et al. 2014; Wong et al. 2012). Filamentary RRAM are organized in three families. The Oxide-based (OxRAM), the Conductive-Bridge (CBRAM) and the Hybrid. While each family relies on a different physical effect, electrical behaviors of OxRAM, CBRAM and Hybrid are identical. All these memories rely in the creation (named set operation) and destruction (named reset operation) of a Conductive Filament (CF) inside an insulating material. This leads to a controllable variation of the equivalent material resistance state.
4.1.1
Filamentary Based RRAM Technologies
In OxRAM technology, the CF is made of oxygen vacancies in a Transition-MetalOxide (TMO) switching layer. Under the effect of the electric field and joule effect, the oxygen ions migrate inside the TMO. The switching materials are generally made of metal oxides such as HfOx (Goux et al. 2011), AlOx (Lee et al. 2010), NiOx (Seo et al. 2004), TiOx (Kwon et al. 2010) or TaOx (Lee et al. 2011) sandwiched inbetween two metal electrodes (Top Electrode—TE and Bottom Electrode—BE). Two different kinds of OxRAM exist: the Unipolar (or Non-Polar) OxRAMs relying only on the thermal effect for the filament destruction (Symmetric stacks such as Pt/HfO2/Pt (Cagli et al. 2011), Pt/NiO/Pt (Seo et al. 2004) or TiW/SiOx/TiW (Chen et al. 2016)), while Bipolar OxRAM uses the electrical field effect to amplify the thermal effect during reset (non-symetric stacks with vacancies scavenging layer are used such as Ti/HfO2 (Vianello et al. 2014)). Bipolar OxRAM devices enable smaller reset current and better switching characteristics than non-polar devices. In CBRAM technology, the CF creation and destruction relies on the electro-migration of metal ions from the TE inside an insulating material. Various insulator/metal couples can be used, such as GeS2/Ag (Jameson et al. 2012) or Al2O3/Cu (Belmonte et al. 2013). Finally, Hybrid resistive memories (Vianello et al. 2014) are a mix between OxRAM and CBRAM. Oxygen vacancies are used to improve the mobility of metal ions during the set and reset operations (Molas et al. 2014; Nail et al. 2016). For all the previously introduced filamentary RRAM technologies (OxRAM, CBRAM or Hybrid), the electrical behavior is almost identical and is introduced in Fig. 9. At the beginning of the RRAM life, an electro-forming operation is mandatory to perform a first breakdown of the insulator material and create the first oxygen vacancies (OxRAM) or metal (CBRAM) CF (Xu et al. 2008). Compliance current is used to limit the current and enables a reversible breakdown. Once the electro-forming step performed, the RRAM is in Low Resistance State (LRS). From a LRS, by applying a reverse voltage across the device, a part of the CF can be destroyed changing the RRAM resistance to a High Resistance State (HRS) (i.e., a reset operation). Finally,
60
L. A. S. Julien et al.
Fig. 9 I-V curve of a filamentary RRAM with detailed electroforming, reset and set operations inspired from Thammasack et al. (2017)
Compliance current
Reset
Set
Electroforming
a set operation is performed by applying a positive voltage pulse across the RRAM. The set operation is similar to an electro-forming operation but with lower voltages. It is worth noting that the electro-forming step is a critical operation for filamentary RRAM while it requires high voltages and long time (Vianello et al. 2014; Lorenzi et al. 2012) (around the µs at 3volts for a HfO2-based OxRAM). From a test point on view it may be critical and cause the Electrical Wafer Sort test to be overly long and not cost effective. As this point is weakly documented and as most of the device development teams are trying to lower the electro-forming voltage or create forming-free devices (Kim et al. 2016; Chakrabarti et al. 2014), in the following, the assumption that the electro-forming operation has already been performed is taken and the focus is given on the set and reset operations. Figure 10 presents the relation between the programming current (Iprog) and the LRS value during a set operation. This direct relation between the programming current is explained by the fact that a higher current lead to a wider CF, which results in a lower resistance state. Figure 10a (inspired from Vianello et al. (2014), Fackenthal et al. (2014), filament scaling inserts are inspired from Vandelli et al. (2011)) shows that this relation is commonly considered for both OxRAM and CBRAM. Finally, Fig. 10b (inspired from Garbin et al. (2015)), shows the cumulative distribution of the achieved LRS. A lower Iprog results in a wider distribution while a higher Iprog in a narrower distribution. This behavior can be explained physically by the width of the created CF. A higher Iprog produces a wider filament, less sensible to variations than a small CF resulting of a lower Iprog.
A µA 10
CF
61
50µ
TE
130µA
TE
340µA
Innovative Memory Architectures Using Functionality Enhanced Devices
CF
BE
BE [72] [43]
Increased Variability
LRS reduction
LRS reduction
(a)
(b)
Fig. 10 a Evolution of the LRS value versus the Iprog current for various CBRAM and OxRAM technologies inspired from and Vianello et al. (2014), Fackenthal et al. (2014) and Vandelli et al. (2011). It shows a noticeable dependency and the corresponding Modeled CF size for various Iprog currents. b Variability of the LRS value for various Iprog current. It shows that lower is the Iprog larger is the CF distribution, inspired from Garbin et al. (2015) HRS/LRS = 5
t = 1μs, LRS=3kΩ
HRS Voltage Dependency
Treset Voltage Dependency
HRS = 300kΩ
HRS = 70kΩ
Tset Voltage Dependency
(a)
V
(b)
(c)
Fig. 11 a Evolution of the set time versus the set voltage. b Evolution of the reset time versus the reset voltage for a constant HRS/LRS ratio. c Evolution of the HRS value versus the reset voltage at for a constant time
Figure 11a and b, inspired from Vianello et al. (2014), shows the relation between the programming time and the applied voltage across the RRAM device. During a set pulse, the time required to start the switching operation reduces exponentially with the applied voltage. In reset operation, the required time to reach a given HRS value (in this example, 5 times the LRS value) also depends exponentially with the applied voltage. Finally, Fig. 11c (also inspired from Vianello et al. (2014)) shows the relationship between the achieved HRS value versus the applied voltage for a constant programming time (1 µs) and starting from a constant LRS (here 3 k). This section introduced the metrics and trade-offs of filamentary RRAMs technologies. After a first electro-forming operation requiring long time and high voltage, the RRAM device end-up in a LRS. From a LRS, a reset operation can be performed by applying a negative voltage across the RRAM. A higher reset voltage induces a higher HRS value at constant time or a faster reset operation for a given HRS target. From a HRS, a set operation can be performed by applying a positive voltage across
62
L. A. S. Julien et al.
the RRAM. A higher voltage induces a faster set operation. Once the switching happens, the current flowing through the RRAM must be limited to ensure a reversible switching. The resulting LRS value depends on the maximum Iprog current value through the RRAM.
4.1.2
RRAM Electrical Models
Various filamentary RRAM models are reported in the literature (Hajri et al. 2017; Strukov et al. 2009; Strukov and Williams 2009; Pickett et al. 2009; Kvatinsky et al. 2012, 2015; Jiang et al. 2016; Bocquet et al. 2014b). However, all these models are not suitable for fast and accurate electrical simulations of RRAM memory array. Indeed, to the best of our knowledge, only (Jiang et al. 2016; Bocquet et al. 2014b) includes the relationship between the programming current Iprog and the obtained resistance value. Additionally, as the model introduced in Bocquet et al. (2014a, b) is compiled for Eldo simulator (Platform 2021) and fitted with up to date silicon data (Vianello et al. 2014). This simulation set is considered it in the following. The model used to simulate the RRAM relies on electric field-induced creation/destruction of oxygen vacancies within the switching layer, as presented in Bocquet et al. (2014a, b). The memory resistance is directly linked to the radius of the conductive filament (CF), which is calculated thanks to a single master equation continuously accounting for both electro-forming/set and reset. The model takes into account various phenomenon, including the switching time dependency versus the applied voltage for all operations, the HRS value evolution versus the applied voltage during reset, the LRS value evolution vs Iprog, and the temperature impact on all operations (i.e., the relationships from Figs. 10 and 11 are modeled). In the following simulations the RRAM electro-forming step is considered as already performed.
4.2 PC SiNWFET-OxRAM Co-Integration In this section, as it was identified in Jovanovic et al. (2016) and Portal et al. (2017), we propose to overcome 1T1R VT-loss with a groundbreaking solution. Figure 12a, b remind the operation of a 1T1R bitcell during a set and a reset operation. During a reset operation, the n-type transistor Vgs switches to the other side of the transistor, forcing the internal node of the 1T1R bitcell to not overcome Vreset-VT (where VT is the n-type VT voltage). This effect forces an increase of the 1T1R programming voltages and thus of the overall reset operation energetic cost increase. It also causes a reduction of the transistor reliability due to additional voltage stress. One solution (Portal et al. 2017) is presented Fig. 12c and d, it consists in using a different transistor for each operation and keep a good control of the transistor Vgs. Thereby, for both set and reset operations, the transistor Vgs is well controlled between the SourceLine (SL) and the WordLine (WL). However, adding a supplementary p-type transistor strongly increases the overall bitcell area as shown later-on. The proposed
Innovative Memory Architectures Using Functionality Enhanced Devices VBL=Vset
HRS
LRS
VBL=gnd
LRS
Vgs
Vgs
VWLP= Vreset
(a)
VSL=gnd SET
VBL=Vset
HRS
HRS
VWLN=VGset
Vreset-Vt
2T1R
LRS
LRS
HRS VWLN= gnd
VWLP=Vset VSL=gnd
(c)
RESET
VBL=gnd
VWLN=VGset
Vgs
VSL=Vreset
(b)
63
SET
Vgs
VWLP= gnd
(d)
VSL=Vreset RESET
1PCT1R VBL=Vset
HRS
LRS
VCG=VGset VPG=Vset
VBL=gnd
LRS nMOS
Vgs
(e)
HRS
VCG=gnd VPG=gnd
pMOS
Vgs VSL=gnd SET
(f)
VSL=Vreset RESET
Fig. 12 Schematic of 1T1R, 2T1R and 1PCT1R bitcells during set and reset operations. a and b shows a 1T1R, c and d a 2T1R and e and f a 1PCT1R bitcell. Vgs is highlighted for each structure
solution consists in using a transistor that can dynamically be switched between ptype and n-type. This solution is introduced Fig. 12e, f for set and reset operations. A set operation is performed by using the PC SiNWFET as a n-type transistor (i.e. by applying the set voltage (Vset) on the Polarity Gate (PG)) while a reset operation is performed by using the PC SiNWFET as a p-type transistor (i.e., by applying gnd to the PG). The created bitcell is named 1 Polarity Controllable Transistor-1 RRAM (1PCT1R). While it as briefly been introduced in Shamsi et al. (2015), co-integrating PC SiNWFET and RRAM technologies may be interesting from a security point of view. By symmetrizing the programming current Iprog during both set and reset operations, it smooths the memory power trace and thus enhance its reliability to Side Channel Attacks. In this work, we only focus on the memory array architecture, density and reliability considerations.
4.3 Bitcell Design In this section, the physical layout of 1T1R bitcells is reported as it is proposed in Shen et al. (2012) and Chang et al. (2014) (Fig. 13a), and the physical layout of 2T1R bitcells is introduced based on the design rules of 28nm process technologies (Fig. 13b). As the drains of the p-type and n-type transistors cannot be easily merged, they are connected through a metal 1 interconnect (hatched purple) and the RRAM device is integrated in a via (yellow cross) between the metal 1 and the metal 2 (yel-
64
L. A. S. Julien et al. n-Well
SL0 WL0
WL1
diffusion Spacing
WL2
WL3
SL1
40.3F² 160nm
160nm 80nm
80nm
BL0
BL0 30nm
30nm
12.4F²
diffusion Spacing
diffusion Spacing
30nm
30nm
BL1 BL1 SL0
(a)
SL1
(b)
WLn0
WLn1
WLp1
WLp2
Fig. 13 Physical Layout considerations for a 1T1R and b 2T1R bitcells. In the 2T1R, the n-type MOS could be considered minimum size to respect the 2x ratio between n-type and p-type current drive on CMOS technologies
low). The SourceLines (SL) are integrated in metal 1 in parallel with the WordLines controlling the p-type (WLp) and the n-type (WLn) transistors. P-type transistor width is sized to ensure an identical drive compared to a minimum size n-type transistor, thereby its sizing is 160 nm instead of 80 nm. As the largest transistor defines the bitcell height, for layout density and uniformity considerations, the n-type transistor width is also increased to 160nm (as a consequence, a lower WL bias must be considered to control a lower set programming current). Considering the minimum spacing between the active regions in a same well (for similar type transistors) and in two different wells (for n-to-p type transistors spacing), the minimum 2T1R bitcell area is determined. This area can be estimated at 420 nm * 240 nm (0.1008 µm2 –40.3 F2 ) for a 2T1R bitcell versus a 160 nm * 194 nm (0.031 µm2 –12.4 F2 ) for a 1T1R bitcell. As a reminder, the area here is expressed in F2 . In high-density memories, the metal half pitch F is assumed to be the half pitch of the first metal level. In 28nm technology (which is the reference here), F = 50 nm. In this section, two PC-based bitcells are presented. The 1PCT1R bitcell is based on a standard 1T1R structure, whereas the 1XPCT1R exhibits an innovative cross-shape structure. For each bitcell, physical structure and operating conditions are compared to 1T1R and 2T1R MOSbased bitcells. For the PC-based bitcells, as there is an additional terminal in the transistor (the Polarity Gate), a control line is added in the memory array to bias the Polarity Gates: the Polar Line (PL). The design environment (models and physical design assumptions) is first introduced followed by bitcells, physical design and functional validation descriptions. It is worth noting that some interesting structures could be guessed where the BLs or the SLs are merged with the PLs. However, during the bias of the memory, some transient effects could happen and cause unexpected programming operations in the array. Consequently, the focus is put on standard memory architecture to study scaling properties at the cost of an additional control line. Additionally, memory architectures embedding an additional control line (such as the PL) are standard in the eFlash memories (Do et al. 2016).
Innovative Memory Architectures Using Functionality Enhanced Devices
4.3.1
65
Physical Considerations and Simulation Environment
In order to compare it with advanced CMOS, a 22nm gate length SiNWFET transistor is considered with the design rules of a 28nm process node to determine the minimum spacing between the transistors and the drain/source area and the BEoL minimal spacing (F = 50 nm). In Mohammadi et al. (2015), Zhang et al. (2014b), Zografos et al. (2014), a 22nm gate length SiNWFET device operation was demonstrated thanks to physical simulation (TCAD) tool. A basic SiNWFET electrical model was described in Zhang et al. (2013, 2014b). This model is based on a parametric table extracted from TCAD simulations whose basic parameters are fitted on measured device characteristics, introduced previously (Marchi et al. 2014). Access resistances are estimated according to the device geometry. Each capacitance is extracted from TCAD simulations as an average value under all possible bias conditions. Moreover, the OxRAM connected in series with the SiNWFET is simulated with the OxRAM model proposed in Bocquet et al. (2014b). Note that the forming step is not presented in the simulation below.
4.3.2
Standard 1PCT1R Bitcell
Figure 14a shows the physical layout of a PC SiNWFET transistor. In the schematics, the transistor is represented as an active line between two drain/source contacts (reference to Fig. 14a, b). The Control Gate (CG) and the Polar Gate (PG) are represented as polysilicon gates even though in reality, these gates are enveloping the stacked nanowires and do not require additional spacing with the neighboring polysilicon wires. The two PGs are connected together on one side of the transistor with a poly wire while the CG contact are placed on the other side to optimize the layout, as is it done in Bobba et al. (2012). Figure 14b shows the organization in memory arrays of PC SiNWFET and OxRAM devices. The OxRAM bitcell is integrated on a via between the metal 1 BL and the PC SiNWFET drain. The sources of two PC SiNWFET are connected together to the SL to enable higher density bitcell organization. The BL and the PL are connected vertically while the SL and WL are horizontal. For the area estimation, 22nm SiNWFET transistors (Mohammadi et al. 2015) and a 28nm FinFET CMOS design rules are considered. Following this design rules, the minimum 1PCT1R bitcell area is defined. Thereby, considering the minimum active area and the minimum spacing between two active area in 28nm technology, the minimum bitcell size occupies a 262 nm * 245 nm area (0.064 µm2 −25.67 F2 ). This bitcell is bigger than a standard 1T1R bitcell (12.4 F2 ), however, it is expected to provide the performances of the 2T1R bitcell occupying a 40.3 F2 . During the programming operations in 1PCT1R arrays, two main constraints are mandatory: (1) the accessed bitcell has to be activated; (2) the non-accessed bitcells has to be disabled to avoid parasitic write operations. While the PC SiNWFET and OxRAM behaviors depends on the relative voltages between PL, SL, BL and WL terminals, the applied voltages can be either positive or negative. However, to simplify design and to avoid triple well isolation, only positive voltage are preferred. For set
66
L. A. S. Julien et al. BL0
Polarity Gate Contact
PL0
PL1
BL1
WL0
Drain Contact
SL0
Source Contact with RRAM
Control Gate Contact
Metal 1
Metal 2
Via/Contact
(a)
SL1
245nm
Metal 3
WL1
RRAM
262nm
(b)
Fig. 14 Layout of a 2 × 2 bits 1PCT1R cell array in a 25nm SiNWFET process. Transistors drains are shared to reduce the bitcell area Table 3 Overview of the programming voltages and PC SiNWFET type for set, reset and read operations Set Reset Read Status
Selected
WL SL BL PL PCT type
Vgset 0 Vset Vset n-type
Nonselected 0 0 0 Vset n-type
Selected 0 Vreset 0 0 p-type
Nonselected Vreset Vreset Vreset 0 p-type
Selected Vgread 0 Vread Vdd n-type
Nonselected 0 0 0 Vdd n-type
operation (as shown Fig. 12e), the PG voltage is put to high voltage to ensure n-type operation. Then, the CG voltage is put to the set gate voltage (VGset) in order to control the set current. During reset operation (as shown Fig. 12f), the PG voltage is set to ground (gnd) in order to ensure p-type operation. Then, the CG voltage is also set to gnd to provide reset operation with p-type transistor. Table 3 summarizes the bias voltages used for set and reset operations for the selected and non-selected bitcells. In the memory array operations, all the PC SiNWFET are set in the same polarity (all n-type for set and read operations and all p-type for the reset operation). Figure 15 presents the simulation waveform of a 2 × 2 array of 1PCT1R bitcells as shown in Fig. 14. Each bitcell is first set to LRS then reset to HRS. In-between, each bit is read to validate the programmed resistance state. The PL, WL, SL and BL voltages are controlled as introduced Table 3. As each bitcell is programmed, the current through all the bitcells is shown and indicates that no parasitic programming operation on unselected bitcells occurs.
Innovative Memory Architectures Using Functionality Enhanced Devices
67
Fig. 15 Waveforms of set and reset operations in a 1PCT1R bitcell 2 × 2 array. WL, PL, BL voltages and OxRAM current are shown. The immunity to programming disturb in non-selected bitcells is ensured
4.3.3
Breakthrough 1XPCT1R Bitcell
Thanks to the memories array structure regularity, a higher flexibility is allowed with the design rules compared to standard logic physical design rules. In this section, with the assumption that gates can be deposited in both vertical and horizontal directions, an innovating bitcell, using PC SiNWFET transistors organized in a cross shape is proposed. The cross-shaped 1PCT1R bitcell (1XPCT1R) is validated through physical layout feasibility study and electrical simulations. Figure 16 presents the 1XPCT1R schematic diagram. Four 1PCT1R bitcells are organized in cross-shape with common transistor source. The transistors T1 and T2 (resp. T3 and T4) CGs are connected together to the WL0 (resp. WL1). T2 and T4 (resp. T1 and T3) PGs are in common and connected to the PL0 (resp. PL1). T2 and T3 OxRAMs are connected to the BL1 while T1 OxRAM is connected to BL0 and T4 OxRAM to BL2. In Fig. 17, the layout array organization is shown; the 1XPCT1R is a cross-shaped bitcell. Each cross’s arm supports a PC SiNWFET (green) and an OxRAM memory (yellow squares). The minimum size replicable bloc is a 20bits blocks in a 0.828 µm2 . It leads to a 0.041 µm2 per bit (almost half of the standard 1PCT1R area 0.064 µm2 ) for a 22nm physical rules PC SiNWFET technology node. Figure 17 presents
68
L. A. S. Julien et al.
PL0
SL0
PL1
BL0 WL0
T1
T2
BL1 WL1
T3
BL2
T4
Fig. 16 Schematic of a 2 × 2 bits 1XPCT1R bitcell array. Each 1XPCT1R bitcell is constituted of 4 SiNWFETs with common drain
the detailed physical layout of a 4 bits 1XPCT1R block. The common SL is drawn using a metal 3 vertical wire. Connection between the SL and the transistors common source is performed through a metal 1 wire used to shift the contact over T1 transistor. Thereby, BLs (resp. WLs) are drawn using metal 2 horizontal lines and are connected to the OxRAMs (resp. CGs). Each transistor drain supports an OxRAM. This 1XPCT1R array organization needs specific border bitcells. Some bits have to be sacrificed in the border. To make all the BLs, WLs, SLs and PLs accessible, the border cross are cut and some bits are not connected as presented Fig. 18. The uncompleted cross containing no common SL are sacrificed. It represents one bit among six for the firsts and lasts BL and SL. To ease the addressing, first and last BL and SL can be not addressed. Knowing that memory arrays are classically surrounded by dummy cells ring, sacrificed BL and SL can be considered as part of the dummy ring and do not impact the memory area. As before, the programming operations are considered as relative voltages differences and can be operated relatively to the gnd. Set operation is done by considering
Innovative Memory Architectures Using Functionality Enhanced Devices PL0
69 SL0
PL1
BL0 910nm
T1
WL0
T1
T3
BL1 WL1
910nm
(a)
T4 BL2
(b)
Fig. 17 Physical description of a 20-bits bitcell array and in the insert, a detailed layout of a 2 × 2 bits 1XPCT1R bitcell array. Equivalent bit density is almost 2 times higher than for 1PCT1R bitcell
Fig. 18 Array of 1XPCT1R bitcells with detailed WL, BL, SL and PL. Border bitcells are detailed: unconnected OxRAMs are highlighted in red and uncomplete cross are cut
all the PC SiNWFET in n-type. First, all the SL, BL, WL are put at gnd. Then, the selected WL is biased to VGset and the writing pulse is applied on the selected BL. During reset operation, all the PC SiNWFET are put in p-type (PL voltage at gnd) and the array WLs, BLs and SLs are polarized at the reset voltage (Vreset). Finally, the selected WL is pulled down to gnd and the writing pulse (from Vreset to gnd) is applied on the selected BL. Due to it non-standard array organization, several cases are possible: two bicells with common WL, common PL, common WL and SL, common BL and SL, common PL and SL, and common BL. When a non-selected bitcell shares common lines with selected bitcell, immunity to write disturb has to be demonstrated. Shared WL, SL and BL are standard non-selected bitcells cases (as shown in Fig. 17). Shared PL is not critical for 1XPCT1R because all the PLs have the same polarization during programing or read operations. Figure 19 presents
70
L. A. S. Julien et al.
Fig. 19 Waveforms of set and reset operations for 1XPCT1R bitcell array. For each operation on, immunity on unselected bitcells that share WL, SL or BL with selected one is ensured
the disturb immunity for common WL and SL and for common BL and SL. During both reset and set operations, the write disturb is avoided by the WL, SL, BL and PL voltages.
4.4 Performances Metrics In this section, we explore the performances offered by various RRAM-based architectures using PCT transistors as selectors. Thereby, we compare PCT-based RRAM bitcells with CMOS-based 1T1R and 2T1R bitcells for Set, Reset and Read operations. Overall in this section we show that (i) during Set operation, PCT-based bitcells offer equivalent performances to CMOS-based bitcells. (ii) During Reset operations, PCT-based bitcell are a solution to the overdrive issue identified in CMOS-based bitcells while only slightly increasing the area overhead. Finally (iii), we show that
Innovative Memory Architectures Using Functionality Enhanced Devices
71
some PCT-based bitcells can enable better read performances than CMOS-based bitcells thanks to a lower contact density along the memory lines. In order to compare CMOS and PCT-based bitcells, we follow the considerations described in Sect. 4.3.1 and rely on a 22nm SiNWFET low-power technology (Zhang et al. 2013, 2014b). To ensure a fair apple-to-apple comparison with CMOS, we consider a low-power CMOS technology: the 28nm FDSOI technology PDK from STMicroelectronics. Finally, RRAM technology is simulated considering a filamentary-based model from Bocquet et al. (2014b) calibrated on characterization data (Vianello et al. 2014). Further information about the simulation framework are available in Levisse et al. (2019).
4.4.1
Performances in Set Operation
While the first claim of PCT-based bitcells is to achieve groundbreaking performance improvement during Reset, thank to the good gate-source voltage (Vgs) control, it is important to show that using such selectors does not degrade the performances in Set. In that sense, we simulated set operations using PCT and CMOS selectors during set operation. Figure 20 shows the set time versus the BL-SL voltage difference for PCT and CMOS-based bitcells. As both technologies do not behave exactly the same, we tuned the gate voltage to ensure that the programming current is the same for both bitcells. Thereby, in order to target a 60 µA programming current, we setup 0.6 V for a 6-nanowire PCT SiNWFET transistor and 0.9 V for a 80 nm wide MOS transistor. With this flow, we ensure that both PCT and CMOS bitcells have the same set performances.
Fig. 20 Set time versus BL-SL voltage difference for minimum size 1T1R MOS bitcell (red) and PCT-based bitcell (blue)
10-5
Set time [s]
10
-6
10-7
10
-8
10
-9
1.6
MOS bitcell
1.7
PCT bitcell
1.8
BL-SL voltage [V]
1.9
2
72
4.4.2
L. A. S. Julien et al.
Performances in Reset Operation
In this section, we explore the performance gains enabled during reset operation by the proposed PCT-based bitcells. We compare them with CMOS-based 1T1R and 2T1R bitcells. The considered 1T1R bitcell is based on a minimum size transistor (W = 80 nm) following the design proposed in Shen et al. (2012). For the 2T1R, we consider three different configurations: (i) minimum size p-type (80 nm), (ii) medium size p-type (120 nm width) and (iii) double size 160nm p-type. In order to ensure layout regularity, we size the n-type accordingly to the p-type. Then, during the set operation, as described Sect. 4.4.1, we underdrive the n-type transistor gate to keep a 60 µA programmind current (Iprog). Figure 13 shows the bitcell layout for 2T1R (configuration (iii)) and for the 1T1R bitcell. From an area perspective, a wider p-type leads to a bigger bitcell: configuration (i) enables a 30.3 F2 bitcell (0.0756 µm2 ) while configuration (ii) a 33.6 F2 area. Finally, configuration (iii) is the biggest 2T1R bitcell with a 40.3 F2 bitcell area. We perform reset operations with various BL, SL and WL voltages, for all the 1T1R, 2T1R and 1PCT1R bitcells. Then, we define the reset time as the time required for the RRAM resistance value to achieve a HRS/LRS ratio of 10. Figure 21 shows the reset time versus the SL-BL voltage for all the considered bitcells. As expected, the 1T1R bitcell requires a huge gate overdrive to perform sub-100 µs reset time. As a reference, 1T1R bitcells demonstrated in the literature require more from 3 to 5 V to enable sub-100 ns reset operations (Grossi et al. 2018; Chen et al. 2012; Yi et al. 2011). In red, the 1T1R bitcell with its gate overdriven from 1.7 V up to 2 V shows poor reset performances while causing high voltage stress on the transistor. Alternatively, 2T1R bitcells (in blue) show stronger reset performances and transistor reliability at the price of bigger bitcells (more than 30.3 F2 ). Finally, the proposed PCT-based bitcells are represented in green. When performing a set operation in n-type configuration and the reset in p-type configuration with a standard PCT bitcell, performances are equivalent to 2T1R bitcells. At the same time, it enables area reduction from 1.35× (25 F2 vs 33.6 F2 ) with a 1PCT1R vs minimum size 2T1R, up to 2.6× (16 F2 vs. 40.3 F2 ) with a 1XPCT1R vs double size 2T1R. It is interesting to note that PCTbased bitcell (1.2 V Vgs) show the same performances than strongly overdriven (2V Vgs) 1T1R bitcell for a 1.85 V SL-BL voltage. On the other hand, it shows better scalability at higher SL-BL voltage difference as the Vgs is independent from the SLBL voltage. This effect leads to 105x reset time reduction for 2.2 V SL-BL versus 2 V overdriven 1T1R CMOS bitcell. Compared to 2T1R CMOS bitcells, while it shows equivalent performances than medium size p-type (33.6 F2 ) at SL-BL 1.6 V, it is outperformed by double size (40.3 F2 ) 2T1R bitcell as the p-type serial resistance becomes lower. It can be noted here that PCT-based bitcell performances during reset can be improved by increasing the amount of stacked nanowires. Ultimately, PCTbased bitcell considering n-type set and p-type reset enable up to 75x performance improvement at 2.2 SL-BL voltage difference versus CMOS-based 2T1R. At a lower operating voltage (1.8 V SL-BL), PCT-based bitcell outperform CMOS-based 2T1R by 5x. Alternatively, considering a p-type based set and a n-type based reset can enable strong performance improvement (up to 500x at 1.8 V SL-BL) but it would
Innovative Memory Architectures Using Functionality Enhanced Devices 10 000 1T1R - Vg
s = 1.7V
1 000
Reset Operation Time [us]
Fig. 21 Reset time versus SL-BL voltage difference for CMOS 1T1R bitcell (red) and various gate overdrive voltages (from 1.7 to 2 V), CMOS 2T1R bitcells (blue) with various p-type transistor size (from 80 nm up to 160 nm width) and PCT-based bitcells (green) for n-type and p-type reset
73
1T1R - Vg
s = 1.8V
1T1R - Vg
s = 2V -
- 12.4F²
2T1R Vgs = 1.2V - 30.3F²
- 12.4F²
12.4F²
2T1R Vgs = 1.2V 33.6F²
100
5x 10
PCT set n/rst p Vgs = 1.2V 16 to 25 F²
105x
1 PCT set p/rst n Vgs = 1.2V 16 to 25 F²
0.1
500x
75x
2T1R Vgs = 1.2V 40.3F²
0.01
1.2
1.4
1.6
1.8
2
2.2
2.4
SourceLine to BitLine Voltage [V]
require to fabricate the RRAM stack in reserve fashion (top electrode first) and may imply different behavior. It can be noted that equivalent behavior could be observed with 2T1R (Portal et al. 2017) however, it would not improve much area considerations. Figure 22 shows the energy consumed during a reset operation performed at 1.8 V SL-BL voltage for all the bitcells under study. As expected from Fig. 21, the programming energy of PCT-based bitcells is intermediate between intermediate size CMOS-based 2T1R and double size 2T1R bitcells (from 33.6 F2 to 40.3 F2 ). Equivalent reset energies (1 to 10 nJ) cannot be achieved with minimum size CMOS 1T1R bitcell without a strong Vgs overdrive (more than 2 V). It is important to note that we do not consider here the energy necessary to the generation of such voltages. While it could seem fair to consider highest density bitcells and overdrive them, it induces a stress on the array and periphery gate oxide, diffusions etc. (bitcells sharing the WL are going to be stressed as well) and may lead to early memory failure (Federspiel et al. 2012).
4.4.3
Performances in Read Operation
This subsection explores the impact of the proposed bitcells during read operation. We compare the proposed PCT bitcells with regular CMOS-based 1T1R and 2T1R bitcells. While most of the results proposed in this section are pretty expected (i.e., bigger bitcells means lower read frequency), we show that the 1XPCT1R bitcell shows a shorter access time than smaller bitcells thanks to its unconventional array organization. By considering more metal lines per array, as shown Table 4, it reduces the parasitic capacitance on the SL, BL, PL and WL, enabling faster charge and discharge of the line, and thereby faster read operation. This reduction ratio is 4/5th
74
L. A. S. Julien et al. 100
Reset energy [nJ]
33.6F2 2T1R bitcell MOS selectors degradation
10
1PCT1R bitcell 16-25F2
1
Min Size (12.4F2) 1T1R bitcell
40.3F2 2T1R bitcell 0.1 1
1.2
1.4
1.6
1.8
2
2.2
2.4
Reset Gate Voltage [V]
Fig. 22 Energy consumed during reset operations for various bitcell architectures versus the required programming voltage. 1T1R bitcell requires an increase of the programming voltage (red) inducing a reduction of the MOS transistor reliability. PC SiNWFET-bitcells (green) enable 2T1R (green) operation voltages without overdrive while using a single PC SiNWFET transistor per OxRAM bitcell Table 4 Number of BLs for standard arrays (1T1R, 1PCT1R) and for 1XPCT1R array Array size Array BL and SL 1PCT1R 1XPCT1R 256 bits 1 kbits 4 kbits 16 kbits 65 kbits 262 kbits
16 32 64 128 256 512
18 36 72 144 287 574
of the contact density per line compared to regular bitcells. While the read energy is slightly increased, performances for a 65kbit array are improved by 12% compared to a 32% bigger 1T1R-based array. We simulated both CMOS and PCT-based arrays during read operations for various memory sizes. We considered a BL precharge and discharge to perform a read and assumed that a sense amplifier is able to achieve a read operation out of the read discharge. Following Sect. 4.4.1, we consider a 60 µA programming current (i.e., leading to a 20 k LRS state). We assume that a read operation is triggered by the WL activation after the precharge. Overall, we consider in the read time (i) the WL charge and discharge time and (ii) the time for the BL to discharge through the accessed bitcell. To do so, we consider the extracted parasitics of a 28nm CMOS technology node Back-End-of-Line. Figure 23, left hand side, presents the read time versus the BL length in a memory array while the SL is 512 bitcells long. Figure 23, right hand side, presents the ratio the read time of the different bitcells normalized against the 1XPCT1R
Innovative Memory Architectures Using Functionality Enhanced Devices 10
1.8 2T1R 1PCT1R 1T1R 1XPCT1R -30%
-67%
6
2T1R bitcells
1.4
+8.6% -15%
Gain
Read time [ns]
8
2T1R 1PCT1R 1T1R
1.6
75
1.2
4
17%
27%
1 2
0.8
XPCT1R bitcell array less profitable for shorter BL
0
0.6 0
100
200
300
BL length
400
500
600
0
100
200
300
400
500
600
BL length
Fig. 23 a Read time versus BL length for a 512 Bitcells long SL for CMOS and PCT bitcells. b Normalized read time versus 1XPCT1R bitcell array
bitcell-based array. We show 67% and 15% of gain for the 1XPCT1R compared to CMOS-based 2T1R and 1T1R bitcells respectively for a 512 long BL. Concerning standard 1PCT1R bitcells, performances gains versus 2T1R are 30% while it shows a 8.6% performances degradation versus 1T1R due to its lower density. Finally, we demonstrate 17% and 27% performance improvement for the 1XPCT1R compared to 1T1R and 1PCT1R respectively. In details, while the gate capacitance of the PCT is higher than its CMOS counterpart, for the same memory size, when compared to the BL discharge, it only represent 3% (respectively 1%) for a 256×256 PCT (respectively 1T1R) array and 6% (respectively 2%) for a 512 × 512 array. When considering non-square arrays, if the PL or SL are long enough, they may become longer to charge than the time it takes for the BL to discharge. In that case, the SL length limits the performances of the 1XPCT1R array. As the PL is connected to two polarity gates (PG) while the SL is connected to 1 single transistor drain, PL parasitic capacitance is higher. In this context, PL charging time limits the read speed when a read is performed right after a reset operation (as shown Fig. 19). As the WL and the BL are in the same direction in 1XPCT1R bitcell arrays, a longer WL charge also correspond to a longer BL discharge, mitigating the impact of the WL charge over the read performances and keeping its effect low as introduced for squared arrays. Figure 24 shows the read time ratio between CMOS 1T1R and 1XPCT1R bitcells array for various array sizes (BLs and SLs). The green area correspond to array sizes for which 1XPCT1R arrays have better performances than CMOS-based 1T1R. On the other hand, the red area corresponds to array cases for which 1T1R is more profitable. For all the reasons aforementionned, 1T1R array are more profitable for wide (more than 300 BLs) and thin (less than 100 SLs). In this section, the energy consumption difference between the previously studied architectures with emphasis on the selector voltage stress and the bitcell area is explored. For 1T1R, 2T1R and 1PCT1R, using the performances reported in Vianello et al. (2014), the reset energy versus the reset voltage for a 100 µA set current (LRS = 10 k) is extracted and plotted Fig. 22. For each reset pulse, the current is integrated from the beginning of the reset operation until the HRS value reaches 100 k. This
76
100 90
SL number
Fig. 24 1XPCT1R over 1T1R read time ratio versus array size (BLs and SLs). Except for extremely wide array, 1PCT1R arrays are more profitable than CMOS-based ones
L. A. S. Julien et al.
Zone in which 1XPCT1R based array perform better
80 70 60 Zone in which CMOS 1T1R based array perform better
50 40 300
350
400
450
500
BL number
value is then multiplied with the voltage difference between the SL and the BL in order to extract the reset energy. In 1T1R configuration (red curves), due to the VT loss through the n-type transistor, a higher voltage is needed to ensure a successful operation. 1T1R bitcells require higher than 1.8 V on at least one of their terminals to perform energy-efficient programming operations (dotted red line represents the programming energy when only the WL voltage is increased and the BL is kept at 1.2 V, solid red line the energy when both SL and WL voltages are increased). On the other hand, reset operation with a p-type transistor enables energy-efficient programming operation without the management of high voltages and without overstress on the memory array selection transistors (green curve for PC SiNWFET based bitcells and blue curve for 2T1R). Table 5 summarizes the area, programming time, read time and programming voltages considered for the bitcells considered in this work: CMOS-based 1T1R and 2T1R, as well as PCT-based 1PCT1R and 1XPCT1R.
4.5 Conclusions In this section, the opportunities opened by the use of PC SiNWFET transistors for designing OxRAM memory arrays are explored. Two innovative bitcells (1PCT1R and 1XPCT1R) using PCT SiNWFET are presented and validated through electrical simulations. Then, the area and energy consumption of these bitcells during read and write operations are compared with 1T1R and 2T1R CMOS-based bitcells. Thanks to its dynamic p-type/n-type switch features, PCT-based bitcells enable low reset voltage as the 2T1R bitcell while providing a more compact bitcell (0.041 µm2 for the 1XPCT1R, 0.064 µm2 for the 1PCT1R) versus 0.1008 µm2 for the 2T1R (1.6x area reduction for the 1PCT1R and 2.45x for 1XPCT1R bitcell). On the other hand, as
Innovative Memory Architectures Using Functionality Enhanced Devices
77
Table 5 Summary of the CMOS and PC SiNWFET-based bitcells area and required voltages Bitcell type 1T1R 2T1R 1PCT1R 1XPCT1R 1-bit Area Sub 20 µs reset gate voltage Reset time 1.8V BL/SL Voltage Vg = 1.2 V (except 1T1R) Read time 512 × 512 array
0.031 µm2 12.4 F2 >2 V
0.1008 µm2 40.3 F2 i
where u(·) is denoted as the signum neural activation function, and c is a constant that refers to a reference voltage. Each estimator Di(k) should aim to predict the correct teaching labels Ti(k) for new unseen patterns Vin . To solve this problem, W is tuned
254
L. Danial et al.
to minimize some measure of error between the estimated and desired labels, over a K 0 -long subset of the empirical data, or training set (for which k = 1, . . . , K 0 ). Then, a common measure error metric is the LMS error function defined as 0 2 1
Di(k) − Ti(k) , 2 k=1 i=0
K
ELMS =
N −1
(8)
where the 1/2 coefficient is for mathematical convenience. The performance of the resulting estimators is then tested over a different subset, called the test set (k = K 0 + 1, . . . , K ). A reasonable iterative algorithm for minimizing the error (that is, updating W where the initial choice of W is arbitrary) is the following instance of online SGD, W (k+1) = W (k) −
N −1 2
η ∇ Di(k) − Ti(k) , 2 W (k) i=0
(9)
where η is the learning rate, a (usually small) positive constant, and each iteration k, a single empirical sample Vin(k) is chosen randomly and presented at system input. The chain rules (7) and (8) are used to obtain the outer product: (k) − Di(k) T j(k) . Wi(k) j( j>i) = −η Ti
(10)
The training phase continues until the error is below E thr eshold , a small predefined constant threshold that quantifies the learning accuracy. We show in the next section, that the error function in (8) after training is proportional to the cost function in (5) and the network energy function in (4). The training algorithm is implemented by the feedback shown in Fig. 4b.
3.2 Neural Network DAC The simplest type of DAC uses a binary-weighted architecture (Plassche 2013), where N (number of bits) binary-weighted distributed elements (e.g., current sources, resistors, or capacitors) are combined to provide a discrete analog output with finite resolution. The direct conversion feature can be exploited for high-speed applications (Plassche 2013) because it uses a minimal number of conventional components and small die area. This DAC topology relies on the working principle of the inverting summing operational amplifier circuit, as shown in Fig. 5. Hence, the output voltage is the inverted sum of the input voltages, weighted by the ratio between the feedback resistor and the series resistance for each input. Digital inputs follow full-scale voltages, which means that logical ‘1’ is equivalent to VD D , and similarly, logical ‘0’ is equivalent to 0 V. The LSB input is connected to the highest resistance value, which
Neuromorphic Data Converters Using Memristors
255
Fig. 5 Binary-weighted resistors-based DAC
equals the feedback resistance R. Accordingly, the MSB input is connected to the lowest resistance value R/2 N −1 . The other side of output is Vout = −
N −1 1 i 2 Vi , 2 N i=0
(11)
where the minus sign is a result of the inverting operational amplifier, and Vi is the digital voltage input of a bit with index i, after it has been attenuated by 2 N , which is a normalization factor that fits the full-scale voltage. The output voltage is proportional to the binary value of the word VN −1 . . . V0 . Despite the simplicity of the binary-weighted DAC concept, critical practical shortcomings have hindered its realization. The variability of the resistors, which defines the ratio between the MSB and LSB coefficients (dynamic range), is enormous and grows exponentially with the number of resolution bits, making accurate matching very difficult, and overwhelming a huge asymmetric area with power starved resistors, e.g., for N bits the ratio equals 2 N −1 . Furthermore, maintaining accurate resistance values over a wide range is problematic. In advanced submicron CMOS fabrication technologies, it is challenging to manufacture resistors over a wide resistance range and preserve an accurate ratio, especially in the presence of temperature variations. Process imperfections degrade the conversion precision and increase the vulnerability to mismatch errors. ANNs inherently perform the following dot product of inputs and weights, A=
N −1
Wi Vi ,
(12)
i=0
where A is an analog result of the digital inputs’ weighted sum. From this stage, both deterministic and non-deterministic equivalence between (11) and (12) are derived. Thus, the discrete voltage of the DAC output as defined in (11) can be seen as a
256
L. Danial et al.
special case of a single-layer ANN, and (12) could be adjusted, using ML methods, to behave as a binary-weighted DAC with intrinsic variations. We exploit the neural network’s intelligent properties to achieve an adaptive DAC that is trained online by an ML algorithm. Assume a learning system that operates on K discrete trials, with N digital inputs Vi(k) , actual discrete output A(k) as in (12), and desired labeled output (i.e., teaching signal) t (k) . The weight W i is tuned to minimize the following LMS of the DAC through the training phase 2 1 (k) A − t (k) . 2 k=1 K
E=
(13)
A reasonable iterative update rule for minimizing (13) (i.e., updating W , where initially W is arbitrarily chosen) is the following online SGD iteration, Wi(k) = −η
∂E ∂ Wi(k)
⇒ Wi(k) = −η
∂ E ∂ A(k) · , ∂ A(k) ∂ Wi
Wi(k) = −η A(k) − t (k) Vi(k) ,
(14)
where η is the learning rate, a small positive constant, and during each iteration k, a single empirical sample of the digital input voltage V (k) , is chosen randomly. This learning algorithm, called Adaline or LMS (Widrow and Lehr 1990), is widely used in adaptive signal processing and control systems (Widrow and Stearns 1985). The MSB, which divides the data range into two different sections, will begin its adjustment procedure towards the error gradient descent. In the same way, other bits will gradually begin their adjustment procedure later, after they converge to their relevant sections. Therefore, the training dynamics of less significant bits is complex and requires more time to be captured. Hence, the convergence time expectation is also binary-weighted distributed. The LSB, which represents the most precise quantum, requires the longest resolution match and the lengthiest training time to converge. While the MSB can quickly achieve a stable value, the LSB may still present oscillations, thus continuously changing the collective error function in (13). Concurrently, the MSB will be disturbed and swing back and forth recursively in a deadlock around a fixed point. This problem is aggravated in the presence of noise and variations, and ameliorated by using smaller learning rates. Hence, we propose a slightly modified update rule to guarantee a global minimum of the error, and to fine-tune the weights proportionally to their significance degree. We call the modified rule the binary-weighted time-varying gradient descent learning rule, expressed as Wi(k) = −η(t) A(k) − t (k) ·V i(k) ,
(15)
where η(t) is a time-varying learning rate, decreasing in a binary-weighted manner along with the training time, as shown in Fig. 6. The expression for η(t) is
Neuromorphic Data Converters Using Memristors
257
Fig. 6 Flow of the online binary-weighted time-varying gradient descent training algorithm, which updates the weights according to the error function
⎧ ηi f k ≤ K /2 ⎪ ⎪ ⎨ η/2i f K /2 < k ≤ 3K /4 η(t) = ⎪... ⎪ ⎩ η/2 N −1 i f (2 N −1 − 1) · K /2 N −1 < k ≤ (2 N − 1) · K /2 N
(16)
This learning rule utilizes the convergence time acceleration and the decaying learning rate to reduce bit fluctuations around a fixed point. In Sect. 3.4, we show that this learning rule is better than (14) in terms of training time duration, accuracy, and robustness to learning rate non-uniformity.
3.3 Circuit Design Using Memristors Here, we present the circuit building blocks of the proposed data converter designs, including its different components: neuron, synapse, and feedback circuits. The design methodologies, operational mechanism, and constraints of the building blocks are specified. For simplicity, we provide the circuit design of the quantization stage and assume that the analog input is sampled separately by means of an external sample-and-hold circuit. A.
System Overview
We leverage the conceptual simplicity, parallelism level, and minimum die size of the neural network architecture by implementing online SGD in hardware, thus achieving reconfigurable, accurate, adaptive, and scalable data converters that can be used for high-speed, high-precision, and cost-effective applications. In Figs. 4b and 7, the proposed architectures for four-bit neural network ADC and DAC are shown, respectively. As mentioned, the device is based on memristive synapses that collectively integrate through the operational amplifier, and a feedback circuit that regulates the value of the weights in real time according to (15). The ADC architecture is composed of ten synapse units, four neuron unit and a feedback to every neuron.
258
L. Danial et al.
Fig. 7 Schematic of a four-bit adaptive DAC based on a single layer ANN and binary-weighted synapses, trained online by a supervised learning algorithm executed by the feedback
The DAC architecture is composed of four synapse units, one neuron unit, and a synchronous training unit. Depending on the characteristics and requirements of the application, a set of system parameters is determined. First, the sampling frequency f s , which specifies the converter speed, is determined, followed by the number of resolution bits N, which specifies the accuracy of the converter, and then the full-scale voltage VF S , which specifies the converter input dynamic range. Dynamic specifications concerning the learning rate are taken into consideration during the training phase, in order to address further system requirements such as desired precision level, training time, and power consumption. The supervised learning algorithm is activated by interchangeable synchronous read and write cycles, utilizing the same execution path for both read and write operations in situ. Reading is the intended conversion phase; its final result is sampled at the end of the reading cycle Tr , after transient effects are mitigated, and it is latched by a negative-edge triggered latch for the entire writing cycle. The writing cycle Tw activates the feedback circuit, which executes the learning algorithm, and compares the actual analog output of the DAC sampled at the end of the read cycle to the desired analog value, and the actual digital outputs of the ADC to the desired digital labels, which are supplied by the peripheral circuit. During conversion, the training feedback is disconnected, leaving simple low-power-consuming binary-weighted converters (Gao et al. 2013). It is preferable that the reading cycle be equal to the writing cycle, which will make it possible to capture the same non-deterministic behaviors, intrinsic noise, and environmental variations in the synaptic weights while training. Thus, the sampling frequency is
Neuromorphic Data Converters Using Memristors
259
Fig. 8 Building blocks of the neural network 4-bit ADC. a Schematic of the memristive synapse S i,j . Note that Wi j = R f /Si j . b Schematic of the neural activity, which comprises an inverting OpAmp for integration and a latched-comparator for decision-making. c Digital feedback circuit for the gradient descent algorithm. d Feedback circuit for the gradient descent learning algorithm. e Schematic of the PWM circuit (Shen et al. 2007) that generates fixed amplitude pulses with a time width proportional to the subtraction product between the real and teaching signals
fs = B.
1 . Tr + Tw
(17)
Artificial Synapse
We adopt our synapse circuit design from Soudry et al. (2015), Danial et al. (2018a): a single voltage-controlled memristor, connected to a shared terminal of two MOSFET transistors (p-type and n-type), as shown in Fig. 8a. The circuit utilizes the intrinsic dynamics of the memristive crossbar using 2T1R cells, which inherently implements Ohm’s and Kirchhoff’s laws for dot product, and SGD for training, as the basics of ANN hardware realization (Prezioso et al. 2015). The output of the synapse is the current flowing through the memristor. Thus, the magnitude of the input signal u should be less than the minimum conductance threshold, |u| < min VT n , VT p .
(18)
The synaptic weight is modified based on the value of e, which selects either input u or u. Thus, the writing voltage, Vw (or −Vw ), is applied via the source terminal of both transistors, and must be higher than the threshold voltage for memristive switching: |Vth,mem | < |Vw | < min VT n , VT p .
(19)
260
L. Danial et al.
Note that the right terminal of the memristor is connected to the virtual ground of an OpAmp (Danial et al. 2018a), whereas the left terminal is connected to a transistor that operates in the ohmic regime and a shock absorption capacitor (Greshnikov et al. 2016). The memristor value M i,j varies between low and high resistance states, Ron and Roff , respectively. The assumption of transistors in ohmic operation bounds the write and read voltages, and constrains the initial memristive state variable and other design parameters, as further described in Danial et al. (2018a). We use voltage controlled synapses, unlike the synapses in Soudry et al. (2015). The read voltage Vr must be sufficiently lower than the switching threshold of the memristor to prevent accumulative reads from disturbing the conductance of the memristor (i.e., state drift) after several read operations. Hence, |Vr | < Vth,mem .
(20)
A great advantage of these low read and write voltages is the resulting low-power consumption (Gao et al. 2013), and low subthreshold current leakage; high leakage would threaten the accuracy of the memristor. Voltages Vw and Vr are attenuated values of the digital DAC inputs and ADC outputs that fit design constraints (19) and (20). The assumption of ohmic operation is valid only if the conductance of the memristor is much smaller than the effective conductance of the transistor, as follows, Rmem (s(t))
K VD D
1 , − 2max(V Tn , VTp )
(21)
where K is a technology dependent constant that describes the transistor conduction strength, VD D is the maximum power supply, s is the memristor internal state variable distributed between [0–1], and Rmem refers to the memristor resistance as a function of the state variable s. The latter relationship is chosen to be linear (Kvatinsky et al. 2015) Rmem (t) = s(t) · (R O F F − R O N ) + R O N .
(22)
As a result, the memristor resistance level that could be achieved during training is lower bounded. Otherwise, the applied voltage over the memristor during the write cycle will not be sufficient to stimulate it. This constraint is achieved by the following condition: |Vw |
Rmem ,min (smin (t)) 1 + Rmem ,min (smin (t)) K (VD D −2VT )
≥ Vth,mem .
(23)
The voltage division creates non-uniformity in the writing voltage of each cycle and will explicitly affect the learning rate. A shock absorption capacitor (Greshnikov et al. 2016) was added to eliminate fluctuation spikes derived from either subthreshold
Neuromorphic Data Converters Using Memristors
261
leakage or high frequency switching. Its value is bounded by the sampling frequency of the converter, 1 1 Cshock,max ≤ . K (VD D − 2VT ) fs
(24)
For the design, we have used 0.18 μm CMOS process, and memristors fitted by the VTEAM model (Kvatinsky et al. 2015) to the Pt/HfOx /Hf/TiN RRAM device with a buffer layer (Sandrini et al. 2016). This device has a high-to-low resistance state (HRS/LRS) ratio of ~50 and low forming, set, and reset voltages. The circuit parameters are listed in Table 1. C.
Artificial Neuron
The neural activation is the de facto activity in neuromorphic computing that collectively integrates analog inputs and fires output by means of a non-linear activation function. Following, we show the circuit design of the neural activation functions of the DAC and the ADC, respectively. (1)
DAC Neuron
The neuron of the DAC is implemented by an operational amplifier with a negative feedback resistor R (Gao et al. 2013). It receives currents from N memristors and sums them simultaneously, as follows: N −1
Rf Vi , A≈− R mem i i=0
(25)
where Vi is a read voltage via a memristor with index i, which represents the digital input value of the i-th bit. In the reading cycle, only the NMOS transistor is conducting since e = Vdd , with a negative read voltage to eliminate the inverting sign of the operational amplifier. The resolution of the converter, which equals the minimal quantum, is achieved when the digital is defined by r = VF S /2 N . The maximum analog output input ‘11…11’ is inserted, and is equal to Amax = 2 N − 1 VF S /2 N . Therefore, the read voltage equals Vr = r = VF S /2 N , and it should obey the constraints in (19). Based on this read voltage, bounds on the number of resolution bits that the converter could hold were formalized. From (19), we extract the minimal number of resolution bits, VF S , Nmin ≥ log2 (26) min VTn , VTp where the maximal number of resolution bits is bounded by the binary-weighted levels within the dynamic range of the memristor, Nmax ≤ log2 RROOFNF . Because of the serial transistor resistance, however, it is undesirable to use surrounding levels.
262
L. Danial et al.
Table 1 Circuit parameters Type
Parameter
Value
Device parameters
Type
Parameter Value
Design parameters
Power supply
VD D
1.8 V
Shock capacitor
Cshock
100 fF
NMOS
W/L
10
Writing voltage
VW
±0.5 V
VT n
0.56 V
Reading voltage
Vr
−0.1125 V
W/L
20
Feedback resistor
Rf
45 k
VT p
−0.57 V
Reading time
Tr
5 μs
Von/o f f
−0.3 V, 0.4 V
Writing time
Tw
5 μs
K on/o f f
−4.8 mm/s, 2.8 mm/s Parasitic capacitance
C mem
1.145 fF
αon/o f f
3, 1
Parasitic inductance
L mem
3.7 pH
RO N
2 k
Input resistance
Rin
45 k
RO F F f (s)
100 k s · (1 − s)
Comparator bandwidth OpAmp gain
BW A
4 GHz 100
PMOS
Memristors
ADC parameters
Learning parameters of NN/logarithmic ADC
Sampling frequency
fs
0.1 MSPS
Learning rate
η
Number of bits
N
4
Error threshold
E threshold 2 · 10−3
Full-scale voltage
VF S
[ VD2 D − VD D ]
Sampling frequency
fs
0.1 MSPS
Maximum max(η) learning rate
Number of bits
N
4
Error threshold
Full-scale voltage
VF S
[ VD2 D − VD D ]
DAC parameters
0.01
Learning parameters of DAC 0.01
E threshold 2 · 10−3
Learning parameter of pipelined ADC
Sub-ADC/DAC parameters
ηADC/DAC
fs
0.1 MSPS
VFS
VDD
Ethreshold ADC/DAC
1, 1 4.5 ·
10−2 ,
9·
10−3
Neuromorphic Data Converters Using Memristors
263
Doing so decreases the number of bits by log2 RON K (V1D D −2VT ) , which is approximated to be zero in our case because R O N 1/K (VD D − 2VT ). Additionally, in the case of smaller full-scale voltage, some levels should be reserved. For example, if the full-scale voltage is half of the maximum power supply VF S = V D D /2, then the highest binary-weighted levelshould be reserved. Doing so will decrease the DD
. The maximum number of bits that the effective number of bits by log2 VVFs,min proposed DAC could convert is up to Nmax ≤ log2
RO F F RO N
− log2
1 RON K (VD D
VD D − log2 . (27) − 2VT ) VFs,min
In this case, if the minimal full-scale voltage is VF S = V D D /2, then the number of bits that could be converted by a converter with the device parameters listed in Table 1 is at most four. In the same context, the feedback resistor is upper-bounded by the minimal full-scale voltage and the highest resistance of the memristor, Rf ≤
R O F F VF S , VD D
(28)
when considering bi-directional variations of the training above and below the fixed resistance level, respectively. These variations are evaluated as ±10% of the nominal value. (2)
ADC Neuron
The neural activation function in the originally proposed Hopfield neural network (Tank and Hopfield 1986) has some constraints in linearity and monotonicity. Fortunately, in asymmetric Hopfield networks, no such strict constraints are required, and simple digital comparators can be used (Po-Rong et al. 1994), while device mismatch, parasitic, and instability issue of the neuron circuit are adaptively compensated for by the synapse. The ADC neuron circuit is realized, as shown in Fig. 8b, by a trans-impedance amplifier implemented as an inverting operational amplifier (OpAmp), cascaded to a comparator with zero voltage reference, zero voltage V max , and –V dd as V min to generate negative signs for the inhibitory synapses of the LSBs. The comparator is latched using time-interleaved phased clock, and its decision result (0 V or –V dd ) is sampled at the end of the reading cycle Tr , after transient effects are mitigated and neurons synchronized, and their outputs are forward propagated. It is latched for the entire writing cycle Tw , and handled by the feedback circuit. Note that the effective weights are normalized via the OpAmp and equal to Wi j, j>i = R f /Si j, j>i , where Rf is the negative feedback resistor and Si j is the effective resistance of Mi j and the serial transistor. D.
Feedback Training Unit
The online SGD algorithm is executed by the feedback circuit. Our aim is to design (10) and (15) in hardware and execute basic subtraction and multiplication operations. The ADC system is more sophisticated than the DAC system, however its
264
L. Danial et al.
training is simpler. The subtraction product Ti(k) − Di(k) is implemented by a digital subtractor, as shown in Fig. 8c. The subtraction result of each neuron (other than the MSB) is backward propagated as an enable signal e simultaneously to all its synapses. The multiplication is invoked as an AND logic gate via the synapse transistors and controlled by e, whereas the attenuated desired digital output T j(k) is connected via the source of the synapse. After the training is complete (E ≤ E thr eshold ), the feedback is disconnected from the conversion path. Regarding the DAC, the subtraction discrete voltage product (namely, the error) is pulse modulated by a pulse-width modulator (PWM) with time width linearly proportional to the error and ±VD D , 0V pulse levels. As illustrated in Fig. 8d, the PWM product is applied, via the feedback loop, to the synapse as an enable signal. The PWM (Greshnikov et al. 2016), as shown in Fig. 8e, is controlled by a clock that determines the maximum width of the product pulse. If sign(A − T ) ≥ 0, then the NMOS is conducting and the subtraction amplitude is compared to a positive ramp with full-scale voltage and clock cycle time width. Otherwise, the PMOS is conducting and the subtraction amplitude is compared to a negative ramp with full-scale voltage and clock cycle time width. As implied in (15), the learning rate is time varying. This is achieved by controlling the PWM with clocks with binary-weighted frequency multiples. Therefore, the multiplication is invoked as an AND logic gate and controlled by the modulated enable signal, whereas the attenuated digital input is connected via the source of the synapse. The input is attenuated to obey the constraint in (19), as specified in Table 1. The learning rate is a key factor of the adaptation performance: it depends on the circuit parameters listed in Table 1, and on the write voltage, pulse-time width, feedback resistor, present state, and memristor device physical properties. The learning rate is η(t) =
R (R O F F − R O N )s(t) = , R Rf
(29)
where s is the change in the memristor internal state, and is defined as in the VTEAM model,
Tw s =
K on/o f f
VW Von/o f f
αon/o f f −1 · f (s)dt,
(30)
0
where K on/o f f , and αon/o f f are constants that describe the state evolution rate and its nonlinearity, respectively, Von/o f f are voltage thresholds, and f (s) is a window function that adds nonlinearity and state dependency during state evolution.
Neuromorphic Data Converters Using Memristors
265
3.4 Evaluation of Neuromorphic Data Converters In this section, our proposed four-bits data converter designs are evaluated in a SPICE simulation framework (Cadence Virtuoso) using a 0.18 μm CMOS process and the VTEAM memristor model (Kvatinsky et al. 2015). First, the learning algorithm is evaluated in terms of least mean square error and training time. Next, the circuit is statically and dynamically evaluated, and finally power consumption is analyzed. The proposed data conversion functionality and robustness were massively tested under extreme conditions using MATLAB. The design parameters and constraints are listed in Table 1. Furthermore, circuit variations and noise sources are quantified and validated, as listed in Table 1. A.
Reconfiguration
Figures 9a and 10a shows the resistive value of the synapses when two sawtooth training datasets with different full-scale voltage ranges (VD D and VD D /2) and different sampling frequencies (f s and 100 fs) are applied successively in real time for training the ADC and DAC, respectively. After approximately 4000 and 2000 training
Fig. 9 ADC training evaluation. a Synaptic weight reconfiguration during the training phase for the V FS = 1.8 V and f s = 100 KSPS. Synapses are trained for the V FS = 0.9 V and f s = 10 MSPS and shown in real time. The synaptic weight is equal to the ratio between Rf and the corresponding memristor, thus it has no units. b The LMS error function optimization during training until it achieves E threshold . c actual digital outputs Di (logical value) at three different time stamps during the training; periodic digital outputs are achieved after the training is finished, corresponding to the analog input ramp. d Comparison between the discrete analog values of the teaching dataset and the actual output by connecting it to an ideal DAC, at three different time stamps during the training; an identical staircase is obtained after the training is complete
266
L. Danial et al.
Fig. 10 DAC training evaluation. a Binary-weighted synaptic adaptation during the training phase for the 1.8 V full-scale output voltage range. Correspondingly, synapses are trained for the 0.9 V full-scale output voltage range and shown in real time. b Comparison between the teaching dataset and the actual neural discrete analog DAC output at three different time stamps during the training; an identical staircase is achieved after the training is complete. c integral and d differential nonlinearities of the ADC at three different time stamps in response to the DC input voltage ramp
samples, which are equal to 40 ms and 20 ms training time for the ADC and DAC, respectively, the errors according to (8) and (13) are below E thr eshold . Furthermore, when the full-scale voltage changes to VD D /2 and the sampling frequency changes to 100 fs, the ADC and DAC systems converge to a new steady state that quantize 0.9 V full-scale at a 10 MSPS sampling rate. The LMS error (8) is shown in Fig. 9b. In the same context, neural activity adaptation that denotes optimization toward its gradient descent during training digital output bits is shown, at three different time stamps, in Fig. 9c for the initial state before training (samples 0–15), coarse-grained training (i.e., where the error is slightly higher than E thr eshold , samples 220–235), and fine-grained training (i.e., where the error is sufficiently low and the ADC response converges to the desired state, samples 3720–3735). The digital outputs are ideally converted to discrete analog via an ideal 4-bit DAC that is connected back-to-back and accurately reproduces the ADC’s present state, as shown in Fig. 9d at the same three time stamps. Analogously, the comparison between the analog teaching dataset and actual neural discrete DAC analog output at three different time stamps is shown in Fig. 10b and an identical staircase is achieved after DAC training is completed.
Neuromorphic Data Converters Using Memristors
B.
267
Self-Calibration
The process variation parameters for the memristor are pessimistically chosen (Danial et al. 2018a), randomly generated with a normal distribution, and incorporated into the VTEAM model (Kvatinsky et al. 2015) with a variance of approximately 10% to cover wide reliability margins (Hosticka 1985). Table 2 lists the magnitude of variability for these effects. Transistor parameters such as VW , W/L , and VT in Table 1 are chosen to guarantee a globally optimal solution even under such extreme conditions. In Fig. 9, we show that the proposed training algorithm can not only tolerate such variations over time but also compensate for them by using different synaptic weights. We statically evaluated how the proposed ADC responds to the DC ramp signal at the three given time stamps, as illustrated in Fig. 10c–d. The teaching staircase in Fig. 9d is a subset of DC ramp input that statically evaluates the ADC at the aforesaid time stamps. The differences between two adjacent digital output decimal codes within the actual ADC output are therefore the differential non-linearities (DNL). Likewise, the differences between the actual ADC output and the ideal staircase for each digital input code are the integral non-linearities (INL) (Plassche 2013). The DNL of the last code is undefined. Results of the maximum DNL and INL are shown, respectively, in Fig. 10c, d. Prior to training, the ADC is completely non-linear and non-monotonic, with several missing codes. Thus, INL ≈ 8 LSB, and DNL ≈ 5 LSB. Improved performance can be seen at the second time stamp (2 ms ~ 200 samples), where the ADC appears monotonic; however, it is still not accurate (INL ≈ −2 LSB, DNL ≈ 2 LSB). After the training is complete (40 ms), the ADC is almost fully calibrated, monotonic, and accurate: INL ≈ 0.4 LSB, and DNL ≈ 0.5 LSB. Since the DAC output is purely analog and is much sensitive to noise, thus we discuss its non-linearity calibration in the next sub-section. The proposed converters were simulated with different sampling frequencies f s to show its versatility and flexibility to adapt to different conditions that represent different specifications for different applications. At high frequency the memristor is modeled as a resistor in parallel to a capacitor and is connected in series with an inductance on each side (Wainstein and Kvatinsky 2017). The parasitic capacitance between electrodes of the memristor is dominant at high frequencies. As a result, the equivalent impedance of the memristor decays along the frequency. The values of the parasitic capacitance and inductance are listed in Table 1. For simplicity, we will show the spectral analysis on the DAC. The maximum frequency at which the DAC can operate, f max , is defined as the frequency at which the high-to-low-impedance ratio will not allow binary-weighted distribution of N-bits that covers the half- to full-scale voltage range: |
ZOFF | ≤ 2 N +1 , ZON
(31)
where Z O F F and Z O N are high and low impedance states, respectively. At the frequency-band of interest, Z O N ≈ R O N , Z O F F ≈ R O F F || 2π j fs1Cmem =
268
L. Danial et al.
Table 2 Circuit variations and noise Type
Nominal value
Variance
W = 2 μm R = 50 /
±0.5%μm
W = 0.15 μm
±1%μm
Device mismatch Resistor Capacitor
C A = 0.68 fF/μm2 NMOS/PMOS
Sampler
W/L
±10%
VT
AV √ T WL
τ
400 ps Rf Ron R 1+(1+ Ronf
= 7.14 mV
OpAmp finite gain
−
Comparator
Vo f f set
5 mV
Memristor
Von/o f f
±10%V
K on/o f f
±10%mm/s
RO N RO F F
±10%
Pulse width modulation noise Thermal noise
White noise 2 kTg−1 , kT/C
50 ps 10−16 V 2 s
IR drop
Vw
±10%V
Quantization noise
VF S 2 N +1
81
)/A
Noise sources
= 56.25 mV
Frequency-dependent noise and variations/aging
VF S√ 2 N +1 3
= 32.5 mV
√ ±10% V/ Hz 1.668 GHz
Input switching noise ADC cutoff frequency
LdI/dt f T,max
Propagation time Input referred noise
ROFF ·C in 0.5 V2 log2 6kTFRS fs −1
1.27 mv
OpAmp input noise
1/ f flicker noise
√ 10 nv/ Hz
Slew rate Comparator ambiguity
2π f VF S
Jitter
log2 ( √
Memristor stochastics
Poisson process (τ )
π BW 6.93 f s
1.13 V/ns 0.625 mV
− 1.1 2 3π f s τ jitter
30 ps
)−1
50 ps 2.85·10−5 VW
= 1.1 μs
e 0.156
Memristor OFF impedance
ROFF
√
Endurance degradation
R
10%/decade
RO F F (1+(R O F F Cmem ·2π f )2
Neuromorphic Data Converters Using Memristors
269
Fig. 11 a A high impedance state ZOFF as a function of sampling frequency; dashed lines indicate the maximum possible frequency bandwidth for a half- to full-scale voltage range with a high-to-low-impedance ratio of 32 and 16, respectively. b DAC reconfiguration for a 10MSPS sampling frequency, by continuous synaptic update. The frequency-dependent variations were captured by the synaptic weights
RO F F 1+2π j f s C mem R O F F
, and the series inductance is negligible. By solving (24), we find
f max =
1 2π R O F F Cmem
·
RO F F R O N · 2 N +1
2 − 1.
(32)
The decay of Z O F F as a function of frequency is shown in Fig. 11a, along with the maximum frequency bandwidth for different-scale voltages. In our case, for a four-bit DAC and full- to half-scale voltage range, f max = 1.668 GHz, which is below the transit frequency f T of 0.18 μm CMOS transistors, the cutoff frequency of memristors (Pi et al. 2015), and the OpAmp slew rate. The training dynamics are different in this case because the learning rate is a function of the pulse-width duration, which is a function of the sampling frequency. The higher the sampling frequency, the smaller the learning rate and the higher the number of training samples. Additionally, taking the frequency dependent variations into consideration, the synaptic weights are different and are able to absorb and compensate for these variations, as shown in Fig. 11b in response to the 10 MSPS sampling frequency. The frequency is 100× higher than 100 KSPS; as a result, the time interval for a single sample is 100× smaller, as is the learning rate. However, the total number of training samples until the error equals E thr eshold is ~1.5× greater, with ~66× lower training time (~0.45 ms). The ratios are not linear because the convergence time is different among the bits and not linear. Analogously, parasitic
270
L. Danial et al.
effects such as capacitance and inductance, as listed in Table 2 which are the dominant factors in ADC accuracy at high frequencies, have been adaptively captured as simulated by 10 MSPS within longer training time and shown in Fig. 9a. Data converters are being continuously pushed towards their performance limits as technology scales down and system specifications become more challenging. Several analysis methods have been established to estimate variation sources and their impact on the performance (Nuzzo et al. 2008; Sepke et al. 2009). All these mechanisms are specific and technology dependent, requiring exhaustive characterization, massive validation, and relatively long development time-to-market. Adaptive intelligent systems motivated by ML algorithms are, however, inherently robust to variations, which is a key element in the set of problems they are designed to solve. This suggests that the effects of intrinsic variations on the performance of the analog circuit are relatively small. Therefore, online training algorithms are not exclusive to reconfiguration, but can also be used for self-calibration, adaptation, and noise tolerance with generic standard methodology (Kvatinsky et al. 2015). For this reason, a crude estimation of the magnitude of variability has been extracted from Kvatinsky et al. (2015), Nuzzo et al. (2008), Sepke et al. (2009), Niu et al. (2010), Hu et al. (2011), Torrezan et al. (2011), Chen et al. (2011) and characterized as listed in Table 2: 1.
2.
The process variation parameters for the memristor are pessimistically chosen, with a coefficient of variation (CV = standard deviation/mean ~30%) to cover wide reliability margins (Soudry et al. 2015). The variability in the parameters of the memristors is equivalent either to corresponding changes in the synaptic weights or to the learning rate η. Frequency-dependent variations capture the parasitic capacitance and inductance of the memristor (Torrezan et al. 2011) and model it by a varying impedance as a function of the frequency. In addition, R degradation (Chen et al. 2011) along switching cycles as a result of oxide defects and device aging is considered.
Endurance is an essential performance criterion of memristive devices for memory applications. Therefore, qualitative and pessimistically approximate analysis is done to evaluate the DAC’s lifetime versus the increasing training time as a result of the memristor’s endurance degradation. Endurance failure behavior is observed in Hf-based RRAM (Chen et al. 2011) and can be explained by different physical mechanisms that degrade its switching characteristics and high-to-low resistance ratio. Among these mechanisms is the oxidation induced interface reaction, a result of high voltage/current during SET. The endurance of the fitted Pt/HfOx /Hf/TiN is ~8 K cycles with 1.15 V for SET and −2.25 V for RESET, as observed in Hu et al. (2011). Decreasing operational voltages considerably improves the endurance while increasing the switching time of the device. According to the fitted parameters in Table 1, the simulated switching time with ±V w is 75 μs instead of the reported 400 ns with 1.15 V for SET, and 1 ms instead of the reported 10μs with −2.25 V for RESET (Sandrini et al. 2016).
Neuromorphic Data Converters Using Memristors
271
The trade-off between write latency and endurance has been well-studied (Strukov 2016), and the relationship between them is formalized (Zhang et al. 2016) as Endurance ≈
tW P t0
E x po_ f actor ,
(33)
where tW P is write latency, t0 is a device related constant, and Expo_factor is an empirical constant with a typical value of 2. Accordingly, the endurance of the device will increase to 8 · 107 cycles with the proposed writing voltage. Due to the nature of the proposed converters, they will continue training until their training errors equal E thr eshold and achieve a high ENOB. Thus, the high-to-low resistance ratio degradation is not discernible, as it is compensated for by longer training times. A rough approximation, using logarithmic endurance degradation in time, is modeled by a 10% drop of R per decade, as listed in Table 2. The training time as a function of the number of switching cycles is shown in Fig. 12a. To prove that the endurance is not a limitation for the proposed DAC, we estimate the number of training epochs until wear-out. As a pessimistic evaluation, we assume that every 1 ms
Fig. 12 a Endurance degradation along device lifetime, in terms of full switching cycles, logarithmically affect R in each training sample and are compensated for by the increasing training time for the whole epoch. b Statistical simulations of randomly generated variations and noise sources show the probability distribution of typical and extreme cases in terms of the effective number of resistive level. c The impact of variations in the number of effective levels on the number of training samples in each case. d ENOB as a function of the number of stable resistive levels, where the minimum is five uniformly distributed binary-weighted levels
272
L. Danial et al.
of training time equals a full RESET. This assumption is more aggressive for degradation than a total of 200 intermediate switches in 1 ms (Chen et al. 2011). Therefore, the maximum training time is 160 ms and the corresponding minimal number of 7 training epochs until wear- out is ≈ 8·10 = 500 K. This finding implies that, in the 160 worst case, the DAC could be reconfigured ~150 times per day for ~10 years either for new configuration or for calibration-only, depending on the running application (Gines et al. 2009). A similar endurance analysis could be made for the ADC. C.
Noise Tolerance
In contemporary data converters, calibration mechanisms (Li et al. 2003) can be used to compensate for device mismatch and process imperfections, but noise can irreparably degrade performance and is also less straightforward to capture at design time. Noise sources include intrinsic thermal noise from the feedback resistor, memristor, and transistor (Hosticka 1985), in addition to pulse-width modulation noise, quantization noise (Gray 1990), jitter, comparator ambiguity (Walden 1999), input referred noise, non-linear distortions, training label sampling noise and fluctuations, memristor switching stochastics, and frequency-dependent noise (Nemirovsky et al. 2011), as listed in Table 2. Therefore, the effective number of stable resistive levels, as a function of noise margin (NM) due to statistically correlated noise and variation sources, was massively analyzed using Monte-Carlo simulations (Danial et al. 2018a). Furthermore, its impact on the ENOB was determined, with typical results (in 38% of the cases) of 64 resistive levels, ~3% NM, and ~3.7 ENOB. For robust validation of the data conversion functionality in the presence of correlated variations and noise sources, we statistically analyzed the DAC performance for large numbers of randomly generated scenarios, as it has sensitive analog output and simple topology. We show the distribution of the achieved effective number of resistive levels in Fig. 12b. The number of resistive levels, however, is finite and is a function of variations, data retention, noise margin, and amplifier sensitivity (Sepke et al. 2009). Figure 12b shows that extreme cases where the write variation is ±10% and the comparator offset of the PWM is ±5 mV are less likely. Therefore, the effective number of resistive levels in the typical case (approximately 38% of the cases) is ~64. The number of resistive levels has a key role in achieving such adaptive, self-calibrated, noise-tolerant, and highly accurate DACs. Due to its selfcalibration capability, the DAC can tolerate variations and compensate for them by imposing a penalty of more training samples, as shown in Fig. 12c. Alternately, fewer training samples or stable resistive levels are sufficient for lower accuracy, as shown in Fig. 12d, in terms of ENOB, lower-bounded by five uniformly distributed binary-weighted levels covering a half- to full-scale voltage range. While process variations determine the convergence time and accuracy, noise can cause the network to deviate from the optimum weights with destructive oscillations. In Fig. 13a, the training processes for both gradient descent and the binary-weighted time-varying gradient descent with decaying learning rate are shown. Observe that the regular gradient descent, which succeeded in stabilizing the synapses without the presence of noise, now fails to stabilize the synapses. Conversely, the binaryweighted time-varying gradient descent with decaying learning rate successfully
Neuromorphic Data Converters Using Memristors
273
Fig. 13 Comparison between regular gradient descent (GD) and the proposed binary-weighted time-varying gradient descent (BW TV GD) algorithms in the presence of noise and process variations. a The GD failed to converge the synapses, whereas the BW TV GD succeeded and outperformed the GD with b smaller MSE, better c DNL, and d INL
overcame noise and variations with stable synapses. The comparison is made, accordingly, in terms of MSE, DNL, and INL, as shown in Fig. 13b–d, respectively. The switching non-linearity and threshold of the memristor device mitigate synaptic fluctuations derived from noise and variation sources. Nevertheless, the gradient descent algorithm fails to converge to a global optimum and keeps excessively capturing stochastic dynamics whereas the time-varying learning rate of the proposed algorithm enhances the network immunity against overfitting (Dietterich and Tom 1995) and achieves reliable predictive performance on unseen data. The ADC non-linear functionality Vout = f (vi ) in response to voltage input vi = A cos(ωt), where A is the amplitude and ω is the frequency, could be qualitatively described as Vout = a0 + a1 A cos(ωt) +
a2 A 2 [1 + cos(2ωt)] + . . . , 2
(34)
where ao is the DC constant, a1 is the small-signal gain constant, and a2 is the distortion constant. We show that the proposed algorithm is able to adaptively alleviate non-linear distortions and tolerate noise by estimating the f (·) function. The ADC is dynamically evaluated and analyzed, at the three given time stamps, in response to a sinusoidal input signal with 44 kHz frequency, which meets the Nyquist condition, f input ≤ f s /2, and applies for coherent fast Fourier transform (FFT) using a Hamming window and a prime number of cycles distributed over 5000 samples, which is sufficient for reliable FFT without collisions and data loss (Solomon 1993). Figure 14a shows the FFT for signal and distortion power as a function of frequency,
274
L. Danial et al.
Fig. 14 Static conversion evaluation that shows the efficiency of the training algorithm in mismatch calibration a Dynamic conversion evaluation that shows the efficiency of the training algorithm in noise tolerance and distortion mitigation by coherent fast Fourier transform of the ADC output in response to a sinusoidal input signal with 44 kHz frequency at three different time stamps during the training with ENOB calculation. b Power evaluation of the network that shows power optimization during training
where each time stamp is shown in a different color. Figure 14a illustrates that the harmonic distortions are mitigated, the fundamental power increases, and the SNDR and ENOB improve as the training progresses. Synaptic fluctuations arising due to noise and variation sources are mitigated by the switching non-linearity and threshold of the memristor (Danial et al. 2018a). Along with the quantization noise or dither, this helps the network converge to a global minimum, and improve the ENOB, breaking through the thermal noise limit in some cases. This well-known phenomenon is called stochastic resonance and was reported in the past in ANNs (Benzi et al. 1981) and memristors (Stotland and Ventra 2012). D.
Power Optimization
Section 3.1 shows the equivalence between the Hopfield-like energy function of the network given by (4) and the cost function that solves the conversion optimization given by (6). The cost function achieves its minimum, lower bounded by quantization error power, when the synaptic weights are configured so as to guarantee that each analog input is correctly mapped to its corresponding digital output. Consequently, the power consumption is optimized during training until it achieves its minimal value when the training is finished. Consequently, the power dissipation of the entire network is analyzed, and is attributed to three sources: 1.
Neural integration power: the power dissipated on the feedback resistor of the OpAmp is Pinti = (Vin − 2 Vr e f i
N −1
Rf − V j )2 /R f . R i j j=i+1
(35)
Neuromorphic Data Converters Using Memristors
2.
3.
275
This function solves the ADC quantization after training for each neuron, as described Nin−1(3). The total neural integration power dissipated on all neurons is Pinti . Pint = i=0 Neural activation power: the power dissipated on the comparators and OpAmps at the sampling frequency. This power source is constant and negligible: Pacti = 3 μW in 0.18 μm CMOSprocess in f T . The total activation power dissipated N −1 Pact i . on all neurons is Pact = i=0 Synapse power: the power dissipated on synapses, including reconfigurable and fixed synapses for each neuron, is Psynapsei =
N −1
2i Vr2e f Vi2 Vin2 + + . Rf Rf R j=i+1 i j
(36)
N −1 The total synaptic power dissipation is Psynapse = i=0 Psynapsei . Thus, the total power consumption is the sum of the three power sources averaged on a full-scale ramp with 2N samples (epoch), as shown in Fig. 14b during training time. Each point in the horizontal axis represents a full-scale ramp, and its corresponding value in the vertical axis represents the average of the total dissipated power. After the training is finished and the network configured as an ADC, the average of the synapse power on a full-scale ramp is half of the maximum power dissipated, and the neural integration power is minimal. This balance results in optimal power dissipation. Note that the dynamic power consumption as a result of updating memristors during the training phase is not determined and is not considered as conversion power dissipation by FOM definition in (1). We neglect the power dissipation of the feedback because, after the training is finished, the feedback is disconnected and the network maintains the minimal achieved power dissipation level during conversion. We assume that this power source is relatively low because of the small area of training feedback, short training time, and the small ratio between training to conversion cycles during the lifetime of the converter, even at a high rate of application configurations.
4 Large-Scale Neuromorphic Mixed-Signal Data Converters 4.1 Scaling Challenges The four-bit resolution is insufficient for practical applications, while direct scaling of the proposed architecture is challenging. When increasing the scale of the network,
276
L. Danial et al.
the number of neurons, synapses, and feedbacks are quadratically higher. Consequently, this will increase the area and power consumption substantially, as calculated in Table 3 based on Danial et al. (2018a), Sandrini et al. (2016). Due to the successive nature of the proposed ADC architecture, higher numbers of neurons require longer conversion time as a result of the propagation time, settling time, and decision-making time of each. Therefore, to eliminate signal aliasing, the maximal Nyquist sampling frequency will unfortunately be limited, as determined in Table 3. Additional challenges in scaling are the required high-to-low resistance states ratio of the synaptic weights, the number of resistive levels, cutoff frequency, and endurance. We calculated the maximal number of bits in our previous work (Danial et al. 2018a), which is four bits for the memristive device under test, but devices with higher HRS/LRS are achievable. Moreover, we show in this paper that devicedependent properties are compensated for by longer training time to achieve maximal ENOB, which is equal to (N-3) bits regardless of the conversion speed. Overall, the FOM still improves as the number of bits increases, because of the optimal achieved ENOB, as calculated in Table 3. Furthermore, in advanced CMOS technology nodes the FOM will improve due to lower power consumption and higher sampling rates. These findings prove that the proposed architecture is conceptually and practically scalable, even in the presence of the mentioned scaling challenges. These shortcomings still need to be investigated by leveraging mixed-signal architectures and deep neural network concepts. This section presents a high-resolution pipeline of the proposed four-bits data converters, and a low-resolution logarithmic quantization training of neural network data converters.
4.2 Pipelined Neuromorphic ADC A.
Theory
A scalable and modular neural network ADC architecture based on a pipeline of four-bit converters is proposed, preserving their inherent advantages in application reconfiguration, mismatch self-calibration, noise tolerance, and power optimization, while approaching higher resolution and throughput in penalty of latency. We propose a hybrid CMOS-memristor design with multiple trainable cores of the previously proposed four-bit NN ADC (Danial et al. 2018b) and NN DAC (Danial et al. 2018a) in a two-stage pipeline. This architecture takes advantage of light-weight low-power sub-ADC cores combined with high throughput and resolution achievable through the pipeline. Also, each sub-ADC optimizes the effective number of bits (ENOB) and power dissipation during training for the chosen sampling frequency (Danial et al. 2018b). B.
Proposed Pipelined NN ADC Architecture
An eight-bit two-stage pipelined network is shown in Fig. 15. In the first-stage subADC, a synapse W ij is present between a pre-synaptic neuron with index j and digital
4
8
N
8
N
#Neurons, #feedbacks
Area
4
#Bits
9740
≈ N (1.1N + 1208)
N (N +1) 2
4850
Total (μm2 )
36
10
#Synapses
Table 3 Scalability evaluation
1 −1 N ·t p + NBW
0.74
1.66
Conversion rate (GSPS)
Time
N 4
(2 −21− )·4
6
4
Training (KSamples)
2−21−
150
100
150
N 4
Wearout (trainings/day for 10 yrs)
Pint + Pact + Psynapse
650
100
Power (μW)
P 2 N −0.3 f s
7.5
8.25
FOM (fJ/conv)
N · 2N
2048
28 2
64
24 V N −1+log2 VD D FS
#levels
H RS L RS
Memristor
Neuromorphic Data Converters Using Memristors 277
278
L. Danial et al.
Fig. 15 a Proposed architecture of a two-stage pipelined ADC trained online using SGD. The first stage consists of four-bit single-layer neural network sub-ADC and DAC. The second stage consists of another four-bit neural network ADC. Both stages operate simultaneously to increase the conversion throughput and their intermediate results are temporarily stored in D-flipflop registers. b Training input V t1 and V t2 correspond to digital labels of two sub-ADCs
output Dj , and a post-synaptic neuron with index i, and digital output Di . A neuron for each bit collectively integrates inputs from all synapses and produces an output by the signum neural activation function u(.). The sub-ADC coarsely quantizes (MSBs) the sampled input V in to the digital code D7 D6 D5 D4 (MSB to LSB) as ⎧ D7 ⎪ ⎪ ⎨ D6 ⎪ D5 ⎪ ⎩ D4
= u Vin = u Vin = u Vin = u Vin
− 8Vr e f , − 4Vr e f − W6,7 D7 , − 2Vr e f − W5,6 D6 − W5,7 D7 , − Vr e f − W4,5 D5 − W4,6 D6 − W4,7 D7 .
(37)
The output of the sub-ADC is converted back to an analog signal A by the DAC according to A=
7 1
Wi Di , 24 i=4
(38)
where W i are the synaptic weights. Next, this output is subtracted from the held input to produce a residue Q as Q = Vin − A.
(39)
This residue is sent to the next stage of the pipeline, where it is first sampled and held. The second stage sub-ADC is designed similar to that of the first stage, except that the resistive weights of the input are modified from Rin = Rf (feedback resistance of neuron) to Rf /16. This is made in order to scale the input from V FS /16 to the full-scale voltage V FS . The LSBs of the digital output are obtained from this stage as
Neuromorphic Data Converters Using Memristors
⎧ D3 ⎪ ⎪ ⎨ D2 ⎪ D ⎪ ⎩ 1 D0
= u 16Q − 8Vr e f , = u 16Q − 4Vr e f − W2,3 D3 , = u 16Q − 2Vr e f − W1,2 D2 − W1,3 D3 , = u 16Q − Vr e f − W0,1 D1 − W0,2 D2 − W0,3 D3 .
279
(40)
The sample-and-hold circuit enables concurrent operation of the two stages, achieving a high throughput rate, but introduces latency of two clock cycles. Thus D-flipflop registers are used to time-align the MSBs and the LSBs. C.
Training Framework
The aim of the training is to configure the network from a random initial state (random synaptic weights) to an accurate eight-bit ADC. It is achieved by minimizing the mean-square-error (MSE) of each sub-ADC and the DAC by using specific teaching labels for desired quantization. For the training dataset, an analog ramp signal is sampled at 4 · 28 (=1024). Four adjacent samples are given the same digital labels, providing an eight-bit training dataset, shown as V t1 in Fig. 16b. The more we train the ADCs with extra labels, the higher conversion accuracy we achieve. This is because of the nonlinear nature of the ADC task. The analog ramp input with the corresponding four MSBs is used to train the first stage ADC. A sawtooth version of this input (V t2 in Fig. 15b) with the remaining LSBs is used for the training of second stage. The switch S2 is turned to position 2, when the overall mean-square-error falls below E threshold . D.
Evaluation Results
Figure 16a) shows the variation of the MSE of the first-stage DAC. After approximately 5,000 training samples (312 epochs), which equals 50 ms training time for a 0.1 MSPS conversion rate, the MSE error fatalls below E threshold . Figure 16b shows the total MSE of the two sub-ADCs. After approximately 40,000 training samples (39 epochs), which equals 400 ms training time, the total MSE falls below E threshold . The analog output is converted through an ideal 8-bit DAC and probed at three different timestamps during training, as shown in Fig. 16e. The output is identical to the input staircase after the training is completed. For 1.8 V ramp signal sampled by 18 k points at 0.1 MSPS, DNL is within ±0.20 LSB and integral nonlinearity (INL) is lower than ±0.18 LSB as shown in Fig. 16c. Figure 16d shows the output spectrum at 0.1 MSPS sampling rate. The input is a 44 kHz 1.8 Vpp sine wave. The converter achieves 47.5 dB SNDR at the end of training. Next, we analyzed the power consumption of the network by considering neural integration power, neural activation power, and synapse power (Danial et al. 2018b). Remarkably, the total power consumption is optimized similar to Danial et al. (2018b) during training. The ADC consumes 272 μW of power, averaged over a full-scale ramp with 4 · 28 samples. The proposed 8-bit pipelined architecture is compared to the scaled version of neural network ADC (Danial et al. 2018b). As shown in Table 4, the pipelined ADC consumes less power, achieves high conversion rate, and better FOM with lesser
280
L. Danial et al.
Fig. 16 a Pipeline ADC training evaluation. a Mean square error of the first-stage DAC minimization during its training. b Total mean square error of the two stages during training of sub-ADCs. c DNL and INL at the end of training. d 2048-point FFT for a 44 kHz sinusoidal input. e Comparison between the teaching dataset and the actual output of the ADC by connecting it to an ideal DAC, at three different timestamps during the training; an identical staircase (time-aligned for latency) is obtained when training is complete
HRS/LRS device ratio. To test the scalability of our architecture, we performed behavioral simulations in MATLAB. Our results for 12-bit design with ideal device parameters are summarized in Table 5. Furthermore, when the full-scale voltage is reduced to 0.9 V and the sampling frequency is increased to 10 MSPS, the network converges to a new steady state to operate correctly under different specifications (Danial et al. 2018b).
Neuromorphic Data Converters Using Memristors Table 4 Performance comparison
Parameter
NN ADC (Danial et al. 2018b)a
Proposed pipelined NN ADC
# Bits
8
8
# Synapse
36
24
Memristor HRS/LRS
28
24
Max conversion rate (GSPS)
0.74
1.66
Power (μW)
650
272
FOM (fJ/conv)
7.5
0.97b
Training time (ms)
1060
400
a b
Table 5 Scalability evaluation
281
Based on scalability evaluation of the 8b neuromorphic ADC Extrapolated FOM at the maximum conversion rate
# Bits
12
# Synapses
38
# Samples per epoch
1 · 212
Max |DNL|
0.61 LSB
Max |INL|
0.60 LSB
Training time (ms)
2000
4.3 Expanding the DAC Design A DAC is determined by its sampling frequencies and the number of resolution bits. These two specifications are challenging to achieve together in conventional DACs, and they are two major bottlenecks. We show an efficient mechanism that achieves optimal possible accuracy from the number of real allocated bits N for each sampling frequency f s . Using the constraints and the design parameters listed in Table 1, the maximum number of bits was at most four. This section discusses large-scale DACs by using the proposed four-bit DAC as a prototype that can be duplicated or cascaded to create a larger architecture. Interestingly, AI techniques that involve deep neural networks and backpropagation algorithms (Indiveri et al. 2013; Danial et al. 2018a) can be exploited and interpolated into the design of large-scale DACs that are based on the four-bit DAC. For example, in Fig. 17, an eight-bit DAC that is based on the four-bit DAC is shown. The analog output of such a DAC is
282
L. Danial et al.
Fig. 17 An eight-bit reconfigurable DAC composed from two four-bit DACs by using a two-layer neural network
⎧ 3 Rf ⎪ ⎨ A1 ≈ − i=0 Rmemi Vi Rf 7 A2 ≈ − i=4 V , Rmem i i ⎪ ⎩ Atot = W21 A1 + W22 A2
(41)
j=1,2
where W21 , W22 are the second-layer weights (W 2 j = R f /R2 j the error function of the eight-bit deep neural network DAC is
). Similarly to (5),
2 1 (k) Atot − t (k) . 2 k=1 K
E=
(42)
The learning rules of the first layer synapses W1i(0≤i≤7) are extracted by using the error gradient descent and backpropagation algorithms W (k) 1i(0≤i≤3) = −η
∂E
= −η
∂E
·
∂ A(k) tot
∂ W1i(k) ∂ A(k) ∂ A(k) tot 1 (k) (k) (k) Vi , = −ηW21 Atot − t
W (k) 1i(4≤i≤7) = −η
∂E
= −η
∂E
·
∂ A(k) tot
∂ W1i(k) ∂ A(k) ∂ A(k) tot 2 (k) (k) (k) Vi . = −ηW22 Atot − t
·
∂ A(k) 1
∂ W1i(k) (43)
·
∂ A(k) 2
∂ W1i(k) (44)
Using the same design methodology as for the four-bit DAC, this network defines a high precision eight-bit DAC with adaptive abilities to self-calibrate mismatches and tolerate variations. The weights in the second layer are fixed and predefined during design time; they do not need to be adjustable, and they do not obey the learning rule. Thus, learning rules (43) and (44) depend on predefined parameters and do not vary during training as in multi-layer neural networks with a backpropagation algorithm (Indiveri et al. 2013). The training data-set is given through and compared to the DAC output, which is the second layer output, and then the error product is backpropagated directly to the first layer synapses for both four-bit DACs simultaneously. Different learning rates are used for each four-bit DAC. Although resistors are highly
Neuromorphic Data Converters Using Memristors
283
prone to manufacturing variations, they can be used effectively for the second layer since the mismatches in that layer will be calibrated and compensated for by the weights of the first layer. Thus, the proposed large-scale concept will actually take advantage of the defects and handle them robustly. A major challenge that directly relates to large-scale trainable DACs is how to generate the data-set for teaching. We assume that peripheral circuitry is provided and able to generate real-time data-sets with different specifications that fit the required DAC. Larger numbers of bits, smaller full-scale voltages, and higher frequencies, however, will be challenging for these circuits, which are not only technology dependent but also special purpose. For example, pulse-width modulators are bounded by the frequency with they can work. Therefore, the proposed binary-weighted timevarying gradient descent complicates the design but improves accuracy, compared to the regular gradient descent that uses a uniform learning rate. In future work, we will investigate general purpose peripheral circuitry that generates data-sets in real time.
4.4 Logarithmic Neuromorphic Data Converters A.
Theory
Logarithmic ADC/DACs are employed in biomedical applications where signals with high dynamic range are recorded. A logarithmic data converter can efficiently quantize the sampled data by reducing the number of resolution bits, sampling rate, and power consumption, albeit with reduced accuracy for high amplitudes for the same input dynamic range of a linear ADC/DAC. A three bit logarithmic ADC/DAC design is proposed based on memristive technology and utilize ML algorithms to train an ANN architecture. An N-bit logarithmic ADC converts an analog input voltage (V in ) to an N-bit digital output code (Dout = DN-1 , …, D0 ) according to a logarithmic mapping described by N −1
i=0
Di 2i =
Vin c 2N log B B , c VF S
(45)
where N is number of bits, B is base of logarithmic function, C is code efficiency factor (Cantarano and Pallottino 1973) and V FS is full scale analog input range. The LSB size of an N-bit logarithmic ADC varies with the input amplitudes, as shown in Fig. 18. For small input amplitudes, the LSB size is small and has a minimum value of C (46) L S Bmin = VF S B −C B 2 N − 1 ,
284
L. Danial et al.
Fig. 18 Characteristics of reconfigurable quantization: linear versus logarithmic
when Dout changes from 0 to 1. For large input amplitudes, the LSB size is larger and has a maximum value of C (47) L S Bmax = VF S 1 − B − 2 N , when Dout changes from 2N − 2 to 2N − 1. The dynamic range (DR) of an ADC is defined by the ratio of the maximum input amplitude to the minimum resolvable input amplitude: D R(d B) = 20log10 (
VF S BC ) = 20log10 ( C ). L S Bmin B 2N − 1
(48)
The DNL and INL for logarithmic ADC are defined similarly to the linear ADC except that in a logarithmic ADC the ideal step size varies with each step D N L( j) = I N L( j) =
V j+1 − V j , L S Bideal
j
D N L(i),
(49)
(50)
i=1
where V j and V j+1 are adjacent code transition voltages, and j {x|1 < = x < = 2 N -2}. B.
Proposed Logarithmic Neuromorphic Data Converters
The learning capabilities of ANNs are utilized, applying linear vector–matrixmultiplication and non-linear decision-making operations to train them to perform logarithmic quantization. Therefore, we formulate the logarithmic ADC equations in an ANN-like manner as follows, using three bits as an example,
Neuromorphic Data Converters Using Memristors
⎧ ⎨ D2 = u Vin − 24 Vr e f D1 = u Vin − 22 Vr e f D2 − 26 D2 ⎩ D0 = u Vin − 2V r e f D1 D2 − 23 D1 D2 − 25 D1 D 2 − 27 D1 D2
285
(51)
where V in is the analog input and D2 D1 D0 is the corresponding digital form (i = 2 is the MSB), while each Di is the complement of each digital bit, and each bit (neuron product) has either zero or full-scale voltage. u(·) is denoted as the signum neural activation function, and V ref is a reference voltage equal to LSBmin. Each neuron is a collective integrator of its inputs. The analog input is sampled and successively (by a pipeline) approximated by a combination of binary-weighted inhibitory synaptic connections between different neurons and their complement. The interconnected synaptic weights of the network are described by an asymmetric matrix W, and each element W ij represents the synaptic weight of the connection from pre-synaptic neuron j to post-synaptic neuron i. In this case, we have additional synaptic connections due to the AND product between neurons and their complements, the matrix dimensions approach (2 N −1 + 2). To train this network, mean square error technique. C.
Trainable Neural Network Logarithmic DAC Architecture
A single-layer ANN with a single neuron to perform a three-bit logarithmic DAC is proposed. The equations are formulated as Vout = 20 D0 D1 D2 + 21 D0 D1 D2 + 22 D0 D1 D2 + 23 D0 D1 D2 + 2 4 D0 D1 D2 + 2 5 D0 D1 D2 + 2 6 D0 D1 D2 + 2 7 D0 D1 D2 . D.
Circuit Design
The neural network ADC/DAC architectures and their building blocks, including neurons, synapses, and training feedbacks, are illustrated in Fig. 19. This device has a high-to-low resistance state (HRS/LRS) ratio of 50 to 1000. The aspect weight N ratio of the ADC/DAC is equal to 22 −1 (for V FS = V DD /2). The HRS/LRS ratio sets an upper bound on the number of conversion bits. For example, four-bit logarithmic ADC/DAC is infeasible using this device. Thus, we demonstrate a threebit logarithmic ADC/DAC, which has better DR than a four-bit linear ADC/DAC (Sundarasaradula et al. 2016). Table 1 lists the circuit parameters. We used the same training circuits from Danial et al. (2018a, b). While the feedback of the ADC is simple and realized by digital circuits, the feedback of the DAC is implemented by a pulse width modulator (PWM) with time proportional to the error and ±VD D , 0V pulse levels. After the training is complete (E ≤ E thr eshold ), the feedback is disconnected from the conversion path. E.
Results
We statically evaluated how the proposed ADC responds to the DC logarithmic ramp signal. After training, the ADC is almost fully calibrated, monotonic, and
286
L. Danial et al.
Fig. 19 Single-layer ANN three-bit logarithmic architectures, trained online using SGD, including synapses W i,j , neurons N i , and feedbacks FBi of in addition to AND gates between neurons and their complements: a ADC with seven synapses, b DAC with eight synapses. c Schematic of the memristive synapse. Neuron and training feedback circuits in ADC and DAC, using digital circuits and PWM, respectively, are as introduced previously
accurate: INL≈0.26 LSB, and DNL ≈ 0.62 LSB. It is then dynamically evaluated and analyzed, in response to an exponential sinusoidal input signal with frequency where the harmonic distortions are mitigated, and the SNDR and ENOB improve as the training progresses. Power consumption is also analyzed, as specified in Pouyan et al. (2014), during training until it reaches its minimum when the training is finished. The best energetic state of the network is achieved when it is configured in a logarithmic ADC manner. The DAC is evaluated using similar methodologies as in Danial et al. (2018b). We show that the proposed networks can also be trained to perform linear ADC/DAC using linearly quantized teaching data-sets. Table 6 lists the full performance metrics of proposed logarithmic data converters.
4.5 Breaking Through the Speed-Power-Accuracy Tradeoff We investigate the real-time training of the proposed data converters for general purpose applications. For every selected f s within the f t bandwidth, the data converter is trained correspondingly by a training data-set with the same specifications and achieves optimal ENOB as shown in Fig. 20a. The maximal ENOB (∼3.7) is asymptotically bounded by the intrinsic quantization noise, which is not exceeded. Analogously, the power consumption is dynamically optimized for every fs to achieve the minimal power dissipation of the network, as shown in Fig. 20b. The power dissipated on resistors has a greater effect on overall power dissipation than the frequencydependent dissipation (e.g., capacitors). Simultaneously, and as we show that the equivalence between the quantization cost function (6) and the energy function (4) after the error function (8) is optimized, co-optimization in terms of both ENOB and power dissipation along the training samples is achieved, as shown in Fig. 20c.
Neuromorphic Data Converters Using Memristors Table 6 Performance evaluation of the proposed logarithmic data converters
287
Metric
Logarithmic ADC
Linear ADC (Danial et al. 2018b)
N
3 bits
4 bits
INL
0.26 LSB
0.4 LSB
DNL
0.62 LSB
0.5 LSB
DR
42.114 dB
24.08 dB
SNDR
17.1 dB
24.034 dB
ENOB
2.55
3.7
P
45.18 μW
100 μW
FOM
77.19 pJ/conv
0.136 nJ/conv
Training time
20 ms
40 ms
Metric
Logarithmic DAC
Linear DAC (Danial et al. 2018a)
N
3 bits
4 bits
INL
0.163 LSB
0.12 LSB
DNL
0.122 LSB
0.11 LSB
Training time
80 ms
30 ms
Fig. 20 Breaking through the speed-power-accuracy tradeoff. a Speed-accuracy tradeoff by achieving maximal ENOB regardless of f s after training is complete. b Speed-power tradeoff by achieving minimal P regardless of f s after the training is complete. The frequency-dependent power dissipation is negligible. c Accuracy-power tradeoff by achieving maximal ENOB and minimal P after the training is complete. d FOM dynamic optimization with training
Interestingly, the collective optimization of the proposed architecture breaks through the speed-power-accuracy tradeoff, and dynamically scales the FOM to achieve a cutting-edge value of 8.25 fJ/conv.step, as shown in Fig. 20d and marked by the green star in Fig. 3.
288
L. Danial et al.
The versatility of the proposed architecture with regard to reconfiguration, mismatch self-calibration, noise-tolerance, and power optimization is attained using a simple and minimalistic design with a reconfigurable single-channel. The proposed architecture moreover utilizes the resistive parallel computing of memristors to achieve high speed, in addition to its analog non-volatility, enabling a standard digital ML algorithm to intelligently adjust its conductance precisely and in situ to achieve high-accuracy. The minimalistic design results in low-power consumption, thus achieving a cost-effective ADC.
5 Conclusions This chapter presents real-time trainable data conversion architecture for general purpose applications, which breaks through the speed-power-accuracy tradeoff. Motivated by the analogies between mixed-signal circuits and the neuromorphic paradigm, we exploit the intelligent properties of an ANN, and suggest a SAR-like neural network ADC architecture and a binary-weighted like neural network DAC architecture that are trained online by a supervised ML algorithm. The neural network is realized by means of a hybrid CMOS–memristor circuit design. The trainable mechanism successfully proves collective properties of the network in reconfiguration to multiple full-scale voltages and frequencies, mismatch self-calibration, noisetolerance, power optimization, and FOM dynamic scaling. Furthermore, we proposed a scalable pipelined neural network ADC architecture based on coarse-resolution neuromorphic ADC and DAC, modularly cascaded in a high-throughput pipeline and precisely trained using SGD algorithm. With a 1.8 V full-scale voltage, the pipelined architecture achieved 0.97 fJ/conv FOM at the maximum conversion rate. In addition, we employed reconfigurable quantization by training neuromorphic data converters to perform logarithmic quantization. We believe that the proposed data converters constitute a milestone with valuable results for neuromorphic computing in emerging applications with varying conditions, such as wearable devices, internet-of-things, and automotive applications.
References R. Benzi, A. Sutera, A. Vulpiani, The mechanism of stochastic resonance. J. Phys. a: Math. Gen. 14(11), L453–L457 (1981) S. Cantarano, G.V. Pallottino, Logarithmic analog-to-digital converters: a survey. IMS 22(3), 201– 213 (1973) B. Chen et al., Physical mechanisms of endurance degradation in TMO-RRAM, in IEEE International Electron Devices Meeting (IEDM), December 2011, pp. 12.3.1–12.3.4 Y. Chiu, B. Nikoli, P.R. Gray, Scaling of analog-to-digital converters into ultra-deep-submicron CMOS, in Proceedings of the IEEE on Custom Integrated Circuits Conference, September 2005, pp. 368–375
Neuromorphic Data Converters Using Memristors
289
L. Danial, S. Kvatinsky, Real time trainable data converter for general purpose applications, in IEEE/ACM International Symposium on Nanoscale Architectures, July 2018 L. Danial, N. Wainstein, S. Kraus, S. Kvatinsky, DIDACTIC: a data-intelligent digital-to-analog converter with a trainable integrated circuit using memristors. IEEE J. Emerg. Select. Top. Circuits Syst. 8(1), 146–158 (2018a) L. Danial, N. Wainstein, S. Kraus, S. Kvatinsky, Breaking through the speed-power-accuracy tradeoff in ADCs using a memristive neuromorphic architecture. IEEE Trans. Emerg. Top. Comput. Intell. 2(5), 396–409 (2018b) T. Dietterich, Tom, Overfitting and undercomputing in machine learning. ACM Comput. Surv. 27(3), 326–327 (1995) L. Gao et al., Digital-to-analog and analog-to-digital conversion with metal oxide memristors for ultra-low power computing, in Proceedings of the IEEE/ACM International Symposium on Nanoscale Architectures, NANOARCH, July 2013, pp. 19–22 A.J. Gines, E.J. Peralias, A. Rueda, A survey on digital background calibration of ADCs, in European Conference on Circuit Theory and Design, August 2009, pp. 101–104 R.M. Gray, Quantization noise spectra. IEEE Trans. Inf. Theory 36(6), 1220–1244 (1990) S. Greshnikov, E. Rosenthal, D. Soudry, S. Kvatinsky, A fully analog memristor-based multilayer neural network with online backpropagation training, in Proceeding of the IEEE International Conference on Circuits and Systems, May 2016, pp. 1394–1397 B.J. Hosticka, Performance comparison of analog and digital circuits. Proc. IEEE 73(1), 25–29 (1985) M. Hu et al., Geometry variations analysis of TiO2 thin-film and spintronic memristors, in IEEE Proceedings of the 16th Asia and South Pacific Design Automation Conference, January 2011. G. Indiveri, B. Linares-Barranco, R. Legenstein, G. Deligeorgis, T. Prodromakis, Integration of nanoscale memristor synapses in neuromorphic computing architectures. Nanotechnology 24(38) (2013), Art. ID. 384010 A.K. Jain, J. Mao, K.M. Mohiuddin, Artificial neural networks: a tutorial. IEEE Comput. 29(3), 31–44 (1996) S.H. Jo et al., Nanoscale memristor device as synapse in neuromorphic systems. Nano Lett. 10(4), 1297–1301 (2010) B.E. Jonsson, A survey of A/D-converter performance evolution, in Proceedings of the IEEE International Conference on Electronics, Circuits and Systems, December 2010, pp. 766–769 P. Kinget, M.S.J. Steyaert, Impact of transistor mismatch on the speed-accuracy-power trade-off of analog CMOS circuits, in Proceedings of the IEEE Custom Integrated Circuits Conference, May 1996, pp. 333–336 T. Kugelstadt, The operation of the SAR-ADC based on charge redistribution. Texas Instrum. Analog Appl. J. 10–12 (2000) S. Kvatinsky, M. Ramadan, E.G. Friedman, A. Kolodny, VTEAM: a general model for voltagecontrolled memristors. IEEE Trans. Circuits Syst. II Express Briefs 62(8), 786–790 (2015) J. Li, S. Member, U. Moon, S. Member, background calibration techniques for multistage pipelined ADCs with digital redundancy. IEEE Trans. Circuits Syst. II: Analog Digital Signal Process 50(9), 531–538 (2003) C. Mead, Neuromorphic electronic systems. Proc. IEEE 78(10), 1629–1636 (1990) B. Murmann, ADC performance survey 1997–2017 (2021), http://web.stanford.edu/~murmann/adc survey.html Y. Nemirovsky et al., 1/f noise in advanced CMOS transistors. IEEE Instrum. Meas. Mag. 14(1), 14–22 (2011) D. Niu, Y. Chen, C. Xu, Y. Xie, Impact of process variations on emerging memristor, in Proceedings of the 47th Design Automation Conference - DAC’10, June 2010, pp. 877–882 P. Nuzzo, F. De Bernardinis, P. Terreni, G. Van der Plas, Noise analysis of regenerative comparators for reconfigurable ADC architectures. IEEE Trans. Circuits Syst. I Regul. Pap. 55(6), 1441–1454 (2008)
290
L. Danial et al.
S. Pi, M. Ghadiri-Sadrabadi, J.C. Bardin, Q. Xia, Nanoscale memristive radiofrequency switches. Nat. Commun. 6(7519), 1–9 (2015) C. Po-Rong, W. Bor-Chin, H.M. Gong, A triangular connection hopfield neural network approach to analog-to-digital conversion. IEEE Trans. Instrum. Meas. 43(6), 882–888 (1994) P. Pouyan, E. Amat, A. Rubio, Reliability challenges in design of memristive memories, in Proceedings of the European Workshop on CMOS Variability (VARI), Sept. 2014, pp. 1–6 M. Prezioso et al., Training and operation of an integrated neuromorphic network based on metaloxide memristors. Nature 521(7550), 61–64 (2015) J. Sandrini et al., Effect of metal buffer layer and thermal annealing on HfOx-based ReRAMs, in IEEE International Conference on the Science of Electrical Engineering, November 2016, pp. 1–5 T. Sepke, P. Holloway, C.G. Sodini, H.S. Lee, Noise analysis for comparator-based circuits. IEEE Trans. Circuits Syst. I Regul. Pap. 56(3), 541–553 (2009) D.L. Shen, Y.C. Lai, T.C. Lee, A 10-bit binary-weighted DAC with digital background LMS calibration, in Proceeding of the IEEE Asian Solid-State Circuits Conference, November 2007, pp. 352–355 O.M. Solomon, The use of DFT windows in signal-to-noise ratio and harmonic distortion computations, in IEEE Instrumentation and Measurement Technology Conference, May 1993, pp. 103–108 D. Soudry et al., Memristor-based multilayer neural networks with online gradient descent training. IEEE Trans. Neural Netw. Learn. Syst. 26(10), 2408–2421 (2015) M. Steyaert, K. Uyttenhove, Speed-power-accuracy trade-off in high-speed analog-to-digital converters: now and in the future, in Analog Circuit Design (Springer, 2000), pp. 3–24 A. Stotland, M. Di Ventra, Stochastic memory: memory enhancement due to noise. Phys. Rev. E 85(1), 011116 (2012) D.B. Strukov, Endurance-write-speed tradeoffs in nonvolatile memories. Appl. Phys. A 122(4), 1–4 (2016) Y. Sundarasaradula et al., A 6-bit, two-step, successive approximation logarithmic ADC for biomedical applications, in ICECS (2016), pp. 25–28 D. Tank, J.J. Hopfield, Simple ‘neural’ optimization networks: an A/D converter, signal decision circuit, and a linear programming circuit. IEEE Trans. Circuits Syst. 33(5), 533–541 (1986) A.C. Torrezan, J.P. Strachan, G. Medeiros-Ribeiro, R.S. Williams, Sub-nanosecond switching of a tantalum oxide memristor. Nanotechnology 22(48), 1–7 (2011) K. Uyttenhove, M.S.J. Steyaert, Speed-power-accuracy tradeoff in high-speed CMOS ADCs. IEEE Trans. Circuits Syst. II: Analog Digital Signal Process 49(4), 280–287 (2002) R.J. van de Plassche, CMOS Integrated Analog-to-Digital and Digital-to-Analog Converters (Springer Science & Business Media, 2013) N. Wainstein, S. Kvatinsky, An RF memristor model and memristive single-pole double-throw switches, in IEEE International Symposium on Circuits and Systems (2017) (in press). R.H. Walden, Analog-to-digital converter survey and analysis. IEEE J. Sel. Areas Commun. 17(4), 539–550 (1999) B. Widrow, M.A. Lehr, 30 years of adaptive neural networks: perceptron, madaline, and backpropagation. Proc. IEEE 78(9), 1415–1442 (1990) B. Widrow, S.D. Stearns, Adaptive signal processing, in Englewood Cliffs (Rentice-Hall, NJ, USA, 1985) L. Zhang et al., Mellow writes: extending lifetime in resistive memories through selective slow write backs, in ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), June 2016, pp. 519–531
Hardware Security in Emerging Photonic Network-on-Chip Architectures Ishan G. Thakkar, Sai Vineel Reddy Chittamuru, Varun Bhat, Sairam Sri Vatsavai, and Sudeep Pasricha
Abstract Photonic networks-on-chip (PNoCs) enable high bandwidth on-chip data transfers by using photonic waveguides capable of dense-wavelength-divisionmultiplexing (DWDM) for signal traversal and microring resonators (MRs) for signal modulation. A Hardware Trojan in a PNoC can manipulate the electrical driving circuit of its MRs to cause the MRs to snoop data from the neighboring wavelength channels in a shared photonic waveguide. This introduces a serious security threat. This chapter presents a novel framework called SOTERIA that utilizes process variation based authentication signatures along with architecture-level enhancements to protect data in PNoC architectures from snooping attacks. With a minimal overheads of up to 10.6% in average latency and of up to 13.3% in energy-delay-product (EDP) our approach can significantly enhance the hardware security in DWDMbased PNoCs.
I. G. Thakkar Electrical and Computer Engineering Department, University of Kentucky, Lexington, KY 40506, USA e-mail: [email protected] S. V. R. Chittamuru Micron Technology, Inc., Austin, TX, USA e-mail: [email protected] V. Bhat Qualcomm, San Diego, CA, USA S. S. Vatsavai Electrical Engineering, University of Kentucky, Lexington, KY 40506, USA e-mail: [email protected] S. Pasricha (B) Electrical and Computer Engineering Department, Colorado State University, Fort Collins, CO 80523, USA e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2023 M. M. S. Aly and A. Chattopadhyay (eds.), Emerging Computing: From Devices to Systems, Computer Architecture and Design Methodologies, https://doi.org/10.1007/978-981-16-7487-7_9
291
292
I. G. Thakkar et al.
1 Introduction Since the end of Dennard scaling in mid-2000s, the furtherance of CMOS transistor scaling has allowed continued increase in transistor count per microprocessor chip. This growing number of transistors, which is already in a few billions today, have been utilized to integrate increasingly greater number of cores per microprocessor chip. Such microprocessor chips with multiple integrated cores are often referred to as chip-multiprocessors (CMPs). To enable efficient communication between these cores within a CMP, conventional bus-based interconnects have been replaced with network-on-chips (NoCs) (e.g., Dally and Towles 2001; Benini and Micheli 2002). In NoCs, processing cores use routers connected by segmented links for efficient and scalable communication devoid of global wire delays.
2 State of the Art in NoCs In CMPs, electrical NoCs (ENoCs) have emerged as the standard communication fabric, as they are more scalable and modular compared to the traditional bus-based interconnects. Many CMPs have been implemented with ENoCs, such as Intel’s 48core SCC (Held 2010), Tilera’s 72-core TILE-Gx (Mattina 2014), Kalray’s 256-core MPPA (de Dinechin et al. 2013), Intel’s 80-core TeraFlops chip (Hoskote et al. 2007), Sun’s Niagara (McGhan 2006), and MIT’s RAW chip (Kim et al. 2003). Moroever, a 496-core CMP mesh topology ENoC has been recently implemented in Rovinski et al. (2019). Increasing core count and decreasing transistor size elevates the concerns associated with reliability, power consumption, performance and security in ENoCs. This also reflects in the ongoing research efforts in the field of NoCs. With shrinking technology node size, ENoCs become prone to faults that arise due to manufacturing defects, increased demand for traffic, and aging of various components. Several techniques have been proposed to address fault tolerance in ENoCs (e.g., Priya et al. 2018; Pinheiro et al. 2019; Boraten and Kodi 2018; Kim et al. 2013; DiTomaso et al. 2016). For instance, in Priya et al. (2018), a bypass path around faulty nodes is set up to improve fault tolerance in ENoCs. In addition to faults, avoiding congestion, deadlock, and livelock in ENoCs is also important, as the performance of ENoCs depends on that. Congestion is caused in an ENoC when majority of data packets traverse through a common path, deadlock is caused because of a cyclic buffer dependency, and livelock occurs when the packets keep spinning around the network without progressing to their destination. To avoid congestion, deadlock, and livelock conditions in ENoCs, several effective routing algorithms have been proposed in prior works (e.g., Bahrebar and Stroobandt (2015); Parasar et al. (2018); Ramrakhyani and Krishna (2017); Ramrakhyani et al. (2019); Wang et al. (2016)). For instance, (Bahrebar and Stroobandt 2015) uses the Hamiltonian routing strategy to adaptively route the packets through deadlock-free paths in a diametrical 2D mesh ENoC.
Hardware Security in Emerging Photonic Network-on-Chip Architectures
293
Furthermore, the increase in number of cores and shrinking feature size also increases power density in CMPs, which causes electromigration, negative bias temperature instability, hot carrier injection, and time-dependent dielectric breakdown. All of these effects accelerate aging in CMPs that degrades reliability and lifetime of CMPs. Substantial research has been done to address the reliability and low lifetime related issues in ENoCs (e.g., Kim et al. 2018; Huang et al. 2018; Rathore et al. 2019). For example, an adaptive routing algorithm is utilized in Kim et al. (2018) to improve lifetime reliability of ENoCs. Moreover, high dynamic energy consumption in ENoCs is also a critical challenge, as ENoCs consume significant portion of overall CMP power budget (Kim et al. 2011; Bezerra et al. 2010). Several prior works have focused on minimizing the dynamic energy consumption in ENoCs (e.g., Moghaddam 2017; Bezerra et al. 2010; Samih et al. 2013; Sun and Zhang 2017; Ascia et al. 2018). For instance, (Moghaddam 2017) manages the dynamic energy consumption in ENOCs by combining thread migration with dynamic voltage and frequency scaling techniques. Further, a good amount of research is also being conducted in using CMPs with NoCs for developing application specific processors for datacenters (e.g., SmarCo Fan et al. 2018), cloud based computing (e.g., Piton McKeown et al. 2017), and reconfigurable accelerator systems (e.g., MITRACA Ben Abdelhamid et al. 2019). Another critical challenge for ENoCs is their poor performance scalability. With increasing core count and the resultant increase in communication distances, the achievable throughput and latency for data transfers with ENoCs substantially degrade. With the motivation of providing better communication for longer distances, the use of wireless networks-on-chip (WiNoCs) has been proposed in place of ENoCs (Elmiligi et al. 2015). A WiNoC architecture is proposed in Wang et al. (2011) that uses on-chip antennas to transfer data across long distances, while an arbitration mechanism to ensure high-average and guaranteed performance for WiNoCs had been developed in Baharloo et al. (2018). In DiTomaso et al. (2015), a WiNoC uses an adaptable algorithm that works in the background along with a token sharing scheme to fully utilize the wireless bandwidth efficiently. Furthermore, the recent advancements in emerging technologies like silicon nanophotonics have made it possible to realize photonics with NoCs (PNoCs). Therefore, several PNoC architectures have been proposed for CMPs, e.g., Corona (Vantrease et al. 2008), FireFly (Pan et al. 2009), LumiNoC (Li et al. 2012), CAMON (Wang et al. 2019). These PNoC architectures leverage the high bandwidth density and low dynamic power consumption of photonics to address the scalability issues of ENoCs. Another crucial challenge faced by ENoCs is security. ENoCs are vulnerable to various types of security attacks such as Hardware Trojans, hijacking, extraction of sensitive information, and Denial of Service (DoS). Physical hardware security concerns due to Hardware Trojan (HT) attacks are especially important. An HT is realized through malicious modification of an integrated circuit (IC) during its design or fabrication phase. Such malicious ICs are generally third-party IPs that are used for reducing the design time of CMPs. Such HT-related security risks need to be detected and mitigated in order to ensure trusted functionality. Therefore, several recent works have addressed these security issues in ENoCs (e.g., Boraten and Kodi
294
I. G. Thakkar et al.
2018; Madden et al. 2018; Lebiednik et al. 2018; Biswas 2018; Das et al. 2018). To mitigate the DoS and timing attacks in ENoCs, (Boraten and Kodi 2018) proposed the use of a non-Interference based adaptive routing. Similarly, to improve hardware security in WiNoCs, (Madden et al. 2018) employed a machine learning based engine to protect WiNoCs from DoS, spoofing, and eavesdropping attacks.
3 Photonic NoCs (PNoCs) and Related Security Challenges Due to the ever-increasing core count to meet the growing performance demands of modern Big Data and cloud computing applications, ENoCs suffer from high power dissipation and severely reduced performance (Zhou and Kodi 2016). The crosstalk and electromagnetic interference are also increasing with technology scaling, which further reduces the performance and reliability of electrical NoCs (Pasricha and Dutt 2008). Thus, there is a crucial need of a new viable interconnect technology that can address the shortcomings of ENoCs. Recent developments in silicon photonics have enabled the integration of photonic components and interconnects with CMOS circuits on a chip. The ability to communicate at near light speed, larger bandwidth density, and lower dynamic power dissipation are the prolific advantages that Photonic NoCs (PNoCs) provide over their metallic counterparts (Miller 2009). These advantages motivate the use of PNoCs for inter-core communication in modern CMPs (Batten et al. 2008). Several PNoC architectures have been proposed to date (e.g., Sarangi et al. 2008; Xiao et al. 2007; Sun et al. 2012). These architectures employ on-chip photonic links, each of which connects two or more gateway interfaces. A cluster of processing cores are connected to PNoC by a gateway interface (GI). Each photonic link comprises one or more photonic waveguides and each waveguide can support a large number of dense-wavelength-division-multiplexed (DWDM) wavelengths. A data signal is carried by a wavelength. Typically, Each source GI generates multiple data signals in the electrical domain (as sequences of logical 1 and 0 voltage levels) that are modulated onto the multiple DWDM carrier wavelengths simultaneously, using a bank of modulator MRs at the source GI (Chittamuru and Pasricha 2016). The data-modulated carrier wavelengths traverse a link to a destination GI, where an array of detector MRs filter them and drop them on photodetectors to regenerate electrical data signals. In general, each GI in a PNoC is able to send and receive data in the optical domain on all of the utilized carrier wavelengths. Therefore, a bank of modulator MRs (i.e., modulator bank) and a bank of detector MRs (i.e., detector bank) are present at each GI. Each MR in a bank resonates with and operates on a specific carrier wavelength. The high bandwidth parallel data transfers is enabled in PNoCs because of the excellent wavelength selectivity of MRs and DWDM capability of waveguides. However, the excellent wavelength selectivity of MRs and DWDM capability of waveguides also impose serious hardware security threats in PNoCs. The hardware security issues in PNoCs are especially exacerbated due to the complexity of hardware in modern CMPs. This is because to meet the growing performance
Hardware Security in Emerging Photonic Network-on-Chip Architectures
295
demands of modern Big Data and cloud computing applications, the complexity of hardware in modern CMPs has increased. To reduce the hardware design time of these complex CMPs, third-party hardware IPs are frequently used. But these third party IPs can introduce security risks (Chakraborty et al. 2009; Tehranipoor and Koushanfar 2009). For instance, the presence of Hardware Trojans (HTs) in the third-party IPs can lead to leakage of critical and sensitive information from modern CMPs (Skorobogatov and Woods 2012). Thus, security researchers are now increasingly interested in overcoming hardware-level security risks in addition to traditionally focused software-level security. Similarly, the CMPs with PNoCs are also expected to use several third party IPs similar to ENoCs and therefore, are vulnerable to security risks (Ancajas et al. 2014). For instance, if the entire PNoC used within a CMP is a third-party IP, then this PNoC with HTs within the control units of its GIs can snoop on packets in the network. Sensitive information can be determined by malicious core (a core running a malicious program) in CMP using these transferred packets. Unfortunately, MRs of PNoCs are especially susceptible to security threatening manipulations from HTs. In particular, the MR tuning circuits that are essential for supporting data broadcasts and to counteract MR resonance shifts due to process variations (PV) make it easy for HTs to retune MRs and initiate snooping attacks. The tuning circuits of detector MRs partially detune them from their resonance wavelengths to enable data broadcast, Pan et al. (2009), Li et al. (2012), Chittamuru et al. (2017), such that a significant portion of the photonic signal energy in the data carrying wavelengths continues to propagate in the waveguide to be absorbed in the subsequent detector MRs. On the other hand, resonance wavelengths shifts in MRs due to process variations (PV) (Selvaraja 2011). MR tuning circuits are used to counteract PV-induced resonance shifts in MRs by retuning the resonance wavelengths using carrier injection/depletion or thermal tuning (Batten et al. 2008). These tuning circuits of detector MRs can be manipulated by the HT to partially tune the detector MR to a passing wavelength in the waveguide, which enables snooping of the data that is modulated on the passing wavelength. Such covert data snooping is a serious security risk in PNoCs. In this work, we present a framework that improves the hardware security in PNoCs by protecting data from snooping attacks. Our framework can be easily implemented in any existing DWDM-based PNoC without major changes to the architecture and it has low overhead. To the best of our knowledge, this is the first work that attempts to improve hardware security for PNoCs. Our novel contributions are: • We analyze security risks in photonic devices and extend this analysis to linklevel, to determine the impact of these risks on PNoCs; • We propose a circuit-level PV-based security enhancement scheme that uses PVbased authentication signatures to protect data from snooping attacks in photonic waveguides; • We propose an architecture-level reservation-assisted security enhancement scheme to improve security in DWDM-based PNoCs;
296
I. G. Thakkar et al.
• We combine the circuit-level and architecture-level schemes into a holistic framework called SOTERIA; and analyze it on the Firefly (Pan et al. 2009) and Flexishare (Pan et al. 2010) crossbar-based PNoC architectures.
4 Related Work Several prior works Ancajas et al. (2014), Gebotys et al. (2003), Kapoor et al. (2013) discuss the presence of security threats in ENoCs and have proposed solutions to mitigate them. Data scrambling, packet certification, and node obfuscation were used to present three-layer security system approach (Ancajas et al. 2014) to enable protection against data snooping attacks. A symmetric-key based cryptography design was presented in Gebotys et al. (2003) for securing the NoC. In Kapoor et al. (2013), a framework was presented to use permanent keys and temporary session keys for NoC transfers between secure and non-secure cores. However, no prior work has analyzed security risks in photonic devices and links; or considered the impact of these risks on PNoCs. Fabrication-induced PV impact the cross-section, i.e., width and height, of photonic devices, such as MRs and waveguides. Thermal tuning or localized trimming are techniques used at device level to counteract the drifts in resonant wavelength because of PV in MRs (Batten et al. 2008). Trimming can induce blue shifts in the resonance wavelengths of MRs using carrier injection into MRs, whereas thermal tuning can induce red shifts in MR resonances through heating of MRs using integrated heaters.Device level trimming/tuning techniques are inevitable to remedy PV; but their use also enables partial detuning of MRs that can be used to snoop data from a shared photonic waveguide. In addition, the impact of PV-remedial techniques on crosstalk noise and proposed techniques to mitigate it were discussed in prior works (Chittamuru et al. 2016a, b). None of the prior works analyze the impact of PV-remedial techniques on hardware security in PNoCs. Our proposed framework in this chapter is novel as it enables security against snooping attacks in PNoCs for the first time. Our frame-work improves security for any DWDM-based PNoC architecture being network agnostic, mitigating PV, and with minimal overhead.
5 Hardware Security Concerns in PNoCs 5.1 Device-Level Security Concerns Undesirable changes in MR widths and heights due to Process variation (PV) causes “shifts” in MR resonance wavelengths, which can be remedied using localized trimming and thermal tuning methods. The localized trimming method injects (or
Hardware Security in Emerging Photonic Network-on-Chip Architectures
297
Fig. 1 Impact of a malicious modulator MR, b malicious detector MR on data in DWDM-based photonic waveguides (Chittamuru et al. 2018b)
depletes) free carriers into (or from) the Si core of an MR using an electrical tuning circuit, which reduces (or increases) the MR’s refractive index owing to the electro-optic effect, thereby remedying the PV-induced red (or blue) shift in the MR’s resonance wavelength. An integrated micro-heater is employed in thermal tuning to adjust the temperature and refractive index of an MR (owing to the thermo-optic effect) for PV remedy. Typically, the modulator MRs and detectors use the same electro-optic effect (i.e., carrier injection/depletion) implemented through the same electrical tuning circuit as used for localized trimming, to move in and out of resonance (i.e., switch ON/OFF) with a wavelength (Chittamuru et al. 2016a). A HT can manipulate this electrical tuning circuit, which may lead to malicious operation of modulator and detector MRs, as discussed next. Figure 1a shows the malicious operation of a modulator MR. A malicious modulator MR is partially tuned to a data-carrying wavelength (shown in purple) that is passing by in the waveguide. The malicious modulator MR draws some power from the data-carrying wavelength, which can ultimately lead to data corruption as optical ‘1’s in the data can lose significant power to be altered into ‘0’s. Alternatively, a partially tuned malicious detector (Fig. 1b) to a passing data-carrying wavelength can filter only a small amount of its power and drop it on a photodetector for data duplication. This small amount of filtered power does not alter the data in the waveguide so that it continues to travel to its target detector for legitimate communication (Li et al. 2012). Further, a malicious detector MR can also cause data corruption (by partially tuning to a wavelength) and denial of communication (by fully tuning to a wavelength). Thus, both malicious modulator and detector MRs can corrupt data (which can be detected and corrected) or cause Denial of Service (DoS) type of security attacks. In addition, malicious detector MRs can also snoop data from the waveguide without altering it. Thus, major security threat in photonic links is malicious detector MRs snooping data from the waveguide without altering it. Note that malicious modulator MRs only corrupt data (which can be detected) and do not covertly duplicate it, and are thus not a major security risk.
298
I. G. Thakkar et al.
5.2 Link-Level Security Concerns Typically, one or more DWDM-based photonic waveguides are present in a photonic link. A modulator bank (a series of modulator MRs) at the source GI and a detector bank (a series of detector MRs) at the destination GI are used in a DWDM based photonic waveguide. DWDM-based waveguides can be broadly classified into four types: single-writer-single-reader (SWSR), single-writer-multiple-reader (SWMR), multiple-writer-single-reader (MWSR), and multiple-writer-multiple-reader (MWMR). We restricted our link-level analysis to MWMR waveguides as SWSR, SWMR and MWSR waveguides are subsets of MWMR. An MWMR waveguide typically passes through multiple GIs, connecting the modulator banks of some GIs to the detector banks of the remaining GIs. Multiple GIs (referred to as source GIs) can send data using their modulator banks and multiple GIs (referred to as destination GIs) can receive (read) data using their detector banks in an MWMR waveguide. An MWMR waveguide with two source GIs and two destination GIs are shown in Fig. 2 as an example. The impact of malicious source and destination GIs on this MWMR waveguide is presented in Fig. 2a and b, respectively. The modulator bank of source GI S1 is sending data to the detector bank of destination GI D2 . When source GI S2 , which is in the communication path, becomes malicious with an HT in its control logic, it can manipulate its modular bank to modify the existing ‘1’s in the data to ‘0’s leading to data corruption. For example, in Fig. 2a, S1 is supposed to send ‘0110’ to D2 , but due to data corruption by malicious GI
Fig. 2 Impact of a malicious modulator (source) bank, b malicious detector bank on data in DWDM-based photonic waveguides (Chittamuru et al. 2018b)
Hardware Security in Emerging Photonic Network-on-Chip Architectures
299
S2 , ‘0010’ is received by D2 . However, using parity or error correction code (ECC) bits in the data can be used to detect and correct this type of data corruption. Thus, malicious source GIs do not cause major security risks in DWDM-based MWMR waveguides. Let us consider another scenario for the same data communication path (i.e., from S1 to D2 ). When destination GI D1 , which is in the communication path, becomes malicious with an HT in its control logic, the detector bank of D1 can be partially tuned to the utilized wavelength channels to snoop data. In the example shown in Fig. 2b, D1 snoops ‘0110’ from the wavelength channels that are destined to D2 . The sensitive information can be determined by transferring snooped data from D1 to a malicious core within the CMP. Since, the intended communication among CMP cores is not disrupted, this type of snooping attack from malicious destination GIs are hard to detect. Therefore, there is a pressing need to address the security risks imposed by snooping GIs in DWDM-based PNoC architectures. To address this need, we propose a novel framework SOTERIA that improves hardware security in DWDM-based PNoC architectures.
6 SOTERIA Framework: Overview Our proposed multi-layer SOTERIA framework enables secure communication in DWDM-based PNoC architectures by integrating circuit-level and architecture-level enhancements. Figure 3 gives a high-level overview of this framework. The PV-based security enhancement (PVSC) scheme uses the PV profile of the destination GIs’ detector MRs to encrypt data before it is transmitted via the photonic waveguide. This scheme is sufficient to protect data from snooping GIs, if they not aware of target destination GI. However, a snooping GI can decipher the encrypted data if the target
Fig. 3 Overview of proposed SOTERIA framework that integrates a circuit-level PV-based security enhancement (PVSC) scheme and an architecture-level reservation-assisted security enhancement (RVSC) scheme (Chittamuru et al. 2018b)
300
I. G. Thakkar et al.
destination GI information is known. Many PNoC architectures (e.g., Ancajas et al. 2014; Chen and Joshi 2013) use the same waveguide to transmit both the destination GI information and actual data, making them vulnerable to data snooping attacks despite using PVSC. We devise an architecture-level reservation-assisted security enhancement (RVSC) scheme that uses a secure reservation waveguide to avoid the stealing of destination GI information by snooping GIs to enhance the security of these PNoCs. The next two sections present details of our PVSC and RVSC schemes.
7 PV-Based Security Enhancement As discussed earlier (Sect. 5.2), malicious destination GIs can snoop data from a shared waveguide. Data encryption can be used to address this security concern so that the malicious destination GIs cannot decipher the snooped data. The encryption key used for data encryption should be kept secret from the snooping GIs for the encrypted data to be truly undecipherable, which can be challenging as the identity of the snooping GIs in a PNoC is not known. Therefore, it becomes very difficult to decide whether or not to share the encryption key with a destination GI (that can be malicious) for data decryption. Each destination GI can have a different key to resolve this conundrum so that a key that is specific to a secure destination GI does not need to be shared with a malicious destination GI for decryption purpose. Moreover, to keep these destination specific keys secure, the malicious GIs in a PNoC must not be able to clone the algorithm (or method) used to generate these keys. To generate unclonable encryption keys, the PV profiles of the destination GIs’ detector MRs are used in our PV-based security (PVSC) scheme. As discussed in Selvaraja (2011), PV induces random shifts in the resonance wavelengths of the MRs used in a PNoC. These resonance shifts can be in the range from −3 to 3 nm (Selvaraja 2011). PV profiles are different for the MRs that belong to different GIs in a PNoC. In fact, the MRs that belong to different MR banks of the same GI also have different PV profiles. Due to their random nature, these MR PV profiles cannot be cloned by the malicious GIs, which makes the encryption keys generated using these PV profiles truly unclonable. PVSC uses the PV profiles of detector MRs to generate a unique encryption key for each detector bank of every MWMR waveguide in a PNoC. Our PVSC scheme generates encryption keys during the testing phase of the CMP chip, by using a dithering signal based in-situ method (Padmaraju et al. 2013) to generate an anti-symmetric analog error signal for each detector MR of every detector bank that is proportional to the PV-induced resonance shift in the detector MR. Then, it converts the analog error signal into a 64-bit digital signal. Thus, a 64-bit digital error signal is generated for every detector MR of each detector bank. We consider 64 DWDM wavelengths per waveguide, and hence, we have 64 detector MRs in every detector bank and 64 modulator MRs in every modulator bank. For each detector bank, our PVSC scheme XORs the 64 digital error signals (of 64 bits each) from each of the 64 detector MRs to create a unique 64-bit encryption key. Note
Hardware Security in Emerging Photonic Network-on-Chip Architectures
301
Fig. 4 Overview of proposed PV-based security enhancement scheme (Chittamuru et al. 2018b)
that our PVSC scheme also uses the same anti-symmetric error signals to control the carrier injection and heating of the MRs to remedy the PV-induced shifts in their resonances. To understand how the 64-bit encryption key is utilized to encrypt data in photonic links, consider Fig. 4 which depicts an example photonic link that has one MWMR waveguide and connects the modulator banks of two source GIs (S1 and S2 ) with the detector banks of two destination GIs (D1 and D2 ). PVSC creates two 64-bit encryption keys corresponding to two destination GIs on the link, and stores them at the source GIs. When data is to be transmitted by a source GI, the key for the appropriate destination is used to encrypt data at the flit-level granularity, by performing an XOR between the key and the data flit. The encryption key matching the data flit size is required. We consider the size of data flits to be 512 bits. Therefore, the 64-bit encryption key is appended eight times to generate a 512-bit encryption key. In Fig. 4, 512-bit encryption keys (for destination GIs D1 and D2 ) are stored in the source GI local ROM, whereas every destination GI stores only its corresponding 512-bit key in its ROM. To eliminate the latency overhead of affixing 64-bit keys we store 512-bit key to generate 512-bit keys, at the cost of a reasonable area/energy overhead in the ROM. As an example, if S1 wants to send a data flit to D2 , then S1 first accesses the 512-bit encryption key corresponding to D2 from its local ROM and XORs the data flit with this key in one cycle, and then transmits the encrypted data flit over the link. As the link employs only one waveguide with 64 DWDM wavelengths, therefore, the encrypted 512-bit data flit is transferred on the link to D2 in eight cycles. At D2 , the data flit is decrypted by XORing it with the 512-bit key corresponding to D2 from the local ROM. In this scheme,D1 cannot decipher the data even if D1 snoops the data intended for D2 , as it does not have access to the correct key (corresponding to D2 ) for decryption. Thus, our PVSC encryption scheme protects data against snooping attacks in DWDM-based PNoCs.
302
I. G. Thakkar et al.
Limitations of PVSC: The PVSC scheme can protect data from being deciphered by a snooping GI, if the following two conditions about the underlying PNoC architecture hold true: (i) the snooping GI does not know the target destination GI for the snooped data, (ii) the snooping GI cannot access the encryption key corresponding to the target destination GI. As discussed earlier, only all source GIs have an encryption key stored and at the corresponding destination GI making it physically inaccessible to a snooping destination GI. However, if more than one GIs in a PNoC are compromised due to HTs in their control units and if these HTs launch a coordinated snooping attack, then it may be possible for the snooping GI to access the encryption key corresponding to the target destination GI. For instance, consider the photonic link in Fig. 4. If both S1 and D1 are compromised, then the HT in S1 ’s control unit can access the encryption keys corresponding to both D1 and D2 from its ROM and transfer them to a malicious core (a core running a malicious program). Moreover, the data intended for D2 can be snooped by the HT in D1 ’s control unit and transfer it to the malicious core. Thus, the malicious core may have access to the snooped data as well as the encryption keys stored at the source GIs. Nevertheless, to decipher the snooped data accessing the encryption keys stored at the source GIs is not sufficient for the malicious GI (or core). This is because the compromised ROM typically has multiple encryption keys corresponding to multiple destination GIs, and choosing a correct key that can decipher data requires the knowledge of the target destination GI. Thus, our PVSC encryption scheme can secure data communication in PNoCs as long as the malicious GIs (or cores) do not know the target destinations of the snooped data. Unfortunately, many PNoC architectures, e.g., (Ancajas et al. 2014; Chen and Joshi 2013), that employ photonic links with multiple destination GIs utilize the same waveguide to transmit both the target destination information and actual data. In such PNoCs, from shared waveguide malicious GI can manage to tap the target destination information, then it can access the correct encryption key from the compromised ROM to decipher the snooped data. Thus, there is a need to conceal the target destination information from malicious GIs (cores). This motivates us to propose an architecture-level solution, as discussed next.
8 Reservation-Assisted Security Enhancement In PNoCs that use photonic links with multiple destination GIs, data is typically transferred in two time-division-multiplexed (TDM) slots called reservation slot and data slot (Ancajas et al. 2014; Chen and Joshi 2013). Figure 5a shows PNoCs using the same waveguide to transfer both slots to minimize photonic hardware. To enable reservation of the waveguide, each destination is assigned a reservation selection wavelength. In Fig. 5a, λ1 and λ2 are the reservation selection wavelengths corresponding to destination GIs D1 and D2 , respectively. Ideally, detector switches ON its detector bank to recieve data in the next data slot when a destination GI detects its reservation selection wavelength in the reservation slot. But in the presence of an HT,
Hardware Security in Emerging Photonic Network-on-Chip Architectures
303
Fig. 5 Reservation-assisted data transmission in DWDM-based photonic waveguides a without RVSC, b with RVSC (Chittamuru et al. 2018b)
a malicious GI can snoop signals from the reservation slot using the same detector bank that is used for data reception. For example, in Fig. 5a, malicious GI D1 is using one of its detectors to snoop λ2 from the reservation slot. By snooping λ2 , D1 can identify that the data it will snoop in the subsequent data slot will be intended for destination D2 . Thus, D1 can decipher its snooped data now by choosing the correct encryption key from the compromised. To address this security risk, we propose an architecture-level reservation-assisted security enhancement (RVSC) scheme. In RVSC, a reservation waveguide is added, whose main function is to carry reservation slots, whereas data slots are carried by data waveguide. We use double MRs to switch the signals of reservation slots from the data waveguide to the reservation waveguide, as shown in Fig. 5b. Double MRs are used instead of single MRs for switching to ensure that the switched signals do not reverse their propagation direction after switching (Chittamuru et al. 2018a). Double MRs also have lower signal loss due to steeper roll-off of their filter responses (Chittamuru et al. 2018a) compared to single MRs. In a photonic link the double MRs are switched ON only in a reservation slot, otherwise they are switched OFF to let the signals of the data slot pass by in the data waveguide. Furthermore, in RVSC, each destination GI has only one detector on
304
I. G. Thakkar et al.
the reservation waveguide, which corresponds to its receiver selection wavelength. For example, in Fig. 5b, D1 and D2 will have detectors corresponding to their reservation selection wavelengths λ1 and λ2 , respectively, on the reservation waveguide. Figure 5b shows this making it difficult for the malicious GI D1 to snoop λ2 from the reservation slot, as D1 does not have a detector corresponding to λ2 on the reservation waveguide. However, the HT in D1 ’s control unit may still attempt to snoop other reservation wavelengths (e.g., λ2 ) in the reservation slot by retuning D1 ’s λ1 detector. The HT would required to perfect the timing and target wavelength of its snooping attack to succeed in these attempts, which is very difficult due to the large number of utilized reservation wavelengths. Thus, the correct encryption key cannot be identified by D1 to decipher the snooped data. In summary, RVSC enhances security in PNoCs by protecting data from snooping attacks, even if the encryption keys used to secure data are compromised.
9 Implementing SOTERIA Framework on PNoCs We characterize the impact of SOTERIA on two popular PNoC architectures: Firefly (Pan et al. 2009) and Flexishare (Pan et al. 2010), both of which use DWDM-based photonic waveguides for data communication. We consider Firefly PNoC with 8×8 SWMR crossbar (Pan et al. 2009) and a Flexishare PNoC with 32 × 32 MWMR crossbar (Pan et al. 2010) with 2-pass token stream arbitration. We adapt the analytical equations from Chittamuru et al. (2018a) to model the signal power loss and required laser power in the SOTERIA-enhanced Firefly and Flexishare PNoCs. XOR gates are required to enable parallel encryption and decryption of 512-bit data flits at each source and destination GI of the SOTERIA-enhanced Firefly and Flexishare PNoCs. The overhead for encryption and decryption of every data flit was as 1 cycle delay. The overall laser power and delay overheads for both PNoCs are quantified in the results section. Firefly PNoC: Firefly PNoC (Pan et al. 2009), for a 256-core system, has 8 clusters (C1–C8) with 32 cores in each cluster. Firefly uses reservation-assisted SWMR data channels in its 8 × 8 crossbar for inter-cluster communication. Each data channel consists of 8 SWMR waveguides, with 64 DWDM wavelengths in each waveguide. A reservation waveguide was added to every SWMR channel to integrate SOTERIA with Firefly PNoC. This reservation waveguide has 7 detector MRs to detect reservation selection wavelengths corresponding to 7 destination GIs. Further-more, 64 double MRs (corresponding to 64 DWDM wavelengths) are used at each reservation waveguide to implement RVSC. To enable PVSC, each source GI has a ROM with seven entries of 512 bits each to store seven 512-bit encryption keys corresponding to seven destination GIs. In addition, each destination GI requires a 512-bit ROM to store its own encryption key. Flexishare PNoC: We also integrate SOTERIA with the Flexishare PNoC architecture (Pan et al. 2010) with 256 cores. We considered a 64-radix 64-cluster Flexishare
Hardware Security in Emerging Photonic Network-on-Chip Architectures
305
PNoC with four cores in each cluster and 32 data channels for inter-cluster communication. Each data channel has four MWMR waveguides with each having 64 DWDM wavelengths. In SOTERIA-enhanced Flexishare, we added a reservation waveguide to each MWMR channel. Each reservation waveguide has 16 detector MRs to detect reservation selection wavelengths corresponding to 16 destination GIs. A ROM with 16 entries of 512 bits each to store the encryption keys at each source GI is required, whereas each destination GI requires a 512-bit ROM to enable PVSC.
10 Evaluations 10.1 Evaluation Setup To evaluate our proposed SOTERIA (PVSC + RVSC) security enhancement framework for DWDM-based PNoCs, we integrate it with the Firefly (Pan et al. 2009) and Flexishare (Pan et al. 2010) PNoCs, as explained in Sect. 7. We modeled and performed simulation based analysis of the SOTERIA-enhanced Firefly and Flexishare PNoCs using a cycle-accurate SystemC based NoC simulator, for a 256-core single-chip architecture at 22 nm. The power dissipation and energy consumption were validated from the DSENT tool (Sun et al. 2012). We used real-world traffic from the PARSEC benchmark suite (Bienia et al. 2008). GEM5 full-system simulation (Binkert et al. 2011) of parallelized PARSEC applications were used to generate traces that were fed into our NoC simulator. We set a “warmup” period of 100 million instructions and then captured traces for the subsequent 1 billion instructions. These traces are extracted from parallel regions of execution of PARSEC applications. We performed geometric calculations for a 20 mm × 20 mm chip size, to determine lengths of SWMR and MWMR waveguides in Firefly and Flexishare. Based on this analysis, the time needed for light to travel from the first to the last node was estimated as 8 cycles at 5 GHz clock frequency (Chittamuru et al. 2017). We use a 512-bit packet size, as advocated in the Firefly and Flexishare PNoCs. Similar to Chittamuru et al. (2018a), we adapt the VARIUS tool (Sarangi et al. 2008) to model random and systematic die-to-die (D2 D) as well as within-die (WID) process variations in MRs for the Firefly and Flexishare PNoCs. The static and dynamic energy consumption values for electrical routers and concentrators in Firefly and Flexishare PNoCs are based on results from DSENT (Sun et al. 2012). We model and consider the area, power, and performance overheads for our framework implemented with the Firefly and Flexishare PNoCs as follows. SOTERIA with Firefly and Flexishare PNoCs has an electrical area overhead of 12.7 mm2 and 3.4 mm2 , respectively, and power overhead of 0.44 W and 0.36 W, respectively, using gate-level analysis and CACTI 6.5 (CACTI 2021) tool for memory and buffers. The photonic area of Firefly and Flexishare PNoCs is 19.83 mm2 and 5.2 mm2 , respectively, based on the physical dimensions (Xiao et al. 2007) of their waveguides, MRs, and splitters. For energy consumption of photonic devices, we
306
I. G. Thakkar et al.
adapt model parameters from recent work (Chittamuru et al. 2017; Thakkar et al. 2016) with 0.42 pJ/bit for every modulation and detection event and 0.18 pJ/bit for the tuning circuits of modulators and photodetectors. The MR trimming power is 130 µW/nm (Dang et al. 2017) for current injection and tuning power is 240 µW/nm (Dang et al. 2017) for heating.
10.2 Overhead Analysis of SOTERIA on PNoCs Our first set of experiments compare the baseline (without any security enhancements) Firefly and Flexishare PNoCs with their SOTERIA enhanced variants. From Sect. 7, all 8 SWMR waveguide groups of the Firefly PNoC and all 32 MWMR waveguide groups of the Flexishare PNoC are equipped with PVSC encryption/decryption and reservation waveguides for the RVSC scheme. The total signal loss at the detectors of the worst-case power loss node (N W C P L) were calculated by adapting the analytical models from Chittamuru et al. (2018a), which corresponds to router C4R0 for the Firefly PNoC (Pan et al. 2009) and node R63 for the Flexishare PNoC (Pan et al. 2010). Figure 6a summarizes the worstcase signal loss results for the baseline and SOTERIA configurations for the two PNoC architectures. The loss is increased by 1.6 dB for Firefly PNoC with SOTERIA and Flexishare PNoC with SOTERIA increased by 1.2 dB on average, compared to their respective baselines. Compared to the baseline PNoCs that have no single or
Fig. 6 Comparison of a worst-case signal loss and b laser power dissipation of SOTERIA framework on Firefly and Flexishare PNoCs with their respective baselines considering 100 process variation maps (Chittamuru et al. 2018b)
Hardware Security in Emerging Photonic Network-on-Chip Architectures
307
double MRs to switch the signals of the reservation slots, the double MRs used in the SOTERIA-enhanced PNoCs to switch the wavelength signals of the reservation slots increase through losses in the waveguides, which ultimately increases the worst-case signal losses in the SOTERIA-enhanced PNoCs. Using the worst-case signal losses shown in Fig. 6a, we determine the total photonic laser power and corresponding electrical laser power for the baseline and SOTERIA-enhanced variants of Firefly and Flexishare PNoCs, shown in Fig. 6b. From this figure, the laser power overheads are 44.7% and 31.40% for the Firefly and Flexishare PNoCs with SOTERIA on average, compared to their baselines. Figure 7 presents detailed simulation results that quantify the average packet latency and energy-delay product (EDP) for the two configurations of the Firefly and Flexishare PNoCs. Results are shown for twelve multi-threaded PARSEC benchmarks. From Fig. 7a, Firefly with SOTERIA has 5.2% and Flexishare with SOTERIA has 10.6% higher latency on average compared to their respective baselines. The increase in average latency is due to the additional delay due to encryption and decryption of data (Sect. 7) with PVSC. From the results for EDP shown in Fig. 7b, Firefly with SOTERIA has 4.9% and Flexishare with SOTERIA has 13.3% higher EDP on average compared to their respective baselines. The increase in their average packet latency and the presence of additional RVSC reservation waveguides leads to increase in EDP for the SOTERIAenhanced PNoCs, which increases the required photonic hardware (e.g., more number of MRs) in the SOTERIA-enhanced PNoCs. This in turn increases static energy consumption (i.e., laser energy and trimming/tuning energy), ultimately increasing the EDP.
10.3 Analysis of Overhead Sensitivity Our last set of evaluations explore how the overhead of SOTERIA changes with varying levels of security in the network. Typically, in a manycore system, sensitive information (i.e., keys) is present only at certain portion of the data and hence only a certain number of communication links need to be secure. Therefore, we secure only a certain number channels using SOTERIA for our analysis in this section, instead of securing all data channels of the Flexishare PNoC. Out of the total 32 MWMR channels in the Flexishare PNoC, we secure 4 (FLEX-ST-4), 8 (FLEX-ST-8), 16 (FLEX-ST-16), and 24 (FLEX-ST-24) channels, and evaluate the average packet latency and EDP for these variants of the SOTERIA-enhanced Flexishare PNoC. In Fig. 8, we present average packet latency and EDP values for the five SOTERIA-enhanced configurations of the Flexishare PNoC. From Fig. 8a, FLEX-ST-4, FLEX-ST-8, FLEX-ST-16, and FLEX-ST-24 have 1.8%, 3.5%, 6.7%, and 9.5% higher latency on average compared to the baseline Flexishare. Increase in number of SOTERIA enhanced MWMR waveguides increases number of packets that are transferred through the PVSC encryption scheme, which contributes to the increase in average packet latency across these variants. From the results for EDP shown
308
I. G. Thakkar et al.
Fig. 7 a Normalized average latency and b energy-delay product (EDP) comparison between different variants of Firefly and Flexishare PNoCs that include their baselines and their variant with SOTERIA framework, for PARSEC benchmarks. Latency results are normalized with their respective baseline architecture results. Bars represent mean values of average latency and EDP for 100 PV maps; confidence intervals show variation in average latency and EDP across PARSEC benchmarks (Chittamuru et al. 2018b)
in Fig. 8b, FLEX-ST-4, FLEX-ST-8, FLEX-ST-16, and FLEX-ST-24 have 2%, 4%, 7.6%, and 10.8% higher EDP on average compared to the baseline Flexishare. EDP in Flexishare PNoC increases with increase in number of SOTERIA enhanced MWMR waveguides. Overall EDP across these variants is increased due to increase in average packet latency and signal loss due to the higher number of reservation waveguides and double MRs.
Hardware Security in Emerging Photonic Network-on-Chip Architectures
309
Fig. 8 a Normalized latency and b energy-delay product (EDP) comparison between Flexishare baseline and Flexishare with 4, 8, 16, and 24 SOTERIA enhanced MWMR waveguide groups, for PARSEC benchmarks. Latency results are normalized to the baseline Flexishare results (Chittamuru et al. 2018b)
11 Conclusion We presented a novel security enhancement framework called SOTERIA that secures data during unicast communications in DWDM-based PNoC architectures from snooping attacks. Our proposed SOTERIA framework shows interesting trade-offs between security, performance, and energy overhead for the Firefly and Flexishare PNoC architectures. Our analysis shows that SOTERIA enables hardware security in crossbar based PNoCs with minimal overheads of up to 10.6% in average latency
310
I. G. Thakkar et al.
and of up to 13.3% in EDP compared to the baseline PNoCs. Thus, an attractive solution to enhance hardware security in emerging DWDM-based PNoCs is presented as SOTERIA. Acknowledgements This research is supported by grants from the University of Kentucky and NSF (CCF-1813370).
References D.M. Ancajas et al., Fort-NoCs: mitigating the threat of a compromised NoC, in Proceedings of the DAC (2014) G. Ascia, V. Catania, S. Monteleone, M. Palesi, D. Patti, J. Jose, Improving energy consumption of NoC based architectures through approximate communication, in 2018 7th Mediterranean Conference on Embedded Computing (MECO), Budva (2018), pp. 1–4 M. Baharloo, A. Khonsari, P. Shiri, I. Namdari, D. Rahmati, High-average and guaranteed performance for wireless networks-on-chip architectures, in IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Hong Kong (2018), pp. 226–231 P. Bahrebar, D. Stroobandt, Hamiltonian path strategy for deadlock-free and adaptive routing in diametrical 2D mesh NoCs, in 5th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, Shenzhen (2015), pp. 1209–1212 C. Batten et al., Building manycore processor-to-dram networks with monolithic silicon photonics. Hot I, 21–30 (2008) R. Ben Abdelhamid, Y. Yamaguchi, T. Boku, MITRACA: manycore interlinked torus reconfigurable accelerator architecture, in 2019 IEEE 30th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), New York, NY, USA (2019), pp. 38–38 L. Benini, G.D. Micheli, Networks on chips: a new SoC paradigm. IEEE Comput. 35, 70–78 (2002) G.B.P. Bezerra, S. Forrest, M. Forrest, A. Davis, P. Zarkesh Ha, Modeling NoC traffic locality and energy consumption with rent’s communication probability distribution, in Proceedings of the International Workshop on System Level Interconnect Prediction (SLIP’10) (2010), pp. 3–8 C. Bienia et al., The PARSEC benchmark suit: characterization and architectural implications, in PACT, Oct. 2008 N. Binkert et al., The gem5 simulator, in CA News, May 2011 A.K. Biswas, Efficient timing channel protection for hybrid (packet/circuit-switched) network-onchip. IEEE Trans. Parallel Distrib. Syst. 29(5), 1044–1057 (2018) T. Boraten, A.K. Kodi, Runtime techniques to mitigate soft errors in network-on-chip (NoC) architectures. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 37(3), 682–695 (2018) T.H. Boraten, A.K. Kodi, Securing NoCs against timing attacks with non-interference based adaptive routing, in Twelfth IEEE/ACM International Symposium on Networks-on-Chip (NOCS), Turin (2018), pp. 1–8 CACTI 6.5 (2021), http://www.hpl.hp.com/research/cacti/ R. Chakraborty, S. Narasimhan, S. Bhunia, Hardware Trojan: threats and emerging solutions, in Proceedings of the HLDVT, Nov. 2009, pp. 166–171 C. Chen, A. Joshi, Runtime management of laser power in silicon-photonic multibus NoC architecture, in Proceedings of the IEEE JQE (2013) S.V.R. Chittamuru, S. Pasricha, SPECTRA: a framework for thermal reliability management in silicon-photonic networks-on-chip, in Proceedings of the VLSID, Jan 2016 S.V.R. Chittamuru, I. Thakkar, S. Pasricha, Analyzing voltage bias and temperature induced aging effects in photonic inter-connects for manycore computing, in Proceedings of the SLIP, June 2017
Hardware Security in Emerging Photonic Network-on-Chip Architectures
311
S.V.R. Chittamuru, I. Thakkar, S. Pasricha, HYDRA: hetero-dyne crosstalk mitigation with double microring resonators and data encoding for photonic NoCs. TVLSI 26(1) (2018a) S.V.R. Chittamuru, I.G. Thakkar, V. Bhat, S. Pasricha, SOTERIA: exploiting process variations to enhance hardware security with photonic NoC architectures, in 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), San Francisco, CA (2018b) S.V.R. Chittamuru, I. Thakkar, S. Pasricha, PICO: mitigating heterodyne crosstalk due to process variations and intermodu-lation effects in photonic NoCs, in Proceedings of the DAC, June 2016a S.V.R. Chittamuru, I. Thakkar, S. Pasricha, Process variation aware crosstalk mitigation for DWDM based photonic NoC Architectures, in Proceedings of the ISQED, March 2016b S.V.R. Chittamuru, S. Desai, S. Pasricha, SWIFTNoC: a reconfigurable silicon photonic network with multicast enabled channel sharing for multicore architectures. ACM JETC 13(4), 58 (2017) W.J. Dally, B. Towles, Route packets, not wires, in Proceedings of the DAC (2001) D. Dang, S.V.R. Chittamuru, R. Mahapatra, S. Pasricha, Islands of heaters: a novel thermal management framework for photonic NoCs, in Proceedings of the ASPDAC, Jan 2017 S. Das, K. Basu, J.R. Doppa, P.P. Pande, R. Karri, K. Chakrabarty, Abetting planned obsolescence by aging 3D networks-on-chip, in Twelfth IEEE/ACM International Symposium on Networks-onChip (NOCS), Turin (2018), pp. 1–8 B. de Dinechin, R. Ayrignac, P.-E. Beaucamps, P. Couvert, B. Ganne, P. de Massas, F. Jacquet, S. Jones, N. Chaisemartin, F. Riss, T. Strudel, A clustered manycore processor architecture for embedded and accelerated applications, in High Performance Extreme Computing Conference (HPEC) (2013) D. DiTomaso, T. Boraten, A. Kodi, A. Louri, Dynamic error mitigation in NoCs using intelligent prediction techniques, in 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Taipei (2016), pp. 1–12 D. DiTomaso, A. Kodi, D. Matolak, S. Kaya, S. Laha, W. Rayess, A-WiNoC: adaptive wireless network-on-chip architecture for chip multiprocessors. IEEE Trans. Parallel Distrib. Syst. 26(12), 3289–3302 (2015) H. Elmiligi, F. Gebali, M. Watheq El-Kharashi, A.A. Morgan, Traffic analysis of multi-core body sensor networks based on wireless NoC infrastructure, in 2015 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM), Victoria, BC (2015), pp. 201– 204 D. Fan, SmarCo: an efficient many-core processor for high-throughput applications in datacenters, in IEEE International Symposium on High Performance Computer Architecture (HPCA), Vienna (2018), pp. 596–607 C.H. Gebotys et al., A framework for security on NoC technologies, in Proceedings of the ISVLSI, Feb. 2003 J. Held, Single-chip cloud computer: an experimental many-core processor from intel labs, in Presented at Intel Labs Single-chip Cloud Computer Symposium (Santa Clara, California, 2010) Y. Hoskote, S. Vangal, A. Singh, N. Borkar, S. Borkar, A 5-GHz mesh interconnect for a teraflops processor. IEEE Micro 51–61 (2007) L. Huang, A lifetime-aware mapping algorithm to extend MTTF of networks-on-chip, in 23rd Asia and South Pacific Design Automation Conference (ASP-DAC), Jeju (2018), pp. 147–152 H.K. Kapoor et al., A security framework for NoC using authenticated encryption and session keys, in CSSP (2013) J.S. Kim, J. Beom Hong, J.Y. Kang, T. Hee Han, Lifetime improvement method using thresholdbased partial data compression in NoC, in International SoC Design Conference (ISOCC), Daegu, Korea (South) (2018), pp. 269–270 H. Kim, P. Ghoshal, B. Grot, P.V. Gratz, Reducing network-onchip energy consumption through spatial locality speculation, in Proceedings of the International Symposium on Networks-On-Chip (NOCS’11) (2011), pp. 233–240 J.S. Kim, M.B. Taylor, J. Miller, D. Wentzlaff, Energy characterization of a tiled architecture processor with on-chip networks, in ISLPED ’03, New York, NY, USA (ACM, 2003)
312
I. G. Thakkar et al.
H. Kim, A. Vitkovskiy, P.V. Gratz, V. Soteriou, Use it or lose it: wear-out and lifetime in future chip multiprocessors, in 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Davis, CA (2013), pp. 136–147 B. Lebiednik, S. Abadal, H. Kwon, T. Krishna, Architecting a secure wireless network-on-chip, in Twelfth IEEE/ACM International Symposium on Networks-on-Chip (NOCS), Turin (2018), pp. 1–8 C. Li et al., Energy-efficient optical broadcast for nanophotonic networks-on-chip, in Proceedings of the OIC (2012), pp. 64–65 C. Li, M. Browning, P.V. Gratz, S. Palermo, LumiNOC: a power-efficient, high-performance, photonic network-on-chip for future parallel architectures, in 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT), Minneapolis, MN (2012), pp. 421–422 K. Madden, J. Harkin, L. McDaid, C. Nugent, Adding security to networks-on-chip using neural networks, in IEEE Symposium Series on Computational Intelligence (SSCI), Bangalore, India (2018), pp. 1299–1306 M. Mattina, Architecture and Performance of the TILE-GX Processor Family. White Paper (Tilera Corporation, 2014) H. McGhan, Niagara 2. Microprocessor Report (2006) M. McKeown et al., Piton: a manycore processor for multitenant clouds. IEEE Micro 37(2), 70–80 (2017) D.A.B. Miller, Device requirements for optical interconnects to silicon chips. JPROC 97(7), 1166– 1185 (2009) M.G. Moghaddam, Dynamic energy and reliability management in network-on-chip based chip multiprocessors, in Eighth International Green and Sustainable Computing Conference (IGSC), Orlando, FL (2017), pp. 1–4 K. Padmaraju et al., Wavelength locking and thermally stabilizing microring resonators using dithering signals. JLT 32(3) (2013) Y. Pan et al., Firefly: illuminating future network-on-chip with nanophotonics, in Proceedings of the ISCA (2009) Y. Pan, J. Kim, G. Memik, Flexishare: channel sharing for an energy efficient nanophotonic crossbar, in Proceedings of the HPCA (2010) M. Parasar, A. Sinha, T. Krishna, Brownian bubble router: enabling deadlock freedom via guaranteed forward progress, in Twelfth IEEE/ACM International Symposium on Networks-on-Chip (NOCS), Turin (2018), pp. 1–8 S. Pasricha, N. Dutt, On-chip Communication Architectures (Morgan Kauffman, 2008) A.C. Pinheiro, J.A.N. Silveira, D.A.B. Tavares, F.G.A. Silva, C.A.M. Marcon, Optimized faulttolerant buffer design for network-on-chip applications, in IEEE 10th Latin American Symposium on Circuits and Systems (LASCAS), Colombia, Armenia (2019), pp. 217–220 S. Priya, S. Agarwal, H.K. Kapoor, Fault tolerance in network on chip using bypass path establishing packets, in 2018 31st International Conference on VLSI Design and 2018 17th International Conference on Embedded Systems (VLSID), Pune (2018), pp. 457–458 A. Ramrakhyani, T. Krishna, Static bubble: a framework for deadlock-free irregular on-chip topologies, in IEEE International Symposium on High Performance Computer Architecture (HPCA), Austin, TX (2017), pp. 253–264 A. Ramrakhyani, P.V. Gratz, T. Krishna, Synchronized progress in interconnection networks (SPIN): a new theory for deadlock freedom. IEEE Micro 39(3), 110–117 (2019) V. Rathore, V. Chaturvedi, A.K. Singh, T. Srikanthan, M. Shafique, Towards scalable lifetime reliability management for dark silicon manycore systems, in IEEE 25th International Symposium on On-Line Testing and Robust System Design (IOLTS), Greece, Rhodes (2019), pp. 204–207 A. Rovinski, A 1.4 GHz 695 Giga Risc-V Inst, s 496-core manycore processor with mesh on-chip network and an all-digital synthesized PLL in 16nm CMOS, in Symposium on VLSI Circuits, Kyoto, Japan (2019), pp. C30–C31
Hardware Security in Emerging Photonic Network-on-Chip Architectures
313
A. Samih, R. Wang, A. Krishna, C. Maciocco, C. Tai, Y. Solihin, Energy-efficient interconnect via router parking, in IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), Shenzhen (2013), pp. 508–519 S. Sarangi et al., VARIUS: a model of process variation and resulting timing errors for microarchitects. IEEE TSM 21(1), 3–13 (2008) S.K. Selvaraja, Wafer-scale fabrication technology for silicon photonic integrated circuits. Ph.D. Thesis (Ghent University, 2011) S. Skorobogatov, C. Woods, Breakthrough silicon scanning discovers backdoor in military chip, in Proceedings of the CHES Sept. 2012, pp. 23–40 C. Sun et al., DSENT: a tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling, in NOCS (2012) J. Sun, Y. Zhang, An energy-aware mapping algorithm for mesh-based network-on-chip architectures, in 2017 International Conference on Progress in Informatics and Computing (PIC), Nanjing (2017), pp. 357–361 M. Tehranipoor, F. Koushanfar, A survey of hardware Trojan taxonomy and detection. IEEE Des. Test 10–25 (2009) I. Thakkar, S.V.R. Chittamuru, S. Pasricha, Improving the reliability and energy-efficiency of highbandwidth photonic NoC architectures with multilevel signaling, in Proceedings of the NOCS, Oct. 2017 I. Thakkar, S.V.R. Chittamuru, S. Pasricha, Mitigation of homodyne crosstalk noise in silicon photonic NoC architectures with tunable decoupling, in Proceedings of the CODES+ISSS, Oct. 2016 D. Vantrease, Corona: system implications of emerging nanophotonic technology, in International Symposium on Computer Architecture, Beijing (2008), pp. 153–164 C. Wang, W. Hu, N. Bagherzadeh, A wireless network-on-chip design for multicore platforms, in 2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing, Ayia Napa (2011), pp. 409–416 Z. Wang et al., CAMON: low-cost silicon photonic chiplet for manycore processors. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. (2019) L. Wang, X. Wang, T. Mak, Adaptive routing algorithms for lifetime reliability optimization in network-on-chip. IEEE Trans. Comput. 65(9), 2896–2902 (2016) S. Xiao, M.H. Khan, H. Shen, M. Qi, Modeling and measurement of losses in silicon-on-insulator resonators and bends. Opt. Express 15(17), 10553–10561 (2007) L. Zhou, A.K. Kodi, PROBE: prediction-based optical bandwidth scaling for energy-efficient NoCs, in Proceedings of the IEEE/ACM International Symposium on Networks-on-Chip (NOCS) (2016)
Design Automation Flows
Synthesis and Technology Mapping for In-Memory Computing Debjyoti Bhattacharjee and Anupam Chattopadhyay
Abstract In this chapter, we introduce the preliminaries of in-memory computing processing-in-memory platforms, such as memristive Memory Processing Units (mMPU), which allow leveraging data locality and performing stateful logic operations. To allow computing of arbitrary Boolean functions using such novel computing platforms, development of design automation flows (EDA) are of critical importance. Typically, EDA flows consist of multiple phases. Technology-independent logic synthesis is the first step, where the input Boolean function is restructured without any specific technology constraints, which is generally followed by a technologydependent optimization phase, where technology specific hints are used for optimization of the data structure obtained from the first step. The final step is technology mapping, which takes the optimized function representation to implement it using technology-specific constraints. In this chapter, we present an end-to-end mapping framework for mMPU with various mapping objectives. We begin the chapter by presenting an optimal technology mapping method with the goal of mapping a Boolean function on a single row of mMPU. Thereafter, we propose a Look-Up Table (LUT) based mapping that attempts at minimizing delay of mapping, without any area constraints. We extend this method to work with area-constraints. The proposed framework is modular and can be improved with more efficient heuristics as well as technology-specific optimizations. We present benchmarking results with other approaches throughout this chapter.
D. Bhattacharjee (B) IMEC, Leuven, Belgium e-mail: [email protected] A. Chattopadhyay School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2023 M. M. S. Aly and A. Chattopadhyay (eds.), Emerging Computing: From Devices to Systems, Computer Architecture and Design Methodologies, https://doi.org/10.1007/978-981-16-7487-7_10
317
318
D. Bhattacharjee and A. Chattopadhyay
1 Introduction The separation between the processing units and memory unit requires data transfer over energy-hungry buses. This data transfer bottleneck is popularly known as the memory wall. The overhead in terms of energy and delay associated with this transfer of data is considerably higher than the cost of the computation itself (Pedram et al. 2016). Extensive research has been conducted to overcome the memory wall, ranging from the classic memory hierarchy to the close integration of processing units within the memory (Aga et al. 2017; Seshadri et al. 2017). However, these methods still require transfer of data between the processing blocks and the memory, thus falling into the category of von Neumann architectures. Processing data within the memory has emerged as a promising alternative to the von Neumann architecture. This is generally referred to as Logic-in-Memory (LiM). The primary approach to perform LiM is to store input variables or/and logic output in a memory cell. This is enabled when the physical capabilities of the memory can be used for data storage (as memory) and computation (as logic). Various memory technologies, including Resistive RAM (RRAM), Phase Change Memory (PCM), Spin-transfer torque magnetic random-access memory (STT-MRAM) and others have been used to realize LiM computation (Lehtonen and Laiho 2009; Agrawal et al. 2018; Linn et al. 2012; Gaillardon et al. 2016; Kingra et al. 2020; Kvatinsky et al. 2014; Hamdioui et al. 2015). RRAM is one of the contending technologies for logic-in-memory computation. RRAMs permit stateful logic, where the logical states are represented as resistive state of the devices and at the same time, are capable of computation. Multiple functionally complete logic families have been successfully demonstrated using RRAM devices (Reuben et al. 2017). In the following, three prominent logic families are presented. Material Implication Logic (Lehtonen and Laiho 2009): Consider two RRAM devices p and q with internal states S p and Sq respectively, as shown in Fig. 1a. By applying voltages to the terminal, material implication can be computed, with the next state (NS) of device p set to the result of computation. N S p = S p → Sq
(1)
Majority Logic (Gaillardon et al. 2016): In this approach as shown in Fig. 1b, the wordline voltage (Vwl ) and bitline voltages (Vbl ) act as logic inputs, while the internal resistive state Sx of the device x acts a third input. The next state of device x in this case is a function of three inputs as shown below in the following equation. N Sx = M3 (Sx , Vwl , Vbl )
(2)
Memristor-Aided loGIC (MAGIC) (Kvatinsky et al. 2014). MAGIC allows inmemory compute operation by using the internal resistive state of single or mul-
Synthesis and Technology Mapping for In-Memory Computing (a)
(b)
319 (c)
...
Fig. 1 Logic primitives realized using memristors. a Material implication, b majority logic, c memristor aided logic (MAGIC)
tiple RRAM devices as input. The exact number of inputs (k) depends on the specific device used for computation. The result of computation is written to a new device (r ), as shown in Fig. 1c. The internal resistive state of the input devices remain unchanged. Using MAGIC operations, multi-input NOR and NOT can be realized. N Sr = N O R(Si1 , Si2 , . . . , Sik ) N Sr = N O T (Si )
(3) (4)
General purpose architectures have been proposed based on these primitives. A bit-serial Programmable Logic in Memory (PLiM) architecture was proposed by Gaillardon et al. (2016) that uses majority as the logic primitive. PLiM relied on using the same crossbar for storage of instructions as well for computation. RRAMbased Very long instruction word (VLIW) Architecture for in-Memory comPuting (ReVAMP) was proposed by Bhattacharjee et al. (2017), that used Instruction Memory for the instruction storage and a separate RRAM crossbar as data storage and computation memory. Haj Ali et al. proposed memristive Memory Processing Unit (mMPU) (Haj-Ali et al. 2018). The mMPU consists of memristive memory arrays, along with Complementary Metal Oxide Semiconductor (CMOS) periphery and control circuits to allow support for computations as well as conventional data read and write operations. To perform a computation within the mMPU, a compute command is sent to the mMPU controller. The controller generates the corresponding control signals and applies the signals to the crossbar array to perform the actual MAGIC operations. The mMPU allows MAGIC NOR and NOT gates to be executed within any part of the crossbar array, which allows storage of data as well as computation to happen in the same array. Compared to the architectures based on Material Implication, and Majority logic, MAGIC provides an inherent advantage. For MAGIC, control signals are not dependent on the output of a compute operation. Wider acceptance of these architectures and technologies critically rely on efficient design automation flows, including logic synthesis and technology mapping. In this chapter, we focus on the design automation challenges for architectures supporting MAGIC operations. Intuitively, a Boolean function (represented using logic
320
D. Bhattacharjee and A. Chattopadhyay
level intermediate form) is processed by the technology mapping flow to generate a sequence of MAGIC operations which are executed on the crossbar. The remainder of the chapter is organized as follows. We introduce in-memory computing using MAGIC operations in Sect. 2, alongside preliminaries of EDA flows. Thereafter, we present a brief outline of the technology mapping algorithms for mMPU which are targeted towards various optimization objectives in Sect. 3. We explain the algorithms in detail in the following sections. We conclude the chapter in Sect. 7.
2 Background on MAGIC Operations We present the basics of computing using MAGIC operations to begin with. As shown in Fig. 2a, a 2-input MAGIC NOR gate consists of 2-input memristors (I N1 and I N2 ) and one output memristor (OU T ). The memristive state of the output memristor changes in accordance with the resistive states of the input memristors. Low resistive state is interpreted as logical ‘1’ while high resistive state is interpreted as logical ‘0’. The NOR gate operation is realized by applying VG to the input memristors while the output memristor is grounded. Note that the output memristor has to be initialized to low resistive state before the NOR operation is carried out. After applying the voltage, the resistance of the output memristor is set based on the ratio between the resistances of the input and the output memristors and results in a NOR operation. The MAGIC NOR operation can be performed with the devices arranged in a crossbar configuration, as shown in the right hand side of Fig. 2a. By extending this approach, it is feasible to perform logical n-input NOR and NOT operations. Multiple MAGIC operations can be performed in parallel. The parallel execution of multiple NOR gates is achieved whenever inputs and outputs of the n-input NOR gates are aligned in the same rows or columns in a crossbar, as shown in Fig. 2b. For example, Fig. 2c, two 3-input NOR operations are performed in parallel. M1,4 = N O R(M1,1 , M1,2 , M1,3 )M2,4 = N O R(M2,1 , M2,2 , M2,3 ) Also, vertical operations are allowed as shown in Fig. 2d. M3,4 = N O R(M1,4 , M2,4 ) A single-input NOR operation is a NOT gate, as shown in Fig. 2e. M3,5 = N O T (M3,4 ) Thus, both n-input NOR and NOT gates can be executed by MAGIC operations. It is also possible to reset the devices in parallel in the crossbar to ‘1’, either row-wise or column-wise.
Synthesis and Technology Mapping for In-Memory Computing
321
Fig. 2 Basic MAGIC operations on a crossbar array
3 Technology Mapping Problem for MAGIC Before presenting the technology mapping problem, we briefly review the basics of design automation flow. We show a representative design automation flow to map a Boolean function to a technology specific list of operations in Fig. 3. For logic synthesis, a classification of different Intermediate Representations (IRs) has been proposed in Soeken and Chattopadhyay (2016). First, there are Functional approaches, where the IR is used to explicitly express the logic function. Examples for IRs are Boolean truth tables, Look-Up Tables (LUTs) or Binary Decision Diagrams (BDDs). Second, there are Structural approaches, where the IR is used to represent the structure of the circuit, e.g., using And-Inverter Graphs (AIGs). For technology mapping on memristive crossbar, both types of approaches have been adopted, as it fits more closely the device-level operations. Logic synthesis transforms the input logic network to optimize some criteria, such as reduction in the number of vertices or number of logic levels, etc. An excellent background on the topic can be obtained in Micheli (1994). A variety of tools have been developed both by industrial efforts as well as in academic efforts, for logic synthesis and verification (Sentovich et al. 1992; McGeer et al. 1993; Synopsys Design Compiler 2018; Xilinx Design Suite 2018; Berkeley Logic Synthesis and Verification Group 2016). ABC is an opensource tool that allows scalable logic transformations based on AIGs, with a variety of innovative algorithms (Berkeley Logic Synthesis and Verification Group 2016). Generic logic synthesis techniques based on MIGs have been developed extensively in the recent years (Amarú et al. 2013; Amaru et al. 2016). The output of the logic synthesis tools can be used directly as input to the technology mapping phase.
322 Fig. 3 Design automation flow from Boolean function specification to a technology mapped netlist
D. Bhattacharjee and A. Chattopadhyay
Boolean function
Logic representation Technology Independent Logic Synthesis
AIG : And Inverter Graph MIG: Majority Inverter Graph BDD : Binary Decision Graph Technology constraints
Technology aware Logic Synthesis
State update function Fan-in constraints
Technology-mapping constraints Technology mapping
Delay constraint Area constraint Available gate library
Mapped netlist / Instruction sequence
3.1 Boolean Function Representation MAGIC devices realize multi-input NOR operations, which do not allow a direct mapping from structures such as AIGs or Majority inverter graphs. There are two commonly used approaches for effectively representing the input Boolean function. 1. Mapping AIG/MIG to a NOR/NOT netlist. 2. Representing the Boolean function as a Look Up Table (LUT) graph and each LUT using NOR-of-NOR representation. In the first approach, the output of the logic synthesis phase is passed to a standard cell mapping tool, such that a NOR/NOT netlist is generated. This can be directly used for performing technology mapping. ABC1 can be used to generate such netlists with customized standard cell library.2 The alternative is to directly represent the Boolean function as a LUT graph, with each LUT represented using a NOR-ofNOR representation. The rationale for using LUT graph is that it allows mapping to all forms of Boolean functions. We formally define LUT graph and NOR-of-NOR representation below.
1 2
https://github.com/berkeley-abc/abc. https://github.com/debjyoti0891/MAGICNetlistGen.
Synthesis and Technology Mapping for In-Memory Computing
323
LUT graph: Any arbitrary Boolean function can be represented as a directed acyclic graph (DAG) G = V, E, with each vertex having at most k-predecessors (Berkeley Logic Synthesis and Verification Group 2016). Each vertex v, v ∈ V , with kpredecessors represents a k-input Boolean function or simply a k-input LUT. Each edge, u → v represents a data dependency from the output of node u to an input of node v. LUT Graph Generation using ABC
abc UC Berkeley, ABC 1.01 (compiled Feb 25 2020 13:45:47) abc 01> read_blif cm151a.blif abc 02> strash abc 03> if -K 4 abc 04> write_verilog cm151a.v abc 04> quit
Example 1 Figure 4 shows the cm151a benchmark from LGSynth91 as a DAG with k = 4. The benchmark has 12 primary inputs a − l and two primary outputs m and n. LUT 16 has a dependency on LUTs 17 and 18 and on primary input j. NOR-of-NOR representation: A Boolean function F : Bn → B, expressed in sumof-products (SoP) form can be converted to the NOR-of-NORs (NoN) representation by the following simple transformations.
g
h
20 -10 1 1-1 1
18 -10 1 1-1 1
e
f
i
c
n
15 -001 1 0-00 1
22 -001 0 0-00 0
19 -01 1 0-0 1
16 -01 1 0-0 1
17 -10 1 1-1 1
m
21 -10 1 1-1 1
d
a
b
j
k
l
Fig. 4 cm151a benchmark partitioned into LUTs with k = 4. Each triangular node represents a primary input, while the inverted triangle represent primary outputs. Each round node represents a LUT. LUT id and their functionality in SoP is shown inside the node
324
D. Bhattacharjee and A. Chattopadhyay
1. Replace ∨ and ∧ operations with ∨ 2. Flip the polarity of each primary input 3. Negate the result. For example, we can express F in NoN representation as follows. F = (a ∧ b) ∨ (a ∧ b ∧ c)
= (a ∨ b) ∨ (a ∨ b ∨ c)
(5)
Alternatively, we can express this NoN as: Variables 1st product term: 2nd product term:
a 0 1
b 1 0
c – 0
3.2 Problem Definition The technology mapping problem for ReRAM based in-memory computing is to determine a sequence of inputs, to be applied, to the wordlines and the bitlines of ReRAM devices to compute a given function. The delay of the obtained mapping is equal to the number of steps that the mapping contains. We say that the mapping is delay optimal if the number of steps is minimum. The number of devices used in a mapping solution determines the area of the solution. We use the terms operation and instruction interchangeably. In the context of technology mapping, we look at three distinct variants of the problem. 1. Single row mapping: In this variant, a Boolean function is mapped on a single row of the crossbar. This might result higher latency for computation of a single Boolean function. However, this allows SIMD computation, where each row processes the same function but with a different data stream. 2. Delay-constrained crossbar mapping: The Boolean function is mapped with the goal of minimizing delay of mapping, without any constraint on the size of the crossbar. This mapping is useful when the latency of mapping a function is of vital importance and sufficiently large crossbar is available. 3. Area-constrained crossbar mapping: Given a fixed crossbar dimension, the Boolean function is mapped to minimize delay of mapping. This mapping is useful for mapping arbitrary Boolean functions on a fixed size crossbar. In the following sections, we present solutions for each of the variants of the technology mapping problem.
Synthesis and Technology Mapping for In-Memory Computing
325
4 Single Row Optimal Mapping In this section, we present an optimal solution to map a Boolean function on a single row of the crossbar, while considering device reuse. This mapping can be replicated across multiple rows of the array for SIMD operations, harnessing the parallelism offered by the MAGIC operations for high throughput, albeit at the cost of a higher latency. We consider the logic function to be represented in terms of NOT and NOR gates, without any constraints on the number of inputs on the NOR gates. Since the goal is to map the function on a single row of the crossbar, only a single MAGIC operation can be performed in a cycle. In addition, the position of the operands in the row does not impact the overall delay of mapping. Without loss of generality, we consider each primary input present in the netlist to be already allocated a device, prior to the start of computation. A gate in the netlist can be computed exactly once. A device that has the result of a computation that is not required any longer, can be reset. Once an output has been computed on a memristor, the memristor cannot be reset. The key difference from the constraints used in SIMPLE (Hur et al. 2017) comes from the fact that the current work does not consider the geometry of the crossbar in the formulation. This allows considerable reduction in the number of constraints, since mapping a node to any device is same as mapping to any other device. Furthermore, the constraints present in SIMPLE that have to be solved to permit parallel execution of nodes are not present in the current formulation. We present OptiSimpler—a procedure that determines a feasible sequence of operations to computed a given mapped netlist using a given number of available memristors using SAT constraints. The inputs provided to OptiSimpler are summarily presented in Table 1, and the variables used in the SAT formulation are shown in Table 2. Table 1 Input parameters to OptiSimpler Parameter Description G
PO D S
A graph G = V, E is created from a mapped NOR/NOT netlist. Each gate gi is represented by a vertex i in the graph. An edge i → j ∈ G, if the output of gate gi is an input to gate g j A subset of vertices V that represent the gates that drive the primary outputs D is the number of devices available for mapping, D > 1 S is a positive integer that determines the number of steps that are available to the SAT solver for mapping [Refer to example for details]
326
D. Bhattacharjee and A. Chattopadhyay
Table 2 Variables used in SAT formulation Variable Type ais
Binary
f is
Binary
Description 1 indicates that vertex i is assigned to a device in step s. 0 indicates that no device has been allocated to vertex i in step s 1 indicates that vertex i was assigned a device in some step s , such that s ≤ s
4.1 Clauses for SAT Formulation In this subsection, we list out the constraints for optimally solving the single row technology mapping problem. 1. No vertex has a device allocated at the beginning. ai0 = 0, ∀i ∈ V
(6)
2. Each output vertex must have a device allocated at the end. aiS = 1, ∀gi ∈ P O
(7)
3. A vertex can be allocated only if all the predecessors of the node are currently allocated or if it is already allocated. s−1 s−1 s−1 =⇒ ais (a s−1 p1 ∧ a p2 ∧ . . . ∧ aik ) ∨ ai
(8)
where pi ∈ pr ed(i), 1 ≤ s ≤ S. 4. In each step, the number of devices allocated cannot be more than number of available devices D. s AtMost(a1s , a2s , . . . , a|V | , D), ∀1 ≤ s ≤ S
(9)
5. A vertex can be allocated only once. This is also known as one-shot computation. ∀i ∈ V and 1 ≤ s ≤ S: f is = f is−1 ∨ (ais ∧ (¬ais−1 ))
(10)
f is =⇒ ¬ais
(11)
Synthesis and Technology Mapping for In-Memory Computing
327
The output of the SAT solver is a feasible assignment of the variables, such that all the constraints are True. We can construct a valid sequence of MAGIC operations from the assignment of the variables ais using the Algorithm 1. Algorithm 1: Determining sequence of MAGIC operations from SAT solver assignment A 1 2 3 4 5 6
Procedure MAGICOpGen(G, A, D, S) O ps = List(); ver T oDev = Map(); f r eeDev = Stack(); for d = 1; d ≤ D; d + + do f r eeDev.push(d);
7 8
for s = 1; s ≤ S; s + + do for i = 1; i ≤ |V |; i + + do if ¬ais &&ais−1 then dev = ver T oDev[i]; f r eeDev.Push(dev); O ps.add(dev,Set);
9 10 11 12 13 14 15 16 17 18
for i = 1; i ≤ |V |; i + + do if ais &&¬ais−1 then dev = f r eeDev.Pop(); ver T oDev[i] = dev ; O ps.add(dev,gi ); return O ps ;
4.2 Determining Minimum Number of Devices The OptiSimpler formulation can determine the if mapping a function is feasible with a given number of devices. This can be used to determine the minimum number of devices required to map the Boolean function. The Algorithm 2 presents the procedure for the same. The algorithm uses log2 |V | calls to the OptiSimpler to determine the minimum number of devices.
4.3 Demonstrative Example Figure 5a shows a representative DAG. Nodes 0–7 without any fan-in are the primary inputs while node 8 is the primary output. For the constraint formulation, we set the number of steps S to 10 for this example.
328
D. Bhattacharjee and A. Chattopadhyay
Algorithm 2: Determining minimum number of devices using SAT Procedure FindMinDev(G, P O) D = |V | ; S = 2D ; f easible, f easible Alloc = OptiSimpler(G, P O, D, S); f easibleDev = D; high = D; low = 1; while low ≤ high do mid = (low + high)/2; f easible, alloc = OptiSimpler(G, P O, mid, S); if f easible == sat then // The solver found a satisfiable assignment 12 f easible Alloc = alloc; 13 f easibleDev = mid; 14 high = mid − 1;
1 2 3 4 5 6 7 8 9 10 11
15 16 17
else low = mid + 1; return f easibleDev, MAGICOpGen(G, f easible Alloc, f easibleDev, S);
Fig. 5 A representative DAG, along with the assignment of variable by the SAT solver corresponding to the optimal mapping for the DAG with 5 devices
As stated in Algorithm 2, the procedure uses a binary search to find the minimum number of devices. Figure 6a shows the result of each SAT solver invocation, along with the corresponding delay achieved. Note that for each call to the SAT solver, we use the same number of steps (2|V |) available for mapping. The minimum number of devices that gives a feasbile mapping for the DAG is equal to 5. The solution obtained using 5 devices is shown in Fig. 5b. We should note that each time step s does not correspond directly to the cycles of executing MAGIC operations. If there is more than one operation in a time step, the operations are performed sequentially—the Reset operations first, followed by the MAGIC oper-
Synthesis and Technology Mapping for In-Memory Computing
329
Fig. 6 Determining the optimal mapping for DAG in Fig. 5a using Algorithm 2
ations. Initially at s = 0, none of the devices are allocated a device and hence the all the entries are 0. At step s = 1, nodes 11 and 12 are assigned a device and computed. Similarly, 9 and 10 are allocated and computed in s = 3. In s = 4, 13 is computed. In step s = 5, the devices allocated to 9 and 10 are Reset, and one of these devices are used for computing 14. Similarly, rest of the assignments can be interpreted. Note that, we consider only one time step in the assignment, if the assignment in consecutive times steps is identical (for e.g., s = 1 and s = 2). Using Algorithm 1, the sequence of MAGIC operations corresponding to the assignment is presented in Fig. 6b.
4.4 Results The results of benchmarking are presented in Table 3. The proposed algorithm was implemented in Python, with Z3 as the constraint solver. The algorithm can be used for in arche framework using the rowsat command (Bhattacharjee 2020). The flags -md and -ms indicate the mapping to be performed with minimum number of devices and minimum number of steps respectively. The input benchmarks are in the form of a mapped netlist, consisting of NOR and NOT gates only. The experiments were run on Intel(R) Xeon(R) CPU E5-2695 v2 @ 2.40 GHz, with 30 GB RAM, running Ubuntu 12.04. The runtime of each benchmark was limited to roughly 24 h.
330
D. Bhattacharjee and A. Chattopadhyay
Table 3 Benchmarking results for SAT-based single row mapping Benchmark Nodes Dmin Dunsat Magic Reset Cycles x2 misex parity 5xp1 clip b9 apex2 x4 aes
78 86 92 119 161 188 374 547 365
19 10 11 30 40 21 93 102 45
NA 7 NA NA NA 20 NA NA NA
68 78 76 112 152 147 336 453 357
49 68 65 82 112 126 243 351 312
117 146 141 194 264 273 579 804 669
(%)
α (%)
75.64 88.37 88.04 74.79 75.16 88.83 75.13 81.35 87.67
50.00 69.77 53.26 63.03 63.98 45.21 54.81 46.98 83.29
Executing OptiSimpler
python3 arche.py Synthesis and technology mapping for emerging technologies arche>read experiments/fa_map.v arche>rowsat -ms -md ... Min devices : 6 Min steps : 18 arche>quit
The Dmin column indicates the minimum number of devices for which a satisfiable solution (sat) was found while the Dunsat indicates the number of devices for which feasible solution could not be found. The results where Dmin = Dunsat +1, indicate that the solver was terminated before the optimal solution could be reached. It should be noted that the amount of time that the SAT solver takes to complete execution varies for different values of available devices D, while keeping the other inputs identical. Therefore, the overall execution time for determining the minimum number of devices cannot be estimated directly from one call to OptiSimpler. The Cycles columns indicate the delay of mapping, consisting of Magic and Set operations. % indicates the percentage reduction in device usage, compared to the trivial allocation (equal to number of nodes), considering one device allocated per node. α% is the percentage overhead in terms of number cycles, compared to the delay achievable using trivial allocation.
Synthesis and Technology Mapping for In-Memory Computing
331
4.5 Summary In this section, we presented an algorithm to obtain the minimum number of devices required for mapping arbitrary netlist using a single row of memristors realizing Magic operations. Using a SAT solver, an optimal solution can be obtained for an arbitrary number of available devices for mapping. Benchmarking results indicate that considerable reduction in device usage can be obtained (upto 88%) by permitting device reuse, but at the cost increased delay. Further reduction in device usage might be achieved by permitting computation of nodes, more than once.
5 Delay-Constrained 2D Crossbar Mapping MAGIC operations allow multiple parallel operations on a crossbar array. In this section, we present a technology mapping flow that focuses solely on minimizing delay of mapping in a 2D-crossbar array. The proposed delay-constrained flow comprises of three distinct stages, which as described below. 1. LUT graph generation: A Boolean function is represented as a LUT graph, which permits searching for various multiple-input NOR gates for minimizing delay. 2. Crossbar Mapping: In this phase, the geometrical shape of individual LUTs is used to determine a mapping of the LUTs to the crossbar in order to maximize parallel operations. 3. Optimal Mapping Extraction: In this phase, the mapping design space is explored through a Pareto analysis in order to extract the solution that minimizes both latency and crossbar area occupation.
5.1 LUT Graph Generation In this phase, the input Boolean function is partitioned into an LUT graph by using ABC. We begin by reading the Boolean function as an AIG and then attempt to reduce the cardinality of the AIG by using the following heuristic methods, resyn, resyn2, resyn2rs, and compress2rs. The resulting netlist is then partitioned into kinput LUTs using the commands described in Sect. 3.1, with k = {2, 3, 4, 7, 10}. These LUT graphs are used as input to the crossbar mapping phase. MAGIC enables n-input NORs as long as defined voltage supply constraints are matched. For this algorithm, we limit the number of inputs to 4. Therefore, for LUTS with 7 and 10 inputs are decomposed into LUTs with smaller number of inputs, as shown in Fig. 7. This theoretically limits the number of parallel NOR gates to 16 (24 ).
332
D. Bhattacharjee and A. Chattopadhyay
Fig. 7 Various k-input LUTs. Decomposition for 7 and 10 input LUTs
5.2 Crossbar Mapping The goal of this phase is to map the individual nodes (LUTs) of the input LUT graph on the crossbar, so as to minimize the delay of computing. LUTs in the same topological level of the LUT graph do not have any dependencies between themselves and therefore, could be scheduled in parallel. The mapping of the LUTs is performed row-wise, level by level as reported in Fig. 8a. The crossbar is divided in tiles which host the LUTs. A tile is identified by (l, y) coordinates that represent respectively the l-th logic level of the netlist, and the vertical index of the tile (starting from the top). One LUT is allocated to one tile; as an example, S1,1 is the LUT allocated to the tile (1, 1) of the crossbar. The mapping algorithm starts stacking the LUTs of the first level sorted by their size, i.e., the number of their LUT terms, in a decreasing order. The mapping of cascading LUTs is constrained to a dimensional condition which imposes that two LUTs of different levels having greater or equal size cannot
Fig. 8 Crossbar mapping for delay-constrained technology mapping
Synthesis and Technology Mapping for In-Memory Computing
333
be placed at the same vertical position y. Although it could seem a limitation, the proposed mapping scheme relies on the following rule of thumb: as the logic level increases, the LUT size decreases. In order to permit computation of multiple LUTs in parallel, we utilize the NORof-NOR representation of the LUT function. Since the NoN representation consists of only NOR and NOT operations, it can be computed by MAGIC operations directly in 3 cycles, ignoring the initialization cycle(s). All the variables in appropriate polarity (inverted or regular) in a product term are aligned in rows. For the variables which are not present in a product term, the corresponding memristor is set to ‘1’, which is the state of the memristor after reset. This is followed by computing NOR of all the product terms horizontally in a single cycle. In the next cycle, a vertical NOR of the above results produces the negated output. In the last cycle, we negate this result to get output of the computed function. Example 2 The computation of F in Eq. (5) using MAGIC operations is shown Fig. 9. Row 1 and row 2 have the inputs for the 1st and 2nd product terms respectively. These inputs are NORed in parallel to compute the product terms with the outputs written to M1,4 (H1) and M2,4 (H2). The product terms are vertically NORed to compute F in M3,4 . In the final step, F is inverted using a NOT operation to compute F (in M3,2 ). Each LUT has a dedicated output-to-input alignment row (golden rows in Fig. 8b). It is used to relocate the output position w.r.t. cascading LUTs. The output-to-input alignment is performed through a sequence of two NOT operations (double arrows in Fig. 8b) using an intermediate temporary memristor; alternatively, if a negated input is required by any of the LUT terms, the alignment is done with a single NOT operation (dashed arrow). LUTs in all levels must be separated at least by one row of memristors in order to avoid the overlapping of alignment rows, as shown for Si+1,1 and Si,1 . When a don’t care condition appears in a LUT row, no realignment copies are required in cascading LUTs, since the value is fixed to a logic “0” or “1” by the scheduler. Thus, a larger k value leads to more don’t care conditions, improving the quality of the computation.
Fig. 9 Computation of F = (a ∨ b) ∨ (a ∨ b ∨ c) with 3 inputs and 2 product using MAGIC operations on a 3 × 4 crossbar
334
5.2.1
D. Bhattacharjee and A. Chattopadhyay
Demonstrative Example
A 1-bit full adder mapped with the proposed scheme is reported in Fig. 10. Firstly, S1,1 and S1,2 are placed separated by a crossbar row. Then, the algorithm tries to place S2,3 , the largest LUT in level 2; once verified that the size of S2,3 is larger or equal to S1,1 and S1,2 , S2,3 is mapped below S1,2 , namely, the smallest LUT of the previous level. Lastly, as S3,3 has the same size of S2,1 and S1,2 , the first available vertical position is the one next to S2,3 . Figure 10 also shows an example of how the SAID-based full adder works, reporting the logic operation per cycle. The sequence of the crossbar operations is previously loaded in memory and executed by the crossbar scheduler. Firstly, the scheduler sets the primary inputs (PI) of each LUT. For simplicity, we do not consider the time needed for the scheduler to place each PI in the total count of the computation cycles. The operations of the first logic level are executed in the first 4 cycles: the LUT terms of all LUTs are computed with the H N O R (S1,1 , S1,2 ) operation; then the V N O R (S1,1 , S1,2 ) operation in cycle 2 computes the NOR of the results of LUT terms in S1,1 ; the NOT operation in cycle 3 extracts the output value for S1,1 , whereas in cycle 4 a NOT operation runs on S1,2 , which will be used in Level 3 to compute Cout . To be noticed that, having a single row, S1,2 has already generated its output value at cycle 1. After the execution of level 1, an output-to-input alignment is needed. The negated output of S1,1 is aligned to the inputs of S2,1 and to the first LUT term of S2,3 . The
Fig. 10 SAID-based implementation of a 1-bit full adder
Synthesis and Technology Mapping for In-Memory Computing
335
second LUT term of S2,3 needs S1,1 , thus a two-cycle alignment is needed (from cycles 7 to 8). At this point, LUTs in level 2 are executed (9 to 12 cycles); the output S is obtained in this level from S2,3 . After the alignment operations (13 to 16 cycles), S3,3 is executed in only one cycle (the 17th), generating Cout .
5.3 Optimal Mapping Extraction The design space is inspected through a Pareto analysis which solves a multi-objective optimization problem. The best solution is the one that minimizes the area occupation, i.e., the number of memristors required for in-memory computing, and the latency represented by the number of time steps needed to compute all LUTs. In particular, by varying the parameter k, the best solution in terms of the area/latency trade-off is determined. The proposed SAID logic synthesis reduces the latency by concurrently executing multiple LUTs of a logic level. However, this comes at the price of a higher area due to the redundancy introduced by LUTs.
5.4 Results In this section, we present the results of mapping ISCAS’85 and IWLS’93 benchmark suites. Tables 4 and 5 report the obtained experimental results for the ISCAS’85 and the IWLS’93 benchmark suites, respectively. We report the best k-LUT configuration, the number of cycles (Cycles column) and the number of devices (D column) required to map each benchmark. We compare our method against state-of-the-art methods w.r.t the speedup (η) obtained by the proposed method, and the overhead (γ ) in terms of devices required. For the ISCAS’85 benchmarks, the highest average speedup of 3.31× is achieved w.r.t. (Talati et al. 2016). In addition, an average area saving of 28% is also achieved. A favorable trade-off between speed and area is obtained w.r.t. (Gharpinde et al. 2018) with an average 1.84× speedup and a negligible area overhead of only 7%. Lastly, the comparison with (Thangkhiew and Datta 2018) demonstrates that our approach to the problem is more efficient than specific meta-heuristic mapping solutions (speedup of 1.7× with 11% more area). The IWLS’93 experiments demonstrate the scalability of the proposed approach. On average, the delay-constrained mapping resulted to be 3.89× faster, paying a low 31% area overhead. A last observation is related to the best k-LUT configuration. Indeed, in most cases, the best solution is found with k ≥ 3. This demonstrates that the intuition of using multiple-input NOR gates is a viable solution to improve the figures of merit of LiM applications.
3
3
3
4
4
4
2
3
c880
c1355
c1908
c2670
c3540
c5315
c6288
c7552
Average
4
c499
1283.60
2565
4069
1754
1566
643
627
554
482
420
156
2293.70
4003
6067
2937
3261
1249
1095
1182
1113
1399
631
3470.90
3.82
0.72
4.04
2.66
5.2
4.26
3.68
2.21
2.3
4.25
0.82
0.4
2.05
0.4
0.77
0.35
0.41
0.57
0.99
1.39
0.9
1927.60
1.53
0.93
1.88
1.53
2.32
1.68
1.94
1.58
2.75
2.24
ηGharpinde et al. (2018)
7
Gharpinde et al. (2018)
ηTalati et al. (2016)
γTalati et al. (2016)
Talati et al. (2016)
D
k
Cycles
Delay-constrained
c432
Bench
1.07
0.89
1.54
0.79
1.28
0.62
0.95
0.96
1.21
1.08
1.42
γGharpinde et al. (2018)
1974.70
1.49
1.23
1.85
1.54
2.18
1.55
1.69
1.56
2.23
1.7
ηThangkhiew and Datta (2018)
Thangkhiew and Datta (2018)
1.11
0.91
1.14
0.77
1.22
0.65
1.03
1.10
1.26
1.29
1.70
γThangkhiew and Datta (2018)
Table 4 Synthesis and mapping results on ISCAS’85 benchmarks. η = Speed up of delay-constrained method in terms of cycles. γ = Overhead of the delay-constrained method in terms of number of devices
336 D. Bhattacharjee and A. Chattopadhyay
Synthesis and Technology Mapping for In-Memory Computing
337
Table 5 Synthesis and Mapping Results on IWLS’93 benchmarks Bench
Delay-constrained
Thangkhiew and Datta (2018) Cycles
D
ηThangkhiew and Datta (2018)
γThangkhiew and Datta (2018)
5xp1
4
106
251
228
244
2.15
1.03
9sym
7
160
1026
552
594
3.45
1.73
alu4
7
1124
5331
2681
2794
2.39
1.91
apex1
7
1764
3170
5251
5547
2.98
0.57
apex2
7
241
880
667
736
2.77
1.20
apex4
4
4477
5050
7105
7442
1.59
0.68
k
apex5
Cycles
D
7
777
2223
1966
2319
2.53
0.96
b12
10
26
283
137
169
5.27
1.67
bw
10
53
718
367
385
6.92
1.86
clip
7
135
451
239
261
1.77
1.73
con1
10
5
52
35
48
7.00
1.08
cordic
7
28
164
117
164
4.18
1.00
duke2
10
300
1632
1261
1342
4.20
1.22
e64
10
134
394
1394
1647
10.40
0.24
7
2104
9945
6947
7290
3.30
1.36
ex5p
10
277
2061
1229
1286
4.44
1.60
inc
10
55
280
264
282
4.80
0.99
misex1
10
17
184
141
157
8.29
1.17
misex2
10
30
125
239
287
7.97
0.44
misex3c
7
518
2551
2976
3094
5.75
0.82
misex3
ex1010
10
1053
6184
1494
1563
1.42
3.96
pdc
7
1959
8403
2142
2237
1.09
3.76
rd53
4
31
83
75
86
2.42
0.97
rd73
4
150
379
262
280
1.75
1.35
rd84
4
252
745
410
428
1.63
1.74
sao2
10
79
559
309
331
3.91
1.69
seq
7
1957
7792
4658
4884
2.38
1.60
spla
1050
4036
2312
2399
2.20
1.68
10
15
98
101
112
6.73
0.88
t481
4
30
106
114
140
3.80
0.76
table3
3
1766
2242
4256
4418
2.41
0.51
4
squar5
table5
7
1169
2009
3851
4028
3.29
0.50
vg2
10
55
280
289
345
5.25
0.81
xor5
3
12
35
22
31
1.83
1.13
644.38
2050.65
1590.91
1687.35
3.89
1.31
Average
5.5 Summary In this section, we presented a delay-constrained mapping on a 2D crossbar without any area constraints. The method outperforms existing methods by a significant margin in terms of delay, at the cost of higher number of devices required for mapping. In the next section, we present an area constrained mapping flow for MAGIC operations, CONTRA. In order to allow area constraints, we does not enforce placement patterns of LUTs which the delay constrained mapping method does. In order to sup-
338
D. Bhattacharjee and A. Chattopadhyay
port arbitrary placement of LUTs, we introduce A* search technique for optimally moving the intermediate results to any desired location.
6 Area-Constrained Technology Mapping Flow In this section, we describe CONTRA, a 2-dimensional area-Constrained Technology mapping fRAmework for memristive memory processing unit, which is shown in Fig. 11.
6.1 LUT Placement on Crossbar The LUTs are topologically ordered and grouped by the number of inputs. The LUTs are placed one below another with inputs aligned till we are limited by the height of the crossbar. Consider n-LUTs each with k-inputs. Once the LUTs are aligned one below another, we can compute the horizontal NOR of all LUTs in one cycle. This is because the inputs and outputs of all the LUTs are aligned and the voltage of the columns applies to all LUTs. In the next n-cycles, we can perform the vertical NOR operations to compute the inverted output of the n stacked LUTs. Thus, (n + 1) cycles are required to compute the n stacked LUTs. Let us consider that each k-input n LUT L i has pi product terms, 1 ≤ i ≤ n. Then, the area L ar ea required to compute the n LUTs in parallel is: n L ar ea =
n ( pi + 1) × (k + 1)
(12)
i=1
The LUT placement strategy is from top to bottom and from left to right. The spacing parameter is used to specify the number of rows that are left empty between two LUTs stacked vertically. If we do not have enough free devices to place a new LUT, the crossbar is scanned row-wise and column-wise to check in which rows or columns,
Fig. 11 CONTRA: area-Constrained Technology mapping fRAmework for Memristive Memory Processing Unit
Synthesis and Technology Mapping for In-Memory Computing
339
the intermediate results are present. These are considered blocked and the rest of the crossbar is reset either row-wise or column-wise, which results in lesser number of devices being blocked. The process is repeated till all the LUTs are placed. The overall flow is presented in Algorithm 3. Note that in the delay-constrained mapping approach, a fixed ordering is enforced during placement of LUTs vertically as well as horizontally. In the area-constrained approach, we do not enforce such an order. We also allow resetting the crossbar to permit reuse of devices, which was not considered for delay-constrained mapping.
Example 3 For cm151a, we stack the LUTs 17 and 18 in the crossbar, as shown in Fig. 12. Since enough space in not available vertically, we stack LUTs 20 and 21 on the right. We reset the crossbar, without resetting column 4 and 8, as these columns contain the intermediate results. We continue placing the other LUTs in similar manner. The first three steps are shown below. 1 [LUT17(1, 1) → (3, 4)][LUT18(4, 1) → (6, 4)] 2 [LUT20(1, 5) → (3, 8)][LUT21(4, 5) → (6, 8)] 3 [Reset columns except{4, 8}] 4 ... Note that we are effectively computing the inverted output of each LUT. Therefore, for the output LUTs, an additional NOT operation is required, as specified in lines 17–19 of Algorithm 3.
Fig. 12 LUT Placement phase on a 8 × 8 crossbar for the cm151a benchmark
340
D. Bhattacharjee and A. Chattopadhyay
Algorithm 3: Area-constrained technology mapping. Input : G, R, C, spacing Output: Mapping of G to crossbar R × C. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
do L set = Pick LUTs in a topological level with equal number of inputs. if limited by space vertically then Start placing from next available column ; end if limited by both vertical and horizontal space then Reset the cells keeping the intermediate outputs intact. end Place L set stacked together vertically with spacing rows empty between subsequent LUTs. Schedule all the LUTs in L set in the same time slot of the schedule. while There is a LUT not yet placed.; for Each set of LUTs stacked together do Place the inputs for these LUTs, using A* search and vertical copies; Compute intermediate results in parallel using Horizontal NORs.; Compute inverted output of LUTs in sequence using Vertical NORs.; end for Each inverted output of G do Invert using NOT operation to compute outputs of G. end
6.2 LUT Input Placement Technique For some of the LUTs, we require the intermediate outputs from previous computations as inputs. We use A∗ search to get the shortest path to copy an intermediate value from source (R S , C S ) to destination (R D , C D ) with a minimum number of NOT operations. The cost of a location cost (r, c) is f (r, c) + g(r, c). f (r, c) is equal to the number of copy operations used to reach from (R S , C S ) till (r, c). ⎧ ⎪ ⎨0, if (r, c) is the destination g(r, c) = 1, if r == R D or c == RC ⎪ ⎩ 2, otherwise
(13)
All empty cells in the row and column of the current location are considered its neighbours. The search starts at the source, updates the cost of the neighbouring location and picks the location with the least cost. The process is repeated till the goal state is reached. If the path length is odd, the polarity of the input is reversed while for an even length path, the polarity is preserved. This is due to an odd or even number of NOT operations respectively. If the inputs of a NoN has only positive or negative terms, but not both, we need to choose the copy path to be even or odd accordingly. If the inputs are of mixed polarity, we can choose the path with shorter length, the polarity does not matter. Thereafter, the input variable is vertically copied
Synthesis and Technology Mapping for In-Memory Computing
341
Fig. 13 Placement of the inputs for LUTs 16 and 17 and the corresponding literals for NOR-of-NOR computation
to different rows as required for the other product terms in the LUT, according to the NoN representation. Example 4 LUT 16 uses the output of LUT 17 as input, with the NoN representation shown in Fig. 13. We copy the value from M3,4 to M3,1 using a sequence of NOT operations, obtained using A∗ search. NOT(M3,4 → M3,6 ), NOT(M3,6 → M3,1 ), NOT(M3,1 → M2,1 )
The state of the crossbar after placing all the inputs (LUTs 17, 18, 20 and 21, primary inputs i and j) for LUT 16 and 19 is shown in the last sub-figure of Fig. 13.
6.3 Input Alignment for Multiple LUTs Multiple LUTs scheduled together for execution, often share common inputs. If the common inputs are assigned to the same column, then only a single A∗ search would be required to bring the input to the column, and followed by vertically copying to the appropriate rows. This would lead to reduction in delay as well as reduction in the number of devices involved in copying. The goal is to have an assignment of the inputs of the individual LUTs to columns such that it maximizes the number of aligned inputs in a set of stacked LUTs. We encode the constraints of this problem to optimally solve the problem using an Satisfiability Modulo Theories (SMT) solver. Maximize
n k n
align c,li,l j
(14)
c=1 li=1 l j=1
Ac,l = v|∃v ∈ input ofLU T l.1 ≤ c ≤ k and 1 ≤ l ≤ n.
(15)
align c,li,l j = 1 if Ac,li = Ac,l j .1 ≤ c ≤ k, 1 ≤ li ≤ n and 1 ≤ l j ≤ n.
(16)
342
D. Bhattacharjee and A. Chattopadhyay
Algorithm 4: Input Alignment Input : M Output: Malign 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
L = Ordered List of variables in the matrix in descending order by count. Malign = initialize n × k matrix with φ; for variable v in L do Rv = {r if v ∈ row r of M}; targetc = None; for col c in matrix M do if Malign [r ][c] == φ|∀r ∈ Rv then targetc = c; break; end end if targetc == None then Place v in any free column in each row ∈ Rv ; else Place v in column targetc in each row ∈ Rv ; end end return Malign ;
The assignment to variable Ac,l = v indicates a variable v is assigned to column c of LUT l. For n LUTs each with k inputs, a brute force approach would have time complexity of (k!)n−1 . As the SMT solver takes a long time to solve and have to be executed multiple times in mapping a benchmark, we propose a greedy algorithm for faster mapping. Consider k-input LUTs and n of these LUTs stacked together. This can be represented as a matrix with dimensions n × k, where each row represents the inputs variables of the LUT. As the inputs of an LUT are unique, each variable occurs at most once in each row of the matrix. The detailed alignment approach is shown in Algorithm 4. We explain the algorithm with a representative example. Example 5 Consider the three 4-input LUTs with their input variables arranged as an unaligned matrix, as shown below. The variables are ordered in descending order by frequency. L = {a:3, b:2, c:2, d:1, e:1, h:1, g:1, x:1}. We start the alignment by placing ‘a’ in the first column. In the next step, we place ‘b’. As row 1 and 2 of column 1 are already occupied by ‘a’, we place ‘b’ in column 2. Similarly, we continue the process until all the variables are placed.
Synthesis and Technology Mapping for In-Memory Computing
343
Fig. 14 Snippet of MAGIC instructions generated by CONTRA on mapping cm151a benchmark on 8 × 8 crossbar with k = 4 and spacing=0
a b h
Unaligned Step 1 Step 2 ... Aligned b cd aφ φ φ a b φ φ ab c d c e a aφ φ φ a b φ φ ab c e a gx aφ φ φ aφ φ φ ah g x
Example 6 For the LUTs 16 and 19, the result of alignment is shown in first subfigure of Fig. 13, specified by variables in pink. The variables 17, 18 and j are assigned columns 1, 2 and 3 for LUT 17 while the variables 20, 21 and j are assigned columns 1, 2 and 3 for LUT 18, thereby aligning input variable j. This completes the description of the technique for area-constrained mapping. The output of mapping cm151a benchmark to 8 × 8 crossbar with k = 4 and spacing = 0 and is shown in Fig. 14. The benchmark was mapped in 71 cycles. Each line signifies one or more operations with the corresponding input and gate names (pi, old_n_18, etc.) that are executed in the same cycle. In the next section, we present the results of benchmarking the proposed method.
6.4 Experimental Results This section presents the experimental results of the area-constrained technology mapping flow—CONTRA, for computing arbitrary functions using MAGIC operations. We have implemented the proposed flow in the arche framework (Bhattacharjee 2020). We demonstrate the mapping of the cm151a benchmark in the code snippet below, using the arche framework.
344
D. Bhattacharjee and A. Chattopadhyay
Executing Contra python3 arche.py arche>setlog contralog.txt set log file : contralog.txt arche>sacmap -f cm151a.blif -d results/ -k 4 -R 8 -C 8 Mapping with crossbar dimension 8x8 Completed partitioning benchmark cm151a.blif File stats [results/k4cm151a.blif] : 12 inputs, 2 outputs, 6 wires, 8 assign LUT placement for cm151a.blif begins. LUT placement for cm151a.blif completed. Detailed placement starts. LUT Schedule generated with 7 slots Placing crossbar for timeslots [2, 3, 4, 5, 6, 7, 8] placing 1/7 timeslots placing 2/7 timeslots placing 3/7 timeslots placing 4/7 timeslots placing 5/7 timeslots placing 6/7 timeslots placing 7/7 timeslots Detailed placement completed. Verifying equivalence of cm151a.blif and results /Cr_8_8_k4_cm151a.blif.v Verifying completed with return code 0 Files cm151a.blif and results/Cr_8_8_k4_cm151a.blif.v are logically equivalent arche> quit
The arche framework supports a variety of input formats for the benchmarks, including blif, structural verilog, aig. Internally, arche uses ABC (Berkeley Logic Synthesis and Verification Group 2016) for generating the LUT graph and the SOP representation of LUT functions, which we converted to NoN representation for mapping. For each benchmark, we generate cycle accurate MAGIC instructions. A representative output of mapping is shown in Fig. 14. We developed an in-house mMPU simulator for executing MAGIC instructions. We used the simulator to generate execution traces which were converted into Verilog. The generated Verilog and the input benchmarks were formally checked for functional equivalence using the cec command of ABC. We benchmark our tool using the ISCAS’85 benchmarks (Hansen et al. 1999), which have been used extensively for evaluation of automation flows for MAGIC. The experiments were run on a shared cluster with 16 Intel(R) Xeon(R) CPU E52667 v4 @ 3.20 GHz, with Red Hat Enterprise Linux 7. Table 6 shows the results of mapping the benchmarks for three crossbar dimensions. We report the results for the best delay (in cycles) by varying k from 2 to 4. As expected, the increase in crossbar dimensions results in lower delay of execution. We also report the results of
Synthesis and Technology Mapping for In-Memory Computing
345
Table 6 Benchmarking results for the ISCAS’85 benchmark for three crossbar sizes. We ran each benchmark with k = {2, 3, 4} and spacing set to {0, 2, 4, 6}. For each benchmark, the best results were obtained for k = 4 and spacing set to 6 Bench PI (R, C) PO (64, 64) (128, 64) (128, 128) Cycles Cycles Cycles c432 c499 c880 c1355 c1908 c2670 c3540 c5315 c6288 c7552
36 41 60 41 33 233 50 178 32 207
7 32 26 32 25 140 22 123 32 108
797 1391 1314 1390 1511 2132 3751 5022 8176 7308
774 1341 1268 1341 1470 2066 3575 4827 7890 7039
770 1343 1263 1344 1469 2060 3575 4831 7881 7036
Table 7 Benchmarking results for the EPFL MIG benchmarks for three crossbar sizes. We ran each benchmark with k = {2, 3, 4} and spacing set to 6. For each benchmark, the best results were obtained for k = 4 Bench PI (R, C) PO (64, 64) (128, 64) (128, 128) Cycles Cycles Cycles arbiter cavlc ctrl dec i2c int2float priority router voter
256 10 7 8 147 11 128 60 1001
129 11 26 256 142 7 8 30 1
81941 3808 786 1399 6698 1369 5479 1150 –
81582 3672 759 1253 6656 1340 5398 1121 68777
81434 3686 757 1284 6692 1323 5389 1153 68758
mapping for the EPFL benchmarks.3 We report the results for EPFL MIG benchmarks in Table 7 for three crossbar dimensions. For the larger EPFL arithmetic and random control benchmarks, we report the results for crossbar with 256 × 256 dimensions in Tables 8 and 9 respectively. We observe that for most of the results, the best delay was obtained for k = 4. This is because setting a higher value of k, leads to fewer LUTs in the LUT graph. Since multiple LUTs can be scheduled in parallel (based on constraints mentioned in Algorithm 3), this leads to reduction in the number of cycles to compute the benchmark by exploiting higher degree of parallelism. For large benchmark such as voter in Table 7 and very small crossbar dimension (64, 64), the mapping flow fails. 3
https://github.com/lsils/benchmarks.
346
D. Bhattacharjee and A. Chattopadhyay
Table 8 Benchmarking results for the EPFL arithmetic benchmarks for 256 × 256 crossbar size Bench PI PO LUTs k Spacing Cycles adder bar div hyp log2 max multiplier sin sqrt square
256 135 128 256 32 512 128 24 128 64
129 128 128 128 32 130 128 25 64 128
339 1408 57239 64228 10127 1057 10183 1915 8399 6292
4 4 2 4 4 4 3 4 4 4
6 6 6 – 1 6 0 6 6 0
4398 12216 342330 – 128647 9468 90925 21761 101694 74614
Table 9 Benchmarking results for the EPFL control benchmarks for 256 × 256 crossbar size, with spacing set to 6. We ran each benchmark with k = {2, 3, 4} and the best results were obtained for k=4 Benchmark PI PO LUTs Cycles ac97_ctrl comp des_area div16 hamming i2c MAC32 max mem_ctrl MUL32 pci_bridge32 pci_spoci_ctrl revx sasc simple_spi spi sqrt32 square ss_pcm systemcaes systemcdes tv80 usb_funct usb_phy
2255 279 368 32 200 147 96 512 1198 64 3519 85 20 133 148 274 32 64 106 930 314 373 1860 113
2250 193 72 32 7 142 65 130 1225 64 3528 76 25 132 147 276 16 127 98 819 258 404 1846 111
3926 8090 1797 2293 725 423 3310 1866 3031 2758 23257 446 3056 204 305 1581 989 6083 159 3207 1128 3044 5265 187
27742 74379 17273 22047 9414 3133 40007 16072 22021 31344 110318 3621 31603 1476 2307 13115 11326 67602 968 26981 9468 25986 41029 1156
Synthesis and Technology Mapping for In-Memory Computing
347
This happens because during the placement phase of the flow, multiple columns are blocked with intermediate results which does not leave enough number of free devices to map the rest of the LUTs.
6.5 Impact of Spacing Parameter Spacing is the number of rows that is left free between two LUTs stacked vertically, as described in Algorithm 3. We analyze the impact of spacing on three large benchmarks for ISCAS’85, k = 4 and two crossbar dimensions 64 × 64 and 128 × 128. The results of analysis are summarily shown in Fig. 15. For most of the benchmarks, the delay decreases considerably by increasing spacing from 0 to 4 or 6 (depending on the benchmark). However, increasing spacing further leads to increase in delay. This is due to the fact that leaving empty row helps in finding shorter paths between source and destination locations on the crossbar while using A* search, that leads to reduction in delay. However, setting a large value (such as 8 or higher) for the spacing parameter leads to lesser space available in the crossbar for actual placement of the LUTs, which leads to reduction in number of parallel operations and higher delay.
Fig. 15 Impact of spacing parameters on delay for three benchmarks, considering two crossbar dimensions 64 × 64 and 128 × 128, with k = 4
348
D. Bhattacharjee and A. Chattopadhyay
6.6 Impact of Crossbar Dimensions Figure 16 shows the impact of crossbar dimensions on delay of mapping, while keeping the number of devices (R × C) constant. We considered k = {2, 3, 4}, spacing = {0, 2, 4, 6} and three large benchmarks for ISCAS’85 benchmarks. The best delay for all the benchmarks were obtained for k = 4 and spacing=6. We can observe that increasing the number of rows and decreasing the number of columns, the delay of mapping decreases. As discussed in Sect. 6, LUTs are stacked in vertical orientation and can be executed in parallel as long as there are no data dependencies and the number of inputs are same. Increasing the number of rows allows greater number of parallel operations to be executed. When a small number of columns are available, the mapping delay increases (as observed by changing crossbar dimensions from 1024 × 64 to 2048 × 32). This is because lower number of devices are available when columns are blocked during for preserving intermediate results and the alignment overhead increases as well.
6.7 Copy Overhead Figure 17 shows the overhead of copy operations as a percentage. As evident from Fig. 17, copy operations constitute a large overhead in the computation of a benchmark. As we use A* search algorithm to align the inputs, the exact number of copy operations used in alignment is optimal. However in order to limit run time, we do not try and scheduling multiple copy operations in parallel, considering multiple source and destination locations simultaneously. This could be investigated in future, at the cost of higher execution time of the search algorithm.
Fig. 16 Impact of crossbar dimensions on delay of mapping, while keeping the number of devices constant
Overhead (Cycles)
Synthesis and Technology Mapping for In-Memory Computing 80% 70% 60% 50% 40% 30% 20% 10% 0%
349 70%
49% 43%
44% 38%
49% 43% 26%
c499
c880
c1355
56%
54%
45% 33%
21%
c432
62%
58%
57%
21%
c1908
input
c2670
c3540
28%
35% 17%
c5315
c6288
c7552
copy
Fig. 17 Overhead of primary input placement and copying intermediate results for LUT input
6.8 Comparison with Existing Works The existing technology mapping approaches for MAGIC do not consider area constraints in mapping and focus only on minimizing the delay. Given a benchmark, the existing methods report the crossbar dimensions required to map the benchmark, along with the delay of mapping. These works therefore cannot map benchmarks to arbitrary sized crossbar arrays. For comparison, we determine the smallest crossbar dimension for which the mapping was feasible using CONTRA. In the absence of area constraints, our method can achieve delay identical to delay-constrained method. CONTRA requires significant lower area to map in comparison to existing methods, while having relatively higher delay. As none of the methods support area constraints, we use Area-Delay Product (ADP) as a composite metric for direct comparison. AD P = R × C × C ycles AD PEi Improvement in AD P = AD PC O N T R A
(17) (18)
The list of existing works we compare CONTRA to follows: • E1 (Gharpinde et al. 2017): A NOR/INV netlist is mapped using MAGIC operations by replicating specific logic levels or gates in order to achieve the maximum parallelism while guaranteeing a square shape allocation of memristors. • E2 (Zulehner et al. 2019): A staircase structure is utilized to reach a almost squarelike layout with focus on minimizing the number of time steps and utilized memristors. • E3, E4 (Thangkhiew and Datta 2018): These methods correspond to the delay optimisation and crossbar orientation optimisation methods using a simulated annealing approach. • E5, E6 (Yadav et al. 2019): These methods correspond to the Look Ahead with Parallel Mapping and Look Ahead Heuristic and Parallel Mapping methods presented by Yadav et al. The look-ahead heuristics attempts to minimize the number
3
3
3
3
4
4
4
4
4
c499
c880
c1355
c1908
c2670
c3540
c5315
c6288
c7552
824
1140
1389
1092
1489
2267
3726
5365
8744
8009
20 × 12
20 × 16
32 × 22
36 × 16
32 × 22
38 × 34
60 × 26
64 × 48
32 × 30
64 × 48
GM Overhead (Delay):
GM Reduction (Area):
3
5.9×
845 × 14
2297 × 6
1261 × 11
650 × 16
664 × 9
312 × 13
359 × 10
383 × 5
323 × 13
146 × 9
1.6×
3929
3776
3295
2396
1490
1056
1072
761
1155
349
10.8×
214 × 175
151 × 870
221 × 136
137 × 164
66 × 92
83 × 85
96 × 63
67 × 39
96 × 44
22 × 42
RxC
3.5×
2182
3751
1361
1435
551
517
236
427
242
225
Cycles
RxC
Cycles
E1 (Gharpinde et al. 2017) E2 (Zulehner et al. 2019)
Cycles
k
RxC
Proposed
c432
Bench
9.4×
321 × 320
436 × 98
298 × 73
153 × 150
301 × 45
60 × 60
72 × 43
124 × 30
73 × 37
62 × 11
RxC
1.7×
3824
5007
3239
2418
1401
970
938
750
935
265
Cycles
19.8×
381 × 379
265 × 265
449 × 179
160 × 161
385 × 245
70 × 66
91 × 55
103 × 73
83 × 55
51 × 47
RxC
1.5×
4012
5515
3382
2589
1495
1075
1060
913
1059
342
Cycles
12.3×
220 × 57
33 × 892
249 × 122
71 × 221
202 × 137
42 × 88
49 × 163
69 × 73
45 × 182
36 × 150
RxC
1.9×
3031
3161
2676
2007
1278
928
825
726
903
338
Cycles
E3 (Thangkhiew and Datta 2018) E4 (Thangkhiew and Datta 2018) E5 (Yadav et al. 2019)
4.7×
542 × 22
49 × 115
547 × 22
109 × 55
340 × 29
93 × 33
103 × 28
107 × 14
116 × 31
69 × 13
RxC
2.2×
2486
3104
2251
1761
1183
648
757
613
707
290
Cycles
E6 (Yadav et al. 2019)
Table 10 Comparison of CONTRA with existing works. Note that the delay for the existing works do not consider placement overhead of primary inputs. R = Number of Rows, C = Number of Columns, k = Number of inputs to generate LUT Graph
350 D. Bhattacharjee and A. Chattopadhyay
Synthesis and Technology Mapping for In-Memory Computing
351
Fig. 18 Comparison of the ADP of CONTRA with existing works, along with Geometric Mean (GM) of improvement in ADP of CONTRA over existing works
of copy operations. The parallel mapping approach of the gates tries to maximize the evaluation of gates in parallel. We present the comparison results in Table 10. The main observations are (1) CONTRA requires less crossbar area compared to all other methods. (2) Not only the total area is smaller, but the size of each dimension is smaller which makes mapping of logic into memory significantly more feasible. (3) Unfortunately, these benefits come with a slightly higher delay. None of the previous works on technology mapping for MAGIC consider the overhead of placing the primary inputs on the crossbar (Gharpinde et al. 2017; Zulehner et al. 2019; Thangkhiew and Datta 2018; Yadav et al. 2019). However, we considered the cost of placing the primary inputs in all our mapping results. From Fig. 17, we can observe that the overhead of input in terms of number of cycles could be as high as 49% for smaller benchmarks. This strongly suggests that the overhead of input placement must be considered during mapping. Therefore, comparing our proposed method directly in terms of delay with existing works is unfair. In Fig. 18, we plot the improvement in ADP for individual test cases from the ISCAS’85 benchmarks. Barring two cases (c432 for E2 and c880 for E6), there is a considerable improvement in ADP for the proposed algorithm for all the benchmarks against all the existing implementations. We present the geometric mean of improvement in ADP of CONTRA over the existing methods. CONTRA achieves the best geometric mean improvement of 13.1× over E4. From Fig. 18, we can also rank existing methods on the basis of their ADP. After CONTRA, E6 has the next best ADP, followed closely by E1 and E2, whereas E3, E4 and E5 are significantly worse.
7 Conclusion In-memory computing is a new technology enabler that is poised to become a mainstream computer architecture feature. Multiple device/technologies do support inmemory computing and therefore various flavours of the architecture are already reported in the literature, including prototype studies. The emergence and even-
352
D. Bhattacharjee and A. Chattopadhyay
tual acceptance of these architectures critically hinges on the availability of efficient design automation flows. In this chapter, we presented multiple techniques for automatically synthesizing and mapping a given Boolean function on to a memristive memory processing unit. The discussion covered the basics of memristive technologies, data structures for the design automation flows. In particular, we presented design automation targeting optimality of area and runtime. These are further extended with area and delay constraints. As the memory is arranged in crossbar structures, we presented design automation flows specifically targeting crossbar structures. Overall, the studies presented in this chapter, along with the demonstrative examples, represent the state-of-the-art technology mapping flow for in-memory computing.
References S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, R. Das, Compute caches, in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) (IEEE, 2017), pp. 481–492 A. Agrawal, A. Jaiswal, C. Lee, K. Roy, X-SRAM: enabling in-memory Boolean computations in CMOS static random access memories. IEEE Trans. Circuits Syst. I Regul. Pap. 65(12), 4219– 4232 (2018) L. Amarú, P.-E. Gaillardon, G. De Micheli, BDS-MAJ: a BDD-based logic synthesis tool exploiting majority logic decomposition, in Proceedings of the 50th Annual Design Automation Conference (ACM, 2013), p. 47 L. Amaru, P.-E. Gaillardon, G. De Micheli, Majority-inverter graph: a new paradigm for logic optimization. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 35(5), 806–819 (2016) Berkeley Logic Synthesis and Verification Group, ABC: A System for Sequential Synthesis and Verification (2016), http://www.eecs.berkeley.edu/~alanmi/abc/. Accessed 31 October 2017 D. Bhattacharjee, R. Devadoss, A. Chattopadhyay, ReVAMP: ReRAM based VLIW architecture for in-memory computing, in 2017 Design, Automation & Test in Europe Conference & Exhibition (DATE) (IEEE, 2017), pp. 782–787 D. Bhattacharjee, Arche: A Framework for Technology Mapping of Emerging Technologies (2020), https://github.com/debjyoti0891/arche. Accessed 28 October 2020 P.E. Gaillardon, L. Amarú, A. Siemon, E. Linn, R. Waser, A. Chattopadhyay, G.D. Micheli, The programmable logic-in-memory (PLIM) computer, in DATE (2016), pp. 427–432 R. Gharpinde, P.L. Thangkhiew, K. Datta, I. Sengupta, A scalable in-memory logic synthesis approach using memristor crossbar. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 26(2), 355–366 (2017) R. Gharpinde, P.L. Thangkhiew, K. Datta, I. Sengupta, A scalable in-memory logic synthesis approach using memristor crossbar. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 26(2), 355–366 (2018) A. Haj-Ali, R. Ben-Hur, N. Wald, R. Ronen, S. Kvatinsky, Not in name alone: a memristive memory processing unit for real in-memory processing. IEEE Micro 38(5), 13–21 (2018) S. Hamdioui, L. Xie, H.A.D. Nguyen, M. Taouil, K. Bertels, H. Corporaal, H. Jiao, F. Catthoor, D. Wouters, L. Eike et al., Memristor based computation-in-memory architecture for dataintensive applications, in DATE (EDA Consortium, 2015), pp. 1718–1725 M.C. Hansen, H. Yalcin, J.P. Hayes, Unveiling the ISCAS-85 benchmarks: a case study in reverse engineering. IEEE Des. Test Comput. 16(3), 72–80 (1999) R.B. Hur, N. Wald, N. Talati, S. Kvatinsky, Simple magic: synthesis and in-memory mapping of logic execution for memristor-aided logic, in Proceedings of the 36th International Conference on Computer-Aided Design (IEEE Press, 2017), pp. 225–232
Synthesis and Technology Mapping for In-Memory Computing
353
S.K. Kingra, V. Parmar, C.-C. Chang, B. Hudec, T.-H. Hou, M. Suri, SLIM: simultaneous logic-inmemory computing exploiting bilayer analog OxRAM devices. Sci. Rep. 10(1), 1–14 (2020) S. Kvatinsky, D. Belousov, S. Liman, G. Satat, N. Wald, E.G. Friedman, A. Kolodny, U.C. Weiser, Magic-memristor-aided logic. IEEE Trans. Circuits Syst. II Express Briefs 61(11), 895–899 (2014) E. Lehtonen, M. Laiho, Stateful implication logic with memristors, in NanoArch (IEEE Computer Society, 2009), pp. 33–36 E. Linn, R. Rosezin, S. Tappertzhofen, U. Böttger, R. Waser, Beyond von Neumann-logic operations in passive crossbar arrays alongside memory operations. Nanotechnology 23(30) (2012) P. McGeer, J. Sanghavi, R. Brayton, A.S. Vincentelli, Espresso-signature: a new exact minimizer for logic functions, in Proceedings of the 30th International Design Automation Conference (ACM, 1993), pp. 618–624 G.D. Micheli, Synthesis and Optimization of Digital Circuits (McGraw-Hill Higher Education, 1994) A. Pedram, S. Richardson, M. Horowitz, S. Galal, S. Kvatinsky, Dark memory and accelerator-rich system optimization in the dark silicon era. IEEE Design Test 34(2), 39–50 (2016) J. Reuben, R. Ben-Hur, N. Wald, N. Talati, A.H. Ali, P.-E. Gaillardon, S. Kvatinsky, Memristive logic: a framework for evaluation and comparison, in 2017 27th International Symposium on Power and Timing Modeling, Optimization and Simulation (PATMOS) (IEEE, 2017), pp. 1–8 E.M. Sentovich, K.J. Singh, L. Lavagno, C. Moon, R. Murgai, A. Saldanha, H. Savoj, P.R. Stephan, R.K. Brayton, A. Sangiovanni-Vincentelli, SIS: a system for sequential circuit synthesis (1992) V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M.A. Kozuch, O. Mutlu, P.B. Gibbons, T.C. Mowry, Ambit: in-memory accelerator for bulk bitwise operations using commodity dram technology, in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (2017), pp. 273–287 M. Soeken, A. Chattopadhyay, Unlocking efficiency and scalability of reversible logic synthesis using conventional logic synthesis, in 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC) (IEEE, 2016), pp. 1–6 Synopsys Design Compiler (2018), https://www.synopsys.com/implementation-and-signoff/rtlsynthesis-test/dc-ultra.html N. Talati, S. Gupta, P. Mane, S. Kvatinsky, Logic design within memristive memories using memristor-aided logic (magic). IEEE Trans. Nanotechnol. 15(4), 635–650 (2016) P.L. Thangkhiew, K. Datta, Scalable in-memory mapping of Boolean functions in memristive crossbar array using simulated annealing. J. Syst. Architect. 89, 49–59 (2018) Xilinx Design Suite (2018), https://www.xilinx.com/products/design-tools/vivado.html D.N. Yadav, P.L. Thangkhiew, K. Datta, Look-ahead mapping of Boolean functions in memristive crossbar array. Integration 64, 152–162 (2019) A. Zulehner, K. Datta, I. Sengupta, R. Wille, A staircase structure for scalable and efficient synthesis of memristor-aided logic, in Proceedings of the 24th Asia and South Pacific Design Automation Conference (ACM, 2019), pp. 237–242
Empowering the Design of Reversible and Quantum Logic with Decision Diagrams Robert Wille, Philipp Niemann, Alwin Zulehner, and Rolf Drechsler
Abstract Reversible computation has received significant attention in recent years as an alternative computation paradigm which can be beneficial e.g. for encoder circuits, low power design, adiabatic circuits, verification—just to name a few examples. Aside from those applications in the design of (conventional) integrated circuits, reversible logic components are also a key ingredient in many quantum algorithms, i.e. in the field of quantum computing which by itself emerged as a very promising computing paradigm that, particularly these days, gains more and more relevance. All that led to a steadily increasing demand for methods that allow for an efficient and correct design of corresponding circuits. Decision diagrams play an important role in the design of conventional circuitry. In the recent years, also their benefits for the design of the newly emerging reversible and quantum logic circuits become evident. In this overview paper, we review and illustrate previous and ongoing work on decision diagrams for such circuits and sketch corresponding design methods relying on them. By this, we demonstrate how broadly decision diagrams can be employed in this area and how they empower the design flow for these emerging technologies.
R. Wille (B) · A. Zulehner Institute for Integrated Circuits, Johannes Kepler University, 4040 Linz, Austria e-mail: [email protected] A. Zulehner e-mail: [email protected] P. Niemann · R. Drechsler Department of Computer Science, University of Bremen, 28359 Bremen, Germany e-mail: [email protected] R. Drechsler e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2023 M. M. S. Aly and A. Chattopadhyay (eds.), Emerging Computing: From Devices to Systems, Computer Architecture and Design Methodologies, https://doi.org/10.1007/978-981-16-7487-7_11
355
356
R. Wille et al.
1 Introduction While the vast majority of circuits and systems rely on a conventional computation paradigm, alternative schemes of computations may provide potential for further technologies and/or applications. Reversible circuits are a corresponding example as they have been shown to be beneficial e.g. for coding/encoding (Zulehner and Wille 2017d), low-power design (Landauer 1961; Bennett 1973; Berut et al. 2012), adiabatic circuits (Athas and Svensson 1994; Rauchenecker et al. 2017), verification (Amarù et al. 2016), or on-chip interconnects (Wille et al. 2012, 2016). In all these applications, the main property of reversible logic, namely that corresponding circuits only realize bijections which map a given input pattern to a unique output pattern and, by this, allow for computations in both directions is exploited. Besides that and related to reversible logic, the domain of quantum circuits (Nielsen and Chuang 2000) received steadily increasing attention. Here, qubits rather than conventional bits are utilized which allow for exploiting quantum-physical phenomena such as superposition and entanglement. This provides the basis for quantum parallelism, i.e. the ability to conduct operations on an exponential number of basis states concurrently, and, by this, plenty of applications in domains such as quantum chemistry, machine learning, cryptography, search, or simulation exist where conventional systems reach their limits (Grover 1996; Shor 1997; Montanaro 2016). Particular in the recent past, this domain gained more and more relevance with companies such as Google, IBM, Microsoft, etc. getting more and more involved. Since all quantum computations are inherently reversible in nature, reversible and quantum circuits employ many similarities which is why many accomplishments in the domain of reversible circuits also can be employed in the domain of quantum circuits. This broad variety of applications eventually led to a steadily increasing demand for methods that efficiently and correctly design such circuits. Accordingly, this area has been intensely considered by researchers worldwide in the past years (for overviews of the respective work we refer to Saeedi and Markov (2013), Drechsler and Wille (2011)). In this overview paper,1 we put a particular emphasis on corresponding methods which rely on decision diagrams. In fact, decision diagrams such as Binary Decision Diagrams (BDDs, Bryant 1986), Kronecker Functional Decision Diagrams (KFDDs, Drechsler and Becker 1998), or Binary Moment Diagrams (BMDs, Bryant and Chen 1995) played an important role in the design of conventional circuitry and have been applied for numerous design tasks in this domain (see e.g. Malik et al. 1988; Drechsler et al. 2004). And although the concepts of reversible and particularly quantum circuits are fundamentally different to that, some decision diagram solutions have already been proposed and proven useful for design automation in these emerging domains. More precisely, the most prominent proposals comprise the Quantum Decision Diagram (QDD, Abdollahi and Pedram 2006), the Quantum Information Decision Diagram (QuIDD, Viamontes et al. 2004), the X-decomposition Quantum Decision 1
This work is an extended version of Wille et al. (2018) and also takes into account most recent developments in the field.
Empowering the Design of Reversible and Quantum Logic with Decision Diagrams
357
Diagram (XQDD, Wang et al. 2008) as well as the Quantum Multiple-Valued Decision Diagram (QMDD, Niemann et al. 2016). These decision diagrams have been used in various applications such as synthesis (see e.g. Abdollahi and Pedram 2006; Soeken et al. 2012), simulation (see e.g. Goodman et al. 2007; Viamontes et al. 2009; Zulehner and Wille 2017a), and verification (see e.g. Viamontes et al. 2007; Wang et al. 2008; Wille et al. 2009) of reversible and quantum circuits.2 In the following, we will review and illustrate selected previous work on decision diagrams for reversible and quantum circuits as well as corresponding design methods relying on them. To this end, we will focus on decision diagrams identical or similar to the QMDD whose concepts are briefly motivated and reviewed in the following section. Afterwards, Sect. 3 illustrates how QMDDs can be utilized for typical design tasks such as synthesis, verification, and simulation. During all that, we will not provide a comprehensive description, but aim to sketch the main ideas while referring to the respective original work for a more detailed treatment. By this, we exemplarily demonstrate how broadly decision diagrams are already employed in this area and what benefits they yield for these emerging technologies. References are provided to equip the interested reader with more comprehensive descriptions and implementations, respectively.
2 Decision Diagrams for Reversible and Quantum Logic We start this work by providing a motivation for decision diagrams in reversible and quantum logic. Similar to conventional logic, reversible and quantum function representations suffer from an exponential complexity which is aimed to be coped by decision diagrams. This is discussed in more detail next. After that, an intuition of the main concepts of the decision diagrams considered in this paper is provided.
2.1 Motivation Reversible and quantum circuits obviously realize reversible and quantum functions. For details on the background for both, we refer to the respective literature such as Nielsen and Chuang (2000) and focus on their function representation in the following. In fact, reversible and quantum functions can be represented by matrices defined as follows. Definition 1 A reversible Boolean function f : Bn → Bn defines an input/output mapping where the number of inputs is equal to the number of outputs and where each input pattern is mapped to a unique output pattern. This can be described using a permutation matrix describing a permutation π of the set {0, . . . , 2n − 1}, i.e. a 2
For a comprehensive overview of these diagrams we refer to Niemann and Wille (2017, Chap. 3).
358
R. Wille et al.
Fig. 1 Representation for reversible and quantum functions
2n × 2n matrix P = [ pi, j ]2n ×2n with pi, j = 1 if i = π( j) and 0 otherwise, for all 0 ≤ i, j < 2n . Each column (row) of the matrix represents one possible input pattern (output pattern) of the function. If pi, j = 1, then the input pattern corresponding to column j maps to the output pattern corresponding to row i. Example 1 Figure 1a shows a permutation matrix describing a reversible function that maps e.g. input pattern 00 to the output pattern 10 (denoted by the 1-entry in the first column of the matrix). Quantum functions are similar, but work on so-called qubits rather than conventional (i.e. Boolean) bits. A qubit can represent two basis states 0 and 1 as well as superpositions of the two. More formally: Definition 2 A qubit is a two-level quantum system, described by a two-dimensional complex Hilbert space. The two orthogonal basis states |0 ≡ 01 and |1 ≡ 01 are used to represent the Boolean values 0 and 1. The state of a qubit may be written as |x = α|0 + β|1, where the amplitudes α and β are complex numbers with |α|2 + |β|2 = 1. The quantum state of a single qubit is denoted by the vector βα . The state of a quantum system with n > 1 qubits can be represented by a complex-valued vector of length 2n , called the state vector. According to the postulates of quantum mechanics, the evolution of a quantum system can be described by a series of transformation operations satisfying the following. Definition 3 A quantum operation over n qubits can be represented by a unitary transformation matrix, i.e. a 2n × 2n matrix U = [u i, j ]2n ×2n with • each entry u i, j assuming a complex value and • the inverse U −1 of U being the conjugate transpose matrix (adjoint matrix) U † of U (i.e. U −1 = U † ). Every quantum operation is reversible since the matrix defining any quantum operation is invertible. At the end of the computation, a qubit can be measured causing it to collapse to a basis state. Then, depending on the current state of the qubit, either a 0 (with probability of |α|2 ) or a 1 (with probability of |β|2 ) results. The state of the qubit is destroyed by the act of measuring it.
Empowering the Design of Reversible and Quantum Logic with Decision Diagrams
359
Example 2 Consider the quantum operation H defined by the unitary matrix shown in Fig. 1b which is the well-known Hadamard operation (Nielsen and Chuang 2000). Applying H to the input state |x = 01 , i.e. computing H × |x, yields a new quan tum state |x = √12 11 . For |x , α = β = √12 . Measuring this qubit would either lead to a Boolean 0 or a Boolean 1 with a probability of | √12 |2 = 0.5 each. This computation represents one of the simplest quantum circuits—a single-qubit random number generator.
2.2 Compact Representation of Matrices The core idea for obtaining compact representations of the permutation and transformation matrices occurring in the study of reversible and quantum functions is to identify redundancies in the matrices and to represent recurring patterns, i.e. identical or similar sub-matrices, by shared structures. To this end, a matrix of dimension 2n × 2n is partitioned into four sub-matrices of dimension 2n−1 × 2n−1 as follows: M00 M01 M= M10 M11 The partitioning process can recursively be applied to each of the sub-matrices and to each of the subsequent levels of sub-matrices until one reaches the terminal case where each sub-matrix is a single matrix entry. Now, the core idea of QMDDs (Niemann et al. 2016) is to represent this matrix decomposition in terms of a Directed Acyclic Graph (DAG) and to represent identical sub-matrices by shared nodes. As QMDDs additionally allow to annotate edge weights, this also allows to use shared nodes for structurally equivalent matrices that only differ by a scalar factor. Example 3 Figure 2b shows the QMDD for the transformation matrix from Fig. 2a. Here, the single root node (labeled q0 ) represents the whole matrix and has four outgoing edges to nodes representing the top-left, top-right, bottom-left, and bottom-right sub-matrix (from left to right). This decomposition is repeated at each partitioning level until the terminal node (representing a single matrix entry) is reached. Note that a single node at the q1 level is sufficient in our case, since the first three 2 × 2 submatrices are identical and the bottom-right matrix differs only by the scalar factor −1. During the construction of the QMDD, this scalar factor is identified and annotated to the corresponding edge. Similarly, the common multiplier √12 is extracted and annotated to the root edge. Moreover, efficient algorithms have been presented for applying operations like matrix addition or multiplication directly on the QMDD data-structure. These algorithms are polynomial in the size of the QMDD representations such that a compact
360
R. Wille et al.
Fig. 2 Representations for U = H ⊗ I2
representation with a high number of shared nodes is highly desirable. Note, however, that there is a trade-off between compactness and accuracy that results from issues with the representation of the complex/irrational numbers as floating-point numbers (Zulehner et al. 2019a). On the one hand, some redundancies might not be found due to rounding errors if the machine accuracy is too high (i.e., leading to a less compact representation), while, on the other hand, inadequate redundancies might be found if the machine accuracy is too small (i.e., leading to a less accurate or even corrupted computation). As a consequence, deliberately lowering the machine accuracy might be a suitable way to further increase the efficiency of the computations, while algebraic number representations can be employed if perfect accuracy is of the essence. Overall, QMDDs allow for both, a compact representation as well as an efficient manipulation of permutation/transformation matrices. As a consequence, they have been used in a broad variety of applications in the design of reversible and quantum circuits. This will be discussed in more detail in the following.
3 Application in the Design of Reversible and Quantum Circuits This section illustrates how decision diagrams such as the one reviewed above can be utilized for typical design tasks such as synthesis, verification, and simulation. To this end, we first sketch the premise of the respective design task followed by an illustration of how decision diagrams help in this regard.
3.1 Synthesis Synthesis constitutes the task of finding a realization of a given function as a reversible or quantum circuit. This obviously is one of the most important steps in the design of circuits and system as it provides the user with first realizations of the desired
Empowering the Design of Reversible and Quantum Logic with Decision Diagrams
361
Fig. 3 Reversible circuit
function. To this end, a circuit model including a gate library (realizing the respective reversible or quantum operations) is used. In terms of reversible circuits, the commonly used gate library is formed by generalized Toffoli gates. Definition 4 A Multiple-Controlled Multiple-Polarity Toffoli (MCMPT) gate gi = TOF(Ci , ti ) is composed of a set Ci ⊆ {x +j |x j ∈ X }∪ {x −j |x j ∈ X } of positive and negative control lines (where X denotes the set of all circuit lines) and a target line ti ∈ X with {ti− , ti+ } ∩ Ci = ∅. Furthermore, a line must not occur both as positive and as negative control line in a gate, i.e. {x +j , x −j } ∩ Ci = {x +j , x −j }. Then, the value of the target line ti is inverted by gate gi iff all positive (negative) control lines are assigned one (zero). All other lines are passed through the gate unaltered. A cascade of such gates G = g1 g2 . . . gl forms a reversible circuit. Example 4 Figure 3 shows a reversible circuit composed of three circuit lines and four Toffoli gates. Furthermore, the circuit lines are annotated with their respective value when applying input combination x3 x2 x1 = 111. The first gate g1 inverts the value of target line x2 , because the positive control line x3+ is assigned 1. Gate g2 inverts the value of target line x3 , because the control lines x2− and x1+ are assigned 0 and 1, respectively. The remaining two gates do not alter the value on any circuit lines, because the control lines are not assigned accordingly. It is known since the seminal work of Tom Toffoli (1980) that one particular kind of Toffoli gate (with two positive controls) is sufficient to realize any reversible Boolean function. Several approaches have been proposed which conduct synthesis for a given reversible Boolean function by successively applying Toffoli gates until the represented functionality is transformed into the identity function (typical representatives of such a scheme are Transformation-Based Synthesis (Shende et al. 2002; Miller et al. 2003) as well as approaches based on Reed-Muller expansion (Gupta et al. 2006) or Reed-Muller spectra (Maslov et al. 2007)). Since the original function f is a bijection, a unique inverse f −1 exists and the composition f · f −1 = id yields the identity function. Consequently, if a cascade G of gates is determined which transforms f to the identity, a circuit realizing f −1 has been obtained. From this, a circuit that realizes f can be obtained by inverting G, i.e. by reversing the gate order and replacing each gate with its inverse—which, in case of Toffoli gates is trivial since all such gates are self-inverse. However, methods relying on such a scheme (such as Shende et al. 2002; Miller et al. 2003; Gupta et al. 2006; Maslov et al. 2007) suffer from the exponential complexity of the function description, e.g. in terms of a truth-table or permutation matrix.
362
R. Wille et al.
Fig. 4 Effects of applying Toffoli gates to QMDDs
Since decision diagrams often allow for a compact representation of the function to be synthesized, they provide a suitable solution to this problem. Moreover, since they additionally offer efficient capabilities for function manipulation, corresponding transformations can easily be conducted. Example 5 Consider the root node of the QMDD shown in Fig. 4a. To establish the identity for this node (the top-right and the bottom-left sub-matrix are zero matrices), we apply the gate TOF(∅, x2 ). This simply exchanges the first (third) and the second (fourth) edge of the root node of the QMDD. The resulting QMDD is shown in Fig. 4b. To establish the identity for the right-most x1 -node, we again need to apply a Toffoli gate with target line x1 . To ensure that the other node labeled x1 is not modified either (this node already represents the identity), we add a positive control line x2+ to the gate (i.e. TOF({x2+ }, x1 )). This way, only the nodes are affected that can be reached through the fourth edge of the root node (i.e. the node labeled x2 )— eventually resulting in the QMDD representing the identity shown in Fig. 4c. Approaches such as proposed in Soeken et al. (2012) and further improved in Zulehner and Wille (2017b) successfully realize these concepts and allow for an efficient (and scalable) synthesis of reversible circuits using decision diagrams. Considering quantum logic, a similar strategy can be applied, although the problem is significantly more complicated. In fact, additionally quantum-mechanical effects have to be considered when determining a gate sequence yielding the identity. These effects become evident as multiple complex-valued entries per column in the unitary matrix—in contrast to pure permutation matrices in the case of reversible logic. Initial synthesis approaches have considered different decompositions of the considered unitary transformation matrices, e.g. the Cosine-Sine-Decomposition (CSD) in Shende et al. (2006) or the Quantum Shannon Decomposition (QSD) in Saeedi et al. (2011a). Howver, these have the drawback that they lead to a significant number of gates (even for a small number of qubits) and that they rely on a set of arbitrary one-qubit gates. The latter poses a severe obstacle since, in physical realizations, these must be approximated by a restricted set of gates, in particular when fault tolerant methods are applied. Moreover, decision diagrams do not seem to provide any benefit here. In contrast, dedicated approaches have been proposed that are by construction compatible with decision diagrams and not only employ their compact function
Empowering the Design of Reversible and Quantum Logic with Decision Diagrams
363
Fig. 5 Proposed global synthesis scheme
representation to overcome limitations of previous work, but even exploit certain specific properties of the decision diagrams (Niemann et al. 2014a, 2018). In these approaches, the given unitary matrix is transformed in three steps (as also illustrated in Fig. 5): (a) Eliminate superposition, i.e. apply quantum gates so that all multiple non-zero matrix entries in each column are combined to a single non-zero entry. (b) Move to diagonal, i.e. apply quantum gates which move the remaining non-zero entries to the diagonal of the matrix. (c) Remove phase shifts, i.e. apply quantum gates which transform the diagonal entries to 1—eventually yielding the identity matrix. Note that the second step (Move to diagonal) is very similar to synthesizing a reversible Boolean function, since the unitary matrices at this point are essentially permutation matrices with the relaxation that the non-zero entries can be any complex number with an absolute value of 1 (not only 1 itself). Accordingly, the same approaches as for reversible Boolean functions can be applied here. Finally, note that all above-mentioned approaches require a reversible description of the function to be synthesized in order to work properly. However, it is often the case that (Boolean) functions are to be realized which are originally described in a non-reversible way, i.e. output patterns are not unique. Then, a so-called embedding is to be conducted in the first place. To this end, corresponding extensions (e.g. in terms of additional primary outputs; called garbage outputs) are employed on the function which allow to explicitly distinguish non-unique output patterns—making the function reversible. Since also here, a function in its entirety has to be considered, decision diagrams as a means for compact representation have successfully been utilized for this purpose (see e.g. Zulehner and Wille 2017c). Moreover, they even have been employed to completely get rid of this extra step and, instead, do a onepass synthesis scheme which combines embedding and synthesis (see e.g. Zulehner and Wille 2018a, b).
3.2 Verification Verification means the task of checking whether a given function representation and/or circuit indeed is correct. A typical example for a verification task is equivalence checking in which it is checked whether two structurally different function
364
R. Wille et al.
Fig. 6 Equivalence checking for quantum circuits
descriptions are functionally equivalent or not. Typical use cases include e.g. the situation in which a designer wants to confirm whether the generated circuit indeed realizes the desired function or whether an optimization conducted on a circuit did not change its functionality. In particular with the emerge of sophisticated design flows for quantum computing (which includes steps such as compiling to elementary gates (Amy et al. 2013; Miller et al. 2011) followed by mapping to a descriptions that satisfies coupling constraints (Saeedi et al. 2011b; Li et al. 2019; Zulehner et al. 2019b)), equivalence checking gains importance as this leads to different versions of a quantum circuit which may differ in their structure and gate library, but are supposed to remain functionally equivalent. However, equivalence checking of quantum circuits is an exponentially hard problem (in fact, it has been proven to be QMA-complete (Janzing et al. 2005)). For tasks like these, decision diagrams have already received a well-known reputation for conventional circuits since the corresponding representations are inherently canonic (assuming a fixed variable order) (Bryant 1986). Luckily, for the decision diagrams considered here, the same property exists. In fact, as proven in Niemann et al. (2016), the representation illustrated in Sect. 2.2 is canonic. Accordingly, two reversible or quantum functions can easily be verified by simply generating the decision diagram in the same fashion and comparing them (this has e.g. been evaluated in Wille et al. (2009), Niemann et al. (2016)).3 Example 6 Consider the two quantum circuits shown in Fig. 6. To determine whether these two circuits are equivalent, we construct a QMDD describing the functionality for each circuit. Since both circuits lead to the same QMDD (also shown in Fig. 6), their equivalence is proven. Moreover, in case the we considered circuits are indeed equivalent, certain characteristics of quantum circuits and their corresponding decision diagrams can be utilized. More precisely, consider two quantum computations whose equivalence shall be investigated and which are provided as quantum circuits G = g1 . . . gm and 3
Note that this approach can also be generalized for multiple-valued reversible and quantum functionality as demonstrated in Niemann et al. (2014b).
Empowering the Design of Reversible and Quantum Logic with Decision Diagrams
365
Gˆ = gˆ 1 . . . gˆ n . Due to the inherent reversibility of quantum operations, the inverse of a quantum circuit may be easily determined by complex conjugating each gate’s representation and inverting the order of gates, e.g., Gˆ −1 = (gˆ n )† . . . (gˆ 1 )† . If two computations are equivalent, this certainly allows for the conclusion that G · Gˆ −1 = id, where id denotes the identity function. If now a strategy can be employed so that the respective gates from G and Gˆ −1 are applied in a fashion frequently yielding the identity, the entire procedure can be conducted rather efficiently since the identity constitutes the best case for decision diagram representations and only require a linear amount of nodes with respect to the number of qubits. Preliminary investigations exploiting this have been conducted in Burgholzer and Wille (2020).
3.3 Simulation Simulation constitutes the task of determining the output state for a given input state applied to a reversible or quantum circuit. This is relevant to quickly validate the intended behavior of a circuit without the burden of conducting a full verification. In case of quantum circuits, simulation also becomes important, since e.g. (preliminary versions of) quantum applications are usually evaluated through simulation first, before the mature realizations are actually executed on real machines. This is because real quantum computers are usually expensive, hardly available, and yet limited in the number of qubits as well as their fidelity. Besides that, simulators may give additional insights, since, e.g., the precise amplitudes of a quantum state are explicitly determined (while they are not observable in a real quantum computer). For reversible circuits, simulation is rather trivial as basically only Boolean values are applied on the inputs which can easily be evaluated considering reversible gates such as Toffoli gates. Figure 3 nicely shows this: the input pattern can easily be propagated from left to right by checking whether all control lines of a gate are set to the appropriate value (i.e. 1 for positive and 0 for negative controls) and flipping the target line accordingly, while, at the same time, all remaining values are passed through the gate unaltered. In case of quantum circuits, however, a substantially harder problem results.4 In fact, in this case, the respective input state has to be provided in terms of a state vector so that it can be evaluated with respect to a unitary matrix representing the quantum operation to be simulated. The simulation step itself can then be conducted by a matrix-vector multiplication. Example 7 Consider a quantum system composed of two qubits which is currently in state |x = |00. Applying an H -operation to the first qubit (as defined by the matrix shown in Fig. 2a) yields a new state vector determined by
4
This also does not come with a surprise since, if simulating quantum circuits would be trivial on a conventional machine, there would be no need for a quantum circuit in the first place.
366
R. Wille et al.
⎡
1 1 ⎢ 0 |x = √ ⎢ 2 ⎣1 0
0 1 0 1
1 0 −1 0
⎡ ⎤ ⎤ ⎡ ⎤ 0 1 1 ⎢0⎥ ⎢0⎥ 1 1⎥ ⎥ × ⎢ ⎥ = √ ⎢ ⎥. 0 ⎦ ⎣0⎦ 2 ⎣ 1⎦ −1 0 0
Also here, decision diagrams as reviewed above in Sect. 2.2 can help as they already provide a compact representation for unitary matrices. However, additionally a representation of state vectors is required. Moreover, the quantum operations to be simulated (and, accordingly, methods to manipulate matrices and vectors) as well as the (non-reversible) measurement step needs to be supported. To this end, corresponding solutions using decision diagrams have been proposed in Zulehner and Wille (2017a). Eventually, this led to substantial improvements with respect to currently available simulators such as LIQUi | (Wecker and Svore 2014) from Microsoft or qHiPSTER (Smelyanskiy et al. 2016) from Intel. In fact, while e.g. LIQUi | is capable of simulating Shor’s Algorithm (a well-known quantum method for factorization (Shor 1997)) for at most 31 qubits in more than 30 days, the simulation approach based on decision diagrams completes this task within a minute—showing an impressive display of the capabilities of these data-structures. Recent developments even further improved upon that. For example, the work in Zulehner and Wille (2019) showed that a proper combination of gate operations or even entire sub-circuits allows for further speed-ups. This is because, using decision diagrams, sometimes a matrix-matrix multiplication (which combines two gates) might be cheaper than a (theoretically less expensive) matrix-vector multiplication. On the other side, typical optimization schemes applied for quantum circuit simulation can not directly been applied for decision diagrams yet. As an example: matrix/vector operations usually can be parallelized quite easily, which is why most methods utilize corresponding concurrent schemes. But since decision diagrams heavily rely on sharing, the correspondingly needed synchronization overhead often prevents improvements due to parallelization (Hillmich et al. 2020).
4 Conclusions In this work, we provided an overview of decision diagrams for reversible and quantum circuits as well as their potential for the design of these emerging technologies. While already established in conventional circuit design for many decades, decisions diagrams for the area considered here are still not that common. However, with the approaches and the potential from the recent past as discussed in this work as well as alternative diagram types such as QDDs, QuIDDs, or XQDDs, a case can be made that decision diagrams might become similarly important for reversible and quantum circuit design as they have done for the design of conventional circuits and systems.
Empowering the Design of Reversible and Quantum Logic with Decision Diagrams
367
Acknowledgements We sincerely thank all co-authors and collaborators who worked with us in the past in this exciting area. This work has partially been supported by the European Union through the COST Action IC1405 as well as the LIT Secure and Correct Systems Lab funded by the State of Upper Austria.
References A. Abdollahi, M. Pedram, Analysis and synthesis of quantum circuits by using quantum decision diagrams, in Design, Automation and Test in Europe (2006), pp. 317–322 L.G. Amarù, P. Gaillardon, R. Wille, G.D. Micheli, Exploiting inherent characteristics of reversible circuits for faster combinational equivalence checking, in Design, Automation and Test in Europe (2016), pp. 175–180 M. Amy, D. Maslov, M. Mosca, M. Roetteler, A meet-in-the-middle algorithm for fast synthesis of depth-optimal quantum circuits. IEEE Trans. CAD 32(6), 818–830 (2013) W.C. Athas, L.J. Svensson, Reversible logic issues in adiabatic CMOS, in Proceedings Workshop on Physics and Computation, PhysComp’94 (1994), pp. 111–118. https://doi.org/10.1109/ PHYCMP.1994.363692 C. Bennett, Logical reversibility of computation. IBM J. Res. Dev. 17(6), 525–532 (1973) A. Berut, A. Arakelyan, A. Petrosyan, S. Ciliberto, R. Dillenschneider, E. Lutz, Experimental verification of Landauer’s principle linking information and thermodynamics. Nature 483, 187– 189 (2012) R.E. Bryant, Graph-based algorithms for Boolean function manipulation. IEEE Trans. Comp. 35(8), 677–691 (1986) R.E. Bryant, Y. Chen, Verification of arithmetic circuits with binary moment diagrams, in Design Automation Conference (1995), pp. 535–541. https://doi.org/10.1145/217474.217583. http://doi. acm.org/10.1145/217474.217583 L. Burgholzer, R. Wille, Improved DD-based equivalence checking of quantum circuits, in ASP Design Automation Conference (2020) R. Drechsler, B. Becker, Ordered Kronecker functional decision diagrams-a data structure for representation and manipulation of boolean functions. IEEE Trans. CAD 17(10), 965–973 (1998) R. Drechsler, J. Shi, G. Fey, Synthesis of fully testable circuits from BDDs. IEEE Trans. CAD 23(3), 440–443 (2004) R. Drechsler, R. Wille, From truth tables to programming languages: progress in the design of reversible circuits, in International Symposium on Multiple-Valued Logic (2011), pp. 78–85. https://doi.org/10.1109/ISMVL.2011.40 D. Goodman, M.A. Thornton, D.Y. Feinstein, D.M. Miller, Quantum logic circuit simulation based on the QMDD data structure, in International Reed-Muller Workshop (2007) L.K. Grover, A fast quantum mechanical algorithm for database search, in Symposium on the Theory of Computing (1996), pp. 212–219. https://doi.org/10.1145/237814.237866. http://doi.acm.org/ 10.1145/237814.237866 P. Gupta, A. Agrawal, N.K. Jha, An algorithm for synthesis of reversible logic circuits. IEEE Trans. CAD 25(11), 2317–2330 (2006) S. Hillmich, A. Zulehner, R. Wille, Concurrency in DD-based quantum circuit simulation, in ASP Design Automation Conference (2020) D. Janzing, P. Wocjan, T. Beth, Non-identity check is QMA-complete. Int. J. Quantum Inf. 03(03), 463–473 (2005) R. Landauer, Irreversibility and heat generation in the computing process. IBM J. Res. Dev. 5(3), 183–191 (1961) G. Li, Y. Ding, Y. Xie, Tackling the qubit mapping problem for NISQ-era quantum devices, in ASPLOS (2019), pp. 1001–1014
368
R. Wille et al.
S. Malik, A. Wang, R. Brayton, A. Sangiovanni-Vincentelli, Logic verification using binary decision diagrams in a logic synthesis environment, in International Conference on CAD (1988), pp. 6–9 D. Maslov, G.W. Dueck, D.M. Miller, Techniques for the synthesis of reversible Toffoli networks. ACM Trans. Des. Autom. Electron. Syst. 12(4) (2007) D.M. Miller, D. Maslov, G.W. Dueck, A transformation based algorithm for reversible logic synthesis, in Design Automation Conference (2003), pp. 318–323 D.M. Miller, R. Wille, Z. Sasanian, Elementary quantum gate realizations for multiple-control Toffoli gates, in International Symposium on Multiple-Valued Logic (2011), pp. 288–293 A. Montanaro, Quantum algorithms: an overview. NPJ Quantum Inf. 2, 15,023 (2016) M. Nielsen, I. Chuang, Quantum Computation and Quantum Information (Cambridge University Press, 2000) P. Niemann, R. Wille, Compact Representations for the Design of Quantum Logic (Springer, 2017) P. Niemann, R. Wille, R. Drechsler, Efficient synthesis of quantum circuits implementing Clifford group operations, in ASP Design Automation Conference (2014a), pp. 483–488 P. Niemann, R. Wille, R. Drechsler, Equivalence checking in multi-level quantum systems, in Conference on Reversible Computation (2014b), pp. 201–215 P. Niemann, R. Wille, R. Drechsler, Improved synthesis of Clifford+T quantum functionality, in Design, Automation and Test in Europe (2018) P. Niemann, R. Wille, D.M. Miller, M.A. Thornton, R. Drechsler, QMDDs: efficient quantum function representation and manipulation. IEEE Trans. CAD 35(1), 86–99 (2016) A. Rauchenecker, T. Ostermann, R. Wille, Exploiting reversible logic design for implementing adiabatic circuits, in 2017 MIXDES-24th International Conference on Mixed Design of Integrated Circuits and Systems (IEEE, 2017), pp. 264–270 M. Saeedi, M. Arabzadeh, M.S. Zamani, M. Sedighi, Block-based quantum-logic synthesis. Quantum Inf. Comput. 11(3 & 4), 262–277 (2011a) M. Saeedi, R. Wille, R. Drechsler, Synthesis of quantum circuits for linear nearest neighbor architectures. Quantum Inf. Process. 10(3), 355–377 (2011b) M. Saeedi, I.L. Markov, Synthesis and optimization of reversible circuits: a survey. ACM Comput. Surv. 45(2), 21 (2013) V.V. Shende, S.S. Bullock, I.L. Markov, Synthesis of quantum-logic circuits. IEEE Trans. CAD 25(6), 1000–1010 (2006) V.V. Shende, A.K. Prasad, I.L. Markov, J.P. Hayes, Reversible logic circuit synthesis, in International Conference on CAD (2002), pp. 353–360 P.W. Shor, Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM J. Comput. 26(5), 1484–1509 (1997) M. Smelyanskiy, N.P.D. Sawaya, A. Aspuru-Guzik, qHiPSTER: the quantum high performance software testing environment. CoRR (2016), arXiv:1601.07195 M. Soeken, R. Wille, C. Hilken, N. Przigoda, R. Drechsler, Synthesis of reversible circuits with minimal lines for large functions, in ASP Design Automation Conference (2012), pp. 85–92 T. Toffoli, Reversible computing, in Automata, Languages and Programming, ed. by W. de Bakker, J. van Leeuwen (Springer, 1980), p. 632. Technical Memo MIT/LCS/TM-151, MIT Lab. for Comput. Sci G. Viamontes, I. Markov, J.P. Hayes, High-performance QuIDD-based simulation of quantum circuits, in Design, Automation and Test in Europe (2004), pp. 1354–1355 G. Viamontes, I. Markov, J.P. Hayes, Checking equivalence of quantum circuits and states, in International Conference on CAD (2007), pp. 69–74 G. Viamontes, I. Markov, J.P. Hayes, Quantum Circuit Simulation (Springer, 2009) S.A. Wang, C.Y. Lu, I.M. Tsai, S.Y. Kuo, An XQDD-based verification method for quantum circuits. IEICE Trans. 91-A(2), 584–594 (2008) D. Wecker, K.M. Svore, LIQUi|>: a software design architecture and domain-specific language for quantum computing. CoRR (2014), arXiv:1402.4467
Empowering the Design of Reversible and Quantum Logic with Decision Diagrams
369
R. Wille, R. Drechsler, C. Osewold, A.G. Ortiz, Automatic design of low-power encoders using reversible circuit synthesis, in Design, Automation and Test in Europe (2012), pp. 1036–1041. https://doi.org/10.1109/DATE.2012.6176648 R. Wille, D. Große, D.M. Miller, R. Drechsler, Equivalence checking of reversible circuits, in International Symposium on Multiple-Valued Logic (2009), pp. 324–330 R. Wille, O. Keszocze, S. Hillmich, M. Walter, A.G. Ortiz, Synthesis of approximate coders for on-chip interconnects using reversible logic, in Design, Automation and Test in Europe (2016), pp. 1140–1143 R. Wille, P. Niemann, A. Zulehner, R. Drechsler, Decision diagrams for the design of reversible and quantum circuits, in International Symposium on Devices, Circuits and Systems (ISDCS) (2018), pp. 1–6. https://doi.org/10.1109/ISDCS.2018.8379626 A. Zulehner, P. Niemann, R. Drechsler, R. Wille, Accuracy and compactness in decision diagrams for quantum computation, in Design, Automation and Test in Europe (2019a), pp. 280–283. https://doi.org/10.23919/DATE.2019.8715040 A. Zulehner, A. Paler, R. Wille, An efficient methodology for mapping quantum circuits to the IBM QX architectures. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 38(7), 1226–1236 (2019b) A. Zulehner, R. Wille, Advanced simulation of quantum computations. CoRR (2017a), arXiv:1707.00865 A. Zulehner, R. Wille, Improving synthesis of reversible circuits: exploiting redundancies in paths and nodes of QMDDs, in Conference on Reversible Computation (2017b), pp. 232–247 A. Zulehner, R. Wille, Make it reversible: efficient embedding of non-reversible functions, in Design, Automation and Test in Europe (2017c), pp. 458–463 A. Zulehner, R. Wille, Taking one-to-one mappings for granted: advanced logic design of encoder circuits, in Design, Automation and Test in Europe (2017d), pp. 818–823 A. Zulehner, R. Wille, Exploiting coding techniques for logic synthesis of reversible circuits, in ASP Design Automation Conference (2018a), pp. 670–675 A. Zulehner, R. Wille, One-pass design of reversible circuits: combining embedding and synthesis for reversible logic. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. (2018b) A. Zulehner, R. Wille, Matrix-vector vs. matrix-matrix multiplication: potential in DD-based simulation of quantum computations, in Design, Automation and Test in Europe (2019), pp. 90–95
Error-Tolerant Mapping for Quantum Computing Abdullah Ash Saki, Mahabubul Alam, Junde Li, and Swaroop Ghosh
Abstract Quantum computers are built with fragile and noise/error-prone qubits. Some prominent errors include, decoherence/dephasing, gate error, readout error, leakage, and crosstalk. Furthermore, the qubits vary in terms of their quality. Some qubits are healthy whereas others prone to errors. This presents an opportunity to exploit good quality qubits to improve the computation outcome. This chapter reviews the state-of-the-art mapping techniques for error tolerance. We take quantum benchmarks as well as approximate algorithms for applications covering MaxCut, object detection and factorization to illustrate various optimization challenges and opportunities.
1 Introduction Quantum computers can speed-up certain class of problems such as factorization (Shor 1999), database search (Grover 1996), chemistry (Kandala et al. 2017), machine learning (Biamonte et al. 2017), etc. Quantum computers are made of quantum bits or qubits that store data as quantum states. These states are manipulated using operations named quantum gates to perform computation. In recent years, quantum computing has progressed considerably from low-level hardware to high-level software with Google’s demonstration of the quantum advantage as one example of many. On one hand, ongoing research is developing various qubit technologies such as, superconducting (Koch et al. 2007), ion-trap (Cirac and Cirac 1995), etc. On the other hand, the number of qubits is increasing with 49–72 qubit systems are reported A. A. Saki · M. Alam · J. Li · S. Ghosh (B) Pennsylvania State University, University Park, State College, PA 16802, USA e-mail: [email protected] M. Alam e-mail: [email protected] J. Li e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2023 M. M. S. Aly and A. Chattopadhyay (eds.), Emerging Computing: From Devices to Systems, Computer Architecture and Design Methodologies, https://doi.org/10.1007/978-981-16-7487-7_12
371
372
A. A. Saki et al.
from various vendors like IBM (2017), Intel (2018), Google (2018), etc. Moreover, prototypical machines of 5–15 qubits are publicly available from IBM (2020). Despite all the promises and potentials, quantum computers are still in a nascent stage. Existing quantum computers have limited qubit connectivity. Two arbitrary qubits cannot be entangled unless they are connected. Moreover, qubits are noisy and have a short life-time. These quantum computers are called Noisy Intermediate-Scale Quantum (NISQ) computers (Preskill 2018). Various orthogonal approaches are proposed to solve the noise issue. First, qubits and gate operations can be more robust and accurate. There has been a thousand-fold improvement in decoherence time in the last two decades (Steffen 2011). However, there are still some grounds to break. The second approach is Quantum Error Correction (QEC) (e.g., Shor code (Shor 1995), Surface code (Bravyi and Kitaev 1998)). QECs encode a logical qubit using many physical qubits. However, in existing devices with a limited number of qubits, applying QEC is not practical. Due to these hardware limitations, the research community started exploring software techniques such as mapping to squeeze performance out of these NISQ devices for useful applications. Mapping of quantum circuits i.e., allocating logical qubits to physical hardware qubits is necessary to address the connectivity issues of the NISQ devices. The existing mapping techniques (e.g., Zulehner et al. 2018) formulate the problem to minimize the depth of the circuit in addition to resolving the connectivity issue. Minimizing the depth of the circuit is beneficial to address short qubit lifetime or decoherence. Moreover, existing quantum computers exhibit qubit-to-qubit variation in terms of error-rates (gate-error, readout error, etc.). One segment of the community developed techniques to exploit these variations and intelligently map operations to less noisy qubits to improve the program fidelity. Therefore, the concept of error-aware mapping of the quantum circuit emerged in the community. In this chapter, we will present the motivation and the state-of-the-art of error-tolerant mapping techniques. The chapter is organized as follows: in Sect. 2, we present the basics of quantum computing that are necessary to understand the core concepts of error-tolerant mapping. In Sect. 3, we discuss the mapping of quantum circuits, its objectives, and considerations. In Sect. 4 we present the current state-of-the-art of error-tolerant mapping techniques. We discuss several applications and their performance under error-tolerant mapping in Sect. 5. A summary of related works and future outlook are presented in Sect. 6. We draw conclusion in Sect. 7.
2 Basics of Quantum Computing 2.1 Qubits Qubit is the building block of a quantum processor. It is a two-level system that stores data as quantum states. For example, electron spin can be exploited to realize a qubit where electron up-spin can represent data ‘1’ and down-spin can represent
Error-Tolerant Mapping for Quantum Computing
373
data ‘0’. The most popular type of qubit technology is superconducting qubits. The basic structure consists of a capacitor in parallel with a superconducting JoshepsonJunction device. The Josephson-Junction works as a non-linear inductor and the capacitor-inductor combination works as an anharmonic LC-oscillator. The lowest two energy states of this LC-oscillator work as the computational basis |0 (the ground state) and |1 (the first excited state). Unlike classical bit which can be either ‘0’ or ‘1’ at a time, qubits can be a mixture of both 0 and 1 simultaneously due to quantum superposition. Mathematically, a qubit state is represented by state vector |ψ = a |0 + b |1 where |a|2 and |b|2 represent probabilities of ‘0’ and ‘1’ respectively. Suppose, a qubit is in state |ψ = 0.707 |0 + 0.707 |1. Reading out this multiple time will theoretically generate 0s and 1s with equal probability (as 0.7072 = 0.5). Qubit state is often visually expressed with a Bloch sphere where the poles of the sphere are 0 and 1, and equator is the perfect superposition. Moreover, qubit states can be entangled due to which qubit states of two or multiple qubits can be correlated. Performing a single operation on one of the entangled qubits may alter the states of other entangled qubits. Finally, qubit states exhibit quantum interference. The amplitude of a qubit state can be negative as well as positive. Thus, a quantum algorithm designer can tailor the gate operations in a way such that a negative amplitude can cancel out a positive amplitude of the same qubit state. Such a technique is widely used in Grover search kind of application to find out the desired result by canceling out other qubit states. Quantum superposition, entanglement, and interference are believed to be at the core of quantum speed-ups of quantum algorithms.
2.2 Quantum Gates Quantum gates are operations that modulate the data (current state) of a qubit and perform desired computation. However, unlike classical logic gates, quantum gates are not physically created. Instead, they are realized using pulses. For superconducting qubits (used in IBM machines), the pulses are electromagnetic (radio-frequency) in nature. Figure 1 shows a micrograph of a 7-qubit IBM quantum processor. The zigzag lines are the wave-guides that carry the gate pulses and modulate the qubit state. Intuitively, the gate pulses induce a varied amount of rotation (depends on pulse amplitude, duration, and shape) along different axes in the Bloch sphere. For example, the Hadamard (H) gate rotates a qubit state by π/2 radian and puts a qubit in |0 state to a superposition state (equator in the Bloch sphere as in Fig. 2). Mathematically, quantum gates are represented using unitary matrices (a matrix U is unitary if UU † = I , where U † is the adjoint of matrix U and I is the identity matrix). For an n-qubit gate, the dimension of the unitary matrix is 2n × 2n . Any unitary matrix can be a quantum gate. However, in existing systems, only a handful of gates are possible which are often known as the native gates or basis gates of that quantum processor. For IBM systems, the basis gates are U1, U2, U3, ID, and
374
A. A. Saki et al.
Fig. 1 Micrograph of a 7 qubit quantum computer from IBM. The boxes are the superconducting qubits and the zig-zag lines are wave-guides that carry the gate and readout pulses. Credit: IBM Research (IBM Research 2017; Kandala et al. 2017). Licensed under Creative Commons (CC BY-ND 2.0) https:// creativecommons.org/ licenses/by-nd/2.0/
Fig. 2 Bloch sphere representation of qubit state |0 and the effect of Hadamard (H) gate on it. In Bloch sphere, the poles represents state |0 and |1, and the equator represents equal superposition √ (|0 + |1)/ 2
CNOT. CNOT is the only 2-qubit gate, and others are single-qubit. Any non-native gate in a quantum circuit is first decomposed using the native gates. For example, the Hadamard (H) gate is decomposed as U2(0, π/2) in IBM systems. Following are the matrices of some of the commonly used quantum gates: √ 1 1 H = (1/ 2) 1 −1 U 3(θ, φ, λ) =
1 0 T = 0 eiπ/4
01 X= 10
cos(θ/2) −e jλ sin(θ/2) jφ −e sin(θ/2) −e j (λ+φ) cos(θ/2)
Error-Tolerant Mapping for Quantum Computing
375 4
0
1
2
3
4
5
6
14
13
12
11
10
9
8
7
3
14
13
5
2
15
12
6
1
16
11
7
ibmq_16_melbourne
0
17
10
Rigetti Aspen-4-16Q-A
Fig. 3 Coupling graph of ibmq_16_melbourne and Rigetti Aspen
⎡
1 ⎢0 C N OT = ⎢ ⎣0 0
0 1 0 0
0 0 0 1
⎤ 0 0⎥ ⎥ 1⎦ 0
⎡
1 ⎢0 CZ = ⎢ ⎣0 0
0 1 0 0
0 0 1 0
⎤ 0 0⎥ ⎥ 0⎦ −1
Figure 3 shows the qubit connectivity graphs for a IBM and Rigetti computer, respectively. Here, the nodes (the circles) are the qubits. These coupling graphs convey an important message. The native 2-qubit gate (CNOT in IBM and CZ in Rigetti) can only be applied on qubits that are connected or, have an edge in the graph between the qubits. For instance, CNOT can be applied on qubit-1 and 2 of IBMQ16 as there exists an edge between these qubits in the graph. However, CNOT cannot be applied directly between qubit-1 and 3, there are not connected. This limited connectivity presents a challenge in quantum circuit mapping and often referred to as coupling constraints (discussed in Sect. 2.4).
2.3 Errors in Near-Term Quantum Processors 2.3.1
Gate Error
Gate errors are due to imperfect logical operations. As mentioned earlier, quantum gates are realized with pulses, and pulses can be imperfect. For example, consider the RYπ/2 gate. Due to variation, the pulse intended for a π/2 rotation may not result in an exact π/2 rotation. It may under or over-rotate, leading to erroneous logical operation. Gate errors are quantified by an experimental protocol named randomized benchmarking (RB) (Knill et al. 2008). For existing systems, the 2-qubit gate errors (e.g., CNOT error) is an order of magnitude higher than the 1-qubit gate error. Single qubit and multi-qubit gate errors for ibmq_16_melbourne are tabulated in Table 1. If a quantum circuit contains a higher number of gates, more gate error will accumulate and the reliability of the quantum program will be lower. Thus, one objective of compilation and error-tolerant mapping is to reduce the number of gates in the quantum circuit.
376
A. A. Saki et al.
Table 1 Gate errors, readout errors, T1, and T2 times of a publicly accessible quantum computer from IBM ibmq_16_melbourne (Date accessed: 31-Mar-2020; varies over time). GE = 1-qubit gate error, RE = Readout error, MGE = Multi-qubit (CNOT) gate error Qubit GE RE T1 (µs) T2 (µs) Pair MGE Pair MGE (×10−3 ) Q0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14
2.3.2
0.50 1.75 1.96 0.60 1.79 5.15 0.76 1.34 0.76 1.51 1.28 0.90 0.86 2.33 0.89
0.02 0.03 0.06 0.06 0.04 0.09 0.02 0.03 0.27 0.04 0.04 0.03 0.05 0.11 0.05
78.40 56.44 49.44 80.94 57.30 18.92 86.20 42.59 37.21 45.00 76.88 55.88 111.2 27.70 35.62
22.76 18.06 58.85 62.84 31.67 36.40 110.6 76.30 59.90 89.87 55.95 68.65 54.77 46.87 43.90
Q1–Q0 Q1–Q2 Q2–Q3 Q4–Q3 Q4–Q10 Q5–Q4 Q5–Q6 Q5–Q9 Q6–Q8 Q7–Q8 Q9–Q8 Q9–Q10 Q11–Q3 Q11–Q10 Q11–Q12
0.025 0.029 0.037 0.017 0.026 0.037 0.055 0.042 0.028 0.032 0.036 0.036 0.024 0.033 0.017
Q12–Q2 Q13–Q1 Q13–Q12 Q13–Q14 Q14–Q0
0.063 0.070 0.030 0.062 0.027
Decoherence
Decoherence is related to short qubit lifetime. Qubits may interact with the environment and spontaneously lose its saved state. To illustrate the issue, we show the effect of relaxation, one type of decoherence (Fig. 4). Due to relaxation, a qubit in state 1 spontaneously loses energy and ends up in state 0. Decoherence of a qubit is usually characterized by T1 relaxation time and T2 dephasing time. If the gate time is tg , then roughly 1 − exp(−tg /T1) is the error probability that a state |1 will be damped. This means if the gate time (operation) is long, the qubit will lose its state more. T1 and T2 times for ibmq_16_melbourne are reported in Table 1. Thus, an error tolerant mapping must try to minimize the depth of the quantum circuit.
2.3.3
Readout or Measurement Error
The readout is the classical bit-flip error due to the imperfection of measurement circuitry. Due to readout error, reading out a qubit storing a 1 may give out 0 and viceversa. A very simple protocol can be used to quantify the readout error probabilities. For a single qubit, it involves preparing the qubit in all possible binary states (i.e., 0 and 1) and reading out the qubit (both preparation and measurement multiple times).
Error-Tolerant Mapping for Quantum Computing
377
Fig. 4 Illustration of relaxation and dephasing
By default, in IBM machines the qubits are initialized from 0 states. Therefore, to prepare state ‘1’ a quantum-NOT (X) gate has to be applied to the 0 state. Ideally, if the process of preparing 0 (1) and reading out is repeated N times, all the time it should generate 0 (1). However, due to readout error, in some of the cases, a flipped bit might be read. For example, say state 0 is prepared and measured 10,000 times. Out of these 10,000 trials, 9000 times it reads out 0 and other 1000 times it reads out 1. Thus, measurement error rate M01 will be 1000/10,000 = 0.1 (Mx y stands for probability of reading out state ‘y’ while the prepared state is ‘x’; thus, M00 = 9000/10,000 = 0.90). For multi-qubit readout characterization, the same protocol applies. However, the number of binary states that need to be prepared and read will be higher. For example, to characterize a 3-qubit system 23 = 8 binary states (000, 001, 010, . . . , 110, and 111) need to be prepared and measured (each N-times). Readout errors for ibmq_16_melbourne are reported in Table 1. The reported errors are an average of 1-flip error and 0-flip error. Unlike gate-error and decoherence which depends on the number of gates in the circuit, readout error is gate count agnostic. It solely depends on the state being read. Thus, minimizing the number of gates in the circuit does not improve readout error.
2.3.4
Crosstalk
Crosstalk is another kind of error present in the near term quantum processors. Quantum gates are realized using pulses (e.g., microwave pulse in superconducting qubits, laser pulses in an ion-trap qubit, etc.). Ideally, the effect of a gate operation on one qubit is independent of what is happening on other qubits. However, the gate pulse intended for one qubit can leak to an unintended qubit. This unintentional excitation is called ‘crosstalk’. The gate-errors may exhibit conditional dependence due to crosstalk. That means the gate error of an isolated gate operation may differ from the gate error with another gate operation in parallel. A recent experimental study reveals that the gate-error with
378
A. A. Saki et al.
another operation in parallel can be 2X-3X higher than an isolated gate operation (Murali et al. 2020). Thus, crosstalk introduces a conflicting compilation decision. For example, reducing decoherence error requires shortening the depth of a quantum circuit. One way to achieve this is to maximize the parallel gate operation. However, more crosstalk is introduced in the system when many gates act in parallel. Thus, parallel gate operations must find a compromise to optimize both crosstalk and decoherence.
2.3.5
Qubit-to-Qubit Variation
The error-rates show qubit-to-qubit variations. Figure 5 shows error-rate variations among qubits in 3 IBM quantum computer. The figures clearly show that some of the qubits have a smaller error-rate compared to others. Thus, utilizing qubits with a smaller error-rate to perform more operations can be beneficial in terms of the overall reliability of the quantum program. A number of works (Tannu and Qureshi 2019; Murali et al. 2019; Ash-Saki et al. 2019) leverage this spatial variability of qubit error-rates to improve the error tolerance of a quantum program.
(a)
(b)
Fig. 5 The trend of 1-qubit gate error and readout error collected over 43 days period from 5-qubit IBMQX4 quantum computer. The trend clearly shows the error-rate varied over time. Moreover, the error-rate curves for different qubits do not coincide meaning there is a qubit-to-qubit variation
Error-Tolerant Mapping for Quantum Computing
2.3.6
379
Temporal Variation
Besides qubit-to-qubit variations, error-rates exhibit temporal variations. The errorrates tend to drift in time. Figure 5 shows the trend. Quantum circuits such as, classifiers that are compiled once and used multiple times are heavily affected by the temporal variation in qubit quality (Alam et al. 2019a).
2.4 Mapping Challenges in Quantum Processors The coupling constraint briefly discussed in Sect. 2.2 is a challenge in NISQ devices. To better understand the issues, consider the illustration in Fig. 6. Consider a coupling graph of a hypothetical 4-qubit system (Q1, Q2, Q3, Q4) as shown in Fig. 6b. This coupling graph indicates that the physical qubits are linearly coupled and two-qubit operations are allowed between the following qubit pairs: {(Q1, Q2), (Q2, Q3), (Q3, Q4)}. Due to coupling constraints, the logical qubits of a quantum circuit have to be mapped to the physical qubits of the quantum computer such that all the gate operations can be conducted. However, it is usually not possible to find a mapping that satisfies all the constraints throughout the whole circuit. For instance, the circuit in Fig. 6a has 6 CNOT operations. If we try to execute this circuit on the hardware shown in Fig. 6b, no logical-to-physical qubit mapping exists which allows us to execute all 6 CNOT operations. For instance, the logical-to-physical qubit mapping shown in Fig. 6b will allow us to execute only two of the 6 CNOT gates (CNOT q1 q2 and CNOT q3 q4). The remaining 4 CNOT operations can not be executed for the current qubit mapping. To circumvent this issue, additional SWAP operations are inserted in the circuit to dynamically move the logical qubits to other physical qubits. A SWAP operation interchanges the states between two neighboring qubits without any modification in the logical qubit states. For instance, after executing the first two CNOT operations,
Four-qubit quantum circuit
Linearly coupled physical qubits and initial placement q1 Q1
q1
q2
q2
Permutation layer - 1|2
(a)
Layer-3
q2
q3
q1
q3
q1
q2
q4
q4
q4
Q4
q4 Layer-2
q3
Q3 Q3
Layer-1
q2
Q2
q3
q4
q1 Q1
Q2 q3
Permutation layer - 2|3
Q4 (b)
Layer-1
Layer-2
Layer-3
(c)
Fig. 6 a A hypothetical 4-qubit quantum circuit, b chosen 4 physical qubits to implement the circuit with initial logical-to-physical qubit mapping, and c meeting the coupling constraints with additional SWAP operations
380
A. A. Saki et al.
two additional SWAP operations (SWAP q1 q2 followed by SWAP q1 q3) enables us to execute a further two CNOT operations (CNOT q2 q3 and CNOT q1 q4). Two additional SWAP operations are required to execute the last two CNOT operations as shown in Fig. 6c. This process of transforming a quantum circuit to meet the hardware constraints is commonly referred to as Qubit Mapping in the literature. The final SWAP-inserted version is referred to as nearest-neighbor (NN) compliant circuit, and it is directly executable on a quantum computer. Note that, the additional SWAP operations need to be decomposed to the basis gates of the target hardware as well for execution. Any conventional Qubit Mapping procedure involves (i) the selection of physical qubits on the hardware for the logical qubits in the given circuit (qubit allocation), (ii) initial one-to-one mapping of the logical and the physical qubits (initial placement), and (iii) adding (minimum number of) SWAP operations to meet the hardware constraints for the whole circuit.
3 Mapping Quantum Circuits to Quantum Processors 3.1 Reducing Swap Gates The additional SWAP gates affect the reliability of the quantum circuit by increasing the cumulative gate-count and circuit-depth. The potential for errors due to imperfect gate operations increases due to the larger gate-count. A high-depth circuit will require more execution time on hardware compared to a low-depth one. Higher execution time reduces the algorithmic speed besides making the circuit more susceptible to decoherence errors. Therefore, the number of additional SWAP gates should be kept as small as possible. A multitude of optimization techniques has been proposed in recent years to minimize the number of added SWAP operations during Qubit Mapping procedure. The tool-flow that implements certain optimization techniques to transform any given quantum circuit for target hardware through circuit decomposition, SWAP insertion, and other forms of circuit optimizations (e.g. single/two-qubit gate cancellations, rotation merging, etc. (Nam et al. 2018)) is termed as a compiler or a transpiler. The vast majority of the quantum circuit compilers including the state-of-the-art circuit compiler available in IBM’s quantum software design toolkit (qiskit) partitions the quantum circuit into multiple layers of gates where all the gates in a certain layer can be executed in parallel. Two consecutive gates in a quantum circuit can be executed in parallel as long as they operate on a different set of qubits. For instance, the first two CNOT operation in the circuit shown in Fig. 6a act on two different qubit pairs: (q1, q2) and (q3, q4). Hence, these two operations constitute the first layer of the circuit. Similarly, the other two layers of the circuit can be formed as shown in Fig. 6a. SWAP operations are added between consecutive layers where a change in the logical-to-physical qubit mapping is warranted. The collective SWAP operations
Error-Tolerant Mapping for Quantum Computing
381
between two consecutive circuit layers are commonly referred to as the permutation layer in the literature. Figure 6c shows the construction of the two permutation layers for the circuit and target hardware shown in Fig. 6a, b respectively. The Qubit Mapping problem is thought to be NP-complete (Siraichi et al. 2018). Two disparate approaches are followed to minimize the number of SWAP operations during the Qubit Mapping procedure. In the first approach, the problem is formulated as an instance of constraint satisfaction problem and later, powerful reasoning engines (e.g. SMT solver, ILP solver, SAT solver, etc.) are used to find the best possible solution to meet these constraints (Bhattacharjee et al. 2019; Murali et al. 2019; Wille et al. 2019). Although such approaches often lead to near-optimal solutions for small circuits, they are not suited for large circuits often leading to an enormous amount of compilation time overhead. The second approach relies on efficient heuristics that incrementally lead towards a solution (Zulehner et al. 2018; Li et al. 2019). However, the decisions taken at each step focus on maximizing the gain in the current step which may lead to sub-optimal solutions (e.g. local optima). A* heuristics is used to tackle the Qubit Mapping problem (Zulehner et al. 2019). In each iteration of the algorithm, the proposed approach chooses a single SWAP operation that minimizes the cumulative SWAP distances of all the two-qubit operations in the current layer into consideration. SWAP distance has been defined as the minimum number of SWAP operations required to make a logical qubit nearest neighbor to another qubit. For instance, the SWAP distance between q1 and q4 in the quantum circuit in Fig. 6a is 2 for the logical-to-physical qubit mapping shown in Fig. 6b. The procedure continues until the cumulative SWAP distance becomes zero for the current layer. After finding a set of SWAP operations that finds a new logical-to-physical qubit mapping satisfying the hardware constraints to execute all the gate operations in the current layer, the algorithm moves to find a solution for the next layer. A variant of their primary approach also included the cumulative SWAP distances of the next layers in the cost function to choose SWAP operations for the current layer which they referred to as the look-ahead approach. The look-ahead approach often leads to a better solution at the expense of higher compilation time.
3.1.1
Choosing Routing Paths
Before inserting a SWAP gate, the method in Zulehner et al. (2019) checks all the mapped qubits along with the qubits connected to them. For example, in the 6 qubit circuit in Fig. 7a, qubits are mapped to Q1, Q2, Q3, Q4, Q5, and Q9 of ibmq_16_melbourne. To bring the qubits in layer-2, (Zulehner et al. 2019) checks all the possible SWAPs between these qubits along with the qubits connected to them i.e., Q0, Q13, Q12, Q11, Q10, and Q8. This culminates in checking 13 possible SWAP candidates (Fig. 7b). However, it has been reasoned (Li et al. 2019) that we can reduce this number as not all the physical qubits have the same ‘priority’ in routing decision. They consider active qubits in Layer-2 (they name it as ‘front layer’) i.e., Q2 and Q9 and the qubits connected to them i.e., Q1, Q12, Q3, Q5, Q8, and Q10 as priority qubits. They check
382
A. A. Saki et al. Zulehner et al. Layer 1
Layer 2 Layer 3
(Q1) q0 (Q2) q1
Li et al.
Q1
Q3
Q4
Q1
Q13
Q4
Q10
Q2
Q3
Q1
Q2
Q4
Q5
Q2
Q12
Q2
Q12
Q5
Q9
Q5
Q9
Q2
Q3
Q5
Q6
Q9
Q10
ibmq_16_melbourne
Q3
Q11
Q9
Q10
Q8
Q9
-
Q9
Q8
(Q3) q2 (Q4) q3 (Q5) q4 (Q9) q5
Murali et al.
Q0
(a)
(b)
Q1
Q2
(c)
0
1
14
13
2
3
4
5
6
12
11
10
9
8
-
7
Reserving rectangle One-bend paths
(d)
Fig. 7 a A 6 qubit example circuit mapped to physical qubits Q1, Q2, Q3, Q4, Q5, and Q9 of ibmq_16_melbourne. CNOT operation in layer-2 cannot be directly executed as there is no connection between Q2 and Q9. Therefore, SWAPs need to be inserted to bring the qubits closer, b one method (Zulehner et al. 2019) checks 13 possible SWAP candidates, c another method (Li et al. 2019) checks 6 possible SWAPs, and d yet another method (Murali et al. 2019) reserves a rectangle for routing and checks one-bend paths
the SWAP between these reduced set of qubits for routing which results in checking 6 SWAPs (Fig. 7c). This reduction can be significant for larger quantum computer architectures and larger quantum circuits. Another routing optimization consideration can be found in Murali et al. (2019). Their proposal reserves a rectangle with the control and target qubits of the CNOT under consideration (in this case Q2 and Q9. The rectangle is highlighted in gray in Fig. 7d), and they use one-bends paths for routing as in Fig. 7d.
3.1.2
Initial Placement
The initial placement of logical qubits to physical qubits influences the number of additional swaps. Figure 8 demonstrates this statement. Suppose, we are trying to map a 4 qubit quantum circuit (Fig. 8a) to 4 qubits of a quantum computer (say, IBMQ16 in Fig. 3). For the initial placement in linearly connected qubits as in Fig. 8b (suppose, (a, b, c, d) = (1, 2, 3, 4) of IBMQ16), one SWAP needs to be added to satisfy the coupling constraint. However, if we choose an initial placement as in Fig. 8c (suppose, (a, b, c, d) = (1, 2, 3, 12) of IBMQ16; a T-topology), 2 SWAPs are needed. This example clearly demonstrates that the initial placement influences the total number of additional SWAPs required. Therefore, a systematic approach is necessary to get a better initial placement. In Paler (2019), the author investigated the influence of initial qubit placement during NISQ circuit compilation. Three relevant questions are presented: (1) which search criteria have to be optimized?; (2) where should the search start from?; (3) does the starting point influence the quality of the compiled circuit? In Li et al. (2019), the authors proposed another interesting method to refine the initial mapping. They use the reversibility of quantum circuits to update the initial mapping for a better result. They name the approach as reverse traversal.
Error-Tolerant Mapping for Quantum Computing
383
Fig. 8 a Original circuit. NN compliant circuits considering, b linear topology, c T topology. An initial placement to linear topology needs 1 SWAP whereas initial placement to a T topology requires 2 SWAPs clearly showing the influence of initial placement on the total number of SWAPs. d Linear nearest neighbor (LNN) topology and e T topology
q3 q4
q1 q2 q3 q4
Updated initial mapping
q2
Reversed circuit Final mapping
Initial mapping
Original circuit q1
Fig. 9 Conceptual illustration of Reverse traversal (Li et al. 2019) method to refine the initial placement
Their approach can be explained with the aid of Fig. 9. They employ a heuristicbased search to generate the NN-compliant map. At first, the search begins with a randomly generated initial mapping and generates an NN-compliant map which the authors name as final mapping. Now, quantum circuits are reversible in nature. That means, gate-order in the circuit can be reversed to generate the input from the output. The authors use the final mapping from the first step as the initial mapping of the next step. They use the reversed version of the original circuit and run the heuristic search again to generate another final map. This final map of the reversed circuit is the updated initial map of the original circuit. In their method, they perform this reverse traversal 3 times to get a better initial mapping.
4 Error-Tolerant Mapping of Quantum Circuits This section will discuss several works that introduced error awareness in the mapping process. The error specifications vary among qubits. The works presented in this section exploit this variation and propose techniques to use more reliable qubits for operation instead of blind allocation. The work by Tannu et al. (2019) first proposed to take 2-qubit gate-error variation into account and allocate and move qubits accordingly. The work in Ash-Saki et al. (2019) considered decoherence along with
384
A. A. Saki et al.
gate-error rates. Finally, the work in Bhattacharjee et al. (2019) proposed integrated flow from NN-compliant circuit generation to error-aware mapping.
4.1 Consideration to Gate Error As mentioned before, gate errors reduce the reliability of a quantum circuit. In a large quantum circuit, the accumulated gate errors from all the erroneous gate operations may render the output of the circuit meaningless. Therefore, a Qubit Mapping procedure should consider minimizing the accumulated gate errors while searching for mapping to meet the hardware constraints. Due to the spatial variation in gate errors (discussed previously), this minimization approach focuses on pushing most of the gate operations to the better quality physical qubits. This goal can be achieved by modifying each level of the conventional Qubit Mapping procedure as discussed below.
4.1.1
Variation-Aware Qubit Allocation (VQA)
The first step in any conventional Qubit Mapping procedure is the selection of ‘n’ physical qubits to execute any given quantum circuit with ‘n’ logical qubits (qubit allocation). A qubit-to-qubit variation-unaware procedure will generally pick a subset of the available physical qubits based on some heuristics that may offer reduced SWAP operations in the subsequent steps (e.g. qubit sub-graph with the maximum number of edges). In Tannu and Qureshi (2019), the authors presented a variationaware qubit allocation (VQA) approach where the subset of physical qubits are chosen to maximize their cumulative connectivity strength. The connectivity strength of a qubit has been defined as the sum of all its coupling-link success probabilities. For instance, a 6-qubit hypothetical coupling graph is shown in Fig. 10a with variation in the coupling link success probabilities (variation in the allowed two-
Fig. 10 a A hypothetical 6-qubit system (variable reliability/success probability of two-qubit operations), b possible 3-qubit triplet choices for implementing a 3-qubit circuit and their corresponding cumulative connectivity strengths, c possible routes to move a qubit state from Q1 to Q4, and d success probabilities of the routes
Error-Tolerant Mapping for Quantum Computing
385
qubit gate success probabilities between different qubit pairs). To implement a 3-qubit circuit on this hardware, 9 possible qubit-triplets (with continuous connectivity within the triplets) can be chosen from the available 6 physical qubits as shown in Fig. 10b. The connectivity strength of Q1 is (0.8 + 0.9) or 1.7 while it is (0.92 + 0.95 + 0.98) or 2.85 for Q5. The cumulative connectivity strengths of the qubit triplets are shown in Fig. 10b. According to Tannu and Qureshi (2019), choosing {Q2, Q5, Q4} is better for error tolerance.
4.1.2
Variation-Aware Qubit Movement (VQM)
As mentioned earlier, during the Qubit Mapping procedure, qubits are moved dynamically using SWAP operations to ensure that all the operations in the given circuit can be conducted in the hardware. Generally, there exist multiple paths (set of links) which can be used to move a logical qubit from one physical qubit to another. A variation-unaware approach will choose the shortest path. In Tannu and Qureshi (2019), the authors presented a variation-aware qubit movement (VQM) approach where the link path is chosen based on the success probability of the path (/route). The success probability of a path has been defined as the product of the success probabilities of the individual links. For instance, the success probability of the links (Q1, Q2) and (Q2, Q3) in Fig. 10c are 0.80 and 0.85 respectively. In simple words, a success probability of 0.80 for the link (Q1, Q2) means that interchanging the qubit states between Q1 and Q2 is 80% likely to be successful. The route (Q1 → Q2 → Q3) has a success probability of 0.80 * 0.85 or 0.68. To move a logical qubit from Q1 to Q4, one can use one of 4 possible routes as shown in Fig. 10c. While, Route-1, Route-2, and Route-3 are similar in terms of their overall lengths (which is 3), their success probabilities are vastly different as shown in Fig. 10d. Route-4 has the least success probability and maximum length. The variation-aware qubit movement approach presented in Tannu and Qureshi (2019) will choose Route-3 to maximize the success probability during the Qubit Mapping procedure. Up to 2.5X performance improvement using VQM was reported in Tannu and Qureshi (2019).
4.2 Qubit Re-allocation (QURE) The core idea of QURE (Ash-Saki et al. 2019) is to find a set of physical qubits that has a smaller error-rates, and thus, results in better program fidelity. This approach starts with a depth-optimal (SWAP-inserted and NN-compliant) version of a quantum circuit which can be generated using mapping techniques like (Zulehner et al. 2018, 2019). This version of the circuit is named as initial mapping (I M). The I M itself is a sub-graph of the coupling graph (parent graph) of the quantum computer. There can be multiple isomorphic sub-graphs (I SG) of the I M in the parent graph. The I M can be mapped to any of these sub-graphs without any modification or gate
386
A. A. Saki et al.
insertion. Since gate-error varies from qubit-to-qubit Sect. 2.3.1, some I SGs may be better in terms of fidelity than others for the same quantum program. QURE systematically checks a number of the I SGs to find an I SG with better a fidelity. It re-allocates already allocated (mapped) logical qubits to a different set of physical qubits to optimize for gate-error. Hence, the name qubit re-allocation. Two different QURE approaches are discussed below.
4.2.1
QURE: Sub-graph Search
The sub-graph search approach (Ash-Saki et al. 2019) requires solving two problems—(i) find the isomorphic sub-graphs to the IM, and (ii) compute program fidelity on each I SG. This approach can be applied to any type of workload. Thus, it is termed as appN I S Q_U niver sal. Finding Isomorphic Sub-graphs Finding all the I SGs is believed to be NPcomplete (Eppstein 2002). This can be intractable for larger qubit systems. However, it has been noted that quantum hardware developed so far shows some regularities in their coupling graph (Ash-Saki et al. 2019). They generally follow a grid-like architecture (e.g., ibmq_16_melbourne). This particular observation can reduce the time and space complexity of finding a reduced set of I SGs. The steps are described with an example in Fig. 11. First, a rectangular grid architecture from the coupling graph
Fig. 11 Finding ISGs for a given workload and target quantum hardware
Error-Tolerant Mapping for Quantum Computing
387
is extracted (Fig. 11b) which is called V(H)-Grid. The V(H)-Grid is rotated by 90◦ to obtain H(V)-Grid. For instance, for a given workload and I M shown in Fig. 11a–b, 3 × 2 V-Grid and 2 × 3 H-Grid (Fig. 11c) can be generated. If the extracted grid is a square, then V-Grid and H-Grid are the same. Next, the V and H-grids are slid over the parent graph to generate new grids or sub-graphs. Again, inside each V-Grid and H-Grid, the given workload can be mapped in at least 4 different ways as shown in Fig. 11d. The above steps generate a pool of I SGs from which the best need to be selected in terms of program fidelity. For the example in Fig. 11 this sliding and mirroring generates 48 sub-graphs. Although the true number of possible I SGs can be much larger, the above approach can be useful to reduce the search space and to make it more scalable for QCs with a large number of qubits. Thus, the proposed approach offers a trade-off between scalability and the theoretically best solution. To pick the best ISG among N I SG possible combinations, a simpler model based on success probability is employed (Ash-Saki et al. 2019) which can be approximated
N (1 − ηi ). Here, S P denotes probability of success, N using the equation S P = i=1 = total gate number, and ηi is the error-rate of the ith gate operation (e.g., gate-error rates in Table 1). The I SG that generates the highest S P should be used to execute the program on the real device.
4.2.2
QURE: Greedy Approach
For linearly coupled NN-circuit, a greedy approach can be employed to find the optimal mapping. The authors’ term this approach as appN I S Q_L N N . In this approach, the logical qubits are allocated to the physical qubits one-by-one following a ranking of the logical and physical qubits. All physical qubits in target hardware are ranked in ascending order of the average two-qubit gate error-rates. The logical qubits are ranked in the descending order of the total number of two-qubit gate operations a logical qubit is involved in. Consider the example in Fig. 12 to understand this approach. For simplicity, consider a 9Q-Square QC with physical qubit ranking: [Q5, Q0, Q3, Q7, Q8, Q4, Q1, Q6, Q2] (i.e., Q5 is the best qubit in terms of average 2-qubit gate error). A hypothetical 4-qubit LNN workload has the following ranking of logical qubits: [q2, q1, q3, q0] (i.e., q2 is involved in the highest number of 2-qubit operations) (Fig. 12b). During the allocation process, two lists are maintained: Assigned_Physical_Qubits, and Allocated_Logical_Qubits with two supplementary lists unallocated (logical) neighbors and unallocated physical neighbors (UPN). In the first iteration, Assigned_Physical_Qubits, and Allocated_Logicall _Qubits are empty as no logical qubit is allocated to any physical qubit at the very beginning. Thus, unallocated (logical) neighbors = [q2, q1, q3, q0]. Considering the ranking, logical qubit q2 and physical qubit Q5 are selected. Therefore, at the end of the first iteration q2 is allocated to Q5. In the next iteration, the state of lists are Assigned_Physical_Qubits = [Q5], Allocated_Logica_Qubits = [q2], Unallocated (logical) neighbors (of q2) = [q1,
388
A. A. Saki et al.
Fig. 12 Greedy allocation of given LNN workload on target hardware
q3], and Unallocated physical neighbors (of Q5) = [Q8, Q4, Q2]. The “unallocated physical neighbors” list contains the neighbors of the last assigned physical qubit. Considering the ranking, q1 is selected among these two logical neighbors and Q8 is selected among these 3 physical neighbors. Thus, the allocation in this iteration is q1 → Q8. Following the same approach, q3 will be assigned to Q4 in iteration-3 and q0 to Q7 in the final iteration.
4.3 Multi-constraint Quantum Circuit Mapping Multi-constraint quantum circuit mapping (Bhattacharjee et al. 2019) delineates various considerations to reduce the search space to find the noise-aware mapping in a reasonable time. The flow starts with a high-level quantum circuit description, the coupling graph of a target quantum computer, and the error-specifications (gateerrors) of the qubits. The flow generates a depth-optimal, SWAP-inserted (i.e., NN compliant), and noise-aware mapping of the input circuit. The proposed technology mapping flow is shown in Fig. 13. The steps are described below:
Error-Tolerant Mapping for Quantum Computing
389
Fig. 13 Proposed multi-constraint quantum circuit mapping to NISQ computer
1. Topology Sub-graph Selection: In the first step, sub-graphs with k qubits each are extracted from the coupling graph (parent graph) of the quantum computer of n qubits such that k ≤ n. Here, k is the number of logical qubits in the input quantum circuit. The authors limit the number of extracted sub-graphs as it may become intractable for large n. The LNN and T-topology as in Fig. 8 are examples of extracted topology sub-graphs. 2. Logical Qubit to Topology Graph Node Mapping: In this step, logical qubits are assigned to the vertices of an extracted sub-graph. The number of these mappings are limited to keep the search tractable. This is defined as configuration which is similar to the concept of ‘initial mapping/placement’ used in other works. 3. Nearest Neighbour (NN) Compliance: This step achieves NN compliance by inserting swap gates. The authors use a constraint-based approach with an interlinear-programming (ILP) solver. The objectives of the solution are to minimize the overall depth of the circuit and the number of additional swap gates. The authors equip the ILP solver with ‘look-ahead’ capability. The possible number of routing paths is reduced by extracting the topology sub-graph in the first step. 4. Fidelity-aware mapping of NN-compliant circuit to QC: Finally, a noise-aware allocation is obtained by applying QURE (Ash-Saki et al. 2019) (described in Sect. 4.2). It has been shown that the choice of sub-graph, initial configuration, and look-ahead window size can all optimize the final gate-count and improve the fidelity of the circuit (up to 6.76X).
390
A. A. Saki et al.
5 Error-Tolerant Mapping for Application Existing and the near-term quantum computers are noisy, have a very limited number of qubits, and qubit-to-qubit connectivity. Utilizing quantum computers to implement quantum algorithms (e.g. Shor’s factorization, Grover’s search, etc.) that may outperform their classical counterparts, at least theoretically, requires full error-corrected qubits which is probably decades away. To utilize the power of small-scale noisy quantum computers to perform useful computations, quantum-classical hybrid variational algorithms (e.g. variational quantum eigensolver, quantum approximate optimization algorithm, quantum classifiers, quantum generative adversarial network, etc.) have been developed over the past few years (Farhi et al. 2014; Kandala et al. 2017; Schuld et al. 2020; Dallaire-Demers and Killoran 2018). Quantum approximate optimization algorithm (QAOA) is such a hybrid algorithm targeted to solve hard combinatorial optimization problems on gate-based universal quantum computers. QAOA has been demonstrated to solve NP-hard combinatorial optimization problems such as MaxCut, MaxSAT, etc. quite efficiently (Farhi et al. 2014; Zhou et al. 2018; Crooks 2018; Wecker et al. 2016). In this section, we describe the basics of QAOA, its applicability in solving various optimization problems and the role of error-aware qubit mapping.
5.1 Background on QAOA and Impact of Noise In QAOA, each of the binary variables in the target cost function (C(z)) is represented by a qubit. The classical objective function (C(z)) is converted into a quantum problem Hamiltonian by promoting each binary variable (z i ) into a Pauli-Z operator σiz : HC = C(σ1z , σ2z , . . . , σ Nz ). After initializing the qubits in the superpo⊗N sition state N |+ x , the problem (i.e., cost) Hamiltonian and a mixing Hamiltonian (H B = j=1 σ j ) are applied repeatedly ( p times for a p-depth QAOA) to gener
ate a variational wavefunction: ψ p(γ ,β) = e−iβ p HB e−iγ p HC . . . e−iβ1 HB e−iγ1 HC |+⊗N (Zhou et al. 2018). Here, γ1 , . . . , γ p and β1 , . . . , β p quantum gate parameters. Then, the expectation value of HC in the (/output state of the cir wavefunction
variational cuit) is computed E p (γ , β) = ψ p(γ ,β) HC ψ p(γ ,β) through a repetitive sampling process (Farhi et al. 2017; Zhou et al. 2018). A classical optimizer iteratively updates the QAOA parameters (γ , β) so as to maximize E p (γ , β). In a gate-based quantum computer, the Hamiltonians are decomposed into a set native gates of the computer. A figure of merit for benchmarking the E (γ ,β) performance of QAOA is the approximation ratio r = Cp max or its conjugate (1 − r ) (Crooks 2018) where Cmax is the maximum cost of the given cost-maximization problem. The lowest depth version of the QAOA ( p = 1) has a provable performance guarantee which is considerably better than a random guess for certain classes of problems. For example, Farhi et al. (2014) showed that, for the maximum cut problem
Error-Tolerant Mapping for Quantum Computing
391
Fig. 14 QAOA-MaxCut solution space for the 4-node 3-regular graph a noiseless, and b with noises (Alam et al. 2019b)
(discussed later) of 3-regular graphs, QAOA always finds a cut that is at least 0.6924 times the size of the optimal cut at p = 1. However, there exist classical algorithms that have better performance guarantees. The best known classical approximation algorithm, Goemans and Williamson (1995), guarantees to find a cut that is at least 0.878 times the size of the optimal cut for similar problem instances. However, QAOA performance can only improve with the number of levels- p (Farhi et al. 2017). Therefore, QAOA is touted to outperform classical algorithms at higher plevels (Zhou et al. 2018; Crooks 2018). From an application standpoint, evolving the system with a cost Hamiltonian means applying a certain quantum circuit on the qubits. The quantum circuit of a certain cost Hamiltonian consists of Controlled-PHASE (CPHASE) operations for every quadratic term in the cost function. Every CPHASE operation is decomposed into a combination of 2 CNOT, and RZ operations in IBM quantum machines (Alam et al. 2019b). For every ternary or higher-order terms in the cost function, the circuit can have corresponding Multi-Controlled-PHASE operations. Note that, the evolution of the system deviates from the intended due to quantum noises (e.g. gate errors, decoherence, etc.) which can affect the algorithmic performance. Recent studies indicate that quantum noise flattens the solution space of QAOA, and thereby degrades the quality of the solution (Alam et al. 2019b; Xue et al. 2019). The entire solution space of a QAOA-MaxCut instance is shown in Fig. 14 with and without various noises (Alam et al. 2019b). Note that, the magnitudes of the peaks are indicators of the performance that can be achieved using QAOA. Flattened peaks under noise indicate a lesser achievable performance in the presence of noise. Hence, efficient mapping methodologies to reduce the impact of noise on the quantum circuits (e.g. minimizing circuit depth to mitigate decoherence errors, minimizing gate-count to reduce gate error accumulation, variation-aware mapping policies, etc.) may prove useful to extract greater performance from such algorithms on noisy devices.
392
A. A. Saki et al.
For example, choosing the qubit-triplet {Q2, Q5, Q4} over {Q1, Q2, Q3} in the hypothetical qubit architecture shown in Fig. 10a for a QAOA optimization procedure of 3 binary variables will maximize the fidelity of the output state and hence, will ensure a less flattened solution space. For optimization problems such as MaxCut, a good mapping translates to a higher probability of getting the global maximum cut solution. For a factorization problem, it offers a higher probability of getting the correct solution.
5.2 MaxCut The maximum cut (MaxCut) problem is frequently encountered in VLSI design automation (Sherwani 2012) which can be solved using QAOA. Given a graph G = (V, E) with nodes V and edges E, Maxcut finds a subset S ∈ V such that the number of edges between S and its complementary subset is maximized. Finding an exact solution of MaxCut is NP-hard (Karp 1972). However efficient polynomial time classical algorithms exist to find an approximate answer within a fixed multiplicative factor of the optimum (Papadimitriou and Yannakakis 1991). In a classical setup, if the nodes of a target N -node graph are represented by the binary spin (−1/1) variables {z 1 , z 2 , . . . , z N }, a MaxCut solving procedure maximizes following cost function: 21 (i, j)∈E Ci j (1 − z i z j ) where Ci j = 1 if the nodes are connected and 0 otherwise. To solve the MaxCut problem with QAOA, we first convert the classical cost function into a cost Hamiltonian (HC ) by replacing the binary variables with Pauli-Z operations: 21 (i, j)∈E Ci j (1 − σiz σ jz ). Following mixing Hamiltonian (H B ) can be N used (Farhi et al. 2014): 21 i=1 σix . Note that, the cost Hamiltonian for MaxCut only has quadratic terms which translate to CPHASE operations in the QAOA-circuit (2 CNOT and 1 RZ operation required for a single CPHASE operation in IBM machines as shown in Fig. 18). Furthermore, these CPHASE operations are commutative (Crooks 2018). This commutation property can be utilized to develop circuit compilation methodologies that can significantly reduce the compiled QAOA-circuit depth and gate-count (and therefore, reduce accumulation of errors) (Alam et al. 2020). This can be achieved by following a two-step approach: (i) reducing the number of layers in the circuit by re-ordering the CPHASE gates (maximizing parallel gate operations to reduce the circuit depth), and (ii) heuristics driven iterative compilation of the circuit with re-ordered layers to reduce the number of added SWAP operations. Instruction parallelization: In Alam et al. (2020), the proposed instruction parallelization approach starts with a random sequence of the CPHASE gates for any given problem. The algorithm generates the circuit layers (each consisting of a few CPHASE gates acting on different qubits) by following a greedy layer formation procedure described below:
Error-Tolerant Mapping for Quantum Computing
393
Fig. 15 a 4-node 3-regular graph, b layer formation procedure (Alam et al. 2020), and c generated circuit layers
Step-1: An empty layer of qubits is defined without assigning any CPHASE operation to any of the qubits. The number of qubits in the layer is equal to the number of logical qubits in the problem. Step-2: The algorithm iterates through the CPHASE gates. It adds a CPHASE gate to the layer if both of the two-qubits involved in this CPHASE operation is not already occupied. The added CPHASE gate is then removed from the initial sequence of CPHASE gates and the qubits are marked as occupied. Step-3: If all the qubits in the current layer are occupied or the end of the sequence is reached—before assigning all the CPHASE operations to the layers, a new empty layer is defined and the algorithm iterates through the remaining CPHASE gates from Step-2. The procedure terminates when the CPHASE gate sequence list is empty. A demonstrative example is shown in Fig. 15. The example problem QAOA instance has 6 CPHASE gates between the logical qubit pairs {(1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4)}. A total of 3-layers are constructed through the procedure as shown in Fig. 15b, c. Each iteration in the procedure is shown in Fig. 15b. The constructed circuit after the layer formation procedure is shown in Fig. 16b. A randomly generated circuit for the same problem required 6 layers as shown in Fig. 16a. Layer re-ordering: In Alam et al. (2020), the authors also pointed out to the fact that the relative orders of the layers can affect the number of added SWAP operations during the Qubit Mapping procedure. To take advantage, the authors proposed an iterative compilation procedure guided by heuristics. The procedure starts with the layer-order produced during the layer construction procedure (termed as the root order). In each iteration, it considers exchanging the order of all possible pairs of layers for a given QAOA circuit instance. Each of these exchanges produces a distinct order of the layers. The circuit is compiled for all these possible layer orders in the current iteration and the order that provides the highest gain in the target objective (e.g. minimizing the circuit depth with respect to the root order) is picked as the new root order for the next iteration. The procedure terminates based on a predefined termination policy (e.g. no gain in the target objective).
394
A. A. Saki et al.
Fig. 16 a A randomly constructed QAOA-circuit, and b generated circuit following (Alam et al. 2020) for the MaxCut problem of the graph in Fig. 15a
An example of the procedure is shown in Fig. 17. Starting with a hypothetical QAOA instance with 4 CPHASE layers (L1 L2 L3 L4), the circuit is compiled with a traditional compiler backend and critical circuit depth is found to be 100 (minimization of the circuit depth is the target objective). In the current iteration, all possible two-layer interchanges (a total of 6) are considered and the circuit is compiled for each of the layer orders produced by these interchanges. Interchanging L1 with L4 results in a maximum reduction in circuit depth i.e., 28. Hence, this layer order is picked as the root for the next iteration. In iteration-2, none of the layer orders produced by the two-layer interchange approach gives any gain. Therefore, the procedure stops (taking no gain in target objective as the termination policy).
Fig. 17 Hypothetical iteration steps of an iterative compilation procedure (depth minimization as the target objective) with 4 layers of CPHASE gates (Alam et al. 2020)
Error-Tolerant Mapping for Quantum Computing
395
A reduction of gate-count up to 23.21% and circuit depth up to 53.65% were reported in Alam et al. (2020) following these approaches.
5.3 Factorization The security of many modern public key cryptography protocols (e.g. RSA) lies in the difficulty of finding prime factors of semi-prime numbers (Paar and Pelzl 2009). The factorization problem can be converted into a cost function minimization and therefore, it can be solved using QAOA (Anschuetz et al. 2019). The cost function is problem specific (e.g., for a target integer) derived from a set of algebraic equations involving Boolean variables, which is generated according to the binary number multiplication rule. The number of unknowns in the equation set requires the same number of qubit. However, due to the limited number of qubit in NISQ-era computers, it is essential to reduce the number of unknowns before running QAOA. One approach is to simplify the clauses in the equations classically using the properties of Boolean algebra. Factoring 143 will be used as an example for which the simplified equation set is: p1 + q1 = 1, p2 + q2 = 1, p2 q1 + p1 q2 = 1 (Anschuetz et al. 2019). The factorization of 143 becomes a cost minimization problem of the following function: C p = ( p1 + q1 − 1)2 + ( p2 + q2 − 1)2 + ( p2 q1 + p1 q2 − 1)2 . Each of the unknown binary variables is mapped into the qubits’ space to construct 1−σ z 1−σ z 1−σ z 1−σ z the cost Hamiltonian: p1 = 2 1 , p2 = 2 2 , q1 = 2 3 and q2 = 2 4 , where σiz is the Pauli Z operator on spin i. A direct mathematical representation of the cost Hamiltonian (Hc ) becomes: 3 − σ1z − σ2z − σ3z − σ4z + 2σ1z σ3z − σ2z σ3z + 2σ2z σ4z + 2σ1z σ2z σ3z σ4z . Note that, Hc includes a quaternary term which translates to a 4-qubit ZZZZ-interaction (Multi-Controlled-PHASE gate with 3 control qubits) in the corresponding QAOA circuit. In IBM machines, 6 CNOT operations will be required for this single 4-qubit interaction. Circuit decomposition of each of these terms in the Hc (for IBM machines) is shown in Fig. 18. Note that, efficient transformation techniques exists (such as, Grobner transformation, Schaller and Schützhold transformation, etc. (Qiu et al. 2020)) that can be used to reduce qubit-to-qubit interactions in Hc . For instance, a Schaller transformation of the above mentioned Hc produces the following: 5 − 3σ1z − σ2z − σ3z + 2σ1z σ3z − 3σ2z σ3z + 2σ1z σ2z σ3z − 3σ4z + σ1z σ4z + 2σ2z σ4z + 2σ2z σ3z σ4z (Qiu et al. 2020) which reduces the maximum qubit-to-qubit interaction from 4 to 3. These transformation techniques provide unique design knobs which can be utilized while searching for error tolerant mapping of the cost Hamiltonian on target architectures using the techniques discussed before.
396
A. A. Saki et al.
Fig. 18 Circuit de-compositions for individual clauses in Hc for IBM machines (Qiu et al. 2020)
5.4 Object Detection Redundant locations suppression of object detection has been solved by converting the suppression task to Quadratic Unconstrained Binary Optimization (QUBO) problem (Rujikietgumjorn and Collins 2013). Traditionally, the converted QUBO problem is then solved by classical algorithms such as Tabu and Greedy Search. With the help of the quantum system, such a problem now can be tackled by either universal gate quantum computer like IBM QX or quantum annealer like the D-Wave system. Quadratic unconstrained binary optimization: The existing object detection approaches almost exclusively use Non-Maximum Suppression (NMS) for removing redundant detection. However, finding optimal detections to be kept from raw detections generated from object detector is essentially a combinatorial optimization problem. The optimal solution is a binary string vector x = (x1 , x2 , . . . , x N )T that maximizes the objective function C(x). The standard objective function is typically formulated by linear and quadratic terms as follows: C(x) =
N i=1
ci xi +
N N
ci j xi x j = x T Q x
(1)
i=1 j=i
where ∀i, xi ∈ {0, 1} are binary variables, ci are linear coefficients and ci j are quadratic coefficients. Each object detection instance is represented by a specific upper triangular matrix Q, where ci terms make up diagonal elements and ci j terms constitute off-diagonal ones of the matrix. For object detection problem, N denotes the number of detection bounding boxes, ci denotes the detection score of each bi , and ci j denotes negative value of or (bi , b j ) for penalizing high overlap between pair of bounding boxes. Theoretically, QUBO is a suitable alternative to greedy NMS for redundancy suppression as overlap or (bi , b j ) between each pair of detections is fully considered in QUBO while NMS only considers overlap or (M, bi ) between highest-scoring detection (M) and others. Besides, QUBO is also able to uniformly combine mul-
Error-Tolerant Mapping for Quantum Computing
397
Fig. 19 Object detection using NMS and QUBO for removing false positive detection. Both methods are using a confidence threshold of 0.8, a suppressing non-maximum detection locations from faster R-CNN (Ren et al. 2015) with a threshold of 0.5; b filtering out redundant locations from faster R-CNN using QUBO (Li et al. 2020)
tiple linear and quadratic metrics from different strategies and even detectors. Put differently, matrix Q can be formed by the sum of multiple weighted linear and pair matrices for more accurate detection. Below is an example of adopting a single linear matrix L and two pair matrices P1 and P2 : Q = w 1 L − w 2 P1 − w 3 P2
(2)
where L is a diagonal matrix of objectness scores, P1 is one pair matrix for measuring IOU (intersection over union) overlap, and P2 is another pair matrix for measuring feature correlation coefficient. An image instance for demonstrating pedestrian detection results is shown in Fig. 19, where suppression using QUBO filters out more false positive bounding boxes generated from Faster R-CNN detector (Ren et al. 2015) than NMS with overlap threshold of 0.5. Universal gate quantum computing: QAOA is a suitable hybrid quantum-classical algorithm for solving the above mentioned QUBO problem. QUBO problem is NPhard for classical algorithms as the time complexity grows exponentially, in time O(2n ), with number of detection boxes. However, the complexity of QAOA primarily depends on quantum circuit depth, which may make it faster than classical ones to handle image instance with large number of objects. QAOA performance can be improved hierarchically by a few levels of mechanisms such as dedicated classical optimizer, gate parameter analysis, and gate rescheduling. The QAOA performance is measured by the Approximation Ratio (AR) and Function Calls (FC). AR reflects the optimality of generated binary vector output compared to brute force output, while FC explains the algorithm efficiency, physical runtime, of this algorithm. In Li et al. (2020), a global classical optimizer e.g., differential evolution is used since it achieves the highest approximation ratio (increasing monotonically with circuit depth). However it takes around 35X more function calls than Nelder-Mead, and 47X more FCs than L-BFGS-B. All the following experiments are performed with L-BFGS-B as it runs faster than the other two and achieves comparable accuracy. The second-level mechanism using quantum gate parameter analysis speeds
398
A. A. Saki et al.
up the algorithm 5.5X by constraining parameters from (−π, π ) to (0, π ) by eliminating degeneracies in Hilbert space. Further speedup is achieved by exploiting the parameter regression technique. Another level of improvement is accomplished by quantum gate rescheduling whose implementation relies on qubit allocation. Gate operation scheduling optimization harnesses more parallelism on the execution of the logical quantum gates. The gate sequence can only be reordered among gates that are commutative, otherwise, the quantum circuit generates an incorrect functionality. Optimal qubit allocation reduces the number of SWAP operations and makes gate scheduling more efficient. Fidelity is only calculated between ρnonsch and ρr esch . The results show that Fid(ρnonsch , ρr esch ) for level-1 circuit is 99.96% and 99.94% for level-2 circuit. Gate rescheduling improvement on circuit delay becomes more pronounced with increasing circuit depth (Li et al. 2020).
6 Other Related Works and Future Outlook 6.1 Other Related Works Apart from the works mentioned so far, there are a number of works that address NN-compliant mapping, reducing number of SWAPs, and error-awareness from different directions. The works tackle the mapping problem in 1D/2D architectures using mixed integer linear programming (Bhattacharjee and Chattopadhyay 2017; Shafaei et al. 2014), graph partitioning (Chakrabarti et al. 2011), temporal planners (Booth et al. 2018; Venturelli et al. 2017, 2018), exact method (Lye et al. 2015), greedy randomized search (Oddi and Rasconi 2018), pseudo Boolean optimization (Wille et al. 2014), dynamic programming (Siraichi et al. 2018), graph isomorphism (Siraichi et al. 2019), etc. The above mentioned works use mathematical formulation and software-solver to achieve the NN mapping. As the mapping is an NP-hard problem (Maslov et al. 2008), these solutions do not scale well with increasing number of qubits. Another type of work exploits properties of quantum operations such as gatecommutation (Itoko et al. 2019) and re-ordering (Hattori and Yamashita 2018) to minimize the number of additional SWAPs. If two gates are commutable, changing their order does not change logical outcome. Finally, some recent works address error-types other than gate-error and decoherence like readout error and crosstalk. In Murali et al. (2019), the readout error is included in the noise-adaptive mapping flow. The heuristic cost function is defined with weighted gate-error and readout-error. By changing the individual error weights among {0, 0.5, 1}, the algorithm optimizes for only gate error or only readout error or both. SMT formulation to map qubits is also proposed in the same work. The work in Murali et al. (2020) addresses crosstalk using software techniques. However, this work focuses on gate scheduling instead of mapping. Two parallel gates can suffer
Error-Tolerant Mapping for Quantum Computing
399
Table 2 Summary of error considerations of different mapping techniques. (H = Heuristics, R-T = Reverse-traversal, rand = random) Zulehner Li et al. et al. (2018) (2019) Methodology
H
H
Gate error Decoherence
x
Tannu and Qureshi (2019)
Murali et al. Ash-Saki (2019) et al. (2019)
Bhattacharjee Murali et al. et al. (2019) (2020)
H
SMT + H
H
ILP + H
x
x
x
x
x
x
x
–
Multiple
x
Readout
x
Crosstalk Initial mapping
SMT + H
x Rand
R-T
–
H
–
from crosstalk between each other. The proposed technique prescribes to schedule them in two different time-steps. However, this may increase the program execution time and can have negative decoherence effect. Therefore, the technique compares the crosstalk error and decoherence error. If decoherence is less than crosstalk, then it only schedules the operations in different time-steps. The work in Tannu and Qureshi (2019) proposes to fuse outcomes of different mappings of the same circuit to filter out dissimilar errors and improve the inferencing strength of a quantum circuit. Table 2 summarizes various mapping techniques and types of errors addressed by recent papers.
6.2 Future Outlook Error sources such as, gate error, readout error, and decoherence/dephasing have been studied in existing literature along with spatial and temporal variation in qubit quality. Techniques such as, mapping and remapping have been developed to optimize the resilience of quantum circuits (both random and parametric circuits) in presence of gate errors, decoherence, and variations. However, error sources such as, crosstalk largely remain unexplored. Other open issues include optimization of approximate algorithms such as, QAOA and VQE for problems beyond MaxCut e.g., bond-length estimation, object detection, and path planning, to name few.
7 Conclusion Various error sources reduce the computing power of quantum computers. We presented an overview of these error sources and their impact in NISQ computers. We described various mapping and compilation techniques to optimize the errors to improve the fidelity of computation. We also highlighted some of the important research challenges, open problems, and opportunities.
400
A. A. Saki et al.
Acknowledgements This work is supported by SRC (2847.001), NSF (CNS-1722557, CCF1718474, DGE-1723687, DGE-1821766, OIA-2040667 and DGE-2113839) and a Seed Grant award from the Institute for Computational and Data Sciences (ICDS) at the Pennsylvania State University. The authors would like to thank the students in LOGICS lab, Penn State and Ling Qiu for contribution on VQF. This content is solely the responsibility of the authors and does not necessarily represent the views of the ICDS and NSF.
References M. Alam, A. Ash-Saki, S. Ghosh, Addressing temporal variations in qubit quality metrics for parameterized quantum circuits, in 2019 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED) (2019a), pp. 1–6 M. Alam, A. Ash-Saki, S. Ghosh, Analysis of quantum approximate optimization algorithm under realistic noise in superconducting qubits (2019b), arXiv preprint: arXiv:1907.09631 M. Alam, A. Ash-Saki, S. Ghosh, An efficient circuit compilation flow for quantum approximate optimization algorithm, in 57th Annual Design Automation Conference (2020) E. Anschuetz, J. Olson, A. Aspuru-Guzik, Y. Cao, Variational quantum factoring, in International Workshop on Quantum Technology and Optimization Problems (Springer, 2019), pp. 74–85 A. Ash-Saki, M. Alam, S. Ghosh, QURE: qubit re-allocation in noisy intermediate-scale quantum computers, in Proceedings of the 56th Annual Design Automation Conference (2019), pp. 1–6 D. Bhattacharjee, A. Ash Saki, M. Alam, A. Chattopadhyay, S. Ghosh, MUQUT: multi-constraint quantum circuit mapping on NISQ computers, in 38th IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2019 (Institute of Electrical and Electronics Engineers Inc., 2019), p. 8942132 D. Bhattacharjee, A. Chattopadhyay, Depth-optimal quantum circuit placement for arbitrary topologies (2017), arXiv preprint: arXiv:1703.08540 J. Biamonte, P. Wittek, N. Pancotti, P. Rebentrost, N. Wiebe, S. Lloyd, Quantum machine learning. Nature 549(7671), 195–202 (2017) K.E.C. Booth, M. Do, J.C. Beck, E. Rieffel, D. Venturelli, J. Frank, Comparing and integrating constraint programming and temporal planning for quantum circuit compilation, in Twenty-Eighth International Conference on Automated Planning and Scheduling (2018) S.B. Bravyi, A.Y. Kitaev, Quantum codes on a lattice with boundary (1998), arXiv preprint: arXiv:quant-ph/9811052 A. Chakrabarti, S. Sur-Kolay, A. Chaudhury, Linear nearest neighbor synthesis of reversible circuits by graph partitioning (2011), arXiv preprint: arXiv:1112.0564 J.I. Cirac, P. Zoller, Quantum computations with cold trapped ions. Phys. Rev. Lett. 74, 4091–4094 (1995) G.E. Crooks, Performance of the quantum approximate optimization algorithm on the maximum cut problem (2018), arXiv preprint: arXiv:1811.08419 P.-L. Dallaire-Demers, N. Killoran, Quantum generative adversarial networks. Phys. Rev. A 98(1), 012324 (2018) D. Eppstein, Subgraph isomorphism in planar graphs and related problems, in Graph Algorithms And Applications I (World Scientific, 2002), pp. 283–309 E. Farhi, J. Goldstone, S. Gutmann, A quantum approximate optimization algorithm (2014), arXiv preprint: arXiv:1411.4028 E. Farhi, J. Goldstone, S. Gutmann, H. Neven, Quantum algorithms for fixed qubit architectures (2017), arXiv preprint: arXiv:1703.06199 M.X. Goemans, D.P. Williamson, Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM (JACM) 42(6), 1115–1145 (1995)
Error-Tolerant Mapping for Quantum Computing
401
Google AI Blog, A Preview of Bristlecone, Google’s New Quantum Processor (2018), https://ai. googleblog.com/2018/03/a-preview-of-bristlecone-googles-new.html. Accessed 30 March 2020 L.K. Grover, A fast quantum mechanical algorithm for database search, in Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing (1996), pp. 212–219 W. Hattori, S. Yamashita, Quantum circuit optimization by changing the gate order for 2d nearest neighbor architectures, in International Conference on Reversible Computation (Springer, 2018), pp. 228–243 IBM, IBM Announces Advances to IBM Quantum Systems and Ecosystem (2017), https://www03.ibm.com/press/us/en/pressrelease/53374.wss. Accessed 30 March 2020 IBM, IBM Quantum Experience (2020), http://quantum-computing.ibm.com/. Accessed 30 March 2020 IBM Research, IBM 7 Qubit Device (2017), https://www.flickr.com/photos/ibm_research_zurich/ 37028171191. Accessed 30 March 2020 Intel, The Future of Quantum Computing is Counted in Qubits (2018), https://newsroom.intel.com/ news/future-quantum-computing-counted-qubits/. Accessed 30 March 2020 T. Itoko, R. Raymond, T. Imamichi, A. Matsuo, A.W. Cross, Quantum circuit compilers using gate commutation rules, in Proceedings of the 24th Asia and South Pacific Design Automation Conference (2019), pp. 191–196 A. Kandala, A. Mezzacapo, K. Temme, M. Takita, M. Brink, J.M. Chow, J.M. Gambetta, Hardware efficient variational quantum eigensolver for small molecules and quantum magnets. Nature 549(7671), 242–246 (2017) R.M. Karp, Reducibility among combinatorial problems, in Complexity of Computer Computations (Springer, 1972), pp. 85–103 E. Knill, D. Leibfried, R. Reichle, J. Britton, R.B. Blakestad, J.D. Jost, C. Langer, R. Ozeri, S. Seidelin, D.J. Wineland, Randomized benchmarking of quantum gates. Phys. Rev. A 77(1), 012307 (2008) J. Koch, T.M. Yu, J. Gambetta, A.A. Houck, D.I. Schuster, J. Majer, A. Blais, M.H. Devoret, S.M. Girvin, R.J. Schoelkopf, Charge-insensitive qubit design derived from the cooper pair box. Phys. Rev. A 76, 042319 (2007) J. Li, M. Alam, A. Ash-Saki, S. Ghosh, Hierarchical improvement of quantum approximate optimization algorithm for object detection, in 2020 21th International Symposium on Quality Electronic Design (ISQED) (IEEE, 2020), pp. 335–340 G. Li, Y. Ding, Y. Xie, Tackling the qubit mapping problem for NISQ-era quantum devices, in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (2019), pp. 1001–1014 A. Lye, R. Wille, R. Drechsler, Determining the minimal number of swap gates for multi-dimensional nearest neighbor quantum circuits, in The 20th Asia and South Pacific Design Automation Conference (IEEE, 2015), pp. 178–183 D. Maslov, S.M. Falconer, M. Mosca, Quantum circuit placement. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 27(4), 752–763 (2008) P. Murali, J.M. Baker, A. Javadi-Abhari, F.T. Chong, M. Martonosi, Noise-adaptive compiler mappings for noisy intermediate-scale quantum computers, in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (2019), pp. 1015–1029 P. Murali, D.C. McKay, M. Martonosi, A. Javadi-Abhari, Software mitigation of crosstalk on noisy intermediate-scale quantum computers (2020), arXiv preprint: arXiv:2001.02826 P. Murali, A. Javadi-Abhari, F.T. Chong, M. Martonosi, Formal constraint-based compilation for noisy intermediate-scale quantum systems. Microprocess. Microsyst. 66, 102–112 (2019) Y. Nam, N.J. Ross, Y. Su, A.M. Childs, D. Maslov, Automated optimization of large quantum circuits with continuous parameters. NPJ Quantum Inf. 4(1), 1–12 (2018) A. Oddi, R. Rasconi, Greedy randomized search for scalable compilation of quantum circuits, in International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research (Springer, 2018), pp. 446–461
402
A. A. Saki et al.
C. Paar, J. Pelzl, Understanding Cryptography: A Textbook for Students and Practitioners (Springer Science & Business Media, 2009) A. Paler, On the influence of initial qubit placement during NISQ circuit compilation, in International Workshop on Quantum Technology and Optimization Problems (Springer, 2019), pp. 207–217 C.H. Papadimitriou, M. Yannakakis, Optimization, approximation, and complexity classes. J. Comput. Syst. Sci. 43(3), 425–440 (1991) J. Preskill, Quantum computing in the NISQ era and beyond. Quantum 2, 79 (2018) L. Qiu, M. Alam, A. Ash-Saki, S. Ghosh, Analyzing resilience of variational quantum factoring under realistic noise, in Annual GOMACTech Conference, San Diego, CA, USA (2020) S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks, in Advances in Neural Information Processing Systems (2015) S. Rujikietgumjorn, R.T. Collins, Optimized pedestrian detection for multiple and occluded people, in 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013, pp. 3690–3697 M. Schuld, A. Bocharov, K.M. Svore, N. Wiebe, Circuit-centric quantum classifiers. Phys. Rev. A 101(3), 032308 (2020) A. Shafaei, M. Saeedi, M. Pedram, Qubit placement to minimize communication overhead in 2d quantum architectures, in 2014 19th Asia and South Pacific Design Automation Conference (ASPDAC) (IEEE, 2014), pp. 495–500 N.A. Sherwani, Algorithms for VLSI Physical Design Automation (Springer Science & Business Media, 2012) P.W. Shor, Scheme for reducing decoherence in quantum computer memory. Phys. Rev. A 52(4), R2493 (1995) P.W. Shor, Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM Rev. 41(2), 303–332 (1999) M.Y. Siraichi, V.F. dos Santos, S. Collange, F.M.Q. Pereira, Qubit allocation, in Proceedings of the 2018 International Symposium on Code Generation and Optimization (2018), pp. 113–125 M.Y. Siraichi, V.F. dos Santos, C. Collange, F.M.Q. Pereira. Qubit allocation as a combination of subgraph isomorphism and token swapping. Proc. ACM Programm. Lang. 3(OOPSLA), 1–29 (2019) M. Steffen, Superconducting Qubits are Getting Serious (2011), https://physics.aps.org/articles/ pdf/10.1103/Physics.4.103. Accessed 30 March 2020 S.S. Tannu, M. Qureshi, Ensemble of diverse mappings: improving reliability of quantum computers by orchestrating dissimilar mistakes, in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (2019), pp. 253–265 S.S. Tannu, M.K. Qureshi, Not all qubits are created equal: a case for variability-aware policies for NISQ-era quantum computers, in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (2019), pp. 987–999 D. Venturelli, M. Do, E.G. Rieffel, J. Frank, Temporal planning for compilation of quantum approximate optimization circuits. IJCAI 4440–4446 (2017) D. Venturelli, M. Do, E. Rieffel, J. Frank, Compiling quantum circuits to realistic hardware architectures using temporal planners. Quantum Sci. Technol. 3(2), 025004 (2018) D. Wecker, M.B. Hastings, M. Troyer, Training a quantum optimizer. Phys. Rev. A 94(2), 022309 (2016) R. Wille, L. Burgholzer, A. Zulehner, Mapping quantum circuits to IBM QX architectures using the minimal number of swap and h operations, in 2019 56th ACM/IEEE Design Automation Conference (DAC) (IEEE, 2019), pp. 1–6 R. Wille, A. Lye, R. Drechsler, Optimal swap gate insertion for nearest neighbor quantum circuits, in 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC) (IEEE, 2014), pp. 489–494 C. Xue, Z.-Y. Chen, Y.-C. Wu, G.-P. Guo, Effects of quantum noise on quantum approximate optimization algorithm (2019), arXiv preprint: arXiv:1909.02196
Error-Tolerant Mapping for Quantum Computing
403
L. Zhou et al., Quantum approximate optimization algorithm: performance, mechanism, and implementation on near-term devices (2018), arXiv preprint: arXiv:1812.01041 A. Zulehner, A. Paler, R. Wille, An efficient methodology for mapping quantum circuits to the IBM QX architectures. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 38(7), 1226–1236 (2018) A. Zulehner, H. Bauer, R. Wille, Evaluating the flexibility of a* for mapping quantum circuits, in Reversible Computation, ed. by M.K. Thomsen, M. Soeken (Springer International Publishing, Cham, 2019), pp. 171–190
System-Level Trends
Intelligent Edge Biomedical Sensors in the Internet of Things (IoT) Era Elisabetta De Giovanni, Farnaz Forooghifar, Gregoire Surrel, Tomas Teijeiro, Miguel Peon, Amir Aminifar, and David Atienza Alonso
Abstract The design of reliable wearable systems for real-time and long-term monitoring presents major challenges, although they are poised as the next frontier of innovation in the context of Internet-of-Things (IoT) to provide personalized healthcare. This new generation of biomedical sensors targets to be interconnected in ways that improve our lives and transform the medical industry. Therefore, they offer an excellent opportunity to integrate the next generation of artificial intelligence (AI) based techniques in medical devices. However, several key challenges remain in achieving this potential due to the inherent resource-constrained nature of wearable systems for Big Data medical applications, which need to detect pathologies in real time. Concretely, in this chapter, we discuss the opportunities for edge computing and edge AI in next-generation intelligent biomedical sensors in the IoT era and the key challenges in wearable systems design for pathology detection and health/activity monitoring in the context of IoT technologies. First, we introduce the notion of selfawareness toward the conception of the next-generation intelligent edge biomedical E. De Giovanni · F. Forooghifar · G. Surrel · T. Teijeiro · M. Peon · A. Aminifar Embedded Systems Laboratory (ESL), Faculty of Engineering (STI), Ecole Polytechnique Fédérale de Lausanne (EPFL), EPFL-STI-IEM-ESL, Station 11, 1015 Lausanne, Switzerland e-mail: [email protected] F. Forooghifar e-mail: [email protected] G. Surrel e-mail: [email protected] T. Teijeiro e-mail: [email protected] M. Peon e-mail: [email protected] A. Aminifar Lund University, Lund, Sweden D. Atienza Alonso (B) Professor of EE and Head of Embedded Systems Laboratory (ESL), Ecole Polytechnique Fédérale de Lausanne (EPFL), EPFL-STI-IEM-ESL, Station 11, 1015 Lausanne, Switzerland e-mail: [email protected] © The Author(s) 2023 M. M. S. Aly and A. Chattopadhyay (eds.), Emerging Computing: From Devices to Systems, Computer Architecture and Design Methodologies, https://doi.org/10.1007/978-981-16-7487-7_13
407
408
E. De Giovanni et al.
sensors to trade-off machine-learning performance versus system lifetime, according to the application requirements of the medical monitoring systems. Subsequently, we present the implications of personalization and multi-parametric sensing in the context of the system-level architecture of intelligent edge biomedical sensors. Thus, they can adapt to the real world, as living organisms do, to operate efficiently according to the target application requirements and available energy at any moment in time. Then, we discuss the impacts of self-awareness and low-power requirements at the circuit level for sampling through a paradigm shift to react to the input signal itself. Finally, we conclude by highlighting that the techniques discussed in this chapter may be applied jointly to design the next-generation intelligent biomedical sensors and systems in the IoT era.
1 Introduction to Self-Aware and Adaptive Internet of Things Remote health monitoring has attracted a lot of attention over the past decades to provide the opportunity of early detection and prediction of pathological health conditions. This early detection and prediction not only improves the quality of life for the patients but also significantly reduces the load on their family members. Moreover, this improvement reduces the socioeconomic burden that is caused due to the disability of patients to work despite their health conditions. Wearable and Internet of Things (IoT) technologies offer a promising solution in pervasive health monitoring by relaxing the constraints with respect to time and place. Today, wearable systems are facing fundamental barriers in terms of battery lifetime and Quality of Service (QoS). Indeed, the main challenge in wearable systems is increasing their battery lifetime while maintaining the reliability of system. A recently proposed concept for overcoming this challenge is self-awareness. Self-awareness offers a promising solution for this issue by equipping the system with two key concepts of learning and reasoning. In the learning phase the system gains knowledge about itself and its environment and in the reasoning phase this information is used
Fig. 1 Observe-AnalyzeAct loop in self-aware systems. The system (1) observes the information from itself and the environment, (2) analyzes the information and the new situation, and (3) takes action to adapt to the new situation and move toward its goals
Observe
System Analyze
Act
Intelligent Edge Biomedical Sensors in the Internet of Things (IoT) Era
409
to make a decision and act in a way that the pre-defined goals of the system are fulfilled (Lewis et al. 2016). Thus, the main goal of self-awareness is to give the system the ability to monitor its own performance with respect to self and environmental changes, adapt to these changes and as a result improve autonomously (Lewis et al. 2011). These three steps that are repeated in a loop, as shown in Fig. 1, keep the system aware of the changes in situation and assist the system to continuously move toward reaching its goals. Self-awareness can be applied to many different categories of applications in the systems equipped with control mechanisms and units, and has three major properties: self-reflection, self-prediction and self-adaptation (Jantsch et al. 2017). According to these properties, each self-aware system is aware of its architecture, execution environment, operational goals, and their corresponding dynamic changes during operation. It is also able to predict the effect of dynamic changes and proactively adapt itself as the environment evolves in order to ensure that quality of service or energy requirements are satisfied. The notion of self-awareness can be adopted in various design aspects of IoT systems, in spite of highly dynamic changes in the environment. In the system-onchips domain, self-awareness assists to deal better with the complexity coming from the system itself, from the environment, and from the exceedingly diverse goals and objectives (Hoffmann et al. 2012; Jantsch et al. 2017). In the self-triggered control domain, the controllers execute only when the expected performance is about to be violated (Filieri et al. 2014; Aminifar et al. 2016; Andrén et al. 2017). And finally, in the remote health monitoring system domain, where the quality of the results is significantly affected by different conditions of the patient (such as age and medical history) as well as the environment (such as temperature and …), self-awareness adapts the system to the situation to guarantee the quality of results (Anzanpour et al. 2017; Masinelli et al. 2020). The target applications in this chapter of the book are wearable monitoring systems used for pathology detection and health and activity monitoring. We will discuss the self-aware techniques that can be applied in different design domains of such systems with the goal to improve the performance and/or increase the battery lifetime of wearable monitoring systems. We will start with the application level aspects of the system talking about machine learning and artificial intelligence. Afterwards, we move to circuit architecture discussing self-aware signal acquisition and sampling techniques. Finally, we cover the self-aware system architecture and platform taking in to account the personalization and multi-sensor systems. • The overview of a typical wearable system used for health monitoring is shown in Fig. 2. The main phases of these systems are acquisition and preprocessing of the bio-signals, extraction of corresponding features and the machine learning module, which is developed in the train phase, using the data that are previously gathered from the patients, and is used for detection of pathology in the test phase on the new data acquired from patients. In the application level, which translates to the learning part of the monitoring system, we will see the contribution of self-awareness in distribution of the monitoring over different levels of learning
Test
Train
410
E. De Giovanni et al. Pre-recorded biosignal
Newly-recorded biosignal
Acquisition
Preprocessing
Feature extraction
Machine learning
Acquisition
Preprocessing
Feature extraction
Detection model
Detection model
Prediction
Fig. 2 Overview of a typical wearable system for monitoring pathological health conditions. The system (1) acquires the input bio-signal, (2) preprocesses the data by applying some filters, (3) extracts the features corresponding to the type of bio-signal and pathology, (4) develops a model in the train phase and uses it to predict the pathology in the test phase (Forooghifar et al. 2018)
models with different performance and energy consumption profiles. We will also see efficient distribution of monitoring over higher computational layers, i.e. fog and cloud layers using concept of self-awareness. • One of the main goals of self-awareness is energy efficiency to reduce the battery lifetime of wearable systems. In the middle of hardware-level and applicationlevel optimizations, power management at the platform layer is one of the means to achieve this goal. In a single-core platform this translates into alternating low energy modes with active mode depending on the application duty cycle. When parallelization is required, a platform with multiple cores can handle different states of clock-gating and power-gating of the different blocks of cores. We will have two examples of single-core and multi-core platforms to illustrate the power management and when it is worth to choose between one or the other given the application duty cycle. • Self-awareness of a system can happen at the level of signal acquisition. Changing the paradigm used for sampling, a self-aware sampling strategy reacts to the input signal itself, changing it sampling constraints accordingly. It becomes possible to lower the energy spent on data acquisition without lowering the performance of the target application. We will see how bio-signals can benefit from this novel approach of analog-to-digital conversion.
2 Self-Aware Machine Learning and Artificial Intelligence In this section we discuss about embedding the notion of self-awareness in machine learning and artificial intelligence algorithms to improve either the performance of the target systems or reduce the energy consumption in such platforms. The self-aware learning algorithm is described in general at first. We then analyze this method in an epileptic seizure detection system (Forooghifar et al. 2019) as a real case study to give a better and more detailed overview of the technique. We also introduce simple results in this specific application to better observe the positive effect of self-awareness on the system.
Intelligent Edge Biomedical Sensors in the Internet of Things (IoT) Era
411
To utilize the notion of self-awareness in systems that implement machine learning methods, we take advantage of the fact that, to classify the majority of inputs we do not need complex model and a simple classifier gives us reliable results. Thus, we can define different levels of our machine learning algorithm with different complexities, but only switch to the complex classifier when it is necessary for obtaining reliable result, using notion of self-awareness. We will motivate this approach by a small example, and after that, will discuss more complicated variations of the same technique by discussing distributed machine learning. Using this technique, the higher-level infrastructures such as fog and cloud assist the wearable system by performing the complex classifications.
2.1 Motivational Example To illustrate the main idea of our approach, we start by using a small example. Without loss of generality, and for the simplicity of the presentation, we herein consider two sets of features, calculated in the feature extraction phase of Fig. 2, for a binary classification problem, which are shown in Fig. 3. We assume that the computational complexity of the feature set that is along the vertical axis (feature set 2) is higher than the computational complexity of the one along the horizontal axis (feature set 1). For instance, feature set 1 may contain time-domain features of the dataset, whereas feature set 2 may contain frequency-domain features. Time-domain features have a complexity order of O(n), where n is the signal’s length, whereas the frequency-domain features have a complexity order of O(n log2 n), as the calculation of frequency-domain features requires additional signal transformations, such as the Fourier transform. In this example, we consider 25 dot-shaped samples of class 1 and another 25 cross-shaped samples of class 2. For instance, in the case of a pathology, dot-shaped samples belong to people suffering from the pathology, whereas cross-shaped ones belong to healthy subjects. Let us suppose that n = 1024. Depending on the confidence level, we can build three different classifiers.
Feature set 2
Fig. 3 Motivational example on self-awareness concept
Feature set 1
412
E. De Giovanni et al.
The first classifier is shown by the dashed line in Fig. 3. This classifier uses only the feature set 1 to separate two classes. As it can be observed in Fig. 3, if we use this classifier, some samples within the shaded gray area will be misclassified. The accuracy and the expected computational complexity of this classifier are: Accuracydashed = 88%, Complexit ydashed = n = 1024. Therefore, the expected computational complexity of this classifier is low, while also the classification accuracy is less than the optimal solution. Another alternative is to combine both feature sets. This is done using a second classifier shown by the solid line in Fig. 3. The accuracy and the expected computational complexity of this classifier are Accuracysolid = 100%, Complexit ysolid = n log2 n = 10240, respectively. This classifier outperforms the first classifier in terms of classification accuracy, but its computational complexity is ten times higher. Finally, the third classifier uses the combination of the two previous ones. In order to have optimum trade-off between accuracy and complexity, based on the confidence level that we want to have, we can either use the classifier which uses feature set 1 (the dashed line) or we can use the full one which uses both sets of features (the solid line). The main goal of this scheme is to reduce the classifier complexity in terms of the number of features that we will use for final classification, while maintaining a high classification accuracy. As shown in Fig. 3, the first classifier cannot make confident decisions for samples that happen to be in the shaded gray area. For these samples, the second classifier, i.e., the classifier that uses all the available features should be used if we target a medical application that truly requires a high confidence level. Hence, once we identify the region in which the first classifier does not provide high confidence results, the next step is to check for each testing example if it falls into this shaded area and, if that is the case, to use the second classifier. Otherwise, we use the first classifier, i.e., the classifier with the reduced number of features. For this particular example, let us suppose that we have found the region in which the first classifier does not provide high confidence results, shown in Fig. 3. If we use a two-level classifier scheme that we are proposing in this work, the classification accuracy is Accuracytwo-level = 100%, whereas the expected classification complexity is calculated as: E(C) =
20 30 ·n+ · n · log2 n. 50 50
For n = 1024, the expected classification complexity is E our _appr oach (C) = 4710.4, whereas with the classical approach that uses all available features · n · log2 n = 10240. (shown by the solid line in Fig. 3) we obtain E classical (C) = 50 50 Hence, for this motivational example and n = 210 , our approach reduces the classification complexity by a factor of 2. In summary, for our motivational example shown in Fig. 3, we have presented three different classifiers. The first classifier (the dashed line) has a low computational complexity, but its performance is much lower that the performance of the second one (the solid line). On the other hand, the second classifier has a high classification accuracy, but its computational complexity is much more than the complexity of the
Intelligent Edge Biomedical Sensors in the Internet of Things (IoT) Era
No Pre-processed data
Feature extraction
Confidence model
413
Complex model
Confident?
Yes
Simple model
Fig. 4 Two-level classification in wearable monitoring systems using self-awareness and concept of confidence (Forooghifar et al. 2019)
first-level one. Our proposed (third) classifier combines the two previous classifiers, to achieve a high classification accuracy and a low computational complexity.
2.2 Centralized Two-Level Learning Embedding the two-level classification technique in the wearable monitoring system results in the system shown in Fig. 4. To utilize the notion of self-awareness, we define the concept of ‘confidence’ for the simple model. The model is considered confident if it can make reliable decision about its current input. If the model can predict the result correctly, its confidence is ‘1’ and otherwise it is ‘0’. Based on this labeling, in order to be able to predict the confidence of the simple model, a separate model is developed in the training phase. In the test phase of the system, if the confidence of the simple model is calculated as ‘1’, the system uses this model to decrease the complexity and energy consumption of the system while maintaining the performance. Otherwise, if the simple model is not confident, the system switches to the complex model and trade-offs energy with performance. In our epileptic seizure detection case study, we define a simple model, which uses the model that is trained with few number of simple features, and a complex model, which uses more complex features. In fact, the entire set of features are used seizure detection only when the confident classification based on the set of simple features is not possible. Then, we take advantage of the multi-mode execution possibilities of the platform, in a self-aware fashion, so that the energy consumption is reduced while the detection performance remains in an acceptable level for medical use. If we consider the energy consumption of the confidence calculation, simple classification, and complex classification as E C , E 1 and E 2 , respectively, the total consumed energy for our self-aware classification technique will be: E execution = E C + P1 · E 1 + (1 − P1 ) · E 2 ,
(1)
414
E. De Giovanni et al.
Monitoring systems
Fog
Cloud
Fig. 5 Distribution of health monitoring over the Internet of Things (IoT) infrastructure, including monitoring systems, fog and cloud (Forooghifar et al. 2019)
where P1 is the probability of invoking the simple model (Forooghifar et al. 2019). As a result, as the percentage of choosing simple classifier, which mainly depends on the application, is increased the energy consumption of the system is decreased.
2.3 Decentralized Multi-Level Learning Due to the limitation in the computational resources of wearable devices, migration of complex and energy-hungry tasks to higher level infrastructures that can provide more computational resources is crucial (Forooghifar et al. 2019). Different computation infrastructures, including fog (personal devices such as cellphones and smart watches) and cloud, are available for interaction with wearable devices as shown in Fig. 5. Deciding whether to communicate with higher layers depends on the trade-off between communication and computation costs, in order to reduce the overall energy consumption of wearable devices and improve their battery lifetime. In this task distribution over higher computation layers via communication, selfawareness can provide us with information to determine whether this communication can contribute in total energy reduction. We consider the same two-level classification technique, where the complex model is implemented on the fog or cloud. Whenever the simple model is confident, we execute the classification on the device. Otherwise, based on the communication cost, we choose whether to perform the classification on the wearable device or to distribute it to fog/cloud. In the epileptic seizure detection case study, the simple feature set is used in the wearable device and the complex features are implemented on the fog/cloud. In the formulation of latency for this system, the first two terms are latency of calculating confidence (L C ) and using simple classifier (P1 · L 1 ), respectively. In addition, the latency of task distribution to the fog/cloud consists of two elements, the latency of communication with fog/cloud (L 1→2 ) and the latency of classification, which is the latency of complex classification (L 2 ) multiplied by the speed-up factor of fog/cloud (γ2 ). As a result the execution latency of this system is: L execution = L C + P1 · L 1 + (1 − P1 ) · (L 1→2 + γ2 · L 2 ),
(2)
Intelligent Edge Biomedical Sensors in the Internet of Things (IoT) Era
415
Table 1 Summarizing the trade-offs between centralized and decentralized systems with and without applying self-aware technique (Forooghifar et al. 2019) Scenario Performance (%) Latency (ms) Energy ×10−7 Complex classifier Simple classifier Self-aware classifier E→F E →C
82.53 75.16 80.83 80.83 80.83
3270.80 554.00 1287.54 1420.24 1.06 × 109
13.11 1.18 4.41 3.65 2369.84
We estimate the energy consumption of the wearable system: E execution = E C + P1 · E 1 + (1 − P1 ) · E 1→2 ,
(3)
where E C , E 1 , E 2 , and E 1→2 are the energy consumption of the confidence calculation, simple classification, and communication with fog/cloud, respectively.
2.4 Case Study: Epileptic Seizure Detection In this part we present some simple experimental results of applying self-awareness in the epileptic seizure detection system. Table 1 compares the performance, latency, and energy consumption of the centralized and decentralized epileptic seizure detection system. We observe that using the self-aware classifier improves the detection performance by 5.67% compared to the simple classifier, which is only 1.7% less than the quality of the complex classification. At the same time the energy consumption of the proposed classifier is only 39.4% of the complex classifier. According to this table, among the presented solutions, the most energy-efficient choice is to offload the computationally-complex tasks to the fog. This solution requires the lowest energy (3.65 mJ) and the latency overhead is only approximately 10.4% of the entire end-to-end latency. In our case study, the communication with the cloud engines is only used to notify the hospital in case of emergency for rescue, due to the limited bandwidth and the major energy overhead of transmission via this protocol. In conclusion, in this section, we have presented how to introduce the notion of self-awareness into the machine learning module of the wearable systems as a novel approach to reduce their energy consumption, while guaranteeing the quality of service of the system. We considered an epileptic seizure detection system as our real-life case study and validated our approach. Overall, using different levels of the classification based on the demand of the system and application is a promising self-aware technique to reach system’s goals in the application level.
416
E. De Giovanni et al.
3 Self-Aware System Architecture and Platform Energy efficiency is an important factor to take into account in any wearable sensor design to ensure remote long-term health monitoring (Sinha et al. 2000). To achieve accurate inference with minimal power consumption, wearable sensor nodes (WSNs) have evolved (Guk et al. 2019) from single-core systems (Rincon et al. 2012; Surrel et al. 2018) into ultra-low power (ULP) (Konijnenburg et al. 2019) and multi-core parallel computing platforms (Conti et al. 2016; Duch et al. 2017; Pullini et al. 2019). These modern ULP platforms combine several techniques to achieve high computing performance when required, while reducing their overall energy consumption. Moreover, they can leverage information about their working scenario, which varies for each concrete patient and situation, to adjust their use of internal resources to fulfill the required tasks with the minimum use of energy. In this section, we describe two different paradigms to reduce energy consumption: platform-aware design and the emerging patient-aware design.
3.1 Platform-Aware Application Design When designing a biomedical application for remote health monitoring we find two broad groups of platforms: single- and multi-core. Whereas single-core platforms were traditionally simpler and cheaper, modern multi-core platforms can improve the execution of inherently parallel computations, such as multi-lead electrocardiogram (ECG) signal processing, by distributing the work among several cores. This distribution of tasks allows the system to meet all the deadlines while operating at a lower clock frequency and correspondingly lower voltage level, which results in significant energy savings. In the following paragraphs we describe the most relevant aspects of WSN platforms for energy efficiency.
3.1.1
Sleeping Modes
Embedded platforms typically have the ability of clock-gating their elements to reduce dynamic power. Clock-gating is a very effective technique that allows finegrained control over individual elements, frequently allowing them to resume their work with a delay of a single cycle. Unfortunately, clock-gating cannot reduce current leakage, which impact is becoming more pronounced as technology evolves towards smaller transistors. In contrast, power-gating suppresses leakage current, but the time required to reactivate an element—particularly if clock generators are stopped— limits the minimum duration of the sleeping periods. Moreover, whereas clock-gating can be implemented with a small area overhead through specific gates, power-gating often requires careful design of local power supply networks with significant area overhead. Therefore, modern ULP platforms offer several sleeping modes, from
Intelligent Edge Biomedical Sensors in the Internet of Things (IoT) Era
417
fine-grained clock-gating to power-gating of large blocks. A careful match between resource activation and computational requirements allows the platform to operate at an optimal energy point.
3.1.2
Multi-banked Memories
The division of the platform memories into smaller banks that can be independently powered on/off, or placed into retention mode, enables fine-grained control on their energy use. For example, applications that acquire and process input signals in “windows” can control which bank is active because it is being currently used by the direct memory access (DMA), which banks must retain their contents until the next processing interval, and which ones can remain off because they do not contain yet any data. Another important characteristic is the existence of a memory hierarchy. Since smaller banks generally have a lower energy cost per access, placing the most accessed data into them can reduce the total energy consumption.
3.1.3
DMA Modules
(DMAs) are crucial to achieve low power operation in applications that capture windows of data before processing, as the cores, which consume more energy, can be kept deactivated until enough samples have been acquired. Advanced DMA modules are also used to transfer data between different levels in a memory hierarchy, i.e. implementing double-buffers.
3.1.4
Efficient Hardware-Based Synchronization
Efficient parallelization in multi-core platforms requires hardware-based mechanisms that enable single-cycle synchronization and fine-grained clock-gating of the cores that are waiting for an event (Duch et al. 2017; Flamand et al. 2018).
3.1.5
Example Platforms
In this section, we study one representative platform from each of the two categories: Single-core platform: An instance of the previous generation of single-core low power platforms is the EFM32LG-STK3600 containing an EFM32 TM Leopard Gecko 32-bit MCU (Silicon Labs 2017), a 48 MHz ARM Cortex-M3 processor with a 3 V supply, 256 KB flash memory and 32 KB RAM. The platform can be paired with the corresponding
418
E. De Giovanni et al.
Table 2 Current consumption in different power modes of the EFM32LG platform (summarized from Silicon Labs (2017)) Mode Current (µA Current Wake-up Notes MHz−1) (total) (µs) EM0 EM1 EM2
211 63 –
10 mA 3 mA 0.95 µA
– 0 2
EM3
–
0.65 µA
2
EM4
–
20 nA
160
Fully active at 48 MHz CPU sleeping, DMA available Deep sleep, RTC on, CPU and RAM retention, I/O available Stop, CPU & RAM retention, no I/O peripherals Shutoff
Simplicity Studio software energy profiler as a tool for energy consumption analysis. Power management is implemented through 5 working modes, controlled by the energy management unit (EMU), with an active mode (EM0) and 4 low energy modes (EM1–EM4), in descending order of energy consumption and increasing wake-up time (Table 2). The results of the analysis in this platform can be used as a base for ECG-based devices like the SmartCardia INYU (Surrel et al. 2018) or electroencephalogram monitoring devices such as the e-Glass (Sopic et al. 2018), which includes a microcontroller unit (MCU) of the same family of the EFM32. Multi-core platform: GAP8 (Flamand et al. 2018) is a commercial RISC-V implementation based on the PULP project (Conti et al. 2016) built on 55 nm. It consists of a main processor, termed “fabric controller” (FC), which performs short tasks and manages the complete platform, and a cluster of eight additional cores. The cluster cores are activated only during compute-intensive phases; they can cooperate on a single task or work independently. A hardware event unit implements single-cycle synchronization primitives and clock-gating of waiting cores. The memory is divided into two blocks: The first one L2 (512 KB) is connected to the FC, whereas the second one L1 (64 KB) provides energy-efficient data access to the cluster cores. Data transfers between both levels are performed by a dedicated DMA module. A second DMA module is in charge of the data acquisition without intervention of the FC. Power numbers for GAP8 at different operating frequencies and voltage points are reported in Flamand et al. (2018). GAP8 implements full retention of the 512 KB L2 memory at only 32 µW (8 µW per 128 KB bank). Furthermore, advanced power management reduces power down to 3.6 µW in deep sleep mode. A typical processing cycle in GAP8 starts with the DMA receiving data from external sources while the FC is clock-gated and the cluster is completely powergated. Once enough data are received, the FC activates the cluster cores and programs the DMA to transfer data in and out of the L1 memory. While the cluster cores are processing, the FC can be clock-gated to conserve energy. Once the heaviest parts
Intelligent Edge Biomedical Sensors in the Internet of Things (IoT) Era 14 100 % Duty Cycle
10 %
90 %
12 10 I (mA)
Fig. 6 Opportunities for power management in single- and multi-core platforms for an application with a duty cycle of 10%
419
Active
Active
Active
8 6 4 2
Sleep
0
Sleep Time
I (mA)
(a )Single-core platform with clock-gating. 32 28 24 20 16 12 8 4 0
Cluster active
FC active
Sleep Time
(b) Multi-core PULP platform. Main processor (FC) with clock-gating and parallel cluster (CL) with power-gating.
of the computation are completed, the FC can power down the cluster (and its L1 memory).
3.1.6
Power Management
To achieve continuous monitoring, different power management techniques that exploit the characteristics of biomedical applications are available to conserve energy. Figure 6 shows the periodic life cycle of a typical biomedical application, both on a single-core (Fig. 6a) and on a multi-core (Fig. 6b) platform. These applications are typically pseudo-periodic, with a duty cycle (DC) defined as the amount of computation time over a period including sample acquisition and computation (10% in the figure). These variations of computational demands along time create opportunities to conserve energy through an appropriate power state management. Using the values reported in Table 2 for a single-core platform like the EFM32, Fig. 7 shows the impact of three possible power management strategies on the average current drawn by the platform—which relates directly to average power and energy consumption over time. Taking advantage of the shallower sleeping mode (EM1),
420
E. De Giovanni et al. 10 9
Current (mA)
8 7 6 5 4 3 2 1 0
EM0
EM0+EM1
EM0+EM2
Sleeping
0
2.7
0.00085
Active
10
1
1
Fig. 7 Average current in a single-core platform using different power management policies, for a duty cycle of 10%
which guarantees the fastest reaction time to external events, a decrease in current of 63% can be achieved. If the system can afford longer reaction times, and counts with external buffering (i.e., in the sampling or communication devices themselves), EM2 can be used, reducing the average current up to 90% with respect to the original. In the case of a multi-core platform, such as GAP8 (Fig. 6b), the main processor (FC) can be used on its own during light computation stages while the cluster is powergated (off). The cluster can be activated to take advantage of parallelization and reduce the total execution time of more complex tasks. Figure 8 shows the impact of power management and parallelization on the energy consumption of the platform. Since the speed-up achieved through parallelization changes the execution time, it directly affects the final energy consumption. Hence, the figure shows energy consumption per second of computation, rather than average current or power, to illustrate the impact of the parallelization (and the corresponding reduction of processing time) in the final energy consumption of the system. The figure shows two groups of bars: the left one for a DC of 10%, and the right one for 100%. In both cases, the first bar corresponds to the energy consumed in single-core mode without implementing any power management, whereas the second one corresponds to single-core using clock-gating during idle periods—of course, under a DC of 100%, both cases are equivalent. The remaining bars correspond to the energy consumption in multi-core mode. We make two important observations: First, the multi-core version can achieve significant improvements (up to 41%) with respect to the single-core version with power management. Second, the parallelization needs to reach a minimum speedup to attain any energy savings. For example, with a 4-core platform, if the speed-up obtained is 2×, then the multi-core version will consume even more energy than the single-core one because the additional hardware is not correctly exploited.
Intelligent Edge Biomedical Sensors in the Internet of Things (IoT) Era
421
120.0% DC = 10 %
DC = 100 %
100.0% 80.0% 60.0%
122.1% 100.0%
100.0%
100.0% 84.4%
40.0% 63.1%
20.0%
44.2% 13.6%
0.0% Singlecore, always-on Sleeping Active
15.8%
9.9%
12.0%
8.0%
Single-core 8x cores 8x cores 4x cores 4x cores + (2x speed- (4x speed- (4x speed- (8x speedclockup) up) up) up) gating
Singlecore, always-on
Single-core 8x cores 8x cores 4x cores 4x cores + (2x speed- (4x speed- (4x speed- (8x speedclockup) up) up) up) gating
0.0%
3.6%
3.8%
3.9%
3.9%
4.0%
0.0%
0.0%
2.0%
3.0%
3.0%
3.5%
100.0%
10.0%
12.0%
6.0%
8.1%
4.1%
100.0%
100.0%
120.1%
60.1%
81.4%
40.7%
Fig. 8 Normalized average energy consumption in a multi-core platform for different DCs, number of cores and speed-up factors. Power-gating applied to idle cores. Savings increase for higher DCs because for low DCs energy consumption is bounded by the sleeping mode
3.1.7
Memory Management
Quite frequently, biosignal processing applications need to acquire a number of samples to complete a window of processing. The sampling period typically extends over several seconds, with a low acquisition rate (e.g., 250 Hz). During the sampling period, the processor is typically clock- or power-gated; the system keeps active only the DMA controller, the memory and the devices required for signal acquisition such as analog-to-digital converters (ADCs) or bus interfaces. However, even with this minimal amount of hardware active, the amount of energy consumed during the acquisition phase can be quite large. In fact, the energy consumed by the memories during the sample acquisition period, which is measured in seconds, can become comparable in magnitude to the energy consumed during the processing period, which is typically measured in tens or hundreds of milliseconds. In that sense, modern platforms have memories divided into multiple banks which can be independently switched off or on. Most memories support also a retention mode in which they keep their contents, but cannot be accessed. To minimize the energy consumption during sampling, a system should use the smallest size of bank that is feasible, and keep off all the banks except the one currently receiving new samples—as power mode transition requires some time, it may be better to keep the bank active between sampling periods. When the bank is filled, it should placed into retention mode. As capturing progresses, the banks move from disconnected state to active and, finally, to retention. When the sampling period ends, all the banks containing data can be activated before starting the computation.
422
E. De Giovanni et al.
3.2 Patient-Aware Applications In health monitoring, targeted and personalized diagnosis and treatment are essential for a successful prognosis. One example is in the group of patients suffering from paroxysmal atrial fibrillation (PAF), which is caused by heterogeneous mechanisms and can be asymptomatic. In our work De Giovanni et al. (2017), we show how a patient-aware approach to predict the onset of PAF significantly increases the accuracy compared to methods that consider inter-patient variability. In our work, we propose to use an abstraction of the ECG signal (termed “delineation”) by selecting specific relevant points for each patient. Then, we train different models for the different patients, automatically adjusting the complexity of the model (i.e., number of features or delineated points) as required to the specific condition of each patient. Since the extraction of each of the delineation points, or features, has a different computational cost, the patient-aware approach changes the computational complexity of the same application for each patient: By choosing different groups of ECG delineation points with different computational load per patient, the energy consumption of the algorithm is scaled to the specific patient. These considerations can be used in conjunction with the previously explored platform-aware techniques to achieve optimal computation. For example, in the case of patients for which few (easy) points are delineated, the algorithm may be able to work in single-core mode or at a lower frequency-voltage point. If the set of required points makes the delineation process become more complex, then a higher frequency-voltage point can be used to guarantee that all the deadlines are met.
3.3 Towards Adaptive and Multi-parametric Applications Newer WSN applications for remote health monitoring or for tracking performance in sports are becoming multi-parametric and adaptive. These applications use multiple sensors, such as respiratory activity (RSP) or photoplethysmography (PPG) to estimate vital parameters such as heart-rate (Giovanni et al. 2016), blood pressure and oxygen saturation (Murali et al. 2017). Due to the performance challenges that these new applications create, modern applications are evolving to handle those larger workloads, but also to follow their performance requirements more closely. For example, Mr. Wolf (Pullini et al. 2019) is a PULP-based platform that can run applications such as seizure detection with 23 ECG electrodes (Benatti et al. 2016) or an online learning and classification EMG-based gesture recognition (Benatti et al. 2019). These new platforms offer parallel processing and floating point support, but can also operate at different voltage-frequency points. In that way, the designers can pick among multiple combinations of working states: single-core at low frequency-voltage (minimum performance, minimum energy consumption), single-
Intelligent Edge Biomedical Sensors in the Internet of Things (IoT) Era
423
core at high frequency-voltage, multi-core at low frequency-voltage, and multi-core at high frequency-voltage (highest performance, highest energy consumption). In consequence, the relevant problem is becoming now how to match the performance requirements of the application with platform resources to meet all deadlines, while avoiding energy wastage. One example of this effort are adaptive applications that employ algorithms of increasing complexity in cascade, activating only the least complex ones required to achieve a satisfying precision. For example, in Forooghifar et al. (2019) the authors propose to use multiple support vector machines (SVMs) of increasing complexity to process increasing numbers of biosignals in cognitive workload monitoring. A fundamental feature of SVMs is that they produce a classification (e.g., “stressed” versus “not stressed”), but they also produce a certainty score. Using that score, the designer can determine if the current SVM is adequate or if the next complex one should be used. The successive SVMs require more complete inputs, which increases both the complexity of feature extraction and the computation of the SVMs itself. In conclusion, a carefully designed application can determine the required resources at each stage, and configure platform resources according to its requirements.
4 Self-Aware Signal Acquisition and Sampling IoT devices are small embedded systems, often constrained in resources. While the more powerful IoT devices have a permanent power supply, other devices are more limited. This is the case for remote wireless systems dedicated to data collection powered by a battery. While in the former case, the energy budget is often, in the later case, each Joule must be used wisely. In the most constrained setups, sampling data from the environment is often one of the main energy expenditures. Therefore, the components used for the signal acquisition need to be as low-power as the design constraints allow. Indeed, the performance of the device must not be lowered below the threshold of acceptability. The Analog-to-Digital Converters (ADCs) used in this context are prominently based on the Successive Approximation Register (SAR) architecture, because they are lower low-power than other architectures for a given bit-depth, as seen in Fig. 9. An additional benefit that cannot be represented in the figure is the evolution of the current draw depending on the sampling rate. Indeed, while the current consumed
424
E. De Giovanni et al.
by the Sigma-Delta architecture is constant, it scales with the number of samples takes in the case of SAR ADCs. This means that lowering the sampling frequency rate directly leads to significant power savings.
4.1 When the Design Drives the Sampling: The Data Deluge There are inherent limits to how low the energy consumption of ADCs can be. Indeed, even with optimised Analog-to-Digital Converters (ADCs) using a lowpower architecture, there is a minimum amount of samples required to accuractely capture the signal. This amount is defined according to the Shannon-Nyquist theorem, where the highest frequency in the signal drives the sampling rate to use. For example, when recording an audio signal, the maximum frequency (highest pitch) that can be heard by human ear is close to 20 kHz, which requires a sampling rate of 40 kHz. When considering the addition of a low-pass filter with a 2.05 kHz transition band to remove frequencies above 20 kHz, we reach the common 44.1 kHz sampling frequency.
Fig. 9 Comparison of the three main ADC architectures taking into account the resolution and typical power consumption. Each marker represent an ADC extracted from a manufacturer’s catalogue. Generally, Delta-Sigma ADCs reach a higher resolution than SAR and Pipeline architectures. From an energy point-of-view, SAR-based devices can reach lower power than other architectures, partly because their consumption scales with the number of samples whereas the Delta-Sigma architecture imposes a constant current draw. The main benefit of pipeline-based ADCs, not visible in this figure, is the acquisition speed that is an order of magnitude faster than other architectures
Intelligent Edge Biomedical Sensors in the Internet of Things (IoT) Era
425
Following the Shannon-Nyquist sampling theorem, there is a constant sampling rate defined for the acquisition. This sampling rate is directly connected to the highest frequency that could happen in the signal, which might not even happen. As a consequence, there is an global over-sampling of the signal at every instant where is maximum expected frequency is not in the signal.
4.2 When the Signal Drives the Sampling: The Event-Driven Strategy As the energy consumption of SAR ADCs is linked to the number of samples taken, lowering this number leads to a globally energy savings. One of such strategy is Compressed Sensing (Mamaghanian et al. 2011), where the system follows an irregular sampling pattern according to a mathematical model: the signal’s sparsity can be used to recover the original signal by solving an under-constrained linear system. This is already an improvement over the uniform sampling approach, but it has few shortcomings. First, this relies on the probability of getting significant data at the right time. When the system does not samples the signal when a significant event is happening, it is either totally lost, or is has a poor acquisition. Second, the process to reconstruct the full signal from the compressed sensing one is a computationally expensive process. A low-power IoT sensor node would not have the energy budget to retrieve the signal and react if necessary. If the data processing is pushed towards the remote IoT node, an alternative design must be chosen. The best approach is to have the signal itself driving the sampling. There are multiple strategies that implement this way of reasoning about signal acquisition. In the following parts, event-triggered sampling is motivated and illustrated with the specific case of the main bio-signal of the heart, the electro-cardiogram (ECG). ECG signals combine periods of high frequency when the beat happens, and lower frequencies otherwise. Each heartbeat in an ECG is observed as a sequence of three wave components (annotated in Figs. 10 and 11): 1. P wave: electrical activation of the atria, 2. QRS complex: electrical activation of the ventricles, 3. T wave: electrical recovery of the ventricles.
4.2.1
Level-Crossing Event Triggering
From the traditional situation of uniform sampling, as shown in Fig. 10, the sampling frequency chosen here is not sufficient to correctly capture all the signal’s characteristics. However, raising the sampling rate to reliably capture the R peak would be over-provisioning the other parts of the signal. Over-sampling is detrimental for resource-constrained medical systems (Surrel et al. 2018; Sopic et al. 2018) as more samples means more energy required to process, store, or transmit the acquired data.
426
E. De Giovanni et al.
Fig. 10 Sampling of a single ECG sinus beat, with a traditional uniform sampling strategy. This means there is fixed number of samples per unit of time. If the sampling frequency is too low, details from the signal are lost, such as the accurate QRS complex. This means there is a trade-off between the energy spend digitizing the signal and quality of the sampled signal. (Source Record A00848 from the PhysioNet Challenge 2017 database, between t = 2.8 s and t = 3.8 s.)
Fig. 11 Sampling of a single ECG sinus, with an event-driven sampling strategy. The samples are collected each time the signal crosses a threshold, represented as horizontal grey bars. For the system designer, the performance of the system is not defined anymore by the sampling rate but by the distribution of the levels. Compared to Fig. 10, the number of samples is similar, but the QRS complex is accurately sampled. (Source Record A00848 from the PhysioNet Challenge 2017 database, between t = 2.8 s and t = 3.8 s.)
Intelligent Edge Biomedical Sensors in the Internet of Things (IoT) Era
427
Indeed, a medical device has a lower usability for the patients if the battery life is a limiting factor. Even though the full ECG can be very informative to detect symptoms of diseases, it is not always necessary to have full details about the P, QRS, and T waves. Depending on the application, desired accuracy and use-case, partial data can be sufficient to run the required diagnostics. For instance, it is possible to have an online detection of Obstructive Sleep Apnea (OSA) on a wearable device only relying on the time between heart beats (Surrel et al. 2018). The accuracy is improved when the peaks’ amplitude is also used. Because all the processing is performed on the device, it needs to spare the energy spent. For this example application, lowering the sampling frequency exacerbates two problems. First, it is impossible to significantly reduce the sampling frequency as we need to accurately detect heart-beats. Secondly, the algorithm needs accurate timing between the heart-beats, otherwise the quality of the results decreases dramatically. Lowering the sampling frequency lowers the temporal resolution of the heart-beat detection. As a consequence, any energy saved from lowering the sampling frequency is paid with a reduced performance. Switching to an event-driven signal acquisition is beneficial due to two reasons. First, the heart-beat is the highest peak in the signal. Therefore, it will quickly cross multiple thresholds, clearly flagging its presence in the triggers received. Compared to Compressed Sensing, it is not possible to miss it in the signal. Secondly, even with a coarse configuration (i.e., low number of thresholds), we only lose precision in the peak height but the time of when the heart-beat happens is preserved.
4.2.2
Error-Based Event Triggering
The classical level-crossing method is less than ideal because oscillations or noise around a threshold will trigger many samples. Second, a linear evolution of the signal, that is to say with a zero derivative, will generate samples regularly spaces both in time and value. This bring no useful information compared to simply having the first and last points of the linear section. An alternative approach is to consider the error between the raw signal and its sampled version as the trigger for sampling. Putting the focus on the signal reconstruction error, the event-triggered sampling task is a minimization problem, looking for the minimum number of samples that allow us to obtain a digital representation of the analog signal that is sufficient for ECG processing. A family of methods well suited for this problem is polygonal approximation, also called piece-wise linear representation or linear path simplification (Keogh et al. 2004). These methods assume that the input signal can be represented as a sequence of linear segments, and they apply different techniques to obtain the minimum number of segments satisfying some error criterion. Within this family, one method especially suitable for sampling time series is the Wall-Danielsson algorithm (Wall 1984). This method has linear complexity, works online, and only needs one signal sample in advance to estimate the approximation error. Conversely, it does not guarantee optimality neither in the number of points,
428
E. De Giovanni et al.
Fig. 12 Drawbacks of the level-crossing sampling strategy (red crosses) compared to the polygonal approximation sampling (green pluses). In the first part, the signal oscillating around a level is generating many samples, while in the second part, because of the linear evolution of the signal, the samples are taken at regular time. In both cases, samples in the middle of each zone do not bring any meaningful information. The polygonal approximation lowers the total number of samples while retaining the information about the behavior of the signal
nor in the selected samples. This method follows a bottom-up approach, in which points are merged to the current linear segment until an error threshold is reached, and then a new segment is created. The error is measured as the area deviation between the original signal and the current segment. This algorithm overcomes two main shortcomings of the classical level-crossing method visible in Fig. 12. First, an almost constant signal oscillating near a level generates more events than required. Second, fast linear changes generate numerous events. With polygonal approximation, the number of samples is not affected by constant displacements of the signal level, and linear changes are always represented by just two samples, no matter the slope value.
4.2.3
Self-Aware Sampling for ECG Signals
For a self-aware system, an important ECG feature that can be exploited to reduce the amount of data is the physiological regularity observed in the signal. In particular, under a normal sinus rhythm situation, the same heartbeat pattern is repeated between 60 and 100 times per minute. Thus, if this situation is detected on a signal fragment, from that point onward it would be enough to capture just the information needed to identify a change in the rhythm.
Intelligent Edge Biomedical Sensors in the Internet of Things (IoT) Era
429
Fig. 13 Self-aware triggered sampling of an ECG fragment using polygonal approximation. Top: Original signal, sampled 360 Hz. Bottom: Resulting signal of the adaptive sampling method. The detection of a regular rhythm enables a substantial reduction of the sampling frequency by getting a much coarse representation of the signal. After a rhythm change (block of an average sampling frequency of 34.9 Hz), the sampling frequency is increased to allow getting the details of the abnormal area. (Source Record 119 of the MIT-BIH Arrhythmia DB, lead MLII, between 17:10 and 17:24)
This idea is illustrated in Fig. 13, showing a 24-s ECG segment. As long as we observe three regular P-QRS-T heartbeat patterns with a normal distance between them, we drastically reduce the detail of the signal just to be able to check that the regularity is maintained. This results in a rougher approximation of the signal, but detailed enough to observe the regular heartbeats at the expected time points. When an unexpected event breaks this regularity, the procedure is able to lower the error between the signal and its sampled version, hence supporting a more precise analysis of the new situation.
4.3 Evaluation of Event-Driven Sampling The potential of event-driven sampling is illustrated with ECG signals by comparing the performance of a standard QRS detection algorithm provided in the WFDB software package from PhysioNet. The output of the QRS detection algorithm is compared against the manual annotation done by a medical doctor, using the bxb application from the WFDB toolkit. The dataset used is the MIT-BIH Arrhythmia database available on PhysioNet, which is widely used in the bibliography to evaluate QRS detection algorithms. This database contains 48 ECG records of 30 min duration sampled 360 Hz. Table 3 shows the performance comparison between the proposed method and the other sampling strategies, including ordinary uniform sampling (U.S.), compressed
430
E. De Giovanni et al.
sensing (C.S.), level-crossing (L.C.), and finally self-aware adaptive sampling (S.A.). The considered performance metrics are specificity, positive predictivity and the combined F1 score. The compressed sensing method has been applied as explained in Mamaghanian et al. (2011), while the adopted level-crossing scheme is linear with a threshold every 200 µV. The results show that for a similar F1 score, compressedsensing halves the sampling frequency while level-crossing divides it by more than eight. The self-aware adaptive approach outperforms the two other strategies. The relevance of event-driven sampling is necessarily application-specific, where the performance must be evaluated carefully. However, given the properties of such a system where the samples are taken depending on the signal itself rather than an external factor, it is expected to bring significant energy-savings for an equivalent system performance in multiple domains.
5 Conclusions When optimizing a system for a given task, it is often required to change our mindset. The majority of systems benefit from dynamically adapting to changes according to their capabilities and limits. This adaptive behavior opens the way to significant energy-savings, while maintaining the required performance. The design of a self-aware system can take place on one or multiple levels, depending on the final goal and constraints. Each layer is very different when considering the impact of making it self-aware, with trade-offs involving design complexity, implementation cost, or technological availability. This chapter presented three practical examples of self-aware designs, one for each layer considered. In Fig. 14, the layer the closest to the analog world, the hardware layer, is involved in the analog-to-digital conversion of a signal, sampling it. One layer up is the architecture layer, i.e., the electronic components to run software algorithms. It offers specific capabilities for building up self-aware systems. Finally, the closest layer to the user is the application layer, which is involved in data pro-
Table 3 QRS detection performance comparison among different sampling strategies and resulting average sample rate for the 46 selected records from the MIT-BIH Arrhythmia DB Sampling strategy Se +P F1 fs (Hz) Uniform Sampling (U.S.) Compressed Sensing (C.S.) Level-Crossing (L.C.) Self-Aware (S.A.)
99.73
99.85
99.79
360.0
99.64
99.82
99.73
180.0
99.66
99.83
99.74
43.7
99.62
99.84
99.73
13.6
Intelligent Edge Biomedical Sensors in the Internet of Things (IoT) Era
431
cessing and yielding the final results. Each one of these layers has great potentials for full customization in a self-aware system. 1. The Application Layer: An epileptic seizure detection classifier has been presented. An initial lightweight classification is first performed. Depending on the results, the classifier decides if a more in-depth analysis is required. This approach brings significant savings for all cases where is lightweight analysis is sufficient, with a minimal overhead when performing the full analysis. 2. The Architecture Layer: In transitioning to the application layer, an important factor to consider is the structure of the different platforms available and how it can affect the energy efficiency. Power management must be taken into account in the application design process for any platform. The need of parallelization in multi-core platforms, described as energy savings compared to the singlecore implementation, depends on the application duty cycle and speed-up on the cores cluster. This analysis can enable choosing the optimal set up to achieve the maximum energy savings given certain general features of the application. 3. The Hardware Layer: At the beginning of the chain of processing data comes the signal digitization. In low-power systems, the energy budget of sampling can be lowered by changing the data acquisition strategy, moving from a uniform sampling to an adaptive event-driven one. In the application presented, this nonNyquist sampling could lower by more than 25× the total number of samples collected without any significant decrease in performance. The design constraints may require a single layer to be self-aware. In that case, it is likely that targeting the application layer is the right thing to do. If needed, it is possible to go further by turning the architecture layer or the hardware layer self-aware. Finally, a fully self-aware system will bring a definite advantage, i.e., the possibility to have a much lower energy consumption without any major loss in terms of performance.
Fig. 14 A top-down design approach for self-aware systems, starting from the application layer, moving down to the circuit architecture level, and finally the hardware level
432
E. De Giovanni et al.
Acknowledgements This work has been partially supported by the ML-Edge Swiss National Science Foundation (NSF) Research project (GA No. 200020_182009/1), by the H2020 DeepHealth Project (GA No. 825111), by the ReSoRT Project (GA No. REG-19-019) funded by Botnar Foundation, and by the PEDESITE Swiss NSF Sinergia project (GA No. SCRSII5 193813/1).
References A. Aminifar, P. Tabuada, P. Eles, Z. Peng, in 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE) (IEEE, 2016), pp. 636–641 M.T. Andrén, B. Bernhardsson, A. Cervin, K. Soltesz, in 2017 IEEE 56th Annual Conference on Decision and Control (CDC) (IEEE, 2017), pp. 5438–5444 A. Anzanpour, I. Azimi, M. Götzinger, A.M. Rahmani, N. TaheriNejad, P. Liljeberg, A. Jantsch, N. Dutt, in Proceedings of the Conference on Design, Automation & Test in Europe (European Design and Automation Association, 2017), pp. 1056–1061 S. Benatti, F. Montagna, D. Rossi, L. Benini, in 2016 IEEE Biomedical Circuits and Systems Conference (BioCAS) (2016), pp. 86–89. https://doi.org/10.1109/BioCAS.2016.7833731 S. Benatti, F. Montagna, V. Kartsch, A. Rahimi, D. Rossi, L. Benini, IEEE Trans. Biomed. Circuits Syst. 13(3), 516 (2019). https://doi.org/10.1109/TBCAS.2019.2914476 F. Conti, D. Rossi, A. Pullini, I. Loi, L. Benini, J. Signal Process. Syst. 84(3), 339 (2016). https:// doi.org/10.1007/s11265-015-1070-9 E. De Giovanni, A. Aminifar, A. Luca, S. Yazdani, J.M. Vesin, D. Atienza, in Proceedings of CINC, vol. 44 (2017), pp. 285–191 L. Duch, S. Basu, R. Braojos, G. Ansaloni, L. Pozzi, D. Atienza, IEEE TCAS-I 64(9), 2448 (2017). https://doi.org/10.1109/TCSI.2017.2701499. http://ieeexplore.ieee.org/document/7936544/ A. Filieri, H. Hoffmann, M. Maggio, in Proceedings of the 36th International Conference on Software Engineering (2014), pp. 299–310 E. Flamand, D. Rossi, F. Conti, I. Loi, A. Pullini, F. Rotenberg, L. Benini, in IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP) (2018), pp. 1–4. https://doi.org/10.1109/ASAP.2018.8445101 F. Forooghifar, A. Aminifar, D.A. Alonso, in 2018 21st Euromicro Conference on Digital System Design (DSD) (IEEE, 2018), pp. 426–432 F. Forooghifar, A. Aminifar, L. Cammoun, I. Wisniewski, C. Ciumas, P. Ryvlin, D. Atienza, Mobile Networks and Applications (2019), pp. 1–14 F. Forooghifar, A. Aminifar, D. Atienza, IEEE Trans. Biomed. Circuits Syst. 13(6), 1338 (2019) E.D. Giovanni, S. Murali, F. Rincon, D. Atienza, in 2016 Euromicro Conference on Digital System Design (DSD) (2016), pp. 553–560. https://doi.org/10.1109/DSD.2016.101 K. Guk, G. Han, J. Lim, K. Jeong, T. Kang, E.K. Lim, J. Jung, Nanomaterials 9(6) (2019). https:// doi.org/10.3390/nano9060813. https://www.ncbi.nlm.nih.gov/pubmed/31146479 H. Hoffmann, J. Holt, G. Kurian, E. Lau, M. Maggio, J.E. Miller, S.M. Neuman, M. Sinangil, Y. Sinangil, A. Agarwal et al., in Proceedings of the 49th Annual Design Automation Conference (2012), pp. 259–264 A. Jantsch, N. Dutt, A.M. Rahmani, IEEE Design & Test 34(6), 8 (2017) E. Keogh, S. Chu, D. Hart, M. Pazzani, in Data Mining in Time Series Databases (World Scientific, 2004), pp. 1–21 M. Konijnenburg, R. van Wegberg, S. Song, H. Ha, W. Sijbers, J. Xu, S. Stanzione, C. van Liempd, D. Biswas, A. Breeschoten, P. Vis, C. Van Hoof, N. Van Helleputte, in IEEE International SolidState Circuits Conference (ISSCC) (2019), pp. 360–362. https://doi.org/10.1109/ISSCC.2019. 8662520
Intelligent Edge Biomedical Sensors in the Internet of Things (IoT) Era
433
P.R. Lewis, A. Chandra, S. Parsons, E. Robinson, K. Glette, R. Bahsoon, J. Torresen, X. Yao, in 2011 Fifth IEEE Conference on Self-Adaptive and Self-Organizing Systems Workshops (SASOW) (IEEE, 2011), pp. 102–107 P.R. Lewis, M. Platzner, B. Rinner, J. Tørresen, X. Yao, Self-aware Computing Systems (Springer, 2016) H. Mamaghanian, N. Khaled, D. Atienza, P. Vandergheynst, IEEE TBME 58(9), 2456 (2011) G. Masinelli, F. Forooghifar, A. Arza, A. Aminifar, D. Atienza, IEEE Design & Test (2020) S. Murali, F. Rincón, S. Baumann, E. Pérez Marcos, Monitoring device for monitoring of vital signs (international patent WO 2019/102242 A1) (2017), https://patents.google.com/patent/ WO2019102242A1/en?q=ppt+blood+pressure+rincon&oq=ppt++blood+pressure+rincon A. Pullini, D. Rossi, I. Loi, G. Tagliavini, L. Benini, IEEE J. Solid-State Circuits 54(7), 1970 (2019). https://doi.org/10.1109/JSSC.2019.2912307. https://ieeexplore.ieee.org/document/8715500/ F. Rincon, P.R. Grassi, N. Khaled, D. Atienza, D. Sciuto, in Engineering in Medicine and Biology Society (IEEE, 2012), pp. 2472–2475. https://doi.org/10.1109/EMBC.2012.6346465. http:// ieeexplore.ieee.org/document/6346465/ Silicon Labs, EFM32 Leopard Gecko Reference Manual (2017), https://www.silabs.com/products/ mcu/32-bit/efm32-leopard-gecko A. Sinha, A. Wang, A.P. Chandrakasan, in Proceedings of the ISLPED (ACM Press, 2000), pp. 31–36. https://doi.org/10.1145/344166.344188. http://portal.acm.org/citation.cfm? doid=344166.344188 D. Sopic, A. Aminifar, A. Aminifar, D. Atienza, IEEE TBioCaS (99), 1 (2018) D. Sopic, A. Aminifar, D. Atienza, in 2018 IEEE International Symposium on Circuits and Systems (ISCAS) (IEEE, 2018), pp. 1–5 G. Surrel, A. Aminifar, F. Rincon, S. Murali, D.A. Atienza, IEEE Trans. Biomed. Circuits Syst. pp. 1–12 (2018). https://doi.org/10.1109/TBCAS.2018.2824659. https://ieeexplore.ieee. org/document/8355729/ G. Surrel, A. Aminifar, F. Rincón, S. Murali, D. Atienza, IEEE TBioCaS 12(4), 762 (2018) K. Wall, P.E. Danielsson, Comput. Vis. Graph. Image Process. 28(2), 220 (1984). https:// doi.org/10.1016/S0734-189X(84)80023-7. https://www.sciencedirect.com/science/article/pii/ S0734189X84800237
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Reconfigurable Architectures: The Shift from General Systems to Domain Specific Solutions Eleonora D’Arnese, Davide Conficconi, Marco D. Santambrogio, and Donatella Sciuto
Abstract Reconfigurable computing is an expanding field that, during the last decades, has evolved from a relatively closed community, where hard skilled developers deployed high performance systems, based on their knowledge of the underlying physical system, to an attractive solution to both industry and academia. With this chapter, we explore the different lines of development in the field, namely the need of new tools to shorten the development time, the creation of heterogeneous platforms which couple hardware accelerators with general purpose processors, and the demand to move from general to specific solutions. Starting with the identification of the main limitations that have led to improvements in the field, we explore the emergence of a wide range of Computer Aided Design tools that allow the use of high level languages and guide the user in the whole process of system deployment. This opening to a wider public and their high performance with relatively low power consumption facilitate the spreading in data-centers, where, apart from the undeniable benefits, we have explored critical issues. We conclude with the latest trends in the field such as the use of hardware as a service and the shifting to Domain Specific Architectures based on reconfigurable fabrics.
E. D’Arnese · D. Conficconi · M. D. Santambrogio · D. Sciuto (B) Politecnico di Milano, Piazza Leonardo Da Vinci 32, Milano, Italy e-mail: [email protected] E. D’Arnese e-mail: [email protected] D. Conficconi e-mail: [email protected] M. D. Santambrogio e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2023 M. M. S. Aly and A. Chattopadhyay (eds.), Emerging Computing: From Devices to Systems, Computer Architecture and Design Methodologies, https://doi.org/10.1007/978-981-16-7487-7_14
435
436
E. D’Arnese et al.
1 Introduction Aim of this chapter is to provide an extensive overview of the reconfigurable systems’ evolution from different points of view, with a specific focus on those based on Field Programmable Gate Arrays (FPGAs). We will explore the evolution of these systems that starts from standalone to hybrid solutions (Choi et al. 2019), the evolution of the toolchains developed both to increase the productivity and widening the user base of these reconfigurable fabrics (Tessier et al. 2015), and the differentiation of the paradigms employed and applicative scenarios (Kachris and Soudris 2016). Considering the magnitude of the topics, we will cover the time-span between a period when only a restricted elite of people knew and exploited reconfigurable systems and the current days where they are often integrated into data-centers and provided as services to a wider audience. Following this path, we can identify three main trends that helped the adoption of reconfigurable fabrics, specifically FPGAs, in a variety of computation systems: a programming paradigm democratization, a development of heterogeneous platforms, and new accessibility solutions. In a first instance, reconfigurable fabrics were employed mostly in telecommunications thanks to the possibility of easy and fast reconfiguration, and in signal processing for speeding up the computation of specific algorithms (De La Piedra et al. 2012). At that time, the deployment time was quite long and limited to few skilled developers (Pocek et al. 2013). Although the computational benefits were undeniable compared to general processors, the development bottlenecks were limiting the potential of such platforms. For these reasons, a huge effort was put in developing toolchains and abstraction layers to help developers, not necessarily hard-skilled on the topic, to benefit from reconfigurable fabrics (Nane et al. 2015). Some examples of High Level Synthesis (HLS) tools are Vivado HLS (commercial) (Cong et al. 2011) and LegUp (academic) (Canis et al. 2011), while Darkroom (Hegarty et al. 2014) and PYNQ (Xilinx 2016) are examples of abstraction layers. The second attempt to increase the usage of reconfigurable systems by a larger number of users was combining them with general purpose processors and, later on, with software programmable vector engines. The coupling with micro-controller and hard-processors opens to different applicative scenarios but also introduces new challenges on interconnections and memory coherency (Choi et al. 2019). Indeed, the aforementioned heterogeneity and high connectivity favors the adoption of reconfigurable systems in the cloud computing ecosystem, where the power wall hit with the dark silicon (Esmaeilzadeh et al. 2011) makes the providers craving for energy efficient solutions, such as reconfigurable systems. Finally, the newest trend in the reconfigurable system field is to further promote their use by users closer to the software world. In this sense, we are witnessing an increasing number of providers of reconfigurable platforms in the cloud as services for the final users (Kachris and Soudris 2016).
Reconfigurable Architectures: The Shift from General Systems …
437
Combined with the opening to the wider public, various efforts have been put in the development of Domain Specific Architectures (DSAs), which enable the user to develop applications that run on a reconfigurable system using a Domain Specific Language (DSL). Based on all these considerations, our review will start by providing a description of the state of the art around 2010, highlighting the main limitations that pushed for the improvements that we have anticipated. Then, following the different lines of development in the past ten years, we will end by describing the current solutions and the possible trends that we will see in the next years. Even though we acknowledge that dynamic reconfiguration and low level technological details are a relevant matter of reconfigurable systems, they prove to be out of the scope of this chapter, therefore, we suggest the reader to refer to (Vipin and Fahmy 2018; Compton and Hauck 2002; Kuon et al. 2008). Moreover, reconfigurable systems, and their spatial computation features, have proven to be more effective on many applications against general purpose processors (DeHon 2000; Guo et al. 2004), though lacking in usability. Besides, several discussions on the benefits and trade-offs in using reconfigurable systems have been widely treated in (DeHon 2000; Guo et al. 2004; Kuon and Rose 2007; Tessier et al. 2015; Compton and Hauck 2002). We will start our analysis with a description of the situation around 2010, where reconfigurable fabrics were reserved for a small community of hard skilled developers. Section 2 sees the rise of awareness on the limitations of the employed tools and the long time for deployment. Therefore, Sect. 3 reports the birth of more structured Computer Aided Design tools, as also shown in Fig. 1, which opens the reconfigurable computing world to a different public together with the coupling of reconfigurable fabrics with general purpose processors. This trend leads to new challenges and paradigms of use. Hence, in Sect. 4, we explore the spread of reconfigurable systems in heterogeneous architectures and the shift to a different employment paradigm.
Fig. 1 Timeline of a selection of some relevant works following the taxonomy proposed in this chapter
438
E. D’Arnese et al.
In fact, as shown in Fig. 1, we witness the increment of providers of reconfigurable systems as a service and the interest in new directions, such as the domain specialization. Finally, Sect. 5 concludes with a summary of the described trends and a glance at the possible future trends in reconfigurable fabrics and their applications.
2 Overview and Motivation As anticipated, reconfigurable fabrics, and specifically FPGAs, were initially employed mostly for prototyping and as generic programmable hardware accelerators applicable in various fields such as telecommunications and signal processing, thanks to their efficiency and reprogrammability (Pocek et al. 2013). Though they provide a high level of flexibility, close to the software one, and good performances, they require the body-of-knowledge from a wide mixture of fields (Tessier et al. 2015). Therefore, during this time, the development of FPGA-based solutions was confined to few groups strongly specialized in hardware development due to both flaws in the development tools and to the relatively low use of these fabrics. The production of custom made hardware solutions requires the collaboration of engineers coming from different disciplines, and it is highly time consuming even though they have proven to reach higher throughput with less energy consumption compared to general purpose processors (DeHon 2000). To this extent, the main users of reconfigurable computing systems were mainly those that developed system-wide competencies, from the high-level software down-to the single LUT (Tessier et al. 2015). As we have said, reconfigurable fabrics offer a flexibility similar to software, but their employment is widely limited by the burden of a highly time consuming design process. Therefore, this gap among software and hardware productivity opens to a huge portion of research investments in Computer Aided Design (CAD) tools and the raising of abstraction levels (Trimberger 2018). At the dawn of reconfigurable computing, hardware description languages (HDLs) were the first attempts to move from schematics to a higher productivity approach, but, to reach satisfactory results, the developer needs an in-depth knowledge of the underlying physical architecture of the fabric (Tessier et al. 2015). Even though using HDLs opens to the use of tools starting from the register transfer level (RTL), the developer is requested to provide a specification at a low level of abstraction (Nane et al. 2015). The required knowledge, indeed, prevents the accessibility to the field to a wider audience, for this reason, new environments were developed. AutoPilot, from 2013 Vivado HLS, succeeded in introducing the High Level Synthesis (HLS), or programming the hardware with a high-level language such as C, in the reconfigurable computing industry (Cong et al. 2011). Unfortunately, the efforts from the reconfigurable computing community were not enough. Indeed, a study of a heterogeneous group of researchers demonstrates that, although the HLS tools remarkably increase productivity, there is still room for improvements (Nane et al. 2015). Nane et al. in (2015), describe the possible opti-
Reconfigurable Architectures: The Shift from General Systems …
439
mizations an HLS tool could apply to the high level starting code, among which some of them come from the compiler world, thus familiar to some software developer, such as polyhedral analysis. Other techniques applied are instead specific for hardware optimizations, like datapath bitwidth optimization, spatial parallelism exploitation, and hardware speculation. The authors present a study of three academic tools and a commercial one where they compare performance and area for a set of benchmarks with design produced in a push-button way and with manual optimizations. The results show that, with the current status of HLS tools, a software developer could not program FPGAs by simple software paradigms, but must be aware of the huge difference between the software optimizations and the hardware one (Nane et al. 2015). While HLS represents an impressive step to increase the audience of reconfigurable systems, the computing world was shaken by the first version of the Project Catapult (Putnam et al. 2014), a reconfigurable fabric in a custom data-center to accelerate large-scale software services. This example of success from a big corporation such as Microsoft broke the cliché of reconfigurable systems for a very small niche. The first deployment was achieved in 2014, but Microsoft pushed forward on this project, leading to the second version in 2016 with the FPGA directly attached to the Network card (Caulfield et al. 2016). Nevertheless, the gap between hardware and software programming was still far from being filled up. Indeed, platforms such as Graphic Processing Units (GPUs) have gained traction with the community and have been adopted as the standard way to accelerate computational workloads. The main reason behind this is the ease of programming those devices, thanks to the abstraction offered by the APIs such as the Compute Unified Device Architecture (CUDA) (NVIDIA 2007). The parallel computation power and the easy to use programming models have made GPUs the defacto winner in several applications fields, such as machine learning model training.
3 Towards Reconfigurable System Democratization The aforementioned huge productivity gap, between reconfigurable systems and general purpose processors, pushed neither software programmers nor ASIC developers to embrace the reconfigurable computing world. Though the flexibility of reconfigurable systems opens to FPGA-based wireless sensor networks for tasks spanning from sensor fusion to small co-processor (De La Piedra et al. 2012), the device complexity was increasing. Indeed, in 2011 the two main players in the FPGA market, i.e., Xilinx and Altera, introduced two heterogeneous systems composed of an FPGA tightly coupled with a hard-processor. Xilinx presents the Zynq technology (Xilinx 2016), where an ARM processor and an FPGA are on the same chip, whereas Intel presents a Xeon coupled with an Altera FPGA (Oliver et al. 2011) through the Intel Quick Path Interconnect (QPI) technology (Ziakas et al. 2010). Additionally, technology advancements were on the roadmap for 2015, (Chandrakar et al. 2015; Shannon et al. 2015), but the productivity gap, and the increasing device complexity,
440
E. D’Arnese et al.
keep the reconfigurable devices for a niche. To increase their adoption, the research community has worked along two main lines of development, that will be described in this and the following section. One line describes the democratization of reconfigurable systems with the improvement of Computer Aided Design (CAD) tools and the abstraction level provided to the final users, which move its first steps towards a domain specialization, further presented in Sect. 4.3. The second line focuses on a shift on how reconfigurable systems are employed, starting from Hardware-as-aService, moving to increase specialization through heterogeneity, and finishing with Domain-Specific Architectures (DSAs).
3.1 Design Automation Tools for FPGAs One of the main efforts in the reconfigurable computing world was centered on Computer Aided Design tools for FPGAs. The increasing complexity of the available platforms, and the increasing demand for efficiency in computations, foster the research on CAD for reconfigurable computing systems. A large number of tools have been developed both at the academic and industrial levels, and we will provide an overview of some of the most significant ones, clustering them into industrial and academic tools, and in closed and open source. The commercial tools are currently almost all closed source for obvious market reasons. Here we report, some of the most relevant vendors on the market, such as Xilinx and Intel FPGAs, known as Altera up to 2015. What follows it is not to be intended as, neither it is a complete commercial tools list. Starting from 2012, Xilinx released the Vivado design suite to support the latest released platforms (Feist 2012). Vivado, with Intel Quartus (Intel 2020a) as its counterpart, performs the system design task such as the synthesis, the place and route, and the final bitstream generation, as shown in the left-hand side of Fig. 2. Both these companies provide their commercial version of HLS tool to enable fast prototyping and deployment, namely Vivado HLS (Cong et al. 2011) and Intel HLS Compiler (Intel 2020b). Finally, both Intel FPGAs and Xilinx provide support for OpenCL language with the Altera FPGA SDK in 2012 (Czajkowski 2012; Altera 2013) and Xilinx SDAccel in 2014 (Xilinx 2014), which is been unified in Xilinx Vitis (Xilinx 2019). One of the peculiarities, of SDAccel for example, is the complete abstraction from the underlying system, leaving as the only duty to the final user the development of the custom accelerator. This custom computing accelerator will be, in the end, integrated by the tool with a basic shell of logic blocks, thanks to a partial reconfiguration of the FPGA. On the other hand, the academic community continuously work to push further the research on CAD tools. One great effort has been put in HLS toolchains all reviewed in Nane et al. (2015). Among those, in the survey, there are two open source solutions, such as BAMBU (Pilato and Ferrandi 2013) and LegUp (Canis et al. 2011), and another one closed source named DWARV (Nane et al. 2012). As mentioned in Sect. 2, the results show that, for a software developer, it is still discouraging to approach HLS tools. Indeed, the authors of Koeplinger et al. (2016) present a closed-
Reconfigurable Architectures: The Shift from General Systems …
441
source CAD framework for reconfigurable accelerators. The framework leverages a custom intermediate representation (IR) called DHDL to provide architectural templates easily customizable. This, and many other papers from this field (Grant et al. 2011; Fricke et al. 2018; Chin et al. 2017), enable the development of Coarse Grained Reconfigurable Architectures (CGRA) such as Plasticine (Prabhakar et al. 2017) that leverages DHDL IR. Furthermore, many research groups focus on the effective usage of the polyhedral model (Karp et al. 1967), widely used in software compilers, for the efficient automatic code generation targeting reconfigurable systems (Schmid et al. 2013; Pouchet et al. 2013; Zuo et al. 2013; Natale et al. 2016). Other works try to overcome some limitations of many HLS tools, such as the expression of static parallelism and static scheduling (Nane et al. 2015). An example comes from LegUp (Choi et al. 2017) that provides support to express multi core hardware systems through multi-threading executions. Differently, TAPAS (Margerm et al. 2018), an open source CAD framework, try to push further the thread parallelism approach in a dynamic way. Indeed, this work aims at supporting dynamic execution, e.g., with cache misses, dynamic synchronization, and dynamic instruction scheduling onto the target reconfigurable system. Another body of work leverages existing toolchains to provide a testing environment for custom algorithms for the HLS phase, such as new architectural templates or new programming models, and for the design flow, such as new place and route algorithms. An example is the CAOS platform (Rabozzi et al. 2017) that aims at increasing the adoption of reconfigurable computing systems in the HPC community. Indeed, it provides an easy to use hardware-software co-design platform, by driving the user through a semi-automated flow for hardware-software partitioning, architectural templates choices, and communication infrastructure (Tucci et al. 2017). The other key feature of CAOS is the modularity of the proposed platform. Specifically, it is designed to host externally developed modules with a set of specific APIs and interfaces to enable customization from the community (Rabozzi et al. 2017). Following this trend, other open-source projects aimed at encouraging the development of custom algorithms within the FPGAs flow. Rapid Smith in 2011 proposed a set of tools and APIs for creating “your own custom CAD toolchain” on top of the so-called Xilinx Design Language (XDL) (Lavin et al. 2011). In 2018, Lavin and Kaviani presented RapidWright (Lavin and Kaviani 2018), a toolchain from Xilinx Research Labs. RapidWright is an open source platform for custom module plug-in an FPGA-flow, that aims at increasing the productivity and the design performance combined with the Vivado toolchain. Thanks to the proposed gateway to Vivado, called design checkpoints (DCPs), the authors want to create an ecosystem around FPGA-based CAD tools (Lavin and Kaviani 2018). On the wave of development of CAD for FPGAs, there is an interest in creating an ecosystem of open source hardware tools. Indeed, the Symbiflow project (Symbiflow 2018) aims at providing a fully free and completely open-source toolchain for commercial FPGA, with a flow from HDL down to bitstream generation, as reported in Fig. 2. Thanks to its first sub-project, named Ice-storm (IceStorm 2015), they can reproduce a bitstream for Lattice iCE40 FPGAs, while currently, they are document-
442
E. D’Arnese et al.
Fig. 2 Overview of the main blocks of design flows from source code down to reconfigurable system configuration
ing the Xilinx 7-Series bitstreams (X-Ray 2017). In Shah et al. (2019), the authors present their toolchain along with custom-computing machines, such as a low power neural network and a Linux-bootable RISC SoC. Given the increasing complexity of reconfigurable platforms and the struggles related to the time to market, these open-source CAD tools can either improve the commercial tools with community contribution or democratizing reconfigurable systems.
3.2 The Abstraction Level Rise Towards Domain Specific Languages Another important improvement of reconfigurable computing systems focuses on the rise of the abstraction level. Given the first CAD efforts that exploit C/C++ based HLS, the software community starts to approach the reconfigurable ecosystem, while seeing an explosion of higher level languages, such as Python, or to using more and more Domain Specific Languages, such as Halide. Table 1 shows an overview of some of the relevant works in this trend. Indeed, a large amount of work aims at embodying high-level languages, different from C/C++, for new hardware-software co-design techniques, left-hand side Fig. 2, and run-time management, right-hand side Fig. 2, enabling a wider set of users.
Reconfigurable Architectures: The Shift from General Systems …
443
Table 1 Examples of research works in the abstraction level rise towards DSLs Language Hardware design Run-time mangament Python OpenCL Java Scala Halide Darkroom
Altera (Singh 2011), Xilinx (Wirbel 2014) MaxJ (Maxeler 2011) Chisel (Bachrach et al. 2012) FROST (Del Sozzo et al. 2017), Pu et al. (2017) Hegarty et al. (2014)
PYNQ (Xilinx 2016) Altera (Singh 2011), Xilinx (Wirbel 2014) Max run-time (Maxeler 2011) Pu et al. (2017)
For instance, the PYNQ project (Xilinx 2016) is an open-source framework that enables python programmers to use complex SoC, or accelerator boards, supported by a set of predefined libraries and drivers. Another great effort by Altera comes from the integration of the OpenCL language in the FPGA-based design flow, first presented in 2011 (Singh 2011). In 2012 Altera releases an official compilation framework for OpenCL-based designs along with a library for PCIe-based host-FPGA communication (Czajkowski 2012; Settle et al. 2013) with encouraging results. Indeed, also Xilinx follows these efforts and provides an integration in its HLS toolchain (Wirbel 2014). Both Altera and Xilinx exploit the versatility of the OpenCL standard for managing the run-time of the target platform, and the hardware design, opening effectively to the idea of integrating PCIe-based reconfigurable accelerators in a server rack. A different approach is adopted by The Maxeler technologies, which provides integrated server-class CPUs with accelerators based on the dataflow model (Dennis et al. 1980). Moreover, they provide a design language called MaxJ, based on Java, which, compiled by the MaxCompiler tool (Maxeler 2011), enables applications such as the development of design tools for CGRAs as in Koeplinger et al. (2016). On the design side, several efforts have aimed at improving hardware-software co-design. Among these, Chisel (Bachrach et al. 2012) is a hardware design language developed as a domain specific language (DSL) within Scala, mid column of Fig. 2. Although not specifically designed for FPGAs, it has been employed in projects like the Edge Tensor Processing Unit (TPU) (Google 2018), and has been the enabler of other reconfigurable hardware works (Margerm et al. 2018). Among the different DSLs in the literature, Halide (Ragan-Kelley et al. 2013) has been particularly attractive for the reconfigurable computing field. Halide is a high-level language that focuses on image processing, with a special feature for decoupling code for execution from code for scheduling. The image processing field is especially attractive for reconfigurable computing since it is characterized by a high degree of parallelism in several applications. Darkroom (Hegarty et al. 2014) is a framework for image processing pipelines able to target different backend devices.
444
E. D’Arnese et al.
In particular, it can transform high-level descriptions of image processing pipelines into hardware code for both ASIC and FPGA devices. In 2017, Pu et al. presented a framework that performs DSL-to-FPGA transformation based on the Halide language (Pu et al. 2017). The framework can generate the bitstream for Zynq-based systems alongside a multi-threaded software program that controls the hardware. During the same year, FROST (Del Sozzo et al. 2017) is presented. FROST is a backend compatible with both Halide IR and the Tiramisú compiler (Ray 2018; Baghdadi et al. 2019), which enables the transformation from a high-level language to an FPGA design, further processed by the SDAccel toolchain, as in Fig. 2. As highlighted in this Section, all these steps aim at opening the reconfigurable computing world to a wider public, by means of tools that provide a more userfriendly use of these devices, and pave the way for the first attempt of paradigm shift, such as domain specialization. The following Section will guide the reader through the evolution towards the second line of development, introduced at the beginning of this Section, namely the shift in the use of reconfigurable systems.
4 Recent Trends in the Reconfigurable Systems Spotlight Aim of this Section is to provide an overview of the latest trends in the reconfigurable computing community. Specifically, reconfigurable computing systems have become one of the standard commodities available in the cloud, and even more heterogeneous with an increasing amount of problems linked to the communication infrastructure. Last but not least, there still is an open question on which is the most suitable use of reconfigurable systems. Should we tailor the reconfigurable system for a single domain with a wide range of applications, or should we exploit a single reconfigurable domain specific architecture (DSA) engine, or is it better an automated tool to rule them all?
4.1 Reconfigurable Computing in the Cloud: Hardware-as-a-Service Starting from 2014, a turning point changed the model of employing FPGAs in the cloud, when Microsoft (Putnam et al. 2014) deployed Altera, and both IBM (Chen et al. 2014) and Baidu (Ouyang et al. 2014) deployed Xilinx FPGAs to improve their services. The idea of reconfigurable accelerators for cloud computing was gaining more and more traction, and in 2016, while Microsoft deployed a renewed and improved version of Catapult (Caulfield et al. 2016), with FPGAs attached directly to the network, one of the first surveys on this topic was published (Kachris and Soudris 2016).
Reconfigurable Architectures: The Shift from General Systems …
445
The year after the second version of Catapult, the terms of FPGA-as-a-Service (FaaS), or Hardware-as-a-Service (HaaS), become a steady reality. In 2017, Amazon presented the AWS F1 instances (Pellerin 2017) with 1, 2, and 8 FPGAs devices attached, and Huawei presented its F1 instances of the FACS cloud service (Huawei 2017). At that time, that meant that everyone could develop and deploy its FPGAbased service, system, and application without the need of owning the device per se, and let a cloud provider manage and maintain the infrastructure. The FaaS revolution opens to new business models, new markets, and a reality like Nimbix started to deploy its own HPC solution based on reconfigurable accelerators (Hosseinabady and Núñez-Yáñez 2017; Dang and Skadron 2017), while it arises critical issues in the virtualization of reconfigurable fabrics (Niemiec et al. 2019). Moreover, a project as FireSim (Karandikar et al. 2018) could rely on the FPGAs publicly available in the cloud. FireSim is an FPGA-accelerated hardware simulated environment that enables a more accurate representation of new data-center-like contexts to test either hardware or software design iterations without the need for a real deployment (Champion 2017), and therefore limiting the cost for the final user.
4.2 Increasing Heterogeneity in Reconfigurable Systems Following the aforementioned improvements and considering that homogeneous multi-core processors, especially in data-centres, are failing to provide the desired energy efficiency and performance, new devices have been deployed, specifically heterogeneous architectures (Choi et al. 2019). The integration in these architectures of hardware accelerators is gaining interest as a promising solution. Considering the different possibilities, FPGA-based heterogeneous devices apply to a wide range of fields thanks to their reconfigurability and high performance with low power consumption (Choi et al. 2016). Based on these advantages, various platforms have been produced by the industry, with each of them employing different physical integration and memory coherency. Although these solutions are appealing, they pose different challenges to the developers, such as the choice of the most suitable one to a specific application (Choi et al. 2019). As a result, we can identify not only different resolutive approaches for coupling CPU and FPGA, but also the integration of the FPGA with both CPU and software programmable accelerators, e.g., with Xilinx Versal ACAP (Gaide et al. 2019), right-hand side Fig. 3. Looking at the pure CPU-FPGA coupling Choi et al. in (2019), have provided an interesting classification of platforms on the market. They have provided a guide to developers to decide which platform is most suited for a specific computational paradigm. For the scope of this work, we present some characterizing examples for the various approaches. Traditionally, the FPGA is connected to the CPU by means of the PCIe interface with both of them with their private memories, such as Microsoft Catapult in its first version (Putnam et al. 2014). Other examples, which also allow the final user to use high-level languages for implementing its custom accelerator,
446
E. D’Arnese et al.
Fig. 3 Three classes of heterogeneous reconfigurable systems. While the first two, a and b devise a tight integration with/without a (private) shared memory, e.g., DRAM, the third one, c exploits (or not) memory coherency through interconnections, such as PCIe, OpenCAPI, CCIX, or even directly attached to the network interface card
are the Amazon F1 instances (Pellerin 2017) and the Alpha Data FPGA boards (AlphaData 2020). These solutions enable not only the spread of reconfigurable fabrics as services to the final users but also, allowing the use of high level languages, open to a wider public. Also based on PCIe interface, other vendors have proposed coherent shared memory between CPU and FPGA, such as IBM with its Coherent Accelerator Processor Interface (CAPI) for POWER8 (Wile 2014). Following the path of the coherent shared memory, but aiming at a tighter connection CPU-FPGA, the first version in 2011 of the Intel Xeon+FPGA platform exploits a QuickPath Interconnect (QPI) (Oliver et al. 2011). This idea evolves throughout the years, and in 2016 a further improvement of the Intel Xeon+FPGA platform was presented (Gupta 2017), center of Fig. 3. The version presented is a System in Package (SiP) where one, or more, reconfigurable accelerators are tightly coupled on the same package through the usage of a hybrid connection CPU-FPGA based on PCIe and QPI (Gupta 2017). Though the improvements of the communication infrastructure were fostering the potential of these devices, as highlighted in Choi et al. (2019), a coherent interconnection was still impractical because of the insufficient bandwidth and high latency cache designs. To this purpose, two consortiums of companies were born: the OpenCAPI group in 2016 (OpenCAPI Consortium 2016), and the CCIX consortium in 2017 (CCIX 2017). Their work produced two coherent communication standards, called OpenCAPI and CCIX, where the CCIX is compatible and built on top of the PCIe stack, while OpenCAPI is an open and new interface, right-hand side of Fig. 3. Indeed, the OpenCAPI aims at being an open interface architecture that allows different ranges of accelerators, from smart NICs to reconfigurable-/ASIC-based systems, to be connected to the same high performance and coherent bus in an agnostic way with respect to the processor architecture (OpenCAPI 2016).
Reconfigurable Architectures: The Shift from General Systems …
447
4.3 Towards Domain-Specialization With the approach of physical technology limits, such as the end of Dennard Scaling (Esmaeilzadeh et al. 2011) and slow down of Moore’s law (Hennessy and Patterson 2019), computer architects have to face the growing computing demand differently. Unless a new disrupting technology appears on the market, general purpose computing has reached a limit, and DSAs are the most viable way for energy efficient computations (Hennessy and Patterson 2019). In these regards, reconfigurable computing systems represent one of the possible solutions that enable custom-domain-specific computing platforms. Indeed, the literature presents a large number of works that focus on a single domain, intended as a wide range of problems solvable with a common approach, but the major problem now resides on an open question. Is it better to have a single DSA, a single computation engine to rule them all, or a tool able to generate an application-specific architecture for a target algorithm? Besides, the DSAs have different meanings in the reconfigurable computing world: is a DSA a “fixed” architecture and datapath that exploit the adaptability of a reconfigurable platform or is it an architecture that is coarsely grained reconfigurable at the datapath/processing element level, more like a coarse grained reconfigurable architecture (CGRA)? Though CGRAs could impact remarkably, thanks to their low reconfiguration time compared to FPGAs, and their high specialization, “they are still immature in terms of programmability, productivity, and adaptability”, as advocated in (Liu et al. 2019). The paper presents a comprehensive and in-depth analysis of CGRA. To avoid misleading definitions, we refer to FPGAs as fine-grained reconfigurable architectures, or at the time of writing, devices available on the market. Instead, we refer to CGRA as reconfigurable computing platforms at a coarse-, or processing element-level (Liu et al. 2019). Finally, we refer to reconfigurable processors, or Application-Specific Instruction-set Processor (ASIP), to a programmable CPU with custom logic, i.e., an ASIP, with a portion of reconfigurable logic devoted to implementing reconfigurable functional units (Chattopadhyay et al. 2008; Barat et al. 2002). Among the several possible domains, machine learning (ML) has found its application in several fields, from computational theory to software engineering, from fraud detection to video segmentation, and many companies are born to address several problems within this domain. The reconfigurable computing world contributes to the growth of the machine learning world, especially considering neural networks (NNs), or deep neural networks (DNNs), inference. Among the startups born from this context, DeePhi, now acquired by Xilinx, focuses its proposition on an efficient hardware-software co-design methodology to efficiently map NN-based computations on either an FPGA-based deep processing unit (DPU) or an ASIC-based DPU (Guo et al. 2016). Several works proposing a methodology to build efficient NN inference accelerators based on an FPGA have been proposed in the literature, and we suggest the reader interested in more details to look at Guo et al. (2019). Indeed, Guo et al.,
448
E. D’Arnese et al.
present a survey of efficient techniques to build NNs accelerators, to focus on the design of the architecture, on how to compress the model, and on how to design the system. A particular example not reported in the survey is the neural processing unit (NPU) architecture of the Brainwave project (Chung et al. 2018), which aims at serving in real-time DNNs-based applications at the cloud-scale. A different example of a DSA for the ML world is described in Nowatzki et al. (2017). The authors propose a methodology for stream-dataflow execution model, based on a CGRA architecture organization. The reconfigurable datapath of the Softbrain microarchitecture has shown both comparable performance and energy efficiency results against stateof-the-art ASIC, whereas it keeps enough flexibility to reduce design and verification time, thus costs and time to market. Alongside all these considerations, Venieris et al., propose an extensive survey of toolchains for mapping convolutional NNs(CNNs) on FPGAs (Venieris et al. 2018), that we suggest as reference for the topic. Specifically, the survey analyzes in depth all the software-hardware automation tools used for CNNs mapping on FPGAs and proposes an interesting classification among the considered architectures. The target hardware of these toolchains can be divided in streaming architectures, that builds on top of a highly optimized basic block composed differently for each CNN, such as FINN (Umuroglu et al. 2017), and single computation engines, a fixed architecture that generally varies software instruction sequences for different CNNs, such as FP-DNN (Guan et al. 2017). Finally, both Guo et al. (2019) and Venieris et al. (2018) conclude their surveys by demanding an increased effort for hardware-software co-design tools in such domain. Moving to a completely different domain, such as communication networks, where reconfigurable computing plays a crucial role, many works in the literature are torn between two approaches: a mapping DSL-to-hardware or a single DSA to rule them all. In these regards, a reconfigurable match tables (RMT) architecture has been proposed in 2013 (Bosshart et al. 2013). Even if the authors tend to an ASIC-based DSA, the architecture proposed is a reconfigurable packet processing architecture that is software programmable. This work leads to the birth of a DSL in the field of switch architecture for packet processing, that nowadays is widely recognized in the community and known as P4 (Bosshart et al. 2014). As DSL, P4 is designed to be “reconfigurable”, or software programmable, and protocol independent. Moreover, it abstracts completely from whatever specific packet format, and it is architectureagnostic, meaning that the burden of targeting the underline architecture is left to the compiler. The first work that provides automation of generating HDL from high-level P4 programs is presented from Benacek et al. (2016), and then expanded in further parallelism levels from Wang et al. (2017) reaching outstanding throughput of T bits/s on an FPGA. On the other hand, Pontarelli et al. present their DSA, called FlowBlaze (Pontarelli et al. 2019). The abstraction model they built upon, is different from the one of P4, instead, it uses the same abstraction of the RMT architecture, called OpenFlow (McKeown et al. 2008), and therefore can be considered as the RMT extension. Table 2 summarizes the division of the approaches adopted to domain-specialization for the field reported in this Section.
Reconfigurable Architectures: The Shift from General Systems …
449
Table 2 Summary of different approaches to the domain-specialization of reconfigurable systems Approaches DSA Tool Hybrid “Fixed” architecture software programmable Machine Learning
Networking
NPU (Chung et al. 2018) FP-DNN (Guan et al. 2017) Deephi (Guo et al. 2016) Flowblaze (Pontarelli et al. 2019)
One architecture for each “Semi-fixed” architecture problem to tackle with a reconfigurable datapath FINN (Umuroglu et al. Softbrain (Nowatzki et al. 2017) 2017)
P4-to-VHDL (Benácek et al. 2016; Wang et al. 2017)
N/A
5 Summary and Future Directions In this chapter, we have summarized the evolution of the reconfigurable computing world, mainly focusing on FPGA-based systems, and on the concurrent development of new tools for helping the spreading of these technologies till recent days. In our journey in the reconfigurable fabric world, we have gone through different paradigms of use, and through the improvement of the support tools to developers, starting from the first attempts addressed to an already skilled audience till the more recent solutions that opens to a new user base, like software engineers. In these attempts, we have moved from solutions that can reach top performance only with an in depth knowledge of the underlying system, to the latest tools which guide the users during the entire process, from high level code to the deployment of an entire system. In the wake of the opening to different audiences, reconfigurable fabrics have become a valuable option in data-centers, given their reconfigurability, high performance, and lower power consumption compared to general purpose multi-core architectures. Such an opening leads to the necessity of a heterogeneous system, that couples the already present general purpose processors to hardware accelerators with all the connected problems of interconnections and memory coherency. On the other hand, thanks to their entry in the cloud market, a new paradigm, that has become popular, is the concept of hardware as a service (HaaS), where the final users exploit hardware resources of a cloud provider, which takes care of the maintenance and management costs of the physical board. In these days, the coverage of reconfigurable fabrics is widening and is gaining more and more popularity in different application fields, such as data-centers and highly compute intensive applications. Though there are several domains the benefit from reconfigurable fabrics, the database one seems to suffer the previous problems of interconnect I/O and programmability issues, that still prevent the adoption of FPGA systems in the database industry (Fang et al. 2020).
450
E. D’Arnese et al.
In these regards, new generations of reconfigurable systems will become key players in memory-bandwidth hungry applications, through the integration of hardblocks for interconnectivity, such as OpenCAPI (OpenCAPI 2016), and powerful memories, such as HBM (Lee 2014). Indeed, an example of possible solution to many of these problems could be the Xilinx ACAP (Gaide et al. 2019), that is planned to have a memory controller and hardened interfaces on its architecture, instead of being deployed in the programmable logic. ACAP integrates, within the same chip, a combination of CPUs, programmable logic resources, and a multi-core tile-based software programmable engines. The heterogeneity proposed would, for sure, increase the performance harvestable from the same reconfigurable system, but poses huge challenges in the hardware/software co-design, and increases the complexity of the design itself. To this extent, we think that CGRAs (Liu et al. 2019) would achieve a key role, thanks to their flexibility and computation capabilities, though the reconfigurable computing community would have to put an important effort for extending to CGRAs the work done so far for FPGAs. Another trend that we envision for reconfigurable systems is the growing relevance that these systems will gain in the cloud-computing and the HPC world. Among the ongoing projects, Honeycomb (MSR 2020), a project from Microsoft Research (MSR), envisions distributed systems with CPU-free nodes, where specialized hardware, and currently FPGAs, would be able to remove the burden of generalities coming from CPUs. Along with these architectural and use improvements, CAD tools, and methodologies of software-hardware co-design must grow to make all these great features usable. Without usability, any hardware improvement would not lead to further steps, but only new useless great hardware. Therefore, in this struggle of domain specialization, we envision continuous efforts on this open question: it would be the tool for every domain algorithms’ or the single engine that wins the competition. Last but not least, one of the trends that we believe can have an impact in all the research fields of reconfigurable computing, but not limited to, is the open source revolution and the agile-style development for the hardware. While the software advanced thanks to several contributions of the open source world, the hardware community is still stuck in the closed version, and this does not contribute to the growth of an ecosystem. Currently, some steps forward have been done with the revolution brought by the open-sourcing of RISC-V (Waterman et al. 2011). To this extent, while the waterfall development model was the dominant one, nowadays iterative development cycles are gaining more importance (Lee et al. 2016). Thanks to works such as generators (Asanovic et al. 2016) and fast to deploy large-scale emulation environment (Karandikar et al. 2018), but not limited to, the hardware agile development lifecycle is born alongside research groups working on this topic (Stanford’s Agile Hardware Center 2019). Summing up, we strongly believe that reconfigurable systems are one of the main vehicles of the iterative development style. We also believe that reconfigurable systems are, and will continue to play a dominant role both in fast and flexible hardware development and in the reduction of the time to market and its related costs.
Reconfigurable Architectures: The Shift from General Systems …
451
References AlphaData, Alphadata reconfigurable computing for HPC boards (2020), https://www.alpha-data. com/dcp/ Altera, Implementing FPGA design with theopencl standard (2013), https://www.intel.com/content/ dam/www/programmable/us/en/pdfs/literature/wp/wp-01173-opencl.pdf K. Asanovic, R. Avizienis, J. Bachrach, S. Beamer, D. Biancolin, C. Celio, H. Cook, D. Dabbelt, J. Hauser, A. Izraelevitz et al., The rocket chip generator. Technical Report (EECS Department, University of California, Berkeley, UCB/EECS-2016-17, 2016) J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Avižienis, J. Wawrzynek, K. Asanovi´c, Chisel: constructing hardware in a scala embedded language, in DAC Design Automation Conference (IEEE, 2012), pp. 1212–1221 R. Baghdadi, J. Ray, M.B. Romdhane, E. Del Sozzo, A. Akkas, Y. Zhang, P. Suriana, S. Kamil, S. Amarasinghe, Tiramisu: a polyhedral compiler for expressing fast and portable code, in IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (IEEE, 2019), pp. 193–205 F. Barat, R. Lauwereins, G. Deconinck, Reconfigurable instruction set processors from a hardware/software perspective. IEEE Trans. Softw. Eng. 28(9), 847–862 (2002) P. Benácek, V. Pu, H. Kubátová, P4-to-VHDL: automatic generation of 100 gbps packet parsers, in IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) (IEEE, 2016), pp. 148–155 P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese, N. McKeown, M. Izzard, F. Mujica, M. Horowitz, Forwarding metamorphosis: fast programmable match-action processing in hardware for SDN. ACM SIGCOMM Comput. Commun. Rev. 43(4), 99–110 (2013) P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford, C. Schlesinger, D. Talayco, A. Vahdat, G. Varghese et al., P4: programming protocol-independent packet processors. ACM SIGCOMM Comput. Commun. Rev. 44(3), 87–95 (2014) A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J.H. Anderson, S. Brown, T. Czajkowski, Legup: high-level synthesis for FPGA-based processor/accelerator systems, in Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (2011), pp. 33–36 A.M. Caulfield, E.S. Chung, A. Putnam, H. Angepat, J. Fowers, M. Haselman, S. Heil, M. Humphrey, P. Kaur, J.-Y. Kim, A cloud-scale acceleration architecture, in 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) (IEEE, 2016), pp. 1–13 CCIX, CCIX: a new coherent multichip interconnect for accelerated use cases (2017), https://www. ccixconsortium.com/wp-content/uploads/2018/08/ArmTechCon17-CCIX-A-New-CoherentMultichip-Interconnect-for-Accelerated-Use-Cases.pdf M. Champion, Bringing datacenter-scale hardware-software co-design to the cloud with FireSim and Amazon EC2 F1 instances, in AWS Compute Blog (2017) S. Chandrakar, D. Gaitonde, T. Bauer, Enhancements in ultrascale CLB architecture, in Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (2015), pp. 108–116 A. Chattopadhyay, R. Leupers, H. Meyr, G. Ascheid, Language-Driven Exploration and Implementation of Partially Reconfigurable ASIPs (Springer Science & Business Media, 2008) F. Chen, Y. Shan, Y. Zhang, Y. Wang, H. Franke, X. Chang, K. Wang, Enabling FPGAs in the cloud, in Proceedings of the 11th ACM Conference on Computing Frontiers (2014), pp. 1–10 S.A. Chin, N. Sakamoto, A. Rui, J. Zhao, J.H. Kim, Y. Hara-Azumi, J. Anderson, CGRA-ME: a unified framework for CGRA modelling and exploration, in 2017 IEEE 28th International Conference on Application-Specific Systems, Architectures and Processors (ASAP) (IEEE, 2017), pp. 184–189 J. Choi, S.D. Brown, J.H. Anderson, From pthreads to multicore hardware systems in legup highlevel synthesis for FPGAs. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 25(10), 2867–2880 (2017)
452
E. D’Arnese et al.
Y.-K. Choi, J. Cong, Z. Fang, Y. Hao, G. Reinman, P. Wei, A quantitative analysis on microarchitectures of modern CPU-FPGA platforms, in Proceedings of the 53rd Annual Design Automation Conference (2016), pp. 1–6 Y.-K. Choi, J. Cong, Z. Fang, Y. Hao, G. Reinman, P. Wei, In-depth analysis on microarchitectures of modern heterogeneous CPU-FPGA platforms. ACM Trans. Reconfigurable Technol. Syst. (TRETS) 12(1), 1–20 (2019) E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman et al., Serving DNNs in real time at datacenter scale with project brainwave. IEEE Micro 38(2), 8–20 (2018) K. Compton, S. Hauck, Reconfigurable computing: a survey of systems and software. ACM Comput. Surv. (CSUR) 34(2), 171–210 (2002) J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, Z. Zhang, High-level synthesis for FPGAs: From prototyping to deployment. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 30(4), 473–491 (2011) T.S. Czajkowski, U. Aydonat, D. Denisenko, J. Freeman, M. Kinsner, D. Neto, J. Wong, P. Yiannacouras, D.P. Singh, From opencl to high-performance hardware on FPGAs, in 22nd International Conference on Field Programmable Logic and Applications (FPL) (IEEE, 2012), pp. 531–534 V. Dang, K. Skadron, Acceleration of frequent itemset mining on FPGA using SDAccel and Vivado HLS, in 2017 IEEE 28th International Conference on Application-Specific Systems, Architectures and Processors (ASAP) (IEEE, 2017), pp. 195–200 A. De La Piedra, A. Braeken, A. Touhafi, Sensor systems based on FPGAs and their applications: a survey. Sensors 12(9), 12,235–12264 (2012) A. DeHon, The density advantage of configurable computing. Computer 33(4), 41–49 (2000) E. Del Sozzo, R. Baghdadi, S. Amarasinghe, M.D. Santambrogio, A common backend for hardware acceleration on FPGA, in 2017 IEEE International Conference on Computer Design (ICCD) (IEEE, 2017), pp. 427–430 J.B. Dennis, Data flow supercomputers. Computer 11, 48–56 (1980) L. Di Tucci, M. Rabozzi, L. Stornaiuolo, M.D. Santambrogio, The role of CAD frameworks in heterogeneous FPGA-based cloud systems, in IEEE International Conference on Computer Design (ICCD) (IEEE, 2017), pp. 423–426 H. Esmaeilzadeh, E. Blem, R.S. Amant, K. Sankaralingam, D. Burger, Dark silicon and the end of multicore scaling, in 38th Annual International Symposium on Computer Architecture (ISCA) (IEEE, 2011), pp. 365–376 J. Fang, Y.T. Mulder, J. Hidders, J. Lee, H.P. Hofstee, In-memory database acceleration on FPGAs: a survey. VLDB J. 29(1), 33–59 (2020) T. Feist, Vivado design suite. White Paper 5, 30 (2012) F. Fricke, A. Werner, K. Shahin, M. Hübner, CGRA tool flow for fast run-time reconfiguration, in International Symposium on Applied Reconfigurable Computing (Springer, 2018), pp. 661–672 B. Gaide, D. Gaitonde, C. Ravishankar, T. Bauer, Xilinx adaptive compute acceleration platform: versaltm architecture, in Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (2019), pp. 84–93 Google, Experiences building edge TPU with chisel (2018), https://www.youtube.com/watch? v=x85342Cny8c D. Grant, C. Wang, G.G. Lemieux, A CAD framework for Malibu: an FPGA with time-multiplexed coarse-grained elements, in Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (2011), pp. 123–132 Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang, J. Cong, FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates, in IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) (IEEE, 2017), pp. 152–159 Z. Guo, W. Najjar, F. Vahid, K. Vissers, A quantitative analysis of the speedup factors of FPGAs over processors, in Proceedings of the 2004 ACM/SIGDA 12th International Symposium on Field Programmable Gate Arrays (2004), pp. 162–170
Reconfigurable Architectures: The Shift from General Systems …
453
K. Guo, L. Sui, J. Qiu, S. Yao, S. Han, Y. Wang, H. Yang, From model to FPGA: Softwarehardware co-design for efficient neural network acceleration, in IEEE Hot Chips 28 Symposium (HCS) (IEEE, 2016), pp. 1–27 K. Guo, S. Zeng, J. Yu, Y. Wang, H. Yang, [DL] A survey of FPGA-based neural network inference accelerators. ACM Trans. Reconfigurable Technol. Syst. (TRETS) 12(1), 1–26 (2019) P. Gupta, Accelerating datacenter workloads, in 26th International Conference on Field Programmable Logic and Applications (FPL), vol. 2017 (2016), p. 20 J. Hegarty, J. Brunhaver, Z. DeVito, J. Ragan-Kelley, N. Cohen, S. Bell, A. Vasilyev, M. Horowitz, P. Hanrahan, Darkroom: compiling high-level image processing code into hardware pipelines. ACM Trans. Graph. 33(4), 144–1 (2014) J.L. Hennessy, D.A. Patterson, A new golden age for computer architecture. Commun. ACM 62(2), 48–60 (2019) M. Hosseinabady, J.L. Núñez-Yáñez, Pipelined streaming computation of histogram in FPGA OpenCL, in PARCO (2017), pp. 632–641 Huawei, Huawei releases the new-generation intelligent cloud hardware platform Atlas (2017), https://www.huawei.com/us/news/global/2017/201709061557 IceStorm, Project IceStorm website (2015), http://www.clifford.at/icestorm/ Intel, Intel HLS documentation (2020b), https://www.intel.com/content/www/us/en/ programmable/products/design-software/high-level-design/intel-hls-compiler/support.html Intel, Intel quartus documentation (2020a), https://www.intel.com/content/www/us/en/ programmable/products/design-software/fpga-design/quartus-prime/user-guides.html C. Kachris, D. Soudris, A survey on reconfigurable accelerators for cloud computing, in 26th International Conference on Field Programmable Logic and Applications (FPL) (IEEE, 2016), pp. 1–10 S. Karandikar, H. Mao, D. Kim, D. Biancolin, A. Amid, D. Lee, N. Pemberton, E. Amaro, C. Schmidt, A. Chopra, Firesim: FPGA-accelerated cycle-exact scale-out system simulation in the public cloud, in ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA) (IEEE, 2018), pp. 29–42 R.M. Karp, R.E. Miller, S. Winograd, The organization of computations for uniform recurrence equations. J. ACM (JACM) 14(3), 563–590 (1967) D. Koeplinger, R. Prabhakar, Y. Zhang, C. Delimitrou, C. Kozyrakis, K. Olukotun, Automatic generation of efficient accelerators for reconfigurable hardware, in ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) (IEEE, 2016), pp. 115–127 I. Kuon, J. Rose, Measuring the gap between FPGAs and ASICs. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 26(2), 203–215 (2007) I. Kuon, R. Tessier, J. Rose et al., FPGA architecture: survey and challenges, foundations and trends® . Electron. Des. Autom. 2(2), 135–253 (2008) C. Lavin, A. Kaviani, Rapidwright: enabling custom crafted implementations for FPGAs, in IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) (IEEE, 2018), pp. 133–140 C. Lavin, M. Padilla, J. Lamprecht, P. Lundrigan, B. Nelson, B. Hutchings, RapidSmith: do-ityourself CAD tools for Xilinx FPGAs, in 2011 21st International Conference on Field Programmable Logic and Applications (IEEE, 2011), pp. 349–355 D.U. Lee, K.W. Kim, K.W. Kim, K.S. Lee, S.J. Byeon, J.H. Kim, J.H. Cho, J. Lee, J.H. Chun, A 1.2 v 8 gb 8-channel 128 gb/s high-bandwidth memory (HBM) stacked dram with effective I/O test circuits. IEEE J. Solid-State Circuits 50(1), 191–203 (2014) Y. Lee, A. Waterman, H. Cook, B. Zimmer, B. Keller, A. Puggelli, J. Kwak, R. Jevtic, S. Bailey, M. Blagojevic et al., An agile approach to building RISC-V microprocessors. IEEE Micro 36(2), 8–20 (2016) L. Liu, J. Zhu, Z. Li, Y. Lu, Y. Deng, J. Han, S. Yin, S. Wei, A survey of coarse-grained reconfigurable architecture and design: taxonomy, challenges, and applications. ACM Comput. Surv. (CSUR) 52(6), 1–39 (2019)
454
E. D’Arnese et al.
S. Margerm, A. Sharifian, A. Guha, A. Shriraman, G. Pokam, TAPAS: generating parallel accelerators from parallel programs, in 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) (IEEE, 2018), pp. 245–257 Maxeler, Maxcompiler white paper (2011), https://www.maxeler.com/media/documents/ MaxelerWhitePaperMaxCompiler.pdf N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, J. Turner, Openflow: enabling innovation in campus networks. ACM SIGCOMM Comput. Commun. Rev. 38(2), 69–74 (2008) MSR, Honeycomb (2020), https://www.microsoft.com/en-us/research/project/honeycomb/ R. Nane, V.-M. Sima, B. Olivier, R. Meeuws, Y. Yankova, K. Bertels, DWARV 2.0: a CoSy-based C-to-VHDL hardware compiler, in 22nd International Conference on Field Programmable Logic and Applications (FPL) (IEEE, 2012), pp. 619–622 R. Nane, V.-M. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y.T. Chen, H. Hsiao, S. Brown, F. Ferrandi et al., A survey and evaluation of fpga high-level synthesis tools. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 35(10), 1591–1604 (2015) G. Natale, G. Stramondo, P. Bressana, R. Cattaneo, D. Sciuto, M.D. Santambrogio, A polyhedral model-based framework for dataflow implementation on FPGA devices of iterative stencil loops, in 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) (IEEE, 2016), pp. 1–8 G.S. Niemiec, L.M. Batista, A.E. Schaeffer-Filho, G.L. Nazar, A survey on FPGA support for the feasible execution of virtualized network functions (IEEE Commun. Surv, Tutorials, 2019) T. Nowatzki, V. Gangadhar, N. Ardalani, K. Sankaralingam, Stream-dataflow acceleration, in ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA) (IEEE, 2017), pp. 416–429 C. NVIDIA, Compute unified device architecture programming guide, in Nvidia website (2007) N. Oliver, R.R. Sharma, S. Chang, B. Chitlur, E. Garcia, J. Grecco, A. Grier, N. Ijih, Y. Liu, P. Marolia et al., A reconfigurable computing system based on a cache-coherent fabric, in 2011 International Conference on Reconfigurable Computing and FPGAs (IEEE, 2011), pp. 80–85 OpenCAPI Consortium, Tech leaders unite to enable new cloud datacenter server designs for big data, machine learning, analytics, and other emerging workloads (2016), https://opencapi. org/2016/10/tech-leaders-unite-to-enable-new-cloud-datacenter-server-designs-for-big-datamachine-learning-analytics-and-other-emerging-workloads/, October 2016 OpenCAPI, Opencapi a data-centric approach to server design (2016), https://opencapi.org/wpcontent/uploads/2016/09/OpenCAPI-Exhibit-SC17.pdf J. Ouyang, S. Lin, W. Qi, Y. Wang, B. Yu, S. Jiang, SDA: software-defined accelerator for large-scale DNN systems, in IEEE Hot Chips 26 Symposium (HCS) (IEEE, 2014), pp. 1–23 D. Pellerin, FPGA accelerated computing using AWS F1 instances (AWS Public Sector Summit, 2017) C. Pilato, F. Ferrandi, Bambu: a modular framework for the high level synthesis of memory-intensive applications, in 2013 23rd International Conference on Field programmable Logic and Applications (IEEE, 2013), pp. 1–4 K. Pocek, R. Tessier, A. DeHon, Birth and adolescence of reconfigurable computing: a survey of the first 20 years of field-programmable custom computing machines, in IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines (IEEE, 2013), pp. 1–17 S. Pontarelli, R. Bifulco, M. Bonola, C. Cascone, M. Spaziani, V. Bruschi, D. Sanvito, G. Siracusano, A. Capone, M. Honda, F. Huici, G. Siracusano, Flowblaze: stateful packet processing in hardware, in 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), USENIX Association, Boston, MA, Feb. 2019, pp. 531–548, https://www.usenix.org/conference/ nsdi19/presentation/pontarelli L.-N. Pouchet, P. Zhang, P. Sadayappan, J. Cong, Polyhedral-based data reuse optimization for configurable computing, in Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (2013), pp. 29–38
Reconfigurable Architectures: The Shift from General Systems …
455
R. Prabhakar, Y. Zhang, D. Koeplinger, M. Feldman, T. Zhao, S. Hadjis, A. Pedram, C. Kozyrakis, K. Olukotun, Plasticine: a reconfigurable architecture for parallel patterns, in ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA) (IEEE, 2017), pp. 389–402 Project X-Ray, Project X-ray repository (2017), https://github.com/SymbiFlow/prjxray J. Pu, S. Bell, X. Yang, J. Setter, S. Richardson, J. Ragan-Kelley, M. Horowitz, Programming heterogeneous systems from an image processing DSL. ACM Trans. Archit. Code Optim. (TACO) 14(3), 1–25 (2017) A. Putnam, A.M. Caulfield, E.S. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G.P. Gopal, J. Gray, A reconfigurable fabric for accelerating large-scale datacenter services, in ACM/IEEE 41st International Symposium on Computer Architecture (ISCA) (IEEE, 2014), pp. 13–24 M. Rabozzi, R. Brondolin, G. Natale, E. Del Sozzo, M. Huebner, A. Brokalakis, C. Ciobanu, D. Stroobandt, M.D. Santambrogio, A CAD open platform for high performance reconfigurable systems in the extra project, in IEEE Computer Society Annual Symposium on VLSI (ISVLSI) (IEEE, 2017, pp. 368–373 M. Rabozzi, G. Natale, E. Del Sozzo, A. Scolari, L. Stornaiuolo, M.D. Santambrogio, Heterogeneous exascale supercomputing: the role of CAD in the exaFPGA project, in Design, Automation & Test in Europe Conference and Exhibition (DATE) (IEEE, 2017), pp. 410–415 J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, S. Amarasinghe, Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM Sigplan Notices 48(6), 519–530 (2013) J.M. Ray, A unified compiler backend for distributed, cooperative heterogeneous execution. Ph.D. Dissertation (Massachusetts Institute of Technology, 2018) M. Schmid, F. Hannig, R. Tanase, J. Teich, High-level synthesis revised: generation of FPGA accelerators from a domain-specific language using the polyhedron model (2013) S.O. Settle et al., High-performance dynamic programming on FPGAs with OpenCL, in Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC) (2013), pp. 1–6 D. Shah, E. Hung, C. Wolf, S. Bazanski, D. Gisselquist, M. Milanovic, Yosys+ nextpnr: an open source framework from verilog to bitstream for commercial FPGAs, in IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) (IEEE, 2019), pp. 1–4 L. Shannon, V. Cojocaru, C.N. Dao, P.H. Leong, Technology scaling in FPGAs: trends in applications and architectures, in IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines (IEEE, 2015), pp. 1–8 D. Singh, Implementing FPGA design with the opencl standard. Altera White Paper 1 (2011) Stanford’s Agile Hardware Center, Creating an agile hardware flow, in 2019 IEEE Hot Chips 31 Symposium (HCS) (2019) Symbiflow, Symbiflow project website (2018), https://symbiflow.github.io/ R. Tessier, K. Pocek, A. DeHon, Reconfigurable computing architectures. Proc. IEEE 103(3), 332– 354 (2015) S.M.S. Trimberger, Three ages of FPGAs: a retrospective on the first thirty years of fpga technology: this paper reflects on how Moore’s law has driven the design of FPGAs through three epochs: the age of invention, the age of expansion, and the age of accumulation. IEEE Solid-State Circuits Mag. 10(2), 16–29 (2018) Y. Umuroglu, N.J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, K. Vissers, FINN: a framework for fast, scalable binarized neural network inference, in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Ser. FPGA’17 (ACM, 2017), pp. 65–74 S.I. Venieris, A. Kouris, C.-S. Bouganis, Toolflows for mapping convolutional neural networks on FPGAs: a survey and future directions (2018), arXiv:1803.05900 K. Vipin, S.A. Fahmy, FPGA dynamic and partial reconfiguration: a survey of architectures, methods, and applications. ACM Comput. Surv. (CSUR) 51(4), 1–39 (2018)
456
E. D’Arnese et al.
H. Wang, R. Soulé, H.T. Dang, K.S. Lee, V. Shrivastav, N. Foster, H. Weatherspoon, P4FPGA: a rapid prototyping framework for p4, in Proceedings of the Symposium on SDN Research (2017), pp. 122–135 A. Waterman, Y. Lee, D.A. Patterson, K. Asanovic, The RISC-V instruction set manual, volume I: Base user-level ISA, vol. 116. Technical Report (EECS Department, UC Berkeley, UCB/EECS2011-62, 2011) B. Wile, Coherent accelerator processor interface (CAPI) for power8 systems white paper, in IBM Systems and Technology Group (2014) L. Wirbel, Xilinx SDAccel: a unified development environment for tomorrow’s data center (The Linley Group Inc., 2014) Xilinx, Pynq: Python for productivity for Zynq (2016), http://www.pynq.io/ Xilinx, Sdaccel Press Release (2014), https://www.xilinx.com/news/press/2014/xilinxannounces-sdaccel-development-environment-for-opencl-c-and-c-delivering-up-to-25xbetter-performance-watt-to-the-data-center.html Xilinx, Xilinx vitis unified software platform (2019), https://www.xilinx.com/products/designtools/vitis/vitis-platform.html, October 2019 Xilinx, Zynq SoC family (2016), https://www.xilinx.com/products/silicon-devices/soc/zynq-7000. html D. Ziakas, A. Baum, R.A. Maddox, R.J. Safranek, Intel® quickpath interconnect architectural features supporting scalable system architectures, in 18th IEEE Symposium on High Performance Interconnects (IEEE, 2010), pp. 1–6 W. Zuo, P. Li, D. Chen, L.-N. Pouchet, S. Zhong, J. Cong, Improving polyhedral code generation for high-level synthesis, in 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES + ISSS) (IEEE, 2013), pp. 1–10