Software Defined Chips: Volume II 981197635X, 9789811976353

This book is the second volume of a two-volume book set which introduces software-defined chips. In this book, the progr

140 80 9MB

English Pages 335 [330] Year 2022

Table of contents :
Foreword
Preface
Contents
Introduction
1 Programming Model
1.1 Dilemma of the Programming Model of Software-Defined Chips
1.2 Three Routes
1.3 Three Obstacles
1.3.1 Von Neumann Architecture and Random Access Machine Model
1.3.2 Memory Wall
1.3.3 Power Wall
1.3.4 I/O Wall
1.4 Impossible Trinity
1.5 Three Types of Exploration
1.5.1 Spatial Domain Parallelism and Irregular Application
1.5.2 Programming Model of Spatial Domain Parallelism
1.6 Summary and Prospect
References
2 Hardware Security and Reliability
2.1 Security
2.1.1 Countermeasures Against Fault Attacks
2.1.2 Countermeasures Against Side Channel Attacks
2.1.3 PUF Technology Based on SDC
2.2 Reliability
2.2.1 Topology Reconfiguration Method Based on Maximum Flow Algorithm
2.2.2 Multi-objective Mapping Optimization Method for Reconfigurable Network-on-Chip
References
3 Technical Difficulties and Development Trend
3.1 Analysis of Technical Difficulties
3.1.1 Flexibility: Programmability Design Coordinating the Software and Hardware
3.1.2 Efficiency: Tradeoff Between Hardware Parallelism and Utilization
3.2 Instruction-Level Parallelism
3.3 Data-Level Parallelism
3.4 Memory-Level Parallelism
3.5 Task-Level Parallelism
3.6 Speculation Parallelism
3.6.1 Ease of Use: Optimizing Virtualized Hardware with Software Scheduling
3.6.2 Prospects on Development Trend
3.7 Independent Task-Level Parallelism
3.8 Data-Level Parallelism
3.9 Bit-Level Parallelism
3.10 Optimization of Memory Access Patterns
3.10.1 Multi-Level Parallelism Design for In-/near-Memory Computing
3.11 Implementation of Instruction-Level Parallelism in SDCs
3.12 Implementation of Data-Level Parallelism in SDCs
3.13 Implementation of Task-Level Parallelism in SDCs
3.14 Implementation of Speculation Parallelism in SDCs
3.15 Efficiency of Memory in the SDC
3.15.1 Software-Transparent Hardware Dynamic Optimization Design
3.16 Virtualization of SDCs
3.17 Online Training by Means of Machine Learning
References
4 Current Application Fields
4.1 Analysis of Application Fields
4.2 Artificial Intelligence
4.2.1 Algorithm Analysis
4.2.2 State-of-the-Art Artificial Intelligence Chips
4.2.3 Software-Defined Artificial Intelligence Chip
4.3 5G Communication Baseband
4.3.1 Algorithm Analysis
4.3.2 State-of-the-Art Research on Communication Baseband Chips
4.3.3 Software-Defined Communication Baseband Chip
4.4 Cryptographic Computation
4.4.1 Analysis of Cryptographic Algorithms
4.4.2 Current Status of the Research on Cryptographic Chips
4.4.3 Software-Defined Cryptographic Chips
4.5 Hardware Security of the Processor
4.5.1 Background
4.5.2 Analysis of CPU Hardware Security Threats
4.5.3 Existing Countermeasures
4.5.4 CPU Hardware Security Technology Based on Software-Defined Chips
4.6 Graph Computation
4.6.1 Background of Graph Algorithms
4.6.2 Programming Model of Graph Computation
4.6.3 Research Progress of Hardware Architecture for Graph Computing
4.6.4 Outlook
References
5 Future Application Prospects
5.1 Evolutionary Computing
5.1.1 Background and Concept of Evolutionary Computing
5.1.2 The Evolution and State-Of-The-Art Research
5.1.3 Software-Defined Evolutionary Computing Chip
5.2 Post-Quantum Cryptography
5.2.1 Concept and Application of Post-Quantum Cryptographic Algorithms
5.3 Current Status of Post-Quantum Cryptographic Algorithms
5.3.1 Status Quo of the Research on Post-Quantum Cryptographic Chips
5.3.2 Software-Defined Post-Quantum Cryptographic Chip
5.4 Fully Homomorphic Encryption
5.4.1 Concept and Application of Fully Homomorphic Encryption
5.4.2 Status Quo of the Research on Fully Homomorphic Encryption Chips
5.4.3 Software-Defined Fully Homomorphic Encryption Computing Chip
References

Recommend Papers

Software Defined Chips: Volume I 9811969930, 9789811969935

This is the first book of a two-volume book set which introduces software defined chips. In this book, it introduces the

103 25 8MB Read more

Software Defined Radio: The Software Communications Architecture 9780470865187, 0470865180

Because of its complexity and presentation, the Software Communications Architecture is widely misunderstood and under-u

702 59 2MB Read more

VMware Software-Defined Storage: A Design Guide to the Policy-Driven, Software-Defined Storage Era [1 ed.] 9781119292784, 9781119292777

The inside guide to the next generation of data storage technology VMware Software-Defined Storage, A Guide to the Polic

122 69 29MB Read more

Software Defined Radio: Architectures, Systems and Functions 0470851643, 9780470851647, 9780470865019

Synthesizing the findings of a research program began in 2000 and now nearing its completion, European computer scientis

480 31 4MB Read more

Cognitive Radio: An Integrated Agent Architecture for Software Defined Radio

This dissertation is submitted in partial fulfillment of the degree of Doctor of Technology in Teleinformatics. Royal In

573 41 1MB Read more

Software Defined Internet of Everything (Internet of Things) 3030893278, 9783030893279

This book provides comprehensive discussion on key topics related to the usage and deployment of software defined networ

120 78 8MB Read more

Software Defined Radio for 3G 1580533477, 9781580533478, 9781580538046

If you're a mobile communications engineer considering software radio solutions, this practical resource is essenti

420 100 7MB Read more

Software Defined Radio Prototype toward Cognitive Radio Communication Systems

Kanagawa, Japan, National Institute of Information and Communications Technology (NICT) - IEEE, 2005. – pp. 539-547. Nat

630 83 806KB Read more

Software-Defined Networking (SDN) with OpenStack. 9781786462213, 1786462214

154 34 16MB Read more

RFID - Read My Chips!

409 66 2MB Read more

Software Defined Chips: Volume II
981197635X, 9789811976353

Author / Uploaded
Leibo Liu
Shaojun Wei
Jianfeng Zhu
Chenchen Deng

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Leibo Liu · Shaojun Wei · Jianfeng Zhu · Chenchen Deng

Software Defined Chips Volume II

Software Defined Chips

Leibo Liu · Shaojun Wei · Jianfeng Zhu · Chenchen Deng

Software Defined Chips Volume II

Leibo Liu School of Integrated Circuits Tsinghua University Beijing, China

Shaojun Wei School of Integrated Circuits Tsinghua University Beijing, China

Jianfeng Zhu School of Integrated Circuits Tsinghua University Beijing, China

Chenchen Deng Beijing National Research Center for Information Science and Technology Tsinghua University Beijing, China

ISBN 978-981-19-7635-3 ISBN 978-981-19-7636-0 (eBook) https://doi.org/10.1007/978-981-19-7636-0 Jointly published with Science Press The print edition is not for sale in China mainland. Customers from China mainland please order the print book from: Science Press. © Science Press 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publishers, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publishers nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publishers remain neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Foreword

When my old friend Prof. Shaojun Wei asked me to write a foreword for his book Software Defined Chip, I really felt a little bit uneasy. Professor Wei Shaojun is an expert in the field of integrated circuits in China, while I’m a layman of chips. To be honest, I don’t think I’m qualified for this. What emboldened me to accept this mission is that, on the one hand, chips and software are closely connected, both of which are the most fundamental elements of various systems in this era of information, while, on the other hand, the name of the book “software defined” is related to my expertise and also what I have been progressively promoting and publicizing for all these years. Looking back to the history of computer development, software has been existing as the “appurtenance” of hardware (mainly integrated circuit chips) for a long period of time since the first-generation computer was born in the 1940s. After the high-level programming language emerged in the late 1950s, the word “software” started to be proposed in a parallel position with “hardware”, and has since gradually become independent and formed an independent branch subject of computer science. However, software had not gotten rid of hardware and became independent products and commodities until late 1970 when the software industry saw its surge. For decades, the collaborative development of software and hardware has underpinned the modern information industry, and provided a constantly upgraded source of power for the development of an information-based human society. Especially after the large-scale commercialization of the Internet in the mid-1990s, a massive and influential social and economic reform took place. The Wintel system we are familiar with is an example of the collaborative development of software and hardware. Software and integrated circuits are the cores and souls of the information technology industry, playing a huge role as enablers and radiators. In recent years, next-generation information technologies and their applications, represented by cloud computing, big data, artificial intelligence, and Internet of Things, have widely covered and influenced every part of our society, economy, and lives. Digital transformation and development have become a trend of time for traditional industries. The digital economy is now a new economic form after the

v

vi

Foreword

industrial economy. The digital civilization is approaching, and the human community is now on the verge of an information society. As one of the core enabling technologies of this era, software has been ubiquitously pervading all walks of life, evoking profound changes from the inside. Software not only is an important part of the information infrastructure, but is also becoming the infrastructure for social and economic activities of mankind in the information era by redefining the infrastructure in the traditional physical world and the infrastructure for social and economic activities. It is a key support for the progression and advancement of human civilization. In this sense, we are stepping into an era where software defines everything, featuring that everything is interconnected and programmable. Originating from a “software-defined network”, the word “software-defined” has been a popular term in the field of information technologies in recent years. The software-defined network has posed a significant impact and change on the network communication industry, redefined the traditional network architecture, and even reshaped the structure of the traditional communication industry. Subsequently, software-defined memory, software-defined environments, and software-defined data centers kept popping up one after another. Currently, “Software-Defined Everything (SDX)” for ubiquitous information technology resources is reshaping the traditional information technology system and has become an important development trend of the information technology industry. Also, the term “software-defined” has begun to extend out of the information world and reach the physical world and the human society to play its critical role of “enabling, assignment, and intelligentization”. It has also begun to redefine the worldwide landscape where human, machines, and things are combined. From the perspective of a software technology researcher, the software-defined technology is virtually the “virtualization of basic resources” and “programmability of management tasks”. In fact, those are always principles of the design and implementation of computing operating systems. The focus is to virtualize the underlying infrastructure resources and open up APIs to achieve flexible and customizable resource management by programmable means. Meanwhile, it condenses and bears the commonalities of the industry to better support and adapt to the needs and changes of the upper-level business system. Therefore, I regard “software-defined” as a methodology based on platform thinking. The so-called “SDX” means constructing an “operating system” for “X”. For years, my team and I have been focusing on software-defined technologies in the field of computing systems and Industrial Internet of Things. We have achieved some positive results. Also, I have tried my best to promote and publicize the “software-defined” concept on different occasions, including manufacturing, equipment, smart cities, smart homes, etc. However, chips were never in it. As software must run on a chip, my inertial thinking was that only a system built on chips can be defined by software. The first time I heard the term “software-defined chip (SDC)” was from Prof. Wei. In the third Future Chip Forum, organized by Prof. Wei at the end of 2018 with the theme “Reconfigurable Chip Technology”, I was invited to make a report on soft-defined everything. This was an opportunity for me to learn SDCs and expand my knowledge of “software-defined”.

Foreword

vii

Integrated circuits are of the most complicated design and manufacturing technologies in human history, fully symbolizing the fruits of human wisdom. Also, the integrated circuit industry is a strategic, fundamental, and leading industry to support the national economic and social development and guarantee national security. It is a research field of national strategic importance. As information technology is making continuous breakthroughs, numerous emerging applications keep springing up and gain strong development momentum, raising highly demanding requirements for data processing and computing efficiency. The traditional chip architecture is being greatly challenged while digital chips cannot reach high energy efficiency and high flexibility simultaneously, which has become a problem recognized by the international community. Based on the profound accumulation in the research of integrated circuit design methodologies, Prof. Wei and his team proposed the SDC architecture and its design paradigm, with which software dynamically defines the chip functions, and promotes the transformation of the digital chip architecture and design paradigm. This study not only leads the generalized field of computing chips, but also provides an important lesson for the common problems faced in the era of software-defined technologies. I am very glad to see that they have deeply expanded on the development background, technical connotation, key applications, and future development of SDCs based on their understanding of the technical trend of the information industry and research outcomes long accumulated in relevant fields, and published China’s first book in this field. Here, I would like to extend my sincere congratulations! I believe that this book can act as an important reference for information technology researchers and practitioners, thereby deepening their knowledge and understanding of “software-defined”, and greatly contributing to the cultivation of information technology talents, industrial development, and ecosystem construction of China.

Summer of 2021

Hong Mei

Preface

Our team has been studying dynamically reconfigurable chips since 2006. In 2014, we wrote the book “Reconfigurable Computing” and published it in Science Press. The book introduces the basic concepts of reconfigurable computing, as well as the hardware architecture and mapping mechanism of dynamically reconfigurable chips. In the past 5–6 years, we further worked on the theories and technologies of SDCs based on the previous outcomes on reconfigurable computing. Softwaredefined chips (SDCs) and dynamically reconfigurable chips are much alike yet greatly different. Dynamically reconfigurable chips hold an upward view based on chips, with the focus on solving the problems of the circuit itself, while SDCs hold a downward view, with the focus on, in addition to the circuit as always, programming paradigms, compiling systems, etc. After unremitting efforts, our team has published dozens of influential academic papers in solid-state circuits, computer architecture, electronic design automation and other fields involved by SDCs, and many invention patents have been granted in China and the US. We have enabled a series of marketoriented technical applications for major national projects, and solved some practical challenges encountered in industrial production. Now we compile our research results on SDCs, analysis of cutting-edge technologies, and thoughts about the future development of computing chips into a book and share it with all of you. However, as we still lack sufficient knowledge, we look forward to your criticisms and suggestions if there is anything improper in the book. SDC is a new paradigm of computing chip architecture design. It is expected to fill the gap between software and hardware, and directly define the runtime functions of hardware with software, so that the chip can swiftly adjust to software changes while featuring high performance, low power consumption, high flexibility, high programmability, unrestricted capacity, and ease of use, which are rarely seen simultaneously on traditional computing chips. There are two main reasons why SDCs have such technical advantages: Firstly, it is featured with mixed-grained but mostly coarse-grained reconfigurable processing elements instead of the traditional fine-grained lookup table logic to greatly reduce redundant resources. The energy efficiency is improved by one or two orders of magnitude compared with traditional programmable devices (such as FPGA), and by two or three orders of ix

x

Preface

magnitude compared with instruction-driven processors (such as CPU), rivaling that of application specific circuits (such as ASIC). Secondly, it supports dynamic partial reconfiguration, and the switching time can be within a few nanoseconds. Therefore, the capacity can be expanded through the fast time-division multiplexing of the hardware, and is no longer limited by the physical scale of the circuits. In other words, similar to the CPU that can run software codes of any size, an SDC can hold digital logic of any size and any number of gates, which is very different from an FPGA. Meanwhile, dynamic reconfiguration can better fit with the serialization characteristics of software programs than static reconstruction, and it is more efficient when programming in high-level languages. Software developers who do not have knowledge of circuits can efficiently program SDCs with purely software-based thinking. Lowering the threshold of use will enable agile chip development, speed up application iteration and system deployment, and greatly expand the use of chips. Software Defined Chip has two volumes. The first volume mainly introduces the conceptual evolution, technical principles, key issues, hardware architecture, and compiling system of SDCs. This book is the second volume and will focus on the following topics: How is the usability of SDCs achieved? What challenges does the programming model face? What are the intrinsic advantages in security and reliability? What technical difficulties are SDCs still facing? How will the technology develop in the future? What applications have been achieved? What are the advantages of applications compared with traditional computing chips? Which areas have better development prospects in the future? This book is divided into five chapters: Chapter 1 introduces the programming model of SDCs. By reviewing the co-evolution of architectures and programming models of modern general-purpose processors, it analyzes the programming models of SDCs as an emerging computing architecture, discusses how the chip design can address the problems of “memory wall”, “power wall”, and “I/O wall” brought by the unbalanced development of semiconductor device technologies, and sums up the ternery paradox of programming models, that is, a programming model cannot achieve high generality, high development efficiency, and high execution efficiency at the same time. This chapter also proposes three possible research directions for the programming models of SDCs. Chapter 2 introduces the intrinsic security and reliability of SDCs. In terms of security, it takes the fault attack of a cryptographic chip as an example to introduce how to use the dynamic partial reconfiguration feature to improve the resistance against side-channel attacks, and how to make full use of the abundant computing units and interconnections to construct a physical unclonable function (PUF) to improve hardware security. In terms of reliability, it takes Network-on-a-Chip (NoC) of SDCs as an example to introduce an efficient topology reconstruction method to improve the fault tolerance of the system, along with the algorithm mapping optimization technology after the topology is dynamically changed. Chapter 3 focuses on the main technical bottlenecks faced by SDCs in terms of flexibility, efficiency, and usability, expands on the possibility of new design concepts, and envisions the future development trend of SDC technologies. Chapter 4 analyzes the target application fields of SDCs, and introduces design cases of SDCs in artificial intelligence, 5G communications, cryptography, graph

Preface

xi

computing, network protocol processing, and other applications. Chapter 5 envisions the application of SDCs in emerging scenarios in the future, with the focus on the application in emerging technologies such as evolutionary computing, post-quantum cryptography, and fully homomorphic encryption. This book embodies the collective wisdom of the reconfigurable computing team at Tsinghua University accumulated in the past 10 years. Thanks to many postdoctoral, doctoral, master, and undergraduate students, and engineers for their unremitting efforts. They are Jianfeng Zhu, Chenchen Deng, Wenping Zhu, Honglan Jiang, Jiaji He, Bohan Yang, Zhaoshi Li, Neng Zhang, Huiyu Mo, Xingchen Man, Longlong Chen, Yufeng Huang, Yibo Wu, Weiyi Sun, Dibei Chen, Baofen Yuan, Liwei Sun, Ang Li, Jinyi Chen, Xiangyu Kong, Hanning Wang, and Siming Kou. Thanks to Prof. Shaojun Wei for his helpful support and guidance on the writing of this book, and special thanks to Academician Hong Mei, a well-known expert in system software and software engineering, for reviewing this book and writing a foreword. Finally, I would also like to appreciate my wife and children (Tuo Tuo and Dou Dou) for their understanding and supporting of my work. You are an important driver of my work and advancement in the future!

Tsinghua Garden in June 2021

Leibo Liu

Contents

1 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Dilemma of the Programming Model of Software-Defined Chips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Three Routes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Three Obstacles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Von Neumann Architecture and Random Access Machine Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Memory Wall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Power Wall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.4 I/O Wall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Impossible Trinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Three Types of Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Spatial Domain Parallelism and Irregular Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 Programming Model of Spatial Domain Parallelism . . . . . 1.6 Summary and Prospect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Hardware Security and Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Countermeasures Against Fault Attacks . . . . . . . . . . . . . . . 2.1.2 Countermeasures Against Side Channel Attacks . . . . . . . . 2.1.3 PUF Technology Based on SDC . . . . . . . . . . . . . . . . . . . . . 2.2 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Topology Reconfiguration Method Based on Maximum Flow Algorithm . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Multi-objective Mapping Optimization Method for Reconfigurable Network-on-Chip . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 2 3 6 7 9 14 25 28 31 32 37 68 70 73 74 74 79 90 102 102 112 131

xiii

xiv

Contents

3 Technical Difficulties and Development Trend . . . . . . . . . . . . . . . . . . . . . 3.1 Analysis of Technical Difficulties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Flexibility: Programmability Design Coordinating the Software and Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Efficiency: Tradeoff Between Hardware Parallelism and Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Instruction-Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Data-Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Memory-Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Task-Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Speculation Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Ease of Use: Optimizing Virtualized Hardware with Software Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Prospects on Development Trend . . . . . . . . . . . . . . . . . . . . . 3.7 Independent Task-Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Data-Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Bit-Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10 Optimization of Memory Access Patterns . . . . . . . . . . . . . . . . . . . . . 3.10.1 Multi-Level Parallelism Design for In-/near-Memory Computing . . . . . . . . . . . . . . . . . . . . . 3.11 Implementation of Instruction-Level Parallelism in SDCs . . . . . . . 3.12 Implementation of Data-Level Parallelism in SDCs . . . . . . . . . . . . . 3.13 Implementation of Task-Level Parallelism in SDCs . . . . . . . . . . . . . 3.14 Implementation of Speculation Parallelism in SDCs . . . . . . . . . . . . 3.15 Efficiency of Memory in the SDC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.15.1 Software-Transparent Hardware Dynamic Optimization Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.16 Virtualization of SDCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.17 Online Training by Means of Machine Learning . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

135 136

4 Current Application Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Analysis of Application Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Algorithm Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 State-of-the-Art Artificial Intelligence Chips . . . . . . . . . . . 4.2.3 Software-Defined Artificial Intelligence Chip . . . . . . . . . . 4.3 5G Communication Baseband . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Algorithm Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 State-of-the-Art Research on Communication Baseband Chips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Software-Defined Communication Baseband Chip . . . . . . 4.4 Cryptographic Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Analysis of Cryptographic Algorithms . . . . . . . . . . . . . . . .

167 168 171 171 174 187 192 194

136 139 139 140 140 141 141 144 146 149 149 149 150 150 151 151 152 153 157 160 160 161 163

200 206 214 215

Contents

xv

4.4.2

Current Status of the Research on Cryptographic Chips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Software-Defined Cryptographic Chips . . . . . . . . . . . . . . . 4.5 Hardware Security of the Processor . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Analysis of CPU Hardware Security Threats . . . . . . . . . . . 4.5.3 Existing Countermeasures . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.4 CPU Hardware Security Technology Based on Software-Defined Chips . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Graph Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Background of Graph Algorithms . . . . . . . . . . . . . . . . . . . . 4.6.2 Programming Model of Graph Computation . . . . . . . . . . . 4.6.3 Research Progress of Hardware Architecture for Graph Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.4 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Future Application Prospects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Evolutionary Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Background and Concept of Evolutionary Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 The Evolution and State-Of-The-Art Research . . . . . . . . . 5.1.3 Software-Defined Evolutionary Computing Chip . . . . . . . 5.2 Post-Quantum Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Concept and Application of Post-Quantum Cryptographic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Current Status of Post-Quantum Cryptographic Algorithms . . . . . . 5.3.1 Status Quo of the Research on Post-Quantum Cryptographic Chips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Software-Defined Post-Quantum Cryptographic Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Fully Homomorphic Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Concept and Application of Fully Homomorphic Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Status Quo of the Research on Fully Homomorphic Encryption Chips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Software-Defined Fully Homomorphic Encryption Computing Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

221 226 232 234 235 236 238 245 247 251 257 270 272 279 280 280 281 287 288 289 291 294 299 301 301 304 307 315

Introduction

Software Defined Chip has two volumes, and this book is the second volume. By retrospecting the co-evolution of modern general-purpose processors and programming models, this book analyzes the research focus of the programming model of Software-Defined Chips (SDCs). How to utilize the dynamic reconfigurable feature of the SDC to improve the security and reliability of chip hardware is also presented. Challenges and envisions the direction of future technological breakthroughs of SDCs are discussed. This book covers the latest research of SDCs in artificial intelligence, cryptographic computing, 5G communications, and other fields, as well as future-oriented emerging applications. This book is suitable for scientific researchers, senior graduate students, and engineers in related industries engaged in electronic engineering and computer science.

xvii

Chapter 1

Programming Model

All problems in computer science can be solved by another level of indirection, except for the problem of too many layers of indirection. —David Wheeler [1].

The main difference between software-defined chips (SDCs) and ASIC is that SDCs need to execute user-written software like general-purpose processor. ASIC is only for specific applications. It only needs to provide special APIs without considering how programmers program it while the function of SDCs is finally realized by programmers. A necessary condition for a set of hardware to attract a large number of users to invest in the development of software is that the software on the hardware is forward compatible: even if the new generation of hardware design has changed dramatically, the software previously written by users can still run correctly on the new chip. The “language” for dialogue between software and hardware is the programming model. The programming model in general sense refers to all levels of abstraction from application to chip. In the long development process of general-purpose processor chips, a complex hierarchical layer of indirection model composed of abstraction levels such as programming language, compiler intermediate representation and instruction set architecture has gradually formed. In these models, the upper layer of indirection hides the complexity of the lower layer of indirection in turn. For example, to hide the complex process control caused by that the instruction counter can arbitrarily jump (e.g., jump class instruction in x86 instruction set), the programming language layer provides a variety of process control statements, such as statements in C language. In this way, when developing applications, programmers only need to develop applications for specific layer of indirection without considering the complexity of the underlying implementation. However, as a new computing architecture different from general-purpose processor and ASIC in both chip architecture and programming model design, the SDCs faces the dilemma of “chicken-and-egg” in the programming model: without SDC programming model, the design of SDCs is like “water without a source”, lacking a software to guide the direction of chip design; without the design of SDCs, © Science Press 2023 L. Liu et al., Software Defined Chips, https://doi.org/10.1007/978-981-19-7636-0_1

1

2

1 Programming Model

the programming model design of SDCs is like “a tree without roots”, lacking a hardware to test the effectiveness of programming model. To break the dilemma, this chapter will review the co-evolution of modern generalpurpose processor architecture and programming model. Section 1.1 analyzes the causes and effects of the dilemma in detail. Section 1.2 examines the layer of indirection structure of modern programming models, and then summarizes the design routes of three programming models. Section 1.3 examines how the chip design and programming model should deal with the “three walls” caused by the unbalanced process development of semiconductor devices, namely “memory wall”, “power wall” and “I/O wall”. More and more complex hardware has spawned a variety of programming models. Section 1.4 summarizes the “Impossible Trinity of programming model” from the evolution process of programming model: the new programming model cannot obtain high generality, high development efficiency and high execution efficiency at the same time. At most, it can only achieve two goals at the same time and abandon the other goal. Combined with the processing method of hardware complexity in the abstraction level of computing system, the rationality of Impossible Trinity can be explained empirically. Finally, based on the “Impossible Trinity”, Sect. 1.5 puts forward three possible research directions for the programming model dilemma of SDCs.

1.1 Dilemma of the Programming Model of Software-Defined Chips In the past 60 years, humans have created a spectacle: the performance of chips continues to grow exponentially, and the applications based on chips become more and more complex and diverse. As the contract between the chip and the application, the programming model ensures that the past applications can be easily transplanted to the future chip through the consistency of the contract. However, the end of Moore’s law, like taking away the firewood from under the cauldron, has destroyed the spectacle of the computing industry. For SDCs, the chip design should be freed from the constraint of old contract and be reconsidered from the relationship between chips, programming models and applications. Without SDC programming model, the architecture design of SDCs is like “water without a source”, lacking a software to guide the direction of hardware design. In the research of architecture, the most direct response to the paradigm shift of hardware is to invent a new (domain-specific) programming model. Although the new programming model is attractive in the short term, it usually means that programmers must rewrite the code, and will bring serious obstacles to understanding and communication to the software development team, making the learning curve steep. In the rapid iteration stage of hardware architecture, a lot of human and material resources are directly spent, and it is unrealistic to design and develop an automated

1.2 Three Routes

3

compiler for the evolving architecture. This makes it difficult for the target application to respond quickly to the decisions in hardware design when designing a new hardware paradigm, resulting in the dilemma of no software available. Without the architecture of SDCs, the programming model design of SDCs is like “a tree without roots”, lacking a hardware to support the development of programming model. The programming model functions to hide the complex hardware mechanism. In an era when Moore’s law is still effective in enhancing the performance of general-purpose processors, the design of programming models is much simpler than today. Although the hardware mechanism of the processor may change greatly between generations, the instruction set architecture (ISA) of the new generation processor only needs to add a few or several types of instructions. Therefore, the programming model, compiler and programming language of the previous generation can be applied to the new generation processor with only a few changes made. However, with the failure of Moore’s law in enhancing processor performance, specialization has become the most important performance source of the new generation of hardware. It is difficult to abstract this specialized hardware with a unified or similar ISA. Therefore, different new hardware requires different programming models. When the emerging hardware paradigm has not been finalized, it is difficult for the programming model to clarify which hardware mechanisms to hide. If the paradox of “chicken-and-egg” cannot be solved, the development of SDCs will face two outcomes, that is, either the stagnation of hardware development due to the inability of software to adapt, or the inability of software to use hardware for innovation. To break this dilemma, we need to fundamentally rethink how to design, program and use SDCs. We believe that we can gather the fragmented common sense, and then more consistently understand the design method of SDCs programming model through review of the co-evolution of modern general-purpose processor architecture and programming model, reflection on historical experience and discussion on concepts.

1.2 Three Routes As mentioned in the introduction of this chapter, the layer of indirection is the main driving force for the growth and productivity progress of the computing industry. Today, most computer architects may not know the working principle of modern microprocessors nor the technological process of semiconductor manufacturing. However, by maintaining these interrelated layers of indirections, computer professionals can efficiently code (e.g., using Python) at a higher abstraction level. This makes today’s applications developed. Figure 1.1 shows typical layers of indirection from top (application) to bottom (chip) in today’s computing industry. According to the traditional software and hardware partition method, software is above ISA and hardware is below ISA. The higher the abstraction level of the layer of indirection, the higher the development efficiency of the program; reversely, the higher the complexity in the lower layer of indirection,

4

Application developer

Higher abstraction level

Application Higher complexity

Fig. 1.1 Typical diagram of layers of indirection from top (application) to bottom (chip) in computer science

1 Programming Model

Algorithm Programming language

Software

Assembly language Instruction set architecture Microarchitecture Register transport layer Physical layer

Hardware

Compiler designer Architecture designer Hardware developer

Abstraction level

the higher the execution efficiency of the program. A new layer of indirection is introduced to hide the complexity of the layer of indirection below it, thus improving the development efficiency. If a wide range of applications in the whole computing industry are like rows of high-rise buildings, then each layer of indirection is a floor, and the programming model is the cement that binds them together. The programming model in narrow sense refers to the contract between layers from the application layer to the microarchitecture layer. Specifically, the programming model specifies which behaviors in the upper layer are legal and the execution mechanism of each behavior in the lower layer. Similar contracts also exist from the microarchitecture layer to the physical layer. For example, netlist files are used as contracts from the register transport layer to the device layer. These contracts are not in the scope of programming model discussed in this chapter since application developers do not deal with them. However, as stated in the second half of the introduction of this chapter, excessive layer of indirections is a difficult problem to solve. A key problem here is that the introduction of each layer of indirection will cause a loss of performance on the chip. More layers of indirection will cause greater performance loss. Therefore, the programming languages with high abstraction level, such as Python and JavaScript, are mainly designed to improve development efficiency and expand the scope of application. To achieve these two goals, high-level languages have many common features. For example, they are usually interpreted by single thread and have garbage collection mechanism based on simple algorithms such as reference count. Because of these characteristics, the execution efficiency of high level languages is very low. In 2020, Science magazine published a paper on computer architecture, There Is Enough Space At The Top [2]. An example shows that the execution time of matrix multiplication program written in Python is 100–60,000 times that of program written in highly optimized C language by developers at the same level, as shown in Table 1.1. Not only that, high-level languages also need more memory to execute. For example, integers in Python occupy 24 bytes instead of 4 bytes in C language (because each object carries type information, reference count, etc.), and the memory overhead of data structures such as lists or dictionaries is more than 4 times that of C++. Of course, these high-level languages are not designed to make efficient use of hardware. However, when the performance of the chip no longer increases with the progress of Moore’s law, the execution efficiency gap between high-level language

1.2 Three Routes

5

Table 1.1 Comparison of acceleration of 4096 × 4096 matrix multiplication performed by different programs [2] Version

Implementation

Running time/s

GFLOPS

Absolute acceleration

Relative acceleration

Fraction of peak/%

1

Python

25552.48

0.005

1

–

0.00

2

Java

2372.68

0.058

11

10.8

0.01

3

C

542.67

0.253

47

4.4

0.03

4

Parallel loops

69.80

1969

366

78

0.24

5

Parallel divide and conquer

3.80

36.180

6.727

18.4

4.33

6

Plus vectorization

1.10

124.914

23,224

3.5

14.96

7

Plus AVX intrinsics

0.41

337

62,806

2.7

40.45

Note Each version represents a continuous refinement on the Python source code. Running time refers to the execution rime of this version. GFLOPS refers to the number of 64-bit floating-point operations performed by this version per second (in billions). Absolute acceleration is the relative speed of Python, while the relative acceleration with additional precision bits in the display is the acceleration compared with the previous version. The fraction of peak is the ratio of 835 GFLOPS compared to the computer

and high-performance language has become a gold mine that has not been fully explored. According to which layer of indirection developers mainly use in the development process, practitioners in the computing industry are roughly divided into four types (Fig. 1.1): hardware developers are responsible for designing circuits and manufacturing chips, mainly designing ALU, cache and other modules at the circuit level; the architecture designer is responsible for designing the micro architecture ISA, building the computing system using the modules designed by the hardware developers, and providing the functions of the computing system to the upper developers in the form of ISA or API; the compiler designer is responsible for designing the programming language and compiler tool chain according to the application requirements and architecture characteristics, so that the application written by the application developer can be automatically transformed into the machine code that can be executed by the target architecture; the application developer is responsible for using programming language to develop applications. Referring to the previous definition, the programming model can be regarded as a language for dialogue between application developers and hardware developers. The language is designed by architecture designers and compiler designers. Considering the different types of practitioners responsible for hiding complex hardware mechanisms, we can briefly summarize the design routes of three programming models. First, some hardware mechanisms only need to be considered by the architecture designer, and generally do not require the intervention of the compiler. For example, in today’s popular domain-specific accelerators, architects usually

6

1 Programming Model

provide a set of simple APIs or special instructions for upper-level compilers and application developers to call directly. Secondly, some hardware mechanisms can be handled by the compiler designer without being understood by the application developer. For example, hundreds of registers in the CPU can be allocated automatically by the compiler. Finally, the performance potential of many hardware mechanisms must be fully developed by application developers according to the needs of applications. For example, the concurrent execution mechanism of multithreaded processor needs application developers to write programs in parallel programming language to be fully utilized. The three design routes bring different characteristics to the programming model. The development of programming model is the process of balancing these three routes. The design motivation and programming methods of typical hardware mechanisms will be reviewed in chronological order.

1.3 Three Obstacles Gene Amdahl is world famous for his “Amdahl’s Law” [3]. This law points out that the marginal benefit of parallel computing performance decreases with the increase of the number of threads. However, Amdahl also proposed the second principle [3] in 1967, which is called “Amdahl’s Rule of Thumb” or “Another Amdahl’s Law”: hardware architecture design needs to balance computing power, memory bandwidth and I/O bandwidth. The ratio of ideal processor computing performance, memory bandwidth and I/O bandwidth ratio is 1:1:1, that is, the computing performance of the processor of million instructions per second (MIPS) requires 1 MB of memory and 1 Mbit/s of I/O bandwidth. “Amdahl’s Rule of Thumb” was once regarded as a golden rule when it was proposed, but it is little known today. The reason is that since 1985, due to the development of integrated circuit technology, the ratio of memory bandwidth to I/O bandwidth of computing system cannot be maintained at the ideal ratio of 1:1:1 with computing performance. As shown in Fig. 1.2, the growth rates of CPU computing performance, memory bandwidth, disk bandwidth and network bandwidth are different in different time periods. Just as the dislocation between two plates in crustal movement will form cliffs, the performance dislocation of different modules in the computing system will also form a “high wall”. Today, 60 years after the birth of integrated circuits, the three “high walls” recognized by industry and academia are: the “memory wall” formed by the dislocation of memory performance and CPU performance after 1995, the “power wall” formed by the dislocation of CPU performance and chip power consumption after 2005, and the “I/O wall” formed by the dislocation of CPU performance and I/O bandwidth after 2015. To cross these three walls and maintain the balance of the system, researchers of architecture, programming model, compiler and software engineering have designed many complex mechanisms. Taking the programming model as the axis, there are some mechanisms that can be implemented only through hardware design without

1.3 Three Obstacles

7 I/O wall Processor

Improved relative bandwidth

Power wall

Memory wall

Network

Disk

Year

Fig. 1.2 Changes of CPU computing performance, memory bandwidth, disk bandwidth and network bandwidth over time from 1980 to 2020 (when the tension of hardware performance dislocation cannot be solved in the previous architecture—programming model design, the computing system encountered “memory wall”, “power wall” and “I/O wall” [4]) (see color chart)

changing the programming model, such as multilevel cache; Other mechanisms require changing the programming model, but they can complete the transformation of new and old applications through automatic compilation technology, so as to remain transparent to programmers, such as VLIW technology; There are also some mechanisms that must be explicitly developed and utilized by programmers, such as multithreading technology. Although the design goal of all programming models is to facilitate programmers to develop and utilize the underlying hardware mechanism, due to the complexity of the hardware mechanism, the corresponding programming models are also miscellaneous and difficult to unify [5].

1.3.1 Von Neumann Architecture and Random Access Machine Model To clarify the technical context of the programming model and provide a breakthrough idea for the design of SDC programming model, this section will return to the classical Von Neumann architecture and random access machine (RAM) programming model and start the journey of traceability. During the journey, we will take the memory ordering relationship between “write data” and “write flag” in the message queue data structure as an example to explain how to meet the application requirements with the emergence of increasingly rich hardware mechanisms and programming models. The original computer only loaded programs with fixed purposes, and its hardware was composed of various gate circuits. A specific program is executed by a fixed circuit board assembled from these gate circuits. Therefore, if the program function

8

1 Programming Model

needs to be modified, the circuit board must be reassembled. In 1945, Von Neumann put forward the design concept of “stored program” computer. Its basic idea is to encode computer instructions and store them in the computer memory, that is, storedprogram computer. By treating instructions as a special type of static data, a storedprogram computer can easily change its program and change its work tasks under program control. This is the beginning of Von Neumann computer system. This design concept led to the separation of software and hardware, which gave birth to the profession of programmer. At the same time, the practice of treating instructions as data gave birth to assembly language, compiler, and other automatic programming tools, and introduced the prototype of programming model. In addition, with the help of “automatically programming programs”, that is, compiler, programmers can write programs in a way that is easier for humans to understand. Figure 1.3 shows the structure diagram of Von Neumann architecture. Von Neumann’s paper identified five components in the “computer structure”: processing element, controller, memory, input device and output device. Since then, the processing element and controller unit are integrated in the processor, the capacity of memory is expanding, and the input and output devices are constantly updated. The evolution of these basic components is the development process of modern computing system. However, the performance dislocation of processor, memory, and peripherals in the evolution forces humans to design more and more complex hardware mechanisms and programming models. The programming model corresponding to Von Neumann architecture is RAM model [5]. RAM model is a kind of Turing machine, which is equivalent to general Turing machine. In the RAM model, the execution state and data of the application are stored in a limited number of registers in the processor and external memory, as shown in Fig. 1.4. The registers in the processor only save the intermediate state of application execution, and all data should be reflected in the memory finally. To give a main line in the traceability journey, this section briefly introduces the process of inserting elements in the circular queue. Queue is a basic abstract data structure and a linear table of FIFO. Figure 1.5 shows a circular queue implemented using an array, which needs to maintain two flags: the queue head flag and the queue tail flag. The queue only allows insert operations at the backend and read operations at the frontend. To simplify the discussion, only the

CPU Arithmetic logic unit (ALU) Memory (data and instructions)

Input device

Register Output device Control logic

Fig. 1.3 Design concept of Von Neumann structure

1.3 Three Obstacles Fig. 1.4 Sequence of two writes of processor and two writes in memory under RAM model

9 Processor i

...

Register (intermediate state of execution)

Monolithic memory

Fig. 1.5 When adding elements to the circular queue, it is necessary to ensure that the write data is before the write flag

Fetch data

Add new data

...

Processor: Write data → Write flag

Memory: Write data → Write flag

void enqueue (int x) { while (!queue.full()) { queue[tail] = x; tail++; } }

Ring Buffer

Free old data

single-producer single-consumer queue is considered here, that is, at any time, at most one write thread inserts data into the queue and one read thread reads data from the queue. When inserting an element into the queue, the program needs to query the tail flag status (! queue. full()); then write the data (queue [tail] = x), and finally update the queue tail flag (tail++), as shown in the code in Fig. 1.5. Next, we will gradually explore how various hardware mechanisms and programming models can meet the order-preserving requirements of “read flag - write data - write flag”. As the first stop of the journey, the order-preserving method in RAM model is simple and direct. As long as the processor executes the application with the reading flag before the writing data, and the writing data before the writing flag, the queue data in the memory will be updated before the queue tail flag.

1.3.2 Memory Wall Since Robert Noyce and Jack Kilby invented the integrated circuit in 1958, various components in Von Neumann architecture began to be gradually replaced by integrated circuits: first, processor, and then memory (IBM invented DRAM based on integrated circuit in 1965, and the early magnetic memory was replaced by integrated circuit). Moore’s Law in 1965 and Dennard’s Law in 1974 set a road map for the development of integrated circuits. Because both processor and memory are applicable to Moore’s Law, the “Amdahl’s Rule of Thumb” is also observed. For a long time, the performance of computing system has been advancing with Moore’s Law. However, the crisis lurks in the prosperous times. Due to the limitations of transistor level circuit design, the read latency of DRAM first lags behind Moore’s Law. As shown in Fig. 1.6, the unit storing 1-bit data in DRAM memory is composed of a

10

1 Programming Model Sense amplifier

Word line

Vsignal

Read/write transistor

Cparasitic

Cstorage

Performance

Fig. 1.6 Schematic diagram of 1-bit data unit in DRAM (data is stored in C storage , controlled by read–write transistor and read out through sense amplifier)

Processor Memory

Fig. 1.7 Based on the performance in 1980, the gap between the processor performance (the time interval between two memory accesses of the processor) and the DRAM memory access delay gradually widened (around 2005, the gap gradually narrowed as the processor performance was limited by power consumption [4])

capacitor and a transistor. Capacitors are used to store data, and transistors are used to control the charge and discharge of capacitors. When reading data, the transistor is gated. The electric charge stored on the capacitor will change the source voltage very slightly. After that, the sense amplifier can detect this slight change. The structure amplifies the small positive change of voltage to the high level (representing logic 1) and the small negative change of voltage to the low level (representing logic 0). The sensing process is a slow process, and as the transistor size becomes smaller and the capacitor size becomes smaller, the longer time the sensing process takes. The time of the sensing process determines the access time of the DRAM. Therefore, the reduction speed of DRAM access time is far behind the reduction speed of the time interval between two memory accesses of the processor under Moore’s Law. This is the “memory wall” encountered in the development of Von Neumann architecture. Figure 1.7 clearly shows the “memory wall” problem. If each memory access takes tens to hundreds of cycles to wait for the response of DRAM, the performance improvement of the processor will become meaningless. To solve this problem, two swords have been forged in the research of computer architecture: cache and memory-level parallelism (MLP). Cache uses the principle of locality to reduce the number of CPU accesses to main memory. In short, the instructions and data being accessed by the processor

1.3 Three Obstacles

11

and the DRAM area nearby may be accessed many times in the future. Therefore, when accessing this area for the first time, the area will be copied to the cache. When accessing the instructions or data of this area later, there is no need to access the DRAM. After the introduction of cache, the memory in Von Neumann architecture has become a hierarchical storage structure. Cache is a completely transparent part of the programming model. Although programmers can optimize the program code according to the characteristics of cache to obtain better performance, programmers usually cannot directly intervene in the operation of cache. Therefore, the cache problem is generally not considered in the exploration of programming model design space.1 MLP refers to the ability of the processor to process multiple memory access instructions at the same time. Multiple access requests from the processor can be processed concurrently between the cache and DRAM of the hierarchical storage structure, and between multiple banks of DRAM. For example, when the memory access instruction is not hit in the cache and needs to wait for the data in the DRAM, once the subsequent memory access instruction is hit in the cache, the processor can complete the subsequent memory access instruction first, so as to avoid the processor blocking on the missed memory access instructions with large delay. Although MLP cannot reduce the access delay of a single operation, it increases the available bandwidth of the storage system and improves the overall performance of the system. To realize MLP, hardware developers design hardware mechanisms such as multithreading concurrent execution, instruction multi-issue and instruction reordering. Their purpose is to introduce multiple concurrent and independent memory access instructions to develop MLP. But their interaction with the programming model is much more complex than cache. Thread-level parallelism requires application developers to use parallel programming language to explicitly develop and debug. Although the scheduling process of concurrent execution of threads on the processor can be completely completed by the hardware mechanism and transparent to the programming model, the parallelism between multiple threads in the application must be developed by the developer according to the requirements of the target application. Compiler-dependent automatic parallelization has always been the focus of programming language and compiler research, but up to now, the exploration of task level and thread-level parallelism in practical applications still depends on the efforts of application developers. Thread-level parallelism programming is still a high threshold task. Instruction multi-issue requires the discovery of instructions without dependencies. The step of finding instructions without dependencies can be completed dynamically when the processor executes instructions by using hardware mechanisms, that is, superscalar processor architecture; it can also be used by the compiler to develop the instruction-level parallelism statically during compilation, that is, the VLIW 1

The lower level programming model will open the cache line size to programmers as an important parameter of the processor. However, this has become the consensus of all programming models, and there is no need to explore it in the design space.

12

1 Programming Model

processor architecture. For general-purpose processors, the performance of VLIW architecture is much worse than that of superscalar architecture because it is difficult to find enough instructions to be issued by compiler static profiling. Because the performance loss caused by the compiler is greater than the performance and power loss caused by the implementation of the hardware entirely, the instruction multi-issue mechanism of general-purpose processor finally abandons VLIW and is implemented entirely by hardware. Similar to cache, instruction multi-issue is ultimately completely transparent to the programming model. Instruction reordering suspends instructions with a long delay, especially for memory access instructions that have a cache miss, and executes the subsequent instructions first. Although the reorder buffer of superscalar processor can reorder a few dozens of instructions, a wider range of instruction reordering will lead to a sharp increase in the complexity of hardware design, and the marginal cost will soon exceed the marginal utility. Therefore, the reordering between a wide range of instructions (hundreds of instructions) can only be completed by the static profiling of the compiler. Finally, through the joint efforts of hardware developers and compiler designers, instruction reordering is transparent to application developers. The interaction between a single hardware mechanism and the programming model already needs so many design considerations. The coexistence of multiple hardware mechanisms will make the design of the programming model more complex. Here, we use an example of write data—write flag to observe the trouble that the processor with multithreading and instruction reordering mechanism will bring to the design of programming model. Figure 1.8 shows a case of developing MLP using multithreading and instruction reordering in a single-producer single-consumer queue. Thread 0 and Thread 1 perform write and read operations on the message queue respectively. The memory access instructions in the two threads can be executed concurrently. Under the totalstore order (TSO) model of X86 instruction set architecture, the order of the thread 0’s two writes is exactly the same as that seen in memory. However, to develop MLP on a larger scale, the compiler will also reorder write instructions. The interaction of these two hardware mechanisms will make the programming model more complex. Figure 1.8 shows a possible error: if the compiler reorders the write data (I1) and write flag (I2) instructions, the processor will execute I2 first and then I1; If the processor switches from Thread 0 to Thread 1 after executing I2, Thread 1 will believe that the Thread 0’s data has been written according to the result of read flag, and then read the wrong data. It can be seen that the original programming model fails under the combined action of the two mechanisms. To avoid the wrong result of compiler instruction reordering in multi-threaded environment, the programming model needs to be further optimized. Different applications have completely different requirements on whether to reorder specific instructions, which is difficult to be covered by a set of compiler static profiling methods. Therefore, the application developer assumes the task of deciding whether to reorder specific instructions. In C language, there are two kinds of semantics related to instruction reordering: (1) the instructions corresponding to variables with volatile

1.3 Three Obstacles

13 Processor: If 12 is before 11 and the thread is switched after 12, an error occurs

Initially, tail = 0, Q [0], Q [1] is undefined Reorder buffer X86 CPU Storage buffer R1 = 1, error occurs when R2 is not defined! Programmer: Write data (11)

Write flag (12) Compiler

L1 Cache Cache hierarchy of writeback mechanism Main memory Main memory: consistent with processor

Fig. 1.8 In the case of multithreading, the compiler reordering instructions may cause errors when the processor executes because the write order changes

qualifiers will not be reordered by the compiler at all. In Fig. 1.8, application developers can add volatile qualifiers when defining tail variables to avoid reordering instructions related to read and write tail by the compiler. However, the addition of volatile prevents many reordering that would not have caused errors, resulting in performance loss. (2) barrier programming primitives can prevent the instructions before and after barrier from being reordered by the compiler. In Fig. 1.8, the application developer can insert a barrier primitive between the write flag and the write data to prevent the write flag and the write data from being reordered by the compiler. However, if the program is not as simple as in the example, to accurately insert the barrier primitive, the application developer needs to have a deep understanding of the process of multi-threaded concurrent execution, and needs a lot of debugging work, which will greatly reduce the usability of the programming model. It can be seen from this example that for the general programming model on the general-purpose processor, if a hardware mechanism needs the processing of the programming model, the related programming model design may have two results, that is, the execution efficiency will be lost due to the addition of the compiler layer of indirection, such as volatile qualifier; or the development efficiency is lost because the application developer needs to have insight into the hardware mechanism, such as the barrier primitive. Cache and MLP are widely used hardware mechanisms proven through the practice test in this period. However, there are many hardware mechanisms that have not been used in history because no appropriate usage has been found. One of the most typical examples is scratchpad, which is completely controlled by the programmer. The scratchpad buffers data on demand, reducing access to DRAM. Figure 1.9 compares the similarities and differences between scratchpad and cache. They are banks different from main memory. Generally, the read and write speed is much faster than main memory. However, the cache has the same address space as the

14 Fig. 1.9 Comparison between cache and scratchpad in general-purpose processor

1 Programming Model

Address space

Main memory （DRAM）

Address space

Scratchpad （SRAM）

Main memory （DRAM）

Cache （SRAM）

CPU

(a) Cache configuration

CPU

(b) Scratchpad configuration

main memory, which is transparent to the programming model; while the scratchpad and main memory belong to different address spaces, which need to be explicitly used by application developers or compiler designers. Since there is no need to maintain the complex data flags in the cache, the scratchpad has better performance and power consumption than the cache when executing the same data flow. However, scratchpad has never found a way to integrate with programming model in the application field of general-purpose processor. Firstly, the scratchpad will introduce address spaces with different behaviors, which will destroy the memory model with unified address in RAM programming model. Secondly, unlike compiler-based cache optimization, memory translation using scratchpads must fully handle the remapping of main memory addresses related to virtual memory. These disadvantages make it never become the mainstream mechanism of general-purpose processor. The scratchpad was not used on a large scale until it was used in GPU more than ten years later. In short, in the “memory wall” period, the new hardware mechanism tries to avoid destroying the illusion of “single thread + single memory” created by the RAM programming model of Von Neumann architecture. The cache mechanism adds faster SRAM memory to the original DRAM, but these SRAM only cache the copies in the main memory DRAM, maintaining the illusion of a single memory. In the era of single core processor as the mainstream platform, ordinary application developers do not need to worry about the problem of multi-threaded development of MLP. Moreover, since there is only one core physically and only one thread is executing at any point in time, it is much easier to verify the correctness of multi-threaded programs than multi-core processors. On the contrary, the hardware mechanism that destroys this illusion, such as scratchpad, has not been listed in the mainstream hardware design.

1.3.3 Power Wall Moore’s Law ensures that the speed of a single transistor increases exponentially, and the area and cost decrease exponentially. With the increasing number and density

1.3 Three Obstacles

15

Dennard: “We can keep power consumption constant.” S3 S=1.4x Faster transistor

S=1.4x Lower capacitance

S2

2

S =2x More transistors

Vdd pass S = 1.4x and S2 = 2x scaling

S The threshold voltage cannot drop again

1 Fig. 1.10 Dennard Scaling ended around 2006 due to the threshold voltage (S in the figure is the scaling factor between the two generation semiconductor processes, generally speaking, s = 1.4, that is, the area of a single transistor in the next generation process is 1/2 of the previous generation (the length and width are reduced to 1/1.4 of the previous generation respectively), the performance is 1.4 times that of the previous generation, and the capacitance is 1/1.4 of the previous generation)

of transistors integrated on a single chip, heat dissipation must be considered in chip manufacturing. In 1974, Dennard et al. proposed [6], the size of the chip is reduced by 1/S and the frequency is increased by S times. As long as the working voltage of the chip is correspondingly reduced by 1/S, the power consumption per unit area will remain constant. Figure 1.10 shows how Dennard Scaling ensures constant power consumption per unit area of the chip. Under the new generation semiconductor process, the number of transistors per unit area will increase by S2 times and the frequency will increase by S times, but the power consumption per unit area can remain unchanged. Guaranteed by Dennard Scaling, chip manufacturing companies such as Intel can quickly improve the working frequency of the chip and integrate more transistors to provide more complex functions without considering the heat dissipation of the chip. From Intel 4004 in 1971 to Intel Core 2 processor in 2006, the working voltage of the chip gradually decreased from 15 V to about 1 V. However, in 2005, the working voltage of the chip has been reduced to about 0.9 V, which is very close to the threshold voltage of the transistor (0.4–0.8 V). Limited by the material and structure of the transistor, the threshold voltage is difficult to be further reduced, so the working voltage of the chip can no longer be reduced. Since then, the power consumption per unit area will be doubled for every half reduction in the size of the chip. Worse still, when the working voltage approaches the threshold voltage, the leakage power consumption from the transistor gate to the substrate accounts for an increasing proportion of the total power consumption [7]. The new generation semiconductor process will face the challenge of increasing power consumption per unit area of chip, which is the problem of “power wall” of chip. To overcome the problem of “power wall”, the idea of “dark silicon” has been widely adopted in chip design since 2005 [8], that is, the working area of full speed operation (lighting) on the chip is limited through the design of multi-core and heterogeneous architecture, so as to make the chip meet the power consumption

16

1 Programming Model

Fig. 1.11 Processor layout of Intel Skylake architecture (there are four CPU cores and one GPU on a single chip. At any given point in time, only some circuits are working, thus meeting the power consumption constraint [9]) (see color chart)

constraints. Figure 1.11 shows the processor layout of Intel Skylake architecture. Taking Skylake’s layout as an example, the hardware mechanisms for realizing “dark silicon” can be divided into the following three categories. (1) Add low frequency modules, such as designing larger cache, SIMD execution unit, etc. The frequency here refers to both the working frequency and the use frequency. Taking cache as an example, at most nearly half of the CPU area in the “dark silicon” era is used to realize cache (in Fig. 1.11, the CPU cache includes the last-level L3 cache and L1 cache and L2 cache in each CPU core). On the one hand, due to the existence of MLP, the working frequency of cache can be lower than the data path of processor. For example, in the previous architecture of Intel Sandybridge, the voltage and frequency of L3 cache and kernel can be controlled separately. On the other hand, since the CPU cache is composed of many SRAM blocks, the control logic of the SRAM that is not currently accessed can be clock gated or use other technologies to reduce power consumption. The mechanism of reducing hardware frequency by increasing cache is completely transparent to the programming model. (2) Add parallel hardware modules. According to the Dennard Scaling principle in Fig. 1.10, when the size of the transistor in an advanced technology is reduced to 1/2 of the previous generation technology, if the frequency of a single processor core remains unchanged, its area is reduced to 1/4 and the power consumption (due to the reduction of the capacitance of a single transistor) is reduced to 1/2 of the original. In this way, if the power budget of the chip remains unchanged, three lower frequency cores can be added to the chip. At the same time, because only a part of the CPU core is working at any given time point, the chip can

1.3 Three Obstacles

17

close some cores through technologies such as clock gating and power gating to further reduce power consumption. In reality, since 2005, x86 architecture processors no longer focus on increasing the frequency of chips, but increasing the number of processor cores on new chips. The BIG-LITTLE architecture of ARM architecture places high-performance BIG core and energy-efficient LITTLE core on one chip at the same time, making full use of the design space brought by the parallel hardware module mechanism. However, to make full use of the performance of the chip, application developers have to learn the skills of multithreading programming. The parallel mechanism based on Multithreading makes the performance of hardware difficult to be fully developed by application developers. (3) Specialized hardware. To ensure its generality, general-purpose processor will have a lot of hardware redundancy. In 2010, a study on running H.264 decoding on general-purpose processor [10] showed that the energy consumed by the execution unit of general-purpose processor accounted for only 5% of the total energy, and a large amount of energy was consumed in the fetch, decoding and other modules of the processor. Therefore, designing specialized hardware modules on a chip according to the application requirements and then constituting a heterogeneous system-on-chip (SoC) can greatly improve its energy efficiency. The GPU integrated on the Skylake architecture processor in Fig. 1.11 is customized for processing image rendering. In addition, SIMD processing elements such as Intel SSE/AVX can also be regarded as specialized hardware. SIMD processing element uses a controller to control multiple processing elements, and performs the same operation on each data in a group of data vectors at the same time, so as to amortize the processor’s overhead of fetching and decoding on this group of data vectors. Of course, SIMD units are only suitable for scenarios where there is very regular data-level parallelism. The popularity of specialized hardware and heterogeneous SoC has brought about the “Babel Problem” of programming model: different specialized hardware modules need different programming models; when the same application runs on different specialized hardware, it takes a lot of manpower to “translate” the application. Although the “power wall” problem did not stop the pace of Moore’s law, it destroyed the RAM programming model on Von Neumann architecture. As chip multi-processor (CMP) and heterogeneous architecture have gradually become the mainstream hardware design methods, application developers are forced to face parallel programming model and heterogeneous programming model. The parallel programming model breaks the illusion of “single thread” in RAM model, that is, physically, multiple cores need applications to provide thread-level parallelism, and SIMD vector processing elements in a single core need applications to provide data-level parallelism. These impose higher requirements for application developers. Before the emergence of “power wall”, parallel programming was a difficult knowledge for a few supercomputer application developers; after the

18

1 Programming Model

emergence of “power wall”, almost all application developers have to face the challenge of parallel programming. How to reduce the difficulty of parallel application development has become the core issue of architecture, programming model and programming language. The heterogeneous programming model breaks the illusion of “single memory” of RAM model, that is, the CPU-GPU heterogeneous architecture violates the mechanism of a single CPU facing a unique address space for the first time. Application developers must start to consider how to efficiently migrate data between different storage structures or address spaces. It is worth mentioning that the problems encountered by the scratchpad in the “memory wall” era, such as designing the ISA/API for data movement across multiple address space and remapping virtual memory, which are not friendly to application developers, have been partially solved by specialization on the CPU-GPU system. In addition, the interaction of multiple concurrent threads and multiple concurrent banks will bring greater complexity. For example, when a single-chip multi-core processor is combined with an on-chip multilevel cache storage system, different processors on the chip may obtain inconsistent results when accessing data at the same address because the timing of data update in different caches is inconsistent. These problems are collectively called cache-consistency model problems, which greatly affect the design of processor architecture, programming model and programming language. At the time of the collapse of existing programming models, emerging programming models are flourishing. Their characteristics are hard to sum up. Therefore, this section will continue to use the example of inserting elements in circular queues to trace the development direction of programming models in the era of “power wall”. Specifically, this section will analyze how to ensure the order of “write data write flag” when inserting elements with thread-level parallel programming model, cache-consistency model and heterogeneous programming model. First, since the general-purpose processor turned to multi-core architecture, programming models for developing thread-level parallelism have emerged. From the early Pthread and OpenMP based on C language to the recent Golang and C++ 20 co-routine, almost all new routes have made a good commitment to solving the problems of multi-core parallel programming. Unfortunately, “there has been nothing perfect since the olden days”. These new routes either fail to exploit the full potential of a specific application scenario on a specific architecture (such as Java), or have a very steep learning curve (such as C++). Java is the most widely used parallel programming language at present. It is widely used in enterprise web application development and mobile application development. One of the motives for Sun Microsystems to design Java in 1990 was that the C language widely used at that time needed the help of platform related Pthread or OpenMP libraries and lacked native support for multi-threaded features. Java provides multithreading support at the language level by introducing keywords such as synchronized when it was originally designed. For circular queues, application developers can directly call the thread-safe queue provided in the Java library, or

1.3 Three Obstacles

19

they can carefully read the Java memory model and concurrency control primitives to design a queue by themselves. However, the performance problem of Java has been widely criticized. Many language features of Java are to support cross-platform and full-scene development (improve generality). At the same time, Java also provides a large number of libraries to reduce the overhead of learning and debugging and improve development efficiency. For the thread-safe queue (java.util.concurrent.ArrayBlockingQueue) in the Java library, the user only needs to directly call its add() function to complete the operation of inserting elements, without considering the order relationship between the above multiple reads and writes. However, to adapt to various queue application scenarios, ArrayBlockingQueue provides a general queue implementation that supports “multiple-producer multiple-consumer”, that is, any number of threads can insert elements concurrently, and any number of threads can take out elements concurrently. For other application scenarios, such as “single-producer single-consumer” queue, although application developers can obtain higher performance through simpler synchronization, calling ArrayBlockingQueue directly misses this opportunity. Research shows that compared with the native ArrayBlockingQueue, the queue design for “single-producer single-consumer” in Java can improve the throughput dozens of times. However, to achieve better performance, application developers need to fully understand their own needs, Java memory model, operating system thread scheduling mechanism, multiprocessor cache-consistency mechanism and so on. It can be seen from this example that in the face of the problem of thread-level parallel programming, Java gives priority to ensuring generality. At the same time, application developers can choose between development efficiency and operation efficiency according to their own needs. Second, as the design of CMP architecture becomes more and more complex, hardware designers begin to try to open more and more hardware mechanisms directly to application developers, expecting to make the hardware more effectively adapt to the use scenario with the help of application developers. Among them, adding std::memory_order to the standard library after C++ 11 allows application developers to independently determine the order relationship between multiple reads and writes according to the multi-core cache-consistency model of the specific processor platform. Due to the subtle differences among cache-consistency models of different processor platforms, application developers need to understand the design details of the processor when using it. Therefore, the learning cost and development difficulty of std::memory_order are very high. Next, taking the “write data - write flag” sequence of circular queue insertion as an example, we will briefly introduce design philosophy of std::memory_order. Figure 1.12 shows a simplified on-chip dual core processor architecture. The two processor cores exchange data through shared memory. Compared with Fig. 1.8, the order of the two write operations is influenced by a new factor, in addition to the reordering of instructions by the compiler and processor cores: under the action of cache, even if processor core 0 (Core 0) and main memory complete two write operations in the order of “write data - write flag”, processor core 1 (core 1) may not observe the writing results in this order. Consider the two threads in Fig. 1.8.

20 Fig. 1.12 In the cache-consistency model of the on-chip multi-core processor, it is necessary to consider not only whether the write order seen by the main memory is consistent with the processor core initiating the write operation, but also whether other processor cores can observe the same order

1 Programming Model Write data

Write flag

Write data

Core 0

Core 1

Reorder buffer

Reorder buffer

Storage cache

Storage cache

L1 cache

L1 cache

Write flag

Main memory

Store writeback level Storage: Write data

Write flag

Suppose Thread 0 runs on processor core 0 and Thread 1 runs on processor core 1. When processor core 0 executes I1 and I2 of Fig. 1.8, and Q [tail] and tail are written respectively, the propagation speed of the two writes to processor core 1 may be inconsistent. For example, I1 has a write loss when writing Q [tail], so it needs to wait for the cache line where Q [tail] is located in main memory to be retrieved into L1 cache, that is, hundreds of cycles; When the latter instruction I2 writes tail, L1 cache write hits will occur. At this point, should the latter write wait until the result of the previous write can be observed by processor core 1 before writing to the L1 cache? Different processor architectures have different answers to this question. The cache-consistency model of processor architecture is to answer whether multiple processor cores observe the same order of write operations, read operations and atomic operations. Figure 1.13 shows several mainstream cache-consistency models. In the figure, A–F represent multiple sequential memory accesses initiated by a thread, including reads (e.g., B=), writes (e.g., =A), atomic acquire and atomic release. The arrow between two memory accesses indicates that other processor cores can definitely observe the order of these two accesses. “E=” and “F=” are two write operations on one processor core. It can be seen that under the TSO model represented by x86 instruction set architecture, all processor cores will observe a completely consistent “write data - write flag” order. Application developers do not need to consider the order of write access on X86 multi-core platform, which reduces the difficulty of application development. However, all write accesses on the X86 platform must wait until the effect of the previous write access on the same thread can be observed by all processor cores before they can be transmitted to other processor cores. This severely limits the scalability of the processor core. To obtain better scalability, the processor instruction set architectures proposed in recent years, such as ARMv8 and RISC-V all adopt the released consistency model. As can be seen from Fig. 1.13, the released consistency model has no requirements for the propagation order of two writes on a processor core. If the application developer has requirements

1.3 Three Obstacles

21 =A

=A B=

B=

acquire (S);

C=

B=

acquire (S);

C= =D

release (S);

F=

(a) Sequential consistency

acquire (S);

=D

release (S);

release (S);

E=

(b) Total order consistency

acquire (S);

(c) Partial order consistency

acquire (S);

C= =D

release (S);

E=

F=

=A B=

C=

=D

F=

=A B=

C=

E=

E=

=A

=D

release (S);

E=

F=

(d) Weak order consistency

F=

(e) Released consistency

Fig. 1.13 Several mainstream cache-consistency models [4]

for the propagation order of writing, it needs to rely on the programming language or assembly language to explicitly call the atomic operation. Similar to the volatile tag in C language, the C++ 11 standard library provides the atomized variable library std::atomic. To enable programmers to fully tap the potential of instruction set architecture according to specific application scenarios during cross-platform development, the standard library std::memory_order of C++ 11 adds four types of attributes to the atomized memory access of variables, namely, relaxed ordering, release-acquire ordering, release-consume ordering, and sequentiallyconsistent ordering. By default, the attribute of std::atomic is sequentially-consistent ordering: multiple atomic read–write operations in a thread, as well as the order of atomic read–write and normal read-write, it will maintain the original order when propagating to other processor cores. By default, the sequentially-consistent ordering std::atomic has exactly the same semantics as volatile tags in C language. The default std::atomic is very easy to use, but there is a performance loss. For example, on RISCV, if the tail variable in Fig. 1.8 is declared as std::atomic, not only the write order of Q [tail] and tail will be propagated in order (that is, the write of Q [tail] will wait for other processors to see the write result of tail), but also the order between any read–write operation and tail will be propagated in order. std::memory_order can realize finer control of memory read–write order, as shown in Fig. 1.14. The tail tag is declared as type std::atomic. std::atomic type in C++ 11 can be read and written using the load()/store() function. The attribute of std::memory_order can be specified during reading and writing. Figure 1.14 uses

22

1 Programming Model Thread 0

Thread 1

I1: Q[tail] = 1;

I3: R1 = tail.load(std::memory_order_acquire);

I2: tail.store(1, std::memory_order_release);

I4: R2 = Q[R1];

If the write result of I2 is read by I3, I4 must be able to observe the write operation of I1

Fig. 1.14 Fine-grained control over the “write data - write flag” order using std::memory_order in C++ 11

the release–acquire order: all read and write operations before the store () are not allowed to be moved to be after the store (); all read and write operations after load () are not allowed to be moved to be before this load (). In this way, if the value written by I2 is successfully read by I3, all writes to memory before I2 in Thread 0 are visible to Thread 1 at this time. It can be seen from this example that opening the mechanisms such as chip multiprocessor cache consistency directly to application developers who are very familiar with the hardware architecture can fully tap the potential of hardware mechanisms in specific application scenarios. Of course, on the one hand, the learning curve is steep; on the other hand, the development cost is also very high, that is, when application developers face the complexity of hardware mechanism, they need a lot of debugging and verification to ensure the correct operation of real applications. Third, after this turbulent period, the most significant change is that with the popularity of GPU, application developers begin to learn and adapt to the heterogeneous programming model represented by CUDA. Under the constraint of “power wall”, the improvement speed of CPU computing power is much slower than the growth of data volume. In contrast, GPU saves a lot of overhead in the process of fetching and decoding by specializing the data parallelism, so it can invest more power budget in the design of improving computing power. Figure 1.15 shows the comparison of single precision and double precision floating point peak computing power between CPU and GPU in recent years. It can be seen that the gap between the peak computing power of CPU and GPU is becoming larger. Of course, for computing chips, it is not enough to only improve the peak performance. The key to the large-scale use of chips is to have suitable target applications and programming methods. At present, the most important application on GPU is data intensive application, and the most mainstream programming method is NVIDIA’s CUDA programming language. Due to the simplification of control logic, the current programming method of GPGPU requires application developers to understand the details of the underlying hardware mechanism. Although these cumbersome details will seriously affect the work efficiency of application developers, most application developers still choose to use GPU instead of CPU to obtain higher performance in the face of data intensive applications. Specifically, application developers need to package 32 or 64 mutually independent threads (different architectures have different SIMD concurrency requirements) into a lock-step thread group in the control flow, so as to amortize the overheads of control logic such as fetching and decoding in multithreaded SIMD

1.3 Three Obstacles

23

Fig. 1.15 Comparison of floating-point peak computing power (GFLOPS/s) between CPU and GPU [11] (see color chart)

processors; application developers should create as many thread groups as possible for each multithreaded SIMD processor to hide the access delay of DRAM; and they also need to keep or distribute data addresses in one or several memory blocks to achieve the expected memory performance. At present, all GPU programming models, including CUDA and OpenCL, have similar and steep programming method learning curves. Overall, the GPGPU programming model sacrifices usability for high energy efficiency. Initially, CUDA was only used in areas where there were few and sophisticated application developers such as scientific computing. These developers tend to pay higher learning and development costs in exchange for better application execution performance. Most ordinary application developers have no motivation or opportunity to learn CUDA. However, after 2012, the rise of deep learning applications represented by AlexNet [12] promoted the popularity of GPGPU programming methods represented by CUDA among ordinary application developers. The emerging big data processing applications represented by deep learning bring massive data-level parallelism, which can make full use of the computing power of GPU. This makes application developers willing to pay additional learning costs and be familiar with various hardware mechanisms of GPGPU, so as to develop and utilize the performance potential of GPGPU for data intensive applications. Today, the GPGPU programming model represented by CUDA has been widely accepted. More and more applications with high computing power have been customized and tailored according to the requirements of GPGPU programming model, so as to make full use of the high computing power of GPGPU. CUDA and OpenCL have also become the most widely used programming methods in general data intensive applications, and have been extended to FPGA, CGRA and other architectures for data intensive applications [13].

24

1 Programming Model

Since the computational part of CUDA is neither suitable nor requiring to implement queues, this section will not discuss CUDA programming model in detail. However, the process of CPU dispatching tasks to GPGPU is also carried out in the form of task queue. With the evolution of GPU architecture, the implementation of this task queue becomes more and more complex. The existing GPGPUs are vertically integrated in a way that the driver manages the task queue architecturally so that the APIs for inserting elements and removing elements from the task queue can be opened only to application developers and compiler designers. Figure 1.16 shows the typical connection mode between GPU and CPU in heterogeneous system. It can be seen that the CPU is connected with the integrated GPU through the internal interface of the processor (such as Intel cache consistency interface (CCI)), and connected with the discrete GPU through PCIe bus. Integrated GPUs communicate with the CPU within the same chip with very low latency (about dozens of to hundreds of nanoseconds (ns)), making it easy for integrated GPUs to maintain cache consistency with the CPU. Discrete GPUs, on the other hand, communicate with the CPU via PCIe with latency in the order of microseconds, making it difficult to maintain cache consistency with the CPU. Therefore, the design consideration of task queue under integrated graphics card is not much different from that of multicore processor; for discrete GPUs, we need to consider how to use the mechanism of PCIe peripherals to improve the reading and writing efficiency of the queue. For example, PCIe provides a doorbell mechanism to avoid frequent PCIe reads and writes when the queue inserts elements. Application developers do not need to understand the details of PCIe or the difference between PCIe and the internal interface of the processor. They only need to call the API related to task queue. The underlying implementation details are entirely the responsibility of the architecture layer and driver. To sum up, GPGPU task queue greatly enhances its usability and execution efficiency by sacrificing generality. Fig. 1.16 Two methods for connecting CPU and GPU in a heterogeneous system

Core 0

Integrated GPU Application

L1 cache

L3 cache

GPU on- chip Cache Consistent area

discrete Independent GPU

1.3 Three Obstacles

25

1.3.4 I/O Wall Von Neumann architecture can be divided into three functional modules: memory, processing unit, and peripherals. The previous two crises were caused by the mismatch between memory speed and computing speed, and the mismatch between computing power consumption and memory power consumption. In the past computer textbooks, it has been emphasized that the speed of peripherals is much slower than the computing speed. However, since around 2015, this dogma that has lasted for decades has gradually lapsed. Today, the mismatch between peripheral speed and computing/storage speed is forming a new “I/O wall”. Figure 1.17 shows the change trend of bandwidth among hard disk, network, and CPU-DRAM from 1995 to 2020. It can be seen that the network bandwidth has increased rapidly after 2010; after 2015, the speed of hard disk has increased rapidly. In the same period, the bandwidth growth rate of CPU-DRAM lags far behind the growth rate of network and hard disk. There are two main reasons for this phenomenon: (1) the successive emergence of “memory wall” and “power wall” makes the growth rate of bandwidth of CPU-DRAM far behind the pace of Moore’s Law. At present, the bandwidth of DDR interface of the mainstream CPU-DRAM is doubled every 5–7 years; (2) the technological breakthrough of network and hard disk in the physical layer has led to the explosive growth of their bandwidth. In the first decade of the twenty-first century, the mainstream network transmission medium is copper twisted pair, and the mainstream storage medium is disk. After 2010, the optical communication module in the data center began to popularize. Compared

Hard disk bandwidth/device

Unlimited memory bandwidth

Network bandwidth/cable

DRAM bandwidth/CPU slot

Year

Fig. 1.17 Changes of hard disk, network, and DRAM bandwidth over time [14] (see color chart)

26

1 Programming Model

with the network card based on copper twisted pair, the bandwidth of a single fiber network card increased rapidly from 1 to 10 Gbit/s. After 2015, solid state disk (SSD) gradually replaced hard disk drive (HDD). Previous HDDs used mechanical motors to address on the disk. Limited by the speed of the disk (about 15,000 r/min), the bandwidth of a single HDD can only reach hundreds of megabytes per second. However, with the development of flash technology, especially the progress of flash sustainability, SSD began to replace HDD. The addressing process of SSD is similar to DRAM, which is completely driven by electrical signals, so the bandwidth is no longer limited by the addressing speed. Since the growth rate of I/O bandwidth is far greater than that of CPU-DRAM bandwidth, the classical Von Neumann architecture has been difficult to meet the bandwidth requirements of peripherals, and emerging architecture designs have spurted forth. For example, Intel has added data direct I/O (DDIO) technology to the “Xeon” series processor product line for the data center, which can bypass DRAM and allow peripherals (mainly network cards) to directly write data packets to the LLC of the CPU. Although DDIO successfully bypassed the “memory wall” of CPU access to DRAM and achieved commercial success, with the continuous growth of the bandwidth of a single network card beyond Moore’s Law, the computing speed of CPU is still difficult to keep up with the bandwidth requirements of the network card. For example, for the mainstream 100 Gbit/s network card in the current data center, if 64B network packets are used, the CPU needs to process one packet every 3.3 ns. The main frequency of a typical data center CPU is about 2 GHz, and the latency to access LLC is about 5 ns (20 processor cycles). Therefore, the performance of mainstream CPUs in matching 100 Gbit/s NICs is already difficult to meet the bandwidth demand of NICs; handling next-generation 400 Gbit/s NICs is far from enough. In terms of storage, with SSD becoming the mainstream technology, the storage bandwidth is completely limited by the I/O interface speed of CPU. At present, the mainstream storage device interface adopts NVMe standard, which is interconnected with CPU based on PCIe interface. With the popularity of PCIe Gen 5, it can be predicted that the computing speed of CPU is difficult to keep up with the growth of memory bandwidth. The problem of “I/O wall” has increasingly become a key bottleneck restricting the performance of computing system. For the previous example of inserting elements into the circular queue, the emergence of “I/O wall” also brings new design space. At present, the Infiniband network commonly used in high-performance computing provides remote direct memory access (RDMA). RDMA allows the local CPU to directly read and write the remote memory through the RDMA network card, bypassing the remote CPU, thus overcoming the obstruction of the “I/O wall” to a certain extent. However, the existing RDMA support for memory read–write primitives is limited by the complexity of the network card, including only read, write and atomic CAS (compare and swap) operations. Moreover, the latency of RDMA is much larger than the memory access latency, and the latency of end-to-end access is still much larger than 1 µs for the lowest latency Infiniband network available. If RDMA is used to implement circular queue insertion elements, considering that there may be multiple hosts reading and writing to the queue at the same time, a “multiple-producer multiple-consumer”

1.3 Three Obstacles 0 us

27 1 us

2 us

3 us

Local application Local NIC

Remote memory

2 us

Local application Local NIC

RDMA read

RDMA-based network Remote NIC

1 us

0 us

4 us

RDMA CAS

Conflict window

Atomic append via RDMA

Intelligent 2× RDMA read

network CAS

Remote NIC Read Remote memory

2×write

Atomic append in a single back and forthround trip

Fig. 1.18 When using the existing read (), write () and CAS () atomic append in RDMA to implement circular queue insertion elements, it is necessary to communicate back and forth in the network at least 3 times; using the new atomic append primitive requires only one round-trip communication [15]

insertion operation needs to be supported::first check whether there is enough free space in the queue, then modify the pointer atomically to allocate memory in the queue, then write the data, and finally write the tag to make the data valid (at the same time, it is necessary to detect whether the write is damaged). With RDMA, operations can only be performed on a remote page through a series of RDMA requests. As shown in Fig. 1.18, RDMA read will first check the available space, then occupy the position with atomic CAS operation, then send a request to write data, and finally need to write flags. Overall, it will take at least 3 network round trips to complete this operation (write data and write markers can share the same network packet), and at least 5 µs for even the fastest networks today. Moreover, this latency is already approaching the physical limit of the transmission of signals such as optoelectronics in the medium, and is difficult to reduce as technology advances. To solve the impact of network latency on the performance of remote circular queue based on RDMA, academia began to explore adding more complex RDMA primitives and reducing the number of request turnbacks. For the insert operation, an optimization method is to add the “atomic append” primitive in RDMA, as shown in Fig. 1.18. The local node sends the primitive of adding elements in the queue to the remote node. The remote node is responsible for reading and writing this primitive with local memory, so as to avoid frequent back and forth remote reading and writing operations. Today (2021), how to solve the problem of “I/O wall” is still an open question. The academia and industry have put forward a large number of research and solutions. Most research and solutions are trying to make Von Neumann architecture more heterogeneous: add computing functions for network and memory, so as to offload the original CPU computing to peripherals. The exploration in the two directions of network and storage has been uniformly summarized by the academic community as software-defined network/memory. Specifically, the software-defined network offloads the network protocol stack, such as transport layer security (TLS) protocol in TCP/IP protocol and some underlying data access APIs, to the intelligent

28

1 Programming Model

network card for computation; software-defined memory offloads some requirements of the memory, such as logging, RAID, compression, etc., to the processing element close to the memory for computation. The combination of software-defined network, software-defined memory and SDCs will make a splash in the next decade.

1.4 Impossible Trinity In Sect. 1.3, with the development of semiconductor technology, some hardware mechanisms designed by Von Neumann architecture to solve the three obstacles of “memory wall”, “power wall” and “I/O wall” are briefly introduced. Figure 1.19 roughly shows the evolution of the hardware architecture in response to the “three high walls”. Overall, the hardware architecture is developing towards specialization and parallelization. Figure 1.20 is a summary of some hardware mechanisms involved in Sect. 1.3 and their advantages and disadvantages to programming. It can be seen that these hardware mechanisms can be roughly divided into four categories according to their effects: (1) Some hardware mechanisms represented by cache are completely transparent to the programming model. These hardware mechanisms are very friendly to upper application developers and compiler designers, and are universal to most applications. However, the power consumption/area overhead of these hardware mechanisms may be large, but most of them have entered the mainstream architecture design. (2) The cost for learning and developing some hardware mechanisms represented by thread-level parallelism is very high. The premise of fully developing the performance potential of these hardware mechanisms is that application developers should write programs according to the needs of applications and the organization of hardware architecture. However, to excellent developers, these

I/O device

I/O device

I/O device Accelerator

Accelerator

Cache

Cache

Cache

Memory

Main memory

Main memory

Main memory

Von Neumann architecture

Memory wall (1995 to present)

Power wall (since 2005)

I/O wall (2015 to present)

Fig. 1.19 Evolution of Von Neumann architecture to hardware specialization after encountering “three high walls”

1.4 Impossible Trinity

29

Classification

Advantages

Disadvantages

a)

Completely transparent to the programming model

High hardware overhead

High execution efficiency and universal

High learning and development costs

Hardware Mechanism Multilevel cache Instruction multi-issue Thread-level parallelism

b)

Scratchpad C++ std::memory_order Multithreaded secure queue container

c)

Easy to develop and universal

Not fully tap the potential of architecture

d)

Easy to develop and high execution efficiency

Poor compatibility

Instruction reordering RDMA-based queue GPU task queue

Fig. 1.20 Some hardware mechanisms of “memory wall”, “power wall” and I/O design and their advantages and disadvantages for programming

hardware mechanisms usually have high performance and good application generality. Some of these hardware mechanisms have been widely used after their invention, such as a variety of atomic operation instructions; some have become standard in subsequent hardware architectures as application developers have become more familiar with them years later, such as multithreading, scratchpad, etc.; many more mechanisms have been forgotten in the archives of history because they cannot be efficiently used by application developers. Perhaps in the future, when facing new challenges, these hardware mechanisms will be discovered and effectively utilized again from the old pile. (3) Some hardware mechanisms represented by instruction reordering have high requirements for compilation technology, and the existing compilation technology is difficult to develop the full performance potential of a specific application on a specific architecture. To fully develop the performance of this kind of hardware, it is usually necessary for expert programmers to bypass the compiler and write programs directly using assembly instructions. For example, SIMD execution unit on x86 architecture requires application developers to directly use SIMD instruction set to realize data-level parallelism of specific applications; the multi-port register file in GPU stream processor requires application developers to distribute SIMT access requests on different ports according to the access characteristics of specific applications. However, the existing compiler technology is difficult to meet these needs, so that the performance obtained by the average application developer using existing compilers will have a large gap compared to that of expert programmers. It is worth noting that for a certain hardware mechanism, the execution efficiency (or performance) of a programming model is poor, which means that for a series of applications, there is a great gap between the performance obtained by using this programming model and the performance obtained by the best programming model.

30

1 Programming Model

(4) The hardware mechanism represented by GPGPU task queue has poor compatibility, which is reflected in that it can only be used in specific applications on specific architectures. This kind of mechanism is usually provided to application developers as several simple instructions or APIs, which is easy to learn. At the same time, specialized design can give full play to the potential of a specific architecture. Most of the domain-specific accelerators represented by Google TPU adopt similar hardware mechanisms and programming models. From the experience of existing programming models, we can summarize the Impossible Trinity of programming model, that is, a new and effective programming model cannot achieve high development efficiency (easy to use), high operation efficiency (high performance) and good generality at the same time. Figure 1.21 vividly shows the Impossible Trinity in the form of “Impossible Triangle”: the programming model represented by CUDA has high performance and good generality, but the development efficiency is low; the programming model represented by NumPy is easy to use and universal, but its performance is low; the programming model represented by TensorFlow has high performance and good usability, but the compatibility is poor (only applicable to deep learning). These three types of programming models in turn correspond to the complex hardware mechanism that is difficult to completely hide the programming model to be solved by application developers, compiler designers and architecture designers, as shown in Fig. 1.22. It should be noted that although the three types of programming models in Fig. 1.21 are fixed on the three vertices of the triangle, in fact, these programming models are

Fig. 1.21 Impossible Trinity of programming model

Fig. 1.22 Causes of Impossible Trinity: how to deal with the hardware mechanism that is difficult to completely hide

Modern computing hierarchy Application Programming language

class A{ private a; public b; …} JAVA

def say(): a=

1.5 Three Types of Exploration Fig. 1.23 Typical hardware mechanisms are ordered according to the difficulty of hiding

31 Parallelism

Easy to hide

Hard to hide

Instruction-level parallelism Pipeline parallelism Memory level parallelism Predictive parallelism Data-level parallelism Thread-level parallelism Spatial domain parallelism

Specialization Heterogeneous SOC Domain accelerator Near-data processing Scratchpad

non-polarized and evolving all the time. First, in reality, programming models are often located inside triangles, biased to a vertex or an edge. For example, CUDA loses generality to some extent, but it is not extremely difficult to develop. Programming models that are widely used need to find a compromise between the three goals. Secondly, the programming model has been evolving. For example, the earliest CUDA has poor generality and is only suitable for graphics rendering and some scientific computing applications. With the gradual addition of new mechanisms such as unified address space in CUDA by NVIDIA, and the wide application of deep learning in machine learning applications, CUDA has become more and more universal. Figure 1.23 shows typical parallelism and specialized hardware mechanisms, ordered by the difficulty of hiding the programming model. Among them, datalevel parallelism, spatial domain parallelism and scratchpad, which are difficult to hide, are common hardware mechanisms in SDCs, making programmability one of the most important challenges of SDCs.

1.5 Three Types of Exploration According to the “Impossible Trinity”, how to make application developers make more effective use of the newly added hardware mechanism should be considered in the programming model design of SDCs in combination with the existing programming model. The most important new hardware mechanism in SDCs is spatial domain parallelism. Therefore, combined with the application field, this section will first discuss the similarities and differences between spatial domain parallelism and the existing hardware mechanisms such as thread-level parallelism, data-level parallelism and data flow parallelism from the perspective of application. Specifically, compared with other parallelism, spatial domain parallelism can execute applications with irregular memory access and irregular control flow more efficiently. Starting from the existing programming model that can develop spatial domain parallelism, this section explores how to add support for spatial domain parallelism to the existing SDC programming model from three directions according to the “Impossible Trinity”.

32

1 Programming Model

1.5.1 Spatial Domain Parallelism and Irregular Application Irregular applications generally refer to the applications in the fields of sparse matrix algebra and graph computing, as well as the applications using nonlinear data structures such as trees and sets. Many practical problems belong to irregular applications. For example, the collaborative filtering problem in machine learning and Bayesian Graph model belong to graph computing; cluster analysis methods such as K-means algorithm in data mining need set structure; relational database uses B+ tree data structure to implement index. Among the 13 classifications of algorithm space given by the University of California, Berkeley [16], only two types of algorithms (dense linear algebra and spectral analysis) have almost no irregular parts, while the irregular parts of four types of algorithms (unstructured mesh, combinatorial logic, finite state machine, and alternation and backtracking) dominate. Spatial domain parallelism has the potential to deal with various irregularities. To illustrate this point, this book will first classify the irregularity in applications with reference to existing research [17], and then point out the potential advantages of SDCs in dealing with irregular applications by comparing SDCs with current mainstream accelerators GPGPU and ASIC [18]. The first type of irregularity is the control dependence introduced by complex control flow. In C language, it is represented by branch statements, bounded/unbounded loops, sub function calls and so on. Because the control dependence expressed by control flow determines the actual execution order of statements, complex control flow will lead to the serial execution of statements. In addition, the statements between different branches generated by control flow often need to be mutually exclusive, which is difficult to be executed at the same time. Therefore, once the complex control flow appears in the application, it is often difficult to achieve efficient parallelization. The second type of irregularity is the runtime dependencies introduced by shared data structures. Many common data structures, such as trees and graphs, will introduce shared data involving irregular access into the algorithm. The concurrency of these accesses needs to be determined according to the specific access address at run time, but cannot be determined at compile time. The mainstream parallel programming methods under Von Neumann architecture provide a variety of synchronization primitives based on parallel random access machine (PRAM) to solve this problem, such as locks, semaphores and so on. These synchronization primitives are customized for the shared memory model. It is much more difficult to implement them on SDCs based on non-Von Neumann architecture than a general-purpose processor. At the same time, it is difficult for the compiler to schedule and prefetch data at compile time due to uncertain dependencies. This will result in a large delay in the chip’s access to irregular data. Therefore, how to achieve irregular access to shared data without relying on PRAM model synchronization primitives and effectively hide access latency is the key to solve this kind of irregularity. In most irregular applications, these two kinds of irregularity coexist. For example, in the field of graph computing, to access the dynamic graph data structure, it is often

1.5 Three Types of Exploration

33

necessary to use nested unbounded loops to traverse the vertex of the graph in the program; in the field of branch and backtracking algorithm, the branches not entered are generally saved in a shared stack to temporarily store the previous branch results for backtracking. Therefore, to deal with irregular applications, we need to consider the implementation of these two kinds of irregularity at the same time. At present, the two most popular emerging hardware architectures are GPGPU based on SIMT computing paradigm and neural network accelerator based on data flow computing paradigm. These two kinds of hardware architectures make use of the flexibilities in time domain and spatial domain respectively, and achieve a good speedup effect in the application of rules. Figure 1.24 shows how to extend a sequential thread on a general-purpose processor (the color from light to dark indicates the sequence of operations in the thread) to the parallel implementation under SIMT paradigm and data flow paradigm. In SIMT paradigm, a thread is copied and allocated to a large number of parallel computing resources provided by GPGPU in space. Because all threads are controlled by the same instruction flow, concurrent computing resources do not bring additional fetching and decoding operations, so it obtains higher energy efficiency than general-purpose processors. In addition, under the data flow paradigm, spatial computing resources correspond to each operation in the thread in turn, and then form a spatial pipeline. When there is no dependence on the data between different threads, the data corresponding to each thread is sent to the pipeline in turn according to the compiler’s scheduling. Each level in the pipeline is activated when the data arrives, which completely avoids the additional fetching and decoding overhead of the general-purpose processor. When there is complex control flow in the task, the single instruction flow in the spatial domain of SIMT paradigm will lead to a waste of computing resources, and the data flow paradigm can convert the control flow into data flow and integrate it freely into the spatial-domain pipeline. Figure 1.25 illustrates this situation by taking a branch statement as an example. In the SIMT paradigm, since there is only one instruction flow in the spatial domain that controls the execution of all threads, the instruction flow must first control all threads to execute the first branch, and then control all threads to execute the second branch. When different threads execute branch statements and enter different branches respectively, each thread still executes the first branch and the second branch successively under the control of Spatial Single Single instruction, multiple threads SIMT thread

Data flow

Dataflow

Time

Fig. 1.24 Parallelizing a regular single threaded program using SIMT computing paradigm and data flow computing paradigm (see color chart)

ě ě

ě ě

34

1 Programming Model Spatial

Fig. 1.25 Parallelism of branch statements (if else) in tasks under SIMT paradigm and data flow paradigm (see color chart)

Data flow

Dataflow

Time

Single Single instruction, multiple threads SIMT thread

a single instruction flow, while each thread only needs to execute one branch on a general-purpose processor. Since SIMT paradigm is not free in spatial domain, it will waste a lot of computing resources on unnecessary branches when executing complex control flow, especially when there are nested branch statements in the application. The judgment results of branch statements in the data flow paradigm are transmitted to the next level computing unit as the data in the pipeline, and different branches correspond to different computing units in space. When a branch is selected for the data corresponding to a thread, only the computing unit corresponding to the branch is activated by the data flow of the thread. Therefore, the data flow paradigm uses its flexibility of computing resources in spatial domain to avoid wasting computing resources on unnecessary branches. When there are runtime dependencies in the task, the SIMT paradigm can hide the access delay through fast thread switching in the time domain, and the pipeline of the data flow paradigm can continue to execute only after all potential dependencies in the time domain are solved. Figure 1.26 illustrates this situation. In the data flow paradigm, once the compiler finds that there is a dependence between the two stages of the pipeline that cannot be solved by the compiler, even if the dependence does not exist at run time (for example, two threads read and write two addresses successively through pointers. If the compiler cannot confirm that the two addresses must be different, it must assume that there is a dependence between these reads and writes to ensure correctness), the next stage in the pipeline still needs to wait for all operations of the previous stage to complete before execution. Since the data flow paradigm is not free in the time domain, its corresponding pipeline must wait for all dependencies to be resolved in the time domain before it can continue to run. SIMT paradigm has similar flexibilities to general-purpose processor in time domain. It can solve some runtime dependencies by using the synchronization primitives under the PRAM model. At the same time, the GPGPU supports the cheap switching of threads. Once the data required by one group of threads to continue computing is not ready, the computing resources can quickly switch to another group of threads to continue computing under the control of the instruction flow until all the data of this group of threads are ready. Therefore, SIMT paradigm uses its flexibility of computing resources in time domain to hide the access delay of shared data, and simplifies the

1.5 Three Types of Exploration Spatial Single Single instruction, multiple threads SIMT thread

Data flow

Dataflow

Time

Fig. 1.26 Parallelism of runtime dependencies in tasks under SIMT paradigm and data flow paradigm (see color chart)

35

programming of shared data structure by supporting some PRAM synchronization primitives. The SDCs based on reconfigurable computing paradigm has the characteristics of “array computing and function reconfiguration”. It is manifested in the parallel execution of multiple PEs and interconnect in spatial domain and continuous and rapid reconfiguration of array functions in time domain. Figure 1.27 shows the flexibilities of the SDCs in time domain and spatial domain by taking the mapping results of eight regular cycles of three operations on the SDCs as an example. To simplify the discussion, it is assumed that there are only four processing elements (PEs) on the SDC architecture, and only one feasible mapping of the loop on the four PEs is shown here. Since each PE and the interconnect between PEs are specified by the configuration information at run time, the function of SDCs in spatial domain is very flexible. To reduce the times of reading and loading configuration information, a mapping method of spatial pipeline similar to data flow paradigm is adopted in Fig. 1.27 to make the data in the loop flow between PE1–PE3 and PE2–PE4 respectively. Also, since the functions of each PE and interconnect can be reconfigured at run time, PE1 and PE2 at the upstream of the pipeline in Fig. 1.27 are reconfigured into the third operation in the loop after completing the first operation. This not only ensures that the loop can be unrolled on computing resources, but also avoids the limitation of computing resources. As can be seen from the example in Fig. 1.27, the SDCs have both flexibilities in time domain and spatial domain. Therefore, there may be a design scheme to solve the above two kinds of irregularities on the SDCs at the same time, so as to expand the application field that the SDCs can support to irregular applications. Of course, SDCs are not without shortcomings. Compared with the free switching of instructions in the time domain of SIMT paradigm, SDCs are more expensive to reconfigure and need to avoid frequent switching; compared with the free data transmission mode in the spatial domain of the data flow paradigm, the SDCs needs to sacrifice a certain data transmission flexibility to control the reconfigurable data transmission cost. The specific implementation of the flexibility of the SDCs in time domain and spatial domain still needs to be carefully weighed according to the target application field.

36

1 Programming Model

Fig. 1.27 Flexibilities of reconfigurable computing paradigm in time domain and spatial domain (see color chart)

Table 1.2 summarizes the expected results of the SDCs compared to the SIMT paradigm and the data flow paradigm in dealing with the two types of irregularities. The SDC paradigm is relatively free in both the time and spatial domains and can handle irregular applications that are difficult to be addressed by other types of parallelism. This is both its strength and the source of the problems it faces. The flexibility of a processor is not determined by what applications its architecture has the potential to execute, but by what applications programmers can develop on the architecture. When designing any new feature on hardware architecture, it is important to consider how the programmer will use it. To achieve collaborative optimization in time domain and spatial domain, it is bound to design a new programming model based on the original programming model. Table 1.2 Expected results of the SDCs compared to the SIMT paradigm and the data flow paradigm in dealing with two types of irregularities Irregularity

Complex control flow

SIMT paradigm

Spatial domain is not flexible Time domain is free Computing resources are wasted on Multiple groups of threads freely unnecessary branches switch to hide data access latency

Runtime dependence

Data flow paradigm Spatial domain is flexible Control flow flows freely in spatial pipeline

Time domain is not free Computing resources need to wait for all dependencies to be resolved

SDCs

Time domain is relatively free Reconfiguration in the time domain mitigates the impact of data latency

Spatial domain is relatively flexible Control flow is converted into data flow in spatial domain

1.5 Three Types of Exploration

37

1.5.2 Programming Model of Spatial Domain Parallelism At present, the hardware architecture and programming model design of spatial domain parallelism are still evolving rapidly. There is no conclusion about the architecture design of spatial domain parallelism in industry and academia. Dynamic data flow architecture, FPGA, CGRA, TIA, and other architectures are considered to be suitable for spatial domain parallelism. From this section, starting from the most widely used FPGA programming model, we explore the programming method of spatial domain parallelism. (1) Exploration of sacrificing the generality On the road of sacrificing generality, hardware mechanisms such as spatial domain parallelism are mainly handled by architecture designers. Application developers describe their requirements in the form of domain-specific library or domain-specific language. Each application has its own unique problems. Taking the sparse linear algebra field represented by sparse matrix–vector multiplication as an example, this section briefly introduces the design considerations of programming model at the expense of generality. For the design methods of programming models in other application fields, such as graph computing, artificial intelligence, etc., refer to Chap. 4. Sparse matrix–vector multiplication (SpMV) means that sparse matrix A is multiplied by vector x to obtain another vector y, as follows: y = Ax

(1.1)

where matrix A is m × n sparse matrix; X is the n-dimensional vector to be multiplied; y is the m-dimensional result vector. Sparse matrix was first proposed in the 1960s to make full use of sparsity when solving linear equations. It can better solve some problems that cannot be solved by dense matrix [4]. There is no strict mathematical quantitative definition of sparse matrix, but it usually refers to a matrix with a large number of elements of zero. The most significant application of sparse matrix is using the iterative method to solve sparse linear equations. In the iterative method, the computation of SpMV takes the longest time. Studies have shown that the computation of SpMV takes the longest time, accounting for more than 75% of the total computation time, in solving large-scale systems of linear equations using the indirect method [19]. In addition, the computation of convolution in neural networks can usually be converted into matrix multiplication operations, which in turn consist of a series of matrix vector multiplication operations. Therefore, the convolutional operations in sparse neural networks can be finally converted into solving SpMV as well. Therefore, it is very necessary to accelerate SpMV algorithm. For bandwidth constrained applications such as SpMV, bandwidth utilization (BU) is generally used as the performance evaluation metric, which is defined as GFLOPS per unit bandwidth. However, due to the irregular control flow and data memory access mode of sparse matrix vector multiplication, the computation and memory

38

1 Programming Model

access cannot be well adapted in the traditional Von Neumann architecture. This leads to low utilization of processing elements and bandwidth, which is only 0.1– 10% of the peak performance. With the progress of semiconductor manufacturing technology, reconfigurable computing paradigm-based FPGA has developed into a high-performance computing platform with highly parallel computing and deep pipeline resources, and is increasingly used in various servers to accelerate large parallel applications. Moreover, due to the reconfigurability of hardware, FPGA can be reprogrammed to perform new types of computing tasks, so as to meet the everchanging needs of a wide range of industries. As a large-scale parallel application, the research on the implementation and optimization of SpMV on FPGA has become a hot spot in the field of scientific research and engineering. Although FPGAs can accelerate SpMV well with their abundant parallel computing resources, the performance degrades when the size of the matrix exceeds the capacity of FPGA on-chip storage resources. In other words, FPGA alone cannot solve the large-scale SpMV problem. At present, the mainstream sparse matrix expressions include compressed sparse row (CSR) and coordinate (COO), which are designed for Von Neumann architecture. On FPGA, sparse matrix operations can be efficiently implemented through the computational mode of spatial-domain pipeline. The existing expression forms have a large amount of redundant information in this mode, which leads to the waste of memory bandwidth. The spatial-domain pipeline computing mode based on reconfigurable computing paradigm is adopted to design a new sparse matrix expression, which is expected to improve the bandwidth utilization of FPGA in processing sparse matrix applications. Sparse matrix contains a large number of zero elements. To save storage space and reduce the redundant computation of zero elements, compressed storage is usually used. The storage format of sparse matrix is also closely related to the performance of sparse matrix vector multiplication. Therefore, many studies start from the optimization of storage format of sparse matrix to improve the performance of SpMV. At present, the commonly used storage formats are COO, CSR and compressed sparse column (CSC). COO format is a triplet format, that is, the format consists of three arrays. These three arrays are val, row_idx and col_idx, which stores the numerical value, row index and column index of non-zero elements in the sparse matrix respectively. Generally speaking, the non-zero elements of a sparse matrix are stored in these three arrays in order from left to right and from top to bottom. However, since the COO format records the value, row index and column index of each non-zero element, the nonzero elements are independent of each other and can be stored in any order. Compared with the storage method of dense matrix, COO format can save a lot of storage space. In the compressed storage format, the COO format is relatively flexible, but compared with the storage methods described below, the storage space is not optimal. COO records a lot of information, and is not very efficient for accessing specific elements because of its unordered arrangement, and the storage space is still large. Therefore, researchers have proposed a further compression format, namely CSR format. CSR format is also composed of three arrays, but the elements in the array are not linear one-to-one correspondence. The three arrays are val, col_idx and row_ptr.

1.5 Three Types of Exploration

39

Where, val and col_idx array, like the COO format, stores the values and column indices of non-zero elements in the sparse matrix respectively, but the order of storing the matrix here can only be from left to right and from top to bottom, that is, it is stored in the order of row priority. The third array row_ptr stores the offset of the first non-zero element of each row in val and col_idx, that is, row_ptr [i] is the position of the first non-zero element in row i of the sparse matrix in val and col_idx array. The size of the row_ptr array is usually N + 1, where N is the number of rows of the matrix. The last element of row_ptr is the total number of non-zero elements of sparse matrix. Corresponding to CSR compression by row, there is also a CSC format compressed by column. Compared with COO format, although the elements of the three arrays in CSR and CSC format do not correspond linearly one-to-one, they store more concise information, save more storage space, and access specific element values faster. CSR and CSC are the most common compression storage formats at present. For example, CSC format is used to store sparse matrix in MATLAB. At present, the problems about SpMV performance mainly come from the following challenges: (1) Irregular memory access. Since sparse matrices are usually compressed, the access to vector x is usually irregular. Taking CSR storage format as an example, the access address to vector x is col_idx array elements, so the elements of each col_idx array must be loaded into memory first and then used as the address of the access vector x. This indirect access directly leads to irregular access to vector x, and finally affects the computational performance of SpMV. (2) Load imbalance. Since the number of non-zero elements in each row or column of the sparse matrix is not the same, the size of the data set to be accumulated in different rows and columns is not the same when computing partial multiply accumulate. This leads to the load imbalance of PEs responsible for different rows, resulting in a situation where the PEs responsible for fewer data sets are computed first and are in an idle waiting state, reducing the overall computational efficiency. (3) Large-scale matrix problem. Many current designs, which rely on the on-chip storage resources of FPGAs, eliminate the problem of irregular memory access by caching vector x or intermediate results in block random access memory (BRAM) on the FPGA. However, when the matrix size increases beyond the FPGA on-chip memory capacity, these designs become powerless. Although most designs choose to use the block strategy for processing, blocking will cause the intermediate results to move back and forth between the on-chip memory and off-chip memory of FPGA, increase the amount of data memory access, and ultimately affect the overall performance. Moreover, blocking will also bring the problem of zero rows or short rows, which will also affect the performance. (4) Zero row or short row problem. Zero row or short row means that the number of non-zero elements contained in some rows of the sparse matrix is zero or very small. Short rows can lead to the need for zero-padding in some PEs, resulting

40

1 Programming Model

in idling, which has a great impact on performance. Zero rows are generally generated in the process of sparse matrix blocking, requiring additional circuits and more complex control to eliminate them, which improves the difficulty and complexity of the design. To solve these problems, this book designs a heterogeneous accelerator suitable for SpMV according to the characteristics of CPU-FPGA heterogeneous platform. The irregular access operation is executed on the CPU side and the irregular control flow part is executed on the FPGA side, thus enabling the whole design to achieve high bandwidth non-blocking computation with much higher bandwidth utilization. Firstly, a new storage format of sparse matrix adapted to CPU-FPGA architecture, namely coordinated compressed column (CCC) format, is introduced, and compared with common storage formats. Among them, the memory accesses of different formats are compared, because this factor directly affects the performance of bandwidth constrained applications. Then, the overall design framework of SpMV accelerator is proposed, and then the data acquisition algorithm at the CPU end is introduced. The algorithm can ensure that the data at the FPGA end will not conflict, so the computation can be performed without blocking. Finally, the hardware implementation of FPGA data flow is described. As the name suggests, CCC format is a combination of COO format and CSC format, and consists of three arrays. Firstly, the matrix is partitioned according to the threshold (where the threshold is the amount related to the computing resources at the FPGA side), that is, the number of matrix rows contained in a partition is threshold (the part less than the threshold row is classified as a partition), as shown in Fig. 1.28. With partition as the unit, data is stored in each partition in a manner similar to CSC, that is, in the order of column priority from top to bottom and from left to right, but different from the three arrays in CSC format, the three arrays of CCC are the same as COO, namely val, row_idx and col_idx stores the value, row index and column index of non-zero elements respectively. The advantage of this is that the linear characteristic of non-zero array is retained, so that the corresponding elements of vector x can be well used to replace col_idx in CCC format when CPU assembles data. In this way, the data transmitted from CPU to FPGA is more conducive to SpMV computation. Assuming that the threshold is 2, the CCC format of the sparse matrix A in Fig. 1.28, again as an example, is shown in Fig. 1.29 [20]. After the CCC format is adopted, the data can flow to the processing unit at the FPGA end without blocking. The computation process of the whole design is roughly Fig. 1.28 Sparse matrix after one partition

0

1

2

3

4

5

0

1

0

0

2

0

3

1

0

4

5

0

0

0

2

0

0

6

0

7

0

3

8

0

0

0

9

a

4

0

0

0

b

c

0

5

d

0

e

0

0

f

Partition_0 Partition_1 Partition_2

1.5 Three Types of Exploration

41

Fig. 1.29 Sparse matrix in Fig. 1.28 in CCC format

as follows: firstly, the CPU loads the matrix data and uses the corresponding x vector element to replace the column index information in CCC format. Then, the value, the row index, and the corresponding x vector elements of the matrix flow to the FPGA side to perform data flow processing. The computing cycle is based on one partition. The computing of the next partition will not begin until all non-zero elements in the partition being processed are computed. Since the products of non-zero elements from the same row need to be accumulated together, the non-zero elements of the same row need to be marked (here is the row index), which is represented by tokens in FPGA. In this way, the irregular memory address access is executed on the CPU, while the irregular control part, namely token comparison and reduction operation, is executed on the FPGA. Based on the characteristics of CPU-FPGA heterogeneous architecture and the proposed CCC format, the whole SpMV accelerator is designed. The overall design idea is to load the irregular data access part into the CPU, and the irregular control flow part is executed on the FPGA. On the CPU side, to make better use of data locality, the column priority access mode of CCC format is adopted for sparse matrix, so that the access to x vector elements is sequential. At the FPGA end, the computation units of multiplication and accumulation are separated. The multiplier computes the product of matrix non-zero elements and x vector in the order of column priority, and the accumulator is responsible for accumulating non-zero elements from the same row. Due to the limited resources at the FPGA end, the computing resources of the accumulator are limited. Therefore, the number of accumulators is called threshold. At the same time, there can only be non-zero elements of the threshold row for multiplication and accumulation, which requires that the CPU side can only take the elements of the threshold row when extracting data. Therefore, unlike the CSC format, which spans all rows according to column priority, only threshold rows can be crossed here. This also explains the basis of the previous CCC format partition threshold. In this way, the computation of the whole system is divided as follows: the CPU first converts the original sparse matrix into CCC format, and then takes out the nonzero elements in CCC format and the elements of corresponding vector x to form a transaction. The transaction is transmitted to the FPGA side through shared memory. Once the FPGA side reads the data, it immediately performs the computation. After computing a partition, it sends the result back to the CPU. On CPU-FPGA heterogeneous platforms with shared memory, a transaction generally corresponds to a cache line. The whole execution process can be completely pipelined without blocking, so the bandwidth utilization is very high. The execution process of the whole system is shown in Fig. 1.30.

42

1 Programming Model FPGA end

CPU end Shared memory Initial sparse matrix

Format

Vector

x

FPGA write

Result vector y

Conversion Cacheline 1

CCC Storage matrix

……… Cacheline n

FPGA read

SpMV computing kernel

Fig. 1.30 Overall design architecture of SpMV accelerator customized for spatial-domain pipeline

Specifically, the workflow of this design can be summarized into the following three steps: (1) Format conversion. Format conversion is mainly to convert the storage format of sparse matrix from the original format (such as COO, CSR, or CSC) to CCC format. (2) Data extraction. In this step, the CPU assembles the sparse matrix and vector x in CCC format into a transaction according to the data acquisition algorithm, writes it to the shared memory, and then transmits it to FPGA. To achieve non-blocking data flow computation at the FPGA side, the matrix non-zero data in the cache line read by the FPGA every cycle must come from different rows so that the partial products do not flow to the same accumulator after the computation of multiplication, thus avoiding the blocking caused by data conflicts. Therefore, this book designs a data extraction algorithm on the CPU side, which can filter out the data from the same row when extracting data on the CPU side. (3) Computation execution. This step performs the main multiplication and addition operation of SpMV. Once the FPGA side reads the data, it starts to execute immediately. This process is completely deep pipelined and can fully utilize data and pipeline parallelism. According to the above discussion, to meet the requirements of pipelined non-blocking execution, the number of accumulators at the FPGA end must be equal to the threshold of partition in CCC format. Intuitively, it’s better for a smaller threshold, because the larger the threshold, the more accumulator resources are required. However, the threshold cannot be too small, which is mainly due to two considerations. On the one hand, there must be enough non-zero elements from different rows assembled into a cache line, so that the CPU can effectively fetch data. On the other hand, a larger threshold means that the data in a transaction has a greater probability of coming from different rows, which requires less filling of invalid elements, which is very beneficial to the CPU data extraction algorithm. See the following

1.5 Three Types of Exploration

43

contents for the data extraction algorithm. Therefore, in the design of this book, various factors are weighed, and the value of threshold is set to 32. According to the above discussion, each accumulator at the FPGA end is responsible for the product accumulation of non-zero elements from the same row. If two or more non-zero elements in a cache line come from the same row, data conflict will occur because they will flow to the same accumulator in the same clock cycle. There are two ways to resolve this data conflict. The first method is to design a buffer to cache the data causing data conflict. However, this method will cause congestion and waiting, which will affect the throughput between CPU and FPGA and is incompatible with the goal of non-blocking execution. More seriously, in extreme cases, all non-zero elements need to be cached, which is obviously not conducive to design. Another method is to ensure that the data in the same transaction comes from the same row through a specific algorithm when extracting data from the CPU. The data extraction process continues until all non-zero elements are sent to the FPGA end. After the FPGA side reads the data transmitted from the CPU side, it immediately starts to perform multiplication and addition operation. On the FPGA side, the whole hardware implementation mainly includes three units, that is, the PE_MULT responsible for multiplication computation, PE_MUX responsible for data flow and PE_REDUCE responsible for accumulation reduction, as shown in Fig. 1.31. Since a cache line contains three combined pairs from val, row_idx array and vector x elements, the number of multipliers is set to 3. It is worth noting that more multipliers will not bring additional performance improvement, because FPGA is a data flow execution mode. As long as there is data, it can process all the data immediately. After receiving the computation start signal at the FPGA end, first, the three elements of val array and the corresponding vector x element flow to the multiplier to perform multiplication, and the corresponding row_idx array elements are used as PE_MUX gating signal. The output of the multiplier then flows to the PE_MUX unit, which is connected to 32 accumulators, each of which is responsible for the accumulation of parts from the same row of non-zero elements. For each clock cycle, three outputs from the PE_MULT unit flow to the PE_MUX unit. According to the row index information, data from PE_MUX will flow to the corresponding accumulator circuit. When the CPU fetches data, an identification signal will be set at the end of each partition to indicate the end of the partition. Therefore, once the accumulator circuit detects the identification signal, it ends the accumulation of this section of data and generates an output result. Then, in the next clock cycle, the accumulator starts to accumulate the product of the data of the next partition. To evaluate the performance of the heterogeneous domain-specific accelerator, the target platform selected for the experiment is Intel HARP-2. HARP-2 is an experimental server with shared memory CPU-FPGA heterogeneous architecture. It integrates CPU and FPGA in one chip, so that the interconnection bandwidth between them is higher and the delay is lower. This book implements the whole design on the HARP-2 platform [21]. The comprehensive resource utilization is shown in Table 1.3. It can be seen that the whole design uses less than 40% of the logic resources on the FPGA and very few

44

1 Programming Model PE_MULT unit

Cache line from CPU

PE_MUX unit

PE_REDUCE unit

val(0) x(0)

Multiplier 0

Reduction circuit 0

val(1) x(1)

Multiplier 1

Reduction circuit 1

... ...

………

………

val(k) x(k)

Multiplier k

Reduction circuit k

MUX circuit

Subset of result vector y

row_idx(0), row_idx(1),…, row_idx(k)

Fig. 1.31 Overall framework of SpMV computing kernel at FPGA end

DSP computing resources (less than 1%, because it is limited by the system bandwidth between CPU and FPGA). If the bandwidth increases, then the design is easily scalable by simply adding more multipliers, multiplexers, and accumulators, and there are enough resources left on the FPGA to meet this demand. Therefore, this design has strong scalability. For computing applications, the performance of the design is generally measured by the number of operations performed per unit time. Since the problem under study is a double-precision floating point SpMV, the absolute performance of the design can be measured in terms of the number of floating-point operations per unit of time, i.e., GFLOPS. In SpMV computation, a total of Nz floating point multiplication operations (non-zero, that is, the number of non-zero elements in the sparse matrix) and about Nz floating point addition operations are performed. Therefore, the GFLOPS computation is expressed as 2Nz divided by the total computation time T, i.e. GFLOPS = 2Nz/T

(1.2)

Among them, T includes the time consumed by the CPU side to extract data, the data transmission time from the CPU to the FPGA, the multiplication and addition computation time at the FPGA side, and the transmission time of the result vector Table 1.3 Resource usage of heterogeneous sparse linear algebra system at FPGA end Module

Adaptive logic module (%)

Block random access memory (%)

DSP block

Accelerator function unit

34

10

Less than 1%

5

5

Cache consistency interface

0%

1.5 Three Types of Exploration

45

back to the CPU from the FPGA. T does not include the time when the sparse matrix is converted from the original storage format to CCC format. This is part of the operation that is preprocessed. In irregular operations such as SpMV, preprocessing is acceptable. The reason why GFLOPS is absolute performance is that the characteristics and related parameters of various systems are different, such as system bandwidth, the number of accelerators, FPGA type and other characteristics. These factors affect the performance of the final design. In particular, the system bandwidth directly affects the performance of bandwidth constrained applications such as SpMV. The larger the system bandwidth, the better the final performance of SpMV, which is particularly obvious in our proposed design. Therefore, if we want to fairly compare the performance of SpMV accelerator under different systems, it is not accurate to only use the absolute performance GFLOPS as the metric. Based on this, this book uses bandwidth utilization (BU) as a relative metric to describe the final performance. BU is the absolute performance per unit bandwidth, that is, the designed GFLOPS is divided by the system bandwidth, and its unit is GFLOPS/GB, as shown in Eq. (1.3): BU = GFLOPS/bandwidth

(1.3)

Another metric, the designed throughput, can also indirectly reflect the utilization of bandwidth. Throughput is expressed as the total amount of cache line data transferred, cl_trans, divided by the transaction transmission time from the CPU to the FPGA, T_trans, as shown in Eq. (1.4): throughput = cl_trans/T_trans

(1.4)

Since not all the data contained in the cache line is valid, the throughput also does not fully reflect the bandwidth utilization. The invalid data in the cache line is mainly reflected in the following aspects: (1) in the CPU side data fetch algorithm, if the data in the three boxes conflict, the invalid data will be filled; (2) in the data fetch algorithm, the end of each partition will be filled with an identification signal indicating the end; (3) the size of the cache line does not fully match the data size. The data size of the combination including val, row_idx arrays and x vector is 60 Bytes, while the size of the cache line is 64 Bytes, so there are elements of 4 Bytes that are invalid. However, the throughput can well reflect the non-blocking execution of the system. The closer the throughput is to the system bandwidth, the better the non-blocking execution of the design. Therefore, in this design, the throughput of the benchmark will also be reported. The performance and bandwidth utilization results of the benchmarks on the heterogeneous sparse linear algebra acceleration module are shown in Fig. 1.32. It can be found that among the 17 test matrices, the values of GFLOPS are very close. This can be expected in our design, because the performance bottleneck of this design is the bandwidth between CPU and FPGA, which is independent of the attribute of input sparse matrix. Therefore, although the sparsity and non-zero element distribution of different sparse matrices are different, the measured GFLOPS are roughly the same.

46

1 Programming Model

Fig. 1.32 Performance results of heterogeneous sparse linear algebra acceleration module in test matrix

Similarly, the throughput of 17 test matrices can also be calculated. The calculation results show that the throughput of all test matrices is close to 12 GB/s, which is the same as the system bandwidth, indicating that the design occupies all of the system bandwidth, which is almost a full bandwidth calculation with efficient non-blocking execution. Therefore, it can be expected that the bandwidth utilization is also high. The results of the GFLOPS comparison with the design of Grigoras et al. [19] on the Maxeler Vertics platform are shown in Fig. 1.33. It can be seen that the design architecture proposed in this section can achieve similar orders of magnitude of GFLOPS values as the implementation on the Maxeler Vertics platform. For matrices with higher sparsity (i.e., fewer non-zero elements), GFLOPS of the design in this section is higher. This is because the design in Ref. [19] generates an optimal architecture according to the attributes of the input sparse matrix, so that the access continuity of each block is as close as possible to the access characteristics of dense matrix vector multiplication. Therefore, for denser matrices, the performance of this approach is better, while the performance gradually decreases as the matrix sparsity increases. For the design proposed in this section, the design is insensitive to the attributes and structure of sparse matrices because irregular data access and irregular computation execution are placed on the CPU and FPGA, respectively, and the GFLOPS are similar under the design in this section regardless of sparser or denser sparse matrices. It is worth noting that the bandwidth between CPU and FPGA of Maxeler Vertics platform used in Ref. [19] is three times that of HARP-2 platform used for our design. Therefore, the bandwidth utilization rate designed in this section is much higher than that of Maxeler Vertics platform. The bandwidth utilization computation results of both are shown in Fig. 1.33. As can be obtained from Fig. 1.33, the average bandwidth utilization of the architecture proposed in this section is 0.094 GFLOP/GB, and the average bandwidth utilization of the architecture proposed in the Ref. [19] using the Maxeler Vertics platform is 0.031 GFLOP/GB. Therefore, 2 times improvement is achieved compared with Ref. [19].

1.5 Three Types of Exploration

47

Fig. 1.33 Comparison of performance and bandwidth utilization with Ref. [19]

From the perspective of the implementation of the programming model, the design of this book is very friendly to application developers and compiler designers. The architecture designer only needs to provide a set of APIs for conversion between common sparse matrix formats and CCC formats, as well as common sparse matrix operator APIs, to allow application developers to use sparse linear algebra acceleration modules based on heterogeneous reconfigurable architectures. Moreover, this programming model is well compatible forward: no matter what changes are made to the reconfigurable architecture, such as sharing DRAM directly with the CPU, switching from FPGA to CGRA, etc., the hardware changes can be completely transparent to the compiler designer and the application developer because the programming model leaves the complexity of the hardware architecture to the architecture designer. In short, the CCC format of sparse matrix sacrifices generality for better performance and development efficiency. (2) Exploration of sacrificing the development efficiency One of the key issues that makes it difficult for spatial domain parallelism to be used by application developers is that application developers are more familiar with PRAM model under Von Neumann architecture and imperative programming languages based on PRAM model, such as C and Java. Since each thread on a general-purpose processor executes step by step according to instructions, and the human brain is more familiar with describing tasks step by step, imperative programming is more friendly to application developers. Today, most programming languages are imperative.

48

1 Programming Model

In imperative programming languages, parallelism is usually expressed as threadlevel parallelism and data-level parallelism. Different threads execute concurrent instructions at the same time. When synchronization between threads is required, application developers use synchronization primitives such as locks, semaphores and barriers in shared memory as needed. Data-level parallelism is explicitly expressed by application developers through SIMD instructions. However, it is difficult for imperative programming languages to express spatial domain parallelism. On the one hand, there are many concurrent modules in spatial domain parallelism. When using the thread abstraction of imperative programming language, each module should correspond to a thread. This would make programming extremely complicated. On the other hand, synchronization between spatially parallel modules is very frequent. For example, when multiple modules form a pipeline, a handshake synchronization may be required between the two modules every clock cycle. The synchronization primitive based on shared memory in imperative programming language is difficult to meet such frequent synchronization requirements. To address this challenge, an attempt is made here to replace the existing imperative programming model by using the channel primitives from the communicating sequential processes (CSP) concurrent programming model for exchanging messages to more efficiently develop spatial domain parallelism in applications [22]. CSP uses a set of independent processes to describe applications. Processes interact with each other only through message passing channels. The channel for exchanging messages is pre declared between processes, and then each process synchronizes by writing and reading data to and from the channel at run time. In CSP, channel describes a primitive for communication between multiple modules, which has the characteristics of low latency and high concurrency. At present, mainstream FPGA HLS language manufacturers have also added channel supporting syntax to their own language, such as Intel’s FPGA SDK for OpenCL. Corresponding to the hardware implementation of the SDCs, this channel can be abstracted to correspond to some concrete FIFO queues, that is, using the channel technology, different kernels can be directly linked through FIFO queues. Figure 1.34 shows the overall framework diagram of the channel. The FIFO in the figure is the channel primitive. The following describes some characteristics of channel data transmission with a specific example. Table 1.4 shows a code example of synchronization between modules using channel primitives. The code describing the single channel data transmission can be synthesized into the data transmission hardware structure shown in Fig. 1.35 on the SDCs. In Table 1.4, the Producer kernel writes 10 elements ([0, 9]) to channel t0, and the Consumer kernel reads 5 elements from the channel each time it is executed. In the first read, it can be seen from the diagram of Fig. 1.35 that the Consumer kernel will read the five values of 0–4. Since the data in the channel will exist until the entire code loaded into the FPGA is fully executed, the Consumer kernel will not stop running after reading the data in the channel only once. After reading the five values of the first time, it will continue to execute the second time. The values read this time are 5–9. The above example code will not have a problem when the Producer executes only once, but it may have a deadlock when the producer kernel needs to execute

1.5 Three Types of Exploration

49 Main processor Initialization

FIFO

FIFO

Kernel 1

FIFO

FIFO

Kernel 2

FIFO

Kernel 0

Kernel N

FIFO

I/O channel

I/O channel

RAM

Fig. 1.34 Synchronization between modules (i.e., Kernel in the figure) in spatial domain parallelism using channels

Table 1.4 Code example of synchronization between modules using channels Example of single channel data transmission codes channel int t0; __kernel void Producer() { for (int c= 0; cl, h[i])

16 17

end if

18

end if

19

end while

20

barrier

21 22

if i = 0 then M ← reduce(Ms) end kernel

52

1 Programming Model

Fig. 1.36 Hardware diagram of K-means filtering algorithm implemented using work-stealing and stack

0

Input subtree 1 2

Mi

FPGA Heap 0

Heap 1

Heap 2

Stack 0

Stack 1

Stack 2

Workitem 0

Workitem 1

Workitem 2

Ms[0]

Ms[1]

Ms[2]

REDUCE

Mi+1

work items is determined by the topology of kd-tree. In extreme cases, if all nodes of kd-tree have only the left subtree, only one work item will be executed. In the case of work-stealing, whether the kd-tree subtree is balanced or not, its corresponding tasks are transmitted to three work items as a whole. Therefore, the parallelism of the whole program code is well guaranteed. Although work-stealing solves the problem of task dynamic balance among multiple parallel work items, it will still have many sequential execution parts in the process of actual operation, so that the overall parallel development is still limited. According to Table 1.5, its implementation can be abstracted into three major steps: GET, PROCESS, and PUSH, and then a data flow diagram and timing diagram of the way they are implemented can be drawn. As shown in Fig. 1.37, in the case of two work items, the end of each iteration must wait until all GET, PROCESS, and PUSH operations in this round of the cycle in both work items are finished before moving on to the next cycle. As we can understand from the timing diagram, in this case, although the algorithm has some parallelism (i.e., the two work items are executed in parallel), within a single work item, the parallelism of the algorithm is very low, even lower than the general execution in serial. This is because at the end of each cycle, even though a single work item quickly executes the three operations GET, PROCESS, and PUSH, it cannot immediately execute the next cycle; it must wait until the slowest one among all work items is executed before proceeding to the next cycle. Obviously, this is very inefficient, which gives an opportunity to further improve the parallelism of the algorithm. One of the essential reasons for this reduced parallelism is that during the execution of the loop, the data to be read in the next loop must come from the data written in the previous loop, which creates an inter-loop dependence during the execution of the program, and this dependence causes the start of the next loop to wait for the slowest

1.5 Three Types of Exploration Work-item Work-item

GET

53

Work item 1

Work item 2 While(all q finish)

GETi

PROCESS PUSH

GETi

PROCESSi PROCESSi

PUSHi PUSHi

GETi+1 GETi+1

While(all q finish)

Time

Fig. 1.37 Data flow diagram and timing diagram of the work-stealing implementation

execution of all work items to finish, thus reducing the parallelism of the program execution. If read and write operations can be put into a task queue, and the task queue can be executed in parallel, such dependencies will no longer exist, and the parallelism of the program will be further improved. In terms of specific hardware, this idea needs to be implemented through double-ended FIFO, and such a FIFO is required to have independent read and write ports. However, there is a lack of tools in previous imperative programming languages to describe such a double-ended FIFO that can be executed in parallel. For the implementation of Table 1.5, to achieve such a read/write parallelism, there must be two nested while loops to detect the double ended FIFO. In the current HLS language, such nested loops cannot be pipelined. This is because in such cases, the internal initiation interval (II) is difficult to be determined accurately, so there is no way to expand these operations into the form of pipeline. Fortunately, such a concurrent polling of a SPSC queue can be implemented with channels. In OpenCL, the non-blocking channel with specified depth corresponds to a double-ended FIFO on FPGA, which greatly simplifies the design of doubleended FIFO and makes it possible to improve the parallelism of the algorithm by implementing a single-producer single-consumer parallel queue. Such a concurrent data structure can greatly improve the parallelism of K-means algorithm. It not only has simple design, but also has high stability. While ensuring the performance, its accuracy is the same as that of previous methods. This is because with such a parallel queue structure, the tasks executed by work items can be pipelined conveniently, and pipelining can greatly improve the program performance. Table 1.6 is a pseudo code example of implementing the K-means algorithm using a single-producer single-consumer queue. This sample code consists of two kernels. The PROCESS kernel is used to traverse the entire kd-tree and perform the necessary computation and filtering of the cluster centroids during the traversal; the UPDATE kernel is used to update the set of cluster centroids when the corresponding conditions are met during the processing of the PROCESS kernel. In the example code, two channels are instantiated, one of which, update_c, is used to receive and pass signals from the PROCESS kernel to the UPDATE kernel that meet the criteria for updating the set of cluster centroids. The other task_c is an important part of the

54

1 Programming Model

Table 1.6 Pseudo code example of implementing K-means by using the SPSC queue Pseudo code example of implementing K-means using SPSC queue 1

channel Task task_c;

2

channel Update update_c;

kernel PROCESS(global Tree *tree, global Centerset *m) 3

while true do

4

Task t = READ_CHANNEL(task_c);

5

if Update u = FILTER(t, tree, m) then

6 7 8 9

WRITE_CHANNEL(update_c, u); else WRITE_CHANNEL(task_c, t.left); WRITE_CHANNEL(task_c, t.right);

10

end if

11

end while

kernel UPDATE(global Centerset *m) 12

bool terminated = false; // flag for finishing traversal

13

while not terminated do

14

Update u = READ_CHANNEL(update_c);

15

terminated = UPDATE_CENTER(m, u);

16

end while

single-producer single-consumer concurrent queue data structure. This task queue either generates two new tasks (as shown in line 8 and line 9 in Table 1.6) or passes a signal to the UPDATE kernel to update the set of cluster centroid (as shown in line 6). The FILTER function in Table 1.6 calculates the distance between the current node and the cluster centroids in the candidate list, and finds the one with the smallest distance. If the current node is a leaf node, the data in the leaf node will be directly counted in the classification of the cluster centroid with the smallest distance; if it is not a leaf node, the cluster centroid that is “inferior” than the smallest distance will be excluded, thus reducing the computation of distance. After the exclusion, a further judgment is made, that is, whether there is only one cluster centroid left in the remaining candidate list, if yes, then all the following data nodes are within the range of this cluster centroid, then all the following data objects are put into the range of this cluster centroid. If not, the operation is performed according to the subsequent children of the current node. If there is only a left node, the content of the left node is sent to the parallel queue for subsequent processing; accordingly, if there is only a right node, the content of the right node is sent to the parallel queue for subsequent processing; if there are left and right nodes in the children nodes, both are sent to the parallel queue for subsequent processing. This cycle continues until the update signal appears or all nodes of the kd-tree are traversed, then the update operation of the cluster centroid starts. In the UPDATE module, signals and data from the parallel task queue are received through the channel. When a signal is received to update the cluster centroids, the UPDATE module calculates the mean value of the data nodes within the newly received cluster centroids and transmits the new cluster centroids to the parallel task queue, which also transmits the corresponding data and signals to the PROCESS

1.5 Three Types of Exploration

55

kernel for the next step. Throughout the algorithm, updating, passing the updated data, and receiving the updated signals are repeated until the cluster centroid no longer change or change within a given error range, and then the algorithm is truly finished. Figure 1.38 shows the hardware diagram on FPGA after the synthesis of K-means filtering algorithm implemented by SPSC queue. As can be seen from Fig. 1.38, the three modules PROCESS, UPDATE, and SPSC are all connected by channels. Since the whole channel is a double-ended FIFO, the data fed into SPSC at the same time do not have any dependence on each other, so the whole process of feeding in and out data is non-blocking, and there is almost no waiting time between processing the data of the previous round and the next round in the PROCESS module. As soon as the input data starts feeding, the data will be generated continuously until the whole kd-tree is traversed or the condition for the end of the algorithm is met, so that the whole process of operation is highly pipelined. Figure 1.39 shows a case after this method is adopted. This solves the bottleneck of the previous K-means filtering algorithm implemented on FPGAs, where multiple modules had to wait for each other and the execution time was determined by the slowest module, and further experiments will show a great performance improvement brought by this change. Although the single-producer single-consumer parallel queue is a good solution to the problem of pipeline operations within a single work item, thus further enhancing the parallelism of program execution, there is a fatal problem with this design, namely, when assigning tasks using the double-ended single-producer singleconsumer queue, it is equivalent to performing a breadth-first search (BFS) on the kd-tree. This leads to the fact that this queue must have the capacity of half of the nodes in the kd-tree, which is the only way to ensure that there is no deadlock during the whole operation. In other words, if sufficient capacity is not provided, the queue may get stuck, thus causing a deadlock in PROCESS module. Generally, the number of nodes in the kd-tree is huge, and if the queue needs to have half the capacity of the nodes, it is quite demanding on the hardware resources, and also due to the limited hardware resources, it will further limit the application of this method in the case of large-scale data structures.

Input channel cache

PROCESS

SPSC queue

Cache to be updated UPDATE

Cache updated

Output channel cache

Fig. 1.38 Hardware diagram of K-means algorithm implemented by SPSC queue

56

1 Programming Model RdC RDC: read data from channel WrC write data back to channel RdC

PROCESS RdC

WrC PROCESS

RdC

WrC PROCESS

WrC

Time

Fig. 1.39 Timing diagram of the PROCESS module processing data in the ideal case

To solve this limitation, it would be a good choice to provide a pre-order traversal using a stack instead of a queue. In this case, the size of the stack would only need to be the depth of the kd-tree, which would be much less than half the number of nodes of the kd-tree, providing a higher scalability for large-scale data applications. Table 1.7 shows the pseudo code implemented with single-producer single-consumer parallel stack. To ensure a pre-sorted traversal of the K-means algorithm, the instantiated channel task_c in the single-producer single-consumer queue needs to be replaced by a specific kernel, named SPSC_STACK (see the code in Table 1.7). In Table 1.7, the Table 1.7 SPSC stack pseudo codes SPSC stack pseudo codes 1

channel Task push_c;

2

channel Task pop_c;

3

__attribute__((autorun))

kernel SPSC_STACK() 4

local Stack stack;

5

while true do

6

Task t;

7

if READ_CHANNEL_NB(push_c, &t) then

8

pushStack(&stack, t);

9

end if

10

if t =peekStack(&stack) then

11 12 13

if WRITE_CHANNEL_NB(pop_c, t) then popStack(&stack); end if

14

end if

15

end while

1.5 Three Types of Exploration

57

attribute of autorun in line 3 is used to declare the SPSC_STACK is kernel module, which will run automatically after the program is loaded on FPGA. Through such a statement, the SPSC_STACK module can be used as a standalone service kernel. A stack data type will be instantiated and defined as a double-ended BRAM (the data type defined in the local memory module of OpenCL will be stored in this unit) for the entire SPSC_STACK execution. In each cycle or iteration, the SPSC_STACK kernel will detect the push channel (line 7 in the code) and write the previously defined stack data type into the channel (here, a minor modification is made to the syntax of read_channel_nb in Intel FPGA OpenCL to better adapt to this application). At the same time, if the stack is not empty, it will try to read the contents of the stack and pop a task from the channel (line 11 of the code). Due to the non-blocking nature of the channel, this interface to the SPSC_STACK kernel is very simple and easy to use compared to the imperative programming paradigm. In this case, PROCESS only needs to detect push and pop channels for task insertion and extraction. Since the K-means algorithm implemented using the framework in Fig. 1.38 does not make good use of the FPGA resources, i.e., after using such a framework, there are still a lot of idle resources in the FPGA that are not utilized. There is a need to find a way to extend our design to maximize the use of idle resources in the FPGA, which would also maximize the parallelism of the algorithm. Of course, the first simple way to think of is to simply instantiate multiple copies of the framework. However, although this approach can make use of the idle resources in the FPGA, since the entire kd-tree access process is dynamic, if it is simply copied, the balance between multiple modules will be difficult to maintain, and it is likely that multiple modules will wait for the slowest one to finish executing, and the pipeline operation within the program will be greatly affected. Therefore, if we want to maximize the use of idle resources in the FPGA while ensuring sufficient parallelism for the program, we need to maintain dynamic load balancing while extending the framework of Fig. 1.38. To solve such a problem, this section constructs a concurrent data structure of multiple-producer multiple-consumer stack based on single-producer singleconsumer stack. When constructing such a multiple-producer multiple-consumer stack concurrent data structure, a regular work allocation strategy is adopted to ensure dynamic load balance. Work distribution is a strategy of proactive synchronization (from the perspective of task creator). In this strategy, new tasks are evenly distributed to idle processing units, and work-stealing is a reactive asynchronous strategy. In this asynchronous strategy, idle processing units try to steal tasks from other elements or processing units asynchronously. Either work distribution or work-stealing is a better way to perform dynamic balancing. One of the main reasons for choosing load balancing with work allocation is that in the Intel FPGA OpenCL syntax, it is not recommended to have multiple read operations or multiple write operations acting on the same channel, because this design is very unfriendly to pipeline. Figure 1.40 shows the framework of the K-means algorithm implemented with multiple-producer multiple-consumer stack. It is also implemented on HARP-2 platform. The three K-means filtering algorithms described in this section can be implemented on FPGA using Intel OpenCL for FPGA tool. The experiments begin by comparing the constructed concurrent

58

1 Programming Model DISTRIBUTOR Input channel cache SPEC_STACK

PROCESS Cache to be updated UPDATE DISTRIBUTOR

Cache updated

Output channel cache

Fig. 1.40 Implementations of K-means clustering algorithm with multiple-producer multipleconsumer stack

single-producer single-consumer stack with a sequential baseline, and although this is a sequential baseline it is also a carefully optimized version. In the experiments, the baseline executes in 0.0198 s per iteration on the HARP-2 platform, while the concurrent SPSC stack we constructed executes in 0.0013 s per iteration. Figure 1.41 shows the comparison between the performance improvement of the algorithm implemented with the concurrent multiple-producer multiple-consumer stack and the ideal performance improvement. It can be seen from the figure that on the basis of the performance improvement of the concurrent single-producer singleconsumer stack by 15.2 times compared with the baseline, it can achieve a speed improvement of about 3.5 times at most (in the case of four work items). It can be found from the figure that with the increase of the number of work items, the improvement of program performance is also linear, and the value of performance improvement is very close to the ideal situation, which fully shows that the concurrent multiple-producer multiple-consumer stack constructed is scalable in improving program parallelism, which is of great significance to program parallelism.

Fig. 1.41 Comparison between multiple-producer multiple-consumer stack version and ideal speed increase

1.5 Three Types of Exploration

59

Fig. 1.42 Comparison of hardware resources implemented by different K-means filtering algorithms (see color chart)

The hardware resources consumed by the program is also a very important reference metric, which is also of great significance for further optimization and improvement of the algorithm. Figure 1.42 compares the hardware resources of different implementations. The SPSC slightly decreases in the use of logic functions compared with the baseline, because after using the channel to build concurrent data structure, the logic of the whole program execution is clearer and there are less complex dependencies, which makes the proportion of the corresponding hardware logic resources decrease; in terms of RAM resources, the SPSC has a slight increase, because the use of channels to build a concurrent data structure inevitably increases the overhead of memory resources, which is one of the main reasons for the increase in RAM resources; in terms of DSP resources consumption, there is not much difference between the SPSC and the baseline, and the DSP resources consumption of the SPSC is slightly reduced, which could also be an optimization brought by the simplified logic. In Fig. 1.42, MPMC_P2 and MPMC_P4 represents the resource consumption of MPMC with 2 work items and 4 work items respectively. Compared with the SPSC, the two versions have improved in the consumption of logic resources, RAM resources and DSP resources. This is because the MPMC is essentially a reuse of the SPSC. Therefore, when the hardware resources permit, it will give priority to depleting all available resources to maximize the use of idle resources in the hardware. After switching from imperative programming language to CSP programming language and from imperative synchronization primitives such as locks and atomics weight to channel synchronization primitives, spatial domain parallelism can be developed and utilized more effectively. The SDCs with spatial domain parallelism as one of the main advantaged can take CSP programming language (such as Golang) as an alternative to imperative programming language. It should be noted that the channel synchronization primitive actually sacrifices the development efficiency of the application. On the one hand, because the human brain is more familiar with imperative, that is, describing tasks step by step, when switching

60

1 Programming Model

from imperative programming language to CSP programming language, application developers need to “translate” existing applications. On the other hand, the existing program debugging process is based on imperative programming language. The common debugging methods represented by breakpoints are almost completely ineffective for the CSP programming language. Several CDS in this section are developed with a significant amount of time spent in the debugging process. Therefore, when CSP channel primitives develop spatial domain parallelism by sacrificing development efficiency in exchange for generality and execution efficiency. (3) Exploration of sacrificing the execution efficiency According to the analysis in Sect. 1.5.1 [18, 24], the complex control flow and runtime dependence of irregular applications make it difficult for existing FPGA programming methods to utilize their parallelism. Referring to the runtime parallel programming method for irregular applications on general-purpose processors, this section proposes a fine-grained pipelined parallel programming model applied to FPGA and CGRA to extra fine-grained parallelism in irregular applications. A common feature of advanced programming methods (such as HLS, stream computing programming methods, etc.) that have been applied to FPGA is that the parallelism in the application is extracted at compile time. As a result, these programming methods are only applicable to regular applications with a relatively small number of control flows and structured data access patterns. However, many important computing intensive applications, such as graph computing and computational graphics, have control flows that are difficult to predict by static profiling and data structures with poor locality. Therefore, the current implementation of these applications on FPGAs can only use the underlying HDL to extract the inherent parallelism of the application. For example, to handle sparse matrix-like applications, computational resources on FPGAs can be manually mapped to dedicated sparse matrix preprocessing logic that analyzes the input data to parallelize the computational process; graph computing application often customize the FPGA access path to tap into the parallelism of graph data structures. This shows that although there are many parallelism in these irregular applications to be developed, the existing high-level programming model of FPGA cannot effectively express their parallelism, and can only rely on the underlying HDL to specialize the parallelism of these applications. The key to solve the programmability problem of irregular applications is to deal with the mismatch between FPGA execution mechanism and existing high-level programming methods. According to Sect. 1.5.1, the execution mechanism of FPGA needs to specify the specific implementation of time-domain and spatial-domain flexibilities in the computing paradigm. HDL with a lower abstraction level matches this execution mechanism by providing an abstraction of multiple mutually independent processes operates in parallel. However, the design of large-scale applications via HDL requires consideration of the collaboration between a large number of concurrent processes at a low abstraction level, which makes it difficult for application developers to write and debug HDL programs efficiently. In contrast, HLS based on high-level languages such as C/C++ reduces the difficulty for programmers to develop by increasing the abstraction level. However, these high-level languages are

1.5 Three Types of Exploration

61

built based on Von Neumann’s computing paradigm, and sequential model is required when programming with these languages. Therefore, the current mainstream HLS technology can only be applied to the regular applications, and the parallelism in the programmer’s code can be found through the compiler. Taking BFS algorithm in graph computing as an example, the characteristics of fine-grained parallelism are analyzed. Figure 1.43a shows the pseudo code of the BFS algorithm. The BFS algorithm traverses a graph and labels each vertex v in the graph with a v.level in its labeling specifying the number of edges on the shortest path from source vertex root to Vertex v. In pseudo code, struct Visit is used to refer to a visit to a vertex. BFS algorithm maintains a Visit task queue according to first in first out (line 3). Initially, the v.level of all vertexes is defined as infinity. After adding the source vertex to the queue (line 5), each iteration (line 6) reads a Visit from the task queue and accesses all neighbors of Visit:vertex. When a vertex with v.level larger than Visit::level is visited, the v.level of that vertex is set to Visit::vertex (line 9), and a

Fig. 1.43 BFS algorithm and its analysis

62

1 Programming Model

new task Visit(v, t.level+1) is added to the task queue (line 10). Figure 1.43b shows the CDFG of the BFS algorithm nested loop (lines 6–11). Each vertex represents the basic operation in a Visit task, and each edge represents the dependence between operations. If the compiler looks for the parallelism of the BFS algorithm only at the compile time, it can find that all loop bodies of the intra loop (line 7) are independent of each other, because given a vertex, there is no dependence between all operations when querying the v.level of each of its neighbors. This is also shown in Fig. 1.43b. The irregularity of BFS is mainly reflected in two aspects: first, the creation of Visit task is dynamic, which makes it very difficult for the compiler to statically scheduling. In the BFS algorithm, the tasks created by the outer loop depend on the connectivity between the vertexes in the graph structure to be processed (dependencies i and ii in Fig. 1.43b, while the number of inner loop tasks varies depending on the number of outgoing edges (i.e., the number of tasks that depend on that task) at each vertex (dependence iv). To resolve these dependencies, the implementation of the BFS algorithm on a general-purpose processor requires a complex control flow using multiple layers of loops and branch statements nested together. Second, dynamic access to shared memory will lead to runtime dependencies (dependence iii). For BFS algorithm, the access of outer loop to shared memory must be synchronized by PRAM synchronization primitive to avoid access conflict when reading and writing vertexes. The existing programming model based on HLS only explores the parallelism of BFS algorithm at compile time. After synthesizing the OpenCL description of a BFS with Altera OpenCL (AOCL) SDK and analyzing its results, the scheduling diagram in Fig. 1.43c can be obtained. The BFS is split into two kernels: Kernel 1 checks if a neighbor vertex has been accessed and marks it if not; Kernel 2 accesses the marked vertex. In the BFS implementation of AOCL, the above irregularities need to be resolved by the host general-purpose processor (Host) interacting with the FPGA. Each Kernel corresponds to multiple pipelines on the FPGA. Host then lets the FPGA execute Kernel 1 and Kernel 2 sequentially through reconfiguration until Kernel 2 cannot find any vertex that needs to be accessed. Through the Host, all inter-loop dependencies in Fig. 1.43b are resolved. However, the execution process is over serialized. The Host program inserts a barrier operation between two Kernel calls to ensure that there is no collision in data access between loops. HDL implementations of the BFS algorithm have been available on CGRA with better performance than general purpose processors. Analysis of these implementations reveals that they all use structures that cannot be expressed in existing HLS programming approaches. First, task collection and allocation in HLS implementation are carried out through complex instruction flow on the host, while task collection and allocation in HDL are driven by data flow. Secondly, all possible memory access conflicts are avoided through barrier operation in HLS, while in HDL, the index is dynamically checked at runtime to avoid conflicting access. Based on the above two improvements, the BFS implementation performance of HDL is much better than that of HLS. Figure 1.44 compares the scheduling diagrams. Without losing generality, the five operators in the previous CDFG are combined

1.5 Three Types of Exploration

63 Visit task whenis processing node vertex i |Update task whenis processing node vertex j

Barrier

BFS-AOCL (HLS implementation) Queue1 Visit

BFS-SPEC

Queue2 Update

Data driven flow

(HDL implementation)

(a) Input diagram structuregraph and simplified BFS CDFG

Collision detection

Time

(b) Comparison of BFS node vertex access order under HLS implementation and HDL implementation

Fig. 1.44 Scheduling diagram of HLS and HDL BFS algorithms scheduling on CGRA (see color chart)

into two operators, corresponding to the outer loop body and the inner loop body in the BFS pseudo code respectively. On FPGA, the CDFG uses spatial-domain computing resources to map to two computing resources respectively. In the HLS implementation, two operators are executed alternately, which are synchronized by barrier operation; while in HDL, two operators are pipelined, the tasks are transmitted in the form of data flow between the two levels, and the execution results of the later level pipeline are fed back to the previous level to avoid collision during vertex access. Thus, the HDL implementation makes more efficient use of CGRA’s spatial-domain computing resources, resulting in superior performance. The inherent parallelism between operators in BFS algorithm needs to be dynamically explored at run time. To exploit this parallelism, an algorithm can be partitioned into fine-grained tasks. When executing each task, the processor checks whether the potential dependencies between tasks are established according to the input data, and executes tasks without dependencies in parallel. This parallelism is called finegrained parallelism. The HDL implementation of BFS fully exploits the inherent fine-grained parallelism of the algorithm, so as to make full use of the computing resources of FPGA. Fine-grained parallelism can be tapped on general-purpose processors by thread level speculation (TLS) technology [25] In TLS, fine-grained tasks correspond to threads on general-purpose processors. The creation and allocation of tasks are managed in the form of thread pool in the runtime system. The dependencies between tasks then need to be declared by the programmer using specific programming methods, and later at runtime the threads that can be executed in parallel are selected by a runtime system that checks the dependencies between tasks based on the input data. An analysis of a variety of irregular applications [26] points out that fine-grained parallelism is widely available in a large number of application domains and that all of these applications can be parallelized using a TLS-like approach [27]. However, the abstraction of thread comes from the instruction flow of Von Neumann computing paradigm. For reconfigurable computing, the thread overhead

64

1 Programming Model

of implementing a general-purpose processor using reconfiguration of computing resources in the time domain is too large. Therefore, TLS technology is difficult to be applied to FPGA. This chapter studies an attempt to propose new programming methods to develop fine-grained parallelism in applications based on the flexibility of FPGAs in the spatial domain and the way tasks are carried out in the spatial domain in HDL implementations. This section presents a new reconfigurable computing programming process. The model can exploit the fine-grained parallelism in the application according to the programmer’s description, and then implement the application on FPGA in the form of runtime parallel execution pipeline. By giving part of the parallelism development tasks to the compiler and runtime system, the programming model in this section has high development efficiency, but needs to sacrifice a certain execution efficiency. Figure 1.45 shows the development process of the programming model. Firstly, programmers analyze irregular applications and express them using parallel programming method based on fine-grained pipelining. In this programming method, applications are disassembled into fine-grained tasks, and the dependencies between tasks are described by rules. To facilitate the debugging of the program, this section designs a set of pure software debugging environment based on the general-purpose processor to debug the programmer’s declaration. There is no additional dependence between tasks and rules, so multiple tasks or rules can be easily pipelined in parallel. Single task or rule can generate pipeline on FPGA through HLS. To combine the tasks and rules, this section designs a set of rule templates, so that the task and promise can be executed in parallel. Finally, the process can be implemented by non-rule application on FPGA. In the fine-grained pipelined parallel programming method, the two core problems are how to describe tasks and how to describe rules. To solve these two problems, first formally decompose irregular applications into fine-grained tasks, which solves the complex control flow in irregular applications; then give the syntax of rules and express the runtime dependencies in irregular applications in terms of rules. Generally speaking, applications with fine-grained parallelism are organized through loops. The loop can be abstracted into a task. The loop accesses the task queue to fetch tasks at execution time and then adds new tasks to the task queue. The classical compilation theory [28] divides loops into for-all and for-each loops General-purpose processor

Dependence Decision template Specialization

Irregular application

Analysis

Taskdependence

Control data flow

Synthesize

Fig. 1.45 Fine grained pipelined parallel programming flow on CGRA

Application of reconfigurable processor

1.5 Three Types of Exploration

65

Fig. 1.46 Fine-grained task abstraction from two levels of nested loops of the BFS algorithm

according to the execution order. Their semantics and their sequential execution mechanism are as follows. (1) for-all loop: all loop iterations can be executed in parallel. During sequential execution, the system selects any loop for iterative execution. (2) for-each loop: the later iterations need to get the execution results of the previous iterations when they are executed. During sequential execution, the runtime system executes each iteration in the order of iteration. In fine-grained parallelism, the loop bodies of the above two loops can be abstracted as tasks, and a task can create a new task as it executes. The order relationship between tasks is determined by the loop type. Figure 1.46 shows how to extract tasks from code and the order relationship between tasks by taking the two-layer nested loop of BFS algorithm as an example. Here the outer loop is a for-each loop, i.e., there is a sequential relationship between the iterations in that loop for execution (indicated by a one-way arrow in the figure); while the inner loop is a for-all loop, i.e., there is no sequential relationship between the iterations in that loop (indicated by a two-way arrow in the figure). Two types of loops correspond to two types of tasks, respectively. Fine-grained parallelism defers the resolution of inter-task dependencies to the runtime, thus avoiding the problem of over-serialization of tasks during compile-time scheduling. However, it is not easy to express the runtime dependencies between tasks understandable by the compiler in applications. This section refers to the event-condition-action (ECA) syntax in database programming [29] to customize a programming approach to express runtime dependencies for FPGAs. (1) Event refers to the activation of a task, or a task reaches a specific operation on its loop body. When an event occurs in the system, the index and data fields generating the event are broadcast to all rules in the system. (2) Condition is a Boolean expression composed of the index and data field of event and those of the parent tasks generating the rule. (3) Action is limited to returning a Boolean value to the parent task that generated the rule, and the task can use the Boolean value to make a judgment. Reviewing the pseudocode and CDFG of the previous BFS algorithm shown in Fig. 1.43, the dependencies i, ii and iv in the BFS CDFG have been expressed

66

1 Programming Model

through the for-each fine-grained task set. The dependence iii in the CDFG can be expressed in this natural language: during the execution of task Visit i, when the concurrent task Visit j in the system accesses the same vertex, if j < i, then the task Visit i re-executes; otherwise, the task Visit i writes back the execution result. In the previous HDL implementation of BFS, the pipeline follow a similar natural language description to resolve dependencies at runtime. Using tasks and rules, programmers can describe fine-grained parallelism without resorting to complex control flow. Later at execution time, the runtime system can execute tasks concurrently using a set of well-ordered tasks and resolve runtime dependencies based on input data with the help of rules. A single task or rule is very simple and can be implemented on FPGA based on the existing HLS method. However, the combination of multiple tasks and rules cannot be generated automatically with the help of existing HLS methods. To solve this problem, this section proposes a template rule engine, as shown in Fig. 1.47. Figure 1.47 from left to right is the task declaration, the pipeline implementation of the task on FPGA, the rule engine on FPGA and the rule declaration. Since the inter-task dependencies are translated into rules, the task pipeline can be composed by simply arranging the operators in the tasks in space using HLS. Each kind of task corresponds to one or more pipeline(s). Figure 1.47 uses spatial domain parallelism to realize runtime optimistic execution, thus developing fine-grained parallelism for applications. Optimistic parallel execution in runtime generally refers to the following execution mechanisms: the compiler does not completely solve the dependencies between parallel tasks. The runtime system needs to schedule tasks optimistically, and then decide which tasks can be parallel according to the input data. Specifically, given the set of tasks and Data flow diagram of fine-grained task operator and dependencydependence relationship between operators

Pipeline 2

Rule engine

Pipeline 1

Dequeue

t11 t10

Allocator (10, ..)

...

2

Neighbors t6 ...

t3 t2

8

_

Promise Lanes

Enqueue 4 3 2

7

true false

Fig. 1.47 Cooperative execution of tasks and rules on CGRA

Event Bus 2

Return Buffer

1.5 Three Types of Exploration

67

the runtime dependencies between tasks described in terms of rules, the runtime optimistic execution mechanism in Fig. 1.47 has the following two implementations. (1) Speculative parallelization (SPEC): multiple tasks are scheduled to execute in parallel at compile time regardless of conflicts with other tasks, and then each task is checked for conflicts with other tasks at runtime. Taking BFS as an example, the condition for a runtime Visit task to be re-executed is if and only if an earlier Visit writes to the vertex it is about to write to. (2) Coordinative parallelization (COOR): the runtime system ensures that only tasks without collision are ready. Taking BFS as an example, the condition for a Visit task to create a new task is if and only if the Visit task is the task with minimum level in the system. In speculative parallelization, a task can be executed successfully only when there is no conflict with the task with minimum level; while in coordinative parallelization, only the task with minimum level can create concurrent tasks. Interestingly, both of the above optimistic parallelization mechanisms can be implemented in a pure software approach. To enable programmers to confirm the correctness of the application after completing its description, this section implements the version of the above programming method and execution mechanism on the general-purpose processor based on thread pool and condition variables using C language and PRAM synchronization primitives. This pure software version is very close to the idea of TLS. However, a large number of PRAM synchronization primitives are difficult to be effectively implemented on FPGA through HLS. This is the fundamental reason why it is difficult for HLS to develop fine-grained parallelism in spatial-domain computing architecture. Figure 1.48 shows the speedup ratio relative to the Xeon processor. The FPGA implementation has a performance improvement of 2.2–5.9 times compared to the sequential implementation and is comparable to the performance of the 10-core parallel implementation (0.6–2.1 times speedup ratio). Compared with the existing HLS on FPGA, only Ref. [30] gives the implementation of BFS in OpenCL. Hand annotation is widely used in this implementation to assist the compiler to generate FPGA implementations with better performance at compile time. This section reproduces the implementation in the Ref. [30], and Fig. 1.48 Speedup ratio of 6 algorithm implementations on FPGAs versus serial (single core) and parallel (10 cores and 20 threads) on general-purpose processors

68

1 Programming Model

compares its performance with the FPGA implementation performance of two BFS runtime parallel algorithms. On the USA-road dataset, the execution time of BFS in the Reference is 124.1 s, while in this section, the execution time based on speculative parallelization BFS (SPEC-BFS) is 0.47 s, and the execution time based on coordinative parallelization BFS (COOR-BFS) is 0.64 s. It can be seen that the programming model proposed in this section greatly improves the performance of BFS algorithm. Except for BFS and single-source shortest path (SSSP) algorithms, the algorithms tested in this section can only be executed in parallel at run time. Therefore, several other algorithms can only be implemented sequentially under the current AOCL framework, and can hardly use the parallel computing resources of FPGA. Therefore, there is no parallel implementation of these algorithms in the existing HLS literature. This section relies on the runtime parallel execution mechanism to exploit fine-grained parallelism, and implements the automatic parallelism of these algorithms on FPGA for the first time. Although the approach in this section provides a huge performance improvement over existing HLS approach, there is still a large performance gap compared with the hardware accelerator on FPGA implemented by HDL. In short, the fine-grained pipelined programming model trades for better development efficiency by sacrificing execution efficiency, and takes into account the generality of the programming model.

1.6 Summary and Prospect Taking the “three high walls” faced by the development of semiconductor technology as the main line, this chapter traces the co-evolution process of hardware and software in the development of computing chip industry in recent 60 years, and comes to a seemingly pessimistic Impossible Trinity, that is, the new programming model cannot give consideration to generality, development efficiency and execution efficiency. This is a great challenge to the innovation of SDC programming model. To a certain extent, the generality of programming model determines the flexibility of the chip, the development efficiency determines the software ecological environment of the chip, and the execution efficiency determines the chip efficiency. To reduce the NRE cost of the chip, it is necessary to ensure that the chip has a large enough application market, and its programming model covers as many applications as possible to meet the needs of as many customers as possible; to improve the computing and energy efficiency of the chip, the chip architecture and programming model need to be domain-specific for applications; to build a good software development ecological environment, the chip should have enough customers on the one hand, and be friendly to application developers on the other hand. Under the Impossible Trinity, it seems that these three goals can never be achieved simultaneously. Fortunately, the GPGPU architecture and deep learning-like applications that have emerged in the last decade have systematically illustrated how to improve the energy efficiency of chips in specific applications and build a healthy software ecosystem

1.6 Summary and Prospect

69

without destroying the flexibility of chips. Ten years ago, only a few expert programmers in the field of graphics rendering and scientific computing could fully utilize the computing potential of GPUs. Limited by the computing power of CPU, deep learning algorithms are difficult to be extended to large-scale data sets with practical significance. However, the convolutional neural network AlexNet appeared in 2012 changed this state. AlexNet redesigned the software algorithm for the hardware architecture of GPU, effectively utilized the high computing power of GPU, realized the breakthrough of deep learning from quantitative change to qualitative change, and attracted a large number of programmers to join the software ecology of GPU. Since then, based on the previous GPU architecture, the GPU architecture designer has customized the arithmetic logic unit and datapath of the hardware architecture for the computation process of deep learning. For example, the Nvidia Volta architecture incorporates a dedicated tensor processing element in each stream processor, which is widespread in deep learning applications such as convolutional neural networks. A recent study showed that [31] in deep learning applications, the energy efficiency of the latest GPU chip is much higher than that of CPU and FPGA, and close to that of ASIC. At the same time, based on the original architecture and the progress of semiconductor process, these GPUs can more efficiently execute applications such as graphics rendering and scientific computing. In addition, GPU design companies such as Nvidia and AMD are avoiding a complete redesign of the overall architecture by incrementally updating their existing GPU architectures with specialized designs for deep learning applications, thereby balancing chip flexibility and energy efficiency and amortize the growing NRE costs required in chip design and manufacturing with increasing volumes. By domain-specific arithmetic logic units and data paths based on the existing GPU architecture, GPU can provide high energy-efficient computing power on the basis of ensuring flexibility. The example of deep learning implementation on GPU also shows that to obtain better speedup ratios and higher energy efficiency from the new architecture requires application developers to modify and optimize the original algorithms. For applications that did not run on the GPU before, the existing algorithms are usually highly customized for the CPU, and running under the GPU architecture cannot achieve the optimal efficiency. For example, high-performance algorithms designed for CPUs usually emphasize a balance of computation and access memory; GPU architectures, by reducing instruction overhead and providing specialized execution units, have much lower computation overhead than CPUs, while access overhead remains almost constant. This causes the problem of “incompatibility” when the CPU optimized algorithm is directly transplanted to the GPU, that is, the computing speed on the GPU is much faster than the access speed, and the algorithm is completely limited by the access bandwidth. Thus, when the algorithm on CPU is transplanted to GPUs, it is essential to develop data locality to reduce the number of accesses to external memory during algorithm execution by using the SRAM of GPUs. However, at present, many emerging architectures try to subvert the existing programming model in the R&D stage, ignoring user habits, which brings great learning costs to software developers. It is worth affirming that GPU has made a lot of efforts in guiding application developers to modify algorithms according to

70

1 Programming Model

the new architecture, and finally formed a set of guidance process for beginners: after directly replicating the programming approach on CPUs and replacing looplevel parallelism with data-level parallelism, it makes the application’s performance on GPU equal to CPUs or with a small improvement; after gradually modifying the algorithm according to the best practices in GPU programming, the application eventually achieves a performance improvement by tens to hundreds of times. Based on the development process familiar to application developers, GPU gradually guides application developers to be familiar with GPU architecture, and finally makes the GPU programming model applied to more and more fields. Therefore, the SDC needs to be used in several critical applications first, so that its performance and energy efficiency can be improved dozens of times than the existing architecture to attract users in these application areas; then the ecological environment of the existing hardware architecture should be leveraged to gradually guide application developers to adapt to new hardware features. This is the starting point for this chapter to explore three categories of programming models for SDCs, starting from the FPGA programming model.

References 1. Contributors W, David Wheeler (computer scientist)—Wikipedia. https://en.wikipedia.org/w/ index.php?title=David_Wheeler_(computer_scientist)&oldid=989191659 [2020-10-20] 2. Leiserson CE, Thompson NC, Emer JS et al (2020) There’s plenty of room at the top: what will drive computer performance after Moore’s law? Science 368(6495). https://doi.org/10.1126/ science.aam9744 3. Contributors W, Amdahl’s law—Wikipedia. https://en.wikipedia.org/w/index.php?title=Amd ahl%27s_law&oldid=991970624 [2020-10-20] 4. Hennessy JL, Patterson DA (2011) Computer architecture: a quantitative approach. Elsevier, Amsterdam 5. Contributors W, Random-access machine—Wikipedia. https://en.wikipedia.org/w/index.php? title=Random-access_machine&oldid=991980016 [2020-10-20] 6. Dennard RH, Gaensslen FH, Rideout VL et al (1974) Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE J Solid-State Circuits 9(5):256–268 7. Horowitz M (2014) Computing’s energy problem (and what we can do about it). In: IEEE international solid-state circuits conference, pp 10–14 8. Taylor MB (2012) Is dark silicon useful?: harnessing the four horsemen of the coming dark silicon apocalypse. In: Proceedings of the 49th annual design automation conference, pp 1131– 1136 9. Skylake(quad-core)(annotated).png-WikiChip. https://en.wikichip.org/wiki/File:skylake_ (quad-core)_(annotated).png [2020-10-20] 10. Hameed R, Qadeer W, Wachs M et al (2010) Understanding sources of inefficiency in generalpurpose chips. In: ISCA’10, pp 37–47 11. Linux performance in cloud (2019). http://techblog.cloudperf.net/2019 [2020-10-20] 12. Krizhevsky A, Sutskever I, Hinton GE (2017) ImageNet classification with deep convolutional neural networks. Commun ACM 60(6):84–90 13. Voitsechov D, Etsion Y (2014) Single-graph multiple flows: energy efficient design alternative for GPGPUs. In: ISCA’14, pp 205–216 14. Kruger F, CPU bandwidth: the worrisome 2020 trend. https://blog.westerndigital.com/cpu-ban dwidth-the-worrisome-2020-trend [2020-12-24]

References

71

15. Blanas S, Scaling the network wall in data-intensive computing. https://www.sigarch.org/sca ling-the-network-wall-in-data-intensive-computing [2019-02-20] 16. Asanovic K, Bodik R, Catanzaro BC et al (2006) The landscape of parallel computing research: a view from Berkeley. University of California, Berkeley 17. Nowatzki T, Gangadhar V, Sankaralingam K (2015) Exploring the potential of heterogeneous von Neumann/data flow execution models. In: Proceedings of the 42nd annual international symposium on computer architecture, pp 298–310 18. Li Z (2018) Research on key technologies of programming model and hardware architecture of highly flexible reconfigurable processor. Tsinghua University, Beijing 19. Grigoras P, Burovskiy P, Luk W (2016) CASK: open-source custom architectures for sparse kernels. In: Proceedings of the 2016 ACM/SIGDA international symposium on field-programmable gate arrays, pp 179–184 20. Lu K, Li Z, Liu L et al (2019) ReDESK: a reconfigurable data flow engine for sparse kernels on heterogeneous platforms. In: IEEE/ACM international conference on computer-aided design, pp 1–8 21. Gupta PK (2015) Intel Xeon+ FPGA platform for the data center. In: Workshop presentation, reconfigurable computing for the masses, really, pp 1–10 22. Yan H, Li Z, Liu L et al (2019) Constructing concurrent data structures on FPGA with channels. In: Proceedings of the ACM/SIGDA international symposium on field-programmable gate arrays, pp 172–177 23. Kun Z, Qiming H, Rui W et al (2008) Real-time KD-tree construction on graphics hardware. ACM Trans Graph 126:189–193 24. Li Z, Liu L, Deng Y et al (2017) Aggressive pipelining of irregular applications on reconfigurable hardware. In: The 44th annual international symposium on computer architecture, pp 575–586 25. Gayatri R, Badia RM, Aygaude E (2014) Loop level speculation in a task based programming model. In: International conference on high performance computing, pp 1–5 26. Pingali K, Nguyen D, Kulkarni M et al (2011) The Tao of parallelism in algorithms. In: ACM SIGPLAN conference on programming language design and implementation, pp 1–7 27. Hassaan MA, Nguyen DD, Pingali KK (2015) Kinetic dependence graphs. In: Architectural support for programming languages and operating systems, pp 457–471 28. Kennedy K (2002) Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers, San Francisco 29. Ho C, Kim SJ, Sankaralingam K (2015) Efficient execution of memory access phases using data flow specialization. In: International symposium on computer architecture, pp 118–130 30. Krommydas K, Feng WC, Antonopoulos CD et al (2016) OpenDwarfs: characterization of Dwarf-based benchmarks on fixed and reconfigurable architectures. J Sig Process Syst 1–21 31. Dally W, Yatish T, Song H (2020) Domain-specific hardware accelerators. Commun ACM 63(7):48–57

Chapter 2

Hardware Security and Reliability

Moving target defense enables us to create, analyze, evaluate, and deploy mechanisms and strategies that are diverse and that continually shift and change over time to increase complexity and cost for attackers, limit the exposure of vulnerabilities and opportunities for attack, and increase system resiliency. —National Science and Technology Council, December 2011

The vulnerabilities of Meltdown and Spectre disclosed in early 2018 are generally considered to be one of the most serious hardware security problems so far. Because the root cause is the problem of hardware design, software can only reduce the impact and cannot solve it completely. In the face of many processors with vulnerabilities, it is not realistic to upgrade and replace the hardware in a short period of time. The attack strategies are evolving synchronously along with countermeasures, resulting in the failure of countermeasures. At the same time, new attack methods are emerging, and the methods to resist attacks should be quickly updated and implemented iteratively. Therefore, how to respond to the changing and rapidly upgrading attack means after the chip manufacturing is a recognized problem. In the SDCs, the hardware changes rapidly with the change of software, and can adapt to the algorithm requirements by dynamically changing the circuit architecture. By effectively using the dynamic reconﬁgurable characteristics of SDCs, the hardware security and reliability of the chip can be greatly improved. The SDC has intrinsic security. On the one hand, the partial and dynamic reconﬁgurability characteristics of processing element array (PEA) can be fully utilized to develop countermeasures based on time and spatial randomization. When the cryptographic algorithm is executed at different time and spatial positions of the array every time, the attacker’s various precision attacks will come to naught. More speciﬁcally, when the attacker wants to attack the speciﬁc implementation of the cryptographic algorithm, the randomization method makes the position of the sensitive point change rapidly, and it is difﬁcult to attack even if he has the key to the back door. On the other hand, it is possible to take full advantage of the architecture of SDC and rely on the array of processing element sand interconnection among them to improve security. Through resource reuse, the additional overhead can be signiﬁcantly reduced. For example, a physical unclonable function (PUF) can be built on the basis of the PEA © Science Press 2023 L. Liu et al., Software Defined Chips, https://doi.org/10.1007/978-981-19-7636-0_2

73

74

2 Hardware Security and Reliability

to realize lightweight authentication or security key generation while completing the basic encryption and decryption operation. Effective fault tolerance mechanism is very important to ensure the reliability of integrated circuits. SDCs usually integrate a large number of processing elements (PEs), and idle PEs can be used as spare components to replace error components to maintain the correctness of the whole system. Due to the limited number of PEs, how to maximize the use of redundant hardware resources on the chip to improve the repair rate and reliability is a problem worthy of research. By designing an efﬁcient topology reconﬁguration method, the fault tolerance of the system can be greatly improved. At the same time, how to ensure that the algorithm can still be efﬁciently mapped to the SDC after the dynamic reconﬁguration of topology is also a problem to be considered. Section 2.1 of this chapter will introduce the intrinsic security of the SDC against physical attacks of the cryptographic chip. Examples including semi-invasive attacks such as fault attacks and non-invasive attacks such as side channel attacks, are given to demonstrate the resistance enhancement using partial and dynamical reconﬁgurations. In addition, how to use the computing resources of the SDC to build PUF, and how to use PUF to improve the security of software-deﬁned cryptographic chip are described. Section 2.2 takes the network-on-chip of SDC as an example to introduce how to efﬁciently use redundant units to repair the on-chip communication network and improve reliability when the router or PE fails, and how to ﬁnd an optimal mapping considering performance, energy consumption and reliability when the topology and routing algorithm change dynamically.

2.1 Security 2.1.1 Countermeasures Against Fault Attacks As a physical attack, fault attack obtains conﬁdential information by maliciously injecting faults, which has a great potential threat to hardware security. Meanwhile, the accuracy of fault injection is greatly improved in the past few years. For example, the spatial and temporal accuracy of laser injection has reached logic gate level and sub nanosecond level [1], which makes it possible to inject two faults at the sensitive points of cryptographic algorithm computation at the same time. However, the current countermeasures cannot resist this double fault attack. This is because at present, the most commonly used countermeasure against fault attacks is redundant computing, e.g., to copy double hardware with the same computation function and determine whether an error is made by comparing two results. When the accuracy of fault injection has been able to inject the same fault into two computation paths at the same time, the fault cannot be detected effectively by comparison. Although it seems that double fault attack can be dealt with by increasing the number of redundant circuits, when more advanced fault injection attack methods appear, such

2.1 Security

75

countermeasures against fault attacks based on redundancy and comparison do not seem to be sustainable, but bring a lot of overhead. The SDC has the characteristics of dynamic reconﬁguration, which can introduce spatial randomness into the computation path and randomly change the computation time, so that the probability of successful fault injection is greatly reduced. This characteristic can play an important role in resisting double fault attack [2]. Compared with the current passive fault detection methods, this section introduces three active defense methods based on SDC, which greatly reduces the probability of successful fault injection, so as to improve the ability of the chip to resist double fault attack. These three methods are discussed in detail below [3]. 1. Round based relocation (RBR) The key to the success of fault attack is to ﬁnd the precise time and spatial location which can produce particular kinds of faulty ciphertext, which is deﬁned as a sensitive point in this book. The key of round based relocation technology based on absolute spatial randomization is to change the conﬁguration of computing array in the process of executing the algorithm each time, so as to realize the randomization of the spatial position of sensitive points. To make full use of hardware resources to improve randomness, spatial randomization can be implemented in each control step of RBR. When the number of stages of a data ﬂow diagram is s, the highest achievable randomness is also s. Figure 2.1 shows an example of RBR. The round function in the ﬁgure has three stages, and the PEs of the same row in the computation array are set to the same stage. When the spatial randomness is 3, the three mapping methods in the ﬁgure will be adopted randomly with a probability of 1/3 of each. Compared with the original ﬁxed mapping method, only 7 PEs are needed to complete the computation. Using RBR method requires two additional PEs. In different application scenarios, the randomness can be greater than or less than 3 according to the application requirements and available hardware resources. For example, when the spatial randomness is 4, the additional three PEs in row 4 need to be used, so the randomness increases, and the additional hardware overhead also increases. Generally speaking, because of the diffusion characteristics of iterative ciphers, most sensitive points exist in the last few rounds of cryptographic algorithm encryption. Round based relocation circuits based on absolute spatial randomization are mostly used in the last few rounds. At the same time, this method can also be used in other rounds to further improve security. Figure 2.2 is a hardware diagram supporting the implementation of RBR technology. To implement random mapping, the conﬁguration controller used to control the functions of PEs and interconnection needs to introduce a random number generator (RNG). Compared with the conﬁguration without randomization, it is not necessary to repeatedly save each stage of conﬁguration information, but the correspondence between contexts and actual hardware resources needs to be changed. To reconﬁgure the round function, the conﬁguration contexts are fetched from the context memory, which takes several additional cycles, but compared with the entire cryptographic computation cycle (usually in the order of hundreds of cycles), the performance overhead can be ignored. RBR uses each computing stage as the control

76

2 Hardware Security and Reliability

Fig. 2.1 Schematic diagram of RBR

step of spatial randomization. This ﬁne-grained reconﬁguration method can effectively reduce the hardware resource overhead of countermeasures, compared to the traditional division method using the sub-function of the algorithm. 2. Register pair swap (RPS) technology The register pair swap technology based on relative spatial randomization randomly changes the interconnection relationship between ALU and register in each PE pair,

Fig. 2.2 Schematic diagram of RBR hardware support

2.1 Security

77

so that the output of ALU executing sensitive operation points is stored in different registers during different executions. This is because previous research results show that distributed memory units such as registers are more vulnerable to fault attacks than full combinational logic [1]. After using RPS technology, if one of two PEs is a sensitive point and the fault is injected into its corresponding register, the success probability of fault injection will only be 50% of the original. At this time, the corresponding spatial randomness is 2. The spatial randomness can be further improved by upgrading to register pair swap of multiple PEs. However, it should be controlled that the additional multi-input MUX will not affect the critical path of the computing array, so as to avoid the negative impact of the reduction of clock frequency on the performance of the whole computing array. Figure 2.3 is a hardware diagram supporting RPS technology. To ensure that the output of ALU can still be sent to the correct PE after randomization and storage in the register, each PE is added with two 2-to-1 MUXs controlled by the same 1-bit RNG. The additional hardware of RPS is 2-to-1 MUX and RNG. However, they add about tens of logic gates and thousands of gates respectively, which is almost negligible compared with the whole SDC computing array (more than one million gates). The RBR based on absolute spatial randomization introduced earlier does not make full use of the randomness between PEs of the same level. The implementation of RPS can further improve the spatial randomness by taking PE in the same level as PE pairs. This randomness is directly introduced by the random number generator to change the connection relationship between ALU and register, which is not related to the mapping method, so it is not necessary to change the hardware structure of the conﬁguration controller. RBR and RPS introduce spatial randomness between stages and within stages respectively. When they are used at the same time, the total spatial randomness can be regarded as the product of their respective randomness.

Fig. 2.3 RPS hardware support

78

2 Hardware Security and Reliability

3. Random delay insertion (RDI) technology As shown in Fig. 2.4, the random delay insertion technology introduces randomness in the time dimension by inserting a random number of redundant cycles before the last few rounds of operations where the sensitive point is located. Assuming that the sensitive points are distributed in the tth round of the algorithm computation, the tth round of computation will be carried out after waiting for a random number of cycles followed by the end of the t − 1th round of computation. In fact, higher time randomness can be achieved by changing the operation order in the cryptographic algorithm on the premise that the sequence of two operations will not affect the ﬁnal computation result. For cryptographic algorithms with high security requirements, most operations cannot meet the commutative property. Inserting a random number of redundant cycles is a more feasible way for cryptographic computing chips. At the same time, to avoid time attack, in the redundancy period of random insertion, the PEs in the computing array is still performing false operations (e.g., the original computing operation, but the operands are irrelevant data). Therefore, it will not be distinguished because the power consumption of idle PE is signiﬁcantly reduced. The implementation of RDI also requires introduction of RNG into the conﬁguration logic, and the additional computation cycles will bring additional performance overhead, which is determined by the relative proportion between the redundant cycles and the original computation time required by the algorithm. Fig. 2.4 Schematic diagram of RDI method

2.1 Security

79

Taking the AES algorithm implementation against the very threatening doublefault attack as an example, after introducing the countermeasures to the reconﬁgurable cryptographic processor, the resistance against the fault attack has been significantly improved. The overhead of the countermeasure is also very small. At the same time, three countermeasures are used. Under the constraints of throughput reduction of 5%, area overhead of 35% and power consumption overhead of 10%, the resistance against reconﬁgurable cryptographic processor is improved by 2–4 orders of magnitude [3].

2.1.2 Countermeasures Against Side Channel Attacks The attack method based on fault injection has certain generality, that is, it can attack different types of SDCs, while the side channel attack method is a special attack technology for cryptographic chips. Therefore, the software-deﬁned cryptographic chips will also be threatened by the side channel attack. This section ﬁrst introduces the basic concepts and attack methods of side channel attack, and discusses the generally applicable countermeasures against side channel attacks; then, combined with the intrinsic characteristics of software-deﬁned cryptographic chip, the high-level evaluation technology of software-deﬁned cryptographic chip side channel security in detail is introduced. On the basis of effectively locating the source of side channel vulnerability, this section introduces the countermeasures against electromagnetic attack based on redundant resources and randomization from the aspects of hardware architecture, conﬁguration and circuit design method, with a focus on new multi probe local electromagnetic attack method and the trade-off between energy efﬁciency and ﬂexibility. 1. Overview of countermeasures against side channel attacks Cryptographic algorithm ensures the conﬁdentiality, integrity and availability of information. Its hardware implementation technology (i.e. cryptographic chip technology) is the physical basis of information system security. As the hardware carrier of cryptographic algorithm, cryptographic chip plays a key role in the application of information security [4]. In recent years, the frequent occurrence of cyber space security events has made information security more and more widely concerned. The underlying basis of information security is the security of integrated circuit hardware. For all kinds of cryptographic chips, attackers always try to use various attack means to obtain sensitive information such as passwords, so as to threaten the security of the whole information system. The growing demand for information security applications has brought unprecedented challenges to the design and application of software-deﬁned cryptographic chips. In the actual use of software-deﬁned cryptographic chip, the security design of cryptographic chip itself is the core of security defense strategy. For all kinds of cryptographic algorithms, due to the intensive control execution and data processing of the algorithm itself, the SDC has natural

80

2 Hardware Security and Reliability

advantages in implementing the application of different kinds of cryptographic algorithms. However, different from the traditional general-purpose processors driven by instruction stream and the cryptographic ASICs t driven by data stream, the softwaredeﬁned cryptographic chip adopts a computing method that combines the ﬂexibility of instruction driven structure and the high energy efﬁciency of data driven structure. Like other types of cryptographic chips, SDC also faces the security threat from side channel attacks. For all kinds of cryptographic chips, attackers always try to use various attack means to obtain sensitive information such as keys. In terms of attack methods, they are mainly divided into invasive attacks, semi-invasive attacks and non-invasive attacks [5]. Invasive attack destroys the circuit package by opening the cover to expose the bare chip, and then directly obtains the sensitive information inside the chip by means of reverse engineering and microprobe. However, its technical threshold and attack cost are very high, and it will cause irreversible permanent damage to the attack object [6]. Semi-invasive attacks also remove the package of the chip, but there is no need to establish an actual electrical connection with the chip under test, so it will not cause substantial mechanical damage to the circuit [7]. Generally, fault injection attacks based on laser and other means can be classiﬁed as such attacks, but it is still difﬁcult to implement such attacks. The non-invasive attack does not need to destroy the integrated circuit package, but obtains sensitive data such as keys by analyzing the operation information in the working process of the integrated circuit [8]. Non-invasive attack is easy to implement, and the attack process does not need large overhead, making it a mainstream attack method. Side channel attack is the most important type of non-invasive attack. It collects the power consumption, electromagnetic, delay and other side channel information generated by the chip, and uses the data analysis method to obtain the sensitive information inside the chip according to its correlation with the input/output data and algorithm key. At present, it is one of the most important security threats faced by cryptographic chips. The general ﬂow of side channel attack is shown in Fig. 2.5. Side channel attack was ﬁrst proposed by American cryptologist Kocher in the late 1990s [9]. In terms of actual attack effect, the attack efﬁciency of side channel attack is much stronger than that of traditional cryptographic analysis methods, which poses a great threat to the security of cryptographic chips. Taking the channel attack on the power consumption as an example, a variety of cryptographic algorithms such as RSA [10], AES [11], ECC [12], SM3 [13], SM4 [14], PRESENT [15] have been implemented by researchers to crack sensitive information on hardware platforms such as smart cards [16], processors [16], microcontrollers [17], FPGA [18], and ASIC [19]. Also, the attack algorithm also covers simple power analysis (SPA) [20], differential power analysis (DPA) [8], correlation power analysis (CPA) [21], mutual information analysis (MIA) [22], template analysis (TA) [23], etc. In addition, various side channel information and side channel analysis methods can be combined into new attack strategies, such as the attack method combining fault attack and power consumption analysis [24]. Although there is no example of side channel attack against software-deﬁned cryptographic chip, the software-deﬁned cryptographic chips are vulnerable to side channel attack because they are driven

2.1 Security

81

Plaintext

Secret key

Cryptographic chip Output

Input

Side channel information acquisition

Run-time Power consumption Electromagnetic radiation

Ciphertext

Side channel attack

Side channel data analysis Key cracked

Fig. 2.5 General ﬂow of side channel attack

with the hybrid operation mode of “conﬁguration contexts and data ﬂow”, and the operation state of the chip is closely related to the characteristics of cryptographic algorithms. In addition, different from the traditional side channel attack intended to obtain sensitive information such as keys, the side channel attack against the software-deﬁned cryptographic chip can additionally crack the operation state of the software-deﬁned cryptographic chip, which will not only cause the leakage of sensitive information such as keys, but also steal the “conﬁguration stream” which is the core intellectual property. The traditional methods to defend against side channel attacks are intended to increase the difﬁculty for attackers to crack sensitive information such as keys. Generally, there are two kinds of technologies: masking and hiding [25]. Masking technology can remove the correlation by introducing randomness into the intermediate value of algorithm operation. The data to be protected needs to be masked, and then the cryptographic chip calculates the masked data and removes the mask before the ﬁnal data output. In this way, the data will be masked in the whole operation process, which improves the difﬁculty of attack. Through the hiding technology, the sensitive information is hidden by introducing random noise and reducing signalto-noise ratio, or the circuits are accurately designed to ensure that the side channel information of different operations in the whole operation process is almost the same. Hiding technology does not need additional operations on encrypted data, nor does it need to have a deep understanding of cryptographic algorithms. Many researchers have proposed various countermeasures against side channel attack that can be applied to different levels. However, using the reconﬁguration technology of cryptographic chips to resist side channel attacks is still in its infancy, and there is no mature solution that can be applied to software-deﬁned cryptographic chips. Only some researchers have studied the side channel security strategy of SDC based

82

2 Hardware Security and Reliability

on FPGA platform. Some researchers have analyzed the DPA attack principle on the FPGA platform and implemented countermeasures against side channel attacks such as hiding, confusion and noise injection [26], but they have not studied and utilized the reconﬁguration characteristics of FPGA. In 2012, some researchers cooperated with the industry to explore the countermeasures against side channel attacks of a series of hardware platforms such as FPGA [27], and discussed the balance between ﬂexibility and security. For cryptographic chips with reconﬁgurable characteristics, countermeasures against side channel attacks based on software-deﬁned cryptographic chips was proposed in 2014 [28], as shown in Fig. 2.6. For the speciﬁc computation conﬁguration composed of gray PEs, the “idle” PEs are conﬁgured to perform dummy operations or complementary operations, so as to hide the real operation. However, because the scheme only schedules the “idle” PE and does not optimize the cryptographic operation sequence itself, it fails to hide the key information from the source, and there is still a risk of side channel information leakage. In 2015, dynamic logic reconﬁguration based on Spartan-6 series FPGA platform was implemented [29], and the lookup table was optimized to realize S-box to verify the resistance against power side channel attacks. In 2019, based on ZYNQ UltraScale + Series FPGA, some researchers further explored the use of software-deﬁned technology to implement the dynamic change of cryptographic circuit layout [30], so as to improve the security level against side channel attacks. In terms of tapping the potential of software-deﬁned cryptographic chip features to resist side channel attacks, some researchers have conducted in-depth research on the principle of fault attack [3, 31], and optimized the hardware architecture, but have not analyzed the side channel vulnerability.

Fig. 2.6 Schematic diagram of countermeasures against side channel attacks based on idle PEs

2.1 Security

83

Considering the unique operation characteristics and special development process of software-deﬁned cryptographic chips, it is often difﬁcult to directly apply the traditional countermeasures. Therefore, to more effectively implement the side channel security protection strategy and comprehensively consider the intrinsic characteristics of the software-deﬁned cryptographic chip, this section will introduce the software deﬁned hardware architecture and conﬁguration strategy analysis oriented to the side channel vulnerability analysis to develop the countermeasures against side channel attacks suitable for the software-deﬁned cryptographic chip under the guidance of classical masking and hiding methods and explore the migratable technology on the hardware platform of general software-deﬁned cryptographic chip. 2. High-level evaluation of side channel security for software-deﬁned cryptographic chip Considering the complexity of IC chip design, even if the software-deﬁned cryptographic chip has post-silicon reconﬁgurable characteristics, if it is found that there are obvious side channel security hidden dangers after the cryptographic chip is taped out, the remedial measures will still bring great overhead; in addition, to more efﬁciently develop countermeasures against side channel attacks, it is necessary to know the side channel vulnerability source of the software-deﬁned cryptographic chip in advance and locate it accurately. Therefore, while developing pre-silicon security evaluation and countermeasures against side channel attacks, how to accurately and quantitatively evaluate the side channel security level of software-deﬁned cryptographic chip is particularly critical. Considering the special operation mechanism of software-deﬁned cryptographic chip, it is necessary to build a set of evaluation system suitable for software-deﬁned cryptographic chip. Also, considering that the software-deﬁned cryptographic chip will be dynamically reconﬁgured in real time to change the chip function, the operation state of the software-deﬁned cryptographic chip must be considered when carrying out the side channel security quantitative evaluation for the software-deﬁned cryptographic chip. The existing side channel security evaluation technology measures the difﬁculty of cracking the key and other sensitive information in the cryptographic chip by collecting the side channel information of the cryptographic chip in the process of algorithm execution, using statistical analysis method and certain information leakage model. Considering that the software-deﬁned cryptographic chip itself is a complex system, including both hardware circuits and the dynamic conﬁguration controller, the side channel evaluation method of software-deﬁned cryptographic chip must be different from the traditional cryptographic chip evaluation method. It is necessary to fully consider the characteristics of the computing form of softwaredeﬁned cryptographic chip. Combined with the existing hardware architecture and compilation paradigm, the information leakage weights of “conﬁguration stream” and “data ﬂow” should be differentiated, and the information leakage model should be optimized corresponding to the circuit modules that potentially leak sensitive information. As shown in Eq. (2.1), R(t) is the radiation measurement varying with time. The initial and ending operating states of the ith logic unit in the circuit are Ai and Bi respectively. On the basis of the traditional Hamming distance model, the

84

2 Hardware Security and Reliability

differentiated weight ratio Fi is added to make the information leakage model more suitable for the evaluation of software deﬁned hardware. R(t) =

n ∑

Fi × ( Ai ⊕ Bi )

(2.1)

i=1

Based on the optimized information leakage model, the non-uniform probability distribution and Student’s t-test are introduced in the spatio-temporal hybrid scenario. Based on the test vector leakage assessment (TVLA) methodology, the level of side channel difference of the chip is evaluated in the form of the expectation of successful key cracking, and side channel security quantization evaluation algorithm suitable for the intensive computation and control of software-deﬁned cryptographic chip is formed. As shown in Eq. (2.2), X 1 and X 2 are the mean values of side channel information of the circuit under different driving excitation, S12 and S22 are variances, n1 and n2 are the sample capacities. T-distribution theory is used to infer the probability of difference, so as to compare whether the difference between the two averages is signiﬁcant. If the absolute value of T exceeds 4.5, it is considered that there is a signiﬁcant information leakage. The model can also quantitatively evaluate the separate circuit modules. T =/

X1 − X2 (

(n 1 −1)S12 +(n 2 −1)S22 n 1 +n 2 −2

1 n1

+

1 n2

)

(2.2)

To verify the resistance of the software-deﬁned cryptographic chip against side channel attack in the real cases, it is necessary to build a side channel experimental platform. Considering that the traditional test platform is often only for exciting the chip under test and collecting the side channel information leakage, and will not synchronously monitor the operation status of the chip, additional monitoring of chip operation status is therefore needed to complete the side channel security evaluation of software-deﬁned cryptographic chip. The test platform framework to be built is shown in Fig. 2.7. On the basis of the traditional side channel test system, the monitoring of the working state of the chip is added through the debugging interface. Through the actual test of software-deﬁned cryptographic chip, the improvement effect of speciﬁc security level can be veriﬁed. The hardware core of software-deﬁned cryptographic chip is reconﬁgurable processing unit (RPU). Software-deﬁned arrays often contain multiple block computing units (BCU), data exchange modules, high-speed buses and other peripheral circuits, as shown in Fig. 2.8. At the same time, the software-deﬁned cryptographic chip can correctly perform the corresponding operations only with the support of certain conﬁguration strategies. Conﬁguration contexts schedule the hardware of the chip and allocate computing resources, while the data ﬂow is the data to be processed and returned operation results. BCU can analyze the data header of conﬁguration contexts, encrypt/decrypt the plaintext/ciphertext data according to

2.1 Security

85

Digital Oscilloscope Configuration ch1 ch2 ch3 ch4

Oscilloscope Network cable

Side channel information Signal amplification

Test auxiliary signal Interface Debugging

PCIE GPIO Host PC

SDC Chip Configura tion control

Operation execution

Probe

Fig. 2.7 Block diagram of test system (including chip condition monitoring)

different control commands, control the ﬂow direction of data ﬂow and complete the computation. BCU has multiple processing elements (PE), which mainly include ALU, S-BOX, SHIFT, GF, DP, BENES and other modules. It can complete arithmetic and logic operations (including addition, subtraction, and, or, XOR, etc.), S-box table lookup, left/right shift (including rotating shift), permutation, ﬁnite ﬁeld product, etc. The fundamental reason for the leakage of side channel information is that there is correlation between the side channel information and the intermediate value of algorithm operation, and the side channel information is generated by the hardware circuit during operation. Therefore, it is necessary to analyze the nonlinear logic

Software-defined cryptographic chip BCU #0 Data register

Data flow configuration contexts

PE

PE

PE

PE

Memory

ALU

Data store/load unit

Port

BCU #2

BCU #31

Fig. 2.8 Hardware architecture of software-deﬁned cryptographic chip

Other peripheral circuits

BCU #1

86

2 Hardware Security and Reliability

operation circuit modules in the PEs one by one, and carry out single bit decomposition and tracking of the data ﬂow, locate the bits directly related to sensitive information such as key, and establish the mapping relationship between data ﬂow and side channel information leakage based on Hamming distance leakage model. Different from the traditional cryptographic chip, the software-deﬁned cryptographic chip will reconﬁgure the chip functions during operation, change the interconnection and data path of the internal PEA of the chip, etc. Since the software-deﬁned array contains multiple BCUs, it is necessary to consider the change of the computing mode of the cryptographic chip by the conﬁguration contexts and the scheduling of computing resources, so as to further decouple the mapping relationship between the conﬁguration contexts and data ﬂow and the circuit modules that easily leak sensitive information. In this way, a side channel vulnerability analysis and location method can be used for software-deﬁned cryptographic chips. The XOR operation and multivariate XOR operation in the most widely used S-box operation in symmetric cryptographic algorithms are used as examples for illustration, which are listed in Table 2.1. The software-deﬁned cryptographic chip performs XOR operation instruction as SBOXB_PREXOR, indicating the XOR operation of the encryption operation result rs1 and the extension key rs2 of the previous round, and stores the result in rd, that is, rd = sboxa(rs1^rs2). For this instruction, to evaluate the possible information leakage of the actual circuit, it is necessary to determine the BCU module and internal PE performing the operation according to the conﬁguration contexts, and determine the speciﬁc operation data according to the data ﬂow. For another example, the multivariate operation XOR instruction is TRIRS.XOR, which means that the XOR operation is performed on rs1, rs2 and rs3, and the results are stored in rd, that is, rd = rs1^rs2^rs3. It is also necessary to determine the speciﬁc operation data and the PE of the operation, and consider the side channel information difference between different PEs in the cooperative working state. Referring to the above analysis methods for speciﬁc instructions, the existing instruction set is summarized, and the mapping between the conﬁguration contexts and data ﬂow and the circuit module that can easily leak sensitive information is established combined with the complete conﬁguration policy. As shown in Fig. 2.9, it is necessary to conduct ﬁne-grained analysis of the instruction ﬂow, decompose the conﬁguration contexts and data ﬂow, and determine the speciﬁc data and corresponding hardware modules related to the intermediate value of algorithm operation at each time according to the encryption operation timeline, so as to calculate and evaluate the information leakage. 3. Countermeasures against side channel attack based on redundant resources and randomization Table 2.1 Example of instruction ﬂow analysis Instruction name

Instruction format

Function description

SBOXA_PREXOR

0000001_rs2_rs1_010_rd_0001011

rd = sbox (rs1 XOR rs2)

TRIRS.XOR

rs3_2’b11_rs2_rs1_000_rd_1000011

rd = rs1 XOR rs2 XOR rs3

2.1 Security

87

0 ~ 6

Analysis of control instruction

Peripheral circuit

32-bit instruction

7 ~ 11 12 ~ 14 15 ~ 31

BCU BCU BCU

BCU BCU BCU

Analysis of data instruction Power consumption

Electromagnetic radiation

Fig. 2.9 Side-channel analysis of software-deﬁned cryptographic chip

The ﬁrst two subsections introduce the side channel attack and evaluation of the software-deﬁned cryptographic chip. This section will describe the countermeasures against the side channel attacks based on the security evaluation and in combination with the intrinsic characteristics of the software-deﬁned cryptographic chip, so as to improve the reliability and security level of the software-deﬁned cryptographic chip. Next, the research on spatial and temporal dynamic hybrid reconﬁguration strategy based on sensitive information hiding is introduced from the aspects of hardware architecture, conﬁguration and control, and circuit design method. For softwaredeﬁned cryptographic chips, because different types of cryptographic algorithms need to be switched in real applications, the scheme based on masking is bound to put forward higher requirements for the development of reconﬁguration strategy, and the hardware architecture needs to be modiﬁed to support masking and unmasking operations; the scheme based on hiding will not affect the cryptographic operation itself. It is universal for different cryptographic algorithms and is suitable for software-deﬁned cryptographic chips. Firstly, the principle of reducing side channel information leakage based on hiding scheme is analyzed. The side channel information leakage of software-deﬁned cryptographic chip is recorded as H overall , which is generally composed of three parts, including information leakage H d closely related to data operation, information leakage H ind not directly related to data operation, and side channel information leakage H n caused by various noises, as shown in Eq. (2.3), where H d is the variable concerned by the attacker. Hoverall = Hd + Hind + Hn

(2.3)

88

2 Hardware Security and Reliability

To break the internal relationship between the existing side channel information leakage, H d can be hidden by introducing additional “interference variables” related to data operation. The introduced variables is expressed as H Δ , as shown in Eq. (2.4): Hoverall = Hd + Hind + Hn + HΔ

(2.4)

The core idea of general side channel attack is to calculate the correlation between the traces and the intermediate value of the speculation key with the help of a large number of side channel traces and under the guidance of a certain information leakage model. Taking the widely used Pearson correlation coefﬁcient as an example, the computation formula of correlation coefﬁcient ρ is shown in Eq. (2.5), where, E (•) and Var (•) represent the mean and variance of the corresponding data, and signalto-noise ratio (SNR) represents the signal to noise ratio between H ind + H n and H d + H Δ . Through derivation, it can be found that the correlation coefﬁcient of the ﬁnal overall side channel information leakage H overall is directly related to Var (H Δ ) and Var (H d ), and the level of information leakage can be reduced by adding additional interference H Δ and reducing information leakage H d closely related to data operation. E(W · Hoverall ) − E(W ) · E(Hoverall ) Var (W ) · Var (Hd + Hind + Hn + HΔ ) E(W · Hd ) − E(W ) · E(Hd ) / / =√ (HΔ ) (Hind +HΔ ) 1 + Var Var (W ) · Var (Hd ) 1 + Var Var (Hd ) Var (Hd +HΔ )

ρ(W ,Hoverall ) = √

=

ρ(W ,Hd ) 1 ·/ 1 (HΔ ) 1 + SNR 1 + Var Var (Hd )

(2.5)

It is necessary to fully utilize the post-silicon reconﬁguration characteristics of software-deﬁned cryptographic chips, comprehensively schedule the redundant of hardware resources on the chip, comprehensively analyze the principles of attack behavior based on the existing hardware architecture, and develop countermeasures in both the temporal and spatial domains. The dynamic random fake operations are introduced in the temporal domain to break the correlation between the internal operation of the circuit and the side channel information without affecting the main computation sequence to increase the H Δ variable; meanwhile, the dynamic random dummy operations are introduced into the main computation sequence to reduce the HD variable. In the spatial domain, the sequences of main operation, fake operations and dummy operations are mixed into different BCUs and PEs on the chip to form a hybrid spatial and -temporal dynamic reconﬁguration scheme. At the same time, the additional operation and insertion time are studied to evaluate the maximum randomness that can be introduced. As shown in Fig. 2.10, unprotected sequences are executed sequentially in the same PE of the same BCU, which is more vulnerable to attack, while execution with protection strategy are randomly dispersed in

2.1 Security

89

Execution Time

Configure

Unprotected

T1

T2

T3

T4

T5

T6

T7

a

b

c

d

e

f

g

contexts

RNG

Configuration controller

BCU BCU BCU BCU BCU BCU BCU #0 #0 #0 #0 #0 #0 #0

Execution Time T1 a

T2 b

T3 c

T4

T5

T6

T7

T8

T9

a

b

c

d

e

f

g

d

Protected

Real operations b

d

f e

a c

f

e f

PE

g

b

c

d

e

f

g

g

BCU BCU BCU BCU BCU BCU BCU BCU BCU #3 #18 #20 #20 #6 #0 #5 #7 #26

a

BCU

Dynamic fake and dummy operations

Fig. 2.10 Schematic diagram of spatial and temporal dynamic hybrid reconﬁguration strategy (see color chart)

different BCUs and PEs at ﬁrst, and dynamic random fake and dummy operations are additionally introduced. The reconﬁguration scheme with protection strategy can not only resist power side channel attacks, but also local electromagnetic side channel attacks due to the introduction of spatial dynamic random characteristics. In addition, considering that dynamic random reconﬁguration will increase the amount of conﬁguration contexts and reduce the chip conﬁguration speed, it is also necessary to verify the function integrity and energy efﬁciency. By appropriately increasing the operating frequency and reusing the same conﬁguration information, and feeding back the veriﬁcation results through iterative optimization, the conﬁguration strategy is optimized to minimize the additional overhead caused by security improvement. 4. Prospect of security design automation In real cases, for software-deﬁned cryptographic chips, algorithm analysis are required for various cryptographic algorithms. The existing process generally only pursues high energy efﬁciency and low resource overhead, but insufﬁcient consideration is given to security design. The countermeasures against side-channel attacks based on redundant resources and randomization introduced in the previous sections needs to balance parameters such as power consumption, overhead and security

90

2 Hardware Security and Reliability General development process Computing architecture

Security design automation Hardware architecture design Function design requirements of softwaredefined cryptographic chip

Processing element array

Side channel information acquisition and analysis

3

2

Configuration system

Design of reconfiguration strategy and mapping method

Increase Side channel security requirements

On-chip interconnection

Spatial and temporal dynamic hybrid reconfiguration strategy

Guide

1

Side channel security quantitative evaluation technology

Configuration controller On-chip memory

feedback

Side channel vulnerability analysis technology

Fig. 2.11 Security design automation for software-deﬁned cryptographic chips

level. The best method is to introduce security design automation, increase support for security attributes and options for security consideration in the analysis of various algorithms. The implementation of security design automation needs to be closely combined with the existing development process, as shown in Fig. 2.11. On the basis of the general development process, the demand for the side channel security is increased. Firstly, the spatial and temporal dynamic hybrid reconﬁguration strategy is added to the existing design process, and then the initial security evaluation is provided with the help of high-level security evaluation technology. After the algorithm development is completed and implemented on the software-deﬁned cryptographic chip, the ﬁnal quantitative evaluation of side channel security is completed with the help of the actual side channel information acquisition and analysis system. With this iterative process, the security design automation is realized.

2.1.3 PUF Technology Based on SDC As the ancient Greek philosopher Heraclitus said, “People cannot step twice into the same river”, an identical design layout and production process cannot guarantee that two chips are identical. Various undesirable process deviations in traditional chip production make it possible to identify each individual chip. This unique identiﬁcation also enables a variety of cryptographic applications. Semiconductor physically unclonable function (PUF) is considered as an essential module of hardware trusted root in hardware security system, which can identify the unique characteristics of integrated circuit chips. This type of module is often designed to be very sensitive to the process deviation in the chip production process. The PUF on each chip produces a unique response output according to the speciﬁc challenge input. There are many ways to classify PUFs. The most common way is deﬁned according to the relationship between the area growth and the growth of challenge-response pair (CRP) implemented by PUF (Fig. 2.12): weak PUF is deﬁned as the polynomial relationship between the growth of CRP and the growth of area; and strong PUF is deﬁned as the exponential relationship. In practical applications, weak PUF is often used to replace the high-cost secure nonvolatile memory to store sensitive information in the

2.1 Security

91

Strong PUF

Weak PUF

?

Fig. 2.12 PUFs classiﬁed into weak PUFs and strong PUFs

encryption system, such as key; in addition to generating keys, strong PUF will also be used for lightweight authentication. The SDC endows PUF with the characteristic of dynamic reconﬁguration, which provides a new foundation and security anchor for the further wide application of PUF. Dynamic reconﬁguration can enhance the availability of PUF and improve its ability to resist machine learning attacks. When the existing PUF is used to generate the key, since the key extraction function is designed in advance, it cannot be updated after chip production. Once the service life of the key ends or there is a risk of leakage, it cannot be revoked or updated. The reconﬁguration feature of SDC enhances the availability of the existing PUF that generates the key. Strong PUFs used as lightweight authentication applications are often threatened by machine learning modeling. Up to now, almost all strong PUFs have been completely or to some extent broken by various machine learning algorithms. The dynamic reconﬁgurability of SDC can purposefully change the computing circuit in PUF implementation, making machine learning more difﬁcult or even impossible. 1. Evaluation metrics of PUFs PUF uses the uncontrollable and inevitable process deviation in the manufacturing process of integrated circuit itself to produce unique output on each individual chip. Under the same manufacturing process and parameters, due to the incomplete consistency within the chip, each chip has its own unique PUF CRP relationship. Based on the needs of the application ﬁeld, the four metrics of PUF are often evaluated: uniformity, uniqueness, reliability and avalanche effect. Uniformity represents the ratio of 1 and 0 in the response of PUF to different input challenges. The ideal value of uniformity is 50%. Uniqueness means that the response of different chips to the same challenge is random enough, and the ideal value is 50%. Reliability is deﬁned as the consistency of outputs generated by the same PUF at different times against the same input challenge; reliability is measured by bit error rate. The ideal value is 0, which means there is no bit error rate. Avalanche effect is used to measure the importance of each challenge bit in PUF design. Ideally, any bit of the input challenge changes by a single bit, and the probability of response change is 50%. As shown in Fig. 2.12, PUFs can be divided into weak PUFs and strong PUFs according to the algorithm attribute of challenge-response pair (CRP) behavior. The

92

2 Hardware Security and Reliability

CRP space of weak PUFs is generally small, and can be attacked directly by tabulation. Due to CRP space, strong PUFs cannot be attacked by tabulation, but most strong PUF designs can realize the mathematical modeling of strong PUFs through side channel attack and machine learning attack. Arbiter PUFs (APUF) and ring oscillator PUFs (ROPUF) are two classic strong PUFs and weak PUFs, respectively. Their structures are shown in Figs. 2.13 and 2.14. PUFs can also be classiﬁed into delay PUFs and memory-based PUFs according to the principle of extracting process deviation. The delay PUF generates response by using the delay difference between circuit elements and wires such as APUFs [32]; memory-based PUFs use bistable memory elements to generate response, such as SRAM PUFs [33]. 2. Attacks on PUFs and countermeasures 1) Power attack and countermeasures Power attack obtains information by analyzing the instantaneous power consumption or current change of the circuit. It is generally divided into simple power analysis and differential power analysis. Simple power analysis can attack single arbiter PUF and XOR arbiter PUF. The basic principle of this attack is to analyze the instantaneous power consumption traces to identify the latch transition from 0 to 1 in the arbiter

Arbiter

Trigger signal

Challenge

Challenge

Response

Challenge

Fig. 2.13 Structure diagram of arbiter PUF

…

…

…

Multiplexer

Comparator

Comparator

Comparator

…

Challenge

Fig. 2.14 Structure diagram of ROPUF

Response

2.1 Security

93

PUF, so as to determine the current single arbiter PUF latch unit state. For the XOR arbiter PUF composed of multiple arbiters PUF in parallel, the number of 0 and 1 in the latch or arbiter can be analyzed by simple power consumption, but the speciﬁc location with output of 1 or 0 cannot be located. Therefore, it is necessary to consider the structural characteristics when attacking [34]. Reference [34] adopts the divide and conquer method to attack this kind of PUF. After obtaining the total number of outputs of 1 by using simple power analysis, it selectively uses CRP conducive to the attack to carry out machine learning attacks on each parallel arbiter PUF, that is, CRP with the output of all parallel arbiters PUF of 1 or 0 is selected for training the model, to realize the high-precision modeling of a single arbiter PUF, and then realize the high-accuracy attack. Differential power analysis is an effective attack method for XOR arbiter PUF. By comparing the power traces before and after response, the ﬁtting model with minimum variance is obtained by gradient optimization algorithm, and the corresponding model is obtained [35]. To resist power consumption attack, the hiding method can be adopted. A common way is to balance the power consumption of each gate in different output logic, such as using dual track circuit [36]. The two data tracks of each gate are charged and discharged in each clock cycle. The two tracks are charged equally, and the charge and discharge are data independent. The overall charge and discharge are always a constant. Complementary operation is also a common hiding method. Two different output corresponding operations of 0 and 1 are carried out at the same time, so that the total power consumption corresponding to different actual outputs is consistent. For example, the arbiter PUF can use two symmetric reverse output signals and two latches to balance the power consumption. 2) Fault attack and countermeasures Fault attack performs attacks by faulty information and behaviors. The output response of PUF is not completely stable due to the thermal noise and environmental changes in practical use. Using the reliability information of PUF response, combined with the covariance matrix adaptation evolution strategy (CMA-ES), the machine learning attack can attack the XOR arbiter PUF. This method focuses on attacking a single PUF, and the unreliability introduced by other PUFs is regarded as noise. Each additional PUF only adds additional noise, which transforms the relationship between the complexity of machine learning and the number of XORs from exponential relationship to linear relationship, reducing the difﬁculty of attack [37]. If it is assumed that the attacker can get access to PUFs, it can trigger faulty behavior by adjusting the power supply voltage or changing environmental factors, and increase unstable CRP to accelerate the attack and improve the accuracy of the attack. According to the principle of fault attack, error detection technology can be used to reduce errors and improve attack resistance. Common methods include classic error detection techniques such as spatial redundancy and temporal redundancy. If the attacker cannot access and control PUF, the attacker cannot get error information or reliability information by limiting the number of CRP reuse, that is, each CRP is allowed to be used only once. In addition, more different PUF designs aim to

94

2 Hardware Security and Reliability

APUF d2

Response1

Multiplexer

APUF d1

Response2

… APUF d2n

Response

Response2n Select

n

APUF s1

APUF s2

…

APUF sn

Challenge

Fig. 2.15 MPUF

overcome reliability problems. Figure 2.15 shows a multiplexer PUF (mux PUF, MPUF) structure including a single MUX and multiple arbiter PUFs. The reliability of MPUF response decreases with the increase of the number of arbiter PUFs, but the rate is lower than that of XOR arbitrators, so it has higher resistance to related attacks [38]. 3) Electromagnetic attack and countermeasures Electromagnetic (EM) attacks are mainly used to attack oscillator based PUF designs such as ROPUF. Electromagnetic attack guesses the output of PUF with more than 50% accuracy by obtaining the oscillator frequency information. For the electromagnetic attack of ROPUF, the frequency amplitude spectrum is calculated through the obtained electromagnetic traces, the frequency amplitude difference is compared to distinguish the frequency range of the oscillator. Then, the average frequency difference or average frequency amplitude is calculated to identify the speciﬁc chip area with the highest leakage. The average between the frequency amplitude spectra obtained many times is compared to extract two distinct frequencies of two ring oscillators each time, and ﬁnally generate a complete ROPUF model [39]. To reduce electromagnetic leakage, special transistor level technology can be used to hide electromagnetic information, but this kind of method requires additional design overhead and will lose circuit performance in most cases. Designers may need to prepare a large number of special standard units for each key element, while careful place and route is required. To prevent electromagnetic attack based on

2.1 Security

95

microprobe, special package is required to prevent from opening of the package and approaching the chip surface, but, which greatly increases the manufacturing cost. Another protection method is to install an active shield on or around the chip, but driving the shield consumes more power. Electromagnetic sensors can be designed to resist electromagnetic attacks. The electric coupling between the probe and the measured object makes it impossible for the probe to complete the measurement without disturbing the original magnetic ﬁeld. This method uses a sensor based on LC oscillator to detect intrusion, which is suitable for any electromagnetic analysis and fault injection attack, in which the electromagnetic probe is placed near the detection target circuit [40]. For the electromagnetic attack of a speciﬁc ROPUF its resistance can be improved by compact layout. For example, the adjacent oscillator array is placed by imitating the position of sine wave and cosine wave. The electromagnetic domains leaked by the two adjacent oscillators will overlap, so that the electromagnetic detector cannot distinguish the frequency of a single ring oscillator. In addition, the resistance to electromagnetic attack can also be enhanced by specially designing the working mode of ROPUF. For example, each ring oscillator can be used only once in a single comparison to prevent the identiﬁcation of ring oscillators or compare all ring oscillators at the same time. The disadvantage is that both methods will increase the hardware overhead. 4) Machine learning attack and countermeasures (1) Attack strategy Machine learning attack is a powerful attack method against PUF, which can be divided into three types of attacks: white-box, gray-box and black-box according to the level of knowledge of the target. The black box attack is the simplest. The attacker does not need to know the inside of the target or model it in advance, but with enough CRPs. However, the disadvantage of black box attack is low attack accuracy. White box attack requires the attacker to fully understand the internal structure and working condition of the target, model the target, and carry out machine learning attack based on the model. White box attack has the highest complexity and accuracy. Gray box attack is between black box attack and white box attack. Neural network (NN), logistic regression (LR), covariance matrix adaptation evolution strategy (CMA-ES) and support vector machine (SVM) are common machine learning attack methods. Most of the existing PUFs are designed without considering the resistance to machine learning attacks. For the most classic arbiter PUF, the above four machine learning attack methods can successfully attack the arbiter PUF even in the case of black box. The arbiter PUF has a simple mathematical model. In the case of white box attack, the attack accuracy is very high, which can be close to 100%. (2) Attack countermeasures To enhance the ability of PUF to resist machine learning attacks, the attack difﬁculty can be increased by increasing the complexity of PUF infrastructure design.

96

2 Hardware Security and Reliability

Fig. 2.16 XOR PUF

Challenge

PUF

1

PUF

2

Response

… PUF

n

XOR PUF is a PUF that can resist machine learning attacks to a certain extent [41]. As shown in Fig. 2.16, multiple identical PUFs are combined through XOR gates to produce ﬁnal output. Compared with a single PUF, an XOR PUF improves the resistance to machine learning attacks. However after the new attack method is proposed, its resistance to machine learning attacks is greatly weakened. For example, the reliability-based machine learning attack method can weaken the resistance of the XOR arbiter PUF to machine learning attacks from the original exponential relationship with the number of XOR inputs to a linear relationship [37]. Various variants of arbiter PUF have been proposed to deal with machine learning attacks, such as lightweight security arbiter PUF and feedforward arbiter PUF [42]. However, they have been successfully broken by machine learning attacks. Using LR to attack 128 level lightweight security PUFs with 5 XOR input modules, the attack accuracy can reach 99%; the attack accuracy of CMA-ES based attack on 128 level feedforward PUFs with 10 feedforward loops can also reach more than 97% [43]. 3. SDC-based Physical unclonable function technology The traditional physical unclonable function often has static challenge response. Without considering the possible noise in the response, the same challenge always gives the same response. The SDC-based physical unclonable function is also called reconﬁgurable physical unclonable function (rPUF), which was ﬁrst proposed in Lim’s master’s thesis [44] in 2004. The SDC-based physical unclonable function can transform the existing physical unclonable function into a new physical unclonable function through a variety of reconﬁguration methods. The newly generated physical unclonable function has a new, uncontrollable and unpredictable CRPs. On one hand, reconﬁgurability provides revocability for the existing PUFs for lightweight authentication system and key generation. On the other hand, it also prolongs the life of PUFs for the lightweight authentication system with limited authentication times. When the SDC-based physical unclonable function utilizes the feature of reconﬁguration, it is transformed into a new physical unclonable function. This new physical unclonable function should inherit all the security and non-security features of the original physical unclonable function, and the only change is CRPs. To ensure that this reconﬁguration process will not introduce new security risks, the reconﬁguration process must be uncontrollable.

2.1 Security

97

The reconﬁguration technology of SDC-based physical unclonable function can be divided into two kinds: one is the intrinsic reconﬁguration that changes the intrinsic physical feature of the hardware carrier of physical unclonable function, and the other is the reconﬁguration of computing circuit in the implementation of physical unclonable function. (1) Reconﬁgurable PUF based on updating of intrinsic physical properties Under the inﬂuence of external factors, the intrinsic physical parameters used as physical unclonable functions have changed uncontrollably, resulting in the uncontrollable and irreversible renewal to the CRPs of physical unclonable functions. This process is considered as a reconﬁgurable operation for physical unclonable function based on intrinsic physical difference update. Here, two examples are used to introduce the reconﬁgurable physical unclonable function based on the update of intrinsic physical properties. (2) Polymer-based optical reconﬁgurable unclonable function A reconﬁgurable optical PUF is proposed in Ref. [45]. The physical properties of this physical unclonable function come from the light scattering particles on the surface of the polymer. The challenge space is deﬁned as containing the speciﬁc position and incident angle of laser beam, while the response space contains all possible speckle patterns. Figure 2.17 shows two response examples of the response space [45]. The implementation of this physical unclonable function can work in two modes: (1) when the laser irradiates the polymer surface in the normal mode, the physical unclonable function generates a stable speckle mode; (2) when the laser works at a higher current, the high-intensity laser beam will melt the polymer surface, resulting in the change of the position of scattered particles, and the changed structure will freeze after cooling, so as to reconﬁgure the implementation of the physical unclonable function. Although the existing technology has been able to highly integrate the laser source and the material suitable for making the optical PUFs [46], the cost of integrating

(a) Before reconfiguration Fig. 2.17 Optically reconﬁgurable PUF

(b) After reconfiguration

98

2 Hardware Security and Reliability

the laser source on the chip and the sensor required to extract the speckle pattern are the practical obstacles to the wide application of the reconﬁgurable optical PUF. 1) Reconﬁgurable physical unclonable function based on phase change memory As a new type of fast nonvolatile memory, phase change memory can also be used as the carrier of reconﬁgurable physical unclonable function. Phase change bit memory is usually made of chalcogenide glass containing one or more chalcogenide compound(s). Now common phase change materials are based on germanium antimony tellurium (GeSbTe) alloys. After heating under special mode, it can switch between crystalline and amorphous status, thus having different resistance values and therefore can be used to store different values. Generally speaking, GeSbTe alloy has high resistance in amorphous state and low resistance in crystalline state. Both crystalline and amorphous states can be encoded into logic “1” and “0”, or intermediate states can be encoded into multiple bits to increase the storage capacity in a unit cell. Reference [45] proposed that the control accuracy of phase change in some phase change memories is less than that of resistance measurement. Therefore, if the controllable accuracy of the phase change can only divide all the resistance value intervals corresponding to the controllable phase change into n segments, but the measurement accuracy of the resistance can further subdivide each N segment. as shown in Fig. 2.18, the measurement accuracy of the resistance value allows people to divide each N segment into left and right intervals, then they can be encoded. The single bit information in this left and right interval can be easily read out, but in the process of reconﬁguration, it means changing the phase. This single bit information is not controlled, that is, it can be reconﬁgured to generate a non-volatile random state through this uncontrollable process. Since phase change bit memory can be used as embedded memory in integrated circuits, the reconﬁgurable physical unclonable function based on phase change memory is also feasible. 2) Physical unclonable function based on reconﬁgurable architecture As shown in Fig. 2.12, PUF can be divided into weak PUF and strong PUF according to the algorithm attribute of CRP behavior. Strong PUF is mainly composed of two parts, namely, physical unclonable intrinsic characteristics and computing circuit. References [47] and [48] studied the speciﬁc implementation of reconﬁgurable computing circuit to change the physical unclonable function. This kind of PUF generally relies on software-deﬁned chips. As a deterministic digital circuit, the reconﬁgurability of computing circuit can be realized by traditional reconﬁgurable hardware.

r=0 left

r=1 right

Logic0

r=0 left

r=1 right

Logic1

r=0 left

r=1 right

Resistance

Logic0

Fig. 2.18 Reconﬁgurable physical unclonable function based on phase change bit memory

2.1 Security

99

As shown in Fig. 3.1 in Volume I, CGRA is composed of PE and the interconnection between them. PE is the basic unit for performing operations, including MUX, Register, ALU and other components. Through the programming of conﬁguration contexts, the function of PEs can be changed. There are various forms of interconnection between PEs determined by conﬁguration contexts, such as 2D-mesh and mesh plus. A single SDC can be conﬁgured to execute multiple cryptographic algorithms at the same time. When it runs, the idle PE on the chip can be used to perform fake operations to improve the ability to resist physical attacks. SDC is considered to be one of the ideal platforms for cryptographic applications. PUF based on dynamic reconﬁgurable PEA can be used to provide high-quality random source and secure key storage. As mentioned earlier, the delay PUF generates a response by comparing the same path delay of two paths, and the memory-based PUF generates a response by extracting the metastable structure of SRAM or other cross coupled inverters. The SDC has rich computing resources to generate corresponding PUF responses by composing diverse delay paths, making it easy to implement delay PUFs. (1) Reconﬁgurable physical unclonable function based on PE interconnections. In Ref. [47], the delay PUF based on a single PE of the SDC is called PEPUF. PEPUF generates single bit output by comparing the delay between PEs of the same architecture. The signal generated at the same time propagates in two delay paths with the same topology. The arbiter at the end of the path is used to judge the order of arrival signals and generate single bit output. The arbiter can use RS latch with small area and good transient characteristics. The challenge of PUF is mapped to the input of PE. Taking the PEPUF shown in Fig. 2.19 as an example, each PE has two groups of inputs of X 0 = (x0 , x1 , x2 , x3 ) and X 1 = (x4 , x5 , x6 , x7 ), and generates four-bit outputs of O0 = (o0 , o1 , o2 , o3 ) and O0' = (o0' , o1' , o2' , o3' ) respectively. The response of the PEPUF is determined by judging the arrival order of the two PE output signals by the arbiter. PEPUF is a delay PUF that generates the response by using the intrinsic delay of operation in PE. The PEPUF shown in single PEPUF in Fig. 2.19 is a single-stage PEPUF, and the available challenges are conﬁguration contexts and inputs. Different PE conﬁguration contexts can conﬁgure PE into different functions, so the internal speciﬁc computation module can be selected as the basic unit for extracting intrinsic process deviation. Common conﬁgurable functions include addition and subtraction in arithmetic computation and “and”, “or”, “not” and “XOR” in logic operation. The interconnections between PEs is another type of important resource to implement PUFs. To make full use of these resources, a single PEPUF can be interconnected to form a multi-stage PEPUF. The intrinsic delay of interconnection itself can also be used to add more entropy. In the multi-module PEPUF, part of the challenge serves as a choice of interconnection path. As shown in the example of multi module PEPUF structure in Fig. 2.20, the PEPUF connected by light gray shadow module and the PEPUF connected by

100

2 Hardware Security and Reliability x0 x1 x2 x3 x4 x5 x6 x7

o0

PE

y0

Arbiter 1

y1

Arbiter 2

y2

Arbiter 3

y3

o2 o3

x0 x1 x2 x3 x4 x5 x6 x7

Arbiter 0

o1

o ’0 o ’1

PE’

o ’2 o ’3

Fig. 2.19 Structure of single module PEPUF

dark gray shadow module have complete topology. When reaching the arbiter, the delay difference between the two lines only comes from the process deviation during the production of SDCs. (2) Analysis of PEPUF implementation ﬂexibility Here the analysis explores the implementation space of PEPUF by changing the interconnect form on a multi-interconnect CGRA implementation. Figure 2.21 is a structure with 32 delay units on a speciﬁc CGRA, which is arranged in 4 rows and Fig. 2.20 Example of multi-module PEPUF structure

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

Arbiter

2.1 Security

101

C1

C2

C3

C4

C5

C6

C7

C8

C16

C15

C14

C13

C12

C11

C10

C9

C17

C18

C19

C20

C21

C22

C23

C24

C32

C31

C30

C29

C28

C27

C26

C25

Starting signal

Arbiter

Fig. 2.21 A CGRA connecting method

8 columns. For convenience of description, the units in row X and column Y in the structure are marked with (X, Y), and the rows and columns are counted from 0. To simplify the research, it is speciﬁed that the starting signal must be input by the upper left corner unit, and the output of the lower left corner delay unit is connected to the arbiter. The left end of the unit (0, 0) shown in the ﬁgure is connected to the starting signal, and the left end of the output unit (3, 0) is connected to the arbiter. The intermediate general unit (X, Y ) can be connected to six adjacent delay units through conﬁguration, which are located at (X − 1, Y − 1), (X, Y − 1), (X + 1, Y − 1), (X − 1, Y + 1), (X, Y + 1) and (X + 1, Y + 1). In addition to the general connection, there are also special connections for the delay units in columns 0 and 7 of the CGRA structure. The special connections in Fig. 2.21 are as follows: the left ends of delay units (1, 0) and (2, 0) are connected to each other; the right ends of delay units (0, 7) and (1, 7) are connected with each other, and the right ends of delay units (2, 7) and (3, 7) are connected with each other. The connection between the units in Fig. 2.21 is represented by a dotted line, indicating that this connection can be dynamically conﬁgured at run time. Therefore, the length of PUF constructed through the basic unit is not ﬁxed. To facilitate research, the ﬁrst unit must be (0, 0) and the last unit must be (3, 0). It is also speciﬁed that the path of PEPUFs cannot pass through the same PE twice. Based on the above three constraints, the construction algorithm traverses and searches the number of PEPUF and interconnections that can be used in the CGRA. The results show that the number of PEPUF with a length of 16 can be obtained in the 32 unit structure is 19,220; the number of PEPUF with a length of 32 can be constructed is 56,675. There is only one special connection between every two rows in Fig. 2.21, which limits the ﬂexibility of PEPUF. Therefore, it can be considered to increase the number of special connections per row in the CGRA structure to increase the ﬂexibility of the PEPUF path, as shown in Fig. 2.22

102

2 Hardware Security and Reliability C1

C2

C3

C4

C5

C6

C7

C8

C16

C15

C14

C13

C12

C11

C10

C9

C17

C18

C19

C20

C21

C22

C23

C24

C32

C31

C30

C29

C28

C27

C26

C25

Starting signal

Arbiter

Fig. 2.22 A CGRA connecting method (two special connections per layer)

2.2 Reliability With the continuous expansion of the scale of SDC PEA, interconnection plays a vital role in the whole system. Network-on-chip (NoC) has gradually become one of the most commonly used interconnection architectures because of its reconﬁgurability, high performance and low power consumption. Network-on-chip has an important impact on the reliability of the whole SDC. Taking network-on-chip as an example, this section introduces how to effectively improve the reliability of system on chip from two aspects: topology reconﬁguration method and multi-objective joint mapping optimization.

2.2.1 Topology Reconfiguration Method Based on Maximum Flow Algorithm Since a large number of PEs are usually integrated on SDCs, some of the idle PEs can be used as spare components instead of error components to maintain the correctness of the whole system. Due to the limited number of PEs, how to maximize the use of these components to improve the repair rate and reliability is a problem worthy of study. This section introduces a topology reconﬁguration algorithm based on the maximum ﬂow algorithm [49]. Inspired by the network ﬂow algorithm in graph theory, the problem of repairing the faulty PEs is transformed into a network ﬂow problem, and a mathematical model is built. The maximum ﬂow algorithm is utilized to solve this problem. By introducing the concept of virtual topology, the burden on the operating system caused by the change of topology is greatly reduced. 1. Design metrics Assuming that there is a group of PEs, part of which works properly and another part has errors, how to reconﬁgure the interconnections between these PEs to obtain a

2.2 Reliability

103

NoC with correct functions is the problem we want to solve. It is desired to increase the repair rate as much as possible while reducing the cost such as increase in reconﬁguration time, change in topology, increase in area, decrease in throughput rate, and increase in latency. Since area, throughput and delay are common metrics for evaluating NoC, they will not be repeated here. The following describes the other three evaluation metrics, namely, repair rate, reconﬁguration time and topology. Repair rate is an important metric to evaluate the effectiveness of a repair method. It is deﬁned as the probability that all faulty PEs can be repaired by spare PEs. The repair rate of different methods is different. Since the hardware resources on the chip are limited, the goal is to provide as many repair schemes as possible for each defective PE, so that they can be repaired effectively. Reconﬁguration time determines whether a method can be executed when the system is running, which depends on the computation time of the repair algorithm. The reconﬁguration time has a great impact on the performance of the whole system. If the error can be detected during operation and repaired through reconﬁguration, the system does not need to stop and wait for repair, thus improving the performance. In addition, when the chip is under mass production and test, the reconﬁguration time is also an important metric, because it is closely related to the production cost of the chip. Therefore, it is best to minimize the reconﬁguration time. In the process of reconﬁguration, topology needs to be considered. Because the location of the faulty PE is not known in advance, when the faulty PE is replaced by the spare PE, the resulting topology may become irregular and lead to performance degradation. For example, Fig. 2.23a is a 4 × 4 two-dimensional mesh structure. Suppose that a column of spare PE is added to improve the reliability of the chip, as shown in Fig. 2.23b. When the faulty PE is replaced by the spare PE, as shown in Fig. 2.23c, d, different chips may be of different topologies, and these topologies may be different from the desired structure. In this way, the operating system will have a heavy burden when optimizing parallel programs on different topologies [50]. To solve this problem, some concepts about topologies are ﬁrst introduced below. Reference topology is deﬁned as the desired topology [51]. For example, Fig. 2.23a is a 4 × 4 reference topology of two-dimensional mesh. Figure 2.24a has 4 spare PEs and 4 faulty PEs. The faulty PEs are No. 2, No. 7, No. 8 and No. 19 respectively. Physical topology is a structure composed of PE working normally, as shown in Fig. 2.24b. Although this topology is different from the reference topology, these PEs can still form a 4 × 4 processor. In the reconstructed chip, each PE is considered to be virtually connected to the PEs around it. Therefore, the reconstructed topology is deﬁned as virtual topology. Figure 2.24c is an example of 4 × 4 two-dimensional mesh virtual topology. In Fig. 2.24c, No. 3, No. 6, No. 9 and No. 13 are four virtual neighbours of No. 12 PE, and No. 13 PE is virtually considered to be located below No. 12 PE, although they are physically adjacent to the left and right. Although No. 9 PE and No. 12 PE are actually 3 steps apart, they are considered to be adjacent in the virtual topology. For operating systems and other applications, the virtual topology is the same regardless of the actual physical topology. In this way, the operating system can more easily optimize parallel programs and allocate tasks.

104

2 Hardware Security and Reliability

(a) Desired target design

(b) of the implementation on the chip

(c) Chip with faulty PEs case I Error-free PE

(d) Chip with faulty PEs case II Spare PE

Faulty PE

Router

Fig. 2.23 Faulty PEs changing the topology of the target design

2. Reconﬁguration method and algorithm Two methods are presented here to solve the topology reconﬁguration problem and analyze the reconﬁguration time, throughput and latency. The ﬁrst is the maximum ﬂow (MF) algorithm. This method assumes that the data transmission between PEs is evenly distributed and will not change the basic topology of the system. However, in real application, the amount of data transmission between different PEs is not evenly distributed. Therefore, an improved algorithm, minimum cost maximum ﬂow algorithm, is proposed, which adds a cost characteristic to model different PEs. If the data transmission volume between PEs is evenly distributed, the faulty information is obtained in the manufacturing test. The information about which PEs are faulty and which are spare PEs is stored in a centralized topology reconﬁguration controller. The controller is implemented by an ARM 7 processor. Assuming that the non-spare PE at (x, y) is faulty, in an effective repair scheme, the function of the PE is replaced by the PE normally working at (x' , y' ). More speciﬁcally, the PE of (x' , y' ) is renumbered as (x, y) in the reconstructed network. Packets originally sent to (x, y) will be sent to (x' , y' ). The renumbering process of faulty PE is completed

2.2 Reliability

105

1

2

3

4

5

6

7

8

9

10

Spare PE

11

12

13

14

15

Faulty PE

16

17

18

19

20

Router

Fault-free PE

(a) Chip topology with faulty PEs

1

3

6 11

12

13

16

17

18

(b) Physical topology

4

5

1

3

4

5

9

10

6

12

9

10

14

15

11

13

14

15

20

16

17

18

20

(c) Virtual topology

Fig. 2.24 Schematic diagram of reference topology of 4 × 4 two-dimensional mesh

during reconﬁguration. Because only the number is changed and the routing policy is not changed, there is no additional overhead at run time. In this way, packets are sent to logically adjacent nodes through NoC. When the PE of (x, y) is replaced by the PE of (x' , y' ), the PE of (x' , y' ) is then replaced by the PE of (x'' , y'' ) until the replacement process ends with a spare PE. The ordered sequence (x, y), (x' , y' ), (x'' , y' ), … in this substitution process is deﬁned as the repair path. This is a sequence that logically replaces the faulty PE with the spare PE. Inspired by the compensation path proposed in Ref. [52], a general method for reconstructing networks is proposed, that is, determining the repair path. Once the repair path is determined, the virtual adjacent node of each PE can be determined, so that the virtual topology can be obtained. Figure 2.25 shows an example to illustrate the concept of repair path and reconstructed virtual topology. At the top of the Fig. 2.25a, there are three faulty PEs, which are respectively connected to the spare PE by three repair paths. No. 3 PE is wrong, it is replaced by No. 4 PE, and No. 4 PE is replaced by No. 5 spare PE. In this way, the packet that should have been sent to PE No. 3 will be sent to PE No. 4. For example, in the original topology, if the data is to be sent from No. 9 PE to No. 3 PE, the transmission path should be 9-8-3. However, in the reconstructed topology, the transmission path is 10-9-4. This indicates that each PE in the original topology is renumbered in the

106

2 Hardware Security and Reliability

1

2

3

4

5

6

7

4

5

6

7

8

9

10

11

8

9

10

11

12

13

14

15

12

13

14

15

16

17

18

19

20

16

17

18

19

(a) 4x4 network structure indicating repair path Fault-free PE

Spare PE

(b) Reconstructed virtual topology Repair path

Faulty PE

Router

Fig. 2.25 Schematic diagram of a 4 × 4 mesh indicating the repair path and its topology structure after reconﬁguration

virtual topology. This mapping process is completed through a lookup table, which is stored in each router. After the reconﬁguration controller calculates the repair path, the new coordinates are assigned to each PE. After ﬁnding the faulty PE, to repair the error, the repair path must start from the faulty PE and end at a spare PE. Since data packets can only be transmitted to physical adjacent nodes through NoC, PE sequences on the repair path must be physically connected, that is, the repair path must be continuous. If there are multiple repair paths in the network, these paths cannot intersect. In the virtual topology, each PE can only be mapped to one coordinate. Path intersection means that the PE at the intersection will be mapped to two coordinates, so this situation is not allowed. To summarize, a set of repair paths must meet the following conditions: (1) Each repair path is continuous. (2) The repair path set must cover all faulty non-spare PEs. (3) Each two repair paths cannot intersect with each other. The MF repair algorithm is introduced next, which analyzes whether a network can be completely repaired and how to get the repair paths if it can be repaired. If all faulty PEs in the network can be repaired, MF will generate a set of repair paths; if it cannot be completely repaired, MF will generate a set of repair paths that can repair the faulty PE to a maximum extent. The problem of generating a set of non-intersection and continuous repair paths can be transformed into the maximum ﬂow problem. The maximum ﬂow problem is a classic combinatorial optimization problem, that is, how to determine the maximum ﬂow between source and target in a network with capacity constraints on nodes and edges. The relationship between the repair path and the MF algorithm will be described below (Fig. 2.26). The mesh is regarded as a directed graph, where each repair path can be regarded as a unit ﬂow from the faulty PE to the spare PE. Thus, the problem of the maximum

2.2 Reliability

(a) Multi-source multi-target network

107

(b) Single-source single-target network with flow and repair path

Fig. 2.26 Determination of repair path using maximum ﬂow algorithm

ﬂow is that there are multiple sources and multiple targets. The capacity limit of each node and edge is set to “1”, so that each node and each edge is guaranteed to appear at most once in the repair path. A super source point is added to point to all faulty PEs, and all spare PEs are merged to a super target point. At this time, a single-source and single-target network is formed. Since each repair path is a unit ﬂow from the source point to the target point, the maximum trafﬁc of the network is equal to the number of faulty PEs that can be repaired. If all faulty PEs can ﬁnd the repair path, that is, all faulty PEs can be repaired, then the maximum network trafﬁc is equal to the total number of faulty PEs. In this way, the problem of NoC topology reconﬁguration is transformed into the problem of maximum ﬂow in the graph theory. A mesh can be represented by a directed graph G(V, E). V is the set of nodes in the network and E is the set of edges. F is the set of faulty nodes. Each node represents a PE and the corresponding router, and the directed edge connecting the two nodes is the link between the routers. The capacity of each edge and each node is 1. The following is the mathematical description of the problem. (1) The node set is deﬁned as V ' = V ∪ {S, T }, S is the source point and T is the target point. (2) The edge set E connecting the nodes in V ' is deﬁned as follows. ➀ For each pair of adjacent nodes (i, j) in the mesh, the two edges of i → j and j → i are deﬁned; ➁ For each spare node v ∈ V , an edge v → T is deﬁned. ➂ For each faulty node v ∈ F, an edge S → v is deﬁned. (3) Deﬁne the capacity of each edge as 1. (4) Deﬁne the capacity of each node to 1. (5) For the graph constructed above, the maximum ﬂow problem is solved.

108

2 Hardware Security and Reliability

By solving the above problem, the maximum ﬂow of the graph and each ﬂow will be obtained. The maximum ﬂow in the graph indicates how many faulty PEs can be repaired, and each ﬂow represents a repair path. According to the repair path, the virtual topology can be obtained, as shown in Fig. 2.26b. The virtual number of each PE is stored in the router and called at run time. In addition, if the maximum ﬂow is not equal to the number of faulty PE, it indicates that some errors cannot be repaired. That is, in the current fault pattern, a set of repair paths containing all faulty PEs cannot be found. Figure 2.27 shows an error pattern that cannot be completely repaired. There are 6 faulty PEs and 6 spare PEs. One of the spare PEs is faulty. In this mode, only 3 failures can be repaired, and the repair path is also drawn in the ﬁgure. 3. Theoretical analysis of reconﬁguration time and performance Finding the maximum ﬂow between the source and target can ensure that as many faulty PEs as possible are repaired by the spare PE, so as to improve the reliability of the network. The time complexity of the maximum ﬂow problem is polynomial. S

Faulty node Spare node Link Flow and repair path

T

Fig. 2.27 Example of a fault pattern that cannot be completely repaired

2.2 Reliability

109

Generally, the time complexity is O(V 3 ) [53], V is the number of nodes. For mediumsized sparse graphs, if the edge capacity is an integer, the time complexity can be reduced to O(V ElogU ), E is the number of edges, and U is the upper limit of edge capacity [54]. The execution time of traditional simulated annealing (SA) algorithm is uncertain, but it cannot be solved in polynomial time [53]. When the scale of the problem increases, the time complexity will be very high. Since the method proposed here can be solved in polynomial time, it can be executed once after chip fabrication or periodically at runtime without incurring a large reconﬁguration time cost. From the perspective of NoC, it is necessary to establish a model to evaluate the performance degradation of different virtual topologies. A metric called distance factor (DF) is introduced in Ref. [51]. Obviously, in an irregular topology, the average distance between nodes is greater than the reference topology, so the delay is longer. DF is used to describe the average distance between virtual adjacent nodes, so this metric also reﬂects the average delay and throughput of the network. The distance factor between nodes m and n is deﬁned as their physical distance (DFmn = Hopsmn ). The distance factor (DFn ) of node n is deﬁned as the average distance factor between node n and all k virtual adjacent nodes around it: DFn =

k 1∑ DFmn k m=1

(2.6)

The distance factor DF of a topology is deﬁned as the average value of DFn of all N nodes in the topology. DF =

N 1 ∑ DFn N N =1

(2.7)

Obviously, the DF of the reference topology is the smallest, because each node is physically adjacent to the virtual adjacent node. For example, in a two-dimensional mesh, DF = 1, which indicates that the distance between each pair of virtual adjacent nodes is 1. A smaller DF indicates that the transmission delay between the virtual neighboring nodes is smaller and they are physically close to each other. The performance of MF repair algorithm is evaluated by experiments. Two repair schemes are used as the comparison baseline, namely shifting [55] and switching schemes [56]. These two methods are represented by “N:1” and “N:2” respectively. Figure 2.28 shows examples of these two methods. The shifting method adds a column of spare PEs on the right border of the mesh. If there is an faulty PE in a line, shift from the faulty PE and repair it with a spare PE. This method can accommodate one faulty PE per row. The switching method adds a column of spare PEs on the left and right boarder of the mesh respectively. Each row of this method can tolerate two failures. Therefore, there are two types of topology in the experiment, namely N × (N + 1) and N × (N + 2) meshes, corresponding to the above two repair methods respectively. The MF repair algorithm uses exactly the same spare

110

2 Hardware Security and Reliability

hardware resources, i.e., N × (N + 1) topology is used for both methods when MF is compared with N:1, and N × (N + 2) topology is used for both methods when MF is compared with N:2. Under the same topology and fault pattern, each method may obtain different virtual topology, so the performance may be different. Figure 2.28d and e shows examples of different virtual topologies obtained by N:2 and MF repair algorithms respectively. Two different sizes of meshes, N = 4 and N = 8, are used in the simulation. The failure rate of PE ranges from 1 to 10%. 10,000 groups of faulty patterns are randomly generated for each size. In some fault patterns, MF can ﬁx all errors, while N:1 and N:2 cannot. For example, if there are more than two failures in a line, MF method can get a virtual topology with correct function, but N:1 and N:2 cannot. However, in the following experiments, this situation is not considered because DF cannot be calculated at this time. DF is calculated only when both methods can completely repair errors in the network. Figure 2.29 shows the comparison results of DF. DF increases with the increase of failure rate. N:1 and MF have the same DF, because for those fault patterns that can be repaired by these two methods at the same time, the solution obtained by using N:1 is exactly the same as MF, so DF is the same. In other words, N:1 is a subset of MF. The DF of MF is smaller than that of N:2. N:2 can only repair the faulty PE with a spare PE in the same row and does not take the characteristics of the repair path into account. Figure 2.28d and e give the results of a 4 × 6 faulty mesh repaired with N:2 and MF methods, respectively. As shown in Fig. 2.28c, No. 10 PE is faulty. Using the N:2 method, the repair path can only be 9-8-7. MF is more ﬂexible, and the repair path can be 4-5-6. DF of N:2 is 1.3958, while DF of MF is 1.2187. MF is based on breadth ﬁrst search to ﬁnd the shortest repair path as far as possible. This advantage becomes more and more obvious with the increase of mesh size. In the comparison between N:2 and MF, for an fault pattern, the solution MF obtained by N:2 can also be obtained. This shows that MF can get a solution no worse than N:2. According to the deﬁnition of DF, the average distance between virtual adjacent nodes obtained by MF is the same as N:1, but less than N:2. Therefore, it can be predicted that the throughput and delay obtained by MF are better than the two comparison schemes. The prediction is also conﬁrmed by the throughput rate and latency results measured by the simulated experimental results. 4. Minimum-cost MF approach In the above discussion, all elements in the system are considered to be the same. In real applications, the amount of data transmission between PEs and the load on the link may be different. To model more accurately, these differences should be given different weights. To solve this problem, an improved minimum cost MF method [57] is proposed to obtain the maximum ﬂow and reduce the cost under the given limit. The construction of directed ﬂow graph is similar to MF. There are a super source point and a super target point, and the capacity of each node and edge is limited to 1. A new variable-cost is also introduced in the graph. Cost can be deﬁned on nodes or edges. It can model different metrics in the network, such as edge delay, hardware cost, production cost and so on. Then the problem can be solved by the minimum cost maximum ﬂow algorithm, and the time complexity is polynomial [54].

2.2 Reliability

111

1

2

3

4

5

1

3

4

5

6

7

8

9

10

6

7

9

10

11

12

13

14

15

11

12

13

14

16

17

18

19

20

16

17

18

20

(a) 4 x 5 network with faulty PE

(b) Virtual topology reconstructed using N: 1

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

Fault-free PE Spare PE Faulty PE Router

19

20

21

22

23

24

(c) 4 x 6 mesh with faulty PE

2

3

4

5

2

3

5

6

7

8

9

12

8

9

4

12

14

15

16

17

14

15

16

17

20

21

22

23

20

21

22

23

(d) Virtual topology reconstructed using N: 2

(e) Virtual topology reconstructed using MF

Fig. 2.28 Example of reconﬁguration using N:1, N:2, and MF

The experimental results show that the repair rate of MF can be improved by 50% with the same hardware resources. Compared with the fault-free system, the throughput decreases by only 2.5% and the delay increases by less than 4%. MF method makes maximum use of redundant hardware resources, and its reliability is higher than that of previous methods. The reconﬁguration time of this method is

112

2 Hardware Security and Reliability

Fig. 2.29 DF comparison results of MF, N:1 and N:2 for meshes of different sizes and different failure rates

polynomial and can be used for real-time reconﬁguration. In addition, the concept of virtual topology greatly reduces the burden of operating system. By transforming the topology reconﬁguration problem into the network ﬂow problem in graph theory, the reliability of reconﬁgurable system is improved. The idea of using the maximum ﬂow algorithm to improve the utilization of redundant hardware resources is not limited to two-dimensional mesh topology, but also can be used in other topologies. The location of faulty PE and spare PE can also be arbitrary. The core idea is the same, that is, all the faulty PEs are merged into a super source point and all the spare PEs are merged into a super target point, thus forming a directed graph, and then solving the maximum ﬂow problem for this graph.

2.2.2 Multi-objective Mapping Optimization Method for Reconfigurable Network-on-Chip To meet the requirements of SDC dynamic reconﬁguration, the best mapping of network-on-chip needs to be determined by an accurate and ﬂexible evaluation model oriented efﬁcient mapping method. Therefore, this section discusses three aspects of modeling analysis, mapping method and experimental veriﬁcation [58]. 1. Modeling analysis 1) Background In the research, the target application is usually represented by the application characteristic graph (APCG) shown in Fig. 2.30a. The APCG is bidirectional graph, in which each node corresponds to an IP, and the edge of connecting nodes represents the communication between IP. The trafﬁc between IPs is deﬁned by the weight of

2.2 Reliability

113

IP1

R1

R2

R3

R4

R5

R6

R7

R8

R9

30 40

132

IP2 93

IP5 IP3

11 12

37 53

IP4

31

IP6

(a) Target application characteristic graph

(b) NoC architecture

Fig. 2.30 Examples of target application characteristic graph and NoC architecture

the edges, as shown by the number on each edge. Figure 2.30b is an example of a network-on-chip architecture, which is also bidirectional. Each node represents a router and a PE, and each edge represents the link between routers. The links between each two nodes are bidirectional lines. If the total number of communication paths of a network-on-chip is deﬁned as N, N = 24 in Fig. 2.30b. The application mapping discussed here is to ﬁnd the best node and edge correspondence between the APCG and the NoC architecture. The types of errors occurring in a SDC computing array can usually be deﬁned as hard errors or soft errors. The location where the error occurs includes router, PE and communication path. All hard errors and soft errors occurring in the PE can be replaced using spare PEs [59], and the optimal mapping needs to be found by remapping after the replacement. The soft errors in routers and communication paths are handled by waiting for the soft errors to disappear, but the effects of these soft errors on reliability, communication energy consumption and performance are considered in the modeling process. A soft error in the router may affect one or more links connected to the router, resulting failure in data transmission to the destination router. The worst-case assumption is adopted in the study, that is, any error in the router is considered to be the case that all links connected to the router are faulty. In this way, errors in routers can be attributed to link errors. In the process of modeling, the error that needs to be considered can be simpliﬁed as a soft error in links. The probability of link errors in NoC may be affected by many factors, such as the temperature of the chip near links, the errors of adjacent modules and so on. The speciﬁc model of link errors affected by different factors are not discussed here, but the failure probability of each link is considered separately in the multiobjective optimization modeling. In other words, the method can be applied to any link error model. When n of the N links are faulty, there are a total of M = C Nn error scenarios if all possible locations of faulty links. Since the corresponding reliability,

114

2 Hardware Security and Reliability

communication energy consumption and performance in these M different cases are very different, these M error cases will be taken into account in the modeling process. 2) Reliability model Although the actual method to deal with soft errors of links in the simulation process is to wait for the soft errors to disappear before transmitting data, the impact of link errors on communication paths should still be considered in the reliability model. Because in the cycle when the error occurs, data cannot be transmitted through the link with errors. In the reliability model, the communication path with error and the communication path without error should be distinguished. The reliability model is mainly completed in two steps: ➀ ﬁnd out all feasible communication paths according to the routing algorithm; ➁ judge whether there is a communication path between the source and the destination according to the link error, which can successfully transmit the data to the destination, and assign a binary ID to the reliability. When n links have errors, the reliability of the ith error condition SD . For example, Fig. 2.31 shows two from source S to destination D is deﬁned as Ri,n different faulty scenarios when three links are faulty, where R4 is the source and R9 is the destination. In Fig. 2.31a, errors occur in paths 6, 14 and 21. For the deterministic routing algorithm, the communication path can be determined in advance. If it is an X-Y routing algorithm, the communication path is R4-R5-R6-R9; if it is a Y-X routing algorithm, the communication path R4-R7-R8-R9. In the error case in Fig. 2.31a, for the X-Y routing algorithm, data cannot be transmitted from R4 to R9, that is to 49 = 0. However, for the Y-X routing algorithm, the data can be successfully say R1,3 49 transmitted from R4 to R9. In this case, R1,3 = 1. For adaptive routing, there may be multiple feasible transmission paths between source and destination. For example, for the adaptive routing with the shortest distance, there are three communication paths between R4 and R9, namely R4-R5-R6-R9, R4-R5-R8-R9 and R4-R7-R8R9. In the case of wrong interconnection in Fig. 2.31a, although the data cannot be transmitted normally through R4-R5-R6-R9, the other two communication paths are feasible. Then, no matter which path is selected, the reliability of R4 to R9 in this 49 = 1. Figure 2.31b shows the second error condition. Errors have error case is R1,3 occurred in the 6th, 13th and 23rd links. In this error case, data cannot be successfully 49 = 0. When transmitted from R4 to R9 regardless of the routing algorithm, i.e. R2,3 considering the reliability of the communication path between the source S and the destination D, all M = C Nn faulty scenarios with n faulty links are taken into account, as shown in Eq. (2.8): RSD =

N ∑ M ∑

SD Ri,n PI,n

(2.8)

n=0 i=1

where, PI,n represents the failure probability of the ith faulty scenario when n links are wrong, as shown in Eq. (2.9). Where, I represents the set of n faulty links in the ith error case.

2.2 Reliability

115

Fig. 2.31 Example of two different error conditions when errors occur in three communication paths

PI,n =

N ∏ j=1, j∈I

pj ×

N ∏

(1 − p j )

(2.9)

j=1, j ∈I /

p j represents the failure probability of link j, so that the failure probability of all links can be different. 3) Communication energy consumption model The energy consumption of practical application in network-on-chip mainly includes three parts: the computation energy consumption of PE, the static energy consumption of network-on-chip and the communication energy consumption of data transmitted by network-on-chip. The PE here are assumed to be homogeneous, so when running the same application with different mapping methods, the total computational energy consumption will still be the same for different mapping methods. The energy consumption in the ﬁrst part is not considered in the modeling. The static energy consumption is usually affected by the fabrication process, working temperature and working voltage, but only the working temperature will be slightly different due to different mapping methods. Previous research results showed that different mapping will bring temperature changes of 1–2 °C [60]. At the same time, HSpice simulation results show that such a temperature change will bring up to 9.44% of the static energy consumption change. In the process size of 14 nm, the static energy consumption accounts for about 5–20% of the total energy consumption [61]. Accordingly, the static energy consumption change of 9.44% will only account for 0.5–1.9% of the total energy consumption change. Therefore, the effects of different mapping methods on the working temperature and then on the static energy consumption of the second part are not considered. Different mapping methods have the greatest impact on the communication energy consumption of the third part, and the focus is on the energy consumption difference under different mapping methods rather than the absolute value of energy consumption. Therefore, the modeling object only

116

2 Hardware Security and Reliability

considers the communication energy consumption of data transmitted by the third part of the network-on-chip. The network-on-chip communication energy consumption is calculated using the Bit Energy model [62], that is, the communication energy consumption consumed by transmitting data between source routing S and target routing D is calculated by using the energy consumption of single bit data through a router E Rbit and an interconnection wire E Lbit . When n interconnection wires have errors, the communication energy consumption in case ith is shown in Eq. (2.10): [ ] SD SD SD E i,n = V S D E Rbit (di,n + 1) + E Lbit di,n

(2.10)

SD where, V S D is the communications volume from S to D; di,n is the number of interconnection wire on the communication path under the ith error condition when n interconnection wire have errors. Similar to the reliability model, the energy consumption under all M error conditions is taken into account. The energy consumption from S to D communication path is shown in Eq. (2.11):

E SD =

N ∑ M ∑

SD E i,n PI,n

(2.11)

n=0 i=1

where, PI,n as shown in Eq. (2.9), is the failure probability of the ith case when n interconnection wire are faulty. For the deterministic routing algorithm, the communication path between the source routing and the target routing is determined in advance, so the number of routes and interconnection wires through which data transmission passes can also be determined in advance. For the adaptive routing algorithm, the communication path SD in Eq. (2.10) is is updated at run time with the data transmission, that is, the di,n also updated at run time. Therefore, the communication energy consumption model is suitable for both deterministic routing algorithm and adaptive routing algorithm. 4) Performance model The performance of communication network is mainly evaluated from two aspects: delay and throughput. The performance modeling here will quantitatively analyze the delay, while the throughput will be qualitatively analyzed from the bandwidth limit. The delay can be divided into three parts: ➀ the time of transmitting data from the source routing to the target routing when there is no error and no blocking on the interconnection wire; ➁ the time of waiting for the soft error to disappear; ➂ the waiting time caused by blocking. The performance model uses wormhole as the switching technique, and the delay of body ﬂit and tail ﬂit is the same as that of head ﬂit. To simplify the model, only the delay of the head ﬂit is modeled, which is deﬁned as the time interval between the creation of the head ﬂit at the source node and the reception of the head ﬂit at the target node. The delay of the ﬁrst part is the time required to transmit the head ﬂit from the source node to the target node, as shown in Eq. (2.12):

2.2 Reliability

117 SD SD SD Ci,n = tw di,n + tr (di,n + 1)

(2.12)

where, tw and tr represent the time required to transmit a ﬂit through an interconnection wire and a router, respectively. In this study, when the head ﬂit encounters the faulty interconnection wire, the method is to wait for a clock cycle and try to retransmit until the soft error disappears and the transmission is successful. Since it is impossible to accurately estimate the number of cycles each interconnection wire needs to recover from a soft error, the average waiting time is used to represent the waiting time caused by the jth interconnection wire, as shown in Eq. (2.13): F j = lim ( p j + 2 p 2j + 3 p 3j + · · · + T p Tj ) = T →∞

pj (1 − p j )2

(2.13)

where, p j is the failure probability of the jth interconnection wire; T is the number of clock cycles that the head ﬂit needs to wait. Similar to the way of computing communication energy consumption in Eq. (2.10), Eqs. (2.12) and (2.13) can also be applied to adaptive routing. When the communication path is updated according to the system information, the length of the data transmission path between the source node and the target node is updated accordingly. During the actual operation of the application, multiple head ﬂits may need to pass through one router at the same time, resulting in the delay caused by blocking in the third part. The queue mode of ﬁrst-in-ﬁrst-processing is adopted to calculate the delay caused by blocking, and each router is regarded as a service desk. For deterministic routing, when the source node and target node have been determined, the transmission path of the head ﬂit is determined. When blocking occurs, the head ﬂit waiting to be transmitted can only be arranged in one queue, that is, in this case, only one service desk in the queue can serve the data. For adaptive routing, the communication path will be updated with the system information, so the head ﬂit may be transmitted to different routers according to the state of the communication network, which means that multiple service desks may serve the head ﬂit. Assume that the number of service desks is m. There are vc virtual channels in each router, and these virtual channels can also serve as service desks in the queuing model. Therefore, the G/G/m-FIFO queue is used to calculate the waiting time caused by blocking. In this queue, the arrival interval and service time are considered to be independent general random distributions. Using Allen-Cunneen Equation [63], the K , waiting time from the uth input port to the vth output port of the Kth router is Wu→v as shown in Eqs. (2.14)–(2.16): K Wu→v =(

W0K

W0K )( ) ∑ ∑ K K 1 − Ux=u ρx→v 1 − Ux=u+1 ρx→v

C 2A K + C S2 K Pm+vc u→v v × × ρvK = 2(m + vc)ρ μvK

(2.14)

(2.15)

118

2 Hardware Security and Reliability

⎧ Pm + cv =

ρ m+vc +ρ , 2 m+vc+1

ρ

2

,

ρ > 0.7 ρ≤0.7

(2.16)

where, C 2A K is the variation coefﬁcient of the arrival queue of router K, and the arrival u→v queue of each router in the arrival network is determined by the target application. C S2 K is the variation coefﬁcient of the service queue of router K. The service time at v the vth output port of the Kth router consists of the following three parts: ➀ the mere time of transmitting a ﬂit from the Kth router to the K + 1th router; ➁ the waiting time for data allocation from the uth input port to the vth output port of the K + 1th router; ➂ waiting time for the vth output port of the K + 1th router to become idle, that is, the service time of the vth output port of the K + 1th router. Since each output port of the K + 1th router has an important impact on the service time of the vth output port of route K, correlation tree and recursive algorithm are used for computation. Taking the K + 1th router as the root node, the next router will be added to the correlation tree only when the next router communicates with the current router, otherwise the router will be abandoned. The addition of correlation tree nodes will continue until the router only communicates with the processing unit and has no communication with other routers. When the correlation tree is established, the service time of the leaf node is calculated ﬁrst, and then the service time of the parent node is backtracked upward using a recursive algorithm until the service time of the root node is calculated. The average service time of the vth output port of the Kth router is SvK , the second moment of the service time distribution of the output port v ( )2 of router K is SvK , and the service array variation coefﬁcient of Server K is C S2 K , v as shown in Eq. (2.17): SvK =

V ∑ λK

u→x

x=1

(

SvK

)2

=

λxK

V ∑ λK x=1

K +1 (tw + tr + Wu→x + SxK +1 )

u→x λxK

)2 ( K +1 tw + tr + Wu→x + SxK +1 (

C S2vK

=(

SvK SvK

(2.17)

)2 )2 − 1

The parameters used in Eqs. (2.14)–(2.17) are determined by the actual target application and communication network, and the relevant deﬁnitions are shown in Table 2.2. To sum up, when the ith error occurs in n interconnection lines, the head ﬂit delay between source node S and target node D is shown in Eq. (2.18):

2.2 Reliability

119

Table 2.2 Deﬁnitions of relevant parameters in delay model Parameter Deﬁnition K ρx→v

The ratio of the time when the output port v of router K is occupied by the input port X

ρvK , ρ

ρvK =

K λx→v

Average ﬂit information input rate (ﬂit/cycle)

U ∑ x=1

K λx→v μvK , ρ =

V ∑

ρvK

v=1

μvK

Average service rate (cycle/ﬂit)

U

The total number of input ports of a router

V

The total number of output ports of a router

SD di,n +1

SD

SD SD L i,n = Ci,n +

di,n ∑

FJ (t) +

t=1

∑

(t) WUK(t)→V (t)

(2.18)

t=1

where J(t) is a function for computing the serial number of the tth interconnection wire on the communication path; K(t) is a function for computing the serial number of the tth router on the communication path; U(t) and V (t) are functions for computing the input port and output port numbers of the tth router, respectively. Similarly, when various error conditions are taken into account in the total delay computation, the total delay is shown in Eq. (2.19): L SD =

N ∑ M ∑

SD L i,n PI,n

(2.19)

n=0 i=1

where, PI,n is the failure probability of the ith error condition when n interconnection wires occur. The throughput is considered based on the quantitative analysis of bandwidth constraints. As shown in Eq. (2.20), bandwidth constraints can balance the trafﬁc of each node as much as possible to ensure throughput. ∑[

] f (Pmap(S),map(D) , l) × V S D ≤ B(l)

(2.20)

S,D

where, l is the serial number of the interconnect, and B(l) is the maximum bandwidth of the lth interconnection wire. Pmap(S),map(D) is the communication path mapped from source node S to target node D. The binary function shown in Eq. (2.21) f (Pmap(S),map(D) , l) represents whether the lth interconnection wire is on the corresponding communication path. ⎧ f (Pmap(S),map(D) , l) =

1, 0,

l ∈ Pmap(S),map(D) l∈ / Pmap(S),map(D)

(2.21)

120

2 Hardware Security and Reliability

(5) Reliability efﬁciency model The key to ﬁnd the optimal mapping method is to ﬁnd the best balance among reliability, energy consumption and performance, but in fact, the relationship between the three is usually mutually restricted. For example, the increase in reliability is likely to be at the cost of increased energy consumption and reduced performance. Therefore, reliability efﬁciency model is proposed to measure the relationship among reliability, energy consumption and performance. Reliability efﬁciency (R/EL) is deﬁned as the reliability beneﬁt brought by a unit of power-delay product. In different application scenarios, the corresponding reliability, energy consumption and performance requirements may be very different. For example, the primary requirement for most mobile devices is low power consumption, while the primary requirement for devices used for space exploration is high reliability. Therefore, it is necessary to introduce a weight parameter into the reliability efﬁciency model to distinguish the importance of the three. After introducing the weight parameter, the reliability efﬁciency model from the source node to the target node is shown in Eq. (2.22): SD Reff =

(R S D − minre)α 1 + E SD × L SD

(2.22)

where, α is the weight parameter; minre is the minimum reliability requirement of the system for the communication path. Weight parameter α is the weight used to adjust the relative importance of reliability and power-delay product. The value of minre depends on many factors, such as the surrounding environment of the networkon-chip, application scenarios, system requirements and so on. For example, the reliability requirement of a parallel supercomputer only allows its interconnection architecture to lose packets at most once within 10,000 working hours [12], so that the corresponding minre value can be calculated according to the failure probability. In practical application, users can determine the speciﬁc value of minre in a more systematic way according to different needs. For a speciﬁc mapping mode, the total reliability efﬁciency is deﬁned by Eq. (2.23): Reff =

∑

SD Reff × Y SD

(2.23)

S,D

where, Y S D is a binary function used to indicate whether there is data transmission between source node S and target node D. If there is communication between S and D, then Y S D = 1, otherwise Y S D = 0. 2. Mapping method Network-on-chip can be regarded as a complete graph, and the method of ﬁnding the optimal mapping can be abstracted as ﬁnding a path with the least cost that can cover all nodes in the complete graph. In this way, the problem of ﬁnding the optimal mapping can be transformed into the traveling salesman problem. It

2.2 Reliability

121

has been proved that the traveling salesman problem is a non-deterministic polynomial complete (NPC) problem. In most cases, the computational complexity of NPC problems is very high. Classical algorithms such as simulated annealing and branch and bound (BB) methods are used to reduce the computational complexity of such problems. The main goal of this section is to reduce the computational complexity as much as possible to meet the requirements of dynamic reconﬁguration on the premise of ﬁnding the optimal mapping. To further reduce the computational complexity, a priority and compensation factor oriented branch and bound (PCBB) mapping method is proposed. Before introducing the PCBB mapping method, a brief introduction of the BB mapping method is given. 1) BB mapping method BB mapping method ﬁnds the optimal solution of an objective function by establishing a search tree. In the process of ﬁnding the optimal mapping, the objective function is the reliability efﬁciency model established earlier. The process of ﬁnding the optimal solution can be realized by BB mapping method. Figure 2.32 is a simple example of a search tree, showing the process of mapping an application with three IPs to NoC. Each box in the ﬁgure represents a possible mapping method and is also a node of the search tree. The root node of the search tree is the starting point of the search tree and also the beginning of the mapping. The three spaces in the box indicate that no IPs have been mapped to the NoC. As the IP is mapped to the ﬁrst node of NoC, a new search tree node is generated. The intermediate node located on the branch of the search tree represents the partial mapping that some IPs have been mapped to the NoC. The number in the box represents the sequence number of the IP mapped to the NoC, and the position corresponding to the number and space represents the sequence number of the node on the NoC. For example, the intermediate node “23_” represents that IP2 and IP3 are mapped to the ﬁrst and second NoC nodes respectively, while IP1 has not been mapped. After the leaf node is generated, that is, all IPs are successfully mapped to the corresponding nodes on the NoC, the establishment of the search tree is completed. In the process of establishing the search tree, to reduce the computational complexity, only the nodes that may become the optimal solution will be established and retained. For any intermediate node, if the maximum beneﬁt of the node is less than the maximum beneﬁt of the current optimal solution, the node cannot become the optimal solution and will be deleted directly. When these intermediate nodes are deleted, the child nodes of this node will not be generated, which greatly reduces the complexity of search. 2) PCBB mapping method From the BB mapping method, it can be seen that the intermediate node closer to the root node has more child nodes, which means higher computational complexity. If there is a non-optimal mapping in this part of intermediate nodes, the node can be identiﬁed and deleted as soon as possible, then the overhead of the whole search algorithm will be greatly reduced. Therefore, the priority allocation method can be

122

2 Hardware Security and Reliability

Root node

2__ 23_ Leaf node

___ 1__

3__

21_

13_

12_

231

132

123

Fig. 2.32 Example of mapping an application with three IPs to a search tree on NoC

used to identify and delete these non-optimal mapping intermediate nodes close to the root node as soon as possible, so as to reduce the computational complexity of the whole search process. The IP is prioritized according to the total IP communication volume of the target application. The IP with the largest communication volume has the highest priority. In the mapping process, the mapping order of IPs is from high to low according to its priority. In the BB mapping method, when the maximum beneﬁt of the intermediate node is less than the maximum beneﬁt of the current optimal solution, the intermediate node will be deleted, where the maximum beneﬁt is calculated based on the reliability beneﬁt model proposed earlier. Accurately computing the reliability beneﬁt can help us ﬁnd a mapping model closer to the actual optimal solution, but it also means higher computational complexity. Therefore, to ﬁnd a compromise between accuracy and computational complexity, a compensation factor β is introduced to the criteria for deleting intermediate nodes, as shown in Eq. (2.24): UB < max{Reff }/(1 + β)

(2.24)

where, UB is the maximum value of reliability beneﬁt, including the following three parts: ➀ reliability efﬁciency UBm,m between mapped IPs; ➁ reliability efﬁciency UBm,n between mapped IP and unmapped IP; ➂ reliability efﬁciency UBn,n between unmapped IPs. Where, UBm,m can be directly calculated by Eq. (2.23), while UBm,n and UBn,n can be estimated by mapping the remaining IPs through greedy algorithm respectively. Although the estimated value will be slightly higher than the real value, this will not affect the deletion criteria in Eq. (2.24). Because if the estimated maximum reliability beneﬁt of the intermediate node is still less than the maximum value of the current optimal solution, the real reliability beneﬁt must also satisfy Eq. (2.24). Otherwise, the intermediate node will be generated as one of the comparison objects for the next mapping. 3) Remapping process The previous discussion is to ﬁnd the best mapping mode considering reliability, communication energy consumption and performance when the target application is mapped to NoC for the ﬁrst time. Here, another aspect of the method of mapping

2.2 Reliability

123

the target application to NoC: remapping is considered. Remapping needs to be implemented dynamically when an application runs on the NoC and an error occurs in the interconnect, router, or processing unit, or when the application requirements change. In other words, when the NoC is reconstructed at runtime, the best mapping needs to be updated dynamically To improve the efﬁciency of remapping, when a demand for recalculating the optimal mapping is initiated, such as when an error occurs, the mapping before the error occurs is ﬁrst stored into memory as the current optimal mapping. In this way, the overhead of computing many intermediate nodes from scratch is saved, so as to speed up the computation speed of remapping. At the same time, the reliability efﬁciency of the current mapping is also calculated as the maximum value of the current reliability beneﬁt. The following mapping patterns will be compared, generated or discarded step by step according to the PCBB mapping method described above. The code for remapping the whole process is shown in Fig. 2.33. 4) Analysis of computational complexity PCBB mapping method reduces the computational complexity by priority allocation based on BB mapping method, and uses compensation factor to balance the computational overhead and accuracy. Here, the complexity of the method of ﬁnding the optimal mapping is analyzed from the following two aspects: ➀ the computational complexity of computing the reliability efﬁciency Reff of each intermediate node; ➁ the computational overhead caused by the comparison and decision of node deletion if (need reconfiguration){ Read(LastMap); MaxGain=LastMap -> gain; MaxUpperBound=LastMap->upperbound; } else{ MaxGain=-1; MaxUpperBound=-1; } initialization; while(Q is not empty){ Establish(child); if(child->gain< MaxGain or child->upperboundgain > MaxGain) MaxGain= child->gain; if (child->upperbound > MaxUpperBound) MaxUpperBound= child->upperbound; } }

Fig. 2.33 Remapping process code

124

2 Hardware Security and Reliability

in the mapping process. In the actual implementation process, many intermediate data (e.g., routing table, communication energy consumption, etc.) can be stored in memory in advance. Therefore, similar basic operations are required for computing the reliability efﬁciency of each intermediate node. The computational complexity 2.5 ). of this part is exponential with the number of NoC nodes (N NoC ), i.e., O(NNoC Since it is impossible to accurately estimate the number of deleted nodes on each layer of branch, it is advisable to assume that the number of remaining intermediate nodes on each layer of branch is q, and the total computational complexity is shown in Eq. (2.25): 2.5 CC = O(NNoC × q NNoC /2−1 )

(2.25)

3. Experimental veriﬁcation The veriﬁcation experiment mainly includes three aspects: the accuracy of the optimal mapping method, reconﬁgurability, and veriﬁcation of the efﬁciency of the solution method. A software platform written in C++ is used to ﬁnd the optimal mapping mode and calculate the time required for the search process. The optimal mapping method is evaluated by mapping the target application to the network-on-chip for layout level simulation. A 7 × 7 2-D Mesh network-on-chip is designed in the experiment. The network-on-chip at the register transfer level is ﬁrst implemented in Verilog, followed by synthesis using the TSMC 65 nm process to obtain the layout after placement and routing. The area of NoC is about 1.14 × 0.96 mm2 , with working frequency up to 200 MHz. The designed network-on-chip is also veriﬁed on Xilinx XC7V585TFFG1157-3 FPGA, accounting for only 13.45% of the total FPGA resources. In the reliability efﬁciency model deﬁned in Eq. (2.22), the models of reliability, energy consumption and delay are established on the premise that the interconnection wire failure probability is independent of each other. This is enough to prove that the interconnection wire error model developed now or even in the future can be embedded in the method proposed above. A simple interconnection wire error model is used to prove that the method in this section is also applicable when the failure probabilities if interconnection wires are different. The interconnection wires are divided into two groups according to the communication volume. The group with larger communication volume has higher failure probability ph , while the group with smaller communication volume has lower failure probability pl . In all simulation processes, errors are randomly injected into the on-chip interconnection network with these two failure probabilities according to different locations of the interconnection wire. In the experiment, it is assumed that there is no need for special emphasis on reliability, communication energy consumption or performance, so the weight coefﬁcient α is 1 and the minimum reliability constant minre is 0. Reliability in the simulation is measured by the probability of successful transmitting a ﬂit from the source node to the target node, and the communication energy consumption is directly obtained from the layout level simulation results. In all simulations, a packet consists of 8 ﬂits, including 1 head ﬂit, 6 body ﬂit and 1 tail ﬂit. The delay is determined by the delay of the head ﬂit, that is, the time interval when the head ﬂit

2.2 Reliability Table 2.3 Target applications and speciﬁc information

125 Application name IP number Min/Max volumes NoC size MPEG4

9

1/942

9

Telecom

16

11/71

16

Ami25

25

1/4

25

Ami49

49

1/14

49

H264

14

3280/124,417,508

16

HEVC

16

697/1,087,166

16

Freqmine

12

12/6174

16

Swaption

15

145/47,726,417

16

is generated at the source node and received by the target node. The throughput is obtained according to the average number of ﬂits of information transmitted by each router in the simulation. 1) Accuracy veriﬁcation To verify the optimal mapping method found in this section, PCBB, BB [64] and SA mapping methods are compared from three aspects: reliability, communication energy and performance. A total of eight applications are selected, as shown in Table 2.3. The ﬁrst four applications selected are the same applications with the BB mapping approach [64] and are commonly used in reality, while the last four selected are applications with higher communication volumes to meet the needs of reconﬁgurable computing arrays based on network-on-chip with increasing complexity. H264 and HEVC are two video coding standards, while Freqmine and Swaption are selected from Princeton Application Repository for Shared-Memory Computers (PARSEC) [65]. The number of IPS for these applications is determined based on the principle of balancing the communication volume between IPs as much as possible. The BB mapping method is only limited to two-dimensional Mesh topology and X-Y routing, so the same setting is used in accuracy veriﬁcation. SA mapping method is a classical algorithm to ﬁnd the local optimal solution of objective function through probability algorithm, which lacks the model of reliability, communication energy and performance. Therefore, for fair comparison, the results of this method are based on the reliability efﬁciency model proposed above. When the interconnection wire failure probability is greater than 50%, the whole communication network is likely to fail to work normally. Therefore, in the simulation, the failure probability of interconnection line is set to be less than 0.5. Where, pl = 0.0001, 0.001, 0.025, 0.04, 0.0625, 0.1, 0.25, 0.5 and ph = 0.001, 0.025, 0.04, 0.0625, 0.1, 0.25, 0.5. Table 2.4 lists the maximum, minimum and average values of 5000 groups of failure probability injection experimental results. It can be seen from the average value in the table that PCBB mapping method has signiﬁcant advantages over BB mapping method and SA mapping method in all aspects. The reliability, communication energy and performance will be further explained in the following part. (1) Reliability

126

2 Hardware Security and Reliability

Table 2.4 Overview of the optimal mapping found by the PCBB mapping method compared with the BB mapping method and SA mapping method in various aspects (unit: %) Comparison item

Compared with BB mapping method

Reliability

−0.96

Communication energy

Compared with SA mapping method

Minimum Maximum Average Minimum Maximum Average 106.8

8.2

−20.8

167.0

4.2

−1.1

52.3

21.2

−8.2

65.1

6.7

Latency

2.4

37.1

15.5

−3.5

25.3

8.9

Throughput

0.7

22.2

9.3

3.5

22.2

8.5

The comparison of the reliability of PCBB, BB and SA mapping methods for eight applications relative to different interconnection wire failure probabilities is as follows: when the failure probability is small ( pl ≤0.025), the reliability of the optimal mapping found by PCBB mapping method and the other two mapping methods is almost the same, because when the failure probability is small, almost all mapping methods have high reliability. With the increase of failure probability ( pl > 0.025), PCBB mapping method has obvious advantages over the other mapping methods. Although the results of PCBB mapping method are not as good as SA mapping method in a few cases (25%), the computation time of SA mapping method is much longer than PCBB mapping method. The possible reason here is that the accuracy of PCBB mapping method may be sacriﬁced for the improvement of speed. More detailed discussion will be further explained in relevant sections. Overall, compared with BB mapping method and SA mapping method, PCBB mapping method obtains 8.2% and 4.2% reliability gain. (2) Communication energy consumption In the comparison of communication energy consumption, in most cases (73.4%), the communication energy consumption of the optimal mapping found by PCBB mapping method is smaller than that of the other two methods. The rest cases is probably due to the adoption of acceleration and the deletion of the optimal intermediate node. From the overall average value, the communication energy consumption of PCBB mapping method is still 21.2 and 6.7% lower than that of BB mapping method and SA mapping method. (3) Performance Performance comparison includes latency and throughput, and latency itself is closely related to throughput. Therefore, the result of latency is compared by changing the value of throughput after ﬁnding the optimal mapping under the failure probability of pl = 0.01, ph = 0.1. When the throughput is low, the latency is about 15 cycles/ﬂit. With the increase of throughput, the whole network gradually enters saturation. From the average results, the latency of PCBB mapping method is 15.5% less than that of BB mapping method. This advantage lies in the quantitative analysis of latency in

2.2 Reliability

127

the reliability model. Those mapping modes with more blocking and greatly affected by faulty interconnection can be easily eliminated, while the mapping with lower latency will be retained as an alternative to the optimal mapping. Compared with the SA mapping method based on the same model, the PCBB mapping method also has an average latency reduction of 8.5%, which further proves the accuracy of the method of ﬁnding the optimal solution. With different failure probabilities, the changes of throughput are compared. When the injection rate is small, the throughput of different failure probabilities is basically the same. To consider the impact of different failure probabilities on the throughput, the injection rate in the experiment is 1 ﬂit/(cycle · node). At this time, the network has reached saturation. As the failure probability increases, the throughput decreases. At the same time, although the throughput is only modeled qualitatively through bandwidth constraints without corresponding quantitative model, the optimal mapping found by PCBB mapping method has obvious advantages. From the average value, the throughput gain of PCBB mapping method can reach 9.3% and 8.5% respectively compared with BB mapping method and SA mapping method. 2) Veriﬁcation of computation efﬁciency To meet the needs of remapping at run time, the computational complexity needs to be reduced as much as possible while ensuring accuracy. Here, the computation time required to map the target application to NoC for the ﬁrst time is compared with BB mapping method and SA mapping method. At the same time, the three run on the same computing platform: 2 × Intel ® Xeon ® E5520 CPU, 2.27 GHz main frequency and 16 GB memory. In the PCBB mapping method, the compensation factor β is 0, so as to maximize the speed of ﬁnding the optimal solution. The computation time and corresponding ratio of the best mapping of the eight applications are shown in Table 2.5. It can be seen that PCBB mapping method can ﬁnd the optimal mapping faster than BB mapping method and SA mapping method. In the previous discussion, it can be seen that the accuracy of PCBB mapping method in ﬁnding the optimal mapping is better than both in most cases. The advantage of speed is mainly due to the priority allocation in the search method. It can give priority to the IP with large communication volume, and delete the intermediate node closer to the root node, which reduces the computational overhead to a great extent. 3) veriﬁcation of reconﬁgurability (1) Feasibility In addition to minimizing the time overhead for the ﬁrst mapping, it is also necessary to support recalculating the optimal mapping for the new topology and routing algorithm at runtime in case of errors or other situations that require remapping. A total of the following experiments are conducted depending on the different scenarios that require remapping. According to the needs of different applications, the SDC computing array may need to change the interconnection structure, that is, the topology and routing algorithm of the network-on-chip may change during operation. This experiment will

128

2 Hardware Security and Reliability

Table 2.5 Comparison of computation time of PCBB, BB and SA mapping methods Application name

PCBB/s 0.02

MPEG4

BB/s

SA/s

0.03

37.4

BB/PCBB

SA/PCBB

1.5

1870

Telecom

0.65

2.65

156.7

4.1

241

Ami25

7.38

9.62

1397.3

1.3

189

Ami49

18.14

13,064.4

1.7

720

31.6

H264

0.50

2.68

486.8

5.4

974

HEVC

0.64

3.19

340.3

5.0

532

Freqmine

0.87

1.77

335.5

2.0

386

Swaption

0.45

2.03

385.5

4.5

857

run the eight applications shown in Table 2.3 starting from the most commonly used two-dimensional Mesh topology and X-Y deterministic routing algorithm, then change the topology and routing algorithm of NoC, and recalculate the optimal mapping corresponding to the reconstructed NoC. The four common topology and routing algorithm combinations after the change are obtained from a review article that studies 60 network-on-chip literature [66]. The survey mentioned 66 topologies and 67 routing algorithms. There are three most common topologies: 56.1% are Mesh/Torus, 12.1% are self-customized, and 7.6% are ring. Routing algorithms are usually divided into two categories: deterministic routing (62.7%) and adaptive routing (37.3%). Therefore, Torus, Spidergon, de Brujin Graph, and Mesh are selected as the veriﬁcation topologies among these three common topologies, and deterministic and adaptive routing algorithms are also selected as routing algorithms, as shown in Table 2.6. The remapping time for these four NoC topologies and routing algorithms is shown in Table 2.7. For different applications, with the increase of application scale, the remapping time is also gradually increasing, which is consistent with the theoretical analysis of Eq. (2.25). More importantly, it can be seen that the remapping time under various conditions are within the requirements of the reconﬁgurable architecture. The hard errors of interconnection wire, router and processing unit are solved by redundant replacement. When the replacement occurs, the communication between nodes will change, so it is also necessary to ﬁnd the optimal mapping again. Router Table 2.6 Combinations of different network-on-chip topologies and routing algorithms to verify reconﬁgurability Combination No. Topology

Routing algorithm

Name

Classiﬁcation

Name

Classiﬁcation

I

Torus

Mesh/Torus

OddEven

Adaptive routing

II

Spidergon

Ring

CrossFirst

Deterministic routing

III

de Brujin Graph Self-customization Deﬂection

Deterministic routing

IV

Mesh

Mesh/Torus

Fulladaptive Adaptive routing

2.2 Reliability

129

Table 2.7 Comparison of computation time of remapping after NoC topology and routing algorithm change (unit: s) Application name

I

II

III

IV

MPEG4

0.02

0.01

0.01

0.03

Telecom

0.57

0.56

0.57

0.56

Ami25

3.67

3.7

3.81

3.66

Ami49

8.06

6.26

6.96

6.81

H264

0.56

0.56

0.57

0.57

HEVC

0.56

0.57

0.58

0.56

Freqmine

0.57

0.57

0.57

0.57

Swaption

0.57

0.56

0.57

0.57

failures can be attributed to all interconnection wires connected to it. Table 2.8 shows the computation time of ﬁnding the optimal mapping again after failures in different numbers of interconnection wires, which is one order of magnitude smaller than the results in Table 2.7. Relevant research [67] proposed two methods of remapping after errors occur in processing units. The PCBB mapping method uses the same applications and is compared with them as shown in Table 2.9. Although PCBB mapping method is slightly slower than LICF, they are basically in the same order of magnitude. At the same time, the computation time of the remapping of PCBB mapping method is much less than that of MIQP, and the search space of PCBB mapping method is much larger than that of the two methods. In general, the computing time of remapping caused by errors also meets the requirements of dynamic reconﬁguration. (2) Energy consumption After exploring the feasibility of reconﬁgurability, the energy consumption of remapping using the PCBB mapping method also needs to be considered. For the case Table 2.8 Comparison of computation time of remapping after interconnection line error (unit: ms) Application name MPEG4

Number of error interconnection lines 3

6

9

12

10

12

13

12

Telecom

25

24

23

25

Ami25

50

54

53

54

Ami49

83

83

81

80

H264

24

23

25

23

HEVC

26

27

24

27

Freqmine

24

25

25

24

Swaption

25

24

22

22

130

2 Hardware Security and Reliability

Table 2.9 Comparison of the computation time for remapping after an error occurs in the processing unit (unit: s) Application (number of IPS)

NOC size

Auto-Indust (9IP)

4I

TGFF-1 (12IP)

Number of errors

LICF [67]

MIQP [67]

PCBB (this section)

2

0.01

0.2

0.03

4

0.02

2.51

0.04

6

0.04

51.62

0.06

7

0.04

177.72

0.08

2

0.01

0.44

0.02

3

0.02

1.34

0.05

4

0.03

4.3

0.06

where the entire NoC topology and routing algorithm are changed due to application requirements, remapping is the only option, so its cost can be ignored for now. However, for the case where an error occurs and redundancy is used for replacement, the simplest and most straightforward method other than remapping is to move the IP on the node where the error occurs to the nearest available node without changing the mapping location of other IPs. For this case, the energy consumption of remapping is discussed by selecting the example of Swaption mapped to a two-dimensional Mesh network-on-chip with X-Y routing. Table 2.10 compares the costs of simply replacing and remapping nearby in different cases. The cost in Table 2.10 is deﬁned by Eq. (2.26): Cost = (energy consumption of new mapping mode + energy consumption of moving IP)

(2.26)

− The energy consumption of the mapping mode before the error occurs

It can be seen from the comparison results in Table 2.10 that when the number of moving IPs is small and the redundant unit is close to the faulty unit, it is more Table 2.10 Comparison of energy consumption cost between remapping and nearest replacement Distance between a redundant unit and an error unit

Number of mobile IPs

Nearby replacement cost/J

Remapping cost/J

1

1

−0.03

0.02

2

4

−0.10

−0.05

3

3

−0.08

−0.08

4

2

0.04

−0.10

5

4

0.14

−0.16

6

3

0.32

−0.23

7

2

0.43

−0.16

8

2

0.83

−0.27

References

131

advantageous to simply replace it nearby. However, if the redundant unit is far away from the faulty unit or the number of IP to be moved is large due to communication paths, even if some IPs need to be moved, the energy consumption after remapping can be reduced through multi-objective optimization. In contrast, simple replacement nearby brings a signiﬁcant increase in energy consumption. From the perspective of energy, the advantages of remapping are obvious.

References 1. van Woudenberg JGJ, Witteman MF, Menarini F (2011) Practical optical fault injection on secure microcontrollers. In: Workshop on fault diagnosis and tolerance in cryptography, pp 91–99 2. Bo W (2018) Research on high energy efﬁciency reconﬁgurable cryptographic processor architecture and its anti physical attack technology. Tsinghua University, Beijing 3. Wang B, Liu L, Deng C et al (2016) Against double fault attacks: injection effort model, space and time randomization based countermeasures for reconﬁgurable array architecture. IEEE Trans Inf Forensics Secur 11(6):1151–1164 4. Leibo L, Bo W, Shaojun W et al (2018) Reconﬁgurable computing cryptographic processor. Science Press, Beijing 5. Anderson R, Kuhn M (1996) Tamper resistance—a cautionary note. In: Proceedings of the 2nd Usenix workshop on electronic commerce, pp 1–11 6. Skorobogatov S (2011) Physical attacks on tamper resistance: progress and lessons. In: Proceedings of the 2nd ARO special workshop on hardware assurance, pp 1–10 7. Skorobogatov SP (2005) Semi-invasive attacks: a new approach to hardware security analysis 8. Kocher P, Jaffe J, Jun B (1999) Differential power analysis. In: Annual international cryptology conference, pp 388–397 9. Kocher PC (1996) Timing attacks on implementations of Difﬁe-Hellman, RSA, DSS, and other systems. Springer, Berlin 10. Yen S, Lien W, Moon S et al (2005) Power analysis by exploiting chosen message and internal collisions: vulnerability of checking mechanism for RSA-decryption. Springer, pp 183–195 11. Ors SB, Gurkaynak F, Oswald E et al (2004) Power-analysis attack on an ASIC AES implementation. In: International conference on information technology: coding and computing, pp 546–552 12. Kadir SA, Sasongko A, Zulkiﬂi M (2011) Simple power analysis attack against elliptic curve cryptography processor on FPGA implementation. In: Proceedings of the 2011 international conference on electrical engineering and informatics, pp 1–4 13. Guo L, Wang L, Li Q et al (2015) Differential power analysis on dynamic password token based on SM3 algorithm, and countermeasures. In: The 11th international conference on computational intelligence and security, pp 354–357 14. Qiu S, Bai G (2014) Power analysis of a FPGA implementation of SM4. In: The 5th international conference on computing, communications and networking technologies, pp 1–6 15. Duan X, Cui Q, Wang S et al (2016) Differential power analysis attack and efﬁcient countermeasures on PRESENT. In: The 8th IEEE international conference on communication software and networks, pp 8–12 16. Li H, Wu K, Peng B et al (2008) Enhanced correlation power analysis attack on smart card. In: The 9th international conference for young computer scientists, pp 2143–2148 17. Adegbite O, Hasan SR (2017) A novel correlation power analysis attack on PIC based AES-128 without access to crypto device. In: The 60th international midwest symposium on circuits and systems, pp 1320–1323

132

2 Hardware Security and Reliability

18. Masoomi M, Masoumi M, Ahmadian M (2010) A practical differential power analysis attack against an FPGA implementation of AES cryptosystem. In: International conference on information society, pp 308–312 19. Sugawara T, Suzuki D, Saeki M et al (2013) On measurable side-channel leaks inside ASIC design primitives. Springer, pp 159–178 20. Courrège J, Feix B, Roussellet M (2010) Simple power analysis on exponentiation revisited. In: International conference on smart card research and advanced applications, pp 65–79 21. Brier E, Clavier C, Olivier F (2004) Correlation power analysis with a leakage model. In: International workshop on cryptographic hardware and embedded systems, pp 16–29 22. Gierlichs B, Batina L, Tuyls P et al (2008) Mutual information analysis. In: International workshop on cryptographic hardware and embedded systems, pp 426–442 23. Chari S, Rao J R, Rohatgi P (2002) Template attacks. In: International workshop on cryptographic hardware and embedded systems, pp 13–28 24. Dassance F, Venelli A (2012) Combined fault and side-channel attacks on the AES key schedule. In: Workshop on fault diagnosis and tolerance in cryptography, pp 63–71 25. Güneysu T, Moradi A (2011) Generic side-channel countermeasures for reconﬁgurable devices. In: International workshop on cryptographic hardware and embedded systems, pp 33–48 26. Moradi A, Mischke O, Paar C (2011) Practical evaluation of DPA countermeasures on reconﬁgurable hardware. In: IEEE international symposium on hardware-oriented security and trust, pp 154–160 27. Beat R, Grabher P, Page D et al (2012) On reconﬁgurable fabrics and generic side-channel countermeasures. In: The 22nd International conference on ﬁeld programmable logic and applications, pp 663–666 28. Shan W, Shi L, Fu X et al (2014) A side-channel analysis resistant reconﬁgurable cryptographic coprocessor supporting multiple block cipher algorithms. In: Proceedings of the 51st annual design automation conference, pp 1–6 29. Sasdrich P, Moradi A, Mischke O et al (2015) Achieving side-channel protection with dynamic logic reconﬁguration on modern FPGAs In: IEEE international symposium on hardware oriented security and trust, pp 130–136 30. Hettwer B, Petersen J, Gehrer S et al (2019) Securing cryptographic circuits by exploiting implementation diversity and partial reconﬁguration on FPGAs. In: Design, automation & test in Europe conference & exhibition, pp 260–263 31. Wang B, Liu L, Deng C et al (2016) Exploration of Benes network in cryptographic processors: a random infection countermeasure for block ciphers against fault attacks. IEEE Trans Inf Forensics Secur 12(2):309–322 32. Devadas S, Suh E, Paral S et al (2008) Design and implementation of PUF-based “unclonable” RFID ICs for anti-counterfeiting and security applications. In: IEEE International conference on RFID, pp 58–64 33. Guajardo J, Kumar SS, Schrijen G et al (2007) FPGA intrinsic PUFs and their use for IP protection. In: Cryptographic hardware and embedded systems, Vienna, pp 63–80 34. Mahmoud A, Rührmair U, Majzoobi M et al. Combined modeling and side channel attacks on strong PUFs [EB/OL]. https://eprint.iacr.org/2013/632. Accessed 20 Dec 2020 35. Rührmair U, Xu X, Sölter J et al (2014) Efﬁcient power and timing side channels for physical unclonable functions. In: Cryptographic hardware and embedded systems, pp 476–492 36. Tiri K, Hwang D, Hodjat A et al (2005) Prototype IC with WDDL and differential routing–DPA resistance assessment. In: Cryptographic hardware and embedded systems, pp 354–365 37. Delvaux J, Verbauwhede I (2013) Side channel modeling attacks on 65nm arbiter PUFs exploiting CMOS device noise. In: IEEE International symposium on hardware-oriented security and trust, pp 137–142 38. Sahoo DP, Mukhopadhyay D, Chakraborty RS et al (2018) A multiplexer-based arbiter PUF composition with enhanced reliability and security. IEEE Trans Comput 67(3):403–417 39. Merli D, Schuster D, Stumpf F et al (2011) Semi-invasive EM attack on FPGA RO PUFs and countermeasures. In: Workshop on embedded systems security, pp 1–9

References

133

40. Homma N, Hayashi Y, Miura N et al (2014) EM attack is non-invasive?—design methodology and validity veriﬁcation of EM attack sensor. In: Cryptographic hardware and embedded systems, pp 1–16 41. Suh GE, Devadas S (2007) Physical unclonable functions for device authentication and secret key generation. In: Design automation conference, pp 9–14 42. Lee JW, Lim D, Gassend B et al (2004) A technique to build a secret key in integrated circuits for identiﬁcation and authentication applications. In: Symposium on VLSI circuits, pp 176–179 43. Jing Y, Guo Q, Yu H et al (2018) Modeling attacks on strong physical unclonable functions strengthened by random number and weak PUF. In: IEEE VLSI test symposium, pp 1–6 44. Lim D (2004) Extracting secret keys from integrated circuits. Massachusetts Institute of Technology, Cambridge 45. Kursawe K, Sadeghi AR, Schellekens D et al (2009) Reconﬁgurable physical unclonable functions-enabling technology for tamper-resistant storage. In: IEEE International workshop on hardware-oriented security and trust, pp 22–29 46. Škori´c B, Tuyls P, Ophey W (2005) Robust key extraction from physical unclonable functions. In: International conference on applied cryptography and network security, pp 407–422 47. Dongxing W (2018) Research on key technologies of constructing physical unclonable functions by dynamically reconstructing computing arrays. Tsinghua University, Beijing 48. Zhou Z (2018) Research on hardware security key technologies of reconﬁgurable computing processor. Tsinghua University, Beijing 49. Liu L, Ren Y, Deng C et al (2015) A novel approach using a minimum cost maximum ﬂow algorithm for fault-tolerant topology reconﬁguration in NoC architectures. In: The 20th Asia and South Paciﬁc design automation conference, pp 48–53 50. Stallings W (2017) Operating systems: internals and design principles, 9th edn. Pearson, Englewood 51. Zhang L, Han Y, Xu Q et al (2009) On topology reconﬁguration for defect-tolerant NoCbased homogeneous manycore systems. IEEE Trans Very Large Scale Integr (VLSI) Syst 17(9):1173–1186 52. Varvarigou TA, Roychowdhury VP, Kailath T (1993) Reconﬁguring processor arrays using multiple-track models: the 3-track-1-spare-approach. IEEE Trans Comput 42(11):1281–1293 53. Karzanov A (1974) Determining the maximal ﬂow in a network by the method of preﬂows. In: Souviet mathematics Doklady, pp 1–8 54. Edmonds J, Karp RM (1972) Theoretical improvements in algorithmic efﬁciency for network ﬂow problems. J ACM 19(2):248–264 55. Chang Y, Chiu C, Lin S et al (2011) On the design and analysis of fault tolerant NoC architecture using spare routers. In: The 16th Asia and South Paciﬁc design automation conference, pp 431–436 56. Kang U, Chung H, Heo S et al (2010) 8 Gb 3-D DDR3 DRAM using through-silicon-via technology. IEEE J Solid-State Circuits 45(1):111–119 57. Yu, Ren (2014) Research on key technologies of high reliability on-chip internetwork design. Tsinghua University, Beijing 58. Wu, Chen (2015) Research on multi-objective joint optimization mapping method for reconﬁgurable network-on-chip. Tsinghua University, Beijing 59. Chatterjee N, Chattopadhyay S, Manna K (2014) A spare router based reliable network-on-chip design. In: IEEE International symposium on circuits and systems, pp 1957–1960 60. Kornaros G, Pnevmatikatos D (2014) Dynamic power and thermal management of NoC-based heterogeneous MPSoCs. ACM Trans Reconﬁg Technol Syst 7(1):1 61. Shahidi GG (2019) Chip power scaling in recent CMOS technology nodes. IEEE Access 7:851–856 62. Ye TT, Benini L, de Micheli G (2002) Analysis of power consumption on switch fabrics in network routers. In: Proceedings 2002 design automation conference, pp 524–529 63. Gunter Bolch SGHD (2006) Queueing networks and Markov chains: modeling and performance evaluation with computer science applications. Wiley, Manhattan

134

2 Hardware Security and Reliability

64. Ababei C, Kia HS, Yadav OP et al (2011) Energy and reliability oriented mapping for regular networks-on-chip. In: Proceedings of the 5th ACM/IEEE international symposium on networks-on-chip. Association for Computing Machinery, Pittsburgh, Pennsylvania, pp 121–128 65. The PARSEC Benchmark Suite [EB/OL]. https://parsec.cs.princeton.edu/overview.htm. Accessed 30 May 2020 66. Salminen E, Kulmala A, Hamalainen TD (2008) Survey of network-on-chip proposals. White Paper OCP-IP 1:13 67. Li Z, Li S, Hua X et al (2013) Run-time reconﬁguration to tolerate core failures for realtime embedded applications on NoC manycore platforms. In: 2013 IEEE 10th International conference on high performance computing and communications & 2013 IEEE International conference on embedded and ubiquitous computing, pp 1990–1997

Chapter 3

Technical Difficulties and Development Trend

A new golden age for computer architecture: Domain-specific hardware/software co-design, enhanced security, open instruction sets, and agile chip development. —John Hennessy and David Patterson, Turing Award, 2018

Chapter 2 of Volumn I introduces the key issues to be considered in the research of SDC, while Chaps 3 and 4 of Volume I and Chaps 1 and 2 of Volume II introduce the design space of these key issues from different layers and perspectives. This chapter will focus on the main technical difficulties faced in solving these key issues. Fundamentally, the key problem that SDC hopes to solve is to realize high flexibility, ease of use and computing efficiency on a single chip at the same time. However, these goals themselves restrict each other: Functional flexibility means more hardware resource redundancy, and ease of use means fewer hardware optimization opportunities, which are undoubtedly important factors that make it difficult to improve computing efficiency. Until today, most mainstream computing chip designs reflect the embarrassing trade-offs of chip developers: CPU perfectly realizes functional flexibility through time division multiplexing of a small number of computing logic units and other components, but computing logic only occupies a very small area in its chip design. When dealing with many applications, the non-computing resources of CPU’s (e.g. instruction flow control logic) are essentially redundant; at the other extreme, ASIC maximizes the computing efficiency of its underlying circuit through spatial-domain parallelism, pipelining and specialized design. However, its productivity is very poor due to tens of million or even hundreds of million dollars of design and production costs and years of development time, which can only be applied to a small number of widely used application fields. However, the key problem of SDCs is not simply to make a trade-off between flexibility, ease of use and computing efficiency to meet the target constraints (which is what many accelerator chips are doing at present), but to find the shortcomings of existing design methods and theories from the perspective of architecture innovation, so as to improve flexibility, computing efficiency or ease of use without sacrificing the other two objectives. Therefore, this chapter analyzes the main technical difficulties of using the existing chip design methods and ideas to realize the SDC in terms of flexibility, efficiency © Science Press 2023 L. Liu et al., Software Defined Chips, https://doi.org/10.1007/978-981-19-7636-0_3

135

136

3 Technical Difficulties and Development Trend

and ease of use. In view of these technical difficulties, the possibility of new design concept is further discussed, and the future development trend of SDC is prospected.

3.1 Analysis of Technical Difficulties FPGA, CPU and even GPU chips have formed a mature ecosystem from software to hardware. For example, using Verilog language to program FPGA has become a standard method. GPGPU can be programmed by using CUDA or OpenCL standards. These software and hardware workflows have good efficiency, performance and ease of use, and can enable applications to make full use of the parallel processing ability of underlying hardware, so they are widely accepted. However, SDC is still in the development stage of establishing software and hardware ecosystem, and there is no perfect standard that has been widely commercialized. There are still many problems to be solved when establishing and improving the ecosystem or standard. Most of these problems are not unique to SDCs. For example, how to provide more hardware specific parallel processing capabilities to applications without making the programming model too complex and daunting, or the “memory wall” problem caused by memory bandwidth constraints, are common problems in chip design. However, the solutions to these problems are not universal. Different underlying hardware may have completely different processing methods, so these problems are still the focus of SDC research. Of course, there are some unique challenges in SDC, which are related to its specific hardware and programming, computing and execution model. Generally speaking, the challenges faced by SDC mainly include three aspects: How to conduct programmability design coordinating the software and hardware to obtain flexibility, how to balance the parallelism and utilization of hardware to develop efficiently, and how to use software-scheduled virtualization forhardware optimization to improve ease of use.

3.1.1 Flexibility: Programmability Design Coordinating the Software and Hardware The SDC needs to be flexible enough to support different computing tasks and applications, which will inevitably lead to the redundancy of hardware resources. The SDC should reduce this redundancy as much as possible while meeting the application requirements, so as to improve the utilization of hardware. This requires close cooperation between software and hardware to find an optimal balance point of the system. Therefore, the first challenge of SDC is how to realize the programmability design coordinating the software and hardware.

3.1 Analysis of Technical Difficulties

137

One of the most direct problems in hardware/software co-design is how to design the programming model. The programming model defines how software or applications use the underlying hardware. It has two purposes: One is computing efficiency. It’s desired to expose more details of the underlying hardware, so that the software can make full use of the capabilities provided by its hardware to improve the utilization of hardware; the second is programming efficiency. It’s desired to abstract the hardware as much as possible, so that software designers can program without much in-depth understanding of the underlying hardware. These two purposes are obviously contradictory. Too much abstraction of the hardware is not allowed if the application performance needs to be deeply optimized, and it is not practical to provide all the underlying details to the programmer if the system is aiming at ease to program. Therefore, a compromise acceptable to both sides is needed. At present, there are few research on this topic, so there is no effective and easy-to-use programming model. For different applications or SDCs in different research, different programming models are often used, but there is no programming model whose programmability and performance can meet the requirements for wide commercialization. It is very urgent to find a programming model that is most suitable for SDC. This is one of the most important preconditions for SDC ecosystem to be widely accepted by academia and industry. Therefore, the programmability is still the main challenge faced by SDC. For SDCs, it is natural to explore what is the most suitable programming model from both hardware and software perspectives, and what are the challenges. From the perspective of hardware, SDC is a computing architecture with much higher complexity than other computing architectures. Different from CPU, SDC is a two-dimensional distributed spatial computing array, which supports higher dimensional parallelism and stronger computing power compared with CPU. These hardware capabilities need to be made available to software in a easy-to-program manner, which is even more difficult than the VLIW programming model. This is because although VLIW can provide parallel programming capability of PEs, the underlying control logic of its different instructions can only be integrated into a long instruction in a decoupled manner, while SDCs may have data commmunication between all PEs and their neighbors at the same moment. Different from GPU, SDC has dynamically configurable interconnection, its PEs do not communicate through shared memory, but communicate directly. Generally speaking, the GPU first move the data in the main memory to the VRAM of the GPU through PCIe, and then executes the program based on the shared memory model. However, SDCs are different in that they require to explicitly define how data flows in order to function. SDCs are different from FPGAs in that the underlying computational units are coarse-grained and can perform different functional computations and can be dynamically switched according to different configuration. Therefore, the scheduling and partitioning of tasks in SDC is more complex than FPGA. In addition, although HLS technology has been developed for some time, FPGA programming with high-level language is still not widely used, mainly because the performance gap between HLS and programming directly with low-level language is too large. Therefore, it can be

138

3 Technical Difficulties and Development Trend

concluded that it is more challenging to use high-level language to program SDCs than to program FPGA. Clearly, the complex hardware architecture of SDC is one of the most fundamental and difficult challenges in the programmability design coordinating software and hardware. Moreover, the hardware design of the SDC can also be adjusted according to the needs of the application. This introduces a new variable for programming model design, namely hardware function, which greatly increases the difficulty of programming model design. From the perspective of software, the mainstream programming models are sequential style, which cannot make full use of the two-dimensional fine-grained parallel processing ability of SDC, so they are not suitable for the software and hardware co-design. As mentioned above, the challenge in programming model design is essentially that software can only provide coarse-grained parallelism if it abstracts off the underlying architecture at a high level, but meanwhile the programmability will be too poor if the developer programs the underlying architecture directly at a low level. From the perspective of software, programmers cannot fully understand the implementation details of the underlying hardware. A good programming model should be able to provide the core features of the underlying architecture as much as possible without making the programming work too complex, so that the software can control and make use of the parallelism provided by most of the hardware. Generally speaking, the software should be enabled to easily extract the computing power provide by hardware to the greatest extent. This is usually not an easy task. As we can see on FPGA, HLS provides the possibility of high-level programming. It automatically synthesizes high-level languages into hardware description languages, but the performance is far from practical use; The manual programming of low-level hardware description languages such as VHDL consumes a large amount of work and has poor programmability. Of course, there are also successful examples. For example, CUDA is a good example of using high-level language to program GPGPU. This is because the GPU is a very regular parallel processing unit and has a single computational mode, so the CUDA model which has an abstraction of the mode can make efficient use of the parallelism of the GPU and brings little burden to programmers. It is very challenging to design a general programming model for SDC. However, it can be noted that most SDCs are designed for specific fields, such as machine learning. Therefore, drawing on the idea of DSL, a more feasible way is to design different programming models for different application fields. Due to restricting the computing model, programming models designed for a domain can be optimized in advance according to the data and control characteristics of that domain, so the functionality provided by the hardware can be relatively more fixed, which drastically reduces the design space of programming models. As a result, programming models designed for the domain are likely to achieve high hardware utilization compared to generalpurpose programming models. Nevertheless, this direction still faces the challenge of extracting application characteristics for different application domains, considering their effective implementation on the underlying hardware, and incorporating them into the design of the programming model.

3.2 Instruction-Level Parallelism

139

Overall, the flexibility of SDCs needs to be guaranteed by the programmability design coordinating the hardware and software. One of the main issues is how to design a programming model for SDCs. The complexity of hardware, the limitations of existing software, and the customizability of hardware functions finally make the design space of programming model very large and complex. This shows that finding the optimal system advantage of combining hardware and software is very difficult, but it is urgent to be considered for the development of SDCs.

3.1.2 Efficiency: Tradeoff Between Hardware Parallelism and Utilization The performance of computing architecture is mainly defined by its throughput, that is, how much specific data can be processed per unit time, or how many operations can be completed. For example, the number of floating-point operations per second (FLOPS) or the number of specific operations per second (e.g., multiplication and addition) are used to define the computing power of high-performance computers and application accelerators, while the amount of data transmitted per second (i.e. bandwidth) is used to evaluate the performance of network switches or routers. Throughput is actually an indicator of system parallelism, which reflects how many requests or contents a system can process at the same time. In modern computing architecture, many parallelism can be utilized in each part. For example, in superscalar CPU, multiple instructions can be issued, branch prediction and out-of-order execution can be performed, and multi-core processor can process multiple threads at the same time. These are three different levels of parallelism, which have a very important contribution to the improvement of processor performance. Of course, different levels of parallelism will certainly affect each other in the system. When an architecture could provide and exploit diverse parallelism better and keep conflictfree, the performance will be better, but it also leads to greater power and area. For the convenience of research, the parallelism of the system is generally categorized into the following types in the field of computer architecture.

3.2 Instruction-Level Parallelism Instruction-level parallelism (ILP) refers to the ability of the system to process multiple instructions at the same time. Instruction-level parallelism is named primarily for CPUs, although many other architectures are also involved. For example, superscalar processors can issue multiple instructions at a time, and VLIW processors can split long instructions and execute them in parallel. These are methods of developing instruction-level parallelism at compile time before execution. In addition, the out-of-order processor can dynamically use instruction-level parallelism

140

3 Technical Difficulties and Development Trend

when the code is running, and can execute multiple instructions without data dependence at the same time. The same software code can obviously execute faster on processors with higher ILP.

3.3 Data-Level Parallelism Data-level parallelism (DLP) refers to the ability of processing multiple data or groups of data in one operation. When the computational pattern is relatively uniform, using the SIMD model to process multiple data simultaneously is a common implementation of data-level parallelism. GPU is one paradigm of SIMD. Its single instruction can control multiple stream processors simultaneously based on numbering to compute a large amount of data at the same time. In the case of single instruction controlling single computation, operations such as instruction fetching and decoding and committing often limit the further improvement of throughput. Therefore, datalevel parallelism can bypass the bottleneck of instruction control and greatly improve the system throughput with high energy efficiency. Intel’s × 86 instruction set has also evolved many times. Over time, many vector operation instruction extensions using SIMD mode have been added, such as streaming SIMD extension (SSE), AVX2, AVX512, etc., to provide higher parallelism and throughput for high-performance computing.

3.4 Memory-Level Parallelism Memory-level parallelism (MLP) refers to the ability to execute multiple different memory access requests at the same time. Moore’s Law has led to a steady increase in the throughput of computing chips, resulting in the increasing demand of memory access bandwidth. However, the development of main memory bandwidth cannot keep up with this pace, which is called the “memory wall” problem. To make full use of instruction-level parallelism and data-level parallelism, it is necessary to design a system with enough memory bandwidth. However, the single-access latency of memory has hardly decreased in recent 20 years, so memory-level parallelism has become the most important and effective way to improve memory bandwidth. From the perspective of computing system, techniques such as non-blocking caching, multiple memory controllers, and allowing multiple accesses to execute simultaneously are all effective ways to provide memory-level parallelism. From the perspective of memory system, main memory providing multiple access channels to access different banks and Ranks at the same time is also a key factor in improving MLP.

3.6 Speculation Parallelism

141

3.5 Task-Level Parallelism Task-level parallelism refers to the ability of a system to execute multiple tasks at the same time. For example, a multi-core processor can execute multiple threads at the same time. Compared with a single core processor, it can greatly improve the multitasking performance, which is finally reflected in the improvement of throughput compared with a single core processor. In addition, a single task can also be divided into decoupled sub tasks, and task-level parallelism can be used to improve throughput.

3.6 Speculation Parallelism Speculation parallelism (SP) is to speculate on the possible subsequent operations, prepare data or execute operations in advance, so as to reduce the delay required for waiting and improve bandwidth and throughput. If the speculation is wrong, restore the previous context and then perform the correct operation. For example, branch prediction and out-of-order execution in CPU is an implementation of speculation parallelism. The parallelism discussed before improves the system performance by increasing the number of requests executed at the same time, while the speculation parallelism further improves the system performance by reducing the waiting time and delay. According to Amdalh’s law, when the parallelizable parts of a system are executed fully in parallel, the final execution time of the whole program is determined by the serial part. Many applications contain a large number of serial instructions. For example, the branch instructions in SPEC2006 account for about 20% of the total instructions [1]. This part is often the bottleneck of the overall system performance, but its performance cannot be improved by using parallelism other than speculation parallelism. When the prediction accuracy is high enough, the cost of recovery in case of prediction failure is small. Speculation parallelism can improve the performance of the serial part and greatly improve the throughput of the system. Speculation parallelism mainly predicts two kinds of dependencies, namely, the control dependence of instructions and the ambiguous dependence of data. The control dependence of instructions refers to the situation that the result of the previous instruction determines whether a subsequent instruction is executed, while the ambiguous dependence of data is a partial data dependence, which means that the memory access address is determined by the register, and whether two instructions have data dependence can be determined only when the instruction is executed. The main SP techniques to solve these two dependencies in the CPU are branch prediction and out-of-order execution. Branch prediction technique records the historical decisions of this branch instruction or other related branch instructions, then predicts the future behavior of this branch instruction according to the history, and executes the predicted instructions in advance. Of course, this requires the hardware to provide prediction result detection and the mechanism to flush the pipeline after prediction

142

3 Technical Difficulties and Development Trend

failure. Therefore, some additional registers are required to record the system state at the branch time for rollback after failure. Out-of-order execution technique is used to solve the ambiguous dependence of data. Although true data dependencies must be executed sequentially, instructions without true data dependencies can be executed simultaneously by renaming operands. Speculation parallelism can greatly improve the performance of the system, but it will also greatly increase its power consumption and area. Predictive execution does not improve the energy efficiency of the system, because the correct operation always needs to be executed, while the wrong operation may also be executed and needs to be rolled back. This is the reason why the energy efficiency of out-of-order execution CPU is far lower than that of sequential execution CPU. For a computing system, it is not always better to have more parallelism at different levels, e.g., GPUs rarely use SP. This is because it is not trivial to provide different levels of parallelism in hardware. For example, the implementation of AVX512 instruction extensions requires adding more decode control logic and arithmetic units to the CPU, which increases the CPU’s power and area overhead. Similarly, SDCs also need to balance between providing greater parallelism and controlling power consumption and area, that is, they not only need to provide greater parallelism to improve system performance, but also need to consider whether energy efficiency and area efficiency are high enough. Otherwise, although parallelism is increased, the cost of power and area of the system is often too large to be achievable. According to the classification of SDC computing models in Reference [2], SCSD computing model executes a single configuration information on a single data set every time, which is a widely accepted, simple but effective computing model. This method can provide a large number of instruction-level parallelism because a configuration information can integrate a large number of instructions and map them to the spatial array for simultaneous execution. The SCMD computation model can operate on multiple data sets with the same configuration, which can make full use of idle PE. SCMD computing model provides greater data-level parallelism on top of SCSD. Similar to GPU, this method is suitable for vector or streaming applications such as multimedia. Finally, the MCMD computing model can execute multiple different configurations in the PEA at the same time, and can switch the PEA configuration asynchronously. Compared with SCMD, this method mainly explores the task-level parallelism of SDC. The more complex computing model can obviously provide greater parallelism. For example, MCMD generally has much higher utilization of PEA than SCSD, so it has better performance. However, the parallelism provided by complex computing models requires more complex hardware or more powerful compilers, which cannot be underestimated. For example, the MCMD computing model requires mapping different tasks onto different places and timesteps of PEAs, and communication between MCMD tasks requires explicit message passing through interconnections due to the absence of shared memory, both of which pose great challenges for compilers of SDCs. Many studies have explored how to take advantage of different levels of parallelism in SDCs [3–5]. For example, TRIPS [3] supports three working modes, which

3.6 Speculation Parallelism

143

can provide instruction-level parallelism, data-level parallelism and task-level parallelism respectively; memory-level parallelism and data-level parallelism are essential to SDCs when dealing with big data applications. In addition to the instruction-level parallelism, data-level parallelism and tasklevel parallelism discussed above, SDCs can also support speculation parallelism. Research [6] shows that executing loop instructions speculatively, eliminating their dependencies and parallelizing them can improve the performance by more than 60%. This explains why speculation parallelism needs to be provided on the SDC from the perspective of performance. After all, it is very easy for SDCs to support the parallelism of loop instructions. However, it is not easy to implement speculation parallelism on the SDC, because the “instructions” of the SDC are essentially a set of configuration information, and a large number of operations will be performed in the configuration. This leads to the need to record and reorder the memory access of the large amount of operations if branch prediction and out-of-order execution are applied, which will cause a great burden on the power consumption and area of the SDC. Moreover, the performance of history-based branch predictors in SDCs has not yet reached an acceptable level because configuration switching is not as frequent as instruction execution [7]. Therefore, research is also needed in improving prediction accuracy and reducing the cost of prediction miss if speculation parallelism is to be implemented on SDCs. In addition, the SDC also needs to provide memory-level parallelism to increase the bandwidth of accessing memory. This is the key to ensure that other parallelism can be fully utilized. However, there is still a lack of discussion on memory-level parallelism in the current research of SDC, possibly because the implementation of memory-level parallelism not only needs the cooperation of the computation fabric and compilation of SDC, but also needs to specially design a suitable memory system, and thus the research cost and threshold are relatively high. There are three main challenges in memory-level parallelism design for SDCs. First, the existing memory systems are deeply optimized for sequential streaming access of memory. There is a row buffer in the array that can cache adjacent data and interact with the processing chip through prefetch and burst mode to improve bandwidth. However, the two-dimensional distributed PE of SDC often produces distributed sparse access requests. Therefore, compared with CPU, GPU and other computing architectures, the existing memory system is not very suitable for SDC, which poses a severer challenge to realize memory-level parallelism. Second, although many studies have explored that the SDC can customize and optimize the computing architecture for different applications, there is no targeted consideration on the memory system and application memory access mode. The existing SDC research often uses the traditional cache and scratchpad memory, and the main memory is even a simple abstraction, and there are few systems that combine computing and memory. Therefore, memory often restricts the performance of the whole system. Third, the SDC is very suitable to be combined with in-memory computing, which can greatly increase the memory-level parallelism. However, DRAM process is optimized for density and is not suitable for realizing computing logic. Therefore,

144

3 Technical Difficulties and Development Trend

it is not efficientto implement SDC on DRAM. How to combine the memory with the PEs of SDC to enable more efficient memory access has also become one of the challenges of memory-level parallelism. In short, an efficient SDC requires hardware to provide a large number of different levels of parallelism and ensure utilization. However, as discussed in this section, although the SDC has the ability to provide parallelism at all levels of computing and memory, there are still great difficulties in coordinating different parallelism and achieving high utilization, which restricts the improvement of the performance of the SDC.

3.6.1 Ease of Use: Optimizing Virtualized Hardware with Software Scheduling New designs of hardware architecture of SDC are booming. Different architectures often adopt different programming models, which makes it difficult to transplant the program of SDC. A small update of architecture may lead to the need for program rewriting. Therefore, ease of use is an important challenge for SDCs. One of the main methods to solve the incompatibility of programs with different architectures is to use virtualization technology to develop their ease of use through software scheduling of virtualized entities. Virtualization is not a cutting-edge concept. The virtualization technology of FPGA and CPU has become very mature. For example, Intel and AMD have proposed VT-x and AMD-V technologies respectively. In essence, they add virtualization specific instructions to the × 86 instruction set, support the implementation of these instructions on the microarchitecture, and add the running mode of CPU to enable it to support the state of virtualization. The main purpose of CPU virtualization is to safely run programs compiled into other instruction set architecture and their operating systems. Virtualization of SDCs is similar to FPGAs, with the main purpose of improving their ease of use. Specifically, it provides layers of indirection for software and hardware. The application software only needs to be programmed for the virtual model without considering how to implement the virtual layer on the specific hardware. This is different from the programming model design mentioned in Sect. 3.1.1. The layer of indirection in virtualization is not the interface of software and hardware, but a complete hardware model of SDC, which is just virtual. Programming this layer of indirection requires the use of the programming model abstracted from it. At present, there are many different hardware implementations of SDCs, which often use different computing models and execution models. The current situation is that after the hardware architecture changes, the software needs to be rewritten, which is a great impediment to the development of SDCs. This is partly due to the fact that there are no widely adopted commercial products for SDCs, due to the lack of commonly accepted performance metrics, fundamental research platforms and

3.6 Speculation Parallelism

145

Fig. 3.1 Virtualization and related support system of SDCs

compilation methods, and moreover due to the lack of paradigms and standards for virtualizing SDCs. Virtualizing the SDC and allowing compilation to replace manual implementation of different hardware architectures can better solve the problems of long development cycles, high workloads, and unneccessary engineering. As shown in Fig. 3.1, virtualization needs to unified various hardware implementations of the SDC into a abstract model, and the application only needs to program the model. This unified model can be naturally integrated into the operating system, and the operating system can dynamically schedule and execute the configuration fit to the model. The application programmed with the unified SDC model can be compiled into configuration information by the compiler and the other tools of the specific hardware architecture, and finally dynamically scheduled to be executed on the hardware. However, virtualization of SDCs faces many challenges. Firstly, it is not easy to build a unified computing model by combining the different SDC architecture in many academic research and engineering applications. FPGA virtualization has been widely studied since the 1990s, but SDC-related research is still lacking. As of now, SDC control strategies, interfaces, PE functions, memory systems and even interconnect are all different, which seemingly cannot be unified in a single model. The second challenge is compilation. SDC has complex hardware. Dynamic compilation is not widely accepted because of its high hardware overhead. Today’s compilers mainly rely on static compilation. Therefore, when the SDC is virtualized, how to map the virtualized model to the specific hardware is also a big challenge. For today’s SDC system, this work is mainly completed by the statical compiler. This is not a wise method, because the static compiler of SDCs is already too complex to realize the virtualization, leading to low utilization and poor performance at run time.

146

3 Technical Difficulties and Development Trend

The ease of use of SDCs needs to be guaranteed by virtualization technology. However, as described in this section, its virtualization technology has the problems of difficult abstraction, poor performance and high cost of software scheduling hardware. Therefore, how to efficiently realize the virtualization of software scheduling and hardware optimization has become an important challenge for SDC.

3.6.2 Prospects on Development Trend Section 3.1 discusses three main challenges or obstacles faced by the development of SDCs. This section discusses the development of SDC about these three challenges, and look forward to future development trend. First, for the problem of flexibility, SDCs do not yet have a acknowledged solution, but because FPGAs can be viewed as a fine-grained reconfigurable SDC, its more mature solutions and research ideas for flexibility are worth learning from. It can be concluded that the flexibility of SDC will depend on the software and hardware codesign driven by applications in the future. In terms of the problem of high efficiency, developing parallelism at all levels and using speculation is the cutting-edge trend of computing model. In addition, the combination with emerging memory processes such as three-dimensional stacked DRAM is a hot research direction. In terms of ease of use, the virtualization of SDC is a research field to be developed, which needs to be further studied. Similarly, some ideas of FPGA virtualization scheme can be used for reference. In addition, with the characteristics of reconfigurability of the SDC, the hardware can dynamically optimize the tasks and perform self-training on-line. This is not only helpful to improve the ease of use of SDC, but also beneficial to its efficiency and flexibility design. This section looks forward to the development trends of flexibility, efficiency and ease of use.

3.6.2.1

Application-Driven Software and Hardware Co-Design

Application-driven refers to the result that the architecture design of SDC is optimized for specific applications and accelerates iteration. Nowadays, most SDCs are not designed to become general-purpose computing chips. Between ASIC and FPGA, mainstream SDCs are application-driven domain accelerators. Early SDC design flows often used an application-specific iterative architecture optimization approach similar to that used for ASICs [8–10]. In the ASIC design flow, for a specific application written in a high-level language, tools are used to analyze its features and hot spots or hot areas in execution, and the architecture design is developed exploiting parallelism for the parts or hot areas that are easy to optimize, and the results are obtained after compilation and simulation. The results are then combined with the application feature for the next iteration to analyze the hot spots that were not solved or new bottlenecks, and then iterate the architecture design. This process is very effective for the optimization of fixed applications, and converges quickly in most

3.6 Speculation Parallelism

147

cases. It is very suitable for ASIC design because the ASIC has a very large hardware design space and can be optimized for any hot spots or application-specific performance bottleneck for the hardware architecture, and each iteration may discover new hot spots, thus making the result a relatively large performance improvement so that the final performance will be very good. In the case of large hardware design space, it is more appropriate to adopt such an optimization iterative process that requires only a small amount of human participation. However, the design space of SDC is not as large as ASIC, and its computation and execution mode is regular. In this case, the benefits of still adopting this design process are not enough to make up for the problems it brings. For example, the automated application specific analysis and optimization method requires a performance criteriato determine when to stop the iteration, but in fact, for early SDCs, there is a lack of a reliable performance criteria to judge the design advantages and disadvantages, e.g., without a clear description of what benchmarks the design flow was referencing [11–14]. Because the problem faced by SDCs is often domain applications rather than specific algorithm, it is often difficult to find a benchmark with similar domain coverage. If the performance benchmark is not properly selected, the final iterative optimization results may not be applicable to the target field. In addition, although using this process in the design of SDC achieves high performance finally, the efficiency is often not high because of the long iteration cycle. Compared with ASIC, the hardware architecture and computing mode of SDC are relatively fixed, and the design space to be explored is not so broad. If manual heuristic and top-level design is used, the productivity and efficiency will be much higher than that of automated process. The software and hardware design process of the SDC should be applicationspecific, and it can be considered that integrating human intelligence into this design process is an effective way to improve its design productivity and efficiency. The main implementation method is to integrate the programming model into the whole SDC design process, and design it as an important object of software and hardware codesign, as shown in Fig. 3.2. The software flow is from ‘software application’ to ‘executable code’, and the hardware flow is from ‘engineer’ to ‘architecture description’. On the one hand, the programming model guides how to program the software application for the hardware, on the other hand, it puts forward some requirements for the hardware design, and guides the design of the computation model and execution model underneath it and the final architecture implementation. The virtualization template provides the abstract description for SDC virtualization. In the process of forming the architecture description, the virtualization template will be very helpful to the ease of use of the SDC. The specific virtualization methods will be discussed in Sect. 3.2.3. As can be seen in Fig. 3.2, the programming model is in a very core position in the software and hardware codesign process. On the one hand, it needs to extract the application features for design; on the other hand, it also needs to make full use of the potential of hardware. In such a software and hardware codesign process different from ASIC, the ease of use and flexibility are greatly enhanced, shortening the development cycles, because the engineer participate in the design of programming model.

148

3 Technical Difficulties and Development Trend

Fig. 3.2 Software and hardware co-design integrated with programming models (see color diagram)

Although it has been mentioned in Sect. 3.1.1 that there are still many challenges in the design of programming model of SDC, we believe that the design and research of programming model is an urgent problem to be solved and is one of the development trends of SDC. This trend has also appeared in the FPGA field before. As mentioned above, FPGA can be regarded as a fine-grained SDC, so this section will learn from its published representative research. It is worth noting that new research results are still published from time to time on the design of programming models for FPGAs. Here are some examples to study the design process of FPGA. Reference [15] proposes a high-level DSL for FPGA, which optimizes some common computing pattern such as map, reduce and enumeration for FPGA hardware and integrates them into its DSL. After learning this programming model, programmers can directly generate corresponding optimized hardware modules for these computing pattern. This idea is similar to the specific instruction set design of CPU, but the programming model design here needs to generate hardware instead of calling specific functions of the hardware during execution. On this basis, References [16–18] optimize the way of compiling hardware modules and the method of exploring design space, and propose new programming models and design frameworks. These programming models consider new functions, such as thread prediction, and can provide higher performance under the condition of effective utilization. These research efforts incorporate programming models within the FPGA design process. This design process is different from the automatic application driven process. The programming model enables human to play an active role in the design iterative process and provides the possibility of directly using predefined hardware modules. Referring to FPGA, the SDC can also use the programming model. By introducing programming model into its design process, it can develop more quickly and effectively, so that it will be more flexible and easy to use. SDCs have some

3.9 Bit-Level Parallelism

149

development-friendly features that should be fully utilized in their programming model design, some of which has not be considered in the programming model design of FPGAs, which are briefly described and analyzed below.

3.7 Independent Task-Level Parallelism The communication between PEs in SDC relies on interconnection, and its execution relies on explicit compilation, which poses high requirement on compiler design. And when there is no or little data exchange between tasks, the SDC can easily map different tasks to PEAs simultaneously or sequentially. For example, a SDC supporting the MCMD type of computing model can perform multiple task configurations in the same PEA at the same moment. Since it is relatively easy to implement this parallelism on the computing architecture of the SDC, the programming model needs to be designed to support such functionality.

3.8 Data-Level Parallelism Data-level parallelism is very common in FPGA, CPU, GPU and memory since it is very common that there is no dependency between the data they process, whether in computing intensive applications or data intensive applications. The SDC has a spatial computing architecture and many PEs. It almost naturally supports datalevel parallelism and has formed a relatively fixed mode. For example, SDCs of the SCMD model can execute multiple data computations in the same PEA at the same time. Therefore, the data-level parallelism should not be ignored in defining the programming model of the chip.

3.9 Bit-Level Parallelism SDC is a coarse-grained computing architecture, which often does not support bit level operation. However, if some fine-grained units are integrated into the SDC, the bit level parallel optimization opportunities commonly existing in the fields of cryptography can be better utilized. If the SDC is applied to the field of cryptography, he implementation of bit level parallel should not be ignored in the design of its programming model.

150

3 Technical Difficulties and Development Trend

3.10 Optimization of Memory Access Patterns The memory access patterns of applications are often very different, which is very expensive for DRAM that is optimized for sequential read and write. The SDC is a two-dimensional computing architecture, and each PE has the ability of independent memory access. If the programming model of the SDC can explicitly use the memory access characteristics of different applications, use the spatial layout for sequential access, and place the data as close as possible to the computing location, then the requirements for memory bandwidth can be effectively minimized. This is of great significance to today’s limited memory bandwidth. The existing research work on SDC design process has certain considerations in these aspects. For example, the Reference [19] proposed a streaming data flow model. Its design process also starts from the analysis of the application field, summarizes the memory access, instruction access and computing characteristics of the application, and puts forward the corresponding execution model to optimize it, Then, the required programming model is abstracted based on the execution model. This mainly puts forward the optimization scheme for the memory access characteristics. Reference [19] considered the mapping problem of parallel programming model, and mapped some loops or nested patterns to PEA using multi-stage pipeline, which is the implementation of task-level parallelism and data-level parallelism. There is not much research work on the programming model of SDC, which needs to be further improved. However, at the same time, application-driven hardware and software design using programming models will continue to improve, which is one of the future development trends of SDCs.

3.10.1 Multi-Level Parallelism Design for In-/near-Memory Computing Section 3.2.1 has described the different levels of parallelism and the challenges of enabling these different levels of parallelism in SDCs. Some parallelism is easy to implement in SDC, e.g., data-level parallelism and task-level parallelism, while although speculation parallelism may greatly improve the performance of SDC, it is not easy to implement. This section specifically discusses how the existing research develope different levels of parallelism in SDCs, and what is the development trend in improving the parallelism of SDCs.

3.12 Implementation of Data-Level Parallelism in SDCs

151

3.11 Implementation of Instruction-Level Parallelism in SDCs Pipeline is one of the most important implementations of instruction-level parallelism in general-purpose processors. However, pipeline is not easy to be implemented in the two-dimensional spatial computing architecture of SDC, because the spatial computing architecture is not a sequential computing process. It cannot be pipelined by simply adding some registers and control logic, not to mention that the SDC needs to support dynamic reconfiguration, which requires that the pipeline is also reconfigurable. Although hardware can barely absorb the idea of pipelining, it is still possible to exploit some instruction-level parallelism by using software pipelining techniques; for example, the References [20–23] explores the use of software pipelining techniques to unroll operations such as loops, which can eventually be implemented on a SDC. The way of exploiting instruction-level parallelism in SDC is implemented by means of static compilation and dynamic scheduling. These two ways are integrated and supported in SDC. This is different from the case where VLIW mainly depends on compilation and the case where out-of-order processors mainly rely on dynamic scheduling. For static compilation, VLIW or superscalar is essentially a one-dimensional spatial computing architecture, which can process two different instructions at the same time, while SDC is a more complex twodimensional computing architecture in space, and the same idea can be applied to SDC. For example, a configuration of a SDC is the spatial implementation of an instruction sequence. In essence, it is an instruction-level parallelism that trades space for time. In addition, the data flow method can also be used in the SDC. Firstly, the software is statically compiled into a data flow graph, and then the hardware is used to manage the dependence of data and the execution order of operations. The next step can be carried out whenever the data is ready. This dynamic data flow method is similar to the out-of-order execution technology, but the out-of-order execution in the processor is for instructions, and the data flow method is for the data flow graph compiled by instructions. In essence, they are the implementation of instruction-level parallelism by using dynamic scheduling. The data flow method is widely used in SDCs [3, 19, 24, 25], and its main purpose is to reduce the unnecessary cost of control and make it data-centric for computation.

3.12 Implementation of Data-Level Parallelism in SDCs In fact, the SDC is more suitable for the computation and execution mode of MIMD, because the configuration running on its two-dimensional computing architecture can be regarded as the spatial mapping of a group of instructions, and it is easier to execute multiple groups of different instructions at the same time in the twodimensional space. However, some researches on SDC also explore how to implement SIMD on it. As mentioned earlier, SIMD is the standard implementation of data-level

152

3 Technical Difficulties and Development Trend

parallelism and the fundamental technology of GPGPU computing. SCMD can be used in SDC to realize data-level parallelism, so as to reduce the cost of energy consumption and area caused by instruction processing [5, 26].

3.13 Implementation of Task-Level Parallelism in SDCs Task-level parallelism requires asynchronous execution of different irrelevant tasks, but the PEA itself in many SDCs does not support asynchronous execution. This is because the PEA operating synchronously only needs to load one configuration to achieve good energy efficiency, hardware utilization and ease of programming. Such a SDC uses a centralized management mode to control the chip, which is effective and energy-efficient, but does not support task-level parallelism [27–29]. The PEA operating synchronously cannot complete the asynchronous function required by task-level parallelism, which will cause synchronization conflict, and there will be many problems in the communication between tasks. Of course, a single PEA can also be used as the processor core to integrate multiple PEAs in the SDC to realize coarse-grained task-level parallelism. This idea is similar to that of multicore processors. Reference [30] is a typical example in this regard. Also similar to multi-core processors, the bottleneck of coarse-grained task-level parallelism is that the utilization rate of PEA may not be high: On the one hand, SDCs are often domain-customized, and the generality of PEA is not as high as that of CPU; on the other hand, the learning curve of software-managed task-level parallelism on SDC is steep. Fine-grained task-level parallelism on PEA is not completely impossible on SDC. Using data flow method, not only instruction-level parallelism, but also fine-grained task-level parallelism can be realized. There are two ways of data flow: One is to form a static data flow graph completely compiled by the compiler; the other is to realize timely response and execution through dynamic scheduling and detection. Both methods can achieve task-level parallelism. Static data flow can be compiled together with the data flows of multiple tasks during compilation, so that the data flow graph of different tasks can share a PEA in space, and then map them to the PEA for computation [3]. This method is effective when the data dependencies of different tasks can be expressed statically and explicitly, but it is not applicable if there are ambiguous memory data dependencies that are difficult to solve by static compilation. The dynamic data flow method supports the interleaved execution of different data flow graphs in time on a static basis. Dynamic scheduling can make better use of task-level parallelism. See Reference [25] for research in this regard. On the other hand, task-level parallelism programming on CPU often needs to be implemented with the help of MPI or OpenMP, and the communication between different tasks is a major bottleneck. In hardware, the communication between different tasks in the processor can only be realized through shared memory. Its on-chip limited SRAM resources only have cache function and are not optimized for communication. The communication between different tasks of SDC can be realized in many

3.14 Implementation of Speculation Parallelism in SDCs

153

ways. In addition to shared memory, SDCs also provide more diverse communication methods, such as using on-chip distributed FIFO or communicating through on-chip interconnection of wire switching. These communication methods have much higher energy efficiency and performance than shared memory, and are also an essential part of task-level parallelism in SDC. Although this explicit communication mode has much higher requirements for programmers and compilers, the SDC supports these functions, which provides a choice on whether to use and choose different task-level parallelism communication modes according to different situations and applications, thus increasing the space for design and optimization.

3.14 Implementation of Speculation Parallelism in SDCs Speculation parallelism technology will execute the dependent instructions sequentially. If the prediction fails, the completed operation will be rolled back, and then the correct instructions will be executed again. Speculation parallelism can be combined with instruction-level parallelism, data-level parallelism and even task-level parallelism to jointly improve the parallelism and performance of the system, but additional hardware support needs to be provided. For example, additional hardware units are required to detect when a predicate judgment arrives for a predicated operation, and its predicated computation operation needs to be cached to isolate it from normal operations and confirmed only when the predicate judgment arrives; for example, rollback of completed operations needs to be supported when prediction fails, and if memory operations are involved, then additional memory access latency is introduced. The architecture of out-of-order processor is the most classic case of the combination of instruction-level parallelism and speculation parallelism. In terms of task-level parallelism, the dependent threads or tasks can also be serialized in multitasking systems. As long as the rollback is guaranteed and the prediction failure rate is low, the prediction will improve the performance in general. There are three main ways to realize speculation parallelism in the computing and control architecture of SDC. Although there is not much related research so far, this is the only way to develop speculation parallelism in SDC, so it is one of the important development trends in the future. The following describes the three ways to implement speculation parallelism. As shown in Fig. 3.3, this is a simple control flow. The judgment operation A determines whether the next step is to calculate B or C, and finally obtains the result D. Speculation parallelism executes B or C by default and rolls back in case of execution errors. The first implementation method of speculation parallelism is shown in Fig. 3.4, that is, speculation parallelism is implemented on the host controller. The host controller (usually a general-purpose processor or finite automata) statically compiles the predicted operations into the sequence of configuration packages, and then maps them to the PEA for execution. In the case of Fig. 3.4, if the prediction of A judgment is true, B is executed in advance. If the prediction is successful, then the subsequent

154

3 Technical Difficulties and Development Trend

Fig. 3.3 Schematic diagram of a predictable control flow

operations can be continued. If the prediction fails and A’s judgment is actually false, then the incorrect configuration needs to be flushed and the correct configuration reloaded, i.e., the configuration of C is executed, and this computation process needs to be executed again. At the same time, the prediction configuration of the upper part has been executed, so it is necessary to undo the operation that has been executed. Figure 3.5 shows another possibility, that is, to implement SP in PEA. Compile two different paths into a configuration package, and then load them into the PEA for execution, so that two different paths can be executed at the same time. A, B

Fig. 3.4 Speculation parallelism based on the host controller

3.14 Implementation of Speculation Parallelism in SDCs

155

and C are executed at the same time. One of B and C must be predicted correctly. Then, when A obtains the result, select one of them. The performance advantage of this prediction is that if the operation time of B and C is shorter than that of A, they can be completely hidden by SP. In addition, the SP in the PEA does not need to be reloaded, which reduces the loss of misprediction [31, 32]. In fact, this method has no loss caused by misprediction, because no operation needs to be revoked, and correct operation will always occur. This idea of enumeration selection is widely used in digital circuit design. For example, the carry select adder calculates the carry enumeration respectively, and then selects the computation result according to the final result of carry. In a broad sense, this is also a space-for-time operation. All possibilities need to be enumerated. When there are few cases that need to be enumerated, this method has advantages, otherwise the gain is not worth the loss. On the one hand, the PEA cannot map too many computing blocks that need to be enumerated at the same time. For example, in the case of Fig. 3.5, B and C need to be executed on the PEA at the same time; on the other hand, if too many cases need to be enumerated, the ratio of the power consumption area used to perform correct computation to the overall power consumption area will decrease, and the utilization of PEA will be very low. In addition, another limitation of this method is that it cannot realize the reverse control dependence. Therefore, the speculation parallelism provided by this method is relatively limited, and it may have better effect in the case of limiting the application field. Figure 3.6 is a method of implementing speculation parallelism in PE. Using this method, first of all, the PE needs to be autonomous, that is, in this case, there is no need for an external host controller to control the PEA, but each PE has its own control decision-making ability, which is a distributed control mode, such as TIA [33]. In this case, B and A are executed simultaneously, but when the correct path is C, only B is switched to C. Of course, at the same time, the side effects of B’s execution need to be eliminated. Compared with the speculation parallelism based on the external controller shown in Fig. 3.4, the speculation parallelism based on the PE itself does not require the switching of the overall configuration. If the external main controller can perform multiple configurations on the PEA at the same time for Fig. 3.5 Speculation parallelism in PE array

156

3 Technical Difficulties and Development Trend

Fig. 3.6 Speculation parallelism in autonomous PEs

fine-grained configuration parallelism, the speculation parallelism of fine-grained reconfiguration shown on Fig. 3.6 can also be realized on its basis. The existing research work on speculation parallelism in SDC is carried out at these three levels. References [31] and [34] proposed methods that use partial predicated execution to enable simultaneous execution of branches and branch decisions, but the operations performed by branches that fail in prediction need to be restricted to the present array. Reference [35] proposed a method of integrating branch predictors into computing blocks, where each computational block is capable of making predictions, which is similar to the approach to implement speculation parallelism in autonomous PE species. In addition, Reference [36] uses the forwarding control capability of PEA to realize speculation parallelism in a single configuration, which is exactly the method of Fig. 3.5. In general, speculation parallelism in the PEA is the most effective method. However, this method is essentially different from the other two methods. This method executes the wrong and correct branches at the same time, trading space for time, followed by the improvement of power consumption and the reduction of hardware utilization. The other two approaches will be efficient without too many wrong branches being executed with high prediction accuracy. In addition, since a particular strategy alone cannot achieve good performance gains in a more general context, the implementation of speculation parallelism in PEAs and in autonomous PEs needs to be explored and studied simultaneously. Selectively performing different levels of prediction parallelism based on the characteristics of different computing blocks, such as high or low prediction accuracy and whether the prediction expectation is biased, is a promising exploration in the design space. Finally, speculation parallelism also requires an efficient memory system to support it, because the direct consequence of a misprediction is that the data within the memory needs to be modified. Exploring more efficient speculation parallelism implementation is one of the important development trends of SDC.

3.15 Efficiency of Memory in the SDC

157

3.15 Efficiency of Memory in the SDC In today’s big data era, the storage space occupied by data is becoming larger and larger, and more and more data need to be processed per unit time. Due to the limited storage capacity of the processor, a large amount of data needs to be transported from the main memory to the processor for computation, and then the computation results are transported back to the main memory. If the computation of data is relatively simple, the data processing capacity of the whole system is limited by the bandwidth between main memory and processor. This is the increasingly serious problem of “memory wall” [37]. No matter what kind of computing architecture and form, as long as its data processing throughput reaches a certain level, it will inevitably face this problem, and SDC is no exception. At present, there are four main solutions to the problem of “memory wall”. The first method is to use SRAM to optimize the system. Cache in general-purpose processor memory system is a very classic design. The cache copies the data segment with good spatial locality or reusability from the main memory to a small and fast SRAM on the chip when it is used for the first time. Since SRAM has lower latency and higher bandwidth compared to main memory DRAM, caching can give a huge performance bonus to the system if the data can be reused. A considerable proportion of the area of modern general-purpose processors is SRAM for caching. But caching has its limitations. The problem is that if the data is only used once in the processor, caching not only has no advantage, but also increases the power consumption of the whole system because of redundant data movement. In fact, the biggest power consumption of modern processors is cache. Another way to use SRAM is to design the on-chip SRAM space accessible by the program to provide programmers with a method to explicitly use this efficient memory location, which can transform the optimization problem into program design and reduce the cost of hardware. For example, every few stream processors in GPGPU share a MB-sized piece of SRAM storage, called shared memory, among themselves. When programming with CUDA, shared memory can be indexed and read explicitly through modifiers. Providing an explicit interface, while combining the wisdom of the algorithm writers and enabling people to decide when to utilize this high performance storage, still does not solve the problem when data reusability is very poor. Another classic method is memory compression. Memory compression has two purposes: One is to compress the less commonly used main memory data to save the main memory space, which is similar to the function of reserving part of the space on the hard disk as exchange storage; second, only compressed data needs to be transmitted between the processor and main memory, so the demand for memory bandwidth is reduced. There are many types of compression methods such as Hoffman coding and arithmetic coding, which are used to eliminate redundancy of data according to its characteristics in order to reach the Shannon limit. As for the SDC, References [38] to [40] discuss that the configuration information of the SDC is

158

3 Technical Difficulties and Development Trend

compressed in advance. In the execution process, only these compressed configurations need to be transmitted to the chip, and then the original configuration information is obtained through hardware online decoding. However, online decoding is not an easy operation, which has demanding requirements on hardware. Complex coding often has higher compression efficiency, but it is more difficult to realize. Therefore, memory compression in SDCs is not very common, although memory compression has been one of the industrial standards in today’s general-purpose processors. The third method is to use the more advanced technology of integrated circuits to manufacture memory closer to computing. One approach is embedded DRAM (eDRAM), that is, DRAM is implemented on CMOS technology to provide large memory space on chip. In the field of high-performance servers, many commercial products have involed eDRAM. For example, IBM’s z-series processor is a loyal supporter of eDRAM. Among them, z15 integrates 256 MB of eDRAM as L3 cache on the computing chip, and a separate interconnect chip is fabricated to use the vast majority of its area to fabricate 960 MB of eDRAM for shared L4 cache [41]. Although eDRAM can provide more memory space than on-chip SRAM, eDRAM has a much lower capacity per unit area than DRAM chips which are manufactured using mature processes that have been optimized for capacity for decades because eDRAM is manufactured using CMOS logic processes. Due to capacity constraints, eDRAM cannot be a substitute for external modular main memory. In addition, DRAM needs to be constantly refreshed to maintain data, which in turn poses a significant challenge to the power consumption and heat dissipation of the computing chip. Reference [42] specifically addresses the application-related optimization of eDRAM refresh. Today’s eDRAM is more used as a cache, just as in IBM z-series processors. At present, few studies have discussed whether using eDRAM can improve performance or alleviate the problem of “memory wall” in SDCs. Considering that although cache is rarely used on SDC, it will also integrate explicitly accessible scratchpad or distributed SRAM FIFO, and these memory will also face capacity bottlenecks, the use of eDRAM in SDC may improve the performance of some application fields. In addition to eDRAM, 3D stacking technology is also a new process to make the memory location closer to computing. Micron’s HMC and the high bandwidth memory (HBM), HBM2, and HBM2E, which have now been adopted as standards by JEDEC, all utilize 3D stacking to stack multiple memory chips on top of each other. Specifically, three-dimensional stacking technology uses through silicon via (TSV) to stack multiple memory chips together to form a three-dimensional chipset. This highcapacity chipset can be directly stacked on the logic chip (e.g., HMC) or connected with the logic chip (e.g., HBM) through a silicon substrate with interconnection, so as to reduce the physical distance from memory to computation and greatly improve the bandwidth. While HMC has been abandoned by Micron, HBM is widely used in high-performance GPUs, for example, AMD’s RX Vega series uses HBM2 to provide large bandwidth for GPUs. In 2019, SK Hynix said its HBM2E product stacks eight memory chips together to provide a total capacity of 16 GB and a huge bandwidth of up to 460 GB/s. In addition to having applications in GPUs, high-performance field accelerators such as Google’s TPUv2 for neural network training and TPUv3 have

3.15 Efficiency of Memory in the SDC

159

switched from DDR3 for TPUs to HBM [43], which is also because the demand for main memory bandwidth for neural network training is much higher than that realized by TPU. For SDCs, the use of three-dimensional stacked storage like HBM is also a very straightforward and effective way to boost bandwidth. With the help of the three-dimensional stacking process, near memory computing can also be realized, and some computing cores can be placed closer to the memory for execution. HRL [44] stacked a logic chip at the bottom of the three-dimensional stacked memory chip, mixing coarse-grained and fine-grained reconfigurable PEs and multiplexer units for branch selection output. NDA [45] explored the use of TSV to stack commercially available conventional DDR or LPDDR (low power DDR) with SDCs, analyzing their performance and performing design space exploration. However, these studies have not considered the communication and synchronization between different stacks, and there is no in-depth discussion on the possibility of using a larger bit width for communication supported by TSV. The fourth method is to calculate directly with the memory unit, which can thoroughly solve the problem of memory bandwidth. This is because the number of bits required to transfer computational control is generally negligible compared to the number of bits required to transfer computational data. This idea has been put forward in the twentieth century [46], and has become a hot spot in the academic community for some time, but it is difficult to realize due to the limitations of hardware and technology. Moreover, due to the considerable performance improvement brought by Moore’s law at that time, the problem of “memory wall” was not serious. This idea was not adopted by the industry, and then shelved by the academic community. Until recent years, when Moore’s law is not emphasized, emerging applications requiring a large amount of data processing such as neural networks and biological information appeared, which revitalized this research direction. It is found that computation can be carried out in many memory device arrays, which are called PIM. For example, logic operation can be realized in DRAM [47, 48], computation can be carried out in SRAM [49–51], in spin transfer torque-magnetoresistive random access memory (STT-MRAM) [52], and in ReRAM [53], and even in phase change memory (PRAM) [54]. The optimization in SDCs for memory systems is focused on the interface design and on the integration with new processes. There are two optimization opportunities for interface design, on the one hand, for applications that access memory laws, such as neural networks, image processing, etc., a streaming memory interface can be designed to improve memory parallelism and utilization. DySER [28] and Eyeriss [56] are examples of streaming memory access interfaces. On the other hand, the memory access data of most applications is fragmented and not easily combined. At this time, memory access can be explicitly combined into the instruction set and opened to programmers, and the manual static optimization method can be used to improve the memory utilization efficiency [57, 58]. In addition, the new technology of the memory system also brings new opportunities to the SDC. For example, HRL connects the SDC and DRAM chip through TSV, which can greatly improve the memory bandwidth of the system and accelerate the application.

160

3 Technical Difficulties and Development Trend

Generally speaking, a multi-level parallel architecture of in-/near-memory computing is the inevitable trend of the development of SDC.

3.15.1 Software-Transparent Hardware Dynamic Optimization Design Today’s SDCs mostly rely on static compilation, but a dynamic compiling architecture that can optimize data flow at runtime may be more efficient for today’s mainstream applications. Therefore, a development trend of SDC is to explore the possibility of runtime online hardware optimization combined with the related research of hardware dynamic optimization. There are two technologies that may be helpful for software-transparent hardware dynamic optimization: one is the virtualization of SDC, and the other is the use of machine learning for hardware online training and dynamic optimization.

3.16 Virtualization of SDCs Virtualization is not only a guarantee for the ease of use of SDCs, but also a way to achieve software-transparent dynamic optimization of hardware. The virtual thread or process formed after the virtualization of the SDC needs the operating system or the underlying runtime system to dynamically schedule and utilize the hardware resources. This process is invisible to software or applications. Different hardware settings and resource allocation can be made according to the characteristics of different virtual processes. This is a dynamic optimization process. Currently, the exploration of virtualization for SDCs is still in its infancy, but given its similarity to FPGAs, key techniques in FPGA virtualization design should be learned. The first key technology is standardization, that is, the use of standardized hardware interface, software call interface and protocol. It is not difficult to realize standardization in SDC, but it needs to be promoted by industry and academia. The second is the overlay. For example, the SDC is a kind of overlay of FPGA [59, 60]. The overlay abstracts the underlying details and provides the ability to use hardware without hardware programming, which is a necessary condition for agile development. Although the overlay of SDCs has not been widely discussed, the classification of SDCs in Reference [2] can be used as a guide for its overlay exploration. The third is virtualization process technology. According to the dynamic scheduling and optimization strategy, a virtualization process will be allocated some PEs and memory resources. The virtualization process also needs to consider software and hardware interfaces and protocols. Generally speaking, FPGA and SDC can be used not only as an accelerator, but also as an independent co-processor. Of course, they are more designed and used as a domain accelerator. These two different modes have

3.17 Online Training by Means of Machine Learning

161

different requirements for hardware process and its design. Nevertheless, SDC can dynamically schedule its coarse-grained hardware resources, so it is easier to design and execute hardware process than FPGA. However, the virtualization methods and dynamic optimization strategies required by SDCs for different functions will be different. For hardware dynamic optimization, resource scheduling is the most important in virtualization technology, especially considering that the scheduling difficulty will be very high because the SDC has two-dimensional spatial computing architecture and explicit data communication. The SDC does not have a predefined architectural template; it can be reconfigured at the PEA level, as in ADRES [27]; at the PE row level, as in PipeRench [29]; or at the individual PE level, as in TIA [33]. These different modes correspond to different hardware resources. It is not a trivial for the runtime system to schedule and utilize these hardware resources. However, if the hardware resources can be effectively scheduled to match the requirements of different virtualization processes, the performance will be greatly improved when the higher-level software is the same. This is also the significance of hardware dynamic optimization.

3.17 Online Training by Means of Machine Learning In the field of computing architecture, hardware to accelerate machine learning is emerging endlessly. At the same time, in recent years, people have begun to explore whether machine learning can help the hardware design and system performance optimization. When dealing with some problems, machine learning will have a great performance improvement compared with traditional methods. Generally speaking, machine learning can be divided into supervised learning, unsupervised learning and reinforcement learning. The input of supervised learning is a large number of labeled data, which is suitable for solving problems such as optimal solutions in huge search spaces, fitting complex functional relationships, and classification. For example, convolutional neural network (CNN) is widely used in the field of image recognition and computer vision, while recurrent neural network (RNN) is more common in the research of speech recognition. The input of unsupervised learning is unlabeled data, which is mainly used to solve the dilemma of less labled data. Reinforcement learning is the mapping between the current state and the required action when optimizing a specific goal by using statistical methods, so it is suitable for solving the optimization problem of complex systems for specific goals. In computer architecture, there are many possible areas where both supervised and reinforcement learning can be useful, such as performance modeling and simulation of computing systems. Because each part of the computing system will affect each other, it is often very difficult and inaccurate to predict the system performance using traditional methods, which is an area where supervised learning is more appropriate. Similarly, for the design space exploration of computing architecture, due to the large design space and the huge amount of manual exploration, the use of supervised

162

3 Technical Difficulties and Development Trend

learning may provide guidance for hardware design and point out some effective optimization directions to save labor costs. The above examples are all practical problems of hardware design considered, and there are many places where machine learning can be applied when the system is running, which is also the focus of this section. In the SDC, energy consumption optimization, interconnection performance optimization, configuration scheduling and predict execution, and even memory controller can realize dynamic hardware adaptation with the help of machine learning. DVFS can dynamically adjust the required power consumption according to the workload of each hardware resource in the system. It is a category in which reinforcement learning can play a role. The adjustment of voltage and frequency is regarded as an action in reinforcement learning, and the final optimization goal is system energy consumption. Reinforcement learning can greatly reduce system energy consumption [61, 62]. There are a large number of interconnections between PEs in the SDC. When there are a large number of PEs and data forwarding is allowed, an network-onchip will be formed. Machine learning has many applications in computer networks, such as load balancing, communication volume engineering and so on. Also in onchip networks, machine learning can be used to better control the flow of network data and dynamically limit the flow of network data generated by each node to achieve the highest network utilization. Moreover, the error correction system of network-on-chip can also be improved by machine learning. Compared with cyclic redundancy check (CRC), energy efficiency, delay and reliability will be greatly improved [63, 64]. SDC is mostly designed and used as an application accelerator, so it is often a part of heterogeneous systems. In heterogeneous systems, if the master needs to dynamically allocate resources and schedule tasks offloaded to the accelerator, machine learning can be used to take into account the long-term impact of task allocation and train machine learning models online to dynamically optimize task scheduling. In addition, the PEA of the SDC contains many PEs. If the MCMD computation model is adopted, decisions need to be made at each time. It is necessary to consider how to allocate PE on the array and which configurations to implement, which will greatly improve the overall performance of the system. These decisions can also be optimized through reinforcement learning. The most classic application of machine learning in hardware design is branch predictor. The new branch predictor that uses a perceptron or CNN to collect historical decisions for training and then makes branch predictions has a 3% to 5% lower number of missed predictions per kilo instructions (MPKI) than the traditional secondary branch predictor with the highest accuracy [65]. The accuracy of branch prediction by machine learning method is far higher than the best result that can be achieved by traditional methods. SDCs also need to use branch predictors to support the development of speculation parallelism. As mentioned earlier, speculation parallelism is an important research direction of SDC, and reducing prediction loss and prediction miss rate is the only way to make speculation parallelism more effective.

References

163

Machine learning may also make some improvements in the memory controller of the SDC to improve memory access and the performance of the overall system. Reinforcement learning can take into account the key factors of the memory controller, such as delay and concurrency, and then take the commands of the memory controller as the action of reinforcement learning, so as to optimize the energy consumption or system performance of the storage controller. In addition, Sect. 3.2.2 mentioned that SDCs require an efficient memory system, and near-storage computing is a promising research direction. SDCs have many different PEs, and multiple SDCs can form a computing system together. At this time, how to distribute the workload to different computing locations according to the principle of near memory computing can also be used for decision-making and optimization by machine learning. Although many aspects of the SDC can be dynamically optimized in hardware using machine learning, online machine learning training is not trivial. A highperformance machine learning model will inevitably require a lot of computing resources, which is also the final problem solved by the machine learning accelerator. To realize dynamic optimization in SDC, it is necessary to balance the performance of dynamic optimization with the additional hardware area and power consumption required to realize dynamic optimization. Software-transparent hardware dynamic optimization is a cutting-edge field. As mentioned above, there are many problems to be solved in the programming model of SDC. Using the adaptive ability of hardware, we can obtain improved hardware performance without changing the software. This is also an important development direction of SDC.

References 1. Bird S, Phansalkar A, John L K, et al. Performance characterization of spec CPU benchmarks on intel’s core microarchitecture based processor[C]//SPEC Benchmark Workshop, 2007: 1–7 2. Liu L, Zhu J, Li Z et al (2019) A survey of coarse-grained reconfigurable architecture and design: Taxonomy, challenges, and applications[J]. ACM Comput Surv 52(6):1–39 3. Sankaralingam K, Nagarajan R, Liu H et al (2004) TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP[J]. ACM Trans Arch Code Optim 1(1):62–93 4. Park H, Park Y, Mahlke S (2009) Polymorphic pipeline array: A flexible multicore accelerator with virtualized execution for mobile multimedia applications[C]. In: Proceedings of the 42nd Annual IEEE/ACM international symposium on microarchitecture, 370–380 5. Prabhakar R, Zhang Y, Koeplinger D, et al. (2017) Plasticine: A reconfigurable architecture for parallel patterns[C]. In: The 44th annual international symposium on computer architecture, 389–402 6. Packirisamy V, Zhai A, Hsu W, et al. (2009) Exploring speculative parallelism in SPEC[C]. In: IEEE international symposium on performance analysis of systems and software, 77–88 7. Robatmili B, Li D, Esmaeilzadeh H, et al. How to implement effective prediction and forwarding for fusable dynamic multicore architectures[C]. In: The 19th International Symposium on High Performance Computer Architecture, 2013: 460–471 8. Chattopadhyay A. (2013) Ingredients of adaptability: A survey of reconfigurable processors[J]. VLSI Design

164

3 Technical Difficulties and Development Trend

9. Karuri K, Chattopadhyay A, Chen X, et al. (2008) A design flow for architecture exploration and implementation of partially reconfigurable processors[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 16(10): 1281–1294 10. Stripf T, Koenig R, Becker JA (2011) Novel ADL-based compiler-centric software framework for reconfigurable mixed-ISA processors[C]. In: International conference on embedded computer systems: architectures, modeling and simulation, 157–164 11. Bouwens F, Berekovic M, Kanstein A, et al. (2007) Architectural exploration of the ADRES coarse-grained reconfigurable array[C]. In: International workshop on applied reconfigurable computing, 1–13 12. Chin S A, Sakamoto N, Rui A, et al. (2017) CGRA-ME: A unified framework for CGRA modelling and exploration[C]. In: The 28th international conference on application-specific systems, architectures and processors (ASAP), 184–189 13. Suh D, Kwon K, Kim S, et al. (2012) Design space exploration and implementation of a high performance and low area coarse grained reconfigurable processor[C]. In: International conference on field-programmable technology, 67–70 14. Kim Y, Mahapatra R N, Choi K. (2009) Design space exploration for efficient resource utilization in coarse-grained reconfigurable architecture[J]. IEEE transactions on very large scale integration (VLSI) systems, 18(10): 1471–1482 15. George N, Lee H, Novo D, et al. (2014) Hardware system synthesis from domain-specific languages[C]. In: The 24th international conference on field programmable logic and applications (FPL), 1–8 16. Prabhakar R, Koeplinger D, Brown KJ et al (2016) Generating configurable hardware from parallel patterns[J]. ACM Sigplan Not 51(4):651–665 17. Koeplinger D, Prabhakar R, Zhang Y, et al. (2016) Automatic generation of efficient accelerators for reconfigurable hardware[C]. In: The 43rd annual international symposium on computer architecture, 115–127 18. Li Z, Liu L, Deng Y, et al. (2017) Aggressive pipelining of irregular applications on reconfigurable hardware[C]. In: The 44th annual international symposium on computer architecture, 575–586 19. Nowatzki T, Gangadhar V, Ardalani N, et al. (2017) Stream-data flow acceleration[C]. In: The 44th annual international symposium on computer architecture,:416–429 20. Rau BR, Glaeser CD (1981) Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing[J]. ACM SIGMICRO Newsl 12(4):183– 198 21. Mei B, Vernalde S, Verkest D, et al. (2003) Exploiting loop-level parallelism on coarse-grained reconfigurable architectures using modulo scheduling[C]. In: Design, automation and test in europe conference and exhibition, 296–301 22. Hamzeh M, Shrivastava A, Vrudhula S. (2012) EPIMap: Using epimorphism to map applications on CGRAs[C]. In: Proceedings of the 49th annual design automation conference, 1284–1291 23. Hamzeh M, Shrivastava A, Vrudhula S (2013) REGIMap: Register-aware application mapping on coarse-grained reconfigurable architectures (CGRAs)[C]. In: proceedings of the 50th annual design automation conference, 1–10 24. Swanson S, Schwerin A, Mercaldi M et al (2007) The wavescalar architecture[J]. ACM Trans Comput Syst 25(2):1–54 25. Voitsechov D, Etsion Y (2014) Single-graph multiple flows: Energy efficient design alternative for GPGPUs[J]. ACM SIGARCH Comput Arch News 42(3):205–216 26. Singh H, Lee M, Lu G et al (2000) MorphoSys: An integrated reconfigurable system for data-parallel and computation-intensive applications[J]. IEEE Trans Comput 49(5):465–481 27. Mei B, Vernalde S, Verkest D, et al. (2003) ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix[C]. In: International conference on field programmable logic and applications, 61–70 28. Govindaraju V, Ho C, Nowatzki T et al (2012) Dyser: Unifying functionality and parallelism specialization for energy-efficient computing[J]. IEEE Micro 32(5):38–51

References

165

29. Goldstein SC, Schmit H, Budiu M et al (2000) PipeRench: A reconfigurable architecture and compiler[J]. Computer 33(4):70–77 30. Pager J, Jeyapaul R, Shrivastava A (2015) A software scheme for multithreading on CGRAs[J]. ACM Trans Embed Comput Syst 14(1):1–26 31. Chang K, Choi K (2008) Mapping control intensive kernels onto coarse-grained reconfigurable array architecture[C]. In: International SoC Design Conference, 362 32. Lee G, Chang K, Choi K (2010) Automatic mapping of control-intensive kernels onto coarsegrained reconfigurable array architecture with speculative execution[C]. In: ieee international symposium on parallel & distributed processing, workshops and PHD forum, 1–4 33. Parashar A, Pellauer M, Adler M et al (2014) Efficient spatial processing element control via triggered instructions[J]. IEEE Micro 34(3):120–137 34. Mahlke SA, Hank RE, McCormick JE, et al. (1995) A comparison of full and partial predicated execution support for ILP processors[C]. In: Proceedings of the 22nd annual international symposium on computer architecture, 138–150 35. Kim C, Sethumadhavan S, Govindan M S, et al. (2007) Composable lightweight processors[C]. In:The 40th annual IEEE/ACM international symposium on microarchitecture, 381–394 36. Mahlke SA, Lin DC, Chen WY et al (1992) Effective compiler support for predicated execution using the hyperblock[J]. ACM SIGMICRO Newsl 23(1–2):45–54 37. Kagi A, Goodman JR, Burger D (1996) Memory bandwidth limitations of future microprocessors[C]. In: The 23rd annual international symposium on computer architecture, 78. 38. Jafri S M, Hemani A, Paul K, et al. (2011) Compression based efficient and agile configuration mechanism for coarse grained reconfigurable architectures[C]. In: IEEE international symposium on parallel and distributed processing, workshops and PHD forum, 290–293 39. Kim Y, Mahapatra RN (2009) Dynamic context compression for low-power coarse-grained reconfigurable architecture[J]. IEEE transactions on very large scale integration (VLSI) systems, 18(1): 15–28 40. Suzuki M, Hasegawa Y, Tuan VM, et al. (2006) A cost-effective context memory structure for dynamically reconfigurable processors[C]. In: The 20th IEEE International Parallel & Distributed Processing Symposium, 8 41. Saporito A (2020) The IBM z15 processor chip set[C]. IEEE hot Chips 32 symposium, 1–17 42. Tu F, Wu W, Yin S, et al. (2018) RANA: Towards efficient neural acceleration with refreshoptimized embedded DRAM[C]. In: The 45th annual international symposium on computer architecture, 340–352 43. Norrie T, Patil N, Yoon D H, et al. (2020) Google’s training chips revealed: TPUv2 and TPUv3[C]//IEEE Hot Chips 32 Symposium (HCS), IEEE Comput Soc, 1–70 44. Gao M, Kozyrakis C. HRL: Efficient and flexible reconfigurable logic for near-data processing[C]. In: IEEE international symposium on high performance computer architecture, 2016: 126–137 45. Farmahini-Farahani A, Ahn JH, Morrow K, et al. (2015) NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules[C]. In: The 21st international symposium on high performance computer architecture, 283–295 46. Patterson D, Anderson T, Cardwell N et al (1997) A case for intelligent RAM[J]. IEEE Micro 17(2):34–44 47. Seshadri V, Lee D, Mullins T, et al. (2017) Ambit: In-memory accelerator for bulk bitwise operations using commodity DRAM technology[C]. In:The 50th annual IEEE/ACM international symposium on microarchitecture, 273–287 48. Li S, Niu D, Malladi K T, et al. (2017) Drisa: A dram-based reconfigurable in-situ accelerator[C]. In: The 50th annual IEEE/ACM international symposium on microarchitecture, 288–301. 49. Zhang J, Wang Z, Verma N (2016) A machine-learning classifier implemented in a standard 6T SRAM array[C]. In: IEEE symposium on VLSI circuits (VLSI-Circuits), 1–2 50. Chen D, Li Z, Xiong T, et al. (2020) CATCAM: Constant-time alteration ternary CAM with scalable in-memory architecture[C]. In: The 53rd annual IEEE/ACM international symposium on microarchitecture, 342–355

166

3 Technical Difficulties and Development Trend

51. Eckert C, Wang X, Wang J, et al. (2018) Neural cache: Bit-serial in-cache acceleration of deep neural networks[C]. In: The 45th annual international symposium on computer architecture, 383–396 52. Guo Q, Guo X, Patel R, et al. (2013) AC-DIMM: Associative computing with STT-MRAM[C] In: Proceedings of the 40th annual international symposium on computer architecture, 189–200 53. Chi P, Li S, Xu C et al (2016) Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory[J]. ACM SIGARCH Comput Arch News 44(3):27–39 54. Sebastian A, Tuma T, Papandreou N et al (2017) Temporal correlation detection using computational phase-change memory[J]. Nat Commun 8(1):1–10 55. Cong J, Huang H, Ma C, et al. (2014) A fully pipelined and dynamically composable architecture of CGRA[C]. In: The 22nd annual international symposium on Field-programmable custom computing machines, 9–16 56. Chen Y, Krishna T, Emer JS et al (2016) Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks[J]. IEEE J Solid-State Circuits 52(1):127–138 57. Ciricescu S, Essick R, Lucas B, et al. (2003) The reconfigurable streaming vector processor (RSVP/spl trade/)[C]. In: The 36th annual IEEE/ACM international symposium on microarchitecture, 141–150. 58. Ho C, Kim SJ, Sankaralingam K (2015) Efficient execution of memory access phases using data flow specialization[C]. In: Proceedings of the 42nd annual international symposium on computer architecture, 118–130 59. Jain AK, Maskell DL, Fahmy SA. (2016) Are coarse-grained overlays ready for general purpose application acceleration on fpgas?[C]. In: The 14th international conference on dependable, autonomic and secure computing, the 14th international conference on pervasive intelligence and computing, the 2nd international conference on big data intelligence and computing and cyber science and technology congress, 586–593 60. Liu C, Ng H, So HK (2015) QuickDough: A rapid FPGA loop accelerator design framework using soft CGRA overlay[C]. In: International conference on field programmable technology, 56–63 61. Khawam S, Nousias I, Milward M, et al. (2007) The reconfigurable instruction cell array[J]. IEEE transactions on very large scale integration (VLSI) systems, 16(1): 75–85 62. Venkatesh G, Sampson J, Goulding N et al (2010) Conservation cores: Reducing the energy of mature computations[J]. ACM Sigplan Not 45(3):205–218 63. Waingold E, Taylor M, Srikrishna D et al (1997) Baring it all to software: Raw machines[J]. Computer 30(9):86–93 64. Swanson S, Michelson K, Schwerin A, et al. (2003) WaveScalar[C]. In: Proceedings of the 36th annual IEEE/ACM international symposium on microarchitecture, 291–302 65. Bondalapati K, Prasanna VK (2002) Reconfigurable computing systems[J]. Proc IEEE 90(7):1201–1217

Chapter 4

Current Application Fields

To improve is to change; to be perfect is to change often. —Winston S. Churchill, 1925.

With the rapid social and technological development in recent years, a torrent of new applications continue to emerge, demanding computing power far beyond the past. Meanwhile, computing chips have a growing demand for performance, energy efficiency and flexibility. For example, there are many types of cryptographic algorithms, and algorithm standards are being constantly updated. When an old standard expires, the new one will be established. The number of cryptographic algorithms in security protocols is constantly increasing, with their forms being constantly changing. Existing algorithms may be compromised and invalidated, and safer algorithms will be proposed soon afterwards. This poses great challenges to the flexibility of cryptographic computing chips. Software-defined chips use software to define chips dynamically and in real time. Circuits can perform nanosecond-level functional reconfiguration aligned with changes in algorithm requirements, so as to agilely and efficiently realize multi-domain applications. In order to take advantage of the efficiency of hardware in computing, software-defined chips need to sacrifice the flexibility of software to some extend, that is, the acceleration will not target all types of software. For example, the acceleration of some NumPy library functions is supported, but not all Python libraries. Therefore, software-defined chips are mainly positioned as a domain-specific accelerator at present, rather than a general-purpose platform. Take CGRA, a typical software-defined chip, as an example. It provides flexible coarsegrained computing resources and interconnections, and also enhances the support for flexible memory access and data reuse, so it is very suitable for data-intensive computing. This is consistent with the requirements of many emerging applications. Meanwhile, the reconfigurable computing resources of software-defined chips are very sufficient during runtime. These redundant resources can be used to enhance security. For example, a security check module based on redundant resources can resist physical attacks and hardware Trojan attacks. Therefore, software-defined chips are also very suitable for fields that require high information security.

© Science Press 2023 L. Liu et al., Software Defined Chips, https://doi.org/10.1007/978-981-19-7636-0_4

167

168

4 Current Application Fields

In this chapter, we first analyze the advantageous application fields of softwaredefined chips in detail, and then select typical applications of the software-defined chip technology from six different areas, i.e. artificial intelligence, 5G communication baseband, cryptographic computing, hardware security, graph computing, and network protocols, then evaluate its performance, efficiency and security, and compare them with traditional computing architectures (CPU, FPGA, ASIC).

4.1 Analysis of Application Fields Software-defined chips are neither geared toward general-purpose computing, nor do they accelerate a single application like ASICs. In general, software-defined chips are positioned between FPGAs and ASICs among many acceleration architectures. They are mainly applied to the acceleration of a variety of applications with similar features, so they are often used as domain-specific accelerators. In general, the energy efficiency of software-defined chips mainly benefits from specialization, that is, specialized architecture design according to the characteristics of applications in the domain. The performance of software-defined chips benefits from the use of parallelization at multiple levels, including parallel computing in the spatial domain and pipelined computing in the time domain, as well as instruction-level parallelism, data-level parallelism, and task-level parallelism. But even so, softwaredefined chips are not suitable for all application fields. The following part briefly discusses the application fields of software-defined chips from several perspectives. 1. Application diversity First of all, there should be diversified application types in the target field. The core algorithms of different applications should have similar computing features and acceleration requirements, or some features of the application are constantly changing over time. If applications in the field are of a single type and only a fixed core algorithm needs to be accelerated, then ASIC accelerator specifically customized for the application is the best option. On the contrary, if applications are diverse and the core algorithm is constantly changing, the reconfigurable characteristic of softwaredefined chips will be more flexible, and it is allowed to use the same hardware structure to accelerate multiple applications. For example, there are many algorithms with different computing paradigms in the field of graph computing, such as BFS, DFS, PageRank, etc., which have different requirements for hardware; in the field of artificial intelligence, the scale and network structure of deep neural networks are constantly changing. In these fields, ASICs can only accelerate a few applications, while software-defined chips are more adaptive to the requirements for diverse and changing algorithms. 2. Mixed data granularity We discussed the impact of data granularity on algorithm precision and hardware overhead in Chap. 3 of Volume I. Software-defined chips are particularly suitable

4.1 Analysis of Application Fields

169

for computing scenarios with mixed and variable granularities. For example, neural network acceleration, communication, and cryptographic computing often require different data granularities as they have varying algorithms. ASIC usually does not have the characteristics of variable granularity, while FPGA starts with single-bit logic to construct coarse-grained units for all algorithms, which consumes a lot of hardware overhead and compilation time. In contrast, software-defined chips can provide basic units such as 4bit or 8bit as required by different scenarios, and form complex processing elements (PE) with simple configurations, so they can efficiently meet the needs of different algorithms in these fields. 3. Computing density Different from other architectures, the key of software-defined chips to achieving high energy efficiency and high-performance computing is to make full use of the spatial data flow computing mode brought by its computing array. This requires a high proportion of arithmetic operation instructions and high parallelism in the application, otherwise it will be difficult to effectively utilize a large number of computing resources in the array, thus seriously reducing the acceleration efficiency. Generally speaking, the higher the computing density in the application and the higher the proportion of computing time in the total operating cycle, the greater the energy efficiency is improved for software-defined chips. For example, in compute intensive tasks with high data parallelism such as CNN and scientific computing, softwaredefined chips can achieve hundreds of times higher energy efficiency than generalpurpose accelerators such as multi-core processors or GPUs. In addition, for the core arithmetic operation instructions in the application, most fields only need to use multiplication and addition operations, which can ensure the simple internal structure of the PE and reduce hardware overhead. Some algorithms require special instructions such as bit operation instructions for encryption and decryption algorithms, and Softmax functions for neural network output layers. Software-defined chips designed for these fields often use specialized hardware units to implement these complex but common operations, and to achieve further performance and energy efficiency improvement at a small overhead. 4. Application regularity Application regularity refers to the proportion of instructions the latency and the timing sequence of which can be statically predicted in the application. In fact, usually only the latency of some instructions (such as arithmetic operations) is completely predictable statically, and it is difficult to statically predict the execution time sequence for most of the remaining instructions. The specific behavior of instructions can only be determined based on specific input data or computing results during runtime. This leads to irregularities, such as conditions and branch instructions, and memory access instructions with non-fixed latency. Generally speaking, most of the algorithms that perform arithmetic operations on static data structures such as arrays are regular (such as FFT, CNN, etc.). If the algorithm includes the processing of dynamic data structures such as linked lists, graphs, and trees, the instructions will include a large number of dynamic condition evaluation and dynamic address

170

4 Current Application Fields

computation instructions related to data structures, which will greatly increase the irregularity of the application. The irregularity of the control will severely reduce the degree of parallelism of computation, increase the pipeline initiation interval, reduce the execution efficiency of the hardware pipeline, and limit the acceleration space of the application. Similar to most other accelerators, the target acceleration applications of software-defined chips should not contain control flows, or only contain a small amount of control flows with a fixed mode to ensure high energy efficiency. Although irregularities can be alleviated by the use of the dynamic scheduling mechanism to a certain extent, it will lead to higher complexity on the architecture and greatly increase the hardware overhead. The gains are limited. In some cases, the energy efficiency of the accelerator may even be lower than that of general-purpose processors. In typical irregular application scenarios such as graph computations that will be introduced later in this chapter, graph algorithms are usually converted into equivalent and more regular matrix operations for higher regularity at the cost of some memory overhead. 5. Memory access mode and data reuse In addition to higher computational parallellism, another main factor for the high energy efficiency of software-defined chips is that they use high-bandwidth buffer-onchip for data reuse, and can be specialized and optimized for domain-specific memory access modes, thereby effectively mitigating the bottleneck in memory access. If the algorithm has good locality, the software-defined chip can statically move all the data required at runtime to the buffer-on-chip, and there is no need to access the lowspeed and high-power off-chip memory system during operation, thus effectively improving the efficiency of memory access. If the range of data accessed during the operation of the algorithm can be statically predicted, the software-defined chip can further improve the efficiency of data access by using scratchpad memory as the buffer-on-chip. That’s because explicitly specifying data access and rewrite instructions can avoid complex hardware logic such as data replacement in the Cache. In addition, for software-defined chips designed with memory-computing decoupling, if the memory access mode rules are applied and the memory access address sequence can be statically predicted, then coarse-grained memory access can be performed through the static configuration of the address generation unit to make full use of the bandwidth of the memory and apply more resources for arithmetic computation to further improve performance. Conversely, if the memory access address in the application is random and memory access patterns are irregular such as nested memory access, the memory access request will be sent to the cache system in series, which not only wastes a lot of bandwidth, but may also cause data conflicts and other issues. To sum up, software-defined chips are suitable to accelerate a domain with diversified applications, and its hardware flexibility can be fully utilized for the algorithms with mixed data granularities, The applications with a higher degree of data parallelism, which requires a higher computation density, and the lowest possible control irregularity, can obtain higher performance and energy efficiency. In addition, in order to take full advantages of the on-chip memory system of software-defined chips, the memory access mode in the application should be as regular and fixed as possible,

4.2 Artificial Intelligence

171

with sound data reusability. In fact, many acceleration domains have the abovementioned characteristics, such as artificial intelligence, 5G communication, cryptographic computing, graph computing and network protocols, etc. In the following chapters, we will introduce the implementation of algorithms in these typical domains and briefly describe their typical design cases of acceleration architecture.

4.2 Artificial Intelligence 4.2.1 Algorithm Analysis John McCarthy, an American computer scientist known as the father of artificial intelligence, gave his own definition of artificial intelligence in the 1950s. He believed that artificial intelligence, as a kind of scientific engineering, can create intelligent machines to complete various tasks in real life as humans do. This concept is still used today. Nowadays, the development and progress of artificial intelligence has brought great convenience to the human society, enriched our daily lives, improved industrial production efficiency, and therefore effectively promoted the advancement and development of society. The main process of the current artificial intelligence technology is: first to extract input features, then design corresponding models according to specific application scenarios, and then perform further output, classification, detection, and segmentation based on these features like human brains do. Models are usually trained based on some given data. The early extraction method of the artificial intelligence technology is mainly traditional descriptors. Those traditional descriptors are mostly based on prior knowledge of images, and mathematical expressions are used to reflect this prior knowledge, which is finally implemented in the computation of image pixels. This extracted feature is called the hand-crafted feature. The current mainstream artificial intelligence technologies all adopt the deep learning technology. Unlike traditional feature descriptors, deep learning can learn the distribution rules of data from a large amount of data, so that high-level features can be extracted and used to perform tasks. At present, artificial intelligence technologies based on deep learning have surpassed humans in many aspects [1]. So far, the human brain is the most intelligent “machine”, so it is natural to find the corresponding artificial intelligence model by taking human brains as the model. The human brain transmits and processes information based on nerve synapses. In order to imitate this process, an artificial neural network model was proposed. Neurons are usually connected to each other by dendrites and axons, as shown in Fig. 4.1. Neurons receive signals from other neurons through dendrites and axons, and then process them internally to generate new signal output. These input and output signals are defined as activation values. The connection between axons and dendrites is called a synapse. The input activation value is first multiplied by the weight in the neuron, and then the results are added up. But usually neurons do not directly output values, because if the sum result is output directly, then the continuous neuron output will

172

4 Current Application Fields

Fig. 4.1 Neuron connections in the brain (x i , wi , f (·) and b represent activation value, weight, activation function and bias respectively)

x0

w0 Synapse Dendrite

w1x1

Neuron

yj Axon

w2x2

be equivalent to the linear output of a neuron. Therefore, usually each neuron will be followed by an activation function f (·). The role of this function is to introduce nonlinear features to improve the ability of extracting features. Figure 4.2a shows a typical artificial neural network model. The neurons in the input layer will receive the input value for processing, and then pass the result to the middle layer, usually called the hidden layer. After these values are processed by the hidden layer, they will be passed to the output layer, which is the output of the entire network. Figure 4.2 shows the computation of each layer of the network: 2 y j = f ( wi j xi +b), where wij , x i , yi represent the weight, the input activation value t=1

and the output activation value, respectively. f() is the non-linear activation functions, and b is the bias. The number of hidden layers of mainstream neural networks is now more than a thousand, which is called deep neural network. Generally speaking, the deeper the network, the higher-level features of the input information can be obtained. In this kind of network, the original image pixels are used as its input, and each level of the network is extracting features. The features are gradually improved from the previous general features to high-level features related to tasks. In the last layer, there are often special layers that are combined to output the final result. Starting from around 2010, with the increase of training data and the enhancement of hardware computing capabilities, deep neural networks have experienced a surging development. In particular, the introduction of ImageNet, a database used for image recognition and classification, provides a metric for the development and formation of many neural networks. Nowadays, most neural network architectures are trained, validated and implemented on this data set. The accuracy, complexity, and model of each mainstream network on ImageNet are shown in Fig. 4.3. In 2012, a team from Toronto used GPU to train their network structure, called AlexNet [2], which reduced errors in the previously optimal model by approximately 10%. On the basis of AlexNet, more and more excellent network models have been proposed. These network models can create new precision records on ImageNet. The accuracy metric on ImageNet are mainly divided into two categories: Top-1 and Top-5. The Top-1 error means that whether the result with the highest probability is of the correct classified prediction of an image The Top-5 error is to determine whether the correct

4.2 Artificial Intelligence

173

Fig. 4.2 A typical model of artificial neural network

result is included in the top five scoring categories. If the result is included, the model is considered to be correct in the classification of the image sample. As shown in Fig. 4.3, the accuracy of image recognition is increasing day by day in both Top-1 and Top-5. It is worth noting that the Top-5 error on ImageNet by human is about 5%, while that of ResNet is now as low as less than 5%. However, as the accuracy increases, the size of the model, especially the computational complexity, also increases sharply, which is very unfavorable for edge computing in today’s Internet of Things. First of all, many vision tasks, such as autonomous driving, require real-time data processing at the edge and cannot rely on

NASNet-A-Large

80

Xception

Inception-ResNet-v2 Inception-v4

NASNet-A-Large

95

SENet-154 Xception

ResNet-50 VGG-19 BN

VGG-16 BN

Top-5 accuracy %

Top-1 accuracy %

DenseNet-121

ResNet-34 MobileNet-v2

70

VGG-13 BN

VGG-11 BN

VGG-19

VGG-16 VGG-13

VGG-16 BN ResNet-34 MobileNet-v2

VGG-19 BN

VGG-13 BN

90 VGG-11 BN VGG-19

VGG-16

VGG-13 ShuffleNet

ShuffleNet

65 60

SENet-154

Inception-v3 ResNet-101

Inception-v3 ResNet-101 ResNet-50 DenseNet-121

75

Inception-ResNet-v2 Inception-v4

VGG-11

GoogleNet

VGG-11

GoogleNet

1M 5M 10M

50M

100M

150M 1M 5M 10M

50M

100M

150M

SqueezeNet-v1.1

SqueezeNet-v1.1

80

SqueezeNet-v1.0

SqueezeNet-v1.0 AlexNet

AlexNet

55

0

5

10

15

Operations/G-FLOPS a

20

25

0

5

10

15

20

25

Top-5 accuracy % b

Fig. 4.3 Complexity of mainstream networks and the accuracy of Top-1 and Top-5 on ImageNet (The complexity is represented by the floating point operands (FLOPS) required to process an image and the size of the circle in the figure represents the size of the network model, that is, the number of parameters) (see color picture)

174

4 Current Application Fields

high-latency cloud computing. Meanwhile, most of these visual tasks are for video processing, which involves a large amount of complex data processing, including the data to be processed and the parameters of the model itself. This brings about great challenges to hardware resources at the edge end with limited area resources. Also, many edge-end hardware devices, such as embedded platforms like mobile phones, have extremely limited hardware resources and power supplies. Therefore, it is extremely important to figure out how to efficiently implement deep learning networks on such platforms. At the same time, algorithms for artificial intelligence are complex and diverse. Even if dedicated hardware is used for a specific task, the area overhead and cost is unacceptable. Only by solving the above-mentioned problems can artificial intelligence algorithms based on deep neural networks be implemented in the era of Internet of Things, and software-defined chips can be an excellent solution. First of all, the software-defined chips are proposed as a chip solution to support large-scale computation of artificial intelligence algorithms. Also, software-defined chips can fully explore the efficiency of chips, thereby improving their energy efficiency, reducing power consumption. Dynamically reconfiguring chip resources according to task requirements can increase the flexibility to support a variety of tasks.

4.2.2 State-of-the-Art Artificial Intelligence Chips Among the current artificial intelligence algorithms, CNN has been investigated extensively. It is mainly composed of a fully-connected (FC) layer and a convolutional layer (CONV). These two network operation modes are mainly based on multiply-and-accumulate (MAC) operations, which are easy to expand and perform parallel operations. In the early days, CPUs or GPUs were used to process CNN. In these platforms, CNN’s FC layer and CONV layer will be mapped into the form of matrix multiplication for computation. At the same time, in order to increase the degree of parallelism, single instruction multiple data (SIMD) streams or single instruction multiple thread (SIMT) streams are often used by CPUs and GPUs to increase the parallelism of data operations. However, when we perform convolution operations on these two platforms, input activation values need to be replicated repeatedly to meet the requirements of matrix operations, which leads to complex memory access operations and severely impairs the efficiency of the memory system. Although software packages have been developed to optimize matrix multiplication in convolution, this generally results in extra accumulation operations and more irregular memory access operations. In order to execute CNN more efficiently, the current mainstream architecture includes ASIC and FPGA. Compared with GPU and CPU, these hardware platforms are more specialized. Especially in the CNN inference process, ASIC and FPGA can make full use of their own advantages to improve the energy efficiency dozens of times or even hundreds of times. In these hardware platforms, bottlenecks mainly

4.2 Artificial Intelligence

175

lie in how to reduce memory access and how to avoid irregular memory access operations. A MACoperation may require the memory read of weight, input activation value and partial sum, and the memory write of partial sum may be involved after the computation is completed. The worst case is that all data are stored in off-chip DRAM, which means all memory accesses have to go through the off-chip memory. It will seriously damage the throughput and energy efficiency of the overall computation, because one memory access operation consumes several orders of magnitude more energy than one MAC operation [3]. In order to solve the above problems, artificial intelligence chips were designed. Artificial intelligence chips mainly design several levels of local memory hierarchy to alleviate the energy overhead caused by data movement during data processing. The energy overhead at each level of memory is different. The energy required for reading and writing data is lower as the memory gets closer to the PE. Therefore, a superior data flow should reduce the number of times to read data from the memory that consumes a lot of energy. It is suggested to read data from the memory that is as close to the PE as possible. However, considering the area overhead and cost, the amount of data that can be stored in low-energyconsuming memory is often very limited. Therefore, a major challenge in the current data flow design is how to improve the reuse ratio of data in low-energy-consuming storage units based on the convolution mode of CNN. 1. CNN data flow In the CNN data flow, there are three forms of input data reuse, as shown in Fig. 4.4. For convolution operations, the same input data and filter weights can be reused in a given channel, and these data reusing can generate different partial sums; in terms of input activation value reusing, the weights of different output channels can be applied to the same weight, so this input activation value can be reused by multiple different output channels; in addition, if the input is processed in batches, then the same weight can be reused in a batch of data to generate output from different input data. The current data flow of mainstream artificial intelligence chips is designed based on these three methods. When the data flow is given in the CNN processing, the compiler will map the shape and size of the CNN to the hardware for operation. Based on different data processing characteristics, CNN data flows can be roughly divided into the following four categories. (1) Weight stationary shows the computation of each layer of the network The goal of the weight stationary is to reduce the energy consumption of reading the weights, that is, to read the weights from the PE as much as possible in the data flow rather than from the global buffers or DRAM. That means the reusing of weights needs to be maximized. After each weight is read from the off-chip memory and enters the PE, it will remain stationary for a period of time to process related computations. Because the weight remains unchanged, the input activation value and the generated partial sum must be moved in the PE array and the global buffer. Normally, the input activation value will be broadcast to all PEs, and then the partial sum will be generated in the entire PE array.

176

4 Current Application Fields

Fig. 4.4 Potential data reusing opportunities in CNN

A classic example is neuFLOW [4], which uses 8 convolutional units to process a 10 × 10 convolutional kernel. There are a total of 100 MAC units, and each MAC unit reserves a weight to support the weight stationary data flow. As shown in Fig. 4.5a, the input activation value is broadcast to all MAC units, and the partial sums are added up between the MAC units. In order to add the partial sums correctly, it is necessary to allocate an extra delay storage elements in the MAC unit to store partial sums. Other architectures also use this data flow [5–11]. (2) Output stationary data flow The main goal of output stationary data flows is to minimize the energy consumption of reading and writing partial sums. This kind of data flow saves the data flow of the same output result in the same register file. The common implementation is to stream the input activation value in the, and then broadcast the weight in the entire PE array, as shown in Fig. 4.5b. A typical example is ShiDianNao [12]. In this architecture, each PE generates corresponding output by obtaining input activation values from adjacent PEs. When the PE array executes a specific network structure, the data will be propagated vertically and horizontally. Each PE has a register to store the required data for a certain period. At the system level, the global buffer transmits input activation values in a stream, while weights are transmitted in the PE array in the form of broadcast. Partial sums will be accumulated in each PE, and once a complete result is generated, it will be sent back to the on-chip memory. The architectures in [13, 14] both use this kind of data flow. Since the output value can come from different dimensions, the output stationary data flow has many forms, as shown in Fig. 4.6. For example, OSA mainly acts on the convolutional layer, so it mainly processes the results on the same output channel of the convolutional layer at the same time, so that the data reuseing ratio can

4.2 Artificial Intelligence

177

Fig. 4.5 Data flow used in CNN processing (see color picture)

be maximized. OSC acts on the fully-connected layer. Since there is only one value on each output channel, this data flow mainly processes output values on different channels. OSB is somewhere in between. There are several representative structure of the above three data flows [12–14]. (3) No local reuse data flow The register files are of high efficiency in terms of energy (pJ/bit), but not efficient in terms of area (µm2 /bit). In order to maximize the efficiency of on-chip memory and minimize the bandwidth of off-chip memory, the PE is no longer provided with local storage, but all storage is allocated to the global buffer, as shown in Fig. 4.5c. Therefore, the difference between the no local reuse data flow and the first two is that no data will remain stationary in the PE array. The problem is that the increased traffic during the interaction with the global buffer. What is different is that it need to multicast the input activation value, and single-cast the filter weights, and then the accumulation operation is performed in the PE array. An architecture proposed by UCLA uses this scheme [15]. The filter weights and the input activations are first read from the global buffer, and then processed in the

178

4 Current Application Fields

Fig. 4.6 Several cases of output static data flow

MAC unit, and then the product is further added and summed using a adder tree, and the above steps are completed within one cycle. The result will be sent back to the global buffer. Another example is DianNao [16]. Unlike UCLA, DianNao uses specialized registers to save partial sums in the PE array, which can further reduce the energy overhead of reading/writing partial sums. (4) Row stationary data flow The Eyeriss architecture proposed a new row stationary data flow [17], which can maximize the data reuse rate in the register files level, and greatly improve the overall energy efficiency. Unlike the previous data flow, it not only optimizes the reuse rate of the weights or the input activations. As shown in Fig. 4.7, when performing onedimensional convolution, the data flow saves the entire row of weights of the convolutional kernel in the PE, and then streams the input activations. The PE processes one sliding window each time, and only needs one memory space to store the partial sums. Because the input activations has overlaps in different sliding windows, and those partially overlapping values can also be stored in the PE for reuseing. From steps 1 to 3 in the figure, it can be seen that the data flow can maximize the reuseed input and weights and the result of partial sums when performing one-dimensional convolution. Each PE can process a one-dimensional convolution operation, so the twodimensional convolution operation can be implemented by a combination of multiple PEs, as shown in Fig. 4.8. For example, in order to form the output result of the first row, it is required to provide three rows of weights and three rows of input activation values. Therefore, we can set three PEs in one column, with each one processing one row of convolution operation. Then their partial sums can be added in the vertical direction to output the first row of results. In order to generate the second row of output, another column of PEs can be arranged, and the three rows of input activation

4.2 Artificial Intelligence

179

Fig. 4.7 One-dimensional line static data flow

Fig. 4.8 Two-dimensional row stationary data flow

values can be moved downward. The first row of input activation values are discarded, and then the fourth row of input activation values are added, and the weight remains unchanged, thereby outputting the second row of results. Similarly, if you want to output the third line of results, you need to set up an additional column of PEs. This two-dimensional PE array can reduce the access to global buffers. For example, the weight of each row is reused in the PEs on the same row. Each row of input activation value is reused in diagonal PEs. At the same time, the partial sum of each row is added vertically. Therefore, in this data flow, data reuseing in two-dimensional convolution can be maximized. In order to solve the high-dimensional convolution in the convolutional layer, multiple rows of input data and weights will be mapped to the same PE, as shown in Fig. 4.9. In order to reuse weights in one PE, the input activation values of different rows will be concatenated and then go through one-dimensional convolution operation in one PE. In order to reuse input activation values in one PE, the weights of

180

4 Current Application Fields

Fig. 4.9 Different input channels and multiple rows of convolutional kernel

different rows will be concatenated in one PE for one-dimensional convolution operation. Finally, in order to increase the number of partial sum accumulation operations in the PE, the input activation values and weights from different channels are interleaved and then run in one PE, because the partial sums in different input channels can naturally be accumulated and then produce the final result. The number of convolutional kernels and the number of input and output channels that can be processed at the same time is programmable. For any model, there is an optimal configuration, which mainly depends on the network layer parameters of the network model and the hardware resources provided, such as the number of PEs and size of memory at different levels. Because the parameter implementation of the network model is already known, the most appropriate way is to design a compiler to perform offline deployment to achieve the best results, as shown in Fig. 4.10. Take Eyeriss as an example. As shown in Fig. 4.11, the architecture includes 14 × 12 PEs, a 108 KB global buffer, the ReLU activation function, and input feature map compression units. Off-chip data is read into the global buffer through a 64.bit bidirectional data bus, and then read into the PE array. The main problem to be solved by the architecture is how to support different convolutional kernels and different sizes of input and output feature maps. Two mapping methods can be used to solve the problem of different convolutional kernel sizes, as shown in Fig. 4.12. First, the replication can be used to map the shapes that cannot fill the entire PE array. For example, for the 3rd to 5th convolutional layers of AlexNet, only the 13 × 3 PE array can be used when performing the twodimensional convolution operation. This structure can be copied four times to support different convolutional kernels, and output to different output channels. The second method is folding. For example, the second convolutional layer of AlexNet requires a 27 × 5 PE array to perform two-dimensional convolution. This can be folded into

4.2 Artificial Intelligence

181

Fig. 4.10 Mapping based on hardware resources and the network model

Fig. 4.11 Eyeriss hardware architecture

two parts: 14 × 5 and 13 × 5, and each part is vertically mapped to the PE array. The remaining PEs will be clock gated to save energy consumption. 2. Recent design of artificial intelligence chips The design of artificial intelligence chips needs to be optimized according to the characteristics of the neural network. By analyzing the state-of-the-art artificial intelligence chips, it is not difficult to see that an excellent chip design often digs into the characteristics of neural network algorithm models. Therefore, according to the model optimization method used in the chip design, recent research on artificial intelligence chips can be divided into the following four categories.

182

4 Current Application Fields Folding

Replication

27

13

...

...

AlexNet 5 Layer 2

...

...

...

...

14

14

12

3

13

3

13

3

13

3

13

...

...

...

...

...

...

AlexNet Layers 3 to3 5

Unused PEs will be clock gated

Physical PE array

12

5 5

14 13 Physical PE array

Fig. 4.12 Two ways of copying and folding

(1) Chip design using sparse computing Sparsity is a property that exists widely in neural networks. It refers to the situation where the zero value of model parameters takes up a large proportion. Sparsity brings great convenience to hardware design. It can compress a model that contains a large number of parameters, reduce the bandwidth, energy consumption and storage overhead of the model, so that the model can be applied on embedded systems or edge computing chips. In chip design, the typical representative of using the sparse design is Cnvlutin [18]. The architecture design of Cnvlutin is derived from the DaDianNao processor [19] and it solves the problem that DaDianNao cannot skip the zero value computation due to the internal regular data flow form. It only uses the sparsity of the activation value. The basic idea is to decouple the movement of the activation value vector and the weight vector first, so that they do not need to follow the same pace every time they move; secondly, to read the corresponding weight and non-zero activation value to complete the computation by establishing a nonzero activation value index. Therefore, Cnvlutin can skip the computation operations whose activation value is zero. Its unit structure and computation process are shown in Fig. 4.13. Sparse computations will inevitably bring about irregular computations, and the resulting unbalanced use of hardware resources is a challenge that needs to be solved in chip design. Aiming at this problem, the design of SparTen accelerator [20] provides some methods to solve the unbalanced use of hardware resources. The basic design idea of SparTen is to use the sparsity of the activation value and the weight at the same time to find out the pair whose activation value and weight

4.2 Artificial Intelligence

183

Fig. 4.13 Cnvlutin’s computing unit structure and data flow form

are both non-zero by means of online computation, so as to realize non-zero value operation only. However, as the number of non-zero weights on each convolutional kernel is different, when the activation value vector is calculated in different PEs and different convolutional kernels, the imbalance in hardware resource utilization will occur. To solve this problem, SparTen proposes to sort different convolutional kernels by sparsity, and put the convolutional kernels with complementary sparsity into one unit to process the activation value, so as to realize basically balanced computation time between different units. However, this will disrupt the relative position of convolutional kernels, so the computation results need to be reordered and output through a large permutation network. (2) Chip design using predictive computing Predictive computing is an emerging research hotspot in the design of artificial intelligence chips. It focuses on eliminating “invalid” computations that still exist after using sparse computing. That is the results of these computations cannot be passed to the next layer to be used (such as the computation generating negative value before the ReLu activation function, and the computation generating the non-maximum value before the MaxPool layer). “Invalid” computations widely exist in the computation of neural network models, but they don’t affect the accuracy of the model. If these computations can be effectively eliminated, it will be extremely beneficial to optimize the performance and power consumption of the hardware. In this context, Song et al. [21] proposed an accelerator chip design method using predictive computing. In this design, each input neuron is divided into two parts: the high-order bits part and the low-order bits part. These two parts are multiplied by weights at different stages,

184

4 Current Application Fields

as shown in Fig. 4.14. The first stage is prediction. In this stage, the high-order bits part and the weight of the activation value are calculated. Because the size of the data is mainly determined by the high-bit part, this result can be used as a predictor to indicate whether the result is “invalid”. The second stage is execution. In this stage, only the low-order bits part and the weight of the corresponding activation value at the valid position are calculated, and then added to the predictor at the corresponding position in the previous stage to obtain a complete result. Although the design has more computation stages, the computation time saved is negligible. The sacrifice of accuracy can lead to relaxation of prediction conditions. If a certain degree of accuracy loss is tolerated in the application scenario, the predictive computing will reduce more computation operations. Therefore, SnaPEA [22] proposed a design method that effectively balances accuracy and computation. It tries to reduce as much computation as possible within the limited range of accuracy loss by setting the prediction threshold for each layer of the neural network. SnaPEA provides more flexibility for chip design, allowing application scenarios with different accuracy requirements to take advantage of the benefits of predictive computing. (3) Chip design using quantitation strategies Quantization, also called low-precision, is a method of transforming the 32-bit floating-point operations of a neural network into a lower bit-width fixed-point. Since there are many redundant operations in neural network operations, although quantization reduces the operations, the accuracy of models will not be greatly affected. Meanwhile, quantization can effectively reduce the computational complexity, the network scale and memory. More and more the chip design using quantitative strategies are emerging. BitFusion [23] is a highly flexible chip design that supports multiple bit-width parameter computations. In the quantized neural network model, different layers have different parameter bit-width requirements, which requires a

PE

Input low-order bits

Executor

Weight

Output

Shifter

Input high-order bits

0

Predictor

Weight

Controller

Prediction stage

Execution stage

Fig. 4.14 Implementation form of predictive computing

4.2 Artificial Intelligence

185

hardware design that supports different bit-width computations, otherwise it will lead to a waste of hardware resources. In BitFusion, 2-bit PEs are designed to split the multiplication computation of parameters with multiple bit widths into many parts, so that each part can be implemented in a 2-bit PE. With the support of the shift operation, it accumulates the results of each part to generate the output. The process is shown in Fig. 4.15. This design reduces the granularity of the PE bit width and supports multi-bit width more flexibly, so that the chip design can effectively utilize the quantitative strategy to improve its performance. With the development of quantization technologies, binary and ternary neural networks have also appeared one after another. Although they remarkably compromise the accuracy, they also greatly increase the speed of computations, thus being applied in many scenarios. Hyeonuk et al. [24] designed hardware on the basis of the binary neural network. They use the characteristics of binary parameters to decompose the convolutional kernel into two parts. The similarity of the internal parameters in the decomposed convolutional kernel is greatly improved, so the amount of computation and energy consumption can be further reduced. (4) Chip design using bit-level computing

Input over time

With the development of artificial intelligence technologies, especially the reduction of bit-level parameters caused by quantization methods, researchers have put their eyes on the use of bit-level computing for chip design. Bit-level computing can simplify hardware design, because a smaller computation bit width requires a smaller

Fig. 4.15 Basic operation process of BitFusion

186

4 Current Application Fields

PE bit width. The AND gate can even be used in replacement of the multiplier for single-bit computing. Therefore, in recent years, excellent chip designs have continuously emerged in this field. Stripes [25] is an accelerator that utilizes singlebit sequential computing. When processing an activation value, it sequentially inputs each bit of the activation value into the PE in the order from high to low, and the multiplication operation can be completed through the AND gate and the weight. The structure is shown in Fig. 4.16. Compared with the operation implemented by the original parallelled multipliers, although this operation requires more computation cycles, it’s simpler to implement, and the cost of computation cycles can also be alleviated by more parallelled PEs. The design of the PRA [26] accelerator is a further improvement of Stripes. It also uses the bit-sequential computation of the activation value. However, = it proposes a method to avoid calculating the zerovalue bit. The basic idea of this method is to encode the effective bit (bit of 1) in the activation value as an offset to guide the weight to perform a shift operation and accumulate it into a complete output result. It should be pointed out that this method also faces the problem of unbalanced computation time because it takes advantage of the sparsity of bits, which is also considered in the hardware design of PRA. The use of bit-level can also is also made possible by encoding data. Laconic [27] uses the Booth encoding format to encode activation values and weights, and then calculates only valid bits. The significant bits in Booth-encoded data are further reduced, so the computation cycle and hardware resources can be reduced accordingly. The above architectures and methods that support CNN are all based on ASIC or FPGA. With the help of the specialized feature of these hardware platforms, high area and energy efficiency have been achieved. However, the flexibility of these hardware platforms is greatly limited, that is to say, this hardware can only be used in certain specific tasks in most cases. Once they are used in more cases, their usability will be greatly compromised, because the wiring in the ASIC has been fixed and cannot be changed in the future; although the FPGA is programmable, it cannot reconfigure the hardware at runtime according to task changes during the operation.

Fig. 4.16 Basic PE of stripes

4.2 Artificial Intelligence

187

In order to solve this problem and ensure the high efficiency of chips when executing artificial intelligence algorithms, a software-defined artificial intelligence chip will be an outstanding alternative.

4.2.3 Software-Defined Artificial Intelligence Chip For IoT edge devices, people have proposed the concept of intelligent IoT. In terms of definition, this concept is actually a combination of industrial IoT, artificial intelligence and advanced computing power. In terms of meaning, intelligent IoT is to apply artificial intelligence technologies to every node of IoT, that is, intelligent nodes. Each node can independently adapt to the environment, analyze the environment, and finally make the best decision. The Internet of Things will greatly change people’s lifestyles and promote the progress and development of human society. The key of the implementation and popularization of the Internet of Things is in the design of the chip of the edge devices. The traditional IoT believes that an IoT is built when devices are connected to 2G and 3G basebands. In fact, the intelligent IoT should be a combination of multiple communications. In contrast, traditional devices on the edge only have primary digital processing capabilities. As artificial intelligence spreads out, the intelligent IoT requires a large amount of computation power on the edge. From primary learning of devices on the edge to the massive applications of artificial intelligence, what is required is the computing power and time efficiency of chips on the edge devices. This is especially important for the Internet of Vehicles. Cars need to respond very quickly to emergencies to avoid accidents. That is to say, the underlying algorithm must be implemented at runtime. Besides, as there is a large amount of data to be processed, and it requires a very large bandwidth to transmit the data to the cloud in real time. It is desired to perform computations locally. Therefore, a powerful chip is required to complete the task [28]. The demand for edge device chips mainly includes the following four aspects: ➀ high performance, low power consumption, low price, and easy to use; ➁ high security, and related encryption functions to protect related IPs; ➂ The accuracy of the algorithm itself for handling tasks must meet the requirements of specific scenarios. Low accuracy is unacceptable, even if it’s fast; ➃ a truly effective smart chip cannot be used without a perfect ecosystem, which needs to be developed and constructed along with upstream and downstream suppliers. For chip manufacturers, the more advanced the manufacturing process they use, the more they need to invest. The initial investment for the 65 nm technology to the most advanced 5 nm technology rises from several million to hundreds of million US dollars and the break-even point also rises from several million units to hundreds of millions of units [28]. It can be seen that it is difficult for chip manufacturers to make profits if they only rely on a single market. In addition, as tasks become larger, chip designs are becoming more complex, product life cycles are getting shorter, and differences are getting broader. Therefore, in the era of the Internet of Things where the market scale is becoming more fragmented and the task requirements are becoming more complex, chip design

188

4 Current Application Fields

companies need to constantly change their business models and continue to iterate their own technological innovations. On the other hand, large-scale chip companies can quickly iterate their products from top to bottom because they occupy a large market share, which is difficult for small and medium-sized companies. The softwaredefined chipscan be modified at the software level to change the chip without repeated design and tape out. It is a bottom-up approach in chip design and innovation. To achieve energy-efficient artificial intelligence computing capabilities under the demand of intelligent IoT scenarios, in addition to hardware-oriented optimization of algorithms, designing an architecture that is easier to integrate with increasingly changing algorithm models is also a main solution currently. Among them, the reconfigurable architecture is a highly anticipated approach that can be used to improve the energy efficiency of artificial intelligence chips. It’s an effective method to reduce power consumption under the premise of ensuring the efficiency and accuracy of artificial intelligence computations [28]. Although artificial intelligence chips are mushrooming, some chips can even perform some intelligent tasks humans can’t do, but they are only optimized and accelerated on specific tasks, and there is a serious lack of flexibility and adaptive capabilities in actual scenarios. The software-defined chip, which is based on a reconfigurable architecture, can dynamically adjust the hardware at runtime to support to different task flexibly. Also, based on the upperlevel software configuration, the chip can efficiently schedule hardware resources, achieving an area and energy efficiency that is similar to that of ASIC. In order to meet the increasingly diversified artificial intelligence application scenarios, the reconfigurable architecture needs to support different types of network layers, including various convolutional layers, fully connected layers, and hybrid networks. Some computing architectures will design different PEs to handle different network layers, but this will reduce the reuse rate of hardware resources. When dealing with hybrid networks, the flexibility and energy efficiency of the hardware will be limited. Also, the hybrid network has high fault tolerance and variation, which means that the bit width precision, convolutional kernel size, and activation function used in each layer may be different. Finally, for the convolutional layer and the fully connected layer, the memory access pattern, computation density, and data reuse modes are different. In order to improve the efficiency, the reconfigurable architecture should support various efficient data flows and corresponding hardware modules. The reconfigurable artificial intelligence hardware architecture compiles mainstream models in the artificial intelligence field through a compiler, and then generates configuration information that can cover mainstream network operators, and stores them on the chip. The hardware are dynamically reconfigured based on the configuration information for efficient computation. Taking Thinker [29] as an example, the hardware architecture is shown in Fig. 4.17. The reconfigurable architecture is composed of two PE arrays, and each PE array is composed of 16 × 16 PEs. The memory includes two 144 KB on-chip buffers, a 1 KB shared weight buffer, and two 16 KB local buffers in the PE array. The PE array is dynamically reconfigurable, and PEs are mainly divided into two categories: general PEs and super PEs. Both types of PEs can support one 16 × 16-bit multiplication or two 8 × 16-bit multiplications. As shown in Fig. 4.18, general PE

4.2 Artificial Intelligence

189

Fig. 4.17 Thinker hardware architecture

(a) General PE architecture

(b) Super PE architecture

(c) Configuration mode + status

Fig. 4.18 General PE architecture, super PE architecture and configuration mode + state

supports MAC operations at various network layers. The function of PEs is mainly controlled by a 5-bit configuration word. The super PE has five additional operations to the general PE: pooling, tanh and Sigmoid activation functions, scalar multiplication and -addition operations of the pooling layer, and recursive neural network gating operations. The super PE is controlled by a 12-bit configuration word. Figure 4.19 shows the PE functions implemented by different configurations. Figure 4.19a shows the CONV operation. In order to avoid the power consumption overhead caused by the 0 value operation, the gating technology is used here; Fig. 4.19b shows the FC operation, which is similar to the CONV operation; Fig. 4.19c shows the use of several activation functions; Fig. 4.19d shows the pooling operation. It can be seen that the PE can change the configuration and different modules are called to achieve different functions.

190

4 Current Application Fields

Fig. 4.19 Convolution operation, fully connected operation, activation function, recursive neural network and pooling (see color picture)

The PE state is controlled by the finite-state machine, and the state machine is transferred according to the configuration context. Thinker’s configuration contexts are mainly divided into three levels, PE array level, neural layer level, and PE level, as shown in Fig. 4.20. Array configuration information includes data flow information, batch number, number of network layers, the base address of network layer parameter, etc. The neural layer level configuration information is used to control the operation of a particular layer, including the input activation value and the weight base address, as well as the size of the convolutional kernel, the number of output channels and other information. PE level configuration information directly controls the state and function of each PE. In neural network operations, real-value operations are mainly to ensure the accuracy. However, with the development of algorithms, binary/ternary networks are gradually emerging, and the precision is constantly approaching real-value networks. Because the weights of the binary/ternary network are represented by two values [1, −1] or three values [1, 0, −1], hardware implementation is simple without multipliers, thus greatly reducing area and power consumption. Similarly, because the convolution method, kernel size, weight bit width (binary/ternary), the input activation value, and the activation function are different, the reconfigurable architecture can provide efficient support for the above high flexibility requirements. The architecture [30] uses the reconfiguration technology design to dynamically support any binary/ternary network. This architecture can support input activation

4.2 Artificial Intelligence

191

Fig. 4.20 Thinker configuration information (the unit b in the figure stands for bit)

values of multiple bit widths, and its architecture is shown in Fig. 4.21a. The main component of the architecture is a computing engine composed of 16 PE groups, and each PE group contains two PEs. The memory controller allows PEs in a group to exchange input weights and output activations. All PE units are controlled by a 12-bit configuration word, as shown in Fig. 4.21b. S 0 –S 2 configure the adder tree to support input activation values of different bit widths; S 3 –S 4 configure the computation mode; S 5 –S 11 select the activation function, and whether the pooling and other layers are valid. S 12 is a control word used to control load balance. In addition, the architecture

192

4 Current Application Fields FIBC

Overall architecture

I/O

Datapath

Configuration interface

Controller

SBTC

Shared

KTFR

Configuration

Memory system

PE 2

Integral Fusion

S9

×16

Binarize/ Ternarize

Pooling

Batch Norm

ReLU/PReLU

S7S8

S5S6

S10S11

32'b

S10S11

Binarize/ Ternarize

Feature Reconstruction

S9

Batch Norm

128'b

S3S4

Pooling

1024'b 4096'b

S0S1S2

S7S8

S5S6

Feature Reconstruction

210

S12

Configurable Adder Tree

512'b

Pre-Processing

Weight SRAM (64KB)

PE 1 PE Group 1

1024'b

Integral Fusion

...

I/O

Data SRAM (128KB)

Memory Controller

I/O

S3S4

ReLU/PReLU

1024'b S12

S0S1S2

012

128'b

Configurable Adder Tree

Integral 32'b Calcula -tion

Pre-Processing

S12

32'b Integral SRAM (32KB)

32'b

PE

(a) Binary/ternary reconfigurable architecture

32'b

Function

8*2b 4*4b 2*8b 1*16b SBTC Convolution FIBC Calc. method KTFR No ReLU ReLU ReLU mode PReLU No Max Pooling mode Average No Batch Normalization Yes No Quantization Binarize mode Ternarize No Load balancing Yes Addertree mode

S0 1 0 0 0 × × × × × × × × × × × × × × × ×

S1 1 1 0 0 × × × × × × × × × × × × × × × ×

S2 1 0 1 0 × × × × × × × × × × × × × × × ×

S3 × × × × 0 1 0 × × × × × × × × × × × × ×

S4 × × × × 1 0 0 × × × × × × × × × × × × ×

S5 × × × × × × × 0 1 0 × × × × × × × × × ×

S6 × × × × × × × 0 0 1 × × × × × × × × × ×

S7 × × × × × × × × × × 0 1 0 × × × × × × ×

S8 × × × × × × × × × × 0 0 1 × × × × × × ×

S9 × × × × × × × × × × × × × 0 1 × × × × ×

S10 × × × × × × × × × × × × × × × 0 1 0 × ×

S11 × × × × × × × × × × × × × × × 0 0 1 × ×

S12 × × × × × × × × × × × × × × × × × × 0 1

(b) State machine

Fig. 4.21 Binary/ternary reconfigurable architecture hardware and state machine (b stands for bit) (see color picture)

also includes 32 KB of integral SRAM, 128 KB of data SRAM, 64 KB of weight SRAM, and an integral calculation unit. In the binary/ternary network, the critical path is usually the accumulation process of input activation values. In order to shorten this critical path, this architecture designs a five-stage pipeline configurable adder tree to add 32 16-bit data, as shown in Fig. 4.22c. In order to flexibly support different bit widths of activation values, a configurable addend adder tree and eight carry adder trees are designed. The addend adder is shown in Fig. 4.22a, which is a 16-bit configurable and divisible adder tree. Each configurable adder consists of eight 2-bit regular addition trees and seven multiplexers, so that the carry can be controlled according to the bit width of the input activation value, as shown in Fig. 4.22b. This adder tree can add two 16-bit data to generate a 16-bit data and an 8-bit carry. Each carry adder tree is used to add each bit in the 8-bit carry. The addend adder tree inputs one 16-bit data, and eight carry adder trees output eight 6-bit data. According to different input activation values, these data are concatenated and combined into four 64.bit data. It is then determined by S 0 S 1 S 2 that which of the four output values is sent to the next accumulator. The accumulator is composed of a 64.bit adder, a multiplexer and a 64.bit register.

4.3 5G Communication Baseband The development of communication technologies has always been accompanied by the progress of human society, and the advancement of communication technologies promote the exchanges among different regions and races, the fusion of technologies and cultures, and raise production to a whole new stage, which in turn boosts the development of communication technologies. From the first-generation analog telecommunication system to the fifth-generation digital communication system, the communication capacity, and quality has been improved greatly, and the latency is

4.3 5G Communication Baseband

193 A B n'b n'b Cout n-b adder Cin 1'b 1'b O (n+1)'b

A B 16'b configurable adder 16'b 16'b Input: 16'b data A and B C S0S1S2 3'b configuration bits S0S1S2 8'b Output: 16'b sum O 3'b O 16'b 8'b carry C

n'b regular adder Input: n'b data A and B 1'b input carry Cin Output: (n+1)’ b sum 0 1'b output carry Cout

(a) 16bit configurable adder tree A13A12 B13B12

A11A10 B11B10

0

0

A15A14 B15B14

S2

C3

2-b adder

S0

O7O6

C2

O5O4

2-b adder

S1

C1

2-b adder

10

O9O8

2-b adder

A1A0 B1B0

0

10

C4

A3A2 B3B2

0

10

S0

C5 O11O10

A5A4 B5B4

0

10

C6 O13O12 S2

A7A6 B7B6

0 2-b adder

10

S0

C7 O15O14

2-b adder

10

2-b adder

10

2-b adder

A9A8 B9B8

0

S0

O3O2

C0

O1O0

(b) Adder tree structure 16'b 1'b C1,1 8

C

C1,1 2

C

1-b adder

C12,8

2-b adder

C

1,13 1

×8

C

1,4 1

C

C

×8

1,3 1

C

1-b adder

1,2 1

C

1-b adder

C 72,1

C 82,1 1'b

2-b adder

C 3,1 7

C

...

3-b adder

×4

C

C14,2

4,1 2

C 74,1

C14,1

4-b adder

4-b adder

C 5,1 2

C 5,1 7

C

1,4

C

C1,1

1,13

C

1,14

C

1,15

C1,16

×16

C 2,2

C 2,8

C 2,7

×8

8'b C 3,1 1'b 8 C 3,1

C 3,1 2 3,1 1

C

1,3

C 2,1

2,1 2

C12,1

C12,2

2-b adder

3,4 1

3-b adder

1,1 1

C

C12,7

2-b adder

...

1-b adder

1,14 1

1,2

C 84,1 1'b

×4

8'b

C 4,2

C 4,1 C 5,1 8 1'b

C 3,4

...

C

1,15 1

...

C

1,16 1

16'b

8'b

... ...

C1,1 7

8'b C 5,1

8'b

C15,1

5-b adder

{42'b0, E1F15F14 F1F0}

64'b

S0S1S2

000

16'b 6'b

6'b

6'b

E1

E2

E3

6'b

6'b

6'b

×8 E6

E7

E8

...

Carry adder tree

{{18'b0,E1F15F14 F9F8}, {18'b0,E2F7F6 F1F0}}

Concatenating

{{6'b0, E1F15F14F13F12}, {6'b0, E2F11F10F9F8}, {6'b0, E3F7F6F5F4}, {6'b0, E4F3F2F1F0}}

64'b 001

64'b

Addend adder tree

F

{{E1F15F14}, {E2F13F12}, {E3F11F10}, {E4F9F8}, {E5F7F6}, {E6F5F4}, {E7F3F2}, {E8F1F0}}

64'b

010

111

64'b

Configurable adder tree

0

r_st

64'b

(C) Configurable adder tree Fig. 4.22 16-bit configurable adder tree, adder tree structure and configurable adder tree (b stands for bit)

lower, and the forms of communicated information is becoming diversified. The 5G communication system now can segment the network by the needs of different applications, and give the best choice among mobile bandwidth, high reliability and low latency, and large-scale access for different applications. Different 5G communication technologies have different communication standards, communication algorithms, and antenna sizes. Software-defined chips have such advantages as flexibility, scalability, high throughput, high energy efficiency, and

194

4 Current Application Fields

low latency, therefore they have promising prospects in 5G communication. Traditional baseband chips can be divided into two types: application specific integrated circuits (ASIC) and instruction set architecture processor (ISAP). ASICs are often designed for a specific communication standard or algorithm, and are featured with high data throughput, high energy efficiency, and low latency. However, ASICs can not provide customized and specialized designs, or support the evolution of communication standards and algorithms. Further, the high cost and long development cycle of advanced fabrication technologies means that the inflexible ASIC solution has many limitations. The ISAP solution usually includes hardware implementations such as General Propose Processor (GPP), DSP, GPGPU, etc. Although these hardware solutions using instruction set architectures boast a certain degree of flexibility, the ISAP solution features low energy efficiency and high power consumption area overhead, which are essential metrics for both base stations and mobile devices. The software-defined chip is a promising 5G communication baseband chip solution as the hardware can be configured at runtime by software and obtain sufficient flexibility and scalability while achieving high energy efficiency.

4.3.1 Algorithm Analysis On the basis of previous communication technologies, 5G communication has advanced technologies including multiple access, multi-antenna, code modulation, new waveform design. In 5G communication, the massive multiple-input multipleoutput (MIMO) technology is usually combined with the orthogonal frequency division multiplexing (OFDM) technology proposed in 4G communication to improve system bandwidth utilization while increasing signal transmission rate and reliability. The use of the massive MIMO technology has greatly increased the amount of data that the baseband needs to process. As the processing of multiple input data has become the computing power bottleneck of baseband chips, this section introduces the baseband processing algorithm and its core MIMO detection algorithm, which is divided into linear detection algorithm and non-linear detection algorithm [31]. 1. Baseband processing algorithm Figure 4.23 shows a flow of a typical 5G communication baseband processing algorithm. It applies both MIMO and OFDM technologies. The system decomposes the MIMO baseband signal processing into multiple single-channel OFDM signal processing [32]. In single-channel signal processing, the transmitted signal undergoes channel encoding and interleaving, and is then modulated and mapped. After the serial-to-parallel conversion, the sub-carrier mapping is performed, and the transmission data is loaded onto multiple orthogonal sub-carriers using IFFT, and then the transmission data flow is obtained through parallel-to-serial conversion. After cyclic prefix (CP) expansion and low pass filter (LPF), the signal is converted into an analog signal and sent out. The received signal will go through the reverse process. The primary data is obtained from the orthogonal carrier vector using the FFT technology.

4.3 5G Communication Baseband

Channel encoder

Transmitter

Receiver

Channel decodering

Signal modulation

Signal demodulation

195

MIMO encode

MIMO detect

Subcarrier modulation

IFFT

InsertCP CP

LPF (FIR)

D/A

Subcarrier modulation

IFFT

InsertCP CP

LPF (FIR)

D/A

Subcarrier demodulation

IFFT

Remove CP

LPF (FIR)

D/A

Subcarrier demodulation

IFFT

Remove CP

LPF (FIR)

D/A

Channel estimation

Fig. 4.23 MIMO-OFDM system baseband algorithm processing flow

The baseband processing part of the 5G communication system includes channel encoding and decoding, signal modulation and demodulation, MIMO signal detection, fast Fourier transform (FFT) and finite impulse response (FIR) filter module. Because the massive MIMO technology is used in the 5G communication system, the system needs to receive and transmit a large amount of data, which raises higher requirements on the MIMO detection hardware module, including higher energy efficiency, flexibility, and scalability. Figure 4.24 is a simplified schematic diagram of the MIMO system. The antenna at the receiving end will receive the signal from the transmitting end. We can use y to represent the received signal vector. From the communication theory, we know Eq. (4.1): y = Hs + n

(4.1)

where y is the received signal; H is the channel matrix; s is the transmitted signal vector; n is the noise vector. The most important thing in signal detection is to calculate the vector of the transmitted signal by using the received signal vector Transmitting end

S1

Receiving end n1 x1

S2 Transmit signal

Receiving signal

Detector

nr Sm

xnr

Fig. 4.24 MIMO system

196

4 Current Application Fields

y and the estimated channel matrix H. The MIMO detection algorithm focuses on detection performance, and its performance is usually measured by bit error rate (BER). 2. Linear massive MIMO detection algorithm For massive MIMO signal detection, it is very critical to figure out how to efficiently and accurately detect the transmitted signal of the massive MIMO system. For massive detection algorithms, attention will be paid to the precision and complexity of the detection algorithm, because it will affect the detection performance, hardware complexity and cost of hardware implementation. Massive MIMO detection algorithms can be divided into linear massive MIMO detection algorithms and nonlinear massive MIMO detection algorithms. Although linear massive MIMO detection algorithms are less precise than nonlinear detection algorithms, they excel in low complexity. Therefore, in the scenario where high requirements are put on power consumption rather than communication quality, linear detection algorithms can be used for MIMO signal detection. In linear detection algorithms, the bottleneck of computation often lies in the inversion of large matrices, especially when the MIMO system is large. In this case, the complexity of the algorithm will be very high, so is the cost of hardware implementation. In actual computations, linear iterative algorithms are often used to avoid complex matrix inversion. This section focuses on linear massive MIMO detection algorithms. Common linear massive detection algorithms can be divided into zero-force (ZF) detection algorithms and minimum-mean-square-error (MMSE) detection algorithms [33]. In the ZF detection algorithm, the noise vector is ignored. According to the channel model given by Eq. (4.1), after ignoring the noise, we have y = Hs

(4.2)

Simultaneously left multiply the both sides of Eq. (4.2) by the transpose matrix H H of the channel matrix and combine with Eq. (4.3), we have Eq. (4.4): yMF = H H y

(4.3)

s = (H H H )−1 y M F

(4.4)

Since noise is not considered, there is an error in the Eq. (4.4). Based on the above derivation, the transmitted signal s can be estimated through the matrix W in Eq. (4.5): sˆ = W y

(4.5)

where sˆ represents the estimated transmitted signal, and the estimation of the transmitted signal can now be converted to the estimation of the matrix W. In the ZF detection algorithm, when the additive noise is ignored, the transmitted signal s can

4.3 5G Communication Baseband

197

be estimated by estimating the matrix W. If the influence of the additive noise n is considered, the influence of noise is put into the matrix W, and the matrix W is obtained by making the estimated signal approximate to the real transmitted signal s. This is the MMSE detection algorithm. In order to make the estimated signal as close to the true value as possible, we use Eq. (4.6) as the objective function: sˆ = WNMMSE = arg min Es − W y2 W

(4.6)

Let this equation find the partial derivative and the extreme value of W, and then we can get the estimate of the matrix W (where N0 is the spectral density of noise and Ns is the spectral density of signal) in Eq. (4.7): W =

N0 H H+ IN Ns t

−1

H

HH

(4.7)

For ZF and MMSE algorithms, the most compute-intensive part of estimating the channel matrix W is in the inverse operation of the large matrix, which is difficult to run on hardware quickly and energy-efficiently. In order to avoid the huge complexity of the inversion operation, a variety of linear iterative algorithms have been proposed. These algorithms use the iteration between vectors or matrices to avoid the inversion operation of large matrices. Commonly used linear iterative algorithms include Newman series approximation algorithm, Chebyshev iteration algorithm, Jacobi iteration algorithm and conjugate gradient algorithm. 3. Nonlinear massive MIMO detection algorithm As introduced above, although linear MIMO detection algorithms have the advantage of low complexity, they lack accuracy, especially when the number of users’ antennas is close to or equal to the number of antennas at the base station [34] or high quality of the received signal is required, in which case nonlinear detection algorithms need to be used. Maximum likelihood (ML) and TASER algorithms [35] are two common nonlinear MIMO detection algorithms. The ML algorithm is the most accurate nonlinear detection algorithm, but its complexity increases exponentially with the number of antennas at the transmitting end, which is not implementable for massive MIMO systems [36]. The SD detector [37] and the K-best [38] detector are based on the ML algorithm and achieve a balance between computational complexity and performance by controlling the number of nodes in the search layer. The TASER algorithm is based on semi-definite relaxation. This algorithm achieves the detection performance similar to the ML algorithm with a polynomial level of computational complexity in a system with a low bit rate and a fixed modulation scheme [39]. These two algorithms will be introduced in detail below. The ML detection algorithm finds the closest constellation point as an estimate of the transmitted signal by traversing the set of constellation points, thereby realizing the detection of MIMO signals. To solve s with the least mean square optimization method, we have Eq. (4.8)

198

4 Current Application Fields

sˆ = arg min P(y|H, s) = arg min ||y − H s||2 s∈

s∈

(4.8)

To perform QR decomposition on the channel matrix and use the properties of the upper triangular matrix R, we have Eq. (4.9) sˆ = arg min[ f Nt (s Nt ) + · · · + f 1 (s Nt , s Nt −1 , · · · , s1 )] s∈

(4.9)

Among them, f k s Nt , s Nt −1 , · · · , sk can be expressed as Eq. (4.10) 2 Nt f k s Nt , s Nt −1 , · · · , sk = yk − Rk, j s j j=k

(4.10)

For the objective optimal estimation function in Eq. (4.9), the optimal solution can be found by constructing a search tree. As shown in Fig. 4.25 [40], in the search tree, there are S nodes in the first layer (S is the number of possible values for each point in the modulation method), and its value is f Nt (s Nt ). The sum of the node values along the path from the root node to the bottom node is a value evaluated by the mean square of the objective function. To find the optimal path from all paths is to find the optimal solution of the detection algorithm. The ML detection algorithm estimates the transmitted signal by traversing all nodes, which is obviously the best nonlinear MIMO detection algorithm. However, as can be seen from the search tree in Fig. 4.25, the complexity of the ML detection algorithm increases exponentially with the number of transmitting antennas. This NP-type detection algorithm is obviously not suitable for the actual communication system, and some approximations need to be done on this algorithm to reduce the time complexity of the algorithm. SD detector and K-best detector are two approximate optimization methods of ML detection algorithm. The K-best detector takes a pruning operation on the search tree in Fig. 4.25, and only keeps the nodes on the previous K path with the smallest metric in each layer of nodes. Although the K-best detection algorithm reduces the time complexity of the algorithm, the latter is still high when the number of transmitting antennas is large, and a lower K value will increase the bit error rate. The SD detector searches the hypersphere near the received signal vector to find the most likely transmitted signal. Therefore, while obtaining the best approximation performance, the time complexity of finding the optimal estimation of the received signal maintains the polynomial level. In the linear space formed by the received signal, the norm is defined as the Euclidean distance d = ||y − H s||2 . Then just make the Eq. (4.11) and Eq. (4.12) the smallest: sˆ = arg min ||y − H s||2 = arg min d W

W

(4.11)

4.3 5G Communication Baseband

199 Root node

S1=1

S1=-1

f1(1)=5

f1(-1)=1 First-level node 1

5

S2=1

S2=-1

f2(1,1)=3

f2(1,-1)=2

f2(-1,1)=1

f2(-1,-1)=2

S2=1

S2=-1

Second-level node 2

3

S3=-1

S3=1

4 Third-layer node 7

-1

S3=-1

-1

-1

-1

1

-1

1

S3=-1

4 5

4

-1

S3=1

3

1

-1

1

S3=1

3

6

-1

8

7

10

1

1

-1

S3=-1

1

1

-1

9 9

8

-1

S3=1

1

1

1

1

17

-1

1

1

1

Fig. 4.25 Search tree of the ML signal detection algorithm

di+1

2 Nt = di + yi − Ri, j s j j=i

(4.12)

As shown in Fig. 4.26, when nodes in the search tree are traversed, starting from the last level of leaf nodes, when the Euclidean distance from the search path node to the leaf node is greater than the given D, the received signal is considered to be outside the hypersphere, and the search path is discarded. The SD detection algorithm searches for the path of the Euclidean distance within the given radius D, starting from the bottom leaf node, until it finds the optimal path to the root node. The complexity and performance of the SD detector are affected by the parameter D. The SD-pruning detection algorithm optimizes the algorithm of the SD detector by traversing the value of D in the search tree. The TASER detection algorithm aims at two major scenarios, the coherent data detection of high-order multi-user MIMO (MU-MIMO) systems and the joint channel estimation and data detection of large-scale SIMO systems. With semidefinite relaxation, in a system with a low bit rate and a fixed modulation scheme, the performance similar to the ML detection algorithm can be achieved under polynomial complexity.

200

4 Current Application Fields

Fig. 4.26 Search tree of the SD signal detection algorithm

4.3.2 State-of-the-Art Research on Communication Baseband Chips In terms of chip architecture, MIMO detection chips can be divided into ASIC and ISAP. ASIC is based on the idea of specialization and can achieve very high area efficiency and energy efficiency for specific MIMO detection algorithms. ISAP is based on processors using instruction sets, including general-purpose processors (GPP), general-purpose graphics processing units (GPGPU), DSP, and ASIP. Due to the use of instruction sets, ISAP has higher flexibility. 1. MIMO detection chips based on ISAP ISAP-based MIMO detection chips can be roughly divided into two categories, namely, those using existing processor architectures such as GPP, GPGPU or DSP, or those using ASIP. The former focuses on the mapping and optimization of algorithms for existing architectures to obtain better performance and energy efficiency, while the latter is more efficient in completing the detection algorithm by optimizing the instruction set architecture (ISA) and micro-architecture for the detection algorithm. A solution that combines multi-core processors and GPUs is proposed [41]. As shown in Fig. 4.27, the multi-core CPU will preprocess the channel matrix based on column-norm ordering, and then detect the MIMO signal in the GPU. Heterogeneous solutions using multi-core CPUs and GPUs can realize highly parallel processing of MIMO signals, which greatly improves the throughput of detection signals. The literature [42] aimed at the use scenario of MU-MIMO on the base station side and divided the antenna array into multiple clusters based on the GPU processing architecture, and then detected the scattered antenna signals on each array cluster. This solution greatly reduces the bandwidth required for communication between decentralized detection units.

4.3 5G Communication Baseband

201

GPU

Multi-core CPU Execute algorithm 1

Buﬀer data at time t+1

Buﬀer data at time t

Execute algorithm 2 and algorithm 3 Shared memory

Global memory

Buﬀer data at time t-1

Fig. 4.27 Multi-core CPU-GPU processing framework diagram

ASIP adopts a customized instruction set, and obtains better performance and energy efficiency than general-purpose processors by optimizing the hardware architecture. napCore [43] is an ASIP for efficient software-defined radio. This chip improves performance and energy efficiency through a customized instruction set and optimized memory access technology. napCore supports SIMD expansion. Its typical application is the linear MIMO detection. napCore has achieved energy efficiency that is approaching to that of ASIC while achieving high flexibility. Figure 4.28 shows the pipeline structure of napCore. It has a seven-stage pipeline structure. The last four stages are arithmetic operation stages. EX1 and EX2 can realize the multiplication of complex numbers. RED1 and RED2 realize the addition operation. RED2 can realize the multiply-and-accumulate operation by reading the vector memory. In order to improve the throughput of SIMD, napCore optimizes the acquisition of multiple operands as shown in Fig. 4.29. Through multiple multiplexers, the control code can be generated through instruction compilation to realize the selection of input operands in the arithmetic path. In addition, nopCore also made architectural innovations including bypass unit and permutation networks to improve the energy efficiency and area efficiency of the chip. 2. ASIC-based MIMO detection chip ASIC generally adopts a fully-customized or semi-customized chip design method, and performs hardware design for a specific MIMO detection algorithm, which can usually achieve performance and energy area efficiency far superior to ISAP. According to the scale of MIMO, ASIC-based MIMO detection chips can be divided into small and medium-scale MIMO detection ASICs and massive MIMO detection ASICs.

202

4 Current Application Fields

Fig. 4.28 Schematic diagram of napCore pipeline structure

Fig. 4.29 Schematic diagram of data acquisition for the first operand

Figure 4.30 shows the ASIC detection module of a small and medium-sized MIMO [44, 45] which is used for 4 × 4 detection and decoding applying the MMSE linear detection algorithm. The MMSE detection decoder adopts a four-stage pipeline, reduces the longest critical path through the retiming technology, and improves the throughput of the detection input signal. As shown in the hardware architecture diagram of the MMSE detector, the first-stage pipeline is used to generate the estimate channel matrix, while the second-stage and third-stage pipelines are used for the LU

4.3 5G Communication Baseband

203

Fig. 4.30 Diagram of MMSE detector module

decomposition of the channel matrix. In order to increase the computation speed of the second-stage LU decomposition and prevent the working frequency of the entire detection module from being limited, the detector uses a parallel reciprocal structure to reduce the latency by 33.3%. According to measurements mentioned in the literature, the ASIC-based MIMO detection chip implemented in the 65 nm technology has achieved a data throughput rate of 1.38 Gbit/s, a power consumption of 26.5 mW, and an energy efficiency of 19.2 pJ/bit. With the evolution of communication standards, the scale of antenna arrays used in communication systems has become larger and larger, and MU-MIMO has become an indispensable part of 5G communication standards. ASICs for massive MIMO signal detection have gradually become a research hotspot. Massive MIMO detection has been more and more applied as it can achieve high data throughput while reducing the overhead per unit area. ASIC chips for massive MIMO linear detection [46, 47] and non-linear detection [48] are designed. Chebyshev iterative algorithm is used to optimize the matrix inversion in the MMSE detection algorithm [46], avoiding tedious inversion operations. A full-pipeline hardware architecture based on parallel Chebyshev iteration is designed as shown in Fig. 4.31. The six-stage pipeline structure can be divided into three modules: an initial module, an iterative module, and an approximate LLR processing module. This ASIC solution adopts the 65 nm TSMC technology, and its energy efficiency and area efficiency reach 2.46 Gbit/(s·W) and 0.53 Gbit/(s·mm2 ) respectively. Literature [47] used a parallel PE array on the input side to improve the throughput of the detection signal, and applied a conjugate-gradient-based user depth pipeline in the parallel PE array to estimate the received signal, and finally obtained the optimal detection signal on the

204

4 Current Application Fields

Fig. 4.31 Diagram of MMSE linear detection algorithm module

Fig. 4.32 Diagram of MMSE detection hardware structure

receiving side. Figures 4.32 and 4.33 show the diagrams of the top-level architecture, the parallel processing array, and the user-defined pipeline structure of the ASIC chip. With TSMC’s 65 nm technology, the energy efficiency and area efficiency of the ASIC chip are 2.69 Gbit/ (s·W) and 1.09 Gbit/(s·mm2 ) respectively. An ASIC chip for the nonlinear detection algorithm in massive MIMO detection is designed based on the K-best nonlinear detection algorithm [48]. Chebyshev decomposition is used to simplify the QR decomposition preprocessing steps of the channel matrix and reduce the number of multiplications while increasing the degree of parallelism. In addition, the pipeline structure of this chip adopts the partially iterative lattice reduction method to improve the accuracy of detection results. By using the lattice basis algorithm of sorting QR decomposition, the number of comparators is greatly reduced in the k-best signal detection stage. Figure 4.34 shows the top-level structure of the ASIC. 3. Limitations of traditional MIMO detection chips With the increase of the antenna scale, the existing ISAP-based MIMO detection chip needs to process an exponentially increased amount of data, so the system cannot realize real-time data processing, which severely restricts the application of existing ISAP-based MIMO detection chips in 5G and future communications systems. The MIMO detection chip based on ASIC design designs dedicated hardware circuits for different MIMO detection algorithms. The circuit can be optimized according to the characteristics of different algorithms. Therefore, ASIC has the advantages of high data throughput, low latency, low power consumption per unit area, and high

4.3 5G Communication Baseband

205

(a) Parallel processing array diagram

(b) Pipeline structure diagram Fig. 4.33 Diagram of the computational array structure

Sorted QR decomposition unit

Partially iterative lattice reduction unit Inversion unit

Initialization unit

BUFFER

Post vector unit

Fig. 4.34 Diagram of Chebyshev detection algorithm structure

K-best unit

206

4 Current Application Fields

energy efficiency. However, as the MIMO detection algorithm constantly advances, communication algorithm standards and protocols are also constantly updated, which requires hardware to be adapted to these changes. However, ASIC-based MIMO detection chips cannot change their form of functions after being fabricated, so they need to be redesigned and produced to support different algorithms. With the increase of the design cost and time, it is becoming increasingly difficult for MIMO detection chips using ASICs to keep up with the iterative updates of communication protocols and algorithm standards. Therefore, the fixed hardware of ASIC-based MIMO detectors cannot meet the requirements of flexibility and scalability.

4.3.3 Software-Defined Communication Baseband Chip Massive MIMO detection is one of the most critical tasks in baseband processing. As communication technologies are developing, it is necessary to realize personalized and specialized services based on different communication standards, different antenna array sizes, different MIMO detection algorithms, and different communication quality and energy efficiency required by communication services. Therefore, modern MIMO detection processors need to be sufficiently flexible to adapt to different scenarios and different communication protocols and standards. The scalability is required with rapidly developing and evolving baseband processing algorithms. Software-defined chips boast the advantages of high performance, high flexibility, and high energy efficiency, making them a promising solution for massive MIMO baseband signal processing. Software-defined chips achieve performance similar to ASICs at the expense of certain generality. Based on the analysis of massive MIMO signal detection algorithms, PEs, interconnections, memories, and configurations can be optimized in this application domain. 1. Analysis of massive MIMO detection algorithms How to accurately recover the signal sent by the users received by the base station has always been essential in the signal detection technology. It is even more difficult to make tradeoffs between performance, power consumption, and R&D costs in different MIMO signal detection solutions. ASIC chips can achieve the theoretically highest performance and energy efficiency for specific detection algorithms. However, due to the diversity of communication standards and iterative updates of communication algorithms, multiple ASIC chips are required when ASIC is used as a solution, and its power consumption is not necessarily advantageous. Its high research and development costs and development time have become constraints. The ISAP chip uses an instruction set to implement baseband signal processing, therefore enjoying a high degree of flexibility, but suffering a natural disadvantage in energy consumption. With the development of the MIMO technology, the scale of antenna array has increased from 4 × 4 to 8 × 8 to the higher 16 × 16, which makes even base

4.3 5G Communication Baseband

207

stations need to consider using low-power technologies. By changing the configuration of the PEs with software, software-defined chips can achieve energy efficiency close to that of ASIC while maintaining flexibility, which is the biggest advantage of this solution. Another big advantage of software-defined chips is that different detection schemes can be implemented by configuration on the same chip, and different detection schemes can be switched according to communication requirements to obtain the optimal implementation. For example, MIMO signal detection algorithms can be divided into linear signal detection algorithms and non-linear signal detection algorithms. Linear signal detection algorithms have lower computational complexity, but are not as accurate as non-linear signal detection algorithms. Non-linear detection algorithms have better accuracy at the cost of higher power consumption. Softwaredefined baseband chips allow us to select the optimal communication scheme in accordance with communication environments and requirements. The configuration of the computing space is the most critical part of softwaredefined baseband chips. It needs to analyze the behavioral pattern of massive MIMO signal detection, the algorithm’s parallel strategy, and extract the core operator. To analyze the behavioral pattern of massive MIMO signal detection algorithms, it needs to identify the common and unique features of different algorithms, analyze the main features of the algorithm, including basic structure, operation type, operation frequency, data dependence between operations, and data scheduling strategy. By analyzing the features of various signal detection algorithms, common features are extract from multiple algorithms to determine a set of representative algorithms with common features. In addition, in order to make full use of the advantages of massive MIMO signal detection software-defined chips in spatial domain operations, it is necessary to perform parallel strategy analysis on the massive MIMO signal detection algorithm. The results can provide a basis for the parallelism and pipeline design in the algorithm mapping solution. The parallel strategy analysis takes a set of representative algorithms instead of a single algorithm as the object, which helps to realize the transfer of parallel features among a set of representative algorithms. With the development of the massive MIMO technology, new signal detection algorithms keep emerging. If it is based on a set of representative algorithms, the new algorithm can be classified into a representative algorithm set according to its features. The parallel strategy and the mapping algorithm can be referred in the set for algorithm analysis to greatly save energy and time. After analyzing the behavior pattern and the parallel strategy, it is necessary to extract the core operator of the massive MIMO signal detection application. The core operator provides an important basis for the design of the reconfigurable processing element array (PEA), especially the reconfigurable PE. The extraction of core operators requires a proper balance between the generality and complexity of operators to avoid restrictions on the performance and security of the algorithm. 2. Hardware architecture of software-defined communication baseband chips The core computing component of software-defined communication baseband chips is the reconfigurable PEA, which is mainly composed of the master control interface,

4 Current Application Fields

ARM7 master controller

Coprocessor instructions

208

Global register address Global register data

Master control interface

Configuration memory

Task enabling signal Task completion signal

Enabling signal control word

Advanced high performance bus (AHB)

Configuration package length

Configuration package address

Configuration packet

Configuration packet length_ Configuration information Configuration controller Task completion signal

Configuration controller PEA controller Configuration controller

PE array

Task enabling signal

Shared memory address

Shared memory (data)

Data controller

Shared memory data Shared memory access request signal

External data

Shared memory access authorization signal

Control flow

Configuration flow

Data flow

Fig. 4.35 Diagram of PEA structure (see color picture)

the configuration controller, the data controller, the PE array controller and the PE array, as shown in Fig. 4.35. Software-defined communication baseband chips can use the master control interface, the configuration controller and the data controller to exchange data. The master control interface is a coprocessor or AHB. As the main module of the master control interface, the ARM processor can load relevant data. As the main module of AHB, the configuration controller initiates a read request to the configuration memory and transmits the configuration packet to the PEA. The data controller, as another main module of AHB, initiates a read–write request to the shared memory (the on-chip memory shared by ARM7 and PEA). The data exchange between shared memory and main memory is performed by the ARM processor that controls the memory access controller to handle the data, and complete the data transmission between the PE array and the shared memory. The basic computing unit in the PE array is the PE, and the basic unit of time is the machine cycle (the machine cycle represents the time period from when the PE starts to execute the task in the configuration packet to the end of the execution). The ALU generates the output by calculation the input in each machine cycle. When a PE completes the computation of a machine cycle, it waits for all other PEs to complete the computation of the current machine cycle, and then enters the next machine cycle together with all other PEs. The PE notifies the PE array after completing the execution of the configuration packet. After receiving the signal that all the PEs have completed the execution, the PE array terminates this set of configuration. A PE does not need to perform exactly the same number of machine cycles for a set of configuration packets, so a PE can terminate this set of configurations early.

4.3 5G Communication Baseband

209

3. Computing module of software-defined communication baseband chips The computing module of a software-defined communication baseband chip is composed of PE array, on-chip memory and the interconnection. The PE array is the core computing part of the software-defined communication baseband detection chip. The PE array and the corresponding data memory constitute the data path of a software-defined communication baseband detection chip, and the architecture of the data path directly determines the flexibility, performance and energy efficiency of the processor. Different massive MIMO signal detection algorithms greatly differ in the granularity of basic operations of PEs (from one-bit logic operations to thousand-bit finite field operations). This section discusses the mixed-granularity PE architecture. It not only involves the basic designs like ALU, data, configuration interface, and registers, but also involves the optimization of the proportion of PEs with different granularities in the array and their corresponding positions. In addition, the mixed granularity also brings new challenges to interconnection topology. Since PEs of different granularities also differ in the granularity of data processing, their interconnection may involve data merging and data splitting. In a heterogeneous interconnection system, the interconnection cost and mapping characteristics of the algorithm need to be considered. The memory provides data support for the reconfigurable PE array. Compute-intensive and data-intensive reconfigurable massive MIMO signal detection processors require a lot of parallel computations; therefore, the data throughput of the memory can easily become the performance bottleneck of the entire processor, which is called the “memory wall”. Therefore, collaborative design is required in terms of memory organization, memory capacity, memory access arbitration mechanism, memory interface, etc., to ensure that the performance of the PE array is not affected and minimize the area and power consumption overhead of the memory. (1) PE structure As the basic computing unit in the PE array, PE is composed of ALU and private register files. Figure 4.36 shows the basic structure of PE. The basic time unit of PE is the machine cycle. A machine cycle is the duration of the PE to complete an operation. A global synchronization mechanism is adopted between PEs in one machine cycle. In the same group of configuration packets, the PE array terminates the group of configuration information after receiving the feedback signal that all PEs have completed the group of configuration packets. However, different PEs do not need to perform exactly the same number of machine cycles for a set of configuration packets. The bit width of parallelly processed data in the PE array is determined by the granularity of PEs. On the one hand, if the computation granularity is too fine, it cannot match the signal detection algorithm that needs to be supported by the processor. The bit truncation will affect the precision of the algorithm. Multiple operations reduce the efficiency of interconnection, controller resources and resource allocation s, and ultimately reduce the performance and energy efficiency of the overall implementation. On the other hand, if the computation granularity is too

210

4 Current Application Fields

Fig. 4.36 Composition of PE

coarse, only part of the bit width in the PE participates in the operation. This leads to redundant computing resources, therefore affecting the overall performance such as area and latency. Therefore, the computation granularity must match the detection algorithm set supported by software-defined communication baseband chips. Based on the analysis of the features of signal detection algorithms, both linear and non-linear detection algorithms have their own features. The computation granularity of PE is finally determined after performing fixed-point processing on multiple signal detection algorithms. The analysis result shows that the 32-bit word length is sufficient to support the current requirement for computation accuracy. In addition, the length of special operators required by some algorithms can be controlled to 16 bits with a fixed point. Therefore, the ALU design adds data concatenating and splitting operators as well as the operations to handle high and low bits respectively. The bit width required by the linear signal detection algorithm is basically around 32 bits. Generally speaking, the PE computation granularity for a large-capacity MIMO signal detection processor is recommended to be equal to or greater than 32 bits. Considering the interconnections of PE, the selected granularity should be a power of 2. Therefore, the granularity can be selected as 32 bits. It should be noted that, in the actual architecture design, if a special algorithm set is required, the PE processing granularity may need to be adjusted accordingly to better meet the application requirements. (2) Design of on-chip memory Software-defined communication baseband processing chips use shared on-chip memory, and each shared memory has 16 banks, which is determined by the number

4.3 5G Communication Baseband

211

of PEs in each PE array. When memory access conflicts occur between PEs, multiple banks can alleviate memory access latency. By default, the shared memory address contains 10 bits, of which the first two bits are tag bits used to identify which bank the data is stored in. The data is aligned word by word, and each word has two bytes. Each bank has an arbitrator, and each PE is connected to an arbiter. The priority is determined by the arbitrator when multiple PEs access the bank. There is a dedicated interface between the shared memory and the PE array. The bit width of the dedicated interface address line is 4 × 8, and that of the data line is 4 × 32. In each machine cycle, each bank can handle data access once; a shared memory can handle data access up to 16 times (when all 16 banks initiate access requests). At the beginning, each bank has 16 inputs, and a fixed priority is set in the order from 1 to 16. That means, when multiple inputs conflict during the access (including reading and writing), the corresponding memory access operation is executed according to the input priority 1–16. The arbitrator supports broadcasting. If multiple PEs initiate data read requests to addresses in one cycle, the arbitrator can satisfy all requests in one cycle. During initialization, the data in the shared memory is read from the external memory by the ARM processor, and the result is written into the external memory by the ARM processor. Shared memory can be accessed in two modes: (1) Only interact with one PE array (the number of PEA matches that of shared memory. For example, PEA0 only interacts with shared memory 0). (2) Interact with adjacent PE arrays. (3) On-chip interconnection At present, defects exist in the communication between systems in software-defined chips used for massive MIMO detection. However, compared with the traditional ASIC architecture, as the software-defined communication baseband chip adopts the reconfiguration technology, the size of the PE array is greatly reduced, so the size of the PEA can be limited to 4 × 4. In the next-generation mobile communication system with high data throughput and low latency demand, many detection algorithms generally require massive MIMO signal detection system hardware to have very high parallel capabilities in order to improve detection efficiency, thereby improving system performance. In addition, there are frequent data exchanges in a reconfigurable system for massive MIMO signal detection, which poses a challenge to the traditional bus structure in terms of communication latency and efficiency. Since its appearance, the MIMO technology has developed from ordinary MIMO to massive MIMO. The size of the antenna array becomes larger and larger, and the number of mobile devices that the system can accommodate is increasing. With the development of the MIMO technology, new detection algorithms have been proposed. Therefore, the future massive MIMO signal detection system must support high scalability, while the traditional bus structure can no longer meet the requirements. Compared with the bus structure, the network on a chip (NoC) has the following advantages [49]: (1) Scalability: Because its structure is flexible and variable, the number of nodes that can be integrated is theoretically unlimited.

212

4 Current Application Fields

(2) Concurrency: The system provides good parallel communication capabilities to improve data throughput and overall performance. (3) Multiple clock domains: Unlike the single clock synchronization of the bus structure, NoC adopts global asynchronization and local synchronization; each node has its own clock domain, and different nodes communicate asynchronously through routing protocols. This fundamentally solves the area and power consumption problems caused by the huge clock tree in the bus structure. 4. Configuration module of software-defined communication baseband chips The configuration method of software-defined communication baseband chips mainly involves the organization method of configuration information, the configuration mechanism, and the configuration hardware circuit design. The organization of configuration information mainly involves the definition of configuration bits, the structure organization of configuration information, and the compression of configuration information [50–52]. Since the massive MIMO signal detection algorithm has high computational complexity and involves some operations that require more configuration information (such as a large look-up table), the required configuration information is usually of a large amount. Therefore, the organization and compression of configuration information become the key to the efficient operation of the signal detection algorithm on the reconfigurable processor. The configuration mechanisms focus on how to schedule the configuration information corresponding to computing resources. Massive MIMO signal detection algorithms usually need to switch frequently between multiple subgraphs. Therefore, a corresponding configuration mechanism needs to be established to minimize the impact of configuration switching on performance. Finally, the organization method and configuration mechanism of configuration information must be supported by the configuration hardware. The design of the configuration module mainly includes the design of the configuration memory, the configuration interface and the configuration controller. Figure 4.37 briefly describes the organization mode and configuration mechanism of configuration information. (1) Configuration interface module The configuration interface module mainly includes the master control interface, the configuration controller, and the configuration packet design, as shown in Fig. 4.38. The master control interface is used to realize collaboration between the main processor and the software-defined baseband processor. Through the register files, the following three functions can be realized: First, the main processor can send configuration information to the software-defined baseband processor. Second, the software-defined baseband processor sends operating state information to the main processor. Third, the main processor can quickly exchange data with the coprocessor. The configuration controller is responsible for parsing, reading and distributing the received configuration information. The design of configuration information can be realized through a configuration packet, which can be regarded as a large scheduling table that controls the flow of data on the chip and the state of PEs in the array.

4.3 5G Communication Baseband

213

Fig. 4.37 Structural diagram of configuration information (see color picture)

Fig. 4.38 Configuration structure diagram

(2) Mapping method In order to fully exert the advantages of the software-defined chip, how massive MIMO signal detection algorithms are configured on the software-defined chip architecture is crucial. The software-defined chip architecture is different from the traditional Von Neumann architecture. The configuration stream is introduced on the basis of the traditional instruction stream and data flow, which makes it more complicated to map MIMO signal detection applications to the hardware architecture. As shown in Fig. 4.39, the main steps of mapping include the generating data flow graph of the massive MIMO signal detection algorithm, dividing the data flow graph into different sub-graphs, mapping the sub-graphs to the reconfigurable

214

4 Current Application Fields Data ﬂow diagram

Conﬁguration 1 Conﬁguration 2 Conﬁguration 3 Conﬁguration 4 Subdiagram 1

Subdiagram 2 Reconﬁgurable array

Subdiagram 3

Subdiagram 4

Fig. 4.39 Configuration mapping flow

massive MIMO signal detection PE array, and generate corresponding configuration information. The generation process of the data flow graph mainly includes the expansion of the core loop, the scalar replacement and the distribution of intermediate data. During the dividing process, the complete data flow graph is divided into multiple subgraphs with data dependence in the time domain based on the computing resources of the reconfigurable PE array. During the process of mapping subgraphs to software-defined communication baseband chips, the subgraph is mapped to the PE array in the software-defined communication detection chip, and finally generates effective configuration information.

4.4 Cryptographic Computation As the fundamental technology to guarantee the security of data storage, communication and processing, the cryptographic algorithms are widely used in data centers, network equipment, edge devices and IoT nodes of all aspects of the national economy and people’s livelihood. However, different application scenarios and functions have different priorities in the selection of cryptographic algorithms in terms of latency, power consumption, and energy efficiency. As the physical carrier for the implementation of cryptographic algorithms, cryptographic chips also need to be flexible enough to switch between different demands and algorithms. The softwaredefined cryptographic chip is an ideal solution to this problem. This section describes

4.4 Cryptographic Computation

215

the application of software-defined chips in the field of cryptographic computing from the perspective of the analysis of computational attributes of cryptographic algorithms, the state-of-the-art cryptographic chips, and the design and implementation of software-defined cryptographic chips.

4.4.1 Analysis of Cryptographic Algorithms 1. Overview of cryptographic algorithms As a core technology to achieve information security, cryptography is a technical field that spans multiple disciplines such as mathematics, computer science, electronics and communications. The modern cryptography focuses on designing a variety of provably secure cryptographic systems in response to different application environments and security threats on the basis of satisfying efficiency. As shown in Fig. 4.40, cryptographic algorithms provide the following basic functional attributes. (1) Confidentiality: It refers to the protection of sensitive information in the storage or transmission state. Unauthorized individuals, entities or processes are not allowed to access it. (2) Integrity: It refers to maintaining the consistency of information, that is, avoiding unauthorized tampering of information in the process of generation, transmission, storage, and use. (3) Authenticity: It refers to confirming the authenticity of a message/entity and establishing trust on it. (4) Non-repudiation: It refers to ensuring that system operators or information processors cannot deny their actions or processing results. In order to realize these functional attributes, different cryptographic algorithms have been developed to support these functions. A cryptographic algorithm includes five basic elements: plaintext, ciphertext, key, encryption, and decryption. The plaintext is the original information to be encrypted, and the ciphertext is the confidential

Conﬁdentiality

Integrity

Authenticity

Non-repudiation

Encryption algorithm

Hash function

Message authentication code

Digital signature

Fig. 4.40 Requirements of cryptographic technologies for information security functions

216

4 Current Application Fields

information obtained after encryption. The key is the sensitive information used to ensure the correct implementation of encryption and decryption. According to Kerckhoffs’ principle, “even if all details of the cryptographic system are known, as long as the key is not leaked, then it is safe”. The key is the core data that needs to be protected during the implementation of cryptographic algorithms. Encryption refers to the process of encrypting plaintext to ciphertext through an encryption algorithm, and decryption is the process of recovering plaintext through a decryption algorithm. According to the different methods of using keys in cryptographic algorithms, cryptographic algorithms can be divided into three major categories: symmetric cryptographic algorithms, asymmetric cryptographic algorithms (also known as public key cryptographic algorithms), and hash functions. Symmetric cryptography refers to a cryptographic scheme that uses the same key in the encryption and decryption process, while the public key cryptographic algorithm uses different keys in the encryption and decryption process. The hash function is a cryptographic algorithm that does not require a key. Figure 4.41 is a general-purpose symmetric cryptographic algorithm model, in which the ciphertext is obtained after encrypting the plaintext by using the key, and transmitted to the message receiver via an untrusted channel. The message receiver uses the same key as the sender to decrypt the ciphertext to obtain the original plaintext. In the execution of encryption and decryption, the keys used are the same. The key can be generated by a third party and distributed to the sender and the receiver through a secure channel, or generated by the message sender and transmitted to the message receiver through the secure channel. According to the different sizes of the plaintext each time the symmetric cryptographic algorithm performs encryption, it can be further divided into block ciphers and stream ciphers. The block cipher refers to the plaintexts of a certain size are encrypted at a time, while the stream cipher refers to the bit-by-bit encryption of each bit of data. Figure 4.42 is the model of a general-purpose public key cryptographic algorithm. The public key cryptographic algorithm includes two keys, namely a public key and a private key. Only the private key needs to be kept in secret and it is mainly used for decryption by the message receiver. The public key is public and is mainly used by the message sender to encrypt the plaintext. The core of public key cryptography is the trapdoor one-way function, that is, it is easy to compute in one direction, but it is very difficult to solve it in the reverse direction. In other words, it is difficult to infer the correct plaintext by only knowing the ciphertext and the public key.

Fig. 4.41 Symmetric cryptographic algorithm

4.4 Cryptographic Computation

217

Public key

Plaintext

Encryption

Private key

Ciphertext

Decryption

Plaintext

Fig. 4.42 Model of public key cryptography algorithms

The main advantage of the public key cryptographic algorithm is that both sides of communication do not need to share the key through the secure channel in advance. Only the public key will be involved in the data communication process, while the private key does not need to be transmitted or shared. The hash function is an algorithm that converts input information of any length (message) into output information of a fixed length (hash value). The output length of this function has nothing to do with the input length. The hash function has two main features: collision resistance and one-wayness. Collision resistance means that the hash values corresponding to different inputs are different, and it is very difficult to find two different inputs corresponding to the same hash value. One-wayness means that the input information of the hash function cannot be calculated through the hash value. Hash functions are often used in file verification, digital signatures, message authentication, and pseudo-random number generation in the security field. 2. Analysis of cryptographic algorithm features Nowadays, there are two cryptographic algorithm systems, the international standards and China’s domestic standards. International cryptographic standards are mainly drafted by the National Institute of Standards and Technology (NIST) led by experts in the field of cryptography around the world. At present, a relatively mature standard technology system and application ecosystem have been established. Considering the important impact of cryptographic technologies on information and national security, China has established an independent commercial cryptographic system, such as the SM algorithm. Table 4.1 shows the international and domestic cryptographic algorithms of different cryptographic types. Next, the symmetric cryptography, the public key cryptography and the hash function will be analyzed from three perspectives, i.e. algorithm features, core operators and algorithm parallelism. AES is the most common block cryptographic algorithm, which has several types of different key lengths, such as AES128, AES192, and AES256. As an international cryptographic algorithm standard released in 2001, it is based on the typical SPN structure, and the block length is 128 bits. Each encryption process involves ten rounds of iterations. In the first nine rounds of iterations, byte substitution, row shifting, column mixing, and key addition operations are periodically performed in sequence. There is no column mixing in the last round of iteration. The decryption process is similar to encryption, except the changed operation sequence Which is add round key, inverse column mixing, row shifting, inverse byte substitution. The use order of the round key is reversed. In addition, both the encryption and the decryption

218 Table 4.1 Mainstream international and domestic cryptographic algorithms

4 Current Application Fields Type of algorithm Symmetric cryptography

Stream cipher

Block cipher

Public key cryptography

Hash function

Standard of Algorithm

Name of algorithm

International standard

RC4

Domestic standard

ZUC

International standard

AES, DES, 3DES

Domestic standard

SM1, SM4, SM7

International standard

RSA, ECC, DH, ECDHE

Domestic standard

SM2, SM9

International standard

MD5, SHA1, SHA2, SHA3

Domestic standard

SM3

processes have an initial operation of key addition. Most block cryptographic algorithms are based on similar design theories. The main structures include the Feistel network structure and the SP network structure. Therefore, different block cryptographic algorithms share many types of core computing operators, such as basic logical operations (XOR, AND, OR and NOT), integer operations (addition, subtraction, modular addition, modular subtraction, modular multiplication and modular inverse), shift operations with fixed or variable bits, S-Box substitution operations, permutation operations, etc. Different from symmetric cryptographic algorithms adopting substitution and permutation, the public key cryptographic algorithm are constructed based on difficult mathematical problems. According to different mathematical problems, there are mainly three types of public key cryptographic algorithms: large integer decomposition problems represented by RSA, discrete logarithm problems represented by DSA, Diiffe-Hellman and ElGamal, elliptic curve discrete logarithm problems represented by ECDSA and ECC. For the public key cryptographic algorithm based on large integer problems and discrete logarithm problems, the key length refers to the number of bits required by the binary representation of the modulus n. The longer the key length, the higher the security level, and of course the greater the amount of computation and the slower the speed. As for RSA algorithm in real applications, the modulus n is required to be a high-width integer, which is generally 1024 bit and 2048 bit, etc., and can even be as high as 15360 bit. The numbers of bits of the two prime numbers p and q decomposed by modulus n are close, which is about half that of modulus n. For the public key algorithm based on the elliptic curve discrete logarithm problem, a smaller key can be used compared with the RSA algorithm, providing a comparable or higher security level. For the ECC algorithm on the prime field GF

4.4 Cryptographic Computation

219

(p), the key length refers to the number of bits required by the binary representation of p. It is usually a prime number with a bit width of 160 to 571 bits. The core computation logic of the public key algorithm includes the modular exponentiation, modular multiplication and modular addition operations of large integers or large polynomials. Modular exponentiation can be converted into a series of modular multiplication and squaring through the square-multiplication algorithm. Modular multiplication is a commonly used but also most time-consuming operations. The computation speed of modular multiplication directly affects the processing speed of the public key cryptographic algorithm. How to realize fast large integer modular multiplication is the key to realize fast public key cryptographic algorithms. The most direct way of modular multiplication is to multiply first and then find the modulus. The Montgomery modular multiplication algorithm can avoid division and transform the modular operation into multiplication and shift operations. Therefore, the core common operations of public key cryptographic algorithms are multiplication, addition, and shift. The hash algorithm includes more than ten mainstream algorithms such as SM3, MD5, and SHA1/2/3. SHA-3 and SM3 algorithms are the most typical algorithms among them. The SHA3 algorithm was announced as the standard Keccak algorithm by NIST in October 2012. The SHA3 algorithm includes message padding, message expansion, and iterative compression. The message padding part pads the input message of any length with an integer multiple of the message block length. Message expansion performs zero padding on the message block and expands to the width of the compression function. Iterative compression is an iterative process of a compression function, and the output of the compression function is called a chain value. The compression function is a key part of the hash algorithm, including 24 iterations. Each iteration includes 5 steps consisting of permutation and substitution. SM3 is a hash algorithm published by China’s State Cryptography Administration in 2012. This algorithm has the advantages of high security strength, simple structure and high efficiency of hardware and software implementation. The SM3 algorithm is also composed of basic functions such as message padding, message expansion, and iterative compression. Different hash functions can all be implemented on a common architecture consisting of a control path and a data path. The control path is a control finite-state machine, which generates control signals of the data path according to parameters such as the number of iterations. The number of algorithm cycles is determined and arithmetic modules are selected accordingly. In the data path, the input data successively goes through message padding and message expansion, then goes through the compression function to obtain the chain value, and then stored in the compression function register after the output transformation. Then, the control unit evaluates whether the number of iterations has been reached. If the iteration has been completed, the hash value is output; if the iteration is not completed, the compression function is returned to perform the next round of compression. In fact, it is possible to support various hash algorithms by adjusting the bit width, the compression function in iterations, the number of iterations, and the parameters in the constant memory.

220

4 Current Application Fields High computing speed Low power consumption

Fig. 4.43 Design space of cryptographic chips

Perform ance

Resistance against physical attack

Security

Flexibility

Support multiple types of algorithms

3. Design space of cryptographic chips Different from the design of traditional digital integrated circuits that seeks performance, power consumption, area and other metrics, cryptographic chips also need to consider security in the design and implementation process. Therefore, cryptographic chips need to realize a reasonable balance among three metrics: functional flexibility, computing performance, and security. Figure 4.43 shows the design space of cryptographic chips formed by the constraints of these three technical metrics. (1) Flexibility There are many types of cryptographic algorithms, and standards are being constantly updated. When an old standard expires, the new one will be established. The number of cryptographic algorithms in security protocols is constantly increasing, with their forms being constantly changing. Existing algorithms may be compromised and invalidated, and safer algorithms will be proposed soon afterwards. The cryptographic algorithms vary greatly in type, quantity, frequency of modification and upgrade, and the frequency of dynamic function switch, and etc. All these application requirements pose a huge challenge to the functional flexibility of cryptographic processors. (2) Performance Cryptographic chips have two contradictory metrics, namely high processing performance and low computing power consumption. The former is mainly for applications in scenarios such as data centers and servers where fast computing speed and high throughput are required. The latter is mainly for applications such as edge computing, Internet of Things and wearable devices where higher energy efficiency and lower power consumption are pursued. (3) Security As the physical carrier for the implementation of cryptographic algorithms, the time, electromagnetic, power consumption, and even sound generated during the

4.4 Cryptographic Computation

221

processing of the cryptographic algorithm may cause the leakage of sensitive information of the algorithm. Also, as physical attacks are increasingly mature and conveniently deployed, cryptographic chips in an open environment are facing more severe challenges in terms of security. Therefore, it is necessary to consider necessary countermeasures against possible physical attacks in the design of cryptographic chips, such as the implementation of constant-time circuits for timing attacks, as well as masking and hiding methods for electromagnetic/power consumption attacks, and so on. Countermeasures against physical attacks that are independent of cryptographic chip functions will inevitably cause extra overhead in area, power consumption, and even performance. Therefore, how to obtain the best resistance against physical attacks with minimal additional overhead has always been a hot research topic in the field of hardware security.

4.4.2 Current Status of the Research on Cryptographic Chips According to the different optimization directions in the design space of cryptographic chips, current cryptographic chips are divided into three categories: performance-driven cryptographic chips, flexibility-driven cryptographic chips, and security-driven cryptographic chips. 1. Performance-driven cryptographic chips According to different application requirements, cryptographic chips are mainly divided into those for server applications featuring high throughput and high performance, and those for IoT and wearable computing scenarios featuring low power consumption and high energy efficiency, both of which pursue extreme performance. The key technologies to achieve high-performance and high-throughput processing include pipeline design and retiming design [53–55]. A high-throughput AES architecture, which support 128-bit, 192-bit and 256-bit key, and four modes: ECB, CBC, CTR and CCM, is proposed [54]. The overall architecture of the proposed AES algorithm, as shown in Fig. 4.44, is mainly composed of I/O interface, FIFOsand AES core. The datapath of the AES core adopts a two-stage pipeline design to adapt to the timing of the datapath, while the key is also generated from a pipeline. The architecture can alternately process two data flows on a single datapath; in the CCM mode, the architecture can process two different data flows in parallel because CCM only requires encryption, which effectively improves the throughput. In addition, the architecture also uses the retiming technology to optimize the XOR operation and multiplexer, further improving the critical path. The hardware of the architecture is implemented using the 0.13 µm standard CMOS technology. Its frequency can reach 333 MHz, and the throughput can reach up to 4.27Gbit/s. A new pipeline architecture based on round function for the CBC encryption mode of AES is proposed [55]. Different functional modules of the round function adopt a normalized design, so that the affine transformation and the linear mapping can

222

4 Current Application Fields

Fig. 4.44 High-performance AES processor architecture

use the same architecture, and the entire round function only needs a 128bit 4.to-1 multiplexer. By contrast, similar architectures usually require multiple multiplexers. Therefore this can effectively reduce the pseudo-critical path. The architecture also applies operation reordering and register retiming technologies, so that the inversion operation of encryption and decryption can use the same architecture without incurring additional latency costs. As for encryption operations, then affine transformation, column mixing, and key addition are merged by exchanging affine transformation and row shift.; as for decryption operations, the positions of the inversion operation and the reverse shift in byte substitution are swapped, so that the inverse transformation in byte substitution is at the very beginning of the round. The hardware of this architecture is implemented using 65 nm, 45 nm, and 15 nm standard CMOS technologies. Compared with other architectures, the throughput per unit area can be increased by 53–72%. A high-performance, highly flexible dual-domain ECC processor with a key length of up to 576 bits is designed [53]. By initializing the elliptic curve parameters and instruction codes stored in the ROM, it can realize basic operations required by any dual-domain ECC, thus making it applicable to different elliptic curve constant multiplication algorithms, such as the binary method and the Montgomery algorithm. Also, it is applicable to different ECC standards, such as FIPS 186–2 (Federal Information Processing Standard), IEEE P1363 and ANSI X9.62 (American National Standards Institute). The architecture includes ECC controllers, modular arithmetic logic units, ROMs, register files and advanced high-performance bus interfaces. In

4.4 Cryptographic Computation

223

order to realize the high flexibility of the processor, the MALU in the architecture integrates different modular arithmetic function modules, including modular addition and subtraction, modular multiplication and modular division. In addition, in order to reduce the latency in the data path, a basic processing unit based on a carrysave adder and a ripple-carry adder is designed to implement the proposed radix-4 interleaved multiplication, the modular doubling and the modular quadrupling. By reusing the basic processing element of MALU, the hardware utilization rate of the processor can be improved. The hardware of the architecture is implemented using XMC 55 nm standard CMOS technology, and its equivalent gate count is 189 K. It takes 0.60 ms to execute the 163-bit ECC once; it takes 6.75 ms to execute the 571bit ECC once. It takes 7.29 ms to implement the 192-bit ECC algorithm on Xilinx Virtex4 FPGA, and 49.6 ms to implement the 521-bit ECC. The key technologies to realize low power consumption and high area efficiency design include hardware reusing and high energy efficiency circuit design [56–58]. An AES architecture of high energy efficiency is designed [57]. Compared with the conventional architecture, the row shift operation in the round function is omitted, and the flip-flops in data and key storage is replaced by latches using the retiming, thereby saving 25% of the area and 69% of the power consumption. The glitch reduction technology adds a flipflop to the path and retiming the S-Box balances the path latency. The architecture is implemented using the 40 nm CMOS technology with a circuit area of 2228 equivalent gates and a voltage of 0.47 V. It achieves an energy efficiency of 446Gbps/W and a throughput of 46.2Mbit/s. An 8-bit lightweight nanoAES accelerator for ultra-low-power mobile system-on-chip is designed [58]. The row shift operation in nanoAES is moved to the start position of the round operation, and the shift is realized through a sequential scan chain. The nanoAES only uses an 8-bit S-Box. The basic operations required, such as addition, squaring and inversion, use 4.bit logic circuits. In order to reduce the critical path latency, the mapping operation is moved outside the critical path of the round computation in the design, and the polynomial optimization technology is used to reduce the circuit area by 18% and the critical path latency by 12%. The hardware of the architecture is implemented using the 22 nm CMOS technology. The die area is 0.19 mm2 . The encryption and decryption accelerators occupy 2200 µm2 and 2736 µm2 respectively. The encryption and decryption accelerators have 1947 and 2090 equivalent gates respectively. The peak energy efficiency can reach 289Gbps/W. A high-area-efficiency hardware architecture for BLAKE algorithm (one of the candidate algorithms for SHA3 in the second round) is proposed [56]. In order to reduce the circuit area, the round function G is implemented by iterating 10 times through a 32-bit adder. The module for calculating the G function is composed of two 32-bit XOR operations, a rotater and an adder; the selected state words are sorted and used for the computation of the chain value h; the value stored in the intermediate register can be derived from the new chain value. In addition, a semi-customized dedicated 4 × 32-bit memory based on a clock gated latch array is designed to store the chain value. This architecture requires a total of 5 registers, which reduces the memory area by 34% compared with the flip-flop-based memory. The hardware of

224

4 Current Application Fields

the architecture BLAKE-32 is implemented by using the UMC 1P/6 M 0.18 µm technology, with an area of 0.127mm2 . 2. Flexibility-driven cryptographic chips Literature [59] proposed a processing-in-memory (PIM) cryptographic processor Recryptor based on the ARM Cortex-M0 architecture for Internet of Things applications, which uses near-data and in-memory computing technologies to improve energy efficiency. The processor uses 10 SRAM units to support bit-level computing operations up to 512-bit width. Meanwhile, high-throughput near-data processing capabilities are achieved by placing custom-designed shifters, rotater, and S-Box close to the memory. The processor has certain reconfigurable features and supports common public key cryptographic algorithms, symmetric cryptographic algorithms and hash algorithms. The system architecture of Recryptor is shown in Fig. 4.45. It includes an ARM Cortex-M0 microprocessor with 32 KB of memory, a low-power serial bus for accessing off-chip data, an internal arbiter, and a memory composed of 4 banks. Except for a custom-designed SRAM for cryptographic computing acceleration, the other three SRAMs are all generated by a standard memory compiler. Recryptor is taped out using the 40 nm CMOS technology, and the chip area is 0.128mm2 . Under the operating voltage of 0.7 V and the operating frequency of 28.8 MHz, compared with the related work, the processing speed and energy consumption can be improved by 6.8 times and 12.8 times, respectively. A heterogeneous multi-core processor for public key cryptographic algorithms is proposed [60]. The processor has the advantages of low latency and high throughput. The processor consists of two clock domains with different functions. The high clock frequency domain includes 4 PEs, and the low clock frequency domain includes a reduced instruction set computing (RISC) processor. The two parts are interconnected through FIFO, while RISC generates macro instructions for controlling PE to execute computation functions. The PEs in the processor is programmable and can provide high-performance arithmetic computations, such as long-word-length

Fig. 4.45 Computing architecture of the cryptographic processor Recryptor

4.4 Cryptographic Computation

225

modular multiplication and addition. It has a 5-stage pipeline structure like RISC and can execute 292-bit long-word-length modular addition. The architecture is implemented using the TSMC 65 nm CMOS technology, with the highest frequency up to 960 MHz. It takes 0.087 ms to complete a 1024.bit RSA encryption. 3. Security-driven cryptographic chips Although the cryptographic algorithm is mathematically proven to be safe in theory, the side channel signals such as power consumption, electromagnetic signals, and time information generated by the physical carrier cryptographic chip, still pose the risk of the leakage of sensitive information. For cryptographic chips in physically accessible scenarios (especially in areas such as wearables and the Internet of Things), necessary countermeasures need to be considered. For timing attacks, it is only required to guarantee the constant time realization of the circuit. Therefore, the discussion here mainly focuses on the resistance against power consumption and electromagnetic attacks. The current side channel countermeasures can be divided into logic layer, the architecture layer and the circuit layer. In the field of side channel security research on cryptographic chips, the minimum traces of disclosure (MTD) is usually used to measure the side channel resistance of different technologies. The protection goal of the logic layer is to equalize the power consumption of the chip in each clock cycle as much as possible, in order to hide the specific computation logic in the circuit operation. Typical technologies include dual-rail logic [61], dynamic differential logic [62], and gate-level masking [63]. For circuit implementation, this kind of technologies usually require a specially designed library file, and will also cause a relatively large area and power consumption overhead. The architecture layer technology mainly changes the strong correlation between side channel information and algorithm processing flow by inserting dummy operations and operating out-of-order execution. However, the side channel resistance is strongly related to the algorithm and the specific architecture implemented. Another technology is the custom design during the physical implementation of the circuit. Typical technologies include the current balancing technology based on switched capacitors [64], the low-voltage linear regulator [65], and so on. Firstly, this type of technologies require more professional design capabilities of custom circuits. Secondly, these new signal leakage sources will also introduce additional security risks. Based on the analysis of the white box model of the AES cryptographic chip, the signature suppression technology in the current domain is used to achieve an increase in the resistance against electromagnetic attacks and power consumption attacks by multiple orders of magnitude [66]. By combining the current domain signature attenuation technology and the local lower level metal routing, the current of critical correlated signals will be greatly suppressed before reaching the supply pins. At the same time, the current on the top metal connected to external pins is also suppressed. In this work, the protected and unprotected versions of AES-256 under the 65 nm technology are implemented. The experimental results show that compared with the current protection scheme, the resistance against correlational power and electromagnetic attacks can be improved more than 100 times with similar power consumption and area overhead.

226

4 Current Application Fields

4.4.3 Software-Defined Cryptographic Chips This section focuses on software-defined cryptographic chips support mainstream symmetric cryptographic algorithms and hash algorithms [67, 68]. The chip includes a dynamically and partially reconfigurable PE array and a interconnection structure to improve energy efficiency and area efficiency, while ensuring full flexibility of functions. In addition, based on the spatio-temporal random dynamic features of software-defined chips, effective countermeasures against side channel attacks is realized. In the design process of the chip, key technologies were adopted, including distributed control network, paralleled computation and configuration design, configuration compression and organization design. The following is a detailed description of the software-defined cryptographic chip from the perspectives of the basic architecture, key technologies, and chip implementation results. 1. Computing architecture The system architecture of the cryptographic chip is shown in Fig. 4.46. It is mainly composed of two parts: a data processing engine and a configuration controller. (1) Data processing engine The data processing engine includes four coarse-grained reconfigurable processing element arrays, and each array includes 4 × 8 reconfigurable processing elements (PE) and 8 internal routing units for connecting with adjacent rows of PEs. Each PE has four inputs and four outputs, and can support 8bit, 16bit and 32bit operations. Considering the area overhead, two types of PEs, namely T0 and T1, are constructed and arranged in interlaced rows. Both types of PE include a basic function unit (BFU) and a special function unit (SFU). BFU provides all the basic functions in PE, including arithmetic functions, logic functions and shift functions. The arithmetic function in each PE includes a 32-bit adder, and there is a 16-bit multiplier in every four rows. Logic functions mainly include AND, OR, NOT and XOR operations. The shift function is mainly used to support logic and cyclic shifts up to 64 bits.

Fig. 4.46 Computing architecture of the software-defined cryptographic chip (see color picture)

4.4 Cryptographic Computation

227

The difference between T0 and T1 is mainly reflected in SFU. The SFU in T0 is an 8 × 8 S-box with 4 configurable input/output bit widths, while the SFU in T1 is a 64.bit Benes network for non-blocking transmission from input to output in the permutation operation. By using these two types of PEs, all operations in the current mainstream cryptographic algorithms can be covered. With different configurations, the functions of each PE can swap by changing the multiplexer input by each PE. The configuration information also determines the interconnection relationship between PEs. Due to the large scale of the PE array, the debugging logic is debugged by collecting the intermediate results of the PE array and the PE status. With the help of the collected information, the debugging logic can determine whether the PE is currently operating normally or make necessary adjustments. The token driven network (TDN) uses a token register chain to control the computation sequence of PEs. Each PE is enabled by a separate token register. There are a total of 16 independent token register chains to transmit the tokens generated by the PE input FIFO of the previous line to the PE of the next row. As for the PE that is not activated by the token register, its clock signal will be turned off to further reduce power consumption. The register channel is used to reorder the output results of 16 threads. The size of each channel is 32bit. GPRF is used to exchange data between different configurations, and is composed of 256 32-bit registers to ensure that all intermediate results in the PE can be loaded into GPRF. (2) Configuration controller The configuration information used to define PE functions and interconnection information is mainly generated by the configuration controller. Since there is a large amount of identical configuration information between different algorithms, PE rows and PEs, a hierarchical configuration mechanism can be used to minimize duplicate configuration information. In the configuration controller, three levels of configuration storage are established for the three abstract levels of PEs, PE rows, and tasks. This configuration structure and design can save duplicate configuration information by about 70%, and further reduce the storage overhead and the configuration time. The parsing register gets the command from L3 and writes it into the configuration information register after the three-layer parsing. When the PE array is in an idle state, the configuration switching module loads the computing configuration information into the PE and runs it. Using the data-flow-driven mode similar to ASIC, the PE array can complete the corresponding encryption/decryption function without changing the PE function. When the computing task is finished, the PE function can be switched according to the configuration information. 2. Key core technologies The design of this software-defined cryptographic chip mainly applies two technologies to improve the efficiency of the processor, namely the configuration acceleration system (CAS) and the multi-channel storage network (MCN).

228

4 Current Application Fields

Fig. 4.47 Configuration acceleration system

(1) Configuration acceleration system An efficient scheduling system is the key to improving the utilization of computing resources. In traditional software-defined chips, configuration and computations are usually performed sequentially. In other words, all processing units remain idle in the configuration mode, which causes a great waste of computing resources. This problem can be solved by parallelizing the system configuration with multiple independent computing tasks. As shown in Fig. 4.47, the configuration acceleration system is mainly composed of three modules: task injection, multi-task scheduling, and context analyzer. After the initialization is complete, the task injection module will send commands to the multi-task scheduling module. The multitasking system analyzes the context of multiple tasks in advance, and distributes the tasks to different channels accordingly. The context analyzer decodes the information sent by the multi-task scheduling module and loads the corresponding computed content into the corresponding processing unit. The L3 in the task injection module saves the command set loaded from the external interface, and its depth is 256, which means up to 256 commands can be loaded at a time. Generally, a cryptographic algorithm requires 1–20 commands to implement, and each command is a high-level abstraction of configuration information. When configuration switching is required, the command address and read mode corresponding to the next algorithm in L3 are loaded. Generally speaking, about 20 different algorithms can be loaded in L3 at a time. Currently two command read modes are supported: increment counter and lookup table. When the command of an algorithm is stored in a continuous manner, the method of the count-up counter is adopted, while in other situations, the method of the lookup table is adopted. The multi-task scheduling system realizes the function of multi-task scheduling by constructing N task channels and corresponding scheduling logic between the command queue and the reconfigurable processing unit. The required computing resources are scheduled under the control of the command sent by the task injection system. If the computing resources required by the current task are lower than the number of available processing units, the task is assigned to the idle task channel standing in the front of the queue. Since the designated task channel has no task being executed, the task mapping can be completed immediately. Otherwise, the task will be allocated to the task channel whose computing resources exceed its

4.4 Cryptographic Computation

229

demand, but it must not be executed until the task executed by the task channel is completed. The evaluation work of each task channel is executed sequentially. For the scheduling of computing resources, the priorities of each task channel are the same. A task channel can schedule 1–4 processing units for each task according to different task requirements. In general, the 4 processing units can be fully utilized by multiple tasks. The multi-task scheduling system improves the resource utilization of the processing unit while improving the computational efficiency. The context analysis system analyzes the content sent from the multi-task scheduling system and transmits the configuration information to the processing unit according to the state of the corresponding processing unit. There are three levels of configuration storage and control. As the bottom layer, L1 (computational information) stores the operation codes of the processing unit and interconnection configuration information. L2 (thread information) is mainly used to store the configuration information of the entire line of processing units, including the configuration index information of processing units, interconnection of processing units, intermediate results, and specific functions such as the Benes network. L3 (task information) mainly stores the task configuration information received from the multi-task scheduling system, including the number of rows of processing units required to complete the task and the index in the corresponding L2 storage. This design and implementation of the three-layer configuration information and the corresponding parser can reduce the overall size of the configuration information while improving the efficiency of configuration information switching. When the computing task currently performed by the processing element array ends, the pre-parsed configuration information can immediately complete the switching of configuration information to perform the new task. The parse of the configuration information does not need to be executed after the computation is completed. Instead, it is executed in parallel, thereby reducing the configuration switching time of the two computing tasks to only 3 to 4 clock cycles. (2) Multi-channel storage network Since many PEs produce a large number of intermediate results for subsequent computations, it is critical to achieve efficient reading and writing of these intermediate data. In this design, a multi-channel storage network MCSN is proposed to realize high-bandwidth parallel data interaction between PE and GPRF. In addition to the inner router (IRT), GPRF acts as the main data exchange channel in the array and has a serious impact on the efficiency of parallel computing. The MCSN structure in this design is shown in Fig. 4.48. Unlike the storage solution that uses a single read– write interface, GPRF is divided into 16 storage segments to achieve more physical interfaces. Meanwhile, a virtual interface with each PE vector is constructed, and a three-level interconnection between the real physical interface and the virtual interface is used to reduce the overall area. In this way, each PE can implement storage access independently and in parallel. L2 implements the connection of 4 read–write interfaces in a line with the memory through the storage interface of the basic module (composed of 8 rows of PEs). By configuring the multiplexer, a line consisting of 4

230

4 Current Application Fields

independent read–write ports in the basic module is selected each time. The selection signal of the multiplexer is dynamically obtained through the read/write enable signal in each row. In this way, it is possible to switch between different rows without configuring the multiplexer. L3 mainly implements the direct connection between each basic module and each memory through the real port of address decoding. Compared with the direct storage access design using register files and multiplexers, the area overhead of each PE access storage can be reduced to about 1/20 of the original. In addition, it can avoid the complicated configuration of each multiplexer. The dynamic configuration switching is determined by the real-time status of the PE, so as to avoid the latency reduction in specific application scenarios. It should be noted that when more than 16 PEs access different addresses in the same storage or multiple rows of PEs access the same addresses in the storage, the latency of the MCSN will be worse than that of the decoder scheme. Therefore, scheduling tools are needed to avoid this non-ideal situation as much as possible. In this design, configuration and simulation tools are used to check the latency. At the same time, regression or genetic algorithms are used to perform the optimal selection and search, so that the average latency overhead can be contained within 15%. 3. Prototype chip implementation and performance test In order to further validate the technological advancement of the above-mentioned technologies on cryptographic chips, the TSMC 65 nm technology was used to tape out the prototype of the proposed software-defined cryptographic chip. Figure 4.49 shows the die photo of the chip. The area of this processor is 9.91mm2 , and the operating frequency is 500 MHz. The processor can support all common symmetric cryptographies and hash functions. Table 4.2 is the performance statistics of the cryptographic algorithms supported by this cryptographic chip. It can be seen that the

Fig. 4.48 Multi-channel storage network

4.4 Cryptographic Computation

231

throughput rate of the three block ciphers in the non-feedback mode can reach 64Gbit/s, and the average throughput rate of the overall block cipher can reach 28.3 Gbit/s. Due to the feedback features in stream ciphers and hash functions, the throughput rate of these algorithms will be relatively low (from 0.35 Gbit/s to 8Gbit/s). The cryptographic chip has high energy efficiency, and the power consumption used to run any algorithm is kept within 1 W (from 0.422 to 0.704 W). In order to further evaluate the comprehensive advantages of the cryptographic chip in terms of performance and flexibility, we compared it with a reconfigurable cryptographic processor. Among the 7 algorithms that are both supported, 3 algorithms have the same performance and the other 4 algorithms have their performance improved by 1.5–4 times. A more in-depth analysis and comparison is shown in Table 4.3. In the following table, the most commonly used AES-128 algorithm is taken as an example to compare with related work in terms of energy consumption and area efficiency. Most of the current work is to achieve the maximum throughput rate through pipeline acceleration, that is, 128 bits per cycle. The energy overhead of each operation is used as an indicator of energy efficiency to compare with the coarse-grained reconfigurable cryptographic processor. This design improves the energy efficiency by 6.2 times while maintaining similar area efficiency; the energy efficiency is increased by 44.5

Fig. 4.49 Die photo of software-defined cryptographic chips

232

4 Current Application Fields

Table 4.2 Performance comparison of different cryptographic algorithms Type

Algorithm

Throughput Rate/(Gbit/s)

Power Consumption/W

Block cipher

AES

64

0.625

SM4

64

0.578

Serpent

1.81

0.574

DES

32

0.588

Type

Stream cipher

Hash function

Camillia

64

0.614

Algorithm

Throughput Rate/(Gbit/s)

Power Consumption/W

Twofish

32

0.588

MISTY1

12

0.495

SEED

20

0.493

IDEA

14.25

0.473

SHACAL2

3.8

0.422

AESGCM

20.4

0.704

MORUS640

3.63

0.484

ZUC

5.32

0.588

SNOW

5.82

0.612

RC4

2

0.612

SHA256

0.8

0.577

SHA3

0.35

0.538

SM3

0.66

0.57

MD5

8

0.623

times in comparison with the FPGA-based implementation scheme; the energy efficiency and area efficiency can be improved by more than two orders of magnitude compared with the general-purpose processor with 1000 cores; the energy efficiency is improved by 9.1 times and area efficiency by 6390 times compared with the Cortex-M0 processor applying the low-power processing-in-memory scheme.

4.5 Hardware Security of the Processor The popularity of personal computers has brought long-term growth momentum to the integrated circuit industry, and the mobile Internet allows everyone to carry electronic chips of a certain form with them. As the Internet of Things technology is being promoted, chips will be spread all over the world. Processor chips appear in various applications in daily lives in various forms, from server centers for cloud computing and financial services to base stations located all over the city for mobile communication, and from bank card chips to medical electronic equipments. The

59.98

15.4

0.68

0.043

Power consumption/W

Energy efficiency/(Gbit/(s·W))

Area efficiency/(Gbit/(s·mm2 ))

21.4

Area/mm2

Throughput rate/(Gbit/s)

2.3 –

11.2 9.32 × 10−4

– 11

1.275

40.8

319

40

FPGA [70]

2.8 × 10−4

0.005

28.8

40

32

178

Technology/nm

Frequency/MHz

General-purpose processor 2 [59]

General-purpose processor 1 [69]

Table 4.3 Comparison of area efficiency and energy efficiency

6.72

14.3

6.2

6.32

128

1000

45

Reconfigurable processor [71]

6.46

102.4

0.625

9.91

64

500

65

Software-defined chip

4.5 Hardware Security of the Processor 233

234

4 Current Application Fields

security of processor hardware is related to various fields of the national economy and people’s livelihood.

4.5.1 Background As the globalization of the supply chain for integrated circuits, malicious codes and backdoors may be implemented in various stages of the manufacturing process, such as commercial third-party IP cores, EDA software, wafer manufacturing, etc. In addition, potential security vulnerabilities caused by the design flaw, such as meltdown [72] and spectre [73], also make the CPU insecure. How to ensure the safety of critical hardware devices such as CPUs while a fully controllable supply chain is infeasible, is an urgent problem. From the perspective of the design, manufacturing, and supply chain, the security of integrated circuit hardware can be divided into two categories: the incompletely credible production and supply chain, and the completely credible production and supply chain. Although the refined division of the globalized industrial chain has greatly reduced the entry barrier of the semiconductor industry, it enables potential risks to the safety of integrated circuits. As shown in Fig. 4.50, the supply chain of integrated circuits is divided into design, validation, production, encapsulation, testing, and final application. Each step poses potential security threats. There may be risks, such as IP leakage, third-party IP trust, design vulnerabilities, and malicious circuit implantation, in the design stage; invalid validation and specification vulnerabilities in the validation stage; mask tampering, malicious implantation, and malicious fuse programming in the process of production, encapsulation and testing; the risk of invasive and non-invasive multifarious attacks by malicious users at the final application end. Threats to hardware security are everywhere, from the initial design, to the manufacturing test, and to the final deployment and application.

Design

Validation

Production

Fig. 4.50 Risk of the integrated circuit supply chain

Encapsulation

Test

Application

4.5 Hardware Security of the Processor

235

4.5.2 Analysis of CPU Hardware Security Threats Considering the highly globalized supplychain of CPU, the design and implementation of CPUs is basically a black box for end users. The hardware security of CPUs is critical to the normal development of production and living for modern society. Hardware vulnerabilities, hardware front doors, and malicious hardware are three prevailing CPU hardware security threats. Among them, most of the hardware vulnerabilities are hardware design vulnerabilities caused by the principle of CPU technologies. There is not enough information at the software level to judge and prevent such attacks. A ground-breaking innovation in the CPU architecture is required to completely solve this problem at the hardware level. The front door of hardware usually refers to the non-public channel reserved by the designer and manufacturer to provide updates and long-term maintenance. However, it may leave a possible channel for illegal attacks. Malicious hardware includes a variety of hardware Trojan horses and hardware backdoors. Modern chips can contain billions to tens of billions of transistors. Only by modifying dozens of them can someone implant Trojan horses and backdoors. Locating and analyzing Trojan horses and backdoors among so many transistors using the traditional way is tantamount to finding a needle in a haystack. Considering that only a spot of public information about hardware front doors and Trajan horse backdoors is available as it involves trade secrets, in this section, we mainly expand on hardware vulnerabilities. Since the spectre and meltdown attacks were published in 2018, attack methods based on transient execution have mushroomed. This kind of memory leak attack is implemented through the abnormal use of the branch prediction function in modern processors. In 2019, a new type of attack method based on the leakage of CPU internal cache information was published. Typical examples include RIDL [74] (rogue in-flight data load), Fallout [75] and Zombie Load [76]. Attackers use microarchitectural data sampling to execute in advance load instructions that cause errors, and leak critical sensitive data by bypassing information. These new vulnerabilities exploit behavioral patterns at the processor microarchitecture level, such as out-of-order execution, speculative execution, and other transient execution patterns. The spectre was independently discovered by Jann Horn of Google Project Zero. The problem was also discovered by the collaborative research of Paul Kocher, Daniel Genkin, Mike Hamburg, Moritz Lipp, and Yuval Yarom. Spectre is not a single vulnerability that can be easily repaired. It now refers to a combination of a type of vulnerabilities. These vulnerabilities all take advantage of the by-products of speculative execution of modern CPUs in order to speed up execution. The spectre attack utilizes a time-based bypass attack. A malicious process can obtain information in the mapped memory of other programs after the speculative execution with sensitive data. The Meltdown attack was independently discovered by three research teams, including Jann Horn, Werner Haas and Thomas Prescher of Cyberus Technology, and Daniel Gruss, Moritz Lipp, Stefan Mangard and Michael Schwarz from Technical University of Graz. The Meltdown attacks rely on the out-of-order execution

236

4 Current Application Fields

of instructions after the CPU goes abnormal. Some specific CPUs allow transient instructions in the pipeline to use the result of the instruction that is about to cause an error for calculation before the exception instruction is submitted. In this way, low-privileged processes can obtain data in the memory space with high-privileged protection without obtaining privileges. The Meltdown attack vulnerabilities exist in most of Intel’s CPUs with × 86 instruction sets. Some IBM processors with the POWER architecture and some processors with the ARM architecture are also affected by this. Fallout, RIDL, and zombie-load attacks are all based on data sampling vulnerabilities related to the micro-architecture in the Intel × 86 processor hyperthreading. As a result, the security boundary that should be guaranteed by the architecture can now be breached, causing data leakage. Unlike spectre and meltdown attacks that collect data from CPU’s cache, RIDL and fallout attacks based on micro-architectures bypass the CPU’s speculative execution to collect information in the CPU’s internal cache. These internal caches include the line fill buffer, the load port buffer and the store buffer.

4.5.3 Existing Countermeasures As for CPU hardware security issues, traditional methods aiming at design vulnerabilities can be used to correct related CPU hardware security issues to a certain extent. In this section, we introduce five countermeasures to common hardware design vulnerabilities, and discuss their pros, cons and limitations. Some of them are permanent or temporary solutions to the security vulnerabilities that have been released on specific CPUs, and some are the future design and improvements promised by CPU manufacturers. (1) The kernel page-table isolation (KPTI) is a software-level solution that uses the enhanced isolation of user space and kernel space memory adopted by the Linux kernel to alleviate the Meltdown hardware defects in the × 86 CPU. The × 86 CPU that supports the process-context identifier (PCID) can use the KPTI technology to avoid the refresh of the translation lookaside buffer. Literature [77] reported that the overhead of KPTI may be as high as 30%, even with PCID optimization. (2) The load fence (lfence) problem. The memory barrier is a type of synchronization barrier instructions used to ensure the sequentialization of memory operations by the CPU and the compiler. The read barrier instruction in the × 86 CPU is lfence, and the corresponding instruction on the ARM architecture is csdb (consumption of speculative data barrier). Literature [78] mentioned that Microsoft uses the lfence sequential instruction set in its C/C + + compiler to solve spectre attacks. However, on one hand, it is not easy to select an appropriate

4.5 Hardware Security of the Processor

237

position to load the lfence instruction. On the other hand, using the lfence instruction can only solve the variants of some spectre attacks in specific situations, and will meanwhile lose up to 60% of performance. (3) In January 2018, Google introduced the Retpoline technology, which aims to efficiently solve spectre vulnerabilities, on its security blog. This technology replaces indirect jump instruction with return instructions to reduce the occurrence of vulnerable out-of-order execution. Google engineers believe that the solution they proposed for × 86 CPUs can also be used on other platforms such as ARM. Retpoline can solve spectre attacks based on branch target buffers (BTB), but it has no effect on attacks based on other CPU modules. Intel also pointed out that the control-flow enforcement technology (CET) in the future CPU technology may give a false alarm on the solution using the Retpoline technology. A relevant paper published by Google in February 2019 showed that the protection only relying on software cannot completely avoid spectre vulnerabilities, and the design of CPUs must be modified. (4) CPU manufacturers can add a feature that prohibits speculative execution of indirect branch by updating the microcode. Meanwhile, the operating system software also needs to be modified simultaneously to prohibit speculative execution of indirect jumps and prohibit the other thread of the physical core from controlling the indirect branch speculation, so as to ensure that subsequent indirect branch predictions will not be controlled by the previous indirect jump. However, these operations will cause a huge loss of CPU performance and require simultaneous modification of microcode and system software [79]. (5) CPU manufacturers can redesign the processor micro-architecture to address existing attack methods. For example, Intel claimed to use the “Virtual Fences” technology to isolate speculative executions. There are many variants of Spectre and Meltdown attacks. The new redesigned Xeon processor can avoid VAR2-CVE-2017–5715 (Spectre) and VAR3-CVE-2017–5754 (Meltdown), but VAR1-CVE-2017–5753 (Spectre) is still not addressed [80]. On the whole, the Meltdown attacks can currently be prevented with KPTI, but not Spectre attacks as the software still lacks sufficient information. On the one hand, it is difficult to extract the feature codes of malware. On the other hand, software cannot obtain the CPU behavior from the issuance to the submission of instructions. The problem cannot be solved at the software level, and the cost will be high. CPU manufacturers’ modifications cannot guarantee the immunity against future attacks. Meanwhile, the updated protection capabilities lack third-party validation support. There are many types of unit modules in the microarchitecture. Existing methods of software protection only protect one or two of them [79].

238

4 Current Application Fields

4.5.4 CPU Hardware Security Technology Based on Software-Defined Chips The essential issue of security threats at CPU hardware lies in that it is impossible to validate the security consistency between CPU hardware implementation and CPU design specifications [81]. On the one hand, the security validation for malicious hardware insertion requires a large test space; on the other hand, malicious design vulnerabilities lack a golden model for security validation. The general idea of CPU hardware security dynamic monitoring and control technology based on software-defined chips believes that hardware architecture security is the foundation of everything, and behavioral security is the representation. Basic insecurity will reflect behavioral insecurity. The hardware security of the CPU can be represented by the security validation of the CPU hardware behavior. The purpose of the CPU hardware security dynamic monitoring and control technology is to detect behavioral security at runtime. The monitoring collection scheme uses a security judgment mechanism with a whitelist as the main means and a blacklist as the supplementary means to cover most of the hardware behavior of most channels. With the CPU hardware security dynamic monitoring and control technology, it can effectively monitor attacks that use cache side channels such as Meltdown & Spectre, hidden backdoors and illegal instructions in the CPU at a general cost less than 10%, and eliminate the management engine (ME) subsystems and microcode and other uncontrollable front door hardware security threats. The effectiveness of the CPU hardware security dynamic monitoring and control technology is built on the fact that the CPU behavior is based on the Turing machine model. The CPU behavior is deterministic, and the behavioral safety of CPU hardware can be checked for equivalence with an equivalent Turing machine model. Existing methods use another independent CPU for recording and replay to achieve the purpose of monitoring [82]. However, considering the complexity and relative closedness of contemporary CPUs, the replay of the target CPU in the instruction set architecture cannot be satisfied by using another CPU or commercial FPGA product, not to mention the performance in terms of energy efficiency and power consumption. However, the software-defined chip RCP can be regarded in many aspects as an excellent platform for recording and analyzing behaviors of the target CPU in the hardware dynamic security check (DSC). The dynamic reconfigurability of RCP ensures that all the instruction set models of the target CPU do not need to be configured on the RCP chip at the same time, but is dynamically loaded and configured only when needed. Also, the architecture models of multiple CPUs can be retained in the configuration information, thus providing high flexibility and power efficiency. RCP can also be configured with multiple cryptographic security components as needed to meet the needs of corresponding system security. The equivalent Turing machine model is one of the keys to monitoring the security of CPU hardware. The core instruction state, input and output data and transfer function of the security-extended Turing machine model in Fig. 4.51 correspond to the instruction set model and behavior-level model. The micro-architecture state

4.5 Hardware Security of the Processor

239

corresponds to the RTL model of the CPU, and the physical state corresponds to the transistor-level model. The outer layer of the model has greater simulation costs, but is closer to the actual system, so the effectiveness is higher. The instruction set and behavior-level model correspond to the security of the output channel and the storage channel, and the RTL-level model corresponds to the software side channel security. If there is an accurate and credible transistor-level model, it becomes possible to monitor the hardware side channel security. But in reality, it is difficult to obtain a credible RTL-level and transistor-level model of the CPU hardware. The CPU hardware dynamic security check (DSC) technology is a dynamic monitoring technology based on the instruction set CPU model and the hardware behavior security assertion. As shown in “Fig. 4.52 DSC technology schematic”, the main process of DSC is divided into three stages: collection, validation and control. Validation

Collection Circuits technology, radiation, etc.

+Physical state

Microarchitecture design

+Micro-architecture state Equivalence test

Instruction set design

Transistor-level model

RTL model

Control

CPU Hardware behavior safety assertion CPU control module

Command state+ Input and output data+ Transfer function

Instruction set model

Instruction set model

Behavioral model

Security Extension - Turing machine model

CPU

Fig. 4.51 CPU hardware security dynamic monitoring and control technology

Control PASS I Replay . validation CPU hardware level behavior snapshot

Von Neumann Architecture (UTM) ΣIO: data

CPU model

Watchlist

Collection

Datapath

Virtual machine level behavior

Hardware Trojan

Controller ΣIO: data

Microarchitecture level behavior

Collection

Validate

Instruction set architecture level behavior

Collection

Equivalence test Input/ output

Memory

Feature check

Backdoor Hardware side channel

δState Collection

Σpm

instruction

Circuit parameter level behavior

Validation and control

Software side channel Security assertion

Runtime countermeasure system PASS II Proof validation

(record)

Limited length behavior sample

+

(validate)

Hardware state transition process

Fig. 4.52 DSC technology schematic

+

Report speciﬁc behaviors of hardware attacks and take responsive measures to hardware

(control) Instruction-level CPU model + hardware behavior safety assertion

240

4 Current Application Fields

In the record stage, limited-length CPU behavior samples are collected, and the hardware state transition process is also preserved. The start state of the CPU, the end state of the CPU, and the input and output of the CPU during this period are all preserved. In the validation stage, the DSC system will replay the samples and check the equivalence of the record and replay results, and identify unanticipated behaviors. Once an unanticipated behavior is found in the control stage, the specific behavior of the hardware attack is reported and responsive measures are taken to the hardware. 1. System framework As shown in “Fig. 4.53 DSC framework and components”, the main body of the DSC dynamic monitoring and control system includes two parts: the monitoring and control chip die and the supervised commercial CPU chip die. The monitoring and control module will collect its behaviors at regular intervals when the commercial CPU chip is running. The duration of a single collection is called the collection window. The monitoring and control module will record the initial state and end state of the CPU at the beginning and end of the collection window. During the entire collection window period, the monitoring and control module will take an updated snapshot of the main memory, only recording the changed and accessed data and its address. Meanwhile, the I/O data and asynchronous events of external equipment, such as network cards, graphics cards and MEs, will also be collected and recorded. As shown in Fig. 4.54, in the behavior analysis stage, the instruction set architecture of the target CPU model will be used as a golden reference to perform replay analysis and unexpected judgments on collected behaviors. In the behavior control stage, extreme results of behavior analysis will be combined with the security policy constructed with the definition of security attributes and the security level specified by the administrator to respond to unanticipated behaviors. 2. Key technologies of DSC The DSC system includes many key technologies. In this section, we will introduce the key technologies involved in behavior sampling, analysis, and security control. (1) The key technology in the behavior sampling stage The key technology in the behavior sampling stage of the DSC system focuses on the state and behavior collection of the CPU. It mainly includes two technical methods: the non-intrusive sampling technology of CPU hardware behaviors and the asynchronous event data recording and instruction boundary sequencing method on the high-speed bus. The goal of the non-intrusive behavior sampling technology is to collect the start and end states of the CPU, and the data transmission behavior of all interfaces of the CPU without interfering with the operation of the CPU. The challenge of this technology lies in the design of behavior sampling that does not rely on the microarchitecture. Also, the sampling needs to be transparent to the software to ensure the compatibility of the software operating environment. Meanwhile, the challenge lies in how to record the complete runtime memory at a low cost. While sampling the CPU

4.5 Hardware Security of the Processor

241

Commercial CPU chip die

Behavior sampling

Behavior control

Behavior analysis

External devices

Main memory

Monitoring control chip die

Collection window CPU

Monitoring sampling control chip

Runtime CPU

analysis

control sampling

analysis

control

Fig. 4.53 DSC framework and components

? Collected CPU initial state Reference CPU end state Collected CPU end state

Collected mem/IO input

Collected mem/IO output

CPU model ISA

Reference mem/IO output

?

+ judgment

Reference execution instruction ﬂow

? Behavior assertion

Fig. 4.54 DSC analysis of CPU behaviors: replay and judgment

Replay

PASS

242

4 Current Application Fields

interface through hardware, it is necessary to ensure the compatibility of the existing hardware interface. The existing method is divided into three parts: use a virtual machine monitor interface (hypervisor) to collect the internal register information of the CPU; use a custom dedicated memory tracer (MTR) to collect the accessed memory; use a customized dedicated I/O tracer (ITR) to collect the I/O behavior of the CPU. Among them, the dedicated I/O tracer introduces PCIE Switch to address the issues of software compatibility and signal point-to-point integrity. The behavior sampling technology allows for the collection without interference and low cost. The performance loss caused by the collection of the start and end states of the CPU is less than 2%. The performance loss caused by the sampling of the CPU-accessed memory and the CPU-accessed I/O during the sampling window period is less than 1%. The goal of asynchronous event recording and alignment technology is to accurately record the occurrence location of CPU asynchronous events, and provide a basis for subsequent replay checks. The difficulty lies in the capture and positioning of external interrupts and DMA (direct memory access), as well as the precise positioning of asynchronous events in loops and recursive calls. Due to the jump instruction, the PC pointer cannot uniquely locate the position of events in the instruction stream. It is possible to realize accurate positioning in combination with the register that monotonically increases with the execution of instructions, such as the branch counter. An external interrupt will cause the CPU to exit from the virtual machine operating mode. However, the interrupt information will be recorded in the virtual machine control block. The capture can be completed by extracting terminal information from the virtual machine control block. The ITR will intercept and temporarily store the current DMA request, and request an interrupt from the CPU. The CPU will exit from the virtual machine operating mode. At this time, the DSC records the current running status through the virtual machine monitor interface, and completes the positioning of the DMA asynchronous time. After that, the ITR releases the intercepted DMA operation, and the DMA request will be written to the memory. (2) The key technology in the behavior analysis stage The main purpose of the DSC system behavior analysis is to determine whether the CPU behavior is in line with expectations and whether there is a vulnerability exploitation behavior. The key technology adopted includes two behavior judgment methods: the unanticipated behavior judgment based on the sample replay on the CPU security behavior model, and the CPU vulnerability attack behavior judgment based on the speculative replay. The goal of the unanticipated behavior judgment based on the sample replay on the CPU security behavior model is to accurately replay the CPU instruction behavior to determine unanticipated behaviors. The challenge of this judgment method lies in building a model of the CPU’s complete instruction architecture; how to map the hardware behavior to the instruction set; and it is possible to inject non-deterministic events. DSC obtains the instruction sequence according to the collected program counter (PC). During the replay, the simulation instruction is executed on the simulation model of the CPU, including the transition of the CPU state and the impact of

4.5 Hardware Security of the Processor

243

input and output. Also, it is necessary to inject corresponding asynchronous events at the correct instruction boundary. Experimental tests have proved that the analysis method can find unanticipated behaviors and undefined instructions of the CPU. The purpose of the CPU vulnerability attack behavior judgment based on the speculative replay is to detect attacks utilizing CPU speculative execution vulnerabilities. The challenge of this judgment method is that the behavior of the vulnerability at the instruction level is completely in line with expectations, and the software solution cannot detect the behavior of speculative execution at the micro-architecture level. If the matching only relies on instruction features, the detection error rate and cost are both high. As shown in Fig. 4.55, the CPU vulnerability attack behavior judgment based on speculative replay is divided into two replay logics in specific implementation: normal replay and speculative replay. The two replay logics respectively record the memory access addresses to the virtual cache, the normal address list, and the speculative address list (SAL). The speculative replay logic terminates when the memory access is not in the memory recording module. But if the normal replay instruction tries to measure the memory access latency, the CPU security model will judge the source of the memory access address; if it comes from the SAL and is not in the normal virtual cache, it will be determined to be an attack. (3) The key technology in the security control stage The main purpose of security control is to ensuring the security of the DSC system components when the CPU behavior does not meet expectations. The key technology adopted is the credible system startup and firmware verification technology. The main goal is to ensure the predictable startup of the behavior collection and analysis system, and to ensure the controlled execution of the CPU hardware behavior. However, how to identify the security identity of the system and ensure the controllable and reliable update of the CPU microcode are still difficult issues that need to be resolved. The startup process of the DSC system includes the attestation of the RCP configuration firmware and the boot loader process based on the signature and

Xeon® Cores Branch taken

Branch predicted

Speculative replay Normal replay

VCACHE

Speculative replay

SAL

Not issued

Memory recording module

MTRs Miss

… … CLFLUSH ds: [0x12340000] … RDTSCP MOV eax, ptrd s: [0x12340000] RDTSCP … …

RAM

Warning a spectre attack is detected!

SAL: speculative address list VCACHE: virtual cache,normaladdress list

Fig. 4.55 Schematic diagram of the principle of speculative replay

244

4 Current Application Fields

the one time programmable (OTP). The microcode updates of regulated commercial CPUs also need to be controlled. Generally speaking, the microcode of CPUs can be updated through multiple channels, including the basic input/output system (BIOS) and the CPU driver and system patches in the operating system. The operation of writing the microcode can be intercepted by the security control technology, and then the pre-programmed credential is used to verify the signature of the microcode, and complete the identity validation and tamper-proof validation. 3. DSC prototype system An Xeon processor is used as the monitoring and control target to validate the effectiveness of the DSC system. The software-defined processing chip RCP, ITR and MTR are used to monitor and control the Xeon processor online. The DSC prototype system architecture is shown in Fig. 4.56. The RCP chip is used to track the internal register of the Xeon processor and replay the × 86 instruction set architecture model. The ITR chip is used to track the information in the I/O channel, and the MTR is used to track the data access on the memory. The Xeon core, RCP and ITR are encapsulated together using LGA-3647, while MTR and DDR4 are integrated on the DIMM module. Figure 4.57 shows the module architecture diagram of RCP in DSC. The reconfigurable PE and array can be used to map various CPU instruction set architecture models. In the behavior analysis stage, behaviors of the Xeon processor are replayed

API

APP Guest OS (Linux & Windows)

Driver

Host OS + Hypervisor

X86 Core

...

X86 Core

Xeon CPU (Skylake)

PCIe

PCIe

Xeon CPU (Skylake)

RCP

Integrated IO PCI e & DMI

RCP

ITR

ITR Socket 1

Socket 0

MTR

16 Cores

Integrated IO PCI e & DMI

HSDIMM ×12 64 GB

X86 Core

16 Cores

Integrated Memory Controller

...

UPI

MTR

Integrated Memory Controller

X86 Core

UPI

Firmware

HSDIMM ×12 64GB

DMI PCIe

PCIe

IO Hub Chip BMC SATA

USB

BIOS

Fig. 4.56 CPU hardware security dynamic monitoring and control system for Xeon processors

4.6 Graph Computation

245

Fig. 4.57 RCP function module in DSC

by the RCP at the instruction set architecture level, and all abnormal hardware behaviors that are inconsistent with the replay can be detected. RCP also records and analyzes key and vulnerable hardware details that are not covered by the ISA model, such as branch prediction, so that it can complete the detection of spectre attacks. This system also includes software-level support, mainly including a host operating system that supports the virtual machine monitoring interface and related configuration information of the RCP. The host operating system selected here is Centos 7.4 that supports the processor check module (PCM); the selected guest operating system is Redhat 7.3. The test results show that when 300,000 servers work at the same time, 99.8% of hardware Trojan attacks can be detected; the performance loss on a single server is only 0.98%; and the power consumption is 33 W (about 7% of the power consumption of a single server). Meltdown and spectre-V1 can be detected at the same cost. The public attack demonstration program test shows that more than 90% meltdown attacks and more than 99% spectre attacks can be detected under the condition of using a 100-µs collection window and collecting once per second. As shown in Fig. 4.58, the DSC system has applied Montage Technology’s Jintide® Server CPU and been equipped with Lenovo high-performance servers.

4.6 Graph Computation The graph processing architecture is currently a hot area in the industrial and academic circles [83–95]. It is a typical domain-specific accelerator (DSA). Its development route also reflects the typical features of software-defined chips: the

246

4 Current Application Fields DDR4 DIMM with MTRs

MTR

Data Buffer

Industrial Server CPU 2.0 GHz 24 Cores TDP 95-150W

Data Buffer

Register Clock Driver (RCD)

MTR

MTR

MTR

MTR

Data Buffer

Data Buffer

RCP Chip

Intel Skylake Xeon® Cores ITR Chip

Reconfigurable Logic Array

PCIe upstream ports

Reconfigurable Logic Array

µController

Data Buffers PCIe downstream ports

Trace Peripheral Communication 15 X 20 mm2 0.5 GHz TDP 40 W Sample length 100us Sample Frequency 8

6->8

(a)push operator

8

7

6->8

(b)pull operator value of the adjacent nodes of the outgoing edge. The pull operator reads the value of the adjacent nodes of the incoming edge of the “central node”, and may read the weight of the incoming edge, and then update the value of the “central node” [97]. Figure 4.65a shows the process of executing the push-type “maximum propagation operator” that takes the node with the value of 8 in Fig. 4.60 as the central node. In the figure, the value of the central node is propagated to the adjacent nodes corresponding to its outgoing edge, and then the values of some adjacent nodes are updated after comparison. Figure 4.65b shows the process of executing the pull-type “maximum propagation operator” that takes the node with the value of 6 in Fig. 4.60 as the central node. The vertex-centric model executes the complete algorithm by executing these operators repeatedly [96]. The specific process [96] is shown as follows: (1) At the beginning of the algorithm, each node will be given an initial value and an initial activation state. The initial activation state of nodes varies in different algorithms. For example, in the “maximum propagation” algorithm, each node will be activated at the beginning. On the contrary, only the source node is activated in the single-source shortest path algorithm. (2) The system needs to execute an operator on each node that is active in the current iteration process, and activate the node whose value is updated. For the push operator, activated nodes are those whose value is updated; for the pull operator, activated nodes are all the nodes connected by the outgoing edge. (3) Repeat the iteration until there are no more activated nodes. List_Level1Figure 4.66 shows the process of executing the “maximum propagation” algorithm on the graph shown in Fig. 4.60. The process shown in the figure adopts a bulk synchronous parallel mode and a push operator. The versatility of the above model is obvious. For example, if a push operator is used, and its function is defined as: access the adjacent nodes of all outgoing edges of the central node. When visiting an adjacent node, we should first calculate the sum of the value of the central node and the corresponding edge weight, and then compare it with the node value of the adjacent node, and take the smaller value as

254

4 Current Application Fields

Fig. 4.66 Execution process of “maximum propagation algorithm” [96]

the node value in the subsequent iterations of that node. In this way, the BellmanFord algorithm mentioned above is implemented. Obviously, when the edge weight is always 1, the above operator is also equivalent to implementing the breadth first search algorithm. For another example, if a pull operator is used and its function is defined as: accumulate the quotient of the PageRank value of all the adjacent nodes of the central node and the number of outgoing edges, and then multiply and add the accumulated value and the pre-specified constant (as shown in line 14 of Fig. 4.64), the result is used as the new PageRank value of the central node. Then, the PageRank algorithm is implemented [2]. During the actual execution of the graph computing model, the processing of updated values is closely related to the parallel execution of the algorithm. Generally speaking, graph computing can be divided into two types by their execution modes [97]: bulk-synchronous parallelism (BSP) and asynchronous parallelism. There is only one core difference between the two, that is, whether the updated node value is immediately visible in the current iteration: the updated node value in the BSP only takes effect in the next iteration, while that in the asynchronous parallelism takes effect immediately. We can also understand the difference between

4.6 Graph Computation

255

them by comparing the Jacobi iteration and the Gauss-Seidal iteration in numerical computations. Generally speaking, BSP is simpler to implement and easier to achieve massive parallelism, but its convergence is slower (the number of iterations required for convergence is larger) [97]. The asynchronous parallelism converges quickly, but it is not easy to parallelize: in order to make the updated value immediately visible, complex synchronous operations are required [97]. In addition, asynchronous parallelism can improve the convergence speed by aggressively scheduling the processing sequence of different nodes [90, 97]. A typical example is the SSSP algorithm: the Bellman-Ford algorithm applies BSP, while the Dijsktra algorithm and its variants apply asynchronous parallelism. In fact, the Dijsktra algorithm only needs one iteration to get the final result, but it is almost impossible to parallelize, and the scheduling cost is extremely high. 2. Matrix perspective of the vertex-centric model [97, 100] In fact, the above algorithm model can also be viewed from a matrix perspective. Although the “vertex-centric” algorithm model is easy to use, but people’s vision is easily limited to the local area of the graph or the local operation of the graph algorithm. Viewing the graph algorithm model from a matrix perspective can help people grasp the execution process of the graph algorithm as a whole, and is more conducive to the understanding of the graph analysis framework. The matrix perspective of the graph algorithm model is based on the following mathematical foundation. That is, one iteration of most graph algorithms can always be regarded as a generalized matrix–vector multiplication defined on a certain semiring [97, 100]. The only difference between this generalized matrix–vector multiplication and an ordinary matrix–vector multiplication lies in that the original multiplication and addition operations are respectively replaced by the user-defined “edge process” and the “reduce” operators [100]. Generally speaking, the matrix involved in the generalized matrix–vector multiplication in the graph algorithm is the transposed adjacency matrix of the graph, and the vector involved is the node value vector obtained in the previous iteration. The vector obtained by multiplying the matrix vector will be applied to the old node value vector to obtain a new node value vector. Figure 4.67 depicts the execution process of the algorithm in Fig. 4.66 from the perspective of a matrix. In this algorithm, the function of the “edge process” operator is equivalent to that of multiplication, that is, simply passing the old value of the adjacent node to the “reduce” operator. The function of the “reduce” operator is equivalent to max{·, ·}, , that is, taking the maximum value. As the “reduce” operator, the “apply” operator also takes the maximum value. We will not explain it further as the description in the graph is very intuitive. It is worth pointing out that, in the matrix perspective. accessing edges column by column from top to bottom (or from bottom to top) is equivalent to executing the push operator; accessing edges line by line from left to right (or from right to left) is equivalent to executing the pull operator. 3. Difficulties in the implementation of graph computing

4 Current Application Fields Temporary value

9 8 7 6

Source node value

Source node value

9 8 7 6

0 0 0 0

Process Reduce

Destination node value

Destination node value

256

9 8 7 6

Temporary value

9 8 7 6

8 9 8 8

Temporary value

9 9 8 8

Source node value

Source node value

9 9 8 8

9 9 9 9

Process Reduce

Destination node value

Destination node value

Apply

9 9 8 8

Temporary value

9 9 8 8

Temporary value

9 9 9 9

0 0 0 0

Temporary value

9 9 9 9

0 0 0 0

Source node value

9 9 9 9 Process Reduce

Destination node value

Destination node value

Apply

9 9 9 9

9 9 9 9

Source node value End Apply

Fig. 4.67 Matrix perspective of the execution process of “maximum propagation algorithm” (see color picture)

There are three difficulties in the implementation of the above graph computing models: low computation to memory access ratio, irregularity, and extremely large data sets [97]. (1) The computation to memory access ratio is very low. According to the various graph algorithms mentioned above, every time a graph algorithm visits an edge in the graph, only a few computations are performed. For example, PageRank only performs one multiplication and addition operation every time it visits an edge (it is noted that division can be transformed into multiplication). Therefore, even if the parallelism of the computing unit is fully utilized, it will eventually be limited by the memory access bandwidth, and the rich computing resources on the chip cannot actually participate in the computation effectively. (2) Fine-grained irregularity. (3) ➀ Irregularity in memory access. Either the source node access or the destination node access must involve random memory access, while the node storage is finegrained (in many cases, a node value only occupies 4 bytes). In addition, the activation operation will also bring random memory access. These fine-grained irregular memory accesses will cause the cache blocks in the cache system to be evicted untimely, which in turn makes it difficult for the cache system to

4.6 Graph Computation

257

discover possible memory access localities. Since the data transmission of the main memory system is always coarse-grained (1 cache block is transmitted at a time, that is, 64 bytes), fine-grained random memory access directly through the main memory system will inevitably waste a great amount of bandwidth. In addition, too random memory accesses will often cause line misses in the DRAM particles, which in turn causes the DRAM chips to be frequently activated and pre-charged. Therefore it will further reduce the memory access bandwidth and increase the memory access latency. (4) ➁ Irregularity in parallelism. The execution of operators always requires atomicity, regardless of BSP or asynchronous parallelism. In the graph structure, there are extensive irregular data dependence, which lead to frequent occurrences of “write-write conflicts” and “read–write conflicts” during memory access when graph algorithms are executed in parallel. Avoiding these conflicts will inevitably lead to irregularity in parallelism. Additionally, it should be pointed out that asynchronous parallelism will definitely introduce more dependences as it requires the operation result of the operator to be immediately visible, thereby greatly increasing the irregularity in parallelism. In short, the difference between BSP and asynchronous parallelism reflects different ways of balance between convergence and irregularity in parallelism. The increase in convergence will reduce the workload, but the increase in irregularity in parallelism will increase the cost that is required to complete the same workload. (4) The data set is extremely large. The scale of a super-large graph may exceed the storage space of the DRAM. And this may mean that we need to access the disk when performing graph computing, and the low performance of the disk will bring more serious bandwidth bottlenecks. It can be seen that the fundamental performance bottleneck of graph computing is memory access bandwidth. In addition, the irregularity in parallelism in graph computing will also lead to the increased complexity of the graph computing architecture design and decrease the bandwidth utilization. In fact, the software optimization based on the current hardware architecture is difficult to fundamentally solve the above bottlenecks: the cache-based multi-core architecture is difficult to adapt to irregular fine-grained parallelism; and the current main memory system limits the system’s memory access bandwidth and storage capacity. This has greatly stimulated the development of the research related to graph computing in the field of hardware architecture, which is our main topic that will be discussed below.

4.6.3 Research Progress of Hardware Architecture for Graph Computing As we will see, the research ideas of graph computing hardware architecture are always fundamentally consistent. They are intended to deal with the three challenges mentioned above. The graph computing accelerator with traditional technological

258

4 Current Application Fields

approach still relies on the traditional main memory system, so in fact it cannot cope with the first and the third challenges, that is, the bandwidth bottleneck and the capacity bottleneck of the main memory system. Therefore, its contribution is mainly around the irregularity of graph computing [83, 84, 89, 90, 92]. In order to truly break the bandwidth bottleneck of graph computing, the academia has shifted its attention to the near-data processing architecture based on 3D stacking [87, 88, 95]. However, it is not easy to fully tap the potential of this type of architecture due to the obstruction of the inherent “irregularity” of graph computing. Current research based on this architecture is also on this issue. In order to cope with the third challenge, research on Flash SSD (Solid State Drive) architecture has also appeared in recent years [85, 101]. Finally, there is a lot of research on graph computing acceleration based on the existing CPU/GPU architecture in this field [93, 94]. The purpose of this kind of research is to greatly improve the efficiency of processing graph algorithms with the existing architecture by introducing small-overhead changes to the existing architecture (such as introducing a DSA module with a small area and power consumption). Of course, limited by length, what we discuss here in this section is far from the full picture of graph computing architectures. For example, in order to overcome the bottleneck of memory access, research on in-memory graph computing architecture based on memristor has also emerged in recent years [86, 91]. The research of graph computing based on the existing CPU/GPU architecture is far more than the SCU mentioned next. The main purpose of this section is to give a brief introduction to many aspects of the work in this field, so that we can help readers establish a basic understanding of the extensive research in the extensive research in this field. 1. Graph computing accelerator of traditional technological approach: dealing with irregularities in graph computing As mentioned above, the graph computing accelerator with a traditional chip architecture can only deal with irregularities in graph computing. But this does not mean that its design is simple in itself. In fact, as it cannot receive support from emerging technologies, its pursuit of excellent graph computing performance will only become more and more challenging. The first question that designers have to face is to choose between BSP or asynchronous parallelism? Secondly, they have to figure out how to effectively solve the various irregularities in the parallel mode. Next, we will discuss the typical design of accelerators in different parallel modes. (1) Accelerator for bulk-synchronous parallel mode [83, 84] The bulk-synchronous parallel mode accelerator designs using the traditional chip architecture include Graphicionado [83] of The International Symposium on Microarchitecture (MICRO)’49 and GraphDynS [84] of the MICRO’52. The former underpins the basic form of the accelerator architecture in this mode, while the latter is an improvement of the former. Therefore, the former will be the focus of our discussion here. Graphicionado perfectly solves the problem of fine-grained random memory access and fine-grained irregular parallelism in BSP mode. There are two ingenious

4.6 Graph Computation

259

benefits in this design: one is that the eDRAM (embedded DRAM) on-chip scratchpad up to 64 MB and simple graph slicing are used to eliminate off-chip random memory access; the other is that the simple Hash is used to distribute computing tasks, which not only achieves a sound load balance but also solves the problem of atomic operation [83]. Specifically, compared with off-chip access to main memory, the on-chip random memory access can provide higher effective bandwidth, finer granularity in accordance with application requirements, and higher energy efficiency. Therefore, reading the data that may be accessed onto the chip serially in advance and then performing random memory access on the chip according to actual needs is an effective way to eliminate irregular random memory access [83]. For graphs where the size of the node value vector exceeds the on-chip memory capacity, the transposed adjacency matrix can be sliced horizontally based on the destination node from the matrix perspective, so that the destination node vector corresponding to each sub-matrix can be completely loaded on the chip. However, this slicing scheme will bring additional overhead, including repeated reading of the source node value and reading of more edge index information. In order to minimize these overheads, the number of sub-matrices must be as small as possible, that is, the size of the sub-matrix and its corresponding destination node vector must be as large as possible, so that the on-chip memory capacity should be as large as possible. Graphicionado utilized the eDRAM technology that has been increasingly mature in recent years, and introduced an on-chip scratchpad of up to 64 MB, which minimizes the extra overhead caused by graph slicing. Experiments show that this approach is extremely effective [83]. In addition, the on-chip memory in the form of scratchpad instead of cache also effectively avoids improper eviction of cache blocks [83]. The implementation of Graphicionado is based on push operators [83, 100]. As shown in Fig. 4.68, the “source-oriented” part of the front end of the parallel pipeline is used to read the activated source node and its outgoing edge. In order to cope with memory access latency, the outgoing edges of multiple nodes will be read in parallel in the same pipeline. For algorithms where all source nodes are always active (such as PageRank), we can simply use the “prefetch” technology [83].

Fig. 4.68 Graphicionado’s parallel pipeline architecture (the Apply stage is omitted in the figure) [83]

260

4 Current Application Fields

After reading the edges, we need to start “edge process” and “reduce” computations. If the “source node” is still the center at this time, there will inevitably be conflicts between different pipelines due to the “atomization” requirement. In order to circumvent this problem, Graphicionado adopts a simple hash distribution strategy, which allocates the computing tasks associated with different destination nodes to the computing unit with the number equal to the low bit of the node number and the corresponding eDRAM block through Crossbar (if the number of pipelines is 8, this method amounts to giving the task to the processing element numbered K for processing, where this task is related to the destination node in which the node number divided by 8 gives k as the remainder [83]. In this way, the intersection of the destination node sets in the charge of different computing units (that is, the different “destination-oriented” rear pipelines in Fig. 4.68) must be empty. This solves the problem of different computing units accessing the same destination node at the same time. Also, if it is considered that the distribution of the number of edges in a node is independent of the node number, then the task load between different computing units is relatively balanced in the long run. (2) Asynchronous parallel mode accelerator [89, 92] Accelerator designs based on the asynchronous parallel mode include the work of Ozdal et al. in ISCA16 [92], and GraphPulse [89] in MICRO’53. The former design is based on a synchronization unit similar to the reorder buffer (ROB) to realize the detection and resolution of data dependence [89, 92]. According to the experimental results, the complexity of this implementation limits the parallelism exploitation of this design [92]. The latter is a dataflow-style asynchronous parallel implementation based on “event-driven” scheduling [89]. The following texts will focus on the design of GraphPulse. In fact, the “event-driven” irregular parallelism exploitation is a design method that has been widely adopted, discussed, and researched in the software and architecture fields in recent years. It has been proven to efficiently exploit task parallelism with a large number of irregular data dependences. However, the main difficulty in introducing this mechanism into asynchronous parallel graph computing is that if a new event is generated for each outgoing edge of every activated source node (because the activated source node will update the value of different destination nodes via each of its outgoing edges), then the number of events will quickly exceed the storage limit of the on-chip memory [89]. This is because real graphs often have hundreds of millions or even tens of billions of edges, and the storage of normal chips will be far exceeded even if only a small part of the edges are activated. The main contribution of GraphPulse is the introduction of an “event-driven” mechanism and a special “coalesce” method to greatly compress the event queue [89]. Figure 4.69 shows the overall design of GraphPulse, which is a typical “event-driven” architecture. Figure 4.70 shows the algorithm idea of “coalesce”. In short, the information of two events that are in the same queue and have the same destination node can be coalesced through the “reduce” operation, so that two events coalesce into one event [89]. In this way, the size of the on-chip queue GraphPulse will certainly not exceed the number of destination nodes currently processing in the subgraph.

4.6 Graph Computation

261

Fig. 4.69 Overall architecture of GraphPulse [89]

Fig. 4.70 Principle of the “coalesce” operation [89] (see color picture)

Figure 4.71 shows the concrete implementation of the “coalesce” operation. In order to achieve fast “coalesce”, GraphPulse must be able to quickly find events that can be coalesced. This is achieved through the “direct mapping” shown in the figure, that is, events with the same destination node will be mapped to the same address, which is somewhat similar to the “direct mapping” rule in the cache [89]. In addition, by recording and coalescing event information in the event queue, GraphPulse also eliminates off-chip random memory access [89]. GraphABCD: Between bulk synchronization and asynchronization [90] ISCA’20’s GraphABCD completely jumps out of the existing graph analysis frameworks. It examines graph computing from the perspective of Block Coordinate Descent (BCD), which has been fully studied in the field of machine learning, and has proposed a new asynchronous framework that is between BSP and the traditional asynchronous parallelism.

262

(a) On-chip memory address mapping of events

4 Current Application Fields

(b) Execution flow

Fig. 4.71 Implementation of “coalesce” operation [89] (see color picture)

The GraphABCD was proposed for the following two reasons: First, from the perspective of the underlying hardware, the perspective of the computing architecture needs to be extended from a single ASIC to heterogeneous computing in order to make full use of various computing resources of the data center. However, the synchronization between heterogeneous computing units often introduces significant overhead. Second, from the perspective of high-level algorithms, we expect to understand how the design parameters of graph computing frameworks affect the convergence speed of the algorithm, and how to reach a balance between the convergence speed at the algorithm level and the synchronization overhead of heterogeneous computing. To this end, GraphABCD introduced the BCD perspective to examine and design the heterogeneous graph computing framework. As shown in Fig. 4.72, the BCD framework slices the matrix. Different from BSP and traditional asynchronous parallelism, BCD processes one sub-matrix at a time, and meanwhile makes the processing result of the sub-matrix immediately visible to the subsequent processing. This means: ➀ When processing a sub-matrix block, GraphABCD does not need to face the serious data dependence problem as asynchronous parallelism does; ➁ When processing different sub-matrices, the computation results of the previous sub-matrices can be used. Therefore the convergence can be accelerated. In this way, GraphABCD can realize an optimal design in terms of convergence speed and synchronization overhead. The specific algorithm flow is shown in Fig. 4.72b. In short, GraphABCD repeatedly selects and processes a matrix block according to the scheduling algorithm until there is no matrix block to be processed. The design space here includes the size of the matrix block and the scheduling algorithm of the matrix block. Generally speaking, the smaller the matrix block, the better the convergence, but the greater the synchronization overhead. Simple scheduling algorithms, such as

4.6 Graph Computation

(a) Division of graphs

263

(b) Execution process

Fig. 4.72 Division of graphs in GraphABCD and the execution flow on graph division [90]

cyclically processing matrix blocks, generally feature low overhead and slow convergence; while complex priority scheduling algorithms have high overhead and fast convergence. Under this general framework, GraphABCD proposed a more specific design based on the CPU-FPGA platform, as shown in Fig. 4.73. It uses a pull–push operator that mixes pull and push and the corresponding Gather-Scatter program model (Gather corresponds to pull, Scatter corresponds to push). In order to take advantages of different computing platforms, GraphABCD distributes Gather tasks, which involve more computations than random memory accesses, to FPGAs suitable for computing tasks, and distributes Scatter tasks, which involve a large number of random memory accesses, to more competent CPUs. 2. Near-data processing graph computing accelerator based on the 3D stacked architecture: a significant increase in bandwidth [87, 88, 95] Interestingly, the graph computing work that first appeared at the top conferences in the architecture field was not the design based on traditional chip architectures, but

Fig. 4.73 GraphABCD’s hardware architecture [90]

264

4 Current Application Fields

was the near-data processing graph computing accelerator based on the 3D stacked architecture [95]. This phenomenon is both unexpected and reasonable in a sense. After all, compared with traditional technological paths, the 3D stacked architecture provides a real chance of breakthrough for the memory access bottleneck of graph computing. This work can be traced back to ISCA’15’s Tesseract [95]. It adopts an HMC-like architecture to provide terabyte-magnitude large-scale memory access bandwidth for graph computing. However, as a pioneer in this field, Tesseract is also bold but rough, which is the common feature of many pioneers of this field. The simple onchip communication scheme adopted by Tesseract in its implementation prevents it from obtaining better performance, and this has also become the object of repeated discussion and optimization by latecomers [87, 88]. After Tesseract, many optimization papers based on similar architectures began to emerge. For example, the GraphP of International Symposium on High-Performance Computer Architecture (HPCA) ’18 [87] proposed to use reasonable graph division (“source-node-based” division), reasonable communication link design and use, and other methods to reduce onchip communication and optimize the overall performance. The GraphQ [88] of MICRO’52 is inspired by the static and structured communication scheduling in distributed computing and completely eliminates irregular data movement on the chip, thus greatly improving the performance of graph computing under the 3D stacked architecture. The evolution trajectory from Tesseract to GraphP and then to GraphQ represents a fairly clear development path. However, limited by length, we only discuss Tesseract and GraphQ in this section. (1) Tesseract: the pioneer of near-data processing accelerator [95] Tesseract is the first to propose a graph computing architecture based on HMC, as shown in Fig. 4.74. A complete Tesseract architecture consists of 16 HMCs, which are connected through the SerDes link shown in Fig. 4.74a. Each HMC contains eight 8 Gb DRAM layers, which are divided into 32 vaults in the vertical direction, as shown in Fig. 4.74b. Each vault is connected to a Crossbar network via a serial bus. In this way, the vaults of the same HMC can communicate through the Crossbar, while the vaults in different HMCs can also transmit messages through the Crossbar in the HMC and the SerDes link between HMCs. As shown in Fig. 4.74c, a processing core is integrated under each vault to perform computation and communication tasks. The adjacency matrix and node value vector of the graph are divided into different vaults for parallel processing by their respective processing cores. In this way, we obtain tremendous bandwidth through parallel access to such a large number of vaults. When talking about this kind of non-shared-memory parallel processing architecture, it is natural to think of distributed parallel computing. In fact, the distributed parallel graph computing also gains the memory access bandwidth several times more than a single node by means of the parallel processing of multiple nodes. So, what is the fundamental difference between near-data processing and distributed computing? Firstly, near-data processing accesses DRAM through TSV and uses onchip links for communication, so it surely has extremely low power consumption.

4.6 Graph Computation

265

Crossbar network

List prefetcher

Prefetch buﬀer

Message trigger prefetcher

Memory controller

In-order core

NI

Message queue

(a) HMC network

(b) HMC internal vault and processing core network

(c) Internal architecture of the processing core

Fig. 4.74 Overall architecture [95]

Secondly, in terms of performance alone, the fundamental advantage of near-data processing is that the bandwidth of these vaults is interconnected by ultra-high bandwidth and extremely low-latency. In the HMC, the bandwidth between the vault and the Crossbar network can reach 40 GB/s, and the bandwidth of the SerDes link between HMCs can also reach 120 GB/s [95]. Such a high bandwidth is rarely provided by data centers. Compared with distributed computing, the ultra-large interconnection bandwidth in the architecture as shown in Fig. 4.74 greatly reduces the communication overhead of the multi-node system, and greatly improves the scalability of system performance. It is on this basis that near-data processing can truly and effectively utilize the memory access bandwidth of each vault and make contributions to the overall performance of the system. However, even with such a large interconnection bandwidth, it does not mean that communication will not become an obstacle to the further improvement of the performance of the near-data processing system. In fact, the interconnection bandwidth between HMCs is significantly smaller than the total memory access bandwidth within the HMC [88, 95]. Compared with memory access, this means an insufficiently optimized communication design may cause communication to be a new bottleneck for the near-data processing system. This is the case with Tesseract. Although it was designed with non-blocking remote process calls and various prefetching techniques [95], the irregularity of graph computing itself still causes a large amount of finegrained and irregular on-chip data movement [88]. And this is exactly the problem that GraphQ aims to solve. (2) GraphQ: Communication optimization of near-data processing architectures [88] In Literature [88], the author pertinently pointed out the similarities between neardata processing and distributed computing, and then realized huge optimization of communication inter-cube and intra-cube by using batching, coalescing, reasonable communication scheduling, etc. in distributed computing, as well as the heterogeneous division of computing tasks among processing cores.

266

4 Current Application Fields

For the processing of edges where the destination nodes are in another HMC and the source node values are in the local HMC, Tesseract will transfer the relative offset information of the source node values to another HMC through a remote call, and then the remote HMC will rely on this information to read the the source node values from local HMCs respectively, and complete the corresponding computation. Inspired by the “batched communication” in distributed computing, GraphQ adopts the communication optimization strategy as shown in Fig. 4.75a. As shown on the right side of Fig. 4.75b, these source node values will first complete the “reduce” computation locally to reduce all these values into one value. Meanwhile, this “reduce” process will be performed in parallel for all local source nodes associated with the same remote HMC, and the final reduction results will be combined into a “batch” message and sent to the remote HMC for further computations. In this way, the size and number of messages are greatly reduced, the amount of communication is greatly reduced, and the actual bandwidth utilization is improved. Meanwhile, in order to further reduce the communication overhead, GraphQ adopts the overlapped communication strategy shown in Fig. 4.76. Therefore, compared with scattered remote calls, each communication step of this strategy occurs at a certain moment and between certain objects. On the one hand, it greatly simplifies communication scheduling. On the other hand, it avoids competition of communication resources. In addition, the above communication process can also overlap with the computation process (such as the “reduce” process mentioned above), which further hides the communication overhead. In order to improve the memory access efficiency in HMC, GraphQ divides the computing core in HMC into Process units and the Apply units. The former mainly involve serial memory access, and the latter involves necessary random memory access, as shown in Fig. 4.77c. Separating them can avoid their completely different

(a) Overall schematic

(b) Batched message generation within a single HMC

Fig. 4.75 Generation and transmission of batched messages [88] (see color picture)

4.6 Graph Computation

267

(a) Graph division, allocation and execution sequence between HMCs (b) Overlaped Computation/communication mechanism between HMCs.

Fig. 4.76 Graph division, allocation, and execution sequence between HMCs in GraphQ, and the computation/communication overlap mechanism between HMCs [88]

memory access modes from influencing each other, and can optimize them separately, thereby significantly improving the memory access efficiency. In addition to the above two points of design optimization, GraphQ also discussed and optimized the communication optimization between multiple neardata processing chip nodes. As it is limited by length, no more discussion will be offered here. 3. Graph computing architecture based on Flash SSD: Efficient computation of very large graphs[85, 101] The architecture research based on Flash SSD mainly includes GraFBoost [85] of ISCA’18 and GraphSSD [101] of ISCA’19. Their design goal is to handle very large graphs (billions of nodes and tens of billions of edges) that are difficult to store in

Fig. 4.77 Comparison between the multi-core architectures in GraphQ’s and Tesseract’s single HMCs and the traditional multi-core architecture [88]

268

4 Current Application Fields

the DRAM main memory system. GraphSSD realizes the optimization of graph data access and update through a special design of SSD controller. GraFBoost proposed a more general algorithm for optimizing SSD memory access, and designed a special accelerator for this algorithm. As it is limited by length, only the design idea of GraFBoost is introduced here. The main architecture of GraFBoost includes Flash SSD, a 1 GB DRAM memory and an accelerator [85]. Compared with DRAM, the key feature of SSD lies in that the random and fine-grained memory accesses are more intolerable. Therefore, the core of GraFBoost is that it proposed an algorithm, sort-reduce, that can convert all SSD accesses associated with graph computing into sequentialized accesses, and an accelerator for this algorithm [85]. The core idea of the sort-reduce algorithm is to merge and sort the intermediate results generated by the push operator in the graph computing model and stored in DRAM based on the destination node number to which they belong, and compress the number of intermediate results in each step of merging by utilizing the nature that the intermediate results belonging to the same destination node can be reduced, thereby obtaining a sequentialized and compressed update vector, as shown in Fig. 4.78b [85]. The access to SSD can be optimized through the final sequentialized and compressed update vector; the execution efficiency of the algorithm can be improved by reducing the intermediate result in each step of merging. Based on the above algorithm design, the architecture design and execution process of the corresponding accelerator can be given, as shown in Figs 4.79 and 4.80. Its implementation is quite intuitive, and is completely consistent with the idea of merging and sorting: first, divide the sequence that needs to be sorted and reduced into small blocks that can be stored on the accelerator (FPGA), and use a simple merge network on the accelerator to complete the sorting of small blocks. Then use the merger tree to streamingly sort and reduce the sequence blocks larger than on-chip memory. Finally, this process can be simply extended to sequence blocks that exceed the DRAM storage space and repeated to obtain the final result [85].

Fig. 4.78 Execution idea of sorting and reducing (obviously (b) is better) [85]

4.6 Graph Computation

269

Fig. 4.79 Data flow of the “sort-reduce” accelerator [85]

(a) Sort reduce based on on-chip memory and DRAM: merge smaller sortreduced data blocks into larger sort-reduced data blocks

(b) Merge larger sort-reduced data blocks into the final result Fig. 4.80 Hierarchical sorting and reducing process [85]

Experiments have proved that GraFBoost has sound performance, and this performance result does not depend on the capacity of the DRAM system [85]. In addition, the performance degradation of GraFBoost is not obvious with the enlargement of the graph [85]. 4. Graph computing enhancement of traditional architecture (CPU/GPU)[93, 94]

270

4 Current Application Fields

In addition to the above-mentioned types of designs, there are many design works based on existing architectures (such as CPU and GPU). These works often first analyze and point out a certain inefficient step or process of the graph computing framework on the current architecture, and then design a low-overhead additional circuit specifically for this step or process and add it to the current architecture, thereby greatly improving the implementation performance of graph computing on the current architecture [93, 94]. The advantage of this type of design is that it takes into account the compatibility of DSA with currently popular architectures and existing programming models. A typical GPU-based graph computing enhancement work is the stream compaction unit (SCU) of ISCA’19 [94]. Stream compaction is the operation used in GPU graph computing to extract the nodes or edges that are activated in this iteration. In order to run graph computation efficiently, the stream compaction operation needs to identify the graph elements that are activated in this iteration, read them out, and then store them in a contiguous memory address sequentially and compactly. In this way, after the next iteration starts, the GPU’s computing core can efficiently access the activated elements by accessing this segment of continuous addresses. However, the “stream compaction” operation is not suitable for GPU implementation. On the one hand, the GPU’s stream computing unit is specifically designed for computing, while the stream compaction operation only involves data movement [94]. On the other hand, the stream compaction operation involves a large number of sparse and fine-grained random memory accesses. These accesses cannot effectively coalesce, so they are not friendly to the lock-step execution of stream computing units [94]. Therefore, although the “stream compaction” operation does not seem complicated, experiments show that the percentage of its execution time is extremely high, and is close to 60% sometimes, as shown in Fig. 4.81 [94]. Therefore, Literature [94] proposed that a special low-overhead stream compaction unit should be added to the GPU to improve the graph computing performance of the GPU, as shown in Fig. 4.82. The design of the SCU is not complicated. The key lies in the sound compatibility with the original GPU graph computing in terms of the programming model and performance. We will not discuss it further here.

4.6.4 Outlook As a typical DSA, the graph computing architecture and its development trajectory reflect the distinctive features of “software-defined chips”. At present, although new outcomes are still emerging one after another, the discussions on architectures based on the traditional vertex-centric model have verged on maturity, However, new opportunities have come. In recent years, the emerging graph computing applications represented by graph mining [102–104] and graph convolutional network (GCN) [105, 106] have posed new challenges to the research on graph computing hardware architecture.

4.6 Graph Computation

271

Fig. 4.81 Computation time breakdown of BFS, SSSP and PageRank on NVIDIA GTX980 and Tegra X1 [94]

Fig. 4.82 Basic architecture of GPU-SCU [94]

There are two distinct differences between these two types of applications and the traditional graph computation described above. (1) The difference in the data structure, which is reflected in two aspects: ➀ Different data structure of the node value[104.106] . In traditional graph computing, the value or attribute of a single node is often just a simple integer or floating point number with a size of only 4 bytes, such as SSSP and PageRank. In the graph mining algorithms based on graph embedding [102, 104] and graph neural networks [105, 106], the node attribute of a single node is a vector in itself, and its length may be as long as tens or even hundreds of bytes. This greatly increases the storage and bandwidth consumption of node values, and makes the strategy of eliminating random node value access based on the on-chip scratchpad in

272

4 Current Application Fields

the Graphicionado-like architecture very inefficient. Obviously, this difference puts forward new requirements for the memory access optimization of the graph computing architecture. ➁ The features of the graph structure may be different. A typical example is GCN. The graph structures faced by GCN for different application fields are obviously different in sparsity and graph size, for example [107]. In applications such as compound structure analysis, the size of the graph to be processed by GCN is much smaller than a large graph used by traditional graph applications. Even the entire adjacency matrix can be completely stored on the chip [107]. In this case, it is obviously inappropriate to still use the original memory access and computation strategy, because the on-chip scratchpad can be used to completely eliminate off-chip memory access. However, considering the irregularity of graph computation itself, the efficient implementation of on-chip graph processing may not be easy. (2) The difference in the execution mode of graph algorithms. Obviously, since the neural network computations are involved, the execution of GCN is significantly different from traditional graph computing [105, 106]. Another typical example is the graph embedding mining algorithm based on random walk. The basic principle is to reduce the computation workload in an iteration by introducing randomness, while maintaining statistical precision to a certain extent [102]. Obviously, this is also different from traditional graph computing, and it will bring greater challenges to the efficient implementation of memory access and parallelism. Proposing new techniques and ideas to address the emerging challenges mentioned above is a major task for the development of the graph computing architecture in the future. However, the more profound problem lies in that these emerging graph computing applications are not fully compatible with the graph computing models and architectures established above, or they cannot achieve sound performance under these architectures. In short, the universality of traditional graph computing architecture in their own domain has been challenged by emerging applications. This is also a common problem faced by the research of domain-specific architectures: a currently domain-flexible design may not be enough flexible anymore very soon. And this is also the greatest challenge faced by the practical application of domain-specific architectures. “How do we design a model framework or interface that has sound domain flexibility and performance at present and in the future?” This is a question that needs to be constantly asked and answered in all studies of domain-specific architectures, including graph computing.

References 1. Mo H, Liu L, Zhu W et al (2019) Face alignment with expression- and pose-based adaptive initialization. IEEE Trans Multimedia 21(4):943–956 2. Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th international conference on neural information

References

273

processing systems, vol 1, pp 1097–1105 3. Horowitz M (2014) 1.1 Computing’s energy problem (and what we can do about it). In: IEEE international solid-state circuits conference digest of technical papers, pp 10–14 4. Gokhale V, Jin J, Dundar A et al (2014) A 240 G-ops/s mobile coprocessor for deep neural networks. In: IEEE conference on computer vision and pattern recognition workshops, pp 696–701 5. Sankaradas M, Jakkula V, Cadambi S et al (2009) A massively parallel coprocessor for convolutional neural networks. In: The 20th IEEE international conference on applicationspecific systems, architectures and processors, pp 53–60 6. Sriram V, Cox D, Tsoi KH et al (2010) Towards an embedded biologically-inspired machine vision processor. In: International conference on field-programmable technology, pp 273–278 7. Mo H, Liu L, Zhu W et al (2020) A multi-task hardwired accelerator for face detection and alignment. IEEE Trans Circuits Syst Video Technol 30(11):4284–4298 8. Mo H, Liu L, Zhu W et al (2020) A 460 GOPS/W improved-mnemonic-descent-method-based hardwired accelerator for face alignment. IEEE Trans Multimedia 99:1 9. Chakradhar S, Sankaradas M, Jakkula V et al (2010) A dynamically configurable coprocessor for convolutional neural networks. ACM Sigarch Comput Arch News 38(3):247–257 10. Park S, Bong K, Shin D et al (2015) 4. 6 A1. 93TOPS/W scalable deep learning/inference processor with tetra-parallel MIMD architecture for big-data applications. In: IEEE international solid-state circuits conference digest of technical papers, pp 1–3 11. Cavigelli L, Benini L (2017) Origami: A 803-GOp/s/W convolutional network accelerator. IEEE Trans Circuits Syst Video Technol 27(11):2461–2475 12. Du Z, Fasthuber R, Chen T et al (2015) ShiDianNao: shifting vision processing closer to the sensor. In: The 42nd annual international symposium on computer architecture, pp 92–104 13. Gupta S, Agrawal A, Gopalakrishnan K et al (2015) Deep learning with limited numerical precision. In: Proceedings of the 32nd international conference on international conference on machine learning, vol 37, pp 1737–1746 14. Peemen M, Setio AAA, Mesman B et al (2013) Memory-centric accelerator design for convolutional neural networks. In: The 31st international conference on computer design, pp 13–19 15. Zhang C, Li P, Sun G et al (2015) Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: Proceedings of the 2015 ACM/SIGDA international symposium on field-programmable gate arrays, Monterey, pp 161–170 16. Chen T, Du Z, Sun N et al (2014) DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In: International conference on architectural support for programming languages & operating systems, pp 1–5 17. Chen Y, Emer J, Sze V (2016) Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks. In: The 43rd annual international symposium on computer architecture, pp 367–379 18. Albericio J, Judd P, Hetherington T et al (2016) Cnvlutin: ineffectual-neuron-free deep neural network computing. In: The 43rd annual international symposium on computer architecture, pp 1–13 19. Chen Y, Luo T, Liu S et al (2014) DaDianNao: a machine-learning supercomputer. In: The 47th annual IEEE/ACM international symposium on microarchitecture, pp 609–622 20. Gondimalla A, Chesnut N, Thottethodi M et al (2019) SparTen: a sparse tensor accelerator for convolutional neural networks. In: The 52nd annual IEEE/ACM international symposium, pp 1–7 21. Song M, Zhao J, Hu Y et al (2018) Prediction based execution on deep neural networks. In: The 45th annual international symposium on computer architecture, pp 752–763 22. Akhlaghi V, Yazdanbakhsh A, Samadi K, et al. SnaPEA: Predictive early activation for reducing computation in deep convolutional neural networks[C]//The 45th Annual International Symposium on Computer Architecture, 2018: 662–673. 23. Sharma H, Park J, Suda N et al (2018) Bit fusion: bit-level dynamically composable architecture for accelerating deep neural network. In: The 45th annual international symposium on computer architecture, pp 764–775

274

4 Current Application Fields

24. Hyeonuk K, Jaehyeong S, Yeongjae C et al (2017) A kernel decomposition architecture for binary-weight convolutional neural networks. In: The 54th ACM/EDAC/IEEE design automation conference, pp 1–6 25. Judd P, Albericio J, Hetherington T et al (2016) Stripes: bit-serial deep neural network computing. In: The 49th annual IEEE/ACM international symposium on microarchitecture, pp 1–12 26. Albericio J, Delmás A, Judd P et al (2017) Bit-pragmatic deep neural network computing. In: The 50th annual IEEE/ACM international symposium on microarchitecture, pp 382–394 27. Sharify S, D Lascorz A, Mahmoud M et al (2019) Laconic deep learning inference acceleration. In: The 46th annual international symposium on computer architecture, pp 304–317 28. Sze V, Chen Y, Yang T et al (2017) Efficient processing of deep neural networks: a tutorial and survey. Proc IEEE 105(12):2295–2329 29. Yin S, Ouyang P, Tang S et al (2018) A high energy efficient reconfigurable hybrid neural network processor for deep learning applications. IEEE J Solid-State Circuits 53(4):968–982 30. Yin S, Ouyang P, Yang J et al (2019) An energy-efficient reconfigurable processor for binaryand ternary-weight neural networks with flexible data bit width. IEEE J Solid-State Circuits 54(4):1120–1136 31. Kim S, Sanchez JC, Rao YN et al (2006) A comparison of optimal MIMO linear and nonlinear models for brain-machine interfaces. J Neural Eng 3(2):145–161 32. Chen M (2015) Research on key reconfigurable computing technologies oriented to communication baseband signal processing. Southeast University, Nanjing 33. Trimeche A, Boukid N, Sakly A et al (2012) Performance analysis of ZF and MMSE equalizers for MIMO systems. In: The 7th international conference on design & technology of integrated systems in nanoscale era, pp 1–6 34. Wu M, Yin B, Wang G et al (2014) Large-scale MIMO detection for 3GPP LTE: algorithms and FPGA implementations. IEEE J Sel Topics Sig Process 8(5):916–929 35. Castaneda O, Goldstein T, Studer C (2016) Data detection in large multi-antenna wireless systems via approximate semidefinite relaxation. IEEE Trans Circuits Syst I Regul Pap 63(12):2334–2346 36. Gao X, Dai L, Hu Y et al (2015) Low-complexity signal detection for large-scale MIMO in optical wireless communications[J]. IEEE J Sel Areas Commun 33(9):1903–1912 37. Chu X, McAllister J (2012) Software-defined sphere decoding for FPGA-based MIMO detection. IEEE Trans Signal Process 60(11):6017–6026 38. Huang Z, Tsai P (2011) Efficient implementation of QR decomposition for gigabit MIMOOFDM systems. IEEE Trans Circuits Syst I Regul Pap 58(10):2531–2542 39. Jalden J, Ottersten B (2008) The diversity order of the semidefinite relaxation detector. IEEE Trans Inf Theory 54(4):1406–1422 40. Liu L, Peng G, Wei S (2019) Massive MIMO detection algorithm and VLSI architecture. Springer, Singapore 41. Roger S, Ramiro C, Gonzalez A et al (2012) Fully parallel GPU implementation of a fixedcomplexity soft-output MIMO detector. IEEE Trans Veh Technol 61(8):3796–3800 42. Li K, Sharan RR, Chen Y et al (2017) Decentralized baseband processing for massive MUMIMO systems. IEEE J Emerg Sel Topics Circuits Syst 7(4):491–507 43. Guenther D, Leupers R, Ascheid G (2016) Efficiency enablers of lightweight SDR for MIMO baseband processing. IEEE Trans Very Large Scale Integr (VLSI) Syst 24(2):567–577 44. Tang W, Chen C, Zhang Z (2019) A 2.4 mm2 130-mW MMSE-nonbinary LDPC iterative detector decoder for 4×4 256-QAM MIMO in 65-nm CMOS. IEEE J Solid-State Circuits 54(7):2070–2080 45. Tang W, Chen C, Zhang Z (2016) A 0.58mm2 2.76Gb/s 79.8pJ/b 256-QAM massive MIMO message-passing detector. In: IEEE symposium on VLSI circuits (VLSI-Circuits), pp 1–2 46. Peng G, Liu L, Zhang P et al (2017) Low-computing-load, high-parallelism detection method based on Chebyshev iteration for massive MIMO systems with VLSI architecture. IEEE Trans Signal Process 65(14):3775–3788

References

275

47. Liu L, Peng G, Wang P et al (2020) Energy- and area-efficient recursive-conjugate-gradientbased MMSE detector for massive MIMO systems. IEEE Trans Signal Process 68:573–588 48. Peng G, Liu L, Zhou S et al (2018) Algorithm and architecture of a low-complexity and high-parallelism preprocessing-based K-best detector for large-scale MIMO systems. IEEE Trans Signal Process 66(7):1860–1875 49. Yang Z (2012) Modeling and simulation of reconfigurable network-on-a-chip oriented to multiple topological structures. Nanjing University of Aeronautics and Astronautics, Nanjing 50. Atak O, Atalar A (2013) BilRC: an execution triggered coarse grained reconfigurable architecture. IEEE Trans Very Large Scale Integr (VLSI) Syst 21(7):1285–1298 51. Lu Y, Liu L, Deng Y et al (2017) Minimizing pipeline stalls in distributed-controlled coarsegrained reconfigurable arrays with triggered instruction issue and execution. In: Proceedings of the 54th annual design automation conference, pp 1–6 52. Liu L, Wang J, Zhu J et al (2016) TLIA: efficient reconfigurable architecture for control-intensive kernels with triggered-long-instructions. IEEE Trans Parallel Distrib Syst 27(7):2143–2154 53. Liu Z, Liu D, Zou X (2016) An efficient and flexible hardware implementation of the dual-field elliptic curve cryptographic processor. IEEE Trans Industr Electron 64(3):2353–2362 54. Lin S, Huang C (2007) A high-throughput low-power AES cipher for network applications. In: Asia and South Pacific design automation conference, pp 595–600 55. Ueno R, Morioka S, Homma N et al (2016) A high throughput/gate AES hardware architecture by compressing encryption and decryption datapaths. In: International conference on cryptographic hardware and embedded systems, pp 538–558 56. Henzen L, Aumasson J, Meier W et al (2010) VLSI characterization of the cryptographic hash function BLAKE. IEEE Trans Very Large Scale Integr (VLSI) Syst 19(10):1746–1754 57. Zhang Y, Yang K, Saligane M et al (2016) A compact 446Gbps/W AES accelerator for mobile SoC and IoT in 40nm. In: IEEE symposium on VLSI circuits, pp 1–2 58. Mathew S, Satpathy S, Suresh V et al (2015) 340mv—1.1v, 289Gbps/w, 2090-gate nanoaes hardware accelerator with area-optimized encrypt/decrypt GF (24) 2 polynomials in 22nm tri-gate cmos. IEEE J Solid-State Circuits 50(4):1048–1058 59. Zhang Y, Xu L, Dong Q et al (2018) Recryptor: a reconfigurable cryptographic cortex-M0 processor with in-memory and near-memory computing for IoT security. IEEE J Solid-State Circuits 53(4):995–1005 60. Han J, Dou R, Zeng L et al (2015) A heterogeneous multicore crypto-processor with flexible long-word-length computation. IEEE Trans Circuits Syst I Regul Pap 62(5):1372–1381 61. Bucci M, Giancane L, Luzzi R et al (2006) Three-phase dual-rail pre-charge logic. In: Cryptographic hardware and embedded systems, pp 232–241 62. Hwang D D, Tiri K, Hodjat A et al (2006) AES-based security coprocessor IC in 0. 18$muhbox m $CMOS with resistance to differential power analysis side-channel attacks. IEEE J Solid-State Circuits 41(4):781–792 63. Popp T, Kirschbaum M, Zefferer T et al (2007) Evaluation of the masked logic style MDPL on a prototype chip. In: Cryptographic hardware and embedded systems, pp 81–94 64. Tokunaga C, Blaauw D (2009) Securing encryption systems with a switched capacitor current equalizer. IEEE J Solid-State Circuits 45(1):23–31 65. Singh A, Kar M, Chekuri VCK et al (2019) Enhanced power and electromagnetic SCA resistance of encryption engines via a security-aware integrated all-digital LDO. IEEE J Solid-State Circuits 55(2):478–493 66. Das D, Danial J, Golder A et al (2020) EM and power SCA-resilient AES-256 through >350x current-domain signature attenuation and local lower metal routing. IEEE J Solid-State Circuits 56(1):136–150 67. Liu L, Wang B, Deng C et al (2018) Anole: a highly efficient dynamically reconfigurable crypto-processor for symmetric-key algorithms. IEEE Trans Comput Aided Des Integr Circuits Syst 37(12):3081–3094 68. Deng C, Wang B, Liu L et al (2019) A 60 Gb/s-level coarse-grained reconfigurable cryptographic processor with less than 1-W power. IEEE Trans Circuits Syst II Express Briefs 67(2):375–379

276

4 Current Application Fields

69. Bohnenstiehl B, Stillmaker A, Pimentel JJ et al (2017) KiloCore: a 32-nm 1000-processor computational array. IEEE J Solid-State Circuits 52(4):891–902 70. Wang Y, Ha Y (2013) FPGA-based 40. 9-Gbits/s masked AES with area optimization for storage area network. IEEE Trans Circuits Syst II Express Briefs 60(1):36–40 71. Sayilar G, Chiou D (2014) Cryptoraptor: high throughput reconfigurable cryptographic processor. In: IEEE/ACM international conference on computer-aided design, pp 155–161 72. Lipp M, Schwarz M, Gruss D et al (2018) Meltdown: reading kernel memory from user space. In: The 27th USENIX security symposium, pp 46–56 73. Kocher P, Horn J, Fogh A et al (2019) Spectre attacks: exploiting speculative execution. In: IEEE symposium on security and privacy, pp 1–19 74. Van Schaik S, Milburn A, Österlund S et al (2019) RIDL: rogue in-flight data load. In: IEEE symposium on security and privacy, pp 88–105 75. Canella C, Genkin D, Giner L et al (2019) Fallout: leaking data on meltdown-resistant CPUs. In: ACM conference on computer and communications security, pp 769–784 76. Schwarz M, Lipp M, Moghimi D et al (2019) ZombieLoad: cross-privilege-boundary data sampling. In: ACM conference on computer and communications security, pp 753–768 77. Corbet J (2020) KAISER: hiding the kernel from user space. https://lwn.net/Articles/738975 78. Kocher P (2020) Spectre mitigations in Microsoft’s C/C++ Compiler. https://www.paulko cher.com/doc/MicrosoftCompilerSpectreMitigation.html 79. O’Donnell L (2020) Intel’s “Virtual Fences” spectre fix won’t protect against variant 4. https:// threatpost.com/intels-virtual-fences-spectre-fix-wont-protect-against-variant-4/132246 80. Intel (2020) Intel analysis of speculative execution side channels. https://newsroom.intel. com/wp-content/uploads/sites/11/2018/01/Intel-Analysis-of-Speculative-Execution-SideChannels.pdf 81. Bhunia S, Hsiao MS, Banga M et al (2014) Hardware trojan attacks: threat analysis and countermeasures. Proc IEEE 102(8):1229–1247 82. Shalabi Y, Yan M, Honarmand N et al (2018) Record-replay architecture as a general security framework. In: IEEE symposium on high-performance computer architecture, pp 180–193 83. Ham TJ, Wu L, Sundaram N et al (2016) Graphicionado: a high-performance and energyefficient accelerator for graph analytics. In: The 49th annual IEEE/ACM international symposium on microarchitecture, p 56 84. Yan M, Hu X, Li S et al (2019) Alleviating irregularity in graph analytics acceleration: a hardware/software co-design approach. In: Proceedings of the 52nd annual IEEE/ACM international symposium on microarchitecture, pp 615–628 85. Jun S, Wright A, Zhang S et al (2018) GraFboost: using accelerated flash storage for external graph analytics. In: Proceedings of the 45th annual international symposium on computer architecture, pp 411–424 86. Challapalle N, Rampalli S, Song L et al (2020) GaaS-X: graph analytics accelerator supporting sparse data representation using crossbar architectures. In: Proceedings of the ACM/IEEE 47th annual international symposium on computer architecture, pp 433–445 87. Zhang M, Zhuo Y, Wang C et al (2018) GraphP: reducing communication for PIM-based graph processing with efficient data partition. In: IEEE international symposium on high performance computer architecture, pp 544–557 88. Zhuo Y, Wang C, Zhang M et al (2019) GraphQ: scalable PIM-based graph processing. In: Proceedings of the 52nd annual IEEE/ACM international symposium on microarchitecture, pp 712–725 89. Rahman S, Abu-Ghazaleh N, Gupta R (2020) GraphPulse: an event-driven hardware accelerator for asynchronous graph processing. In: Proceedings of the 53rd annual IEEE/ACM international symposium on microarchitecture. Association for Computing Machinery, pp 908–921 90. Yang Y, Li Z, Deng Y et al (2020) GraphABCD: scaling out graph analytics with asynchronous block coordinate descent. In: Proceedings of the ACM/IEEE 47th annual international symposium on computer architecture, pp 419–432

References

277

91. Song L, Zhuo Y, Qian X et al (2018) GraphR: accelerating graph processing using ReRAM. In: IEEE international symposium on high performance computer architecture, pp 531–543 92. Ozdal M M, Yesil S, Kim T et al (2016) Energy efficient architecture for graph analytics accelerators. In: Proceedings of the 43rd international symposium on computer architecture, pp 166–177 93. Mukkara A, Beckmann N, Abeydeera M et al (2018) Exploiting locality in graph analytics through hardware-accelerated traversal scheduling. In: Proceedings of the 51st annual IEEE/ACM international symposium on microarchitecture, pp 1–14 94. Segura A, Arnau J, González A (2019) SCU: a GPU stream compaction unit for graph processing. In: Proceedings of the 46th international symposium on computer architecture, pp 424–435 95. Ahn J, Hong S, Yoo S et al (2015) A scalable processing-in-memory accelerator for parallel graph processing. In: Proceedings of the 42nd annual international symposium on computer architecture, pp 105–117 96. Malewicz G, Austern MH, Bik AJC et al (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, pp 135–146 97. Lenharth A, Nguyen D, Pingali K (2016) Parallel graph analytics. Commun ACM 59(5):78–87 98. Satish N, Sundaram N, Patwary MMA et al (2014) Navigating the maze of graph analytics frameworks using massive graph datasets. In: Proceedings of the 2014 ACM SIGMOD international conference on management of data, pp 979–990 99. Whang JJ, Lenharth A, Dhillon IS et al (2015) Scalable data-driven PageRank: algorithms, system issues, and lessons learned[C]//Euro-Par. Parallel Process 2015:438–450 100. Sundaram N, Satish N, Patwary MMA et al (2015) GraphMat: high performance graph analytics made productive. Proc VLDB Endow 8(11):1214–1225 101. Matam KK, Koo G, Zha H et al (2019) GraphSSD: graph semantics aware SSD. In: Proceedings of the 46th international symposium on computer architecture, pp 116–128 102. Yang K, Zhang M, Chen K et al (2019) KnightKing: a fast distributed graph random walk engine. In: Proceedings of the 27th ACM symposium on operating systems principles, pp 524–537 103. Yao P, Zheng L, Zeng Z et al (2020) A locality-aware energy-efficient accelerator for graph mining applications. In: Proceedings of the 53rd annual IEEE/ACM international symposium on microarchitecture. Association for Computing Machinery, pp 895–907 104. Zhang M, Wu Y, Chen K et al (2016) Exploring the hidden dimension in graphprocessing. In: Proceedings of the 12th USENIX conference on operating systems design and implementation, pp 285–300 105. Yan M, Deng L, Hu X et al (2020) HyGCN: a GCN accelerator with hybrid architecture. In: IEEE international symposium on high performance computer architecture, pp 15–29 106. Geng T, Li A, Shi R et al (2020) AWB-GCN: a graph convolutional network accelerator with runtime workload rebalancing. In: Proceedings of the 53rd annual IEEE/ACM international symposium on microarchitecture. Association for Computing Machinery, pp 922–936 107. Dwivedi VP, Joshi CK, Laurent T et al (2020) Benchmarking graph neural networks. arXiv preprint. arXiv: 2003.00982

Chapter 5

Future Application Prospects

The best way to predict the future is to invent it. —Alan Kay, InfoWorld, 1982

Driven by artificial intelligence, cloud computing, autonomous driving, the Internet of Things, blockchain, quantum computing, and other emerging technologies and applications, computing systems are becoming data-driven, flexible and selfadaptive, diversified for different requirements and personalized scenarios. As a physical carrier for computing, traditional ASICs with fixed functions can no longer meet the real-time functional reconfiguration requirements of future applications. Meanwhile, FPGAs with functional reconfiguration features are also suffering the bottlenecks in real-time agility and energy efficiency. Therefore, software-defined chips, as an energy-efficient and highly flexible solution, will play an increasingly prominent role in future application scenarios. At present, data has become an important factor of production factor like oil. Data-driven computation and analysis have brought ground-breaking changes to productivity improvement in all sectors. In this section, we will focus on the future intelligent computing, data security and privacy protection. The focus is put on analyzing and envisioning the application of software-defined chips in evolutionaryintelligent computing, post-quantum cryptography, fully homomorphic encryption and other emerging technologies. First of all, although artificial intelligence has been deeply applied in computer vision, natural language processing and other fields, it can only be applied in “weak intelligence” scenarios due to its inherent feasture of non-interpretable, data dependence and other characteristics. The future “strong intelligence” requires the ability to continuously and autonomously evolve. Firstly, smart chips must have the ability to support flexible functions, adaptive architecture, and agile development. Secondly, in the face of the approaching security threats to the current public-key cryptosystem caused by quantum computing, cryptographers have begun to design the post-quantum cryptographic algorithms based on more mathematically difficult problems and capable of resisting quantum attacks. The standardization of post-quantum cryptographic algorithms is steadily advancing, its corresponding implementation work and application performance evaluation are © Science Press 2023 L. Liu et al., Software Defined Chips, https://doi.org/10.1007/978-981-19-7636-0_5

279

280

5 Future Application Prospects

also very important. As post-quantum cryptographic algorithms feature diversified mathematically difficult problems and multiple parameter selection options, complex computing modes, and insufficient study on physical security, it requires support from a computing architecture with dynamically reconfigurable functions. Also, hybridmode cryptographic chips compatible with classic public-key and post-quantum cryptographic algorithms also need to be implemented by software-defined chips. Finally, increasingly serious data security and privacy issues have become a constraint in the further release of the value of massive data. Homomorphic encryption, as a computing mode, can make data “available but invisible” under the premise of data protection, and properly address the problem of privacy computing when data owners and service providers are separated. But like post-quantum cryptographic algorithms, the homomorphic encryption scheme is also in the process of rapid iterative evolution. The highly complex computing requirements and memory overhead make it still a long way from the actual application. Based on the co-design of software and hardware and supporting fast and run-time function reconfiguration, softwaredefined chips can accelerate the processing of homomorphic encryption, optimize the efficiency of computation, and boost the practical application of homomorphic encryption.

5.1 Evolutionary Computing At present, artificial intelligence is still in a relatively fixed operating mode, that is, specific data set—training—testing. The problem brought by this mode is that the model cannot cope with the actual various and complicated scenes. The model accuracy will drop sharply once the environmental factors change, which is unacceptable for increasingly complex scenarios. Therefore, future artificial intelligence will develop in the direction of evolutionary computing. The evolutionary computing will self-adjust the model according to the changes of the scene and the environment to maintain high accuracy. Meanwhile, most current artificial intelligence chips can only support and accelerate some specific model. Even if it can support different parameters of the same model, it still cannot cope with changes in the model architecture. Therefore, the design of future smart chips will also be directed towards how to efficiently support the development of evolutionary computing models.

5.1.1 Background and Concept of Evolutionary Computing Different from the traditional learning from datasets in with predefined categories, evolutionary computing is oriented to open environments and changing scenarios. It needs to be driven by knowledge and data to further improve the generalization and robustness of the model, as shown in Fig. 5.1. The model can discover new

5.1 Evolutionary Computing Modeling Tacit knowledge

Softwareization Model/algorithm

281 Intelligence

Optimization

Accuracy Industrial software

Chip

Embedded system

Smart machine

Perceive

Human-computer interaction data Optimizing digital twin for cyberspace

Physical info sensor

Machine operating data Physical equipment for optimizing physical space

Fig. 5.1 Evolutionary system

knowledge, learn and update itself in the process of self-evolution. Also, evolutionary computing also needs to deal with noise interference and intentional noise attacks from the environment. This also requires evolutionary computing to be able to distinguish various samples in the environment and determine whether it is a new sample or noise. In order to support the efficient operation of evolutionary computing models, the hardware needs to have new capabilities: it not only needs to support the efficient training of evolutionary models (different from the current mainstream accelerators that only support algorithmic models for inference), but also to ensure the correctness and security of training. Evolutionary computing, including model training and hardware design, has a wide range of application prospects in reality. We can say it is the pillar in the new era of artificial intelligence. Many current artificial intelligence application scenarios have strong constraints on the environment and the objects for detection, such as face recognition, intelligent beautifying, license plate recognition, etc. However, daily applications such as smart home, autonomous driving, smart medical care, and situational awareness are typical dynamic perception tasks in open environments. These environments are characterized by frequent changes, endless noise, and unexpected new samples. The evolutionary computing model will be able to cope with these challenges well. In addition, in the coming era of the Internet of Things, there will be hundreds of millions of IoT edge devices are used, and the algorithm model of each device needs to be updated iteratively on the local hardware. IoT devices based on evolvable intelligence will put forward new requirements for chip functions and power consumption, so evolvable intelligence chips are also the key to the successful application of artificial intelligence technologies, and will become the mainstream research trend.

5.1.2 The Evolution and State-Of-The-Art Research Machine perception and pattern recognition (machines perceive and understand the environment through artificial intelligence technologies) are one of the core branches

282

5 Future Application Prospects

and research directions in the field of artificial intelligence. In the past 60 years, the theories and methods in this field have gained tremendous results, as shown in Fig. 5.2. Especially since the introduction of deep learning methods and deep neural networks in 2006, the performance of visual perception (image classification, object detection and recognition, behavior recognition, etc.) and auditory perception (speech recognition) has witnessed significant improvement after combining big data and GPU parallel computing, which almost completely surpasses the traditional pattern recognition method. Deep neural networks used in natural language processing, Go game (AlphaGo) and other fields and have also produced significant effects. Traditional pattern recognition methods estimate the conditional probability density P(x|ci ) or posterior probability P(ci |x) of a predefined category on the basis of artificial feature extraction. The former is called a generative model and the latter is called a discriminative model. The deep neural network automatically learns the discriminative characteristics of the task according to the data distribution and the final artificial requirements, so it has stronger perception and recognition capabilities. The deep learning model has made continuous breakthroughs in the recognition rate of specific static tasks. The recognition records on the public standard database are broken constantly, even surpassing human recognition. However, once being used in actual open environments, the deep learning model will encounter various kinds of problems which leads to degraded performance inconsistent with that achieved in the laboratory environment. This is because the pattern recognition system in the laboratory environment mostly relies on a large number of labeled samples while offline learning takes a dominant position. It lacks the ability of logical inference and continuous autonomous learning, thus being not suitable for open environment perception. At the hardware level, in order to accelerate the implementation of artificial intelligence, especially on edge platforms with demanding power consumption and latency requirements, a large number of artificial intelligence inference accelerators have been designed. While ensuring the precision of the algorithm, energy efficiency and area efficiency are increasing day by day. Artifical intelligence accelerator started from supporting to simple matrix multiplication operations (including the hardware implementation of support-vector machines, random forests, and feature descriptors),

Early artiﬁcial intelligence (AI) Engineering of manufacturing Intelligent machines and programs

1950

1960

1970

Machine learning (ML) Ability to learn without being explicitly programmed

1980

1990

2000

2006

Deep learning (DL) Learning based on deep neural networks

2010

Fig. 5.2 History of evolutionary computing systems

2012

2019

Evolutionary computing

2020

...

5.1 Evolutionary Computing

283

to support subsequent artificial neural networks. The following is the optimization of convolutional neural networks and recurrent neural networks (including quantization, prediction, pruning, etc. and the corresponding hardware architecture design, including weight stationary dataflow, output stationary dataflow, and no-local reuse dataflow), the goal ofwhich is to increase the data reuse rate, thereby reducing the memory access, avoiding redundant computation, and improving the energy and area efficiency. However, such accelerators can only support the trained and quantified model. Once the model is changed, it will not be able to be updated at the hardware level, in which case the model needs to be retrained, and even the hardware needs to be redesigned to support the new model, generating a huge overhead. In order to avoid situation that models are required to be redesigned for any static task, extensive research has been carried out on automatic machine learning (AutoML). Traditional deep learning model construction often includes data preparation, model construction, parameter selection, training methods, etc. These steps are usually carried out separately and are all manually operated. This is especially true for the selection of models and hyper-parameters. Even if prior knowledge is added, there are still many potential optimal solutions. Manual operation is not only inefficient, but also easy to select locally optimal result. In addition, once a model is established in a static task, it is difficult to directly migrate this model to a different task, resulting in a waste of resources. AutoML searches for the best hyper-parameter and model architecure from all possible models through autonomous search, reinforcement learning, evolutionary algorithms, and gradient descent algorithms. Once the AutoML algorithm is established, it only needs to provide the data set, and the AutoML algorithm will automatically search for the optimal hyper-parameter and model architecture in the current task to meet the latency and accuracy requirements of the task. However, although the AutoML method is applicable to multiple different tasks, retraining is required for each specific task, and manual intervention cannot be completely avoided. Meanwhile, AutoML requires the support of a lot of hardware resources, such as multi-GPU joint distributed training. It is also very timeconsuming, making it almost impossible to directly deploy it at the edge. It can only be searched in the cloud, and then the model is deployed at the edge. Therefore, like traditional artificial intelligence chips, the deep learning chip can only support the inference process of the network model, but cannot perform on-chip online updates under this condition, and is therefore not suitable for evolutionary computing. In order to address the problem that the hardware only supports the inference process, many scholars have begun to study how to implement efficient on-chip training at the edge in the past two years. As shown in Fig. 5.3, on-chip training is widely used, especially in the customized adjustment of artificial intelligence models based on individual consumers’ habits. It will be more practical and improve the personalization of products. On-chip training mainly includes two aspects: ➀ The compiler design for the training model; ➁ The improvement of hardware resources, including hard-wire adjustment, gradient computation module design, multi-batch computation support, etc. The compiler that supports training is mainly mapped from the high-level neural network model to the hardware according to the user’s needs, as shown in Fig. 5.4.

284

5 Future Application Prospects

Fig. 5.3 Potential usage scenarios of on-chip training

According to the operation of each layer of the network and the current hardware resources, the optimized hardware language module is selected from the pre-set module library, and then the hardwire and hardware resources are adjusted to implement this module. These modules are designed to support the operations required for model training. Only the selected modules can be synthesized. During the training process, each iteration of a batch of data performs weight update layer by layer, like the inference process; the samples in each batch of data are processed one by one [1]. The hardware implementation needs to support both training and inference processes. The following is an example [2], and its architecture is shown in Fig. 5.5. Figure 5.5a shows the training and inference process of CNN. In the inference process, each convolutional layer accepts Nc channels of input data, and then outputs

Fig. 5.4 Compiler process supporting training

5.1 Evolutionary Computing

285

Nr output channels. The fully connected layer is treated as a 1 × 1 convolution operation. The stochastic gradient descent method is used in this training process. In each data iteration, the weight of each layer of the network is updated according to parameters such as the returned error gradient and the preset learning rate depending on the stochastic gradient descent rule. In order to get all the weight gradients, the feature maps and weight gradients of each layer need to be calculated. In the ith layer, the gradient of the output feature layer is first calculated according to the i + 1th feature map gradient and weight. After the computation is completed, the weight gradient can be calculated based on the feature map gradient of the i + 1th layer and the input feature map. Then each weight can be obtained according to the product of the weight gradient and the learning rate. In order to efficiently support on-chip training and inference, the hardware architecture must be highly flexible. Fixed-point arithmetic resources are sufficient for model inference, but not very suitable for onchip training. As shown in Fig. 5.5b, the architecture supporting training operations includes 8 processing units, 8 pooling units, and an optimized Softmax module. In order to solve this problem, 16-bit floating-point processing elements and a 10/5bit fixed-point processing elements are added to support the training and inference process. Similar works [3–7] can also be referred. Evolutionary computing mainly addresses the problem of how to allow the model to evade noise and discover new categories itself during operation, and improve the self-generalization capability. Specifically, evolutionary computing needs to fight against intentional or unintentional noise attacks and interference from the external environment. Also, it needs to find samples that are different from the existing categories among the collected samples, and classify them as new samples or noise according to the data volume. In addition, evolutionary computing needs to support on-chip model updates at the hardware level, and needs to improve its own hardware security to be able to resist intended noise attacks at the hardware level.

(a) Computation flow for inference and training

(b) System architecture

Fig. 5.5 Inference and training process and system architecture of the network model

286

5 Future Application Prospects

In order to realize evolutionary computing, three major problems at the algorithm and hardware level need to be solved: (1) Small sample learning: In the real environment, we usually don’t have sufficient samples and new samples often appear. Therefore, from the perspective of algorithm models, evolutionary computing needs to improve the capability of small sample learning and generalization on new tasks through mutual knowledge of cross-modal information. At the hardware level, the chip not only needs to support the gradient computation when updating the model with small samples, but also needs to add up all the gradients generated by small computing samples, and then update the weight stored on the chip. In addition, special hardware modules are needed to realize the generation of supplementary knowledge information. (2) Unsupervised learning: In addition to the small sample size, those samples may possibly not have been manually tagged, this is no manual annotation has been provided. From the perspective of algorithm models, it is possible to construct a self-supervised learning task with logical inference capabilities, for example, a rough classification based on pre-calibrated data instead of accurate classification of each sample; another example is to infer the correct arrangement mode from a disordered image block sequence, or restore the logical word order from miscellaneous natural languages. The purpose of those is to allow the model to automatically learn from untagged data to obtain generalizable feature representations with semantic representation and logical inference. At the hardware level, the chip needs to randomly scramble the input data, which involves disorderly access to the data in the memory. How to design the memory so that the data rearrangement involved in the unsupervised learning model is more hardware-friendly and does not become a bottleneck in the training process, is similar to the “memory wall” problem in traditional artificial intelligence chips. (3) Design of trustworthy model: A very important point of the model is that the result of the output of the current task must be trustworthy from the algorithm to the implementation of underlying hardware. At the algorithm level, the trustworthiness of the model can be improved by using the traditional prior knowledge of probability density distribution and the learning of the structural pattern recognition system based on primitive attributes or component decomposition. That is to say, each prediction result needs to offer a reasonable confidence estimate. At the hardware level, the first thing to do is to improve the security level of the chip to prevent intentional or unintentional noise interference from causing substantial damage to the results. In addition, the probability model needs to design a separate hardware module with a higher security level to make a trustworthiness evaluation on the output results of the main model each time.

5.1 Evolutionary Computing

287

5.1.3 Software-Defined Evolutionary Computing Chip In evolutionary computing, the chip plays a very critical role, because it needs to be able to efficiently support the operation of the evolutionary model and ensure the peformance and effectiveness. However, the chip still needs reasonable scheduling on its hardware resources by the upper-level software for specific tasks even if it already has high flexibility and sufficient hardware resources, so as to execute tasks efficiently. Therefore, in evolutionary computing, it is an effective way to define the chip with software. Software-defined evolutionary computing chips mainly include two aspects: ➀ The research on the theory of intelligent hardware automatic generation, including software-defined hardware primitive design, application-driven hardware design and agile development methods; ➁ The research on the real-time independent evolution technology of intelligent hardware, including online training methods for hardware evolution, circuit reconfiguration technology, and online configuration information generation technology. Intelligent hardware automatic generation mainly refers to the automatic hardware division and allocation of the task according to the specific task at the upper level and the resource allocation of the underlying hardware, in order to achieve the highest hardware utilization efficiency, which can be done by evolutionary computing compiler. The compiler not only needs to support a variety of different tasks, but also to reasonably divide hardware resources according to the different modules in the task. Also, it needs to generate configuration information that can determine how the hardware performs model updates. This configuration information will directly determine the way existing hardware resources handle collected new samples or noise in an open environment, and the update form of the corresponding model. First of all, evolutionary computing chips not only need to support model update with the traditional stochastic gradient descent method, but also need to support other more effective gradient descent algorithms, such as Adadelta and Adam. In the stochastic gradient descent method, an image is used to update network parameters each time, and these gradient descent methods require a batch of data to be updated iteratively. Meanwhile, the updated data not only includes weights, but also some hyper-parameters such as learning rate and momentum. Because the stochastic gradient descent method is prone to a large oscillation in the gradient generated each time due to different data, this oscillation may increase the frequency of parameter changes, so as to finally reach a local optimal solution and no longer decrease. Meanwhile, as it has overshoots due to frequent fluctuations and is time-consuming, it is obviously not suitable for evolutionary computing to update and evolve in an open environment. In addition, as for the generation of auxiliary information in the small sample problem, we can consider converting the special needs of the scene into the configuration information through the evolutionary computing compiler at the software level in the specific application scenario, and then call an additional dedicated module to support the process of supplementary knowledge; as for the unsupervised

288

5 Future Application Prospects

learning of evolutionary computing, the chip should be designed with a more effective and reasonable memory structure and data scheduling method to minimize data rearrangement and loading; finally, in order to improve the security level of the chip and avoid interference and damage by external noise, additional encryption chips can be considered to ensure the safe operation of evolutionary computing chips and the accuracy of the final result.

5.2 Post-Quantum Cryptography In recent years, “quantum supremacy”, a relatively unfamiliar term, has been mentioned more and more frequently. This is because of the 54-qubit superconducting quantum computer “Sycamore” [8] released by Google in 2019, which can complete the computational tasks that would take 10,000 years to complete on the IBM supercomputer in just 200 s. Accordingly, Google claimed that it had achieved “quantum supremacy.” Also in anissue of the “Science” magazine in 2020, Pan Jianwei’s research team from the University of Science and Technology of China successfully constructed the 76-photon quantum computer “Jiuzhang” [9], which is 100 trillion times faster in processing the Gaussian boson sampling problem compared with the fastest supercomputer. In fact, “quantum supremacy” means that the computing power for specific problems exceeds that of classical supercomputers; the next stage is to realize a quantum simulation system with practical application value, which can play an active role in combinatorial optimization, machine learning, quantum chemistry, etc.; the ultimate goal is to realize a general programmable quantum computer. Having achieved exponential growth in memory and computing, quantum computers have brought unprecedented improvements in the solution to complex scientific problems. In the meanwhile, they have also brought increasingly urgent security risks to cryptography. For future quantum computer attacks, there are mainly two quantum security technologies: the post-quantum cryptography based on more complex mathematical problems and the quantum cryptography based on the quantum theory. The former is inherited from the traditional cryptography currently in use and is compatible with classical computers, can be easily implemented on silicon.. Therefore, in this section, we only discuss post-quantum cryptography that is compatible with the currently mature silicon-based digital integrated circuit technology.

5.2 Post-Quantum Cryptography

289

5.2.1 Concept and Application of Post-Quantum Cryptographic Algorithms In this section, we introduce the background and concepts of post-quantum cryptography, and the standardization progress of current post-quantum cryptography algorithms. 1. Background of post-quantum cryptography Quantum computing is a new computing mode that performs computation by regulating quantum information units following the law of quantum mechanics. The superposition of quantum states allows each qubit to express two values “0” and “1” at the same time, so that n qubits can express 2n values at the same time. This feature enables quantum computers to achieve exponential growth in memory and computing power compared with classical computers, thus forming quantum superiority. Meanwhile, the Shor quantum algorithm [10] proposed in 1994 can solve complex mathematical problems such as factoring large integers and discrete logarithms in polynomial time. The Grover quantum algorithm [11] can reduce the search time of disordered databases to the square root of original time. It means that when a specific entity needs to be searched out from N disordered entities, classical computers can only query one by one through exhaustive search while the Grover algorithm only needs N 0.5 queries. It should be emphasized that these two algorithms cannot run on traditional classic computers. However, it is generally believed that a quantum computer with at least one million qubits is needed to truly crack the classical cryptographic algorithms currently in use. Although the current quantum computer has not reached one hundred qubits, according to the quantum computer roadmap released by IBM [12], its quantum computers will achieve a thousand qubits in around 2023, and is expected to reach one million within ten years. DigiCert commissioned a professional research company to investigate information technology professionals in 400 companies located in the United States, Germany, and Japan [13]. 59% of them said they are considering or deploying hybrid certificates with quantum-resistant capabilities. According to the Quantum Risk Assessment Report issued by the Global Risk Institute (GRI) in Canada [14], the currently and commonly used RSA-1024 and RSA-2048 can be breached by quantum computers with 4.8 million and 9.66 million physical qubits within 13 min and 50 min, respectively. In the face of such severe security challenges, the National Security Agency (NSA) called for switching to a post-quantum cryptosystem as early as 2015 to deal with the security threats posed by quantum computers. Considering the time of the standardization of quantum security cryptographic algorithms and the time of the update of cryptographic infrastructure (usually more than 10 years), and in some application scenarios (such as the confidentiality of national, institutional, and individual sensitive information), the attacker can store the currently intercepted information and crack the information when the quantum attack conditions allow so. Therefore, it is particularly important to conduct the research on cryptographic algorithms with quantum-resistant capability

290

5 Future Application Prospects

in advance, and complete the update and upgrade of cryptographic infrastructure and cryptosystem before quantum computers are put into practical use on a large scale. Next, we will analyze the impact of quantum computing attacks on the current classical cryptosystem. As introduced in Sect. 4.4, cryptographic algorithms are mainly divided into public-key cryptography, symmetric cryptography, and hash functions. As shown in Table 5.1, since most currently used public-key cryptographies such as RSA and ECC are based on the difficult problems of large integer factoring and discrete logarithm, the security of public-key cryptographic algorithm in the era of quantum computing will no longer work. As for symmetric cryptography, although quantum computers running the Grover algorithm can crack symmetric cryptographies faster, it is completely possible to obtain the same security strength as before by doubling the key length of the symmetric cryptographic algorithm. Therefore, symmetric cryptography is currently considered to be quantum secure. However, since the keys of symmetric cryptographic algorithms in most application scenarios are exchanged through public-key cryptographic algorithms, the leakage of the symmetric cryptographic keys caused by the cracking of the public-key cryptography still impairs the security of the symmetric cryptography. As for hash functions without key participation, the security can also be improved by doubling the output length. At present, the research on quantum-safe cryptography is mainly divided into two directions: post-quantum cryptography (PQC) and quantum cryptography. The post-quantum cryptography, also known as quantum resistant cryptography (QRC) or quantum safe cryptography (QSC), is a public-key cryptographic algorithm based on more complex mathematically difficult problems and is proven safe from traditional attacks and known quantum attacks. Due to its good compatibility with current cryptographic algorithms and classic computers, it has gained wide attention. It should be emphasized that these mathematically difficult problems can withstand all known quantum algorithm attacks, and are considered quantum safe until there is further evidence showing that they are vulnerable to quantum attacks. Table 5.1 Influence of large-scale quantum computers on the classical cryptosystem Cryptosystem

Cryptographic algorithm

Impact and solutions

Public key cryptography

RSA

Completely cracked

ECDSA

Completely cracked

Symmetric cryptography

Hash function

Diffie-Hellman

Completely cracked

AES

Security strength is reduced, and a larger key size is required

3-DES

Security strength is reduced, and a larger key size is required

SHA-1/2/3

Safety intensity is reduced, longer output is required

5.3 Current Status of Post-Quantum Cryptographic Algorithms

291

5.3 Current Status of Post-Quantum Cryptographic Algorithms Since 2015, many telecommunications unions or international academic organizations such as NIST, European Telecommunications Standards Institute (ETSI), and Institute of Electrical and Electronics Engineers (IEEE) have started studying and standardizing post-quantum cryptography. The Chinese Association for Cryptologic Research (CACR) held a national cryptographic algorithm design competition [15] in 2019, which included two competition units: block ciper and public-key cryptography. With respect to the public-key cryptographic algorithm, a variety of quantumresistant public-key algorithm proposals based on lattice, multivariate, and supersingular homology were received. Among them, the LAC algorithm from the Institute of Information Engineering of the Chinese Academy of Sciences and the Aigis algorithm from Fudan University finally won the first prize. Meanwhile, China took this opportunity to start the standardization work of post-quantum cryptographic algorithms. In terms of international influence at present, the post-quantum cryptography standardization carried out by NIST [16] has received the most extensive participation and attention worldwide. As shown in Fig. 5.6, since the algorithm collection started in December 2016, NIST has received a total of 82 algorithm proposals. As of December 2017, except the proposals that have been cracked, voluntarily withdrawn, and failed to pass the review, there had been a total of 64 effective algorithm proposals, including 45 public-key encryption (PKC) algorithms and 19 digital signature (DS) algorithms. In January 2019, NIST announced the second round of totally 26 algorithms, including 17 PKC algorithms and 9 DS algorithms. In the third round of algorithms announced in July 2020, NIST announced 7 finalist algorithms and 8 alternate candidates. According to the NIST statement, the initial algorithm standard will be announced at the end of 2021 or early 2022, and will be mainly selected from the 7 finalist algorithms. The period from 2022 to 2024 is the drafting stage of the PQC standard. If the algorithms in this stage have security risks and other issues, a new standard algorithm will be selected on the basis of an in-depth analysis of the alternate algorithms. (1) Evaluation criteria

December 2017 : First round candidates Public key encryption announced in (64) Digital signature

December 2016: Formal call for submissions

45 19

January 2019: The second round candidates announced (26)

July 2020: Public key encryption The third round schemes Digital signature announced (7+8)

Public key encryption

17

Digital signature

9

9 6

end of 2021 or early 2022 Draft standards

Fig. 5.6 NIST evolution process of post-quantum cryptography standardization

292

5 Future Application Prospects

According to the requirements of NIST, three metrics are mainly used to evaluate the overall performance of post-quantum cryptographic algorithms, namely security, cost and performance, and the implementation features of algorithms. Security: It is necessary to achieve security not only in classical computers, but also in quantum computers. As shown in Table 5.2, the different cracking strengths of the AES and SHA algorithms are used to quantify the security strength of post-quantum cryptographic algorithms. Cost and performance: Cost includes computing efficiency and memory requirements, mainly involves the size of public-keys, ciphertexts and signatures, the computating efficiency of key generation, public and private key operations, and the probability of decryption errors. Computating efficiency refers to the executionspeed of an algorithm. NIST hopes that the candidate algorithms can approach or even exceed the execution speed of current public-key algorithms. Memory requirements refer to the size of software code, RAM requirements and the number of equivalent gates implemented by hardware. Algorithms and implementation features: Algorithms with higher flexibility will have advantages over other competing algorithms. The flexibility here includes the ability to run efficiently on multiple platforms, have certain parallelism, or support instruction set extension to achieve higher performance. Furthermore, we expect simple and efficient design solutions. (2) Current algorithm types At present, post-quantum cryptographic algorithms can be divided into the following four categories based on their basic mathematically difficult problems. (1) The hash-based post-quantum cryptographic algorithm, which is mainly used for digital signatures. The hash-based signature algorithm evolved from a one-time signature scheme and uses Merkle’s hash tree authentication mechanism. The root of the hash tree is the public-key, and the one-time authentication key is the leaf node of the tree. The security of hash-based signature algorithms depends on the collision resistance of the hash function. Since there is no effective quantum algorithm that can quickly find the collision of the hash function, the hash-based structure can resist quantum computer attacks as long as the output length is long enough. Besides, the security of hash-based signature algorithms does not depend on a specific hash function. Even if some of the currently used hash Table 5.2 NIST’s definition of the security levels of post-quantum cryptography

Security level

Security strength

I

Exhaustive key search to break AES128

II

Collision search to break SHA256

III

Exhaustive key search tobreak AES192

IV

Collision search to break SHA384

V

Exhaustive key search to breakAES256

5.3 Current Status of Post-Quantum Cryptographic Algorithms

293

functions are compromised, a more secure hash function can be used to directly replace the compromised one. (2) The multivariate-based post-quantum cryptographic algorithm, which uses a quadratic polynomial group with multiple variates in finite fields to construct algorithms such as encryption, signature, and key exchange. The security of multivariate-based cryptography depends on the difficulty of solving nonlinear equations, that is, the multivariate quadratic polynomial problem. The problem proved to be non-deterministic polynomial time difficulty. Now there are no known classical and quantum algorithms that can quickly solve multivariatebased equations in finite fields. Compared with classic cryptographic algorithms based on number theories, multivariate-based algorithms have a faster computation speed but a larger public-key size, so it is suitable for application scenarios where frequent public-key transmission is not required, such as IoT devices. (3) The lattice-based post-quantum cryptographic algorithm, which is considered to be one of the most promising post-quantum cryptographic algorithms due to the better balance between security, public and private key size, and computation speed. Compared with the structure of cryptographic algorithms based on number theory problems, lattice-based algorithms can significantly increase the computation speed, achieve higher security strength, and only slightly increase the communication overhead. Compared with other implemented post-quantum cryptographies, lattice-based cryptographies have a smaller public and private key size, higher security, and faster computation speed [8]. In addition, the lattice-based post-quantum cryptographic algorithm can realize various cryptographic structures such as encryption, digital signature, key exchange, attribute encryption, function encryption, and fully homomorphic encryption. In recent years, the structure of lattice-based cryptography based on the learning with errors (LWE) problem and the ring learning with errors (RLWE) problem has developed rapidly and is considered to be one of the technical routes that is most likely to be standardized. (4) The code-based post-quantum cryptographic algorithm, which uses error correction codes to correct and calculate random errors added, and is mainly used for public-key encryption and key exchange. McEliece uses a random binary irreducible Goppa code as the private key, and the public-key is a general linear code after transforming the private key. Courtois, Finiasz, and Sendrier use the Niederreiter public-key encryption algorithm to construct an encoding-based signature scheme. The main problem of encoding-based algorithms (such as McEliece) is that the public-key size is too large. Table 5.3 shows NIST’s statistic results of post-quantum cryptographic algorithms in the third round. It can be seen that among the 7 finalist algorithms, there is only one encoding-based Classic McEliece algorithm and one multivariate-based Rainbow algorithm, which are respectively used for public-key encryption and digital signature. According to the NIST statement, if there are no special security risks or technical difficulties, these two algorithms will acquiescently become the finalist standard algorithms. In the meantime, one of the three lattice-based post-quantum

294

5 Future Application Prospects

Table 5.3 NIST’s statistic results of post-quantum cryptographic algorithms in the third round Third round of algorithms

Mathematically difficult problems

Public key encryption

Digital signature

Finalist algorithm

Lattice

Crystals-Kyber

Crystals-Dilithium

NTRU

Falcon

Saber Encoding

Classic McEliece

Multivariate Alternate algorithm

Rainbow

Lattice

FrodoKEM

Encoding

BIKE

NTRUPrime HQC Super-singular homology

SIKE

Multivariate

GeMSS

Hash

SPHINCS + Picnic

cryptographic algorithms for public-key encryption and the two lattice-based postquantum cryptographic algorithms for digital signatures will be selected as the finalist algorithm.

5.3.1 Status Quo of the Research on Post-Quantum Cryptographic Chips As mentioned above, compared with existing public-key cryptographic algorithms, post-quantum cryptographic algorithms are designed based on more difficult mathematical problems. As a result, current alternate post-quantum cryptographic algorithms are much better than classical cryptographic algorithms in terms of computational complexity, storage overhead, and bandwidth requirements. Therefore, it is particularly necessary to accelerate the algorithm through various forms of cryptographic hardware such as ASIC, FPGA, and ISAP, so as to better boost the industrial application of post-quantum cryptography. But in the meantime, since the current post-quantum cryptographic algorithm standard has not yet been determined, and each alternate algorithm will have major or minor changes and adjustments in each iteration, the research on post-quantum cryptographic chips is not yet extensive. 1. Post-quantum cryptographic chips for ASIC implementation Currently, the lattice-based post-quantum cryptographic processor published by Professor Chandrakasan’s team at MIT [17, 18] is one of the few post-quantum cryptographic chips implemented on silicon chips. As a post-quantum cryptographic chip

5.3 Current Status of Post-Quantum Cryptographic Algorithms

295

Keccak computing core

Seed registers

1 KB Instr. Mem.

Poly. Cache

Poly. Cache

Keccak state Modular Arith. Uint

Sampler

NTT constants RAM

Uniform

Binomial

Gauss

Ternary

Instr. decode + control

Read/write interface

Reset

Clock

Address

Write data

Read data

Interrupt

Fig. 5.7 System architecture of the reconfigurable cryptographic processor Sapphire

for IoT applications, it achieves both low power consumption and a certain degree of configurability, and can support up to 5 lattice-based post-quantum cryptographic algorithms. As shown in Fig. 5.7, in order to fully support the sampling function adopted by the target algorithm, this processor implements uniform sampling, ternary sampling, discrete Gaussian sampling, and central binomial distributed sampling through dedicated hardware, and pays some area overhead. Meanwhile, in order to reduce the power consumption overhead of encryption and decryption, a butterfly processing element is sequentially used repeatedly to complete the corresponding public-key encryption function. Another work explores the design of the post-quantum cryptographic chip using the number theoretical transform (NTT) algorithm and the module learning with rounding (MLWR) algorithm. This work proposed a number theoretical transform and inverse transform method of low computational complexity [19], and an efficient post-quantum cryptographic hardware architecture (Fig. 5.8). It can not only reduce the computational complexity of a type of lattice-based post-quantum cryptographic algorithms, but also reduce the hardware resource overhead while increasing the algorithm execution speed. Experimental results show that, compared with the mainstream algorithms at the time, the computation speed of this design is more than 2.5 times faster, and the area latency product is reduced by 4.9 times. The crux of the low efficiency of the existing lattice-cipher-oriented number theoretic transform architecture lies in that its forward transform and inverse transform require pre-processing and post-processing respectively. The pre-processing and post-processing have a huge amount of computation, which is a bottleneck restricting the increase in processing speed. By fusing the pre-processing part into the timedomain-decomposed fast Fourier transform, and fusing the post-processing part into the frequency-domain-decomposed fast Fourier transform, the two parts of computation are completely eliminated. Compared with the classic fast Fourier transform, this method has no additional time overhead and the hardware cost is also very small.

296

5 Future Application Prospects pk/c

Poly. decoding Binomial sampling

RAM_NTT (R2) N/4×14 ×4

RAM_W 2N×14

Keccak Reject sampling

sk

Poly. decoding

μ

Encoding

c

Butterfly unit

RAM(R0) N/2×28 ̂/ ̂

μ ̂

RAM(R1) N/2×28

v′

Compression

Decompression

μ

Fig. 5.8 Hardware architecture of the NTT-friendly NewHope algorithm

Also, the researchers have proposed a compact processing element architecture that can support two butterfly operations. For the specific modulus of the NewHope algorithm, they proposed a constant-time modulo reduction method that does not need to perform multiplication operations, and designed a number theoretic transform hardware implementation architecture of low complexity accordingly, which features the fastest execution speed among number theoretic transform hardware implementation architectures of the same scale, and whose area latency product is reduced by nearly 3 times. In addition, this research also used architecture optimization technologies such as double bandwidth matching and time sequence hiding to further reduce the number of clock cycles to execute the NewHope algorithm. They designed the NewHope hardware architecture with a constant processing time. The Saber algorithm has received extensive attention and in-depth research. In Literature [20], the author proposed a hierarchical Karatsuba multiplication for the 256-order polynomial multiplication used in the Saber algorithm, and made a customized optimization and design for the computation mode in the Saber algorithm. As shown in Fig. 5.9, the architecture design of this work uses a multiplier array to complete the Karatsuba multiplication between the 16 coefficients of two polynomials. Meanwhile, the corresponding multiplier has completed the custom design according to the data sampling characteristics of the binomial distribution in the Saber algorithm. 2. Post-quantum cryptographic chips based on FPGA platform The research team of Kris Gaj from George Mason University has been committed to the evaluation of algorithm hardware performance in the process of international cryptographic algorithm standardization. They have played a vital role in the standardization of AES algorithms and SHA algorithms. As shown in Fig. 5.10, the team is currently using high-level synthesis tools to carry out software and hardware co-design on Xilinx’s FPGA platform for current post-quantum cryptographic algorithms [21]. The optimization technologies mainly used in high-level synthesis

5.3 Current Status of Post-Quantum Cryptographic Algorithms

297

Multiplier array Intermediate result storage Sampler Pseudo-random number generator Multiplier Binomial distributed sampling Alignment Pseudo-random number generator Public key storage Private key storage Most significant bit Data trimming Adder Reset Input data Clock Output data Fig. 5.9 Accelerator architecture optimized for the Saber algorithm

include loop unrolling and loop pipelining. Loop unrolling refers to the parallel execution of functional units that have no data dependence in the loop. It is mainly used in application scenarios that are sensitive to computation latencies, and amounts to exchanging for time with resources. The meaning of loop pipelining is very simple. It is to pipeline the functional units executed sequentially in the loop, thereby reducing the execution time of the overall loop computation. But meanwhile, the conversion

298

5 Future Application Prospects

from C/C++ language algorithm implementation to RTL-level hardware code is not completely automatic. Necessary manual optimization of the original algorithm code is still required. In addition, the current high-level synthesis tools cannot provide good support for dynamic arrays, system functions, and pointers, and need to be modified before synthesis. In Literature [22], the author used HLS to explore the design space of the core computing module NTT in the lattice-based cryptography, and compared it with the code obtained by manual design and optimization. He found that the performance of high-level synthesis implementation is significantly lower than the result of designs by human. However, HLS can speed up the design cycle and carry out diversified exploration of the design space. 3. Post-quantum cryptographic chips based on ISAP architecture The research team of the Technical University of Munich, Germany, published a paper about a post-quantum cryptographic processor RISC-V [23]. While performing custom design on the new computing operations of the post-quantum cryptographic algorithm, it also extends the RISC-V instruction set. In this work, the author divides the current post-quantum cryptographic chip into two forms: the tightly coupled accelerator and the loosely coupled accelerator. Loosely coupled accelerators are hardware accelerators based on ASIC implementation, which perform complete cryptographic algorithm functions. The disadvantage lies in the relatively large data communication overhead. As shown in Fig. 5.11, this work first designed and implemented a series of hardware accelerators for the computing mode in the post-quantum

Clock unit Zynq

Zynq processing system

AXI

AXI timer FIFO

Input FIFO

FIFO

Output FIFO

Hardware accelerator

Fig. 5.10 System architecture of the GMU team’s software and hardware co-design

5.3 Current Status of Post-Quantum Cryptographic Algorithms

299

Fig. 5.11 System architecture of the RISCQ-V processor

cryptographic algorithm, including parallel butterfly operations, random polynomial generation, vectorized modulo operation implementation, and twiddle factor generation. Secondly, 28 new instructions are expanded on the basis of the RISC-V instruction set to support computations in the post-quantum cryptographic algorithm. Finally, the design was evaluated on FPGA platform and ASIC implementation. In addition, the research team of Fudan University published a domain-specific post-quantum cryptographic processor architecture based on the RISC-V architecture for the lattice-based cryptographic algorithm [24]. The major creativity of this work is: mining data-level parallelism in RLWE and MLWE alternate algorithms, and realizing the vectorization of NTT and sampling process. Literature [25] proposed an architecture for NTT computation under the RISC-V instruction set. As the most complex bottleneck module of lattice-based post-quantum cryptography, the realization of polynomial multiplication is the priority of efficient hardware implementation. This work integrates NTT into the RSIC-V pipeline architecture to accelerate NTT processing.

5.3.2 Software-Defined Post-Quantum Cryptographic Chip A challenge of software-defined post-quantum cryptographic chips is that there is no algorithm standard yet. Although the number of alternate algorithms has been greatly reduced, there are still a variety of algorithms and parameters. Meanwhile, as the standardization work progresses, each algorithm has a possibility to be modified and iterated.

300

5 Future Application Prospects

At present, from the perspective of hardware implementation, the top priority to be addressed is the computational complexity of algorithms. At this stage, security is the primary design indicator of the algorithm design team, however, the difficulty of algorithm implementation has not been fully considered, especially the convenience of hardware implementation. Meanwhile, these algorithms often involve knowledge of advanced mathematics such as number theory, abstract algebra, and coding theory. It is still difficult for hardware designers with a background in electronic engineering to fully understand the documentation of these algorithms. Although the reference implementation based on the C language can be used as the input of the high-level synthesis tool, the design space cannot be fully explored either after the high-level synthesis is adopted, and some potential optimization space is lost. Also, current post-quantum cryptographic algorithms mainly include multiple categories based on different mathematical problems. And each type of algorithms is different from other types. Furthermore, lattice-based algorithms are also divided into structured lattices and unstructured lattices, and coding-based algorithms are also divided into algebra-based, short hamming, and low-rank algorithms by NIST. On the whole, the computation type involved in the post-quantum cryptographic algorithms is very different from traditional public-key cryptographic algorithms. The large integer multiplication (including large integer multiplication modulo operation on large prime numbers, multiplication of two large prime numbers, etc.) determines the computational complexity of the current public-key cryptographic algorithm. But this is only applicable to the super-singular homology type in post-quantum cryptographic algorithms. There is a certain decryption error probability problem in some post-quantum cryptographic algorithms, which may cause time-consuming repetitive computations, which in turn affects the average and worst decryption time. In addition, some encryption operations, key generation operations, and key encapsulation operations require random numbers that meet a certain distribution as input. Random sampling operations require a true random number generator and post-processing that meets different distributions. These circuit functions are rarely involved in the current cryptographic chip design. The problem of side-channel protection occurs on the high-performance implementation of server-oriented applications, because it is assumed that the physical contact scenario does not exist, so it is only necessary to meet the constant execution time during the design. However, for the lightweight implementation of mobile devices and IoT device applications, it is necessary to fully consider possible time sequence, power consumption, electromagnetic attacks, and memory leaks. Moreover, many side-channel protection methods are not universal, but are closely related to the computational features of algorithms. Therefore, side channel protection is also a challenge in the hardware design process, so are the additional resource overhead caused by side-channel protection and the evaluation of the protection effect.

5.4 Fully Homomorphic Encryption

301

5.4 Fully Homomorphic Encryption Cloud computing is a new service model to provide information technology resources. It gathers massive amounts of computing and storage resources, and provides them as commodities to users conveniently via the Internet. Because it can significantly save users the cost of purchasing software and hardware and operation and maintenance systems, it is highly favored by most individuals and small and medium-sized users. However, since the storage and computation of user data are entirely offered by the cloud service provider, these data are completely visible to the service provider. Therefore, some security-sensitive applications and data are not suitable for the cloud-based service model. The traditional cryptosystem does not support data processing without ciphertext. Users have to decrypt the data first if they want to run computation on the data. The emergence of fully homomorphic encryption (FHE) technology fundamentally solves this problem. It supports direct processing of data without decryption while maintaining the ciphertext. Therefore, the fully homomorphic encryption technology has bright application prospects in cloud computing and other untrusted scenarios. The fully homomorphic encryption technology is still under rapid development. Major problems at present include low performance and excessive storage requirement. This technology encounters the following problems on currently mainstream computation platforms, such as insufficient performance of general-purpose processors, excessive power consumption of graphics processors, a large amount of configuration of field programmable logic devices, and insufficient flexibility of applicationspecific integrated circuits. The fully homomorphic encryption technology is characterized by large parameters, intensive computations, high flexibility requirements, and rich parallelism. The high performance, high flexibility, high energy efficiency, and efficient configuration of software-defined chips make it strongly aligned with the current development and characteristics of the fully homomorphic encryption technology. It has a great potential to become the implementation platform for the fully homomorphic encryption technology. Based on the introduction to the fully homomorphic encryption technology, in this section, we discussed the potential of using the software-defined chip method to realize the technology, and analyzed the possible implementation methods of key components.

5.4.1 Concept and Application of Fully Homomorphic Encryption The traditional encryption technology is widely used to protect the security of sensitive information. However, an information system can only store encrypted information, but cannot perform any computation operation on it. The homomorphic encryption technology was created to overcome this difficulty. It refers to a cryptographic scheme that allows direct computation of encrypted information without

302

5 Future Application Prospects

revealing plaintext information. The homomorphic encryption schemes are classified into the following three types according to the allowed operation types and times [26]: the partially homomorphic encryption (PHE), which allows only one type of operations but has no limit on the number of times; the somewhat homomorphic encryption (SWHE), which allows a limited number of operations of multiple types; the fully homomorphic encryption, which allows an unlimited number of operations of any type. When using the symbol m to represent sensitive information, En(m) to represent encrypted sensitive information, and f (·) to represent any function that you want to process the information, then the fully homomorphic encryption scheme can be expressed as follows: there is a certain function F(·), in which you can directly use the plaintext En(m) to calculate F(En(m)) = En(f (m)) without knowing the key or the plaintext m. Figure 5.12 shows a simple example of using cloud computing services through fully homomorphic encryption. The process can be roughly divided into 6 steps: (1) The user first generates the public-key and the private key required for fully homomorphic encryption, and uses the public-key to encrypt its private data; (2) The user sends the encrypted data to the cloud server and stores it in cipher text; (3) When the user wants to perform a certain operation on his data, he sends the algorithm he wants to execute to the cloud server; (4) The cloud server maps to the corresponding homomorphic algorithm based on the algorithm as required by the user, and processes the ciphertext data; (5) The cloud server sends the processed ciphertext result to the user, and the cloud server cannot know the content of the result; (6) The user decrypts the ciphertext result with the private key and restores the plaintext content of the computation result. Since fully homomorphic encryption can operate on ciphertext without revealing ciphertext information, many application scenarios become realistic and practical. Taking secure search as a typical example, it means that a series of encrypted unsorted data is uploaded and saved on the server in advance, and the server end does not have a decryption key. The user sends an encrypted query instruction to the server. After processing the instruction, the server returns an index and data satisfying the query instruction to the user, and the returned value is encrypted and invisible to the server. This function is applicable to many application scenarios, such as secure search of private e-mails, confidential military documents or company sensitive business documents, searching for patient records in medical databases according to certain Fig. 5.12 Cloud computing flow chart of fully homomorphic encryption

User

Cloud

5.4 Fully Homomorphic Encryption

303

specific conditions, and secure search engines, etc. In the past, secure search protocols always faced one of the following problems: the protocol only provides a limited search function; the computational complexity of the protocol is linearly related to the size of the database, so it is very inefficient; or the security of the protocol is insufficient, and important search information will be revealed [27]. The secure search based on fully homomorphic encryption is expected to solve these problems. Ciphertext statistics is another typical application. It means that the user provides a large amount of encrypted data to the outsourcer, and then the outsourcer performs statistical analysis on the data in the ciphertext state to obtain encrypted statistical results and return them to the user. Since the data given to current cloud computing service providers are transparent to them, it is impossible to use cloud computing to perform statistical analysis on some confidential or private data. Instead, the fully homomorphic encryption technology makes it possible to deliver these private data to cloud service providers for statistical analysis. Machine learning has been a hot research topic in various fields in recent years. It can acquire knowledge more efficiently through learning a large amount of data, and provide various services more intelligently and personalized. However, in some fields, data privacy makes it inconvenient to use machine learning. For example, in medicine and bioinformatics, machine learning can be used to efficiently analyze a large amount of medical and genomic data, but these data cannot be shared freely for ethical and regulatory reasons. The fully homomorphic encryption technology can overcome these problems and implement secure machine learning in privacysensitive applications [28]. The concept of homomorphic encryption was first proposed by Rivest et al. [29] in 1978. It was used to describe a cryptographic system that can compute encrypted data without decryption. After proposing this concept, researchers have tried to construct different homomorphic cryptographic systems, but they were either partially homomorphic or somewhat homomorphic. It was not until 2009 that Gentry proposed the first ideal-lattice-based fully homomorphic encryption scheme [30]. Since then, schemes based on the approximate greatest common divisor (AGCD) [31], schemes based on LWE or RLWE [32], schemes based on NTRU (number theory research unit) [33] have appeared one after another. The method named bootstrapping proposed by Gentry in the first FHE scheme can convert a somewhat homomorphic encryption (SWHE) scheme into a fully homomorphic encryption scheme. In subsequent studies, most of the proposed SWHE schemes are part of the fully homomorphic encryption (FHE) scheme [26]. As the bootstrapping operation has an excessive amount of computation, researchers began to study the bounded/leveled fully homomorphic encryption scheme that can only perform a pre-determined limited number of operations. We can regard this scheme as the SWHE scheme. Therefore, we will combine the SWHE scheme and the FHE scheme and discuss them later. At present, the mainstream fully homomorphic encryption schemes include the followings and their optimized variants: Brakerski-Gentry-Vaikuntanathan (BGV) scheme [34], Brakerski/Fan-Vercauteren (BFV) scheme [35, 36], Lopez-Alt -TromerVaikuntanathan (LTV) scheme [33], Gentry-Sahai-Waters (GSW) scheme [37], Cheon-Kim-Kim-Song (CKKS) scheme [38], TFHE scheme [39], etc. The BGV

304

5 Future Application Prospects

scheme is based on the difficult LWE or RLWE problems. It eliminates the bootstrapping step and adopts a hierarchical homomorphic scheme. The BGV scheme uses SIMD to encode multiple plaintexts in one ciphertext at the same time, which greatly improves the efficiency of ciphertext operations. Literature [40] constructed a fully homomorphic circuit model based on this scheme, and performed homomorphic computations on the AES algorithm. The GSW program is based on the difficult LWE problem. The LTV scheme is based on the difficult NTRU problem. The BFV scheme, CKKS scheme and TFHE scheme are all based on the RLWE problem. The TFHE scheme is a variant of RLWE based on the GSW scheme. It is good at logic computations and can perform fast bootstrapping after each logic gate. The CKKS scheme is good at floating point computations and can perform approximate computations quickly.

5.4.2 Status Quo of the Research on Fully Homomorphic Encryption Chips Although the fully homomorphic encryption scheme has been continuously improved in recent years, and the efficiency has been greatly increased, there is still a long way to go to meet actual application requirements. The lower implementation performance of fully homomorphic encryption is the main bottleneck that prevents it from exerting its huge application value. There are four ways to solve this problem: the first is to study new fully homomorphic schemes to reduce the complexity of fully homomorphic schemes; the second is to improve specific implementation algorithms, reduce the number and computational complexity of key operations such as matrix multiplication, polynomial multiplication, and large integer multiplication; the third is to optimize the ciphertext application algorithm, reduce the number of ciphertext multiplications, and reduce the depth of ciphertext multiplications; the fourth is to improve the processing capacity of the fully homomorphic encryption hardware platform, and use a variety of software and hardware technologies to improve the processing performance of the fully homomorphic computing platform. The first three approaches are to reduce the workload of computations, and the fourth approach is to increase the computing power of the fully homomorphic encryption platform. At present, the major hardware platforms of fully homomorphic encryption schemes include.general-purpose processors, GPUs, FPGAs, and ASICs The research of fully homomorphic encryption chips is mainly based on FPGA or ASIC for hardware architecture study. Thanks to its flexibility, FPGA has become an important hardware platform for accelerating homomorphic encryption. FPGA is generally used as a co-processor to complete high-complexity computations. It forms a heterogeneous architecture with general-purpose processors to jointly complete fully homomorphic encryption

5.4 Fully Homomorphic Encryption

305

computations. Generally, FPGA implements large integer multiplication or polynomial multiplication, and even only the NTT conversion and coefficient multiplication. For example, Literature [41] implemented a 768 k multiplier based on StratixV FPGA. Literature [42] implemented homomorphic multiplication, homomorphic addition and key exchange on the RLWE-problem-based fully homomorphic encryption scheme (such as YASHE) based on Stratix-V FPGA. Literature [43] implemented the homomorphic ciphertext data operation of the YASHE scheme and the SIMON64/128 block algorithm based on Virtex-7 XC7V1140T. Literature [44] implemented the AES and Prince algorithms of the LTV scheme based on Virtex-7 XC7VX690T. Literature [45] implemented the encryption, decryption and re-encryption operations of the FV scheme based on Virtex 6 FPGA. Literature [46] proposed a hybrid number theoretic transform NTT method and an NTT-decoupled hardware architecture for large integer multiplication and high-order polynomial multiplication in fully homomorphic encryption, as shown in Fig. 5.13. This architecture decomposes NTT and INTT, reduces storage overhead by half; adopts a parallel computing architecture based on interleaved memory access, and reduces the number of clock cycles by half. FPGA-based hardware platforms greatly improve the performance of fully homomorphic encryption, and have lower power consumption compared with GPUs. However, FPGAs are general-purpose devices and are not optimized for fully homomorphic applications. FPGA also has shortcomings such as massive configuration data and inability of dynamic reconfiguration caused by fine-grained reconfiguration features. Due to the lack of good support for high-level languages, FPGA development is more difficult than CPU and GPU. ASIC hardware platforms are relatively inflexible. They are generally designed for fixed fully homomorphic encryption schemes and fixed parameters while pursuing

Fig. 5.13 Hardware architecture of NTT-decoupled multipliers

306

5 Future Application Prospects

extreme performance, area efficiency or energy efficiency, and they have specialized hardware for acceleration. Compared with FPGA, ASIC further improves the processing performance of fully homomorphic encryption and reduces power consumption. Literature [47] adopted the IBM 90 nm technology to achieve 768 Kbit integer multiplication. Literature [48] adopted the TSMC 90 nm technology to achieve million-bit integer multiplication. Literature [49] adopted the TSMC 90 nm technology to achieve the encryption, decryption and re-encryption of the GH fully homomorphic scheme, and basically realized the complete operation of the fully homomorphic encryption. The NTT decoupling architecture proposed in Literature [46] can also be used in the ASIC design. Literature [50] adopted the 55 nm technology to design a low-power chip for IoT devices, which supports homomorphic encryption and homomorphic decryption operations. DARPA released the DPRIVE project in 2020, which planned to develop a hardware accelerator for fully homomorphic encryption. The system architecture is shown in Fig. 5.14. The large arithmetic word size (LAWS) architecture proposed by it is expected to be able to process thousands of bits of data width, greatly reducing the execution time of fully homomorphic encryption. ASIC-based hardware platforms can improve the implementation performance of fully homomorphic encryption in terms of performance, area, power consumption, cost-effectiveness, etc. However, application-specific integrated circuits are difficult to design, with a high development cost and poor flexibility. Once the chip is designed, it cannot be programmed or reconfigured, making it difficult to adapt to the constantly evolving scheme and potentially diverse applications of fully homomorphic encryption. Fig. 5.14 DPRIVE system architecture

Fully homomorphic algorithms

Low-level fully homomorphic programming model (optimize fully homomorphic data representation and computation) Memory access, input and output

Local memory (optimize fully homomorphic cache/data access/latency/ power consumption)

Storage management unit

Add

Multip

Modular

Shift

Conversion

LAWS hardware architecture

5.4 Fully Homomorphic Encryption

307

5.4.3 Software-Defined Fully Homomorphic Encryption Computing Chip Implementation platforms based on FPGA or ASIC are difficult to meet the comprehensive requirements of fully homomorphic encryption for highly flexible, highly performing, low-power-consuming and easy-to-develop hardware platforms due to their respective characteristics. The fully homomorphic encryption technology is characterized by high flexibility, intensive computation, and rich parallelism. It is a hardware platform with outstanding comprehensive indicators in flexibility, performance, power consumption, and ease of use. Therefore, it is very suitable to be implemented with the software-defined chip method. In this section, we first analyze the necessity of using the software-defined chip method for fully homomorphic encryption, and then discusses the method of designing key modules in fully homomorphic encryption using the software-defined chip technology. 1. Necessity of software-defined fully homomorphic encryption computing chip The necessity of using software-defined fully homomorphic encryption computing chips is mainly reflected in two aspects: On the one hand, fully homomorphic encryption applications have higher requirements for the flexibility, speed and energy efficiency of the implementation platform; on the other hand, the homomorphic encryption computation is characterized by large parameters, intensive computation, and rich parallelism. In the following text, we analyze the necessity of using the software-defined chip method for design from these two aspects. (1) Application requirements From the perspective of application requirements, the fully homomorphic encryption technology requires high flexibility for its implementation platform, which is reflected in many aspects. First of all, since fully homomorphic encryption is a technology that has not been fully mature, there are a large number of and many kinds of schemes. Optimizations to existing schemes as well as new schemes are proposed every year. These schemes have advantages and disadvantages in terms of computational complexity, key size, ciphertext size, and noise growth rate. At present, there is no scheme having an absolutely advantageous position in all aspects. Therefore, different schemes need to be selected pertinently for different requirements in different application scenarios. Since there are multiple foundations for the construction of fully homomorphic encryption schemes, namely, ideal-lattice-based schemes, AGCD-based schemes, LWE- or RLWE-based schemes, and NTRU-based schemes, the basic data types processed by these different types of schemes are relatively different, requiring many types of processing operations. Therefore, it raises high requirements for the flexibility of the implementation platform. Secondly, after a specific fully homomorphic encryption scheme is selected, each scheme still has multiple parameter sets corresponding to different security levels and noise tolerances. The relationship between the security level and the parameter set is

308

5 Future Application Prospects

the same as that of traditional cryptographic algorithms. The higher the security level required, the larger the corresponding parameter value. Regarding noise, fully homomorphic encryption will bear a certain amount of noise when encrypting plaintext to generate ciphertext. Each homomorphic addition or homomorphic multiplication in the ciphertext state will cause the noise to increase. When the noise increase exceeds a certain limit, it will cause errors in decryption. Therefore, it requires different numbers of homomorphic operations in different application scenarios. Different parameter sets need to be selected accordingly to correspond to different noise tolerances and ensure the correctness of computation. In addition, some schemes still use different parameters in the computation after the parameter set is determined. For example, in the LTV scheme, there are multiple levels of homomorphic computation, and the modulus q used in each level is different. The modulus in the first level is the largest, and then it decreases when leveling up. In the computation process of this scheme, it is necessary to modify the value of the modulus as the computation level advances. In summary, different fully homomorphic encryption schemes and different parameter sets need to be selected for different application scenarios. Some schemes also use different parameters, which requires higher flexibility in the implementation platform. However, the fully homomorphic encryption schemes on platforms with better flexibility, such as general-purpose processors, graphics processors, and field programmable logic devices are disadvantageous in slow speed and excessive power consumption. While pursuing flexibility, it is also necessary to ensure that it is implemented at a high speed and high energy efficiency, so the software-defined chip technology is a very potential fully homomorphic encryption platform. (2) Computing features The specific computing features of fully homomorphic encryption are somewhat different from those of traditional cryptographic algorithms. These features also make it very suitable for implementation using the software-defined chip method. The fully homomorphic encryption scheme features huge parameters. There are two reasons for this: one is practicability; the other is security. Practicability mainly refers to the depth of the fully homomorphic circuit. The greater the depth, the more operations that can be performed in the homomorphism, which means the wider the application range. The current research focus is on fully homomorphic circuits with a depth of 40 to 80 levels. Such a depth enables it to complete similar homomorphic computations in 10 rounds of AES iterations. Take the LWE-based fully homomorphic encryption scheme as an example. The increase in circuit depth is accompanied by the increase in noise. In order to ensure the correctness of computation, the noise tolerance must be increased, which means the increase in the modulus q correspondingly. In order to ensure a larger computational depth, it is necessary to ensure that q is as large as possible. As for security, only increasing the modulus q will mean that the difficulty of the LWE problem is reduced, which means the reduction of security. In order to increase the modulus q while maintaining the security, it is necessary to increase the dimension n of the polynomial or matrix at the same time. Therefore, in order to ensure both computational depth and security, the fully homomorphic

5.4 Fully Homomorphic Encryption

309

encryption scheme often uses a larger q and n. In most cases, the number of terms n of the polynomial is around 215 or 216 and the coefficient is between 1200 bit or 2500 bit [43]. The fully homomorphic encryption is computationally intensive. For a polynomial with a large number of terms and large coefficients, multiplication operations require a very large amount of computation. Polynomial multiplication is used in all stages of a fully homomorphic encryption scheme, such as key generation, encryption and decryption, homomorphic multiplication, and linearization. Therefore, fully homomorphic encryption includes many polynomial multiplications that require a huge amount of computation. It determines the fact that fully homomorphic encryption is a computationally intensive algorithm, and the software-defined chip method will gain greater benefits. The computation of fully homomorphic encryption features rich parallelism. Since polynomial multiplication is the most common logic and the most timeconsuming operation stage in each fully homomorphic encryption scheme, here we mainly analyze the parallelism of polynomial multiplication. If we perform polynomial multiplication simply in a schoolbook way, we need to perform n2 times of modular multiplication on polynomial coefficients, and these modular multiplication operations can be fully parallelized. However, in practice, there are two problems when implementing simple schoolbook multiplication: one is the high computational complexity; the other is the excessive number of bits of the coefficient. Firstly, the computational complexity of schoolbook multiplication is O(n2 ). The value is n = 215 for typical parameters. It means that it needs to run about one billion modular multiplication operations, consuming a lot of logic resources or computing time. Secondly, since the length of polynomial coefficients is in the order of kilobits, it is very difficult to store and compute the entire coefficient as the basic unit. To address these two difficulties, researchers often use the Chinese remainder theorem (CRT) transformation and NTT to optimize polynomial multiplication [51]. The method of using CRT for optimization is shown in Fig. 5.15. CRT can disaggregate each large coefficient into multiple smaller numbers. In this way, each coefficient of the polynomial A can be disaggregated by the same CRT to obtain multiple polynomials with smaller coefficients, namely A(1) , A(2) ,…, A(k) . Disaggregate another polynomial B with the same method to get B(1) , B(2) ,…, B(k) . After that, perform polynomial multiplication on polynomials A(i) and B(i) with small coefficients, and then perform inverse Chinese remainder theorem (ICRT) transform on the result of the multiplication C (i) to obtain the result C = A × B of the original polynomial multiplication. The CRT algorithm makes a polynomial multiplication with large coefficients into multiple parallel polynomial multiplications with small coefficients. If it is disaggregated into k groups, the parallelism is increased by k times. The CRT transform algorithm can convert polynomial multiplication with large coefficients into polynomial multiplication with small coefficients, while the NTT algorithm can reduce the computational complexity of polynomial multiplication from O(n2 ) to O(nlogn). The method of accelerating polynomial multiplication using NTT is similar to the method of accelerating convolution using FFT. The specific steps are as follows: pass the n coefficients of polynomial A through NTT to obtain n

310

A(1)

A

A(2)

CRT

...

Fig. 5.15 CRT transform to optimize polynomial multiplication

5 Future Application Prospects

C(1)

A(k)

C(2)

B(2)

CRT

C

C(k)

...

B

ICRT

...

B(1)

B(k)

NTT-domain coefficients, and then use the same method to transform polynomial B to the NTT domain, then multiply the coefficients of the two NTT-domain polynomials pairwise, and finally apply the inverse number theoretic transform (INTT) on the result to obtain the result of the polynomial multiplication. The following analysis is about the parallelism of this method. The NTT-based polynomial multiplication can be divided into three steps, namely NTT, multiplication of each coefficient pairwise, and INTT. The computation methods of NTT and INTT are basically the same. Firstly, the pairwise multiplication of coefficients can obviously be completely parallel. Secondly, the NTT/INTT transform includes log operations of n layers, and each layer of operation includes n/2 multiplications, n/2 additions, and n/2 subtractions. Among these operations, multiplications in the same layer can be parallelized; additions and subtractions in the same layer can be parallelized; multiplications, additions, and subtractions in the same layer cannot be parallelized because of the data dependence between them, and the operations of each layer cannot be parallelized. In short, the fully homomorphic encryption scheme has big parameters, which leads to an intensive computation and can benefit greatly from the software-defined chip method. Meanwhile, its computations also feature rich parallelism, which is very suitable for being implemented with the software-defined chip method. 2. Design of commonly-used key modules In fully homomorphic encryption computations, commonly used key modules mainly include high-order polynomial multiplication, Chinese remainder theorem and other modules. In this section, we discuss related issues of designing these modules using the software-defined chip method. (1) High-order polynomial multiplication As mentioned above, the polynomial order used in the fully homomorphic encryption scheme is very high. If the schoolbook polynomial multiplication is used directly, the computational complexity is O(n2 ), which is excessively large, so the NTT algorithm and the Karatsuba algorithm are often used to reduce the computational complexity. When the modulus q is a prime number, the NTT algorithm can be used to reduce the complexity of polynomial multiplication from O(n2 ) to O(nlogn). At this time, the main computation becomes the forward and inverse transform of NTT, So in the following text, we mainly analyze the forward and inverse transform of NTT. The actual computation method of NTT is similar to that of FFT, which is composed of

5.4 Fully Homomorphic Encryption

311

butterfly operations of log n layers. The n/2 butterfly operations of each layer can be parallelized. In addition to the main butterfly operation, the NTT required for multiplication on a polynomial ring often requires pre-processing and post-processing, but it can be incorporated into the butterfly operation through special processing. First, we discuss the array structure of the NTT algorithm designed by using the software-defined chip method. The most straightforward array structure is to map the butterfly operation in each NTT to a processing unit. That is to say, n/2 lines and log n columns of processing units are required. The advantage of this array structure is that the entire NTT algorithm can be completely calculated once configured, and if there are multiple NTT transform tasks, it can also be efficiently pipelined. However, the disadvantage is that the required array scale is too large, and cannot be fully achieved when the polynomial order is very large. The second array structure is to map one layer of NTT to the processing unit, that is, n/2 processing units are required. In this array structure, a complete layer of butterfly operations in NTT can be calculated each time. However, due to the large differences in the operations of each layer of NTT, reconfiguration is required after each operation, which is inefficient and requires a special configuration strategy to improve efficiency. The third array structure is to use a small-scale rectangular processing unit array, such as a processing unit array with 8 lines and 4 columns. Eight lines are calculated each time when processing the first 4 layers of NTT operations, and they can also be pipelined. However, the disadvantage is that the data dependence of subsequent layers is more complicated. The configuration needs to be adjusted according to the number of layers and the workload is large. In practical applications, the appropriate array structure should be selected according to specific requirements and available resources. Next, let’s discuss the functional design of processing units. In our mapping, each processing unit corresponds to a butterfly operation, so the functions of each processing unit include: storage of pre-computed twiddle factors, modular multiplication, modular addition and modular subtraction in butterfly operations, adjustment of the sequence of modular multiplication, modular addition, and modular subtraction according to the NTT type. Therefore, the registers required by the processing unit include registers for storing coefficient input, registers for storing pre-computed twiddle factors, registers for storing modulus, and status registers for selecting decimation in time (DIT) or decimation in frequency (DIF). The logic functions required by the processing unit include a modular multiplication operation module, two modular addition/subtraction operation modules, and a control logic for selecting the data path according to the configured function. Now let’s discuss the granularity design of processing units. The granularity here mainly refers to the granularity of modulo operation, so its size depends on the magnitude of the modulus. Although the bit width of the modulus varies for different fully homomorphic encryption schemes and parameter sets, after using the Chinese remainder theorem, the large modulus can be disaggregated into a series of small prime modulus. By selecting this series of small prime modulus as prime numbers with the same bit width, the coefficients participating in the NTT operation can have the same granularity. In order to improve the efficiency of data transmission and

312

5 Future Application Prospects

processing, prime numbers with the same bit width as other parts of the system are often selected, which is generally 32 bit or 64 bit. Then let’s discuss the interconnection structure of the array. The interconnection structure is closely related to the overall array structure and mapping method. If the first array structure described above is adopted, the interconnection structure between processing units is very simple, and each processing unit only needs to have two interconnection structures corresponding to DIT and DIF. Therefore, I won’t expand on it here. For other array structures, processing units in the same layer do not need an interconnection structure, and the processing units of adjacent layers do not need a full interconnection structure as certain rules exist in their interconnection relationship. Through the specific analysis of the NTT algorithm, it can be known that each processing unit only needs to be connected to log n processing units in the upper layer. In addition, please note that when the array cannot map all layers of NTT, it is necessary to provide a data path from the output of the last layer of the array to the input of the first layer of the array. Finally, let’s discuss the configuration strategy. For processing units of the NTT algorithm, the information that needs to be configured includes the function selection of the processing unit, the value of the modulus and the twiddle factor, and the interconnection structure. In order to reduce the data volume of the configuration information and also speed up the configuration, a hierarchical configuration method is used instead of using complete configuration information in each configuration. The main method is to divide the configuration information into multiple layers according to the frequency of changes to different types of configuration information to minimize the size of the configuration information. In addition, since each processing unit only requires a limited number of twiddle factors, we can consider storing multiple twiddle factors locally in the processing unit during the initial configuration, and only the corresponding NTT layer number needs to be transmitted for the subsequent configuration information. The processing unit selects the corresponding twiddle factor according to the layer number, which can further reduce the configuration workload during computation, but at the cost of increasing the number of registers of the processing unit. Another commonly-used algorithm for accelerating polynomial multiplication in fully homomorphic encryption algorithms is the Karatsuba algorithm. It is often used to accelerate polynomial multiplication when the modulus is not a prime number and the NTT algorithm cannot be used. The following is a brief introduction to the Karatsuba algorithm. For simplicity but without loss of generality, we take even-term polynomials as an example in the following text, but the oddterm polynomials can also use the Karatsuba algorithm simply by zero extension or asymmetric decomposition. Assuming that the input polynomials A and B have n terms, the polynomial can be divided into high n/2 terms and low n/2 terms, that is, A = A L + A H x n/2 , B = B L + B H · x n/2 . When calculating the product of A and B with a direct method, it is A × B = A L + A H · x n/2 B L + B H · x n/2 = A L B L + (A L B H + A H B L )x n/2 + A H B H · x n , we need 4th-degree n/2 term polynomial multiplication. The Karatsuba algorithm transforms the middle term (AL BH + AH BL ) into (AL + AH )(BL + BH ) − AL BL − AH BH for computation, where AL BL and

5.4 Fully Homomorphic Encryption

313

AH BH have already been calculated, so we only need to recalculate (AL + AH )(BL + BH ). Therefore, the Karatsuba algorithm reduces the 4th-degree polynomial multiplication to the 3rd-degree multiplication at the cost of adding two additions in the pre-processing and some additions and subtractions in the post-processing. When the number of polynomial terms is very large, the Karatsuba algorithm is often used iteratively, and the higher-order polynomials are disaggregated to small polynomials with fewer terms. Then, schoolbook polynomial multiplications are used on each small polynomial, and in the end the final multiplication result is restored through the iteration of post-processing. These three steps all have sound parallelism and can be implemented using the software-defined chip method. Firstly, we discuss the use of the software-defined chip method to accelerate the array structure of the Karatsuba algorithm. It is planned to speed up the three steps of pre-processing, schoolbook polynomial multiplication, and post-processing. The pre-processing and post-processing steps have irregular structures as they adopt recursive methods, while the schoolbook polynomial multiplication step occupies the largest part of computation and has a very regular structure, so the array structure can be designed as per the requirements of the polynomial multiplication step, and then the other two steps can be mapped to this array as efficiently as possible. Since the reduced multiplication operation is equivalent to the increased addition and subtraction operation cost when the Karatsuba algorithm iterates to a certain extent, the numbers of polynomial terms disaggregated by the Karatsuba algorithm for different fully homomorphic encryption schemes are very similar, and the specific number of terms can be determined by the multiplication and addition cost of the specific platform. Therefore, the array structure can choose a matrix that takes the number of polynomial terms as the length and width, and determine the number of matrices according to specific resource constraints and speed requirements. The function of the processing unit must meet the requirements of three steps. The polynomial multiplication step requires the processing unit to have the function of multiplying, accumulating, and temporarily storing partial products. The pre-processing step requires to have the function of adding and temporarily storing coefficients. The post-processing requires to have the function of addition as well as adding and subtracting three numbers. Based on the above requirements, the processing unit should include three logic modules: the multiplier, the addition and subtraction, the adder, more than one coefficient register, and one function control module. Now, let’s discuss the granularity design of processing units. The consideration of the granularity design is similar to the consideration in the design of the NTT algorithm above. Generally, the CRT is used to make the granularity the same as other parts of the system. If you want to support algorithms that do not use the CRT, you may consider using a sequential method in the multiplier to save hardware resources, so that the granularity of the multiplier can still be 32bit or 64bit. Then, let’s discuss the interconnection structure of the array. According to the requirements of polynomial multiplication mapping, each processing unit needs three data inputs, respectively from the left unit, the lower left unit, and the external input broadcast of the same column, while the output needs to be connected to the right unit and

314

5 Future Application Prospects

the upper right unit. Meanwhile, in order to facilitate the mapping on the array for pre-processing and post-processing, each processing unit can be interconnected with eight neighboring processing units, which can improve the efficiency of mapping on the array for pre-processing and post-processing. Finally, let’s discuss the configuration strategy. Similar to the configuration strategy in the NTT algorithm, in order to reduce the configuration information and increase the speed, it is necessary to use a hierarchical configuration method. Since the function of the processing unit is almost unchanged during the process of polynomial multiplication, the configuration information corresponding to this function can be greatly compressed, so that each control signal can be decoded locally in the processing unit. The configuration information only contains the polynomial multiplication function code and the number of items and moduli. As for pre-processing and post-processing, the computing operation of each processing unit is relatively simple, but the operation structure is not regular. Frequent configuration will waste a lot of time. You may consider idling some units for some cycles when mapping the functions to reduce the frequency of changing the configuration. In the previous text, we respectively introduced the use of the NTT algorithm and the Karatsuba algorithm to accelerate high-order polynomial multiplication, and analyzed the idea of implementing these two algorithms with the softwaredefined chip technology from the aspects of array structure, processing unit function design, processing unit granularity selection, array interconnection structure, and configuration strategy. These aspects must be considered when the software-defined chip method is used for design. You can also refer to the above analysis methods when implementing other modules. (2) Chinese remainder theorem Hundreds or even thousands of bits of moduli are often used in fully homomorphic encryption schemes. In order to reduce the computational complexity of such large coefficients, researchers often use the CRT to convert computations on a large modulus into several small moduli. The forward and inverse transform of the CRT is required in the key switching and modulus switching steps commonly used in the scheme. Before using the software-defined chip method to accelerate the CRT, we must first analyze the key operations in the algorithm. Suppose that the moduli used by the CRT are M and m i respectively, satisfying M = Π mi . Then the operation required for forward transform is: calculate the modulus ai = x mod mi of each mi for the number x from 0 to M − 1. Take M i = M/mi , ci = M i × (M i −1 mod mi ), then the inverse transform is to calculate Σ ai ci mod M. Suppose the bit width of M is N and the bit width of mi is n, then the key operations involved in the forward and inverse transform are the following three: find the modulus of the number with a bit width N to the number with a bit width n; find the multiplication and summation of the number with a bit width N and the number with a bit width n; find the modulus of the number with a bit width about N + n to the number with a bit width N. If the Montgomery reduction algorithm or the Barrett reduction algorithm is used for modulo operation, the key operation is to multiply two numbers with a bit width of

References

315

about N. Therefore, the software-defined chip method can be used to accelerate this large integer multiplication operation. The specific design ideas are discussed below. First, let’s discuss the design of the array structure. Since the bit width of the integer used for multiplication is relatively large, we consider using the divide and conquer method to divide the operand into several parts with a bit width of n, find the partial product and then accumulate it. Therefore, an array structure with a length and width of N/n can be used. The partial product of the same multiplication is calculated in the same column of each cycle, and the partial product result is passed to the upper right unit in the next cycle, and the carry is passed to the right unit. The pipelining method can increase the throughput. In addition, if hardware resources are limited, a single-column N/n-line array structure can also be used to complete a multiplication operation in N/n cycles. Next, let’s discuss the functional design of processing units. The processing unit only needs very simple functions, mainly including a multiply-accumulate unit, an operand register and a partial product register. The multiply-accumulate unit contains four inputs, namely two multiply inputs with a bit width of n, a partial product input with a bit width of 2n, a carry input with a bit width of log log2 N/n. It also contains two outputs, namely a partial product output with a bit width of 2n and a carry output with a bit width of log2 N/n. Now let’s discuss the granularity design of processing units. Because the bit width n of the small modulus mi is often taken as 32bit or 64bit when the fully homomorphic encryption scheme uses the CRT, so the multiplier can be designed with the corresponding granularity. Then let’s discuss the interconnection structure of the array. If a rectangular array structure with a length and width of N/n is used, each unit needs four input connections, namely the left operand input, the operand broadcast input of the same column, the left carry input and the lower left partial product input. Therefore, each column of units need a broadcast input connection. The left and the right adjacent units need a connection with a bit width n + log2 N/n. The lower left and the upper right adjacent units need a connection with a bit width 2n. If a single-column N/n-line array structure is used, the broadcast input connection and the interconnection of upper and lower adjacent units are required. Finally, let’s discuss the configuration strategy. Because the function of each processing unit is almost unchanged in this computation process, so we only need to control the input and output of the entire array. The configuration information of each unit only needs to select the corresponding computing mode according to the bit width of the operand involved in the computation, and it does not need to be changed during the computation process.

References 1. Venkataramanaiah SK, Ma Y, Yin S et al (2019) Automatic compiler based FPGA accelerator for CNN training. In: The 29th international conference on field programmable logic and applications, pp 166–172

316

5 Future Application Prospects

2. Lu C, Wu Y, Yang C (2019) A 2.25TOPS/W fully-integrated deep CNN learning processor with on-chip training. In: IEEE asian solid-state circuits conference (A-SSCC), pp 65–68 3. Dey S, Chen D, Li Z et al (2018) A highly parallel FPGA implementation of sparse neural network training. In: International conference on reconfigurable computing and FPGAs, pp 1–4 4. Chen Z, Fu S, Cao Q et al (2020) A mixed-signal time-domain generative adversarial network accelerator with efficient subthreshold time multiplier and mixed-signal on-chip training for low power edge devices. In: IEEE symposium on VLSI circuits, pp 1–2 5. Zhao Z, Wang Y, Zhang X et al (2019) An energy-efficient computing-in-memory neuromorphic system with on-chip training. In: IEEE biomedical circuits and systems conference, pp 1–4 6. Tu F, Wu W, Wang Y et al (2020) Evolver: a deep learning processor with on-device quantization-voltage-frequency tuning[J]. IEEE J Solid-State Circuits 56(2):658–673 7. Siddhartha S, Wilton S, Boland D et al (2018) Simultaneous inference and training using on-FPGA weight perturbation techniques. In: International conference on field-programmable technology, pp 306–309 8. Arute F, Arya K, Babbush R et al (2019) Quantum supremacy using a programmable superconducting processor. Nature 574(7779):505–510 9. Zhong H, Wang H, Deng Y et al (2020) Quantum computational advantage using photons. Science 370(6523):1460–1463 10. Shor PW (1994) Algorithms for quantum computation: discrete logarithms and factoring. In: The 35th annual symposium on foundations of computer science, pp 124–134 11. Grover LK (1996) A fast quantum mechanical algorithm for database search. In: The 28th annual ACM symposium of theory of computing, pp 212–219 12. Gambetta J (2020) IBM’s roadmap for scaling quantum technology[EB/OL]. https://www.ibm. com/blogs/research/2020/09/ibm-quantum-roadmap [2020-10-01] 13. Digicert. Prospects and Risks of Quantum: 2019 DIGICERT Post-quantum Encryption Survey [EB/OL]. http://www.digicert.com/resources/industry-report/2019-Post-Quantum-Gypto-Sur vey-cn.pdf [2020-05-01] 14. Michele M, Vlad G (2020) A resource estimation framework for quantum attacks against cryptographic functions—improvements[EB/OL]. https://globalriskinstitute.org/publications/ quantum-risk-assessment-report-part-4-2 [2020-12-02] 15. Chinese Association for Cryptologic Research. Announcement on the evaluation result of the national cryptographic algorithm design competition [EB/OL]. https://www.cacrnet.org.cn/ site/content/854.html [2020-12-10] 16. NIST. Post-quantum cryptography standardization [EB/OL]. https://csrc.nist.gov/Projects/ post-quantum-cryptography/post-quantum-cryptography-standardization [2020-09-01] 17. Banerjee U, Pathak A, Chandrakasan AP (2019) An energy-efficient configurable lattice cryptography processor for the quantum-secure Internet of Things. In: IEEE international solid-state circuits conference, pp 46–48 18. Banerjee U, Ukyab TS, Chandrakasan AP (2019) Sapphire: a configurable crypto-processor for post-quantum lattice-based protocols. IACR Trans Cryptographic Hardware Embed Syst 4:17–61 19. Zhang N, Yang B, Chen C et al (2020) Highly efficient architecture of NewHope-NIST on FPGA using low-complexity NTT/INTT. IACR Trans Cryptographic Hardware Embed Syst 49–72 20. Zhu Y, Zhu M, Yang B et al (2020) A high-performance hardware implementation of saber based on Karatsuba algorithm[EB/OL]. https://eprint.iacr.org/2020/1037 [2020-11-01] 21. Mohajerani K, Haeussler R, Nagpal R et al (2020) FPGA benchmarking of round 2 candidates in the NIST lightweight cryptography standardization process: methodology, metrics, tools, and results[EB/OL]. https://eprint.iacr.org/2020/1207 [2020-03-10] 22. Ozcan E, Aysu A (2019) High-level-synthesis of number-theoretic transform: a case study for future cryptosystems. IEEE Embed Syst Lett 12(4):133–136 23. Fritzmann T, Sigl G, Sepúlveda J (2020) RISQ-V: tightly coupled RISC-V accelerators for post-quantum cryptography. IACR Trans Cryptographic Hardware Embed Syst 4:239–280

References

317

24. Xin G, Han J, Yin T et al (2020) VPQC: a domain-specific vector processor for post-quantum cryptography based on RISC-V architecture. IEEE Trans Circuits Syst I Regul Pap 67(8):2672– 2684 25. Karabulut E, Aysu A (2020) RANTT: a RISC-V architecture extension for the number theoretic transform. In: The 30th international conference on field-programmable logic and applications, pp 26–32 26. Acar A, Aksu H, Uluagac AS et al (2018) A survey on homomorphic encryption schemes: theory and implementation. ACM Comput Surv 51(4):1–35 27. Akavia A, Feldman D, Shaul H (2018) Secure search via multi-ring fully homomorphic encryption. IACR Cryptol ePrint Arch 245 28. Wood A, Najarian K, Kahrobaei D (2020) Homomorphic encryption for machine learning in medicine and bioinformatics. ACM Comput Surv 53(4):1–35 29. Rivest RL, Adleman L, Dertouzos ML (1978) On data banks and privacy homomorphisms. Academic Press, New York 30. Gentry C (2009) A fully homomorphic encryption scheme. Stanford University, Stanford 31. van Dijk M, Gentry C, Halevi S et al (2010) Fully homomorphic encryption over the integers. In: The 29th annual international conference on the theory and applications of cryptographic techniques, pp 24–43 32. Brakerski Z, Vaikuntanathan V (2011) Fully homomorphic encryption from ring-LWE and security for key dependent messages. In: Proceedings of the 31st annual conference on advances in cryptology, pp 505–524 33. López-Alt A, Tromer E, Vaikuntanathan V (2012) On-the-fly multiparty computation on the cloud via multikey fully homomorphic encryption. In: The 44th annual ACM symposium on theory of computing, pp 1219–1234 34. Brakerski Z, Gentry C, Vaikuntanathan V (2012) (Leveled) fully homomorphic encryption without bootstrapping. In: Proceedings of the 3rd innovations in theoretical computer science conference, pp 309–325 35. Brakerski Z (2012) Fully homomorphic encryption without modulus switching from classical GapSVP. In: The 32nd annual cryptology conference, pp 868–886 36. Fan J, Vercauteren F (2012) Somewhat practical fully homomorphic encryption. IACR Cryptol ePrint Arch 144 37. Gentry C, Sahai A, Waters B (2013) Homomorphic encryption from learning with errors: conceptually-simpler, asymptotically-faster, attribute-based. In: The 33rd annual cryptology conference, pp 75–92 38. Cheon JH, Kim A, Kim M et al (2017) Homomorphic encryption for arithmetic of approximate numbers. In: Advances in cryptology—ASIACRYPT, pp 409–437 39. Chillotti I, Gama N, Georgieva M et al (2020) TFHE: fast fully homomorphic encryption over the torus. J Cryptol 33(1):34–91 40. Gentry C, Halevi S, Smart NP (2012) Homomorphic evaluation of the AES circuit. In: The 32nd annual international cryptology conference, pp 850–867 41. Wang W, Huang XM (2013) FPGA implementation of a large-number multiplier for fully homomorphic encryption. In: IEEE international symposium on circuits and systems, pp 2589– 2592 42. Poppelmann T, Naehrig M, Putnam A et al (2015) Accelerating homomorphic evaluation on reconfigurable hardware. In: The 17th international workshop on cryptographic hardware and embedded systems, pp 143–163 43. Sinha Roy S, Järvinen K, Vercauteren F et al (2015) Modular hardware architecture for somewhat homomorphic function evaluation. In: The 17th international workshop on cryptographic hardware and embedded systems, pp 164–184 44. Ozturk E, Doroz Y, Savas E et al (2016) A custom accelerator for homomorphic encryption applications. IEEE Trans Comput 99:1 45. Roy SS, Vercauteren F, Vliegen J et al (2017) Hardware assisted fully homomorphic function evaluation and encrypted search. IEEE Trans Comput 99:1

318

5 Future Application Prospects

46. Zhang N, Qin Q, Yuan H et al (2020) NTTU: an area-efficient low-power NTT-uncoupled architecture for NTT-based multiplication. IEEE Trans Comput 69(4):520–533 47. Wang W, Huang X M, Emmart N et al (2014) VLSI design of a large-number multiplier for fully homomorphic encryption. IEEE Trans Very Large Scale Integr (VLSI) Syst 22(9):1879–1887 48. Doroz Y, Ozturb E, Sunar B (2014) A million-bit multiplier architecture for fully homomorphic encryption. Microprocess Microsyst 38(8):766–775 49. Doroz Y, Ozturk E, Sunar B (2015) Accelerating fully homomorphic encryption in hardware. IEEE Trans Comput 64(6):1509–1521 50. Yoon I, Cao N, Amaravati A et al (2019) A 55 nm 50 nJ/encode 13 nJ/decode homomorphic encryption crypto-engine for IoT nodes to enable secure computation on encrypted data. In: IEEE custom integrated circuits conference, pp 1–4 51. Dai W, Doröz Y, Sunar B (2014) Accelerating NTRU based homomorphic encryption using GPUs. In: IEEE high performance extreme computing conference, pp 1–6.